2025-05-07T20:23:13.3036442Z Current runner version: '2.323.0'
2025-05-07T20:23:13.3042098Z Runner name: 'i-0e56304501e4f5200'
2025-05-07T20:23:13.3043019Z Machine name: 'ip-10-0-66-0'
2025-05-07T20:23:13.3045748Z ##[group]GITHUB_TOKEN Permissions
2025-05-07T20:23:13.3048258Z Contents: read
2025-05-07T20:23:13.3048785Z Metadata: read
2025-05-07T20:23:13.3049283Z Packages: read
2025-05-07T20:23:13.3049776Z ##[endgroup]
2025-05-07T20:23:13.3052023Z Secret source: None
2025-05-07T20:23:13.3052672Z Prepare workflow directory
2025-05-07T20:23:13.3998352Z Prepare all required actions
2025-05-07T20:23:13.4040080Z Getting action download info
2025-05-07T20:23:13.5908992Z Download action repository 'actions/checkout@v4' (SHA:11bd71901bbe5b1630ceea73d27597364c9af683)
2025-05-07T20:23:13.8870304Z Download action repository 'actions/download-artifact@v4' (SHA:d3f86a106a0bac45b974a628896c90dbdf5c8093)
2025-05-07T20:23:14.3031624Z Download action repository 'pytorch/test-infra@main' (SHA:117fccdf5892ff9a958d2afb4b4b8b6e930d3187)
2025-05-07T20:23:16.0510324Z Getting action download info
2025-05-07T20:23:16.1757315Z Download action repository 'nick-fields/retry@3e91a01664abd3c5cd539100d10d33b9c5b68482' (SHA:3e91a01664abd3c5cd539100d10d33b9c5b68482)
2025-05-07T20:23:16.3754906Z Complete job name: test_and_publish_artifact (x86, linux.g5.4xlarge.nvidia.gpu, genai, 3.11, 12.8.0, 12.6.3, clang)
2025-05-07T20:23:16.4268534Z A job started hook has been configured by the self-hosted runner administrator
2025-05-07T20:23:16.4378251Z ##[group]Run '/home/ec2-user/runner-scripts/before_job.sh'
2025-05-07T20:23:16.4389804Z shell: /usr/bin/bash --noprofile --norc -e -o pipefail {0}
2025-05-07T20:23:16.4390444Z ##[endgroup]
2025-05-07T20:23:17.4543401Z Runner Type: linux.g5.4xlarge.nvidia.gpu
2025-05-07T20:23:17.4543849Z Instance Type: g5.4xlarge
2025-05-07T20:23:17.4544098Z AMI Name: unknown
2025-05-07T20:23:17.4584494Z AMI ID: ami-071226ecf16aa7d96
2025-05-07T20:23:22.8518324Z ##[group]Run actions/checkout@v4
2025-05-07T20:23:22.8518627Z with:
2025-05-07T20:23:22.8518877Z   submodules: true
2025-05-07T20:23:22.8519110Z   repository: pytorch/FBGEMM
2025-05-07T20:23:22.8519490Z   token: ***
2025-05-07T20:23:22.8519691Z   ssh-strict: true
2025-05-07T20:23:22.8519907Z   ssh-user: git
2025-05-07T20:23:22.8520125Z   persist-credentials: true
2025-05-07T20:23:22.8520374Z   clean: true
2025-05-07T20:23:22.8520605Z   sparse-checkout-cone-mode: true
2025-05-07T20:23:22.8520874Z   fetch-depth: 1
2025-05-07T20:23:22.8521087Z   fetch-tags: false
2025-05-07T20:23:22.8521310Z   show-progress: true
2025-05-07T20:23:22.8521530Z   lfs: false
2025-05-07T20:23:22.8521734Z   set-safe-directory: true
2025-05-07T20:23:22.8521992Z env:
2025-05-07T20:23:22.8522201Z   PRELUDE: .github/scripts/setup_env.bash
2025-05-07T20:23:22.8522514Z   BUILD_ENV: build_binary
2025-05-07T20:23:22.8522767Z   BUILD_TARGET: genai
2025-05-07T20:23:22.8523000Z   BUILD_VARIANT: cuda
2025-05-07T20:23:22.8523252Z   BUILD_CUDA_VERSION: 12.8.0
2025-05-07T20:23:22.8523496Z   ENFORCE_CUDA_DEVICE: 1
2025-05-07T20:23:22.8523729Z ##[endgroup]
2025-05-07T20:23:22.9665761Z Syncing repository: pytorch/FBGEMM
2025-05-07T20:23:22.9667292Z ##[group]Getting Git version info
2025-05-07T20:23:22.9667866Z Working directory is '/home/ec2-user/actions-runner/_work/FBGEMM/FBGEMM'
2025-05-07T20:23:22.9668672Z [command]/usr/bin/git version
2025-05-07T20:23:22.9668992Z git version 2.47.1
2025-05-07T20:23:22.9684811Z ##[endgroup]
2025-05-07T20:23:22.9695164Z Copying '/home/ec2-user/.gitconfig' to '/home/ec2-user/actions-runner/_work/_temp/cd033a63-f207-416f-848b-cd9b9c59e344/.gitconfig'
2025-05-07T20:23:22.9703861Z Temporarily overriding HOME='/home/ec2-user/actions-runner/_work/_temp/cd033a63-f207-416f-848b-cd9b9c59e344' before making global git config changes
2025-05-07T20:23:22.9704874Z Adding repository directory to the temporary git global config as a safe directory
2025-05-07T20:23:22.9718713Z [command]/usr/bin/git config --global --add safe.directory /home/ec2-user/actions-runner/_work/FBGEMM/FBGEMM
2025-05-07T20:23:22.9763958Z [command]/usr/bin/git config --local --get remote.origin.url
2025-05-07T20:23:22.9787535Z https://github.com/pytorch/FBGEMM
2025-05-07T20:23:22.9805691Z ##[group]Removing previously created refs, to avoid conflicts
2025-05-07T20:23:22.9809310Z [command]/usr/bin/git rev-parse --symbolic-full-name --verify --quiet HEAD
2025-05-07T20:23:22.9834611Z refs/heads/main
2025-05-07T20:23:22.9843747Z [command]/usr/bin/git checkout --detach
2025-05-07T20:23:23.8554151Z HEAD is now at b6b2ce3 Migrate TBE forward kernels to `FBGEMM_LAUNCH_KERNEL` (#4079)
2025-05-07T20:23:23.8605850Z [command]/usr/bin/git branch --delete --force main
2025-05-07T20:23:23.8636800Z Deleted branch main (was b6b2ce3).
2025-05-07T20:23:23.8642463Z ##[endgroup]
2025-05-07T20:23:23.8645500Z [command]/usr/bin/git submodule status
2025-05-07T20:23:23.9066446Z  e5d7c0bd5d9aec44d68830187138149e6a8c4e32 external/asmjit (e5d7c0b)
2025-05-07T20:23:23.9150765Z  4a61bdd4bd4ed730e078aebc7c0fcf046ff29406 external/composable_kernel (4a61bdd)
2025-05-07T20:23:23.9239389Z  6543fec09b2f04ac4a666882998b534afc9c1349 external/cpuinfo (6543fec)
2025-05-07T20:23:23.9325586Z  3ed8d2ec4ba35ef5d9d8353826209b6f868f63d3 external/cutlass (3ed8d2e)
2025-05-07T20:23:23.9416447Z  f8d7d77c06936315286eb55f8de22cd23c188571 external/googletest (f8d7d77)
2025-05-07T20:23:23.9506617Z  420084499c7c1e1c2d801922f40df202eac5f3a0 external/hipify_torch (4200844)
2025-05-07T20:23:23.9588107Z  9cca280a4d0ccf0c08f47a99aa71d1b0e52f8d03 external/json (9cca280)
2025-05-07T20:23:23.9601043Z ##[group]Cleaning the repository
2025-05-07T20:23:23.9605735Z [command]/usr/bin/git clean -ffdx
2025-05-07T20:23:23.9664296Z [command]/usr/bin/git reset --hard HEAD
2025-05-07T20:23:23.9775844Z HEAD is now at b6b2ce3 Migrate TBE forward kernels to `FBGEMM_LAUNCH_KERNEL` (#4079)
2025-05-07T20:23:23.9783188Z ##[endgroup]
2025-05-07T20:23:23.9785000Z ##[group]Disabling automatic garbage collection
2025-05-07T20:23:23.9789380Z [command]/usr/bin/git config --local gc.auto 0
2025-05-07T20:23:23.9821202Z ##[endgroup]
2025-05-07T20:23:23.9821737Z ##[group]Setting up auth
2025-05-07T20:23:23.9826513Z [command]/usr/bin/git config --local --name-only --get-regexp core\.sshCommand
2025-05-07T20:23:23.9870138Z [command]/usr/bin/git submodule foreach --recursive sh -c "git config --local --name-only --get-regexp 'core\.sshCommand' && git config --local --unset-all 'core.sshCommand' || :"
2025-05-07T20:23:24.0201139Z Entering 'external/asmjit'
2025-05-07T20:23:24.0267991Z Entering 'external/composable_kernel'
2025-05-07T20:23:24.0340873Z Entering 'external/cpuinfo'
2025-05-07T20:23:24.0407200Z Entering 'external/cutlass'
2025-05-07T20:23:24.0483848Z Entering 'external/googletest'
2025-05-07T20:23:24.0550736Z Entering 'external/hipify_torch'
2025-05-07T20:23:24.0616218Z Entering 'external/json'
2025-05-07T20:23:24.0703323Z [command]/usr/bin/git config --local --name-only --get-regexp http\.https\:\/\/github\.com\/\.extraheader
2025-05-07T20:23:24.0736447Z [command]/usr/bin/git submodule foreach --recursive sh -c "git config --local --name-only --get-regexp 'http\.https\:\/\/github\.com\/\.extraheader' && git config --local --unset-all 'http.https://github.com/.extraheader' || :"
2025-05-07T20:23:24.1070946Z Entering 'external/asmjit'
2025-05-07T20:23:24.1136347Z Entering 'external/composable_kernel'
2025-05-07T20:23:24.1209082Z Entering 'external/cpuinfo'
2025-05-07T20:23:24.1276589Z Entering 'external/cutlass'
2025-05-07T20:23:24.1352165Z Entering 'external/googletest'
2025-05-07T20:23:24.1418041Z Entering 'external/hipify_torch'
2025-05-07T20:23:24.1486717Z Entering 'external/json'
2025-05-07T20:23:24.1572934Z [command]/usr/bin/git config --local http.https://github.com/.extraheader AUTHORIZATION: basic ***
2025-05-07T20:23:24.1624960Z ##[endgroup]
2025-05-07T20:23:24.1625536Z ##[group]Fetching the repository
2025-05-07T20:23:24.1632533Z [command]/usr/bin/git -c protocol.version=2 fetch --no-tags --prune --no-recurse-submodules --depth=1 origin +a2f4c52051596e74bc8c16e3d2867a4ecdd271e0:refs/remotes/pull/4066/merge
2025-05-07T20:23:24.3561056Z From https://github.com/pytorch/FBGEMM
2025-05-07T20:23:24.3561934Z  * [new ref]         a2f4c52051596e74bc8c16e3d2867a4ecdd271e0 -> pull/4066/merge
2025-05-07T20:23:24.3587124Z ##[endgroup]
2025-05-07T20:23:24.3587771Z ##[group]Determining the checkout info
2025-05-07T20:23:24.3588713Z ##[endgroup]
2025-05-07T20:23:24.3593047Z [command]/usr/bin/git sparse-checkout disable
2025-05-07T20:23:24.3644138Z [command]/usr/bin/git config --local --unset-all extensions.worktreeConfig
2025-05-07T20:23:24.3672941Z ##[group]Checking out the ref
2025-05-07T20:23:24.3676278Z [command]/usr/bin/git checkout --progress --force refs/remotes/pull/4066/merge
2025-05-07T20:23:24.3802887Z Previous HEAD position was b6b2ce3 Migrate TBE forward kernels to `FBGEMM_LAUNCH_KERNEL` (#4079)
2025-05-07T20:23:24.3806053Z HEAD is now at a2f4c52 Merge 6060cd4b5f971680caecdcc657faccb5720d1c3e into fd4df5f456e0cca514bacd98a39efb72990fd9f4
2025-05-07T20:23:24.3815875Z ##[endgroup]
2025-05-07T20:23:24.3816402Z ##[group]Setting up auth for fetching submodules
2025-05-07T20:23:24.3821394Z [command]/usr/bin/git config --global http.https://github.com/.extraheader AUTHORIZATION: basic ***
2025-05-07T20:23:24.3873051Z [command]/usr/bin/git config --global --unset-all url.https://github.com/.insteadOf
2025-05-07T20:23:24.3903721Z [command]/usr/bin/git config --global --add url.https://github.com/.insteadOf git@github.com:
2025-05-07T20:23:24.3936429Z [command]/usr/bin/git config --global --add url.https://github.com/.insteadOf org-21003710@github.com:
2025-05-07T20:23:24.3965855Z ##[endgroup]
2025-05-07T20:23:24.3966727Z ##[group]Fetching submodules
2025-05-07T20:23:24.3970398Z [command]/usr/bin/git submodule sync
2025-05-07T20:23:24.4351870Z Synchronizing submodule url for 'external/asmjit'
2025-05-07T20:23:24.4352379Z Synchronizing submodule url for 'external/composable_kernel'
2025-05-07T20:23:24.4353147Z Synchronizing submodule url for 'external/cpuinfo'
2025-05-07T20:23:24.4353691Z Synchronizing submodule url for 'external/cutlass'
2025-05-07T20:23:24.4354174Z Synchronizing submodule url for 'external/googletest'
2025-05-07T20:23:24.4354653Z Synchronizing submodule url for 'external/hipify_torch'
2025-05-07T20:23:24.4355174Z Synchronizing submodule url for 'external/json'
2025-05-07T20:23:24.4368168Z [command]/usr/bin/git -c protocol.version=2 submodule update --init --force --depth=1
2025-05-07T20:23:24.4799524Z Submodule path 'external/asmjit': checked out 'e5d7c0bd5d9aec44d68830187138149e6a8c4e32'
2025-05-07T20:23:24.4951469Z Submodule path 'external/composable_kernel': checked out '4a61bdd4bd4ed730e078aebc7c0fcf046ff29406'
2025-05-07T20:23:24.5054868Z Submodule path 'external/cpuinfo': checked out '6543fec09b2f04ac4a666882998b534afc9c1349'
2025-05-07T20:23:24.5222574Z Submodule path 'external/cutlass': checked out '3ed8d2ec4ba35ef5d9d8353826209b6f868f63d3'
2025-05-07T20:23:24.5312676Z Submodule path 'external/googletest': checked out 'f8d7d77c06936315286eb55f8de22cd23c188571'
2025-05-07T20:23:24.5394852Z Submodule path 'external/hipify_torch': checked out '420084499c7c1e1c2d801922f40df202eac5f3a0'
2025-05-07T20:23:24.5495865Z Submodule path 'external/json': checked out '9cca280a4d0ccf0c08f47a99aa71d1b0e52f8d03'
2025-05-07T20:23:24.5512713Z [command]/usr/bin/git submodule foreach git config --local gc.auto 0
2025-05-07T20:23:24.5842974Z Entering 'external/asmjit'
2025-05-07T20:23:24.5873797Z Entering 'external/composable_kernel'
2025-05-07T20:23:24.5906733Z Entering 'external/cpuinfo'
2025-05-07T20:23:24.5939033Z Entering 'external/cutlass'
2025-05-07T20:23:24.5971136Z Entering 'external/googletest'
2025-05-07T20:23:24.6002527Z Entering 'external/hipify_torch'
2025-05-07T20:23:24.6035598Z Entering 'external/json'
2025-05-07T20:23:24.6080207Z ##[endgroup]
2025-05-07T20:23:24.6081101Z ##[group]Persisting credentials for submodules
2025-05-07T20:23:24.6087100Z [command]/usr/bin/git submodule foreach --recursive sh -c "git config --local --name-only --get-regexp 'url\.https\:\/\/github\.com\/\.insteadOf' && git config --local --unset-all 'url.https://github.com/.insteadOf' || :"
2025-05-07T20:23:24.6418255Z Entering 'external/asmjit'
2025-05-07T20:23:24.6459398Z url.https://github.com/.insteadof
2025-05-07T20:23:24.6459811Z url.https://github.com/.insteadof
2025-05-07T20:23:24.6502319Z Entering 'external/composable_kernel'
2025-05-07T20:23:24.6545295Z url.https://github.com/.insteadof
2025-05-07T20:23:24.6545752Z url.https://github.com/.insteadof
2025-05-07T20:23:24.6594520Z Entering 'external/cpuinfo'
2025-05-07T20:23:24.6638551Z url.https://github.com/.insteadof
2025-05-07T20:23:24.6638898Z url.https://github.com/.insteadof
2025-05-07T20:23:24.6681709Z Entering 'external/cutlass'
2025-05-07T20:23:24.6724073Z url.https://github.com/.insteadof
2025-05-07T20:23:24.6724411Z url.https://github.com/.insteadof
2025-05-07T20:23:24.6775543Z Entering 'external/googletest'
2025-05-07T20:23:24.6817771Z url.https://github.com/.insteadof
2025-05-07T20:23:24.6818127Z url.https://github.com/.insteadof
2025-05-07T20:23:24.6861043Z Entering 'external/hipify_torch'
2025-05-07T20:23:24.6903600Z url.https://github.com/.insteadof
2025-05-07T20:23:24.6903992Z url.https://github.com/.insteadof
2025-05-07T20:23:24.6952119Z Entering 'external/json'
2025-05-07T20:23:24.6994565Z url.https://github.com/.insteadof
2025-05-07T20:23:24.6995012Z url.https://github.com/.insteadof
2025-05-07T20:23:24.7056138Z [command]/usr/bin/git submodule foreach sh -c "git config --local 'http.https://github.com/.extraheader' 'AUTHORIZATION: basic ***' && git config --local --show-origin --name-only --get-regexp remote.origin.url"
2025-05-07T20:23:24.7385152Z Entering 'external/asmjit'
2025-05-07T20:23:24.7451935Z file:/home/ec2-user/actions-runner/_work/FBGEMM/FBGEMM/.git/modules/external/asmjit/config	remote.origin.url
2025-05-07T20:23:24.7454449Z Entering 'external/composable_kernel'
2025-05-07T20:23:24.7515591Z file:/home/ec2-user/actions-runner/_work/FBGEMM/FBGEMM/.git/modules/external/composable_kernel/config	remote.origin.url
2025-05-07T20:23:24.7518483Z Entering 'external/cpuinfo'
2025-05-07T20:23:24.7580437Z file:/home/ec2-user/actions-runner/_work/FBGEMM/FBGEMM/.git/modules/external/cpuinfo/config	remote.origin.url
2025-05-07T20:23:24.7583573Z Entering 'external/cutlass'
2025-05-07T20:23:24.7644798Z file:/home/ec2-user/actions-runner/_work/FBGEMM/FBGEMM/.git/modules/external/cutlass/config	remote.origin.url
2025-05-07T20:23:24.7648521Z Entering 'external/googletest'
2025-05-07T20:23:24.7710201Z file:/home/ec2-user/actions-runner/_work/FBGEMM/FBGEMM/.git/modules/external/googletest/config	remote.origin.url
2025-05-07T20:23:24.7713319Z Entering 'external/hipify_torch'
2025-05-07T20:23:24.7774950Z file:/home/ec2-user/actions-runner/_work/FBGEMM/FBGEMM/.git/modules/external/hipify_torch/config	remote.origin.url
2025-05-07T20:23:24.7777929Z Entering 'external/json'
2025-05-07T20:23:24.7843891Z file:/home/ec2-user/actions-runner/_work/FBGEMM/FBGEMM/.git/modules/external/json/config	remote.origin.url
2025-05-07T20:23:24.7967531Z [command]/usr/bin/git submodule foreach git config --local --add 'url.https://github.com/.insteadOf' 'git@github.com:'
2025-05-07T20:23:24.8300924Z Entering 'external/asmjit'
2025-05-07T20:23:24.8333007Z Entering 'external/composable_kernel'
2025-05-07T20:23:24.8366209Z Entering 'external/cpuinfo'
2025-05-07T20:23:24.8401802Z Entering 'external/cutlass'
2025-05-07T20:23:24.8434698Z Entering 'external/googletest'
2025-05-07T20:23:24.8467303Z Entering 'external/hipify_torch'
2025-05-07T20:23:24.8501473Z Entering 'external/json'
2025-05-07T20:23:24.8548335Z [command]/usr/bin/git submodule foreach git config --local --add 'url.https://github.com/.insteadOf' 'org-21003710@github.com:'
2025-05-07T20:23:24.8881129Z Entering 'external/asmjit'
2025-05-07T20:23:24.8913415Z Entering 'external/composable_kernel'
2025-05-07T20:23:24.8947390Z Entering 'external/cpuinfo'
2025-05-07T20:23:24.8979148Z Entering 'external/cutlass'
2025-05-07T20:23:24.9011521Z Entering 'external/googletest'
2025-05-07T20:23:24.9043702Z Entering 'external/hipify_torch'
2025-05-07T20:23:24.9074991Z Entering 'external/json'
2025-05-07T20:23:24.9121653Z ##[endgroup]
2025-05-07T20:23:24.9163783Z [command]/usr/bin/git log -1 --format=%H
2025-05-07T20:23:24.9190384Z a2f4c52051596e74bc8c16e3d2867a4ecdd271e0
2025-05-07T20:23:24.9381299Z ##[group]Run actions/download-artifact@v4
2025-05-07T20:23:24.9381639Z with:
2025-05-07T20:23:24.9381893Z   name: fbgemm_genai_x86_clang_py3.11_cu12.8.0.whl
2025-05-07T20:23:24.9382249Z   merge-multiple: false
2025-05-07T20:23:24.9382522Z   repository: pytorch/FBGEMM
2025-05-07T20:23:24.9382800Z   run-id: 14891846252
2025-05-07T20:23:24.9383023Z env:
2025-05-07T20:23:24.9383253Z   PRELUDE: .github/scripts/setup_env.bash
2025-05-07T20:23:24.9383577Z   BUILD_ENV: build_binary
2025-05-07T20:23:24.9383842Z   BUILD_TARGET: genai
2025-05-07T20:23:24.9384081Z   BUILD_VARIANT: cuda
2025-05-07T20:23:24.9384334Z   BUILD_CUDA_VERSION: 12.8.0
2025-05-07T20:23:24.9384602Z   ENFORCE_CUDA_DEVICE: 1
2025-05-07T20:23:24.9384857Z ##[endgroup]
2025-05-07T20:23:25.1709420Z Downloading single artifact
2025-05-07T20:23:25.2633067Z Preparing to download the following artifacts:
2025-05-07T20:23:25.2634215Z - fbgemm_genai_x86_clang_py3.11_cu12.8.0.whl (ID: 3081407693, Size: 18493360, Expected Digest: sha256:712e5982f3c27e6bb70c4c07f6076ab85e5daa73adc8fdd928558f49c8845247)
2025-05-07T20:23:25.3174766Z Redirecting to blob download url: https://productionresultssa4.blob.core.windows.net/actions-results/b81c1ade-b872-4473-afc9-b227c140a38f/workflow-job-run-0c78ae5c-d1af-5cac-9cef-71d15264925f/artifacts/26da78488c24807c90bb678b8b7579283275a81cc21beba82a9498d4848351d8.zip
2025-05-07T20:23:25.3176167Z Starting download of artifact to: /home/ec2-user/actions-runner/_work/FBGEMM/FBGEMM
2025-05-07T20:23:25.4401003Z (node:208300) [DEP0005] DeprecationWarning: Buffer() is deprecated due to security and usability issues. Please use the Buffer.alloc(), Buffer.allocUnsafe(), or Buffer.from() methods instead.
2025-05-07T20:23:25.4402063Z (Use `node --trace-deprecation ...` to show where the warning was created)
2025-05-07T20:23:25.7347652Z SHA256 digest of downloaded artifact is 712e5982f3c27e6bb70c4c07f6076ab85e5daa73adc8fdd928558f49c8845247
2025-05-07T20:23:25.7348419Z Artifact download completed successfully.
2025-05-07T20:23:25.7348753Z Total of 1 artifact(s) downloaded
2025-05-07T20:23:25.7354205Z Download artifact has finished successfully
2025-05-07T20:23:25.7610337Z ##[group]Run pytorch/test-infra/.github/actions/setup-nvidia@main
2025-05-07T20:23:25.7610732Z with:
2025-05-07T20:23:25.7610952Z   driver-version: 570.133.07
2025-05-07T20:23:25.7611197Z env:
2025-05-07T20:23:25.7611424Z   PRELUDE: .github/scripts/setup_env.bash
2025-05-07T20:23:25.7611737Z   BUILD_ENV: build_binary
2025-05-07T20:23:25.7611986Z   BUILD_TARGET: genai
2025-05-07T20:23:25.7612213Z   BUILD_VARIANT: cuda
2025-05-07T20:23:25.7612453Z   BUILD_CUDA_VERSION: 12.8.0
2025-05-07T20:23:25.7612718Z   ENFORCE_CUDA_DEVICE: 1
2025-05-07T20:23:25.7612950Z ##[endgroup]
2025-05-07T20:23:25.7708851Z ##[group]Run nick-fields/retry@3e91a01664abd3c5cd539100d10d33b9c5b68482
2025-05-07T20:23:25.7709338Z with:
2025-05-07T20:23:25.7709543Z   timeout_minutes: 10
2025-05-07T20:23:25.7709774Z   max_attempts: 3
2025-05-07T20:23:25.7732984Z   command: # Is it disgusting to have a full shell script here in this github action? Sure
# But is it the best way to make it so that this action relies on nothing else? Absolutely
set -eou pipefail

DISTRIBUTION=$(. /etc/os-release;echo $ID$VERSION_ID)
DRIVER_FN="NVIDIA-Linux-x86_64-${DRIVER_VERSION}.run"

install_nvidia_docker2_amzn2() {
    (
        set -x
        # Needed for yum-config-manager
        sudo yum install -y yum-utils
        if [[ "${DISTRIBUTION}" == "amzn2023" ]] ; then
          YUM_REPO_URL="https://nvidia.github.io/libnvidia-container/stable/rpm/nvidia-container-toolkit.repo"
        else
          # Amazon Linux 2
          YUM_REPO_URL="https://nvidia.github.io/nvidia-docker/${DISTRIBUTION}/nvidia-docker.repo"
        fi

        sudo yum-config-manager --add-repo "${YUM_REPO_URL}"
        sudo yum install -y nvidia-docker2 nvidia-container-toolkit-1.16.2
        sudo systemctl restart docker
    )
}

install_nvidia_docker2_ubuntu20() {
    (
        set -x
        # Install nvidia-driver package if not installed
        status="$(dpkg-query -W --showformat='${db:Status-Status}' nvidia-docker2 2>&1)"
        if [ ! $? = 0 ] || [ ! "$status" = installed ]; then
          sudo apt-get install -y nvidia-docker2 nvidia-container-toolkit-1.16.2
          sudo systemctl restart docker
        fi
    )
}

pre_install_nvidia_driver_amzn2() {
    (
        # Purge any nvidia driver installed from RHEL repo
        sudo yum remove -y nvidia-driver-latest-dkms
    )
}

install_nvidia_driver_common() {
    (
        # Try to gather more information about the runner and its existing NVIDIA driver if any
        echo "Before installing NVIDIA driver"
        lspci
        lsmod
        modinfo nvidia || true

        HAS_NVIDIA_DRIVER=0
        # Check if NVIDIA driver has already been installed
        if [ -x "$(command -v nvidia-smi)" ]; then
            set +e
            # The driver exists, check its version next. Also check only the first GPU if there are more than one of them
            # so that the same driver version is not print over multiple lines
            INSTALLED_DRIVER_VERSION=$(nvidia-smi --query-gpu=driver_version --format=csv,noheader --id=0)
            NVIDIA_SMI_STATUS=$?

            if [ "$NVIDIA_SMI_STATUS" -ne 0 ] && [ "$NVIDIA_SMI_STATUS" -ne 14 ]; then
                echo "Failed to get NVIDIA driver version ($INSTALLED_DRIVER_VERSION). Continuing"
            elif [ "$INSTALLED_DRIVER_VERSION" != "$DRIVER_VERSION" ]; then
                echo "NVIDIA driver ($INSTALLED_DRIVER_VERSION) has been installed, but we expect to have $DRIVER_VERSION instead. Continuing"

                # Turn off persistent mode so that the installation script can unload the kernel module
                sudo killall nvidia-persistenced || true
            else
                HAS_NVIDIA_DRIVER=1
                echo "NVIDIA driver ($INSTALLED_DRIVER_VERSION) has already been installed. Skipping NVIDIA driver installation"
            fi
            set -e
        fi

        if [ "$HAS_NVIDIA_DRIVER" -eq 0 ]; then
            # CAUTION: this may need to be updated in future
            if [ "${DISTRIBUTION}" != ubuntu20.04 ]; then
                  sudo yum groupinstall -y "Development Tools"
                  # ensure our kernel install is the same as our underlying kernel,
                  # groupinstall "Development Tools" has a habit of mismatching kernel headers
                  sudo yum install -y "kernel-devel-uname-r == $(uname -r)"
                  sudo modprobe backlight
            fi
            sudo curl -fsL -o /tmp/nvidia_driver "https://s3.amazonaws.com/ossci-linux/nvidia_driver/$DRIVER_FN"

            set +e
            sudo /bin/bash /tmp/nvidia_driver -s --no-drm
            NVIDIA_INSTALLATION_STATUS=$?

            RESET_GPU=0
            if [ "$NVIDIA_INSTALLATION_STATUS" -ne 0 ]; then
                sudo cat /var/log/nvidia-installer.log
                # Fail to install NVIDIA driver, try to reset the GPU
                RESET_GPU=1
            elif [ -x "$(command -v nvidia-smi)" ]; then
                # Check again if nvidia-smi works even if the driver installation completes successfully
                INSTALLED_DRIVER_VERSION=$(nvidia-smi --query-gpu=driver_version --format=csv,noheader --id=0)
                NVIDIA_SMI_STATUS=$?

                if [ "$NVIDIA_SMI_STATUS" -ne 0 ] && [ "$NVIDIA_SMI_STATUS" -ne 14 ]; then
                    RESET_GPU=1
                fi
            fi

            if [ "$RESET_GPU" -eq 1 ]; then
                NVIDIA_DEVICES=$(lspci -D | grep -i NVIDIA | cut -d' ' -f1)
                # The GPU can get stuck in a failure state if somehow the test crashs the GPU microcode. When this
                # happens, we'll try to reset all NVIDIA devices https://github.com/pytorch/pytorch/issues/88388
                for PCI_ID in $NVIDIA_DEVICES; do
                    DEVICE_ENABLED=$(cat /sys/bus/pci/devices/$PCI_ID/enable)

                    echo "Reseting $PCI_ID (enabled state: $DEVICE_ENABLED)"
                    # This requires sudo permission of course
                    echo "1" | sudo tee /sys/bus/pci/devices/$PCI_ID/reset
                    sleep 1
                done
            fi

            sudo rm -fv /tmp/nvidia_driver
            set -e
        fi
    )
}

post_install_nvidia_driver_common() {
    (
        sudo modprobe nvidia || true
        echo "After installing NVIDIA driver"
        lspci
        lsmod
        modinfo nvidia || true

        (
            set +e

            nvidia-smi
            # NB: Annoyingly, nvidia-smi command returns successfully with return code 0 even in
            # the case where the driver has already crashed as it still can get the driver version
            # and some basic information like the bus ID.  However, the rest of the information
            # would be missing (ERR!), for example:
            #
            # +-----------------------------------------------------------------------------+
            # | NVIDIA-SMI 525.89.02    Driver Version: 525.89.02    CUDA Version: 12.0     |
            # |-------------------------------+----------------------+----------------------+
            # | GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
            # | Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
            # |                               |                      |               MIG M. |
            # |===============================+======================+======================|
            # |   0  ERR!                Off  | 00000000:00:1E.0 Off |                 ERR! |
            # |ERR!  ERR! ERR!    ERR! / ERR! |   4184MiB / 23028MiB |    ERR!      Default |
            # |                               |                      |                 ERR! |
            # +-------------------------------+----------------------+----------------------+
            #
            # +-----------------------------------------------------------------------------+
            # | Processes:                                                                  |
            # |  GPU   GI   CI        PID   Type   Process name                  GPU Memory |
            # |        ID   ID                                                   Usage      |
            # |=============================================================================|
            # +-----------------------------------------------------------------------------+
            #
            # This should be reported as a failure instead as it will guarantee to fail when
            # Docker tries to run with --gpus all
            #
            # So, the correct check here is to query one of the missing piece of info like
            # GPU name, so that the command can fail accordingly
            nvidia-smi --query-gpu=gpu_name --format=csv,noheader --id=0
            NVIDIA_SMI_STATUS=$?

            # Allowable exit statuses for nvidia-smi, see: https://github.com/NVIDIA/gpu-operator/issues/285
            if [ "$NVIDIA_SMI_STATUS" -eq 0 ] || [ "$NVIDIA_SMI_STATUS" -eq 14 ]; then
                echo "INFO: Ignoring allowed status ${NVIDIA_SMI_STATUS}"
            else
                echo "ERROR: nvidia-smi exited with unresolved status ${NVIDIA_SMI_STATUS}"
                exit ${NVIDIA_SMI_STATUS}
            fi
            set -e
        )
    )
}

install_nvidia_driver_amzn2() {
    (
        set -x
        pre_install_nvidia_driver_amzn2
        install_nvidia_driver_common
        post_install_nvidia_driver_common
    )
}

install_nvidia_driver_ubuntu20() {
    (
        set -x
        install_nvidia_driver_common
        post_install_nvidia_driver_common
    )
}

echo "== Installing nvidia driver ${DRIVER_FN} =="
case "${DISTRIBUTION}" in
    amzn*)
        install_nvidia_driver_amzn2
        ;;
    ubuntu20.04)
        install_nvidia_driver_ubuntu20
        ;;
    *)
        echo "ERROR: Unknown distribution ${DISTRIBUTION}"
        exit 1
        ;;
esac

# Install container toolkit based on distribution
echo "== Installing nvidia container toolkit for ${DISTRIBUTION} =="
case "${DISTRIBUTION}" in
    amzn*)
        install_nvidia_docker2_amzn2
        ;;
    ubuntu20.04)
        install_nvidia_docker2_ubuntu20
        ;;
    *)
        echo "ERROR: Unknown distribution ${DISTRIBUTION}"
        exit 1
        ;;
esac
echo "GPU_FLAG=--gpus all -e NVIDIA_DRIVER_CAPABILITIES=all" >> "${GITHUB_ENV}"

# Fix https://github.com/NVIDIA/nvidia-docker/issues/1648 on runners with
# more than one GPUs. This just needs to be run once. The command fails
# on subsequent runs and complains that the mode is already on, but that's
# ok
sudo nvidia-persistenced || true
# This should show persistence mode ON
nvidia-smi

2025-05-07T20:23:25.7755818Z   retry_wait_seconds: 10
2025-05-07T20:23:25.7756080Z   polling_interval_seconds: 1
2025-05-07T20:23:25.7756339Z   warning_on_retry: true
2025-05-07T20:23:25.7775723Z   continue_on_error: false
2025-05-07T20:23:25.7776005Z env:
2025-05-07T20:23:25.7776222Z   PRELUDE: .github/scripts/setup_env.bash
2025-05-07T20:23:25.7776559Z   BUILD_ENV: build_binary
2025-05-07T20:23:25.7776796Z   BUILD_TARGET: genai
2025-05-07T20:23:25.7777014Z   BUILD_VARIANT: cuda
2025-05-07T20:23:25.7777254Z   BUILD_CUDA_VERSION: 12.8.0
2025-05-07T20:23:25.7777503Z   ENFORCE_CUDA_DEVICE: 1
2025-05-07T20:23:25.7777738Z   DRIVER_VERSION: 570.133.07
2025-05-07T20:23:25.7777990Z ##[endgroup]
2025-05-07T20:23:26.6631196Z == Installing nvidia driver NVIDIA-Linux-x86_64-570.133.07.run ==
2025-05-07T20:23:26.6631919Z + pre_install_nvidia_driver_amzn2
2025-05-07T20:23:26.6634792Z + sudo yum remove -y nvidia-driver-latest-dkms
2025-05-07T20:23:26.9881261Z No match for argument: nvidia-driver-latest-dkms
2025-05-07T20:23:26.9881645Z No packages marked for removal.
2025-05-07T20:23:26.9949066Z Dependencies resolved.
2025-05-07T20:23:26.9959012Z Nothing to do.
2025-05-07T20:23:26.9959667Z Complete!
2025-05-07T20:23:27.0885324Z + install_nvidia_driver_common
2025-05-07T20:23:27.0889519Z + echo 'Before installing NVIDIA driver'
2025-05-07T20:23:27.0889857Z + lspci
2025-05-07T20:23:27.0890580Z Before installing NVIDIA driver
2025-05-07T20:23:27.1008984Z 00:00.0 Host bridge: Intel Corporation 440FX - 82441FX PMC [Natoma]
2025-05-07T20:23:27.1010571Z 00:01.0 ISA bridge: Intel Corporation 82371SB PIIX3 ISA [Natoma/Triton II]
2025-05-07T20:23:27.1012110Z 00:01.3 Non-VGA unclassified device: Intel Corporation 82371AB/EB/MB PIIX4 ACPI (rev 08)
2025-05-07T20:23:27.1013599Z 00:03.0 VGA compatible controller: Amazon.com, Inc. Device 1111
2025-05-07T20:23:27.1014564Z 00:04.0 Non-Volatile memory controller: Amazon.com, Inc. NVMe EBS Controller
2025-05-07T20:23:27.1015507Z 00:05.0 Ethernet controller: Amazon.com, Inc. Elastic Network Adapter (ENA)
2025-05-07T20:23:27.1016167Z 00:1e.0 3D controller: NVIDIA Corporation GA102GL [A10G] (rev a1)
2025-05-07T20:23:27.1016644Z 00:1f.0 Non-Volatile memory controller: Amazon.com, Inc. NVMe SSD Controller
2025-05-07T20:23:27.1017048Z + lsmod
2025-05-07T20:23:27.1060200Z Module                  Size  Used by
2025-05-07T20:23:27.1060905Z veth                   36864  0
2025-05-07T20:23:27.1061671Z nvidia_modeset       1716224  0
2025-05-07T20:23:27.1062512Z video                  65536  1 nvidia_modeset
2025-05-07T20:23:27.1063447Z wmi                    36864  1 video
2025-05-07T20:23:27.1064193Z nvidia_uvm           1884160  0
2025-05-07T20:23:27.1064773Z nvidia              11583488  7 nvidia_uvm,nvidia_modeset
2025-05-07T20:23:27.1065415Z drm                   602112  1 nvidia
2025-05-07T20:23:27.1065981Z drm_panel_orientation_quirks    32768  1 drm
2025-05-07T20:23:27.1066330Z backlight              24576  3 video,drm,nvidia_modeset
2025-05-07T20:23:27.1066677Z i2c_core              110592  2 nvidia,drm
2025-05-07T20:23:27.1066968Z xt_conntrack           16384  1
2025-05-07T20:23:27.1067221Z nft_chain_nat          16384  3
2025-05-07T20:23:27.1067480Z xt_MASQUERADE          20480  1
2025-05-07T20:23:27.1067777Z nf_nat                 57344  2 nft_chain_nat,xt_MASQUERADE
2025-05-07T20:23:27.1068654Z nf_conntrack_netlink    57344  0
2025-05-07T20:23:27.1069093Z nf_conntrack          184320  4 xt_conntrack,nf_nat,nf_conntrack_netlink,xt_MASQUERADE
2025-05-07T20:23:27.1069525Z nf_defrag_ipv6         24576  1 nf_conntrack
2025-05-07T20:23:27.1069833Z nf_defrag_ipv4         16384  1 nf_conntrack
2025-05-07T20:23:27.1070114Z xfrm_user              57344  1
2025-05-07T20:23:27.1070377Z xfrm_algo              16384  1 xfrm_user
2025-05-07T20:23:27.1070662Z xt_addrtype            16384  2
2025-05-07T20:23:27.1070911Z nft_compat             20480  4
2025-05-07T20:23:27.1071214Z nf_tables             311296  57 nft_compat,nft_chain_nat
2025-05-07T20:23:27.1071617Z nfnetlink              20480  4 nft_compat,nf_conntrack_netlink,nf_tables
2025-05-07T20:23:27.1071975Z br_netfilter           36864  0
2025-05-07T20:23:27.1072258Z bridge                323584  1 br_netfilter
2025-05-07T20:23:27.1072553Z stp                    16384  1 bridge
2025-05-07T20:23:27.1072845Z llc                    16384  2 bridge,stp
2025-05-07T20:23:27.1073116Z overlay               167936  0
2025-05-07T20:23:27.1073366Z tls                   135168  0
2025-05-07T20:23:27.1073619Z nls_ascii              16384  1
2025-05-07T20:23:27.1073863Z nls_cp437              20480  1
2025-05-07T20:23:27.1074105Z vfat                   24576  1
2025-05-07T20:23:27.1074353Z fat                    86016  1 vfat
2025-05-07T20:23:27.1074612Z sunrpc                696320  1
2025-05-07T20:23:27.1074857Z ena                   180224  0
2025-05-07T20:23:27.1075099Z i8042                  45056  0
2025-05-07T20:23:27.1075343Z serio                  28672  3 i8042
2025-05-07T20:23:27.1075617Z ghash_clmulni_intel    16384  0
2025-05-07T20:23:27.1075878Z button                 24576  0
2025-05-07T20:23:27.1076130Z sch_fq_codel           20480  17
2025-05-07T20:23:27.1076383Z dm_mod                188416  0
2025-05-07T20:23:27.1076626Z fuse                  163840  1
2025-05-07T20:23:27.1076870Z loop                   36864  0
2025-05-07T20:23:27.1077115Z configfs               57344  1
2025-05-07T20:23:27.1077366Z dax                    45056  1 dm_mod
2025-05-07T20:23:27.1077637Z dmi_sysfs              20480  0
2025-05-07T20:23:27.1077879Z crc32_pclmul           16384  0
2025-05-07T20:23:27.1078274Z crc32c_intel           24576  0
2025-05-07T20:23:27.1078525Z efivarfs               24576  1
2025-05-07T20:23:27.1078776Z + modinfo nvidia
2025-05-07T20:23:27.1079586Z filename:       /lib/modules/6.1.130-139.222.amzn2023.x86_64/kernel/drivers/video/nvidia.ko
2025-05-07T20:23:27.1080116Z import_ns:      DMA_BUF
2025-05-07T20:23:27.1080370Z alias:          char-major-195-*
2025-05-07T20:23:27.1080644Z version:        570.133.07
2025-05-07T20:23:27.1080894Z supported:      external
2025-05-07T20:23:27.1081141Z license:        Dual MIT/GPL
2025-05-07T20:23:27.1081431Z firmware:       nvidia/570.133.07/gsp_tu10x.bin
2025-05-07T20:23:27.1081774Z firmware:       nvidia/570.133.07/gsp_ga10x.bin
2025-05-07T20:23:27.1082107Z srcversion:     49515739FD8F721A3F2F714
2025-05-07T20:23:27.1082421Z alias:          pci:v000010DEd*sv*sd*bc06sc80i00*
2025-05-07T20:23:27.1082760Z alias:          pci:v000010DEd*sv*sd*bc03sc02i00*
2025-05-07T20:23:27.1083092Z alias:          pci:v000010DEd*sv*sd*bc03sc00i00*
2025-05-07T20:23:27.1083402Z depends:        i2c-core,drm
2025-05-07T20:23:27.1083669Z retpoline:      Y
2025-05-07T20:23:27.1083893Z name:           nvidia
2025-05-07T20:23:27.1084259Z vermagic:       6.1.130-139.222.amzn2023.x86_64 SMP preempt mod_unload modversions 
2025-05-07T20:23:27.1084727Z parm:           NvSwitchRegDwords:NvSwitch regkey (charp)
2025-05-07T20:23:27.1085284Z parm:           NvSwitchBlacklist:NvSwitchBlacklist=uuid[,uuid...] (charp)
2025-05-07T20:23:27.1085961Z parm:           NVreg_ResmanDebugLevel:int
2025-05-07T20:23:27.1086449Z parm:           NVreg_RmLogonRC:int
2025-05-07T20:23:27.1086922Z parm:           NVreg_ModifyDeviceFiles:int
2025-05-07T20:23:27.1087439Z parm:           NVreg_DeviceFileUID:int
2025-05-07T20:23:27.1088041Z parm:           NVreg_DeviceFileGID:int
2025-05-07T20:23:27.1088435Z parm:           NVreg_DeviceFileMode:int
2025-05-07T20:23:27.1088904Z parm:           NVreg_InitializeSystemMemoryAllocations:int
2025-05-07T20:23:27.1089484Z parm:           NVreg_UsePageAttributeTable:int
2025-05-07T20:23:27.1090005Z parm:           NVreg_EnablePCIeGen3:int
2025-05-07T20:23:27.1090387Z parm:           NVreg_EnableMSI:int
2025-05-07T20:23:27.1090688Z parm:           NVreg_EnableStreamMemOPs:int
2025-05-07T20:23:27.1091053Z parm:           NVreg_RestrictProfilingToAdminUsers:int
2025-05-07T20:23:27.1091449Z parm:           NVreg_PreserveVideoMemoryAllocations:int
2025-05-07T20:23:27.1091826Z parm:           NVreg_EnableS0ixPowerManagement:int
2025-05-07T20:23:27.1092228Z parm:           NVreg_S0ixPowerManagementVideoMemoryThreshold:int
2025-05-07T20:23:27.1092636Z parm:           NVreg_DynamicPowerManagement:int
2025-05-07T20:23:27.1093052Z parm:           NVreg_DynamicPowerManagementVideoMemoryThreshold:int
2025-05-07T20:23:27.1093462Z parm:           NVreg_EnableGpuFirmware:int
2025-05-07T20:23:27.1093798Z parm:           NVreg_EnableGpuFirmwareLogs:int
2025-05-07T20:23:27.1094163Z parm:           NVreg_OpenRmEnableUnsupportedGpus:int
2025-05-07T20:23:27.1094530Z parm:           NVreg_EnableUserNUMAManagement:int
2025-05-07T20:23:27.1094865Z parm:           NVreg_MemoryPoolSize:int
2025-05-07T20:23:27.1095184Z parm:           NVreg_KMallocHeapMaxSize:int
2025-05-07T20:23:27.1095513Z parm:           NVreg_VMallocHeapMaxSize:int
2025-05-07T20:23:27.1095829Z parm:           NVreg_IgnoreMMIOCheck:int
2025-05-07T20:23:27.1096146Z parm:           NVreg_NvLinkDisable:int
2025-05-07T20:23:27.1096487Z parm:           NVreg_EnablePCIERelaxedOrderingMode:int
2025-05-07T20:23:27.1096840Z parm:           NVreg_RegisterPCIDriver:int
2025-05-07T20:23:27.1097167Z parm:           NVreg_EnableResizableBar:int
2025-05-07T20:23:27.1097498Z parm:           NVreg_EnableDbgBreakpoint:int
2025-05-07T20:23:27.1097831Z parm:           NVreg_EnableNonblockingOpen:int
2025-05-07T20:23:27.1098170Z parm:           NVreg_RegistryDwords:charp
2025-05-07T20:23:27.1098507Z parm:           NVreg_RegistryDwordsPerDevice:charp
2025-05-07T20:23:27.1098837Z parm:           NVreg_RmMsg:charp
2025-05-07T20:23:27.1099257Z parm:           NVreg_GpuBlacklist:charp
2025-05-07T20:23:27.1099586Z parm:           NVreg_TemporaryFilePath:charp
2025-05-07T20:23:27.1099925Z parm:           NVreg_ExcludedGpus:charp
2025-05-07T20:23:27.1100233Z parm:           NVreg_DmaRemapPeerMmio:int
2025-05-07T20:23:27.1100565Z parm:           NVreg_RmNvlinkBandwidth:charp
2025-05-07T20:23:27.1100927Z parm:           NVreg_RmNvlinkBandwidthLinkCount:int
2025-05-07T20:23:27.1101270Z parm:           NVreg_ImexChannelCount:int
2025-05-07T20:23:27.1101598Z parm:           NVreg_CreateImexChannel0:int
2025-05-07T20:23:27.1101945Z parm:           NVreg_GrdmaPciTopoCheckOverride:int
2025-05-07T20:23:27.1102285Z parm:           rm_firmware_active:charp
2025-05-07T20:23:27.1102587Z + HAS_NVIDIA_DRIVER=0
2025-05-07T20:23:27.1102838Z ++ command -v nvidia-smi
2025-05-07T20:23:27.1103101Z + '[' -x /usr/bin/nvidia-smi ']'
2025-05-07T20:23:27.1103355Z + set +e
2025-05-07T20:23:27.1103667Z ++ nvidia-smi --query-gpu=driver_version --format=csv,noheader --id=0
2025-05-07T20:23:27.1320370Z + INSTALLED_DRIVER_VERSION=570.133.07
2025-05-07T20:23:27.1320666Z + NVIDIA_SMI_STATUS=0
2025-05-07T20:23:27.1320900Z + '[' 0 -ne 0 ']'
2025-05-07T20:23:27.1321684Z + '[' 570.133.07 '!=' 570.133.07 ']'
2025-05-07T20:23:27.1321954Z + HAS_NVIDIA_DRIVER=1
2025-05-07T20:23:27.1323131Z + echo 'NVIDIA driver (570.133.07) has already been installed. Skipping NVIDIA driver installation'
2025-05-07T20:23:27.1323931Z + set -e
2025-05-07T20:23:27.1324139Z + '[' 1 -eq 0 ']'
2025-05-07T20:23:27.1324525Z NVIDIA driver (570.133.07) has already been installed. Skipping NVIDIA driver installation
2025-05-07T20:23:27.1324988Z + post_install_nvidia_driver_common
2025-05-07T20:23:27.1327545Z + sudo modprobe nvidia
2025-05-07T20:23:27.2635115Z + echo 'After installing NVIDIA driver'
2025-05-07T20:23:27.2635479Z + lspci
2025-05-07T20:23:27.2635704Z After installing NVIDIA driver
2025-05-07T20:23:27.2751532Z 00:00.0 Host bridge: Intel Corporation 440FX - 82441FX PMC [Natoma]
2025-05-07T20:23:27.2752060Z 00:01.0 ISA bridge: Intel Corporation 82371SB PIIX3 ISA [Natoma/Triton II]
2025-05-07T20:23:27.2752597Z 00:01.3 Non-VGA unclassified device: Intel Corporation 82371AB/EB/MB PIIX4 ACPI (rev 08)
2025-05-07T20:23:27.2753192Z 00:03.0 VGA compatible controller: Amazon.com, Inc. Device 1111
2025-05-07T20:23:27.2753858Z 00:04.0 Non-Volatile memory controller: Amazon.com, Inc. NVMe EBS Controller
2025-05-07T20:23:27.2754610Z 00:05.0 Ethernet controller: Amazon.com, Inc. Elastic Network Adapter (ENA)
2025-05-07T20:23:27.2755096Z 00:1e.0 3D controller: NVIDIA Corporation GA102GL [A10G] (rev a1)
2025-05-07T20:23:27.2755561Z 00:1f.0 Non-Volatile memory controller: Amazon.com, Inc. NVMe SSD Controller
2025-05-07T20:23:27.2755956Z + lsmod
2025-05-07T20:23:27.2784713Z Module                  Size  Used by
2025-05-07T20:23:27.2784999Z veth                   36864  0
2025-05-07T20:23:27.2785262Z nvidia_modeset       1716224  0
2025-05-07T20:23:27.2785541Z video                  65536  1 nvidia_modeset
2025-05-07T20:23:27.2786046Z wmi                    36864  1 video
2025-05-07T20:23:27.2786570Z nvidia_uvm           1884160  0
2025-05-07T20:23:27.2787150Z nvidia              11583488  7 nvidia_uvm,nvidia_modeset
2025-05-07T20:23:27.2787790Z drm                   602112  1 nvidia
2025-05-07T20:23:27.2788371Z drm_panel_orientation_quirks    32768  1 drm
2025-05-07T20:23:27.2789186Z backlight              24576  3 video,drm,nvidia_modeset
2025-05-07T20:23:27.2789864Z i2c_core              110592  2 nvidia,drm
2025-05-07T20:23:27.2790419Z xt_conntrack           16384  1
2025-05-07T20:23:27.2790927Z nft_chain_nat          16384  3
2025-05-07T20:23:27.2791436Z xt_MASQUERADE          20480  1
2025-05-07T20:23:27.2792015Z nf_nat                 57344  2 nft_chain_nat,xt_MASQUERADE
2025-05-07T20:23:27.2792661Z nf_conntrack_netlink    57344  0
2025-05-07T20:23:27.2793438Z nf_conntrack          184320  4 xt_conntrack,nf_nat,nf_conntrack_netlink,xt_MASQUERADE
2025-05-07T20:23:27.2794286Z nf_defrag_ipv6         24576  1 nf_conntrack
2025-05-07T20:23:27.2795308Z nf_defrag_ipv4         16384  1 nf_conntrack
2025-05-07T20:23:27.2795892Z xfrm_user              57344  1
2025-05-07T20:23:27.2796213Z xfrm_algo              16384  1 xfrm_user
2025-05-07T20:23:27.2796491Z xt_addrtype            16384  2
2025-05-07T20:23:27.2796757Z nft_compat             20480  4
2025-05-07T20:23:27.2797063Z nf_tables             311296  57 nft_compat,nft_chain_nat
2025-05-07T20:23:27.2797473Z nfnetlink              20480  4 nft_compat,nf_conntrack_netlink,nf_tables
2025-05-07T20:23:27.2797836Z br_netfilter           36864  0
2025-05-07T20:23:27.2798112Z bridge                323584  1 br_netfilter
2025-05-07T20:23:27.2798405Z stp                    16384  1 bridge
2025-05-07T20:23:27.2798693Z llc                    16384  2 bridge,stp
2025-05-07T20:23:27.2798978Z overlay               167936  0
2025-05-07T20:23:27.2799227Z tls                   135168  0
2025-05-07T20:23:27.2799470Z nls_ascii              16384  1
2025-05-07T20:23:27.2799728Z nls_cp437              20480  1
2025-05-07T20:23:27.2799984Z vfat                   24576  1
2025-05-07T20:23:27.2800229Z fat                    86016  1 vfat
2025-05-07T20:23:27.2800495Z sunrpc                696320  1
2025-05-07T20:23:27.2800742Z ena                   180224  0
2025-05-07T20:23:27.2800976Z i8042                  45056  0
2025-05-07T20:23:27.2801227Z serio                  28672  3 i8042
2025-05-07T20:23:27.2801501Z ghash_clmulni_intel    16384  0
2025-05-07T20:23:27.2801756Z button                 24576  0
2025-05-07T20:23:27.2802002Z sch_fq_codel           20480  17
2025-05-07T20:23:27.2802258Z dm_mod                188416  0
2025-05-07T20:23:27.2802503Z fuse                  163840  1
2025-05-07T20:23:27.2802741Z loop                   36864  0
2025-05-07T20:23:27.2803153Z configfs               57344  1
2025-05-07T20:23:27.2803404Z dax                    45056  1 dm_mod
2025-05-07T20:23:27.2803669Z dmi_sysfs              20480  0
2025-05-07T20:23:27.2803921Z crc32_pclmul           16384  0
2025-05-07T20:23:27.2804181Z crc32c_intel           24576  0
2025-05-07T20:23:27.2804427Z efivarfs               24576  1
2025-05-07T20:23:27.2804677Z + modinfo nvidia
2025-05-07T20:23:27.2807932Z filename:       /lib/modules/6.1.130-139.222.amzn2023.x86_64/kernel/drivers/video/nvidia.ko
2025-05-07T20:23:27.2808388Z import_ns:      DMA_BUF
2025-05-07T20:23:27.2808634Z alias:          char-major-195-*
2025-05-07T20:23:27.2808901Z version:        570.133.07
2025-05-07T20:23:27.2809144Z supported:      external
2025-05-07T20:23:27.2809388Z license:        Dual MIT/GPL
2025-05-07T20:23:27.2809672Z firmware:       nvidia/570.133.07/gsp_tu10x.bin
2025-05-07T20:23:27.2810007Z firmware:       nvidia/570.133.07/gsp_ga10x.bin
2025-05-07T20:23:27.2810325Z srcversion:     49515739FD8F721A3F2F714
2025-05-07T20:23:27.2810640Z alias:          pci:v000010DEd*sv*sd*bc06sc80i00*
2025-05-07T20:23:27.2810976Z alias:          pci:v000010DEd*sv*sd*bc03sc02i00*
2025-05-07T20:23:27.2811300Z alias:          pci:v000010DEd*sv*sd*bc03sc00i00*
2025-05-07T20:23:27.2811607Z depends:        i2c-core,drm
2025-05-07T20:23:27.2811864Z retpoline:      Y
2025-05-07T20:23:27.2812083Z name:           nvidia
2025-05-07T20:23:27.2812435Z vermagic:       6.1.130-139.222.amzn2023.x86_64 SMP preempt mod_unload modversions 
2025-05-07T20:23:27.2812900Z parm:           NvSwitchRegDwords:NvSwitch regkey (charp)
2025-05-07T20:23:27.2813333Z parm:           NvSwitchBlacklist:NvSwitchBlacklist=uuid[,uuid...] (charp)
2025-05-07T20:23:27.2813747Z parm:           NVreg_ResmanDebugLevel:int
2025-05-07T20:23:27.2814050Z parm:           NVreg_RmLogonRC:int
2025-05-07T20:23:27.2814356Z parm:           NVreg_ModifyDeviceFiles:int
2025-05-07T20:23:27.2814668Z parm:           NVreg_DeviceFileUID:int
2025-05-07T20:23:27.2814969Z parm:           NVreg_DeviceFileGID:int
2025-05-07T20:23:27.2815274Z parm:           NVreg_DeviceFileMode:int
2025-05-07T20:23:27.2815635Z parm:           NVreg_InitializeSystemMemoryAllocations:int
2025-05-07T20:23:27.2816012Z parm:           NVreg_UsePageAttributeTable:int
2025-05-07T20:23:27.2816452Z parm:           NVreg_EnablePCIeGen3:int
2025-05-07T20:23:27.2816759Z parm:           NVreg_EnableMSI:int
2025-05-07T20:23:27.2817057Z parm:           NVreg_EnableStreamMemOPs:int
2025-05-07T20:23:27.2817424Z parm:           NVreg_RestrictProfilingToAdminUsers:int
2025-05-07T20:23:27.2817822Z parm:           NVreg_PreserveVideoMemoryAllocations:int
2025-05-07T20:23:27.2818196Z parm:           NVreg_EnableS0ixPowerManagement:int
2025-05-07T20:23:27.2818595Z parm:           NVreg_S0ixPowerManagementVideoMemoryThreshold:int
2025-05-07T20:23:27.2818998Z parm:           NVreg_DynamicPowerManagement:int
2025-05-07T20:23:27.2819413Z parm:           NVreg_DynamicPowerManagementVideoMemoryThreshold:int
2025-05-07T20:23:27.2819814Z parm:           NVreg_EnableGpuFirmware:int
2025-05-07T20:23:27.2820150Z parm:           NVreg_EnableGpuFirmwareLogs:int
2025-05-07T20:23:27.2820512Z parm:           NVreg_OpenRmEnableUnsupportedGpus:int
2025-05-07T20:23:27.2820875Z parm:           NVreg_EnableUserNUMAManagement:int
2025-05-07T20:23:27.2821208Z parm:           NVreg_MemoryPoolSize:int
2025-05-07T20:23:27.2821525Z parm:           NVreg_KMallocHeapMaxSize:int
2025-05-07T20:23:27.2821852Z parm:           NVreg_VMallocHeapMaxSize:int
2025-05-07T20:23:27.2822164Z parm:           NVreg_IgnoreMMIOCheck:int
2025-05-07T20:23:27.2822474Z parm:           NVreg_NvLinkDisable:int
2025-05-07T20:23:27.2822817Z parm:           NVreg_EnablePCIERelaxedOrderingMode:int
2025-05-07T20:23:27.2823165Z parm:           NVreg_RegisterPCIDriver:int
2025-05-07T20:23:27.2823492Z parm:           NVreg_EnableResizableBar:int
2025-05-07T20:23:27.2823827Z parm:           NVreg_EnableDbgBreakpoint:int
2025-05-07T20:23:27.2824159Z parm:           NVreg_EnableNonblockingOpen:int
2025-05-07T20:23:27.2824591Z parm:           NVreg_RegistryDwords:charp
2025-05-07T20:23:27.2824929Z parm:           NVreg_RegistryDwordsPerDevice:charp
2025-05-07T20:23:27.2825261Z parm:           NVreg_RmMsg:charp
2025-05-07T20:23:27.2825546Z parm:           NVreg_GpuBlacklist:charp
2025-05-07T20:23:27.2825871Z parm:           NVreg_TemporaryFilePath:charp
2025-05-07T20:23:27.2826190Z parm:           NVreg_ExcludedGpus:charp
2025-05-07T20:23:27.2826497Z parm:           NVreg_DmaRemapPeerMmio:int
2025-05-07T20:23:27.2826829Z parm:           NVreg_RmNvlinkBandwidth:charp
2025-05-07T20:23:27.2827179Z parm:           NVreg_RmNvlinkBandwidthLinkCount:int
2025-05-07T20:23:27.2827518Z parm:           NVreg_ImexChannelCount:int
2025-05-07T20:23:27.2827840Z parm:           NVreg_CreateImexChannel0:int
2025-05-07T20:23:27.2828365Z parm:           NVreg_GrdmaPciTopoCheckOverride:int
2025-05-07T20:23:27.2828701Z parm:           rm_firmware_active:charp
2025-05-07T20:23:27.2828987Z + set +e
2025-05-07T20:23:27.2829241Z + nvidia-smi
2025-05-07T20:23:27.2984136Z Wed May  7 20:23:27 2025       
2025-05-07T20:23:27.2984506Z +-----------------------------------------------------------------------------------------+
2025-05-07T20:23:27.2985015Z | NVIDIA-SMI 570.133.07             Driver Version: 570.133.07     CUDA Version: 12.8     |
2025-05-07T20:23:27.2985505Z |-----------------------------------------+------------------------+----------------------+
2025-05-07T20:23:27.2986052Z | GPU  Name                 Persistence-M | Bus-Id          Disp.A | Volatile Uncorr. ECC |
2025-05-07T20:23:27.2986579Z | Fan  Temp   Perf          Pwr:Usage/Cap |           Memory-Usage | GPU-Util  Compute M. |
2025-05-07T20:23:27.2987013Z |                                         |                        |               MIG M. |
2025-05-07T20:23:27.2987348Z |=========================================+========================+======================|
2025-05-07T20:23:27.3119323Z |   0  NVIDIA A10G                    On  |   00000000:00:1E.0 Off |                    0 |
2025-05-07T20:23:27.3119791Z |  0%   28C    P8             22W /  300W |       0MiB /  23028MiB |      0%      Default |
2025-05-07T20:23:27.3120185Z |                                         |                        |                  N/A |
2025-05-07T20:23:27.3120756Z +-----------------------------------------+------------------------+----------------------+
2025-05-07T20:23:27.3124079Z                                                                                          
2025-05-07T20:23:27.3124537Z +-----------------------------------------------------------------------------------------+
2025-05-07T20:23:27.3125034Z | Processes:                                                                              |
2025-05-07T20:23:27.3125539Z |  GPU   GI   CI              PID   Type   Process name                        GPU Memory |
2025-05-07T20:23:27.3126053Z |        ID   ID                                                               Usage      |
2025-05-07T20:23:27.3126476Z |=========================================================================================|
2025-05-07T20:23:27.3130878Z |  No running processes found                                                             |
2025-05-07T20:23:27.3131363Z +-----------------------------------------------------------------------------------------+
2025-05-07T20:23:27.5708978Z + nvidia-smi --query-gpu=gpu_name --format=csv,noheader --id=0
2025-05-07T20:23:27.5877319Z NVIDIA A10G
2025-05-07T20:23:27.5920020Z + NVIDIA_SMI_STATUS=0
2025-05-07T20:23:27.5921661Z + '[' 0 -eq 0 ']'
2025-05-07T20:23:27.5922116Z + echo 'INFO: Ignoring allowed status 0'
2025-05-07T20:23:27.5922423Z + set -e
2025-05-07T20:23:27.5922628Z INFO: Ignoring allowed status 0
2025-05-07T20:23:27.5930726Z == Installing nvidia container toolkit for amzn2023 ==
2025-05-07T20:23:27.5943778Z + sudo yum install -y yum-utils
2025-05-07T20:23:28.0299734Z Last metadata expiration check: 0:10:02 ago on Wed May  7 20:13:26 2025.
2025-05-07T20:23:28.0548711Z Package dnf-utils-4.3.0-13.amzn2023.0.5.noarch is already installed.
2025-05-07T20:23:28.0951977Z Dependencies resolved.
2025-05-07T20:23:28.1136534Z Nothing to do.
2025-05-07T20:23:28.1136900Z Complete!
2025-05-07T20:23:28.1541228Z + [[ amzn2023 == \a\m\z\n\2\0\2\3 ]]
2025-05-07T20:23:28.1541867Z + YUM_REPO_URL=https://nvidia.github.io/libnvidia-container/stable/rpm/nvidia-container-toolkit.repo
2025-05-07T20:23:28.1542712Z + sudo yum-config-manager --add-repo https://nvidia.github.io/libnvidia-container/stable/rpm/nvidia-container-toolkit.repo
2025-05-07T20:23:28.5739873Z Adding repo from: https://nvidia.github.io/libnvidia-container/stable/rpm/nvidia-container-toolkit.repo
2025-05-07T20:23:28.6318724Z + sudo yum install -y nvidia-docker2 nvidia-container-toolkit-1.16.2
2025-05-07T20:23:29.1621790Z nvidia-container-toolkit                         13 kB/s | 833  B     00:00    
2025-05-07T20:23:29.1869263Z Package nvidia-docker2-2.14.0-1.noarch is already installed.
2025-05-07T20:23:29.1874789Z Package nvidia-container-toolkit-1.16.2-1.x86_64 is already installed.
2025-05-07T20:23:29.2269241Z Dependencies resolved.
2025-05-07T20:23:29.2450943Z Nothing to do.
2025-05-07T20:23:29.2451386Z Complete!
2025-05-07T20:23:29.2855044Z + sudo systemctl restart docker
2025-05-07T20:23:31.6697083Z nvidia-persistenced failed to initialize. Check syslog for more details.
2025-05-07T20:23:31.6893459Z Wed May  7 20:23:31 2025       
2025-05-07T20:23:31.6894172Z +-----------------------------------------------------------------------------------------+
2025-05-07T20:23:31.6895076Z | NVIDIA-SMI 570.133.07             Driver Version: 570.133.07     CUDA Version: 12.8     |
2025-05-07T20:23:31.6895961Z |-----------------------------------------+------------------------+----------------------+
2025-05-07T20:23:31.6896856Z | GPU  Name                 Persistence-M | Bus-Id          Disp.A | Volatile Uncorr. ECC |
2025-05-07T20:23:31.6897797Z | Fan  Temp   Perf          Pwr:Usage/Cap |           Memory-Usage | GPU-Util  Compute M. |
2025-05-07T20:23:31.6898373Z |                                         |                        |               MIG M. |
2025-05-07T20:23:31.6898724Z |=========================================+========================+======================|
2025-05-07T20:23:31.7030002Z |   0  NVIDIA A10G                    On  |   00000000:00:1E.0 Off |                    0 |
2025-05-07T20:23:31.7030450Z |  0%   29C    P8             22W /  300W |       0MiB /  23028MiB |      0%      Default |
2025-05-07T20:23:31.7030819Z |                                         |                        |                  N/A |
2025-05-07T20:23:31.7031210Z +-----------------------------------------+------------------------+----------------------+
2025-05-07T20:23:31.7034338Z                                                                                          
2025-05-07T20:23:31.7034735Z +-----------------------------------------------------------------------------------------+
2025-05-07T20:23:31.7035162Z | Processes:                                                                              |
2025-05-07T20:23:31.7035601Z |  GPU   GI   CI              PID   Type   Process name                        GPU Memory |
2025-05-07T20:23:31.7036027Z |        ID   ID                                                               Usage      |
2025-05-07T20:23:31.7036370Z |=========================================================================================|
2025-05-07T20:23:31.7040891Z |  No running processes found                                                             |
2025-05-07T20:23:31.7041363Z +-----------------------------------------------------------------------------------------+
2025-05-07T20:23:32.8278948Z Command completed after 1 attempt(s).
2025-05-07T20:23:32.8366963Z ##[group]Run . $PRELUDE; print_system_info; print_ec2_info
2025-05-07T20:23:32.8367441Z [36;1m. $PRELUDE; print_system_info; print_ec2_info[0m
2025-05-07T20:23:32.8380633Z shell: /usr/bin/bash --noprofile --norc -e -o pipefail {0}
2025-05-07T20:23:32.8381175Z env:
2025-05-07T20:23:32.8381408Z   PRELUDE: .github/scripts/setup_env.bash
2025-05-07T20:23:32.8381710Z   BUILD_ENV: build_binary
2025-05-07T20:23:32.8381962Z   BUILD_TARGET: genai
2025-05-07T20:23:32.8382204Z   BUILD_VARIANT: cuda
2025-05-07T20:23:32.8382437Z   BUILD_CUDA_VERSION: 12.8.0
2025-05-07T20:23:32.8382694Z   ENFORCE_CUDA_DEVICE: 1
2025-05-07T20:23:32.8382995Z   GPU_FLAG: --gpus all -e NVIDIA_DRIVER_CAPABILITIES=all
2025-05-07T20:23:32.8383322Z ##[endgroup]
2025-05-07T20:23:33.1766328Z ################################################################################
2025-05-07T20:23:33.1766708Z # Print System Info
2025-05-07T20:23:33.1766930Z #
2025-05-07T20:23:33.1782630Z # [2025-05-07T20:23:33.177Z] + print_system_info 
2025-05-07T20:23:33.1783001Z ################################################################################
2025-05-07T20:23:33.1783219Z 
2025-05-07T20:23:33.1783337Z ################################################################################
2025-05-07T20:23:33.1783673Z [INFO] Printing environment variables ...
2025-05-07T20:23:33.1783982Z + printenv
2025-05-07T20:23:33.1784098Z 
2025-05-07T20:23:33.1794419Z SHELL=/bin/bash
2025-05-07T20:23:33.1794941Z GITHUB_WORKSPACE=/home/ec2-user/actions-runner/_work/FBGEMM/FBGEMM
2025-05-07T20:23:33.1795497Z BUILD_VARIANT=cuda
2025-05-07T20:23:33.1796219Z GITHUB_PATH=/home/ec2-user/actions-runner/_work/_temp/_runner_file_commands/add_path_059f0104-fe17-4e08-a0e5-9395de160e8b
2025-05-07T20:23:33.1797010Z GITHUB_ACTION=__run
2025-05-07T20:23:33.1797394Z GPU_FLAG=--gpus all -e NVIDIA_DRIVER_CAPABILITIES=all
2025-05-07T20:23:33.1797851Z GITHUB_RUN_NUMBER=10601
2025-05-07T20:23:33.1798167Z RUNNER_NAME=i-0e56304501e4f5200
2025-05-07T20:23:33.1798445Z GITHUB_REPOSITORY_OWNER_ID=21003710
2025-05-07T20:23:33.1798749Z PLATFORM_NAME_LC=linux-x86_64
2025-05-07T20:23:33.1799009Z MACHINE_NAME_LC=x86_64
2025-05-07T20:23:33.1799363Z ACTIONS_RUNNER_HOOK_JOB_COMPLETED=/home/ec2-user/runner-scripts/after_job.sh
2025-05-07T20:23:33.1799779Z GITHUB_TRIGGERING_ACTOR=q10
2025-05-07T20:23:33.1800062Z PRELUDE=.github/scripts/setup_env.bash
2025-05-07T20:23:33.1800339Z GITHUB_REF_TYPE=branch
2025-05-07T20:23:33.1801044Z ***
2025-05-07T20:23:33.1801239Z LOGNAME=ec2-user
2025-05-07T20:23:33.1801475Z GITHUB_REPOSITORY_ID=150154628
2025-05-07T20:23:33.1801725Z ENFORCE_CUDA_DEVICE=1
2025-05-07T20:23:33.1801951Z GITHUB_ACTIONS=true
2025-05-07T20:23:33.1802170Z SYSTEMD_EXEC_PID=55534
2025-05-07T20:23:33.1802436Z GITHUB_SHA=a2f4c52051596e74bc8c16e3d2867a4ecdd271e0
2025-05-07T20:23:33.1802969Z GITHUB_WORKFLOW_REF=pytorch/FBGEMM/.github/workflows/fbgemm_gpu_ci_cuda.yml@refs/pull/4066/merge
2025-05-07T20:23:33.1803469Z RUNNER_ENVIRONMENT=self-hosted
2025-05-07T20:23:33.1803750Z GITHUB_REF=refs/pull/4066/merge
2025-05-07T20:23:33.1803999Z RUNNER_OS=Linux
2025-05-07T20:23:33.1804220Z GITHUB_REF_PROTECTED=false
2025-05-07T20:23:33.1804465Z HOME=/home/ec2-user
2025-05-07T20:23:33.1804706Z GITHUB_API_URL=https://api.github.com
2025-05-07T20:23:33.1804990Z LANG=C.UTF-8
2025-05-07T20:23:33.1805288Z RUNNER_TRACKING_ID=github_bf3ce286-0e7f-4ee1-994d-9126ade0d35d
2025-05-07T20:23:33.1805635Z RUNNER_ARCH=X64
2025-05-07T20:23:33.1805907Z RUNNER_TEMP=/home/ec2-user/actions-runner/_work/_temp
2025-05-07T20:23:33.1806234Z BUILD_TARGET=genai
2025-05-07T20:23:33.1806743Z GITHUB_STATE=/home/ec2-user/actions-runner/_work/_temp/_runner_file_commands/save_state_059f0104-fe17-4e08-a0e5-9395de160e8b
2025-05-07T20:23:33.1807584Z GITHUB_ENV=/home/ec2-user/actions-runner/_work/_temp/_runner_file_commands/set_env_059f0104-fe17-4e08-a0e5-9395de160e8b
2025-05-07T20:23:33.1808303Z GITHUB_EVENT_PATH=/home/ec2-user/actions-runner/_work/_temp/_github_workflow/event.json
2025-05-07T20:23:33.1809152Z INVOCATION_ID=92df7f3866bb4d08acaa1a9054d7e53b
2025-05-07T20:23:33.1809474Z GITHUB_EVENT_NAME=pull_request
2025-05-07T20:23:33.1809739Z GITHUB_RUN_ID=14891846252
2025-05-07T20:23:33.1810311Z GITHUB_STEP_SUMMARY=/home/ec2-user/actions-runner/_work/_temp/_runner_file_commands/step_summary_059f0104-fe17-4e08-a0e5-9395de160e8b
2025-05-07T20:23:33.1811069Z BUILD_ENV=build_binary
2025-05-07T20:23:33.1811298Z GITHUB_ACTOR=q10
2025-05-07T20:23:33.1811514Z GITHUB_RUN_ATTEMPT=1
2025-05-07T20:23:33.1811753Z KERN_NAME_LC=linux
2025-05-07T20:23:33.1811969Z BUILD_CUDA_VERSION=12.8.0
2025-05-07T20:23:33.1812266Z GITHUB_GRAPHQL_URL=https://api.github.com/graphql
2025-05-07T20:23:33.1812610Z PLATFORM_NAME=Linux-x86_64
2025-05-07T20:23:33.1812855Z USER=ec2-user
2025-05-07T20:23:33.1813078Z GITHUB_SERVER_URL=https://github.com
2025-05-07T20:23:33.1813354Z SHLVL=1
2025-05-07T20:23:33.1813550Z GITHUB_ACTOR_ID=255046
2025-05-07T20:23:33.1813848Z RUNNER_TOOL_CACHE=/home/ec2-user/actions-runner/_work/_tool
2025-05-07T20:23:33.1814285Z GITHUB_WORKFLOW_SHA=6060cd4b5f971680caecdcc657faccb5720d1c3e
2025-05-07T20:23:33.1814647Z GITHUB_REF_NAME=4066/merge
2025-05-07T20:23:33.1814878Z KERN_NAME=Linux
2025-05-07T20:23:33.1815108Z GITHUB_JOB=test_and_publish_artifact
2025-05-07T20:23:33.1815505Z ACTIONS_RUNNER_HOOK_JOB_STARTED=/home/ec2-user/runner-scripts/before_job.sh
2025-05-07T20:23:33.1815924Z GITHUB_REPOSITORY=pytorch/FBGEMM
2025-05-07T20:23:33.1816199Z GITHUB_RETENTION_DAYS=90
2025-05-07T20:23:33.1816445Z JOURNAL_STREAM=8:82680
2025-05-07T20:23:33.1816747Z RUNNER_WORKSPACE=/home/ec2-user/actions-runner/_work/FBGEMM
2025-05-07T20:23:33.1817162Z GITHUB_ACTION_REPOSITORY=
2025-05-07T20:23:33.1817466Z PATH=/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin
2025-05-07T20:23:33.1817787Z GITHUB_BASE_REF=main
2025-05-07T20:23:33.1817997Z CI=true
2025-05-07T20:23:33.1818233Z GITHUB_REPOSITORY_OWNER=pytorch
2025-05-07T20:23:33.1818632Z GITHUB_HEAD_REF=bm/genai-rocm-oss-6
2025-05-07T20:23:33.1819018Z GITHUB_ACTION_REF=
2025-05-07T20:23:33.1819354Z GITHUB_WORKFLOW=FBGEMM GPU/GenAI CUDA CI
2025-05-07T20:23:33.1820093Z GITHUB_OUTPUT=/home/ec2-user/actions-runner/_work/_temp/_runner_file_commands/set_output_059f0104-fe17-4e08-a0e5-9395de160e8b
2025-05-07T20:23:33.1820661Z MACHINE_NAME=x86_64
2025-05-07T20:23:33.1820883Z _=/usr/bin/printenv
2025-05-07T20:23:33.1821019Z 
2025-05-07T20:23:33.1821144Z ################################################################################
2025-05-07T20:23:33.1821569Z [INFO] Print ldd version ...
2025-05-07T20:23:33.1821923Z + ldd --version
2025-05-07T20:23:33.1822104Z 
2025-05-07T20:23:33.1822235Z ldd (GNU libc) 2.34
2025-05-07T20:23:33.1822599Z Copyright (C) 2021 Free Software Foundation, Inc.
2025-05-07T20:23:33.1823104Z This is free software; see the source for copying conditions.  There is NO
2025-05-07T20:23:33.1823628Z warranty; not even for MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.
2025-05-07T20:23:33.1824067Z Written by Roland McGrath and Ulrich Drepper.
2025-05-07T20:23:33.1824281Z 
2025-05-07T20:23:33.1824404Z ################################################################################
2025-05-07T20:23:33.1824707Z [INFO] Print CPU info ...
2025-05-07T20:23:33.1824945Z + nproc
2025-05-07T20:23:33.1825055Z 
2025-05-07T20:23:33.1833781Z 16
2025-05-07T20:23:33.1835518Z 
2025-05-07T20:23:33.1835733Z + lscpu
2025-05-07T20:23:33.1835915Z 
2025-05-07T20:23:33.1905740Z Architecture:                         x86_64
2025-05-07T20:23:33.1906229Z CPU op-mode(s):                       32-bit, 64-bit
2025-05-07T20:23:33.1907228Z Address sizes:                        48 bits physical, 48 bits virtual
2025-05-07T20:23:33.1908076Z Byte Order:                           Little Endian
2025-05-07T20:23:33.1908431Z CPU(s):                               16
2025-05-07T20:23:33.1908730Z On-line CPU(s) list:                  0-15
2025-05-07T20:23:33.1909119Z Vendor ID:                            AuthenticAMD
2025-05-07T20:23:33.1909482Z Model name:                           AMD EPYC 7R32
2025-05-07T20:23:33.1909790Z CPU family:                           23
2025-05-07T20:23:33.1910301Z Model:                                49
2025-05-07T20:23:33.1910594Z Thread(s) per core:                   2
2025-05-07T20:23:33.1910875Z Core(s) per socket:                   8
2025-05-07T20:23:33.1911161Z Socket(s):                            1
2025-05-07T20:23:33.1911564Z Stepping:                             0
2025-05-07T20:23:33.1911851Z BogoMIPS:                             5599.99
2025-05-07T20:23:33.1913905Z Flags:                                fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ht syscall nx mmxext fxsr_opt pdpe1gb rdtscp lm constant_tsc rep_good nopl nonstop_tsc cpuid extd_apicid aperfmperf tsc_known_freq pni pclmulqdq ssse3 fma cx16 sse4_1 sse4_2 movbe popcnt aes xsave avx f16c rdrand hypervisor lahf_lm cmp_legacy cr8_legacy abm sse4a misalignsse 3dnowprefetch topoext ssbd ibrs ibpb stibp vmmcall fsgsbase bmi1 avx2 smep bmi2 rdseed adx smap clflushopt clwb sha_ni xsaveopt xsavec xgetbv1 clzero xsaveerptr rdpru wbnoinvd arat npt nrip_save rdpid
2025-05-07T20:23:33.1915954Z Hypervisor vendor:                    KVM
2025-05-07T20:23:33.1916260Z Virtualization type:                  full
2025-05-07T20:23:33.1916590Z L1d cache:                            256 KiB (8 instances)
2025-05-07T20:23:33.1916951Z L1i cache:                            256 KiB (8 instances)
2025-05-07T20:23:33.1917314Z L2 cache:                             4 MiB (8 instances)
2025-05-07T20:23:33.1917663Z L3 cache:                             32 MiB (2 instances)
2025-05-07T20:23:33.1918009Z NUMA node(s):                         1
2025-05-07T20:23:33.1918316Z NUMA node0 CPU(s):                    0-15
2025-05-07T20:23:33.1918648Z Vulnerability Gather data sampling:   Not affected
2025-05-07T20:23:33.1919010Z Vulnerability Itlb multihit:          Not affected
2025-05-07T20:23:33.1919352Z Vulnerability L1tf:                   Not affected
2025-05-07T20:23:33.1919698Z Vulnerability Mds:                    Not affected
2025-05-07T20:23:33.1920048Z Vulnerability Meltdown:               Not affected
2025-05-07T20:23:33.1920390Z Vulnerability Mmio stale data:        Not affected
2025-05-07T20:23:33.1920777Z Vulnerability Reg file data sampling: Not affected
2025-05-07T20:23:33.1921312Z Vulnerability Retbleed:               Mitigation; untrained return thunk; SMT enabled with STIBP protection
2025-05-07T20:23:33.1922005Z Vulnerability Spec rstack overflow:   Mitigation; safe RET
2025-05-07T20:23:33.1922695Z Vulnerability Spec store bypass:      Mitigation; Speculative Store Bypass disabled via prctl
2025-05-07T20:23:33.1923435Z Vulnerability Spectre v1:             Mitigation; usercopy/swapgs barriers and __user pointer sanitization
2025-05-07T20:23:33.1924435Z Vulnerability Spectre v2:             Mitigation; Retpolines; IBPB conditional; STIBP always-on; RSB filling; PBRSB-eIBRS Not affected; BHI Not affected
2025-05-07T20:23:33.1925122Z Vulnerability Srbds:                  Not affected
2025-05-07T20:23:33.1925480Z Vulnerability Tsx async abort:        Not affected
2025-05-07T20:23:33.1925797Z 
2025-05-07T20:23:33.1925891Z + cat /proc/cpuinfo
2025-05-07T20:23:33.1926027Z 
2025-05-07T20:23:33.1926213Z processor	: 0
2025-05-07T20:23:33.1926427Z vendor_id	: AuthenticAMD
2025-05-07T20:23:33.1926669Z cpu family	: 23
2025-05-07T20:23:33.1926885Z model		: 49
2025-05-07T20:23:33.1927089Z model name	: AMD EPYC 7R32
2025-05-07T20:23:33.1927339Z stepping	: 0
2025-05-07T20:23:33.1927553Z microcode	: 0x830107f
2025-05-07T20:23:33.1927806Z cpu MHz		: 3299.302
2025-05-07T20:23:33.1928050Z cache size	: 512 KB
2025-05-07T20:23:33.1928539Z physical id	: 0
2025-05-07T20:23:33.1928745Z siblings	: 16
2025-05-07T20:23:33.1928946Z core id		: 0
2025-05-07T20:23:33.1929142Z cpu cores	: 8
2025-05-07T20:23:33.1929340Z apicid		: 0
2025-05-07T20:23:33.1929530Z initial apicid	: 0
2025-05-07T20:23:33.1929739Z fpu		: yes
2025-05-07T20:23:33.1929940Z fpu_exception	: yes
2025-05-07T20:23:33.1930149Z cpuid level	: 13
2025-05-07T20:23:33.1930352Z wp		: yes
2025-05-07T20:23:33.1932474Z flags		: fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ht syscall nx mmxext fxsr_opt pdpe1gb rdtscp lm constant_tsc rep_good nopl nonstop_tsc cpuid extd_apicid aperfmperf tsc_known_freq pni pclmulqdq ssse3 fma cx16 sse4_1 sse4_2 movbe popcnt aes xsave avx f16c rdrand hypervisor lahf_lm cmp_legacy cr8_legacy abm sse4a misalignsse 3dnowprefetch topoext ssbd ibrs ibpb stibp vmmcall fsgsbase bmi1 avx2 smep bmi2 rdseed adx smap clflushopt clwb sha_ni xsaveopt xsavec xgetbv1 clzero xsaveerptr rdpru wbnoinvd arat npt nrip_save rdpid
2025-05-07T20:23:33.1934827Z bugs		: sysret_ss_attrs null_seg spectre_v1 spectre_v2 spec_store_bypass retbleed smt_rsb srso ibpb_no_ret
2025-05-07T20:23:33.1935305Z bogomips	: 5599.99
2025-05-07T20:23:33.1935524Z TLB size	: 3072 4K pages
2025-05-07T20:23:33.1935764Z clflush size	: 64
2025-05-07T20:23:33.1935978Z cache_alignment	: 64
2025-05-07T20:23:33.1936246Z address sizes	: 48 bits physical, 48 bits virtual
2025-05-07T20:23:33.1936565Z power management:
2025-05-07T20:23:33.1936694Z 
2025-05-07T20:23:33.1936783Z processor	: 1
2025-05-07T20:23:33.1936991Z vendor_id	: AuthenticAMD
2025-05-07T20:23:33.1937225Z cpu family	: 23
2025-05-07T20:23:33.1937428Z model		: 49
2025-05-07T20:23:33.1937627Z model name	: AMD EPYC 7R32
2025-05-07T20:23:33.1937871Z stepping	: 0
2025-05-07T20:23:33.1938078Z microcode	: 0x830107f
2025-05-07T20:23:33.1938295Z cpu MHz		: 3191.328
2025-05-07T20:23:33.1938506Z cache size	: 512 KB
2025-05-07T20:23:33.1938717Z physical id	: 0
2025-05-07T20:23:33.1938916Z siblings	: 16
2025-05-07T20:23:33.1939116Z core id		: 1
2025-05-07T20:23:33.1939312Z cpu cores	: 8
2025-05-07T20:23:33.1939505Z apicid		: 2
2025-05-07T20:23:33.1939701Z initial apicid	: 2
2025-05-07T20:23:33.1939913Z fpu		: yes
2025-05-07T20:23:33.1940107Z fpu_exception	: yes
2025-05-07T20:23:33.1940324Z cpuid level	: 13
2025-05-07T20:23:33.1940529Z wp		: yes
2025-05-07T20:23:33.1942449Z flags		: fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ht syscall nx mmxext fxsr_opt pdpe1gb rdtscp lm constant_tsc rep_good nopl nonstop_tsc cpuid extd_apicid aperfmperf tsc_known_freq pni pclmulqdq ssse3 fma cx16 sse4_1 sse4_2 movbe popcnt aes xsave avx f16c rdrand hypervisor lahf_lm cmp_legacy cr8_legacy abm sse4a misalignsse 3dnowprefetch topoext ssbd ibrs ibpb stibp vmmcall fsgsbase bmi1 avx2 smep bmi2 rdseed adx smap clflushopt clwb sha_ni xsaveopt xsavec xgetbv1 clzero xsaveerptr rdpru wbnoinvd arat npt nrip_save rdpid
2025-05-07T20:23:33.1944634Z bugs		: sysret_ss_attrs null_seg spectre_v1 spectre_v2 spec_store_bypass retbleed smt_rsb srso ibpb_no_ret
2025-05-07T20:23:33.1945123Z bogomips	: 5599.99
2025-05-07T20:23:33.1945345Z TLB size	: 3072 4K pages
2025-05-07T20:23:33.1945580Z clflush size	: 64
2025-05-07T20:23:33.1945793Z cache_alignment	: 64
2025-05-07T20:23:33.1946072Z address sizes	: 48 bits physical, 48 bits virtual
2025-05-07T20:23:33.1946390Z power management:
2025-05-07T20:23:33.1946523Z 
2025-05-07T20:23:33.1946609Z processor	: 2
2025-05-07T20:23:33.1946827Z vendor_id	: AuthenticAMD
2025-05-07T20:23:33.1947069Z cpu family	: 23
2025-05-07T20:23:33.1947277Z model		: 49
2025-05-07T20:23:33.1947487Z model name	: AMD EPYC 7R32
2025-05-07T20:23:33.1947731Z stepping	: 0
2025-05-07T20:23:33.1947931Z microcode	: 0x830107f
2025-05-07T20:23:33.1948164Z cpu MHz		: 3299.688
2025-05-07T20:23:33.1948380Z cache size	: 512 KB
2025-05-07T20:23:33.1948585Z physical id	: 0
2025-05-07T20:23:33.1948791Z siblings	: 16
2025-05-07T20:23:33.1948993Z core id		: 2
2025-05-07T20:23:33.1949261Z cpu cores	: 8
2025-05-07T20:23:33.1949460Z apicid		: 4
2025-05-07T20:23:33.1949657Z initial apicid	: 4
2025-05-07T20:23:33.1949860Z fpu		: yes
2025-05-07T20:23:33.1950061Z fpu_exception	: yes
2025-05-07T20:23:33.1950277Z cpuid level	: 13
2025-05-07T20:23:33.1950475Z wp		: yes
2025-05-07T20:23:33.1952478Z flags		: fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ht syscall nx mmxext fxsr_opt pdpe1gb rdtscp lm constant_tsc rep_good nopl nonstop_tsc cpuid extd_apicid aperfmperf tsc_known_freq pni pclmulqdq ssse3 fma cx16 sse4_1 sse4_2 movbe popcnt aes xsave avx f16c rdrand hypervisor lahf_lm cmp_legacy cr8_legacy abm sse4a misalignsse 3dnowprefetch topoext ssbd ibrs ibpb stibp vmmcall fsgsbase bmi1 avx2 smep bmi2 rdseed adx smap clflushopt clwb sha_ni xsaveopt xsavec xgetbv1 clzero xsaveerptr rdpru wbnoinvd arat npt nrip_save rdpid
2025-05-07T20:23:33.1954721Z bugs		: sysret_ss_attrs null_seg spectre_v1 spectre_v2 spec_store_bypass retbleed smt_rsb srso ibpb_no_ret
2025-05-07T20:23:33.1955205Z bogomips	: 5599.99
2025-05-07T20:23:33.1955423Z TLB size	: 3072 4K pages
2025-05-07T20:23:33.1955648Z clflush size	: 64
2025-05-07T20:23:33.1955860Z cache_alignment	: 64
2025-05-07T20:23:33.1956129Z address sizes	: 48 bits physical, 48 bits virtual
2025-05-07T20:23:33.1956432Z power management:
2025-05-07T20:23:33.1956568Z 
2025-05-07T20:23:33.1956649Z processor	: 3
2025-05-07T20:23:33.1956864Z vendor_id	: AuthenticAMD
2025-05-07T20:23:33.1957101Z cpu family	: 23
2025-05-07T20:23:33.1957302Z model		: 49
2025-05-07T20:23:33.1957510Z model name	: AMD EPYC 7R32
2025-05-07T20:23:33.1957748Z stepping	: 0
2025-05-07T20:23:33.1957985Z microcode	: 0x830107f
2025-05-07T20:23:33.1958232Z cpu MHz		: 3300.234
2025-05-07T20:23:33.1958450Z cache size	: 512 KB
2025-05-07T20:23:33.1958665Z physical id	: 0
2025-05-07T20:23:33.1958873Z siblings	: 16
2025-05-07T20:23:33.1959070Z core id		: 3
2025-05-07T20:23:33.1959269Z cpu cores	: 8
2025-05-07T20:23:33.1959467Z apicid		: 6
2025-05-07T20:23:33.1959664Z initial apicid	: 6
2025-05-07T20:23:33.1959872Z fpu		: yes
2025-05-07T20:23:33.1960074Z fpu_exception	: yes
2025-05-07T20:23:33.1960297Z cpuid level	: 13
2025-05-07T20:23:33.1960498Z wp		: yes
2025-05-07T20:23:33.1962419Z flags		: fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ht syscall nx mmxext fxsr_opt pdpe1gb rdtscp lm constant_tsc rep_good nopl nonstop_tsc cpuid extd_apicid aperfmperf tsc_known_freq pni pclmulqdq ssse3 fma cx16 sse4_1 sse4_2 movbe popcnt aes xsave avx f16c rdrand hypervisor lahf_lm cmp_legacy cr8_legacy abm sse4a misalignsse 3dnowprefetch topoext ssbd ibrs ibpb stibp vmmcall fsgsbase bmi1 avx2 smep bmi2 rdseed adx smap clflushopt clwb sha_ni xsaveopt xsavec xgetbv1 clzero xsaveerptr rdpru wbnoinvd arat npt nrip_save rdpid
2025-05-07T20:23:33.1964603Z bugs		: sysret_ss_attrs null_seg spectre_v1 spectre_v2 spec_store_bypass retbleed smt_rsb srso ibpb_no_ret
2025-05-07T20:23:33.1965086Z bogomips	: 5599.99
2025-05-07T20:23:33.1965309Z TLB size	: 3072 4K pages
2025-05-07T20:23:33.1965536Z clflush size	: 64
2025-05-07T20:23:33.1965754Z cache_alignment	: 64
2025-05-07T20:23:33.1966024Z address sizes	: 48 bits physical, 48 bits virtual
2025-05-07T20:23:33.1966328Z power management:
2025-05-07T20:23:33.2015278Z 
2025-05-07T20:23:33.2015406Z processor	: 4
2025-05-07T20:23:33.2015654Z vendor_id	: AuthenticAMD
2025-05-07T20:23:33.2015947Z cpu family	: 23
2025-05-07T20:23:33.2016186Z model		: 49
2025-05-07T20:23:33.2016440Z model name	: AMD EPYC 7R32
2025-05-07T20:23:33.2016680Z stepping	: 0
2025-05-07T20:23:33.2016915Z microcode	: 0x830107f
2025-05-07T20:23:33.2017170Z cpu MHz		: 3288.514
2025-05-07T20:23:33.2017388Z cache size	: 512 KB
2025-05-07T20:23:33.2017599Z physical id	: 0
2025-05-07T20:23:33.2017816Z siblings	: 16
2025-05-07T20:23:33.2018021Z core id		: 4
2025-05-07T20:23:33.2018218Z cpu cores	: 8
2025-05-07T20:23:33.2018420Z apicid		: 8
2025-05-07T20:23:33.2018619Z initial apicid	: 8
2025-05-07T20:23:33.2018827Z fpu		: yes
2025-05-07T20:23:33.2019088Z fpu_exception	: yes
2025-05-07T20:23:33.2019310Z cpuid level	: 13
2025-05-07T20:23:33.2019522Z wp		: yes
2025-05-07T20:23:33.2021614Z flags		: fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ht syscall nx mmxext fxsr_opt pdpe1gb rdtscp lm constant_tsc rep_good nopl nonstop_tsc cpuid extd_apicid aperfmperf tsc_known_freq pni pclmulqdq ssse3 fma cx16 sse4_1 sse4_2 movbe popcnt aes xsave avx f16c rdrand hypervisor lahf_lm cmp_legacy cr8_legacy abm sse4a misalignsse 3dnowprefetch topoext ssbd ibrs ibpb stibp vmmcall fsgsbase bmi1 avx2 smep bmi2 rdseed adx smap clflushopt clwb sha_ni xsaveopt xsavec xgetbv1 clzero xsaveerptr rdpru wbnoinvd arat npt nrip_save rdpid
2025-05-07T20:23:33.2023894Z bugs		: sysret_ss_attrs null_seg spectre_v1 spectre_v2 spec_store_bypass retbleed smt_rsb srso ibpb_no_ret
2025-05-07T20:23:33.2024377Z bogomips	: 5599.99
2025-05-07T20:23:33.2024599Z TLB size	: 3072 4K pages
2025-05-07T20:23:33.2024825Z clflush size	: 64
2025-05-07T20:23:33.2025043Z cache_alignment	: 64
2025-05-07T20:23:33.2025310Z address sizes	: 48 bits physical, 48 bits virtual
2025-05-07T20:23:33.2025618Z power management:
2025-05-07T20:23:33.2025755Z 
2025-05-07T20:23:33.2025840Z processor	: 5
2025-05-07T20:23:33.2026056Z vendor_id	: AuthenticAMD
2025-05-07T20:23:33.2026291Z cpu family	: 23
2025-05-07T20:23:33.2026490Z model		: 49
2025-05-07T20:23:33.2026698Z model name	: AMD EPYC 7R32
2025-05-07T20:23:33.2026945Z stepping	: 0
2025-05-07T20:23:33.2027146Z microcode	: 0x830107f
2025-05-07T20:23:33.2027367Z cpu MHz		: 3287.096
2025-05-07T20:23:33.2027588Z cache size	: 512 KB
2025-05-07T20:23:33.2027795Z physical id	: 0
2025-05-07T20:23:33.2027997Z siblings	: 16
2025-05-07T20:23:33.2028467Z core id		: 5
2025-05-07T20:23:33.2028720Z cpu cores	: 8
2025-05-07T20:23:33.2028922Z apicid		: 10
2025-05-07T20:23:33.2029168Z initial apicid	: 10
2025-05-07T20:23:33.2029377Z fpu		: yes
2025-05-07T20:23:33.2029580Z fpu_exception	: yes
2025-05-07T20:23:33.2029796Z cpuid level	: 13
2025-05-07T20:23:33.2030002Z wp		: yes
2025-05-07T20:23:33.2031916Z flags		: fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ht syscall nx mmxext fxsr_opt pdpe1gb rdtscp lm constant_tsc rep_good nopl nonstop_tsc cpuid extd_apicid aperfmperf tsc_known_freq pni pclmulqdq ssse3 fma cx16 sse4_1 sse4_2 movbe popcnt aes xsave avx f16c rdrand hypervisor lahf_lm cmp_legacy cr8_legacy abm sse4a misalignsse 3dnowprefetch topoext ssbd ibrs ibpb stibp vmmcall fsgsbase bmi1 avx2 smep bmi2 rdseed adx smap clflushopt clwb sha_ni xsaveopt xsavec xgetbv1 clzero xsaveerptr rdpru wbnoinvd arat npt nrip_save rdpid
2025-05-07T20:23:33.2034092Z bugs		: sysret_ss_attrs null_seg spectre_v1 spectre_v2 spec_store_bypass retbleed smt_rsb srso ibpb_no_ret
2025-05-07T20:23:33.2034576Z bogomips	: 5599.99
2025-05-07T20:23:33.2034798Z TLB size	: 3072 4K pages
2025-05-07T20:23:33.2035028Z clflush size	: 64
2025-05-07T20:23:33.2035247Z cache_alignment	: 64
2025-05-07T20:23:33.2035513Z address sizes	: 48 bits physical, 48 bits virtual
2025-05-07T20:23:33.2035822Z power management:
2025-05-07T20:23:33.2035960Z 
2025-05-07T20:23:33.2036044Z processor	: 6
2025-05-07T20:23:33.2036263Z vendor_id	: AuthenticAMD
2025-05-07T20:23:33.2036497Z cpu family	: 23
2025-05-07T20:23:33.2036706Z model		: 49
2025-05-07T20:23:33.2036912Z model name	: AMD EPYC 7R32
2025-05-07T20:23:33.2037146Z stepping	: 0
2025-05-07T20:23:33.2037359Z microcode	: 0x830107f
2025-05-07T20:23:33.2037591Z cpu MHz		: 3314.091
2025-05-07T20:23:33.2037799Z cache size	: 512 KB
2025-05-07T20:23:33.2038014Z physical id	: 0
2025-05-07T20:23:33.2038220Z siblings	: 16
2025-05-07T20:23:33.2038417Z core id		: 6
2025-05-07T20:23:33.2038621Z cpu cores	: 8
2025-05-07T20:23:33.2038826Z apicid		: 12
2025-05-07T20:23:33.2039032Z initial apicid	: 12
2025-05-07T20:23:33.2039245Z fpu		: yes
2025-05-07T20:23:33.2039447Z fpu_exception	: yes
2025-05-07T20:23:33.2039657Z cpuid level	: 13
2025-05-07T20:23:33.2039867Z wp		: yes
2025-05-07T20:23:33.2041978Z flags		: fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ht syscall nx mmxext fxsr_opt pdpe1gb rdtscp lm constant_tsc rep_good nopl nonstop_tsc cpuid extd_apicid aperfmperf tsc_known_freq pni pclmulqdq ssse3 fma cx16 sse4_1 sse4_2 movbe popcnt aes xsave avx f16c rdrand hypervisor lahf_lm cmp_legacy cr8_legacy abm sse4a misalignsse 3dnowprefetch topoext ssbd ibrs ibpb stibp vmmcall fsgsbase bmi1 avx2 smep bmi2 rdseed adx smap clflushopt clwb sha_ni xsaveopt xsavec xgetbv1 clzero xsaveerptr rdpru wbnoinvd arat npt nrip_save rdpid
2025-05-07T20:23:33.2044177Z bugs		: sysret_ss_attrs null_seg spectre_v1 spectre_v2 spec_store_bypass retbleed smt_rsb srso ibpb_no_ret
2025-05-07T20:23:33.2044666Z bogomips	: 5599.99
2025-05-07T20:23:33.2045007Z TLB size	: 3072 4K pages
2025-05-07T20:23:33.2045241Z clflush size	: 64
2025-05-07T20:23:33.2045456Z cache_alignment	: 64
2025-05-07T20:23:33.2045715Z address sizes	: 48 bits physical, 48 bits virtual
2025-05-07T20:23:33.2046034Z power management:
2025-05-07T20:23:33.2046162Z 
2025-05-07T20:23:33.2046254Z processor	: 7
2025-05-07T20:23:33.2046464Z vendor_id	: AuthenticAMD
2025-05-07T20:23:33.2046704Z cpu family	: 23
2025-05-07T20:23:33.2046911Z model		: 49
2025-05-07T20:23:33.2047111Z model name	: AMD EPYC 7R32
2025-05-07T20:23:33.2047347Z stepping	: 0
2025-05-07T20:23:33.2047559Z microcode	: 0x830107f
2025-05-07T20:23:33.2047780Z cpu MHz		: 3305.442
2025-05-07T20:23:33.2048001Z cache size	: 512 KB
2025-05-07T20:23:33.2048220Z physical id	: 0
2025-05-07T20:23:33.2048477Z siblings	: 16
2025-05-07T20:23:33.2048755Z core id		: 7
2025-05-07T20:23:33.2049016Z cpu cores	: 8
2025-05-07T20:23:33.2049275Z apicid		: 14
2025-05-07T20:23:33.2049533Z initial apicid	: 14
2025-05-07T20:23:33.2049763Z fpu		: yes
2025-05-07T20:23:33.2049972Z fpu_exception	: yes
2025-05-07T20:23:33.2050183Z cpuid level	: 13
2025-05-07T20:23:33.2050391Z wp		: yes
2025-05-07T20:23:33.2052308Z flags		: fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ht syscall nx mmxext fxsr_opt pdpe1gb rdtscp lm constant_tsc rep_good nopl nonstop_tsc cpuid extd_apicid aperfmperf tsc_known_freq pni pclmulqdq ssse3 fma cx16 sse4_1 sse4_2 movbe popcnt aes xsave avx f16c rdrand hypervisor lahf_lm cmp_legacy cr8_legacy abm sse4a misalignsse 3dnowprefetch topoext ssbd ibrs ibpb stibp vmmcall fsgsbase bmi1 avx2 smep bmi2 rdseed adx smap clflushopt clwb sha_ni xsaveopt xsavec xgetbv1 clzero xsaveerptr rdpru wbnoinvd arat npt nrip_save rdpid
2025-05-07T20:23:33.2054488Z bugs		: sysret_ss_attrs null_seg spectre_v1 spectre_v2 spec_store_bypass retbleed smt_rsb srso ibpb_no_ret
2025-05-07T20:23:33.2054962Z bogomips	: 5599.99
2025-05-07T20:23:33.2055186Z TLB size	: 3072 4K pages
2025-05-07T20:23:33.2055422Z clflush size	: 64
2025-05-07T20:23:33.2055634Z cache_alignment	: 64
2025-05-07T20:23:33.2055909Z address sizes	: 48 bits physical, 48 bits virtual
2025-05-07T20:23:33.2056226Z power management:
2025-05-07T20:23:33.2056354Z 
2025-05-07T20:23:33.2056447Z processor	: 8
2025-05-07T20:23:33.2056658Z vendor_id	: AuthenticAMD
2025-05-07T20:23:33.2056894Z cpu family	: 23
2025-05-07T20:23:33.2057106Z model		: 49
2025-05-07T20:23:33.2057305Z model name	: AMD EPYC 7R32
2025-05-07T20:23:33.2057545Z stepping	: 0
2025-05-07T20:23:33.2057757Z microcode	: 0x830107f
2025-05-07T20:23:33.2057977Z cpu MHz		: 3278.423
2025-05-07T20:23:33.2058213Z cache size	: 512 KB
2025-05-07T20:23:33.2058455Z physical id	: 0
2025-05-07T20:23:33.2058660Z siblings	: 16
2025-05-07T20:23:33.2058863Z core id		: 0
2025-05-07T20:23:33.2059059Z cpu cores	: 8
2025-05-07T20:23:33.2059252Z apicid		: 1
2025-05-07T20:23:33.2059446Z initial apicid	: 1
2025-05-07T20:23:33.2059656Z fpu		: yes
2025-05-07T20:23:33.2059848Z fpu_exception	: yes
2025-05-07T20:23:33.2060068Z cpuid level	: 13
2025-05-07T20:23:33.2060278Z wp		: yes
2025-05-07T20:23:33.2062186Z flags		: fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ht syscall nx mmxext fxsr_opt pdpe1gb rdtscp lm constant_tsc rep_good nopl nonstop_tsc cpuid extd_apicid aperfmperf tsc_known_freq pni pclmulqdq ssse3 fma cx16 sse4_1 sse4_2 movbe popcnt aes xsave avx f16c rdrand hypervisor lahf_lm cmp_legacy cr8_legacy abm sse4a misalignsse 3dnowprefetch topoext ssbd ibrs ibpb stibp vmmcall fsgsbase bmi1 avx2 smep bmi2 rdseed adx smap clflushopt clwb sha_ni xsaveopt xsavec xgetbv1 clzero xsaveerptr rdpru wbnoinvd arat npt nrip_save rdpid
2025-05-07T20:23:33.2064715Z bugs		: sysret_ss_attrs null_seg spectre_v1 spectre_v2 spec_store_bypass retbleed smt_rsb srso ibpb_no_ret
2025-05-07T20:23:33.2065199Z bogomips	: 5599.99
2025-05-07T20:23:33.2065418Z TLB size	: 3072 4K pages
2025-05-07T20:23:33.2065661Z clflush size	: 64
2025-05-07T20:23:33.2065871Z cache_alignment	: 64
2025-05-07T20:23:33.2066215Z address sizes	: 48 bits physical, 48 bits virtual
2025-05-07T20:23:33.2066531Z power management:
2025-05-07T20:23:33.2066660Z 
2025-05-07T20:23:33.2066751Z processor	: 9
2025-05-07T20:23:33.2066961Z vendor_id	: AuthenticAMD
2025-05-07T20:23:33.2067197Z cpu family	: 23
2025-05-07T20:23:33.2067404Z model		: 49
2025-05-07T20:23:33.2067601Z model name	: AMD EPYC 7R32
2025-05-07T20:23:33.2067840Z stepping	: 0
2025-05-07T20:23:33.2068044Z microcode	: 0x830107f
2025-05-07T20:23:33.2068261Z cpu MHz		: 3293.563
2025-05-07T20:23:33.2068474Z cache size	: 512 KB
2025-05-07T20:23:33.2068687Z physical id	: 0
2025-05-07T20:23:33.2068889Z siblings	: 16
2025-05-07T20:23:33.2069160Z core id		: 1
2025-05-07T20:23:33.2069362Z cpu cores	: 8
2025-05-07T20:23:33.2069552Z apicid		: 3
2025-05-07T20:23:33.2069756Z initial apicid	: 3
2025-05-07T20:23:33.2069965Z fpu		: yes
2025-05-07T20:23:33.2070159Z fpu_exception	: yes
2025-05-07T20:23:33.2070371Z cpuid level	: 13
2025-05-07T20:23:33.2070574Z wp		: yes
2025-05-07T20:23:33.2072476Z flags		: fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ht syscall nx mmxext fxsr_opt pdpe1gb rdtscp lm constant_tsc rep_good nopl nonstop_tsc cpuid extd_apicid aperfmperf tsc_known_freq pni pclmulqdq ssse3 fma cx16 sse4_1 sse4_2 movbe popcnt aes xsave avx f16c rdrand hypervisor lahf_lm cmp_legacy cr8_legacy abm sse4a misalignsse 3dnowprefetch topoext ssbd ibrs ibpb stibp vmmcall fsgsbase bmi1 avx2 smep bmi2 rdseed adx smap clflushopt clwb sha_ni xsaveopt xsavec xgetbv1 clzero xsaveerptr rdpru wbnoinvd arat npt nrip_save rdpid
2025-05-07T20:23:33.2074647Z bugs		: sysret_ss_attrs null_seg spectre_v1 spectre_v2 spec_store_bypass retbleed smt_rsb srso ibpb_no_ret
2025-05-07T20:23:33.2075125Z bogomips	: 5599.99
2025-05-07T20:23:33.2075346Z TLB size	: 3072 4K pages
2025-05-07T20:23:33.2075581Z clflush size	: 64
2025-05-07T20:23:33.2075790Z cache_alignment	: 64
2025-05-07T20:23:33.2076052Z address sizes	: 48 bits physical, 48 bits virtual
2025-05-07T20:23:33.2076427Z power management:
2025-05-07T20:23:33.2076614Z 
2025-05-07T20:23:33.2076709Z processor	: 10
2025-05-07T20:23:33.2076990Z vendor_id	: AuthenticAMD
2025-05-07T20:23:33.2077227Z cpu family	: 23
2025-05-07T20:23:33.2077423Z model		: 49
2025-05-07T20:23:33.2077625Z model name	: AMD EPYC 7R32
2025-05-07T20:23:33.2077858Z stepping	: 0
2025-05-07T20:23:33.2078057Z microcode	: 0x830107f
2025-05-07T20:23:33.2078279Z cpu MHz		: 3299.089
2025-05-07T20:23:33.2078492Z cache size	: 512 KB
2025-05-07T20:23:33.2078699Z physical id	: 0
2025-05-07T20:23:33.2078903Z siblings	: 16
2025-05-07T20:23:33.2079102Z core id		: 2
2025-05-07T20:23:33.2079291Z cpu cores	: 8
2025-05-07T20:23:33.2079490Z apicid		: 5
2025-05-07T20:23:33.2079689Z initial apicid	: 5
2025-05-07T20:23:33.2079895Z fpu		: yes
2025-05-07T20:23:33.2080090Z fpu_exception	: yes
2025-05-07T20:23:33.2080306Z cpuid level	: 13
2025-05-07T20:23:33.2080505Z wp		: yes
2025-05-07T20:23:33.2082405Z flags		: fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ht syscall nx mmxext fxsr_opt pdpe1gb rdtscp lm constant_tsc rep_good nopl nonstop_tsc cpuid extd_apicid aperfmperf tsc_known_freq pni pclmulqdq ssse3 fma cx16 sse4_1 sse4_2 movbe popcnt aes xsave avx f16c rdrand hypervisor lahf_lm cmp_legacy cr8_legacy abm sse4a misalignsse 3dnowprefetch topoext ssbd ibrs ibpb stibp vmmcall fsgsbase bmi1 avx2 smep bmi2 rdseed adx smap clflushopt clwb sha_ni xsaveopt xsavec xgetbv1 clzero xsaveerptr rdpru wbnoinvd arat npt nrip_save rdpid
2025-05-07T20:23:33.2084580Z bugs		: sysret_ss_attrs null_seg spectre_v1 spectre_v2 spec_store_bypass retbleed smt_rsb srso ibpb_no_ret
2025-05-07T20:23:33.2085057Z bogomips	: 5599.99
2025-05-07T20:23:33.2085379Z TLB size	: 3072 4K pages
2025-05-07T20:23:33.2085610Z clflush size	: 64
2025-05-07T20:23:33.2085823Z cache_alignment	: 64
2025-05-07T20:23:33.2086095Z address sizes	: 48 bits physical, 48 bits virtual
2025-05-07T20:23:33.2086397Z power management:
2025-05-07T20:23:33.2086607Z 
2025-05-07T20:23:33.2086691Z processor	: 11
2025-05-07T20:23:33.2086915Z vendor_id	: AuthenticAMD
2025-05-07T20:23:33.2087143Z cpu family	: 23
2025-05-07T20:23:33.2087343Z model		: 49
2025-05-07T20:23:33.2087550Z model name	: AMD EPYC 7R32
2025-05-07T20:23:33.2087784Z stepping	: 0
2025-05-07T20:23:33.2087996Z microcode	: 0x830107f
2025-05-07T20:23:33.2088250Z cpu MHz		: 3297.070
2025-05-07T20:23:33.2088480Z cache size	: 512 KB
2025-05-07T20:23:33.2088701Z physical id	: 0
2025-05-07T20:23:33.2088909Z siblings	: 16
2025-05-07T20:23:33.2089108Z core id		: 3
2025-05-07T20:23:33.2089310Z cpu cores	: 8
2025-05-07T20:23:33.2089515Z apicid		: 7
2025-05-07T20:23:33.2089707Z initial apicid	: 7
2025-05-07T20:23:33.2089969Z fpu		: yes
2025-05-07T20:23:33.2090246Z fpu_exception	: yes
2025-05-07T20:23:33.2090511Z cpuid level	: 13
2025-05-07T20:23:33.2090731Z wp		: yes
2025-05-07T20:23:33.2092655Z flags		: fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ht syscall nx mmxext fxsr_opt pdpe1gb rdtscp lm constant_tsc rep_good nopl nonstop_tsc cpuid extd_apicid aperfmperf tsc_known_freq pni pclmulqdq ssse3 fma cx16 sse4_1 sse4_2 movbe popcnt aes xsave avx f16c rdrand hypervisor lahf_lm cmp_legacy cr8_legacy abm sse4a misalignsse 3dnowprefetch topoext ssbd ibrs ibpb stibp vmmcall fsgsbase bmi1 avx2 smep bmi2 rdseed adx smap clflushopt clwb sha_ni xsaveopt xsavec xgetbv1 clzero xsaveerptr rdpru wbnoinvd arat npt nrip_save rdpid
2025-05-07T20:23:33.2094857Z bugs		: sysret_ss_attrs null_seg spectre_v1 spectre_v2 spec_store_bypass retbleed smt_rsb srso ibpb_no_ret
2025-05-07T20:23:33.2095337Z bogomips	: 5599.99
2025-05-07T20:23:33.2095549Z TLB size	: 3072 4K pages
2025-05-07T20:23:33.2095789Z clflush size	: 64
2025-05-07T20:23:33.2096004Z cache_alignment	: 64
2025-05-07T20:23:33.2096263Z address sizes	: 48 bits physical, 48 bits virtual
2025-05-07T20:23:33.2096584Z power management:
2025-05-07T20:23:33.2096713Z 
2025-05-07T20:23:33.2096804Z processor	: 12
2025-05-07T20:23:33.2097019Z vendor_id	: AuthenticAMD
2025-05-07T20:23:33.2097256Z cpu family	: 23
2025-05-07T20:23:33.2097464Z model		: 49
2025-05-07T20:23:33.2097664Z model name	: AMD EPYC 7R32
2025-05-07T20:23:33.2097905Z stepping	: 0
2025-05-07T20:23:33.2098114Z microcode	: 0x830107f
2025-05-07T20:23:33.2098332Z cpu MHz		: 3288.719
2025-05-07T20:23:33.2098548Z cache size	: 512 KB
2025-05-07T20:23:33.2098775Z physical id	: 0
2025-05-07T20:23:33.2098976Z siblings	: 16
2025-05-07T20:23:33.2099177Z core id		: 4
2025-05-07T20:23:33.2099382Z cpu cores	: 8
2025-05-07T20:23:33.2099580Z apicid		: 9
2025-05-07T20:23:33.2099779Z initial apicid	: 9
2025-05-07T20:23:33.2100002Z fpu		: yes
2025-05-07T20:23:33.2100208Z fpu_exception	: yes
2025-05-07T20:23:33.2100427Z cpuid level	: 13
2025-05-07T20:23:33.2100644Z wp		: yes
2025-05-07T20:23:33.2102570Z flags		: fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ht syscall nx mmxext fxsr_opt pdpe1gb rdtscp lm constant_tsc rep_good nopl nonstop_tsc cpuid extd_apicid aperfmperf tsc_known_freq pni pclmulqdq ssse3 fma cx16 sse4_1 sse4_2 movbe popcnt aes xsave avx f16c rdrand hypervisor lahf_lm cmp_legacy cr8_legacy abm sse4a misalignsse 3dnowprefetch topoext ssbd ibrs ibpb stibp vmmcall fsgsbase bmi1 avx2 smep bmi2 rdseed adx smap clflushopt clwb sha_ni xsaveopt xsavec xgetbv1 clzero xsaveerptr rdpru wbnoinvd arat npt nrip_save rdpid
2025-05-07T20:23:33.2104990Z bugs		: sysret_ss_attrs null_seg spectre_v1 spectre_v2 spec_store_bypass retbleed smt_rsb srso ibpb_no_ret
2025-05-07T20:23:33.2105474Z bogomips	: 5599.99
2025-05-07T20:23:33.2105703Z TLB size	: 3072 4K pages
2025-05-07T20:23:33.2105941Z clflush size	: 64
2025-05-07T20:23:33.2106151Z cache_alignment	: 64
2025-05-07T20:23:33.2106534Z address sizes	: 48 bits physical, 48 bits virtual
2025-05-07T20:23:33.2106853Z power management:
2025-05-07T20:23:33.2106983Z 
2025-05-07T20:23:33.2107078Z processor	: 13
2025-05-07T20:23:33.2107291Z vendor_id	: AuthenticAMD
2025-05-07T20:23:33.2107529Z cpu family	: 23
2025-05-07T20:23:33.2107816Z model		: 49
2025-05-07T20:23:33.2108044Z model name	: AMD EPYC 7R32
2025-05-07T20:23:33.2108308Z stepping	: 0
2025-05-07T20:23:33.2108516Z microcode	: 0x830107f
2025-05-07T20:23:33.2108733Z cpu MHz		: 3291.493
2025-05-07T20:23:33.2108952Z cache size	: 512 KB
2025-05-07T20:23:33.2109258Z physical id	: 0
2025-05-07T20:23:33.2109461Z siblings	: 16
2025-05-07T20:23:33.2109662Z core id		: 5
2025-05-07T20:23:33.2109867Z cpu cores	: 8
2025-05-07T20:23:33.2110061Z apicid		: 11
2025-05-07T20:23:33.2110265Z initial apicid	: 11
2025-05-07T20:23:33.2110483Z fpu		: yes
2025-05-07T20:23:33.2110683Z fpu_exception	: yes
2025-05-07T20:23:33.2110899Z cpuid level	: 13
2025-05-07T20:23:33.2111104Z wp		: yes
2025-05-07T20:23:33.2113025Z flags		: fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ht syscall nx mmxext fxsr_opt pdpe1gb rdtscp lm constant_tsc rep_good nopl nonstop_tsc cpuid extd_apicid aperfmperf tsc_known_freq pni pclmulqdq ssse3 fma cx16 sse4_1 sse4_2 movbe popcnt aes xsave avx f16c rdrand hypervisor lahf_lm cmp_legacy cr8_legacy abm sse4a misalignsse 3dnowprefetch topoext ssbd ibrs ibpb stibp vmmcall fsgsbase bmi1 avx2 smep bmi2 rdseed adx smap clflushopt clwb sha_ni xsaveopt xsavec xgetbv1 clzero xsaveerptr rdpru wbnoinvd arat npt nrip_save rdpid
2025-05-07T20:23:33.2115223Z bugs		: sysret_ss_attrs null_seg spectre_v1 spectre_v2 spec_store_bypass retbleed smt_rsb srso ibpb_no_ret
2025-05-07T20:23:33.2115708Z bogomips	: 5599.99
2025-05-07T20:23:33.2115929Z TLB size	: 3072 4K pages
2025-05-07T20:23:33.2116172Z clflush size	: 64
2025-05-07T20:23:33.2116384Z cache_alignment	: 64
2025-05-07T20:23:33.2116653Z address sizes	: 48 bits physical, 48 bits virtual
2025-05-07T20:23:33.2116969Z power management:
2025-05-07T20:23:33.2117102Z 
2025-05-07T20:23:33.2117186Z processor	: 14
2025-05-07T20:23:33.2117404Z vendor_id	: AuthenticAMD
2025-05-07T20:23:33.2117641Z cpu family	: 23
2025-05-07T20:23:33.2117846Z model		: 49
2025-05-07T20:23:33.2118086Z model name	: AMD EPYC 7R32
2025-05-07T20:23:33.2118353Z stepping	: 0
2025-05-07T20:23:33.2118560Z microcode	: 0x830107f
2025-05-07T20:23:33.2118798Z cpu MHz		: 3285.188
2025-05-07T20:23:33.2119012Z cache size	: 512 KB
2025-05-07T20:23:33.2119222Z physical id	: 0
2025-05-07T20:23:33.2119430Z siblings	: 16
2025-05-07T20:23:33.2119630Z core id		: 6
2025-05-07T20:23:33.2119824Z cpu cores	: 8
2025-05-07T20:23:33.2120028Z apicid		: 13
2025-05-07T20:23:33.2120238Z initial apicid	: 13
2025-05-07T20:23:33.2120455Z fpu		: yes
2025-05-07T20:23:33.2120661Z fpu_exception	: yes
2025-05-07T20:23:33.2120881Z cpuid level	: 13
2025-05-07T20:23:33.2121081Z wp		: yes
2025-05-07T20:23:33.2123014Z flags		: fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ht syscall nx mmxext fxsr_opt pdpe1gb rdtscp lm constant_tsc rep_good nopl nonstop_tsc cpuid extd_apicid aperfmperf tsc_known_freq pni pclmulqdq ssse3 fma cx16 sse4_1 sse4_2 movbe popcnt aes xsave avx f16c rdrand hypervisor lahf_lm cmp_legacy cr8_legacy abm sse4a misalignsse 3dnowprefetch topoext ssbd ibrs ibpb stibp vmmcall fsgsbase bmi1 avx2 smep bmi2 rdseed adx smap clflushopt clwb sha_ni xsaveopt xsavec xgetbv1 clzero xsaveerptr rdpru wbnoinvd arat npt nrip_save rdpid
2025-05-07T20:23:33.2125208Z bugs		: sysret_ss_attrs null_seg spectre_v1 spectre_v2 spec_store_bypass retbleed smt_rsb srso ibpb_no_ret
2025-05-07T20:23:33.2125695Z bogomips	: 5599.99
2025-05-07T20:23:33.2125920Z TLB size	: 3072 4K pages
2025-05-07T20:23:33.2126150Z clflush size	: 64
2025-05-07T20:23:33.2126369Z cache_alignment	: 64
2025-05-07T20:23:33.2126637Z address sizes	: 48 bits physical, 48 bits virtual
2025-05-07T20:23:33.2126944Z power management:
2025-05-07T20:23:33.2127083Z 
2025-05-07T20:23:33.2127271Z processor	: 15
2025-05-07T20:23:33.2127492Z vendor_id	: AuthenticAMD
2025-05-07T20:23:33.2127723Z cpu family	: 23
2025-05-07T20:23:33.2127930Z model		: 49
2025-05-07T20:23:33.2128419Z model name	: AMD EPYC 7R32
2025-05-07T20:23:33.2128669Z stepping	: 0
2025-05-07T20:23:33.2129024Z microcode	: 0x830107f
2025-05-07T20:23:33.2129250Z cpu MHz		: 3281.467
2025-05-07T20:23:33.2129458Z cache size	: 512 KB
2025-05-07T20:23:33.2129671Z physical id	: 0
2025-05-07T20:23:33.2129877Z siblings	: 16
2025-05-07T20:23:33.2130075Z core id		: 7
2025-05-07T20:23:33.2130271Z cpu cores	: 8
2025-05-07T20:23:33.2130470Z apicid		: 15
2025-05-07T20:23:33.2130677Z initial apicid	: 15
2025-05-07T20:23:33.2130882Z fpu		: yes
2025-05-07T20:23:33.2131081Z fpu_exception	: yes
2025-05-07T20:23:33.2131300Z cpuid level	: 13
2025-05-07T20:23:33.2131499Z wp		: yes
2025-05-07T20:23:33.2133423Z flags		: fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ht syscall nx mmxext fxsr_opt pdpe1gb rdtscp lm constant_tsc rep_good nopl nonstop_tsc cpuid extd_apicid aperfmperf tsc_known_freq pni pclmulqdq ssse3 fma cx16 sse4_1 sse4_2 movbe popcnt aes xsave avx f16c rdrand hypervisor lahf_lm cmp_legacy cr8_legacy abm sse4a misalignsse 3dnowprefetch topoext ssbd ibrs ibpb stibp vmmcall fsgsbase bmi1 avx2 smep bmi2 rdseed adx smap clflushopt clwb sha_ni xsaveopt xsavec xgetbv1 clzero xsaveerptr rdpru wbnoinvd arat npt nrip_save rdpid
2025-05-07T20:23:33.2135602Z bugs		: sysret_ss_attrs null_seg spectre_v1 spectre_v2 spec_store_bypass retbleed smt_rsb srso ibpb_no_ret
2025-05-07T20:23:33.2136089Z bogomips	: 5599.99
2025-05-07T20:23:33.2136301Z TLB size	: 3072 4K pages
2025-05-07T20:23:33.2136537Z clflush size	: 64
2025-05-07T20:23:33.2136750Z cache_alignment	: 64
2025-05-07T20:23:33.2137009Z address sizes	: 48 bits physical, 48 bits virtual
2025-05-07T20:23:33.2137324Z power management:
2025-05-07T20:23:33.2137461Z 
2025-05-07T20:23:33.2137465Z 
2025-05-07T20:23:33.2137591Z ################################################################################
2025-05-07T20:23:33.2137927Z [INFO] Print PCI info ...
2025-05-07T20:23:33.2138185Z + lspci -v
2025-05-07T20:23:33.2138310Z 
2025-05-07T20:23:33.2138527Z 00:00.0 Host bridge: Intel Corporation 440FX - 82441FX PMC [Natoma]
2025-05-07T20:23:33.2138915Z 	Subsystem: Amazon.com, Inc. Device 1237
2025-05-07T20:23:33.2139240Z 	Flags: bus master, medium devsel, latency 0
2025-05-07T20:23:33.2139444Z 
2025-05-07T20:23:33.2139637Z 00:01.0 ISA bridge: Intel Corporation 82371SB PIIX3 ISA [Natoma/Triton II]
2025-05-07T20:23:33.2140023Z 	Physical Slot: 1
2025-05-07T20:23:33.2140270Z 	Flags: bus master, fast devsel, latency 0
2025-05-07T20:23:33.2140471Z 
2025-05-07T20:23:33.2140726Z 00:01.3 Non-VGA unclassified device: Intel Corporation 82371AB/EB/MB PIIX4 ACPI (rev 08)
2025-05-07T20:23:33.2141157Z 	Physical Slot: 1
2025-05-07T20:23:33.2141417Z 	Flags: bus master, fast devsel, latency 0, IRQ 9
2025-05-07T20:23:33.2141640Z 
2025-05-07T20:23:33.2141917Z 00:03.0 VGA compatible controller: Amazon.com, Inc. Device 1111 (prog-if 00 [VGA controller])
2025-05-07T20:23:33.2142354Z 	Physical Slot: 3
2025-05-07T20:23:33.2142603Z 	Flags: bus master, fast devsel, latency 0
2025-05-07T20:23:33.2142944Z 	Memory at c1000000 (32-bit, prefetchable) [size=4M]
2025-05-07T20:23:33.2143303Z 	Expansion ROM at 000c0000 [disabled] [size=128K]
2025-05-07T20:23:33.2143522Z 
2025-05-07T20:23:33.2143824Z 00:04.0 Non-Volatile memory controller: Amazon.com, Inc. NVMe EBS Controller (prog-if 02 [NVM Express])
2025-05-07T20:23:33.2144331Z 	Subsystem: Amazon.com, Inc. Device 0000
2025-05-07T20:23:33.2144623Z 	Physical Slot: 4
2025-05-07T20:23:33.2144876Z 	Flags: bus master, fast devsel, latency 0, IRQ 11
2025-05-07T20:23:33.2145260Z 	Memory at c1808000 (32-bit, non-prefetchable) [size=16K]
2025-05-07T20:23:33.2145622Z 	Capabilities: <access denied>
2025-05-07T20:23:33.2145899Z 	Kernel driver in use: nvme
2025-05-07T20:23:33.2146075Z 
2025-05-07T20:23:33.2148266Z 00:05.0 Ethernet controller: Amazon.com, Inc. Elastic Network Adapter (ENA)
2025-05-07T20:23:33.2148761Z 	Subsystem: Amazon.com, Inc. Elastic Network Adapter (ENA)
2025-05-07T20:23:33.2149176Z 	Physical Slot: 5
2025-05-07T20:23:33.2149413Z 	Flags: bus master, fast devsel, latency 0
2025-05-07T20:23:33.2149770Z 	Memory at c1804000 (32-bit, non-prefetchable) [size=16K]
2025-05-07T20:23:33.2150234Z 	Memory at c1400000 (32-bit, prefetchable) [size=4M]
2025-05-07T20:23:33.2150551Z 	Capabilities: <access denied>
2025-05-07T20:23:33.2150816Z 	Kernel driver in use: ena
2025-05-07T20:23:33.2151058Z 	Kernel modules: ena
2025-05-07T20:23:33.2151195Z 
2025-05-07T20:23:33.2151364Z 00:1e.0 3D controller: NVIDIA Corporation GA102GL [A10G] (rev a1)
2025-05-07T20:23:33.2151742Z 	Subsystem: NVIDIA Corporation Device 152f
2025-05-07T20:23:33.2152034Z 	Physical Slot: 30
2025-05-07T20:23:33.2152296Z 	Flags: bus master, fast devsel, latency 0, IRQ 10
2025-05-07T20:23:33.2152665Z 	Memory at c0000000 (32-bit, non-prefetchable) [size=16M]
2025-05-07T20:23:33.2153060Z 	Memory at 1800000000 (64-bit, prefetchable) [size=32G]
2025-05-07T20:23:33.2153429Z 	Memory at 1040000000 (64-bit, prefetchable) [size=32M]
2025-05-07T20:23:33.2153749Z 	Capabilities: <access denied>
2025-05-07T20:23:33.2154020Z 	Kernel driver in use: nvidia
2025-05-07T20:23:33.2154275Z 	Kernel modules: nvidia
2025-05-07T20:23:33.2154421Z 
2025-05-07T20:23:33.2154721Z 00:1f.0 Non-Volatile memory controller: Amazon.com, Inc. NVMe SSD Controller (prog-if 02 [NVM Express])
2025-05-07T20:23:33.2155230Z 	Subsystem: Amazon.com, Inc. Device 0000
2025-05-07T20:23:33.2155521Z 	Physical Slot: 31
2025-05-07T20:23:33.2155766Z 	Flags: bus master, fast devsel, latency 0
2025-05-07T20:23:33.2156114Z 	Memory at c1800000 (32-bit, non-prefetchable) [size=16K]
2025-05-07T20:23:33.2156492Z 	Memory at c180c000 (32-bit, prefetchable) [size=8K]
2025-05-07T20:23:33.2156816Z 	Capabilities: <access denied>
2025-05-07T20:23:33.2157076Z 	Kernel driver in use: nvme
2025-05-07T20:23:33.2157246Z 
2025-05-07T20:23:33.2157250Z 
2025-05-07T20:23:33.2157367Z ################################################################################
2025-05-07T20:23:33.2164221Z [INFO] Print Linux distribution info ...
2025-05-07T20:23:33.2164544Z + uname -a
2025-05-07T20:23:33.2164668Z 
2025-05-07T20:23:33.2165061Z Linux ip-10-0-66-0.ec2.internal 6.1.130-139.222.amzn2023.x86_64 #1 SMP PREEMPT_DYNAMIC Tue Mar 11 01:10:58 UTC 2025 x86_64 x86_64 x86_64 GNU/Linux
2025-05-07T20:23:33.2165555Z 
2025-05-07T20:23:33.2165633Z + uname -m
2025-05-07T20:23:33.2165746Z 
2025-05-07T20:23:33.2165829Z x86_64
2025-05-07T20:23:33.2165934Z 
2025-05-07T20:23:33.2166018Z + cat /proc/version
2025-05-07T20:23:33.2166156Z 
2025-05-07T20:23:33.2166689Z Linux version 6.1.130-139.222.amzn2023.x86_64 (mockbuild@ip-10-0-55-76) (gcc (GCC) 11.5.0 20240719 (Red Hat 11.5.0-5), GNU ld version 2.39-6.amzn2023.0.11) #1 SMP PREEMPT_DYNAMIC Tue Mar 11 01:10:58 UTC 2025
2025-05-07T20:23:33.2167311Z 
2025-05-07T20:23:33.2167400Z + cat /etc/os-release
2025-05-07T20:23:33.2167541Z 
2025-05-07T20:23:33.2167640Z NAME="Amazon Linux"
2025-05-07T20:23:33.2167847Z VERSION="2023"
2025-05-07T20:23:33.2168051Z ID="amzn"
2025-05-07T20:23:33.2168242Z ID_LIKE="fedora"
2025-05-07T20:23:33.2168443Z VERSION_ID="2023"
2025-05-07T20:23:33.2168679Z PLATFORM_ID="platform:al2023"
2025-05-07T20:23:33.2168960Z PRETTY_NAME="Amazon Linux 2023.6.20250317"
2025-05-07T20:23:33.2169246Z ANSI_COLOR="0;33"
2025-05-07T20:23:33.2169491Z CPE_NAME="cpe:2.3:o:amazon:amazon_linux:2023"
2025-05-07T20:23:33.2169886Z HOME_URL="https://aws.amazon.com/linux/amazon-linux-2023/"
2025-05-07T20:23:33.2170315Z DOCUMENTATION_URL="https://docs.aws.amazon.com/linux/"
2025-05-07T20:23:33.2170723Z SUPPORT_URL="https://aws.amazon.com/premiumsupport/"
2025-05-07T20:23:33.2171160Z BUG_REPORT_URL="https://github.com/amazonlinux/amazon-linux-2023"
2025-05-07T20:23:33.2171528Z VENDOR_NAME="AWS"
2025-05-07T20:23:33.2171763Z VENDOR_URL="https://aws.amazon.com/"
2025-05-07T20:23:33.2172054Z SUPPORT_END="2029-06-30"
2025-05-07T20:23:33.2172203Z 
2025-05-07T20:23:33.2172444Z ################################################################################
2025-05-07T20:23:33.2172743Z # Print EC2 Instance Info
2025-05-07T20:23:33.2172977Z #
2025-05-07T20:23:33.2173184Z # [2025-05-07T20:23:33.211Z] + print_ec2_info 
2025-05-07T20:23:33.2173495Z ################################################################################
2025-05-07T20:23:33.2173783Z 
2025-05-07T20:23:33.2239556Z ami-id: ami-071226ecf16aa7d96
2025-05-07T20:23:33.2374878Z instance-id: i-0e56304501e4f5200
2025-05-07T20:23:33.2490372Z instance-type: g5.4xlarge
2025-05-07T20:23:33.2531724Z ##[group]Run . $PRELUDE; print_gpu_info
2025-05-07T20:23:33.2532071Z [36;1m. $PRELUDE; print_gpu_info[0m
2025-05-07T20:23:33.2541813Z shell: /usr/bin/bash --noprofile --norc -e -o pipefail {0}
2025-05-07T20:23:33.2542163Z env:
2025-05-07T20:23:33.2542385Z   PRELUDE: .github/scripts/setup_env.bash
2025-05-07T20:23:33.2542691Z   BUILD_ENV: build_binary
2025-05-07T20:23:33.2542934Z   BUILD_TARGET: genai
2025-05-07T20:23:33.2543158Z   BUILD_VARIANT: cuda
2025-05-07T20:23:33.2543395Z   BUILD_CUDA_VERSION: 12.8.0
2025-05-07T20:23:33.2543658Z   ENFORCE_CUDA_DEVICE: 1
2025-05-07T20:23:33.2543965Z   GPU_FLAG: --gpus all -e NVIDIA_DRIVER_CAPABILITIES=all
2025-05-07T20:23:33.2544292Z ##[endgroup]
2025-05-07T20:23:33.5876323Z ################################################################################
2025-05-07T20:23:33.5876792Z [INFO] Printing general display info ...
2025-05-07T20:23:33.5893780Z [EXEC] [ATTEMPT 0/3]    + wget -q --timeout 1 pypi.org -O /dev/null
2025-05-07T20:23:33.6815034Z [CHECK] Network does not appear to be blocked.
2025-05-07T20:23:33.6824361Z /usr/bin/sudo
2025-05-07T20:23:33.6835077Z which: no apt-get in (/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin)
2025-05-07T20:23:33.6845078Z /usr/bin/yum
2025-05-07T20:23:33.6846796Z [INSTALL] Updating system repositories ...
2025-05-07T20:23:33.6867213Z [EXEC] [ATTEMPT 0/3]    + sudo yum update -y
2025-05-07T20:23:34.1718219Z Last metadata expiration check: 0:00:05 ago on Wed May  7 20:23:29 2025.
2025-05-07T20:23:34.2489585Z ================================================================================
2025-05-07T20:23:34.2490048Z WARNING:
2025-05-07T20:23:34.2490379Z   A newer release of "Amazon Linux" is available.
2025-05-07T20:23:34.2490690Z 
2025-05-07T20:23:34.2490848Z   Available Versions:
2025-05-07T20:23:34.2491059Z 
2025-05-07T20:23:34.2491184Z   Version 2023.7.20250331:
2025-05-07T20:23:34.2491540Z     Run the following command to upgrade to 2023.7.20250331:
2025-05-07T20:23:34.2491793Z 
2025-05-07T20:23:34.2491941Z       dnf upgrade --releasever=2023.7.20250331
2025-05-07T20:23:34.2492148Z 
2025-05-07T20:23:34.2492235Z     Release notes:
2025-05-07T20:23:34.2492644Z      https://docs.aws.amazon.com/linux/al2023/release-notes/relnotes-2023.7.20250331.html
2025-05-07T20:23:34.2493006Z 
2025-05-07T20:23:34.2493104Z   Version 2023.7.20250414:
2025-05-07T20:23:34.2493414Z     Run the following command to upgrade to 2023.7.20250414:
2025-05-07T20:23:34.2493656Z 
2025-05-07T20:23:34.2493771Z       dnf upgrade --releasever=2023.7.20250414
2025-05-07T20:23:34.2493985Z 
2025-05-07T20:23:34.2494070Z     Release notes:
2025-05-07T20:23:34.2494458Z      https://docs.aws.amazon.com/linux/al2023/release-notes/relnotes-2023.7.20250414.html
2025-05-07T20:23:34.2494819Z 
2025-05-07T20:23:34.2494914Z   Version 2023.7.20250428:
2025-05-07T20:23:34.2495215Z     Run the following command to upgrade to 2023.7.20250428:
2025-05-07T20:23:34.2495461Z 
2025-05-07T20:23:34.2495575Z       dnf upgrade --releasever=2023.7.20250428
2025-05-07T20:23:34.2495782Z 
2025-05-07T20:23:34.2495875Z     Release notes:
2025-05-07T20:23:34.2496254Z      https://docs.aws.amazon.com/linux/al2023/release-notes/relnotes-2023.7.20250428.html
2025-05-07T20:23:34.2496613Z 
2025-05-07T20:23:34.2496735Z ================================================================================
2025-05-07T20:23:34.3651653Z Dependencies resolved.
2025-05-07T20:23:34.3938650Z ================================================================================
2025-05-07T20:23:34.3939261Z  Package                       Arch   Version    Repository                Size
2025-05-07T20:23:34.3939767Z ================================================================================
2025-05-07T20:23:34.3940117Z Upgrading:
2025-05-07T20:23:34.3940697Z  nvidia-container-toolkit      x86_64 1.17.6-1   nvidia-container-toolkit 1.2 M
2025-05-07T20:23:34.3941273Z  nvidia-container-toolkit-base x86_64 1.17.6-1   nvidia-container-toolkit 5.7 M
2025-05-07T20:23:34.3941705Z 
2025-05-07T20:23:34.3942003Z Transaction Summary
2025-05-07T20:23:34.3942264Z ================================================================================
2025-05-07T20:23:34.3942571Z Upgrade  2 Packages
2025-05-07T20:23:34.3942772Z 
2025-05-07T20:23:34.3942882Z Total download size: 6.9 M
2025-05-07T20:23:34.3943141Z Downloading Packages:
2025-05-07T20:23:34.4403823Z (1/2): nvidia-container-toolkit-1.17.6-1.x86_64  27 MB/s | 1.2 MB     00:00    
2025-05-07T20:23:34.4813341Z (2/2): nvidia-container-toolkit-base-1.17.6-1.x  66 MB/s | 5.7 MB     00:00    
2025-05-07T20:23:34.4821572Z --------------------------------------------------------------------------------
2025-05-07T20:23:34.4824547Z Total                                            79 MB/s | 6.9 MB     00:00     
2025-05-07T20:23:34.4826940Z Running transaction check
2025-05-07T20:23:34.4924415Z Transaction check succeeded.
2025-05-07T20:23:34.4924964Z Running transaction test
2025-05-07T20:23:34.5221278Z Transaction test succeeded.
2025-05-07T20:23:34.5223798Z Running transaction
2025-05-07T20:23:35.0803341Z   Preparing        :                                                        1/1 
2025-05-07T20:23:35.1882793Z   Upgrading        : nvidia-container-toolkit-base-1.17.6-1.x86_64          1/4 
2025-05-07T20:23:35.1917654Z   Upgrading        : nvidia-container-toolkit-1.17.6-1.x86_64               2/4 
2025-05-07T20:23:35.2139535Z   Running scriptlet: nvidia-container-toolkit-1.17.6-1.x86_64               2/4 
2025-05-07T20:23:35.2140176Z   Cleanup          : nvidia-container-toolkit-1.16.2-1.x86_64               3/4 
2025-05-07T20:23:35.2250279Z   Running scriptlet: nvidia-container-toolkit-1.16.2-1.x86_64               3/4 
2025-05-07T20:23:35.2278976Z   Cleanup          : nvidia-container-toolkit-base-1.16.2-1.x86_64          4/4 
2025-05-07T20:23:35.3753266Z   Running scriptlet: nvidia-container-toolkit-1.17.6-1.x86_64               4/4 
2025-05-07T20:23:35.3753844Z   Verifying        : nvidia-container-toolkit-1.17.6-1.x86_64               1/4 
2025-05-07T20:23:35.3754378Z   Verifying        : nvidia-container-toolkit-1.16.2-1.x86_64               2/4 
2025-05-07T20:23:35.3754914Z   Verifying        : nvidia-container-toolkit-base-1.17.6-1.x86_64          3/4 
2025-05-07T20:23:35.5158431Z ================================================================================
2025-05-07T20:23:35.5159074Z WARNING:
2025-05-07T20:23:35.5159431Z   A newer release of "Amazon Linux" is available.
2025-05-07T20:23:35.5159738Z 
2025-05-07T20:23:35.5159868Z   Available Versions:
2025-05-07T20:23:35.5160065Z 
2025-05-07T20:23:35.5160185Z   Version 2023.7.20250331:
2025-05-07T20:23:35.5160561Z     Run the following command to upgrade to 2023.7.20250331:
2025-05-07T20:23:35.5160809Z 
2025-05-07T20:23:35.5160939Z       dnf upgrade --releasever=2023.7.20250331
2025-05-07T20:23:35.5161167Z 
2025-05-07T20:23:35.5161251Z     Release notes:
2025-05-07T20:23:35.5161658Z      https://docs.aws.amazon.com/linux/al2023/release-notes/relnotes-2023.7.20250331.html
2025-05-07T20:23:35.5162022Z 
2025-05-07T20:23:35.5162126Z   Version 2023.7.20250414:
2025-05-07T20:23:35.5162426Z     Run the following command to upgrade to 2023.7.20250414:
2025-05-07T20:23:35.5162671Z 
2025-05-07T20:23:35.5162785Z       dnf upgrade --releasever=2023.7.20250414
2025-05-07T20:23:35.5162996Z 
2025-05-07T20:23:35.5163093Z     Release notes:
2025-05-07T20:23:35.5163479Z      https://docs.aws.amazon.com/linux/al2023/release-notes/relnotes-2023.7.20250414.html
2025-05-07T20:23:35.5163834Z 
2025-05-07T20:23:35.5163931Z   Version 2023.7.20250428:
2025-05-07T20:23:35.5164225Z     Run the following command to upgrade to 2023.7.20250428:
2025-05-07T20:23:35.5164472Z 
2025-05-07T20:23:35.5164585Z       dnf upgrade --releasever=2023.7.20250428
2025-05-07T20:23:35.5164788Z 
2025-05-07T20:23:35.5165152Z     Release notes:
2025-05-07T20:23:35.5165533Z      https://docs.aws.amazon.com/linux/al2023/release-notes/relnotes-2023.7.20250428.html
2025-05-07T20:23:35.5165890Z 
2025-05-07T20:23:35.5166206Z ================================================================================
2025-05-07T20:23:35.5723427Z   Verifying        : nvidia-container-toolkit-base-1.16.2-1.x86_64          4/4 
2025-05-07T20:23:35.5724342Z 
2025-05-07T20:23:35.5724582Z Upgraded:
2025-05-07T20:23:35.5725501Z   nvidia-container-toolkit-1.17.6-1.x86_64                                      
2025-05-07T20:23:35.5726619Z   nvidia-container-toolkit-base-1.17.6-1.x86_64                                 
2025-05-07T20:23:35.5727294Z 
2025-05-07T20:23:35.5727455Z Complete!
2025-05-07T20:23:35.6182671Z [INSTALL] Installing system package(s): hostname lshw ...
2025-05-07T20:23:35.6208867Z [EXEC] [ATTEMPT 0/3]    + sudo yum install -y hostname lshw
2025-05-07T20:23:36.0751014Z Last metadata expiration check: 0:00:07 ago on Wed May  7 20:23:29 2025.
2025-05-07T20:23:36.0990760Z Package hostname-3.23-4.amzn2023.0.3.x86_64 is already installed.
2025-05-07T20:23:36.0995718Z Package lshw-B.02.19.2-7.amzn2023.0.3.x86_64 is already installed.
2025-05-07T20:23:36.1397182Z Dependencies resolved.
2025-05-07T20:23:36.1579994Z Nothing to do.
2025-05-07T20:23:36.1580527Z Complete!
2025-05-07T20:23:36.1982437Z + hostname
2025-05-07T20:23:36.1982589Z 
2025-05-07T20:23:36.1996778Z ip-10-0-66-0.ec2.internal
2025-05-07T20:23:36.1998162Z 
2025-05-07T20:23:36.1998652Z + sudo lshw -C display
2025-05-07T20:23:36.1998877Z 
2025-05-07T20:23:36.4794934Z   *-display:0 UNCLAIMED
2025-05-07T20:23:36.4795381Z        description: VGA compatible controller
2025-05-07T20:23:36.4795704Z        product: Amazon.com, Inc.
2025-05-07T20:23:36.4795973Z        vendor: Amazon.com, Inc.
2025-05-07T20:23:36.4796231Z        physical id: 3
2025-05-07T20:23:36.4796468Z        bus info: pci@0000:00:03.0
2025-05-07T20:23:36.4796717Z        version: 00
2025-05-07T20:23:36.4796956Z        width: 32 bits
2025-05-07T20:23:36.4797173Z        clock: 33MHz
2025-05-07T20:23:36.4797415Z        capabilities: vga_controller bus_master
2025-05-07T20:23:36.4797727Z        configuration: latency=0
2025-05-07T20:23:36.4798061Z        resources: memory:c1000000-c13fffff memory:c0000-dffff
2025-05-07T20:23:36.4798387Z   *-display:1
2025-05-07T20:23:36.4798612Z        description: 3D controller
2025-05-07T20:23:36.4798889Z        product: GA102GL [A10G]
2025-05-07T20:23:36.4799179Z        vendor: NVIDIA Corporation
2025-05-07T20:23:36.4799466Z        physical id: 1e
2025-05-07T20:23:36.4799702Z        bus info: pci@0000:00:1e.0
2025-05-07T20:23:36.4799957Z        version: a1
2025-05-07T20:23:36.4800163Z        width: 64 bits
2025-05-07T20:23:36.4800382Z        clock: 33MHz
2025-05-07T20:23:36.4800676Z        capabilities: pm pciexpress msix bus_master cap_list
2025-05-07T20:23:36.4801038Z        configuration: driver=nvidia latency=0
2025-05-07T20:23:36.4801650Z        resources: iomemory:180-17f iomemory:100-ff irq:10 memory:c0000000-c0ffffff memory:1800000000-1fffffffff memory:1040000000-1041ffffff
2025-05-07T20:23:36.4834587Z 
2025-05-07T20:23:36.4834790Z ################################################################################
2025-05-07T20:23:36.4835265Z [INFO] Printing NVIDIA GPU info ...
2025-05-07T20:23:36.4962107Z 00:1e.0 3D controller: NVIDIA Corporation GA102GL [A10G] (rev a1)
2025-05-07T20:23:36.5149266Z Wed May  7 20:23:36 2025       
2025-05-07T20:23:36.5149765Z +-----------------------------------------------------------------------------------------+
2025-05-07T20:23:36.5150291Z | NVIDIA-SMI 570.133.07             Driver Version: 570.133.07     CUDA Version: 12.8     |
2025-05-07T20:23:36.5150765Z |-----------------------------------------+------------------------+----------------------+
2025-05-07T20:23:36.5151252Z | GPU  Name                 Persistence-M | Bus-Id          Disp.A | Volatile Uncorr. ECC |
2025-05-07T20:23:36.5151772Z | Fan  Temp   Perf          Pwr:Usage/Cap |           Memory-Usage | GPU-Util  Compute M. |
2025-05-07T20:23:36.5152409Z |                                         |                        |               MIG M. |
2025-05-07T20:23:36.5152904Z |=========================================+========================+======================|
2025-05-07T20:23:36.5283068Z |   0  NVIDIA A10G                    On  |   00000000:00:1E.0 Off |                    0 |
2025-05-07T20:23:36.5283583Z |  0%   29C    P8             22W /  300W |       0MiB /  23028MiB |      0%      Default |
2025-05-07T20:23:36.5283986Z |                                         |                        |                  N/A |
2025-05-07T20:23:36.5284374Z +-----------------------------------------+------------------------+----------------------+
2025-05-07T20:23:36.5287934Z                                                                                          
2025-05-07T20:23:36.5288425Z +-----------------------------------------------------------------------------------------+
2025-05-07T20:23:36.5288860Z | Processes:                                                                              |
2025-05-07T20:23:36.5289282Z |  GPU   GI   CI              PID   Type   Process name                        GPU Memory |
2025-05-07T20:23:36.5289689Z |        ID   ID                                                               Usage      |
2025-05-07T20:23:36.5290031Z |=========================================================================================|
2025-05-07T20:23:36.5293253Z |  No running processes found                                                             |
2025-05-07T20:23:36.5293720Z +-----------------------------------------------------------------------------------------+
2025-05-07T20:23:36.7749409Z ################################################################################
2025-05-07T20:23:36.7749745Z [INFO] Printing AMD GPU info ...
2025-05-07T20:23:36.7894020Z which: no rocminfo in (/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin)
2025-05-07T20:23:36.7894835Z [CHECK] rocminfo not found
2025-05-07T20:23:36.7903722Z which: no rocm-smi in (/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin)
2025-05-07T20:23:36.7904855Z [CHECK] rocm-smi not found
2025-05-07T20:23:36.7948737Z ##[group]Run . $PRELUDE; setup_miniconda $HOME/miniconda
2025-05-07T20:23:36.7949574Z [36;1m. $PRELUDE; setup_miniconda $HOME/miniconda[0m
2025-05-07T20:23:36.7963947Z shell: /usr/bin/bash --noprofile --norc -e -o pipefail {0}
2025-05-07T20:23:36.7964516Z env:
2025-05-07T20:23:36.7964872Z   PRELUDE: .github/scripts/setup_env.bash
2025-05-07T20:23:36.7965360Z   BUILD_ENV: build_binary
2025-05-07T20:23:36.7965741Z   BUILD_TARGET: genai
2025-05-07T20:23:36.7966084Z   BUILD_VARIANT: cuda
2025-05-07T20:23:36.7966456Z   BUILD_CUDA_VERSION: 12.8.0
2025-05-07T20:23:36.7966859Z   ENFORCE_CUDA_DEVICE: 1
2025-05-07T20:23:36.7967319Z   GPU_FLAG: --gpus all -e NVIDIA_DRIVER_CAPABILITIES=all
2025-05-07T20:23:36.7967822Z ##[endgroup]
2025-05-07T20:23:37.1333317Z ################################################################################
2025-05-07T20:23:37.1333723Z # Setup Miniconda
2025-05-07T20:23:37.1333940Z #
2025-05-07T20:23:37.1349706Z # [2025-05-07T20:23:37.134Z] + setup_miniconda /home/ec2-user/miniconda
2025-05-07T20:23:37.1350114Z ################################################################################
2025-05-07T20:23:37.1350327Z 
2025-05-07T20:23:37.1366161Z [EXEC] [ATTEMPT 0/3]    + wget -q --timeout 1 pypi.org -O /dev/null
2025-05-07T20:23:37.2248902Z [CHECK] Network does not appear to be blocked.
2025-05-07T20:23:37.2249421Z [SETUP] A Miniconda installation appears to already exist in /home/ec2-user/miniconda ...
2025-05-07T20:23:37.2249973Z [SETUP] Clearing out directory: /home/ec2-user/miniconda ...
2025-05-07T20:23:37.2250341Z + rm -rf /home/ec2-user/miniconda
2025-05-07T20:23:37.2250532Z 
2025-05-07T20:23:42.1495953Z 
2025-05-07T20:23:42.1496647Z + mkdir -p /home/ec2-user/miniconda
2025-05-07T20:23:42.1497101Z 
2025-05-07T20:23:42.1513992Z 
2025-05-07T20:23:42.1514443Z [SETUP] Downloading the Miniconda installer ...
2025-05-07T20:23:42.1538075Z [EXEC] [ATTEMPT 0/3]    + wget -q https://repo.anaconda.com/miniconda/Miniconda3-latest-Linux-x86_64.sh -O miniconda.sh
2025-05-07T20:23:43.1491822Z [SETUP] Installing Miniconda ...
2025-05-07T20:23:43.1492212Z + bash miniconda.sh -b -p /home/ec2-user/miniconda -u
2025-05-07T20:23:43.1492466Z 
2025-05-07T20:23:43.1639466Z PREFIX=/home/ec2-user/miniconda
2025-05-07T20:23:43.6175587Z Unpacking payload ...
2025-05-07T20:23:44.1390754Z entry_point.py:256: DeprecationWarning: Python 3.14 will, by default, filter extracted tar archives and reject files or modify their metadata. Use the filter argument to control this behavior.
2025-05-07T20:23:44.9454472Z entry_point.py:256: DeprecationWarning: Python 3.14 will, by default, filter extracted tar archives and reject files or modify their metadata. Use the filter argument to control this behavior.
2025-05-07T20:23:47.0588735Z 
2025-05-07T20:23:47.0589477Z Installing base environment...
2025-05-07T20:23:47.0589808Z 
2025-05-07T20:23:48.1420674Z Preparing transaction: ...working... done
2025-05-07T20:23:51.1484705Z Executing transaction: ...working... done
2025-05-07T20:23:51.8076875Z entry_point.py:256: DeprecationWarning: Python 3.14 will, by default, filter extracted tar archives and reject files or modify their metadata. Use the filter argument to control this behavior.
2025-05-07T20:23:51.8973224Z installation finished.
2025-05-07T20:23:51.8980143Z 
2025-05-07T20:23:51.8980362Z + rm -f miniconda.sh
2025-05-07T20:23:51.8980512Z 
2025-05-07T20:23:51.9303921Z 
2025-05-07T20:23:51.9304114Z [SETUP] Reloading the bash configuration ...
2025-05-07T20:23:51.9304462Z + /home/ec2-user/miniconda/bin/conda init bash
2025-05-07T20:23:51.9304684Z 
2025-05-07T20:23:52.2964228Z no change     /home/ec2-user/miniconda/condabin/conda
2025-05-07T20:23:52.2964649Z no change     /home/ec2-user/miniconda/bin/conda
2025-05-07T20:23:52.2965099Z no change     /home/ec2-user/miniconda/bin/conda-env
2025-05-07T20:23:52.2965546Z no change     /home/ec2-user/miniconda/bin/activate
2025-05-07T20:23:52.2966007Z no change     /home/ec2-user/miniconda/bin/deactivate
2025-05-07T20:23:52.2966874Z no change     /home/ec2-user/miniconda/etc/profile.d/conda.sh
2025-05-07T20:23:52.2967370Z no change     /home/ec2-user/miniconda/etc/fish/conf.d/conda.fish
2025-05-07T20:23:52.2967801Z no change     /home/ec2-user/miniconda/shell/condabin/Conda.psm1
2025-05-07T20:23:52.2968243Z no change     /home/ec2-user/miniconda/shell/condabin/conda-hook.ps1
2025-05-07T20:23:52.2968764Z no change     /home/ec2-user/miniconda/lib/python3.13/site-packages/xontrib/conda.xsh
2025-05-07T20:23:52.2969273Z no change     /home/ec2-user/miniconda/etc/profile.d/conda.csh
2025-05-07T20:23:52.2969624Z no change     /home/ec2-user/.bashrc
2025-05-07T20:23:52.2969896Z No action taken.
2025-05-07T20:23:52.3624290Z 
2025-05-07T20:23:52.3624865Z + . /home/ec2-user/.bashrc
2025-05-07T20:23:52.3625121Z 
2025-05-07T20:23:53.2032722Z 
2025-05-07T20:23:53.2033266Z [SETUP] Installing libmamba-solver (required since Anaconda 2024.02-1) and libarchive ...
2025-05-07T20:23:53.2056163Z [EXEC] [ATTEMPT 0/3]    + conda install --solver=classic -c conda-forge --override-channels -y conda-libmamba-solver libmamba libmambapy libarchive
2025-05-07T20:24:06.5918101Z Collecting package metadata (current_repodata.json): - \ | / - \ | / - \ | / - \ | / - \ | / - \ | / - \ | / - \ | / - \ | / - \ | / - \ | / - \ | / - \ | / - \ | / - \ | / - \ | / - \ | / - \ | / - \ | / - \ | / - \ | / - \ | / - \ | / - \ | / - \ | / - \ | / - \ | / - done
2025-05-07T20:24:08.1736940Z Solving environment: | / - \ | / - \ | / - \ done
2025-05-07T20:24:08.2708058Z 
2025-05-07T20:24:08.2708397Z ## Package Plan ##
2025-05-07T20:24:08.2708551Z 
2025-05-07T20:24:08.2708712Z   environment location: /home/ec2-user/miniconda
2025-05-07T20:24:08.2709289Z 
2025-05-07T20:24:08.2709389Z   added / updated specs:
2025-05-07T20:24:08.2709660Z     - conda-libmamba-solver
2025-05-07T20:24:08.2709923Z     - libarchive
2025-05-07T20:24:08.2710140Z     - libmamba
2025-05-07T20:24:08.2710355Z     - libmambapy
2025-05-07T20:24:08.2710481Z 
2025-05-07T20:24:08.2710485Z 
2025-05-07T20:24:08.2710613Z The following packages will be downloaded:
2025-05-07T20:24:08.2710835Z 
2025-05-07T20:24:08.2710955Z     package                    |            build
2025-05-07T20:24:08.2711273Z     ---------------------------|-----------------
2025-05-07T20:24:08.2711686Z     ca-certificates-2025.4.26  |       hbd8a1cb_0         149 KB  conda-forge
2025-05-07T20:24:08.2712177Z     certifi-2025.4.26          |     pyhd8ed1ab_0         154 KB  conda-forge
2025-05-07T20:24:08.2712603Z     conda-25.3.1               |  py313h78bf25f_1         1.1 MB  conda-forge
2025-05-07T20:24:08.2713073Z     conda-libmamba-solver-25.4.0|     pyhd8ed1ab_0          41 KB  conda-forge
2025-05-07T20:24:08.2713531Z     ------------------------------------------------------------
2025-05-07T20:24:08.2713876Z                                            Total:         1.4 MB
2025-05-07T20:24:08.2714089Z 
2025-05-07T20:24:08.2714207Z The following packages will be UPDATED:
2025-05-07T20:24:08.2714417Z 
2025-05-07T20:24:08.2717944Z   ca-certificates    pkgs/main/linux-64::ca-certificates-2~ --> conda-forge/noarch::ca-certificates-2025.4.26-hbd8a1cb_0 
2025-05-07T20:24:08.2718719Z   conda              pkgs/main::conda-25.3.1-py313h06a4308~ --> conda-forge::conda-25.3.1-py313h78bf25f_1 
2025-05-07T20:24:08.2719087Z 
2025-05-07T20:24:08.2719312Z The following packages will be SUPERSEDED by a higher-priority channel:
2025-05-07T20:24:08.2719622Z 
2025-05-07T20:24:08.2719934Z   certifi            pkgs/main/linux-64::certifi-2025.4.26~ --> conda-forge/noarch::certifi-2025.4.26-pyhd8ed1ab_0 
2025-05-07T20:24:08.2720722Z   conda-libmamba-so~ pkgs/main::conda-libmamba-solver-25.4~ --> conda-forge::conda-libmamba-solver-25.4.0-pyhd8ed1ab_0 
2025-05-07T20:24:08.2721213Z 
2025-05-07T20:24:08.2721217Z 
2025-05-07T20:24:08.2721221Z 
2025-05-07T20:24:08.2721580Z Downloading and Extracting Packages: ...working...
2025-05-07T20:24:08.2721955Z conda-25.3.1         | 1.1 MB    |            |   0% 
2025-05-07T20:24:08.2722177Z 
2025-05-07T20:24:08.2725973Z certifi-2025.4.26    | 154 KB    |            |   0% [A
2025-05-07T20:24:08.2726230Z 
2025-05-07T20:24:08.2726234Z 
2025-05-07T20:24:08.2729559Z ca-certificates-2025 | 149 KB    |            |   0% [A[A
2025-05-07T20:24:08.2729832Z 
2025-05-07T20:24:08.2729836Z 
2025-05-07T20:24:08.2729840Z 
2025-05-07T20:24:08.3200246Z conda-libmamba-solve | 41 KB     |            |   0% [A[A[A
2025-05-07T20:24:08.3200542Z 
2025-05-07T20:24:08.3200545Z 
2025-05-07T20:24:08.3200549Z 
2025-05-07T20:24:08.3304865Z conda-libmamba-solve | 41 KB     | ########## | 100% [A[A[A
2025-05-07T20:24:08.3305299Z 
2025-05-07T20:24:08.3305311Z 
2025-05-07T20:24:08.3305346Z 
2025-05-07T20:24:08.3342186Z conda-libmamba-solve | 41 KB     | ########## | 100% [A[A[A
2025-05-07T20:24:08.3342477Z 
2025-05-07T20:24:08.3342491Z 
2025-05-07T20:24:08.3410111Z ca-certificates-2025 | 149 KB    | ########## | 100% [A[A
2025-05-07T20:24:08.3410377Z 
2025-05-07T20:24:08.3410383Z 
2025-05-07T20:24:08.3562514Z ca-certificates-2025 | 149 KB    | ########## | 100% [A[A
2025-05-07T20:24:08.3587107Z conda-25.3.1         | 1.1 MB    | ########## | 100% 
2025-05-07T20:24:08.3589116Z 
2025-05-07T20:24:08.3731218Z certifi-2025.4.26    | 154 KB    | ########## | 100% [A
2025-05-07T20:24:08.3731492Z 
2025-05-07T20:24:08.3732338Z certifi-2025.4.26    | 154 KB    | ########## | 100% [A
2025-05-07T20:24:08.3732789Z 
2025-05-07T20:24:08.4718946Z certifi-2025.4.26    | 154 KB    | ########## | 100% [A
2025-05-07T20:24:08.4719362Z conda-25.3.1         | 1.1 MB    | ########## | 100% 
2025-05-07T20:24:08.4725309Z conda-25.3.1         | 1.1 MB    | ########## | 100% 
2025-05-07T20:24:08.4726089Z                                                      
2025-05-07T20:24:08.4726353Z 
2025-05-07T20:24:08.4726563Z                                                      [A
2025-05-07T20:24:08.4726775Z 
2025-05-07T20:24:08.4726781Z 
2025-05-07T20:24:08.4726997Z                                                      [A[A
2025-05-07T20:24:08.4727269Z 
2025-05-07T20:24:08.4727273Z 
2025-05-07T20:24:08.4727277Z 
2025-05-07T20:24:08.4727474Z                                                      [A[A[A done
2025-05-07T20:24:08.5729227Z Preparing transaction: / done
2025-05-07T20:24:08.6736959Z Verifying transaction: \ done
2025-05-07T20:24:10.0762183Z Executing transaction: / - \ | / - \ | / - \ | / - done
2025-05-07T20:24:11.9958264Z [SETUP] Updating Miniconda base packages ...
2025-05-07T20:24:11.9986469Z [EXEC] [ATTEMPT 0/3]    + conda update -n base -c defaults --update-deps -y conda
2025-05-07T20:24:12.8264824Z Channels:
2025-05-07T20:24:12.8265092Z  - defaults
2025-05-07T20:24:12.8265345Z Platform: linux-64
2025-05-07T20:24:14.0535264Z Collecting package metadata (repodata.json): - \ | / - \ | done
2025-05-07T20:24:14.1757511Z Solving environment: - \ Channels:
2025-05-07T20:24:14.1757828Z  - defaults
2025-05-07T20:24:14.1758049Z Platform: linux-64
2025-05-07T20:24:14.4596191Z Collecting package metadata (repodata.json): / - \ | done
2025-05-07T20:24:14.6701934Z Solving environment: - \ | / done
2025-05-07T20:24:14.7569316Z done
2025-05-07T20:24:14.8214959Z 
2025-05-07T20:24:14.8215261Z ## Package Plan ##
2025-05-07T20:24:14.8215439Z 
2025-05-07T20:24:14.8215595Z   environment location: /home/ec2-user/miniconda
2025-05-07T20:24:14.8215869Z 
2025-05-07T20:24:14.8215969Z   added / updated specs:
2025-05-07T20:24:14.8216242Z     - conda
2025-05-07T20:24:14.8216364Z 
2025-05-07T20:24:14.8216368Z 
2025-05-07T20:24:14.8216506Z The following packages will be downloaded:
2025-05-07T20:24:14.8216747Z 
2025-05-07T20:24:14.8216865Z     package                    |            build
2025-05-07T20:24:14.8217208Z     ---------------------------|-----------------
2025-05-07T20:24:14.8217848Z     pip-25.1                   |     pyhc872135_2         1.3 MB
2025-05-07T20:24:14.8218233Z     tzdata-2025b               |       h04d1e81_0         116 KB
2025-05-07T20:24:14.8218596Z     ------------------------------------------------------------
2025-05-07T20:24:14.8218930Z                                            Total:         1.4 MB
2025-05-07T20:24:14.8219137Z 
2025-05-07T20:24:14.8219258Z The following packages will be UPDATED:
2025-05-07T20:24:14.8219460Z 
2025-05-07T20:24:14.8219765Z   pip                pkgs/main/linux-64::pip-25.0-py313h06~ --> pkgs/main/noarch::pip-25.1-pyhc872135_2 
2025-05-07T20:24:14.8220259Z   tzdata                                   2025a-h04d1e81_0 --> 2025b-h04d1e81_0 
2025-05-07T20:24:14.8220508Z 
2025-05-07T20:24:14.8220512Z 
2025-05-07T20:24:14.8220516Z 
2025-05-07T20:24:14.8220661Z Downloading and Extracting Packages: ...working...
2025-05-07T20:24:14.8221021Z pip-25.1             | 1.3 MB    |            |   0% 
2025-05-07T20:24:14.8221231Z 
2025-05-07T20:24:14.8589281Z tzdata-2025b         | 116 KB    |            |   0% [A
2025-05-07T20:24:14.8589532Z 
2025-05-07T20:24:14.8767130Z tzdata-2025b         | 116 KB    | ########## | 100% [A
2025-05-07T20:24:15.0827616Z pip-25.1             | 1.3 MB    | ########## | 100% 
2025-05-07T20:24:15.0829789Z pip-25.1             | 1.3 MB    | ########## | 100% 
2025-05-07T20:24:15.0882256Z pip-25.1             | 1.3 MB    | ########## | 100% 
2025-05-07T20:24:15.0882477Z 
2025-05-07T20:24:15.0883223Z tzdata-2025b         | 116 KB    | ########## | 100% [A
2025-05-07T20:24:15.0883482Z 
2025-05-07T20:24:15.0888664Z tzdata-2025b         | 116 KB    | ########## | 100% [A
2025-05-07T20:24:15.0889082Z                                                      
2025-05-07T20:24:15.0889290Z 
2025-05-07T20:24:15.0889466Z                                                      [A done
2025-05-07T20:24:15.1892653Z Preparing transaction: \ done
2025-05-07T20:24:15.2898811Z Verifying transaction: / done
2025-05-07T20:24:17.2926635Z Executing transaction: \ | / - \ | / - \ | / - \ | / - \ | / - done
2025-05-07T20:24:17.9006963Z [SETUP] Cleaning up Conda packages ...
2025-05-07T20:24:17.9010644Z + conda clean --packages --tarball -y
2025-05-07T20:24:17.9010857Z 
2025-05-07T20:24:19.1374519Z Will remove 99 (117.8 MB) tarball(s).
2025-05-07T20:24:19.1374897Z Will remove 11 (16.0 MB) package(s).
2025-05-07T20:24:19.2042452Z 
2025-05-07T20:24:19.2051269Z + conda clean --all -y
2025-05-07T20:24:19.2051516Z 
2025-05-07T20:24:19.7503421Z There are no unused tarball(s) to remove.
2025-05-07T20:24:19.7503806Z Will remove 1 index cache(s).
2025-05-07T20:24:19.7504093Z There are no unused package(s) to remove.
2025-05-07T20:24:19.7504402Z There are no tempfile(s) to remove.
2025-05-07T20:24:19.7504739Z There are no logfile(s) to remove.
2025-05-07T20:24:19.8143515Z 
2025-05-07T20:24:19.8148806Z + conda info
2025-05-07T20:24:19.8149034Z 
2025-05-07T20:24:20.5664144Z 
2025-05-07T20:24:20.5664550Z      active environment : base
2025-05-07T20:24:20.5664939Z     active env location : /home/ec2-user/miniconda
2025-05-07T20:24:20.5665294Z             shell level : 1
2025-05-07T20:24:20.5665614Z        user config file : /home/ec2-user/.condarc
2025-05-07T20:24:20.5665990Z  populated config files : /home/ec2-user/miniconda/.condarc
2025-05-07T20:24:20.5666351Z           conda version : 25.3.1
2025-05-07T20:24:20.5666631Z     conda-build version : not installed
2025-05-07T20:24:20.5666924Z          python version : 3.13.2.final.0
2025-05-07T20:24:20.5667223Z                  solver : libmamba (default)
2025-05-07T20:24:20.5667531Z        virtual packages : __archspec=1=zen2
2025-05-07T20:24:20.5667821Z                           __conda=25.3.1=0
2025-05-07T20:24:20.5668104Z                           __cuda=12.8=0
2025-05-07T20:24:20.5668377Z                           __glibc=2.34=0
2025-05-07T20:24:20.5668670Z                           __linux=6.1.130=0
2025-05-07T20:24:20.5668939Z                           __unix=0=0
2025-05-07T20:24:20.5669559Z        base environment : /home/ec2-user/miniconda  (writable)
2025-05-07T20:24:20.5669972Z       conda av data dir : /home/ec2-user/miniconda/etc/conda
2025-05-07T20:24:20.5670309Z   conda av metadata url : None
2025-05-07T20:24:20.5670679Z            channel URLs : https://repo.anaconda.com/pkgs/main/linux-64
2025-05-07T20:24:20.5671100Z                           https://repo.anaconda.com/pkgs/main/noarch
2025-05-07T20:24:20.5671473Z                           https://repo.anaconda.com/pkgs/r/linux-64
2025-05-07T20:24:20.5671849Z                           https://repo.anaconda.com/pkgs/r/noarch
2025-05-07T20:24:20.5672218Z           package cache : /home/ec2-user/miniconda/pkgs
2025-05-07T20:24:20.5672558Z                           /home/ec2-user/.conda/pkgs
2025-05-07T20:24:20.5672888Z        envs directories : /home/ec2-user/miniconda/envs
2025-05-07T20:24:20.5673230Z                           /home/ec2-user/.conda/envs
2025-05-07T20:24:20.5673534Z                platform : linux-64
2025-05-07T20:24:20.5674350Z              user-agent : conda/25.3.1 requests/2.32.3 CPython/3.13.2 Linux/6.1.130-139.222.amzn2023.x86_64 amzn/2023.6.20250317 glibc/2.34 solver/libmamba conda-libmamba-solver/25.4.0 libmambapy/2.0.5 aau/0.7.0 c/. s/. e/.
2025-05-07T20:24:20.5675164Z                 UID:GID : 1000:1000
2025-05-07T20:24:20.5675440Z              netrc file : None
2025-05-07T20:24:20.5675702Z            offline mode : False
2025-05-07T20:24:20.5675867Z 
2025-05-07T20:24:20.6340909Z 
2025-05-07T20:24:20.6341432Z [SETUP] Exporting Miniconda variables ...
2025-05-07T20:24:20.6342832Z [SETUP] Saving Miniconda variables to /home/ec2-user/actions-runner/_work/_temp/_runner_file_commands/add_path_7aad9153-10f6-47cc-a7e3-d15333c993e3 ...
2025-05-07T20:24:20.6343601Z [SETUP] Successfully set up Miniconda at /home/ec2-user/miniconda
2025-05-07T20:24:20.6425434Z ##[group]Run . $PRELUDE; create_conda_environment $BUILD_ENV 3.11
2025-05-07T20:24:20.6426091Z [36;1m. $PRELUDE; create_conda_environment $BUILD_ENV 3.11[0m
2025-05-07T20:24:20.6444600Z shell: /usr/bin/bash --noprofile --norc -e -o pipefail {0}
2025-05-07T20:24:20.6444955Z env:
2025-05-07T20:24:20.6445183Z   PRELUDE: .github/scripts/setup_env.bash
2025-05-07T20:24:20.6445471Z   BUILD_ENV: build_binary
2025-05-07T20:24:20.6445716Z   BUILD_TARGET: genai
2025-05-07T20:24:20.6445941Z   BUILD_VARIANT: cuda
2025-05-07T20:24:20.6446165Z   BUILD_CUDA_VERSION: 12.8.0
2025-05-07T20:24:20.6446422Z   ENFORCE_CUDA_DEVICE: 1
2025-05-07T20:24:20.6446719Z   GPU_FLAG: --gpus all -e NVIDIA_DRIVER_CAPABILITIES=all
2025-05-07T20:24:20.6447040Z ##[endgroup]
2025-05-07T20:24:20.9809273Z ################################################################################
2025-05-07T20:24:20.9809750Z # Create Conda Environment
2025-05-07T20:24:20.9810082Z #
2025-05-07T20:24:20.9826593Z # [2025-05-07T20:24:20.982Z] + create_conda_environment build_binary 3.11
2025-05-07T20:24:20.9827170Z ################################################################################
2025-05-07T20:24:20.9827474Z 
2025-05-07T20:24:20.9843975Z [EXEC] [ATTEMPT 0/3]    + wget -q --timeout 1 pypi.org -O /dev/null
2025-05-07T20:24:21.0710095Z [CHECK] Network does not appear to be blocked.
2025-05-07T20:24:21.0710619Z [SETUP] Listing existing Conda environments ...
2025-05-07T20:24:21.0711029Z + conda info --envs
2025-05-07T20:24:21.0711168Z 
2025-05-07T20:24:21.8196733Z 
2025-05-07T20:24:21.8197414Z # conda environments:
2025-05-07T20:24:21.8197794Z #
2025-05-07T20:24:21.8198114Z base                   /home/ec2-user/miniconda
2025-05-07T20:24:21.8198416Z 
2025-05-07T20:24:21.8856511Z 
2025-05-07T20:24:21.8857037Z [SETUP] Deleting the prefix directory if it exists ...
2025-05-07T20:24:23.5281865Z + rm -rf /home/ec2-user/miniconda/envs/build_binary
2025-05-07T20:24:23.5282524Z 
2025-05-07T20:24:23.5294311Z 
2025-05-07T20:24:23.5303642Z [SETUP] Creating new Conda environment (Python 3.11) ...
2025-05-07T20:24:23.5326296Z [EXEC] [ATTEMPT 0/3]    + conda create -y -n build_binary python=3.11
2025-05-07T20:24:24.2878200Z Channels:
2025-05-07T20:24:24.2878519Z  - defaults
2025-05-07T20:24:24.2878799Z Platform: linux-64
2025-05-07T20:24:25.8394558Z Collecting package metadata (repodata.json): - \ | / - \ | / - \ done
2025-05-07T20:24:25.9636956Z Solving environment: / done
2025-05-07T20:24:25.9924162Z 
2025-05-07T20:24:25.9924552Z ## Package Plan ##
2025-05-07T20:24:25.9924766Z 
2025-05-07T20:24:25.9925044Z   environment location: /home/ec2-user/miniconda/envs/build_binary
2025-05-07T20:24:25.9925454Z 
2025-05-07T20:24:25.9925577Z   added / updated specs:
2025-05-07T20:24:25.9925821Z     - python=3.11
2025-05-07T20:24:25.9925958Z 
2025-05-07T20:24:25.9925963Z 
2025-05-07T20:24:25.9926083Z The following packages will be downloaded:
2025-05-07T20:24:25.9926294Z 
2025-05-07T20:24:25.9926445Z     package                    |            build
2025-05-07T20:24:25.9926756Z     ---------------------------|-----------------
2025-05-07T20:24:25.9927119Z     _libgcc_mutex-0.1          |             main           3 KB
2025-05-07T20:24:25.9927547Z     _openmp_mutex-5.1          |            1_gnu          21 KB
2025-05-07T20:24:25.9928078Z     ca-certificates-2025.2.25  |       h06a4308_0         129 KB
2025-05-07T20:24:25.9928836Z     python-3.11.11             |       he870216_0        32.9 MB
2025-05-07T20:24:25.9936927Z     setuptools-78.1.1          |  py311h06a4308_0         2.3 MB
2025-05-07T20:24:25.9937511Z     wheel-0.45.1               |  py311h06a4308_0         151 KB
2025-05-07T20:24:25.9938013Z     ------------------------------------------------------------
2025-05-07T20:24:25.9938372Z                                            Total:        35.4 MB
2025-05-07T20:24:25.9938575Z 
2025-05-07T20:24:25.9938714Z The following NEW packages will be INSTALLED:
2025-05-07T20:24:25.9938933Z 
2025-05-07T20:24:25.9939515Z   _libgcc_mutex      pkgs/main/linux-64::_libgcc_mutex-0.1-main 
2025-05-07T20:24:25.9940097Z   _openmp_mutex      pkgs/main/linux-64::_openmp_mutex-5.1-1_gnu 
2025-05-07T20:24:25.9940509Z   bzip2              pkgs/main/linux-64::bzip2-1.0.8-h5eee18b_6 
2025-05-07T20:24:25.9940986Z   ca-certificates    pkgs/main/linux-64::ca-certificates-2025.2.25-h06a4308_0 
2025-05-07T20:24:25.9941509Z   ld_impl_linux-64   pkgs/main/linux-64::ld_impl_linux-64-2.40-h12ee557_0 
2025-05-07T20:24:25.9941984Z   libffi             pkgs/main/linux-64::libffi-3.4.4-h6a678d5_1 
2025-05-07T20:24:25.9942407Z   libgcc-ng          pkgs/main/linux-64::libgcc-ng-11.2.0-h1234567_1 
2025-05-07T20:24:25.9942837Z   libgomp            pkgs/main/linux-64::libgomp-11.2.0-h1234567_1 
2025-05-07T20:24:25.9943338Z   libstdcxx-ng       pkgs/main/linux-64::libstdcxx-ng-11.2.0-h1234567_1 
2025-05-07T20:24:25.9943792Z   libuuid            pkgs/main/linux-64::libuuid-1.41.5-h5eee18b_0 
2025-05-07T20:24:25.9944211Z   ncurses            pkgs/main/linux-64::ncurses-6.4-h6a678d5_0 
2025-05-07T20:24:25.9944626Z   openssl            pkgs/main/linux-64::openssl-3.0.16-h5eee18b_0 
2025-05-07T20:24:25.9945023Z   pip                pkgs/main/noarch::pip-25.1-pyhc872135_2 
2025-05-07T20:24:25.9945414Z   python             pkgs/main/linux-64::python-3.11.11-he870216_0 
2025-05-07T20:24:25.9945838Z   readline           pkgs/main/linux-64::readline-8.2-h5eee18b_0 
2025-05-07T20:24:25.9946304Z   setuptools         pkgs/main/linux-64::setuptools-78.1.1-py311h06a4308_0 
2025-05-07T20:24:25.9946754Z   sqlite             pkgs/main/linux-64::sqlite-3.45.3-h5eee18b_0 
2025-05-07T20:24:25.9947160Z   tk                 pkgs/main/linux-64::tk-8.6.14-h39e8969_0 
2025-05-07T20:24:25.9947561Z   tzdata             pkgs/main/noarch::tzdata-2025b-h04d1e81_0 
2025-05-07T20:24:25.9947971Z   wheel              pkgs/main/linux-64::wheel-0.45.1-py311h06a4308_0 
2025-05-07T20:24:25.9948350Z   xz                 pkgs/main/linux-64::xz-5.6.4-h5eee18b_1 
2025-05-07T20:24:25.9948727Z   zlib               pkgs/main/linux-64::zlib-1.2.13-h5eee18b_1 
2025-05-07T20:24:25.9948965Z 
2025-05-07T20:24:25.9948970Z 
2025-05-07T20:24:25.9948974Z 
2025-05-07T20:24:25.9949200Z Downloading and Extracting Packages: ...working...
2025-05-07T20:24:25.9949563Z python-3.11.11       | 32.9 MB   |            |   0% 
2025-05-07T20:24:25.9949790Z 
2025-05-07T20:24:25.9951789Z setuptools-78.1.1    | 2.3 MB    |            |   0% [A
2025-05-07T20:24:25.9952134Z 
2025-05-07T20:24:25.9952140Z 
2025-05-07T20:24:25.9956761Z wheel-0.45.1         | 151 KB    |            |   0% [A[A
2025-05-07T20:24:25.9957102Z 
2025-05-07T20:24:25.9957108Z 
2025-05-07T20:24:25.9957113Z 
2025-05-07T20:24:25.9965556Z ca-certificates-2025 | 129 KB    |            |   0% [A[A[A
2025-05-07T20:24:25.9965932Z 
2025-05-07T20:24:25.9965948Z 
2025-05-07T20:24:25.9965953Z 
2025-05-07T20:24:25.9965958Z 
2025-05-07T20:24:25.9975699Z _openmp_mutex-5.1    | 21 KB     |            |   0% [A[A[A[A
2025-05-07T20:24:25.9976061Z 
2025-05-07T20:24:25.9976066Z 
2025-05-07T20:24:25.9976090Z 
2025-05-07T20:24:25.9976103Z 
2025-05-07T20:24:25.9983160Z 
2025-05-07T20:24:26.0292215Z _libgcc_mutex-0.1    | 3 KB      |            |   0% [A[A[A[A[A
2025-05-07T20:24:26.0292490Z 
2025-05-07T20:24:26.0292494Z 
2025-05-07T20:24:26.0292498Z 
2025-05-07T20:24:26.0292501Z 
2025-05-07T20:24:26.0378120Z _openmp_mutex-5.1    | 21 KB     | ########## | 100% [A[A[A[A
2025-05-07T20:24:26.0378387Z 
2025-05-07T20:24:26.0378391Z 
2025-05-07T20:24:26.0515538Z wheel-0.45.1         | 151 KB    | ########## | 100% [A[A
2025-05-07T20:24:26.0515786Z 
2025-05-07T20:24:26.0515790Z 
2025-05-07T20:24:26.0515794Z 
2025-05-07T20:24:26.0515797Z 
2025-05-07T20:24:26.0519444Z 
2025-05-07T20:24:26.0667540Z _libgcc_mutex-0.1    | 3 KB      | ########## | 100% [A[A[A[A[A
2025-05-07T20:24:26.0667814Z 
2025-05-07T20:24:26.0667818Z 
2025-05-07T20:24:26.0677100Z 
2025-05-07T20:24:26.0768077Z ca-certificates-2025 | 129 KB    | ########## | 100% [A[A[A
2025-05-07T20:24:26.0768357Z 
2025-05-07T20:24:26.0768629Z 
2025-05-07T20:24:26.0768634Z 
2025-05-07T20:24:26.0768765Z 
2025-05-07T20:24:26.0768768Z 
2025-05-07T20:24:26.0927448Z _libgcc_mutex-0.1    | 3 KB      | ########## | 100% [A[A[A[A[A
2025-05-07T20:24:26.0933366Z python-3.11.11       | 32.9 MB   | 3          |   4% 
2025-05-07T20:24:26.0935152Z 
2025-05-07T20:24:26.1212257Z setuptools-78.1.1    | 2.3 MB    | ###8       |  38% [A
2025-05-07T20:24:26.1212516Z 
2025-05-07T20:24:26.1212520Z 
2025-05-07T20:24:26.1216482Z 
2025-05-07T20:24:26.1223816Z ca-certificates-2025 | 129 KB    | ########## | 100% [A[A[A
2025-05-07T20:24:26.1224095Z 
2025-05-07T20:24:26.1224100Z 
2025-05-07T20:24:26.1224103Z 
2025-05-07T20:24:26.1341414Z ca-certificates-2025 | 129 KB    | ########## | 100% [A[A[A
2025-05-07T20:24:26.1341695Z 
2025-05-07T20:24:26.1341699Z 
2025-05-07T20:24:26.1341703Z 
2025-05-07T20:24:26.1341706Z 
2025-05-07T20:24:26.1344261Z _openmp_mutex-5.1    | 21 KB     | ########## | 100% [A[A[A[A
2025-05-07T20:24:26.1344524Z 
2025-05-07T20:24:26.1344537Z 
2025-05-07T20:24:26.1344549Z 
2025-05-07T20:24:26.1344558Z 
2025-05-07T20:24:26.1499605Z _openmp_mutex-5.1    | 21 KB     | ########## | 100% [A[A[A[A
2025-05-07T20:24:26.1499870Z 
2025-05-07T20:24:26.1613477Z setuptools-78.1.1    | 2.3 MB    | ########## | 100% [A
2025-05-07T20:24:26.1613728Z 
2025-05-07T20:24:26.1613780Z 
2025-05-07T20:24:26.1616619Z wheel-0.45.1         | 151 KB    | ########## | 100% [A[A
2025-05-07T20:24:26.1616856Z 
2025-05-07T20:24:26.1616860Z 
2025-05-07T20:24:26.1928331Z wheel-0.45.1         | 151 KB    | ########## | 100% [A[A
2025-05-07T20:24:26.2929010Z python-3.11.11       | 32.9 MB   | ##9        |  30% 
2025-05-07T20:24:26.4137020Z python-3.11.11       | 32.9 MB   | ########4  |  85% 
2025-05-07T20:24:26.4562794Z python-3.11.11       | 32.9 MB   | ########## | 100% 
2025-05-07T20:24:26.4563034Z 
2025-05-07T20:24:26.4564606Z setuptools-78.1.1    | 2.3 MB    | ########## | 100% [A
2025-05-07T20:24:26.4564857Z 
2025-05-07T20:24:27.0769147Z setuptools-78.1.1    | 2.3 MB    | ########## | 100% [A
2025-05-07T20:24:27.0776717Z python-3.11.11       | 32.9 MB   | ########## | 100% 
2025-05-07T20:24:27.0777188Z                                                      
2025-05-07T20:24:27.0777396Z 
2025-05-07T20:24:27.0777614Z                                                      [A
2025-05-07T20:24:27.0777843Z 
2025-05-07T20:24:27.0777847Z 
2025-05-07T20:24:27.0778023Z                                                      [A[A
2025-05-07T20:24:27.0778297Z 
2025-05-07T20:24:27.0778303Z 
2025-05-07T20:24:27.0778308Z 
2025-05-07T20:24:27.0778566Z                                                      [A[A[A
2025-05-07T20:24:27.0778866Z 
2025-05-07T20:24:27.0778871Z 
2025-05-07T20:24:27.0778876Z 
2025-05-07T20:24:27.0778882Z 
2025-05-07T20:24:27.0779074Z                                                      [A[A[A[A
2025-05-07T20:24:27.0779286Z 
2025-05-07T20:24:27.0779290Z 
2025-05-07T20:24:27.0779293Z 
2025-05-07T20:24:27.0779297Z 
2025-05-07T20:24:27.0779301Z 
2025-05-07T20:24:27.0779497Z                                                      [A[A[A[A[A done
2025-05-07T20:24:27.2885915Z Preparing transaction: \ | done
2025-05-07T20:24:28.6579323Z Verifying transaction: - \ | / - \ | / - \ | / - done
2025-05-07T20:24:30.9707068Z Executing transaction: | / - \ | / - \ | / - \ | / - \ | / - \ | / - done
2025-05-07T20:24:31.0237011Z #
2025-05-07T20:24:31.0237368Z # To activate this environment, use
2025-05-07T20:24:31.0237766Z #
2025-05-07T20:24:31.0238043Z #     $ conda activate build_binary
2025-05-07T20:24:31.0238365Z #
2025-05-07T20:24:31.0238585Z # To deactivate an active environment, use
2025-05-07T20:24:31.0238881Z #
2025-05-07T20:24:31.0239078Z #     $ conda deactivate
2025-05-07T20:24:31.0239237Z 
2025-05-07T20:24:31.1295132Z [SETUP] Upgrading PIP to latest ...
2025-05-07T20:24:31.1319339Z [EXEC] [ATTEMPT 0/3]    + conda run -n build_binary pip install --upgrade pip
2025-05-07T20:24:33.8748566Z Requirement already satisfied: pip in /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages (25.1)
2025-05-07T20:24:33.8749359Z Collecting pip
2025-05-07T20:24:33.8749684Z   Using cached pip-25.1.1-py3-none-any.whl.metadata (3.6 kB)
2025-05-07T20:24:33.8750097Z Using cached pip-25.1.1-py3-none-any.whl (1.8 MB)
2025-05-07T20:24:33.8750440Z Installing collected packages: pip
2025-05-07T20:24:33.8750743Z   Attempting uninstall: pip
2025-05-07T20:24:33.8751033Z     Found existing installation: pip 25.1
2025-05-07T20:24:33.8751338Z     Uninstalling pip-25.1:
2025-05-07T20:24:33.8751618Z       Successfully uninstalled pip-25.1
2025-05-07T20:24:33.8751934Z Successfully installed pip-25.1.1
2025-05-07T20:24:33.8752120Z 
2025-05-07T20:24:33.9380327Z [SETUP] Upgrading pyOpenSSL ...
2025-05-07T20:24:33.9403036Z [EXEC] [ATTEMPT 0/3]    + conda install -n build_binary -c conda-forge --override-channels -y pyOpenSSL>22.1.0
2025-05-07T20:24:34.7945665Z Channels:
2025-05-07T20:24:34.7946022Z  - conda-forge
2025-05-07T20:24:34.7946345Z Platform: linux-64
2025-05-07T20:24:45.3690724Z Collecting package metadata (repodata.json): - \ | / - \ | / - \ | / - \ | / - \ | done
2025-05-07T20:24:47.0903237Z Solving environment: - \ | / - done
2025-05-07T20:24:47.1520967Z 
2025-05-07T20:24:47.1521621Z ## Package Plan ##
2025-05-07T20:24:47.1521860Z 
2025-05-07T20:24:47.1522162Z   environment location: /home/ec2-user/miniconda/envs/build_binary
2025-05-07T20:24:47.1522567Z 
2025-05-07T20:24:47.1522670Z   added / updated specs:
2025-05-07T20:24:47.1522945Z     - pyopenssl[version='>22.1.0']
2025-05-07T20:24:47.1523133Z 
2025-05-07T20:24:47.1523137Z 
2025-05-07T20:24:47.1523261Z The following packages will be downloaded:
2025-05-07T20:24:47.1523481Z 
2025-05-07T20:24:47.1523599Z     package                    |            build
2025-05-07T20:24:47.1523928Z     ---------------------------|-----------------
2025-05-07T20:24:47.1524321Z     cffi-1.17.1                |  py311hf29c0ef_0         295 KB  conda-forge
2025-05-07T20:24:47.1524770Z     cryptography-44.0.3        |  py311hafd3f86_0         1.5 MB  conda-forge
2025-05-07T20:24:47.1525324Z     libgcc-15.1.0              |       h767d61c_2         810 KB  conda-forge
2025-05-07T20:24:47.1525739Z     libgcc-ng-15.1.0           |       h69a702a_2          34 KB  conda-forge
2025-05-07T20:24:47.1526147Z     libgomp-15.1.0             |       h767d61c_2         442 KB  conda-forge
2025-05-07T20:24:47.1526565Z     openssl-3.5.0              |       h7b32b05_1         3.0 MB  conda-forge
2025-05-07T20:24:47.1526981Z     pycparser-2.22             |     pyh29332c3_1         108 KB  conda-forge
2025-05-07T20:24:47.1527413Z     pyopenssl-25.0.0           |     pyhd8ed1ab_0         120 KB  conda-forge
2025-05-07T20:24:47.1527833Z     python_abi-3.11            |          2_cp311           5 KB  conda-forge
2025-05-07T20:24:47.1528522Z     typing-extensions-4.13.2   |       h0e9735f_0          88 KB  conda-forge
2025-05-07T20:24:47.1529014Z     typing_extensions-4.13.2   |     pyh29332c3_0          51 KB  conda-forge
2025-05-07T20:24:47.1529440Z     ------------------------------------------------------------
2025-05-07T20:24:47.1529794Z                                            Total:         6.4 MB
2025-05-07T20:24:47.1530037Z 
2025-05-07T20:24:47.1530220Z The following NEW packages will be INSTALLED:
2025-05-07T20:24:47.1530471Z 
2025-05-07T20:24:47.1530675Z   cffi               conda-forge/linux-64::cffi-1.17.1-py311hf29c0ef_0 
2025-05-07T20:24:47.1531171Z   cryptography       conda-forge/linux-64::cryptography-44.0.3-py311hafd3f86_0 
2025-05-07T20:24:47.1531660Z   libgcc             conda-forge/linux-64::libgcc-15.1.0-h767d61c_2 
2025-05-07T20:24:47.1532102Z   pycparser          conda-forge/noarch::pycparser-2.22-pyh29332c3_1 
2025-05-07T20:24:47.1532575Z   pyopenssl          conda-forge/noarch::pyopenssl-25.0.0-pyhd8ed1ab_0 
2025-05-07T20:24:47.1533030Z   python_abi         conda-forge/linux-64::python_abi-3.11-2_cp311 
2025-05-07T20:24:47.1534076Z   typing-extensions  conda-forge/noarch::typing-extensions-4.13.2-h0e9735f_0 
2025-05-07T20:24:47.1535228Z   typing_extensions  conda-forge/noarch::typing_extensions-4.13.2-pyh29332c3_0 
2025-05-07T20:24:47.1535567Z 
2025-05-07T20:24:47.1535681Z The following packages will be UPDATED:
2025-05-07T20:24:47.1535884Z 
2025-05-07T20:24:47.1536277Z   ca-certificates    pkgs/main/linux-64::ca-certificates-2~ --> conda-forge/noarch::ca-certificates-2025.4.26-hbd8a1cb_0 
2025-05-07T20:24:47.1537021Z   libgcc-ng          pkgs/main::libgcc-ng-11.2.0-h1234567_1 --> conda-forge::libgcc-ng-15.1.0-h69a702a_2 
2025-05-07T20:24:47.1537659Z   libgomp              pkgs/main::libgomp-11.2.0-h1234567_1 --> conda-forge::libgomp-15.1.0-h767d61c_2 
2025-05-07T20:24:47.1538283Z   openssl              pkgs/main::openssl-3.0.16-h5eee18b_0 --> conda-forge::openssl-3.5.0-h7b32b05_1 
2025-05-07T20:24:47.1538638Z 
2025-05-07T20:24:47.1538642Z 
2025-05-07T20:24:47.1538646Z 
2025-05-07T20:24:47.1538802Z Downloading and Extracting Packages: ...working...
2025-05-07T20:24:47.1539187Z openssl-3.5.0        | 3.0 MB    |            |   0% 
2025-05-07T20:24:47.1539411Z 
2025-05-07T20:24:47.1543559Z cryptography-44.0.3  | 1.5 MB    |            |   0% [A
2025-05-07T20:24:47.1543882Z 
2025-05-07T20:24:47.1549408Z 
2025-05-07T20:24:47.1555464Z libgcc-15.1.0        | 810 KB    |            |   0% [A[A
2025-05-07T20:24:47.1555815Z 
2025-05-07T20:24:47.1555821Z 
2025-05-07T20:24:47.1555827Z 
2025-05-07T20:24:47.1565583Z libgomp-15.1.0       | 442 KB    |            |   0% [A[A[A
2025-05-07T20:24:47.1565931Z 
2025-05-07T20:24:47.1565935Z 
2025-05-07T20:24:47.1565939Z 
2025-05-07T20:24:47.1570703Z 
2025-05-07T20:24:47.1582116Z cffi-1.17.1          | 295 KB    |            |   0% [A[A[A[A
2025-05-07T20:24:47.1582484Z 
2025-05-07T20:24:47.1582492Z 
2025-05-07T20:24:47.1582497Z 
2025-05-07T20:24:47.1582503Z 
2025-05-07T20:24:47.1582508Z 
2025-05-07T20:24:47.1584135Z pyopenssl-25.0.0     | 120 KB    |            |   0% [A[A[A[A[A
2025-05-07T20:24:47.1584529Z 
2025-05-07T20:24:47.1584544Z 
2025-05-07T20:24:47.1584549Z 
2025-05-07T20:24:47.1584554Z 
2025-05-07T20:24:47.1584559Z 
2025-05-07T20:24:47.1584564Z 
2025-05-07T20:24:47.1585680Z pycparser-2.22       | 108 KB    |            |   0% [A[A[A[A[A[A
2025-05-07T20:24:47.1585963Z 
2025-05-07T20:24:47.1585967Z 
2025-05-07T20:24:47.1585971Z 
2025-05-07T20:24:47.1585982Z 
2025-05-07T20:24:47.1585986Z 
2025-05-07T20:24:47.1585989Z 
2025-05-07T20:24:47.1593331Z 
2025-05-07T20:24:47.1600480Z typing-extensions-4. | 88 KB     |            |   0% [A[A[A[A[A[A[A
2025-05-07T20:24:47.1600793Z 
2025-05-07T20:24:47.1600797Z 
2025-05-07T20:24:47.1600801Z 
2025-05-07T20:24:47.1600804Z 
2025-05-07T20:24:47.1600808Z 
2025-05-07T20:24:47.1600812Z 
2025-05-07T20:24:47.1600815Z 
2025-05-07T20:24:47.1601945Z 
2025-05-07T20:24:47.1603931Z typing_extensions-4. | 51 KB     |            |   0% [A[A[A[A[A[A[A[A
2025-05-07T20:24:47.1604228Z 
2025-05-07T20:24:47.1604232Z 
2025-05-07T20:24:47.1604245Z 
2025-05-07T20:24:47.1604250Z 
2025-05-07T20:24:47.1604259Z 
2025-05-07T20:24:47.1604263Z 
2025-05-07T20:24:47.1604267Z 
2025-05-07T20:24:47.1604270Z 
2025-05-07T20:24:47.1604582Z 
2025-05-07T20:24:47.1606404Z libgcc-ng-15.1.0     | 34 KB     |            |   0% [A[A[A[A[A[A[A[A[A
2025-05-07T20:24:47.1606711Z 
2025-05-07T20:24:47.1606716Z 
2025-05-07T20:24:47.1606719Z 
2025-05-07T20:24:47.1606723Z 
2025-05-07T20:24:47.1606727Z 
2025-05-07T20:24:47.1606730Z 
2025-05-07T20:24:47.1606734Z 
2025-05-07T20:24:47.1606737Z 
2025-05-07T20:24:47.1606741Z 
2025-05-07T20:24:47.1606745Z 
2025-05-07T20:24:47.2138128Z python_abi-3.11      | 5 KB      |            |   0% [A[A[A[A[A[A[A[A[A[A
2025-05-07T20:24:47.2138419Z 
2025-05-07T20:24:47.2140414Z 
2025-05-07T20:24:47.2198409Z libgcc-15.1.0        | 810 KB    | ########## | 100% [A[A
2025-05-07T20:24:47.2198731Z 
2025-05-07T20:24:47.2198736Z 
2025-05-07T20:24:47.2198741Z 
2025-05-07T20:24:47.2200146Z 
2025-05-07T20:24:47.2270622Z cffi-1.17.1          | 295 KB    | ########## | 100% [A[A[A[A
2025-05-07T20:24:47.2271001Z 
2025-05-07T20:24:47.2271005Z 
2025-05-07T20:24:47.2273415Z 
2025-05-07T20:24:47.2524781Z libgomp-15.1.0       | 442 KB    | ########## | 100% [A[A[A
2025-05-07T20:24:47.2534194Z openssl-3.5.0        | 3.0 MB    | ####5      |  46% 
2025-05-07T20:24:47.2534437Z 
2025-05-07T20:24:47.2588461Z cryptography-44.0.3  | 1.5 MB    | ##9        |  30% [A
2025-05-07T20:24:47.2588885Z 
2025-05-07T20:24:47.2588891Z 
2025-05-07T20:24:47.2588896Z 
2025-05-07T20:24:47.2588901Z 
2025-05-07T20:24:47.2588916Z 
2025-05-07T20:24:47.2657454Z pyopenssl-25.0.0     | 120 KB    | #####3     |  53% [A[A[A[A[A
2025-05-07T20:24:47.2657838Z 
2025-05-07T20:24:47.2657852Z 
2025-05-07T20:24:47.2657856Z 
2025-05-07T20:24:47.2657859Z 
2025-05-07T20:24:47.2657863Z 
2025-05-07T20:24:47.2657866Z 
2025-05-07T20:24:47.2688866Z pycparser-2.22       | 108 KB    | #4         |  15% [A[A[A[A[A[A
2025-05-07T20:24:47.2689254Z 
2025-05-07T20:24:47.2689258Z 
2025-05-07T20:24:47.2689271Z 
2025-05-07T20:24:47.2689281Z 
2025-05-07T20:24:47.2689284Z 
2025-05-07T20:24:47.2747958Z pyopenssl-25.0.0     | 120 KB    | ########## | 100% [A[A[A[A[A
2025-05-07T20:24:47.2748355Z 
2025-05-07T20:24:47.2748360Z 
2025-05-07T20:24:47.2748365Z 
2025-05-07T20:24:47.2748370Z 
2025-05-07T20:24:47.2748384Z 
2025-05-07T20:24:47.2748390Z 
2025-05-07T20:24:47.2748395Z 
2025-05-07T20:24:47.2780065Z typing-extensions-4. | 88 KB     | #8         |  18% [A[A[A[A[A[A[A
2025-05-07T20:24:47.2780415Z 
2025-05-07T20:24:47.2780419Z 
2025-05-07T20:24:47.2780429Z 
2025-05-07T20:24:47.2780433Z 
2025-05-07T20:24:47.2780437Z 
2025-05-07T20:24:47.2780440Z 
2025-05-07T20:24:47.2835831Z pycparser-2.22       | 108 KB    | ########## | 100% [A[A[A[A[A[A
2025-05-07T20:24:47.2836200Z 
2025-05-07T20:24:47.2836215Z 
2025-05-07T20:24:47.2836221Z 
2025-05-07T20:24:47.2836226Z 
2025-05-07T20:24:47.2836231Z 
2025-05-07T20:24:47.2836236Z 
2025-05-07T20:24:47.2838255Z 
2025-05-07T20:24:47.3058410Z typing-extensions-4. | 88 KB     | ########## | 100% [A[A[A[A[A[A[A
2025-05-07T20:24:47.3058776Z 
2025-05-07T20:24:47.3058780Z 
2025-05-07T20:24:47.3058784Z 
2025-05-07T20:24:47.3058787Z 
2025-05-07T20:24:47.3058791Z 
2025-05-07T20:24:47.3058795Z 
2025-05-07T20:24:47.3058798Z 
2025-05-07T20:24:47.3059442Z 
2025-05-07T20:24:47.3136449Z typing_extensions-4. | 51 KB     | ###1       |  31% [A[A[A[A[A[A[A[A
2025-05-07T20:24:47.3136870Z 
2025-05-07T20:24:47.3136874Z 
2025-05-07T20:24:47.3136878Z 
2025-05-07T20:24:47.3136881Z 
2025-05-07T20:24:47.3136885Z 
2025-05-07T20:24:47.3136889Z 
2025-05-07T20:24:47.3136892Z 
2025-05-07T20:24:47.3141317Z 
2025-05-07T20:24:47.3142174Z typing_extensions-4. | 51 KB     | ########## | 100% [A[A[A[A[A[A[A[A
2025-05-07T20:24:47.3142584Z 
2025-05-07T20:24:47.3142589Z 
2025-05-07T20:24:47.3142594Z 
2025-05-07T20:24:47.3142599Z 
2025-05-07T20:24:47.3142604Z 
2025-05-07T20:24:47.3142609Z 
2025-05-07T20:24:47.3142615Z 
2025-05-07T20:24:47.3142620Z 
2025-05-07T20:24:47.3142634Z 
2025-05-07T20:24:47.3175216Z libgcc-ng-15.1.0     | 34 KB     | ####7      |  47% [A[A[A[A[A[A[A[A[A
2025-05-07T20:24:47.3175611Z 
2025-05-07T20:24:47.3175616Z 
2025-05-07T20:24:47.3175621Z 
2025-05-07T20:24:47.3175627Z 
2025-05-07T20:24:47.3175632Z 
2025-05-07T20:24:47.3175637Z 
2025-05-07T20:24:47.3175642Z 
2025-05-07T20:24:47.3175647Z 
2025-05-07T20:24:47.3179193Z 
2025-05-07T20:24:47.3256643Z libgcc-ng-15.1.0     | 34 KB     | ########## | 100% [A[A[A[A[A[A[A[A[A
2025-05-07T20:24:47.3257022Z 
2025-05-07T20:24:47.3257028Z 
2025-05-07T20:24:47.3257033Z 
2025-05-07T20:24:47.3257047Z 
2025-05-07T20:24:47.3257053Z 
2025-05-07T20:24:47.3257058Z 
2025-05-07T20:24:47.3257063Z 
2025-05-07T20:24:47.3257068Z 
2025-05-07T20:24:47.3257073Z 
2025-05-07T20:24:47.3257078Z 
2025-05-07T20:24:47.3278938Z python_abi-3.11      | 5 KB      | ########## | 100% [A[A[A[A[A[A[A[A[A[A
2025-05-07T20:24:47.3279327Z 
2025-05-07T20:24:47.3279332Z 
2025-05-07T20:24:47.3279338Z 
2025-05-07T20:24:47.3279555Z 
2025-05-07T20:24:47.3279561Z 
2025-05-07T20:24:47.3279687Z 
2025-05-07T20:24:47.3279692Z 
2025-05-07T20:24:47.3279697Z 
2025-05-07T20:24:47.3279702Z 
2025-05-07T20:24:47.3283079Z 
2025-05-07T20:24:47.3522654Z python_abi-3.11      | 5 KB      | ########## | 100% [A[A[A[A[A[A[A[A[A[A
2025-05-07T20:24:47.3524183Z 
2025-05-07T20:24:47.3664642Z cryptography-44.0.3  | 1.5 MB    | ########## | 100% [A
2025-05-07T20:24:47.3665539Z openssl-3.5.0        | 3.0 MB    | ########## | 100% 
2025-05-07T20:24:47.3733947Z openssl-3.5.0        | 3.0 MB    | ########## | 100% 
2025-05-07T20:24:47.3734283Z 
2025-05-07T20:24:47.3734289Z 
2025-05-07T20:24:47.3734304Z 
2025-05-07T20:24:47.3734310Z 
2025-05-07T20:24:47.3735316Z cffi-1.17.1          | 295 KB    | ########## | 100% [A[A[A[A
2025-05-07T20:24:47.3735650Z 
2025-05-07T20:24:47.3735655Z 
2025-05-07T20:24:47.3735672Z 
2025-05-07T20:24:47.3735684Z 
2025-05-07T20:24:47.3747660Z cffi-1.17.1          | 295 KB    | ########## | 100% [A[A[A[A
2025-05-07T20:24:47.3748007Z 
2025-05-07T20:24:47.3748013Z 
2025-05-07T20:24:47.3748035Z 
2025-05-07T20:24:47.3752351Z libgomp-15.1.0       | 442 KB    | ########## | 100% [A[A[A
2025-05-07T20:24:47.3752698Z 
2025-05-07T20:24:47.3752704Z 
2025-05-07T20:24:47.3752709Z 
2025-05-07T20:24:47.3894472Z libgomp-15.1.0       | 442 KB    | ########## | 100% [A[A[A
2025-05-07T20:24:47.3895148Z 
2025-05-07T20:24:47.3895159Z 
2025-05-07T20:24:47.3899891Z libgcc-15.1.0        | 810 KB    | ########## | 100% [A[A
2025-05-07T20:24:47.3900225Z 
2025-05-07T20:24:47.3900745Z 
2025-05-07T20:24:47.4039949Z libgcc-15.1.0        | 810 KB    | ########## | 100% [A[A
2025-05-07T20:24:47.4040289Z 
2025-05-07T20:24:47.4040294Z 
2025-05-07T20:24:47.4040299Z 
2025-05-07T20:24:47.4040304Z 
2025-05-07T20:24:47.4040310Z 
2025-05-07T20:24:47.4070801Z pyopenssl-25.0.0     | 120 KB    | ########## | 100% [A[A[A[A[A
2025-05-07T20:24:47.4071180Z 
2025-05-07T20:24:47.4071186Z 
2025-05-07T20:24:47.4071191Z 
2025-05-07T20:24:47.4071196Z 
2025-05-07T20:24:47.4071210Z 
2025-05-07T20:24:47.4071220Z 
2025-05-07T20:24:47.4071226Z 
2025-05-07T20:24:47.4075515Z typing-extensions-4. | 88 KB     | ########## | 100% [A[A[A[A[A[A[A
2025-05-07T20:24:47.4075979Z 
2025-05-07T20:24:47.4075984Z 
2025-05-07T20:24:47.4075989Z 
2025-05-07T20:24:47.4075994Z 
2025-05-07T20:24:47.4075999Z 
2025-05-07T20:24:47.4076004Z 
2025-05-07T20:24:47.4076009Z 
2025-05-07T20:24:47.4228550Z typing-extensions-4. | 88 KB     | ########## | 100% [A[A[A[A[A[A[A
2025-05-07T20:24:47.4228961Z 
2025-05-07T20:24:47.4228966Z 
2025-05-07T20:24:47.4228972Z 
2025-05-07T20:24:47.4228977Z 
2025-05-07T20:24:47.4228982Z 
2025-05-07T20:24:47.4228987Z 
2025-05-07T20:24:47.4228992Z 
2025-05-07T20:24:47.4229006Z 
2025-05-07T20:24:47.4233842Z typing_extensions-4. | 51 KB     | ########## | 100% [A[A[A[A[A[A[A[A
2025-05-07T20:24:47.4234256Z 
2025-05-07T20:24:47.4234262Z 
2025-05-07T20:24:47.4234267Z 
2025-05-07T20:24:47.4234272Z 
2025-05-07T20:24:47.4234287Z 
2025-05-07T20:24:47.4234301Z 
2025-05-07T20:24:47.4234306Z 
2025-05-07T20:24:47.4234317Z 
2025-05-07T20:24:47.4364410Z typing_extensions-4. | 51 KB     | ########## | 100% [A[A[A[A[A[A[A[A
2025-05-07T20:24:47.4364855Z 
2025-05-07T20:24:47.4364861Z 
2025-05-07T20:24:47.4364866Z 
2025-05-07T20:24:47.4364871Z 
2025-05-07T20:24:47.4364877Z 
2025-05-07T20:24:47.4364883Z 
2025-05-07T20:24:47.4364890Z 
2025-05-07T20:24:47.4364896Z 
2025-05-07T20:24:47.4364902Z 
2025-05-07T20:24:47.4366543Z 
2025-05-07T20:24:47.4491524Z python_abi-3.11      | 5 KB      | ########## | 100% [A[A[A[A[A[A[A[A[A[A
2025-05-07T20:24:47.4492311Z 
2025-05-07T20:24:47.4492322Z 
2025-05-07T20:24:47.4492332Z 
2025-05-07T20:24:47.4492342Z 
2025-05-07T20:24:47.4492353Z 
2025-05-07T20:24:47.4492363Z 
2025-05-07T20:24:47.4492373Z 
2025-05-07T20:24:47.4492383Z 
2025-05-07T20:24:47.4492394Z 
2025-05-07T20:24:47.4494162Z libgcc-ng-15.1.0     | 34 KB     | ########## | 100% [A[A[A[A[A[A[A[A[A
2025-05-07T20:24:47.4494873Z 
2025-05-07T20:24:47.4495086Z 
2025-05-07T20:24:47.4495093Z 
2025-05-07T20:24:47.4495225Z 
2025-05-07T20:24:47.4495231Z 
2025-05-07T20:24:47.4495235Z 
2025-05-07T20:24:47.4495241Z 
2025-05-07T20:24:47.4495246Z 
2025-05-07T20:24:47.4495409Z 
2025-05-07T20:24:47.4993015Z libgcc-ng-15.1.0     | 34 KB     | ########## | 100% [A[A[A[A[A[A[A[A[A
2025-05-07T20:24:47.4993404Z 
2025-05-07T20:24:47.4993409Z 
2025-05-07T20:24:47.4993415Z 
2025-05-07T20:24:47.4993420Z 
2025-05-07T20:24:47.4993425Z 
2025-05-07T20:24:47.4993430Z 
2025-05-07T20:24:47.4996750Z pycparser-2.22       | 108 KB    | ########## | 100% [A[A[A[A[A[A
2025-05-07T20:24:47.4997126Z 
2025-05-07T20:24:47.4997132Z 
2025-05-07T20:24:47.4997137Z 
2025-05-07T20:24:47.4997142Z 
2025-05-07T20:24:47.4997147Z 
2025-05-07T20:24:47.4997202Z 
2025-05-07T20:24:47.5898101Z pycparser-2.22       | 108 KB    | ########## | 100% [A[A[A[A[A[A
2025-05-07T20:24:47.6043837Z openssl-3.5.0        | 3.0 MB    | ########## | 100% 
2025-05-07T20:24:47.6044186Z 
2025-05-07T20:24:47.6044535Z cryptography-44.0.3  | 1.5 MB    | ########## | 100% [A
2025-05-07T20:24:47.6044894Z 
2025-05-07T20:24:47.6051990Z cryptography-44.0.3  | 1.5 MB    | ########## | 100% [A
2025-05-07T20:24:47.6052488Z                                                      
2025-05-07T20:24:47.6052759Z 
2025-05-07T20:24:47.6052996Z                                                      [A
2025-05-07T20:24:47.6053265Z 
2025-05-07T20:24:47.6053271Z 
2025-05-07T20:24:47.6053498Z                                                      [A[A
2025-05-07T20:24:47.6053780Z 
2025-05-07T20:24:47.6053786Z 
2025-05-07T20:24:47.6053791Z 
2025-05-07T20:24:47.6054015Z                                                      [A[A[A
2025-05-07T20:24:47.6054281Z 
2025-05-07T20:24:47.6054284Z 
2025-05-07T20:24:47.6054288Z 
2025-05-07T20:24:47.6054291Z 
2025-05-07T20:24:47.6054467Z                                                      [A[A[A[A
2025-05-07T20:24:47.6054760Z 
2025-05-07T20:24:47.6054765Z 
2025-05-07T20:24:47.6054771Z 
2025-05-07T20:24:47.6054784Z 
2025-05-07T20:24:47.6054789Z 
2025-05-07T20:24:47.6055069Z                                                      [A[A[A[A[A
2025-05-07T20:24:47.6055376Z 
2025-05-07T20:24:47.6055381Z 
2025-05-07T20:24:47.6055395Z 
2025-05-07T20:24:47.6055400Z 
2025-05-07T20:24:47.6055405Z 
2025-05-07T20:24:47.6055410Z 
2025-05-07T20:24:47.6055669Z                                                      [A[A[A[A[A[A
2025-05-07T20:24:47.6055892Z 
2025-05-07T20:24:47.6055895Z 
2025-05-07T20:24:47.6055899Z 
2025-05-07T20:24:47.6055903Z 
2025-05-07T20:24:47.6055906Z 
2025-05-07T20:24:47.6055910Z 
2025-05-07T20:24:47.6055913Z 
2025-05-07T20:24:47.6056093Z                                                      [A[A[A[A[A[A[A
2025-05-07T20:24:47.6056306Z 
2025-05-07T20:24:47.6056310Z 
2025-05-07T20:24:47.6056313Z 
2025-05-07T20:24:47.6056317Z 
2025-05-07T20:24:47.6056320Z 
2025-05-07T20:24:47.6056324Z 
2025-05-07T20:24:47.6056327Z 
2025-05-07T20:24:47.6056331Z 
2025-05-07T20:24:47.6056517Z                                                      [A[A[A[A[A[A[A[A
2025-05-07T20:24:47.6056739Z 
2025-05-07T20:24:47.6056742Z 
2025-05-07T20:24:47.6056746Z 
2025-05-07T20:24:47.6056750Z 
2025-05-07T20:24:47.6056753Z 
2025-05-07T20:24:47.6056757Z 
2025-05-07T20:24:47.6056760Z 
2025-05-07T20:24:47.6056764Z 
2025-05-07T20:24:47.6056767Z 
2025-05-07T20:24:47.6056951Z                                                      [A[A[A[A[A[A[A[A[A
2025-05-07T20:24:47.6057170Z 
2025-05-07T20:24:47.6057174Z 
2025-05-07T20:24:47.6057177Z 
2025-05-07T20:24:47.6057181Z 
2025-05-07T20:24:47.6057184Z 
2025-05-07T20:24:47.6057188Z 
2025-05-07T20:24:47.6057191Z 
2025-05-07T20:24:47.6057195Z 
2025-05-07T20:24:47.6057198Z 
2025-05-07T20:24:47.6057202Z 
2025-05-07T20:24:47.6057395Z                                                      [A[A[A[A[A[A[A[A[A[A done
2025-05-07T20:24:47.7066738Z Preparing transaction: | done
2025-05-07T20:24:47.8071609Z Verifying transaction: - done
2025-05-07T20:24:49.3098805Z Executing transaction: | / - \ | / - \ | / - \ | / - done
2025-05-07T20:24:49.4862905Z [SETUP] Testing pyOpenSSL import ...
2025-05-07T20:24:51.2209605Z [CHECK] Python (sub-)package 'OpenSSL' found ...
2025-05-07T20:24:51.2224261Z [SETUP] Installing libxcrypt ...
2025-05-07T20:24:51.2248616Z [EXEC] [ATTEMPT 0/3]    + conda install -n build_binary -c conda-forge --override-channels -y libxcrypt
2025-05-07T20:24:52.0909717Z Channels:
2025-05-07T20:24:52.0909962Z  - conda-forge
2025-05-07T20:24:52.0910196Z Platform: linux-64
2025-05-07T20:24:55.3826716Z Collecting package metadata (repodata.json): - \ | / done
2025-05-07T20:24:55.7503030Z Solving environment: \ done
2025-05-07T20:24:55.8131035Z 
2025-05-07T20:24:55.8131741Z ## Package Plan ##
2025-05-07T20:24:55.8132172Z 
2025-05-07T20:24:55.8132746Z   environment location: /home/ec2-user/miniconda/envs/build_binary
2025-05-07T20:24:55.8133538Z 
2025-05-07T20:24:55.8133769Z   added / updated specs:
2025-05-07T20:24:55.8134313Z     - libxcrypt
2025-05-07T20:24:55.8134572Z 
2025-05-07T20:24:55.8134595Z 
2025-05-07T20:24:55.8134846Z The following packages will be downloaded:
2025-05-07T20:24:55.8135265Z 
2025-05-07T20:24:55.8135495Z     package                    |            build
2025-05-07T20:24:55.8136126Z     ---------------------------|-----------------
2025-05-07T20:24:55.8136867Z     libxcrypt-4.4.36           |       hd590300_1          98 KB  conda-forge
2025-05-07T20:24:55.8137663Z     ------------------------------------------------------------
2025-05-07T20:24:55.8138148Z                                            Total:          98 KB
2025-05-07T20:24:55.8138357Z 
2025-05-07T20:24:55.8138483Z The following NEW packages will be INSTALLED:
2025-05-07T20:24:55.8138698Z 
2025-05-07T20:24:55.8138925Z   libxcrypt          conda-forge/linux-64::libxcrypt-4.4.36-hd590300_1 
2025-05-07T20:24:55.8139205Z 
2025-05-07T20:24:55.8139209Z 
2025-05-07T20:24:55.8139213Z 
2025-05-07T20:24:55.8139363Z Downloading and Extracting Packages: ...working...
2025-05-07T20:24:55.9759019Z libxcrypt-4.4.36     | 98 KB     |            |   0% 
2025-05-07T20:24:55.9778180Z libxcrypt-4.4.36     | 98 KB     | #6         |  16% 
2025-05-07T20:24:55.9899940Z libxcrypt-4.4.36     | 98 KB     | ########## | 100% 
2025-05-07T20:24:55.9902370Z libxcrypt-4.4.36     | 98 KB     | ########## | 100% 
2025-05-07T20:24:55.9902766Z                                                      
2025-05-07T20:24:55.9903124Z  done
2025-05-07T20:24:56.0905932Z Preparing transaction: / done
2025-05-07T20:24:56.1910463Z Verifying transaction: \ done
2025-05-07T20:24:56.2914894Z Executing transaction: / done
2025-05-07T20:24:59.7360710Z [SETUP] Copying <crypt.h> over ...
2025-05-07T20:24:59.7361416Z + cp /home/ec2-user/miniconda/envs/build_binary/include/crypt.h /home/ec2-user/miniconda/envs/build_binary/include/python3.11/crypt.h
2025-05-07T20:24:59.7361956Z 
2025-05-07T20:24:59.7390907Z 
2025-05-07T20:25:01.3857762Z [SETUP] Installed Python version: Python 3.11.11
2025-05-07T20:25:01.3858429Z [SETUP] Successfully created Conda environment: build_binary
2025-05-07T20:25:01.3892116Z ##[group]Run . $PRELUDE; install_cxx_compiler $BUILD_ENV gcc
2025-05-07T20:25:01.3892568Z [36;1m. $PRELUDE; install_cxx_compiler $BUILD_ENV gcc[0m
2025-05-07T20:25:01.3904560Z shell: /usr/bin/bash --noprofile --norc -e -o pipefail {0}
2025-05-07T20:25:01.3904895Z env:
2025-05-07T20:25:01.3905126Z   PRELUDE: .github/scripts/setup_env.bash
2025-05-07T20:25:01.3905431Z   BUILD_ENV: build_binary
2025-05-07T20:25:01.3905672Z   BUILD_TARGET: genai
2025-05-07T20:25:01.3905891Z   BUILD_VARIANT: cuda
2025-05-07T20:25:01.3906123Z   BUILD_CUDA_VERSION: 12.8.0
2025-05-07T20:25:01.3906372Z   ENFORCE_CUDA_DEVICE: 1
2025-05-07T20:25:01.3906662Z   GPU_FLAG: --gpus all -e NVIDIA_DRIVER_CAPABILITIES=all
2025-05-07T20:25:01.3906989Z ##[endgroup]
2025-05-07T20:25:01.7366976Z ################################################################################
2025-05-07T20:25:01.7367357Z # Install C/C++ Compilers
2025-05-07T20:25:01.7367597Z #
2025-05-07T20:25:01.7383879Z # [2025-05-07T20:25:01.738Z] + install_cxx_compiler build_binary gcc
2025-05-07T20:25:01.7384587Z ################################################################################
2025-05-07T20:25:01.7384805Z 
2025-05-07T20:25:01.7404103Z [EXEC] [ATTEMPT 0/3]    + wget -q --timeout 1 pypi.org -O /dev/null
2025-05-07T20:25:01.8322874Z [CHECK] Network does not appear to be blocked.
2025-05-07T20:25:01.8334137Z [INSTALL] Installing GLIBC (architecture = 64) ...
2025-05-07T20:25:01.8358026Z [EXEC] [ATTEMPT 0/3]    + conda install -n build_binary -c conda-forge --override-channels -y sysroot_linux-64=2.17
2025-05-07T20:25:02.7010815Z Channels:
2025-05-07T20:25:02.7011456Z  - conda-forge
2025-05-07T20:25:02.7012068Z Platform: linux-64
2025-05-07T20:25:06.0644880Z Collecting package metadata (repodata.json): - \ | / done
2025-05-07T20:25:06.4346228Z Solving environment: \ done
2025-05-07T20:25:06.4967752Z 
2025-05-07T20:25:06.4967939Z ## Package Plan ##
2025-05-07T20:25:06.4968097Z 
2025-05-07T20:25:06.4968312Z   environment location: /home/ec2-user/miniconda/envs/build_binary
2025-05-07T20:25:06.4968621Z 
2025-05-07T20:25:06.4968730Z   added / updated specs:
2025-05-07T20:25:06.4969005Z     - sysroot_linux-64=2.17
2025-05-07T20:25:06.4969190Z 
2025-05-07T20:25:06.4969194Z 
2025-05-07T20:25:06.4969321Z The following packages will be downloaded:
2025-05-07T20:25:06.4969543Z 
2025-05-07T20:25:06.4969659Z     package                    |            build
2025-05-07T20:25:06.4969981Z     ---------------------------|-----------------
2025-05-07T20:25:06.4970392Z     kernel-headers_linux-64-3.10.0|      he073ed8_18         921 KB  conda-forge
2025-05-07T20:25:06.4970879Z     sysroot_linux-64-2.17      |      h0157908_18        14.5 MB  conda-forge
2025-05-07T20:25:06.4971299Z     ------------------------------------------------------------
2025-05-07T20:25:06.4971649Z                                            Total:        15.4 MB
2025-05-07T20:25:06.4971881Z 
2025-05-07T20:25:06.4972035Z The following NEW packages will be INSTALLED:
2025-05-07T20:25:06.4972275Z 
2025-05-07T20:25:06.4972552Z   kernel-headers_li~ conda-forge/noarch::kernel-headers_linux-64-3.10.0-he073ed8_18 
2025-05-07T20:25:06.4973120Z   sysroot_linux-64   conda-forge/noarch::sysroot_linux-64-2.17-h0157908_18 
2025-05-07T20:25:06.4973421Z 
2025-05-07T20:25:06.4973425Z 
2025-05-07T20:25:06.4973429Z 
2025-05-07T20:25:06.4973584Z Downloading and Extracting Packages: ...working...
2025-05-07T20:25:06.4973949Z sysroot_linux-64-2.1 | 14.5 MB   |            |   0% 
2025-05-07T20:25:06.4974184Z 
2025-05-07T20:25:06.6991004Z kernel-headers_linux | 921 KB    |            |   0% [A
2025-05-07T20:25:06.7290685Z sysroot_linux-64-2.1 | 14.5 MB   |            |   0% 
2025-05-07T20:25:06.7290925Z 
2025-05-07T20:25:06.7461148Z kernel-headers_linux | 921 KB    | 1          |   2% [A
2025-05-07T20:25:06.7462340Z 
2025-05-07T20:25:06.8238944Z kernel-headers_linux | 921 KB    | ########## | 100% [A
2025-05-07T20:25:06.8239543Z sysroot_linux-64-2.1 | 14.5 MB   | ########## | 100% 
2025-05-07T20:25:07.0186635Z sysroot_linux-64-2.1 | 14.5 MB   | ########## | 100% 
2025-05-07T20:25:07.0187072Z 
2025-05-07T20:25:07.0188070Z kernel-headers_linux | 921 KB    | ########## | 100% [A
2025-05-07T20:25:07.0188456Z 
2025-05-07T20:25:07.4454848Z kernel-headers_linux | 921 KB    | ########## | 100% [A
2025-05-07T20:25:07.4458190Z sysroot_linux-64-2.1 | 14.5 MB   | ########## | 100% 
2025-05-07T20:25:07.4458540Z                                                      
2025-05-07T20:25:07.4458737Z 
2025-05-07T20:25:07.4459349Z                                                      [A done
2025-05-07T20:25:07.5462746Z Preparing transaction: / done
2025-05-07T20:25:07.7466951Z Verifying transaction: \ | done
2025-05-07T20:25:07.9522992Z Executing transaction: - \ done
2025-05-07T20:25:08.1087897Z [CHECK] LD_LIBRARY_PATH = 
2025-05-07T20:25:08.1088199Z [CHECK] CONDA_PREFIX is not set.
2025-05-07T20:25:09.7937821Z [CHECK] libstdc++.so.6 found in CONDA_PREFIX PATH (symbolic link): /home/ec2-user/miniconda/envs/build_binary/lib/libstdc++.so.6
2025-05-07T20:25:09.7951823Z [INSTALL] Installing GCC (11.4.0, 64) through Conda ...
2025-05-07T20:25:09.7973193Z [EXEC] [ATTEMPT 0/3]    + conda install -n build_binary -c conda-forge --override-channels -y gxx_linux-64=11.4.0
2025-05-07T20:25:10.6862006Z Channels:
2025-05-07T20:25:10.6862266Z  - conda-forge
2025-05-07T20:25:10.6862500Z Platform: linux-64
2025-05-07T20:25:14.0093710Z Collecting package metadata (repodata.json): - \ | / done
2025-05-07T20:25:14.9637028Z Solving environment: \ | / done
2025-05-07T20:25:15.0290202Z 
2025-05-07T20:25:15.0291140Z ## Package Plan ##
2025-05-07T20:25:15.0291514Z 
2025-05-07T20:25:15.0291939Z   environment location: /home/ec2-user/miniconda/envs/build_binary
2025-05-07T20:25:15.0292526Z 
2025-05-07T20:25:15.0292720Z   added / updated specs:
2025-05-07T20:25:15.0293233Z     - gxx_linux-64=11.4.0
2025-05-07T20:25:15.0293545Z 
2025-05-07T20:25:15.0293568Z 
2025-05-07T20:25:15.0293810Z The following packages will be downloaded:
2025-05-07T20:25:15.0294289Z 
2025-05-07T20:25:15.0294529Z     package                    |            build
2025-05-07T20:25:15.0295067Z     ---------------------------|-----------------
2025-05-07T20:25:15.0295473Z     binutils_impl_linux-64-2.40|       ha1999f0_7         6.0 MB  conda-forge
2025-05-07T20:25:15.0295953Z     binutils_linux-64-2.40     |       hb3c18ed_4          28 KB  conda-forge
2025-05-07T20:25:15.0296408Z     gcc_impl_linux-64-11.4.0   |      h00c12a0_13        53.0 MB  conda-forge
2025-05-07T20:25:15.0296847Z     gcc_linux-64-11.4.0        |       ha077dfb_4          31 KB  conda-forge
2025-05-07T20:25:15.0297287Z     gxx_impl_linux-64-11.4.0   |      h634f3ee_13        11.2 MB  conda-forge
2025-05-07T20:25:15.0297724Z     gxx_linux-64-11.4.0        |       h35bfe5d_4          29 KB  conda-forge
2025-05-07T20:25:15.0298146Z     ld_impl_linux-64-2.40      |       hf3520f5_7         691 KB  conda-forge
2025-05-07T20:25:15.0298615Z     libgcc-devel_linux-64-11.4.0|     h8f596e0_113         2.3 MB  conda-forge
2025-05-07T20:25:15.0299087Z     libsanitizer-11.4.0        |      h5763a12_13         3.5 MB  conda-forge
2025-05-07T20:25:15.0299521Z     libstdcxx-15.1.0           |       h8f9b012_2         3.7 MB  conda-forge
2025-05-07T20:25:15.0299982Z     libstdcxx-devel_linux-64-11.4.0|     h8f596e0_113        11.1 MB  conda-forge
2025-05-07T20:25:15.0300455Z     libstdcxx-ng-15.1.0        |       h4852527_2          34 KB  conda-forge
2025-05-07T20:25:15.0300861Z     ------------------------------------------------------------
2025-05-07T20:25:15.0301276Z                                            Total:        91.6 MB
2025-05-07T20:25:15.0301562Z 
2025-05-07T20:25:15.0301697Z The following NEW packages will be INSTALLED:
2025-05-07T20:25:15.0301922Z 
2025-05-07T20:25:15.0302251Z   binutils_impl_lin~ conda-forge/linux-64::binutils_impl_linux-64-2.40-ha1999f0_7 
2025-05-07T20:25:15.0302819Z   binutils_linux-64  conda-forge/linux-64::binutils_linux-64-2.40-hb3c18ed_4 
2025-05-07T20:25:15.0303786Z   gcc_impl_linux-64  conda-forge/linux-64::gcc_impl_linux-64-11.4.0-h00c12a0_13 
2025-05-07T20:25:15.0304303Z   gcc_linux-64       conda-forge/linux-64::gcc_linux-64-11.4.0-ha077dfb_4 
2025-05-07T20:25:15.0304830Z   gxx_impl_linux-64  conda-forge/linux-64::gxx_impl_linux-64-11.4.0-h634f3ee_13 
2025-05-07T20:25:15.0305357Z   gxx_linux-64       conda-forge/linux-64::gxx_linux-64-11.4.0-h35bfe5d_4 
2025-05-07T20:25:15.0305877Z   libgcc-devel_linu~ conda-forge/noarch::libgcc-devel_linux-64-11.4.0-h8f596e0_113 
2025-05-07T20:25:15.0306432Z   libsanitizer       conda-forge/linux-64::libsanitizer-11.4.0-h5763a12_13 
2025-05-07T20:25:15.0306925Z   libstdcxx          conda-forge/linux-64::libstdcxx-15.1.0-h8f9b012_2 
2025-05-07T20:25:15.0307462Z   libstdcxx-devel_l~ conda-forge/noarch::libstdcxx-devel_linux-64-11.4.0-h8f596e0_113 
2025-05-07T20:25:15.0307820Z 
2025-05-07T20:25:15.0307934Z The following packages will be UPDATED:
2025-05-07T20:25:15.0308144Z 
2025-05-07T20:25:15.0308460Z   ld_impl_linux-64   pkgs/main::ld_impl_linux-64-2.40-h12e~ --> conda-forge::ld_impl_linux-64-2.40-hf3520f5_7 
2025-05-07T20:25:15.0309421Z   libstdcxx-ng       pkgs/main::libstdcxx-ng-11.2.0-h12345~ --> conda-forge::libstdcxx-ng-15.1.0-h4852527_2 
2025-05-07T20:25:15.0309824Z 
2025-05-07T20:25:15.0309828Z 
2025-05-07T20:25:15.0309832Z 
2025-05-07T20:25:15.0309987Z Downloading and Extracting Packages: ...working...
2025-05-07T20:25:15.0310360Z gcc_impl_linux-64-11 | 53.0 MB   |            |   0% 
2025-05-07T20:25:15.0310591Z 
2025-05-07T20:25:15.0310991Z gxx_impl_linux-64-11 | 11.2 MB   |            |   0% [A
2025-05-07T20:25:15.0311231Z 
2025-05-07T20:25:15.0311235Z 
2025-05-07T20:25:15.0313211Z libstdcxx-devel_linu | 11.1 MB   |            |   0% [A[A
2025-05-07T20:25:15.0313477Z 
2025-05-07T20:25:15.0313489Z 
2025-05-07T20:25:15.0315767Z 
2025-05-07T20:25:15.0335998Z binutils_impl_linux- | 6.0 MB    |            |   0% [A[A[A
2025-05-07T20:25:15.0336322Z 
2025-05-07T20:25:15.0336328Z 
2025-05-07T20:25:15.0336333Z 
2025-05-07T20:25:15.0336377Z 
2025-05-07T20:25:15.0346341Z libstdcxx-15.1.0     | 3.7 MB    |            |   0% [A[A[A[A
2025-05-07T20:25:15.0346794Z 
2025-05-07T20:25:15.0346801Z 
2025-05-07T20:25:15.0346806Z 
2025-05-07T20:25:15.0346823Z 
2025-05-07T20:25:15.0346830Z 
2025-05-07T20:25:15.0350811Z libsanitizer-11.4.0  | 3.5 MB    |            |   0% [A[A[A[A[A
2025-05-07T20:25:15.0351276Z 
2025-05-07T20:25:15.0351283Z 
2025-05-07T20:25:15.0351288Z 
2025-05-07T20:25:15.0351306Z 
2025-05-07T20:25:15.0351311Z 
2025-05-07T20:25:15.0351317Z 
2025-05-07T20:25:15.0352514Z libgcc-devel_linux-6 | 2.3 MB    |            |   0% [A[A[A[A[A[A
2025-05-07T20:25:15.0352999Z 
2025-05-07T20:25:15.0353007Z 
2025-05-07T20:25:15.0353024Z 
2025-05-07T20:25:15.0353030Z 
2025-05-07T20:25:15.0353035Z 
2025-05-07T20:25:15.0353040Z 
2025-05-07T20:25:15.0353045Z 
2025-05-07T20:25:15.0354502Z ld_impl_linux-64-2.4 | 691 KB    |            |   0% [A[A[A[A[A[A[A
2025-05-07T20:25:15.0354915Z 
2025-05-07T20:25:15.0354954Z 
2025-05-07T20:25:15.0354965Z 
2025-05-07T20:25:15.0354970Z 
2025-05-07T20:25:15.0354975Z 
2025-05-07T20:25:15.0354981Z 
2025-05-07T20:25:15.0354986Z 
2025-05-07T20:25:15.0354991Z 
2025-05-07T20:25:15.0360747Z libstdcxx-ng-15.1.0  | 34 KB     |            |   0% [A[A[A[A[A[A[A[A
2025-05-07T20:25:15.0361082Z 
2025-05-07T20:25:15.0361089Z 
2025-05-07T20:25:15.0361094Z 
2025-05-07T20:25:15.0361100Z 
2025-05-07T20:25:15.0361105Z 
2025-05-07T20:25:15.0361121Z 
2025-05-07T20:25:15.0361127Z 
2025-05-07T20:25:15.0361132Z 
2025-05-07T20:25:15.0367814Z 
2025-05-07T20:25:15.0371858Z gcc_linux-64-11.4.0  | 31 KB     |            |   0% [A[A[A[A[A[A[A[A[A
2025-05-07T20:25:15.0372345Z 
2025-05-07T20:25:15.0372355Z 
2025-05-07T20:25:15.0372364Z 
2025-05-07T20:25:15.0372374Z 
2025-05-07T20:25:15.0372382Z 
2025-05-07T20:25:15.0372390Z 
2025-05-07T20:25:15.0372397Z 
2025-05-07T20:25:15.0372404Z 
2025-05-07T20:25:15.0372409Z 
2025-05-07T20:25:15.0379399Z 
2025-05-07T20:25:15.0388010Z gxx_linux-64-11.4.0  | 29 KB     |            |   0% [A[A[A[A[A[A[A[A[A[A
2025-05-07T20:25:15.0388331Z 
2025-05-07T20:25:15.0388337Z 
2025-05-07T20:25:15.0388343Z 
2025-05-07T20:25:15.0388349Z 
2025-05-07T20:25:15.0388354Z 
2025-05-07T20:25:15.0388358Z 
2025-05-07T20:25:15.0388363Z 
2025-05-07T20:25:15.0388368Z 
2025-05-07T20:25:15.0388372Z 
2025-05-07T20:25:15.0388376Z 
2025-05-07T20:25:15.0388381Z 
2025-05-07T20:25:15.1565209Z binutils_linux-64-2. | 28 KB     |            |   0% [A[A[A[A[A[A[A[A[A[A[A
2025-05-07T20:25:15.1565535Z 
2025-05-07T20:25:15.1565539Z 
2025-05-07T20:25:15.1565543Z 
2025-05-07T20:25:15.1565547Z 
2025-05-07T20:25:15.1994704Z libstdcxx-15.1.0     | 3.7 MB    | 5          |   5% [A[A[A[A
2025-05-07T20:25:15.1994984Z 
2025-05-07T20:25:15.1994988Z 
2025-05-07T20:25:15.2625369Z libstdcxx-devel_linu | 11.1 MB   |            |   0% [A[A
2025-05-07T20:25:15.2625723Z 
2025-05-07T20:25:15.2625731Z 
2025-05-07T20:25:15.2625738Z 
2025-05-07T20:25:15.2625920Z 
2025-05-07T20:25:15.2999459Z libstdcxx-15.1.0     | 3.7 MB    | #3         |  13% [A[A[A[A
2025-05-07T20:25:15.2999954Z 
2025-05-07T20:25:15.2999959Z 
2025-05-07T20:25:15.3073204Z libstdcxx-devel_linu | 11.1 MB   | 2          |   3% [A[A
2025-05-07T20:25:15.3085771Z gcc_impl_linux-64-11 | 53.0 MB   |            |   0% 
2025-05-07T20:25:15.3087476Z 
2025-05-07T20:25:15.3728849Z gxx_impl_linux-64-11 | 11.2 MB   |            |   0% [A
2025-05-07T20:25:15.3729130Z 
2025-05-07T20:25:15.3729134Z 
2025-05-07T20:25:15.3732428Z 
2025-05-07T20:25:15.3999685Z binutils_impl_linux- | 6.0 MB    |            |   0% [A[A[A
2025-05-07T20:25:15.3999975Z 
2025-05-07T20:25:15.4000613Z 
2025-05-07T20:25:15.4076112Z libstdcxx-devel_linu | 11.1 MB   | ####3      |  43% [A[A
2025-05-07T20:25:15.4087233Z gcc_impl_linux-64-11 | 53.0 MB   | 6          |   6% 
2025-05-07T20:25:15.4089997Z 
2025-05-07T20:25:15.4224063Z gxx_impl_linux-64-11 | 11.2 MB   | ###6       |  37% [A
2025-05-07T20:25:15.4224333Z 
2025-05-07T20:25:15.4224337Z 
2025-05-07T20:25:15.4224367Z 
2025-05-07T20:25:15.4228397Z 
2025-05-07T20:25:15.4229337Z libstdcxx-15.1.0     | 3.7 MB    | ########## | 100% [A[A[A[A
2025-05-07T20:25:15.4229650Z 
2025-05-07T20:25:15.4229656Z 
2025-05-07T20:25:15.4229661Z 
2025-05-07T20:25:15.4231075Z 
2025-05-07T20:25:15.4625752Z libstdcxx-15.1.0     | 3.7 MB    | ########## | 100% [A[A[A[A
2025-05-07T20:25:15.4626064Z 
2025-05-07T20:25:15.4626070Z 
2025-05-07T20:25:15.4626075Z 
2025-05-07T20:25:15.4626080Z 
2025-05-07T20:25:15.4626085Z 
2025-05-07T20:25:15.4729402Z libsanitizer-11.4.0  | 3.5 MB    |            |   0% [A[A[A[A[A
2025-05-07T20:25:15.4729698Z 
2025-05-07T20:25:15.4729703Z 
2025-05-07T20:25:15.4734701Z 
2025-05-07T20:25:15.5010042Z binutils_impl_linux- | 6.0 MB    | #####3     |  53% [A[A[A
2025-05-07T20:25:15.5010336Z 
2025-05-07T20:25:15.5015995Z 
2025-05-07T20:25:15.5076283Z libstdcxx-devel_linu | 11.1 MB   | ######8    |  68% [A[A
2025-05-07T20:25:15.5092491Z gcc_impl_linux-64-11 | 53.0 MB   | #2         |  12% 
2025-05-07T20:25:15.5092756Z 
2025-05-07T20:25:15.5627265Z gxx_impl_linux-64-11 | 11.2 MB   | ######4    |  64% [A
2025-05-07T20:25:15.5627522Z 
2025-05-07T20:25:15.5627526Z 
2025-05-07T20:25:15.5627530Z 
2025-05-07T20:25:15.5627534Z 
2025-05-07T20:25:15.5628773Z 
2025-05-07T20:25:15.6010557Z libsanitizer-11.4.0  | 3.5 MB    | #######9   |  80% [A[A[A[A[A
2025-05-07T20:25:15.6010853Z 
2025-05-07T20:25:15.6015244Z 
2025-05-07T20:25:15.6078557Z libstdcxx-devel_linu | 11.1 MB   | #########6 |  96% [A[A
2025-05-07T20:25:15.6093064Z gcc_impl_linux-64-11 | 53.0 MB   | #7         |  18% 
2025-05-07T20:25:15.6095402Z 
2025-05-07T20:25:15.6861767Z gxx_impl_linux-64-11 | 11.2 MB   | #########  |  90% [A
2025-05-07T20:25:15.6862133Z 
2025-05-07T20:25:15.6862139Z 
2025-05-07T20:25:15.6862145Z 
2025-05-07T20:25:15.6862152Z 
2025-05-07T20:25:15.6862157Z 
2025-05-07T20:25:15.7079841Z libsanitizer-11.4.0  | 3.5 MB    | ########## | 100% [A[A[A[A[A
2025-05-07T20:25:15.7253135Z gcc_impl_linux-64-11 | 53.0 MB   | ##5        |  26% 
2025-05-07T20:25:15.7253773Z 
2025-05-07T20:25:15.7253782Z 
2025-05-07T20:25:15.7253789Z 
2025-05-07T20:25:15.7253796Z 
2025-05-07T20:25:15.7253804Z 
2025-05-07T20:25:15.7258518Z 
2025-05-07T20:25:15.7300498Z libgcc-devel_linux-6 | 2.3 MB    |            |   1% [A[A[A[A[A[A
2025-05-07T20:25:15.7300961Z 
2025-05-07T20:25:15.7300968Z 
2025-05-07T20:25:15.7302002Z 
2025-05-07T20:25:15.7308957Z binutils_impl_linux- | 6.0 MB    | ########## | 100% [A[A[A
2025-05-07T20:25:15.7309405Z 
2025-05-07T20:25:15.7309410Z 
2025-05-07T20:25:15.7312006Z 
2025-05-07T20:25:15.7894104Z binutils_impl_linux- | 6.0 MB    | ########## | 100% [A[A[A
2025-05-07T20:25:15.7894555Z 
2025-05-07T20:25:15.7894562Z 
2025-05-07T20:25:15.7894567Z 
2025-05-07T20:25:15.7894572Z 
2025-05-07T20:25:15.7894577Z 
2025-05-07T20:25:15.7894582Z 
2025-05-07T20:25:15.7896020Z 
2025-05-07T20:25:15.8086819Z ld_impl_linux-64-2.4 | 691 KB    | 2          |   2% [A[A[A[A[A[A[A
2025-05-07T20:25:15.8254382Z gcc_impl_linux-64-11 | 53.0 MB   | ###2       |  32% 
2025-05-07T20:25:15.8255166Z 
2025-05-07T20:25:15.8255173Z 
2025-05-07T20:25:15.8255179Z 
2025-05-07T20:25:15.8255184Z 
2025-05-07T20:25:15.8255189Z 
2025-05-07T20:25:15.8259802Z 
2025-05-07T20:25:15.8398477Z libgcc-devel_linux-6 | 2.3 MB    | ########6  |  87% [A[A[A[A[A[A
2025-05-07T20:25:15.8398836Z 
2025-05-07T20:25:15.8398840Z 
2025-05-07T20:25:15.8398844Z 
2025-05-07T20:25:15.8398848Z 
2025-05-07T20:25:15.8398851Z 
2025-05-07T20:25:15.8398855Z 
2025-05-07T20:25:15.8402670Z 
2025-05-07T20:25:15.9086826Z ld_impl_linux-64-2.4 | 691 KB    | ########## | 100% [A[A[A[A[A[A[A
2025-05-07T20:25:15.9087132Z 
2025-05-07T20:25:15.9087542Z 
2025-05-07T20:25:15.9091381Z libstdcxx-devel_linu | 11.1 MB   | ########## | 100% [A[A
2025-05-07T20:25:15.9132372Z gcc_impl_linux-64-11 | 53.0 MB   | ###8       |  39% 
2025-05-07T20:25:15.9132749Z 
2025-05-07T20:25:15.9132755Z 
2025-05-07T20:25:15.9132761Z 
2025-05-07T20:25:15.9132766Z 
2025-05-07T20:25:15.9132771Z 
2025-05-07T20:25:15.9132803Z 
2025-05-07T20:25:15.9132820Z 
2025-05-07T20:25:15.9132826Z 
2025-05-07T20:25:15.9169245Z libstdcxx-ng-15.1.0  | 34 KB     | ####7      |  47% [A[A[A[A[A[A[A[A
2025-05-07T20:25:15.9169558Z 
2025-05-07T20:25:15.9169563Z 
2025-05-07T20:25:15.9169567Z 
2025-05-07T20:25:15.9169570Z 
2025-05-07T20:25:15.9169574Z 
2025-05-07T20:25:15.9169578Z 
2025-05-07T20:25:15.9169581Z 
2025-05-07T20:25:15.9172833Z 
2025-05-07T20:25:15.9186683Z libstdcxx-ng-15.1.0  | 34 KB     | ########## | 100% [A[A[A[A[A[A[A[A
2025-05-07T20:25:15.9186988Z 
2025-05-07T20:25:15.9186993Z 
2025-05-07T20:25:15.9186996Z 
2025-05-07T20:25:15.9190133Z 
2025-05-07T20:25:15.9234588Z libstdcxx-15.1.0     | 3.7 MB    | ########## | 100% [A[A[A[A
2025-05-07T20:25:15.9234948Z 
2025-05-07T20:25:15.9234954Z 
2025-05-07T20:25:15.9234959Z 
2025-05-07T20:25:15.9234964Z 
2025-05-07T20:25:15.9234970Z 
2025-05-07T20:25:15.9234986Z 
2025-05-07T20:25:15.9640115Z libgcc-devel_linux-6 | 2.3 MB    | ########## | 100% [A[A[A[A[A[A
2025-05-07T20:25:15.9640453Z 
2025-05-07T20:25:15.9640467Z 
2025-05-07T20:25:15.9640471Z 
2025-05-07T20:25:15.9640475Z 
2025-05-07T20:25:15.9640487Z 
2025-05-07T20:25:15.9640491Z 
2025-05-07T20:25:15.9640494Z 
2025-05-07T20:25:15.9640498Z 
2025-05-07T20:25:15.9640502Z 
2025-05-07T20:25:15.9643930Z 
2025-05-07T20:25:15.9645819Z gxx_linux-64-11.4.0  | 29 KB     | #####5     |  55% [A[A[A[A[A[A[A[A[A[A
2025-05-07T20:25:15.9646139Z 
2025-05-07T20:25:15.9646143Z 
2025-05-07T20:25:15.9646147Z 
2025-05-07T20:25:15.9646150Z 
2025-05-07T20:25:15.9646154Z 
2025-05-07T20:25:15.9646157Z 
2025-05-07T20:25:15.9646161Z 
2025-05-07T20:25:15.9646165Z 
2025-05-07T20:25:15.9646168Z 
2025-05-07T20:25:15.9646172Z 
2025-05-07T20:25:15.9646176Z 
2025-05-07T20:25:15.9684148Z binutils_linux-64-2. | 28 KB     | #####6     |  56% [A[A[A[A[A[A[A[A[A[A[A
2025-05-07T20:25:15.9684631Z 
2025-05-07T20:25:15.9684637Z 
2025-05-07T20:25:15.9684643Z 
2025-05-07T20:25:15.9684648Z 
2025-05-07T20:25:15.9684653Z 
2025-05-07T20:25:15.9684671Z 
2025-05-07T20:25:15.9684961Z 
2025-05-07T20:25:15.9684971Z 
2025-05-07T20:25:15.9684976Z 
2025-05-07T20:25:15.9684981Z 
2025-05-07T20:25:15.9690232Z 
2025-05-07T20:25:15.9697262Z binutils_linux-64-2. | 28 KB     | ########## | 100% [A[A[A[A[A[A[A[A[A[A[A
2025-05-07T20:25:15.9697732Z 
2025-05-07T20:25:15.9697737Z 
2025-05-07T20:25:15.9697741Z 
2025-05-07T20:25:15.9697744Z 
2025-05-07T20:25:15.9697748Z 
2025-05-07T20:25:15.9697753Z 
2025-05-07T20:25:15.9697758Z 
2025-05-07T20:25:15.9697763Z 
2025-05-07T20:25:15.9697768Z 
2025-05-07T20:25:15.9699394Z 
2025-05-07T20:25:15.9714760Z gxx_linux-64-11.4.0  | 29 KB     | ########## | 100% [A[A[A[A[A[A[A[A[A[A
2025-05-07T20:25:15.9718173Z 
2025-05-07T20:25:15.9816790Z gxx_impl_linux-64-11 | 11.2 MB   | ########## | 100% [A
2025-05-07T20:25:15.9817062Z 
2025-05-07T20:25:15.9817067Z 
2025-05-07T20:25:15.9817070Z 
2025-05-07T20:25:15.9817074Z 
2025-05-07T20:25:15.9817078Z 
2025-05-07T20:25:15.9817081Z 
2025-05-07T20:25:15.9817085Z 
2025-05-07T20:25:15.9817324Z 
2025-05-07T20:25:15.9817693Z 
2025-05-07T20:25:15.9848898Z gcc_linux-64-11.4.0  | 31 KB     | #####2     |  52% [A[A[A[A[A[A[A[A[A
2025-05-07T20:25:15.9849322Z 
2025-05-07T20:25:15.9849328Z 
2025-05-07T20:25:15.9849334Z 
2025-05-07T20:25:15.9849339Z 
2025-05-07T20:25:15.9849344Z 
2025-05-07T20:25:15.9849349Z 
2025-05-07T20:25:15.9849354Z 
2025-05-07T20:25:15.9849359Z 
2025-05-07T20:25:15.9851267Z 
2025-05-07T20:25:16.0092774Z gcc_linux-64-11.4.0  | 31 KB     | ########## | 100% [A[A[A[A[A[A[A[A[A
2025-05-07T20:25:16.0125430Z gcc_impl_linux-64-11 | 53.0 MB   | ####8      |  48% 
2025-05-07T20:25:16.0125702Z 
2025-05-07T20:25:16.0125706Z 
2025-05-07T20:25:16.0125710Z 
2025-05-07T20:25:16.0125715Z 
2025-05-07T20:25:16.0125718Z 
2025-05-07T20:25:16.0125723Z 
2025-05-07T20:25:16.0126013Z 
2025-05-07T20:25:16.0131361Z ld_impl_linux-64-2.4 | 691 KB    | ########## | 100% [A[A[A[A[A[A[A
2025-05-07T20:25:16.0131652Z 
2025-05-07T20:25:16.0131656Z 
2025-05-07T20:25:16.0131659Z 
2025-05-07T20:25:16.0131716Z 
2025-05-07T20:25:16.0131722Z 
2025-05-07T20:25:16.0131727Z 
2025-05-07T20:25:16.0131733Z 
2025-05-07T20:25:16.1093018Z ld_impl_linux-64-2.4 | 691 KB    | ########## | 100% [A[A[A[A[A[A[A
2025-05-07T20:25:16.1719044Z gcc_impl_linux-64-11 | 53.0 MB   | #####9     |  59% 
2025-05-07T20:25:16.1719298Z 
2025-05-07T20:25:16.1719302Z 
2025-05-07T20:25:16.1719305Z 
2025-05-07T20:25:16.1719309Z 
2025-05-07T20:25:16.1720341Z 
2025-05-07T20:25:16.2093877Z libsanitizer-11.4.0  | 3.5 MB    | ########## | 100% [A[A[A[A[A
2025-05-07T20:25:16.2594694Z gcc_impl_linux-64-11 | 53.0 MB   | #######    |  70% 
2025-05-07T20:25:16.2595038Z 
2025-05-07T20:25:16.2595044Z 
2025-05-07T20:25:16.2595047Z 
2025-05-07T20:25:16.2595052Z 
2025-05-07T20:25:16.2595057Z 
2025-05-07T20:25:16.2595061Z 
2025-05-07T20:25:16.2595065Z 
2025-05-07T20:25:16.2596126Z 
2025-05-07T20:25:16.2601943Z libstdcxx-ng-15.1.0  | 34 KB     | ########## | 100% [A[A[A[A[A[A[A[A
2025-05-07T20:25:16.2602340Z 
2025-05-07T20:25:16.2602378Z 
2025-05-07T20:25:16.2602398Z 
2025-05-07T20:25:16.2602404Z 
2025-05-07T20:25:16.2602408Z 
2025-05-07T20:25:16.2602413Z 
2025-05-07T20:25:16.2602418Z 
2025-05-07T20:25:16.2602433Z 
2025-05-07T20:25:16.3097362Z libstdcxx-ng-15.1.0  | 34 KB     | ########## | 100% [A[A[A[A[A[A[A[A
2025-05-07T20:25:16.4097980Z gcc_impl_linux-64-11 | 53.0 MB   | #######9   |  80% 
2025-05-07T20:25:16.4983139Z gcc_impl_linux-64-11 | 53.0 MB   | ########9  |  90% 
2025-05-07T20:25:16.4983406Z 
2025-05-07T20:25:16.4983411Z 
2025-05-07T20:25:16.4983416Z 
2025-05-07T20:25:16.4983420Z 
2025-05-07T20:25:16.4983423Z 
2025-05-07T20:25:16.4984118Z 
2025-05-07T20:25:16.5492653Z libgcc-devel_linux-6 | 2.3 MB    | ########## | 100% [A[A[A[A[A[A
2025-05-07T20:25:16.5493028Z 
2025-05-07T20:25:16.5493034Z 
2025-05-07T20:25:16.5493040Z 
2025-05-07T20:25:16.5493045Z 
2025-05-07T20:25:16.5493060Z 
2025-05-07T20:25:16.5493066Z 
2025-05-07T20:25:16.5493071Z 
2025-05-07T20:25:16.5493076Z 
2025-05-07T20:25:16.5493081Z 
2025-05-07T20:25:16.5493359Z 
2025-05-07T20:25:16.5493364Z 
2025-05-07T20:25:16.5497576Z binutils_linux-64-2. | 28 KB     | ########## | 100% [A[A[A[A[A[A[A[A[A[A[A
2025-05-07T20:25:16.5497960Z 
2025-05-07T20:25:16.5497964Z 
2025-05-07T20:25:16.5497968Z 
2025-05-07T20:25:16.5497971Z 
2025-05-07T20:25:16.5497975Z 
2025-05-07T20:25:16.5497978Z 
2025-05-07T20:25:16.5497982Z 
2025-05-07T20:25:16.5497986Z 
2025-05-07T20:25:16.5497989Z 
2025-05-07T20:25:16.5497993Z 
2025-05-07T20:25:16.5498171Z 
2025-05-07T20:25:16.6057924Z binutils_linux-64-2. | 28 KB     | ########## | 100% [A[A[A[A[A[A[A[A[A[A[A
2025-05-07T20:25:16.6058226Z 
2025-05-07T20:25:16.6058230Z 
2025-05-07T20:25:16.6058233Z 
2025-05-07T20:25:16.6058237Z 
2025-05-07T20:25:16.6058241Z 
2025-05-07T20:25:16.6058244Z 
2025-05-07T20:25:16.6058248Z 
2025-05-07T20:25:16.6058251Z 
2025-05-07T20:25:16.6058255Z 
2025-05-07T20:25:16.6058259Z 
2025-05-07T20:25:16.6063060Z gxx_linux-64-11.4.0  | 29 KB     | ########## | 100% [A[A[A[A[A[A[A[A[A[A
2025-05-07T20:25:16.6063568Z 
2025-05-07T20:25:16.6063572Z 
2025-05-07T20:25:16.6063576Z 
2025-05-07T20:25:16.6063579Z 
2025-05-07T20:25:16.6063583Z 
2025-05-07T20:25:16.6063586Z 
2025-05-07T20:25:16.6063590Z 
2025-05-07T20:25:16.6063599Z 
2025-05-07T20:25:16.6063603Z 
2025-05-07T20:25:16.6064727Z 
2025-05-07T20:25:16.6446329Z gxx_linux-64-11.4.0  | 29 KB     | ########## | 100% [A[A[A[A[A[A[A[A[A[A
2025-05-07T20:25:16.6446602Z 
2025-05-07T20:25:16.6446613Z 
2025-05-07T20:25:16.6446617Z 
2025-05-07T20:25:16.7139668Z binutils_impl_linux- | 6.0 MB    | ########## | 100% [A[A[A
2025-05-07T20:25:16.7140066Z 
2025-05-07T20:25:16.7140072Z 
2025-05-07T20:25:16.7140087Z 
2025-05-07T20:25:16.7140093Z 
2025-05-07T20:25:16.7140098Z 
2025-05-07T20:25:16.7140104Z 
2025-05-07T20:25:16.7140108Z 
2025-05-07T20:25:16.7140113Z 
2025-05-07T20:25:16.7140118Z 
2025-05-07T20:25:16.7145136Z gcc_linux-64-11.4.0  | 31 KB     | ########## | 100% [A[A[A[A[A[A[A[A[A
2025-05-07T20:25:16.7145522Z 
2025-05-07T20:25:16.7145550Z 
2025-05-07T20:25:16.7145554Z 
2025-05-07T20:25:16.7145558Z 
2025-05-07T20:25:16.7145561Z 
2025-05-07T20:25:16.7145565Z 
2025-05-07T20:25:16.7145568Z 
2025-05-07T20:25:16.7145572Z 
2025-05-07T20:25:16.7145575Z 
2025-05-07T20:25:16.8005320Z gcc_linux-64-11.4.0  | 31 KB     | ########## | 100% [A[A[A[A[A[A[A[A[A
2025-05-07T20:25:16.8005911Z gcc_impl_linux-64-11 | 53.0 MB   | ########## | 100% 
2025-05-07T20:25:16.9651531Z gcc_impl_linux-64-11 | 53.0 MB   | ########## | 100% 
2025-05-07T20:25:16.9651867Z 
2025-05-07T20:25:17.1508608Z gxx_impl_linux-64-11 | 11.2 MB   | ########## | 100% [A
2025-05-07T20:25:17.1508887Z 
2025-05-07T20:25:17.1508891Z 
2025-05-07T20:25:17.6033432Z libstdcxx-devel_linu | 11.1 MB   | ########## | 100% [A[A
2025-05-07T20:25:17.6040437Z gcc_impl_linux-64-11 | 53.0 MB   | ########## | 100% 
2025-05-07T20:25:17.6041000Z                                                      
2025-05-07T20:25:17.6041338Z 
2025-05-07T20:25:17.6041675Z                                                      [A
2025-05-07T20:25:17.6042028Z 
2025-05-07T20:25:17.6042034Z 
2025-05-07T20:25:17.6042328Z                                                      [A[A
2025-05-07T20:25:17.6042666Z 
2025-05-07T20:25:17.6042672Z 
2025-05-07T20:25:17.6042678Z 
2025-05-07T20:25:17.6042955Z                                                      [A[A[A
2025-05-07T20:25:17.6043278Z 
2025-05-07T20:25:17.6043283Z 
2025-05-07T20:25:17.6043288Z 
2025-05-07T20:25:17.6043293Z 
2025-05-07T20:25:17.6043557Z                                                      [A[A[A[A
2025-05-07T20:25:17.6043868Z 
2025-05-07T20:25:17.6043874Z 
2025-05-07T20:25:17.6043879Z 
2025-05-07T20:25:17.6043884Z 
2025-05-07T20:25:17.6043889Z 
2025-05-07T20:25:17.6044172Z                                                      [A[A[A[A[A
2025-05-07T20:25:17.6044486Z 
2025-05-07T20:25:17.6044491Z 
2025-05-07T20:25:17.6044496Z 
2025-05-07T20:25:17.6044500Z 
2025-05-07T20:25:17.6044505Z 
2025-05-07T20:25:17.6044510Z 
2025-05-07T20:25:17.6045063Z                                                      [A[A[A[A[A[A
2025-05-07T20:25:17.6045345Z 
2025-05-07T20:25:17.6045348Z 
2025-05-07T20:25:17.6045352Z 
2025-05-07T20:25:17.6045356Z 
2025-05-07T20:25:17.6045359Z 
2025-05-07T20:25:17.6045363Z 
2025-05-07T20:25:17.6045366Z 
2025-05-07T20:25:17.6045568Z                                                      [A[A[A[A[A[A[A
2025-05-07T20:25:17.6045864Z 
2025-05-07T20:25:17.6045869Z 
2025-05-07T20:25:17.6045874Z 
2025-05-07T20:25:17.6045879Z 
2025-05-07T20:25:17.6045884Z 
2025-05-07T20:25:17.6045889Z 
2025-05-07T20:25:17.6045894Z 
2025-05-07T20:25:17.6045899Z 
2025-05-07T20:25:17.6046172Z                                                      [A[A[A[A[A[A[A[A
2025-05-07T20:25:17.6046524Z 
2025-05-07T20:25:17.6046530Z 
2025-05-07T20:25:17.6046537Z 
2025-05-07T20:25:17.6046543Z 
2025-05-07T20:25:17.6046550Z 
2025-05-07T20:25:17.6046556Z 
2025-05-07T20:25:17.6046562Z 
2025-05-07T20:25:17.6046568Z 
2025-05-07T20:25:17.6046575Z 
2025-05-07T20:25:17.6047105Z                                                      [A[A[A[A[A[A[A[A[A
2025-05-07T20:25:17.6047437Z 
2025-05-07T20:25:17.6047443Z 
2025-05-07T20:25:17.6047448Z 
2025-05-07T20:25:17.6047453Z 
2025-05-07T20:25:17.6047459Z 
2025-05-07T20:25:17.6047464Z 
2025-05-07T20:25:17.6047469Z 
2025-05-07T20:25:17.6047474Z 
2025-05-07T20:25:17.6047490Z 
2025-05-07T20:25:17.6047495Z 
2025-05-07T20:25:17.6047792Z                                                      [A[A[A[A[A[A[A[A[A[A
2025-05-07T20:25:17.6048137Z 
2025-05-07T20:25:17.6048142Z 
2025-05-07T20:25:17.6048147Z 
2025-05-07T20:25:17.6048152Z 
2025-05-07T20:25:17.6048157Z 
2025-05-07T20:25:17.6048175Z 
2025-05-07T20:25:17.6048180Z 
2025-05-07T20:25:17.6048185Z 
2025-05-07T20:25:17.6048190Z 
2025-05-07T20:25:17.6048196Z 
2025-05-07T20:25:17.6048201Z 
2025-05-07T20:25:17.6048511Z                                                      [A[A[A[A[A[A[A[A[A[A[A done
2025-05-07T20:25:17.7052655Z Preparing transaction: \ done
2025-05-07T20:25:18.0058068Z Verifying transaction: / - \ done
2025-05-07T20:25:18.1067997Z Executing transaction: / done
2025-05-07T20:25:18.2720261Z [INSTALL] Setting the C/C++ compiler symlinks ...
2025-05-07T20:25:22.1714407Z + ln -sf /home/ec2-user/miniconda/envs/build_binary/bin/x86_64-conda-linux-gnu-cc /home/ec2-user/miniconda/envs/build_binary/bin/cc
2025-05-07T20:25:22.1714955Z 
2025-05-07T20:25:22.1726014Z 
2025-05-07T20:25:22.1745659Z + ln -sf /home/ec2-user/miniconda/envs/build_binary/bin/x86_64-conda-linux-gnu-cc /home/ec2-user/miniconda/envs/build_binary/bin/gcc
2025-05-07T20:25:22.1746200Z 
2025-05-07T20:25:22.1758725Z 
2025-05-07T20:25:22.1775967Z + ln -sf /home/ec2-user/miniconda/envs/build_binary/bin/x86_64-conda-linux-gnu-c++ /home/ec2-user/miniconda/envs/build_binary/bin/c++
2025-05-07T20:25:22.1776489Z 
2025-05-07T20:25:22.1788032Z 
2025-05-07T20:25:22.1805286Z + ln -sf /home/ec2-user/miniconda/envs/build_binary/bin/x86_64-conda-linux-gnu-c++ /home/ec2-user/miniconda/envs/build_binary/bin/g++
2025-05-07T20:25:22.1805854Z 
2025-05-07T20:25:22.1818137Z 
2025-05-07T20:25:24.0769534Z /home/ec2-user/miniconda/envs/build_binary/bin/cc
2025-05-07T20:25:24.0769815Z 
2025-05-07T20:25:24.1388150Z [CHECK] Binary cc found in PATH
2025-05-07T20:25:26.0271372Z /home/ec2-user/miniconda/envs/build_binary/bin/gcc
2025-05-07T20:25:26.0271669Z 
2025-05-07T20:25:26.0897040Z [CHECK] Binary gcc found in PATH
2025-05-07T20:25:27.9790795Z /home/ec2-user/miniconda/envs/build_binary/bin/c++
2025-05-07T20:25:27.9791076Z 
2025-05-07T20:25:28.0420874Z [CHECK] Binary c++ found in PATH
2025-05-07T20:25:29.9289175Z /home/ec2-user/miniconda/envs/build_binary/bin/g++
2025-05-07T20:25:29.9289475Z 
2025-05-07T20:25:29.9913345Z [CHECK] Binary g++ found in PATH
2025-05-07T20:25:29.9917541Z [INFO] Printing out all preprocessor defines in the C compiler ...
2025-05-07T20:25:29.9917977Z + conda run -n build_binary cc -dM -E -
2025-05-07T20:25:29.9918185Z 
2025-05-07T20:25:31.8880449Z #define __DBL_MIN_EXP__ (-1021)
2025-05-07T20:25:31.8881249Z #define __UINT_LEAST16_MAX__ 0xffff
2025-05-07T20:25:31.8881605Z #define __ATOMIC_ACQUIRE 2
2025-05-07T20:25:31.8881866Z #define __FLT128_MAX_10_EXP__ 4932
2025-05-07T20:25:31.8882200Z #define __FLT_MIN__ 1.17549435082228750796873653722224568e-38F
2025-05-07T20:25:31.8882553Z #define __GCC_IEC_559_COMPLEX 2
2025-05-07T20:25:31.8882838Z #define __UINT_LEAST8_TYPE__ unsigned char
2025-05-07T20:25:31.8883142Z #define __SIZEOF_FLOAT80__ 16
2025-05-07T20:25:31.8883405Z #define __INTMAX_C(c) c ## L
2025-05-07T20:25:31.8883662Z #define __CHAR_BIT__ 8
2025-05-07T20:25:31.8883898Z #define __UINT8_MAX__ 0xff
2025-05-07T20:25:31.8884149Z #define __SCHAR_WIDTH__ 8
2025-05-07T20:25:31.8884407Z #define __WINT_MAX__ 0xffffffffU
2025-05-07T20:25:31.8884676Z #define __FLT32_MIN_EXP__ (-125)
2025-05-07T20:25:31.8884957Z #define __ORDER_LITTLE_ENDIAN__ 1234
2025-05-07T20:25:31.8885260Z #define __SIZE_MAX__ 0xffffffffffffffffUL
2025-05-07T20:25:31.8885562Z #define __WCHAR_MAX__ 0x7fffffff
2025-05-07T20:25:31.8886034Z #define __GCC_HAVE_SYNC_COMPARE_AND_SWAP_1 1
2025-05-07T20:25:31.8886364Z #define __GCC_HAVE_SYNC_COMPARE_AND_SWAP_2 1
2025-05-07T20:25:31.8886685Z #define __GCC_HAVE_SYNC_COMPARE_AND_SWAP_4 1
2025-05-07T20:25:31.8887084Z #define __DBL_DENORM_MIN__ ((double)4.94065645841246544176568792868221372e-324L)
2025-05-07T20:25:31.8887533Z #define __GCC_HAVE_SYNC_COMPARE_AND_SWAP_8 1
2025-05-07T20:25:31.8887845Z #define __GCC_ATOMIC_CHAR_LOCK_FREE 2
2025-05-07T20:25:31.8888123Z #define __GCC_IEC_559 2
2025-05-07T20:25:31.8888374Z #define __FLT32X_DECIMAL_DIG__ 17
2025-05-07T20:25:31.8888647Z #define __FLT_EVAL_METHOD__ 0
2025-05-07T20:25:31.8888907Z #define __FLT64_DECIMAL_DIG__ 17
2025-05-07T20:25:31.8889190Z #define __GCC_ATOMIC_CHAR32_T_LOCK_FREE 2
2025-05-07T20:25:31.8889519Z #define __UINT_FAST64_MAX__ 0xffffffffffffffffUL
2025-05-07T20:25:31.8889835Z #define __SIG_ATOMIC_TYPE__ int
2025-05-07T20:25:31.8890112Z #define __DBL_MIN_10_EXP__ (-307)
2025-05-07T20:25:31.8890389Z #define __FINITE_MATH_ONLY__ 0
2025-05-07T20:25:31.8890661Z #define __FLT32X_MAX_EXP__ 1024
2025-05-07T20:25:31.8890932Z #define __FLT32_HAS_DENORM__ 1
2025-05-07T20:25:31.8891196Z #define __UINT_FAST8_MAX__ 0xff
2025-05-07T20:25:31.8891490Z #define __FLT32_MAX_10_EXP__ 38
2025-05-07T20:25:31.8891765Z #define __DEC64_MAX_EXP__ 385
2025-05-07T20:25:31.8892022Z #define __INT8_C(c) c
2025-05-07T20:25:31.8892266Z #define __INT_LEAST8_WIDTH__ 8
2025-05-07T20:25:31.8892557Z #define __UINT_LEAST64_MAX__ 0xffffffffffffffffUL
2025-05-07T20:25:31.8892880Z #define __SHRT_MAX__ 0x7fff
2025-05-07T20:25:31.8893194Z #define __LDBL_MAX__ 1.18973149535723176502126385303097021e+4932L
2025-05-07T20:25:31.8893541Z #define __FLT64X_MAX_10_EXP__ 4932
2025-05-07T20:25:31.8893819Z #define __LDBL_IS_IEC_60559__ 2
2025-05-07T20:25:31.8894089Z #define __FLT64X_HAS_QUIET_NAN__ 1
2025-05-07T20:25:31.8894363Z #define __UINT_LEAST8_MAX__ 0xff
2025-05-07T20:25:31.8894643Z #define __GCC_ATOMIC_BOOL_LOCK_FREE 2
2025-05-07T20:25:31.8895036Z #define __FLT128_DENORM_MIN__ 6.47517511943802511092443895822764655e-4966F128
2025-05-07T20:25:31.8895450Z #define __UINTMAX_TYPE__ long unsigned int
2025-05-07T20:25:31.8895734Z #define __linux 1
2025-05-07T20:25:31.8895966Z #define __DEC32_EPSILON__ 1E-6DF
2025-05-07T20:25:31.8896248Z #define __FLT_EVAL_METHOD_TS_18661_3__ 0
2025-05-07T20:25:31.8896526Z #define __unix 1
2025-05-07T20:25:31.8896753Z #define __UINT32_MAX__ 0xffffffffU
2025-05-07T20:25:31.8897033Z #define __FLT128_MIN_EXP__ (-16381)
2025-05-07T20:25:31.8897299Z #define __WINT_MIN__ 0U
2025-05-07T20:25:31.8897550Z #define __FLT128_MIN_10_EXP__ (-4931)
2025-05-07T20:25:31.8897834Z #define __FLT32X_IS_IEC_60559__ 2
2025-05-07T20:25:31.8898101Z #define __INT_LEAST16_WIDTH__ 16
2025-05-07T20:25:31.8898369Z #define __SCHAR_MAX__ 0x7f
2025-05-07T20:25:31.8898623Z #define __FLT128_MANT_DIG__ 113
2025-05-07T20:25:31.8898901Z #define __WCHAR_MIN__ (-__WCHAR_MAX__ - 1)
2025-05-07T20:25:31.8899196Z #define __INT64_C(c) c ## L
2025-05-07T20:25:31.8899465Z #define __GCC_ATOMIC_POINTER_LOCK_FREE 2
2025-05-07T20:25:31.8899871Z #define __FLT32X_MANT_DIG__ 53
2025-05-07T20:25:31.8900133Z #define __USER_LABEL_PREFIX__ 
2025-05-07T20:25:31.8900479Z #define __FLT64X_EPSILON__ 1.08420217248550443400745280086994171e-19F64x
2025-05-07T20:25:31.8900851Z #define __STDC_HOSTED__ 1
2025-05-07T20:25:31.8901099Z #define __DEC64_MIN_EXP__ (-382)
2025-05-07T20:25:31.8901385Z #define __DBL_DIG__ 15
2025-05-07T20:25:31.8901647Z #define __FLT32_DIG__ 6
2025-05-07T20:25:31.8901940Z #define __FLT_EPSILON__ 1.19209289550781250000000000000000000e-7F
2025-05-07T20:25:31.8902288Z #define __SHRT_WIDTH__ 16
2025-05-07T20:25:31.8902540Z #define __FLT32_IS_IEC_60559__ 2
2025-05-07T20:25:31.8902860Z #define __LDBL_MIN__ 3.36210314311209350626267781732175260e-4932L
2025-05-07T20:25:31.8903206Z #define __STDC_UTF_16__ 1
2025-05-07T20:25:31.8903458Z #define __DBL_IS_IEC_60559__ 2
2025-05-07T20:25:31.8903716Z #define __DEC32_MAX__ 9.999999E96DF
2025-05-07T20:25:31.8904095Z #define __FLT64X_DENORM_MIN__ 3.64519953188247460252840593361941982e-4951F64x
2025-05-07T20:25:31.8904573Z #define __FLT32X_HAS_INFINITY__ 1
2025-05-07T20:25:31.8904853Z #define __INT32_MAX__ 0x7fffffff
2025-05-07T20:25:31.8905103Z #define __unix__ 1
2025-05-07T20:25:31.8905331Z #define __INT_WIDTH__ 32
2025-05-07T20:25:31.8905576Z #define __SIZEOF_LONG__ 8
2025-05-07T20:25:31.8905816Z #define __STDC_IEC_559__ 1
2025-05-07T20:25:31.8906072Z #define __STDC_ISO_10646__ 201103L
2025-05-07T20:25:31.8906340Z #define __UINT16_C(c) c
2025-05-07T20:25:31.8906576Z #define __DECIMAL_DIG__ 21
2025-05-07T20:25:31.8906838Z #define __STDC_IEC_559_COMPLEX__ 1
2025-05-07T20:25:31.8907193Z #define __FLT64_EPSILON__ 2.22044604925031308084726333618164062e-16F64
2025-05-07T20:25:31.8907548Z #define __gnu_linux__ 1
2025-05-07T20:25:31.8907794Z #define __FLT128_IS_IEC_60559__ 2
2025-05-07T20:25:31.8908073Z #define __FLT64X_MIN_10_EXP__ (-4931)
2025-05-07T20:25:31.8908356Z #define __LDBL_HAS_QUIET_NAN__ 1
2025-05-07T20:25:31.8908620Z #define __FLT64_MANT_DIG__ 53
2025-05-07T20:25:31.8908897Z #define __FLT64X_MANT_DIG__ 64
2025-05-07T20:25:31.8909148Z #define __GNUC__ 11
2025-05-07T20:25:31.8909441Z #define __pie__ 2
2025-05-07T20:25:31.8909659Z #define __MMX__ 1
2025-05-07T20:25:31.8909878Z #define __FLT_HAS_DENORM__ 1
2025-05-07T20:25:31.8910147Z #define __SIZEOF_LONG_DOUBLE__ 16
2025-05-07T20:25:31.8910431Z #define __BIGGEST_ALIGNMENT__ 16
2025-05-07T20:25:31.8910698Z #define __FLT64_MAX_10_EXP__ 308
2025-05-07T20:25:31.8911042Z #define __DBL_MAX__ ((double)1.79769313486231570814527423731704357e+308L)
2025-05-07T20:25:31.8911484Z #define __INT_FAST32_MAX__ 0x7fffffffffffffffL
2025-05-07T20:25:31.8911800Z #define __DBL_HAS_INFINITY__ 1
2025-05-07T20:25:31.8912056Z #define __SIZEOF_FLOAT__ 4
2025-05-07T20:25:31.8912322Z #define __HAVE_SPECULATION_SAFE_VALUE 1
2025-05-07T20:25:31.8912624Z #define __DEC32_MIN_EXP__ (-94)
2025-05-07T20:25:31.8912885Z #define __INTPTR_WIDTH__ 64
2025-05-07T20:25:31.8913150Z #define __FLT64X_HAS_INFINITY__ 1
2025-05-07T20:25:31.8913441Z #define __UINT_LEAST32_MAX__ 0xffffffffU
2025-05-07T20:25:31.8913739Z #define __FLT32X_HAS_DENORM__ 1
2025-05-07T20:25:31.8914008Z #define __INT_FAST16_TYPE__ long int
2025-05-07T20:25:31.8914291Z #define __MMX_WITH_SSE__ 1
2025-05-07T20:25:31.8914540Z #define __LDBL_HAS_DENORM__ 1
2025-05-07T20:25:31.8914809Z #define __FLT128_HAS_INFINITY__ 1
2025-05-07T20:25:31.8915081Z #define __DEC32_MIN__ 1E-95DF
2025-05-07T20:25:31.8915337Z #define __DBL_MAX_EXP__ 1024
2025-05-07T20:25:31.8915596Z #define __WCHAR_WIDTH__ 32
2025-05-07T20:25:31.8915908Z #define __FLT32_MAX__ 3.40282346638528859811704183484516925e+38F32
2025-05-07T20:25:31.8916260Z #define __DEC128_EPSILON__ 1E-33DL
2025-05-07T20:25:31.8916524Z #define __SSE2_MATH__ 1
2025-05-07T20:25:31.8916775Z #define __ATOMIC_HLE_RELEASE 131072
2025-05-07T20:25:31.8917070Z #define __PTRDIFF_MAX__ 0x7fffffffffffffffL
2025-05-07T20:25:31.8917380Z #define __amd64 1
2025-05-07T20:25:31.8917606Z #define __STDC_NO_THREADS__ 1
2025-05-07T20:25:31.8917869Z #define __ATOMIC_HLE_ACQUIRE 65536
2025-05-07T20:25:31.8918278Z #define __LONG_LONG_MAX__ 0x7fffffffffffffffLL
2025-05-07T20:25:31.8918597Z #define __SIZEOF_SIZE_T__ 8
2025-05-07T20:25:31.8918853Z #define __FLT64X_MIN_EXP__ (-16381)
2025-05-07T20:25:31.8919129Z #define __SIZEOF_WINT_T__ 4
2025-05-07T20:25:31.8919387Z #define __LONG_LONG_WIDTH__ 64
2025-05-07T20:25:31.8919651Z #define __FLT32_MAX_EXP__ 128
2025-05-07T20:25:31.8919908Z #define __GXX_ABI_VERSION 1016
2025-05-07T20:25:31.8920173Z #define __FLT_MIN_EXP__ (-125)
2025-05-07T20:25:31.8920443Z #define __GCC_HAVE_DWARF2_CFI_ASM 1
2025-05-07T20:25:31.8920716Z #define __INT16_MAX__ 0x7fff
2025-05-07T20:25:31.8921040Z #define __x86_64 1
2025-05-07T20:25:31.8929812Z #define __INT_FAST64_TYPE__ long int
2025-05-07T20:25:31.8930235Z #define __FLT64_DENORM_MIN__ 4.94065645841246544176568792868221372e-324F64
2025-05-07T20:25:31.8930700Z #define __DBL_MIN__ ((double)2.22507385850720138309023271733240406e-308L)
2025-05-07T20:25:31.8931198Z #define __FLT128_EPSILON__ 1.92592994438723585305597794258492732e-34F128
2025-05-07T20:25:31.8931839Z #define __FLT64X_NORM_MAX__ 1.18973149535723176502126385303097021e+4932F64x
2025-05-07T20:25:31.8932228Z #define __SIZEOF_POINTER__ 8
2025-05-07T20:25:31.8932487Z #define __LP64__ 1
2025-05-07T20:25:31.8932713Z #define __DBL_HAS_QUIET_NAN__ 1
2025-05-07T20:25:31.8933062Z #define __FLT32X_EPSILON__ 2.22044604925031308084726333618164062e-16F32x
2025-05-07T20:25:31.8933439Z #define __DECIMAL_BID_FORMAT__ 1
2025-05-07T20:25:31.8933717Z #define __FLT64_MIN_EXP__ (-1021)
2025-05-07T20:25:31.8933992Z #define __FLT64_MIN_10_EXP__ (-307)
2025-05-07T20:25:31.8934274Z #define __FLT64X_DECIMAL_DIG__ 21
2025-05-07T20:25:31.8934555Z #define __DEC128_MIN__ 1E-6143DL
2025-05-07T20:25:31.8934817Z #define __REGISTER_PREFIX__ 
2025-05-07T20:25:31.8935079Z #define __UINT16_MAX__ 0xffff
2025-05-07T20:25:31.8935340Z #define __DBL_HAS_DENORM__ 1
2025-05-07T20:25:31.8935595Z #define __LDBL_HAS_INFINITY__ 1
2025-05-07T20:25:31.8935925Z #define __FLT32_MIN__ 1.17549435082228750796873653722224568e-38F32
2025-05-07T20:25:31.8936291Z #define __UINT8_TYPE__ unsigned char
2025-05-07T20:25:31.8936562Z #define __FLT_DIG__ 6
2025-05-07T20:25:31.8936798Z #define __NO_INLINE__ 1
2025-05-07T20:25:31.8937043Z #define __DEC_EVAL_METHOD__ 2
2025-05-07T20:25:31.8937359Z #define __DEC128_MAX__ 9.999999999999999999999999999999999E6144DL
2025-05-07T20:25:31.8937708Z #define __FLT_MANT_DIG__ 24
2025-05-07T20:25:31.8937967Z #define __LDBL_DECIMAL_DIG__ 21
2025-05-07T20:25:31.8938229Z #define __VERSION__ "11.4.0"
2025-05-07T20:25:31.8938477Z #define __UINT64_C(c) c ## UL
2025-05-07T20:25:31.8938733Z #define _STDC_PREDEF_H 1
2025-05-07T20:25:31.8938988Z #define __INT_LEAST32_MAX__ 0x7fffffff
2025-05-07T20:25:31.8939273Z #define __GCC_ATOMIC_INT_LOCK_FREE 2
2025-05-07T20:25:31.8939559Z #define __FLT128_MAX_EXP__ 16384
2025-05-07T20:25:31.8939824Z #define __FLT32_MANT_DIG__ 24
2025-05-07T20:25:31.8940118Z #define __FLOAT_WORD_ORDER__ __ORDER_LITTLE_ENDIAN__
2025-05-07T20:25:31.8940447Z #define __FLT128_HAS_DENORM__ 1
2025-05-07T20:25:31.8940720Z #define __FLT32_DECIMAL_DIG__ 9
2025-05-07T20:25:31.8941008Z #define __FLT128_DIG__ 33
2025-05-07T20:25:31.8941258Z #define __INT32_C(c) c
2025-05-07T20:25:31.8941498Z #define __DEC64_EPSILON__ 1E-15DD
2025-05-07T20:25:31.8941766Z #define __ORDER_PDP_ENDIAN__ 3412
2025-05-07T20:25:31.8942042Z #define __DEC128_MIN_EXP__ (-6142)
2025-05-07T20:25:31.8942319Z #define __INT_FAST32_TYPE__ long int
2025-05-07T20:25:31.8942629Z #define __UINT_LEAST16_TYPE__ short unsigned int
2025-05-07T20:25:31.8942925Z #define unix 1
2025-05-07T20:25:31.8943156Z #define __SIZE_TYPE__ long unsigned int
2025-05-07T20:25:31.8943469Z #define __UINT64_MAX__ 0xffffffffffffffffUL
2025-05-07T20:25:31.8943766Z #define __FLT_IS_IEC_60559__ 2
2025-05-07T20:25:31.8944072Z #define __GNUC_WIDE_EXECUTION_CHARSET_NAME "UTF-32LE"
2025-05-07T20:25:31.8944398Z #define __FLT64X_DIG__ 18
2025-05-07T20:25:31.8944641Z #define __INT8_TYPE__ signed char
2025-05-07T20:25:31.8944902Z #define __ELF__ 1
2025-05-07T20:25:31.8945277Z #define __GCC_ASM_FLAG_OUTPUTS__ 1
2025-05-07T20:25:31.8945561Z #define __UINT32_TYPE__ unsigned int
2025-05-07T20:25:31.8945837Z #define __FLT_RADIX__ 2
2025-05-07T20:25:31.8946087Z #define __INT_LEAST16_TYPE__ short int
2025-05-07T20:25:31.8946439Z #define __LDBL_EPSILON__ 1.08420217248550443400745280086994171e-19L
2025-05-07T20:25:31.8946802Z #define __UINTMAX_C(c) c ## UL
2025-05-07T20:25:31.8947061Z #define __SSE_MATH__ 1
2025-05-07T20:25:31.8947291Z #define __k8 1
2025-05-07T20:25:31.8947579Z #define __FLT32X_MIN__ 2.22507385850720138309023271733240406e-308F32x
2025-05-07T20:25:31.8947950Z #define __SIG_ATOMIC_MAX__ 0x7fffffff
2025-05-07T20:25:31.8948246Z #define __GCC_ATOMIC_WCHAR_T_LOCK_FREE 2
2025-05-07T20:25:31.8948539Z #define __SIZEOF_PTRDIFF_T__ 8
2025-05-07T20:25:31.8948799Z #define __LDBL_DIG__ 18
2025-05-07T20:25:31.8949047Z #define __FLT64_IS_IEC_60559__ 2
2025-05-07T20:25:31.8949349Z #define __x86_64__ 1
2025-05-07T20:25:31.8949589Z #define __FLT32X_MIN_EXP__ (-1021)
2025-05-07T20:25:31.8949894Z #define __DEC32_SUBNORMAL_MIN__ 0.000001E-95DF
2025-05-07T20:25:31.8950331Z #define __INT_FAST16_MAX__ 0x7fffffffffffffffL
2025-05-07T20:25:31.8950637Z #define __FLT64_DIG__ 15
2025-05-07T20:25:31.8950918Z #define __UINT_FAST32_MAX__ 0xffffffffffffffffUL
2025-05-07T20:25:31.8951262Z #define __UINT_LEAST64_TYPE__ long unsigned int
2025-05-07T20:25:31.8951573Z #define __FLT_HAS_QUIET_NAN__ 1
2025-05-07T20:25:31.8951888Z #define __FLT_MAX_10_EXP__ 38
2025-05-07T20:25:31.8952163Z #define __LONG_MAX__ 0x7fffffffffffffffL
2025-05-07T20:25:31.8952449Z #define __FLT64X_HAS_DENORM__ 1
2025-05-07T20:25:31.8952808Z #define __DEC128_SUBNORMAL_MIN__ 0.000000000000000000000000000000001E-6143DL
2025-05-07T20:25:31.8953201Z #define __FLT_HAS_INFINITY__ 1
2025-05-07T20:25:31.8953485Z #define __GNUC_EXECUTION_CHARSET_NAME "UTF-8"
2025-05-07T20:25:31.8953819Z #define __UINT_FAST16_TYPE__ long unsigned int
2025-05-07T20:25:31.8954139Z #define __DEC64_MAX__ 9.999999999999999E384DD
2025-05-07T20:25:31.8954426Z #define __INT_FAST32_WIDTH__ 64
2025-05-07T20:25:31.8954716Z #define __CHAR16_TYPE__ short unsigned int
2025-05-07T20:25:31.8955023Z #define __PRAGMA_REDEFINE_EXTNAME 1
2025-05-07T20:25:31.8955300Z #define __SIZE_WIDTH__ 64
2025-05-07T20:25:31.8955531Z #define __SEG_FS 1
2025-05-07T20:25:31.8955762Z #define __INT_LEAST16_MAX__ 0x7fff
2025-05-07T20:25:31.8956037Z #define __DEC64_MANT_DIG__ 16
2025-05-07T20:25:31.8956302Z #define __INT64_MAX__ 0x7fffffffffffffffL
2025-05-07T20:25:31.8956589Z #define __SEG_GS 1
2025-05-07T20:25:31.8956900Z #define __FLT32_DENORM_MIN__ 1.40129846432481707092372958328991613e-45F32
2025-05-07T20:25:31.8957272Z #define __SIG_ATOMIC_WIDTH__ 32
2025-05-07T20:25:31.8957543Z #define __INT_LEAST64_TYPE__ long int
2025-05-07T20:25:31.8957827Z #define __INT16_TYPE__ short int
2025-05-07T20:25:31.8958098Z #define __INT_LEAST8_TYPE__ signed char
2025-05-07T20:25:31.8958390Z #define __STDC_VERSION__ 201710L
2025-05-07T20:25:31.8958653Z #define __SIZEOF_INT__ 4
2025-05-07T20:25:31.8958893Z #define __DEC32_MAX_EXP__ 97
2025-05-07T20:25:31.8959168Z #define __INT_FAST8_MAX__ 0x7f
2025-05-07T20:25:31.8959509Z #define __FLT128_MAX__ 1.18973149535723176508575932662800702e+4932F128
2025-05-07T20:25:31.8959899Z #define __INTPTR_MAX__ 0x7fffffffffffffffL
2025-05-07T20:25:31.8960184Z #define linux 1
2025-05-07T20:25:31.8960414Z #define __FLT64_HAS_QUIET_NAN__ 1
2025-05-07T20:25:31.8960699Z #define __FLT32_MIN_10_EXP__ (-37)
2025-05-07T20:25:31.8960972Z #define __FLT32X_DIG__ 15
2025-05-07T20:25:31.8961230Z #define __PTRDIFF_WIDTH__ 64
2025-05-07T20:25:31.8961491Z #define __LDBL_MANT_DIG__ 64
2025-05-07T20:25:31.8961749Z #define __FLT64_HAS_INFINITY__ 1
2025-05-07T20:25:31.8962094Z #define __FLT64X_MAX__ 1.18973149535723176502126385303097021e+4932F64x
2025-05-07T20:25:31.8962504Z #define __SIG_ATOMIC_MIN__ (-__SIG_ATOMIC_MAX__ - 1)
2025-05-07T20:25:31.8962826Z #define __code_model_small__ 1
2025-05-07T20:25:31.8963100Z #define __GCC_ATOMIC_LONG_LOCK_FREE 2
2025-05-07T20:25:31.8963387Z #define __DEC32_MANT_DIG__ 7
2025-05-07T20:25:31.8963732Z #define __k8__ 1
2025-05-07T20:25:31.8963960Z #define __INTPTR_TYPE__ long int
2025-05-07T20:25:31.8964252Z #define __UINT16_TYPE__ short unsigned int
2025-05-07T20:25:31.8964550Z #define __WCHAR_TYPE__ int
2025-05-07T20:25:31.8964787Z #define __pic__ 2
2025-05-07T20:25:31.8965039Z #define __UINTPTR_MAX__ 0xffffffffffffffffUL
2025-05-07T20:25:31.8965349Z #define __INT_FAST64_WIDTH__ 64
2025-05-07T20:25:31.8965630Z #define __INT_FAST64_MAX__ 0x7fffffffffffffffL
2025-05-07T20:25:31.8965958Z #define __GCC_ATOMIC_TEST_AND_SET_TRUEVAL 1
2025-05-07T20:25:31.8966325Z #define __FLT_NORM_MAX__ 3.40282346638528859811704183484516925e+38F
2025-05-07T20:25:31.8966676Z #define __FLT32_HAS_INFINITY__ 1
2025-05-07T20:25:31.8966947Z #define __FLT64X_MAX_EXP__ 16384
2025-05-07T20:25:31.8967239Z #define __UINT_FAST64_TYPE__ long unsigned int
2025-05-07T20:25:31.8967543Z #define __INT_MAX__ 0x7fffffff
2025-05-07T20:25:31.8967800Z #define __linux__ 1
2025-05-07T20:25:31.8968028Z #define __INT64_TYPE__ long int
2025-05-07T20:25:31.8968382Z #define __FLT_MAX_EXP__ 128
2025-05-07T20:25:31.8968637Z #define __ORDER_BIG_ENDIAN__ 4321
2025-05-07T20:25:31.8968909Z #define __DBL_MANT_DIG__ 53
2025-05-07T20:25:31.8969163Z #define __SIZEOF_FLOAT128__ 16
2025-05-07T20:25:31.8969445Z #define __INT_LEAST64_MAX__ 0x7fffffffffffffffL
2025-05-07T20:25:31.8969768Z #define __GCC_ATOMIC_CHAR16_T_LOCK_FREE 2
2025-05-07T20:25:31.8970062Z #define __DEC64_MIN__ 1E-383DD
2025-05-07T20:25:31.8970320Z #define __WINT_TYPE__ unsigned int
2025-05-07T20:25:31.8970613Z #define __UINT_LEAST32_TYPE__ unsigned int
2025-05-07T20:25:31.8970921Z #define __SIZEOF_SHORT__ 2
2025-05-07T20:25:31.8971276Z #define __FLT32_NORM_MAX__ 3.40282346638528859811704183484516925e+38F32
2025-05-07T20:25:31.8971628Z #define __SSE__ 1
2025-05-07T20:25:31.8971863Z #define __LDBL_MIN_EXP__ (-16381)
2025-05-07T20:25:31.8972201Z #define __FLT64_MAX__ 1.79769313486231570814527423731704357e+308F64
2025-05-07T20:25:31.8972537Z #define __amd64__ 1
2025-05-07T20:25:31.8972763Z #define __WINT_WIDTH__ 32
2025-05-07T20:25:31.8973027Z #define __INT_LEAST8_MAX__ 0x7f
2025-05-07T20:25:31.8973290Z #define __INT_LEAST64_WIDTH__ 64
2025-05-07T20:25:31.8973564Z #define __LDBL_MAX_EXP__ 16384
2025-05-07T20:25:31.8973830Z #define __FLT32X_MAX_10_EXP__ 308
2025-05-07T20:25:31.8974096Z #define __SIZEOF_INT128__ 16
2025-05-07T20:25:31.8974356Z #define __FLT64X_IS_IEC_60559__ 2
2025-05-07T20:25:31.8974628Z #define __LDBL_MAX_10_EXP__ 4932
2025-05-07T20:25:31.8974891Z #define __ATOMIC_RELAXED 0
2025-05-07T20:25:31.8975244Z #define __DBL_EPSILON__ ((double)2.22044604925031308084726333618164062e-16L)
2025-05-07T20:25:31.8975703Z #define __FLT128_MIN__ 3.36210314311209350626267781732175260e-4932F128
2025-05-07T20:25:31.8976052Z #define _LP64 1
2025-05-07T20:25:31.8976279Z #define __UINT8_C(c) c
2025-05-07T20:25:31.8976521Z #define __FLT64_MAX_EXP__ 1024
2025-05-07T20:25:31.8976785Z #define __INT_LEAST32_TYPE__ int
2025-05-07T20:25:31.8977048Z #define __SIZEOF_WCHAR_T__ 4
2025-05-07T20:25:31.8977333Z #define __UINT64_TYPE__ long unsigned int
2025-05-07T20:25:31.8977640Z #define __GNUC_PATCHLEVEL__ 0
2025-05-07T20:25:31.8977996Z #define __FLT128_NORM_MAX__ 1.18973149535723176508575932662800702e+4932F128
2025-05-07T20:25:31.8978446Z #define __FLT64_NORM_MAX__ 1.79769313486231570814527423731704357e+308F64
2025-05-07T20:25:31.8978817Z #define __FLT128_HAS_QUIET_NAN__ 1
2025-05-07T20:25:31.8979111Z #define __INTMAX_MAX__ 0x7fffffffffffffffL
2025-05-07T20:25:31.8979415Z #define __INT_FAST8_TYPE__ signed char
2025-05-07T20:25:31.8979777Z #define __FLT64X_MIN__ 3.36210314311209350626267781732175260e-4932F64x
2025-05-07T20:25:31.8980144Z #define __GNUC_STDC_INLINE__ 1
2025-05-07T20:25:31.8980399Z #define __FLT64_HAS_DENORM__ 1
2025-05-07T20:25:31.8980733Z #define __FLT32_EPSILON__ 1.19209289550781250000000000000000000e-7F32
2025-05-07T20:25:31.8981095Z #define __DBL_DECIMAL_DIG__ 17
2025-05-07T20:25:31.8981352Z #define __STDC_UTF_32__ 1
2025-05-07T20:25:31.8981621Z #define __INT_FAST8_WIDTH__ 8
2025-05-07T20:25:31.8981897Z #define __FXSR__ 1
2025-05-07T20:25:31.8982295Z #define __FLT32X_MAX__ 1.79769313486231570814527423731704357e+308F32x
2025-05-07T20:25:31.8982743Z #define __DBL_NORM_MAX__ ((double)1.79769313486231570814527423731704357e+308L)
2025-05-07T20:25:31.8983148Z #define __BYTE_ORDER__ __ORDER_LITTLE_ENDIAN__
2025-05-07T20:25:31.8983456Z #define __INTMAX_WIDTH__ 64
2025-05-07T20:25:31.8983709Z #define __UINT32_C(c) c ## U
2025-05-07T20:25:31.8984039Z #define __FLT_DENORM_MIN__ 1.40129846432481707092372958328991613e-45F
2025-05-07T20:25:31.8984392Z #define __INT8_MAX__ 0x7f
2025-05-07T20:25:31.8984630Z #define __LONG_WIDTH__ 64
2025-05-07T20:25:31.8984865Z #define __PIC__ 2
2025-05-07T20:25:31.8985112Z #define __UINT_FAST32_TYPE__ long unsigned int
2025-05-07T20:25:31.8985503Z #define __FLT32X_NORM_MAX__ 1.79769313486231570814527423731704357e+308F32x
2025-05-07T20:25:31.8985877Z #define __CHAR32_TYPE__ unsigned int
2025-05-07T20:25:31.8986206Z #define __FLT_MAX__ 3.40282346638528859811704183484516925e+38F
2025-05-07T20:25:31.8986613Z #define __SSE2__ 1
2025-05-07T20:25:31.8986827Z #define __INT32_TYPE__ int
2025-05-07T20:25:31.8987081Z #define __SIZEOF_DOUBLE__ 8
2025-05-07T20:25:31.8987337Z #define __FLT_MIN_10_EXP__ (-37)
2025-05-07T20:25:31.8987659Z #define __FLT64_MIN__ 2.22507385850720138309023271733240406e-308F64
2025-05-07T20:25:31.8988010Z #define __INT_LEAST32_WIDTH__ 32
2025-05-07T20:25:31.8988277Z #define __INTMAX_TYPE__ long int
2025-05-07T20:25:31.8988539Z #define __DEC128_MAX_EXP__ 6145
2025-05-07T20:25:31.8988805Z #define __FLT32X_HAS_QUIET_NAN__ 1
2025-05-07T20:25:31.8989082Z #define __ATOMIC_CONSUME 1
2025-05-07T20:25:31.8989371Z #define __GNUC_MINOR__ 4
2025-05-07T20:25:31.8989615Z #define __INT_FAST16_WIDTH__ 64
2025-05-07T20:25:31.8989898Z #define __UINTMAX_MAX__ 0xffffffffffffffffUL
2025-05-07T20:25:31.8990189Z #define __PIE__ 2
2025-05-07T20:25:31.8990502Z #define __FLT32X_DENORM_MIN__ 4.94065645841246544176568792868221372e-324F32x
2025-05-07T20:25:31.8990886Z #define __DBL_MAX_10_EXP__ 308
2025-05-07T20:25:31.8991237Z #define __LDBL_DENORM_MIN__ 3.64519953188247460252840593361941982e-4951L
2025-05-07T20:25:31.8991589Z #define __INT16_C(c) c
2025-05-07T20:25:31.8991819Z #define __STDC__ 1
2025-05-07T20:25:31.8992049Z #define __PTRDIFF_TYPE__ long int
2025-05-07T20:25:31.8992313Z #define __ATOMIC_SEQ_CST 5
2025-05-07T20:25:31.8992565Z #define __FLT32X_MIN_10_EXP__ (-307)
2025-05-07T20:25:31.8992863Z #define __UINTPTR_TYPE__ long unsigned int
2025-05-07T20:25:31.8993200Z #define __DEC64_SUBNORMAL_MIN__ 0.000000000000001E-383DD
2025-05-07T20:25:31.8993531Z #define __DEC128_MANT_DIG__ 34
2025-05-07T20:25:31.8993796Z #define __LDBL_MIN_10_EXP__ (-4931)
2025-05-07T20:25:31.8994076Z #define __SIZEOF_LONG_LONG__ 8
2025-05-07T20:25:31.8994335Z #define __FLT128_DECIMAL_DIG__ 36
2025-05-07T20:25:31.8994617Z #define __GCC_ATOMIC_LLONG_LOCK_FREE 2
2025-05-07T20:25:31.8994906Z #define __FLT32_HAS_QUIET_NAN__ 1
2025-05-07T20:25:31.8995175Z #define __FLT_DECIMAL_DIG__ 9
2025-05-07T20:25:31.8995473Z #define __UINT_FAST16_MAX__ 0xffffffffffffffffUL
2025-05-07T20:25:31.8995866Z #define __LDBL_NORM_MAX__ 1.18973149535723176502126385303097021e+4932L
2025-05-07T20:25:31.8996228Z #define __GCC_ATOMIC_SHORT_LOCK_FREE 2
2025-05-07T20:25:31.8996533Z #define __UINT_FAST8_TYPE__ unsigned char
2025-05-07T20:25:31.8996826Z #define __ATOMIC_ACQ_REL 4
2025-05-07T20:25:31.8997070Z #define __ATOMIC_RELEASE 3
2025-05-07T20:25:31.8997232Z 
2025-05-07T20:25:31.9504912Z 
2025-05-07T20:25:31.9505309Z [INFO] Printing out all preprocessor defines in the C++ compiler ...
2025-05-07T20:25:31.9505747Z + conda run -n build_binary c++ -dM -E -x c++ -
2025-05-07T20:25:31.9505974Z 
2025-05-07T20:25:33.8442486Z #define __DBL_MIN_EXP__ (-1021)
2025-05-07T20:25:33.8442858Z #define __cpp_attributes 200809L
2025-05-07T20:25:33.8443192Z #define __cpp_nontype_template_parameter_auto 201606L
2025-05-07T20:25:33.8443544Z #define __UINT_LEAST16_MAX__ 0xffff
2025-05-07T20:25:33.8443831Z #define __ATOMIC_ACQUIRE 2
2025-05-07T20:25:33.8444087Z #define __FLT128_MAX_10_EXP__ 4932
2025-05-07T20:25:33.8444773Z #define __FLT_MIN__ 1.17549435082228750796873653722224568e-38F
2025-05-07T20:25:33.8445128Z #define __GCC_IEC_559_COMPLEX 2
2025-05-07T20:25:33.8445406Z #define __cpp_aggregate_nsdmi 201304L
2025-05-07T20:25:33.8445726Z #define __UINT_LEAST8_TYPE__ unsigned char
2025-05-07T20:25:33.8446042Z #define __SIZEOF_FLOAT80__ 16
2025-05-07T20:25:33.8446315Z #define __INTMAX_C(c) c ## L
2025-05-07T20:25:33.8446567Z #define __CHAR_BIT__ 8
2025-05-07T20:25:33.8446806Z #define __UINT8_MAX__ 0xff
2025-05-07T20:25:33.8447053Z #define __SCHAR_WIDTH__ 8
2025-05-07T20:25:33.8447301Z #define __WINT_MAX__ 0xffffffffU
2025-05-07T20:25:33.8447570Z #define __FLT32_MIN_EXP__ (-125)
2025-05-07T20:25:33.8447846Z #define __cpp_static_assert 201411L
2025-05-07T20:25:33.8448127Z #define __ORDER_LITTLE_ENDIAN__ 1234
2025-05-07T20:25:33.8448426Z #define __SIZE_MAX__ 0xffffffffffffffffUL
2025-05-07T20:25:33.8448726Z #define __WCHAR_MAX__ 0x7fffffff
2025-05-07T20:25:33.8449010Z #define __GCC_HAVE_SYNC_COMPARE_AND_SWAP_1 1
2025-05-07T20:25:33.8449494Z #define __GCC_HAVE_SYNC_COMPARE_AND_SWAP_2 1
2025-05-07T20:25:33.8449820Z #define __GCC_HAVE_SYNC_COMPARE_AND_SWAP_4 1
2025-05-07T20:25:33.8450216Z #define __DBL_DENORM_MIN__ double(4.94065645841246544176568792868221372e-324L)
2025-05-07T20:25:33.8450618Z #define __GCC_HAVE_SYNC_COMPARE_AND_SWAP_8 1
2025-05-07T20:25:33.8450927Z #define __GCC_ATOMIC_CHAR_LOCK_FREE 2
2025-05-07T20:25:33.8451210Z #define __GCC_IEC_559 2
2025-05-07T20:25:33.8451452Z #define __FLT32X_DECIMAL_DIG__ 17
2025-05-07T20:25:33.8451775Z #define __FLT_EVAL_METHOD__ 0
2025-05-07T20:25:33.8452047Z #define __cpp_binary_literals 201304L
2025-05-07T20:25:33.8452328Z #define __FLT64_DECIMAL_DIG__ 17
2025-05-07T20:25:33.8452619Z #define __cpp_noexcept_function_type 201510L
2025-05-07T20:25:33.8452934Z #define __GCC_ATOMIC_CHAR32_T_LOCK_FREE 2
2025-05-07T20:25:33.8453236Z #define __cpp_variadic_templates 200704L
2025-05-07T20:25:33.8453567Z #define __UINT_FAST64_MAX__ 0xffffffffffffffffUL
2025-05-07T20:25:33.8453899Z #define __SIG_ATOMIC_TYPE__ int
2025-05-07T20:25:33.8454164Z #define __DBL_MIN_10_EXP__ (-307)
2025-05-07T20:25:33.8454438Z #define __FINITE_MATH_ONLY__ 0
2025-05-07T20:25:33.8454717Z #define __cpp_variable_templates 201304L
2025-05-07T20:25:33.8455014Z #define __FLT32X_MAX_EXP__ 1024
2025-05-07T20:25:33.8455271Z #define __FLT32_HAS_DENORM__ 1
2025-05-07T20:25:33.8455534Z #define __UINT_FAST8_MAX__ 0xff
2025-05-07T20:25:33.8455811Z #define __cpp_rvalue_reference 200610L
2025-05-07T20:25:33.8456133Z #define __cpp_nested_namespace_definitions 201411L
2025-05-07T20:25:33.8456460Z #define __DEC64_MAX_EXP__ 385
2025-05-07T20:25:33.8456716Z #define __INT8_C(c) c
2025-05-07T20:25:33.8456948Z #define __INT_LEAST8_WIDTH__ 8
2025-05-07T20:25:33.8457222Z #define __cpp_variadic_using 201611L
2025-05-07T20:25:33.8457541Z #define __UINT_LEAST64_MAX__ 0xffffffffffffffffUL
2025-05-07T20:25:33.8457856Z #define __INT_LEAST8_MAX__ 0x7f
2025-05-07T20:25:33.8458138Z #define __cpp_capture_star_this 201603L
2025-05-07T20:25:33.8458441Z #define __SHRT_MAX__ 0x7fff
2025-05-07T20:25:33.8458757Z #define __LDBL_MAX__ 1.18973149535723176502126385303097021e+4932L
2025-05-07T20:25:33.8459104Z #define __FLT64X_MAX_10_EXP__ 4932
2025-05-07T20:25:33.8459387Z #define __cpp_if_constexpr 201606L
2025-05-07T20:25:33.8459667Z #define __LDBL_IS_IEC_60559__ 2
2025-05-07T20:25:33.8459926Z #define __FLT64X_HAS_QUIET_NAN__ 1
2025-05-07T20:25:33.8460200Z #define __UINT_LEAST8_MAX__ 0xff
2025-05-07T20:25:33.8460475Z #define __GCC_ATOMIC_BOOL_LOCK_FREE 2
2025-05-07T20:25:33.8460858Z #define __FLT128_DENORM_MIN__ 6.47517511943802511092443895822764655e-4966F128
2025-05-07T20:25:33.8461275Z #define __UINTMAX_TYPE__ long unsigned int
2025-05-07T20:25:33.8461573Z #define __linux 1
2025-05-07T20:25:33.8461836Z #define __DEC32_EPSILON__ 1E-6DF
2025-05-07T20:25:33.8462113Z #define __FLT_EVAL_METHOD_TS_18661_3__ 0
2025-05-07T20:25:33.8462390Z #define __unix 1
2025-05-07T20:25:33.8462610Z #define __UINT32_MAX__ 0xffffffffU
2025-05-07T20:25:33.8462998Z #define __GXX_EXPERIMENTAL_CXX0X__ 1
2025-05-07T20:25:33.8463289Z #define __FLT128_MIN_EXP__ (-16381)
2025-05-07T20:25:33.8463560Z #define __WINT_MIN__ 0U
2025-05-07T20:25:33.8463801Z #define __FLT128_MIN_10_EXP__ (-4931)
2025-05-07T20:25:33.8464080Z #define __FLT32X_IS_IEC_60559__ 2
2025-05-07T20:25:33.8464353Z #define __INT_LEAST16_WIDTH__ 16
2025-05-07T20:25:33.8464613Z #define __SCHAR_MAX__ 0x7f
2025-05-07T20:25:33.8464865Z #define __FLT128_MANT_DIG__ 113
2025-05-07T20:25:33.8465149Z #define __WCHAR_MIN__ (-__WCHAR_MAX__ - 1)
2025-05-07T20:25:33.8465440Z #define __INT64_C(c) c ## L
2025-05-07T20:25:33.8465704Z #define __GCC_ATOMIC_POINTER_LOCK_FREE 2
2025-05-07T20:25:33.8465999Z #define __FLT32X_MANT_DIG__ 53
2025-05-07T20:25:33.8466263Z #define __GCC_ATOMIC_CHAR16_T_LOCK_FREE 2
2025-05-07T20:25:33.8466560Z #define __cpp_aligned_new 201606L
2025-05-07T20:25:33.8466833Z #define __USER_LABEL_PREFIX__ 
2025-05-07T20:25:33.8467088Z #define __FLT32_MAX_10_EXP__ 38
2025-05-07T20:25:33.8467438Z #define __FLT64X_EPSILON__ 1.08420217248550443400745280086994171e-19F64x
2025-05-07T20:25:33.8467901Z #define __STDC_HOSTED__ 1
2025-05-07T20:25:33.8468152Z #define __DEC64_MIN_EXP__ (-382)
2025-05-07T20:25:33.8468420Z #define __cpp_decltype_auto 201304L
2025-05-07T20:25:33.8468692Z #define __DBL_DIG__ 15
2025-05-07T20:25:33.8468924Z #define __FLT32_DIG__ 6
2025-05-07T20:25:33.8469217Z #define __FLT_EPSILON__ 1.19209289550781250000000000000000000e-7F
2025-05-07T20:25:33.8469654Z #define __GXX_WEAK__ 1
2025-05-07T20:25:33.8469891Z #define __SHRT_WIDTH__ 16
2025-05-07T20:25:33.8470135Z #define __FLT32_IS_IEC_60559__ 2
2025-05-07T20:25:33.8470458Z #define __LDBL_MIN__ 3.36210314311209350626267781732175260e-4932L
2025-05-07T20:25:33.8470808Z #define __DBL_IS_IEC_60559__ 2
2025-05-07T20:25:33.8471073Z #define __DEC32_MAX__ 9.999999E96DF
2025-05-07T20:25:33.8471365Z #define __cpp_threadsafe_static_init 200806L
2025-05-07T20:25:33.8471694Z #define __cpp_enumerator_attributes 201411L
2025-05-07T20:25:33.8472150Z #define __FLT64X_DENORM_MIN__ 3.64519953188247460252840593361941982e-4951F64x
2025-05-07T20:25:33.8472545Z #define __FLT32X_HAS_INFINITY__ 1
2025-05-07T20:25:33.8472821Z #define __INT32_MAX__ 0x7fffffff
2025-05-07T20:25:33.8473078Z #define __unix__ 1
2025-05-07T20:25:33.8473297Z #define __INT_WIDTH__ 32
2025-05-07T20:25:33.8473543Z #define __SIZEOF_LONG__ 8
2025-05-07T20:25:33.8473788Z #define __STDC_IEC_559__ 1
2025-05-07T20:25:33.8474035Z #define __STDC_ISO_10646__ 201103L
2025-05-07T20:25:33.8474301Z #define __UINT16_C(c) c
2025-05-07T20:25:33.8474541Z #define __DECIMAL_DIG__ 21
2025-05-07T20:25:33.8474791Z #define __STDC_IEC_559_COMPLEX__ 1
2025-05-07T20:25:33.8475144Z #define __FLT64_EPSILON__ 2.22044604925031308084726333618164062e-16F64
2025-05-07T20:25:33.8475503Z #define __gnu_linux__ 1
2025-05-07T20:25:33.8475818Z #define __INT16_MAX__ 0x7fff
2025-05-07T20:25:33.8476077Z #define __FLT64_MIN_EXP__ (-1021)
2025-05-07T20:25:33.8476353Z #define __FLT64X_MIN_10_EXP__ (-4931)
2025-05-07T20:25:33.8476638Z #define __LDBL_HAS_QUIET_NAN__ 1
2025-05-07T20:25:33.8476911Z #define __FLT64_MANT_DIG__ 53
2025-05-07T20:25:33.8477171Z #define __FLT64X_MANT_DIG__ 64
2025-05-07T20:25:33.8477427Z #define __GNUC__ 11
2025-05-07T20:25:33.8477639Z #define __GXX_RTTI 1
2025-05-07T20:25:33.8477861Z #define __pie__ 2
2025-05-07T20:25:33.8478073Z #define __MMX__ 1
2025-05-07T20:25:33.8478307Z #define __FLT_HAS_DENORM__ 1
2025-05-07T20:25:33.8478573Z #define __SIZEOF_LONG_DOUBLE__ 16
2025-05-07T20:25:33.8478855Z #define __BIGGEST_ALIGNMENT__ 16
2025-05-07T20:25:33.8479119Z #define __STDC_UTF_16__ 1
2025-05-07T20:25:33.8479369Z #define __FLT64_MAX_10_EXP__ 308
2025-05-07T20:25:33.8479664Z #define __cpp_delegating_constructors 200604L
2025-05-07T20:25:33.8479972Z #define __FLT32_HAS_INFINITY__ 1
2025-05-07T20:25:33.8480316Z #define __DBL_MAX__ double(1.79769313486231570814527423731704357e+308L)
2025-05-07T20:25:33.8480685Z #define __cpp_raw_strings 200710L
2025-05-07T20:25:33.8480982Z #define __INT_FAST32_MAX__ 0x7fffffffffffffffL
2025-05-07T20:25:33.8481403Z #define __DBL_HAS_INFINITY__ 1
2025-05-07T20:25:33.8481672Z #define __SIZEOF_FLOAT__ 4
2025-05-07T20:25:33.8481927Z #define __HAVE_SPECULATION_SAFE_VALUE 1
2025-05-07T20:25:33.8482233Z #define __cpp_fold_expressions 201603L
2025-05-07T20:25:33.8482525Z #define __DEC32_MIN_EXP__ (-94)
2025-05-07T20:25:33.8482791Z #define __INTPTR_WIDTH__ 64
2025-05-07T20:25:33.8483042Z #define __FLT64X_HAS_INFINITY__ 1
2025-05-07T20:25:33.8483328Z #define __UINT_LEAST32_MAX__ 0xffffffffU
2025-05-07T20:25:33.8492170Z #define __FLT32X_HAS_DENORM__ 1
2025-05-07T20:25:33.8492469Z #define __INT_FAST16_TYPE__ long int
2025-05-07T20:25:33.8492762Z #define __MMX_WITH_SSE__ 1
2025-05-07T20:25:33.8493015Z #define __LDBL_HAS_DENORM__ 1
2025-05-07T20:25:33.8493284Z #define __cplusplus 201703L
2025-05-07T20:25:33.8493554Z #define __cpp_ref_qualifiers 200710L
2025-05-07T20:25:33.8493839Z #define __DEC32_MIN__ 1E-95DF
2025-05-07T20:25:33.8494098Z #define __DEPRECATED 1
2025-05-07T20:25:33.8494360Z #define __cpp_rvalue_references 200610L
2025-05-07T20:25:33.8494820Z #define __DBL_MAX_EXP__ 1024
2025-05-07T20:25:33.8495087Z #define __WCHAR_WIDTH__ 32
2025-05-07T20:25:33.8495405Z #define __FLT32_MAX__ 3.40282346638528859811704183484516925e+38F32
2025-05-07T20:25:33.8495759Z #define __DEC128_EPSILON__ 1E-33DL
2025-05-07T20:25:33.8496039Z #define __SSE2_MATH__ 1
2025-05-07T20:25:33.8496293Z #define __ATOMIC_HLE_RELEASE 131072
2025-05-07T20:25:33.8496600Z #define __PTRDIFF_MAX__ 0x7fffffffffffffffL
2025-05-07T20:25:33.8496887Z #define __amd64 1
2025-05-07T20:25:33.8497121Z #define __STDC_NO_THREADS__ 1
2025-05-07T20:25:33.8497392Z #define __ATOMIC_HLE_ACQUIRE 65536
2025-05-07T20:25:33.8497655Z #define __GNUG__ 11
2025-05-07T20:25:33.8497917Z #define __LONG_LONG_MAX__ 0x7fffffffffffffffLL
2025-05-07T20:25:33.8498232Z #define __SIZEOF_SIZE_T__ 8
2025-05-07T20:25:33.8498481Z #define __cpp_nsdmi 200809L
2025-05-07T20:25:33.8498745Z #define __FLT64X_MIN_EXP__ (-16381)
2025-05-07T20:25:33.8499022Z #define __SIZEOF_WINT_T__ 4
2025-05-07T20:25:33.8499282Z #define __LONG_LONG_WIDTH__ 64
2025-05-07T20:25:33.8499561Z #define __cpp_initializer_lists 200806L
2025-05-07T20:25:33.8499857Z #define __FLT32_MAX_EXP__ 128
2025-05-07T20:25:33.8500117Z #define __cpp_hex_float 201603L
2025-05-07T20:25:33.8500386Z #define __GXX_ABI_VERSION 1016
2025-05-07T20:25:33.8500655Z #define __FLT128_HAS_INFINITY__ 1
2025-05-07T20:25:33.8500932Z #define __FLT_MIN_EXP__ (-125)
2025-05-07T20:25:33.8501197Z #define __GCC_HAVE_DWARF2_CFI_ASM 1
2025-05-07T20:25:33.8501466Z #define __x86_64 1
2025-05-07T20:25:33.8501695Z #define __cpp_lambdas 200907L
2025-05-07T20:25:33.8501984Z #define __INT_FAST64_TYPE__ long int
2025-05-07T20:25:33.8502378Z #define __FLT64_DENORM_MIN__ 4.94065645841246544176568792868221372e-324F64
2025-05-07T20:25:33.8502765Z #define __cpp_template_auto 201606L
2025-05-07T20:25:33.8503113Z #define __DBL_MIN__ double(2.22507385850720138309023271733240406e-308L)
2025-05-07T20:25:33.8503559Z #define __FLT128_EPSILON__ 1.92592994438723585305597794258492732e-34F128
2025-05-07T20:25:33.8504037Z #define __FLT64X_NORM_MAX__ 1.18973149535723176502126385303097021e+4932F64x
2025-05-07T20:25:33.8504423Z #define __SIZEOF_POINTER__ 8
2025-05-07T20:25:33.8504668Z #define __LP64__ 1
2025-05-07T20:25:33.8504896Z #define __DBL_HAS_QUIET_NAN__ 1
2025-05-07T20:25:33.8505245Z #define __FLT32X_EPSILON__ 2.22044604925031308084726333618164062e-16F32x
2025-05-07T20:25:33.8505614Z #define __DECIMAL_BID_FORMAT__ 1
2025-05-07T20:25:33.8505888Z #define __FLT64_MIN_10_EXP__ (-307)
2025-05-07T20:25:33.8506173Z #define __FLT64X_DECIMAL_DIG__ 21
2025-05-07T20:25:33.8506440Z #define __DEC128_MIN__ 1E-6143DL
2025-05-07T20:25:33.8506710Z #define __REGISTER_PREFIX__ 
2025-05-07T20:25:33.8506970Z #define __UINT16_MAX__ 0xffff
2025-05-07T20:25:33.8507225Z #define __LDBL_HAS_INFINITY__ 1
2025-05-07T20:25:33.8507552Z #define __FLT32_MIN__ 1.17549435082228750796873653722224568e-38F32
2025-05-07T20:25:33.8507912Z #define __UINT8_TYPE__ unsigned char
2025-05-07T20:25:33.8508180Z #define __FLT_DIG__ 6
2025-05-07T20:25:33.8508583Z #define __NO_INLINE__ 1
2025-05-07T20:25:33.8508831Z #define __DEC_EVAL_METHOD__ 2
2025-05-07T20:25:33.8509158Z #define __DEC128_MAX__ 9.999999999999999999999999999999999E6144DL
2025-05-07T20:25:33.8509588Z #define __FLT_MANT_DIG__ 24
2025-05-07T20:25:33.8509847Z #define __LDBL_DECIMAL_DIG__ 21
2025-05-07T20:25:33.8510111Z #define __VERSION__ "11.4.0"
2025-05-07T20:25:33.8510360Z #define __UINT64_C(c) c ## UL
2025-05-07T20:25:33.8510634Z #define __cpp_unicode_characters 201411L
2025-05-07T20:25:33.8510936Z #define _STDC_PREDEF_H 1
2025-05-07T20:25:33.8511188Z #define __INT_LEAST32_MAX__ 0x7fffffff
2025-05-07T20:25:33.8511489Z #define __GCC_ATOMIC_INT_LOCK_FREE 2
2025-05-07T20:25:33.8511821Z #define __FLT128_MAX_EXP__ 16384
2025-05-07T20:25:33.8512088Z #define __FLT32_MANT_DIG__ 24
2025-05-07T20:25:33.8512389Z #define __FLOAT_WORD_ORDER__ __ORDER_LITTLE_ENDIAN__
2025-05-07T20:25:33.8512728Z #define __cpp_aggregate_bases 201603L
2025-05-07T20:25:33.8513007Z #define __FLT128_HAS_DENORM__ 1
2025-05-07T20:25:33.8513412Z #define __FLT32_DECIMAL_DIG__ 9
2025-05-07T20:25:33.8513671Z #define __FLT128_DIG__ 33
2025-05-07T20:25:33.8513911Z #define __INT32_C(c) c
2025-05-07T20:25:33.8514146Z #define __DEC64_EPSILON__ 1E-15DD
2025-05-07T20:25:33.8514425Z #define __ORDER_PDP_ENDIAN__ 3412
2025-05-07T20:25:33.8514702Z #define __DEC128_MIN_EXP__ (-6142)
2025-05-07T20:25:33.8514973Z #define __INT_FAST32_TYPE__ long int
2025-05-07T20:25:33.8515289Z #define __UINT_LEAST16_TYPE__ short unsigned int
2025-05-07T20:25:33.8515595Z #define unix 1
2025-05-07T20:25:33.8515813Z #define __DBL_HAS_DENORM__ 1
2025-05-07T20:25:33.8516080Z #define __cpp_rtti 199711L
2025-05-07T20:25:33.8516344Z #define __SIZE_TYPE__ long unsigned int
2025-05-07T20:25:33.8516650Z #define __UINT64_MAX__ 0xffffffffffffffffUL
2025-05-07T20:25:33.8516952Z #define __FLT_IS_IEC_60559__ 2
2025-05-07T20:25:33.8517262Z #define __GNUC_WIDE_EXECUTION_CHARSET_NAME "UTF-32LE"
2025-05-07T20:25:33.8517582Z #define __FLT64X_DIG__ 18
2025-05-07T20:25:33.8517841Z #define __INT8_TYPE__ signed char
2025-05-07T20:25:33.8518135Z #define __cpp_digit_separators 201309L
2025-05-07T20:25:33.8518415Z #define __ELF__ 1
2025-05-07T20:25:33.8518639Z #define __GCC_ASM_FLAG_OUTPUTS__ 1
2025-05-07T20:25:33.8518920Z #define __UINT32_TYPE__ unsigned int
2025-05-07T20:25:33.8519195Z #define __FLT_RADIX__ 2
2025-05-07T20:25:33.8519435Z #define __INT_LEAST16_TYPE__ short int
2025-05-07T20:25:33.8519791Z #define __LDBL_EPSILON__ 1.08420217248550443400745280086994171e-19L
2025-05-07T20:25:33.8520155Z #define __UINTMAX_C(c) c ## UL
2025-05-07T20:25:33.8520424Z #define __GLIBCXX_BITSIZE_INT_N_0 128
2025-05-07T20:25:33.8520701Z #define __k8 1
2025-05-07T20:25:33.8520997Z #define __FLT32X_MIN__ 2.22507385850720138309023271733240406e-308F32x
2025-05-07T20:25:33.8521363Z #define __SIG_ATOMIC_MAX__ 0x7fffffff
2025-05-07T20:25:33.8521707Z #define __GCC_ATOMIC_WCHAR_T_LOCK_FREE 2
2025-05-07T20:25:33.8522011Z #define __SIZEOF_PTRDIFF_T__ 8
2025-05-07T20:25:33.8522271Z #define __LDBL_DIG__ 18
2025-05-07T20:25:33.8522519Z #define __FLT64_IS_IEC_60559__ 2
2025-05-07T20:25:33.8522780Z #define __x86_64__ 1
2025-05-07T20:25:33.8523020Z #define __FLT32X_MIN_EXP__ (-1021)
2025-05-07T20:25:33.8523311Z #define __DEC32_SUBNORMAL_MIN__ 0.000001E-95DF
2025-05-07T20:25:33.8523645Z #define __INT_FAST16_MAX__ 0x7fffffffffffffffL
2025-05-07T20:25:33.8523954Z #define __FLT64_DIG__ 15
2025-05-07T20:25:33.8524229Z #define __UINT_FAST32_MAX__ 0xffffffffffffffffUL
2025-05-07T20:25:33.8524576Z #define __UINT_LEAST64_TYPE__ long unsigned int
2025-05-07T20:25:33.8524894Z #define __FLT_HAS_QUIET_NAN__ 1
2025-05-07T20:25:33.8525153Z #define __FLT_MAX_10_EXP__ 38
2025-05-07T20:25:33.8525432Z #define __LONG_MAX__ 0x7fffffffffffffffL
2025-05-07T20:25:33.8525736Z #define __FLT64X_HAS_DENORM__ 1
2025-05-07T20:25:33.8526092Z #define __DEC128_SUBNORMAL_MIN__ 0.000000000000000000000000000000001E-6143DL
2025-05-07T20:25:33.8526488Z #define __FLT_HAS_INFINITY__ 1
2025-05-07T20:25:33.8526779Z #define __GNUC_EXECUTION_CHARSET_NAME "UTF-8"
2025-05-07T20:25:33.8527207Z #define __cpp_unicode_literals 200710L
2025-05-07T20:25:33.8527515Z #define __UINT_FAST16_TYPE__ long unsigned int
2025-05-07T20:25:33.8527834Z #define __DEC64_MAX__ 9.999999999999999E384DD
2025-05-07T20:25:33.8528582Z #define __INT_FAST32_WIDTH__ 64
2025-05-07T20:25:33.8528924Z #define __CHAR16_TYPE__ short unsigned int
2025-05-07T20:25:33.8529230Z #define __PRAGMA_REDEFINE_EXTNAME 1
2025-05-07T20:25:33.8529508Z #define __SIZE_WIDTH__ 64
2025-05-07T20:25:33.8529740Z #define __SEG_FS 1
2025-05-07T20:25:33.8529972Z #define __INT_LEAST16_MAX__ 0x7fff
2025-05-07T20:25:33.8530249Z #define __DEC64_MANT_DIG__ 16
2025-05-07T20:25:33.8530516Z #define __INT64_MAX__ 0x7fffffffffffffffL
2025-05-07T20:25:33.8530804Z #define __SEG_GS 1
2025-05-07T20:25:33.8531114Z #define __FLT32_DENORM_MIN__ 1.40129846432481707092372958328991613e-45F32
2025-05-07T20:25:33.8531494Z #define __SIG_ATOMIC_WIDTH__ 32
2025-05-07T20:25:33.8531762Z #define __INT_LEAST64_TYPE__ long int
2025-05-07T20:25:33.8532055Z #define __INT16_TYPE__ short int
2025-05-07T20:25:33.8532615Z #define __INT_LEAST8_TYPE__ signed char
2025-05-07T20:25:33.8532946Z #define __cpp_structured_bindings 201606L
2025-05-07T20:25:33.8533242Z #define __SIZEOF_INT__ 4
2025-05-07T20:25:33.8533491Z #define __DEC32_MAX_EXP__ 97
2025-05-07T20:25:33.8533744Z #define __INT_FAST8_MAX__ 0x7f
2025-05-07T20:25:33.8534083Z #define __FLT128_MAX__ 1.18973149535723176508575932662800702e+4932F128
2025-05-07T20:25:33.8534464Z #define __INTPTR_MAX__ 0x7fffffffffffffffL
2025-05-07T20:25:33.8534773Z #define __cpp_sized_deallocation 201309L
2025-05-07T20:25:33.8535099Z #define __cpp_guaranteed_copy_elision 201606L
2025-05-07T20:25:33.8535398Z #define linux 1
2025-05-07T20:25:33.8535626Z #define __FLT64_HAS_QUIET_NAN__ 1
2025-05-07T20:25:33.8535902Z #define __FLT32_MIN_10_EXP__ (-37)
2025-05-07T20:25:33.8536176Z #define __EXCEPTIONS 1
2025-05-07T20:25:33.8536425Z #define __PTRDIFF_WIDTH__ 64
2025-05-07T20:25:33.8536680Z #define __LDBL_MANT_DIG__ 64
2025-05-07T20:25:33.8536958Z #define __cpp_range_based_for 201603L
2025-05-07T20:25:33.8537249Z #define __FLT64_HAS_INFINITY__ 1
2025-05-07T20:25:33.8537588Z #define __FLT64X_MAX__ 1.18973149535723176502126385303097021e+4932F64x
2025-05-07T20:25:33.8537977Z #define __STDCPP_DEFAULT_NEW_ALIGNMENT__ 16
2025-05-07T20:25:33.8538328Z #define __SIG_ATOMIC_MIN__ (-__SIG_ATOMIC_MAX__ - 1)
2025-05-07T20:25:33.8538649Z #define __code_model_small__ 1
2025-05-07T20:25:33.8538931Z #define __GCC_ATOMIC_LONG_LOCK_FREE 2
2025-05-07T20:25:33.8539239Z #define __cpp_nontype_template_args 201411L
2025-05-07T20:25:33.8539538Z #define __DEC32_MANT_DIG__ 7
2025-05-07T20:25:33.8539817Z #define __cpp_return_type_deduction 201304L
2025-05-07T20:25:33.8540108Z #define __k8__ 1
2025-05-07T20:25:33.8540338Z #define __INTPTR_TYPE__ long int
2025-05-07T20:25:33.8540617Z #define __UINT16_TYPE__ short unsigned int
2025-05-07T20:25:33.8540929Z #define __WCHAR_TYPE__ int
2025-05-07T20:25:33.8541172Z #define __pic__ 2
2025-05-07T20:25:33.8541420Z #define __UINTPTR_MAX__ 0xffffffffffffffffUL
2025-05-07T20:25:33.8541727Z #define __INT_FAST64_WIDTH__ 64
2025-05-07T20:25:33.8541993Z #define __cpp_decltype 200707L
2025-05-07T20:25:33.8542299Z #define __INT_FAST64_MAX__ 0x7fffffffffffffffL
2025-05-07T20:25:33.8542654Z #define __GCC_ATOMIC_TEST_AND_SET_TRUEVAL 1
2025-05-07T20:25:33.8543015Z #define __FLT_NORM_MAX__ 3.40282346638528859811704183484516925e+38F
2025-05-07T20:25:33.8543368Z #define __FLT64X_MAX_EXP__ 16384
2025-05-07T20:25:33.8543653Z #define __UINT_FAST64_TYPE__ long unsigned int
2025-05-07T20:25:33.8543971Z #define __cpp_inline_variables 201606L
2025-05-07T20:25:33.8544258Z #define __INT_MAX__ 0x7fffffff
2025-05-07T20:25:33.8544499Z #define __linux__ 1
2025-05-07T20:25:33.8544723Z #define __INT64_TYPE__ long int
2025-05-07T20:25:33.8544981Z #define __FLT_MAX_EXP__ 128
2025-05-07T20:25:33.8545244Z #define __ORDER_BIG_ENDIAN__ 4321
2025-05-07T20:25:33.8545508Z #define __DBL_MANT_DIG__ 53
2025-05-07T20:25:33.8545792Z #define __cpp_inheriting_constructors 201511L
2025-05-07T20:25:33.8546411Z #define __SIZEOF_FLOAT128__ 16
2025-05-07T20:25:33.8546792Z #define __INT_LEAST64_MAX__ 0x7fffffffffffffffL
2025-05-07T20:25:33.8547173Z #define __DEC64_MIN__ 1E-383DD
2025-05-07T20:25:33.8547475Z #define __WINT_TYPE__ unsigned int
2025-05-07T20:25:33.8547758Z #define __UINT_LEAST32_TYPE__ unsigned int
2025-05-07T20:25:33.8548054Z #define __SIZEOF_SHORT__ 2
2025-05-07T20:25:33.8548379Z #define __FLT32_NORM_MAX__ 3.40282346638528859811704183484516925e+38F32
2025-05-07T20:25:33.8548722Z #define __SSE__ 1
2025-05-07T20:25:33.8548948Z #define __LDBL_MIN_EXP__ (-16381)
2025-05-07T20:25:33.8549353Z #define __FLT64_MAX__ 1.79769313486231570814527423731704357e+308F64
2025-05-07T20:25:33.8549696Z #define __amd64__ 1
2025-05-07T20:25:33.8549914Z #define __WINT_WIDTH__ 32
2025-05-07T20:25:33.8550165Z #define __INT_LEAST64_WIDTH__ 64
2025-05-07T20:25:33.8550434Z #define __LDBL_MAX_EXP__ 16384
2025-05-07T20:25:33.8550689Z #define __FLT32X_MAX_10_EXP__ 308
2025-05-07T20:25:33.8550959Z #define __SIZEOF_INT128__ 16
2025-05-07T20:25:33.8551340Z #define __FLT64X_IS_IEC_60559__ 2
2025-05-07T20:25:33.8551610Z #define __LDBL_MAX_10_EXP__ 4932
2025-05-07T20:25:33.8551911Z #define __ATOMIC_RELAXED 0
2025-05-07T20:25:33.8552264Z #define __DBL_EPSILON__ double(2.22044604925031308084726333618164062e-16L)
2025-05-07T20:25:33.8552713Z #define __FLT128_MIN__ 3.36210314311209350626267781732175260e-4932F128
2025-05-07T20:25:33.8553062Z #define _LP64 1
2025-05-07T20:25:33.8553280Z #define __UINT8_C(c) c
2025-05-07T20:25:33.8553510Z #define __FLT64_MAX_EXP__ 1024
2025-05-07T20:25:33.8553774Z #define __INT_LEAST32_TYPE__ int
2025-05-07T20:25:33.8554040Z #define __SIZEOF_WCHAR_T__ 4
2025-05-07T20:25:33.8554300Z #define __GNUC_PATCHLEVEL__ 0
2025-05-07T20:25:33.8554644Z #define __FLT128_NORM_MAX__ 1.18973149535723176508575932662800702e+4932F128
2025-05-07T20:25:33.8555102Z #define __FLT64_NORM_MAX__ 1.79769313486231570814527423731704357e+308F64
2025-05-07T20:25:33.8555471Z #define __FLT128_HAS_QUIET_NAN__ 1
2025-05-07T20:25:33.8555767Z #define __INTMAX_MAX__ 0x7fffffffffffffffL
2025-05-07T20:25:33.8556078Z #define __INT_FAST8_TYPE__ signed char
2025-05-07T20:25:33.8556384Z #define __cpp_namespace_attributes 201411L
2025-05-07T20:25:33.8556750Z #define __FLT64X_MIN__ 3.36210314311209350626267781732175260e-4932F64x
2025-05-07T20:25:33.8557113Z #define __STDCPP_THREADS__ 1
2025-05-07T20:25:33.8557372Z #define __GNUC_STDC_INLINE__ 1
2025-05-07T20:25:33.8557625Z #define __FLT64_HAS_DENORM__ 1
2025-05-07T20:25:33.8557962Z #define __FLT32_EPSILON__ 1.19209289550781250000000000000000000e-7F32
2025-05-07T20:25:33.8558321Z #define __DBL_DECIMAL_DIG__ 17
2025-05-07T20:25:33.8558575Z #define __STDC_UTF_32__ 1
2025-05-07T20:25:33.8558816Z #define __INT_FAST8_WIDTH__ 8
2025-05-07T20:25:33.8559062Z #define __FXSR__ 1
2025-05-07T20:25:33.8559358Z #define __FLT32X_MAX__ 1.79769313486231570814527423731704357e+308F32x
2025-05-07T20:25:33.8559798Z #define __DBL_NORM_MAX__ double(1.79769313486231570814527423731704357e+308L)
2025-05-07T20:25:33.8560211Z #define __BYTE_ORDER__ __ORDER_LITTLE_ENDIAN__
2025-05-07T20:25:33.8560517Z #define __INTMAX_WIDTH__ 64
2025-05-07T20:25:33.8560773Z #define __cpp_runtime_arrays 198712L
2025-05-07T20:25:33.8561066Z #define __UINT64_TYPE__ long unsigned int
2025-05-07T20:25:33.8561363Z #define __UINT32_C(c) c ## U
2025-05-07T20:25:33.8561625Z #define __cpp_alias_templates 200704L
2025-05-07T20:25:33.8561979Z #define __FLT_DENORM_MIN__ 1.40129846432481707092372958328991613e-45F
2025-05-07T20:25:33.8562334Z #define __FLT128_IS_IEC_60559__ 2
2025-05-07T20:25:33.8562599Z #define __INT8_MAX__ 0x7f
2025-05-07T20:25:33.8562842Z #define __LONG_WIDTH__ 64
2025-05-07T20:25:33.8563076Z #define __PIC__ 2
2025-05-07T20:25:33.8563341Z #define __UINT_FAST32_TYPE__ long unsigned int
2025-05-07T20:25:33.8563731Z #define __FLT32X_NORM_MAX__ 1.79769313486231570814527423731704357e+308F32x
2025-05-07T20:25:33.8564124Z #define __CHAR32_TYPE__ unsigned int
2025-05-07T20:25:33.8564460Z #define __FLT_MAX__ 3.40282346638528859811704183484516925e+38F
2025-05-07T20:25:33.8564894Z #define __cpp_constexpr 201603L
2025-05-07T20:25:33.8565152Z #define __SSE2__ 1
2025-05-07T20:25:33.8565384Z #define __cpp_deduction_guides 201703L
2025-05-07T20:25:33.8565667Z #define __INT32_TYPE__ int
2025-05-07T20:25:33.8565921Z #define __SIZEOF_DOUBLE__ 8
2025-05-07T20:25:33.8566182Z #define __cpp_exceptions 199711L
2025-05-07T20:25:33.8566459Z #define __FLT_MIN_10_EXP__ (-37)
2025-05-07T20:25:33.8566782Z #define __FLT64_MIN__ 2.22507385850720138309023271733240406e-308F64
2025-05-07T20:25:33.8567138Z #define __INT_LEAST32_WIDTH__ 32
2025-05-07T20:25:33.8567408Z #define __INTMAX_TYPE__ long int
2025-05-07T20:25:33.8567668Z #define __DEC128_MAX_EXP__ 6145
2025-05-07T20:25:33.8567936Z #define __FLT32X_HAS_QUIET_NAN__ 1
2025-05-07T20:25:33.8568213Z #define __ATOMIC_CONSUME 1
2025-05-07T20:25:33.8568452Z #define __GNUC_MINOR__ 4
2025-05-07T20:25:33.8568704Z #define __GLIBCXX_TYPE_INT_N_0 __int128
2025-05-07T20:25:33.8568996Z #define __INT_FAST16_WIDTH__ 64
2025-05-07T20:25:33.8569369Z #define __UINTMAX_MAX__ 0xffffffffffffffffUL
2025-05-07T20:25:33.8569664Z #define __PIE__ 2
2025-05-07T20:25:33.8569987Z #define __FLT32X_DENORM_MIN__ 4.94065645841246544176568792868221372e-324F32x
2025-05-07T20:25:33.8570396Z #define __cpp_template_template_args 201611L
2025-05-07T20:25:33.8570696Z #define __DBL_MAX_10_EXP__ 308
2025-05-07T20:25:33.8571038Z #define __LDBL_DENORM_MIN__ 3.64519953188247460252840593361941982e-4951L
2025-05-07T20:25:33.8571400Z #define __INT16_C(c) c
2025-05-07T20:25:33.8571619Z #define __STDC__ 1
2025-05-07T20:25:33.8571862Z #define __FLT32X_DIG__ 15
2025-05-07T20:25:33.8572136Z #define __PTRDIFF_TYPE__ long int
2025-05-07T20:25:33.8572400Z #define __ATOMIC_SEQ_CST 5
2025-05-07T20:25:33.8572656Z #define __FLT32X_MIN_10_EXP__ (-307)
2025-05-07T20:25:33.8572953Z #define __UINTPTR_TYPE__ long unsigned int
2025-05-07T20:25:33.8573286Z #define __DEC64_SUBNORMAL_MIN__ 0.000000000000001E-383DD
2025-05-07T20:25:33.8573613Z #define __DEC128_MANT_DIG__ 34
2025-05-07T20:25:33.8573885Z #define __LDBL_MIN_10_EXP__ (-4931)
2025-05-07T20:25:33.8574167Z #define __cpp_generic_lambdas 201304L
2025-05-07T20:25:33.8574442Z #define __SSE_MATH__ 1
2025-05-07T20:25:33.8574680Z #define __SIZEOF_LONG_LONG__ 8
2025-05-07T20:25:33.8574959Z #define __cpp_user_defined_literals 200809L
2025-05-07T20:25:33.8575256Z #define __FLT128_DECIMAL_DIG__ 36
2025-05-07T20:25:33.8575533Z #define __GCC_ATOMIC_LLONG_LOCK_FREE 2
2025-05-07T20:25:33.8575821Z #define __FLT32_HAS_QUIET_NAN__ 1
2025-05-07T20:25:33.8576084Z #define __FLT_DECIMAL_DIG__ 9
2025-05-07T20:25:33.8576376Z #define __UINT_FAST16_MAX__ 0xffffffffffffffffUL
2025-05-07T20:25:33.8576762Z #define __LDBL_NORM_MAX__ 1.18973149535723176502126385303097021e+4932L
2025-05-07T20:25:33.8577123Z #define __GCC_ATOMIC_SHORT_LOCK_FREE 2
2025-05-07T20:25:33.8577422Z #define __UINT_FAST8_TYPE__ unsigned char
2025-05-07T20:25:33.8577714Z #define _GNU_SOURCE 1
2025-05-07T20:25:33.8577953Z #define __cpp_init_captures 201304L
2025-05-07T20:25:33.8578241Z #define __ATOMIC_ACQ_REL 4
2025-05-07T20:25:33.8578503Z #define __ATOMIC_RELEASE 3
2025-05-07T20:25:33.8578656Z 
2025-05-07T20:25:33.9073576Z 
2025-05-07T20:25:33.9074247Z + conda run -n build_binary c++ --version
2025-05-07T20:25:33.9074507Z 
2025-05-07T20:25:35.7941057Z c++ (conda-forge gcc 11.4.0-13) 11.4.0
2025-05-07T20:25:35.7941457Z Copyright (C) 2021 Free Software Foundation, Inc.
2025-05-07T20:25:35.7941904Z This is free software; see the source for copying conditions.  There is NO
2025-05-07T20:25:35.7942439Z warranty; not even for MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.
2025-05-07T20:25:35.7942757Z 
2025-05-07T20:25:35.7942762Z 
2025-05-07T20:25:35.8565281Z 
2025-05-07T20:25:35.8566076Z [INFO] Printing the default version of the C standard used by the compiler ...
2025-05-07T20:25:35.8566649Z + conda run -n build_binary cc -dM -E - < /dev/null | grep __STDC_VERSION__
2025-05-07T20:25:35.8566950Z 
2025-05-07T20:25:37.8133402Z #define __STDC_VERSION__ 201710L
2025-05-07T20:25:37.8135448Z 
2025-05-07T20:25:37.8136392Z [INFO] Printing the default version of the C++ standard used by the compiler ...
2025-05-07T20:25:37.8136967Z + conda run -n build_binary c++ -dM -E -x c++ - < /dev/null | grep __cplusplus
2025-05-07T20:25:37.8137280Z 
2025-05-07T20:25:39.7761298Z #define __cplusplus 201703L
2025-05-07T20:25:39.7764030Z 
2025-05-07T20:25:39.7764393Z [INSTALL] Successfully installed C/C++ compilers
2025-05-07T20:25:39.7799075Z ##[group]Run . $PRELUDE; install_cuda $BUILD_ENV 12.8.0
2025-05-07T20:25:39.7799492Z [36;1m. $PRELUDE; install_cuda $BUILD_ENV 12.8.0[0m
2025-05-07T20:25:39.7811958Z shell: /usr/bin/bash --noprofile --norc -e -o pipefail {0}
2025-05-07T20:25:39.7812303Z env:
2025-05-07T20:25:39.7812531Z   PRELUDE: .github/scripts/setup_env.bash
2025-05-07T20:25:39.7812834Z   BUILD_ENV: build_binary
2025-05-07T20:25:39.7813069Z   BUILD_TARGET: genai
2025-05-07T20:25:39.7813306Z   BUILD_VARIANT: cuda
2025-05-07T20:25:39.7813540Z   BUILD_CUDA_VERSION: 12.8.0
2025-05-07T20:25:39.7813789Z   ENFORCE_CUDA_DEVICE: 1
2025-05-07T20:25:39.7814265Z   GPU_FLAG: --gpus all -e NVIDIA_DRIVER_CAPABILITIES=all
2025-05-07T20:25:39.7814600Z ##[endgroup]
2025-05-07T20:25:40.1193475Z ################################################################################
2025-05-07T20:25:40.1193838Z # Install CUDA
2025-05-07T20:25:40.1194068Z #
2025-05-07T20:25:40.1208938Z # [2025-05-07T20:25:40.120Z] + install_cuda build_binary 12.8.0
2025-05-07T20:25:40.1209312Z ################################################################################
2025-05-07T20:25:40.1209532Z 
2025-05-07T20:25:40.1224336Z [EXEC] [ATTEMPT 0/3]    + wget -q --timeout 1 pypi.org -O /dev/null
2025-05-07T20:25:40.2096258Z [CHECK] Network does not appear to be blocked.
2025-05-07T20:25:40.2096624Z [SETUP] Cleaning up Conda packages ...
2025-05-07T20:25:40.2101851Z + conda clean --packages --tarball -y
2025-05-07T20:25:40.2102078Z 
2025-05-07T20:25:40.9219369Z Will remove 32 (148.9 MB) tarball(s).
2025-05-07T20:25:40.9219789Z Will remove 6 (619 KB) package(s).
2025-05-07T20:25:40.9841104Z 
2025-05-07T20:25:40.9850422Z + conda clean --all -y
2025-05-07T20:25:40.9850635Z 
2025-05-07T20:25:41.6629937Z There are no unused tarball(s) to remove.
2025-05-07T20:25:41.6630620Z Will remove 1 index cache(s).
2025-05-07T20:25:41.6631176Z There are no unused package(s) to remove.
2025-05-07T20:25:41.6631789Z There are no tempfile(s) to remove.
2025-05-07T20:25:41.6632397Z There are no logfile(s) to remove.
2025-05-07T20:25:41.7263905Z 
2025-05-07T20:25:41.7277878Z [INSTALL] Installing CUDA 12.8.0 ...
2025-05-07T20:25:41.7302450Z [EXEC] [ATTEMPT 0/3]    + conda install --force-reinstall -n build_binary -c conda-forge --override-channels -y cuda=12.8.0
2025-05-07T20:25:42.6378954Z Channels:
2025-05-07T20:25:42.6379251Z  - conda-forge
2025-05-07T20:25:42.6379494Z Platform: linux-64
2025-05-07T20:25:53.2252628Z Collecting package metadata (repodata.json): - \ | / - \ | / - \ | / - \ | / - \ | / done
2025-05-07T20:25:54.3523869Z Solving environment: \ | / - done
2025-05-07T20:25:54.4275580Z 
2025-05-07T20:25:54.4275993Z ## Package Plan ##
2025-05-07T20:25:54.4276197Z 
2025-05-07T20:25:54.4276430Z   environment location: /home/ec2-user/miniconda/envs/build_binary
2025-05-07T20:25:54.4276809Z 
2025-05-07T20:25:54.4276911Z   added / updated specs:
2025-05-07T20:25:54.4277173Z     - cuda=12.8.0
2025-05-07T20:25:54.4277307Z 
2025-05-07T20:25:54.4277340Z 
2025-05-07T20:25:54.4277471Z The following packages will be downloaded:
2025-05-07T20:25:54.4277684Z 
2025-05-07T20:25:54.4277831Z     package                    |            build
2025-05-07T20:25:54.4278148Z     ---------------------------|-----------------
2025-05-07T20:25:54.4278527Z     alsa-lib-1.2.14            |       hb9d3cd8_0         553 KB  conda-forge
2025-05-07T20:25:54.4278940Z     attr-2.5.1                 |       h166bdaf_1          69 KB  conda-forge
2025-05-07T20:25:54.4279398Z     binutils-2.40              |       h4852527_7          31 KB  conda-forge
2025-05-07T20:25:54.4279976Z     c-compiler-1.5.2           |       h0b41bf4_0           6 KB  conda-forge
2025-05-07T20:25:54.4280409Z     cuda-12.8.0                |       ha804496_0          26 KB  conda-forge
2025-05-07T20:25:54.4280837Z     cuda-cccl_linux-64-12.8.55 |       ha770c72_1         1.0 MB  conda-forge
2025-05-07T20:25:54.4281806Z     cuda-command-line-tools-12.8.0|       ha770c72_0          20 KB  conda-forge
2025-05-07T20:25:54.4282320Z     cuda-compiler-12.8.0       |       hbad6d8a_0          20 KB  conda-forge
2025-05-07T20:25:54.4282842Z     cuda-crt-dev_linux-64-12.8.61|       ha770c72_1          90 KB  conda-forge
2025-05-07T20:25:54.4283512Z     cuda-crt-tools-12.8.61     |       ha770c72_1          27 KB  conda-forge
2025-05-07T20:25:54.4284127Z     cuda-cudart-12.8.57        |       h5888daf_1          22 KB  conda-forge
2025-05-07T20:25:54.4284599Z     cuda-cudart-dev-12.8.57    |       h5888daf_1          23 KB  conda-forge
2025-05-07T20:25:54.4285301Z     cuda-cudart-dev_linux-64-12.8.57|       h3f2d84a_1         377 KB  conda-forge
2025-05-07T20:25:54.4286170Z     cuda-cudart-static-12.8.57 |       h5888daf_1          22 KB  conda-forge
2025-05-07T20:25:54.4286683Z     cuda-cudart-static_linux-64-12.8.57|       h3f2d84a_1         950 KB  conda-forge
2025-05-07T20:25:54.4287192Z     cuda-cudart_linux-64-12.8.57|       h3f2d84a_1         188 KB  conda-forge
2025-05-07T20:25:54.4287672Z     cuda-cuobjdump-12.8.55     |       hbd13f7d_0         227 KB  conda-forge
2025-05-07T20:25:54.4288111Z     cuda-cupti-12.8.57         |       hbd13f7d_0         1.8 MB  conda-forge
2025-05-07T20:25:54.4288550Z     cuda-cupti-dev-12.8.57     |       h5888daf_0         4.0 MB  conda-forge
2025-05-07T20:25:54.4289057Z     cuda-cuxxfilt-12.8.55      |       hbd13f7d_0         211 KB  conda-forge
2025-05-07T20:25:54.4289506Z     cuda-driver-dev-12.8.57    |       h5888daf_1          22 KB  conda-forge
2025-05-07T20:25:54.4289990Z     cuda-driver-dev_linux-64-12.8.90|       h3f2d84a_1          36 KB  conda-forge
2025-05-07T20:25:54.4290456Z     cuda-gdb-12.8.55           |       h50b4baa_0         353 KB  conda-forge
2025-05-07T20:25:54.4290890Z     cuda-libraries-12.8.0      |       ha770c72_0          20 KB  conda-forge
2025-05-07T20:25:54.4291356Z     cuda-libraries-dev-12.8.0  |       ha770c72_0          20 KB  conda-forge
2025-05-07T20:25:54.4291821Z     cuda-nsight-12.8.55        |       h7938cbb_0       113.2 MB  conda-forge
2025-05-07T20:25:54.4292253Z     cuda-nvcc-12.8.61          |       hcdd1206_0          23 KB  conda-forge
2025-05-07T20:25:54.4292703Z     cuda-nvcc-dev_linux-64-12.8.61|       he91c749_1        12.7 MB  conda-forge
2025-05-07T20:25:54.4293173Z     cuda-nvcc-impl-12.8.61     |       h85509e4_1          25 KB  conda-forge
2025-05-07T20:25:54.4293627Z     cuda-nvcc-tools-12.8.61    |       he02047a_1        24.5 MB  conda-forge
2025-05-07T20:25:54.4294086Z     cuda-nvcc_linux-64-12.8.61 |       h04802cd_0          25 KB  conda-forge
2025-05-07T20:25:54.4294539Z     cuda-nvdisasm-12.8.55      |       hbd13f7d_0         4.9 MB  conda-forge
2025-05-07T20:25:54.4294997Z     cuda-nvml-dev-12.8.55      |       hbd13f7d_0         134 KB  conda-forge
2025-05-07T20:25:54.4295437Z     cuda-nvprof-12.8.57        |       hbd13f7d_0         2.5 MB  conda-forge
2025-05-07T20:25:54.4295881Z     cuda-nvprune-12.8.55       |       hbd13f7d_0          68 KB  conda-forge
2025-05-07T20:25:54.4296323Z     cuda-nvrtc-12.8.61         |       hbd13f7d_0        63.1 MB  conda-forge
2025-05-07T20:25:54.4296766Z     cuda-nvrtc-dev-12.8.61     |       h5888daf_0          34 KB  conda-forge
2025-05-07T20:25:54.4297201Z     cuda-nvtx-12.8.55          |       hbd13f7d_0          31 KB  conda-forge
2025-05-07T20:25:54.4297650Z     cuda-nvvm-dev_linux-64-12.8.61|       ha770c72_1          25 KB  conda-forge
2025-05-07T20:25:54.4298121Z     cuda-nvvm-impl-12.8.61     |       he02047a_1        20.8 MB  conda-forge
2025-05-07T20:25:54.4298576Z     cuda-nvvm-tools-12.8.61    |       he02047a_1        23.5 MB  conda-forge
2025-05-07T20:25:54.4299020Z     cuda-nvvp-12.8.57          |       hbd13f7d_0       112.4 MB  conda-forge
2025-05-07T20:25:54.4299492Z     cuda-opencl-12.8.55        |       hbd13f7d_0          29 KB  conda-forge
2025-05-07T20:25:54.4299940Z     cuda-opencl-dev-12.8.55    |       h5888daf_0          95 KB  conda-forge
2025-05-07T20:25:54.4300529Z     cuda-profiler-api-12.8.55  |       h7938cbb_0          22 KB  conda-forge
2025-05-07T20:25:54.4300990Z     cuda-runtime-12.8.0        |       ha804496_0          20 KB  conda-forge
2025-05-07T20:25:54.4301455Z     cuda-sanitizer-api-12.8.55 |       hbd13f7d_0         8.8 MB  conda-forge
2025-05-07T20:25:54.4301921Z     cuda-toolkit-12.8.0        |       ha804496_0          20 KB  conda-forge
2025-05-07T20:25:54.4302352Z     cuda-tools-12.8.0          |       ha770c72_0          19 KB  conda-forge
2025-05-07T20:25:54.4302781Z     cuda-version-12.8          |       h5d125a7_3          21 KB  conda-forge
2025-05-07T20:25:54.4303238Z     cuda-visual-tools-12.8.0   |       ha770c72_0          20 KB  conda-forge
2025-05-07T20:25:54.4303781Z     cxx-compiler-1.5.2         |       hf52228f_0           6 KB  conda-forge
2025-05-07T20:25:54.4304184Z     dbus-1.13.6                |       h5008d03_3         604 KB  conda-forge
2025-05-07T20:25:54.4304566Z     expat-2.7.0                |       h5888daf_0         137 KB  conda-forge
2025-05-07T20:25:54.4305030Z     font-ttf-dejavu-sans-mono-2.37|       hab24e00_0         388 KB  conda-forge
2025-05-07T20:25:54.4305543Z     font-ttf-inconsolata-3.000 |       h77eed37_0          94 KB  conda-forge
2025-05-07T20:25:54.4306044Z     font-ttf-source-code-pro-2.038|       h77eed37_0         684 KB  conda-forge
2025-05-07T20:25:54.4306530Z     font-ttf-ubuntu-0.83       |       h77eed37_3         1.5 MB  conda-forge
2025-05-07T20:25:54.4306968Z     fontconfig-2.15.0          |       h7e30c49_1         259 KB  conda-forge
2025-05-07T20:25:54.4307423Z     fonts-conda-ecosystem-1    |                0           4 KB  conda-forge
2025-05-07T20:25:54.4307883Z     fonts-conda-forge-1        |                0           4 KB  conda-forge
2025-05-07T20:25:54.4308322Z     freetype-2.13.3            |       ha770c72_1         168 KB  conda-forge
2025-05-07T20:25:54.4308716Z     gcc-11.4.0                 |      h602e360_13          49 KB  conda-forge
2025-05-07T20:25:54.4309200Z     gds-tools-1.13.0.11        |       h5888daf_0        37.9 MB  conda-forge
2025-05-07T20:25:54.4309603Z     gmp-6.3.0                  |       hac33072_2         449 KB  conda-forge
2025-05-07T20:25:54.4309978Z     gxx-11.4.0                 |      h602e360_13          49 KB  conda-forge
2025-05-07T20:25:54.4310370Z     keyutils-1.6.1             |       h166bdaf_0         115 KB  conda-forge
2025-05-07T20:25:54.4310760Z     krb5-1.21.3                |       h659f571_0         1.3 MB  conda-forge
2025-05-07T20:25:54.4311145Z     libcap-2.71                |       h39aace5_0         100 KB  conda-forge
2025-05-07T20:25:54.4311559Z     libcublas-12.8.3.14        |       h9ab20c4_0       460.2 MB  conda-forge
2025-05-07T20:25:54.4312002Z     libcublas-dev-12.8.3.14    |       h9ab20c4_0          89 KB  conda-forge
2025-05-07T20:25:54.4312448Z     libcufft-11.3.3.41         |       hbd13f7d_0       147.4 MB  conda-forge
2025-05-07T20:25:54.4312883Z     libcufft-dev-11.3.3.41     |       h5888daf_0          33 KB  conda-forge
2025-05-07T20:25:54.4313328Z     libcufile-1.13.0.11        |       h12f29b5_0         939 KB  conda-forge
2025-05-07T20:25:54.4313765Z     libcufile-dev-1.13.0.11    |       h5888daf_0          35 KB  conda-forge
2025-05-07T20:25:54.4314209Z     libcurand-10.3.9.55        |       hbd13f7d_0        43.6 MB  conda-forge
2025-05-07T20:25:54.4314657Z     libcurand-dev-10.3.9.55    |       h5888daf_0         265 KB  conda-forge
2025-05-07T20:25:54.4315100Z     libcusolver-11.7.2.55      |       h9ab20c4_0       156.9 MB  conda-forge
2025-05-07T20:25:54.4315563Z     libcusolver-dev-11.7.2.55  |       h9ab20c4_0          59 KB  conda-forge
2025-05-07T20:25:54.4316024Z     libcusparse-12.5.7.53      |       hbd13f7d_0       164.9 MB  conda-forge
2025-05-07T20:25:54.4316493Z     libcusparse-dev-12.5.7.53  |       h5888daf_0          51 KB  conda-forge
2025-05-07T20:25:54.4316959Z     libedit-3.1.20191231       |       he28a2e2_2         121 KB  conda-forge
2025-05-07T20:25:54.4317395Z     libexpat-2.7.0             |       h5888daf_0          73 KB  conda-forge
2025-05-07T20:25:54.4317918Z     libfreetype-2.13.3         |       ha770c72_1           8 KB  conda-forge
2025-05-07T20:25:54.4318366Z     libfreetype6-2.13.3        |       h48d6fc4_1         371 KB  conda-forge
2025-05-07T20:25:54.4318840Z     libgcrypt-lib-1.11.0       |       hb9d3cd8_2         572 KB  conda-forge
2025-05-07T20:25:54.4319298Z     libglib-2.84.0             |       h2ff4ddf_0         3.8 MB  conda-forge
2025-05-07T20:25:54.4319708Z     libglvnd-1.7.0             |       ha4b6fd6_2         129 KB  conda-forge
2025-05-07T20:25:54.4320133Z     libgpg-error-1.55          |       h3f2d84a_0         305 KB  conda-forge
2025-05-07T20:25:54.4320636Z     libiconv-1.18              |       h4ce23a2_1         696 KB  conda-forge
2025-05-07T20:25:54.4321035Z     libnl-3.11.0               |       hb9d3cd8_0         724 KB  conda-forge
2025-05-07T20:25:54.4321439Z     libnpp-12.3.3.65           |       hbd13f7d_0       130.6 MB  conda-forge
2025-05-07T20:25:54.4321862Z     libnpp-dev-12.3.3.65       |       h5888daf_0         443 KB  conda-forge
2025-05-07T20:25:54.4322281Z     libnsl-2.0.1               |       hd590300_0          33 KB  conda-forge
2025-05-07T20:25:54.4322725Z     libnuma-2.0.18             |       h4ab18f5_2          42 KB  conda-forge
2025-05-07T20:25:54.4323149Z     libnvfatbin-12.8.55        |       hbd13f7d_0         793 KB  conda-forge
2025-05-07T20:25:54.4323600Z     libnvfatbin-dev-12.8.55    |       h5888daf_0          26 KB  conda-forge
2025-05-07T20:25:54.4324073Z     libnvjitlink-12.8.61       |       hbd13f7d_0        28.7 MB  conda-forge
2025-05-07T20:25:54.4324537Z     libnvjitlink-dev-12.8.61   |       h5888daf_0          25 KB  conda-forge
2025-05-07T20:25:54.4324999Z     libnvjpeg-12.3.5.57        |       h97fd463_0         3.0 MB  conda-forge
2025-05-07T20:25:54.4325436Z     libnvjpeg-dev-12.3.5.57    |       ha770c72_0          31 KB  conda-forge
2025-05-07T20:25:54.4325877Z     libopengl-1.7.0            |       ha4b6fd6_2          50 KB  conda-forge
2025-05-07T20:25:54.4326291Z     libpng-1.6.47              |       h943b412_0         282 KB  conda-forge
2025-05-07T20:25:54.4326698Z     libsqlite-3.49.2           |       hee588c1_0         895 KB  conda-forge
2025-05-07T20:25:54.4327131Z     libsystemd0-256.9          |       h2774228_0         401 KB  conda-forge
2025-05-07T20:25:54.4327559Z     libudev1-257.4             |       h9a4d06a_0         140 KB  conda-forge
2025-05-07T20:25:54.4327972Z     libuuid-2.38.1             |       h0b41bf4_0          33 KB  conda-forge
2025-05-07T20:25:54.4328765Z     libxcb-1.17.0              |       h8a09558_0         387 KB  conda-forge
2025-05-07T20:25:54.4329249Z     libxkbcommon-1.8.0         |       hc4a0caf_0         627 KB  conda-forge
2025-05-07T20:25:54.4329691Z     libxkbfile-1.1.0           |       h166bdaf_1         111 KB  conda-forge
2025-05-07T20:25:54.4330104Z     libxml2-2.13.5             |       h064dc61_0         673 KB  conda-forge
2025-05-07T20:25:54.4330511Z     libzlib-1.3.1              |       hb9d3cd8_2          60 KB  conda-forge
2025-05-07T20:25:54.4330906Z     lz4-c-1.9.4                |       hcb278e6_0         140 KB  conda-forge
2025-05-07T20:25:54.4331345Z     nsight-compute-2025.1.0.14 |       hb5ebaad_0       320.6 MB  conda-forge
2025-05-07T20:25:54.4331781Z     nspr-4.36                  |       h5888daf_0         225 KB  conda-forge
2025-05-07T20:25:54.4332163Z     nss-3.111                  |       h159eef7_0         1.9 MB  conda-forge
2025-05-07T20:25:54.4332549Z     ocl-icd-2.3.3              |       hb9d3cd8_0         104 KB  conda-forge
2025-05-07T20:25:54.4332987Z     opencl-headers-2024.10.24  |       h5888daf_0          53 KB  conda-forge
2025-05-07T20:25:54.4333419Z     pcre2-10.44                |       hc749103_2         934 KB  conda-forge
2025-05-07T20:25:54.4333841Z     pthread-stubs-0.4          |    hb9d3cd8_1002           8 KB  conda-forge
2025-05-07T20:25:54.4334280Z     python-3.11.8              |hab00c5b_0_cpython        29.3 MB  conda-forge
2025-05-07T20:25:54.4334847Z     rdma-core-55.0             |       h5888daf_0         1.2 MB  conda-forge
2025-05-07T20:25:54.4335256Z     sqlite-3.32.3              |       hcee41ef_1         1.4 MB  conda-forge
2025-05-07T20:25:54.4335651Z     tk-8.6.13                  |noxft_h4845f30_101         3.2 MB  conda-forge
2025-05-07T20:25:54.4336046Z     wayland-1.23.1             |       h3e06ad9_0         314 KB  conda-forge
2025-05-07T20:25:54.4336439Z     xcb-util-0.4.1             |       hb711507_2          19 KB  conda-forge
2025-05-07T20:25:54.4336865Z     xcb-util-cursor-0.1.5      |       hb9d3cd8_0          20 KB  conda-forge
2025-05-07T20:25:54.4337313Z     xcb-util-image-0.4.0       |       hb711507_2          24 KB  conda-forge
2025-05-07T20:25:54.4337901Z     xcb-util-keysyms-0.4.1     |       hb711507_0          14 KB  conda-forge
2025-05-07T20:25:54.4338374Z     xcb-util-renderutil-0.3.10 |       hb711507_0          17 KB  conda-forge
2025-05-07T20:25:54.4338827Z     xcb-util-wm-0.4.2          |       hb711507_0          50 KB  conda-forge
2025-05-07T20:25:54.4339280Z     xkeyboard-config-2.44      |       hb9d3cd8_0         384 KB  conda-forge
2025-05-07T20:25:54.4339729Z     xorg-libice-1.1.2          |       hb9d3cd8_0          57 KB  conda-forge
2025-05-07T20:25:54.4340152Z     xorg-libsm-1.2.6           |       he73a12e_0          27 KB  conda-forge
2025-05-07T20:25:54.4340576Z     xorg-libx11-1.8.12         |       h4f16b4b_0         816 KB  conda-forge
2025-05-07T20:25:54.4340995Z     xorg-libxau-1.0.12         |       hb9d3cd8_0          14 KB  conda-forge
2025-05-07T20:25:54.4341456Z     xorg-libxcomposite-0.4.6   |       hb9d3cd8_2          13 KB  conda-forge
2025-05-07T20:25:54.4341939Z     xorg-libxdamage-1.1.6      |       hb9d3cd8_0          13 KB  conda-forge
2025-05-07T20:25:54.4342394Z     xorg-libxdmcp-1.1.5        |       hb9d3cd8_0          19 KB  conda-forge
2025-05-07T20:25:54.4342830Z     xorg-libxext-1.3.6         |       hb9d3cd8_0          49 KB  conda-forge
2025-05-07T20:25:54.4343282Z     xorg-libxfixes-6.0.1       |       hb9d3cd8_0          19 KB  conda-forge
2025-05-07T20:25:54.4343722Z     xorg-libxi-1.8.2           |       hb9d3cd8_0          46 KB  conda-forge
2025-05-07T20:25:54.4344162Z     xorg-libxrandr-1.5.4       |       hb9d3cd8_0          29 KB  conda-forge
2025-05-07T20:25:54.4344616Z     xorg-libxrender-0.9.12     |       hb9d3cd8_0          32 KB  conda-forge
2025-05-07T20:25:54.4345070Z     xorg-libxtst-1.2.5         |       hb9d3cd8_3          32 KB  conda-forge
2025-05-07T20:25:54.4345482Z     zlib-1.3.1                 |       hb9d3cd8_2          90 KB  conda-forge
2025-05-07T20:25:54.4345859Z     zstd-1.5.7                 |       hb8e6e7a_2         554 KB  conda-forge
2025-05-07T20:25:54.4346249Z     ------------------------------------------------------------
2025-05-07T20:25:54.4346593Z                                            Total:        1.90 GB
2025-05-07T20:25:54.4346800Z 
2025-05-07T20:25:54.4346940Z The following NEW packages will be INSTALLED:
2025-05-07T20:25:54.4347157Z 
2025-05-07T20:25:54.4347371Z   alsa-lib           conda-forge/linux-64::alsa-lib-1.2.14-hb9d3cd8_0 
2025-05-07T20:25:54.4347801Z   attr               conda-forge/linux-64::attr-2.5.1-h166bdaf_1 
2025-05-07T20:25:54.4348216Z   binutils           conda-forge/linux-64::binutils-2.40-h4852527_7 
2025-05-07T20:25:54.4348675Z   c-compiler         conda-forge/linux-64::c-compiler-1.5.2-h0b41bf4_0 
2025-05-07T20:25:54.4349190Z   cuda               conda-forge/noarch::cuda-12.8.0-ha804496_0 
2025-05-07T20:25:54.4349693Z   cuda-cccl_linux-64 conda-forge/noarch::cuda-cccl_linux-64-12.8.55-ha770c72_1 
2025-05-07T20:25:54.4350287Z   cuda-command-line~ conda-forge/linux-64::cuda-command-line-tools-12.8.0-ha770c72_0 
2025-05-07T20:25:54.4350857Z   cuda-compiler      conda-forge/noarch::cuda-compiler-12.8.0-hbad6d8a_0 
2025-05-07T20:25:54.4351394Z   cuda-crt-dev_linu~ conda-forge/noarch::cuda-crt-dev_linux-64-12.8.61-ha770c72_1 
2025-05-07T20:25:54.4351953Z   cuda-crt-tools     conda-forge/linux-64::cuda-crt-tools-12.8.61-ha770c72_1 
2025-05-07T20:25:54.4352560Z   cuda-cudart        conda-forge/linux-64::cuda-cudart-12.8.57-h5888daf_1 
2025-05-07T20:25:54.4353072Z   cuda-cudart-dev    conda-forge/linux-64::cuda-cudart-dev-12.8.57-h5888daf_1 
2025-05-07T20:25:54.4353639Z   cuda-cudart-dev_l~ conda-forge/noarch::cuda-cudart-dev_linux-64-12.8.57-h3f2d84a_1 
2025-05-07T20:25:54.4356056Z   cuda-cudart-static conda-forge/linux-64::cuda-cudart-static-12.8.57-h5888daf_1 
2025-05-07T20:25:54.4356676Z   cuda-cudart-stati~ conda-forge/noarch::cuda-cudart-static_linux-64-12.8.57-h3f2d84a_1 
2025-05-07T20:25:54.4357267Z   cuda-cudart_linux~ conda-forge/noarch::cuda-cudart_linux-64-12.8.57-h3f2d84a_1 
2025-05-07T20:25:54.4357936Z   cuda-cuobjdump     conda-forge/linux-64::cuda-cuobjdump-12.8.55-hbd13f7d_0 
2025-05-07T20:25:54.4358446Z   cuda-cupti         conda-forge/linux-64::cuda-cupti-12.8.57-hbd13f7d_0 
2025-05-07T20:25:54.4358946Z   cuda-cupti-dev     conda-forge/linux-64::cuda-cupti-dev-12.8.57-h5888daf_0 
2025-05-07T20:25:54.4359471Z   cuda-cuxxfilt      conda-forge/linux-64::cuda-cuxxfilt-12.8.55-hbd13f7d_0 
2025-05-07T20:25:54.4360055Z   cuda-driver-dev    conda-forge/linux-64::cuda-driver-dev-12.8.57-h5888daf_1 
2025-05-07T20:25:54.4360625Z   cuda-driver-dev_l~ conda-forge/noarch::cuda-driver-dev_linux-64-12.8.90-h3f2d84a_1 
2025-05-07T20:25:54.4361145Z   cuda-gdb           conda-forge/linux-64::cuda-gdb-12.8.55-h50b4baa_0 
2025-05-07T20:25:54.4361624Z   cuda-libraries     conda-forge/linux-64::cuda-libraries-12.8.0-ha770c72_0 
2025-05-07T20:25:54.4362181Z   cuda-libraries-dev conda-forge/linux-64::cuda-libraries-dev-12.8.0-ha770c72_0 
2025-05-07T20:25:54.4362724Z   cuda-nsight        conda-forge/linux-64::cuda-nsight-12.8.55-h7938cbb_0 
2025-05-07T20:25:54.4363199Z   cuda-nvcc          conda-forge/linux-64::cuda-nvcc-12.8.61-hcdd1206_0 
2025-05-07T20:25:54.4363706Z   cuda-nvcc-dev_lin~ conda-forge/noarch::cuda-nvcc-dev_linux-64-12.8.61-he91c749_1 
2025-05-07T20:25:54.4364264Z   cuda-nvcc-impl     conda-forge/linux-64::cuda-nvcc-impl-12.8.61-h85509e4_1 
2025-05-07T20:25:54.4364799Z   cuda-nvcc-tools    conda-forge/linux-64::cuda-nvcc-tools-12.8.61-he02047a_1 
2025-05-07T20:25:54.4365348Z   cuda-nvcc_linux-64 conda-forge/linux-64::cuda-nvcc_linux-64-12.8.61-h04802cd_0 
2025-05-07T20:25:54.4365875Z   cuda-nvdisasm      conda-forge/linux-64::cuda-nvdisasm-12.8.55-hbd13f7d_0 
2025-05-07T20:25:54.4366391Z   cuda-nvml-dev      conda-forge/linux-64::cuda-nvml-dev-12.8.55-hbd13f7d_0 
2025-05-07T20:25:54.4366890Z   cuda-nvprof        conda-forge/linux-64::cuda-nvprof-12.8.57-hbd13f7d_0 
2025-05-07T20:25:54.4367388Z   cuda-nvprune       conda-forge/linux-64::cuda-nvprune-12.8.55-hbd13f7d_0 
2025-05-07T20:25:54.4367879Z   cuda-nvrtc         conda-forge/linux-64::cuda-nvrtc-12.8.61-hbd13f7d_0 
2025-05-07T20:25:54.4368381Z   cuda-nvrtc-dev     conda-forge/linux-64::cuda-nvrtc-dev-12.8.61-h5888daf_0 
2025-05-07T20:25:54.4369057Z   cuda-nvtx          conda-forge/linux-64::cuda-nvtx-12.8.55-hbd13f7d_0 
2025-05-07T20:25:54.4369603Z   cuda-nvvm-dev_lin~ conda-forge/noarch::cuda-nvvm-dev_linux-64-12.8.61-ha770c72_1 
2025-05-07T20:25:54.4370154Z   cuda-nvvm-impl     conda-forge/linux-64::cuda-nvvm-impl-12.8.61-he02047a_1 
2025-05-07T20:25:54.4370691Z   cuda-nvvm-tools    conda-forge/linux-64::cuda-nvvm-tools-12.8.61-he02047a_1 
2025-05-07T20:25:54.4371192Z   cuda-nvvp          conda-forge/linux-64::cuda-nvvp-12.8.57-hbd13f7d_0 
2025-05-07T20:25:54.4371658Z   cuda-opencl        conda-forge/linux-64::cuda-opencl-12.8.55-hbd13f7d_0 
2025-05-07T20:25:54.4372177Z   cuda-opencl-dev    conda-forge/linux-64::cuda-opencl-dev-12.8.55-h5888daf_0 
2025-05-07T20:25:54.4372740Z   cuda-profiler-api  conda-forge/linux-64::cuda-profiler-api-12.8.55-h7938cbb_0 
2025-05-07T20:25:54.4373281Z   cuda-runtime       conda-forge/noarch::cuda-runtime-12.8.0-ha804496_0 
2025-05-07T20:25:54.4373816Z   cuda-sanitizer-api conda-forge/linux-64::cuda-sanitizer-api-12.8.55-hbd13f7d_0 
2025-05-07T20:25:54.4374359Z   cuda-toolkit       conda-forge/noarch::cuda-toolkit-12.8.0-ha804496_0 
2025-05-07T20:25:54.4374943Z   cuda-tools         conda-forge/linux-64::cuda-tools-12.8.0-ha770c72_0 
2025-05-07T20:25:54.4375415Z   cuda-version       conda-forge/noarch::cuda-version-12.8-h5d125a7_3 
2025-05-07T20:25:54.4375931Z   cuda-visual-tools  conda-forge/linux-64::cuda-visual-tools-12.8.0-ha770c72_0 
2025-05-07T20:25:54.4376469Z   cxx-compiler       conda-forge/linux-64::cxx-compiler-1.5.2-hf52228f_0 
2025-05-07T20:25:54.4376918Z   dbus               conda-forge/linux-64::dbus-1.13.6-h5008d03_3 
2025-05-07T20:25:54.4377321Z   expat              conda-forge/linux-64::expat-2.7.0-h5888daf_0 
2025-05-07T20:25:54.4377829Z   font-ttf-dejavu-s~ conda-forge/noarch::font-ttf-dejavu-sans-mono-2.37-hab24e00_0 
2025-05-07T20:25:54.4378515Z   font-ttf-inconsol~ conda-forge/noarch::font-ttf-inconsolata-3.000-h77eed37_0 
2025-05-07T20:25:54.4379160Z   font-ttf-source-c~ conda-forge/noarch::font-ttf-source-code-pro-2.038-h77eed37_0 
2025-05-07T20:25:54.4379729Z   font-ttf-ubuntu    conda-forge/noarch::font-ttf-ubuntu-0.83-h77eed37_3 
2025-05-07T20:25:54.4380226Z   fontconfig         conda-forge/linux-64::fontconfig-2.15.0-h7e30c49_1 
2025-05-07T20:25:54.4380718Z   fonts-conda-ecosy~ conda-forge/noarch::fonts-conda-ecosystem-1-0 
2025-05-07T20:25:54.4381202Z   fonts-conda-forge  conda-forge/noarch::fonts-conda-forge-1-0 
2025-05-07T20:25:54.4381652Z   freetype           conda-forge/linux-64::freetype-2.13.3-ha770c72_1 
2025-05-07T20:25:54.4382069Z   gcc                conda-forge/linux-64::gcc-11.4.0-h602e360_13 
2025-05-07T20:25:54.4382495Z   gds-tools          conda-forge/linux-64::gds-tools-1.13.0.11-h5888daf_0 
2025-05-07T20:25:54.4382918Z   gmp                conda-forge/linux-64::gmp-6.3.0-hac33072_2 
2025-05-07T20:25:54.4383295Z   gxx                conda-forge/linux-64::gxx-11.4.0-h602e360_13 
2025-05-07T20:25:54.4383703Z   keyutils           conda-forge/linux-64::keyutils-1.6.1-h166bdaf_0 
2025-05-07T20:25:54.4384117Z   krb5               conda-forge/linux-64::krb5-1.21.3-h659f571_0 
2025-05-07T20:25:54.4384521Z   libcap             conda-forge/linux-64::libcap-2.71-h39aace5_0 
2025-05-07T20:25:54.4384960Z   libcublas          conda-forge/linux-64::libcublas-12.8.3.14-h9ab20c4_0 
2025-05-07T20:25:54.4385467Z   libcublas-dev      conda-forge/linux-64::libcublas-dev-12.8.3.14-h9ab20c4_0 
2025-05-07T20:25:54.4385968Z   libcufft           conda-forge/linux-64::libcufft-11.3.3.41-hbd13f7d_0 
2025-05-07T20:25:54.4386456Z   libcufft-dev       conda-forge/linux-64::libcufft-dev-11.3.3.41-h5888daf_0 
2025-05-07T20:25:54.4386947Z   libcufile          conda-forge/linux-64::libcufile-1.13.0.11-h12f29b5_0 
2025-05-07T20:25:54.4387446Z   libcufile-dev      conda-forge/linux-64::libcufile-dev-1.13.0.11-h5888daf_0 
2025-05-07T20:25:54.4387954Z   libcurand          conda-forge/linux-64::libcurand-10.3.9.55-hbd13f7d_0 
2025-05-07T20:25:54.4388446Z   libcurand-dev      conda-forge/linux-64::libcurand-dev-10.3.9.55-h5888daf_0 
2025-05-07T20:25:54.4388984Z   libcusolver        conda-forge/linux-64::libcusolver-11.7.2.55-h9ab20c4_0 
2025-05-07T20:25:54.4389658Z   libcusolver-dev    conda-forge/linux-64::libcusolver-dev-11.7.2.55-h9ab20c4_0 
2025-05-07T20:25:54.4390198Z   libcusparse        conda-forge/linux-64::libcusparse-12.5.7.53-hbd13f7d_0 
2025-05-07T20:25:54.4390722Z   libcusparse-dev    conda-forge/linux-64::libcusparse-dev-12.5.7.53-h5888daf_0 
2025-05-07T20:25:54.4391241Z   libedit            conda-forge/linux-64::libedit-3.1.20191231-he28a2e2_2 
2025-05-07T20:25:54.4391697Z   libexpat           conda-forge/linux-64::libexpat-2.7.0-h5888daf_0 
2025-05-07T20:25:54.4392165Z   libfreetype        conda-forge/linux-64::libfreetype-2.13.3-ha770c72_1 
2025-05-07T20:25:54.4392660Z   libfreetype6       conda-forge/linux-64::libfreetype6-2.13.3-h48d6fc4_1 
2025-05-07T20:25:54.4393167Z   libgcrypt-lib      conda-forge/linux-64::libgcrypt-lib-1.11.0-hb9d3cd8_2 
2025-05-07T20:25:54.4393644Z   libglib            conda-forge/linux-64::libglib-2.84.0-h2ff4ddf_0 
2025-05-07T20:25:54.4394082Z   libglvnd           conda-forge/linux-64::libglvnd-1.7.0-ha4b6fd6_2 
2025-05-07T20:25:54.4394642Z   libgpg-error       conda-forge/linux-64::libgpg-error-1.55-h3f2d84a_0 
2025-05-07T20:25:54.4395110Z   libiconv           conda-forge/linux-64::libiconv-1.18-h4ce23a2_1 
2025-05-07T20:25:54.4395538Z   libnl              conda-forge/linux-64::libnl-3.11.0-hb9d3cd8_0 
2025-05-07T20:25:54.4395962Z   libnpp             conda-forge/linux-64::libnpp-12.3.3.65-hbd13f7d_0 
2025-05-07T20:25:54.4396421Z   libnpp-dev         conda-forge/linux-64::libnpp-dev-12.3.3.65-h5888daf_0 
2025-05-07T20:25:54.4396870Z   libnsl             conda-forge/linux-64::libnsl-2.0.1-hd590300_0 
2025-05-07T20:25:54.4397296Z   libnuma            conda-forge/linux-64::libnuma-2.0.18-h4ab18f5_2 
2025-05-07T20:25:54.4397836Z   libnvfatbin        conda-forge/linux-64::libnvfatbin-12.8.55-hbd13f7d_0 
2025-05-07T20:25:54.4398360Z   libnvfatbin-dev    conda-forge/linux-64::libnvfatbin-dev-12.8.55-h5888daf_0 
2025-05-07T20:25:54.4398892Z   libnvjitlink       conda-forge/linux-64::libnvjitlink-12.8.61-hbd13f7d_0 
2025-05-07T20:25:54.4399437Z   libnvjitlink-dev   conda-forge/linux-64::libnvjitlink-dev-12.8.61-h5888daf_0 
2025-05-07T20:25:54.4399963Z   libnvjpeg          conda-forge/linux-64::libnvjpeg-12.3.5.57-h97fd463_0 
2025-05-07T20:25:54.4400474Z   libnvjpeg-dev      conda-forge/linux-64::libnvjpeg-dev-12.3.5.57-ha770c72_0 
2025-05-07T20:25:54.4400969Z   libopengl          conda-forge/linux-64::libopengl-1.7.0-ha4b6fd6_2 
2025-05-07T20:25:54.4401409Z   libpng             conda-forge/linux-64::libpng-1.6.47-h943b412_0 
2025-05-07T20:25:54.4401840Z   libsqlite          conda-forge/linux-64::libsqlite-3.49.2-hee588c1_0 
2025-05-07T20:25:54.4402311Z   libsystemd0        conda-forge/linux-64::libsystemd0-256.9-h2774228_0 
2025-05-07T20:25:54.4402777Z   libudev1           conda-forge/linux-64::libudev1-257.4-h9a4d06a_0 
2025-05-07T20:25:54.4403202Z   libxcb             conda-forge/linux-64::libxcb-1.17.0-h8a09558_0 
2025-05-07T20:25:54.4403654Z   libxkbcommon       conda-forge/linux-64::libxkbcommon-1.8.0-hc4a0caf_0 
2025-05-07T20:25:54.4404142Z   libxkbfile         conda-forge/linux-64::libxkbfile-1.1.0-h166bdaf_1 
2025-05-07T20:25:54.4404585Z   libxml2            conda-forge/linux-64::libxml2-2.13.5-h064dc61_0 
2025-05-07T20:25:54.4405008Z   libzlib            conda-forge/linux-64::libzlib-1.3.1-hb9d3cd8_2 
2025-05-07T20:25:54.4405414Z   lz4-c              conda-forge/linux-64::lz4-c-1.9.4-hcb278e6_0 
2025-05-07T20:25:54.4405899Z   nsight-compute     conda-forge/linux-64::nsight-compute-2025.1.0.14-hb5ebaad_0 
2025-05-07T20:25:54.4406384Z   nspr               conda-forge/linux-64::nspr-4.36-h5888daf_0 
2025-05-07T20:25:54.4406752Z   nss                conda-forge/linux-64::nss-3.111-h159eef7_0 
2025-05-07T20:25:54.4407148Z   ocl-icd            conda-forge/linux-64::ocl-icd-2.3.3-hb9d3cd8_0 
2025-05-07T20:25:54.4407640Z   opencl-headers     conda-forge/linux-64::opencl-headers-2024.10.24-h5888daf_0 
2025-05-07T20:25:54.4408125Z   pcre2              conda-forge/linux-64::pcre2-10.44-hc749103_2 
2025-05-07T20:25:54.4408585Z   pthread-stubs      conda-forge/linux-64::pthread-stubs-0.4-hb9d3cd8_1002 
2025-05-07T20:25:54.4409117Z   rdma-core          conda-forge/linux-64::rdma-core-55.0-h5888daf_0 
2025-05-07T20:25:54.4409550Z   wayland            conda-forge/linux-64::wayland-1.23.1-h3e06ad9_0 
2025-05-07T20:25:54.4409977Z   xcb-util           conda-forge/linux-64::xcb-util-0.4.1-hb711507_2 
2025-05-07T20:25:54.4410450Z   xcb-util-cursor    conda-forge/linux-64::xcb-util-cursor-0.1.5-hb9d3cd8_0 
2025-05-07T20:25:54.4410971Z   xcb-util-image     conda-forge/linux-64::xcb-util-image-0.4.0-hb711507_2 
2025-05-07T20:25:54.4411500Z   xcb-util-keysyms   conda-forge/linux-64::xcb-util-keysyms-0.4.1-hb711507_0 
2025-05-07T20:25:54.4412073Z   xcb-util-renderut~ conda-forge/linux-64::xcb-util-renderutil-0.3.10-hb711507_0 
2025-05-07T20:25:54.4412593Z   xcb-util-wm        conda-forge/linux-64::xcb-util-wm-0.4.2-hb711507_0 
2025-05-07T20:25:54.4413102Z   xkeyboard-config   conda-forge/linux-64::xkeyboard-config-2.44-hb9d3cd8_0 
2025-05-07T20:25:54.4413744Z   xorg-libice        conda-forge/linux-64::xorg-libice-1.1.2-hb9d3cd8_0 
2025-05-07T20:25:54.4414213Z   xorg-libsm         conda-forge/linux-64::xorg-libsm-1.2.6-he73a12e_0 
2025-05-07T20:25:54.4414675Z   xorg-libx11        conda-forge/linux-64::xorg-libx11-1.8.12-h4f16b4b_0 
2025-05-07T20:25:54.4415150Z   xorg-libxau        conda-forge/linux-64::xorg-libxau-1.0.12-hb9d3cd8_0 
2025-05-07T20:25:54.4415689Z   xorg-libxcomposite conda-forge/linux-64::xorg-libxcomposite-0.4.6-hb9d3cd8_2 
2025-05-07T20:25:54.4416262Z   xorg-libxdamage    conda-forge/linux-64::xorg-libxdamage-1.1.6-hb9d3cd8_0 
2025-05-07T20:25:54.4416793Z   xorg-libxdmcp      conda-forge/linux-64::xorg-libxdmcp-1.1.5-hb9d3cd8_0 
2025-05-07T20:25:54.4417376Z   xorg-libxext       conda-forge/linux-64::xorg-libxext-1.3.6-hb9d3cd8_0 
2025-05-07T20:25:54.4417886Z   xorg-libxfixes     conda-forge/linux-64::xorg-libxfixes-6.0.1-hb9d3cd8_0 
2025-05-07T20:25:54.4418372Z   xorg-libxi         conda-forge/linux-64::xorg-libxi-1.8.2-hb9d3cd8_0 
2025-05-07T20:25:54.4418868Z   xorg-libxrandr     conda-forge/linux-64::xorg-libxrandr-1.5.4-hb9d3cd8_0 
2025-05-07T20:25:54.4419453Z   xorg-libxrender    conda-forge/linux-64::xorg-libxrender-0.9.12-hb9d3cd8_0 
2025-05-07T20:25:54.4419978Z   xorg-libxtst       conda-forge/linux-64::xorg-libxtst-1.2.5-hb9d3cd8_3 
2025-05-07T20:25:54.4420415Z   zstd               conda-forge/linux-64::zstd-1.5.7-hb8e6e7a_2 
2025-05-07T20:25:54.4420665Z 
2025-05-07T20:25:54.4420780Z The following packages will be UPDATED:
2025-05-07T20:25:54.4420989Z 
2025-05-07T20:25:54.4421261Z   libuuid              pkgs/main::libuuid-1.41.5-h5eee18b_0 --> conda-forge::libuuid-2.38.1-h0b41bf4_0 
2025-05-07T20:25:54.4421861Z   zlib                    pkgs/main::zlib-1.2.13-h5eee18b_1 --> conda-forge::zlib-1.3.1-hb9d3cd8_2 
2025-05-07T20:25:54.4422181Z 
2025-05-07T20:25:54.4422395Z The following packages will be SUPERSEDED by a higher-priority channel:
2025-05-07T20:25:54.4422708Z 
2025-05-07T20:25:54.4423108Z   python               pkgs/main::python-3.11.11-he870216_0 --> conda-forge::python-3.11.8-hab00c5b_0_cpython 
2025-05-07T20:25:54.4423730Z   sqlite                pkgs/main::sqlite-3.45.3-h5eee18b_0 --> conda-forge::sqlite-3.32.3-hcee41ef_1 
2025-05-07T20:25:54.4434049Z   tk                        pkgs/main::tk-8.6.14-h39e8969_0 --> conda-forge::tk-8.6.13-noxft_h4845f30_101 
2025-05-07T20:25:54.4434373Z 
2025-05-07T20:25:54.4434397Z 
2025-05-07T20:25:54.4434401Z 
2025-05-07T20:25:54.4434544Z Downloading and Extracting Packages: ...working...
2025-05-07T20:25:54.4434910Z libcublas-12.8.3.14  | 460.2 MB  |            |   0% 
2025-05-07T20:25:54.4435134Z 
2025-05-07T20:25:54.4435533Z nsight-compute-2025. | 320.6 MB  |            |   0% [A
2025-05-07T20:25:54.4435772Z 
2025-05-07T20:25:54.4435776Z 
2025-05-07T20:25:54.4435994Z libcusparse-12.5.7.5 | 164.9 MB  |            |   0% [A[A
2025-05-07T20:25:54.4436249Z 
2025-05-07T20:25:54.4436253Z 
2025-05-07T20:25:54.4436256Z 
2025-05-07T20:25:54.4436484Z libcusolver-11.7.2.5 | 156.9 MB  |            |   0% [A[A[A
2025-05-07T20:25:54.4436735Z 
2025-05-07T20:25:54.4436738Z 
2025-05-07T20:25:54.4436742Z 
2025-05-07T20:25:54.4436746Z 
2025-05-07T20:25:54.4436970Z libcufft-11.3.3.41   | 147.4 MB  |            |   0% [A[A[A[A
2025-05-07T20:25:54.4437218Z 
2025-05-07T20:25:54.4437222Z 
2025-05-07T20:25:54.4437229Z 
2025-05-07T20:25:54.4437233Z 
2025-05-07T20:25:54.4437237Z 
2025-05-07T20:25:54.4441604Z libnpp-12.3.3.65     | 130.6 MB  |            |   0% [A[A[A[A[A
2025-05-07T20:25:54.4441866Z 
2025-05-07T20:25:54.4441873Z 
2025-05-07T20:25:54.4441877Z 
2025-05-07T20:25:54.4441881Z 
2025-05-07T20:25:54.4441884Z 
2025-05-07T20:25:54.4441895Z 
2025-05-07T20:25:54.4443784Z cuda-nsight-12.8.55  | 113.2 MB  |            |   0% [A[A[A[A[A[A
2025-05-07T20:25:54.4444080Z 
2025-05-07T20:25:54.4444085Z 
2025-05-07T20:25:54.4444089Z 
2025-05-07T20:25:54.4444092Z 
2025-05-07T20:25:54.4444096Z 
2025-05-07T20:25:54.4444100Z 
2025-05-07T20:25:54.4444108Z 
2025-05-07T20:25:54.4449220Z cuda-nvvp-12.8.57    | 112.4 MB  |            |   0% [A[A[A[A[A[A[A
2025-05-07T20:25:54.4449511Z 
2025-05-07T20:25:54.4449516Z 
2025-05-07T20:25:54.4449519Z 
2025-05-07T20:25:54.4449528Z 
2025-05-07T20:25:54.4449531Z 
2025-05-07T20:25:54.4449535Z 
2025-05-07T20:25:54.4449539Z 
2025-05-07T20:25:54.4449542Z 
2025-05-07T20:25:54.4450638Z cuda-nvrtc-12.8.61   | 63.1 MB   |            |   0% [A[A[A[A[A[A[A[A
2025-05-07T20:25:54.4450987Z 
2025-05-07T20:25:54.4450995Z 
2025-05-07T20:25:54.4450999Z 
2025-05-07T20:25:54.4451003Z 
2025-05-07T20:25:54.4451006Z 
2025-05-07T20:25:54.4451010Z 
2025-05-07T20:25:54.4451017Z 
2025-05-07T20:25:54.4451021Z 
2025-05-07T20:25:54.4451189Z 
2025-05-07T20:25:54.4452950Z libcurand-10.3.9.55  | 43.6 MB   |            |   0% [A[A[A[A[A[A[A[A[A
2025-05-07T20:25:54.4453309Z 
2025-05-07T20:25:54.4453313Z 
2025-05-07T20:25:54.4453317Z 
2025-05-07T20:25:54.4453320Z 
2025-05-07T20:25:54.4453324Z 
2025-05-07T20:25:54.4453328Z 
2025-05-07T20:25:54.4453331Z 
2025-05-07T20:25:54.4453335Z 
2025-05-07T20:25:54.4453347Z 
2025-05-07T20:25:54.4453351Z 
2025-05-07T20:25:54.4461705Z gds-tools-1.13.0.11  | 37.9 MB   |            |   0% [A[A[A[A[A[A[A[A[A[A
2025-05-07T20:25:54.4462005Z 
2025-05-07T20:25:54.4462009Z 
2025-05-07T20:25:54.4462013Z 
2025-05-07T20:25:54.4462017Z 
2025-05-07T20:25:54.4462020Z 
2025-05-07T20:25:54.4462024Z 
2025-05-07T20:25:54.4462027Z 
2025-05-07T20:25:54.4462031Z 
2025-05-07T20:25:54.4462035Z 
2025-05-07T20:25:54.4462038Z 
2025-05-07T20:25:54.4462042Z 
2025-05-07T20:25:54.4463542Z python-3.11.8        | 29.3 MB   |            |   0% [A[A[A[A[A[A[A[A[A[A[A
2025-05-07T20:25:54.4463886Z 
2025-05-07T20:25:54.4463899Z 
2025-05-07T20:25:54.4463903Z 
2025-05-07T20:25:54.4463907Z 
2025-05-07T20:25:54.4463910Z 
2025-05-07T20:25:54.4463914Z 
2025-05-07T20:25:54.4463917Z 
2025-05-07T20:25:54.4463928Z 
2025-05-07T20:25:54.4463932Z 
2025-05-07T20:25:54.4463936Z 
2025-05-07T20:25:54.4463939Z 
2025-05-07T20:25:54.4463943Z 
2025-05-07T20:25:54.4465631Z libnvjitlink-12.8.61 | 28.7 MB   |            |   0% [A[A[A[A[A[A[A[A[A[A[A[A
2025-05-07T20:25:54.4466037Z 
2025-05-07T20:25:54.4466041Z 
2025-05-07T20:25:54.4466045Z 
2025-05-07T20:25:54.4466048Z 
2025-05-07T20:25:54.4466052Z 
2025-05-07T20:25:54.4466055Z 
2025-05-07T20:25:54.4466059Z 
2025-05-07T20:25:54.4466062Z 
2025-05-07T20:25:54.4466066Z 
2025-05-07T20:25:54.4466069Z 
2025-05-07T20:25:54.4466073Z 
2025-05-07T20:25:54.4466076Z 
2025-05-07T20:25:54.4466080Z 
2025-05-07T20:25:54.4467219Z cuda-nvcc-tools-12.8 | 24.5 MB   |            |   0% [A[A[A[A[A[A[A[A[A[A[A[A[A
2025-05-07T20:25:54.4467614Z 
2025-05-07T20:25:54.4467618Z 
2025-05-07T20:25:54.4467629Z 
2025-05-07T20:25:54.4467632Z 
2025-05-07T20:25:54.4467636Z 
2025-05-07T20:25:54.4467639Z 
2025-05-07T20:25:54.4467643Z 
2025-05-07T20:25:54.4467646Z 
2025-05-07T20:25:54.4467650Z 
2025-05-07T20:25:54.4467658Z 
2025-05-07T20:25:54.4467661Z 
2025-05-07T20:25:54.4467665Z 
2025-05-07T20:25:54.4467668Z 
2025-05-07T20:25:54.4467675Z 
2025-05-07T20:25:54.4471310Z cuda-nvvm-tools-12.8 | 23.5 MB   |            |   0% [A[A[A[A[A[A[A[A[A[A[A[A[A[A
2025-05-07T20:25:54.4471680Z 
2025-05-07T20:25:54.4471684Z 
2025-05-07T20:25:54.4471688Z 
2025-05-07T20:25:54.4471691Z 
2025-05-07T20:25:54.4471695Z 
2025-05-07T20:25:54.4471698Z 
2025-05-07T20:25:54.4471702Z 
2025-05-07T20:25:54.4471706Z 
2025-05-07T20:25:54.4471709Z 
2025-05-07T20:25:54.4471713Z 
2025-05-07T20:25:54.4471716Z 
2025-05-07T20:25:54.4471720Z 
2025-05-07T20:25:54.4471723Z 
2025-05-07T20:25:54.4471727Z 
2025-05-07T20:25:54.4471730Z 
2025-05-07T20:25:54.4472952Z cuda-nvvm-impl-12.8. | 20.8 MB   |            |   0% [A[A[A[A[A[A[A[A[A[A[A[A[A[A[A
2025-05-07T20:25:54.4473383Z 
2025-05-07T20:25:54.4473386Z 
2025-05-07T20:25:54.4473390Z 
2025-05-07T20:25:54.4473394Z 
2025-05-07T20:25:54.4473397Z 
2025-05-07T20:25:54.4473401Z 
2025-05-07T20:25:54.4473404Z 
2025-05-07T20:25:54.4473408Z 
2025-05-07T20:25:54.4473420Z 
2025-05-07T20:25:54.4473428Z 
2025-05-07T20:25:54.4473432Z 
2025-05-07T20:25:54.4473436Z 
2025-05-07T20:25:54.4473598Z 
2025-05-07T20:25:54.4473603Z 
2025-05-07T20:25:54.4473606Z 
2025-05-07T20:25:54.4473610Z 
2025-05-07T20:25:54.4475567Z cuda-nvcc-dev_linux- | 12.7 MB   |            |   0% [A[A[A[A[A[A[A[A[A[A[A[A[A[A[A[A
2025-05-07T20:25:54.4475900Z 
2025-05-07T20:25:54.4475904Z 
2025-05-07T20:25:54.4475908Z 
2025-05-07T20:25:54.4475911Z 
2025-05-07T20:25:54.4475915Z 
2025-05-07T20:25:54.4475919Z 
2025-05-07T20:25:54.4475922Z 
2025-05-07T20:25:54.4475926Z 
2025-05-07T20:25:54.4475929Z 
2025-05-07T20:25:54.4475933Z 
2025-05-07T20:25:54.4475936Z 
2025-05-07T20:25:54.4475940Z 
2025-05-07T20:25:54.4476056Z 
2025-05-07T20:25:54.4476060Z 
2025-05-07T20:25:54.4476063Z 
2025-05-07T20:25:54.4476067Z 
2025-05-07T20:25:54.4476071Z 
2025-05-07T20:25:54.4477034Z cuda-sanitizer-api-1 | 8.8 MB    |            |   0% [A[A[A[A[A[A[A[A[A[A[A[A[A[A[A[A[A
2025-05-07T20:25:54.4477425Z 
2025-05-07T20:25:54.4477429Z 
2025-05-07T20:25:54.4477433Z 
2025-05-07T20:25:54.4477437Z 
2025-05-07T20:25:54.4477448Z 
2025-05-07T20:25:54.4477466Z 
2025-05-07T20:25:54.4477469Z 
2025-05-07T20:25:54.4477473Z 
2025-05-07T20:25:54.4477477Z 
2025-05-07T20:25:54.4477480Z 
2025-05-07T20:25:54.4477484Z 
2025-05-07T20:25:54.4477487Z 
2025-05-07T20:25:54.4477491Z 
2025-05-07T20:25:54.4477495Z 
2025-05-07T20:25:54.4477498Z 
2025-05-07T20:25:54.4477502Z 
2025-05-07T20:25:54.4477505Z 
2025-05-07T20:25:54.4477509Z 
2025-05-07T20:25:54.4479388Z cuda-nvdisasm-12.8.5 | 4.9 MB    |            |   0% [A[A[A[A[A[A[A[A[A[A[A[A[A[A[A[A[A[A
2025-05-07T20:25:54.4479747Z 
2025-05-07T20:25:54.4479751Z 
2025-05-07T20:25:54.4479763Z 
2025-05-07T20:25:54.4479766Z 
2025-05-07T20:25:54.4479770Z 
2025-05-07T20:25:54.4479774Z 
2025-05-07T20:25:54.4479777Z 
2025-05-07T20:25:54.4479781Z 
2025-05-07T20:25:54.4479785Z 
2025-05-07T20:25:54.4479788Z 
2025-05-07T20:25:54.4479792Z 
2025-05-07T20:25:54.4479796Z 
2025-05-07T20:25:54.4479799Z 
2025-05-07T20:25:54.4479809Z 
2025-05-07T20:25:54.4479812Z 
2025-05-07T20:25:54.4479821Z 
2025-05-07T20:25:54.4479825Z 
2025-05-07T20:25:54.4479829Z 
2025-05-07T20:25:54.4479832Z 
2025-05-07T20:25:54.5371996Z  ... (more hidden) ...[A[A[A[A[A[A[A[A[A[A[A[A[A[A[A[A[A[A[A
2025-05-07T20:25:54.5378516Z libcublas-12.8.3.14  | 460.2 MB  |            |   0% 
2025-05-07T20:25:54.5378939Z 
2025-05-07T20:25:54.5393042Z nsight-compute-2025. | 320.6 MB  |            |   1% [A
2025-05-07T20:25:54.5393322Z 
2025-05-07T20:25:54.5393953Z 
2025-05-07T20:25:54.5530147Z libcusparse-12.5.7.5 | 164.9 MB  |            |   1% [A[A
2025-05-07T20:25:54.5530531Z 
2025-05-07T20:25:54.5530546Z 
2025-05-07T20:25:54.5530580Z 
2025-05-07T20:25:54.5531296Z 
2025-05-07T20:25:54.5967505Z libcufft-11.3.3.41   | 147.4 MB  |            |   0% [A[A[A[A
2025-05-07T20:25:54.5967790Z 
2025-05-07T20:25:54.5967802Z 
2025-05-07T20:25:54.5967806Z 
2025-05-07T20:25:54.6374227Z libcusolver-11.7.2.5 | 156.9 MB  |            |   0% [A[A[A
2025-05-07T20:25:54.6379797Z libcublas-12.8.3.14  | 460.2 MB  | 1          |   1% 
2025-05-07T20:25:54.6380782Z 
2025-05-07T20:25:54.6395385Z nsight-compute-2025. | 320.6 MB  | 2          |   2% [A
2025-05-07T20:25:54.6395647Z 
2025-05-07T20:25:54.6397711Z 
2025-05-07T20:25:54.6530219Z libcusparse-12.5.7.5 | 164.9 MB  | 3          |   3% [A[A
2025-05-07T20:25:54.6530497Z 
2025-05-07T20:25:54.6530501Z 
2025-05-07T20:25:54.6530505Z 
2025-05-07T20:25:54.6530509Z 
2025-05-07T20:25:54.6971881Z libcufft-11.3.3.41   | 147.4 MB  | 2          |   3% [A[A[A[A
2025-05-07T20:25:54.6972286Z 
2025-05-07T20:25:54.6972399Z 
2025-05-07T20:25:54.6972964Z 
2025-05-07T20:25:54.7382478Z libcusolver-11.7.2.5 | 156.9 MB  | 2          |   2% [A[A[A
2025-05-07T20:25:54.7385718Z 
2025-05-07T20:25:54.7396211Z nsight-compute-2025. | 320.6 MB  | 3          |   3% [A
2025-05-07T20:25:54.7396478Z 
2025-05-07T20:25:54.7397262Z 
2025-05-07T20:25:54.7495485Z libcusparse-12.5.7.5 | 164.9 MB  | 5          |   5% [A[A
2025-05-07T20:25:54.7532956Z libcublas-12.8.3.14  | 460.2 MB  | 2          |   2% 
2025-05-07T20:25:54.7534180Z 
2025-05-07T20:25:54.7534192Z 
2025-05-07T20:25:54.7534197Z 
2025-05-07T20:25:54.7534202Z 
2025-05-07T20:25:54.7976560Z libcufft-11.3.3.41   | 147.4 MB  | 5          |   5% [A[A[A[A
2025-05-07T20:25:54.7976851Z 
2025-05-07T20:25:54.7976855Z 
2025-05-07T20:25:54.7981218Z 
2025-05-07T20:25:54.8384310Z libcusolver-11.7.2.5 | 156.9 MB  | 4          |   5% [A[A[A
2025-05-07T20:25:54.8385198Z 
2025-05-07T20:25:54.8396594Z nsight-compute-2025. | 320.6 MB  | 4          |   5% [A
2025-05-07T20:25:54.8396854Z 
2025-05-07T20:25:54.8399378Z 
2025-05-07T20:25:54.8535503Z libcusparse-12.5.7.5 | 164.9 MB  | 7          |   8% [A[A
2025-05-07T20:25:54.8536020Z 
2025-05-07T20:25:54.8536025Z 
2025-05-07T20:25:54.8536029Z 
2025-05-07T20:25:54.8536033Z 
2025-05-07T20:25:54.8550431Z libcufft-11.3.3.41   | 147.4 MB  | 8          |   8% [A[A[A[A
2025-05-07T20:25:54.8980462Z libcublas-12.8.3.14  | 460.2 MB  | 2          |   3% 
2025-05-07T20:25:54.8980731Z 
2025-05-07T20:25:54.8980736Z 
2025-05-07T20:25:54.8980755Z 
2025-05-07T20:25:54.9384971Z libcusolver-11.7.2.5 | 156.9 MB  | 7          |   7% [A[A[A
2025-05-07T20:25:54.9389240Z 
2025-05-07T20:25:54.9440392Z nsight-compute-2025. | 320.6 MB  | 6          |   6% [A
2025-05-07T20:25:54.9440762Z 
2025-05-07T20:25:54.9442924Z 
2025-05-07T20:25:54.9536938Z libcusparse-12.5.7.5 | 164.9 MB  | 9          |  10% [A[A
2025-05-07T20:25:54.9537265Z 
2025-05-07T20:25:54.9537271Z 
2025-05-07T20:25:54.9537276Z 
2025-05-07T20:25:54.9537282Z 
2025-05-07T20:25:54.9553901Z libcufft-11.3.3.41   | 147.4 MB  | #          |  11% [A[A[A[A
2025-05-07T20:25:54.9981797Z libcublas-12.8.3.14  | 460.2 MB  | 3          |   4% 
2025-05-07T20:25:54.9982182Z 
2025-05-07T20:25:54.9982188Z 
2025-05-07T20:25:54.9986217Z 
2025-05-07T20:25:55.0438126Z libcusolver-11.7.2.5 | 156.9 MB  | #          |  10% [A[A[A
2025-05-07T20:25:55.0438955Z 
2025-05-07T20:25:55.0445528Z nsight-compute-2025. | 320.6 MB  | 7          |   7% [A
2025-05-07T20:25:55.0445891Z 
2025-05-07T20:25:55.0448225Z 
2025-05-07T20:25:55.0558234Z libcusparse-12.5.7.5 | 164.9 MB  | #2         |  12% [A[A
2025-05-07T20:25:55.0842451Z libcublas-12.8.3.14  | 460.2 MB  | 4          |   4% 
2025-05-07T20:25:55.0842848Z 
2025-05-07T20:25:55.0842857Z 
2025-05-07T20:25:55.0842866Z 
2025-05-07T20:25:55.0842883Z 
2025-05-07T20:25:55.1010152Z libcufft-11.3.3.41   | 147.4 MB  | #3         |  13% [A[A[A[A
2025-05-07T20:25:55.1010567Z 
2025-05-07T20:25:55.1010574Z 
2025-05-07T20:25:55.1011349Z 
2025-05-07T20:25:55.1439413Z libcusolver-11.7.2.5 | 156.9 MB  | #2         |  13% [A[A[A
2025-05-07T20:25:55.1441407Z 
2025-05-07T20:25:55.1448001Z nsight-compute-2025. | 320.6 MB  | 8          |   9% [A
2025-05-07T20:25:55.1448362Z 
2025-05-07T20:25:55.1449413Z 
2025-05-07T20:25:55.1559159Z libcusparse-12.5.7.5 | 164.9 MB  | #4         |  15% [A[A
2025-05-07T20:25:55.1875800Z libcublas-12.8.3.14  | 460.2 MB  | 5          |   5% 
2025-05-07T20:25:55.1876063Z 
2025-05-07T20:25:55.1876068Z 
2025-05-07T20:25:55.1876073Z 
2025-05-07T20:25:55.1876109Z 
2025-05-07T20:25:55.2109730Z libcufft-11.3.3.41   | 147.4 MB  | #5         |  15% [A[A[A[A
2025-05-07T20:25:55.2110161Z 
2025-05-07T20:25:55.2110167Z 
2025-05-07T20:25:55.2110879Z 
2025-05-07T20:25:55.2448814Z libcusolver-11.7.2.5 | 156.9 MB  | #4         |  15% [A[A[A
2025-05-07T20:25:55.2449201Z 
2025-05-07T20:25:55.2449208Z 
2025-05-07T20:25:55.2504794Z libcusparse-12.5.7.5 | 164.9 MB  | #6         |  17% [A[A
2025-05-07T20:25:55.2505122Z 
2025-05-07T20:25:55.2623645Z nsight-compute-2025. | 320.6 MB  | 9          |  10% [A
2025-05-07T20:25:55.2883639Z libcublas-12.8.3.14  | 460.2 MB  | 5          |   6% 
2025-05-07T20:25:55.2884022Z 
2025-05-07T20:25:55.2884028Z 
2025-05-07T20:25:55.2884046Z 
2025-05-07T20:25:55.2886198Z 
2025-05-07T20:25:55.3127389Z libcufft-11.3.3.41   | 147.4 MB  | #7         |  17% [A[A[A[A
2025-05-07T20:25:55.3127814Z 
2025-05-07T20:25:55.3127821Z 
2025-05-07T20:25:55.3127836Z 
2025-05-07T20:25:55.3503958Z libcusolver-11.7.2.5 | 156.9 MB  | #7         |  17% [A[A[A
2025-05-07T20:25:55.3504247Z 
2025-05-07T20:25:55.3505010Z 
2025-05-07T20:25:55.3644438Z libcusparse-12.5.7.5 | 164.9 MB  | #9         |  19% [A[A
2025-05-07T20:25:55.3648379Z 
2025-05-07T20:25:55.3833354Z nsight-compute-2025. | 320.6 MB  | #1         |  11% [A
2025-05-07T20:25:55.4087417Z libcublas-12.8.3.14  | 460.2 MB  | 6          |   7% 
2025-05-07T20:25:55.4087685Z 
2025-05-07T20:25:55.4087689Z 
2025-05-07T20:25:55.4087696Z 
2025-05-07T20:25:55.4087895Z 
2025-05-07T20:25:55.4193555Z libcufft-11.3.3.41   | 147.4 MB  | #9         |  20% [A[A[A[A
2025-05-07T20:25:55.4193984Z 
2025-05-07T20:25:55.4194298Z 
2025-05-07T20:25:55.4194304Z 
2025-05-07T20:25:55.4655891Z libcusolver-11.7.2.5 | 156.9 MB  | #9         |  20% [A[A[A
2025-05-07T20:25:55.4656229Z 
2025-05-07T20:25:55.4789322Z nsight-compute-2025. | 320.6 MB  | #2         |  12% [A
2025-05-07T20:25:55.4789592Z 
2025-05-07T20:25:55.4792375Z 
2025-05-07T20:25:55.4890121Z libcusparse-12.5.7.5 | 164.9 MB  | ##1        |  21% [A[A
2025-05-07T20:25:55.5107935Z libcublas-12.8.3.14  | 460.2 MB  | 7          |   7% 
2025-05-07T20:25:55.5108195Z 
2025-05-07T20:25:55.5108199Z 
2025-05-07T20:25:55.5108203Z 
2025-05-07T20:25:55.5109658Z 
2025-05-07T20:25:55.5194169Z libcufft-11.3.3.41   | 147.4 MB  | ##1        |  22% [A[A[A[A
2025-05-07T20:25:55.5194693Z 
2025-05-07T20:25:55.5194697Z 
2025-05-07T20:25:55.5194701Z 
2025-05-07T20:25:55.5672933Z libcusolver-11.7.2.5 | 156.9 MB  | ##2        |  22% [A[A[A
2025-05-07T20:25:55.5676396Z 
2025-05-07T20:25:55.5846000Z nsight-compute-2025. | 320.6 MB  | #3         |  13% [A
2025-05-07T20:25:55.5846593Z 
2025-05-07T20:25:55.5850753Z 
2025-05-07T20:25:55.6038018Z libcusparse-12.5.7.5 | 164.9 MB  | ##3        |  23% [A[A
2025-05-07T20:25:55.6197222Z libcublas-12.8.3.14  | 460.2 MB  | 8          |   8% 
2025-05-07T20:25:55.6197479Z 
2025-05-07T20:25:55.6197483Z 
2025-05-07T20:25:55.6197495Z 
2025-05-07T20:25:55.6198733Z 
2025-05-07T20:25:55.6202589Z libcufft-11.3.3.41   | 147.4 MB  | ##3        |  24% [A[A[A[A
2025-05-07T20:25:55.6203003Z 
2025-05-07T20:25:55.6203010Z 
2025-05-07T20:25:55.6203026Z 
2025-05-07T20:25:55.6674452Z libcusolver-11.7.2.5 | 156.9 MB  | ##4        |  25% [A[A[A
2025-05-07T20:25:55.6674809Z 
2025-05-07T20:25:55.6849128Z nsight-compute-2025. | 320.6 MB  | #4         |  15% [A
2025-05-07T20:25:55.6849505Z 
2025-05-07T20:25:55.6849509Z 
2025-05-07T20:25:55.7151291Z libcusparse-12.5.7.5 | 164.9 MB  | ##5        |  26% [A[A
2025-05-07T20:25:55.7201157Z libcublas-12.8.3.14  | 460.2 MB  | 8          |   9% 
2025-05-07T20:25:55.7201535Z 
2025-05-07T20:25:55.7201542Z 
2025-05-07T20:25:55.7201576Z 
2025-05-07T20:25:55.7216293Z libcusolver-11.7.2.5 | 156.9 MB  | ##7        |  27% [A[A[A
2025-05-07T20:25:55.7216651Z 
2025-05-07T20:25:55.7216655Z 
2025-05-07T20:25:55.7216659Z 
2025-05-07T20:25:55.7217597Z 
2025-05-07T20:25:55.7676214Z libcufft-11.3.3.41   | 147.4 MB  | ##5        |  26% [A[A[A[A
2025-05-07T20:25:55.7676574Z 
2025-05-07T20:25:55.8152463Z nsight-compute-2025. | 320.6 MB  | #5         |  16% [A
2025-05-07T20:25:55.8217092Z libcublas-12.8.3.14  | 460.2 MB  | 9          |   9% 
2025-05-07T20:25:55.8217335Z 
2025-05-07T20:25:55.8217340Z 
2025-05-07T20:25:55.8217343Z 
2025-05-07T20:25:55.8217952Z 
2025-05-07T20:25:55.8306014Z libcufft-11.3.3.41   | 147.4 MB  | ##8        |  28% [A[A[A[A
2025-05-07T20:25:55.8306330Z 
2025-05-07T20:25:55.8306334Z 
2025-05-07T20:25:55.8308338Z 
2025-05-07T20:25:55.8677317Z libcusolver-11.7.2.5 | 156.9 MB  | ##9        |  29% [A[A[A
2025-05-07T20:25:55.8677713Z 
2025-05-07T20:25:55.9156400Z nsight-compute-2025. | 320.6 MB  | #7         |  17% [A
2025-05-07T20:25:55.9218106Z libcublas-12.8.3.14  | 460.2 MB  | #          |  10% 
2025-05-07T20:25:55.9218481Z 
2025-05-07T20:25:55.9218487Z 
2025-05-07T20:25:55.9218492Z 
2025-05-07T20:25:55.9220073Z 
2025-05-07T20:25:55.9579153Z libcufft-11.3.3.41   | 147.4 MB  | ###        |  31% [A[A[A[A
2025-05-07T20:25:55.9579531Z 
2025-05-07T20:25:55.9579535Z 
2025-05-07T20:25:55.9582904Z 
2025-05-07T20:25:55.9679825Z libcusolver-11.7.2.5 | 156.9 MB  | ###1       |  32% [A[A[A
2025-05-07T20:25:55.9680159Z 
2025-05-07T20:25:55.9702020Z nsight-compute-2025. | 320.6 MB  | #8         |  19% [A
2025-05-07T20:25:55.9702363Z 
2025-05-07T20:25:55.9703807Z 
2025-05-07T20:25:56.0381756Z libcusparse-12.5.7.5 | 164.9 MB  | ##7        |  28% [A[A
2025-05-07T20:25:56.0393239Z libcublas-12.8.3.14  | 460.2 MB  | #1         |  11% 
2025-05-07T20:25:56.0393488Z 
2025-05-07T20:25:56.0393493Z 
2025-05-07T20:25:56.0393496Z 
2025-05-07T20:25:56.0394651Z 
2025-05-07T20:25:56.0579571Z libcufft-11.3.3.41   | 147.4 MB  | ###2       |  33% [A[A[A[A
2025-05-07T20:25:56.0580314Z 
2025-05-07T20:25:56.0580320Z 
2025-05-07T20:25:56.0580326Z 
2025-05-07T20:25:56.0703519Z libcusolver-11.7.2.5 | 156.9 MB  | ###4       |  34% [A[A[A
2025-05-07T20:25:56.0703955Z 
2025-05-07T20:25:56.0704724Z 
2025-05-07T20:25:56.0924169Z libcusparse-12.5.7.5 | 164.9 MB  | ##9        |  30% [A[A
2025-05-07T20:25:56.0925808Z 
2025-05-07T20:25:56.1514373Z nsight-compute-2025. | 320.6 MB  | ##         |  20% [A
2025-05-07T20:25:56.1520720Z libcublas-12.8.3.14  | 460.2 MB  | #1         |  12% 
2025-05-07T20:25:56.1521096Z 
2025-05-07T20:25:56.1521102Z 
2025-05-07T20:25:56.1521108Z 
2025-05-07T20:25:56.1522347Z 
2025-05-07T20:25:56.1582200Z libcufft-11.3.3.41   | 147.4 MB  | ###4       |  35% [A[A[A[A
2025-05-07T20:25:56.1582600Z 
2025-05-07T20:25:56.1582605Z 
2025-05-07T20:25:56.1584000Z 
2025-05-07T20:25:56.1704797Z libcusolver-11.7.2.5 | 156.9 MB  | ###6       |  36% [A[A[A
2025-05-07T20:25:56.1705088Z 
2025-05-07T20:25:56.1705092Z 
2025-05-07T20:25:56.1986660Z libcusparse-12.5.7.5 | 164.9 MB  | ###1       |  32% [A[A
2025-05-07T20:25:56.1987392Z 
2025-05-07T20:25:56.2525268Z nsight-compute-2025. | 320.6 MB  | ##1        |  21% [A
2025-05-07T20:25:56.2528671Z libcublas-12.8.3.14  | 460.2 MB  | #2         |  12% 
2025-05-07T20:25:56.2528982Z 
2025-05-07T20:25:56.2528989Z 
2025-05-07T20:25:56.2528993Z 
2025-05-07T20:25:56.2533237Z 
2025-05-07T20:25:56.2634522Z libcufft-11.3.3.41   | 147.4 MB  | ###6       |  37% [A[A[A[A
2025-05-07T20:25:56.2634854Z 
2025-05-07T20:25:56.2634858Z 
2025-05-07T20:25:56.2634862Z 
2025-05-07T20:25:56.2706626Z libcusolver-11.7.2.5 | 156.9 MB  | ###8       |  39% [A[A[A
2025-05-07T20:25:56.2706921Z 
2025-05-07T20:25:56.2709414Z 
2025-05-07T20:25:56.3026702Z libcusparse-12.5.7.5 | 164.9 MB  | ###4       |  34% [A[A
2025-05-07T20:25:56.3027190Z 
2025-05-07T20:25:56.3526829Z nsight-compute-2025. | 320.6 MB  | ##2        |  23% [A
2025-05-07T20:25:56.3635198Z libcublas-12.8.3.14  | 460.2 MB  | #3         |  13% 
2025-05-07T20:25:56.3635481Z 
2025-05-07T20:25:56.3635644Z 
2025-05-07T20:25:56.3635658Z 
2025-05-07T20:25:56.3709332Z libcusolver-11.7.2.5 | 156.9 MB  | ####1      |  41% [A[A[A
2025-05-07T20:25:56.3709738Z 
2025-05-07T20:25:56.3709745Z 
2025-05-07T20:25:56.4149234Z libcusparse-12.5.7.5 | 164.9 MB  | ###7       |  37% [A[A
2025-05-07T20:25:56.4151069Z 
2025-05-07T20:25:56.4546879Z nsight-compute-2025. | 320.6 MB  | ##3        |  24% [A
2025-05-07T20:25:56.4642156Z libcublas-12.8.3.14  | 460.2 MB  | #4         |  14% 
2025-05-07T20:25:56.4642468Z 
2025-05-07T20:25:56.4642474Z 
2025-05-07T20:25:56.4644728Z 
2025-05-07T20:25:56.4711005Z libcusolver-11.7.2.5 | 156.9 MB  | ####4      |  44% [A[A[A
2025-05-07T20:25:56.4711371Z 
2025-05-07T20:25:56.4712074Z 
2025-05-07T20:25:56.5118429Z libcusparse-12.5.7.5 | 164.9 MB  | ###9       |  40% [A[A
2025-05-07T20:25:56.5118726Z 
2025-05-07T20:25:56.5118731Z 
2025-05-07T20:25:56.5118735Z 
2025-05-07T20:25:56.5118738Z 
2025-05-07T20:25:56.5249830Z libcufft-11.3.3.41   | 147.4 MB  | ###8       |  39% [A[A[A[A
2025-05-07T20:25:56.5250131Z 
2025-05-07T20:25:56.5649615Z nsight-compute-2025. | 320.6 MB  | ##4        |  25% [A
2025-05-07T20:25:56.5714648Z libcublas-12.8.3.14  | 460.2 MB  | #4         |  15% 
2025-05-07T20:25:56.5715021Z 
2025-05-07T20:25:56.5716280Z 
2025-05-07T20:25:56.5908419Z libcusparse-12.5.7.5 | 164.9 MB  | ####2      |  42% [A[A
2025-05-07T20:25:56.5908717Z 
2025-05-07T20:25:56.5908729Z 
2025-05-07T20:25:56.5908733Z 
2025-05-07T20:25:56.6122199Z libcusolver-11.7.2.5 | 156.9 MB  | ####6      |  47% [A[A[A
2025-05-07T20:25:56.6122486Z 
2025-05-07T20:25:56.6122491Z 
2025-05-07T20:25:56.6122494Z 
2025-05-07T20:25:56.6124799Z 
2025-05-07T20:25:56.6646128Z libcufft-11.3.3.41   | 147.4 MB  | ####1      |  41% [A[A[A[A
2025-05-07T20:25:56.6646624Z 
2025-05-07T20:25:56.6757546Z nsight-compute-2025. | 320.6 MB  | ##6        |  26% [A
2025-05-07T20:25:56.6757848Z 
2025-05-07T20:25:56.6760669Z 
2025-05-07T20:25:56.6837144Z libcusparse-12.5.7.5 | 164.9 MB  | ####4      |  45% [A[A
2025-05-07T20:25:56.6932749Z libcublas-12.8.3.14  | 460.2 MB  | #5         |  16% 
2025-05-07T20:25:56.6933041Z 
2025-05-07T20:25:56.6933046Z 
2025-05-07T20:25:56.6935776Z 
2025-05-07T20:25:56.7129660Z libcusolver-11.7.2.5 | 156.9 MB  | ####8      |  49% [A[A[A
2025-05-07T20:25:56.7129950Z 
2025-05-07T20:25:56.7129954Z 
2025-05-07T20:25:56.7129958Z 
2025-05-07T20:25:56.7131149Z 
2025-05-07T20:25:56.7728729Z libcufft-11.3.3.41   | 147.4 MB  | ####3      |  44% [A[A[A[A
2025-05-07T20:25:56.7731247Z 
2025-05-07T20:25:56.7803156Z nsight-compute-2025. | 320.6 MB  | ##7        |  27% [A
2025-05-07T20:25:56.7803427Z 
2025-05-07T20:25:56.7805262Z 
2025-05-07T20:25:56.7902118Z libcusparse-12.5.7.5 | 164.9 MB  | ####7      |  47% [A[A
2025-05-07T20:25:56.8130941Z libcublas-12.8.3.14  | 460.2 MB  | #6         |  16% 
2025-05-07T20:25:56.8131197Z 
2025-05-07T20:25:56.8131201Z 
2025-05-07T20:25:56.8131205Z 
2025-05-07T20:25:56.8131209Z 
2025-05-07T20:25:56.8716751Z libcufft-11.3.3.41   | 147.4 MB  | ####6      |  47% [A[A[A[A
2025-05-07T20:25:56.8717239Z 
2025-05-07T20:25:56.8717243Z 
2025-05-07T20:25:56.8717247Z 
2025-05-07T20:25:56.8771805Z libcusolver-11.7.2.5 | 156.9 MB  | #####1     |  51% [A[A[A
2025-05-07T20:25:56.8772080Z 
2025-05-07T20:25:56.8887850Z nsight-compute-2025. | 320.6 MB  | ##8        |  28% [A
2025-05-07T20:25:56.8888183Z 
2025-05-07T20:25:56.8890705Z 
2025-05-07T20:25:56.9018948Z libcusparse-12.5.7.5 | 164.9 MB  | ####9      |  49% [A[A
2025-05-07T20:25:56.9138087Z libcublas-12.8.3.14  | 460.2 MB  | #6         |  17% 
2025-05-07T20:25:56.9138413Z 
2025-05-07T20:25:56.9138417Z 
2025-05-07T20:25:56.9138421Z 
2025-05-07T20:25:56.9138959Z 
2025-05-07T20:25:56.9720330Z libcufft-11.3.3.41   | 147.4 MB  | ####8      |  49% [A[A[A[A
2025-05-07T20:25:56.9720633Z 
2025-05-07T20:25:56.9720637Z 
2025-05-07T20:25:56.9725235Z 
2025-05-07T20:25:56.9941055Z libcusolver-11.7.2.5 | 156.9 MB  | #####3     |  54% [A[A[A
2025-05-07T20:25:56.9941343Z 
2025-05-07T20:25:56.9956594Z nsight-compute-2025. | 320.6 MB  | ##9        |  29% [A
2025-05-07T20:25:56.9956854Z 
2025-05-07T20:25:56.9958351Z 
2025-05-07T20:25:57.0139729Z libcusparse-12.5.7.5 | 164.9 MB  | #####1     |  52% [A[A
2025-05-07T20:25:57.0140006Z 
2025-05-07T20:25:57.0140010Z 
2025-05-07T20:25:57.0140014Z 
2025-05-07T20:25:57.0140732Z 
2025-05-07T20:25:57.0150514Z libcufft-11.3.3.41   | 147.4 MB  | #####1     |  51% [A[A[A[A
2025-05-07T20:25:57.0722230Z libcublas-12.8.3.14  | 460.2 MB  | #7         |  18% 
2025-05-07T20:25:57.0722487Z 
2025-05-07T20:25:57.0722491Z 
2025-05-07T20:25:57.0723925Z 
2025-05-07T20:25:57.1003239Z libcusolver-11.7.2.5 | 156.9 MB  | #####5     |  56% [A[A[A
2025-05-07T20:25:57.1003552Z 
2025-05-07T20:25:57.1038198Z nsight-compute-2025. | 320.6 MB  | ###        |  30% [A
2025-05-07T20:25:57.1038461Z 
2025-05-07T20:25:57.1038468Z 
2025-05-07T20:25:57.1142985Z libcusparse-12.5.7.5 | 164.9 MB  | #####4     |  54% [A[A
2025-05-07T20:25:57.1143283Z 
2025-05-07T20:25:57.1143322Z 
2025-05-07T20:25:57.1143327Z 
2025-05-07T20:25:57.1144461Z 
2025-05-07T20:25:57.1288533Z libcufft-11.3.3.41   | 147.4 MB  | #####3     |  54% [A[A[A[A
2025-05-07T20:25:57.1722786Z libcublas-12.8.3.14  | 460.2 MB  | #8         |  18% 
2025-05-07T20:25:57.1723051Z 
2025-05-07T20:25:57.1723055Z 
2025-05-07T20:25:57.1724388Z 
2025-05-07T20:25:57.2041016Z libcusolver-11.7.2.5 | 156.9 MB  | #####8     |  58% [A[A[A
2025-05-07T20:25:57.2041313Z 
2025-05-07T20:25:57.2041892Z 
2025-05-07T20:25:57.2143875Z libcusparse-12.5.7.5 | 164.9 MB  | #####6     |  56% [A[A
2025-05-07T20:25:57.2144156Z 
2025-05-07T20:25:57.2144160Z 
2025-05-07T20:25:57.2144165Z 
2025-05-07T20:25:57.2144168Z 
2025-05-07T20:25:57.2161459Z libcufft-11.3.3.41   | 147.4 MB  | #####6     |  56% [A[A[A[A
2025-05-07T20:25:57.2161840Z 
2025-05-07T20:25:57.2312017Z nsight-compute-2025. | 320.6 MB  | ###        |  31% [A
2025-05-07T20:25:57.2724120Z libcublas-12.8.3.14  | 460.2 MB  | #8         |  19% 
2025-05-07T20:25:57.2724377Z 
2025-05-07T20:25:57.2724601Z 
2025-05-07T20:25:57.2727128Z 
2025-05-07T20:25:57.3080619Z libcusolver-11.7.2.5 | 156.9 MB  | ######     |  61% [A[A[A
2025-05-07T20:25:57.3080898Z 
2025-05-07T20:25:57.3080903Z 
2025-05-07T20:25:57.3144767Z libcusparse-12.5.7.5 | 164.9 MB  | #####8     |  59% [A[A
2025-05-07T20:25:57.3145195Z 
2025-05-07T20:25:57.3145199Z 
2025-05-07T20:25:57.3145203Z 
2025-05-07T20:25:57.3145221Z 
2025-05-07T20:25:57.3277246Z libcufft-11.3.3.41   | 147.4 MB  | #####8     |  59% [A[A[A[A
2025-05-07T20:25:57.3277920Z 
2025-05-07T20:25:57.3363470Z nsight-compute-2025. | 320.6 MB  | ###1       |  32% [A
2025-05-07T20:25:57.3728945Z libcublas-12.8.3.14  | 460.2 MB  | #9         |  19% 
2025-05-07T20:25:57.3729205Z 
2025-05-07T20:25:57.3729211Z 
2025-05-07T20:25:57.3730675Z 
2025-05-07T20:25:57.4101992Z libcusolver-11.7.2.5 | 156.9 MB  | ######3    |  63% [A[A[A
2025-05-07T20:25:57.4102390Z 
2025-05-07T20:25:57.4105074Z 
2025-05-07T20:25:57.4147768Z libcusparse-12.5.7.5 | 164.9 MB  | ######     |  61% [A[A
2025-05-07T20:25:57.4148077Z 
2025-05-07T20:25:57.4148083Z 
2025-05-07T20:25:57.4148088Z 
2025-05-07T20:25:57.4148093Z 
2025-05-07T20:25:57.4277011Z libcufft-11.3.3.41   | 147.4 MB  | ######1    |  61% [A[A[A[A
2025-05-07T20:25:57.4277288Z 
2025-05-07T20:25:57.4370449Z nsight-compute-2025. | 320.6 MB  | ###2       |  33% [A
2025-05-07T20:25:57.4731485Z libcublas-12.8.3.14  | 460.2 MB  | ##         |  20% 
2025-05-07T20:25:57.4731762Z 
2025-05-07T20:25:57.4731767Z 
2025-05-07T20:25:57.4731773Z 
2025-05-07T20:25:57.5104568Z libcusolver-11.7.2.5 | 156.9 MB  | ######5    |  66% [A[A[A
2025-05-07T20:25:57.5104855Z 
2025-05-07T20:25:57.5104860Z 
2025-05-07T20:25:57.5175072Z libcusparse-12.5.7.5 | 164.9 MB  | ######2    |  63% [A[A
2025-05-07T20:25:57.5175348Z 
2025-05-07T20:25:57.5175353Z 
2025-05-07T20:25:57.5175358Z 
2025-05-07T20:25:57.5175363Z 
2025-05-07T20:25:57.5280360Z libcufft-11.3.3.41   | 147.4 MB  | ######3    |  64% [A[A[A[A
2025-05-07T20:25:57.5280650Z 
2025-05-07T20:25:57.5372528Z nsight-compute-2025. | 320.6 MB  | ###3       |  34% [A
2025-05-07T20:25:57.5731766Z libcublas-12.8.3.14  | 460.2 MB  | ##         |  21% 
2025-05-07T20:25:57.5732024Z 
2025-05-07T20:25:57.5732030Z 
2025-05-07T20:25:57.5732035Z 
2025-05-07T20:25:57.6107388Z libcusolver-11.7.2.5 | 156.9 MB  | ######8    |  68% [A[A[A
2025-05-07T20:25:57.6107665Z 
2025-05-07T20:25:57.6108361Z 
2025-05-07T20:25:57.6195384Z libcusparse-12.5.7.5 | 164.9 MB  | ######5    |  65% [A[A
2025-05-07T20:25:57.6195653Z 
2025-05-07T20:25:57.6195658Z 
2025-05-07T20:25:57.6195661Z 
2025-05-07T20:25:57.6195665Z 
2025-05-07T20:25:57.6281218Z libcufft-11.3.3.41   | 147.4 MB  | ######6    |  66% [A[A[A[A
2025-05-07T20:25:57.6285738Z 
2025-05-07T20:25:57.6375507Z nsight-compute-2025. | 320.6 MB  | ###4       |  35% [A
2025-05-07T20:25:57.6737149Z libcublas-12.8.3.14  | 460.2 MB  | ##1        |  21% 
2025-05-07T20:25:57.6737417Z 
2025-05-07T20:25:57.6737421Z 
2025-05-07T20:25:57.6738362Z 
2025-05-07T20:25:57.7113480Z libcusolver-11.7.2.5 | 156.9 MB  | #######    |  71% [A[A[A
2025-05-07T20:25:57.7113876Z 
2025-05-07T20:25:57.7115520Z 
2025-05-07T20:25:57.7281477Z libcusparse-12.5.7.5 | 164.9 MB  | ######7    |  67% [A[A
2025-05-07T20:25:57.7282599Z 
2025-05-07T20:25:57.7347027Z nsight-compute-2025. | 320.6 MB  | ###5       |  36% [A
2025-05-07T20:25:57.7347304Z 
2025-05-07T20:25:57.7347308Z 
2025-05-07T20:25:57.7347567Z 
2025-05-07T20:25:57.7350240Z 
2025-05-07T20:25:57.7375865Z libcufft-11.3.3.41   | 147.4 MB  | ######8    |  69% [A[A[A[A
2025-05-07T20:25:57.7767581Z libcublas-12.8.3.14  | 460.2 MB  | ##2        |  22% 
2025-05-07T20:25:57.7767956Z 
2025-05-07T20:25:57.7767974Z 
2025-05-07T20:25:57.7768916Z 
2025-05-07T20:25:57.8164540Z libcusolver-11.7.2.5 | 156.9 MB  | #######2   |  73% [A[A[A
2025-05-07T20:25:57.8164830Z 
2025-05-07T20:25:57.8167320Z 
2025-05-07T20:25:57.8284736Z libcusparse-12.5.7.5 | 164.9 MB  | ######9    |  70% [A[A
2025-05-07T20:25:57.8286975Z 
2025-05-07T20:25:57.8347482Z nsight-compute-2025. | 320.6 MB  | ###6       |  37% [A
2025-05-07T20:25:57.8348037Z 
2025-05-07T20:25:57.8348041Z 
2025-05-07T20:25:57.8348045Z 
2025-05-07T20:25:57.8348049Z 
2025-05-07T20:25:57.8375872Z libcufft-11.3.3.41   | 147.4 MB  | #######1   |  71% [A[A[A[A
2025-05-07T20:25:57.8770646Z libcublas-12.8.3.14  | 460.2 MB  | ##2        |  23% 
2025-05-07T20:25:57.8770899Z 
2025-05-07T20:25:57.8770951Z 
2025-05-07T20:25:57.8771019Z 
2025-05-07T20:25:57.9177332Z libcusolver-11.7.2.5 | 156.9 MB  | #######5   |  75% [A[A[A
2025-05-07T20:25:57.9177746Z 
2025-05-07T20:25:57.9182545Z 
2025-05-07T20:25:57.9286498Z libcusparse-12.5.7.5 | 164.9 MB  | #######1   |  72% [A[A
2025-05-07T20:25:57.9287530Z 
2025-05-07T20:25:57.9378563Z nsight-compute-2025. | 320.6 MB  | ###8       |  38% [A
2025-05-07T20:25:57.9662158Z libcublas-12.8.3.14  | 460.2 MB  | ##3        |  24% 
2025-05-07T20:25:57.9662423Z 
2025-05-07T20:25:57.9662429Z 
2025-05-07T20:25:57.9662435Z 
2025-05-07T20:25:57.9662440Z 
2025-05-07T20:25:57.9771081Z libcufft-11.3.3.41   | 147.4 MB  | #######3   |  74% [A[A[A[A
2025-05-07T20:25:57.9771500Z 
2025-05-07T20:25:57.9771504Z 
2025-05-07T20:25:57.9771508Z 
2025-05-07T20:25:58.0180056Z libcusolver-11.7.2.5 | 156.9 MB  | #######7   |  78% [A[A[A
2025-05-07T20:25:58.0180353Z 
2025-05-07T20:25:58.0182144Z 
2025-05-07T20:25:58.0288508Z libcusparse-12.5.7.5 | 164.9 MB  | #######4   |  74% [A[A
2025-05-07T20:25:58.0290703Z 
2025-05-07T20:25:58.0385113Z nsight-compute-2025. | 320.6 MB  | ###9       |  39% [A
2025-05-07T20:25:58.0771673Z libcublas-12.8.3.14  | 460.2 MB  | ##4        |  24% 
2025-05-07T20:25:58.0772046Z 
2025-05-07T20:25:58.0772052Z 
2025-05-07T20:25:58.0772057Z 
2025-05-07T20:25:58.1185589Z libcusolver-11.7.2.5 | 156.9 MB  | ########   |  81% [A[A[A
2025-05-07T20:25:58.1186014Z 
2025-05-07T20:25:58.1187845Z 
2025-05-07T20:25:58.1290788Z libcusparse-12.5.7.5 | 164.9 MB  | #######6   |  77% [A[A
2025-05-07T20:25:58.1292411Z 
2025-05-07T20:25:58.1298526Z nsight-compute-2025. | 320.6 MB  | ####       |  41% [A
2025-05-07T20:25:58.1298823Z 
2025-05-07T20:25:58.1298827Z 
2025-05-07T20:25:58.1298831Z 
2025-05-07T20:25:58.1298835Z 
2025-05-07T20:25:58.1385423Z libcufft-11.3.3.41   | 147.4 MB  | #######5   |  76% [A[A[A[A
2025-05-07T20:25:58.1838155Z libcublas-12.8.3.14  | 460.2 MB  | ##5        |  25% 
2025-05-07T20:25:58.1838477Z 
2025-05-07T20:25:58.1838481Z 
2025-05-07T20:25:58.1838507Z 
2025-05-07T20:25:58.2270271Z libcusolver-11.7.2.5 | 156.9 MB  | ########3  |  83% [A[A[A
2025-05-07T20:25:58.2270583Z 
2025-05-07T20:25:58.2270595Z 
2025-05-07T20:25:58.2300413Z libcusparse-12.5.7.5 | 164.9 MB  | #######8   |  79% [A[A
2025-05-07T20:25:58.2300715Z 
2025-05-07T20:25:58.2300719Z 
2025-05-07T20:25:58.2300722Z 
2025-05-07T20:25:58.2301899Z 
2025-05-07T20:25:58.2306724Z libcufft-11.3.3.41   | 147.4 MB  | #######8   |  78% [A[A[A[A
2025-05-07T20:25:58.2308250Z 
2025-05-07T20:25:58.2475470Z nsight-compute-2025. | 320.6 MB  | ####1      |  42% [A
2025-05-07T20:25:58.2879053Z libcublas-12.8.3.14  | 460.2 MB  | ##6        |  26% 
2025-05-07T20:25:58.2879364Z 
2025-05-07T20:25:58.2879370Z 
2025-05-07T20:25:58.2881991Z 
2025-05-07T20:25:58.3301955Z libcusolver-11.7.2.5 | 156.9 MB  | ########5  |  86% [A[A[A
2025-05-07T20:25:58.3302333Z 
2025-05-07T20:25:58.3302339Z 
2025-05-07T20:25:58.3302344Z 
2025-05-07T20:25:58.3304169Z 
2025-05-07T20:25:58.3370356Z libcufft-11.3.3.41   | 147.4 MB  | ########   |  81% [A[A[A[A
2025-05-07T20:25:58.3370643Z 
2025-05-07T20:25:58.3370648Z 
2025-05-07T20:25:58.3451552Z libcusparse-12.5.7.5 | 164.9 MB  | ########1  |  81% [A[A
2025-05-07T20:25:58.3455615Z 
2025-05-07T20:25:58.3485913Z nsight-compute-2025. | 320.6 MB  | ####2      |  43% [A
2025-05-07T20:25:58.4303743Z libcublas-12.8.3.14  | 460.2 MB  | ##6        |  27% 
2025-05-07T20:25:58.4304015Z 
2025-05-07T20:25:58.4304019Z 
2025-05-07T20:25:58.4304023Z 
2025-05-07T20:25:58.4307223Z 
2025-05-07T20:25:58.4372393Z libcufft-11.3.3.41   | 147.4 MB  | ########3  |  83% [A[A[A[A
2025-05-07T20:25:58.4372765Z 
2025-05-07T20:25:58.4373047Z 
2025-05-07T20:25:58.4452967Z libcusparse-12.5.7.5 | 164.9 MB  | ########3  |  83% [A[A
2025-05-07T20:25:58.4453263Z 
2025-05-07T20:25:58.4491472Z nsight-compute-2025. | 320.6 MB  | ####4      |  44% [A
2025-05-07T20:25:58.4879256Z libcublas-12.8.3.14  | 460.2 MB  | ##7        |  28% 
2025-05-07T20:25:58.4879580Z 
2025-05-07T20:25:58.4879584Z 
2025-05-07T20:25:58.4879610Z 
2025-05-07T20:25:58.5304486Z libcusolver-11.7.2.5 | 156.9 MB  | ########8  |  88% [A[A[A
2025-05-07T20:25:58.5304851Z 
2025-05-07T20:25:58.5304857Z 
2025-05-07T20:25:58.5304863Z 
2025-05-07T20:25:58.5307465Z 
2025-05-07T20:25:58.5372718Z libcufft-11.3.3.41   | 147.4 MB  | ########6  |  86% [A[A[A[A
2025-05-07T20:25:58.5373110Z 
2025-05-07T20:25:58.5373711Z 
2025-05-07T20:25:58.5494898Z libcusparse-12.5.7.5 | 164.9 MB  | ########5  |  86% [A[A
2025-05-07T20:25:58.5510545Z libcublas-12.8.3.14  | 460.2 MB  | ##8        |  29% 
2025-05-07T20:25:58.5513491Z 
2025-05-07T20:25:58.5880081Z nsight-compute-2025. | 320.6 MB  | ####5      |  45% [A
2025-05-07T20:25:58.5880422Z 
2025-05-07T20:25:58.5880426Z 
2025-05-07T20:25:58.5882441Z 
2025-05-07T20:25:58.6335417Z libcusolver-11.7.2.5 | 156.9 MB  | #########  |  90% [A[A[A
2025-05-07T20:25:58.6335698Z 
2025-05-07T20:25:58.6335702Z 
2025-05-07T20:25:58.6335706Z 
2025-05-07T20:25:58.6335714Z 
2025-05-07T20:25:58.6375258Z libcufft-11.3.3.41   | 147.4 MB  | ########8  |  89% [A[A[A[A
2025-05-07T20:25:58.6375826Z 
2025-05-07T20:25:58.6379401Z 
2025-05-07T20:25:58.6514962Z libcusparse-12.5.7.5 | 164.9 MB  | ########8  |  88% [A[A
2025-05-07T20:25:58.6517268Z 
2025-05-07T20:25:58.6551472Z nsight-compute-2025. | 320.6 MB  | ####6      |  46% [A
2025-05-07T20:25:58.6880711Z libcublas-12.8.3.14  | 460.2 MB  | ##9        |  29% 
2025-05-07T20:25:58.6880970Z 
2025-05-07T20:25:58.6880975Z 
2025-05-07T20:25:58.6882603Z 
2025-05-07T20:25:58.7341942Z libcusolver-11.7.2.5 | 156.9 MB  | #########2 |  93% [A[A[A
2025-05-07T20:25:58.7342235Z 
2025-05-07T20:25:58.7342239Z 
2025-05-07T20:25:58.7342269Z 
2025-05-07T20:25:58.7343334Z 
2025-05-07T20:25:58.7453592Z libcufft-11.3.3.41   | 147.4 MB  | #########1 |  91% [A[A[A[A
2025-05-07T20:25:58.7453870Z 
2025-05-07T20:25:58.7453874Z 
2025-05-07T20:25:58.7547629Z libcusparse-12.5.7.5 | 164.9 MB  | #########  |  90% [A[A
2025-05-07T20:25:58.7550507Z 
2025-05-07T20:25:58.7556904Z nsight-compute-2025. | 320.6 MB  | ####7      |  48% [A
2025-05-07T20:25:58.7881401Z libcublas-12.8.3.14  | 460.2 MB  | ###        |  30% 
2025-05-07T20:25:58.7881781Z 
2025-05-07T20:25:58.7881788Z 
2025-05-07T20:25:58.7883123Z 
2025-05-07T20:25:58.8368673Z libcusolver-11.7.2.5 | 156.9 MB  | #########5 |  95% [A[A[A
2025-05-07T20:25:58.8368969Z 
2025-05-07T20:25:58.8368974Z 
2025-05-07T20:25:58.8368978Z 
2025-05-07T20:25:58.8368981Z 
2025-05-07T20:25:58.8455842Z libcufft-11.3.3.41   | 147.4 MB  | #########3 |  93% [A[A[A[A
2025-05-07T20:25:58.8456226Z 
2025-05-07T20:25:58.8457832Z 
2025-05-07T20:25:58.8595179Z libcusparse-12.5.7.5 | 164.9 MB  | #########2 |  93% [A[A
2025-05-07T20:25:58.8635740Z libcublas-12.8.3.14  | 460.2 MB  | ###1       |  31% 
2025-05-07T20:25:58.8636822Z 
2025-05-07T20:25:58.8883933Z nsight-compute-2025. | 320.6 MB  | ####8      |  49% [A
2025-05-07T20:25:58.8884206Z 
2025-05-07T20:25:58.8884211Z 
2025-05-07T20:25:58.8886266Z 
2025-05-07T20:25:58.9371310Z libcusolver-11.7.2.5 | 156.9 MB  | #########7 |  98% [A[A[A
2025-05-07T20:25:58.9371622Z 
2025-05-07T20:25:58.9371627Z 
2025-05-07T20:25:58.9371631Z 
2025-05-07T20:25:58.9371646Z 
2025-05-07T20:25:58.9490340Z libcufft-11.3.3.41   | 147.4 MB  | #########5 |  96% [A[A[A[A
2025-05-07T20:25:58.9490654Z 
2025-05-07T20:25:58.9494051Z 
2025-05-07T20:25:58.9662353Z libcusparse-12.5.7.5 | 164.9 MB  | #########4 |  95% [A[A
2025-05-07T20:25:58.9665103Z 
2025-05-07T20:25:58.9742804Z nsight-compute-2025. | 320.6 MB  | ####9      |  50% [A
2025-05-07T20:25:59.0374152Z libcublas-12.8.3.14  | 460.2 MB  | ###1       |  32% 
2025-05-07T20:25:59.0374474Z 
2025-05-07T20:25:59.0374771Z 
2025-05-07T20:25:59.0374775Z 
2025-05-07T20:25:59.0374778Z 
2025-05-07T20:25:59.0516062Z libcufft-11.3.3.41   | 147.4 MB  | #########8 |  98% [A[A[A[A
2025-05-07T20:25:59.0516338Z 
2025-05-07T20:25:59.0516713Z 
2025-05-07T20:25:59.0665634Z libcusparse-12.5.7.5 | 164.9 MB  | #########7 |  97% [A[A
2025-05-07T20:25:59.0665912Z 
2025-05-07T20:25:59.0744710Z nsight-compute-2025. | 320.6 MB  | #####1     |  51% [A
2025-05-07T20:25:59.1517601Z libcublas-12.8.3.14  | 460.2 MB  | ###2       |  33% 
2025-05-07T20:25:59.1517959Z 
2025-05-07T20:25:59.1520200Z 
2025-05-07T20:25:59.1666168Z libcusparse-12.5.7.5 | 164.9 MB  | #########9 | 100% [A[A
2025-05-07T20:25:59.1666548Z 
2025-05-07T20:25:59.1747096Z nsight-compute-2025. | 320.6 MB  | #####2     |  52% [A
2025-05-07T20:25:59.2667615Z libcublas-12.8.3.14  | 460.2 MB  | ###3       |  34% 
2025-05-07T20:25:59.2667939Z 
2025-05-07T20:25:59.2753896Z nsight-compute-2025. | 320.6 MB  | #####4     |  54% [A
2025-05-07T20:25:59.3671457Z libcublas-12.8.3.14  | 460.2 MB  | ###4       |  35% 
2025-05-07T20:25:59.3671733Z 
2025-05-07T20:25:59.3760181Z nsight-compute-2025. | 320.6 MB  | #####5     |  56% [A
2025-05-07T20:25:59.4764895Z libcublas-12.8.3.14  | 460.2 MB  | ###6       |  36% 
2025-05-07T20:25:59.4790246Z libcublas-12.8.3.14  | 460.2 MB  | ###7       |  37% 
2025-05-07T20:25:59.4791979Z 
2025-05-07T20:25:59.5766316Z nsight-compute-2025. | 320.6 MB  | #####7     |  57% [A
2025-05-07T20:25:59.5790766Z libcublas-12.8.3.14  | 460.2 MB  | ###8       |  39% 
2025-05-07T20:25:59.5791692Z 
2025-05-07T20:25:59.6767766Z nsight-compute-2025. | 320.6 MB  | #####9     |  59% [A
2025-05-07T20:25:59.6791188Z libcublas-12.8.3.14  | 460.2 MB  | ###9       |  40% 
2025-05-07T20:25:59.6792076Z 
2025-05-07T20:25:59.7768360Z nsight-compute-2025. | 320.6 MB  | ######1    |  61% [A
2025-05-07T20:25:59.7791707Z libcublas-12.8.3.14  | 460.2 MB  | ####1      |  41% 
2025-05-07T20:25:59.7792035Z 
2025-05-07T20:25:59.8942725Z nsight-compute-2025. | 320.6 MB  | ######2    |  63% [A
2025-05-07T20:25:59.8944263Z 
2025-05-07T20:25:59.9139040Z nsight-compute-2025. | 320.6 MB  | ######4    |  64% [A
2025-05-07T20:26:00.0115252Z libcublas-12.8.3.14  | 460.2 MB  | ####2      |  42% 
2025-05-07T20:26:00.0116854Z 
2025-05-07T20:26:00.0277991Z nsight-compute-2025. | 320.6 MB  | ######5    |  66% [A
2025-05-07T20:26:00.1274686Z libcublas-12.8.3.14  | 460.2 MB  | ####3      |  43% 
2025-05-07T20:26:00.1274939Z 
2025-05-07T20:26:00.1323671Z nsight-compute-2025. | 320.6 MB  | ######7    |  67% [A
2025-05-07T20:26:00.2275701Z libcublas-12.8.3.14  | 460.2 MB  | ####4      |  44% 
2025-05-07T20:26:00.2277814Z 
2025-05-07T20:26:00.2329008Z nsight-compute-2025. | 320.6 MB  | ######8    |  69% [A
2025-05-07T20:26:00.3282015Z libcublas-12.8.3.14  | 460.2 MB  | ####5      |  46% 
2025-05-07T20:26:00.3282371Z 
2025-05-07T20:26:00.3329874Z nsight-compute-2025. | 320.6 MB  | #######    |  71% [A
2025-05-07T20:26:00.4332110Z libcublas-12.8.3.14  | 460.2 MB  | ####6      |  47% 
2025-05-07T20:26:00.4837719Z libcublas-12.8.3.14  | 460.2 MB  | ####8      |  48% 
2025-05-07T20:26:00.4838039Z 
2025-05-07T20:26:00.5336324Z nsight-compute-2025. | 320.6 MB  | #######2   |  72% [A
2025-05-07T20:26:00.5839557Z libcublas-12.8.3.14  | 460.2 MB  | ####9      |  49% 
2025-05-07T20:26:00.5839812Z 
2025-05-07T20:26:00.6419485Z nsight-compute-2025. | 320.6 MB  | #######4   |  74% [A
2025-05-07T20:26:00.6841555Z libcublas-12.8.3.14  | 460.2 MB  | #####      |  51% 
2025-05-07T20:26:00.6841831Z 
2025-05-07T20:26:00.7421924Z nsight-compute-2025. | 320.6 MB  | #######5   |  76% [A
2025-05-07T20:26:00.7842020Z libcublas-12.8.3.14  | 460.2 MB  | #####1     |  52% 
2025-05-07T20:26:00.7842370Z 
2025-05-07T20:26:00.8845114Z nsight-compute-2025. | 320.6 MB  | #######7   |  78% [A
2025-05-07T20:26:00.8845838Z 
2025-05-07T20:26:00.9285999Z nsight-compute-2025. | 320.6 MB  | ########   |  81% [A
2025-05-07T20:26:01.0036364Z libcublas-12.8.3.14  | 460.2 MB  | #####2     |  53% 
2025-05-07T20:26:01.0037373Z 
2025-05-07T20:26:01.0288411Z nsight-compute-2025. | 320.6 MB  | ########2  |  83% [A
2025-05-07T20:26:01.1143044Z libcublas-12.8.3.14  | 460.2 MB  | #####4     |  54% 
2025-05-07T20:26:01.1143381Z 
2025-05-07T20:26:01.1289365Z nsight-compute-2025. | 320.6 MB  | ########4  |  85% [A
2025-05-07T20:26:01.2207312Z libcublas-12.8.3.14  | 460.2 MB  | #####5     |  55% 
2025-05-07T20:26:01.2209142Z 
2025-05-07T20:26:01.2290014Z nsight-compute-2025. | 320.6 MB  | ########6  |  86% [A
2025-05-07T20:26:01.3353553Z libcublas-12.8.3.14  | 460.2 MB  | #####6     |  57% 
2025-05-07T20:26:01.3389017Z libcublas-12.8.3.14  | 460.2 MB  | #####7     |  58% 
2025-05-07T20:26:01.3389415Z 
2025-05-07T20:26:01.3927530Z nsight-compute-2025. | 320.6 MB  | ########8  |  88% [A
2025-05-07T20:26:01.3927799Z 
2025-05-07T20:26:01.3927804Z 
2025-05-07T20:26:01.3927807Z 
2025-05-07T20:26:01.3932787Z 
2025-05-07T20:26:01.4388219Z libcufft-11.3.3.41   | 147.4 MB  | ########## | 100% [A[A[A[A
2025-05-07T20:26:01.4466162Z libcublas-12.8.3.14  | 460.2 MB  | #####8     |  59% 
2025-05-07T20:26:01.4466560Z 
2025-05-07T20:26:01.4466809Z 
2025-05-07T20:26:01.4466816Z 
2025-05-07T20:26:01.4466822Z 
2025-05-07T20:26:01.4466849Z 
2025-05-07T20:26:01.4635396Z libnpp-12.3.3.65     | 130.6 MB  |            |   0% [A[A[A[A[A
2025-05-07T20:26:01.4636435Z 
2025-05-07T20:26:01.5468800Z nsight-compute-2025. | 320.6 MB  | #########  |  90% [A
2025-05-07T20:26:01.5469225Z 
2025-05-07T20:26:01.5469231Z 
2025-05-07T20:26:01.5469234Z 
2025-05-07T20:26:01.5469238Z 
2025-05-07T20:26:01.5472587Z 
2025-05-07T20:26:01.5757968Z libnpp-12.3.3.65     | 130.6 MB  | 2          |   2% [A[A[A[A[A
2025-05-07T20:26:01.6111786Z libcublas-12.8.3.14  | 460.2 MB  | #####9     |  60% 
2025-05-07T20:26:01.6112075Z 
2025-05-07T20:26:01.6469299Z nsight-compute-2025. | 320.6 MB  | #########1 |  92% [A
2025-05-07T20:26:01.6469581Z 
2025-05-07T20:26:01.6469585Z 
2025-05-07T20:26:01.6469589Z 
2025-05-07T20:26:01.6469592Z 
2025-05-07T20:26:01.6469596Z 
2025-05-07T20:26:01.7090780Z libnpp-12.3.3.65     | 130.6 MB  | 4          |   5% [A[A[A[A[A
2025-05-07T20:26:01.7154642Z libcublas-12.8.3.14  | 460.2 MB  | ######     |  61% 
2025-05-07T20:26:01.7154943Z 
2025-05-07T20:26:01.7154947Z 
2025-05-07T20:26:01.7154951Z 
2025-05-07T20:26:01.7474750Z libcusolver-11.7.2.5 | 156.9 MB  | ########## | 100% [A[A[A
2025-05-07T20:26:01.7475232Z 
2025-05-07T20:26:01.7475238Z 
2025-05-07T20:26:01.7475275Z 
2025-05-07T20:26:01.7475281Z 
2025-05-07T20:26:01.7475287Z 
2025-05-07T20:26:01.7614154Z libnpp-12.3.3.65     | 130.6 MB  | 7          |   7% [A[A[A[A[A
2025-05-07T20:26:01.7614978Z 
2025-05-07T20:26:01.7720064Z nsight-compute-2025. | 320.6 MB  | #########3 |  93% [A
2025-05-07T20:26:01.7720334Z 
2025-05-07T20:26:01.7720338Z 
2025-05-07T20:26:01.7720342Z 
2025-05-07T20:26:01.7720346Z 
2025-05-07T20:26:01.7720350Z 
2025-05-07T20:26:01.7721137Z 
2025-05-07T20:26:01.8477371Z cuda-nsight-12.8.55  | 113.2 MB  |            |   0% [A[A[A[A[A[A
2025-05-07T20:26:01.8477683Z 
2025-05-07T20:26:01.8477689Z 
2025-05-07T20:26:01.8477727Z 
2025-05-07T20:26:01.8477732Z 
2025-05-07T20:26:01.8483459Z 
2025-05-07T20:26:01.8491704Z libnpp-12.3.3.65     | 130.6 MB  | 9          |  10% [A[A[A[A[A
2025-05-07T20:26:01.8722424Z libcublas-12.8.3.14  | 460.2 MB  | ######1    |  62% 
2025-05-07T20:26:01.8722707Z 
2025-05-07T20:26:01.8722713Z 
2025-05-07T20:26:01.8722718Z 
2025-05-07T20:26:01.8722723Z 
2025-05-07T20:26:01.8723012Z 
2025-05-07T20:26:01.8725471Z 
2025-05-07T20:26:01.9086030Z cuda-nsight-12.8.55  | 113.2 MB  | 2          |   3% [A[A[A[A[A[A
2025-05-07T20:26:01.9086904Z 
2025-05-07T20:26:01.9486733Z nsight-compute-2025. | 320.6 MB  | #########4 |  95% [A
2025-05-07T20:26:01.9487091Z 
2025-05-07T20:26:01.9487097Z 
2025-05-07T20:26:01.9487102Z 
2025-05-07T20:26:01.9487107Z 
2025-05-07T20:26:01.9489695Z 
2025-05-07T20:26:01.9723876Z libnpp-12.3.3.65     | 130.6 MB  | #2         |  12% [A[A[A[A[A
2025-05-07T20:26:01.9724272Z 
2025-05-07T20:26:01.9724277Z 
2025-05-07T20:26:01.9724280Z 
2025-05-07T20:26:01.9724535Z 
2025-05-07T20:26:01.9724539Z 
2025-05-07T20:26:01.9726778Z 
2025-05-07T20:26:01.9799201Z cuda-nsight-12.8.55  | 113.2 MB  | 5          |   5% [A[A[A[A[A[A
2025-05-07T20:26:01.9799596Z 
2025-05-07T20:26:01.9801474Z 
2025-05-07T20:26:01.9928798Z libcusparse-12.5.7.5 | 164.9 MB  | ########## | 100% [A[A
2025-05-07T20:26:02.0462001Z libcublas-12.8.3.14  | 460.2 MB  | ######2    |  63% 
2025-05-07T20:26:02.0462365Z 
2025-05-07T20:26:02.0488240Z nsight-compute-2025. | 320.6 MB  | #########5 |  96% [A
2025-05-07T20:26:02.0488612Z 
2025-05-07T20:26:02.0488618Z 
2025-05-07T20:26:02.0488623Z 
2025-05-07T20:26:02.0488628Z 
2025-05-07T20:26:02.0488633Z 
2025-05-07T20:26:02.0488638Z 
2025-05-07T20:26:02.0490944Z 
2025-05-07T20:26:02.0498569Z cuda-nvvp-12.8.57    | 112.4 MB  |            |   0% [A[A[A[A[A[A[A
2025-05-07T20:26:02.0498863Z 
2025-05-07T20:26:02.0498867Z 
2025-05-07T20:26:02.0498871Z 
2025-05-07T20:26:02.0498874Z 
2025-05-07T20:26:02.0501004Z 
2025-05-07T20:26:02.0724115Z libnpp-12.3.3.65     | 130.6 MB  | #4         |  14% [A[A[A[A[A
2025-05-07T20:26:02.0724431Z 
2025-05-07T20:26:02.0724435Z 
2025-05-07T20:26:02.0724439Z 
2025-05-07T20:26:02.0724442Z 
2025-05-07T20:26:02.0724446Z 
2025-05-07T20:26:02.0727959Z 
2025-05-07T20:26:02.1244691Z cuda-nsight-12.8.55  | 113.2 MB  | 8          |   8% [A[A[A[A[A[A
2025-05-07T20:26:02.1490218Z libcublas-12.8.3.14  | 460.2 MB  | ######3    |  64% 
2025-05-07T20:26:02.1490496Z 
2025-05-07T20:26:02.1490501Z 
2025-05-07T20:26:02.1490505Z 
2025-05-07T20:26:02.1490508Z 
2025-05-07T20:26:02.1490513Z 
2025-05-07T20:26:02.1490516Z 
2025-05-07T20:26:02.1492500Z 
2025-05-07T20:26:02.1761692Z cuda-nvvp-12.8.57    | 112.4 MB  | 1          |   1% [A[A[A[A[A[A[A
2025-05-07T20:26:02.1762009Z 
2025-05-07T20:26:02.1762015Z 
2025-05-07T20:26:02.1762018Z 
2025-05-07T20:26:02.1762022Z 
2025-05-07T20:26:02.1762026Z 
2025-05-07T20:26:02.1762031Z 
2025-05-07T20:26:02.1781442Z cuda-nsight-12.8.55  | 113.2 MB  | #          |  10% [A[A[A[A[A[A
2025-05-07T20:26:02.1781862Z 
2025-05-07T20:26:02.1781866Z 
2025-05-07T20:26:02.1781869Z 
2025-05-07T20:26:02.1781873Z 
2025-05-07T20:26:02.1783936Z 
2025-05-07T20:26:02.1832131Z libnpp-12.3.3.65     | 130.6 MB  | #6         |  17% [A[A[A[A[A
2025-05-07T20:26:02.1832422Z 
2025-05-07T20:26:02.2701022Z nsight-compute-2025. | 320.6 MB  | #########6 |  97% [A
2025-05-07T20:26:02.2762283Z libcublas-12.8.3.14  | 460.2 MB  | ######4    |  64% 
2025-05-07T20:26:02.2762582Z 
2025-05-07T20:26:02.2762754Z 
2025-05-07T20:26:02.2762759Z 
2025-05-07T20:26:02.2762765Z 
2025-05-07T20:26:02.2762770Z 
2025-05-07T20:26:02.2762789Z 
2025-05-07T20:26:02.2807229Z cuda-nsight-12.8.55  | 113.2 MB  | #2         |  13% [A[A[A[A[A[A
2025-05-07T20:26:02.2807631Z 
2025-05-07T20:26:02.2807638Z 
2025-05-07T20:26:02.2807644Z 
2025-05-07T20:26:02.2807649Z 
2025-05-07T20:26:02.2807654Z 
2025-05-07T20:26:02.2807660Z 
2025-05-07T20:26:02.2808111Z 
2025-05-07T20:26:02.2878657Z cuda-nvvp-12.8.57    | 112.4 MB  | 2          |   3% [A[A[A[A[A[A[A
2025-05-07T20:26:02.2879101Z 
2025-05-07T20:26:02.2879107Z 
2025-05-07T20:26:02.2879112Z 
2025-05-07T20:26:02.2879117Z 
2025-05-07T20:26:02.2886473Z 
2025-05-07T20:26:02.3105224Z libnpp-12.3.3.65     | 130.6 MB  | #8         |  19% [A[A[A[A[A
2025-05-07T20:26:02.3107274Z 
2025-05-07T20:26:02.3810075Z nsight-compute-2025. | 320.6 MB  | #########8 |  98% [A
2025-05-07T20:26:02.3810809Z 
2025-05-07T20:26:02.3810818Z 
2025-05-07T20:26:02.3810823Z 
2025-05-07T20:26:02.3810829Z 
2025-05-07T20:26:02.3810834Z 
2025-05-07T20:26:02.3810839Z 
2025-05-07T20:26:02.3811098Z 
2025-05-07T20:26:02.3821088Z cuda-nvvp-12.8.57    | 112.4 MB  | 5          |   5% [A[A[A[A[A[A[A
2025-05-07T20:26:02.3821515Z 
2025-05-07T20:26:02.3821521Z 
2025-05-07T20:26:02.3821527Z 
2025-05-07T20:26:02.3821532Z 
2025-05-07T20:26:02.3821538Z 
2025-05-07T20:26:02.3824646Z 
2025-05-07T20:26:02.3925938Z cuda-nsight-12.8.55  | 113.2 MB  | #4         |  15% [A[A[A[A[A[A
2025-05-07T20:26:02.3934745Z 
2025-05-07T20:26:02.3934753Z 
2025-05-07T20:26:02.3935036Z 
2025-05-07T20:26:02.3935042Z 
2025-05-07T20:26:02.3935047Z 
2025-05-07T20:26:02.4184246Z libnpp-12.3.3.65     | 130.6 MB  | ##         |  21% [A[A[A[A[A
2025-05-07T20:26:02.4354239Z libcublas-12.8.3.14  | 460.2 MB  | ######4    |  65% 
2025-05-07T20:26:02.4354573Z 
2025-05-07T20:26:02.4813955Z nsight-compute-2025. | 320.6 MB  | #########8 |  99% [A
2025-05-07T20:26:02.4814403Z 
2025-05-07T20:26:02.4814410Z 
2025-05-07T20:26:02.4814415Z 
2025-05-07T20:26:02.4814420Z 
2025-05-07T20:26:02.4814426Z 
2025-05-07T20:26:02.4814431Z 
2025-05-07T20:26:02.4815640Z 
2025-05-07T20:26:02.4927534Z cuda-nvvp-12.8.57    | 112.4 MB  | 7          |   8% [A[A[A[A[A[A[A
2025-05-07T20:26:02.4927830Z 
2025-05-07T20:26:02.4927834Z 
2025-05-07T20:26:02.4927837Z 
2025-05-07T20:26:02.4927841Z 
2025-05-07T20:26:02.4930796Z 
2025-05-07T20:26:02.4935509Z libnpp-12.3.3.65     | 130.6 MB  | ##3        |  23% [A[A[A[A[A
2025-05-07T20:26:02.4935843Z 
2025-05-07T20:26:02.4935847Z 
2025-05-07T20:26:02.4935875Z 
2025-05-07T20:26:02.4935878Z 
2025-05-07T20:26:02.4935882Z 
2025-05-07T20:26:02.4935885Z 
2025-05-07T20:26:02.5227029Z cuda-nsight-12.8.55  | 113.2 MB  | #7         |  17% [A[A[A[A[A[A
2025-05-07T20:26:02.5578212Z libcublas-12.8.3.14  | 460.2 MB  | ######5    |  66% 
2025-05-07T20:26:02.5581684Z 
2025-05-07T20:26:02.5814129Z nsight-compute-2025. | 320.6 MB  | #########9 | 100% [A
2025-05-07T20:26:02.5814428Z 
2025-05-07T20:26:02.5814432Z 
2025-05-07T20:26:02.5814436Z 
2025-05-07T20:26:02.5814440Z 
2025-05-07T20:26:02.5814443Z 
2025-05-07T20:26:02.5814447Z 
2025-05-07T20:26:02.5815807Z 
2025-05-07T20:26:02.5938152Z cuda-nvvp-12.8.57    | 112.4 MB  | 9          |  10% [A[A[A[A[A[A[A
2025-05-07T20:26:02.5938470Z 
2025-05-07T20:26:02.5938474Z 
2025-05-07T20:26:02.5938478Z 
2025-05-07T20:26:02.5938481Z 
2025-05-07T20:26:02.5938485Z 
2025-05-07T20:26:02.5938489Z 
2025-05-07T20:26:02.5958661Z cuda-nsight-12.8.55  | 113.2 MB  | #9         |  20% [A[A[A[A[A[A
2025-05-07T20:26:02.5959067Z 
2025-05-07T20:26:02.5959071Z 
2025-05-07T20:26:02.5959075Z 
2025-05-07T20:26:02.5959079Z 
2025-05-07T20:26:02.5961412Z 
2025-05-07T20:26:02.6284059Z libnpp-12.3.3.65     | 130.6 MB  | ##5        |  25% [A[A[A[A[A
2025-05-07T20:26:02.6815365Z libcublas-12.8.3.14  | 460.2 MB  | ######6    |  66% 
2025-05-07T20:26:02.6815753Z 
2025-05-07T20:26:02.6815758Z 
2025-05-07T20:26:02.6815795Z 
2025-05-07T20:26:02.6815800Z 
2025-05-07T20:26:02.6815803Z 
2025-05-07T20:26:02.6815807Z 
2025-05-07T20:26:02.6817053Z 
2025-05-07T20:26:02.6940247Z cuda-nvvp-12.8.57    | 112.4 MB  | #1         |  12% [A[A[A[A[A[A[A
2025-05-07T20:26:02.6940679Z 
2025-05-07T20:26:02.6940686Z 
2025-05-07T20:26:02.6940691Z 
2025-05-07T20:26:02.6940696Z 
2025-05-07T20:26:02.6940701Z 
2025-05-07T20:26:02.6943511Z 
2025-05-07T20:26:02.6960507Z cuda-nsight-12.8.55  | 113.2 MB  | ##2        |  22% [A[A[A[A[A[A
2025-05-07T20:26:02.6960854Z 
2025-05-07T20:26:02.6960858Z 
2025-05-07T20:26:02.6960862Z 
2025-05-07T20:26:02.6960866Z 
2025-05-07T20:26:02.6964109Z 
2025-05-07T20:26:02.7287278Z libnpp-12.3.3.65     | 130.6 MB  | ##7        |  27% [A[A[A[A[A
2025-05-07T20:26:02.7825717Z libcublas-12.8.3.14  | 460.2 MB  | ######6    |  67% 
2025-05-07T20:26:02.7826113Z 
2025-05-07T20:26:02.7826121Z 
2025-05-07T20:26:02.7826126Z 
2025-05-07T20:26:02.7826131Z 
2025-05-07T20:26:02.7826136Z 
2025-05-07T20:26:02.7826141Z 
2025-05-07T20:26:02.7832602Z 
2025-05-07T20:26:02.7965786Z cuda-nvvp-12.8.57    | 112.4 MB  | #3         |  14% [A[A[A[A[A[A[A
2025-05-07T20:26:02.7966089Z 
2025-05-07T20:26:02.7966094Z 
2025-05-07T20:26:02.7966097Z 
2025-05-07T20:26:02.7966101Z 
2025-05-07T20:26:02.7966376Z 
2025-05-07T20:26:02.8025785Z libnpp-12.3.3.65     | 130.6 MB  | ##9        |  30% [A[A[A[A[A
2025-05-07T20:26:02.8026080Z 
2025-05-07T20:26:02.8026086Z 
2025-05-07T20:26:02.8026091Z 
2025-05-07T20:26:02.8026096Z 
2025-05-07T20:26:02.8026102Z 
2025-05-07T20:26:02.8028028Z 
2025-05-07T20:26:02.8407444Z cuda-nsight-12.8.55  | 113.2 MB  | ##4        |  24% [A[A[A[A[A[A
2025-05-07T20:26:02.8829140Z libcublas-12.8.3.14  | 460.2 MB  | ######7    |  67% 
2025-05-07T20:26:02.8829416Z 
2025-05-07T20:26:02.8829420Z 
2025-05-07T20:26:02.8829424Z 
2025-05-07T20:26:02.8829427Z 
2025-05-07T20:26:02.8829432Z 
2025-05-07T20:26:02.8829436Z 
2025-05-07T20:26:02.8829443Z 
2025-05-07T20:26:02.9025419Z cuda-nvvp-12.8.57    | 112.4 MB  | #6         |  16% [A[A[A[A[A[A[A
2025-05-07T20:26:02.9025778Z 
2025-05-07T20:26:02.9025782Z 
2025-05-07T20:26:02.9025786Z 
2025-05-07T20:26:02.9025789Z 
2025-05-07T20:26:02.9025793Z 
2025-05-07T20:26:02.9037223Z libnpp-12.3.3.65     | 130.6 MB  | ###2       |  32% [A[A[A[A[A
2025-05-07T20:26:02.9037502Z 
2025-05-07T20:26:02.9037506Z 
2025-05-07T20:26:02.9037510Z 
2025-05-07T20:26:02.9037514Z 
2025-05-07T20:26:02.9037518Z 
2025-05-07T20:26:02.9037521Z 
2025-05-07T20:26:02.9535822Z cuda-nsight-12.8.55  | 113.2 MB  | ##6        |  27% [A[A[A[A[A[A
2025-05-07T20:26:02.9830899Z libcublas-12.8.3.14  | 460.2 MB  | ######7    |  68% 
2025-05-07T20:26:02.9831303Z 
2025-05-07T20:26:02.9831309Z 
2025-05-07T20:26:02.9831314Z 
2025-05-07T20:26:02.9831320Z 
2025-05-07T20:26:02.9831325Z 
2025-05-07T20:26:02.9831330Z 
2025-05-07T20:26:02.9831335Z 
2025-05-07T20:26:03.0058202Z cuda-nvvp-12.8.57    | 112.4 MB  | #8         |  18% [A[A[A[A[A[A[A
2025-05-07T20:26:03.0058514Z 
2025-05-07T20:26:03.0058519Z 
2025-05-07T20:26:03.0058545Z 
2025-05-07T20:26:03.0058549Z 
2025-05-07T20:26:03.0058552Z 
2025-05-07T20:26:03.0062319Z 
2025-05-07T20:26:03.0122607Z cuda-nsight-12.8.55  | 113.2 MB  | ##8        |  29% [A[A[A[A[A[A
2025-05-07T20:26:03.0122974Z 
2025-05-07T20:26:03.0122978Z 
2025-05-07T20:26:03.0122982Z 
2025-05-07T20:26:03.0122986Z 
2025-05-07T20:26:03.0122990Z 
2025-05-07T20:26:03.0546216Z libnpp-12.3.3.65     | 130.6 MB  | ###4       |  34% [A[A[A[A[A
2025-05-07T20:26:03.0883834Z libcublas-12.8.3.14  | 460.2 MB  | ######8    |  68% 
2025-05-07T20:26:03.0884208Z 
2025-05-07T20:26:03.0884212Z 
2025-05-07T20:26:03.0884247Z 
2025-05-07T20:26:03.0884251Z 
2025-05-07T20:26:03.0884254Z 
2025-05-07T20:26:03.0884258Z 
2025-05-07T20:26:03.0885742Z 
2025-05-07T20:26:03.1113330Z cuda-nvvp-12.8.57    | 112.4 MB  | ##         |  20% [A[A[A[A[A[A[A
2025-05-07T20:26:03.1113688Z 
2025-05-07T20:26:03.1113692Z 
2025-05-07T20:26:03.1113696Z 
2025-05-07T20:26:03.1113699Z 
2025-05-07T20:26:03.1113703Z 
2025-05-07T20:26:03.1115865Z 
2025-05-07T20:26:03.1231051Z cuda-nsight-12.8.55  | 113.2 MB  | ###1       |  31% [A[A[A[A[A[A
2025-05-07T20:26:03.1231475Z 
2025-05-07T20:26:03.1231480Z 
2025-05-07T20:26:03.1231485Z 
2025-05-07T20:26:03.1231500Z 
2025-05-07T20:26:03.1231505Z 
2025-05-07T20:26:03.1582363Z libnpp-12.3.3.65     | 130.6 MB  | ###6       |  36% [A[A[A[A[A
2025-05-07T20:26:03.1896851Z libcublas-12.8.3.14  | 460.2 MB  | ######8    |  69% 
2025-05-07T20:26:03.1897226Z 
2025-05-07T20:26:03.1897232Z 
2025-05-07T20:26:03.1897236Z 
2025-05-07T20:26:03.1897242Z 
2025-05-07T20:26:03.1897247Z 
2025-05-07T20:26:03.1897252Z 
2025-05-07T20:26:03.1899063Z 
2025-05-07T20:26:03.2236703Z cuda-nvvp-12.8.57    | 112.4 MB  | ##2        |  22% [A[A[A[A[A[A[A
2025-05-07T20:26:03.2237121Z 
2025-05-07T20:26:03.2237127Z 
2025-05-07T20:26:03.2237132Z 
2025-05-07T20:26:03.2237137Z 
2025-05-07T20:26:03.2237142Z 
2025-05-07T20:26:03.2237147Z 
2025-05-07T20:26:03.2414066Z cuda-nsight-12.8.55  | 113.2 MB  | ###3       |  33% [A[A[A[A[A[A
2025-05-07T20:26:03.2414487Z 
2025-05-07T20:26:03.2414492Z 
2025-05-07T20:26:03.2414497Z 
2025-05-07T20:26:03.2414502Z 
2025-05-07T20:26:03.2417075Z 
2025-05-07T20:26:03.2599610Z libnpp-12.3.3.65     | 130.6 MB  | ###8       |  38% [A[A[A[A[A
2025-05-07T20:26:03.2914318Z libcublas-12.8.3.14  | 460.2 MB  | ######9    |  69% 
2025-05-07T20:26:03.2914683Z 
2025-05-07T20:26:03.2914689Z 
2025-05-07T20:26:03.2914694Z 
2025-05-07T20:26:03.2914699Z 
2025-05-07T20:26:03.2914715Z 
2025-05-07T20:26:03.2914720Z 
2025-05-07T20:26:03.2917542Z 
2025-05-07T20:26:03.3239350Z cuda-nvvp-12.8.57    | 112.4 MB  | ##4        |  25% [A[A[A[A[A[A[A
2025-05-07T20:26:03.3240059Z 
2025-05-07T20:26:03.3240072Z 
2025-05-07T20:26:03.3240076Z 
2025-05-07T20:26:03.3240079Z 
2025-05-07T20:26:03.3240083Z 
2025-05-07T20:26:03.3240086Z 
2025-05-07T20:26:03.3523528Z cuda-nsight-12.8.55  | 113.2 MB  | ###5       |  36% [A[A[A[A[A[A
2025-05-07T20:26:03.3523947Z 
2025-05-07T20:26:03.3523953Z 
2025-05-07T20:26:03.3523983Z 
2025-05-07T20:26:03.3523989Z 
2025-05-07T20:26:03.3527220Z 
2025-05-07T20:26:03.3607016Z libnpp-12.3.3.65     | 130.6 MB  | ####       |  40% [A[A[A[A[A
2025-05-07T20:26:03.3916943Z libcublas-12.8.3.14  | 460.2 MB  | #######    |  70% 
2025-05-07T20:26:03.3917288Z 
2025-05-07T20:26:03.3917294Z 
2025-05-07T20:26:03.3917299Z 
2025-05-07T20:26:03.3917304Z 
2025-05-07T20:26:03.3917309Z 
2025-05-07T20:26:03.3917314Z 
2025-05-07T20:26:03.3917331Z 
2025-05-07T20:26:03.4374155Z cuda-nvvp-12.8.57    | 112.4 MB  | ##6        |  27% [A[A[A[A[A[A[A
2025-05-07T20:26:03.4374561Z 
2025-05-07T20:26:03.4374566Z 
2025-05-07T20:26:03.4374601Z 
2025-05-07T20:26:03.4374616Z 
2025-05-07T20:26:03.4374622Z 
2025-05-07T20:26:03.4378423Z 
2025-05-07T20:26:03.4589121Z cuda-nsight-12.8.55  | 113.2 MB  | ###7       |  38% [A[A[A[A[A[A
2025-05-07T20:26:03.4589530Z 
2025-05-07T20:26:03.4589545Z 
2025-05-07T20:26:03.4589550Z 
2025-05-07T20:26:03.4589556Z 
2025-05-07T20:26:03.4591428Z 
2025-05-07T20:26:03.4669126Z libnpp-12.3.3.65     | 130.6 MB  | ####2      |  42% [A[A[A[A[A
2025-05-07T20:26:03.4921618Z libcublas-12.8.3.14  | 460.2 MB  | #######    |  71% 
2025-05-07T20:26:03.4921976Z 
2025-05-07T20:26:03.4921982Z 
2025-05-07T20:26:03.4921987Z 
2025-05-07T20:26:03.4921992Z 
2025-05-07T20:26:03.4921998Z 
2025-05-07T20:26:03.4922003Z 
2025-05-07T20:26:03.4922017Z 
2025-05-07T20:26:03.5423736Z cuda-nvvp-12.8.57    | 112.4 MB  | ##9        |  29% [A[A[A[A[A[A[A
2025-05-07T20:26:03.5424136Z 
2025-05-07T20:26:03.5424142Z 
2025-05-07T20:26:03.5424147Z 
2025-05-07T20:26:03.5424152Z 
2025-05-07T20:26:03.5424172Z 
2025-05-07T20:26:03.5424219Z 
2025-05-07T20:26:03.5606104Z cuda-nsight-12.8.55  | 113.2 MB  | ###9       |  40% [A[A[A[A[A[A
2025-05-07T20:26:03.5606503Z 
2025-05-07T20:26:03.5606508Z 
2025-05-07T20:26:03.5606523Z 
2025-05-07T20:26:03.5606528Z 
2025-05-07T20:26:03.5613341Z 
2025-05-07T20:26:03.5669691Z libnpp-12.3.3.65     | 130.6 MB  | ####4      |  44% [A[A[A[A[A
2025-05-07T20:26:03.5923238Z libcublas-12.8.3.14  | 460.2 MB  | #######1   |  71% 
2025-05-07T20:26:03.5923590Z 
2025-05-07T20:26:03.5923596Z 
2025-05-07T20:26:03.5923601Z 
2025-05-07T20:26:03.5923606Z 
2025-05-07T20:26:03.5923611Z 
2025-05-07T20:26:03.5923617Z 
2025-05-07T20:26:03.5924386Z 
2025-05-07T20:26:03.6424251Z cuda-nvvp-12.8.57    | 112.4 MB  | ###1       |  31% [A[A[A[A[A[A[A
2025-05-07T20:26:03.6424666Z 
2025-05-07T20:26:03.6424670Z 
2025-05-07T20:26:03.6424674Z 
2025-05-07T20:26:03.6424677Z 
2025-05-07T20:26:03.6424681Z 
2025-05-07T20:26:03.6429013Z 
2025-05-07T20:26:03.6609603Z cuda-nsight-12.8.55  | 113.2 MB  | ####2      |  42% [A[A[A[A[A[A
2025-05-07T20:26:03.6609939Z 
2025-05-07T20:26:03.6609943Z 
2025-05-07T20:26:03.6609947Z 
2025-05-07T20:26:03.6609951Z 
2025-05-07T20:26:03.6609954Z 
2025-05-07T20:26:03.6753828Z libnpp-12.3.3.65     | 130.6 MB  | ####6      |  46% [A[A[A[A[A
2025-05-07T20:26:03.6923828Z libcublas-12.8.3.14  | 460.2 MB  | #######1   |  72% 
2025-05-07T20:26:03.6924181Z 
2025-05-07T20:26:03.6924432Z 
2025-05-07T20:26:03.6924437Z 
2025-05-07T20:26:03.6924441Z 
2025-05-07T20:26:03.6924445Z 
2025-05-07T20:26:03.6924449Z 
2025-05-07T20:26:03.6924453Z 
2025-05-07T20:26:03.7434007Z cuda-nvvp-12.8.57    | 112.4 MB  | ###3       |  34% [A[A[A[A[A[A[A
2025-05-07T20:26:03.7434352Z 
2025-05-07T20:26:03.7434356Z 
2025-05-07T20:26:03.7434359Z 
2025-05-07T20:26:03.7434363Z 
2025-05-07T20:26:03.7434366Z 
2025-05-07T20:26:03.7437132Z 
2025-05-07T20:26:03.7613878Z cuda-nsight-12.8.55  | 113.2 MB  | ####4      |  44% [A[A[A[A[A[A
2025-05-07T20:26:03.7614292Z 
2025-05-07T20:26:03.7614298Z 
2025-05-07T20:26:03.7614625Z 
2025-05-07T20:26:03.7614630Z 
2025-05-07T20:26:03.7616220Z 
2025-05-07T20:26:03.7757529Z libnpp-12.3.3.65     | 130.6 MB  | ####8      |  48% [A[A[A[A[A
2025-05-07T20:26:03.7926619Z libcublas-12.8.3.14  | 460.2 MB  | #######2   |  72% 
2025-05-07T20:26:03.7926947Z 
2025-05-07T20:26:03.7926951Z 
2025-05-07T20:26:03.7926954Z 
2025-05-07T20:26:03.7926958Z 
2025-05-07T20:26:03.7926985Z 
2025-05-07T20:26:03.7926989Z 
2025-05-07T20:26:03.7932552Z 
2025-05-07T20:26:03.8434197Z cuda-nvvp-12.8.57    | 112.4 MB  | ###6       |  36% [A[A[A[A[A[A[A
2025-05-07T20:26:03.8434505Z 
2025-05-07T20:26:03.8434509Z 
2025-05-07T20:26:03.8434513Z 
2025-05-07T20:26:03.8434517Z 
2025-05-07T20:26:03.8434521Z 
2025-05-07T20:26:03.8443201Z 
2025-05-07T20:26:03.8615929Z cuda-nsight-12.8.55  | 113.2 MB  | ####6      |  47% [A[A[A[A[A[A
2025-05-07T20:26:03.8616249Z 
2025-05-07T20:26:03.8616253Z 
2025-05-07T20:26:03.8616256Z 
2025-05-07T20:26:03.8616260Z 
2025-05-07T20:26:03.8625711Z 
2025-05-07T20:26:03.8771102Z libnpp-12.3.3.65     | 130.6 MB  | #####      |  50% [A[A[A[A[A
2025-05-07T20:26:03.8932161Z libcublas-12.8.3.14  | 460.2 MB  | #######2   |  73% 
2025-05-07T20:26:03.8932449Z 
2025-05-07T20:26:03.8932453Z 
2025-05-07T20:26:03.8932457Z 
2025-05-07T20:26:03.8932461Z 
2025-05-07T20:26:03.8932464Z 
2025-05-07T20:26:03.8932469Z 
2025-05-07T20:26:03.8937188Z 
2025-05-07T20:26:03.9435140Z cuda-nvvp-12.8.57    | 112.4 MB  | ###8       |  39% [A[A[A[A[A[A[A
2025-05-07T20:26:03.9435452Z 
2025-05-07T20:26:03.9435457Z 
2025-05-07T20:26:03.9435461Z 
2025-05-07T20:26:03.9435465Z 
2025-05-07T20:26:03.9435468Z 
2025-05-07T20:26:03.9435472Z 
2025-05-07T20:26:03.9616147Z cuda-nsight-12.8.55  | 113.2 MB  | ####9      |  49% [A[A[A[A[A[A
2025-05-07T20:26:03.9616443Z 
2025-05-07T20:26:03.9616447Z 
2025-05-07T20:26:03.9616451Z 
2025-05-07T20:26:03.9616455Z 
2025-05-07T20:26:03.9618103Z 
2025-05-07T20:26:03.9774828Z libnpp-12.3.3.65     | 130.6 MB  | #####2     |  52% [A[A[A[A[A
2025-05-07T20:26:03.9935718Z libcublas-12.8.3.14  | 460.2 MB  | #######3   |  74% 
2025-05-07T20:26:03.9936002Z 
2025-05-07T20:26:03.9936006Z 
2025-05-07T20:26:03.9936010Z 
2025-05-07T20:26:03.9936014Z 
2025-05-07T20:26:03.9936018Z 
2025-05-07T20:26:03.9936022Z 
2025-05-07T20:26:03.9936231Z 
2025-05-07T20:26:04.0461056Z cuda-nvvp-12.8.57    | 112.4 MB  | ####1      |  41% [A[A[A[A[A[A[A
2025-05-07T20:26:04.0461397Z 
2025-05-07T20:26:04.0461402Z 
2025-05-07T20:26:04.0461406Z 
2025-05-07T20:26:04.0461410Z 
2025-05-07T20:26:04.0461413Z 
2025-05-07T20:26:04.0461417Z 
2025-05-07T20:26:04.0657578Z cuda-nsight-12.8.55  | 113.2 MB  | #####1     |  52% [A[A[A[A[A[A
2025-05-07T20:26:04.0657901Z 
2025-05-07T20:26:04.0657905Z 
2025-05-07T20:26:04.0657908Z 
2025-05-07T20:26:04.0657912Z 
2025-05-07T20:26:04.0659483Z 
2025-05-07T20:26:04.0776491Z libnpp-12.3.3.65     | 130.6 MB  | #####4     |  54% [A[A[A[A[A
2025-05-07T20:26:04.0936383Z libcublas-12.8.3.14  | 460.2 MB  | #######4   |  74% 
2025-05-07T20:26:04.0936745Z 
2025-05-07T20:26:04.0936783Z 
2025-05-07T20:26:04.0936786Z 
2025-05-07T20:26:04.0936790Z 
2025-05-07T20:26:04.0936794Z 
2025-05-07T20:26:04.0936798Z 
2025-05-07T20:26:04.0939350Z 
2025-05-07T20:26:04.1463714Z cuda-nvvp-12.8.57    | 112.4 MB  | ####3      |  44% [A[A[A[A[A[A[A
2025-05-07T20:26:04.1464022Z 
2025-05-07T20:26:04.1464027Z 
2025-05-07T20:26:04.1464030Z 
2025-05-07T20:26:04.1464034Z 
2025-05-07T20:26:04.1464288Z 
2025-05-07T20:26:04.1464294Z 
2025-05-07T20:26:04.1666589Z cuda-nsight-12.8.55  | 113.2 MB  | #####4     |  54% [A[A[A[A[A[A
2025-05-07T20:26:04.1666892Z 
2025-05-07T20:26:04.1666896Z 
2025-05-07T20:26:04.1666900Z 
2025-05-07T20:26:04.1666904Z 
2025-05-07T20:26:04.1675400Z 
2025-05-07T20:26:04.1800472Z libnpp-12.3.3.65     | 130.6 MB  | #####6     |  57% [A[A[A[A[A
2025-05-07T20:26:04.2077657Z libcublas-12.8.3.14  | 460.2 MB  | #######4   |  75% 
2025-05-07T20:26:04.2078024Z 
2025-05-07T20:26:04.2078030Z 
2025-05-07T20:26:04.2078035Z 
2025-05-07T20:26:04.2078040Z 
2025-05-07T20:26:04.2078334Z 
2025-05-07T20:26:04.2078339Z 
2025-05-07T20:26:04.2078344Z 
2025-05-07T20:26:04.2465988Z cuda-nvvp-12.8.57    | 112.4 MB  | ####6      |  46% [A[A[A[A[A[A[A
2025-05-07T20:26:04.2466401Z 
2025-05-07T20:26:04.2466407Z 
2025-05-07T20:26:04.2466412Z 
2025-05-07T20:26:04.2466418Z 
2025-05-07T20:26:04.2466423Z 
2025-05-07T20:26:04.2466428Z 
2025-05-07T20:26:04.2668098Z cuda-nsight-12.8.55  | 113.2 MB  | #####6     |  57% [A[A[A[A[A[A
2025-05-07T20:26:04.2668495Z 
2025-05-07T20:26:04.2668501Z 
2025-05-07T20:26:04.2668517Z 
2025-05-07T20:26:04.2668523Z 
2025-05-07T20:26:04.2670835Z 
2025-05-07T20:26:04.2801317Z libnpp-12.3.3.65     | 130.6 MB  | #####9     |  59% [A[A[A[A[A
2025-05-07T20:26:04.3083540Z libcublas-12.8.3.14  | 460.2 MB  | #######5   |  75% 
2025-05-07T20:26:04.3083910Z 
2025-05-07T20:26:04.3083916Z 
2025-05-07T20:26:04.3083921Z 
2025-05-07T20:26:04.3083926Z 
2025-05-07T20:26:04.3083931Z 
2025-05-07T20:26:04.3083936Z 
2025-05-07T20:26:04.3086099Z 
2025-05-07T20:26:04.3484490Z cuda-nvvp-12.8.57    | 112.4 MB  | ####8      |  49% [A[A[A[A[A[A[A
2025-05-07T20:26:04.3484895Z 
2025-05-07T20:26:04.3484901Z 
2025-05-07T20:26:04.3484906Z 
2025-05-07T20:26:04.3484912Z 
2025-05-07T20:26:04.3484926Z 
2025-05-07T20:26:04.3490430Z 
2025-05-07T20:26:04.3673702Z cuda-nsight-12.8.55  | 113.2 MB  | #####9     |  59% [A[A[A[A[A[A
2025-05-07T20:26:04.3674105Z 
2025-05-07T20:26:04.3674142Z 
2025-05-07T20:26:04.3674161Z 
2025-05-07T20:26:04.3674166Z 
2025-05-07T20:26:04.3675386Z 
2025-05-07T20:26:04.3876917Z libnpp-12.3.3.65     | 130.6 MB  | ######1    |  61% [A[A[A[A[A
2025-05-07T20:26:04.4085199Z libcublas-12.8.3.14  | 460.2 MB  | #######6   |  76% 
2025-05-07T20:26:04.4085547Z 
2025-05-07T20:26:04.4085553Z 
2025-05-07T20:26:04.4085558Z 
2025-05-07T20:26:04.4085563Z 
2025-05-07T20:26:04.4085568Z 
2025-05-07T20:26:04.4085573Z 
2025-05-07T20:26:04.4087645Z 
2025-05-07T20:26:04.4550824Z cuda-nvvp-12.8.57    | 112.4 MB  | #####1     |  51% [A[A[A[A[A[A[A
2025-05-07T20:26:04.4551180Z 
2025-05-07T20:26:04.4551184Z 
2025-05-07T20:26:04.4551188Z 
2025-05-07T20:26:04.4551192Z 
2025-05-07T20:26:04.4551195Z 
2025-05-07T20:26:04.4551212Z 
2025-05-07T20:26:04.4676624Z cuda-nsight-12.8.55  | 113.2 MB  | ######1    |  61% [A[A[A[A[A[A
2025-05-07T20:26:04.4676984Z 
2025-05-07T20:26:04.4676989Z 
2025-05-07T20:26:04.4676992Z 
2025-05-07T20:26:04.4677004Z 
2025-05-07T20:26:04.4677028Z 
2025-05-07T20:26:04.4879586Z libnpp-12.3.3.65     | 130.6 MB  | ######3    |  63% [A[A[A[A[A
2025-05-07T20:26:04.5086316Z libcublas-12.8.3.14  | 460.2 MB  | #######6   |  77% 
2025-05-07T20:26:04.5086570Z 
2025-05-07T20:26:04.5086574Z 
2025-05-07T20:26:04.5086578Z 
2025-05-07T20:26:04.5086581Z 
2025-05-07T20:26:04.5086585Z 
2025-05-07T20:26:04.5086589Z 
2025-05-07T20:26:04.5088027Z 
2025-05-07T20:26:04.5593856Z cuda-nvvp-12.8.57    | 112.4 MB  | #####3     |  54% [A[A[A[A[A[A[A
2025-05-07T20:26:04.5594154Z 
2025-05-07T20:26:04.5594158Z 
2025-05-07T20:26:04.5594162Z 
2025-05-07T20:26:04.5594196Z 
2025-05-07T20:26:04.5594200Z 
2025-05-07T20:26:04.5594212Z 
2025-05-07T20:26:04.5705631Z cuda-nsight-12.8.55  | 113.2 MB  | ######3    |  64% [A[A[A[A[A[A
2025-05-07T20:26:04.5706041Z 
2025-05-07T20:26:04.5706046Z 
2025-05-07T20:26:04.5706052Z 
2025-05-07T20:26:04.5706057Z 
2025-05-07T20:26:04.5708251Z 
2025-05-07T20:26:04.5973942Z libnpp-12.3.3.65     | 130.6 MB  | ######5    |  66% [A[A[A[A[A
2025-05-07T20:26:04.6112600Z libcublas-12.8.3.14  | 460.2 MB  | #######7   |  77% 
2025-05-07T20:26:04.6112939Z 
2025-05-07T20:26:04.6112944Z 
2025-05-07T20:26:04.6112949Z 
2025-05-07T20:26:04.6112954Z 
2025-05-07T20:26:04.6112960Z 
2025-05-07T20:26:04.6112967Z 
2025-05-07T20:26:04.6114302Z 
2025-05-07T20:26:04.6596289Z cuda-nvvp-12.8.57    | 112.4 MB  | #####6     |  56% [A[A[A[A[A[A[A
2025-05-07T20:26:04.6596686Z 
2025-05-07T20:26:04.6596690Z 
2025-05-07T20:26:04.6596693Z 
2025-05-07T20:26:04.6596697Z 
2025-05-07T20:26:04.6596700Z 
2025-05-07T20:26:04.6599152Z 
2025-05-07T20:26:04.6726272Z cuda-nsight-12.8.55  | 113.2 MB  | ######6    |  66% [A[A[A[A[A[A
2025-05-07T20:26:04.6726866Z 
2025-05-07T20:26:04.6726871Z 
2025-05-07T20:26:04.6726874Z 
2025-05-07T20:26:04.6726878Z 
2025-05-07T20:26:04.6729533Z 
2025-05-07T20:26:04.6992857Z libnpp-12.3.3.65     | 130.6 MB  | ######7    |  68% [A[A[A[A[A
2025-05-07T20:26:04.7429801Z libcublas-12.8.3.14  | 460.2 MB  | #######7   |  78% 
2025-05-07T20:26:04.7430075Z 
2025-05-07T20:26:04.7430079Z 
2025-05-07T20:26:04.7430083Z 
2025-05-07T20:26:04.7430086Z 
2025-05-07T20:26:04.7430090Z 
2025-05-07T20:26:04.7430094Z 
2025-05-07T20:26:04.7431226Z 
2025-05-07T20:26:04.7596415Z cuda-nvvp-12.8.57    | 112.4 MB  | #####8     |  59% [A[A[A[A[A[A[A
2025-05-07T20:26:04.7596823Z 
2025-05-07T20:26:04.7596829Z 
2025-05-07T20:26:04.7596834Z 
2025-05-07T20:26:04.7596838Z 
2025-05-07T20:26:04.7596842Z 
2025-05-07T20:26:04.7599448Z 
2025-05-07T20:26:04.7811767Z cuda-nsight-12.8.55  | 113.2 MB  | ######8    |  68% [A[A[A[A[A[A
2025-05-07T20:26:04.7812073Z 
2025-05-07T20:26:04.7812109Z 
2025-05-07T20:26:04.7812113Z 
2025-05-07T20:26:04.7812117Z 
2025-05-07T20:26:04.7814426Z 
2025-05-07T20:26:04.7993541Z libnpp-12.3.3.65     | 130.6 MB  | ######9    |  70% [A[A[A[A[A
2025-05-07T20:26:04.8483629Z libcublas-12.8.3.14  | 460.2 MB  | #######8   |  79% 
2025-05-07T20:26:04.8483909Z 
2025-05-07T20:26:04.8483913Z 
2025-05-07T20:26:04.8483916Z 
2025-05-07T20:26:04.8483950Z 
2025-05-07T20:26:04.8483954Z 
2025-05-07T20:26:04.8483958Z 
2025-05-07T20:26:04.8483970Z 
2025-05-07T20:26:04.8606107Z cuda-nvvp-12.8.57    | 112.4 MB  | ######1    |  61% [A[A[A[A[A[A[A
2025-05-07T20:26:04.8606403Z 
2025-05-07T20:26:04.8606407Z 
2025-05-07T20:26:04.8606410Z 
2025-05-07T20:26:04.8606414Z 
2025-05-07T20:26:04.8606426Z 
2025-05-07T20:26:04.8608544Z 
2025-05-07T20:26:04.8997155Z cuda-nsight-12.8.55  | 113.2 MB  | #######    |  71% [A[A[A[A[A[A
2025-05-07T20:26:04.9486893Z libcublas-12.8.3.14  | 460.2 MB  | #######9   |  79% 
2025-05-07T20:26:04.9487242Z 
2025-05-07T20:26:04.9487275Z 
2025-05-07T20:26:04.9487278Z 
2025-05-07T20:26:04.9487282Z 
2025-05-07T20:26:04.9487285Z 
2025-05-07T20:26:04.9487289Z 
2025-05-07T20:26:04.9487293Z 
2025-05-07T20:26:04.9612245Z cuda-nvvp-12.8.57    | 112.4 MB  | ######3    |  63% [A[A[A[A[A[A[A
2025-05-07T20:26:04.9612645Z 
2025-05-07T20:26:04.9612649Z 
2025-05-07T20:26:04.9612652Z 
2025-05-07T20:26:04.9612656Z 
2025-05-07T20:26:04.9612678Z 
2025-05-07T20:26:04.9612691Z 
2025-05-07T20:26:05.0002653Z cuda-nsight-12.8.55  | 113.2 MB  | #######3   |  73% [A[A[A[A[A[A
2025-05-07T20:26:05.0071490Z libcublas-12.8.3.14  | 460.2 MB  | #######9   |  80% 
2025-05-07T20:26:05.0071830Z 
2025-05-07T20:26:05.0071834Z 
2025-05-07T20:26:05.0071838Z 
2025-05-07T20:26:05.0071842Z 
2025-05-07T20:26:05.0074015Z 
2025-05-07T20:26:05.0488303Z libnpp-12.3.3.65     | 130.6 MB  | #######2   |  72% [A[A[A[A[A
2025-05-07T20:26:05.0488616Z 
2025-05-07T20:26:05.0488620Z 
2025-05-07T20:26:05.0488624Z 
2025-05-07T20:26:05.0488627Z 
2025-05-07T20:26:05.0488663Z 
2025-05-07T20:26:05.0488666Z 
2025-05-07T20:26:05.0489322Z 
2025-05-07T20:26:05.0615196Z cuda-nvvp-12.8.57    | 112.4 MB  | ######5    |  66% [A[A[A[A[A[A[A
2025-05-07T20:26:05.0615628Z 
2025-05-07T20:26:05.0615634Z 
2025-05-07T20:26:05.0615639Z 
2025-05-07T20:26:05.0615644Z 
2025-05-07T20:26:05.0615649Z 
2025-05-07T20:26:05.0615654Z 
2025-05-07T20:26:05.1069222Z cuda-nsight-12.8.55  | 113.2 MB  | #######5   |  76% [A[A[A[A[A[A
2025-05-07T20:26:05.1079756Z libcublas-12.8.3.14  | 460.2 MB  | ########   |  80% 
2025-05-07T20:26:05.1080137Z 
2025-05-07T20:26:05.1080144Z 
2025-05-07T20:26:05.1080150Z 
2025-05-07T20:26:05.1080155Z 
2025-05-07T20:26:05.1080160Z 
2025-05-07T20:26:05.1596962Z libnpp-12.3.3.65     | 130.6 MB  | #######3   |  74% [A[A[A[A[A
2025-05-07T20:26:05.1597371Z 
2025-05-07T20:26:05.1597376Z 
2025-05-07T20:26:05.1597381Z 
2025-05-07T20:26:05.1597386Z 
2025-05-07T20:26:05.1597391Z 
2025-05-07T20:26:05.1597396Z 
2025-05-07T20:26:05.1598085Z 
2025-05-07T20:26:05.1616977Z cuda-nvvp-12.8.57    | 112.4 MB  | ######8    |  68% [A[A[A[A[A[A[A
2025-05-07T20:26:05.1617693Z 
2025-05-07T20:26:05.1617701Z 
2025-05-07T20:26:05.1617706Z 
2025-05-07T20:26:05.1617711Z 
2025-05-07T20:26:05.1617715Z 
2025-05-07T20:26:05.1617722Z 
2025-05-07T20:26:05.2084605Z cuda-nsight-12.8.55  | 113.2 MB  | #######8   |  78% [A[A[A[A[A[A
2025-05-07T20:26:05.2085027Z 
2025-05-07T20:26:05.2085058Z 
2025-05-07T20:26:05.2085062Z 
2025-05-07T20:26:05.2085066Z 
2025-05-07T20:26:05.2088061Z 
2025-05-07T20:26:05.2254518Z libnpp-12.3.3.65     | 130.6 MB  | #######5   |  76% [A[A[A[A[A
2025-05-07T20:26:05.2601881Z libcublas-12.8.3.14  | 460.2 MB  | ########1  |  81% 
2025-05-07T20:26:05.2602186Z 
2025-05-07T20:26:05.2602191Z 
2025-05-07T20:26:05.2602195Z 
2025-05-07T20:26:05.2602200Z 
2025-05-07T20:26:05.2602203Z 
2025-05-07T20:26:05.2602208Z 
2025-05-07T20:26:05.2602212Z 
2025-05-07T20:26:05.2701891Z cuda-nvvp-12.8.57    | 112.4 MB  | #######    |  70% [A[A[A[A[A[A[A
2025-05-07T20:26:05.2702206Z 
2025-05-07T20:26:05.2702211Z 
2025-05-07T20:26:05.2702214Z 
2025-05-07T20:26:05.2702218Z 
2025-05-07T20:26:05.2702222Z 
2025-05-07T20:26:05.2705616Z 
2025-05-07T20:26:05.3095816Z cuda-nsight-12.8.55  | 113.2 MB  | ########   |  81% [A[A[A[A[A[A
2025-05-07T20:26:05.3096127Z 
2025-05-07T20:26:05.3096131Z 
2025-05-07T20:26:05.3096135Z 
2025-05-07T20:26:05.3096138Z 
2025-05-07T20:26:05.3100320Z 
2025-05-07T20:26:05.3263234Z libnpp-12.3.3.65     | 130.6 MB  | #######7   |  78% [A[A[A[A[A
2025-05-07T20:26:05.3603774Z libcublas-12.8.3.14  | 460.2 MB  | ########1  |  82% 
2025-05-07T20:26:05.3604088Z 
2025-05-07T20:26:05.3604094Z 
2025-05-07T20:26:05.3604099Z 
2025-05-07T20:26:05.3604104Z 
2025-05-07T20:26:05.3604110Z 
2025-05-07T20:26:05.3604115Z 
2025-05-07T20:26:05.3609957Z 
2025-05-07T20:26:05.3744303Z cuda-nvvp-12.8.57    | 112.4 MB  | #######2   |  73% [A[A[A[A[A[A[A
2025-05-07T20:26:05.3744747Z 
2025-05-07T20:26:05.3744753Z 
2025-05-07T20:26:05.3744758Z 
2025-05-07T20:26:05.3744764Z 
2025-05-07T20:26:05.3744804Z 
2025-05-07T20:26:05.3744810Z 
2025-05-07T20:26:05.4098539Z cuda-nsight-12.8.55  | 113.2 MB  | ########3  |  83% [A[A[A[A[A[A
2025-05-07T20:26:05.4098978Z 
2025-05-07T20:26:05.4098984Z 
2025-05-07T20:26:05.4098990Z 
2025-05-07T20:26:05.4098995Z 
2025-05-07T20:26:05.4099738Z 
2025-05-07T20:26:05.4345240Z libnpp-12.3.3.65     | 130.6 MB  | #######9   |  80% [A[A[A[A[A
2025-05-07T20:26:05.4704515Z libcublas-12.8.3.14  | 460.2 MB  | ########2  |  82% 
2025-05-07T20:26:05.4704894Z 
2025-05-07T20:26:05.4704900Z 
2025-05-07T20:26:05.4704905Z 
2025-05-07T20:26:05.4704910Z 
2025-05-07T20:26:05.4704915Z 
2025-05-07T20:26:05.4704921Z 
2025-05-07T20:26:05.4708444Z 
2025-05-07T20:26:05.4752039Z cuda-nvvp-12.8.57    | 112.4 MB  | #######5   |  75% [A[A[A[A[A[A[A
2025-05-07T20:26:05.4752328Z 
2025-05-07T20:26:05.4752332Z 
2025-05-07T20:26:05.4752335Z 
2025-05-07T20:26:05.4752339Z 
2025-05-07T20:26:05.4752343Z 
2025-05-07T20:26:05.4752346Z 
2025-05-07T20:26:05.5110597Z cuda-nsight-12.8.55  | 113.2 MB  | ########5  |  86% [A[A[A[A[A[A
2025-05-07T20:26:05.5111232Z 
2025-05-07T20:26:05.5111236Z 
2025-05-07T20:26:05.5111239Z 
2025-05-07T20:26:05.5111243Z 
2025-05-07T20:26:05.5111732Z 
2025-05-07T20:26:05.5491466Z libnpp-12.3.3.65     | 130.6 MB  | ########1  |  81% [A[A[A[A[A
2025-05-07T20:26:05.5727567Z libcublas-12.8.3.14  | 460.2 MB  | ########2  |  83% 
2025-05-07T20:26:05.5728555Z 
2025-05-07T20:26:05.5728567Z 
2025-05-07T20:26:05.5728570Z 
2025-05-07T20:26:05.5728574Z 
2025-05-07T20:26:05.5728578Z 
2025-05-07T20:26:05.5728582Z 
2025-05-07T20:26:05.5728586Z 
2025-05-07T20:26:05.5780270Z cuda-nvvp-12.8.57    | 112.4 MB  | #######7   |  78% [A[A[A[A[A[A[A
2025-05-07T20:26:05.5780607Z 
2025-05-07T20:26:05.5780611Z 
2025-05-07T20:26:05.5780614Z 
2025-05-07T20:26:05.5780618Z 
2025-05-07T20:26:05.5780621Z 
2025-05-07T20:26:05.5781700Z 
2025-05-07T20:26:05.6114695Z cuda-nsight-12.8.55  | 113.2 MB  | ########7  |  88% [A[A[A[A[A[A
2025-05-07T20:26:05.6115111Z 
2025-05-07T20:26:05.6115425Z 
2025-05-07T20:26:05.6115430Z 
2025-05-07T20:26:05.6115435Z 
2025-05-07T20:26:05.6117151Z 
2025-05-07T20:26:05.6599334Z libnpp-12.3.3.65     | 130.6 MB  | ########3  |  83% [A[A[A[A[A
2025-05-07T20:26:05.6727515Z libcublas-12.8.3.14  | 460.2 MB  | ########3  |  83% 
2025-05-07T20:26:05.6727786Z 
2025-05-07T20:26:05.6727790Z 
2025-05-07T20:26:05.6727794Z 
2025-05-07T20:26:05.6727827Z 
2025-05-07T20:26:05.6727831Z 
2025-05-07T20:26:05.6727835Z 
2025-05-07T20:26:05.6727839Z 
2025-05-07T20:26:05.6841345Z cuda-nvvp-12.8.57    | 112.4 MB  | ########   |  80% [A[A[A[A[A[A[A
2025-05-07T20:26:05.6841639Z 
2025-05-07T20:26:05.6841643Z 
2025-05-07T20:26:05.6841647Z 
2025-05-07T20:26:05.6841651Z 
2025-05-07T20:26:05.6841655Z 
2025-05-07T20:26:05.6841658Z 
2025-05-07T20:26:05.7120822Z cuda-nsight-12.8.55  | 113.2 MB  | #########  |  90% [A[A[A[A[A[A
2025-05-07T20:26:05.7121144Z 
2025-05-07T20:26:05.7121148Z 
2025-05-07T20:26:05.7121151Z 
2025-05-07T20:26:05.7121155Z 
2025-05-07T20:26:05.7122813Z 
2025-05-07T20:26:05.7602870Z libnpp-12.3.3.65     | 130.6 MB  | ########5  |  85% [A[A[A[A[A
2025-05-07T20:26:05.7762480Z libcublas-12.8.3.14  | 460.2 MB  | ########3  |  84% 
2025-05-07T20:26:05.7762846Z 
2025-05-07T20:26:05.7762852Z 
2025-05-07T20:26:05.7762857Z 
2025-05-07T20:26:05.7762862Z 
2025-05-07T20:26:05.7762868Z 
2025-05-07T20:26:05.7762874Z 
2025-05-07T20:26:05.7762899Z 
2025-05-07T20:26:05.7952999Z cuda-nvvp-12.8.57    | 112.4 MB  | ########2  |  83% [A[A[A[A[A[A[A
2025-05-07T20:26:05.7953301Z 
2025-05-07T20:26:05.7953305Z 
2025-05-07T20:26:05.7953309Z 
2025-05-07T20:26:05.7953312Z 
2025-05-07T20:26:05.7953316Z 
2025-05-07T20:26:05.7953320Z 
2025-05-07T20:26:05.8126644Z cuda-nsight-12.8.55  | 113.2 MB  | #########2 |  93% [A[A[A[A[A[A
2025-05-07T20:26:05.8126977Z 
2025-05-07T20:26:05.8126981Z 
2025-05-07T20:26:05.8126984Z 
2025-05-07T20:26:05.8126988Z 
2025-05-07T20:26:05.8131121Z 
2025-05-07T20:26:05.8721667Z libnpp-12.3.3.65     | 130.6 MB  | ########7  |  87% [A[A[A[A[A
2025-05-07T20:26:05.8788547Z libcublas-12.8.3.14  | 460.2 MB  | ########4  |  84% 
2025-05-07T20:26:05.8788807Z 
2025-05-07T20:26:05.8788812Z 
2025-05-07T20:26:05.8788815Z 
2025-05-07T20:26:05.8788819Z 
2025-05-07T20:26:05.8788823Z 
2025-05-07T20:26:05.8788827Z 
2025-05-07T20:26:05.8791532Z 
2025-05-07T20:26:05.8954254Z cuda-nvvp-12.8.57    | 112.4 MB  | ########5  |  85% [A[A[A[A[A[A[A
2025-05-07T20:26:05.8954666Z 
2025-05-07T20:26:05.8954672Z 
2025-05-07T20:26:05.8954677Z 
2025-05-07T20:26:05.8954682Z 
2025-05-07T20:26:05.8954687Z 
2025-05-07T20:26:05.8954693Z 
2025-05-07T20:26:05.9193985Z cuda-nsight-12.8.55  | 113.2 MB  | #########5 |  95% [A[A[A[A[A[A
2025-05-07T20:26:05.9194391Z 
2025-05-07T20:26:05.9194397Z 
2025-05-07T20:26:05.9194402Z 
2025-05-07T20:26:05.9194407Z 
2025-05-07T20:26:05.9202506Z 
2025-05-07T20:26:05.9816292Z libnpp-12.3.3.65     | 130.6 MB  | ########9  |  89% [A[A[A[A[A
2025-05-07T20:26:05.9947071Z libcublas-12.8.3.14  | 460.2 MB  | ########4  |  85% 
2025-05-07T20:26:05.9947494Z 
2025-05-07T20:26:05.9947500Z 
2025-05-07T20:26:05.9947506Z 
2025-05-07T20:26:05.9947511Z 
2025-05-07T20:26:05.9947516Z 
2025-05-07T20:26:05.9947521Z 
2025-05-07T20:26:05.9953582Z 
2025-05-07T20:26:05.9968978Z cuda-nvvp-12.8.57    | 112.4 MB  | ########7  |  87% [A[A[A[A[A[A[A
2025-05-07T20:26:05.9969282Z 
2025-05-07T20:26:05.9969286Z 
2025-05-07T20:26:05.9969507Z 
2025-05-07T20:26:05.9969512Z 
2025-05-07T20:26:05.9969516Z 
2025-05-07T20:26:05.9969520Z 
2025-05-07T20:26:06.0200820Z cuda-nsight-12.8.55  | 113.2 MB  | #########7 |  98% [A[A[A[A[A[A
2025-05-07T20:26:06.0201229Z 
2025-05-07T20:26:06.0201233Z 
2025-05-07T20:26:06.0201237Z 
2025-05-07T20:26:06.0201240Z 
2025-05-07T20:26:06.0204885Z 
2025-05-07T20:26:06.0827636Z libnpp-12.3.3.65     | 130.6 MB  | #########1 |  91% [A[A[A[A[A
2025-05-07T20:26:06.1003952Z libcublas-12.8.3.14  | 460.2 MB  | ########5  |  86% 
2025-05-07T20:26:06.1004340Z 
2025-05-07T20:26:06.1004346Z 
2025-05-07T20:26:06.1004351Z 
2025-05-07T20:26:06.1004629Z 
2025-05-07T20:26:06.1004634Z 
2025-05-07T20:26:06.1004640Z 
2025-05-07T20:26:06.1006556Z 
2025-05-07T20:26:06.1206878Z cuda-nvvp-12.8.57    | 112.4 MB  | ########9  |  90% [A[A[A[A[A[A[A
2025-05-07T20:26:06.1207280Z 
2025-05-07T20:26:06.1207286Z 
2025-05-07T20:26:06.1207291Z 
2025-05-07T20:26:06.1207297Z 
2025-05-07T20:26:06.1208662Z 
2025-05-07T20:26:06.1831251Z libnpp-12.3.3.65     | 130.6 MB  | #########3 |  93% [A[A[A[A[A
2025-05-07T20:26:06.2038062Z libcublas-12.8.3.14  | 460.2 MB  | ########6  |  86% 
2025-05-07T20:26:06.2038334Z 
2025-05-07T20:26:06.2038338Z 
2025-05-07T20:26:06.2038342Z 
2025-05-07T20:26:06.2038345Z 
2025-05-07T20:26:06.2038349Z 
2025-05-07T20:26:06.2038352Z 
2025-05-07T20:26:06.2038357Z 
2025-05-07T20:26:06.2210673Z cuda-nvvp-12.8.57    | 112.4 MB  | #########1 |  92% [A[A[A[A[A[A[A
2025-05-07T20:26:06.2211100Z 
2025-05-07T20:26:06.2211107Z 
2025-05-07T20:26:06.2211113Z 
2025-05-07T20:26:06.2211118Z 
2025-05-07T20:26:06.2214443Z 
2025-05-07T20:26:06.2841692Z libnpp-12.3.3.65     | 130.6 MB  | #########5 |  96% [A[A[A[A[A
2025-05-07T20:26:06.3041925Z libcublas-12.8.3.14  | 460.2 MB  | ########6  |  87% 
2025-05-07T20:26:06.3042189Z 
2025-05-07T20:26:06.3042193Z 
2025-05-07T20:26:06.3042197Z 
2025-05-07T20:26:06.3042200Z 
2025-05-07T20:26:06.3042204Z 
2025-05-07T20:26:06.3042208Z 
2025-05-07T20:26:06.3044948Z 
2025-05-07T20:26:06.3212768Z cuda-nvvp-12.8.57    | 112.4 MB  | #########4 |  94% [A[A[A[A[A[A[A
2025-05-07T20:26:06.3213173Z 
2025-05-07T20:26:06.3213178Z 
2025-05-07T20:26:06.3213183Z 
2025-05-07T20:26:06.3213187Z 
2025-05-07T20:26:06.3213192Z 
2025-05-07T20:26:06.3856581Z libnpp-12.3.3.65     | 130.6 MB  | #########7 |  98% [A[A[A[A[A
2025-05-07T20:26:06.4047181Z libcublas-12.8.3.14  | 460.2 MB  | ########7  |  87% 
2025-05-07T20:26:06.4047542Z 
2025-05-07T20:26:06.4047548Z 
2025-05-07T20:26:06.4047553Z 
2025-05-07T20:26:06.4047558Z 
2025-05-07T20:26:06.4047564Z 
2025-05-07T20:26:06.4047569Z 
2025-05-07T20:26:06.4047574Z 
2025-05-07T20:26:06.4865894Z cuda-nvvp-12.8.57    | 112.4 MB  | #########6 |  97% [A[A[A[A[A[A[A
2025-05-07T20:26:06.5048766Z libcublas-12.8.3.14  | 460.2 MB  | ########7  |  88% 
2025-05-07T20:26:06.5049041Z 
2025-05-07T20:26:06.5049047Z 
2025-05-07T20:26:06.5049052Z 
2025-05-07T20:26:06.5049058Z 
2025-05-07T20:26:06.5049063Z 
2025-05-07T20:26:06.5049068Z 
2025-05-07T20:26:06.5055414Z 
2025-05-07T20:26:06.5866912Z cuda-nvvp-12.8.57    | 112.4 MB  | #########9 |  99% [A[A[A[A[A[A[A
2025-05-07T20:26:06.6866414Z libcublas-12.8.3.14  | 460.2 MB  | ########8  |  89% 
2025-05-07T20:26:06.7867110Z libcublas-12.8.3.14  | 460.2 MB  | ########9  |  89% 
2025-05-07T20:26:06.8868362Z libcublas-12.8.3.14  | 460.2 MB  | #########  |  90% 
2025-05-07T20:26:06.9942821Z libcublas-12.8.3.14  | 460.2 MB  | #########1 |  91% 
2025-05-07T20:26:07.0943060Z libcublas-12.8.3.14  | 460.2 MB  | #########1 |  92% 
2025-05-07T20:26:07.1950286Z libcublas-12.8.3.14  | 460.2 MB  | #########2 |  93% 
2025-05-07T20:26:07.2969412Z libcublas-12.8.3.14  | 460.2 MB  | #########3 |  93% 
2025-05-07T20:26:07.3977261Z libcublas-12.8.3.14  | 460.2 MB  | #########4 |  94% 
2025-05-07T20:26:07.5000127Z libcublas-12.8.3.14  | 460.2 MB  | #########5 |  95% 
2025-05-07T20:26:07.6001318Z libcublas-12.8.3.14  | 460.2 MB  | #########5 |  96% 
2025-05-07T20:26:07.7049146Z libcublas-12.8.3.14  | 460.2 MB  | #########6 |  97% 
2025-05-07T20:26:07.8092648Z libcublas-12.8.3.14  | 460.2 MB  | #########7 |  97% 
2025-05-07T20:26:07.9096326Z libcublas-12.8.3.14  | 460.2 MB  | #########8 |  98% 
2025-05-07T20:26:08.0097614Z libcublas-12.8.3.14  | 460.2 MB  | #########9 |  99% 
2025-05-07T20:26:09.7613162Z libcublas-12.8.3.14  | 460.2 MB  | #########9 | 100% 
2025-05-07T20:26:09.7613435Z 
2025-05-07T20:26:09.7613646Z 
2025-05-07T20:26:09.7613659Z 
2025-05-07T20:26:09.7613767Z 
2025-05-07T20:26:09.7613779Z 
2025-05-07T20:26:09.7616368Z 
2025-05-07T20:26:09.8263133Z cuda-nsight-12.8.55  | 113.2 MB  | ########## | 100% [A[A[A[A[A[A
2025-05-07T20:26:09.8263892Z 
2025-05-07T20:26:09.8263900Z 
2025-05-07T20:26:09.8263906Z 
2025-05-07T20:26:09.8263912Z 
2025-05-07T20:26:09.8263919Z 
2025-05-07T20:26:09.8263925Z 
2025-05-07T20:26:09.8263932Z 
2025-05-07T20:26:09.8263938Z 
2025-05-07T20:26:09.9277502Z cuda-nvrtc-12.8.61   | 63.1 MB   |            |   0% [A[A[A[A[A[A[A[A
2025-05-07T20:26:09.9277890Z 
2025-05-07T20:26:09.9277894Z 
2025-05-07T20:26:09.9277926Z 
2025-05-07T20:26:09.9277930Z 
2025-05-07T20:26:09.9277934Z 
2025-05-07T20:26:09.9277938Z 
2025-05-07T20:26:09.9277942Z 
2025-05-07T20:26:09.9282025Z 
2025-05-07T20:26:10.0290591Z cuda-nvrtc-12.8.61   | 63.1 MB   | 5          |   6% [A[A[A[A[A[A[A[A
2025-05-07T20:26:10.0291038Z 
2025-05-07T20:26:10.0291044Z 
2025-05-07T20:26:10.0291049Z 
2025-05-07T20:26:10.0291055Z 
2025-05-07T20:26:10.0291061Z 
2025-05-07T20:26:10.0291068Z 
2025-05-07T20:26:10.0291073Z 
2025-05-07T20:26:10.0292603Z 
2025-05-07T20:26:10.1244447Z cuda-nvrtc-12.8.61   | 63.1 MB   | #1         |  11% [A[A[A[A[A[A[A[A
2025-05-07T20:26:10.1244885Z 
2025-05-07T20:26:10.1244889Z 
2025-05-07T20:26:10.1244893Z 
2025-05-07T20:26:10.1249755Z 
2025-05-07T20:26:10.1290362Z libcufft-11.3.3.41   | 147.4 MB  | ########## | 100% [A[A[A[A
2025-05-07T20:26:10.1290673Z 
2025-05-07T20:26:10.1290677Z 
2025-05-07T20:26:10.1290682Z 
2025-05-07T20:26:10.1290687Z 
2025-05-07T20:26:10.1290690Z 
2025-05-07T20:26:10.1290694Z 
2025-05-07T20:26:10.1290712Z 
2025-05-07T20:26:10.1292545Z 
2025-05-07T20:26:10.2296173Z cuda-nvrtc-12.8.61   | 63.1 MB   | #7         |  17% [A[A[A[A[A[A[A[A
2025-05-07T20:26:10.2296481Z 
2025-05-07T20:26:10.2299193Z 
2025-05-07T20:26:10.2299271Z 
2025-05-07T20:26:10.2299277Z 
2025-05-07T20:26:10.2299283Z 
2025-05-07T20:26:10.2299288Z 
2025-05-07T20:26:10.2299294Z 
2025-05-07T20:26:10.2299341Z 
2025-05-07T20:26:10.3409176Z cuda-nvrtc-12.8.61   | 63.1 MB   | ##2        |  23% [A[A[A[A[A[A[A[A
2025-05-07T20:26:10.3409583Z 
2025-05-07T20:26:10.3409587Z 
2025-05-07T20:26:10.3409591Z 
2025-05-07T20:26:10.3409594Z 
2025-05-07T20:26:10.3409623Z 
2025-05-07T20:26:10.3409626Z 
2025-05-07T20:26:10.3409630Z 
2025-05-07T20:26:10.3410777Z 
2025-05-07T20:26:10.4410547Z cuda-nvrtc-12.8.61   | 63.1 MB   | ##8        |  28% [A[A[A[A[A[A[A[A
2025-05-07T20:26:10.4410861Z 
2025-05-07T20:26:10.4410865Z 
2025-05-07T20:26:10.4410869Z 
2025-05-07T20:26:10.4410873Z 
2025-05-07T20:26:10.4410876Z 
2025-05-07T20:26:10.4410901Z 
2025-05-07T20:26:10.4410905Z 
2025-05-07T20:26:10.4412228Z 
2025-05-07T20:26:10.4669696Z cuda-nvrtc-12.8.61   | 63.1 MB   | ###3       |  34% [A[A[A[A[A[A[A[A
2025-05-07T20:26:10.4669993Z 
2025-05-07T20:26:10.4669998Z 
2025-05-07T20:26:10.4670001Z 
2025-05-07T20:26:10.4670005Z 
2025-05-07T20:26:10.4670009Z 
2025-05-07T20:26:10.4670012Z 
2025-05-07T20:26:10.4683265Z 
2025-05-07T20:26:10.5324316Z cuda-nvvp-12.8.57    | 112.4 MB  | ########## | 100% [A[A[A[A[A[A[A
2025-05-07T20:26:10.5324614Z 
2025-05-07T20:26:10.5324618Z 
2025-05-07T20:26:10.5324622Z 
2025-05-07T20:26:10.5324626Z 
2025-05-07T20:26:10.5324652Z 
2025-05-07T20:26:10.5324656Z 
2025-05-07T20:26:10.5324660Z 
2025-05-07T20:26:10.5324672Z 
2025-05-07T20:26:10.5326986Z 
2025-05-07T20:26:10.5412036Z libcurand-10.3.9.55  | 43.6 MB   |            |   0% [A[A[A[A[A[A[A[A[A
2025-05-07T20:26:10.5412334Z 
2025-05-07T20:26:10.5412346Z 
2025-05-07T20:26:10.5412350Z 
2025-05-07T20:26:10.5412354Z 
2025-05-07T20:26:10.5412357Z 
2025-05-07T20:26:10.5412606Z 
2025-05-07T20:26:10.5412611Z 
2025-05-07T20:26:10.5412617Z 
2025-05-07T20:26:10.6333146Z cuda-nvrtc-12.8.61   | 63.1 MB   | ###9       |  40% [A[A[A[A[A[A[A[A
2025-05-07T20:26:10.6333474Z 
2025-05-07T20:26:10.6333478Z 
2025-05-07T20:26:10.6333481Z 
2025-05-07T20:26:10.6333485Z 
2025-05-07T20:26:10.6333489Z 
2025-05-07T20:26:10.6333492Z 
2025-05-07T20:26:10.6333496Z 
2025-05-07T20:26:10.6333500Z 
2025-05-07T20:26:10.6333503Z 
2025-05-07T20:26:10.6519213Z libcurand-10.3.9.55  | 43.6 MB   | 6          |   6% [A[A[A[A[A[A[A[A[A
2025-05-07T20:26:10.6519667Z 
2025-05-07T20:26:10.6519955Z 
2025-05-07T20:26:10.6519958Z 
2025-05-07T20:26:10.6519962Z 
2025-05-07T20:26:10.6519966Z 
2025-05-07T20:26:10.6519969Z 
2025-05-07T20:26:10.6519973Z 
2025-05-07T20:26:10.6521138Z 
2025-05-07T20:26:10.7333033Z cuda-nvrtc-12.8.61   | 63.1 MB   | ####5      |  46% [A[A[A[A[A[A[A[A
2025-05-07T20:26:10.7333344Z 
2025-05-07T20:26:10.7333348Z 
2025-05-07T20:26:10.7333352Z 
2025-05-07T20:26:10.7333377Z 
2025-05-07T20:26:10.7333381Z 
2025-05-07T20:26:10.7333384Z 
2025-05-07T20:26:10.7333388Z 
2025-05-07T20:26:10.7333392Z 
2025-05-07T20:26:10.7333395Z 
2025-05-07T20:26:10.7544968Z libcurand-10.3.9.55  | 43.6 MB   | #2         |  12% [A[A[A[A[A[A[A[A[A
2025-05-07T20:26:10.7545271Z 
2025-05-07T20:26:10.7545275Z 
2025-05-07T20:26:10.7545279Z 
2025-05-07T20:26:10.7545283Z 
2025-05-07T20:26:10.7545287Z 
2025-05-07T20:26:10.7545290Z 
2025-05-07T20:26:10.7545294Z 
2025-05-07T20:26:10.7545306Z 
2025-05-07T20:26:10.8339437Z cuda-nvrtc-12.8.61   | 63.1 MB   | #####1     |  51% [A[A[A[A[A[A[A[A
2025-05-07T20:26:10.8339870Z 
2025-05-07T20:26:10.8339874Z 
2025-05-07T20:26:10.8339878Z 
2025-05-07T20:26:10.8339882Z 
2025-05-07T20:26:10.8339886Z 
2025-05-07T20:26:10.8339889Z 
2025-05-07T20:26:10.8339893Z 
2025-05-07T20:26:10.8339897Z 
2025-05-07T20:26:10.8343687Z 
2025-05-07T20:26:10.8609132Z libcurand-10.3.9.55  | 43.6 MB   | #9         |  19% [A[A[A[A[A[A[A[A[A
2025-05-07T20:26:10.8609473Z 
2025-05-07T20:26:10.8609477Z 
2025-05-07T20:26:10.8609481Z 
2025-05-07T20:26:10.8609492Z 
2025-05-07T20:26:10.8609496Z 
2025-05-07T20:26:10.8609500Z 
2025-05-07T20:26:10.8609503Z 
2025-05-07T20:26:10.8610246Z 
2025-05-07T20:26:10.9346642Z cuda-nvrtc-12.8.61   | 63.1 MB   | #####6     |  57% [A[A[A[A[A[A[A[A
2025-05-07T20:26:10.9347031Z 
2025-05-07T20:26:10.9347037Z 
2025-05-07T20:26:10.9347042Z 
2025-05-07T20:26:10.9347047Z 
2025-05-07T20:26:10.9347053Z 
2025-05-07T20:26:10.9347058Z 
2025-05-07T20:26:10.9347063Z 
2025-05-07T20:26:10.9347072Z 
2025-05-07T20:26:10.9347077Z 
2025-05-07T20:26:10.9716832Z libcurand-10.3.9.55  | 43.6 MB   | ##6        |  26% [A[A[A[A[A[A[A[A[A
2025-05-07T20:26:10.9717185Z 
2025-05-07T20:26:10.9717189Z 
2025-05-07T20:26:10.9717193Z 
2025-05-07T20:26:10.9717196Z 
2025-05-07T20:26:10.9717200Z 
2025-05-07T20:26:10.9717204Z 
2025-05-07T20:26:10.9717210Z 
2025-05-07T20:26:10.9722227Z 
2025-05-07T20:26:11.0363244Z cuda-nvrtc-12.8.61   | 63.1 MB   | ######1    |  62% [A[A[A[A[A[A[A[A
2025-05-07T20:26:11.0363717Z 
2025-05-07T20:26:11.0363723Z 
2025-05-07T20:26:11.0363729Z 
2025-05-07T20:26:11.0363734Z 
2025-05-07T20:26:11.0365214Z 
2025-05-07T20:26:11.0365620Z libnpp-12.3.3.65     | 130.6 MB  | ########## | 100% [A[A[A[A[A
2025-05-07T20:26:11.0365889Z 
2025-05-07T20:26:11.0365893Z 
2025-05-07T20:26:11.0365897Z 
2025-05-07T20:26:11.0365900Z 
2025-05-07T20:26:11.0366028Z 
2025-05-07T20:26:11.0410689Z libnpp-12.3.3.65     | 130.6 MB  | ########## | 100% [A[A[A[A[A
2025-05-07T20:26:11.0410972Z 
2025-05-07T20:26:11.0410976Z 
2025-05-07T20:26:11.0410980Z 
2025-05-07T20:26:11.0410996Z 
2025-05-07T20:26:11.0411000Z 
2025-05-07T20:26:11.0411003Z 
2025-05-07T20:26:11.0411007Z 
2025-05-07T20:26:11.0411011Z 
2025-05-07T20:26:11.0416940Z 
2025-05-07T20:26:11.0801991Z libcurand-10.3.9.55  | 43.6 MB   | ###2       |  33% [A[A[A[A[A[A[A[A[A
2025-05-07T20:26:11.0802338Z 
2025-05-07T20:26:11.0802342Z 
2025-05-07T20:26:11.0802345Z 
2025-05-07T20:26:11.0802589Z 
2025-05-07T20:26:11.0802594Z 
2025-05-07T20:26:11.0802598Z 
2025-05-07T20:26:11.0802601Z 
2025-05-07T20:26:11.0803215Z 
2025-05-07T20:26:11.0891281Z cuda-nvrtc-12.8.61   | 63.1 MB   | ######7    |  67% [A[A[A[A[A[A[A[A
2025-05-07T20:26:11.0891567Z 
2025-05-07T20:26:11.0891571Z 
2025-05-07T20:26:11.0891575Z 
2025-05-07T20:26:11.0891579Z 
2025-05-07T20:26:11.0891592Z 
2025-05-07T20:26:11.0891596Z 
2025-05-07T20:26:11.0891599Z 
2025-05-07T20:26:11.0891603Z 
2025-05-07T20:26:11.0891606Z 
2025-05-07T20:26:11.0891610Z 
2025-05-07T20:26:11.1419923Z gds-tools-1.13.0.11  | 37.9 MB   |            |   0% [A[A[A[A[A[A[A[A[A[A
2025-05-07T20:26:11.1420488Z 
2025-05-07T20:26:11.1420492Z 
2025-05-07T20:26:11.1420496Z 
2025-05-07T20:26:11.1420499Z 
2025-05-07T20:26:11.1420503Z 
2025-05-07T20:26:11.1420506Z 
2025-05-07T20:26:11.1420510Z 
2025-05-07T20:26:11.1420514Z 
2025-05-07T20:26:11.1422860Z 
2025-05-07T20:26:11.1892065Z libcurand-10.3.9.55  | 43.6 MB   | ###9       |  39% [A[A[A[A[A[A[A[A[A
2025-05-07T20:26:11.1892395Z 
2025-05-07T20:26:11.1892400Z 
2025-05-07T20:26:11.1892403Z 
2025-05-07T20:26:11.1892407Z 
2025-05-07T20:26:11.1892411Z 
2025-05-07T20:26:11.1892414Z 
2025-05-07T20:26:11.1892418Z 
2025-05-07T20:26:11.1893693Z 
2025-05-07T20:26:11.1902162Z cuda-nvrtc-12.8.61   | 63.1 MB   | #######2   |  72% [A[A[A[A[A[A[A[A
2025-05-07T20:26:11.1902441Z 
2025-05-07T20:26:11.1902445Z 
2025-05-07T20:26:11.1902449Z 
2025-05-07T20:26:11.1902452Z 
2025-05-07T20:26:11.1902456Z 
2025-05-07T20:26:11.1902459Z 
2025-05-07T20:26:11.1902463Z 
2025-05-07T20:26:11.1902470Z 
2025-05-07T20:26:11.1902580Z 
2025-05-07T20:26:11.1904490Z 
2025-05-07T20:26:11.2570318Z gds-tools-1.13.0.11  | 37.9 MB   | 6          |   7% [A[A[A[A[A[A[A[A[A[A
2025-05-07T20:26:11.2570653Z 
2025-05-07T20:26:11.2570659Z 
2025-05-07T20:26:11.2570674Z 
2025-05-07T20:26:11.2570681Z 
2025-05-07T20:26:11.2570686Z 
2025-05-07T20:26:11.2570691Z 
2025-05-07T20:26:11.2570697Z 
2025-05-07T20:26:11.2570704Z 
2025-05-07T20:26:11.2574817Z 
2025-05-07T20:26:11.2904027Z libcurand-10.3.9.55  | 43.6 MB   | ####5      |  46% [A[A[A[A[A[A[A[A[A
2025-05-07T20:26:11.2904326Z 
2025-05-07T20:26:11.2904330Z 
2025-05-07T20:26:11.2904334Z 
2025-05-07T20:26:11.2904338Z 
2025-05-07T20:26:11.2904342Z 
2025-05-07T20:26:11.2904345Z 
2025-05-07T20:26:11.2904350Z 
2025-05-07T20:26:11.2904353Z 
2025-05-07T20:26:11.2904357Z 
2025-05-07T20:26:11.2904402Z 
2025-05-07T20:26:11.3080387Z gds-tools-1.13.0.11  | 37.9 MB   | #4         |  14% [A[A[A[A[A[A[A[A[A[A
2025-05-07T20:26:11.3080766Z 
2025-05-07T20:26:11.3080772Z 
2025-05-07T20:26:11.3080777Z 
2025-05-07T20:26:11.3080810Z 
2025-05-07T20:26:11.3080816Z 
2025-05-07T20:26:11.3080821Z 
2025-05-07T20:26:11.3080826Z 
2025-05-07T20:26:11.3080832Z 
2025-05-07T20:26:11.3664517Z cuda-nvrtc-12.8.61   | 63.1 MB   | #######7   |  77% [A[A[A[A[A[A[A[A
2025-05-07T20:26:11.3664831Z 
2025-05-07T20:26:11.3664835Z 
2025-05-07T20:26:11.3664840Z 
2025-05-07T20:26:11.3664843Z 
2025-05-07T20:26:11.3664847Z 
2025-05-07T20:26:11.3664875Z 
2025-05-07T20:26:11.3664879Z 
2025-05-07T20:26:11.3664891Z 
2025-05-07T20:26:11.3664894Z 
2025-05-07T20:26:11.3906150Z libcurand-10.3.9.55  | 43.6 MB   | #####2     |  52% [A[A[A[A[A[A[A[A[A
2025-05-07T20:26:11.3906489Z 
2025-05-07T20:26:11.3906493Z 
2025-05-07T20:26:11.3906505Z 
2025-05-07T20:26:11.3906509Z 
2025-05-07T20:26:11.3906512Z 
2025-05-07T20:26:11.3906516Z 
2025-05-07T20:26:11.3906519Z 
2025-05-07T20:26:11.3906524Z 
2025-05-07T20:26:11.3906527Z 
2025-05-07T20:26:11.3907004Z 
2025-05-07T20:26:11.4174819Z gds-tools-1.13.0.11  | 37.9 MB   | ##1        |  21% [A[A[A[A[A[A[A[A[A[A
2025-05-07T20:26:11.4175266Z 
2025-05-07T20:26:11.4175273Z 
2025-05-07T20:26:11.4175279Z 
2025-05-07T20:26:11.4175286Z 
2025-05-07T20:26:11.4175292Z 
2025-05-07T20:26:11.4175299Z 
2025-05-07T20:26:11.4175305Z 
2025-05-07T20:26:11.4180063Z 
2025-05-07T20:26:11.4680285Z cuda-nvrtc-12.8.61   | 63.1 MB   | ########1  |  82% [A[A[A[A[A[A[A[A
2025-05-07T20:26:11.4680613Z 
2025-05-07T20:26:11.4680879Z 
2025-05-07T20:26:11.4680888Z 
2025-05-07T20:26:11.4680894Z 
2025-05-07T20:26:11.4680899Z 
2025-05-07T20:26:11.4680902Z 
2025-05-07T20:26:11.4680906Z 
2025-05-07T20:26:11.4680910Z 
2025-05-07T20:26:11.4680913Z 
2025-05-07T20:26:11.4953783Z libcurand-10.3.9.55  | 43.6 MB   | #####8     |  58% [A[A[A[A[A[A[A[A[A
2025-05-07T20:26:11.4954088Z 
2025-05-07T20:26:11.4954092Z 
2025-05-07T20:26:11.4954096Z 
2025-05-07T20:26:11.4954099Z 
2025-05-07T20:26:11.4954103Z 
2025-05-07T20:26:11.4954107Z 
2025-05-07T20:26:11.4954110Z 
2025-05-07T20:26:11.4954114Z 
2025-05-07T20:26:11.4954117Z 
2025-05-07T20:26:11.4954406Z 
2025-05-07T20:26:11.5226633Z gds-tools-1.13.0.11  | 37.9 MB   | ##8        |  29% [A[A[A[A[A[A[A[A[A[A
2025-05-07T20:26:11.5227059Z 
2025-05-07T20:26:11.5227065Z 
2025-05-07T20:26:11.5227070Z 
2025-05-07T20:26:11.5227074Z 
2025-05-07T20:26:11.5227079Z 
2025-05-07T20:26:11.5227094Z 
2025-05-07T20:26:11.5227100Z 
2025-05-07T20:26:11.5228849Z 
2025-05-07T20:26:11.5683220Z cuda-nvrtc-12.8.61   | 63.1 MB   | ########6  |  86% [A[A[A[A[A[A[A[A
2025-05-07T20:26:11.5683537Z 
2025-05-07T20:26:11.5683541Z 
2025-05-07T20:26:11.5683545Z 
2025-05-07T20:26:11.5683548Z 
2025-05-07T20:26:11.5683552Z 
2025-05-07T20:26:11.5683555Z 
2025-05-07T20:26:11.5683559Z 
2025-05-07T20:26:11.5683563Z 
2025-05-07T20:26:11.5684921Z 
2025-05-07T20:26:11.5954158Z libcurand-10.3.9.55  | 43.6 MB   | ######4    |  65% [A[A[A[A[A[A[A[A[A
2025-05-07T20:26:11.5954487Z 
2025-05-07T20:26:11.5954492Z 
2025-05-07T20:26:11.5954497Z 
2025-05-07T20:26:11.5954502Z 
2025-05-07T20:26:11.5954507Z 
2025-05-07T20:26:11.5954543Z 
2025-05-07T20:26:11.5954549Z 
2025-05-07T20:26:11.5954554Z 
2025-05-07T20:26:11.5954559Z 
2025-05-07T20:26:11.5954565Z 
2025-05-07T20:26:11.6227541Z gds-tools-1.13.0.11  | 37.9 MB   | ###6       |  37% [A[A[A[A[A[A[A[A[A[A
2025-05-07T20:26:11.6227846Z 
2025-05-07T20:26:11.6227851Z 
2025-05-07T20:26:11.6227857Z 
2025-05-07T20:26:11.6227862Z 
2025-05-07T20:26:11.6227866Z 
2025-05-07T20:26:11.6227897Z 
2025-05-07T20:26:11.6227901Z 
2025-05-07T20:26:11.6231517Z 
2025-05-07T20:26:11.6696754Z cuda-nvrtc-12.8.61   | 63.1 MB   | #########1 |  91% [A[A[A[A[A[A[A[A
2025-05-07T20:26:11.6697054Z 
2025-05-07T20:26:11.6697066Z 
2025-05-07T20:26:11.6697071Z 
2025-05-07T20:26:11.6697074Z 
2025-05-07T20:26:11.6697078Z 
2025-05-07T20:26:11.6697082Z 
2025-05-07T20:26:11.6697086Z 
2025-05-07T20:26:11.6697089Z 
2025-05-07T20:26:11.6697273Z 
2025-05-07T20:26:11.7178901Z libcurand-10.3.9.55  | 43.6 MB   | #######    |  71% [A[A[A[A[A[A[A[A[A
2025-05-07T20:26:11.7179349Z 
2025-05-07T20:26:11.7179386Z 
2025-05-07T20:26:11.7179393Z 
2025-05-07T20:26:11.7179401Z 
2025-05-07T20:26:11.7179408Z 
2025-05-07T20:26:11.7179415Z 
2025-05-07T20:26:11.7179422Z 
2025-05-07T20:26:11.7179428Z 
2025-05-07T20:26:11.7179434Z 
2025-05-07T20:26:11.7181402Z 
2025-05-07T20:26:11.7231608Z gds-tools-1.13.0.11  | 37.9 MB   | ####4      |  45% [A[A[A[A[A[A[A[A[A[A
2025-05-07T20:26:11.7231906Z 
2025-05-07T20:26:11.7231925Z 
2025-05-07T20:26:11.7231930Z 
2025-05-07T20:26:11.7231933Z 
2025-05-07T20:26:11.7231937Z 
2025-05-07T20:26:11.7231941Z 
2025-05-07T20:26:11.7231945Z 
2025-05-07T20:26:11.7231948Z 
2025-05-07T20:26:11.7711783Z cuda-nvrtc-12.8.61   | 63.1 MB   | #########5 |  96% [A[A[A[A[A[A[A[A
2025-05-07T20:26:11.7712083Z 
2025-05-07T20:26:11.7712087Z 
2025-05-07T20:26:11.7712091Z 
2025-05-07T20:26:11.7712094Z 
2025-05-07T20:26:11.7712108Z 
2025-05-07T20:26:11.7712113Z 
2025-05-07T20:26:11.7712117Z 
2025-05-07T20:26:11.7712120Z 
2025-05-07T20:26:11.7712124Z 
2025-05-07T20:26:11.8181583Z libcurand-10.3.9.55  | 43.6 MB   | #######7   |  77% [A[A[A[A[A[A[A[A[A
2025-05-07T20:26:11.8181930Z 
2025-05-07T20:26:11.8181934Z 
2025-05-07T20:26:11.8181938Z 
2025-05-07T20:26:11.8181942Z 
2025-05-07T20:26:11.8181946Z 
2025-05-07T20:26:11.8181949Z 
2025-05-07T20:26:11.8181953Z 
2025-05-07T20:26:11.8181957Z 
2025-05-07T20:26:11.8181960Z 
2025-05-07T20:26:11.8183480Z 
2025-05-07T20:26:11.8713187Z gds-tools-1.13.0.11  | 37.9 MB   | #####1     |  52% [A[A[A[A[A[A[A[A[A[A
2025-05-07T20:26:11.8713613Z 
2025-05-07T20:26:11.8713619Z 
2025-05-07T20:26:11.8713625Z 
2025-05-07T20:26:11.8713630Z 
2025-05-07T20:26:11.8713635Z 
2025-05-07T20:26:11.8713640Z 
2025-05-07T20:26:11.8713645Z 
2025-05-07T20:26:11.8713651Z 
2025-05-07T20:26:11.8717610Z 
2025-05-07T20:26:11.9182392Z libcurand-10.3.9.55  | 43.6 MB   | ########4  |  84% [A[A[A[A[A[A[A[A[A
2025-05-07T20:26:11.9182703Z 
2025-05-07T20:26:11.9182707Z 
2025-05-07T20:26:11.9182711Z 
2025-05-07T20:26:11.9182715Z 
2025-05-07T20:26:11.9182718Z 
2025-05-07T20:26:11.9182981Z 
2025-05-07T20:26:11.9182985Z 
2025-05-07T20:26:11.9182988Z 
2025-05-07T20:26:11.9182992Z 
2025-05-07T20:26:11.9182996Z 
2025-05-07T20:26:11.9719201Z gds-tools-1.13.0.11  | 37.9 MB   | #####9     |  60% [A[A[A[A[A[A[A[A[A[A
2025-05-07T20:26:11.9719524Z 
2025-05-07T20:26:11.9719528Z 
2025-05-07T20:26:11.9719532Z 
2025-05-07T20:26:11.9719537Z 
2025-05-07T20:26:11.9719561Z 
2025-05-07T20:26:11.9719573Z 
2025-05-07T20:26:11.9719576Z 
2025-05-07T20:26:11.9719580Z 
2025-05-07T20:26:11.9719584Z 
2025-05-07T20:26:12.0189327Z libcurand-10.3.9.55  | 43.6 MB   | #########3 |  94% [A[A[A[A[A[A[A[A[A
2025-05-07T20:26:12.0189646Z 
2025-05-07T20:26:12.0189650Z 
2025-05-07T20:26:12.0189654Z 
2025-05-07T20:26:12.0189657Z 
2025-05-07T20:26:12.0189661Z 
2025-05-07T20:26:12.0189666Z 
2025-05-07T20:26:12.0189670Z 
2025-05-07T20:26:12.0189674Z 
2025-05-07T20:26:12.0189677Z 
2025-05-07T20:26:12.0195815Z 
2025-05-07T20:26:12.1189703Z gds-tools-1.13.0.11  | 37.9 MB   | ######7    |  67% [A[A[A[A[A[A[A[A[A[A
2025-05-07T20:26:12.1190066Z 
2025-05-07T20:26:12.1190070Z 
2025-05-07T20:26:12.1190074Z 
2025-05-07T20:26:12.1190078Z 
2025-05-07T20:26:12.1190081Z 
2025-05-07T20:26:12.1190085Z 
2025-05-07T20:26:12.1190089Z 
2025-05-07T20:26:12.1190092Z 
2025-05-07T20:26:12.1190096Z 
2025-05-07T20:26:12.1190100Z 
2025-05-07T20:26:12.2190120Z gds-tools-1.13.0.11  | 37.9 MB   | #######5   |  76% [A[A[A[A[A[A[A[A[A[A
2025-05-07T20:26:12.2190439Z 
2025-05-07T20:26:12.2190444Z 
2025-05-07T20:26:12.2190449Z 
2025-05-07T20:26:12.2190453Z 
2025-05-07T20:26:12.2190456Z 
2025-05-07T20:26:12.2190461Z 
2025-05-07T20:26:12.2190464Z 
2025-05-07T20:26:12.2190468Z 
2025-05-07T20:26:12.2190471Z 
2025-05-07T20:26:12.2192213Z 
2025-05-07T20:26:12.3196119Z gds-tools-1.13.0.11  | 37.9 MB   | ########4  |  85% [A[A[A[A[A[A[A[A[A[A
2025-05-07T20:26:12.3196437Z 
2025-05-07T20:26:12.3196441Z 
2025-05-07T20:26:12.3196445Z 
2025-05-07T20:26:12.3196449Z 
2025-05-07T20:26:12.3196452Z 
2025-05-07T20:26:12.3196456Z 
2025-05-07T20:26:12.3196487Z 
2025-05-07T20:26:12.3196490Z 
2025-05-07T20:26:12.3196494Z 
2025-05-07T20:26:12.3197161Z 
2025-05-07T20:26:13.3794632Z gds-tools-1.13.0.11  | 37.9 MB   | #########3 |  94% [A[A[A[A[A[A[A[A[A[A
2025-05-07T20:26:13.3794965Z 
2025-05-07T20:26:13.3794969Z 
2025-05-07T20:26:13.3795519Z 
2025-05-07T20:26:13.4823710Z libcusolver-11.7.2.5 | 156.9 MB  | ########## | 100% [A[A[A
2025-05-07T20:26:13.4824139Z 
2025-05-07T20:26:13.4824144Z 
2025-05-07T20:26:13.4824150Z 
2025-05-07T20:26:13.4824155Z 
2025-05-07T20:26:13.4824160Z 
2025-05-07T20:26:13.4824165Z 
2025-05-07T20:26:13.4824169Z 
2025-05-07T20:26:13.4824176Z 
2025-05-07T20:26:13.4826525Z 
2025-05-07T20:26:13.5333866Z libcurand-10.3.9.55  | 43.6 MB   | ########## | 100% [A[A[A[A[A[A[A[A[A
2025-05-07T20:26:13.5334293Z 
2025-05-07T20:26:13.5334299Z 
2025-05-07T20:26:13.5334304Z 
2025-05-07T20:26:13.5334309Z 
2025-05-07T20:26:13.5334315Z 
2025-05-07T20:26:13.5334320Z 
2025-05-07T20:26:13.5334325Z 
2025-05-07T20:26:13.5334353Z 
2025-05-07T20:26:13.5334359Z 
2025-05-07T20:26:13.5334364Z 
2025-05-07T20:26:13.5334369Z 
2025-05-07T20:26:13.5997079Z python-3.11.8        | 29.3 MB   |            |   0% [A[A[A[A[A[A[A[A[A[A[A
2025-05-07T20:26:13.5997482Z 
2025-05-07T20:26:13.5997488Z 
2025-05-07T20:26:13.5997493Z 
2025-05-07T20:26:13.5997498Z 
2025-05-07T20:26:13.5997503Z 
2025-05-07T20:26:13.6002393Z 
2025-05-07T20:26:13.6337601Z cuda-nsight-12.8.55  | 113.2 MB  | ########## | 100% [A[A[A[A[A[A
2025-05-07T20:26:13.6337976Z 
2025-05-07T20:26:13.6337980Z 
2025-05-07T20:26:13.6337984Z 
2025-05-07T20:26:13.6337988Z 
2025-05-07T20:26:13.6338001Z 
2025-05-07T20:26:13.6338005Z 
2025-05-07T20:26:13.6338009Z 
2025-05-07T20:26:13.6338013Z 
2025-05-07T20:26:13.6338016Z 
2025-05-07T20:26:13.6338020Z 
2025-05-07T20:26:13.6340222Z 
2025-05-07T20:26:13.7130239Z python-3.11.8        | 29.3 MB   | #1         |  12% [A[A[A[A[A[A[A[A[A[A[A
2025-05-07T20:26:13.7130627Z 
2025-05-07T20:26:13.7130631Z 
2025-05-07T20:26:13.7130855Z 
2025-05-07T20:26:13.7130858Z 
2025-05-07T20:26:13.7130862Z 
2025-05-07T20:26:13.7130866Z 
2025-05-07T20:26:13.7130869Z 
2025-05-07T20:26:13.7130873Z 
2025-05-07T20:26:13.7130877Z 
2025-05-07T20:26:13.7130880Z 
2025-05-07T20:26:13.7344726Z gds-tools-1.13.0.11  | 37.9 MB   | ########## | 100% [A[A[A[A[A[A[A[A[A[A
2025-05-07T20:26:13.7345034Z 
2025-05-07T20:26:13.7345052Z 
2025-05-07T20:26:13.7345069Z 
2025-05-07T20:26:13.7345075Z 
2025-05-07T20:26:13.7345080Z 
2025-05-07T20:26:13.7345085Z 
2025-05-07T20:26:13.7345090Z 
2025-05-07T20:26:13.7345096Z 
2025-05-07T20:26:13.7345100Z 
2025-05-07T20:26:13.7345105Z 
2025-05-07T20:26:13.7345118Z 
2025-05-07T20:26:13.7877907Z python-3.11.8        | 29.3 MB   | ##3        |  24% [A[A[A[A[A[A[A[A[A[A[A
2025-05-07T20:26:13.7878239Z 
2025-05-07T20:26:13.7878243Z 
2025-05-07T20:26:13.7878257Z 
2025-05-07T20:26:13.7878261Z 
2025-05-07T20:26:13.7878264Z 
2025-05-07T20:26:13.7878268Z 
2025-05-07T20:26:13.7878271Z 
2025-05-07T20:26:13.7878275Z 
2025-05-07T20:26:13.7878289Z 
2025-05-07T20:26:13.7878292Z 
2025-05-07T20:26:13.7878296Z 
2025-05-07T20:26:13.7884993Z 
2025-05-07T20:26:13.8335541Z libnvjitlink-12.8.61 | 28.7 MB   |            |   0% [A[A[A[A[A[A[A[A[A[A[A[A
2025-05-07T20:26:13.8336132Z 
2025-05-07T20:26:13.8336137Z 
2025-05-07T20:26:13.8336141Z 
2025-05-07T20:26:13.8336146Z 
2025-05-07T20:26:13.8336150Z 
2025-05-07T20:26:13.8336168Z 
2025-05-07T20:26:13.8336172Z 
2025-05-07T20:26:13.8336177Z 
2025-05-07T20:26:13.8349459Z cuda-nvrtc-12.8.61   | 63.1 MB   | ########## | 100% [A[A[A[A[A[A[A[A
2025-05-07T20:26:13.8349845Z 
2025-05-07T20:26:13.8349849Z 
2025-05-07T20:26:13.8349853Z 
2025-05-07T20:26:13.8349857Z 
2025-05-07T20:26:13.8349861Z 
2025-05-07T20:26:13.8349864Z 
2025-05-07T20:26:13.8349868Z 
2025-05-07T20:26:13.8349872Z 
2025-05-07T20:26:13.8349883Z 
2025-05-07T20:26:13.8349886Z 
2025-05-07T20:26:13.8349890Z 
2025-05-07T20:26:13.8884191Z python-3.11.8        | 29.3 MB   | ###5       |  36% [A[A[A[A[A[A[A[A[A[A[A
2025-05-07T20:26:13.8884586Z 
2025-05-07T20:26:13.8884602Z 
2025-05-07T20:26:13.8884607Z 
2025-05-07T20:26:13.8884612Z 
2025-05-07T20:26:13.8884617Z 
2025-05-07T20:26:13.8884622Z 
2025-05-07T20:26:13.8884628Z 
2025-05-07T20:26:13.8884632Z 
2025-05-07T20:26:13.8884637Z 
2025-05-07T20:26:13.8884642Z 
2025-05-07T20:26:13.8884647Z 
2025-05-07T20:26:13.8884653Z 
2025-05-07T20:26:13.9018959Z libnvjitlink-12.8.61 | 28.7 MB   | #          |  11% [A[A[A[A[A[A[A[A[A[A[A[A
2025-05-07T20:26:13.9019375Z 
2025-05-07T20:26:13.9019381Z 
2025-05-07T20:26:13.9019386Z 
2025-05-07T20:26:13.9019391Z 
2025-05-07T20:26:13.9019397Z 
2025-05-07T20:26:13.9019402Z 
2025-05-07T20:26:13.9019407Z 
2025-05-07T20:26:13.9019412Z 
2025-05-07T20:26:13.9019417Z 
2025-05-07T20:26:13.9019423Z 
2025-05-07T20:26:13.9019428Z 
2025-05-07T20:26:13.9019433Z 
2025-05-07T20:26:13.9024219Z 
2025-05-07T20:26:13.9422132Z cuda-nvcc-tools-12.8 | 24.5 MB   |            |   0% [A[A[A[A[A[A[A[A[A[A[A[A[A
2025-05-07T20:26:13.9422475Z 
2025-05-07T20:26:13.9422490Z 
2025-05-07T20:26:13.9422494Z 
2025-05-07T20:26:13.9422498Z 
2025-05-07T20:26:13.9422501Z 
2025-05-07T20:26:13.9422505Z 
2025-05-07T20:26:13.9422515Z 
2025-05-07T20:26:13.9422519Z 
2025-05-07T20:26:13.9422522Z 
2025-05-07T20:26:13.9422526Z 
2025-05-07T20:26:13.9425542Z 
2025-05-07T20:26:13.9884822Z python-3.11.8        | 29.3 MB   | ####7      |  48% [A[A[A[A[A[A[A[A[A[A[A
2025-05-07T20:26:13.9885232Z 
2025-05-07T20:26:13.9885238Z 
2025-05-07T20:26:13.9885243Z 
2025-05-07T20:26:13.9885248Z 
2025-05-07T20:26:13.9885253Z 
2025-05-07T20:26:13.9885259Z 
2025-05-07T20:26:13.9885264Z 
2025-05-07T20:26:13.9885269Z 
2025-05-07T20:26:13.9885274Z 
2025-05-07T20:26:13.9885279Z 
2025-05-07T20:26:13.9885284Z 
2025-05-07T20:26:13.9886893Z 
2025-05-07T20:26:14.0024730Z libnvjitlink-12.8.61 | 28.7 MB   | ##1        |  21% [A[A[A[A[A[A[A[A[A[A[A[A
2025-05-07T20:26:14.0025156Z 
2025-05-07T20:26:14.0025162Z 
2025-05-07T20:26:14.0025167Z 
2025-05-07T20:26:14.0025172Z 
2025-05-07T20:26:14.0025439Z 
2025-05-07T20:26:14.0025445Z 
2025-05-07T20:26:14.0025450Z 
2025-05-07T20:26:14.0025455Z 
2025-05-07T20:26:14.0025460Z 
2025-05-07T20:26:14.0025465Z 
2025-05-07T20:26:14.0025470Z 
2025-05-07T20:26:14.0025475Z 
2025-05-07T20:26:14.0026820Z 
2025-05-07T20:26:14.0282482Z cuda-nvcc-tools-12.8 | 24.5 MB   | #1         |  11% [A[A[A[A[A[A[A[A[A[A[A[A[A
2025-05-07T20:26:14.0287522Z 
2025-05-07T20:26:14.0669417Z nsight-compute-2025. | 320.6 MB  | ########## | 100% [A
2025-05-07T20:26:14.0669727Z 
2025-05-07T20:26:14.0669734Z 
2025-05-07T20:26:14.0669739Z 
2025-05-07T20:26:14.0669744Z 
2025-05-07T20:26:14.0669750Z 
2025-05-07T20:26:14.0669755Z 
2025-05-07T20:26:14.0669761Z 
2025-05-07T20:26:14.0669766Z 
2025-05-07T20:26:14.0669772Z 
2025-05-07T20:26:14.0669777Z 
2025-05-07T20:26:14.0669781Z 
2025-05-07T20:26:14.0964134Z python-3.11.8        | 29.3 MB   | #####9     |  59% [A[A[A[A[A[A[A[A[A[A[A
2025-05-07T20:26:14.0964452Z 
2025-05-07T20:26:14.0964456Z 
2025-05-07T20:26:14.0964493Z 
2025-05-07T20:26:14.0964499Z 
2025-05-07T20:26:14.0964504Z 
2025-05-07T20:26:14.0964509Z 
2025-05-07T20:26:14.0964515Z 
2025-05-07T20:26:14.0964529Z 
2025-05-07T20:26:14.0964534Z 
2025-05-07T20:26:14.0964539Z 
2025-05-07T20:26:14.0964544Z 
2025-05-07T20:26:14.0964550Z 
2025-05-07T20:26:14.0964555Z 
2025-05-07T20:26:14.0964561Z 
2025-05-07T20:26:14.1034166Z cuda-nvvm-tools-12.8 | 23.5 MB   |            |   0% [A[A[A[A[A[A[A[A[A[A[A[A[A[A
2025-05-07T20:26:14.1034500Z 
2025-05-07T20:26:14.1034504Z 
2025-05-07T20:26:14.1034508Z 
2025-05-07T20:26:14.1034511Z 
2025-05-07T20:26:14.1034515Z 
2025-05-07T20:26:14.1034518Z 
2025-05-07T20:26:14.1034522Z 
2025-05-07T20:26:14.1034525Z 
2025-05-07T20:26:14.1034529Z 
2025-05-07T20:26:14.1034532Z 
2025-05-07T20:26:14.1034536Z 
2025-05-07T20:26:14.1034539Z 
2025-05-07T20:26:14.1034543Z 
2025-05-07T20:26:14.1043760Z cuda-nvcc-tools-12.8 | 24.5 MB   | ##3        |  24% [A[A[A[A[A[A[A[A[A[A[A[A[A
2025-05-07T20:26:14.1044225Z 
2025-05-07T20:26:14.1044246Z 
2025-05-07T20:26:14.1044251Z 
2025-05-07T20:26:14.1044256Z 
2025-05-07T20:26:14.1044261Z 
2025-05-07T20:26:14.1044266Z 
2025-05-07T20:26:14.1044271Z 
2025-05-07T20:26:14.1044276Z 
2025-05-07T20:26:14.1044280Z 
2025-05-07T20:26:14.1044284Z 
2025-05-07T20:26:14.1044287Z 
2025-05-07T20:26:14.1051921Z 
2025-05-07T20:26:14.1182260Z libnvjitlink-12.8.61 | 28.7 MB   | ###1       |  32% [A[A[A[A[A[A[A[A[A[A[A[A
2025-05-07T20:26:14.1182586Z 
2025-05-07T20:26:14.1182590Z 
2025-05-07T20:26:14.1966723Z libcusparse-12.5.7.5 | 164.9 MB  | ########## | 100% [A[A
2025-05-07T20:26:14.1967015Z 
2025-05-07T20:26:14.1967029Z 
2025-05-07T20:26:14.1967032Z 
2025-05-07T20:26:14.1967036Z 
2025-05-07T20:26:14.1967039Z 
2025-05-07T20:26:14.1967043Z 
2025-05-07T20:26:14.1967055Z 
2025-05-07T20:26:14.1967059Z 
2025-05-07T20:26:14.1967062Z 
2025-05-07T20:26:14.1967066Z 
2025-05-07T20:26:14.1967069Z 
2025-05-07T20:26:14.1967074Z 
2025-05-07T20:26:14.1967078Z 
2025-05-07T20:26:14.1967083Z 
2025-05-07T20:26:14.2014065Z cuda-nvvm-tools-12.8 | 23.5 MB   | #          |  10% [A[A[A[A[A[A[A[A[A[A[A[A[A[A
2025-05-07T20:26:14.2014531Z 
2025-05-07T20:26:14.2014537Z 
2025-05-07T20:26:14.2014542Z 
2025-05-07T20:26:14.2014547Z 
2025-05-07T20:26:14.2014552Z 
2025-05-07T20:26:14.2014559Z 
2025-05-07T20:26:14.2014564Z 
2025-05-07T20:26:14.2014569Z 
2025-05-07T20:26:14.2014577Z 
2025-05-07T20:26:14.2014868Z 
2025-05-07T20:26:14.2016349Z 
2025-05-07T20:26:14.2301436Z python-3.11.8        | 29.3 MB   | ######9    |  70% [A[A[A[A[A[A[A[A[A[A[A
2025-05-07T20:26:14.2301742Z 
2025-05-07T20:26:14.2301746Z 
2025-05-07T20:26:14.2301749Z 
2025-05-07T20:26:14.2301753Z 
2025-05-07T20:26:14.2301757Z 
2025-05-07T20:26:14.2301761Z 
2025-05-07T20:26:14.2301764Z 
2025-05-07T20:26:14.2301768Z 
2025-05-07T20:26:14.2301772Z 
2025-05-07T20:26:14.2301775Z 
2025-05-07T20:26:14.2301779Z 
2025-05-07T20:26:14.2301786Z 
2025-05-07T20:26:14.2305627Z 
2025-05-07T20:26:14.2317079Z cuda-nvcc-tools-12.8 | 24.5 MB   | ###5       |  35% [A[A[A[A[A[A[A[A[A[A[A[A[A
2025-05-07T20:26:14.2317852Z 
2025-05-07T20:26:14.2317859Z 
2025-05-07T20:26:14.2317876Z 
2025-05-07T20:26:14.2317882Z 
2025-05-07T20:26:14.2317888Z 
2025-05-07T20:26:14.2317895Z 
2025-05-07T20:26:14.2317901Z 
2025-05-07T20:26:14.2317906Z 
2025-05-07T20:26:14.2317911Z 
2025-05-07T20:26:14.2317916Z 
2025-05-07T20:26:14.2317921Z 
2025-05-07T20:26:14.2317927Z 
2025-05-07T20:26:14.2975593Z libnvjitlink-12.8.61 | 28.7 MB   | ####1      |  42% [A[A[A[A[A[A[A[A[A[A[A[A
2025-05-07T20:26:14.2975942Z 
2025-05-07T20:26:14.2975946Z 
2025-05-07T20:26:14.2975950Z 
2025-05-07T20:26:14.2975954Z 
2025-05-07T20:26:14.2975957Z 
2025-05-07T20:26:14.2975961Z 
2025-05-07T20:26:14.2975965Z 
2025-05-07T20:26:14.2975968Z 
2025-05-07T20:26:14.2975972Z 
2025-05-07T20:26:14.2975976Z 
2025-05-07T20:26:14.2975979Z 
2025-05-07T20:26:14.2975983Z 
2025-05-07T20:26:14.2975987Z 
2025-05-07T20:26:14.2975990Z 
2025-05-07T20:26:14.3226649Z cuda-nvvm-tools-12.8 | 23.5 MB   | ##1        |  21% [A[A[A[A[A[A[A[A[A[A[A[A[A[A
2025-05-07T20:26:14.3227022Z 
2025-05-07T20:26:14.3227026Z 
2025-05-07T20:26:14.3227030Z 
2025-05-07T20:26:14.3227033Z 
2025-05-07T20:26:14.3227037Z 
2025-05-07T20:26:14.3227041Z 
2025-05-07T20:26:14.3227044Z 
2025-05-07T20:26:14.3227057Z 
2025-05-07T20:26:14.3227061Z 
2025-05-07T20:26:14.3227064Z 
2025-05-07T20:26:14.3228738Z 
2025-05-07T20:26:14.3308841Z python-3.11.8        | 29.3 MB   | #######9   |  80% [A[A[A[A[A[A[A[A[A[A[A
2025-05-07T20:26:14.3309223Z 
2025-05-07T20:26:14.3309228Z 
2025-05-07T20:26:14.3309234Z 
2025-05-07T20:26:14.3309253Z 
2025-05-07T20:26:14.3309259Z 
2025-05-07T20:26:14.3309264Z 
2025-05-07T20:26:14.3309270Z 
2025-05-07T20:26:14.3309275Z 
2025-05-07T20:26:14.3309280Z 
2025-05-07T20:26:14.3309285Z 
2025-05-07T20:26:14.3309290Z 
2025-05-07T20:26:14.3309295Z 
2025-05-07T20:26:14.3309300Z 
2025-05-07T20:26:14.3339748Z cuda-nvcc-tools-12.8 | 24.5 MB   | ####6      |  46% [A[A[A[A[A[A[A[A[A[A[A[A[A
2025-05-07T20:26:14.3340152Z 
2025-05-07T20:26:14.3340166Z 
2025-05-07T20:26:14.3340170Z 
2025-05-07T20:26:14.3340173Z 
2025-05-07T20:26:14.3340177Z 
2025-05-07T20:26:14.3340180Z 
2025-05-07T20:26:14.3340184Z 
2025-05-07T20:26:14.3340188Z 
2025-05-07T20:26:14.3340191Z 
2025-05-07T20:26:14.3340195Z 
2025-05-07T20:26:14.3340198Z 
2025-05-07T20:26:14.3340202Z 
2025-05-07T20:26:14.3978769Z libnvjitlink-12.8.61 | 28.7 MB   | #####      |  51% [A[A[A[A[A[A[A[A[A[A[A[A
2025-05-07T20:26:14.3979132Z 
2025-05-07T20:26:14.3979138Z 
2025-05-07T20:26:14.3979143Z 
2025-05-07T20:26:14.3979148Z 
2025-05-07T20:26:14.3979153Z 
2025-05-07T20:26:14.3979158Z 
2025-05-07T20:26:14.3979163Z 
2025-05-07T20:26:14.3979168Z 
2025-05-07T20:26:14.3979174Z 
2025-05-07T20:26:14.3979179Z 
2025-05-07T20:26:14.3979194Z 
2025-05-07T20:26:14.3979200Z 
2025-05-07T20:26:14.3979205Z 
2025-05-07T20:26:14.3983357Z 
2025-05-07T20:26:14.4343467Z cuda-nvvm-tools-12.8 | 23.5 MB   | ###1       |  32% [A[A[A[A[A[A[A[A[A[A[A[A[A[A
2025-05-07T20:26:14.4343833Z 
2025-05-07T20:26:14.4343864Z 
2025-05-07T20:26:14.4343868Z 
2025-05-07T20:26:14.4343872Z 
2025-05-07T20:26:14.4343875Z 
2025-05-07T20:26:14.4343879Z 
2025-05-07T20:26:14.4343883Z 
2025-05-07T20:26:14.4343886Z 
2025-05-07T20:26:14.4343890Z 
2025-05-07T20:26:14.4343894Z 
2025-05-07T20:26:14.4344708Z 
2025-05-07T20:26:14.4359407Z python-3.11.8        | 29.3 MB   | ########9  |  89% [A[A[A[A[A[A[A[A[A[A[A
2025-05-07T20:26:14.4359761Z 
2025-05-07T20:26:14.4359768Z 
2025-05-07T20:26:14.4359773Z 
2025-05-07T20:26:14.4359778Z 
2025-05-07T20:26:14.4359783Z 
2025-05-07T20:26:14.4359788Z 
2025-05-07T20:26:14.4359793Z 
2025-05-07T20:26:14.4359799Z 
2025-05-07T20:26:14.4359804Z 
2025-05-07T20:26:14.4359809Z 
2025-05-07T20:26:14.4359813Z 
2025-05-07T20:26:14.4359818Z 
2025-05-07T20:26:14.4378170Z libnvjitlink-12.8.61 | 28.7 MB   | #####9     |  60% [A[A[A[A[A[A[A[A[A[A[A[A
2025-05-07T20:26:14.4378477Z 
2025-05-07T20:26:14.4378481Z 
2025-05-07T20:26:14.4378485Z 
2025-05-07T20:26:14.4378488Z 
2025-05-07T20:26:14.4378771Z 
2025-05-07T20:26:14.4378774Z 
2025-05-07T20:26:14.4378778Z 
2025-05-07T20:26:14.4378782Z 
2025-05-07T20:26:14.4378785Z 
2025-05-07T20:26:14.4378810Z 
2025-05-07T20:26:14.4378816Z 
2025-05-07T20:26:14.4378820Z 
2025-05-07T20:26:14.4378823Z 
2025-05-07T20:26:14.4981947Z cuda-nvcc-tools-12.8 | 24.5 MB   | #####6     |  57% [A[A[A[A[A[A[A[A[A[A[A[A[A
2025-05-07T20:26:14.4982326Z 
2025-05-07T20:26:14.4982330Z 
2025-05-07T20:26:14.4982334Z 
2025-05-07T20:26:14.4982338Z 
2025-05-07T20:26:14.4982342Z 
2025-05-07T20:26:14.4982346Z 
2025-05-07T20:26:14.4982350Z 
2025-05-07T20:26:14.4982354Z 
2025-05-07T20:26:14.4982358Z 
2025-05-07T20:26:14.4982363Z 
2025-05-07T20:26:14.4982367Z 
2025-05-07T20:26:14.4982371Z 
2025-05-07T20:26:14.4982374Z 
2025-05-07T20:26:14.4984364Z 
2025-05-07T20:26:14.5387034Z cuda-nvvm-tools-12.8 | 23.5 MB   | ####2      |  43% [A[A[A[A[A[A[A[A[A[A[A[A[A[A
2025-05-07T20:26:14.5387525Z 
2025-05-07T20:26:14.5387529Z 
2025-05-07T20:26:14.5387533Z 
2025-05-07T20:26:14.5387560Z 
2025-05-07T20:26:14.5387564Z 
2025-05-07T20:26:14.5387567Z 
2025-05-07T20:26:14.5387571Z 
2025-05-07T20:26:14.5387574Z 
2025-05-07T20:26:14.5387578Z 
2025-05-07T20:26:14.5387582Z 
2025-05-07T20:26:14.5387585Z 
2025-05-07T20:26:14.5387596Z 
2025-05-07T20:26:14.5387600Z 
2025-05-07T20:26:14.5416513Z cuda-nvcc-tools-12.8 | 24.5 MB   | ######8    |  69% [A[A[A[A[A[A[A[A[A[A[A[A[A
2025-05-07T20:26:14.5416913Z 
2025-05-07T20:26:14.5416917Z 
2025-05-07T20:26:14.5416921Z 
2025-05-07T20:26:14.5416924Z 
2025-05-07T20:26:14.5416934Z 
2025-05-07T20:26:14.5416938Z 
2025-05-07T20:26:14.5416946Z 
2025-05-07T20:26:14.5416949Z 
2025-05-07T20:26:14.5416953Z 
2025-05-07T20:26:14.5416957Z 
2025-05-07T20:26:14.5418221Z 
2025-05-07T20:26:14.5474951Z python-3.11.8        | 29.3 MB   | #########7 |  98% [A[A[A[A[A[A[A[A[A[A[A
2025-05-07T20:26:14.5475330Z 
2025-05-07T20:26:14.5475336Z 
2025-05-07T20:26:14.5475341Z 
2025-05-07T20:26:14.5475346Z 
2025-05-07T20:26:14.5475351Z 
2025-05-07T20:26:14.5475368Z 
2025-05-07T20:26:14.5475373Z 
2025-05-07T20:26:14.5475378Z 
2025-05-07T20:26:14.5475383Z 
2025-05-07T20:26:14.5475388Z 
2025-05-07T20:26:14.5475393Z 
2025-05-07T20:26:14.5476831Z 
2025-05-07T20:26:14.5986833Z libnvjitlink-12.8.61 | 28.7 MB   | ######8    |  69% [A[A[A[A[A[A[A[A[A[A[A[A
2025-05-07T20:26:14.5987232Z 
2025-05-07T20:26:14.5987239Z 
2025-05-07T20:26:14.5987275Z 
2025-05-07T20:26:14.5987279Z 
2025-05-07T20:26:14.5987283Z 
2025-05-07T20:26:14.5987286Z 
2025-05-07T20:26:14.5987293Z 
2025-05-07T20:26:14.5987296Z 
2025-05-07T20:26:14.5987300Z 
2025-05-07T20:26:14.5987303Z 
2025-05-07T20:26:14.5987315Z 
2025-05-07T20:26:14.5987318Z 
2025-05-07T20:26:14.5987322Z 
2025-05-07T20:26:14.5990085Z 
2025-05-07T20:26:14.6423862Z cuda-nvvm-tools-12.8 | 23.5 MB   | #####5     |  55% [A[A[A[A[A[A[A[A[A[A[A[A[A[A
2025-05-07T20:26:14.6424346Z 
2025-05-07T20:26:14.6424376Z 
2025-05-07T20:26:14.6424381Z 
2025-05-07T20:26:14.6424386Z 
2025-05-07T20:26:14.6424392Z 
2025-05-07T20:26:14.6424424Z 
2025-05-07T20:26:14.6424429Z 
2025-05-07T20:26:14.6424435Z 
2025-05-07T20:26:14.6424440Z 
2025-05-07T20:26:14.6424445Z 
2025-05-07T20:26:14.6424450Z 
2025-05-07T20:26:14.6424456Z 
2025-05-07T20:26:14.6424461Z 
2025-05-07T20:26:14.6475274Z cuda-nvcc-tools-12.8 | 24.5 MB   | #######9   |  80% [A[A[A[A[A[A[A[A[A[A[A[A[A
2025-05-07T20:26:14.6475703Z 
2025-05-07T20:26:14.6475941Z 
2025-05-07T20:26:14.6475946Z 
2025-05-07T20:26:14.6475950Z 
2025-05-07T20:26:14.6475953Z 
2025-05-07T20:26:14.6475957Z 
2025-05-07T20:26:14.6475961Z 
2025-05-07T20:26:14.6475966Z 
2025-05-07T20:26:14.6475971Z 
2025-05-07T20:26:14.6475975Z 
2025-05-07T20:26:14.6475980Z 
2025-05-07T20:26:14.6475984Z 
2025-05-07T20:26:14.6988463Z libnvjitlink-12.8.61 | 28.7 MB   | #######8   |  78% [A[A[A[A[A[A[A[A[A[A[A[A
2025-05-07T20:26:14.6988794Z 
2025-05-07T20:26:14.6988798Z 
2025-05-07T20:26:14.6988802Z 
2025-05-07T20:26:14.6988806Z 
2025-05-07T20:26:14.6988809Z 
2025-05-07T20:26:14.6988813Z 
2025-05-07T20:26:14.6989176Z 
2025-05-07T20:26:14.6989182Z 
2025-05-07T20:26:14.6989187Z 
2025-05-07T20:26:14.6989192Z 
2025-05-07T20:26:14.6989197Z 
2025-05-07T20:26:14.6989202Z 
2025-05-07T20:26:14.6989207Z 
2025-05-07T20:26:14.6989213Z 
2025-05-07T20:26:14.7432835Z cuda-nvvm-tools-12.8 | 23.5 MB   | ######7    |  67% [A[A[A[A[A[A[A[A[A[A[A[A[A[A
2025-05-07T20:26:14.7433277Z 
2025-05-07T20:26:14.7433304Z 
2025-05-07T20:26:14.7433308Z 
2025-05-07T20:26:14.7433312Z 
2025-05-07T20:26:14.7433315Z 
2025-05-07T20:26:14.7433319Z 
2025-05-07T20:26:14.7433323Z 
2025-05-07T20:26:14.7433326Z 
2025-05-07T20:26:14.7433330Z 
2025-05-07T20:26:14.7433334Z 
2025-05-07T20:26:14.7433337Z 
2025-05-07T20:26:14.7433342Z 
2025-05-07T20:26:14.7433345Z 
2025-05-07T20:26:14.7509587Z cuda-nvcc-tools-12.8 | 24.5 MB   | #########  |  91% [A[A[A[A[A[A[A[A[A[A[A[A[A
2025-05-07T20:26:14.7509906Z 
2025-05-07T20:26:14.7509910Z 
2025-05-07T20:26:14.7509913Z 
2025-05-07T20:26:14.7509921Z 
2025-05-07T20:26:14.7509927Z 
2025-05-07T20:26:14.7509944Z 
2025-05-07T20:26:14.7509948Z 
2025-05-07T20:26:14.7509951Z 
2025-05-07T20:26:14.7509955Z 
2025-05-07T20:26:14.7509959Z 
2025-05-07T20:26:14.7509962Z 
2025-05-07T20:26:14.7511834Z 
2025-05-07T20:26:14.8066557Z libnvjitlink-12.8.61 | 28.7 MB   | ########7  |  87% [A[A[A[A[A[A[A[A[A[A[A[A
2025-05-07T20:26:14.8066887Z 
2025-05-07T20:26:14.8066891Z 
2025-05-07T20:26:14.8066914Z 
2025-05-07T20:26:14.8066918Z 
2025-05-07T20:26:14.8066923Z 
2025-05-07T20:26:14.8066927Z 
2025-05-07T20:26:14.8066932Z 
2025-05-07T20:26:14.8066936Z 
2025-05-07T20:26:14.8066941Z 
2025-05-07T20:26:14.8066946Z 
2025-05-07T20:26:14.8066950Z 
2025-05-07T20:26:14.8066955Z 
2025-05-07T20:26:14.8066959Z 
2025-05-07T20:26:14.8066970Z 
2025-05-07T20:26:14.8515250Z cuda-nvvm-tools-12.8 | 23.5 MB   | #######8   |  79% [A[A[A[A[A[A[A[A[A[A[A[A[A[A
2025-05-07T20:26:14.8515581Z 
2025-05-07T20:26:14.8515585Z 
2025-05-07T20:26:14.8515589Z 
2025-05-07T20:26:14.8515592Z 
2025-05-07T20:26:14.8515596Z 
2025-05-07T20:26:14.8515632Z 
2025-05-07T20:26:14.8515635Z 
2025-05-07T20:26:14.8515639Z 
2025-05-07T20:26:14.8515643Z 
2025-05-07T20:26:14.8515648Z 
2025-05-07T20:26:14.8515652Z 
2025-05-07T20:26:14.8515655Z 
2025-05-07T20:26:14.9068594Z libnvjitlink-12.8.61 | 28.7 MB   | #########8 |  99% [A[A[A[A[A[A[A[A[A[A[A[A
2025-05-07T20:26:14.9068921Z 
2025-05-07T20:26:14.9068925Z 
2025-05-07T20:26:14.9068944Z 
2025-05-07T20:26:14.9068948Z 
2025-05-07T20:26:14.9068951Z 
2025-05-07T20:26:14.9068955Z 
2025-05-07T20:26:14.9068958Z 
2025-05-07T20:26:14.9068962Z 
2025-05-07T20:26:14.9068966Z 
2025-05-07T20:26:14.9068969Z 
2025-05-07T20:26:14.9068973Z 
2025-05-07T20:26:14.9068976Z 
2025-05-07T20:26:14.9068980Z 
2025-05-07T20:26:14.9068983Z 
2025-05-07T20:26:15.6520038Z cuda-nvvm-tools-12.8 | 23.5 MB   | #########3 |  93% [A[A[A[A[A[A[A[A[A[A[A[A[A[A
2025-05-07T20:26:15.6520388Z 
2025-05-07T20:26:15.6520392Z 
2025-05-07T20:26:15.6520396Z 
2025-05-07T20:26:15.6520399Z 
2025-05-07T20:26:15.6520403Z 
2025-05-07T20:26:15.6520438Z 
2025-05-07T20:26:15.6520446Z 
2025-05-07T20:26:15.6520449Z 
2025-05-07T20:26:15.6520453Z 
2025-05-07T20:26:15.6520457Z 
2025-05-07T20:26:15.6522531Z 
2025-05-07T20:26:15.6994217Z python-3.11.8        | 29.3 MB   | ########## | 100% [A[A[A[A[A[A[A[A[A[A[A
2025-05-07T20:26:15.6994517Z 
2025-05-07T20:26:15.6994521Z 
2025-05-07T20:26:15.6994525Z 
2025-05-07T20:26:15.6994770Z 
2025-05-07T20:26:15.6994775Z 
2025-05-07T20:26:15.6994779Z 
2025-05-07T20:26:15.6994783Z 
2025-05-07T20:26:15.6994786Z 
2025-05-07T20:26:15.6994802Z 
2025-05-07T20:26:15.6994806Z 
2025-05-07T20:26:15.6994809Z 
2025-05-07T20:26:15.6994813Z 
2025-05-07T20:26:15.6994816Z 
2025-05-07T20:26:15.6994820Z 
2025-05-07T20:26:15.7001269Z 
2025-05-07T20:26:15.7036306Z cuda-nvvm-impl-12.8. | 20.8 MB   |            |   0% [A[A[A[A[A[A[A[A[A[A[A[A[A[A[A
2025-05-07T20:26:15.7036761Z 
2025-05-07T20:26:15.7036767Z 
2025-05-07T20:26:15.7036772Z 
2025-05-07T20:26:15.7036777Z 
2025-05-07T20:26:15.7036783Z 
2025-05-07T20:26:15.7037044Z 
2025-05-07T20:26:15.7037048Z 
2025-05-07T20:26:15.7037051Z 
2025-05-07T20:26:15.7037055Z 
2025-05-07T20:26:15.7037059Z 
2025-05-07T20:26:15.7037062Z 
2025-05-07T20:26:15.7037066Z 
2025-05-07T20:26:15.7037724Z 
2025-05-07T20:26:15.7489739Z cuda-nvcc-tools-12.8 | 24.5 MB   | ########## | 100% [A[A[A[A[A[A[A[A[A[A[A[A[A
2025-05-07T20:26:15.7490116Z 
2025-05-07T20:26:15.7490143Z 
2025-05-07T20:26:15.7490147Z 
2025-05-07T20:26:15.7490150Z 
2025-05-07T20:26:15.7490154Z 
2025-05-07T20:26:15.7490158Z 
2025-05-07T20:26:15.7490161Z 
2025-05-07T20:26:15.7490165Z 
2025-05-07T20:26:15.7490169Z 
2025-05-07T20:26:15.7490172Z 
2025-05-07T20:26:15.7490185Z 
2025-05-07T20:26:15.7490189Z 
2025-05-07T20:26:15.7490193Z 
2025-05-07T20:26:15.7490197Z 
2025-05-07T20:26:15.7490200Z 
2025-05-07T20:26:15.7492987Z 
2025-05-07T20:26:15.7640577Z cuda-nvcc-dev_linux- | 12.7 MB   |            |   0% [A[A[A[A[A[A[A[A[A[A[A[A[A[A[A[A
2025-05-07T20:26:15.7641009Z 
2025-05-07T20:26:15.7641015Z 
2025-05-07T20:26:15.7641040Z 
2025-05-07T20:26:15.7641045Z 
2025-05-07T20:26:15.7641051Z 
2025-05-07T20:26:15.7641056Z 
2025-05-07T20:26:15.7641059Z 
2025-05-07T20:26:15.7641063Z 
2025-05-07T20:26:15.7641067Z 
2025-05-07T20:26:15.7641070Z 
2025-05-07T20:26:15.7641074Z 
2025-05-07T20:26:15.7641078Z 
2025-05-07T20:26:15.7641081Z 
2025-05-07T20:26:15.7643343Z 
2025-05-07T20:26:15.7994173Z cuda-nvvm-tools-12.8 | 23.5 MB   | ########## | 100% [A[A[A[A[A[A[A[A[A[A[A[A[A[A
2025-05-07T20:26:15.7994546Z 
2025-05-07T20:26:15.7994552Z 
2025-05-07T20:26:15.7994557Z 
2025-05-07T20:26:15.7994562Z 
2025-05-07T20:26:15.7994567Z 
2025-05-07T20:26:15.7994572Z 
2025-05-07T20:26:15.7994577Z 
2025-05-07T20:26:15.7994583Z 
2025-05-07T20:26:15.7994588Z 
2025-05-07T20:26:15.7994593Z 
2025-05-07T20:26:15.7994598Z 
2025-05-07T20:26:15.7994604Z 
2025-05-07T20:26:15.7994609Z 
2025-05-07T20:26:15.7994614Z 
2025-05-07T20:26:15.7996274Z 
2025-05-07T20:26:15.8106636Z cuda-nvvm-impl-12.8. | 20.8 MB   | #5         |  15% [A[A[A[A[A[A[A[A[A[A[A[A[A[A[A
2025-05-07T20:26:15.8107098Z 
2025-05-07T20:26:15.8107104Z 
2025-05-07T20:26:15.8107109Z 
2025-05-07T20:26:15.8107115Z 
2025-05-07T20:26:15.8107120Z 
2025-05-07T20:26:15.8107126Z 
2025-05-07T20:26:15.8107131Z 
2025-05-07T20:26:15.8107136Z 
2025-05-07T20:26:15.8107142Z 
2025-05-07T20:26:15.8107147Z 
2025-05-07T20:26:15.8107153Z 
2025-05-07T20:26:15.8107159Z 
2025-05-07T20:26:15.8107185Z 
2025-05-07T20:26:15.8107192Z 
2025-05-07T20:26:15.8107198Z 
2025-05-07T20:26:15.8107203Z 
2025-05-07T20:26:15.8109974Z 
2025-05-07T20:26:15.8496842Z cuda-sanitizer-api-1 | 8.8 MB    |            |   0% [A[A[A[A[A[A[A[A[A[A[A[A[A[A[A[A[A
2025-05-07T20:26:15.8497353Z 
2025-05-07T20:26:15.8497359Z 
2025-05-07T20:26:15.8497364Z 
2025-05-07T20:26:15.8497369Z 
2025-05-07T20:26:15.8497374Z 
2025-05-07T20:26:15.8497379Z 
2025-05-07T20:26:15.8497384Z 
2025-05-07T20:26:15.8497389Z 
2025-05-07T20:26:15.8497395Z 
2025-05-07T20:26:15.8497400Z 
2025-05-07T20:26:15.8497405Z 
2025-05-07T20:26:15.8497432Z 
2025-05-07T20:26:15.8497437Z 
2025-05-07T20:26:15.8497442Z 
2025-05-07T20:26:15.8497447Z 
2025-05-07T20:26:15.8499685Z 
2025-05-07T20:26:15.8503422Z cuda-nvcc-dev_linux- | 12.7 MB   | ##4        |  24% [A[A[A[A[A[A[A[A[A[A[A[A[A[A[A[A
2025-05-07T20:26:15.8503806Z 
2025-05-07T20:26:15.8503814Z 
2025-05-07T20:26:15.8503822Z 
2025-05-07T20:26:15.8503831Z 
2025-05-07T20:26:15.8504123Z 
2025-05-07T20:26:15.8504130Z 
2025-05-07T20:26:15.8504136Z 
2025-05-07T20:26:15.8504141Z 
2025-05-07T20:26:15.8504147Z 
2025-05-07T20:26:15.8504162Z 
2025-05-07T20:26:15.8504167Z 
2025-05-07T20:26:15.8506378Z 
2025-05-07T20:26:15.9068889Z libnvjitlink-12.8.61 | 28.7 MB   | ########## | 100% [A[A[A[A[A[A[A[A[A[A[A[A
2025-05-07T20:26:15.9069305Z 
2025-05-07T20:26:15.9069311Z 
2025-05-07T20:26:15.9069315Z 
2025-05-07T20:26:15.9069319Z 
2025-05-07T20:26:15.9069325Z 
2025-05-07T20:26:15.9069329Z 
2025-05-07T20:26:15.9069333Z 
2025-05-07T20:26:15.9069337Z 
2025-05-07T20:26:15.9069342Z 
2025-05-07T20:26:15.9069587Z 
2025-05-07T20:26:15.9069592Z 
2025-05-07T20:26:15.9069596Z 
2025-05-07T20:26:15.9069600Z 
2025-05-07T20:26:15.9069603Z 
2025-05-07T20:26:15.9069652Z 
2025-05-07T20:26:15.9107731Z cuda-nvvm-impl-12.8. | 20.8 MB   | ###        |  30% [A[A[A[A[A[A[A[A[A[A[A[A[A[A[A
2025-05-07T20:26:15.9108062Z 
2025-05-07T20:26:15.9108066Z 
2025-05-07T20:26:15.9108070Z 
2025-05-07T20:26:15.9108085Z 
2025-05-07T20:26:15.9108089Z 
2025-05-07T20:26:15.9108093Z 
2025-05-07T20:26:15.9108097Z 
2025-05-07T20:26:15.9108100Z 
2025-05-07T20:26:15.9108104Z 
2025-05-07T20:26:15.9108108Z 
2025-05-07T20:26:15.9108111Z 
2025-05-07T20:26:15.9108121Z 
2025-05-07T20:26:15.9108125Z 
2025-05-07T20:26:15.9108129Z 
2025-05-07T20:26:15.9108132Z 
2025-05-07T20:26:15.9108136Z 
2025-05-07T20:26:15.9108139Z 
2025-05-07T20:26:15.9127997Z cuda-sanitizer-api-1 | 8.8 MB    | ###        |  31% [A[A[A[A[A[A[A[A[A[A[A[A[A[A[A[A[A
2025-05-07T20:26:15.9128639Z 
2025-05-07T20:26:15.9128643Z 
2025-05-07T20:26:15.9128655Z 
2025-05-07T20:26:15.9128659Z 
2025-05-07T20:26:15.9128662Z 
2025-05-07T20:26:15.9128666Z 
2025-05-07T20:26:15.9128669Z 
2025-05-07T20:26:15.9128673Z 
2025-05-07T20:26:15.9128676Z 
2025-05-07T20:26:15.9128680Z 
2025-05-07T20:26:15.9128683Z 
2025-05-07T20:26:15.9128687Z 
2025-05-07T20:26:15.9128690Z 
2025-05-07T20:26:15.9128694Z 
2025-05-07T20:26:15.9128697Z 
2025-05-07T20:26:15.9128706Z 
2025-05-07T20:26:15.9128709Z 
2025-05-07T20:26:15.9132256Z 
2025-05-07T20:26:15.9561185Z cuda-nvdisasm-12.8.5 | 4.9 MB    |            |   0% [A[A[A[A[A[A[A[A[A[A[A[A[A[A[A[A[A[A
2025-05-07T20:26:15.9561533Z 
2025-05-07T20:26:15.9561537Z 
2025-05-07T20:26:15.9561540Z 
2025-05-07T20:26:15.9561544Z 
2025-05-07T20:26:15.9561548Z 
2025-05-07T20:26:15.9561559Z 
2025-05-07T20:26:15.9561562Z 
2025-05-07T20:26:15.9561566Z 
2025-05-07T20:26:15.9561570Z 
2025-05-07T20:26:15.9561573Z 
2025-05-07T20:26:15.9561577Z 
2025-05-07T20:26:15.9561580Z 
2025-05-07T20:26:15.9561585Z 
2025-05-07T20:26:15.9561589Z 
2025-05-07T20:26:15.9561607Z 
2025-05-07T20:26:15.9561611Z 
2025-05-07T20:26:16.0124161Z cuda-nvcc-dev_linux- | 12.7 MB   | ####8      |  49% [A[A[A[A[A[A[A[A[A[A[A[A[A[A[A[A
2025-05-07T20:26:16.0124521Z 
2025-05-07T20:26:16.0124525Z 
2025-05-07T20:26:16.0124528Z 
2025-05-07T20:26:16.0124532Z 
2025-05-07T20:26:16.0124536Z 
2025-05-07T20:26:16.0124539Z 
2025-05-07T20:26:16.0124543Z 
2025-05-07T20:26:16.0124567Z 
2025-05-07T20:26:16.0124571Z 
2025-05-07T20:26:16.0124574Z 
2025-05-07T20:26:16.0124578Z 
2025-05-07T20:26:16.0124582Z 
2025-05-07T20:26:16.0124585Z 
2025-05-07T20:26:16.0124589Z 
2025-05-07T20:26:16.0124592Z 
2025-05-07T20:26:16.0124596Z 
2025-05-07T20:26:16.0125987Z 
2025-05-07T20:26:16.0132819Z cuda-sanitizer-api-1 | 8.8 MB    | ######     |  61% [A[A[A[A[A[A[A[A[A[A[A[A[A[A[A[A[A
2025-05-07T20:26:16.0133155Z 
2025-05-07T20:26:16.0133159Z 
2025-05-07T20:26:16.0133170Z 
2025-05-07T20:26:16.0133174Z 
2025-05-07T20:26:16.0133178Z 
2025-05-07T20:26:16.0133181Z 
2025-05-07T20:26:16.0133196Z 
2025-05-07T20:26:16.0133200Z 
2025-05-07T20:26:16.0133203Z 
2025-05-07T20:26:16.0133207Z 
2025-05-07T20:26:16.0133211Z 
2025-05-07T20:26:16.0133214Z 
2025-05-07T20:26:16.0133218Z 
2025-05-07T20:26:16.0133221Z 
2025-05-07T20:26:16.0133225Z 
2025-05-07T20:26:16.0133229Z 
2025-05-07T20:26:16.0133232Z 
2025-05-07T20:26:16.0133236Z 
2025-05-07T20:26:16.0273014Z cuda-nvdisasm-12.8.5 | 4.9 MB    | ###5       |  36% [A[A[A[A[A[A[A[A[A[A[A[A[A[A[A[A[A[A
2025-05-07T20:26:16.0273484Z 
2025-05-07T20:26:16.0273489Z 
2025-05-07T20:26:16.0273493Z 
2025-05-07T20:26:16.0273497Z 
2025-05-07T20:26:16.0273500Z 
2025-05-07T20:26:16.0273504Z 
2025-05-07T20:26:16.0273508Z 
2025-05-07T20:26:16.0273511Z 
2025-05-07T20:26:16.0273515Z 
2025-05-07T20:26:16.0273519Z 
2025-05-07T20:26:16.0273522Z 
2025-05-07T20:26:16.0273534Z 
2025-05-07T20:26:16.0273538Z 
2025-05-07T20:26:16.0273542Z 
2025-05-07T20:26:16.0276604Z 
2025-05-07T20:26:16.0770054Z cuda-nvvm-impl-12.8. | 20.8 MB   | ####4      |  45% [A[A[A[A[A[A[A[A[A[A[A[A[A[A[A
2025-05-07T20:26:16.0770640Z 
2025-05-07T20:26:16.0770644Z 
2025-05-07T20:26:16.0770648Z 
2025-05-07T20:26:16.0770652Z 
2025-05-07T20:26:16.0770655Z 
2025-05-07T20:26:16.0770659Z 
2025-05-07T20:26:16.0770663Z 
2025-05-07T20:26:16.0770666Z 
2025-05-07T20:26:16.0770670Z 
2025-05-07T20:26:16.0770674Z 
2025-05-07T20:26:16.0770677Z 
2025-05-07T20:26:16.0770681Z 
2025-05-07T20:26:16.0770694Z 
2025-05-07T20:26:16.0770698Z 
2025-05-07T20:26:16.0770702Z 
2025-05-07T20:26:16.0770705Z 
2025-05-07T20:26:16.1125365Z cuda-nvcc-dev_linux- | 12.7 MB   | #######1   |  72% [A[A[A[A[A[A[A[A[A[A[A[A[A[A[A[A
2025-05-07T20:26:16.1125702Z 
2025-05-07T20:26:16.1125706Z 
2025-05-07T20:26:16.1125710Z 
2025-05-07T20:26:16.1125713Z 
2025-05-07T20:26:16.1125717Z 
2025-05-07T20:26:16.1125720Z 
2025-05-07T20:26:16.1125724Z 
2025-05-07T20:26:16.1125728Z 
2025-05-07T20:26:16.1125732Z 
2025-05-07T20:26:16.1125736Z 
2025-05-07T20:26:16.1125739Z 
2025-05-07T20:26:16.1125753Z 
2025-05-07T20:26:16.1125770Z 
2025-05-07T20:26:16.1125773Z 
2025-05-07T20:26:16.1125777Z 
2025-05-07T20:26:16.1125780Z 
2025-05-07T20:26:16.1127421Z 
2025-05-07T20:26:16.1143613Z cuda-sanitizer-api-1 | 8.8 MB    | #########3 |  93% [A[A[A[A[A[A[A[A[A[A[A[A[A[A[A[A[A
2025-05-07T20:26:16.1144011Z 
2025-05-07T20:26:16.1144015Z 
2025-05-07T20:26:16.1144018Z 
2025-05-07T20:26:16.1144022Z 
2025-05-07T20:26:16.1144038Z 
2025-05-07T20:26:16.1144042Z 
2025-05-07T20:26:16.1144045Z 
2025-05-07T20:26:16.1144049Z 
2025-05-07T20:26:16.1144052Z 
2025-05-07T20:26:16.1144056Z 
2025-05-07T20:26:16.1144060Z 
2025-05-07T20:26:16.1144063Z 
2025-05-07T20:26:16.1144067Z 
2025-05-07T20:26:16.1144070Z 
2025-05-07T20:26:16.1144074Z 
2025-05-07T20:26:16.1144077Z 
2025-05-07T20:26:16.1144081Z 
2025-05-07T20:26:16.1145333Z 
2025-05-07T20:26:16.1272536Z cuda-nvdisasm-12.8.5 | 4.9 MB    | ########9  |  90% [A[A[A[A[A[A[A[A[A[A[A[A[A[A[A[A[A[A
2025-05-07T20:26:16.1272887Z 
2025-05-07T20:26:16.1272891Z 
2025-05-07T20:26:16.1272904Z 
2025-05-07T20:26:16.1272908Z 
2025-05-07T20:26:16.1272911Z 
2025-05-07T20:26:16.1272925Z 
2025-05-07T20:26:16.1272928Z 
2025-05-07T20:26:16.1272932Z 
2025-05-07T20:26:16.1272936Z 
2025-05-07T20:26:16.1272939Z 
2025-05-07T20:26:16.1272943Z 
2025-05-07T20:26:16.1272947Z 
2025-05-07T20:26:16.1272950Z 
2025-05-07T20:26:16.1272954Z 
2025-05-07T20:26:16.1272957Z 
2025-05-07T20:26:16.1784339Z cuda-nvvm-impl-12.8. | 20.8 MB   | #####8     |  59% [A[A[A[A[A[A[A[A[A[A[A[A[A[A[A
2025-05-07T20:26:16.1784692Z 
2025-05-07T20:26:16.1784697Z 
2025-05-07T20:26:16.1784702Z 
2025-05-07T20:26:16.1784707Z 
2025-05-07T20:26:16.1784713Z 
2025-05-07T20:26:16.1784718Z 
2025-05-07T20:26:16.1784723Z 
2025-05-07T20:26:16.1784728Z 
2025-05-07T20:26:16.1784733Z 
2025-05-07T20:26:16.1784739Z 
2025-05-07T20:26:16.1784744Z 
2025-05-07T20:26:16.1784749Z 
2025-05-07T20:26:16.1784754Z 
2025-05-07T20:26:16.1784759Z 
2025-05-07T20:26:16.1784764Z 
2025-05-07T20:26:16.1784958Z 
2025-05-07T20:26:16.2279549Z cuda-nvcc-dev_linux- | 12.7 MB   | #########3 |  93% [A[A[A[A[A[A[A[A[A[A[A[A[A[A[A[A
2025-05-07T20:26:16.2279907Z 
2025-05-07T20:26:16.2279911Z 
2025-05-07T20:26:16.2279915Z 
2025-05-07T20:26:16.2279919Z 
2025-05-07T20:26:16.2279930Z 
2025-05-07T20:26:16.2279934Z 
2025-05-07T20:26:16.2279938Z 
2025-05-07T20:26:16.2279941Z 
2025-05-07T20:26:16.2279945Z 
2025-05-07T20:26:16.2279948Z 
2025-05-07T20:26:16.2280171Z 
2025-05-07T20:26:16.2280176Z 
2025-05-07T20:26:16.2280179Z 
2025-05-07T20:26:16.2280183Z 
2025-05-07T20:26:16.2280187Z 
2025-05-07T20:26:16.3114909Z cuda-nvvm-impl-12.8. | 20.8 MB   | #######3   |  73% [A[A[A[A[A[A[A[A[A[A[A[A[A[A[A
2025-05-07T20:26:16.3115259Z 
2025-05-07T20:26:16.3115264Z 
2025-05-07T20:26:16.3115267Z 
2025-05-07T20:26:16.3115271Z 
2025-05-07T20:26:16.3115275Z 
2025-05-07T20:26:16.3115278Z 
2025-05-07T20:26:16.3115282Z 
2025-05-07T20:26:16.3115286Z 
2025-05-07T20:26:16.3115289Z 
2025-05-07T20:26:16.3115293Z 
2025-05-07T20:26:16.3115297Z 
2025-05-07T20:26:16.3115300Z 
2025-05-07T20:26:16.3115527Z 
2025-05-07T20:26:16.3115530Z 
2025-05-07T20:26:16.3115534Z 
2025-05-07T20:26:16.3115538Z 
2025-05-07T20:26:16.3115541Z 
2025-05-07T20:26:16.3117186Z 
2025-05-07T20:26:16.3286719Z cuda-nvdisasm-12.8.5 | 4.9 MB    | ########## | 100% [A[A[A[A[A[A[A[A[A[A[A[A[A[A[A[A[A[A
2025-05-07T20:26:16.3287124Z 
2025-05-07T20:26:16.3287137Z 
2025-05-07T20:26:16.3287160Z 
2025-05-07T20:26:16.3287165Z 
2025-05-07T20:26:16.3287169Z 
2025-05-07T20:26:16.3287172Z 
2025-05-07T20:26:16.3287176Z 
2025-05-07T20:26:16.3287180Z 
2025-05-07T20:26:16.3287183Z 
2025-05-07T20:26:16.3287187Z 
2025-05-07T20:26:16.3287191Z 
2025-05-07T20:26:16.3287194Z 
2025-05-07T20:26:16.3287198Z 
2025-05-07T20:26:16.3287201Z 
2025-05-07T20:26:16.3287205Z 
2025-05-07T20:26:16.3721399Z cuda-nvvm-impl-12.8. | 20.8 MB   | ########9  |  89% [A[A[A[A[A[A[A[A[A[A[A[A[A[A[A
2025-05-07T20:26:16.3721737Z 
2025-05-07T20:26:16.3721741Z 
2025-05-07T20:26:16.3721744Z 
2025-05-07T20:26:16.3721748Z 
2025-05-07T20:26:16.3721763Z 
2025-05-07T20:26:16.3721766Z 
2025-05-07T20:26:16.3721770Z 
2025-05-07T20:26:16.3721773Z 
2025-05-07T20:26:16.3721777Z 
2025-05-07T20:26:16.3721780Z 
2025-05-07T20:26:16.3721784Z 
2025-05-07T20:26:16.3721787Z 
2025-05-07T20:26:16.3721799Z 
2025-05-07T20:26:16.3721803Z 
2025-05-07T20:26:16.3721807Z 
2025-05-07T20:26:16.3721810Z 
2025-05-07T20:26:16.3721814Z 
2025-05-07T20:26:16.3721829Z 
2025-05-07T20:26:16.3723259Z 
2025-05-07T20:26:16.4596022Z  ... (more hidden) ...[A[A[A[A[A[A[A[A[A[A[A[A[A[A[A[A[A[A[A
2025-05-07T20:26:16.4596313Z 
2025-05-07T20:26:16.4596317Z 
2025-05-07T20:26:16.4596321Z 
2025-05-07T20:26:16.4596324Z 
2025-05-07T20:26:16.4596328Z 
2025-05-07T20:26:16.4596332Z 
2025-05-07T20:26:16.4596336Z 
2025-05-07T20:26:16.4596349Z 
2025-05-07T20:26:16.4596353Z 
2025-05-07T20:26:16.4596356Z 
2025-05-07T20:26:16.4596360Z 
2025-05-07T20:26:16.4596363Z 
2025-05-07T20:26:16.4596367Z 
2025-05-07T20:26:16.4596370Z 
2025-05-07T20:26:16.4596374Z 
2025-05-07T20:26:16.4596377Z 
2025-05-07T20:26:16.4600930Z 
2025-05-07T20:26:16.4728764Z cuda-sanitizer-api-1 | 8.8 MB    | ########## | 100% [A[A[A[A[A[A[A[A[A[A[A[A[A[A[A[A[A
2025-05-07T20:26:16.4729128Z 
2025-05-07T20:26:16.4729132Z 
2025-05-07T20:26:16.4729136Z 
2025-05-07T20:26:16.4729139Z 
2025-05-07T20:26:16.4729143Z 
2025-05-07T20:26:16.4729147Z 
2025-05-07T20:26:16.4729150Z 
2025-05-07T20:26:16.4729168Z 
2025-05-07T20:26:16.4729171Z 
2025-05-07T20:26:16.4729175Z 
2025-05-07T20:26:16.4729178Z 
2025-05-07T20:26:16.4729182Z 
2025-05-07T20:26:16.4729185Z 
2025-05-07T20:26:16.4729189Z 
2025-05-07T20:26:16.4729192Z 
2025-05-07T20:26:16.4729196Z 
2025-05-07T20:26:16.4729208Z 
2025-05-07T20:26:16.4729212Z 
2025-05-07T20:26:16.4729216Z 
2025-05-07T20:26:16.6613131Z  ... (more hidden) ...[A[A[A[A[A[A[A[A[A[A[A[A[A[A[A[A[A[A[A
2025-05-07T20:26:16.6613442Z 
2025-05-07T20:26:16.6613455Z 
2025-05-07T20:26:16.6613460Z 
2025-05-07T20:26:16.6613463Z 
2025-05-07T20:26:16.6613467Z 
2025-05-07T20:26:16.6613470Z 
2025-05-07T20:26:16.6613498Z 
2025-05-07T20:26:16.6613502Z 
2025-05-07T20:26:16.6613506Z 
2025-05-07T20:26:16.6613509Z 
2025-05-07T20:26:16.6613513Z 
2025-05-07T20:26:16.6613517Z 
2025-05-07T20:26:16.6613521Z 
2025-05-07T20:26:16.6613524Z 
2025-05-07T20:26:16.6613528Z 
2025-05-07T20:26:16.6614581Z 
2025-05-07T20:26:16.6716310Z cuda-nvcc-dev_linux- | 12.7 MB   | ########## | 100% [A[A[A[A[A[A[A[A[A[A[A[A[A[A[A[A
2025-05-07T20:26:16.6716648Z 
2025-05-07T20:26:16.6716652Z 
2025-05-07T20:26:16.6716656Z 
2025-05-07T20:26:16.6716659Z 
2025-05-07T20:26:16.6716663Z 
2025-05-07T20:26:16.6716666Z 
2025-05-07T20:26:16.6716670Z 
2025-05-07T20:26:16.6716673Z 
2025-05-07T20:26:16.6716677Z 
2025-05-07T20:26:16.6716680Z 
2025-05-07T20:26:16.6716684Z 
2025-05-07T20:26:16.6716695Z 
2025-05-07T20:26:16.6716699Z 
2025-05-07T20:26:16.6716702Z 
2025-05-07T20:26:16.6716706Z 
2025-05-07T20:26:16.6716709Z 
2025-05-07T20:26:16.6716713Z 
2025-05-07T20:26:16.6716717Z 
2025-05-07T20:26:16.6716720Z 
2025-05-07T20:26:17.0603747Z  ... (more hidden) ...[A[A[A[A[A[A[A[A[A[A[A[A[A[A[A[A[A[A[A
2025-05-07T20:26:17.0604069Z 
2025-05-07T20:26:17.0604073Z 
2025-05-07T20:26:17.0604077Z 
2025-05-07T20:26:17.0604090Z 
2025-05-07T20:26:17.0604094Z 
2025-05-07T20:26:17.0604097Z 
2025-05-07T20:26:17.0604101Z 
2025-05-07T20:26:17.0604104Z 
2025-05-07T20:26:17.0604109Z 
2025-05-07T20:26:17.0604113Z 
2025-05-07T20:26:17.0604145Z 
2025-05-07T20:26:17.0604149Z 
2025-05-07T20:26:17.0604152Z 
2025-05-07T20:26:17.0604156Z 
2025-05-07T20:26:17.0606587Z 
2025-05-07T20:26:17.6596611Z cuda-nvvm-impl-12.8. | 20.8 MB   | ########## | 100% [A[A[A[A[A[A[A[A[A[A[A[A[A[A[A
2025-05-07T20:26:17.6597125Z 
2025-05-07T20:26:17.6597131Z 
2025-05-07T20:26:17.6597137Z 
2025-05-07T20:26:17.6597142Z 
2025-05-07T20:26:17.6597148Z 
2025-05-07T20:26:17.6597153Z 
2025-05-07T20:26:17.6597159Z 
2025-05-07T20:26:17.6597164Z 
2025-05-07T20:26:17.6599164Z 
2025-05-07T20:26:18.5274511Z libcurand-10.3.9.55  | 43.6 MB   | ########## | 100% [A[A[A[A[A[A[A[A[A
2025-05-07T20:26:18.5274860Z 
2025-05-07T20:26:18.5274882Z 
2025-05-07T20:26:18.5274886Z 
2025-05-07T20:26:18.5274890Z 
2025-05-07T20:26:18.5274893Z 
2025-05-07T20:26:18.5274897Z 
2025-05-07T20:26:18.5274900Z 
2025-05-07T20:26:18.5274913Z 
2025-05-07T20:26:18.5274917Z 
2025-05-07T20:26:18.5274920Z 
2025-05-07T20:26:18.7549001Z gds-tools-1.13.0.11  | 37.9 MB   | ########## | 100% [A[A[A[A[A[A[A[A[A[A
2025-05-07T20:26:18.7549407Z 
2025-05-07T20:26:18.7549423Z 
2025-05-07T20:26:18.7549427Z 
2025-05-07T20:26:18.7549431Z 
2025-05-07T20:26:18.7549435Z 
2025-05-07T20:26:18.7549438Z 
2025-05-07T20:26:18.7553368Z 
2025-05-07T20:26:19.1835732Z cuda-nvvp-12.8.57    | 112.4 MB  | ########## | 100% [A[A[A[A[A[A[A
2025-05-07T20:26:19.4097971Z libcublas-12.8.3.14  | 460.2 MB  | ########## | 100% 
2025-05-07T20:26:19.4098292Z 
2025-05-07T20:26:19.4098364Z 
2025-05-07T20:26:19.4098369Z 
2025-05-07T20:26:19.4098387Z 
2025-05-07T20:26:19.4098428Z 
2025-05-07T20:26:19.9438537Z libnpp-12.3.3.65     | 130.6 MB  | ########## | 100% [A[A[A[A[A
2025-05-07T20:26:19.9438972Z 
2025-05-07T20:26:19.9438978Z 
2025-05-07T20:26:19.9438983Z 
2025-05-07T20:26:19.9438989Z 
2025-05-07T20:26:19.9439002Z 
2025-05-07T20:26:19.9439006Z 
2025-05-07T20:26:19.9439010Z 
2025-05-07T20:26:19.9439014Z 
2025-05-07T20:26:20.4957323Z cuda-nvrtc-12.8.61   | 63.1 MB   | ########## | 100% [A[A[A[A[A[A[A[A
2025-05-07T20:26:20.4957715Z 
2025-05-07T20:26:20.4957722Z 
2025-05-07T20:26:20.4957742Z 
2025-05-07T20:26:20.4957748Z 
2025-05-07T20:26:20.4957753Z 
2025-05-07T20:26:20.4957758Z 
2025-05-07T20:26:20.4957764Z 
2025-05-07T20:26:20.4957769Z 
2025-05-07T20:26:20.4957774Z 
2025-05-07T20:26:20.4957779Z 
2025-05-07T20:26:20.4957784Z 
2025-05-07T20:26:20.4957790Z 
2025-05-07T20:26:20.4957795Z 
2025-05-07T20:26:20.8592872Z cuda-nvcc-tools-12.8 | 24.5 MB   | ########## | 100% [A[A[A[A[A[A[A[A[A[A[A[A[A
2025-05-07T20:26:20.8593201Z 
2025-05-07T20:26:20.8593205Z 
2025-05-07T20:26:20.8593209Z 
2025-05-07T20:26:20.8593242Z 
2025-05-07T20:26:20.8593245Z 
2025-05-07T20:26:20.8593249Z 
2025-05-07T20:26:20.8593252Z 
2025-05-07T20:26:20.8593256Z 
2025-05-07T20:26:20.8593268Z 
2025-05-07T20:26:20.8593272Z 
2025-05-07T20:26:20.8593279Z 
2025-05-07T20:26:20.9518204Z python-3.11.8        | 29.3 MB   | ########## | 100% [A[A[A[A[A[A[A[A[A[A[A
2025-05-07T20:26:20.9518518Z 
2025-05-07T20:26:20.9518839Z 
2025-05-07T20:26:20.9518845Z 
2025-05-07T20:26:20.9518849Z 
2025-05-07T20:26:20.9518854Z 
2025-05-07T20:26:20.9518858Z 
2025-05-07T20:26:20.9518863Z 
2025-05-07T20:26:20.9518866Z 
2025-05-07T20:26:20.9518870Z 
2025-05-07T20:26:20.9518873Z 
2025-05-07T20:26:20.9518877Z 
2025-05-07T20:26:20.9518880Z 
2025-05-07T20:26:20.9518884Z 
2025-05-07T20:26:20.9518888Z 
2025-05-07T20:26:20.9833377Z cuda-nvvm-tools-12.8 | 23.5 MB   | ########## | 100% [A[A[A[A[A[A[A[A[A[A[A[A[A[A
2025-05-07T20:26:20.9833732Z 
2025-05-07T20:26:20.9833736Z 
2025-05-07T20:26:20.9833739Z 
2025-05-07T20:26:20.9833743Z 
2025-05-07T20:26:20.9834008Z 
2025-05-07T20:26:20.9834011Z 
2025-05-07T20:26:20.9834015Z 
2025-05-07T20:26:20.9834018Z 
2025-05-07T20:26:20.9834022Z 
2025-05-07T20:26:20.9834026Z 
2025-05-07T20:26:20.9834029Z 
2025-05-07T20:26:20.9834033Z 
2025-05-07T20:26:20.9834047Z 
2025-05-07T20:26:20.9834051Z 
2025-05-07T20:26:20.9834054Z 
2025-05-07T20:26:20.9834058Z 
2025-05-07T20:26:20.9834061Z 
2025-05-07T20:26:20.9834075Z 
2025-05-07T20:26:21.1738677Z cuda-nvdisasm-12.8.5 | 4.9 MB    | ########## | 100% [A[A[A[A[A[A[A[A[A[A[A[A[A[A[A[A[A[A
2025-05-07T20:26:21.1739027Z 
2025-05-07T20:26:21.1739030Z 
2025-05-07T20:26:21.1739034Z 
2025-05-07T20:26:21.1739038Z 
2025-05-07T20:26:21.1739041Z 
2025-05-07T20:26:21.1739045Z 
2025-05-07T20:26:21.1739048Z 
2025-05-07T20:26:21.1739052Z 
2025-05-07T20:26:21.1739055Z 
2025-05-07T20:26:21.1739059Z 
2025-05-07T20:26:21.1739062Z 
2025-05-07T20:26:21.1739066Z 
2025-05-07T20:26:21.1739070Z 
2025-05-07T20:26:21.1739073Z 
2025-05-07T20:26:21.1739077Z 
2025-05-07T20:26:21.1739080Z 
2025-05-07T20:26:21.1739115Z 
2025-05-07T20:26:21.3433556Z cuda-sanitizer-api-1 | 8.8 MB    | ########## | 100% [A[A[A[A[A[A[A[A[A[A[A[A[A[A[A[A[A
2025-05-07T20:26:21.3433932Z 
2025-05-07T20:26:21.3433936Z 
2025-05-07T20:26:21.3433939Z 
2025-05-07T20:26:21.3433943Z 
2025-05-07T20:26:21.3433946Z 
2025-05-07T20:26:21.3433950Z 
2025-05-07T20:26:21.3433954Z 
2025-05-07T20:26:21.3433958Z 
2025-05-07T20:26:21.3433991Z 
2025-05-07T20:26:21.3433995Z 
2025-05-07T20:26:21.3433999Z 
2025-05-07T20:26:21.3434006Z 
2025-05-07T20:26:21.4861599Z libnvjitlink-12.8.61 | 28.7 MB   | ########## | 100% [A[A[A[A[A[A[A[A[A[A[A[A
2025-05-07T20:26:21.4862051Z 
2025-05-07T20:26:21.4862056Z 
2025-05-07T20:26:21.4862059Z 
2025-05-07T20:26:21.4862063Z 
2025-05-07T20:26:21.4862066Z 
2025-05-07T20:26:21.4862078Z 
2025-05-07T20:26:21.4862082Z 
2025-05-07T20:26:21.4862085Z 
2025-05-07T20:26:21.4862089Z 
2025-05-07T20:26:21.4862092Z 
2025-05-07T20:26:21.4862096Z 
2025-05-07T20:26:21.4862100Z 
2025-05-07T20:26:21.4862103Z 
2025-05-07T20:26:21.4862136Z 
2025-05-07T20:26:21.4862140Z 
2025-05-07T20:26:21.4862150Z 
2025-05-07T20:26:21.5075971Z cuda-nvcc-dev_linux- | 12.7 MB   | ########## | 100% [A[A[A[A[A[A[A[A[A[A[A[A[A[A[A[A
2025-05-07T20:26:21.5076411Z 
2025-05-07T20:26:21.5076415Z 
2025-05-07T20:26:21.5076419Z 
2025-05-07T20:26:21.5076431Z 
2025-05-07T20:26:21.5076435Z 
2025-05-07T20:26:21.5076459Z 
2025-05-07T20:26:21.5076463Z 
2025-05-07T20:26:21.5076466Z 
2025-05-07T20:26:21.5076470Z 
2025-05-07T20:26:21.5076473Z 
2025-05-07T20:26:21.5076484Z 
2025-05-07T20:26:21.5076487Z 
2025-05-07T20:26:21.5076491Z 
2025-05-07T20:26:21.5076494Z 
2025-05-07T20:26:21.5076498Z 
2025-05-07T20:26:21.5076501Z 
2025-05-07T20:26:21.5076505Z 
2025-05-07T20:26:21.5076509Z 
2025-05-07T20:26:21.5076513Z 
2025-05-07T20:26:22.3449785Z  ... (more hidden) ...[A[A[A[A[A[A[A[A[A[A[A[A[A[A[A[A[A[A[A
2025-05-07T20:26:22.3450107Z 
2025-05-07T20:26:22.3450111Z 
2025-05-07T20:26:22.3450125Z 
2025-05-07T20:26:22.3450130Z 
2025-05-07T20:26:22.3450161Z 
2025-05-07T20:26:22.3450165Z 
2025-05-07T20:26:22.3450169Z 
2025-05-07T20:26:22.3450173Z 
2025-05-07T20:26:22.3450177Z 
2025-05-07T20:26:22.3450181Z 
2025-05-07T20:26:22.3450185Z 
2025-05-07T20:26:22.3450188Z 
2025-05-07T20:26:22.3450192Z 
2025-05-07T20:26:22.3450196Z 
2025-05-07T20:26:22.3450199Z 
2025-05-07T20:26:26.1732816Z cuda-nvvm-impl-12.8. | 20.8 MB   | ########## | 100% [A[A[A[A[A[A[A[A[A[A[A[A[A[A[A
2025-05-07T20:26:26.1733162Z 
2025-05-07T20:26:27.4367372Z nsight-compute-2025. | 320.6 MB  | ########## | 100% [A
2025-05-07T20:26:27.4375277Z libcublas-12.8.3.14  | 460.2 MB  | ########## | 100% 
2025-05-07T20:26:27.4375544Z 
2025-05-07T20:26:27.4375549Z 
2025-05-07T20:26:27.4375553Z 
2025-05-07T20:26:27.4375557Z 
2025-05-07T20:26:27.4375564Z 
2025-05-07T20:26:27.4375568Z 
2025-05-07T20:26:27.4375572Z 
2025-05-07T20:26:27.4375577Z 
2025-05-07T20:26:27.4375581Z 
2025-05-07T20:26:27.4375585Z 
2025-05-07T20:26:27.4375588Z 
2025-05-07T20:26:27.4375592Z 
2025-05-07T20:26:27.4375816Z 
2025-05-07T20:26:27.4375820Z 
2025-05-07T20:26:27.4375823Z 
2025-05-07T20:26:27.4375827Z 
2025-05-07T20:26:27.4375831Z 
2025-05-07T20:26:27.4375834Z 
2025-05-07T20:26:27.4375838Z 
2025-05-07T20:26:27.4375939Z                       
2025-05-07T20:26:27.4376269Z [A[A[A[A[A[A[A[A[A[A[A[A[A[A[A[A[A[A[A
2025-05-07T20:26:27.4376594Z                                                      
2025-05-07T20:26:27.4376795Z 
2025-05-07T20:26:27.4376979Z                                                      [A
2025-05-07T20:26:27.4377176Z 
2025-05-07T20:26:27.4377181Z 
2025-05-07T20:26:27.4377356Z                                                      [A[A
2025-05-07T20:26:27.4377559Z 
2025-05-07T20:26:27.4377563Z 
2025-05-07T20:26:27.4377567Z 
2025-05-07T20:26:27.4377750Z                                                      [A[A[A
2025-05-07T20:26:27.4377956Z 
2025-05-07T20:26:27.4377960Z 
2025-05-07T20:26:27.4377964Z 
2025-05-07T20:26:27.4377967Z 
2025-05-07T20:26:27.4378154Z                                                      [A[A[A[A
2025-05-07T20:26:27.4378371Z 
2025-05-07T20:26:27.4378374Z 
2025-05-07T20:26:27.4378378Z 
2025-05-07T20:26:27.4378381Z 
2025-05-07T20:26:27.4378385Z 
2025-05-07T20:26:27.4378829Z                                                      [A[A[A[A[A
2025-05-07T20:26:27.4379076Z 
2025-05-07T20:26:27.4379080Z 
2025-05-07T20:26:27.4379110Z 
2025-05-07T20:26:27.4379114Z 
2025-05-07T20:26:27.4379118Z 
2025-05-07T20:26:27.4379121Z 
2025-05-07T20:26:27.4379311Z                                                      [A[A[A[A[A[A
2025-05-07T20:26:27.4379531Z 
2025-05-07T20:26:27.4379535Z 
2025-05-07T20:26:27.4379539Z 
2025-05-07T20:26:27.4379542Z 
2025-05-07T20:26:27.4379546Z 
2025-05-07T20:26:27.4379549Z 
2025-05-07T20:26:27.4379553Z 
2025-05-07T20:26:27.4379867Z                                                      [A[A[A[A[A[A[A
2025-05-07T20:26:27.4380173Z 
2025-05-07T20:26:27.4380188Z 
2025-05-07T20:26:27.4380194Z 
2025-05-07T20:26:27.4380201Z 
2025-05-07T20:26:27.4380216Z 
2025-05-07T20:26:27.4380222Z 
2025-05-07T20:26:27.4380227Z 
2025-05-07T20:26:27.4380232Z 
2025-05-07T20:26:27.4380513Z                                                      [A[A[A[A[A[A[A[A
2025-05-07T20:26:27.4380746Z 
2025-05-07T20:26:27.4380756Z 
2025-05-07T20:26:27.4380760Z 
2025-05-07T20:26:27.4380763Z 
2025-05-07T20:26:27.4380767Z 
2025-05-07T20:26:27.4380776Z 
2025-05-07T20:26:27.4380780Z 
2025-05-07T20:26:27.4380783Z 
2025-05-07T20:26:27.4380787Z 
2025-05-07T20:26:27.4381015Z                                                      [A[A[A[A[A[A[A[A[A
2025-05-07T20:26:27.4381241Z 
2025-05-07T20:26:27.4381245Z 
2025-05-07T20:26:27.4381248Z 
2025-05-07T20:26:27.4381252Z 
2025-05-07T20:26:27.4381255Z 
2025-05-07T20:26:27.4381259Z 
2025-05-07T20:26:27.4381262Z 
2025-05-07T20:26:27.4381266Z 
2025-05-07T20:26:27.4381270Z 
2025-05-07T20:26:27.4381273Z 
2025-05-07T20:26:27.4382103Z                                                      [A[A[A[A[A[A[A[A[A[A
2025-05-07T20:26:27.4382404Z 
2025-05-07T20:26:27.4382410Z 
2025-05-07T20:26:27.4382422Z 
2025-05-07T20:26:27.4382427Z 
2025-05-07T20:26:27.4382431Z 
2025-05-07T20:26:27.4382435Z 
2025-05-07T20:26:27.4382440Z 
2025-05-07T20:26:27.4382443Z 
2025-05-07T20:26:27.4382447Z 
2025-05-07T20:26:27.4382451Z 
2025-05-07T20:26:27.4382454Z 
2025-05-07T20:26:27.4383206Z                                                      [A[A[A[A[A[A[A[A[A[A[A
2025-05-07T20:26:27.4383445Z 
2025-05-07T20:26:27.4383450Z 
2025-05-07T20:26:27.4383453Z 
2025-05-07T20:26:27.4383457Z 
2025-05-07T20:26:27.4383460Z 
2025-05-07T20:26:27.4383464Z 
2025-05-07T20:26:27.4383467Z 
2025-05-07T20:26:27.4383471Z 
2025-05-07T20:26:27.4383474Z 
2025-05-07T20:26:27.4383478Z 
2025-05-07T20:26:27.4383482Z 
2025-05-07T20:26:27.4383485Z 
2025-05-07T20:26:27.4383692Z                                                      [A[A[A[A[A[A[A[A[A[A[A[A
2025-05-07T20:26:27.4383914Z 
2025-05-07T20:26:27.4383918Z 
2025-05-07T20:26:27.4383922Z 
2025-05-07T20:26:27.4384033Z 
2025-05-07T20:26:27.4384048Z 
2025-05-07T20:26:27.4384051Z 
2025-05-07T20:26:27.4384055Z 
2025-05-07T20:26:27.4384059Z 
2025-05-07T20:26:27.4384062Z 
2025-05-07T20:26:27.4384066Z 
2025-05-07T20:26:27.4384079Z 
2025-05-07T20:26:27.4384082Z 
2025-05-07T20:26:27.4384086Z 
2025-05-07T20:26:27.4384296Z                                                      [A[A[A[A[A[A[A[A[A[A[A[A[A
2025-05-07T20:26:27.4384520Z 
2025-05-07T20:26:27.4384524Z 
2025-05-07T20:26:27.4384527Z 
2025-05-07T20:26:27.4384538Z 
2025-05-07T20:26:27.4384542Z 
2025-05-07T20:26:27.4384545Z 
2025-05-07T20:26:27.4384549Z 
2025-05-07T20:26:27.4384552Z 
2025-05-07T20:26:27.4384556Z 
2025-05-07T20:26:27.4384560Z 
2025-05-07T20:26:27.4384563Z 
2025-05-07T20:26:27.4384567Z 
2025-05-07T20:26:27.4384570Z 
2025-05-07T20:26:27.4384574Z 
2025-05-07T20:26:27.4384778Z                                                      [A[A[A[A[A[A[A[A[A[A[A[A[A[A
2025-05-07T20:26:27.4385007Z 
2025-05-07T20:26:27.4385010Z 
2025-05-07T20:26:27.4385019Z 
2025-05-07T20:26:27.4385023Z 
2025-05-07T20:26:27.4385027Z 
2025-05-07T20:26:27.4385030Z 
2025-05-07T20:26:27.4385034Z 
2025-05-07T20:26:27.4385037Z 
2025-05-07T20:26:27.4385041Z 
2025-05-07T20:26:27.4385045Z 
2025-05-07T20:26:27.4385048Z 
2025-05-07T20:26:27.4385052Z 
2025-05-07T20:26:27.4385055Z 
2025-05-07T20:26:27.4385059Z 
2025-05-07T20:26:27.4385063Z 
2025-05-07T20:26:27.4385277Z                                                      [A[A[A[A[A[A[A[A[A[A[A[A[A[A[A
2025-05-07T20:26:27.4385502Z 
2025-05-07T20:26:27.4385505Z 
2025-05-07T20:26:27.4385509Z 
2025-05-07T20:26:27.4385512Z 
2025-05-07T20:26:27.4385516Z 
2025-05-07T20:26:27.4385520Z 
2025-05-07T20:26:27.4385523Z 
2025-05-07T20:26:27.4385527Z 
2025-05-07T20:26:27.4385530Z 
2025-05-07T20:26:27.4385552Z 
2025-05-07T20:26:27.4385556Z 
2025-05-07T20:26:27.4385559Z 
2025-05-07T20:26:27.4385563Z 
2025-05-07T20:26:27.4385566Z 
2025-05-07T20:26:27.4385570Z 
2025-05-07T20:26:27.4385574Z 
2025-05-07T20:26:27.4385780Z                                                      [A[A[A[A[A[A[A[A[A[A[A[A[A[A[A[A
2025-05-07T20:26:27.4386019Z 
2025-05-07T20:26:27.4386023Z 
2025-05-07T20:26:27.4386026Z 
2025-05-07T20:26:27.4386030Z 
2025-05-07T20:26:27.4386033Z 
2025-05-07T20:26:27.4386037Z 
2025-05-07T20:26:27.4386040Z 
2025-05-07T20:26:27.4386044Z 
2025-05-07T20:26:27.4386048Z 
2025-05-07T20:26:27.4386051Z 
2025-05-07T20:26:27.4386059Z 
2025-05-07T20:26:27.4386063Z 
2025-05-07T20:26:27.4386066Z 
2025-05-07T20:26:27.4386070Z 
2025-05-07T20:26:27.4386073Z 
2025-05-07T20:26:27.4386077Z 
2025-05-07T20:26:27.4386081Z 
2025-05-07T20:26:27.4386303Z                                                      [A[A[A[A[A[A[A[A[A[A[A[A[A[A[A[A[A
2025-05-07T20:26:27.4386531Z 
2025-05-07T20:26:27.4386535Z 
2025-05-07T20:26:27.4386539Z 
2025-05-07T20:26:27.4386542Z 
2025-05-07T20:26:27.4386546Z 
2025-05-07T20:26:27.4386549Z 
2025-05-07T20:26:27.4386553Z 
2025-05-07T20:26:27.4386556Z 
2025-05-07T20:26:27.4386560Z 
2025-05-07T20:26:27.4386569Z 
2025-05-07T20:26:27.4386578Z 
2025-05-07T20:26:27.4386581Z 
2025-05-07T20:26:27.4386585Z 
2025-05-07T20:26:27.4386589Z 
2025-05-07T20:26:27.4386592Z 
2025-05-07T20:26:27.4386596Z 
2025-05-07T20:26:27.4386600Z 
2025-05-07T20:26:27.4386603Z 
2025-05-07T20:26:27.4386844Z                                                      [A[A[A[A[A[A[A[A[A[A[A[A[A[A[A[A[A[A
2025-05-07T20:26:27.4387169Z 
2025-05-07T20:26:27.4387173Z 
2025-05-07T20:26:27.4387276Z [A
2025-05-07T20:26:27.4387378Z 
2025-05-07T20:26:27.4387382Z 
2025-05-07T20:26:27.4387490Z [A[A
2025-05-07T20:26:27.4387603Z 
2025-05-07T20:26:27.4387607Z 
2025-05-07T20:26:27.4387611Z 
2025-05-07T20:26:27.4388013Z [A[A[A
2025-05-07T20:26:27.4388193Z 
2025-05-07T20:26:27.4388199Z 
2025-05-07T20:26:27.4388206Z 
2025-05-07T20:26:27.4388221Z 
2025-05-07T20:26:27.4388377Z [A[A[A[A
2025-05-07T20:26:27.4388548Z 
2025-05-07T20:26:27.4388553Z 
2025-05-07T20:26:27.4388565Z 
2025-05-07T20:26:27.4388570Z 
2025-05-07T20:26:27.4388576Z 
2025-05-07T20:26:27.4388928Z [A[A[A[A[A
2025-05-07T20:26:27.4389192Z 
2025-05-07T20:26:27.4389197Z 
2025-05-07T20:26:27.4389203Z 
2025-05-07T20:26:27.4389208Z 
2025-05-07T20:26:27.4389213Z 
2025-05-07T20:26:27.4389218Z 
2025-05-07T20:26:27.4389458Z [A[A[A[A[A[A
2025-05-07T20:26:27.4389598Z 
2025-05-07T20:26:27.4389605Z 
2025-05-07T20:26:27.4389614Z 
2025-05-07T20:26:27.4389619Z 
2025-05-07T20:26:27.4389634Z 
2025-05-07T20:26:27.4389639Z 
2025-05-07T20:26:27.4389644Z 
2025-05-07T20:26:27.4389880Z [A[A[A[A[A[A[A
2025-05-07T20:26:27.4390031Z 
2025-05-07T20:26:27.4390036Z 
2025-05-07T20:26:27.4390041Z 
2025-05-07T20:26:27.4390046Z 
2025-05-07T20:26:27.4390051Z 
2025-05-07T20:26:27.4390055Z 
2025-05-07T20:26:27.4390060Z 
2025-05-07T20:26:27.4390064Z 
2025-05-07T20:26:27.4390641Z [A[A[A[A[A[A[A[A
2025-05-07T20:26:27.4390874Z 
2025-05-07T20:26:27.4390883Z 
2025-05-07T20:26:27.4390892Z 
2025-05-07T20:26:27.4390900Z 
2025-05-07T20:26:27.4390908Z 
2025-05-07T20:26:27.4390918Z 
2025-05-07T20:26:27.4390946Z 
2025-05-07T20:26:27.4390954Z 
2025-05-07T20:26:27.4390975Z 
2025-05-07T20:26:27.4391166Z [A[A[A[A[A[A[A[A[A
2025-05-07T20:26:27.4391387Z 
2025-05-07T20:26:27.4391394Z 
2025-05-07T20:26:27.4391400Z 
2025-05-07T20:26:27.4391405Z 
2025-05-07T20:26:27.4391410Z 
2025-05-07T20:26:27.4391415Z 
2025-05-07T20:26:27.4391420Z 
2025-05-07T20:26:27.4391426Z 
2025-05-07T20:26:27.4391445Z 
2025-05-07T20:26:27.4391459Z 
2025-05-07T20:26:27.4391651Z [A[A[A[A[A[A[A[A[A[A
2025-05-07T20:26:27.4391872Z 
2025-05-07T20:26:27.4391876Z 
2025-05-07T20:26:27.4391881Z 
2025-05-07T20:26:27.4391886Z 
2025-05-07T20:26:27.4391891Z 
2025-05-07T20:26:27.4391906Z 
2025-05-07T20:26:27.4391911Z 
2025-05-07T20:26:27.4391916Z 
2025-05-07T20:26:27.4391921Z 
2025-05-07T20:26:27.4391927Z 
2025-05-07T20:26:27.4391932Z 
2025-05-07T20:26:27.4392123Z [A[A[A[A[A[A[A[A[A[A[A
2025-05-07T20:26:27.4392369Z 
2025-05-07T20:26:27.4392374Z 
2025-05-07T20:26:27.4392379Z 
2025-05-07T20:26:27.4392383Z 
2025-05-07T20:26:27.4392393Z 
2025-05-07T20:26:27.4392398Z 
2025-05-07T20:26:27.4392403Z 
2025-05-07T20:26:27.4392408Z 
2025-05-07T20:26:27.4392413Z 
2025-05-07T20:26:27.4392418Z 
2025-05-07T20:26:27.4392423Z 
2025-05-07T20:26:27.4392429Z 
2025-05-07T20:26:27.4392624Z [A[A[A[A[A[A[A[A[A[A[A[A
2025-05-07T20:26:27.4392884Z 
2025-05-07T20:26:27.4392889Z 
2025-05-07T20:26:27.4392894Z 
2025-05-07T20:26:27.4392905Z 
2025-05-07T20:26:27.4392910Z 
2025-05-07T20:26:27.4392916Z 
2025-05-07T20:26:27.4392921Z 
2025-05-07T20:26:27.4392926Z 
2025-05-07T20:26:27.4392931Z 
2025-05-07T20:26:27.4392936Z 
2025-05-07T20:26:27.4392941Z 
2025-05-07T20:26:27.4392946Z 
2025-05-07T20:26:27.4392951Z 
2025-05-07T20:26:27.4393169Z [A[A[A[A[A[A[A[A[A[A[A[A[A
2025-05-07T20:26:27.4393427Z 
2025-05-07T20:26:27.4393432Z 
2025-05-07T20:26:27.4393437Z 
2025-05-07T20:26:27.4393443Z 
2025-05-07T20:26:27.4393448Z 
2025-05-07T20:26:27.4393453Z 
2025-05-07T20:26:27.4393458Z 
2025-05-07T20:26:27.4393463Z 
2025-05-07T20:26:27.4393469Z 
2025-05-07T20:26:27.4393479Z 
2025-05-07T20:26:27.4393492Z 
2025-05-07T20:26:27.4393498Z 
2025-05-07T20:26:27.4393503Z 
2025-05-07T20:26:27.4393508Z 
2025-05-07T20:26:27.4393714Z [A[A[A[A[A[A[A[A[A[A[A[A[A[A
2025-05-07T20:26:27.4393981Z 
2025-05-07T20:26:27.4393986Z 
2025-05-07T20:26:27.4393999Z 
2025-05-07T20:26:27.4394004Z 
2025-05-07T20:26:27.4394009Z 
2025-05-07T20:26:27.4394014Z 
2025-05-07T20:26:27.4394236Z 
2025-05-07T20:26:27.4394243Z 
2025-05-07T20:26:27.4394248Z 
2025-05-07T20:26:27.4394253Z 
2025-05-07T20:26:27.4394258Z 
2025-05-07T20:26:27.4394263Z 
2025-05-07T20:26:27.4394268Z 
2025-05-07T20:26:27.4394274Z 
2025-05-07T20:26:27.4394278Z 
2025-05-07T20:26:27.4394510Z [A[A[A[A[A[A[A[A[A[A[A[A[A[A[A
2025-05-07T20:26:27.4394788Z 
2025-05-07T20:26:27.4394794Z 
2025-05-07T20:26:27.4394799Z 
2025-05-07T20:26:27.4394804Z 
2025-05-07T20:26:27.4394809Z 
2025-05-07T20:26:27.4394814Z 
2025-05-07T20:26:27.4394819Z 
2025-05-07T20:26:27.4394824Z 
2025-05-07T20:26:27.4394829Z 
2025-05-07T20:26:27.4394940Z 
2025-05-07T20:26:27.4394945Z 
2025-05-07T20:26:27.4394950Z 
2025-05-07T20:26:27.4394955Z 
2025-05-07T20:26:27.4394960Z 
2025-05-07T20:26:27.4394965Z 
2025-05-07T20:26:27.4394970Z 
2025-05-07T20:26:27.4395214Z [A[A[A[A[A[A[A[A[A[A[A[A[A[A[A[A
2025-05-07T20:26:27.4395491Z 
2025-05-07T20:26:27.4395496Z 
2025-05-07T20:26:27.4395501Z 
2025-05-07T20:26:27.4395506Z 
2025-05-07T20:26:27.4395527Z 
2025-05-07T20:26:27.4395532Z 
2025-05-07T20:26:27.4395537Z 
2025-05-07T20:26:27.4395543Z 
2025-05-07T20:26:27.4395548Z 
2025-05-07T20:26:27.4395553Z 
2025-05-07T20:26:27.4395558Z 
2025-05-07T20:26:27.4395563Z 
2025-05-07T20:26:27.4395568Z 
2025-05-07T20:26:27.4395573Z 
2025-05-07T20:26:27.4395578Z 
2025-05-07T20:26:27.4395583Z 
2025-05-07T20:26:27.4395588Z 
2025-05-07T20:26:27.4395811Z [A[A[A[A[A[A[A[A[A[A[A[A[A[A[A[A[A
2025-05-07T20:26:27.4396088Z 
2025-05-07T20:26:27.4396092Z 
2025-05-07T20:26:27.4396096Z 
2025-05-07T20:26:27.4396099Z 
2025-05-07T20:26:27.4396103Z 
2025-05-07T20:26:27.4396113Z 
2025-05-07T20:26:27.4396116Z 
2025-05-07T20:26:27.4396120Z 
2025-05-07T20:26:27.4396124Z 
2025-05-07T20:26:27.4396127Z 
2025-05-07T20:26:27.4396131Z 
2025-05-07T20:26:27.4396134Z 
2025-05-07T20:26:27.4396138Z 
2025-05-07T20:26:27.4396141Z 
2025-05-07T20:26:27.4396145Z 
2025-05-07T20:26:27.4396149Z 
2025-05-07T20:26:27.4396158Z 
2025-05-07T20:26:27.4396162Z 
2025-05-07T20:26:27.4396384Z [A[A[A[A[A[A[A[A[A[A[A[A[A[A[A[A[A[A
2025-05-07T20:26:27.4396592Z 
2025-05-07T20:26:27.4396602Z 
2025-05-07T20:26:27.4396713Z [A
2025-05-07T20:26:27.4396820Z 
2025-05-07T20:26:27.4396824Z 
2025-05-07T20:26:27.4397126Z [A[A
2025-05-07T20:26:27.4397263Z 
2025-05-07T20:26:27.4397267Z 
2025-05-07T20:26:27.4397277Z 
2025-05-07T20:26:27.4397549Z [A[A[A
2025-05-07T20:26:27.4397670Z 
2025-05-07T20:26:27.4397675Z 
2025-05-07T20:26:27.4397682Z 
2025-05-07T20:26:27.4397686Z 
2025-05-07T20:26:27.4398017Z [A[A[A[A
2025-05-07T20:26:27.4398150Z 
2025-05-07T20:26:27.4398155Z 
2025-05-07T20:26:27.4398174Z 
2025-05-07T20:26:27.4398179Z 
2025-05-07T20:26:27.4398188Z 
2025-05-07T20:26:27.4398430Z [A[A[A[A[A
2025-05-07T20:26:27.4398563Z 
2025-05-07T20:26:27.4398566Z 
2025-05-07T20:26:27.4398570Z 
2025-05-07T20:26:27.4398573Z 
2025-05-07T20:26:27.4398577Z 
2025-05-07T20:26:27.4398580Z 
2025-05-07T20:26:27.4398953Z [A[A[A[A[A[A
2025-05-07T20:26:27.4399088Z 
2025-05-07T20:26:27.4399110Z 
2025-05-07T20:26:27.4399116Z 
2025-05-07T20:26:27.4399122Z 
2025-05-07T20:26:27.4399127Z 
2025-05-07T20:26:27.4399133Z 
2025-05-07T20:26:27.4399138Z 
2025-05-07T20:26:27.4399341Z [A[A[A[A[A[A[A
2025-05-07T20:26:27.4399479Z 
2025-05-07T20:26:27.4399486Z 
2025-05-07T20:26:27.4399490Z 
2025-05-07T20:26:27.4399494Z 
2025-05-07T20:26:27.4399498Z 
2025-05-07T20:26:27.4399503Z 
2025-05-07T20:26:27.4399508Z 
2025-05-07T20:26:27.4399520Z 
2025-05-07T20:26:27.4399819Z [A[A[A[A[A[A[A[A
2025-05-07T20:26:27.4399971Z 
2025-05-07T20:26:27.4399978Z 
2025-05-07T20:26:27.4399982Z 
2025-05-07T20:26:27.4399985Z 
2025-05-07T20:26:27.4400001Z 
2025-05-07T20:26:27.4400011Z 
2025-05-07T20:26:27.4400014Z 
2025-05-07T20:26:27.4400018Z 
2025-05-07T20:26:27.4400021Z 
2025-05-07T20:26:27.4400357Z [A[A[A[A[A[A[A[A[A
2025-05-07T20:26:27.4400524Z 
2025-05-07T20:26:27.4400537Z 
2025-05-07T20:26:27.4400542Z 
2025-05-07T20:26:27.4400546Z 
2025-05-07T20:26:27.4400550Z 
2025-05-07T20:26:27.4400554Z 
2025-05-07T20:26:27.4400690Z 
2025-05-07T20:26:27.4400695Z 
2025-05-07T20:26:27.4400700Z 
2025-05-07T20:26:27.4400703Z 
2025-05-07T20:26:27.4400854Z [A[A[A[A[A[A[A[A[A[A
2025-05-07T20:26:27.4401020Z 
2025-05-07T20:26:27.4401024Z 
2025-05-07T20:26:27.4401027Z 
2025-05-07T20:26:27.4401042Z 
2025-05-07T20:26:27.4401045Z 
2025-05-07T20:26:27.4401049Z 
2025-05-07T20:26:27.4401052Z 
2025-05-07T20:26:27.4401056Z 
2025-05-07T20:26:27.4401061Z 
2025-05-07T20:26:27.4401067Z 
2025-05-07T20:26:27.4401072Z 
2025-05-07T20:26:27.4401219Z [A[A[A[A[A[A[A[A[A[A[A
2025-05-07T20:26:27.4401398Z 
2025-05-07T20:26:27.4401401Z 
2025-05-07T20:26:27.4401494Z 
2025-05-07T20:26:27.4401509Z 
2025-05-07T20:26:27.4401513Z 
2025-05-07T20:26:27.4401516Z 
2025-05-07T20:26:27.4401520Z 
2025-05-07T20:26:27.4401523Z 
2025-05-07T20:26:27.4401527Z 
2025-05-07T20:26:27.4401530Z 
2025-05-07T20:26:27.4401534Z 
2025-05-07T20:26:27.4401537Z 
2025-05-07T20:26:27.4401682Z [A[A[A[A[A[A[A[A[A[A[A[A
2025-05-07T20:26:27.4401871Z 
2025-05-07T20:26:27.4401882Z 
2025-05-07T20:26:27.4401885Z 
2025-05-07T20:26:27.4401889Z 
2025-05-07T20:26:27.4401892Z 
2025-05-07T20:26:27.4401896Z 
2025-05-07T20:26:27.4401899Z 
2025-05-07T20:26:27.4401903Z 
2025-05-07T20:26:27.4401906Z 
2025-05-07T20:26:27.4401910Z 
2025-05-07T20:26:27.4401913Z 
2025-05-07T20:26:27.4401917Z 
2025-05-07T20:26:27.4401920Z 
2025-05-07T20:26:27.4402347Z [A[A[A[A[A[A[A[A[A[A[A[A[A
2025-05-07T20:26:27.4402566Z 
2025-05-07T20:26:27.4402572Z 
2025-05-07T20:26:27.4402576Z 
2025-05-07T20:26:27.4402579Z 
2025-05-07T20:26:27.4402583Z 
2025-05-07T20:26:27.4402586Z 
2025-05-07T20:26:27.4402600Z 
2025-05-07T20:26:27.4402624Z 
2025-05-07T20:26:27.4402627Z 
2025-05-07T20:26:27.4402631Z 
2025-05-07T20:26:27.4402634Z 
2025-05-07T20:26:27.4402638Z 
2025-05-07T20:26:27.4402642Z 
2025-05-07T20:26:27.4402645Z 
2025-05-07T20:26:27.4402956Z [A[A[A[A[A[A[A[A[A[A[A[A[A[A
2025-05-07T20:26:27.4403154Z 
2025-05-07T20:26:27.4403160Z 
2025-05-07T20:26:27.4403165Z 
2025-05-07T20:26:27.4403177Z 
2025-05-07T20:26:27.4403189Z 
2025-05-07T20:26:27.4403201Z 
2025-05-07T20:26:27.4403207Z 
2025-05-07T20:26:27.4403212Z 
2025-05-07T20:26:27.4403217Z 
2025-05-07T20:26:27.4403222Z 
2025-05-07T20:26:27.4403226Z 
2025-05-07T20:26:27.4403231Z 
2025-05-07T20:26:27.4403237Z 
2025-05-07T20:26:27.4403243Z 
2025-05-07T20:26:27.4403247Z 
2025-05-07T20:26:27.4403410Z [A[A[A[A[A[A[A[A[A[A[A[A[A[A[A
2025-05-07T20:26:27.4403610Z 
2025-05-07T20:26:27.4403613Z 
2025-05-07T20:26:27.4403617Z 
2025-05-07T20:26:27.4403620Z 
2025-05-07T20:26:27.4403624Z 
2025-05-07T20:26:27.4403627Z 
2025-05-07T20:26:27.4403631Z 
2025-05-07T20:26:27.4403647Z 
2025-05-07T20:26:27.4403657Z 
2025-05-07T20:26:27.4403661Z 
2025-05-07T20:26:27.4403664Z 
2025-05-07T20:26:27.4403668Z 
2025-05-07T20:26:27.4403671Z 
2025-05-07T20:26:27.4403675Z 
2025-05-07T20:26:27.4403678Z 
2025-05-07T20:26:27.4403682Z 
2025-05-07T20:26:27.4403852Z [A[A[A[A[A[A[A[A[A[A[A[A[A[A[A[A
2025-05-07T20:26:27.4404057Z 
2025-05-07T20:26:27.4404065Z 
2025-05-07T20:26:27.4404069Z 
2025-05-07T20:26:27.4404072Z 
2025-05-07T20:26:27.4404076Z 
2025-05-07T20:26:27.4404079Z 
2025-05-07T20:26:27.4404083Z 
2025-05-07T20:26:27.4404093Z 
2025-05-07T20:26:27.4404096Z 
2025-05-07T20:26:27.4404100Z 
2025-05-07T20:26:27.4404103Z 
2025-05-07T20:26:27.4404107Z 
2025-05-07T20:26:27.4404110Z 
2025-05-07T20:26:27.4404114Z 
2025-05-07T20:26:27.4404117Z 
2025-05-07T20:26:27.4404121Z 
2025-05-07T20:26:27.4404124Z 
2025-05-07T20:26:27.4404295Z [A[A[A[A[A[A[A[A[A[A[A[A[A[A[A[A[A
2025-05-07T20:26:27.4404513Z 
2025-05-07T20:26:27.4404516Z 
2025-05-07T20:26:27.4404520Z 
2025-05-07T20:26:27.4404529Z 
2025-05-07T20:26:27.4404533Z 
2025-05-07T20:26:27.4404536Z 
2025-05-07T20:26:27.4404540Z 
2025-05-07T20:26:27.4404551Z 
2025-05-07T20:26:27.4404555Z 
2025-05-07T20:26:27.4404558Z 
2025-05-07T20:26:27.4404562Z 
2025-05-07T20:26:27.4404565Z 
2025-05-07T20:26:27.4404569Z 
2025-05-07T20:26:27.4404572Z 
2025-05-07T20:26:27.4404575Z 
2025-05-07T20:26:27.4404579Z 
2025-05-07T20:26:27.4404665Z 
2025-05-07T20:26:27.4404669Z 
2025-05-07T20:26:27.4405180Z [A[A[A[A[A[A[A[A[A[A[A[A[A[A[A[A[A[A
2025-05-07T20:26:27.4405398Z 
2025-05-07T20:26:27.4405410Z 
2025-05-07T20:26:27.4405513Z [A
2025-05-07T20:26:27.4405615Z 
2025-05-07T20:26:27.4405618Z 
2025-05-07T20:26:27.4405907Z [A[A
2025-05-07T20:26:27.4406029Z 
2025-05-07T20:26:27.4406034Z 
2025-05-07T20:26:27.4406040Z 
2025-05-07T20:26:27.4406390Z [A[A[A
2025-05-07T20:26:27.4406504Z 
2025-05-07T20:26:27.4406510Z 
2025-05-07T20:26:27.4406516Z 
2025-05-07T20:26:27.4406525Z 
2025-05-07T20:26:27.4406799Z [A[A[A[A
2025-05-07T20:26:27.4407104Z 
2025-05-07T20:26:27.4407117Z 
2025-05-07T20:26:27.4407125Z 
2025-05-07T20:26:27.4407130Z 
2025-05-07T20:26:27.4407135Z 
2025-05-07T20:26:27.4407379Z [A[A[A[A[A
2025-05-07T20:26:27.4407502Z 
2025-05-07T20:26:27.4407510Z 
2025-05-07T20:26:27.4407514Z 
2025-05-07T20:26:27.4407517Z 
2025-05-07T20:26:27.4407521Z 
2025-05-07T20:26:27.4407524Z 
2025-05-07T20:26:27.4407940Z [A[A[A[A[A[A
2025-05-07T20:26:27.4408072Z 
2025-05-07T20:26:27.4408088Z 
2025-05-07T20:26:27.4408093Z 
2025-05-07T20:26:27.4408098Z 
2025-05-07T20:26:27.4408102Z 
2025-05-07T20:26:27.4408107Z 
2025-05-07T20:26:27.4408111Z 
2025-05-07T20:26:27.4408363Z [A[A[A[A[A[A[A
2025-05-07T20:26:27.4408511Z 
2025-05-07T20:26:27.4408521Z 
2025-05-07T20:26:27.4408526Z 
2025-05-07T20:26:27.4408532Z 
2025-05-07T20:26:27.4408537Z 
2025-05-07T20:26:27.4408543Z 
2025-05-07T20:26:27.4408548Z 
2025-05-07T20:26:27.4408554Z 
2025-05-07T20:26:27.4408809Z [A[A[A[A[A[A[A[A
2025-05-07T20:26:27.4408966Z 
2025-05-07T20:26:27.4408972Z 
2025-05-07T20:26:27.4408993Z 
2025-05-07T20:26:27.4408998Z 
2025-05-07T20:26:27.4409003Z 
2025-05-07T20:26:27.4409009Z 
2025-05-07T20:26:27.4409013Z 
2025-05-07T20:26:27.4409017Z 
2025-05-07T20:26:27.4409021Z 
2025-05-07T20:26:27.4409156Z [A[A[A[A[A[A[A[A[A
2025-05-07T20:26:27.4409313Z 
2025-05-07T20:26:27.4409316Z 
2025-05-07T20:26:27.4409328Z 
2025-05-07T20:26:27.4409331Z 
2025-05-07T20:26:27.4409372Z 
2025-05-07T20:26:27.4409381Z 
2025-05-07T20:26:27.4409398Z 
2025-05-07T20:26:27.4409404Z 
2025-05-07T20:26:27.4409409Z 
2025-05-07T20:26:27.4409415Z 
2025-05-07T20:26:27.4409605Z [A[A[A[A[A[A[A[A[A[A
2025-05-07T20:26:27.4409777Z 
2025-05-07T20:26:27.4409783Z 
2025-05-07T20:26:27.4409799Z 
2025-05-07T20:26:27.4409805Z 
2025-05-07T20:26:27.4409810Z 
2025-05-07T20:26:27.4409816Z 
2025-05-07T20:26:27.4409820Z 
2025-05-07T20:26:27.4409825Z 
2025-05-07T20:26:27.4409829Z 
2025-05-07T20:26:27.4409834Z 
2025-05-07T20:26:27.4409837Z 
2025-05-07T20:26:27.4409970Z [A[A[A[A[A[A[A[A[A[A[A
2025-05-07T20:26:27.4410162Z 
2025-05-07T20:26:27.4410166Z 
2025-05-07T20:26:27.4410170Z 
2025-05-07T20:26:27.4410173Z 
2025-05-07T20:26:27.4410177Z 
2025-05-07T20:26:27.4410180Z 
2025-05-07T20:26:27.4410184Z 
2025-05-07T20:26:27.4410187Z 
2025-05-07T20:26:27.4410191Z 
2025-05-07T20:26:27.4410194Z 
2025-05-07T20:26:27.4410198Z 
2025-05-07T20:26:27.4410201Z 
2025-05-07T20:26:27.4410361Z [A[A[A[A[A[A[A[A[A[A[A[A
2025-05-07T20:26:27.4410539Z 
2025-05-07T20:26:27.4410542Z 
2025-05-07T20:26:27.4410546Z 
2025-05-07T20:26:27.4410549Z 
2025-05-07T20:26:27.4410564Z 
2025-05-07T20:26:27.4410568Z 
2025-05-07T20:26:27.4410571Z 
2025-05-07T20:26:27.4410575Z 
2025-05-07T20:26:27.4410578Z 
2025-05-07T20:26:27.4410582Z 
2025-05-07T20:26:27.4410593Z 
2025-05-07T20:26:27.4410596Z 
2025-05-07T20:26:27.4410600Z 
2025-05-07T20:26:27.4410740Z [A[A[A[A[A[A[A[A[A[A[A[A[A
2025-05-07T20:26:27.4410926Z 
2025-05-07T20:26:27.4410929Z 
2025-05-07T20:26:27.4410933Z 
2025-05-07T20:26:27.4410937Z 
2025-05-07T20:26:27.4410953Z 
2025-05-07T20:26:27.4410957Z 
2025-05-07T20:26:27.4410971Z 
2025-05-07T20:26:27.4410974Z 
2025-05-07T20:26:27.4410978Z 
2025-05-07T20:26:27.4410981Z 
2025-05-07T20:26:27.4410985Z 
2025-05-07T20:26:27.4410988Z 
2025-05-07T20:26:27.4410992Z 
2025-05-07T20:26:27.4410995Z 
2025-05-07T20:26:27.4411139Z [A[A[A[A[A[A[A[A[A[A[A[A[A[A
2025-05-07T20:26:27.4411336Z 
2025-05-07T20:26:27.4411431Z 
2025-05-07T20:26:27.4411435Z 
2025-05-07T20:26:27.4411439Z 
2025-05-07T20:26:27.4411442Z 
2025-05-07T20:26:27.4411446Z 
2025-05-07T20:26:27.4411449Z 
2025-05-07T20:26:27.4411453Z 
2025-05-07T20:26:27.4411456Z 
2025-05-07T20:26:27.4411460Z 
2025-05-07T20:26:27.4411463Z 
2025-05-07T20:26:27.4411467Z 
2025-05-07T20:26:27.4411470Z 
2025-05-07T20:26:27.4411474Z 
2025-05-07T20:26:27.4411486Z 
2025-05-07T20:26:27.4411647Z [A[A[A[A[A[A[A[A[A[A[A[A[A[A[A
2025-05-07T20:26:27.4411843Z 
2025-05-07T20:26:27.4411847Z 
2025-05-07T20:26:27.4411850Z 
2025-05-07T20:26:27.4411854Z 
2025-05-07T20:26:27.4411858Z 
2025-05-07T20:26:27.4411943Z 
2025-05-07T20:26:27.4411947Z 
2025-05-07T20:26:27.4411950Z 
2025-05-07T20:26:27.4411954Z 
2025-05-07T20:26:27.4411966Z 
2025-05-07T20:26:27.4411969Z 
2025-05-07T20:26:27.4411973Z 
2025-05-07T20:26:27.4411976Z 
2025-05-07T20:26:27.4411980Z 
2025-05-07T20:26:27.4411983Z 
2025-05-07T20:26:27.4411987Z 
2025-05-07T20:26:27.4412160Z [A[A[A[A[A[A[A[A[A[A[A[A[A[A[A[A
2025-05-07T20:26:27.4412366Z 
2025-05-07T20:26:27.4412370Z 
2025-05-07T20:26:27.4412373Z 
2025-05-07T20:26:27.4412377Z 
2025-05-07T20:26:27.4412380Z 
2025-05-07T20:26:27.4412384Z 
2025-05-07T20:26:27.4412387Z 
2025-05-07T20:26:27.4412391Z 
2025-05-07T20:26:27.4412394Z 
2025-05-07T20:26:27.4412398Z 
2025-05-07T20:26:27.4412401Z 
2025-05-07T20:26:27.4412405Z 
2025-05-07T20:26:27.4412408Z 
2025-05-07T20:26:27.4412412Z 
2025-05-07T20:26:27.4412415Z 
2025-05-07T20:26:27.4412419Z 
2025-05-07T20:26:27.4412422Z 
2025-05-07T20:26:27.4412606Z [A[A[A[A[A[A[A[A[A[A[A[A[A[A[A[A[A
2025-05-07T20:26:27.4412807Z 
2025-05-07T20:26:27.4412815Z 
2025-05-07T20:26:27.4412819Z 
2025-05-07T20:26:27.4412822Z 
2025-05-07T20:26:27.4412826Z 
2025-05-07T20:26:27.4412829Z 
2025-05-07T20:26:27.4412833Z 
2025-05-07T20:26:27.4412836Z 
2025-05-07T20:26:27.4412849Z 
2025-05-07T20:26:27.4412860Z 
2025-05-07T20:26:27.4412863Z 
2025-05-07T20:26:27.4412867Z 
2025-05-07T20:26:27.4412870Z 
2025-05-07T20:26:27.4412874Z 
2025-05-07T20:26:27.4412882Z 
2025-05-07T20:26:27.4412886Z 
2025-05-07T20:26:27.4412889Z 
2025-05-07T20:26:27.4412893Z 
2025-05-07T20:26:27.4413059Z [A[A[A[A[A[A[A[A[A[A[A[A[A[A[A[A[A[A
2025-05-07T20:26:27.4413269Z 
2025-05-07T20:26:27.4413273Z 
2025-05-07T20:26:27.4413380Z [A
2025-05-07T20:26:27.4413485Z 
2025-05-07T20:26:27.4413488Z 
2025-05-07T20:26:27.4413607Z [A[A
2025-05-07T20:26:27.4413717Z 
2025-05-07T20:26:27.4413720Z 
2025-05-07T20:26:27.4413724Z 
2025-05-07T20:26:27.4413994Z [A[A[A
2025-05-07T20:26:27.4414107Z 
2025-05-07T20:26:27.4414111Z 
2025-05-07T20:26:27.4414117Z 
2025-05-07T20:26:27.4414135Z 
2025-05-07T20:26:27.4414324Z [A[A[A[A
2025-05-07T20:26:27.4414446Z 
2025-05-07T20:26:27.4414460Z 
2025-05-07T20:26:27.4414464Z 
2025-05-07T20:26:27.4414468Z 
2025-05-07T20:26:27.4414472Z 
2025-05-07T20:26:27.4414675Z [A[A[A[A[A
2025-05-07T20:26:27.4414799Z 
2025-05-07T20:26:27.4414807Z 
2025-05-07T20:26:27.4414811Z 
2025-05-07T20:26:27.4414815Z 
2025-05-07T20:26:27.4414830Z 
2025-05-07T20:26:27.4414834Z 
2025-05-07T20:26:27.4415044Z [A[A[A[A[A[A
2025-05-07T20:26:27.4415170Z 
2025-05-07T20:26:27.4415177Z 
2025-05-07T20:26:27.4415180Z 
2025-05-07T20:26:27.4415184Z 
2025-05-07T20:26:27.4415188Z 
2025-05-07T20:26:27.4415191Z 
2025-05-07T20:26:27.4415195Z 
2025-05-07T20:26:27.4415416Z [A[A[A[A[A[A[A
2025-05-07T20:26:27.4415557Z 
2025-05-07T20:26:27.4415560Z 
2025-05-07T20:26:27.4415564Z 
2025-05-07T20:26:27.4415567Z 
2025-05-07T20:26:27.4415571Z 
2025-05-07T20:26:27.4415574Z 
2025-05-07T20:26:27.4415578Z 
2025-05-07T20:26:27.4415582Z 
2025-05-07T20:26:27.4415775Z [A[A[A[A[A[A[A[A
2025-05-07T20:26:27.4415929Z 
2025-05-07T20:26:27.4415935Z 
2025-05-07T20:26:27.4415939Z 
2025-05-07T20:26:27.4415942Z 
2025-05-07T20:26:27.4415946Z 
2025-05-07T20:26:27.4415949Z 
2025-05-07T20:26:27.4415953Z 
2025-05-07T20:26:27.4415968Z 
2025-05-07T20:26:27.4415971Z 
2025-05-07T20:26:27.4416148Z [A[A[A[A[A[A[A[A[A
2025-05-07T20:26:27.4416304Z 
2025-05-07T20:26:27.4416311Z 
2025-05-07T20:26:27.4416429Z 
2025-05-07T20:26:27.4416442Z 
2025-05-07T20:26:27.4416445Z 
2025-05-07T20:26:27.4416449Z 
2025-05-07T20:26:27.4416452Z 
2025-05-07T20:26:27.4416456Z 
2025-05-07T20:26:27.4416459Z 
2025-05-07T20:26:27.4416463Z 
2025-05-07T20:26:27.4416598Z [A[A[A[A[A[A[A[A[A[A
2025-05-07T20:26:27.4416768Z 
2025-05-07T20:26:27.4416771Z 
2025-05-07T20:26:27.4416781Z 
2025-05-07T20:26:27.4416785Z 
2025-05-07T20:26:27.4416788Z 
2025-05-07T20:26:27.4416792Z 
2025-05-07T20:26:27.4416795Z 
2025-05-07T20:26:27.4416799Z 
2025-05-07T20:26:27.4416802Z 
2025-05-07T20:26:27.4416806Z 
2025-05-07T20:26:27.4416809Z 
2025-05-07T20:26:27.4417031Z [A[A[A[A[A[A[A[A[A[A[A
2025-05-07T20:26:27.4417209Z 
2025-05-07T20:26:27.4417212Z 
2025-05-07T20:26:27.4417216Z 
2025-05-07T20:26:27.4417219Z 
2025-05-07T20:26:27.4417223Z 
2025-05-07T20:26:27.4417226Z 
2025-05-07T20:26:27.4417230Z 
2025-05-07T20:26:27.4417233Z 
2025-05-07T20:26:27.4417237Z 
2025-05-07T20:26:27.4417240Z 
2025-05-07T20:26:27.4417244Z 
2025-05-07T20:26:27.4417253Z 
2025-05-07T20:26:27.4417398Z [A[A[A[A[A[A[A[A[A[A[A[A
2025-05-07T20:26:27.4417578Z 
2025-05-07T20:26:27.4417582Z 
2025-05-07T20:26:27.4417585Z 
2025-05-07T20:26:27.4417589Z 
2025-05-07T20:26:27.4417592Z 
2025-05-07T20:26:27.4417596Z 
2025-05-07T20:26:27.4417599Z 
2025-05-07T20:26:27.4417603Z 
2025-05-07T20:26:27.4417607Z 
2025-05-07T20:26:27.4417610Z 
2025-05-07T20:26:27.4417614Z 
2025-05-07T20:26:27.4417622Z 
2025-05-07T20:26:27.4417626Z 
2025-05-07T20:26:27.4417767Z [A[A[A[A[A[A[A[A[A[A[A[A[A
2025-05-07T20:26:27.4417950Z 
2025-05-07T20:26:27.4417954Z 
2025-05-07T20:26:27.4417962Z 
2025-05-07T20:26:27.4417966Z 
2025-05-07T20:26:27.4417975Z 
2025-05-07T20:26:27.4417979Z 
2025-05-07T20:26:27.4417982Z 
2025-05-07T20:26:27.4417986Z 
2025-05-07T20:26:27.4417989Z 
2025-05-07T20:26:27.4417993Z 
2025-05-07T20:26:27.4417996Z 
2025-05-07T20:26:27.4418000Z 
2025-05-07T20:26:27.4418003Z 
2025-05-07T20:26:27.4418007Z 
2025-05-07T20:26:27.4418159Z [A[A[A[A[A[A[A[A[A[A[A[A[A[A
2025-05-07T20:26:27.4418356Z 
2025-05-07T20:26:27.4418360Z 
2025-05-07T20:26:27.4418363Z 
2025-05-07T20:26:27.4418367Z 
2025-05-07T20:26:27.4418370Z 
2025-05-07T20:26:27.4418374Z 
2025-05-07T20:26:27.4418377Z 
2025-05-07T20:26:27.4418381Z 
2025-05-07T20:26:27.4418384Z 
2025-05-07T20:26:27.4418388Z 
2025-05-07T20:26:27.4418392Z 
2025-05-07T20:26:27.4418395Z 
2025-05-07T20:26:27.4418399Z 
2025-05-07T20:26:27.4418402Z 
2025-05-07T20:26:27.4418406Z 
2025-05-07T20:26:27.4418568Z [A[A[A[A[A[A[A[A[A[A[A[A[A[A[A
2025-05-07T20:26:27.4418953Z 
2025-05-07T20:26:27.4418956Z 
2025-05-07T20:26:27.4418966Z 
2025-05-07T20:26:27.4418970Z 
2025-05-07T20:26:27.4418973Z 
2025-05-07T20:26:27.4418977Z 
2025-05-07T20:26:27.4418980Z 
2025-05-07T20:26:27.4418984Z 
2025-05-07T20:26:27.4418987Z 
2025-05-07T20:26:27.4418991Z 
2025-05-07T20:26:27.4419001Z 
2025-05-07T20:26:27.4419005Z 
2025-05-07T20:26:27.4419008Z 
2025-05-07T20:26:27.4419012Z 
2025-05-07T20:26:27.4419016Z 
2025-05-07T20:26:27.4419023Z 
2025-05-07T20:26:27.4419182Z [A[A[A[A[A[A[A[A[A[A[A[A[A[A[A[A
2025-05-07T20:26:27.4419382Z 
2025-05-07T20:26:27.4419392Z 
2025-05-07T20:26:27.4419395Z 
2025-05-07T20:26:27.4419399Z 
2025-05-07T20:26:27.4419411Z 
2025-05-07T20:26:27.4419415Z 
2025-05-07T20:26:27.4419418Z 
2025-05-07T20:26:27.4419422Z 
2025-05-07T20:26:27.4419425Z 
2025-05-07T20:26:27.4419429Z 
2025-05-07T20:26:27.4419432Z 
2025-05-07T20:26:27.4419436Z 
2025-05-07T20:26:27.4419440Z 
2025-05-07T20:26:27.4419443Z 
2025-05-07T20:26:27.4419447Z 
2025-05-07T20:26:27.4419450Z 
2025-05-07T20:26:27.4419454Z 
2025-05-07T20:26:27.4419613Z [A[A[A[A[A[A[A[A[A[A[A[A[A[A[A[A[A
2025-05-07T20:26:27.4419822Z 
2025-05-07T20:26:27.4419826Z 
2025-05-07T20:26:27.4419829Z 
2025-05-07T20:26:27.4419833Z 
2025-05-07T20:26:27.4419836Z 
2025-05-07T20:26:27.4419840Z 
2025-05-07T20:26:27.4419843Z 
2025-05-07T20:26:27.4419847Z 
2025-05-07T20:26:27.4419850Z 
2025-05-07T20:26:27.4419854Z 
2025-05-07T20:26:27.4419857Z 
2025-05-07T20:26:27.4419950Z 
2025-05-07T20:26:27.4419954Z 
2025-05-07T20:26:27.4419958Z 
2025-05-07T20:26:27.4419961Z 
2025-05-07T20:26:27.4419965Z 
2025-05-07T20:26:27.4419968Z 
2025-05-07T20:26:27.4419972Z 
2025-05-07T20:26:27.4420138Z [A[A[A[A[A[A[A[A[A[A[A[A[A[A[A[A[A[A
2025-05-07T20:26:27.4420351Z 
2025-05-07T20:26:27.4420355Z 
2025-05-07T20:26:27.4420459Z [A
2025-05-07T20:26:27.4420581Z 
2025-05-07T20:26:27.4420586Z 
2025-05-07T20:26:27.4420707Z [A[A
2025-05-07T20:26:27.4420832Z 
2025-05-07T20:26:27.4420835Z 
2025-05-07T20:26:27.4420839Z 
2025-05-07T20:26:27.4420959Z [A[A[A
2025-05-07T20:26:27.4421066Z 
2025-05-07T20:26:27.4421150Z 
2025-05-07T20:26:27.4421154Z 
2025-05-07T20:26:27.4421157Z 
2025-05-07T20:26:27.4421267Z [A[A[A[A
2025-05-07T20:26:27.4421395Z 
2025-05-07T20:26:27.4421399Z 
2025-05-07T20:26:27.4421403Z 
2025-05-07T20:26:27.4421406Z 
2025-05-07T20:26:27.4421410Z 
2025-05-07T20:26:27.4421525Z [A[A[A[A[A
2025-05-07T20:26:27.4421657Z 
2025-05-07T20:26:27.4421661Z 
2025-05-07T20:26:27.4421671Z 
2025-05-07T20:26:27.4421674Z 
2025-05-07T20:26:27.4421678Z 
2025-05-07T20:26:27.4421681Z 
2025-05-07T20:26:27.4421795Z [A[A[A[A[A[A
2025-05-07T20:26:27.4421929Z 
2025-05-07T20:26:27.4421932Z 
2025-05-07T20:26:27.4421936Z 
2025-05-07T20:26:27.4421940Z 
2025-05-07T20:26:27.4421943Z 
2025-05-07T20:26:27.4421947Z 
2025-05-07T20:26:27.4421950Z 
2025-05-07T20:26:27.4422068Z [A[A[A[A[A[A[A
2025-05-07T20:26:27.4422206Z 
2025-05-07T20:26:27.4422209Z 
2025-05-07T20:26:27.4422213Z 
2025-05-07T20:26:27.4422217Z 
2025-05-07T20:26:27.4422220Z 
2025-05-07T20:26:27.4422224Z 
2025-05-07T20:26:27.4422227Z 
2025-05-07T20:26:27.4422235Z 
2025-05-07T20:26:27.4422359Z [A[A[A[A[A[A[A[A
2025-05-07T20:26:27.4422513Z 
2025-05-07T20:26:27.4422517Z 
2025-05-07T20:26:27.4422520Z 
2025-05-07T20:26:27.4422524Z 
2025-05-07T20:26:27.4422527Z 
2025-05-07T20:26:27.4422531Z 
2025-05-07T20:26:27.4422534Z 
2025-05-07T20:26:27.4422538Z 
2025-05-07T20:26:27.4422542Z 
2025-05-07T20:26:27.4422677Z [A[A[A[A[A[A[A[A[A
2025-05-07T20:26:27.4422839Z 
2025-05-07T20:26:27.4422843Z 
2025-05-07T20:26:27.4422846Z 
2025-05-07T20:26:27.4422850Z 
2025-05-07T20:26:27.4422853Z 
2025-05-07T20:26:27.4422857Z 
2025-05-07T20:26:27.4422860Z 
2025-05-07T20:26:27.4422864Z 
2025-05-07T20:26:27.4422867Z 
2025-05-07T20:26:27.4422871Z 
2025-05-07T20:26:27.4423011Z [A[A[A[A[A[A[A[A[A[A
2025-05-07T20:26:27.4423170Z 
2025-05-07T20:26:27.4423174Z 
2025-05-07T20:26:27.4423177Z 
2025-05-07T20:26:27.4423181Z 
2025-05-07T20:26:27.4423184Z 
2025-05-07T20:26:27.4423188Z 
2025-05-07T20:26:27.4423191Z 
2025-05-07T20:26:27.4423195Z 
2025-05-07T20:26:27.4423203Z 
2025-05-07T20:26:27.4423207Z 
2025-05-07T20:26:27.4423210Z 
2025-05-07T20:26:27.4423355Z [A[A[A[A[A[A[A[A[A[A[A
2025-05-07T20:26:27.4423524Z 
2025-05-07T20:26:27.4423528Z 
2025-05-07T20:26:27.4423531Z 
2025-05-07T20:26:27.4423535Z 
2025-05-07T20:26:27.4423538Z 
2025-05-07T20:26:27.4423542Z 
2025-05-07T20:26:27.4423551Z 
2025-05-07T20:26:27.4423554Z 
2025-05-07T20:26:27.4423562Z 
2025-05-07T20:26:27.4423565Z 
2025-05-07T20:26:27.4423569Z 
2025-05-07T20:26:27.4423572Z 
2025-05-07T20:26:27.4423716Z [A[A[A[A[A[A[A[A[A[A[A[A
2025-05-07T20:26:27.4423907Z 
2025-05-07T20:26:27.4423910Z 
2025-05-07T20:26:27.4423914Z 
2025-05-07T20:26:27.4423917Z 
2025-05-07T20:26:27.4423921Z 
2025-05-07T20:26:27.4423924Z 
2025-05-07T20:26:27.4423928Z 
2025-05-07T20:26:27.4423931Z 
2025-05-07T20:26:27.4423935Z 
2025-05-07T20:26:27.4423938Z 
2025-05-07T20:26:27.4423942Z 
2025-05-07T20:26:27.4423945Z 
2025-05-07T20:26:27.4423949Z 
2025-05-07T20:26:27.4424087Z [A[A[A[A[A[A[A[A[A[A[A[A[A
2025-05-07T20:26:27.4424279Z 
2025-05-07T20:26:27.4424283Z 
2025-05-07T20:26:27.4424286Z 
2025-05-07T20:26:27.4424290Z 
2025-05-07T20:26:27.4424293Z 
2025-05-07T20:26:27.4424297Z 
2025-05-07T20:26:27.4424300Z 
2025-05-07T20:26:27.4424304Z 
2025-05-07T20:26:27.4424307Z 
2025-05-07T20:26:27.4424311Z 
2025-05-07T20:26:27.4424314Z 
2025-05-07T20:26:27.4424318Z 
2025-05-07T20:26:27.4424331Z 
2025-05-07T20:26:27.4424443Z 
2025-05-07T20:26:27.4424599Z [A[A[A[A[A[A[A[A[A[A[A[A[A[A
2025-05-07T20:26:27.4424793Z 
2025-05-07T20:26:27.4424797Z 
2025-05-07T20:26:27.4424800Z 
2025-05-07T20:26:27.4424804Z 
2025-05-07T20:26:27.4424807Z 
2025-05-07T20:26:27.4424811Z 
2025-05-07T20:26:27.4424814Z 
2025-05-07T20:26:27.4424827Z 
2025-05-07T20:26:27.4424830Z 
2025-05-07T20:26:27.4424834Z 
2025-05-07T20:26:27.4424837Z 
2025-05-07T20:26:27.4424841Z 
2025-05-07T20:26:27.4424844Z 
2025-05-07T20:26:27.4424848Z 
2025-05-07T20:26:27.4424851Z 
2025-05-07T20:26:27.4424999Z [A[A[A[A[A[A[A[A[A[A[A[A[A[A[A
2025-05-07T20:26:27.4425278Z 
2025-05-07T20:26:27.4425281Z 
2025-05-07T20:26:27.4425285Z 
2025-05-07T20:26:27.4425289Z 
2025-05-07T20:26:27.4425292Z 
2025-05-07T20:26:27.4425296Z 
2025-05-07T20:26:27.4425299Z 
2025-05-07T20:26:27.4425303Z 
2025-05-07T20:26:27.4425306Z 
2025-05-07T20:26:27.4425310Z 
2025-05-07T20:26:27.4425313Z 
2025-05-07T20:26:27.4425317Z 
2025-05-07T20:26:27.4425320Z 
2025-05-07T20:26:27.4425328Z 
2025-05-07T20:26:27.4425332Z 
2025-05-07T20:26:27.4425335Z 
2025-05-07T20:26:27.4425491Z [A[A[A[A[A[A[A[A[A[A[A[A[A[A[A[A
2025-05-07T20:26:27.4425690Z 
2025-05-07T20:26:27.4425694Z 
2025-05-07T20:26:27.4425697Z 
2025-05-07T20:26:27.4425701Z 
2025-05-07T20:26:27.4425704Z 
2025-05-07T20:26:27.4425708Z 
2025-05-07T20:26:27.4425711Z 
2025-05-07T20:26:27.4425715Z 
2025-05-07T20:26:27.4425718Z 
2025-05-07T20:26:27.4425722Z 
2025-05-07T20:26:27.4425725Z 
2025-05-07T20:26:27.4425729Z 
2025-05-07T20:26:27.4425747Z 
2025-05-07T20:26:27.4425750Z 
2025-05-07T20:26:27.4425754Z 
2025-05-07T20:26:27.4425757Z 
2025-05-07T20:26:27.4425766Z 
2025-05-07T20:26:27.4425922Z [A[A[A[A[A[A[A[A[A[A[A[A[A[A[A[A[A
2025-05-07T20:26:27.4426130Z 
2025-05-07T20:26:27.4426134Z 
2025-05-07T20:26:27.4426138Z 
2025-05-07T20:26:27.4426141Z 
2025-05-07T20:26:27.4426145Z 
2025-05-07T20:26:27.4426148Z 
2025-05-07T20:26:27.4426152Z 
2025-05-07T20:26:27.4426155Z 
2025-05-07T20:26:27.4426159Z 
2025-05-07T20:26:27.4426168Z 
2025-05-07T20:26:27.4426171Z 
2025-05-07T20:26:27.4426175Z 
2025-05-07T20:26:27.4426178Z 
2025-05-07T20:26:27.4426182Z 
2025-05-07T20:26:27.4426185Z 
2025-05-07T20:26:27.4426189Z 
2025-05-07T20:26:27.4426192Z 
2025-05-07T20:26:27.4426196Z 
2025-05-07T20:26:27.4426363Z [A[A[A[A[A[A[A[A[A[A[A[A[A[A[A[A[A[A
2025-05-07T20:26:27.4426567Z 
2025-05-07T20:26:27.4426571Z 
2025-05-07T20:26:27.4426669Z [A
2025-05-07T20:26:27.4426779Z 
2025-05-07T20:26:27.4426782Z 
2025-05-07T20:26:27.4426885Z [A[A
2025-05-07T20:26:27.4426996Z 
2025-05-07T20:26:27.4427006Z 
2025-05-07T20:26:27.4427010Z 
2025-05-07T20:26:27.4427133Z [A[A[A
2025-05-07T20:26:27.4427243Z 
2025-05-07T20:26:27.4427246Z 
2025-05-07T20:26:27.4427250Z 
2025-05-07T20:26:27.4427253Z 
2025-05-07T20:26:27.4427368Z [A[A[A[A
2025-05-07T20:26:27.4427484Z 
2025-05-07T20:26:27.4427488Z 
2025-05-07T20:26:27.4427491Z 
2025-05-07T20:26:27.4427495Z 
2025-05-07T20:26:27.4427498Z 
2025-05-07T20:26:27.4427615Z [A[A[A[A[A
2025-05-07T20:26:27.4427740Z 
2025-05-07T20:26:27.4427744Z 
2025-05-07T20:26:27.4427747Z 
2025-05-07T20:26:27.4427751Z 
2025-05-07T20:26:27.4427754Z 
2025-05-07T20:26:27.4427758Z 
2025-05-07T20:26:27.4427872Z [A[A[A[A[A[A
2025-05-07T20:26:27.4428023Z 
2025-05-07T20:26:27.4428026Z 
2025-05-07T20:26:27.4428030Z 
2025-05-07T20:26:27.4428033Z 
2025-05-07T20:26:27.4428037Z 
2025-05-07T20:26:27.4428040Z 
2025-05-07T20:26:27.4428044Z 
2025-05-07T20:26:27.4428375Z [A[A[A[A[A[A[A
2025-05-07T20:26:27.4428530Z 
2025-05-07T20:26:27.4428534Z 
2025-05-07T20:26:27.4428548Z 
2025-05-07T20:26:27.4428552Z 
2025-05-07T20:26:27.4428555Z 
2025-05-07T20:26:27.4428564Z 
2025-05-07T20:26:27.4428567Z 
2025-05-07T20:26:27.4428571Z 
2025-05-07T20:26:27.4428702Z [A[A[A[A[A[A[A[A
2025-05-07T20:26:27.4428849Z 
2025-05-07T20:26:27.4428852Z 
2025-05-07T20:26:27.4428856Z 
2025-05-07T20:26:27.4428859Z 
2025-05-07T20:26:27.4428863Z 
2025-05-07T20:26:27.4428866Z 
2025-05-07T20:26:27.4428870Z 
2025-05-07T20:26:27.4428873Z 
2025-05-07T20:26:27.4429014Z 
2025-05-07T20:26:27.4429224Z [A[A[A[A[A[A[A[A[A
2025-05-07T20:26:27.4429380Z 
2025-05-07T20:26:27.4429383Z 
2025-05-07T20:26:27.4429387Z 
2025-05-07T20:26:27.4429390Z 
2025-05-07T20:26:27.4429394Z 
2025-05-07T20:26:27.4429397Z 
2025-05-07T20:26:27.4429401Z 
2025-05-07T20:26:27.4429405Z 
2025-05-07T20:26:27.4429408Z 
2025-05-07T20:26:27.4429412Z 
2025-05-07T20:26:27.4429545Z [A[A[A[A[A[A[A[A[A[A
2025-05-07T20:26:27.4429707Z 
2025-05-07T20:26:27.4429711Z 
2025-05-07T20:26:27.4429714Z 
2025-05-07T20:26:27.4429718Z 
2025-05-07T20:26:27.4429722Z 
2025-05-07T20:26:27.4429726Z 
2025-05-07T20:26:27.4429856Z 
2025-05-07T20:26:27.4429860Z 
2025-05-07T20:26:27.4429863Z 
2025-05-07T20:26:27.4429867Z 
2025-05-07T20:26:27.4429870Z 
2025-05-07T20:26:27.4430070Z [A[A[A[A[A[A[A[A[A[A[A
2025-05-07T20:26:27.4430320Z 
2025-05-07T20:26:27.4430335Z 
2025-05-07T20:26:27.4430339Z 
2025-05-07T20:26:27.4430344Z 
2025-05-07T20:26:27.4430349Z 
2025-05-07T20:26:27.4430354Z 
2025-05-07T20:26:27.4430367Z 
2025-05-07T20:26:27.4430373Z 
2025-05-07T20:26:27.4430378Z 
2025-05-07T20:26:27.4430383Z 
2025-05-07T20:26:27.4430388Z 
2025-05-07T20:26:27.4430393Z 
2025-05-07T20:26:27.4430598Z [A[A[A[A[A[A[A[A[A[A[A[A
2025-05-07T20:26:27.4430825Z 
2025-05-07T20:26:27.4430828Z 
2025-05-07T20:26:27.4430832Z 
2025-05-07T20:26:27.4430835Z 
2025-05-07T20:26:27.4430839Z 
2025-05-07T20:26:27.4430842Z 
2025-05-07T20:26:27.4430846Z 
2025-05-07T20:26:27.4430849Z 
2025-05-07T20:26:27.4430853Z 
2025-05-07T20:26:27.4430856Z 
2025-05-07T20:26:27.4430860Z 
2025-05-07T20:26:27.4430863Z 
2025-05-07T20:26:27.4430867Z 
2025-05-07T20:26:27.4431023Z [A[A[A[A[A[A[A[A[A[A[A[A[A
2025-05-07T20:26:27.4431209Z 
2025-05-07T20:26:27.4431213Z 
2025-05-07T20:26:27.4431216Z 
2025-05-07T20:26:27.4431220Z 
2025-05-07T20:26:27.4431223Z 
2025-05-07T20:26:27.4431227Z 
2025-05-07T20:26:27.4431230Z 
2025-05-07T20:26:27.4431234Z 
2025-05-07T20:26:27.4431237Z 
2025-05-07T20:26:27.4431241Z 
2025-05-07T20:26:27.4431244Z 
2025-05-07T20:26:27.4431253Z 
2025-05-07T20:26:27.4431256Z 
2025-05-07T20:26:27.4431260Z 
2025-05-07T20:26:27.4431409Z [A[A[A[A[A[A[A[A[A[A[A[A[A[A
2025-05-07T20:26:27.4431599Z 
2025-05-07T20:26:27.4431603Z 
2025-05-07T20:26:27.4431606Z 
2025-05-07T20:26:27.4431610Z 
2025-05-07T20:26:27.4431613Z 
2025-05-07T20:26:27.4431617Z 
2025-05-07T20:26:27.4431620Z 
2025-05-07T20:26:27.4431624Z 
2025-05-07T20:26:27.4431634Z 
2025-05-07T20:26:27.4431638Z 
2025-05-07T20:26:27.4431641Z 
2025-05-07T20:26:27.4431645Z 
2025-05-07T20:26:27.4431648Z 
2025-05-07T20:26:27.4431652Z 
2025-05-07T20:26:27.4431655Z 
2025-05-07T20:26:27.4431804Z [A[A[A[A[A[A[A[A[A[A[A[A[A[A[A
2025-05-07T20:26:27.4432008Z 
2025-05-07T20:26:27.4432011Z 
2025-05-07T20:26:27.4432015Z 
2025-05-07T20:26:27.4432018Z 
2025-05-07T20:26:27.4432022Z 
2025-05-07T20:26:27.4432025Z 
2025-05-07T20:26:27.4432029Z 
2025-05-07T20:26:27.4432032Z 
2025-05-07T20:26:27.4432036Z 
2025-05-07T20:26:27.4432039Z 
2025-05-07T20:26:27.4432043Z 
2025-05-07T20:26:27.4432050Z 
2025-05-07T20:26:27.4432054Z 
2025-05-07T20:26:27.4432057Z 
2025-05-07T20:26:27.4432061Z 
2025-05-07T20:26:27.4432064Z 
2025-05-07T20:26:27.4432225Z [A[A[A[A[A[A[A[A[A[A[A[A[A[A[A[A
2025-05-07T20:26:27.4432422Z 
2025-05-07T20:26:27.4432425Z 
2025-05-07T20:26:27.4432429Z 
2025-05-07T20:26:27.4432432Z 
2025-05-07T20:26:27.4432436Z 
2025-05-07T20:26:27.4432440Z 
2025-05-07T20:26:27.4432443Z 
2025-05-07T20:26:27.4432447Z 
2025-05-07T20:26:27.4432450Z 
2025-05-07T20:26:27.4432454Z 
2025-05-07T20:26:27.4432457Z 
2025-05-07T20:26:27.4432461Z 
2025-05-07T20:26:27.4432465Z 
2025-05-07T20:26:27.4432473Z 
2025-05-07T20:26:27.4432476Z 
2025-05-07T20:26:27.4432485Z 
2025-05-07T20:26:27.4432489Z 
2025-05-07T20:26:27.4432644Z [A[A[A[A[A[A[A[A[A[A[A[A[A[A[A[A[A
2025-05-07T20:26:27.4432846Z 
2025-05-07T20:26:27.4432875Z 
2025-05-07T20:26:27.4432878Z 
2025-05-07T20:26:27.4432882Z 
2025-05-07T20:26:27.4432885Z 
2025-05-07T20:26:27.4432889Z 
2025-05-07T20:26:27.4432892Z 
2025-05-07T20:26:27.4432996Z 
2025-05-07T20:26:27.4433001Z 
2025-05-07T20:26:27.4433004Z 
2025-05-07T20:26:27.4433008Z 
2025-05-07T20:26:27.4433011Z 
2025-05-07T20:26:27.4433015Z 
2025-05-07T20:26:27.4433018Z 
2025-05-07T20:26:27.4433022Z 
2025-05-07T20:26:27.4433025Z 
2025-05-07T20:26:27.4433029Z 
2025-05-07T20:26:27.4433032Z 
2025-05-07T20:26:27.4433205Z [A[A[A[A[A[A[A[A[A[A[A[A[A[A[A[A[A[A
2025-05-07T20:26:27.4433413Z 
2025-05-07T20:26:27.4433416Z 
2025-05-07T20:26:27.4433517Z [A
2025-05-07T20:26:27.4433628Z 
2025-05-07T20:26:27.4433632Z 
2025-05-07T20:26:27.4433736Z [A[A
2025-05-07T20:26:27.4433849Z 
2025-05-07T20:26:27.4433929Z 
2025-05-07T20:26:27.4433933Z 
2025-05-07T20:26:27.4434041Z [A[A[A
2025-05-07T20:26:27.4434148Z 
2025-05-07T20:26:27.4434152Z 
2025-05-07T20:26:27.4434155Z 
2025-05-07T20:26:27.4434159Z 
2025-05-07T20:26:27.4434271Z [A[A[A[A
2025-05-07T20:26:27.4434387Z 
2025-05-07T20:26:27.4434391Z 
2025-05-07T20:26:27.4434394Z 
2025-05-07T20:26:27.4434398Z 
2025-05-07T20:26:27.4434407Z 
2025-05-07T20:26:27.4434524Z [A[A[A[A[A
2025-05-07T20:26:27.4434644Z 
2025-05-07T20:26:27.4434648Z 
2025-05-07T20:26:27.4434652Z 
2025-05-07T20:26:27.4434655Z 
2025-05-07T20:26:27.4434659Z 
2025-05-07T20:26:27.4434662Z 
2025-05-07T20:26:27.4434779Z [A[A[A[A[A[A
2025-05-07T20:26:27.4434904Z 
2025-05-07T20:26:27.4434907Z 
2025-05-07T20:26:27.4434911Z 
2025-05-07T20:26:27.4434915Z 
2025-05-07T20:26:27.4434918Z 
2025-05-07T20:26:27.4434922Z 
2025-05-07T20:26:27.4434925Z 
2025-05-07T20:26:27.4435046Z [A[A[A[A[A[A[A
2025-05-07T20:26:27.4435181Z 
2025-05-07T20:26:27.4435184Z 
2025-05-07T20:26:27.4435193Z 
2025-05-07T20:26:27.4435197Z 
2025-05-07T20:26:27.4435200Z 
2025-05-07T20:26:27.4435204Z 
2025-05-07T20:26:27.4435207Z 
2025-05-07T20:26:27.4435211Z 
2025-05-07T20:26:27.4435335Z [A[A[A[A[A[A[A[A
2025-05-07T20:26:27.4435479Z 
2025-05-07T20:26:27.4435483Z 
2025-05-07T20:26:27.4435486Z 
2025-05-07T20:26:27.4435490Z 
2025-05-07T20:26:27.4435493Z 
2025-05-07T20:26:27.4435497Z 
2025-05-07T20:26:27.4435504Z 
2025-05-07T20:26:27.4435508Z 
2025-05-07T20:26:27.4435512Z 
2025-05-07T20:26:27.4435641Z [A[A[A[A[A[A[A[A[A
2025-05-07T20:26:27.4435792Z 
2025-05-07T20:26:27.4435796Z 
2025-05-07T20:26:27.4435799Z 
2025-05-07T20:26:27.4435803Z 
2025-05-07T20:26:27.4435806Z 
2025-05-07T20:26:27.4435810Z 
2025-05-07T20:26:27.4435814Z 
2025-05-07T20:26:27.4435833Z 
2025-05-07T20:26:27.4435837Z 
2025-05-07T20:26:27.4435840Z 
2025-05-07T20:26:27.4435968Z [A[A[A[A[A[A[A[A[A[A
2025-05-07T20:26:27.4436125Z 
2025-05-07T20:26:27.4436129Z 
2025-05-07T20:26:27.4436133Z 
2025-05-07T20:26:27.4436136Z 
2025-05-07T20:26:27.4436144Z 
2025-05-07T20:26:27.4436153Z 
2025-05-07T20:26:27.4436157Z 
2025-05-07T20:26:27.4436160Z 
2025-05-07T20:26:27.4436164Z 
2025-05-07T20:26:27.4436167Z 
2025-05-07T20:26:27.4436171Z 
2025-05-07T20:26:27.4436300Z [A[A[A[A[A[A[A[A[A[A[A
2025-05-07T20:26:27.4436469Z 
2025-05-07T20:26:27.4436480Z 
2025-05-07T20:26:27.4436484Z 
2025-05-07T20:26:27.4436488Z 
2025-05-07T20:26:27.4436495Z 
2025-05-07T20:26:27.4436498Z 
2025-05-07T20:26:27.4436502Z 
2025-05-07T20:26:27.4436505Z 
2025-05-07T20:26:27.4436509Z 
2025-05-07T20:26:27.4436512Z 
2025-05-07T20:26:27.4436516Z 
2025-05-07T20:26:27.4436519Z 
2025-05-07T20:26:27.4436651Z [A[A[A[A[A[A[A[A[A[A[A[A
2025-05-07T20:26:27.4436836Z 
2025-05-07T20:26:27.4436840Z 
2025-05-07T20:26:27.4436843Z 
2025-05-07T20:26:27.4436847Z 
2025-05-07T20:26:27.4436850Z 
2025-05-07T20:26:27.4436854Z 
2025-05-07T20:26:27.4436857Z 
2025-05-07T20:26:27.4436861Z 
2025-05-07T20:26:27.4436864Z 
2025-05-07T20:26:27.4436868Z 
2025-05-07T20:26:27.4436871Z 
2025-05-07T20:26:27.4436880Z 
2025-05-07T20:26:27.4436884Z 
2025-05-07T20:26:27.4438717Z [A[A[A[A[A[A[A[A[A[A[A[A[A done
2025-05-07T20:26:27.7647803Z Preparing transaction: | / - done
2025-05-07T20:26:32.2630577Z Verifying transaction: | / - \ | / - \ | / - \ | / - \ | / - \ | / - \ | / - \ | / - \ | / - \ | / - \ | / - \ done
2025-05-07T20:26:33.0743025Z Executing transaction: / - \ | / - \ | done
2025-05-07T20:26:35.4637459Z [INSTALL] Fixing file placements for CUDA 12.8.0+ ...
2025-05-07T20:26:35.4637893Z [INSTALL] Creating symlinks: libnvToolsExt.so
2025-05-07T20:26:35.4638604Z + ln -sf /home/ec2-user/miniconda/envs/build_binary/lib/libnvToolsExt.so.1 /home/ec2-user/miniconda/envs/build_binary/lib/libnvToolsExt.so
2025-05-07T20:26:35.4639161Z 
2025-05-07T20:26:35.4651145Z 
2025-05-07T20:26:35.4651880Z + ln -sf /home/ec2-user/miniconda/envs/build_binary/targets/x86_64-linux/lib/libnvToolsExt.so.1 /home/ec2-user/miniconda/envs/build_binary/targets/x86_64-linux/lib/libnvToolsExt.so
2025-05-07T20:26:35.4652939Z 
2025-05-07T20:26:35.4663823Z 
2025-05-07T20:26:35.4664135Z [INSTALL] Copying nvtx3 headers ...
2025-05-07T20:26:35.4669307Z + cp -r /home/ec2-user/miniconda/envs/build_binary/nsight-compute-2025.1.0/host/target-linux-x64/nvtx/include/nvtx3/nvToolsExt.h /home/ec2-user/miniconda/envs/build_binary/nsight-compute-2025.1.0/host/target-linux-x64/nvtx/include/nvtx3/nvToolsExtCuda.h /home/ec2-user/miniconda/envs/build_binary/nsight-compute-2025.1.0/host/target-linux-x64/nvtx/include/nvtx3/nvToolsExtCudaRt.h /home/ec2-user/miniconda/envs/build_binary/nsight-compute-2025.1.0/host/target-linux-x64/nvtx/include/nvtx3/nvToolsExtOpenCL.h /home/ec2-user/miniconda/envs/build_binary/nsight-compute-2025.1.0/host/target-linux-x64/nvtx/include/nvtx3/nvToolsExtSync.h /home/ec2-user/miniconda/envs/build_binary/nsight-compute-2025.1.0/host/target-linux-x64/nvtx/include/nvtx3/nvtx3.hpp /home/ec2-user/miniconda/envs/build_binary/nsight-compute-2025.1.0/host/target-linux-x64/nvtx/include/nvtx3/nvtxDetail /home/ec2-user/miniconda/envs/build_binary/include/
2025-05-07T20:26:35.4673004Z 
2025-05-07T20:26:35.6227895Z 
2025-05-07T20:26:35.6233409Z + cp -r /home/ec2-user/miniconda/envs/build_binary/nsight-compute-2025.1.0/host/target-linux-x64/nvtx/include/nvtx3/nvToolsExt.h /home/ec2-user/miniconda/envs/build_binary/nsight-compute-2025.1.0/host/target-linux-x64/nvtx/include/nvtx3/nvToolsExtCuda.h /home/ec2-user/miniconda/envs/build_binary/nsight-compute-2025.1.0/host/target-linux-x64/nvtx/include/nvtx3/nvToolsExtCudaRt.h /home/ec2-user/miniconda/envs/build_binary/nsight-compute-2025.1.0/host/target-linux-x64/nvtx/include/nvtx3/nvToolsExtOpenCL.h /home/ec2-user/miniconda/envs/build_binary/nsight-compute-2025.1.0/host/target-linux-x64/nvtx/include/nvtx3/nvToolsExtSync.h /home/ec2-user/miniconda/envs/build_binary/nsight-compute-2025.1.0/host/target-linux-x64/nvtx/include/nvtx3/nvtx3.hpp /home/ec2-user/miniconda/envs/build_binary/nsight-compute-2025.1.0/host/target-linux-x64/nvtx/include/nvtx3/nvtxDetail /home/ec2-user/miniconda/envs/build_binary/targets/x86_64-linux/include/
2025-05-07T20:26:35.6237161Z 
2025-05-07T20:26:35.6255623Z 
2025-05-07T20:26:35.6256008Z [INSTALL] Appending libcuda.so path to LD_LIBRARY_PATH ...
2025-05-07T20:26:35.6633507Z [ENV] Appending to LD_LIBRARY_PATH: /home/ec2-user/miniconda/envs/build_binary/targets/x86_64-linux/lib/stubs ...
2025-05-07T20:26:37.5562891Z ERROR conda.cli.main_run:execute(125): `conda run printenv LD_LIBRARY_PATH` failed. (See above for error)
2025-05-07T20:26:37.6204062Z + conda env config vars set -n build_binary LD_LIBRARY_PATH=/home/ec2-user/miniconda/envs/build_binary/targets/x86_64-linux/lib/stubs
2025-05-07T20:26:37.6204766Z 
2025-05-07T20:26:38.0495866Z 
2025-05-07T20:26:38.0504284Z [INSTALL] Setting environment variable NVML_LIB_PATH ...
2025-05-07T20:26:38.0852908Z + conda env config vars set -n build_binary NVML_LIB_PATH=/home/ec2-user/miniconda/envs/build_binary/lib/stubs/libnvidia-ml.so
2025-05-07T20:26:38.0853479Z 
2025-05-07T20:26:38.5210529Z 
2025-05-07T20:26:38.5210960Z [INSTALL] Setting environment variable CUDA_INCLUDE_DIRS ...
2025-05-07T20:26:38.5212220Z + conda env config vars set -n build_binary CUDA_INCLUDE_DIRS="/home/ec2-user/miniconda/envs/build_binary/include/:/home/ec2-user/miniconda/envs/build_binary/targets/x86_64-linux/include/"
2025-05-07T20:26:38.5213281Z 
2025-05-07T20:26:38.9477701Z 
2025-05-07T20:26:40.9855536Z [CHECK] cuda_runtime.h found in CONDA_PREFIX PATH (file): /home/ec2-user/miniconda/envs/build_binary/targets/x86_64-linux/include/cuda_runtime.h
2025-05-07T20:26:43.0270466Z [CHECK] libcuda.so found in CONDA_PREFIX PATH (file): /home/ec2-user/miniconda/envs/build_binary/targets/x86_64-linux/lib/stubs/libcuda.so
2025-05-07T20:26:45.0663509Z [CHECK] libnvToolsExt.so found in CONDA_PREFIX PATH (symbolic link): /home/ec2-user/miniconda/envs/build_binary/lib/libnvToolsExt.so
2025-05-07T20:26:45.0665090Z /home/ec2-user/miniconda/envs/build_binary/targets/x86_64-linux/lib/libnvToolsExt.so
2025-05-07T20:26:47.0948825Z [CHECK] libnvidia-ml.so found in CONDA_PREFIX PATH (file): /home/ec2-user/miniconda/envs/build_binary/targets/x86_64-linux/lib/stubs/libnvidia-ml.so
2025-05-07T20:26:48.9952020Z /home/ec2-user/miniconda/envs/build_binary/bin/nvcc
2025-05-07T20:26:48.9952300Z 
2025-05-07T20:26:49.0577103Z [CHECK] Binary nvcc found in PATH
2025-05-07T20:26:52.9192323Z /tmp/tmpkx9dulct: line 3: clang: command not found
2025-05-07T20:26:52.9192622Z 
2025-05-07T20:26:52.9192913Z ERROR conda.cli.main_run:execute(125): `conda run clang --version` failed. (See above for error)
2025-05-07T20:26:52.9821069Z + ls -la /home/ec2-user/miniconda/envs/build_binary/etc/conda/activate.d
2025-05-07T20:26:52.9821382Z 
2025-05-07T20:26:52.9843879Z total 36
2025-05-07T20:26:52.9844269Z drwxr-xr-x. 2 ec2-user ec2-user   191 May  7 20:26 .
2025-05-07T20:26:52.9844798Z drwxr-xr-x. 5 ec2-user ec2-user    62 May  7 20:25 ..
2025-05-07T20:26:52.9845380Z -rw-r--r--. 2 ec2-user ec2-user  3778 Jun 10  2024 activate-binutils_linux-64.sh
2025-05-07T20:26:52.9845919Z -rw-r--r--. 2 ec2-user ec2-user 11630 Jun 10  2024 activate-gcc_linux-64.sh
2025-05-07T20:26:52.9846478Z -rw-r--r--. 2 ec2-user ec2-user  5190 Jun 10  2024 activate-gxx_linux-64.sh
2025-05-07T20:26:52.9847121Z -rw-r--r--. 2 ec2-user ec2-user   136 Mar 27 01:27 libglib_activate.sh
2025-05-07T20:26:52.9847739Z -rw-r--r--. 2 ec2-user ec2-user   872 Nov 13 09:20 libxml2_activate.sh
2025-05-07T20:26:52.9848346Z -rw-r--r--. 2 ec2-user ec2-user  2932 Jan 24 22:22 ~cuda-nvcc_activate.sh
2025-05-07T20:26:52.9848667Z 
2025-05-07T20:26:52.9848881Z [INSTALL] Removing the -ccbin=CXX hook from NVCC activation scripts ...
2025-05-07T20:26:52.9849511Z + sed -i /-ccbin=/d /home/ec2-user/miniconda/envs/build_binary/etc/conda/activate.d/*cuda-nvcc_activate.sh
2025-05-07T20:26:52.9849920Z 
2025-05-07T20:26:52.9869902Z 
2025-05-07T20:26:52.9870356Z + conda run -n build_binary c++ --version | grep -i clang
2025-05-07T20:26:52.9870709Z 
2025-05-07T20:26:54.9367803Z 
2025-05-07T20:26:54.9368367Z [BUILD] Setting prepend flags for NVCC ...
2025-05-07T20:26:54.9368907Z + conda env config vars set -n build_binary NVCC_PREPEND_FLAGS="-allow-unsupported-compiler"
2025-05-07T20:26:54.9369279Z 
2025-05-07T20:26:55.3661736Z 
2025-05-07T20:26:55.3662064Z + conda run -n build_binary printenv NVCC_PREPEND_FLAGS
2025-05-07T20:26:55.3662326Z 
2025-05-07T20:26:57.2572051Z -allow-unsupported-compiler
2025-05-07T20:26:57.2572417Z 
2025-05-07T20:26:57.3204717Z 
2025-05-07T20:26:57.3205400Z [INFO] Printing out all preprocessor defines in nvcc ...
2025-05-07T20:26:57.3205913Z + conda run -n build_binary nvcc --compiler-options -dM -E -x cu - < /dev/null
2025-05-07T20:26:57.3206225Z 
2025-05-07T20:26:59.2922241Z #define _GLIBCXX_DEPRECATED_SUGGEST(ALT) __attribute__ ((__deprecated__ ("use '" ALT "' instead")))
2025-05-07T20:26:59.2923002Z #define M_PIl 3.141592653589793238462643383279502884L
2025-05-07T20:26:59.2923410Z #define _IO_CURRENTLY_PUTTING 0x800
2025-05-07T20:26:59.2923847Z #define __W_EXITCODE(ret,sig) ((ret) << 8 | (sig))
2025-05-07T20:26:59.2924251Z #define __DBL_MIN_EXP__ (-1021)
2025-05-07T20:26:59.2924518Z #define _STL_PAIR_H 1
2025-05-07T20:26:59.2924771Z #define __cpp_attributes 200809L
2025-05-07T20:26:59.2925177Z #define __cpp_nontype_template_parameter_auto 201606L
2025-05-07T20:26:59.2925572Z #define __DELETE_THROW throw()
2025-05-07T20:26:59.2926186Z #define _PTRDIFF_T_ 
2025-05-07T20:26:59.2926575Z #define M_PI_4 0.78539816339744830962
2025-05-07T20:26:59.2926973Z #define __UINT_LEAST16_MAX__ 0xffff
2025-05-07T20:26:59.2927335Z #define _IO_LEFT 02
2025-05-07T20:26:59.2927631Z #define __ATOMIC_ACQUIRE 2
2025-05-07T20:26:59.2927919Z #define _POSIX2_BC_SCALE_MAX 99
2025-05-07T20:26:59.2928513Z #define _GLIBCXX_USE_RANDOM_TR1 1
2025-05-07T20:26:59.2929092Z #define _GLIBCXX_MOVE_BACKWARD3(_Tp,_Up,_Vp) std::move_backward(_Tp, _Up, _Vp)
2025-05-07T20:26:59.2929558Z #define __FLT128_MAX_10_EXP__ 4932
2025-05-07T20:26:59.2929834Z #define RE_DUP_MAX (0x7fff)
2025-05-07T20:26:59.2930290Z #define _IOS_OUTPUT 2
2025-05-07T20:26:59.2930586Z #define __SM_100_RT_HPP__ 
2025-05-07T20:26:59.2930995Z #define __FLT_MIN__ 1.17549435082228750796873653722224568e-38F
2025-05-07T20:26:59.2931505Z #define toascii_l(c,l) __toascii_l ((c), (l))
2025-05-07T20:26:59.2932056Z #define __GCC_IEC_559_COMPLEX 2
2025-05-07T20:26:59.2932549Z #define _GLIBCXX_USE_FCHMOD 1
2025-05-07T20:26:59.2933042Z #define __cpp_aggregate_nsdmi 201304L
2025-05-07T20:26:59.2934295Z #define __bswap_16(x) (__extension__ ({ unsigned short int __v, __x = (unsigned short int) (x); if (__builtin_constant_p (__x)) __v = __bswap_constant_16 (__x); else __asm__ ("rorw $8, %w0" : "=r" (__v) : "0" (__x) : "cc"); __v; }))
2025-05-07T20:26:59.2935456Z #define __UINT_LEAST8_TYPE__ unsigned char
2025-05-07T20:26:59.2935883Z #define __SIZEOF_FLOAT80__ 16
2025-05-07T20:26:59.2936280Z #define cudaTextureTypeCubemapLayered 0xFC
2025-05-07T20:26:59.2936587Z #define _T_WCHAR_ 
2025-05-07T20:26:59.2936806Z #define stdout stdout
2025-05-07T20:26:59.2937136Z #define _GLIBCXX_ABI_TAG_CXX11 __attribute ((__abi_tag__ ("cxx11")))
2025-05-07T20:26:59.2937511Z #define CHAR_BIT __CHAR_BIT__
2025-05-07T20:26:59.2937765Z #define __flexarr []
2025-05-07T20:26:59.2937998Z #define _GLIBCXX_HAVE_FINITEF 1
2025-05-07T20:26:59.2938317Z #define __islower_l(c,l) __isctype_l((c), _ISlower, (l))
2025-05-07T20:26:59.2938664Z #define _IO_FLAGS2_USER_WBUF 8
2025-05-07T20:26:59.2938908Z #define _MATH_H 1
2025-05-07T20:26:59.2939184Z #define cudaOccupancyDisableCachingOverride 0x01
2025-05-07T20:26:59.2939519Z #define __S64_TYPE long int
2025-05-07T20:26:59.2939776Z #define __stub_fchflags 
2025-05-07T20:26:59.2940033Z #define cudaDeviceScheduleMask 0x07
2025-05-07T20:26:59.2940323Z #define __SQUAD_TYPE long int
2025-05-07T20:26:59.2940585Z #define __INTMAX_C(c) c ## L
2025-05-07T20:26:59.2940882Z #define cudaStreamFireAndForget ((cudaStream_t)0x4)
2025-05-07T20:26:59.2941216Z #define _BSD_SIZE_T_DEFINED_ 
2025-05-07T20:26:59.2941471Z #define NL_NMAX INT_MAX
2025-05-07T20:26:59.2941700Z #define _BITS_TIME_H 1
2025-05-07T20:26:59.2941974Z #define M_LN10l 2.302585092994045684017991454684364208L
2025-05-07T20:26:59.2942299Z #define _GLIBCXX_TXN_SAFE_DYN 
2025-05-07T20:26:59.2942593Z #define cudaStreamTailLaunch ((cudaStream_t)0x3)
2025-05-07T20:26:59.2942941Z #define M_El 2.718281828459045235360287471352662498L
2025-05-07T20:26:59.2943338Z #define _PSTL_PRAGMA_DECLARE_SIMD _PSTL_PRAGMA(omp declare simd)
2025-05-07T20:26:59.2943691Z #define __CHAR_BIT__ 8
2025-05-07T20:26:59.2943952Z #define __FSWORD_T_TYPE __SYSCALL_SLONG_TYPE
2025-05-07T20:26:59.2944265Z #define _PSTL_STRING_CONCAT(x,y) x #y
2025-05-07T20:26:59.2944559Z #define _GLIBCXX98_USE_C99_MATH 1
2025-05-07T20:26:59.2944816Z #define FP_NAN 0
2025-05-07T20:26:59.2945077Z #define makedev(maj,min) gnu_dev_makedev (maj, min)
2025-05-07T20:26:59.2945481Z #define cudaGetDeviceProperties cudaGetDeviceProperties_v2
2025-05-07T20:26:59.2945880Z #define __cudaCDP2GetErrorString 
2025-05-07T20:26:59.2955681Z #define SHRT_MAX __SHRT_MAX__
2025-05-07T20:26:59.2955969Z #define _GLIBCXX_X86_RDSEED 1
2025-05-07T20:26:59.2956228Z #define __SM_80_RT_H__ 
2025-05-07T20:26:59.2956458Z #define _NEW 
2025-05-07T20:26:59.2956681Z #define CLOCK_PROCESS_CPUTIME_ID 2
2025-05-07T20:26:59.2956965Z #define __UINT8_MAX__ 0xff
2025-05-07T20:26:59.2957587Z #define _PSTL_ASSERT_MSG(_Condition,_Message) __glibcxx_assert(_Condition)
2025-05-07T20:26:59.2957992Z #define __SCHAR_WIDTH__ 8
2025-05-07T20:26:59.2958234Z #define __USE_ANSI 1
2025-05-07T20:26:59.2958519Z #define _IO_BE(expr,res) __builtin_expect ((expr), res)
2025-05-07T20:26:59.2958898Z #define __isupper_l(c,l) __isctype_l((c), _ISupper, (l))
2025-05-07T20:26:59.2959252Z #define __cudaCDP2Memcpy2DAsync_ptsz 
2025-05-07T20:26:59.2959549Z #define __WINT_MAX__ 0xffffffffU
2025-05-07T20:26:59.2959818Z #define __SIZEOF_PTHREAD_ATTR_T 56
2025-05-07T20:26:59.2960098Z #define __FLT32_MIN_EXP__ (-125)
2025-05-07T20:26:59.2960374Z #define _GLIBCXX_END_NAMESPACE_LDBL 
2025-05-07T20:26:59.2960648Z #define PIPE_BUF 4096
2025-05-07T20:26:59.2961075Z #define _PSTL_PRAGMA_SIMD_ORDERED_MONOTONIC_2ARGS(PRM1,PRM2) 
2025-05-07T20:26:59.2961525Z #define _GLIBCXX_NAMESPACE_LDBL_OR_CXX11 _GLIBCXX_NAMESPACE_CXX11
2025-05-07T20:26:59.2961887Z #define ADJ_TICK 0x4000
2025-05-07T20:26:59.2962188Z #define _PSTL_VERSION_PATCH (_PSTL_VERSION % 10)
2025-05-07T20:26:59.2962528Z #define MQ_PRIO_MAX 32768
2025-05-07T20:26:59.2962798Z #define __SIZEOF_PTHREAD_MUTEXATTR_T 4
2025-05-07T20:26:59.2963106Z #define __WAIT_INT(status) (*(int *) &(status))
2025-05-07T20:26:59.2963560Z #define __GLIBC_PREREQ(maj,min) ((__GLIBC__ << 16) + __GLIBC_MINOR__ >= ((maj) << 16) + (min))
2025-05-07T20:26:59.2964075Z #define cudaCooperativeLaunchMultiDeviceNoPreSync 0x01
2025-05-07T20:26:59.2964431Z #define _XOPEN_SOURCE 700
2025-05-07T20:26:59.2964690Z #define _POSIX2_BC_DIM_MAX 2048
2025-05-07T20:26:59.2964964Z #define __VECTOR_FUNCTIONS_HPP__ 
2025-05-07T20:26:59.2965242Z #define __cpp_static_assert 201411L
2025-05-07T20:26:59.2965523Z #define __GLIBCXX__ 20230528
2025-05-07T20:26:59.2965797Z #define _GLIBCXX_HAVE_STRXFRM_L 1
2025-05-07T20:26:59.2966063Z #define _POSIX_TTY_NAME_MAX 9
2025-05-07T20:26:59.2966338Z #define _GLIBCXX_USE_WEAK_REF __GXX_WEAK__
2025-05-07T20:26:59.2966637Z #define __OFF_T_MATCHES_OFF64_T 1
2025-05-07T20:26:59.2966911Z #define __ORDER_LITTLE_ENDIAN__ 1234
2025-05-07T20:26:59.2967205Z #define __SIZE_MAX__ 0xffffffffffffffffUL
2025-05-07T20:26:59.2967563Z #define __ispunct_l(c,l) __isctype_l((c), _ISpunct, (l))
2025-05-07T20:26:59.2967900Z #define __WCHAR_MAX__ 0x7fffffff
2025-05-07T20:26:59.2968172Z #define _GLIBCXX_USE_CLOCK_MONOTONIC 1
2025-05-07T20:26:59.2968481Z #define __BLKCNT_T_TYPE __SYSCALL_SLONG_TYPE
2025-05-07T20:26:59.2968832Z #define __isprint_l(c,l) __isctype_l((c), _ISprint, (l))
2025-05-07T20:26:59.2969174Z #define cudaNvSciSyncAttrSignal 0x1
2025-05-07T20:26:59.2969466Z #define _GLIBCXX_USE_LONG_LONG 1
2025-05-07T20:26:59.2969755Z #define __GCC_HAVE_SYNC_COMPARE_AND_SWAP_1 1
2025-05-07T20:26:59.2970069Z #define __GCC_HAVE_SYNC_COMPARE_AND_SWAP_2 1
2025-05-07T20:26:59.2970395Z #define __GCC_HAVE_SYNC_COMPARE_AND_SWAP_4 1
2025-05-07T20:26:59.2970789Z #define __DBL_DENORM_MIN__ double(4.94065645841246544176568792868221372e-324L)
2025-05-07T20:26:59.2971193Z #define __GCC_HAVE_SYNC_COMPARE_AND_SWAP_8 1
2025-05-07T20:26:59.2971484Z #define ADJ_ESTERROR 0x0008
2025-05-07T20:26:59.2971755Z #define __GCC_ATOMIC_CHAR_LOCK_FREE 2
2025-05-07T20:26:59.2972040Z #define __GCC_IEC_559 2
2025-05-07T20:26:59.2972366Z #define __cpp_lib_transformation_trait_aliases 201304
2025-05-07T20:26:59.2972696Z #define _IO_flockfile(_fp) 
2025-05-07T20:26:59.2972958Z #define CLOCK_MONOTONIC_RAW 4
2025-05-07T20:26:59.2973220Z #define __FLT32X_DECIMAL_DIG__ 17
2025-05-07T20:26:59.2973484Z #define _IOFBF 0
2025-05-07T20:26:59.2973695Z #define __USE_BSD 1
2025-05-07T20:26:59.2973915Z #define __FLT_EVAL_METHOD__ 0
2025-05-07T20:26:59.2974185Z #define SHRT_MIN (-SHRT_MAX - 1)
2025-05-07T20:26:59.2974459Z #define _IO_USER_LOCK 0x8000
2025-05-07T20:26:59.2974719Z #define _IO_NO_WRITES 8
2025-05-07T20:26:59.2974968Z #define _GLIBCXX_PSEUDO_VISIBILITY(V) 
2025-05-07T20:26:59.2975317Z #define __ASMNAME2(prefix,cname) __STRING (prefix) cname
2025-05-07T20:26:59.2975664Z #define _GLIBCXX_HAVE_SYS_STAT_H 1
2025-05-07T20:26:59.2975961Z #define MB_CUR_MAX (__ctype_get_mb_cur_max ())
2025-05-07T20:26:59.2976388Z #define __cpp_binary_literals 201304L
2025-05-07T20:26:59.2976680Z #define _CPP_TYPE_TRAITS_H 1
2025-05-07T20:26:59.2976936Z #define __BEGIN_NAMESPACE_C99 
2025-05-07T20:26:59.2977203Z #define __FLT64_DECIMAL_DIG__ 17
2025-05-07T20:26:59.2977509Z #define _GLIBCXX_SYNCHRONIZATION_HAPPENS_AFTER(A) 
2025-05-07T20:26:59.2977882Z #define _G_HAVE_ST_BLKSIZE defined (_STATBUF_ST_BLKSIZE)
2025-05-07T20:26:59.2978239Z #define __cpp_noexcept_function_type 201510L
2025-05-07T20:26:59.2978542Z #define M_PI 3.14159265358979323846
2025-05-07T20:26:59.2978849Z #define _GLIBCXX_PACKAGE_NAME "package-unused"
2025-05-07T20:26:59.2979165Z #define _GLIBCXX_HAVE_BUILTIN_IS_SAME 1
2025-05-07T20:26:59.2979555Z #define __GCC_ATOMIC_CHAR32_T_LOCK_FREE 2
2025-05-07T20:26:59.2979853Z #define _POSIX_DELAYTIMER_MAX 32
2025-05-07T20:26:59.2980119Z #define _GLIBCXX_USE_UTIME 1
2025-05-07T20:26:59.2980382Z #define _STL_ITERATOR_BASE_FUNCS_H 1
2025-05-07T20:26:59.2980966Z #define _IO_peekc_unlocked(_fp) (_IO_BE ((_fp)->_IO_read_ptr >= (_fp)->_IO_read_end, 0) && __underflow (_fp) == EOF ? EOF : *(unsigned char *) (_fp)->_IO_read_ptr)
2025-05-07T20:26:59.2981540Z #define _GLIBCXX_TR1_ELL_INTEGRAL_TCC 1
2025-05-07T20:26:59.2981861Z #define w_termsig __wait_terminated.__w_termsig
2025-05-07T20:26:59.2982181Z #define __FLOAT_WORD_ORDER __BYTE_ORDER
2025-05-07T20:26:59.2982527Z #define __cudaCDP2GetErrorName 
2025-05-07T20:26:59.2982789Z #define XATTR_SIZE_MAX 65536
2025-05-07T20:26:59.2983051Z #define be64toh(x) __bswap_64 (x)
2025-05-07T20:26:59.2983353Z #define __ASSERT_VOID_CAST static_cast<void>
2025-05-07T20:26:59.2983669Z #define __cpp_variadic_templates 200704L
2025-05-07T20:26:59.2983970Z #define RAND_MAX 2147483647
2025-05-07T20:26:59.2984235Z #define _GLIBCXX_USE_C99_COMPLEX_TR1 1
2025-05-07T20:26:59.2984549Z #define __UINT_FAST64_MAX__ 0xffffffffffffffffUL
2025-05-07T20:26:59.2984858Z #define __SM_90_RT_H__ 
2025-05-07T20:26:59.2985098Z #define __SIG_ATOMIC_TYPE__ int
2025-05-07T20:26:59.2985348Z #define __COMPAR_FN_T 
2025-05-07T20:26:59.2985592Z #define __GID_T_TYPE __U32_TYPE
2025-05-07T20:26:59.2985855Z #define _IO_BAD_SEEN 0x4000
2025-05-07T20:26:59.2986313Z #define _PSTL_PRAGMA_MESSAGE_IMPL(x) _PSTL_PRAGMA(message(_PSTL_STRING_CONCAT(_PSTL_PRAGMA_LOCATION, x)))
2025-05-07T20:26:59.2986813Z #define __DBL_MIN_10_EXP__ (-307)
2025-05-07T20:26:59.2987150Z #define __glibcxx_requires_sorted_pred(_First,_Last,_Pred) 
2025-05-07T20:26:59.2987500Z #define __FINITE_MATH_ONLY__ 0
2025-05-07T20:26:59.2987785Z #define _PSTL_PRAGMA_SIMD_INCLUSIVE_SCAN(PRM) 
2025-05-07T20:26:59.2988117Z #define cudaArrayColorAttachment 0x20
2025-05-07T20:26:59.2988425Z #define __cpp_variable_templates 201304L
2025-05-07T20:26:59.2988921Z #define cudaKernelNodeAttributeMemSyncDomainMap cudaLaunchAttributeMemSyncDomainMap
2025-05-07T20:26:59.2989592Z #define __cpp_lib_integral_constant_callable 201304
2025-05-07T20:26:59.2989917Z #define _GLIBCXX_HAVE_SINHF 1
2025-05-07T20:26:59.2990180Z #define MOD_TIMECONST ADJ_TIMECONST
2025-05-07T20:26:59.2990475Z #define __cpp_lib_result_of_sfinae 201210
2025-05-07T20:26:59.2990775Z #define __SM_30_INTRINSICS_H__ 
2025-05-07T20:26:59.2991040Z #define __FLT32X_MAX_EXP__ 1024
2025-05-07T20:26:59.2991299Z #define _GLIBCXX_USE_WCHAR_T 1
2025-05-07T20:26:59.2991560Z #define _GLIBCXX_MATH_H 1
2025-05-07T20:26:59.2991809Z #define __u_char_defined 
2025-05-07T20:26:59.2992115Z #define WIFEXITED(status) __WIFEXITED (__WAIT_INT (status))
2025-05-07T20:26:59.2992472Z #define STA_PPSERROR 0x0800
2025-05-07T20:26:59.2992727Z #define _GLIBCXX_STD_A std
2025-05-07T20:26:59.2992971Z #define __FLT32_HAS_DENORM__ 1
2025-05-07T20:26:59.2993247Z #define _GLIBCXX_BEGIN_NAMESPACE_VERSION 
2025-05-07T20:26:59.2993681Z #define __device_builtin_texture_type__ __location__(device_builtin_texture_type)
2025-05-07T20:26:59.2994089Z #define FP_INFINITE 1
2025-05-07T20:26:59.2994453Z #define _GLIBCXX11_DEPRECATED_SUGGEST(ALT) _GLIBCXX_DEPRECATED_SUGGEST(ALT)
2025-05-07T20:26:59.2994864Z #define _IO_pid_t __pid_t
2025-05-07T20:26:59.2995116Z #define __UINT_FAST8_MAX__ 0xff
2025-05-07T20:26:59.2995528Z #define __LEAF , __leaf__
2025-05-07T20:26:59.2995771Z #define PATH_MAX 4096
2025-05-07T20:26:59.2996018Z #define __cpp_rvalue_reference 200610L
2025-05-07T20:26:59.2996342Z #define __LDBL_REDIR1(name,proto,alias) name proto
2025-05-07T20:26:59.2996657Z #define _LIMITS_H___ 
2025-05-07T20:26:59.2996879Z #define __size_t 
2025-05-07T20:26:59.2997099Z #define _GLIBCXX_HAVE_FREXPF 1
2025-05-07T20:26:59.2997626Z #define STA_RONLY (STA_PPSSIGNAL | STA_PPSJITTER | STA_PPSWANDER | STA_PPSERROR | STA_CLOCKERR | STA_NANO | STA_MODE | STA_CLK)
2025-05-07T20:26:59.2998178Z #define _GLIBCXX_HAVE_FREXPL 1
2025-05-07T20:26:59.2998557Z #define __cpp_nested_namespace_definitions 201411L
2025-05-07T20:26:59.2998880Z #define __DEC64_MAX_EXP__ 385
2025-05-07T20:26:59.2999138Z #define _WCHAR_T_DEFINED 
2025-05-07T20:26:59.2999489Z #define __glibcxx_requires_can_decrement_range(_First1,_Last1,_First2) 
2025-05-07T20:26:59.2999872Z #define MOD_STATUS ADJ_STATUS
2025-05-07T20:26:59.3000169Z #define _GLIBCXX_PURE __attribute__ ((__pure__))
2025-05-07T20:26:59.3000489Z #define _GLIBCXX_HAVE_STDINT_H 1
2025-05-07T20:26:59.3000762Z #define __SIZEOF_PTHREAD_CONDATTR_T 4
2025-05-07T20:26:59.3001037Z #define __INT8_C(c) c
2025-05-07T20:26:59.3001296Z #define __cudaCDP2GetParameterBuffer 
2025-05-07T20:26:59.3001587Z #define _GLIBCXX_HAVE_COSHF 1
2025-05-07T20:26:59.3001846Z #define _GLIBCXX_HAVE_COSHL 1
2025-05-07T20:26:59.3002101Z #define __SM_70_RT_HPP__ 
2025-05-07T20:26:59.3002340Z #define __INT_LEAST8_WIDTH__ 8
2025-05-07T20:26:59.3002613Z #define __cpp_variadic_using 201611L
2025-05-07T20:26:59.3002928Z #define __UINT_LEAST64_MAX__ 0xffffffffffffffffUL
2025-05-07T20:26:59.3003249Z #define __INT_LEAST8_MAX__ 0x7f
2025-05-07T20:26:59.3003519Z #define __SM_61_INTRINSICS_HPP__ 
2025-05-07T20:26:59.3003788Z #define _IO_FLAGS2_MMAP 1
2025-05-07T20:26:59.3004049Z #define __cpp_capture_star_this 201603L
2025-05-07T20:26:59.3004352Z #define __cudaCDP2LaunchDeviceV2_ptsz 
2025-05-07T20:26:59.3004649Z #define _GLIBCXX_HAVE_ENDIAN_H 1
2025-05-07T20:26:59.3005009Z #define __always_inline __inline __attribute__ ((__always_inline__))
2025-05-07T20:26:59.3005373Z #define NFDBITS __NFDBITS
2025-05-07T20:26:59.3005629Z #define _PSTL_PRAGMA_FORCEINLINE 
2025-05-07T20:26:59.3005914Z #define _GLIBCXX_HAVE_SYS_STATVFS_H 1
2025-05-07T20:26:59.3006220Z #define __glibcxx_requires_sorted(_First,_Last) 
2025-05-07T20:26:59.3006533Z #define __SHRT_MAX__ 0x7fff
2025-05-07T20:26:59.3006787Z #define _GLIBCXX_SYMVER_GNU 1
2025-05-07T20:26:59.3007066Z #define w_stopval __wait_stopped.__w_stopval
2025-05-07T20:26:59.3007366Z #define STA_UNSYNC 0x0040
2025-05-07T20:26:59.3007680Z #define __LDBL_MAX__ 1.18973149535723176502126385303097021e+4932L
2025-05-07T20:26:59.3008094Z #define _GLIBCXX_USE_C99_COMPLEX _GLIBCXX11_USE_C99_COMPLEX
2025-05-07T20:26:59.3008444Z #define __FLT64X_MAX_10_EXP__ 4932
2025-05-07T20:26:59.3008728Z #define __cpp_if_constexpr 201606L
2025-05-07T20:26:59.3009038Z #define __glibcxx_class_requires4(_a,_b,_c,_d,_e) 
2025-05-07T20:26:59.3009357Z #define _GLIBCXX_HAVE_WCHAR_H 1
2025-05-07T20:26:59.3009673Z #define _GLIBCXX_USE_C99_STDIO _GLIBCXX11_USE_C99_STDIO
2025-05-07T20:26:59.3010004Z #define __daddr_t_defined 
2025-05-07T20:26:59.3010248Z #define __LDBL_IS_IEC_60559__ 2
2025-05-07T20:26:59.3010522Z #define _GLIBCXX_TR1_RIEMANN_ZETA_TCC 1
2025-05-07T20:26:59.3010834Z #define _GLIBCXX_HAVE_STRUCT_DIRENT_D_TYPE 1
2025-05-07T20:26:59.3011353Z #define _PSTL_CPP11_STD_ROTATE_BROKEN ((__GLIBCXX__ && __GLIBCXX__ < 20150716) || (_MSC_VER && _MSC_VER < 1800))
2025-05-07T20:26:59.3011894Z #define _ACRTIMP 
2025-05-07T20:26:59.3012170Z #define _IO_EOF_SEEN 0x10
2025-05-07T20:26:59.3012499Z #define _GLIBCXX_TR1_POLY_LAGUERRE_TCC 1
2025-05-07T20:26:59.3012853Z #define _IOS_BIN 128
2025-05-07T20:26:59.3013280Z #define __fortify_function __extern_always_inline __attribute_artificial__
2025-05-07T20:26:59.3013776Z #define __FLT64X_HAS_QUIET_NAN__ 1
2025-05-07T20:26:59.3014105Z #define UNDERFLOW 4
2025-05-07T20:26:59.3014375Z #define NAME_MAX 255
2025-05-07T20:26:59.3014703Z #define SCHAR_MAX __SCHAR_MAX__
2025-05-07T20:26:59.3014971Z #define __UINT_LEAST8_MAX__ 0xff
2025-05-07T20:26:59.3015246Z #define __GCC_ATOMIC_BOOL_LOCK_FREE 2
2025-05-07T20:26:59.3015527Z #define _IO_UNIFIED_JUMPTABLES 1
2025-05-07T20:26:59.3015900Z #define __FLT128_DENORM_MIN__ 6.47517511943802511092443895822764655e-4966F128
2025-05-07T20:26:59.3016283Z #define __ptr_t void *
2025-05-07T20:26:59.3016520Z #define M_E 2.7182818284590452354
2025-05-07T20:26:59.3016787Z #define cudaSurfaceType1D 0x01
2025-05-07T20:26:59.3017054Z #define __USE_ISOCXX11 1
2025-05-07T20:26:59.3017320Z #define __UINTMAX_TYPE__ long unsigned int
2025-05-07T20:26:59.3017711Z #define cudaDeviceBlockingSync 0x04
2025-05-07T20:26:59.3018002Z #define CLOCK_MONOTONIC_COARSE 6
2025-05-07T20:26:59.3018275Z #define _GLIBCXX_OS_DEFINES 1
2025-05-07T20:26:59.3018550Z #define _GLIBCXX_NODISCARD [[__nodiscard__]]
2025-05-07T20:26:59.3018861Z #define cudaSurfaceType2D 0x02
2025-05-07T20:26:59.3019115Z #define __linux 1
2025-05-07T20:26:59.3019342Z #define __DEC32_EPSILON__ 1E-6DF
2025-05-07T20:26:59.3019617Z #define cudaDeviceMask 0xff
2025-05-07T20:26:59.3019883Z #define _GLIBCXX_END_NAMESPACE_ALGO 
2025-05-07T20:26:59.3020165Z #define __CUDA_API_VER_MAJOR__ 12
2025-05-07T20:26:59.3020445Z #define htobe16(x) __bswap_16 (x)
2025-05-07T20:26:59.3020725Z #define HUGE_VALF (__builtin_huge_valf())
2025-05-07T20:26:59.3021026Z #define __FLT_EVAL_METHOD_TS_18661_3__ 0
2025-05-07T20:26:59.3021319Z #define HUGE_VALL (__builtin_huge_vall())
2025-05-07T20:26:59.3021609Z #define _BITS_TYPES_H 1
2025-05-07T20:26:59.3021897Z #define ULONG_LONG_MAX (LONG_LONG_MAX * 2ULL + 1ULL)
2025-05-07T20:26:59.3022312Z #define _IO_cleanup_region_end(_Doit) 
2025-05-07T20:26:59.3022680Z #define cudaSurfaceType3D 0x03
2025-05-07T20:26:59.3023016Z #define _GLIBCXX_HAVE_SYS_TIME_H 1
2025-05-07T20:26:59.3023360Z #define __cudaGet_blockIdx() blockIdx
2025-05-07T20:26:59.3023712Z #define _IO_DONT_CLOSE 0100000
2025-05-07T20:26:59.3024482Z #define __MATHDECLX(type,function,suffix,args,attrib) __MATHDECL_1(type, function,suffix, args) __attribute__ (attrib); __MATHDECL_1(type, __CONCAT(__,function),suffix, args) __attribute__ (attrib)
2025-05-07T20:26:59.3025282Z #define cudaHostRegisterDefault 0x00
2025-05-07T20:26:59.3025558Z #define __unix 1
2025-05-07T20:26:59.3025773Z #define MATH_ERRNO 1
2025-05-07T20:26:59.3026010Z #define _GLIBCXX_STDIO_SEEK_END 2
2025-05-07T20:26:59.3026278Z #define _GLIBCXX_USE_FCHMODAT 1
2025-05-07T20:26:59.3026533Z #define __SM_100_RT_H__ 
2025-05-07T20:26:59.3026782Z #define __UINT32_MAX__ 0xffffffffU
2025-05-07T20:26:59.3027058Z #define __GXX_EXPERIMENTAL_CXX0X__ 1
2025-05-07T20:26:59.3027350Z #define __UID_T_TYPE __U32_TYPE
2025-05-07T20:26:59.3027626Z #define _GLIBCXX20_DEPRECATED(MSG) 
2025-05-07T20:26:59.3027918Z #define _GLIBCXX_HAVE_ATOMIC_LOCK_POLICY 1
2025-05-07T20:26:59.3029023Z #define __CUDART_API_VERSION ((__CUDA_API_VER_MAJOR__ * 1000) + (__CUDA_API_VER_MINOR__ * 10))
2025-05-07T20:26:59.3029548Z #define __nv_pure__ __location__(nv_pure)
2025-05-07T20:26:59.3029845Z #define CUDARTAPI_CDECL 
2025-05-07T20:26:59.3030095Z #define _PSTL_USAGE_WARNINGS 0
2025-05-07T20:26:59.3030368Z #define _GLIBCXX98_USE_C99_COMPLEX 1
2025-05-07T20:26:59.3030655Z #define __cpp_lib_void_t 201411
2025-05-07T20:26:59.3030911Z #define _POSIX_AIO_MAX 1
2025-05-07T20:26:59.3031146Z #define __SIZE_T 
2025-05-07T20:26:59.3031395Z #define isgraph_l(c,l) __isgraph_l ((c), (l))
2025-05-07T20:26:59.3031703Z #define _GLIBCXX_FULLY_DYNAMIC_STRING 0
2025-05-07T20:26:59.3031996Z #define _POSIX_PIPE_BUF 512
2025-05-07T20:26:59.3032253Z #define __CUDA_RUNTIME_API_H__ 
2025-05-07T20:26:59.3032515Z #define _GLIBCXX_HAVE_STRTOLD 1
2025-05-07T20:26:59.3032778Z #define _ATFILE_SOURCE 1
2025-05-07T20:26:59.3033158Z #define __glibcxx_assert(cond) do { __glibcxx_constexpr_assert(cond); } while (false)
2025-05-07T20:26:59.3033572Z #define __WAIT_STATUS void *
2025-05-07T20:26:59.3033833Z #define __MATH_FUNCTIONS_H__ 
2025-05-07T20:26:59.3034099Z #define _GLIBCXX_HAVE_WCSTOF 1
2025-05-07T20:26:59.3034692Z #define __FLT128_MIN_EXP__ (-16381)
2025-05-07T20:26:59.3034974Z #define _GLIBCXX_HAVE_LC_MESSAGES 1
2025-05-07T20:26:59.3035247Z #define __WINT_MIN__ 0U
2025-05-07T20:26:59.3035811Z #define _PSTL_CPP14_VARIABLE_TEMPLATES_PRESENT (!__INTEL_COMPILER || __INTEL_COMPILER >= 1700) && (_MSC_FULL_VER >= 190023918 || __cplusplus >= 201402L)
2025-05-07T20:26:59.3036436Z #define isdigit_l(c,l) __isdigit_l ((c), (l))
2025-05-07T20:26:59.3036734Z #define WUNTRACED 2
2025-05-07T20:26:59.3036961Z #define _GLIBCXX_HAVE_SQRTF 1
2025-05-07T20:26:59.3037228Z #define __SIZEOF_PTHREAD_RWLOCKATTR_T 8
2025-05-07T20:26:59.3039197Z #define NZERO 20
2025-05-07T20:26:59.3039426Z #define _GLIBCXX_HAVE_MEMALIGN 1
2025-05-07T20:26:59.3039698Z #define _PSTL_PRAGMA(x) _Pragma(#x)
2025-05-07T20:26:59.3039990Z #define MOD_CLKA ADJ_OFFSET_SINGLESHOT
2025-05-07T20:26:59.3040278Z #define MOD_CLKB ADJ_TICK
2025-05-07T20:26:59.3040531Z #define __FLT128_MIN_10_EXP__ (-4931)
2025-05-07T20:26:59.3040813Z #define __FLT32X_IS_IEC_60559__ 2
2025-05-07T20:26:59.3041085Z #define __DEVICE_FUNCTIONS_H__ 
2025-05-07T20:26:59.3041358Z #define SCHAR_MIN (-SCHAR_MAX - 1)
2025-05-07T20:26:59.3041621Z #define EXIT_FAILURE 1
2025-05-07T20:26:59.3041862Z #define ADJ_MAXERROR 0x0004
2025-05-07T20:26:59.3042120Z #define __INT_LEAST16_WIDTH__ 16
2025-05-07T20:26:59.3042377Z #define _SIZE_T_DEFINED_ 
2025-05-07T20:26:59.3042628Z #define _POSIX_AIO_LISTIO_MAX 2
2025-05-07T20:26:59.3042904Z #define __cudaCDP2DeviceGetLimit 
2025-05-07T20:26:59.3043232Z #define __LDBL_REDIR_NTH(name,proto) name proto __THROW
2025-05-07T20:26:59.3044324Z nvcc warning : Support for offline compilation for architectures prior to '<compute/sm/lto>_75' will be removed in a future release (Use -Wno-deprecated-gpu-targets to suppress warning).
2025-05-07T20:26:59.3045016Z 
2025-05-07T20:26:59.3045132Z #define __cudaCDP2FuncGetAttributes 
2025-05-07T20:26:59.3045427Z #define __SCHAR_MAX__ 0x7f
2025-05-07T20:26:59.3045674Z #define __FLT128_MANT_DIG__ 113
2025-05-07T20:26:59.3045951Z #define __USING_NAMESPACE_STD(name) 
2025-05-07T20:26:59.3046242Z #define _GLIBCXX_HAVE_OBSOLETE_ISINF 1
2025-05-07T20:26:59.3046541Z #define __WCHAR_MIN__ (-__WCHAR_MAX__ - 1)
2025-05-07T20:26:59.3046835Z #define SEEK_DATA 3
2025-05-07T20:26:59.3047067Z #define __KERNEL_STRICT_NAMES 
2025-05-07T20:26:59.3047352Z #define _IO_stderr ((_IO_FILE*)(&_IO_2_1_stderr_))
2025-05-07T20:26:59.3047765Z #define _IO_ferror_unlocked(__fp) (((__fp)->_flags & _IO_ERR_SEEN) != 0)
2025-05-07T20:26:59.3048151Z #define _FUNCTEXCEPT_H 1
2025-05-07T20:26:59.3048403Z #define __INT64_C(c) c ## L
2025-05-07T20:26:59.3048667Z #define __NTH(fct) __LEAF_ATTR fct throw ()
2025-05-07T20:26:59.3048999Z #define _GLIBCXX_CONST __attribute__ ((__const__))
2025-05-07T20:26:59.3049320Z #define _GLIBCXX_HAVE_LINK 1
2025-05-07T20:26:59.3049590Z #define cudaNvSciSyncAttrWait 0x2
2025-05-07T20:26:59.3049886Z #define __GCC_ATOMIC_POINTER_LOCK_FREE 2
2025-05-07T20:26:59.3050182Z #define STA_PPSWANDER 0x0400
2025-05-07T20:26:59.3050434Z #define __INT_WCHAR_T_H 
2025-05-07T20:26:59.3050669Z #define WSTOPPED 2
2025-05-07T20:26:59.3050905Z #define _POSIX_THREAD_THREADS_MAX 64
2025-05-07T20:26:59.3051181Z #define _POSIX_MQ_OPEN_MAX 8
2025-05-07T20:26:59.3051430Z #define FP_NORMAL 4
2025-05-07T20:26:59.3051672Z #define __cudaCDP2LaunchDevice_ptsz 
2025-05-07T20:26:59.3051969Z #define _BITS_TIMEX_H 1
2025-05-07T20:26:59.3052262Z #define _POSIX_LINK_MAX 8
2025-05-07T20:26:59.3052578Z #define _GLIBCXX_HAVE_LIMIT_FSIZE 1
2025-05-07T20:26:59.3052926Z #define _GLIBCXX_HAVE_ATAN2F 1
2025-05-07T20:26:59.3053253Z #define cudaTextureType1D 0x01
2025-05-07T20:26:59.3053586Z #define _GLIBCXX_HAVE_ATAN2L 1
2025-05-07T20:26:59.3053917Z #define COLL_WEIGHTS_MAX 255
2025-05-07T20:26:59.3054240Z #define __isascii(c) (((c) & ~0x7f) == 0)
2025-05-07T20:26:59.3054558Z #define __toascii(c) ((c) & 0x7f)
2025-05-07T20:26:59.3054979Z #define __attribute_format_strfmon__(a,b) __attribute__ ((__format__ (__strfmon__, a, b)))
2025-05-07T20:26:59.3055414Z #define _IO_MAGIC 0xFBAD0000
2025-05-07T20:26:59.3055779Z #define _GLIBCXX_USE_SENDFILE 1
2025-05-07T20:26:59.3056044Z #define _POSIX_SOURCE 1
2025-05-07T20:26:59.3056285Z #define cudaTextureType2D 0x02
2025-05-07T20:26:59.3056545Z #define _PTR_TRAITS_H 1
2025-05-07T20:26:59.3056817Z #define _GLIBCXX_NOEXCEPT_QUAL noexcept (_NE)
2025-05-07T20:26:59.3057122Z #define _GLIBCXX_HAVE_POWF 1
2025-05-07T20:26:59.3057386Z #define _POSIX2_BC_STRING_MAX 1000
2025-05-07T20:26:59.3057705Z #define __attribute_used__ __attribute__ ((__used__))
2025-05-07T20:26:59.3058038Z #define cudaTextureType3D 0x03
2025-05-07T20:26:59.3058300Z #define _STDIO_USES_IOSTREAM 
2025-05-07T20:26:59.3058647Z #define CLOCK_REALTIME 0
2025-05-07T20:26:59.3058893Z #define __FLT32X_MANT_DIG__ 53
2025-05-07T20:26:59.3059158Z #define __GCC_ATOMIC_CHAR16_T_LOCK_FREE 2
2025-05-07T20:26:59.3059457Z #define __cpp_aligned_new 201606L
2025-05-07T20:26:59.3059732Z #define __USER_LABEL_PREFIX__ 
2025-05-07T20:26:59.3060000Z #define cudaEventBlockingSync 0x01
2025-05-07T20:26:59.3060295Z #define _GLIBCXX_HAVE_TANL 1
2025-05-07T20:26:59.3060566Z #define _GLIBCXX_USE_PTHREAD_RWLOCK_T 1
2025-05-07T20:26:59.3060862Z #define _GLIBCXX_HAVE_LINUX_RANDOM_H 1
2025-05-07T20:26:59.3061157Z #define _GLIBCXX_USE_C99_FENV_TR1 1
2025-05-07T20:26:59.3061434Z #define __FLT32_MAX_10_EXP__ 38
2025-05-07T20:26:59.3061695Z #define __GLIBC__ 2
2025-05-07T20:26:59.3061968Z #define __END_DECLS }
2025-05-07T20:26:59.3062264Z #define FP_ILOGB0 (-2147483647 - 1)
2025-05-07T20:26:59.3062713Z #define __FLT64X_EPSILON__ 1.08420217248550443400745280086994171e-19F64x
2025-05-07T20:26:59.3063170Z #define __CONCAT(x,y) x ## y
2025-05-07T20:26:59.3063487Z #define WCONTINUED 8
2025-05-07T20:26:59.3063748Z #define __STDC_HOSTED__ 1
2025-05-07T20:26:59.3063998Z #define _GLIBCXX_HAVE_ARPA_INET_H 1
2025-05-07T20:26:59.3064267Z #define _ALLOCA_H 1
2025-05-07T20:26:59.3064497Z #define __host__ __location__(host)
2025-05-07T20:26:59.3064905Z #define __warndecl(name,msg) extern void name (void) __attribute__((__warning__ (msg)))
2025-05-07T20:26:59.3065340Z #define __SLONG32_TYPE int
2025-05-07T20:26:59.3065602Z #define _GLIBCXX_DEBUG_ASSERTIONS_H 1
2025-05-07T20:26:59.3065882Z #define _SYS_SELECT_H 1
2025-05-07T20:26:59.3066117Z #define _IO_LINE_BUF 0x200
2025-05-07T20:26:59.3066360Z #define _IOS_NOCREATE 32
2025-05-07T20:26:59.3066609Z #define __DEC64_MIN_EXP__ (-382)
2025-05-07T20:26:59.3066878Z #define __cudaGet_warpSize() warpSize
2025-05-07T20:26:59.3067166Z #define __SSIZE_T_TYPE __SWORD_TYPE
2025-05-07T20:26:59.3067450Z #define _GLIBCXX_HAVE_LIMIT_VMEM 0
2025-05-07T20:26:59.3067724Z #define __global__ __location__(global)
2025-05-07T20:26:59.3068012Z #define __GNU_LIBRARY__ 6
2025-05-07T20:26:59.3068267Z #define __cpp_decltype_auto 201304L
2025-05-07T20:26:59.3068532Z #define __DBL_DIG__ 15
2025-05-07T20:26:59.3068758Z #define TIME_UTC 1
2025-05-07T20:26:59.3068977Z #define __FLT32_DIG__ 6
2025-05-07T20:26:59.3069397Z #define __forceinline__ __inline__ __attribute__((always_inline))
2025-05-07T20:26:59.3069796Z #define cudaHostAllocWriteCombined 0x04
2025-05-07T20:26:59.3070112Z #define cudaDeviceScheduleAuto 0x00
2025-05-07T20:26:59.3070412Z #define iscntrl_l(c,l) __iscntrl_l ((c), (l))
2025-05-07T20:26:59.3070706Z #define _G_BUFSIZ 8192
2025-05-07T20:26:59.3071008Z #define __FLT_EPSILON__ 1.19209289550781250000000000000000000e-7F
2025-05-07T20:26:59.3071367Z #define cudaTextureTypeCubemap 0x0C
2025-05-07T20:26:59.3071658Z #define __cudaCDP2GetDevice 
2025-05-07T20:26:59.3071933Z #define __cudaCDP2PeekAtLastError 
2025-05-07T20:26:59.3072219Z #define STA_CLOCKERR 0x1000
2025-05-07T20:26:59.3072465Z #define __GXX_WEAK__ 1
2025-05-07T20:26:59.3072741Z #define __RLIM_T_TYPE __SYSCALL_ULONG_TYPE
2025-05-07T20:26:59.3073043Z #define _GLIBCXX_HAVE_ISNANF 1
2025-05-07T20:26:59.3073301Z #define __SHRT_WIDTH__ 16
2025-05-07T20:26:59.3073594Z #define __cpp_lib_robust_nonmodifying_seq_ops 201304
2025-05-07T20:26:59.3073929Z #define _GLIBCXX_BITS_SPECFUN_H 1
2025-05-07T20:26:59.3074200Z #define _GLIBCXX_HAVE_ISNANL 1
2025-05-07T20:26:59.3074664Z #define isblank_l(c,l) __isblank_l ((c), (l))
2025-05-07T20:26:59.3074965Z #define _G_config_h 1
2025-05-07T20:26:59.3075233Z #define M_LOG2El 1.442695040888963407359924681001892137L
2025-05-07T20:26:59.3075565Z #define ADJ_OFFSET_SINGLESHOT 0x8001
2025-05-07T20:26:59.3075841Z #define _GCC_WCHAR_T 
2025-05-07T20:26:59.3076075Z #define TMP_MAX 238328
2025-05-07T20:26:59.3076307Z #define __FLT32_IS_IEC_60559__ 2
2025-05-07T20:26:59.3076576Z #define __DEVICE_TYPES_H__ 
2025-05-07T20:26:59.3076833Z #define __DEV_T_TYPE __UQUAD_TYPE
2025-05-07T20:26:59.3134301Z #define _EXT_NUMERIC_TRAITS 1
2025-05-07T20:26:59.3134663Z #define _GLIBCXX_BEGIN_NAMESPACE_ALGO 
2025-05-07T20:26:59.3135378Z #define _IO_SKIPWS 01
2025-05-07T20:26:59.3135769Z #define cudaStreamGraphFireAndForgetAsSibling (cudaStream_t)0x0300000000000000
2025-05-07T20:26:59.3136218Z #define _IO_SCIENTIFIC 04000
2025-05-07T20:26:59.3136478Z #define _GLIBCXX_HAVE_STRING_H 1
2025-05-07T20:26:59.3136804Z #define __LDBL_MIN__ 3.36210314311209350626267781732175260e-4932L
2025-05-07T20:26:59.3137156Z #define cudaDeviceScheduleSpin 0x01
2025-05-07T20:26:59.3137510Z #define __nonnull(params) __attribute__ ((__nonnull__ params))
2025-05-07T20:26:59.3137851Z #define __DBL_IS_IEC_60559__ 2
2025-05-07T20:26:59.3138096Z #define le32toh(x) (x)
2025-05-07T20:26:59.3138318Z #define _SIZE_T_DEFINED 
2025-05-07T20:26:59.3138556Z #define _GLIBCXX_HAVE_XLOCALE_H 1
2025-05-07T20:26:59.3138877Z #define cudaArraySparsePropertiesSingleMipTail 0x1
2025-05-07T20:26:59.3139224Z #define __DEC32_MAX__ 9.999999E96DF
2025-05-07T20:26:59.3139621Z #define __WIFSIGNALED(status) (((signed char) (((status) & 0x7f) + 1) >> 1) > 0)
2025-05-07T20:26:59.3140034Z #define _GLIBCXX_HAVE_FMODL 1
2025-05-07T20:26:59.3140305Z #define _GLIBCXX_HAVE_POLL 1
2025-05-07T20:26:59.3140571Z #define __SM_32_INTRINSICS_H__ 
2025-05-07T20:26:59.3140832Z #define _POSIX_NAME_MAX 14
2025-05-07T20:26:59.3141118Z #define __cpp_threadsafe_static_init 200806L
2025-05-07T20:26:59.3141685Z #define _GLIBCXX_MAKE_MOVE_IF_NOEXCEPT_ITERATOR(_Iter) std::__make_move_if_noexcept_iterator(_Iter)
2025-05-07T20:26:59.3142299Z #define _GLIBCXX_USE_CLOCK_REALTIME 1
2025-05-07T20:26:59.3142670Z #define __cpp_enumerator_attributes 201411L
2025-05-07T20:26:59.3143094Z #define __WCOREDUMP(status) ((status) & __WCOREFLAG)
2025-05-07T20:26:59.3143481Z #define _WCHAR_T_ 
2025-05-07T20:26:59.3143751Z #define _GLIBCXX_FAST_MATH 0
2025-05-07T20:26:59.3144195Z #define __FLT64X_DENORM_MIN__ 3.64519953188247460252840593361941982e-4951F64x
2025-05-07T20:26:59.3144594Z #define RTSIG_MAX 32
2025-05-07T20:26:59.3144813Z #define _STDDEF_H 
2025-05-07T20:26:59.3145041Z #define CU_UUID_HAS_BEEN_DEFINED 
2025-05-07T20:26:59.3145314Z #define _VA_LIST_DEFINED 
2025-05-07T20:26:59.3145562Z #define __FLT32X_HAS_INFINITY__ 1
2025-05-07T20:26:59.3145882Z #define __glibcxx_requires_non_empty_range(_First,_Last) 
2025-05-07T20:26:59.3146264Z #define __grid_constant__ __location__(grid_constant)
2025-05-07T20:26:59.3146590Z #define __INT32_MAX__ 0x7fffffff
2025-05-07T20:26:59.3146879Z #define _GLIBCXX_BEGIN_EXTERN_C extern "C" {
2025-05-07T20:26:59.3147334Z #define _PSTL_CPP14_INTEGER_SEQUENCE_PRESENT (_MSC_VER >= 1900 || __cplusplus >= 201402L)
2025-05-07T20:26:59.3147852Z #define __glibcxx_digits_b(T,B) (B - __glibcxx_signed_b (T,B))
2025-05-07T20:26:59.3148208Z #define __SIZEOF_PTHREAD_COND_T 48
2025-05-07T20:26:59.3148520Z #define _PSTL_PRAGMA_SIMD_ORDERED_MONOTONIC(PRM) 
2025-05-07T20:26:59.3148830Z #define __unix__ 1
2025-05-07T20:26:59.3149135Z #define __SM_60_ATOMIC_FUNCTIONS_H__ 
2025-05-07T20:26:59.3149409Z #define __INT_WIDTH__ 32
2025-05-07T20:26:59.3149650Z #define __SIZEOF_LONG__ 8
2025-05-07T20:26:59.3149885Z #define _IONBF 2
2025-05-07T20:26:59.3150318Z #define __MATHCALLX(function,suffix,args,attrib) __MATHDECLX (_Mdouble_,function,suffix, args, attrib)
2025-05-07T20:26:59.3151067Z #define _IO_getc_unlocked(_fp) (_IO_BE ((_fp)->_IO_read_ptr >= (_fp)->_IO_read_end, 0) ? __uflow (_fp) : *(unsigned char *) (_fp)->_IO_read_ptr++)
2025-05-07T20:26:59.3151800Z #define __STDC_IEC_559__ 1
2025-05-07T20:26:59.3152110Z #define __STDC_ISO_10646__ 201103L
2025-05-07T20:26:59.3152436Z #define __UINT16_C(c) c
2025-05-07T20:26:59.3152728Z #define M_2_PI 0.63661977236758134308
2025-05-07T20:26:59.3153054Z #define STA_DEL 0x0020
2025-05-07T20:26:59.3153347Z #define __CUDACC_VER_MINOR__ 8
2025-05-07T20:26:59.3153657Z #define __id_t_defined 
2025-05-07T20:26:59.3153957Z #define w_retcode __wait_terminated.__w_retcode
2025-05-07T20:26:59.3154398Z #define _IO_PENDING_OUTPUT_COUNT(_fp) ((_fp)->_IO_write_ptr - (_fp)->_IO_write_base)
2025-05-07T20:26:59.3154818Z #define _GLIBCXX_HAVE_MODFF 1
2025-05-07T20:26:59.3155162Z #define _GLIBCXX_HAVE_MODFL 1
2025-05-07T20:26:59.3155413Z #define __DECIMAL_DIG__ 21
2025-05-07T20:26:59.3155664Z #define _POSIX2_RE_DUP_MAX 255
2025-05-07T20:26:59.3155925Z #define __USE_FORTIFY_LEVEL 0
2025-05-07T20:26:59.3156180Z #define __STDC_IEC_559_COMPLEX__ 1
2025-05-07T20:26:59.3156440Z #define SING 2
2025-05-07T20:26:59.3156663Z #define STA_FREQHOLD 0x0080
2025-05-07T20:26:59.3156922Z #define __SM_32_ATOMIC_FUNCTIONS_HPP__ 
2025-05-07T20:26:59.3157219Z #define cudaStreamDefault 0x00
2025-05-07T20:26:59.3157563Z #define __FLT64_EPSILON__ 2.22044604925031308084726333618164062e-16F64
2025-05-07T20:26:59.3157932Z #define _GLIBCXX_HAVE_HYPOTL 1
2025-05-07T20:26:59.3158192Z #define _GLIBCXX_HAVE_SYS_UIO_H 1
2025-05-07T20:26:59.3158457Z #define __gnu_linux__ 1
2025-05-07T20:26:59.3158692Z #define __INT16_MAX__ 0x7fff
2025-05-07T20:26:59.3158939Z #define _LARGEFILE_SOURCE 1
2025-05-07T20:26:59.3159232Z #define MAX_INPUT 255
2025-05-07T20:26:59.3159474Z #define __FLT64_MIN_EXP__ (-1021)
2025-05-07T20:26:59.3159798Z #define __isalpha_l(c,l) __isctype_l((c), _ISalpha, (l))
2025-05-07T20:26:59.3160167Z #define __glibcxx_requires_heap(_First,_Last) 
2025-05-07T20:26:59.3160479Z #define _GLIBCXX_CPU_DEFINES 1
2025-05-07T20:26:59.3160737Z #define _GLIBCXX_HAVE_POLL_H 1
2025-05-07T20:26:59.3161135Z #define __attribute_warn_unused_result__ __attribute__ ((__warn_unused_result__))
2025-05-07T20:26:59.3161555Z #define _IO_SHOWPOS 02000
2025-05-07T20:26:59.3161886Z #define _GLIBCXX_HAVE_SYMVER_SYMBOL_RENAMING_RUNTIME_SUPPORT 1
2025-05-07T20:26:59.3162234Z #define _Mfloat_ float
2025-05-07T20:26:59.3162497Z #define __glibcxx_requires_cond(_Cond,_Msg) 
2025-05-07T20:26:59.3162809Z #define __FLT64X_MIN_10_EXP__ (-4931)
2025-05-07T20:26:59.3163089Z #define DELAYTIMER_MAX 2147483647
2025-05-07T20:26:59.3163407Z #define cudaMemPoolCreateUsageHwDecompress 0x2
2025-05-07T20:26:59.3163934Z #define __glibcxx_max_b(T,B) (__glibcxx_signed_b (T,B) ? (((((T)1 << (__glibcxx_digits_b (T,B) - 1)) - 1) << 1) + 1) : ~(T)0)
2025-05-07T20:26:59.3164414Z #define __LDBL_HAS_QUIET_NAN__ 1
2025-05-07T20:26:59.3164688Z #define _GLIBCXX98_USE_C99_STDIO 1
2025-05-07T20:26:59.3165009Z #define cudaKernelNodeAttrID cudaLaunchAttributeID
2025-05-07T20:26:59.3165363Z #define __glibcxx_class_requires2(_a,_b,_c) 
2025-05-07T20:26:59.3165654Z #define __USE_ISOC11 1
2025-05-07T20:26:59.3165881Z #define _BSD_SIZE_T_ 
2025-05-07T20:26:59.3166114Z #define ADJ_MICRO 0x1000
2025-05-07T20:26:59.3166353Z #define _GLIBCXX_HAVE_FABSF 1
2025-05-07T20:26:59.3166610Z #define _GLIBCXX_HAVE_FABSL 1
2025-05-07T20:26:59.3166904Z #define _PSTL_PRAGMA_SIMD _PSTL_PRAGMA(omp simd)
2025-05-07T20:26:59.3167212Z #define __FLT64_MANT_DIG__ 53
2025-05-07T20:26:59.3167514Z #define __attribute_const__ __attribute__ ((__const__))
2025-05-07T20:26:59.3167841Z #define __THROW throw ()
2025-05-07T20:26:59.3168086Z #define __cudaGet_gridDim() gridDim
2025-05-07T20:26:59.3168375Z #define __SM_60_ATOMIC_FUNCTIONS_HPP__ 
2025-05-07T20:26:59.3168724Z #define __glibcxx_requires_heap_pred(_First,_Last,_Pred) 
2025-05-07T20:26:59.3169067Z #define htobe32(x) __bswap_32 (x)
2025-05-07T20:26:59.3169339Z #define _GLIBCXX_HAVE_POWL 1
2025-05-07T20:26:59.3169598Z #define __FLT64X_MANT_DIG__ 64
2025-05-07T20:26:59.3169859Z #define __GLIBC_HAVE_LONG_LONG 1
2025-05-07T20:26:59.3170110Z #define L_tmpnam 20
2025-05-07T20:26:59.3170335Z #define ___int_wchar_t_h 
2025-05-07T20:26:59.3170772Z #define WIFCONTINUED(status) __WIFCONTINUED (__WAIT_INT (status))
2025-05-07T20:26:59.3171146Z #define isascii(c) __isascii (c)
2025-05-07T20:26:59.3171401Z #define _T_PTRDIFF 
2025-05-07T20:26:59.3171705Z #define _GLIBCXX_MOVE3(_Tp,_Up,_Vp) std::move(_Tp, _Up, _Vp)
2025-05-07T20:26:59.3172053Z #define toascii(c) __toascii (c)
2025-05-07T20:26:59.3172333Z #define __GNUC__ 11
2025-05-07T20:26:59.3172605Z #define __SYSCALL_ULONG_TYPE __ULONGWORD_TYPE
2025-05-07T20:26:59.3172896Z #define __GXX_RTTI 1
2025-05-07T20:26:59.3173115Z #define __pie__ 2
2025-05-07T20:26:59.3173322Z #define __MMX__ 1
2025-05-07T20:26:59.3173623Z #define __cudaCDP2Malloc 
2025-05-07T20:26:59.3173874Z #define __timespec_defined 1
2025-05-07T20:26:59.3174122Z #define L_ctermid 9
2025-05-07T20:26:59.3174348Z #define __OFF64_T_TYPE __SQUAD_TYPE
2025-05-07T20:26:59.3174654Z #define __cudaCDP2GetParameterBufferV2 
2025-05-07T20:26:59.3175046Z #define offsetof(TYPE,MEMBER) __builtin_offsetof (TYPE, MEMBER)
2025-05-07T20:26:59.3175420Z #define _BITS_POSIX2_LIM_H 1
2025-05-07T20:26:59.3175681Z #define _GLIBCXX98_USE_C99_STDLIB 1
2025-05-07T20:26:59.3175968Z #define cudaMemAttachGlobal 0x01
2025-05-07T20:26:59.3176269Z #define FD_SET(fd,fdsetp) __FD_SET (fd, fdsetp)
2025-05-07T20:26:59.3176571Z #define __FLT_HAS_DENORM__ 1
2025-05-07T20:26:59.3176833Z #define __SIZEOF_LONG_DOUBLE__ 16
2025-05-07T20:26:59.3177264Z #define _GLIBCXX_NATIVE_THREAD_ID (__gthread_active_p() ? __gthread_self() : (__gthread_t)1)
2025-05-07T20:26:59.3177986Z #define assert_perror(errnum) (!(errnum) ? __ASSERT_VOID_CAST (0) : __assert_perror_fail ((errnum), __FILE__, __LINE__, __ASSERT_FUNCTION))
2025-05-07T20:26:59.3178584Z #define _IO_HAVE_ST_BLKSIZE _G_HAVE_ST_BLKSIZE
2025-05-07T20:26:59.3178886Z #define __USE_SVID 1
2025-05-07T20:26:59.3179139Z #define __constant__ __location__(constant)
2025-05-07T20:26:59.3179445Z #define _GLIBCXX_HAVE_POSIX_MEMALIGN 1
2025-05-07T20:26:59.3179738Z #define __device__ __location__(device)
2025-05-07T20:26:59.3180067Z #define _GLIBCXX_HAVE_EXCEPTION_PTR_SINCE_GCC46 1
2025-05-07T20:26:59.3180380Z #define _GLIBCXX_RES_LIMITS 1
2025-05-07T20:26:59.3180640Z #define M_1_PI 0.31830988618379067154
2025-05-07T20:26:59.3180915Z #define CUDART_DEVICE __device__
2025-05-07T20:26:59.3181250Z #define __LDBL_REDIR1_NTH(name,proto,alias) name proto __THROW
2025-05-07T20:26:59.3181613Z #define M_PI_2 1.57079632679489661923
2025-05-07T20:26:59.3181945Z #define __BIGGEST_ALIGNMENT__ 16
2025-05-07T20:26:59.3182388Z #define cudaExternalSemaphoreWaitSkipNvSciBufMemSync 0x02
2025-05-07T20:26:59.3182855Z #define __STDC_UTF_16__ 1
2025-05-07T20:26:59.3183158Z #define LONG_MAX __LONG_MAX__
2025-05-07T20:26:59.3183607Z #define __glibcxx_digits10_b(T,B) (__glibcxx_digits_b (T,B) * 643L / 2136)
2025-05-07T20:26:59.3184067Z #define _POSIX_THREAD_DESTRUCTOR_ITERATIONS 4
2025-05-07T20:26:59.3184376Z #define _POSIX_HOST_NAME_MAX 255
2025-05-07T20:26:59.3184643Z #define __FLT64_MAX_10_EXP__ 308
2025-05-07T20:26:59.3184899Z #define NGROUPS_MAX 65536
2025-05-07T20:26:59.3185157Z #define _GLIBCXX_NAMESPACE_LDBL 
2025-05-07T20:26:59.3185418Z #define __USE_ISOC95 1
2025-05-07T20:26:59.3185635Z #define _TIME_H 1
2025-05-07T20:26:59.3185897Z #define M_LOG10El 0.434294481903251827651128918916605082L
2025-05-07T20:26:59.3186212Z #define __USE_ISOC99 1
2025-05-07T20:26:59.3186523Z #define __ASMNAME(cname) __ASMNAME2 (__USER_LABEL_PREFIX__, cname)
2025-05-07T20:26:59.3186884Z #define HOST_NAME_MAX 64
2025-05-07T20:26:59.3187133Z #define _POSIX_SEM_NSEMS_MAX 256
2025-05-07T20:26:59.3187389Z #define _IOS_ATEND 4
2025-05-07T20:26:59.3187616Z #define __SM_35_INTRINSICS_H__ 
2025-05-07T20:26:59.3187940Z #define WTERMSIG(status) __WTERMSIG (__WAIT_INT (status))
2025-05-07T20:26:59.3188336Z #define cudaStreamAttrValue cudaLaunchAttributeValue
2025-05-07T20:26:59.3188668Z #define _GLIBCXX_HAVE_S_ISREG 1
2025-05-07T20:26:59.3188947Z #define cudaSurfaceTypeCubemap 0x0C
2025-05-07T20:26:59.3189360Z #define __cpp_delegating_constructors 200604L
2025-05-07T20:26:59.3189775Z #define __FLT32_HAS_INFINITY__ 1
2025-05-07T20:26:59.3190033Z #define _STDIO_H 1
2025-05-07T20:26:59.3190435Z #define __isctype_l(c,type,locale) ((locale)->__ctype_b[(int) (c)] & (unsigned short int) type)
2025-05-07T20:26:59.3190888Z #define _GLIBCXX_PREDEFINED_OPS_H 1
2025-05-07T20:26:59.3191250Z #define __DBL_MAX__ double(1.79769313486231570814527423731704357e+308L)
2025-05-07T20:26:59.3191624Z #define _G_IO_IO_FILE_VERSION 0x20001
2025-05-07T20:26:59.3191914Z #define _POSIX_SIGQUEUE_MAX 32
2025-05-07T20:26:59.3192172Z #define _GLIBCXX_HAVE_GETS 1
2025-05-07T20:26:59.3192440Z #define _GLIBCXX_HAVE_LINUX_TYPES_H 1
2025-05-07T20:26:59.3192818Z #define __cpp_raw_strings 200710L
2025-05-07T20:26:59.3193110Z #define __INT_FAST32_MAX__ 0x7fffffffffffffffL
2025-05-07T20:26:59.3193420Z #define _GLIBCXX_HAVE_VFWSCANF 1
2025-05-07T20:26:59.3193694Z #define __DBL_HAS_INFINITY__ 1
2025-05-07T20:26:59.3193962Z #define __STDCPP_MATH_SPEC_FUNCS__ 201003L
2025-05-07T20:26:59.3194272Z #define _GLIBCXX_STDIO_EOF -1
2025-05-07T20:26:59.3194544Z #define __SIZEOF_PTHREAD_MUTEX_T 40
2025-05-07T20:26:59.3194821Z #define __CHANNEL_DESCRIPTOR_H__ 
2025-05-07T20:26:59.3195174Z #define _ISbit(bit) ((bit) < 8 ? ((1 << (bit)) << 8) : ((1 << (bit)) >> 8))
2025-05-07T20:26:59.3195555Z #define __SIZEOF_FLOAT__ 4
2025-05-07T20:26:59.3195793Z #define __USE_XOPEN 1
2025-05-07T20:26:59.3196035Z #define __SIZEOF_PTHREAD_RWLOCK_T 56
2025-05-07T20:26:59.3196467Z #define cudaStreamAttributeMemSyncDomain cudaLaunchAttributeMemSyncDomain
2025-05-07T20:26:59.3196899Z #define __USE_XOPEN2K 1
2025-05-07T20:26:59.3197142Z #define _PSTL_UDR_PRESENT 1
2025-05-07T20:26:59.3197406Z #define __HAVE_SPECULATION_SAFE_VALUE 1
2025-05-07T20:26:59.3197697Z #define _GLIBCXX_HAVE_COSF 1
2025-05-07T20:26:59.3197969Z #define __cpp_fold_expressions 201603L
2025-05-07T20:26:59.3198487Z #define cudaWaitExternalSemaphoresAsync __CUDART_API_PTSZ(cudaWaitExternalSemaphoresAsync_v2)
2025-05-07T20:26:59.3199000Z #define NL_LANGMAX _POSIX2_LINE_MAX
2025-05-07T20:26:59.3199286Z #define __DEC32_MIN_EXP__ (-94)
2025-05-07T20:26:59.3199643Z #define __glibcxx_requires_partitioned_upper(_First,_Last,_Value) 
2025-05-07T20:26:59.3200025Z #define __DADDR_T_TYPE __S32_TYPE
2025-05-07T20:26:59.3200396Z #define cudaExternalSemaphoreSignalSkipNvSciBufMemSync 0x01
2025-05-07T20:26:59.3200786Z #define __END_NAMESPACE_C99 
2025-05-07T20:26:59.3201059Z #define __glibcxx_integral_traps true
2025-05-07T20:26:59.3201336Z #define _POSIX_PATH_MAX 256
2025-05-07T20:26:59.3201595Z #define __INTPTR_WIDTH__ 64
2025-05-07T20:26:59.3201880Z #define __FLT64X_HAS_INFINITY__ 1
2025-05-07T20:26:59.3202197Z #define _IOS_TRUNC 16
2025-05-07T20:26:59.3202487Z #define _ISOC11_SOURCE 1
2025-05-07T20:26:59.3202796Z #define _GLIBCXX_HAVE_LINUX_FUTEX 1
2025-05-07T20:26:59.3203142Z #define __UINT_LEAST32_MAX__ 0xffffffffU
2025-05-07T20:26:59.3203509Z #define _GLIBCXX_HAVE_QUICK_EXIT 1
2025-05-07T20:26:59.3203953Z #define __glibcxx_requires_irreflexive_pred2(_First,_Last,_Pred) 
2025-05-07T20:26:59.3204391Z #define LONG_MIN (-LONG_MAX - 1L)
2025-05-07T20:26:59.3204659Z #define _GLIBCXX_HAVE_SINCOSF 1
2025-05-07T20:26:59.3204918Z #define _IO_UNITBUF 020000
2025-05-07T20:26:59.3205167Z #define _GLIBCXX_HAVE_SINCOSL 1
2025-05-07T20:26:59.3205417Z #define __FD_SETSIZE 1024
2025-05-07T20:26:59.3205664Z #define getc(_fp) _IO_getc (_fp)
2025-05-07T20:26:59.3205931Z #define be32toh(x) __bswap_32 (x)
2025-05-07T20:26:59.3206263Z #define _GLIBCXX_PACKAGE__GLIBCXX_VERSION "version-unused"
2025-05-07T20:26:59.3206613Z #define __FLT32X_HAS_DENORM__ 1
2025-05-07T20:26:59.3206875Z #define __INT_FAST16_TYPE__ long int
2025-05-07T20:26:59.3207174Z #define isxdigit_l(c,l) __isxdigit_l ((c), (l))
2025-05-07T20:26:59.3207498Z #define _GLIBCXX_HAVE_GETIPINFO 1
2025-05-07T20:26:59.3207765Z #define __MMX_WITH_SSE__ 1
2025-05-07T20:26:59.3208056Z #define __isalnum_l(c,l) __isctype_l((c), _ISalnum, (l))
2025-05-07T20:26:59.3208386Z #define _WCHAR_T_DEFINED_ 
2025-05-07T20:26:59.3208670Z #define cudaIpcMemLazyEnablePeerAccess 0x01
2025-05-07T20:26:59.3209132Z #define _GLIBCXX_HAVE_AT_QUICK_EXIT 1
2025-05-07T20:26:59.3209414Z #define __INO_T_MATCHES_INO64_T 1
2025-05-07T20:26:59.3209682Z #define __USE_POSIX199506 1
2025-05-07T20:26:59.3209933Z #define _FEATURES_H 1
2025-05-07T20:26:59.3210164Z #define __LDBL_HAS_DENORM__ 1
2025-05-07T20:26:59.3210547Z #define _PSTL_PRAGMA_SIMD_REDUCTION(PRM) _PSTL_PRAGMA(omp simd reduction(PRM))
2025-05-07T20:26:59.3211016Z #define __WEXITSTATUS(status) (((status) & 0xff00) >> 8)
2025-05-07T20:26:59.3211337Z #define __stub_getmsg 
2025-05-07T20:26:59.3211597Z #define _IO_FIXED 010000
2025-05-07T20:26:59.3211935Z #define __cpp_lib_addressof_constexpr 201603
2025-05-07T20:26:59.3212418Z #define _GLIBCXX11_USE_C99_STDIO 1
2025-05-07T20:26:59.3212752Z #define __stub_setlogin 
2025-05-07T20:26:59.3213049Z #define __stub_fattach 
2025-05-07T20:26:59.3213341Z #define __cplusplus 201703L
2025-05-07T20:26:59.3213663Z #define __cpp_ref_qualifiers 200710L
2025-05-07T20:26:59.3213960Z #define _STRUCT_TIMEVAL 1
2025-05-07T20:26:59.3214220Z #define INFINITY (__builtin_inff())
2025-05-07T20:26:59.3214488Z #define _IO_UNBUFFERED 2
2025-05-07T20:26:59.3214970Z #define cudaStreamAttributeSynchronizationPolicy cudaLaunchAttributeSynchronizationPolicy
2025-05-07T20:26:59.3215489Z #define _IO_INTERNAL 010
2025-05-07T20:26:59.3215726Z #define __DEC32_MIN__ 1E-95DF
2025-05-07T20:26:59.3216057Z #define cudaKernelNodeAttrValue cudaLaunchAttributeValue
2025-05-07T20:26:59.3216405Z #define __dev_t_defined 
2025-05-07T20:26:59.3216634Z #define __DEPRECATED 1
2025-05-07T20:26:59.3216862Z #define __S32_TYPE int
2025-05-07T20:26:59.3217113Z #define __cpp_rvalue_references 200610L
2025-05-07T20:26:59.3217405Z #define __DBL_MAX_EXP__ 1024
2025-05-07T20:26:59.3217663Z #define _IO_fpos_t _G_fpos_t
2025-05-07T20:26:59.3217914Z #define __WCHAR_WIDTH__ 32
2025-05-07T20:26:59.3218507Z #define cudaKernelNodeAttributePreferredSharedMemoryCarveout cudaLaunchAttributePreferredSharedMemoryCarveout
2025-05-07T20:26:59.3219127Z #define _G_HAVE_MREMAP 1
2025-05-07T20:26:59.3219439Z #define __FLT32_MAX__ 3.40282346638528859811704183484516925e+38F32
2025-05-07T20:26:59.3219776Z #define OVERFLOW 3
2025-05-07T20:26:59.3220015Z #define __toascii_l(c,l) ((l), __toascii (c))
2025-05-07T20:26:59.3220320Z #define __DEC128_EPSILON__ 1E-33DL
2025-05-07T20:26:59.3220606Z #define __SM_32_ATOMIC_FUNCTIONS_H__ 
2025-05-07T20:26:59.3220934Z #define _GLIBCXX_DEFAULT_ABI_TAG _GLIBCXX_ABI_TAG_CXX11
2025-05-07T20:26:59.3221260Z #define __SSE2_MATH__ 1
2025-05-07T20:26:59.3221502Z #define __ATOMIC_HLE_RELEASE 131072
2025-05-07T20:26:59.3221800Z #define __FSFILCNT_T_TYPE __SYSCALL_ULONG_TYPE
2025-05-07T20:26:59.3222096Z #define _IO_STDIO_H 
2025-05-07T20:26:59.3222343Z #define PDP_ENDIAN __PDP_ENDIAN
2025-05-07T20:26:59.3222627Z #define isspace_l(c,l) __isspace_l ((c), (l))
2025-05-07T20:26:59.3222932Z #define __cudaCDP2Memcpy2DAsync 
2025-05-07T20:26:59.3223223Z #define __PTRDIFF_MAX__ 0x7fffffffffffffffL
2025-05-07T20:26:59.3223530Z #define _GLIBCXX_HAVE_STRERROR_R 1
2025-05-07T20:26:59.3223789Z #define __amd64 1
2025-05-07T20:26:59.3224010Z #define _POSIX_TZNAME_MAX 6
2025-05-07T20:26:59.3224273Z #define __cudaCDP2Memset3DAsync 
2025-05-07T20:26:59.3224538Z #define __SYSCALL_WORDSIZE 64
2025-05-07T20:26:59.3224823Z #define _GLIBCXX_HAVE_ATTRIBUTE_VISIBILITY 1
2025-05-07T20:26:59.3225121Z #define _EXT_TYPE_TRAITS 1
2025-05-07T20:26:59.3225374Z #define _GLIBCXX_HAVE_POSIX_SEMAPHORE 1
2025-05-07T20:26:59.3225669Z #define _POSIX_RE_DUP_MAX 255
2025-05-07T20:26:59.3225929Z #define __STDC_NO_THREADS__ 1
2025-05-07T20:26:59.3226169Z #define __bounded 
2025-05-07T20:26:59.3226393Z #define _GLIBCXX_HAVE_ACOSL 1
2025-05-07T20:26:59.3226665Z #define __USECONDS_T_TYPE __U32_TYPE
2025-05-07T20:26:59.3226949Z #define _IO_DELETE_DONT_CLOSE 0x40
2025-05-07T20:26:59.3227220Z #define __BEGIN_NAMESPACE_STD 
2025-05-07T20:26:59.3227484Z #define _PTRDIFF_T_DECLARED 
2025-05-07T20:26:59.3227756Z #define __OFF_T_TYPE __SYSCALL_SLONG_TYPE
2025-05-07T20:26:59.3228069Z #define __W_STOPCODE(sig) ((sig) << 8 | 0x7f)
2025-05-07T20:26:59.3229299Z #define cudaStreamAttributePriority cudaLaunchAttributePriority
2025-05-07T20:26:59.3229810Z #define _GLIBCXX_HAVE_NETDB_H 1
2025-05-07T20:26:59.3230074Z #define __SM_20_INTRINSICS_HPP__ 
2025-05-07T20:26:59.3230410Z #define __cpp_lib_has_unique_object_representations 201606
2025-05-07T20:26:59.3230751Z #define STA_PLL 0x0001
2025-05-07T20:26:59.3230990Z #define __ATOMIC_HLE_ACQUIRE 65536
2025-05-07T20:26:59.3231254Z #define __GNUG__ 11
2025-05-07T20:26:59.3231486Z #define _GLIBCXX_USE_GET_NPROCS 1
2025-05-07T20:26:59.3231753Z #define _T_WCHAR 
2025-05-07T20:26:59.3232044Z #define __cudaCDP2GetDeviceCount 
2025-05-07T20:26:59.3232521Z #define __specialization_static 
2025-05-07T20:26:59.3232893Z #define __LONG_LONG_MAX__ 0x7fffffffffffffffLL
2025-05-07T20:26:59.3233267Z #define __SIZEOF_SIZE_T__ 8
2025-05-07T20:26:59.3233584Z #define cudaArraySparse 0x40
2025-05-07T20:26:59.3233905Z #define STA_PPSFREQ 0x0002
2025-05-07T20:26:59.3234238Z #define _IO_stdin ((_IO_FILE*)(&_IO_2_1_stdin_))
2025-05-07T20:26:59.3234538Z #define _WCHAR_T 
2025-05-07T20:26:59.3234755Z #define __cudaCDP2Free 
2025-05-07T20:26:59.3235374Z #define __FD_ZERO(fdsp) do { int __d0, __d1; __asm__ __volatile__ ("cld; rep; " __FD_ZERO_STOS : "=c" (__d0), "=D" (__d1) : "a" (0), "0" (sizeof (fd_set) / sizeof (__fd_mask)), "1" (&__FDS_BITS (fdsp)[0]) : "memory"); } while (0)
2025-05-07T20:26:59.3236055Z #define __cpp_nsdmi 200809L
2025-05-07T20:26:59.3236466Z #define __glibcxx_min_b(T,B) (__glibcxx_signed_b (T,B) ? -__glibcxx_max_b (T,B) - 1 : (T)0)
2025-05-07T20:26:59.3236898Z #define __FLT64X_MIN_EXP__ (-16381)
2025-05-07T20:26:59.3237163Z #define __SIZEOF_WINT_T__ 4
2025-05-07T20:26:59.3237426Z #define cudaArrayCubemap 0x04
2025-05-07T20:26:59.3237753Z #define _PSTL_MONOTONIC_PRESENT (__INTEL_COMPILER >= 1800)
2025-05-07T20:26:59.3238090Z #define _GLIBCXX_UTILITY 1
2025-05-07T20:26:59.3238332Z #define __NO_CTYPE 1
2025-05-07T20:26:59.3238557Z #define __stub_bdflush 
2025-05-07T20:26:59.3238920Z #define _GLIBCXX_MAKE_MOVE_ITERATOR(_Iter) std::make_move_iterator(_Iter)
2025-05-07T20:26:59.3239334Z #define __CORRECT_ISO_CPP_STRING_H_PROTO 
2025-05-07T20:26:59.3239632Z #define _GLIBCXX_STDC_HEADERS 1
2025-05-07T20:26:59.3239886Z #define __LONG_LONG_WIDTH__ 64
2025-05-07T20:26:59.3240159Z #define __cpp_initializer_lists 200806L
2025-05-07T20:26:59.3240460Z #define _GLIBCXX_HAVE_NETINET_TCP_H 1
2025-05-07T20:26:59.3240749Z #define __U16_TYPE unsigned short int
2025-05-07T20:26:59.3241070Z #define __glibcxx_requires_can_increment(_First,_Size) 
2025-05-07T20:26:59.3241416Z #define _GLIBCXX_HAVE_SYS_PARAM_H 1
2025-05-07T20:26:59.3241698Z #define __FLT32_MAX_EXP__ 128
2025-05-07T20:26:59.3241976Z #define cudaHostRegisterIoMemory 0x04
2025-05-07T20:26:59.3242319Z #define __FD_MASK(d) ((__fd_mask) 1 << ((d) % __NFDBITS))
2025-05-07T20:26:59.3242676Z #define __cpp_lib_is_invocable 201703
2025-05-07T20:26:59.3242983Z #define _IO_STDIO 040000
2025-05-07T20:26:59.3243307Z #define _SIGSET_NWORDS (1024 / (8 * sizeof (unsigned long int)))
2025-05-07T20:26:59.3243693Z #define cudaSurfaceType1DLayered 0xF1
2025-05-07T20:26:59.3244001Z #define cudaArraySurfaceLoadStore 0x02
2025-05-07T20:26:59.3244289Z #define _PTRDIFF_T 
2025-05-07T20:26:59.3244504Z #define _MOVE_H 1
2025-05-07T20:26:59.3244730Z #define __cpp_hex_float 201603L
2025-05-07T20:26:59.3244982Z #define ADJ_TAI 0x0080
2025-05-07T20:26:59.3245208Z #define __ptrvalue 
2025-05-07T20:26:59.3245429Z #define _GLIBCXX_HOSTED 1
2025-05-07T20:26:59.3245674Z #define __GXX_ABI_VERSION 1016
2025-05-07T20:26:59.3245958Z #define __WTERMSIG(status) ((status) & 0x7f)
2025-05-07T20:26:59.3246254Z #define MATH_ERREXCEPT 2
2025-05-07T20:26:59.3246504Z #define _GLIBCXX_HAS_GTHREADS 1
2025-05-07T20:26:59.3246782Z #define cudaTextureType2DLayered 0xF2
2025-05-07T20:26:59.3247176Z #define __isleap(year) ((year) % 4 == 0 && ((year) % 100 != 0 || (year) % 400 == 0))
2025-05-07T20:26:59.3247546Z #define __USE_GNU 1
2025-05-07T20:26:59.3247778Z #define __FLT128_HAS_INFINITY__ 1
2025-05-07T20:26:59.3248154Z #define __FLT_MIN_EXP__ (-125)
2025-05-07T20:26:59.3248414Z #define __GCC_HAVE_DWARF2_CFI_ASM 1
2025-05-07T20:26:59.3248791Z #define __FD_CLR(d,set) ((void) (__FDS_BITS (set)[__FD_ELT (d)] &= ~__FD_MASK (d)))
2025-05-07T20:26:59.3249169Z #define WEXITED 4
2025-05-07T20:26:59.3249388Z #define _IO_NO_READS 4
2025-05-07T20:26:59.3249678Z #define cudaGraphKernelNodePortLaunchCompletion 2
2025-05-07T20:26:59.3250019Z #define M_LOG2E 1.4426950408889634074
2025-05-07T20:26:59.3250294Z #define _POSIX_SYMLINK_MAX 255
2025-05-07T20:26:59.3250579Z #define _GLIBCXX_HAVE_BUILTIN_HAS_UNIQ_OBJ_REP 1
2025-05-07T20:26:59.3250889Z #define __uid_t_defined 
2025-05-07T20:26:59.3251216Z #define __FD_ELT(d) ((d) / __NFDBITS)
2025-05-07T20:26:59.3251495Z #define _GLIBCXX_USE_STD_SPEC_FUNCS 1
2025-05-07T20:26:59.3251767Z #define WNOHANG 1
2025-05-07T20:26:59.3252010Z #define alloca(size) __builtin_alloca (size)
2025-05-07T20:26:59.3252305Z #define _GLIBCXX_HAVE_HYPOTF 1
2025-05-07T20:26:59.3252573Z #define cudaEventDefault 0x00
2025-05-07T20:26:59.3252873Z #define __maxnreg__(a) __attribute__((maxnreg(a)))
2025-05-07T20:26:59.3253191Z #define NL_SETMAX INT_MAX
2025-05-07T20:26:59.3253418Z #define __x86_64 1
2025-05-07T20:26:59.3253646Z #define __cudaCDP2LaunchDevice 
2025-05-07T20:26:59.3254034Z #define __REDIRECT(name,proto,alias) name proto __asm__ (__ASMNAME (#alias))
2025-05-07T20:26:59.3254497Z #define _GLIBCXX_BEGIN_NAMESPACE_CXX11 namespace __cxx11 {
2025-05-07T20:26:59.3254986Z #define __extern_always_inline extern __always_inline __attribute__ ((__gnu_inline__))
2025-05-07T20:26:59.3255414Z #define __PTRDIFF_T 
2025-05-07T20:26:59.3255725Z #define __exctype_l(name) extern int name (int, __locale_t) __THROW
2025-05-07T20:26:59.3256100Z #define _GLIBCXX_HAVE_FINITEL 1
2025-05-07T20:26:59.3265556Z #define __SM_35_ATOMIC_FUNCTIONS_H__ 
2025-05-07T20:26:59.3265900Z #define _Mlong_double_ long double
2025-05-07T20:26:59.3266192Z #define __cpp_lambdas 200907L
2025-05-07T20:26:59.3266449Z #define _IO_DEC 020
2025-05-07T20:26:59.3266684Z #define _GLIBCXX_HAVE_SINHL 1
2025-05-07T20:26:59.3266963Z #define _POSIX_CLOCKRES_MIN 20000000
2025-05-07T20:26:59.3267253Z #define __INT_FAST64_TYPE__ long int
2025-05-07T20:26:59.3267531Z #define ADJ_TIMECONST 0x0020
2025-05-07T20:26:59.3267799Z #define _GLIBCXX_HAVE_SQRTL 1
2025-05-07T20:26:59.3268098Z #define __cudaCDP2DeviceGetSharedMemConfig 
2025-05-07T20:26:59.3268420Z #define _GLIBCXX_HAVE_STDALIGN_H 1
2025-05-07T20:26:59.3268697Z #define _ANSI_STDDEF_H 
2025-05-07T20:26:59.3268979Z #define _GLIBCXX_MOVE(__val) std::move(__val)
2025-05-07T20:26:59.3269351Z #define _GLIBCXX_HAVE_STRERROR_L 1
2025-05-07T20:26:59.3269718Z #define __FLT64_DENORM_MIN__ 4.94065645841246544176568792868221372e-324F64
2025-05-07T20:26:59.3270113Z #define _GLIBCXX_USE_DEV_RANDOM 1
2025-05-07T20:26:59.3270390Z #define _STL_ITERATOR_BASE_TYPES_H 1
2025-05-07T20:26:59.3270685Z #define __cpp_template_auto 201606L
2025-05-07T20:26:59.3271047Z #define __DBL_MIN__ double(2.22507385850720138309023271733240406e-308L)
2025-05-07T20:26:59.3271420Z #define _GLIBCXX_HAVE_SYS_SEM_H 1
2025-05-07T20:26:59.3271740Z #define __key_t_defined 
2025-05-07T20:26:59.3272041Z #define _IO_MAGIC_MASK 0xFFFF0000
2025-05-07T20:26:59.3272494Z #define __cluster_dims__(...) __attribute__((cluster_dims(__VA_ARGS__)))
2025-05-07T20:26:59.3273075Z #define __FLT128_EPSILON__ 1.92592994438723585305597794258492732e-34F128
2025-05-07T20:26:59.3273519Z #define __GNUC_VA_LIST 
2025-05-07T20:26:59.3273909Z #define __FLT64X_NORM_MAX__ 1.18973149535723176502126385303097021e+4932F64x
2025-05-07T20:26:59.3274294Z #define __SIZEOF_POINTER__ 8
2025-05-07T20:26:59.3274553Z #define CLOCK_REALTIME_COARSE 5
2025-05-07T20:26:59.3274839Z #define _GLIBCXX14_CONSTEXPR constexpr
2025-05-07T20:26:59.3275134Z #define __USE_XOPEN2KXSI 1
2025-05-07T20:26:59.3275383Z #define __WCOREFLAG 0x80
2025-05-07T20:26:59.3275632Z #define M_2_SQRTPI 1.12837916709551257390
2025-05-07T20:26:59.3275937Z #define cudaEventDisableTiming 0x02
2025-05-07T20:26:59.3276213Z #define __LP64__ 1
2025-05-07T20:26:59.3276647Z #define __isascii_l(c,l) ((l), __isascii (c))
2025-05-07T20:26:59.3276970Z #define cudaStreamNonBlocking 0x01
2025-05-07T20:26:59.3277254Z #define _IO_off64_t __off64_t
2025-05-07T20:26:59.3277508Z #define __DBL_HAS_QUIET_NAN__ 1
2025-05-07T20:26:59.3277770Z #define __time_t_defined 1
2025-05-07T20:26:59.3278024Z #define _POSIX_SYMLOOP_MAX 8
2025-05-07T20:26:59.3278363Z #define __FLT32X_EPSILON__ 2.22044604925031308084726333618164062e-16F32x
2025-05-07T20:26:59.3278728Z #define __USE_UNIX98 1
2025-05-07T20:26:59.3278972Z #define __MODE_T_TYPE __U32_TYPE
2025-05-07T20:26:59.3279235Z #define CLOCK_REALTIME_ALARM 8
2025-05-07T20:26:59.3279598Z #define _GLIBCXX_HAVE_STRINGS_H 1
2025-05-07T20:26:59.3279894Z #define __LEAF_ATTR __attribute__ ((__leaf__))
2025-05-07T20:26:59.3280202Z #define __DECIMAL_BID_FORMAT__ 1
2025-05-07T20:26:59.3280455Z #define SEEK_CUR 1
2025-05-07T20:26:59.3280681Z #define __RLIM64_T_TYPE __UQUAD_TYPE
2025-05-07T20:26:59.3280950Z #define _ASSERT_H 1
2025-05-07T20:26:59.3281518Z #define _PSTL_PRAGMA_DECLARE_REDUCTION(NAME,OP) _PSTL_PRAGMA(omp declare reduction(NAME:OP : omp_out(omp_in)) initializer(omp_priv = omp_orig))
2025-05-07T20:26:59.3282144Z #define _GLIBCXX_USE_DEPRECATED 1
2025-05-07T20:26:59.3282416Z #define CHAR_MAX SCHAR_MAX
2025-05-07T20:26:59.3282664Z #define _GLIBCXX_HAVE_SETENV 1
2025-05-07T20:26:59.3282928Z #define NL_ARGMAX _POSIX_ARG_MAX
2025-05-07T20:26:59.3283198Z #define _GLIBCXX_USE_UTIMENSAT 1
2025-05-07T20:26:59.3283562Z #define __extern_inline extern __inline __attribute__ ((__gnu_inline__))
2025-05-07T20:26:59.3283967Z #define _GLIBCXX_DEBUG_ONLY(_Statement) 
2025-05-07T20:26:59.3284625Z #define _IO_putc_unlocked(_ch,_fp) (_IO_BE ((_fp)->_IO_write_ptr >= (_fp)->_IO_write_end, 0) ? __overflow (_fp, (unsigned char) (_ch)) : (unsigned char) (*(_fp)->_IO_write_ptr++ = (_ch)))
2025-05-07T20:26:59.3285279Z #define _GLIBCXX_HAVE_BUILTIN_LAUNDER 1
2025-05-07T20:26:59.3285572Z #define _IO_BOOLALPHA 0200000
2025-05-07T20:26:59.3285930Z #define _PSTL_CPP17_EXECUTION_POLICIES_PRESENT (_MSC_VER >= 1912)
2025-05-07T20:26:59.3286307Z #define _GLIBCXX_PACKAGE_URL ""
2025-05-07T20:26:59.3286570Z #define __FLT64_MIN_10_EXP__ (-307)
2025-05-07T20:26:59.3286852Z #define cudaArrayDefault 0x00
2025-05-07T20:26:59.3287130Z #define __cudaCDP2LaunchDeviceV2 
2025-05-07T20:26:59.3287416Z #define __FDS_BITS(set) ((set)->fds_bits)
2025-05-07T20:26:59.3287700Z #define TLOSS 5
2025-05-07T20:26:59.3287920Z #define __ssize_t_defined 
2025-05-07T20:26:59.3288173Z #define __CUDACC_VER_BUILD__ 61
2025-05-07T20:26:59.3288439Z #define ULONG_MAX (LONG_MAX * 2UL + 1UL)
2025-05-07T20:26:59.3288735Z #define __FLT64X_DECIMAL_DIG__ 21
2025-05-07T20:26:59.3289015Z #define _POSIX_HIWAT _POSIX_PIPE_BUF
2025-05-07T20:26:59.3289291Z #define __DEC128_MIN__ 1E-6143DL
2025-05-07T20:26:59.3289577Z #define __cudaCDP2EventRecordWithFlags 
2025-05-07T20:26:59.3289885Z #define _GLIBCXX_ATOMIC_BUILTINS 1
2025-05-07T20:26:59.3290172Z #define cudaPeerAccessDefault 0x00
2025-05-07T20:26:59.3290463Z #define _GLIBCXX_HAVE_SYS_SOCKET_H 1
2025-05-07T20:26:59.3290755Z #define __REGISTER_PREFIX__ 
2025-05-07T20:26:59.3291008Z #define __UINT16_MAX__ 0xffff
2025-05-07T20:26:59.3291338Z #define __glibcxx_requires_sorted_set(_First1,_Last1,_First2) 
2025-05-07T20:26:59.3291705Z #define _IOS_NOREPLACE 64
2025-05-07T20:26:59.3291992Z #define __cdecl 
2025-05-07T20:26:59.3292292Z #define cudaEventInterprocess 0x04
2025-05-07T20:26:59.3292699Z #define M_SQRT1_2l 0.707106781186547524400844362104849039L
2025-05-07T20:26:59.3293104Z #define LOGIN_NAME_MAX 256
2025-05-07T20:26:59.3293410Z #define _IO_TIED_PUT_GET 0x400
2025-05-07T20:26:59.3293747Z #define X_TLOSS 1.41484755040568800000e+16
2025-05-07T20:26:59.3294112Z #define CUDA_IPC_HANDLE_SIZE 64
2025-05-07T20:26:59.3294421Z #define __LDBL_HAS_INFINITY__ 1
2025-05-07T20:26:59.3294727Z #define __attribute_pure__ __attribute__ ((__pure__))
2025-05-07T20:26:59.3295054Z #define __TEXTURE_TYPES_H__ 
2025-05-07T20:26:59.3295451Z #define __NV_GLIBCXX_VERSION (__GNUC__ * 10000 + __GNUC_MINOR__ * 100 + __GNUC_PATCHLEVEL__)
2025-05-07T20:26:59.3295983Z #define ADJ_NANO 0x2000
2025-05-07T20:26:59.3296289Z #define __FLT32_MIN__ 1.17549435082228750796873653722224568e-38F32
2025-05-07T20:26:59.3296643Z #define __UINT8_TYPE__ unsigned char
2025-05-07T20:26:59.3296922Z #define _GLIBCXX_HAVE_ISWBLANK 1
2025-05-07T20:26:59.3297183Z #define __FLT_DIG__ 6
2025-05-07T20:26:59.3297534Z #define __REDIRECT_LDBL(name,proto,alias) __REDIRECT (name, proto, alias)
2025-05-07T20:26:59.3297926Z #define __NO_INLINE__ 1
2025-05-07T20:26:59.3298228Z #define _PSTL_EARLYEXIT_PRESENT (__INTEL_COMPILER >= 1800)
2025-05-07T20:26:59.3298578Z #define _POSIX_NGROUPS_MAX 8
2025-05-07T20:26:59.3298911Z #define ADJ_STATUS 0x0010
2025-05-07T20:26:59.3299176Z #define __cudaCDP2MemcpyAsync_ptsz 
2025-05-07T20:26:59.3299465Z #define CLOCK_BOOTTIME_ALARM 9
2025-05-07T20:26:59.3299729Z #define LONG_LONG_MAX __LONG_LONG_MAX__
2025-05-07T20:26:59.3300030Z #define _GLIBCXX_HAVE_OBSOLETE_ISNAN 1
2025-05-07T20:26:59.3300317Z #define __DEC_EVAL_METHOD__ 2
2025-05-07T20:26:59.3300700Z #define cudaStreamGraphFireAndForget (cudaStream_t)0x0200000000000000
2025-05-07T20:26:59.3301115Z #define _GLIBCXX_HAVE_ALIGNED_ALLOC 1
2025-05-07T20:26:59.3301459Z #define __DEC128_MAX__ 9.999999999999999999999999999999999E6144DL
2025-05-07T20:26:59.3301864Z #define CHAR_MIN SCHAR_MIN
2025-05-07T20:26:59.3302159Z #define MAX_CANON 255
2025-05-07T20:26:59.3302448Z #define __FLT_MANT_DIG__ 24
2025-05-07T20:26:59.3302767Z #define __LDBL_DECIMAL_DIG__ 21
2025-05-07T20:26:59.3303093Z #define _GLIBCXX_HAVE_COMPLEX_H 1
2025-05-07T20:26:59.3303445Z #define _PSTL_PRAGMA_VECTOR_UNALIGNED 
2025-05-07T20:26:59.3303826Z #define _POSIX_FD_SETSIZE _POSIX_OPEN_MAX
2025-05-07T20:26:59.3304128Z #define _GLIBCXX_HAVE_HYPOT 1
2025-05-07T20:26:59.3304406Z #define __cudaCDP2Memset2DAsync_ptsz 
2025-05-07T20:26:59.3304729Z #define _GLIBCXX_TR1_MODIFIED_BESSEL_FUNC_TCC 1
2025-05-07T20:26:59.3305037Z #define __VERSION__ "11.4.0"
2025-05-07T20:26:59.3305301Z #define _GLIBCXX11_USE_C99_STDLIB 1
2025-05-07T20:26:59.3305601Z #define cudaHostRegisterMapped 0x02
2025-05-07T20:26:59.3305891Z #define _GLIBCXX_HAVE_INT64_T 1
2025-05-07T20:26:59.3306167Z #define _GLIBCXX_USE_CONSTEXPR constexpr
2025-05-07T20:26:59.3306478Z #define FD_ZERO(fdsetp) __FD_ZERO (fdsetp)
2025-05-07T20:26:59.3306773Z #define __UINT64_C(c) c ## UL
2025-05-07T20:26:59.3307028Z #define MOD_OFFSET ADJ_OFFSET
2025-05-07T20:26:59.3307282Z #define _SYS_TYPES_H 1
2025-05-07T20:26:59.3307525Z #define AIO_PRIO_DELTA_MAX 20
2025-05-07T20:26:59.3307780Z #define _GLIBCXX_HAVE_TANHF 1
2025-05-07T20:26:59.3308032Z #define _SYS_CDEFS_H 1
2025-05-07T20:26:59.3308270Z #define _GLIBCXX_HAVE_TANHL 1
2025-05-07T20:26:59.3308547Z #define __cpp_unicode_characters 201411L
2025-05-07T20:26:59.3308841Z #define _IO_ERR_SEEN 0x20
2025-05-07T20:26:59.3309191Z #define _GLIBCXX_USE_DECIMAL_FLOAT 1
2025-05-07T20:26:59.3309485Z #define __cudaCDP2StreamDestroy 
2025-05-07T20:26:59.3309754Z #define FP_SUBNORMAL 3
2025-05-07T20:26:59.3310007Z #define cudaOccupancyDefault 0x00
2025-05-07T20:26:59.3310287Z #define _INITIALIZER_LIST 
2025-05-07T20:26:59.3310537Z #define _STDC_PREDEF_H 1
2025-05-07T20:26:59.3310794Z #define _GLIBCXX_PACKAGE_BUGREPORT ""
2025-05-07T20:26:59.3311083Z #define _GLIBCXX_HAVE_MODF 1
2025-05-07T20:26:59.3311334Z #define _IO_file_flags _flags
2025-05-07T20:26:59.3311589Z #define __USE_XOPEN2K8 1
2025-05-07T20:26:59.3311837Z #define htobe64(x) __bswap_64 (x)
2025-05-07T20:26:59.3312107Z #define _OLD_STDIO_MAGIC 0xFABC0000
2025-05-07T20:26:59.3312381Z #define HUGE 3.40282347e+38F
2025-05-07T20:26:59.3312647Z #define __cpp_lib_is_null_pointer 201309
2025-05-07T20:26:59.3313015Z #define WEXITSTATUS(status) __WEXITSTATUS (__WAIT_INT (status))
2025-05-07T20:26:59.3313414Z #define islower_l(c,l) __islower_l ((c), (l))
2025-05-07T20:26:59.3313723Z #define _GLIBCXX_USE_CXX11_ABI 1
2025-05-07T20:26:59.3313988Z #define _GLIBCXX_HAVE_SYMLINK 1
2025-05-07T20:26:59.3314244Z #define _BSD_SOURCE 1
2025-05-07T20:26:59.3314479Z #define _GLIBCXX_THROW(_EXC) 
2025-05-07T20:26:59.3315442Z #define _GLIBCXX_HAS_NESTED_TYPE(_NTYPE) template<typename _Tp, typename = __void_t<>> struct __has_ ##_NTYPE : false_type { }; template<typename _Tp> struct __has_ ##_NTYPE<_Tp, __void_t<typename _Tp::_NTYPE>> : true_type { };
2025-05-07T20:26:59.3316297Z #define __catch(X) catch(X)
2025-05-07T20:26:59.3316558Z #define __INT_LEAST32_MAX__ 0x7fffffff
2025-05-07T20:26:59.3316848Z #define LINE_MAX _POSIX2_LINE_MAX
2025-05-07T20:26:59.3317116Z #define __TIMER_T_TYPE void *
2025-05-07T20:26:59.3317367Z #define __STRING(x) #x
2025-05-07T20:26:59.3317609Z #define __GCC_ATOMIC_INT_LOCK_FREE 2
2025-05-07T20:26:59.3317876Z #define _T_PTRDIFF_ 
2025-05-07T20:26:59.3318214Z #define _GLIBCXX_USE_NOEXCEPT noexcept
2025-05-07T20:26:59.3318520Z #define cudaEventWaitExternal 0x01
2025-05-07T20:26:59.3318791Z #define __unbounded 
2025-05-07T20:26:59.3319033Z #define __DEVICE_ATOMIC_FUNCTIONS_H__ 
2025-05-07T20:26:59.3319323Z #define __FLT128_MAX_EXP__ 16384
2025-05-07T20:26:59.3319603Z #define __INO_T_TYPE __SYSCALL_ULONG_TYPE
2025-05-07T20:26:59.3319903Z #define be16toh(x) __bswap_16 (x)
2025-05-07T20:26:59.3320181Z #define __cpp_lib_is_final 201402L
2025-05-07T20:26:59.3320480Z #define _GLIBCXX_BEGIN_NAMESPACE_CONTAINER 
2025-05-07T20:26:59.3320806Z #define LONG_LONG_MIN (-LONG_LONG_MAX - 1LL)
2025-05-07T20:26:59.3321118Z #define __MATH_DECLARE_LDOUBLE 1
2025-05-07T20:26:59.3321398Z #define __managed__ __location__(managed)
2025-05-07T20:26:59.3321694Z #define _POSIX2_EXPR_NEST_MAX 32
2025-05-07T20:26:59.3322168Z #define __GNUC_PREREQ(maj,min) ((__GNUC__ << 16) + __GNUC_MINOR__ >= ((maj) << 16) + (min))
2025-05-07T20:26:59.3322692Z #define _POSIX_STREAM_MAX 8
2025-05-07T20:26:59.3323013Z #define __LIBRARY_TYPES_H__ 
2025-05-07T20:26:59.3323474Z #define _GLIBCXX_END_NAMESPACE_LDBL_OR_CXX11 _GLIBCXX_END_NAMESPACE_CXX11
2025-05-07T20:26:59.3323970Z #define __FLT32_MANT_DIG__ 24
2025-05-07T20:26:59.3324282Z #define _SYS_SIZE_T_H 
2025-05-07T20:26:59.3324569Z #define _PSTL_VERSION_MINOR ((_PSTL_VERSION % 1000) / 10)
2025-05-07T20:26:59.3324910Z #define _GLIBCXX_STDLIB_H 1
2025-05-07T20:26:59.3325190Z #define isupper_l(c,l) __isupper_l ((c), (l))
2025-05-07T20:26:59.3325480Z #define _CRTIMP 
2025-05-07T20:26:59.3325705Z #define _GLIBCXX_CXX_CONFIG_H 1
2025-05-07T20:26:59.3326018Z #define __FLOAT_WORD_ORDER__ __ORDER_LITTLE_ENDIAN__
2025-05-07T20:26:59.3326350Z #define STA_PPSJITTER 0x0200
2025-05-07T20:26:59.3326704Z #define _IO_feof_unlocked(__fp) (((__fp)->_flags & _IO_EOF_SEEN) != 0)
2025-05-07T20:26:59.3327120Z #define __SUSECONDS_T_TYPE __SYSCALL_SLONG_TYPE
2025-05-07T20:26:59.3327430Z #define _GLIBCXX_HAVE_ISINFF 1
2025-05-07T20:26:59.3327710Z #define __glibcxx_requires_subscript(_N) 
2025-05-07T20:26:59.3328002Z #define __SIZE_T__ 
2025-05-07T20:26:59.3328557Z #define __stub_gtty 
2025-05-07T20:26:59.3328861Z #define __pid_t_defined 
2025-05-07T20:26:59.3329133Z #define _GLIBCXX_FWDREF(_Tp) _Tp&&
2025-05-07T20:26:59.3329439Z #define __NLINK_T_TYPE __SYSCALL_ULONG_TYPE
2025-05-07T20:26:59.3329757Z #define __glibcxx_function_requires(...) 
2025-05-07T20:26:59.3330064Z #define __SM_80_RT_HPP__ 
2025-05-07T20:26:59.3330305Z #define __need_clockid_t 
2025-05-07T20:26:59.3330551Z #define SSIZE_MAX LONG_MAX
2025-05-07T20:26:59.3330808Z #define _GLIBCXX_HAVE_USELOCALE 1
2025-05-07T20:26:59.3331121Z #define __glibcxx_requires_string_len(_String,_Len) 
2025-05-07T20:26:59.3331455Z #define _IO_HEX 0100
2025-05-07T20:26:59.3331774Z #define __NFDBITS (8 * (int) sizeof (__fd_mask))
2025-05-07T20:26:59.3332188Z #define cudaExternalMemoryDedicated 0x1
2025-05-07T20:26:59.3332314Z #define _GLIBCXX_HAVE_TGMATH_H 1
2025-05-07T20:26:59.3332442Z #define _GLIBCXX11_USE_C99_COMPLEX 1
2025-05-07T20:26:59.3332728Z #define _GLIBCXX17_DEPRECATED_SUGGEST(ALT) _GLIBCXX_DEPRECATED_SUGGEST(ALT)
2025-05-07T20:26:59.3332876Z #define ispunct_l(c,l) __ispunct_l ((c), (l))
2025-05-07T20:26:59.3333007Z #define __cpp_aggregate_bases 201603L
2025-05-07T20:26:59.3333139Z #define __cudaGet_blockDim() blockDim
2025-05-07T20:26:59.3333271Z #define __cudaCDP2Memcpy3DAsync 
2025-05-07T20:26:59.3333657Z #define __cudaCDP2MemcpyAsync 
2025-05-07T20:26:59.3333754Z #define __stub_sstk 
2025-05-07T20:26:59.3333846Z #define _IO_IN_BACKUP 0x100
2025-05-07T20:26:59.3334001Z #define _GLIBCXX_USE_C99_STDLIB _GLIBCXX11_USE_C99_STDLIB
2025-05-07T20:26:59.3334089Z #define __wur 
2025-05-07T20:26:59.3334206Z #define isprint_l(c,l) __isprint_l ((c), (l))
2025-05-07T20:26:59.3334299Z #define _G_HAVE_MMAP 1
2025-05-07T20:26:59.3334381Z #define _IO_OCT 040
2025-05-07T20:26:59.3334476Z #define __FLT128_HAS_DENORM__ 1
2025-05-07T20:26:59.3334571Z #define NL_MSGMAX INT_MAX
2025-05-07T20:26:59.3334662Z #define _GLIBCXX_USE_LFS 1
2025-05-07T20:26:59.3334913Z #define cudaDeviceScheduleBlockingSync 0x04
2025-05-07T20:26:59.3335011Z #define _POSIX_RTSIG_MAX 8
2025-05-07T20:26:59.3335114Z #define _GLIBCXX_NOEXCEPT noexcept
2025-05-07T20:26:59.3335302Z #define __glibcxx_requires_partitioned_lower(_First,_Last,_Value) 
2025-05-07T20:26:59.3335404Z #define __FLT32_DECIMAL_DIG__ 9
2025-05-07T20:26:59.3335502Z #define _STL_ALGOBASE_H 1
2025-05-07T20:26:59.3335610Z #define __cudaCDP2MemsetAsync_ptsz 
2025-05-07T20:26:59.3335707Z #define __off64_t_defined 
2025-05-07T20:26:59.3335806Z #define _GLIBCXX_WEAK_DEFINITION 
2025-05-07T20:26:59.3335899Z #define __FLT128_DIG__ 33
2025-05-07T20:26:59.3336005Z #define _GLIBCXX_USE_C99_INTTYPES_TR1 1
2025-05-07T20:26:59.3336104Z #define _GLIBCXX_HAVE_LOCALE_H 1
2025-05-07T20:26:59.3336197Z #define __INT32_C(c) c
2025-05-07T20:26:59.3336292Z #define __DEC64_EPSILON__ 1E-15DD
2025-05-07T20:26:59.3336391Z #define __ORDER_PDP_ENDIAN__ 3412
2025-05-07T20:26:59.3336491Z #define __DEC128_MIN_EXP__ (-6142)
2025-05-07T20:26:59.3336589Z #define __PDP_ENDIAN 3412
2025-05-07T20:26:59.3336676Z #define _ISOC95_SOURCE 1
2025-05-07T20:26:59.3336780Z #define _IO_fpos64_t _G_fpos64_t
2025-05-07T20:26:59.3336910Z #define M_PI_2l 1.570796326794896619231321691639751442L
2025-05-07T20:26:59.3337004Z #define BYTE_ORDER __BYTE_ORDER
2025-05-07T20:26:59.3337104Z #define __SM_90_RT_HPP__ 
2025-05-07T20:26:59.3337205Z #define __INT_FAST32_TYPE__ long int
2025-05-07T20:26:59.3337304Z #define __have_pthread_attr_t 1
2025-05-07T20:26:59.3337403Z #define _GLIBCXX_HAVE_LIMIT_DATA 1
2025-05-07T20:26:59.3337623Z #define _GLIBCXX_BEGIN_NAMESPACE_LDBL_OR_CXX11 _GLIBCXX_BEGIN_NAMESPACE_CXX11
2025-05-07T20:26:59.3337736Z #define __cudaCDP2StreamWaitEvent 
2025-05-07T20:26:59.3337838Z #define __cudaCDP2EventRecord 
2025-05-07T20:26:59.3337932Z #define _BITS_TYPESIZES_H 1
2025-05-07T20:26:59.3338024Z #define htole32(x) (x)
2025-05-07T20:26:59.3338271Z #define __cudaCDP2OccupancyMaxActiveBlocksPerMultiprocessorWithFlags 
2025-05-07T20:26:59.3338397Z #define __SYSCALL_SLONG_TYPE __SLONGWORD_TYPE
2025-05-07T20:26:59.3338506Z #define _GLIBCXX_USE_C99_MATH_TR1 1
2025-05-07T20:26:59.3338662Z #define WSTOPSIG(status) __WSTOPSIG (__WAIT_INT (status))
2025-05-07T20:26:59.3338804Z #define _GLIBCXX_USE_C99_MATH _GLIBCXX11_USE_C99_MATH
2025-05-07T20:26:59.3338928Z #define __UINT_LEAST16_TYPE__ short unsigned int
2025-05-07T20:26:59.3339072Z #define __WIFEXITED(status) (__WTERMSIG(status) == 0)
2025-05-07T20:26:59.3339170Z #define ADJ_OFFSET 0x0001
2025-05-07T20:26:59.3339270Z #define cudaArrayLayered 0x01
2025-05-07T20:26:59.3339438Z #define _PSTL_ICC_18_OMP_SIMD_BROKEN (__INTEL_COMPILER == 1800)
2025-05-07T20:26:59.3339558Z #define cudaEventRecordDefault 0x00
2025-05-07T20:26:59.3339654Z #define _GLIBCXX_HAVE_FMODF 1
2025-05-07T20:26:59.3339755Z #define _PSTL_PRAGMA_MESSAGE(x) 
2025-05-07T20:26:59.3339842Z #define unix 1
2025-05-07T20:26:59.3339935Z #define __DBL_HAS_DENORM__ 1
2025-05-07T20:26:59.3340036Z #define _POSIX_CHILD_MAX 25
2025-05-07T20:26:59.3340135Z #define _POSIX_MAX_INPUT 255
2025-05-07T20:26:59.3340251Z #define __cudaCDP2DeviceGetCacheConfig 
2025-05-07T20:26:59.3340344Z #define __USE_POSIX 1
2025-05-07T20:26:59.3340440Z #define __FD_ZERO_STOS "stosq"
2025-05-07T20:26:59.3340572Z #define _PSTL_VERSION_MAJOR (_PSTL_VERSION / 1000)
2025-05-07T20:26:59.3340671Z #define __THROWNL throw ()
2025-05-07T20:26:59.3340893Z #define __cpp_rtti 199711L
2025-05-07T20:26:59.3340997Z #define __SIZE_TYPE__ long unsigned int
2025-05-07T20:26:59.3341093Z #define __PMT(args) args
2025-05-07T20:26:59.3341210Z #define __UINT64_MAX__ 0xffffffffffffffffUL
2025-05-07T20:26:59.3341357Z #define __va_arg_pack_len() __builtin_va_arg_pack_len ()
2025-05-07T20:26:59.3341478Z #define __ULONGWORD_TYPE unsigned long int
2025-05-07T20:26:59.3341569Z #define _SIZE_T_DECLARED 
2025-05-07T20:26:59.3341671Z #define _PSTL_STRING_AUX(x) #x
2025-05-07T20:26:59.3341765Z #define __FLT_IS_IEC_60559__ 2
2025-05-07T20:26:59.3342156Z #define _PSTL_CPP14_MAKE_REVERSE_ITERATOR_PRESENT (_MSC_VER >= 1900 || __cplusplus >= 201402L || __cpp_lib_make_reverse_iterator == 201402)
2025-05-07T20:26:59.3342346Z #define _GLIBCXX_HAVE_LIMIT_AS 1
2025-05-07T20:26:59.3342441Z #define XATTR_LIST_MAX 65536
2025-05-07T20:26:59.3342535Z #define __CUDACC_VER_MAJOR__ 12
2025-05-07T20:26:59.3342682Z #define __GNUC_WIDE_EXECUTION_CHARSET_NAME "UTF-32LE"
2025-05-07T20:26:59.3342767Z #define _WCHAR_T_H 
2025-05-07T20:26:59.3342865Z #define __FLT64X_DIG__ 18
2025-05-07T20:26:59.3342964Z #define _IO_SHOWBASE 0200
2025-05-07T20:26:59.3343051Z #define _POSIX_QLIMIT 1
2025-05-07T20:26:59.3343156Z #define __INT8_TYPE__ signed char
2025-05-07T20:26:59.3343249Z #define __SURFACE_TYPES_H__ 
2025-05-07T20:26:59.3343338Z #define __CUDA_ARCH__ 520
2025-05-07T20:26:59.3343451Z #define __cpp_digit_separators 201309L
2025-05-07T20:26:59.3343534Z #define __ELF__ 1
2025-05-07T20:26:59.3343634Z #define CLOCK_THREAD_CPUTIME_ID 3
2025-05-07T20:26:59.3343741Z #define __GCC_ASM_FLAG_OUTPUTS__ 1
2025-05-07T20:26:59.3343827Z #define STA_INS 0x0010
2025-05-07T20:26:59.3343931Z #define __UINT32_TYPE__ unsigned int
2025-05-07T20:26:59.3344106Z #define _toupper(c) ((int) (*__ctype_toupper_loc ())[(int) (c)])
2025-05-07T20:26:59.3344203Z #define _BITS_BYTESWAP_H 1
2025-05-07T20:26:59.3344302Z #define __ID_T_TYPE __U32_TYPE
2025-05-07T20:26:59.3344418Z #define __TIME_T_TYPE __SYSCALL_SLONG_TYPE
2025-05-07T20:26:59.3344532Z #define __DEVICE_DOUBLE_FUNCTIONS_HPP__ 
2025-05-07T20:26:59.3344637Z #define _GLIBCXX_HAVE_MBSTATE_T 1
2025-05-07T20:26:59.3344740Z #define __cpp_lib_logical_traits 201510
2025-05-07T20:26:59.3344840Z #define ADJ_OFFSET_SS_READ 0xa001
2025-05-07T20:26:59.3344998Z #define __warnattr(msg) __attribute__((__warning__ (msg)))
2025-05-07T20:26:59.3345156Z #define _PSTL_PRAGMA_LOCATION " [Parallel STL message]: "
2025-05-07T20:26:59.3345254Z #define _IO_funlockfile(_fp) 
2025-05-07T20:26:59.3345579Z #define cudaKernelNodeAttributeAccessPolicyWindow cudaLaunchAttributeAccessPolicyWindow
2025-05-07T20:26:59.3345706Z #define M_2_PIl 0.636619772367581343075535053490057448L
2025-05-07T20:26:59.3345807Z #define __DRIVER_TYPES_H__ 
2025-05-07T20:26:59.3345904Z #define __FLT_RADIX__ 2
2025-05-07T20:26:59.3346007Z #define __INT_LEAST16_TYPE__ short int
2025-05-07T20:26:59.3346179Z #define __LDBL_EPSILON__ 1.08420217248550443400745280086994171e-19L
2025-05-07T20:26:59.3346274Z #define __UINTMAX_C(c) c ## UL
2025-05-07T20:26:59.3346373Z #define _GLIBCXX_USE_LSTAT 1
2025-05-07T20:26:59.3346481Z #define minor(dev) gnu_dev_minor (dev)
2025-05-07T20:26:59.3346577Z #define _POSIX_C_SOURCE 200809L
2025-05-07T20:26:59.3346678Z #define _GLIBCXX_HAVE_DIRENT_H 1
2025-05-07T20:26:59.3346786Z #define __GLIBCXX_BITSIZE_INT_N_0 128
2025-05-07T20:26:59.3346870Z #define WORD_BIT 32
2025-05-07T20:26:59.3346956Z #define _IO_USER_BUF 1
2025-05-07T20:26:59.3347054Z #define __VECTOR_TYPES_H__ 
2025-05-07T20:26:59.3347158Z #define __SM_20_ATOMIC_FUNCTIONS_HPP__ 
2025-05-07T20:26:59.3347268Z #define cudaHostAllocPortable 0x01
2025-05-07T20:26:59.3347374Z #define PTHREAD_STACK_MIN 16384
2025-05-07T20:26:59.3347476Z #define __long_double_t long double
2025-05-07T20:26:59.3347577Z #define _GLIBCXX_HAVE_ISINF 1
2025-05-07T20:26:59.3347669Z #define _POSIX_ARG_MAX 4096
2025-05-07T20:26:59.3348069Z #define cudaKernelNodeAttributeDeviceUpdatableKernelNode cudaLaunchAttributeDeviceUpdatableKernelNode
2025-05-07T20:26:59.3348158Z #define __k8 1
2025-05-07T20:26:59.3348443Z #define _GLIBCXX_NO_OBSOLETE_ISINF_ISNAN_DYNAMIC __GLIBC_PREREQ(2,23)
2025-05-07T20:26:59.3348616Z #define __FLT32X_MIN__ 2.22507385850720138309023271733240406e-308F32x
2025-05-07T20:26:59.3348740Z #define __LDBL_REDIR(name,proto) name proto
2025-05-07T20:26:59.3348841Z #define __SIG_ATOMIC_MAX__ 0x7fffffff
2025-05-07T20:26:59.3348940Z #define __SM_30_INTRINSICS_HPP__ 
2025-05-07T20:26:59.3349118Z #define _GLIBCXX_EXTERN_TEMPLATE 1
2025-05-07T20:26:59.3349213Z #define __blksize_t_defined 
2025-05-07T20:26:59.3349312Z #define _IO_SHOWPOINT 0400
2025-05-07T20:26:59.3349410Z #define _GLIBCXX_HAVE_LIMIT_RSS 1
2025-05-07T20:26:59.3349608Z #define cudaDeviceLmemResizeToMax 0x10
2025-05-07T20:26:59.3349710Z #define _GLIBCXX_X86_RDRAND 1
2025-05-07T20:26:59.3349815Z #define __GCC_ATOMIC_WCHAR_T_LOCK_FREE 2
2025-05-07T20:26:59.3349912Z #define _IO_IS_FILEBUF 0x2000
2025-05-07T20:26:59.3350017Z #define _GLIBCXX_USE_DUAL_ABI 1
2025-05-07T20:26:59.3350276Z #define __bswap_constant_16(x) ((unsigned short int) ((((x) >> 8) & 0xff) | (((x) & 0xff) << 8)))
2025-05-07T20:26:59.3350615Z #define cudaSignalExternalSemaphoresAsync __CUDART_API_PTSZ(cudaSignalExternalSemaphoresAsync_v2)
2025-05-07T20:26:59.3350723Z #define UCHAR_MAX (SCHAR_MAX * 2 + 1)
2025-05-07T20:26:59.3350822Z #define __SIZEOF_PTRDIFF_T__ 8
2025-05-07T20:26:59.3350911Z #define SEEK_SET 0
2025-05-07T20:26:59.3351010Z #define _GLIBCXX_TR1_GAMMA_TCC 1
2025-05-07T20:26:59.3351106Z #define __CUDA_API_VER_MINOR__ 8
2025-05-07T20:26:59.3351303Z #define _GLIBCXX_VISIBILITY(V) __attribute__ ((__visibility__ (#V)))
2025-05-07T20:26:59.3351408Z #define __cudaCDP2GetLastError 
2025-05-07T20:26:59.3351514Z #define _GLIBCXX_HAVE_COSL 1
2025-05-07T20:26:59.3351635Z #define _MATH_H_MATHDEF 1
2025-05-07T20:26:59.3352031Z #define __bswap_constant_32(x) ((((x) & 0xff000000) >> 24) | (((x) & 0x00ff0000) >> 8) | (((x) & 0x0000ff00) << 8) | (((x) & 0x000000ff) << 24))
2025-05-07T20:26:59.3352154Z #define _GLIBCXX_USE_FLOAT128 1
2025-05-07T20:26:59.3352283Z #define _IO_FLAGS2_NOTCANCEL 2
2025-05-07T20:26:59.3352406Z #define __stub_sigreturn 
2025-05-07T20:26:59.3352708Z #define __errordecl(name,msg) extern void name (void) __attribute__((__error__ (msg)))
2025-05-07T20:26:59.3352831Z #define _GLIBCXX_HAVE_UTIME_H 1
2025-05-07T20:26:59.3352946Z #define __HOST_CONFIG_H__ 
2025-05-07T20:26:59.3353078Z #define _XOPEN_SOURCE_EXTENDED 1
2025-05-07T20:26:59.3353184Z #define CLOCK_TAI 11
2025-05-07T20:26:59.3353320Z #define _GLIBCXX_END_NAMESPACE_VERSION 
2025-05-07T20:26:59.3353585Z #define __glibcxx_requires_sorted_set_pred(_First1,_Last1,_First2,_Pred) 
2025-05-07T20:26:59.3353696Z #define __restrict_arr 
2025-05-07T20:26:59.3353841Z #define _PSTL_PRAGMA_MESSAGE_POLICIES(x) 
2025-05-07T20:26:59.3354020Z #define __glibcxx_requires_valid_range(_First,_Last) 
2025-05-07T20:26:59.3354609Z #define strndupa(s,n) (__extension__ ({ const char *__old = (s); size_t __len = strnlen (__old, (n)); char *__new = (char *) __builtin_alloca (__len + 1); __new[__len] = '\0'; (char *) memcpy (__new, __old, __len); }))
2025-05-07T20:26:59.3354805Z #define __attribute_artificial__ __attribute__ ((__artificial__))
2025-05-07T20:26:59.3354893Z #define __USE_MISC 1
2025-05-07T20:26:59.3354999Z #define __UWORD_TYPE unsigned long int
2025-05-07T20:26:59.3355109Z #define _EXCEPTION_DEFINES_H 1
2025-05-07T20:26:59.3355200Z #define _GCC_LIMITS_H_ 
2025-05-07T20:26:59.3355287Z #define __LDBL_DIG__ 18
2025-05-07T20:26:59.3355394Z #define __BIT_TYPES_DEFINED__ 1
2025-05-07T20:26:59.3355500Z #define __malloc_and_calloc_defined 
2025-05-07T20:26:59.3355600Z #define __FLT64_IS_IEC_60559__ 2
2025-05-07T20:26:59.3355707Z #define _GLIBCXX_HAVE_SYS_SYSINFO_H 1
2025-05-07T20:26:59.3355795Z #define __x86_64__ 1
2025-05-07T20:26:59.3355883Z #define _SIZE_T_ 
2025-05-07T20:26:59.3356850Z #define __bswap_constant_64(x) (__extension__ ((((x) & 0xff00000000000000ull) >> 56) | (((x) & 0x00ff000000000000ull) >> 40) | (((x) & 0x0000ff0000000000ull) >> 24) | (((x) & 0x000000ff00000000ull) >> 8) | (((x) & 0x00000000ff000000ull) << 8) | (((x) & 0x0000000000ff0000ull) << 24) | (((x) & 0x000000000000ff00ull) << 40) | (((x) & 0x00000000000000ffull) << 56)))
2025-05-07T20:26:59.3356956Z #define _POSIX2_COLL_WEIGHTS_MAX 2
2025-05-07T20:26:59.3357064Z #define __FLT32X_MIN_EXP__ (-1021)
2025-05-07T20:26:59.3357178Z #define __PTHREAD_RWLOCK_INT_FLAGS_SHARED 1
2025-05-07T20:26:59.3357303Z #define __DEC32_SUBNORMAL_MIN__ 0.000001E-95DF
2025-05-07T20:26:59.3357397Z #define _IO_iconv_t _G_iconv_t
2025-05-07T20:26:59.3357506Z #define _GLIBCXX_FLOAT_IS_IEEE_BINARY32 1
2025-05-07T20:26:59.3357631Z #define __cpp_lib_make_reverse_iterator 201402
2025-05-07T20:26:59.3357768Z #define _GLIBCXX_SYNCHRONIZATION_HAPPENS_BEFORE(A) 
2025-05-07T20:26:59.3357942Z #define _GLIBCXX_HAVE_DLFCN_H 1
2025-05-07T20:26:59.3358411Z #define strdupa(s) (__extension__ ({ const char *__old = (s); size_t __len = strlen (__old) + 1; char *__new = (char *) __builtin_alloca (__len); (char *) memcpy (__new, __old, __len); }))
2025-05-07T20:26:59.3358534Z #define __no_return__ __attribute__((noreturn))
2025-05-07T20:26:59.3358691Z #define __device_builtin__ __location__(device_builtin)
2025-05-07T20:26:59.3358792Z #define _PSTL_HIDE_FROM_ABI_POP 
2025-05-07T20:26:59.3358891Z #define _GLIBCXX_HAVE_ACOSF 1
2025-05-07T20:26:59.3358984Z #define STA_FLL 0x0008
2025-05-07T20:26:59.3359126Z #define _GLIBCXX_HAVE_BUILTIN_IS_CONSTANT_EVALUATED 1
2025-05-07T20:26:59.3359222Z #define _GLIBCXX_END_EXTERN_C }
2025-05-07T20:26:59.3359349Z #define __INT_FAST16_MAX__ 0x7fffffffffffffffL
2025-05-07T20:26:59.3359461Z #define __cpp_lib_integer_sequence 201304
2025-05-07T20:26:59.3359547Z #define __stub_revoke 
2025-05-07T20:26:59.3359648Z #define __timer_t_defined 1
2025-05-07T20:26:59.3359785Z #define _GLIBCXX11_DEPRECATED _GLIBCXX_DEPRECATED
2025-05-07T20:26:59.3359876Z #define INT_MAX __INT_MAX__
2025-05-07T20:26:59.3359992Z #define ULLONG_MAX (LLONG_MAX * 2ULL + 1)
2025-05-07T20:26:59.3360097Z #define _GLIBCXX_END_NAMESPACE_CXX11 }
2025-05-07T20:26:59.3360200Z #define _GLIBCXX_ICONV_CONST 
2025-05-07T20:26:59.3360306Z #define major(dev) gnu_dev_major (dev)
2025-05-07T20:26:59.3360415Z #define cudaArrayTextureGather 0x08
2025-05-07T20:26:59.3360524Z #define _GLIBCXX_LT_OBJDIR ".libs/"
2025-05-07T20:26:59.3360668Z #define __inline_hint__ __attribute__((nv_inline_hint))
2025-05-07T20:26:59.3360763Z #define __NV_LEGACY_LAUNCH 1
2025-05-07T20:26:59.3360864Z #define _IO_off_t __off_t
2025-05-07T20:26:59.3360953Z #define __FLT64_DIG__ 15
2025-05-07T20:26:59.3361172Z #define PTHREAD_DESTRUCTOR_ITERATIONS _POSIX_THREAD_DESTRUCTOR_ITERATIONS
2025-05-07T20:26:59.3361280Z #define _POSIX2_LINE_MAX 2048
2025-05-07T20:26:59.3361409Z #define __UINT_FAST32_MAX__ 0xffffffffffffffffUL
2025-05-07T20:26:59.3361547Z #define __UINT_LEAST64_TYPE__ long unsigned int
2025-05-07T20:26:59.3361644Z #define ADJ_FREQUENCY 0x0002
2025-05-07T20:26:59.3361745Z #define __CUDART_API_PTDS(api) api
2025-05-07T20:26:59.3361836Z #define NULL __null
2025-05-07T20:26:59.3361967Z #define cudaStreamPerThread ((cudaStream_t)0x2)
2025-05-07T20:26:59.3362077Z #define _GLIBCXX_CONSTEXPR constexpr
2025-05-07T20:26:59.3362188Z #define __U64_TYPE unsigned long int
2025-05-07T20:26:59.3362282Z #define __FLT_HAS_QUIET_NAN__ 1
2025-05-07T20:26:59.3362375Z #define __FLT_MAX_10_EXP__ 38
2025-05-07T20:26:59.3362467Z #define FP_ZERO 2
2025-05-07T20:26:59.3362564Z #define _GLIBCXX_HAVE_FLOORL 1
2025-05-07T20:26:59.3362722Z #define __isgraph_l(c,l) __isctype_l((c), _ISgraph, (l))
2025-05-07T20:26:59.3362856Z #define __LONG_MAX__ 0x7fffffffffffffffL
2025-05-07T20:26:59.3362958Z #define __WCHAR_T__ 
2025-05-07T20:26:59.3363066Z #define __FLT64X_HAS_DENORM__ 1
2025-05-07T20:26:59.3363262Z #define __DEC128_SUBNORMAL_MIN__ 0.000000000000000000000000000000001E-6143DL
2025-05-07T20:26:59.3363418Z #define _GLIBCXX_NORETURN __attribute__ ((__noreturn__))
2025-05-07T20:26:59.3363518Z #define __FLT_HAS_INFINITY__ 1
2025-05-07T20:26:59.3363639Z #define __GNUC_EXECUTION_CHARSET_NAME "UTF-8"
2025-05-07T20:26:59.3363757Z #define _GLIBCXX20_DEPRECATED_SUGGEST(ALT) 
2025-05-07T20:26:59.3364006Z #define __WSTOPSIG(status) __WEXITSTATUS(status)
2025-05-07T20:26:59.3364136Z #define cudaSurfaceTypeCubemapLayered 0xFC
2025-05-07T20:26:59.3364227Z #define _BSD_PTRDIFF_T_ 
2025-05-07T20:26:59.3364325Z #define _SIGSET_H_types 1
2025-05-07T20:26:59.3364441Z #define cudaTextureType1DLayered 0xF1
2025-05-07T20:26:59.3364552Z #define __cpp_unicode_literals 200710L
2025-05-07T20:26:59.3364701Z #define __isdigit_l(c,l) __isctype_l((c), _ISdigit, (l))
2025-05-07T20:26:59.3364808Z #define __LONG_LONG_PAIR(HI,LO) LO, HI
2025-05-07T20:26:59.3364933Z #define __UINT_FAST16_TYPE__ long unsigned int
2025-05-07T20:26:59.3365068Z #define __bos0(ptr) __builtin_object_size (ptr, 0)
2025-05-07T20:26:59.3365282Z #define __DEC64_MAX__ 9.999999999999999E384DD
2025-05-07T20:26:59.3365415Z #define M_1_PIl 0.318309886183790671537767526745028724L
2025-05-07T20:26:59.3365527Z #define __CUDACC_DEVICE_ATOMIC_BUILTINS__ 1
2025-05-07T20:26:59.3365699Z #define WIFSTOPPED(status) __WIFSTOPPED (__WAIT_INT (status))
2025-05-07T20:26:59.3365808Z #define __INT_FAST32_WIDTH__ 64
2025-05-07T20:26:59.3365911Z #define _POSIX2_CHARCLASS_NAME_MAX 14
2025-05-07T20:26:59.3366021Z #define _GLIBCXX_BITS_STD_ABS_H 
2025-05-07T20:26:59.3366110Z #define STA_MODE 0x4000
2025-05-07T20:26:59.3366219Z #define __CHAR16_TYPE__ short unsigned int
2025-05-07T20:26:59.3366331Z #define __PRAGMA_REDEFINE_EXTNAME 1
2025-05-07T20:26:59.3366448Z #define __glibcxx_signed_b(T,B) ((T)(-1) < 0)
2025-05-07T20:26:59.3366549Z #define __USING_NAMESPACE_C99(name) 
2025-05-07T20:26:59.3366651Z #define BIG_ENDIAN __BIG_ENDIAN
2025-05-07T20:26:59.3366758Z #define __cudaCDP2EventRecord_ptsz 
2025-05-07T20:26:59.3366857Z #define _GLIBCXX_HAVE_SINL 1
2025-05-07T20:26:59.3366978Z #define EXPR_NEST_MAX _POSIX2_EXPR_NEST_MAX
2025-05-07T20:26:59.3367068Z #define __SIZE_WIDTH__ 64
2025-05-07T20:26:59.3367184Z #define __BLKSIZE_T_TYPE __SYSCALL_SLONG_TYPE
2025-05-07T20:26:59.3367275Z #define __SEG_FS 1
2025-05-07T20:26:59.3367365Z #define _IO_size_t size_t
2025-05-07T20:26:59.3367476Z #define __INT_LEAST16_MAX__ 0x7fff
2025-05-07T20:26:59.3367575Z #define INT_MIN (-INT_MAX - 1)
2025-05-07T20:26:59.3367661Z #define __stub_lchmod 
2025-05-07T20:26:59.3367759Z #define __DEC64_MANT_DIG__ 16
2025-05-07T20:26:59.3367870Z #define __INT64_MAX__ 0x7fffffffffffffffL
2025-05-07T20:26:59.3367967Z #define _GLIBCXX_MANGLE_SIZE_T m
2025-05-07T20:26:59.3368056Z #define __SEG_GS 1
2025-05-07T20:26:59.3368235Z #define __FLT32_DENORM_MIN__ 1.40129846432481707092372958328991613e-45F32
2025-05-07T20:26:59.3368325Z #define _IOS_APPEND 8
2025-05-07T20:26:59.3368428Z #define __SIG_ATOMIC_WIDTH__ 32
2025-05-07T20:26:59.3368520Z #define _GLIBCXX_RELEASE 11
2025-05-07T20:26:59.3368622Z #define _GLIBCXX98_USE_C99_WCHAR 1
2025-05-07T20:26:59.3368727Z #define _IO_IS_APPENDING 0x1000
2025-05-07T20:26:59.3368826Z #define __INT_LEAST64_TYPE__ long int
2025-05-07T20:26:59.3368917Z #define htole16(x) (x)
2025-05-07T20:26:59.3369027Z #define __TEXTURE_INDIRECT_FUNCTIONS_H__ 
2025-05-07T20:26:59.3369121Z #define _GLIBCXX_HAVE_FCNTL_H 1
2025-05-07T20:26:59.3369226Z #define __INT16_TYPE__ short int
2025-05-07T20:26:59.3369328Z #define __INT_LEAST8_TYPE__ signed char
2025-05-07T20:26:59.3369434Z #define __glibcxx_class_requires(_a,_b) 
2025-05-07T20:26:59.3369553Z #define __cpp_structured_bindings 201606L
2025-05-07T20:26:59.3369677Z #define __align__(n) __attribute__((aligned(n)))
2025-05-07T20:26:59.3369769Z #define __SIZEOF_INT__ 4
2025-05-07T20:26:59.3369864Z #define __WCLONE 0x80000000
2025-05-07T20:26:59.3369956Z #define __DEC32_MAX_EXP__ 97
2025-05-07T20:26:59.3370039Z #define SEEK_HOLE 4
2025-05-07T20:26:59.3370133Z #define TIMER_ABSTIME 1
2025-05-07T20:26:59.3370234Z #define __INT_FAST8_MAX__ 0x7f
2025-05-07T20:26:59.3370331Z #define __CUDA_MATH_CRTIMP 
2025-05-07T20:26:59.3370504Z #define __FLT128_MAX__ 1.18973149535723176508575932662800702e+4932F128
2025-05-07T20:26:59.3370616Z #define __INTPTR_MAX__ 0x7fffffffffffffffL
2025-05-07T20:26:59.3370718Z #define __DRIVER_FUNCTIONS_H__ 
2025-05-07T20:26:59.3370917Z #define __cpp_sized_deallocation 201309L
2025-05-07T20:26:59.3371017Z #define __MATH_FUNCTIONS_HPP__ 
2025-05-07T20:26:59.3371143Z #define __cpp_guaranteed_copy_elision 201606L
2025-05-07T20:26:59.3371233Z #define _LINUX_LIMITS_H 
2025-05-07T20:26:59.3371314Z #define linux 1
2025-05-07T20:26:59.3371415Z #define MOD_MICRO ADJ_MICRO
2025-05-07T20:26:59.3371524Z #define _GLIBCXX_DEBUG_ASSERT(_Condition) 
2025-05-07T20:26:59.3371628Z #define _GLIBCXX_HAVE_VSWSCANF 1
2025-05-07T20:26:59.3371722Z #define _GLIBCXX_HAVE_ISNAN 1
2025-05-07T20:26:59.3371826Z #define _XOPEN_IOV_MAX _POSIX_UIO_MAXIOV
2025-05-07T20:26:59.3371975Z #define __cudart_builtin__ __location__(cudart_builtin)
2025-05-07T20:26:59.3372151Z #define __cpp_lib_hypot 201603
2025-05-07T20:26:59.3372247Z #define __FLT64_HAS_QUIET_NAN__ 1
2025-05-07T20:26:59.3372350Z #define _GLIBCXX_HAVE_WCTYPE_H 1
2025-05-07T20:26:59.3372440Z #define MOD_NANO ADJ_NANO
2025-05-07T20:26:59.3372525Z #define htole64(x) (x)
2025-05-07T20:26:59.3372632Z #define FP_ILOGBNAN (-2147483647 - 1)
2025-05-07T20:26:59.3372762Z #define _IO_stdout ((_IO_FILE*)(&_IO_2_1_stdout_))
2025-05-07T20:26:59.3372857Z #define _IO_UPPERCASE 01000
2025-05-07T20:26:59.3373351Z #define cudaKernelNodeAttributeClusterSchedulingPolicyPreference cudaLaunchAttributeClusterSchedulingPolicyPreference
2025-05-07T20:26:59.3373439Z #define __USE_POSIX2 1
2025-05-07T20:26:59.3373547Z #define MOD_ESTERROR ADJ_ESTERROR
2025-05-07T20:26:59.3373635Z #define __WALL 0x40000000
2025-05-07T20:26:59.3373732Z #define _GLIBCXX_HAVE_LDEXPF 1
2025-05-07T20:26:59.3373825Z #define _XLOCALE_H 1
2025-05-07T20:26:59.3373920Z #define _GLIBCXX_USE_TMPNAM 1
2025-05-07T20:26:59.3374023Z #define __FLT32_MIN_10_EXP__ (-37)
2025-05-07T20:26:59.3374125Z #define __KEY_T_TYPE __S32_TYPE
2025-05-07T20:26:59.3374228Z #define __cudaGet_threadIdx() threadIdx
2025-05-07T20:26:59.3374316Z #define __EXCEPTIONS 1
2025-05-07T20:26:59.3374421Z #define __CUDART_API_PTSZ(api) api
2025-05-07T20:26:59.3374612Z #define __launch_bounds__(...) __annotate__(launch_bounds(__VA_ARGS__))
2025-05-07T20:26:59.3374711Z #define __WORDSIZE 64
2025-05-07T20:26:59.3374817Z #define CLOCK_MONOTONIC 1
2025-05-07T20:26:59.3375020Z #define _STL_RELOPS_H 1
2025-05-07T20:26:59.3375188Z #define __PTRDIFF_WIDTH__ 64
2025-05-07T20:26:59.3375320Z #define __BEGIN_DECLS extern "C" {
2025-05-07T20:26:59.3384507Z #define _GLIBCXX_HAVE_SYS_IPC_H 1
2025-05-07T20:26:59.3384632Z #define __LDBL_MANT_DIG__ 64
2025-05-07T20:26:59.3384735Z #define _GLIBCXX_HAVE_TRUNCATE 1
2025-05-07T20:26:59.3385041Z #define cudaKernelNodeAttributeClusterDimension cudaLaunchAttributeClusterDimension
2025-05-07T20:26:59.3385275Z #define _PSTL_GCC_VERSION (__GNUC__ * 10000 + __GNUC_MINOR__ * 100 + __GNUC_PATCHLEVEL__)
2025-05-07T20:26:59.3385430Z #define _GLIBCXX_NAMESPACE_CXX11 __cxx11::
2025-05-07T20:26:59.3385533Z #define _GLIBCXX_NUMERIC_LIMITS 1
2025-05-07T20:26:59.3385638Z #define __cpp_range_based_for 201603L
2025-05-07T20:26:59.3385759Z #define __cpp_lib_exchange_function 201304
2025-05-07T20:26:59.3385868Z #define _GLIBCXX_HAVE_INTTYPES_H 1
2025-05-07T20:26:59.3385979Z #define _GLIBCXX_DARWIN_USE_64_BIT_INODE 1
2025-05-07T20:26:59.3386168Z #define cudaCooperativeLaunchMultiDeviceNoPostSync 0x02
2025-05-07T20:26:59.3386269Z #define __FLT64_HAS_INFINITY__ 1
2025-05-07T20:26:59.3386364Z #define _GLIBCXX_CSTDLIB 1
2025-05-07T20:26:59.3386477Z #define _GLIBCXX_DEBUG_MACRO_SWITCH_H 1
2025-05-07T20:26:59.3386650Z #define __FLT64X_MAX__ 1.18973149535723176502126385303097021e+4932F64x
2025-05-07T20:26:59.3386771Z #define __STDCPP_DEFAULT_NEW_ALIGNMENT__ 16
2025-05-07T20:26:59.3386859Z #define _STRING_H 1
2025-05-07T20:26:59.3386961Z #define _BITS_PTHREADTYPES_H 1
2025-05-07T20:26:59.3387062Z #define _GCC_MAX_ALIGN_T 
2025-05-07T20:26:59.3387163Z #define __SM_32_INTRINSICS_HPP__ 
2025-05-07T20:26:59.3387300Z #define __SIG_ATOMIC_MIN__ (-__SIG_ATOMIC_MAX__ - 1)
2025-05-07T20:26:59.3387403Z #define __code_model_small__ 1
2025-05-07T20:26:59.3387494Z #define _PSTL_CONFIG_H 
2025-05-07T20:26:59.3387597Z #define __GCC_ATOMIC_LONG_LOCK_FREE 2
2025-05-07T20:26:59.3387879Z #define __cpp_nontype_template_args 201411L
2025-05-07T20:26:59.3387977Z #define __SM_20_INTRINSICS_H__ 
2025-05-07T20:26:59.3388089Z #define cudaCpuDeviceId ((int)-1)
2025-05-07T20:26:59.3388427Z #define assert(expr) ((expr) ? __ASSERT_VOID_CAST (0) : __assert_fail (__STRING(expr), __FILE__, __LINE__, __ASSERT_FUNCTION))
2025-05-07T20:26:59.3388525Z #define __DEC32_MANT_DIG__ 7
2025-05-07T20:26:59.3388622Z #define le64toh(x) (x)
2025-05-07T20:26:59.3388715Z #define FILENAME_MAX 4096
2025-05-07T20:26:59.3388868Z #define __iscntrl_l(c,l) __isctype_l((c), _IScntrl, (l))
2025-05-07T20:26:59.3388994Z #define __cpp_return_type_deduction 201304L
2025-05-07T20:26:59.3389319Z #define L_cuserid 9
2025-05-07T20:26:59.3389412Z #define __ino_t_defined 
2025-05-07T20:26:59.3389503Z #define __k8__ 1
2025-05-07T20:26:59.3389603Z #define __INTPTR_TYPE__ long int
2025-05-07T20:26:59.3389713Z #define __UINT16_TYPE__ short unsigned int
2025-05-07T20:26:59.3389817Z #define __int8_t_defined 
2025-05-07T20:26:59.3389919Z #define __WCHAR_TYPE__ int
2025-05-07T20:26:59.3390028Z #define __CLOCKID_T_TYPE __S32_TYPE
2025-05-07T20:26:59.3390144Z #define cudaHostRegisterPortable 0x01
2025-05-07T20:26:59.3390242Z #define __SLONGWORD_TYPE long int
2025-05-07T20:26:59.3390367Z #define _GLIBCXX_PACKAGE_TARNAME "libstdc++"
2025-05-07T20:26:59.3390516Z #define __isblank_l(c,l) __isctype_l((c), _ISblank, (l))
2025-05-07T20:26:59.3390609Z #define __HAVE_COLUMN 
2025-05-07T20:26:59.3390696Z #define __stub_fdetach 
2025-05-07T20:26:59.3391100Z #define __CUDACC_VER__ "__CUDACC_VER__ is no longer supported.  Use __CUDACC_VER_MAJOR__, __CUDACC_VER_MINOR__, and __CUDACC_VER_BUILD__ instead."
2025-05-07T20:26:59.3391194Z #define __pic__ 2
2025-05-07T20:26:59.3391313Z #define __UINTPTR_MAX__ 0xffffffffffffffffUL
2025-05-07T20:26:59.3391411Z #define CLOCKS_PER_SEC 1000000l
2025-05-07T20:26:59.3391534Z #define __INT_FAST64_WIDTH__ 64
2025-05-07T20:26:59.3391658Z #define _GLIBCXX_HAVE_SOCKATMARK 1
2025-05-07T20:26:59.3391773Z #define __stub_chflags 
2025-05-07T20:26:59.3391891Z #define CLOCK_BOOTTIME 7
2025-05-07T20:26:59.3391996Z #define __need_IOV_MAX 
2025-05-07T20:26:59.3392131Z #define putc(_ch,_fp) _IO_putc (_ch, _fp)
2025-05-07T20:26:59.3392270Z #define __UQUAD_TYPE unsigned long int
2025-05-07T20:26:59.3392393Z #define __cpp_decltype 200707L
2025-05-07T20:26:59.3392520Z #define __BYTE_ORDER __LITTLE_ENDIAN
2025-05-07T20:26:59.3392634Z #define _GLIBCXX_USE_C99 1
2025-05-07T20:26:59.3392766Z #define _GLIBCXX_TR1_BETA_FUNCTION_TCC 1
2025-05-07T20:26:59.3392882Z #define TTY_NAME_MAX 32
2025-05-07T20:26:59.3393086Z #define _GLIBCXX_FORWARD(_Tp,__val) std::forward<_Tp>(__val)
2025-05-07T20:26:59.3393242Z #define __INT_FAST64_MAX__ 0x7fffffffffffffffL
2025-05-07T20:26:59.3393457Z #define _PSTL_ASSERT(_Condition) __glibcxx_assert(_Condition)
2025-05-07T20:26:59.3393596Z #define __GCC_ATOMIC_TEST_AND_SET_TRUEVAL 1
2025-05-07T20:26:59.3393713Z #define __LITTLE_ENDIAN 1234
2025-05-07T20:26:59.3393835Z #define STA_PPSTIME 0x0004
2025-05-07T20:26:59.3393944Z #define __import__ 
2025-05-07T20:26:59.3394062Z #define BUFSIZ _IO_BUFSIZ
2025-05-07T20:26:59.3394200Z #define M_SQRT2l 1.414213562373095048801688724209698079L
2025-05-07T20:26:59.3394287Z #define __export__ 
2025-05-07T20:26:59.3394411Z #define __FSID_T_TYPE struct { int __val[2]; }
2025-05-07T20:26:59.3394512Z #define cudaMemAttachHost 0x02
2025-05-07T20:26:59.3394674Z #define __FLT_NORM_MAX__ 3.40282346638528859811704183484516925e+38F
2025-05-07T20:26:59.3394777Z #define _GLIBCXX_HAVE_ICONV 1
2025-05-07T20:26:59.3394867Z #define _GLIBCXX_SYMVER 1
2025-05-07T20:26:59.3394963Z #define __FLT64X_MAX_EXP__ 16384
2025-05-07T20:26:59.3395064Z #define _WCHAR_T_DECLARED 
2025-05-07T20:26:59.3395182Z #define __UINT_FAST64_TYPE__ long unsigned int
2025-05-07T20:26:59.3395301Z #define isalpha_l(c,l) __isalpha_l ((c), (l))
2025-05-07T20:26:59.3395414Z #define __cpp_inline_variables 201606L
2025-05-07T20:26:59.3395504Z #define WNOWAIT 0x01000000
2025-05-07T20:26:59.3395597Z #define PLOSS 6
2025-05-07T20:26:59.3395779Z #define M_LN10 2.30258509299404568402
2025-05-07T20:26:59.3396041Z #define _PSTL_UDS_PRESENT (__INTEL_COMPILER >= 1900 && __INTEL_COMPILER_BUILD_DATE >= 20180626)
2025-05-07T20:26:59.3396135Z #define EXIT_SUCCESS 0
2025-05-07T20:26:59.3396232Z #define __LDBL_REDIR_DECL(name) 
2025-05-07T20:26:59.3396326Z #define _GLIBCXX_HAVE_STRTOF 1
2025-05-07T20:26:59.3396434Z #define MOD_FREQUENCY ADJ_FREQUENCY
2025-05-07T20:26:59.3396524Z #define __thread__ __thread
2025-05-07T20:26:59.3396620Z #define _GLIBCXX_HAVE_MEMORY_H 1
2025-05-07T20:26:59.3396720Z #define __INT_MAX__ 0x7fffffff
2025-05-07T20:26:59.3396822Z #define __SIZEOF_PTHREAD_BARRIER_T 32
2025-05-07T20:26:59.3397128Z #define __glibcxx_requires_partitioned_upper_pred(_First,_Last,_Value,_Pred) 
2025-05-07T20:26:59.3397242Z #define __cudaCDP2StreamWaitEvent_ptsz 
2025-05-07T20:26:59.3397336Z #define _GLIBCXX_HAVE_SINF 1
2025-05-07T20:26:59.3397428Z #define __linux__ 1
2025-05-07T20:26:59.3397523Z #define STA_PPSSIGNAL 0x0100
2025-05-07T20:26:59.3397653Z #define M_LN2l 0.693147180559945309417232121458176568L
2025-05-07T20:26:59.3397751Z #define __S16_TYPE short int
2025-05-07T20:26:59.3398092Z #define __glibcxx_constexpr_assert(cond) if (__builtin_is_constant_evaluated() && !bool(cond)) __builtin_unreachable()
2025-05-07T20:26:59.3398198Z #define __NVCC_DIAG_PRAGMA_SUPPORT__ 1
2025-05-07T20:26:59.3398391Z #define __bos(ptr) __builtin_object_size (ptr, __USE_FORTIFY_LEVEL > 1)
2025-05-07T20:26:59.3398490Z #define __COMMON_FUNCTIONS_H__ 
2025-05-07T20:26:59.3398598Z #define UINT_MAX (INT_MAX * 2U + 1U)
2025-05-07T20:26:59.3398679Z #define _T_SIZE_ 
2025-05-07T20:26:59.3398782Z #define LLONG_MAX __LONG_LONG_MAX__
2025-05-07T20:26:59.3398906Z #define __cudaCDP2StreamCreateWithFlags 
2025-05-07T20:26:59.3398999Z #define _PSTL_VERSION 12000
2025-05-07T20:26:59.3399120Z #define __noinline__ __attribute__((noinline))
2025-05-07T20:26:59.3399224Z #define __WNOTHREAD 0x20000000
2025-05-07T20:26:59.3399321Z #define _G_va_list __gnuc_va_list
2025-05-07T20:26:59.3399453Z #define M_PI_4l 0.785398163397448309615660845819875721L
2025-05-07T20:26:59.3399549Z #define _IOS_INPUT 1
2025-05-07T20:26:59.3399641Z #define __USE_LARGEFILE64 1
2025-05-07T20:26:59.3399744Z #define _GLIBCXX_TR1_EXP_INTEGRAL_TCC 1
2025-05-07T20:26:59.3399846Z #define __INT64_TYPE__ long int
2025-05-07T20:26:59.3399944Z #define _POSIX_SSIZE_MAX 32767
2025-05-07T20:26:59.3400048Z #define __shared__ __location__(shared)
2025-05-07T20:26:59.3400140Z #define __FLT_MAX_EXP__ 128
2025-05-07T20:26:59.3400294Z #define __glibc_unlikely(cond) __builtin_expect((cond), 0)
2025-05-07T20:26:59.3400391Z #define __gid_t_defined 
2025-05-07T20:26:59.3400511Z #define _GLIBCXX_USE_SC_NPROCESSORS_ONLN 1
2025-05-07T20:26:59.3400610Z #define __ORDER_BIG_ENDIAN__ 4321
2025-05-07T20:26:59.3400813Z #define __glibcxx_requires_can_increment_range(_First1,_Last1,_First2) 
2025-05-07T20:26:59.3400909Z #define _GLIBCXX17_INLINE inline
2025-05-07T20:26:59.3401000Z #define __DBL_MANT_DIG__ 53
2025-05-07T20:26:59.3401097Z #define ___int_size_t_h 
2025-05-07T20:26:59.3401202Z #define __FSBLKCNT64_T_TYPE __UQUAD_TYPE
2025-05-07T20:26:59.3401328Z #define __cpp_inheriting_constructors 201511L
2025-05-07T20:26:59.3401484Z #define __WIFCONTINUED(status) ((status) == __W_CONTINUED)
2025-05-07T20:26:59.3401588Z #define CUDA_DOUBLE_MATH_FUNCTIONS 1
2025-05-07T20:26:59.3401689Z #define _GLIBCXX_HAVE_FENV_H 1
2025-05-07T20:26:59.3401787Z #define _GLIBCXX_HAVE_STDBOOL_H 1
2025-05-07T20:26:59.3401881Z #define __SIZEOF_FLOAT128__ 16
2025-05-07T20:26:59.3402010Z #define __INT_LEAST64_MAX__ 0x7fffffffffffffffL
2025-05-07T20:26:59.3402124Z #define _GLIBCXX_TR1_HYPERGEOMETRIC_TCC 1
2025-05-07T20:26:59.3402250Z #define _GLIBCXX_DEBUG_PEDASSERT(_Condition) 
2025-05-07T20:26:59.3402347Z #define __clock_t_defined 1
2025-05-07T20:26:59.3402447Z #define _POSIX_SEM_VALUE_MAX 32767
2025-05-07T20:26:59.3402566Z #define __cudaCDP2RuntimeGetVersion 
2025-05-07T20:26:59.3402656Z #define __GLIBC_MINOR__ 17
2025-05-07T20:26:59.3402749Z #define __DEC64_MIN__ 1E-383DD
2025-05-07T20:26:59.3402948Z #define __WINT_TYPE__ unsigned int
2025-05-07T20:26:59.3403058Z #define __UINT_LEAST32_TYPE__ unsigned int
2025-05-07T20:26:59.3403149Z #define __SIZEOF_SHORT__ 2
2025-05-07T20:26:59.3403326Z #define __FLT32_NORM_MAX__ 3.40282346638528859811704183484516925e+38F32
2025-05-07T20:26:59.3403410Z #define __SSE__ 1
2025-05-07T20:26:59.3403508Z #define SEM_VALUE_MAX (2147483647)
2025-05-07T20:26:59.3403611Z #define M_SQRT1_2 0.70710678118654752440
2025-05-07T20:26:59.3403696Z #define _CTYPE_H 1
2025-05-07T20:26:59.3403787Z #define __sigset_t_defined 
2025-05-07T20:26:59.3403895Z #define __LDBL_MIN_EXP__ (-16381)
2025-05-07T20:26:59.3404070Z #define _GLIBCXX_HAVE_LOGF 1
2025-05-07T20:26:59.3404168Z #define MOD_TAI ADJ_TAI
2025-05-07T20:26:59.3404265Z #define _IO_va_list __gnuc_va_list
2025-05-07T20:26:59.3404361Z #define _GLIBCXX_HAVE_LOGL 1
2025-05-07T20:26:59.3404455Z #define __SM_70_RT_H__ 
2025-05-07T20:26:59.3404549Z #define _GLIBCXX_HAVE_WRITEV 1
2025-05-07T20:26:59.3404663Z #define cudaEventWaitDefault 0x00
2025-05-07T20:26:59.3404765Z #define _GLIBCXX_HAVE_EXPL 1
2025-05-07T20:26:59.3404928Z #define __FLT64_MAX__ 1.79769313486231570814527423731704357e+308F64
2025-05-07T20:26:59.3405022Z #define _POSIX_MAX_CANON 255
2025-05-07T20:26:59.3405142Z #define _GLIBCXX_NOEXCEPT_PARM , bool _NE
2025-05-07T20:26:59.3405240Z #define FD_SETSIZE __FD_SETSIZE
2025-05-07T20:26:59.3405331Z #define _GLIBCXX_TXN_SAFE 
2025-05-07T20:26:59.3405419Z #define __amd64__ 1
2025-05-07T20:26:59.3405509Z #define __WINT_WIDTH__ 32
2025-05-07T20:26:59.3405620Z #define __CUDA_DEVICE_RUNTIME_API_H__ 
2025-05-07T20:26:59.3405883Z #define __REDIRECT_NTHNL(name,proto,alias) name proto __THROWNL __asm__ (__ASMNAME (#alias))
2025-05-07T20:26:59.3405990Z #define _GLIBCXX_STDIO_SEEK_CUR 1
2025-05-07T20:26:59.3406081Z #define EOF (-1)
2025-05-07T20:26:59.3406180Z #define __WAIT_STATUS_DEFN void *
2025-05-07T20:26:59.3406276Z #define __USE_POSIX199309 1
2025-05-07T20:26:59.3406384Z #define __INT_LEAST64_WIDTH__ 64
2025-05-07T20:26:59.3406486Z #define __LDBL_MAX_EXP__ 16384
2025-05-07T20:26:59.3406581Z #define __FLT32X_MAX_10_EXP__ 308
2025-05-07T20:26:59.3406685Z #define LLONG_MIN (-LLONG_MAX-1)
2025-05-07T20:26:59.3406800Z #define cudaSurfaceType2DLayered 0xF2
2025-05-07T20:26:59.3406895Z #define ____mbstate_t_defined 1
2025-05-07T20:26:59.3406990Z #define STA_NANO 0x2000
2025-05-07T20:26:59.3407088Z #define _GLIBCXX_HAVE_LOG10F 1
2025-05-07T20:26:59.3407189Z #define _GLIBCXX_HAVE_LOG10L 1
2025-05-07T20:26:59.3407278Z #define _IO_LINKED 0x80
2025-05-07T20:26:59.3407374Z #define __cpp_lib_launder 201606
2025-05-07T20:26:59.3407476Z #define __SIZEOF_INT128__ 16
2025-05-07T20:26:59.3407584Z #define __PTHREAD_MUTEX_HAVE_PREV 1
2025-05-07T20:26:59.3407681Z #define __FLT64X_IS_IEC_60559__ 2
2025-05-07T20:26:59.3407781Z #define _GLIBCXX_TYPE_TRAITS 1
2025-05-07T20:26:59.3407921Z #define cudaGraphKernelNodePortProgrammatic 1
2025-05-07T20:26:59.3408027Z #define __DEVICE_ATOMIC_FUNCTIONS_HPP__ 
2025-05-07T20:26:59.3408139Z #define __BLKCNT64_T_TYPE __SQUAD_TYPE
2025-05-07T20:26:59.3408237Z #define __LDBL_MAX_10_EXP__ 4932
2025-05-07T20:26:59.3408330Z #define __W_CONTINUED 0xffff
2025-05-07T20:26:59.3408426Z #define __ATOMIC_RELAXED 0
2025-05-07T20:26:59.3408557Z #define w_coredump __wait_terminated.__w_coredump
2025-05-07T20:26:59.3408684Z #define __FSBLKCNT_T_TYPE __SYSCALL_ULONG_TYPE
2025-05-07T20:26:59.3408884Z #define __cudaCDP2OccupancyMaxActiveBlocksPerMultiprocessor 
2025-05-07T20:26:59.3409066Z #define __DBL_EPSILON__ double(2.22044604925031308084726333618164062e-16L)
2025-05-07T20:26:59.3409156Z #define __stub_stty 
2025-05-07T20:26:59.3409319Z #define _tolower(c) ((int) (*__ctype_tolower_loc ())[(int) (c)])
2025-05-07T20:26:59.3409410Z #define le16toh(x) (x)
2025-05-07T20:26:59.3409524Z #define BC_SCALE_MAX _POSIX2_BC_SCALE_MAX
2025-05-07T20:26:59.3409696Z #define __FLT128_MIN__ 3.36210314311209350626267781732175260e-4932F128
2025-05-07T20:26:59.3409785Z #define _SIZET_ 
2025-05-07T20:26:59.3409877Z #define XATTR_NAME_MAX 255
2025-05-07T20:26:59.3410051Z #define _SVID_SOURCE 1
2025-05-07T20:26:59.3410140Z #define _LP64 1
2025-05-07T20:26:59.3410231Z #define _LIBC_LIMITS_H_ 1
2025-05-07T20:26:59.3410462Z #define __REDIRECT_NTH_LDBL(name,proto,alias) __REDIRECT_NTH (name, proto, alias)
2025-05-07T20:26:59.3410582Z #define _GLIBCXX_TR1_BESSEL_FUNCTION_TCC 1
2025-05-07T20:26:59.3410667Z #define __UINT8_C(c) c
2025-05-07T20:26:59.3410761Z #define _GLIBCXX_HAVE_CEILF 1
2025-05-07T20:26:59.3410861Z #define _GLIBCXX_HAVE_CEILL 1
2025-05-07T20:26:59.3410971Z #define __cudaCDP2Memset3DAsync_ptsz 
2025-05-07T20:26:59.3411065Z #define __CUDA_ARCH_LIST__ 520
2025-05-07T20:26:59.3411243Z #define __FLT64_MAX_EXP__ 1024
2025-05-07T20:26:59.3411340Z #define MOD_MAXERROR ADJ_MAXERROR
2025-05-07T20:26:59.3411429Z #define CUDARTAPI 
2025-05-07T20:26:59.3411526Z #define IOV_MAX 1024
2025-05-07T20:26:59.3411705Z #define __glibcxx_requires_irreflexive2(_First,_Last) 
2025-05-07T20:26:59.3411830Z #define __INT_LEAST32_TYPE__ int
2025-05-07T20:26:59.3411953Z #define P_tmpdir "/tmp"
2025-05-07T20:26:59.3412081Z #define cudaMemAttachSingle 0x04
2025-05-07T20:26:59.3412191Z #define __wchar_t__ 
2025-05-07T20:26:59.3412321Z #define __cpp_lib_is_aggregate 201703
2025-05-07T20:26:59.3412422Z #define SEEK_END 2
2025-05-07T20:26:59.3412543Z #define __SIZEOF_WCHAR_T__ 4
2025-05-07T20:26:59.3412756Z #define _GLIBCXX_USE_TBB_PAR_BACKEND __has_include(<tbb/tbb.h>)
2025-05-07T20:26:59.3412878Z #define _IO_ftrylockfile(_fp) 
2025-05-07T20:26:59.3413065Z #define _GLIBCXX_USE_C99_WCHAR _GLIBCXX11_USE_C99_WCHAR
2025-05-07T20:26:59.3413177Z #define ____FILE_defined 1
2025-05-07T20:26:59.3413328Z #define _GLIBCXX_HAVE_BUILTIN_IS_AGGREGATE 1
2025-05-07T20:26:59.3413457Z #define __GNUC_PATCHLEVEL__ 0
2025-05-07T20:26:59.3413565Z #define _ISOC99_SOURCE 1
2025-05-07T20:26:59.3413689Z #define __VECTOR_FUNCTIONS_H__ 
2025-05-07T20:26:59.3413995Z #define __REDIRECT_NTH(name,proto,alias) name proto __THROW __asm__ (__ASMNAME (#alias))
2025-05-07T20:26:59.3414161Z #define _PSTL_USE_NONTEMPORAL_STORES_IF_ALLOWED 
2025-05-07T20:26:59.3414276Z #define _IO_RIGHT 04
2025-05-07T20:26:59.3414374Z #define __END_NAMESPACE_STD 
2025-05-07T20:26:59.3414558Z #define __FLT128_NORM_MAX__ 1.18973149535723176508575932662800702e+4932F128
2025-05-07T20:26:59.3414661Z #define _GLIBCXX_STD_C std
2025-05-07T20:26:59.3414781Z #define cudaInitDeviceFlagsAreValid 0x01
2025-05-07T20:26:59.3414883Z #define _LARGEFILE64_SOURCE 1
2025-05-07T20:26:59.3414984Z #define _GLIBCXX_USE_C99_STDINT_TR1 1
2025-05-07T20:26:59.3415068Z #define _STDDEF_H_ 
2025-05-07T20:26:59.3415245Z #define __FLT64_NORM_MAX__ 1.79769313486231570814527423731704357e+308F64
2025-05-07T20:26:59.3415348Z #define __FLT128_HAS_QUIET_NAN__ 1
2025-05-07T20:26:59.3415467Z #define isalnum_l(c,l) __isalnum_l ((c), (l))
2025-05-07T20:26:59.3415674Z #define __FD_ISSET(d,set) ((__FDS_BITS (set)[__FD_ELT (d)] & __FD_MASK (d)) != 0)
2025-05-07T20:26:59.3415786Z #define __INTMAX_MAX__ 0x7fffffffffffffffL
2025-05-07T20:26:59.3415932Z #define __glibcxx_requires_irreflexive(_First,_Last) 
2025-05-07T20:26:59.3416065Z #define cudaGraphKernelNodePortDefault 0
2025-05-07T20:26:59.3416167Z #define __INT_FAST8_TYPE__ signed char
2025-05-07T20:26:59.3416284Z #define __cudaCDP2Memcpy3DAsync_ptsz 
2025-05-07T20:26:59.3416381Z #define __PID_T_TYPE __S32_TYPE
2025-05-07T20:26:59.3416494Z #define __cpp_namespace_attributes 201411L
2025-05-07T20:26:59.3416598Z #define CHARCLASS_NAME_MAX 2048
2025-05-07T20:26:59.3416696Z #define _GLIBCXX_HAVE_TANF 1
2025-05-07T20:26:59.3416793Z #define _GLIBCXX_USE_ST_MTIM 1
2025-05-07T20:26:59.3416973Z #define __FLT64X_MIN__ 3.36210314311209350626267781732175260e-4932F64x
2025-05-07T20:26:59.3417074Z #define __CUDA_RUNTIME_H__ 
2025-05-07T20:26:59.3417249Z #define WIFSIGNALED(status) __WIFSIGNALED (__WAIT_INT (status))
2025-05-07T20:26:59.3417355Z #define _GLIBCXX_HAVE_STDLIB_H 1
2025-05-07T20:26:59.3417450Z #define __STDCPP_THREADS__ 1
2025-05-07T20:26:59.3417593Z #define M_2_SQRTPIl 1.128379167095512573896158903121545172L
2025-05-07T20:26:59.3417816Z #define __GNUC_STDC_INLINE__ 1
2025-05-07T20:26:59.3417918Z #define _POSIX_UIO_MAXIOV 16
2025-05-07T20:26:59.3418023Z #define _PSTL_PAR_BACKEND_SERIAL 
2025-05-07T20:26:59.3418142Z #define __ASSERT_FUNCTION __PRETTY_FUNCTION__
2025-05-07T20:26:59.3418235Z #define __FLT64_HAS_DENORM__ 1
2025-05-07T20:26:59.3418342Z #define __WORDSIZE_TIME64_COMPAT32 1
2025-05-07T20:26:59.3418505Z #define _GLIBCXX_DEPRECATED __attribute__ ((__deprecated__))
2025-05-07T20:26:59.3418673Z #define __FLT32_EPSILON__ 1.19209289550781250000000000000000000e-7F32
2025-05-07T20:26:59.3418779Z #define _PSTL_HIDE_FROM_ABI_PUSH 
2025-05-07T20:26:59.3418900Z #define cudaStreamLegacy ((cudaStream_t)0x1)
2025-05-07T20:26:59.3419093Z #define _IO_cleanup_region_start(_fct,_fp) 
2025-05-07T20:26:59.3419201Z #define __location__(a) __annotate__(a)
2025-05-07T20:26:59.3419426Z #define __device_builtin_surface_type__ __location__(device_builtin_surface_type)
2025-05-07T20:26:59.3419528Z #define _POSIX2_BC_BASE_MAX 99
2025-05-07T20:26:59.3419645Z #define __cudaCDP2DeviceGetAttribute 
2025-05-07T20:26:59.3419739Z #define __DBL_DECIMAL_DIG__ 17
2025-05-07T20:26:59.3419837Z #define __STDC_UTF_32__ 1
2025-05-07T20:26:59.3419933Z #define __INT_FAST8_WIDTH__ 8
2025-05-07T20:26:59.3420029Z #define NAN (__builtin_nanf (""))
2025-05-07T20:26:59.3420129Z #define _POSIX_MQ_PRIO_MAX 32
2025-05-07T20:26:59.3420210Z #define __FXSR__ 1
2025-05-07T20:26:59.3420291Z #define _SIZE_T 
2025-05-07T20:26:59.3420398Z #define _GLIBCXX_USE_GETTIMEOFDAY 1
2025-05-07T20:26:59.3420509Z #define cudaHostRegisterReadOnly 0x08
2025-05-07T20:26:59.3420680Z #define __FLT32X_MAX__ 1.79769313486231570814527423731704357e+308F32x
2025-05-07T20:26:59.3420832Z #define __WIFSTOPPED(status) (((status) & 0xff) == 0x7f)
2025-05-07T20:26:59.3420924Z #define _IO_ssize_t __ssize_t
2025-05-07T20:26:59.3421028Z #define __ULONG32_TYPE unsigned int
2025-05-07T20:26:59.3421209Z #define __DBL_NORM_MAX__ double(1.79769313486231570814527423731704357e+308L)
2025-05-07T20:26:59.3421412Z #define cudaStreamGraphTailLaunch (cudaStream_t)0x0100000000000000
2025-05-07T20:26:59.3421511Z #define _GXX_NULLPTR_T 
2025-05-07T20:26:59.3421637Z #define __glibcxx_class_requires3(_a,_b,_c,_d) 
2025-05-07T20:26:59.3421734Z #define FOPEN_MAX 16
2025-05-07T20:26:59.3421851Z #define __BIG_ENDIAN 4321
2025-05-07T20:26:59.3421997Z #define __BYTE_ORDER__ __ORDER_LITTLE_ENDIAN__
2025-05-07T20:26:59.3422123Z #define __suseconds_t_defined 
2025-05-07T20:26:59.3422232Z #define __off_t_defined 
2025-05-07T20:26:59.3422339Z #define stderr stderr
2025-05-07T20:26:59.3422463Z #define M_LOG10E 0.43429448190325182765
2025-05-07T20:26:59.3422603Z #define __glibcxx_requires_string(_String) 
2025-05-07T20:26:59.3422729Z #define _GLIBCXX_HAVE_LDEXPL 1
2025-05-07T20:26:59.3422852Z #define __INTMAX_WIDTH__ 64
2025-05-07T20:26:59.3423355Z #define _PSTL_CPP14_2RANGE_MISMATCH_EQUAL_PRESENT (_MSC_VER >= 1900 || __cplusplus >= 201300L || __cpp_lib_robust_nonmodifying_seq_ops == 201304)
2025-05-07T20:26:59.3423468Z #define __mode_t_defined 
2025-05-07T20:26:59.3423584Z #define _GCC_SIZE_T 
2025-05-07T20:26:59.3423706Z #define __INO64_T_TYPE __UQUAD_TYPE
2025-05-07T20:26:59.3423823Z #define __cpp_runtime_arrays 198712L
2025-05-07T20:26:59.3423936Z #define __UINT64_TYPE__ long unsigned int
2025-05-07T20:26:59.3424032Z #define __USE_XOPEN2K8XSI 1
2025-05-07T20:26:59.3424129Z #define __UINT32_C(c) c ## U
2025-05-07T20:26:59.3424233Z #define __cpp_alias_templates 200704L
2025-05-07T20:26:59.3424338Z #define cudaHostAllocMapped 0x02
2025-05-07T20:26:59.3424449Z #define __DEVICE_LAUNCH_PARAMETERS_H__ 
2025-05-07T20:26:59.3424541Z #define _STL_ITERATOR_H 1
2025-05-07T20:26:59.3424627Z #define __size_t__ 
2025-05-07T20:26:59.3424766Z #define cudaStreamAttrID cudaLaunchAttributeID
2025-05-07T20:26:59.3424861Z #define _GLIBCXX_HAVE_ATANF 1
2025-05-07T20:26:59.3424969Z #define cudaEventRecordExternal 0x01
2025-05-07T20:26:59.3425124Z #define __isspace_l(c,l) __isctype_l((c), _ISspace, (l))
2025-05-07T20:26:59.3425218Z #define _IO_BUFSIZ _G_BUFSIZ
2025-05-07T20:26:59.3425476Z #define __FLT_DENORM_MIN__ 1.40129846432481707092372958328991613e-45F
2025-05-07T20:26:59.3425562Z #define _ENDIAN_H 1
2025-05-07T20:26:59.3425666Z #define __builtin_align__(a) __align__(a)
2025-05-07T20:26:59.3425767Z #define _GLIBCXX20_CONSTEXPR 
2025-05-07T20:26:59.3425868Z #define __NV_NO_HOST_COMPILER_CHECK 1
2025-05-07T20:26:59.3425948Z #define __try try
2025-05-07T20:26:59.3426047Z #define _GLIBCXX_HAVE_FINITE 1
2025-05-07T20:26:59.3426139Z #define __FLT128_IS_IEC_60559__ 2
2025-05-07T20:26:59.3426228Z #define __INT8_MAX__ 0x7f
2025-05-07T20:26:59.3426492Z #define cudaStreamGetCaptureInfo __CUDART_API_PTSZ(cudaStreamGetCaptureInfo_v2)
2025-05-07T20:26:59.3426663Z #define __LONG_WIDTH__ 64
2025-05-07T20:26:59.3426746Z #define __PIC__ 2
2025-05-07T20:26:59.3426865Z #define BC_STRING_MAX _POSIX2_BC_STRING_MAX
2025-05-07T20:26:59.3426985Z #define __UINT_FAST32_TYPE__ long unsigned int
2025-05-07T20:26:59.3427124Z #define FD_ISSET(fd,fdsetp) __FD_ISSET (fd, fdsetp)
2025-05-07T20:26:59.3427226Z #define _GLIBCXX_HAVE_FLOAT_H 1
2025-05-07T20:26:59.3427322Z #define _GLIBCXX_HAVE_ATANL 1
2025-05-07T20:26:59.3427510Z #define __FLT32X_NORM_MAX__ 1.79769313486231570814527423731704357e+308F32x
2025-05-07T20:26:59.3427612Z #define __DEVICE_FUNCTIONS_HPP__ 
2025-05-07T20:26:59.3427714Z #define __CHAR32_TYPE__ unsigned int
2025-05-07T20:26:59.3427811Z #define _IO_uid_t __uid_t
2025-05-07T20:26:59.3427908Z #define _GLIBCXX_HAVE_READLINK 1
2025-05-07T20:26:59.3428035Z #define __cudaCDP2EventRecordWithFlags_ptsz 
2025-05-07T20:26:59.3428134Z #define _CONCEPT_CHECK_H 1
2025-05-07T20:26:59.3428633Z #define __FLT_MAX__ 3.40282346638528859811704183484516925e+38F
2025-05-07T20:26:59.3428790Z #define _GLIBCXX_HAVE_NETINET_IN_H 1
2025-05-07T20:26:59.3428913Z #define _GLIBCXX_TR1_SPECIAL_FUNCTION_UTIL_H 1
2025-05-07T20:26:59.3428996Z #define LONG_BIT 64
2025-05-07T20:26:59.3429161Z #define __SIZEOF_PTHREAD_BARRIERATTR_T 4
2025-05-07T20:26:59.3429262Z #define _GLIBCXX_USE_ALLOCATOR_NEW 1
2025-05-07T20:26:59.3429394Z #define __cpp_lib_math_special_functions 201603L
2025-05-07T20:26:59.3429495Z #define __fsfilcnt_t_defined 
2025-05-07T20:26:59.3429586Z #define __blkcnt_t_defined 
2025-05-07T20:26:59.3429855Z #define cudaKernelNodeAttributeMemSyncDomain cudaLaunchAttributeMemSyncDomain
2025-05-07T20:26:59.3429953Z #define __USE_LARGEFILE 1
2025-05-07T20:26:59.3430052Z #define __cpp_constexpr 201603L
2025-05-07T20:26:59.3430146Z #define CUDART_VERSION 12080
2025-05-07T20:26:59.3430245Z #define NL_TEXTMAX INT_MAX
2025-05-07T20:26:59.3430347Z #define cudaDeviceMapHost 0x08
2025-05-07T20:26:59.3430441Z #define _GLIBCXX_CMATH 1
2025-05-07T20:26:59.3430640Z #define __attribute_format_arg__(x) __attribute__ ((__format_arg__ (x)))
2025-05-07T20:26:59.3430737Z #define __lldiv_t_defined 1
2025-05-07T20:26:59.3430830Z #define __SSE2__ 1
2025-05-07T20:26:59.3430913Z #define _IOLBF 1
2025-05-07T20:26:59.3431013Z #define _GLIBCXX_HAVE_SYS_TYPES_H 1
2025-05-07T20:26:59.3431109Z #define _GLIBCXX_HAVE_FLOORF 1
2025-05-07T20:26:59.3431237Z #define __cpp_deduction_guides 201703L
2025-05-07T20:26:59.3431332Z #define _GLIBCXX_HAVE_EXPF 1
2025-05-07T20:26:59.3431442Z #define __annotate__(a) __attribute__((a))
2025-05-07T20:26:59.3431541Z #define __INT32_TYPE__ int
2025-05-07T20:26:59.3431634Z #define __SIZEOF_DOUBLE__ 8
2025-05-07T20:26:59.3431746Z #define cudaDeviceSyncMemops 0x80
2025-05-07T20:26:59.3431845Z #define __cpp_exceptions 199711L
2025-05-07T20:26:59.3431941Z #define __FLT_MIN_10_EXP__ (-37)
2025-05-07T20:26:59.3432056Z #define cudaDeviceScheduleYield 0x02
2025-05-07T20:26:59.3432148Z #define _SYS_SYSMACROS_H 1
2025-05-07T20:26:59.3432263Z #define _GLIBCXX_TR1_LEGENDRE_FUNCTION_TCC 1
2025-05-07T20:26:59.3432439Z #define __FLT64_MIN__ 2.22507385850720138309023271733240406e-308F64
2025-05-07T20:26:59.3432534Z #define __INT_LEAST32_WIDTH__ 32
2025-05-07T20:26:59.3432629Z #define __SWORD_TYPE long int
2025-05-07T20:26:59.3432729Z #define __INTMAX_TYPE__ long int
2025-05-07T20:26:59.3432825Z #define _GLIBCXX11_USE_C99_MATH 1
2025-05-07T20:26:59.3433166Z #define __PTHREAD_SPINS 0, 0
2025-05-07T20:26:59.3433262Z #define _BITS_POSIX1_LIM_H 1
2025-05-07T20:26:59.3433543Z #define cudaStreamAttributeMemSyncDomainMap cudaLaunchAttributeMemSyncDomainMap
2025-05-07T20:26:59.3433645Z #define __DEC128_MAX_EXP__ 6145
2025-05-07T20:26:59.3433790Z #define math_errhandling (MATH_ERRNO | MATH_ERREXCEPT)
2025-05-07T20:26:59.3433871Z #define _T_SIZE 
2025-05-07T20:26:59.3433982Z #define cudaHostAllocDefault 0x00
2025-05-07T20:26:59.3434107Z #define _PSTL_PRAGMA_SIMD_EXCLUSIVE_SCAN(PRM) 
2025-05-07T20:26:59.3434231Z #define __va_arg_pack() __builtin_va_arg_pack ()
2025-05-07T20:26:59.3434333Z #define _POSIX_TIMER_MAX 32
2025-05-07T20:26:59.3434548Z #define _GLIBCXX_HAVE_TLS 1
2025-05-07T20:26:59.3434669Z #define _GLIBCXX_NOTHROW _GLIBCXX_USE_NOEXCEPT
2025-05-07T20:26:59.3434776Z #define __FLT32X_HAS_QUIET_NAN__ 1
2025-05-07T20:26:59.3434866Z #define __ATOMIC_CONSUME 1
2025-05-07T20:26:59.3435046Z #define __CUDA_ARCH_HAS_FEATURE__(_FEAT) __CUDA_ARCH_FEAT_ ##_FEAT
2025-05-07T20:26:59.3435141Z #define __GNUC_MINOR__ 4
2025-05-07T20:26:59.3435244Z #define __GLIBCXX_TYPE_INT_N_0 __int128
2025-05-07T20:26:59.3435344Z #define __INT_FAST16_WIDTH__ 64
2025-05-07T20:26:59.3435461Z #define __UINTMAX_MAX__ 0xffffffffffffffffUL
2025-05-07T20:26:59.3435543Z #define __PIE__ 2
2025-05-07T20:26:59.3435653Z #define LITTLE_ENDIAN __LITTLE_ENDIAN
2025-05-07T20:26:59.3435751Z #define _GLIBCXX_HAVE_INT64_T_LONG 1
2025-05-07T20:26:59.3435941Z #define __FLT32X_DENORM_MIN__ 4.94065645841246544176568792868221372e-324F32x
2025-05-07T20:26:59.3436163Z #define __intN_t(N,MODE) typedef int int ##N ##_t __attribute__ ((__mode__ (MODE)))
2025-05-07T20:26:59.3436262Z #define __nlink_t_defined 
2025-05-07T20:26:59.3436393Z #define _GLIBCXX17_DEPRECATED [[__deprecated__]]
2025-05-07T20:26:59.3436504Z #define _PSTL_STRING(x) _PSTL_STRING_AUX(x)
2025-05-07T20:26:59.3436590Z #define _XOPEN_LIM_H 1
2025-05-07T20:26:59.3436853Z #define __u_intN_t(N,MODE) typedef unsigned int u_int ##N ##_t __attribute__ ((__mode__ (MODE)))
2025-05-07T20:26:59.3436975Z #define __cpp_template_template_args 201611L
2025-05-07T20:26:59.3437078Z #define _GTHREAD_USE_MUTEX_TIMEDLOCK 1
2025-05-07T20:26:59.3437184Z #define BC_DIM_MAX _POSIX2_BC_DIM_MAX
2025-05-07T20:26:59.3437277Z #define __DBL_MAX_10_EXP__ 308
2025-05-07T20:26:59.3437366Z #define __FILE_defined 1
2025-05-07T20:26:59.3437551Z #define __LDBL_DENORM_MIN__ 3.64519953188247460252840593361941982e-4951L
2025-05-07T20:26:59.3437647Z #define _GLIBCXX_HAVE_SINCOS 1
2025-05-07T20:26:59.3437747Z #define __USE_XOPEN_EXTENDED 1
2025-05-07T20:26:59.3437857Z #define __cpp_lib_tuple_element_t 201402L
2025-05-07T20:26:59.3437975Z #define isascii_l(c,l) __isascii_l ((c), (l))
2025-05-07T20:26:59.3438088Z #define cudaInvalidDeviceId ((int)-2)
2025-05-07T20:26:59.3438191Z #define _GLIBCXX_HAVE_SYS_RESOURCE_H 1
2025-05-07T20:26:59.3438275Z #define __INT16_C(c) c
2025-05-07T20:26:59.3438377Z #define __U32_TYPE unsigned int
2025-05-07T20:26:59.3438475Z #define _GLIBCXX_HAVE_SYS_IOCTL_H 1
2025-05-07T20:26:59.3438602Z #define FD_CLR(fd,fdsetp) __FD_CLR (fd, fdsetp)
2025-05-07T20:26:59.3438691Z #define __STDC__ 1
2025-05-07T20:26:59.3438787Z #define _GLIBCXX_HAVE_VWSCANF 1
2025-05-07T20:26:59.3438891Z #define _GLIBCXX_HAVE_EXECINFO_H 1
2025-05-07T20:26:59.3438987Z #define _GLIBCXX_USE_REALPATH 1
2025-05-07T20:26:59.3439137Z #define __attribute_malloc__ __attribute__ ((__malloc__))
2025-05-07T20:26:59.3439233Z #define __FLT32X_DIG__ 15
2025-05-07T20:26:59.3439332Z #define _GLIBCXX_USE_C99_CTYPE_TR1 1
2025-05-07T20:26:59.3439428Z #define __PTRDIFF_TYPE__ long int
2025-05-07T20:26:59.3439547Z #define cudaArrayDeferredMapping 0x80
2025-05-07T20:26:59.3439662Z #define _GLIBCXX_END_NAMESPACE_CONTAINER 
2025-05-07T20:26:59.3439758Z #define USHRT_MAX (SHRT_MAX * 2 + 1)
2025-05-07T20:26:59.3439869Z #define __cpp_lib_is_swappable 201603
2025-05-07T20:26:59.3439950Z #define stdin stdin
2025-05-07T20:26:59.3440040Z #define __ino64_t_defined 
2025-05-07T20:26:59.3440132Z #define STA_CLK 0x8000
2025-05-07T20:26:59.3440312Z #define __clockid_t_defined 1
2025-05-07T20:26:59.3440466Z #define _GLIBCXX_NOEXCEPT_IF(...) noexcept(__VA_ARGS__)
2025-05-07T20:26:59.3440628Z #define __attribute_noinline__ __attribute__ ((__noinline__))
2025-05-07T20:26:59.3440731Z #define __cudaCDP2MemsetAsync 
2025-05-07T20:26:59.3440838Z #define _PSTL_PRAGMA_SIMD_SCAN(PRM) 
2025-05-07T20:26:59.3440943Z #define _GLIBCXX_BEGIN_NAMESPACE_LDBL 
2025-05-07T20:26:59.3441046Z #define _GLIBCXX_TR1_POLY_HERMITE_TCC 1
2025-05-07T20:26:59.3441251Z #define __FD_SET(d,set) ((void) (__FDS_BITS (set)[__FD_ELT (d)] |= __FD_MASK (d)))
2025-05-07T20:26:59.3441344Z #define __ATOMIC_SEQ_CST 5
2025-05-07T20:26:59.3442030Z #define __tobody(c,f,a,args) (__extension__ ({ int __res; if (sizeof (c) > 1) { if (__builtin_constant_p (c)) { int __c = (c); __res = __c < -128 || __c > 255 ? __c : (a)[__c]; } else __res = f args; } else __res = (a)[(int) (c)]; __res; }))
2025-05-07T20:26:59.3442142Z #define DOMAIN 1
2025-05-07T20:26:59.3442257Z #define M_LN2 0.69314718055994530942
2025-05-07T20:26:59.3442370Z #define __NVCC__ 1
2025-05-07T20:26:59.3442498Z #define __cudaCDP2Memset2DAsync 
2025-05-07T20:26:59.3442638Z #define __CLOCK_T_TYPE __SYSCALL_SLONG_TYPE
2025-05-07T20:26:59.3442768Z #define _PSTL_PRAGMA_SIMD_EARLYEXIT 
2025-05-07T20:26:59.3442893Z #define __throw_exception_again throw
2025-05-07T20:26:59.3443011Z #define M_SQRT2 1.41421356237309504880
2025-05-07T20:26:59.3443128Z #define __EXCEPTION_H 1
2025-05-07T20:26:59.3443247Z #define __FLT32X_MIN_10_EXP__ (-307)
2025-05-07T20:26:59.3443376Z #define HUGE_VAL (__builtin_huge_val())
2025-05-07T20:26:59.3443760Z #define cudaStreamAttributeAccessPolicyWindow cudaLaunchAttributeAccessPolicyWindow
2025-05-07T20:26:59.3443909Z #define __UINTPTR_TYPE__ long unsigned int
2025-05-07T20:26:59.3444042Z #define _GLIBCXX_INLINE_VERSION 0
2025-05-07T20:26:59.3444160Z #define _GLIBCXX_USE_INT128 1
2025-05-07T20:26:59.3444289Z #define __cpp_lib_bool_constant 201505
2025-05-07T20:26:59.3444393Z #define PTHREAD_KEYS_MAX 1024
2025-05-07T20:26:59.3444539Z #define __DEC64_SUBNORMAL_MIN__ 0.000000000000001E-383DD
2025-05-07T20:26:59.3444645Z #define __FSFILCNT64_T_TYPE __UQUAD_TYPE
2025-05-07T20:26:59.3444764Z #define _GLIBCXX_DOUBLE_IS_IEEE_BINARY64 1
2025-05-07T20:26:59.3444857Z #define __DEC128_MANT_DIG__ 34
2025-05-07T20:26:59.3444962Z #define __cpp_lib_tuples_by_type 201304
2025-05-07T20:26:59.3445065Z #define __LDBL_MIN_10_EXP__ (-4931)
2025-05-07T20:26:59.3445167Z #define __cpp_generic_lambdas 201304L
2025-05-07T20:26:59.3445310Z #define _GLIBCXX_THROW_OR_ABORT(_EXC) (throw (_EXC))
2025-05-07T20:26:59.3445404Z #define __useconds_t_defined 
2025-05-07T20:26:59.3445504Z #define _GLIBCXX_USE_SCHED_YIELD 1
2025-05-07T20:26:59.3445698Z #define __attribute_deprecated__ __attribute__ ((__deprecated__))
2025-05-07T20:26:59.3445845Z #define __cpp_lib_type_trait_variable_templates 201510L
2025-05-07T20:26:59.3445931Z #define __SSE_MATH__ 1
2025-05-07T20:26:59.3446028Z #define _IO_wint_t wint_t
2025-05-07T20:26:59.3446124Z #define __SIZEOF_LONG_LONG__ 8
2025-05-07T20:26:59.3446221Z #define _GLIBCXX_VERBOSE 1
2025-05-07T20:26:59.3446323Z #define _GLIBCXX_HAVE_ASINF 1
2025-05-07T20:26:59.3446437Z #define __cpp_user_defined_literals 200809L
2025-05-07T20:26:59.3446542Z #define _GLIBCXX_HAVE_ISINFL 1
2025-05-07T20:26:59.3446639Z #define _GLIBCXX_HAVE_ASINL 1
2025-05-07T20:26:59.3446723Z #define __USE_ATFILE 1
2025-05-07T20:26:59.3446821Z #define _POSIX_OPEN_MAX 20
2025-05-07T20:26:59.3446916Z #define _POSIX_LOGIN_NAME_MAX 9
2025-05-07T20:26:59.3447005Z #define _GCC_PTRDIFF_T 
2025-05-07T20:26:59.3447236Z #define cudaKernelNodeAttributePriority cudaLaunchAttributePriority
2025-05-07T20:26:59.3447338Z #define __FLT128_DECIMAL_DIG__ 36
2025-05-07T20:26:59.3447439Z #define _POSIX_THREAD_KEYS_MAX 128
2025-05-07T20:26:59.3447549Z #define __GCC_ATOMIC_LLONG_LOCK_FREE 2
2025-05-07T20:26:59.3447661Z #define __cpp_lib_array_constexpr 201803L
2025-05-07T20:26:59.3447744Z #define _STDLIB_H 1
2025-05-07T20:26:59.3447888Z #define __exctype(name) extern int name (int) __THROW
2025-05-07T20:26:59.3448100Z #define __FLT32_HAS_QUIET_NAN__ 1
2025-05-07T20:26:59.3448200Z #define __FLT_DECIMAL_DIG__ 9
2025-05-07T20:26:59.3448328Z #define __UINT_FAST16_MAX__ 0xffffffffffffffffUL
2025-05-07T20:26:59.3448435Z #define __SURFACE_INDIRECT_FUNCTIONS_H__ 
2025-05-07T20:26:59.3448536Z #define __SM_61_INTRINSICS_H__ 
2025-05-07T20:26:59.3448720Z #define _GLIBCXX_PACKAGE_STRING "package-unused version-unused"
2025-05-07T20:26:59.3448875Z #define __isxdigit_l(c,l) __isctype_l((c), _ISxdigit, (l))
2025-05-07T20:26:59.3448985Z #define __glibcxx_requires_nonempty() 
2025-05-07T20:26:59.3449105Z #define w_stopsig __wait_stopped.__w_stopsig
2025-05-07T20:26:59.3449277Z #define __ldiv_t_defined 1
2025-05-07T20:26:59.3449461Z #define __glibcxx_requires_irreflexive_pred(_First,_Last,_Pred) 
2025-05-07T20:26:59.3449555Z #define ___int_ptrdiff_t_h 
2025-05-07T20:26:59.3449731Z #define __LDBL_NORM_MAX__ 1.18973149535723176502126385303097021e+4932L
2025-05-07T20:26:59.3449834Z #define __cudaCDP2EventDestroy 
2025-05-07T20:26:59.3449934Z #define __HOST_DEFINES_H__ 
2025-05-07T20:26:59.3450041Z #define __GCC_ATOMIC_SHORT_LOCK_FREE 2
2025-05-07T20:26:59.3450141Z #define __SM_20_ATOMIC_FUNCTIONS_H__ 
2025-05-07T20:26:59.3450240Z #define _GLIBCXX_USE_NANOSLEEP 1
2025-05-07T20:26:59.3450328Z #define CUDART_CB 
2025-05-07T20:26:59.3450430Z #define BC_BASE_MAX _POSIX2_BC_BASE_MAX
2025-05-07T20:26:59.3450553Z #define _GLIBCXX_USE_C99_INTTYPES_WCHAR_T_TR1 1
2025-05-07T20:26:59.3450644Z #define MB_LEN_MAX 16
2025-05-07T20:26:59.3450868Z #define __glibcxx_requires_partitioned_lower_pred(_First,_Last,_Value,_Pred) 
2025-05-07T20:26:59.3450972Z #define _GLIBCXX11_USE_C99_WCHAR 1
2025-05-07T20:26:59.3451102Z #define _IO_peekc(_fp) _IO_peekc_unlocked (_fp)
2025-05-07T20:26:59.3451217Z #define _GLIBCXX_HAVE_AS_SYMVER_DIRECTIVE 1
2025-05-07T20:26:59.3451319Z #define _GLIBCXX_HAVE_UNISTD_H 1
2025-05-07T20:26:59.3451468Z #define __glibc_likely(cond) __builtin_expect((cond), 1)
2025-05-07T20:26:59.3451578Z #define __UINT_FAST8_TYPE__ unsigned char
2025-05-07T20:26:59.3451678Z #define _GNU_SOURCE 1
2025-05-07T20:26:59.3451764Z #define __stub_putmsg 
2025-05-07T20:26:59.3451848Z #define __CUDACC__ 1
2025-05-07T20:26:59.3451951Z #define __N(msgid) (msgid)
2025-05-07T20:26:59.3452056Z #define __P(args) args
2025-05-07T20:26:59.3452371Z #define cudaKernelNodeAttributeCooperative cudaLaunchAttributeCooperative
2025-05-07T20:26:59.3452503Z #define __cpp_init_captures 201304L
2025-05-07T20:26:59.3452632Z #define _GLIBCXX17_CONSTEXPR constexpr
2025-05-07T20:26:59.3452752Z #define __ATOMIC_ACQ_REL 4
2025-05-07T20:26:59.3452874Z #define __cpp_lib_as_const 201510
2025-05-07T20:26:59.3452982Z #define __WCHAR_T 
2025-05-07T20:26:59.3453104Z #define __ATOMIC_RELEASE 3
2025-05-07T20:26:59.3453222Z #define __fsblkcnt_t_defined 
2025-05-07T20:26:59.3453366Z #define __cudaCDP2EventCreateWithFlags 
2025-05-07T20:26:59.3453501Z #define __DEVICE_DOUBLE_FUNCTIONS_H__ 
2025-05-07T20:26:59.3453509Z 
2025-05-07T20:26:59.3612526Z 
2025-05-07T20:26:59.3613176Z + conda run -n build_binary nvcc --version
2025-05-07T20:26:59.3613192Z 
2025-05-07T20:27:01.2597750Z nvcc: NVIDIA (R) Cuda compiler driver
2025-05-07T20:27:01.2598180Z Copyright (c) 2005-2025 NVIDIA Corporation
2025-05-07T20:27:01.2598498Z Built on Wed_Jan_15_19:20:09_PST_2025
2025-05-07T20:27:01.2598808Z Cuda compilation tools, release 12.8, V12.8.61
2025-05-07T20:27:01.2599139Z Build cuda_12.8.r12.8/compiler.35404655_0
2025-05-07T20:27:01.2599342Z 
2025-05-07T20:27:01.3226832Z 
2025-05-07T20:27:01.3239982Z /usr/bin/nvidia-smi
2025-05-07T20:27:01.3244929Z + nvidia-smi
2025-05-07T20:27:01.3245083Z 
2025-05-07T20:27:01.3417608Z Wed May  7 20:27:01 2025       
2025-05-07T20:27:01.3418030Z +-----------------------------------------------------------------------------------------+
2025-05-07T20:27:01.3418538Z | NVIDIA-SMI 570.133.07             Driver Version: 570.133.07     CUDA Version: 12.8     |
2025-05-07T20:27:01.3419023Z |-----------------------------------------+------------------------+----------------------+
2025-05-07T20:27:01.3419832Z | GPU  Name                 Persistence-M | Bus-Id          Disp.A | Volatile Uncorr. ECC |
2025-05-07T20:27:01.3420358Z | Fan  Temp   Perf          Pwr:Usage/Cap |           Memory-Usage | GPU-Util  Compute M. |
2025-05-07T20:27:01.3420778Z |                                         |                        |               MIG M. |
2025-05-07T20:27:01.3421116Z |=========================================+========================+======================|
2025-05-07T20:27:01.3590816Z |   0  NVIDIA A10G                    On  |   00000000:00:1E.0 Off |                    0 |
2025-05-07T20:27:01.3591255Z |  0%   28C    P8             22W /  300W |       0MiB /  23028MiB |      0%      Default |
2025-05-07T20:27:01.3591826Z |                                         |                        |                  N/A |
2025-05-07T20:27:01.3592219Z +-----------------------------------------+------------------------+----------------------+
2025-05-07T20:27:01.3595606Z                                                                                          
2025-05-07T20:27:01.3596006Z +-----------------------------------------------------------------------------------------+
2025-05-07T20:27:01.3596429Z | Processes:                                                                              |
2025-05-07T20:27:01.3596866Z |  GPU   GI   CI              PID   Type   Process name                        GPU Memory |
2025-05-07T20:27:01.3597279Z |        ID   ID                                                               Usage      |
2025-05-07T20:27:01.3597625Z |=========================================================================================|
2025-05-07T20:27:01.3600578Z |  No running processes found                                                             |
2025-05-07T20:27:01.3601053Z +-----------------------------------------------------------------------------------------+
2025-05-07T20:27:01.6314983Z 
2025-05-07T20:27:01.6319553Z [INSTALL] Successfully installed CUDA 12.8.0
2025-05-07T20:27:01.6370570Z ##[group]Run . $PRELUDE; install_pytorch_pip $BUILD_ENV nightly cuda/12.8.0
2025-05-07T20:27:01.6371115Z [36;1m. $PRELUDE; install_pytorch_pip $BUILD_ENV nightly cuda/12.8.0[0m
2025-05-07T20:27:01.6396141Z shell: /usr/bin/bash --noprofile --norc -e -o pipefail {0}
2025-05-07T20:27:01.6396501Z env:
2025-05-07T20:27:01.6396738Z   PRELUDE: .github/scripts/setup_env.bash
2025-05-07T20:27:01.6397037Z   BUILD_ENV: build_binary
2025-05-07T20:27:01.6397284Z   BUILD_TARGET: genai
2025-05-07T20:27:01.6397516Z   BUILD_VARIANT: cuda
2025-05-07T20:27:01.6397747Z   BUILD_CUDA_VERSION: 12.8.0
2025-05-07T20:27:01.6398010Z   ENFORCE_CUDA_DEVICE: 1
2025-05-07T20:27:01.6398333Z   GPU_FLAG: --gpus all -e NVIDIA_DRIVER_CAPABILITIES=all
2025-05-07T20:27:01.6398664Z ##[endgroup]
2025-05-07T20:27:01.9773707Z ################################################################################
2025-05-07T20:27:01.9774414Z # Install PyTorch (PIP)
2025-05-07T20:27:01.9774871Z #
2025-05-07T20:27:01.9790249Z # [2025-05-07T20:27:01.978Z] + install_pytorch_pip build_binary nightly cuda/12.8.0
2025-05-07T20:27:01.9790689Z ################################################################################
2025-05-07T20:27:01.9790901Z 
2025-05-07T20:27:01.9821246Z [EXEC] [ATTEMPT 0/3]    + conda install -n build_binary -c conda-forge --override-channels -y numpy
2025-05-07T20:27:02.9809535Z Channels:
2025-05-07T20:27:02.9809879Z  - conda-forge
2025-05-07T20:27:02.9810191Z Platform: linux-64
2025-05-07T20:27:06.2453542Z Collecting package metadata (repodata.json): - \ | / done
2025-05-07T20:27:06.9634574Z Solving environment: \ | / done
2025-05-07T20:27:07.1864703Z 
2025-05-07T20:27:07.1865294Z ## Package Plan ##
2025-05-07T20:27:07.1865532Z 
2025-05-07T20:27:07.1865810Z   environment location: /home/ec2-user/miniconda/envs/build_binary
2025-05-07T20:27:07.1866184Z 
2025-05-07T20:27:07.1866290Z   added / updated specs:
2025-05-07T20:27:07.1866532Z     - numpy
2025-05-07T20:27:07.1866663Z 
2025-05-07T20:27:07.1866698Z 
2025-05-07T20:27:07.1866822Z The following packages will be downloaded:
2025-05-07T20:27:07.1867040Z 
2025-05-07T20:27:07.1867153Z     package                    |            build
2025-05-07T20:27:07.1867473Z     ---------------------------|-----------------
2025-05-07T20:27:07.1867850Z     libblas-3.9.0              |31_h59b9bed_openblas          16 KB  conda-forge
2025-05-07T20:27:07.1868463Z     libcblas-3.9.0             |31_he106b2a_openblas          16 KB  conda-forge
2025-05-07T20:27:07.1869162Z     libgfortran-15.1.0         |       h69a702a_2          34 KB  conda-forge
2025-05-07T20:27:07.1869717Z     libgfortran5-15.1.0        |       hcea5267_2         1.5 MB  conda-forge
2025-05-07T20:27:07.1870573Z     liblapack-3.9.0            |31_h7ac8fdf_openblas          16 KB  conda-forge
2025-05-07T20:27:07.1871042Z     libopenblas-0.3.29         |pthreads_h94d23a6_0         5.6 MB  conda-forge
2025-05-07T20:27:07.1871491Z     numpy-2.2.5                |  py311h5d046bc_0         8.6 MB  conda-forge
2025-05-07T20:27:07.1871871Z     ------------------------------------------------------------
2025-05-07T20:27:07.1872216Z                                            Total:        15.9 MB
2025-05-07T20:27:07.1872429Z 
2025-05-07T20:27:07.1872559Z The following NEW packages will be INSTALLED:
2025-05-07T20:27:07.1872773Z 
2025-05-07T20:27:07.1873003Z   libblas            conda-forge/linux-64::libblas-3.9.0-31_h59b9bed_openblas 
2025-05-07T20:27:07.1873498Z   libcblas           conda-forge/linux-64::libcblas-3.9.0-31_he106b2a_openblas 
2025-05-07T20:27:07.1874001Z   libgfortran        conda-forge/linux-64::libgfortran-15.1.0-h69a702a_2 
2025-05-07T20:27:07.1874508Z   libgfortran5       conda-forge/linux-64::libgfortran5-15.1.0-hcea5267_2 
2025-05-07T20:27:07.1875040Z   liblapack          conda-forge/linux-64::liblapack-3.9.0-31_h7ac8fdf_openblas 
2025-05-07T20:27:07.1875598Z   libopenblas        conda-forge/linux-64::libopenblas-0.3.29-pthreads_h94d23a6_0 
2025-05-07T20:27:07.1876311Z   numpy              conda-forge/linux-64::numpy-2.2.5-py311h5d046bc_0 
2025-05-07T20:27:07.1876586Z 
2025-05-07T20:27:07.1876590Z 
2025-05-07T20:27:07.1876594Z 
2025-05-07T20:27:07.1876735Z Downloading and Extracting Packages: ...working...
2025-05-07T20:27:07.1877095Z numpy-2.2.5          | 8.6 MB    |            |   0% 
2025-05-07T20:27:07.1877309Z 
2025-05-07T20:27:07.1882949Z libopenblas-0.3.29   | 5.6 MB    |            |   0% [A
2025-05-07T20:27:07.1883297Z 
2025-05-07T20:27:07.1883303Z 
2025-05-07T20:27:07.1893342Z libgfortran5-15.1.0  | 1.5 MB    |            |   0% [A[A
2025-05-07T20:27:07.1893706Z 
2025-05-07T20:27:07.1893711Z 
2025-05-07T20:27:07.1893716Z 
2025-05-07T20:27:07.1908507Z libgfortran-15.1.0   | 34 KB     |            |   0% [A[A[A
2025-05-07T20:27:07.1908882Z 
2025-05-07T20:27:07.1908888Z 
2025-05-07T20:27:07.1908893Z 
2025-05-07T20:27:07.1911711Z 
2025-05-07T20:27:07.1929611Z libblas-3.9.0        | 16 KB     |            |   0% [A[A[A[A
2025-05-07T20:27:07.1929976Z 
2025-05-07T20:27:07.1929993Z 
2025-05-07T20:27:07.1929999Z 
2025-05-07T20:27:07.1930004Z 
2025-05-07T20:27:07.1930009Z 
2025-05-07T20:27:07.1944770Z libcblas-3.9.0       | 16 KB     |            |   0% [A[A[A[A[A
2025-05-07T20:27:07.1945143Z 
2025-05-07T20:27:07.1945149Z 
2025-05-07T20:27:07.1945154Z 
2025-05-07T20:27:07.1945159Z 
2025-05-07T20:27:07.1945164Z 
2025-05-07T20:27:07.1945169Z 
2025-05-07T20:27:07.2512600Z liblapack-3.9.0      | 16 KB     |            |   0% [A[A[A[A[A[A
2025-05-07T20:27:07.2512979Z 
2025-05-07T20:27:07.2512984Z 
2025-05-07T20:27:07.2512989Z 
2025-05-07T20:27:07.2513311Z 
2025-05-07T20:27:07.3383486Z libblas-3.9.0        | 16 KB     | ########## | 100% [A[A[A[A
2025-05-07T20:27:07.3383857Z 
2025-05-07T20:27:07.3383863Z 
2025-05-07T20:27:07.3383868Z 
2025-05-07T20:27:07.3419225Z libgfortran-15.1.0   | 34 KB     | ####7      |  47% [A[A[A
2025-05-07T20:27:07.3419597Z 
2025-05-07T20:27:07.3419602Z 
2025-05-07T20:27:07.3419608Z 
2025-05-07T20:27:07.3419613Z 
2025-05-07T20:27:07.3481846Z 
2025-05-07T20:27:07.4019040Z libcblas-3.9.0       | 16 KB     | #########7 |  98% [A[A[A[A[A
2025-05-07T20:27:07.4019408Z 
2025-05-07T20:27:07.4019413Z 
2025-05-07T20:27:07.4076825Z 
2025-05-07T20:27:07.4085062Z libgfortran-15.1.0   | 34 KB     | ########## | 100% [A[A[A
2025-05-07T20:27:07.4085433Z 
2025-05-07T20:27:07.4085439Z 
2025-05-07T20:27:07.4085444Z 
2025-05-07T20:27:07.4085450Z 
2025-05-07T20:27:07.4090563Z 
2025-05-07T20:27:07.4803408Z libcblas-3.9.0       | 16 KB     | ########## | 100% [A[A[A[A[A
2025-05-07T20:27:07.4803774Z 
2025-05-07T20:27:07.4803778Z 
2025-05-07T20:27:07.4803782Z 
2025-05-07T20:27:07.4803785Z 
2025-05-07T20:27:07.4803987Z 
2025-05-07T20:27:07.4804777Z 
2025-05-07T20:27:07.4829631Z liblapack-3.9.0      | 16 KB     | #########7 |  98% [A[A[A[A[A[A
2025-05-07T20:27:07.4829966Z 
2025-05-07T20:27:07.4829970Z 
2025-05-07T20:27:07.4829973Z 
2025-05-07T20:27:07.4829977Z 
2025-05-07T20:27:07.4829981Z 
2025-05-07T20:27:07.4831183Z 
2025-05-07T20:27:07.5010390Z liblapack-3.9.0      | 16 KB     | ########## | 100% [A[A[A[A[A[A
2025-05-07T20:27:07.5010683Z 
2025-05-07T20:27:07.5010691Z 
2025-05-07T20:27:07.5010696Z 
2025-05-07T20:27:07.5010701Z 
2025-05-07T20:27:07.5012644Z 
2025-05-07T20:27:07.5017841Z libcblas-3.9.0       | 16 KB     | ########## | 100% [A[A[A[A[A
2025-05-07T20:27:07.5018185Z 
2025-05-07T20:27:07.5018191Z 
2025-05-07T20:27:07.5018196Z 
2025-05-07T20:27:07.5018202Z 
2025-05-07T20:27:07.5023394Z libblas-3.9.0        | 16 KB     | ########## | 100% [A[A[A[A
2025-05-07T20:27:07.5023672Z 
2025-05-07T20:27:07.5023676Z 
2025-05-07T20:27:07.5023680Z 
2025-05-07T20:27:07.5023690Z 
2025-05-07T20:27:07.5151913Z libblas-3.9.0        | 16 KB     | ########## | 100% [A[A[A[A
2025-05-07T20:27:07.5152317Z 
2025-05-07T20:27:07.5152323Z 
2025-05-07T20:27:07.5152329Z 
2025-05-07T20:27:07.5153274Z libgfortran-15.1.0   | 34 KB     | ########## | 100% [A[A[A
2025-05-07T20:27:07.5153668Z 
2025-05-07T20:27:07.5153673Z 
2025-05-07T20:27:07.5153679Z 
2025-05-07T20:27:07.5153981Z 
2025-05-07T20:27:07.5153990Z 
2025-05-07T20:27:07.5154003Z 
2025-05-07T20:27:07.5154647Z liblapack-3.9.0      | 16 KB     | ########## | 100% [A[A[A[A[A[A
2025-05-07T20:27:07.5154916Z 
2025-05-07T20:27:07.5154920Z 
2025-05-07T20:27:07.5154926Z 
2025-05-07T20:27:07.5204355Z libgfortran-15.1.0   | 34 KB     | ########## | 100% [A[A[A
2025-05-07T20:27:07.5204706Z 
2025-05-07T20:27:07.5434578Z libopenblas-0.3.29   | 5.6 MB    |            |   0% [A
2025-05-07T20:27:07.5674131Z numpy-2.2.5          | 8.6 MB    |            |   0% 
2025-05-07T20:27:07.5674377Z 
2025-05-07T20:27:07.5674572Z 
2025-05-07T20:27:07.5791947Z libgfortran5-15.1.0  | 1.5 MB    | 1          |   1% [A[A
2025-05-07T20:27:07.5794012Z 
2025-05-07T20:27:07.5991845Z libopenblas-0.3.29   | 5.6 MB    | ########## | 100% [A
2025-05-07T20:27:07.5992104Z 
2025-05-07T20:27:07.5994279Z 
2025-05-07T20:27:07.6435953Z libgfortran5-15.1.0  | 1.5 MB    | ########## | 100% [A[A
2025-05-07T20:27:07.6694414Z numpy-2.2.5          | 8.6 MB    | #######5   |  76% 
2025-05-07T20:27:07.6723436Z numpy-2.2.5          | 8.6 MB    | ########## | 100% 
2025-05-07T20:27:07.6723669Z 
2025-05-07T20:27:07.6725790Z 
2025-05-07T20:27:07.6733192Z libgfortran5-15.1.0  | 1.5 MB    | ########## | 100% [A[A
2025-05-07T20:27:07.6733620Z 
2025-05-07T20:27:07.6733626Z 
2025-05-07T20:27:07.7391979Z libgfortran5-15.1.0  | 1.5 MB    | ########## | 100% [A[A
2025-05-07T20:27:07.7392256Z 
2025-05-07T20:27:07.7395168Z libopenblas-0.3.29   | 5.6 MB    | ########## | 100% [A
2025-05-07T20:27:07.7395475Z 
2025-05-07T20:27:08.1165220Z libopenblas-0.3.29   | 5.6 MB    | ########## | 100% [A
2025-05-07T20:27:08.1171651Z numpy-2.2.5          | 8.6 MB    | ########## | 100% 
2025-05-07T20:27:08.1171999Z                                                      
2025-05-07T20:27:08.1172195Z 
2025-05-07T20:27:08.1172392Z                                                      [A
2025-05-07T20:27:08.1172600Z 
2025-05-07T20:27:08.1172604Z 
2025-05-07T20:27:08.1172790Z                                                      [A[A
2025-05-07T20:27:08.1172994Z 
2025-05-07T20:27:08.1172998Z 
2025-05-07T20:27:08.1173007Z 
2025-05-07T20:27:08.1173179Z                                                      [A[A[A
2025-05-07T20:27:08.1173381Z 
2025-05-07T20:27:08.1173385Z 
2025-05-07T20:27:08.1173389Z 
2025-05-07T20:27:08.1173392Z 
2025-05-07T20:27:08.1173582Z                                                      [A[A[A[A
2025-05-07T20:27:08.1173785Z 
2025-05-07T20:27:08.1173789Z 
2025-05-07T20:27:08.1173792Z 
2025-05-07T20:27:08.1173796Z 
2025-05-07T20:27:08.1173800Z 
2025-05-07T20:27:08.1173980Z                                                      [A[A[A[A[A
2025-05-07T20:27:08.1174448Z 
2025-05-07T20:27:08.1174451Z 
2025-05-07T20:27:08.1174455Z 
2025-05-07T20:27:08.1174458Z 
2025-05-07T20:27:08.1174462Z 
2025-05-07T20:27:08.1174465Z 
2025-05-07T20:27:08.1174665Z                                                      [A[A[A[A[A[A done
2025-05-07T20:27:08.2177840Z Preparing transaction: \ done
2025-05-07T20:27:08.3180344Z Verifying transaction: / done
2025-05-07T20:27:08.4188944Z Executing transaction: \ done
2025-05-07T20:27:08.5961469Z ################################################################################
2025-05-07T20:27:08.5961879Z # Install Package From PyTorch PIP: torch
2025-05-07T20:27:08.5962181Z #
2025-05-07T20:27:08.5979560Z # [2025-05-07T20:27:08.597Z] + install_from_pytorch_pip build_binary torch nightly cuda/12.8.0
2025-05-07T20:27:08.5980035Z ################################################################################
2025-05-07T20:27:08.5980259Z 
2025-05-07T20:27:08.5995713Z [EXEC] [ATTEMPT 0/3]    + wget -q --timeout 1 pypi.org -O /dev/null
2025-05-07T20:27:08.6893570Z [CHECK] Network does not appear to be blocked.
2025-05-07T20:27:08.6894567Z ################################################################################
2025-05-07T20:27:08.6895302Z # Prepare PIP Arguments (PyTorch PIP)
2025-05-07T20:27:08.6895631Z #
2025-05-07T20:27:08.6913482Z # [2025-05-07T20:27:08.690Z] + __prepare_pip_arguments torch nightly cuda/12.8.0
2025-05-07T20:27:08.6913926Z ################################################################################
2025-05-07T20:27:08.6914140Z 
2025-05-07T20:27:08.6934361Z [INSTALL] Extracted package (channel, version): (nightly, LATEST)
2025-05-07T20:27:08.6960113Z [INSTALL] Extracted package variant: cu128
2025-05-07T20:27:08.6976932Z [INSTALL] Using a non-RELEASE channel: nightly ...
2025-05-07T20:27:08.6977526Z [INSTALL] Extracted the full PIP channel: https://download.pytorch.org/whl/nightly/cu128/
2025-05-07T20:27:08.6986198Z [INSTALL] Extracted the full PIP package: --pre torch
2025-05-07T20:27:08.6995131Z [INSTALL] Attempting to install [torch, LATEST] from PyTorch PIP using channel https://download.pytorch.org/whl/nightly/cu128/ ...
2025-05-07T20:27:08.7016269Z [EXEC] [ATTEMPT 0/3]    + conda run -n build_binary pip install --pre torch --index-url https://download.pytorch.org/whl/nightly/cu128/
2025-05-07T20:28:06.7934620Z Looking in indexes: https://download.pytorch.org/whl/nightly/cu128/
2025-05-07T20:28:06.7935201Z Collecting torch
2025-05-07T20:28:06.7936087Z   Using cached https://download.pytorch.org/whl/nightly/cu128/torch-2.8.0.dev20250507%2Bcu128-cp311-cp311-manylinux_2_28_x86_64.whl.metadata (30 kB)
2025-05-07T20:28:06.7937015Z Collecting filelock (from torch)
2025-05-07T20:28:06.7937517Z   Using cached https://download.pytorch.org/whl/nightly/filelock-3.16.1-py3-none-any.whl (16 kB)
2025-05-07T20:28:06.7938434Z Requirement already satisfied: typing-extensions>=4.10.0 in /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages (from torch) (4.13.2)
2025-05-07T20:28:06.7939152Z Collecting sympy>=1.13.3 (from torch)
2025-05-07T20:28:06.7939699Z   Using cached https://download.pytorch.org/whl/nightly/sympy-1.13.3-py3-none-any.whl (6.2 MB)
2025-05-07T20:28:06.7940207Z Collecting networkx (from torch)
2025-05-07T20:28:06.7940700Z   Using cached https://download.pytorch.org/whl/nightly/networkx-3.4.2-py3-none-any.whl (1.7 MB)
2025-05-07T20:28:06.7941228Z Collecting jinja2 (from torch)
2025-05-07T20:28:06.7941694Z   Using cached https://download.pytorch.org/whl/nightly/jinja2-3.1.4-py3-none-any.whl (133 kB)
2025-05-07T20:28:06.7942197Z Collecting fsspec (from torch)
2025-05-07T20:28:06.7942680Z   Using cached https://download.pytorch.org/whl/nightly/fsspec-2024.10.0-py3-none-any.whl (179 kB)
2025-05-07T20:28:06.7943243Z Collecting nvidia-cuda-nvrtc-cu12==12.8.61 (from torch)
2025-05-07T20:28:06.7944057Z   Using cached https://download.pytorch.org/whl/nightly/cu128/nvidia_cuda_nvrtc_cu12-12.8.61-py3-none-manylinux2010_x86_64.manylinux_2_12_x86_64.whl.metadata (1.7 kB)
2025-05-07T20:28:06.7945318Z Collecting nvidia-cuda-runtime-cu12==12.8.57 (from torch)
2025-05-07T20:28:06.7946137Z   Using cached https://download.pytorch.org/whl/nightly/cu128/nvidia_cuda_runtime_cu12-12.8.57-py3-none-manylinux2014_x86_64.manylinux_2_17_x86_64.whl.metadata (1.7 kB)
2025-05-07T20:28:06.7946946Z Collecting nvidia-cuda-cupti-cu12==12.8.57 (from torch)
2025-05-07T20:28:06.7947744Z   Using cached https://download.pytorch.org/whl/nightly/cu128/nvidia_cuda_cupti_cu12-12.8.57-py3-none-manylinux2014_x86_64.manylinux_2_17_x86_64.whl.metadata (1.7 kB)
2025-05-07T20:28:06.7948525Z Collecting nvidia-cudnn-cu12==9.8.0.87 (from torch)
2025-05-07T20:28:06.7949422Z   Using cached https://download.pytorch.org/whl/nightly/cu128/nvidia_cudnn_cu12-9.8.0.87-py3-none-manylinux_2_27_x86_64.whl.metadata (1.8 kB)
2025-05-07T20:28:06.7950119Z Collecting nvidia-cublas-cu12==12.8.3.14 (from torch)
2025-05-07T20:28:06.7950820Z   Using cached https://download.pytorch.org/whl/nightly/cu128/nvidia_cublas_cu12-12.8.3.14-py3-none-manylinux_2_27_x86_64.whl.metadata (1.7 kB)
2025-05-07T20:28:06.7951527Z Collecting nvidia-cufft-cu12==11.3.3.41 (from torch)
2025-05-07T20:28:06.7952301Z   Using cached https://download.pytorch.org/whl/nightly/cu128/nvidia_cufft_cu12-11.3.3.41-py3-none-manylinux2014_x86_64.manylinux_2_17_x86_64.whl.metadata (1.5 kB)
2025-05-07T20:28:06.7953247Z Collecting nvidia-curand-cu12==10.3.9.55 (from torch)
2025-05-07T20:28:06.7953952Z   Using cached https://download.pytorch.org/whl/nightly/cu128/nvidia_curand_cu12-10.3.9.55-py3-none-manylinux_2_27_x86_64.whl.metadata (1.5 kB)
2025-05-07T20:28:06.7954662Z Collecting nvidia-cusolver-cu12==11.7.2.55 (from torch)
2025-05-07T20:28:06.7955374Z   Using cached https://download.pytorch.org/whl/nightly/cu128/nvidia_cusolver_cu12-11.7.2.55-py3-none-manylinux_2_27_x86_64.whl.metadata (1.6 kB)
2025-05-07T20:28:06.7956084Z Collecting nvidia-cusparse-cu12==12.5.7.53 (from torch)
2025-05-07T20:28:06.7956884Z   Using cached https://download.pytorch.org/whl/nightly/cu128/nvidia_cusparse_cu12-12.5.7.53-py3-none-manylinux2014_x86_64.manylinux_2_17_x86_64.whl.metadata (1.6 kB)
2025-05-07T20:28:06.7957684Z Collecting nvidia-cusparselt-cu12==0.6.3 (from torch)
2025-05-07T20:28:06.7958404Z   Using cached https://download.pytorch.org/whl/nightly/cu128/nvidia_cusparselt_cu12-0.6.3-py3-none-manylinux2014_x86_64.whl.metadata (6.8 kB)
2025-05-07T20:28:06.7959096Z Collecting nvidia-nccl-cu12==2.26.2 (from torch)
2025-05-07T20:28:06.7959850Z   Using cached https://download.pytorch.org/whl/nightly/cu128/nvidia_nccl_cu12-2.26.2-py3-none-manylinux2014_x86_64.manylinux_2_17_x86_64.whl.metadata (2.0 kB)
2025-05-07T20:28:06.7960605Z Collecting nvidia-nvtx-cu12==12.8.55 (from torch)
2025-05-07T20:28:06.7961355Z   Using cached https://download.pytorch.org/whl/nightly/cu128/nvidia_nvtx_cu12-12.8.55-py3-none-manylinux2014_x86_64.manylinux_2_17_x86_64.whl.metadata (1.6 kB)
2025-05-07T20:28:06.7962127Z Collecting nvidia-nvjitlink-cu12==12.8.61 (from torch)
2025-05-07T20:28:06.7962925Z   Using cached https://download.pytorch.org/whl/nightly/cu128/nvidia_nvjitlink_cu12-12.8.61-py3-none-manylinux2010_x86_64.manylinux_2_12_x86_64.whl.metadata (1.7 kB)
2025-05-07T20:28:06.7963716Z Collecting nvidia-cufile-cu12==1.13.0.11 (from torch)
2025-05-07T20:28:06.7964496Z   Using cached https://download.pytorch.org/whl/nightly/cu128/nvidia_cufile_cu12-1.13.0.11-py3-none-manylinux2014_x86_64.manylinux_2_17_x86_64.whl.metadata (1.5 kB)
2025-05-07T20:28:06.7965296Z Collecting pytorch-triton==3.3.0+git96316ce5 (from torch)
2025-05-07T20:28:06.7966124Z   Using cached https://download.pytorch.org/whl/nightly/pytorch_triton-3.3.0%2Bgit96316ce5-cp311-cp311-manylinux_2_27_x86_64.manylinux_2_28_x86_64.whl.metadata (1.6 kB)
2025-05-07T20:28:06.7967380Z Requirement already satisfied: setuptools>=40.8.0 in /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages (from pytorch-triton==3.3.0+git96316ce5->torch) (78.1.1)
2025-05-07T20:28:06.7968216Z Collecting mpmath<1.4,>=1.1.0 (from sympy>=1.13.3->torch)
2025-05-07T20:28:06.7968862Z   Using cached https://download.pytorch.org/whl/nightly/mpmath-1.3.0-py3-none-any.whl (536 kB)
2025-05-07T20:28:06.7969394Z Collecting MarkupSafe>=2.0 (from jinja2->torch)
2025-05-07T20:28:06.7970093Z   Using cached https://download.pytorch.org/whl/nightly/MarkupSafe-2.1.5-cp311-cp311-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (28 kB)
2025-05-07T20:28:06.7971130Z Using cached https://download.pytorch.org/whl/nightly/cu128/torch-2.8.0.dev20250507%2Bcu128-cp311-cp311-manylinux_2_28_x86_64.whl (1047.1 MB)
2025-05-07T20:28:06.7972141Z Using cached https://download.pytorch.org/whl/nightly/cu128/nvidia_cublas_cu12-12.8.3.14-py3-none-manylinux_2_27_x86_64.whl (609.6 MB)
2025-05-07T20:28:06.7973201Z Using cached https://download.pytorch.org/whl/nightly/cu128/nvidia_cuda_cupti_cu12-12.8.57-py3-none-manylinux2014_x86_64.manylinux_2_17_x86_64.whl (10.2 MB)
2025-05-07T20:28:06.7974339Z Using cached https://download.pytorch.org/whl/nightly/cu128/nvidia_cuda_nvrtc_cu12-12.8.61-py3-none-manylinux2010_x86_64.manylinux_2_12_x86_64.whl (88.0 MB)
2025-05-07T20:28:06.7975477Z Using cached https://download.pytorch.org/whl/nightly/cu128/nvidia_cuda_runtime_cu12-12.8.57-py3-none-manylinux2014_x86_64.manylinux_2_17_x86_64.whl (954 kB)
2025-05-07T20:28:06.7977122Z Using cached https://download.pytorch.org/whl/nightly/cu128/nvidia_cudnn_cu12-9.8.0.87-py3-none-manylinux_2_27_x86_64.whl (698.0 MB)
2025-05-07T20:28:06.7978170Z Using cached https://download.pytorch.org/whl/nightly/cu128/nvidia_cufft_cu12-11.3.3.41-py3-none-manylinux2014_x86_64.manylinux_2_17_x86_64.whl (193.1 MB)
2025-05-07T20:28:06.7979285Z Using cached https://download.pytorch.org/whl/nightly/cu128/nvidia_cufile_cu12-1.13.0.11-py3-none-manylinux2014_x86_64.manylinux_2_17_x86_64.whl (1.2 MB)
2025-05-07T20:28:06.7980326Z Using cached https://download.pytorch.org/whl/nightly/cu128/nvidia_curand_cu12-10.3.9.55-py3-none-manylinux_2_27_x86_64.whl (63.6 MB)
2025-05-07T20:28:06.7981294Z Using cached https://download.pytorch.org/whl/nightly/cu128/nvidia_cusolver_cu12-11.7.2.55-py3-none-manylinux_2_27_x86_64.whl (260.4 MB)
2025-05-07T20:28:06.7982363Z Using cached https://download.pytorch.org/whl/nightly/cu128/nvidia_cusparse_cu12-12.5.7.53-py3-none-manylinux2014_x86_64.manylinux_2_17_x86_64.whl (292.1 MB)
2025-05-07T20:28:06.7983433Z Using cached https://download.pytorch.org/whl/nightly/cu128/nvidia_cusparselt_cu12-0.6.3-py3-none-manylinux2014_x86_64.whl (156.8 MB)
2025-05-07T20:28:06.7984475Z Using cached https://download.pytorch.org/whl/nightly/cu128/nvidia_nccl_cu12-2.26.2-py3-none-manylinux2014_x86_64.manylinux_2_17_x86_64.whl (201.3 MB)
2025-05-07T20:28:06.7985588Z Using cached https://download.pytorch.org/whl/nightly/cu128/nvidia_nvjitlink_cu12-12.8.61-py3-none-manylinux2010_x86_64.manylinux_2_12_x86_64.whl (39.2 MB)
2025-05-07T20:28:06.7986692Z Using cached https://download.pytorch.org/whl/nightly/cu128/nvidia_nvtx_cu12-12.8.55-py3-none-manylinux2014_x86_64.manylinux_2_17_x86_64.whl (89 kB)
2025-05-07T20:28:06.7987815Z Using cached https://download.pytorch.org/whl/nightly/pytorch_triton-3.3.0%2Bgit96316ce5-cp311-cp311-manylinux_2_27_x86_64.manylinux_2_28_x86_64.whl (153.5 MB)
2025-05-07T20:28:06.7990047Z Installing collected packages: nvidia-cusparselt-cu12, mpmath, sympy, pytorch-triton, nvidia-nvtx-cu12, nvidia-nvjitlink-cu12, nvidia-nccl-cu12, nvidia-curand-cu12, nvidia-cufile-cu12, nvidia-cuda-runtime-cu12, nvidia-cuda-nvrtc-cu12, nvidia-cuda-cupti-cu12, nvidia-cublas-cu12, networkx, MarkupSafe, fsspec, filelock, nvidia-cusparse-cu12, nvidia-cufft-cu12, nvidia-cudnn-cu12, jinja2, nvidia-cusolver-cu12, torch
2025-05-07T20:28:06.7991636Z 
2025-05-07T20:28:06.7993567Z Successfully installed MarkupSafe-2.1.5 filelock-3.16.1 fsspec-2024.10.0 jinja2-3.1.4 mpmath-1.3.0 networkx-3.4.2 nvidia-cublas-cu12-12.8.3.14 nvidia-cuda-cupti-cu12-12.8.57 nvidia-cuda-nvrtc-cu12-12.8.61 nvidia-cuda-runtime-cu12-12.8.57 nvidia-cudnn-cu12-9.8.0.87 nvidia-cufft-cu12-11.3.3.41 nvidia-cufile-cu12-1.13.0.11 nvidia-curand-cu12-10.3.9.55 nvidia-cusolver-cu12-11.7.2.55 nvidia-cusparse-cu12-12.5.7.53 nvidia-cusparselt-cu12-0.6.3 nvidia-nccl-cu12-2.26.2 nvidia-nvjitlink-cu12-12.8.61 nvidia-nvtx-cu12-12.8.55 pytorch-triton-3.3.0+git96316ce5 sympy-1.13.3 torch-2.8.0.dev20250507+cu128
2025-05-07T20:28:06.7995686Z 
2025-05-07T20:28:09.0236981Z torch                    2.8.0.dev20250507+cu128
2025-05-07T20:28:09.0239377Z [CHECK] The installed package [torch, nightly/LATEST] is the correct variant (cu128)
2025-05-07T20:28:12.4288830Z [CHECK] Python (sub-)package 'torch.distributed' found ...
2025-05-07T20:28:15.8668734Z [CHECK] NOTE: The installed version is: 2.8.0.dev20250507+cu128
2025-05-07T20:28:15.8669531Z [CHECK] NOTE: Checking _GLIBCXX_USE_CXX11_ABI ...
2025-05-07T20:28:19.1963654Z True
2025-05-07T20:28:19.1963921Z True
2025-05-07T20:28:19.1964033Z 
2025-05-07T20:28:19.2588801Z [INSTALL] Successfully installed PyTorch through PyTorch PIP
2025-05-07T20:28:19.2626114Z ##[group]Run if . $PRELUDE && which conda; then collect_pytorch_env_info $BUILD_ENV; fi
2025-05-07T20:28:19.2626718Z [36;1mif . $PRELUDE && which conda; then collect_pytorch_env_info $BUILD_ENV; fi[0m
2025-05-07T20:28:19.2639042Z shell: /usr/bin/bash --noprofile --norc -e -o pipefail {0}
2025-05-07T20:28:19.2639390Z env:
2025-05-07T20:28:19.2639615Z   PRELUDE: .github/scripts/setup_env.bash
2025-05-07T20:28:19.2639911Z   BUILD_ENV: build_binary
2025-05-07T20:28:19.2640153Z   BUILD_TARGET: genai
2025-05-07T20:28:19.2640381Z   BUILD_VARIANT: cuda
2025-05-07T20:28:19.2640615Z   BUILD_CUDA_VERSION: 12.8.0
2025-05-07T20:28:19.2640861Z   ENFORCE_CUDA_DEVICE: 1
2025-05-07T20:28:19.2641161Z   GPU_FLAG: --gpus all -e NVIDIA_DRIVER_CAPABILITIES=all
2025-05-07T20:28:19.2641496Z ##[endgroup]
2025-05-07T20:28:19.6014864Z /home/ec2-user/miniconda/bin/conda
2025-05-07T20:28:19.6016456Z ################################################################################
2025-05-07T20:28:19.6017075Z # Collect PyTorch Environment Information (for Reporting Issues)
2025-05-07T20:28:19.6017467Z #
2025-05-07T20:28:19.6032660Z # [2025-05-07T20:28:19.602Z] + collect_pytorch_env_info build_binary
2025-05-07T20:28:19.6033160Z ################################################################################
2025-05-07T20:28:19.6033377Z 
2025-05-07T20:28:19.6048197Z [EXEC] [ATTEMPT 0/3]    + wget -q --timeout 1 pypi.org -O /dev/null
2025-05-07T20:28:19.6981790Z [CHECK] Network does not appear to be blocked.
2025-05-07T20:28:19.6990266Z [INFO] Downloading the PyTorch environment info collection script ...
2025-05-07T20:28:19.6991227Z + wget -q https://raw.githubusercontent.com/pytorch/pytorch/main/torch/utils/collect_env.py
2025-05-07T20:28:19.6991637Z 
2025-05-07T20:28:19.8028662Z 
2025-05-07T20:28:19.8029547Z [INFO] Collecting PyTorch environment info (will be needed for reporting issues to PyTorch) ...
2025-05-07T20:28:19.8054985Z [EXEC] [ATTEMPT 0/3]    + conda run -n build_binary python collect_env.py
2025-05-07T20:28:25.5706496Z Collecting environment information...
2025-05-07T20:28:25.5706930Z PyTorch version: 2.8.0.dev20250507+cu128
2025-05-07T20:28:25.5707252Z Is debug build: False
2025-05-07T20:28:25.5707512Z CUDA used to build PyTorch: 12.8
2025-05-07T20:28:25.5707791Z ROCM used to build PyTorch: N/A
2025-05-07T20:28:25.5707971Z 
2025-05-07T20:28:25.5708078Z OS: Amazon Linux 2023.6.20250317 (x86_64)
2025-05-07T20:28:25.5708407Z GCC version: (conda-forge gcc 11.4.0-13) 11.4.0
2025-05-07T20:28:25.5708724Z Clang version: Could not collect
2025-05-07T20:28:25.5709062Z CMake version: Could not collect
2025-05-07T20:28:25.5709341Z Libc version: glibc-2.34
2025-05-07T20:28:25.5709497Z 
2025-05-07T20:28:25.5709797Z Python version: 3.11.8 | packaged by conda-forge | (main, Feb 16 2024, 20:53:32) [GCC 12.3.0] (64-bit runtime)
2025-05-07T20:28:25.5710417Z Python platform: Linux-6.1.130-139.222.amzn2023.x86_64-x86_64-with-glibc2.34
2025-05-07T20:28:25.5710837Z Is CUDA available: True
2025-05-07T20:28:25.5711103Z CUDA runtime version: 12.8.61
2025-05-07T20:28:25.5711373Z CUDA_MODULE_LOADING set to: LAZY
2025-05-07T20:28:25.5711683Z GPU models and configuration: GPU 0: NVIDIA A10G
2025-05-07T20:28:25.5712284Z Nvidia driver version: 570.133.07
2025-05-07T20:28:25.5712560Z cuDNN version: Could not collect
2025-05-07T20:28:25.5712834Z HIP runtime version: N/A
2025-05-07T20:28:25.5713092Z MIOpen runtime version: N/A
2025-05-07T20:28:25.5713350Z Is XNNPACK available: True
2025-05-07T20:28:25.5713514Z 
2025-05-07T20:28:25.5713593Z CPU:
2025-05-07T20:28:25.5713814Z Architecture:                         x86_64
2025-05-07T20:28:25.5714151Z CPU op-mode(s):                       32-bit, 64-bit
2025-05-07T20:28:25.5714533Z Address sizes:                        48 bits physical, 48 bits virtual
2025-05-07T20:28:25.5714924Z Byte Order:                           Little Endian
2025-05-07T20:28:25.5715245Z CPU(s):                               16
2025-05-07T20:28:25.5715536Z On-line CPU(s) list:                  0-15
2025-05-07T20:28:25.5716050Z Vendor ID:                            AuthenticAMD
2025-05-07T20:28:25.5716397Z Model name:                           AMD EPYC 7R32
2025-05-07T20:28:25.5716718Z CPU family:                           23
2025-05-07T20:28:25.5717017Z Model:                                49
2025-05-07T20:28:25.5717305Z Thread(s) per core:                   2
2025-05-07T20:28:25.5717592Z Core(s) per socket:                   8
2025-05-07T20:28:25.5717879Z Socket(s):                            1
2025-05-07T20:28:25.5718165Z Stepping:                             0
2025-05-07T20:28:25.5718466Z BogoMIPS:                             5599.99
2025-05-07T20:28:25.5720488Z Flags:                                fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ht syscall nx mmxext fxsr_opt pdpe1gb rdtscp lm constant_tsc rep_good nopl nonstop_tsc cpuid extd_apicid aperfmperf tsc_known_freq pni pclmulqdq ssse3 fma cx16 sse4_1 sse4_2 movbe popcnt aes xsave avx f16c rdrand hypervisor lahf_lm cmp_legacy cr8_legacy abm sse4a misalignsse 3dnowprefetch topoext ssbd ibrs ibpb stibp vmmcall fsgsbase bmi1 avx2 smep bmi2 rdseed adx smap clflushopt clwb sha_ni xsaveopt xsavec xgetbv1 clzero xsaveerptr rdpru wbnoinvd arat npt nrip_save rdpid
2025-05-07T20:28:25.5722516Z Hypervisor vendor:                    KVM
2025-05-07T20:28:25.5722835Z Virtualization type:                  full
2025-05-07T20:28:25.5723177Z L1d cache:                            256 KiB (8 instances)
2025-05-07T20:28:25.5723541Z L1i cache:                            256 KiB (8 instances)
2025-05-07T20:28:25.5723911Z L2 cache:                             4 MiB (8 instances)
2025-05-07T20:28:25.5724265Z L3 cache:                             32 MiB (2 instances)
2025-05-07T20:28:25.5724590Z NUMA node(s):                         1
2025-05-07T20:28:25.5724879Z NUMA node0 CPU(s):                    0-15
2025-05-07T20:28:25.5725250Z Vulnerability Gather data sampling:   Not affected
2025-05-07T20:28:25.5725628Z Vulnerability Itlb multihit:          Not affected
2025-05-07T20:28:25.5725986Z Vulnerability L1tf:                   Not affected
2025-05-07T20:28:25.5726329Z Vulnerability Mds:                    Not affected
2025-05-07T20:28:25.5726684Z Vulnerability Meltdown:               Not affected
2025-05-07T20:28:25.5727043Z Vulnerability Mmio stale data:        Not affected
2025-05-07T20:28:25.5727401Z Vulnerability Reg file data sampling: Not affected
2025-05-07T20:28:25.5727938Z Vulnerability Retbleed:               Mitigation; untrained return thunk; SMT enabled with STIBP protection
2025-05-07T20:28:25.5728699Z Vulnerability Spec rstack overflow:   Mitigation; safe RET
2025-05-07T20:28:25.5729238Z Vulnerability Spec store bypass:      Mitigation; Speculative Store Bypass disabled via prctl
2025-05-07T20:28:25.5729908Z Vulnerability Spectre v1:             Mitigation; usercopy/swapgs barriers and __user pointer sanitization
2025-05-07T20:28:25.5730755Z Vulnerability Spectre v2:             Mitigation; Retpolines; IBPB conditional; STIBP always-on; RSB filling; PBRSB-eIBRS Not affected; BHI Not affected
2025-05-07T20:28:25.5731427Z Vulnerability Srbds:                  Not affected
2025-05-07T20:28:25.5731918Z Vulnerability Tsx async abort:        Not affected
2025-05-07T20:28:25.5732147Z 
2025-05-07T20:28:25.5732252Z Versions of relevant libraries:
2025-05-07T20:28:25.5732520Z [pip3] numpy==2.2.5
2025-05-07T20:28:25.5732766Z [pip3] nvidia-cublas-cu12==12.8.3.14
2025-05-07T20:28:25.5733070Z [pip3] nvidia-cuda-cupti-cu12==12.8.57
2025-05-07T20:28:25.5733383Z [pip3] nvidia-cuda-nvrtc-cu12==12.8.61
2025-05-07T20:28:25.5733697Z [pip3] nvidia-cuda-runtime-cu12==12.8.57
2025-05-07T20:28:25.5734006Z [pip3] nvidia-cudnn-cu12==9.8.0.87
2025-05-07T20:28:25.5734297Z [pip3] nvidia-cufft-cu12==11.3.3.41
2025-05-07T20:28:25.5734589Z [pip3] nvidia-curand-cu12==10.3.9.55
2025-05-07T20:28:25.5734890Z [pip3] nvidia-cusolver-cu12==11.7.2.55
2025-05-07T20:28:25.5735192Z [pip3] nvidia-cusparse-cu12==12.5.7.53
2025-05-07T20:28:25.5735649Z [pip3] nvidia-cusparselt-cu12==0.6.3
2025-05-07T20:28:25.5735954Z [pip3] nvidia-nccl-cu12==2.26.2
2025-05-07T20:28:25.5736239Z [pip3] nvidia-nvjitlink-cu12==12.8.61
2025-05-07T20:28:25.5736548Z [pip3] nvidia-nvtx-cu12==12.8.55
2025-05-07T20:28:25.5736838Z [pip3] pytorch-triton==3.3.0+git96316ce5
2025-05-07T20:28:25.5737137Z [pip3] torch==2.8.0.dev20250507+cu128
2025-05-07T20:28:25.5737508Z [conda] cuda-cudart               12.8.57              h5888daf_1    conda-forge
2025-05-07T20:28:25.5737993Z [conda] cuda-cudart-dev           12.8.57              h5888daf_1    conda-forge
2025-05-07T20:28:25.5738495Z [conda] cuda-cudart-dev_linux-64  12.8.57              h3f2d84a_1    conda-forge
2025-05-07T20:28:25.5739012Z [conda] cuda-cudart-static        12.8.57              h5888daf_1    conda-forge
2025-05-07T20:28:25.5739544Z [conda] cuda-cudart-static_linux-64 12.8.57              h3f2d84a_1    conda-forge
2025-05-07T20:28:25.5740073Z [conda] cuda-cudart_linux-64      12.8.57              h3f2d84a_1    conda-forge
2025-05-07T20:28:25.5740548Z [conda] cuda-cupti                12.8.57              hbd13f7d_0    conda-forge
2025-05-07T20:28:25.5741014Z [conda] cuda-cupti-dev            12.8.57              h5888daf_0    conda-forge
2025-05-07T20:28:25.5741500Z [conda] cuda-libraries            12.8.0               ha770c72_0    conda-forge
2025-05-07T20:28:25.5741994Z [conda] cuda-libraries-dev        12.8.0               ha770c72_0    conda-forge
2025-05-07T20:28:25.5742461Z [conda] cuda-nvrtc                12.8.61              hbd13f7d_0    conda-forge
2025-05-07T20:28:25.5742931Z [conda] cuda-nvrtc-dev            12.8.61              h5888daf_0    conda-forge
2025-05-07T20:28:25.5743390Z [conda] cuda-nvtx                 12.8.55              hbd13f7d_0    conda-forge
2025-05-07T20:28:25.5743835Z [conda] cuda-opencl               12.8.55              hbd13f7d_0    conda-forge
2025-05-07T20:28:25.5744309Z [conda] cuda-opencl-dev           12.8.55              h5888daf_0    conda-forge
2025-05-07T20:28:25.5744785Z [conda] cuda-runtime              12.8.0               ha804496_0    conda-forge
2025-05-07T20:28:25.5745243Z [conda] libcublas                 12.8.3.14            h9ab20c4_0    conda-forge
2025-05-07T20:28:25.5745705Z [conda] libcublas-dev             12.8.3.14            h9ab20c4_0    conda-forge
2025-05-07T20:28:25.5746169Z [conda] libcufft                  11.3.3.41            hbd13f7d_0    conda-forge
2025-05-07T20:28:25.5746626Z [conda] libcufft-dev              11.3.3.41            h5888daf_0    conda-forge
2025-05-07T20:28:25.5747079Z [conda] libcurand                 10.3.9.55            hbd13f7d_0    conda-forge
2025-05-07T20:28:25.5747546Z [conda] libcurand-dev             10.3.9.55            h5888daf_0    conda-forge
2025-05-07T20:28:25.5748018Z [conda] libcusolver               11.7.2.55            h9ab20c4_0    conda-forge
2025-05-07T20:28:25.5748496Z [conda] libcusolver-dev           11.7.2.55            h9ab20c4_0    conda-forge
2025-05-07T20:28:25.5749030Z [conda] libcusparse               12.5.7.53            hbd13f7d_0    conda-forge
2025-05-07T20:28:25.5749519Z [conda] libcusparse-dev           12.5.7.53            h5888daf_0    conda-forge
2025-05-07T20:28:25.5749998Z [conda] libnvjitlink              12.8.61              hbd13f7d_0    conda-forge
2025-05-07T20:28:25.5750563Z [conda] libnvjitlink-dev          12.8.61              h5888daf_0    conda-forge
2025-05-07T20:28:25.5751019Z [conda] numpy                     2.2.5           py311h5d046bc_0    conda-forge
2025-05-07T20:28:25.5751477Z [conda] nvidia-cublas-cu12        12.8.3.14                pypi_0    pypi
2025-05-07T20:28:25.5751975Z [conda] nvidia-cuda-cupti-cu12    12.8.57                  pypi_0    pypi
2025-05-07T20:28:25.5752463Z [conda] nvidia-cuda-nvrtc-cu12    12.8.61                  pypi_0    pypi
2025-05-07T20:28:25.5752961Z [conda] nvidia-cuda-runtime-cu12  12.8.57                  pypi_0    pypi
2025-05-07T20:28:25.5753448Z [conda] nvidia-cudnn-cu12         9.8.0.87                 pypi_0    pypi
2025-05-07T20:28:25.5753998Z [conda] nvidia-cufft-cu12         11.3.3.41                pypi_0    pypi
2025-05-07T20:28:25.5754475Z [conda] nvidia-curand-cu12        10.3.9.55                pypi_0    pypi
2025-05-07T20:28:25.5754959Z [conda] nvidia-cusolver-cu12      11.7.2.55                pypi_0    pypi
2025-05-07T20:28:25.5755444Z [conda] nvidia-cusparse-cu12      12.5.7.53                pypi_0    pypi
2025-05-07T20:28:25.5755932Z [conda] nvidia-cusparselt-cu12    0.6.3                    pypi_0    pypi
2025-05-07T20:28:25.5756412Z [conda] nvidia-nccl-cu12          2.26.2                   pypi_0    pypi
2025-05-07T20:28:25.5756885Z [conda] nvidia-nvjitlink-cu12     12.8.61                  pypi_0    pypi
2025-05-07T20:28:25.5757353Z [conda] nvidia-nvtx-cu12          12.8.55                  pypi_0    pypi
2025-05-07T20:28:25.5757823Z [conda] pytorch-triton            3.3.0+git96316ce5          pypi_0    pypi
2025-05-07T20:28:25.5758277Z [conda] torch                     2.8.0.dev20250507+cu128          pypi_0    pypi
2025-05-07T20:28:25.5758543Z 
2025-05-07T20:28:25.6461589Z ##[group]Run . $PRELUDE; cd fbgemm_gpu; prepare_fbgemm_gpu_build $BUILD_ENV
2025-05-07T20:28:25.6462251Z [36;1m. $PRELUDE; cd fbgemm_gpu; prepare_fbgemm_gpu_build $BUILD_ENV[0m
2025-05-07T20:28:25.6474156Z shell: /usr/bin/bash --noprofile --norc -e -o pipefail {0}
2025-05-07T20:28:25.6474501Z env:
2025-05-07T20:28:25.6474730Z   PRELUDE: .github/scripts/setup_env.bash
2025-05-07T20:28:25.6475024Z   BUILD_ENV: build_binary
2025-05-07T20:28:25.6475272Z   BUILD_TARGET: genai
2025-05-07T20:28:25.6475502Z   BUILD_VARIANT: cuda
2025-05-07T20:28:25.6475739Z   BUILD_CUDA_VERSION: 12.8.0
2025-05-07T20:28:25.6475996Z   ENFORCE_CUDA_DEVICE: 1
2025-05-07T20:28:25.6476296Z   GPU_FLAG: --gpus all -e NVIDIA_DRIVER_CAPABILITIES=all
2025-05-07T20:28:25.6476628Z ##[endgroup]
2025-05-07T20:28:25.9878594Z ################################################################################
2025-05-07T20:28:25.9878983Z # Prepare FBGEMM-GPU Build
2025-05-07T20:28:25.9879229Z #
2025-05-07T20:28:25.9895081Z # [2025-05-07T20:28:25.989Z] + prepare_fbgemm_gpu_build build_binary
2025-05-07T20:28:25.9895484Z ################################################################################
2025-05-07T20:28:25.9895706Z 
2025-05-07T20:28:25.9910519Z [EXEC] [ATTEMPT 0/3]    + wget -q --timeout 1 pypi.org -O /dev/null
2025-05-07T20:28:26.0777919Z [CHECK] Network does not appear to be blocked.
2025-05-07T20:28:26.0798012Z [BUILD] Running git submodules update ...
2025-05-07T20:28:26.0818571Z [EXEC] [ATTEMPT 0/3]    + git submodule sync
2025-05-07T20:28:26.1187007Z Synchronizing submodule url for '../external/asmjit'
2025-05-07T20:28:26.1187668Z Synchronizing submodule url for '../external/composable_kernel'
2025-05-07T20:28:26.1188305Z Synchronizing submodule url for '../external/cpuinfo'
2025-05-07T20:28:26.1188861Z Synchronizing submodule url for '../external/cutlass'
2025-05-07T20:28:26.1189415Z Synchronizing submodule url for '../external/googletest'
2025-05-07T20:28:26.1189868Z Synchronizing submodule url for '../external/hipify_torch'
2025-05-07T20:28:26.1190273Z Synchronizing submodule url for '../external/json'
2025-05-07T20:28:26.1222837Z [EXEC] [ATTEMPT 0/3]    + git submodule update --init --recursive
2025-05-07T20:28:26.1774816Z [BUILD] Installing other build dependencies ...
2025-05-07T20:28:26.1796946Z [EXEC] [ATTEMPT 0/3]    + conda run --no-capture-output -n build_binary python -m pip install -r requirements.txt
2025-05-07T20:28:28.5437980Z Collecting backports.tarfile (from -r requirements.txt (line 13))
2025-05-07T20:28:28.5452302Z   Using cached backports.tarfile-1.2.0-py3-none-any.whl.metadata (2.0 kB)
2025-05-07T20:28:28.5793515Z Collecting build (from -r requirements.txt (line 14))
2025-05-07T20:28:28.5805857Z   Using cached build-1.2.2.post1-py3-none-any.whl.metadata (6.5 kB)
2025-05-07T20:28:28.7238216Z Collecting cmake (from -r requirements.txt (line 15))
2025-05-07T20:28:28.7252981Z   Using cached cmake-4.0.0-py3-none-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata (6.3 kB)
2025-05-07T20:28:28.7641054Z Collecting click (from -r requirements.txt (line 16))
2025-05-07T20:28:28.7654453Z   Using cached click-8.1.8-py3-none-any.whl.metadata (2.3 kB)
2025-05-07T20:28:28.9923744Z Collecting hypothesis (from -r requirements.txt (line 17))
2025-05-07T20:28:28.9937951Z   Using cached hypothesis-6.131.14-py3-none-any.whl.metadata (5.6 kB)
2025-05-07T20:28:29.0023211Z Requirement already satisfied: jinja2 in /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages (from -r requirements.txt (line 18)) (3.1.4)
2025-05-07T20:28:29.0025643Z Requirement already satisfied: mpmath==1.3.0 in /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages (from -r requirements.txt (line 19)) (1.3.0)
2025-05-07T20:28:29.0480751Z Collecting ninja (from -r requirements.txt (line 20))
2025-05-07T20:28:29.0493524Z   Using cached ninja-1.11.1.4-py3-none-manylinux_2_12_x86_64.manylinux2010_x86_64.whl.metadata (5.0 kB)
2025-05-07T20:28:29.0506362Z Requirement already satisfied: numpy>=2.0.2 in /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages (from -r requirements.txt (line 21)) (2.2.5)
2025-05-07T20:28:29.0826733Z Collecting pyre-extensions (from -r requirements.txt (line 22))
2025-05-07T20:28:29.0839198Z   Using cached pyre_extensions-0.0.32-py3-none-any.whl.metadata (4.0 kB)
2025-05-07T20:28:29.1371817Z Collecting pyyaml (from -r requirements.txt (line 23))
2025-05-07T20:28:29.1385075Z   Using cached PyYAML-6.0.2-cp311-cp311-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata (2.1 kB)
2025-05-07T20:28:29.1709963Z Collecting scikit-build (from -r requirements.txt (line 24))
2025-05-07T20:28:29.1721708Z   Using cached scikit_build-0.18.1-py3-none-any.whl.metadata (18 kB)
2025-05-07T20:28:29.1768737Z Requirement already satisfied: setuptools in /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages (from -r requirements.txt (line 25)) (78.1.1)
2025-05-07T20:28:29.2125034Z Collecting setuptools_git_versioning (from -r requirements.txt (line 26))
2025-05-07T20:28:29.2137672Z   Using cached setuptools_git_versioning-2.1.0-py3-none-any.whl.metadata (6.1 kB)
2025-05-07T20:28:29.2462015Z Collecting tabulate (from -r requirements.txt (line 27))
2025-05-07T20:28:29.2473240Z   Using cached tabulate-0.9.0-py3-none-any.whl.metadata (34 kB)
2025-05-07T20:28:29.2865102Z Collecting patchelf (from -r requirements.txt (line 28))
2025-05-07T20:28:29.2877351Z   Using cached patchelf-0.17.2.2-py3-none-manylinux1_x86_64.manylinux_2_5_x86_64.musllinux_1_1_x86_64.whl.metadata (3.5 kB)
2025-05-07T20:28:29.3272266Z Collecting packaging>=19.1 (from build->-r requirements.txt (line 14))
2025-05-07T20:28:29.3283751Z   Using cached packaging-25.0-py3-none-any.whl.metadata (3.3 kB)
2025-05-07T20:28:29.3569733Z Collecting pyproject_hooks (from build->-r requirements.txt (line 14))
2025-05-07T20:28:29.3581299Z   Using cached pyproject_hooks-1.2.0-py3-none-any.whl.metadata (1.3 kB)
2025-05-07T20:28:29.3912809Z Collecting attrs>=22.2.0 (from hypothesis->-r requirements.txt (line 17))
2025-05-07T20:28:29.3924832Z   Using cached attrs-25.3.0-py3-none-any.whl.metadata (10 kB)
2025-05-07T20:28:29.4297226Z Collecting sortedcontainers<3.0.0,>=2.1.0 (from hypothesis->-r requirements.txt (line 17))
2025-05-07T20:28:29.4309802Z   Using cached sortedcontainers-2.4.0-py2.py3-none-any.whl.metadata (10 kB)
2025-05-07T20:28:29.4334715Z Requirement already satisfied: MarkupSafe>=2.0 in /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages (from jinja2->-r requirements.txt (line 18)) (2.1.5)
2025-05-07T20:28:29.4640339Z Collecting typing-inspect (from pyre-extensions->-r requirements.txt (line 22))
2025-05-07T20:28:29.4652304Z   Using cached typing_inspect-0.9.0-py3-none-any.whl.metadata (1.5 kB)
2025-05-07T20:28:29.4666593Z Requirement already satisfied: typing-extensions in /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages (from pyre-extensions->-r requirements.txt (line 22)) (4.13.2)
2025-05-07T20:28:29.4949767Z Collecting distro (from scikit-build->-r requirements.txt (line 24))
2025-05-07T20:28:29.4961674Z   Using cached distro-1.9.0-py3-none-any.whl.metadata (6.8 kB)
2025-05-07T20:28:29.4982098Z Requirement already satisfied: wheel>=0.32.0 in /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages (from scikit-build->-r requirements.txt (line 24)) (0.45.1)
2025-05-07T20:28:29.5393878Z Collecting mypy-extensions>=0.3.0 (from typing-inspect->pyre-extensions->-r requirements.txt (line 22))
2025-05-07T20:28:29.5405714Z   Using cached mypy_extensions-1.1.0-py3-none-any.whl.metadata (1.1 kB)
2025-05-07T20:28:29.5437473Z Using cached backports.tarfile-1.2.0-py3-none-any.whl (30 kB)
2025-05-07T20:28:29.5449528Z Using cached build-1.2.2.post1-py3-none-any.whl (22 kB)
2025-05-07T20:28:29.5461777Z Using cached cmake-4.0.0-py3-none-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (27.9 MB)
2025-05-07T20:28:29.5680931Z Using cached click-8.1.8-py3-none-any.whl (98 kB)
2025-05-07T20:28:29.5693495Z Using cached hypothesis-6.131.14-py3-none-any.whl (500 kB)
2025-05-07T20:28:29.5709448Z Using cached sortedcontainers-2.4.0-py2.py3-none-any.whl (29 kB)
2025-05-07T20:28:29.5721491Z Using cached ninja-1.11.1.4-py3-none-manylinux_2_12_x86_64.manylinux2010_x86_64.whl (422 kB)
2025-05-07T20:28:29.5736867Z Using cached pyre_extensions-0.0.32-py3-none-any.whl (12 kB)
2025-05-07T20:28:29.5749063Z Using cached PyYAML-6.0.2-cp311-cp311-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (762 kB)
2025-05-07T20:28:29.5765953Z Using cached scikit_build-0.18.1-py3-none-any.whl (85 kB)
2025-05-07T20:28:29.5778390Z Using cached setuptools_git_versioning-2.1.0-py3-none-any.whl (10 kB)
2025-05-07T20:28:29.5790201Z Using cached tabulate-0.9.0-py3-none-any.whl (35 kB)
2025-05-07T20:28:29.5802472Z Using cached patchelf-0.17.2.2-py3-none-manylinux1_x86_64.manylinux_2_5_x86_64.musllinux_1_1_x86_64.whl (466 kB)
2025-05-07T20:28:29.5817726Z Using cached attrs-25.3.0-py3-none-any.whl (63 kB)
2025-05-07T20:28:29.5831016Z Using cached packaging-25.0-py3-none-any.whl (66 kB)
2025-05-07T20:28:29.5843576Z Using cached distro-1.9.0-py3-none-any.whl (20 kB)
2025-05-07T20:28:29.5855653Z Using cached pyproject_hooks-1.2.0-py3-none-any.whl (10 kB)
2025-05-07T20:28:29.5867413Z Using cached typing_inspect-0.9.0-py3-none-any.whl (8.8 kB)
2025-05-07T20:28:29.5879448Z Using cached mypy_extensions-1.1.0-py3-none-any.whl (5.0 kB)
2025-05-07T20:28:29.7222367Z Installing collected packages: sortedcontainers, tabulate, pyyaml, pyproject_hooks, patchelf, packaging, ninja, mypy-extensions, distro, cmake, click, backports.tarfile, attrs, typing-inspect, setuptools_git_versioning, scikit-build, hypothesis, build, pyre-extensions
2025-05-07T20:28:32.0352978Z 
2025-05-07T20:28:32.0380830Z Successfully installed attrs-25.3.0 backports.tarfile-1.2.0 build-1.2.2.post1 click-8.1.8 cmake-4.0.0 distro-1.9.0 hypothesis-6.131.14 mypy-extensions-1.1.0 ninja-1.11.1.4 packaging-25.0 patchelf-0.17.2.2 pyproject_hooks-1.2.0 pyre-extensions-0.0.32 pyyaml-6.0.2 scikit-build-0.18.1 setuptools_git_versioning-2.1.0 sortedcontainers-2.4.0 tabulate-0.9.0 typing-inspect-0.9.0
2025-05-07T20:28:32.2123727Z ################################################################################
2025-05-07T20:28:32.2124071Z # Install PyTorch (PyTorch PIP)
2025-05-07T20:28:32.2124523Z #
2025-05-07T20:28:32.2143971Z # [2025-05-07T20:28:32.214Z] + install_triton_pip build_binary
2025-05-07T20:28:32.2144351Z ################################################################################
2025-05-07T20:28:32.2144562Z 
2025-05-07T20:28:32.2144786Z [BUILD] Installing pytorch-triton nightly/3.2.0+git4b3bb1f8 from PIP ...
2025-05-07T20:28:32.2154166Z ################################################################################
2025-05-07T20:28:32.2154584Z # Install Package From PyTorch PIP: pytorch-triton
2025-05-07T20:28:32.2154913Z #
2025-05-07T20:28:32.2161771Z # [2025-05-07T20:28:32.215Z] + install_from_pytorch_pip build_binary pytorch-triton nightly/3.2.0+git4b3bb1f8
2025-05-07T20:28:32.2162295Z ################################################################################
2025-05-07T20:28:32.2162516Z 
2025-05-07T20:28:32.2180858Z [EXEC] [ATTEMPT 0/3]    + wget -q --timeout 1 pypi.org -O /dev/null
2025-05-07T20:28:32.3117139Z [CHECK] Network does not appear to be blocked.
2025-05-07T20:28:32.3117497Z ################################################################################
2025-05-07T20:28:32.3117840Z # Prepare PIP Arguments (PyTorch PIP)
2025-05-07T20:28:32.3118124Z #
2025-05-07T20:28:32.3136629Z # [2025-05-07T20:28:32.313Z] + __prepare_pip_arguments pytorch-triton nightly/3.2.0+git4b3bb1f8 
2025-05-07T20:28:32.3137106Z ################################################################################
2025-05-07T20:28:32.3137326Z 
2025-05-07T20:28:32.3187245Z [INSTALL] Extracted package (channel, version): (nightly, 3.2.0+git4b3bb1f8)
2025-05-07T20:28:32.3203360Z [INSTALL] Using a non-RELEASE channel: nightly ...
2025-05-07T20:28:32.3203973Z [INSTALL] Extracted the full PIP channel: https://download.pytorch.org/whl/nightly/
2025-05-07T20:28:32.3212631Z [INSTALL] Extracted the full PIP package: --pre pytorch-triton==3.2.0+git4b3bb1f8
2025-05-07T20:28:32.3221735Z [INSTALL] Attempting to install [pytorch-triton, 3.2.0+git4b3bb1f8] from PyTorch PIP using channel https://download.pytorch.org/whl/nightly/ ...
2025-05-07T20:28:32.3242682Z [EXEC] [ATTEMPT 0/3]    + conda run -n build_binary pip install --pre pytorch-triton==3.2.0+git4b3bb1f8 --index-url https://download.pytorch.org/whl/nightly/
2025-05-07T20:28:37.0806651Z ERROR: pip's dependency resolver does not currently take into account all the packages that are installed. This behaviour is the source of the following dependency conflicts.
2025-05-07T20:28:37.0807853Z torch 2.8.0.dev20250507+cu128 requires pytorch-triton==3.3.0+git96316ce5; platform_system == "Linux", but you have pytorch-triton 3.2.0+git4b3bb1f8 which is incompatible.
2025-05-07T20:28:37.0808484Z 
2025-05-07T20:28:37.0808694Z Looking in indexes: https://download.pytorch.org/whl/nightly/
2025-05-07T20:28:37.0809108Z Collecting pytorch-triton==3.2.0+git4b3bb1f8
2025-05-07T20:28:37.0809932Z   Using cached https://download.pytorch.org/whl/nightly/pytorch_triton-3.2.0%2Bgit4b3bb1f8-cp311-cp311-manylinux_2_27_x86_64.manylinux_2_28_x86_64.whl.metadata (1.3 kB)
2025-05-07T20:28:37.0811138Z Using cached https://download.pytorch.org/whl/nightly/pytorch_triton-3.2.0%2Bgit4b3bb1f8-cp311-cp311-manylinux_2_27_x86_64.manylinux_2_28_x86_64.whl (166.5 MB)
2025-05-07T20:28:37.0811911Z Installing collected packages: pytorch-triton
2025-05-07T20:28:37.0812257Z   Attempting uninstall: pytorch-triton
2025-05-07T20:28:37.0812642Z     Found existing installation: pytorch-triton 3.3.0+git96316ce5
2025-05-07T20:28:37.0813053Z     Uninstalling pytorch-triton-3.3.0+git96316ce5:
2025-05-07T20:28:37.0813472Z       Successfully uninstalled pytorch-triton-3.3.0+git96316ce5
2025-05-07T20:28:37.0813907Z Successfully installed pytorch-triton-3.2.0+git4b3bb1f8
2025-05-07T20:28:37.0814162Z 
2025-05-07T20:28:39.2932409Z [CHECK] Python (sub-)package 'triton' found ...
2025-05-07T20:28:39.2936258Z [CHECK] Printing out the pytorch-triton version ...
2025-05-07T20:28:41.4425229Z ################################################################################
2025-05-07T20:28:41.4426227Z [CHECK] The installed VERSION of pytorch-triton is: 3.2.0
2025-05-07T20:28:41.4426743Z ################################################################################
2025-05-07T20:28:41.4427043Z 
2025-05-07T20:28:43.5094813Z [CHECK] Python (sub-)package 'numpy' found ...
2025-05-07T20:28:45.6223673Z [CHECK] Python (sub-)package 'skbuild' found ...
2025-05-07T20:28:45.6228054Z [BUILD] Successfully ran git submodules update
2025-05-07T20:28:45.6272978Z ##[group]Run . $PRELUDE; install_fbgemm_gpu_wheel $BUILD_ENV *.whl
2025-05-07T20:28:45.6273460Z [36;1m. $PRELUDE; install_fbgemm_gpu_wheel $BUILD_ENV *.whl[0m
2025-05-07T20:28:45.6285277Z shell: /usr/bin/bash --noprofile --norc -e -o pipefail {0}
2025-05-07T20:28:45.6285625Z env:
2025-05-07T20:28:45.6285847Z   PRELUDE: .github/scripts/setup_env.bash
2025-05-07T20:28:45.6286154Z   BUILD_ENV: build_binary
2025-05-07T20:28:45.6286402Z   BUILD_TARGET: genai
2025-05-07T20:28:45.6286626Z   BUILD_VARIANT: cuda
2025-05-07T20:28:45.6286869Z   BUILD_CUDA_VERSION: 12.8.0
2025-05-07T20:28:45.6287146Z   ENFORCE_CUDA_DEVICE: 1
2025-05-07T20:28:45.6287449Z   GPU_FLAG: --gpus all -e NVIDIA_DRIVER_CAPABILITIES=all
2025-05-07T20:28:45.6287779Z ##[endgroup]
2025-05-07T20:28:45.9629020Z ################################################################################
2025-05-07T20:28:45.9629413Z # Install FBGEMM-GPU from Wheel
2025-05-07T20:28:45.9629681Z #
2025-05-07T20:28:45.9649264Z # [2025-05-07T20:28:45.964Z] + install_fbgemm_gpu_wheel build_binary fbgemm_gpu_genai_nightly-2025.5.7-cp311-cp311-manylinux_2_28_x86_64.whl
2025-05-07T20:28:45.9649902Z ################################################################################
2025-05-07T20:28:45.9650123Z 
2025-05-07T20:28:45.9650482Z [INSTALL] Printing out FBGEMM-GPU wheel SHA: fbgemm_gpu_genai_nightly-2025.5.7-cp311-cp311-manylinux_2_28_x86_64.whl
2025-05-07T20:28:45.9651172Z + sha1sum fbgemm_gpu_genai_nightly-2025.5.7-cp311-cp311-manylinux_2_28_x86_64.whl
2025-05-07T20:28:45.9651515Z 
2025-05-07T20:28:45.9801070Z c326345df354c6141153099e3e50ba8d6de34fcb  fbgemm_gpu_genai_nightly-2025.5.7-cp311-cp311-manylinux_2_28_x86_64.whl
2025-05-07T20:28:45.9803916Z 
2025-05-07T20:28:45.9804414Z + sha256sum fbgemm_gpu_genai_nightly-2025.5.7-cp311-cp311-manylinux_2_28_x86_64.whl
2025-05-07T20:28:45.9804775Z 
2025-05-07T20:28:45.9972989Z 9f4154b2f6c41ae40824604f2980de212f6e65550128fe52cae1c9c75e71312b  fbgemm_gpu_genai_nightly-2025.5.7-cp311-cp311-manylinux_2_28_x86_64.whl
2025-05-07T20:28:45.9975438Z 
2025-05-07T20:28:45.9975810Z + md5sum fbgemm_gpu_genai_nightly-2025.5.7-cp311-cp311-manylinux_2_28_x86_64.whl
2025-05-07T20:28:45.9976154Z 
2025-05-07T20:28:46.0304705Z 1c01cd21bdf738277ab20dc3f0582ce3  fbgemm_gpu_genai_nightly-2025.5.7-cp311-cp311-manylinux_2_28_x86_64.whl
2025-05-07T20:28:46.0307032Z 
2025-05-07T20:28:46.0316687Z [INSTALL] Installing FBGEMM-GPU wheel: fbgemm_gpu_genai_nightly-2025.5.7-cp311-cp311-manylinux_2_28_x86_64.whl ...
2025-05-07T20:28:46.0338823Z [EXEC] [ATTEMPT 0/3]    + conda run -n build_binary python -m pip install fbgemm_gpu_genai_nightly-2025.5.7-cp311-cp311-manylinux_2_28_x86_64.whl
2025-05-07T20:28:48.8019364Z Processing ./fbgemm_gpu_genai_nightly-2025.5.7-cp311-cp311-manylinux_2_28_x86_64.whl
2025-05-07T20:28:48.8020374Z Requirement already satisfied: numpy in /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages (from fbgemm-gpu-genai-nightly==2025.5.7) (2.2.5)
2025-05-07T20:28:48.8021213Z Installing collected packages: fbgemm-gpu-genai-nightly
2025-05-07T20:28:48.8021663Z Successfully installed fbgemm-gpu-genai-nightly-2025.5.7
2025-05-07T20:28:48.8021936Z 
2025-05-07T20:28:55.6336692Z ################################################################################
2025-05-07T20:28:55.6337127Z [CHECK] !!!!    INFO    !!!!
2025-05-07T20:28:55.6337520Z [CHECK] The installed version of PyTorch is: 2.8.0.dev20250507+cu128
2025-05-07T20:28:55.6337949Z [CHECK] CUDA version reported by PyTorch is: 12.8
2025-05-07T20:28:55.6338252Z [CHECK]
2025-05-07T20:28:55.6338576Z [CHECK] NOTE: If the PyTorch package channel is different from the FBGEMM_GPU
2025-05-07T20:28:55.6339459Z [CHECK]       package channel; the package may be broken at runtime!!!
2025-05-07T20:28:55.6339852Z ################################################################################
2025-05-07T20:28:55.6340065Z 
2025-05-07T20:28:55.6340187Z [INSTALL] Checking imports and symbols ...
2025-05-07T20:28:59.5479198Z [CHECK] Python (sub-)package 'fbgemm_gpu' found ...
2025-05-07T20:29:03.4698316Z [CHECK] Found symbol '__version__' in Python package 'fbgemm_gpu'.
2025-05-07T20:29:07.3979634Z [CHECK] Found symbol '__variant__' in Python package 'fbgemm_gpu'.
2025-05-07T20:29:07.3982529Z [CHECK] Printing out the FBGEMM-GPU version ...
2025-05-07T20:29:19.1487861Z ################################################################################
2025-05-07T20:29:19.1488275Z [CHECK] The installed FBGEMM TARGET is: genai
2025-05-07T20:29:19.1488621Z [CHECK] The installed FBGEMM VARIANT is: cuda
2025-05-07T20:29:19.1488972Z [CHECK] The installed FBGEMM VERSION is: 2025.5.7
2025-05-07T20:29:19.1489351Z ################################################################################
2025-05-07T20:29:19.1489570Z 
2025-05-07T20:29:26.9951077Z ################################################################################
2025-05-07T20:29:26.9951558Z [CHECK] FBGEMM_GPU Experimental Packages
2025-05-07T20:29:26.9953133Z [CHECK] fbgemm_gpu: ['__annotations__', '__builtins__', '__cached__', '__doc__', '__file__', '__loader__', '__name__', '__package__', '__path__', '__spec__', '__target__', '__variant__', '__version__', '_load_library', 'docs', 'fbgemm_genai_libraries', 'fbgemm_gpu', 'fbgemm_gpu_libraries', 'libraries_to_load', 'library', 'logging', 'open_source', 'os', 'split_embedding_configs', 'split_table_batched_embeddings_ops_common', 'torch', 'utils']
2025-05-07T20:29:26.9955187Z [CHECK] fbgemm_gpu.experimental: ['__doc__', '__file__', '__loader__', '__name__', '__package__', '__path__', '__spec__']
2025-05-07T20:29:26.9955712Z ################################################################################
2025-05-07T20:29:26.9955945Z 
2025-05-07T20:29:26.9956098Z [INSTALL] Check for installation of Python sources ...
2025-05-07T20:29:30.9208158Z [CHECK] Python (sub-)package 'fbgemm_gpu.config' found ...
2025-05-07T20:29:34.8392907Z [CHECK] Python (sub-)package 'fbgemm_gpu.docs' found ...
2025-05-07T20:29:38.8872111Z [CHECK] Python (sub-)package 'fbgemm_gpu.quantize' found ...
2025-05-07T20:29:42.8069664Z [CHECK] Python (sub-)package 'fbgemm_gpu.tbe.cache' found ...
2025-05-07T20:29:42.8073862Z [INSTALL] Check for operator registrations ...
2025-05-07T20:29:46.6524445Z fbgemm.nccl_init
2025-05-07T20:29:46.6524629Z 
2025-05-07T20:29:46.7142596Z [CHECK] FBGEMM_GPU operator appears to be correctly registered: torch.ops.fbgemm.nccl_init
2025-05-07T20:29:50.5579860Z fbgemm.gqa_attn_splitk
2025-05-07T20:29:50.5580074Z 
2025-05-07T20:29:50.6201289Z [CHECK] FBGEMM_GPU operator appears to be correctly registered: torch.ops.fbgemm.gqa_attn_splitk
2025-05-07T20:29:54.4748396Z fbgemm.rope_qkv_decoding
2025-05-07T20:29:54.4748627Z 
2025-05-07T20:29:54.5365298Z [CHECK] FBGEMM_GPU operator appears to be correctly registered: torch.ops.fbgemm.rope_qkv_decoding
2025-05-07T20:29:54.5365913Z [INSTALL] FBGEMM-GPU installation through wheel completed ...
2025-05-07T20:29:54.5402579Z ##[group]Run . $PRELUDE; test_all_fbgemm_gpu_modules $BUILD_ENV
2025-05-07T20:29:54.5403049Z [36;1m. $PRELUDE; test_all_fbgemm_gpu_modules $BUILD_ENV[0m
2025-05-07T20:29:54.5415043Z shell: /usr/bin/bash --noprofile --norc -e -o pipefail {0}
2025-05-07T20:29:54.5415420Z env:
2025-05-07T20:29:54.5415655Z   PRELUDE: .github/scripts/setup_env.bash
2025-05-07T20:29:54.5415949Z   BUILD_ENV: build_binary
2025-05-07T20:29:54.5416196Z   BUILD_TARGET: genai
2025-05-07T20:29:54.5416428Z   BUILD_VARIANT: cuda
2025-05-07T20:29:54.5416660Z   BUILD_CUDA_VERSION: 12.8.0
2025-05-07T20:29:54.5416921Z   ENFORCE_CUDA_DEVICE: 1
2025-05-07T20:29:54.5417229Z   GPU_FLAG: --gpus all -e NVIDIA_DRIVER_CAPABILITIES=all
2025-05-07T20:29:54.5417756Z ##[endgroup]
2025-05-07T20:29:54.8770361Z ################################################################################
2025-05-07T20:29:54.8770741Z # Test All FBGEMM-GPU Modules
2025-05-07T20:29:54.8771009Z #
2025-05-07T20:29:54.8787202Z # [2025-05-07T20:29:54.878Z] + test_all_fbgemm_gpu_modules build_binary
2025-05-07T20:29:54.8787615Z ################################################################################
2025-05-07T20:29:54.8787830Z 
2025-05-07T20:30:02.7393642Z [TEST] Determined FBGEMM_GPU (target : variant) from installation: (genai : cuda)
2025-05-07T20:30:02.7394760Z [TEST] Will be running tests specific to this target and variant ...
2025-05-07T20:30:02.7395541Z [TEST] Determined the test directories:
2025-05-07T20:30:02.7396155Z fbgemm_gpu/experimental/gen_ai/test
2025-05-07T20:30:02.7396753Z fbgemm_gpu/experimental/example/test
2025-05-07T20:30:02.7397344Z fbgemm_gpu/experimental/gemm/test
2025-05-07T20:30:02.7397711Z 
2025-05-07T20:30:02.7400071Z [TEST] FBGEMM_GPU variant is cuda; configuring for CUDA-based testing ...
2025-05-07T20:30:02.7406325Z [TEST] Set environment variables for CUDA testing ...
2025-05-07T20:30:02.7406902Z + conda env config vars unset -n build_binary CUDA_VISIBLE_DEVICES
2025-05-07T20:30:02.7407183Z 
2025-05-07T20:30:03.1677895Z 
2025-05-07T20:30:03.1678314Z [TEST] Installing PyTest ...
2025-05-07T20:30:03.1701447Z [EXEC] [ATTEMPT 0/3]    + conda install -n build_binary -c conda-forge --override-channels -y pytest expecttest
2025-05-07T20:30:04.2812978Z Channels:
2025-05-07T20:30:04.2813236Z  - conda-forge
2025-05-07T20:30:04.2813464Z Platform: linux-64
2025-05-07T20:30:07.5677983Z Collecting package metadata (repodata.json): - \ | / done
2025-05-07T20:30:08.7148052Z Solving environment: \ | / done
2025-05-07T20:30:08.9466800Z 
2025-05-07T20:30:08.9467162Z ## Package Plan ##
2025-05-07T20:30:08.9467391Z 
2025-05-07T20:30:08.9467672Z   environment location: /home/ec2-user/miniconda/envs/build_binary
2025-05-07T20:30:08.9468087Z 
2025-05-07T20:30:08.9468208Z   added / updated specs:
2025-05-07T20:30:08.9468533Z     - expecttest
2025-05-07T20:30:08.9468810Z     - pytest
2025-05-07T20:30:08.9469044Z 
2025-05-07T20:30:08.9469048Z 
2025-05-07T20:30:08.9469170Z The following packages will be downloaded:
2025-05-07T20:30:08.9469400Z 
2025-05-07T20:30:08.9469516Z     package                    |            build
2025-05-07T20:30:08.9469838Z     ---------------------------|-----------------
2025-05-07T20:30:08.9470232Z     colorama-0.4.6             |     pyhd8ed1ab_1          26 KB  conda-forge
2025-05-07T20:30:08.9470692Z     exceptiongroup-1.2.2       |     pyhd8ed1ab_1          20 KB  conda-forge
2025-05-07T20:30:08.9471159Z     expecttest-0.3.0           |     pyhd8ed1ab_0          14 KB  conda-forge
2025-05-07T20:30:08.9471596Z     iniconfig-2.0.0            |     pyhd8ed1ab_1          11 KB  conda-forge
2025-05-07T20:30:08.9472021Z     packaging-25.0             |     pyh29332c3_1          61 KB  conda-forge
2025-05-07T20:30:08.9472456Z     pluggy-1.5.0               |     pyhd8ed1ab_1          23 KB  conda-forge
2025-05-07T20:30:08.9472866Z     pytest-8.3.5               |     pyhd8ed1ab_0         254 KB  conda-forge
2025-05-07T20:30:08.9473746Z     tomli-2.2.1                |     pyhd8ed1ab_1          19 KB  conda-forge
2025-05-07T20:30:08.9474397Z     ------------------------------------------------------------
2025-05-07T20:30:08.9474840Z                                            Total:         428 KB
2025-05-07T20:30:08.9475080Z 
2025-05-07T20:30:08.9475241Z The following NEW packages will be INSTALLED:
2025-05-07T20:30:08.9475589Z 
2025-05-07T20:30:08.9475839Z   colorama           conda-forge/noarch::colorama-0.4.6-pyhd8ed1ab_1 
2025-05-07T20:30:08.9476438Z   exceptiongroup     conda-forge/noarch::exceptiongroup-1.2.2-pyhd8ed1ab_1 
2025-05-07T20:30:08.9477033Z   expecttest         conda-forge/noarch::expecttest-0.3.0-pyhd8ed1ab_0 
2025-05-07T20:30:08.9477648Z   iniconfig          conda-forge/noarch::iniconfig-2.0.0-pyhd8ed1ab_1 
2025-05-07T20:30:08.9478374Z   packaging          conda-forge/noarch::packaging-25.0-pyh29332c3_1 
2025-05-07T20:30:08.9478919Z   pluggy             conda-forge/noarch::pluggy-1.5.0-pyhd8ed1ab_1 
2025-05-07T20:30:08.9489223Z   pytest             conda-forge/noarch::pytest-8.3.5-pyhd8ed1ab_0 
2025-05-07T20:30:08.9489665Z   tomli              conda-forge/noarch::tomli-2.2.1-pyhd8ed1ab_1 
2025-05-07T20:30:08.9489919Z 
2025-05-07T20:30:08.9489931Z 
2025-05-07T20:30:08.9489935Z 
2025-05-07T20:30:08.9490079Z Downloading and Extracting Packages: ...working...
2025-05-07T20:30:08.9490482Z pytest-8.3.5         | 254 KB    |            |   0% 
2025-05-07T20:30:08.9490732Z 
2025-05-07T20:30:08.9491131Z packaging-25.0       | 61 KB     |            |   0% [A
2025-05-07T20:30:08.9491368Z 
2025-05-07T20:30:08.9491372Z 
2025-05-07T20:30:08.9501805Z colorama-0.4.6       | 26 KB     |            |   0% [A[A
2025-05-07T20:30:08.9502193Z 
2025-05-07T20:30:08.9502199Z 
2025-05-07T20:30:08.9502204Z 
2025-05-07T20:30:08.9510992Z pluggy-1.5.0         | 23 KB     |            |   0% [A[A[A
2025-05-07T20:30:08.9511530Z 
2025-05-07T20:30:08.9511537Z 
2025-05-07T20:30:08.9511543Z 
2025-05-07T20:30:08.9511549Z 
2025-05-07T20:30:08.9519644Z exceptiongroup-1.2.2 | 20 KB     |            |   0% [A[A[A[A
2025-05-07T20:30:08.9520031Z 
2025-05-07T20:30:08.9520037Z 
2025-05-07T20:30:08.9520042Z 
2025-05-07T20:30:08.9520046Z 
2025-05-07T20:30:08.9523373Z 
2025-05-07T20:30:08.9524766Z tomli-2.2.1          | 19 KB     |            |   0% [A[A[A[A[A
2025-05-07T20:30:08.9525024Z 
2025-05-07T20:30:08.9525028Z 
2025-05-07T20:30:08.9525035Z 
2025-05-07T20:30:08.9525038Z 
2025-05-07T20:30:08.9525042Z 
2025-05-07T20:30:08.9525046Z 
2025-05-07T20:30:08.9537365Z expecttest-0.3.0     | 14 KB     |            |   0% [A[A[A[A[A[A
2025-05-07T20:30:08.9537654Z 
2025-05-07T20:30:08.9537659Z 
2025-05-07T20:30:08.9537662Z 
2025-05-07T20:30:08.9537666Z 
2025-05-07T20:30:08.9537669Z 
2025-05-07T20:30:08.9537673Z 
2025-05-07T20:30:08.9537768Z 
2025-05-07T20:30:09.0775591Z iniconfig-2.0.0      | 11 KB     |            |   0% [A[A[A[A[A[A[A
2025-05-07T20:30:09.0776166Z 
2025-05-07T20:30:09.0776175Z 
2025-05-07T20:30:09.0776183Z 
2025-05-07T20:30:09.0776203Z 
2025-05-07T20:30:09.0803325Z exceptiongroup-1.2.2 | 20 KB     | #######9   |  80% [A[A[A[A
2025-05-07T20:30:09.0803613Z 
2025-05-07T20:30:09.0803617Z 
2025-05-07T20:30:09.0803621Z 
2025-05-07T20:30:09.0806851Z 
2025-05-07T20:30:09.0998183Z exceptiongroup-1.2.2 | 20 KB     | ########## | 100% [A[A[A[A
2025-05-07T20:30:09.0998468Z 
2025-05-07T20:30:09.0998472Z 
2025-05-07T20:30:09.1049999Z 
2025-05-07T20:30:09.1544786Z pluggy-1.5.0         | 23 KB     | ######9    |  69% [A[A[A
2025-05-07T20:30:09.1545040Z 
2025-05-07T20:30:09.1545044Z 
2025-05-07T20:30:09.1547199Z 
2025-05-07T20:30:09.1584366Z pluggy-1.5.0         | 23 KB     | ########## | 100% [A[A[A
2025-05-07T20:30:09.1584613Z 
2025-05-07T20:30:09.1584617Z 
2025-05-07T20:30:09.1584620Z 
2025-05-07T20:30:09.1584624Z 
2025-05-07T20:30:09.1588971Z 
2025-05-07T20:30:09.1612069Z tomli-2.2.1          | 19 KB     | ########5  |  85% [A[A[A[A[A
2025-05-07T20:30:09.1612334Z 
2025-05-07T20:30:09.1613702Z 
2025-05-07T20:30:09.2044421Z colorama-0.4.6       | 26 KB     | ######     |  61% [A[A
2025-05-07T20:30:09.2044971Z 
2025-05-07T20:30:09.2044976Z 
2025-05-07T20:30:09.2044980Z 
2025-05-07T20:30:09.2044984Z 
2025-05-07T20:30:09.2044987Z 
2025-05-07T20:30:09.2049122Z tomli-2.2.1          | 19 KB     | ########## | 100% [A[A[A[A[A
2025-05-07T20:30:09.2049387Z 
2025-05-07T20:30:09.2049391Z 
2025-05-07T20:30:09.2278212Z colorama-0.4.6       | 26 KB     | ########## | 100% [A[A
2025-05-07T20:30:09.2278707Z 
2025-05-07T20:30:09.2278726Z 
2025-05-07T20:30:09.2278733Z 
2025-05-07T20:30:09.2278741Z 
2025-05-07T20:30:09.2278748Z 
2025-05-07T20:30:09.2278755Z 
2025-05-07T20:30:09.2316264Z expecttest-0.3.0     | 14 KB     | ########## | 100% [A[A[A[A[A[A
2025-05-07T20:30:09.2316547Z 
2025-05-07T20:30:09.2316551Z 
2025-05-07T20:30:09.2316731Z 
2025-05-07T20:30:09.2316734Z 
2025-05-07T20:30:09.2316738Z 
2025-05-07T20:30:09.2320893Z 
2025-05-07T20:30:09.2511791Z expecttest-0.3.0     | 14 KB     | ########## | 100% [A[A[A[A[A[A
2025-05-07T20:30:09.2512140Z 
2025-05-07T20:30:09.2512145Z 
2025-05-07T20:30:09.2512159Z 
2025-05-07T20:30:09.2513781Z 
2025-05-07T20:30:09.2659246Z exceptiongroup-1.2.2 | 20 KB     | ########## | 100% [A[A[A[A
2025-05-07T20:30:09.2659532Z 
2025-05-07T20:30:09.2659536Z 
2025-05-07T20:30:09.2659540Z 
2025-05-07T20:30:09.2659543Z 
2025-05-07T20:30:09.2659547Z 
2025-05-07T20:30:09.2682291Z tomli-2.2.1          | 19 KB     | ########## | 100% [A[A[A[A[A
2025-05-07T20:30:09.2682538Z 
2025-05-07T20:30:09.2682542Z 
2025-05-07T20:30:09.2682893Z 
2025-05-07T20:30:09.2778140Z pluggy-1.5.0         | 23 KB     | ########## | 100% [A[A[A
2025-05-07T20:30:09.2778403Z 
2025-05-07T20:30:09.2778407Z 
2025-05-07T20:30:09.2782531Z colorama-0.4.6       | 26 KB     | ########## | 100% [A[A
2025-05-07T20:30:09.2782790Z 
2025-05-07T20:30:09.2782802Z 
2025-05-07T20:30:09.2784687Z colorama-0.4.6       | 26 KB     | ########## | 100% [A[A
2025-05-07T20:30:09.2784934Z 
2025-05-07T20:30:09.2784938Z 
2025-05-07T20:30:09.2784942Z 
2025-05-07T20:30:09.2784945Z 
2025-05-07T20:30:09.2784956Z 
2025-05-07T20:30:09.2784964Z 
2025-05-07T20:30:09.2784968Z 
2025-05-07T20:30:09.2794745Z iniconfig-2.0.0      | 11 KB     | ########## | 100% [A[A[A[A[A[A[A
2025-05-07T20:30:09.2795019Z 
2025-05-07T20:30:09.2795023Z 
2025-05-07T20:30:09.2795041Z 
2025-05-07T20:30:09.2795045Z 
2025-05-07T20:30:09.2795049Z 
2025-05-07T20:30:09.2795116Z 
2025-05-07T20:30:09.2797089Z 
2025-05-07T20:30:09.2799764Z iniconfig-2.0.0      | 11 KB     | ########## | 100% [A[A[A[A[A[A[A
2025-05-07T20:30:09.2800117Z 
2025-05-07T20:30:09.2800122Z 
2025-05-07T20:30:09.2800127Z 
2025-05-07T20:30:09.2800131Z 
2025-05-07T20:30:09.2800136Z 
2025-05-07T20:30:09.2800141Z 
2025-05-07T20:30:09.2878010Z expecttest-0.3.0     | 14 KB     | ########## | 100% [A[A[A[A[A[A
2025-05-07T20:30:09.2878297Z 
2025-05-07T20:30:09.2878301Z 
2025-05-07T20:30:09.2878305Z 
2025-05-07T20:30:09.2878308Z 
2025-05-07T20:30:09.2878312Z 
2025-05-07T20:30:09.2878315Z 
2025-05-07T20:30:09.2878330Z 
2025-05-07T20:30:09.3035138Z iniconfig-2.0.0      | 11 KB     | ########## | 100% [A[A[A[A[A[A[A
2025-05-07T20:30:09.3035521Z 
2025-05-07T20:30:09.3041854Z packaging-25.0       | 61 KB     | ##6        |  26% [A
2025-05-07T20:30:09.3042158Z 
2025-05-07T20:30:09.3129888Z packaging-25.0       | 61 KB     | ########## | 100% [A
2025-05-07T20:30:09.3130236Z 
2025-05-07T20:30:09.3248174Z packaging-25.0       | 61 KB     | ########## | 100% [A
2025-05-07T20:30:09.3304607Z pytest-8.3.5         | 254 KB    | 6          |   6% 
2025-05-07T20:30:09.3556249Z pytest-8.3.5         | 254 KB    | ########## | 100% 
2025-05-07T20:30:09.3562955Z pytest-8.3.5         | 254 KB    | ########## | 100% 
2025-05-07T20:30:09.3563355Z                                                      
2025-05-07T20:30:09.3563673Z 
2025-05-07T20:30:09.3563933Z                                                      [A
2025-05-07T20:30:09.3564223Z 
2025-05-07T20:30:09.3564228Z 
2025-05-07T20:30:09.3564454Z                                                      [A[A
2025-05-07T20:30:09.3564729Z 
2025-05-07T20:30:09.3564985Z 
2025-05-07T20:30:09.3565006Z 
2025-05-07T20:30:09.3565208Z                                                      [A[A[A
2025-05-07T20:30:09.3565480Z 
2025-05-07T20:30:09.3565485Z 
2025-05-07T20:30:09.3565490Z 
2025-05-07T20:30:09.3565495Z 
2025-05-07T20:30:09.3565755Z                                                      [A[A[A[A
2025-05-07T20:30:09.3566037Z 
2025-05-07T20:30:09.3566043Z 
2025-05-07T20:30:09.3566048Z 
2025-05-07T20:30:09.3566053Z 
2025-05-07T20:30:09.3566058Z 
2025-05-07T20:30:09.3566315Z                                                      [A[A[A[A[A
2025-05-07T20:30:09.3566572Z 
2025-05-07T20:30:09.3566576Z 
2025-05-07T20:30:09.3566580Z 
2025-05-07T20:30:09.3566583Z 
2025-05-07T20:30:09.3566735Z 
2025-05-07T20:30:09.3566739Z 
2025-05-07T20:30:09.3566937Z                                                      [A[A[A[A[A[A
2025-05-07T20:30:09.3567147Z 
2025-05-07T20:30:09.3567150Z 
2025-05-07T20:30:09.3567154Z 
2025-05-07T20:30:09.3567157Z 
2025-05-07T20:30:09.3567167Z 
2025-05-07T20:30:09.3567171Z 
2025-05-07T20:30:09.3567174Z 
2025-05-07T20:30:09.3567374Z                                                      [A[A[A[A[A[A[A done
2025-05-07T20:30:09.4568635Z Preparing transaction: \ done
2025-05-07T20:30:09.5573849Z Verifying transaction: / done
2025-05-07T20:30:11.3606171Z Executing transaction: \ | / - \ | / - \ | / - \ | / - \ | done
2025-05-07T20:30:11.4879730Z [TEST] Checking imports ...
2025-05-07T20:30:15.3987087Z [CHECK] Python (sub-)package 'fbgemm_gpu' found ...
2025-05-07T20:30:15.3999704Z [TEST] Setting feature flags ...
2025-05-07T20:30:15.4000153Z + conda env config vars set -n build_binary FBGEMM_TBE_ENSEMBLE_ROWWISE_ADAGRAD=1
2025-05-07T20:30:15.4000514Z 
2025-05-07T20:30:15.8246650Z 
2025-05-07T20:30:15.8247221Z [TEST] PyTest args:  -v -rsx -s -W ignore::pytest.PytestCollectionWarning
2025-05-07T20:30:15.8248558Z ################################################################################
2025-05-07T20:30:15.8249023Z # Run FBGEMM-GPU Tests: 
2025-05-07T20:30:15.8249333Z #
2025-05-07T20:30:15.8269033Z # [2025-05-07T20:30:15.826Z] + __run_fbgemm_gpu_tests_in_directory build_binary
2025-05-07T20:30:15.8269460Z ################################################################################
2025-05-07T20:30:15.8269673Z 
2025-05-07T20:30:15.8277139Z [TEST] Enumerating ALL test files ...
2025-05-07T20:30:15.8306019Z ./attention/gqa_test.py
2025-05-07T20:30:15.8306330Z ./coalesce/coalesce_test.py
2025-05-07T20:30:15.8306725Z ./comm/multi_gpu_car_test.py
2025-05-07T20:30:15.8307020Z ./gather_scatter/gather_scatter_test.py
2025-05-07T20:30:15.8307317Z ./kv_cache/kv_cache_test.py
2025-05-07T20:30:15.8307587Z ./moe/activation_test.py
2025-05-07T20:30:15.8307851Z ./moe/gather_scatter_test.py
2025-05-07T20:30:15.8308101Z ./moe/layers_test.py
2025-05-07T20:30:15.8308341Z ./moe/shuffling_test.py
2025-05-07T20:30:15.8308589Z ./quantize/quantize_test.py
2025-05-07T20:30:15.8308750Z 
2025-05-07T20:30:15.8308874Z [TEST] Enumerating IGNORED test files ...
2025-05-07T20:30:15.8309161Z 
2025-05-07T20:30:15.8327068Z ################################################################################
2025-05-07T20:30:15.8342479Z # [2025-05-07T20:30:15.833Z] Run Python Test Suite:
2025-05-07T20:30:15.8342847Z #   ./attention/gqa_test.py
2025-05-07T20:30:15.8343187Z ################################################################################
2025-05-07T20:30:15.8366400Z + conda run --no-capture-output -n build_binary python -m pytest -v -rsx -s -W ignore::pytest.PytestCollectionWarning --cache-clear ./attention/gqa_test.py
2025-05-07T20:30:15.8367005Z 
2025-05-07T20:30:18.3484775Z ============================= test session starts ==============================
2025-05-07T20:30:18.3485464Z platform linux -- Python 3.11.8, pytest-8.3.5, pluggy-1.5.0 -- /home/ec2-user/miniconda/envs/build_binary/bin/python
2025-05-07T20:30:18.3485980Z cachedir: .pytest_cache
2025-05-07T20:30:18.3486875Z hypothesis profile 'ci' -> database=None, deadline=None, print_blob=True, derandomize=True, suppress_health_check=(HealthCheck.too_slow,)
2025-05-07T20:30:18.3487610Z rootdir: /home/ec2-user/actions-runner/_work/FBGEMM/FBGEMM/fbgemm_gpu
2025-05-07T20:30:18.3488012Z plugins: hypothesis-6.131.14
2025-05-07T20:30:20.0811491Z collecting ... collected 2 items
2025-05-07T20:30:20.0811857Z 
2025-05-07T20:30:57.9416794Z attention/gqa_test.py::Int4GQATest::test_gqa Trying example: test_gqa(
2025-05-07T20:30:57.9419531Z     self=<gqa_test.Int4GQATest testMethod=test_gqa>,
2025-05-07T20:30:57.9420134Z     int4_kv=False,
2025-05-07T20:30:57.9420472Z     num_groups=1,
2025-05-07T20:30:57.9420741Z     B=1,
2025-05-07T20:30:57.9421024Z     MAX_T=4,
2025-05-07T20:30:57.9421266Z     N_H_L=1,
2025-05-07T20:30:57.9421950Z )
2025-05-07T20:30:57.9422191Z Trying example: test_gqa(
2025-05-07T20:30:57.9422557Z     self=<gqa_test.Int4GQATest testMethod=test_gqa>,
2025-05-07T20:30:57.9423283Z     int4_kv=True,
2025-05-07T20:30:57.9423544Z     num_groups=1,
2025-05-07T20:30:57.9423814Z     B=1,
2025-05-07T20:30:57.9424036Z     MAX_T=4,
2025-05-07T20:30:57.9424273Z     N_H_L=1,
2025-05-07T20:30:57.9424504Z )
2025-05-07T20:30:57.9424737Z Trying example: test_gqa(
2025-05-07T20:30:57.9425101Z     self=<gqa_test.Int4GQATest testMethod=test_gqa>,
2025-05-07T20:30:57.9425486Z     int4_kv=True,
2025-05-07T20:30:57.9425739Z     num_groups=4,
2025-05-07T20:30:57.9425994Z     B=23,
2025-05-07T20:30:57.9426244Z     MAX_T=33,
2025-05-07T20:30:57.9426489Z     N_H_L=68,
2025-05-07T20:30:57.9426736Z )
2025-05-07T20:30:57.9426975Z Trying example: test_gqa(
2025-05-07T20:30:57.9427323Z     self=<gqa_test.Int4GQATest testMethod=test_gqa>,
2025-05-07T20:30:57.9427712Z     int4_kv=True,
2025-05-07T20:30:57.9427966Z     num_groups=4,
2025-05-07T20:30:57.9428465Z     B=77,
2025-05-07T20:30:57.9428700Z     MAX_T=4,
2025-05-07T20:30:57.9428957Z     N_H_L=1,
2025-05-07T20:30:57.9429291Z )
2025-05-07T20:30:57.9429652Z Trying example: test_gqa(
2025-05-07T20:30:57.9430025Z     self=<gqa_test.Int4GQATest testMethod=test_gqa>,
2025-05-07T20:30:57.9430401Z     int4_kv=True,
2025-05-07T20:30:57.9430659Z     num_groups=4,
2025-05-07T20:30:57.9430913Z     B=77,
2025-05-07T20:30:57.9431137Z     MAX_T=52,
2025-05-07T20:30:57.9431382Z     N_H_L=67,
2025-05-07T20:30:57.9431618Z )
2025-05-07T20:30:57.9431851Z Trying example: test_gqa(
2025-05-07T20:30:57.9432205Z     self=<gqa_test.Int4GQATest testMethod=test_gqa>,
2025-05-07T20:30:57.9432590Z     int4_kv=False,
2025-05-07T20:30:57.9432852Z     num_groups=4,
2025-05-07T20:30:57.9433097Z     B=57,
2025-05-07T20:30:57.9433326Z     MAX_T=45,
2025-05-07T20:30:57.9433565Z     N_H_L=120,
2025-05-07T20:30:57.9433800Z )
2025-05-07T20:30:57.9434038Z Trying example: test_gqa(
2025-05-07T20:30:57.9434391Z     self=<gqa_test.Int4GQATest testMethod=test_gqa>,
2025-05-07T20:30:57.9434779Z     int4_kv=True,
2025-05-07T20:30:57.9435032Z     num_groups=4,
2025-05-07T20:30:57.9435284Z     B=52,
2025-05-07T20:30:57.9435506Z     MAX_T=42,
2025-05-07T20:30:57.9435744Z     N_H_L=53,
2025-05-07T20:30:57.9435984Z )
2025-05-07T20:30:57.9436214Z Trying example: test_gqa(
2025-05-07T20:30:57.9436569Z     self=<gqa_test.Int4GQATest testMethod=test_gqa>,
2025-05-07T20:30:57.9436952Z     int4_kv=True,
2025-05-07T20:30:57.9437202Z     num_groups=1,
2025-05-07T20:30:57.9437452Z     B=77,
2025-05-07T20:30:57.9437684Z     MAX_T=95,
2025-05-07T20:30:57.9437917Z     N_H_L=53,
2025-05-07T20:30:57.9438192Z )
2025-05-07T20:30:57.9438436Z Trying example: test_gqa(
2025-05-07T20:30:57.9438785Z     self=<gqa_test.Int4GQATest testMethod=test_gqa>,
2025-05-07T20:30:57.9439170Z     int4_kv=True,
2025-05-07T20:30:57.9439430Z     num_groups=4,
2025-05-07T20:30:57.9439676Z     B=113,
2025-05-07T20:30:57.9439909Z     MAX_T=48,
2025-05-07T20:30:57.9440161Z     N_H_L=96,
2025-05-07T20:30:57.9440389Z )
2025-05-07T20:30:57.9440628Z Trying example: test_gqa(
2025-05-07T20:30:57.9440981Z     self=<gqa_test.Int4GQATest testMethod=test_gqa>,
2025-05-07T20:30:57.9441363Z     int4_kv=False,
2025-05-07T20:30:57.9441617Z     num_groups=1,
2025-05-07T20:30:57.9442160Z     B=51,
2025-05-07T20:30:57.9442401Z     MAX_T=61,
2025-05-07T20:30:57.9442634Z     N_H_L=69,
2025-05-07T20:30:57.9442865Z )
2025-05-07T20:30:57.9443104Z Trying example: test_gqa(
2025-05-07T20:30:57.9443451Z     self=<gqa_test.Int4GQATest testMethod=test_gqa>,
2025-05-07T20:30:57.9443835Z     int4_kv=False,
2025-05-07T20:30:57.9444094Z     num_groups=4,
2025-05-07T20:30:57.9444336Z     B=17,
2025-05-07T20:30:57.9444569Z     MAX_T=113,
2025-05-07T20:30:57.9444819Z     N_H_L=65,
2025-05-07T20:30:57.9445050Z )
2025-05-07T20:30:57.9445288Z Trying example: test_gqa(
2025-05-07T20:30:57.9445643Z     self=<gqa_test.Int4GQATest testMethod=test_gqa>,
2025-05-07T20:30:57.9446024Z     int4_kv=False,
2025-05-07T20:30:57.9446416Z     num_groups=4,
2025-05-07T20:30:57.9446672Z     B=17,
2025-05-07T20:30:57.9446896Z     MAX_T=65,
2025-05-07T20:30:57.9447136Z     N_H_L=65,
2025-05-07T20:30:57.9447376Z )
2025-05-07T20:30:57.9447611Z Trying example: test_gqa(
2025-05-07T20:30:57.9447979Z     self=<gqa_test.Int4GQATest testMethod=test_gqa>,
2025-05-07T20:30:57.9448364Z     int4_kv=False,
2025-05-07T20:30:57.9448619Z     num_groups=4,
2025-05-07T20:30:57.9448868Z     B=65,
2025-05-07T20:30:57.9449325Z     MAX_T=65,
2025-05-07T20:30:57.9449574Z     N_H_L=65,
2025-05-07T20:30:57.9449810Z )
2025-05-07T20:30:57.9450042Z Trying example: test_gqa(
2025-05-07T20:30:57.9450401Z     self=<gqa_test.Int4GQATest testMethod=test_gqa>,
2025-05-07T20:30:57.9450789Z     int4_kv=False,
2025-05-07T20:30:57.9451050Z     num_groups=1,
2025-05-07T20:30:57.9451303Z     B=6,
2025-05-07T20:30:57.9451539Z     MAX_T=108,
2025-05-07T20:30:57.9451779Z     N_H_L=14,
2025-05-07T20:30:57.9452014Z )
2025-05-07T20:30:57.9452252Z Trying example: test_gqa(
2025-05-07T20:30:57.9452609Z     self=<gqa_test.Int4GQATest testMethod=test_gqa>,
2025-05-07T20:30:57.9452997Z     int4_kv=False,
2025-05-07T20:30:57.9453257Z     num_groups=1,
2025-05-07T20:30:57.9453505Z     B=6,
2025-05-07T20:30:57.9453739Z     MAX_T=14,
2025-05-07T20:30:57.9453990Z     N_H_L=14,
2025-05-07T20:30:57.9454218Z )
2025-05-07T20:30:57.9454463Z Trying example: test_gqa(
2025-05-07T20:30:57.9454824Z     self=<gqa_test.Int4GQATest testMethod=test_gqa>,
2025-05-07T20:30:57.9455211Z     int4_kv=False,
2025-05-07T20:30:57.9455471Z     num_groups=1,
2025-05-07T20:30:57.9455722Z     B=6,
2025-05-07T20:30:57.9455954Z     MAX_T=6,
2025-05-07T20:30:57.9456216Z     N_H_L=14,
2025-05-07T20:30:57.9456466Z )
2025-05-07T20:30:57.9456715Z Trying example: test_gqa(
2025-05-07T20:30:57.9457085Z     self=<gqa_test.Int4GQATest testMethod=test_gqa>,
2025-05-07T20:30:57.9457493Z     int4_kv=False,
2025-05-07T20:30:57.9457755Z     num_groups=1,
2025-05-07T20:30:57.9457960Z     B=6,
2025-05-07T20:30:57.9458154Z     MAX_T=6,
2025-05-07T20:30:57.9458355Z     N_H_L=6,
2025-05-07T20:30:57.9458544Z )
2025-05-07T20:30:57.9458745Z Trying example: test_gqa(
2025-05-07T20:30:57.9459032Z     self=<gqa_test.Int4GQATest testMethod=test_gqa>,
2025-05-07T20:30:57.9459335Z     int4_kv=False,
2025-05-07T20:30:57.9459559Z     num_groups=1,
2025-05-07T20:30:57.9459768Z     B=70,
2025-05-07T20:30:57.9459952Z     MAX_T=94,
2025-05-07T20:30:57.9460157Z     N_H_L=78,
2025-05-07T20:30:57.9460352Z )
2025-05-07T20:30:57.9460542Z Trying example: test_gqa(
2025-05-07T20:30:57.9460829Z     self=<gqa_test.Int4GQATest testMethod=test_gqa>,
2025-05-07T20:30:57.9461141Z     int4_kv=False,
2025-05-07T20:30:57.9461351Z     num_groups=1,
2025-05-07T20:30:57.9461561Z     B=78,
2025-05-07T20:30:57.9461754Z     MAX_T=94,
2025-05-07T20:30:57.9461945Z     N_H_L=78,
2025-05-07T20:30:57.9462137Z )
2025-05-07T20:30:57.9462339Z Trying example: test_gqa(
2025-05-07T20:30:57.9462630Z     self=<gqa_test.Int4GQATest testMethod=test_gqa>,
2025-05-07T20:30:57.9462935Z     int4_kv=False,
2025-05-07T20:30:57.9463154Z     num_groups=1,
2025-05-07T20:30:57.9463371Z     B=94,
2025-05-07T20:30:57.9463561Z     MAX_T=94,
2025-05-07T20:30:57.9463772Z     N_H_L=78,
2025-05-07T20:30:57.9463967Z )
2025-05-07T20:30:57.9464160Z Trying example: test_gqa(
2025-05-07T20:30:57.9464562Z     self=<gqa_test.Int4GQATest testMethod=test_gqa>,
2025-05-07T20:30:57.9464881Z     int4_kv=False,
2025-05-07T20:30:57.9465094Z     num_groups=1,
2025-05-07T20:30:57.9465311Z     B=94,
2025-05-07T20:30:57.9465507Z     MAX_T=94,
2025-05-07T20:30:57.9465701Z     N_H_L=94,
2025-05-07T20:30:57.9465911Z )
2025-05-07T20:30:57.9466110Z Trying example: test_gqa(
2025-05-07T20:30:57.9466398Z     self=<gqa_test.Int4GQATest testMethod=test_gqa>,
2025-05-07T20:30:57.9466715Z     int4_kv=False,
2025-05-07T20:30:57.9466938Z     num_groups=4,
2025-05-07T20:30:57.9467144Z     B=41,
2025-05-07T20:30:57.9467345Z     MAX_T=105,
2025-05-07T20:30:57.9467555Z     N_H_L=126,
2025-05-07T20:30:57.9467752Z )
2025-05-07T20:30:57.9467980Z Trying example: test_gqa(
2025-05-07T20:30:57.9468381Z     self=<gqa_test.Int4GQATest testMethod=test_gqa>,
2025-05-07T20:30:57.9468686Z     int4_kv=False,
2025-05-07T20:30:57.9468903Z     num_groups=4,
2025-05-07T20:30:57.9469167Z     B=105,
2025-05-07T20:30:57.9469362Z     MAX_T=105,
2025-05-07T20:30:57.9469571Z     N_H_L=126,
2025-05-07T20:30:57.9469778Z )
2025-05-07T20:30:57.9469978Z Trying example: test_gqa(
2025-05-07T20:30:57.9470262Z     self=<gqa_test.Int4GQATest testMethod=test_gqa>,
2025-05-07T20:30:57.9470574Z     int4_kv=False,
2025-05-07T20:30:57.9470790Z     num_groups=4,
2025-05-07T20:30:57.9470997Z     B=105,
2025-05-07T20:30:57.9471196Z     MAX_T=105,
2025-05-07T20:30:57.9471407Z     N_H_L=105,
2025-05-07T20:30:57.9471611Z )
2025-05-07T20:30:57.9471809Z Trying example: test_gqa(
2025-05-07T20:30:57.9472105Z     self=<gqa_test.Int4GQATest testMethod=test_gqa>,
2025-05-07T20:30:57.9472417Z     int4_kv=True,
2025-05-07T20:30:57.9472620Z     num_groups=1,
2025-05-07T20:30:57.9472828Z     B=95,
2025-05-07T20:30:57.9473028Z     MAX_T=114,
2025-05-07T20:30:57.9473221Z     N_H_L=43,
2025-05-07T20:30:57.9473417Z )
2025-05-07T20:30:57.9473613Z Trying example: test_gqa(
2025-05-07T20:30:57.9473894Z     self=<gqa_test.Int4GQATest testMethod=test_gqa>,
2025-05-07T20:30:57.9474208Z     int4_kv=True,
2025-05-07T20:30:57.9474422Z     num_groups=1,
2025-05-07T20:30:57.9474623Z     B=43,
2025-05-07T20:30:57.9474819Z     MAX_T=114,
2025-05-07T20:30:57.9475024Z     N_H_L=43,
2025-05-07T20:30:57.9475214Z )
2025-05-07T20:30:57.9475414Z Trying example: test_gqa(
2025-05-07T20:30:57.9475702Z     self=<gqa_test.Int4GQATest testMethod=test_gqa>,
2025-05-07T20:30:57.9476013Z     int4_kv=True,
2025-05-07T20:30:57.9476227Z     num_groups=1,
2025-05-07T20:30:57.9476440Z     B=43,
2025-05-07T20:30:57.9476624Z     MAX_T=43,
2025-05-07T20:30:57.9476827Z     N_H_L=43,
2025-05-07T20:30:57.9477023Z )
2025-05-07T20:30:57.9477216Z Trying example: test_gqa(
2025-05-07T20:30:57.9477512Z     self=<gqa_test.Int4GQATest testMethod=test_gqa>,
2025-05-07T20:30:57.9477837Z     int4_kv=False,
2025-05-07T20:30:57.9478048Z     num_groups=1,
2025-05-07T20:30:57.9478256Z     B=21,
2025-05-07T20:30:57.9478451Z     MAX_T=38,
2025-05-07T20:30:57.9478652Z     N_H_L=42,
2025-05-07T20:30:57.9478842Z )
2025-05-07T20:30:57.9479055Z Trying example: test_gqa(
2025-05-07T20:30:57.9479348Z     self=<gqa_test.Int4GQATest testMethod=test_gqa>,
2025-05-07T20:30:57.9479656Z     int4_kv=False,
2025-05-07T20:30:57.9479881Z     num_groups=1,
2025-05-07T20:30:57.9480098Z     B=38,
2025-05-07T20:30:57.9480289Z     MAX_T=38,
2025-05-07T20:30:57.9480492Z     N_H_L=42,
2025-05-07T20:30:57.9480689Z )
2025-05-07T20:30:57.9480885Z Trying example: test_gqa(
2025-05-07T20:30:57.9481181Z     self=<gqa_test.Int4GQATest testMethod=test_gqa>,
2025-05-07T20:30:57.9481496Z     int4_kv=False,
2025-05-07T20:30:57.9481711Z     num_groups=1,
2025-05-07T20:30:57.9481926Z     B=38,
2025-05-07T20:30:57.9482121Z     MAX_T=42,
2025-05-07T20:30:57.9482321Z     N_H_L=42,
2025-05-07T20:30:57.9482531Z )
2025-05-07T20:30:57.9482734Z Trying example: test_gqa(
2025-05-07T20:30:57.9483020Z     self=<gqa_test.Int4GQATest testMethod=test_gqa>,
2025-05-07T20:30:57.9483338Z     int4_kv=False,
2025-05-07T20:30:57.9483562Z     num_groups=1,
2025-05-07T20:30:57.9483774Z     B=42,
2025-05-07T20:30:57.9484063Z     MAX_T=42,
2025-05-07T20:30:57.9484272Z     N_H_L=42,
2025-05-07T20:30:57.9484460Z )
2025-05-07T20:30:57.9484657Z Trying example: test_gqa(
2025-05-07T20:30:57.9484945Z     self=<gqa_test.Int4GQATest testMethod=test_gqa>,
2025-05-07T20:30:57.9485268Z     int4_kv=True,
2025-05-07T20:30:57.9485478Z     num_groups=1,
2025-05-07T20:30:57.9485691Z     B=74,
2025-05-07T20:30:57.9485885Z     MAX_T=20,
2025-05-07T20:30:57.9486082Z     N_H_L=15,
2025-05-07T20:30:57.9486284Z )
2025-05-07T20:30:57.9486492Z Trying example: test_gqa(
2025-05-07T20:30:57.9486778Z     self=<gqa_test.Int4GQATest testMethod=test_gqa>,
2025-05-07T20:30:57.9487097Z     int4_kv=True,
2025-05-07T20:30:57.9487313Z     num_groups=1,
2025-05-07T20:30:57.9487603Z     B=20,
2025-05-07T20:30:57.9487792Z     MAX_T=20,
2025-05-07T20:30:57.9488129Z     N_H_L=15,
2025-05-07T20:30:57.9488406Z )
2025-05-07T20:30:57.9488653Z Trying example: test_gqa(
2025-05-07T20:30:57.9498549Z     self=<gqa_test.Int4GQATest testMethod=test_gqa>,
2025-05-07T20:30:57.9498915Z     int4_kv=True,
2025-05-07T20:30:57.9499141Z     num_groups=1,
2025-05-07T20:30:57.9499353Z     B=20,
2025-05-07T20:30:57.9499542Z     MAX_T=15,
2025-05-07T20:30:57.9499743Z     N_H_L=15,
2025-05-07T20:30:57.9499939Z )
2025-05-07T20:30:57.9500136Z Trying example: test_gqa(
2025-05-07T20:30:57.9500434Z     self=<gqa_test.Int4GQATest testMethod=test_gqa>,
2025-05-07T20:30:57.9500745Z     int4_kv=True,
2025-05-07T20:30:57.9500953Z     num_groups=1,
2025-05-07T20:30:57.9501162Z     B=15,
2025-05-07T20:30:57.9501352Z     MAX_T=20,
2025-05-07T20:30:57.9501542Z     N_H_L=15,
2025-05-07T20:30:57.9501734Z )
2025-05-07T20:30:57.9502190Z Trying example: test_gqa(
2025-05-07T20:30:57.9502556Z     self=<gqa_test.Int4GQATest testMethod=test_gqa>,
2025-05-07T20:30:57.9502882Z     int4_kv=True,
2025-05-07T20:30:57.9503097Z     num_groups=1,
2025-05-07T20:30:57.9503297Z     B=15,
2025-05-07T20:30:57.9503492Z     MAX_T=15,
2025-05-07T20:30:57.9503701Z     N_H_L=15,
2025-05-07T20:30:57.9503896Z )
2025-05-07T20:30:57.9504087Z Trying example: test_gqa(
2025-05-07T20:30:57.9504378Z     self=<gqa_test.Int4GQATest testMethod=test_gqa>,
2025-05-07T20:30:57.9504691Z     int4_kv=False,
2025-05-07T20:30:57.9504901Z     num_groups=4,
2025-05-07T20:30:57.9505108Z     B=117,
2025-05-07T20:30:57.9505296Z     MAX_T=104,
2025-05-07T20:30:57.9505489Z     N_H_L=69,
2025-05-07T20:30:57.9505681Z )
2025-05-07T20:30:57.9505878Z Trying example: test_gqa(
2025-05-07T20:30:57.9506158Z     self=<gqa_test.Int4GQATest testMethod=test_gqa>,
2025-05-07T20:30:57.9506470Z     int4_kv=False,
2025-05-07T20:30:57.9506685Z     num_groups=4,
2025-05-07T20:30:57.9506884Z     B=117,
2025-05-07T20:30:57.9507074Z     MAX_T=117,
2025-05-07T20:30:57.9507273Z     N_H_L=69,
2025-05-07T20:30:57.9507466Z )
2025-05-07T20:30:57.9507664Z Trying example: test_gqa(
2025-05-07T20:30:57.9507953Z     self=<gqa_test.Int4GQATest testMethod=test_gqa>,
2025-05-07T20:30:57.9508255Z     int4_kv=False,
2025-05-07T20:30:57.9508469Z     num_groups=4,
2025-05-07T20:30:57.9508679Z     B=69,
2025-05-07T20:30:57.9508860Z     MAX_T=117,
2025-05-07T20:30:57.9509111Z     N_H_L=69,
2025-05-07T20:30:57.9509301Z )
2025-05-07T20:30:57.9509487Z Trying example: test_gqa(
2025-05-07T20:30:57.9509771Z     self=<gqa_test.Int4GQATest testMethod=test_gqa>,
2025-05-07T20:30:57.9510082Z     int4_kv=False,
2025-05-07T20:30:57.9510284Z     num_groups=4,
2025-05-07T20:30:57.9510488Z     B=117,
2025-05-07T20:30:57.9510678Z     MAX_T=69,
2025-05-07T20:30:57.9510873Z     N_H_L=69,
2025-05-07T20:30:57.9511057Z )
2025-05-07T20:30:57.9511244Z PASSED
2025-05-07T20:30:57.9726961Z attention/gqa_test.py::Int4GQATest::test_mqa_main SKIPPED (Skip when...)
2025-05-07T20:30:57.9727286Z 
2025-05-07T20:30:57.9727436Z =========================== short test summary info ============================
2025-05-07T20:30:57.9728710Z SKIPPED [1] ../../../../../../../../miniconda/envs/build_binary/lib/python3.11/unittest/case.py:153: Skip when CUDA is not available or xformers is not available
2025-05-07T20:30:57.9730470Z ======================== 1 passed, 1 skipped in 40.11s =========================
2025-05-07T20:30:58.6306098Z 
2025-05-07T20:30:58.6306710Z [TEST] Python test suite PASSED: ./attention/gqa_test.py
2025-05-07T20:30:58.6327356Z [TEST] Python test time for ./attention/gqa_test.py: 43 seconds
2025-05-07T20:30:58.6327648Z 
2025-05-07T20:30:58.6327653Z 
2025-05-07T20:30:58.6327657Z 
2025-05-07T20:30:58.6327660Z 
2025-05-07T20:30:58.6348252Z ################################################################################
2025-05-07T20:30:58.6363747Z # [2025-05-07T20:30:58.636Z] Run Python Test Suite:
2025-05-07T20:30:58.6364086Z #   ./coalesce/coalesce_test.py
2025-05-07T20:30:58.6364379Z ################################################################################
2025-05-07T20:30:58.6389941Z + conda run --no-capture-output -n build_binary python -m pytest -v -rsx -s -W ignore::pytest.PytestCollectionWarning --cache-clear ./coalesce/coalesce_test.py
2025-05-07T20:30:58.6390573Z 
2025-05-07T20:31:00.7790120Z ============================= test session starts ==============================
2025-05-07T20:31:00.7790744Z platform linux -- Python 3.11.8, pytest-8.3.5, pluggy-1.5.0 -- /home/ec2-user/miniconda/envs/build_binary/bin/python
2025-05-07T20:31:00.7791270Z cachedir: .pytest_cache
2025-05-07T20:31:00.7791847Z hypothesis profile 'ci' -> database=None, deadline=None, print_blob=True, derandomize=True, suppress_health_check=(HealthCheck.too_slow,)
2025-05-07T20:31:00.7792573Z rootdir: /home/ec2-user/actions-runner/_work/FBGEMM/FBGEMM/fbgemm_gpu
2025-05-07T20:31:00.7792976Z plugins: hypothesis-6.131.14
2025-05-07T20:31:02.4494177Z collecting ... collected 1 item
2025-05-07T20:31:02.4494384Z 
2025-05-07T20:31:03.1773600Z coalesce/coalesce_test.py::CoalesceTest::test_coalesce_batches PASSED
2025-05-07T20:31:03.1773961Z 
2025-05-07T20:31:03.1774112Z ============================== 1 passed in 2.52s ===============================
2025-05-07T20:31:03.8179847Z 
2025-05-07T20:31:03.8180494Z [TEST] Python test suite PASSED: ./coalesce/coalesce_test.py
2025-05-07T20:31:03.8202568Z [TEST] Python test time for ./coalesce/coalesce_test.py: 5 seconds
2025-05-07T20:31:03.8202860Z 
2025-05-07T20:31:03.8202864Z 
2025-05-07T20:31:03.8202869Z 
2025-05-07T20:31:03.8202920Z 
2025-05-07T20:31:03.8223104Z ################################################################################
2025-05-07T20:31:03.8238587Z # [2025-05-07T20:31:03.823Z] Run Python Test Suite:
2025-05-07T20:31:03.8238935Z #   ./comm/multi_gpu_car_test.py
2025-05-07T20:31:03.8239228Z ################################################################################
2025-05-07T20:31:03.8265336Z + conda run --no-capture-output -n build_binary python -m pytest -v -rsx -s -W ignore::pytest.PytestCollectionWarning --cache-clear ./comm/multi_gpu_car_test.py
2025-05-07T20:31:03.8266230Z 
2025-05-07T20:31:05.9603918Z ============================= test session starts ==============================
2025-05-07T20:31:05.9604581Z platform linux -- Python 3.11.8, pytest-8.3.5, pluggy-1.5.0 -- /home/ec2-user/miniconda/envs/build_binary/bin/python
2025-05-07T20:31:05.9605107Z cachedir: .pytest_cache
2025-05-07T20:31:05.9605691Z hypothesis profile 'ci' -> database=None, deadline=None, print_blob=True, derandomize=True, suppress_health_check=(HealthCheck.too_slow,)
2025-05-07T20:31:05.9606426Z rootdir: /home/ec2-user/actions-runner/_work/FBGEMM/FBGEMM/fbgemm_gpu
2025-05-07T20:31:05.9606828Z plugins: hypothesis-6.131.14
2025-05-07T20:31:07.6726568Z collecting ... collected 5 items
2025-05-07T20:31:07.6726773Z 
2025-05-07T20:31:07.6737130Z comm/multi_gpu_car_test.py::LLamaMultiGpuTests::test_allgather SKIPPED
2025-05-07T20:31:07.6755944Z comm/multi_gpu_car_test.py::LLamaMultiGpuTests::test_allgather_dtype_mismatch SKIPPED
2025-05-07T20:31:07.6763315Z comm/multi_gpu_car_test.py::LLamaMultiGpuTests::test_allreduce SKIPPED
2025-05-07T20:31:07.6770241Z comm/multi_gpu_car_test.py::LLamaMultiGpuTests::test_oneshot_car_stress SKIPPED
2025-05-07T20:31:07.6785553Z comm/multi_gpu_car_test.py::LLamaMultiGpuTests::test_reducescatter SKIPPED
2025-05-07T20:31:07.6785902Z 
2025-05-07T20:31:07.6786066Z =========================== short test summary info ============================
2025-05-07T20:31:07.6786725Z SKIPPED [1] comm/multi_gpu_car_test.py:310: Skip when CUDA is not available or when there are not enough GPUs; these tests require at least two GPUs
2025-05-07T20:31:07.6787644Z SKIPPED [1] comm/multi_gpu_car_test.py:351: Skip when CUDA is not available or when there are not enough GPUs; these tests require at least two GPUs
2025-05-07T20:31:07.6788558Z SKIPPED [1] comm/multi_gpu_car_test.py:418: Skip when CUDA is not available or when there are not enough GPUs; these tests require at least two GPUs
2025-05-07T20:31:07.6789513Z SKIPPED [1] comm/multi_gpu_car_test.py:434: Skip when CUDA is not available or when there are not enough GPUs; these tests require at least two GPUs
2025-05-07T20:31:07.6790582Z SKIPPED [1] comm/multi_gpu_car_test.py:402: Skip when CUDA is not available or when there are not enough GPUs; these tests require at least two GPUs
2025-05-07T20:31:07.6791225Z ============================== 5 skipped in 1.84s ==============================
2025-05-07T20:31:08.2606149Z 
2025-05-07T20:31:08.2606928Z [TEST] Python test suite PASSED: ./comm/multi_gpu_car_test.py
2025-05-07T20:31:08.2627957Z [TEST] Python test time for ./comm/multi_gpu_car_test.py: 5 seconds
2025-05-07T20:31:08.2628377Z 
2025-05-07T20:31:08.2628381Z 
2025-05-07T20:31:08.2628385Z 
2025-05-07T20:31:08.2628388Z 
2025-05-07T20:31:08.2648649Z ################################################################################
2025-05-07T20:31:08.2663688Z # [2025-05-07T20:31:08.266Z] Run Python Test Suite:
2025-05-07T20:31:08.2664041Z #   ./gather_scatter/gather_scatter_test.py
2025-05-07T20:31:08.2664394Z ################################################################################
2025-05-07T20:31:08.2689922Z + conda run --no-capture-output -n build_binary python -m pytest -v -rsx -s -W ignore::pytest.PytestCollectionWarning --cache-clear ./gather_scatter/gather_scatter_test.py
2025-05-07T20:31:08.2690579Z 
2025-05-07T20:31:10.4029627Z ============================= test session starts ==============================
2025-05-07T20:31:10.4030265Z platform linux -- Python 3.11.8, pytest-8.3.5, pluggy-1.5.0 -- /home/ec2-user/miniconda/envs/build_binary/bin/python
2025-05-07T20:31:10.4030785Z cachedir: .pytest_cache
2025-05-07T20:31:10.4031353Z hypothesis profile 'ci' -> database=None, deadline=None, print_blob=True, derandomize=True, suppress_health_check=(HealthCheck.too_slow,)
2025-05-07T20:31:10.4032078Z rootdir: /home/ec2-user/actions-runner/_work/FBGEMM/FBGEMM/fbgemm_gpu
2025-05-07T20:31:10.4032496Z plugins: hypothesis-6.131.14
2025-05-07T20:31:12.2475752Z collecting ... collected 2 items
2025-05-07T20:31:12.2476185Z 
2025-05-07T20:31:12.2484765Z gather_scatter/gather_scatter_test.py::GatherScatterTests::test_gather_along_first_dim SKIPPED
2025-05-07T20:31:12.2499184Z gather_scatter/gather_scatter_test.py::GatherScatterTests::test_scatter_add_along_first_dim SKIPPED
2025-05-07T20:31:12.2499612Z 
2025-05-07T20:31:12.2499771Z =========================== short test summary info ============================
2025-05-07T20:31:12.2500385Z SKIPPED [1] gather_scatter/gather_scatter_test.py:29: Skip when no Hopper GPU is available. This test is only for Hopper GPU.
2025-05-07T20:31:12.2501210Z SKIPPED [1] gather_scatter/gather_scatter_test.py:70: Skip when no Hopper GPU is available. This test is only for Hopper GPU.
2025-05-07T20:31:12.2501809Z ============================== 2 skipped in 1.96s ==============================
2025-05-07T20:31:12.8490878Z 
2025-05-07T20:31:12.8491548Z [TEST] Python test suite PASSED: ./gather_scatter/gather_scatter_test.py
2025-05-07T20:31:12.8510718Z [TEST] Python test time for ./gather_scatter/gather_scatter_test.py: 4 seconds
2025-05-07T20:31:12.8511044Z 
2025-05-07T20:31:12.8511048Z 
2025-05-07T20:31:12.8511061Z 
2025-05-07T20:31:12.8511065Z 
2025-05-07T20:31:12.8533614Z ################################################################################
2025-05-07T20:31:12.8548446Z # [2025-05-07T20:31:12.854Z] Run Python Test Suite:
2025-05-07T20:31:12.8548774Z #   ./kv_cache/kv_cache_test.py
2025-05-07T20:31:12.8549110Z ################################################################################
2025-05-07T20:31:12.8573600Z + conda run --no-capture-output -n build_binary python -m pytest -v -rsx -s -W ignore::pytest.PytestCollectionWarning --cache-clear ./kv_cache/kv_cache_test.py
2025-05-07T20:31:12.8574221Z 
2025-05-07T20:31:14.9901882Z ============================= test session starts ==============================
2025-05-07T20:31:14.9902514Z platform linux -- Python 3.11.8, pytest-8.3.5, pluggy-1.5.0 -- /home/ec2-user/miniconda/envs/build_binary/bin/python
2025-05-07T20:31:14.9903358Z cachedir: .pytest_cache
2025-05-07T20:31:14.9903936Z hypothesis profile 'ci' -> database=None, deadline=None, print_blob=True, derandomize=True, suppress_health_check=(HealthCheck.too_slow,)
2025-05-07T20:31:14.9904663Z rootdir: /home/ec2-user/actions-runner/_work/FBGEMM/FBGEMM/fbgemm_gpu
2025-05-07T20:31:14.9905072Z plugins: hypothesis-6.131.14
2025-05-07T20:31:16.7573151Z collecting ... collected 4 items
2025-05-07T20:31:16.7573511Z 
2025-05-07T20:31:19.2908124Z kv_cache/kv_cache_test.py::KVCacheTests::test_fp8_kv_cache SKIPPED (...)
2025-05-07T20:31:19.3031460Z kv_cache/kv_cache_test.py::KVCacheTests::test_int4_kv_cache SKIPPED
2025-05-07T20:31:19.3178802Z kv_cache/kv_cache_test.py::KVCacheTests::test_positional_encoding_with_paged_attention SKIPPED
2025-05-07T20:31:19.3303905Z kv_cache/kv_cache_test.py::KVCacheTests::test_rope_positional_encoding_only SKIPPED
2025-05-07T20:31:19.3304263Z 
2025-05-07T20:31:19.3304414Z =========================== short test summary info ============================
2025-05-07T20:31:19.3305142Z SKIPPED [1] ../../../../../../../../miniconda/envs/build_binary/lib/python3.11/unittest/case.py:153: Skip when H100 is not available or MI300 is not available
2025-05-07T20:31:19.3306285Z SKIPPED [3] ../../../../../../../../miniconda/envs/build_binary/lib/python3.11/unittest/case.py:153: Skip when xformers is not available
2025-05-07T20:31:19.3307033Z ============================== 4 skipped in 4.46s ==============================
2025-05-07T20:31:21.4329531Z 
2025-05-07T20:31:21.4330521Z [TEST] Python test suite PASSED: ./kv_cache/kv_cache_test.py
2025-05-07T20:31:21.4349942Z [TEST] Python test time for ./kv_cache/kv_cache_test.py: 9 seconds
2025-05-07T20:31:21.4350350Z 
2025-05-07T20:31:21.4350356Z 
2025-05-07T20:31:21.4350362Z 
2025-05-07T20:31:21.4350414Z 
2025-05-07T20:31:21.4372571Z ################################################################################
2025-05-07T20:31:21.4387814Z # [2025-05-07T20:31:21.438Z] Run Python Test Suite:
2025-05-07T20:31:21.4388317Z #   ./moe/activation_test.py
2025-05-07T20:31:21.4388702Z ################################################################################
2025-05-07T20:31:21.4412512Z + conda run --no-capture-output -n build_binary python -m pytest -v -rsx -s -W ignore::pytest.PytestCollectionWarning --cache-clear ./moe/activation_test.py
2025-05-07T20:31:21.4413236Z 
2025-05-07T20:31:23.5802156Z ============================= test session starts ==============================
2025-05-07T20:31:23.5802822Z platform linux -- Python 3.11.8, pytest-8.3.5, pluggy-1.5.0 -- /home/ec2-user/miniconda/envs/build_binary/bin/python
2025-05-07T20:31:23.5803362Z cachedir: .pytest_cache
2025-05-07T20:31:23.5803935Z hypothesis profile 'ci' -> database=None, deadline=None, print_blob=True, derandomize=True, suppress_health_check=(HealthCheck.too_slow,)
2025-05-07T20:31:23.5804669Z rootdir: /home/ec2-user/actions-runner/_work/FBGEMM/FBGEMM/fbgemm_gpu
2025-05-07T20:31:23.5805084Z plugins: hypothesis-6.131.14
2025-05-07T20:31:25.2101139Z TMA benchmarks will be running with experimental grid constant TMA descriptor.
2025-05-07T20:31:25.3620934Z collecting ... collected 2 items
2025-05-07T20:31:25.3621314Z 
2025-05-07T20:31:30.5472053Z moe/activation_test.py::ActivationTests::test_silu_mul Trying example: test_silu_mul(
2025-05-07T20:31:30.5472660Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul>,
2025-05-07T20:31:30.5473036Z     T=1,
2025-05-07T20:31:30.5473229Z     D=5120,
2025-05-07T20:31:30.5473434Z     contiguous=True,
2025-05-07T20:31:30.5473660Z     compiled=True,
2025-05-07T20:31:30.5473877Z )
2025-05-07T20:31:30.5474096Z Trying example: test_silu_mul(
2025-05-07T20:31:30.5474473Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul>,
2025-05-07T20:31:30.5474858Z     T=4096,
2025-05-07T20:31:30.5475051Z     D=5120,
2025-05-07T20:31:30.5475243Z     contiguous=True,
2025-05-07T20:31:30.5475472Z     compiled=True,
2025-05-07T20:31:30.5475681Z )
2025-05-07T20:31:30.5476060Z Trying example: test_silu_mul(
2025-05-07T20:31:30.5476433Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul>,
2025-05-07T20:31:30.5476807Z     T=4096,
2025-05-07T20:31:30.5476998Z     D=7168,
2025-05-07T20:31:30.5477195Z     contiguous=False,
2025-05-07T20:31:30.5477428Z     compiled=False,
2025-05-07T20:31:30.5477635Z )
2025-05-07T20:31:30.5477829Z Trying example: test_silu_mul(
2025-05-07T20:31:30.5478205Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul>,
2025-05-07T20:31:30.5478588Z     T=4096,
2025-05-07T20:31:30.5478774Z     D=5120,
2025-05-07T20:31:30.5479025Z     contiguous=False,
2025-05-07T20:31:30.5479249Z     compiled=True,
2025-05-07T20:31:30.5479461Z )
2025-05-07T20:31:30.5479664Z Trying example: test_silu_mul(
2025-05-07T20:31:30.5480028Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul>,
2025-05-07T20:31:30.5480411Z     T=1,
2025-05-07T20:31:30.5480603Z     D=7168,
2025-05-07T20:31:30.5480799Z     contiguous=True,
2025-05-07T20:31:30.5481037Z     compiled=True,
2025-05-07T20:31:30.5481244Z )
2025-05-07T20:31:30.5481444Z Trying example: test_silu_mul(
2025-05-07T20:31:30.5481819Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul>,
2025-05-07T20:31:30.5482196Z     T=1,
2025-05-07T20:31:30.5482391Z     D=7168,
2025-05-07T20:31:30.5482596Z     contiguous=False,
2025-05-07T20:31:30.5482825Z     compiled=True,
2025-05-07T20:31:30.5483023Z )
2025-05-07T20:31:30.5483222Z Trying example: test_silu_mul(
2025-05-07T20:31:30.5483594Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul>,
2025-05-07T20:31:30.5483969Z     T=4096,
2025-05-07T20:31:30.5484162Z     D=5120,
2025-05-07T20:31:30.5484366Z     contiguous=False,
2025-05-07T20:31:30.5484599Z     compiled=False,
2025-05-07T20:31:30.5484802Z )
2025-05-07T20:31:30.5485000Z Trying example: test_silu_mul(
2025-05-07T20:31:30.5485369Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul>,
2025-05-07T20:31:30.5485739Z     T=1,
2025-05-07T20:31:30.5485933Z     D=7168,
2025-05-07T20:31:30.5486131Z     contiguous=True,
2025-05-07T20:31:30.5486358Z     compiled=False,
2025-05-07T20:31:30.5486564Z )
2025-05-07T20:31:30.5486767Z Trying example: test_silu_mul(
2025-05-07T20:31:30.5487139Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul>,
2025-05-07T20:31:30.5487515Z     T=2048,
2025-05-07T20:31:30.5487711Z     D=5120,
2025-05-07T20:31:30.5487902Z     contiguous=True,
2025-05-07T20:31:30.5488130Z     compiled=True,
2025-05-07T20:31:30.5488340Z )
2025-05-07T20:31:30.5488535Z Trying example: test_silu_mul(
2025-05-07T20:31:30.5488903Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul>,
2025-05-07T20:31:30.5489280Z     T=2048,
2025-05-07T20:31:30.5489469Z     D=7168,
2025-05-07T20:31:30.5489668Z     contiguous=True,
2025-05-07T20:31:30.5489941Z     compiled=True,
2025-05-07T20:31:30.5490155Z )
2025-05-07T20:31:30.5490363Z Trying example: test_silu_mul(
2025-05-07T20:31:30.5490735Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul>,
2025-05-07T20:31:30.5491114Z     T=2048,
2025-05-07T20:31:30.5491300Z     D=7168,
2025-05-07T20:31:30.5491499Z     contiguous=True,
2025-05-07T20:31:30.5491732Z     compiled=False,
2025-05-07T20:31:30.5491933Z )
2025-05-07T20:31:30.5492237Z Trying example: test_silu_mul(
2025-05-07T20:31:30.5492617Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul>,
2025-05-07T20:31:30.5492992Z     T=128,
2025-05-07T20:31:30.5493191Z     D=5120,
2025-05-07T20:31:30.5493393Z     contiguous=False,
2025-05-07T20:31:30.5493617Z     compiled=True,
2025-05-07T20:31:30.5493825Z )
2025-05-07T20:31:30.5494028Z Trying example: test_silu_mul(
2025-05-07T20:31:30.5494396Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul>,
2025-05-07T20:31:30.5494775Z     T=128,
2025-05-07T20:31:30.5494970Z     D=5120,
2025-05-07T20:31:30.5495166Z     contiguous=True,
2025-05-07T20:31:30.5495396Z     compiled=True,
2025-05-07T20:31:30.5495606Z )
2025-05-07T20:31:30.5495887Z Trying example: test_silu_mul(
2025-05-07T20:31:30.5496258Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul>,
2025-05-07T20:31:30.5496637Z     T=16384,
2025-05-07T20:31:30.5496838Z     D=5120,
2025-05-07T20:31:30.5497034Z     contiguous=False,
2025-05-07T20:31:30.5497268Z     compiled=True,
2025-05-07T20:31:30.5497478Z )
2025-05-07T20:31:30.5497672Z Trying example: test_silu_mul(
2025-05-07T20:31:30.5498047Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul>,
2025-05-07T20:31:30.5498420Z     T=16384,
2025-05-07T20:31:30.5498628Z     D=5120,
2025-05-07T20:31:30.5498833Z     contiguous=False,
2025-05-07T20:31:30.5499057Z     compiled=False,
2025-05-07T20:31:30.5499270Z )
2025-05-07T20:31:30.5499473Z Trying example: test_silu_mul(
2025-05-07T20:31:30.5499841Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul>,
2025-05-07T20:31:30.5500216Z     T=128,
2025-05-07T20:31:30.5500412Z     D=7168,
2025-05-07T20:31:30.5500626Z     contiguous=True,
2025-05-07T20:31:30.5500860Z     compiled=False,
2025-05-07T20:31:30.5501074Z )
2025-05-07T20:31:30.5501283Z Trying example: test_silu_mul(
2025-05-07T20:31:30.5501654Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul>,
2025-05-07T20:31:30.5502032Z     T=128,
2025-05-07T20:31:30.5502232Z     D=7168,
2025-05-07T20:31:30.5502430Z     contiguous=False,
2025-05-07T20:31:30.5502663Z     compiled=False,
2025-05-07T20:31:30.5502877Z )
2025-05-07T20:31:30.5503072Z Trying example: test_silu_mul(
2025-05-07T20:31:30.5503446Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul>,
2025-05-07T20:31:30.5503822Z     T=1,
2025-05-07T20:31:30.5504006Z     D=5120,
2025-05-07T20:31:30.5504207Z     contiguous=False,
2025-05-07T20:31:30.5504436Z     compiled=False,
2025-05-07T20:31:30.5504639Z )
2025-05-07T20:31:30.5504843Z Trying example: test_silu_mul(
2025-05-07T20:31:30.5505217Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul>,
2025-05-07T20:31:30.5505585Z     T=1,
2025-05-07T20:31:30.5505781Z     D=7168,
2025-05-07T20:31:30.5505992Z     contiguous=False,
2025-05-07T20:31:30.5506234Z     compiled=False,
2025-05-07T20:31:30.5506444Z )
2025-05-07T20:31:30.5506652Z Trying example: test_silu_mul(
2025-05-07T20:31:30.5507042Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul>,
2025-05-07T20:31:30.5507422Z     T=4096,
2025-05-07T20:31:30.5507628Z     D=5120,
2025-05-07T20:31:30.5507837Z     contiguous=True,
2025-05-07T20:31:30.5508065Z     compiled=False,
2025-05-07T20:31:30.5508283Z )
2025-05-07T20:31:30.5508492Z Trying example: test_silu_mul(
2025-05-07T20:31:30.5508872Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul>,
2025-05-07T20:31:30.5509314Z     T=128,
2025-05-07T20:31:30.5509515Z     D=7168,
2025-05-07T20:31:30.5509710Z     contiguous=True,
2025-05-07T20:31:30.5509954Z     compiled=True,
2025-05-07T20:31:30.5510195Z )
2025-05-07T20:31:30.5510418Z Trying example: test_silu_mul(
2025-05-07T20:31:30.5510803Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul>,
2025-05-07T20:31:30.5511200Z     T=1,
2025-05-07T20:31:30.5511390Z     D=5120,
2025-05-07T20:31:30.5511604Z     contiguous=False,
2025-05-07T20:31:30.5511847Z     compiled=True,
2025-05-07T20:31:30.5512054Z )
2025-05-07T20:31:30.5512358Z Trying example: test_silu_mul(
2025-05-07T20:31:30.5512734Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul>,
2025-05-07T20:31:30.5513120Z     T=4096,
2025-05-07T20:31:30.5513308Z     D=7168,
2025-05-07T20:31:30.5513511Z     contiguous=True,
2025-05-07T20:31:30.5513743Z     compiled=False,
2025-05-07T20:31:30.5513950Z )
2025-05-07T20:31:30.5514157Z Trying example: test_silu_mul(
2025-05-07T20:31:30.5514533Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul>,
2025-05-07T20:31:30.5514908Z     T=4096,
2025-05-07T20:31:30.5515099Z     D=7168,
2025-05-07T20:31:30.5515295Z     contiguous=False,
2025-05-07T20:31:30.5515516Z     compiled=True,
2025-05-07T20:31:30.5515727Z )
2025-05-07T20:31:30.5516012Z Trying example: test_silu_mul(
2025-05-07T20:31:30.5516403Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul>,
2025-05-07T20:31:30.5516776Z     T=128,
2025-05-07T20:31:30.5516972Z     D=5120,
2025-05-07T20:31:30.5517181Z     contiguous=True,
2025-05-07T20:31:30.5517410Z     compiled=False,
2025-05-07T20:31:30.5517624Z )
2025-05-07T20:31:30.5517827Z Trying example: test_silu_mul(
2025-05-07T20:31:30.5518197Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul>,
2025-05-07T20:31:30.5518573Z     T=128,
2025-05-07T20:31:30.5518764Z     D=5120,
2025-05-07T20:31:30.5518963Z     contiguous=False,
2025-05-07T20:31:30.5519196Z     compiled=False,
2025-05-07T20:31:30.5519413Z )
2025-05-07T20:31:30.5519608Z Trying example: test_silu_mul(
2025-05-07T20:31:30.5519980Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul>,
2025-05-07T20:31:30.5529176Z     T=1,
2025-05-07T20:31:30.5529387Z     D=5120,
2025-05-07T20:31:30.5529589Z     contiguous=True,
2025-05-07T20:31:30.5529875Z     compiled=False,
2025-05-07T20:31:30.5530098Z )
2025-05-07T20:31:30.5530294Z Trying example: test_silu_mul(
2025-05-07T20:31:30.5530674Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul>,
2025-05-07T20:31:30.5531054Z     T=2048,
2025-05-07T20:31:30.5531243Z     D=7168,
2025-05-07T20:31:30.5531443Z     contiguous=False,
2025-05-07T20:31:30.5531675Z     compiled=True,
2025-05-07T20:31:30.5531874Z )
2025-05-07T20:31:30.5532075Z Trying example: test_silu_mul(
2025-05-07T20:31:30.5532447Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul>,
2025-05-07T20:31:30.5532827Z     T=2048,
2025-05-07T20:31:30.5533010Z     D=7168,
2025-05-07T20:31:30.5533214Z     contiguous=False,
2025-05-07T20:31:30.5533448Z     compiled=False,
2025-05-07T20:31:30.5533649Z )
2025-05-07T20:31:30.5533848Z Trying example: test_silu_mul(
2025-05-07T20:31:30.5534218Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul>,
2025-05-07T20:31:30.5534589Z     T=16384,
2025-05-07T20:31:30.5534794Z     D=7168,
2025-05-07T20:31:30.5534995Z     contiguous=False,
2025-05-07T20:31:30.5535216Z     compiled=True,
2025-05-07T20:31:30.5535419Z )
2025-05-07T20:31:30.5535617Z Trying example: test_silu_mul(
2025-05-07T20:31:30.5535983Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul>,
2025-05-07T20:31:30.5536361Z     T=16384,
2025-05-07T20:31:30.5536563Z     D=7168,
2025-05-07T20:31:30.5536792Z     contiguous=True,
2025-05-07T20:31:30.5537017Z     compiled=True,
2025-05-07T20:31:30.5537224Z )
2025-05-07T20:31:30.5537415Z Trying example: test_silu_mul(
2025-05-07T20:31:30.5537783Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul>,
2025-05-07T20:31:30.5538160Z     T=4096,
2025-05-07T20:31:30.5538341Z     D=7168,
2025-05-07T20:31:30.5538534Z     contiguous=True,
2025-05-07T20:31:30.5538756Z     compiled=True,
2025-05-07T20:31:30.5538957Z )
2025-05-07T20:31:30.5539147Z Trying example: test_silu_mul(
2025-05-07T20:31:30.5539515Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul>,
2025-05-07T20:31:30.5539897Z     T=2048,
2025-05-07T20:31:30.5540105Z     D=5120,
2025-05-07T20:31:30.5540324Z     contiguous=False,
2025-05-07T20:31:30.5540549Z     compiled=False,
2025-05-07T20:31:30.5540747Z )
2025-05-07T20:31:30.5541139Z Trying example: test_silu_mul(
2025-05-07T20:31:30.5541516Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul>,
2025-05-07T20:31:30.5541884Z     T=2048,
2025-05-07T20:31:30.5542078Z     D=5120,
2025-05-07T20:31:30.5542277Z     contiguous=True,
2025-05-07T20:31:30.5542496Z     compiled=False,
2025-05-07T20:31:30.5542707Z )
2025-05-07T20:31:30.5542905Z Trying example: test_silu_mul(
2025-05-07T20:31:30.5543270Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul>,
2025-05-07T20:31:30.5543653Z     T=128,
2025-05-07T20:31:30.5543846Z     D=7168,
2025-05-07T20:31:30.5544036Z     contiguous=False,
2025-05-07T20:31:30.5544268Z     compiled=True,
2025-05-07T20:31:30.5544871Z )
2025-05-07T20:31:30.5545063Z Trying example: test_silu_mul(
2025-05-07T20:31:30.5545434Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul>,
2025-05-07T20:31:30.5545813Z     T=16384,
2025-05-07T20:31:30.5546009Z     D=5120,
2025-05-07T20:31:30.5546206Z     contiguous=True,
2025-05-07T20:31:30.5546431Z     compiled=True,
2025-05-07T20:31:30.5546640Z )
2025-05-07T20:31:30.5546834Z Trying example: test_silu_mul(
2025-05-07T20:31:30.5547204Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul>,
2025-05-07T20:31:30.5547582Z     T=2048,
2025-05-07T20:31:30.5547763Z     D=5120,
2025-05-07T20:31:30.5547963Z     contiguous=False,
2025-05-07T20:31:30.5548190Z     compiled=True,
2025-05-07T20:31:30.5548389Z )
2025-05-07T20:31:30.5548589Z Trying example: test_silu_mul(
2025-05-07T20:31:30.5548960Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul>,
2025-05-07T20:31:30.5549381Z     T=16384,
2025-05-07T20:31:30.5549577Z     D=5120,
2025-05-07T20:31:30.5549781Z     contiguous=True,
2025-05-07T20:31:30.5550006Z     compiled=False,
2025-05-07T20:31:30.5550210Z )
2025-05-07T20:31:30.5550412Z Trying example: test_silu_mul(
2025-05-07T20:31:30.5550775Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul>,
2025-05-07T20:31:30.5551158Z     T=16384,
2025-05-07T20:31:30.5551354Z     D=7168,
2025-05-07T20:31:30.5551553Z     contiguous=False,
2025-05-07T20:31:30.5551772Z     compiled=False,
2025-05-07T20:31:30.5551980Z )
2025-05-07T20:31:30.5552175Z Trying example: test_silu_mul(
2025-05-07T20:31:30.5552537Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul>,
2025-05-07T20:31:30.5552913Z     T=16384,
2025-05-07T20:31:30.5553110Z     D=7168,
2025-05-07T20:31:30.5553299Z     contiguous=True,
2025-05-07T20:31:30.5553524Z     compiled=False,
2025-05-07T20:31:30.5553729Z )
2025-05-07T20:31:30.5553907Z PASSED
2025-05-07T20:31:30.6171278Z W0507 20:31:30.615000 237772 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0] Encountered an exception in identify_mutated_tensors, assuming every input is mutated
2025-05-07T20:31:30.6173400Z W0507 20:31:30.615000 237772 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0] Traceback (most recent call last):
2025-05-07T20:31:30.6176117Z W0507 20:31:30.615000 237772 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/torch/_higher_order_ops/triton_kernel_wrap.py", line 731, in identify_mutated_tensors
2025-05-07T20:31:30.6178956Z W0507 20:31:30.615000 237772 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0]     ttir_module, ordered_tensor_names = generate_ttir(kernel, kwargs)
2025-05-07T20:31:30.6180322Z W0507 20:31:30.615000 237772 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0]                                         ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
2025-05-07T20:31:30.6181637Z W0507 20:31:30.615000 237772 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/torch/_higher_order_ops/triton_kernel_wrap.py", line 356, in generate_ttir
2025-05-07T20:31:30.6183194Z W0507 20:31:30.615000 237772 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0]     ttir_module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:31:30.6184193Z W0507 20:31:30.615000 237772 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0]                   ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
2025-05-07T20:31:30.6185435Z W0507 20:31:30.615000 237772 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py", line 100, in make_ir
2025-05-07T20:31:30.6186821Z W0507 20:31:30.615000 237772 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0]     return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:31:30.6187994Z W0507 20:31:30.615000 237772 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0]            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
2025-05-07T20:31:30.6189334Z W0507 20:31:30.615000 237772 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/code_generator.py", line 1298, in ast_to_ttir
2025-05-07T20:31:30.6190590Z W0507 20:31:30.615000 237772 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0]     generator.visit(fn.parse())
2025-05-07T20:31:30.6191822Z W0507 20:31:30.615000 237772 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/code_generator.py", line 1201, in visit
2025-05-07T20:31:30.6193043Z W0507 20:31:30.615000 237772 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0]     ret = super().visit(node)
2025-05-07T20:31:30.6193878Z W0507 20:31:30.615000 237772 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0]           ^^^^^^^^^^^^^^^^^^^
2025-05-07T20:31:30.6194919Z W0507 20:31:30.615000 237772 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/ast.py", line 418, in visit
2025-05-07T20:31:30.6195950Z W0507 20:31:30.615000 237772 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0]     return visitor(node)
2025-05-07T20:31:30.6196751Z W0507 20:31:30.615000 237772 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0]            ^^^^^^^^^^^^^
2025-05-07T20:31:30.6197952Z W0507 20:31:30.615000 237772 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/code_generator.py", line 352, in visit_Module
2025-05-07T20:31:30.6199251Z W0507 20:31:30.615000 237772 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0]     ast.NodeVisitor.generic_visit(self, node)
2025-05-07T20:31:30.6200383Z W0507 20:31:30.615000 237772 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/ast.py", line 426, in generic_visit
2025-05-07T20:31:30.6201444Z W0507 20:31:30.615000 237772 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0]     self.visit(item)
2025-05-07T20:31:30.6202624Z W0507 20:31:30.615000 237772 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/code_generator.py", line 1207, in visit
2025-05-07T20:31:30.6203985Z W0507 20:31:30.615000 237772 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0]     raise CompilationError(self.jit_fn.src, self.cur_node, repr(e)) from None
2025-05-07T20:31:30.6205055Z W0507 20:31:30.615000 237772 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0] triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:31:30.6206067Z W0507 20:31:30.615000 237772 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0] def _fbgemm_silu_mul_quant(
2025-05-07T20:31:30.6206809Z W0507 20:31:30.615000 237772 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0] ^
2025-05-07T20:31:30.6207843Z W0507 20:31:30.615000 237772 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0] ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:31:30.6328690Z W0507 20:31:30.632000 237772 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0] Encountered an exception in identify_mutated_tensors, assuming every input is mutated
2025-05-07T20:31:30.6329747Z W0507 20:31:30.632000 237772 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0] Traceback (most recent call last):
2025-05-07T20:31:30.6332494Z W0507 20:31:30.632000 237772 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/torch/_higher_order_ops/triton_kernel_wrap.py", line 731, in identify_mutated_tensors
2025-05-07T20:31:30.6335303Z W0507 20:31:30.632000 237772 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0]     ttir_module, ordered_tensor_names = generate_ttir(kernel, kwargs)
2025-05-07T20:31:30.6337225Z W0507 20:31:30.632000 237772 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0]                                         ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
2025-05-07T20:31:30.6339781Z W0507 20:31:30.632000 237772 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/torch/_higher_order_ops/triton_kernel_wrap.py", line 356, in generate_ttir
2025-05-07T20:31:30.6341222Z W0507 20:31:30.632000 237772 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0]     ttir_module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:31:30.6342205Z W0507 20:31:30.632000 237772 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0]                   ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
2025-05-07T20:31:30.6343430Z W0507 20:31:30.632000 237772 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py", line 100, in make_ir
2025-05-07T20:31:30.6344802Z W0507 20:31:30.632000 237772 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0]     return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:31:30.6345861Z W0507 20:31:30.632000 237772 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0]            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
2025-05-07T20:31:30.6347151Z W0507 20:31:30.632000 237772 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/code_generator.py", line 1298, in ast_to_ttir
2025-05-07T20:31:30.6348402Z W0507 20:31:30.632000 237772 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0]     generator.visit(fn.parse())
2025-05-07T20:31:30.6349671Z W0507 20:31:30.632000 237772 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/code_generator.py", line 1201, in visit
2025-05-07T20:31:30.6350926Z W0507 20:31:30.632000 237772 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0]     ret = super().visit(node)
2025-05-07T20:31:30.6351766Z W0507 20:31:30.632000 237772 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0]           ^^^^^^^^^^^^^^^^^^^
2025-05-07T20:31:30.6352960Z W0507 20:31:30.632000 237772 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/ast.py", line 418, in visit
2025-05-07T20:31:30.6353985Z W0507 20:31:30.632000 237772 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0]     return visitor(node)
2025-05-07T20:31:30.6354787Z W0507 20:31:30.632000 237772 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0]            ^^^^^^^^^^^^^
2025-05-07T20:31:30.6355995Z W0507 20:31:30.632000 237772 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/code_generator.py", line 352, in visit_Module
2025-05-07T20:31:30.6357277Z W0507 20:31:30.632000 237772 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0]     ast.NodeVisitor.generic_visit(self, node)
2025-05-07T20:31:30.6358500Z W0507 20:31:30.632000 237772 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/ast.py", line 426, in generic_visit
2025-05-07T20:31:30.6359549Z W0507 20:31:30.632000 237772 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0]     self.visit(item)
2025-05-07T20:31:30.6360722Z W0507 20:31:30.632000 237772 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/code_generator.py", line 1207, in visit
2025-05-07T20:31:30.6362081Z W0507 20:31:30.632000 237772 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0]     raise CompilationError(self.jit_fn.src, self.cur_node, repr(e)) from None
2025-05-07T20:31:30.6363146Z W0507 20:31:30.632000 237772 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0] triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:31:30.6364064Z W0507 20:31:30.632000 237772 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0] def _fbgemm_silu_mul_quant(
2025-05-07T20:31:30.6364817Z W0507 20:31:30.632000 237772 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0] ^
2025-05-07T20:31:30.6365833Z W0507 20:31:30.632000 237772 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0] ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:31:30.6712629Z W0507 20:31:30.670000 237772 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0] Encountered an exception in identify_mutated_tensors, assuming every input is mutated
2025-05-07T20:31:30.6714802Z W0507 20:31:30.670000 237772 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0] Traceback (most recent call last):
2025-05-07T20:31:30.6717532Z W0507 20:31:30.670000 237772 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/torch/_higher_order_ops/triton_kernel_wrap.py", line 731, in identify_mutated_tensors
2025-05-07T20:31:30.6720281Z W0507 20:31:30.670000 237772 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0]     ttir_module, ordered_tensor_names = generate_ttir(kernel, kwargs)
2025-05-07T20:31:30.6721287Z W0507 20:31:30.670000 237772 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0]                                         ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
2025-05-07T20:31:30.6722606Z W0507 20:31:30.670000 237772 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/torch/_higher_order_ops/triton_kernel_wrap.py", line 356, in generate_ttir
2025-05-07T20:31:30.6723995Z W0507 20:31:30.670000 237772 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0]     ttir_module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:31:30.6725209Z W0507 20:31:30.670000 237772 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0]                   ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
2025-05-07T20:31:30.6726447Z W0507 20:31:30.670000 237772 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py", line 100, in make_ir
2025-05-07T20:31:30.6727822Z W0507 20:31:30.670000 237772 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0]     return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:31:30.6729144Z W0507 20:31:30.670000 237772 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0]            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
2025-05-07T20:31:30.6730586Z W0507 20:31:30.670000 237772 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/code_generator.py", line 1298, in ast_to_ttir
2025-05-07T20:31:30.6731845Z W0507 20:31:30.670000 237772 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0]     generator.visit(fn.parse())
2025-05-07T20:31:30.6733077Z W0507 20:31:30.670000 237772 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/code_generator.py", line 1201, in visit
2025-05-07T20:31:30.6734292Z W0507 20:31:30.670000 237772 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0]     ret = super().visit(node)
2025-05-07T20:31:30.6735134Z W0507 20:31:30.670000 237772 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0]           ^^^^^^^^^^^^^^^^^^^
2025-05-07T20:31:30.6736184Z W0507 20:31:30.670000 237772 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/ast.py", line 418, in visit
2025-05-07T20:31:30.6737209Z W0507 20:31:30.670000 237772 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0]     return visitor(node)
2025-05-07T20:31:30.6738013Z W0507 20:31:30.670000 237772 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0]            ^^^^^^^^^^^^^
2025-05-07T20:31:30.6739219Z W0507 20:31:30.670000 237772 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/code_generator.py", line 352, in visit_Module
2025-05-07T20:31:30.6740506Z W0507 20:31:30.670000 237772 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0]     ast.NodeVisitor.generic_visit(self, node)
2025-05-07T20:31:30.6741639Z W0507 20:31:30.670000 237772 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/ast.py", line 426, in generic_visit
2025-05-07T20:31:30.6742695Z W0507 20:31:30.670000 237772 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0]     self.visit(item)
2025-05-07T20:31:30.6743870Z W0507 20:31:30.670000 237772 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/code_generator.py", line 1207, in visit
2025-05-07T20:31:30.6745238Z W0507 20:31:30.670000 237772 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0]     raise CompilationError(self.jit_fn.src, self.cur_node, repr(e)) from None
2025-05-07T20:31:30.6746306Z W0507 20:31:30.670000 237772 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0] triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:31:30.6747237Z W0507 20:31:30.670000 237772 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0] def _fbgemm_silu_mul_quant(
2025-05-07T20:31:30.6748101Z W0507 20:31:30.670000 237772 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0] ^
2025-05-07T20:31:30.6749180Z W0507 20:31:30.670000 237772 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0] ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:31:30.6751204Z W0507 20:31:30.674000 237772 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0] Encountered an exception in identify_mutated_tensors, assuming every input is mutated
2025-05-07T20:31:30.6752249Z W0507 20:31:30.674000 237772 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0] Traceback (most recent call last):
2025-05-07T20:31:30.6753589Z W0507 20:31:30.674000 237772 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/torch/_higher_order_ops/triton_kernel_wrap.py", line 731, in identify_mutated_tensors
2025-05-07T20:31:30.6755098Z W0507 20:31:30.674000 237772 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0]     ttir_module, ordered_tensor_names = generate_ttir(kernel, kwargs)
2025-05-07T20:31:30.6756075Z W0507 20:31:30.674000 237772 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0]                                         ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
2025-05-07T20:31:30.6757374Z W0507 20:31:30.674000 237772 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/torch/_higher_order_ops/triton_kernel_wrap.py", line 356, in generate_ttir
2025-05-07T20:31:30.6758750Z W0507 20:31:30.674000 237772 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0]     ttir_module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:31:30.6759743Z W0507 20:31:30.674000 237772 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0]                   ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
2025-05-07T20:31:30.6760972Z W0507 20:31:30.674000 237772 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py", line 100, in make_ir
2025-05-07T20:31:30.6762345Z W0507 20:31:30.674000 237772 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0]     return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:31:30.6763407Z W0507 20:31:30.674000 237772 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0]            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
2025-05-07T20:31:30.6764696Z W0507 20:31:30.674000 237772 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/code_generator.py", line 1298, in ast_to_ttir
2025-05-07T20:31:30.6765955Z W0507 20:31:30.674000 237772 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0]     generator.visit(fn.parse())
2025-05-07T20:31:30.6767187Z W0507 20:31:30.674000 237772 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/code_generator.py", line 1201, in visit
2025-05-07T20:31:30.6768402Z W0507 20:31:30.674000 237772 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0]     ret = super().visit(node)
2025-05-07T20:31:30.6769230Z W0507 20:31:30.674000 237772 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0]           ^^^^^^^^^^^^^^^^^^^
2025-05-07T20:31:30.6770266Z W0507 20:31:30.674000 237772 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/ast.py", line 418, in visit
2025-05-07T20:31:30.6771381Z W0507 20:31:30.674000 237772 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0]     return visitor(node)
2025-05-07T20:31:30.6772185Z W0507 20:31:30.674000 237772 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0]            ^^^^^^^^^^^^^
2025-05-07T20:31:30.6773397Z W0507 20:31:30.674000 237772 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/code_generator.py", line 352, in visit_Module
2025-05-07T20:31:30.6774687Z W0507 20:31:30.674000 237772 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0]     ast.NodeVisitor.generic_visit(self, node)
2025-05-07T20:31:30.6775811Z W0507 20:31:30.674000 237772 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/ast.py", line 426, in generic_visit
2025-05-07T20:31:30.6776939Z W0507 20:31:30.674000 237772 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0]     self.visit(item)
2025-05-07T20:31:30.6778127Z W0507 20:31:30.674000 237772 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/code_generator.py", line 1207, in visit
2025-05-07T20:31:30.6779487Z W0507 20:31:30.674000 237772 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0]     raise CompilationError(self.jit_fn.src, self.cur_node, repr(e)) from None
2025-05-07T20:31:30.6780558Z W0507 20:31:30.674000 237772 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0] triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:31:30.6781475Z W0507 20:31:30.674000 237772 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0] def _fbgemm_silu_mul_quant(
2025-05-07T20:31:30.6782227Z W0507 20:31:30.674000 237772 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0] ^
2025-05-07T20:31:30.6783262Z W0507 20:31:30.674000 237772 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0] ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:31:31.0850482Z moe/activation_test.py::ActivationTests::test_silu_mul_quant Trying example: test_silu_mul_quant(
2025-05-07T20:31:31.0851822Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:31:31.0852633Z     T=1,
2025-05-07T20:31:31.0853018Z     D=5120,
2025-05-07T20:31:31.0853400Z     scale_ub=None,
2025-05-07T20:31:31.0853832Z     contiguous=True,
2025-05-07T20:31:31.0854281Z     compiled=True,
2025-05-07T20:31:31.0854683Z )
2025-05-07T20:31:31.0855325Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:31:31.0856320Z T = 1, D = 5120, scale_ub = None, contiguous = True, compiled = True
2025-05-07T20:31:31.0856830Z 
2025-05-07T20:31:31.0857005Z     @given(
2025-05-07T20:31:31.0857464Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:31:31.0858102Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:31:31.0858714Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:31:31.0859356Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:31:31.0859899Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:31:31.0860186Z     )
2025-05-07T20:31:31.0860531Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:31:31.0860974Z     def test_silu_mul_quant(
2025-05-07T20:31:31.0861223Z         self,
2025-05-07T20:31:31.0861418Z         T: int,
2025-05-07T20:31:31.0861624Z         D: int,
2025-05-07T20:31:31.0861849Z         scale_ub: Optional[float],
2025-05-07T20:31:31.0862120Z         contiguous: bool,
2025-05-07T20:31:31.0862362Z         compiled: bool,
2025-05-07T20:31:31.0862593Z     ) -> None:
2025-05-07T20:31:31.0862815Z         torch.manual_seed(2025)
2025-05-07T20:31:31.0863056Z     
2025-05-07T20:31:31.0863336Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:31:31.0864036Z     
2025-05-07T20:31:31.0864234Z         x_sign = torch.sign(x)
2025-05-07T20:31:31.0864533Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:31:31.0864849Z         x = x_sign * x_clamp
2025-05-07T20:31:31.0865086Z         x0 = x[:, :D]
2025-05-07T20:31:31.0865317Z         x1 = x[:, D:]
2025-05-07T20:31:31.0865526Z     
2025-05-07T20:31:31.0865711Z         if contiguous:
2025-05-07T20:31:31.0865952Z             x0 = x0.contiguous()
2025-05-07T20:31:31.0866214Z             x1 = x1.contiguous()
2025-05-07T20:31:31.0866453Z     
2025-05-07T20:31:31.0866660Z         if scale_ub is not None:
2025-05-07T20:31:31.0866935Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:31:31.0867434Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:31:31.0867753Z             )
2025-05-07T20:31:31.0867955Z         else:
2025-05-07T20:31:31.0868173Z             scale_ub_tensor = None
2025-05-07T20:31:31.0868423Z     
2025-05-07T20:31:31.0868665Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:31:31.0868984Z             op = silu_mul_quant
2025-05-07T20:31:31.0869294Z             if compiled:
2025-05-07T20:31:31.0869546Z                 op = torch.compile(op)
2025-05-07T20:31:31.0869848Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:31:31.0870124Z     
2025-05-07T20:31:31.0870324Z         y_fp8, y_scale = fn()
2025-05-07T20:31:31.0870613Z         y = y_fp8.to(torch.float32) * y_scale[:, None]
2025-05-07T20:31:31.0870901Z     
2025-05-07T20:31:31.0871150Z         def ref_fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:31:31.0871485Z             x0_fp32 = x0.to(torch.float32)
2025-05-07T20:31:31.0871776Z             x1_fp32 = x1.to(torch.float32)
2025-05-07T20:31:31.0872099Z             y = x0_fp32 * torch.sigmoid(x0_fp32) * x1_fp32
2025-05-07T20:31:31.0872467Z             return triton_quantize_fp8_row(y, scale_ub_tensor)
2025-05-07T20:31:31.0872778Z     
2025-05-07T20:31:31.0872984Z >       y_fp8_ref, y_scale_ref = ref_fn()
2025-05-07T20:31:31.0873187Z 
2025-05-07T20:31:31.0873289Z moe/activation_test.py:126: 
2025-05-07T20:31:31.0873593Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:31:31.0873940Z moe/activation_test.py:124: in ref_fn
2025-05-07T20:31:31.0874273Z     return triton_quantize_fp8_row(y, scale_ub_tensor)
2025-05-07T20:31:31.0875068Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/fbgemm_gpu/experimental/gemm/triton_gemm/fp8_gemm.py:2370: in triton_quantize_fp8_row
2025-05-07T20:31:31.0875817Z     _kernel_quantize_fp8_row[grid](
2025-05-07T20:31:31.0876364Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:31:31.0877060Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:31:31.0877743Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/autotuner.py:186: in run
2025-05-07T20:31:31.0878477Z     timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs}
2025-05-07T20:31:31.0879241Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/autotuner.py:186: in <dictcomp>
2025-05-07T20:31:31.0879993Z     timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs}
2025-05-07T20:31:31.0880718Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/autotuner.py:166: in _bench
2025-05-07T20:31:31.0881363Z     return self.do_bench(kernel_call, quantiles=(0.5, 0.2, 0.8))
2025-05-07T20:31:31.0881971Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/testing.py:117: in do_bench
2025-05-07T20:31:31.0882492Z     fn()
2025-05-07T20:31:31.0882996Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/autotuner.py:152: in kernel_call
2025-05-07T20:31:31.0883583Z     self.fn.run(
2025-05-07T20:31:31.0884145Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:31:31.0884671Z     kernel = self.compile(
2025-05-07T20:31:31.0885211Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:31:31.0885864Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:31:31.0886265Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:31:31.0886493Z 
2025-05-07T20:31:31.0886701Z self = <triton.compiler.compiler.ASTSource object at 0x7f68a93aae10>
2025-05-07T20:31:31.0887787Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=2, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:31:31.0889286Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f68dfb402c0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f68a9943240>}
2025-05-07T20:31:31.0890672Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:31:31.0891687Z context = <triton._C.libtriton.ir.context object at 0x7f68a0400c70>
2025-05-07T20:31:31.0891975Z 
2025-05-07T20:31:31.0892142Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:31:31.0892668Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:31:31.0893136Z                            module_map=module_map)
2025-05-07T20:31:31.0893509Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:31:31.0893866Z E       def _kernel_quantize_fp8_row(
2025-05-07T20:31:31.0894137Z E       ^
2025-05-07T20:31:31.0894601Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:31:31.0895054Z 
2025-05-07T20:31:31.0895470Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:31:31.0895992Z 
2025-05-07T20:31:31.0896099Z Trying example: test_silu_mul_quant(
2025-05-07T20:31:31.0896519Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:31:31.0896915Z     T=2048,
2025-05-07T20:31:31.0897112Z     D=5120,
2025-05-07T20:31:31.0897311Z     scale_ub=1200.0,
2025-05-07T20:31:31.0897533Z     contiguous=True,
2025-05-07T20:31:31.0897758Z     compiled=False,
2025-05-07T20:31:31.0897976Z )
2025-05-07T20:31:31.4394775Z W0507 20:31:31.436000 237772 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1] Encountered an exception in identify_mutated_tensors, assuming every input is mutated
2025-05-07T20:31:31.4395856Z W0507 20:31:31.436000 237772 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1] Traceback (most recent call last):
2025-05-07T20:31:31.4397196Z W0507 20:31:31.436000 237772 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/torch/_higher_order_ops/triton_kernel_wrap.py", line 731, in identify_mutated_tensors
2025-05-07T20:31:31.4398617Z W0507 20:31:31.436000 237772 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1]     ttir_module, ordered_tensor_names = generate_ttir(kernel, kwargs)
2025-05-07T20:31:31.4399594Z W0507 20:31:31.436000 237772 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1]                                         ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
2025-05-07T20:31:31.4401063Z W0507 20:31:31.436000 237772 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/torch/_higher_order_ops/triton_kernel_wrap.py", line 356, in generate_ttir
2025-05-07T20:31:31.4402446Z W0507 20:31:31.436000 237772 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1]     ttir_module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:31:31.4403430Z W0507 20:31:31.436000 237772 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1]                   ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
2025-05-07T20:31:31.4404657Z W0507 20:31:31.436000 237772 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py", line 100, in make_ir
2025-05-07T20:31:31.4406144Z W0507 20:31:31.436000 237772 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1]     return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:31:31.4407217Z W0507 20:31:31.436000 237772 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1]            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
2025-05-07T20:31:31.4408497Z W0507 20:31:31.436000 237772 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/code_generator.py", line 1298, in ast_to_ttir
2025-05-07T20:31:31.4409742Z W0507 20:31:31.436000 237772 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1]     generator.visit(fn.parse())
2025-05-07T20:31:31.4410972Z W0507 20:31:31.436000 237772 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/code_generator.py", line 1201, in visit
2025-05-07T20:31:31.4412202Z W0507 20:31:31.436000 237772 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1]     ret = super().visit(node)
2025-05-07T20:31:31.4413047Z W0507 20:31:31.436000 237772 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1]           ^^^^^^^^^^^^^^^^^^^
2025-05-07T20:31:31.4414075Z W0507 20:31:31.436000 237772 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/ast.py", line 418, in visit
2025-05-07T20:31:31.4415097Z W0507 20:31:31.436000 237772 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1]     return visitor(node)
2025-05-07T20:31:31.4415898Z W0507 20:31:31.436000 237772 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1]            ^^^^^^^^^^^^^
2025-05-07T20:31:31.4417115Z W0507 20:31:31.436000 237772 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/code_generator.py", line 352, in visit_Module
2025-05-07T20:31:31.4418421Z W0507 20:31:31.436000 237772 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1]     ast.NodeVisitor.generic_visit(self, node)
2025-05-07T20:31:31.4419541Z W0507 20:31:31.436000 237772 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/ast.py", line 426, in generic_visit
2025-05-07T20:31:31.4420579Z W0507 20:31:31.436000 237772 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1]     self.visit(item)
2025-05-07T20:31:31.4421764Z W0507 20:31:31.436000 237772 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/code_generator.py", line 1207, in visit
2025-05-07T20:31:31.4423130Z W0507 20:31:31.436000 237772 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1]     raise CompilationError(self.jit_fn.src, self.cur_node, repr(e)) from None
2025-05-07T20:31:31.4424284Z W0507 20:31:31.436000 237772 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1] triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:31:31.4425202Z W0507 20:31:31.436000 237772 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1] def _fbgemm_silu_mul_quant(
2025-05-07T20:31:31.4425943Z W0507 20:31:31.436000 237772 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1] ^
2025-05-07T20:31:31.4426963Z W0507 20:31:31.436000 237772 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1] ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:31:31.5354774Z W0507 20:31:31.533000 237772 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1] Encountered an exception in identify_mutated_tensors, assuming every input is mutated
2025-05-07T20:31:31.5357169Z W0507 20:31:31.533000 237772 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1] Traceback (most recent call last):
2025-05-07T20:31:31.5359820Z W0507 20:31:31.533000 237772 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/torch/_higher_order_ops/triton_kernel_wrap.py", line 731, in identify_mutated_tensors
2025-05-07T20:31:31.5361383Z W0507 20:31:31.533000 237772 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1]     ttir_module, ordered_tensor_names = generate_ttir(kernel, kwargs)
2025-05-07T20:31:31.5370327Z W0507 20:31:31.533000 237772 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1]                                         ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
2025-05-07T20:31:31.5371706Z W0507 20:31:31.533000 237772 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/torch/_higher_order_ops/triton_kernel_wrap.py", line 356, in generate_ttir
2025-05-07T20:31:31.5373100Z W0507 20:31:31.533000 237772 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1]     ttir_module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:31:31.5374089Z W0507 20:31:31.533000 237772 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1]                   ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
2025-05-07T20:31:31.5375325Z W0507 20:31:31.533000 237772 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py", line 100, in make_ir
2025-05-07T20:31:31.5376701Z W0507 20:31:31.533000 237772 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1]     return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:31:31.5377771Z W0507 20:31:31.533000 237772 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1]            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
2025-05-07T20:31:31.5379044Z W0507 20:31:31.533000 237772 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/code_generator.py", line 1298, in ast_to_ttir
2025-05-07T20:31:31.5380292Z W0507 20:31:31.533000 237772 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1]     generator.visit(fn.parse())
2025-05-07T20:31:31.5381520Z W0507 20:31:31.533000 237772 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/code_generator.py", line 1201, in visit
2025-05-07T20:31:31.5382749Z W0507 20:31:31.533000 237772 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1]     ret = super().visit(node)
2025-05-07T20:31:31.5384584Z W0507 20:31:31.533000 237772 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1]           ^^^^^^^^^^^^^^^^^^^
2025-05-07T20:31:31.5385623Z W0507 20:31:31.533000 237772 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/ast.py", line 418, in visit
2025-05-07T20:31:31.5386647Z W0507 20:31:31.533000 237772 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1]     return visitor(node)
2025-05-07T20:31:31.5387453Z W0507 20:31:31.533000 237772 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1]            ^^^^^^^^^^^^^
2025-05-07T20:31:31.5388675Z W0507 20:31:31.533000 237772 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/code_generator.py", line 352, in visit_Module
2025-05-07T20:31:31.5390102Z W0507 20:31:31.533000 237772 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1]     ast.NodeVisitor.generic_visit(self, node)
2025-05-07T20:31:31.5391236Z W0507 20:31:31.533000 237772 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/ast.py", line 426, in generic_visit
2025-05-07T20:31:31.5392284Z W0507 20:31:31.533000 237772 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1]     self.visit(item)
2025-05-07T20:31:31.5393470Z W0507 20:31:31.533000 237772 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/code_generator.py", line 1207, in visit
2025-05-07T20:31:31.5394841Z W0507 20:31:31.533000 237772 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1]     raise CompilationError(self.jit_fn.src, self.cur_node, repr(e)) from None
2025-05-07T20:31:31.5395909Z W0507 20:31:31.533000 237772 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1] triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:31:31.5396837Z W0507 20:31:31.533000 237772 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1] def _fbgemm_silu_mul_quant(
2025-05-07T20:31:31.5397590Z W0507 20:31:31.533000 237772 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1] ^
2025-05-07T20:31:31.5398624Z W0507 20:31:31.533000 237772 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1] ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:31:31.8011045Z W0507 20:31:31.798000 237772 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1] Encountered an exception in identify_mutated_tensors, assuming every input is mutated
2025-05-07T20:31:31.8012129Z W0507 20:31:31.798000 237772 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1] Traceback (most recent call last):
2025-05-07T20:31:31.8013476Z W0507 20:31:31.798000 237772 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/torch/_higher_order_ops/triton_kernel_wrap.py", line 731, in identify_mutated_tensors
2025-05-07T20:31:31.8014894Z W0507 20:31:31.798000 237772 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1]     ttir_module, ordered_tensor_names = generate_ttir(kernel, kwargs)
2025-05-07T20:31:31.8015876Z W0507 20:31:31.798000 237772 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1]                                         ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
2025-05-07T20:31:31.8017171Z W0507 20:31:31.798000 237772 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/torch/_higher_order_ops/triton_kernel_wrap.py", line 356, in generate_ttir
2025-05-07T20:31:31.8018703Z W0507 20:31:31.798000 237772 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1]     ttir_module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:31:31.8019694Z W0507 20:31:31.798000 237772 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1]                   ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
2025-05-07T20:31:31.8020972Z W0507 20:31:31.798000 237772 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py", line 100, in make_ir
2025-05-07T20:31:31.8022350Z W0507 20:31:31.798000 237772 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1]     return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:31:31.8023410Z W0507 20:31:31.798000 237772 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1]            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
2025-05-07T20:31:31.8024811Z W0507 20:31:31.798000 237772 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/code_generator.py", line 1298, in ast_to_ttir
2025-05-07T20:31:31.8026064Z W0507 20:31:31.798000 237772 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1]     generator.visit(fn.parse())
2025-05-07T20:31:31.8027296Z W0507 20:31:31.798000 237772 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/code_generator.py", line 1201, in visit
2025-05-07T20:31:31.8028677Z W0507 20:31:31.798000 237772 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1]     ret = super().visit(node)
2025-05-07T20:31:31.8029555Z W0507 20:31:31.798000 237772 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1]           ^^^^^^^^^^^^^^^^^^^
2025-05-07T20:31:31.8030598Z W0507 20:31:31.798000 237772 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/ast.py", line 418, in visit
2025-05-07T20:31:31.8031626Z W0507 20:31:31.798000 237772 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1]     return visitor(node)
2025-05-07T20:31:31.8032436Z W0507 20:31:31.798000 237772 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1]            ^^^^^^^^^^^^^
2025-05-07T20:31:31.8033656Z W0507 20:31:31.798000 237772 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/code_generator.py", line 352, in visit_Module
2025-05-07T20:31:31.8034934Z W0507 20:31:31.798000 237772 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1]     ast.NodeVisitor.generic_visit(self, node)
2025-05-07T20:31:31.8036069Z W0507 20:31:31.798000 237772 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/ast.py", line 426, in generic_visit
2025-05-07T20:31:31.8037126Z W0507 20:31:31.798000 237772 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1]     self.visit(item)
2025-05-07T20:31:31.8038310Z W0507 20:31:31.798000 237772 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/code_generator.py", line 1207, in visit
2025-05-07T20:31:31.8039677Z W0507 20:31:31.798000 237772 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1]     raise CompilationError(self.jit_fn.src, self.cur_node, repr(e)) from None
2025-05-07T20:31:31.8040738Z W0507 20:31:31.798000 237772 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1] triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:31:31.8041789Z W0507 20:31:31.798000 237772 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1] def _fbgemm_silu_mul_quant(
2025-05-07T20:31:31.8042544Z W0507 20:31:31.798000 237772 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1] ^
2025-05-07T20:31:31.8043573Z W0507 20:31:31.798000 237772 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1] ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:31:31.8158774Z W0507 20:31:31.813000 237772 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1] Encountered an exception in identify_mutated_tensors, assuming every input is mutated
2025-05-07T20:31:31.8160480Z W0507 20:31:31.813000 237772 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1] Traceback (most recent call last):
2025-05-07T20:31:31.8161963Z W0507 20:31:31.813000 237772 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/torch/_higher_order_ops/triton_kernel_wrap.py", line 731, in identify_mutated_tensors
2025-05-07T20:31:31.8163379Z W0507 20:31:31.813000 237772 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1]     ttir_module, ordered_tensor_names = generate_ttir(kernel, kwargs)
2025-05-07T20:31:31.8164363Z W0507 20:31:31.813000 237772 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1]                                         ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
2025-05-07T20:31:31.8165658Z W0507 20:31:31.813000 237772 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/torch/_higher_order_ops/triton_kernel_wrap.py", line 356, in generate_ttir
2025-05-07T20:31:31.8167028Z W0507 20:31:31.813000 237772 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1]     ttir_module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:31:31.8168013Z W0507 20:31:31.813000 237772 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1]                   ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
2025-05-07T20:31:31.8169235Z W0507 20:31:31.813000 237772 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py", line 100, in make_ir
2025-05-07T20:31:31.8170606Z W0507 20:31:31.813000 237772 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1]     return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:31:31.8171669Z W0507 20:31:31.813000 237772 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1]            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
2025-05-07T20:31:31.8172952Z W0507 20:31:31.813000 237772 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/code_generator.py", line 1298, in ast_to_ttir
2025-05-07T20:31:31.8174199Z W0507 20:31:31.813000 237772 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1]     generator.visit(fn.parse())
2025-05-07T20:31:31.8175431Z W0507 20:31:31.813000 237772 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/code_generator.py", line 1201, in visit
2025-05-07T20:31:31.8176646Z W0507 20:31:31.813000 237772 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1]     ret = super().visit(node)
2025-05-07T20:31:31.8177492Z W0507 20:31:31.813000 237772 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1]           ^^^^^^^^^^^^^^^^^^^
2025-05-07T20:31:31.8178531Z W0507 20:31:31.813000 237772 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/ast.py", line 418, in visit
2025-05-07T20:31:31.8179674Z W0507 20:31:31.813000 237772 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1]     return visitor(node)
2025-05-07T20:31:31.8180485Z W0507 20:31:31.813000 237772 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1]            ^^^^^^^^^^^^^
2025-05-07T20:31:31.8181707Z W0507 20:31:31.813000 237772 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/code_generator.py", line 352, in visit_Module
2025-05-07T20:31:31.8182998Z W0507 20:31:31.813000 237772 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1]     ast.NodeVisitor.generic_visit(self, node)
2025-05-07T20:31:31.8184228Z W0507 20:31:31.813000 237772 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/ast.py", line 426, in generic_visit
2025-05-07T20:31:31.8185281Z W0507 20:31:31.813000 237772 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1]     self.visit(item)
2025-05-07T20:31:31.8186472Z W0507 20:31:31.813000 237772 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/code_generator.py", line 1207, in visit
2025-05-07T20:31:31.8187846Z W0507 20:31:31.813000 237772 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1]     raise CompilationError(self.jit_fn.src, self.cur_node, repr(e)) from None
2025-05-07T20:31:31.8188904Z W0507 20:31:31.813000 237772 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1] triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:31:31.8189876Z W0507 20:31:31.813000 237772 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1] def _fbgemm_silu_mul_quant(
2025-05-07T20:31:31.8190628Z W0507 20:31:31.813000 237772 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1] ^
2025-05-07T20:31:31.8191656Z W0507 20:31:31.813000 237772 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1] ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:31:32.1375374Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:31:32.1376318Z T = 2048, D = 5120, scale_ub = 1200.0, contiguous = True, compiled = False
2025-05-07T20:31:32.1376836Z 
2025-05-07T20:31:32.1376989Z     @given(
2025-05-07T20:31:32.1377422Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:31:32.1378002Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:31:32.1378555Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:31:32.1379171Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:31:32.1379775Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:31:32.1380331Z     )
2025-05-07T20:31:32.1380995Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:31:32.1381532Z     def test_silu_mul_quant(
2025-05-07T20:31:32.1381775Z         self,
2025-05-07T20:31:32.1381987Z         T: int,
2025-05-07T20:31:32.1382196Z         D: int,
2025-05-07T20:31:32.1382427Z         scale_ub: Optional[float],
2025-05-07T20:31:32.1382699Z         contiguous: bool,
2025-05-07T20:31:32.1382942Z         compiled: bool,
2025-05-07T20:31:32.1383175Z     ) -> None:
2025-05-07T20:31:32.1383390Z         torch.manual_seed(2025)
2025-05-07T20:31:32.1383644Z     
2025-05-07T20:31:32.1383923Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:31:32.1384264Z     
2025-05-07T20:31:32.1384462Z         x_sign = torch.sign(x)
2025-05-07T20:31:32.1384764Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:31:32.1385074Z         x = x_sign * x_clamp
2025-05-07T20:31:32.1385326Z         x0 = x[:, :D]
2025-05-07T20:31:32.1385540Z         x1 = x[:, D:]
2025-05-07T20:31:32.1385908Z     
2025-05-07T20:31:32.1386104Z         if contiguous:
2025-05-07T20:31:32.1386333Z             x0 = x0.contiguous()
2025-05-07T20:31:32.1386594Z             x1 = x1.contiguous()
2025-05-07T20:31:32.1386839Z     
2025-05-07T20:31:32.1387026Z         if scale_ub is not None:
2025-05-07T20:31:32.1387300Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:31:32.1387754Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:31:32.1388188Z             )
2025-05-07T20:31:32.1388459Z         else:
2025-05-07T20:31:32.1388747Z             scale_ub_tensor = None
2025-05-07T20:31:32.1389190Z     
2025-05-07T20:31:32.1389500Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:31:32.1390086Z             op = silu_mul_quant
2025-05-07T20:31:32.1390382Z             if compiled:
2025-05-07T20:31:32.1390635Z                 op = torch.compile(op)
2025-05-07T20:31:32.1390934Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:31:32.1391218Z     
2025-05-07T20:31:32.1391414Z >       y_fp8, y_scale = fn()
2025-05-07T20:31:32.1391589Z 
2025-05-07T20:31:32.1391691Z moe/activation_test.py:117: 
2025-05-07T20:31:32.1391990Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:31:32.1392318Z moe/activation_test.py:115: in fn
2025-05-07T20:31:32.1392606Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:31:32.1393302Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:31:32.1393982Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:31:32.1394520Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:31:32.1395211Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:31:32.1395871Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:31:32.1396442Z     kernel = self.compile(
2025-05-07T20:31:32.1396991Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:31:32.1397643Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:31:32.1398049Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:31:32.1398277Z 
2025-05-07T20:31:32.1398493Z self = <triton.compiler.compiler.ASTSource object at 0x7f68a02c2dd0>
2025-05-07T20:31:32.1399576Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:31:32.1400938Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f68dfb402c0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f68a995ade0>}
2025-05-07T20:31:32.1402286Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:31:32.1403315Z context = <triton._C.libtriton.ir.context object at 0x7f68a9abb0f0>
2025-05-07T20:31:32.1403603Z 
2025-05-07T20:31:32.1403780Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:31:32.1404295Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:31:32.1404767Z                            module_map=module_map)
2025-05-07T20:31:32.1405136Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:31:32.1405499Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:31:32.1405762Z E       ^
2025-05-07T20:31:32.1406231Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:31:32.1406682Z 
2025-05-07T20:31:32.1407199Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:31:32.1407710Z 
2025-05-07T20:31:32.1407828Z Trying example: test_silu_mul_quant(
2025-05-07T20:31:32.1408246Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:31:32.1408659Z     T=2048,
2025-05-07T20:31:32.1408865Z     D=5120,
2025-05-07T20:31:32.1409060Z     scale_ub=1200.0,
2025-05-07T20:31:32.1409292Z     contiguous=True,
2025-05-07T20:31:32.1409518Z     compiled=True,
2025-05-07T20:31:32.1409722Z )
2025-05-07T20:31:32.1410094Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:31:32.1410803Z T = 2048, D = 5120, scale_ub = 1200.0, contiguous = True, compiled = True
2025-05-07T20:31:32.1411140Z 
2025-05-07T20:31:32.1411238Z     @given(
2025-05-07T20:31:32.1411533Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:31:32.1411937Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:31:32.1412327Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:31:32.1412670Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:31:32.1413003Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:31:32.1413293Z     )
2025-05-07T20:31:32.1413642Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:31:32.1414085Z     def test_silu_mul_quant(
2025-05-07T20:31:32.1414330Z         self,
2025-05-07T20:31:32.1414523Z         T: int,
2025-05-07T20:31:32.1414725Z         D: int,
2025-05-07T20:31:32.1414949Z         scale_ub: Optional[float],
2025-05-07T20:31:32.1415217Z         contiguous: bool,
2025-05-07T20:31:32.1415468Z         compiled: bool,
2025-05-07T20:31:32.1415694Z     ) -> None:
2025-05-07T20:31:32.1415911Z         torch.manual_seed(2025)
2025-05-07T20:31:32.1416157Z     
2025-05-07T20:31:32.1416432Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:31:32.1416775Z     
2025-05-07T20:31:32.1416981Z         x_sign = torch.sign(x)
2025-05-07T20:31:32.1417282Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:31:32.1417595Z         x = x_sign * x_clamp
2025-05-07T20:31:32.1417834Z         x0 = x[:, :D]
2025-05-07T20:31:32.1418055Z         x1 = x[:, D:]
2025-05-07T20:31:32.1418269Z     
2025-05-07T20:31:32.1418456Z         if contiguous:
2025-05-07T20:31:32.1418692Z             x0 = x0.contiguous()
2025-05-07T20:31:32.1418960Z             x1 = x1.contiguous()
2025-05-07T20:31:32.1419202Z     
2025-05-07T20:31:32.1419404Z         if scale_ub is not None:
2025-05-07T20:31:32.1419687Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:31:32.1420028Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:31:32.1420353Z             )
2025-05-07T20:31:32.1420558Z         else:
2025-05-07T20:31:32.1420790Z             scale_ub_tensor = None
2025-05-07T20:31:32.1421077Z     
2025-05-07T20:31:32.1421326Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:31:32.1421642Z             op = silu_mul_quant
2025-05-07T20:31:32.1421906Z             if compiled:
2025-05-07T20:31:32.1422166Z                 op = torch.compile(op)
2025-05-07T20:31:32.1422468Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:31:32.1422740Z     
2025-05-07T20:31:32.1422937Z         y_fp8, y_scale = fn()
2025-05-07T20:31:32.1423229Z         y = y_fp8.to(torch.float32) * y_scale[:, None]
2025-05-07T20:31:32.1423516Z     
2025-05-07T20:31:32.1423762Z         def ref_fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:31:32.1424096Z             x0_fp32 = x0.to(torch.float32)
2025-05-07T20:31:32.1424387Z             x1_fp32 = x1.to(torch.float32)
2025-05-07T20:31:32.1424709Z             y = x0_fp32 * torch.sigmoid(x0_fp32) * x1_fp32
2025-05-07T20:31:32.1425074Z             return triton_quantize_fp8_row(y, scale_ub_tensor)
2025-05-07T20:31:32.1425384Z     
2025-05-07T20:31:32.1425681Z >       y_fp8_ref, y_scale_ref = ref_fn()
2025-05-07T20:31:32.1425883Z 
2025-05-07T20:31:32.1425984Z moe/activation_test.py:126: 
2025-05-07T20:31:32.1426290Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:31:32.1426623Z moe/activation_test.py:124: in ref_fn
2025-05-07T20:31:32.1426950Z     return triton_quantize_fp8_row(y, scale_ub_tensor)
2025-05-07T20:31:32.1427734Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/fbgemm_gpu/experimental/gemm/triton_gemm/fp8_gemm.py:2370: in triton_quantize_fp8_row
2025-05-07T20:31:32.1428905Z     _kernel_quantize_fp8_row[grid](
2025-05-07T20:31:32.1429515Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:31:32.1430346Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:31:32.1431029Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/autotuner.py:186: in run
2025-05-07T20:31:32.1431746Z     timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs}
2025-05-07T20:31:32.1432498Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/autotuner.py:186: in <dictcomp>
2025-05-07T20:31:32.1433245Z     timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs}
2025-05-07T20:31:32.1433974Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/autotuner.py:166: in _bench
2025-05-07T20:31:32.1434602Z     return self.do_bench(kernel_call, quantiles=(0.5, 0.2, 0.8))
2025-05-07T20:31:32.1435202Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/testing.py:117: in do_bench
2025-05-07T20:31:32.1435729Z     fn()
2025-05-07T20:31:32.1436234Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/autotuner.py:152: in kernel_call
2025-05-07T20:31:32.1436820Z     self.fn.run(
2025-05-07T20:31:32.1437301Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:31:32.1437832Z     kernel = self.compile(
2025-05-07T20:31:32.1438374Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:31:32.1439031Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:31:32.1439437Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:31:32.1439663Z 
2025-05-07T20:31:32.1439874Z self = <triton.compiler.compiler.ASTSource object at 0x7f68a98d2510>
2025-05-07T20:31:32.1440995Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=2, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:31:32.1442377Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f68dfb402c0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f68a9ace700>}
2025-05-07T20:31:32.1443717Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:31:32.1444740Z context = <triton._C.libtriton.ir.context object at 0x7f68a8a8c5f0>
2025-05-07T20:31:32.1445025Z 
2025-05-07T20:31:32.1445192Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:31:32.1445707Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:31:32.1446173Z                            module_map=module_map)
2025-05-07T20:31:32.1446543Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:31:32.1446896Z E       def _kernel_quantize_fp8_row(
2025-05-07T20:31:32.1447164Z E       ^
2025-05-07T20:31:32.1447772Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:31:32.1448223Z 
2025-05-07T20:31:32.1448642Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:31:32.1449153Z 
2025-05-07T20:31:32.1449256Z Trying example: test_silu_mul_quant(
2025-05-07T20:31:32.1449665Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:31:32.1450123Z     T=16384,
2025-05-07T20:31:32.1450365Z     D=7168,
2025-05-07T20:31:32.1450609Z     scale_ub=1200.0,
2025-05-07T20:31:32.1450889Z     contiguous=False,
2025-05-07T20:31:32.1451165Z     compiled=False,
2025-05-07T20:31:32.1451419Z )
2025-05-07T20:31:32.3947194Z W0507 20:31:32.392000 237772 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2] Encountered an exception in identify_mutated_tensors, assuming every input is mutated
2025-05-07T20:31:32.3948281Z W0507 20:31:32.392000 237772 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2] Traceback (most recent call last):
2025-05-07T20:31:32.3949671Z W0507 20:31:32.392000 237772 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/torch/_higher_order_ops/triton_kernel_wrap.py", line 731, in identify_mutated_tensors
2025-05-07T20:31:32.3951084Z W0507 20:31:32.392000 237772 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2]     ttir_module, ordered_tensor_names = generate_ttir(kernel, kwargs)
2025-05-07T20:31:32.3952064Z W0507 20:31:32.392000 237772 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2]                                         ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
2025-05-07T20:31:32.3953372Z W0507 20:31:32.392000 237772 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/torch/_higher_order_ops/triton_kernel_wrap.py", line 356, in generate_ttir
2025-05-07T20:31:32.3954747Z W0507 20:31:32.392000 237772 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2]     ttir_module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:31:32.3955725Z W0507 20:31:32.392000 237772 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2]                   ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
2025-05-07T20:31:32.3956947Z W0507 20:31:32.392000 237772 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py", line 100, in make_ir
2025-05-07T20:31:32.3958326Z W0507 20:31:32.392000 237772 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2]     return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:31:32.3959396Z W0507 20:31:32.392000 237772 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2]            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
2025-05-07T20:31:32.3960674Z W0507 20:31:32.392000 237772 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/code_generator.py", line 1298, in ast_to_ttir
2025-05-07T20:31:32.3961926Z W0507 20:31:32.392000 237772 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2]     generator.visit(fn.parse())
2025-05-07T20:31:32.3963147Z W0507 20:31:32.392000 237772 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/code_generator.py", line 1201, in visit
2025-05-07T20:31:32.3964356Z W0507 20:31:32.392000 237772 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2]     ret = super().visit(node)
2025-05-07T20:31:32.3965453Z W0507 20:31:32.392000 237772 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2]           ^^^^^^^^^^^^^^^^^^^
2025-05-07T20:31:32.3966487Z W0507 20:31:32.392000 237772 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/ast.py", line 418, in visit
2025-05-07T20:31:32.3967520Z W0507 20:31:32.392000 237772 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2]     return visitor(node)
2025-05-07T20:31:32.3968318Z W0507 20:31:32.392000 237772 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2]            ^^^^^^^^^^^^^
2025-05-07T20:31:32.3969538Z W0507 20:31:32.392000 237772 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/code_generator.py", line 352, in visit_Module
2025-05-07T20:31:32.3970957Z W0507 20:31:32.392000 237772 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2]     ast.NodeVisitor.generic_visit(self, node)
2025-05-07T20:31:32.3972088Z W0507 20:31:32.392000 237772 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/ast.py", line 426, in generic_visit
2025-05-07T20:31:32.3973138Z W0507 20:31:32.392000 237772 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2]     self.visit(item)
2025-05-07T20:31:32.3974317Z W0507 20:31:32.392000 237772 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/code_generator.py", line 1207, in visit
2025-05-07T20:31:32.3975688Z W0507 20:31:32.392000 237772 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2]     raise CompilationError(self.jit_fn.src, self.cur_node, repr(e)) from None
2025-05-07T20:31:32.3976755Z W0507 20:31:32.392000 237772 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2] triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:31:32.3977670Z W0507 20:31:32.392000 237772 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2] def _fbgemm_silu_mul_quant(
2025-05-07T20:31:32.3978413Z W0507 20:31:32.392000 237772 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2] ^
2025-05-07T20:31:32.3979438Z W0507 20:31:32.392000 237772 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2] ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:31:32.6128411Z W0507 20:31:32.610000 237772 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2] Encountered an exception in identify_mutated_tensors, assuming every input is mutated
2025-05-07T20:31:32.6129504Z W0507 20:31:32.610000 237772 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2] Traceback (most recent call last):
2025-05-07T20:31:32.6130862Z W0507 20:31:32.610000 237772 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/torch/_higher_order_ops/triton_kernel_wrap.py", line 731, in identify_mutated_tensors
2025-05-07T20:31:32.6132284Z W0507 20:31:32.610000 237772 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2]     ttir_module, ordered_tensor_names = generate_ttir(kernel, kwargs)
2025-05-07T20:31:32.6133258Z W0507 20:31:32.610000 237772 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2]                                         ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
2025-05-07T20:31:32.6134564Z W0507 20:31:32.610000 237772 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/torch/_higher_order_ops/triton_kernel_wrap.py", line 356, in generate_ttir
2025-05-07T20:31:32.6136100Z W0507 20:31:32.610000 237772 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2]     ttir_module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:31:32.6137093Z W0507 20:31:32.610000 237772 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2]                   ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
2025-05-07T20:31:32.6138327Z W0507 20:31:32.610000 237772 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py", line 100, in make_ir
2025-05-07T20:31:32.6139700Z W0507 20:31:32.610000 237772 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2]     return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:31:32.6140888Z W0507 20:31:32.610000 237772 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2]            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
2025-05-07T20:31:32.6142168Z W0507 20:31:32.610000 237772 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/code_generator.py", line 1298, in ast_to_ttir
2025-05-07T20:31:32.6143422Z W0507 20:31:32.610000 237772 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2]     generator.visit(fn.parse())
2025-05-07T20:31:32.6144647Z W0507 20:31:32.610000 237772 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/code_generator.py", line 1201, in visit
2025-05-07T20:31:32.6145865Z W0507 20:31:32.610000 237772 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2]     ret = super().visit(node)
2025-05-07T20:31:32.6146701Z W0507 20:31:32.610000 237772 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2]           ^^^^^^^^^^^^^^^^^^^
2025-05-07T20:31:32.6147730Z W0507 20:31:32.610000 237772 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/ast.py", line 418, in visit
2025-05-07T20:31:32.6148757Z W0507 20:31:32.610000 237772 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2]     return visitor(node)
2025-05-07T20:31:32.6149609Z W0507 20:31:32.610000 237772 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2]            ^^^^^^^^^^^^^
2025-05-07T20:31:32.6150816Z W0507 20:31:32.610000 237772 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/code_generator.py", line 352, in visit_Module
2025-05-07T20:31:32.6152110Z W0507 20:31:32.610000 237772 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2]     ast.NodeVisitor.generic_visit(self, node)
2025-05-07T20:31:32.6153239Z W0507 20:31:32.610000 237772 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/ast.py", line 426, in generic_visit
2025-05-07T20:31:32.6154293Z W0507 20:31:32.610000 237772 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2]     self.visit(item)
2025-05-07T20:31:32.6155481Z W0507 20:31:32.610000 237772 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/code_generator.py", line 1207, in visit
2025-05-07T20:31:32.6156857Z W0507 20:31:32.610000 237772 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2]     raise CompilationError(self.jit_fn.src, self.cur_node, repr(e)) from None
2025-05-07T20:31:32.6157938Z W0507 20:31:32.610000 237772 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2] triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:31:32.6158945Z W0507 20:31:32.610000 237772 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2] def _fbgemm_silu_mul_quant(
2025-05-07T20:31:32.6159699Z W0507 20:31:32.610000 237772 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2] ^
2025-05-07T20:31:32.6160724Z W0507 20:31:32.610000 237772 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2] ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:31:32.8540036Z W0507 20:31:32.851000 237772 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2] Encountered an exception in identify_mutated_tensors, assuming every input is mutated
2025-05-07T20:31:32.8542132Z W0507 20:31:32.851000 237772 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2] Traceback (most recent call last):
2025-05-07T20:31:32.8544803Z W0507 20:31:32.851000 237772 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/torch/_higher_order_ops/triton_kernel_wrap.py", line 731, in identify_mutated_tensors
2025-05-07T20:31:32.8547629Z W0507 20:31:32.851000 237772 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2]     ttir_module, ordered_tensor_names = generate_ttir(kernel, kwargs)
2025-05-07T20:31:32.8549670Z W0507 20:31:32.851000 237772 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2]                                         ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
2025-05-07T20:31:32.8551333Z W0507 20:31:32.851000 237772 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/torch/_higher_order_ops/triton_kernel_wrap.py", line 356, in generate_ttir
2025-05-07T20:31:32.8552716Z W0507 20:31:32.851000 237772 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2]     ttir_module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:31:32.8553703Z W0507 20:31:32.851000 237772 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2]                   ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
2025-05-07T20:31:32.8554927Z W0507 20:31:32.851000 237772 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py", line 100, in make_ir
2025-05-07T20:31:32.8556295Z W0507 20:31:32.851000 237772 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2]     return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:31:32.8557353Z W0507 20:31:32.851000 237772 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2]            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
2025-05-07T20:31:32.8558640Z W0507 20:31:32.851000 237772 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/code_generator.py", line 1298, in ast_to_ttir
2025-05-07T20:31:32.8559886Z W0507 20:31:32.851000 237772 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2]     generator.visit(fn.parse())
2025-05-07T20:31:32.8561131Z W0507 20:31:32.851000 237772 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/code_generator.py", line 1201, in visit
2025-05-07T20:31:32.8562347Z W0507 20:31:32.851000 237772 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2]     ret = super().visit(node)
2025-05-07T20:31:32.8563192Z W0507 20:31:32.851000 237772 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2]           ^^^^^^^^^^^^^^^^^^^
2025-05-07T20:31:32.8564371Z W0507 20:31:32.851000 237772 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/ast.py", line 418, in visit
2025-05-07T20:31:32.8565407Z W0507 20:31:32.851000 237772 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2]     return visitor(node)
2025-05-07T20:31:32.8566211Z W0507 20:31:32.851000 237772 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2]            ^^^^^^^^^^^^^
2025-05-07T20:31:32.8567437Z W0507 20:31:32.851000 237772 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/code_generator.py", line 352, in visit_Module
2025-05-07T20:31:32.8568739Z W0507 20:31:32.851000 237772 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2]     ast.NodeVisitor.generic_visit(self, node)
2025-05-07T20:31:32.8569961Z W0507 20:31:32.851000 237772 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/ast.py", line 426, in generic_visit
2025-05-07T20:31:32.8571013Z W0507 20:31:32.851000 237772 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2]     self.visit(item)
2025-05-07T20:31:32.8572188Z W0507 20:31:32.851000 237772 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/code_generator.py", line 1207, in visit
2025-05-07T20:31:32.8573553Z W0507 20:31:32.851000 237772 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2]     raise CompilationError(self.jit_fn.src, self.cur_node, repr(e)) from None
2025-05-07T20:31:32.8574617Z W0507 20:31:32.851000 237772 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2] triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:31:32.8575538Z W0507 20:31:32.851000 237772 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2] def _fbgemm_silu_mul_quant(
2025-05-07T20:31:32.8583432Z W0507 20:31:32.851000 237772 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2] ^
2025-05-07T20:31:32.8584492Z W0507 20:31:32.851000 237772 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2] ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:31:32.8681171Z W0507 20:31:32.866000 237772 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2] Encountered an exception in identify_mutated_tensors, assuming every input is mutated
2025-05-07T20:31:32.8682229Z W0507 20:31:32.866000 237772 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2] Traceback (most recent call last):
2025-05-07T20:31:32.8683585Z W0507 20:31:32.866000 237772 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/torch/_higher_order_ops/triton_kernel_wrap.py", line 731, in identify_mutated_tensors
2025-05-07T20:31:32.8685006Z W0507 20:31:32.866000 237772 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2]     ttir_module, ordered_tensor_names = generate_ttir(kernel, kwargs)
2025-05-07T20:31:32.8685973Z W0507 20:31:32.866000 237772 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2]                                         ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
2025-05-07T20:31:32.8687275Z W0507 20:31:32.866000 237772 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/torch/_higher_order_ops/triton_kernel_wrap.py", line 356, in generate_ttir
2025-05-07T20:31:32.8688658Z W0507 20:31:32.866000 237772 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2]     ttir_module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:31:32.8689801Z W0507 20:31:32.866000 237772 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2]                   ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
2025-05-07T20:31:32.8691045Z W0507 20:31:32.866000 237772 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py", line 100, in make_ir
2025-05-07T20:31:32.8692413Z W0507 20:31:32.866000 237772 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2]     return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:31:32.8693476Z W0507 20:31:32.866000 237772 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2]            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
2025-05-07T20:31:32.8694874Z W0507 20:31:32.866000 237772 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/code_generator.py", line 1298, in ast_to_ttir
2025-05-07T20:31:32.8696121Z W0507 20:31:32.866000 237772 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2]     generator.visit(fn.parse())
2025-05-07T20:31:32.8697358Z W0507 20:31:32.866000 237772 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/code_generator.py", line 1201, in visit
2025-05-07T20:31:32.8698570Z W0507 20:31:32.866000 237772 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2]     ret = super().visit(node)
2025-05-07T20:31:32.8699404Z W0507 20:31:32.866000 237772 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2]           ^^^^^^^^^^^^^^^^^^^
2025-05-07T20:31:32.8700435Z W0507 20:31:32.866000 237772 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/ast.py", line 418, in visit
2025-05-07T20:31:32.8701466Z W0507 20:31:32.866000 237772 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2]     return visitor(node)
2025-05-07T20:31:32.8702267Z W0507 20:31:32.866000 237772 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2]            ^^^^^^^^^^^^^
2025-05-07T20:31:32.8703473Z W0507 20:31:32.866000 237772 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/code_generator.py", line 352, in visit_Module
2025-05-07T20:31:32.8704756Z W0507 20:31:32.866000 237772 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2]     ast.NodeVisitor.generic_visit(self, node)
2025-05-07T20:31:32.8705883Z W0507 20:31:32.866000 237772 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/ast.py", line 426, in generic_visit
2025-05-07T20:31:32.8706935Z W0507 20:31:32.866000 237772 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2]     self.visit(item)
2025-05-07T20:31:32.8708128Z W0507 20:31:32.866000 237772 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/code_generator.py", line 1207, in visit
2025-05-07T20:31:32.8709540Z W0507 20:31:32.866000 237772 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2]     raise CompilationError(self.jit_fn.src, self.cur_node, repr(e)) from None
2025-05-07T20:31:32.8710607Z W0507 20:31:32.866000 237772 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2] triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:31:32.8711574Z W0507 20:31:32.866000 237772 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2] def _fbgemm_silu_mul_quant(
2025-05-07T20:31:32.8712408Z W0507 20:31:32.866000 237772 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2] ^
2025-05-07T20:31:32.8713425Z W0507 20:31:32.866000 237772 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2] ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:31:33.6605155Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:31:33.6605905Z T = 16384, D = 7168, scale_ub = 1200.0, contiguous = False, compiled = False
2025-05-07T20:31:33.6606305Z 
2025-05-07T20:31:33.6606411Z     @given(
2025-05-07T20:31:33.6606716Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:31:33.6607134Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:31:33.6607523Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:31:33.6608297Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:31:33.6608631Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:31:33.6608918Z     )
2025-05-07T20:31:33.6609280Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:31:33.6609729Z     def test_silu_mul_quant(
2025-05-07T20:31:33.6609980Z         self,
2025-05-07T20:31:33.6610185Z         T: int,
2025-05-07T20:31:33.6610393Z         D: int,
2025-05-07T20:31:33.6610628Z         scale_ub: Optional[float],
2025-05-07T20:31:33.6610897Z         contiguous: bool,
2025-05-07T20:31:33.6611148Z         compiled: bool,
2025-05-07T20:31:33.6611382Z     ) -> None:
2025-05-07T20:31:33.6611604Z         torch.manual_seed(2025)
2025-05-07T20:31:33.6611854Z     
2025-05-07T20:31:33.6612138Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:31:33.6612485Z     
2025-05-07T20:31:33.6612691Z         x_sign = torch.sign(x)
2025-05-07T20:31:33.6612999Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:31:33.6613312Z         x = x_sign * x_clamp
2025-05-07T20:31:33.6613565Z         x0 = x[:, :D]
2025-05-07T20:31:33.6613791Z         x1 = x[:, D:]
2025-05-07T20:31:33.6614001Z     
2025-05-07T20:31:33.6614198Z         if contiguous:
2025-05-07T20:31:33.6614439Z             x0 = x0.contiguous()
2025-05-07T20:31:33.6614698Z             x1 = x1.contiguous()
2025-05-07T20:31:33.6614948Z     
2025-05-07T20:31:33.6615147Z         if scale_ub is not None:
2025-05-07T20:31:33.6615420Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:31:33.6615760Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:31:33.6616074Z             )
2025-05-07T20:31:33.6616274Z         else:
2025-05-07T20:31:33.6616486Z             scale_ub_tensor = None
2025-05-07T20:31:33.6616742Z     
2025-05-07T20:31:33.6616983Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:31:33.6617305Z             op = silu_mul_quant
2025-05-07T20:31:33.6617564Z             if compiled:
2025-05-07T20:31:33.6617817Z                 op = torch.compile(op)
2025-05-07T20:31:33.6618115Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:31:33.6618397Z     
2025-05-07T20:31:33.6618607Z >       y_fp8, y_scale = fn()
2025-05-07T20:31:33.6618772Z 
2025-05-07T20:31:33.6618883Z moe/activation_test.py:117: 
2025-05-07T20:31:33.6619182Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:31:33.6619519Z moe/activation_test.py:115: in fn
2025-05-07T20:31:33.6619802Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:31:33.6620489Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:31:33.6621180Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:31:33.6621719Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:31:33.6622405Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:31:33.6623068Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:31:33.6623761Z     kernel = self.compile(
2025-05-07T20:31:33.6624307Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:31:33.6624962Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:31:33.6625367Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:31:33.6625597Z 
2025-05-07T20:31:33.6625811Z self = <triton.compiler.compiler.ASTSource object at 0x7f68a8afecd0>
2025-05-07T20:31:33.6626890Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:31:33.6628536Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f68dfb402c0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f68aabc1760>}
2025-05-07T20:31:33.6629948Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:31:33.6631014Z context = <triton._C.libtriton.ir.context object at 0x7f68a8b12af0>
2025-05-07T20:31:33.6631313Z 
2025-05-07T20:31:33.6631492Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:31:33.6632006Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:31:33.6632478Z                            module_map=module_map)
2025-05-07T20:31:33.6632861Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:31:33.6633228Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:31:33.6633489Z E       ^
2025-05-07T20:31:33.6633964Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:31:33.6634411Z 
2025-05-07T20:31:33.6634844Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:31:33.6635355Z 
2025-05-07T20:31:33.6635469Z Trying example: test_silu_mul_quant(
2025-05-07T20:31:33.6635882Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:31:33.6636288Z     T=1,
2025-05-07T20:31:33.6636483Z     D=7168,
2025-05-07T20:31:33.6636680Z     scale_ub=None,
2025-05-07T20:31:33.6636904Z     contiguous=True,
2025-05-07T20:31:33.6637136Z     compiled=True,
2025-05-07T20:31:33.6637348Z )
2025-05-07T20:31:33.6637678Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:31:33.6638177Z T = 1, D = 7168, scale_ub = None, contiguous = True, compiled = True
2025-05-07T20:31:33.6638437Z 
2025-05-07T20:31:33.6638525Z     @given(
2025-05-07T20:31:33.6638757Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:31:33.6639086Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:31:33.6639401Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:31:33.6639728Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:31:33.6640067Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:31:33.6640357Z     )
2025-05-07T20:31:33.6640703Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:31:33.6641148Z     def test_silu_mul_quant(
2025-05-07T20:31:33.6641397Z         self,
2025-05-07T20:31:33.6641595Z         T: int,
2025-05-07T20:31:33.6641793Z         D: int,
2025-05-07T20:31:33.6642015Z         scale_ub: Optional[float],
2025-05-07T20:31:33.6642294Z         contiguous: bool,
2025-05-07T20:31:33.6642544Z         compiled: bool,
2025-05-07T20:31:33.6642769Z     ) -> None:
2025-05-07T20:31:33.6642991Z         torch.manual_seed(2025)
2025-05-07T20:31:33.6643230Z     
2025-05-07T20:31:33.6643510Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:31:33.6643978Z     
2025-05-07T20:31:33.6644175Z         x_sign = torch.sign(x)
2025-05-07T20:31:33.6644474Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:31:33.6644785Z         x = x_sign * x_clamp
2025-05-07T20:31:33.6645025Z         x0 = x[:, :D]
2025-05-07T20:31:33.6645244Z         x1 = x[:, D:]
2025-05-07T20:31:33.6645454Z     
2025-05-07T20:31:33.6645638Z         if contiguous:
2025-05-07T20:31:33.6645879Z             x0 = x0.contiguous()
2025-05-07T20:31:33.6646142Z             x1 = x1.contiguous()
2025-05-07T20:31:33.6646374Z     
2025-05-07T20:31:33.6646570Z         if scale_ub is not None:
2025-05-07T20:31:33.6646848Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:31:33.6647187Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:31:33.6647611Z             )
2025-05-07T20:31:33.6647807Z         else:
2025-05-07T20:31:33.6648025Z             scale_ub_tensor = None
2025-05-07T20:31:33.6648274Z     
2025-05-07T20:31:33.6648516Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:31:33.6648832Z             op = silu_mul_quant
2025-05-07T20:31:33.6649087Z             if compiled:
2025-05-07T20:31:33.6649337Z                 op = torch.compile(op)
2025-05-07T20:31:33.6649636Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:31:33.6649914Z     
2025-05-07T20:31:33.6650111Z         y_fp8, y_scale = fn()
2025-05-07T20:31:33.6650400Z         y = y_fp8.to(torch.float32) * y_scale[:, None]
2025-05-07T20:31:33.6650690Z     
2025-05-07T20:31:33.6650973Z         def ref_fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:31:33.6651332Z             x0_fp32 = x0.to(torch.float32)
2025-05-07T20:31:33.6651631Z             x1_fp32 = x1.to(torch.float32)
2025-05-07T20:31:33.6651947Z             y = x0_fp32 * torch.sigmoid(x0_fp32) * x1_fp32
2025-05-07T20:31:33.6652315Z             return triton_quantize_fp8_row(y, scale_ub_tensor)
2025-05-07T20:31:33.6652633Z     
2025-05-07T20:31:33.6652837Z >       y_fp8_ref, y_scale_ref = ref_fn()
2025-05-07T20:31:33.6653046Z 
2025-05-07T20:31:33.6653148Z moe/activation_test.py:126: 
2025-05-07T20:31:33.6653454Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:31:33.6653792Z moe/activation_test.py:124: in ref_fn
2025-05-07T20:31:33.6654127Z     return triton_quantize_fp8_row(y, scale_ub_tensor)
2025-05-07T20:31:33.6654914Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/fbgemm_gpu/experimental/gemm/triton_gemm/fp8_gemm.py:2370: in triton_quantize_fp8_row
2025-05-07T20:31:33.6655673Z     _kernel_quantize_fp8_row[grid](
2025-05-07T20:31:33.6656217Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:31:33.6656908Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:31:33.6657605Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/autotuner.py:186: in run
2025-05-07T20:31:33.6658340Z     timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs}
2025-05-07T20:31:33.6659096Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/autotuner.py:186: in <dictcomp>
2025-05-07T20:31:33.6659865Z     timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs}
2025-05-07T20:31:33.6660598Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/autotuner.py:166: in _bench
2025-05-07T20:31:33.6661288Z     return self.do_bench(kernel_call, quantiles=(0.5, 0.2, 0.8))
2025-05-07T20:31:33.6661891Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/testing.py:117: in do_bench
2025-05-07T20:31:33.6662417Z     fn()
2025-05-07T20:31:33.6662925Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/autotuner.py:152: in kernel_call
2025-05-07T20:31:33.6663504Z     self.fn.run(
2025-05-07T20:31:33.6664061Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:31:33.6664597Z     kernel = self.compile(
2025-05-07T20:31:33.6665137Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:31:33.6665793Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:31:33.6666200Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:31:33.6666427Z 
2025-05-07T20:31:33.6666644Z self = <triton.compiler.compiler.ASTSource object at 0x7f6895005250>
2025-05-07T20:31:33.6667716Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=2, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:31:33.6669228Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f68dfb402c0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f68aa04cd60>}
2025-05-07T20:31:33.6670577Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:31:33.6671600Z context = <triton._C.libtriton.ir.context object at 0x7f68950197b0>
2025-05-07T20:31:33.6671887Z 
2025-05-07T20:31:33.6672062Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:31:33.6672574Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:31:33.6673043Z                            module_map=module_map)
2025-05-07T20:31:33.6673415Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:31:33.6673765Z E       def _kernel_quantize_fp8_row(
2025-05-07T20:31:33.6674034Z E       ^
2025-05-07T20:31:33.6674511Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:31:33.6674955Z 
2025-05-07T20:31:33.6675380Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:31:33.6675887Z 
2025-05-07T20:31:33.6675991Z Trying example: test_silu_mul_quant(
2025-05-07T20:31:33.6676404Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:31:33.6676811Z     T=4096,
2025-05-07T20:31:33.6676998Z     D=5120,
2025-05-07T20:31:33.6677196Z     scale_ub=None,
2025-05-07T20:31:33.6677419Z     contiguous=False,
2025-05-07T20:31:33.6677641Z     compiled=False,
2025-05-07T20:31:33.6677856Z )
2025-05-07T20:31:34.0225762Z W0507 20:31:34.020000 237772 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3] Encountered an exception in identify_mutated_tensors, assuming every input is mutated
2025-05-07T20:31:34.0228051Z W0507 20:31:34.020000 237772 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3] Traceback (most recent call last):
2025-05-07T20:31:34.0230856Z W0507 20:31:34.020000 237772 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/torch/_higher_order_ops/triton_kernel_wrap.py", line 731, in identify_mutated_tensors
2025-05-07T20:31:34.0232514Z W0507 20:31:34.020000 237772 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3]     ttir_module, ordered_tensor_names = generate_ttir(kernel, kwargs)
2025-05-07T20:31:34.0233483Z W0507 20:31:34.020000 237772 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3]                                         ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
2025-05-07T20:31:34.0234962Z W0507 20:31:34.020000 237772 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/torch/_higher_order_ops/triton_kernel_wrap.py", line 356, in generate_ttir
2025-05-07T20:31:34.0236341Z W0507 20:31:34.020000 237772 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3]     ttir_module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:31:34.0237325Z W0507 20:31:34.020000 237772 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3]                   ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
2025-05-07T20:31:34.0238545Z W0507 20:31:34.020000 237772 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py", line 100, in make_ir
2025-05-07T20:31:34.0239908Z W0507 20:31:34.020000 237772 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3]     return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:31:34.0241092Z W0507 20:31:34.020000 237772 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3]            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
2025-05-07T20:31:34.0242367Z W0507 20:31:34.020000 237772 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/code_generator.py", line 1298, in ast_to_ttir
2025-05-07T20:31:34.0243610Z W0507 20:31:34.020000 237772 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3]     generator.visit(fn.parse())
2025-05-07T20:31:34.0244831Z W0507 20:31:34.020000 237772 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/code_generator.py", line 1201, in visit
2025-05-07T20:31:34.0246038Z W0507 20:31:34.020000 237772 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3]     ret = super().visit(node)
2025-05-07T20:31:34.0246871Z W0507 20:31:34.020000 237772 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3]           ^^^^^^^^^^^^^^^^^^^
2025-05-07T20:31:34.0247895Z W0507 20:31:34.020000 237772 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/ast.py", line 418, in visit
2025-05-07T20:31:34.0248915Z W0507 20:31:34.020000 237772 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3]     return visitor(node)
2025-05-07T20:31:34.0249704Z W0507 20:31:34.020000 237772 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3]            ^^^^^^^^^^^^^
2025-05-07T20:31:34.0250914Z W0507 20:31:34.020000 237772 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/code_generator.py", line 352, in visit_Module
2025-05-07T20:31:34.0252260Z W0507 20:31:34.020000 237772 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3]     ast.NodeVisitor.generic_visit(self, node)
2025-05-07T20:31:34.0253382Z W0507 20:31:34.020000 237772 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/ast.py", line 426, in generic_visit
2025-05-07T20:31:34.0254427Z W0507 20:31:34.020000 237772 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3]     self.visit(item)
2025-05-07T20:31:34.0255601Z W0507 20:31:34.020000 237772 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/code_generator.py", line 1207, in visit
2025-05-07T20:31:34.0256968Z W0507 20:31:34.020000 237772 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3]     raise CompilationError(self.jit_fn.src, self.cur_node, repr(e)) from None
2025-05-07T20:31:34.0258115Z W0507 20:31:34.020000 237772 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3] triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:31:34.0259030Z W0507 20:31:34.020000 237772 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3] def _fbgemm_silu_mul_quant(
2025-05-07T20:31:34.0259775Z W0507 20:31:34.020000 237772 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3] ^
2025-05-07T20:31:34.0260787Z W0507 20:31:34.020000 237772 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3] ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:31:34.2744085Z W0507 20:31:34.272000 237772 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3] Encountered an exception in identify_mutated_tensors, assuming every input is mutated
2025-05-07T20:31:34.2745511Z W0507 20:31:34.272000 237772 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3] Traceback (most recent call last):
2025-05-07T20:31:34.2746846Z W0507 20:31:34.272000 237772 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/torch/_higher_order_ops/triton_kernel_wrap.py", line 731, in identify_mutated_tensors
2025-05-07T20:31:34.2748305Z W0507 20:31:34.272000 237772 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3]     ttir_module, ordered_tensor_names = generate_ttir(kernel, kwargs)
2025-05-07T20:31:34.2749369Z W0507 20:31:34.272000 237772 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3]                                         ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
2025-05-07T20:31:34.2750666Z W0507 20:31:34.272000 237772 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/torch/_higher_order_ops/triton_kernel_wrap.py", line 356, in generate_ttir
2025-05-07T20:31:34.2752093Z W0507 20:31:34.272000 237772 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3]     ttir_module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:31:34.2753073Z W0507 20:31:34.272000 237772 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3]                   ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
2025-05-07T20:31:34.2754300Z W0507 20:31:34.272000 237772 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py", line 100, in make_ir
2025-05-07T20:31:34.2755825Z W0507 20:31:34.272000 237772 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3]     return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:31:34.2756890Z W0507 20:31:34.272000 237772 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3]            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
2025-05-07T20:31:34.2758299Z W0507 20:31:34.272000 237772 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/code_generator.py", line 1298, in ast_to_ttir
2025-05-07T20:31:34.2759550Z W0507 20:31:34.272000 237772 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3]     generator.visit(fn.parse())
2025-05-07T20:31:34.2760768Z W0507 20:31:34.272000 237772 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/code_generator.py", line 1201, in visit
2025-05-07T20:31:34.2761979Z W0507 20:31:34.272000 237772 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3]     ret = super().visit(node)
2025-05-07T20:31:34.2762813Z W0507 20:31:34.272000 237772 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3]           ^^^^^^^^^^^^^^^^^^^
2025-05-07T20:31:34.2763983Z W0507 20:31:34.272000 237772 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/ast.py", line 418, in visit
2025-05-07T20:31:34.2765003Z W0507 20:31:34.272000 237772 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3]     return visitor(node)
2025-05-07T20:31:34.2765804Z W0507 20:31:34.272000 237772 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3]            ^^^^^^^^^^^^^
2025-05-07T20:31:34.2767008Z W0507 20:31:34.272000 237772 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/code_generator.py", line 352, in visit_Module
2025-05-07T20:31:34.2768368Z W0507 20:31:34.272000 237772 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3]     ast.NodeVisitor.generic_visit(self, node)
2025-05-07T20:31:34.2769502Z W0507 20:31:34.272000 237772 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/ast.py", line 426, in generic_visit
2025-05-07T20:31:34.2770541Z W0507 20:31:34.272000 237772 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3]     self.visit(item)
2025-05-07T20:31:34.2771721Z W0507 20:31:34.272000 237772 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/code_generator.py", line 1207, in visit
2025-05-07T20:31:34.2773073Z W0507 20:31:34.272000 237772 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3]     raise CompilationError(self.jit_fn.src, self.cur_node, repr(e)) from None
2025-05-07T20:31:34.2774137Z W0507 20:31:34.272000 237772 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3] triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:31:34.2775053Z W0507 20:31:34.272000 237772 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3] def _fbgemm_silu_mul_quant(
2025-05-07T20:31:34.2775796Z W0507 20:31:34.272000 237772 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3] ^
2025-05-07T20:31:34.2776818Z W0507 20:31:34.272000 237772 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3] ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:31:34.6422647Z W0507 20:31:34.640000 237772 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3] Encountered an exception in identify_mutated_tensors, assuming every input is mutated
2025-05-07T20:31:34.6423714Z W0507 20:31:34.640000 237772 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3] Traceback (most recent call last):
2025-05-07T20:31:34.6425057Z W0507 20:31:34.640000 237772 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/torch/_higher_order_ops/triton_kernel_wrap.py", line 731, in identify_mutated_tensors
2025-05-07T20:31:34.6426467Z W0507 20:31:34.640000 237772 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3]     ttir_module, ordered_tensor_names = generate_ttir(kernel, kwargs)
2025-05-07T20:31:34.6427443Z W0507 20:31:34.640000 237772 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3]                                         ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
2025-05-07T20:31:34.6428888Z W0507 20:31:34.640000 237772 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/torch/_higher_order_ops/triton_kernel_wrap.py", line 356, in generate_ttir
2025-05-07T20:31:34.6430312Z W0507 20:31:34.640000 237772 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3]     ttir_module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:31:34.6431457Z W0507 20:31:34.640000 237772 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3]                   ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
2025-05-07T20:31:34.6432685Z W0507 20:31:34.640000 237772 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py", line 100, in make_ir
2025-05-07T20:31:34.6434050Z W0507 20:31:34.640000 237772 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3]     return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:31:34.6435122Z W0507 20:31:34.640000 237772 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3]            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
2025-05-07T20:31:34.6436544Z W0507 20:31:34.640000 237772 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/code_generator.py", line 1298, in ast_to_ttir
2025-05-07T20:31:34.6437792Z W0507 20:31:34.640000 237772 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3]     generator.visit(fn.parse())
2025-05-07T20:31:34.6439017Z W0507 20:31:34.640000 237772 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/code_generator.py", line 1201, in visit
2025-05-07T20:31:34.6440221Z W0507 20:31:34.640000 237772 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3]     ret = super().visit(node)
2025-05-07T20:31:34.6441063Z W0507 20:31:34.640000 237772 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3]           ^^^^^^^^^^^^^^^^^^^
2025-05-07T20:31:34.6442095Z W0507 20:31:34.640000 237772 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/ast.py", line 418, in visit
2025-05-07T20:31:34.6443117Z W0507 20:31:34.640000 237772 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3]     return visitor(node)
2025-05-07T20:31:34.6443907Z W0507 20:31:34.640000 237772 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3]            ^^^^^^^^^^^^^
2025-05-07T20:31:34.6445121Z W0507 20:31:34.640000 237772 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/code_generator.py", line 352, in visit_Module
2025-05-07T20:31:34.6446400Z W0507 20:31:34.640000 237772 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3]     ast.NodeVisitor.generic_visit(self, node)
2025-05-07T20:31:34.6447531Z W0507 20:31:34.640000 237772 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/ast.py", line 426, in generic_visit
2025-05-07T20:31:34.6448578Z W0507 20:31:34.640000 237772 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3]     self.visit(item)
2025-05-07T20:31:34.6449745Z W0507 20:31:34.640000 237772 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/code_generator.py", line 1207, in visit
2025-05-07T20:31:34.6451105Z W0507 20:31:34.640000 237772 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3]     raise CompilationError(self.jit_fn.src, self.cur_node, repr(e)) from None
2025-05-07T20:31:34.6452163Z W0507 20:31:34.640000 237772 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3] triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:31:34.6453083Z W0507 20:31:34.640000 237772 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3] def _fbgemm_silu_mul_quant(
2025-05-07T20:31:34.6453909Z W0507 20:31:34.640000 237772 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3] ^
2025-05-07T20:31:34.6454927Z W0507 20:31:34.640000 237772 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3] ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:31:34.6565532Z W0507 20:31:34.654000 237772 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3] Encountered an exception in identify_mutated_tensors, assuming every input is mutated
2025-05-07T20:31:34.6567760Z W0507 20:31:34.654000 237772 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3] Traceback (most recent call last):
2025-05-07T20:31:34.6570706Z W0507 20:31:34.654000 237772 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/torch/_higher_order_ops/triton_kernel_wrap.py", line 731, in identify_mutated_tensors
2025-05-07T20:31:34.6572488Z W0507 20:31:34.654000 237772 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3]     ttir_module, ordered_tensor_names = generate_ttir(kernel, kwargs)
2025-05-07T20:31:34.6573457Z W0507 20:31:34.654000 237772 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3]                                         ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
2025-05-07T20:31:34.6574759Z W0507 20:31:34.654000 237772 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/torch/_higher_order_ops/triton_kernel_wrap.py", line 356, in generate_ttir
2025-05-07T20:31:34.6576138Z W0507 20:31:34.654000 237772 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3]     ttir_module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:31:34.6577126Z W0507 20:31:34.654000 237772 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3]                   ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
2025-05-07T20:31:34.6578347Z W0507 20:31:34.654000 237772 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py", line 100, in make_ir
2025-05-07T20:31:34.6579704Z W0507 20:31:34.654000 237772 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3]     return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:31:34.6580766Z W0507 20:31:34.654000 237772 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3]            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
2025-05-07T20:31:34.6582057Z W0507 20:31:34.654000 237772 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/code_generator.py", line 1298, in ast_to_ttir
2025-05-07T20:31:34.6583301Z W0507 20:31:34.654000 237772 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3]     generator.visit(fn.parse())
2025-05-07T20:31:34.6584522Z W0507 20:31:34.654000 237772 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/code_generator.py", line 1201, in visit
2025-05-07T20:31:34.6585730Z W0507 20:31:34.654000 237772 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3]     ret = super().visit(node)
2025-05-07T20:31:34.6586556Z W0507 20:31:34.654000 237772 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3]           ^^^^^^^^^^^^^^^^^^^
2025-05-07T20:31:34.6587588Z W0507 20:31:34.654000 237772 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/ast.py", line 418, in visit
2025-05-07T20:31:34.6588720Z W0507 20:31:34.654000 237772 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3]     return visitor(node)
2025-05-07T20:31:34.6589586Z W0507 20:31:34.654000 237772 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3]            ^^^^^^^^^^^^^
2025-05-07T20:31:34.6590790Z W0507 20:31:34.654000 237772 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/code_generator.py", line 352, in visit_Module
2025-05-07T20:31:34.6592121Z W0507 20:31:34.654000 237772 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3]     ast.NodeVisitor.generic_visit(self, node)
2025-05-07T20:31:34.6593319Z W0507 20:31:34.654000 237772 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/ast.py", line 426, in generic_visit
2025-05-07T20:31:34.6594372Z W0507 20:31:34.654000 237772 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3]     self.visit(item)
2025-05-07T20:31:34.6595556Z W0507 20:31:34.654000 237772 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/code_generator.py", line 1207, in visit
2025-05-07T20:31:34.6596905Z W0507 20:31:34.654000 237772 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3]     raise CompilationError(self.jit_fn.src, self.cur_node, repr(e)) from None
2025-05-07T20:31:34.6597967Z W0507 20:31:34.654000 237772 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3] triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:31:34.6598881Z W0507 20:31:34.654000 237772 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3] def _fbgemm_silu_mul_quant(
2025-05-07T20:31:34.6599632Z W0507 20:31:34.654000 237772 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3] ^
2025-05-07T20:31:34.6600647Z W0507 20:31:34.654000 237772 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3] ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:31:36.4001400Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:31:36.4001979Z T = 4096, D = 5120, scale_ub = None, contiguous = False, compiled = False
2025-05-07T20:31:36.4002363Z 
2025-05-07T20:31:36.4002486Z     @given(
2025-05-07T20:31:36.4002811Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:31:36.4003251Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:31:36.4003654Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:31:36.4004017Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:31:36.4004359Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:31:36.4004646Z     )
2025-05-07T20:31:36.4005009Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:31:36.4005455Z     def test_silu_mul_quant(
2025-05-07T20:31:36.4005712Z         self,
2025-05-07T20:31:36.4005929Z         T: int,
2025-05-07T20:31:36.4006124Z         D: int,
2025-05-07T20:31:36.4006346Z         scale_ub: Optional[float],
2025-05-07T20:31:36.4006627Z         contiguous: bool,
2025-05-07T20:31:36.4006865Z         compiled: bool,
2025-05-07T20:31:36.4007103Z     ) -> None:
2025-05-07T20:31:36.4007330Z         torch.manual_seed(2025)
2025-05-07T20:31:36.4007582Z     
2025-05-07T20:31:36.4007869Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:31:36.4016631Z     
2025-05-07T20:31:36.4016869Z         x_sign = torch.sign(x)
2025-05-07T20:31:36.4017179Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:31:36.4017507Z         x = x_sign * x_clamp
2025-05-07T20:31:36.4017786Z         x0 = x[:, :D]
2025-05-07T20:31:36.4018011Z         x1 = x[:, D:]
2025-05-07T20:31:36.4018230Z     
2025-05-07T20:31:36.4018750Z         if contiguous:
2025-05-07T20:31:36.4018996Z             x0 = x0.contiguous()
2025-05-07T20:31:36.4019259Z             x1 = x1.contiguous()
2025-05-07T20:31:36.4019502Z     
2025-05-07T20:31:36.4019703Z         if scale_ub is not None:
2025-05-07T20:31:36.4019981Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:31:36.4020323Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:31:36.4020638Z             )
2025-05-07T20:31:36.4020838Z         else:
2025-05-07T20:31:36.4021049Z             scale_ub_tensor = None
2025-05-07T20:31:36.4021308Z     
2025-05-07T20:31:36.4021552Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:31:36.4021876Z             op = silu_mul_quant
2025-05-07T20:31:36.4022337Z             if compiled:
2025-05-07T20:31:36.4022591Z                 op = torch.compile(op)
2025-05-07T20:31:36.4022891Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:31:36.4023161Z     
2025-05-07T20:31:36.4023366Z >       y_fp8, y_scale = fn()
2025-05-07T20:31:36.4023531Z 
2025-05-07T20:31:36.4023646Z moe/activation_test.py:117: 
2025-05-07T20:31:36.4023937Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:31:36.4024274Z moe/activation_test.py:115: in fn
2025-05-07T20:31:36.4024559Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:31:36.4025246Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:31:36.4025947Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:31:36.4026488Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:31:36.4027186Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:31:36.4027845Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:31:36.4028800Z     kernel = self.compile(
2025-05-07T20:31:36.4029416Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:31:36.4030077Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:31:36.4030469Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:31:36.4030702Z 
2025-05-07T20:31:36.4030908Z self = <triton.compiler.compiler.ASTSource object at 0x7f689502d390>
2025-05-07T20:31:36.4031985Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:31:36.4033453Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f68dfb402c0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f68a8d402c0>}
2025-05-07T20:31:36.4034804Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:31:36.4035822Z context = <triton._C.libtriton.ir.context object at 0x7f689508d270>
2025-05-07T20:31:36.4036112Z 
2025-05-07T20:31:36.4036278Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:31:36.4036794Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:31:36.4037261Z                            module_map=module_map)
2025-05-07T20:31:36.4037623Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:31:36.4037982Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:31:36.4038244Z E       ^
2025-05-07T20:31:36.4038706Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:31:36.4039165Z 
2025-05-07T20:31:36.4039756Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:31:36.4040275Z 
2025-05-07T20:31:36.4040381Z Trying example: test_silu_mul_quant(
2025-05-07T20:31:36.4040796Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:31:36.4041193Z     T=4096,
2025-05-07T20:31:36.4041388Z     D=7168,
2025-05-07T20:31:36.4041590Z     scale_ub=None,
2025-05-07T20:31:36.4041813Z     contiguous=False,
2025-05-07T20:31:36.4042047Z     compiled=False,
2025-05-07T20:31:36.4042262Z )
2025-05-07T20:31:36.4042577Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:31:36.4043074Z T = 4096, D = 7168, scale_ub = None, contiguous = False, compiled = False
2025-05-07T20:31:36.4043476Z 
2025-05-07T20:31:36.4043561Z     @given(
2025-05-07T20:31:36.4043797Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:31:36.4044112Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:31:36.4044423Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:31:36.4044760Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:31:36.4045082Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:31:36.4045374Z     )
2025-05-07T20:31:36.4045731Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:31:36.4046168Z     def test_silu_mul_quant(
2025-05-07T20:31:36.4046412Z         self,
2025-05-07T20:31:36.4046609Z         T: int,
2025-05-07T20:31:36.4046803Z         D: int,
2025-05-07T20:31:36.4047027Z         scale_ub: Optional[float],
2025-05-07T20:31:36.4047301Z         contiguous: bool,
2025-05-07T20:31:36.4047535Z         compiled: bool,
2025-05-07T20:31:36.4047770Z     ) -> None:
2025-05-07T20:31:36.4047992Z         torch.manual_seed(2025)
2025-05-07T20:31:36.4048235Z     
2025-05-07T20:31:36.4048505Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:31:36.4048856Z     
2025-05-07T20:31:36.4049055Z         x_sign = torch.sign(x)
2025-05-07T20:31:36.4049340Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:31:36.4049651Z         x = x_sign * x_clamp
2025-05-07T20:31:36.4049890Z         x0 = x[:, :D]
2025-05-07T20:31:36.4050103Z         x1 = x[:, D:]
2025-05-07T20:31:36.4050310Z     
2025-05-07T20:31:36.4050501Z         if contiguous:
2025-05-07T20:31:36.4050729Z             x0 = x0.contiguous()
2025-05-07T20:31:36.4050988Z             x1 = x1.contiguous()
2025-05-07T20:31:36.4051230Z     
2025-05-07T20:31:36.4051417Z         if scale_ub is not None:
2025-05-07T20:31:36.4051690Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:31:36.4052027Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:31:36.4052362Z             )
2025-05-07T20:31:36.4052582Z         else:
2025-05-07T20:31:36.4052795Z             scale_ub_tensor = None
2025-05-07T20:31:36.4053051Z     
2025-05-07T20:31:36.4053283Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:31:36.4053603Z             op = silu_mul_quant
2025-05-07T20:31:36.4053856Z             if compiled:
2025-05-07T20:31:36.4054101Z                 op = torch.compile(op)
2025-05-07T20:31:36.4054400Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:31:36.4054678Z     
2025-05-07T20:31:36.4054869Z >       y_fp8, y_scale = fn()
2025-05-07T20:31:36.4055039Z 
2025-05-07T20:31:36.4055138Z moe/activation_test.py:117: 
2025-05-07T20:31:36.4055436Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:31:36.4055762Z moe/activation_test.py:115: in fn
2025-05-07T20:31:36.4056048Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:31:36.4056744Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:31:36.4057433Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:31:36.4058063Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:31:36.4058751Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:31:36.4059420Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:31:36.4059945Z     kernel = self.compile(
2025-05-07T20:31:36.4060489Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:31:36.4061145Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:31:36.4061541Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:31:36.4061844Z 
2025-05-07T20:31:36.4062052Z self = <triton.compiler.compiler.ASTSource object at 0x7f6894990b50>
2025-05-07T20:31:36.4063134Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:31:36.4064510Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f68dfb402c0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f68a8d42160>}
2025-05-07T20:31:36.4065846Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:31:36.4066863Z context = <triton._C.libtriton.ir.context object at 0x7f68948c4a70>
2025-05-07T20:31:36.4067156Z 
2025-05-07T20:31:36.4067322Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:31:36.4067845Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:31:36.4068435Z                            module_map=module_map)
2025-05-07T20:31:36.4068851Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:31:36.4069324Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:31:36.4069585Z E       ^
2025-05-07T20:31:36.4070044Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:31:36.4070503Z 
2025-05-07T20:31:36.4070930Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:31:36.4071451Z 
2025-05-07T20:31:36.4071555Z Trying example: test_silu_mul_quant(
2025-05-07T20:31:36.4071968Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:31:36.4072365Z     T=128,
2025-05-07T20:31:36.4072566Z     D=7168,
2025-05-07T20:31:36.4072768Z     scale_ub=None,
2025-05-07T20:31:36.4072981Z     contiguous=False,
2025-05-07T20:31:36.4073207Z     compiled=True,
2025-05-07T20:31:36.4073410Z )
2025-05-07T20:31:36.4533305Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:31:36.4534064Z T = 128, D = 7168, scale_ub = None, contiguous = False, compiled = True
2025-05-07T20:31:36.4534430Z 
2025-05-07T20:31:36.4534546Z     @given(
2025-05-07T20:31:36.4534858Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:31:36.4535245Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:31:36.4535556Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:31:36.4535893Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:31:36.4536226Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:31:36.4536520Z     )
2025-05-07T20:31:36.4536875Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:31:36.4537324Z     def test_silu_mul_quant(
2025-05-07T20:31:36.4537574Z         self,
2025-05-07T20:31:36.4537774Z         T: int,
2025-05-07T20:31:36.4537979Z         D: int,
2025-05-07T20:31:36.4538207Z         scale_ub: Optional[float],
2025-05-07T20:31:36.4538819Z         contiguous: bool,
2025-05-07T20:31:36.4539065Z         compiled: bool,
2025-05-07T20:31:36.4539297Z     ) -> None:
2025-05-07T20:31:36.4539518Z         torch.manual_seed(2025)
2025-05-07T20:31:36.4539762Z     
2025-05-07T20:31:36.4540045Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:31:36.4540394Z     
2025-05-07T20:31:36.4540591Z         x_sign = torch.sign(x)
2025-05-07T20:31:36.4540885Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:31:36.4541203Z         x = x_sign * x_clamp
2025-05-07T20:31:36.4541447Z         x0 = x[:, :D]
2025-05-07T20:31:36.4541672Z         x1 = x[:, D:]
2025-05-07T20:31:36.4541887Z     
2025-05-07T20:31:36.4542083Z         if contiguous:
2025-05-07T20:31:36.4542506Z             x0 = x0.contiguous()
2025-05-07T20:31:36.4542775Z             x1 = x1.contiguous()
2025-05-07T20:31:36.4543022Z     
2025-05-07T20:31:36.4543212Z         if scale_ub is not None:
2025-05-07T20:31:36.4543498Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:31:36.4543840Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:31:36.4544147Z             )
2025-05-07T20:31:36.4544348Z         else:
2025-05-07T20:31:36.4544566Z             scale_ub_tensor = None
2025-05-07T20:31:36.4544816Z     
2025-05-07T20:31:36.4545061Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:31:36.4545382Z             op = silu_mul_quant
2025-05-07T20:31:36.4545634Z             if compiled:
2025-05-07T20:31:36.4545893Z                 op = torch.compile(op)
2025-05-07T20:31:36.4546197Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:31:36.4546483Z     
2025-05-07T20:31:36.4546679Z         y_fp8, y_scale = fn()
2025-05-07T20:31:36.4546972Z         y = y_fp8.to(torch.float32) * y_scale[:, None]
2025-05-07T20:31:36.4547268Z     
2025-05-07T20:31:36.4547505Z         def ref_fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:31:36.4547844Z             x0_fp32 = x0.to(torch.float32)
2025-05-07T20:31:36.4548144Z             x1_fp32 = x1.to(torch.float32)
2025-05-07T20:31:36.4548458Z             y = x0_fp32 * torch.sigmoid(x0_fp32) * x1_fp32
2025-05-07T20:31:36.4548820Z             return triton_quantize_fp8_row(y, scale_ub_tensor)
2025-05-07T20:31:36.4549256Z     
2025-05-07T20:31:36.4549462Z >       y_fp8_ref, y_scale_ref = ref_fn()
2025-05-07T20:31:36.4549662Z 
2025-05-07T20:31:36.4549766Z moe/activation_test.py:126: 
2025-05-07T20:31:36.4550066Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:31:36.4550407Z moe/activation_test.py:124: in ref_fn
2025-05-07T20:31:36.4550733Z     return triton_quantize_fp8_row(y, scale_ub_tensor)
2025-05-07T20:31:36.4551526Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/fbgemm_gpu/experimental/gemm/triton_gemm/fp8_gemm.py:2370: in triton_quantize_fp8_row
2025-05-07T20:31:36.4552288Z     _kernel_quantize_fp8_row[grid](
2025-05-07T20:31:36.4552837Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:31:36.4553520Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:31:36.4554212Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/autotuner.py:186: in run
2025-05-07T20:31:36.4554936Z     timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs}
2025-05-07T20:31:36.4555685Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/autotuner.py:186: in <dictcomp>
2025-05-07T20:31:36.4556431Z     timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs}
2025-05-07T20:31:36.4557156Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/autotuner.py:166: in _bench
2025-05-07T20:31:36.4557801Z     return self.do_bench(kernel_call, quantiles=(0.5, 0.2, 0.8))
2025-05-07T20:31:36.4558492Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/testing.py:117: in do_bench
2025-05-07T20:31:36.4559019Z     fn()
2025-05-07T20:31:36.4559529Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/autotuner.py:152: in kernel_call
2025-05-07T20:31:36.4560106Z     self.fn.run(
2025-05-07T20:31:36.4560573Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:31:36.4561104Z     kernel = self.compile(
2025-05-07T20:31:36.4561650Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:31:36.4562357Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:31:36.4562834Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:31:36.4563065Z 
2025-05-07T20:31:36.4563281Z self = <triton.compiler.compiler.ASTSource object at 0x7f6894781b50>
2025-05-07T20:31:36.4564364Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=2, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:31:36.4565754Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f68dfb402c0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f68a8d437e0>}
2025-05-07T20:31:36.4567092Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:31:36.4568111Z context = <triton._C.libtriton.ir.context object at 0x7f689459a930>
2025-05-07T20:31:36.4568402Z 
2025-05-07T20:31:36.4568578Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:31:36.4569095Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:31:36.4569565Z                            module_map=module_map)
2025-05-07T20:31:36.4569973Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:31:36.4570441Z E       def _kernel_quantize_fp8_row(
2025-05-07T20:31:36.4570708Z E       ^
2025-05-07T20:31:36.4571181Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:31:36.4571628Z 
2025-05-07T20:31:36.4572087Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:31:36.4572621Z 
2025-05-07T20:31:36.4572733Z Trying example: test_silu_mul_quant(
2025-05-07T20:31:36.4573143Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:31:36.4573555Z     T=128,
2025-05-07T20:31:36.4573753Z     D=7168,
2025-05-07T20:31:36.4573944Z     scale_ub=None,
2025-05-07T20:31:36.4574166Z     contiguous=False,
2025-05-07T20:31:36.4574408Z     compiled=False,
2025-05-07T20:31:36.4574617Z )
2025-05-07T20:31:36.6097863Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:31:36.6098597Z T = 128, D = 7168, scale_ub = None, contiguous = False, compiled = False
2025-05-07T20:31:36.6098987Z 
2025-05-07T20:31:36.6099103Z     @given(
2025-05-07T20:31:36.6099431Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:31:36.6099878Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:31:36.6100191Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:31:36.6100531Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:31:36.6100865Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:31:36.6101156Z     )
2025-05-07T20:31:36.6101533Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:31:36.6101979Z     def test_silu_mul_quant(
2025-05-07T20:31:36.6102262Z         self,
2025-05-07T20:31:36.6102497Z         T: int,
2025-05-07T20:31:36.6103086Z         D: int,
2025-05-07T20:31:36.6103313Z         scale_ub: Optional[float],
2025-05-07T20:31:36.6103588Z         contiguous: bool,
2025-05-07T20:31:36.6103835Z         compiled: bool,
2025-05-07T20:31:36.6104063Z     ) -> None:
2025-05-07T20:31:36.6104287Z         torch.manual_seed(2025)
2025-05-07T20:31:36.6104534Z     
2025-05-07T20:31:36.6104806Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:31:36.6105159Z     
2025-05-07T20:31:36.6105363Z         x_sign = torch.sign(x)
2025-05-07T20:31:36.6105657Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:31:36.6105982Z         x = x_sign * x_clamp
2025-05-07T20:31:36.6106231Z         x0 = x[:, :D]
2025-05-07T20:31:36.6106628Z         x1 = x[:, D:]
2025-05-07T20:31:36.6106845Z     
2025-05-07T20:31:36.6107042Z         if contiguous:
2025-05-07T20:31:36.6107277Z             x0 = x0.contiguous()
2025-05-07T20:31:36.6107547Z             x1 = x1.contiguous()
2025-05-07T20:31:36.6107799Z     
2025-05-07T20:31:36.6108002Z         if scale_ub is not None:
2025-05-07T20:31:36.6108284Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:31:36.6108635Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:31:36.6108959Z             )
2025-05-07T20:31:36.6109228Z         else:
2025-05-07T20:31:36.6109448Z             scale_ub_tensor = None
2025-05-07T20:31:36.6109703Z     
2025-05-07T20:31:36.6109935Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:31:36.6110256Z             op = silu_mul_quant
2025-05-07T20:31:36.6110514Z             if compiled:
2025-05-07T20:31:36.6110759Z                 op = torch.compile(op)
2025-05-07T20:31:36.6111058Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:31:36.6111344Z     
2025-05-07T20:31:36.6111535Z >       y_fp8, y_scale = fn()
2025-05-07T20:31:36.6111705Z 
2025-05-07T20:31:36.6111809Z moe/activation_test.py:117: 
2025-05-07T20:31:36.6112114Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:31:36.6112497Z moe/activation_test.py:115: in fn
2025-05-07T20:31:36.6112786Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:31:36.6113480Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:31:36.6114170Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:31:36.6114712Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:31:36.6115388Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:31:36.6116055Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:31:36.6116597Z     kernel = self.compile(
2025-05-07T20:31:36.6117142Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:31:36.6117796Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:31:36.6118206Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:31:36.6118435Z 
2025-05-07T20:31:36.6118650Z self = <triton.compiler.compiler.ASTSource object at 0x7f6873e10250>
2025-05-07T20:31:36.6119732Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:31:36.6121205Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f68dfb402c0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f6894be0860>}
2025-05-07T20:31:36.6122548Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:31:36.6123663Z context = <triton._C.libtriton.ir.context object at 0x7f6894b05f70>
2025-05-07T20:31:36.6123957Z 
2025-05-07T20:31:36.6124128Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:31:36.6124651Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:31:36.6125118Z                            module_map=module_map)
2025-05-07T20:31:36.6125496Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:31:36.6125853Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:31:36.6126109Z E       ^
2025-05-07T20:31:36.6126584Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:31:36.6127116Z 
2025-05-07T20:31:36.6127533Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:31:36.6128041Z 
2025-05-07T20:31:36.6128555Z Trying example: test_silu_mul_quant(
2025-05-07T20:31:36.6129031Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:31:36.6129502Z     T=4096,
2025-05-07T20:31:36.6129708Z     D=5120,
2025-05-07T20:31:36.6129915Z     scale_ub=1200.0,
2025-05-07T20:31:36.6130157Z     contiguous=True,
2025-05-07T20:31:36.6130400Z     compiled=False,
2025-05-07T20:31:36.6130618Z )
2025-05-07T20:31:36.6130978Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:31:36.6131553Z T = 4096, D = 5120, scale_ub = 1200.0, contiguous = True, compiled = False
2025-05-07T20:31:36.6131874Z 
2025-05-07T20:31:36.6131959Z     @given(
2025-05-07T20:31:36.6132206Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:31:36.6132611Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:31:36.6132954Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:31:36.6133322Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:31:36.6133698Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:31:36.6134018Z     )
2025-05-07T20:31:36.6134409Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:31:36.6134924Z     def test_silu_mul_quant(
2025-05-07T20:31:36.6135187Z         self,
2025-05-07T20:31:36.6135392Z         T: int,
2025-05-07T20:31:36.6135598Z         D: int,
2025-05-07T20:31:36.6135833Z         scale_ub: Optional[float],
2025-05-07T20:31:36.6136133Z         contiguous: bool,
2025-05-07T20:31:36.6136394Z         compiled: bool,
2025-05-07T20:31:36.6136639Z     ) -> None:
2025-05-07T20:31:36.6136873Z         torch.manual_seed(2025)
2025-05-07T20:31:36.6137134Z     
2025-05-07T20:31:36.6137438Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:31:36.6137836Z     
2025-05-07T20:31:36.6138038Z         x_sign = torch.sign(x)
2025-05-07T20:31:36.6138365Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:31:36.6138712Z         x = x_sign * x_clamp
2025-05-07T20:31:36.6138972Z         x0 = x[:, :D]
2025-05-07T20:31:36.6139210Z         x1 = x[:, D:]
2025-05-07T20:31:36.6139431Z     
2025-05-07T20:31:36.6139623Z         if contiguous:
2025-05-07T20:31:36.6139875Z             x0 = x0.contiguous()
2025-05-07T20:31:36.6140162Z             x1 = x1.contiguous()
2025-05-07T20:31:36.6140419Z     
2025-05-07T20:31:36.6140630Z         if scale_ub is not None:
2025-05-07T20:31:36.6140931Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:31:36.6141317Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:31:36.6141660Z             )
2025-05-07T20:31:36.6141873Z         else:
2025-05-07T20:31:36.6142110Z             scale_ub_tensor = None
2025-05-07T20:31:36.6142430Z     
2025-05-07T20:31:36.6142676Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:31:36.6143044Z             op = silu_mul_quant
2025-05-07T20:31:36.6143320Z             if compiled:
2025-05-07T20:31:36.6143580Z                 op = torch.compile(op)
2025-05-07T20:31:36.6144052Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:31:36.6144333Z     
2025-05-07T20:31:36.6144522Z >       y_fp8, y_scale = fn()
2025-05-07T20:31:36.6144695Z 
2025-05-07T20:31:36.6144797Z moe/activation_test.py:117: 
2025-05-07T20:31:36.6145093Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:31:36.6145423Z moe/activation_test.py:115: in fn
2025-05-07T20:31:36.6145712Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:31:36.6146398Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:31:36.6147091Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:31:36.6147763Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:31:36.6148449Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:31:36.6158521Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:31:36.6159083Z     kernel = self.compile(
2025-05-07T20:31:36.6159642Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:31:36.6160310Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:31:36.6160711Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:31:36.6160954Z 
2025-05-07T20:31:36.6161162Z self = <triton.compiler.compiler.ASTSource object at 0x7f6894676550>
2025-05-07T20:31:36.6162261Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:31:36.6163722Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f68dfb402c0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f6894ef2b60>}
2025-05-07T20:31:36.6165080Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:31:36.6166128Z context = <triton._C.libtriton.ir.context object at 0x7f689466e3b0>
2025-05-07T20:31:36.6166432Z 
2025-05-07T20:31:36.6166602Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:31:36.6167136Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:31:36.6167603Z                            module_map=module_map)
2025-05-07T20:31:36.6167981Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:31:36.6168351Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:31:36.6168623Z E       ^
2025-05-07T20:31:36.6169106Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:31:36.6169573Z 
2025-05-07T20:31:36.6169998Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:31:36.6170513Z 
2025-05-07T20:31:36.6170629Z Trying example: test_silu_mul_quant(
2025-05-07T20:31:36.6171047Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:31:36.6171461Z     T=1,
2025-05-07T20:31:36.6171664Z     D=5120,
2025-05-07T20:31:36.6171868Z     scale_ub=None,
2025-05-07T20:31:36.6172086Z     contiguous=True,
2025-05-07T20:31:36.6172341Z     compiled=True,
2025-05-07T20:31:36.6172583Z )
2025-05-07T20:31:36.9570821Z W0507 20:31:36.954000 237772 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4] Encountered an exception in identify_mutated_tensors, assuming every input is mutated
2025-05-07T20:31:36.9572409Z W0507 20:31:36.954000 237772 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4] Traceback (most recent call last):
2025-05-07T20:31:36.9573810Z W0507 20:31:36.954000 237772 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/torch/_higher_order_ops/triton_kernel_wrap.py", line 731, in identify_mutated_tensors
2025-05-07T20:31:36.9575243Z W0507 20:31:36.954000 237772 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4]     ttir_module, ordered_tensor_names = generate_ttir(kernel, kwargs)
2025-05-07T20:31:36.9576217Z W0507 20:31:36.954000 237772 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4]                                         ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
2025-05-07T20:31:36.9577698Z W0507 20:31:36.954000 237772 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/torch/_higher_order_ops/triton_kernel_wrap.py", line 356, in generate_ttir
2025-05-07T20:31:36.9579076Z W0507 20:31:36.954000 237772 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4]     ttir_module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:31:36.9580060Z W0507 20:31:36.954000 237772 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4]                   ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
2025-05-07T20:31:36.9581285Z W0507 20:31:36.954000 237772 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py", line 100, in make_ir
2025-05-07T20:31:36.9582660Z W0507 20:31:36.954000 237772 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4]     return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:31:36.9583729Z W0507 20:31:36.954000 237772 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4]            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
2025-05-07T20:31:36.9585008Z W0507 20:31:36.954000 237772 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/code_generator.py", line 1298, in ast_to_ttir
2025-05-07T20:31:36.9586260Z W0507 20:31:36.954000 237772 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4]     generator.visit(fn.parse())
2025-05-07T20:31:36.9587493Z W0507 20:31:36.954000 237772 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/code_generator.py", line 1201, in visit
2025-05-07T20:31:36.9588707Z W0507 20:31:36.954000 237772 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4]     ret = super().visit(node)
2025-05-07T20:31:36.9589613Z W0507 20:31:36.954000 237772 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4]           ^^^^^^^^^^^^^^^^^^^
2025-05-07T20:31:36.9590648Z W0507 20:31:36.954000 237772 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/ast.py", line 418, in visit
2025-05-07T20:31:36.9591669Z W0507 20:31:36.954000 237772 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4]     return visitor(node)
2025-05-07T20:31:36.9592517Z W0507 20:31:36.954000 237772 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4]            ^^^^^^^^^^^^^
2025-05-07T20:31:36.9593717Z W0507 20:31:36.954000 237772 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/code_generator.py", line 352, in visit_Module
2025-05-07T20:31:36.9595086Z W0507 20:31:36.954000 237772 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4]     ast.NodeVisitor.generic_visit(self, node)
2025-05-07T20:31:36.9596210Z W0507 20:31:36.954000 237772 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/ast.py", line 426, in generic_visit
2025-05-07T20:31:36.9597255Z W0507 20:31:36.954000 237772 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4]     self.visit(item)
2025-05-07T20:31:36.9598431Z W0507 20:31:36.954000 237772 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/code_generator.py", line 1207, in visit
2025-05-07T20:31:36.9599852Z W0507 20:31:36.954000 237772 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4]     raise CompilationError(self.jit_fn.src, self.cur_node, repr(e)) from None
2025-05-07T20:31:36.9600919Z W0507 20:31:36.954000 237772 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4] triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:31:36.9601837Z W0507 20:31:36.954000 237772 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4] def _fbgemm_silu_mul_quant(
2025-05-07T20:31:36.9602580Z W0507 20:31:36.954000 237772 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4] ^
2025-05-07T20:31:36.9603644Z W0507 20:31:36.954000 237772 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4] ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:31:37.0423005Z W0507 20:31:37.039000 237772 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4] Encountered an exception in identify_mutated_tensors, assuming every input is mutated
2025-05-07T20:31:37.0424272Z W0507 20:31:37.039000 237772 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4] Traceback (most recent call last):
2025-05-07T20:31:37.0425610Z W0507 20:31:37.039000 237772 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/torch/_higher_order_ops/triton_kernel_wrap.py", line 731, in identify_mutated_tensors
2025-05-07T20:31:37.0427034Z W0507 20:31:37.039000 237772 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4]     ttir_module, ordered_tensor_names = generate_ttir(kernel, kwargs)
2025-05-07T20:31:37.0428012Z W0507 20:31:37.039000 237772 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4]                                         ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
2025-05-07T20:31:37.0429610Z W0507 20:31:37.039000 237772 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/torch/_higher_order_ops/triton_kernel_wrap.py", line 356, in generate_ttir
2025-05-07T20:31:37.0431002Z W0507 20:31:37.039000 237772 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4]     ttir_module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:31:37.0431996Z W0507 20:31:37.039000 237772 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4]                   ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
2025-05-07T20:31:37.0433275Z W0507 20:31:37.039000 237772 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py", line 100, in make_ir
2025-05-07T20:31:37.0434653Z W0507 20:31:37.039000 237772 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4]     return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:31:37.0436041Z W0507 20:31:37.039000 237772 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4]            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
2025-05-07T20:31:37.0437326Z W0507 20:31:37.039000 237772 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/code_generator.py", line 1298, in ast_to_ttir
2025-05-07T20:31:37.0438577Z W0507 20:31:37.039000 237772 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4]     generator.visit(fn.parse())
2025-05-07T20:31:37.0439803Z W0507 20:31:37.039000 237772 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/code_generator.py", line 1201, in visit
2025-05-07T20:31:37.0441025Z W0507 20:31:37.039000 237772 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4]     ret = super().visit(node)
2025-05-07T20:31:37.0441988Z W0507 20:31:37.039000 237772 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4]           ^^^^^^^^^^^^^^^^^^^
2025-05-07T20:31:37.0443077Z W0507 20:31:37.039000 237772 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/ast.py", line 418, in visit
2025-05-07T20:31:37.0444097Z W0507 20:31:37.039000 237772 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4]     return visitor(node)
2025-05-07T20:31:37.0444899Z W0507 20:31:37.039000 237772 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4]            ^^^^^^^^^^^^^
2025-05-07T20:31:37.0446116Z W0507 20:31:37.039000 237772 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/code_generator.py", line 352, in visit_Module
2025-05-07T20:31:37.0447402Z W0507 20:31:37.039000 237772 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4]     ast.NodeVisitor.generic_visit(self, node)
2025-05-07T20:31:37.0448530Z W0507 20:31:37.039000 237772 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/ast.py", line 426, in generic_visit
2025-05-07T20:31:37.0449578Z W0507 20:31:37.039000 237772 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4]     self.visit(item)
2025-05-07T20:31:37.0450764Z W0507 20:31:37.039000 237772 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/code_generator.py", line 1207, in visit
2025-05-07T20:31:37.0452128Z W0507 20:31:37.039000 237772 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4]     raise CompilationError(self.jit_fn.src, self.cur_node, repr(e)) from None
2025-05-07T20:31:37.0453195Z W0507 20:31:37.039000 237772 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4] triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:31:37.0454117Z W0507 20:31:37.039000 237772 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4] def _fbgemm_silu_mul_quant(
2025-05-07T20:31:37.0454866Z W0507 20:31:37.039000 237772 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4] ^
2025-05-07T20:31:37.0455883Z W0507 20:31:37.039000 237772 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4] ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:31:37.3031319Z W0507 20:31:37.300000 237772 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4] Encountered an exception in identify_mutated_tensors, assuming every input is mutated
2025-05-07T20:31:37.3033195Z W0507 20:31:37.300000 237772 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4] Traceback (most recent call last):
2025-05-07T20:31:37.3034816Z W0507 20:31:37.300000 237772 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/torch/_higher_order_ops/triton_kernel_wrap.py", line 731, in identify_mutated_tensors
2025-05-07T20:31:37.3036246Z W0507 20:31:37.300000 237772 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4]     ttir_module, ordered_tensor_names = generate_ttir(kernel, kwargs)
2025-05-07T20:31:37.3037235Z W0507 20:31:37.300000 237772 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4]                                         ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
2025-05-07T20:31:37.3038532Z W0507 20:31:37.300000 237772 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/torch/_higher_order_ops/triton_kernel_wrap.py", line 356, in generate_ttir
2025-05-07T20:31:37.3040091Z W0507 20:31:37.300000 237772 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4]     ttir_module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:31:37.3041087Z W0507 20:31:37.300000 237772 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4]                   ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
2025-05-07T20:31:37.3042371Z W0507 20:31:37.300000 237772 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py", line 100, in make_ir
2025-05-07T20:31:37.3043748Z W0507 20:31:37.300000 237772 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4]     return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:31:37.3044817Z W0507 20:31:37.300000 237772 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4]            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
2025-05-07T20:31:37.3046105Z W0507 20:31:37.300000 237772 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/code_generator.py", line 1298, in ast_to_ttir
2025-05-07T20:31:37.3047360Z W0507 20:31:37.300000 237772 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4]     generator.visit(fn.parse())
2025-05-07T20:31:37.3048587Z W0507 20:31:37.300000 237772 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/code_generator.py", line 1201, in visit
2025-05-07T20:31:37.3049805Z W0507 20:31:37.300000 237772 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4]     ret = super().visit(node)
2025-05-07T20:31:37.3050650Z W0507 20:31:37.300000 237772 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4]           ^^^^^^^^^^^^^^^^^^^
2025-05-07T20:31:37.3051678Z W0507 20:31:37.300000 237772 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/ast.py", line 418, in visit
2025-05-07T20:31:37.3052712Z W0507 20:31:37.300000 237772 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4]     return visitor(node)
2025-05-07T20:31:37.3053514Z W0507 20:31:37.300000 237772 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4]            ^^^^^^^^^^^^^
2025-05-07T20:31:37.3054721Z W0507 20:31:37.300000 237772 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/code_generator.py", line 352, in visit_Module
2025-05-07T20:31:37.3056006Z W0507 20:31:37.300000 237772 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4]     ast.NodeVisitor.generic_visit(self, node)
2025-05-07T20:31:37.3057220Z W0507 20:31:37.300000 237772 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/ast.py", line 426, in generic_visit
2025-05-07T20:31:37.3058265Z W0507 20:31:37.300000 237772 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4]     self.visit(item)
2025-05-07T20:31:37.3059441Z W0507 20:31:37.300000 237772 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/code_generator.py", line 1207, in visit
2025-05-07T20:31:37.3060798Z W0507 20:31:37.300000 237772 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4]     raise CompilationError(self.jit_fn.src, self.cur_node, repr(e)) from None
2025-05-07T20:31:37.3061871Z W0507 20:31:37.300000 237772 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4] triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:31:37.3062917Z W0507 20:31:37.300000 237772 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4] def _fbgemm_silu_mul_quant(
2025-05-07T20:31:37.3063677Z W0507 20:31:37.300000 237772 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4] ^
2025-05-07T20:31:37.3064694Z W0507 20:31:37.300000 237772 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4] ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:31:37.3169689Z W0507 20:31:37.314000 237772 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4] Encountered an exception in identify_mutated_tensors, assuming every input is mutated
2025-05-07T20:31:37.3170926Z W0507 20:31:37.314000 237772 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4] Traceback (most recent call last):
2025-05-07T20:31:37.3172275Z W0507 20:31:37.314000 237772 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/torch/_higher_order_ops/triton_kernel_wrap.py", line 731, in identify_mutated_tensors
2025-05-07T20:31:37.3173689Z W0507 20:31:37.314000 237772 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4]     ttir_module, ordered_tensor_names = generate_ttir(kernel, kwargs)
2025-05-07T20:31:37.3174664Z W0507 20:31:37.314000 237772 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4]                                         ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
2025-05-07T20:31:37.3175976Z W0507 20:31:37.314000 237772 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/torch/_higher_order_ops/triton_kernel_wrap.py", line 356, in generate_ttir
2025-05-07T20:31:37.3177355Z W0507 20:31:37.314000 237772 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4]     ttir_module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:31:37.3178353Z W0507 20:31:37.314000 237772 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4]                   ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
2025-05-07T20:31:37.3179579Z W0507 20:31:37.314000 237772 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py", line 100, in make_ir
2025-05-07T20:31:37.3180960Z W0507 20:31:37.314000 237772 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4]     return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:31:37.3182020Z W0507 20:31:37.314000 237772 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4]            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
2025-05-07T20:31:37.3183906Z W0507 20:31:37.314000 237772 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/code_generator.py", line 1298, in ast_to_ttir
2025-05-07T20:31:37.3185166Z W0507 20:31:37.314000 237772 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4]     generator.visit(fn.parse())
2025-05-07T20:31:37.3186391Z W0507 20:31:37.314000 237772 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/code_generator.py", line 1201, in visit
2025-05-07T20:31:37.3187612Z W0507 20:31:37.314000 237772 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4]     ret = super().visit(node)
2025-05-07T20:31:37.3188442Z W0507 20:31:37.314000 237772 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4]           ^^^^^^^^^^^^^^^^^^^
2025-05-07T20:31:37.3189671Z W0507 20:31:37.314000 237772 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/ast.py", line 418, in visit
2025-05-07T20:31:37.3190702Z W0507 20:31:37.314000 237772 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4]     return visitor(node)
2025-05-07T20:31:37.3191507Z W0507 20:31:37.314000 237772 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4]            ^^^^^^^^^^^^^
2025-05-07T20:31:37.3192709Z W0507 20:31:37.314000 237772 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/code_generator.py", line 352, in visit_Module
2025-05-07T20:31:37.3193983Z W0507 20:31:37.314000 237772 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4]     ast.NodeVisitor.generic_visit(self, node)
2025-05-07T20:31:37.3195105Z W0507 20:31:37.314000 237772 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/ast.py", line 426, in generic_visit
2025-05-07T20:31:37.3196148Z W0507 20:31:37.314000 237772 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4]     self.visit(item)
2025-05-07T20:31:37.3197324Z W0507 20:31:37.314000 237772 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/code_generator.py", line 1207, in visit
2025-05-07T20:31:37.3198671Z W0507 20:31:37.314000 237772 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4]     raise CompilationError(self.jit_fn.src, self.cur_node, repr(e)) from None
2025-05-07T20:31:37.3199734Z W0507 20:31:37.314000 237772 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4] triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:31:37.3200651Z W0507 20:31:37.314000 237772 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4] def _fbgemm_silu_mul_quant(
2025-05-07T20:31:37.3201400Z W0507 20:31:37.314000 237772 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4] ^
2025-05-07T20:31:37.3202426Z W0507 20:31:37.314000 237772 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4] ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:31:37.5167943Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:31:37.5168656Z T = 1, D = 5120, scale_ub = None, contiguous = True, compiled = True
2025-05-07T20:31:37.5169014Z 
2025-05-07T20:31:37.5169144Z     @given(
2025-05-07T20:31:37.5169392Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:31:37.5169709Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:31:37.5170022Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:31:37.5170386Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:31:37.5170711Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:31:37.5171001Z     )
2025-05-07T20:31:37.5171692Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:31:37.5172139Z     def test_silu_mul_quant(
2025-05-07T20:31:37.5172392Z         self,
2025-05-07T20:31:37.5172592Z         T: int,
2025-05-07T20:31:37.5172789Z         D: int,
2025-05-07T20:31:37.5173015Z         scale_ub: Optional[float],
2025-05-07T20:31:37.5173294Z         contiguous: bool,
2025-05-07T20:31:37.5173532Z         compiled: bool,
2025-05-07T20:31:37.5173763Z     ) -> None:
2025-05-07T20:31:37.5173992Z         torch.manual_seed(2025)
2025-05-07T20:31:37.5174261Z     
2025-05-07T20:31:37.5174533Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:31:37.5174880Z     
2025-05-07T20:31:37.5175078Z         x_sign = torch.sign(x)
2025-05-07T20:31:37.5175521Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:31:37.5175830Z         x = x_sign * x_clamp
2025-05-07T20:31:37.5176080Z         x0 = x[:, :D]
2025-05-07T20:31:37.5176307Z         x1 = x[:, D:]
2025-05-07T20:31:37.5176518Z     
2025-05-07T20:31:37.5176719Z         if contiguous:
2025-05-07T20:31:37.5176963Z             x0 = x0.contiguous()
2025-05-07T20:31:37.5177220Z             x1 = x1.contiguous()
2025-05-07T20:31:37.5177464Z     
2025-05-07T20:31:37.5177663Z         if scale_ub is not None:
2025-05-07T20:31:37.5177935Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:31:37.5178281Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:31:37.5178595Z             )
2025-05-07T20:31:37.5178787Z         else:
2025-05-07T20:31:37.5179004Z             scale_ub_tensor = None
2025-05-07T20:31:37.5179259Z     
2025-05-07T20:31:37.5179487Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:31:37.5179805Z             op = silu_mul_quant
2025-05-07T20:31:37.5180066Z             if compiled:
2025-05-07T20:31:37.5180316Z                 op = torch.compile(op)
2025-05-07T20:31:37.5180607Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:31:37.5180884Z     
2025-05-07T20:31:37.5181085Z         y_fp8, y_scale = fn()
2025-05-07T20:31:37.5181367Z         y = y_fp8.to(torch.float32) * y_scale[:, None]
2025-05-07T20:31:37.5181660Z     
2025-05-07T20:31:37.5181899Z         def ref_fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:31:37.5182229Z             x0_fp32 = x0.to(torch.float32)
2025-05-07T20:31:37.5182524Z             x1_fp32 = x1.to(torch.float32)
2025-05-07T20:31:37.5182841Z             y = x0_fp32 * torch.sigmoid(x0_fp32) * x1_fp32
2025-05-07T20:31:37.5183193Z             return triton_quantize_fp8_row(y, scale_ub_tensor)
2025-05-07T20:31:37.5183504Z     
2025-05-07T20:31:37.5183709Z >       y_fp8_ref, y_scale_ref = ref_fn()
2025-05-07T20:31:37.5183905Z 
2025-05-07T20:31:37.5184016Z moe/activation_test.py:126: 
2025-05-07T20:31:37.5184316Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:31:37.5184656Z moe/activation_test.py:124: in ref_fn
2025-05-07T20:31:37.5184984Z     return triton_quantize_fp8_row(y, scale_ub_tensor)
2025-05-07T20:31:37.5185769Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/fbgemm_gpu/experimental/gemm/triton_gemm/fp8_gemm.py:2370: in triton_quantize_fp8_row
2025-05-07T20:31:37.5186518Z     _kernel_quantize_fp8_row[grid](
2025-05-07T20:31:37.5187065Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:31:37.5187752Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:31:37.5188431Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/autotuner.py:186: in run
2025-05-07T20:31:37.5189290Z     timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs}
2025-05-07T20:31:37.5190052Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/autotuner.py:186: in <dictcomp>
2025-05-07T20:31:37.5190927Z     timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs}
2025-05-07T20:31:37.5191654Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/autotuner.py:166: in _bench
2025-05-07T20:31:37.5192321Z     return self.do_bench(kernel_call, quantiles=(0.5, 0.2, 0.8))
2025-05-07T20:31:37.5192953Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/testing.py:117: in do_bench
2025-05-07T20:31:37.5193467Z     fn()
2025-05-07T20:31:37.5193978Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/autotuner.py:152: in kernel_call
2025-05-07T20:31:37.5194567Z     self.fn.run(
2025-05-07T20:31:37.5195038Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:31:37.5195642Z     kernel = self.compile(
2025-05-07T20:31:37.5196183Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:31:37.5196848Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:31:37.5197247Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:31:37.5197485Z 
2025-05-07T20:31:37.5197690Z self = <triton.compiler.compiler.ASTSource object at 0x7f6894651890>
2025-05-07T20:31:37.5198767Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=2, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:31:37.5200149Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f68dfb402c0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f68a93ec040>}
2025-05-07T20:31:37.5201495Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:31:37.5202512Z context = <triton._C.libtriton.ir.context object at 0x7f6894627830>
2025-05-07T20:31:37.5202803Z 
2025-05-07T20:31:37.5202969Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:31:37.5203485Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:31:37.5203949Z                            module_map=module_map)
2025-05-07T20:31:37.5204307Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:31:37.5204662Z E       def _kernel_quantize_fp8_row(
2025-05-07T20:31:37.5204930Z E       ^
2025-05-07T20:31:37.5205389Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:31:37.5205846Z 
2025-05-07T20:31:37.5206263Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:31:37.5206776Z 
2025-05-07T20:31:37.5206880Z Trying example: test_silu_mul_quant(
2025-05-07T20:31:37.5207303Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:31:37.5207700Z     T=2048,
2025-05-07T20:31:37.5207894Z     D=5120,
2025-05-07T20:31:37.5208093Z     scale_ub=None,
2025-05-07T20:31:37.5208312Z     contiguous=True,
2025-05-07T20:31:37.5208544Z     compiled=True,
2025-05-07T20:31:37.5208757Z )
2025-05-07T20:31:37.8540404Z W0507 20:31:37.851000 237772 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5] Encountered an exception in identify_mutated_tensors, assuming every input is mutated
2025-05-07T20:31:37.8541480Z W0507 20:31:37.851000 237772 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5] Traceback (most recent call last):
2025-05-07T20:31:37.8543232Z W0507 20:31:37.851000 237772 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/torch/_higher_order_ops/triton_kernel_wrap.py", line 731, in identify_mutated_tensors
2025-05-07T20:31:37.8544665Z W0507 20:31:37.851000 237772 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5]     ttir_module, ordered_tensor_names = generate_ttir(kernel, kwargs)
2025-05-07T20:31:37.8545639Z W0507 20:31:37.851000 237772 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5]                                         ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
2025-05-07T20:31:37.8546954Z W0507 20:31:37.851000 237772 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/torch/_higher_order_ops/triton_kernel_wrap.py", line 356, in generate_ttir
2025-05-07T20:31:37.8548328Z W0507 20:31:37.851000 237772 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5]     ttir_module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:31:37.8549552Z W0507 20:31:37.851000 237772 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5]                   ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
2025-05-07T20:31:37.8550776Z W0507 20:31:37.851000 237772 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py", line 100, in make_ir
2025-05-07T20:31:37.8552152Z W0507 20:31:37.851000 237772 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5]     return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:31:37.8553214Z W0507 20:31:37.851000 237772 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5]            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
2025-05-07T20:31:37.8554508Z W0507 20:31:37.851000 237772 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/code_generator.py", line 1298, in ast_to_ttir
2025-05-07T20:31:37.8555758Z W0507 20:31:37.851000 237772 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5]     generator.visit(fn.parse())
2025-05-07T20:31:37.8556971Z W0507 20:31:37.851000 237772 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/code_generator.py", line 1201, in visit
2025-05-07T20:31:37.8558183Z W0507 20:31:37.851000 237772 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5]     ret = super().visit(node)
2025-05-07T20:31:37.8559019Z W0507 20:31:37.851000 237772 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5]           ^^^^^^^^^^^^^^^^^^^
2025-05-07T20:31:37.8560055Z W0507 20:31:37.851000 237772 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/ast.py", line 418, in visit
2025-05-07T20:31:37.8561080Z W0507 20:31:37.851000 237772 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5]     return visitor(node)
2025-05-07T20:31:37.8561874Z W0507 20:31:37.851000 237772 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5]            ^^^^^^^^^^^^^
2025-05-07T20:31:37.8563138Z W0507 20:31:37.851000 237772 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/code_generator.py", line 352, in visit_Module
2025-05-07T20:31:37.8564420Z W0507 20:31:37.851000 237772 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5]     ast.NodeVisitor.generic_visit(self, node)
2025-05-07T20:31:37.8565550Z W0507 20:31:37.851000 237772 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/ast.py", line 426, in generic_visit
2025-05-07T20:31:37.8566748Z W0507 20:31:37.851000 237772 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5]     self.visit(item)
2025-05-07T20:31:37.8567933Z W0507 20:31:37.851000 237772 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/code_generator.py", line 1207, in visit
2025-05-07T20:31:37.8569295Z W0507 20:31:37.851000 237772 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5]     raise CompilationError(self.jit_fn.src, self.cur_node, repr(e)) from None
2025-05-07T20:31:37.8570356Z W0507 20:31:37.851000 237772 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5] triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:31:37.8571338Z W0507 20:31:37.851000 237772 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5] def _fbgemm_silu_mul_quant(
2025-05-07T20:31:37.8572086Z W0507 20:31:37.851000 237772 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5] ^
2025-05-07T20:31:37.8573121Z W0507 20:31:37.851000 237772 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5] ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:31:37.9393997Z W0507 20:31:37.937000 237772 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5] Encountered an exception in identify_mutated_tensors, assuming every input is mutated
2025-05-07T20:31:37.9395200Z W0507 20:31:37.937000 237772 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5] Traceback (most recent call last):
2025-05-07T20:31:37.9396530Z W0507 20:31:37.937000 237772 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/torch/_higher_order_ops/triton_kernel_wrap.py", line 731, in identify_mutated_tensors
2025-05-07T20:31:37.9397985Z W0507 20:31:37.937000 237772 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5]     ttir_module, ordered_tensor_names = generate_ttir(kernel, kwargs)
2025-05-07T20:31:37.9398965Z W0507 20:31:37.937000 237772 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5]                                         ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
2025-05-07T20:31:37.9400266Z W0507 20:31:37.937000 237772 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/torch/_higher_order_ops/triton_kernel_wrap.py", line 356, in generate_ttir
2025-05-07T20:31:37.9401641Z W0507 20:31:37.937000 237772 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5]     ttir_module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:31:37.9402645Z W0507 20:31:37.937000 237772 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5]                   ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
2025-05-07T20:31:37.9403896Z W0507 20:31:37.937000 237772 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py", line 100, in make_ir
2025-05-07T20:31:37.9405265Z W0507 20:31:37.937000 237772 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5]     return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:31:37.9406324Z W0507 20:31:37.937000 237772 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5]            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
2025-05-07T20:31:37.9407603Z W0507 20:31:37.937000 237772 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/code_generator.py", line 1298, in ast_to_ttir
2025-05-07T20:31:37.9409736Z W0507 20:31:37.937000 237772 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5]     generator.visit(fn.parse())
2025-05-07T20:31:37.9410969Z W0507 20:31:37.937000 237772 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/code_generator.py", line 1201, in visit
2025-05-07T20:31:37.9412174Z W0507 20:31:37.937000 237772 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5]     ret = super().visit(node)
2025-05-07T20:31:37.9413053Z W0507 20:31:37.937000 237772 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5]           ^^^^^^^^^^^^^^^^^^^
2025-05-07T20:31:37.9414079Z W0507 20:31:37.937000 237772 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/ast.py", line 418, in visit
2025-05-07T20:31:37.9415234Z W0507 20:31:37.937000 237772 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5]     return visitor(node)
2025-05-07T20:31:37.9416032Z W0507 20:31:37.937000 237772 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5]            ^^^^^^^^^^^^^
2025-05-07T20:31:37.9417242Z W0507 20:31:37.937000 237772 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/code_generator.py", line 352, in visit_Module
2025-05-07T20:31:37.9418523Z W0507 20:31:37.937000 237772 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5]     ast.NodeVisitor.generic_visit(self, node)
2025-05-07T20:31:37.9419633Z W0507 20:31:37.937000 237772 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/ast.py", line 426, in generic_visit
2025-05-07T20:31:37.9420681Z W0507 20:31:37.937000 237772 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5]     self.visit(item)
2025-05-07T20:31:37.9421861Z W0507 20:31:37.937000 237772 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/code_generator.py", line 1207, in visit
2025-05-07T20:31:37.9423264Z W0507 20:31:37.937000 237772 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5]     raise CompilationError(self.jit_fn.src, self.cur_node, repr(e)) from None
2025-05-07T20:31:37.9424322Z W0507 20:31:37.937000 237772 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5] triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:31:37.9425231Z W0507 20:31:37.937000 237772 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5] def _fbgemm_silu_mul_quant(
2025-05-07T20:31:37.9425980Z W0507 20:31:37.937000 237772 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5] ^
2025-05-07T20:31:37.9427000Z W0507 20:31:37.937000 237772 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5] ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:31:38.1987525Z W0507 20:31:38.196000 237772 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5] Encountered an exception in identify_mutated_tensors, assuming every input is mutated
2025-05-07T20:31:38.1988647Z W0507 20:31:38.196000 237772 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5] Traceback (most recent call last):
2025-05-07T20:31:38.1990068Z W0507 20:31:38.196000 237772 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/torch/_higher_order_ops/triton_kernel_wrap.py", line 731, in identify_mutated_tensors
2025-05-07T20:31:38.1991540Z W0507 20:31:38.196000 237772 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5]     ttir_module, ordered_tensor_names = generate_ttir(kernel, kwargs)
2025-05-07T20:31:38.1992920Z W0507 20:31:38.196000 237772 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5]                                         ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
2025-05-07T20:31:38.1994291Z W0507 20:31:38.196000 237772 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/torch/_higher_order_ops/triton_kernel_wrap.py", line 356, in generate_ttir
2025-05-07T20:31:38.1995676Z W0507 20:31:38.196000 237772 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5]     ttir_module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:31:38.1996662Z W0507 20:31:38.196000 237772 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5]                   ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
2025-05-07T20:31:38.1998041Z W0507 20:31:38.196000 237772 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py", line 100, in make_ir
2025-05-07T20:31:38.1999419Z W0507 20:31:38.196000 237772 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5]     return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:31:38.2000482Z W0507 20:31:38.196000 237772 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5]            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
2025-05-07T20:31:38.2001757Z W0507 20:31:38.196000 237772 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/code_generator.py", line 1298, in ast_to_ttir
2025-05-07T20:31:38.2003062Z W0507 20:31:38.196000 237772 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5]     generator.visit(fn.parse())
2025-05-07T20:31:38.2004282Z W0507 20:31:38.196000 237772 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/code_generator.py", line 1201, in visit
2025-05-07T20:31:38.2005491Z W0507 20:31:38.196000 237772 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5]     ret = super().visit(node)
2025-05-07T20:31:38.2006328Z W0507 20:31:38.196000 237772 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5]           ^^^^^^^^^^^^^^^^^^^
2025-05-07T20:31:38.2007357Z W0507 20:31:38.196000 237772 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/ast.py", line 418, in visit
2025-05-07T20:31:38.2008383Z W0507 20:31:38.196000 237772 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5]     return visitor(node)
2025-05-07T20:31:38.2009179Z W0507 20:31:38.196000 237772 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5]            ^^^^^^^^^^^^^
2025-05-07T20:31:38.2010389Z W0507 20:31:38.196000 237772 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/code_generator.py", line 352, in visit_Module
2025-05-07T20:31:38.2011670Z W0507 20:31:38.196000 237772 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5]     ast.NodeVisitor.generic_visit(self, node)
2025-05-07T20:31:38.2012784Z W0507 20:31:38.196000 237772 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/ast.py", line 426, in generic_visit
2025-05-07T20:31:38.2013886Z W0507 20:31:38.196000 237772 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5]     self.visit(item)
2025-05-07T20:31:38.2015140Z W0507 20:31:38.196000 237772 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/code_generator.py", line 1207, in visit
2025-05-07T20:31:38.2016502Z W0507 20:31:38.196000 237772 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5]     raise CompilationError(self.jit_fn.src, self.cur_node, repr(e)) from None
2025-05-07T20:31:38.2017559Z W0507 20:31:38.196000 237772 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5] triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:31:38.2018475Z W0507 20:31:38.196000 237772 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5] def _fbgemm_silu_mul_quant(
2025-05-07T20:31:38.2019210Z W0507 20:31:38.196000 237772 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5] ^
2025-05-07T20:31:38.2020308Z W0507 20:31:38.196000 237772 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5] ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:31:38.2131075Z W0507 20:31:38.211000 237772 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5] Encountered an exception in identify_mutated_tensors, assuming every input is mutated
2025-05-07T20:31:38.2132348Z W0507 20:31:38.211000 237772 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5] Traceback (most recent call last):
2025-05-07T20:31:38.2133681Z W0507 20:31:38.211000 237772 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/torch/_higher_order_ops/triton_kernel_wrap.py", line 731, in identify_mutated_tensors
2025-05-07T20:31:38.2135090Z W0507 20:31:38.211000 237772 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5]     ttir_module, ordered_tensor_names = generate_ttir(kernel, kwargs)
2025-05-07T20:31:38.2136078Z W0507 20:31:38.211000 237772 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5]                                         ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
2025-05-07T20:31:38.2137370Z W0507 20:31:38.211000 237772 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/torch/_higher_order_ops/triton_kernel_wrap.py", line 356, in generate_ttir
2025-05-07T20:31:38.2138748Z W0507 20:31:38.211000 237772 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5]     ttir_module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:31:38.2139728Z W0507 20:31:38.211000 237772 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5]                   ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
2025-05-07T20:31:38.2140959Z W0507 20:31:38.211000 237772 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py", line 100, in make_ir
2025-05-07T20:31:38.2142332Z W0507 20:31:38.211000 237772 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5]     return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:31:38.2143438Z W0507 20:31:38.211000 237772 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5]            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
2025-05-07T20:31:38.2144714Z W0507 20:31:38.211000 237772 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/code_generator.py", line 1298, in ast_to_ttir
2025-05-07T20:31:38.2145966Z W0507 20:31:38.211000 237772 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5]     generator.visit(fn.parse())
2025-05-07T20:31:38.2147369Z W0507 20:31:38.211000 237772 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/code_generator.py", line 1201, in visit
2025-05-07T20:31:38.2148710Z W0507 20:31:38.211000 237772 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5]     ret = super().visit(node)
2025-05-07T20:31:38.2158462Z W0507 20:31:38.211000 237772 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5]           ^^^^^^^^^^^^^^^^^^^
2025-05-07T20:31:38.2159562Z W0507 20:31:38.211000 237772 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/ast.py", line 418, in visit
2025-05-07T20:31:38.2160607Z W0507 20:31:38.211000 237772 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5]     return visitor(node)
2025-05-07T20:31:38.2161602Z W0507 20:31:38.211000 237772 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5]            ^^^^^^^^^^^^^
2025-05-07T20:31:38.2162894Z W0507 20:31:38.211000 237772 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/code_generator.py", line 352, in visit_Module
2025-05-07T20:31:38.2164201Z W0507 20:31:38.211000 237772 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5]     ast.NodeVisitor.generic_visit(self, node)
2025-05-07T20:31:38.2165330Z W0507 20:31:38.211000 237772 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/ast.py", line 426, in generic_visit
2025-05-07T20:31:38.2166396Z W0507 20:31:38.211000 237772 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5]     self.visit(item)
2025-05-07T20:31:38.2167601Z W0507 20:31:38.211000 237772 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/code_generator.py", line 1207, in visit
2025-05-07T20:31:38.2168984Z W0507 20:31:38.211000 237772 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5]     raise CompilationError(self.jit_fn.src, self.cur_node, repr(e)) from None
2025-05-07T20:31:38.2170061Z W0507 20:31:38.211000 237772 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5] triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:31:38.2170978Z W0507 20:31:38.211000 237772 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5] def _fbgemm_silu_mul_quant(
2025-05-07T20:31:38.2171737Z W0507 20:31:38.211000 237772 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5] ^
2025-05-07T20:31:38.2172772Z W0507 20:31:38.211000 237772 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5] ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:31:38.5756753Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:31:38.5757554Z T = 2048, D = 5120, scale_ub = None, contiguous = True, compiled = True
2025-05-07T20:31:38.5757924Z 
2025-05-07T20:31:38.5758016Z     @given(
2025-05-07T20:31:38.5758257Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:31:38.5758582Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:31:38.5758898Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:31:38.5759227Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:31:38.5759565Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:31:38.5759861Z     )
2025-05-07T20:31:38.5760212Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:31:38.5760671Z     def test_silu_mul_quant(
2025-05-07T20:31:38.5760927Z         self,
2025-05-07T20:31:38.5761136Z         T: int,
2025-05-07T20:31:38.5761341Z         D: int,
2025-05-07T20:31:38.5761571Z         scale_ub: Optional[float],
2025-05-07T20:31:38.5761847Z         contiguous: bool,
2025-05-07T20:31:38.5762456Z         compiled: bool,
2025-05-07T20:31:38.5762723Z     ) -> None:
2025-05-07T20:31:38.5762972Z         torch.manual_seed(2025)
2025-05-07T20:31:38.5763215Z     
2025-05-07T20:31:38.5763495Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:31:38.5763842Z     
2025-05-07T20:31:38.5764041Z         x_sign = torch.sign(x)
2025-05-07T20:31:38.5764339Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:31:38.5764655Z         x = x_sign * x_clamp
2025-05-07T20:31:38.5764896Z         x0 = x[:, :D]
2025-05-07T20:31:38.5765120Z         x1 = x[:, D:]
2025-05-07T20:31:38.5765337Z     
2025-05-07T20:31:38.5765527Z         if contiguous:
2025-05-07T20:31:38.5765927Z             x0 = x0.contiguous()
2025-05-07T20:31:38.5766196Z             x1 = x1.contiguous()
2025-05-07T20:31:38.5766432Z     
2025-05-07T20:31:38.5766634Z         if scale_ub is not None:
2025-05-07T20:31:38.5766915Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:31:38.5767260Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:31:38.5767569Z             )
2025-05-07T20:31:38.5767774Z         else:
2025-05-07T20:31:38.5767997Z             scale_ub_tensor = None
2025-05-07T20:31:38.5768249Z     
2025-05-07T20:31:38.5768484Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:31:38.5768804Z             op = silu_mul_quant
2025-05-07T20:31:38.5769053Z             if compiled:
2025-05-07T20:31:38.5769309Z                 op = torch.compile(op)
2025-05-07T20:31:38.5769615Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:31:38.5769891Z     
2025-05-07T20:31:38.5770096Z         y_fp8, y_scale = fn()
2025-05-07T20:31:38.5770388Z         y = y_fp8.to(torch.float32) * y_scale[:, None]
2025-05-07T20:31:38.5770682Z     
2025-05-07T20:31:38.5770926Z         def ref_fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:31:38.5771268Z             x0_fp32 = x0.to(torch.float32)
2025-05-07T20:31:38.5771572Z             x1_fp32 = x1.to(torch.float32)
2025-05-07T20:31:38.5771886Z             y = x0_fp32 * torch.sigmoid(x0_fp32) * x1_fp32
2025-05-07T20:31:38.5772253Z             return triton_quantize_fp8_row(y, scale_ub_tensor)
2025-05-07T20:31:38.5772569Z     
2025-05-07T20:31:38.5772774Z >       y_fp8_ref, y_scale_ref = ref_fn()
2025-05-07T20:31:38.5772979Z 
2025-05-07T20:31:38.5773093Z moe/activation_test.py:126: 
2025-05-07T20:31:38.5773435Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:31:38.5773770Z moe/activation_test.py:124: in ref_fn
2025-05-07T20:31:38.5774105Z     return triton_quantize_fp8_row(y, scale_ub_tensor)
2025-05-07T20:31:38.5774893Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/fbgemm_gpu/experimental/gemm/triton_gemm/fp8_gemm.py:2370: in triton_quantize_fp8_row
2025-05-07T20:31:38.5775655Z     _kernel_quantize_fp8_row[grid](
2025-05-07T20:31:38.5776203Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:31:38.5776887Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:31:38.5777577Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/autotuner.py:186: in run
2025-05-07T20:31:38.5778302Z     timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs}
2025-05-07T20:31:38.5779048Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/autotuner.py:186: in <dictcomp>
2025-05-07T20:31:38.5779799Z     timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs}
2025-05-07T20:31:38.5780525Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/autotuner.py:166: in _bench
2025-05-07T20:31:38.5781161Z     return self.do_bench(kernel_call, quantiles=(0.5, 0.2, 0.8))
2025-05-07T20:31:38.5781868Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/testing.py:117: in do_bench
2025-05-07T20:31:38.5782393Z     fn()
2025-05-07T20:31:38.5782904Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/autotuner.py:152: in kernel_call
2025-05-07T20:31:38.5783485Z     self.fn.run(
2025-05-07T20:31:38.5784008Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:31:38.5784542Z     kernel = self.compile(
2025-05-07T20:31:38.5785079Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:31:38.5785734Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:31:38.5786136Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:31:38.5786446Z 
2025-05-07T20:31:38.5786657Z self = <triton.compiler.compiler.ASTSource object at 0x7f6894212510>
2025-05-07T20:31:38.5787744Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=2, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:31:38.5789221Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f68dfb402c0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f68a9986840>}
2025-05-07T20:31:38.5790563Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:31:38.5791584Z context = <triton._C.libtriton.ir.context object at 0x7f68942bb6f0>
2025-05-07T20:31:38.5791874Z 
2025-05-07T20:31:38.5792049Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:31:38.5792562Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:31:38.5793035Z                            module_map=module_map)
2025-05-07T20:31:38.5793444Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:31:38.5793816Z E       def _kernel_quantize_fp8_row(
2025-05-07T20:31:38.5794098Z E       ^
2025-05-07T20:31:38.5794569Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:31:38.5795017Z 
2025-05-07T20:31:38.5795444Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:31:38.5795951Z 
2025-05-07T20:31:38.5796057Z Trying example: test_silu_mul_quant(
2025-05-07T20:31:38.5796479Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:31:38.5796894Z     T=128,
2025-05-07T20:31:38.5797084Z     D=5120,
2025-05-07T20:31:38.5797290Z     scale_ub=None,
2025-05-07T20:31:38.5797516Z     contiguous=True,
2025-05-07T20:31:38.5797742Z     compiled=True,
2025-05-07T20:31:38.5797968Z )
2025-05-07T20:31:38.9232185Z W0507 20:31:38.920000 237772 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6] Encountered an exception in identify_mutated_tensors, assuming every input is mutated
2025-05-07T20:31:38.9233317Z W0507 20:31:38.920000 237772 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6] Traceback (most recent call last):
2025-05-07T20:31:38.9234675Z W0507 20:31:38.920000 237772 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/torch/_higher_order_ops/triton_kernel_wrap.py", line 731, in identify_mutated_tensors
2025-05-07T20:31:38.9236108Z W0507 20:31:38.920000 237772 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6]     ttir_module, ordered_tensor_names = generate_ttir(kernel, kwargs)
2025-05-07T20:31:38.9237441Z W0507 20:31:38.920000 237772 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6]                                         ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
2025-05-07T20:31:38.9238751Z W0507 20:31:38.920000 237772 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/torch/_higher_order_ops/triton_kernel_wrap.py", line 356, in generate_ttir
2025-05-07T20:31:38.9240125Z W0507 20:31:38.920000 237772 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6]     ttir_module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:31:38.9241104Z W0507 20:31:38.920000 237772 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6]                   ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
2025-05-07T20:31:38.9242477Z W0507 20:31:38.920000 237772 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py", line 100, in make_ir
2025-05-07T20:31:38.9243846Z W0507 20:31:38.920000 237772 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6]     return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:31:38.9244910Z W0507 20:31:38.920000 237772 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6]            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
2025-05-07T20:31:38.9246192Z W0507 20:31:38.920000 237772 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/code_generator.py", line 1298, in ast_to_ttir
2025-05-07T20:31:38.9247438Z W0507 20:31:38.920000 237772 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6]     generator.visit(fn.parse())
2025-05-07T20:31:38.9248674Z W0507 20:31:38.920000 237772 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/code_generator.py", line 1201, in visit
2025-05-07T20:31:38.9249877Z W0507 20:31:38.920000 237772 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6]     ret = super().visit(node)
2025-05-07T20:31:38.9250718Z W0507 20:31:38.920000 237772 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6]           ^^^^^^^^^^^^^^^^^^^
2025-05-07T20:31:38.9251751Z W0507 20:31:38.920000 237772 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/ast.py", line 418, in visit
2025-05-07T20:31:38.9252769Z W0507 20:31:38.920000 237772 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6]     return visitor(node)
2025-05-07T20:31:38.9253573Z W0507 20:31:38.920000 237772 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6]            ^^^^^^^^^^^^^
2025-05-07T20:31:38.9254776Z W0507 20:31:38.920000 237772 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/code_generator.py", line 352, in visit_Module
2025-05-07T20:31:38.9256057Z W0507 20:31:38.920000 237772 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6]     ast.NodeVisitor.generic_visit(self, node)
2025-05-07T20:31:38.9257175Z W0507 20:31:38.920000 237772 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/ast.py", line 426, in generic_visit
2025-05-07T20:31:38.9258214Z W0507 20:31:38.920000 237772 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6]     self.visit(item)
2025-05-07T20:31:38.9259399Z W0507 20:31:38.920000 237772 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/code_generator.py", line 1207, in visit
2025-05-07T20:31:38.9260861Z W0507 20:31:38.920000 237772 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6]     raise CompilationError(self.jit_fn.src, self.cur_node, repr(e)) from None
2025-05-07T20:31:38.9261933Z W0507 20:31:38.920000 237772 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6] triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:31:38.9262853Z W0507 20:31:38.920000 237772 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6] def _fbgemm_silu_mul_quant(
2025-05-07T20:31:38.9263649Z W0507 20:31:38.920000 237772 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6] ^
2025-05-07T20:31:38.9264665Z W0507 20:31:38.920000 237772 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6] ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:31:39.0091662Z W0507 20:31:39.006000 237772 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6] Encountered an exception in identify_mutated_tensors, assuming every input is mutated
2025-05-07T20:31:39.0093523Z W0507 20:31:39.006000 237772 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6] Traceback (most recent call last):
2025-05-07T20:31:39.0094878Z W0507 20:31:39.006000 237772 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/torch/_higher_order_ops/triton_kernel_wrap.py", line 731, in identify_mutated_tensors
2025-05-07T20:31:39.0096309Z W0507 20:31:39.006000 237772 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6]     ttir_module, ordered_tensor_names = generate_ttir(kernel, kwargs)
2025-05-07T20:31:39.0097308Z W0507 20:31:39.006000 237772 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6]                                         ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
2025-05-07T20:31:39.0098608Z W0507 20:31:39.006000 237772 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/torch/_higher_order_ops/triton_kernel_wrap.py", line 356, in generate_ttir
2025-05-07T20:31:39.0099999Z W0507 20:31:39.006000 237772 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6]     ttir_module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:31:39.0100984Z W0507 20:31:39.006000 237772 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6]                   ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
2025-05-07T20:31:39.0102205Z W0507 20:31:39.006000 237772 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py", line 100, in make_ir
2025-05-07T20:31:39.0103646Z W0507 20:31:39.006000 237772 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6]     return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:31:39.0104702Z W0507 20:31:39.006000 237772 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6]            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
2025-05-07T20:31:39.0105979Z W0507 20:31:39.006000 237772 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/code_generator.py", line 1298, in ast_to_ttir
2025-05-07T20:31:39.0107231Z W0507 20:31:39.006000 237772 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6]     generator.visit(fn.parse())
2025-05-07T20:31:39.0108465Z W0507 20:31:39.006000 237772 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/code_generator.py", line 1201, in visit
2025-05-07T20:31:39.0110045Z W0507 20:31:39.006000 237772 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6]     ret = super().visit(node)
2025-05-07T20:31:39.0110877Z W0507 20:31:39.006000 237772 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6]           ^^^^^^^^^^^^^^^^^^^
2025-05-07T20:31:39.0111910Z W0507 20:31:39.006000 237772 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/ast.py", line 418, in visit
2025-05-07T20:31:39.0112935Z W0507 20:31:39.006000 237772 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6]     return visitor(node)
2025-05-07T20:31:39.0113733Z W0507 20:31:39.006000 237772 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6]            ^^^^^^^^^^^^^
2025-05-07T20:31:39.0115075Z W0507 20:31:39.006000 237772 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/code_generator.py", line 352, in visit_Module
2025-05-07T20:31:39.0116359Z W0507 20:31:39.006000 237772 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6]     ast.NodeVisitor.generic_visit(self, node)
2025-05-07T20:31:39.0117474Z W0507 20:31:39.006000 237772 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/ast.py", line 426, in generic_visit
2025-05-07T20:31:39.0118513Z W0507 20:31:39.006000 237772 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6]     self.visit(item)
2025-05-07T20:31:39.0119690Z W0507 20:31:39.006000 237772 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/code_generator.py", line 1207, in visit
2025-05-07T20:31:39.0121047Z W0507 20:31:39.006000 237772 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6]     raise CompilationError(self.jit_fn.src, self.cur_node, repr(e)) from None
2025-05-07T20:31:39.0122109Z W0507 20:31:39.006000 237772 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6] triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:31:39.0123038Z W0507 20:31:39.006000 237772 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6] def _fbgemm_silu_mul_quant(
2025-05-07T20:31:39.0123811Z W0507 20:31:39.006000 237772 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6] ^
2025-05-07T20:31:39.0124835Z W0507 20:31:39.006000 237772 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6] ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:31:39.2714261Z W0507 20:31:39.269000 237772 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6] Encountered an exception in identify_mutated_tensors, assuming every input is mutated
2025-05-07T20:31:39.2715557Z W0507 20:31:39.269000 237772 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6] Traceback (most recent call last):
2025-05-07T20:31:39.2716896Z W0507 20:31:39.269000 237772 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/torch/_higher_order_ops/triton_kernel_wrap.py", line 731, in identify_mutated_tensors
2025-05-07T20:31:39.2718332Z W0507 20:31:39.269000 237772 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6]     ttir_module, ordered_tensor_names = generate_ttir(kernel, kwargs)
2025-05-07T20:31:39.2719314Z W0507 20:31:39.269000 237772 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6]                                         ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
2025-05-07T20:31:39.2720988Z W0507 20:31:39.269000 237772 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/torch/_higher_order_ops/triton_kernel_wrap.py", line 356, in generate_ttir
2025-05-07T20:31:39.2722369Z W0507 20:31:39.269000 237772 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6]     ttir_module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:31:39.2723360Z W0507 20:31:39.269000 237772 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6]                   ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
2025-05-07T20:31:39.2724645Z W0507 20:31:39.269000 237772 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py", line 100, in make_ir
2025-05-07T20:31:39.2726164Z W0507 20:31:39.269000 237772 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6]     return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:31:39.2727244Z W0507 20:31:39.269000 237772 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6]            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
2025-05-07T20:31:39.2728788Z W0507 20:31:39.269000 237772 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/code_generator.py", line 1298, in ast_to_ttir
2025-05-07T20:31:39.2730043Z W0507 20:31:39.269000 237772 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6]     generator.visit(fn.parse())
2025-05-07T20:31:39.2731269Z W0507 20:31:39.269000 237772 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/code_generator.py", line 1201, in visit
2025-05-07T20:31:39.2732496Z W0507 20:31:39.269000 237772 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6]     ret = super().visit(node)
2025-05-07T20:31:39.2733385Z W0507 20:31:39.269000 237772 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6]           ^^^^^^^^^^^^^^^^^^^
2025-05-07T20:31:39.2734410Z W0507 20:31:39.269000 237772 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/ast.py", line 418, in visit
2025-05-07T20:31:39.2735432Z W0507 20:31:39.269000 237772 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6]     return visitor(node)
2025-05-07T20:31:39.2736237Z W0507 20:31:39.269000 237772 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6]            ^^^^^^^^^^^^^
2025-05-07T20:31:39.2737463Z W0507 20:31:39.269000 237772 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/code_generator.py", line 352, in visit_Module
2025-05-07T20:31:39.2738758Z W0507 20:31:39.269000 237772 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6]     ast.NodeVisitor.generic_visit(self, node)
2025-05-07T20:31:39.2739881Z W0507 20:31:39.269000 237772 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/ast.py", line 426, in generic_visit
2025-05-07T20:31:39.2740931Z W0507 20:31:39.269000 237772 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6]     self.visit(item)
2025-05-07T20:31:39.2742119Z W0507 20:31:39.269000 237772 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/code_generator.py", line 1207, in visit
2025-05-07T20:31:39.2743782Z W0507 20:31:39.269000 237772 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6]     raise CompilationError(self.jit_fn.src, self.cur_node, repr(e)) from None
2025-05-07T20:31:39.2745240Z W0507 20:31:39.269000 237772 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6] triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:31:39.2746161Z W0507 20:31:39.269000 237772 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6] def _fbgemm_silu_mul_quant(
2025-05-07T20:31:39.2746911Z W0507 20:31:39.269000 237772 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6] ^
2025-05-07T20:31:39.2747935Z W0507 20:31:39.269000 237772 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6] ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:31:39.2861630Z W0507 20:31:39.284000 237772 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6] Encountered an exception in identify_mutated_tensors, assuming every input is mutated
2025-05-07T20:31:39.2862922Z W0507 20:31:39.284000 237772 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6] Traceback (most recent call last):
2025-05-07T20:31:39.2864269Z W0507 20:31:39.284000 237772 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/torch/_higher_order_ops/triton_kernel_wrap.py", line 731, in identify_mutated_tensors
2025-05-07T20:31:39.2865704Z W0507 20:31:39.284000 237772 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6]     ttir_module, ordered_tensor_names = generate_ttir(kernel, kwargs)
2025-05-07T20:31:39.2866695Z W0507 20:31:39.284000 237772 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6]                                         ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
2025-05-07T20:31:39.2868008Z W0507 20:31:39.284000 237772 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/torch/_higher_order_ops/triton_kernel_wrap.py", line 356, in generate_ttir
2025-05-07T20:31:39.2869507Z W0507 20:31:39.284000 237772 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6]     ttir_module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:31:39.2870496Z W0507 20:31:39.284000 237772 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6]                   ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
2025-05-07T20:31:39.2871725Z W0507 20:31:39.284000 237772 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py", line 100, in make_ir
2025-05-07T20:31:39.2873095Z W0507 20:31:39.284000 237772 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6]     return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:31:39.2874211Z W0507 20:31:39.284000 237772 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6]            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
2025-05-07T20:31:39.2875491Z W0507 20:31:39.284000 237772 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/code_generator.py", line 1298, in ast_to_ttir
2025-05-07T20:31:39.2876741Z W0507 20:31:39.284000 237772 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6]     generator.visit(fn.parse())
2025-05-07T20:31:39.2877960Z W0507 20:31:39.284000 237772 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/code_generator.py", line 1201, in visit
2025-05-07T20:31:39.2879175Z W0507 20:31:39.284000 237772 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6]     ret = super().visit(node)
2025-05-07T20:31:39.2880123Z W0507 20:31:39.284000 237772 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6]           ^^^^^^^^^^^^^^^^^^^
2025-05-07T20:31:39.2881158Z W0507 20:31:39.284000 237772 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/ast.py", line 418, in visit
2025-05-07T20:31:39.2882180Z W0507 20:31:39.284000 237772 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6]     return visitor(node)
2025-05-07T20:31:39.2882980Z W0507 20:31:39.284000 237772 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6]            ^^^^^^^^^^^^^
2025-05-07T20:31:39.2884241Z W0507 20:31:39.284000 237772 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/code_generator.py", line 352, in visit_Module
2025-05-07T20:31:39.2885596Z W0507 20:31:39.284000 237772 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6]     ast.NodeVisitor.generic_visit(self, node)
2025-05-07T20:31:39.2886715Z W0507 20:31:39.284000 237772 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/ast.py", line 426, in generic_visit
2025-05-07T20:31:39.2887759Z W0507 20:31:39.284000 237772 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6]     self.visit(item)
2025-05-07T20:31:39.2888941Z W0507 20:31:39.284000 237772 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/code_generator.py", line 1207, in visit
2025-05-07T20:31:39.2890298Z W0507 20:31:39.284000 237772 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6]     raise CompilationError(self.jit_fn.src, self.cur_node, repr(e)) from None
2025-05-07T20:31:39.2891366Z W0507 20:31:39.284000 237772 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6] triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:31:39.2892288Z W0507 20:31:39.284000 237772 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6] def _fbgemm_silu_mul_quant(
2025-05-07T20:31:39.2893040Z W0507 20:31:39.284000 237772 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6] ^
2025-05-07T20:31:39.2894065Z W0507 20:31:39.284000 237772 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6] ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:31:39.5197710Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:31:39.5198430Z T = 128, D = 5120, scale_ub = None, contiguous = True, compiled = True
2025-05-07T20:31:39.5198831Z 
2025-05-07T20:31:39.5198942Z     @given(
2025-05-07T20:31:39.5199256Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:31:39.5199569Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:31:39.5199898Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:31:39.5200244Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:31:39.5200575Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:31:39.5200869Z     )
2025-05-07T20:31:39.5201227Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:31:39.5201669Z     def test_silu_mul_quant(
2025-05-07T20:31:39.5201916Z         self,
2025-05-07T20:31:39.5202118Z         T: int,
2025-05-07T20:31:39.5202326Z         D: int,
2025-05-07T20:31:39.5202542Z         scale_ub: Optional[float],
2025-05-07T20:31:39.5202816Z         contiguous: bool,
2025-05-07T20:31:39.5203060Z         compiled: bool,
2025-05-07T20:31:39.5203308Z     ) -> None:
2025-05-07T20:31:39.5203589Z         torch.manual_seed(2025)
2025-05-07T20:31:39.5203836Z     
2025-05-07T20:31:39.5204115Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:31:39.5204455Z     
2025-05-07T20:31:39.5205010Z         x_sign = torch.sign(x)
2025-05-07T20:31:39.5205307Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:31:39.5205614Z         x = x_sign * x_clamp
2025-05-07T20:31:39.5205856Z         x0 = x[:, :D]
2025-05-07T20:31:39.5206075Z         x1 = x[:, D:]
2025-05-07T20:31:39.5206277Z     
2025-05-07T20:31:39.5206465Z         if contiguous:
2025-05-07T20:31:39.5206702Z             x0 = x0.contiguous()
2025-05-07T20:31:39.5206958Z             x1 = x1.contiguous()
2025-05-07T20:31:39.5207207Z     
2025-05-07T20:31:39.5207408Z         if scale_ub is not None:
2025-05-07T20:31:39.5207688Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:31:39.5208021Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:31:39.5208479Z             )
2025-05-07T20:31:39.5208678Z         else:
2025-05-07T20:31:39.5208894Z             scale_ub_tensor = None
2025-05-07T20:31:39.5209156Z     
2025-05-07T20:31:39.5209392Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:31:39.5209709Z             op = silu_mul_quant
2025-05-07T20:31:39.5209964Z             if compiled:
2025-05-07T20:31:39.5210214Z                 op = torch.compile(op)
2025-05-07T20:31:39.5210507Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:31:39.5210785Z     
2025-05-07T20:31:39.5210982Z         y_fp8, y_scale = fn()
2025-05-07T20:31:39.5211261Z         y = y_fp8.to(torch.float32) * y_scale[:, None]
2025-05-07T20:31:39.5211555Z     
2025-05-07T20:31:39.5211800Z         def ref_fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:31:39.5212137Z             x0_fp32 = x0.to(torch.float32)
2025-05-07T20:31:39.5212427Z             x1_fp32 = x1.to(torch.float32)
2025-05-07T20:31:39.5212745Z             y = x0_fp32 * torch.sigmoid(x0_fp32) * x1_fp32
2025-05-07T20:31:39.5213110Z             return triton_quantize_fp8_row(y, scale_ub_tensor)
2025-05-07T20:31:39.5213442Z     
2025-05-07T20:31:39.5213671Z >       y_fp8_ref, y_scale_ref = ref_fn()
2025-05-07T20:31:39.5213865Z 
2025-05-07T20:31:39.5213977Z moe/activation_test.py:126: 
2025-05-07T20:31:39.5214270Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:31:39.5214606Z moe/activation_test.py:124: in ref_fn
2025-05-07T20:31:39.5214931Z     return triton_quantize_fp8_row(y, scale_ub_tensor)
2025-05-07T20:31:39.5215716Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/fbgemm_gpu/experimental/gemm/triton_gemm/fp8_gemm.py:2370: in triton_quantize_fp8_row
2025-05-07T20:31:39.5216457Z     _kernel_quantize_fp8_row[grid](
2025-05-07T20:31:39.5216999Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:31:39.5217676Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:31:39.5218360Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/autotuner.py:186: in run
2025-05-07T20:31:39.5219083Z     timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs}
2025-05-07T20:31:39.5219835Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/autotuner.py:186: in <dictcomp>
2025-05-07T20:31:39.5220581Z     timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs}
2025-05-07T20:31:39.5221301Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/autotuner.py:166: in _bench
2025-05-07T20:31:39.5221938Z     return self.do_bench(kernel_call, quantiles=(0.5, 0.2, 0.8))
2025-05-07T20:31:39.5222534Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/testing.py:117: in do_bench
2025-05-07T20:31:39.5223049Z     fn()
2025-05-07T20:31:39.5223555Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/autotuner.py:152: in kernel_call
2025-05-07T20:31:39.5224136Z     self.fn.run(
2025-05-07T20:31:39.5224702Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:31:39.5225227Z     kernel = self.compile(
2025-05-07T20:31:39.5225767Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:31:39.5226422Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:31:39.5226822Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:31:39.5227048Z 
2025-05-07T20:31:39.5227253Z self = <triton.compiler.compiler.ASTSource object at 0x7f6873be6f90>
2025-05-07T20:31:39.5228706Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=2, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:31:39.5230321Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f68dfb402c0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f6894c95c60>}
2025-05-07T20:31:39.5231659Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:31:39.5232678Z context = <triton._C.libtriton.ir.context object at 0x7f6873bfa5b0>
2025-05-07T20:31:39.5232965Z 
2025-05-07T20:31:39.5233135Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:31:39.5233706Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:31:39.5234175Z                            module_map=module_map)
2025-05-07T20:31:39.5234542Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:31:39.5234904Z E       def _kernel_quantize_fp8_row(
2025-05-07T20:31:39.5235174Z E       ^
2025-05-07T20:31:39.5235641Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:31:39.5236086Z 
2025-05-07T20:31:39.5236500Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:31:39.5237015Z 
2025-05-07T20:31:39.5237123Z Trying example: test_silu_mul_quant(
2025-05-07T20:31:39.5237537Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:31:39.5237940Z     T=4096,
2025-05-07T20:31:39.5238130Z     D=5120,
2025-05-07T20:31:39.5238331Z     scale_ub=None,
2025-05-07T20:31:39.5238552Z     contiguous=True,
2025-05-07T20:31:39.5238773Z     compiled=True,
2025-05-07T20:31:39.5238987Z )
2025-05-07T20:31:39.8697305Z W0507 20:31:39.867000 237772 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7] Encountered an exception in identify_mutated_tensors, assuming every input is mutated
2025-05-07T20:31:39.8698507Z W0507 20:31:39.867000 237772 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7] Traceback (most recent call last):
2025-05-07T20:31:39.8699875Z W0507 20:31:39.867000 237772 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/torch/_higher_order_ops/triton_kernel_wrap.py", line 731, in identify_mutated_tensors
2025-05-07T20:31:39.8701309Z W0507 20:31:39.867000 237772 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7]     ttir_module, ordered_tensor_names = generate_ttir(kernel, kwargs)
2025-05-07T20:31:39.8702294Z W0507 20:31:39.867000 237772 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7]                                         ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
2025-05-07T20:31:39.8703600Z W0507 20:31:39.867000 237772 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/torch/_higher_order_ops/triton_kernel_wrap.py", line 356, in generate_ttir
2025-05-07T20:31:39.8705332Z W0507 20:31:39.867000 237772 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7]     ttir_module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:31:39.8706322Z W0507 20:31:39.867000 237772 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7]                   ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
2025-05-07T20:31:39.8707548Z W0507 20:31:39.867000 237772 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py", line 100, in make_ir
2025-05-07T20:31:39.8708923Z W0507 20:31:39.867000 237772 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7]     return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:31:39.8710242Z W0507 20:31:39.867000 237772 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7]            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
2025-05-07T20:31:39.8711512Z W0507 20:31:39.867000 237772 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/code_generator.py", line 1298, in ast_to_ttir
2025-05-07T20:31:39.8712757Z W0507 20:31:39.867000 237772 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7]     generator.visit(fn.parse())
2025-05-07T20:31:39.8713982Z W0507 20:31:39.867000 237772 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/code_generator.py", line 1201, in visit
2025-05-07T20:31:39.8715193Z W0507 20:31:39.867000 237772 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7]     ret = super().visit(node)
2025-05-07T20:31:39.8716027Z W0507 20:31:39.867000 237772 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7]           ^^^^^^^^^^^^^^^^^^^
2025-05-07T20:31:39.8717042Z W0507 20:31:39.867000 237772 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/ast.py", line 418, in visit
2025-05-07T20:31:39.8718118Z W0507 20:31:39.867000 237772 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7]     return visitor(node)
2025-05-07T20:31:39.8727632Z W0507 20:31:39.867000 237772 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7]            ^^^^^^^^^^^^^
2025-05-07T20:31:39.8729327Z W0507 20:31:39.867000 237772 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/code_generator.py", line 352, in visit_Module
2025-05-07T20:31:39.8730662Z W0507 20:31:39.867000 237772 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7]     ast.NodeVisitor.generic_visit(self, node)
2025-05-07T20:31:39.8731797Z W0507 20:31:39.867000 237772 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/ast.py", line 426, in generic_visit
2025-05-07T20:31:39.8732860Z W0507 20:31:39.867000 237772 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7]     self.visit(item)
2025-05-07T20:31:39.8734110Z W0507 20:31:39.867000 237772 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/code_generator.py", line 1207, in visit
2025-05-07T20:31:39.8735479Z W0507 20:31:39.867000 237772 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7]     raise CompilationError(self.jit_fn.src, self.cur_node, repr(e)) from None
2025-05-07T20:31:39.8736767Z W0507 20:31:39.867000 237772 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7] triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:31:39.8737701Z W0507 20:31:39.867000 237772 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7] def _fbgemm_silu_mul_quant(
2025-05-07T20:31:39.8738459Z W0507 20:31:39.867000 237772 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7] ^
2025-05-07T20:31:39.8739487Z W0507 20:31:39.867000 237772 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7] ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:31:39.9570399Z W0507 20:31:39.954000 237772 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7] Encountered an exception in identify_mutated_tensors, assuming every input is mutated
2025-05-07T20:31:39.9571909Z W0507 20:31:39.954000 237772 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7] Traceback (most recent call last):
2025-05-07T20:31:39.9573257Z W0507 20:31:39.954000 237772 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/torch/_higher_order_ops/triton_kernel_wrap.py", line 731, in identify_mutated_tensors
2025-05-07T20:31:39.9574742Z W0507 20:31:39.954000 237772 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7]     ttir_module, ordered_tensor_names = generate_ttir(kernel, kwargs)
2025-05-07T20:31:39.9575726Z W0507 20:31:39.954000 237772 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7]                                         ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
2025-05-07T20:31:39.9577021Z W0507 20:31:39.954000 237772 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/torch/_higher_order_ops/triton_kernel_wrap.py", line 356, in generate_ttir
2025-05-07T20:31:39.9578407Z W0507 20:31:39.954000 237772 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7]     ttir_module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:31:39.9579406Z W0507 20:31:39.954000 237772 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7]                   ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
2025-05-07T20:31:39.9580637Z W0507 20:31:39.954000 237772 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py", line 100, in make_ir
2025-05-07T20:31:39.9582015Z W0507 20:31:39.954000 237772 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7]     return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:31:39.9583078Z W0507 20:31:39.954000 237772 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7]            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
2025-05-07T20:31:39.9584367Z W0507 20:31:39.954000 237772 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/code_generator.py", line 1298, in ast_to_ttir
2025-05-07T20:31:39.9585617Z W0507 20:31:39.954000 237772 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7]     generator.visit(fn.parse())
2025-05-07T20:31:39.9586844Z W0507 20:31:39.954000 237772 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/code_generator.py", line 1201, in visit
2025-05-07T20:31:39.9588062Z W0507 20:31:39.954000 237772 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7]     ret = super().visit(node)
2025-05-07T20:31:39.9588893Z W0507 20:31:39.954000 237772 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7]           ^^^^^^^^^^^^^^^^^^^
2025-05-07T20:31:39.9590183Z W0507 20:31:39.954000 237772 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/ast.py", line 418, in visit
2025-05-07T20:31:39.9591208Z W0507 20:31:39.954000 237772 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7]     return visitor(node)
2025-05-07T20:31:39.9592011Z W0507 20:31:39.954000 237772 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7]            ^^^^^^^^^^^^^
2025-05-07T20:31:39.9593216Z W0507 20:31:39.954000 237772 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/code_generator.py", line 352, in visit_Module
2025-05-07T20:31:39.9594625Z W0507 20:31:39.954000 237772 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7]     ast.NodeVisitor.generic_visit(self, node)
2025-05-07T20:31:39.9595754Z W0507 20:31:39.954000 237772 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/ast.py", line 426, in generic_visit
2025-05-07T20:31:39.9596800Z W0507 20:31:39.954000 237772 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7]     self.visit(item)
2025-05-07T20:31:39.9597989Z W0507 20:31:39.954000 237772 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/code_generator.py", line 1207, in visit
2025-05-07T20:31:39.9599339Z W0507 20:31:39.954000 237772 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7]     raise CompilationError(self.jit_fn.src, self.cur_node, repr(e)) from None
2025-05-07T20:31:39.9600410Z W0507 20:31:39.954000 237772 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7] triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:31:39.9601330Z W0507 20:31:39.954000 237772 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7] def _fbgemm_silu_mul_quant(
2025-05-07T20:31:39.9602075Z W0507 20:31:39.954000 237772 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7] ^
2025-05-07T20:31:39.9603098Z W0507 20:31:39.954000 237772 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7] ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:31:40.2189171Z W0507 20:31:40.216000 237772 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7] Encountered an exception in identify_mutated_tensors, assuming every input is mutated
2025-05-07T20:31:40.2190382Z W0507 20:31:40.216000 237772 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7] Traceback (most recent call last):
2025-05-07T20:31:40.2192084Z W0507 20:31:40.216000 237772 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/torch/_higher_order_ops/triton_kernel_wrap.py", line 731, in identify_mutated_tensors
2025-05-07T20:31:40.2193934Z W0507 20:31:40.216000 237772 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7]     ttir_module, ordered_tensor_names = generate_ttir(kernel, kwargs)
2025-05-07T20:31:40.2195156Z W0507 20:31:40.216000 237772 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7]                                         ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
2025-05-07T20:31:40.2196792Z W0507 20:31:40.216000 237772 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/torch/_higher_order_ops/triton_kernel_wrap.py", line 356, in generate_ttir
2025-05-07T20:31:40.2198165Z W0507 20:31:40.216000 237772 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7]     ttir_module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:31:40.2199514Z W0507 20:31:40.216000 237772 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7]                   ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
2025-05-07T20:31:40.2200747Z W0507 20:31:40.216000 237772 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py", line 100, in make_ir
2025-05-07T20:31:40.2202124Z W0507 20:31:40.216000 237772 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7]     return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:31:40.2203189Z W0507 20:31:40.216000 237772 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7]            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
2025-05-07T20:31:40.2204607Z W0507 20:31:40.216000 237772 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/code_generator.py", line 1298, in ast_to_ttir
2025-05-07T20:31:40.2205854Z W0507 20:31:40.216000 237772 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7]     generator.visit(fn.parse())
2025-05-07T20:31:40.2207077Z W0507 20:31:40.216000 237772 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/code_generator.py", line 1201, in visit
2025-05-07T20:31:40.2208289Z W0507 20:31:40.216000 237772 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7]     ret = super().visit(node)
2025-05-07T20:31:40.2209125Z W0507 20:31:40.216000 237772 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7]           ^^^^^^^^^^^^^^^^^^^
2025-05-07T20:31:40.2210158Z W0507 20:31:40.216000 237772 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/ast.py", line 418, in visit
2025-05-07T20:31:40.2211178Z W0507 20:31:40.216000 237772 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7]     return visitor(node)
2025-05-07T20:31:40.2211978Z W0507 20:31:40.216000 237772 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7]            ^^^^^^^^^^^^^
2025-05-07T20:31:40.2213186Z W0507 20:31:40.216000 237772 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/code_generator.py", line 352, in visit_Module
2025-05-07T20:31:40.2214508Z W0507 20:31:40.216000 237772 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7]     ast.NodeVisitor.generic_visit(self, node)
2025-05-07T20:31:40.2215643Z W0507 20:31:40.216000 237772 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/ast.py", line 426, in generic_visit
2025-05-07T20:31:40.2216686Z W0507 20:31:40.216000 237772 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7]     self.visit(item)
2025-05-07T20:31:40.2217869Z W0507 20:31:40.216000 237772 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/code_generator.py", line 1207, in visit
2025-05-07T20:31:40.2219222Z W0507 20:31:40.216000 237772 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7]     raise CompilationError(self.jit_fn.src, self.cur_node, repr(e)) from None
2025-05-07T20:31:40.2220285Z W0507 20:31:40.216000 237772 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7] triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:31:40.2221210Z W0507 20:31:40.216000 237772 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7] def _fbgemm_silu_mul_quant(
2025-05-07T20:31:40.2222042Z W0507 20:31:40.216000 237772 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7] ^
2025-05-07T20:31:40.2223073Z W0507 20:31:40.216000 237772 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7] ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:31:40.2334454Z W0507 20:31:40.231000 237772 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7] Encountered an exception in identify_mutated_tensors, assuming every input is mutated
2025-05-07T20:31:40.2335685Z W0507 20:31:40.231000 237772 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7] Traceback (most recent call last):
2025-05-07T20:31:40.2337024Z W0507 20:31:40.231000 237772 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/torch/_higher_order_ops/triton_kernel_wrap.py", line 731, in identify_mutated_tensors
2025-05-07T20:31:40.2338662Z W0507 20:31:40.231000 237772 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7]     ttir_module, ordered_tensor_names = generate_ttir(kernel, kwargs)
2025-05-07T20:31:40.2339641Z W0507 20:31:40.231000 237772 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7]                                         ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
2025-05-07T20:31:40.2340940Z W0507 20:31:40.231000 237772 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/torch/_higher_order_ops/triton_kernel_wrap.py", line 356, in generate_ttir
2025-05-07T20:31:40.2342317Z W0507 20:31:40.231000 237772 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7]     ttir_module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:31:40.2343306Z W0507 20:31:40.231000 237772 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7]                   ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
2025-05-07T20:31:40.2344582Z W0507 20:31:40.231000 237772 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py", line 100, in make_ir
2025-05-07T20:31:40.2345958Z W0507 20:31:40.231000 237772 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7]     return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:31:40.2347023Z W0507 20:31:40.231000 237772 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7]            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
2025-05-07T20:31:40.2348311Z W0507 20:31:40.231000 237772 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/code_generator.py", line 1298, in ast_to_ttir
2025-05-07T20:31:40.2349644Z W0507 20:31:40.231000 237772 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7]     generator.visit(fn.parse())
2025-05-07T20:31:40.2350867Z W0507 20:31:40.231000 237772 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/code_generator.py", line 1201, in visit
2025-05-07T20:31:40.2352080Z W0507 20:31:40.231000 237772 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7]     ret = super().visit(node)
2025-05-07T20:31:40.2352909Z W0507 20:31:40.231000 237772 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7]           ^^^^^^^^^^^^^^^^^^^
2025-05-07T20:31:40.2353988Z W0507 20:31:40.231000 237772 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/ast.py", line 418, in visit
2025-05-07T20:31:40.2355128Z W0507 20:31:40.231000 237772 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7]     return visitor(node)
2025-05-07T20:31:40.2355932Z W0507 20:31:40.231000 237772 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7]            ^^^^^^^^^^^^^
2025-05-07T20:31:40.2357145Z W0507 20:31:40.231000 237772 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/code_generator.py", line 352, in visit_Module
2025-05-07T20:31:40.2358435Z W0507 20:31:40.231000 237772 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7]     ast.NodeVisitor.generic_visit(self, node)
2025-05-07T20:31:40.2359556Z W0507 20:31:40.231000 237772 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/ast.py", line 426, in generic_visit
2025-05-07T20:31:40.2360686Z W0507 20:31:40.231000 237772 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7]     self.visit(item)
2025-05-07T20:31:40.2361873Z W0507 20:31:40.231000 237772 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/code_generator.py", line 1207, in visit
2025-05-07T20:31:40.2363238Z W0507 20:31:40.231000 237772 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7]     raise CompilationError(self.jit_fn.src, self.cur_node, repr(e)) from None
2025-05-07T20:31:40.2364358Z W0507 20:31:40.231000 237772 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7] triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:31:40.2365273Z W0507 20:31:40.231000 237772 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7] def _fbgemm_silu_mul_quant(
2025-05-07T20:31:40.2366027Z W0507 20:31:40.231000 237772 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7] ^
2025-05-07T20:31:40.2367057Z W0507 20:31:40.231000 237772 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7] ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:31:40.4700012Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:31:40.4700780Z T = 4096, D = 5120, scale_ub = None, contiguous = True, compiled = True
2025-05-07T20:31:40.4701157Z 
2025-05-07T20:31:40.4701267Z     @given(
2025-05-07T20:31:40.4701575Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:31:40.4701896Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:31:40.4702209Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:31:40.4702574Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:31:40.4702899Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:31:40.4703195Z     )
2025-05-07T20:31:40.4703589Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:31:40.4704064Z     def test_silu_mul_quant(
2025-05-07T20:31:40.4704320Z         self,
2025-05-07T20:31:40.4704522Z         T: int,
2025-05-07T20:31:40.4704717Z         D: int,
2025-05-07T20:31:40.4704941Z         scale_ub: Optional[float],
2025-05-07T20:31:40.4705219Z         contiguous: bool,
2025-05-07T20:31:40.4705456Z         compiled: bool,
2025-05-07T20:31:40.4705693Z     ) -> None:
2025-05-07T20:31:40.4705921Z         torch.manual_seed(2025)
2025-05-07T20:31:40.4706160Z     
2025-05-07T20:31:40.4706449Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:31:40.4706798Z     
2025-05-07T20:31:40.4706997Z         x_sign = torch.sign(x)
2025-05-07T20:31:40.4707285Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:31:40.4707606Z         x = x_sign * x_clamp
2025-05-07T20:31:40.4707854Z         x0 = x[:, :D]
2025-05-07T20:31:40.4708074Z         x1 = x[:, D:]
2025-05-07T20:31:40.4708288Z     
2025-05-07T20:31:40.4708483Z         if contiguous:
2025-05-07T20:31:40.4709196Z             x0 = x0.contiguous()
2025-05-07T20:31:40.4709471Z             x1 = x1.contiguous()
2025-05-07T20:31:40.4709712Z     
2025-05-07T20:31:40.4709901Z         if scale_ub is not None:
2025-05-07T20:31:40.4710180Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:31:40.4710519Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:31:40.4710824Z             )
2025-05-07T20:31:40.4711024Z         else:
2025-05-07T20:31:40.4711240Z             scale_ub_tensor = None
2025-05-07T20:31:40.4711487Z     
2025-05-07T20:31:40.4711727Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:31:40.4712047Z             op = silu_mul_quant
2025-05-07T20:31:40.4712585Z             if compiled:
2025-05-07T20:31:40.4712835Z                 op = torch.compile(op)
2025-05-07T20:31:40.4713135Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:31:40.4713410Z     
2025-05-07T20:31:40.4713601Z         y_fp8, y_scale = fn()
2025-05-07T20:31:40.4713898Z         y = y_fp8.to(torch.float32) * y_scale[:, None]
2025-05-07T20:31:40.4714191Z     
2025-05-07T20:31:40.4714424Z         def ref_fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:31:40.4714759Z             x0_fp32 = x0.to(torch.float32)
2025-05-07T20:31:40.4715057Z             x1_fp32 = x1.to(torch.float32)
2025-05-07T20:31:40.4715373Z             y = x0_fp32 * torch.sigmoid(x0_fp32) * x1_fp32
2025-05-07T20:31:40.4715733Z             return triton_quantize_fp8_row(y, scale_ub_tensor)
2025-05-07T20:31:40.4716042Z     
2025-05-07T20:31:40.4716249Z >       y_fp8_ref, y_scale_ref = ref_fn()
2025-05-07T20:31:40.4716442Z 
2025-05-07T20:31:40.4716544Z moe/activation_test.py:126: 
2025-05-07T20:31:40.4716845Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:31:40.4717195Z moe/activation_test.py:124: in ref_fn
2025-05-07T20:31:40.4717518Z     return triton_quantize_fp8_row(y, scale_ub_tensor)
2025-05-07T20:31:40.4718310Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/fbgemm_gpu/experimental/gemm/triton_gemm/fp8_gemm.py:2370: in triton_quantize_fp8_row
2025-05-07T20:31:40.4719063Z     _kernel_quantize_fp8_row[grid](
2025-05-07T20:31:40.4719610Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:31:40.4720283Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:31:40.4720967Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/autotuner.py:186: in run
2025-05-07T20:31:40.4721687Z     timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs}
2025-05-07T20:31:40.4722438Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/autotuner.py:186: in <dictcomp>
2025-05-07T20:31:40.4723183Z     timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs}
2025-05-07T20:31:40.4723965Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/autotuner.py:166: in _bench
2025-05-07T20:31:40.4724602Z     return self.do_bench(kernel_call, quantiles=(0.5, 0.2, 0.8))
2025-05-07T20:31:40.4725196Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/testing.py:117: in do_bench
2025-05-07T20:31:40.4725713Z     fn()
2025-05-07T20:31:40.4726220Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/autotuner.py:152: in kernel_call
2025-05-07T20:31:40.4726807Z     self.fn.run(
2025-05-07T20:31:40.4727273Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:31:40.4727804Z     kernel = self.compile(
2025-05-07T20:31:40.4728636Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:31:40.4729286Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:31:40.4729838Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:31:40.4730077Z 
2025-05-07T20:31:40.4730285Z self = <triton.compiler.compiler.ASTSource object at 0x7f687358a510>
2025-05-07T20:31:40.4731360Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=2, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:31:40.4732733Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f68dfb402c0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f68940dfc40>}
2025-05-07T20:31:40.4734069Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:31:40.4735215Z context = <triton._C.libtriton.ir.context object at 0x7f68736d3b30>
2025-05-07T20:31:40.4735501Z 
2025-05-07T20:31:40.4735676Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:31:40.4736194Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:31:40.4736663Z                            module_map=module_map)
2025-05-07T20:31:40.4737029Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:31:40.4737391Z E       def _kernel_quantize_fp8_row(
2025-05-07T20:31:40.4737657Z E       ^
2025-05-07T20:31:40.4738123Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:31:40.4738569Z 
2025-05-07T20:31:40.4738998Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:31:40.4739505Z 
2025-05-07T20:31:40.4739616Z Trying example: test_silu_mul_quant(
2025-05-07T20:31:40.4740030Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:31:40.4740436Z     T=16384,
2025-05-07T20:31:40.4740634Z     D=5120,
2025-05-07T20:31:40.4740828Z     scale_ub=None,
2025-05-07T20:31:40.4741046Z     contiguous=True,
2025-05-07T20:31:40.4741272Z     compiled=True,
2025-05-07T20:31:40.4741476Z )
2025-05-07T20:31:40.5010025Z W0507 20:31:40.500000 237772 site-packages/torch/_dynamo/convert_frame.py:987] [1/8] torch._dynamo hit config.recompile_limit (8)
2025-05-07T20:31:40.5011508Z W0507 20:31:40.500000 237772 site-packages/torch/_dynamo/convert_frame.py:987] [1/8]    function: 'silu_mul_quant' (/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:55)
2025-05-07T20:31:40.5012878Z W0507 20:31:40.500000 237772 site-packages/torch/_dynamo/convert_frame.py:987] [1/8]    last reason: 1/7: tensor 'x0' stride mismatch at index 0. expected 5120, actual 10240
2025-05-07T20:31:40.5013883Z W0507 20:31:40.500000 237772 site-packages/torch/_dynamo/convert_frame.py:987] [1/8] To log all recompilation reasons, use TORCH_LOGS="recompiles".
2025-05-07T20:31:40.5014993Z W0507 20:31:40.500000 237772 site-packages/torch/_dynamo/convert_frame.py:987] [1/8] To diagnose recompilation issues, see https://pytorch.org/docs/main/torch.compiler_troubleshooting.html.
2025-05-07T20:31:40.5694336Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:31:40.5695103Z T = 16384, D = 5120, scale_ub = None, contiguous = True, compiled = True
2025-05-07T20:31:40.5695486Z 
2025-05-07T20:31:40.5695598Z     @given(
2025-05-07T20:31:40.5695853Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:31:40.5696165Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:31:40.5696505Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:31:40.5696841Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:31:40.5697165Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:31:40.5697802Z     )
2025-05-07T20:31:40.5698164Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:31:40.5698602Z     def test_silu_mul_quant(
2025-05-07T20:31:40.5698854Z         self,
2025-05-07T20:31:40.5699067Z         T: int,
2025-05-07T20:31:40.5699262Z         D: int,
2025-05-07T20:31:40.5699490Z         scale_ub: Optional[float],
2025-05-07T20:31:40.5699762Z         contiguous: bool,
2025-05-07T20:31:40.5700007Z         compiled: bool,
2025-05-07T20:31:40.5700238Z     ) -> None:
2025-05-07T20:31:40.5700459Z         torch.manual_seed(2025)
2025-05-07T20:31:40.5700709Z     
2025-05-07T20:31:40.5700978Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:31:40.5701478Z     
2025-05-07T20:31:40.5701682Z         x_sign = torch.sign(x)
2025-05-07T20:31:40.5701974Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:31:40.5702292Z         x = x_sign * x_clamp
2025-05-07T20:31:40.5702540Z         x0 = x[:, :D]
2025-05-07T20:31:40.5702764Z         x1 = x[:, D:]
2025-05-07T20:31:40.5702976Z     
2025-05-07T20:31:40.5703172Z         if contiguous:
2025-05-07T20:31:40.5703402Z             x0 = x0.contiguous()
2025-05-07T20:31:40.5703669Z             x1 = x1.contiguous()
2025-05-07T20:31:40.5703909Z     
2025-05-07T20:31:40.5704097Z         if scale_ub is not None:
2025-05-07T20:31:40.5704369Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:31:40.5704706Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:31:40.5705012Z             )
2025-05-07T20:31:40.5705211Z         else:
2025-05-07T20:31:40.5705425Z             scale_ub_tensor = None
2025-05-07T20:31:40.5705681Z     
2025-05-07T20:31:40.5705910Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:31:40.5706234Z             op = silu_mul_quant
2025-05-07T20:31:40.5706485Z             if compiled:
2025-05-07T20:31:40.5706733Z                 op = torch.compile(op)
2025-05-07T20:31:40.5707043Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:31:40.5707324Z     
2025-05-07T20:31:40.5707515Z         y_fp8, y_scale = fn()
2025-05-07T20:31:40.5707801Z         y = y_fp8.to(torch.float32) * y_scale[:, None]
2025-05-07T20:31:40.5708096Z     
2025-05-07T20:31:40.5708330Z         def ref_fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:31:40.5708663Z             x0_fp32 = x0.to(torch.float32)
2025-05-07T20:31:40.5708955Z             x1_fp32 = x1.to(torch.float32)
2025-05-07T20:31:40.5709338Z             y = x0_fp32 * torch.sigmoid(x0_fp32) * x1_fp32
2025-05-07T20:31:40.5709697Z             return triton_quantize_fp8_row(y, scale_ub_tensor)
2025-05-07T20:31:40.5710006Z     
2025-05-07T20:31:40.5710212Z >       y_fp8_ref, y_scale_ref = ref_fn()
2025-05-07T20:31:40.5710409Z 
2025-05-07T20:31:40.5710515Z moe/activation_test.py:126: 
2025-05-07T20:31:40.5710813Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:31:40.5711156Z moe/activation_test.py:124: in ref_fn
2025-05-07T20:31:40.5711483Z     return triton_quantize_fp8_row(y, scale_ub_tensor)
2025-05-07T20:31:40.5712269Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/fbgemm_gpu/experimental/gemm/triton_gemm/fp8_gemm.py:2370: in triton_quantize_fp8_row
2025-05-07T20:31:40.5713022Z     _kernel_quantize_fp8_row[grid](
2025-05-07T20:31:40.5713569Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:31:40.5714252Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:31:40.5714941Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/autotuner.py:186: in run
2025-05-07T20:31:40.5715671Z     timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs}
2025-05-07T20:31:40.5716426Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/autotuner.py:186: in <dictcomp>
2025-05-07T20:31:40.5717262Z     timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs}
2025-05-07T20:31:40.5717992Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/autotuner.py:166: in _bench
2025-05-07T20:31:40.5718632Z     return self.do_bench(kernel_call, quantiles=(0.5, 0.2, 0.8))
2025-05-07T20:31:40.5719228Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/testing.py:117: in do_bench
2025-05-07T20:31:40.5719758Z     fn()
2025-05-07T20:31:40.5720269Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/autotuner.py:152: in kernel_call
2025-05-07T20:31:40.5720857Z     self.fn.run(
2025-05-07T20:31:40.5721326Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:31:40.5721935Z     kernel = self.compile(
2025-05-07T20:31:40.5722485Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:31:40.5723130Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:31:40.5723544Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:31:40.5723785Z 
2025-05-07T20:31:40.5723992Z self = <triton.compiler.compiler.ASTSource object at 0x7f687306fad0>
2025-05-07T20:31:40.5725121Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=2, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:31:40.5726500Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f68dfb402c0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f6895e271a0>}
2025-05-07T20:31:40.5727852Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:31:40.5729129Z context = <triton._C.libtriton.ir.context object at 0x7f6872ef5470>
2025-05-07T20:31:40.5729424Z 
2025-05-07T20:31:40.5729591Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:31:40.5730111Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:31:40.5730578Z                            module_map=module_map)
2025-05-07T20:31:40.5730943Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:31:40.5731305Z E       def _kernel_quantize_fp8_row(
2025-05-07T20:31:40.5731585Z E       ^
2025-05-07T20:31:40.5732050Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:31:40.5732511Z 
2025-05-07T20:31:40.5732929Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:31:40.5733456Z 
2025-05-07T20:31:40.5733561Z Trying example: test_silu_mul_quant(
2025-05-07T20:31:40.5733979Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:31:40.5734426Z     T=1,
2025-05-07T20:31:40.5734627Z     D=5120,
2025-05-07T20:31:40.5734833Z     scale_ub=1200.0,
2025-05-07T20:31:40.5735061Z     contiguous=True,
2025-05-07T20:31:40.5735295Z     compiled=True,
2025-05-07T20:31:40.5735505Z )
2025-05-07T20:31:40.8670089Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:31:40.8670806Z T = 1, D = 5120, scale_ub = 1200.0, contiguous = True, compiled = True
2025-05-07T20:31:40.8671176Z 
2025-05-07T20:31:40.8671287Z     @given(
2025-05-07T20:31:40.8671586Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:31:40.8671898Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:31:40.8672218Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:31:40.8672940Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:31:40.8673279Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:31:40.8673560Z     )
2025-05-07T20:31:40.8673958Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:31:40.8674430Z     def test_silu_mul_quant(
2025-05-07T20:31:40.8674682Z         self,
2025-05-07T20:31:40.8674883Z         T: int,
2025-05-07T20:31:40.8675090Z         D: int,
2025-05-07T20:31:40.8675311Z         scale_ub: Optional[float],
2025-05-07T20:31:40.8675583Z         contiguous: bool,
2025-05-07T20:31:40.8675827Z         compiled: bool,
2025-05-07T20:31:40.8676062Z     ) -> None:
2025-05-07T20:31:40.8676280Z         torch.manual_seed(2025)
2025-05-07T20:31:40.8676529Z     
2025-05-07T20:31:40.8676964Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:31:40.8677304Z     
2025-05-07T20:31:40.8677507Z         x_sign = torch.sign(x)
2025-05-07T20:31:40.8677810Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:31:40.8678129Z         x = x_sign * x_clamp
2025-05-07T20:31:40.8678378Z         x0 = x[:, :D]
2025-05-07T20:31:40.8678600Z         x1 = x[:, D:]
2025-05-07T20:31:40.8678811Z     
2025-05-07T20:31:40.8678995Z         if contiguous:
2025-05-07T20:31:40.8679230Z             x0 = x0.contiguous()
2025-05-07T20:31:40.8679493Z             x1 = x1.contiguous()
2025-05-07T20:31:40.8679728Z     
2025-05-07T20:31:40.8679920Z         if scale_ub is not None:
2025-05-07T20:31:40.8680198Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:31:40.8680532Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:31:40.8680845Z             )
2025-05-07T20:31:40.8681045Z         else:
2025-05-07T20:31:40.8681256Z             scale_ub_tensor = None
2025-05-07T20:31:40.8681520Z     
2025-05-07T20:31:40.8681754Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:31:40.8682066Z             op = silu_mul_quant
2025-05-07T20:31:40.8682322Z             if compiled:
2025-05-07T20:31:40.8682581Z                 op = torch.compile(op)
2025-05-07T20:31:40.8682876Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:31:40.8683148Z     
2025-05-07T20:31:40.8683363Z >       y_fp8, y_scale = fn()
2025-05-07T20:31:40.8683545Z 
2025-05-07T20:31:40.8683658Z moe/activation_test.py:117: 
2025-05-07T20:31:40.8683978Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:31:40.8684315Z moe/activation_test.py:115: in fn
2025-05-07T20:31:40.8684601Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:31:40.8685157Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/torch/_dynamo/eval_frame.py:678: in _fn
2025-05-07T20:31:40.8685717Z     return fn(*args, **kwargs)
2025-05-07T20:31:40.8686382Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:31:40.8687066Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:31:40.8687607Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:31:40.8696728Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:31:40.8697451Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:31:40.8698009Z     kernel = self.compile(
2025-05-07T20:31:40.8698567Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:31:40.8699236Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:31:40.8699645Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:31:40.8699903Z 
2025-05-07T20:31:40.8700197Z self = <triton.compiler.compiler.ASTSource object at 0x7f6872d8b110>
2025-05-07T20:31:40.8701634Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:31:40.8703051Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f68dfb402c0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f6895f7b420>}
2025-05-07T20:31:40.8704405Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:31:40.8705435Z context = <triton._C.libtriton.ir.context object at 0x7f6872dfef70>
2025-05-07T20:31:40.8705734Z 
2025-05-07T20:31:40.8705989Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:31:40.8706516Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:31:40.8706987Z                            module_map=module_map)
2025-05-07T20:31:40.8707373Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:31:40.8707735Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:31:40.8707996Z E       ^
2025-05-07T20:31:40.8708479Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:31:40.8708943Z 
2025-05-07T20:31:40.8709456Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:31:40.8709976Z 
2025-05-07T20:31:40.8710094Z Trying example: test_silu_mul_quant(
2025-05-07T20:31:40.8710511Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:31:40.8710932Z     T=1,
2025-05-07T20:31:40.8711128Z     D=5120,
2025-05-07T20:31:40.8711325Z     scale_ub=None,
2025-05-07T20:31:40.8711548Z     contiguous=False,
2025-05-07T20:31:40.8711782Z     compiled=True,
2025-05-07T20:31:40.8711988Z )
2025-05-07T20:31:40.9184127Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:31:40.9184913Z T = 1, D = 5120, scale_ub = None, contiguous = False, compiled = True
2025-05-07T20:31:40.9185271Z 
2025-05-07T20:31:40.9185394Z     @given(
2025-05-07T20:31:40.9185659Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:31:40.9185984Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:31:40.9186292Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:31:40.9186627Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:31:40.9186952Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:31:40.9187238Z     )
2025-05-07T20:31:40.9187596Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:31:40.9188049Z     def test_silu_mul_quant(
2025-05-07T20:31:40.9188296Z         self,
2025-05-07T20:31:40.9188503Z         T: int,
2025-05-07T20:31:40.9188700Z         D: int,
2025-05-07T20:31:40.9188926Z         scale_ub: Optional[float],
2025-05-07T20:31:40.9189291Z         contiguous: bool,
2025-05-07T20:31:40.9189530Z         compiled: bool,
2025-05-07T20:31:40.9189759Z     ) -> None:
2025-05-07T20:31:40.9189980Z         torch.manual_seed(2025)
2025-05-07T20:31:40.9190221Z     
2025-05-07T20:31:40.9190500Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:31:40.9190847Z     
2025-05-07T20:31:40.9191042Z         x_sign = torch.sign(x)
2025-05-07T20:31:40.9191336Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:31:40.9191655Z         x = x_sign * x_clamp
2025-05-07T20:31:40.9191902Z         x0 = x[:, :D]
2025-05-07T20:31:40.9192117Z         x1 = x[:, D:]
2025-05-07T20:31:40.9192327Z     
2025-05-07T20:31:40.9192534Z         if contiguous:
2025-05-07T20:31:40.9192769Z             x0 = x0.contiguous()
2025-05-07T20:31:40.9193030Z             x1 = x1.contiguous()
2025-05-07T20:31:40.9193274Z     
2025-05-07T20:31:40.9193468Z         if scale_ub is not None:
2025-05-07T20:31:40.9194068Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:31:40.9194439Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:31:40.9194755Z             )
2025-05-07T20:31:40.9194948Z         else:
2025-05-07T20:31:40.9195172Z             scale_ub_tensor = None
2025-05-07T20:31:40.9195433Z     
2025-05-07T20:31:40.9195665Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:31:40.9195983Z             op = silu_mul_quant
2025-05-07T20:31:40.9196239Z             if compiled:
2025-05-07T20:31:40.9196487Z                 op = torch.compile(op)
2025-05-07T20:31:40.9196788Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:31:40.9197073Z     
2025-05-07T20:31:40.9197276Z         y_fp8, y_scale = fn()
2025-05-07T20:31:40.9197695Z         y = y_fp8.to(torch.float32) * y_scale[:, None]
2025-05-07T20:31:40.9197994Z     
2025-05-07T20:31:40.9198238Z         def ref_fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:31:40.9198575Z             x0_fp32 = x0.to(torch.float32)
2025-05-07T20:31:40.9198871Z             x1_fp32 = x1.to(torch.float32)
2025-05-07T20:31:40.9199194Z             y = x0_fp32 * torch.sigmoid(x0_fp32) * x1_fp32
2025-05-07T20:31:40.9199547Z             return triton_quantize_fp8_row(y, scale_ub_tensor)
2025-05-07T20:31:40.9199861Z     
2025-05-07T20:31:40.9200072Z >       y_fp8_ref, y_scale_ref = ref_fn()
2025-05-07T20:31:40.9200264Z 
2025-05-07T20:31:40.9200367Z moe/activation_test.py:126: 
2025-05-07T20:31:40.9200669Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:31:40.9201008Z moe/activation_test.py:124: in ref_fn
2025-05-07T20:31:40.9201341Z     return triton_quantize_fp8_row(y, scale_ub_tensor)
2025-05-07T20:31:40.9202130Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/fbgemm_gpu/experimental/gemm/triton_gemm/fp8_gemm.py:2370: in triton_quantize_fp8_row
2025-05-07T20:31:40.9202887Z     _kernel_quantize_fp8_row[grid](
2025-05-07T20:31:40.9203442Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:31:40.9204126Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:31:40.9204805Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/autotuner.py:186: in run
2025-05-07T20:31:40.9205526Z     timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs}
2025-05-07T20:31:40.9206280Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/autotuner.py:186: in <dictcomp>
2025-05-07T20:31:40.9207026Z     timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs}
2025-05-07T20:31:40.9207764Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/autotuner.py:166: in _bench
2025-05-07T20:31:40.9208406Z     return self.do_bench(kernel_call, quantiles=(0.5, 0.2, 0.8))
2025-05-07T20:31:40.9209013Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/testing.py:117: in do_bench
2025-05-07T20:31:40.9209530Z     fn()
2025-05-07T20:31:40.9210047Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/autotuner.py:152: in kernel_call
2025-05-07T20:31:40.9210637Z     self.fn.run(
2025-05-07T20:31:40.9211104Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:31:40.9211642Z     kernel = self.compile(
2025-05-07T20:31:40.9212189Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:31:40.9212852Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:31:40.9213254Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:31:40.9213492Z 
2025-05-07T20:31:40.9213697Z self = <triton.compiler.compiler.ASTSource object at 0x7f68726fdc90>
2025-05-07T20:31:40.9214869Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=2, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:31:40.9216249Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f68dfb402c0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f6873915b20>}
2025-05-07T20:31:40.9217590Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:31:40.9218609Z context = <triton._C.libtriton.ir.context object at 0x7f6872665ab0>
2025-05-07T20:31:40.9218979Z 
2025-05-07T20:31:40.9219148Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:31:40.9219674Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:31:40.9220135Z                            module_map=module_map)
2025-05-07T20:31:40.9220511Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:31:40.9220872Z E       def _kernel_quantize_fp8_row(
2025-05-07T20:31:40.9221149Z E       ^
2025-05-07T20:31:40.9221616Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:31:40.9222072Z 
2025-05-07T20:31:40.9222490Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:31:40.9223000Z 
2025-05-07T20:31:40.9223118Z Trying example: test_silu_mul_quant(
2025-05-07T20:31:40.9223541Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:31:40.9223946Z     T=1,
2025-05-07T20:31:40.9224140Z     D=5120,
2025-05-07T20:31:40.9224331Z     scale_ub=None,
2025-05-07T20:31:40.9224549Z     contiguous=True,
2025-05-07T20:31:40.9224782Z     compiled=False,
2025-05-07T20:31:40.9224993Z )
2025-05-07T20:31:41.0392311Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:31:41.0393068Z T = 1, D = 5120, scale_ub = None, contiguous = True, compiled = False
2025-05-07T20:31:41.0393429Z 
2025-05-07T20:31:41.0393549Z     @given(
2025-05-07T20:31:41.0393918Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:31:41.0394249Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:31:41.0394557Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:31:41.0394889Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:31:41.0395230Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:31:41.0395548Z     )
2025-05-07T20:31:41.0395898Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:31:41.0396343Z     def test_silu_mul_quant(
2025-05-07T20:31:41.0396591Z         self,
2025-05-07T20:31:41.0396807Z         T: int,
2025-05-07T20:31:41.0397006Z         D: int,
2025-05-07T20:31:41.0397232Z         scale_ub: Optional[float],
2025-05-07T20:31:41.0397510Z         contiguous: bool,
2025-05-07T20:31:41.0397750Z         compiled: bool,
2025-05-07T20:31:41.0397979Z     ) -> None:
2025-05-07T20:31:41.0398203Z         torch.manual_seed(2025)
2025-05-07T20:31:41.0398443Z     
2025-05-07T20:31:41.0398720Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:31:41.0399070Z     
2025-05-07T20:31:41.0399270Z         x_sign = torch.sign(x)
2025-05-07T20:31:41.0399565Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:31:41.0399910Z         x = x_sign * x_clamp
2025-05-07T20:31:41.0400163Z         x0 = x[:, :D]
2025-05-07T20:31:41.0400395Z         x1 = x[:, D:]
2025-05-07T20:31:41.0400602Z     
2025-05-07T20:31:41.0400800Z         if contiguous:
2025-05-07T20:31:41.0401044Z             x0 = x0.contiguous()
2025-05-07T20:31:41.0401300Z             x1 = x1.contiguous()
2025-05-07T20:31:41.0401890Z     
2025-05-07T20:31:41.0402099Z         if scale_ub is not None:
2025-05-07T20:31:41.0402376Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:31:41.0402717Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:31:41.0403033Z             )
2025-05-07T20:31:41.0403229Z         else:
2025-05-07T20:31:41.0403453Z             scale_ub_tensor = None
2025-05-07T20:31:41.0403737Z     
2025-05-07T20:31:41.0403990Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:31:41.0404313Z             op = silu_mul_quant
2025-05-07T20:31:41.0404574Z             if compiled:
2025-05-07T20:31:41.0404829Z                 op = torch.compile(op)
2025-05-07T20:31:41.0405127Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:31:41.0405584Z     
2025-05-07T20:31:41.0405784Z >       y_fp8, y_scale = fn()
2025-05-07T20:31:41.0405952Z 
2025-05-07T20:31:41.0406054Z moe/activation_test.py:117: 
2025-05-07T20:31:41.0406359Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:31:41.0406695Z moe/activation_test.py:115: in fn
2025-05-07T20:31:41.0406974Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:31:41.0407669Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:31:41.0408364Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:31:41.0408899Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:31:41.0409574Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:31:41.0410239Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:31:41.0410776Z     kernel = self.compile(
2025-05-07T20:31:41.0411310Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:31:41.0411967Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:31:41.0412364Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:31:41.0412594Z 
2025-05-07T20:31:41.0412810Z self = <triton.compiler.compiler.ASTSource object at 0x7f6872638750>
2025-05-07T20:31:41.0413878Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:31:41.0415265Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f68dfb402c0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f6873372200>}
2025-05-07T20:31:41.0416612Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:31:41.0417630Z context = <triton._C.libtriton.ir.context object at 0x7f687261b330>
2025-05-07T20:31:41.0417914Z 
2025-05-07T20:31:41.0418087Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:31:41.0418603Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:31:41.0419074Z                            module_map=module_map)
2025-05-07T20:31:41.0419442Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:31:41.0419792Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:31:41.0420059Z E       ^
2025-05-07T20:31:41.0420527Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:31:41.0420977Z 
2025-05-07T20:31:41.0421397Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:31:41.0421905Z 
2025-05-07T20:31:41.0422096Z Trying example: test_silu_mul_quant(
2025-05-07T20:31:41.0422514Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:31:41.0422919Z     T=128,
2025-05-07T20:31:41.0423108Z     D=5120,
2025-05-07T20:31:41.0423310Z     scale_ub=None,
2025-05-07T20:31:41.0423536Z     contiguous=False,
2025-05-07T20:31:41.0423792Z     compiled=True,
2025-05-07T20:31:41.0424027Z )
2025-05-07T20:31:41.0424354Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:31:41.0424847Z T = 128, D = 5120, scale_ub = None, contiguous = False, compiled = True
2025-05-07T20:31:41.0425111Z 
2025-05-07T20:31:41.0425194Z     @given(
2025-05-07T20:31:41.0425436Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:31:41.0425832Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:31:41.0426141Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:31:41.0426476Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:31:41.0426817Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:31:41.0427100Z     )
2025-05-07T20:31:41.0427455Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:31:41.0427907Z     def test_silu_mul_quant(
2025-05-07T20:31:41.0428423Z         self,
2025-05-07T20:31:41.0428627Z         T: int,
2025-05-07T20:31:41.0428834Z         D: int,
2025-05-07T20:31:41.0429064Z         scale_ub: Optional[float],
2025-05-07T20:31:41.0429392Z         contiguous: bool,
2025-05-07T20:31:41.0429646Z         compiled: bool,
2025-05-07T20:31:41.0429877Z     ) -> None:
2025-05-07T20:31:41.0430097Z         torch.manual_seed(2025)
2025-05-07T20:31:41.0430342Z     
2025-05-07T20:31:41.0430617Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:31:41.0430962Z     
2025-05-07T20:31:41.0431164Z         x_sign = torch.sign(x)
2025-05-07T20:31:41.0431458Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:31:41.0431769Z         x = x_sign * x_clamp
2025-05-07T20:31:41.0432013Z         x0 = x[:, :D]
2025-05-07T20:31:41.0432237Z         x1 = x[:, D:]
2025-05-07T20:31:41.0432446Z     
2025-05-07T20:31:41.0432640Z         if contiguous:
2025-05-07T20:31:41.0432875Z             x0 = x0.contiguous()
2025-05-07T20:31:41.0433140Z             x1 = x1.contiguous()
2025-05-07T20:31:41.0433405Z     
2025-05-07T20:31:41.0433712Z         if scale_ub is not None:
2025-05-07T20:31:41.0434081Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:31:41.0434413Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:31:41.0434724Z             )
2025-05-07T20:31:41.0434921Z         else:
2025-05-07T20:31:41.0435130Z             scale_ub_tensor = None
2025-05-07T20:31:41.0435394Z     
2025-05-07T20:31:41.0435635Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:31:41.0435949Z             op = silu_mul_quant
2025-05-07T20:31:41.0436211Z             if compiled:
2025-05-07T20:31:41.0436470Z                 op = torch.compile(op)
2025-05-07T20:31:41.0436762Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:31:41.0437044Z     
2025-05-07T20:31:41.0437246Z >       y_fp8, y_scale = fn()
2025-05-07T20:31:41.0437410Z 
2025-05-07T20:31:41.0437512Z moe/activation_test.py:117: 
2025-05-07T20:31:41.0437814Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:31:41.0438159Z moe/activation_test.py:115: in fn
2025-05-07T20:31:41.0438448Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:31:41.0439003Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/torch/_dynamo/eval_frame.py:678: in _fn
2025-05-07T20:31:41.0439570Z     return fn(*args, **kwargs)
2025-05-07T20:31:41.0440240Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:31:41.0440921Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:31:41.0441615Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:31:41.0442306Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:31:41.0442975Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:31:41.0443503Z     kernel = self.compile(
2025-05-07T20:31:41.0444053Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:31:41.0444722Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:31:41.0445123Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:31:41.0445466Z 
2025-05-07T20:31:41.0445676Z self = <triton.compiler.compiler.ASTSource object at 0x7f68726e4cd0>
2025-05-07T20:31:41.0446762Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:31:41.0448135Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f68dfb402c0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f68738ee0c0>}
2025-05-07T20:31:41.0449473Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:31:41.0450489Z context = <triton._C.libtriton.ir.context object at 0x7f687269d0f0>
2025-05-07T20:31:41.0450785Z 
2025-05-07T20:31:41.0450957Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:31:41.0451487Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:31:41.0451956Z                            module_map=module_map)
2025-05-07T20:31:41.0452327Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:31:41.0452686Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:31:41.0452949Z E       ^
2025-05-07T20:31:41.0453411Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:31:41.0453896Z 
2025-05-07T20:31:41.0454333Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:31:41.0454851Z 
2025-05-07T20:31:41.0454958Z Trying example: test_silu_mul_quant(
2025-05-07T20:31:41.0455377Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:31:41.0455776Z     T=128,
2025-05-07T20:31:41.0455986Z     D=7168,
2025-05-07T20:31:41.0456191Z     scale_ub=1200.0,
2025-05-07T20:31:41.0456420Z     contiguous=False,
2025-05-07T20:31:41.0456652Z     compiled=False,
2025-05-07T20:31:41.0456864Z )
2025-05-07T20:31:41.1335500Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:31:41.1336241Z T = 128, D = 7168, scale_ub = 1200.0, contiguous = False, compiled = False
2025-05-07T20:31:41.1336636Z 
2025-05-07T20:31:41.1336750Z     @given(
2025-05-07T20:31:41.1337075Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:31:41.1337507Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:31:41.1337830Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:31:41.1338171Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:31:41.1338504Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:31:41.1338787Z     )
2025-05-07T20:31:41.1339141Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:31:41.1339601Z     def test_silu_mul_quant(
2025-05-07T20:31:41.1339843Z         self,
2025-05-07T20:31:41.1340054Z         T: int,
2025-05-07T20:31:41.1340274Z         D: int,
2025-05-07T20:31:41.1340498Z         scale_ub: Optional[float],
2025-05-07T20:31:41.1341145Z         contiguous: bool,
2025-05-07T20:31:41.1341407Z         compiled: bool,
2025-05-07T20:31:41.1341634Z     ) -> None:
2025-05-07T20:31:41.1341860Z         torch.manual_seed(2025)
2025-05-07T20:31:41.1342111Z     
2025-05-07T20:31:41.1342387Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:31:41.1342779Z     
2025-05-07T20:31:41.1342980Z         x_sign = torch.sign(x)
2025-05-07T20:31:41.1343281Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:31:41.1343595Z         x = x_sign * x_clamp
2025-05-07T20:31:41.1343836Z         x0 = x[:, :D]
2025-05-07T20:31:41.1344069Z         x1 = x[:, D:]
2025-05-07T20:31:41.1344288Z     
2025-05-07T20:31:41.1344479Z         if contiguous:
2025-05-07T20:31:41.1344870Z             x0 = x0.contiguous()
2025-05-07T20:31:41.1345139Z             x1 = x1.contiguous()
2025-05-07T20:31:41.1345384Z     
2025-05-07T20:31:41.1345576Z         if scale_ub is not None:
2025-05-07T20:31:41.1345859Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:31:41.1346197Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:31:41.1346502Z             )
2025-05-07T20:31:41.1346701Z         else:
2025-05-07T20:31:41.1346917Z             scale_ub_tensor = None
2025-05-07T20:31:41.1347163Z     
2025-05-07T20:31:41.1347397Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:31:41.1347714Z             op = silu_mul_quant
2025-05-07T20:31:41.1347963Z             if compiled:
2025-05-07T20:31:41.1348215Z                 op = torch.compile(op)
2025-05-07T20:31:41.1348514Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:31:41.1348792Z     
2025-05-07T20:31:41.1348994Z >       y_fp8, y_scale = fn()
2025-05-07T20:31:41.1349237Z 
2025-05-07T20:31:41.1349347Z moe/activation_test.py:117: 
2025-05-07T20:31:41.1349649Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:31:41.1349981Z moe/activation_test.py:115: in fn
2025-05-07T20:31:41.1350270Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:31:41.1350958Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:31:41.1351641Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:31:41.1352175Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:31:41.1352853Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:31:41.1353514Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:31:41.1354041Z     kernel = self.compile(
2025-05-07T20:31:41.1354647Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:31:41.1355308Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:31:41.1355705Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:31:41.1355939Z 
2025-05-07T20:31:41.1356145Z self = <triton.compiler.compiler.ASTSource object at 0x7f68725bd3d0>
2025-05-07T20:31:41.1357217Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:31:41.1358590Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f68dfb402c0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f6873917ba0>}
2025-05-07T20:31:41.1359927Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:31:41.1360942Z context = <triton._C.libtriton.ir.context object at 0x7f6872595230>
2025-05-07T20:31:41.1361347Z 
2025-05-07T20:31:41.1361517Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:31:41.1362038Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:31:41.1362508Z                            module_map=module_map)
2025-05-07T20:31:41.1362874Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:31:41.1363235Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:31:41.1363502Z E       ^
2025-05-07T20:31:41.1364014Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:31:41.1364473Z 
2025-05-07T20:31:41.1364888Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:31:41.1365483Z 
2025-05-07T20:31:41.1365594Z Trying example: test_silu_mul_quant(
2025-05-07T20:31:41.1366014Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:31:41.1366413Z     T=128,
2025-05-07T20:31:41.1366616Z     D=5120,
2025-05-07T20:31:41.1366822Z     scale_ub=None,
2025-05-07T20:31:41.1367049Z     contiguous=False,
2025-05-07T20:31:41.1367295Z     compiled=False,
2025-05-07T20:31:41.1367512Z )
2025-05-07T20:31:41.1367836Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:31:41.1368327Z T = 128, D = 5120, scale_ub = None, contiguous = False, compiled = False
2025-05-07T20:31:41.1368593Z 
2025-05-07T20:31:41.1368681Z     @given(
2025-05-07T20:31:41.1368912Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:31:41.1369243Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:31:41.1377585Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:31:41.1377941Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:31:41.1378268Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:31:41.1378558Z     )
2025-05-07T20:31:41.1378920Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:31:41.1379370Z     def test_silu_mul_quant(
2025-05-07T20:31:41.1379609Z         self,
2025-05-07T20:31:41.1379809Z         T: int,
2025-05-07T20:31:41.1380016Z         D: int,
2025-05-07T20:31:41.1380235Z         scale_ub: Optional[float],
2025-05-07T20:31:41.1380518Z         contiguous: bool,
2025-05-07T20:31:41.1380761Z         compiled: bool,
2025-05-07T20:31:41.1380986Z     ) -> None:
2025-05-07T20:31:41.1381214Z         torch.manual_seed(2025)
2025-05-07T20:31:41.1381459Z     
2025-05-07T20:31:41.1381730Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:31:41.1382084Z     
2025-05-07T20:31:41.1382291Z         x_sign = torch.sign(x)
2025-05-07T20:31:41.1382579Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:31:41.1382891Z         x = x_sign * x_clamp
2025-05-07T20:31:41.1383137Z         x0 = x[:, :D]
2025-05-07T20:31:41.1383353Z         x1 = x[:, D:]
2025-05-07T20:31:41.1383576Z     
2025-05-07T20:31:41.1383804Z         if contiguous:
2025-05-07T20:31:41.1384050Z             x0 = x0.contiguous()
2025-05-07T20:31:41.1384302Z             x1 = x1.contiguous()
2025-05-07T20:31:41.1384543Z     
2025-05-07T20:31:41.1384731Z         if scale_ub is not None:
2025-05-07T20:31:41.1384991Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:31:41.1385334Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:31:41.1385648Z             )
2025-05-07T20:31:41.1385839Z         else:
2025-05-07T20:31:41.1386055Z             scale_ub_tensor = None
2025-05-07T20:31:41.1386310Z     
2025-05-07T20:31:41.1386538Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:31:41.1386864Z             op = silu_mul_quant
2025-05-07T20:31:41.1387120Z             if compiled:
2025-05-07T20:31:41.1387366Z                 op = torch.compile(op)
2025-05-07T20:31:41.1387665Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:31:41.1387943Z     
2025-05-07T20:31:41.1388248Z >       y_fp8, y_scale = fn()
2025-05-07T20:31:41.1388424Z 
2025-05-07T20:31:41.1388526Z moe/activation_test.py:117: 
2025-05-07T20:31:41.1388823Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:31:41.1389256Z moe/activation_test.py:115: in fn
2025-05-07T20:31:41.1389533Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:31:41.1390229Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:31:41.1390922Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:31:41.1391459Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:31:41.1392232Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:31:41.1392899Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:31:41.1393444Z     kernel = self.compile(
2025-05-07T20:31:41.1394035Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:31:41.1394703Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:31:41.1395110Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:31:41.1395339Z 
2025-05-07T20:31:41.1395556Z self = <triton.compiler.compiler.ASTSource object at 0x7f68725c0250>
2025-05-07T20:31:41.1396645Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:31:41.1398046Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f68dfb402c0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f6873914fe0>}
2025-05-07T20:31:41.1399409Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:31:41.1400438Z context = <triton._C.libtriton.ir.context object at 0x7f68725e8130>
2025-05-07T20:31:41.1400730Z 
2025-05-07T20:31:41.1400899Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:31:41.1401419Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:31:41.1401889Z                            module_map=module_map)
2025-05-07T20:31:41.1402259Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:31:41.1402616Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:31:41.1402879Z E       ^
2025-05-07T20:31:41.1403347Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:31:41.1403796Z 
2025-05-07T20:31:41.1404219Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:31:41.1404743Z 
2025-05-07T20:31:41.1404847Z Trying example: test_silu_mul_quant(
2025-05-07T20:31:41.1405264Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:31:41.1405674Z     T=128,
2025-05-07T20:31:41.1405863Z     D=5120,
2025-05-07T20:31:41.1406060Z     scale_ub=1200.0,
2025-05-07T20:31:41.1406293Z     contiguous=True,
2025-05-07T20:31:41.1406510Z     compiled=False,
2025-05-07T20:31:41.1406721Z )
2025-05-07T20:31:41.4657318Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:31:41.4658812Z T = 128, D = 5120, scale_ub = 1200.0, contiguous = True, compiled = False
2025-05-07T20:31:41.4659551Z 
2025-05-07T20:31:41.4659786Z     @given(
2025-05-07T20:31:41.4660415Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:31:41.4661552Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:31:41.4662178Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:31:41.4662828Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:31:41.4663481Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:31:41.4664043Z     )
2025-05-07T20:31:41.4664580Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:31:41.4665060Z     def test_silu_mul_quant(
2025-05-07T20:31:41.4665313Z         self,
2025-05-07T20:31:41.4665521Z         T: int,
2025-05-07T20:31:41.4665723Z         D: int,
2025-05-07T20:31:41.4665959Z         scale_ub: Optional[float],
2025-05-07T20:31:41.4666270Z         contiguous: bool,
2025-05-07T20:31:41.4666673Z         compiled: bool,
2025-05-07T20:31:41.4666903Z     ) -> None:
2025-05-07T20:31:41.4667133Z         torch.manual_seed(2025)
2025-05-07T20:31:41.4667385Z     
2025-05-07T20:31:41.4667660Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:31:41.4668007Z     
2025-05-07T20:31:41.4668208Z         x_sign = torch.sign(x)
2025-05-07T20:31:41.4668507Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:31:41.4668814Z         x = x_sign * x_clamp
2025-05-07T20:31:41.4669061Z         x0 = x[:, :D]
2025-05-07T20:31:41.4669362Z         x1 = x[:, D:]
2025-05-07T20:31:41.4669568Z     
2025-05-07T20:31:41.4669759Z         if contiguous:
2025-05-07T20:31:41.4669998Z             x0 = x0.contiguous()
2025-05-07T20:31:41.4670254Z             x1 = x1.contiguous()
2025-05-07T20:31:41.4670504Z     
2025-05-07T20:31:41.4670702Z         if scale_ub is not None:
2025-05-07T20:31:41.4670979Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:31:41.4671326Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:31:41.4671631Z             )
2025-05-07T20:31:41.4671830Z         else:
2025-05-07T20:31:41.4672051Z             scale_ub_tensor = None
2025-05-07T20:31:41.4672298Z     
2025-05-07T20:31:41.4672541Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:31:41.4672863Z             op = silu_mul_quant
2025-05-07T20:31:41.4673113Z             if compiled:
2025-05-07T20:31:41.4673373Z                 op = torch.compile(op)
2025-05-07T20:31:41.4673671Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:31:41.4673945Z     
2025-05-07T20:31:41.4674141Z >       y_fp8, y_scale = fn()
2025-05-07T20:31:41.4674305Z 
2025-05-07T20:31:41.4674413Z moe/activation_test.py:117: 
2025-05-07T20:31:41.4674714Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:31:41.4675045Z moe/activation_test.py:115: in fn
2025-05-07T20:31:41.4675331Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:31:41.4676029Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:31:41.4676718Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:31:41.4677260Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:31:41.4677947Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:31:41.4678610Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:31:41.4679137Z     kernel = self.compile(
2025-05-07T20:31:41.4679683Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:31:41.4680341Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:31:41.4680735Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:31:41.4680974Z 
2025-05-07T20:31:41.4681178Z self = <triton.compiler.compiler.ASTSource object at 0x7f6872552b50>
2025-05-07T20:31:41.4682344Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:31:41.4683722Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f68dfb402c0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f68738837e0>}
2025-05-07T20:31:41.4685063Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:31:41.4686076Z context = <triton._C.libtriton.ir.context object at 0x7f6872556a30>
2025-05-07T20:31:41.4686371Z 
2025-05-07T20:31:41.4686540Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:31:41.4687135Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:31:41.4687603Z                            module_map=module_map)
2025-05-07T20:31:41.4687972Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:31:41.4688328Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:31:41.4688590Z E       ^
2025-05-07T20:31:41.4689054Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:31:41.4689510Z 
2025-05-07T20:31:41.4689928Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:31:41.4690447Z 
2025-05-07T20:31:41.4690554Z Trying example: test_silu_mul_quant(
2025-05-07T20:31:41.4690972Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:31:41.4691374Z     T=1,
2025-05-07T20:31:41.4691574Z     D=7168,
2025-05-07T20:31:41.4691775Z     scale_ub=1200.0,
2025-05-07T20:31:41.4692001Z     contiguous=True,
2025-05-07T20:31:41.4692231Z     compiled=True,
2025-05-07T20:31:41.4692448Z )
2025-05-07T20:31:41.4692774Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:31:41.4693261Z T = 1, D = 7168, scale_ub = 1200.0, contiguous = True, compiled = True
2025-05-07T20:31:41.4693521Z 
2025-05-07T20:31:41.4693608Z     @given(
2025-05-07T20:31:41.4693840Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:31:41.4694158Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:31:41.4694469Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:31:41.4694805Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:31:41.4695130Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:31:41.4695418Z     )
2025-05-07T20:31:41.4695769Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:31:41.4696251Z     def test_silu_mul_quant(
2025-05-07T20:31:41.4696601Z         self,
2025-05-07T20:31:41.4696833Z         T: int,
2025-05-07T20:31:41.4697031Z         D: int,
2025-05-07T20:31:41.4697259Z         scale_ub: Optional[float],
2025-05-07T20:31:41.4697542Z         contiguous: bool,
2025-05-07T20:31:41.4697778Z         compiled: bool,
2025-05-07T20:31:41.4698004Z     ) -> None:
2025-05-07T20:31:41.4698227Z         torch.manual_seed(2025)
2025-05-07T20:31:41.4698467Z     
2025-05-07T20:31:41.4698747Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:31:41.4699092Z     
2025-05-07T20:31:41.4699294Z         x_sign = torch.sign(x)
2025-05-07T20:31:41.4699584Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:31:41.4699898Z         x = x_sign * x_clamp
2025-05-07T20:31:41.4700142Z         x0 = x[:, :D]
2025-05-07T20:31:41.4700361Z         x1 = x[:, D:]
2025-05-07T20:31:41.4700574Z     
2025-05-07T20:31:41.4700768Z         if contiguous:
2025-05-07T20:31:41.4700997Z             x0 = x0.contiguous()
2025-05-07T20:31:41.4701265Z             x1 = x1.contiguous()
2025-05-07T20:31:41.4701507Z     
2025-05-07T20:31:41.4701697Z         if scale_ub is not None:
2025-05-07T20:31:41.4702064Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:31:41.4702403Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:31:41.4702707Z             )
2025-05-07T20:31:41.4702908Z         else:
2025-05-07T20:31:41.4703126Z             scale_ub_tensor = None
2025-05-07T20:31:41.4703373Z     
2025-05-07T20:31:41.4703618Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:31:41.4703938Z             op = silu_mul_quant
2025-05-07T20:31:41.4704199Z             if compiled:
2025-05-07T20:31:41.4704446Z                 op = torch.compile(op)
2025-05-07T20:31:41.4704751Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:31:41.4705030Z     
2025-05-07T20:31:41.4705228Z >       y_fp8, y_scale = fn()
2025-05-07T20:31:41.4705509Z 
2025-05-07T20:31:41.4705609Z moe/activation_test.py:117: 
2025-05-07T20:31:41.4705910Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:31:41.4706239Z moe/activation_test.py:115: in fn
2025-05-07T20:31:41.4706532Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:31:41.4707095Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/torch/_dynamo/eval_frame.py:678: in _fn
2025-05-07T20:31:41.4707657Z     return fn(*args, **kwargs)
2025-05-07T20:31:41.4708310Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:31:41.4709003Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:31:41.4709629Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:31:41.4710308Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:31:41.4710980Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:31:41.4711518Z     kernel = self.compile(
2025-05-07T20:31:41.4712068Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:31:41.4712718Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:31:41.4713129Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:31:41.4713356Z 
2025-05-07T20:31:41.4713570Z self = <triton.compiler.compiler.ASTSource object at 0x7f68729356d0>
2025-05-07T20:31:41.4714651Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:31:41.4716006Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f68dfb402c0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f6872f8e840>}
2025-05-07T20:31:41.4717358Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:31:41.4718379Z context = <triton._C.libtriton.ir.context object at 0x7f687297d5b0>
2025-05-07T20:31:41.4718666Z 
2025-05-07T20:31:41.4718849Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:31:41.4719365Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:31:41.4719836Z                            module_map=module_map)
2025-05-07T20:31:41.4720207Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:31:41.4720572Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:31:41.4720833Z E       ^
2025-05-07T20:31:41.4721315Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:31:41.4721761Z 
2025-05-07T20:31:41.4722271Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:31:41.4722785Z 
2025-05-07T20:31:41.4722891Z Trying example: test_silu_mul_quant(
2025-05-07T20:31:41.4723309Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:31:41.4723733Z     T=1,
2025-05-07T20:31:41.4723951Z     D=7168,
2025-05-07T20:31:41.4724145Z     scale_ub=1200.0,
2025-05-07T20:31:41.4724376Z     contiguous=False,
2025-05-07T20:31:41.4724607Z     compiled=True,
2025-05-07T20:31:41.4724812Z )
2025-05-07T20:31:41.5726075Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:31:41.5726774Z T = 1, D = 7168, scale_ub = 1200.0, contiguous = False, compiled = True
2025-05-07T20:31:41.5727043Z 
2025-05-07T20:31:41.5727540Z     @given(
2025-05-07T20:31:41.5727782Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:31:41.5728101Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:31:41.5728779Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:31:41.5729129Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:31:41.5729463Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:31:41.5729749Z     )
2025-05-07T20:31:41.5730096Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:31:41.5730541Z     def test_silu_mul_quant(
2025-05-07T20:31:41.5730796Z         self,
2025-05-07T20:31:41.5730990Z         T: int,
2025-05-07T20:31:41.5731194Z         D: int,
2025-05-07T20:31:41.5731421Z         scale_ub: Optional[float],
2025-05-07T20:31:41.5731688Z         contiguous: bool,
2025-05-07T20:31:41.5731936Z         compiled: bool,
2025-05-07T20:31:41.5732170Z     ) -> None:
2025-05-07T20:31:41.5732391Z         torch.manual_seed(2025)
2025-05-07T20:31:41.5732644Z     
2025-05-07T20:31:41.5732941Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:31:41.5733290Z     
2025-05-07T20:31:41.5733481Z         x_sign = torch.sign(x)
2025-05-07T20:31:41.5733775Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:31:41.5734089Z         x = x_sign * x_clamp
2025-05-07T20:31:41.5734348Z         x0 = x[:, :D]
2025-05-07T20:31:41.5734592Z         x1 = x[:, D:]
2025-05-07T20:31:41.5734804Z     
2025-05-07T20:31:41.5734995Z         if contiguous:
2025-05-07T20:31:41.5735223Z             x0 = x0.contiguous()
2025-05-07T20:31:41.5735485Z             x1 = x1.contiguous()
2025-05-07T20:31:41.5735724Z     
2025-05-07T20:31:41.5735915Z         if scale_ub is not None:
2025-05-07T20:31:41.5736189Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:31:41.5736529Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:31:41.5736833Z             )
2025-05-07T20:31:41.5737031Z         else:
2025-05-07T20:31:41.5737247Z             scale_ub_tensor = None
2025-05-07T20:31:41.5737496Z     
2025-05-07T20:31:41.5737737Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:31:41.5738058Z             op = silu_mul_quant
2025-05-07T20:31:41.5738314Z             if compiled:
2025-05-07T20:31:41.5738566Z                 op = torch.compile(op)
2025-05-07T20:31:41.5738866Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:31:41.5739135Z     
2025-05-07T20:31:41.5739331Z >       y_fp8, y_scale = fn()
2025-05-07T20:31:41.5739503Z 
2025-05-07T20:31:41.5739606Z moe/activation_test.py:117: 
2025-05-07T20:31:41.5739901Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:31:41.5740228Z moe/activation_test.py:115: in fn
2025-05-07T20:31:41.5740511Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:31:41.5741069Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/torch/_dynamo/eval_frame.py:678: in _fn
2025-05-07T20:31:41.5741629Z     return fn(*args, **kwargs)
2025-05-07T20:31:41.5742288Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:31:41.5742972Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:31:41.5743683Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:31:41.5744373Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:31:41.5745077Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:31:41.5745616Z     kernel = self.compile(
2025-05-07T20:31:41.5746161Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:31:41.5746809Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:31:41.5747211Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:31:41.5747552Z 
2025-05-07T20:31:41.5747769Z self = <triton.compiler.compiler.ASTSource object at 0x7f68729eae90>
2025-05-07T20:31:41.5748840Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:31:41.5750310Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f68dfb402c0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f6872f8c900>}
2025-05-07T20:31:41.5751657Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:31:41.5752677Z context = <triton._C.libtriton.ir.context object at 0x7f6872962a70>
2025-05-07T20:31:41.5752967Z 
2025-05-07T20:31:41.5753144Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:31:41.5753660Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:31:41.5754138Z                            module_map=module_map)
2025-05-07T20:31:41.5754506Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:31:41.5754855Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:31:41.5755117Z E       ^
2025-05-07T20:31:41.5755589Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:31:41.5756036Z 
2025-05-07T20:31:41.5756458Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:31:41.5756965Z 
2025-05-07T20:31:41.5757070Z Trying example: test_silu_mul_quant(
2025-05-07T20:31:41.5757489Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:31:41.5757898Z     T=1,
2025-05-07T20:31:41.5758086Z     D=7168,
2025-05-07T20:31:41.5758278Z     scale_ub=None,
2025-05-07T20:31:41.5758503Z     contiguous=False,
2025-05-07T20:31:41.5758739Z     compiled=True,
2025-05-07T20:31:41.5758950Z )
2025-05-07T20:31:41.6428760Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:31:41.6429591Z T = 1, D = 7168, scale_ub = None, contiguous = False, compiled = True
2025-05-07T20:31:41.6429949Z 
2025-05-07T20:31:41.6430067Z     @given(
2025-05-07T20:31:41.6430368Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:31:41.6430692Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:31:41.6431004Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:31:41.6431328Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:31:41.6431661Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:31:41.6431951Z     )
2025-05-07T20:31:41.6432312Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:31:41.6432757Z     def test_silu_mul_quant(
2025-05-07T20:31:41.6433004Z         self,
2025-05-07T20:31:41.6433204Z         T: int,
2025-05-07T20:31:41.6433400Z         D: int,
2025-05-07T20:31:41.6433964Z         scale_ub: Optional[float],
2025-05-07T20:31:41.6434243Z         contiguous: bool,
2025-05-07T20:31:41.6434478Z         compiled: bool,
2025-05-07T20:31:41.6434713Z     ) -> None:
2025-05-07T20:31:41.6434938Z         torch.manual_seed(2025)
2025-05-07T20:31:41.6435176Z     
2025-05-07T20:31:41.6435458Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:31:41.6435810Z     
2025-05-07T20:31:41.6436009Z         x_sign = torch.sign(x)
2025-05-07T20:31:41.6436304Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:31:41.6436619Z         x = x_sign * x_clamp
2025-05-07T20:31:41.6436859Z         x0 = x[:, :D]
2025-05-07T20:31:41.6437088Z         x1 = x[:, D:]
2025-05-07T20:31:41.6437483Z     
2025-05-07T20:31:41.6437677Z         if contiguous:
2025-05-07T20:31:41.6437909Z             x0 = x0.contiguous()
2025-05-07T20:31:41.6438172Z             x1 = x1.contiguous()
2025-05-07T20:31:41.6438413Z     
2025-05-07T20:31:41.6438611Z         if scale_ub is not None:
2025-05-07T20:31:41.6438890Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:31:41.6439225Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:31:41.6439528Z             )
2025-05-07T20:31:41.6439728Z         else:
2025-05-07T20:31:41.6439941Z             scale_ub_tensor = None
2025-05-07T20:31:41.6440185Z     
2025-05-07T20:31:41.6440421Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:31:41.6440737Z             op = silu_mul_quant
2025-05-07T20:31:41.6440984Z             if compiled:
2025-05-07T20:31:41.6441251Z                 op = torch.compile(op)
2025-05-07T20:31:41.6449819Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:31:41.6450115Z     
2025-05-07T20:31:41.6450327Z         y_fp8, y_scale = fn()
2025-05-07T20:31:41.6450621Z         y = y_fp8.to(torch.float32) * y_scale[:, None]
2025-05-07T20:31:41.6450922Z     
2025-05-07T20:31:41.6451163Z         def ref_fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:31:41.6451512Z             x0_fp32 = x0.to(torch.float32)
2025-05-07T20:31:41.6451812Z             x1_fp32 = x1.to(torch.float32)
2025-05-07T20:31:41.6452124Z             y = x0_fp32 * torch.sigmoid(x0_fp32) * x1_fp32
2025-05-07T20:31:41.6452491Z             return triton_quantize_fp8_row(y, scale_ub_tensor)
2025-05-07T20:31:41.6452808Z     
2025-05-07T20:31:41.6453014Z >       y_fp8_ref, y_scale_ref = ref_fn()
2025-05-07T20:31:41.6453212Z 
2025-05-07T20:31:41.6453315Z moe/activation_test.py:126: 
2025-05-07T20:31:41.6453618Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:31:41.6453967Z moe/activation_test.py:124: in ref_fn
2025-05-07T20:31:41.6454297Z     return triton_quantize_fp8_row(y, scale_ub_tensor)
2025-05-07T20:31:41.6455101Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/fbgemm_gpu/experimental/gemm/triton_gemm/fp8_gemm.py:2370: in triton_quantize_fp8_row
2025-05-07T20:31:41.6455862Z     _kernel_quantize_fp8_row[grid](
2025-05-07T20:31:41.6456421Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:31:41.6457101Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:31:41.6457796Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/autotuner.py:186: in run
2025-05-07T20:31:41.6458522Z     timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs}
2025-05-07T20:31:41.6459286Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/autotuner.py:186: in <dictcomp>
2025-05-07T20:31:41.6460041Z     timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs}
2025-05-07T20:31:41.6460782Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/autotuner.py:166: in _bench
2025-05-07T20:31:41.6461438Z     return self.do_bench(kernel_call, quantiles=(0.5, 0.2, 0.8))
2025-05-07T20:31:41.6462171Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/testing.py:117: in do_bench
2025-05-07T20:31:41.6462703Z     fn()
2025-05-07T20:31:41.6463219Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/autotuner.py:152: in kernel_call
2025-05-07T20:31:41.6463810Z     self.fn.run(
2025-05-07T20:31:41.6464280Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:31:41.6464868Z     kernel = self.compile(
2025-05-07T20:31:41.6465415Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:31:41.6466070Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:31:41.6466560Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:31:41.6466795Z 
2025-05-07T20:31:41.6467009Z self = <triton.compiler.compiler.ASTSource object at 0x7f68724e0d10>
2025-05-07T20:31:41.6468102Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=2, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:31:41.6469579Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f68dfb402c0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f68940f1080>}
2025-05-07T20:31:41.6470924Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:31:41.6471966Z context = <triton._C.libtriton.ir.context object at 0x7f68724409b0>
2025-05-07T20:31:41.6472259Z 
2025-05-07T20:31:41.6472430Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:31:41.6472958Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:31:41.6473425Z                            module_map=module_map)
2025-05-07T20:31:41.6473794Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:31:41.6474154Z E       def _kernel_quantize_fp8_row(
2025-05-07T20:31:41.6474460Z E       ^
2025-05-07T20:31:41.6474941Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:31:41.6475405Z 
2025-05-07T20:31:41.6475827Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:31:41.6476342Z 
2025-05-07T20:31:41.6476454Z Trying example: test_silu_mul_quant(
2025-05-07T20:31:41.6476869Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:31:41.6477281Z     T=1,
2025-05-07T20:31:41.6477470Z     D=5120,
2025-05-07T20:31:41.6477666Z     scale_ub=1200.0,
2025-05-07T20:31:41.6477904Z     contiguous=False,
2025-05-07T20:31:41.6478135Z     compiled=True,
2025-05-07T20:31:41.6478342Z )
2025-05-07T20:31:41.7659036Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:31:41.7659786Z T = 1, D = 5120, scale_ub = 1200.0, contiguous = False, compiled = True
2025-05-07T20:31:41.7660158Z 
2025-05-07T20:31:41.7660243Z     @given(
2025-05-07T20:31:41.7660489Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:31:41.7660811Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:31:41.7661117Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:31:41.7661452Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:31:41.7661797Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:31:41.7662108Z     )
2025-05-07T20:31:41.7662468Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:31:41.7662917Z     def test_silu_mul_quant(
2025-05-07T20:31:41.7663527Z         self,
2025-05-07T20:31:41.7663738Z         T: int,
2025-05-07T20:31:41.7663945Z         D: int,
2025-05-07T20:31:41.7664163Z         scale_ub: Optional[float],
2025-05-07T20:31:41.7664445Z         contiguous: bool,
2025-05-07T20:31:41.7664693Z         compiled: bool,
2025-05-07T20:31:41.7664928Z     ) -> None:
2025-05-07T20:31:41.7665155Z         torch.manual_seed(2025)
2025-05-07T20:31:41.7665408Z     
2025-05-07T20:31:41.7665693Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:31:41.7666036Z     
2025-05-07T20:31:41.7666246Z         x_sign = torch.sign(x)
2025-05-07T20:31:41.7666548Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:31:41.7666864Z         x = x_sign * x_clamp
2025-05-07T20:31:41.7667262Z         x0 = x[:, :D]
2025-05-07T20:31:41.7667484Z         x1 = x[:, D:]
2025-05-07T20:31:41.7667692Z     
2025-05-07T20:31:41.7667892Z         if contiguous:
2025-05-07T20:31:41.7668131Z             x0 = x0.contiguous()
2025-05-07T20:31:41.7668403Z             x1 = x1.contiguous()
2025-05-07T20:31:41.7668651Z     
2025-05-07T20:31:41.7668849Z         if scale_ub is not None:
2025-05-07T20:31:41.7669199Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:31:41.7669544Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:31:41.7669855Z             )
2025-05-07T20:31:41.7670048Z         else:
2025-05-07T20:31:41.7670264Z             scale_ub_tensor = None
2025-05-07T20:31:41.7670519Z     
2025-05-07T20:31:41.7670759Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:31:41.7671077Z             op = silu_mul_quant
2025-05-07T20:31:41.7671334Z             if compiled:
2025-05-07T20:31:41.7671590Z                 op = torch.compile(op)
2025-05-07T20:31:41.7671889Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:31:41.7672169Z     
2025-05-07T20:31:41.7672369Z >       y_fp8, y_scale = fn()
2025-05-07T20:31:41.7672534Z 
2025-05-07T20:31:41.7672642Z moe/activation_test.py:117: 
2025-05-07T20:31:41.7672955Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:31:41.7673294Z moe/activation_test.py:115: in fn
2025-05-07T20:31:41.7673575Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:31:41.7674163Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/torch/_dynamo/eval_frame.py:678: in _fn
2025-05-07T20:31:41.7674753Z     return fn(*args, **kwargs)
2025-05-07T20:31:41.7675418Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:31:41.7676104Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:31:41.7676654Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:31:41.7677345Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:31:41.7678018Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:31:41.7678548Z     kernel = self.compile(
2025-05-07T20:31:41.7679097Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:31:41.7679755Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:31:41.7680153Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:31:41.7680392Z 
2025-05-07T20:31:41.7680600Z self = <triton.compiler.compiler.ASTSource object at 0x7f6872481450>
2025-05-07T20:31:41.7681693Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:31:41.7683178Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f68dfb402c0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f68940f3b00>}
2025-05-07T20:31:41.7684523Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:31:41.7685549Z context = <triton._C.libtriton.ir.context object at 0x7f68724b8f70>
2025-05-07T20:31:41.7685836Z 
2025-05-07T20:31:41.7686016Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:31:41.7686541Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:31:41.7687005Z                            module_map=module_map)
2025-05-07T20:31:41.7687373Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:31:41.7687811Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:31:41.7688065Z E       ^
2025-05-07T20:31:41.7688541Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:31:41.7688988Z 
2025-05-07T20:31:41.7689412Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:31:41.7689920Z 
2025-05-07T20:31:41.7690030Z Trying example: test_silu_mul_quant(
2025-05-07T20:31:41.7690437Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:31:41.7690844Z     T=1,
2025-05-07T20:31:41.7691035Z     D=5120,
2025-05-07T20:31:41.7691229Z     scale_ub=1200.0,
2025-05-07T20:31:41.7691461Z     contiguous=False,
2025-05-07T20:31:41.7691692Z     compiled=False,
2025-05-07T20:31:41.7691901Z )
2025-05-07T20:31:41.7692222Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:31:41.7692716Z T = 1, D = 5120, scale_ub = 1200.0, contiguous = False, compiled = False
2025-05-07T20:31:41.7692980Z 
2025-05-07T20:31:41.7693066Z     @given(
2025-05-07T20:31:41.7693297Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:31:41.7693620Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:31:41.7693980Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:31:41.7694306Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:31:41.7694636Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:31:41.7694926Z     )
2025-05-07T20:31:41.7695273Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:31:41.7695720Z     def test_silu_mul_quant(
2025-05-07T20:31:41.7695965Z         self,
2025-05-07T20:31:41.7696162Z         T: int,
2025-05-07T20:31:41.7696365Z         D: int,
2025-05-07T20:31:41.7696588Z         scale_ub: Optional[float],
2025-05-07T20:31:41.7696857Z         contiguous: bool,
2025-05-07T20:31:41.7697104Z         compiled: bool,
2025-05-07T20:31:41.7697328Z     ) -> None:
2025-05-07T20:31:41.7697550Z         torch.manual_seed(2025)
2025-05-07T20:31:41.7697791Z     
2025-05-07T20:31:41.7698073Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:31:41.7698419Z     
2025-05-07T20:31:41.7698615Z         x_sign = torch.sign(x)
2025-05-07T20:31:41.7698907Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:31:41.7699224Z         x = x_sign * x_clamp
2025-05-07T20:31:41.7699464Z         x0 = x[:, :D]
2025-05-07T20:31:41.7699692Z         x1 = x[:, D:]
2025-05-07T20:31:41.7699904Z     
2025-05-07T20:31:41.7700092Z         if contiguous:
2025-05-07T20:31:41.7700331Z             x0 = x0.contiguous()
2025-05-07T20:31:41.7700597Z             x1 = x1.contiguous()
2025-05-07T20:31:41.7700833Z     
2025-05-07T20:31:41.7701034Z         if scale_ub is not None:
2025-05-07T20:31:41.7701311Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:31:41.7701646Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:31:41.7701963Z             )
2025-05-07T20:31:41.7702164Z         else:
2025-05-07T20:31:41.7702374Z             scale_ub_tensor = None
2025-05-07T20:31:41.7702635Z     
2025-05-07T20:31:41.7702959Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:31:41.7703280Z             op = silu_mul_quant
2025-05-07T20:31:41.7703528Z             if compiled:
2025-05-07T20:31:41.7703782Z                 op = torch.compile(op)
2025-05-07T20:31:41.7704081Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:31:41.7704403Z     
2025-05-07T20:31:41.7704603Z >       y_fp8, y_scale = fn()
2025-05-07T20:31:41.7704769Z 
2025-05-07T20:31:41.7704877Z moe/activation_test.py:117: 
2025-05-07T20:31:41.7705171Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:31:41.7705514Z moe/activation_test.py:115: in fn
2025-05-07T20:31:41.7705806Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:31:41.7706605Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:31:41.7707294Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:31:41.7707839Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:31:41.7708528Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:31:41.7709277Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:31:41.7709815Z     kernel = self.compile(
2025-05-07T20:31:41.7710358Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:31:41.7711015Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:31:41.7711408Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:31:41.7711653Z 
2025-05-07T20:31:41.7711861Z self = <triton.compiler.compiler.ASTSource object at 0x7f687243e590>
2025-05-07T20:31:41.7712941Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:31:41.7714305Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f68dfb402c0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f6873025a80>}
2025-05-07T20:31:41.7715632Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:31:41.7716651Z context = <triton._C.libtriton.ir.context object at 0x7f687240e470>
2025-05-07T20:31:41.7716949Z 
2025-05-07T20:31:41.7717130Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:31:41.7717653Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:31:41.7718128Z                            module_map=module_map)
2025-05-07T20:31:41.7718499Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:31:41.7718856Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:31:41.7719127Z E       ^
2025-05-07T20:31:41.7719594Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:31:41.7720055Z 
2025-05-07T20:31:41.7720475Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:31:41.7720984Z 
2025-05-07T20:31:41.7721099Z Trying example: test_silu_mul_quant(
2025-05-07T20:31:41.7721519Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:31:41.7721936Z     T=16384,
2025-05-07T20:31:41.7722140Z     D=5120,
2025-05-07T20:31:41.7722344Z     scale_ub=1200.0,
2025-05-07T20:31:41.7722570Z     contiguous=False,
2025-05-07T20:31:41.7722809Z     compiled=True,
2025-05-07T20:31:41.7723030Z )
2025-05-07T20:31:42.0364347Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:31:42.0365105Z T = 16384, D = 5120, scale_ub = 1200.0, contiguous = False, compiled = True
2025-05-07T20:31:42.0365391Z 
2025-05-07T20:31:42.0365486Z     @given(
2025-05-07T20:31:42.0365724Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:31:42.0366045Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:31:42.0366371Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:31:42.0366708Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:31:42.0367038Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:31:42.0367327Z     )
2025-05-07T20:31:42.0367676Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:31:42.0368282Z     def test_silu_mul_quant(
2025-05-07T20:31:42.0368528Z         self,
2025-05-07T20:31:42.0368722Z         T: int,
2025-05-07T20:31:42.0368921Z         D: int,
2025-05-07T20:31:42.0369154Z         scale_ub: Optional[float],
2025-05-07T20:31:42.0369421Z         contiguous: bool,
2025-05-07T20:31:42.0369667Z         compiled: bool,
2025-05-07T20:31:42.0369900Z     ) -> None:
2025-05-07T20:31:42.0370128Z         torch.manual_seed(2025)
2025-05-07T20:31:42.0370367Z     
2025-05-07T20:31:42.0370649Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:31:42.0370994Z     
2025-05-07T20:31:42.0371188Z         x_sign = torch.sign(x)
2025-05-07T20:31:42.0371486Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:31:42.0371796Z         x = x_sign * x_clamp
2025-05-07T20:31:42.0372033Z         x0 = x[:, :D]
2025-05-07T20:31:42.0372254Z         x1 = x[:, D:]
2025-05-07T20:31:42.0372474Z     
2025-05-07T20:31:42.0372664Z         if contiguous:
2025-05-07T20:31:42.0372903Z             x0 = x0.contiguous()
2025-05-07T20:31:42.0373166Z             x1 = x1.contiguous()
2025-05-07T20:31:42.0373400Z     
2025-05-07T20:31:42.0373597Z         if scale_ub is not None:
2025-05-07T20:31:42.0373878Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:31:42.0374211Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:31:42.0374524Z             )
2025-05-07T20:31:42.0374724Z         else:
2025-05-07T20:31:42.0374944Z             scale_ub_tensor = None
2025-05-07T20:31:42.0375193Z     
2025-05-07T20:31:42.0375429Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:31:42.0375750Z             op = silu_mul_quant
2025-05-07T20:31:42.0375997Z             if compiled:
2025-05-07T20:31:42.0376252Z                 op = torch.compile(op)
2025-05-07T20:31:42.0376552Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:31:42.0376824Z     
2025-05-07T20:31:42.0377030Z >       y_fp8, y_scale = fn()
2025-05-07T20:31:42.0377194Z 
2025-05-07T20:31:42.0377305Z moe/activation_test.py:117: 
2025-05-07T20:31:42.0377601Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:31:42.0377948Z moe/activation_test.py:115: in fn
2025-05-07T20:31:42.0378236Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:31:42.0378801Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/torch/_dynamo/eval_frame.py:678: in _fn
2025-05-07T20:31:42.0379365Z     return fn(*args, **kwargs)
2025-05-07T20:31:42.0380031Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:31:42.0380720Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:31:42.0381254Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:31:42.0381936Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:31:42.0382606Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:31:42.0383144Z     kernel = self.compile(
2025-05-07T20:31:42.0383767Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:31:42.0384475Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:31:42.0384877Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:31:42.0385102Z 
2025-05-07T20:31:42.0385322Z self = <triton.compiler.compiler.ASTSource object at 0x7f6872bb0f50>
2025-05-07T20:31:42.0386387Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:31:42.0387843Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f68dfb402c0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f6873024cc0>}
2025-05-07T20:31:42.0389270Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:31:42.0390293Z context = <triton._C.libtriton.ir.context object at 0x7f6872bd8e30>
2025-05-07T20:31:42.0390579Z 
2025-05-07T20:31:42.0390748Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:31:42.0391266Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:31:42.0391738Z                            module_map=module_map)
2025-05-07T20:31:42.0392103Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:31:42.0392457Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:31:42.0392725Z E       ^
2025-05-07T20:31:42.0393195Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:31:42.0393641Z 
2025-05-07T20:31:42.0394065Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:31:42.0394629Z 
2025-05-07T20:31:42.0394733Z Trying example: test_silu_mul_quant(
2025-05-07T20:31:42.0395152Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:31:42.0395562Z     T=2048,
2025-05-07T20:31:42.0395753Z     D=7168,
2025-05-07T20:31:42.0395953Z     scale_ub=1200.0,
2025-05-07T20:31:42.0396187Z     contiguous=False,
2025-05-07T20:31:42.0396416Z     compiled=True,
2025-05-07T20:31:42.0396628Z )
2025-05-07T20:31:42.0396957Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:31:42.0397446Z T = 2048, D = 7168, scale_ub = 1200.0, contiguous = False, compiled = True
2025-05-07T20:31:42.0397732Z 
2025-05-07T20:31:42.0397815Z     @given(
2025-05-07T20:31:42.0398054Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:31:42.0398366Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:31:42.0398683Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:31:42.0399020Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:31:42.0399351Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:31:42.0399632Z     )
2025-05-07T20:31:42.0399983Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:31:42.0400423Z     def test_silu_mul_quant(
2025-05-07T20:31:42.0400669Z         self,
2025-05-07T20:31:42.0400872Z         T: int,
2025-05-07T20:31:42.0401079Z         D: int,
2025-05-07T20:31:42.0401299Z         scale_ub: Optional[float],
2025-05-07T20:31:42.0401576Z         contiguous: bool,
2025-05-07T20:31:42.0401822Z         compiled: bool,
2025-05-07T20:31:42.0402042Z     ) -> None:
2025-05-07T20:31:42.0402273Z         torch.manual_seed(2025)
2025-05-07T20:31:42.0402519Z     
2025-05-07T20:31:42.0402793Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:31:42.0403137Z     
2025-05-07T20:31:42.0403340Z         x_sign = torch.sign(x)
2025-05-07T20:31:42.0403717Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:31:42.0404034Z         x = x_sign * x_clamp
2025-05-07T20:31:42.0404277Z         x0 = x[:, :D]
2025-05-07T20:31:42.0404500Z         x1 = x[:, D:]
2025-05-07T20:31:42.0404706Z     
2025-05-07T20:31:42.0404895Z         if contiguous:
2025-05-07T20:31:42.0405130Z             x0 = x0.contiguous()
2025-05-07T20:31:42.0405389Z             x1 = x1.contiguous()
2025-05-07T20:31:42.0405629Z     
2025-05-07T20:31:42.0405825Z         if scale_ub is not None:
2025-05-07T20:31:42.0406095Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:31:42.0406432Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:31:42.0406822Z             )
2025-05-07T20:31:42.0407013Z         else:
2025-05-07T20:31:42.0407231Z             scale_ub_tensor = None
2025-05-07T20:31:42.0407486Z     
2025-05-07T20:31:42.0407716Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:31:42.0408042Z             op = silu_mul_quant
2025-05-07T20:31:42.0408302Z             if compiled:
2025-05-07T20:31:42.0408548Z                 op = torch.compile(op)
2025-05-07T20:31:42.0408848Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:31:42.0409131Z     
2025-05-07T20:31:42.0409332Z >       y_fp8, y_scale = fn()
2025-05-07T20:31:42.0409496Z 
2025-05-07T20:31:42.0409600Z moe/activation_test.py:117: 
2025-05-07T20:31:42.0409896Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:31:42.0410233Z moe/activation_test.py:115: in fn
2025-05-07T20:31:42.0410512Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:31:42.0411072Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/torch/_dynamo/eval_frame.py:678: in _fn
2025-05-07T20:31:42.0411638Z     return fn(*args, **kwargs)
2025-05-07T20:31:42.0412289Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:31:42.0412985Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:31:42.0413520Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:31:42.0414216Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:31:42.0414906Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:31:42.0415437Z     kernel = self.compile(
2025-05-07T20:31:42.0415979Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:31:42.0416635Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:31:42.0417033Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:31:42.0417267Z 
2025-05-07T20:31:42.0417476Z self = <triton.compiler.compiler.ASTSource object at 0x7f6872bd5bd0>
2025-05-07T20:31:42.0426248Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:31:42.0427699Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f68dfb402c0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f6873027060>}
2025-05-07T20:31:42.0429555Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:31:42.0430595Z context = <triton._C.libtriton.ir.context object at 0x7f68721cd1f0>
2025-05-07T20:31:42.0430904Z 
2025-05-07T20:31:42.0431079Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:31:42.0431830Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:31:42.0432314Z                            module_map=module_map)
2025-05-07T20:31:42.0432686Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:31:42.0433054Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:31:42.0433333Z E       ^
2025-05-07T20:31:42.0433807Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:31:42.0434275Z 
2025-05-07T20:31:42.0434698Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:31:42.0435221Z 
2025-05-07T20:31:42.1323560Z Trying example: test_silu_mul_quant(
2025-05-07T20:31:42.1324205Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:31:42.1325205Z     T=1,
2025-05-07T20:31:42.1325472Z     D=5120,
2025-05-07T20:31:42.1325745Z     scale_ub=None,
2025-05-07T20:31:42.1326011Z     contiguous=False,
2025-05-07T20:31:42.1326260Z     compiled=False,
2025-05-07T20:31:42.1326478Z )
2025-05-07T20:31:42.1326803Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:31:42.1327291Z T = 1, D = 5120, scale_ub = None, contiguous = False, compiled = False
2025-05-07T20:31:42.1327564Z 
2025-05-07T20:31:42.1327647Z     @given(
2025-05-07T20:31:42.1327897Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:31:42.1328578Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:31:42.1328900Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:31:42.1329237Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:31:42.1329565Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:31:42.1329862Z     )
2025-05-07T20:31:42.1330219Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:31:42.1330665Z     def test_silu_mul_quant(
2025-05-07T20:31:42.1330910Z         self,
2025-05-07T20:31:42.1331119Z         T: int,
2025-05-07T20:31:42.1331336Z         D: int,
2025-05-07T20:31:42.1331559Z         scale_ub: Optional[float],
2025-05-07T20:31:42.1331867Z         contiguous: bool,
2025-05-07T20:31:42.1332120Z         compiled: bool,
2025-05-07T20:31:42.1332356Z     ) -> None:
2025-05-07T20:31:42.1332579Z         torch.manual_seed(2025)
2025-05-07T20:31:42.1332824Z     
2025-05-07T20:31:42.1333106Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:31:42.1333450Z     
2025-05-07T20:31:42.1333649Z         x_sign = torch.sign(x)
2025-05-07T20:31:42.1333948Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:31:42.1334314Z         x = x_sign * x_clamp
2025-05-07T20:31:42.1334556Z         x0 = x[:, :D]
2025-05-07T20:31:42.1334785Z         x1 = x[:, D:]
2025-05-07T20:31:42.1335004Z     
2025-05-07T20:31:42.1335195Z         if contiguous:
2025-05-07T20:31:42.1335439Z             x0 = x0.contiguous()
2025-05-07T20:31:42.1335707Z             x1 = x1.contiguous()
2025-05-07T20:31:42.1335954Z     
2025-05-07T20:31:42.1336159Z         if scale_ub is not None:
2025-05-07T20:31:42.1336440Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:31:42.1336776Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:31:42.1337098Z             )
2025-05-07T20:31:42.1337314Z         else:
2025-05-07T20:31:42.1337534Z             scale_ub_tensor = None
2025-05-07T20:31:42.1337796Z     
2025-05-07T20:31:42.1338043Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:31:42.1338368Z             op = silu_mul_quant
2025-05-07T20:31:42.1338625Z             if compiled:
2025-05-07T20:31:42.1338885Z                 op = torch.compile(op)
2025-05-07T20:31:42.1339198Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:31:42.1339476Z     
2025-05-07T20:31:42.1339680Z >       y_fp8, y_scale = fn()
2025-05-07T20:31:42.1339845Z 
2025-05-07T20:31:42.1339959Z moe/activation_test.py:117: 
2025-05-07T20:31:42.1340443Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:31:42.1340792Z moe/activation_test.py:115: in fn
2025-05-07T20:31:42.1341083Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:31:42.1341790Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:31:42.1342489Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:31:42.1343041Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:31:42.1343736Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:31:42.1344456Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:31:42.1345110Z     kernel = self.compile(
2025-05-07T20:31:42.1345656Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:31:42.1346325Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:31:42.1346728Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:31:42.1346970Z 
2025-05-07T20:31:42.1347183Z self = <triton.compiler.compiler.ASTSource object at 0x7f6872117010>
2025-05-07T20:31:42.1348266Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:31:42.1349721Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f68dfb402c0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f6872154720>}
2025-05-07T20:31:42.1351060Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:31:42.1352087Z context = <triton._C.libtriton.ir.context object at 0x7f6872122ef0>
2025-05-07T20:31:42.1352380Z 
2025-05-07T20:31:42.1352548Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:31:42.1353068Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:31:42.1353534Z                            module_map=module_map)
2025-05-07T20:31:42.1353903Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:31:42.1354273Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:31:42.1354580Z E       ^
2025-05-07T20:31:42.1355040Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:31:42.1355499Z 
2025-05-07T20:31:42.1355915Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:31:42.1356425Z 
2025-05-07T20:31:42.1356545Z Trying example: test_silu_mul_quant(
2025-05-07T20:31:42.1356956Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:31:42.1357367Z     T=4096,
2025-05-07T20:31:42.1357572Z     D=7168,
2025-05-07T20:31:42.1357778Z     scale_ub=1200.0,
2025-05-07T20:31:42.1358006Z     contiguous=False,
2025-05-07T20:31:42.1358244Z     compiled=False,
2025-05-07T20:31:42.1358457Z )
2025-05-07T20:31:42.1358777Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:31:42.1359281Z T = 4096, D = 7168, scale_ub = 1200.0, contiguous = False, compiled = False
2025-05-07T20:31:42.1359555Z 
2025-05-07T20:31:42.1359645Z     @given(
2025-05-07T20:31:42.1359878Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:31:42.1360203Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:31:42.1360519Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:31:42.1360848Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:31:42.1361273Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:31:42.1361566Z     )
2025-05-07T20:31:42.1361926Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:31:42.1362366Z     def test_silu_mul_quant(
2025-05-07T20:31:42.1362619Z         self,
2025-05-07T20:31:42.1362825Z         T: int,
2025-05-07T20:31:42.1363025Z         D: int,
2025-05-07T20:31:42.1363253Z         scale_ub: Optional[float],
2025-05-07T20:31:42.1363536Z         contiguous: bool,
2025-05-07T20:31:42.1363775Z         compiled: bool,
2025-05-07T20:31:42.1364008Z     ) -> None:
2025-05-07T20:31:42.1364228Z         torch.manual_seed(2025)
2025-05-07T20:31:42.1364484Z     
2025-05-07T20:31:42.1364761Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:31:42.1365178Z     
2025-05-07T20:31:42.1365378Z         x_sign = torch.sign(x)
2025-05-07T20:31:42.1365672Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:31:42.1365991Z         x = x_sign * x_clamp
2025-05-07T20:31:42.1366231Z         x0 = x[:, :D]
2025-05-07T20:31:42.1366459Z         x1 = x[:, D:]
2025-05-07T20:31:42.1366673Z     
2025-05-07T20:31:42.1366864Z         if contiguous:
2025-05-07T20:31:42.1367101Z             x0 = x0.contiguous()
2025-05-07T20:31:42.1367365Z             x1 = x1.contiguous()
2025-05-07T20:31:42.1367607Z     
2025-05-07T20:31:42.1367809Z         if scale_ub is not None:
2025-05-07T20:31:42.1368091Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:31:42.1368426Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:31:42.1368742Z             )
2025-05-07T20:31:42.1368944Z         else:
2025-05-07T20:31:42.1369159Z             scale_ub_tensor = None
2025-05-07T20:31:42.1369426Z     
2025-05-07T20:31:42.1369666Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:31:42.1369979Z             op = silu_mul_quant
2025-05-07T20:31:42.1370243Z             if compiled:
2025-05-07T20:31:42.1370502Z                 op = torch.compile(op)
2025-05-07T20:31:42.1370813Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:31:42.1371086Z     
2025-05-07T20:31:42.1371296Z >       y_fp8, y_scale = fn()
2025-05-07T20:31:42.1371464Z 
2025-05-07T20:31:42.1371569Z moe/activation_test.py:117: 
2025-05-07T20:31:42.1371862Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:31:42.1372198Z moe/activation_test.py:115: in fn
2025-05-07T20:31:42.1372492Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:31:42.1373180Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:31:42.1373874Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:31:42.1374471Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:31:42.1375153Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:31:42.1375814Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:31:42.1376357Z     kernel = self.compile(
2025-05-07T20:31:42.1376904Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:31:42.1377564Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:31:42.1377962Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:31:42.1378201Z 
2025-05-07T20:31:42.1378409Z self = <triton.compiler.compiler.ASTSource object at 0x7f6872185990>
2025-05-07T20:31:42.1379497Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:31:42.1380967Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f68dfb402c0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f68721558a0>}
2025-05-07T20:31:42.1382314Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:31:42.1383347Z context = <triton._C.libtriton.ir.context object at 0x7f6872161cf0>
2025-05-07T20:31:42.1383638Z 
2025-05-07T20:31:42.1383806Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:31:42.1384353Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:31:42.1384852Z                            module_map=module_map)
2025-05-07T20:31:42.1385324Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:31:42.1385682Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:31:42.1385947Z E       ^
2025-05-07T20:31:42.1386422Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:31:42.1386874Z 
2025-05-07T20:31:42.1387291Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:31:42.1387801Z 
2025-05-07T20:31:42.1387916Z Trying example: test_silu_mul_quant(
2025-05-07T20:31:42.1388331Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:31:42.1388743Z     T=16384,
2025-05-07T20:31:42.1388948Z     D=7168,
2025-05-07T20:31:42.1389236Z     scale_ub=None,
2025-05-07T20:31:42.1389463Z     contiguous=True,
2025-05-07T20:31:42.1389695Z     compiled=True,
2025-05-07T20:31:42.1389906Z )
2025-05-07T20:31:42.2763656Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:31:42.2764399Z T = 16384, D = 7168, scale_ub = None, contiguous = True, compiled = True
2025-05-07T20:31:42.2764807Z 
2025-05-07T20:31:42.2764948Z     @given(
2025-05-07T20:31:42.2765322Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:31:42.2765755Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:31:42.2766180Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:31:42.2766620Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:31:42.2766953Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:31:42.2767249Z     )
2025-05-07T20:31:42.2767606Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:31:42.2768048Z     def test_silu_mul_quant(
2025-05-07T20:31:42.2768299Z         self,
2025-05-07T20:31:42.2768503Z         T: int,
2025-05-07T20:31:42.2768701Z         D: int,
2025-05-07T20:31:42.2768943Z         scale_ub: Optional[float],
2025-05-07T20:31:42.2769223Z         contiguous: bool,
2025-05-07T20:31:42.2769474Z         compiled: bool,
2025-05-07T20:31:42.2769701Z     ) -> None:
2025-05-07T20:31:42.2769924Z         torch.manual_seed(2025)
2025-05-07T20:31:42.2770176Z     
2025-05-07T20:31:42.2770451Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:31:42.2770803Z     
2025-05-07T20:31:42.2771043Z         x_sign = torch.sign(x)
2025-05-07T20:31:42.2771349Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:31:42.2771665Z         x = x_sign * x_clamp
2025-05-07T20:31:42.2771951Z         x0 = x[:, :D]
2025-05-07T20:31:42.2772170Z         x1 = x[:, D:]
2025-05-07T20:31:42.2772390Z     
2025-05-07T20:31:42.2772588Z         if contiguous:
2025-05-07T20:31:42.2772823Z             x0 = x0.contiguous()
2025-05-07T20:31:42.2773090Z             x1 = x1.contiguous()
2025-05-07T20:31:42.2773336Z     
2025-05-07T20:31:42.2773530Z         if scale_ub is not None:
2025-05-07T20:31:42.2773816Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:31:42.2774159Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:31:42.2774472Z             )
2025-05-07T20:31:42.2774674Z         else:
2025-05-07T20:31:42.2775125Z             scale_ub_tensor = None
2025-05-07T20:31:42.2775390Z     
2025-05-07T20:31:42.2775622Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:31:42.2775943Z             op = silu_mul_quant
2025-05-07T20:31:42.2776203Z             if compiled:
2025-05-07T20:31:42.2776451Z                 op = torch.compile(op)
2025-05-07T20:31:42.2776751Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:31:42.2777033Z     
2025-05-07T20:31:42.2777226Z >       y_fp8, y_scale = fn()
2025-05-07T20:31:42.2777396Z 
2025-05-07T20:31:42.2777501Z moe/activation_test.py:117: 
2025-05-07T20:31:42.2777800Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:31:42.2778276Z moe/activation_test.py:115: in fn
2025-05-07T20:31:42.2778560Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:31:42.2779120Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/torch/_dynamo/eval_frame.py:678: in _fn
2025-05-07T20:31:42.2779689Z     return fn(*args, **kwargs)
2025-05-07T20:31:42.2780344Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:31:42.2781032Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:31:42.2781567Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:31:42.2782245Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:31:42.2782900Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:31:42.2783436Z     kernel = self.compile(
2025-05-07T20:31:42.2784005Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:31:42.2784682Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:31:42.2785092Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:31:42.2785325Z 
2025-05-07T20:31:42.2785529Z self = <triton.compiler.compiler.ASTSource object at 0x7f68722a4590>
2025-05-07T20:31:42.2786605Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:31:42.2787972Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f68dfb402c0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f6872156a20>}
2025-05-07T20:31:42.2789422Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:31:42.2790451Z context = <triton._C.libtriton.ir.context object at 0x7f6872284430>
2025-05-07T20:31:42.2790736Z 
2025-05-07T20:31:42.2790915Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:31:42.2791434Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:31:42.2791895Z                            module_map=module_map)
2025-05-07T20:31:42.2792268Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:31:42.2792624Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:31:42.2792882Z E       ^
2025-05-07T20:31:42.2793346Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:31:42.2793792Z 
2025-05-07T20:31:42.2794218Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:31:42.2794779Z 
2025-05-07T20:31:42.2794891Z Trying example: test_silu_mul_quant(
2025-05-07T20:31:42.2795299Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:31:42.2795791Z     T=4096,
2025-05-07T20:31:42.2795988Z     D=5120,
2025-05-07T20:31:42.2796180Z     scale_ub=None,
2025-05-07T20:31:42.2796407Z     contiguous=False,
2025-05-07T20:31:42.2796637Z     compiled=True,
2025-05-07T20:31:42.2796845Z )
2025-05-07T20:31:42.2797172Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:31:42.2797667Z T = 4096, D = 5120, scale_ub = None, contiguous = False, compiled = True
2025-05-07T20:31:42.2797933Z 
2025-05-07T20:31:42.2798022Z     @given(
2025-05-07T20:31:42.2798253Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:31:42.2798570Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:31:42.2798880Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:31:42.2799285Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:31:42.2799614Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:31:42.2799903Z     )
2025-05-07T20:31:42.2800254Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:31:42.2800694Z     def test_silu_mul_quant(
2025-05-07T20:31:42.2800939Z         self,
2025-05-07T20:31:42.2801132Z         T: int,
2025-05-07T20:31:42.2801330Z         D: int,
2025-05-07T20:31:42.2801551Z         scale_ub: Optional[float],
2025-05-07T20:31:42.2801817Z         contiguous: bool,
2025-05-07T20:31:42.2802060Z         compiled: bool,
2025-05-07T20:31:42.2802284Z     ) -> None:
2025-05-07T20:31:42.2802500Z         torch.manual_seed(2025)
2025-05-07T20:31:42.2802748Z     
2025-05-07T20:31:42.2803021Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:31:42.2803369Z     
2025-05-07T20:31:42.2803563Z         x_sign = torch.sign(x)
2025-05-07T20:31:42.2803865Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:31:42.2804185Z         x = x_sign * x_clamp
2025-05-07T20:31:42.2804422Z         x0 = x[:, :D]
2025-05-07T20:31:42.2804643Z         x1 = x[:, D:]
2025-05-07T20:31:42.2804863Z     
2025-05-07T20:31:42.2805047Z         if contiguous:
2025-05-07T20:31:42.2805283Z             x0 = x0.contiguous()
2025-05-07T20:31:42.2805553Z             x1 = x1.contiguous()
2025-05-07T20:31:42.2805791Z     
2025-05-07T20:31:42.2805990Z         if scale_ub is not None:
2025-05-07T20:31:42.2806263Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:31:42.2806592Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:31:42.2806905Z             )
2025-05-07T20:31:42.2807111Z         else:
2025-05-07T20:31:42.2807322Z             scale_ub_tensor = None
2025-05-07T20:31:42.2807579Z     
2025-05-07T20:31:42.2807813Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:31:42.2808130Z             op = silu_mul_quant
2025-05-07T20:31:42.2808391Z             if compiled:
2025-05-07T20:31:42.2808649Z                 op = torch.compile(op)
2025-05-07T20:31:42.2808959Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:31:42.2809228Z     
2025-05-07T20:31:42.2809430Z >       y_fp8, y_scale = fn()
2025-05-07T20:31:42.2809596Z 
2025-05-07T20:31:42.2809705Z moe/activation_test.py:117: 
2025-05-07T20:31:42.2809998Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:31:42.2810338Z moe/activation_test.py:115: in fn
2025-05-07T20:31:42.2810625Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:31:42.2811176Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/torch/_dynamo/eval_frame.py:678: in _fn
2025-05-07T20:31:42.2811737Z     return fn(*args, **kwargs)
2025-05-07T20:31:42.2812398Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:31:42.2813090Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:31:42.2813617Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:31:42.2814431Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:31:42.2815103Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:31:42.2815627Z     kernel = self.compile(
2025-05-07T20:31:42.2816171Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:31:42.2816829Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:31:42.2817231Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:31:42.2817458Z 
2025-05-07T20:31:42.2817662Z self = <triton.compiler.compiler.ASTSource object at 0x7f687224db10>
2025-05-07T20:31:42.2818819Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:31:42.2820190Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f68dfb402c0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f6872157c40>}
2025-05-07T20:31:42.2821532Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:31:42.2822549Z context = <triton._C.libtriton.ir.context object at 0x7f68722912f0>
2025-05-07T20:31:42.2822833Z 
2025-05-07T20:31:42.2823000Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:31:42.2823517Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:31:42.2823988Z                            module_map=module_map)
2025-05-07T20:31:42.2824347Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:31:42.2824740Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:31:42.2825021Z E       ^
2025-05-07T20:31:42.2825484Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:31:42.2825930Z 
2025-05-07T20:31:42.2826345Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:31:42.2826859Z 
2025-05-07T20:31:42.3981646Z Trying example: test_silu_mul_quant(
2025-05-07T20:31:42.3982798Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:31:42.3983963Z     T=4096,
2025-05-07T20:31:42.3984474Z     D=5120,
2025-05-07T20:31:42.3984949Z     scale_ub=1200.0,
2025-05-07T20:31:42.3985212Z     contiguous=False,
2025-05-07T20:31:42.3985490Z     compiled=False,
2025-05-07T20:31:42.3985697Z )
2025-05-07T20:31:42.3986025Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:31:42.3986531Z T = 4096, D = 5120, scale_ub = 1200.0, contiguous = False, compiled = False
2025-05-07T20:31:42.3986828Z 
2025-05-07T20:31:42.3986912Z     @given(
2025-05-07T20:31:42.3987155Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:31:42.3987472Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:31:42.3987775Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:31:42.3988117Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:31:42.3988450Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:31:42.3988741Z     )
2025-05-07T20:31:42.3989188Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:31:42.3989630Z     def test_silu_mul_quant(
2025-05-07T20:31:42.3989881Z         self,
2025-05-07T20:31:42.3990080Z         T: int,
2025-05-07T20:31:42.3990282Z         D: int,
2025-05-07T20:31:42.3990518Z         scale_ub: Optional[float],
2025-05-07T20:31:42.3990799Z         contiguous: bool,
2025-05-07T20:31:42.3991038Z         compiled: bool,
2025-05-07T20:31:42.3991276Z     ) -> None:
2025-05-07T20:31:42.3999885Z         torch.manual_seed(2025)
2025-05-07T20:31:42.4000175Z     
2025-05-07T20:31:42.4000460Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:31:42.4000801Z     
2025-05-07T20:31:42.4001003Z         x_sign = torch.sign(x)
2025-05-07T20:31:42.4001304Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:31:42.4001613Z         x = x_sign * x_clamp
2025-05-07T20:31:42.4001863Z         x0 = x[:, :D]
2025-05-07T20:31:42.4002096Z         x1 = x[:, D:]
2025-05-07T20:31:42.4002302Z     
2025-05-07T20:31:42.4002503Z         if contiguous:
2025-05-07T20:31:42.4002744Z             x0 = x0.contiguous()
2025-05-07T20:31:42.4003004Z             x1 = x1.contiguous()
2025-05-07T20:31:42.4003402Z     
2025-05-07T20:31:42.4003598Z         if scale_ub is not None:
2025-05-07T20:31:42.4003867Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:31:42.4004213Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:31:42.4004573Z             )
2025-05-07T20:31:42.4004785Z         else:
2025-05-07T20:31:42.4004997Z             scale_ub_tensor = None
2025-05-07T20:31:42.4005258Z     
2025-05-07T20:31:42.4005500Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:31:42.4005814Z             op = silu_mul_quant
2025-05-07T20:31:42.4006071Z             if compiled:
2025-05-07T20:31:42.4006323Z                 op = torch.compile(op)
2025-05-07T20:31:42.4006616Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:31:42.4006895Z     
2025-05-07T20:31:42.4007092Z >       y_fp8, y_scale = fn()
2025-05-07T20:31:42.4007255Z 
2025-05-07T20:31:42.4007357Z moe/activation_test.py:117: 
2025-05-07T20:31:42.4007659Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:31:42.4008004Z moe/activation_test.py:115: in fn
2025-05-07T20:31:42.4008291Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:31:42.4008990Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:31:42.4009683Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:31:42.4010223Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:31:42.4010898Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:31:42.4011558Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:31:42.4012089Z     kernel = self.compile(
2025-05-07T20:31:42.4012629Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:31:42.4013277Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:31:42.4013677Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:31:42.4013904Z 
2025-05-07T20:31:42.4014119Z self = <triton.compiler.compiler.ASTSource object at 0x7f68728aa190>
2025-05-07T20:31:42.4015251Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:31:42.4016622Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f68dfb402c0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f6872274ae0>}
2025-05-07T20:31:42.4017956Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:31:42.4018982Z context = <triton._C.libtriton.ir.context object at 0x7f68728ca030>
2025-05-07T20:31:42.4019270Z 
2025-05-07T20:31:42.4019450Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:31:42.4020062Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:31:42.4020533Z                            module_map=module_map)
2025-05-07T20:31:42.4020903Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:31:42.4021269Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:31:42.4021529Z E       ^
2025-05-07T20:31:42.4022004Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:31:42.4022450Z 
2025-05-07T20:31:42.4022884Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:31:42.4023392Z 
2025-05-07T20:31:42.4023587Z Trying example: test_silu_mul_quant(
2025-05-07T20:31:42.4023995Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:31:42.4024401Z     T=4096,
2025-05-07T20:31:42.4024597Z     D=5120,
2025-05-07T20:31:42.4024800Z     scale_ub=1200.0,
2025-05-07T20:31:42.4025032Z     contiguous=False,
2025-05-07T20:31:42.4025267Z     compiled=True,
2025-05-07T20:31:42.4025471Z )
2025-05-07T20:31:42.4025819Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:31:42.4026322Z T = 4096, D = 5120, scale_ub = 1200.0, contiguous = False, compiled = True
2025-05-07T20:31:42.4026592Z 
2025-05-07T20:31:42.4026680Z     @given(
2025-05-07T20:31:42.4026910Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:31:42.4027230Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:31:42.4027546Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:31:42.4027882Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:31:42.4028585Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:31:42.4028940Z     )
2025-05-07T20:31:42.4029339Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:31:42.4029773Z     def test_silu_mul_quant(
2025-05-07T20:31:42.4030021Z         self,
2025-05-07T20:31:42.4030223Z         T: int,
2025-05-07T20:31:42.4030418Z         D: int,
2025-05-07T20:31:42.4030641Z         scale_ub: Optional[float],
2025-05-07T20:31:42.4030915Z         contiguous: bool,
2025-05-07T20:31:42.4031170Z         compiled: bool,
2025-05-07T20:31:42.4031409Z     ) -> None:
2025-05-07T20:31:42.4031642Z         torch.manual_seed(2025)
2025-05-07T20:31:42.4031900Z     
2025-05-07T20:31:42.4032182Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:31:42.4032524Z     
2025-05-07T20:31:42.4032714Z         x_sign = torch.sign(x)
2025-05-07T20:31:42.4033008Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:31:42.4033322Z         x = x_sign * x_clamp
2025-05-07T20:31:42.4033565Z         x0 = x[:, :D]
2025-05-07T20:31:42.4033776Z         x1 = x[:, D:]
2025-05-07T20:31:42.4033983Z     
2025-05-07T20:31:42.4034174Z         if contiguous:
2025-05-07T20:31:42.4034423Z             x0 = x0.contiguous()
2025-05-07T20:31:42.4034715Z             x1 = x1.contiguous()
2025-05-07T20:31:42.4034962Z     
2025-05-07T20:31:42.4035153Z         if scale_ub is not None:
2025-05-07T20:31:42.4035428Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:31:42.4035763Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:31:42.4036065Z             )
2025-05-07T20:31:42.4036270Z         else:
2025-05-07T20:31:42.4036488Z             scale_ub_tensor = None
2025-05-07T20:31:42.4036736Z     
2025-05-07T20:31:42.4036973Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:31:42.4037290Z             op = silu_mul_quant
2025-05-07T20:31:42.4037539Z             if compiled:
2025-05-07T20:31:42.4037797Z                 op = torch.compile(op)
2025-05-07T20:31:42.4038106Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:31:42.4038381Z     
2025-05-07T20:31:42.4038572Z >       y_fp8, y_scale = fn()
2025-05-07T20:31:42.4038747Z 
2025-05-07T20:31:42.4038997Z moe/activation_test.py:117: 
2025-05-07T20:31:42.4039299Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:31:42.4039628Z moe/activation_test.py:115: in fn
2025-05-07T20:31:42.4039915Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:31:42.4040481Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/torch/_dynamo/eval_frame.py:678: in _fn
2025-05-07T20:31:42.4041040Z     return fn(*args, **kwargs)
2025-05-07T20:31:42.4041700Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:31:42.4042388Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:31:42.4042929Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:31:42.4043717Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:31:42.4044387Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:31:42.4044972Z     kernel = self.compile(
2025-05-07T20:31:42.4045507Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:31:42.4046163Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:31:42.4046563Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:31:42.4046790Z 
2025-05-07T20:31:42.4046997Z self = <triton.compiler.compiler.ASTSource object at 0x7f68728007d0>
2025-05-07T20:31:42.4048071Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:31:42.4049445Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f68dfb402c0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f6872275e40>}
2025-05-07T20:31:42.4050773Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:31:42.4051791Z context = <triton._C.libtriton.ir.context object at 0x7f68728c4670>
2025-05-07T20:31:42.4052075Z 
2025-05-07T20:31:42.4052239Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:31:42.4052756Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:31:42.4053225Z                            module_map=module_map)
2025-05-07T20:31:42.4053597Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:31:42.4053948Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:31:42.4054208Z E       ^
2025-05-07T20:31:42.4054675Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:31:42.4055116Z 
2025-05-07T20:31:42.4055528Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:31:42.4056038Z 
2025-05-07T20:31:42.4929697Z Trying example: test_silu_mul_quant(
2025-05-07T20:31:42.4930342Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:31:42.4930925Z     T=2048,
2025-05-07T20:31:42.4931190Z     D=7168,
2025-05-07T20:31:42.4931460Z     scale_ub=1200.0,
2025-05-07T20:31:42.4931772Z     contiguous=False,
2025-05-07T20:31:42.4932056Z     compiled=False,
2025-05-07T20:31:42.4932265Z )
2025-05-07T20:31:42.4932583Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:31:42.4933088Z T = 2048, D = 7168, scale_ub = 1200.0, contiguous = False, compiled = False
2025-05-07T20:31:42.4933369Z 
2025-05-07T20:31:42.4933448Z     @given(
2025-05-07T20:31:42.4934032Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:31:42.4934386Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:31:42.4934702Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:31:42.4935030Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:31:42.4935358Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:31:42.4935632Z     )
2025-05-07T20:31:42.4935978Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:31:42.4936417Z     def test_silu_mul_quant(
2025-05-07T20:31:42.4936651Z         self,
2025-05-07T20:31:42.4936847Z         T: int,
2025-05-07T20:31:42.4937042Z         D: int,
2025-05-07T20:31:42.4937254Z         scale_ub: Optional[float],
2025-05-07T20:31:42.4937697Z         contiguous: bool,
2025-05-07T20:31:42.4937937Z         compiled: bool,
2025-05-07T20:31:42.4938176Z     ) -> None:
2025-05-07T20:31:42.4938392Z         torch.manual_seed(2025)
2025-05-07T20:31:42.4938632Z     
2025-05-07T20:31:42.4938936Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:31:42.4939282Z     
2025-05-07T20:31:42.4939473Z         x_sign = torch.sign(x)
2025-05-07T20:31:42.4939769Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:31:42.4940085Z         x = x_sign * x_clamp
2025-05-07T20:31:42.4940327Z         x0 = x[:, :D]
2025-05-07T20:31:42.4940543Z         x1 = x[:, D:]
2025-05-07T20:31:42.4940752Z     
2025-05-07T20:31:42.4940945Z         if contiguous:
2025-05-07T20:31:42.4941172Z             x0 = x0.contiguous()
2025-05-07T20:31:42.4941430Z             x1 = x1.contiguous()
2025-05-07T20:31:42.4941673Z     
2025-05-07T20:31:42.4941863Z         if scale_ub is not None:
2025-05-07T20:31:42.4942144Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:31:42.4942483Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:31:42.4942785Z             )
2025-05-07T20:31:42.4942985Z         else:
2025-05-07T20:31:42.4943202Z             scale_ub_tensor = None
2025-05-07T20:31:42.4943460Z     
2025-05-07T20:31:42.4943695Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:31:42.4944008Z             op = silu_mul_quant
2025-05-07T20:31:42.4944254Z             if compiled:
2025-05-07T20:31:42.4944538Z                 op = torch.compile(op)
2025-05-07T20:31:42.4944854Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:31:42.4945127Z     
2025-05-07T20:31:42.4945313Z >       y_fp8, y_scale = fn()
2025-05-07T20:31:42.4945485Z 
2025-05-07T20:31:42.4945588Z moe/activation_test.py:117: 
2025-05-07T20:31:42.4945884Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:31:42.4946214Z moe/activation_test.py:115: in fn
2025-05-07T20:31:42.4946498Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:31:42.4947185Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:31:42.4947869Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:31:42.4948402Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:31:42.4949080Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:31:42.4949834Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:31:42.4950356Z     kernel = self.compile(
2025-05-07T20:31:42.4950894Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:31:42.4951549Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:31:42.4951951Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:31:42.4952187Z 
2025-05-07T20:31:42.4952399Z self = <triton.compiler.compiler.ASTSource object at 0x7f6872891350>
2025-05-07T20:31:42.4953560Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:31:42.4954935Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f68dfb402c0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f6872276c00>}
2025-05-07T20:31:42.4956264Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:31:42.4957280Z context = <triton._C.libtriton.ir.context object at 0x7f68728851f0>
2025-05-07T20:31:42.4957705Z 
2025-05-07T20:31:42.4957889Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:31:42.4958505Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:31:42.4959057Z                            module_map=module_map)
2025-05-07T20:31:42.4959461Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:31:42.4959859Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:31:42.4960144Z E       ^
2025-05-07T20:31:42.4960681Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:31:42.4961236Z 
2025-05-07T20:31:42.4961737Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:31:42.4962365Z 
2025-05-07T20:31:42.4962476Z Trying example: test_silu_mul_quant(
2025-05-07T20:31:42.4962950Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:31:42.4963411Z     T=1,
2025-05-07T20:31:42.4963604Z     D=7168,
2025-05-07T20:31:42.4963814Z     scale_ub=None,
2025-05-07T20:31:42.4964053Z     contiguous=True,
2025-05-07T20:31:42.4964330Z     compiled=False,
2025-05-07T20:31:42.4964553Z )
2025-05-07T20:31:42.4964902Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:31:42.4965464Z T = 1, D = 7168, scale_ub = None, contiguous = True, compiled = False
2025-05-07T20:31:42.4965772Z 
2025-05-07T20:31:42.4965850Z     @given(
2025-05-07T20:31:42.4966094Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:31:42.4966436Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:31:42.4966780Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:31:42.4967154Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:31:42.4967521Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:31:42.4967837Z     )
2025-05-07T20:31:42.4968237Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:31:42.4968748Z     def test_silu_mul_quant(
2025-05-07T20:31:42.4969009Z         self,
2025-05-07T20:31:42.4969212Z         T: int,
2025-05-07T20:31:42.4969413Z         D: int,
2025-05-07T20:31:42.4969651Z         scale_ub: Optional[float],
2025-05-07T20:31:42.4969953Z         contiguous: bool,
2025-05-07T20:31:42.4970214Z         compiled: bool,
2025-05-07T20:31:42.4970445Z     ) -> None:
2025-05-07T20:31:42.4970671Z         torch.manual_seed(2025)
2025-05-07T20:31:42.4970934Z     
2025-05-07T20:31:42.4971221Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:31:42.4971605Z     
2025-05-07T20:31:42.4971809Z         x_sign = torch.sign(x)
2025-05-07T20:31:42.4972119Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:31:42.4972465Z         x = x_sign * x_clamp
2025-05-07T20:31:42.4972726Z         x0 = x[:, :D]
2025-05-07T20:31:42.4972949Z         x1 = x[:, D:]
2025-05-07T20:31:42.4973180Z     
2025-05-07T20:31:42.4973381Z         if contiguous:
2025-05-07T20:31:42.4973621Z             x0 = x0.contiguous()
2025-05-07T20:31:42.4973907Z             x1 = x1.contiguous()
2025-05-07T20:31:42.4974167Z     
2025-05-07T20:31:42.4974449Z         if scale_ub is not None:
2025-05-07T20:31:42.4974766Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:31:42.4975182Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:31:42.4975533Z             )
2025-05-07T20:31:42.4975737Z         else:
2025-05-07T20:31:42.4975964Z             scale_ub_tensor = None
2025-05-07T20:31:42.4976243Z     
2025-05-07T20:31:42.4976485Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:31:42.4976842Z             op = silu_mul_quant
2025-05-07T20:31:42.4977119Z             if compiled:
2025-05-07T20:31:42.4977383Z                 op = torch.compile(op)
2025-05-07T20:31:42.4977713Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:31:42.4978102Z     
2025-05-07T20:31:42.4978299Z >       y_fp8, y_scale = fn()
2025-05-07T20:31:42.4978490Z 
2025-05-07T20:31:42.4978596Z moe/activation_test.py:117: 
2025-05-07T20:31:42.4978930Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:31:42.4979308Z moe/activation_test.py:115: in fn
2025-05-07T20:31:42.4979621Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:31:42.4980454Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:31:42.4981282Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:31:42.4981907Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:31:42.4982724Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:31:42.4983516Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:31:42.4984156Z     kernel = self.compile(
2025-05-07T20:31:42.4984785Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:31:42.4985572Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:31:42.4986031Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:31:42.4986298Z 
2025-05-07T20:31:42.4986530Z self = <triton.compiler.compiler.ASTSource object at 0x7f6872324510>
2025-05-07T20:31:42.4987857Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:31:42.4989639Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f68dfb402c0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f6872276e80>}
2025-05-07T20:31:42.4991311Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:31:42.4992570Z context = <triton._C.libtriton.ir.context object at 0x7f68723183b0>
2025-05-07T20:31:42.4992909Z 
2025-05-07T20:31:42.4993097Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:31:42.4993713Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:31:42.4994315Z                            module_map=module_map)
2025-05-07T20:31:42.4994734Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:31:42.4995125Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:31:42.4995407Z E       ^
2025-05-07T20:31:42.4995950Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:31:42.4996499Z 
2025-05-07T20:31:42.4997005Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:31:42.4997640Z 
2025-05-07T20:31:42.4997748Z Trying example: test_silu_mul_quant(
2025-05-07T20:31:42.4998310Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:31:42.4998785Z     T=16384,
2025-05-07T20:31:42.4998989Z     D=7168,
2025-05-07T20:31:42.4999199Z     scale_ub=1200.0,
2025-05-07T20:31:42.4999438Z     contiguous=False,
2025-05-07T20:31:42.4999691Z     compiled=True,
2025-05-07T20:31:42.8849283Z )
2025-05-07T20:31:42.8850302Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:31:42.8851687Z T = 16384, D = 7168, scale_ub = 1200.0, contiguous = False, compiled = True
2025-05-07T20:31:42.8852463Z 
2025-05-07T20:31:42.8852683Z     @given(
2025-05-07T20:31:42.8853323Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:31:42.8854683Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:31:42.8855166Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:31:42.8855537Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:31:42.8855875Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:31:42.8856165Z     )
2025-05-07T20:31:42.8856524Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:31:42.8856971Z     def test_silu_mul_quant(
2025-05-07T20:31:42.8857211Z         self,
2025-05-07T20:31:42.8857423Z         T: int,
2025-05-07T20:31:42.8857634Z         D: int,
2025-05-07T20:31:42.8857851Z         scale_ub: Optional[float],
2025-05-07T20:31:42.8858130Z         contiguous: bool,
2025-05-07T20:31:42.8858377Z         compiled: bool,
2025-05-07T20:31:42.8858611Z     ) -> None:
2025-05-07T20:31:42.8858831Z         torch.manual_seed(2025)
2025-05-07T20:31:42.8859077Z     
2025-05-07T20:31:42.8859357Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:31:42.8859698Z     
2025-05-07T20:31:42.8859907Z         x_sign = torch.sign(x)
2025-05-07T20:31:42.8860199Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:31:42.8860505Z         x = x_sign * x_clamp
2025-05-07T20:31:42.8860755Z         x0 = x[:, :D]
2025-05-07T20:31:42.8860972Z         x1 = x[:, D:]
2025-05-07T20:31:42.8861174Z     
2025-05-07T20:31:42.8861363Z         if contiguous:
2025-05-07T20:31:42.8861601Z             x0 = x0.contiguous()
2025-05-07T20:31:42.8861857Z             x1 = x1.contiguous()
2025-05-07T20:31:42.8862101Z     
2025-05-07T20:31:42.8862309Z         if scale_ub is not None:
2025-05-07T20:31:42.8862581Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:31:42.8862921Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:31:42.8863232Z             )
2025-05-07T20:31:42.8863433Z         else:
2025-05-07T20:31:42.8863640Z             scale_ub_tensor = None
2025-05-07T20:31:42.8863896Z     
2025-05-07T20:31:42.8864135Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:31:42.8864443Z             op = silu_mul_quant
2025-05-07T20:31:42.8864695Z             if compiled:
2025-05-07T20:31:42.8864943Z                 op = torch.compile(op)
2025-05-07T20:31:42.8865239Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:31:42.8865516Z     
2025-05-07T20:31:42.8865714Z >       y_fp8, y_scale = fn()
2025-05-07T20:31:42.8865876Z 
2025-05-07T20:31:42.8865977Z moe/activation_test.py:117: 
2025-05-07T20:31:42.8866278Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:31:42.8866612Z moe/activation_test.py:115: in fn
2025-05-07T20:31:42.8866890Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:31:42.8867452Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/torch/_dynamo/eval_frame.py:678: in _fn
2025-05-07T20:31:42.8868021Z     return fn(*args, **kwargs)
2025-05-07T20:31:42.8868707Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:31:42.8869520Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:31:42.8870255Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:31:42.8870940Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:31:42.8871599Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:31:42.8872130Z     kernel = self.compile(
2025-05-07T20:31:42.8872669Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:31:42.8880799Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:31:42.8881234Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:31:42.8881475Z 
2025-05-07T20:31:42.8881683Z self = <triton.compiler.compiler.ASTSource object at 0x7f687238ea10>
2025-05-07T20:31:42.8882913Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:31:42.8884319Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f68dfb402c0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f68723fd1c0>}
2025-05-07T20:31:42.8885671Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:31:42.8886699Z context = <triton._C.libtriton.ir.context object at 0x7f68723168f0>
2025-05-07T20:31:42.8886993Z 
2025-05-07T20:31:42.8887161Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:31:42.8887697Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:31:42.8888158Z                            module_map=module_map)
2025-05-07T20:31:42.8888534Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:31:42.8888901Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:31:42.8889162Z E       ^
2025-05-07T20:31:42.8889635Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:31:42.8890094Z 
2025-05-07T20:31:42.8890517Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:31:42.8891031Z 
2025-05-07T20:31:42.8891143Z Trying example: test_silu_mul_quant(
2025-05-07T20:31:42.8891555Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:31:42.8891965Z     T=1,
2025-05-07T20:31:42.8892159Z     D=7168,
2025-05-07T20:31:42.8892359Z     scale_ub=None,
2025-05-07T20:31:42.8892588Z     contiguous=False,
2025-05-07T20:31:42.8892819Z     compiled=False,
2025-05-07T20:31:42.8893025Z )
2025-05-07T20:31:42.8893350Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:31:42.8893851Z T = 1, D = 7168, scale_ub = None, contiguous = False, compiled = False
2025-05-07T20:31:42.8894114Z 
2025-05-07T20:31:42.8894202Z     @given(
2025-05-07T20:31:42.8894432Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:31:42.8894753Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:31:42.8895065Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:31:42.8895395Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:31:42.8895731Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:31:42.8896023Z     )
2025-05-07T20:31:42.8896369Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:31:42.8896815Z     def test_silu_mul_quant(
2025-05-07T20:31:42.8897069Z         self,
2025-05-07T20:31:42.8897270Z         T: int,
2025-05-07T20:31:42.8897466Z         D: int,
2025-05-07T20:31:42.8897691Z         scale_ub: Optional[float],
2025-05-07T20:31:42.8897969Z         contiguous: bool,
2025-05-07T20:31:42.8898296Z         compiled: bool,
2025-05-07T20:31:42.8898523Z     ) -> None:
2025-05-07T20:31:42.8898745Z         torch.manual_seed(2025)
2025-05-07T20:31:42.8898988Z     
2025-05-07T20:31:42.8899265Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:31:42.8899611Z     
2025-05-07T20:31:42.8899803Z         x_sign = torch.sign(x)
2025-05-07T20:31:42.8900098Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:31:42.8900410Z         x = x_sign * x_clamp
2025-05-07T20:31:42.8900645Z         x0 = x[:, :D]
2025-05-07T20:31:42.8900871Z         x1 = x[:, D:]
2025-05-07T20:31:42.8901085Z     
2025-05-07T20:31:42.8901269Z         if contiguous:
2025-05-07T20:31:42.8901506Z             x0 = x0.contiguous()
2025-05-07T20:31:42.8901852Z             x1 = x1.contiguous()
2025-05-07T20:31:42.8902089Z     
2025-05-07T20:31:42.8902289Z         if scale_ub is not None:
2025-05-07T20:31:42.8902571Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:31:42.8902923Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:31:42.8903228Z             )
2025-05-07T20:31:42.8903427Z         else:
2025-05-07T20:31:42.8903644Z             scale_ub_tensor = None
2025-05-07T20:31:42.8903893Z     
2025-05-07T20:31:42.8904138Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:31:42.8904465Z             op = silu_mul_quant
2025-05-07T20:31:42.8904719Z             if compiled:
2025-05-07T20:31:42.8904977Z                 op = torch.compile(op)
2025-05-07T20:31:42.8905290Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:31:42.8905561Z     
2025-05-07T20:31:42.8905765Z >       y_fp8, y_scale = fn()
2025-05-07T20:31:42.8905929Z 
2025-05-07T20:31:42.8906040Z moe/activation_test.py:117: 
2025-05-07T20:31:42.8906349Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:31:42.8906690Z moe/activation_test.py:115: in fn
2025-05-07T20:31:42.8906980Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:31:42.8907682Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:31:42.8908369Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:31:42.8908920Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:31:42.8909672Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:31:42.8910349Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:31:42.8910879Z     kernel = self.compile(
2025-05-07T20:31:42.8911429Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:31:42.8912102Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:31:42.8912507Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:31:42.8912751Z 
2025-05-07T20:31:42.8912958Z self = <triton.compiler.compiler.ASTSource object at 0x7f687276fd90>
2025-05-07T20:31:42.8914046Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:31:42.8915425Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f68dfb402c0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f68723fdf80>}
2025-05-07T20:31:42.8916780Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:31:42.8917819Z context = <triton._C.libtriton.ir.context object at 0x7f68727ffbf0>
2025-05-07T20:31:42.8918113Z 
2025-05-07T20:31:42.8918420Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:31:42.8918946Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:31:42.8919416Z                            module_map=module_map)
2025-05-07T20:31:42.8919776Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:31:42.8920132Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:31:42.8920394Z E       ^
2025-05-07T20:31:42.8920853Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:31:42.8921314Z 
2025-05-07T20:31:42.8921732Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:31:42.8922333Z 
2025-05-07T20:31:42.8922443Z Trying example: test_silu_mul_quant(
2025-05-07T20:31:42.8922862Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:31:42.8923262Z     T=2048,
2025-05-07T20:31:42.8923466Z     D=7168,
2025-05-07T20:31:42.8923663Z     scale_ub=None,
2025-05-07T20:31:42.8923878Z     contiguous=False,
2025-05-07T20:31:42.8924107Z     compiled=True,
2025-05-07T20:31:42.8924315Z )
2025-05-07T20:31:42.9597929Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:31:42.9598731Z T = 2048, D = 7168, scale_ub = None, contiguous = False, compiled = True
2025-05-07T20:31:42.9599107Z 
2025-05-07T20:31:42.9599219Z     @given(
2025-05-07T20:31:42.9599540Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:31:42.9599862Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:31:42.9600168Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:31:42.9600542Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:31:42.9600878Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:31:42.9601180Z     )
2025-05-07T20:31:42.9601536Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:31:42.9601983Z     def test_silu_mul_quant(
2025-05-07T20:31:42.9602235Z         self,
2025-05-07T20:31:42.9602430Z         T: int,
2025-05-07T20:31:42.9602636Z         D: int,
2025-05-07T20:31:42.9602860Z         scale_ub: Optional[float],
2025-05-07T20:31:42.9603135Z         contiguous: bool,
2025-05-07T20:31:42.9603386Z         compiled: bool,
2025-05-07T20:31:42.9603611Z     ) -> None:
2025-05-07T20:31:42.9603831Z         torch.manual_seed(2025)
2025-05-07T20:31:42.9604078Z     
2025-05-07T20:31:42.9604389Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:31:42.9604747Z     
2025-05-07T20:31:42.9604945Z         x_sign = torch.sign(x)
2025-05-07T20:31:42.9605241Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:31:42.9605552Z         x = x_sign * x_clamp
2025-05-07T20:31:42.9605798Z         x0 = x[:, :D]
2025-05-07T20:31:42.9606026Z         x1 = x[:, D:]
2025-05-07T20:31:42.9606230Z     
2025-05-07T20:31:42.9606428Z         if contiguous:
2025-05-07T20:31:42.9606667Z             x0 = x0.contiguous()
2025-05-07T20:31:42.9606920Z             x1 = x1.contiguous()
2025-05-07T20:31:42.9607171Z     
2025-05-07T20:31:42.9607369Z         if scale_ub is not None:
2025-05-07T20:31:42.9607636Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:31:42.9607985Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:31:42.9608300Z             )
2025-05-07T20:31:42.9608493Z         else:
2025-05-07T20:31:42.9608708Z             scale_ub_tensor = None
2025-05-07T20:31:42.9608964Z     
2025-05-07T20:31:42.9609196Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:31:42.9609513Z             op = silu_mul_quant
2025-05-07T20:31:42.9609775Z             if compiled:
2025-05-07T20:31:42.9610023Z                 op = torch.compile(op)
2025-05-07T20:31:42.9610321Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:31:42.9610607Z     
2025-05-07T20:31:42.9610804Z >       y_fp8, y_scale = fn()
2025-05-07T20:31:42.9611320Z 
2025-05-07T20:31:42.9611425Z moe/activation_test.py:117: 
2025-05-07T20:31:42.9611724Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:31:42.9612062Z moe/activation_test.py:115: in fn
2025-05-07T20:31:42.9612342Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:31:42.9612896Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/torch/_dynamo/eval_frame.py:678: in _fn
2025-05-07T20:31:42.9613462Z     return fn(*args, **kwargs)
2025-05-07T20:31:42.9614122Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:31:42.9614799Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:31:42.9615478Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:31:42.9616158Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:31:42.9616816Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:31:42.9617346Z     kernel = self.compile(
2025-05-07T20:31:42.9617885Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:31:42.9618541Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:31:42.9618935Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:31:42.9619173Z 
2025-05-07T20:31:42.9619379Z self = <triton.compiler.compiler.ASTSource object at 0x7f68727f26d0>
2025-05-07T20:31:42.9620456Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:31:42.9621839Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f68dfb402c0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f68723ff420>}
2025-05-07T20:31:42.9623171Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:31:42.9624190Z context = <triton._C.libtriton.ir.context object at 0x7f68727ce5b0>
2025-05-07T20:31:42.9624479Z 
2025-05-07T20:31:42.9624644Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:31:42.9625211Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:31:42.9625685Z                            module_map=module_map)
2025-05-07T20:31:42.9626053Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:31:42.9626411Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:31:42.9626675Z E       ^
2025-05-07T20:31:42.9627141Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:31:42.9627593Z 
2025-05-07T20:31:42.9628009Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:31:42.9628923Z 
2025-05-07T20:31:42.9629039Z Trying example: test_silu_mul_quant(
2025-05-07T20:31:42.9629501Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:31:42.9629906Z     T=4096,
2025-05-07T20:31:42.9630100Z     D=7168,
2025-05-07T20:31:42.9630301Z     scale_ub=None,
2025-05-07T20:31:42.9630517Z     contiguous=False,
2025-05-07T20:31:42.9630749Z     compiled=True,
2025-05-07T20:31:42.9630969Z )
2025-05-07T20:31:42.9631285Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:31:42.9631781Z T = 4096, D = 7168, scale_ub = None, contiguous = False, compiled = True
2025-05-07T20:31:42.9632051Z 
2025-05-07T20:31:42.9632314Z     @given(
2025-05-07T20:31:42.9632563Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:31:42.9632912Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:31:42.9633253Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:31:42.9633620Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:31:42.9633993Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:31:42.9634312Z     )
2025-05-07T20:31:42.9634758Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:31:42.9635267Z     def test_silu_mul_quant(
2025-05-07T20:31:42.9635531Z         self,
2025-05-07T20:31:42.9635737Z         T: int,
2025-05-07T20:31:42.9635939Z         D: int,
2025-05-07T20:31:42.9636287Z         scale_ub: Optional[float],
2025-05-07T20:31:42.9636588Z         contiguous: bool,
2025-05-07T20:31:42.9636841Z         compiled: bool,
2025-05-07T20:31:42.9637082Z     ) -> None:
2025-05-07T20:31:42.9637317Z         torch.manual_seed(2025)
2025-05-07T20:31:42.9637583Z     
2025-05-07T20:31:42.9637884Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:31:42.9638278Z     
2025-05-07T20:31:42.9638475Z         x_sign = torch.sign(x)
2025-05-07T20:31:42.9638795Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:31:42.9639142Z         x = x_sign * x_clamp
2025-05-07T20:31:42.9639397Z         x0 = x[:, :D]
2025-05-07T20:31:42.9639630Z         x1 = x[:, D:]
2025-05-07T20:31:42.9639861Z     
2025-05-07T20:31:42.9640058Z         if contiguous:
2025-05-07T20:31:42.9640306Z             x0 = x0.contiguous()
2025-05-07T20:31:42.9640594Z             x1 = x1.contiguous()
2025-05-07T20:31:42.9640861Z     
2025-05-07T20:31:42.9641069Z         if scale_ub is not None:
2025-05-07T20:31:42.9641375Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:31:42.9641758Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:31:42.9642109Z             )
2025-05-07T20:31:42.9642325Z         else:
2025-05-07T20:31:42.9642557Z             scale_ub_tensor = None
2025-05-07T20:31:42.9642826Z     
2025-05-07T20:31:42.9643086Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:31:42.9643445Z             op = silu_mul_quant
2025-05-07T20:31:42.9643718Z             if compiled:
2025-05-07T20:31:42.9643989Z                 op = torch.compile(op)
2025-05-07T20:31:42.9644321Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:31:42.9644625Z     
2025-05-07T20:31:42.9644839Z >       y_fp8, y_scale = fn()
2025-05-07T20:31:42.9645032Z 
2025-05-07T20:31:42.9645140Z moe/activation_test.py:117: 
2025-05-07T20:31:42.9645480Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:31:42.9645861Z moe/activation_test.py:115: in fn
2025-05-07T20:31:42.9646179Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:31:42.9646843Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/torch/_dynamo/eval_frame.py:678: in _fn
2025-05-07T20:31:42.9647509Z     return fn(*args, **kwargs)
2025-05-07T20:31:42.9648301Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:31:42.9649130Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:31:42.9649762Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:31:42.9650577Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:31:42.9651376Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:31:42.9652004Z     kernel = self.compile(
2025-05-07T20:31:42.9652647Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:31:42.9653436Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:31:42.9653983Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:31:42.9654254Z 
2025-05-07T20:31:42.9654499Z self = <triton.compiler.compiler.ASTSource object at 0x7f68727cb7d0>
2025-05-07T20:31:42.9655868Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:31:42.9657583Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f68dfb402c0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f68727e8680>}
2025-05-07T20:31:42.9659249Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:31:42.9660356Z context = <triton._C.libtriton.ir.context object at 0x7f68727fb6b0>
2025-05-07T20:31:42.9660648Z 
2025-05-07T20:31:42.9660822Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:31:42.9661352Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:31:42.9661820Z                            module_map=module_map)
2025-05-07T20:31:42.9662192Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:31:42.9662544Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:31:42.9662801Z E       ^
2025-05-07T20:31:42.9663271Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:31:42.9663716Z 
2025-05-07T20:31:42.9664140Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:31:42.9664653Z 
2025-05-07T20:31:43.0929935Z Trying example: test_silu_mul_quant(
2025-05-07T20:31:43.0930643Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:31:43.0931243Z     T=16384,
2025-05-07T20:31:43.0931516Z     D=5120,
2025-05-07T20:31:43.0931788Z     scale_ub=1200.0,
2025-05-07T20:31:43.0932053Z     contiguous=False,
2025-05-07T20:31:43.0932285Z     compiled=False,
2025-05-07T20:31:43.0932503Z )
2025-05-07T20:31:43.0932827Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:31:43.0933333Z T = 16384, D = 5120, scale_ub = 1200.0, contiguous = False, compiled = False
2025-05-07T20:31:43.0933618Z 
2025-05-07T20:31:43.0933707Z     @given(
2025-05-07T20:31:43.0933939Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:31:43.0934259Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:31:43.0934579Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:31:43.0934918Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:31:43.0935292Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:31:43.0935582Z     )
2025-05-07T20:31:43.0935941Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:31:43.0936378Z     def test_silu_mul_quant(
2025-05-07T20:31:43.0936630Z         self,
2025-05-07T20:31:43.0936833Z         T: int,
2025-05-07T20:31:43.0937033Z         D: int,
2025-05-07T20:31:43.0937262Z         scale_ub: Optional[float],
2025-05-07T20:31:43.0937539Z         contiguous: bool,
2025-05-07T20:31:43.0937778Z         compiled: bool,
2025-05-07T20:31:43.0938008Z     ) -> None:
2025-05-07T20:31:43.0938232Z         torch.manual_seed(2025)
2025-05-07T20:31:43.0938472Z     
2025-05-07T20:31:43.0938752Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:31:43.0939100Z     
2025-05-07T20:31:43.0939297Z         x_sign = torch.sign(x)
2025-05-07T20:31:43.0939589Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:31:43.0939898Z         x = x_sign * x_clamp
2025-05-07T20:31:43.0940139Z         x0 = x[:, :D]
2025-05-07T20:31:43.0940352Z         x1 = x[:, D:]
2025-05-07T20:31:43.0940909Z     
2025-05-07T20:31:43.0941102Z         if contiguous:
2025-05-07T20:31:43.0941332Z             x0 = x0.contiguous()
2025-05-07T20:31:43.0941594Z             x1 = x1.contiguous()
2025-05-07T20:31:43.0941834Z     
2025-05-07T20:31:43.0942024Z         if scale_ub is not None:
2025-05-07T20:31:43.0942293Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:31:43.0942639Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:31:43.0942950Z             )
2025-05-07T20:31:43.0943144Z         else:
2025-05-07T20:31:43.0943360Z             scale_ub_tensor = None
2025-05-07T20:31:43.0943615Z     
2025-05-07T20:31:43.0943842Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:31:43.0944317Z             op = silu_mul_quant
2025-05-07T20:31:43.0944573Z             if compiled:
2025-05-07T20:31:43.0944831Z                 op = torch.compile(op)
2025-05-07T20:31:43.0945156Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:31:43.0945470Z     
2025-05-07T20:31:43.0945668Z >       y_fp8, y_scale = fn()
2025-05-07T20:31:43.0945833Z 
2025-05-07T20:31:43.0945933Z moe/activation_test.py:117: 
2025-05-07T20:31:43.0946232Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:31:43.0946573Z moe/activation_test.py:115: in fn
2025-05-07T20:31:43.0946850Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:31:43.0947540Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:31:43.0948230Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:31:43.0948768Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:31:43.0949530Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:31:43.0950195Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:31:43.0950733Z     kernel = self.compile(
2025-05-07T20:31:43.0951267Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:31:43.0951920Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:31:43.0952326Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:31:43.0952553Z 
2025-05-07T20:31:43.0952764Z self = <triton.compiler.compiler.ASTSource object at 0x7f68720c0310>
2025-05-07T20:31:43.0953833Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:31:43.0955266Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f68dfb402c0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f68727e94e0>}
2025-05-07T20:31:43.0956607Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:31:43.0957631Z context = <triton._C.libtriton.ir.context object at 0x7f68720d81b0>
2025-05-07T20:31:43.0957916Z 
2025-05-07T20:31:43.0958087Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:31:43.0958607Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:31:43.0959072Z                            module_map=module_map)
2025-05-07T20:31:43.0959440Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:31:43.0959792Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:31:43.0960056Z E       ^
2025-05-07T20:31:43.0960522Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:31:43.0961055Z 
2025-05-07T20:31:43.0961480Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:31:43.0961990Z 
2025-05-07T20:31:43.0962096Z Trying example: test_silu_mul_quant(
2025-05-07T20:31:43.0962537Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:31:43.0962942Z     T=16384,
2025-05-07T20:31:43.0963138Z     D=5120,
2025-05-07T20:31:43.0963331Z     scale_ub=1200.0,
2025-05-07T20:31:43.0963558Z     contiguous=True,
2025-05-07T20:31:43.0963783Z     compiled=True,
2025-05-07T20:31:43.0964007Z )
2025-05-07T20:31:43.0964407Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:31:43.0965154Z T = 16384, D = 5120, scale_ub = 1200.0, contiguous = True, compiled = True
2025-05-07T20:31:43.0965495Z 
2025-05-07T20:31:43.0965605Z     @given(
2025-05-07T20:31:43.0965891Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:31:43.0966293Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:31:43.0966673Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:31:43.0967000Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:31:43.0967334Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:31:43.0967628Z     )
2025-05-07T20:31:43.0967973Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:31:43.0968415Z     def test_silu_mul_quant(
2025-05-07T20:31:43.0968661Z         self,
2025-05-07T20:31:43.0968856Z         T: int,
2025-05-07T20:31:43.0969057Z         D: int,
2025-05-07T20:31:43.0969277Z         scale_ub: Optional[float],
2025-05-07T20:31:43.0969547Z         contiguous: bool,
2025-05-07T20:31:43.0969787Z         compiled: bool,
2025-05-07T20:31:43.0970010Z     ) -> None:
2025-05-07T20:31:43.0970233Z         torch.manual_seed(2025)
2025-05-07T20:31:43.0970471Z     
2025-05-07T20:31:43.0970748Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:31:43.0971101Z     
2025-05-07T20:31:43.0971306Z         x_sign = torch.sign(x)
2025-05-07T20:31:43.0971596Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:31:43.0971906Z         x = x_sign * x_clamp
2025-05-07T20:31:43.0972150Z         x0 = x[:, :D]
2025-05-07T20:31:43.0972363Z         x1 = x[:, D:]
2025-05-07T20:31:43.0981038Z     
2025-05-07T20:31:43.0981264Z         if contiguous:
2025-05-07T20:31:43.0981516Z             x0 = x0.contiguous()
2025-05-07T20:31:43.0981780Z             x1 = x1.contiguous()
2025-05-07T20:31:43.0982019Z     
2025-05-07T20:31:43.0982221Z         if scale_ub is not None:
2025-05-07T20:31:43.0982501Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:31:43.0982848Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:31:43.0983156Z             )
2025-05-07T20:31:43.0983357Z         else:
2025-05-07T20:31:43.0983570Z             scale_ub_tensor = None
2025-05-07T20:31:43.0983828Z     
2025-05-07T20:31:43.0984077Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:31:43.0984387Z             op = silu_mul_quant
2025-05-07T20:31:43.0984651Z             if compiled:
2025-05-07T20:31:43.0984904Z                 op = torch.compile(op)
2025-05-07T20:31:43.0985206Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:31:43.0985475Z     
2025-05-07T20:31:43.0985675Z >       y_fp8, y_scale = fn()
2025-05-07T20:31:43.0985846Z 
2025-05-07T20:31:43.0985957Z moe/activation_test.py:117: 
2025-05-07T20:31:43.0986248Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:31:43.0986584Z moe/activation_test.py:115: in fn
2025-05-07T20:31:43.0986874Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:31:43.0987430Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/torch/_dynamo/eval_frame.py:678: in _fn
2025-05-07T20:31:43.0987994Z     return fn(*args, **kwargs)
2025-05-07T20:31:43.0988766Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:31:43.0989560Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:31:43.0990091Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:31:43.0990770Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:31:43.0991432Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:31:43.0991957Z     kernel = self.compile(
2025-05-07T20:31:43.0992502Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:31:43.0993242Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:31:43.0993646Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:31:43.0993870Z 
2025-05-07T20:31:43.0994082Z self = <triton.compiler.compiler.ASTSource object at 0x7f68720e2950>
2025-05-07T20:31:43.0995158Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:31:43.0996525Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f68dfb402c0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f68727ea8e0>}
2025-05-07T20:31:43.0997861Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:31:43.0998885Z context = <triton._C.libtriton.ir.context object at 0x7f687208ef70>
2025-05-07T20:31:43.0999172Z 
2025-05-07T20:31:43.0999338Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:31:43.0999861Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:31:43.1000333Z                            module_map=module_map)
2025-05-07T20:31:43.1000691Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:31:43.1001045Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:31:43.1001305Z E       ^
2025-05-07T20:31:43.1001768Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:31:43.1002214Z 
2025-05-07T20:31:43.1002633Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:31:43.1003161Z 
2025-05-07T20:31:43.4289103Z Trying example: test_silu_mul_quant(
2025-05-07T20:31:43.4289755Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:31:43.4290284Z     T=16384,
2025-05-07T20:31:43.4290498Z     D=5120,
2025-05-07T20:31:43.4290726Z     scale_ub=None,
2025-05-07T20:31:43.4290956Z     contiguous=False,
2025-05-07T20:31:43.4291192Z     compiled=True,
2025-05-07T20:31:43.4291401Z )
2025-05-07T20:31:43.4291734Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:31:43.4292239Z T = 16384, D = 5120, scale_ub = None, contiguous = False, compiled = True
2025-05-07T20:31:43.4292515Z 
2025-05-07T20:31:43.4292599Z     @given(
2025-05-07T20:31:43.4292845Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:31:43.4293165Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:31:43.4293480Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:31:43.4293811Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:31:43.4294165Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:31:43.4294464Z     )
2025-05-07T20:31:43.4294814Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:31:43.4295633Z     def test_silu_mul_quant(
2025-05-07T20:31:43.4295884Z         self,
2025-05-07T20:31:43.4296082Z         T: int,
2025-05-07T20:31:43.4296287Z         D: int,
2025-05-07T20:31:43.4296516Z         scale_ub: Optional[float],
2025-05-07T20:31:43.4296786Z         contiguous: bool,
2025-05-07T20:31:43.4297035Z         compiled: bool,
2025-05-07T20:31:43.4297277Z     ) -> None:
2025-05-07T20:31:43.4297494Z         torch.manual_seed(2025)
2025-05-07T20:31:43.4297744Z     
2025-05-07T20:31:43.4298028Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:31:43.4298373Z     
2025-05-07T20:31:43.4298577Z         x_sign = torch.sign(x)
2025-05-07T20:31:43.4298870Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:31:43.4299342Z         x = x_sign * x_clamp
2025-05-07T20:31:43.4299584Z         x0 = x[:, :D]
2025-05-07T20:31:43.4299805Z         x1 = x[:, D:]
2025-05-07T20:31:43.4300019Z     
2025-05-07T20:31:43.4300205Z         if contiguous:
2025-05-07T20:31:43.4300448Z             x0 = x0.contiguous()
2025-05-07T20:31:43.4300713Z             x1 = x1.contiguous()
2025-05-07T20:31:43.4300949Z     
2025-05-07T20:31:43.4301145Z         if scale_ub is not None:
2025-05-07T20:31:43.4301422Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:31:43.4301755Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:31:43.4302068Z             )
2025-05-07T20:31:43.4302273Z         else:
2025-05-07T20:31:43.4302484Z             scale_ub_tensor = None
2025-05-07T20:31:43.4302748Z     
2025-05-07T20:31:43.4302993Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:31:43.4303304Z             op = silu_mul_quant
2025-05-07T20:31:43.4303563Z             if compiled:
2025-05-07T20:31:43.4303824Z                 op = torch.compile(op)
2025-05-07T20:31:43.4304128Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:31:43.4304403Z     
2025-05-07T20:31:43.4304633Z >       y_fp8, y_scale = fn()
2025-05-07T20:31:43.4304822Z 
2025-05-07T20:31:43.4304937Z moe/activation_test.py:117: 
2025-05-07T20:31:43.4305236Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:31:43.4305577Z moe/activation_test.py:115: in fn
2025-05-07T20:31:43.4305864Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:31:43.4306420Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/torch/_dynamo/eval_frame.py:678: in _fn
2025-05-07T20:31:43.4306993Z     return fn(*args, **kwargs)
2025-05-07T20:31:43.4307653Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:31:43.4308347Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:31:43.4308876Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:31:43.4309666Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:31:43.4310330Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:31:43.4310863Z     kernel = self.compile(
2025-05-07T20:31:43.4311398Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:31:43.4312055Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:31:43.4312465Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:31:43.4312694Z 
2025-05-07T20:31:43.4312912Z self = <triton.compiler.compiler.ASTSource object at 0x7f6871f43810>
2025-05-07T20:31:43.4313984Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:31:43.4315513Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f68dfb402c0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f68727eaf20>}
2025-05-07T20:31:43.4316855Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:31:43.4317878Z context = <triton._C.libtriton.ir.context object at 0x7f6871f536f0>
2025-05-07T20:31:43.4318163Z 
2025-05-07T20:31:43.4318328Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:31:43.4318846Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:31:43.4319314Z                            module_map=module_map)
2025-05-07T20:31:43.4319812Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:31:43.4320215Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:31:43.4320499Z E       ^
2025-05-07T20:31:43.4321048Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:31:43.4321597Z 
2025-05-07T20:31:43.4322113Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:31:43.4322734Z 
2025-05-07T20:31:43.4322848Z Trying example: test_silu_mul_quant(
2025-05-07T20:31:43.4323321Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:31:43.4323787Z     T=2048,
2025-05-07T20:31:43.4323986Z     D=5120,
2025-05-07T20:31:43.4324191Z     scale_ub=None,
2025-05-07T20:31:43.4324424Z     contiguous=False,
2025-05-07T20:31:43.4324661Z     compiled=True,
2025-05-07T20:31:43.4324886Z )
2025-05-07T20:31:43.5042378Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:31:43.5043169Z T = 2048, D = 5120, scale_ub = None, contiguous = False, compiled = True
2025-05-07T20:31:43.5043540Z 
2025-05-07T20:31:43.5043656Z     @given(
2025-05-07T20:31:43.5043902Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:31:43.5044220Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:31:43.5044526Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:31:43.5044856Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:31:43.5045175Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:31:43.5045475Z     )
2025-05-07T20:31:43.5045829Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:31:43.5046262Z     def test_silu_mul_quant(
2025-05-07T20:31:43.5046512Z         self,
2025-05-07T20:31:43.5046710Z         T: int,
2025-05-07T20:31:43.5046904Z         D: int,
2025-05-07T20:31:43.5047131Z         scale_ub: Optional[float],
2025-05-07T20:31:43.5047415Z         contiguous: bool,
2025-05-07T20:31:43.5047658Z         compiled: bool,
2025-05-07T20:31:43.5047890Z     ) -> None:
2025-05-07T20:31:43.5048118Z         torch.manual_seed(2025)
2025-05-07T20:31:43.5048358Z     
2025-05-07T20:31:43.5048640Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:31:43.5048991Z     
2025-05-07T20:31:43.5049197Z         x_sign = torch.sign(x)
2025-05-07T20:31:43.5049496Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:31:43.5049815Z         x = x_sign * x_clamp
2025-05-07T20:31:43.5050068Z         x0 = x[:, :D]
2025-05-07T20:31:43.5050293Z         x1 = x[:, D:]
2025-05-07T20:31:43.5050510Z     
2025-05-07T20:31:43.5050710Z         if contiguous:
2025-05-07T20:31:43.5050944Z             x0 = x0.contiguous()
2025-05-07T20:31:43.5051217Z             x1 = x1.contiguous()
2025-05-07T20:31:43.5051465Z     
2025-05-07T20:31:43.5051661Z         if scale_ub is not None:
2025-05-07T20:31:43.5051942Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:31:43.5052283Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:31:43.5052587Z             )
2025-05-07T20:31:43.5052788Z         else:
2025-05-07T20:31:43.5053372Z             scale_ub_tensor = None
2025-05-07T20:31:43.5053628Z     
2025-05-07T20:31:43.5053864Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:31:43.5054181Z             op = silu_mul_quant
2025-05-07T20:31:43.5054435Z             if compiled:
2025-05-07T20:31:43.5054678Z                 op = torch.compile(op)
2025-05-07T20:31:43.5054975Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:31:43.5055257Z     
2025-05-07T20:31:43.5055447Z >       y_fp8, y_scale = fn()
2025-05-07T20:31:43.5055615Z 
2025-05-07T20:31:43.5055716Z moe/activation_test.py:117: 
2025-05-07T20:31:43.5056019Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:31:43.5056350Z moe/activation_test.py:115: in fn
2025-05-07T20:31:43.5056821Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:31:43.5057382Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/torch/_dynamo/eval_frame.py:678: in _fn
2025-05-07T20:31:43.5057935Z     return fn(*args, **kwargs)
2025-05-07T20:31:43.5058604Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:31:43.5059288Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:31:43.5059820Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:31:43.5060491Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:31:43.5061157Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:31:43.5061696Z     kernel = self.compile(
2025-05-07T20:31:43.5062237Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:31:43.5062890Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:31:43.5063291Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:31:43.5063521Z 
2025-05-07T20:31:43.5063733Z self = <triton.compiler.compiler.ASTSource object at 0x7f6871f64690>
2025-05-07T20:31:43.5064848Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:31:43.5066243Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f68dfb402c0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f6871f58d60>}
2025-05-07T20:31:43.5067584Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:31:43.5068612Z context = <triton._C.libtriton.ir.context object at 0x7f6871f38570>
2025-05-07T20:31:43.5068899Z 
2025-05-07T20:31:43.5069080Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:31:43.5069695Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:31:43.5070172Z                            module_map=module_map)
2025-05-07T20:31:43.5070546Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:31:43.5070903Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:31:43.5071160Z E       ^
2025-05-07T20:31:43.5071626Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:31:43.5072071Z 
2025-05-07T20:31:43.5072494Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:31:43.5073004Z 
2025-05-07T20:31:43.5073108Z Trying example: test_silu_mul_quant(
2025-05-07T20:31:43.5073529Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:31:43.5074022Z     T=2048,
2025-05-07T20:31:43.5074219Z     D=5120,
2025-05-07T20:31:43.5074411Z     scale_ub=1200.0,
2025-05-07T20:31:43.5074640Z     contiguous=False,
2025-05-07T20:31:43.5074869Z     compiled=True,
2025-05-07T20:31:43.5075074Z )
2025-05-07T20:31:43.5075448Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:31:43.5075948Z T = 2048, D = 5120, scale_ub = 1200.0, contiguous = False, compiled = True
2025-05-07T20:31:43.5076218Z 
2025-05-07T20:31:43.5076298Z     @given(
2025-05-07T20:31:43.5076532Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:31:43.5076850Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:31:43.5077152Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:31:43.5077562Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:31:43.5077893Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:31:43.5078175Z     )
2025-05-07T20:31:43.5078524Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:31:43.5078967Z     def test_silu_mul_quant(
2025-05-07T20:31:43.5079212Z         self,
2025-05-07T20:31:43.5079404Z         T: int,
2025-05-07T20:31:43.5079608Z         D: int,
2025-05-07T20:31:43.5079835Z         scale_ub: Optional[float],
2025-05-07T20:31:43.5080102Z         contiguous: bool,
2025-05-07T20:31:43.5080347Z         compiled: bool,
2025-05-07T20:31:43.5080569Z     ) -> None:
2025-05-07T20:31:43.5080786Z         torch.manual_seed(2025)
2025-05-07T20:31:43.5081029Z     
2025-05-07T20:31:43.5081305Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:31:43.5081647Z     
2025-05-07T20:31:43.5081846Z         x_sign = torch.sign(x)
2025-05-07T20:31:43.5082146Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:31:43.5082462Z         x = x_sign * x_clamp
2025-05-07T20:31:43.5082701Z         x0 = x[:, :D]
2025-05-07T20:31:43.5082923Z         x1 = x[:, D:]
2025-05-07T20:31:43.5083139Z     
2025-05-07T20:31:43.5083335Z         if contiguous:
2025-05-07T20:31:43.5083579Z             x0 = x0.contiguous()
2025-05-07T20:31:43.5083843Z             x1 = x1.contiguous()
2025-05-07T20:31:43.5084083Z     
2025-05-07T20:31:43.5084282Z         if scale_ub is not None:
2025-05-07T20:31:43.5084559Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:31:43.5084893Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:31:43.5085211Z             )
2025-05-07T20:31:43.5085416Z         else:
2025-05-07T20:31:43.5085627Z             scale_ub_tensor = None
2025-05-07T20:31:43.5085888Z     
2025-05-07T20:31:43.5086124Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:31:43.5086438Z             op = silu_mul_quant
2025-05-07T20:31:43.5086699Z             if compiled:
2025-05-07T20:31:43.5086956Z                 op = torch.compile(op)
2025-05-07T20:31:43.5087256Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:31:43.5087536Z     
2025-05-07T20:31:43.5087742Z >       y_fp8, y_scale = fn()
2025-05-07T20:31:43.5087906Z 
2025-05-07T20:31:43.5088018Z moe/activation_test.py:117: 
2025-05-07T20:31:43.5088316Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:31:43.5088663Z moe/activation_test.py:115: in fn
2025-05-07T20:31:43.5088958Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:31:43.5089510Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/torch/_dynamo/eval_frame.py:678: in _fn
2025-05-07T20:31:43.5090070Z     return fn(*args, **kwargs)
2025-05-07T20:31:43.5090726Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:31:43.5091415Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:31:43.5091944Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:31:43.5092712Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:31:43.5093380Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:31:43.5093911Z     kernel = self.compile(
2025-05-07T20:31:43.5094462Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:31:43.5095167Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:31:43.5095571Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:31:43.5095801Z 
2025-05-07T20:31:43.5096010Z self = <triton.compiler.compiler.ASTSource object at 0x7f6871de1390>
2025-05-07T20:31:43.5097085Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:31:43.5098533Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f68dfb402c0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f6871f59760>}
2025-05-07T20:31:43.5099866Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:31:43.5100879Z context = <triton._C.libtriton.ir.context object at 0x7f6871d09230>
2025-05-07T20:31:43.5101163Z 
2025-05-07T20:31:43.5101328Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:31:43.5101846Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:31:43.5102321Z                            module_map=module_map)
2025-05-07T20:31:43.5102680Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:31:43.5103033Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:31:43.5103302Z E       ^
2025-05-07T20:31:43.5103766Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:31:43.5104216Z 
2025-05-07T20:31:43.5104631Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:31:43.5105152Z 
2025-05-07T20:31:43.6437102Z Trying example: test_silu_mul_quant(
2025-05-07T20:31:43.6437792Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:31:43.6438354Z     T=4096,
2025-05-07T20:31:43.6438603Z     D=5120,
2025-05-07T20:31:43.6438804Z     scale_ub=1200.0,
2025-05-07T20:31:43.6439037Z     contiguous=True,
2025-05-07T20:31:43.6439263Z     compiled=True,
2025-05-07T20:31:43.6439501Z )
2025-05-07T20:31:43.6439827Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:31:43.6440325Z T = 4096, D = 5120, scale_ub = 1200.0, contiguous = True, compiled = True
2025-05-07T20:31:43.6440592Z 
2025-05-07T20:31:43.6440689Z     @given(
2025-05-07T20:31:43.6440924Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:31:43.6441239Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:31:43.6441544Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:31:43.6441873Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:31:43.6442209Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:31:43.6442487Z     )
2025-05-07T20:31:43.6442837Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:31:43.6443279Z     def test_silu_mul_quant(
2025-05-07T20:31:43.6443517Z         self,
2025-05-07T20:31:43.6443725Z         T: int,
2025-05-07T20:31:43.6443931Z         D: int,
2025-05-07T20:31:43.6444154Z         scale_ub: Optional[float],
2025-05-07T20:31:43.6444430Z         contiguous: bool,
2025-05-07T20:31:43.6444675Z         compiled: bool,
2025-05-07T20:31:43.6444904Z     ) -> None:
2025-05-07T20:31:43.6445459Z         torch.manual_seed(2025)
2025-05-07T20:31:43.6445706Z     
2025-05-07T20:31:43.6445983Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:31:43.6446322Z     
2025-05-07T20:31:43.6446521Z         x_sign = torch.sign(x)
2025-05-07T20:31:43.6446812Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:31:43.6447117Z         x = x_sign * x_clamp
2025-05-07T20:31:43.6447363Z         x0 = x[:, :D]
2025-05-07T20:31:43.6447589Z         x1 = x[:, D:]
2025-05-07T20:31:43.6447794Z     
2025-05-07T20:31:43.6447987Z         if contiguous:
2025-05-07T20:31:43.6448224Z             x0 = x0.contiguous()
2025-05-07T20:31:43.6448479Z             x1 = x1.contiguous()
2025-05-07T20:31:43.6448724Z     
2025-05-07T20:31:43.6449075Z         if scale_ub is not None:
2025-05-07T20:31:43.6449342Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:31:43.6449679Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:31:43.6449991Z             )
2025-05-07T20:31:43.6450198Z         else:
2025-05-07T20:31:43.6450410Z             scale_ub_tensor = None
2025-05-07T20:31:43.6450664Z     
2025-05-07T20:31:43.6450901Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:31:43.6451211Z             op = silu_mul_quant
2025-05-07T20:31:43.6451467Z             if compiled:
2025-05-07T20:31:43.6451722Z                 op = torch.compile(op)
2025-05-07T20:31:43.6452018Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:31:43.6452296Z     
2025-05-07T20:31:43.6452495Z >       y_fp8, y_scale = fn()
2025-05-07T20:31:43.6452658Z 
2025-05-07T20:31:43.6452758Z moe/activation_test.py:117: 
2025-05-07T20:31:43.6453056Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:31:43.6453397Z moe/activation_test.py:115: in fn
2025-05-07T20:31:43.6453681Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:31:43.6454236Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/torch/_dynamo/eval_frame.py:678: in _fn
2025-05-07T20:31:43.6454795Z     return fn(*args, **kwargs)
2025-05-07T20:31:43.6455454Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:31:43.6456155Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:31:43.6456680Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:31:43.6457358Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:31:43.6458024Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:31:43.6458556Z     kernel = self.compile(
2025-05-07T20:31:43.6459098Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:31:43.6459761Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:31:43.6460165Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:31:43.6460397Z 
2025-05-07T20:31:43.6460608Z self = <triton.compiler.compiler.ASTSource object at 0x7f6871d26090>
2025-05-07T20:31:43.6470267Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:31:43.6471672Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f68dfb402c0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f6871f5a980>}
2025-05-07T20:31:43.6473027Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:31:43.6474063Z context = <triton._C.libtriton.ir.context object at 0x7f6871d6df70>
2025-05-07T20:31:43.6474477Z 
2025-05-07T20:31:43.6474676Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:31:43.6475222Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:31:43.6475690Z                            module_map=module_map)
2025-05-07T20:31:43.6476060Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:31:43.6476413Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:31:43.6476678Z E       ^
2025-05-07T20:31:43.6477150Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:31:43.6477601Z 
2025-05-07T20:31:43.6478021Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:31:43.6478741Z 
2025-05-07T20:31:43.6478852Z Trying example: test_silu_mul_quant(
2025-05-07T20:31:43.6479334Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:31:43.6479801Z     T=128,
2025-05-07T20:31:43.6479998Z     D=5120,
2025-05-07T20:31:43.6480208Z     scale_ub=1200.0,
2025-05-07T20:31:43.6480456Z     contiguous=False,
2025-05-07T20:31:43.6480694Z     compiled=True,
2025-05-07T20:31:43.6480929Z )
2025-05-07T20:31:43.7307938Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:31:43.7308703Z T = 128, D = 5120, scale_ub = 1200.0, contiguous = False, compiled = True
2025-05-07T20:31:43.7309062Z 
2025-05-07T20:31:43.7309237Z     @given(
2025-05-07T20:31:43.7309473Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:31:43.7309790Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:31:43.7310125Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:31:43.7310462Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:31:43.7310790Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:31:43.7311079Z     )
2025-05-07T20:31:43.7311453Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:31:43.7311893Z     def test_silu_mul_quant(
2025-05-07T20:31:43.7312146Z         self,
2025-05-07T20:31:43.7312353Z         T: int,
2025-05-07T20:31:43.7312552Z         D: int,
2025-05-07T20:31:43.7312779Z         scale_ub: Optional[float],
2025-05-07T20:31:43.7313057Z         contiguous: bool,
2025-05-07T20:31:43.7313295Z         compiled: bool,
2025-05-07T20:31:43.7313529Z     ) -> None:
2025-05-07T20:31:43.7313757Z         torch.manual_seed(2025)
2025-05-07T20:31:43.7314000Z     
2025-05-07T20:31:43.7314281Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:31:43.7314628Z     
2025-05-07T20:31:43.7314843Z         x_sign = torch.sign(x)
2025-05-07T20:31:43.7315139Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:31:43.7315459Z         x = x_sign * x_clamp
2025-05-07T20:31:43.7315714Z         x0 = x[:, :D]
2025-05-07T20:31:43.7315937Z         x1 = x[:, D:]
2025-05-07T20:31:43.7316159Z     
2025-05-07T20:31:43.7316358Z         if contiguous:
2025-05-07T20:31:43.7316596Z             x0 = x0.contiguous()
2025-05-07T20:31:43.7316864Z             x1 = x1.contiguous()
2025-05-07T20:31:43.7317114Z     
2025-05-07T20:31:43.7317310Z         if scale_ub is not None:
2025-05-07T20:31:43.7317595Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:31:43.7317941Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:31:43.7318248Z             )
2025-05-07T20:31:43.7318459Z         else:
2025-05-07T20:31:43.7318681Z             scale_ub_tensor = None
2025-05-07T20:31:43.7318936Z     
2025-05-07T20:31:43.7319179Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:31:43.7319511Z             op = silu_mul_quant
2025-05-07T20:31:43.7319775Z             if compiled:
2025-05-07T20:31:43.7320027Z                 op = torch.compile(op)
2025-05-07T20:31:43.7320337Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:31:43.7320974Z     
2025-05-07T20:31:43.7321172Z >       y_fp8, y_scale = fn()
2025-05-07T20:31:43.7321348Z 
2025-05-07T20:31:43.7321455Z moe/activation_test.py:117: 
2025-05-07T20:31:43.7321761Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:31:43.7322096Z moe/activation_test.py:115: in fn
2025-05-07T20:31:43.7322386Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:31:43.7322953Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/torch/_dynamo/eval_frame.py:678: in _fn
2025-05-07T20:31:43.7323519Z     return fn(*args, **kwargs)
2025-05-07T20:31:43.7324174Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:31:43.7325067Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:31:43.7325609Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:31:43.7326294Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:31:43.7326958Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:31:43.7327495Z     kernel = self.compile(
2025-05-07T20:31:43.7328041Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:31:43.7329064Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:31:43.7329468Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:31:43.7329699Z 
2025-05-07T20:31:43.7329915Z self = <triton.compiler.compiler.ASTSource object at 0x7f6871d46d50>
2025-05-07T20:31:43.7330992Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:31:43.7332374Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f68dfb402c0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f6871db0720>}
2025-05-07T20:31:43.7333710Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:31:43.7334725Z context = <triton._C.libtriton.ir.context object at 0x7f6871d32b70>
2025-05-07T20:31:43.7335058Z 
2025-05-07T20:31:43.7335234Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:31:43.7335743Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:31:43.7336213Z                            module_map=module_map)
2025-05-07T20:31:43.7336578Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:31:43.7336933Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:31:43.7337187Z E       ^
2025-05-07T20:31:43.7337655Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:31:43.7338098Z 
2025-05-07T20:31:43.7338520Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:31:43.7339027Z 
2025-05-07T20:31:43.7339132Z Trying example: test_silu_mul_quant(
2025-05-07T20:31:43.7339546Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:31:43.7339952Z     T=16384,
2025-05-07T20:31:43.7340152Z     D=7168,
2025-05-07T20:31:43.7340346Z     scale_ub=1200.0,
2025-05-07T20:31:43.7340576Z     contiguous=True,
2025-05-07T20:31:43.7340801Z     compiled=True,
2025-05-07T20:31:43.7341005Z )
2025-05-07T20:31:43.7341331Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:31:43.7341967Z T = 16384, D = 7168, scale_ub = 1200.0, contiguous = True, compiled = True
2025-05-07T20:31:43.7342244Z 
2025-05-07T20:31:43.7342325Z     @given(
2025-05-07T20:31:43.7342559Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:31:43.7342874Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:31:43.7343181Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:31:43.7343517Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:31:43.7343848Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:31:43.7344135Z     )
2025-05-07T20:31:43.7344479Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:31:43.7344924Z     def test_silu_mul_quant(
2025-05-07T20:31:43.7345168Z         self,
2025-05-07T20:31:43.7345480Z         T: int,
2025-05-07T20:31:43.7345684Z         D: int,
2025-05-07T20:31:43.7345910Z         scale_ub: Optional[float],
2025-05-07T20:31:43.7346173Z         contiguous: bool,
2025-05-07T20:31:43.7346423Z         compiled: bool,
2025-05-07T20:31:43.7346642Z     ) -> None:
2025-05-07T20:31:43.7346858Z         torch.manual_seed(2025)
2025-05-07T20:31:43.7347100Z     
2025-05-07T20:31:43.7347370Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:31:43.7347704Z     
2025-05-07T20:31:43.7347902Z         x_sign = torch.sign(x)
2025-05-07T20:31:43.7348195Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:31:43.7348503Z         x = x_sign * x_clamp
2025-05-07T20:31:43.7348744Z         x0 = x[:, :D]
2025-05-07T20:31:43.7348966Z         x1 = x[:, D:]
2025-05-07T20:31:43.7349265Z     
2025-05-07T20:31:43.7349459Z         if contiguous:
2025-05-07T20:31:43.7349696Z             x0 = x0.contiguous()
2025-05-07T20:31:43.7349957Z             x1 = x1.contiguous()
2025-05-07T20:31:43.7350195Z     
2025-05-07T20:31:43.7350392Z         if scale_ub is not None:
2025-05-07T20:31:43.7350661Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:31:43.7351003Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:31:43.7351309Z             )
2025-05-07T20:31:43.7351501Z         else:
2025-05-07T20:31:43.7351716Z             scale_ub_tensor = None
2025-05-07T20:31:43.7351971Z     
2025-05-07T20:31:43.7352198Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:31:43.7352515Z             op = silu_mul_quant
2025-05-07T20:31:43.7352773Z             if compiled:
2025-05-07T20:31:43.7353023Z                 op = torch.compile(op)
2025-05-07T20:31:43.7353323Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:31:43.7353599Z     
2025-05-07T20:31:43.7353801Z >       y_fp8, y_scale = fn()
2025-05-07T20:31:43.7353967Z 
2025-05-07T20:31:43.7354065Z moe/activation_test.py:117: 
2025-05-07T20:31:43.7354375Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:31:43.7354750Z moe/activation_test.py:115: in fn
2025-05-07T20:31:43.7355038Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:31:43.7355595Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/torch/_dynamo/eval_frame.py:678: in _fn
2025-05-07T20:31:43.7356148Z     return fn(*args, **kwargs)
2025-05-07T20:31:43.7356803Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:31:43.7357476Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:31:43.7358009Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:31:43.7358679Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:31:43.7359325Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:31:43.7359860Z     kernel = self.compile(
2025-05-07T20:31:43.7360403Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:31:43.7361139Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:31:43.7361532Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:31:43.7361764Z 
2025-05-07T20:31:43.7361968Z self = <triton.compiler.compiler.ASTSource object at 0x7f6871ec3950>
2025-05-07T20:31:43.7363045Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:31:43.7364401Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f68dfb402c0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f6871db0f40>}
2025-05-07T20:31:43.7365891Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:31:43.7366908Z context = <triton._C.libtriton.ir.context object at 0x7f6871e6b7b0>
2025-05-07T20:31:43.7367202Z 
2025-05-07T20:31:43.7367370Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:31:43.7367886Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:31:43.7368348Z                            module_map=module_map)
2025-05-07T20:31:43.7368712Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:31:43.7369070Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:31:43.7369332Z E       ^
2025-05-07T20:31:43.7369788Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:31:43.7370244Z 
2025-05-07T20:31:43.7370658Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:31:43.7371168Z 
2025-05-07T20:31:43.8335974Z Trying example: test_silu_mul_quant(
2025-05-07T20:31:43.8336697Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:31:43.8337258Z     T=16384,
2025-05-07T20:31:43.8337532Z     D=5120,
2025-05-07T20:31:43.8337772Z     scale_ub=1200.0,
2025-05-07T20:31:43.8338002Z     contiguous=True,
2025-05-07T20:31:43.8338228Z     compiled=False,
2025-05-07T20:31:43.8338435Z )
2025-05-07T20:31:43.8338762Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:31:43.8339259Z T = 16384, D = 5120, scale_ub = 1200.0, contiguous = True, compiled = False
2025-05-07T20:31:43.8339537Z 
2025-05-07T20:31:43.8339622Z     @given(
2025-05-07T20:31:43.8339859Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:31:43.8340197Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:31:43.8340513Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:31:43.8340839Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:31:43.8341173Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:31:43.8341466Z     )
2025-05-07T20:31:43.8341809Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:31:43.8342262Z     def test_silu_mul_quant(
2025-05-07T20:31:43.8342510Z         self,
2025-05-07T20:31:43.8342707Z         T: int,
2025-05-07T20:31:43.8342915Z         D: int,
2025-05-07T20:31:43.8343144Z         scale_ub: Optional[float],
2025-05-07T20:31:43.8343412Z         contiguous: bool,
2025-05-07T20:31:43.8343659Z         compiled: bool,
2025-05-07T20:31:43.8343895Z     ) -> None:
2025-05-07T20:31:43.8344122Z         torch.manual_seed(2025)
2025-05-07T20:31:43.8344363Z     
2025-05-07T20:31:43.8344683Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:31:43.8345039Z     
2025-05-07T20:31:43.8345231Z         x_sign = torch.sign(x)
2025-05-07T20:31:43.8345530Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:31:43.8345841Z         x = x_sign * x_clamp
2025-05-07T20:31:43.8346411Z         x0 = x[:, :D]
2025-05-07T20:31:43.8346639Z         x1 = x[:, D:]
2025-05-07T20:31:43.8346847Z     
2025-05-07T20:31:43.8347033Z         if contiguous:
2025-05-07T20:31:43.8347268Z             x0 = x0.contiguous()
2025-05-07T20:31:43.8347527Z             x1 = x1.contiguous()
2025-05-07T20:31:43.8347761Z     
2025-05-07T20:31:43.8347959Z         if scale_ub is not None:
2025-05-07T20:31:43.8348234Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:31:43.8348564Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:31:43.8348876Z             )
2025-05-07T20:31:43.8349077Z         else:
2025-05-07T20:31:43.8349372Z             scale_ub_tensor = None
2025-05-07T20:31:43.8349789Z     
2025-05-07T20:31:43.8350020Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:31:43.8350337Z             op = silu_mul_quant
2025-05-07T20:31:43.8350585Z             if compiled:
2025-05-07T20:31:43.8350836Z                 op = torch.compile(op)
2025-05-07T20:31:43.8351136Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:31:43.8351409Z     
2025-05-07T20:31:43.8351605Z >       y_fp8, y_scale = fn()
2025-05-07T20:31:43.8351770Z 
2025-05-07T20:31:43.8351880Z moe/activation_test.py:117: 
2025-05-07T20:31:43.8352172Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:31:43.8352505Z moe/activation_test.py:115: in fn
2025-05-07T20:31:43.8352789Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:31:43.8353477Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:31:43.8354165Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:31:43.8354717Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:31:43.8355454Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:31:43.8356115Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:31:43.8356652Z     kernel = self.compile(
2025-05-07T20:31:43.8357208Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:31:43.8357874Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:31:43.8358268Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:31:43.8358507Z 
2025-05-07T20:31:43.8358715Z self = <triton.compiler.compiler.ASTSource object at 0x7f6871e8c710>
2025-05-07T20:31:43.8359802Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:31:43.8361207Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f68dfb402c0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f6871db2520>}
2025-05-07T20:31:43.8362569Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:31:43.8363605Z context = <triton._C.libtriton.ir.context object at 0x7f6871e6c570>
2025-05-07T20:31:43.8363900Z 
2025-05-07T20:31:43.8364069Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:31:43.8364594Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:31:43.8365094Z                            module_map=module_map)
2025-05-07T20:31:43.8365483Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:31:43.8365834Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:31:43.8366098Z E       ^
2025-05-07T20:31:43.8366645Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:31:43.8367105Z 
2025-05-07T20:31:43.8367523Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:31:43.8368032Z 
2025-05-07T20:31:43.8368143Z Trying example: test_silu_mul_quant(
2025-05-07T20:31:43.8368548Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:31:43.8368949Z     T=1,
2025-05-07T20:31:43.8369136Z     D=7168,
2025-05-07T20:31:43.8369333Z     scale_ub=1200.0,
2025-05-07T20:31:43.8369553Z     contiguous=False,
2025-05-07T20:31:43.8369784Z     compiled=False,
2025-05-07T20:31:43.8369998Z )
2025-05-07T20:31:43.8370391Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:31:43.8370878Z T = 1, D = 7168, scale_ub = 1200.0, contiguous = False, compiled = False
2025-05-07T20:31:43.8371141Z 
2025-05-07T20:31:43.8371226Z     @given(
2025-05-07T20:31:43.8371461Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:31:43.8371777Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:31:43.8372082Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:31:43.8372408Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:31:43.8372738Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:31:43.8373023Z     )
2025-05-07T20:31:43.8373370Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:31:43.8373803Z     def test_silu_mul_quant(
2025-05-07T20:31:43.8374049Z         self,
2025-05-07T20:31:43.8374248Z         T: int,
2025-05-07T20:31:43.8374445Z         D: int,
2025-05-07T20:31:43.8374679Z         scale_ub: Optional[float],
2025-05-07T20:31:43.8375003Z         contiguous: bool,
2025-05-07T20:31:43.8375242Z         compiled: bool,
2025-05-07T20:31:43.8375477Z     ) -> None:
2025-05-07T20:31:43.8375702Z         torch.manual_seed(2025)
2025-05-07T20:31:43.8375945Z     
2025-05-07T20:31:43.8376227Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:31:43.8376573Z     
2025-05-07T20:31:43.8376766Z         x_sign = torch.sign(x)
2025-05-07T20:31:43.8377064Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:31:43.8377375Z         x = x_sign * x_clamp
2025-05-07T20:31:43.8377618Z         x0 = x[:, :D]
2025-05-07T20:31:43.8377834Z         x1 = x[:, D:]
2025-05-07T20:31:43.8378050Z     
2025-05-07T20:31:43.8378239Z         if contiguous:
2025-05-07T20:31:43.8378468Z             x0 = x0.contiguous()
2025-05-07T20:31:43.8378731Z             x1 = x1.contiguous()
2025-05-07T20:31:43.8378975Z     
2025-05-07T20:31:43.8379168Z         if scale_ub is not None:
2025-05-07T20:31:43.8379450Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:31:43.8379794Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:31:43.8380104Z             )
2025-05-07T20:31:43.8380308Z         else:
2025-05-07T20:31:43.8380534Z             scale_ub_tensor = None
2025-05-07T20:31:43.8380782Z     
2025-05-07T20:31:43.8381020Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:31:43.8381342Z             op = silu_mul_quant
2025-05-07T20:31:43.8381595Z             if compiled:
2025-05-07T20:31:43.8381848Z                 op = torch.compile(op)
2025-05-07T20:31:43.8382148Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:31:43.8382421Z     
2025-05-07T20:31:43.8382611Z >       y_fp8, y_scale = fn()
2025-05-07T20:31:43.8382786Z 
2025-05-07T20:31:43.8382888Z moe/activation_test.py:117: 
2025-05-07T20:31:43.8383188Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:31:43.8383530Z moe/activation_test.py:115: in fn
2025-05-07T20:31:43.8383824Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:31:43.8384535Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:31:43.8385339Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:31:43.8385870Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:31:43.8386552Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:31:43.8387215Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:31:43.8387748Z     kernel = self.compile(
2025-05-07T20:31:43.8388293Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:31:43.8388958Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:31:43.8389519Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:31:43.8389745Z 
2025-05-07T20:31:43.8389951Z self = <triton.compiler.compiler.ASTSource object at 0x7f6872a1d950>
2025-05-07T20:31:43.8391027Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:31:43.8392391Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f68dfb402c0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f6871db1bc0>}
2025-05-07T20:31:43.8393729Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:31:43.8394751Z context = <triton._C.libtriton.ir.context object at 0x7f6872a032b0>
2025-05-07T20:31:43.8395069Z 
2025-05-07T20:31:43.8395259Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:31:43.8395774Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:31:43.8396250Z                            module_map=module_map)
2025-05-07T20:31:43.8396613Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:31:43.8396970Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:31:43.8397240Z E       ^
2025-05-07T20:31:43.8397717Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:31:43.8398163Z 
2025-05-07T20:31:43.8398582Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:31:43.8399111Z 
2025-05-07T20:31:44.1798487Z Trying example: test_silu_mul_quant(
2025-05-07T20:31:44.1799137Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:31:44.1799724Z     T=4096,
2025-05-07T20:31:44.1799953Z     D=7168,
2025-05-07T20:31:44.1800145Z     scale_ub=1200.0,
2025-05-07T20:31:44.1800385Z     contiguous=False,
2025-05-07T20:31:44.1800626Z     compiled=True,
2025-05-07T20:31:44.1800832Z )
2025-05-07T20:31:44.1801154Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:31:44.1801647Z T = 4096, D = 7168, scale_ub = 1200.0, contiguous = False, compiled = True
2025-05-07T20:31:44.1801915Z 
2025-05-07T20:31:44.1801995Z     @given(
2025-05-07T20:31:44.1802230Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:31:44.1802542Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:31:44.1802849Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:31:44.1803181Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:31:44.1803506Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:31:44.1803795Z     )
2025-05-07T20:31:44.1804140Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:31:44.1804593Z     def test_silu_mul_quant(
2025-05-07T20:31:44.1804838Z         self,
2025-05-07T20:31:44.1805067Z         T: int,
2025-05-07T20:31:44.1805656Z         D: int,
2025-05-07T20:31:44.1805886Z         scale_ub: Optional[float],
2025-05-07T20:31:44.1806162Z         contiguous: bool,
2025-05-07T20:31:44.1806401Z         compiled: bool,
2025-05-07T20:31:44.1814613Z     ) -> None:
2025-05-07T20:31:44.1814880Z         torch.manual_seed(2025)
2025-05-07T20:31:44.1815124Z     
2025-05-07T20:31:44.1815408Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:31:44.1815760Z     
2025-05-07T20:31:44.1815961Z         x_sign = torch.sign(x)
2025-05-07T20:31:44.1816252Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:31:44.1816570Z         x = x_sign * x_clamp
2025-05-07T20:31:44.1816817Z         x0 = x[:, :D]
2025-05-07T20:31:44.1817269Z         x1 = x[:, D:]
2025-05-07T20:31:44.1817484Z     
2025-05-07T20:31:44.1817676Z         if contiguous:
2025-05-07T20:31:44.1817907Z             x0 = x0.contiguous()
2025-05-07T20:31:44.1818170Z             x1 = x1.contiguous()
2025-05-07T20:31:44.1818418Z     
2025-05-07T20:31:44.1818614Z         if scale_ub is not None:
2025-05-07T20:31:44.1818901Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:31:44.1819243Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:31:44.1819552Z             )
2025-05-07T20:31:44.1819756Z         else:
2025-05-07T20:31:44.1819977Z             scale_ub_tensor = None
2025-05-07T20:31:44.1820217Z     
2025-05-07T20:31:44.1820439Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:31:44.1820760Z             op = silu_mul_quant
2025-05-07T20:31:44.1821015Z             if compiled:
2025-05-07T20:31:44.1821264Z                 op = torch.compile(op)
2025-05-07T20:31:44.1821566Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:31:44.1821852Z     
2025-05-07T20:31:44.1822043Z >       y_fp8, y_scale = fn()
2025-05-07T20:31:44.1822213Z 
2025-05-07T20:31:44.1822314Z moe/activation_test.py:117: 
2025-05-07T20:31:44.1822619Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:31:44.1822951Z moe/activation_test.py:115: in fn
2025-05-07T20:31:44.1823237Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:31:44.1823803Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/torch/_dynamo/eval_frame.py:678: in _fn
2025-05-07T20:31:44.1824369Z     return fn(*args, **kwargs)
2025-05-07T20:31:44.1825077Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:31:44.1825769Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:31:44.1826306Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:31:44.1826988Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:31:44.1827651Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:31:44.1828496Z     kernel = self.compile(
2025-05-07T20:31:44.1829215Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:31:44.1829869Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:31:44.1830272Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:31:44.1830502Z 
2025-05-07T20:31:44.1830717Z self = <triton.compiler.compiler.ASTSource object at 0x7f6872a35a50>
2025-05-07T20:31:44.1831802Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:31:44.1833198Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f68dfb402c0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f6872a0cb80>}
2025-05-07T20:31:44.1834706Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:31:44.1835792Z context = <triton._C.libtriton.ir.context object at 0x7f6872a61930>
2025-05-07T20:31:44.1836081Z 
2025-05-07T20:31:44.1836257Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:31:44.1836774Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:31:44.1837248Z                            module_map=module_map)
2025-05-07T20:31:44.1837633Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:31:44.1838137Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:31:44.1838396Z E       ^
2025-05-07T20:31:44.1838867Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:31:44.1839319Z 
2025-05-07T20:31:44.1839752Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:31:44.1840263Z 
2025-05-07T20:31:44.1840377Z Trying example: test_silu_mul_quant(
2025-05-07T20:31:44.1840789Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:31:44.1841205Z     T=128,
2025-05-07T20:31:44.1841405Z     D=7168,
2025-05-07T20:31:44.1841605Z     scale_ub=1200.0,
2025-05-07T20:31:44.1841843Z     contiguous=False,
2025-05-07T20:31:44.1842081Z     compiled=True,
2025-05-07T20:31:44.1842292Z )
2025-05-07T20:31:44.2580390Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:31:44.2581122Z T = 128, D = 7168, scale_ub = 1200.0, contiguous = False, compiled = True
2025-05-07T20:31:44.2581598Z 
2025-05-07T20:31:44.2581714Z     @given(
2025-05-07T20:31:44.2582046Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:31:44.2582475Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:31:44.2582803Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:31:44.2583150Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:31:44.2583494Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:31:44.2583781Z     )
2025-05-07T20:31:44.2584148Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:31:44.2584599Z     def test_silu_mul_quant(
2025-05-07T20:31:44.2584842Z         self,
2025-05-07T20:31:44.2585073Z         T: int,
2025-05-07T20:31:44.2585305Z         D: int,
2025-05-07T20:31:44.2585533Z         scale_ub: Optional[float],
2025-05-07T20:31:44.2585813Z         contiguous: bool,
2025-05-07T20:31:44.2586067Z         compiled: bool,
2025-05-07T20:31:44.2586304Z     ) -> None:
2025-05-07T20:31:44.2586535Z         torch.manual_seed(2025)
2025-05-07T20:31:44.2586786Z     
2025-05-07T20:31:44.2587058Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:31:44.2587406Z     
2025-05-07T20:31:44.2587614Z         x_sign = torch.sign(x)
2025-05-07T20:31:44.2587908Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:31:44.2588229Z         x = x_sign * x_clamp
2025-05-07T20:31:44.2588484Z         x0 = x[:, :D]
2025-05-07T20:31:44.2588712Z         x1 = x[:, D:]
2025-05-07T20:31:44.2588922Z     
2025-05-07T20:31:44.2589194Z         if contiguous:
2025-05-07T20:31:44.2589439Z             x0 = x0.contiguous()
2025-05-07T20:31:44.2589700Z             x1 = x1.contiguous()
2025-05-07T20:31:44.2589953Z     
2025-05-07T20:31:44.2590157Z         if scale_ub is not None:
2025-05-07T20:31:44.2590428Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:31:44.2590772Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:31:44.2591094Z             )
2025-05-07T20:31:44.2591291Z         else:
2025-05-07T20:31:44.2591512Z             scale_ub_tensor = None
2025-05-07T20:31:44.2591768Z     
2025-05-07T20:31:44.2592172Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:31:44.2592503Z             op = silu_mul_quant
2025-05-07T20:31:44.2592762Z             if compiled:
2025-05-07T20:31:44.2593009Z                 op = torch.compile(op)
2025-05-07T20:31:44.2593314Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:31:44.2593601Z     
2025-05-07T20:31:44.2593805Z >       y_fp8, y_scale = fn()
2025-05-07T20:31:44.2593974Z 
2025-05-07T20:31:44.2594078Z moe/activation_test.py:117: 
2025-05-07T20:31:44.2594389Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:31:44.2594738Z moe/activation_test.py:115: in fn
2025-05-07T20:31:44.2595018Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:31:44.2595578Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/torch/_dynamo/eval_frame.py:678: in _fn
2025-05-07T20:31:44.2596268Z     return fn(*args, **kwargs)
2025-05-07T20:31:44.2596926Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:31:44.2597608Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:31:44.2598147Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:31:44.2598817Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:31:44.2599474Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:31:44.2600012Z     kernel = self.compile(
2025-05-07T20:31:44.2600552Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:31:44.2601197Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:31:44.2601608Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:31:44.2601838Z 
2025-05-07T20:31:44.2602054Z self = <triton.compiler.compiler.ASTSource object at 0x7f6871aa2390>
2025-05-07T20:31:44.2603132Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:31:44.2604496Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f68dfb402c0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f6872a0d440>}
2025-05-07T20:31:44.2605888Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:31:44.2606915Z context = <triton._C.libtriton.ir.context object at 0x7f6871a6e270>
2025-05-07T20:31:44.2607199Z 
2025-05-07T20:31:44.2607382Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:31:44.2607901Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:31:44.2608376Z                            module_map=module_map)
2025-05-07T20:31:44.2608747Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:31:44.2609101Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:31:44.2609359Z E       ^
2025-05-07T20:31:44.2609825Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:31:44.2610270Z 
2025-05-07T20:31:44.2610690Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:31:44.2611196Z 
2025-05-07T20:31:44.2611303Z Trying example: test_silu_mul_quant(
2025-05-07T20:31:44.2611722Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:31:44.2612130Z     T=2048,
2025-05-07T20:31:44.2612332Z     D=7168,
2025-05-07T20:31:44.2612525Z     scale_ub=None,
2025-05-07T20:31:44.2612829Z     contiguous=True,
2025-05-07T20:31:44.2613061Z     compiled=True,
2025-05-07T20:31:44.2613262Z )
2025-05-07T20:31:44.2613582Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:31:44.2614074Z T = 2048, D = 7168, scale_ub = None, contiguous = True, compiled = True
2025-05-07T20:31:44.2614337Z 
2025-05-07T20:31:44.2614421Z     @given(
2025-05-07T20:31:44.2614656Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:31:44.2614977Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:31:44.2615305Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:31:44.2615659Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:31:44.2615988Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:31:44.2616358Z     )
2025-05-07T20:31:44.2616702Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:31:44.2617144Z     def test_silu_mul_quant(
2025-05-07T20:31:44.2617396Z         self,
2025-05-07T20:31:44.2617588Z         T: int,
2025-05-07T20:31:44.2617791Z         D: int,
2025-05-07T20:31:44.2618016Z         scale_ub: Optional[float],
2025-05-07T20:31:44.2618282Z         contiguous: bool,
2025-05-07T20:31:44.2618525Z         compiled: bool,
2025-05-07T20:31:44.2618749Z     ) -> None:
2025-05-07T20:31:44.2618962Z         torch.manual_seed(2025)
2025-05-07T20:31:44.2619210Z     
2025-05-07T20:31:44.2619487Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:31:44.2619823Z     
2025-05-07T20:31:44.2620024Z         x_sign = torch.sign(x)
2025-05-07T20:31:44.2620321Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:31:44.2620633Z         x = x_sign * x_clamp
2025-05-07T20:31:44.2620875Z         x0 = x[:, :D]
2025-05-07T20:31:44.2621099Z         x1 = x[:, D:]
2025-05-07T20:31:44.2621313Z     
2025-05-07T20:31:44.2621498Z         if contiguous:
2025-05-07T20:31:44.2621737Z             x0 = x0.contiguous()
2025-05-07T20:31:44.2622004Z             x1 = x1.contiguous()
2025-05-07T20:31:44.2622232Z     
2025-05-07T20:31:44.2622437Z         if scale_ub is not None:
2025-05-07T20:31:44.2622717Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:31:44.2623058Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:31:44.2623375Z             )
2025-05-07T20:31:44.2623576Z         else:
2025-05-07T20:31:44.2623787Z             scale_ub_tensor = None
2025-05-07T20:31:44.2624049Z     
2025-05-07T20:31:44.2624292Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:31:44.2624605Z             op = silu_mul_quant
2025-05-07T20:31:44.2624866Z             if compiled:
2025-05-07T20:31:44.2625116Z                 op = torch.compile(op)
2025-05-07T20:31:44.2625421Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:31:44.2625694Z     
2025-05-07T20:31:44.2625898Z >       y_fp8, y_scale = fn()
2025-05-07T20:31:44.2626064Z 
2025-05-07T20:31:44.2626175Z moe/activation_test.py:117: 
2025-05-07T20:31:44.2626475Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:31:44.2626815Z moe/activation_test.py:115: in fn
2025-05-07T20:31:44.2627095Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:31:44.2627644Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/torch/_dynamo/eval_frame.py:678: in _fn
2025-05-07T20:31:44.2628446Z     return fn(*args, **kwargs)
2025-05-07T20:31:44.2629158Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:31:44.2629841Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:31:44.2630369Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:31:44.2631051Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:31:44.2631875Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:31:44.2632403Z     kernel = self.compile(
2025-05-07T20:31:44.2632948Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:31:44.2633605Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:31:44.2634006Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:31:44.2634234Z 
2025-05-07T20:31:44.2634440Z self = <triton.compiler.compiler.ASTSource object at 0x7f6871ae3310>
2025-05-07T20:31:44.2635566Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:31:44.2637051Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f68dfb402c0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f6872a0e340>}
2025-05-07T20:31:44.2638397Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:31:44.2639421Z context = <triton._C.libtriton.ir.context object at 0x7f6871ac31f0>
2025-05-07T20:31:44.2639706Z 
2025-05-07T20:31:44.2639872Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:31:44.2640387Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:31:44.2640859Z                            module_map=module_map)
2025-05-07T20:31:44.2641219Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:31:44.2641585Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:31:44.2641870Z E       ^
2025-05-07T20:31:44.2642527Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:31:44.2643067Z 
2025-05-07T20:31:44.2643482Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:31:44.2643995Z 
2025-05-07T20:31:44.3290454Z Trying example: test_silu_mul_quant(
2025-05-07T20:31:44.3291074Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:31:44.3291656Z     T=16384,
2025-05-07T20:31:44.3291925Z     D=5120,
2025-05-07T20:31:44.3292266Z     scale_ub=None,
2025-05-07T20:31:44.3292569Z     contiguous=False,
2025-05-07T20:31:44.3292875Z     compiled=False,
2025-05-07T20:31:44.3293162Z )
2025-05-07T20:31:44.3293584Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:31:44.3294088Z T = 16384, D = 5120, scale_ub = None, contiguous = False, compiled = False
2025-05-07T20:31:44.3294376Z 
2025-05-07T20:31:44.3294460Z     @given(
2025-05-07T20:31:44.3294703Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:31:44.3295027Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:31:44.3295336Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:31:44.3295666Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:31:44.3295997Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:31:44.3296280Z     )
2025-05-07T20:31:44.3296633Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:31:44.3297074Z     def test_silu_mul_quant(
2025-05-07T20:31:44.3297314Z         self,
2025-05-07T20:31:44.3297515Z         T: int,
2025-05-07T20:31:44.3297717Z         D: int,
2025-05-07T20:31:44.3297938Z         scale_ub: Optional[float],
2025-05-07T20:31:44.3298225Z         contiguous: bool,
2025-05-07T20:31:44.3298477Z         compiled: bool,
2025-05-07T20:31:44.3298708Z     ) -> None:
2025-05-07T20:31:44.3298924Z         torch.manual_seed(2025)
2025-05-07T20:31:44.3299168Z     
2025-05-07T20:31:44.3299617Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:31:44.3299958Z     
2025-05-07T20:31:44.3300157Z         x_sign = torch.sign(x)
2025-05-07T20:31:44.3300451Z >       x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:31:44.3302457Z E       torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 320.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 144.44 MiB is free. Including non-PyTorch memory, this process has 21.92 GiB memory in use. Of the allocated memory 21.60 GiB is allocated by PyTorch, and 40.52 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
2025-05-07T20:31:44.3304430Z 
2025-05-07T20:31:44.3304560Z moe/activation_test.py:95: OutOfMemoryError
2025-05-07T20:31:44.3304771Z 
2025-05-07T20:31:44.3304877Z Trying example: test_silu_mul_quant(
2025-05-07T20:31:44.3305328Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:31:44.3305754Z     T=4096,
2025-05-07T20:31:44.3305955Z     D=7168,
2025-05-07T20:31:44.3306148Z     scale_ub=1200.0,
2025-05-07T20:31:44.3306379Z     contiguous=True,
2025-05-07T20:31:44.3306604Z     compiled=True,
2025-05-07T20:31:44.3306811Z )
2025-05-07T20:31:44.3307133Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:31:44.3307622Z T = 4096, D = 7168, scale_ub = 1200.0, contiguous = True, compiled = True
2025-05-07T20:31:44.3307889Z 
2025-05-07T20:31:44.3307972Z     @given(
2025-05-07T20:31:44.3308206Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:31:44.3308529Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:31:44.3308831Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:31:44.3309260Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:31:44.3309601Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:31:44.3309888Z     )
2025-05-07T20:31:44.3310234Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:31:44.3310672Z     def test_silu_mul_quant(
2025-05-07T20:31:44.3310916Z         self,
2025-05-07T20:31:44.3311109Z         T: int,
2025-05-07T20:31:44.3311311Z         D: int,
2025-05-07T20:31:44.3311536Z         scale_ub: Optional[float],
2025-05-07T20:31:44.3311809Z         contiguous: bool,
2025-05-07T20:31:44.3312050Z         compiled: bool,
2025-05-07T20:31:44.3312279Z     ) -> None:
2025-05-07T20:31:44.3312492Z         torch.manual_seed(2025)
2025-05-07T20:31:44.3312742Z     
2025-05-07T20:31:44.3313014Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:31:44.3313354Z     
2025-05-07T20:31:44.3313551Z         x_sign = torch.sign(x)
2025-05-07T20:31:44.3313847Z >       x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:31:44.3315900Z E       torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 112.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 32.44 MiB is free. Including non-PyTorch memory, this process has 22.03 GiB memory in use. Of the allocated memory 21.61 GiB is allocated by PyTorch, and 136.52 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
2025-05-07T20:31:44.3317740Z 
2025-05-07T20:31:44.3317865Z moe/activation_test.py:95: OutOfMemoryError
2025-05-07T20:31:44.3318076Z 
2025-05-07T20:31:44.3318181Z Trying example: test_silu_mul_quant(
2025-05-07T20:31:44.3318599Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:31:44.3319001Z     T=16384,
2025-05-07T20:31:44.3319194Z     D=7168,
2025-05-07T20:31:44.3319389Z     scale_ub=None,
2025-05-07T20:31:44.3319612Z     contiguous=False,
2025-05-07T20:31:44.3319924Z     compiled=False,
2025-05-07T20:31:44.3320134Z )
2025-05-07T20:31:44.3320454Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:31:44.3320947Z T = 16384, D = 7168, scale_ub = None, contiguous = False, compiled = False
2025-05-07T20:31:44.3321225Z 
2025-05-07T20:31:44.3321306Z     @given(
2025-05-07T20:31:44.3321539Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:31:44.3321855Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:31:44.3322156Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:31:44.3322483Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:31:44.3322809Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:31:44.3323197Z     )
2025-05-07T20:31:44.3323594Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:31:44.3324108Z     def test_silu_mul_quant(
2025-05-07T20:31:44.3324369Z         self,
2025-05-07T20:31:44.3324583Z         T: int,
2025-05-07T20:31:44.3324794Z         D: int,
2025-05-07T20:31:44.3325025Z         scale_ub: Optional[float],
2025-05-07T20:31:44.3325320Z         contiguous: bool,
2025-05-07T20:31:44.3325579Z         compiled: bool,
2025-05-07T20:31:44.3325816Z     ) -> None:
2025-05-07T20:31:44.3326057Z         torch.manual_seed(2025)
2025-05-07T20:31:44.3326323Z     
2025-05-07T20:31:44.3326620Z >       x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:31:44.3329261Z E       torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 448.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 144.44 MiB is free. Including non-PyTorch memory, this process has 21.92 GiB memory in use. Of the allocated memory 21.50 GiB is allocated by PyTorch, and 136.52 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
2025-05-07T20:31:44.3331122Z 
2025-05-07T20:31:44.3331251Z moe/activation_test.py:92: OutOfMemoryError
2025-05-07T20:31:44.3331463Z 
2025-05-07T20:31:44.3331567Z Trying example: test_silu_mul_quant(
2025-05-07T20:31:44.3331979Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:31:44.3332381Z     T=2048,
2025-05-07T20:31:44.3332567Z     D=7168,
2025-05-07T20:31:44.3332766Z     scale_ub=1200.0,
2025-05-07T20:31:44.3332995Z     contiguous=True,
2025-05-07T20:31:44.3333211Z     compiled=True,
2025-05-07T20:31:44.3333418Z )
2025-05-07T20:31:44.3333741Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:31:44.3334240Z T = 2048, D = 7168, scale_ub = 1200.0, contiguous = True, compiled = True
2025-05-07T20:31:44.3334511Z 
2025-05-07T20:31:44.3334591Z     @given(
2025-05-07T20:31:44.3334835Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:31:44.3335195Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:31:44.3335499Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:31:44.3335835Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:31:44.3336165Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:31:44.3336446Z     )
2025-05-07T20:31:44.3336795Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:31:44.3337239Z     def test_silu_mul_quant(
2025-05-07T20:31:44.3337477Z         self,
2025-05-07T20:31:44.3337670Z         T: int,
2025-05-07T20:31:44.3337869Z         D: int,
2025-05-07T20:31:44.3338093Z         scale_ub: Optional[float],
2025-05-07T20:31:44.3338362Z         contiguous: bool,
2025-05-07T20:31:44.3338609Z         compiled: bool,
2025-05-07T20:31:44.3338834Z     ) -> None:
2025-05-07T20:31:44.3339051Z         torch.manual_seed(2025)
2025-05-07T20:31:44.3339294Z     
2025-05-07T20:31:44.3339578Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:31:44.3340058Z     
2025-05-07T20:31:44.3340262Z         x_sign = torch.sign(x)
2025-05-07T20:31:44.3340556Z >       x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:31:44.3342527Z E       torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 56.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 32.44 MiB is free. Including non-PyTorch memory, this process has 22.03 GiB memory in use. Of the allocated memory 21.67 GiB is allocated by PyTorch, and 80.52 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
2025-05-07T20:31:44.3344486Z 
2025-05-07T20:31:44.3344612Z moe/activation_test.py:95: OutOfMemoryError
2025-05-07T20:31:44.3344820Z 
2025-05-07T20:31:44.3344944Z Trying example: test_silu_mul_quant(
2025-05-07T20:31:44.3345392Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:31:44.3345793Z     T=2048,
2025-05-07T20:31:44.3345980Z     D=7168,
2025-05-07T20:31:44.3346186Z     scale_ub=None,
2025-05-07T20:31:44.3346411Z     contiguous=True,
2025-05-07T20:31:44.3346636Z     compiled=False,
2025-05-07T20:31:44.3346839Z )
2025-05-07T20:31:44.4210404Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:31:44.4223457Z T = 2048, D = 7168, scale_ub = None, contiguous = True, compiled = False
2025-05-07T20:31:44.4223927Z 
2025-05-07T20:31:44.4224049Z     @given(
2025-05-07T20:31:44.4224363Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:31:44.4224739Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:31:44.4225071Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:31:44.4225402Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:31:44.4225728Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:31:44.4226028Z     )
2025-05-07T20:31:44.4226378Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:31:44.4226814Z     def test_silu_mul_quant(
2025-05-07T20:31:44.4227059Z         self,
2025-05-07T20:31:44.4227257Z         T: int,
2025-05-07T20:31:44.4227453Z         D: int,
2025-05-07T20:31:44.4227682Z         scale_ub: Optional[float],
2025-05-07T20:31:44.4227965Z         contiguous: bool,
2025-05-07T20:31:44.4228424Z         compiled: bool,
2025-05-07T20:31:44.4228654Z     ) -> None:
2025-05-07T20:31:44.4228878Z         torch.manual_seed(2025)
2025-05-07T20:31:44.4229156Z     
2025-05-07T20:31:44.4229431Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:31:44.4229776Z     
2025-05-07T20:31:44.4229968Z >       x_sign = torch.sign(x)
2025-05-07T20:31:44.4231895Z E       torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 56.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 32.44 MiB is free. Including non-PyTorch memory, this process has 22.03 GiB memory in use. Of the allocated memory 21.67 GiB is allocated by PyTorch, and 80.52 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
2025-05-07T20:31:44.4233739Z 
2025-05-07T20:31:44.4233866Z moe/activation_test.py:94: OutOfMemoryError
2025-05-07T20:31:44.4234078Z 
2025-05-07T20:31:44.4234181Z Trying example: test_silu_mul_quant(
2025-05-07T20:31:44.4234593Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:31:44.4235054Z     T=1,
2025-05-07T20:31:44.4235241Z     D=7168,
2025-05-07T20:31:44.4235443Z     scale_ub=1200.0,
2025-05-07T20:31:44.4235672Z     contiguous=True,
2025-05-07T20:31:44.4235890Z     compiled=False,
2025-05-07T20:31:44.4236099Z )
2025-05-07T20:31:44.4236614Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:31:44.4237093Z T = 1, D = 7168, scale_ub = 1200.0, contiguous = True, compiled = False
2025-05-07T20:31:44.4237361Z 
2025-05-07T20:31:44.4237443Z     @given(
2025-05-07T20:31:44.4237673Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:31:44.4237983Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:31:44.4238280Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:31:44.4238609Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:31:44.4238936Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:31:44.4239210Z     )
2025-05-07T20:31:44.4239559Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:31:44.4240195Z     def test_silu_mul_quant(
2025-05-07T20:31:44.4240434Z         self,
2025-05-07T20:31:44.4240629Z         T: int,
2025-05-07T20:31:44.4240829Z         D: int,
2025-05-07T20:31:44.4241044Z         scale_ub: Optional[float],
2025-05-07T20:31:44.4241317Z         contiguous: bool,
2025-05-07T20:31:44.4241556Z         compiled: bool,
2025-05-07T20:31:44.4241778Z     ) -> None:
2025-05-07T20:31:44.4241986Z         torch.manual_seed(2025)
2025-05-07T20:31:44.4242224Z     
2025-05-07T20:31:44.4242490Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:31:44.4242827Z     
2025-05-07T20:31:44.4243020Z         x_sign = torch.sign(x)
2025-05-07T20:31:44.4243309Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:31:44.4243611Z         x = x_sign * x_clamp
2025-05-07T20:31:44.4243853Z         x0 = x[:, :D]
2025-05-07T20:31:44.4244074Z         x1 = x[:, D:]
2025-05-07T20:31:44.4244275Z     
2025-05-07T20:31:44.4244473Z         if contiguous:
2025-05-07T20:31:44.4244709Z             x0 = x0.contiguous()
2025-05-07T20:31:44.4244961Z             x1 = x1.contiguous()
2025-05-07T20:31:44.4245199Z     
2025-05-07T20:31:44.4245393Z         if scale_ub is not None:
2025-05-07T20:31:44.4245710Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:31:44.4246058Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:31:44.4246391Z             )
2025-05-07T20:31:44.4246592Z         else:
2025-05-07T20:31:44.4246801Z             scale_ub_tensor = None
2025-05-07T20:31:44.4247049Z     
2025-05-07T20:31:44.4247285Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:31:44.4247601Z             op = silu_mul_quant
2025-05-07T20:31:44.4247852Z             if compiled:
2025-05-07T20:31:44.4248100Z                 op = torch.compile(op)
2025-05-07T20:31:44.4248393Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:31:44.4248668Z     
2025-05-07T20:31:44.4248857Z >       y_fp8, y_scale = fn()
2025-05-07T20:31:44.4249025Z 
2025-05-07T20:31:44.4249125Z moe/activation_test.py:117: 
2025-05-07T20:31:44.4249419Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:31:44.4249751Z moe/activation_test.py:115: in fn
2025-05-07T20:31:44.4250033Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:31:44.4250718Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:31:44.4251406Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:31:44.4251943Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:31:44.4252637Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:31:44.4253290Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:31:44.4253816Z     kernel = self.compile(
2025-05-07T20:31:44.4254363Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:31:44.4255007Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:31:44.4255491Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:31:44.4255727Z 
2025-05-07T20:31:44.4255932Z self = <triton.compiler.compiler.ASTSource object at 0x7f68719e54d0>
2025-05-07T20:31:44.4257006Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:31:44.4258353Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f68dfb402c0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f687191dbc0>}
2025-05-07T20:31:44.4259684Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:31:44.4260775Z context = <triton._C.libtriton.ir.context object at 0x7f68719052b0>
2025-05-07T20:31:44.4261062Z 
2025-05-07T20:31:44.4261231Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:31:44.4261744Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:31:44.4262202Z                            module_map=module_map)
2025-05-07T20:31:44.4262561Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:31:44.4262910Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:31:44.4263168Z E       ^
2025-05-07T20:31:44.4263634Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:31:44.4264080Z 
2025-05-07T20:31:44.4264502Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:31:44.4265038Z 
2025-05-07T20:31:44.4265174Z Trying example: test_silu_mul_quant(
2025-05-07T20:31:44.4265580Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:31:44.4265981Z     T=128,
2025-05-07T20:31:44.4266172Z     D=5120,
2025-05-07T20:31:44.4266360Z     scale_ub=None,
2025-05-07T20:31:44.4266575Z     contiguous=True,
2025-05-07T20:31:44.4266801Z     compiled=False,
2025-05-07T20:31:44.4266998Z )
2025-05-07T20:31:44.4805119Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:31:44.4805920Z T = 128, D = 5120, scale_ub = None, contiguous = True, compiled = False
2025-05-07T20:31:44.4806306Z 
2025-05-07T20:31:44.4806484Z     @given(
2025-05-07T20:31:44.4806802Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:31:44.4807234Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:31:44.4807633Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:31:44.4807962Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:31:44.4808306Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:31:44.4808592Z     )
2025-05-07T20:31:44.4808944Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:31:44.4809388Z     def test_silu_mul_quant(
2025-05-07T20:31:44.4809623Z         self,
2025-05-07T20:31:44.4809818Z         T: int,
2025-05-07T20:31:44.4810015Z         D: int,
2025-05-07T20:31:44.4810228Z         scale_ub: Optional[float],
2025-05-07T20:31:44.4810498Z         contiguous: bool,
2025-05-07T20:31:44.4810736Z         compiled: bool,
2025-05-07T20:31:44.4810951Z     ) -> None:
2025-05-07T20:31:44.4811167Z         torch.manual_seed(2025)
2025-05-07T20:31:44.4811406Z     
2025-05-07T20:31:44.4811676Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:31:44.4812016Z     
2025-05-07T20:31:44.4812214Z         x_sign = torch.sign(x)
2025-05-07T20:31:44.4812497Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:31:44.4812802Z         x = x_sign * x_clamp
2025-05-07T20:31:44.4813042Z         x0 = x[:, :D]
2025-05-07T20:31:44.4813256Z         x1 = x[:, D:]
2025-05-07T20:31:44.4813623Z     
2025-05-07T20:31:44.4813819Z         if contiguous:
2025-05-07T20:31:44.4814044Z             x0 = x0.contiguous()
2025-05-07T20:31:44.4814307Z             x1 = x1.contiguous()
2025-05-07T20:31:44.4814545Z     
2025-05-07T20:31:44.4814731Z         if scale_ub is not None:
2025-05-07T20:31:44.4815001Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:31:44.4815337Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:31:44.4815645Z             )
2025-05-07T20:31:44.4815836Z         else:
2025-05-07T20:31:44.4816050Z             scale_ub_tensor = None
2025-05-07T20:31:44.4816295Z     
2025-05-07T20:31:44.4816518Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:31:44.4816971Z             op = silu_mul_quant
2025-05-07T20:31:44.4817221Z             if compiled:
2025-05-07T20:31:44.4817464Z                 op = torch.compile(op)
2025-05-07T20:31:44.4817758Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:31:44.4818037Z     
2025-05-07T20:31:44.4818227Z >       y_fp8, y_scale = fn()
2025-05-07T20:31:44.4818396Z 
2025-05-07T20:31:44.4818495Z moe/activation_test.py:117: 
2025-05-07T20:31:44.4818786Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:31:44.4819116Z moe/activation_test.py:115: in fn
2025-05-07T20:31:44.4819390Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:31:44.4820071Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:31:44.4820755Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:31:44.4821286Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:31:44.4821972Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:31:44.4822626Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:31:44.4823180Z     kernel = self.compile(
2025-05-07T20:31:44.4823719Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:31:44.4824364Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:31:44.4824762Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:31:44.4825007Z 
2025-05-07T20:31:44.4825256Z self = <triton.compiler.compiler.ASTSource object at 0x7f68719b7d90>
2025-05-07T20:31:44.4826334Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:31:44.4827693Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f68dfb402c0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f687191ed40>}
2025-05-07T20:31:44.4829249Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:31:44.4830265Z context = <triton._C.libtriton.ir.context object at 0x7f68719e3bb0>
2025-05-07T20:31:44.4830549Z 
2025-05-07T20:31:44.4830722Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:31:44.4831228Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:31:44.4831691Z                            module_map=module_map)
2025-05-07T20:31:44.4832050Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:31:44.4832407Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:31:44.4832658Z E       ^
2025-05-07T20:31:44.4833118Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:31:44.4833683Z 
2025-05-07T20:31:44.4834103Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:31:44.4834607Z 
2025-05-07T20:31:44.4834709Z Trying example: test_silu_mul_quant(
2025-05-07T20:31:44.4835115Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:31:44.4835565Z     T=128,
2025-05-07T20:31:44.4835758Z     D=7168,
2025-05-07T20:31:44.4835952Z     scale_ub=None,
2025-05-07T20:31:44.4836168Z     contiguous=True,
2025-05-07T20:31:44.4836387Z     compiled=False,
2025-05-07T20:31:44.4836587Z )
2025-05-07T20:31:44.4836905Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:31:44.4837503Z T = 128, D = 7168, scale_ub = None, contiguous = True, compiled = False
2025-05-07T20:31:44.4837766Z 
2025-05-07T20:31:44.4837844Z     @given(
2025-05-07T20:31:44.4838077Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:31:44.4838392Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:31:44.4838696Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:31:44.4839021Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:31:44.4839348Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:31:44.4839627Z     )
2025-05-07T20:31:44.4839966Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:31:44.4840404Z     def test_silu_mul_quant(
2025-05-07T20:31:44.4840649Z         self,
2025-05-07T20:31:44.4840843Z         T: int,
2025-05-07T20:31:44.4841039Z         D: int,
2025-05-07T20:31:44.4841259Z         scale_ub: Optional[float],
2025-05-07T20:31:44.4841523Z         contiguous: bool,
2025-05-07T20:31:44.4841763Z         compiled: bool,
2025-05-07T20:31:44.4841981Z     ) -> None:
2025-05-07T20:31:44.4842194Z         torch.manual_seed(2025)
2025-05-07T20:31:44.4842435Z     
2025-05-07T20:31:44.4842715Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:31:44.4843061Z     
2025-05-07T20:31:44.4843256Z         x_sign = torch.sign(x)
2025-05-07T20:31:44.4843548Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:31:44.4843854Z         x = x_sign * x_clamp
2025-05-07T20:31:44.4844093Z         x0 = x[:, :D]
2025-05-07T20:31:44.4844307Z         x1 = x[:, D:]
2025-05-07T20:31:44.4844517Z     
2025-05-07T20:31:44.4844697Z         if contiguous:
2025-05-07T20:31:44.4844927Z             x0 = x0.contiguous()
2025-05-07T20:31:44.4845194Z             x1 = x1.contiguous()
2025-05-07T20:31:44.4845429Z     
2025-05-07T20:31:44.4845623Z         if scale_ub is not None:
2025-05-07T20:31:44.4845891Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:31:44.4846224Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:31:44.4846528Z             )
2025-05-07T20:31:44.4846719Z         else:
2025-05-07T20:31:44.4846926Z             scale_ub_tensor = None
2025-05-07T20:31:44.4847174Z     
2025-05-07T20:31:44.4847406Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:31:44.4847716Z             op = silu_mul_quant
2025-05-07T20:31:44.4847965Z             if compiled:
2025-05-07T20:31:44.4848215Z                 op = torch.compile(op)
2025-05-07T20:31:44.4848509Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:31:44.4848795Z     
2025-05-07T20:31:44.4848991Z >       y_fp8, y_scale = fn()
2025-05-07T20:31:44.4849158Z 
2025-05-07T20:31:44.4849266Z moe/activation_test.py:117: 
2025-05-07T20:31:44.4849553Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:31:44.4849887Z moe/activation_test.py:115: in fn
2025-05-07T20:31:44.4850166Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:31:44.4850848Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:31:44.4851528Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:31:44.4852152Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:31:44.4852826Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:31:44.4853475Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:31:44.4854001Z     kernel = self.compile(
2025-05-07T20:31:44.4854536Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:31:44.4855230Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:31:44.4855627Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:31:44.4855938Z 
2025-05-07T20:31:44.4856140Z self = <triton.compiler.compiler.ASTSource object at 0x7f6871b0a7d0>
2025-05-07T20:31:44.4857214Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:31:44.4858568Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f68dfb402c0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f687191fd80>}
2025-05-07T20:31:44.4859892Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:31:44.4860906Z context = <triton._C.libtriton.ir.context object at 0x7f6871b92630>
2025-05-07T20:31:44.4861194Z 
2025-05-07T20:31:44.4861369Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:31:44.4861879Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:31:44.4862344Z                            module_map=module_map)
2025-05-07T20:31:44.4862709Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:31:44.4863055Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:31:44.4863304Z E       ^
2025-05-07T20:31:44.4863762Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:31:44.4864210Z 
2025-05-07T20:31:44.4864626Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:31:44.4865133Z 
2025-05-07T20:31:44.4865243Z Trying example: test_silu_mul_quant(
2025-05-07T20:31:44.4865653Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:31:44.4866103Z     T=2048,
2025-05-07T20:31:44.4866287Z     D=7168,
2025-05-07T20:31:44.4866473Z     scale_ub=1200.0,
2025-05-07T20:31:44.4866697Z     contiguous=True,
2025-05-07T20:31:44.4866916Z     compiled=False,
2025-05-07T20:31:44.4867117Z )
2025-05-07T20:31:44.5533066Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:31:44.5533756Z T = 2048, D = 7168, scale_ub = 1200.0, contiguous = True, compiled = False
2025-05-07T20:31:44.5534104Z 
2025-05-07T20:31:44.5534222Z     @given(
2025-05-07T20:31:44.5534493Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:31:44.5534908Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:31:44.5535219Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:31:44.5535546Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:31:44.5535862Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:31:44.5536143Z     )
2025-05-07T20:31:44.5536486Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:31:44.5536922Z     def test_silu_mul_quant(
2025-05-07T20:31:44.5537161Z         self,
2025-05-07T20:31:44.5537355Z         T: int,
2025-05-07T20:31:44.5537552Z         D: int,
2025-05-07T20:31:44.5537920Z         scale_ub: Optional[float],
2025-05-07T20:31:44.5538191Z         contiguous: bool,
2025-05-07T20:31:44.5538426Z         compiled: bool,
2025-05-07T20:31:44.5538643Z     ) -> None:
2025-05-07T20:31:44.5538855Z         torch.manual_seed(2025)
2025-05-07T20:31:44.5539090Z     
2025-05-07T20:31:44.5539359Z >       x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:31:44.5541400Z E       torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 56.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 30.44 MiB is free. Including non-PyTorch memory, this process has 22.03 GiB memory in use. Of the allocated memory 21.70 GiB is allocated by PyTorch, and 53.93 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
2025-05-07T20:31:44.5543362Z 
2025-05-07T20:31:44.5543493Z moe/activation_test.py:92: OutOfMemoryError
2025-05-07T20:31:44.5543701Z 
2025-05-07T20:31:44.5543808Z Trying example: test_silu_mul_quant(
2025-05-07T20:31:44.5544219Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:31:44.5544611Z     T=1,
2025-05-07T20:31:44.5544796Z     D=5120,
2025-05-07T20:31:44.5544987Z     scale_ub=1200.0,
2025-05-07T20:31:44.5545233Z     contiguous=True,
2025-05-07T20:31:44.5545478Z     compiled=False,
2025-05-07T20:31:44.5545687Z )
2025-05-07T20:31:44.5546004Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:31:44.5546479Z T = 1, D = 5120, scale_ub = 1200.0, contiguous = True, compiled = False
2025-05-07T20:31:44.5546750Z 
2025-05-07T20:31:44.5546829Z     @given(
2025-05-07T20:31:44.5547057Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:31:44.5547366Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:31:44.5547666Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:31:44.5547998Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:31:44.5548331Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:31:44.5548608Z     )
2025-05-07T20:31:44.5548955Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:31:44.5549467Z     def test_silu_mul_quant(
2025-05-07T20:31:44.5549702Z         self,
2025-05-07T20:31:44.5549901Z         T: int,
2025-05-07T20:31:44.5550096Z         D: int,
2025-05-07T20:31:44.5550307Z         scale_ub: Optional[float],
2025-05-07T20:31:44.5550575Z         contiguous: bool,
2025-05-07T20:31:44.5550815Z         compiled: bool,
2025-05-07T20:31:44.5551034Z     ) -> None:
2025-05-07T20:31:44.5551244Z         torch.manual_seed(2025)
2025-05-07T20:31:44.5551489Z     
2025-05-07T20:31:44.5551760Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:31:44.5552095Z     
2025-05-07T20:31:44.5552286Z         x_sign = torch.sign(x)
2025-05-07T20:31:44.5552577Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:31:44.5552881Z         x = x_sign * x_clamp
2025-05-07T20:31:44.5553118Z         x0 = x[:, :D]
2025-05-07T20:31:44.5553338Z         x1 = x[:, D:]
2025-05-07T20:31:44.5553540Z     
2025-05-07T20:31:44.5553722Z         if contiguous:
2025-05-07T20:31:44.5553955Z             x0 = x0.contiguous()
2025-05-07T20:31:44.5554205Z             x1 = x1.contiguous()
2025-05-07T20:31:44.5554439Z     
2025-05-07T20:31:44.5554631Z         if scale_ub is not None:
2025-05-07T20:31:44.5554894Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:31:44.5555260Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:31:44.5555581Z             )
2025-05-07T20:31:44.5555774Z         else:
2025-05-07T20:31:44.5555982Z             scale_ub_tensor = None
2025-05-07T20:31:44.5556225Z     
2025-05-07T20:31:44.5556451Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:31:44.5556755Z             op = silu_mul_quant
2025-05-07T20:31:44.5557091Z             if compiled:
2025-05-07T20:31:44.5557347Z                 op = torch.compile(op)
2025-05-07T20:31:44.5557637Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:31:44.5557911Z     
2025-05-07T20:31:44.5558105Z >       y_fp8, y_scale = fn()
2025-05-07T20:31:44.5558266Z 
2025-05-07T20:31:44.5558361Z moe/activation_test.py:117: 
2025-05-07T20:31:44.5558651Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:31:44.5558980Z moe/activation_test.py:115: in fn
2025-05-07T20:31:44.5559252Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:31:44.5559939Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:31:44.5560724Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:31:44.5561251Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:31:44.5561925Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:31:44.5562578Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:31:44.5563104Z     kernel = self.compile(
2025-05-07T20:31:44.5563644Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:31:44.5564288Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:31:44.5564695Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:31:44.5564919Z 
2025-05-07T20:31:44.5565152Z self = <triton.compiler.compiler.ASTSource object at 0x7f6871bce350>
2025-05-07T20:31:44.5566227Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:31:44.5567575Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f68dfb402c0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f6871b1d3a0>}
2025-05-07T20:31:44.5568901Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:31:44.5569912Z context = <triton._C.libtriton.ir.context object at 0x7f6871bf6230>
2025-05-07T20:31:44.5570195Z 
2025-05-07T20:31:44.5570364Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:31:44.5570882Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:31:44.5571347Z                            module_map=module_map)
2025-05-07T20:31:44.5571702Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:31:44.5572059Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:31:44.5572319Z E       ^
2025-05-07T20:31:44.5572781Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:31:44.5573240Z 
2025-05-07T20:31:44.5581364Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:31:44.5581928Z 
2025-05-07T20:31:44.5582039Z Trying example: test_silu_mul_quant(
2025-05-07T20:31:44.5582454Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:31:44.5582846Z     T=2048,
2025-05-07T20:31:44.5583035Z     D=5120,
2025-05-07T20:31:44.5583222Z     scale_ub=None,
2025-05-07T20:31:44.5583438Z     contiguous=True,
2025-05-07T20:31:44.5583661Z     compiled=False,
2025-05-07T20:31:44.5583859Z )
2025-05-07T20:31:44.5584174Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:31:44.5584768Z T = 2048, D = 5120, scale_ub = None, contiguous = True, compiled = False
2025-05-07T20:31:44.5585045Z 
2025-05-07T20:31:44.5585139Z     @given(
2025-05-07T20:31:44.5585401Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:31:44.5585713Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:31:44.5586018Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:31:44.5586345Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:31:44.5586663Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:31:44.5586947Z     )
2025-05-07T20:31:44.5587296Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:31:44.5587730Z     def test_silu_mul_quant(
2025-05-07T20:31:44.5587969Z         self,
2025-05-07T20:31:44.5588243Z         T: int,
2025-05-07T20:31:44.5588430Z         D: int,
2025-05-07T20:31:44.5588643Z         scale_ub: Optional[float],
2025-05-07T20:31:44.5588907Z         contiguous: bool,
2025-05-07T20:31:44.5589210Z         compiled: bool,
2025-05-07T20:31:44.5589435Z     ) -> None:
2025-05-07T20:31:44.5589650Z         torch.manual_seed(2025)
2025-05-07T20:31:44.5589885Z     
2025-05-07T20:31:44.5590146Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:31:44.5590485Z     
2025-05-07T20:31:44.5590672Z >       x_sign = torch.sign(x)
2025-05-07T20:31:44.5592624Z E       torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 40.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 30.44 MiB is free. Including non-PyTorch memory, this process has 22.03 GiB memory in use. Of the allocated memory 21.73 GiB is allocated by PyTorch, and 13.87 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
2025-05-07T20:31:44.5594485Z 
2025-05-07T20:31:44.5594603Z moe/activation_test.py:94: OutOfMemoryError
2025-05-07T20:31:44.5594826Z 
2025-05-07T20:31:44.5594926Z Trying example: test_silu_mul_quant(
2025-05-07T20:31:44.5595329Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:31:44.5595727Z     T=16384,
2025-05-07T20:31:44.5595908Z     D=5120,
2025-05-07T20:31:44.5596097Z     scale_ub=None,
2025-05-07T20:31:44.5596306Z     contiguous=True,
2025-05-07T20:31:44.5596518Z     compiled=False,
2025-05-07T20:31:44.5596717Z )
2025-05-07T20:31:44.6287979Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:31:44.6289374Z T = 16384, D = 5120, scale_ub = None, contiguous = True, compiled = False
2025-05-07T20:31:44.6289928Z 
2025-05-07T20:31:44.6290089Z     @given(
2025-05-07T20:31:44.6290547Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:31:44.6291161Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:31:44.6291763Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:31:44.6292405Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:31:44.6293045Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:31:44.6293594Z     )
2025-05-07T20:31:44.6294270Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:31:44.6295055Z     def test_silu_mul_quant(
2025-05-07T20:31:44.6295290Z         self,
2025-05-07T20:31:44.6295490Z         T: int,
2025-05-07T20:31:44.6295681Z         D: int,
2025-05-07T20:31:44.6295896Z         scale_ub: Optional[float],
2025-05-07T20:31:44.6296163Z         contiguous: bool,
2025-05-07T20:31:44.6296392Z         compiled: bool,
2025-05-07T20:31:44.6296610Z     ) -> None:
2025-05-07T20:31:44.6296821Z         torch.manual_seed(2025)
2025-05-07T20:31:44.6297059Z     
2025-05-07T20:31:44.6297330Z >       x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:31:44.6299516Z E       torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 320.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 30.44 MiB is free. Including non-PyTorch memory, this process has 22.03 GiB memory in use. Of the allocated memory 21.73 GiB is allocated by PyTorch, and 13.87 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
2025-05-07T20:31:44.6301357Z 
2025-05-07T20:31:44.6301481Z moe/activation_test.py:92: OutOfMemoryError
2025-05-07T20:31:44.6301690Z 
2025-05-07T20:31:44.6301799Z Trying example: test_silu_mul_quant(
2025-05-07T20:31:44.6302202Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:31:44.6302705Z     T=4096,
2025-05-07T20:31:44.6302890Z     D=5120,
2025-05-07T20:31:44.6303075Z     scale_ub=None,
2025-05-07T20:31:44.6303287Z     contiguous=True,
2025-05-07T20:31:44.6303509Z     compiled=False,
2025-05-07T20:31:44.6303710Z )
2025-05-07T20:31:44.6304030Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:31:44.6304519Z T = 4096, D = 5120, scale_ub = None, contiguous = True, compiled = False
2025-05-07T20:31:44.6304779Z 
2025-05-07T20:31:44.6304859Z     @given(
2025-05-07T20:31:44.6305088Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:31:44.6305402Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:31:44.6305701Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:31:44.6306019Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:31:44.6306342Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:31:44.6306625Z     )
2025-05-07T20:31:44.6306963Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:31:44.6307411Z     def test_silu_mul_quant(
2025-05-07T20:31:44.6307649Z         self,
2025-05-07T20:31:44.6307838Z         T: int,
2025-05-07T20:31:44.6308034Z         D: int,
2025-05-07T20:31:44.6308264Z         scale_ub: Optional[float],
2025-05-07T20:31:44.6308530Z         contiguous: bool,
2025-05-07T20:31:44.6308766Z         compiled: bool,
2025-05-07T20:31:44.6308987Z     ) -> None:
2025-05-07T20:31:44.6309252Z         torch.manual_seed(2025)
2025-05-07T20:31:44.6309490Z     
2025-05-07T20:31:44.6309766Z >       x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:31:44.6311778Z E       torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 80.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 30.44 MiB is free. Including non-PyTorch memory, this process has 22.03 GiB memory in use. Of the allocated memory 21.73 GiB is allocated by PyTorch, and 13.87 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
2025-05-07T20:31:44.6313609Z 
2025-05-07T20:31:44.6313733Z moe/activation_test.py:92: OutOfMemoryError
2025-05-07T20:31:44.6313941Z 
2025-05-07T20:31:44.6314044Z Trying example: test_silu_mul_quant(
2025-05-07T20:31:44.6314449Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:31:44.6314855Z     T=2048,
2025-05-07T20:31:44.6315065Z     D=5120,
2025-05-07T20:31:44.6315265Z     scale_ub=None,
2025-05-07T20:31:44.6315486Z     contiguous=False,
2025-05-07T20:31:44.6315711Z     compiled=False,
2025-05-07T20:31:44.6315913Z )
2025-05-07T20:31:44.6316228Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:31:44.6316718Z T = 2048, D = 5120, scale_ub = None, contiguous = False, compiled = False
2025-05-07T20:31:44.6316988Z 
2025-05-07T20:31:44.6317066Z     @given(
2025-05-07T20:31:44.6317294Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:31:44.6317605Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:31:44.6317995Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:31:44.6318318Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:31:44.6318651Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:31:44.6318926Z     )
2025-05-07T20:31:44.6319280Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:31:44.6319725Z     def test_silu_mul_quant(
2025-05-07T20:31:44.6319967Z         self,
2025-05-07T20:31:44.6320160Z         T: int,
2025-05-07T20:31:44.6320351Z         D: int,
2025-05-07T20:31:44.6320565Z         scale_ub: Optional[float],
2025-05-07T20:31:44.6320837Z         contiguous: bool,
2025-05-07T20:31:44.6321070Z         compiled: bool,
2025-05-07T20:31:44.6321286Z     ) -> None:
2025-05-07T20:31:44.6321576Z         torch.manual_seed(2025)
2025-05-07T20:31:44.6321807Z     
2025-05-07T20:31:44.6322075Z >       x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:31:44.6324094Z E       torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 40.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 30.44 MiB is free. Including non-PyTorch memory, this process has 22.03 GiB memory in use. Of the allocated memory 21.73 GiB is allocated by PyTorch, and 13.87 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
2025-05-07T20:31:44.6325971Z 
2025-05-07T20:31:44.6326093Z moe/activation_test.py:92: OutOfMemoryError
2025-05-07T20:31:44.6326301Z 
2025-05-07T20:31:44.6326405Z Trying example: test_silu_mul_quant(
2025-05-07T20:31:44.6326813Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:31:44.6327207Z     T=4096,
2025-05-07T20:31:44.6327392Z     D=7168,
2025-05-07T20:31:44.6327578Z     scale_ub=None,
2025-05-07T20:31:44.6327789Z     contiguous=True,
2025-05-07T20:31:44.6328012Z     compiled=True,
2025-05-07T20:31:44.6328379Z )
2025-05-07T20:31:44.6328690Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:31:44.6329167Z T = 4096, D = 7168, scale_ub = None, contiguous = True, compiled = True
2025-05-07T20:31:44.6329427Z 
2025-05-07T20:31:44.6329506Z     @given(
2025-05-07T20:31:44.6329731Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:31:44.6330045Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:31:44.6330341Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:31:44.6330664Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:31:44.6330985Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:31:44.6331268Z     )
2025-05-07T20:31:44.6331607Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:31:44.6332039Z     def test_silu_mul_quant(
2025-05-07T20:31:44.6332276Z         self,
2025-05-07T20:31:44.6332464Z         T: int,
2025-05-07T20:31:44.6332655Z         D: int,
2025-05-07T20:31:44.6332869Z         scale_ub: Optional[float],
2025-05-07T20:31:44.6333130Z         contiguous: bool,
2025-05-07T20:31:44.6333364Z         compiled: bool,
2025-05-07T20:31:44.6333580Z     ) -> None:
2025-05-07T20:31:44.6333789Z         torch.manual_seed(2025)
2025-05-07T20:31:44.6334030Z     
2025-05-07T20:31:44.6334295Z >       x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:31:44.6336439Z E       torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 112.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 30.44 MiB is free. Including non-PyTorch memory, this process has 22.03 GiB memory in use. Of the allocated memory 21.73 GiB is allocated by PyTorch, and 13.87 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
2025-05-07T20:31:44.6338278Z 
2025-05-07T20:31:44.6338399Z moe/activation_test.py:92: OutOfMemoryError
2025-05-07T20:31:44.6338609Z 
2025-05-07T20:31:44.6338712Z Trying example: test_silu_mul_quant(
2025-05-07T20:31:44.6339118Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:31:44.6339519Z     T=2048,
2025-05-07T20:31:44.6339703Z     D=5120,
2025-05-07T20:31:44.6339894Z     scale_ub=1200.0,
2025-05-07T20:31:44.6340118Z     contiguous=False,
2025-05-07T20:31:44.6340336Z     compiled=False,
2025-05-07T20:31:44.6340539Z )
2025-05-07T20:31:44.6340848Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:31:44.6341480Z T = 2048, D = 5120, scale_ub = 1200.0, contiguous = False, compiled = False
2025-05-07T20:31:44.6341750Z 
2025-05-07T20:31:44.6341828Z     @given(
2025-05-07T20:31:44.6342060Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:31:44.6342379Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:31:44.6342679Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:31:44.6343006Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:31:44.6343333Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:31:44.6343609Z     )
2025-05-07T20:31:44.6343952Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:31:44.6344385Z     def test_silu_mul_quant(
2025-05-07T20:31:44.6344618Z         self,
2025-05-07T20:31:44.6344833Z         T: int,
2025-05-07T20:31:44.6345050Z         D: int,
2025-05-07T20:31:44.6345268Z         scale_ub: Optional[float],
2025-05-07T20:31:44.6345529Z         contiguous: bool,
2025-05-07T20:31:44.6345776Z         compiled: bool,
2025-05-07T20:31:44.6345994Z     ) -> None:
2025-05-07T20:31:44.6346204Z         torch.manual_seed(2025)
2025-05-07T20:31:44.6346444Z     
2025-05-07T20:31:44.6346710Z >       x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:31:44.6348723Z E       torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 40.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 30.44 MiB is free. Including non-PyTorch memory, this process has 22.03 GiB memory in use. Of the allocated memory 21.73 GiB is allocated by PyTorch, and 13.87 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
2025-05-07T20:31:44.6350611Z 
2025-05-07T20:31:44.6350730Z moe/activation_test.py:92: OutOfMemoryError
2025-05-07T20:31:44.6350949Z 
2025-05-07T20:31:44.6351052Z Trying example: test_silu_mul_quant(
2025-05-07T20:31:44.6351455Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:31:44.6351848Z     T=4096,
2025-05-07T20:31:44.6352027Z     D=7168,
2025-05-07T20:31:44.6352216Z     scale_ub=1200.0,
2025-05-07T20:31:44.6352434Z     contiguous=True,
2025-05-07T20:31:44.6352649Z     compiled=False,
2025-05-07T20:31:44.6352852Z )
2025-05-07T20:31:44.7271535Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:31:44.7272051Z T = 4096, D = 7168, scale_ub = 1200.0, contiguous = True, compiled = False
2025-05-07T20:31:44.7272457Z 
2025-05-07T20:31:44.7272570Z     @given(
2025-05-07T20:31:44.7272923Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:31:44.7273344Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:31:44.7273723Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:31:44.7274048Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:31:44.7274379Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:31:44.7274658Z     )
2025-05-07T20:31:44.7275006Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:31:44.7275647Z     def test_silu_mul_quant(
2025-05-07T20:31:44.7275885Z         self,
2025-05-07T20:31:44.7276079Z         T: int,
2025-05-07T20:31:44.7276272Z         D: int,
2025-05-07T20:31:44.7276487Z         scale_ub: Optional[float],
2025-05-07T20:31:44.7276751Z         contiguous: bool,
2025-05-07T20:31:44.7276984Z         compiled: bool,
2025-05-07T20:31:44.7277199Z     ) -> None:
2025-05-07T20:31:44.7277413Z         torch.manual_seed(2025)
2025-05-07T20:31:44.7277643Z     
2025-05-07T20:31:44.7277913Z >       x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:31:44.7279950Z E       torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 112.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 30.44 MiB is free. Including non-PyTorch memory, this process has 22.03 GiB memory in use. Of the allocated memory 21.73 GiB is allocated by PyTorch, and 13.87 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
2025-05-07T20:31:44.7281909Z 
2025-05-07T20:31:44.7282031Z moe/activation_test.py:92: OutOfMemoryError
2025-05-07T20:31:44.7282238Z 
2025-05-07T20:31:44.7282346Z Trying example: test_silu_mul_quant(
2025-05-07T20:31:44.7282745Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:31:44.7283138Z     T=16384,
2025-05-07T20:31:44.7283332Z     D=7168,
2025-05-07T20:31:44.7283518Z     scale_ub=None,
2025-05-07T20:31:44.7283728Z     contiguous=False,
2025-05-07T20:31:44.7283949Z     compiled=True,
2025-05-07T20:31:44.7284143Z )
2025-05-07T20:31:44.7284457Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:31:44.7284950Z T = 16384, D = 7168, scale_ub = None, contiguous = False, compiled = True
2025-05-07T20:31:44.7285222Z 
2025-05-07T20:31:44.7285301Z     @given(
2025-05-07T20:31:44.7285527Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:31:44.7285843Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:31:44.7286148Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:31:44.7286471Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:31:44.7286796Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:31:44.7287076Z     )
2025-05-07T20:31:44.7287414Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:31:44.7287850Z     def test_silu_mul_quant(
2025-05-07T20:31:44.7288087Z         self,
2025-05-07T20:31:44.7288277Z         T: int,
2025-05-07T20:31:44.7288468Z         D: int,
2025-05-07T20:31:44.7288682Z         scale_ub: Optional[float],
2025-05-07T20:31:44.7288956Z         contiguous: bool,
2025-05-07T20:31:44.7289186Z         compiled: bool,
2025-05-07T20:31:44.7289405Z     ) -> None:
2025-05-07T20:31:44.7289613Z         torch.manual_seed(2025)
2025-05-07T20:31:44.7289845Z     
2025-05-07T20:31:44.7290117Z >       x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:31:44.7292138Z E       torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 448.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 30.44 MiB is free. Including non-PyTorch memory, this process has 22.03 GiB memory in use. Of the allocated memory 21.73 GiB is allocated by PyTorch, and 13.87 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
2025-05-07T20:31:44.7293981Z 
2025-05-07T20:31:44.7294108Z moe/activation_test.py:92: OutOfMemoryError
2025-05-07T20:31:44.7294321Z 
2025-05-07T20:31:44.7294432Z Trying example: test_silu_mul_quant(
2025-05-07T20:31:44.7294835Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:31:44.7295372Z     T=4096,
2025-05-07T20:31:44.7295559Z     D=7168,
2025-05-07T20:31:44.7295746Z     scale_ub=None,
2025-05-07T20:31:44.7295956Z     contiguous=True,
2025-05-07T20:31:44.7296176Z     compiled=False,
2025-05-07T20:31:44.7296371Z )
2025-05-07T20:31:44.7296688Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:31:44.7297174Z T = 4096, D = 7168, scale_ub = None, contiguous = True, compiled = False
2025-05-07T20:31:44.7297441Z 
2025-05-07T20:31:44.7297517Z     @given(
2025-05-07T20:31:44.7297744Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:31:44.7298050Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:31:44.7298358Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:31:44.7298756Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:31:44.7299076Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:31:44.7299355Z     )
2025-05-07T20:31:44.7299706Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:31:44.7300152Z     def test_silu_mul_quant(
2025-05-07T20:31:44.7300398Z         self,
2025-05-07T20:31:44.7300588Z         T: int,
2025-05-07T20:31:44.7300779Z         D: int,
2025-05-07T20:31:44.7300994Z         scale_ub: Optional[float],
2025-05-07T20:31:44.7301254Z         contiguous: bool,
2025-05-07T20:31:44.7301491Z         compiled: bool,
2025-05-07T20:31:44.7301705Z     ) -> None:
2025-05-07T20:31:44.7301914Z         torch.manual_seed(2025)
2025-05-07T20:31:44.7302147Z     
2025-05-07T20:31:44.7302408Z >       x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:31:44.7304427Z E       torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 112.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 30.44 MiB is free. Including non-PyTorch memory, this process has 22.03 GiB memory in use. Of the allocated memory 21.73 GiB is allocated by PyTorch, and 13.87 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
2025-05-07T20:31:44.7306316Z 
2025-05-07T20:31:44.7306437Z moe/activation_test.py:92: OutOfMemoryError
2025-05-07T20:31:44.7306648Z 
2025-05-07T20:31:44.7306750Z Trying example: test_silu_mul_quant(
2025-05-07T20:31:44.7307155Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:31:44.7307551Z     T=16384,
2025-05-07T20:31:44.7307737Z     D=7168,
2025-05-07T20:31:44.7307921Z     scale_ub=None,
2025-05-07T20:31:44.7308134Z     contiguous=True,
2025-05-07T20:31:44.7308348Z     compiled=False,
2025-05-07T20:31:44.7308559Z )
2025-05-07T20:31:44.7308869Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:31:44.7309437Z T = 16384, D = 7168, scale_ub = None, contiguous = True, compiled = False
2025-05-07T20:31:44.7309709Z 
2025-05-07T20:31:44.7309791Z     @given(
2025-05-07T20:31:44.7310021Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:31:44.7310329Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:31:44.7310628Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:31:44.7310954Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:31:44.7311280Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:31:44.7311557Z     )
2025-05-07T20:31:44.7311902Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:31:44.7312337Z     def test_silu_mul_quant(
2025-05-07T20:31:44.7312575Z         self,
2025-05-07T20:31:44.7312763Z         T: int,
2025-05-07T20:31:44.7312960Z         D: int,
2025-05-07T20:31:44.7313174Z         scale_ub: Optional[float],
2025-05-07T20:31:44.7313435Z         contiguous: bool,
2025-05-07T20:31:44.7313673Z         compiled: bool,
2025-05-07T20:31:44.7313899Z     ) -> None:
2025-05-07T20:31:44.7314193Z         torch.manual_seed(2025)
2025-05-07T20:31:44.7314430Z     
2025-05-07T20:31:44.7314697Z >       x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:31:44.7316714Z E       torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 448.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 30.44 MiB is free. Including non-PyTorch memory, this process has 22.03 GiB memory in use. Of the allocated memory 21.73 GiB is allocated by PyTorch, and 13.87 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
2025-05-07T20:31:44.7318613Z 
2025-05-07T20:31:44.7318735Z moe/activation_test.py:92: OutOfMemoryError
2025-05-07T20:31:44.7318949Z 
2025-05-07T20:31:44.7319052Z Trying example: test_silu_mul_quant(
2025-05-07T20:31:44.7319466Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:31:44.7319870Z     T=16384,
2025-05-07T20:31:44.7320058Z     D=7168,
2025-05-07T20:31:44.7320249Z     scale_ub=1200.0,
2025-05-07T20:31:44.7320469Z     contiguous=True,
2025-05-07T20:31:44.7320682Z     compiled=False,
2025-05-07T20:31:44.7320886Z )
2025-05-07T20:31:44.7321198Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:31:44.7321683Z T = 16384, D = 7168, scale_ub = 1200.0, contiguous = True, compiled = False
2025-05-07T20:31:44.7321962Z 
2025-05-07T20:31:44.7322042Z     @given(
2025-05-07T20:31:44.7322269Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:31:44.7322576Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:31:44.7322884Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:31:44.7323215Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:31:44.7323537Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:31:44.7323812Z     )
2025-05-07T20:31:44.7324161Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:31:44.7324600Z     def test_silu_mul_quant(
2025-05-07T20:31:44.7324836Z         self,
2025-05-07T20:31:44.7325060Z         T: int,
2025-05-07T20:31:44.7325273Z         D: int,
2025-05-07T20:31:44.7325483Z         scale_ub: Optional[float],
2025-05-07T20:31:44.7325753Z         contiguous: bool,
2025-05-07T20:31:44.7325990Z         compiled: bool,
2025-05-07T20:31:44.7326203Z     ) -> None:
2025-05-07T20:31:44.7326415Z         torch.manual_seed(2025)
2025-05-07T20:31:44.7326654Z     
2025-05-07T20:31:44.7326919Z >       x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:31:44.7329120Z E       torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 448.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 30.44 MiB is free. Including non-PyTorch memory, this process has 22.03 GiB memory in use. Of the allocated memory 21.73 GiB is allocated by PyTorch, and 13.87 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
2025-05-07T20:31:44.7330969Z 
2025-05-07T20:31:44.7331090Z moe/activation_test.py:92: OutOfMemoryError
2025-05-07T20:31:44.7331306Z 
2025-05-07T20:31:44.7331407Z Trying example: test_silu_mul_quant(
2025-05-07T20:31:44.7331812Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:31:44.7332203Z     T=128,
2025-05-07T20:31:44.7332392Z     D=5120,
2025-05-07T20:31:44.7332579Z     scale_ub=1200.0,
2025-05-07T20:31:44.7332814Z     contiguous=False,
2025-05-07T20:31:44.7333036Z     compiled=False,
2025-05-07T20:31:44.7333237Z )
2025-05-07T20:31:44.8354552Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:31:44.8355617Z T = 128, D = 5120, scale_ub = 1200.0, contiguous = False, compiled = False
2025-05-07T20:31:44.8356008Z 
2025-05-07T20:31:44.8356120Z     @given(
2025-05-07T20:31:44.8356416Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:31:44.8356734Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:31:44.8357041Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:31:44.8357373Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:31:44.8357698Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:31:44.8357985Z     )
2025-05-07T20:31:44.8358344Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:31:44.8365933Z     def test_silu_mul_quant(
2025-05-07T20:31:44.8366361Z         self,
2025-05-07T20:31:44.8366554Z         T: int,
2025-05-07T20:31:44.8366747Z         D: int,
2025-05-07T20:31:44.8366967Z         scale_ub: Optional[float],
2025-05-07T20:31:44.8367235Z         contiguous: bool,
2025-05-07T20:31:44.8367472Z         compiled: bool,
2025-05-07T20:31:44.8367703Z     ) -> None:
2025-05-07T20:31:44.8367913Z         torch.manual_seed(2025)
2025-05-07T20:31:44.8368155Z     
2025-05-07T20:31:44.8368422Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:31:44.8368754Z     
2025-05-07T20:31:44.8368944Z         x_sign = torch.sign(x)
2025-05-07T20:31:44.8369233Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:31:44.8369539Z         x = x_sign * x_clamp
2025-05-07T20:31:44.8369779Z         x0 = x[:, :D]
2025-05-07T20:31:44.8369991Z         x1 = x[:, D:]
2025-05-07T20:31:44.8370191Z     
2025-05-07T20:31:44.8370379Z         if contiguous:
2025-05-07T20:31:44.8370609Z             x0 = x0.contiguous()
2025-05-07T20:31:44.8370865Z             x1 = x1.contiguous()
2025-05-07T20:31:44.8371101Z     
2025-05-07T20:31:44.8371293Z         if scale_ub is not None:
2025-05-07T20:31:44.8371554Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:31:44.8371890Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:31:44.8372194Z             )
2025-05-07T20:31:44.8372395Z         else:
2025-05-07T20:31:44.8372598Z             scale_ub_tensor = None
2025-05-07T20:31:44.8372843Z     
2025-05-07T20:31:44.8373070Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:31:44.8373378Z             op = silu_mul_quant
2025-05-07T20:31:44.8373627Z             if compiled:
2025-05-07T20:31:44.8373875Z                 op = torch.compile(op)
2025-05-07T20:31:44.8374164Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:31:44.8374439Z     
2025-05-07T20:31:44.8374635Z >       y_fp8, y_scale = fn()
2025-05-07T20:31:44.8374799Z 
2025-05-07T20:31:44.8374899Z moe/activation_test.py:117: 
2025-05-07T20:31:44.8375243Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:31:44.8375574Z moe/activation_test.py:115: in fn
2025-05-07T20:31:44.8375850Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:31:44.8376536Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:31:44.8377220Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:31:44.8377758Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:31:44.8378443Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:31:44.8379105Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:31:44.8379632Z     kernel = self.compile(
2025-05-07T20:31:44.8380173Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:31:44.8380824Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:31:44.8381220Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:31:44.8381444Z 
2025-05-07T20:31:44.8381739Z self = <triton.compiler.compiler.ASTSource object at 0x7f68718bd210>
2025-05-07T20:31:44.8382834Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:31:44.8384190Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f68dfb402c0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f68718c40e0>}
2025-05-07T20:31:44.8385567Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:31:44.8386660Z context = <triton._C.libtriton.ir.context object at 0x7f68718c0df0>
2025-05-07T20:31:44.8386948Z 
2025-05-07T20:31:44.8387120Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:31:44.8387638Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:31:44.8388094Z                            module_map=module_map)
2025-05-07T20:31:44.8388455Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:31:44.8388803Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:31:44.8389051Z E       ^
2025-05-07T20:31:44.8389567Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:31:44.8390020Z 
2025-05-07T20:31:44.8390431Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:31:44.8390941Z 
2025-05-07T20:31:44.8391048Z Trying example: test_silu_mul_quant(
2025-05-07T20:31:44.8391452Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:31:44.8391854Z     T=2048,
2025-05-07T20:31:44.8392050Z     D=7168,
2025-05-07T20:31:44.8392231Z     scale_ub=None,
2025-05-07T20:31:44.8392452Z     contiguous=False,
2025-05-07T20:31:44.8392670Z     compiled=False,
2025-05-07T20:31:44.8392868Z )
2025-05-07T20:31:44.8393176Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:31:44.8393667Z T = 2048, D = 7168, scale_ub = None, contiguous = False, compiled = False
2025-05-07T20:31:44.8393934Z 
2025-05-07T20:31:44.8394015Z     @given(
2025-05-07T20:31:44.8394237Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:31:44.8394541Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:31:44.8394842Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:31:44.8395163Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:31:44.8395490Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:31:44.8395768Z     )
2025-05-07T20:31:44.8396109Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:31:44.8396546Z     def test_silu_mul_quant(
2025-05-07T20:31:44.8396783Z         self,
2025-05-07T20:31:44.8396975Z         T: int,
2025-05-07T20:31:44.8397162Z         D: int,
2025-05-07T20:31:44.8397374Z         scale_ub: Optional[float],
2025-05-07T20:31:44.8397638Z         contiguous: bool,
2025-05-07T20:31:44.8397873Z         compiled: bool,
2025-05-07T20:31:44.8398087Z     ) -> None:
2025-05-07T20:31:44.8398297Z         torch.manual_seed(2025)
2025-05-07T20:31:44.8398529Z     
2025-05-07T20:31:44.8398795Z >       x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:31:44.8400935Z E       torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 56.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 30.44 MiB is free. Including non-PyTorch memory, this process has 22.03 GiB memory in use. Of the allocated memory 21.74 GiB is allocated by PyTorch, and 5.24 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
2025-05-07T20:31:44.8402785Z 
2025-05-07T20:31:44.8402910Z moe/activation_test.py:92: OutOfMemoryError
2025-05-07T20:31:44.8403122Z 
2025-05-07T20:31:44.8403230Z Trying example: test_silu_mul_quant(
2025-05-07T20:31:44.8403637Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:31:44.8404037Z     T=128,
2025-05-07T20:31:44.8404219Z     D=7168,
2025-05-07T20:31:44.8404403Z     scale_ub=1200.0,
2025-05-07T20:31:44.8404623Z     contiguous=True,
2025-05-07T20:31:44.8404837Z     compiled=True,
2025-05-07T20:31:44.8405040Z )
2025-05-07T20:31:44.8701024Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:31:44.8701839Z T = 128, D = 7168, scale_ub = 1200.0, contiguous = True, compiled = True
2025-05-07T20:31:44.8702215Z 
2025-05-07T20:31:44.8702323Z     @given(
2025-05-07T20:31:44.8702661Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:31:44.8702994Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:31:44.8703304Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:31:44.8703632Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:31:44.8703963Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:31:44.8704245Z     )
2025-05-07T20:31:44.8704592Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:31:44.8705040Z     def test_silu_mul_quant(
2025-05-07T20:31:44.8705279Z         self,
2025-05-07T20:31:44.8705505Z         T: int,
2025-05-07T20:31:44.8705725Z         D: int,
2025-05-07T20:31:44.8705947Z         scale_ub: Optional[float],
2025-05-07T20:31:44.8706224Z         contiguous: bool,
2025-05-07T20:31:44.8706462Z         compiled: bool,
2025-05-07T20:31:44.8706681Z     ) -> None:
2025-05-07T20:31:44.8706898Z         torch.manual_seed(2025)
2025-05-07T20:31:44.8707142Z     
2025-05-07T20:31:44.8707414Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:31:44.8707748Z     
2025-05-07T20:31:44.8707939Z         x_sign = torch.sign(x)
2025-05-07T20:31:44.8708234Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:31:44.8708537Z         x = x_sign * x_clamp
2025-05-07T20:31:44.8708776Z         x0 = x[:, :D]
2025-05-07T20:31:44.8708993Z         x1 = x[:, D:]
2025-05-07T20:31:44.8709257Z     
2025-05-07T20:31:44.8709448Z         if contiguous:
2025-05-07T20:31:44.8709679Z             x0 = x0.contiguous()
2025-05-07T20:31:44.8709936Z             x1 = x1.contiguous()
2025-05-07T20:31:44.8710175Z     
2025-05-07T20:31:44.8710364Z         if scale_ub is not None:
2025-05-07T20:31:44.8710634Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:31:44.8710969Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:31:44.8711276Z             )
2025-05-07T20:31:44.8711469Z         else:
2025-05-07T20:31:44.8711678Z             scale_ub_tensor = None
2025-05-07T20:31:44.8711929Z     
2025-05-07T20:31:44.8712159Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:31:44.8712467Z             op = silu_mul_quant
2025-05-07T20:31:44.8712718Z             if compiled:
2025-05-07T20:31:44.8712968Z                 op = torch.compile(op)
2025-05-07T20:31:44.8713259Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:31:44.8713532Z     
2025-05-07T20:31:44.8713726Z >       y_fp8, y_scale = fn()
2025-05-07T20:31:44.8713887Z 
2025-05-07T20:31:44.8713988Z moe/activation_test.py:117: 
2025-05-07T20:31:44.8714277Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:31:44.8714609Z moe/activation_test.py:115: in fn
2025-05-07T20:31:44.8714888Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:31:44.8715490Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/torch/_dynamo/eval_frame.py:678: in _fn
2025-05-07T20:31:44.8716216Z     return fn(*args, **kwargs)
2025-05-07T20:31:44.8716877Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:31:44.8717552Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:31:44.8718079Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:31:44.8718751Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:31:44.8719403Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:31:44.8719927Z     kernel = self.compile(
2025-05-07T20:31:44.8720463Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:31:44.8721234Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:31:44.8721634Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:31:44.8721866Z 
2025-05-07T20:31:44.8722069Z self = <triton.compiler.compiler.ASTSource object at 0x7f6871788610>
2025-05-07T20:31:44.8723141Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:31:44.8724493Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f68dfb402c0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f68718c6fc0>}
2025-05-07T20:31:44.8725820Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:31:44.8726835Z context = <triton._C.libtriton.ir.context object at 0x7f68717dc3b0>
2025-05-07T20:31:44.8727125Z 
2025-05-07T20:31:44.8727293Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:31:44.8727806Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:31:44.8728453Z                            module_map=module_map)
2025-05-07T20:31:44.8728814Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:31:44.8729169Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:31:44.8729430Z E       ^
2025-05-07T20:31:44.8729887Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:31:44.8730333Z 
2025-05-07T20:31:44.8730745Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:31:44.8731258Z 
2025-05-07T20:31:44.8731360Z Trying example: test_silu_mul_quant(
2025-05-07T20:31:44.8731771Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:31:44.8732170Z     T=128,
2025-05-07T20:31:44.8732359Z     D=7168,
2025-05-07T20:31:44.8732554Z     scale_ub=1200.0,
2025-05-07T20:31:44.8732773Z     contiguous=True,
2025-05-07T20:31:44.8732995Z     compiled=False,
2025-05-07T20:31:44.8733201Z )
2025-05-07T20:31:44.8733516Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:31:44.8734002Z T = 128, D = 7168, scale_ub = 1200.0, contiguous = True, compiled = False
2025-05-07T20:31:44.8734275Z 
2025-05-07T20:31:44.8734354Z     @given(
2025-05-07T20:31:44.8734585Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:31:44.8734893Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:31:44.8735249Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:31:44.8735587Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:31:44.8735908Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:31:44.8736188Z     )
2025-05-07T20:31:44.8736685Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:31:44.8737123Z     def test_silu_mul_quant(
2025-05-07T20:31:44.8737363Z         self,
2025-05-07T20:31:44.8737557Z         T: int,
2025-05-07T20:31:44.8737748Z         D: int,
2025-05-07T20:31:44.8737972Z         scale_ub: Optional[float],
2025-05-07T20:31:44.8738245Z         contiguous: bool,
2025-05-07T20:31:44.8738486Z         compiled: bool,
2025-05-07T20:31:44.8738706Z     ) -> None:
2025-05-07T20:31:44.8738919Z         torch.manual_seed(2025)
2025-05-07T20:31:44.8739162Z     
2025-05-07T20:31:44.8739429Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:31:44.8739771Z     
2025-05-07T20:31:44.8739963Z         x_sign = torch.sign(x)
2025-05-07T20:31:44.8740373Z >       x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:31:44.8742361Z E       torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 20.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 8.44 MiB is free. Including non-PyTorch memory, this process has 22.05 GiB memory in use. Of the allocated memory 21.77 GiB is allocated by PyTorch, and 4.62 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
2025-05-07T20:31:44.8744203Z 
2025-05-07T20:31:44.8744323Z moe/activation_test.py:95: OutOfMemoryError
2025-05-07T20:31:44.8744535Z 
2025-05-07T20:31:44.8744645Z Trying example: test_silu_mul_quant(
2025-05-07T20:31:44.8745052Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:31:44.8745453Z     T=128,
2025-05-07T20:31:44.8745645Z     D=5120,
2025-05-07T20:31:44.8745868Z     scale_ub=1200.0,
2025-05-07T20:31:44.8746105Z     contiguous=True,
2025-05-07T20:31:44.8746324Z     compiled=True,
2025-05-07T20:31:44.8746527Z )
2025-05-07T20:31:44.8746842Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:31:44.8747321Z T = 128, D = 5120, scale_ub = 1200.0, contiguous = True, compiled = True
2025-05-07T20:31:44.8747585Z 
2025-05-07T20:31:44.8747673Z     @given(
2025-05-07T20:31:44.8747898Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:31:44.8748207Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:31:44.8748508Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:31:44.8748831Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:31:44.8749215Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:31:44.8749499Z     )
2025-05-07T20:31:44.8749844Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:31:44.8750280Z     def test_silu_mul_quant(
2025-05-07T20:31:44.8750518Z         self,
2025-05-07T20:31:44.8750727Z         T: int,
2025-05-07T20:31:44.8750918Z         D: int,
2025-05-07T20:31:44.8751137Z         scale_ub: Optional[float],
2025-05-07T20:31:44.8751402Z         contiguous: bool,
2025-05-07T20:31:44.8751634Z         compiled: bool,
2025-05-07T20:31:44.8751857Z     ) -> None:
2025-05-07T20:31:44.8752072Z         torch.manual_seed(2025)
2025-05-07T20:31:44.8752310Z     
2025-05-07T20:31:44.8752575Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:31:44.8752913Z     
2025-05-07T20:31:44.8753107Z >       x_sign = torch.sign(x)
2025-05-07T20:31:44.8755131Z E       torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 20.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 8.44 MiB is free. Including non-PyTorch memory, this process has 22.05 GiB memory in use. Of the allocated memory 21.77 GiB is allocated by PyTorch, and 2.12 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
2025-05-07T20:31:44.8757029Z 
2025-05-07T20:31:44.8757149Z moe/activation_test.py:94: OutOfMemoryError
2025-05-07T20:31:44.8757362Z 
2025-05-07T20:31:44.8757463Z Trying example: test_silu_mul_quant(
2025-05-07T20:31:44.8757872Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:31:44.8758272Z     T=128,
2025-05-07T20:31:44.8758453Z     D=7168,
2025-05-07T20:31:44.8758643Z     scale_ub=None,
2025-05-07T20:31:44.8758856Z     contiguous=True,
2025-05-07T20:31:44.8759073Z     compiled=True,
2025-05-07T20:31:44.8759274Z )
2025-05-07T20:31:45.3502168Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:31:45.3502868Z T = 128, D = 7168, scale_ub = None, contiguous = True, compiled = True
2025-05-07T20:31:45.3503446Z 
2025-05-07T20:31:45.3503556Z     @given(
2025-05-07T20:31:45.3503866Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:31:45.3504186Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:31:45.3504495Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:31:45.3504822Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:31:45.3505144Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:31:45.3505434Z     )
2025-05-07T20:31:45.3505833Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:31:45.3506271Z     def test_silu_mul_quant(
2025-05-07T20:31:45.3506508Z         self,
2025-05-07T20:31:45.3506705Z         T: int,
2025-05-07T20:31:45.3506907Z         D: int,
2025-05-07T20:31:45.3507119Z         scale_ub: Optional[float],
2025-05-07T20:31:45.3507393Z         contiguous: bool,
2025-05-07T20:31:45.3507632Z         compiled: bool,
2025-05-07T20:31:45.3507854Z     ) -> None:
2025-05-07T20:31:45.3508069Z         torch.manual_seed(2025)
2025-05-07T20:31:45.3508308Z     
2025-05-07T20:31:45.3508579Z >       x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:31:45.3510683Z E       torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 20.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 8.44 MiB is free. Including non-PyTorch memory, this process has 22.05 GiB memory in use. Of the allocated memory 21.77 GiB is allocated by PyTorch, and 2.12 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
2025-05-07T20:31:45.3512513Z 
2025-05-07T20:31:45.3512634Z moe/activation_test.py:92: OutOfMemoryError
2025-05-07T20:31:45.3512850Z 
2025-05-07T20:31:45.3567471Z FAILED
2025-05-07T20:31:45.3567928Z 
2025-05-07T20:31:45.3568439Z =================================== FAILURES ===================================
2025-05-07T20:31:45.3569099Z _____________________ ActivationTests.test_silu_mul_quant ______________________
2025-05-07T20:31:45.3569763Z   + Exception Group Traceback (most recent call last):
2025-05-07T20:31:45.3570646Z   |   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/unittest/case.py", line 57, in testPartExecutor
2025-05-07T20:31:45.3571425Z   |     yield
2025-05-07T20:31:45.3572059Z   |   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/unittest/case.py", line 623, in run
2025-05-07T20:31:45.3572786Z   |     self._callTestMethod(testMethod)
2025-05-07T20:31:45.3573573Z   |   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/unittest/case.py", line 579, in _callTestMethod
2025-05-07T20:31:45.3574352Z   |     if method() is not None:
2025-05-07T20:31:45.3574699Z   |        ^^^^^^^^
2025-05-07T20:31:45.3575630Z   |   File "/home/ec2-user/actions-runner/_work/FBGEMM/FBGEMM/fbgemm_gpu/experimental/gen_ai/test/moe/activation_test.py", line 75, in test_silu_mul_quant
2025-05-07T20:31:45.3576664Z   |     T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:31:45.3577358Z   |            ^^^^^^^
2025-05-07T20:31:45.3578164Z   |   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/hypothesis/core.py", line 1850, in wrapped_test
2025-05-07T20:31:45.3579049Z   |     raise the_error_hypothesis_found
2025-05-07T20:31:45.3579646Z   | ExceptionGroup: Hypothesis found 4 distinct failures. (4 sub-exceptions)
2025-05-07T20:31:45.3580249Z   +-+---------------- 1 ----------------
2025-05-07T20:31:45.3580657Z     | Traceback (most recent call last):
2025-05-07T20:31:45.3581675Z     |   File "/home/ec2-user/actions-runner/_work/FBGEMM/FBGEMM/fbgemm_gpu/experimental/gen_ai/test/moe/activation_test.py", line 92, in test_silu_mul_quant
2025-05-07T20:31:45.3582781Z     |     x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:31:45.3583503Z     |         ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
2025-05-07T20:31:45.3586387Z     | torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 20.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 8.44 MiB is free. Including non-PyTorch memory, this process has 22.05 GiB memory in use. Of the allocated memory 21.77 GiB is allocated by PyTorch, and 2.12 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
2025-05-07T20:31:45.3590641Z     | Falsifying example: test_silu_mul_quant(
2025-05-07T20:31:45.3591093Z     |     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:31:45.3591504Z     |     T=128,
2025-05-07T20:31:45.3591708Z     |     D=7168,
2025-05-07T20:31:45.3591941Z     |     scale_ub=1200.0,
2025-05-07T20:31:45.3592192Z     |     contiguous=True,
2025-05-07T20:31:45.3592431Z     |     compiled=False,
2025-05-07T20:31:45.3592671Z     | )
2025-05-07T20:31:45.3592864Z     | 
2025-05-07T20:31:45.3593752Z     | You can reproduce this example by temporarily adding @reproduce_failure('6.131.14', b'AEEBQQFBAUEAQQE=') as a decorator on your test case
2025-05-07T20:31:45.3594353Z     +---------------- 2 ----------------
2025-05-07T20:31:45.3594652Z     | Traceback (most recent call last):
2025-05-07T20:31:45.3595362Z     |   File "/home/ec2-user/actions-runner/_work/FBGEMM/FBGEMM/fbgemm_gpu/experimental/gen_ai/test/moe/activation_test.py", line 92, in test_silu_mul_quant
2025-05-07T20:31:45.3596127Z     |     x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:31:45.3596510Z     |         ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
2025-05-07T20:31:45.3598500Z     | torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 20.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 8.44 MiB is free. Including non-PyTorch memory, this process has 22.05 GiB memory in use. Of the allocated memory 21.77 GiB is allocated by PyTorch, and 2.12 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
2025-05-07T20:31:45.3600470Z     | Falsifying example: test_silu_mul_quant(
2025-05-07T20:31:45.3600912Z     |     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:31:45.3601321Z     |     T=128,
2025-05-07T20:31:45.3601532Z     |     D=7168,
2025-05-07T20:31:45.3601753Z     |     scale_ub=None,
2025-05-07T20:31:45.3601992Z     |     contiguous=True,
2025-05-07T20:31:45.3602245Z     |     compiled=True,
2025-05-07T20:31:45.3602498Z     | )
2025-05-07T20:31:45.3602764Z     | 
2025-05-07T20:31:45.3603402Z     | You can reproduce this example by temporarily adding @reproduce_failure('6.131.14', b'AEEBQQFBAEEAQQA=') as a decorator on your test case
2025-05-07T20:31:45.3604018Z     +---------------- 3 ----------------
2025-05-07T20:31:45.3604440Z     | Traceback (most recent call last):
2025-05-07T20:31:45.3605157Z     |   File "/home/ec2-user/actions-runner/_work/FBGEMM/FBGEMM/fbgemm_gpu/experimental/gen_ai/test/moe/activation_test.py", line 92, in test_silu_mul_quant
2025-05-07T20:31:45.3606029Z     |     x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:31:45.3606409Z     |         ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
2025-05-07T20:31:45.3622078Z     | torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 20.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 8.44 MiB is free. Including non-PyTorch memory, this process has 22.05 GiB memory in use. Of the allocated memory 21.77 GiB is allocated by PyTorch, and 2.12 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
2025-05-07T20:31:45.3624443Z     | Falsifying example: test_silu_mul_quant(
2025-05-07T20:31:45.3625085Z     |     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:31:45.3625729Z     |     T=128,
2025-05-07T20:31:45.3626030Z     |     D=5120,
2025-05-07T20:31:45.3626329Z     |     scale_ub=1200.0,
2025-05-07T20:31:45.3626685Z     |     contiguous=True,
2025-05-07T20:31:45.3627038Z     |     compiled=True,
2025-05-07T20:31:45.3627359Z     | )
2025-05-07T20:31:45.3627622Z     | 
2025-05-07T20:31:45.3628658Z     | You can reproduce this example by temporarily adding @reproduce_failure('6.131.14', b'AEEBQQBBAUEAQQA=') as a decorator on your test case
2025-05-07T20:31:45.3629647Z     +---------------- 4 ----------------
2025-05-07T20:31:45.3630056Z     | Traceback (most recent call last):
2025-05-07T20:31:45.3631080Z     |   File "/home/ec2-user/actions-runner/_work/FBGEMM/FBGEMM/fbgemm_gpu/experimental/gen_ai/test/moe/activation_test.py", line 126, in test_silu_mul_quant
2025-05-07T20:31:45.3632105Z     |     y_fp8_ref, y_scale_ref = ref_fn()
2025-05-07T20:31:45.3632508Z     |                              ^^^^^^^^
2025-05-07T20:31:45.3633423Z     |   File "/home/ec2-user/actions-runner/_work/FBGEMM/FBGEMM/fbgemm_gpu/experimental/gen_ai/test/moe/activation_test.py", line 124, in ref_fn
2025-05-07T20:31:45.3634427Z     |     return triton_quantize_fp8_row(y, scale_ub_tensor)
2025-05-07T20:31:45.3634907Z     |            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
2025-05-07T20:31:45.3636091Z     |   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/fbgemm_gpu/experimental/gemm/triton_gemm/fp8_gemm.py", line 2370, in triton_quantize_fp8_row
2025-05-07T20:31:45.3637230Z     |     _kernel_quantize_fp8_row[grid](
2025-05-07T20:31:45.3638089Z     |   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py", line 330, in <lambda>
2025-05-07T20:31:45.3639137Z     |     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:31:45.3639669Z     |                                    ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
2025-05-07T20:31:45.3640335Z     |   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/autotuner.py", line 186, in run
2025-05-07T20:31:45.3641140Z     |     timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs}
2025-05-07T20:31:45.3641639Z     |               ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
2025-05-07T20:31:45.3642324Z     |   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/autotuner.py", line 186, in <dictcomp>
2025-05-07T20:31:45.3643161Z     |     timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs}
2025-05-07T20:31:45.3643662Z     |                        ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
2025-05-07T20:31:45.3644782Z     |   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/autotuner.py", line 166, in _bench
2025-05-07T20:31:45.3645797Z     |     return self.do_bench(kernel_call, quantiles=(0.5, 0.2, 0.8))
2025-05-07T20:31:45.3646359Z     |            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
2025-05-07T20:31:45.3647231Z     |   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/testing.py", line 117, in do_bench
2025-05-07T20:31:45.3648044Z     |     fn()
2025-05-07T20:31:45.3648874Z     |   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/autotuner.py", line 152, in kernel_call
2025-05-07T20:31:45.3650040Z     |     self.fn.run(
2025-05-07T20:31:45.3650832Z     |   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py", line 623, in run
2025-05-07T20:31:45.3651487Z     |     kernel = self.compile(
2025-05-07T20:31:45.3651811Z     |              ^^^^^^^^^^^^^
2025-05-07T20:31:45.3652665Z     |   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py", line 273, in compile
2025-05-07T20:31:45.3653701Z     |     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:31:45.3654278Z     |              ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
2025-05-07T20:31:45.3655234Z     |   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py", line 100, in make_ir
2025-05-07T20:31:45.3656378Z     |     return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:31:45.3657032Z     |            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
2025-05-07T20:31:45.3657565Z     | triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:31:45.3658056Z     | def _kernel_quantize_fp8_row(
2025-05-07T20:31:45.3658431Z     | ^
2025-05-07T20:31:45.3659087Z     | ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:31:45.3659904Z     | Falsifying example: test_silu_mul_quant(
2025-05-07T20:31:45.3660477Z     |     # The test always failed when commented parts were varied together.
2025-05-07T20:31:45.3661207Z     |     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:31:45.3661836Z     |     T=1,  # or any other generated value
2025-05-07T20:31:45.3662286Z     |     D=5120,  # or any other generated value
2025-05-07T20:31:45.3662765Z     |     scale_ub=None,  # or any other generated value
2025-05-07T20:31:45.3663288Z     |     contiguous=True,  # or any other generated value
2025-05-07T20:31:45.3663814Z     |     compiled=True,  # or any other generated value
2025-05-07T20:31:45.3664248Z     | )
2025-05-07T20:31:45.3664507Z     | 
2025-05-07T20:31:45.3665263Z     | You can reproduce this example by temporarily adding @reproduce_failure('6.131.14', b'AEEAQQBBAEEAQQA=') as a decorator on your test case
2025-05-07T20:31:45.3666124Z     +------------------------------------
2025-05-07T20:31:45.3666603Z ---------------------------------- Hypothesis ----------------------------------
2025-05-07T20:31:45.3667103Z Trying example: test_silu_mul_quant(
2025-05-07T20:31:45.3667667Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:31:45.3668177Z     T=1,
2025-05-07T20:31:45.3668428Z     D=5120,
2025-05-07T20:31:45.3668685Z     scale_ub=None,
2025-05-07T20:31:45.3668982Z     contiguous=True,
2025-05-07T20:31:45.3669414Z     compiled=True,
2025-05-07T20:31:45.3669720Z )
2025-05-07T20:31:45.3670203Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:31:45.3670884Z T = 1, D = 5120, scale_ub = None, contiguous = True, compiled = True
2025-05-07T20:31:45.3671261Z 
2025-05-07T20:31:45.3671381Z     @given(
2025-05-07T20:31:45.3671845Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:31:45.3672267Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:31:45.3672691Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:31:45.3673136Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:31:45.3673568Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:31:45.3673957Z     )
2025-05-07T20:31:45.3674424Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:31:45.3675017Z     def test_silu_mul_quant(
2025-05-07T20:31:45.3675337Z         self,
2025-05-07T20:31:45.3675605Z         T: int,
2025-05-07T20:31:45.3675883Z         D: int,
2025-05-07T20:31:45.3676169Z         scale_ub: Optional[float],
2025-05-07T20:31:45.3676637Z         contiguous: bool,
2025-05-07T20:31:45.3676968Z         compiled: bool,
2025-05-07T20:31:45.3677261Z     ) -> None:
2025-05-07T20:31:45.3677560Z         torch.manual_seed(2025)
2025-05-07T20:31:45.3677891Z     
2025-05-07T20:31:45.3678256Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:31:45.3678723Z     
2025-05-07T20:31:45.3678995Z         x_sign = torch.sign(x)
2025-05-07T20:31:45.3679383Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:31:45.3679795Z         x = x_sign * x_clamp
2025-05-07T20:31:45.3680122Z         x0 = x[:, :D]
2025-05-07T20:31:45.3680439Z         x1 = x[:, D:]
2025-05-07T20:31:45.3680726Z     
2025-05-07T20:31:45.3681000Z         if contiguous:
2025-05-07T20:31:45.3681316Z             x0 = x0.contiguous()
2025-05-07T20:31:45.3681670Z             x1 = x1.contiguous()
2025-05-07T20:31:45.3681990Z     
2025-05-07T20:31:45.3682250Z         if scale_ub is not None:
2025-05-07T20:31:45.3682607Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:31:45.3683056Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:31:45.3683465Z             )
2025-05-07T20:31:45.3683723Z         else:
2025-05-07T20:31:45.3683998Z             scale_ub_tensor = None
2025-05-07T20:31:45.3684339Z     
2025-05-07T20:31:45.3684643Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:31:45.3685055Z             op = silu_mul_quant
2025-05-07T20:31:45.3685391Z             if compiled:
2025-05-07T20:31:45.3685724Z                 op = torch.compile(op)
2025-05-07T20:31:45.3686109Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:31:45.3686475Z     
2025-05-07T20:31:45.3686739Z         y_fp8, y_scale = fn()
2025-05-07T20:31:45.3687110Z         y = y_fp8.to(torch.float32) * y_scale[:, None]
2025-05-07T20:31:45.3687501Z     
2025-05-07T20:31:45.3687820Z         def ref_fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:31:45.3688255Z             x0_fp32 = x0.to(torch.float32)
2025-05-07T20:31:45.3688654Z             x1_fp32 = x1.to(torch.float32)
2025-05-07T20:31:45.3689087Z             y = x0_fp32 * torch.sigmoid(x0_fp32) * x1_fp32
2025-05-07T20:31:45.3689566Z             return triton_quantize_fp8_row(y, scale_ub_tensor)
2025-05-07T20:31:45.3689973Z     
2025-05-07T20:31:45.3690261Z >       y_fp8_ref, y_scale_ref = ref_fn()
2025-05-07T20:31:45.3690535Z 
2025-05-07T20:31:45.3690683Z moe/activation_test.py:126: 
2025-05-07T20:31:45.3691095Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:31:45.3691581Z moe/activation_test.py:124: in ref_fn
2025-05-07T20:31:45.3692051Z     return triton_quantize_fp8_row(y, scale_ub_tensor)
2025-05-07T20:31:45.3693158Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/fbgemm_gpu/experimental/gemm/triton_gemm/fp8_gemm.py:2370: in triton_quantize_fp8_row
2025-05-07T20:31:45.3694220Z     _kernel_quantize_fp8_row[grid](
2025-05-07T20:31:45.3694989Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:31:45.3695956Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:31:45.3697014Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/autotuner.py:186: in run
2025-05-07T20:31:45.3698022Z     timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs}
2025-05-07T20:31:45.3699086Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/autotuner.py:186: in <dictcomp>
2025-05-07T20:31:45.3700157Z     timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs}
2025-05-07T20:31:45.3701175Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/autotuner.py:166: in _bench
2025-05-07T20:31:45.3702088Z     return self.do_bench(kernel_call, quantiles=(0.5, 0.2, 0.8))
2025-05-07T20:31:45.3702932Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/testing.py:117: in do_bench
2025-05-07T20:31:45.3703766Z     fn()
2025-05-07T20:31:45.3704469Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/autotuner.py:152: in kernel_call
2025-05-07T20:31:45.3705294Z     self.fn.run(
2025-05-07T20:31:45.3705955Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:31:45.3706679Z     kernel = self.compile(
2025-05-07T20:31:45.3707404Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:31:45.3708244Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:31:45.3708791Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:31:45.3709194Z 
2025-05-07T20:31:45.3709455Z self = <triton.compiler.compiler.ASTSource object at 0x7f68a93aae10>
2025-05-07T20:31:45.3710892Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=2, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:31:45.3712815Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f68dfb402c0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f68a9943240>}
2025-05-07T20:31:45.3714663Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:31:45.3716112Z context = <triton._C.libtriton.ir.context object at 0x7f68a0400c70>
2025-05-07T20:31:45.3716501Z 
2025-05-07T20:31:45.3716722Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:31:45.3717423Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:31:45.3718071Z                            module_map=module_map)
2025-05-07T20:31:45.3718562Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:31:45.3719051Z E       def _kernel_quantize_fp8_row(
2025-05-07T20:31:45.3719435Z E       ^
2025-05-07T20:31:45.3720097Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:31:45.3720732Z 
2025-05-07T20:31:45.3721315Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:31:45.3722043Z 
2025-05-07T20:31:45.3722191Z Trying example: test_silu_mul_quant(
2025-05-07T20:31:45.3722773Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:31:45.3723330Z     T=2048,
2025-05-07T20:31:45.3723590Z     D=5120,
2025-05-07T20:31:45.3723877Z     scale_ub=1200.0,
2025-05-07T20:31:45.3724196Z     contiguous=True,
2025-05-07T20:31:45.3724515Z     compiled=False,
2025-05-07T20:31:45.3724815Z )
2025-05-07T20:31:45.3725271Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:31:45.3725956Z T = 2048, D = 5120, scale_ub = 1200.0, contiguous = True, compiled = False
2025-05-07T20:31:45.3726704Z 
2025-05-07T20:31:45.3726816Z     @given(
2025-05-07T20:31:45.3727139Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:31:45.3727566Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:31:45.3727995Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:31:45.3728727Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:31:45.3729202Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:31:45.3729598Z     )
2025-05-07T20:31:45.3730047Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:31:45.3730625Z     def test_silu_mul_quant(
2025-05-07T20:31:45.3730953Z         self,
2025-05-07T20:31:45.3731201Z         T: int,
2025-05-07T20:31:45.3731661Z         D: int,
2025-05-07T20:31:45.3731940Z         scale_ub: Optional[float],
2025-05-07T20:31:45.3732311Z         contiguous: bool,
2025-05-07T20:31:45.3732644Z         compiled: bool,
2025-05-07T20:31:45.3732927Z     ) -> None:
2025-05-07T20:31:45.3733208Z         torch.manual_seed(2025)
2025-05-07T20:31:45.3733514Z     
2025-05-07T20:31:45.3733846Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:31:45.3734282Z     
2025-05-07T20:31:45.3734539Z         x_sign = torch.sign(x)
2025-05-07T20:31:45.3734921Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:31:45.3735391Z         x = x_sign * x_clamp
2025-05-07T20:31:45.3735721Z         x0 = x[:, :D]
2025-05-07T20:31:45.3736025Z         x1 = x[:, D:]
2025-05-07T20:31:45.3736279Z     
2025-05-07T20:31:45.3736515Z         if contiguous:
2025-05-07T20:31:45.3736808Z             x0 = x0.contiguous()
2025-05-07T20:31:45.3737124Z             x1 = x1.contiguous()
2025-05-07T20:31:45.3737433Z     
2025-05-07T20:31:45.3737670Z         if scale_ub is not None:
2025-05-07T20:31:45.3738038Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:31:45.3738494Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:31:45.3738938Z             )
2025-05-07T20:31:45.3739205Z         else:
2025-05-07T20:31:45.3739499Z             scale_ub_tensor = None
2025-05-07T20:31:45.3739840Z     
2025-05-07T20:31:45.3740153Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:31:45.3740597Z             op = silu_mul_quant
2025-05-07T20:31:45.3740959Z             if compiled:
2025-05-07T20:31:45.3741307Z                 op = torch.compile(op)
2025-05-07T20:31:45.3741693Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:31:45.3742043Z     
2025-05-07T20:31:45.3742284Z >       y_fp8, y_scale = fn()
2025-05-07T20:31:45.3742519Z 
2025-05-07T20:31:45.3742655Z moe/activation_test.py:117: 
2025-05-07T20:31:45.3743059Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:31:45.3743523Z moe/activation_test.py:115: in fn
2025-05-07T20:31:45.3743913Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:31:45.3744900Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:31:45.3745914Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:31:45.3746632Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:31:45.3747497Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:31:45.3748400Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:31:45.3749167Z     kernel = self.compile(
2025-05-07T20:31:45.3749871Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:31:45.3750714Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:31:45.3751225Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:31:45.3751514Z 
2025-05-07T20:31:45.3751968Z self = <triton.compiler.compiler.ASTSource object at 0x7f68a02c2dd0>
2025-05-07T20:31:45.3753520Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:31:45.3755479Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f68dfb402c0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f68a995ade0>}
2025-05-07T20:31:45.3757249Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:31:45.3758710Z context = <triton._C.libtriton.ir.context object at 0x7f68a9abb0f0>
2025-05-07T20:31:45.3759107Z 
2025-05-07T20:31:45.3759349Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:31:45.3760066Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:31:45.3760727Z                            module_map=module_map)
2025-05-07T20:31:45.3761230Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:31:45.3761706Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:31:45.3762076Z E       ^
2025-05-07T20:31:45.3762744Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:31:45.3763387Z 
2025-05-07T20:31:45.3763987Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:31:45.3764718Z 
2025-05-07T20:31:45.3764876Z Trying example: test_silu_mul_quant(
2025-05-07T20:31:45.3765510Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:31:45.3766081Z     T=2048,
2025-05-07T20:31:45.3766339Z     D=5120,
2025-05-07T20:31:45.3766611Z     scale_ub=1200.0,
2025-05-07T20:31:45.3766932Z     contiguous=True,
2025-05-07T20:31:45.3767239Z     compiled=True,
2025-05-07T20:31:45.3767532Z )
2025-05-07T20:31:45.3767982Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:31:45.3768656Z T = 2048, D = 5120, scale_ub = 1200.0, contiguous = True, compiled = True
2025-05-07T20:31:45.3769040Z 
2025-05-07T20:31:45.3769151Z     @given(
2025-05-07T20:31:45.3769476Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:31:45.3769917Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:31:45.3770344Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:31:45.3770810Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:31:45.3771281Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:31:45.3771678Z     )
2025-05-07T20:31:45.3772163Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:31:45.3772775Z     def test_silu_mul_quant(
2025-05-07T20:31:45.3773119Z         self,
2025-05-07T20:31:45.3773396Z         T: int,
2025-05-07T20:31:45.3773678Z         D: int,
2025-05-07T20:31:45.3773967Z         scale_ub: Optional[float],
2025-05-07T20:31:45.3774358Z         contiguous: bool,
2025-05-07T20:31:45.3774710Z         compiled: bool,
2025-05-07T20:31:45.3775030Z     ) -> None:
2025-05-07T20:31:45.3775343Z         torch.manual_seed(2025)
2025-05-07T20:31:45.3775707Z     
2025-05-07T20:31:45.3776093Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:31:45.3776560Z     
2025-05-07T20:31:45.3776830Z         x_sign = torch.sign(x)
2025-05-07T20:31:45.3777233Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:31:45.3777660Z         x = x_sign * x_clamp
2025-05-07T20:31:45.3778001Z         x0 = x[:, :D]
2025-05-07T20:31:45.3778303Z         x1 = x[:, D:]
2025-05-07T20:31:45.3778595Z     
2025-05-07T20:31:45.3778856Z         if contiguous:
2025-05-07T20:31:45.3779184Z             x0 = x0.contiguous()
2025-05-07T20:31:45.3779645Z             x1 = x1.contiguous()
2025-05-07T20:31:45.3779988Z     
2025-05-07T20:31:45.3780267Z         if scale_ub is not None:
2025-05-07T20:31:45.3780651Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:31:45.3781118Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:31:45.3781558Z             )
2025-05-07T20:31:45.3781834Z         else:
2025-05-07T20:31:45.3782123Z             scale_ub_tensor = None
2025-05-07T20:31:45.3782477Z     
2025-05-07T20:31:45.3782801Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:31:45.3783234Z             op = silu_mul_quant
2025-05-07T20:31:45.3783592Z             if compiled:
2025-05-07T20:31:45.3783940Z                 op = torch.compile(op)
2025-05-07T20:31:45.3784436Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:31:45.3784822Z     
2025-05-07T20:31:45.3785096Z         y_fp8, y_scale = fn()
2025-05-07T20:31:45.3785487Z         y = y_fp8.to(torch.float32) * y_scale[:, None]
2025-05-07T20:31:45.3785905Z     
2025-05-07T20:31:45.3786239Z         def ref_fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:31:45.3786711Z             x0_fp32 = x0.to(torch.float32)
2025-05-07T20:31:45.3787114Z             x1_fp32 = x1.to(torch.float32)
2025-05-07T20:31:45.3787554Z             y = x0_fp32 * torch.sigmoid(x0_fp32) * x1_fp32
2025-05-07T20:31:45.3788056Z             return triton_quantize_fp8_row(y, scale_ub_tensor)
2025-05-07T20:31:45.3788487Z     
2025-05-07T20:31:45.3788772Z >       y_fp8_ref, y_scale_ref = ref_fn()
2025-05-07T20:31:45.3789146Z 
2025-05-07T20:31:45.3789301Z moe/activation_test.py:126: 
2025-05-07T20:31:45.3789732Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:31:45.3790237Z moe/activation_test.py:124: in ref_fn
2025-05-07T20:31:45.3790711Z     return triton_quantize_fp8_row(y, scale_ub_tensor)
2025-05-07T20:31:45.3791818Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/fbgemm_gpu/experimental/gemm/triton_gemm/fp8_gemm.py:2370: in triton_quantize_fp8_row
2025-05-07T20:31:45.3792874Z     _kernel_quantize_fp8_row[grid](
2025-05-07T20:31:45.3793644Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:31:45.3794613Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:31:45.3795575Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/autotuner.py:186: in run
2025-05-07T20:31:45.3796598Z     timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs}
2025-05-07T20:31:45.3797655Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/autotuner.py:186: in <dictcomp>
2025-05-07T20:31:45.3798709Z     timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs}
2025-05-07T20:31:45.3799736Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/autotuner.py:166: in _bench
2025-05-07T20:31:45.3800633Z     return self.do_bench(kernel_call, quantiles=(0.5, 0.2, 0.8))
2025-05-07T20:31:45.3801437Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/testing.py:117: in do_bench
2025-05-07T20:31:45.3802112Z     fn()
2025-05-07T20:31:45.3802787Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/autotuner.py:152: in kernel_call
2025-05-07T20:31:45.3803565Z     self.fn.run(
2025-05-07T20:31:45.3804154Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:31:45.3804833Z     kernel = self.compile(
2025-05-07T20:31:45.3805590Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:31:45.3806452Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:31:45.3807802Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:31:45.3808123Z 
2025-05-07T20:31:45.3808392Z self = <triton.compiler.compiler.ASTSource object at 0x7f68a98d2510>
2025-05-07T20:31:45.3809861Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=2, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:31:45.3811835Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f68dfb402c0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f68a9ace700>}
2025-05-07T20:31:45.3813740Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:31:45.3815388Z context = <triton._C.libtriton.ir.context object at 0x7f68a8a8c5f0>
2025-05-07T20:31:45.3815793Z 
2025-05-07T20:31:45.3816036Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:31:45.3816767Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:31:45.3817419Z                            module_map=module_map)
2025-05-07T20:31:45.3817904Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:31:45.3818388Z E       def _kernel_quantize_fp8_row(
2025-05-07T20:31:45.3818743Z E       ^
2025-05-07T20:31:45.3819358Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:31:45.3819975Z 
2025-05-07T20:31:45.3820546Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:31:45.3821274Z 
2025-05-07T20:31:45.3821419Z Trying example: test_silu_mul_quant(
2025-05-07T20:31:45.3821983Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:31:45.3822538Z     T=16384,
2025-05-07T20:31:45.3822815Z     D=7168,
2025-05-07T20:31:45.3823085Z     scale_ub=1200.0,
2025-05-07T20:31:45.3823394Z     contiguous=False,
2025-05-07T20:31:45.3823711Z     compiled=False,
2025-05-07T20:31:45.3824006Z )
2025-05-07T20:31:45.3824464Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:31:45.3849541Z T = 16384, D = 7168, scale_ub = 1200.0, contiguous = False, compiled = False
2025-05-07T20:31:45.3849939Z 
2025-05-07T20:31:45.3850049Z     @given(
2025-05-07T20:31:45.3850360Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:31:45.3850789Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:31:45.3851202Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:31:45.3851670Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:31:45.3852117Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:31:45.3852515Z     )
2025-05-07T20:31:45.3853001Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:31:45.3853601Z     def test_silu_mul_quant(
2025-05-07T20:31:45.3853933Z         self,
2025-05-07T20:31:45.3854199Z         T: int,
2025-05-07T20:31:45.3854465Z         D: int,
2025-05-07T20:31:45.3854766Z         scale_ub: Optional[float],
2025-05-07T20:31:45.3855138Z         contiguous: bool,
2025-05-07T20:31:45.3855460Z         compiled: bool,
2025-05-07T20:31:45.3855773Z     ) -> None:
2025-05-07T20:31:45.3856076Z         torch.manual_seed(2025)
2025-05-07T20:31:45.3856404Z     
2025-05-07T20:31:45.3856782Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:31:45.3857245Z     
2025-05-07T20:31:45.3857505Z         x_sign = torch.sign(x)
2025-05-07T20:31:45.3857909Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:31:45.3858342Z         x = x_sign * x_clamp
2025-05-07T20:31:45.3858668Z         x0 = x[:, :D]
2025-05-07T20:31:45.3858947Z         x1 = x[:, D:]
2025-05-07T20:31:45.3859209Z     
2025-05-07T20:31:45.3859746Z         if contiguous:
2025-05-07T20:31:45.3860055Z             x0 = x0.contiguous()
2025-05-07T20:31:45.3860396Z             x1 = x1.contiguous()
2025-05-07T20:31:45.3860709Z     
2025-05-07T20:31:45.3860955Z         if scale_ub is not None:
2025-05-07T20:31:45.3861323Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:31:45.3861765Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:31:45.3862169Z             )
2025-05-07T20:31:45.3862416Z         else:
2025-05-07T20:31:45.3862685Z             scale_ub_tensor = None
2025-05-07T20:31:45.3863007Z     
2025-05-07T20:31:45.3863322Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:31:45.3863748Z             op = silu_mul_quant
2025-05-07T20:31:45.3864313Z             if compiled:
2025-05-07T20:31:45.3864649Z                 op = torch.compile(op)
2025-05-07T20:31:45.3865052Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:31:45.3865480Z     
2025-05-07T20:31:45.3865753Z >       y_fp8, y_scale = fn()
2025-05-07T20:31:45.3865979Z 
2025-05-07T20:31:45.3866112Z moe/activation_test.py:117: 
2025-05-07T20:31:45.3866503Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:31:45.3866946Z moe/activation_test.py:115: in fn
2025-05-07T20:31:45.3867316Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:31:45.3868250Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:31:45.3869302Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:31:45.3870045Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:31:45.3871006Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:31:45.3871920Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:31:45.3872655Z     kernel = self.compile(
2025-05-07T20:31:45.3873419Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:31:45.3874340Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:31:45.3874908Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:31:45.3875230Z 
2025-05-07T20:31:45.3875513Z self = <triton.compiler.compiler.ASTSource object at 0x7f68a8afecd0>
2025-05-07T20:31:45.3877032Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:31:45.3878988Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f68dfb402c0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f68aabc1760>}
2025-05-07T20:31:45.3880887Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:31:45.3882311Z context = <triton._C.libtriton.ir.context object at 0x7f68a8b12af0>
2025-05-07T20:31:45.3882721Z 
2025-05-07T20:31:45.3882946Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:31:45.3883665Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:31:45.3884310Z                            module_map=module_map)
2025-05-07T20:31:45.3884798Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:31:45.3885308Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:31:45.3885674Z E       ^
2025-05-07T20:31:45.3886307Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:31:45.3886952Z 
2025-05-07T20:31:45.3887619Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:31:45.3888302Z 
2025-05-07T20:31:45.3888442Z Trying example: test_silu_mul_quant(
2025-05-07T20:31:45.3889022Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:31:45.3889572Z     T=1,
2025-05-07T20:31:45.3889836Z     D=7168,
2025-05-07T20:31:45.3890119Z     scale_ub=None,
2025-05-07T20:31:45.3890404Z     contiguous=True,
2025-05-07T20:31:45.3890696Z     compiled=True,
2025-05-07T20:31:45.3890952Z )
2025-05-07T20:31:45.3891362Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:31:45.3891989Z T = 1, D = 7168, scale_ub = None, contiguous = True, compiled = True
2025-05-07T20:31:45.3892412Z 
2025-05-07T20:31:45.3892509Z     @given(
2025-05-07T20:31:45.3892799Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:31:45.3893180Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:31:45.3893571Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:31:45.3893993Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:31:45.3894442Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:31:45.3894841Z     )
2025-05-07T20:31:45.3895338Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:31:45.3895953Z     def test_silu_mul_quant(
2025-05-07T20:31:45.3896296Z         self,
2025-05-07T20:31:45.3896577Z         T: int,
2025-05-07T20:31:45.3896854Z         D: int,
2025-05-07T20:31:45.3897157Z         scale_ub: Optional[float],
2025-05-07T20:31:45.3897541Z         contiguous: bool,
2025-05-07T20:31:45.3897873Z         compiled: bool,
2025-05-07T20:31:45.3898182Z     ) -> None:
2025-05-07T20:31:45.3898485Z         torch.manual_seed(2025)
2025-05-07T20:31:45.3898806Z     
2025-05-07T20:31:45.3899140Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:31:45.3899608Z     
2025-05-07T20:31:45.3899888Z         x_sign = torch.sign(x)
2025-05-07T20:31:45.3900275Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:31:45.3900702Z         x = x_sign * x_clamp
2025-05-07T20:31:45.3901028Z         x0 = x[:, :D]
2025-05-07T20:31:45.3901316Z         x1 = x[:, D:]
2025-05-07T20:31:45.3901605Z     
2025-05-07T20:31:45.3901862Z         if contiguous:
2025-05-07T20:31:45.3902173Z             x0 = x0.contiguous()
2025-05-07T20:31:45.3902529Z             x1 = x1.contiguous()
2025-05-07T20:31:45.3902858Z     
2025-05-07T20:31:45.3903118Z         if scale_ub is not None:
2025-05-07T20:31:45.3903489Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:31:45.3903951Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:31:45.3904394Z             )
2025-05-07T20:31:45.3904666Z         else:
2025-05-07T20:31:45.3904968Z             scale_ub_tensor = None
2025-05-07T20:31:45.3905326Z     
2025-05-07T20:31:45.3905650Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:31:45.3906096Z             op = silu_mul_quant
2025-05-07T20:31:45.3906448Z             if compiled:
2025-05-07T20:31:45.3906789Z                 op = torch.compile(op)
2025-05-07T20:31:45.3907187Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:31:45.3907534Z     
2025-05-07T20:31:45.3907731Z         y_fp8, y_scale = fn()
2025-05-07T20:31:45.3908012Z         y = y_fp8.to(torch.float32) * y_scale[:, None]
2025-05-07T20:31:45.3908306Z     
2025-05-07T20:31:45.3908546Z         def ref_fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:31:45.3908874Z             x0_fp32 = x0.to(torch.float32)
2025-05-07T20:31:45.3909264Z             x1_fp32 = x1.to(torch.float32)
2025-05-07T20:31:45.3909587Z             y = x0_fp32 * torch.sigmoid(x0_fp32) * x1_fp32
2025-05-07T20:31:45.3909937Z             return triton_quantize_fp8_row(y, scale_ub_tensor)
2025-05-07T20:31:45.3910250Z     
2025-05-07T20:31:45.3910454Z >       y_fp8_ref, y_scale_ref = ref_fn()
2025-05-07T20:31:45.3910754Z 
2025-05-07T20:31:45.3910865Z moe/activation_test.py:126: 
2025-05-07T20:31:45.3911156Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:31:45.3911493Z moe/activation_test.py:124: in ref_fn
2025-05-07T20:31:45.3911816Z     return triton_quantize_fp8_row(y, scale_ub_tensor)
2025-05-07T20:31:45.3912602Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/fbgemm_gpu/experimental/gemm/triton_gemm/fp8_gemm.py:2370: in triton_quantize_fp8_row
2025-05-07T20:31:45.3913353Z     _kernel_quantize_fp8_row[grid](
2025-05-07T20:31:45.3913895Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:31:45.3914654Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:31:45.3915338Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/autotuner.py:186: in run
2025-05-07T20:31:45.3916069Z     timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs}
2025-05-07T20:31:45.3916831Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/autotuner.py:186: in <dictcomp>
2025-05-07T20:31:45.3917582Z     timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs}
2025-05-07T20:31:45.3918308Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/autotuner.py:166: in _bench
2025-05-07T20:31:45.3918951Z     return self.do_bench(kernel_call, quantiles=(0.5, 0.2, 0.8))
2025-05-07T20:31:45.3919553Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/testing.py:117: in do_bench
2025-05-07T20:31:45.3920071Z     fn()
2025-05-07T20:31:45.3920580Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/autotuner.py:152: in kernel_call
2025-05-07T20:31:45.3921163Z     self.fn.run(
2025-05-07T20:31:45.3921635Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:31:45.3922162Z     kernel = self.compile(
2025-05-07T20:31:45.3922702Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:31:45.3923355Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:31:45.3923748Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:31:45.3923985Z 
2025-05-07T20:31:45.3924189Z self = <triton.compiler.compiler.ASTSource object at 0x7f6895005250>
2025-05-07T20:31:45.3925270Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=2, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:31:45.3926665Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f68dfb402c0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f68aa04cd60>}
2025-05-07T20:31:45.3928013Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:31:45.3929346Z context = <triton._C.libtriton.ir.context object at 0x7f68950197b0>
2025-05-07T20:31:45.3929639Z 
2025-05-07T20:31:45.3929804Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:31:45.3930322Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:31:45.3930790Z                            module_map=module_map)
2025-05-07T20:31:45.3931155Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:31:45.3931512Z E       def _kernel_quantize_fp8_row(
2025-05-07T20:31:45.3931779Z E       ^
2025-05-07T20:31:45.3932419Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:31:45.3932878Z 
2025-05-07T20:31:45.3933294Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:31:45.3933806Z 
2025-05-07T20:31:45.3933913Z Trying example: test_silu_mul_quant(
2025-05-07T20:31:45.3934323Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:31:45.3934719Z     T=4096,
2025-05-07T20:31:45.3934915Z     D=5120,
2025-05-07T20:31:45.3935117Z     scale_ub=None,
2025-05-07T20:31:45.3935335Z     contiguous=False,
2025-05-07T20:31:45.3935566Z     compiled=False,
2025-05-07T20:31:45.3935777Z )
2025-05-07T20:31:45.3936093Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:31:45.3936716Z T = 4096, D = 5120, scale_ub = None, contiguous = False, compiled = False
2025-05-07T20:31:45.3936983Z 
2025-05-07T20:31:45.3937067Z     @given(
2025-05-07T20:31:45.3937298Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:31:45.3937611Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:31:45.3937916Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:31:45.3938243Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:31:45.3938561Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:31:45.3938843Z     )
2025-05-07T20:31:45.3939189Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:31:45.3939639Z     def test_silu_mul_quant(
2025-05-07T20:31:45.3939875Z         self,
2025-05-07T20:31:45.3940072Z         T: int,
2025-05-07T20:31:45.3940274Z         D: int,
2025-05-07T20:31:45.3940494Z         scale_ub: Optional[float],
2025-05-07T20:31:45.3940773Z         contiguous: bool,
2025-05-07T20:31:45.3941015Z         compiled: bool,
2025-05-07T20:31:45.3941231Z     ) -> None:
2025-05-07T20:31:45.3941448Z         torch.manual_seed(2025)
2025-05-07T20:31:45.3941692Z     
2025-05-07T20:31:45.3941966Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:31:45.3942306Z     
2025-05-07T20:31:45.3942504Z         x_sign = torch.sign(x)
2025-05-07T20:31:45.3942790Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:31:45.3943101Z         x = x_sign * x_clamp
2025-05-07T20:31:45.3943342Z         x0 = x[:, :D]
2025-05-07T20:31:45.3943561Z         x1 = x[:, D:]
2025-05-07T20:31:45.3943762Z     
2025-05-07T20:31:45.3944318Z         if contiguous:
2025-05-07T20:31:45.3944609Z             x0 = x0.contiguous()
2025-05-07T20:31:45.3944961Z             x1 = x1.contiguous()
2025-05-07T20:31:45.3945370Z     
2025-05-07T20:31:45.3945624Z         if scale_ub is not None:
2025-05-07T20:31:45.3946023Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:31:45.3946511Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:31:45.3946873Z             )
2025-05-07T20:31:45.3947217Z         else:
2025-05-07T20:31:45.3947587Z             scale_ub_tensor = None
2025-05-07T20:31:45.3947925Z     
2025-05-07T20:31:45.3948238Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:31:45.3948695Z             op = silu_mul_quant
2025-05-07T20:31:45.3949039Z             if compiled:
2025-05-07T20:31:45.3949428Z                 op = torch.compile(op)
2025-05-07T20:31:45.3949874Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:31:45.3950258Z     
2025-05-07T20:31:45.3950505Z >       y_fp8, y_scale = fn()
2025-05-07T20:31:45.3950780Z 
2025-05-07T20:31:45.3950925Z moe/activation_test.py:117: 
2025-05-07T20:31:45.3951352Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:31:45.3951755Z moe/activation_test.py:115: in fn
2025-05-07T20:31:45.3952162Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:31:45.3952961Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:31:45.3953811Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:31:45.3954474Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:31:45.3955264Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:31:45.3955995Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:31:45.3956698Z     kernel = self.compile(
2025-05-07T20:31:45.3957291Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:31:45.3958017Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:31:45.3958699Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:31:45.3958956Z 
2025-05-07T20:31:45.3959224Z self = <triton.compiler.compiler.ASTSource object at 0x7f689502d390>
2025-05-07T20:31:45.3960344Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:31:45.3961887Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f68dfb402c0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f68a8d402c0>}
2025-05-07T20:31:45.3963311Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:31:45.3964475Z context = <triton._C.libtriton.ir.context object at 0x7f689508d270>
2025-05-07T20:31:45.3964810Z 
2025-05-07T20:31:45.3965039Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:31:45.3965615Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:31:45.3966265Z                            module_map=module_map)
2025-05-07T20:31:45.3966733Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:31:45.3967143Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:31:45.3967554Z E       ^
2025-05-07T20:31:45.3968150Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:31:45.3968624Z 
2025-05-07T20:31:45.3969131Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:31:45.3969654Z 
2025-05-07T20:31:45.3969843Z Trying example: test_silu_mul_quant(
2025-05-07T20:31:45.3970361Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:31:45.3970878Z     T=4096,
2025-05-07T20:31:45.3971213Z     D=7168,
2025-05-07T20:31:45.3971463Z     scale_ub=None,
2025-05-07T20:31:45.3971789Z     contiguous=False,
2025-05-07T20:31:45.3972162Z     compiled=False,
2025-05-07T20:31:45.3972425Z )
2025-05-07T20:31:45.3972852Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:31:45.3973510Z T = 4096, D = 7168, scale_ub = None, contiguous = False, compiled = False
2025-05-07T20:31:45.3973811Z 
2025-05-07T20:31:45.3973974Z     @given(
2025-05-07T20:31:45.3974257Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:31:45.3974712Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:31:45.3975125Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:31:45.3975507Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:31:45.3976006Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:31:45.3976402Z     )
2025-05-07T20:31:45.3976809Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:31:45.3977386Z     def test_silu_mul_quant(
2025-05-07T20:31:45.3977739Z         self,
2025-05-07T20:31:45.3978007Z         T: int,
2025-05-07T20:31:45.3978459Z         D: int,
2025-05-07T20:31:45.3978767Z         scale_ub: Optional[float],
2025-05-07T20:31:45.3979109Z         contiguous: bool,
2025-05-07T20:31:45.3979491Z         compiled: bool,
2025-05-07T20:31:45.3979802Z     ) -> None:
2025-05-07T20:31:45.3980095Z         torch.manual_seed(2025)
2025-05-07T20:31:45.3980472Z     
2025-05-07T20:31:45.3980831Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:31:45.3981262Z     
2025-05-07T20:31:45.3981573Z         x_sign = torch.sign(x)
2025-05-07T20:31:45.3981949Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:31:45.3982391Z         x = x_sign * x_clamp
2025-05-07T20:31:45.3982782Z         x0 = x[:, :D]
2025-05-07T20:31:45.3983136Z         x1 = x[:, D:]
2025-05-07T20:31:45.3983440Z     
2025-05-07T20:31:45.3983782Z         if contiguous:
2025-05-07T20:31:45.3984090Z             x0 = x0.contiguous()
2025-05-07T20:31:45.3984419Z             x1 = x1.contiguous()
2025-05-07T20:31:45.3984812Z     
2025-05-07T20:31:45.3985088Z         if scale_ub is not None:
2025-05-07T20:31:45.3985430Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:31:45.3985926Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:31:45.3986305Z             )
2025-05-07T20:31:45.3986607Z         else:
2025-05-07T20:31:45.3986972Z             scale_ub_tensor = None
2025-05-07T20:31:45.3987327Z     
2025-05-07T20:31:45.3987595Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:31:45.3988085Z             op = silu_mul_quant
2025-05-07T20:31:45.3988421Z             if compiled:
2025-05-07T20:31:45.3988708Z                 op = torch.compile(op)
2025-05-07T20:31:45.3989251Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:31:45.3989624Z     
2025-05-07T20:31:45.3989862Z >       y_fp8, y_scale = fn()
2025-05-07T20:31:45.3990183Z 
2025-05-07T20:31:45.3990312Z moe/activation_test.py:117: 
2025-05-07T20:31:45.3990725Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:31:45.3991185Z moe/activation_test.py:115: in fn
2025-05-07T20:31:45.3991554Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:31:45.3992327Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:31:45.3993140Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:31:45.3993859Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:31:45.3994764Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:31:45.3995751Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:31:45.3996488Z     kernel = self.compile(
2025-05-07T20:31:45.3997176Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:31:45.3998102Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:31:45.3998599Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:31:45.3998882Z 
2025-05-07T20:31:45.3999150Z self = <triton.compiler.compiler.ASTSource object at 0x7f6894990b50>
2025-05-07T20:31:45.4000335Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:31:45.4001802Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f68dfb402c0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f68a8d42160>}
2025-05-07T20:31:45.4003254Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:31:45.4004505Z context = <triton._C.libtriton.ir.context object at 0x7f68948c4a70>
2025-05-07T20:31:45.4004825Z 
2025-05-07T20:31:45.4005067Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:31:45.4005659Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:31:45.4006270Z                            module_map=module_map)
2025-05-07T20:31:45.4006747Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:31:45.4007175Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:31:45.4007552Z E       ^
2025-05-07T20:31:45.4008153Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:31:45.4008707Z 
2025-05-07T20:31:45.4009175Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:31:45.4009771Z 
2025-05-07T20:31:45.4009984Z Trying example: test_silu_mul_quant(
2025-05-07T20:31:45.4010447Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:31:45.4010917Z     T=128,
2025-05-07T20:31:45.4011285Z     D=7168,
2025-05-07T20:31:45.4011536Z     scale_ub=None,
2025-05-07T20:31:45.4011825Z     contiguous=False,
2025-05-07T20:31:45.4012226Z     compiled=True,
2025-05-07T20:31:45.4012486Z )
2025-05-07T20:31:45.4012885Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:31:45.4013938Z T = 128, D = 7168, scale_ub = None, contiguous = False, compiled = True
2025-05-07T20:31:45.4014231Z 
2025-05-07T20:31:45.4014371Z     @given(
2025-05-07T20:31:45.4014645Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:31:45.4015151Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:31:45.4015590Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:31:45.4015963Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:31:45.4016472Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:31:45.4016841Z     )
2025-05-07T20:31:45.4017226Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:31:45.4017838Z     def test_silu_mul_quant(
2025-05-07T20:31:45.4018164Z         self,
2025-05-07T20:31:45.4018504Z         T: int,
2025-05-07T20:31:45.4018800Z         D: int,
2025-05-07T20:31:45.4019103Z         scale_ub: Optional[float],
2025-05-07T20:31:45.4019513Z         contiguous: bool,
2025-05-07T20:31:45.4019820Z         compiled: bool,
2025-05-07T20:31:45.4020128Z     ) -> None:
2025-05-07T20:31:45.4020485Z         torch.manual_seed(2025)
2025-05-07T20:31:45.4020790Z     
2025-05-07T20:31:45.4021169Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:31:45.4021643Z     
2025-05-07T20:31:45.4021902Z         x_sign = torch.sign(x)
2025-05-07T20:31:45.4022301Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:31:45.4022733Z         x = x_sign * x_clamp
2025-05-07T20:31:45.4023098Z         x0 = x[:, :D]
2025-05-07T20:31:45.4023388Z         x1 = x[:, D:]
2025-05-07T20:31:45.4023719Z     
2025-05-07T20:31:45.4024005Z         if contiguous:
2025-05-07T20:31:45.4024308Z             x0 = x0.contiguous()
2025-05-07T20:31:45.4024696Z             x1 = x1.contiguous()
2025-05-07T20:31:45.4025052Z     
2025-05-07T20:31:45.4025295Z         if scale_ub is not None:
2025-05-07T20:31:45.4025704Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:31:45.4026160Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:31:45.4026517Z             )
2025-05-07T20:31:45.4026851Z         else:
2025-05-07T20:31:45.4027185Z             scale_ub_tensor = None
2025-05-07T20:31:45.4027518Z     
2025-05-07T20:31:45.4035401Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:31:45.4035776Z             op = silu_mul_quant
2025-05-07T20:31:45.4036043Z             if compiled:
2025-05-07T20:31:45.4036294Z                 op = torch.compile(op)
2025-05-07T20:31:45.4036846Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:31:45.4037131Z     
2025-05-07T20:31:45.4037325Z         y_fp8, y_scale = fn()
2025-05-07T20:31:45.4037617Z         y = y_fp8.to(torch.float32) * y_scale[:, None]
2025-05-07T20:31:45.4037913Z     
2025-05-07T20:31:45.4038151Z         def ref_fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:31:45.4038488Z             x0_fp32 = x0.to(torch.float32)
2025-05-07T20:31:45.4038785Z             x1_fp32 = x1.to(torch.float32)
2025-05-07T20:31:45.4039107Z             y = x0_fp32 * torch.sigmoid(x0_fp32) * x1_fp32
2025-05-07T20:31:45.4039461Z             return triton_quantize_fp8_row(y, scale_ub_tensor)
2025-05-07T20:31:45.4039772Z     
2025-05-07T20:31:45.4040136Z >       y_fp8_ref, y_scale_ref = ref_fn()
2025-05-07T20:31:45.4040332Z 
2025-05-07T20:31:45.4040435Z moe/activation_test.py:126: 
2025-05-07T20:31:45.4040734Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:31:45.4041084Z moe/activation_test.py:124: in ref_fn
2025-05-07T20:31:45.4041409Z     return triton_quantize_fp8_row(y, scale_ub_tensor)
2025-05-07T20:31:45.4042205Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/fbgemm_gpu/experimental/gemm/triton_gemm/fp8_gemm.py:2370: in triton_quantize_fp8_row
2025-05-07T20:31:45.4042966Z     _kernel_quantize_fp8_row[grid](
2025-05-07T20:31:45.4043521Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:31:45.4044200Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:31:45.4044889Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/autotuner.py:186: in run
2025-05-07T20:31:45.4045621Z     timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs}
2025-05-07T20:31:45.4046385Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/autotuner.py:186: in <dictcomp>
2025-05-07T20:31:45.4047127Z     timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs}
2025-05-07T20:31:45.4047865Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/autotuner.py:166: in _bench
2025-05-07T20:31:45.4048507Z     return self.do_bench(kernel_call, quantiles=(0.5, 0.2, 0.8))
2025-05-07T20:31:45.4049113Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/testing.py:117: in do_bench
2025-05-07T20:31:45.4049631Z     fn()
2025-05-07T20:31:45.4050148Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/autotuner.py:152: in kernel_call
2025-05-07T20:31:45.4050740Z     self.fn.run(
2025-05-07T20:31:45.4051212Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:31:45.4051751Z     kernel = self.compile(
2025-05-07T20:31:45.4052304Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:31:45.4052968Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:31:45.4053369Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:31:45.4053610Z 
2025-05-07T20:31:45.4053818Z self = <triton.compiler.compiler.ASTSource object at 0x7f6894781b50>
2025-05-07T20:31:45.4054914Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=2, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:31:45.4056313Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f68dfb402c0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f68a8d437e0>}
2025-05-07T20:31:45.4057759Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:31:45.4058798Z context = <triton._C.libtriton.ir.context object at 0x7f689459a930>
2025-05-07T20:31:45.4059094Z 
2025-05-07T20:31:45.4059262Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:31:45.4059791Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:31:45.4060256Z                            module_map=module_map)
2025-05-07T20:31:45.4060628Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:31:45.4060993Z E       def _kernel_quantize_fp8_row(
2025-05-07T20:31:45.4061255Z E       ^
2025-05-07T20:31:45.4061802Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:31:45.4062265Z 
2025-05-07T20:31:45.4062693Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:31:45.4063206Z 
2025-05-07T20:31:45.4063320Z Trying example: test_silu_mul_quant(
2025-05-07T20:31:45.4063732Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:31:45.4064139Z     T=128,
2025-05-07T20:31:45.4064334Z     D=7168,
2025-05-07T20:31:45.4064524Z     scale_ub=None,
2025-05-07T20:31:45.4064747Z     contiguous=False,
2025-05-07T20:31:45.4064979Z     compiled=False,
2025-05-07T20:31:45.4065187Z )
2025-05-07T20:31:45.4065561Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:31:45.4066059Z T = 128, D = 7168, scale_ub = None, contiguous = False, compiled = False
2025-05-07T20:31:45.4066328Z 
2025-05-07T20:31:45.4066420Z     @given(
2025-05-07T20:31:45.4066653Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:31:45.4066973Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:31:45.4067281Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:31:45.4067614Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:31:45.4067946Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:31:45.4068235Z     )
2025-05-07T20:31:45.4068580Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:31:45.4069026Z     def test_silu_mul_quant(
2025-05-07T20:31:45.4069349Z         self,
2025-05-07T20:31:45.4069550Z         T: int,
2025-05-07T20:31:45.4069741Z         D: int,
2025-05-07T20:31:45.4069967Z         scale_ub: Optional[float],
2025-05-07T20:31:45.4070240Z         contiguous: bool,
2025-05-07T20:31:45.4070476Z         compiled: bool,
2025-05-07T20:31:45.4070703Z     ) -> None:
2025-05-07T20:31:45.4070924Z         torch.manual_seed(2025)
2025-05-07T20:31:45.4071166Z     
2025-05-07T20:31:45.4071445Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:31:45.4071794Z     
2025-05-07T20:31:45.4071985Z         x_sign = torch.sign(x)
2025-05-07T20:31:45.4072284Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:31:45.4072597Z         x = x_sign * x_clamp
2025-05-07T20:31:45.4072834Z         x0 = x[:, :D]
2025-05-07T20:31:45.4073054Z         x1 = x[:, D:]
2025-05-07T20:31:45.4073270Z     
2025-05-07T20:31:45.4073454Z         if contiguous:
2025-05-07T20:31:45.4073689Z             x0 = x0.contiguous()
2025-05-07T20:31:45.4073953Z             x1 = x1.contiguous()
2025-05-07T20:31:45.4074184Z     
2025-05-07T20:31:45.4074380Z         if scale_ub is not None:
2025-05-07T20:31:45.4074654Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:31:45.4074991Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:31:45.4075070Z             )
2025-05-07T20:31:45.4075151Z         else:
2025-05-07T20:31:45.4075256Z             scale_ub_tensor = None
2025-05-07T20:31:45.4075328Z     
2025-05-07T20:31:45.4075458Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:31:45.4075556Z             op = silu_mul_quant
2025-05-07T20:31:45.4075730Z             if compiled:
2025-05-07T20:31:45.4075838Z                 op = torch.compile(op)
2025-05-07T20:31:45.4075952Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:31:45.4076026Z     
2025-05-07T20:31:45.4076117Z >       y_fp8, y_scale = fn()
2025-05-07T20:31:45.4076122Z 
2025-05-07T20:31:45.4076228Z moe/activation_test.py:117: 
2025-05-07T20:31:45.4076357Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:31:45.4076467Z moe/activation_test.py:115: in fn
2025-05-07T20:31:45.4076569Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:31:45.4077072Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:31:45.4077252Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:31:45.4077611Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:31:45.4077843Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:31:45.4078191Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:31:45.4078288Z     kernel = self.compile(
2025-05-07T20:31:45.4078677Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:31:45.4078850Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:31:45.4078978Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:31:45.4078983Z 
2025-05-07T20:31:45.4079198Z self = <triton.compiler.compiler.ASTSource object at 0x7f6873e10250>
2025-05-07T20:31:45.4079990Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:31:45.4080509Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f68dfb402c0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f6894be0860>}
2025-05-07T20:31:45.4081265Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:31:45.4081462Z context = <triton._C.libtriton.ir.context object at 0x7f6894b05f70>
2025-05-07T20:31:45.4081466Z 
2025-05-07T20:31:45.4081635Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:31:45.4081895Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:31:45.4082021Z                            module_map=module_map)
2025-05-07T20:31:45.4082183Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:31:45.4082282Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:31:45.4082373Z E       ^
2025-05-07T20:31:45.4082733Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:31:45.4082738Z 
2025-05-07T20:31:45.4083162Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:31:45.4083166Z 
2025-05-07T20:31:45.4083271Z Trying example: test_silu_mul_quant(
2025-05-07T20:31:45.4083495Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:31:45.4083583Z     T=4096,
2025-05-07T20:31:45.4083659Z     D=5120,
2025-05-07T20:31:45.4083742Z     scale_ub=1200.0,
2025-05-07T20:31:45.4083835Z     contiguous=True,
2025-05-07T20:31:45.4083925Z     compiled=False,
2025-05-07T20:31:45.4083998Z )
2025-05-07T20:31:45.4084225Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:31:45.4084477Z T = 4096, D = 5120, scale_ub = 1200.0, contiguous = True, compiled = False
2025-05-07T20:31:45.4084482Z 
2025-05-07T20:31:45.4084570Z     @given(
2025-05-07T20:31:45.4084685Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:31:45.4084788Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:31:45.4084908Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:31:45.4085024Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:31:45.4085136Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:31:45.4085218Z     )
2025-05-07T20:31:45.4085462Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:31:45.4085555Z     def test_silu_mul_quant(
2025-05-07T20:31:45.4085637Z         self,
2025-05-07T20:31:45.4085787Z         T: int,
2025-05-07T20:31:45.4085863Z         D: int,
2025-05-07T20:31:45.4085966Z         scale_ub: Optional[float],
2025-05-07T20:31:45.4086055Z         contiguous: bool,
2025-05-07T20:31:45.4086147Z         compiled: bool,
2025-05-07T20:31:45.4086230Z     ) -> None:
2025-05-07T20:31:45.4086324Z         torch.manual_seed(2025)
2025-05-07T20:31:45.4086407Z     
2025-05-07T20:31:45.4086572Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:31:45.4086646Z     
2025-05-07T20:31:45.4086744Z         x_sign = torch.sign(x)
2025-05-07T20:31:45.4086870Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:31:45.4086961Z         x = x_sign * x_clamp
2025-05-07T20:31:45.4087051Z         x0 = x[:, :D]
2025-05-07T20:31:45.4087131Z         x1 = x[:, D:]
2025-05-07T20:31:45.4087202Z     
2025-05-07T20:31:45.4087291Z         if contiguous:
2025-05-07T20:31:45.4087382Z             x0 = x0.contiguous()
2025-05-07T20:31:45.4087477Z             x1 = x1.contiguous()
2025-05-07T20:31:45.4087557Z     
2025-05-07T20:31:45.4087645Z         if scale_ub is not None:
2025-05-07T20:31:45.4087756Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:31:45.4087892Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:31:45.4087971Z             )
2025-05-07T20:31:45.4088053Z         else:
2025-05-07T20:31:45.4088146Z             scale_ub_tensor = None
2025-05-07T20:31:45.4088217Z     
2025-05-07T20:31:45.4088351Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:31:45.4088441Z             op = silu_mul_quant
2025-05-07T20:31:45.4088524Z             if compiled:
2025-05-07T20:31:45.4088628Z                 op = torch.compile(op)
2025-05-07T20:31:45.4088731Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:31:45.4088803Z     
2025-05-07T20:31:45.4088897Z >       y_fp8, y_scale = fn()
2025-05-07T20:31:45.4088901Z 
2025-05-07T20:31:45.4088997Z moe/activation_test.py:117: 
2025-05-07T20:31:45.4089133Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:31:45.4089237Z moe/activation_test.py:115: in fn
2025-05-07T20:31:45.4089336Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:31:45.4089848Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:31:45.4089944Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:31:45.4090301Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:31:45.4090528Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:31:45.4090867Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:31:45.4090969Z     kernel = self.compile(
2025-05-07T20:31:45.4091350Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:31:45.4091526Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:31:45.4091662Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:31:45.4091667Z 
2025-05-07T20:31:45.4091950Z self = <triton.compiler.compiler.ASTSource object at 0x7f6894676550>
2025-05-07T20:31:45.4092740Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:31:45.4093245Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f68dfb402c0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f6894ef2b60>}
2025-05-07T20:31:45.4094007Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:31:45.4094292Z context = <triton._C.libtriton.ir.context object at 0x7f689466e3b0>
2025-05-07T20:31:45.4094296Z 
2025-05-07T20:31:45.4094459Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:31:45.4094729Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:31:45.4094835Z                            module_map=module_map)
2025-05-07T20:31:45.4094996Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:31:45.4095102Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:31:45.4095199Z E       ^
2025-05-07T20:31:45.4095588Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:31:45.4095593Z 
2025-05-07T20:31:45.4096011Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:31:45.4096021Z 
2025-05-07T20:31:45.4096122Z Trying example: test_silu_mul_quant(
2025-05-07T20:31:45.4096352Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:31:45.4096428Z     T=1,
2025-05-07T20:31:45.4096508Z     D=5120,
2025-05-07T20:31:45.4096597Z     scale_ub=None,
2025-05-07T20:31:45.4096686Z     contiguous=True,
2025-05-07T20:31:45.4096775Z     compiled=True,
2025-05-07T20:31:45.4096847Z )
2025-05-07T20:31:45.4097064Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:31:45.4097229Z T = 1, D = 5120, scale_ub = None, contiguous = True, compiled = True
2025-05-07T20:31:45.4097233Z 
2025-05-07T20:31:45.4097311Z     @given(
2025-05-07T20:31:45.4097428Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:31:45.4097532Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:31:45.4097646Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:31:45.4097761Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:31:45.4097884Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:31:45.4097958Z     )
2025-05-07T20:31:45.4098214Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:31:45.4098310Z     def test_silu_mul_quant(
2025-05-07T20:31:45.4098389Z         self,
2025-05-07T20:31:45.4098472Z         T: int,
2025-05-07T20:31:45.4098550Z         D: int,
2025-05-07T20:31:45.4098648Z         scale_ub: Optional[float],
2025-05-07T20:31:45.4098745Z         contiguous: bool,
2025-05-07T20:31:45.4098830Z         compiled: bool,
2025-05-07T20:31:45.4098907Z     ) -> None:
2025-05-07T20:31:45.4099008Z         torch.manual_seed(2025)
2025-05-07T20:31:45.4099080Z     
2025-05-07T20:31:45.4099252Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:31:45.4099336Z     
2025-05-07T20:31:45.4099428Z         x_sign = torch.sign(x)
2025-05-07T20:31:45.4099561Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:31:45.4099654Z         x = x_sign * x_clamp
2025-05-07T20:31:45.4099736Z         x0 = x[:, :D]
2025-05-07T20:31:45.4099825Z         x1 = x[:, D:]
2025-05-07T20:31:45.4099896Z     
2025-05-07T20:31:45.4099980Z         if contiguous:
2025-05-07T20:31:45.4100164Z             x0 = x0.contiguous()
2025-05-07T20:31:45.4100254Z             x1 = x1.contiguous()
2025-05-07T20:31:45.4100325Z     
2025-05-07T20:31:45.4100422Z         if scale_ub is not None:
2025-05-07T20:31:45.4100528Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:31:45.4100660Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:31:45.4100743Z             )
2025-05-07T20:31:45.4100820Z         else:
2025-05-07T20:31:45.4100920Z             scale_ub_tensor = None
2025-05-07T20:31:45.4100997Z     
2025-05-07T20:31:45.4101124Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:31:45.4101219Z             op = silu_mul_quant
2025-05-07T20:31:45.4101304Z             if compiled:
2025-05-07T20:31:45.4101479Z                 op = torch.compile(op)
2025-05-07T20:31:45.4101591Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:31:45.4101663Z     
2025-05-07T20:31:45.4101752Z         y_fp8, y_scale = fn()
2025-05-07T20:31:45.4101882Z         y = y_fp8.to(torch.float32) * y_scale[:, None]
2025-05-07T20:31:45.4101953Z     
2025-05-07T20:31:45.4102086Z         def ref_fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:31:45.4102194Z             x0_fp32 = x0.to(torch.float32)
2025-05-07T20:31:45.4102292Z             x1_fp32 = x1.to(torch.float32)
2025-05-07T20:31:45.4102418Z             y = x0_fp32 * torch.sigmoid(x0_fp32) * x1_fp32
2025-05-07T20:31:45.4102556Z             return triton_quantize_fp8_row(y, scale_ub_tensor)
2025-05-07T20:31:45.4102627Z     
2025-05-07T20:31:45.4102731Z >       y_fp8_ref, y_scale_ref = ref_fn()
2025-05-07T20:31:45.4102736Z 
2025-05-07T20:31:45.4102833Z moe/activation_test.py:126: 
2025-05-07T20:31:45.4102961Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:31:45.4103078Z moe/activation_test.py:124: in ref_fn
2025-05-07T20:31:45.4103208Z     return triton_quantize_fp8_row(y, scale_ub_tensor)
2025-05-07T20:31:45.4103789Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/fbgemm_gpu/experimental/gemm/triton_gemm/fp8_gemm.py:2370: in triton_quantize_fp8_row
2025-05-07T20:31:45.4103890Z     _kernel_quantize_fp8_row[grid](
2025-05-07T20:31:45.4104257Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:31:45.4104481Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:31:45.4104850Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/autotuner.py:186: in run
2025-05-07T20:31:45.4105112Z     timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs}
2025-05-07T20:31:45.4105512Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/autotuner.py:186: in <dictcomp>
2025-05-07T20:31:45.4105775Z     timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs}
2025-05-07T20:31:45.4106154Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/autotuner.py:166: in _bench
2025-05-07T20:31:45.4106322Z     return self.do_bench(kernel_call, quantiles=(0.5, 0.2, 0.8))
2025-05-07T20:31:45.4106670Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/testing.py:117: in do_bench
2025-05-07T20:31:45.4106748Z     fn()
2025-05-07T20:31:45.4107150Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/autotuner.py:152: in kernel_call
2025-05-07T20:31:45.4107238Z     self.fn.run(
2025-05-07T20:31:45.4107577Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:31:45.4107675Z     kernel = self.compile(
2025-05-07T20:31:45.4108060Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:31:45.4108232Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:31:45.4108448Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:31:45.4108453Z 
2025-05-07T20:31:45.4108656Z self = <triton.compiler.compiler.ASTSource object at 0x7f6894651890>
2025-05-07T20:31:45.4109524Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=2, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:31:45.4110032Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f68dfb402c0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f68a93ec040>}
2025-05-07T20:31:45.4110779Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:31:45.4111051Z context = <triton._C.libtriton.ir.context object at 0x7f6894627830>
2025-05-07T20:31:45.4111056Z 
2025-05-07T20:31:45.4111221Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:31:45.4111489Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:31:45.4111596Z                            module_map=module_map)
2025-05-07T20:31:45.4111757Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:31:45.4111866Z E       def _kernel_quantize_fp8_row(
2025-05-07T20:31:45.4111944Z E       ^
2025-05-07T20:31:45.4112301Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:31:45.4112312Z 
2025-05-07T20:31:45.4112730Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:31:45.4112740Z 
2025-05-07T20:31:45.4112843Z Trying example: test_silu_mul_quant(
2025-05-07T20:31:45.4113076Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:31:45.4113153Z     T=2048,
2025-05-07T20:31:45.4113229Z     D=5120,
2025-05-07T20:31:45.4113317Z     scale_ub=None,
2025-05-07T20:31:45.4113401Z     contiguous=True,
2025-05-07T20:31:45.4113484Z     compiled=True,
2025-05-07T20:31:45.4113560Z )
2025-05-07T20:31:45.4113780Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:31:45.4113955Z T = 2048, D = 5120, scale_ub = None, contiguous = True, compiled = True
2025-05-07T20:31:45.4113959Z 
2025-05-07T20:31:45.4114035Z     @given(
2025-05-07T20:31:45.4114152Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:31:45.4114256Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:31:45.4114370Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:31:45.4114490Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:31:45.4114608Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:31:45.4114681Z     )
2025-05-07T20:31:45.4114929Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:31:45.4115032Z     def test_silu_mul_quant(
2025-05-07T20:31:45.4115107Z         self,
2025-05-07T20:31:45.4115189Z         T: int,
2025-05-07T20:31:45.4115264Z         D: int,
2025-05-07T20:31:45.4115361Z         scale_ub: Optional[float],
2025-05-07T20:31:45.4115458Z         contiguous: bool,
2025-05-07T20:31:45.4115545Z         compiled: bool,
2025-05-07T20:31:45.4115623Z     ) -> None:
2025-05-07T20:31:45.4115722Z         torch.manual_seed(2025)
2025-05-07T20:31:45.4115796Z     
2025-05-07T20:31:45.4115963Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:31:45.4116042Z     
2025-05-07T20:31:45.4116133Z         x_sign = torch.sign(x)
2025-05-07T20:31:45.4116262Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:31:45.4116356Z         x = x_sign * x_clamp
2025-05-07T20:31:45.4116435Z         x0 = x[:, :D]
2025-05-07T20:31:45.4116521Z         x1 = x[:, D:]
2025-05-07T20:31:45.4116677Z     
2025-05-07T20:31:45.4116761Z         if contiguous:
2025-05-07T20:31:45.4116858Z             x0 = x0.contiguous()
2025-05-07T20:31:45.4116945Z             x1 = x1.contiguous()
2025-05-07T20:31:45.4117016Z     
2025-05-07T20:31:45.4117110Z         if scale_ub is not None:
2025-05-07T20:31:45.4117214Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:31:45.4117348Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:31:45.4117429Z             )
2025-05-07T20:31:45.4117504Z         else:
2025-05-07T20:31:45.4117600Z             scale_ub_tensor = None
2025-05-07T20:31:45.4117677Z     
2025-05-07T20:31:45.4117804Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:31:45.4117976Z             op = silu_mul_quant
2025-05-07T20:31:45.4118061Z             if compiled:
2025-05-07T20:31:45.4118159Z                 op = torch.compile(op)
2025-05-07T20:31:45.4118267Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:31:45.4118338Z     
2025-05-07T20:31:45.4118432Z         y_fp8, y_scale = fn()
2025-05-07T20:31:45.4118561Z         y = y_fp8.to(torch.float32) * y_scale[:, None]
2025-05-07T20:31:45.4118632Z     
2025-05-07T20:31:45.4118771Z         def ref_fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:31:45.4118878Z             x0_fp32 = x0.to(torch.float32)
2025-05-07T20:31:45.4118977Z             x1_fp32 = x1.to(torch.float32)
2025-05-07T20:31:45.4119096Z             y = x0_fp32 * torch.sigmoid(x0_fp32) * x1_fp32
2025-05-07T20:31:45.4119242Z             return triton_quantize_fp8_row(y, scale_ub_tensor)
2025-05-07T20:31:45.4119314Z     
2025-05-07T20:31:45.4119419Z >       y_fp8_ref, y_scale_ref = ref_fn()
2025-05-07T20:31:45.4119424Z 
2025-05-07T20:31:45.4119530Z moe/activation_test.py:126: 
2025-05-07T20:31:45.4119658Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:31:45.4119768Z moe/activation_test.py:124: in ref_fn
2025-05-07T20:31:45.4119904Z     return triton_quantize_fp8_row(y, scale_ub_tensor)
2025-05-07T20:31:45.4120468Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/fbgemm_gpu/experimental/gemm/triton_gemm/fp8_gemm.py:2370: in triton_quantize_fp8_row
2025-05-07T20:31:45.4120576Z     _kernel_quantize_fp8_row[grid](
2025-05-07T20:31:45.4120940Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:31:45.4121168Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:31:45.4121535Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/autotuner.py:186: in run
2025-05-07T20:31:45.4121793Z     timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs}
2025-05-07T20:31:45.4122207Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/autotuner.py:186: in <dictcomp>
2025-05-07T20:31:45.4122463Z     timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs}
2025-05-07T20:31:45.4122843Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/autotuner.py:166: in _bench
2025-05-07T20:31:45.4123010Z     return self.do_bench(kernel_call, quantiles=(0.5, 0.2, 0.8))
2025-05-07T20:31:45.4123353Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/testing.py:117: in do_bench
2025-05-07T20:31:45.4123436Z     fn()
2025-05-07T20:31:45.4123843Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/autotuner.py:152: in kernel_call
2025-05-07T20:31:45.4123925Z     self.fn.run(
2025-05-07T20:31:45.4124270Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:31:45.4124366Z     kernel = self.compile(
2025-05-07T20:31:45.4124755Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:31:45.4125032Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:31:45.4125166Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:31:45.4125170Z 
2025-05-07T20:31:45.4125379Z self = <triton.compiler.compiler.ASTSource object at 0x7f6894212510>
2025-05-07T20:31:45.4126163Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=2, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:31:45.4126679Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f68dfb402c0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f68a9986840>}
2025-05-07T20:31:45.4127505Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:31:45.4127699Z context = <triton._C.libtriton.ir.context object at 0x7f68942bb6f0>
2025-05-07T20:31:45.4127709Z 
2025-05-07T20:31:45.4127871Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:31:45.4128131Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:31:45.4128539Z                            module_map=module_map)
2025-05-07T20:31:45.4128734Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:31:45.4128839Z E       def _kernel_quantize_fp8_row(
2025-05-07T20:31:45.4128927Z E       ^
2025-05-07T20:31:45.4129284Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:31:45.4129295Z 
2025-05-07T20:31:45.4129721Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:31:45.4129726Z 
2025-05-07T20:31:45.4129837Z Trying example: test_silu_mul_quant(
2025-05-07T20:31:45.4130060Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:31:45.4130145Z     T=128,
2025-05-07T20:31:45.4130223Z     D=5120,
2025-05-07T20:31:45.4130306Z     scale_ub=None,
2025-05-07T20:31:45.4130398Z     contiguous=True,
2025-05-07T20:31:45.4130481Z     compiled=True,
2025-05-07T20:31:45.4130553Z )
2025-05-07T20:31:45.4130776Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:31:45.4130941Z T = 128, D = 5120, scale_ub = None, contiguous = True, compiled = True
2025-05-07T20:31:45.4130946Z 
2025-05-07T20:31:45.4131032Z     @given(
2025-05-07T20:31:45.4131152Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:31:45.4131257Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:31:45.4131378Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:31:45.4131494Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:31:45.4131612Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:31:45.4131693Z     )
2025-05-07T20:31:45.4131941Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:31:45.4132042Z     def test_silu_mul_quant(
2025-05-07T20:31:45.4132119Z         self,
2025-05-07T20:31:45.4132196Z         T: int,
2025-05-07T20:31:45.4132278Z         D: int,
2025-05-07T20:31:45.4132376Z         scale_ub: Optional[float],
2025-05-07T20:31:45.4132466Z         contiguous: bool,
2025-05-07T20:31:45.4132557Z         compiled: bool,
2025-05-07T20:31:45.4132636Z     ) -> None:
2025-05-07T20:31:45.4132731Z         torch.manual_seed(2025)
2025-05-07T20:31:45.4132810Z     
2025-05-07T20:31:45.4132978Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:31:45.4133056Z     
2025-05-07T20:31:45.4133155Z         x_sign = torch.sign(x)
2025-05-07T20:31:45.4133278Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:31:45.4133366Z         x = x_sign * x_clamp
2025-05-07T20:31:45.4133643Z         x0 = x[:, :D]
2025-05-07T20:31:45.4133729Z         x1 = x[:, D:]
2025-05-07T20:31:45.4133806Z     
2025-05-07T20:31:45.4133889Z         if contiguous:
2025-05-07T20:31:45.4133980Z             x0 = x0.contiguous()
2025-05-07T20:31:45.4134074Z             x1 = x1.contiguous()
2025-05-07T20:31:45.4134145Z     
2025-05-07T20:31:45.4134234Z         if scale_ub is not None:
2025-05-07T20:31:45.4134346Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:31:45.4134480Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:31:45.4134555Z             )
2025-05-07T20:31:45.4134636Z         else:
2025-05-07T20:31:45.4134729Z             scale_ub_tensor = None
2025-05-07T20:31:45.4134918Z     
2025-05-07T20:31:45.4135055Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:31:45.4135144Z             op = silu_mul_quant
2025-05-07T20:31:45.4135235Z             if compiled:
2025-05-07T20:31:45.4135335Z                 op = torch.compile(op)
2025-05-07T20:31:45.4135445Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:31:45.4135522Z     
2025-05-07T20:31:45.4135611Z         y_fp8, y_scale = fn()
2025-05-07T20:31:45.4135732Z         y = y_fp8.to(torch.float32) * y_scale[:, None]
2025-05-07T20:31:45.4135813Z     
2025-05-07T20:31:45.4135948Z         def ref_fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:31:45.4136050Z             x0_fp32 = x0.to(torch.float32)
2025-05-07T20:31:45.4136155Z             x1_fp32 = x1.to(torch.float32)
2025-05-07T20:31:45.4136277Z             y = x0_fp32 * torch.sigmoid(x0_fp32) * x1_fp32
2025-05-07T20:31:45.4136416Z             return triton_quantize_fp8_row(y, scale_ub_tensor)
2025-05-07T20:31:45.4136496Z     
2025-05-07T20:31:45.4136601Z >       y_fp8_ref, y_scale_ref = ref_fn()
2025-05-07T20:31:45.4136606Z 
2025-05-07T20:31:45.4136712Z moe/activation_test.py:126: 
2025-05-07T20:31:45.4136843Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:31:45.4136952Z moe/activation_test.py:124: in ref_fn
2025-05-07T20:31:45.4137094Z     return triton_quantize_fp8_row(y, scale_ub_tensor)
2025-05-07T20:31:45.4137658Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/fbgemm_gpu/experimental/gemm/triton_gemm/fp8_gemm.py:2370: in triton_quantize_fp8_row
2025-05-07T20:31:45.4137763Z     _kernel_quantize_fp8_row[grid](
2025-05-07T20:31:45.4138131Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:31:45.4138353Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:31:45.4138727Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/autotuner.py:186: in run
2025-05-07T20:31:45.4138988Z     timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs}
2025-05-07T20:31:45.4139395Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/autotuner.py:186: in <dictcomp>
2025-05-07T20:31:45.4139657Z     timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs}
2025-05-07T20:31:45.4140034Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/autotuner.py:166: in _bench
2025-05-07T20:31:45.4140209Z     return self.do_bench(kernel_call, quantiles=(0.5, 0.2, 0.8))
2025-05-07T20:31:45.4140553Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/testing.py:117: in do_bench
2025-05-07T20:31:45.4140634Z     fn()
2025-05-07T20:31:45.4141047Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/autotuner.py:152: in kernel_call
2025-05-07T20:31:45.4141135Z     self.fn.run(
2025-05-07T20:31:45.4141478Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:31:45.4141577Z     kernel = self.compile(
2025-05-07T20:31:45.4142042Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:31:45.4142229Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:31:45.4142358Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:31:45.4142363Z 
2025-05-07T20:31:45.4142565Z self = <triton.compiler.compiler.ASTSource object at 0x7f6873be6f90>
2025-05-07T20:31:45.4143355Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=2, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:31:45.4143863Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f68dfb402c0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f6894c95c60>}
2025-05-07T20:31:45.4144702Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:31:45.4144892Z context = <triton._C.libtriton.ir.context object at 0x7f6873bfa5b0>
2025-05-07T20:31:45.4144897Z 
2025-05-07T20:31:45.4145067Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:31:45.4145330Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:31:45.4145436Z                            module_map=module_map)
2025-05-07T20:31:45.4145602Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:31:45.4145703Z E       def _kernel_quantize_fp8_row(
2025-05-07T20:31:45.4145783Z E       ^
2025-05-07T20:31:45.4146153Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:31:45.4146158Z 
2025-05-07T20:31:45.4146581Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:31:45.4146585Z 
2025-05-07T20:31:45.4146698Z Trying example: test_silu_mul_quant(
2025-05-07T20:31:45.4146922Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:31:45.4146998Z     T=4096,
2025-05-07T20:31:45.4147084Z     D=5120,
2025-05-07T20:31:45.4147169Z     scale_ub=None,
2025-05-07T20:31:45.4147257Z     contiguous=True,
2025-05-07T20:31:45.4147346Z     compiled=True,
2025-05-07T20:31:45.4147420Z )
2025-05-07T20:31:45.4147637Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:31:45.4147813Z T = 4096, D = 5120, scale_ub = None, contiguous = True, compiled = True
2025-05-07T20:31:45.4147818Z 
2025-05-07T20:31:45.4147901Z     @given(
2025-05-07T20:31:45.4148024Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:31:45.4148123Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:31:45.4148238Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:31:45.4148365Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:31:45.4148477Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:31:45.4148556Z     )
2025-05-07T20:31:45.4148806Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:31:45.4148902Z     def test_silu_mul_quant(
2025-05-07T20:31:45.4148982Z         self,
2025-05-07T20:31:45.4149138Z         T: int,
2025-05-07T20:31:45.4149216Z         D: int,
2025-05-07T20:31:45.4149321Z         scale_ub: Optional[float],
2025-05-07T20:31:45.4149410Z         contiguous: bool,
2025-05-07T20:31:45.4149497Z         compiled: bool,
2025-05-07T20:31:45.4149580Z     ) -> None:
2025-05-07T20:31:45.4149676Z         torch.manual_seed(2025)
2025-05-07T20:31:45.4149753Z     
2025-05-07T20:31:45.4149928Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:31:45.4150001Z     
2025-05-07T20:31:45.4150094Z         x_sign = torch.sign(x)
2025-05-07T20:31:45.4150307Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:31:45.4150395Z         x = x_sign * x_clamp
2025-05-07T20:31:45.4150475Z         x0 = x[:, :D]
2025-05-07T20:31:45.4150561Z         x1 = x[:, D:]
2025-05-07T20:31:45.4150632Z     
2025-05-07T20:31:45.4150722Z         if contiguous:
2025-05-07T20:31:45.4150812Z             x0 = x0.contiguous()
2025-05-07T20:31:45.4150900Z             x1 = x1.contiguous()
2025-05-07T20:31:45.4150978Z     
2025-05-07T20:31:45.4151070Z         if scale_ub is not None:
2025-05-07T20:31:45.4151175Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:31:45.4151317Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:31:45.4151392Z             )
2025-05-07T20:31:45.4151614Z         else:
2025-05-07T20:31:45.4151715Z             scale_ub_tensor = None
2025-05-07T20:31:45.4151787Z     
2025-05-07T20:31:45.4151918Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:31:45.4152014Z             op = silu_mul_quant
2025-05-07T20:31:45.4152106Z             if compiled:
2025-05-07T20:31:45.4152206Z                 op = torch.compile(op)
2025-05-07T20:31:45.4152320Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:31:45.4152393Z     
2025-05-07T20:31:45.4152490Z         y_fp8, y_scale = fn()
2025-05-07T20:31:45.4152610Z         y = y_fp8.to(torch.float32) * y_scale[:, None]
2025-05-07T20:31:45.4152684Z     
2025-05-07T20:31:45.4152827Z         def ref_fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:31:45.4152929Z             x0_fp32 = x0.to(torch.float32)
2025-05-07T20:31:45.4153028Z             x1_fp32 = x1.to(torch.float32)
2025-05-07T20:31:45.4153155Z             y = x0_fp32 * torch.sigmoid(x0_fp32) * x1_fp32
2025-05-07T20:31:45.4153305Z             return triton_quantize_fp8_row(y, scale_ub_tensor)
2025-05-07T20:31:45.4153379Z     
2025-05-07T20:31:45.4153488Z >       y_fp8_ref, y_scale_ref = ref_fn()
2025-05-07T20:31:45.4153492Z 
2025-05-07T20:31:45.4153592Z moe/activation_test.py:126: 
2025-05-07T20:31:45.4153731Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:31:45.4153836Z moe/activation_test.py:124: in ref_fn
2025-05-07T20:31:45.4153971Z     return triton_quantize_fp8_row(y, scale_ub_tensor)
2025-05-07T20:31:45.4154545Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/fbgemm_gpu/experimental/gemm/triton_gemm/fp8_gemm.py:2370: in triton_quantize_fp8_row
2025-05-07T20:31:45.4154649Z     _kernel_quantize_fp8_row[grid](
2025-05-07T20:31:45.4155020Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:31:45.4155293Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:31:45.4155667Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/autotuner.py:186: in run
2025-05-07T20:31:45.4155929Z     timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs}
2025-05-07T20:31:45.4156336Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/autotuner.py:186: in <dictcomp>
2025-05-07T20:31:45.4156587Z     timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs}
2025-05-07T20:31:45.4156970Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/autotuner.py:166: in _bench
2025-05-07T20:31:45.4157135Z     return self.do_bench(kernel_call, quantiles=(0.5, 0.2, 0.8))
2025-05-07T20:31:45.4157489Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/testing.py:117: in do_bench
2025-05-07T20:31:45.4157566Z     fn()
2025-05-07T20:31:45.4157967Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/autotuner.py:152: in kernel_call
2025-05-07T20:31:45.4158062Z     self.fn.run(
2025-05-07T20:31:45.4158400Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:31:45.4158601Z     kernel = self.compile(
2025-05-07T20:31:45.4158997Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:31:45.4159171Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:31:45.4159304Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:31:45.4159308Z 
2025-05-07T20:31:45.4159509Z self = <triton.compiler.compiler.ASTSource object at 0x7f687358a510>
2025-05-07T20:31:45.4160294Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=2, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:31:45.4160888Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f68dfb402c0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f68940dfc40>}
2025-05-07T20:31:45.4161648Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:31:45.4161842Z context = <triton._C.libtriton.ir.context object at 0x7f68736d3b30>
2025-05-07T20:31:45.4161847Z 
2025-05-07T20:31:45.4162009Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:31:45.4162277Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:31:45.4162383Z                            module_map=module_map)
2025-05-07T20:31:45.4162544Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:31:45.4162656Z E       def _kernel_quantize_fp8_row(
2025-05-07T20:31:45.4162734Z E       ^
2025-05-07T20:31:45.4163091Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:31:45.4163096Z 
2025-05-07T20:31:45.4163522Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:31:45.4163526Z 
2025-05-07T20:31:45.4163628Z Trying example: test_silu_mul_quant(
2025-05-07T20:31:45.4163857Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:31:45.4163933Z     T=16384,
2025-05-07T20:31:45.4164009Z     D=5120,
2025-05-07T20:31:45.4164096Z     scale_ub=None,
2025-05-07T20:31:45.4164180Z     contiguous=True,
2025-05-07T20:31:45.4164263Z     compiled=True,
2025-05-07T20:31:45.4164341Z )
2025-05-07T20:31:45.4164561Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:31:45.4164739Z T = 16384, D = 5120, scale_ub = None, contiguous = True, compiled = True
2025-05-07T20:31:45.4164749Z 
2025-05-07T20:31:45.4164833Z     @given(
2025-05-07T20:31:45.4164950Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:31:45.4165056Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:31:45.4165173Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:31:45.4165288Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:31:45.4165407Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:31:45.4165482Z     )
2025-05-07T20:31:45.4165727Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:31:45.4165827Z     def test_silu_mul_quant(
2025-05-07T20:31:45.4165904Z         self,
2025-05-07T20:31:45.4165980Z         T: int,
2025-05-07T20:31:45.4166065Z         D: int,
2025-05-07T20:31:45.4166162Z         scale_ub: Optional[float],
2025-05-07T20:31:45.4166255Z         contiguous: bool,
2025-05-07T20:31:45.4166344Z         compiled: bool,
2025-05-07T20:31:45.4166421Z     ) -> None:
2025-05-07T20:31:45.4166522Z         torch.manual_seed(2025)
2025-05-07T20:31:45.4166594Z     
2025-05-07T20:31:45.4166761Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:31:45.4166921Z     
2025-05-07T20:31:45.4167013Z         x_sign = torch.sign(x)
2025-05-07T20:31:45.4167136Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:31:45.4167231Z         x = x_sign * x_clamp
2025-05-07T20:31:45.4167311Z         x0 = x[:, :D]
2025-05-07T20:31:45.4167393Z         x1 = x[:, D:]
2025-05-07T20:31:45.4167471Z     
2025-05-07T20:31:45.4167556Z         if contiguous:
2025-05-07T20:31:45.4167653Z             x0 = x0.contiguous()
2025-05-07T20:31:45.4167741Z             x1 = x1.contiguous()
2025-05-07T20:31:45.4167813Z     
2025-05-07T20:31:45.4167912Z         if scale_ub is not None:
2025-05-07T20:31:45.4168017Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:31:45.4168152Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:31:45.4168422Z             )
2025-05-07T20:31:45.4168536Z         else:
2025-05-07T20:31:45.4175655Z             scale_ub_tensor = None
2025-05-07T20:31:45.4175744Z     
2025-05-07T20:31:45.4175899Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:31:45.4175994Z             op = silu_mul_quant
2025-05-07T20:31:45.4176090Z             if compiled:
2025-05-07T20:31:45.4176197Z                 op = torch.compile(op)
2025-05-07T20:31:45.4176305Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:31:45.4176386Z     
2025-05-07T20:31:45.4176478Z         y_fp8, y_scale = fn()
2025-05-07T20:31:45.4176603Z         y = y_fp8.to(torch.float32) * y_scale[:, None]
2025-05-07T20:31:45.4176685Z     
2025-05-07T20:31:45.4176824Z         def ref_fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:31:45.4176927Z             x0_fp32 = x0.to(torch.float32)
2025-05-07T20:31:45.4177035Z             x1_fp32 = x1.to(torch.float32)
2025-05-07T20:31:45.4177163Z             y = x0_fp32 * torch.sigmoid(x0_fp32) * x1_fp32
2025-05-07T20:31:45.4177313Z             return triton_quantize_fp8_row(y, scale_ub_tensor)
2025-05-07T20:31:45.4177386Z     
2025-05-07T20:31:45.4177494Z >       y_fp8_ref, y_scale_ref = ref_fn()
2025-05-07T20:31:45.4177499Z 
2025-05-07T20:31:45.4177607Z moe/activation_test.py:126: 
2025-05-07T20:31:45.4177738Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:31:45.4177849Z moe/activation_test.py:124: in ref_fn
2025-05-07T20:31:45.4177990Z     return triton_quantize_fp8_row(y, scale_ub_tensor)
2025-05-07T20:31:45.4178559Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/fbgemm_gpu/experimental/gemm/triton_gemm/fp8_gemm.py:2370: in triton_quantize_fp8_row
2025-05-07T20:31:45.4178672Z     _kernel_quantize_fp8_row[grid](
2025-05-07T20:31:45.4179036Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:31:45.4179265Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:31:45.4179644Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/autotuner.py:186: in run
2025-05-07T20:31:45.4179902Z     timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs}
2025-05-07T20:31:45.4180308Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/autotuner.py:186: in <dictcomp>
2025-05-07T20:31:45.4180558Z     timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs}
2025-05-07T20:31:45.4180935Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/autotuner.py:166: in _bench
2025-05-07T20:31:45.4181110Z     return self.do_bench(kernel_call, quantiles=(0.5, 0.2, 0.8))
2025-05-07T20:31:45.4181452Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/testing.py:117: in do_bench
2025-05-07T20:31:45.4181535Z     fn()
2025-05-07T20:31:45.4181942Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/autotuner.py:152: in kernel_call
2025-05-07T20:31:45.4182027Z     self.fn.run(
2025-05-07T20:31:45.4182533Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:31:45.4182676Z     kernel = self.compile(
2025-05-07T20:31:45.4183231Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:31:45.4183491Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:31:45.4183674Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:31:45.4183681Z 
2025-05-07T20:31:45.4183977Z self = <triton.compiler.compiler.ASTSource object at 0x7f687306fad0>
2025-05-07T20:31:45.4185010Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=2, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:31:45.4185704Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f68dfb402c0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f6895e271a0>}
2025-05-07T20:31:45.4186468Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:31:45.4186658Z context = <triton._C.libtriton.ir.context object at 0x7f6872ef5470>
2025-05-07T20:31:45.4186663Z 
2025-05-07T20:31:45.4186838Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:31:45.4187101Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:31:45.4187209Z                            module_map=module_map)
2025-05-07T20:31:45.4187385Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:31:45.4187488Z E       def _kernel_quantize_fp8_row(
2025-05-07T20:31:45.4187566Z E       ^
2025-05-07T20:31:45.4187939Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:31:45.4187944Z 
2025-05-07T20:31:45.4188362Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:31:45.4188367Z 
2025-05-07T20:31:45.4188477Z Trying example: test_silu_mul_quant(
2025-05-07T20:31:45.4188702Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:31:45.4188778Z     T=1,
2025-05-07T20:31:45.4188862Z     D=5120,
2025-05-07T20:31:45.4188946Z     scale_ub=1200.0,
2025-05-07T20:31:45.4189038Z     contiguous=True,
2025-05-07T20:31:45.4189246Z     compiled=True,
2025-05-07T20:31:45.4189324Z )
2025-05-07T20:31:45.4189558Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:31:45.4189723Z T = 1, D = 5120, scale_ub = 1200.0, contiguous = True, compiled = True
2025-05-07T20:31:45.4189728Z 
2025-05-07T20:31:45.4189809Z     @given(
2025-05-07T20:31:45.4189939Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:31:45.4190039Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:31:45.4190156Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:31:45.4190279Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:31:45.4190395Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:31:45.4190476Z     )
2025-05-07T20:31:45.4190722Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:31:45.4190819Z     def test_silu_mul_quant(
2025-05-07T20:31:45.4190902Z         self,
2025-05-07T20:31:45.4190980Z         T: int,
2025-05-07T20:31:45.4191058Z         D: int,
2025-05-07T20:31:45.4191165Z         scale_ub: Optional[float],
2025-05-07T20:31:45.4191257Z         contiguous: bool,
2025-05-07T20:31:45.4191343Z         compiled: bool,
2025-05-07T20:31:45.4191431Z     ) -> None:
2025-05-07T20:31:45.4191526Z         torch.manual_seed(2025)
2025-05-07T20:31:45.4191598Z     
2025-05-07T20:31:45.4191865Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:31:45.4191939Z     
2025-05-07T20:31:45.4192040Z         x_sign = torch.sign(x)
2025-05-07T20:31:45.4192169Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:31:45.4192259Z         x = x_sign * x_clamp
2025-05-07T20:31:45.4192350Z         x0 = x[:, :D]
2025-05-07T20:31:45.4192429Z         x1 = x[:, D:]
2025-05-07T20:31:45.4192502Z     
2025-05-07T20:31:45.4192596Z         if contiguous:
2025-05-07T20:31:45.4192690Z             x0 = x0.contiguous()
2025-05-07T20:31:45.4192779Z             x1 = x1.contiguous()
2025-05-07T20:31:45.4192858Z     
2025-05-07T20:31:45.4192949Z         if scale_ub is not None:
2025-05-07T20:31:45.4193134Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:31:45.4193279Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:31:45.4193355Z             )
2025-05-07T20:31:45.4193434Z         else:
2025-05-07T20:31:45.4193545Z             scale_ub_tensor = None
2025-05-07T20:31:45.4193621Z     
2025-05-07T20:31:45.4193762Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:31:45.4193853Z             op = silu_mul_quant
2025-05-07T20:31:45.4193939Z             if compiled:
2025-05-07T20:31:45.4194050Z                 op = torch.compile(op)
2025-05-07T20:31:45.4194157Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:31:45.4194230Z     
2025-05-07T20:31:45.4194329Z >       y_fp8, y_scale = fn()
2025-05-07T20:31:45.4194333Z 
2025-05-07T20:31:45.4194432Z moe/activation_test.py:117: 
2025-05-07T20:31:45.4194563Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:31:45.4194673Z moe/activation_test.py:115: in fn
2025-05-07T20:31:45.4194781Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:31:45.4195161Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/torch/_dynamo/eval_frame.py:678: in _fn
2025-05-07T20:31:45.4195262Z     return fn(*args, **kwargs)
2025-05-07T20:31:45.4195760Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:31:45.4195867Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:31:45.4196226Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:31:45.4196450Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:31:45.4196798Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:31:45.4196892Z     kernel = self.compile(
2025-05-07T20:31:45.4197281Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:31:45.4197460Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:31:45.4197594Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:31:45.4197599Z 
2025-05-07T20:31:45.4197812Z self = <triton.compiler.compiler.ASTSource object at 0x7f6872d8b110>
2025-05-07T20:31:45.4198602Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:31:45.4199119Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f68dfb402c0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f6895f7b420>}
2025-05-07T20:31:45.4199878Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:31:45.4200081Z context = <triton._C.libtriton.ir.context object at 0x7f6872dfef70>
2025-05-07T20:31:45.4200085Z 
2025-05-07T20:31:45.4200334Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:31:45.4200600Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:31:45.4200717Z                            module_map=module_map)
2025-05-07T20:31:45.4200881Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:31:45.4200981Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:31:45.4201071Z E       ^
2025-05-07T20:31:45.4201433Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:31:45.4201438Z 
2025-05-07T20:31:45.4201866Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:31:45.4201950Z 
2025-05-07T20:31:45.4202056Z Trying example: test_silu_mul_quant(
2025-05-07T20:31:45.4202283Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:31:45.4202373Z     T=1,
2025-05-07T20:31:45.4202449Z     D=5120,
2025-05-07T20:31:45.4202534Z     scale_ub=None,
2025-05-07T20:31:45.4202628Z     contiguous=False,
2025-05-07T20:31:45.4202712Z     compiled=True,
2025-05-07T20:31:45.4202784Z )
2025-05-07T20:31:45.4203019Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:31:45.4203183Z T = 1, D = 5120, scale_ub = None, contiguous = False, compiled = True
2025-05-07T20:31:45.4203187Z 
2025-05-07T20:31:45.4203276Z     @given(
2025-05-07T20:31:45.4203394Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:31:45.4203494Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:31:45.4203619Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:31:45.4203742Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:31:45.4203855Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:31:45.4203938Z     )
2025-05-07T20:31:45.4204190Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:31:45.4204291Z     def test_silu_mul_quant(
2025-05-07T20:31:45.4204371Z         self,
2025-05-07T20:31:45.4204449Z         T: int,
2025-05-07T20:31:45.4204533Z         D: int,
2025-05-07T20:31:45.4204631Z         scale_ub: Optional[float],
2025-05-07T20:31:45.4204723Z         contiguous: bool,
2025-05-07T20:31:45.4204816Z         compiled: bool,
2025-05-07T20:31:45.4204894Z     ) -> None:
2025-05-07T20:31:45.4204991Z         torch.manual_seed(2025)
2025-05-07T20:31:45.4205071Z     
2025-05-07T20:31:45.4205242Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:31:45.4205314Z     
2025-05-07T20:31:45.4205413Z         x_sign = torch.sign(x)
2025-05-07T20:31:45.4205542Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:31:45.4205639Z         x = x_sign * x_clamp
2025-05-07T20:31:45.4205720Z         x0 = x[:, :D]
2025-05-07T20:31:45.4205801Z         x1 = x[:, D:]
2025-05-07T20:31:45.4205882Z     
2025-05-07T20:31:45.4205970Z         if contiguous:
2025-05-07T20:31:45.4206062Z             x0 = x0.contiguous()
2025-05-07T20:31:45.4206160Z             x1 = x1.contiguous()
2025-05-07T20:31:45.4206233Z     
2025-05-07T20:31:45.4206323Z         if scale_ub is not None:
2025-05-07T20:31:45.4206436Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:31:45.4206573Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:31:45.4206649Z             )
2025-05-07T20:31:45.4206733Z         else:
2025-05-07T20:31:45.4206827Z             scale_ub_tensor = None
2025-05-07T20:31:45.4206900Z     
2025-05-07T20:31:45.4207036Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:31:45.4207126Z             op = silu_mul_quant
2025-05-07T20:31:45.4207223Z             if compiled:
2025-05-07T20:31:45.4207323Z                 op = torch.compile(op)
2025-05-07T20:31:45.4207430Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:31:45.4207509Z     
2025-05-07T20:31:45.4207688Z         y_fp8, y_scale = fn()
2025-05-07T20:31:45.4207811Z         y = y_fp8.to(torch.float32) * y_scale[:, None]
2025-05-07T20:31:45.4207892Z     
2025-05-07T20:31:45.4208028Z         def ref_fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:31:45.4208133Z             x0_fp32 = x0.to(torch.float32)
2025-05-07T20:31:45.4208242Z             x1_fp32 = x1.to(torch.float32)
2025-05-07T20:31:45.4208369Z             y = x0_fp32 * torch.sigmoid(x0_fp32) * x1_fp32
2025-05-07T20:31:45.4208518Z             return triton_quantize_fp8_row(y, scale_ub_tensor)
2025-05-07T20:31:45.4208592Z     
2025-05-07T20:31:45.4208695Z >       y_fp8_ref, y_scale_ref = ref_fn()
2025-05-07T20:31:45.4208699Z 
2025-05-07T20:31:45.4208806Z moe/activation_test.py:126: 
2025-05-07T20:31:45.4209015Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:31:45.4209122Z moe/activation_test.py:124: in ref_fn
2025-05-07T20:31:45.4209263Z     return triton_quantize_fp8_row(y, scale_ub_tensor)
2025-05-07T20:31:45.4209839Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/fbgemm_gpu/experimental/gemm/triton_gemm/fp8_gemm.py:2370: in triton_quantize_fp8_row
2025-05-07T20:31:45.4209951Z     _kernel_quantize_fp8_row[grid](
2025-05-07T20:31:45.4210316Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:31:45.4210541Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:31:45.4210919Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/autotuner.py:186: in run
2025-05-07T20:31:45.4211180Z     timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs}
2025-05-07T20:31:45.4211591Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/autotuner.py:186: in <dictcomp>
2025-05-07T20:31:45.4211859Z     timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs}
2025-05-07T20:31:45.4212239Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/autotuner.py:166: in _bench
2025-05-07T20:31:45.4212415Z     return self.do_bench(kernel_call, quantiles=(0.5, 0.2, 0.8))
2025-05-07T20:31:45.4212761Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/testing.py:117: in do_bench
2025-05-07T20:31:45.4212839Z     fn()
2025-05-07T20:31:45.4213256Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/autotuner.py:152: in kernel_call
2025-05-07T20:31:45.4213340Z     self.fn.run(
2025-05-07T20:31:45.4213683Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:31:45.4213790Z     kernel = self.compile(
2025-05-07T20:31:45.4214176Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:31:45.4214365Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:31:45.4214494Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:31:45.4214499Z 
2025-05-07T20:31:45.4214705Z self = <triton.compiler.compiler.ASTSource object at 0x7f68726fdc90>
2025-05-07T20:31:45.4215510Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=2, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:31:45.4216021Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f68dfb402c0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f6873915b20>}
2025-05-07T20:31:45.4216791Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:31:45.4217084Z context = <triton._C.libtriton.ir.context object at 0x7f6872665ab0>
2025-05-07T20:31:45.4217089Z 
2025-05-07T20:31:45.4217264Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:31:45.4217529Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:31:45.4217637Z                            module_map=module_map)
2025-05-07T20:31:45.4217805Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:31:45.4217911Z E       def _kernel_quantize_fp8_row(
2025-05-07T20:31:45.4217990Z E       ^
2025-05-07T20:31:45.4218356Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:31:45.4218436Z 
2025-05-07T20:31:45.4218857Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:31:45.4218861Z 
2025-05-07T20:31:45.4218974Z Trying example: test_silu_mul_quant(
2025-05-07T20:31:45.4219209Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:31:45.4219297Z     T=1,
2025-05-07T20:31:45.4219374Z     D=5120,
2025-05-07T20:31:45.4219456Z     scale_ub=None,
2025-05-07T20:31:45.4219548Z     contiguous=True,
2025-05-07T20:31:45.4219632Z     compiled=False,
2025-05-07T20:31:45.4219705Z )
2025-05-07T20:31:45.4219933Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:31:45.4220097Z T = 1, D = 5120, scale_ub = None, contiguous = True, compiled = False
2025-05-07T20:31:45.4220101Z 
2025-05-07T20:31:45.4220186Z     @given(
2025-05-07T20:31:45.4220305Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:31:45.4220408Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:31:45.4220529Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:31:45.4220645Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:31:45.4220759Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:31:45.4220844Z     )
2025-05-07T20:31:45.4221091Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:31:45.4221185Z     def test_silu_mul_quant(
2025-05-07T20:31:45.4221270Z         self,
2025-05-07T20:31:45.4221346Z         T: int,
2025-05-07T20:31:45.4221423Z         D: int,
2025-05-07T20:31:45.4221531Z         scale_ub: Optional[float],
2025-05-07T20:31:45.4221621Z         contiguous: bool,
2025-05-07T20:31:45.4221713Z         compiled: bool,
2025-05-07T20:31:45.4221792Z     ) -> None:
2025-05-07T20:31:45.4221887Z         torch.manual_seed(2025)
2025-05-07T20:31:45.4221965Z     
2025-05-07T20:31:45.4222134Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:31:45.4222212Z     
2025-05-07T20:31:45.4222310Z         x_sign = torch.sign(x)
2025-05-07T20:31:45.4222436Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:31:45.4222530Z         x = x_sign * x_clamp
2025-05-07T20:31:45.4222620Z         x0 = x[:, :D]
2025-05-07T20:31:45.4222699Z         x1 = x[:, D:]
2025-05-07T20:31:45.4222770Z     
2025-05-07T20:31:45.4222859Z         if contiguous:
2025-05-07T20:31:45.4222950Z             x0 = x0.contiguous()
2025-05-07T20:31:45.4223040Z             x1 = x1.contiguous()
2025-05-07T20:31:45.4223117Z     
2025-05-07T20:31:45.4223205Z         if scale_ub is not None:
2025-05-07T20:31:45.4223315Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:31:45.4223449Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:31:45.4223523Z             )
2025-05-07T20:31:45.4223605Z         else:
2025-05-07T20:31:45.4223698Z             scale_ub_tensor = None
2025-05-07T20:31:45.4223770Z     
2025-05-07T20:31:45.4223910Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:31:45.4224001Z             op = silu_mul_quant
2025-05-07T20:31:45.4224086Z             if compiled:
2025-05-07T20:31:45.4224195Z                 op = torch.compile(op)
2025-05-07T20:31:45.4224386Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:31:45.4224459Z     
2025-05-07T20:31:45.4224556Z >       y_fp8, y_scale = fn()
2025-05-07T20:31:45.4224560Z 
2025-05-07T20:31:45.4224657Z moe/activation_test.py:117: 
2025-05-07T20:31:45.4224796Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:31:45.4224897Z moe/activation_test.py:115: in fn
2025-05-07T20:31:45.4224996Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:31:45.4225530Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:31:45.4225638Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:31:45.4226010Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:31:45.4226312Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:31:45.4226659Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:31:45.4226759Z     kernel = self.compile(
2025-05-07T20:31:45.4227141Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:31:45.4227312Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:31:45.4227445Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:31:45.4227450Z 
2025-05-07T20:31:45.4227651Z self = <triton.compiler.compiler.ASTSource object at 0x7f6872638750>
2025-05-07T20:31:45.4228746Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:31:45.4229311Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f68dfb402c0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f6873372200>}
2025-05-07T20:31:45.4230070Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:31:45.4230264Z context = <triton._C.libtriton.ir.context object at 0x7f687261b330>
2025-05-07T20:31:45.4230268Z 
2025-05-07T20:31:45.4230431Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:31:45.4230699Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:31:45.4230805Z                            module_map=module_map)
2025-05-07T20:31:45.4230971Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:31:45.4231078Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:31:45.4231156Z E       ^
2025-05-07T20:31:45.4231523Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:31:45.4231528Z 
2025-05-07T20:31:45.4231945Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:31:45.4231949Z 
2025-05-07T20:31:45.4232053Z Trying example: test_silu_mul_quant(
2025-05-07T20:31:45.4232283Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:31:45.4232360Z     T=128,
2025-05-07T20:31:45.4232437Z     D=5120,
2025-05-07T20:31:45.4232523Z     scale_ub=None,
2025-05-07T20:31:45.4232609Z     contiguous=False,
2025-05-07T20:31:45.4232696Z     compiled=True,
2025-05-07T20:31:45.4232768Z )
2025-05-07T20:31:45.4232986Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:31:45.4233168Z T = 128, D = 5120, scale_ub = None, contiguous = False, compiled = True
2025-05-07T20:31:45.4233173Z 
2025-05-07T20:31:45.4233252Z     @given(
2025-05-07T20:31:45.4233591Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:31:45.4233700Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:31:45.4233815Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:31:45.4233930Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:31:45.4234050Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:31:45.4234124Z     )
2025-05-07T20:31:45.4234376Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:31:45.4234470Z     def test_silu_mul_quant(
2025-05-07T20:31:45.4234547Z         self,
2025-05-07T20:31:45.4234630Z         T: int,
2025-05-07T20:31:45.4234707Z         D: int,
2025-05-07T20:31:45.4234809Z         scale_ub: Optional[float],
2025-05-07T20:31:45.4235025Z         contiguous: bool,
2025-05-07T20:31:45.4235112Z         compiled: bool,
2025-05-07T20:31:45.4235189Z     ) -> None:
2025-05-07T20:31:45.4235290Z         torch.manual_seed(2025)
2025-05-07T20:31:45.4235363Z     
2025-05-07T20:31:45.4235537Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:31:45.4235618Z     
2025-05-07T20:31:45.4235710Z         x_sign = torch.sign(x)
2025-05-07T20:31:45.4235842Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:31:45.4235929Z         x = x_sign * x_clamp
2025-05-07T20:31:45.4236013Z         x0 = x[:, :D]
2025-05-07T20:31:45.4236102Z         x1 = x[:, D:]
2025-05-07T20:31:45.4236173Z     
2025-05-07T20:31:45.4236258Z         if contiguous:
2025-05-07T20:31:45.4236356Z             x0 = x0.contiguous()
2025-05-07T20:31:45.4236447Z             x1 = x1.contiguous()
2025-05-07T20:31:45.4236520Z     
2025-05-07T20:31:45.4236620Z         if scale_ub is not None:
2025-05-07T20:31:45.4236731Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:31:45.4236867Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:31:45.4236950Z             )
2025-05-07T20:31:45.4237028Z         else:
2025-05-07T20:31:45.4237132Z             scale_ub_tensor = None
2025-05-07T20:31:45.4237207Z     
2025-05-07T20:31:45.4237334Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:31:45.4237429Z             op = silu_mul_quant
2025-05-07T20:31:45.4237513Z             if compiled:
2025-05-07T20:31:45.4237614Z                 op = torch.compile(op)
2025-05-07T20:31:45.4237723Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:31:45.4237795Z     
2025-05-07T20:31:45.4237885Z >       y_fp8, y_scale = fn()
2025-05-07T20:31:45.4237890Z 
2025-05-07T20:31:45.4237992Z moe/activation_test.py:117: 
2025-05-07T20:31:45.4238121Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:31:45.4238220Z moe/activation_test.py:115: in fn
2025-05-07T20:31:45.4238338Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:31:45.4238715Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/torch/_dynamo/eval_frame.py:678: in _fn
2025-05-07T20:31:45.4238808Z     return fn(*args, **kwargs)
2025-05-07T20:31:45.4239309Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:31:45.4239412Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:31:45.4239768Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:31:45.4239989Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:31:45.4240334Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:31:45.4240428Z     kernel = self.compile(
2025-05-07T20:31:45.4240821Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:31:45.4240998Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:31:45.4241125Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:31:45.4241212Z 
2025-05-07T20:31:45.4241425Z self = <triton.compiler.compiler.ASTSource object at 0x7f68726e4cd0>
2025-05-07T20:31:45.4242207Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:31:45.4242717Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f68dfb402c0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f68738ee0c0>}
2025-05-07T20:31:45.4243473Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:31:45.4243801Z context = <triton._C.libtriton.ir.context object at 0x7f687269d0f0>
2025-05-07T20:31:45.4243812Z 
2025-05-07T20:31:45.4243980Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:31:45.4244242Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:31:45.4244355Z                            module_map=module_map)
2025-05-07T20:31:45.4244515Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:31:45.4244613Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:31:45.4244697Z E       ^
2025-05-07T20:31:45.4245055Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:31:45.4245060Z 
2025-05-07T20:31:45.4245486Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:31:45.4245497Z 
2025-05-07T20:31:45.4245601Z Trying example: test_silu_mul_quant(
2025-05-07T20:31:45.4245824Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:31:45.4245908Z     T=128,
2025-05-07T20:31:45.4245993Z     D=7168,
2025-05-07T20:31:45.4246079Z     scale_ub=1200.0,
2025-05-07T20:31:45.4246173Z     contiguous=False,
2025-05-07T20:31:45.4246259Z     compiled=False,
2025-05-07T20:31:45.4246332Z )
2025-05-07T20:31:45.4246558Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:31:45.4246729Z T = 128, D = 7168, scale_ub = 1200.0, contiguous = False, compiled = False
2025-05-07T20:31:45.4246734Z 
2025-05-07T20:31:45.4246817Z     @given(
2025-05-07T20:31:45.4246934Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:31:45.4247032Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:31:45.4247149Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:31:45.4247269Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:31:45.4247381Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:31:45.4247459Z     )
2025-05-07T20:31:45.4247707Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:31:45.4247805Z     def test_silu_mul_quant(
2025-05-07T20:31:45.4247879Z         self,
2025-05-07T20:31:45.4247955Z         T: int,
2025-05-07T20:31:45.4248036Z         D: int,
2025-05-07T20:31:45.4248133Z         scale_ub: Optional[float],
2025-05-07T20:31:45.4248221Z         contiguous: bool,
2025-05-07T20:31:45.4248310Z         compiled: bool,
2025-05-07T20:31:45.4248386Z     ) -> None:
2025-05-07T20:31:45.4248481Z         torch.manual_seed(2025)
2025-05-07T20:31:45.4248559Z     
2025-05-07T20:31:45.4248724Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:31:45.4248797Z     
2025-05-07T20:31:45.4248898Z         x_sign = torch.sign(x)
2025-05-07T20:31:45.4249026Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:31:45.4249114Z         x = x_sign * x_clamp
2025-05-07T20:31:45.4249201Z         x0 = x[:, :D]
2025-05-07T20:31:45.4249279Z         x1 = x[:, D:]
2025-05-07T20:31:45.4249357Z     
2025-05-07T20:31:45.4249546Z         if contiguous:
2025-05-07T20:31:45.4249639Z             x0 = x0.contiguous()
2025-05-07T20:31:45.4249733Z             x1 = x1.contiguous()
2025-05-07T20:31:45.4249805Z     
2025-05-07T20:31:45.4249895Z         if scale_ub is not None:
2025-05-07T20:31:45.4250007Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:31:45.4250140Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:31:45.4250215Z             )
2025-05-07T20:31:45.4250297Z         else:
2025-05-07T20:31:45.4250389Z             scale_ub_tensor = None
2025-05-07T20:31:45.4250460Z     
2025-05-07T20:31:45.4250594Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:31:45.4250683Z             op = silu_mul_quant
2025-05-07T20:31:45.4250859Z             if compiled:
2025-05-07T20:31:45.4250960Z                 op = torch.compile(op)
2025-05-07T20:31:45.4251066Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:31:45.4251143Z     
2025-05-07T20:31:45.4251234Z >       y_fp8, y_scale = fn()
2025-05-07T20:31:45.4251245Z 
2025-05-07T20:31:45.4251344Z moe/activation_test.py:117: 
2025-05-07T20:31:45.4251478Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:31:45.4251579Z moe/activation_test.py:115: in fn
2025-05-07T20:31:45.4251678Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:31:45.4252184Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:31:45.4252281Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:31:45.4252644Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:31:45.4252866Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:31:45.4253209Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:31:45.4253306Z     kernel = self.compile(
2025-05-07T20:31:45.4253691Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:31:45.4253864Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:31:45.4253996Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:31:45.4254001Z 
2025-05-07T20:31:45.4254201Z self = <triton.compiler.compiler.ASTSource object at 0x7f68725bd3d0>
2025-05-07T20:31:45.4254991Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:31:45.4255500Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f68dfb402c0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f6873917ba0>}
2025-05-07T20:31:45.4256267Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:31:45.4256455Z context = <triton._C.libtriton.ir.context object at 0x7f6872595230>
2025-05-07T20:31:45.4256460Z 
2025-05-07T20:31:45.4256623Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:31:45.4256892Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:31:45.4256999Z                            module_map=module_map)
2025-05-07T20:31:45.4257166Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:31:45.4257265Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:31:45.4257345Z E       ^
2025-05-07T20:31:45.4257708Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:31:45.4257714Z 
2025-05-07T20:31:45.4258214Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:31:45.4258219Z 
2025-05-07T20:31:45.4258329Z Trying example: test_silu_mul_quant(
2025-05-07T20:31:45.4258554Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:31:45.4258631Z     T=128,
2025-05-07T20:31:45.4258719Z     D=5120,
2025-05-07T20:31:45.4258801Z     scale_ub=None,
2025-05-07T20:31:45.4258888Z     contiguous=False,
2025-05-07T20:31:45.4258978Z     compiled=False,
2025-05-07T20:31:45.4259049Z )
2025-05-07T20:31:45.4259268Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:31:45.4259447Z T = 128, D = 5120, scale_ub = None, contiguous = False, compiled = False
2025-05-07T20:31:45.4259527Z 
2025-05-07T20:31:45.4259604Z     @given(
2025-05-07T20:31:45.4259723Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:31:45.4259829Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:31:45.4259948Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:31:45.4260071Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:31:45.4260184Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:31:45.4260258Z     )
2025-05-07T20:31:45.4260514Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:31:45.4260611Z     def test_silu_mul_quant(
2025-05-07T20:31:45.4260686Z         self,
2025-05-07T20:31:45.4260770Z         T: int,
2025-05-07T20:31:45.4260847Z         D: int,
2025-05-07T20:31:45.4260945Z         scale_ub: Optional[float],
2025-05-07T20:31:45.4261041Z         contiguous: bool,
2025-05-07T20:31:45.4261126Z         compiled: bool,
2025-05-07T20:31:45.4261211Z     ) -> None:
2025-05-07T20:31:45.4261312Z         torch.manual_seed(2025)
2025-05-07T20:31:45.4261384Z     
2025-05-07T20:31:45.4261556Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:31:45.4261629Z     
2025-05-07T20:31:45.4261724Z         x_sign = torch.sign(x)
2025-05-07T20:31:45.4261853Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:31:45.4261943Z         x = x_sign * x_clamp
2025-05-07T20:31:45.4262026Z         x0 = x[:, :D]
2025-05-07T20:31:45.4262111Z         x1 = x[:, D:]
2025-05-07T20:31:45.4262182Z     
2025-05-07T20:31:45.4262265Z         if contiguous:
2025-05-07T20:31:45.4262362Z             x0 = x0.contiguous()
2025-05-07T20:31:45.4262450Z             x1 = x1.contiguous()
2025-05-07T20:31:45.4262522Z     
2025-05-07T20:31:45.4262618Z         if scale_ub is not None:
2025-05-07T20:31:45.4262721Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:31:45.4262862Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:31:45.4262941Z             )
2025-05-07T20:31:45.4263015Z         else:
2025-05-07T20:31:45.4263114Z             scale_ub_tensor = None
2025-05-07T20:31:45.4263187Z     
2025-05-07T20:31:45.4263316Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:31:45.4263417Z             op = silu_mul_quant
2025-05-07T20:31:45.4263502Z             if compiled:
2025-05-07T20:31:45.4263601Z                 op = torch.compile(op)
2025-05-07T20:31:45.4263712Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:31:45.4263785Z     
2025-05-07T20:31:45.4263875Z >       y_fp8, y_scale = fn()
2025-05-07T20:31:45.4263886Z 
2025-05-07T20:31:45.4263983Z moe/activation_test.py:117: 
2025-05-07T20:31:45.4264113Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:31:45.4264219Z moe/activation_test.py:115: in fn
2025-05-07T20:31:45.4264318Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:31:45.4264818Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:31:45.4264925Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:31:45.4265366Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:31:45.4265590Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:31:45.4265934Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:31:45.4266027Z     kernel = self.compile(
2025-05-07T20:31:45.4266415Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:31:45.4266587Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:31:45.4266718Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:31:45.4266722Z 
2025-05-07T20:31:45.4267005Z self = <triton.compiler.compiler.ASTSource object at 0x7f68725c0250>
2025-05-07T20:31:45.4267797Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:31:45.4268311Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f68dfb402c0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f6873914fe0>}
2025-05-07T20:31:45.4269139Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:31:45.4269334Z context = <triton._C.libtriton.ir.context object at 0x7f68725e8130>
2025-05-07T20:31:45.4269339Z 
2025-05-07T20:31:45.4269501Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:31:45.4269772Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:31:45.4269887Z                            module_map=module_map)
2025-05-07T20:31:45.4270051Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:31:45.4270150Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:31:45.4270234Z E       ^
2025-05-07T20:31:45.4270590Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:31:45.4270595Z 
2025-05-07T20:31:45.4271019Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:31:45.4271023Z 
2025-05-07T20:31:45.4271126Z Trying example: test_silu_mul_quant(
2025-05-07T20:31:45.4271349Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:31:45.4271434Z     T=128,
2025-05-07T20:31:45.4271511Z     D=5120,
2025-05-07T20:31:45.4271600Z     scale_ub=1200.0,
2025-05-07T20:31:45.4271691Z     contiguous=True,
2025-05-07T20:31:45.4271776Z     compiled=False,
2025-05-07T20:31:45.4271853Z )
2025-05-07T20:31:45.4272074Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:31:45.4272253Z T = 128, D = 5120, scale_ub = 1200.0, contiguous = True, compiled = False
2025-05-07T20:31:45.4272258Z 
2025-05-07T20:31:45.4272342Z     @given(
2025-05-07T20:31:45.4272460Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:31:45.4272562Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:31:45.4272685Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:31:45.4272802Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:31:45.4272917Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:31:45.4272997Z     )
2025-05-07T20:31:45.4273243Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:31:45.4273344Z     def test_silu_mul_quant(
2025-05-07T20:31:45.4273425Z         self,
2025-05-07T20:31:45.4273502Z         T: int,
2025-05-07T20:31:45.4273589Z         D: int,
2025-05-07T20:31:45.4273687Z         scale_ub: Optional[float],
2025-05-07T20:31:45.4273776Z         contiguous: bool,
2025-05-07T20:31:45.4273950Z         compiled: bool,
2025-05-07T20:31:45.4274030Z     ) -> None:
2025-05-07T20:31:45.4274126Z         torch.manual_seed(2025)
2025-05-07T20:31:45.4274209Z     
2025-05-07T20:31:45.4274376Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:31:45.4274449Z     
2025-05-07T20:31:45.4274548Z         x_sign = torch.sign(x)
2025-05-07T20:31:45.4274671Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:31:45.4274767Z         x = x_sign * x_clamp
2025-05-07T20:31:45.4274847Z         x0 = x[:, :D]
2025-05-07T20:31:45.4274927Z         x1 = x[:, D:]
2025-05-07T20:31:45.4275004Z     
2025-05-07T20:31:45.4275087Z         if contiguous:
2025-05-07T20:31:45.4275277Z             x0 = x0.contiguous()
2025-05-07T20:31:45.4275373Z             x1 = x1.contiguous()
2025-05-07T20:31:45.4275444Z     
2025-05-07T20:31:45.4275534Z         if scale_ub is not None:
2025-05-07T20:31:45.4275645Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:31:45.4275784Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:31:45.4275859Z             )
2025-05-07T20:31:45.4275939Z         else:
2025-05-07T20:31:45.4276031Z             scale_ub_tensor = None
2025-05-07T20:31:45.4276106Z     
2025-05-07T20:31:45.4276242Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:31:45.4276331Z             op = silu_mul_quant
2025-05-07T20:31:45.4276421Z             if compiled:
2025-05-07T20:31:45.4276521Z                 op = torch.compile(op)
2025-05-07T20:31:45.4276625Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:31:45.4276703Z     
2025-05-07T20:31:45.4276793Z >       y_fp8, y_scale = fn()
2025-05-07T20:31:45.4276798Z 
2025-05-07T20:31:45.4276904Z moe/activation_test.py:117: 
2025-05-07T20:31:45.4277039Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:31:45.4277140Z moe/activation_test.py:115: in fn
2025-05-07T20:31:45.4277238Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:31:45.4277748Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:31:45.4277845Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:31:45.4278211Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:31:45.4278431Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:31:45.4278771Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:31:45.4278869Z     kernel = self.compile(
2025-05-07T20:31:45.4279251Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:31:45.4279433Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:31:45.4279565Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:31:45.4279569Z 
2025-05-07T20:31:45.4279771Z self = <triton.compiler.compiler.ASTSource object at 0x7f6872552b50>
2025-05-07T20:31:45.4280564Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:31:45.4281068Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f68dfb402c0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f68738837e0>}
2025-05-07T20:31:45.4281828Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:31:45.4282019Z context = <triton._C.libtriton.ir.context object at 0x7f6872556a30>
2025-05-07T20:31:45.4282024Z 
2025-05-07T20:31:45.4282328Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:31:45.4282599Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:31:45.4282705Z                            module_map=module_map)
2025-05-07T20:31:45.4282870Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:31:45.4282969Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:31:45.4283047Z E       ^
2025-05-07T20:31:45.4283408Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:31:45.4283413Z 
2025-05-07T20:31:45.4283831Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:31:45.4283915Z 
2025-05-07T20:31:45.4284025Z Trying example: test_silu_mul_quant(
2025-05-07T20:31:45.4284249Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:31:45.4284335Z     T=1,
2025-05-07T20:31:45.4284420Z     D=7168,
2025-05-07T20:31:45.4284503Z     scale_ub=1200.0,
2025-05-07T20:31:45.4284588Z     contiguous=True,
2025-05-07T20:31:45.4284676Z     compiled=True,
2025-05-07T20:31:45.4284750Z )
2025-05-07T20:31:45.4284971Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:31:45.4285142Z T = 1, D = 7168, scale_ub = 1200.0, contiguous = True, compiled = True
2025-05-07T20:31:45.4285146Z 
2025-05-07T20:31:45.4285223Z     @given(
2025-05-07T20:31:45.4285348Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:31:45.4285447Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:31:45.4285561Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:31:45.4285689Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:31:45.4285802Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:31:45.4285876Z     )
2025-05-07T20:31:45.4286133Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:31:45.4286227Z     def test_silu_mul_quant(
2025-05-07T20:31:45.4286302Z         self,
2025-05-07T20:31:45.4286385Z         T: int,
2025-05-07T20:31:45.4286464Z         D: int,
2025-05-07T20:31:45.4286564Z         scale_ub: Optional[float],
2025-05-07T20:31:45.4286660Z         contiguous: bool,
2025-05-07T20:31:45.4286747Z         compiled: bool,
2025-05-07T20:31:45.4286834Z     ) -> None:
2025-05-07T20:31:45.4286929Z         torch.manual_seed(2025)
2025-05-07T20:31:45.4287001Z     
2025-05-07T20:31:45.4287174Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:31:45.4287247Z     
2025-05-07T20:31:45.4287340Z         x_sign = torch.sign(x)
2025-05-07T20:31:45.4287476Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:31:45.4287565Z         x = x_sign * x_clamp
2025-05-07T20:31:45.4287647Z         x0 = x[:, :D]
2025-05-07T20:31:45.4287735Z         x1 = x[:, D:]
2025-05-07T20:31:45.4287808Z     
2025-05-07T20:31:45.4287897Z         if contiguous:
2025-05-07T20:31:45.4287995Z             x0 = x0.contiguous()
2025-05-07T20:31:45.4288083Z             x1 = x1.contiguous()
2025-05-07T20:31:45.4288164Z     
2025-05-07T20:31:45.4288255Z         if scale_ub is not None:
2025-05-07T20:31:45.4288360Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:31:45.4288501Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:31:45.4288576Z             )
2025-05-07T20:31:45.4288652Z         else:
2025-05-07T20:31:45.4288750Z             scale_ub_tensor = None
2025-05-07T20:31:45.4288825Z     
2025-05-07T20:31:45.4288953Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:31:45.4289050Z             op = silu_mul_quant
2025-05-07T20:31:45.4289139Z             if compiled:
2025-05-07T20:31:45.4289239Z                 op = torch.compile(op)
2025-05-07T20:31:45.4289351Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:31:45.4289423Z     
2025-05-07T20:31:45.4289603Z >       y_fp8, y_scale = fn()
2025-05-07T20:31:45.4289608Z 
2025-05-07T20:31:45.4289706Z moe/activation_test.py:117: 
2025-05-07T20:31:45.4289836Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:31:45.4289942Z moe/activation_test.py:115: in fn
2025-05-07T20:31:45.4290040Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:31:45.4290408Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/torch/_dynamo/eval_frame.py:678: in _fn
2025-05-07T20:31:45.4290506Z     return fn(*args, **kwargs)
2025-05-07T20:31:45.4291003Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:31:45.4291177Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:31:45.4291533Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:31:45.4291759Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:31:45.4292107Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:31:45.4292199Z     kernel = self.compile(
2025-05-07T20:31:45.4292580Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:31:45.4292756Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:31:45.4292884Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:31:45.4292888Z 
2025-05-07T20:31:45.4293098Z self = <triton.compiler.compiler.ASTSource object at 0x7f68729356d0>
2025-05-07T20:31:45.4293882Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:31:45.4294397Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f68dfb402c0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f6872f8e840>}
2025-05-07T20:31:45.4295156Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:31:45.4295344Z context = <triton._C.libtriton.ir.context object at 0x7f687297d5b0>
2025-05-07T20:31:45.4295349Z 
2025-05-07T20:31:45.4295518Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:31:45.4295779Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:31:45.4295900Z                            module_map=module_map)
2025-05-07T20:31:45.4296060Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:31:45.4296159Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:31:45.4296239Z E       ^
2025-05-07T20:31:45.4296602Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:31:45.4296606Z 
2025-05-07T20:31:45.4297024Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:31:45.4297028Z 
2025-05-07T20:31:45.4297138Z Trying example: test_silu_mul_quant(
2025-05-07T20:31:45.4297364Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:31:45.4297446Z     T=1,
2025-05-07T20:31:45.4297521Z     D=7168,
2025-05-07T20:31:45.4297604Z     scale_ub=1200.0,
2025-05-07T20:31:45.4297695Z     contiguous=False,
2025-05-07T20:31:45.4297778Z     compiled=True,
2025-05-07T20:31:45.4297859Z )
2025-05-07T20:31:45.4298081Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:31:45.4298247Z T = 1, D = 7168, scale_ub = 1200.0, contiguous = False, compiled = True
2025-05-07T20:31:45.4298251Z 
2025-05-07T20:31:45.4298414Z     @given(
2025-05-07T20:31:45.4298541Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:31:45.4298639Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:31:45.4298759Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:31:45.4298875Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:31:45.4298987Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:31:45.4299067Z     )
2025-05-07T20:31:45.4299312Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:31:45.4299405Z     def test_silu_mul_quant(
2025-05-07T20:31:45.4299487Z         self,
2025-05-07T20:31:45.4299564Z         T: int,
2025-05-07T20:31:45.4299714Z         D: int,
2025-05-07T20:31:45.4299820Z         scale_ub: Optional[float],
2025-05-07T20:31:45.4299909Z         contiguous: bool,
2025-05-07T20:31:45.4299993Z         compiled: bool,
2025-05-07T20:31:45.4300080Z     ) -> None:
2025-05-07T20:31:45.4304919Z         torch.manual_seed(2025)
2025-05-07T20:31:45.4305010Z     
2025-05-07T20:31:45.4305192Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:31:45.4305273Z     
2025-05-07T20:31:45.4305369Z         x_sign = torch.sign(x)
2025-05-07T20:31:45.4305497Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:31:45.4305595Z         x = x_sign * x_clamp
2025-05-07T20:31:45.4305678Z         x0 = x[:, :D]
2025-05-07T20:31:45.4305767Z         x1 = x[:, D:]
2025-05-07T20:31:45.4305841Z     
2025-05-07T20:31:45.4305929Z         if contiguous:
2025-05-07T20:31:45.4306032Z             x0 = x0.contiguous()
2025-05-07T20:31:45.4306122Z             x1 = x1.contiguous()
2025-05-07T20:31:45.4306195Z     
2025-05-07T20:31:45.4306299Z         if scale_ub is not None:
2025-05-07T20:31:45.4306407Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:31:45.4306545Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:31:45.4306630Z             )
2025-05-07T20:31:45.4306711Z         else:
2025-05-07T20:31:45.4306807Z             scale_ub_tensor = None
2025-05-07T20:31:45.4306889Z     
2025-05-07T20:31:45.4307026Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:31:45.4307121Z             op = silu_mul_quant
2025-05-07T20:31:45.4307216Z             if compiled:
2025-05-07T20:31:45.4307318Z                 op = torch.compile(op)
2025-05-07T20:31:45.4307434Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:31:45.4307506Z     
2025-05-07T20:31:45.4307599Z >       y_fp8, y_scale = fn()
2025-05-07T20:31:45.4307603Z 
2025-05-07T20:31:45.4307712Z moe/activation_test.py:117: 
2025-05-07T20:31:45.4307844Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:31:45.4307953Z moe/activation_test.py:115: in fn
2025-05-07T20:31:45.4308062Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:31:45.4308436Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/torch/_dynamo/eval_frame.py:678: in _fn
2025-05-07T20:31:45.4308542Z     return fn(*args, **kwargs)
2025-05-07T20:31:45.4309043Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:31:45.4309230Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:31:45.4309596Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:31:45.4309820Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:31:45.4310160Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:31:45.4310264Z     kernel = self.compile(
2025-05-07T20:31:45.4310654Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:31:45.4310839Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:31:45.4311082Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:31:45.4311087Z 
2025-05-07T20:31:45.4311292Z self = <triton.compiler.compiler.ASTSource object at 0x7f68729eae90>
2025-05-07T20:31:45.4312083Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:31:45.4312590Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f68dfb402c0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f6872f8c900>}
2025-05-07T20:31:45.4313351Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:31:45.4313640Z context = <triton._C.libtriton.ir.context object at 0x7f6872962a70>
2025-05-07T20:31:45.4313649Z 
2025-05-07T20:31:45.4313819Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:31:45.4314082Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:31:45.4314190Z                            module_map=module_map)
2025-05-07T20:31:45.4314357Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:31:45.4314460Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:31:45.4314537Z E       ^
2025-05-07T20:31:45.4314903Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:31:45.4314908Z 
2025-05-07T20:31:45.4315352Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:31:45.4315365Z 
2025-05-07T20:31:45.4315496Z Trying example: test_silu_mul_quant(
2025-05-07T20:31:45.4315726Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:31:45.4315804Z     T=1,
2025-05-07T20:31:45.4315888Z     D=7168,
2025-05-07T20:31:45.4315971Z     scale_ub=None,
2025-05-07T20:31:45.4316060Z     contiguous=False,
2025-05-07T20:31:45.4316151Z     compiled=True,
2025-05-07T20:31:45.4316226Z )
2025-05-07T20:31:45.4316445Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:31:45.4316615Z T = 1, D = 7168, scale_ub = None, contiguous = False, compiled = True
2025-05-07T20:31:45.4316620Z 
2025-05-07T20:31:45.4316698Z     @given(
2025-05-07T20:31:45.4316825Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:31:45.4316924Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:31:45.4317045Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:31:45.4317169Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:31:45.4317282Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:31:45.4317358Z     )
2025-05-07T20:31:45.4317616Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:31:45.4317710Z     def test_silu_mul_quant(
2025-05-07T20:31:45.4317788Z         self,
2025-05-07T20:31:45.4317872Z         T: int,
2025-05-07T20:31:45.4317949Z         D: int,
2025-05-07T20:31:45.4318057Z         scale_ub: Optional[float],
2025-05-07T20:31:45.4318148Z         contiguous: bool,
2025-05-07T20:31:45.4318234Z         compiled: bool,
2025-05-07T20:31:45.4318318Z     ) -> None:
2025-05-07T20:31:45.4318413Z         torch.manual_seed(2025)
2025-05-07T20:31:45.4318488Z     
2025-05-07T20:31:45.4318665Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:31:45.4318739Z     
2025-05-07T20:31:45.4318836Z         x_sign = torch.sign(x)
2025-05-07T20:31:45.4318966Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:31:45.4319057Z         x = x_sign * x_clamp
2025-05-07T20:31:45.4319138Z         x0 = x[:, :D]
2025-05-07T20:31:45.4319228Z         x1 = x[:, D:]
2025-05-07T20:31:45.4319382Z     
2025-05-07T20:31:45.4319482Z         if contiguous:
2025-05-07T20:31:45.4319574Z             x0 = x0.contiguous()
2025-05-07T20:31:45.4319664Z             x1 = x1.contiguous()
2025-05-07T20:31:45.4319746Z     
2025-05-07T20:31:45.4319840Z         if scale_ub is not None:
2025-05-07T20:31:45.4319948Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:31:45.4320096Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:31:45.4320172Z             )
2025-05-07T20:31:45.4320249Z         else:
2025-05-07T20:31:45.4320353Z             scale_ub_tensor = None
2025-05-07T20:31:45.4320428Z     
2025-05-07T20:31:45.4320558Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:31:45.4320733Z             op = silu_mul_quant
2025-05-07T20:31:45.4320823Z             if compiled:
2025-05-07T20:31:45.4320923Z                 op = torch.compile(op)
2025-05-07T20:31:45.4321037Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:31:45.4321115Z     
2025-05-07T20:31:45.4321213Z         y_fp8, y_scale = fn()
2025-05-07T20:31:45.4321336Z         y = y_fp8.to(torch.float32) * y_scale[:, None]
2025-05-07T20:31:45.4321411Z     
2025-05-07T20:31:45.4321555Z         def ref_fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:31:45.4321661Z             x0_fp32 = x0.to(torch.float32)
2025-05-07T20:31:45.4321762Z             x1_fp32 = x1.to(torch.float32)
2025-05-07T20:31:45.4321899Z             y = x0_fp32 * torch.sigmoid(x0_fp32) * x1_fp32
2025-05-07T20:31:45.4322038Z             return triton_quantize_fp8_row(y, scale_ub_tensor)
2025-05-07T20:31:45.4322112Z     
2025-05-07T20:31:45.4322221Z >       y_fp8_ref, y_scale_ref = ref_fn()
2025-05-07T20:31:45.4322225Z 
2025-05-07T20:31:45.4322335Z moe/activation_test.py:126: 
2025-05-07T20:31:45.4322472Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:31:45.4322579Z moe/activation_test.py:124: in ref_fn
2025-05-07T20:31:45.4322718Z     return triton_quantize_fp8_row(y, scale_ub_tensor)
2025-05-07T20:31:45.4323291Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/fbgemm_gpu/experimental/gemm/triton_gemm/fp8_gemm.py:2370: in triton_quantize_fp8_row
2025-05-07T20:31:45.4323396Z     _kernel_quantize_fp8_row[grid](
2025-05-07T20:31:45.4323759Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:31:45.4323990Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:31:45.4324360Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/autotuner.py:186: in run
2025-05-07T20:31:45.4324625Z     timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs}
2025-05-07T20:31:45.4325035Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/autotuner.py:186: in <dictcomp>
2025-05-07T20:31:45.4325295Z     timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs}
2025-05-07T20:31:45.4325686Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/autotuner.py:166: in _bench
2025-05-07T20:31:45.4325854Z     return self.do_bench(kernel_call, quantiles=(0.5, 0.2, 0.8))
2025-05-07T20:31:45.4326205Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/testing.py:117: in do_bench
2025-05-07T20:31:45.4326284Z     fn()
2025-05-07T20:31:45.4326687Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/autotuner.py:152: in kernel_call
2025-05-07T20:31:45.4326783Z     self.fn.run(
2025-05-07T20:31:45.4327123Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:31:45.4327222Z     kernel = self.compile(
2025-05-07T20:31:45.4327613Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:31:45.4327869Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:31:45.4328007Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:31:45.4328011Z 
2025-05-07T20:31:45.4328509Z self = <triton.compiler.compiler.ASTSource object at 0x7f68724e0d10>
2025-05-07T20:31:45.4329371Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=2, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:31:45.4329885Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f68dfb402c0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f68940f1080>}
2025-05-07T20:31:45.4330857Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:31:45.4331057Z context = <triton._C.libtriton.ir.context object at 0x7f68724409b0>
2025-05-07T20:31:45.4331062Z 
2025-05-07T20:31:45.4331228Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:31:45.4331499Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:31:45.4331608Z                            module_map=module_map)
2025-05-07T20:31:45.4331772Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:31:45.4331884Z E       def _kernel_quantize_fp8_row(
2025-05-07T20:31:45.4331962Z E       ^
2025-05-07T20:31:45.4332321Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:31:45.4332333Z 
2025-05-07T20:31:45.4332761Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:31:45.4332765Z 
2025-05-07T20:31:45.4332874Z Trying example: test_silu_mul_quant(
2025-05-07T20:31:45.4333108Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:31:45.4333186Z     T=1,
2025-05-07T20:31:45.4333267Z     D=5120,
2025-05-07T20:31:45.4333357Z     scale_ub=1200.0,
2025-05-07T20:31:45.4333446Z     contiguous=False,
2025-05-07T20:31:45.4333531Z     compiled=True,
2025-05-07T20:31:45.4333614Z )
2025-05-07T20:31:45.4333832Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:31:45.4333998Z T = 1, D = 5120, scale_ub = 1200.0, contiguous = False, compiled = True
2025-05-07T20:31:45.4334010Z 
2025-05-07T20:31:45.4334087Z     @given(
2025-05-07T20:31:45.4334204Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:31:45.4334319Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:31:45.4334434Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:31:45.4334551Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:31:45.4334674Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:31:45.4334749Z     )
2025-05-07T20:31:45.4335003Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:31:45.4335107Z     def test_silu_mul_quant(
2025-05-07T20:31:45.4335202Z         self,
2025-05-07T20:31:45.4335285Z         T: int,
2025-05-07T20:31:45.4335389Z         D: int,
2025-05-07T20:31:45.4335493Z         scale_ub: Optional[float],
2025-05-07T20:31:45.4335588Z         contiguous: bool,
2025-05-07T20:31:45.4335674Z         compiled: bool,
2025-05-07T20:31:45.4335754Z     ) -> None:
2025-05-07T20:31:45.4335854Z         torch.manual_seed(2025)
2025-05-07T20:31:45.4335928Z     
2025-05-07T20:31:45.4336094Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:31:45.4336180Z     
2025-05-07T20:31:45.4336273Z         x_sign = torch.sign(x)
2025-05-07T20:31:45.4336397Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:31:45.4336627Z         x = x_sign * x_clamp
2025-05-07T20:31:45.4336710Z         x0 = x[:, :D]
2025-05-07T20:31:45.4336791Z         x1 = x[:, D:]
2025-05-07T20:31:45.4336872Z     
2025-05-07T20:31:45.4336957Z         if contiguous:
2025-05-07T20:31:45.4337050Z             x0 = x0.contiguous()
2025-05-07T20:31:45.4337151Z             x1 = x1.contiguous()
2025-05-07T20:31:45.4337225Z     
2025-05-07T20:31:45.4337328Z         if scale_ub is not None:
2025-05-07T20:31:45.4337434Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:31:45.4337568Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:31:45.4337655Z             )
2025-05-07T20:31:45.4337732Z         else:
2025-05-07T20:31:45.4337828Z             scale_ub_tensor = None
2025-05-07T20:31:45.4337989Z     
2025-05-07T20:31:45.4338121Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:31:45.4338213Z             op = silu_mul_quant
2025-05-07T20:31:45.4338308Z             if compiled:
2025-05-07T20:31:45.4338415Z                 op = torch.compile(op)
2025-05-07T20:31:45.4338522Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:31:45.4338602Z     
2025-05-07T20:31:45.4338695Z >       y_fp8, y_scale = fn()
2025-05-07T20:31:45.4338699Z 
2025-05-07T20:31:45.4338804Z moe/activation_test.py:117: 
2025-05-07T20:31:45.4338936Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:31:45.4339041Z moe/activation_test.py:115: in fn
2025-05-07T20:31:45.4339149Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:31:45.4339519Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/torch/_dynamo/eval_frame.py:678: in _fn
2025-05-07T20:31:45.4339612Z     return fn(*args, **kwargs)
2025-05-07T20:31:45.4340122Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:31:45.4340221Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:31:45.4340590Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:31:45.4340812Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:31:45.4341152Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:31:45.4341256Z     kernel = self.compile(
2025-05-07T20:31:45.4341639Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:31:45.4341821Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:31:45.4341952Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:31:45.4341965Z 
2025-05-07T20:31:45.4342169Z self = <triton.compiler.compiler.ASTSource object at 0x7f6872481450>
2025-05-07T20:31:45.4342964Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:31:45.4343472Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f68dfb402c0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f68940f3b00>}
2025-05-07T20:31:45.4344236Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:31:45.4344425Z context = <triton._C.libtriton.ir.context object at 0x7f68724b8f70>
2025-05-07T20:31:45.4344430Z 
2025-05-07T20:31:45.4344597Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:31:45.4344870Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:31:45.4344977Z                            module_map=module_map)
2025-05-07T20:31:45.4345256Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:31:45.4345375Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:31:45.4345461Z E       ^
2025-05-07T20:31:45.4345850Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:31:45.4345855Z 
2025-05-07T20:31:45.4346271Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:31:45.4346275Z 
2025-05-07T20:31:45.4346384Z Trying example: test_silu_mul_quant(
2025-05-07T20:31:45.4346609Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:31:45.4346685Z     T=1,
2025-05-07T20:31:45.4346846Z     D=5120,
2025-05-07T20:31:45.4346930Z     scale_ub=1200.0,
2025-05-07T20:31:45.4347023Z     contiguous=False,
2025-05-07T20:31:45.4347109Z     compiled=False,
2025-05-07T20:31:45.4347184Z )
2025-05-07T20:31:45.4347411Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:31:45.4347579Z T = 1, D = 5120, scale_ub = 1200.0, contiguous = False, compiled = False
2025-05-07T20:31:45.4347583Z 
2025-05-07T20:31:45.4347660Z     @given(
2025-05-07T20:31:45.4347790Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:31:45.4347891Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:31:45.4348004Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:31:45.4348128Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:31:45.4348244Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:31:45.4348324Z     )
2025-05-07T20:31:45.4348569Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:31:45.4348670Z     def test_silu_mul_quant(
2025-05-07T20:31:45.4348754Z         self,
2025-05-07T20:31:45.4348830Z         T: int,
2025-05-07T20:31:45.4348906Z         D: int,
2025-05-07T20:31:45.4349013Z         scale_ub: Optional[float],
2025-05-07T20:31:45.4349169Z         contiguous: bool,
2025-05-07T20:31:45.4349259Z         compiled: bool,
2025-05-07T20:31:45.4349345Z     ) -> None:
2025-05-07T20:31:45.4349441Z         torch.manual_seed(2025)
2025-05-07T20:31:45.4349514Z     
2025-05-07T20:31:45.4349689Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:31:45.4349761Z     
2025-05-07T20:31:45.4349859Z         x_sign = torch.sign(x)
2025-05-07T20:31:45.4349982Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:31:45.4350072Z         x = x_sign * x_clamp
2025-05-07T20:31:45.4350158Z         x0 = x[:, :D]
2025-05-07T20:31:45.4350237Z         x1 = x[:, D:]
2025-05-07T20:31:45.4350309Z     
2025-05-07T20:31:45.4350397Z         if contiguous:
2025-05-07T20:31:45.4350493Z             x0 = x0.contiguous()
2025-05-07T20:31:45.4350584Z             x1 = x1.contiguous()
2025-05-07T20:31:45.4350662Z     
2025-05-07T20:31:45.4350751Z         if scale_ub is not None:
2025-05-07T20:31:45.4350859Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:31:45.4351001Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:31:45.4351077Z             )
2025-05-07T20:31:45.4351159Z         else:
2025-05-07T20:31:45.4351251Z             scale_ub_tensor = None
2025-05-07T20:31:45.4351322Z     
2025-05-07T20:31:45.4351457Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:31:45.4351546Z             op = silu_mul_quant
2025-05-07T20:31:45.4351633Z             if compiled:
2025-05-07T20:31:45.4351738Z                 op = torch.compile(op)
2025-05-07T20:31:45.4351842Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:31:45.4351913Z     
2025-05-07T20:31:45.4352012Z >       y_fp8, y_scale = fn()
2025-05-07T20:31:45.4352021Z 
2025-05-07T20:31:45.4352119Z moe/activation_test.py:117: 
2025-05-07T20:31:45.4352248Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:31:45.4352358Z moe/activation_test.py:115: in fn
2025-05-07T20:31:45.4352542Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:31:45.4353049Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:31:45.4353146Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:31:45.4353502Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:31:45.4353729Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:31:45.4354068Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:31:45.4354170Z     kernel = self.compile(
2025-05-07T20:31:45.4354623Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:31:45.4354798Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:31:45.4354934Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:31:45.4354938Z 
2025-05-07T20:31:45.4355141Z self = <triton.compiler.compiler.ASTSource object at 0x7f687243e590>
2025-05-07T20:31:45.4355925Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:31:45.4356435Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f68dfb402c0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f6873025a80>}
2025-05-07T20:31:45.4357194Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:31:45.4357394Z context = <triton._C.libtriton.ir.context object at 0x7f687240e470>
2025-05-07T20:31:45.4357399Z 
2025-05-07T20:31:45.4357567Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:31:45.4357835Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:31:45.4357941Z                            module_map=module_map)
2025-05-07T20:31:45.4358102Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:31:45.4358207Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:31:45.4358287Z E       ^
2025-05-07T20:31:45.4358645Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:31:45.4358650Z 
2025-05-07T20:31:45.4359073Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:31:45.4359083Z 
2025-05-07T20:31:45.4359187Z Trying example: test_silu_mul_quant(
2025-05-07T20:31:45.4359420Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:31:45.4359499Z     T=16384,
2025-05-07T20:31:45.4359575Z     D=5120,
2025-05-07T20:31:45.4359666Z     scale_ub=1200.0,
2025-05-07T20:31:45.4359753Z     contiguous=False,
2025-05-07T20:31:45.4359836Z     compiled=True,
2025-05-07T20:31:45.4359916Z )
2025-05-07T20:31:45.4360134Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:31:45.4360319Z T = 16384, D = 5120, scale_ub = 1200.0, contiguous = False, compiled = True
2025-05-07T20:31:45.4360323Z 
2025-05-07T20:31:45.4360402Z     @given(
2025-05-07T20:31:45.4360520Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:31:45.4360629Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:31:45.4360745Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:31:45.4360865Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:31:45.4360989Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:31:45.4361064Z     )
2025-05-07T20:31:45.4361397Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:31:45.4361499Z     def test_silu_mul_quant(
2025-05-07T20:31:45.4361577Z         self,
2025-05-07T20:31:45.4361660Z         T: int,
2025-05-07T20:31:45.4361737Z         D: int,
2025-05-07T20:31:45.4361835Z         scale_ub: Optional[float],
2025-05-07T20:31:45.4361931Z         contiguous: bool,
2025-05-07T20:31:45.4362015Z         compiled: bool,
2025-05-07T20:31:45.4362094Z     ) -> None:
2025-05-07T20:31:45.4362195Z         torch.manual_seed(2025)
2025-05-07T20:31:45.4362268Z     
2025-05-07T20:31:45.4362437Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:31:45.4362519Z     
2025-05-07T20:31:45.4362611Z         x_sign = torch.sign(x)
2025-05-07T20:31:45.4362814Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:31:45.4362910Z         x = x_sign * x_clamp
2025-05-07T20:31:45.4362991Z         x0 = x[:, :D]
2025-05-07T20:31:45.4363071Z         x1 = x[:, D:]
2025-05-07T20:31:45.4363157Z     
2025-05-07T20:31:45.4363244Z         if contiguous:
2025-05-07T20:31:45.4363343Z             x0 = x0.contiguous()
2025-05-07T20:31:45.4363433Z             x1 = x1.contiguous()
2025-05-07T20:31:45.4363506Z     
2025-05-07T20:31:45.4363602Z         if scale_ub is not None:
2025-05-07T20:31:45.4363709Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:31:45.4363844Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:31:45.4363928Z             )
2025-05-07T20:31:45.4364003Z         else:
2025-05-07T20:31:45.4364096Z             scale_ub_tensor = None
2025-05-07T20:31:45.4364173Z     
2025-05-07T20:31:45.4364302Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:31:45.4364397Z             op = silu_mul_quant
2025-05-07T20:31:45.4364489Z             if compiled:
2025-05-07T20:31:45.4364588Z                 op = torch.compile(op)
2025-05-07T20:31:45.4364698Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:31:45.4364769Z     
2025-05-07T20:31:45.4364864Z >       y_fp8, y_scale = fn()
2025-05-07T20:31:45.4364868Z 
2025-05-07T20:31:45.4364970Z moe/activation_test.py:117: 
2025-05-07T20:31:45.4365110Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:31:45.4365219Z moe/activation_test.py:115: in fn
2025-05-07T20:31:45.4365317Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:31:45.4365685Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/torch/_dynamo/eval_frame.py:678: in _fn
2025-05-07T20:31:45.4365785Z     return fn(*args, **kwargs)
2025-05-07T20:31:45.4366282Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:31:45.4366394Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:31:45.4366753Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:31:45.4366978Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:31:45.4367325Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:31:45.4367419Z     kernel = self.compile(
2025-05-07T20:31:45.4367801Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:31:45.4367980Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:31:45.4368107Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:31:45.4368112Z 
2025-05-07T20:31:45.4368319Z self = <triton.compiler.compiler.ASTSource object at 0x7f6872bb0f50>
2025-05-07T20:31:45.4369107Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:31:45.4369701Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f68dfb402c0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f6873024cc0>}
2025-05-07T20:31:45.4370462Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:31:45.4370651Z context = <triton._C.libtriton.ir.context object at 0x7f6872bd8e30>
2025-05-07T20:31:45.4370655Z 
2025-05-07T20:31:45.4370826Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:31:45.4371087Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:31:45.4371276Z                            module_map=module_map)
2025-05-07T20:31:45.4371439Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:31:45.4371539Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:31:45.4371628Z E       ^
2025-05-07T20:31:45.4371987Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:31:45.4371992Z 
2025-05-07T20:31:45.4372407Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:31:45.4372411Z 
2025-05-07T20:31:45.4372521Z Trying example: test_silu_mul_quant(
2025-05-07T20:31:45.4372744Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:31:45.4372828Z     T=2048,
2025-05-07T20:31:45.4372904Z     D=7168,
2025-05-07T20:31:45.4372989Z     scale_ub=1200.0,
2025-05-07T20:31:45.4373081Z     contiguous=False,
2025-05-07T20:31:45.4373169Z     compiled=True,
2025-05-07T20:31:45.4373242Z )
2025-05-07T20:31:45.4373467Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:31:45.4373645Z T = 2048, D = 7168, scale_ub = 1200.0, contiguous = False, compiled = True
2025-05-07T20:31:45.4373651Z 
2025-05-07T20:31:45.4373727Z     @given(
2025-05-07T20:31:45.4373850Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:31:45.4373948Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:31:45.4374067Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:31:45.4374183Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:31:45.4374294Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:31:45.4374377Z     )
2025-05-07T20:31:45.4374622Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:31:45.4374715Z     def test_silu_mul_quant(
2025-05-07T20:31:45.4374798Z         self,
2025-05-07T20:31:45.4374881Z         T: int,
2025-05-07T20:31:45.4374956Z         D: int,
2025-05-07T20:31:45.4375059Z         scale_ub: Optional[float],
2025-05-07T20:31:45.4375147Z         contiguous: bool,
2025-05-07T20:31:45.4375232Z         compiled: bool,
2025-05-07T20:31:45.4375319Z     ) -> None:
2025-05-07T20:31:45.4375421Z         torch.manual_seed(2025)
2025-05-07T20:31:45.4375498Z     
2025-05-07T20:31:45.4375666Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:31:45.4375740Z     
2025-05-07T20:31:45.4375837Z         x_sign = torch.sign(x)
2025-05-07T20:31:45.4375961Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:31:45.4376049Z         x = x_sign * x_clamp
2025-05-07T20:31:45.4376137Z         x0 = x[:, :D]
2025-05-07T20:31:45.4376217Z         x1 = x[:, D:]
2025-05-07T20:31:45.4376288Z     
2025-05-07T20:31:45.4376378Z         if contiguous:
2025-05-07T20:31:45.4376469Z             x0 = x0.contiguous()
2025-05-07T20:31:45.4376558Z             x1 = x1.contiguous()
2025-05-07T20:31:45.4376640Z     
2025-05-07T20:31:45.4376731Z         if scale_ub is not None:
2025-05-07T20:31:45.4376847Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:31:45.4376982Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:31:45.4377172Z             )
2025-05-07T20:31:45.4377255Z         else:
2025-05-07T20:31:45.4377349Z             scale_ub_tensor = None
2025-05-07T20:31:45.4377421Z     
2025-05-07T20:31:45.4377556Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:31:45.4377646Z             op = silu_mul_quant
2025-05-07T20:31:45.4377731Z             if compiled:
2025-05-07T20:31:45.4377839Z                 op = torch.compile(op)
2025-05-07T20:31:45.4377944Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:31:45.4378015Z     
2025-05-07T20:31:45.4378117Z >       y_fp8, y_scale = fn()
2025-05-07T20:31:45.4378121Z 
2025-05-07T20:31:45.4378222Z moe/activation_test.py:117: 
2025-05-07T20:31:45.4378358Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:31:45.4378539Z moe/activation_test.py:115: in fn
2025-05-07T20:31:45.4378640Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:31:45.4379021Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/torch/_dynamo/eval_frame.py:678: in _fn
2025-05-07T20:31:45.4379113Z     return fn(*args, **kwargs)
2025-05-07T20:31:45.4379607Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:31:45.4379710Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:31:45.4380068Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:31:45.4380296Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:31:45.4380635Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:31:45.4380735Z     kernel = self.compile(
2025-05-07T20:31:45.4381591Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:31:45.4381767Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:31:45.4381901Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:31:45.4381906Z 
2025-05-07T20:31:45.4382107Z self = <triton.compiler.compiler.ASTSource object at 0x7f6872bd5bd0>
2025-05-07T20:31:45.4382891Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:31:45.4383402Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f68dfb402c0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f6873027060>}
2025-05-07T20:31:45.4384161Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:31:45.4384359Z context = <triton._C.libtriton.ir.context object at 0x7f68721cd1f0>
2025-05-07T20:31:45.4384364Z 
2025-05-07T20:31:45.4384527Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:31:45.4384790Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:31:45.4384903Z                            module_map=module_map)
2025-05-07T20:31:45.4385064Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:31:45.4385169Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:31:45.4385246Z E       ^
2025-05-07T20:31:45.4385604Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:31:45.4385613Z 
2025-05-07T20:31:45.4386039Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:31:45.4386044Z 
2025-05-07T20:31:45.4386147Z Trying example: test_silu_mul_quant(
2025-05-07T20:31:45.4386466Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:31:45.4386545Z     T=1,
2025-05-07T20:31:45.4386622Z     D=5120,
2025-05-07T20:31:45.4386710Z     scale_ub=None,
2025-05-07T20:31:45.4386798Z     contiguous=False,
2025-05-07T20:31:45.4386887Z     compiled=False,
2025-05-07T20:31:45.4386967Z )
2025-05-07T20:31:45.4387185Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:31:45.4387351Z T = 1, D = 5120, scale_ub = None, contiguous = False, compiled = False
2025-05-07T20:31:45.4387355Z 
2025-05-07T20:31:45.4387442Z     @given(
2025-05-07T20:31:45.4387560Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:31:45.4387668Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:31:45.4387868Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:31:45.4387984Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:31:45.4388103Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:31:45.4388182Z     )
2025-05-07T20:31:45.4388427Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:31:45.4388525Z     def test_silu_mul_quant(
2025-05-07T20:31:45.4388600Z         self,
2025-05-07T20:31:45.4388675Z         T: int,
2025-05-07T20:31:45.4388758Z         D: int,
2025-05-07T20:31:45.4388855Z         scale_ub: Optional[float],
2025-05-07T20:31:45.4388943Z         contiguous: bool,
2025-05-07T20:31:45.4389033Z         compiled: bool,
2025-05-07T20:31:45.4389229Z     ) -> None:
2025-05-07T20:31:45.4389330Z         torch.manual_seed(2025)
2025-05-07T20:31:45.4389403Z     
2025-05-07T20:31:45.4389570Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:31:45.4389660Z     
2025-05-07T20:31:45.4389753Z         x_sign = torch.sign(x)
2025-05-07T20:31:45.4389876Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:31:45.4389970Z         x = x_sign * x_clamp
2025-05-07T20:31:45.4390050Z         x0 = x[:, :D]
2025-05-07T20:31:45.4390133Z         x1 = x[:, D:]
2025-05-07T20:31:45.4390212Z     
2025-05-07T20:31:45.4390296Z         if contiguous:
2025-05-07T20:31:45.4390386Z             x0 = x0.contiguous()
2025-05-07T20:31:45.4390480Z             x1 = x1.contiguous()
2025-05-07T20:31:45.4390552Z     
2025-05-07T20:31:45.4390641Z         if scale_ub is not None:
2025-05-07T20:31:45.4390752Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:31:45.4390886Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:31:45.4390966Z             )
2025-05-07T20:31:45.4391043Z         else:
2025-05-07T20:31:45.4391137Z             scale_ub_tensor = None
2025-05-07T20:31:45.4391214Z     
2025-05-07T20:31:45.4391341Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:31:45.4391435Z             op = silu_mul_quant
2025-05-07T20:31:45.4391527Z             if compiled:
2025-05-07T20:31:45.4391628Z                 op = torch.compile(op)
2025-05-07T20:31:45.4391736Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:31:45.4391813Z     
2025-05-07T20:31:45.4391904Z >       y_fp8, y_scale = fn()
2025-05-07T20:31:45.4391908Z 
2025-05-07T20:31:45.4392011Z moe/activation_test.py:117: 
2025-05-07T20:31:45.4392140Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:31:45.4392240Z moe/activation_test.py:115: in fn
2025-05-07T20:31:45.4392344Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:31:45.4392843Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:31:45.4392938Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:31:45.4393301Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:31:45.4393527Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:31:45.4393968Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:31:45.4394062Z     kernel = self.compile(
2025-05-07T20:31:45.4394443Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:31:45.4394621Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:31:45.4394747Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:31:45.4394752Z 
2025-05-07T20:31:45.4394954Z self = <triton.compiler.compiler.ASTSource object at 0x7f6872117010>
2025-05-07T20:31:45.4395744Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:31:45.4396335Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f68dfb402c0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f6872154720>}
2025-05-07T20:31:45.4397093Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:31:45.4397283Z context = <triton._C.libtriton.ir.context object at 0x7f6872122ef0>
2025-05-07T20:31:45.4397287Z 
2025-05-07T20:31:45.4397462Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:31:45.4397723Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:31:45.4397829Z                            module_map=module_map)
2025-05-07T20:31:45.4398001Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:31:45.4398101Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:31:45.4398178Z E       ^
2025-05-07T20:31:45.4398547Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:31:45.4398552Z 
2025-05-07T20:31:45.4398971Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:31:45.4398975Z 
2025-05-07T20:31:45.4399085Z Trying example: test_silu_mul_quant(
2025-05-07T20:31:45.4399309Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:31:45.4399386Z     T=4096,
2025-05-07T20:31:45.4399473Z     D=7168,
2025-05-07T20:31:45.4399557Z     scale_ub=1200.0,
2025-05-07T20:31:45.4399646Z     contiguous=False,
2025-05-07T20:31:45.4399738Z     compiled=False,
2025-05-07T20:31:45.4399811Z )
2025-05-07T20:31:45.4400036Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:31:45.4400217Z T = 4096, D = 7168, scale_ub = 1200.0, contiguous = False, compiled = False
2025-05-07T20:31:45.4400221Z 
2025-05-07T20:31:45.4400299Z     @given(
2025-05-07T20:31:45.4400432Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:31:45.4400531Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:31:45.4400645Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:31:45.4400768Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:31:45.4400881Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:31:45.4400963Z     )
2025-05-07T20:31:45.4401210Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:31:45.4401303Z     def test_silu_mul_quant(
2025-05-07T20:31:45.4401386Z         self,
2025-05-07T20:31:45.4401462Z         T: int,
2025-05-07T20:31:45.4401538Z         D: int,
2025-05-07T20:31:45.4401643Z         scale_ub: Optional[float],
2025-05-07T20:31:45.4401738Z         contiguous: bool,
2025-05-07T20:31:45.4401822Z         compiled: bool,
2025-05-07T20:31:45.4401907Z     ) -> None:
2025-05-07T20:31:45.4402005Z         torch.manual_seed(2025)
2025-05-07T20:31:45.4402078Z     
2025-05-07T20:31:45.4402339Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:31:45.4402414Z     
2025-05-07T20:31:45.4402506Z         x_sign = torch.sign(x)
2025-05-07T20:31:45.4402634Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:31:45.4402723Z         x = x_sign * x_clamp
2025-05-07T20:31:45.4402809Z         x0 = x[:, :D]
2025-05-07T20:31:45.4402890Z         x1 = x[:, D:]
2025-05-07T20:31:45.4402963Z     
2025-05-07T20:31:45.4403053Z         if contiguous:
2025-05-07T20:31:45.4403144Z             x0 = x0.contiguous()
2025-05-07T20:31:45.4403232Z             x1 = x1.contiguous()
2025-05-07T20:31:45.4403309Z     
2025-05-07T20:31:45.4403399Z         if scale_ub is not None:
2025-05-07T20:31:45.4403618Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:31:45.4403757Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:31:45.4403834Z             )
2025-05-07T20:31:45.4403909Z         else:
2025-05-07T20:31:45.4404008Z             scale_ub_tensor = None
2025-05-07T20:31:45.4404086Z     
2025-05-07T20:31:45.4404223Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:31:45.4404311Z             op = silu_mul_quant
2025-05-07T20:31:45.4404396Z             if compiled:
2025-05-07T20:31:45.4404503Z                 op = torch.compile(op)
2025-05-07T20:31:45.4404607Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:31:45.4404679Z     
2025-05-07T20:31:45.4404774Z >       y_fp8, y_scale = fn()
2025-05-07T20:31:45.4404779Z 
2025-05-07T20:31:45.4404875Z moe/activation_test.py:117: 
2025-05-07T20:31:45.4405011Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:31:45.4405137Z moe/activation_test.py:115: in fn
2025-05-07T20:31:45.4405260Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:31:45.4405772Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:31:45.4405874Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:31:45.4406232Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:31:45.4406463Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:31:45.4406807Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:31:45.4406900Z     kernel = self.compile(
2025-05-07T20:31:45.4407287Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:31:45.4407458Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:31:45.4407591Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:31:45.4407601Z 
2025-05-07T20:31:45.4407802Z self = <triton.compiler.compiler.ASTSource object at 0x7f6872185990>
2025-05-07T20:31:45.4408586Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:31:45.4409098Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f68dfb402c0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f68721558a0>}
2025-05-07T20:31:45.4409852Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:31:45.4410047Z context = <triton._C.libtriton.ir.context object at 0x7f6872161cf0>
2025-05-07T20:31:45.4410055Z 
2025-05-07T20:31:45.4410218Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:31:45.4410484Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:31:45.4410682Z                            module_map=module_map)
2025-05-07T20:31:45.4410843Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:31:45.4410948Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:31:45.4411024Z E       ^
2025-05-07T20:31:45.4411381Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:31:45.4411386Z 
2025-05-07T20:31:45.4411811Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:31:45.4411815Z 
2025-05-07T20:31:45.4411917Z Trying example: test_silu_mul_quant(
2025-05-07T20:31:45.4412148Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:31:45.4412401Z     T=16384,
2025-05-07T20:31:45.4412478Z     D=7168,
2025-05-07T20:31:45.4412567Z     scale_ub=None,
2025-05-07T20:31:45.4412653Z     contiguous=True,
2025-05-07T20:31:45.4412736Z     compiled=True,
2025-05-07T20:31:45.4412819Z )
2025-05-07T20:31:45.4413037Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:31:45.4413210Z T = 16384, D = 7168, scale_ub = None, contiguous = True, compiled = True
2025-05-07T20:31:45.4413221Z 
2025-05-07T20:31:45.4413301Z     @given(
2025-05-07T20:31:45.4413418Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:31:45.4413526Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:31:45.4413639Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:31:45.4413755Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:31:45.4413878Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:31:45.4413952Z     )
2025-05-07T20:31:45.4414205Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:31:45.4414306Z     def test_silu_mul_quant(
2025-05-07T20:31:45.4414382Z         self,
2025-05-07T20:31:45.4414462Z         T: int,
2025-05-07T20:31:45.4414545Z         D: int,
2025-05-07T20:31:45.4414648Z         scale_ub: Optional[float],
2025-05-07T20:31:45.4414743Z         contiguous: bool,
2025-05-07T20:31:45.4414829Z         compiled: bool,
2025-05-07T20:31:45.4414907Z     ) -> None:
2025-05-07T20:31:45.4415007Z         torch.manual_seed(2025)
2025-05-07T20:31:45.4415080Z     
2025-05-07T20:31:45.4415247Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:31:45.4415327Z     
2025-05-07T20:31:45.4415418Z         x_sign = torch.sign(x)
2025-05-07T20:31:45.4415542Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:31:45.4415639Z         x = x_sign * x_clamp
2025-05-07T20:31:45.4415720Z         x0 = x[:, :D]
2025-05-07T20:31:45.4415800Z         x1 = x[:, D:]
2025-05-07T20:31:45.4415883Z     
2025-05-07T20:31:45.4415968Z         if contiguous:
2025-05-07T20:31:45.4416063Z             x0 = x0.contiguous()
2025-05-07T20:31:45.4416157Z             x1 = x1.contiguous()
2025-05-07T20:31:45.4416229Z     
2025-05-07T20:31:45.4416334Z         if scale_ub is not None:
2025-05-07T20:31:45.4416440Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:31:45.4416574Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:31:45.4416656Z             )
2025-05-07T20:31:45.4416730Z         else:
2025-05-07T20:31:45.4416823Z             scale_ub_tensor = None
2025-05-07T20:31:45.4416900Z     
2025-05-07T20:31:45.4417027Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:31:45.4417117Z             op = silu_mul_quant
2025-05-07T20:31:45.4417209Z             if compiled:
2025-05-07T20:31:45.4417308Z                 op = torch.compile(op)
2025-05-07T20:31:45.4417413Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:31:45.4417495Z     
2025-05-07T20:31:45.4417585Z >       y_fp8, y_scale = fn()
2025-05-07T20:31:45.4417589Z 
2025-05-07T20:31:45.4417695Z moe/activation_test.py:117: 
2025-05-07T20:31:45.4417822Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:31:45.4418013Z moe/activation_test.py:115: in fn
2025-05-07T20:31:45.4418120Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:31:45.4418488Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/torch/_dynamo/eval_frame.py:678: in _fn
2025-05-07T20:31:45.4418579Z     return fn(*args, **kwargs)
2025-05-07T20:31:45.4419086Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:31:45.4419183Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:31:45.4419545Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:31:45.4419766Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:31:45.4420181Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:31:45.4420280Z     kernel = self.compile(
2025-05-07T20:31:45.4420665Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:31:45.4420843Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:31:45.4420970Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:31:45.4420975Z 
2025-05-07T20:31:45.4421175Z self = <triton.compiler.compiler.ASTSource object at 0x7f68722a4590>
2025-05-07T20:31:45.4421963Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:31:45.4422473Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f68dfb402c0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f6872156a20>}
2025-05-07T20:31:45.4423241Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:31:45.4423429Z context = <triton._C.libtriton.ir.context object at 0x7f6872284430>
2025-05-07T20:31:45.4423434Z 
2025-05-07T20:31:45.4423598Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:31:45.4423866Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:31:45.4423972Z                            module_map=module_map)
2025-05-07T20:31:45.4424138Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:31:45.4424236Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:31:45.4424316Z E       ^
2025-05-07T20:31:45.4424683Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:31:45.4424687Z 
2025-05-07T20:31:45.4425108Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:31:45.4425112Z 
2025-05-07T20:31:45.4425220Z Trying example: test_silu_mul_quant(
2025-05-07T20:31:45.4425443Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:31:45.4425521Z     T=4096,
2025-05-07T20:31:45.4425605Z     D=5120,
2025-05-07T20:31:45.4425687Z     scale_ub=None,
2025-05-07T20:31:45.4425772Z     contiguous=False,
2025-05-07T20:31:45.4425861Z     compiled=True,
2025-05-07T20:31:45.4425933Z )
2025-05-07T20:31:45.4426150Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:31:45.4426328Z T = 4096, D = 5120, scale_ub = None, contiguous = False, compiled = True
2025-05-07T20:31:45.4426338Z 
2025-05-07T20:31:45.4426414Z     @given(
2025-05-07T20:31:45.4426530Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:31:45.4426636Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:31:45.4426831Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:31:45.4426956Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:31:45.4427081Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:31:45.4432603Z     )
2025-05-07T20:31:45.4432879Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:31:45.4432991Z     def test_silu_mul_quant(
2025-05-07T20:31:45.4433072Z         self,
2025-05-07T20:31:45.4433152Z         T: int,
2025-05-07T20:31:45.4433239Z         D: int,
2025-05-07T20:31:45.4433339Z         scale_ub: Optional[float],
2025-05-07T20:31:45.4433432Z         contiguous: bool,
2025-05-07T20:31:45.4433530Z         compiled: bool,
2025-05-07T20:31:45.4433877Z     ) -> None:
2025-05-07T20:31:45.4433976Z         torch.manual_seed(2025)
2025-05-07T20:31:45.4434059Z     
2025-05-07T20:31:45.4434232Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:31:45.4434309Z     
2025-05-07T20:31:45.4434419Z         x_sign = torch.sign(x)
2025-05-07T20:31:45.4434548Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:31:45.4434640Z         x = x_sign * x_clamp
2025-05-07T20:31:45.4434734Z         x0 = x[:, :D]
2025-05-07T20:31:45.4434816Z         x1 = x[:, D:]
2025-05-07T20:31:45.4434902Z     
2025-05-07T20:31:45.4434988Z         if contiguous:
2025-05-07T20:31:45.4435083Z             x0 = x0.contiguous()
2025-05-07T20:31:45.4435185Z             x1 = x1.contiguous()
2025-05-07T20:31:45.4435263Z     
2025-05-07T20:31:45.4435355Z         if scale_ub is not None:
2025-05-07T20:31:45.4435480Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:31:45.4435622Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:31:45.4435707Z             )
2025-05-07T20:31:45.4435794Z         else:
2025-05-07T20:31:45.4435894Z             scale_ub_tensor = None
2025-05-07T20:31:45.4435969Z     
2025-05-07T20:31:45.4436115Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:31:45.4436212Z             op = silu_mul_quant
2025-05-07T20:31:45.4436309Z             if compiled:
2025-05-07T20:31:45.4436413Z                 op = torch.compile(op)
2025-05-07T20:31:45.4436522Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:31:45.4436608Z     
2025-05-07T20:31:45.4436702Z >       y_fp8, y_scale = fn()
2025-05-07T20:31:45.4436708Z 
2025-05-07T20:31:45.4436810Z moe/activation_test.py:117: 
2025-05-07T20:31:45.4436951Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:31:45.4437056Z moe/activation_test.py:115: in fn
2025-05-07T20:31:45.4437162Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:31:45.4437546Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/torch/_dynamo/eval_frame.py:678: in _fn
2025-05-07T20:31:45.4437645Z     return fn(*args, **kwargs)
2025-05-07T20:31:45.4438153Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:31:45.4438252Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:31:45.4438613Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:31:45.4438844Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:31:45.4439185Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:31:45.4439281Z     kernel = self.compile(
2025-05-07T20:31:45.4439672Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:31:45.4439847Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:31:45.4439987Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:31:45.4439992Z 
2025-05-07T20:31:45.4440197Z self = <triton.compiler.compiler.ASTSource object at 0x7f687224db10>
2025-05-07T20:31:45.4441134Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:31:45.4441650Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f68dfb402c0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f6872157c40>}
2025-05-07T20:31:45.4442400Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:31:45.4442599Z context = <triton._C.libtriton.ir.context object at 0x7f68722912f0>
2025-05-07T20:31:45.4442712Z 
2025-05-07T20:31:45.4442878Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:31:45.4443154Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:31:45.4443263Z                            module_map=module_map)
2025-05-07T20:31:45.4443427Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:31:45.4443536Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:31:45.4443615Z E       ^
2025-05-07T20:31:45.4443974Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:31:45.4443979Z 
2025-05-07T20:31:45.4444404Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:31:45.4444409Z 
2025-05-07T20:31:45.4444514Z Trying example: test_silu_mul_quant(
2025-05-07T20:31:45.4444753Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:31:45.4444832Z     T=4096,
2025-05-07T20:31:45.4444912Z     D=5120,
2025-05-07T20:31:45.4445009Z     scale_ub=1200.0,
2025-05-07T20:31:45.4445109Z     contiguous=False,
2025-05-07T20:31:45.4445213Z     compiled=False,
2025-05-07T20:31:45.4445316Z )
2025-05-07T20:31:45.4445543Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:31:45.4445719Z T = 4096, D = 5120, scale_ub = 1200.0, contiguous = False, compiled = False
2025-05-07T20:31:45.4445732Z 
2025-05-07T20:31:45.4445811Z     @given(
2025-05-07T20:31:45.4445931Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:31:45.4446039Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:31:45.4446156Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:31:45.4446278Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:31:45.4446400Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:31:45.4446480Z     )
2025-05-07T20:31:45.4446725Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:31:45.4446829Z     def test_silu_mul_quant(
2025-05-07T20:31:45.4446909Z         self,
2025-05-07T20:31:45.4446999Z         T: int,
2025-05-07T20:31:45.4447078Z         D: int,
2025-05-07T20:31:45.4447177Z         scale_ub: Optional[float],
2025-05-07T20:31:45.4447275Z         contiguous: bool,
2025-05-07T20:31:45.4447363Z         compiled: bool,
2025-05-07T20:31:45.4447443Z     ) -> None:
2025-05-07T20:31:45.4447550Z         torch.manual_seed(2025)
2025-05-07T20:31:45.4447625Z     
2025-05-07T20:31:45.4447795Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:31:45.4447882Z     
2025-05-07T20:31:45.4447979Z         x_sign = torch.sign(x)
2025-05-07T20:31:45.4448104Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:31:45.4448207Z         x = x_sign * x_clamp
2025-05-07T20:31:45.4448295Z         x0 = x[:, :D]
2025-05-07T20:31:45.4448377Z         x1 = x[:, D:]
2025-05-07T20:31:45.4448463Z     
2025-05-07T20:31:45.4448550Z         if contiguous:
2025-05-07T20:31:45.4448650Z             x0 = x0.contiguous()
2025-05-07T20:31:45.4448838Z             x1 = x1.contiguous()
2025-05-07T20:31:45.4448918Z     
2025-05-07T20:31:45.4449020Z         if scale_ub is not None:
2025-05-07T20:31:45.4449127Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:31:45.4449264Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:31:45.4449349Z             )
2025-05-07T20:31:45.4449429Z         else:
2025-05-07T20:31:45.4449525Z             scale_ub_tensor = None
2025-05-07T20:31:45.4449606Z     
2025-05-07T20:31:45.4449739Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:31:45.4449832Z             op = silu_mul_quant
2025-05-07T20:31:45.4449927Z             if compiled:
2025-05-07T20:31:45.4450028Z                 op = torch.compile(op)
2025-05-07T20:31:45.4450221Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:31:45.4450296Z     
2025-05-07T20:31:45.4450389Z >       y_fp8, y_scale = fn()
2025-05-07T20:31:45.4450394Z 
2025-05-07T20:31:45.4450501Z moe/activation_test.py:117: 
2025-05-07T20:31:45.4450639Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:31:45.4450744Z moe/activation_test.py:115: in fn
2025-05-07T20:31:45.4450854Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:31:45.4451353Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:31:45.4451452Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:31:45.4451822Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:31:45.4452046Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:31:45.4452393Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:31:45.4452498Z     kernel = self.compile(
2025-05-07T20:31:45.4452884Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:31:45.4453072Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:31:45.4453202Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:31:45.4453206Z 
2025-05-07T20:31:45.4453420Z self = <triton.compiler.compiler.ASTSource object at 0x7f68728aa190>
2025-05-07T20:31:45.4454205Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:31:45.4454709Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f68dfb402c0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f6872274ae0>}
2025-05-07T20:31:45.4455483Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:31:45.4455673Z context = <triton._C.libtriton.ir.context object at 0x7f68728ca030>
2025-05-07T20:31:45.4455678Z 
2025-05-07T20:31:45.4455849Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:31:45.4456112Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:31:45.4456221Z                            module_map=module_map)
2025-05-07T20:31:45.4456395Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:31:45.4456496Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:31:45.4456585Z E       ^
2025-05-07T20:31:45.4456940Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:31:45.4456949Z 
2025-05-07T20:31:45.4457368Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:31:45.4457373Z 
2025-05-07T20:31:45.4457568Z Trying example: test_silu_mul_quant(
2025-05-07T20:31:45.4457794Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:31:45.4457882Z     T=4096,
2025-05-07T20:31:45.4457960Z     D=5120,
2025-05-07T20:31:45.4458047Z     scale_ub=1200.0,
2025-05-07T20:31:45.4458148Z     contiguous=False,
2025-05-07T20:31:45.4458234Z     compiled=True,
2025-05-07T20:31:45.4458310Z )
2025-05-07T20:31:45.4458536Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:31:45.4458713Z T = 4096, D = 5120, scale_ub = 1200.0, contiguous = False, compiled = True
2025-05-07T20:31:45.4458718Z 
2025-05-07T20:31:45.4458797Z     @given(
2025-05-07T20:31:45.4458927Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:31:45.4459102Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:31:45.4459225Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:31:45.4459342Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:31:45.4459461Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:31:45.4459546Z     )
2025-05-07T20:31:45.4459795Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:31:45.4459890Z     def test_silu_mul_quant(
2025-05-07T20:31:45.4459972Z         self,
2025-05-07T20:31:45.4460048Z         T: int,
2025-05-07T20:31:45.4460124Z         D: int,
2025-05-07T20:31:45.4460229Z         scale_ub: Optional[float],
2025-05-07T20:31:45.4460318Z         contiguous: bool,
2025-05-07T20:31:45.4460404Z         compiled: bool,
2025-05-07T20:31:45.4460491Z     ) -> None:
2025-05-07T20:31:45.4460588Z         torch.manual_seed(2025)
2025-05-07T20:31:45.4460668Z     
2025-05-07T20:31:45.4460843Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:31:45.4460918Z     
2025-05-07T20:31:45.4461016Z         x_sign = torch.sign(x)
2025-05-07T20:31:45.4461140Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:31:45.4461234Z         x = x_sign * x_clamp
2025-05-07T20:31:45.4461323Z         x0 = x[:, :D]
2025-05-07T20:31:45.4461403Z         x1 = x[:, D:]
2025-05-07T20:31:45.4461479Z     
2025-05-07T20:31:45.4461572Z         if contiguous:
2025-05-07T20:31:45.4461665Z             x0 = x0.contiguous()
2025-05-07T20:31:45.4461759Z             x1 = x1.contiguous()
2025-05-07T20:31:45.4461839Z     
2025-05-07T20:31:45.4461930Z         if scale_ub is not None:
2025-05-07T20:31:45.4462037Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:31:45.4462180Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:31:45.4462257Z             )
2025-05-07T20:31:45.4462342Z         else:
2025-05-07T20:31:45.4462439Z             scale_ub_tensor = None
2025-05-07T20:31:45.4462518Z     
2025-05-07T20:31:45.4462656Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:31:45.4462750Z             op = silu_mul_quant
2025-05-07T20:31:45.4462837Z             if compiled:
2025-05-07T20:31:45.4462954Z                 op = torch.compile(op)
2025-05-07T20:31:45.4463061Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:31:45.4463134Z     
2025-05-07T20:31:45.4463232Z >       y_fp8, y_scale = fn()
2025-05-07T20:31:45.4463237Z 
2025-05-07T20:31:45.4463335Z moe/activation_test.py:117: 
2025-05-07T20:31:45.4463475Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:31:45.4463577Z moe/activation_test.py:115: in fn
2025-05-07T20:31:45.4463679Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:31:45.4464058Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/torch/_dynamo/eval_frame.py:678: in _fn
2025-05-07T20:31:45.4464156Z     return fn(*args, **kwargs)
2025-05-07T20:31:45.4464654Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:31:45.4464759Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:31:45.4465203Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:31:45.4465434Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:31:45.4465773Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:31:45.4465868Z     kernel = self.compile(
2025-05-07T20:31:45.4466256Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:31:45.4466430Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:31:45.4466557Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:31:45.4466652Z 
2025-05-07T20:31:45.4466857Z self = <triton.compiler.compiler.ASTSource object at 0x7f68728007d0>
2025-05-07T20:31:45.4467645Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:31:45.4468160Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f68dfb402c0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f6872275e40>}
2025-05-07T20:31:45.4468914Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:31:45.4469243Z context = <triton._C.libtriton.ir.context object at 0x7f68728c4670>
2025-05-07T20:31:45.4469249Z 
2025-05-07T20:31:45.4469412Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:31:45.4469680Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:31:45.4469797Z                            module_map=module_map)
2025-05-07T20:31:45.4469963Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:31:45.4470063Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:31:45.4470149Z E       ^
2025-05-07T20:31:45.4470506Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:31:45.4470510Z 
2025-05-07T20:31:45.4470936Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:31:45.4470941Z 
2025-05-07T20:31:45.4471044Z Trying example: test_silu_mul_quant(
2025-05-07T20:31:45.4471270Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:31:45.4471359Z     T=2048,
2025-05-07T20:31:45.4471437Z     D=7168,
2025-05-07T20:31:45.4471530Z     scale_ub=1200.0,
2025-05-07T20:31:45.4471617Z     contiguous=False,
2025-05-07T20:31:45.4471702Z     compiled=False,
2025-05-07T20:31:45.4471785Z )
2025-05-07T20:31:45.4472008Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:31:45.4472197Z T = 2048, D = 7168, scale_ub = 1200.0, contiguous = False, compiled = False
2025-05-07T20:31:45.4472202Z 
2025-05-07T20:31:45.4472279Z     @given(
2025-05-07T20:31:45.4472396Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:31:45.4472500Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:31:45.4472614Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:31:45.4472730Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:31:45.4472850Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:31:45.4472923Z     )
2025-05-07T20:31:45.4473169Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:31:45.4473278Z     def test_silu_mul_quant(
2025-05-07T20:31:45.4473354Z         self,
2025-05-07T20:31:45.4473438Z         T: int,
2025-05-07T20:31:45.4473515Z         D: int,
2025-05-07T20:31:45.4473734Z         scale_ub: Optional[float],
2025-05-07T20:31:45.4473832Z         contiguous: bool,
2025-05-07T20:31:45.4473919Z         compiled: bool,
2025-05-07T20:31:45.4473996Z     ) -> None:
2025-05-07T20:31:45.4474099Z         torch.manual_seed(2025)
2025-05-07T20:31:45.4474172Z     
2025-05-07T20:31:45.4474340Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:31:45.4474422Z     
2025-05-07T20:31:45.4474515Z         x_sign = torch.sign(x)
2025-05-07T20:31:45.4474639Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:31:45.4474735Z         x = x_sign * x_clamp
2025-05-07T20:31:45.4474815Z         x0 = x[:, :D]
2025-05-07T20:31:45.4474905Z         x1 = x[:, D:]
2025-05-07T20:31:45.4474977Z     
2025-05-07T20:31:45.4475145Z         if contiguous:
2025-05-07T20:31:45.4475244Z             x0 = x0.contiguous()
2025-05-07T20:31:45.4475333Z             x1 = x1.contiguous()
2025-05-07T20:31:45.4475404Z     
2025-05-07T20:31:45.4475499Z         if scale_ub is not None:
2025-05-07T20:31:45.4475611Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:31:45.4475745Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:31:45.4475826Z             )
2025-05-07T20:31:45.4475902Z         else:
2025-05-07T20:31:45.4475995Z             scale_ub_tensor = None
2025-05-07T20:31:45.4476072Z     
2025-05-07T20:31:45.4476199Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:31:45.4476288Z             op = silu_mul_quant
2025-05-07T20:31:45.4476382Z             if compiled:
2025-05-07T20:31:45.4476480Z                 op = torch.compile(op)
2025-05-07T20:31:45.4476592Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:31:45.4476664Z     
2025-05-07T20:31:45.4476753Z >       y_fp8, y_scale = fn()
2025-05-07T20:31:45.4476764Z 
2025-05-07T20:31:45.4476868Z moe/activation_test.py:117: 
2025-05-07T20:31:45.4476996Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:31:45.4477096Z moe/activation_test.py:115: in fn
2025-05-07T20:31:45.4477207Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:31:45.4477706Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:31:45.4477808Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:31:45.4478164Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:31:45.4478383Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:31:45.4478731Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:31:45.4478823Z     kernel = self.compile(
2025-05-07T20:31:45.4479208Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:31:45.4479386Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:31:45.4479516Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:31:45.4479520Z 
2025-05-07T20:31:45.4479728Z self = <triton.compiler.compiler.ASTSource object at 0x7f6872891350>
2025-05-07T20:31:45.4480512Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:31:45.4481014Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f68dfb402c0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f6872276c00>}
2025-05-07T20:31:45.4481773Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:31:45.4481968Z context = <triton._C.libtriton.ir.context object at 0x7f68728851f0>
2025-05-07T20:31:45.4482059Z 
2025-05-07T20:31:45.4482230Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:31:45.4482490Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:31:45.4482601Z                            module_map=module_map)
2025-05-07T20:31:45.4482763Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:31:45.4482862Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:31:45.4482944Z E       ^
2025-05-07T20:31:45.4483301Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:31:45.4483306Z 
2025-05-07T20:31:45.4483799Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:31:45.4483804Z 
2025-05-07T20:31:45.4483911Z Trying example: test_silu_mul_quant(
2025-05-07T20:31:45.4484139Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:31:45.4484223Z     T=1,
2025-05-07T20:31:45.4484302Z     D=7168,
2025-05-07T20:31:45.4484385Z     scale_ub=None,
2025-05-07T20:31:45.4484477Z     contiguous=True,
2025-05-07T20:31:45.4484560Z     compiled=False,
2025-05-07T20:31:45.4484633Z )
2025-05-07T20:31:45.4484856Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:31:45.4485017Z T = 1, D = 7168, scale_ub = None, contiguous = True, compiled = False
2025-05-07T20:31:45.4485022Z 
2025-05-07T20:31:45.4485098Z     @given(
2025-05-07T20:31:45.4485224Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:31:45.4485321Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:31:45.4485448Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:31:45.4485565Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:31:45.4485678Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:31:45.4485759Z     )
2025-05-07T20:31:45.4486011Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:31:45.4486105Z     def test_silu_mul_quant(
2025-05-07T20:31:45.4486190Z         self,
2025-05-07T20:31:45.4486270Z         T: int,
2025-05-07T20:31:45.4486346Z         D: int,
2025-05-07T20:31:45.4486451Z         scale_ub: Optional[float],
2025-05-07T20:31:45.4486541Z         contiguous: bool,
2025-05-07T20:31:45.4486627Z         compiled: bool,
2025-05-07T20:31:45.4486714Z     ) -> None:
2025-05-07T20:31:45.4486809Z         torch.manual_seed(2025)
2025-05-07T20:31:45.4486888Z     
2025-05-07T20:31:45.4487056Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:31:45.4487130Z     
2025-05-07T20:31:45.4487235Z         x_sign = torch.sign(x)
2025-05-07T20:31:45.4487359Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:31:45.4487452Z         x = x_sign * x_clamp
2025-05-07T20:31:45.4487541Z         x0 = x[:, :D]
2025-05-07T20:31:45.4487626Z         x1 = x[:, D:]
2025-05-07T20:31:45.4487699Z     
2025-05-07T20:31:45.4487792Z         if contiguous:
2025-05-07T20:31:45.4487883Z             x0 = x0.contiguous()
2025-05-07T20:31:45.4487971Z             x1 = x1.contiguous()
2025-05-07T20:31:45.4488053Z     
2025-05-07T20:31:45.4488145Z         if scale_ub is not None:
2025-05-07T20:31:45.4488256Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:31:45.4488390Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:31:45.4488467Z             )
2025-05-07T20:31:45.4488549Z         else:
2025-05-07T20:31:45.4488645Z             scale_ub_tensor = None
2025-05-07T20:31:45.4488719Z     
2025-05-07T20:31:45.4488853Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:31:45.4488947Z             op = silu_mul_quant
2025-05-07T20:31:45.4489032Z             if compiled:
2025-05-07T20:31:45.4489141Z                 op = torch.compile(op)
2025-05-07T20:31:45.4489245Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:31:45.4489401Z     
2025-05-07T20:31:45.4489499Z >       y_fp8, y_scale = fn()
2025-05-07T20:31:45.4489503Z 
2025-05-07T20:31:45.4489599Z moe/activation_test.py:117: 
2025-05-07T20:31:45.4489732Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:31:45.4489832Z moe/activation_test.py:115: in fn
2025-05-07T20:31:45.4489931Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:31:45.4490435Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:31:45.4490531Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:31:45.4490886Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:31:45.4491187Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:31:45.4491530Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:31:45.4491627Z     kernel = self.compile(
2025-05-07T20:31:45.4492029Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:31:45.4492208Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:31:45.4492334Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:31:45.4492339Z 
2025-05-07T20:31:45.4492541Z self = <triton.compiler.compiler.ASTSource object at 0x7f6872324510>
2025-05-07T20:31:45.4493332Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:31:45.4493843Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f68dfb402c0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f6872276e80>}
2025-05-07T20:31:45.4494608Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:31:45.4494797Z context = <triton._C.libtriton.ir.context object at 0x7f68723183b0>
2025-05-07T20:31:45.4494801Z 
2025-05-07T20:31:45.4494974Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:31:45.4495235Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:31:45.4495343Z                            module_map=module_map)
2025-05-07T20:31:45.4495510Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:31:45.4495613Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:31:45.4495689Z E       ^
2025-05-07T20:31:45.4496053Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:31:45.4496061Z 
2025-05-07T20:31:45.4496478Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:31:45.4496482Z 
2025-05-07T20:31:45.4496591Z Trying example: test_silu_mul_quant(
2025-05-07T20:31:45.4496815Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:31:45.4496893Z     T=16384,
2025-05-07T20:31:45.4496980Z     D=7168,
2025-05-07T20:31:45.4497067Z     scale_ub=1200.0,
2025-05-07T20:31:45.4497152Z     contiguous=False,
2025-05-07T20:31:45.4497243Z     compiled=True,
2025-05-07T20:31:45.4497316Z )
2025-05-07T20:31:45.4497533Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:31:45.4497726Z T = 16384, D = 7168, scale_ub = 1200.0, contiguous = False, compiled = True
2025-05-07T20:31:45.4497731Z 
2025-05-07T20:31:45.4497808Z     @given(
2025-05-07T20:31:45.4497932Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:31:45.4498114Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:31:45.4498229Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:31:45.4498351Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:31:45.4498465Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:31:45.4498539Z     )
2025-05-07T20:31:45.4498793Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:31:45.4498886Z     def test_silu_mul_quant(
2025-05-07T20:31:45.4498969Z         self,
2025-05-07T20:31:45.4499049Z         T: int,
2025-05-07T20:31:45.4499125Z         D: int,
2025-05-07T20:31:45.4499229Z         scale_ub: Optional[float],
2025-05-07T20:31:45.4499317Z         contiguous: bool,
2025-05-07T20:31:45.4499480Z         compiled: bool,
2025-05-07T20:31:45.4499565Z     ) -> None:
2025-05-07T20:31:45.4499659Z         torch.manual_seed(2025)
2025-05-07T20:31:45.4499734Z     
2025-05-07T20:31:45.4499912Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:31:45.4499986Z     
2025-05-07T20:31:45.4500078Z         x_sign = torch.sign(x)
2025-05-07T20:31:45.4500207Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:31:45.4500296Z         x = x_sign * x_clamp
2025-05-07T20:31:45.4500382Z         x0 = x[:, :D]
2025-05-07T20:31:45.4500462Z         x1 = x[:, D:]
2025-05-07T20:31:45.4500534Z     
2025-05-07T20:31:45.4500623Z         if contiguous:
2025-05-07T20:31:45.4500713Z             x0 = x0.contiguous()
2025-05-07T20:31:45.4500801Z             x1 = x1.contiguous()
2025-05-07T20:31:45.4500878Z     
2025-05-07T20:31:45.4500968Z         if scale_ub is not None:
2025-05-07T20:31:45.4501074Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:31:45.4501221Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:31:45.4501297Z             )
2025-05-07T20:31:45.4501372Z         else:
2025-05-07T20:31:45.4501472Z             scale_ub_tensor = None
2025-05-07T20:31:45.4501543Z     
2025-05-07T20:31:45.4501676Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:31:45.4501772Z             op = silu_mul_quant
2025-05-07T20:31:45.4501858Z             if compiled:
2025-05-07T20:31:45.4501965Z                 op = torch.compile(op)
2025-05-07T20:31:45.4502070Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:31:45.4502145Z     
2025-05-07T20:31:45.4502244Z >       y_fp8, y_scale = fn()
2025-05-07T20:31:45.4502249Z 
2025-05-07T20:31:45.4502345Z moe/activation_test.py:117: 
2025-05-07T20:31:45.4502478Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:31:45.4502585Z moe/activation_test.py:115: in fn
2025-05-07T20:31:45.4502684Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:31:45.4503056Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/torch/_dynamo/eval_frame.py:678: in _fn
2025-05-07T20:31:45.4503153Z     return fn(*args, **kwargs)
2025-05-07T20:31:45.4503650Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:31:45.4503754Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:31:45.4504111Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:31:45.4504331Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:31:45.4504674Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:31:45.4504768Z     kernel = self.compile(
2025-05-07T20:31:45.4505158Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:31:45.4505338Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:31:45.4505487Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:31:45.4505493Z 
2025-05-07T20:31:45.4505851Z self = <triton.compiler.compiler.ASTSource object at 0x7f687238ea10>
2025-05-07T20:31:45.4506632Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:31:45.4507142Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f68dfb402c0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f68723fd1c0>}
2025-05-07T20:31:45.4507893Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:31:45.4508163Z context = <triton._C.libtriton.ir.context object at 0x7f68723168f0>
2025-05-07T20:31:45.4508167Z 
2025-05-07T20:31:45.4508336Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:31:45.4508601Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:31:45.4508715Z                            module_map=module_map)
2025-05-07T20:31:45.4508874Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:31:45.4508973Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:31:45.4509171Z E       ^
2025-05-07T20:31:45.4509530Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:31:45.4509535Z 
2025-05-07T20:31:45.4509958Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:31:45.4509969Z 
2025-05-07T20:31:45.4510073Z Trying example: test_silu_mul_quant(
2025-05-07T20:31:45.4510296Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:31:45.4510379Z     T=1,
2025-05-07T20:31:45.4510457Z     D=7168,
2025-05-07T20:31:45.4510544Z     scale_ub=None,
2025-05-07T20:31:45.4510638Z     contiguous=False,
2025-05-07T20:31:45.4510723Z     compiled=False,
2025-05-07T20:31:45.4510795Z )
2025-05-07T20:31:45.4511023Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:31:45.4511188Z T = 1, D = 7168, scale_ub = None, contiguous = False, compiled = False
2025-05-07T20:31:45.4511192Z 
2025-05-07T20:31:45.4511276Z     @given(
2025-05-07T20:31:45.4511393Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:31:45.4511491Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:31:45.4511612Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:31:45.4511728Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:31:45.4511847Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:31:45.4511926Z     )
2025-05-07T20:31:45.4512171Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:31:45.4512269Z     def test_silu_mul_quant(
2025-05-07T20:31:45.4512352Z         self,
2025-05-07T20:31:45.4512429Z         T: int,
2025-05-07T20:31:45.4512515Z         D: int,
2025-05-07T20:31:45.4512613Z         scale_ub: Optional[float],
2025-05-07T20:31:45.4512702Z         contiguous: bool,
2025-05-07T20:31:45.4512794Z         compiled: bool,
2025-05-07T20:31:45.4512872Z     ) -> None:
2025-05-07T20:31:45.4512965Z         torch.manual_seed(2025)
2025-05-07T20:31:45.4513049Z     
2025-05-07T20:31:45.4513218Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:31:45.4513293Z     
2025-05-07T20:31:45.4513391Z         x_sign = torch.sign(x)
2025-05-07T20:31:45.4513515Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:31:45.4513608Z         x = x_sign * x_clamp
2025-05-07T20:31:45.4513694Z         x0 = x[:, :D]
2025-05-07T20:31:45.4513774Z         x1 = x[:, D:]
2025-05-07T20:31:45.4513845Z     
2025-05-07T20:31:45.4513937Z         if contiguous:
2025-05-07T20:31:45.4514114Z             x0 = x0.contiguous()
2025-05-07T20:31:45.4514210Z             x1 = x1.contiguous()
2025-05-07T20:31:45.4514282Z     
2025-05-07T20:31:45.4514371Z         if scale_ub is not None:
2025-05-07T20:31:45.4514481Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:31:45.4514614Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:31:45.4514689Z             )
2025-05-07T20:31:45.4514772Z         else:
2025-05-07T20:31:45.4514863Z             scale_ub_tensor = None
2025-05-07T20:31:45.4514936Z     
2025-05-07T20:31:45.4515072Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:31:45.4515161Z             op = silu_mul_quant
2025-05-07T20:31:45.4515246Z             if compiled:
2025-05-07T20:31:45.4515432Z                 op = torch.compile(op)
2025-05-07T20:31:45.4515536Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:31:45.4515615Z     
2025-05-07T20:31:45.4515705Z >       y_fp8, y_scale = fn()
2025-05-07T20:31:45.4515710Z 
2025-05-07T20:31:45.4515812Z moe/activation_test.py:117: 
2025-05-07T20:31:45.4515946Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:31:45.4516047Z moe/activation_test.py:115: in fn
2025-05-07T20:31:45.4516144Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:31:45.4516649Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:31:45.4516745Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:31:45.4517106Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:31:45.4517326Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:31:45.4517670Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:31:45.4517769Z     kernel = self.compile(
2025-05-07T20:31:45.4518154Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:31:45.4518325Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:31:45.4518456Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:31:45.4518461Z 
2025-05-07T20:31:45.4518661Z self = <triton.compiler.compiler.ASTSource object at 0x7f687276fd90>
2025-05-07T20:31:45.4519445Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:31:45.4519947Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f68dfb402c0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f68723fdf80>}
2025-05-07T20:31:45.4520714Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:31:45.4520902Z context = <triton._C.libtriton.ir.context object at 0x7f68727ffbf0>
2025-05-07T20:31:45.4520908Z 
2025-05-07T20:31:45.4521071Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:31:45.4521338Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:31:45.4521444Z                            module_map=module_map)
2025-05-07T20:31:45.4521615Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:31:45.4521714Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:31:45.4521790Z E       ^
2025-05-07T20:31:45.4522158Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:31:45.4522163Z 
2025-05-07T20:31:45.4522663Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:31:45.4522668Z 
2025-05-07T20:31:45.4522772Z Trying example: test_silu_mul_quant(
2025-05-07T20:31:45.4523002Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:31:45.4523081Z     T=2048,
2025-05-07T20:31:45.4523164Z     D=7168,
2025-05-07T20:31:45.4523245Z     scale_ub=None,
2025-05-07T20:31:45.4523333Z     contiguous=False,
2025-05-07T20:31:45.4523421Z     compiled=True,
2025-05-07T20:31:45.4523494Z )
2025-05-07T20:31:45.4523713Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:31:45.4523893Z T = 2048, D = 7168, scale_ub = None, contiguous = False, compiled = True
2025-05-07T20:31:45.4523977Z 
2025-05-07T20:31:45.4524057Z     @given(
2025-05-07T20:31:45.4524174Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:31:45.4524281Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:31:45.4524397Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:31:45.4524526Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:31:45.4524639Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:31:45.4524713Z     )
2025-05-07T20:31:45.4524970Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:31:45.4525068Z     def test_silu_mul_quant(
2025-05-07T20:31:45.4525168Z         self,
2025-05-07T20:31:45.4525261Z         T: int,
2025-05-07T20:31:45.4525356Z         D: int,
2025-05-07T20:31:45.4525455Z         scale_ub: Optional[float],
2025-05-07T20:31:45.4525550Z         contiguous: bool,
2025-05-07T20:31:45.4525636Z         compiled: bool,
2025-05-07T20:31:45.4525714Z     ) -> None:
2025-05-07T20:31:45.4525817Z         torch.manual_seed(2025)
2025-05-07T20:31:45.4525895Z     
2025-05-07T20:31:45.4526070Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:31:45.4526144Z     
2025-05-07T20:31:45.4526238Z         x_sign = torch.sign(x)
2025-05-07T20:31:45.4526373Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:31:45.4526460Z         x = x_sign * x_clamp
2025-05-07T20:31:45.4526540Z         x0 = x[:, :D]
2025-05-07T20:31:45.4526628Z         x1 = x[:, D:]
2025-05-07T20:31:45.4526704Z     
2025-05-07T20:31:45.4526789Z         if contiguous:
2025-05-07T20:31:45.4526889Z             x0 = x0.contiguous()
2025-05-07T20:31:45.4526979Z             x1 = x1.contiguous()
2025-05-07T20:31:45.4527050Z     
2025-05-07T20:31:45.4527149Z         if scale_ub is not None:
2025-05-07T20:31:45.4527256Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:31:45.4527398Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:31:45.4527476Z             )
2025-05-07T20:31:45.4527559Z         else:
2025-05-07T20:31:45.4527659Z             scale_ub_tensor = None
2025-05-07T20:31:45.4527731Z     
2025-05-07T20:31:45.4527858Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:31:45.4527953Z             op = silu_mul_quant
2025-05-07T20:31:45.4528041Z             if compiled:
2025-05-07T20:31:45.4528477Z                 op = torch.compile(op)
2025-05-07T20:31:45.4528651Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:31:45.4528759Z     
2025-05-07T20:31:45.4528860Z >       y_fp8, y_scale = fn()
2025-05-07T20:31:45.4528865Z 
2025-05-07T20:31:45.4528969Z moe/activation_test.py:117: 
2025-05-07T20:31:45.4529099Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:31:45.4529204Z moe/activation_test.py:115: in fn
2025-05-07T20:31:45.4529306Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:31:45.4529673Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/torch/_dynamo/eval_frame.py:678: in _fn
2025-05-07T20:31:45.4529777Z     return fn(*args, **kwargs)
2025-05-07T20:31:45.4530271Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:31:45.4530595Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:31:45.4530959Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:31:45.4531181Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:31:45.4531525Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:31:45.4531618Z     kernel = self.compile(
2025-05-07T20:31:45.4531998Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:31:45.4532179Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:31:45.4532467Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:31:45.4532472Z 
2025-05-07T20:31:45.4532686Z self = <triton.compiler.compiler.ASTSource object at 0x7f68727f26d0>
2025-05-07T20:31:45.4533470Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:31:45.4533971Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f68dfb402c0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f68723ff420>}
2025-05-07T20:31:45.4534728Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:31:45.4534916Z context = <triton._C.libtriton.ir.context object at 0x7f68727ce5b0>
2025-05-07T20:31:45.4534927Z 
2025-05-07T20:31:45.4535100Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:31:45.4535366Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:31:45.4535489Z                            module_map=module_map)
2025-05-07T20:31:45.4535681Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:31:45.4535791Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:31:45.4535874Z E       ^
2025-05-07T20:31:45.4536230Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:31:45.4536235Z 
2025-05-07T20:31:45.4536649Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:31:45.4536654Z 
2025-05-07T20:31:45.4536762Z Trying example: test_silu_mul_quant(
2025-05-07T20:31:45.4536987Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:31:45.4537070Z     T=4096,
2025-05-07T20:31:45.4537153Z     D=7168,
2025-05-07T20:31:45.4537238Z     scale_ub=None,
2025-05-07T20:31:45.4537330Z     contiguous=False,
2025-05-07T20:31:45.4537412Z     compiled=True,
2025-05-07T20:31:45.4537490Z )
2025-05-07T20:31:45.4537714Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:31:45.4537885Z T = 4096, D = 7168, scale_ub = None, contiguous = False, compiled = True
2025-05-07T20:31:45.4537889Z 
2025-05-07T20:31:45.4537966Z     @given(
2025-05-07T20:31:45.4538089Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:31:45.4538188Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:31:45.4538303Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:31:45.4538426Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:31:45.4538538Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:31:45.4538617Z     )
2025-05-07T20:31:45.4538866Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:31:45.4538960Z     def test_silu_mul_quant(
2025-05-07T20:31:45.4539042Z         self,
2025-05-07T20:31:45.4539119Z         T: int,
2025-05-07T20:31:45.4539285Z         D: int,
2025-05-07T20:31:45.4539392Z         scale_ub: Optional[float],
2025-05-07T20:31:45.4539480Z         contiguous: bool,
2025-05-07T20:31:45.4539565Z         compiled: bool,
2025-05-07T20:31:45.4539651Z     ) -> None:
2025-05-07T20:31:45.4539745Z         torch.manual_seed(2025)
2025-05-07T20:31:45.4539820Z     
2025-05-07T20:31:45.4539994Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:31:45.4540069Z     
2025-05-07T20:31:45.4540171Z         x_sign = torch.sign(x)
2025-05-07T20:31:45.4540295Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:31:45.4540383Z         x = x_sign * x_clamp
2025-05-07T20:31:45.4540471Z         x0 = x[:, :D]
2025-05-07T20:31:45.4540632Z         x1 = x[:, D:]
2025-05-07T20:31:45.4540704Z     
2025-05-07T20:31:45.4540796Z         if contiguous:
2025-05-07T20:31:45.4540888Z             x0 = x0.contiguous()
2025-05-07T20:31:45.4540977Z             x1 = x1.contiguous()
2025-05-07T20:31:45.4541057Z     
2025-05-07T20:31:45.4541153Z         if scale_ub is not None:
2025-05-07T20:31:45.4541259Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:31:45.4541400Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:31:45.4541475Z             )
2025-05-07T20:31:45.4541552Z         else:
2025-05-07T20:31:45.4541652Z             scale_ub_tensor = None
2025-05-07T20:31:45.4541724Z     
2025-05-07T20:31:45.4541860Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:31:45.4541951Z             op = silu_mul_quant
2025-05-07T20:31:45.4542035Z             if compiled:
2025-05-07T20:31:45.4542142Z                 op = torch.compile(op)
2025-05-07T20:31:45.4542248Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:31:45.4542326Z     
2025-05-07T20:31:45.4542424Z >       y_fp8, y_scale = fn()
2025-05-07T20:31:45.4542428Z 
2025-05-07T20:31:45.4542526Z moe/activation_test.py:117: 
2025-05-07T20:31:45.4542660Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:31:45.4542770Z moe/activation_test.py:115: in fn
2025-05-07T20:31:45.4542870Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:31:45.4543242Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/torch/_dynamo/eval_frame.py:678: in _fn
2025-05-07T20:31:45.4543334Z     return fn(*args, **kwargs)
2025-05-07T20:31:45.4543827Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:31:45.4543929Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:31:45.4544284Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:31:45.4544509Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:31:45.4544851Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:31:45.4544944Z     kernel = self.compile(
2025-05-07T20:31:45.4545337Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:31:45.4545510Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:31:45.4545637Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:31:45.4545641Z 
2025-05-07T20:31:45.4545848Z self = <triton.compiler.compiler.ASTSource object at 0x7f68727cb7d0>
2025-05-07T20:31:45.4546625Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:31:45.4547139Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f68dfb402c0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f68727e8680>}
2025-05-07T20:31:45.4547979Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:31:45.4548175Z context = <triton._C.libtriton.ir.context object at 0x7f68727fb6b0>
2025-05-07T20:31:45.4548180Z 
2025-05-07T20:31:45.4548346Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:31:45.4548607Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:31:45.4548719Z                            module_map=module_map)
2025-05-07T20:31:45.4548879Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:31:45.4548977Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:31:45.4549317Z E       ^
2025-05-07T20:31:45.4549676Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:31:45.4549681Z 
2025-05-07T20:31:45.4550107Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:31:45.4550111Z 
2025-05-07T20:31:45.4550214Z Trying example: test_silu_mul_quant(
2025-05-07T20:31:45.4550439Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:31:45.4550522Z     T=16384,
2025-05-07T20:31:45.4550599Z     D=5120,
2025-05-07T20:31:45.4550683Z     scale_ub=1200.0,
2025-05-07T20:31:45.4550775Z     contiguous=False,
2025-05-07T20:31:45.4550859Z     compiled=False,
2025-05-07T20:31:45.4550938Z )
2025-05-07T20:31:45.4551156Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:31:45.4551335Z T = 16384, D = 5120, scale_ub = 1200.0, contiguous = False, compiled = False
2025-05-07T20:31:45.4551345Z 
2025-05-07T20:31:45.4551427Z     @given(
2025-05-07T20:31:45.4551546Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:31:45.4551645Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:31:45.4551772Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:31:45.4551888Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:31:45.4551999Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:31:45.4552078Z     )
2025-05-07T20:31:45.4552322Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:31:45.4552422Z     def test_silu_mul_quant(
2025-05-07T20:31:45.4552498Z         self,
2025-05-07T20:31:45.4552574Z         T: int,
2025-05-07T20:31:45.4552655Z         D: int,
2025-05-07T20:31:45.4552752Z         scale_ub: Optional[float],
2025-05-07T20:31:45.4552842Z         contiguous: bool,
2025-05-07T20:31:45.4552933Z         compiled: bool,
2025-05-07T20:31:45.4553016Z     ) -> None:
2025-05-07T20:31:45.4553110Z         torch.manual_seed(2025)
2025-05-07T20:31:45.4553189Z     
2025-05-07T20:31:45.4553358Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:31:45.4553431Z     
2025-05-07T20:31:45.4553542Z         x_sign = torch.sign(x)
2025-05-07T20:31:45.4558144Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:31:45.4558255Z         x = x_sign * x_clamp
2025-05-07T20:31:45.4558343Z         x0 = x[:, :D]
2025-05-07T20:31:45.4558436Z         x1 = x[:, D:]
2025-05-07T20:31:45.4558511Z     
2025-05-07T20:31:45.4558599Z         if contiguous:
2025-05-07T20:31:45.4558701Z             x0 = x0.contiguous()
2025-05-07T20:31:45.4558793Z             x1 = x1.contiguous()
2025-05-07T20:31:45.4558866Z     
2025-05-07T20:31:45.4558967Z         if scale_ub is not None:
2025-05-07T20:31:45.4559078Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:31:45.4559228Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:31:45.4559314Z             )
2025-05-07T20:31:45.4559393Z         else:
2025-05-07T20:31:45.4559496Z             scale_ub_tensor = None
2025-05-07T20:31:45.4559570Z     
2025-05-07T20:31:45.4559818Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:31:45.4559920Z             op = silu_mul_quant
2025-05-07T20:31:45.4560008Z             if compiled:
2025-05-07T20:31:45.4560111Z                 op = torch.compile(op)
2025-05-07T20:31:45.4560228Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:31:45.4560302Z     
2025-05-07T20:31:45.4560395Z >       y_fp8, y_scale = fn()
2025-05-07T20:31:45.4560409Z 
2025-05-07T20:31:45.4560511Z moe/activation_test.py:117: 
2025-05-07T20:31:45.4560642Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:31:45.4560752Z moe/activation_test.py:115: in fn
2025-05-07T20:31:45.4560854Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:31:45.4561362Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:31:45.4561546Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:31:45.4561911Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:31:45.4562142Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:31:45.4562487Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:31:45.4562583Z     kernel = self.compile(
2025-05-07T20:31:45.4562977Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:31:45.4563151Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:31:45.4563280Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:31:45.4563285Z 
2025-05-07T20:31:45.4563508Z self = <triton.compiler.compiler.ASTSource object at 0x7f68720c0310>
2025-05-07T20:31:45.4564293Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:31:45.4564806Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f68dfb402c0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f68727e94e0>}
2025-05-07T20:31:45.4565583Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:31:45.4565804Z context = <triton._C.libtriton.ir.context object at 0x7f68720d81b0>
2025-05-07T20:31:45.4565808Z 
2025-05-07T20:31:45.4565974Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:31:45.4566242Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:31:45.4566360Z                            module_map=module_map)
2025-05-07T20:31:45.4566530Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:31:45.4566632Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:31:45.4566719Z E       ^
2025-05-07T20:31:45.4567081Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:31:45.4567085Z 
2025-05-07T20:31:45.4567509Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:31:45.4567513Z 
2025-05-07T20:31:45.4567619Z Trying example: test_silu_mul_quant(
2025-05-07T20:31:45.4567843Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:31:45.4567931Z     T=16384,
2025-05-07T20:31:45.4568010Z     D=5120,
2025-05-07T20:31:45.4568100Z     scale_ub=1200.0,
2025-05-07T20:31:45.4568195Z     contiguous=True,
2025-05-07T20:31:45.4568279Z     compiled=True,
2025-05-07T20:31:45.4568359Z )
2025-05-07T20:31:45.4568666Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:31:45.4568845Z T = 16384, D = 5120, scale_ub = 1200.0, contiguous = True, compiled = True
2025-05-07T20:31:45.4568850Z 
2025-05-07T20:31:45.4568936Z     @given(
2025-05-07T20:31:45.4569057Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:31:45.4569158Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:31:45.4569280Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:31:45.4569398Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:31:45.4569513Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:31:45.4569594Z     )
2025-05-07T20:31:45.4569840Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:31:45.4570056Z     def test_silu_mul_quant(
2025-05-07T20:31:45.4570139Z         self,
2025-05-07T20:31:45.4570218Z         T: int,
2025-05-07T20:31:45.4570302Z         D: int,
2025-05-07T20:31:45.4570399Z         scale_ub: Optional[float],
2025-05-07T20:31:45.4570495Z         contiguous: bool,
2025-05-07T20:31:45.4570588Z         compiled: bool,
2025-05-07T20:31:45.4570667Z     ) -> None:
2025-05-07T20:31:45.4570764Z         torch.manual_seed(2025)
2025-05-07T20:31:45.4570846Z     
2025-05-07T20:31:45.4571017Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:31:45.4571097Z     
2025-05-07T20:31:45.4571197Z         x_sign = torch.sign(x)
2025-05-07T20:31:45.4571323Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:31:45.4571422Z         x = x_sign * x_clamp
2025-05-07T20:31:45.4571508Z         x0 = x[:, :D]
2025-05-07T20:31:45.4571588Z         x1 = x[:, D:]
2025-05-07T20:31:45.4571668Z     
2025-05-07T20:31:45.4571753Z         if contiguous:
2025-05-07T20:31:45.4571852Z             x0 = x0.contiguous()
2025-05-07T20:31:45.4571948Z             x1 = x1.contiguous()
2025-05-07T20:31:45.4572020Z     
2025-05-07T20:31:45.4572112Z         if scale_ub is not None:
2025-05-07T20:31:45.4572225Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:31:45.4572366Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:31:45.4572443Z             )
2025-05-07T20:31:45.4572528Z         else:
2025-05-07T20:31:45.4572622Z             scale_ub_tensor = None
2025-05-07T20:31:45.4572708Z     
2025-05-07T20:31:45.4572838Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:31:45.4572929Z             op = silu_mul_quant
2025-05-07T20:31:45.4573027Z             if compiled:
2025-05-07T20:31:45.4573130Z                 op = torch.compile(op)
2025-05-07T20:31:45.4573239Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:31:45.4573323Z     
2025-05-07T20:31:45.4573419Z >       y_fp8, y_scale = fn()
2025-05-07T20:31:45.4573423Z 
2025-05-07T20:31:45.4573531Z moe/activation_test.py:117: 
2025-05-07T20:31:45.4573672Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:31:45.4573775Z moe/activation_test.py:115: in fn
2025-05-07T20:31:45.4573880Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:31:45.4574260Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/torch/_dynamo/eval_frame.py:678: in _fn
2025-05-07T20:31:45.4574354Z     return fn(*args, **kwargs)
2025-05-07T20:31:45.4574866Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:31:45.4574966Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:31:45.4575326Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:31:45.4575559Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:31:45.4575903Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:31:45.4576011Z     kernel = self.compile(
2025-05-07T20:31:45.4576485Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:31:45.4576662Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:31:45.4576803Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:31:45.4576807Z 
2025-05-07T20:31:45.4577012Z self = <triton.compiler.compiler.ASTSource object at 0x7f68720e2950>
2025-05-07T20:31:45.4577804Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:31:45.4578310Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f68dfb402c0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f68727ea8e0>}
2025-05-07T20:31:45.4579144Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:31:45.4579343Z context = <triton._C.libtriton.ir.context object at 0x7f687208ef70>
2025-05-07T20:31:45.4579348Z 
2025-05-07T20:31:45.4579513Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:31:45.4579784Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:31:45.4579893Z                            module_map=module_map)
2025-05-07T20:31:45.4580056Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:31:45.4580164Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:31:45.4580243Z E       ^
2025-05-07T20:31:45.4580601Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:31:45.4580623Z 
2025-05-07T20:31:45.4581042Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:31:45.4581050Z 
2025-05-07T20:31:45.4581159Z Trying example: test_silu_mul_quant(
2025-05-07T20:31:45.4581393Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:31:45.4581471Z     T=16384,
2025-05-07T20:31:45.4581548Z     D=5120,
2025-05-07T20:31:45.4581638Z     scale_ub=None,
2025-05-07T20:31:45.4581730Z     contiguous=False,
2025-05-07T20:31:45.4581814Z     compiled=True,
2025-05-07T20:31:45.4581895Z )
2025-05-07T20:31:45.4582115Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:31:45.4582299Z T = 16384, D = 5120, scale_ub = None, contiguous = False, compiled = True
2025-05-07T20:31:45.4582303Z 
2025-05-07T20:31:45.4582381Z     @given(
2025-05-07T20:31:45.4582505Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:31:45.4582611Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:31:45.4582727Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:31:45.4582849Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:31:45.4582971Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:31:45.4583046Z     )
2025-05-07T20:31:45.4583292Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:31:45.4583395Z     def test_silu_mul_quant(
2025-05-07T20:31:45.4583473Z         self,
2025-05-07T20:31:45.4583557Z         T: int,
2025-05-07T20:31:45.4583634Z         D: int,
2025-05-07T20:31:45.4583734Z         scale_ub: Optional[float],
2025-05-07T20:31:45.4583831Z         contiguous: bool,
2025-05-07T20:31:45.4583917Z         compiled: bool,
2025-05-07T20:31:45.4583999Z     ) -> None:
2025-05-07T20:31:45.4584104Z         torch.manual_seed(2025)
2025-05-07T20:31:45.4584183Z     
2025-05-07T20:31:45.4584352Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:31:45.4584434Z     
2025-05-07T20:31:45.4584529Z         x_sign = torch.sign(x)
2025-05-07T20:31:45.4584656Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:31:45.4584848Z         x = x_sign * x_clamp
2025-05-07T20:31:45.4584951Z         x0 = x[:, :D]
2025-05-07T20:31:45.4585061Z         x1 = x[:, D:]
2025-05-07T20:31:45.4585153Z     
2025-05-07T20:31:45.4585258Z         if contiguous:
2025-05-07T20:31:45.4585381Z             x0 = x0.contiguous()
2025-05-07T20:31:45.4585493Z             x1 = x1.contiguous()
2025-05-07T20:31:45.4585584Z     
2025-05-07T20:31:45.4585702Z         if scale_ub is not None:
2025-05-07T20:31:45.4585834Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:31:45.4586003Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:31:45.4586107Z             )
2025-05-07T20:31:45.4586202Z         else:
2025-05-07T20:31:45.4586413Z             scale_ub_tensor = None
2025-05-07T20:31:45.4586513Z     
2025-05-07T20:31:45.4586673Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:31:45.4586794Z             op = silu_mul_quant
2025-05-07T20:31:45.4586900Z             if compiled:
2025-05-07T20:31:45.4587030Z                 op = torch.compile(op)
2025-05-07T20:31:45.4587167Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:31:45.4587257Z     
2025-05-07T20:31:45.4587364Z >       y_fp8, y_scale = fn()
2025-05-07T20:31:45.4587369Z 
2025-05-07T20:31:45.4587474Z moe/activation_test.py:117: 
2025-05-07T20:31:45.4587604Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:31:45.4587708Z moe/activation_test.py:115: in fn
2025-05-07T20:31:45.4587816Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:31:45.4588184Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/torch/_dynamo/eval_frame.py:678: in _fn
2025-05-07T20:31:45.4588284Z     return fn(*args, **kwargs)
2025-05-07T20:31:45.4588785Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:31:45.4588883Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:31:45.4589392Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:31:45.4589614Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:31:45.4589954Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:31:45.4590056Z     kernel = self.compile(
2025-05-07T20:31:45.4590442Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:31:45.4590622Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:31:45.4590751Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:31:45.4590762Z 
2025-05-07T20:31:45.4590966Z self = <triton.compiler.compiler.ASTSource object at 0x7f6871f43810>
2025-05-07T20:31:45.4591761Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:31:45.4592266Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f68dfb402c0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f68727eaf20>}
2025-05-07T20:31:45.4593027Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:31:45.4593217Z context = <triton._C.libtriton.ir.context object at 0x7f6871f536f0>
2025-05-07T20:31:45.4593222Z 
2025-05-07T20:31:45.4593397Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:31:45.4593659Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:31:45.4593854Z                            module_map=module_map)
2025-05-07T20:31:45.4594025Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:31:45.4594127Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:31:45.4594204Z E       ^
2025-05-07T20:31:45.4594573Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:31:45.4594577Z 
2025-05-07T20:31:45.4595004Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:31:45.4595010Z 
2025-05-07T20:31:45.4595124Z Trying example: test_silu_mul_quant(
2025-05-07T20:31:45.4595351Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:31:45.4595516Z     T=2048,
2025-05-07T20:31:45.4595609Z     D=5120,
2025-05-07T20:31:45.4595711Z     scale_ub=None,
2025-05-07T20:31:45.4595807Z     contiguous=False,
2025-05-07T20:31:45.4595917Z     compiled=True,
2025-05-07T20:31:45.4595991Z )
2025-05-07T20:31:45.4596219Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:31:45.4596403Z T = 2048, D = 5120, scale_ub = None, contiguous = False, compiled = True
2025-05-07T20:31:45.4596407Z 
2025-05-07T20:31:45.4596487Z     @given(
2025-05-07T20:31:45.4596613Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:31:45.4596713Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:31:45.4596832Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:31:45.4596962Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:31:45.4597076Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:31:45.4597152Z     )
2025-05-07T20:31:45.4597404Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:31:45.4597505Z     def test_silu_mul_quant(
2025-05-07T20:31:45.4597585Z         self,
2025-05-07T20:31:45.4597668Z         T: int,
2025-05-07T20:31:45.4597745Z         D: int,
2025-05-07T20:31:45.4597858Z         scale_ub: Optional[float],
2025-05-07T20:31:45.4597948Z         contiguous: bool,
2025-05-07T20:31:45.4598035Z         compiled: bool,
2025-05-07T20:31:45.4598124Z     ) -> None:
2025-05-07T20:31:45.4598219Z         torch.manual_seed(2025)
2025-05-07T20:31:45.4598298Z     
2025-05-07T20:31:45.4598466Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:31:45.4598542Z     
2025-05-07T20:31:45.4598643Z         x_sign = torch.sign(x)
2025-05-07T20:31:45.4598768Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:31:45.4598860Z         x = x_sign * x_clamp
2025-05-07T20:31:45.4598948Z         x0 = x[:, :D]
2025-05-07T20:31:45.4599028Z         x1 = x[:, D:]
2025-05-07T20:31:45.4599104Z     
2025-05-07T20:31:45.4599198Z         if contiguous:
2025-05-07T20:31:45.4599289Z             x0 = x0.contiguous()
2025-05-07T20:31:45.4599378Z             x1 = x1.contiguous()
2025-05-07T20:31:45.4599460Z     
2025-05-07T20:31:45.4599550Z         if scale_ub is not None:
2025-05-07T20:31:45.4599665Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:31:45.4599800Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:31:45.4599875Z             )
2025-05-07T20:31:45.4599957Z         else:
2025-05-07T20:31:45.4600052Z             scale_ub_tensor = None
2025-05-07T20:31:45.4600126Z     
2025-05-07T20:31:45.4600262Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:31:45.4600352Z             op = silu_mul_quant
2025-05-07T20:31:45.4600436Z             if compiled:
2025-05-07T20:31:45.4600541Z                 op = torch.compile(op)
2025-05-07T20:31:45.4600647Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:31:45.4600719Z     
2025-05-07T20:31:45.4600819Z >       y_fp8, y_scale = fn()
2025-05-07T20:31:45.4600824Z 
2025-05-07T20:31:45.4600922Z moe/activation_test.py:117: 
2025-05-07T20:31:45.4601057Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:31:45.4601280Z moe/activation_test.py:115: in fn
2025-05-07T20:31:45.4601380Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:31:45.4601756Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/torch/_dynamo/eval_frame.py:678: in _fn
2025-05-07T20:31:45.4601848Z     return fn(*args, **kwargs)
2025-05-07T20:31:45.4602340Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:31:45.4602442Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:31:45.4602801Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:31:45.4603028Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:31:45.4603444Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:31:45.4603540Z     kernel = self.compile(
2025-05-07T20:31:45.4603933Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:31:45.4604104Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:31:45.4604232Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:31:45.4604242Z 
2025-05-07T20:31:45.4604445Z self = <triton.compiler.compiler.ASTSource object at 0x7f6871f64690>
2025-05-07T20:31:45.4605227Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:31:45.4605741Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f68dfb402c0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f6871f58d60>}
2025-05-07T20:31:45.4606502Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:31:45.4606695Z context = <triton._C.libtriton.ir.context object at 0x7f6871f38570>
2025-05-07T20:31:45.4606700Z 
2025-05-07T20:31:45.4606862Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:31:45.4607122Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:31:45.4607235Z                            module_map=module_map)
2025-05-07T20:31:45.4607399Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:31:45.4607503Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:31:45.4607580Z E       ^
2025-05-07T20:31:45.4607942Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:31:45.4607946Z 
2025-05-07T20:31:45.4608374Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:31:45.4608378Z 
2025-05-07T20:31:45.4608482Z Trying example: test_silu_mul_quant(
2025-05-07T20:31:45.4608705Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:31:45.4608789Z     T=2048,
2025-05-07T20:31:45.4608865Z     D=5120,
2025-05-07T20:31:45.4608953Z     scale_ub=1200.0,
2025-05-07T20:31:45.4609037Z     contiguous=False,
2025-05-07T20:31:45.4609118Z     compiled=True,
2025-05-07T20:31:45.4609201Z )
2025-05-07T20:31:45.4609419Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:31:45.4609591Z T = 2048, D = 5120, scale_ub = 1200.0, contiguous = False, compiled = True
2025-05-07T20:31:45.4609600Z 
2025-05-07T20:31:45.4609685Z     @given(
2025-05-07T20:31:45.4609803Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:31:45.4609903Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:31:45.4610177Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:31:45.4610296Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:31:45.4610415Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:31:45.4610488Z     )
2025-05-07T20:31:45.4610734Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:31:45.4610834Z     def test_silu_mul_quant(
2025-05-07T20:31:45.4610911Z         self,
2025-05-07T20:31:45.4610989Z         T: int,
2025-05-07T20:31:45.4611074Z         D: int,
2025-05-07T20:31:45.4611171Z         scale_ub: Optional[float],
2025-05-07T20:31:45.4611259Z         contiguous: bool,
2025-05-07T20:31:45.4611350Z         compiled: bool,
2025-05-07T20:31:45.4611428Z     ) -> None:
2025-05-07T20:31:45.4611607Z         torch.manual_seed(2025)
2025-05-07T20:31:45.4611686Z     
2025-05-07T20:31:45.4611852Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:31:45.4611932Z     
2025-05-07T20:31:45.4612024Z         x_sign = torch.sign(x)
2025-05-07T20:31:45.4612155Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:31:45.4612253Z         x = x_sign * x_clamp
2025-05-07T20:31:45.4612333Z         x0 = x[:, :D]
2025-05-07T20:31:45.4612414Z         x1 = x[:, D:]
2025-05-07T20:31:45.4612493Z     
2025-05-07T20:31:45.4612577Z         if contiguous:
2025-05-07T20:31:45.4612668Z             x0 = x0.contiguous()
2025-05-07T20:31:45.4612765Z             x1 = x1.contiguous()
2025-05-07T20:31:45.4612837Z     
2025-05-07T20:31:45.4612926Z         if scale_ub is not None:
2025-05-07T20:31:45.4613037Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:31:45.4613172Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:31:45.4613253Z             )
2025-05-07T20:31:45.4613336Z         else:
2025-05-07T20:31:45.4613430Z             scale_ub_tensor = None
2025-05-07T20:31:45.4613508Z     
2025-05-07T20:31:45.4613636Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:31:45.4613729Z             op = silu_mul_quant
2025-05-07T20:31:45.4613823Z             if compiled:
2025-05-07T20:31:45.4613925Z                 op = torch.compile(op)
2025-05-07T20:31:45.4614030Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:31:45.4614108Z     
2025-05-07T20:31:45.4614199Z >       y_fp8, y_scale = fn()
2025-05-07T20:31:45.4614204Z 
2025-05-07T20:31:45.4614301Z moe/activation_test.py:117: 
2025-05-07T20:31:45.4614438Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:31:45.4614540Z moe/activation_test.py:115: in fn
2025-05-07T20:31:45.4614645Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:31:45.4615017Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/torch/_dynamo/eval_frame.py:678: in _fn
2025-05-07T20:31:45.4615135Z     return fn(*args, **kwargs)
2025-05-07T20:31:45.4615665Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:31:45.4615768Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:31:45.4616124Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:31:45.4616355Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:31:45.4616693Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:31:45.4616794Z     kernel = self.compile(
2025-05-07T20:31:45.4617174Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:31:45.4617345Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:31:45.4617483Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:31:45.4617487Z 
2025-05-07T20:31:45.4617690Z self = <triton.compiler.compiler.ASTSource object at 0x7f6871de1390>
2025-05-07T20:31:45.4618569Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:31:45.4619075Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f68dfb402c0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f6871f59760>}
2025-05-07T20:31:45.4619825Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:31:45.4620017Z context = <triton._C.libtriton.ir.context object at 0x7f6871d09230>
2025-05-07T20:31:45.4620097Z 
2025-05-07T20:31:45.4620261Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:31:45.4620529Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:31:45.4620638Z                            module_map=module_map)
2025-05-07T20:31:45.4620817Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:31:45.4620917Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:31:45.4620995Z E       ^
2025-05-07T20:31:45.4621359Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:31:45.4621364Z 
2025-05-07T20:31:45.4621782Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:31:45.4621786Z 
2025-05-07T20:31:45.4621897Z Trying example: test_silu_mul_quant(
2025-05-07T20:31:45.4622121Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:31:45.4622203Z     T=4096,
2025-05-07T20:31:45.4622287Z     D=5120,
2025-05-07T20:31:45.4622370Z     scale_ub=1200.0,
2025-05-07T20:31:45.4622455Z     contiguous=True,
2025-05-07T20:31:45.4622548Z     compiled=True,
2025-05-07T20:31:45.4622620Z )
2025-05-07T20:31:45.4622843Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:31:45.4623013Z T = 4096, D = 5120, scale_ub = 1200.0, contiguous = True, compiled = True
2025-05-07T20:31:45.4623017Z 
2025-05-07T20:31:45.4623095Z     @given(
2025-05-07T20:31:45.4623217Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:31:45.4623317Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:31:45.4623430Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:31:45.4623558Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:31:45.4623671Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:31:45.4623750Z     )
2025-05-07T20:31:45.4624001Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:31:45.4624095Z     def test_silu_mul_quant(
2025-05-07T20:31:45.4624181Z         self,
2025-05-07T20:31:45.4624262Z         T: int,
2025-05-07T20:31:45.4624342Z         D: int,
2025-05-07T20:31:45.4624446Z         scale_ub: Optional[float],
2025-05-07T20:31:45.4624540Z         contiguous: bool,
2025-05-07T20:31:45.4624625Z         compiled: bool,
2025-05-07T20:31:45.4624720Z     ) -> None:
2025-05-07T20:31:45.4624838Z         torch.manual_seed(2025)
2025-05-07T20:31:45.4624931Z     
2025-05-07T20:31:45.4625146Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:31:45.4625237Z     
2025-05-07T20:31:45.4625350Z         x_sign = torch.sign(x)
2025-05-07T20:31:45.4625509Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:31:45.4625620Z         x = x_sign * x_clamp
2025-05-07T20:31:45.4625724Z         x0 = x[:, :D]
2025-05-07T20:31:45.4625827Z         x1 = x[:, D:]
2025-05-07T20:31:45.4625919Z     
2025-05-07T20:31:45.4626028Z         if contiguous:
2025-05-07T20:31:45.4626141Z             x0 = x0.contiguous()
2025-05-07T20:31:45.4626251Z             x1 = x1.contiguous()
2025-05-07T20:31:45.4626345Z     
2025-05-07T20:31:45.4626566Z         if scale_ub is not None:
2025-05-07T20:31:45.4626698Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:31:45.4626847Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:31:45.4626921Z             )
2025-05-07T20:31:45.4626998Z         else:
2025-05-07T20:31:45.4627095Z             scale_ub_tensor = None
2025-05-07T20:31:45.4627167Z     
2025-05-07T20:31:45.4627297Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:31:45.4627392Z             op = silu_mul_quant
2025-05-07T20:31:45.4627476Z             if compiled:
2025-05-07T20:31:45.4627580Z                 op = torch.compile(op)
2025-05-07T20:31:45.4627685Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:31:45.4627834Z     
2025-05-07T20:31:45.4627931Z >       y_fp8, y_scale = fn()
2025-05-07T20:31:45.4627936Z 
2025-05-07T20:31:45.4628032Z moe/activation_test.py:117: 
2025-05-07T20:31:45.4628476Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:31:45.4628641Z moe/activation_test.py:115: in fn
2025-05-07T20:31:45.4628781Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:31:45.4629210Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/torch/_dynamo/eval_frame.py:678: in _fn
2025-05-07T20:31:45.4629303Z     return fn(*args, **kwargs)
2025-05-07T20:31:45.4629802Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:31:45.4629905Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:31:45.4630266Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:31:45.4630496Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:31:45.4630844Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:31:45.4630942Z     kernel = self.compile(
2025-05-07T20:31:45.4631331Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:31:45.4631505Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:31:45.4631633Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:31:45.4631639Z 
2025-05-07T20:31:45.4631848Z self = <triton.compiler.compiler.ASTSource object at 0x7f6871d26090>
2025-05-07T20:31:45.4632634Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:31:45.4633147Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f68dfb402c0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f6871f5a980>}
2025-05-07T20:31:45.4633903Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:31:45.4634092Z context = <triton._C.libtriton.ir.context object at 0x7f6871d6df70>
2025-05-07T20:31:45.4634102Z 
2025-05-07T20:31:45.4634265Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:31:45.4634528Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:31:45.4634640Z                            module_map=module_map)
2025-05-07T20:31:45.4634803Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:31:45.4634906Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:31:45.4634995Z E       ^
2025-05-07T20:31:45.4635399Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:31:45.4635405Z 
2025-05-07T20:31:45.4636111Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:31:45.4636117Z 
2025-05-07T20:31:45.4636222Z Trying example: test_silu_mul_quant(
2025-05-07T20:31:45.4636447Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:31:45.4636532Z     T=128,
2025-05-07T20:31:45.4636607Z     D=5120,
2025-05-07T20:31:45.4636689Z     scale_ub=1200.0,
2025-05-07T20:31:45.4636786Z     contiguous=False,
2025-05-07T20:31:45.4636869Z     compiled=True,
2025-05-07T20:31:45.4636942Z )
2025-05-07T20:31:45.4637168Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:31:45.4637340Z T = 128, D = 5120, scale_ub = 1200.0, contiguous = False, compiled = True
2025-05-07T20:31:45.4637471Z 
2025-05-07T20:31:45.4637559Z     @given(
2025-05-07T20:31:45.4637680Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:31:45.4637779Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:31:45.4637909Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:31:45.4638026Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:31:45.4638138Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:31:45.4638219Z     )
2025-05-07T20:31:45.4638464Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:31:45.4638566Z     def test_silu_mul_quant(
2025-05-07T20:31:45.4638642Z         self,
2025-05-07T20:31:45.4638718Z         T: int,
2025-05-07T20:31:45.4638803Z         D: int,
2025-05-07T20:31:45.4638901Z         scale_ub: Optional[float],
2025-05-07T20:31:45.4638992Z         contiguous: bool,
2025-05-07T20:31:45.4639083Z         compiled: bool,
2025-05-07T20:31:45.4639168Z     ) -> None:
2025-05-07T20:31:45.4639263Z         torch.manual_seed(2025)
2025-05-07T20:31:45.4639342Z     
2025-05-07T20:31:45.4639510Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:31:45.4639583Z     
2025-05-07T20:31:45.4639686Z         x_sign = torch.sign(x)
2025-05-07T20:31:45.4639810Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:31:45.4639900Z         x = x_sign * x_clamp
2025-05-07T20:31:45.4639988Z         x0 = x[:, :D]
2025-05-07T20:31:45.4640068Z         x1 = x[:, D:]
2025-05-07T20:31:45.4640147Z     
2025-05-07T20:31:45.4640232Z         if contiguous:
2025-05-07T20:31:45.4640324Z             x0 = x0.contiguous()
2025-05-07T20:31:45.4640419Z             x1 = x1.contiguous()
2025-05-07T20:31:45.4640490Z     
2025-05-07T20:31:45.4640580Z         if scale_ub is not None:
2025-05-07T20:31:45.4640693Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:31:45.4640827Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:31:45.4640907Z             )
2025-05-07T20:31:45.4640989Z         else:
2025-05-07T20:31:45.4641084Z             scale_ub_tensor = None
2025-05-07T20:31:45.4641155Z     
2025-05-07T20:31:45.4641291Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:31:45.4641381Z             op = silu_mul_quant
2025-05-07T20:31:45.4641473Z             if compiled:
2025-05-07T20:31:45.4641571Z                 op = torch.compile(op)
2025-05-07T20:31:45.4641675Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:31:45.4641751Z     
2025-05-07T20:31:45.4641842Z >       y_fp8, y_scale = fn()
2025-05-07T20:31:45.4641846Z 
2025-05-07T20:31:45.4641943Z moe/activation_test.py:117: 
2025-05-07T20:31:45.4642079Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:31:45.4642179Z moe/activation_test.py:115: in fn
2025-05-07T20:31:45.4642278Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:31:45.4642652Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/torch/_dynamo/eval_frame.py:678: in _fn
2025-05-07T20:31:45.4642749Z     return fn(*args, **kwargs)
2025-05-07T20:31:45.4643339Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:31:45.4643439Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:31:45.4643795Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:31:45.4644021Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:31:45.4644360Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:31:45.4644453Z     kernel = self.compile(
2025-05-07T20:31:45.4644840Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:31:45.4645016Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:31:45.4645258Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:31:45.4645263Z 
2025-05-07T20:31:45.4645494Z self = <triton.compiler.compiler.ASTSource object at 0x7f6871d46d50>
2025-05-07T20:31:45.4646277Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:31:45.4646788Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f68dfb402c0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f6871db0720>}
2025-05-07T20:31:45.4647541Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:31:45.4647744Z context = <triton._C.libtriton.ir.context object at 0x7f6871d32b70>
2025-05-07T20:31:45.4647748Z 
2025-05-07T20:31:45.4647912Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:31:45.4648186Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:31:45.4648294Z                            module_map=module_map)
2025-05-07T20:31:45.4648454Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:31:45.4648560Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:31:45.4648636Z E       ^
2025-05-07T20:31:45.4648992Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:31:45.4648997Z 
2025-05-07T20:31:45.4649420Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:31:45.4649424Z 
2025-05-07T20:31:45.4649526Z Trying example: test_silu_mul_quant(
2025-05-07T20:31:45.4649762Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:31:45.4649839Z     T=16384,
2025-05-07T20:31:45.4649915Z     D=7168,
2025-05-07T20:31:45.4650005Z     scale_ub=1200.0,
2025-05-07T20:31:45.4650094Z     contiguous=True,
2025-05-07T20:31:45.4650176Z     compiled=True,
2025-05-07T20:31:45.4650256Z )
2025-05-07T20:31:45.4650475Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:31:45.4650650Z T = 16384, D = 7168, scale_ub = 1200.0, contiguous = True, compiled = True
2025-05-07T20:31:45.4650660Z 
2025-05-07T20:31:45.4650736Z     @given(
2025-05-07T20:31:45.4650854Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:31:45.4650960Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:31:45.4651075Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:31:45.4651195Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:31:45.4651311Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:31:45.4651390Z     )
2025-05-07T20:31:45.4651637Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:31:45.4651744Z     def test_silu_mul_quant(
2025-05-07T20:31:45.4651901Z         self,
2025-05-07T20:31:45.4651979Z         T: int,
2025-05-07T20:31:45.4652061Z         D: int,
2025-05-07T20:31:45.4652161Z         scale_ub: Optional[float],
2025-05-07T20:31:45.4652258Z         contiguous: bool,
2025-05-07T20:31:45.4652343Z         compiled: bool,
2025-05-07T20:31:45.4652421Z     ) -> None:
2025-05-07T20:31:45.4652522Z         torch.manual_seed(2025)
2025-05-07T20:31:45.4652595Z     
2025-05-07T20:31:45.4652763Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:31:45.4652845Z     
2025-05-07T20:31:45.4652937Z         x_sign = torch.sign(x)
2025-05-07T20:31:45.4653065Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:31:45.4653161Z         x = x_sign * x_clamp
2025-05-07T20:31:45.4653889Z         x0 = x[:, :D]
2025-05-07T20:31:45.4653970Z         x1 = x[:, D:]
2025-05-07T20:31:45.4654049Z     
2025-05-07T20:31:45.4654133Z         if contiguous:
2025-05-07T20:31:45.4654230Z             x0 = x0.contiguous()
2025-05-07T20:31:45.4654323Z             x1 = x1.contiguous()
2025-05-07T20:31:45.4654395Z     
2025-05-07T20:31:45.4654493Z         if scale_ub is not None:
2025-05-07T20:31:45.4654601Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:31:45.4654736Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:31:45.4654818Z             )
2025-05-07T20:31:45.4654895Z         else:
2025-05-07T20:31:45.4654989Z             scale_ub_tensor = None
2025-05-07T20:31:45.4655068Z     
2025-05-07T20:31:45.4655197Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:31:45.4655291Z             op = silu_mul_quant
2025-05-07T20:31:45.4655386Z             if compiled:
2025-05-07T20:31:45.4655499Z                 op = torch.compile(op)
2025-05-07T20:31:45.4655628Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:31:45.4655717Z     
2025-05-07T20:31:45.4655811Z >       y_fp8, y_scale = fn()
2025-05-07T20:31:45.4655816Z 
2025-05-07T20:31:45.4655919Z moe/activation_test.py:117: 
2025-05-07T20:31:45.4656055Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:31:45.4656155Z moe/activation_test.py:115: in fn
2025-05-07T20:31:45.4656259Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:31:45.4656629Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/torch/_dynamo/eval_frame.py:678: in _fn
2025-05-07T20:31:45.4656723Z     return fn(*args, **kwargs)
2025-05-07T20:31:45.4657223Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:31:45.4657319Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:31:45.4657680Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:31:45.4657907Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:31:45.4658248Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:31:45.4658347Z     kernel = self.compile(
2025-05-07T20:31:45.4658730Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:31:45.4658907Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:31:45.4659033Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:31:45.4659038Z 
2025-05-07T20:31:45.4659241Z self = <triton.compiler.compiler.ASTSource object at 0x7f6871ec3950>
2025-05-07T20:31:45.4660028Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:31:45.4660623Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f68dfb402c0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f6871db0f40>}
2025-05-07T20:31:45.4661383Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:31:45.4661570Z context = <triton._C.libtriton.ir.context object at 0x7f6871e6b7b0>
2025-05-07T20:31:45.4661575Z 
2025-05-07T20:31:45.4661738Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:31:45.4662011Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:31:45.4662117Z                            module_map=module_map)
2025-05-07T20:31:45.4662285Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:31:45.4662498Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:31:45.4662575Z E       ^
2025-05-07T20:31:45.4662945Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:31:45.4662950Z 
2025-05-07T20:31:45.4663367Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:31:45.4663372Z 
2025-05-07T20:31:45.4663481Z Trying example: test_silu_mul_quant(
2025-05-07T20:31:45.4663705Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:31:45.4663783Z     T=16384,
2025-05-07T20:31:45.4663866Z     D=5120,
2025-05-07T20:31:45.4663950Z     scale_ub=1200.0,
2025-05-07T20:31:45.4664035Z     contiguous=True,
2025-05-07T20:31:45.4664124Z     compiled=False,
2025-05-07T20:31:45.4664200Z )
2025-05-07T20:31:45.4664417Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:31:45.4664606Z T = 16384, D = 5120, scale_ub = 1200.0, contiguous = True, compiled = False
2025-05-07T20:31:45.4664610Z 
2025-05-07T20:31:45.4664687Z     @given(
2025-05-07T20:31:45.4664812Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:31:45.4664911Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:31:45.4665026Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:31:45.4665147Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:31:45.4665260Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:31:45.4665335Z     )
2025-05-07T20:31:45.4665591Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:31:45.4665686Z     def test_silu_mul_quant(
2025-05-07T20:31:45.4665762Z         self,
2025-05-07T20:31:45.4665845Z         T: int,
2025-05-07T20:31:45.4665921Z         D: int,
2025-05-07T20:31:45.4666019Z         scale_ub: Optional[float],
2025-05-07T20:31:45.4666118Z         contiguous: bool,
2025-05-07T20:31:45.4666204Z         compiled: bool,
2025-05-07T20:31:45.4666287Z     ) -> None:
2025-05-07T20:31:45.4666383Z         torch.manual_seed(2025)
2025-05-07T20:31:45.4666456Z     
2025-05-07T20:31:45.4666636Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:31:45.4666709Z     
2025-05-07T20:31:45.4666800Z         x_sign = torch.sign(x)
2025-05-07T20:31:45.4666931Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:31:45.4667019Z         x = x_sign * x_clamp
2025-05-07T20:31:45.4667099Z         x0 = x[:, :D]
2025-05-07T20:31:45.4667185Z         x1 = x[:, D:]
2025-05-07T20:31:45.4667257Z     
2025-05-07T20:31:45.4667342Z         if contiguous:
2025-05-07T20:31:45.4667440Z             x0 = x0.contiguous()
2025-05-07T20:31:45.4667528Z             x1 = x1.contiguous()
2025-05-07T20:31:45.4667600Z     
2025-05-07T20:31:45.4667700Z         if scale_ub is not None:
2025-05-07T20:31:45.4667806Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:31:45.4667952Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:31:45.4668028Z             )
2025-05-07T20:31:45.4668104Z         else:
2025-05-07T20:31:45.4668206Z             scale_ub_tensor = None
2025-05-07T20:31:45.4668365Z     
2025-05-07T20:31:45.4668497Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:31:45.4668595Z             op = silu_mul_quant
2025-05-07T20:31:45.4668682Z             if compiled:
2025-05-07T20:31:45.4668783Z                 op = torch.compile(op)
2025-05-07T20:31:45.4668897Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:31:45.4668969Z     
2025-05-07T20:31:45.4669149Z >       y_fp8, y_scale = fn()
2025-05-07T20:31:45.4669165Z 
2025-05-07T20:31:45.4669266Z moe/activation_test.py:117: 
2025-05-07T20:31:45.4669393Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:31:45.4669499Z moe/activation_test.py:115: in fn
2025-05-07T20:31:45.4669599Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:31:45.4670183Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:31:45.4670284Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:31:45.4670648Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:31:45.4670875Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:31:45.4671219Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:31:45.4671312Z     kernel = self.compile(
2025-05-07T20:31:45.4671700Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:31:45.4671872Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:31:45.4672000Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:31:45.4672009Z 
2025-05-07T20:31:45.4672217Z self = <triton.compiler.compiler.ASTSource object at 0x7f6871e8c710>
2025-05-07T20:31:45.4673004Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:31:45.4673516Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f68dfb402c0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f6871db2520>}
2025-05-07T20:31:45.4674269Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:31:45.4674462Z context = <triton._C.libtriton.ir.context object at 0x7f6871e6c570>
2025-05-07T20:31:45.4674472Z 
2025-05-07T20:31:45.4674635Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:31:45.4674897Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:31:45.4675016Z                            module_map=module_map)
2025-05-07T20:31:45.4675189Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:31:45.4675302Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:31:45.4675402Z E       ^
2025-05-07T20:31:45.4675764Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:31:45.4675769Z 
2025-05-07T20:31:45.4676196Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:31:45.4676201Z 
2025-05-07T20:31:45.4676304Z Trying example: test_silu_mul_quant(
2025-05-07T20:31:45.4676528Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:31:45.4676616Z     T=1,
2025-05-07T20:31:45.4676692Z     D=7168,
2025-05-07T20:31:45.4676773Z     scale_ub=1200.0,
2025-05-07T20:31:45.4676866Z     contiguous=False,
2025-05-07T20:31:45.4676949Z     compiled=False,
2025-05-07T20:31:45.4677027Z )
2025-05-07T20:31:45.4677326Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:31:45.4677497Z T = 1, D = 7168, scale_ub = 1200.0, contiguous = False, compiled = False
2025-05-07T20:31:45.4677501Z 
2025-05-07T20:31:45.4677583Z     @given(
2025-05-07T20:31:45.4677699Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:31:45.4677797Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:31:45.4677920Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:31:45.4678035Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:31:45.4678154Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:31:45.4678232Z     )
2025-05-07T20:31:45.4678478Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:31:45.4678657Z     def test_silu_mul_quant(
2025-05-07T20:31:45.4678733Z         self,
2025-05-07T20:31:45.4678809Z         T: int,
2025-05-07T20:31:45.4678891Z         D: int,
2025-05-07T20:31:45.4678997Z         scale_ub: Optional[float],
2025-05-07T20:31:45.4679087Z         contiguous: bool,
2025-05-07T20:31:45.4679178Z         compiled: bool,
2025-05-07T20:31:45.4679257Z     ) -> None:
2025-05-07T20:31:45.4679351Z         torch.manual_seed(2025)
2025-05-07T20:31:45.4679428Z     
2025-05-07T20:31:45.4679595Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:31:45.4679668Z     
2025-05-07T20:31:45.4679765Z         x_sign = torch.sign(x)
2025-05-07T20:31:45.4679889Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:31:45.4679984Z         x = x_sign * x_clamp
2025-05-07T20:31:45.4680066Z         x0 = x[:, :D]
2025-05-07T20:31:45.4680146Z         x1 = x[:, D:]
2025-05-07T20:31:45.4680229Z     
2025-05-07T20:31:45.4680312Z         if contiguous:
2025-05-07T20:31:45.4680403Z             x0 = x0.contiguous()
2025-05-07T20:31:45.4680497Z             x1 = x1.contiguous()
2025-05-07T20:31:45.4680568Z     
2025-05-07T20:31:45.4680664Z         if scale_ub is not None:
2025-05-07T20:31:45.4680790Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:31:45.4685298Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:31:45.4685414Z             )
2025-05-07T20:31:45.4685504Z         else:
2025-05-07T20:31:45.4685636Z             scale_ub_tensor = None
2025-05-07T20:31:45.4685711Z     
2025-05-07T20:31:45.4685850Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:31:45.4685951Z             op = silu_mul_quant
2025-05-07T20:31:45.4686040Z             if compiled:
2025-05-07T20:31:45.4686145Z                 op = torch.compile(op)
2025-05-07T20:31:45.4686264Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:31:45.4686338Z     
2025-05-07T20:31:45.4686447Z >       y_fp8, y_scale = fn()
2025-05-07T20:31:45.4686452Z 
2025-05-07T20:31:45.4686552Z moe/activation_test.py:117: 
2025-05-07T20:31:45.4686684Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:31:45.4686802Z moe/activation_test.py:115: in fn
2025-05-07T20:31:45.4686904Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:31:45.4687413Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:31:45.4687521Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:31:45.4687884Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:31:45.4688115Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:31:45.4688457Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:31:45.4688556Z     kernel = self.compile(
2025-05-07T20:31:45.4688950Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:31:45.4689126Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:31:45.4689465Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:31:45.4689480Z 
2025-05-07T20:31:45.4689689Z self = <triton.compiler.compiler.ASTSource object at 0x7f6872a1d950>
2025-05-07T20:31:45.4690474Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:31:45.4690991Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f68dfb402c0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f6871db1bc0>}
2025-05-07T20:31:45.4691821Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:31:45.4692024Z context = <triton._C.libtriton.ir.context object at 0x7f6872a032b0>
2025-05-07T20:31:45.4692028Z 
2025-05-07T20:31:45.4692194Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:31:45.4692461Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:31:45.4692578Z                            module_map=module_map)
2025-05-07T20:31:45.4692742Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:31:45.4692842Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:31:45.4692931Z E       ^
2025-05-07T20:31:45.4693292Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:31:45.4693304Z 
2025-05-07T20:31:45.4693736Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:31:45.4693741Z 
2025-05-07T20:31:45.4693844Z Trying example: test_silu_mul_quant(
2025-05-07T20:31:45.4694075Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:31:45.4694163Z     T=4096,
2025-05-07T20:31:45.4694241Z     D=7168,
2025-05-07T20:31:45.4694332Z     scale_ub=1200.0,
2025-05-07T20:31:45.4694421Z     contiguous=False,
2025-05-07T20:31:45.4694507Z     compiled=True,
2025-05-07T20:31:45.4694588Z )
2025-05-07T20:31:45.4694810Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:31:45.4694988Z T = 4096, D = 7168, scale_ub = 1200.0, contiguous = False, compiled = True
2025-05-07T20:31:45.4694993Z 
2025-05-07T20:31:45.4695085Z     @given(
2025-05-07T20:31:45.4695224Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:31:45.4695346Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:31:45.4695476Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:31:45.4695596Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:31:45.4695717Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:31:45.4695798Z     )
2025-05-07T20:31:45.4696045Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:31:45.4696147Z     def test_silu_mul_quant(
2025-05-07T20:31:45.4696226Z         self,
2025-05-07T20:31:45.4696304Z         T: int,
2025-05-07T20:31:45.4696393Z         D: int,
2025-05-07T20:31:45.4696492Z         scale_ub: Optional[float],
2025-05-07T20:31:45.4696582Z         contiguous: bool,
2025-05-07T20:31:45.4696680Z         compiled: bool,
2025-05-07T20:31:45.4696761Z     ) -> None:
2025-05-07T20:31:45.4696860Z         torch.manual_seed(2025)
2025-05-07T20:31:45.4696941Z     
2025-05-07T20:31:45.4697113Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:31:45.4697202Z     
2025-05-07T20:31:45.4697294Z         x_sign = torch.sign(x)
2025-05-07T20:31:45.4697421Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:31:45.4697521Z         x = x_sign * x_clamp
2025-05-07T20:31:45.4697602Z         x0 = x[:, :D]
2025-05-07T20:31:45.4697768Z         x1 = x[:, D:]
2025-05-07T20:31:45.4697849Z     
2025-05-07T20:31:45.4697934Z         if contiguous:
2025-05-07T20:31:45.4698026Z             x0 = x0.contiguous()
2025-05-07T20:31:45.4698127Z             x1 = x1.contiguous()
2025-05-07T20:31:45.4698199Z     
2025-05-07T20:31:45.4698292Z         if scale_ub is not None:
2025-05-07T20:31:45.4698407Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:31:45.4698543Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:31:45.4698618Z             )
2025-05-07T20:31:45.4698707Z         else:
2025-05-07T20:31:45.4698802Z             scale_ub_tensor = None
2025-05-07T20:31:45.4698883Z     
2025-05-07T20:31:45.4699016Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:31:45.4699214Z             op = silu_mul_quant
2025-05-07T20:31:45.4699307Z             if compiled:
2025-05-07T20:31:45.4699407Z                 op = torch.compile(op)
2025-05-07T20:31:45.4699519Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:31:45.4699602Z     
2025-05-07T20:31:45.4699694Z >       y_fp8, y_scale = fn()
2025-05-07T20:31:45.4699698Z 
2025-05-07T20:31:45.4699797Z moe/activation_test.py:117: 
2025-05-07T20:31:45.4699936Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:31:45.4700039Z moe/activation_test.py:115: in fn
2025-05-07T20:31:45.4700147Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:31:45.4700520Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/torch/_dynamo/eval_frame.py:678: in _fn
2025-05-07T20:31:45.4700614Z     return fn(*args, **kwargs)
2025-05-07T20:31:45.4701125Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:31:45.4701231Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:31:45.4701594Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:31:45.4701832Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:31:45.4702176Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:31:45.4702279Z     kernel = self.compile(
2025-05-07T20:31:45.4702668Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:31:45.4702843Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:31:45.4702982Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:31:45.4702986Z 
2025-05-07T20:31:45.4703192Z self = <triton.compiler.compiler.ASTSource object at 0x7f6872a35a50>
2025-05-07T20:31:45.4703998Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:31:45.4704506Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f68dfb402c0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f6872a0cb80>}
2025-05-07T20:31:45.4705291Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:31:45.4705514Z context = <triton._C.libtriton.ir.context object at 0x7f6872a61930>
2025-05-07T20:31:45.4705518Z 
2025-05-07T20:31:45.4705683Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:31:45.4705958Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:31:45.4706071Z                            module_map=module_map)
2025-05-07T20:31:45.4706234Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:31:45.4706423Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:31:45.4706504Z E       ^
2025-05-07T20:31:45.4706865Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:31:45.4706881Z 
2025-05-07T20:31:45.4707300Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:31:45.4707304Z 
2025-05-07T20:31:45.4707408Z Trying example: test_silu_mul_quant(
2025-05-07T20:31:45.4707642Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:31:45.4707723Z     T=128,
2025-05-07T20:31:45.4707801Z     D=7168,
2025-05-07T20:31:45.4707898Z     scale_ub=1200.0,
2025-05-07T20:31:45.4708064Z     contiguous=False,
2025-05-07T20:31:45.4708149Z     compiled=True,
2025-05-07T20:31:45.4708234Z )
2025-05-07T20:31:45.4708452Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:31:45.4708641Z T = 128, D = 7168, scale_ub = 1200.0, contiguous = False, compiled = True
2025-05-07T20:31:45.4708645Z 
2025-05-07T20:31:45.4708729Z     @given(
2025-05-07T20:31:45.4708848Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:31:45.4708958Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:31:45.4709173Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:31:45.4709292Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:31:45.4709415Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:31:45.4709490Z     )
2025-05-07T20:31:45.4709747Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:31:45.4709841Z     def test_silu_mul_quant(
2025-05-07T20:31:45.4709925Z         self,
2025-05-07T20:31:45.4710009Z         T: int,
2025-05-07T20:31:45.4710087Z         D: int,
2025-05-07T20:31:45.4710190Z         scale_ub: Optional[float],
2025-05-07T20:31:45.4710289Z         contiguous: bool,
2025-05-07T20:31:45.4710382Z         compiled: bool,
2025-05-07T20:31:45.4710463Z     ) -> None:
2025-05-07T20:31:45.4710566Z         torch.manual_seed(2025)
2025-05-07T20:31:45.4710640Z     
2025-05-07T20:31:45.4710811Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:31:45.4710894Z     
2025-05-07T20:31:45.4710988Z         x_sign = torch.sign(x)
2025-05-07T20:31:45.4711113Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:31:45.4711211Z         x = x_sign * x_clamp
2025-05-07T20:31:45.4711292Z         x0 = x[:, :D]
2025-05-07T20:31:45.4711382Z         x1 = x[:, D:]
2025-05-07T20:31:45.4711456Z     
2025-05-07T20:31:45.4711540Z         if contiguous:
2025-05-07T20:31:45.4711640Z             x0 = x0.contiguous()
2025-05-07T20:31:45.4711734Z             x1 = x1.contiguous()
2025-05-07T20:31:45.4711807Z     
2025-05-07T20:31:45.4711910Z         if scale_ub is not None:
2025-05-07T20:31:45.4712017Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:31:45.4712159Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:31:45.4712245Z             )
2025-05-07T20:31:45.4712322Z         else:
2025-05-07T20:31:45.4712416Z             scale_ub_tensor = None
2025-05-07T20:31:45.4712495Z     
2025-05-07T20:31:45.4712625Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:31:45.4712726Z             op = silu_mul_quant
2025-05-07T20:31:45.4712811Z             if compiled:
2025-05-07T20:31:45.4712915Z                 op = torch.compile(op)
2025-05-07T20:31:45.4713028Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:31:45.4713101Z     
2025-05-07T20:31:45.4713193Z >       y_fp8, y_scale = fn()
2025-05-07T20:31:45.4713197Z 
2025-05-07T20:31:45.4713303Z moe/activation_test.py:117: 
2025-05-07T20:31:45.4713437Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:31:45.4713539Z moe/activation_test.py:115: in fn
2025-05-07T20:31:45.4713645Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:31:45.4714100Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/torch/_dynamo/eval_frame.py:678: in _fn
2025-05-07T20:31:45.4714207Z     return fn(*args, **kwargs)
2025-05-07T20:31:45.4714703Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:31:45.4714803Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:31:45.4715170Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:31:45.4715397Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:31:45.4715788Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:31:45.4715965Z     kernel = self.compile(
2025-05-07T20:31:45.4716347Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:31:45.4716533Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:31:45.4716662Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:31:45.4716667Z 
2025-05-07T20:31:45.4716874Z self = <triton.compiler.compiler.ASTSource object at 0x7f6871aa2390>
2025-05-07T20:31:45.4717667Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:31:45.4718175Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f68dfb402c0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f6872a0d440>}
2025-05-07T20:31:45.4718950Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:31:45.4719140Z context = <triton._C.libtriton.ir.context object at 0x7f6871a6e270>
2025-05-07T20:31:45.4719145Z 
2025-05-07T20:31:45.4719319Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:31:45.4719582Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:31:45.4719690Z                            module_map=module_map)
2025-05-07T20:31:45.4719860Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:31:45.4719960Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:31:45.4720040Z E       ^
2025-05-07T20:31:45.4720409Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:31:45.4720419Z 
2025-05-07T20:31:45.4720840Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:31:45.4720844Z 
2025-05-07T20:31:45.4720962Z Trying example: test_silu_mul_quant(
2025-05-07T20:31:45.4721189Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:31:45.4721267Z     T=2048,
2025-05-07T20:31:45.4721354Z     D=7168,
2025-05-07T20:31:45.4721440Z     scale_ub=None,
2025-05-07T20:31:45.4721524Z     contiguous=True,
2025-05-07T20:31:45.4721618Z     compiled=True,
2025-05-07T20:31:45.4721692Z )
2025-05-07T20:31:45.4721913Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:31:45.4722096Z T = 2048, D = 7168, scale_ub = None, contiguous = True, compiled = True
2025-05-07T20:31:45.4722100Z 
2025-05-07T20:31:45.4722179Z     @given(
2025-05-07T20:31:45.4722306Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:31:45.4722411Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:31:45.4722528Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:31:45.4722654Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:31:45.4722862Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:31:45.4722941Z     )
2025-05-07T20:31:45.4723199Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:31:45.4723295Z     def test_silu_mul_quant(
2025-05-07T20:31:45.4723380Z         self,
2025-05-07T20:31:45.4723461Z         T: int,
2025-05-07T20:31:45.4723540Z         D: int,
2025-05-07T20:31:45.4723648Z         scale_ub: Optional[float],
2025-05-07T20:31:45.4723741Z         contiguous: bool,
2025-05-07T20:31:45.4723829Z         compiled: bool,
2025-05-07T20:31:45.4723915Z     ) -> None:
2025-05-07T20:31:45.4724011Z         torch.manual_seed(2025)
2025-05-07T20:31:45.4724085Z     
2025-05-07T20:31:45.4724259Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:31:45.4724412Z     
2025-05-07T20:31:45.4724507Z         x_sign = torch.sign(x)
2025-05-07T20:31:45.4724641Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:31:45.4724736Z         x = x_sign * x_clamp
2025-05-07T20:31:45.4724818Z         x0 = x[:, :D]
2025-05-07T20:31:45.4724908Z         x1 = x[:, D:]
2025-05-07T20:31:45.4724983Z     
2025-05-07T20:31:45.4725085Z         if contiguous:
2025-05-07T20:31:45.4725196Z             x0 = x0.contiguous()
2025-05-07T20:31:45.4725300Z             x1 = x1.contiguous()
2025-05-07T20:31:45.4725391Z     
2025-05-07T20:31:45.4725481Z         if scale_ub is not None:
2025-05-07T20:31:45.4725594Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:31:45.4725733Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:31:45.4725809Z             )
2025-05-07T20:31:45.4725893Z         else:
2025-05-07T20:31:45.4725987Z             scale_ub_tensor = None
2025-05-07T20:31:45.4726064Z     
2025-05-07T20:31:45.4726199Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:31:45.4726292Z             op = silu_mul_quant
2025-05-07T20:31:45.4726378Z             if compiled:
2025-05-07T20:31:45.4726492Z                 op = torch.compile(op)
2025-05-07T20:31:45.4726597Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:31:45.4726670Z     
2025-05-07T20:31:45.4726768Z >       y_fp8, y_scale = fn()
2025-05-07T20:31:45.4726772Z 
2025-05-07T20:31:45.4726874Z moe/activation_test.py:117: 
2025-05-07T20:31:45.4727009Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:31:45.4727111Z moe/activation_test.py:115: in fn
2025-05-07T20:31:45.4727212Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:31:45.4727586Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/torch/_dynamo/eval_frame.py:678: in _fn
2025-05-07T20:31:45.4727684Z     return fn(*args, **kwargs)
2025-05-07T20:31:45.4728464Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:31:45.4728621Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:31:45.4729083Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:31:45.4729315Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:31:45.4729655Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:31:45.4729749Z     kernel = self.compile(
2025-05-07T20:31:45.4730137Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:31:45.4730309Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:31:45.4730442Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:31:45.4730453Z 
2025-05-07T20:31:45.4730655Z self = <triton.compiler.compiler.ASTSource object at 0x7f6871ae3310>
2025-05-07T20:31:45.4731703Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:31:45.4732217Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f68dfb402c0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f6872a0e340>}
2025-05-07T20:31:45.4732974Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:31:45.4733168Z context = <triton._C.libtriton.ir.context object at 0x7f6871ac31f0>
2025-05-07T20:31:45.4733173Z 
2025-05-07T20:31:45.4733335Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:31:45.4733721Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:31:45.4733837Z                            module_map=module_map)
2025-05-07T20:31:45.4734001Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:31:45.4734104Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:31:45.4734180Z E       ^
2025-05-07T20:31:45.4734536Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:31:45.4734541Z 
2025-05-07T20:31:45.4734965Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:31:45.4734970Z 
2025-05-07T20:31:45.4735077Z Trying example: test_silu_mul_quant(
2025-05-07T20:31:45.4735332Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:31:45.4735418Z     T=16384,
2025-05-07T20:31:45.4735514Z     D=5120,
2025-05-07T20:31:45.4735601Z     scale_ub=None,
2025-05-07T20:31:45.4735686Z     contiguous=False,
2025-05-07T20:31:45.4735770Z     compiled=False,
2025-05-07T20:31:45.4735847Z )
2025-05-07T20:31:45.4736069Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:31:45.4736249Z T = 16384, D = 5120, scale_ub = None, contiguous = False, compiled = False
2025-05-07T20:31:45.4736253Z 
2025-05-07T20:31:45.4736337Z     @given(
2025-05-07T20:31:45.4736454Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:31:45.4736563Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:31:45.4736678Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:31:45.4736794Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:31:45.4736911Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:31:45.4736985Z     )
2025-05-07T20:31:45.4737230Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:31:45.4737335Z     def test_silu_mul_quant(
2025-05-07T20:31:45.4737412Z         self,
2025-05-07T20:31:45.4737490Z         T: int,
2025-05-07T20:31:45.4737575Z         D: int,
2025-05-07T20:31:45.4737673Z         scale_ub: Optional[float],
2025-05-07T20:31:45.4737771Z         contiguous: bool,
2025-05-07T20:31:45.4737864Z         compiled: bool,
2025-05-07T20:31:45.4737946Z     ) -> None:
2025-05-07T20:31:45.4738047Z         torch.manual_seed(2025)
2025-05-07T20:31:45.4738125Z     
2025-05-07T20:31:45.4738294Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:31:45.4738373Z     
2025-05-07T20:31:45.4738464Z         x_sign = torch.sign(x)
2025-05-07T20:31:45.4738589Z >       x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:31:45.4740499Z E       torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 320.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 144.44 MiB is free. Including non-PyTorch memory, this process has 21.92 GiB memory in use. Of the allocated memory 21.60 GiB is allocated by PyTorch, and 40.52 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
2025-05-07T20:31:45.4740512Z 
2025-05-07T20:31:45.4740632Z moe/activation_test.py:95: OutOfMemoryError
2025-05-07T20:31:45.4740637Z 
2025-05-07T20:31:45.4740744Z Trying example: test_silu_mul_quant(
2025-05-07T20:31:45.4740966Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:31:45.4741042Z     T=4096,
2025-05-07T20:31:45.4741126Z     D=7168,
2025-05-07T20:31:45.4741210Z     scale_ub=1200.0,
2025-05-07T20:31:45.4741305Z     contiguous=True,
2025-05-07T20:31:45.4741391Z     compiled=True,
2025-05-07T20:31:45.4741465Z )
2025-05-07T20:31:45.4741689Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:31:45.4741937Z T = 4096, D = 7168, scale_ub = 1200.0, contiguous = True, compiled = True
2025-05-07T20:31:45.4741941Z 
2025-05-07T20:31:45.4742019Z     @given(
2025-05-07T20:31:45.4742143Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:31:45.4742248Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:31:45.4742363Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:31:45.4742490Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:31:45.4742607Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:31:45.4742689Z     )
2025-05-07T20:31:45.4742936Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:31:45.4743031Z     def test_silu_mul_quant(
2025-05-07T20:31:45.4743116Z         self,
2025-05-07T20:31:45.4743194Z         T: int,
2025-05-07T20:31:45.4743272Z         D: int,
2025-05-07T20:31:45.4743377Z         scale_ub: Optional[float],
2025-05-07T20:31:45.4743468Z         contiguous: bool,
2025-05-07T20:31:45.4743561Z         compiled: bool,
2025-05-07T20:31:45.4743649Z     ) -> None:
2025-05-07T20:31:45.4743747Z         torch.manual_seed(2025)
2025-05-07T20:31:45.4743823Z     
2025-05-07T20:31:45.4744000Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:31:45.4744074Z     
2025-05-07T20:31:45.4744175Z         x_sign = torch.sign(x)
2025-05-07T20:31:45.4744299Z >       x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:31:45.4746146Z E       torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 112.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 32.44 MiB is free. Including non-PyTorch memory, this process has 22.03 GiB memory in use. Of the allocated memory 21.61 GiB is allocated by PyTorch, and 136.52 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
2025-05-07T20:31:45.4746164Z 
2025-05-07T20:31:45.4746284Z moe/activation_test.py:95: OutOfMemoryError
2025-05-07T20:31:45.4746289Z 
2025-05-07T20:31:45.4746393Z Trying example: test_silu_mul_quant(
2025-05-07T20:31:45.4746626Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:31:45.4746705Z     T=16384,
2025-05-07T20:31:45.4746782Z     D=7168,
2025-05-07T20:31:45.4746873Z     scale_ub=None,
2025-05-07T20:31:45.4746959Z     contiguous=False,
2025-05-07T20:31:45.4747046Z     compiled=False,
2025-05-07T20:31:45.4747129Z )
2025-05-07T20:31:45.4747348Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:31:45.4747532Z T = 16384, D = 7168, scale_ub = None, contiguous = False, compiled = False
2025-05-07T20:31:45.4747537Z 
2025-05-07T20:31:45.4747614Z     @given(
2025-05-07T20:31:45.4747730Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:31:45.4747840Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:31:45.4747953Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:31:45.4748069Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:31:45.4748269Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:31:45.4748344Z     )
2025-05-07T20:31:45.4748590Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:31:45.4748690Z     def test_silu_mul_quant(
2025-05-07T20:31:45.4748764Z         self,
2025-05-07T20:31:45.4748862Z         T: int,
2025-05-07T20:31:45.4748940Z         D: int,
2025-05-07T20:31:45.4749045Z         scale_ub: Optional[float],
2025-05-07T20:31:45.4749198Z         contiguous: bool,
2025-05-07T20:31:45.4749284Z         compiled: bool,
2025-05-07T20:31:45.4749369Z     ) -> None:
2025-05-07T20:31:45.4749464Z         torch.manual_seed(2025)
2025-05-07T20:31:45.4749536Z     
2025-05-07T20:31:45.4749709Z >       x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:31:45.4751593Z E       torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 448.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 144.44 MiB is free. Including non-PyTorch memory, this process has 21.92 GiB memory in use. Of the allocated memory 21.50 GiB is allocated by PyTorch, and 136.52 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
2025-05-07T20:31:45.4751599Z 
2025-05-07T20:31:45.4751727Z moe/activation_test.py:92: OutOfMemoryError
2025-05-07T20:31:45.4751732Z 
2025-05-07T20:31:45.4751836Z Trying example: test_silu_mul_quant(
2025-05-07T20:31:45.4752056Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:31:45.4752138Z     T=2048,
2025-05-07T20:31:45.4752214Z     D=7168,
2025-05-07T20:31:45.4752307Z     scale_ub=1200.0,
2025-05-07T20:31:45.4752392Z     contiguous=True,
2025-05-07T20:31:45.4752475Z     compiled=True,
2025-05-07T20:31:45.4752553Z )
2025-05-07T20:31:45.4752772Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:31:45.4752945Z T = 2048, D = 7168, scale_ub = 1200.0, contiguous = True, compiled = True
2025-05-07T20:31:45.4752950Z 
2025-05-07T20:31:45.4753033Z     @given(
2025-05-07T20:31:45.4753150Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:31:45.4753248Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:31:45.4753373Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:31:45.4753491Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:31:45.4753609Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:31:45.4753688Z     )
2025-05-07T20:31:45.4753935Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:31:45.4754041Z     def test_silu_mul_quant(
2025-05-07T20:31:45.4754117Z         self,
2025-05-07T20:31:45.4754193Z         T: int,
2025-05-07T20:31:45.4754277Z         D: int,
2025-05-07T20:31:45.4754374Z         scale_ub: Optional[float],
2025-05-07T20:31:45.4754463Z         contiguous: bool,
2025-05-07T20:31:45.4754560Z         compiled: bool,
2025-05-07T20:31:45.4754637Z     ) -> None:
2025-05-07T20:31:45.4754733Z         torch.manual_seed(2025)
2025-05-07T20:31:45.4754810Z     
2025-05-07T20:31:45.4754975Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:31:45.4755052Z     
2025-05-07T20:31:45.4755144Z         x_sign = torch.sign(x)
2025-05-07T20:31:45.4755267Z >       x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:31:45.4757126Z E       torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 56.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 32.44 MiB is free. Including non-PyTorch memory, this process has 22.03 GiB memory in use. Of the allocated memory 21.67 GiB is allocated by PyTorch, and 80.52 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
2025-05-07T20:31:45.4757141Z 
2025-05-07T20:31:45.4757257Z moe/activation_test.py:95: OutOfMemoryError
2025-05-07T20:31:45.4757261Z 
2025-05-07T20:31:45.4757369Z Trying example: test_silu_mul_quant(
2025-05-07T20:31:45.4757590Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:31:45.4757665Z     T=2048,
2025-05-07T20:31:45.4757746Z     D=7168,
2025-05-07T20:31:45.4757826Z     scale_ub=None,
2025-05-07T20:31:45.4757910Z     contiguous=True,
2025-05-07T20:31:45.4758000Z     compiled=False,
2025-05-07T20:31:45.4758072Z )
2025-05-07T20:31:45.4758289Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:31:45.4758542Z T = 2048, D = 7168, scale_ub = None, contiguous = True, compiled = False
2025-05-07T20:31:45.4758546Z 
2025-05-07T20:31:45.4758621Z     @given(
2025-05-07T20:31:45.4758741Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:31:45.4758843Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:31:45.4758957Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:31:45.4759081Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:31:45.4759194Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:31:45.4759268Z     )
2025-05-07T20:31:45.4759518Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:31:45.4759613Z     def test_silu_mul_quant(
2025-05-07T20:31:45.4759695Z         self,
2025-05-07T20:31:45.4759770Z         T: int,
2025-05-07T20:31:45.4759848Z         D: int,
2025-05-07T20:31:45.4759952Z         scale_ub: Optional[float],
2025-05-07T20:31:45.4760043Z         contiguous: bool,
2025-05-07T20:31:45.4760136Z         compiled: bool,
2025-05-07T20:31:45.4760219Z     ) -> None:
2025-05-07T20:31:45.4760313Z         torch.manual_seed(2025)
2025-05-07T20:31:45.4760386Z     
2025-05-07T20:31:45.4760558Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:31:45.4760635Z     
2025-05-07T20:31:45.4760728Z >       x_sign = torch.sign(x)
2025-05-07T20:31:45.4762506Z E       torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 56.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 32.44 MiB is free. Including non-PyTorch memory, this process has 22.03 GiB memory in use. Of the allocated memory 21.67 GiB is allocated by PyTorch, and 80.52 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
2025-05-07T20:31:45.4762512Z 
2025-05-07T20:31:45.4762633Z moe/activation_test.py:94: OutOfMemoryError
2025-05-07T20:31:45.4762637Z 
2025-05-07T20:31:45.4762746Z Trying example: test_silu_mul_quant(
2025-05-07T20:31:45.4762967Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:31:45.4763055Z     T=1,
2025-05-07T20:31:45.4763132Z     D=7168,
2025-05-07T20:31:45.4763217Z     scale_ub=1200.0,
2025-05-07T20:31:45.4763312Z     contiguous=True,
2025-05-07T20:31:45.4763397Z     compiled=False,
2025-05-07T20:31:45.4763471Z )
2025-05-07T20:31:45.4763695Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:31:45.4763858Z T = 1, D = 7168, scale_ub = 1200.0, contiguous = True, compiled = False
2025-05-07T20:31:45.4763863Z 
2025-05-07T20:31:45.4763941Z     @given(
2025-05-07T20:31:45.4764063Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:31:45.4764160Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:31:45.4764280Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:31:45.4764402Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:31:45.4764514Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:31:45.4764598Z     )
2025-05-07T20:31:45.4764958Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:31:45.4765059Z     def test_silu_mul_quant(
2025-05-07T20:31:45.4765143Z         self,
2025-05-07T20:31:45.4765232Z         T: int,
2025-05-07T20:31:45.4765321Z         D: int,
2025-05-07T20:31:45.4765440Z         scale_ub: Optional[float],
2025-05-07T20:31:45.4765538Z         contiguous: bool,
2025-05-07T20:31:45.4765624Z         compiled: bool,
2025-05-07T20:31:45.4765707Z     ) -> None:
2025-05-07T20:31:45.4765801Z         torch.manual_seed(2025)
2025-05-07T20:31:45.4765883Z     
2025-05-07T20:31:45.4766052Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:31:45.4766124Z     
2025-05-07T20:31:45.4766224Z         x_sign = torch.sign(x)
2025-05-07T20:31:45.4766427Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:31:45.4766516Z         x = x_sign * x_clamp
2025-05-07T20:31:45.4766603Z         x0 = x[:, :D]
2025-05-07T20:31:45.4766683Z         x1 = x[:, D:]
2025-05-07T20:31:45.4766756Z     
2025-05-07T20:31:45.4766851Z         if contiguous:
2025-05-07T20:31:45.4766943Z             x0 = x0.contiguous()
2025-05-07T20:31:45.4767032Z             x1 = x1.contiguous()
2025-05-07T20:31:45.4767111Z     
2025-05-07T20:31:45.4767200Z         if scale_ub is not None:
2025-05-07T20:31:45.4767316Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:31:45.4767451Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:31:45.4767526Z             )
2025-05-07T20:31:45.4767609Z         else:
2025-05-07T20:31:45.4767703Z             scale_ub_tensor = None
2025-05-07T20:31:45.4767777Z     
2025-05-07T20:31:45.4767914Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:31:45.4768004Z             op = silu_mul_quant
2025-05-07T20:31:45.4768096Z             if compiled:
2025-05-07T20:31:45.4768204Z                 op = torch.compile(op)
2025-05-07T20:31:45.4768309Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:31:45.4768381Z     
2025-05-07T20:31:45.4768481Z >       y_fp8, y_scale = fn()
2025-05-07T20:31:45.4768486Z 
2025-05-07T20:31:45.4768583Z moe/activation_test.py:117: 
2025-05-07T20:31:45.4768719Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:31:45.4768819Z moe/activation_test.py:115: in fn
2025-05-07T20:31:45.4768919Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:31:45.4769430Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:31:45.4769526Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:31:45.4769886Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:31:45.4770117Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:31:45.4770458Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:31:45.4770565Z     kernel = self.compile(
2025-05-07T20:31:45.4770949Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:31:45.4771122Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:31:45.4771258Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:31:45.4771263Z 
2025-05-07T20:31:45.4771466Z self = <triton.compiler.compiler.ASTSource object at 0x7f68719e54d0>
2025-05-07T20:31:45.4772252Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:31:45.4772760Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f68dfb402c0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f687191dbc0>}
2025-05-07T20:31:45.4773601Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:31:45.4773799Z context = <triton._C.libtriton.ir.context object at 0x7f68719052b0>
2025-05-07T20:31:45.4773804Z 
2025-05-07T20:31:45.4773967Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:31:45.4774238Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:31:45.4774344Z                            module_map=module_map)
2025-05-07T20:31:45.4774503Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:31:45.4774608Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:31:45.4774767Z E       ^
2025-05-07T20:31:45.4775134Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:31:45.4775138Z 
2025-05-07T20:31:45.4775588Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:31:45.4775593Z 
2025-05-07T20:31:45.4775711Z Trying example: test_silu_mul_quant(
2025-05-07T20:31:45.4775951Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:31:45.4776028Z     T=128,
2025-05-07T20:31:45.4776104Z     D=5120,
2025-05-07T20:31:45.4776192Z     scale_ub=None,
2025-05-07T20:31:45.4776277Z     contiguous=True,
2025-05-07T20:31:45.4776369Z     compiled=False,
2025-05-07T20:31:45.4776441Z )
2025-05-07T20:31:45.4776659Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:31:45.4776835Z T = 128, D = 5120, scale_ub = None, contiguous = True, compiled = False
2025-05-07T20:31:45.4776845Z 
2025-05-07T20:31:45.4776921Z     @given(
2025-05-07T20:31:45.4777038Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:31:45.4777144Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:31:45.4777264Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:31:45.4777383Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:31:45.4777503Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:31:45.4777575Z     )
2025-05-07T20:31:45.4777827Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:31:45.4777920Z     def test_silu_mul_quant(
2025-05-07T20:31:45.4777997Z         self,
2025-05-07T20:31:45.4778081Z         T: int,
2025-05-07T20:31:45.4778157Z         D: int,
2025-05-07T20:31:45.4778255Z         scale_ub: Optional[float],
2025-05-07T20:31:45.4778350Z         contiguous: bool,
2025-05-07T20:31:45.4778434Z         compiled: bool,
2025-05-07T20:31:45.4778516Z     ) -> None:
2025-05-07T20:31:45.4778615Z         torch.manual_seed(2025)
2025-05-07T20:31:45.4778688Z     
2025-05-07T20:31:45.4778857Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:31:45.4778936Z     
2025-05-07T20:31:45.4779032Z         x_sign = torch.sign(x)
2025-05-07T20:31:45.4779163Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:31:45.4779252Z         x = x_sign * x_clamp
2025-05-07T20:31:45.4779333Z         x0 = x[:, :D]
2025-05-07T20:31:45.4779422Z         x1 = x[:, D:]
2025-05-07T20:31:45.4779496Z     
2025-05-07T20:31:45.4779580Z         if contiguous:
2025-05-07T20:31:45.4779679Z             x0 = x0.contiguous()
2025-05-07T20:31:45.4779767Z             x1 = x1.contiguous()
2025-05-07T20:31:45.4779840Z     
2025-05-07T20:31:45.4779937Z         if scale_ub is not None:
2025-05-07T20:31:45.4780042Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:31:45.4780177Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:31:45.4780264Z             )
2025-05-07T20:31:45.4780340Z         else:
2025-05-07T20:31:45.4780434Z             scale_ub_tensor = None
2025-05-07T20:31:45.4780513Z     
2025-05-07T20:31:45.4780728Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:31:45.4780827Z             op = silu_mul_quant
2025-05-07T20:31:45.4780912Z             if compiled:
2025-05-07T20:31:45.4781012Z                 op = torch.compile(op)
2025-05-07T20:31:45.4781124Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:31:45.4781195Z     
2025-05-07T20:31:45.4781286Z >       y_fp8, y_scale = fn()
2025-05-07T20:31:45.4781290Z 
2025-05-07T20:31:45.4781394Z moe/activation_test.py:117: 
2025-05-07T20:31:45.4781523Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:31:45.4781624Z moe/activation_test.py:115: in fn
2025-05-07T20:31:45.4781731Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:31:45.4782230Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:31:45.4782414Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:31:45.4782776Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:31:45.4783003Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:31:45.4783354Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:31:45.4783448Z     kernel = self.compile(
2025-05-07T20:31:45.4783836Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:31:45.4784010Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:31:45.4784136Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:31:45.4784141Z 
2025-05-07T20:31:45.4784354Z self = <triton.compiler.compiler.ASTSource object at 0x7f68719b7d90>
2025-05-07T20:31:45.4785156Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:31:45.4785698Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f68dfb402c0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f687191ed40>}
2025-05-07T20:31:45.4786455Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:31:45.4786644Z context = <triton._C.libtriton.ir.context object at 0x7f68719e3bb0>
2025-05-07T20:31:45.4786648Z 
2025-05-07T20:31:45.4786820Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:31:45.4787087Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:31:45.4787199Z                            module_map=module_map)
2025-05-07T20:31:45.4787364Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:31:45.4787463Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:31:45.4787546Z E       ^
2025-05-07T20:31:45.4787903Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:31:45.4787907Z 
2025-05-07T20:31:45.4788323Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:31:45.4788337Z 
2025-05-07T20:31:45.4788443Z Trying example: test_silu_mul_quant(
2025-05-07T20:31:45.4788666Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:31:45.4788750Z     T=128,
2025-05-07T20:31:45.4788825Z     D=7168,
2025-05-07T20:31:45.4788915Z     scale_ub=None,
2025-05-07T20:31:45.4789006Z     contiguous=True,
2025-05-07T20:31:45.4789256Z     compiled=False,
2025-05-07T20:31:45.4789328Z )
2025-05-07T20:31:45.4789637Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:31:45.4789806Z T = 128, D = 7168, scale_ub = None, contiguous = True, compiled = False
2025-05-07T20:31:45.4789812Z 
2025-05-07T20:31:45.4789895Z     @given(
2025-05-07T20:31:45.4790012Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:31:45.4790110Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:31:45.4790232Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:31:45.4790348Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:31:45.4790460Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:31:45.4790539Z     )
2025-05-07T20:31:45.4790786Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:31:45.4791030Z     def test_silu_mul_quant(
2025-05-07T20:31:45.4791115Z         self,
2025-05-07T20:31:45.4791191Z         T: int,
2025-05-07T20:31:45.4791270Z         D: int,
2025-05-07T20:31:45.4791373Z         scale_ub: Optional[float],
2025-05-07T20:31:45.4791467Z         contiguous: bool,
2025-05-07T20:31:45.4791558Z         compiled: bool,
2025-05-07T20:31:45.4791635Z     ) -> None:
2025-05-07T20:31:45.4791730Z         torch.manual_seed(2025)
2025-05-07T20:31:45.4791807Z     
2025-05-07T20:31:45.4791974Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:31:45.4792047Z     
2025-05-07T20:31:45.4792144Z         x_sign = torch.sign(x)
2025-05-07T20:31:45.4792268Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:31:45.4792361Z         x = x_sign * x_clamp
2025-05-07T20:31:45.4792447Z         x0 = x[:, :D]
2025-05-07T20:31:45.4792530Z         x1 = x[:, D:]
2025-05-07T20:31:45.4792603Z     
2025-05-07T20:31:45.4792693Z         if contiguous:
2025-05-07T20:31:45.4792790Z             x0 = x0.contiguous()
2025-05-07T20:31:45.4792884Z             x1 = x1.contiguous()
2025-05-07T20:31:45.4792956Z     
2025-05-07T20:31:45.4793047Z         if scale_ub is not None:
2025-05-07T20:31:45.4793159Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:31:45.4793298Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:31:45.4793374Z             )
2025-05-07T20:31:45.4793457Z         else:
2025-05-07T20:31:45.4793551Z             scale_ub_tensor = None
2025-05-07T20:31:45.4793623Z     
2025-05-07T20:31:45.4793758Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:31:45.4793848Z             op = silu_mul_quant
2025-05-07T20:31:45.4793934Z             if compiled:
2025-05-07T20:31:45.4794043Z                 op = torch.compile(op)
2025-05-07T20:31:45.4794147Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:31:45.4794230Z     
2025-05-07T20:31:45.4794320Z >       y_fp8, y_scale = fn()
2025-05-07T20:31:45.4794325Z 
2025-05-07T20:31:45.4794427Z moe/activation_test.py:117: 
2025-05-07T20:31:45.4794566Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:31:45.4794668Z moe/activation_test.py:115: in fn
2025-05-07T20:31:45.4794771Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:31:45.4795330Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:31:45.4795433Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:31:45.4795792Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:31:45.4796020Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:31:45.4796359Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:31:45.4796460Z     kernel = self.compile(
2025-05-07T20:31:45.4796842Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:31:45.4797019Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:31:45.4797237Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:31:45.4797242Z 
2025-05-07T20:31:45.4797446Z self = <triton.compiler.compiler.ASTSource object at 0x7f6871b0a7d0>
2025-05-07T20:31:45.4798237Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:31:45.4798742Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f68dfb402c0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f687191fd80>}
2025-05-07T20:31:45.4799503Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:31:45.4799764Z context = <triton._C.libtriton.ir.context object at 0x7f6871b92630>
2025-05-07T20:31:45.4799768Z 
2025-05-07T20:31:45.4799938Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:31:45.4800209Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:31:45.4800316Z                            module_map=module_map)
2025-05-07T20:31:45.4800477Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:31:45.4800583Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:31:45.4800659Z E       ^
2025-05-07T20:31:45.4801019Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:31:45.4801024Z 
2025-05-07T20:31:45.4801439Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:31:45.4801448Z 
2025-05-07T20:31:45.4801549Z Trying example: test_silu_mul_quant(
2025-05-07T20:31:45.4801777Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:31:45.4801856Z     T=2048,
2025-05-07T20:31:45.4801936Z     D=7168,
2025-05-07T20:31:45.4802020Z     scale_ub=1200.0,
2025-05-07T20:31:45.4802104Z     contiguous=True,
2025-05-07T20:31:45.4802191Z     compiled=False,
2025-05-07T20:31:45.4802264Z )
2025-05-07T20:31:45.4802481Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:31:45.4802662Z T = 2048, D = 7168, scale_ub = 1200.0, contiguous = True, compiled = False
2025-05-07T20:31:45.4802667Z 
2025-05-07T20:31:45.4802742Z     @given(
2025-05-07T20:31:45.4802857Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:31:45.4802963Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:31:45.4803077Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:31:45.4803208Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:31:45.4803319Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:31:45.4803393Z     )
2025-05-07T20:31:45.4803646Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:31:45.4803739Z     def test_silu_mul_quant(
2025-05-07T20:31:45.4803816Z         self,
2025-05-07T20:31:45.4803898Z         T: int,
2025-05-07T20:31:45.4803975Z         D: int,
2025-05-07T20:31:45.4804073Z         scale_ub: Optional[float],
2025-05-07T20:31:45.4804169Z         contiguous: bool,
2025-05-07T20:31:45.4804253Z         compiled: bool,
2025-05-07T20:31:45.4804330Z     ) -> None:
2025-05-07T20:31:45.4804428Z         torch.manual_seed(2025)
2025-05-07T20:31:45.4804504Z     
2025-05-07T20:31:45.4804672Z >       x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:31:45.4806608Z E       torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 56.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 30.44 MiB is free. Including non-PyTorch memory, this process has 22.03 GiB memory in use. Of the allocated memory 21.70 GiB is allocated by PyTorch, and 53.93 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
2025-05-07T20:31:45.4806619Z 
2025-05-07T20:31:45.4806745Z moe/activation_test.py:92: OutOfMemoryError
2025-05-07T20:31:45.4806749Z 
2025-05-07T20:31:45.4806851Z Trying example: test_silu_mul_quant(
2025-05-07T20:31:45.4807072Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:31:45.4807153Z     T=1,
2025-05-07T20:31:45.4807229Z     D=5120,
2025-05-07T20:31:45.4807311Z     scale_ub=1200.0,
2025-05-07T20:31:45.4807400Z     contiguous=True,
2025-05-07T20:31:45.4807557Z     compiled=False,
2025-05-07T20:31:45.4807632Z )
2025-05-07T20:31:45.4807860Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:31:45.4808023Z T = 1, D = 5120, scale_ub = 1200.0, contiguous = True, compiled = False
2025-05-07T20:31:45.4808033Z 
2025-05-07T20:31:45.4808114Z     @given(
2025-05-07T20:31:45.4808230Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:31:45.4808326Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:31:45.4808444Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:31:45.4808559Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:31:45.4808672Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:31:45.4808750Z     )
2025-05-07T20:31:45.4808997Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:31:45.4809091Z     def test_silu_mul_quant(
2025-05-07T20:31:45.4809172Z         self,
2025-05-07T20:31:45.4809247Z         T: int,
2025-05-07T20:31:45.4809333Z         D: int,
2025-05-07T20:31:45.4809430Z         scale_ub: Optional[float],
2025-05-07T20:31:45.4813831Z         contiguous: bool,
2025-05-07T20:31:45.4813938Z         compiled: bool,
2025-05-07T20:31:45.4814020Z     ) -> None:
2025-05-07T20:31:45.4814135Z         torch.manual_seed(2025)
2025-05-07T20:31:45.4814209Z     
2025-05-07T20:31:45.4814384Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:31:45.4814465Z     
2025-05-07T20:31:45.4814560Z         x_sign = torch.sign(x)
2025-05-07T20:31:45.4814696Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:31:45.4814788Z         x = x_sign * x_clamp
2025-05-07T20:31:45.4814871Z         x0 = x[:, :D]
2025-05-07T20:31:45.4814960Z         x1 = x[:, D:]
2025-05-07T20:31:45.4815034Z     
2025-05-07T20:31:45.4815121Z         if contiguous:
2025-05-07T20:31:45.4815221Z             x0 = x0.contiguous()
2025-05-07T20:31:45.4815312Z             x1 = x1.contiguous()
2025-05-07T20:31:45.4815394Z     
2025-05-07T20:31:45.4815498Z         if scale_ub is not None:
2025-05-07T20:31:45.4815605Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:31:45.4815743Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:31:45.4815828Z             )
2025-05-07T20:31:45.4815909Z         else:
2025-05-07T20:31:45.4816012Z             scale_ub_tensor = None
2025-05-07T20:31:45.4816087Z     
2025-05-07T20:31:45.4816219Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:31:45.4816319Z             op = silu_mul_quant
2025-05-07T20:31:45.4816406Z             if compiled:
2025-05-07T20:31:45.4816508Z                 op = torch.compile(op)
2025-05-07T20:31:45.4816621Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:31:45.4816696Z     
2025-05-07T20:31:45.4816789Z >       y_fp8, y_scale = fn()
2025-05-07T20:31:45.4816794Z 
2025-05-07T20:31:45.4816902Z moe/activation_test.py:117: 
2025-05-07T20:31:45.4817036Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:31:45.4817143Z moe/activation_test.py:115: in fn
2025-05-07T20:31:45.4817255Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:31:45.4817877Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:31:45.4817985Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:31:45.4818346Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:31:45.4818572Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:31:45.4818926Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:31:45.4819022Z     kernel = self.compile(
2025-05-07T20:31:45.4819416Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:31:45.4819594Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:31:45.4819801Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:31:45.4819806Z 
2025-05-07T20:31:45.4820022Z self = <triton.compiler.compiler.ASTSource object at 0x7f6871bce350>
2025-05-07T20:31:45.4820807Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:31:45.4821322Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f68dfb402c0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f6871b1d3a0>}
2025-05-07T20:31:45.4822075Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:31:45.4822272Z context = <triton._C.libtriton.ir.context object at 0x7f6871bf6230>
2025-05-07T20:31:45.4822277Z 
2025-05-07T20:31:45.4822446Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:31:45.4822715Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:31:45.4822831Z                            module_map=module_map)
2025-05-07T20:31:45.4822993Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:31:45.4823094Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:31:45.4823180Z E       ^
2025-05-07T20:31:45.4823539Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:31:45.4823543Z 
2025-05-07T20:31:45.4823973Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:31:45.4823978Z 
2025-05-07T20:31:45.4824083Z Trying example: test_silu_mul_quant(
2025-05-07T20:31:45.4824315Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:31:45.4824400Z     T=2048,
2025-05-07T20:31:45.4824478Z     D=5120,
2025-05-07T20:31:45.4824561Z     scale_ub=None,
2025-05-07T20:31:45.4824662Z     contiguous=True,
2025-05-07T20:31:45.4824748Z     compiled=False,
2025-05-07T20:31:45.4824823Z )
2025-05-07T20:31:45.4825055Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:31:45.4825253Z T = 2048, D = 5120, scale_ub = None, contiguous = True, compiled = False
2025-05-07T20:31:45.4825259Z 
2025-05-07T20:31:45.4825365Z     @given(
2025-05-07T20:31:45.4825486Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:31:45.4825588Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:31:45.4825710Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:31:45.4825826Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:31:45.4825938Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:31:45.4826025Z     )
2025-05-07T20:31:45.4826271Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:31:45.4826367Z     def test_silu_mul_quant(
2025-05-07T20:31:45.4826634Z         self,
2025-05-07T20:31:45.4826713Z         T: int,
2025-05-07T20:31:45.4826800Z         D: int,
2025-05-07T20:31:45.4826898Z         scale_ub: Optional[float],
2025-05-07T20:31:45.4826988Z         contiguous: bool,
2025-05-07T20:31:45.4827080Z         compiled: bool,
2025-05-07T20:31:45.4827159Z     ) -> None:
2025-05-07T20:31:45.4827255Z         torch.manual_seed(2025)
2025-05-07T20:31:45.4827336Z     
2025-05-07T20:31:45.4827505Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:31:45.4827578Z     
2025-05-07T20:31:45.4827675Z >       x_sign = torch.sign(x)
2025-05-07T20:31:45.4829980Z E       torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 40.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 30.44 MiB is free. Including non-PyTorch memory, this process has 22.03 GiB memory in use. Of the allocated memory 21.73 GiB is allocated by PyTorch, and 13.87 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
2025-05-07T20:31:45.4830250Z 
2025-05-07T20:31:45.4830384Z moe/activation_test.py:94: OutOfMemoryError
2025-05-07T20:31:45.4830389Z 
2025-05-07T20:31:45.4830493Z Trying example: test_silu_mul_quant(
2025-05-07T20:31:45.4830724Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:31:45.4830802Z     T=16384,
2025-05-07T20:31:45.4830881Z     D=5120,
2025-05-07T20:31:45.4830970Z     scale_ub=None,
2025-05-07T20:31:45.4831056Z     contiguous=True,
2025-05-07T20:31:45.4831148Z     compiled=False,
2025-05-07T20:31:45.4831236Z )
2025-05-07T20:31:45.4831457Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:31:45.4831639Z T = 16384, D = 5120, scale_ub = None, contiguous = True, compiled = False
2025-05-07T20:31:45.4831644Z 
2025-05-07T20:31:45.4831735Z     @given(
2025-05-07T20:31:45.4831854Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:31:45.4831955Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:31:45.4832080Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:31:45.4832201Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:31:45.4832327Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:31:45.4832402Z     )
2025-05-07T20:31:45.4832650Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:31:45.4832756Z     def test_silu_mul_quant(
2025-05-07T20:31:45.4832833Z         self,
2025-05-07T20:31:45.4832911Z         T: int,
2025-05-07T20:31:45.4833000Z         D: int,
2025-05-07T20:31:45.4833104Z         scale_ub: Optional[float],
2025-05-07T20:31:45.4833194Z         contiguous: bool,
2025-05-07T20:31:45.4833293Z         compiled: bool,
2025-05-07T20:31:45.4833374Z     ) -> None:
2025-05-07T20:31:45.4833470Z         torch.manual_seed(2025)
2025-05-07T20:31:45.4833555Z     
2025-05-07T20:31:45.4833729Z >       x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:31:45.4835551Z E       torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 320.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 30.44 MiB is free. Including non-PyTorch memory, this process has 22.03 GiB memory in use. Of the allocated memory 21.73 GiB is allocated by PyTorch, and 13.87 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
2025-05-07T20:31:45.4835565Z 
2025-05-07T20:31:45.4835697Z moe/activation_test.py:92: OutOfMemoryError
2025-05-07T20:31:45.4835702Z 
2025-05-07T20:31:45.4835824Z Trying example: test_silu_mul_quant(
2025-05-07T20:31:45.4836179Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:31:45.4836259Z     T=4096,
2025-05-07T20:31:45.4836347Z     D=5120,
2025-05-07T20:31:45.4836433Z     scale_ub=None,
2025-05-07T20:31:45.4836520Z     contiguous=True,
2025-05-07T20:31:45.4836612Z     compiled=False,
2025-05-07T20:31:45.4836688Z )
2025-05-07T20:31:45.4836908Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:31:45.4837093Z T = 4096, D = 5120, scale_ub = None, contiguous = True, compiled = False
2025-05-07T20:31:45.4837098Z 
2025-05-07T20:31:45.4837177Z     @given(
2025-05-07T20:31:45.4837301Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:31:45.4837401Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:31:45.4837597Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:31:45.4837722Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:31:45.4837836Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:31:45.4837911Z     )
2025-05-07T20:31:45.4838172Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:31:45.4838266Z     def test_silu_mul_quant(
2025-05-07T20:31:45.4838352Z         self,
2025-05-07T20:31:45.4838430Z         T: int,
2025-05-07T20:31:45.4838508Z         D: int,
2025-05-07T20:31:45.4838613Z         scale_ub: Optional[float],
2025-05-07T20:31:45.4838703Z         contiguous: bool,
2025-05-07T20:31:45.4838790Z         compiled: bool,
2025-05-07T20:31:45.4838879Z     ) -> None:
2025-05-07T20:31:45.4838974Z         torch.manual_seed(2025)
2025-05-07T20:31:45.4839049Z     
2025-05-07T20:31:45.4839230Z >       x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:31:45.4841015Z E       torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 80.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 30.44 MiB is free. Including non-PyTorch memory, this process has 22.03 GiB memory in use. Of the allocated memory 21.73 GiB is allocated by PyTorch, and 13.87 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
2025-05-07T20:31:45.4841027Z 
2025-05-07T20:31:45.4841154Z moe/activation_test.py:92: OutOfMemoryError
2025-05-07T20:31:45.4841158Z 
2025-05-07T20:31:45.4841261Z Trying example: test_silu_mul_quant(
2025-05-07T20:31:45.4841492Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:31:45.4841569Z     T=2048,
2025-05-07T20:31:45.4841645Z     D=5120,
2025-05-07T20:31:45.4841734Z     scale_ub=None,
2025-05-07T20:31:45.4841824Z     contiguous=False,
2025-05-07T20:31:45.4841915Z     compiled=False,
2025-05-07T20:31:45.4841996Z )
2025-05-07T20:31:45.4842218Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:31:45.4842398Z T = 2048, D = 5120, scale_ub = None, contiguous = False, compiled = False
2025-05-07T20:31:45.4842402Z 
2025-05-07T20:31:45.4842488Z     @given(
2025-05-07T20:31:45.4842609Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:31:45.4842706Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:31:45.4842827Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:31:45.4842943Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:31:45.4843066Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:31:45.4843141Z     )
2025-05-07T20:31:45.4843387Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:31:45.4843490Z     def test_silu_mul_quant(
2025-05-07T20:31:45.4843568Z         self,
2025-05-07T20:31:45.4843649Z         T: int,
2025-05-07T20:31:45.4843734Z         D: int,
2025-05-07T20:31:45.4843832Z         scale_ub: Optional[float],
2025-05-07T20:31:45.4843921Z         contiguous: bool,
2025-05-07T20:31:45.4844014Z         compiled: bool,
2025-05-07T20:31:45.4844182Z     ) -> None:
2025-05-07T20:31:45.4844280Z         torch.manual_seed(2025)
2025-05-07T20:31:45.4844363Z     
2025-05-07T20:31:45.4844530Z >       x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:31:45.4846310Z E       torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 40.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 30.44 MiB is free. Including non-PyTorch memory, this process has 22.03 GiB memory in use. Of the allocated memory 21.73 GiB is allocated by PyTorch, and 13.87 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
2025-05-07T20:31:45.4846388Z 
2025-05-07T20:31:45.4846510Z moe/activation_test.py:92: OutOfMemoryError
2025-05-07T20:31:45.4846514Z 
2025-05-07T20:31:45.4846623Z Trying example: test_silu_mul_quant(
2025-05-07T20:31:45.4846851Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:31:45.4846928Z     T=4096,
2025-05-07T20:31:45.4847012Z     D=7168,
2025-05-07T20:31:45.4847095Z     scale_ub=None,
2025-05-07T20:31:45.4847180Z     contiguous=True,
2025-05-07T20:31:45.4847271Z     compiled=True,
2025-05-07T20:31:45.4847345Z )
2025-05-07T20:31:45.4847564Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:31:45.4847744Z T = 4096, D = 7168, scale_ub = None, contiguous = True, compiled = True
2025-05-07T20:31:45.4847750Z 
2025-05-07T20:31:45.4847827Z     @given(
2025-05-07T20:31:45.4847952Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:31:45.4848058Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:31:45.4848173Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:31:45.4848298Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:31:45.4848418Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:31:45.4848493Z     )
2025-05-07T20:31:45.4848750Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:31:45.4848847Z     def test_silu_mul_quant(
2025-05-07T20:31:45.4848932Z         self,
2025-05-07T20:31:45.4849010Z         T: int,
2025-05-07T20:31:45.4849087Z         D: int,
2025-05-07T20:31:45.4849193Z         scale_ub: Optional[float],
2025-05-07T20:31:45.4849284Z         contiguous: bool,
2025-05-07T20:31:45.4849371Z         compiled: bool,
2025-05-07T20:31:45.4849458Z     ) -> None:
2025-05-07T20:31:45.4849553Z         torch.manual_seed(2025)
2025-05-07T20:31:45.4849630Z     
2025-05-07T20:31:45.4849807Z >       x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:31:45.4851593Z E       torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 112.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 30.44 MiB is free. Including non-PyTorch memory, this process has 22.03 GiB memory in use. Of the allocated memory 21.73 GiB is allocated by PyTorch, and 13.87 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
2025-05-07T20:31:45.4851599Z 
2025-05-07T20:31:45.4851725Z moe/activation_test.py:92: OutOfMemoryError
2025-05-07T20:31:45.4851729Z 
2025-05-07T20:31:45.4851831Z Trying example: test_silu_mul_quant(
2025-05-07T20:31:45.4852061Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:31:45.4852139Z     T=2048,
2025-05-07T20:31:45.4852216Z     D=5120,
2025-05-07T20:31:45.4852311Z     scale_ub=1200.0,
2025-05-07T20:31:45.4852400Z     contiguous=False,
2025-05-07T20:31:45.4852485Z     compiled=False,
2025-05-07T20:31:45.4852569Z )
2025-05-07T20:31:45.4852790Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:31:45.4853047Z T = 2048, D = 5120, scale_ub = 1200.0, contiguous = False, compiled = False
2025-05-07T20:31:45.4853053Z 
2025-05-07T20:31:45.4853138Z     @given(
2025-05-07T20:31:45.4853255Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:31:45.4853355Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:31:45.4853478Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:31:45.4853594Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:31:45.4853715Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:31:45.4853789Z     )
2025-05-07T20:31:45.4854034Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:31:45.4854219Z     def test_silu_mul_quant(
2025-05-07T20:31:45.4854296Z         self,
2025-05-07T20:31:45.4854374Z         T: int,
2025-05-07T20:31:45.4854459Z         D: int,
2025-05-07T20:31:45.4854561Z         scale_ub: Optional[float],
2025-05-07T20:31:45.4854649Z         contiguous: bool,
2025-05-07T20:31:45.4854750Z         compiled: bool,
2025-05-07T20:31:45.4854828Z     ) -> None:
2025-05-07T20:31:45.4854929Z         torch.manual_seed(2025)
2025-05-07T20:31:45.4855002Z     
2025-05-07T20:31:45.4855171Z >       x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:31:45.4856994Z E       torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 40.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 30.44 MiB is free. Including non-PyTorch memory, this process has 22.03 GiB memory in use. Of the allocated memory 21.73 GiB is allocated by PyTorch, and 13.87 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
2025-05-07T20:31:45.4857004Z 
2025-05-07T20:31:45.4857121Z moe/activation_test.py:92: OutOfMemoryError
2025-05-07T20:31:45.4857136Z 
2025-05-07T20:31:45.4857237Z Trying example: test_silu_mul_quant(
2025-05-07T20:31:45.4857458Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:31:45.4857541Z     T=4096,
2025-05-07T20:31:45.4857617Z     D=7168,
2025-05-07T20:31:45.4857701Z     scale_ub=1200.0,
2025-05-07T20:31:45.4857795Z     contiguous=True,
2025-05-07T20:31:45.4857882Z     compiled=False,
2025-05-07T20:31:45.4857955Z )
2025-05-07T20:31:45.4858183Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:31:45.4858355Z T = 4096, D = 7168, scale_ub = 1200.0, contiguous = True, compiled = False
2025-05-07T20:31:45.4858359Z 
2025-05-07T20:31:45.4858435Z     @given(
2025-05-07T20:31:45.4858565Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:31:45.4858661Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:31:45.4858782Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:31:45.4858904Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:31:45.4859017Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:31:45.4859097Z     )
2025-05-07T20:31:45.4859342Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:31:45.4859435Z     def test_silu_mul_quant(
2025-05-07T20:31:45.4859519Z         self,
2025-05-07T20:31:45.4859598Z         T: int,
2025-05-07T20:31:45.4859675Z         D: int,
2025-05-07T20:31:45.4859777Z         scale_ub: Optional[float],
2025-05-07T20:31:45.4859865Z         contiguous: bool,
2025-05-07T20:31:45.4859955Z         compiled: bool,
2025-05-07T20:31:45.4860032Z     ) -> None:
2025-05-07T20:31:45.4860124Z         torch.manual_seed(2025)
2025-05-07T20:31:45.4860202Z     
2025-05-07T20:31:45.4860373Z >       x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:31:45.4862259Z E       torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 112.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 30.44 MiB is free. Including non-PyTorch memory, this process has 22.03 GiB memory in use. Of the allocated memory 21.73 GiB is allocated by PyTorch, and 13.87 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
2025-05-07T20:31:45.4862272Z 
2025-05-07T20:31:45.4862388Z moe/activation_test.py:92: OutOfMemoryError
2025-05-07T20:31:45.4862392Z 
2025-05-07T20:31:45.4862494Z Trying example: test_silu_mul_quant(
2025-05-07T20:31:45.4862724Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:31:45.4862877Z     T=16384,
2025-05-07T20:31:45.4862952Z     D=7168,
2025-05-07T20:31:45.4863040Z     scale_ub=None,
2025-05-07T20:31:45.4863125Z     contiguous=False,
2025-05-07T20:31:45.4863213Z     compiled=True,
2025-05-07T20:31:45.4863285Z )
2025-05-07T20:31:45.4863505Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:31:45.4863685Z T = 16384, D = 7168, scale_ub = None, contiguous = False, compiled = True
2025-05-07T20:31:45.4863690Z 
2025-05-07T20:31:45.4863770Z     @given(
2025-05-07T20:31:45.4863884Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:31:45.4863992Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:31:45.4864105Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:31:45.4864221Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:31:45.4864338Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:31:45.4864411Z     )
2025-05-07T20:31:45.4864665Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:31:45.4864763Z     def test_silu_mul_quant(
2025-05-07T20:31:45.4864840Z         self,
2025-05-07T20:31:45.4864923Z         T: int,
2025-05-07T20:31:45.4864998Z         D: int,
2025-05-07T20:31:45.4865100Z         scale_ub: Optional[float],
2025-05-07T20:31:45.4865193Z         contiguous: bool,
2025-05-07T20:31:45.4865276Z         compiled: bool,
2025-05-07T20:31:45.4865352Z     ) -> None:
2025-05-07T20:31:45.4865454Z         torch.manual_seed(2025)
2025-05-07T20:31:45.4865527Z     
2025-05-07T20:31:45.4865718Z >       x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:31:45.4867529Z E       torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 448.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 30.44 MiB is free. Including non-PyTorch memory, this process has 22.03 GiB memory in use. Of the allocated memory 21.73 GiB is allocated by PyTorch, and 13.87 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
2025-05-07T20:31:45.4867539Z 
2025-05-07T20:31:45.4867658Z moe/activation_test.py:92: OutOfMemoryError
2025-05-07T20:31:45.4867668Z 
2025-05-07T20:31:45.4867769Z Trying example: test_silu_mul_quant(
2025-05-07T20:31:45.4867989Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:31:45.4868070Z     T=4096,
2025-05-07T20:31:45.4868145Z     D=7168,
2025-05-07T20:31:45.4868226Z     scale_ub=None,
2025-05-07T20:31:45.4868317Z     contiguous=True,
2025-05-07T20:31:45.4868400Z     compiled=False,
2025-05-07T20:31:45.4868474Z )
2025-05-07T20:31:45.4868697Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:31:45.4868865Z T = 4096, D = 7168, scale_ub = None, contiguous = True, compiled = False
2025-05-07T20:31:45.4868874Z 
2025-05-07T20:31:45.4868950Z     @given(
2025-05-07T20:31:45.4869150Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:31:45.4869248Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:31:45.4869455Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:31:45.4869573Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:31:45.4869686Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:31:45.4869769Z     )
2025-05-07T20:31:45.4870020Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:31:45.4870114Z     def test_silu_mul_quant(
2025-05-07T20:31:45.4870197Z         self,
2025-05-07T20:31:45.4870273Z         T: int,
2025-05-07T20:31:45.4870350Z         D: int,
2025-05-07T20:31:45.4870453Z         scale_ub: Optional[float],
2025-05-07T20:31:45.4870542Z         contiguous: bool,
2025-05-07T20:31:45.4870634Z         compiled: bool,
2025-05-07T20:31:45.4870711Z     ) -> None:
2025-05-07T20:31:45.4870883Z         torch.manual_seed(2025)
2025-05-07T20:31:45.4870964Z     
2025-05-07T20:31:45.4871132Z >       x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:31:45.4872914Z E       torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 112.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 30.44 MiB is free. Including non-PyTorch memory, this process has 22.03 GiB memory in use. Of the allocated memory 21.73 GiB is allocated by PyTorch, and 13.87 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
2025-05-07T20:31:45.4872927Z 
2025-05-07T20:31:45.4873044Z moe/activation_test.py:92: OutOfMemoryError
2025-05-07T20:31:45.4873048Z 
2025-05-07T20:31:45.4873149Z Trying example: test_silu_mul_quant(
2025-05-07T20:31:45.4873378Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:31:45.4873459Z     T=16384,
2025-05-07T20:31:45.4873539Z     D=7168,
2025-05-07T20:31:45.4873630Z     scale_ub=None,
2025-05-07T20:31:45.4873718Z     contiguous=True,
2025-05-07T20:31:45.4873812Z     compiled=False,
2025-05-07T20:31:45.4873886Z )
2025-05-07T20:31:45.4874103Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:31:45.4874284Z T = 16384, D = 7168, scale_ub = None, contiguous = True, compiled = False
2025-05-07T20:31:45.4874289Z 
2025-05-07T20:31:45.4874366Z     @given(
2025-05-07T20:31:45.4874482Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:31:45.4874585Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:31:45.4874700Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:31:45.4874815Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:31:45.4874934Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:31:45.4875014Z     )
2025-05-07T20:31:45.4875266Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:31:45.4875361Z     def test_silu_mul_quant(
2025-05-07T20:31:45.4875440Z         self,
2025-05-07T20:31:45.4875522Z         T: int,
2025-05-07T20:31:45.4875602Z         D: int,
2025-05-07T20:31:45.4875700Z         scale_ub: Optional[float],
2025-05-07T20:31:45.4875794Z         contiguous: bool,
2025-05-07T20:31:45.4875878Z         compiled: bool,
2025-05-07T20:31:45.4875955Z     ) -> None:
2025-05-07T20:31:45.4876055Z         torch.manual_seed(2025)
2025-05-07T20:31:45.4876127Z     
2025-05-07T20:31:45.4876293Z >       x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:31:45.4878152Z E       torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 448.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 30.44 MiB is free. Including non-PyTorch memory, this process has 22.03 GiB memory in use. Of the allocated memory 21.73 GiB is allocated by PyTorch, and 13.87 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
2025-05-07T20:31:45.4878163Z 
2025-05-07T20:31:45.4878280Z moe/activation_test.py:92: OutOfMemoryError
2025-05-07T20:31:45.4878293Z 
2025-05-07T20:31:45.4878395Z Trying example: test_silu_mul_quant(
2025-05-07T20:31:45.4878615Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:31:45.4878697Z     T=16384,
2025-05-07T20:31:45.4878782Z     D=7168,
2025-05-07T20:31:45.4878871Z     scale_ub=1200.0,
2025-05-07T20:31:45.4878955Z     contiguous=True,
2025-05-07T20:31:45.4879039Z     compiled=False,
2025-05-07T20:31:45.4879120Z )
2025-05-07T20:31:45.4879337Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:31:45.4879511Z T = 16384, D = 7168, scale_ub = 1200.0, contiguous = True, compiled = False
2025-05-07T20:31:45.4879594Z 
2025-05-07T20:31:45.4879678Z     @given(
2025-05-07T20:31:45.4879794Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:31:45.4879898Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:31:45.4880017Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:31:45.4880135Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:31:45.4880253Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:31:45.4880327Z     )
2025-05-07T20:31:45.4880572Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:31:45.4880673Z     def test_silu_mul_quant(
2025-05-07T20:31:45.4880749Z         self,
2025-05-07T20:31:45.4880829Z         T: int,
2025-05-07T20:31:45.4880913Z         D: int,
2025-05-07T20:31:45.4881012Z         scale_ub: Optional[float],
2025-05-07T20:31:45.4881104Z         contiguous: bool,
2025-05-07T20:31:45.4881202Z         compiled: bool,
2025-05-07T20:31:45.4881280Z     ) -> None:
2025-05-07T20:31:45.4881384Z         torch.manual_seed(2025)
2025-05-07T20:31:45.4881457Z     
2025-05-07T20:31:45.4881625Z >       x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:31:45.4883418Z E       torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 448.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 30.44 MiB is free. Including non-PyTorch memory, this process has 22.03 GiB memory in use. Of the allocated memory 21.73 GiB is allocated by PyTorch, and 13.87 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
2025-05-07T20:31:45.4883425Z 
2025-05-07T20:31:45.4883542Z moe/activation_test.py:92: OutOfMemoryError
2025-05-07T20:31:45.4883546Z 
2025-05-07T20:31:45.4883663Z Trying example: test_silu_mul_quant(
2025-05-07T20:31:45.4883884Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:31:45.4883961Z     T=128,
2025-05-07T20:31:45.4884046Z     D=5120,
2025-05-07T20:31:45.4884130Z     scale_ub=1200.0,
2025-05-07T20:31:45.4884220Z     contiguous=False,
2025-05-07T20:31:45.4884312Z     compiled=False,
2025-05-07T20:31:45.4884385Z )
2025-05-07T20:31:45.4884602Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:31:45.4884788Z T = 128, D = 5120, scale_ub = 1200.0, contiguous = False, compiled = False
2025-05-07T20:31:45.4884794Z 
2025-05-07T20:31:45.4884887Z     @given(
2025-05-07T20:31:45.4885034Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:31:45.4885153Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:31:45.4885292Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:31:45.4885444Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:31:45.4885589Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:31:45.4885679Z     )
2025-05-07T20:31:45.4885988Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:31:45.4886105Z     def test_silu_mul_quant(
2025-05-07T20:31:45.4886304Z         self,
2025-05-07T20:31:45.4886401Z         T: int,
2025-05-07T20:31:45.4886496Z         D: int,
2025-05-07T20:31:45.4886622Z         scale_ub: Optional[float],
2025-05-07T20:31:45.4886731Z         contiguous: bool,
2025-05-07T20:31:45.4886837Z         compiled: bool,
2025-05-07T20:31:45.4886937Z     ) -> None:
2025-05-07T20:31:45.4887052Z         torch.manual_seed(2025)
2025-05-07T20:31:45.4887143Z     
2025-05-07T20:31:45.4887317Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:31:45.4887389Z     
2025-05-07T20:31:45.4887481Z         x_sign = torch.sign(x)
2025-05-07T20:31:45.4887610Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:31:45.4887775Z         x = x_sign * x_clamp
2025-05-07T20:31:45.4887861Z         x0 = x[:, :D]
2025-05-07T20:31:45.4887941Z         x1 = x[:, D:]
2025-05-07T20:31:45.4888013Z     
2025-05-07T20:31:45.4888103Z         if contiguous:
2025-05-07T20:31:45.4888194Z             x0 = x0.contiguous()
2025-05-07T20:31:45.4888287Z             x1 = x1.contiguous()
2025-05-07T20:31:45.4888369Z     
2025-05-07T20:31:45.4888460Z         if scale_ub is not None:
2025-05-07T20:31:45.4888566Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:31:45.4888707Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:31:45.4888781Z             )
2025-05-07T20:31:45.4888856Z         else:
2025-05-07T20:31:45.4888955Z             scale_ub_tensor = None
2025-05-07T20:31:45.4889027Z     
2025-05-07T20:31:45.4889155Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:31:45.4889251Z             op = silu_mul_quant
2025-05-07T20:31:45.4889338Z             if compiled:
2025-05-07T20:31:45.4889443Z                 op = torch.compile(op)
2025-05-07T20:31:45.4889553Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:31:45.4889625Z     
2025-05-07T20:31:45.4889725Z >       y_fp8, y_scale = fn()
2025-05-07T20:31:45.4889729Z 
2025-05-07T20:31:45.4889834Z moe/activation_test.py:117: 
2025-05-07T20:31:45.4889964Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:31:45.4890073Z moe/activation_test.py:115: in fn
2025-05-07T20:31:45.4890174Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:31:45.4890680Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:31:45.4890783Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:31:45.4891143Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:31:45.4891371Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:31:45.4891719Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:31:45.4891811Z     kernel = self.compile(
2025-05-07T20:31:45.4892203Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:31:45.4892378Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:31:45.4892516Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:31:45.4892521Z 
2025-05-07T20:31:45.4892725Z self = <triton.compiler.compiler.ASTSource object at 0x7f68718bd210>
2025-05-07T20:31:45.4893509Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:31:45.4894019Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f68dfb402c0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f68718c40e0>}
2025-05-07T20:31:45.4894930Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:31:45.4895178Z context = <triton._C.libtriton.ir.context object at 0x7f68718c0df0>
2025-05-07T20:31:45.4895184Z 
2025-05-07T20:31:45.4895388Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:31:45.4895720Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:31:45.4895862Z                            module_map=module_map)
2025-05-07T20:31:45.4896061Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:31:45.4896195Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:31:45.4896294Z E       ^
2025-05-07T20:31:45.4896741Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:31:45.4896837Z 
2025-05-07T20:31:45.4897371Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:31:45.4897377Z 
2025-05-07T20:31:45.4897483Z Trying example: test_silu_mul_quant(
2025-05-07T20:31:45.4897714Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:31:45.4897791Z     T=2048,
2025-05-07T20:31:45.4897868Z     D=7168,
2025-05-07T20:31:45.4897960Z     scale_ub=None,
2025-05-07T20:31:45.4898047Z     contiguous=False,
2025-05-07T20:31:45.4898131Z     compiled=False,
2025-05-07T20:31:45.4898209Z )
2025-05-07T20:31:45.4898427Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:31:45.4898602Z T = 2048, D = 7168, scale_ub = None, contiguous = False, compiled = False
2025-05-07T20:31:45.4898606Z 
2025-05-07T20:31:45.4898693Z     @given(
2025-05-07T20:31:45.4898811Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:31:45.4898919Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:31:45.4899034Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:31:45.4899158Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:31:45.4899278Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:31:45.4899350Z     )
2025-05-07T20:31:45.4899596Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:31:45.4899696Z     def test_silu_mul_quant(
2025-05-07T20:31:45.4899773Z         self,
2025-05-07T20:31:45.4899850Z         T: int,
2025-05-07T20:31:45.4899934Z         D: int,
2025-05-07T20:31:45.4900032Z         scale_ub: Optional[float],
2025-05-07T20:31:45.4900122Z         contiguous: bool,
2025-05-07T20:31:45.4900216Z         compiled: bool,
2025-05-07T20:31:45.4900293Z     ) -> None:
2025-05-07T20:31:45.4900391Z         torch.manual_seed(2025)
2025-05-07T20:31:45.4900468Z     
2025-05-07T20:31:45.4900639Z >       x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:31:45.4902439Z E       torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 56.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 30.44 MiB is free. Including non-PyTorch memory, this process has 22.03 GiB memory in use. Of the allocated memory 21.74 GiB is allocated by PyTorch, and 5.24 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
2025-05-07T20:31:45.4902446Z 
2025-05-07T20:31:45.4902564Z moe/activation_test.py:92: OutOfMemoryError
2025-05-07T20:31:45.4902568Z 
2025-05-07T20:31:45.4902674Z Trying example: test_silu_mul_quant(
2025-05-07T20:31:45.4902895Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:31:45.4902976Z     T=128,
2025-05-07T20:31:45.4903063Z     D=7168,
2025-05-07T20:31:45.4903145Z     scale_ub=1200.0,
2025-05-07T20:31:45.4903228Z     contiguous=True,
2025-05-07T20:31:45.4903315Z     compiled=True,
2025-05-07T20:31:45.4903470Z )
2025-05-07T20:31:45.4903695Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:31:45.4903862Z T = 128, D = 7168, scale_ub = 1200.0, contiguous = True, compiled = True
2025-05-07T20:31:45.4903867Z 
2025-05-07T20:31:45.4903942Z     @given(
2025-05-07T20:31:45.4904063Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:31:45.4904161Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:31:45.4904274Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:31:45.4904395Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:31:45.4904508Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:31:45.4904582Z     )
2025-05-07T20:31:45.4904914Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:31:45.4905008Z     def test_silu_mul_quant(
2025-05-07T20:31:45.4905094Z         self,
2025-05-07T20:31:45.4905170Z         T: int,
2025-05-07T20:31:45.4905246Z         D: int,
2025-05-07T20:31:45.4905355Z         scale_ub: Optional[float],
2025-05-07T20:31:45.4905444Z         contiguous: bool,
2025-05-07T20:31:45.4905530Z         compiled: bool,
2025-05-07T20:31:45.4905611Z     ) -> None:
2025-05-07T20:31:45.4905704Z         torch.manual_seed(2025)
2025-05-07T20:31:45.4905777Z     
2025-05-07T20:31:45.4905951Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:31:45.4906024Z     
2025-05-07T20:31:45.4906113Z         x_sign = torch.sign(x)
2025-05-07T20:31:45.4906243Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:31:45.4906331Z         x = x_sign * x_clamp
2025-05-07T20:31:45.4906418Z         x0 = x[:, :D]
2025-05-07T20:31:45.4906501Z         x1 = x[:, D:]
2025-05-07T20:31:45.4906580Z     
2025-05-07T20:31:45.4906669Z         if contiguous:
2025-05-07T20:31:45.4906762Z             x0 = x0.contiguous()
2025-05-07T20:31:45.4906851Z             x1 = x1.contiguous()
2025-05-07T20:31:45.4906930Z     
2025-05-07T20:31:45.4907024Z         if scale_ub is not None:
2025-05-07T20:31:45.4907130Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:31:45.4907272Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:31:45.4907347Z             )
2025-05-07T20:31:45.4907423Z         else:
2025-05-07T20:31:45.4907521Z             scale_ub_tensor = None
2025-05-07T20:31:45.4907592Z     
2025-05-07T20:31:45.4907721Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:31:45.4907821Z             op = silu_mul_quant
2025-05-07T20:31:45.4907906Z             if compiled:
2025-05-07T20:31:45.4908011Z                 op = torch.compile(op)
2025-05-07T20:31:45.4908117Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:31:45.4908196Z     
2025-05-07T20:31:45.4908294Z >       y_fp8, y_scale = fn()
2025-05-07T20:31:45.4908298Z 
2025-05-07T20:31:45.4908396Z moe/activation_test.py:117: 
2025-05-07T20:31:45.4908527Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:31:45.4908645Z moe/activation_test.py:115: in fn
2025-05-07T20:31:45.4908748Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:31:45.4909215Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/torch/_dynamo/eval_frame.py:678: in _fn
2025-05-07T20:31:45.4909310Z     return fn(*args, **kwargs)
2025-05-07T20:31:45.4909805Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:31:45.4909908Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:31:45.4910266Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:31:45.4910488Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:31:45.4910840Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:31:45.4910933Z     kernel = self.compile(
2025-05-07T20:31:45.4911406Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:31:45.4911582Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:31:45.4911710Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:31:45.4911715Z 
2025-05-07T20:31:45.4911924Z self = <triton.compiler.compiler.ASTSource object at 0x7f6871788610>
2025-05-07T20:31:45.4912708Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:31:45.4913295Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f68dfb402c0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f68718c6fc0>}
2025-05-07T20:31:45.4914056Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:31:45.4914248Z context = <triton._C.libtriton.ir.context object at 0x7f68717dc3b0>
2025-05-07T20:31:45.4914259Z 
2025-05-07T20:31:45.4914424Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:31:45.4914686Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:31:45.4914801Z                            module_map=module_map)
2025-05-07T20:31:45.4914961Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:31:45.4915063Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:31:45.4915159Z E       ^
2025-05-07T20:31:45.4915567Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:31:45.4915571Z 
2025-05-07T20:31:45.4916002Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:31:45.4916007Z 
2025-05-07T20:31:45.4916108Z Trying example: test_silu_mul_quant(
2025-05-07T20:31:45.4916332Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:31:45.4916414Z     T=128,
2025-05-07T20:31:45.4916492Z     D=7168,
2025-05-07T20:31:45.4916573Z     scale_ub=1200.0,
2025-05-07T20:31:45.4916662Z     contiguous=True,
2025-05-07T20:31:45.4916745Z     compiled=False,
2025-05-07T20:31:45.4916819Z )
2025-05-07T20:31:45.4917043Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:31:45.4917212Z T = 128, D = 7168, scale_ub = 1200.0, contiguous = True, compiled = False
2025-05-07T20:31:45.4917222Z 
2025-05-07T20:31:45.4917302Z     @given(
2025-05-07T20:31:45.4917419Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:31:45.4917518Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:31:45.4917641Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:31:45.4917757Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:31:45.4917868Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:31:45.4917947Z     )
2025-05-07T20:31:45.4918192Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:31:45.4918287Z     def test_silu_mul_quant(
2025-05-07T20:31:45.4918372Z         self,
2025-05-07T20:31:45.4918447Z         T: int,
2025-05-07T20:31:45.4918528Z         D: int,
2025-05-07T20:31:45.4918625Z         scale_ub: Optional[float],
2025-05-07T20:31:45.4918713Z         contiguous: bool,
2025-05-07T20:31:45.4918803Z         compiled: bool,
2025-05-07T20:31:45.4918884Z     ) -> None:
2025-05-07T20:31:45.4918978Z         torch.manual_seed(2025)
2025-05-07T20:31:45.4919057Z     
2025-05-07T20:31:45.4919232Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:31:45.4919305Z     
2025-05-07T20:31:45.4919491Z         x_sign = torch.sign(x)
2025-05-07T20:31:45.4919620Z >       x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:31:45.4921413Z E       torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 20.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 8.44 MiB is free. Including non-PyTorch memory, this process has 22.05 GiB memory in use. Of the allocated memory 21.77 GiB is allocated by PyTorch, and 4.62 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
2025-05-07T20:31:45.4921522Z 
2025-05-07T20:31:45.4921640Z moe/activation_test.py:95: OutOfMemoryError
2025-05-07T20:31:45.4921645Z 
2025-05-07T20:31:45.4921752Z Trying example: test_silu_mul_quant(
2025-05-07T20:31:45.4921973Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:31:45.4922053Z     T=128,
2025-05-07T20:31:45.4922134Z     D=5120,
2025-05-07T20:31:45.4922216Z     scale_ub=1200.0,
2025-05-07T20:31:45.4922300Z     contiguous=True,
2025-05-07T20:31:45.4922390Z     compiled=True,
2025-05-07T20:31:45.4922462Z )
2025-05-07T20:31:45.4922682Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:31:45.4922853Z T = 128, D = 5120, scale_ub = 1200.0, contiguous = True, compiled = True
2025-05-07T20:31:45.4922858Z 
2025-05-07T20:31:45.4922932Z     @given(
2025-05-07T20:31:45.4923053Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:31:45.4923151Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:31:45.4923266Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:31:45.4923393Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:31:45.4923506Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:31:45.4923580Z     )
2025-05-07T20:31:45.4923837Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:31:45.4923932Z     def test_silu_mul_quant(
2025-05-07T20:31:45.4924008Z         self,
2025-05-07T20:31:45.4924089Z         T: int,
2025-05-07T20:31:45.4924164Z         D: int,
2025-05-07T20:31:45.4924260Z         scale_ub: Optional[float],
2025-05-07T20:31:45.4924358Z         contiguous: bool,
2025-05-07T20:31:45.4924443Z         compiled: bool,
2025-05-07T20:31:45.4924525Z     ) -> None:
2025-05-07T20:31:45.4924618Z         torch.manual_seed(2025)
2025-05-07T20:31:45.4924691Z     
2025-05-07T20:31:45.4924863Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:31:45.4924935Z     
2025-05-07T20:31:45.4925027Z >       x_sign = torch.sign(x)
2025-05-07T20:31:45.4926873Z E       torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 20.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 8.44 MiB is free. Including non-PyTorch memory, this process has 22.05 GiB memory in use. Of the allocated memory 21.77 GiB is allocated by PyTorch, and 2.12 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
2025-05-07T20:31:45.4926879Z 
2025-05-07T20:31:45.4926998Z moe/activation_test.py:94: OutOfMemoryError
2025-05-07T20:31:45.4927002Z 
2025-05-07T20:31:45.4927111Z Trying example: test_silu_mul_quant(
2025-05-07T20:31:45.4927333Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:31:45.4927409Z     T=128,
2025-05-07T20:31:45.4927491Z     D=7168,
2025-05-07T20:31:45.4927577Z     scale_ub=None,
2025-05-07T20:31:45.4927669Z     contiguous=True,
2025-05-07T20:31:45.4927753Z     compiled=True,
2025-05-07T20:31:45.4927826Z )
2025-05-07T20:31:45.4928055Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:31:45.4930077Z T = 128, D = 7168, scale_ub = None, contiguous = True, compiled = True
2025-05-07T20:31:45.4930115Z 
2025-05-07T20:31:45.4930309Z     @given(
2025-05-07T20:31:45.4930581Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:31:45.4930783Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:31:45.4931011Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:31:45.4931250Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:31:45.4931469Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:31:45.4931626Z     )
2025-05-07T20:31:45.4932120Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:31:45.4932311Z     def test_silu_mul_quant(
2025-05-07T20:31:45.4932708Z         self,
2025-05-07T20:31:45.4932859Z         T: int,
2025-05-07T20:31:45.4933006Z         D: int,
2025-05-07T20:31:45.4933208Z         scale_ub: Optional[float],
2025-05-07T20:31:45.4933383Z         contiguous: bool,
2025-05-07T20:31:45.4933560Z         compiled: bool,
2025-05-07T20:31:45.4933728Z     ) -> None:
2025-05-07T20:31:45.4933916Z         torch.manual_seed(2025)
2025-05-07T20:31:45.4934058Z     
2025-05-07T20:31:45.4934410Z >       x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:31:45.4936576Z E       torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 20.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 8.44 MiB is free. Including non-PyTorch memory, this process has 22.05 GiB memory in use. Of the allocated memory 21.77 GiB is allocated by PyTorch, and 2.12 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
2025-05-07T20:31:45.4936588Z 
2025-05-07T20:31:45.4936712Z moe/activation_test.py:92: OutOfMemoryError
2025-05-07T20:31:45.4936852Z =============================== warnings summary ===============================
2025-05-07T20:31:45.4937168Z ../../../../../../../../miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/autotuner.py:108
2025-05-07T20:31:45.4937466Z ../../../../../../../../miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/autotuner.py:108
2025-05-07T20:31:45.4937758Z ../../../../../../../../miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/autotuner.py:108
2025-05-07T20:31:45.4938634Z   /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/autotuner.py:108: DeprecationWarning: warmup, rep, and use_cuda_graph parameters are deprecated. See https://github.com/triton-lang/triton/pull/4496 for details.
2025-05-07T20:31:45.4938868Z     warnings.warn(("warmup, rep, and use_cuda_graph parameters are deprecated. See "
2025-05-07T20:31:45.4938873Z 
2025-05-07T20:31:45.4939056Z experimental/gen_ai/test/moe/activation_test.py: 10 warnings
2025-05-07T20:31:45.4940322Z   /home/ec2-user/actions-runner/_work/FBGEMM/FBGEMM/fbgemm_gpu/experimental/gen_ai/test/moe/activation_test.py:72: FutureWarning: `torch.testing.assert_allclose()` is deprecated since 1.12 and will be removed in a future release. Please use `torch.testing.assert_close()` instead. You can find detailed upgrade instructions in https://github.com/pytorch/pytorch/issues/61844.
2025-05-07T20:31:45.4940515Z     torch.testing.assert_allclose(y, y_ref, rtol=1.6e-2, atol=1e-3)
2025-05-07T20:31:45.4940520Z 
2025-05-07T20:31:45.4940726Z -- Docs: https://docs.pytest.org/en/stable/how-to/capture-warnings.html
2025-05-07T20:31:45.4940888Z ================== 1 failed, 1 passed, 13 warnings in 21.90s ===================
2025-05-07T20:31:47.2200803Z ERROR conda.cli.main_run:execute(125): `conda run python -m pytest -v -rsx -s -W ignore::pytest.PytestCollectionWarning --cache-clear ./moe/activation_test.py` failed. (See above for error)
2025-05-07T20:31:47.2824137Z 
2025-05-07T20:31:47.2825516Z [TEST] Some tests FAILED.  Re-attempting only FAILED tests: ./moe/activation_test.py
2025-05-07T20:31:47.2825890Z 
2025-05-07T20:31:47.2825896Z 
2025-05-07T20:31:47.2845929Z [EXEC] [ATTEMPT 0/2]    + conda run --no-capture-output -n build_binary python -m pytest -v -rsx -s -W ignore::pytest.PytestCollectionWarning --lf --last-failed-no-failures none ./moe/activation_test.py
2025-05-07T20:31:49.4337482Z ============================= test session starts ==============================
2025-05-07T20:31:49.4338138Z platform linux -- Python 3.11.8, pytest-8.3.5, pluggy-1.5.0 -- /home/ec2-user/miniconda/envs/build_binary/bin/python
2025-05-07T20:31:49.4338669Z cachedir: .pytest_cache
2025-05-07T20:31:49.4339243Z hypothesis profile 'ci' -> database=None, deadline=None, print_blob=True, derandomize=True, suppress_health_check=(HealthCheck.too_slow,)
2025-05-07T20:31:49.4340321Z rootdir: /home/ec2-user/actions-runner/_work/FBGEMM/FBGEMM/fbgemm_gpu
2025-05-07T20:31:49.4340729Z plugins: hypothesis-6.131.14
2025-05-07T20:31:51.0409445Z TMA benchmarks will be running with experimental grid constant TMA descriptor.
2025-05-07T20:31:51.1936522Z collecting ... collected 2 items / 1 deselected / 1 selected
2025-05-07T20:31:51.1937322Z run-last-failure: rerun previous 1 failure
2025-05-07T20:31:51.1937538Z 
2025-05-07T20:31:53.3077015Z W0507 20:31:53.305000 238389 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/0] Encountered an exception in identify_mutated_tensors, assuming every input is mutated
2025-05-07T20:31:53.3078812Z W0507 20:31:53.305000 238389 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/0] Traceback (most recent call last):
2025-05-07T20:31:53.3080972Z W0507 20:31:53.305000 238389 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/0]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/torch/_higher_order_ops/triton_kernel_wrap.py", line 731, in identify_mutated_tensors
2025-05-07T20:31:53.3083394Z W0507 20:31:53.305000 238389 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/0]     ttir_module, ordered_tensor_names = generate_ttir(kernel, kwargs)
2025-05-07T20:31:53.3085020Z W0507 20:31:53.305000 238389 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/0]                                         ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
2025-05-07T20:31:53.3086982Z W0507 20:31:53.305000 238389 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/0]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/torch/_higher_order_ops/triton_kernel_wrap.py", line 356, in generate_ttir
2025-05-07T20:31:53.3089296Z W0507 20:31:53.305000 238389 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/0]     ttir_module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:31:53.3090905Z W0507 20:31:53.305000 238389 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/0]                   ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
2025-05-07T20:31:53.3092906Z W0507 20:31:53.305000 238389 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/0]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py", line 100, in make_ir
2025-05-07T20:31:53.3095101Z W0507 20:31:53.305000 238389 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/0]     return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:31:53.3096814Z W0507 20:31:53.305000 238389 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/0]            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
2025-05-07T20:31:53.3098911Z W0507 20:31:53.305000 238389 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/0]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/code_generator.py", line 1298, in ast_to_ttir
2025-05-07T20:31:53.3101420Z W0507 20:31:53.305000 238389 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/0]     generator.visit(fn.parse())
2025-05-07T20:31:53.3103512Z W0507 20:31:53.305000 238389 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/0]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/code_generator.py", line 1201, in visit
2025-05-07T20:31:53.3105627Z W0507 20:31:53.305000 238389 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/0]     ret = super().visit(node)
2025-05-07T20:31:53.3107071Z W0507 20:31:53.305000 238389 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/0]           ^^^^^^^^^^^^^^^^^^^
2025-05-07T20:31:53.3108911Z W0507 20:31:53.305000 238389 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/0]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/ast.py", line 418, in visit
2025-05-07T20:31:53.3111080Z W0507 20:31:53.305000 238389 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/0]     return visitor(node)
2025-05-07T20:31:53.3112487Z W0507 20:31:53.305000 238389 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/0]            ^^^^^^^^^^^^^
2025-05-07T20:31:53.3114613Z W0507 20:31:53.305000 238389 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/0]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/code_generator.py", line 352, in visit_Module
2025-05-07T20:31:53.3116883Z W0507 20:31:53.305000 238389 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/0]     ast.NodeVisitor.generic_visit(self, node)
2025-05-07T20:31:53.3118906Z W0507 20:31:53.305000 238389 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/0]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/ast.py", line 426, in generic_visit
2025-05-07T20:31:53.3120726Z W0507 20:31:53.305000 238389 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/0]     self.visit(item)
2025-05-07T20:31:53.3122703Z W0507 20:31:53.305000 238389 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/0]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/code_generator.py", line 1207, in visit
2025-05-07T20:31:53.3124824Z W0507 20:31:53.305000 238389 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/0]     raise CompilationError(self.jit_fn.src, self.cur_node, repr(e)) from None
2025-05-07T20:31:53.3126537Z W0507 20:31:53.305000 238389 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/0] triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:31:53.3127976Z W0507 20:31:53.305000 238389 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/0] def _fbgemm_silu_mul_quant(
2025-05-07T20:31:53.3129507Z W0507 20:31:53.305000 238389 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/0] ^
2025-05-07T20:31:53.3131143Z W0507 20:31:53.305000 238389 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/0] ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:31:53.3248748Z W0507 20:31:53.323000 238389 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/0] Encountered an exception in identify_mutated_tensors, assuming every input is mutated
2025-05-07T20:31:53.3250537Z W0507 20:31:53.323000 238389 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/0] Traceback (most recent call last):
2025-05-07T20:31:53.3252706Z W0507 20:31:53.323000 238389 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/0]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/torch/_higher_order_ops/triton_kernel_wrap.py", line 731, in identify_mutated_tensors
2025-05-07T20:31:53.3255110Z W0507 20:31:53.323000 238389 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/0]     ttir_module, ordered_tensor_names = generate_ttir(kernel, kwargs)
2025-05-07T20:31:53.3257053Z W0507 20:31:53.323000 238389 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/0]                                         ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
2025-05-07T20:31:53.3259299Z W0507 20:31:53.323000 238389 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/0]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/torch/_higher_order_ops/triton_kernel_wrap.py", line 356, in generate_ttir
2025-05-07T20:31:53.3261600Z W0507 20:31:53.323000 238389 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/0]     ttir_module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:31:53.3263041Z W0507 20:31:53.323000 238389 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/0]                   ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
2025-05-07T20:31:53.3265151Z W0507 20:31:53.323000 238389 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/0]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py", line 100, in make_ir
2025-05-07T20:31:53.3267469Z W0507 20:31:53.323000 238389 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/0]     return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:31:53.3269361Z W0507 20:31:53.323000 238389 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/0]            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
2025-05-07T20:31:53.3271504Z W0507 20:31:53.323000 238389 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/0]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/code_generator.py", line 1298, in ast_to_ttir
2025-05-07T20:31:53.3273592Z W0507 20:31:53.323000 238389 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/0]     generator.visit(fn.parse())
2025-05-07T20:31:53.3275709Z W0507 20:31:53.323000 238389 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/0]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/code_generator.py", line 1201, in visit
2025-05-07T20:31:53.3277782Z W0507 20:31:53.323000 238389 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/0]     ret = super().visit(node)
2025-05-07T20:31:53.3279220Z W0507 20:31:53.323000 238389 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/0]           ^^^^^^^^^^^^^^^^^^^
2025-05-07T20:31:53.3280907Z W0507 20:31:53.323000 238389 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/0]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/ast.py", line 418, in visit
2025-05-07T20:31:53.3282643Z W0507 20:31:53.323000 238389 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/0]     return visitor(node)
2025-05-07T20:31:53.3284042Z W0507 20:31:53.323000 238389 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/0]            ^^^^^^^^^^^^^
2025-05-07T20:31:53.3286119Z W0507 20:31:53.323000 238389 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/0]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/code_generator.py", line 352, in visit_Module
2025-05-07T20:31:53.3288426Z W0507 20:31:53.323000 238389 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/0]     ast.NodeVisitor.generic_visit(self, node)
2025-05-07T20:31:53.3290442Z W0507 20:31:53.323000 238389 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/0]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/ast.py", line 426, in generic_visit
2025-05-07T20:31:53.3292254Z W0507 20:31:53.323000 238389 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/0]     self.visit(item)
2025-05-07T20:31:53.3294611Z W0507 20:31:53.323000 238389 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/0]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/code_generator.py", line 1207, in visit
2025-05-07T20:31:53.3297041Z W0507 20:31:53.323000 238389 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/0]     raise CompilationError(self.jit_fn.src, self.cur_node, repr(e)) from None
2025-05-07T20:31:53.3298931Z W0507 20:31:53.323000 238389 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/0] triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:31:53.3300568Z W0507 20:31:53.323000 238389 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/0] def _fbgemm_silu_mul_quant(
2025-05-07T20:31:53.3301838Z W0507 20:31:53.323000 238389 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/0] ^
2025-05-07T20:31:53.3303836Z W0507 20:31:53.323000 238389 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/0] ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:31:53.8405243Z moe/activation_test.py::ActivationTests::test_silu_mul_quant Trying example: test_silu_mul_quant(
2025-05-07T20:31:53.8405943Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:31:53.8406360Z     T=1,
2025-05-07T20:31:53.8406575Z     D=5120,
2025-05-07T20:31:53.8406776Z     scale_ub=None,
2025-05-07T20:31:53.8407011Z     contiguous=True,
2025-05-07T20:31:53.8407247Z     compiled=True,
2025-05-07T20:31:53.8407460Z )
2025-05-07T20:31:53.8407794Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:31:53.8408286Z T = 1, D = 5120, scale_ub = None, contiguous = True, compiled = True
2025-05-07T20:31:53.8408569Z 
2025-05-07T20:31:53.8408654Z     @given(
2025-05-07T20:31:53.8408900Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:31:53.8409229Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:31:53.8409537Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:31:53.8409885Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:31:53.8410224Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:31:53.8410516Z     )
2025-05-07T20:31:53.8410869Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:31:53.8411317Z     def test_silu_mul_quant(
2025-05-07T20:31:53.8411573Z         self,
2025-05-07T20:31:53.8411773Z         T: int,
2025-05-07T20:31:53.8411985Z         D: int,
2025-05-07T20:31:53.8412214Z         scale_ub: Optional[float],
2025-05-07T20:31:53.8412488Z         contiguous: bool,
2025-05-07T20:31:53.8412742Z         compiled: bool,
2025-05-07T20:31:53.8412983Z     ) -> None:
2025-05-07T20:31:53.8413206Z         torch.manual_seed(2025)
2025-05-07T20:31:53.8413455Z     
2025-05-07T20:31:53.8413737Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:31:53.8414080Z     
2025-05-07T20:31:53.8414284Z         x_sign = torch.sign(x)
2025-05-07T20:31:53.8414586Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:31:53.8414897Z         x = x_sign * x_clamp
2025-05-07T20:31:53.8415157Z         x0 = x[:, :D]
2025-05-07T20:31:53.8415373Z         x1 = x[:, D:]
2025-05-07T20:31:53.8415587Z     
2025-05-07T20:31:53.8415780Z         if contiguous:
2025-05-07T20:31:53.8416011Z             x0 = x0.contiguous()
2025-05-07T20:31:53.8416274Z             x1 = x1.contiguous()
2025-05-07T20:31:53.8416522Z     
2025-05-07T20:31:53.8416715Z         if scale_ub is not None:
2025-05-07T20:31:53.8417000Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:31:53.8417340Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:31:53.8417646Z             )
2025-05-07T20:31:53.8417857Z         else:
2025-05-07T20:31:53.8418078Z             scale_ub_tensor = None
2025-05-07T20:31:53.8418331Z     
2025-05-07T20:31:53.8418568Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:31:53.8418888Z             op = silu_mul_quant
2025-05-07T20:31:53.8419441Z             if compiled:
2025-05-07T20:31:53.8419699Z                 op = torch.compile(op)
2025-05-07T20:31:53.8420007Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:31:53.8420284Z     
2025-05-07T20:31:53.8420476Z         y_fp8, y_scale = fn()
2025-05-07T20:31:53.8420762Z         y = y_fp8.to(torch.float32) * y_scale[:, None]
2025-05-07T20:31:53.8421054Z     
2025-05-07T20:31:53.8421289Z         def ref_fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:31:53.8421625Z             x0_fp32 = x0.to(torch.float32)
2025-05-07T20:31:53.8421924Z             x1_fp32 = x1.to(torch.float32)
2025-05-07T20:31:53.8422236Z             y = x0_fp32 * torch.sigmoid(x0_fp32) * x1_fp32
2025-05-07T20:31:53.8422804Z             return triton_quantize_fp8_row(y, scale_ub_tensor)
2025-05-07T20:31:53.8423119Z     
2025-05-07T20:31:53.8423319Z >       y_fp8_ref, y_scale_ref = ref_fn()
2025-05-07T20:31:53.8423521Z 
2025-05-07T20:31:53.8423624Z moe/activation_test.py:126: 
2025-05-07T20:31:53.8423928Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:31:53.8424270Z moe/activation_test.py:124: in ref_fn
2025-05-07T20:31:53.8424590Z     return triton_quantize_fp8_row(y, scale_ub_tensor)
2025-05-07T20:31:53.8425380Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/fbgemm_gpu/experimental/gemm/triton_gemm/fp8_gemm.py:2370: in triton_quantize_fp8_row
2025-05-07T20:31:53.8426134Z     _kernel_quantize_fp8_row[grid](
2025-05-07T20:31:53.8426673Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:31:53.8427360Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:31:53.8428057Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/autotuner.py:186: in run
2025-05-07T20:31:53.8429226Z     timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs}
2025-05-07T20:31:53.8429981Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/autotuner.py:186: in <dictcomp>
2025-05-07T20:31:53.8430727Z     timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs}
2025-05-07T20:31:53.8431450Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/autotuner.py:166: in _bench
2025-05-07T20:31:53.8432091Z     return self.do_bench(kernel_call, quantiles=(0.5, 0.2, 0.8))
2025-05-07T20:31:53.8432685Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/testing.py:117: in do_bench
2025-05-07T20:31:53.8433208Z     fn()
2025-05-07T20:31:53.8433722Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/autotuner.py:152: in kernel_call
2025-05-07T20:31:53.8434316Z     self.fn.run(
2025-05-07T20:31:53.8434780Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:31:53.8435318Z     kernel = self.compile(
2025-05-07T20:31:53.8435864Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:31:53.8436511Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:31:53.8436914Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:31:53.8437142Z 
2025-05-07T20:31:53.8437358Z self = <triton.compiler.compiler.ASTSource object at 0x7f498771dc90>
2025-05-07T20:31:53.8438443Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=2, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:31:53.8439831Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f49ce13c400>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f4986e7ae80>}
2025-05-07T20:31:53.8441317Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:31:53.8442344Z context = <triton._C.libtriton.ir.context object at 0x7f498ce1d430>
2025-05-07T20:31:53.8442631Z 
2025-05-07T20:31:53.8442805Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:31:53.8443316Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:31:53.8443785Z                            module_map=module_map)
2025-05-07T20:31:53.8444157Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:31:53.8444633Z E       def _kernel_quantize_fp8_row(
2025-05-07T20:31:53.8444899Z E       ^
2025-05-07T20:31:53.8445363Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:31:53.8445823Z 
2025-05-07T20:31:53.8446247Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:31:53.8446754Z 
2025-05-07T20:31:53.8446868Z Trying example: test_silu_mul_quant(
2025-05-07T20:31:53.8447278Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:31:53.8447702Z     T=2048,
2025-05-07T20:31:53.8447901Z     D=5120,
2025-05-07T20:31:53.8448104Z     scale_ub=1200.0,
2025-05-07T20:31:53.8448328Z     contiguous=True,
2025-05-07T20:31:53.8448557Z     compiled=False,
2025-05-07T20:31:53.8448780Z )
2025-05-07T20:31:54.3751617Z W0507 20:31:54.372000 238389 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/1] Encountered an exception in identify_mutated_tensors, assuming every input is mutated
2025-05-07T20:31:54.3752750Z W0507 20:31:54.372000 238389 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/1] Traceback (most recent call last):
2025-05-07T20:31:54.3754100Z W0507 20:31:54.372000 238389 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/1]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/torch/_higher_order_ops/triton_kernel_wrap.py", line 731, in identify_mutated_tensors
2025-05-07T20:31:54.3755532Z W0507 20:31:54.372000 238389 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/1]     ttir_module, ordered_tensor_names = generate_ttir(kernel, kwargs)
2025-05-07T20:31:54.3756504Z W0507 20:31:54.372000 238389 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/1]                                         ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
2025-05-07T20:31:54.3757805Z W0507 20:31:54.372000 238389 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/1]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/torch/_higher_order_ops/triton_kernel_wrap.py", line 356, in generate_ttir
2025-05-07T20:31:54.3759184Z W0507 20:31:54.372000 238389 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/1]     ttir_module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:31:54.3760158Z W0507 20:31:54.372000 238389 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/1]                   ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
2025-05-07T20:31:54.3761381Z W0507 20:31:54.372000 238389 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/1]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py", line 100, in make_ir
2025-05-07T20:31:54.3762748Z W0507 20:31:54.372000 238389 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/1]     return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:31:54.3764153Z W0507 20:31:54.372000 238389 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/1]            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
2025-05-07T20:31:54.3765431Z W0507 20:31:54.372000 238389 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/1]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/code_generator.py", line 1298, in ast_to_ttir
2025-05-07T20:31:54.3766669Z W0507 20:31:54.372000 238389 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/1]     generator.visit(fn.parse())
2025-05-07T20:31:54.3767890Z W0507 20:31:54.372000 238389 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/1]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/code_generator.py", line 1201, in visit
2025-05-07T20:31:54.3769252Z W0507 20:31:54.372000 238389 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/1]     ret = super().visit(node)
2025-05-07T20:31:54.3770089Z W0507 20:31:54.372000 238389 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/1]           ^^^^^^^^^^^^^^^^^^^
2025-05-07T20:31:54.3771111Z W0507 20:31:54.372000 238389 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/1]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/ast.py", line 418, in visit
2025-05-07T20:31:54.3772121Z W0507 20:31:54.372000 238389 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/1]     return visitor(node)
2025-05-07T20:31:54.3772913Z W0507 20:31:54.372000 238389 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/1]            ^^^^^^^^^^^^^
2025-05-07T20:31:54.3774121Z W0507 20:31:54.372000 238389 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/1]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/code_generator.py", line 352, in visit_Module
2025-05-07T20:31:54.3775404Z W0507 20:31:54.372000 238389 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/1]     ast.NodeVisitor.generic_visit(self, node)
2025-05-07T20:31:54.3776525Z W0507 20:31:54.372000 238389 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/1]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/ast.py", line 426, in generic_visit
2025-05-07T20:31:54.3777562Z W0507 20:31:54.372000 238389 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/1]     self.visit(item)
2025-05-07T20:31:54.3778782Z W0507 20:31:54.372000 238389 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/1]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/code_generator.py", line 1207, in visit
2025-05-07T20:31:54.3780145Z W0507 20:31:54.372000 238389 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/1]     raise CompilationError(self.jit_fn.src, self.cur_node, repr(e)) from None
2025-05-07T20:31:54.3781208Z W0507 20:31:54.372000 238389 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/1] triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:31:54.3782114Z W0507 20:31:54.372000 238389 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/1] def _fbgemm_silu_mul_quant(
2025-05-07T20:31:54.3782857Z W0507 20:31:54.372000 238389 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/1] ^
2025-05-07T20:31:54.3783874Z W0507 20:31:54.372000 238389 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/1] ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:31:54.4846759Z W0507 20:31:54.482000 238389 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/1] Encountered an exception in identify_mutated_tensors, assuming every input is mutated
2025-05-07T20:31:54.4847875Z W0507 20:31:54.482000 238389 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/1] Traceback (most recent call last):
2025-05-07T20:31:54.4850631Z W0507 20:31:54.482000 238389 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/1]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/torch/_higher_order_ops/triton_kernel_wrap.py", line 731, in identify_mutated_tensors
2025-05-07T20:31:54.4853468Z W0507 20:31:54.482000 238389 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/1]     ttir_module, ordered_tensor_names = generate_ttir(kernel, kwargs)
2025-05-07T20:31:54.4855419Z W0507 20:31:54.482000 238389 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/1]                                         ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
2025-05-07T20:31:54.4858006Z W0507 20:31:54.482000 238389 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/1]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/torch/_higher_order_ops/triton_kernel_wrap.py", line 356, in generate_ttir
2025-05-07T20:31:54.4859909Z W0507 20:31:54.482000 238389 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/1]     ttir_module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:31:54.4860890Z W0507 20:31:54.482000 238389 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/1]                   ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
2025-05-07T20:31:54.4862112Z W0507 20:31:54.482000 238389 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/1]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py", line 100, in make_ir
2025-05-07T20:31:54.4863486Z W0507 20:31:54.482000 238389 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/1]     return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:31:54.4864555Z W0507 20:31:54.482000 238389 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/1]            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
2025-05-07T20:31:54.4865842Z W0507 20:31:54.482000 238389 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/1]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/code_generator.py", line 1298, in ast_to_ttir
2025-05-07T20:31:54.4867093Z W0507 20:31:54.482000 238389 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/1]     generator.visit(fn.parse())
2025-05-07T20:31:54.4868310Z W0507 20:31:54.482000 238389 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/1]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/code_generator.py", line 1201, in visit
2025-05-07T20:31:54.4869637Z W0507 20:31:54.482000 238389 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/1]     ret = super().visit(node)
2025-05-07T20:31:54.4870472Z W0507 20:31:54.482000 238389 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/1]           ^^^^^^^^^^^^^^^^^^^
2025-05-07T20:31:54.4871502Z W0507 20:31:54.482000 238389 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/1]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/ast.py", line 418, in visit
2025-05-07T20:31:54.4872524Z W0507 20:31:54.482000 238389 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/1]     return visitor(node)
2025-05-07T20:31:54.4873313Z W0507 20:31:54.482000 238389 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/1]            ^^^^^^^^^^^^^
2025-05-07T20:31:54.4874523Z W0507 20:31:54.482000 238389 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/1]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/code_generator.py", line 352, in visit_Module
2025-05-07T20:31:54.4875800Z W0507 20:31:54.482000 238389 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/1]     ast.NodeVisitor.generic_visit(self, node)
2025-05-07T20:31:54.4877004Z W0507 20:31:54.482000 238389 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/1]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/ast.py", line 426, in generic_visit
2025-05-07T20:31:54.4878039Z W0507 20:31:54.482000 238389 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/1]     self.visit(item)
2025-05-07T20:31:54.4879217Z W0507 20:31:54.482000 238389 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/1]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/code_generator.py", line 1207, in visit
2025-05-07T20:31:54.4880578Z W0507 20:31:54.482000 238389 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/1]     raise CompilationError(self.jit_fn.src, self.cur_node, repr(e)) from None
2025-05-07T20:31:54.4881718Z W0507 20:31:54.482000 238389 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/1] triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:31:54.4882641Z W0507 20:31:54.482000 238389 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/1] def _fbgemm_silu_mul_quant(
2025-05-07T20:31:54.4883379Z W0507 20:31:54.482000 238389 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/1] ^
2025-05-07T20:31:54.4884402Z W0507 20:31:54.482000 238389 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/1] ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:31:54.9320179Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:31:54.9320686Z T = 2048, D = 5120, scale_ub = 1200.0, contiguous = True, compiled = False
2025-05-07T20:31:54.9320966Z 
2025-05-07T20:31:54.9321061Z     @given(
2025-05-07T20:31:54.9321330Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:31:54.9321756Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:31:54.9322145Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:31:54.9322481Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:31:54.9322822Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:31:54.9323108Z     )
2025-05-07T20:31:54.9323466Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:31:54.9323914Z     def test_silu_mul_quant(
2025-05-07T20:31:54.9324154Z         self,
2025-05-07T20:31:54.9324358Z         T: int,
2025-05-07T20:31:54.9324566Z         D: int,
2025-05-07T20:31:54.9324785Z         scale_ub: Optional[float],
2025-05-07T20:31:54.9325067Z         contiguous: bool,
2025-05-07T20:31:54.9325316Z         compiled: bool,
2025-05-07T20:31:54.9325544Z     ) -> None:
2025-05-07T20:31:54.9325766Z         torch.manual_seed(2025)
2025-05-07T20:31:54.9326015Z     
2025-05-07T20:31:54.9326294Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:31:54.9326644Z     
2025-05-07T20:31:54.9326843Z         x_sign = torch.sign(x)
2025-05-07T20:31:54.9327133Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:31:54.9327448Z         x = x_sign * x_clamp
2025-05-07T20:31:54.9327697Z         x0 = x[:, :D]
2025-05-07T20:31:54.9327920Z         x1 = x[:, D:]
2025-05-07T20:31:54.9328403Z     
2025-05-07T20:31:54.9328600Z         if contiguous:
2025-05-07T20:31:54.9328839Z             x0 = x0.contiguous()
2025-05-07T20:31:54.9329090Z             x1 = x1.contiguous()
2025-05-07T20:31:54.9329333Z     
2025-05-07T20:31:54.9329528Z         if scale_ub is not None:
2025-05-07T20:31:54.9329795Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:31:54.9330131Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:31:54.9330444Z             )
2025-05-07T20:31:54.9330638Z         else:
2025-05-07T20:31:54.9330855Z             scale_ub_tensor = None
2025-05-07T20:31:54.9331111Z     
2025-05-07T20:31:54.9331343Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:31:54.9331661Z             op = silu_mul_quant
2025-05-07T20:31:54.9331918Z             if compiled:
2025-05-07T20:31:54.9332327Z                 op = torch.compile(op)
2025-05-07T20:31:54.9332632Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:31:54.9332910Z     
2025-05-07T20:31:54.9333106Z >       y_fp8, y_scale = fn()
2025-05-07T20:31:54.9333272Z 
2025-05-07T20:31:54.9333373Z moe/activation_test.py:117: 
2025-05-07T20:31:54.9333669Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:31:54.9334006Z moe/activation_test.py:115: in fn
2025-05-07T20:31:54.9334285Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:31:54.9334983Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:31:54.9335677Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:31:54.9336332Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:31:54.9337005Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:31:54.9337670Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:31:54.9338205Z     kernel = self.compile(
2025-05-07T20:31:54.9338744Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:31:54.9339405Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:31:54.9339807Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:31:54.9340040Z 
2025-05-07T20:31:54.9340253Z self = <triton.compiler.compiler.ASTSource object at 0x7f49858f5ad0>
2025-05-07T20:31:54.9341324Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:31:54.9342698Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f49ce13c400>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f49872f9da0>}
2025-05-07T20:31:54.9344029Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:31:54.9345048Z context = <triton._C.libtriton.ir.context object at 0x7f4986dff1f0>
2025-05-07T20:31:54.9345335Z 
2025-05-07T20:31:54.9345510Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:31:54.9346020Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:31:54.9346490Z                            module_map=module_map)
2025-05-07T20:31:54.9346864Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:31:54.9347212Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:31:54.9347474Z E       ^
2025-05-07T20:31:54.9347946Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:31:54.9348393Z 
2025-05-07T20:31:54.9348833Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:31:54.9349446Z 
2025-05-07T20:31:54.9349551Z Trying example: test_silu_mul_quant(
2025-05-07T20:31:54.9349968Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:31:54.9350375Z     T=2048,
2025-05-07T20:31:54.9350563Z     D=5120,
2025-05-07T20:31:54.9350760Z     scale_ub=1200.0,
2025-05-07T20:31:54.9350993Z     contiguous=True,
2025-05-07T20:31:54.9351215Z     compiled=True,
2025-05-07T20:31:54.9351432Z )
2025-05-07T20:31:54.9351756Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:31:54.9352246Z T = 2048, D = 5120, scale_ub = 1200.0, contiguous = True, compiled = True
2025-05-07T20:31:54.9352513Z 
2025-05-07T20:31:54.9352594Z     @given(
2025-05-07T20:31:54.9352920Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:31:54.9353238Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:31:54.9353543Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:31:54.9353874Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:31:54.9354201Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:31:54.9354481Z     )
2025-05-07T20:31:54.9354832Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:31:54.9355271Z     def test_silu_mul_quant(
2025-05-07T20:31:54.9355510Z         self,
2025-05-07T20:31:54.9355710Z         T: int,
2025-05-07T20:31:54.9355917Z         D: int,
2025-05-07T20:31:54.9356216Z         scale_ub: Optional[float],
2025-05-07T20:31:54.9356486Z         contiguous: bool,
2025-05-07T20:31:54.9356729Z         compiled: bool,
2025-05-07T20:31:54.9356958Z     ) -> None:
2025-05-07T20:31:54.9357172Z         torch.manual_seed(2025)
2025-05-07T20:31:54.9357423Z     
2025-05-07T20:31:54.9357704Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:31:54.9358042Z     
2025-05-07T20:31:54.9358244Z         x_sign = torch.sign(x)
2025-05-07T20:31:54.9358540Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:31:54.9358845Z         x = x_sign * x_clamp
2025-05-07T20:31:54.9359088Z         x0 = x[:, :D]
2025-05-07T20:31:54.9359316Z         x1 = x[:, D:]
2025-05-07T20:31:54.9359522Z     
2025-05-07T20:31:54.9359713Z         if contiguous:
2025-05-07T20:31:54.9359951Z             x0 = x0.contiguous()
2025-05-07T20:31:54.9360207Z             x1 = x1.contiguous()
2025-05-07T20:31:54.9360449Z     
2025-05-07T20:31:54.9360644Z         if scale_ub is not None:
2025-05-07T20:31:54.9367031Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:31:54.9367442Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:31:54.9367749Z             )
2025-05-07T20:31:54.9367956Z         else:
2025-05-07T20:31:54.9368183Z             scale_ub_tensor = None
2025-05-07T20:31:54.9368440Z     
2025-05-07T20:31:54.9368720Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:31:54.9369071Z             op = silu_mul_quant
2025-05-07T20:31:54.9369323Z             if compiled:
2025-05-07T20:31:54.9369584Z                 op = torch.compile(op)
2025-05-07T20:31:54.9369886Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:31:54.9370159Z     
2025-05-07T20:31:54.9370368Z         y_fp8, y_scale = fn()
2025-05-07T20:31:54.9370661Z         y = y_fp8.to(torch.float32) * y_scale[:, None]
2025-05-07T20:31:54.9370961Z     
2025-05-07T20:31:54.9371202Z         def ref_fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:31:54.9371550Z             x0_fp32 = x0.to(torch.float32)
2025-05-07T20:31:54.9371852Z             x1_fp32 = x1.to(torch.float32)
2025-05-07T20:31:54.9372168Z             y = x0_fp32 * torch.sigmoid(x0_fp32) * x1_fp32
2025-05-07T20:31:54.9372546Z             return triton_quantize_fp8_row(y, scale_ub_tensor)
2025-05-07T20:31:54.9372867Z     
2025-05-07T20:31:54.9373067Z >       y_fp8_ref, y_scale_ref = ref_fn()
2025-05-07T20:31:54.9373272Z 
2025-05-07T20:31:54.9373376Z moe/activation_test.py:126: 
2025-05-07T20:31:54.9373682Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:31:54.9374029Z moe/activation_test.py:124: in ref_fn
2025-05-07T20:31:54.9374354Z     return triton_quantize_fp8_row(y, scale_ub_tensor)
2025-05-07T20:31:54.9375154Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/fbgemm_gpu/experimental/gemm/triton_gemm/fp8_gemm.py:2370: in triton_quantize_fp8_row
2025-05-07T20:31:54.9375919Z     _kernel_quantize_fp8_row[grid](
2025-05-07T20:31:54.9376471Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:31:54.9377161Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:31:54.9377990Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/autotuner.py:186: in run
2025-05-07T20:31:54.9378745Z     timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs}
2025-05-07T20:31:54.9379534Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/autotuner.py:186: in <dictcomp>
2025-05-07T20:31:54.9380291Z     timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs}
2025-05-07T20:31:54.9381029Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/autotuner.py:166: in _bench
2025-05-07T20:31:54.9381683Z     return self.do_bench(kernel_call, quantiles=(0.5, 0.2, 0.8))
2025-05-07T20:31:54.9382363Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/testing.py:117: in do_bench
2025-05-07T20:31:54.9382892Z     fn()
2025-05-07T20:31:54.9383416Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/autotuner.py:152: in kernel_call
2025-05-07T20:31:54.9384000Z     self.fn.run(
2025-05-07T20:31:54.9384471Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:31:54.9385012Z     kernel = self.compile(
2025-05-07T20:31:54.9385564Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:31:54.9386218Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:31:54.9386623Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:31:54.9386855Z 
2025-05-07T20:31:54.9387074Z self = <triton.compiler.compiler.ASTSource object at 0x7f49856930d0>
2025-05-07T20:31:54.9388183Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=2, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:31:54.9389668Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f49ce13c400>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f4985c122a0>}
2025-05-07T20:31:54.9391025Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:31:54.9392066Z context = <triton._C.libtriton.ir.context object at 0x7f4985696db0>
2025-05-07T20:31:54.9392358Z 
2025-05-07T20:31:54.9392539Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:31:54.9393063Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:31:54.9393543Z                            module_map=module_map)
2025-05-07T20:31:54.9393915Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:31:54.9394281Z E       def _kernel_quantize_fp8_row(
2025-05-07T20:31:54.9394548Z E       ^
2025-05-07T20:31:54.9395023Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:31:54.9395478Z 
2025-05-07T20:31:54.9395910Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:31:54.9396427Z 
2025-05-07T20:31:54.9396533Z Trying example: test_silu_mul_quant(
2025-05-07T20:31:54.9396956Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:31:54.9397368Z     T=16384,
2025-05-07T20:31:54.9397570Z     D=7168,
2025-05-07T20:31:54.9397771Z     scale_ub=1200.0,
2025-05-07T20:31:54.9398011Z     contiguous=False,
2025-05-07T20:31:54.9398250Z     compiled=False,
2025-05-07T20:31:54.9398459Z )
2025-05-07T20:31:55.2439846Z W0507 20:31:55.241000 238389 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/2] Encountered an exception in identify_mutated_tensors, assuming every input is mutated
2025-05-07T20:31:55.2440938Z W0507 20:31:55.241000 238389 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/2] Traceback (most recent call last):
2025-05-07T20:31:55.2442285Z W0507 20:31:55.241000 238389 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/2]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/torch/_higher_order_ops/triton_kernel_wrap.py", line 731, in identify_mutated_tensors
2025-05-07T20:31:55.2443706Z W0507 20:31:55.241000 238389 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/2]     ttir_module, ordered_tensor_names = generate_ttir(kernel, kwargs)
2025-05-07T20:31:55.2444794Z W0507 20:31:55.241000 238389 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/2]                                         ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
2025-05-07T20:31:55.2446108Z W0507 20:31:55.241000 238389 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/2]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/torch/_higher_order_ops/triton_kernel_wrap.py", line 356, in generate_ttir
2025-05-07T20:31:55.2447490Z W0507 20:31:55.241000 238389 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/2]     ttir_module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:31:55.2448481Z W0507 20:31:55.241000 238389 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/2]                   ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
2025-05-07T20:31:55.2449705Z W0507 20:31:55.241000 238389 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/2]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py", line 100, in make_ir
2025-05-07T20:31:55.2451091Z W0507 20:31:55.241000 238389 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/2]     return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:31:55.2452163Z W0507 20:31:55.241000 238389 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/2]            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
2025-05-07T20:31:55.2453444Z W0507 20:31:55.241000 238389 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/2]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/code_generator.py", line 1298, in ast_to_ttir
2025-05-07T20:31:55.2454697Z W0507 20:31:55.241000 238389 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/2]     generator.visit(fn.parse())
2025-05-07T20:31:55.2455923Z W0507 20:31:55.241000 238389 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/2]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/code_generator.py", line 1201, in visit
2025-05-07T20:31:55.2457138Z W0507 20:31:55.241000 238389 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/2]     ret = super().visit(node)
2025-05-07T20:31:55.2457974Z W0507 20:31:55.241000 238389 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/2]           ^^^^^^^^^^^^^^^^^^^
2025-05-07T20:31:55.2459005Z W0507 20:31:55.241000 238389 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/2]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/ast.py", line 418, in visit
2025-05-07T20:31:55.2460027Z W0507 20:31:55.241000 238389 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/2]     return visitor(node)
2025-05-07T20:31:55.2460819Z W0507 20:31:55.241000 238389 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/2]            ^^^^^^^^^^^^^
2025-05-07T20:31:55.2462122Z W0507 20:31:55.241000 238389 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/2]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/code_generator.py", line 352, in visit_Module
2025-05-07T20:31:55.2463409Z W0507 20:31:55.241000 238389 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/2]     ast.NodeVisitor.generic_visit(self, node)
2025-05-07T20:31:55.2464535Z W0507 20:31:55.241000 238389 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/2]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/ast.py", line 426, in generic_visit
2025-05-07T20:31:55.2465579Z W0507 20:31:55.241000 238389 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/2]     self.visit(item)
2025-05-07T20:31:55.2466755Z W0507 20:31:55.241000 238389 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/2]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/code_generator.py", line 1207, in visit
2025-05-07T20:31:55.2468197Z W0507 20:31:55.241000 238389 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/2]     raise CompilationError(self.jit_fn.src, self.cur_node, repr(e)) from None
2025-05-07T20:31:55.2469328Z W0507 20:31:55.241000 238389 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/2] triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:31:55.2470250Z W0507 20:31:55.241000 238389 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/2] def _fbgemm_silu_mul_quant(
2025-05-07T20:31:55.2470995Z W0507 20:31:55.241000 238389 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/2] ^
2025-05-07T20:31:55.2472012Z W0507 20:31:55.241000 238389 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/2] ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:31:55.3190646Z W0507 20:31:55.316000 238389 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/2] Encountered an exception in identify_mutated_tensors, assuming every input is mutated
2025-05-07T20:31:55.3191733Z W0507 20:31:55.316000 238389 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/2] Traceback (most recent call last):
2025-05-07T20:31:55.3193074Z W0507 20:31:55.316000 238389 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/2]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/torch/_higher_order_ops/triton_kernel_wrap.py", line 731, in identify_mutated_tensors
2025-05-07T20:31:55.3194493Z W0507 20:31:55.316000 238389 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/2]     ttir_module, ordered_tensor_names = generate_ttir(kernel, kwargs)
2025-05-07T20:31:55.3195475Z W0507 20:31:55.316000 238389 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/2]                                         ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
2025-05-07T20:31:55.3196788Z W0507 20:31:55.316000 238389 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/2]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/torch/_higher_order_ops/triton_kernel_wrap.py", line 356, in generate_ttir
2025-05-07T20:31:55.3198165Z W0507 20:31:55.316000 238389 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/2]     ttir_module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:31:55.3199192Z W0507 20:31:55.316000 238389 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/2]                   ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
2025-05-07T20:31:55.3200432Z W0507 20:31:55.316000 238389 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/2]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py", line 100, in make_ir
2025-05-07T20:31:55.3201812Z W0507 20:31:55.316000 238389 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/2]     return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:31:55.3203037Z W0507 20:31:55.316000 238389 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/2]            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
2025-05-07T20:31:55.3204316Z W0507 20:31:55.316000 238389 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/2]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/code_generator.py", line 1298, in ast_to_ttir
2025-05-07T20:31:55.3205572Z W0507 20:31:55.316000 238389 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/2]     generator.visit(fn.parse())
2025-05-07T20:31:55.3206795Z W0507 20:31:55.316000 238389 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/2]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/code_generator.py", line 1201, in visit
2025-05-07T20:31:55.3208124Z W0507 20:31:55.316000 238389 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/2]     ret = super().visit(node)
2025-05-07T20:31:55.3208962Z W0507 20:31:55.316000 238389 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/2]           ^^^^^^^^^^^^^^^^^^^
2025-05-07T20:31:55.3209992Z W0507 20:31:55.316000 238389 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/2]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/ast.py", line 418, in visit
2025-05-07T20:31:55.3211011Z W0507 20:31:55.316000 238389 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/2]     return visitor(node)
2025-05-07T20:31:55.3211802Z W0507 20:31:55.316000 238389 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/2]            ^^^^^^^^^^^^^
2025-05-07T20:31:55.3213016Z W0507 20:31:55.316000 238389 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/2]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/code_generator.py", line 352, in visit_Module
2025-05-07T20:31:55.3214309Z W0507 20:31:55.316000 238389 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/2]     ast.NodeVisitor.generic_visit(self, node)
2025-05-07T20:31:55.3215425Z W0507 20:31:55.316000 238389 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/2]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/ast.py", line 426, in generic_visit
2025-05-07T20:31:55.3216467Z W0507 20:31:55.316000 238389 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/2]     self.visit(item)
2025-05-07T20:31:55.3217649Z W0507 20:31:55.316000 238389 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/2]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/code_generator.py", line 1207, in visit
2025-05-07T20:31:55.3219061Z W0507 20:31:55.316000 238389 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/2]     raise CompilationError(self.jit_fn.src, self.cur_node, repr(e)) from None
2025-05-07T20:31:55.3220127Z W0507 20:31:55.316000 238389 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/2] triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:31:55.3221048Z W0507 20:31:55.316000 238389 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/2] def _fbgemm_silu_mul_quant(
2025-05-07T20:31:55.3221795Z W0507 20:31:55.316000 238389 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/2] ^
2025-05-07T20:31:55.3222816Z W0507 20:31:55.316000 238389 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/2] ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:31:55.9918345Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:31:55.9919281Z T = 16384, D = 7168, scale_ub = 1200.0, contiguous = False, compiled = False
2025-05-07T20:31:55.9919621Z 
2025-05-07T20:31:55.9919702Z     @given(
2025-05-07T20:31:55.9919946Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:31:55.9920425Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:31:55.9920739Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:31:55.9921076Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:31:55.9921399Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:31:55.9921687Z     )
2025-05-07T20:31:55.9922042Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:31:55.9922488Z     def test_silu_mul_quant(
2025-05-07T20:31:55.9922731Z         self,
2025-05-07T20:31:55.9922932Z         T: int,
2025-05-07T20:31:55.9923138Z         D: int,
2025-05-07T20:31:55.9923355Z         scale_ub: Optional[float],
2025-05-07T20:31:55.9923633Z         contiguous: bool,
2025-05-07T20:31:55.9924049Z         compiled: bool,
2025-05-07T20:31:55.9924273Z     ) -> None:
2025-05-07T20:31:55.9924500Z         torch.manual_seed(2025)
2025-05-07T20:31:55.9924752Z     
2025-05-07T20:31:55.9925037Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:31:55.9925387Z     
2025-05-07T20:31:55.9925587Z         x_sign = torch.sign(x)
2025-05-07T20:31:55.9925877Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:31:55.9926201Z         x = x_sign * x_clamp
2025-05-07T20:31:55.9926452Z         x0 = x[:, :D]
2025-05-07T20:31:55.9926669Z         x1 = x[:, D:]
2025-05-07T20:31:55.9926883Z     
2025-05-07T20:31:55.9927073Z         if contiguous:
2025-05-07T20:31:55.9927301Z             x0 = x0.contiguous()
2025-05-07T20:31:55.9927561Z             x1 = x1.contiguous()
2025-05-07T20:31:55.9927807Z     
2025-05-07T20:31:55.9928003Z         if scale_ub is not None:
2025-05-07T20:31:55.9928454Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:31:55.9928799Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:31:55.9929110Z             )
2025-05-07T20:31:55.9929302Z         else:
2025-05-07T20:31:55.9929515Z             scale_ub_tensor = None
2025-05-07T20:31:55.9929770Z     
2025-05-07T20:31:55.9930005Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:31:55.9930323Z             op = silu_mul_quant
2025-05-07T20:31:55.9930577Z             if compiled:
2025-05-07T20:31:55.9930825Z                 op = torch.compile(op)
2025-05-07T20:31:55.9931125Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:31:55.9931403Z     
2025-05-07T20:31:55.9931593Z >       y_fp8, y_scale = fn()
2025-05-07T20:31:55.9931762Z 
2025-05-07T20:31:55.9931864Z moe/activation_test.py:117: 
2025-05-07T20:31:55.9932163Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:31:55.9932496Z moe/activation_test.py:115: in fn
2025-05-07T20:31:55.9932774Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:31:55.9933466Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:31:55.9934161Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:31:55.9934708Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:31:55.9935390Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:31:55.9936049Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:31:55.9936584Z     kernel = self.compile(
2025-05-07T20:31:55.9937125Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:31:55.9937784Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:31:55.9938179Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:31:55.9938422Z 
2025-05-07T20:31:55.9938628Z self = <triton.compiler.compiler.ASTSource object at 0x7f49856da910>
2025-05-07T20:31:55.9939877Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:31:55.9941242Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f49ce13c400>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f49857a2700>}
2025-05-07T20:31:55.9942574Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:31:55.9943594Z context = <triton._C.libtriton.ir.context object at 0x7f49856fef70>
2025-05-07T20:31:55.9943886Z 
2025-05-07T20:31:55.9944162Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:31:55.9944681Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:31:55.9945144Z                            module_map=module_map)
2025-05-07T20:31:55.9945520Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:31:55.9945881Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:31:55.9946135Z E       ^
2025-05-07T20:31:55.9946600Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:31:55.9947056Z 
2025-05-07T20:31:55.9947470Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:31:55.9947978Z 
2025-05-07T20:31:55.9948091Z Trying example: test_silu_mul_quant(
2025-05-07T20:31:55.9948498Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:31:55.9948908Z     T=1,
2025-05-07T20:31:55.9949201Z     D=7168,
2025-05-07T20:31:55.9949424Z     scale_ub=None,
2025-05-07T20:31:55.9949642Z     contiguous=True,
2025-05-07T20:31:55.9949874Z     compiled=True,
2025-05-07T20:31:55.9950082Z )
2025-05-07T20:31:55.9950406Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:31:55.9950887Z T = 1, D = 7168, scale_ub = None, contiguous = True, compiled = True
2025-05-07T20:31:55.9951144Z 
2025-05-07T20:31:55.9951231Z     @given(
2025-05-07T20:31:55.9951464Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:31:55.9951786Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:31:55.9952094Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:31:55.9952422Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:31:55.9952753Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:31:55.9953042Z     )
2025-05-07T20:31:55.9953393Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:31:55.9953830Z     def test_silu_mul_quant(
2025-05-07T20:31:55.9954073Z         self,
2025-05-07T20:31:55.9954269Z         T: int,
2025-05-07T20:31:55.9954464Z         D: int,
2025-05-07T20:31:55.9954682Z         scale_ub: Optional[float],
2025-05-07T20:31:55.9954959Z         contiguous: bool,
2025-05-07T20:31:55.9955193Z         compiled: bool,
2025-05-07T20:31:55.9955416Z     ) -> None:
2025-05-07T20:31:55.9955634Z         torch.manual_seed(2025)
2025-05-07T20:31:55.9955882Z     
2025-05-07T20:31:55.9956161Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:31:55.9956500Z     
2025-05-07T20:31:55.9956691Z         x_sign = torch.sign(x)
2025-05-07T20:31:55.9956985Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:31:55.9957290Z         x = x_sign * x_clamp
2025-05-07T20:31:55.9957527Z         x0 = x[:, :D]
2025-05-07T20:31:55.9957755Z         x1 = x[:, D:]
2025-05-07T20:31:55.9957968Z     
2025-05-07T20:31:55.9958155Z         if contiguous:
2025-05-07T20:31:55.9958391Z             x0 = x0.contiguous()
2025-05-07T20:31:55.9958653Z             x1 = x1.contiguous()
2025-05-07T20:31:55.9958895Z     
2025-05-07T20:31:55.9959085Z         if scale_ub is not None:
2025-05-07T20:31:55.9959444Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:31:55.9959784Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:31:55.9960087Z             )
2025-05-07T20:31:55.9960280Z         else:
2025-05-07T20:31:55.9960497Z             scale_ub_tensor = None
2025-05-07T20:31:55.9960742Z     
2025-05-07T20:31:55.9960975Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:31:55.9961289Z             op = silu_mul_quant
2025-05-07T20:31:55.9961533Z             if compiled:
2025-05-07T20:31:55.9961782Z                 op = torch.compile(op)
2025-05-07T20:31:55.9962080Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:31:55.9962348Z     
2025-05-07T20:31:55.9962544Z         y_fp8, y_scale = fn()
2025-05-07T20:31:55.9962913Z         y = y_fp8.to(torch.float32) * y_scale[:, None]
2025-05-07T20:31:55.9963201Z     
2025-05-07T20:31:55.9963444Z         def ref_fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:31:55.9963784Z             x0_fp32 = x0.to(torch.float32)
2025-05-07T20:31:55.9964089Z             x1_fp32 = x1.to(torch.float32)
2025-05-07T20:31:55.9964401Z             y = x0_fp32 * torch.sigmoid(x0_fp32) * x1_fp32
2025-05-07T20:31:55.9964762Z             return triton_quantize_fp8_row(y, scale_ub_tensor)
2025-05-07T20:31:55.9965073Z     
2025-05-07T20:31:55.9965270Z >       y_fp8_ref, y_scale_ref = ref_fn()
2025-05-07T20:31:55.9965472Z 
2025-05-07T20:31:55.9965574Z moe/activation_test.py:126: 
2025-05-07T20:31:55.9965869Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:31:55.9966197Z moe/activation_test.py:124: in ref_fn
2025-05-07T20:31:55.9966525Z     return triton_quantize_fp8_row(y, scale_ub_tensor)
2025-05-07T20:31:55.9967313Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/fbgemm_gpu/experimental/gemm/triton_gemm/fp8_gemm.py:2370: in triton_quantize_fp8_row
2025-05-07T20:31:55.9968059Z     _kernel_quantize_fp8_row[grid](
2025-05-07T20:31:55.9968601Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:31:55.9969280Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:31:55.9970008Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/autotuner.py:186: in run
2025-05-07T20:31:55.9970722Z     timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs}
2025-05-07T20:31:55.9971468Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/autotuner.py:186: in <dictcomp>
2025-05-07T20:31:55.9972215Z     timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs}
2025-05-07T20:31:55.9972949Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/autotuner.py:166: in _bench
2025-05-07T20:31:55.9973586Z     return self.do_bench(kernel_call, quantiles=(0.5, 0.2, 0.8))
2025-05-07T20:31:55.9974178Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/testing.py:117: in do_bench
2025-05-07T20:31:55.9974701Z     fn()
2025-05-07T20:31:55.9975205Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/autotuner.py:152: in kernel_call
2025-05-07T20:31:55.9975785Z     self.fn.run(
2025-05-07T20:31:55.9976252Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:31:55.9976782Z     kernel = self.compile(
2025-05-07T20:31:55.9977324Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:31:55.9977975Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:31:55.9978377Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:31:55.9978603Z 
2025-05-07T20:31:55.9978811Z self = <triton.compiler.compiler.ASTSource object at 0x7f4985392310>
2025-05-07T20:31:55.9980011Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=2, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:31:55.9981380Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f49ce13c400>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f4985a18ae0>}
2025-05-07T20:31:55.9982715Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:31:55.9983734Z context = <triton._C.libtriton.ir.context object at 0x7f49853d1030>
2025-05-07T20:31:55.9984120Z 
2025-05-07T20:31:55.9984292Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:31:55.9984806Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:31:55.9985269Z                            module_map=module_map)
2025-05-07T20:31:55.9985629Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:31:55.9985978Z E       def _kernel_quantize_fp8_row(
2025-05-07T20:31:55.9986245Z E       ^
2025-05-07T20:31:55.9986709Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:31:55.9987154Z 
2025-05-07T20:31:55.9987575Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:31:55.9988083Z 
2025-05-07T20:31:55.9988188Z Trying example: test_silu_mul_quant(
2025-05-07T20:31:55.9988604Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:31:55.9989007Z     T=4096,
2025-05-07T20:31:55.9989243Z     D=5120,
2025-05-07T20:31:55.9989432Z     scale_ub=None,
2025-05-07T20:31:55.9989653Z     contiguous=False,
2025-05-07T20:31:55.9989884Z     compiled=False,
2025-05-07T20:31:55.9990083Z )
2025-05-07T20:31:56.3578284Z W0507 20:31:56.355000 238389 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/3] Encountered an exception in identify_mutated_tensors, assuming every input is mutated
2025-05-07T20:31:56.3579366Z W0507 20:31:56.355000 238389 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/3] Traceback (most recent call last):
2025-05-07T20:31:56.3580755Z W0507 20:31:56.355000 238389 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/3]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/torch/_higher_order_ops/triton_kernel_wrap.py", line 731, in identify_mutated_tensors
2025-05-07T20:31:56.3582178Z W0507 20:31:56.355000 238389 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/3]     ttir_module, ordered_tensor_names = generate_ttir(kernel, kwargs)
2025-05-07T20:31:56.3583149Z W0507 20:31:56.355000 238389 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/3]                                         ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
2025-05-07T20:31:56.3584443Z W0507 20:31:56.355000 238389 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/3]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/torch/_higher_order_ops/triton_kernel_wrap.py", line 356, in generate_ttir
2025-05-07T20:31:56.3585811Z W0507 20:31:56.355000 238389 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/3]     ttir_module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:31:56.3586794Z W0507 20:31:56.355000 238389 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/3]                   ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
2025-05-07T20:31:56.3588169Z W0507 20:31:56.355000 238389 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/3]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py", line 100, in make_ir
2025-05-07T20:31:56.3589643Z W0507 20:31:56.355000 238389 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/3]     return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:31:56.3590701Z W0507 20:31:56.355000 238389 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/3]            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
2025-05-07T20:31:56.3591976Z W0507 20:31:56.355000 238389 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/3]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/code_generator.py", line 1298, in ast_to_ttir
2025-05-07T20:31:56.3593336Z W0507 20:31:56.355000 238389 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/3]     generator.visit(fn.parse())
2025-05-07T20:31:56.3594567Z W0507 20:31:56.355000 238389 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/3]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/code_generator.py", line 1201, in visit
2025-05-07T20:31:56.3595769Z W0507 20:31:56.355000 238389 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/3]     ret = super().visit(node)
2025-05-07T20:31:56.3596596Z W0507 20:31:56.355000 238389 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/3]           ^^^^^^^^^^^^^^^^^^^
2025-05-07T20:31:56.3597621Z W0507 20:31:56.355000 238389 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/3]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/ast.py", line 418, in visit
2025-05-07T20:31:56.3598641Z W0507 20:31:56.355000 238389 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/3]     return visitor(node)
2025-05-07T20:31:56.3599438Z W0507 20:31:56.355000 238389 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/3]            ^^^^^^^^^^^^^
2025-05-07T20:31:56.3600708Z W0507 20:31:56.355000 238389 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/3]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/code_generator.py", line 352, in visit_Module
2025-05-07T20:31:56.3601988Z W0507 20:31:56.355000 238389 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/3]     ast.NodeVisitor.generic_visit(self, node)
2025-05-07T20:31:56.3603109Z W0507 20:31:56.355000 238389 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/3]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/ast.py", line 426, in generic_visit
2025-05-07T20:31:56.3604151Z W0507 20:31:56.355000 238389 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/3]     self.visit(item)
2025-05-07T20:31:56.3605340Z W0507 20:31:56.355000 238389 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/3]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/code_generator.py", line 1207, in visit
2025-05-07T20:31:56.3606698Z W0507 20:31:56.355000 238389 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/3]     raise CompilationError(self.jit_fn.src, self.cur_node, repr(e)) from None
2025-05-07T20:31:56.3607763Z W0507 20:31:56.355000 238389 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/3] triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:31:56.3608675Z W0507 20:31:56.355000 238389 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/3] def _fbgemm_silu_mul_quant(
2025-05-07T20:31:56.3609453Z W0507 20:31:56.355000 238389 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/3] ^
2025-05-07T20:31:56.3610494Z W0507 20:31:56.355000 238389 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/3] ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:31:56.6217456Z W0507 20:31:56.619000 238389 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/3] Encountered an exception in identify_mutated_tensors, assuming every input is mutated
2025-05-07T20:31:56.6218516Z W0507 20:31:56.619000 238389 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/3] Traceback (most recent call last):
2025-05-07T20:31:56.6219843Z W0507 20:31:56.619000 238389 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/3]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/torch/_higher_order_ops/triton_kernel_wrap.py", line 731, in identify_mutated_tensors
2025-05-07T20:31:56.6222660Z W0507 20:31:56.619000 238389 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/3]     ttir_module, ordered_tensor_names = generate_ttir(kernel, kwargs)
2025-05-07T20:31:56.6224820Z W0507 20:31:56.619000 238389 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/3]                                         ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
2025-05-07T20:31:56.6227422Z W0507 20:31:56.619000 238389 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/3]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/torch/_higher_order_ops/triton_kernel_wrap.py", line 356, in generate_ttir
2025-05-07T20:31:56.6229881Z W0507 20:31:56.619000 238389 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/3]     ttir_module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:31:56.6230862Z W0507 20:31:56.619000 238389 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/3]                   ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
2025-05-07T20:31:56.6232088Z W0507 20:31:56.619000 238389 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/3]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py", line 100, in make_ir
2025-05-07T20:31:56.6233484Z W0507 20:31:56.619000 238389 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/3]     return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:31:56.6234558Z W0507 20:31:56.619000 238389 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/3]            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
2025-05-07T20:31:56.6235846Z W0507 20:31:56.619000 238389 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/3]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/code_generator.py", line 1298, in ast_to_ttir
2025-05-07T20:31:56.6237101Z W0507 20:31:56.619000 238389 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/3]     generator.visit(fn.parse())
2025-05-07T20:31:56.6238340Z W0507 20:31:56.619000 238389 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/3]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/code_generator.py", line 1201, in visit
2025-05-07T20:31:56.6239615Z W0507 20:31:56.619000 238389 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/3]     ret = super().visit(node)
2025-05-07T20:31:56.6240451Z W0507 20:31:56.619000 238389 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/3]           ^^^^^^^^^^^^^^^^^^^
2025-05-07T20:31:56.6241486Z W0507 20:31:56.619000 238389 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/3]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/ast.py", line 418, in visit
2025-05-07T20:31:56.6242506Z W0507 20:31:56.619000 238389 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/3]     return visitor(node)
2025-05-07T20:31:56.6243318Z W0507 20:31:56.619000 238389 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/3]            ^^^^^^^^^^^^^
2025-05-07T20:31:56.6244659Z W0507 20:31:56.619000 238389 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/3]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/code_generator.py", line 352, in visit_Module
2025-05-07T20:31:56.6245951Z W0507 20:31:56.619000 238389 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/3]     ast.NodeVisitor.generic_visit(self, node)
2025-05-07T20:31:56.6247079Z W0507 20:31:56.619000 238389 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/3]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/ast.py", line 426, in generic_visit
2025-05-07T20:31:56.6253062Z W0507 20:31:56.619000 238389 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/3]     self.visit(item)
2025-05-07T20:31:56.6254278Z W0507 20:31:56.619000 238389 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/3]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/code_generator.py", line 1207, in visit
2025-05-07T20:31:56.6255830Z W0507 20:31:56.619000 238389 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/3]     raise CompilationError(self.jit_fn.src, self.cur_node, repr(e)) from None
2025-05-07T20:31:56.6256914Z W0507 20:31:56.619000 238389 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/3] triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:31:56.6257842Z W0507 20:31:56.619000 238389 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/3] def _fbgemm_silu_mul_quant(
2025-05-07T20:31:56.6258603Z W0507 20:31:56.619000 238389 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/3] ^
2025-05-07T20:31:56.6259637Z W0507 20:31:56.619000 238389 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/3] ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:31:57.2962523Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:31:57.2963092Z T = 4096, D = 5120, scale_ub = None, contiguous = False, compiled = False
2025-05-07T20:31:57.2963378Z 
2025-05-07T20:31:57.2963463Z     @given(
2025-05-07T20:31:57.2963706Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:31:57.2964018Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:31:57.2964331Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:31:57.2964672Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:31:57.2964995Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:31:57.2965287Z     )
2025-05-07T20:31:57.2965644Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:31:57.2966091Z     def test_silu_mul_quant(
2025-05-07T20:31:57.2966332Z         self,
2025-05-07T20:31:57.2966534Z         T: int,
2025-05-07T20:31:57.2966736Z         D: int,
2025-05-07T20:31:57.2966955Z         scale_ub: Optional[float],
2025-05-07T20:31:57.2967235Z         contiguous: bool,
2025-05-07T20:31:57.2967486Z         compiled: bool,
2025-05-07T20:31:57.2967711Z     ) -> None:
2025-05-07T20:31:57.2967933Z         torch.manual_seed(2025)
2025-05-07T20:31:57.2968182Z     
2025-05-07T20:31:57.2968461Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:31:57.2968811Z     
2025-05-07T20:31:57.2969016Z         x_sign = torch.sign(x)
2025-05-07T20:31:57.2969305Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:31:57.2969629Z         x = x_sign * x_clamp
2025-05-07T20:31:57.2969872Z         x0 = x[:, :D]
2025-05-07T20:31:57.2970087Z         x1 = x[:, D:]
2025-05-07T20:31:57.2970295Z     
2025-05-07T20:31:57.2970489Z         if contiguous:
2025-05-07T20:31:57.2970730Z             x0 = x0.contiguous()
2025-05-07T20:31:57.2970991Z             x1 = x1.contiguous()
2025-05-07T20:31:57.2971231Z     
2025-05-07T20:31:57.2971430Z         if scale_ub is not None:
2025-05-07T20:31:57.2971696Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:31:57.2972240Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:31:57.2972564Z             )
2025-05-07T20:31:57.2972760Z         else:
2025-05-07T20:31:57.2972981Z             scale_ub_tensor = None
2025-05-07T20:31:57.2973234Z     
2025-05-07T20:31:57.2973466Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:31:57.2973791Z             op = silu_mul_quant
2025-05-07T20:31:57.2974051Z             if compiled:
2025-05-07T20:31:57.2974297Z                 op = torch.compile(op)
2025-05-07T20:31:57.2974596Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:31:57.2974870Z     
2025-05-07T20:31:57.2975060Z >       y_fp8, y_scale = fn()
2025-05-07T20:31:57.2975229Z 
2025-05-07T20:31:57.2975329Z moe/activation_test.py:117: 
2025-05-07T20:31:57.2975749Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:31:57.2976086Z moe/activation_test.py:115: in fn
2025-05-07T20:31:57.2976365Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:31:57.2977066Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:31:57.2977755Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:31:57.2978287Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:31:57.2978970Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:31:57.2979662Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:31:57.2980214Z     kernel = self.compile(
2025-05-07T20:31:57.2980753Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:31:57.2981416Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:31:57.2981821Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:31:57.2982048Z 
2025-05-07T20:31:57.2982265Z self = <triton.compiler.compiler.ASTSource object at 0x7f49855aac90>
2025-05-07T20:31:57.2983334Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:31:57.2984693Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f49ce13c400>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f4985c11940>}
2025-05-07T20:31:57.2986036Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:31:57.2987062Z context = <triton._C.libtriton.ir.context object at 0x7f4985392ef0>
2025-05-07T20:31:57.2987349Z 
2025-05-07T20:31:57.2987515Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:31:57.2988034Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:31:57.2988502Z                            module_map=module_map)
2025-05-07T20:31:57.2988867Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:31:57.2989279Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:31:57.2989541Z E       ^
2025-05-07T20:31:57.2990034Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:31:57.2990505Z 
2025-05-07T20:31:57.2990919Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:31:57.2991434Z 
2025-05-07T20:31:57.2991537Z Trying example: test_silu_mul_quant(
2025-05-07T20:31:57.2991951Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:31:57.2992351Z     T=4096,
2025-05-07T20:31:57.2992538Z     D=7168,
2025-05-07T20:31:57.2992818Z     scale_ub=None,
2025-05-07T20:31:57.2993044Z     contiguous=False,
2025-05-07T20:31:57.2993272Z     compiled=False,
2025-05-07T20:31:57.2993480Z )
2025-05-07T20:31:57.2993804Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:31:57.2994297Z T = 4096, D = 7168, scale_ub = None, contiguous = False, compiled = False
2025-05-07T20:31:57.2994572Z 
2025-05-07T20:31:57.2994652Z     @given(
2025-05-07T20:31:57.2994884Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:31:57.2995200Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:31:57.2995508Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:31:57.2995838Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:31:57.2996252Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:31:57.2996538Z     )
2025-05-07T20:31:57.2996889Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:31:57.2997342Z     def test_silu_mul_quant(
2025-05-07T20:31:57.2997587Z         self,
2025-05-07T20:31:57.2997792Z         T: int,
2025-05-07T20:31:57.2997995Z         D: int,
2025-05-07T20:31:57.2998215Z         scale_ub: Optional[float],
2025-05-07T20:31:57.2998491Z         contiguous: bool,
2025-05-07T20:31:57.2998735Z         compiled: bool,
2025-05-07T20:31:57.2998962Z     ) -> None:
2025-05-07T20:31:57.2999175Z         torch.manual_seed(2025)
2025-05-07T20:31:57.2999415Z     
2025-05-07T20:31:57.2999689Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:31:57.3000031Z     
2025-05-07T20:31:57.3000228Z         x_sign = torch.sign(x)
2025-05-07T20:31:57.3000524Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:31:57.3000838Z         x = x_sign * x_clamp
2025-05-07T20:31:57.3001080Z         x0 = x[:, :D]
2025-05-07T20:31:57.3001295Z         x1 = x[:, D:]
2025-05-07T20:31:57.3001505Z     
2025-05-07T20:31:57.3001690Z         if contiguous:
2025-05-07T20:31:57.3001936Z             x0 = x0.contiguous()
2025-05-07T20:31:57.3002201Z             x1 = x1.contiguous()
2025-05-07T20:31:57.3002433Z     
2025-05-07T20:31:57.3002625Z         if scale_ub is not None:
2025-05-07T20:31:57.3002897Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:31:57.3003232Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:31:57.3003541Z             )
2025-05-07T20:31:57.3003756Z         else:
2025-05-07T20:31:57.3003969Z             scale_ub_tensor = None
2025-05-07T20:31:57.3004216Z     
2025-05-07T20:31:57.3004448Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:31:57.3004765Z             op = silu_mul_quant
2025-05-07T20:31:57.3005014Z             if compiled:
2025-05-07T20:31:57.3005267Z                 op = torch.compile(op)
2025-05-07T20:31:57.3005564Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:31:57.3005840Z     
2025-05-07T20:31:57.3006043Z >       y_fp8, y_scale = fn()
2025-05-07T20:31:57.3006206Z 
2025-05-07T20:31:57.3006315Z moe/activation_test.py:117: 
2025-05-07T20:31:57.3006604Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:31:57.3006932Z moe/activation_test.py:115: in fn
2025-05-07T20:31:57.3007210Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:31:57.3007896Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:31:57.3008575Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:31:57.3009109Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:31:57.3009835Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:31:57.3010492Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:31:57.3011019Z     kernel = self.compile(
2025-05-07T20:31:57.3011643Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:31:57.3012297Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:31:57.3012689Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:31:57.3012920Z 
2025-05-07T20:31:57.3013127Z self = <triton.compiler.compiler.ASTSource object at 0x7f498511b710>
2025-05-07T20:31:57.3014199Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:31:57.3015556Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f49ce13c400>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f4985a1b060>}
2025-05-07T20:31:57.3016966Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:31:57.3017991Z context = <triton._C.libtriton.ir.context object at 0x7f49851bfd30>
2025-05-07T20:31:57.3018283Z 
2025-05-07T20:31:57.3018448Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:31:57.3018965Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:31:57.3019428Z                            module_map=module_map)
2025-05-07T20:31:57.3019793Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:31:57.3020147Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:31:57.3020433Z E       ^
2025-05-07T20:31:57.3020921Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:31:57.3021368Z 
2025-05-07T20:31:57.3021789Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:31:57.3022299Z 
2025-05-07T20:31:57.3022407Z Trying example: test_silu_mul_quant(
2025-05-07T20:31:57.3022813Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:31:57.3023217Z     T=128,
2025-05-07T20:31:57.3023406Z     D=7168,
2025-05-07T20:31:57.3023599Z     scale_ub=None,
2025-05-07T20:31:57.3023817Z     contiguous=False,
2025-05-07T20:31:57.3024042Z     compiled=True,
2025-05-07T20:31:57.3024255Z )
2025-05-07T20:31:57.3463984Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:31:57.3465020Z T = 128, D = 7168, scale_ub = None, contiguous = False, compiled = True
2025-05-07T20:31:57.3465570Z 
2025-05-07T20:31:57.3465726Z     @given(
2025-05-07T20:31:57.3466185Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:31:57.3466798Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:31:57.3467410Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:31:57.3468057Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:31:57.3468713Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:31:57.3469353Z     )
2025-05-07T20:31:57.3469746Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:31:57.3470196Z     def test_silu_mul_quant(
2025-05-07T20:31:57.3470431Z         self,
2025-05-07T20:31:57.3470629Z         T: int,
2025-05-07T20:31:57.3470825Z         D: int,
2025-05-07T20:31:57.3471053Z         scale_ub: Optional[float],
2025-05-07T20:31:57.3471323Z         contiguous: bool,
2025-05-07T20:31:57.3471560Z         compiled: bool,
2025-05-07T20:31:57.3471784Z     ) -> None:
2025-05-07T20:31:57.3472001Z         torch.manual_seed(2025)
2025-05-07T20:31:57.3472241Z     
2025-05-07T20:31:57.3472516Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:31:57.3472847Z     
2025-05-07T20:31:57.3473046Z         x_sign = torch.sign(x)
2025-05-07T20:31:57.3473478Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:31:57.3473789Z         x = x_sign * x_clamp
2025-05-07T20:31:57.3474030Z         x0 = x[:, :D]
2025-05-07T20:31:57.3474251Z         x1 = x[:, D:]
2025-05-07T20:31:57.3474455Z     
2025-05-07T20:31:57.3474642Z         if contiguous:
2025-05-07T20:31:57.3474872Z             x0 = x0.contiguous()
2025-05-07T20:31:57.3475130Z             x1 = x1.contiguous()
2025-05-07T20:31:57.3475363Z     
2025-05-07T20:31:57.3475558Z         if scale_ub is not None:
2025-05-07T20:31:57.3475828Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:31:57.3476158Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:31:57.3476462Z             )
2025-05-07T20:31:57.3476772Z         else:
2025-05-07T20:31:57.3476981Z             scale_ub_tensor = None
2025-05-07T20:31:57.3477232Z     
2025-05-07T20:31:57.3477466Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:31:57.3477783Z             op = silu_mul_quant
2025-05-07T20:31:57.3478037Z             if compiled:
2025-05-07T20:31:57.3478286Z                 op = torch.compile(op)
2025-05-07T20:31:57.3478575Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:31:57.3478845Z     
2025-05-07T20:31:57.3479037Z         y_fp8, y_scale = fn()
2025-05-07T20:31:57.3479319Z         y = y_fp8.to(torch.float32) * y_scale[:, None]
2025-05-07T20:31:57.3479633Z     
2025-05-07T20:31:57.3479893Z         def ref_fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:31:57.3480224Z             x0_fp32 = x0.to(torch.float32)
2025-05-07T20:31:57.3480508Z             x1_fp32 = x1.to(torch.float32)
2025-05-07T20:31:57.3480820Z             y = x0_fp32 * torch.sigmoid(x0_fp32) * x1_fp32
2025-05-07T20:31:57.3481181Z             return triton_quantize_fp8_row(y, scale_ub_tensor)
2025-05-07T20:31:57.3481488Z     
2025-05-07T20:31:57.3481691Z >       y_fp8_ref, y_scale_ref = ref_fn()
2025-05-07T20:31:57.3481883Z 
2025-05-07T20:31:57.3481989Z moe/activation_test.py:126: 
2025-05-07T20:31:57.3482281Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:31:57.3482615Z moe/activation_test.py:124: in ref_fn
2025-05-07T20:31:57.3482937Z     return triton_quantize_fp8_row(y, scale_ub_tensor)
2025-05-07T20:31:57.3483718Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/fbgemm_gpu/experimental/gemm/triton_gemm/fp8_gemm.py:2370: in triton_quantize_fp8_row
2025-05-07T20:31:57.3484460Z     _kernel_quantize_fp8_row[grid](
2025-05-07T20:31:57.3485002Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:31:57.3485681Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:31:57.3486363Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/autotuner.py:186: in run
2025-05-07T20:31:57.3487087Z     timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs}
2025-05-07T20:31:57.3487833Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/autotuner.py:186: in <dictcomp>
2025-05-07T20:31:57.3488573Z     timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs}
2025-05-07T20:31:57.3489295Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/autotuner.py:166: in _bench
2025-05-07T20:31:57.3489975Z     return self.do_bench(kernel_call, quantiles=(0.5, 0.2, 0.8))
2025-05-07T20:31:57.3490572Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/testing.py:117: in do_bench
2025-05-07T20:31:57.3491089Z     fn()
2025-05-07T20:31:57.3491591Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/autotuner.py:152: in kernel_call
2025-05-07T20:31:57.3492179Z     self.fn.run(
2025-05-07T20:31:57.3492641Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:31:57.3493302Z     kernel = self.compile(
2025-05-07T20:31:57.3493841Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:31:57.3494499Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:31:57.3494900Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:31:57.3495131Z 
2025-05-07T20:31:57.3495335Z self = <triton.compiler.compiler.ASTSource object at 0x7f4984f02310>
2025-05-07T20:31:57.3496423Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=2, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:31:57.3497851Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f49ce13c400>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f498548d9e0>}
2025-05-07T20:31:57.3499196Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:31:57.3500260Z context = <triton._C.libtriton.ir.context object at 0x7f4984d4cb70>
2025-05-07T20:31:57.3500548Z 
2025-05-07T20:31:57.3500713Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:31:57.3501224Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:31:57.3501691Z                            module_map=module_map)
2025-05-07T20:31:57.3502062Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:31:57.3502415Z E       def _kernel_quantize_fp8_row(
2025-05-07T20:31:57.3502684Z E       ^
2025-05-07T20:31:57.3503149Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:31:57.3503604Z 
2025-05-07T20:31:57.3504021Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:31:57.3504534Z 
2025-05-07T20:31:57.3504638Z Trying example: test_silu_mul_quant(
2025-05-07T20:31:57.3505044Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:31:57.3505442Z     T=128,
2025-05-07T20:31:57.3505625Z     D=7168,
2025-05-07T20:31:57.3505816Z     scale_ub=None,
2025-05-07T20:31:57.3506034Z     contiguous=False,
2025-05-07T20:31:57.3506255Z     compiled=False,
2025-05-07T20:31:57.3506459Z )
2025-05-07T20:31:57.6536934Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:31:57.6537441Z T = 128, D = 7168, scale_ub = None, contiguous = False, compiled = False
2025-05-07T20:31:57.6537720Z 
2025-05-07T20:31:57.6537836Z     @given(
2025-05-07T20:31:57.6538075Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:31:57.6538522Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:31:57.6538953Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:31:57.6539376Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:31:57.6539806Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:31:57.6540139Z     )
2025-05-07T20:31:57.6540491Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:31:57.6540932Z     def test_silu_mul_quant(
2025-05-07T20:31:57.6541179Z         self,
2025-05-07T20:31:57.6541366Z         T: int,
2025-05-07T20:31:57.6541571Z         D: int,
2025-05-07T20:31:57.6541795Z         scale_ub: Optional[float],
2025-05-07T20:31:57.6542058Z         contiguous: bool,
2025-05-07T20:31:57.6542302Z         compiled: bool,
2025-05-07T20:31:57.6542533Z     ) -> None:
2025-05-07T20:31:57.6542745Z         torch.manual_seed(2025)
2025-05-07T20:31:57.6542986Z     
2025-05-07T20:31:57.6543421Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:31:57.6543770Z     
2025-05-07T20:31:57.6543959Z         x_sign = torch.sign(x)
2025-05-07T20:31:57.6544253Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:31:57.6544556Z         x = x_sign * x_clamp
2025-05-07T20:31:57.6544789Z         x0 = x[:, :D]
2025-05-07T20:31:57.6545002Z         x1 = x[:, D:]
2025-05-07T20:31:57.6545210Z     
2025-05-07T20:31:57.6545391Z         if contiguous:
2025-05-07T20:31:57.6545627Z             x0 = x0.contiguous()
2025-05-07T20:31:57.6545885Z             x1 = x1.contiguous()
2025-05-07T20:31:57.6546115Z     
2025-05-07T20:31:57.6546311Z         if scale_ub is not None:
2025-05-07T20:31:57.6546580Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:31:57.6547029Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:31:57.6547333Z             )
2025-05-07T20:31:57.6547525Z         else:
2025-05-07T20:31:57.6547731Z             scale_ub_tensor = None
2025-05-07T20:31:57.6547981Z     
2025-05-07T20:31:57.6548225Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:31:57.6548543Z             op = silu_mul_quant
2025-05-07T20:31:57.6548789Z             if compiled:
2025-05-07T20:31:57.6549032Z                 op = torch.compile(op)
2025-05-07T20:31:57.6549393Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:31:57.6549666Z     
2025-05-07T20:31:57.6549862Z >       y_fp8, y_scale = fn()
2025-05-07T20:31:57.6550023Z 
2025-05-07T20:31:57.6550127Z moe/activation_test.py:117: 
2025-05-07T20:31:57.6550412Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:31:57.6550748Z moe/activation_test.py:115: in fn
2025-05-07T20:31:57.6551028Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:31:57.6551714Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:31:57.6552402Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:31:57.6552939Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:31:57.6553616Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:31:57.6554267Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:31:57.6554795Z     kernel = self.compile(
2025-05-07T20:31:57.6555339Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:31:57.6555989Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:31:57.6556380Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:31:57.6556621Z 
2025-05-07T20:31:57.6556826Z self = <triton.compiler.compiler.ASTSource object at 0x7f4984f02c90>
2025-05-07T20:31:57.6557900Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:31:57.6559255Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f49ce13c400>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f498548f1a0>}
2025-05-07T20:31:57.6560633Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:31:57.6561648Z context = <triton._C.libtriton.ir.context object at 0x7f4984dd0cf0>
2025-05-07T20:31:57.6561936Z 
2025-05-07T20:31:57.6562103Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:31:57.6562622Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:31:57.6563082Z                            module_map=module_map)
2025-05-07T20:31:57.6563526Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:31:57.6563875Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:31:57.6564131Z E       ^
2025-05-07T20:31:57.6564595Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:31:57.6565046Z 
2025-05-07T20:31:57.6565459Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:31:57.6565969Z 
2025-05-07T20:31:57.6566077Z Trying example: test_silu_mul_quant(
2025-05-07T20:31:57.6566484Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:31:57.6566959Z     T=4096,
2025-05-07T20:31:57.6567148Z     D=5120,
2025-05-07T20:31:57.6567339Z     scale_ub=1200.0,
2025-05-07T20:31:57.6567563Z     contiguous=True,
2025-05-07T20:31:57.6567788Z     compiled=False,
2025-05-07T20:31:57.6567991Z )
2025-05-07T20:31:57.6568319Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:31:57.6568809Z T = 4096, D = 5120, scale_ub = 1200.0, contiguous = True, compiled = False
2025-05-07T20:31:57.6569083Z 
2025-05-07T20:31:57.6569167Z     @given(
2025-05-07T20:31:57.6569391Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:31:57.6569709Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:31:57.6570013Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:31:57.6570356Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:31:57.6570704Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:31:57.6570985Z     )
2025-05-07T20:31:57.6571327Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:31:57.6571772Z     def test_silu_mul_quant(
2025-05-07T20:31:57.6572012Z         self,
2025-05-07T20:31:57.6572207Z         T: int,
2025-05-07T20:31:57.6572403Z         D: int,
2025-05-07T20:31:57.6572626Z         scale_ub: Optional[float],
2025-05-07T20:31:57.6572896Z         contiguous: bool,
2025-05-07T20:31:57.6573131Z         compiled: bool,
2025-05-07T20:31:57.6573352Z     ) -> None:
2025-05-07T20:31:57.6573569Z         torch.manual_seed(2025)
2025-05-07T20:31:57.6573809Z     
2025-05-07T20:31:57.6574083Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:31:57.6574424Z     
2025-05-07T20:31:57.6579971Z         x_sign = torch.sign(x)
2025-05-07T20:31:57.6580310Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:31:57.6580618Z         x = x_sign * x_clamp
2025-05-07T20:31:57.6580862Z         x0 = x[:, :D]
2025-05-07T20:31:57.6581082Z         x1 = x[:, D:]
2025-05-07T20:31:57.6581284Z     
2025-05-07T20:31:57.6581481Z         if contiguous:
2025-05-07T20:31:57.6581713Z             x0 = x0.contiguous()
2025-05-07T20:31:57.6581970Z             x1 = x1.contiguous()
2025-05-07T20:31:57.6582207Z     
2025-05-07T20:31:57.6582402Z         if scale_ub is not None:
2025-05-07T20:31:57.6582671Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:31:57.6583006Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:31:57.6583315Z             )
2025-05-07T20:31:57.6583505Z         else:
2025-05-07T20:31:57.6583720Z             scale_ub_tensor = None
2025-05-07T20:31:57.6583974Z     
2025-05-07T20:31:57.6584204Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:31:57.6584524Z             op = silu_mul_quant
2025-05-07T20:31:57.6584780Z             if compiled:
2025-05-07T20:31:57.6585023Z                 op = torch.compile(op)
2025-05-07T20:31:57.6585314Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:31:57.6585599Z     
2025-05-07T20:31:57.6585794Z >       y_fp8, y_scale = fn()
2025-05-07T20:31:57.6585959Z 
2025-05-07T20:31:57.6586056Z moe/activation_test.py:117: 
2025-05-07T20:31:57.6586349Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:31:57.6586680Z moe/activation_test.py:115: in fn
2025-05-07T20:31:57.6587060Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:31:57.6587741Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:31:57.6588435Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:31:57.6588968Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:31:57.6589762Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:31:57.6590424Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:31:57.6590953Z     kernel = self.compile(
2025-05-07T20:31:57.6591595Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:31:57.6592242Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:31:57.6592648Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:31:57.6592872Z 
2025-05-07T20:31:57.6593080Z self = <triton.compiler.compiler.ASTSource object at 0x7f4984de9050>
2025-05-07T20:31:57.6594157Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:31:57.6595503Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f49ce13c400>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f498548ea20>}
2025-05-07T20:31:57.6596835Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:31:57.6597857Z context = <triton._C.libtriton.ir.context object at 0x7f4984d91670>
2025-05-07T20:31:57.6598141Z 
2025-05-07T20:31:57.6598310Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:31:57.6598814Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:31:57.6599274Z                            module_map=module_map)
2025-05-07T20:31:57.6599648Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:31:57.6600031Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:31:57.6600280Z E       ^
2025-05-07T20:31:57.6600745Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:31:57.6601195Z 
2025-05-07T20:31:57.6601621Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:31:57.6602133Z 
2025-05-07T20:31:57.6602246Z Trying example: test_silu_mul_quant(
2025-05-07T20:31:57.6602651Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:31:57.6603049Z     T=1,
2025-05-07T20:31:57.6603241Z     D=5120,
2025-05-07T20:31:57.6603426Z     scale_ub=None,
2025-05-07T20:31:57.6603637Z     contiguous=True,
2025-05-07T20:31:57.6603859Z     compiled=True,
2025-05-07T20:31:57.6604057Z )
2025-05-07T20:31:57.9842158Z W0507 20:31:57.981000 238389 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/4] Encountered an exception in identify_mutated_tensors, assuming every input is mutated
2025-05-07T20:31:57.9843255Z W0507 20:31:57.981000 238389 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/4] Traceback (most recent call last):
2025-05-07T20:31:57.9844593Z W0507 20:31:57.981000 238389 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/4]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/torch/_higher_order_ops/triton_kernel_wrap.py", line 731, in identify_mutated_tensors
2025-05-07T20:31:57.9846191Z W0507 20:31:57.981000 238389 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/4]     ttir_module, ordered_tensor_names = generate_ttir(kernel, kwargs)
2025-05-07T20:31:57.9847182Z W0507 20:31:57.981000 238389 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/4]                                         ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
2025-05-07T20:31:57.9848486Z W0507 20:31:57.981000 238389 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/4]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/torch/_higher_order_ops/triton_kernel_wrap.py", line 356, in generate_ttir
2025-05-07T20:31:57.9849869Z W0507 20:31:57.981000 238389 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/4]     ttir_module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:31:57.9851024Z W0507 20:31:57.981000 238389 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/4]                   ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
2025-05-07T20:31:57.9852244Z W0507 20:31:57.981000 238389 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/4]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py", line 100, in make_ir
2025-05-07T20:31:57.9853621Z W0507 20:31:57.981000 238389 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/4]     return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:31:57.9854673Z W0507 20:31:57.981000 238389 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/4]            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
2025-05-07T20:31:57.9855958Z W0507 20:31:57.981000 238389 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/4]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/code_generator.py", line 1298, in ast_to_ttir
2025-05-07T20:31:57.9857210Z W0507 20:31:57.981000 238389 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/4]     generator.visit(fn.parse())
2025-05-07T20:31:57.9858428Z W0507 20:31:57.981000 238389 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/4]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/code_generator.py", line 1201, in visit
2025-05-07T20:31:57.9859639Z W0507 20:31:57.981000 238389 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/4]     ret = super().visit(node)
2025-05-07T20:31:57.9860520Z W0507 20:31:57.981000 238389 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/4]           ^^^^^^^^^^^^^^^^^^^
2025-05-07T20:31:57.9861546Z W0507 20:31:57.981000 238389 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/4]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/ast.py", line 418, in visit
2025-05-07T20:31:57.9862556Z W0507 20:31:57.981000 238389 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/4]     return visitor(node)
2025-05-07T20:31:57.9863349Z W0507 20:31:57.981000 238389 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/4]            ^^^^^^^^^^^^^
2025-05-07T20:31:57.9864557Z W0507 20:31:57.981000 238389 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/4]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/code_generator.py", line 352, in visit_Module
2025-05-07T20:31:57.9865831Z W0507 20:31:57.981000 238389 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/4]     ast.NodeVisitor.generic_visit(self, node)
2025-05-07T20:31:57.9866943Z W0507 20:31:57.981000 238389 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/4]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/ast.py", line 426, in generic_visit
2025-05-07T20:31:57.9868110Z W0507 20:31:57.981000 238389 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/4]     self.visit(item)
2025-05-07T20:31:57.9869367Z W0507 20:31:57.981000 238389 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/4]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/code_generator.py", line 1207, in visit
2025-05-07T20:31:57.9870717Z W0507 20:31:57.981000 238389 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/4]     raise CompilationError(self.jit_fn.src, self.cur_node, repr(e)) from None
2025-05-07T20:31:57.9871777Z W0507 20:31:57.981000 238389 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/4] triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:31:57.9872685Z W0507 20:31:57.981000 238389 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/4] def _fbgemm_silu_mul_quant(
2025-05-07T20:31:57.9873498Z W0507 20:31:57.981000 238389 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/4] ^
2025-05-07T20:31:57.9874521Z W0507 20:31:57.981000 238389 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/4] ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:31:58.0691213Z W0507 20:31:58.066000 238389 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/4] Encountered an exception in identify_mutated_tensors, assuming every input is mutated
2025-05-07T20:31:58.0692260Z W0507 20:31:58.066000 238389 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/4] Traceback (most recent call last):
2025-05-07T20:31:58.0693604Z W0507 20:31:58.066000 238389 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/4]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/torch/_higher_order_ops/triton_kernel_wrap.py", line 731, in identify_mutated_tensors
2025-05-07T20:31:58.0695028Z W0507 20:31:58.066000 238389 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/4]     ttir_module, ordered_tensor_names = generate_ttir(kernel, kwargs)
2025-05-07T20:31:58.0696009Z W0507 20:31:58.066000 238389 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/4]                                         ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
2025-05-07T20:31:58.0697310Z W0507 20:31:58.066000 238389 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/4]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/torch/_higher_order_ops/triton_kernel_wrap.py", line 356, in generate_ttir
2025-05-07T20:31:58.0698682Z W0507 20:31:58.066000 238389 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/4]     ttir_module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:31:58.0699675Z W0507 20:31:58.066000 238389 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/4]                   ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
2025-05-07T20:31:58.0700944Z W0507 20:31:58.066000 238389 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/4]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py", line 100, in make_ir
2025-05-07T20:31:58.0702323Z W0507 20:31:58.066000 238389 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/4]     return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:31:58.0703386Z W0507 20:31:58.066000 238389 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/4]            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
2025-05-07T20:31:58.0704667Z W0507 20:31:58.066000 238389 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/4]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/code_generator.py", line 1298, in ast_to_ttir
2025-05-07T20:31:58.0706054Z W0507 20:31:58.066000 238389 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/4]     generator.visit(fn.parse())
2025-05-07T20:31:58.0707278Z W0507 20:31:58.066000 238389 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/4]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/code_generator.py", line 1201, in visit
2025-05-07T20:31:58.0708483Z W0507 20:31:58.066000 238389 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/4]     ret = super().visit(node)
2025-05-07T20:31:58.0709392Z W0507 20:31:58.066000 238389 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/4]           ^^^^^^^^^^^^^^^^^^^
2025-05-07T20:31:58.0710469Z W0507 20:31:58.066000 238389 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/4]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/ast.py", line 418, in visit
2025-05-07T20:31:58.0711606Z W0507 20:31:58.066000 238389 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/4]     return visitor(node)
2025-05-07T20:31:58.0712411Z W0507 20:31:58.066000 238389 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/4]            ^^^^^^^^^^^^^
2025-05-07T20:31:58.0713612Z W0507 20:31:58.066000 238389 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/4]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/code_generator.py", line 352, in visit_Module
2025-05-07T20:31:58.0714889Z W0507 20:31:58.066000 238389 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/4]     ast.NodeVisitor.generic_visit(self, node)
2025-05-07T20:31:58.0716002Z W0507 20:31:58.066000 238389 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/4]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/ast.py", line 426, in generic_visit
2025-05-07T20:31:58.0717051Z W0507 20:31:58.066000 238389 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/4]     self.visit(item)
2025-05-07T20:31:58.0718232Z W0507 20:31:58.066000 238389 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/4]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/code_generator.py", line 1207, in visit
2025-05-07T20:31:58.0719577Z W0507 20:31:58.066000 238389 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/4]     raise CompilationError(self.jit_fn.src, self.cur_node, repr(e)) from None
2025-05-07T20:31:58.0720633Z W0507 20:31:58.066000 238389 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/4] triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:31:58.0721540Z W0507 20:31:58.066000 238389 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/4] def _fbgemm_silu_mul_quant(
2025-05-07T20:31:58.0722284Z W0507 20:31:58.066000 238389 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/4] ^
2025-05-07T20:31:58.0723302Z W0507 20:31:58.066000 238389 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/4] ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:31:58.3675451Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:31:58.3675946Z T = 1, D = 5120, scale_ub = None, contiguous = True, compiled = True
2025-05-07T20:31:58.3676292Z 
2025-05-07T20:31:58.3676408Z     @given(
2025-05-07T20:31:58.3676737Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:31:58.3677053Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:31:58.3677363Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:31:58.3677697Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:31:58.3678027Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:31:58.3678324Z     )
2025-05-07T20:31:58.3678684Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:31:58.3679128Z     def test_silu_mul_quant(
2025-05-07T20:31:58.3679366Z         self,
2025-05-07T20:31:58.3679564Z         T: int,
2025-05-07T20:31:58.3679936Z         D: int,
2025-05-07T20:31:58.3680156Z         scale_ub: Optional[float],
2025-05-07T20:31:58.3680427Z         contiguous: bool,
2025-05-07T20:31:58.3680672Z         compiled: bool,
2025-05-07T20:31:58.3680896Z     ) -> None:
2025-05-07T20:31:58.3681115Z         torch.manual_seed(2025)
2025-05-07T20:31:58.3681355Z     
2025-05-07T20:31:58.3681620Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:31:58.3681966Z     
2025-05-07T20:31:58.3682165Z         x_sign = torch.sign(x)
2025-05-07T20:31:58.3682459Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:31:58.3682765Z         x = x_sign * x_clamp
2025-05-07T20:31:58.3683014Z         x0 = x[:, :D]
2025-05-07T20:31:58.3683382Z         x1 = x[:, D:]
2025-05-07T20:31:58.3683582Z     
2025-05-07T20:31:58.3683770Z         if contiguous:
2025-05-07T20:31:58.3684006Z             x0 = x0.contiguous()
2025-05-07T20:31:58.3684263Z             x1 = x1.contiguous()
2025-05-07T20:31:58.3684514Z     
2025-05-07T20:31:58.3684709Z         if scale_ub is not None:
2025-05-07T20:31:58.3684977Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:31:58.3685315Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:31:58.3685627Z             )
2025-05-07T20:31:58.3685815Z         else:
2025-05-07T20:31:58.3686034Z             scale_ub_tensor = None
2025-05-07T20:31:58.3686287Z     
2025-05-07T20:31:58.3686518Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:31:58.3686842Z             op = silu_mul_quant
2025-05-07T20:31:58.3687096Z             if compiled:
2025-05-07T20:31:58.3687348Z                 op = torch.compile(op)
2025-05-07T20:31:58.3687639Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:31:58.3687917Z     
2025-05-07T20:31:58.3688112Z         y_fp8, y_scale = fn()
2025-05-07T20:31:58.3688394Z         y = y_fp8.to(torch.float32) * y_scale[:, None]
2025-05-07T20:31:58.3688682Z     
2025-05-07T20:31:58.3688927Z         def ref_fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:31:58.3689256Z             x0_fp32 = x0.to(torch.float32)
2025-05-07T20:31:58.3689548Z             x1_fp32 = x1.to(torch.float32)
2025-05-07T20:31:58.3689862Z             y = x0_fp32 * torch.sigmoid(x0_fp32) * x1_fp32
2025-05-07T20:31:58.3690234Z             return triton_quantize_fp8_row(y, scale_ub_tensor)
2025-05-07T20:31:58.3690577Z     
2025-05-07T20:31:58.3690779Z >       y_fp8_ref, y_scale_ref = ref_fn()
2025-05-07T20:31:58.3690978Z 
2025-05-07T20:31:58.3691079Z moe/activation_test.py:126: 
2025-05-07T20:31:58.3691379Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:31:58.3691714Z moe/activation_test.py:124: in ref_fn
2025-05-07T20:31:58.3692040Z     return triton_quantize_fp8_row(y, scale_ub_tensor)
2025-05-07T20:31:58.3692825Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/fbgemm_gpu/experimental/gemm/triton_gemm/fp8_gemm.py:2370: in triton_quantize_fp8_row
2025-05-07T20:31:58.3693582Z     _kernel_quantize_fp8_row[grid](
2025-05-07T20:31:58.3694122Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:31:58.3694795Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:31:58.3695475Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/autotuner.py:186: in run
2025-05-07T20:31:58.3696194Z     timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs}
2025-05-07T20:31:58.3696937Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/autotuner.py:186: in <dictcomp>
2025-05-07T20:31:58.3697684Z     timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs}
2025-05-07T20:31:58.3698408Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/autotuner.py:166: in _bench
2025-05-07T20:31:58.3699133Z     return self.do_bench(kernel_call, quantiles=(0.5, 0.2, 0.8))
2025-05-07T20:31:58.3699727Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/testing.py:117: in do_bench
2025-05-07T20:31:58.3700268Z     fn()
2025-05-07T20:31:58.3700796Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/autotuner.py:152: in kernel_call
2025-05-07T20:31:58.3701375Z     self.fn.run(
2025-05-07T20:31:58.3701838Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:31:58.3702368Z     kernel = self.compile(
2025-05-07T20:31:58.3702906Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:31:58.3703628Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:31:58.3704023Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:31:58.3704255Z 
2025-05-07T20:31:58.3704467Z self = <triton.compiler.compiler.ASTSource object at 0x7f498423ae10>
2025-05-07T20:31:58.3705541Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=2, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:31:58.3706891Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f49ce13c400>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f4985498fe0>}
2025-05-07T20:31:58.3708224Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:31:58.3709308Z context = <triton._C.libtriton.ir.context object at 0x7f49842dbeb0>
2025-05-07T20:31:58.3709594Z 
2025-05-07T20:31:58.3709766Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:31:58.3710282Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:31:58.3710740Z                            module_map=module_map)
2025-05-07T20:31:58.3711105Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:31:58.3711459Z E       def _kernel_quantize_fp8_row(
2025-05-07T20:31:58.3711719Z E       ^
2025-05-07T20:31:58.3712185Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:31:58.3712633Z 
2025-05-07T20:31:58.3713050Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:31:58.3713562Z 
2025-05-07T20:31:58.3713670Z Trying example: test_silu_mul_quant(
2025-05-07T20:31:58.3714075Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:31:58.3714478Z     T=2048,
2025-05-07T20:31:58.3714670Z     D=5120,
2025-05-07T20:31:58.3714864Z     scale_ub=None,
2025-05-07T20:31:58.3715087Z     contiguous=True,
2025-05-07T20:31:58.3715310Z     compiled=True,
2025-05-07T20:31:58.3715514Z )
2025-05-07T20:31:58.6835266Z W0507 20:31:58.681000 238389 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/5] Encountered an exception in identify_mutated_tensors, assuming every input is mutated
2025-05-07T20:31:58.6837378Z W0507 20:31:58.681000 238389 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/5] Traceback (most recent call last):
2025-05-07T20:31:58.6839979Z W0507 20:31:58.681000 238389 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/5]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/torch/_higher_order_ops/triton_kernel_wrap.py", line 731, in identify_mutated_tensors
2025-05-07T20:31:58.6841605Z W0507 20:31:58.681000 238389 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/5]     ttir_module, ordered_tensor_names = generate_ttir(kernel, kwargs)
2025-05-07T20:31:58.6842577Z W0507 20:31:58.681000 238389 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/5]                                         ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
2025-05-07T20:31:58.6843874Z W0507 20:31:58.681000 238389 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/5]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/torch/_higher_order_ops/triton_kernel_wrap.py", line 356, in generate_ttir
2025-05-07T20:31:58.6845253Z W0507 20:31:58.681000 238389 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/5]     ttir_module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:31:58.6846358Z W0507 20:31:58.681000 238389 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/5]                   ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
2025-05-07T20:31:58.6847588Z W0507 20:31:58.681000 238389 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/5]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py", line 100, in make_ir
2025-05-07T20:31:58.6848951Z W0507 20:31:58.681000 238389 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/5]     return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:31:58.6850045Z W0507 20:31:58.681000 238389 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/5]            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
2025-05-07T20:31:58.6851344Z W0507 20:31:58.681000 238389 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/5]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/code_generator.py", line 1298, in ast_to_ttir
2025-05-07T20:31:58.6852603Z W0507 20:31:58.681000 238389 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/5]     generator.visit(fn.parse())
2025-05-07T20:31:58.6853831Z W0507 20:31:58.681000 238389 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/5]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/code_generator.py", line 1201, in visit
2025-05-07T20:31:58.6855037Z W0507 20:31:58.681000 238389 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/5]     ret = super().visit(node)
2025-05-07T20:31:58.6855864Z W0507 20:31:58.681000 238389 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/5]           ^^^^^^^^^^^^^^^^^^^
2025-05-07T20:31:58.6856893Z W0507 20:31:58.681000 238389 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/5]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/ast.py", line 418, in visit
2025-05-07T20:31:58.6857917Z W0507 20:31:58.681000 238389 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/5]     return visitor(node)
2025-05-07T20:31:58.6858724Z W0507 20:31:58.681000 238389 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/5]            ^^^^^^^^^^^^^
2025-05-07T20:31:58.6859930Z W0507 20:31:58.681000 238389 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/5]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/code_generator.py", line 352, in visit_Module
2025-05-07T20:31:58.6861258Z W0507 20:31:58.681000 238389 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/5]     ast.NodeVisitor.generic_visit(self, node)
2025-05-07T20:31:58.6862374Z W0507 20:31:58.681000 238389 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/5]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/ast.py", line 426, in generic_visit
2025-05-07T20:31:58.6863421Z W0507 20:31:58.681000 238389 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/5]     self.visit(item)
2025-05-07T20:31:58.6864681Z W0507 20:31:58.681000 238389 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/5]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/code_generator.py", line 1207, in visit
2025-05-07T20:31:58.6866031Z W0507 20:31:58.681000 238389 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/5]     raise CompilationError(self.jit_fn.src, self.cur_node, repr(e)) from None
2025-05-07T20:31:58.6867090Z W0507 20:31:58.681000 238389 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/5] triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:31:58.6868003Z W0507 20:31:58.681000 238389 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/5] def _fbgemm_silu_mul_quant(
2025-05-07T20:31:58.6868750Z W0507 20:31:58.681000 238389 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/5] ^
2025-05-07T20:31:58.6869907Z W0507 20:31:58.681000 238389 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/5] ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:31:58.7677021Z W0507 20:31:58.765000 238389 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/5] Encountered an exception in identify_mutated_tensors, assuming every input is mutated
2025-05-07T20:31:58.7679112Z W0507 20:31:58.765000 238389 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/5] Traceback (most recent call last):
2025-05-07T20:31:58.7680842Z W0507 20:31:58.765000 238389 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/5]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/torch/_higher_order_ops/triton_kernel_wrap.py", line 731, in identify_mutated_tensors
2025-05-07T20:31:58.7682259Z W0507 20:31:58.765000 238389 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/5]     ttir_module, ordered_tensor_names = generate_ttir(kernel, kwargs)
2025-05-07T20:31:58.7683239Z W0507 20:31:58.765000 238389 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/5]                                         ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
2025-05-07T20:31:58.7684529Z W0507 20:31:58.765000 238389 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/5]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/torch/_higher_order_ops/triton_kernel_wrap.py", line 356, in generate_ttir
2025-05-07T20:31:58.7685906Z W0507 20:31:58.765000 238389 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/5]     ttir_module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:31:58.7686884Z W0507 20:31:58.765000 238389 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/5]                   ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
2025-05-07T20:31:58.7688114Z W0507 20:31:58.765000 238389 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/5]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py", line 100, in make_ir
2025-05-07T20:31:58.7689490Z W0507 20:31:58.765000 238389 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/5]     return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:31:58.7690599Z W0507 20:31:58.765000 238389 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/5]            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
2025-05-07T20:31:58.7691881Z W0507 20:31:58.765000 238389 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/5]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/code_generator.py", line 1298, in ast_to_ttir
2025-05-07T20:31:58.7693136Z W0507 20:31:58.765000 238389 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/5]     generator.visit(fn.parse())
2025-05-07T20:31:58.7694533Z W0507 20:31:58.765000 238389 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/5]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/code_generator.py", line 1201, in visit
2025-05-07T20:31:58.7695746Z W0507 20:31:58.765000 238389 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/5]     ret = super().visit(node)
2025-05-07T20:31:58.7696574Z W0507 20:31:58.765000 238389 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/5]           ^^^^^^^^^^^^^^^^^^^
2025-05-07T20:31:58.7697604Z W0507 20:31:58.765000 238389 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/5]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/ast.py", line 418, in visit
2025-05-07T20:31:58.7698622Z W0507 20:31:58.765000 238389 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/5]     return visitor(node)
2025-05-07T20:31:58.7699540Z W0507 20:31:58.765000 238389 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/5]            ^^^^^^^^^^^^^
2025-05-07T20:31:58.7700807Z W0507 20:31:58.765000 238389 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/5]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/code_generator.py", line 352, in visit_Module
2025-05-07T20:31:58.7702076Z W0507 20:31:58.765000 238389 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/5]     ast.NodeVisitor.generic_visit(self, node)
2025-05-07T20:31:58.7703193Z W0507 20:31:58.765000 238389 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/5]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/ast.py", line 426, in generic_visit
2025-05-07T20:31:58.7704231Z W0507 20:31:58.765000 238389 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/5]     self.visit(item)
2025-05-07T20:31:58.7705420Z W0507 20:31:58.765000 238389 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/5]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/code_generator.py", line 1207, in visit
2025-05-07T20:31:58.7706777Z W0507 20:31:58.765000 238389 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/5]     raise CompilationError(self.jit_fn.src, self.cur_node, repr(e)) from None
2025-05-07T20:31:58.7707837Z W0507 20:31:58.765000 238389 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/5] triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:31:58.7708753Z W0507 20:31:58.765000 238389 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/5] def _fbgemm_silu_mul_quant(
2025-05-07T20:31:58.7709564Z W0507 20:31:58.765000 238389 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/5] ^
2025-05-07T20:31:58.7710599Z W0507 20:31:58.765000 238389 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/5] ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:31:59.0660044Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:31:59.0660975Z T = 2048, D = 5120, scale_ub = None, contiguous = True, compiled = True
2025-05-07T20:31:59.0661373Z 
2025-05-07T20:31:59.0661509Z     @given(
2025-05-07T20:31:59.0661863Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:31:59.0662316Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:31:59.0662749Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:31:59.0663104Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:31:59.0663442Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:31:59.0663736Z     )
2025-05-07T20:31:59.0664092Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:31:59.0664548Z     def test_silu_mul_quant(
2025-05-07T20:31:59.0664826Z         self,
2025-05-07T20:31:59.0665033Z         T: int,
2025-05-07T20:31:59.0665237Z         D: int,
2025-05-07T20:31:59.0665454Z         scale_ub: Optional[float],
2025-05-07T20:31:59.0665934Z         contiguous: bool,
2025-05-07T20:31:59.0666189Z         compiled: bool,
2025-05-07T20:31:59.0666414Z     ) -> None:
2025-05-07T20:31:59.0666646Z         torch.manual_seed(2025)
2025-05-07T20:31:59.0666892Z     
2025-05-07T20:31:59.0667172Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:31:59.0667526Z     
2025-05-07T20:31:59.0667725Z         x_sign = torch.sign(x)
2025-05-07T20:31:59.0668015Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:31:59.0668331Z         x = x_sign * x_clamp
2025-05-07T20:31:59.0668577Z         x0 = x[:, :D]
2025-05-07T20:31:59.0668792Z         x1 = x[:, D:]
2025-05-07T20:31:59.0669004Z     
2025-05-07T20:31:59.0669406Z         if contiguous:
2025-05-07T20:31:59.0669646Z             x0 = x0.contiguous()
2025-05-07T20:31:59.0669911Z             x1 = x1.contiguous()
2025-05-07T20:31:59.0670153Z     
2025-05-07T20:31:59.0670380Z         if scale_ub is not None:
2025-05-07T20:31:59.0670688Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:31:59.0671025Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:31:59.0671337Z             )
2025-05-07T20:31:59.0671533Z         else:
2025-05-07T20:31:59.0671748Z             scale_ub_tensor = None
2025-05-07T20:31:59.0672001Z     
2025-05-07T20:31:59.0672234Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:31:59.0672555Z             op = silu_mul_quant
2025-05-07T20:31:59.0672811Z             if compiled:
2025-05-07T20:31:59.0673062Z                 op = torch.compile(op)
2025-05-07T20:31:59.0673364Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:31:59.0673642Z     
2025-05-07T20:31:59.0673834Z         y_fp8, y_scale = fn()
2025-05-07T20:31:59.0674139Z         y = y_fp8.to(torch.float32) * y_scale[:, None]
2025-05-07T20:31:59.0674435Z     
2025-05-07T20:31:59.0674671Z         def ref_fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:31:59.0675008Z             x0_fp32 = x0.to(torch.float32)
2025-05-07T20:31:59.0675310Z             x1_fp32 = x1.to(torch.float32)
2025-05-07T20:31:59.0675630Z             y = x0_fp32 * torch.sigmoid(x0_fp32) * x1_fp32
2025-05-07T20:31:59.0675988Z             return triton_quantize_fp8_row(y, scale_ub_tensor)
2025-05-07T20:31:59.0676304Z     
2025-05-07T20:31:59.0676509Z >       y_fp8_ref, y_scale_ref = ref_fn()
2025-05-07T20:31:59.0676704Z 
2025-05-07T20:31:59.0676807Z moe/activation_test.py:126: 
2025-05-07T20:31:59.0677107Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:31:59.0677448Z moe/activation_test.py:124: in ref_fn
2025-05-07T20:31:59.0677771Z     return triton_quantize_fp8_row(y, scale_ub_tensor)
2025-05-07T20:31:59.0678564Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/fbgemm_gpu/experimental/gemm/triton_gemm/fp8_gemm.py:2370: in triton_quantize_fp8_row
2025-05-07T20:31:59.0679332Z     _kernel_quantize_fp8_row[grid](
2025-05-07T20:31:59.0679886Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:31:59.0680570Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:31:59.0681260Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/autotuner.py:186: in run
2025-05-07T20:31:59.0681983Z     timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs}
2025-05-07T20:31:59.0682741Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/autotuner.py:186: in <dictcomp>
2025-05-07T20:31:59.0683484Z     timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs}
2025-05-07T20:31:59.0684218Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/autotuner.py:166: in _bench
2025-05-07T20:31:59.0684867Z     return self.do_bench(kernel_call, quantiles=(0.5, 0.2, 0.8))
2025-05-07T20:31:59.0685685Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/testing.py:117: in do_bench
2025-05-07T20:31:59.0686213Z     fn()
2025-05-07T20:31:59.0686727Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/autotuner.py:152: in kernel_call
2025-05-07T20:31:59.0687314Z     self.fn.run(
2025-05-07T20:31:59.0687779Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:31:59.0688319Z     kernel = self.compile(
2025-05-07T20:31:59.0688865Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:31:59.0694817Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:31:59.0695347Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:31:59.0695593Z 
2025-05-07T20:31:59.0695805Z self = <triton.compiler.compiler.ASTSource object at 0x7f498488e290>
2025-05-07T20:31:59.0697199Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=2, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:31:59.0698811Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f49ce13c400>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f4984c9d8a0>}
2025-05-07T20:31:59.0700172Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:31:59.0701259Z context = <triton._C.libtriton.ir.context object at 0x7f4984889b30>
2025-05-07T20:31:59.0701554Z 
2025-05-07T20:31:59.0701730Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:31:59.0702260Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:31:59.0702732Z                            module_map=module_map)
2025-05-07T20:31:59.0703102Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:31:59.0703459Z E       def _kernel_quantize_fp8_row(
2025-05-07T20:31:59.0703730Z E       ^
2025-05-07T20:31:59.0704203Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:31:59.0704659Z 
2025-05-07T20:31:59.0705092Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:31:59.0705612Z 
2025-05-07T20:31:59.0705723Z Trying example: test_silu_mul_quant(
2025-05-07T20:31:59.0706143Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:31:59.0706553Z     T=128,
2025-05-07T20:31:59.0706745Z     D=5120,
2025-05-07T20:31:59.0706946Z     scale_ub=None,
2025-05-07T20:31:59.0707166Z     contiguous=True,
2025-05-07T20:31:59.0707398Z     compiled=True,
2025-05-07T20:31:59.0707605Z )
2025-05-07T20:31:59.4006138Z W0507 20:31:59.398000 238389 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/6] Encountered an exception in identify_mutated_tensors, assuming every input is mutated
2025-05-07T20:31:59.4008277Z W0507 20:31:59.398000 238389 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/6] Traceback (most recent call last):
2025-05-07T20:31:59.4010531Z W0507 20:31:59.398000 238389 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/6]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/torch/_higher_order_ops/triton_kernel_wrap.py", line 731, in identify_mutated_tensors
2025-05-07T20:31:59.4011953Z W0507 20:31:59.398000 238389 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/6]     ttir_module, ordered_tensor_names = generate_ttir(kernel, kwargs)
2025-05-07T20:31:59.4013109Z W0507 20:31:59.398000 238389 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/6]                                         ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
2025-05-07T20:31:59.4014428Z W0507 20:31:59.398000 238389 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/6]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/torch/_higher_order_ops/triton_kernel_wrap.py", line 356, in generate_ttir
2025-05-07T20:31:59.4015821Z W0507 20:31:59.398000 238389 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/6]     ttir_module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:31:59.4016814Z W0507 20:31:59.398000 238389 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/6]                   ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
2025-05-07T20:31:59.4018151Z W0507 20:31:59.398000 238389 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/6]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py", line 100, in make_ir
2025-05-07T20:31:59.4019532Z W0507 20:31:59.398000 238389 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/6]     return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:31:59.4020602Z W0507 20:31:59.398000 238389 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/6]            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
2025-05-07T20:31:59.4021934Z W0507 20:31:59.398000 238389 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/6]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/code_generator.py", line 1298, in ast_to_ttir
2025-05-07T20:31:59.4023199Z W0507 20:31:59.398000 238389 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/6]     generator.visit(fn.parse())
2025-05-07T20:31:59.4024432Z W0507 20:31:59.398000 238389 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/6]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/code_generator.py", line 1201, in visit
2025-05-07T20:31:59.4025637Z W0507 20:31:59.398000 238389 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/6]     ret = super().visit(node)
2025-05-07T20:31:59.4026470Z W0507 20:31:59.398000 238389 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/6]           ^^^^^^^^^^^^^^^^^^^
2025-05-07T20:31:59.4027496Z W0507 20:31:59.398000 238389 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/6]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/ast.py", line 418, in visit
2025-05-07T20:31:59.4028703Z W0507 20:31:59.398000 238389 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/6]     return visitor(node)
2025-05-07T20:31:59.4029544Z W0507 20:31:59.398000 238389 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/6]            ^^^^^^^^^^^^^
2025-05-07T20:31:59.4030761Z W0507 20:31:59.398000 238389 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/6]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/code_generator.py", line 352, in visit_Module
2025-05-07T20:31:59.4032096Z W0507 20:31:59.398000 238389 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/6]     ast.NodeVisitor.generic_visit(self, node)
2025-05-07T20:31:59.4033224Z W0507 20:31:59.398000 238389 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/6]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/ast.py", line 426, in generic_visit
2025-05-07T20:31:59.4034282Z W0507 20:31:59.398000 238389 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/6]     self.visit(item)
2025-05-07T20:31:59.4035583Z W0507 20:31:59.398000 238389 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/6]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/code_generator.py", line 1207, in visit
2025-05-07T20:31:59.4036946Z W0507 20:31:59.398000 238389 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/6]     raise CompilationError(self.jit_fn.src, self.cur_node, repr(e)) from None
2025-05-07T20:31:59.4038022Z W0507 20:31:59.398000 238389 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/6] triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:31:59.4038936Z W0507 20:31:59.398000 238389 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/6] def _fbgemm_silu_mul_quant(
2025-05-07T20:31:59.4039679Z W0507 20:31:59.398000 238389 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/6] ^
2025-05-07T20:31:59.4040810Z W0507 20:31:59.398000 238389 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/6] ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:31:59.4859302Z W0507 20:31:59.483000 238389 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/6] Encountered an exception in identify_mutated_tensors, assuming every input is mutated
2025-05-07T20:31:59.4860899Z W0507 20:31:59.483000 238389 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/6] Traceback (most recent call last):
2025-05-07T20:31:59.4863576Z W0507 20:31:59.483000 238389 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/6]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/torch/_higher_order_ops/triton_kernel_wrap.py", line 731, in identify_mutated_tensors
2025-05-07T20:31:59.4866480Z W0507 20:31:59.483000 238389 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/6]     ttir_module, ordered_tensor_names = generate_ttir(kernel, kwargs)
2025-05-07T20:31:59.4868578Z W0507 20:31:59.483000 238389 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/6]                                         ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
2025-05-07T20:31:59.4871100Z W0507 20:31:59.483000 238389 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/6]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/torch/_higher_order_ops/triton_kernel_wrap.py", line 356, in generate_ttir
2025-05-07T20:31:59.4872504Z W0507 20:31:59.483000 238389 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/6]     ttir_module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:31:59.4873496Z W0507 20:31:59.483000 238389 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/6]                   ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
2025-05-07T20:31:59.4874726Z W0507 20:31:59.483000 238389 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/6]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py", line 100, in make_ir
2025-05-07T20:31:59.4876118Z W0507 20:31:59.483000 238389 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/6]     return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:31:59.4877194Z W0507 20:31:59.483000 238389 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/6]            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
2025-05-07T20:31:59.4878717Z W0507 20:31:59.483000 238389 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/6]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/code_generator.py", line 1298, in ast_to_ttir
2025-05-07T20:31:59.4879973Z W0507 20:31:59.483000 238389 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/6]     generator.visit(fn.parse())
2025-05-07T20:31:59.4881363Z W0507 20:31:59.483000 238389 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/6]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/code_generator.py", line 1201, in visit
2025-05-07T20:31:59.4882581Z W0507 20:31:59.483000 238389 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/6]     ret = super().visit(node)
2025-05-07T20:31:59.4883415Z W0507 20:31:59.483000 238389 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/6]           ^^^^^^^^^^^^^^^^^^^
2025-05-07T20:31:59.4884446Z W0507 20:31:59.483000 238389 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/6]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/ast.py", line 418, in visit
2025-05-07T20:31:59.4885469Z W0507 20:31:59.483000 238389 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/6]     return visitor(node)
2025-05-07T20:31:59.4886447Z W0507 20:31:59.483000 238389 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/6]            ^^^^^^^^^^^^^
2025-05-07T20:31:59.4887657Z W0507 20:31:59.483000 238389 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/6]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/code_generator.py", line 352, in visit_Module
2025-05-07T20:31:59.4888931Z W0507 20:31:59.483000 238389 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/6]     ast.NodeVisitor.generic_visit(self, node)
2025-05-07T20:31:59.4890051Z W0507 20:31:59.483000 238389 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/6]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/ast.py", line 426, in generic_visit
2025-05-07T20:31:59.4891090Z W0507 20:31:59.483000 238389 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/6]     self.visit(item)
2025-05-07T20:31:59.4892277Z W0507 20:31:59.483000 238389 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/6]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/code_generator.py", line 1207, in visit
2025-05-07T20:31:59.4893636Z W0507 20:31:59.483000 238389 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/6]     raise CompilationError(self.jit_fn.src, self.cur_node, repr(e)) from None
2025-05-07T20:31:59.4894695Z W0507 20:31:59.483000 238389 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/6] triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:31:59.4895604Z W0507 20:31:59.483000 238389 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/6] def _fbgemm_silu_mul_quant(
2025-05-07T20:31:59.4896349Z W0507 20:31:59.483000 238389 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/6] ^
2025-05-07T20:31:59.4897358Z W0507 20:31:59.483000 238389 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/6] ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:31:59.9997478Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:31:59.9998156Z T = 128, D = 5120, scale_ub = None, contiguous = True, compiled = True
2025-05-07T20:31:59.9998433Z 
2025-05-07T20:31:59.9998513Z     @given(
2025-05-07T20:31:59.9998757Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:31:59.9999070Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:31:59.9999380Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:31:59.9999718Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:00.0000046Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:00.0000341Z     )
2025-05-07T20:32:00.0000692Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:00.0001190Z     def test_silu_mul_quant(
2025-05-07T20:32:00.0001435Z         self,
2025-05-07T20:32:00.0001643Z         T: int,
2025-05-07T20:32:00.0001853Z         D: int,
2025-05-07T20:32:00.0002074Z         scale_ub: Optional[float],
2025-05-07T20:32:00.0002350Z         contiguous: bool,
2025-05-07T20:32:00.0002593Z         compiled: bool,
2025-05-07T20:32:00.0003022Z     ) -> None:
2025-05-07T20:32:00.0003248Z         torch.manual_seed(2025)
2025-05-07T20:32:00.0003489Z     
2025-05-07T20:32:00.0003759Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:00.0004105Z     
2025-05-07T20:32:00.0004306Z         x_sign = torch.sign(x)
2025-05-07T20:32:00.0004591Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:00.0004903Z         x = x_sign * x_clamp
2025-05-07T20:32:00.0005148Z         x0 = x[:, :D]
2025-05-07T20:32:00.0005358Z         x1 = x[:, D:]
2025-05-07T20:32:00.0005566Z     
2025-05-07T20:32:00.0005753Z         if contiguous:
2025-05-07T20:32:00.0005994Z             x0 = x0.contiguous()
2025-05-07T20:32:00.0006248Z             x1 = x1.contiguous()
2025-05-07T20:32:00.0006608Z     
2025-05-07T20:32:00.0006804Z         if scale_ub is not None:
2025-05-07T20:32:00.0007070Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:00.0007405Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:00.0007722Z             )
2025-05-07T20:32:00.0007911Z         else:
2025-05-07T20:32:00.0008121Z             scale_ub_tensor = None
2025-05-07T20:32:00.0008374Z     
2025-05-07T20:32:00.0008600Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:00.0008914Z             op = silu_mul_quant
2025-05-07T20:32:00.0009166Z             if compiled:
2025-05-07T20:32:00.0009417Z                 op = torch.compile(op)
2025-05-07T20:32:00.0009710Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:00.0009985Z     
2025-05-07T20:32:00.0010178Z         y_fp8, y_scale = fn()
2025-05-07T20:32:00.0010487Z         y = y_fp8.to(torch.float32) * y_scale[:, None]
2025-05-07T20:32:00.0010800Z     
2025-05-07T20:32:00.0011044Z         def ref_fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:00.0011372Z             x0_fp32 = x0.to(torch.float32)
2025-05-07T20:32:00.0011664Z             x1_fp32 = x1.to(torch.float32)
2025-05-07T20:32:00.0011982Z             y = x0_fp32 * torch.sigmoid(x0_fp32) * x1_fp32
2025-05-07T20:32:00.0012335Z             return triton_quantize_fp8_row(y, scale_ub_tensor)
2025-05-07T20:32:00.0012643Z     
2025-05-07T20:32:00.0012847Z >       y_fp8_ref, y_scale_ref = ref_fn()
2025-05-07T20:32:00.0013040Z 
2025-05-07T20:32:00.0013147Z moe/activation_test.py:126: 
2025-05-07T20:32:00.0013437Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:00.0013772Z moe/activation_test.py:124: in ref_fn
2025-05-07T20:32:00.0014096Z     return triton_quantize_fp8_row(y, scale_ub_tensor)
2025-05-07T20:32:00.0014875Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/fbgemm_gpu/experimental/gemm/triton_gemm/fp8_gemm.py:2370: in triton_quantize_fp8_row
2025-05-07T20:32:00.0015628Z     _kernel_quantize_fp8_row[grid](
2025-05-07T20:32:00.0016172Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:00.0016854Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:00.0017530Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/autotuner.py:186: in run
2025-05-07T20:32:00.0018249Z     timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs}
2025-05-07T20:32:00.0018993Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/autotuner.py:186: in <dictcomp>
2025-05-07T20:32:00.0019734Z     timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs}
2025-05-07T20:32:00.0020454Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/autotuner.py:166: in _bench
2025-05-07T20:32:00.0021090Z     return self.do_bench(kernel_call, quantiles=(0.5, 0.2, 0.8))
2025-05-07T20:32:00.0021689Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/testing.py:117: in do_bench
2025-05-07T20:32:00.0022197Z     fn()
2025-05-07T20:32:00.0022796Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/autotuner.py:152: in kernel_call
2025-05-07T20:32:00.0023380Z     self.fn.run(
2025-05-07T20:32:00.0023844Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:00.0024372Z     kernel = self.compile(
2025-05-07T20:32:00.0024914Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:00.0025561Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:00.0025952Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:00.0026260Z 
2025-05-07T20:32:00.0026466Z self = <triton.compiler.compiler.ASTSource object at 0x7f4899f1f0d0>
2025-05-07T20:32:00.0027546Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=2, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:00.0029135Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f49ce13c400>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f49843453a0>}
2025-05-07T20:32:00.0030470Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:00.0031531Z context = <triton._C.libtriton.ir.context object at 0x7f4899f26e30>
2025-05-07T20:32:00.0031820Z 
2025-05-07T20:32:00.0031989Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:00.0032506Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:00.0032970Z                            module_map=module_map)
2025-05-07T20:32:00.0033336Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:00.0033689Z E       def _kernel_quantize_fp8_row(
2025-05-07T20:32:00.0033956Z E       ^
2025-05-07T20:32:00.0034415Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:00.0034866Z 
2025-05-07T20:32:00.0035279Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:00.0035794Z 
2025-05-07T20:32:00.0035896Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:00.0036310Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:00.0036708Z     T=4096,
2025-05-07T20:32:00.0036907Z     D=5120,
2025-05-07T20:32:00.0037102Z     scale_ub=None,
2025-05-07T20:32:00.0037328Z     contiguous=True,
2025-05-07T20:32:00.0037549Z     compiled=True,
2025-05-07T20:32:00.0037753Z )
2025-05-07T20:32:00.3349154Z W0507 20:32:00.332000 238389 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/7] Encountered an exception in identify_mutated_tensors, assuming every input is mutated
2025-05-07T20:32:00.3350500Z W0507 20:32:00.332000 238389 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/7] Traceback (most recent call last):
2025-05-07T20:32:00.3353198Z W0507 20:32:00.332000 238389 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/7]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/torch/_higher_order_ops/triton_kernel_wrap.py", line 731, in identify_mutated_tensors
2025-05-07T20:32:00.3356018Z W0507 20:32:00.332000 238389 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/7]     ttir_module, ordered_tensor_names = generate_ttir(kernel, kwargs)
2025-05-07T20:32:00.3357961Z W0507 20:32:00.332000 238389 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/7]                                         ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
2025-05-07T20:32:00.3360812Z W0507 20:32:00.332000 238389 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/7]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/torch/_higher_order_ops/triton_kernel_wrap.py", line 356, in generate_ttir
2025-05-07T20:32:00.3362246Z W0507 20:32:00.332000 238389 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/7]     ttir_module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:00.3363226Z W0507 20:32:00.332000 238389 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/7]                   ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
2025-05-07T20:32:00.3364442Z W0507 20:32:00.332000 238389 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/7]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py", line 100, in make_ir
2025-05-07T20:32:00.3365920Z W0507 20:32:00.332000 238389 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/7]     return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:00.3366980Z W0507 20:32:00.332000 238389 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/7]            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
2025-05-07T20:32:00.3368257Z W0507 20:32:00.332000 238389 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/7]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/code_generator.py", line 1298, in ast_to_ttir
2025-05-07T20:32:00.3369503Z W0507 20:32:00.332000 238389 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/7]     generator.visit(fn.parse())
2025-05-07T20:32:00.3370737Z W0507 20:32:00.332000 238389 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/7]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/code_generator.py", line 1201, in visit
2025-05-07T20:32:00.3371939Z W0507 20:32:00.332000 238389 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/7]     ret = super().visit(node)
2025-05-07T20:32:00.3372772Z W0507 20:32:00.332000 238389 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/7]           ^^^^^^^^^^^^^^^^^^^
2025-05-07T20:32:00.3373797Z W0507 20:32:00.332000 238389 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/7]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/ast.py", line 418, in visit
2025-05-07T20:32:00.3374815Z W0507 20:32:00.332000 238389 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/7]     return visitor(node)
2025-05-07T20:32:00.3375615Z W0507 20:32:00.332000 238389 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/7]            ^^^^^^^^^^^^^
2025-05-07T20:32:00.3376821Z W0507 20:32:00.332000 238389 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/7]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/code_generator.py", line 352, in visit_Module
2025-05-07T20:32:00.3378103Z W0507 20:32:00.332000 238389 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/7]     ast.NodeVisitor.generic_visit(self, node)
2025-05-07T20:32:00.3379231Z W0507 20:32:00.332000 238389 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/7]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/ast.py", line 426, in generic_visit
2025-05-07T20:32:00.3380270Z W0507 20:32:00.332000 238389 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/7]     self.visit(item)
2025-05-07T20:32:00.3381504Z W0507 20:32:00.332000 238389 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/7]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/code_generator.py", line 1207, in visit
2025-05-07T20:32:00.3382940Z W0507 20:32:00.332000 238389 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/7]     raise CompilationError(self.jit_fn.src, self.cur_node, repr(e)) from None
2025-05-07T20:32:00.3384004Z W0507 20:32:00.332000 238389 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/7] triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:00.3384922Z W0507 20:32:00.332000 238389 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/7] def _fbgemm_silu_mul_quant(
2025-05-07T20:32:00.3385670Z W0507 20:32:00.332000 238389 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/7] ^
2025-05-07T20:32:00.3386681Z W0507 20:32:00.332000 238389 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/7] ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:00.4202061Z W0507 20:32:00.417000 238389 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/7] Encountered an exception in identify_mutated_tensors, assuming every input is mutated
2025-05-07T20:32:00.4203195Z W0507 20:32:00.417000 238389 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/7] Traceback (most recent call last):
2025-05-07T20:32:00.4204524Z W0507 20:32:00.417000 238389 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/7]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/torch/_higher_order_ops/triton_kernel_wrap.py", line 731, in identify_mutated_tensors
2025-05-07T20:32:00.4205935Z W0507 20:32:00.417000 238389 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/7]     ttir_module, ordered_tensor_names = generate_ttir(kernel, kwargs)
2025-05-07T20:32:00.4206909Z W0507 20:32:00.417000 238389 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/7]                                         ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
2025-05-07T20:32:00.4208209Z W0507 20:32:00.417000 238389 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/7]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/torch/_higher_order_ops/triton_kernel_wrap.py", line 356, in generate_ttir
2025-05-07T20:32:00.4209580Z W0507 20:32:00.417000 238389 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/7]     ttir_module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:00.4210559Z W0507 20:32:00.417000 238389 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/7]                   ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
2025-05-07T20:32:00.4211831Z W0507 20:32:00.417000 238389 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/7]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py", line 100, in make_ir
2025-05-07T20:32:00.4213211Z W0507 20:32:00.417000 238389 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/7]     return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:00.4214267Z W0507 20:32:00.417000 238389 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/7]            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
2025-05-07T20:32:00.4215541Z W0507 20:32:00.417000 238389 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/7]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/code_generator.py", line 1298, in ast_to_ttir
2025-05-07T20:32:00.4216787Z W0507 20:32:00.417000 238389 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/7]     generator.visit(fn.parse())
2025-05-07T20:32:00.4218003Z W0507 20:32:00.417000 238389 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/7]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/code_generator.py", line 1201, in visit
2025-05-07T20:32:00.4219398Z W0507 20:32:00.417000 238389 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/7]     ret = super().visit(node)
2025-05-07T20:32:00.4220227Z W0507 20:32:00.417000 238389 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/7]           ^^^^^^^^^^^^^^^^^^^
2025-05-07T20:32:00.4221247Z W0507 20:32:00.417000 238389 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/7]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/ast.py", line 418, in visit
2025-05-07T20:32:00.4222269Z W0507 20:32:00.417000 238389 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/7]     return visitor(node)
2025-05-07T20:32:00.4223064Z W0507 20:32:00.417000 238389 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/7]            ^^^^^^^^^^^^^
2025-05-07T20:32:00.4224410Z W0507 20:32:00.417000 238389 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/7]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/code_generator.py", line 352, in visit_Module
2025-05-07T20:32:00.4225682Z W0507 20:32:00.417000 238389 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/7]     ast.NodeVisitor.generic_visit(self, node)
2025-05-07T20:32:00.4226795Z W0507 20:32:00.417000 238389 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/7]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/ast.py", line 426, in generic_visit
2025-05-07T20:32:00.4227837Z W0507 20:32:00.417000 238389 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/7]     self.visit(item)
2025-05-07T20:32:00.4229228Z W0507 20:32:00.417000 238389 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/7]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/code_generator.py", line 1207, in visit
2025-05-07T20:32:00.4230606Z W0507 20:32:00.417000 238389 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/7]     raise CompilationError(self.jit_fn.src, self.cur_node, repr(e)) from None
2025-05-07T20:32:00.4231708Z W0507 20:32:00.417000 238389 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/7] triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:00.4232623Z W0507 20:32:00.417000 238389 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/7] def _fbgemm_silu_mul_quant(
2025-05-07T20:32:00.4233370Z W0507 20:32:00.417000 238389 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/7] ^
2025-05-07T20:32:00.4234384Z W0507 20:32:00.417000 238389 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/7] ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:00.7727301Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:00.7728073Z T = 4096, D = 5120, scale_ub = None, contiguous = True, compiled = True
2025-05-07T20:32:00.7728643Z 
2025-05-07T20:32:00.7728757Z     @given(
2025-05-07T20:32:00.7729076Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:00.7729403Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:00.7729710Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:00.7730051Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:00.7730384Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:00.7730669Z     )
2025-05-07T20:32:00.7731020Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:00.7731463Z     def test_silu_mul_quant(
2025-05-07T20:32:00.7731702Z         self,
2025-05-07T20:32:00.7731901Z         T: int,
2025-05-07T20:32:00.7732103Z         D: int,
2025-05-07T20:32:00.7732323Z         scale_ub: Optional[float],
2025-05-07T20:32:00.7732600Z         contiguous: bool,
2025-05-07T20:32:00.7732838Z         compiled: bool,
2025-05-07T20:32:00.7733058Z     ) -> None:
2025-05-07T20:32:00.7733282Z         torch.manual_seed(2025)
2025-05-07T20:32:00.7733524Z     
2025-05-07T20:32:00.7733972Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:00.7734318Z     
2025-05-07T20:32:00.7734516Z         x_sign = torch.sign(x)
2025-05-07T20:32:00.7734810Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:00.7735113Z         x = x_sign * x_clamp
2025-05-07T20:32:00.7735350Z         x0 = x[:, :D]
2025-05-07T20:32:00.7735569Z         x1 = x[:, D:]
2025-05-07T20:32:00.7735769Z     
2025-05-07T20:32:00.7735952Z         if contiguous:
2025-05-07T20:32:00.7736190Z             x0 = x0.contiguous()
2025-05-07T20:32:00.7736441Z             x1 = x1.contiguous()
2025-05-07T20:32:00.7736678Z     
2025-05-07T20:32:00.7736869Z         if scale_ub is not None:
2025-05-07T20:32:00.7737251Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:00.7737587Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:00.7737897Z             )
2025-05-07T20:32:00.7738088Z         else:
2025-05-07T20:32:00.7738309Z             scale_ub_tensor = None
2025-05-07T20:32:00.7738556Z     
2025-05-07T20:32:00.7738790Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:00.7739097Z             op = silu_mul_quant
2025-05-07T20:32:00.7739351Z             if compiled:
2025-05-07T20:32:00.7739600Z                 op = torch.compile(op)
2025-05-07T20:32:00.7739889Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:00.7740162Z     
2025-05-07T20:32:00.7740361Z         y_fp8, y_scale = fn()
2025-05-07T20:32:00.7740721Z         y = y_fp8.to(torch.float32) * y_scale[:, None]
2025-05-07T20:32:00.7741174Z     
2025-05-07T20:32:00.7741542Z         def ref_fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:00.7742070Z             x0_fp32 = x0.to(torch.float32)
2025-05-07T20:32:00.7742522Z             x1_fp32 = x1.to(torch.float32)
2025-05-07T20:32:00.7743008Z             y = x0_fp32 * torch.sigmoid(x0_fp32) * x1_fp32
2025-05-07T20:32:00.7743575Z             return triton_quantize_fp8_row(y, scale_ub_tensor)
2025-05-07T20:32:00.7744102Z     
2025-05-07T20:32:00.7744404Z >       y_fp8_ref, y_scale_ref = ref_fn()
2025-05-07T20:32:00.7744688Z 
2025-05-07T20:32:00.7744831Z moe/activation_test.py:126: 
2025-05-07T20:32:00.7745242Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:00.7745646Z moe/activation_test.py:124: in ref_fn
2025-05-07T20:32:00.7745972Z     return triton_quantize_fp8_row(y, scale_ub_tensor)
2025-05-07T20:32:00.7746758Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/fbgemm_gpu/experimental/gemm/triton_gemm/fp8_gemm.py:2370: in triton_quantize_fp8_row
2025-05-07T20:32:00.7747507Z     _kernel_quantize_fp8_row[grid](
2025-05-07T20:32:00.7748057Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:00.7748737Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:00.7749482Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/autotuner.py:186: in run
2025-05-07T20:32:00.7750194Z     timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs}
2025-05-07T20:32:00.7750970Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/autotuner.py:186: in <dictcomp>
2025-05-07T20:32:00.7751738Z     timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs}
2025-05-07T20:32:00.7752466Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/autotuner.py:166: in _bench
2025-05-07T20:32:00.7753097Z     return self.do_bench(kernel_call, quantiles=(0.5, 0.2, 0.8))
2025-05-07T20:32:00.7753697Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/testing.py:117: in do_bench
2025-05-07T20:32:00.7754208Z     fn()
2025-05-07T20:32:00.7754829Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/autotuner.py:152: in kernel_call
2025-05-07T20:32:00.7755414Z     self.fn.run(
2025-05-07T20:32:00.7755882Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:00.7756411Z     kernel = self.compile(
2025-05-07T20:32:00.7756946Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:00.7757598Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:00.7757997Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:00.7758224Z 
2025-05-07T20:32:00.7758430Z self = <triton.compiler.compiler.ASTSource object at 0x7f4984645bd0>
2025-05-07T20:32:00.7759584Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=2, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:00.7760951Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f49ce13c400>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f4984345b20>}
2025-05-07T20:32:00.7762287Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:00.7763300Z context = <triton._C.libtriton.ir.context object at 0x7f4984649970>
2025-05-07T20:32:00.7763583Z 
2025-05-07T20:32:00.7763749Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:00.7764262Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:00.7764731Z                            module_map=module_map)
2025-05-07T20:32:00.7765097Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:00.7765446Z E       def _kernel_quantize_fp8_row(
2025-05-07T20:32:00.7765718Z E       ^
2025-05-07T20:32:00.7766183Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:00.7766627Z 
2025-05-07T20:32:00.7767043Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:00.7767554Z 
2025-05-07T20:32:00.7767657Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:00.7768069Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:00.7768472Z     T=16384,
2025-05-07T20:32:00.7768664Z     D=5120,
2025-05-07T20:32:00.7768866Z     scale_ub=None,
2025-05-07T20:32:00.7769091Z     contiguous=True,
2025-05-07T20:32:00.7769318Z     compiled=True,
2025-05-07T20:32:00.7769524Z )
2025-05-07T20:32:00.8021305Z W0507 20:32:00.801000 238389 site-packages/torch/_dynamo/convert_frame.py:987] [0/8] torch._dynamo hit config.recompile_limit (8)
2025-05-07T20:32:00.8023050Z W0507 20:32:00.801000 238389 site-packages/torch/_dynamo/convert_frame.py:987] [0/8]    function: 'silu_mul_quant' (/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:55)
2025-05-07T20:32:00.8024441Z W0507 20:32:00.801000 238389 site-packages/torch/_dynamo/convert_frame.py:987] [0/8]    last reason: 0/7: tensor 'x0' stride mismatch at index 0. expected 5120, actual 10240
2025-05-07T20:32:00.8025441Z W0507 20:32:00.801000 238389 site-packages/torch/_dynamo/convert_frame.py:987] [0/8] To log all recompilation reasons, use TORCH_LOGS="recompiles".
2025-05-07T20:32:00.8026557Z W0507 20:32:00.801000 238389 site-packages/torch/_dynamo/convert_frame.py:987] [0/8] To diagnose recompilation issues, see https://pytorch.org/docs/main/torch.compiler_troubleshooting.html.
2025-05-07T20:32:00.8707763Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:00.8709316Z T = 16384, D = 5120, scale_ub = None, contiguous = True, compiled = True
2025-05-07T20:32:00.8709722Z 
2025-05-07T20:32:00.8709825Z     @given(
2025-05-07T20:32:00.8710064Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:00.8710380Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:00.8710694Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:00.8711065Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:00.8711403Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:00.8711686Z     )
2025-05-07T20:32:00.8712041Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:00.8717643Z     def test_silu_mul_quant(
2025-05-07T20:32:00.8717901Z         self,
2025-05-07T20:32:00.8718304Z         T: int,
2025-05-07T20:32:00.8718504Z         D: int,
2025-05-07T20:32:00.8718723Z         scale_ub: Optional[float],
2025-05-07T20:32:00.8718997Z         contiguous: bool,
2025-05-07T20:32:00.8719231Z         compiled: bool,
2025-05-07T20:32:00.8719463Z     ) -> None:
2025-05-07T20:32:00.8719684Z         torch.manual_seed(2025)
2025-05-07T20:32:00.8719921Z     
2025-05-07T20:32:00.8720199Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:00.8720538Z     
2025-05-07T20:32:00.8720737Z         x_sign = torch.sign(x)
2025-05-07T20:32:00.8721063Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:00.8721398Z         x = x_sign * x_clamp
2025-05-07T20:32:00.8721641Z         x0 = x[:, :D]
2025-05-07T20:32:00.8721864Z         x1 = x[:, D:]
2025-05-07T20:32:00.8722075Z     
2025-05-07T20:32:00.8722257Z         if contiguous:
2025-05-07T20:32:00.8722495Z             x0 = x0.contiguous()
2025-05-07T20:32:00.8722753Z             x1 = x1.contiguous()
2025-05-07T20:32:00.8722990Z     
2025-05-07T20:32:00.8723187Z         if scale_ub is not None:
2025-05-07T20:32:00.8723460Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:00.8723792Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:00.8724099Z             )
2025-05-07T20:32:00.8724293Z         else:
2025-05-07T20:32:00.8724505Z             scale_ub_tensor = None
2025-05-07T20:32:00.8724752Z     
2025-05-07T20:32:00.8724988Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:00.8725314Z             op = silu_mul_quant
2025-05-07T20:32:00.8725563Z             if compiled:
2025-05-07T20:32:00.8725816Z                 op = torch.compile(op)
2025-05-07T20:32:00.8726119Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:00.8726390Z     
2025-05-07T20:32:00.8726585Z         y_fp8, y_scale = fn()
2025-05-07T20:32:00.8726867Z         y = y_fp8.to(torch.float32) * y_scale[:, None]
2025-05-07T20:32:00.8727148Z     
2025-05-07T20:32:00.8727388Z         def ref_fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:00.8727724Z             x0_fp32 = x0.to(torch.float32)
2025-05-07T20:32:00.8728018Z             x1_fp32 = x1.to(torch.float32)
2025-05-07T20:32:00.8728522Z             y = x0_fp32 * torch.sigmoid(x0_fp32) * x1_fp32
2025-05-07T20:32:00.8728882Z             return triton_quantize_fp8_row(y, scale_ub_tensor)
2025-05-07T20:32:00.8729193Z     
2025-05-07T20:32:00.8729385Z >       y_fp8_ref, y_scale_ref = ref_fn()
2025-05-07T20:32:00.8729582Z 
2025-05-07T20:32:00.8729684Z moe/activation_test.py:126: 
2025-05-07T20:32:00.8729979Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:00.8730311Z moe/activation_test.py:124: in ref_fn
2025-05-07T20:32:00.8730650Z     return triton_quantize_fp8_row(y, scale_ub_tensor)
2025-05-07T20:32:00.8731442Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/fbgemm_gpu/experimental/gemm/triton_gemm/fp8_gemm.py:2370: in triton_quantize_fp8_row
2025-05-07T20:32:00.8732199Z     _kernel_quantize_fp8_row[grid](
2025-05-07T20:32:00.8732734Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:00.8733548Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:00.8734248Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/autotuner.py:186: in run
2025-05-07T20:32:00.8734971Z     timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs}
2025-05-07T20:32:00.8735718Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/autotuner.py:186: in <dictcomp>
2025-05-07T20:32:00.8736481Z     timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs}
2025-05-07T20:32:00.8737216Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/autotuner.py:166: in _bench
2025-05-07T20:32:00.8737966Z     return self.do_bench(kernel_call, quantiles=(0.5, 0.2, 0.8))
2025-05-07T20:32:00.8738556Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/testing.py:117: in do_bench
2025-05-07T20:32:00.8739073Z     fn()
2025-05-07T20:32:00.8739585Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/autotuner.py:152: in kernel_call
2025-05-07T20:32:00.8740173Z     self.fn.run(
2025-05-07T20:32:00.8740641Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:00.8741173Z     kernel = self.compile(
2025-05-07T20:32:00.8741717Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:00.8742372Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:00.8742776Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:00.8743007Z 
2025-05-07T20:32:00.8743223Z self = <triton.compiler.compiler.ASTSource object at 0x7f49840ba310>
2025-05-07T20:32:00.8744313Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=2, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:00.8745681Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f49ce13c400>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f4984c9ea20>}
2025-05-07T20:32:00.8747033Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:00.8748067Z context = <triton._C.libtriton.ir.context object at 0x7f49840db8f0>
2025-05-07T20:32:00.8748353Z 
2025-05-07T20:32:00.8748526Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:00.8749045Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:00.8749587Z                            module_map=module_map)
2025-05-07T20:32:00.8749959Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:00.8750319Z E       def _kernel_quantize_fp8_row(
2025-05-07T20:32:00.8750577Z E       ^
2025-05-07T20:32:00.8751050Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:00.8751546Z 
2025-05-07T20:32:00.8751965Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:00.8752480Z 
2025-05-07T20:32:00.8752593Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:00.8752999Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:00.8753406Z     T=1,
2025-05-07T20:32:00.8753596Z     D=5120,
2025-05-07T20:32:00.8753791Z     scale_ub=1200.0,
2025-05-07T20:32:00.8754017Z     contiguous=True,
2025-05-07T20:32:00.8754245Z     compiled=True,
2025-05-07T20:32:00.8754449Z )
2025-05-07T20:32:00.9811357Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:00.9812097Z T = 1, D = 5120, scale_ub = 1200.0, contiguous = True, compiled = True
2025-05-07T20:32:00.9812457Z 
2025-05-07T20:32:00.9812555Z     @given(
2025-05-07T20:32:00.9812795Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:00.9813114Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:00.9813421Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:00.9813751Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:00.9814079Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:00.9814369Z     )
2025-05-07T20:32:00.9814725Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:00.9815294Z     def test_silu_mul_quant(
2025-05-07T20:32:00.9815544Z         self,
2025-05-07T20:32:00.9815752Z         T: int,
2025-05-07T20:32:00.9815954Z         D: int,
2025-05-07T20:32:00.9816181Z         scale_ub: Optional[float],
2025-05-07T20:32:00.9816460Z         contiguous: bool,
2025-05-07T20:32:00.9816693Z         compiled: bool,
2025-05-07T20:32:00.9816919Z     ) -> None:
2025-05-07T20:32:00.9817140Z         torch.manual_seed(2025)
2025-05-07T20:32:00.9817378Z     
2025-05-07T20:32:00.9817652Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:00.9817993Z     
2025-05-07T20:32:00.9818187Z         x_sign = torch.sign(x)
2025-05-07T20:32:00.9818472Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:00.9818790Z         x = x_sign * x_clamp
2025-05-07T20:32:00.9819037Z         x0 = x[:, :D]
2025-05-07T20:32:00.9819255Z         x1 = x[:, D:]
2025-05-07T20:32:00.9819462Z     
2025-05-07T20:32:00.9819647Z         if contiguous:
2025-05-07T20:32:00.9819880Z             x0 = x0.contiguous()
2025-05-07T20:32:00.9820142Z             x1 = x1.contiguous()
2025-05-07T20:32:00.9820391Z     
2025-05-07T20:32:00.9820592Z         if scale_ub is not None:
2025-05-07T20:32:00.9820865Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:00.9821257Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:00.9821564Z             )
2025-05-07T20:32:00.9821753Z         else:
2025-05-07T20:32:00.9821979Z             scale_ub_tensor = None
2025-05-07T20:32:00.9822234Z     
2025-05-07T20:32:00.9822464Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:00.9822777Z             op = silu_mul_quant
2025-05-07T20:32:00.9823025Z             if compiled:
2025-05-07T20:32:00.9823268Z                 op = torch.compile(op)
2025-05-07T20:32:00.9823563Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:00.9823834Z     
2025-05-07T20:32:00.9824025Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:00.9824199Z 
2025-05-07T20:32:00.9824297Z moe/activation_test.py:117: 
2025-05-07T20:32:00.9824591Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:00.9824922Z moe/activation_test.py:115: in fn
2025-05-07T20:32:00.9825201Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:00.9825755Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/torch/_dynamo/eval_frame.py:678: in _fn
2025-05-07T20:32:00.9826315Z     return fn(*args, **kwargs)
2025-05-07T20:32:00.9826964Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:00.9827652Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:00.9828498Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:00.9829226Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:00.9829890Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:00.9830419Z     kernel = self.compile(
2025-05-07T20:32:00.9831086Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:00.9831742Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:00.9832133Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:00.9832368Z 
2025-05-07T20:32:00.9832573Z self = <triton.compiler.compiler.ASTSource object at 0x7f4899954210>
2025-05-07T20:32:00.9833645Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
﻿2025-05-07T20:32:00.9838133Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f49ce13c400>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f4899d816c0>}
2025-05-07T20:32:00.9839559Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:00.9840576Z context = <triton._C.libtriton.ir.context object at 0x7f48999fc7b0>
2025-05-07T20:32:00.9840860Z 
2025-05-07T20:32:00.9841031Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:00.9841587Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:00.9842055Z                            module_map=module_map)
2025-05-07T20:32:00.9842417Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:00.9842770Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:00.9843039Z E       ^
2025-05-07T20:32:00.9843503Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:00.9843958Z 
2025-05-07T20:32:00.9844389Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:00.9844897Z 
2025-05-07T20:32:00.9845002Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:00.9845423Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:00.9845825Z     T=1,
2025-05-07T20:32:00.9846011Z     D=5120,
2025-05-07T20:32:00.9846201Z     scale_ub=None,
2025-05-07T20:32:00.9846425Z     contiguous=False,
2025-05-07T20:32:00.9846665Z     compiled=True,
2025-05-07T20:32:00.9846868Z )
2025-05-07T20:32:01.1969896Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:01.1970629Z T = 1, D = 5120, scale_ub = None, contiguous = False, compiled = True
2025-05-07T20:32:01.1971037Z 
2025-05-07T20:32:01.1971170Z     @given(
2025-05-07T20:32:01.1971495Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:01.1971881Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:01.1972189Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:01.1972520Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:01.1972858Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:01.1973148Z     )
2025-05-07T20:32:01.1973492Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:01.1973931Z     def test_silu_mul_quant(
2025-05-07T20:32:01.1974176Z         self,
2025-05-07T20:32:01.1974378Z         T: int,
2025-05-07T20:32:01.1974577Z         D: int,
2025-05-07T20:32:01.1974809Z         scale_ub: Optional[float],
2025-05-07T20:32:01.1975089Z         contiguous: bool,
2025-05-07T20:32:01.1975336Z         compiled: bool,
2025-05-07T20:32:01.1975560Z     ) -> None:
2025-05-07T20:32:01.1975784Z         torch.manual_seed(2025)
2025-05-07T20:32:01.1976029Z     
2025-05-07T20:32:01.1976298Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:01.1976644Z     
2025-05-07T20:32:01.1976848Z         x_sign = torch.sign(x)
2025-05-07T20:32:01.1977348Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:01.1977669Z         x = x_sign * x_clamp
2025-05-07T20:32:01.1977919Z         x0 = x[:, :D]
2025-05-07T20:32:01.1978140Z         x1 = x[:, D:]
2025-05-07T20:32:01.1978357Z     
2025-05-07T20:32:01.1978557Z         if contiguous:
2025-05-07T20:32:01.1978786Z             x0 = x0.contiguous()
2025-05-07T20:32:01.1979049Z             x1 = x1.contiguous()
2025-05-07T20:32:01.1979290Z     
2025-05-07T20:32:01.1979483Z         if scale_ub is not None:
2025-05-07T20:32:01.1979756Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:01.1980095Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:01.1980401Z             )
2025-05-07T20:32:01.1980721Z         else:
2025-05-07T20:32:01.1981024Z             scale_ub_tensor = None
2025-05-07T20:32:01.1981300Z     
2025-05-07T20:32:01.1981527Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:01.1981846Z             op = silu_mul_quant
2025-05-07T20:32:01.1982100Z             if compiled:
2025-05-07T20:32:01.1982347Z                 op = torch.compile(op)
2025-05-07T20:32:01.1982640Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:01.1982910Z     
2025-05-07T20:32:01.1983099Z         y_fp8, y_scale = fn()
2025-05-07T20:32:01.1983384Z         y = y_fp8.to(torch.float32) * y_scale[:, None]
2025-05-07T20:32:01.1983672Z     
2025-05-07T20:32:01.1983925Z         def ref_fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:01.1984251Z             x0_fp32 = x0.to(torch.float32)
2025-05-07T20:32:01.1984540Z             x1_fp32 = x1.to(torch.float32)
2025-05-07T20:32:01.1984862Z             y = x0_fp32 * torch.sigmoid(x0_fp32) * x1_fp32
2025-05-07T20:32:01.1985216Z             return triton_quantize_fp8_row(y, scale_ub_tensor)
2025-05-07T20:32:01.1985538Z     
2025-05-07T20:32:01.1985742Z >       y_fp8_ref, y_scale_ref = ref_fn()
2025-05-07T20:32:01.1985936Z 
2025-05-07T20:32:01.1986039Z moe/activation_test.py:126: 
2025-05-07T20:32:01.1986335Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:01.1986669Z moe/activation_test.py:124: in ref_fn
2025-05-07T20:32:01.1986993Z     return triton_quantize_fp8_row(y, scale_ub_tensor)
2025-05-07T20:32:01.1987769Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/fbgemm_gpu/experimental/gemm/triton_gemm/fp8_gemm.py:2370: in triton_quantize_fp8_row
2025-05-07T20:32:01.1988519Z     _kernel_quantize_fp8_row[grid](
2025-05-07T20:32:01.1989150Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:01.1989822Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:01.1990509Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/autotuner.py:186: in run
2025-05-07T20:32:01.1991223Z     timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs}
2025-05-07T20:32:01.1991976Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/autotuner.py:186: in <dictcomp>
2025-05-07T20:32:01.1992712Z     timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs}
2025-05-07T20:32:01.1993435Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/autotuner.py:166: in _bench
2025-05-07T20:32:01.1994069Z     return self.do_bench(kernel_call, quantiles=(0.5, 0.2, 0.8))
2025-05-07T20:32:01.1994669Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/testing.py:117: in do_bench
2025-05-07T20:32:01.1995179Z     fn()
2025-05-07T20:32:01.1995686Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/autotuner.py:152: in kernel_call
2025-05-07T20:32:01.1996276Z     self.fn.run(
2025-05-07T20:32:01.1996737Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:01.1997366Z     kernel = self.compile(
2025-05-07T20:32:01.1997908Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:01.1998559Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:01.1998956Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:01.1999189Z 
2025-05-07T20:32:01.1999400Z self = <triton.compiler.compiler.ASTSource object at 0x7f489990b810>
2025-05-07T20:32:01.2000477Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=2, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:01.2001985Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f49ce13c400>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f4899d82020>}
2025-05-07T20:32:01.2003319Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:01.2004340Z context = <triton._C.libtriton.ir.context object at 0x7f4899903e70>
2025-05-07T20:32:01.2004633Z 
2025-05-07T20:32:01.2004800Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:01.2005310Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:01.2005771Z                            module_map=module_map)
2025-05-07T20:32:01.2006142Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:01.2006503Z E       def _kernel_quantize_fp8_row(
2025-05-07T20:32:01.2006770Z E       ^
2025-05-07T20:32:01.2007226Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:01.2007680Z 
2025-05-07T20:32:01.2008096Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:01.2008602Z 
2025-05-07T20:32:01.2008715Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:01.2009125Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:01.2009628Z     T=1,
2025-05-07T20:32:01.2009829Z     D=5120,
2025-05-07T20:32:01.2010024Z     scale_ub=None,
2025-05-07T20:32:01.2010235Z     contiguous=True,
2025-05-07T20:32:01.2010457Z     compiled=False,
2025-05-07T20:32:01.2010672Z )
2025-05-07T20:32:01.3182207Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:01.3182935Z T = 1, D = 5120, scale_ub = None, contiguous = True, compiled = False
2025-05-07T20:32:01.3183323Z 
2025-05-07T20:32:01.3183438Z     @given(
2025-05-07T20:32:01.3183793Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:01.3184226Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:01.3184548Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:01.3184873Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:01.3185200Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:01.3185490Z     )
2025-05-07T20:32:01.3185841Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:01.3186285Z     def test_silu_mul_quant(
2025-05-07T20:32:01.3186530Z         self,
2025-05-07T20:32:01.3186727Z         T: int,
2025-05-07T20:32:01.3186928Z         D: int,
2025-05-07T20:32:01.3187154Z         scale_ub: Optional[float],
2025-05-07T20:32:01.3187429Z         contiguous: bool,
2025-05-07T20:32:01.3187669Z         compiled: bool,
2025-05-07T20:32:01.3187899Z     ) -> None:
2025-05-07T20:32:01.3188119Z         torch.manual_seed(2025)
2025-05-07T20:32:01.3188357Z     
2025-05-07T20:32:01.3188634Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:01.3188979Z     
2025-05-07T20:32:01.3189403Z         x_sign = torch.sign(x)
2025-05-07T20:32:01.3189702Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:01.3190009Z         x = x_sign * x_clamp
2025-05-07T20:32:01.3190248Z         x0 = x[:, :D]
2025-05-07T20:32:01.3190468Z         x1 = x[:, D:]
2025-05-07T20:32:01.3190682Z     
2025-05-07T20:32:01.3190890Z         if contiguous:
2025-05-07T20:32:01.3191157Z             x0 = x0.contiguous()
2025-05-07T20:32:01.3191421Z             x1 = x1.contiguous()
2025-05-07T20:32:01.3191655Z     
2025-05-07T20:32:01.3191847Z         if scale_ub is not None:
2025-05-07T20:32:01.3192117Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:01.3192519Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:01.3192883Z             )
2025-05-07T20:32:01.3193076Z         else:
2025-05-07T20:32:01.3193286Z             scale_ub_tensor = None
2025-05-07T20:32:01.3193529Z     
2025-05-07T20:32:01.3193767Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:01.3194081Z             op = silu_mul_quant
2025-05-07T20:32:01.3194326Z             if compiled:
2025-05-07T20:32:01.3194576Z                 op = torch.compile(op)
2025-05-07T20:32:01.3194868Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:01.3195135Z     
2025-05-07T20:32:01.3195331Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:01.3195493Z 
2025-05-07T20:32:01.3195596Z moe/activation_test.py:117: 
2025-05-07T20:32:01.3195884Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:01.3196218Z moe/activation_test.py:115: in fn
2025-05-07T20:32:01.3196494Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:01.3197181Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:01.3197868Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:01.3198404Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:01.3199083Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:01.3199733Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:01.3200262Z     kernel = self.compile(
2025-05-07T20:32:01.3200798Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:01.3201506Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:01.3201897Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:01.3202129Z 
2025-05-07T20:32:01.3202342Z self = <triton.compiler.compiler.ASTSource object at 0x7f4898e69810>
2025-05-07T20:32:01.3203419Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:01.3204775Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f49ce13c400>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f4899d837e0>}
2025-05-07T20:32:01.3206102Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:01.3207116Z context = <triton._C.libtriton.ir.context object at 0x7f4898ee1df0>
2025-05-07T20:32:01.3207402Z 
2025-05-07T20:32:01.3207570Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:01.3208085Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:01.3208541Z                            module_map=module_map)
2025-05-07T20:32:01.3208992Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:01.3209353Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:01.3209615Z E       ^
2025-05-07T20:32:01.3210072Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:01.3210518Z 
2025-05-07T20:32:01.3210929Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:01.3211432Z 
2025-05-07T20:32:01.3211539Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:01.3211951Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:01.3212392Z     T=128,
2025-05-07T20:32:01.3212588Z     D=5120,
2025-05-07T20:32:01.3212821Z     scale_ub=None,
2025-05-07T20:32:01.3213034Z     contiguous=False,
2025-05-07T20:32:01.3213263Z     compiled=True,
2025-05-07T20:32:01.3213467Z )
2025-05-07T20:32:01.3213788Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:01.3214278Z T = 128, D = 5120, scale_ub = None, contiguous = False, compiled = True
2025-05-07T20:32:01.3214541Z 
2025-05-07T20:32:01.3214626Z     @given(
2025-05-07T20:32:01.3214856Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:01.3215168Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:01.3215471Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:01.3215801Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:01.3216120Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:01.3216405Z     )
2025-05-07T20:32:01.3216757Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:01.3217195Z     def test_silu_mul_quant(
2025-05-07T20:32:01.3217438Z         self,
2025-05-07T20:32:01.3217635Z         T: int,
2025-05-07T20:32:01.3217831Z         D: int,
2025-05-07T20:32:01.3223148Z         scale_ub: Optional[float],
2025-05-07T20:32:01.3223447Z         contiguous: bool,
2025-05-07T20:32:01.3223699Z         compiled: bool,
2025-05-07T20:32:01.3223923Z     ) -> None:
2025-05-07T20:32:01.3224142Z         torch.manual_seed(2025)
2025-05-07T20:32:01.3224386Z     
2025-05-07T20:32:01.3224662Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:01.3225007Z     
2025-05-07T20:32:01.3225214Z         x_sign = torch.sign(x)
2025-05-07T20:32:01.3225507Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:01.3225818Z         x = x_sign * x_clamp
2025-05-07T20:32:01.3226055Z         x0 = x[:, :D]
2025-05-07T20:32:01.3226277Z         x1 = x[:, D:]
2025-05-07T20:32:01.3226477Z     
2025-05-07T20:32:01.3226668Z         if contiguous:
2025-05-07T20:32:01.3226910Z             x0 = x0.contiguous()
2025-05-07T20:32:01.3227169Z             x1 = x1.contiguous()
2025-05-07T20:32:01.3227411Z     
2025-05-07T20:32:01.3227606Z         if scale_ub is not None:
2025-05-07T20:32:01.3227875Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:01.3228387Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:01.3228697Z             )
2025-05-07T20:32:01.3228890Z         else:
2025-05-07T20:32:01.3229155Z             scale_ub_tensor = None
2025-05-07T20:32:01.3229402Z     
2025-05-07T20:32:01.3229629Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:01.3229946Z             op = silu_mul_quant
2025-05-07T20:32:01.3230197Z             if compiled:
2025-05-07T20:32:01.3230444Z                 op = torch.compile(op)
2025-05-07T20:32:01.3230735Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:01.3231009Z     
2025-05-07T20:32:01.3231206Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:01.3231397Z 
2025-05-07T20:32:01.3231513Z moe/activation_test.py:117: 
2025-05-07T20:32:01.3231818Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:01.3232153Z moe/activation_test.py:115: in fn
2025-05-07T20:32:01.3232687Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:01.3233252Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/torch/_dynamo/eval_frame.py:678: in _fn
2025-05-07T20:32:01.3233812Z     return fn(*args, **kwargs)
2025-05-07T20:32:01.3234473Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:01.3235157Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:01.3235690Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:01.3236371Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:01.3237092Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:01.3237678Z     kernel = self.compile(
2025-05-07T20:32:01.3238222Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:01.3238877Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:01.3239267Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:01.3239504Z 
2025-05-07T20:32:01.3239712Z self = <triton.compiler.compiler.ASTSource object at 0x7f4898e162d0>
2025-05-07T20:32:01.3240804Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:01.3242187Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f49ce13c400>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f48995e1ee0>}
2025-05-07T20:32:01.3243542Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:01.3244560Z context = <triton._C.libtriton.ir.context object at 0x7f4898ecbf70>
2025-05-07T20:32:01.3244848Z 
2025-05-07T20:32:01.3245015Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:01.3245539Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:01.3245996Z                            module_map=module_map)
2025-05-07T20:32:01.3246363Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:01.3246716Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:01.3246974Z E       ^
2025-05-07T20:32:01.3247440Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:01.3247895Z 
2025-05-07T20:32:01.3248313Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:01.3248826Z 
2025-05-07T20:32:01.3248933Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:01.3249338Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:01.3249746Z     T=128,
2025-05-07T20:32:01.3249934Z     D=7168,
2025-05-07T20:32:01.3250129Z     scale_ub=1200.0,
2025-05-07T20:32:01.3250348Z     contiguous=False,
2025-05-07T20:32:01.3250576Z     compiled=False,
2025-05-07T20:32:01.3250788Z )
2025-05-07T20:32:01.4116366Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:01.4117127Z T = 128, D = 7168, scale_ub = 1200.0, contiguous = False, compiled = False
2025-05-07T20:32:01.4117531Z 
2025-05-07T20:32:01.4117663Z     @given(
2025-05-07T20:32:01.4117995Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:01.4118425Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:01.4118847Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:01.4119498Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:01.4119840Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:01.4120138Z     )
2025-05-07T20:32:01.4120488Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:01.4120935Z     def test_silu_mul_quant(
2025-05-07T20:32:01.4121195Z         self,
2025-05-07T20:32:01.4121416Z         T: int,
2025-05-07T20:32:01.4121646Z         D: int,
2025-05-07T20:32:01.4121871Z         scale_ub: Optional[float],
2025-05-07T20:32:01.4122182Z         contiguous: bool,
2025-05-07T20:32:01.4122427Z         compiled: bool,
2025-05-07T20:32:01.4122654Z     ) -> None:
2025-05-07T20:32:01.4122881Z         torch.manual_seed(2025)
2025-05-07T20:32:01.4123310Z     
2025-05-07T20:32:01.4123586Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:01.4123933Z     
2025-05-07T20:32:01.4124137Z         x_sign = torch.sign(x)
2025-05-07T20:32:01.4124431Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:01.4124761Z         x = x_sign * x_clamp
2025-05-07T20:32:01.4125012Z         x0 = x[:, :D]
2025-05-07T20:32:01.4125227Z         x1 = x[:, D:]
2025-05-07T20:32:01.4125442Z     
2025-05-07T20:32:01.4125630Z         if contiguous:
2025-05-07T20:32:01.4125870Z             x0 = x0.contiguous()
2025-05-07T20:32:01.4126124Z             x1 = x1.contiguous()
2025-05-07T20:32:01.4126364Z     
2025-05-07T20:32:01.4126562Z         if scale_ub is not None:
2025-05-07T20:32:01.4126832Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:01.4127170Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:01.4127475Z             )
2025-05-07T20:32:01.4127678Z         else:
2025-05-07T20:32:01.4127885Z             scale_ub_tensor = None
2025-05-07T20:32:01.4128391Z     
2025-05-07T20:32:01.4128630Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:01.4128936Z             op = silu_mul_quant
2025-05-07T20:32:01.4129187Z             if compiled:
2025-05-07T20:32:01.4129439Z                 op = torch.compile(op)
2025-05-07T20:32:01.4129725Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:01.4130004Z     
2025-05-07T20:32:01.4130208Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:01.4130370Z 
2025-05-07T20:32:01.4130469Z moe/activation_test.py:117: 
2025-05-07T20:32:01.4130761Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:01.4131094Z moe/activation_test.py:115: in fn
2025-05-07T20:32:01.4131392Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:01.4132099Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:01.4132791Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:01.4133320Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:01.4133993Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:01.4134647Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:01.4135182Z     kernel = self.compile(
2025-05-07T20:32:01.4135721Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:01.4136366Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:01.4136761Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:01.4136987Z 
2025-05-07T20:32:01.4137198Z self = <triton.compiler.compiler.ASTSource object at 0x7f4898e7b350>
2025-05-07T20:32:01.4138277Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:01.4139768Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f49ce13c400>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f48995e1a80>}
2025-05-07T20:32:01.4141141Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:01.4142174Z context = <triton._C.libtriton.ir.context object at 0x7f4898e8b9b0>
2025-05-07T20:32:01.4142457Z 
2025-05-07T20:32:01.4142627Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:01.4143139Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:01.4143713Z                            module_map=module_map)
2025-05-07T20:32:01.4144080Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:01.4144430Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:01.4144687Z E       ^
2025-05-07T20:32:01.4145159Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:01.4145606Z 
2025-05-07T20:32:01.4146030Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:01.4146538Z 
2025-05-07T20:32:01.4146641Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:01.4147050Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:01.4147454Z     T=128,
2025-05-07T20:32:01.4147646Z     D=5120,
2025-05-07T20:32:01.4147839Z     scale_ub=None,
2025-05-07T20:32:01.4148064Z     contiguous=False,
2025-05-07T20:32:01.4148298Z     compiled=False,
2025-05-07T20:32:01.4148501Z )
2025-05-07T20:32:01.4148826Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:01.4149372Z T = 128, D = 5120, scale_ub = None, contiguous = False, compiled = False
2025-05-07T20:32:01.4149641Z 
2025-05-07T20:32:01.4149719Z     @given(
2025-05-07T20:32:01.4149954Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:01.4150274Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:01.4150573Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:01.4150901Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:01.4151229Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:01.4151515Z     )
2025-05-07T20:32:01.4151857Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:01.4152299Z     def test_silu_mul_quant(
2025-05-07T20:32:01.4152542Z         self,
2025-05-07T20:32:01.4152734Z         T: int,
2025-05-07T20:32:01.4152937Z         D: int,
2025-05-07T20:32:01.4153154Z         scale_ub: Optional[float],
2025-05-07T20:32:01.4153418Z         contiguous: bool,
2025-05-07T20:32:01.4153655Z         compiled: bool,
2025-05-07T20:32:01.4153874Z     ) -> None:
2025-05-07T20:32:01.4154088Z         torch.manual_seed(2025)
2025-05-07T20:32:01.4154327Z     
2025-05-07T20:32:01.4154598Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:01.4154958Z     
2025-05-07T20:32:01.4155151Z         x_sign = torch.sign(x)
2025-05-07T20:32:01.4155447Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:01.4155760Z         x = x_sign * x_clamp
2025-05-07T20:32:01.4155999Z         x0 = x[:, :D]
2025-05-07T20:32:01.4156220Z         x1 = x[:, D:]
2025-05-07T20:32:01.4156431Z     
2025-05-07T20:32:01.4156626Z         if contiguous:
2025-05-07T20:32:01.4156859Z             x0 = x0.contiguous()
2025-05-07T20:32:01.4157123Z             x1 = x1.contiguous()
2025-05-07T20:32:01.4157372Z     
2025-05-07T20:32:01.4157560Z         if scale_ub is not None:
2025-05-07T20:32:01.4157835Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:01.4158174Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:01.4158482Z             )
2025-05-07T20:32:01.4158768Z         else:
2025-05-07T20:32:01.4158986Z             scale_ub_tensor = None
2025-05-07T20:32:01.4159228Z     
2025-05-07T20:32:01.4159458Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:01.4159770Z             op = silu_mul_quant
2025-05-07T20:32:01.4160015Z             if compiled:
2025-05-07T20:32:01.4160260Z                 op = torch.compile(op)
2025-05-07T20:32:01.4160554Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:01.4160822Z     
2025-05-07T20:32:01.4161023Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:01.4161191Z 
2025-05-07T20:32:01.4161307Z moe/activation_test.py:117: 
2025-05-07T20:32:01.4161685Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:01.4162081Z moe/activation_test.py:115: in fn
2025-05-07T20:32:01.4162360Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:01.4163050Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:01.4163728Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:01.4164261Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:01.4164937Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:01.4165595Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:01.4166116Z     kernel = self.compile(
2025-05-07T20:32:01.4166654Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:01.4167312Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:01.4167706Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:01.4167938Z 
2025-05-07T20:32:01.4168147Z self = <triton.compiler.compiler.ASTSource object at 0x7f4898d36450>
2025-05-07T20:32:01.4169224Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:01.4170582Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f49ce13c400>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f48995e3c40>}
2025-05-07T20:32:01.4171966Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:01.4172984Z context = <triton._C.libtriton.ir.context object at 0x7f4898d7eb30>
2025-05-07T20:32:01.4173276Z 
2025-05-07T20:32:01.4173441Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:01.4173964Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:01.4174434Z                            module_map=module_map)
2025-05-07T20:32:01.4174792Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:01.4175149Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:01.4175417Z E       ^
2025-05-07T20:32:01.4175871Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:01.4176328Z 
2025-05-07T20:32:01.4176742Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:01.4177267Z 
2025-05-07T20:32:01.4177376Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:01.4177800Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:01.4178196Z     T=128,
2025-05-07T20:32:01.4178396Z     D=5120,
2025-05-07T20:32:01.4178597Z     scale_ub=1200.0,
2025-05-07T20:32:01.4178904Z     contiguous=True,
2025-05-07T20:32:01.4179138Z     compiled=False,
2025-05-07T20:32:01.4179347Z )
2025-05-07T20:32:01.5533472Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:01.5534254Z T = 128, D = 5120, scale_ub = 1200.0, contiguous = True, compiled = False
2025-05-07T20:32:01.5534630Z 
2025-05-07T20:32:01.5534747Z     @given(
2025-05-07T20:32:01.5535054Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:01.5535382Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:01.5535698Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:01.5536034Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:01.5536579Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:01.5536869Z     )
2025-05-07T20:32:01.5537225Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:01.5537746Z     def test_silu_mul_quant(
2025-05-07T20:32:01.5538071Z         self,
2025-05-07T20:32:01.5538276Z         T: int,
2025-05-07T20:32:01.5538474Z         D: int,
2025-05-07T20:32:01.5538707Z         scale_ub: Optional[float],
2025-05-07T20:32:01.5538985Z         contiguous: bool,
2025-05-07T20:32:01.5539224Z         compiled: bool,
2025-05-07T20:32:01.5539458Z     ) -> None:
2025-05-07T20:32:01.5539681Z         torch.manual_seed(2025)
2025-05-07T20:32:01.5539924Z     
2025-05-07T20:32:01.5540205Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:01.5540549Z     
2025-05-07T20:32:01.5540752Z         x_sign = torch.sign(x)
2025-05-07T20:32:01.5541067Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:01.5541406Z         x = x_sign * x_clamp
2025-05-07T20:32:01.5541660Z         x0 = x[:, :D]
2025-05-07T20:32:01.5541881Z         x1 = x[:, D:]
2025-05-07T20:32:01.5542099Z     
2025-05-07T20:32:01.5542295Z         if contiguous:
2025-05-07T20:32:01.5542528Z             x0 = x0.contiguous()
2025-05-07T20:32:01.5542794Z             x1 = x1.contiguous()
2025-05-07T20:32:01.5543040Z     
2025-05-07T20:32:01.5543230Z         if scale_ub is not None:
2025-05-07T20:32:01.5543507Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:01.5543851Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:01.5544158Z             )
2025-05-07T20:32:01.5544361Z         else:
2025-05-07T20:32:01.5544581Z             scale_ub_tensor = None
2025-05-07T20:32:01.5544827Z     
2025-05-07T20:32:01.5545070Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:01.5545389Z             op = silu_mul_quant
2025-05-07T20:32:01.5545649Z             if compiled:
2025-05-07T20:32:01.5545902Z                 op = torch.compile(op)
2025-05-07T20:32:01.5546206Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:01.5546486Z     
2025-05-07T20:32:01.5546681Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:01.5546853Z 
2025-05-07T20:32:01.5546955Z moe/activation_test.py:117: 
2025-05-07T20:32:01.5547255Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:01.5547586Z moe/activation_test.py:115: in fn
2025-05-07T20:32:01.5547869Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:01.5548562Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:01.5549329Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:01.5549862Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:01.5550541Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:01.5551235Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:01.5551788Z     kernel = self.compile(
2025-05-07T20:32:01.5552491Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:01.5553150Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:01.5553552Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:01.5553783Z 
2025-05-07T20:32:01.5553991Z self = <triton.compiler.compiler.ASTSource object at 0x7f4898df13d0>
2025-05-07T20:32:01.5555070Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:01.5556444Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f49ce13c400>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f4899db7ba0>}
2025-05-07T20:32:01.5557878Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:01.5558898Z context = <triton._C.libtriton.ir.context object at 0x7f4898dc03f0>
2025-05-07T20:32:01.5559182Z 
2025-05-07T20:32:01.5559348Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:01.5559866Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:01.5560340Z                            module_map=module_map)
2025-05-07T20:32:01.5560701Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:01.5561062Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:01.5561332Z E       ^
2025-05-07T20:32:01.5561799Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:01.5562259Z 
2025-05-07T20:32:01.5562675Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:01.5563196Z 
2025-05-07T20:32:01.5563305Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:01.5563725Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:01.5564125Z     T=1,
2025-05-07T20:32:01.5564322Z     D=7168,
2025-05-07T20:32:01.5564525Z     scale_ub=1200.0,
2025-05-07T20:32:01.5564749Z     contiguous=True,
2025-05-07T20:32:01.5564982Z     compiled=True,
2025-05-07T20:32:01.5565198Z )
2025-05-07T20:32:01.5565523Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:01.5566008Z T = 1, D = 7168, scale_ub = 1200.0, contiguous = True, compiled = True
2025-05-07T20:32:01.5566276Z 
2025-05-07T20:32:01.5566358Z     @given(
2025-05-07T20:32:01.5566610Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:01.5566923Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:01.5567241Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:01.5567584Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:01.5567910Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:01.5568199Z     )
2025-05-07T20:32:01.5568554Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:01.5568999Z     def test_silu_mul_quant(
2025-05-07T20:32:01.5569240Z         self,
2025-05-07T20:32:01.5569442Z         T: int,
2025-05-07T20:32:01.5569648Z         D: int,
2025-05-07T20:32:01.5569865Z         scale_ub: Optional[float],
2025-05-07T20:32:01.5570141Z         contiguous: bool,
2025-05-07T20:32:01.5570387Z         compiled: bool,
2025-05-07T20:32:01.5570606Z     ) -> None:
2025-05-07T20:32:01.5570826Z         torch.manual_seed(2025)
2025-05-07T20:32:01.5571080Z     
2025-05-07T20:32:01.5571386Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:01.5571750Z     
2025-05-07T20:32:01.5571949Z         x_sign = torch.sign(x)
2025-05-07T20:32:01.5572322Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:01.5572640Z         x = x_sign * x_clamp
2025-05-07T20:32:01.5572885Z         x0 = x[:, :D]
2025-05-07T20:32:01.5573100Z         x1 = x[:, D:]
2025-05-07T20:32:01.5573307Z     
2025-05-07T20:32:01.5573501Z         if contiguous:
2025-05-07T20:32:01.5573729Z             x0 = x0.contiguous()
2025-05-07T20:32:01.5573992Z             x1 = x1.contiguous()
2025-05-07T20:32:01.5574238Z     
2025-05-07T20:32:01.5574431Z         if scale_ub is not None:
2025-05-07T20:32:01.5574700Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:01.5575037Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:01.5575349Z             )
2025-05-07T20:32:01.5575591Z         else:
2025-05-07T20:32:01.5584576Z             scale_ub_tensor = None
2025-05-07T20:32:01.5584886Z     
2025-05-07T20:32:01.5585143Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:01.5585468Z             op = silu_mul_quant
2025-05-07T20:32:01.5585731Z             if compiled:
2025-05-07T20:32:01.5585999Z                 op = torch.compile(op)
2025-05-07T20:32:01.5586313Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:01.5586593Z     
2025-05-07T20:32:01.5586802Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:01.5586972Z 
2025-05-07T20:32:01.5587087Z moe/activation_test.py:117: 
2025-05-07T20:32:01.5587384Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:01.5587728Z moe/activation_test.py:115: in fn
2025-05-07T20:32:01.5588023Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:01.5588585Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/torch/_dynamo/eval_frame.py:678: in _fn
2025-05-07T20:32:01.5589231Z     return fn(*args, **kwargs)
2025-05-07T20:32:01.5589904Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:01.5590598Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:01.5591187Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:01.5591874Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:01.5592540Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:01.5593069Z     kernel = self.compile(
2025-05-07T20:32:01.5593619Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:01.5594281Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:01.5594695Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:01.5594932Z 
2025-05-07T20:32:01.5595139Z self = <triton.compiler.compiler.ASTSource object at 0x7f4899204250>
2025-05-07T20:32:01.5596233Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:01.5597617Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f49ce13c400>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f4984e20ea0>}
2025-05-07T20:32:01.5598975Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:01.5600002Z context = <triton._C.libtriton.ir.context object at 0x7f4898d66070>
2025-05-07T20:32:01.5600290Z 
2025-05-07T20:32:01.5600460Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:01.5601012Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:01.5601623Z                            module_map=module_map)
2025-05-07T20:32:01.5601993Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:01.5602354Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:01.5602621Z E       ^
2025-05-07T20:32:01.5603092Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:01.5603540Z 
2025-05-07T20:32:01.5603955Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:01.5604476Z 
2025-05-07T20:32:01.5604597Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:01.5605019Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:01.5605514Z     T=1,
2025-05-07T20:32:01.5605704Z     D=7168,
2025-05-07T20:32:01.5605908Z     scale_ub=1200.0,
2025-05-07T20:32:01.5606146Z     contiguous=False,
2025-05-07T20:32:01.5606374Z     compiled=True,
2025-05-07T20:32:01.5606588Z )
2025-05-07T20:32:01.8290797Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:01.8291310Z T = 1, D = 7168, scale_ub = 1200.0, contiguous = False, compiled = True
2025-05-07T20:32:01.8291889Z 
2025-05-07T20:32:01.8292139Z     @given(
2025-05-07T20:32:01.8292706Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:01.8293339Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:01.8293942Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:01.8294607Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:01.8295268Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:01.8295833Z     )
2025-05-07T20:32:01.8296526Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:01.8297411Z     def test_silu_mul_quant(
2025-05-07T20:32:01.8297899Z         self,
2025-05-07T20:32:01.8298283Z         T: int,
2025-05-07T20:32:01.8298676Z         D: int,
2025-05-07T20:32:01.8299123Z         scale_ub: Optional[float],
2025-05-07T20:32:01.8299654Z         contiguous: bool,
2025-05-07T20:32:01.8300133Z         compiled: bool,
2025-05-07T20:32:01.8300586Z     ) -> None:
2025-05-07T20:32:01.8300979Z         torch.manual_seed(2025)
2025-05-07T20:32:01.8301228Z     
2025-05-07T20:32:01.8301514Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:01.8301858Z     
2025-05-07T20:32:01.8302060Z         x_sign = torch.sign(x)
2025-05-07T20:32:01.8302360Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:01.8302669Z         x = x_sign * x_clamp
2025-05-07T20:32:01.8302916Z         x0 = x[:, :D]
2025-05-07T20:32:01.8303151Z         x1 = x[:, D:]
2025-05-07T20:32:01.8303371Z     
2025-05-07T20:32:01.8303561Z         if contiguous:
2025-05-07T20:32:01.8303806Z             x0 = x0.contiguous()
2025-05-07T20:32:01.8304071Z             x1 = x1.contiguous()
2025-05-07T20:32:01.8304310Z     
2025-05-07T20:32:01.8304510Z         if scale_ub is not None:
2025-05-07T20:32:01.8304792Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:01.8305126Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:01.8305444Z             )
2025-05-07T20:32:01.8305648Z         else:
2025-05-07T20:32:01.8305856Z             scale_ub_tensor = None
2025-05-07T20:32:01.8306115Z     
2025-05-07T20:32:01.8306358Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:01.8306670Z             op = silu_mul_quant
2025-05-07T20:32:01.8306928Z             if compiled:
2025-05-07T20:32:01.8307184Z                 op = torch.compile(op)
2025-05-07T20:32:01.8307481Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:01.8307763Z     
2025-05-07T20:32:01.8307970Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:01.8308137Z 
2025-05-07T20:32:01.8308252Z moe/activation_test.py:117: 
2025-05-07T20:32:01.8308547Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:01.8309204Z moe/activation_test.py:115: in fn
2025-05-07T20:32:01.8309502Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:01.8310069Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/torch/_dynamo/eval_frame.py:678: in _fn
2025-05-07T20:32:01.8310631Z     return fn(*args, **kwargs)
2025-05-07T20:32:01.8311299Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:01.8311981Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:01.8312525Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:01.8313297Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:01.8314028Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:01.8314559Z     kernel = self.compile(
2025-05-07T20:32:01.8315110Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:01.8315769Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:01.8316163Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:01.8316396Z 
2025-05-07T20:32:01.8316602Z self = <triton.compiler.compiler.ASTSource object at 0x7f4899219e90>
2025-05-07T20:32:01.8317686Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:01.8319063Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f49ce13c400>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f4899e4d1c0>}
2025-05-07T20:32:01.8320406Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:01.8321462Z context = <triton._C.libtriton.ir.context object at 0x7f48992b64f0>
2025-05-07T20:32:01.8321766Z 
2025-05-07T20:32:01.8321936Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:01.8322452Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:01.8322917Z                            module_map=module_map)
2025-05-07T20:32:01.8323277Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:01.8323638Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:01.8323910Z E       ^
2025-05-07T20:32:01.8324370Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:01.8324824Z 
2025-05-07T20:32:01.8325245Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:01.8325763Z 
2025-05-07T20:32:01.8325867Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:01.8326284Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:01.8326683Z     T=1,
2025-05-07T20:32:01.8326877Z     D=7168,
2025-05-07T20:32:01.8327077Z     scale_ub=None,
2025-05-07T20:32:01.8327295Z     contiguous=False,
2025-05-07T20:32:01.8327523Z     compiled=True,
2025-05-07T20:32:01.8327734Z )
2025-05-07T20:32:01.8996605Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:01.8997105Z T = 1, D = 7168, scale_ub = None, contiguous = False, compiled = True
2025-05-07T20:32:01.8997436Z 
2025-05-07T20:32:01.8997518Z     @given(
2025-05-07T20:32:01.8998222Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:01.8999126Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:01.8999958Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:01.9001093Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:01.9001578Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:01.9001886Z     )
2025-05-07T20:32:01.9002241Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:01.9002689Z     def test_silu_mul_quant(
2025-05-07T20:32:01.9002937Z         self,
2025-05-07T20:32:01.9003136Z         T: int,
2025-05-07T20:32:01.9003344Z         D: int,
2025-05-07T20:32:01.9003573Z         scale_ub: Optional[float],
2025-05-07T20:32:01.9003844Z         contiguous: bool,
2025-05-07T20:32:01.9004096Z         compiled: bool,
2025-05-07T20:32:01.9004397Z     ) -> None:
2025-05-07T20:32:01.9004683Z         torch.manual_seed(2025)
2025-05-07T20:32:01.9004937Z     
2025-05-07T20:32:01.9005226Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:01.9005567Z     
2025-05-07T20:32:01.9005772Z         x_sign = torch.sign(x)
2025-05-07T20:32:01.9006077Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:01.9006387Z         x = x_sign * x_clamp
2025-05-07T20:32:01.9006635Z         x0 = x[:, :D]
2025-05-07T20:32:01.9006856Z         x1 = x[:, D:]
2025-05-07T20:32:01.9007064Z     
2025-05-07T20:32:01.9007259Z         if contiguous:
2025-05-07T20:32:01.9007491Z             x0 = x0.contiguous()
2025-05-07T20:32:01.9007744Z             x1 = x1.contiguous()
2025-05-07T20:32:01.9007985Z     
2025-05-07T20:32:01.9008179Z         if scale_ub is not None:
2025-05-07T20:32:01.9008451Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:01.9008801Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:01.9009108Z             )
2025-05-07T20:32:01.9009314Z         else:
2025-05-07T20:32:01.9009530Z             scale_ub_tensor = None
2025-05-07T20:32:01.9009777Z     
2025-05-07T20:32:01.9010015Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:01.9010332Z             op = silu_mul_quant
2025-05-07T20:32:01.9010584Z             if compiled:
2025-05-07T20:32:01.9010836Z                 op = torch.compile(op)
2025-05-07T20:32:01.9011139Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:01.9011408Z     
2025-05-07T20:32:01.9011607Z         y_fp8, y_scale = fn()
2025-05-07T20:32:01.9011897Z         y = y_fp8.to(torch.float32) * y_scale[:, None]
2025-05-07T20:32:01.9012206Z     
2025-05-07T20:32:01.9012443Z         def ref_fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:01.9012784Z             x0_fp32 = x0.to(torch.float32)
2025-05-07T20:32:01.9013084Z             x1_fp32 = x1.to(torch.float32)
2025-05-07T20:32:01.9013392Z             y = x0_fp32 * torch.sigmoid(x0_fp32) * x1_fp32
2025-05-07T20:32:01.9013757Z             return triton_quantize_fp8_row(y, scale_ub_tensor)
2025-05-07T20:32:01.9014070Z     
2025-05-07T20:32:01.9014266Z >       y_fp8_ref, y_scale_ref = ref_fn()
2025-05-07T20:32:01.9014467Z 
2025-05-07T20:32:01.9014569Z moe/activation_test.py:126: 
2025-05-07T20:32:01.9014874Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:01.9015216Z moe/activation_test.py:124: in ref_fn
2025-05-07T20:32:01.9015538Z     return triton_quantize_fp8_row(y, scale_ub_tensor)
2025-05-07T20:32:01.9016328Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/fbgemm_gpu/experimental/gemm/triton_gemm/fp8_gemm.py:2370: in triton_quantize_fp8_row
2025-05-07T20:32:01.9017087Z     _kernel_quantize_fp8_row[grid](
2025-05-07T20:32:01.9017627Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:01.9018309Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:01.9018997Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/autotuner.py:186: in run
2025-05-07T20:32:01.9019800Z     timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs}
2025-05-07T20:32:01.9020541Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/autotuner.py:186: in <dictcomp>
2025-05-07T20:32:01.9021314Z     timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs}
2025-05-07T20:32:01.9022062Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/autotuner.py:166: in _bench
2025-05-07T20:32:01.9022699Z     return self.do_bench(kernel_call, quantiles=(0.5, 0.2, 0.8))
2025-05-07T20:32:01.9023292Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/testing.py:117: in do_bench
2025-05-07T20:32:01.9023812Z     fn()
2025-05-07T20:32:01.9024366Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/autotuner.py:152: in kernel_call
2025-05-07T20:32:01.9024977Z     self.fn.run(
2025-05-07T20:32:01.9025441Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:01.9025973Z     kernel = self.compile(
2025-05-07T20:32:01.9026516Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:01.9027160Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:01.9027565Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:01.9027790Z 
2025-05-07T20:32:01.9028003Z self = <triton.compiler.compiler.ASTSource object at 0x7f4898f8c710>
2025-05-07T20:32:01.9029472Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=2, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:01.9030838Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f49ce13c400>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f4899e4dda0>}
2025-05-07T20:32:01.9032231Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:01.9033251Z context = <triton._C.libtriton.ir.context object at 0x7f4898f6cd70>
2025-05-07T20:32:01.9033535Z 
2025-05-07T20:32:01.9033708Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:01.9034223Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:01.9034690Z                            module_map=module_map)
2025-05-07T20:32:01.9035057Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:01.9035417Z E       def _kernel_quantize_fp8_row(
2025-05-07T20:32:01.9035680Z E       ^
2025-05-07T20:32:01.9036144Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:01.9036593Z 
2025-05-07T20:32:01.9037012Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:01.9037521Z 
2025-05-07T20:32:01.9037635Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:01.9038040Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:01.9038444Z     T=1,
2025-05-07T20:32:01.9038632Z     D=5120,
2025-05-07T20:32:01.9038824Z     scale_ub=1200.0,
2025-05-07T20:32:01.9039055Z     contiguous=False,
2025-05-07T20:32:01.9039286Z     compiled=True,
2025-05-07T20:32:01.9039486Z )
2025-05-07T20:32:02.0227476Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:02.0228421Z T = 1, D = 5120, scale_ub = 1200.0, contiguous = False, compiled = True
2025-05-07T20:32:02.0228799Z 
2025-05-07T20:32:02.0228923Z     @given(
2025-05-07T20:32:02.0229292Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:02.0229942Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:02.0230381Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:02.0230803Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:02.0231130Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:02.0231415Z     )
2025-05-07T20:32:02.0231768Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:02.0232204Z     def test_silu_mul_quant(
2025-05-07T20:32:02.0232456Z         self,
2025-05-07T20:32:02.0232654Z         T: int,
2025-05-07T20:32:02.0232850Z         D: int,
2025-05-07T20:32:02.0233075Z         scale_ub: Optional[float],
2025-05-07T20:32:02.0233429Z         contiguous: bool,
2025-05-07T20:32:02.0233724Z         compiled: bool,
2025-05-07T20:32:02.0233952Z     ) -> None:
2025-05-07T20:32:02.0234171Z         torch.manual_seed(2025)
2025-05-07T20:32:02.0234407Z     
2025-05-07T20:32:02.0234700Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:02.0235050Z     
2025-05-07T20:32:02.0235248Z         x_sign = torch.sign(x)
2025-05-07T20:32:02.0235541Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:02.0235856Z         x = x_sign * x_clamp
2025-05-07T20:32:02.0236103Z         x0 = x[:, :D]
2025-05-07T20:32:02.0236324Z         x1 = x[:, D:]
2025-05-07T20:32:02.0236536Z     
2025-05-07T20:32:02.0236726Z         if contiguous:
2025-05-07T20:32:02.0236957Z             x0 = x0.contiguous()
2025-05-07T20:32:02.0237221Z             x1 = x1.contiguous()
2025-05-07T20:32:02.0237472Z     
2025-05-07T20:32:02.0237664Z         if scale_ub is not None:
2025-05-07T20:32:02.0237940Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:02.0238288Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:02.0238593Z             )
2025-05-07T20:32:02.0238800Z         else:
2025-05-07T20:32:02.0239017Z             scale_ub_tensor = None
2025-05-07T20:32:02.0239267Z     
2025-05-07T20:32:02.0239508Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:02.0239826Z             op = silu_mul_quant
2025-05-07T20:32:02.0240075Z             if compiled:
2025-05-07T20:32:02.0240326Z                 op = torch.compile(op)
2025-05-07T20:32:02.0240633Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:02.0240910Z     
2025-05-07T20:32:02.0241103Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:02.0241276Z 
2025-05-07T20:32:02.0241377Z moe/activation_test.py:117: 
2025-05-07T20:32:02.0241675Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:02.0242003Z moe/activation_test.py:115: in fn
2025-05-07T20:32:02.0242288Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:02.0242852Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/torch/_dynamo/eval_frame.py:678: in _fn
2025-05-07T20:32:02.0243411Z     return fn(*args, **kwargs)
2025-05-07T20:32:02.0244070Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:02.0244758Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:02.0245293Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:02.0245965Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:02.0246627Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:02.0247153Z     kernel = self.compile(
2025-05-07T20:32:02.0247689Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:02.0248337Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:02.0248739Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:02.0248966Z 
2025-05-07T20:32:02.0249295Z self = <triton.compiler.compiler.ASTSource object at 0x7f4898f14b90>
2025-05-07T20:32:02.0250368Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:02.0251721Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f49ce13c400>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f4899e4e020>}
2025-05-07T20:32:02.0253054Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:02.0254176Z context = <triton._C.libtriton.ir.context object at 0x7f4898f2d4b0>
2025-05-07T20:32:02.0254460Z 
2025-05-07T20:32:02.0254629Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:02.0255145Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:02.0255610Z                            module_map=module_map)
2025-05-07T20:32:02.0255973Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:02.0256323Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:02.0256576Z E       ^
2025-05-07T20:32:02.0257039Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:02.0257485Z 
2025-05-07T20:32:02.0257912Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:02.0258424Z 
2025-05-07T20:32:02.0258527Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:02.0258972Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:02.0259381Z     T=1,
2025-05-07T20:32:02.0259571Z     D=5120,
2025-05-07T20:32:02.0259772Z     scale_ub=1200.0,
2025-05-07T20:32:02.0259992Z     contiguous=False,
2025-05-07T20:32:02.0260223Z     compiled=False,
2025-05-07T20:32:02.0260430Z )
2025-05-07T20:32:02.0260746Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:02.0261240Z T = 1, D = 5120, scale_ub = 1200.0, contiguous = False, compiled = False
2025-05-07T20:32:02.0261512Z 
2025-05-07T20:32:02.0261619Z     @given(
2025-05-07T20:32:02.0261870Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:02.0262182Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:02.0262491Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:02.0262815Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:02.0263143Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:02.0263428Z     )
2025-05-07T20:32:02.0263777Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:02.0264213Z     def test_silu_mul_quant(
2025-05-07T20:32:02.0264455Z         self,
2025-05-07T20:32:02.0264652Z         T: int,
2025-05-07T20:32:02.0264842Z         D: int,
2025-05-07T20:32:02.0265060Z         scale_ub: Optional[float],
2025-05-07T20:32:02.0265330Z         contiguous: bool,
2025-05-07T20:32:02.0265564Z         compiled: bool,
2025-05-07T20:32:02.0265785Z     ) -> None:
2025-05-07T20:32:02.0265999Z         torch.manual_seed(2025)
2025-05-07T20:32:02.0266232Z     
2025-05-07T20:32:02.0266502Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:02.0266840Z     
2025-05-07T20:32:02.0267027Z         x_sign = torch.sign(x)
2025-05-07T20:32:02.0267317Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:02.0267626Z         x = x_sign * x_clamp
2025-05-07T20:32:02.0267860Z         x0 = x[:, :D]
2025-05-07T20:32:02.0268081Z         x1 = x[:, D:]
2025-05-07T20:32:02.0268295Z     
2025-05-07T20:32:02.0268474Z         if contiguous:
2025-05-07T20:32:02.0268791Z             x0 = x0.contiguous()
2025-05-07T20:32:02.0269050Z             x1 = x1.contiguous()
2025-05-07T20:32:02.0269364Z     
2025-05-07T20:32:02.0269550Z         if scale_ub is not None:
2025-05-07T20:32:02.0269823Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:02.0270156Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:02.0270458Z             )
2025-05-07T20:32:02.0270652Z         else:
2025-05-07T20:32:02.0270860Z             scale_ub_tensor = None
2025-05-07T20:32:02.0271107Z     
2025-05-07T20:32:02.0271359Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:02.0271706Z             op = silu_mul_quant
2025-05-07T20:32:02.0272068Z             if compiled:
2025-05-07T20:32:02.0272446Z                 op = torch.compile(op)
2025-05-07T20:32:02.0272986Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:02.0273365Z     
2025-05-07T20:32:02.0273669Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:02.0273845Z 
2025-05-07T20:32:02.0274035Z moe/activation_test.py:117: 
2025-05-07T20:32:02.0274507Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:02.0274946Z moe/activation_test.py:115: in fn
2025-05-07T20:32:02.0275359Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:02.0276092Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:02.0276881Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:02.0277554Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:02.0278282Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:02.0287575Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:02.0288214Z     kernel = self.compile(
2025-05-07T20:32:02.0288793Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:02.0289464Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:02.0289873Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:02.0290105Z 
2025-05-07T20:32:02.0290313Z self = <triton.compiler.compiler.ASTSource object at 0x7f4898f5a750>
2025-05-07T20:32:02.0291415Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:02.0292807Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f49ce13c400>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f4898fbc720>}
2025-05-07T20:32:02.0294175Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:02.0295196Z context = <triton._C.libtriton.ir.context object at 0x7f4898fcadb0>
2025-05-07T20:32:02.0295492Z 
2025-05-07T20:32:02.0295659Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:02.0296185Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:02.0296660Z                            module_map=module_map)
2025-05-07T20:32:02.0297024Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:02.0297384Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:02.0297653Z E       ^
2025-05-07T20:32:02.0298124Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:02.0298587Z 
2025-05-07T20:32:02.0299125Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:02.0299653Z 
2025-05-07T20:32:02.0299765Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:02.0300188Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:02.0300590Z     T=16384,
2025-05-07T20:32:02.0300793Z     D=5120,
2025-05-07T20:32:02.0301005Z     scale_ub=1200.0,
2025-05-07T20:32:02.0301274Z     contiguous=False,
2025-05-07T20:32:02.0301517Z     compiled=True,
2025-05-07T20:32:02.0301727Z )
2025-05-07T20:32:02.0983531Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:02.0984108Z T = 16384, D = 5120, scale_ub = 1200.0, contiguous = False, compiled = True
2025-05-07T20:32:02.0984615Z 
2025-05-07T20:32:02.0984774Z     @given(
2025-05-07T20:32:02.0985018Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:02.0985341Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:02.0985650Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:02.0985996Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:02.0986332Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:02.0986616Z     )
2025-05-07T20:32:02.0986978Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:02.0987439Z     def test_silu_mul_quant(
2025-05-07T20:32:02.0987679Z         self,
2025-05-07T20:32:02.0987885Z         T: int,
2025-05-07T20:32:02.0988092Z         D: int,
2025-05-07T20:32:02.0988323Z         scale_ub: Optional[float],
2025-05-07T20:32:02.0988595Z         contiguous: bool,
2025-05-07T20:32:02.0988828Z         compiled: bool,
2025-05-07T20:32:02.0989136Z     ) -> None:
2025-05-07T20:32:02.0989364Z         torch.manual_seed(2025)
2025-05-07T20:32:02.0989614Z     
2025-05-07T20:32:02.0989901Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:02.0990246Z     
2025-05-07T20:32:02.0990453Z         x_sign = torch.sign(x)
2025-05-07T20:32:02.0990755Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:02.0991077Z         x = x_sign * x_clamp
2025-05-07T20:32:02.0991361Z         x0 = x[:, :D]
2025-05-07T20:32:02.0991581Z         x1 = x[:, D:]
2025-05-07T20:32:02.0991788Z     
2025-05-07T20:32:02.0991981Z         if contiguous:
2025-05-07T20:32:02.0992218Z             x0 = x0.contiguous()
2025-05-07T20:32:02.0992473Z             x1 = x1.contiguous()
2025-05-07T20:32:02.0992720Z     
2025-05-07T20:32:02.0992921Z         if scale_ub is not None:
2025-05-07T20:32:02.0993190Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:02.0993536Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:02.0993850Z             )
2025-05-07T20:32:02.0994055Z         else:
2025-05-07T20:32:02.0994266Z             scale_ub_tensor = None
2025-05-07T20:32:02.0994524Z     
2025-05-07T20:32:02.0994761Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:02.0995072Z             op = silu_mul_quant
2025-05-07T20:32:02.0995327Z             if compiled:
2025-05-07T20:32:02.0995577Z                 op = torch.compile(op)
2025-05-07T20:32:02.0995868Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:02.0996153Z     
2025-05-07T20:32:02.0996350Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:02.0996517Z 
2025-05-07T20:32:02.0996619Z moe/activation_test.py:117: 
2025-05-07T20:32:02.0996917Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:02.0997249Z moe/activation_test.py:115: in fn
2025-05-07T20:32:02.0997532Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:02.0998084Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/torch/_dynamo/eval_frame.py:678: in _fn
2025-05-07T20:32:02.0998650Z     return fn(*args, **kwargs)
2025-05-07T20:32:02.0999311Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:02.1000120Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:02.1000659Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:02.1001338Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:02.1002043Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:02.1002571Z     kernel = self.compile(
2025-05-07T20:32:02.1003121Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:02.1003781Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:02.1004262Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:02.1004498Z 
2025-05-07T20:32:02.1004706Z self = <triton.compiler.compiler.ASTSource object at 0x7f48989e5510>
2025-05-07T20:32:02.1005791Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:02.1007160Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f49ce13c400>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f4898fbdd00>}
2025-05-07T20:32:02.1008495Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:02.1009507Z context = <triton._C.libtriton.ir.context object at 0x7f48989a4d30>
2025-05-07T20:32:02.1009803Z 
2025-05-07T20:32:02.1009971Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:02.1010492Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:02.1010963Z                            module_map=module_map)
2025-05-07T20:32:02.1011332Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:02.1011689Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:02.1011955Z E       ^
2025-05-07T20:32:02.1012413Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:02.1012868Z 
2025-05-07T20:32:02.1013283Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:02.1013798Z 
2025-05-07T20:32:02.1013904Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:02.1014320Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:02.1014719Z     T=2048,
2025-05-07T20:32:02.1014914Z     D=7168,
2025-05-07T20:32:02.1015117Z     scale_ub=1200.0,
2025-05-07T20:32:02.1015342Z     contiguous=False,
2025-05-07T20:32:02.1015592Z     compiled=True,
2025-05-07T20:32:02.1015805Z )
2025-05-07T20:32:02.1016127Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:02.1016617Z T = 2048, D = 7168, scale_ub = 1200.0, contiguous = False, compiled = True
2025-05-07T20:32:02.1016891Z 
2025-05-07T20:32:02.1016971Z     @given(
2025-05-07T20:32:02.1017205Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:02.1017518Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:02.1017819Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:02.1018148Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:02.1018473Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:02.1018756Z     )
2025-05-07T20:32:02.1019109Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:02.1019547Z     def test_silu_mul_quant(
2025-05-07T20:32:02.1019784Z         self,
2025-05-07T20:32:02.1019986Z         T: int,
2025-05-07T20:32:02.1020275Z         D: int,
2025-05-07T20:32:02.1020498Z         scale_ub: Optional[float],
2025-05-07T20:32:02.1020769Z         contiguous: bool,
2025-05-07T20:32:02.1021012Z         compiled: bool,
2025-05-07T20:32:02.1021244Z     ) -> None:
2025-05-07T20:32:02.1021461Z         torch.manual_seed(2025)
2025-05-07T20:32:02.1021704Z     
2025-05-07T20:32:02.1021980Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:02.1022318Z     
2025-05-07T20:32:02.1022517Z         x_sign = torch.sign(x)
2025-05-07T20:32:02.1022810Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:02.1023116Z         x = x_sign * x_clamp
2025-05-07T20:32:02.1023401Z         x0 = x[:, :D]
2025-05-07T20:32:02.1023688Z         x1 = x[:, D:]
2025-05-07T20:32:02.1023890Z     
2025-05-07T20:32:02.1024080Z         if contiguous:
2025-05-07T20:32:02.1024314Z             x0 = x0.contiguous()
2025-05-07T20:32:02.1024569Z             x1 = x1.contiguous()
2025-05-07T20:32:02.1024810Z     
2025-05-07T20:32:02.1025014Z         if scale_ub is not None:
2025-05-07T20:32:02.1025282Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:02.1025621Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:02.1025927Z             )
2025-05-07T20:32:02.1026122Z         else:
2025-05-07T20:32:02.1026339Z             scale_ub_tensor = None
2025-05-07T20:32:02.1026590Z     
2025-05-07T20:32:02.1026825Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:02.1027136Z             op = silu_mul_quant
2025-05-07T20:32:02.1027390Z             if compiled:
2025-05-07T20:32:02.1027639Z                 op = torch.compile(op)
2025-05-07T20:32:02.1027934Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:02.1028445Z     
2025-05-07T20:32:02.1028669Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:02.1028835Z 
2025-05-07T20:32:02.1028934Z moe/activation_test.py:117: 
2025-05-07T20:32:02.1029272Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:02.1029602Z moe/activation_test.py:115: in fn
2025-05-07T20:32:02.1029879Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:02.1030435Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/torch/_dynamo/eval_frame.py:678: in _fn
2025-05-07T20:32:02.1030989Z     return fn(*args, **kwargs)
2025-05-07T20:32:02.1031640Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:02.1032320Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:02.1032851Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:02.1033531Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:02.1034186Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:02.1034709Z     kernel = self.compile(
2025-05-07T20:32:02.1035249Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:02.1035900Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:02.1036287Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:02.1036519Z 
2025-05-07T20:32:02.1036724Z self = <triton.compiler.compiler.ASTSource object at 0x7f48989c0150>
2025-05-07T20:32:02.1037797Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:02.1039155Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f49ce13c400>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f4898fbe840>}
2025-05-07T20:32:02.1040622Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:02.1041645Z context = <triton._C.libtriton.ir.context object at 0x7f4898936e30>
2025-05-07T20:32:02.1041929Z 
2025-05-07T20:32:02.1042093Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:02.1042609Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:02.1043077Z                            module_map=module_map)
2025-05-07T20:32:02.1043444Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:02.1043851Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:02.1044173Z E       ^
2025-05-07T20:32:02.1044635Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:02.1045080Z 
2025-05-07T20:32:02.1045498Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:02.1046012Z 
2025-05-07T20:32:02.1939765Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:02.1940246Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:02.1940787Z     T=1,
2025-05-07T20:32:02.1940972Z     D=5120,
2025-05-07T20:32:02.1941234Z     scale_ub=None,
2025-05-07T20:32:02.1941539Z     contiguous=False,
2025-05-07T20:32:02.1941846Z     compiled=False,
2025-05-07T20:32:02.1942125Z )
2025-05-07T20:32:02.1942518Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:02.1943016Z T = 1, D = 5120, scale_ub = None, contiguous = False, compiled = False
2025-05-07T20:32:02.1943278Z 
2025-05-07T20:32:02.1943357Z     @given(
2025-05-07T20:32:02.1943591Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:02.1943905Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:02.1944211Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:02.1944540Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:02.1944873Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:02.1945152Z     )
2025-05-07T20:32:02.1945499Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:02.1945935Z     def test_silu_mul_quant(
2025-05-07T20:32:02.1946179Z         self,
2025-05-07T20:32:02.1946370Z         T: int,
2025-05-07T20:32:02.1946574Z         D: int,
2025-05-07T20:32:02.1946792Z         scale_ub: Optional[float],
2025-05-07T20:32:02.1947055Z         contiguous: bool,
2025-05-07T20:32:02.1947294Z         compiled: bool,
2025-05-07T20:32:02.1947521Z     ) -> None:
2025-05-07T20:32:02.1947731Z         torch.manual_seed(2025)
2025-05-07T20:32:02.1947970Z     
2025-05-07T20:32:02.1948243Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:02.1948574Z     
2025-05-07T20:32:02.1948781Z         x_sign = torch.sign(x)
2025-05-07T20:32:02.1949135Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:02.1949440Z         x = x_sign * x_clamp
2025-05-07T20:32:02.1949675Z         x0 = x[:, :D]
2025-05-07T20:32:02.1949898Z         x1 = x[:, D:]
2025-05-07T20:32:02.1950103Z     
2025-05-07T20:32:02.1950289Z         if contiguous:
2025-05-07T20:32:02.1950521Z             x0 = x0.contiguous()
2025-05-07T20:32:02.1950778Z             x1 = x1.contiguous()
2025-05-07T20:32:02.1951009Z     
2025-05-07T20:32:02.1951202Z         if scale_ub is not None:
2025-05-07T20:32:02.1951474Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:02.1951803Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:02.1952117Z             )
2025-05-07T20:32:02.1952308Z         else:
2025-05-07T20:32:02.1952513Z             scale_ub_tensor = None
2025-05-07T20:32:02.1952764Z     
2025-05-07T20:32:02.1952993Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:02.1953501Z             op = silu_mul_quant
2025-05-07T20:32:02.1953762Z             if compiled:
2025-05-07T20:32:02.1954008Z                 op = torch.compile(op)
2025-05-07T20:32:02.1954307Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:02.1954577Z     
2025-05-07T20:32:02.1954766Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:02.1954932Z 
2025-05-07T20:32:02.1955030Z moe/activation_test.py:117: 
2025-05-07T20:32:02.1955325Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:02.1955658Z moe/activation_test.py:115: in fn
2025-05-07T20:32:02.1955933Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:02.1956676Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:02.1957421Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:02.1957958Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:02.1958630Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:02.1959284Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:02.1959807Z     kernel = self.compile(
2025-05-07T20:32:02.1960350Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:02.1960999Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:02.1961395Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:02.1961622Z 
2025-05-07T20:32:02.1961833Z self = <triton.compiler.compiler.ASTSource object at 0x7f489913c390>
2025-05-07T20:32:02.1962913Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:02.1964262Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f49ce13c400>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f48991640e0>}
2025-05-07T20:32:02.1965591Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:02.1966611Z context = <triton._C.libtriton.ir.context object at 0x7f4899124970>
2025-05-07T20:32:02.1966894Z 
2025-05-07T20:32:02.1967058Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:02.1967581Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:02.1968044Z                            module_map=module_map)
2025-05-07T20:32:02.1968404Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:02.1968751Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:02.1969006Z E       ^
2025-05-07T20:32:02.1969464Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:02.1969912Z 
2025-05-07T20:32:02.1970324Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:02.1970836Z 
2025-05-07T20:32:02.1970939Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:02.1971369Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:02.1971800Z     T=4096,
2025-05-07T20:32:02.1971987Z     D=7168,
2025-05-07T20:32:02.1972184Z     scale_ub=1200.0,
2025-05-07T20:32:02.1972405Z     contiguous=False,
2025-05-07T20:32:02.1972627Z     compiled=False,
2025-05-07T20:32:02.1972832Z )
2025-05-07T20:32:02.1973151Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:02.1973770Z T = 4096, D = 7168, scale_ub = 1200.0, contiguous = False, compiled = False
2025-05-07T20:32:02.1974051Z 
2025-05-07T20:32:02.1974131Z     @given(
2025-05-07T20:32:02.1974365Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:02.1974670Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:02.1974972Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:02.1975299Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:02.1975621Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:02.1975899Z     )
2025-05-07T20:32:02.1976243Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:02.1976761Z     def test_silu_mul_quant(
2025-05-07T20:32:02.1976998Z         self,
2025-05-07T20:32:02.1977196Z         T: int,
2025-05-07T20:32:02.1977390Z         D: int,
2025-05-07T20:32:02.1977601Z         scale_ub: Optional[float],
2025-05-07T20:32:02.1977871Z         contiguous: bool,
2025-05-07T20:32:02.1978116Z         compiled: bool,
2025-05-07T20:32:02.1978332Z     ) -> None:
2025-05-07T20:32:02.1978548Z         torch.manual_seed(2025)
2025-05-07T20:32:02.1978791Z     
2025-05-07T20:32:02.1979054Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:02.1979392Z     
2025-05-07T20:32:02.1979585Z         x_sign = torch.sign(x)
2025-05-07T20:32:02.1979873Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:02.1980181Z         x = x_sign * x_clamp
2025-05-07T20:32:02.1980420Z         x0 = x[:, :D]
2025-05-07T20:32:02.1980633Z         x1 = x[:, D:]
2025-05-07T20:32:02.1980833Z     
2025-05-07T20:32:02.1981021Z         if contiguous:
2025-05-07T20:32:02.1981252Z             x0 = x0.contiguous()
2025-05-07T20:32:02.1981504Z             x1 = x1.contiguous()
2025-05-07T20:32:02.1981738Z     
2025-05-07T20:32:02.1981927Z         if scale_ub is not None:
2025-05-07T20:32:02.1982192Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:02.1982526Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:02.1982832Z             )
2025-05-07T20:32:02.1983021Z         else:
2025-05-07T20:32:02.1983231Z             scale_ub_tensor = None
2025-05-07T20:32:02.1983484Z     
2025-05-07T20:32:02.1983708Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:02.1984026Z             op = silu_mul_quant
2025-05-07T20:32:02.1984278Z             if compiled:
2025-05-07T20:32:02.1984520Z                 op = torch.compile(op)
2025-05-07T20:32:02.1984809Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:02.1985080Z     
2025-05-07T20:32:02.1985265Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:02.1985435Z 
2025-05-07T20:32:02.1985536Z moe/activation_test.py:117: 
2025-05-07T20:32:02.1985826Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:02.1986154Z moe/activation_test.py:115: in fn
2025-05-07T20:32:02.1986429Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:02.1987123Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:02.1987805Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:02.1988326Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:02.1989005Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:02.1989731Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:02.1990259Z     kernel = self.compile(
2025-05-07T20:32:02.1990793Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:02.1991446Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:02.1991926Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:02.1992153Z 
2025-05-07T20:32:02.1992359Z self = <triton.compiler.compiler.ASTSource object at 0x7f4899102c90>
2025-05-07T20:32:02.1993421Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:02.1994779Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f49ce13c400>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f4899165300>}
2025-05-07T20:32:02.1996102Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:02.1997231Z context = <triton._C.libtriton.ir.context object at 0x7f4899117930>
2025-05-07T20:32:02.1997513Z 
2025-05-07T20:32:02.1997686Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:02.1998198Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:02.1998666Z                            module_map=module_map)
2025-05-07T20:32:02.1999035Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:02.1999377Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:02.1999642Z E       ^
2025-05-07T20:32:02.2000104Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:02.2000547Z 
2025-05-07T20:32:02.2000968Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:02.2001528Z 
2025-05-07T20:32:02.2001631Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:02.2002034Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:02.2002445Z     T=16384,
2025-05-07T20:32:02.2002638Z     D=7168,
2025-05-07T20:32:02.2002830Z     scale_ub=None,
2025-05-07T20:32:02.2003043Z     contiguous=True,
2025-05-07T20:32:02.2003257Z     compiled=True,
2025-05-07T20:32:02.2003457Z )
2025-05-07T20:32:02.5076354Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:02.5077126Z T = 16384, D = 7168, scale_ub = None, contiguous = True, compiled = True
2025-05-07T20:32:02.5077414Z 
2025-05-07T20:32:02.5077495Z     @given(
2025-05-07T20:32:02.5077736Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:02.5078054Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:02.5078393Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:02.5078737Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:02.5079063Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:02.5079416Z     )
2025-05-07T20:32:02.5079988Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:02.5080515Z     def test_silu_mul_quant(
2025-05-07T20:32:02.5089329Z         self,
2025-05-07T20:32:02.5089567Z         T: int,
2025-05-07T20:32:02.5089767Z         D: int,
2025-05-07T20:32:02.5089978Z         scale_ub: Optional[float],
2025-05-07T20:32:02.5090248Z         contiguous: bool,
2025-05-07T20:32:02.5090487Z         compiled: bool,
2025-05-07T20:32:02.5090715Z     ) -> None:
2025-05-07T20:32:02.5090931Z         torch.manual_seed(2025)
2025-05-07T20:32:02.5091171Z     
2025-05-07T20:32:02.5091472Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:02.5091827Z     
2025-05-07T20:32:02.5092025Z         x_sign = torch.sign(x)
2025-05-07T20:32:02.5092313Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:02.5092631Z         x = x_sign * x_clamp
2025-05-07T20:32:02.5092878Z         x0 = x[:, :D]
2025-05-07T20:32:02.5093088Z         x1 = x[:, D:]
2025-05-07T20:32:02.5093294Z     
2025-05-07T20:32:02.5093752Z         if contiguous:
2025-05-07T20:32:02.5093992Z             x0 = x0.contiguous()
2025-05-07T20:32:02.5094261Z             x1 = x1.contiguous()
2025-05-07T20:32:02.5094508Z     
2025-05-07T20:32:02.5094709Z         if scale_ub is not None:
2025-05-07T20:32:02.5094979Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:02.5095321Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:02.5095632Z             )
2025-05-07T20:32:02.5095827Z         else:
2025-05-07T20:32:02.5096046Z             scale_ub_tensor = None
2025-05-07T20:32:02.5096305Z     
2025-05-07T20:32:02.5096539Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:02.5096943Z             op = silu_mul_quant
2025-05-07T20:32:02.5097279Z             if compiled:
2025-05-07T20:32:02.5097530Z                 op = torch.compile(op)
2025-05-07T20:32:02.5097833Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:02.5098113Z     
2025-05-07T20:32:02.5098320Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:02.5098495Z 
2025-05-07T20:32:02.5098598Z moe/activation_test.py:117: 
2025-05-07T20:32:02.5098905Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:02.5099245Z moe/activation_test.py:115: in fn
2025-05-07T20:32:02.5099526Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:02.5100090Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/torch/_dynamo/eval_frame.py:678: in _fn
2025-05-07T20:32:02.5100657Z     return fn(*args, **kwargs)
2025-05-07T20:32:02.5101316Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:02.5102065Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:02.5102598Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:02.5103278Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:02.5103931Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:02.5104464Z     kernel = self.compile(
2025-05-07T20:32:02.5105010Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:02.5105663Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:02.5106069Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:02.5106308Z 
2025-05-07T20:32:02.5106520Z self = <triton.compiler.compiler.ASTSource object at 0x7f489906bc10>
2025-05-07T20:32:02.5107615Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:02.5109021Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f49ce13c400>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f48991663e0>}
2025-05-07T20:32:02.5110473Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:02.5111533Z context = <triton._C.libtriton.ir.context object at 0x7f48990fe4b0>
2025-05-07T20:32:02.5111858Z 
2025-05-07T20:32:02.5112028Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:02.5112553Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:02.5113025Z                            module_map=module_map)
2025-05-07T20:32:02.5113397Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:02.5113759Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:02.5114018Z E       ^
2025-05-07T20:32:02.5114580Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:02.5115039Z 
2025-05-07T20:32:02.5115459Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:02.5115970Z 
2025-05-07T20:32:02.5116083Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:02.5116494Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:02.5116903Z     T=4096,
2025-05-07T20:32:02.5117103Z     D=5120,
2025-05-07T20:32:02.5117300Z     scale_ub=None,
2025-05-07T20:32:02.5117529Z     contiguous=False,
2025-05-07T20:32:02.5117806Z     compiled=True,
2025-05-07T20:32:02.5118053Z )
2025-05-07T20:32:02.5118377Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:02.5118876Z T = 4096, D = 5120, scale_ub = None, contiguous = False, compiled = True
2025-05-07T20:32:02.5119146Z 
2025-05-07T20:32:02.5119238Z     @given(
2025-05-07T20:32:02.5119470Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:02.5119790Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:02.5120104Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:02.5120429Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:02.5120763Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:02.5121054Z     )
2025-05-07T20:32:02.5121429Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:02.5121903Z     def test_silu_mul_quant(
2025-05-07T20:32:02.5122156Z         self,
2025-05-07T20:32:02.5122356Z         T: int,
2025-05-07T20:32:02.5122563Z         D: int,
2025-05-07T20:32:02.5122789Z         scale_ub: Optional[float],
2025-05-07T20:32:02.5123070Z         contiguous: bool,
2025-05-07T20:32:02.5123307Z         compiled: bool,
2025-05-07T20:32:02.5123534Z     ) -> None:
2025-05-07T20:32:02.5123762Z         torch.manual_seed(2025)
2025-05-07T20:32:02.5124002Z     
2025-05-07T20:32:02.5124279Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:02.5124626Z     
2025-05-07T20:32:02.5124822Z         x_sign = torch.sign(x)
2025-05-07T20:32:02.5125116Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:02.5125430Z         x = x_sign * x_clamp
2025-05-07T20:32:02.5125667Z         x0 = x[:, :D]
2025-05-07T20:32:02.5125896Z         x1 = x[:, D:]
2025-05-07T20:32:02.5126110Z     
2025-05-07T20:32:02.5126304Z         if contiguous:
2025-05-07T20:32:02.5126540Z             x0 = x0.contiguous()
2025-05-07T20:32:02.5126808Z             x1 = x1.contiguous()
2025-05-07T20:32:02.5127048Z     
2025-05-07T20:32:02.5127254Z         if scale_ub is not None:
2025-05-07T20:32:02.5127535Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:02.5127876Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:02.5129059Z             )
2025-05-07T20:32:02.5129289Z         else:
2025-05-07T20:32:02.5129510Z             scale_ub_tensor = None
2025-05-07T20:32:02.5129760Z     
2025-05-07T20:32:02.5130000Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:02.5130327Z             op = silu_mul_quant
2025-05-07T20:32:02.5130580Z             if compiled:
2025-05-07T20:32:02.5130834Z                 op = torch.compile(op)
2025-05-07T20:32:02.5131136Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:02.5131411Z     
2025-05-07T20:32:02.5131613Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:02.5131777Z 
2025-05-07T20:32:02.5131884Z moe/activation_test.py:117: 
2025-05-07T20:32:02.5132180Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:02.5132523Z moe/activation_test.py:115: in fn
2025-05-07T20:32:02.5132807Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:02.5133523Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/torch/_dynamo/eval_frame.py:678: in _fn
2025-05-07T20:32:02.5134092Z     return fn(*args, **kwargs)
2025-05-07T20:32:02.5134743Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:02.5135429Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:02.5135966Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:02.5136642Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:02.5137306Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:02.5137907Z     kernel = self.compile(
2025-05-07T20:32:02.5138506Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:02.5139157Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:02.5139568Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:02.5139797Z 
2025-05-07T20:32:02.5140010Z self = <triton.compiler.compiler.ASTSource object at 0x7f4899047390>
2025-05-07T20:32:02.5141090Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:02.5142464Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f49ce13c400>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f4899166a20>}
2025-05-07T20:32:02.5143813Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:02.5144852Z context = <triton._C.libtriton.ir.context object at 0x7f48990739b0>
2025-05-07T20:32:02.5145139Z 
2025-05-07T20:32:02.5145313Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:02.5145824Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:02.5146291Z                            module_map=module_map)
2025-05-07T20:32:02.5146655Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:02.5147008Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:02.5147263Z E       ^
2025-05-07T20:32:02.5147731Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:02.5148178Z 
2025-05-07T20:32:02.5148607Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:02.5149200Z 
2025-05-07T20:32:02.6289124Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:02.6289852Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:02.6290414Z     T=4096,
2025-05-07T20:32:02.6290678Z     D=5120,
2025-05-07T20:32:02.6290918Z     scale_ub=1200.0,
2025-05-07T20:32:02.6291144Z     contiguous=False,
2025-05-07T20:32:02.6291377Z     compiled=False,
2025-05-07T20:32:02.6291595Z )
2025-05-07T20:32:02.6291961Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:02.6292460Z T = 4096, D = 5120, scale_ub = 1200.0, contiguous = False, compiled = False
2025-05-07T20:32:02.6292740Z 
2025-05-07T20:32:02.6292836Z     @given(
2025-05-07T20:32:02.6293070Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:02.6293393Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:02.6293716Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:02.6294049Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:02.6294376Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:02.6294663Z     )
2025-05-07T20:32:02.6295258Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:02.6295709Z     def test_silu_mul_quant(
2025-05-07T20:32:02.6295959Z         self,
2025-05-07T20:32:02.6296165Z         T: int,
2025-05-07T20:32:02.6296370Z         D: int,
2025-05-07T20:32:02.6296609Z         scale_ub: Optional[float],
2025-05-07T20:32:02.6296889Z         contiguous: bool,
2025-05-07T20:32:02.6297129Z         compiled: bool,
2025-05-07T20:32:02.6297361Z     ) -> None:
2025-05-07T20:32:02.6297587Z         torch.manual_seed(2025)
2025-05-07T20:32:02.6297829Z     
2025-05-07T20:32:02.6298113Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:02.6298601Z     
2025-05-07T20:32:02.6298796Z         x_sign = torch.sign(x)
2025-05-07T20:32:02.6299096Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:02.6299418Z         x = x_sign * x_clamp
2025-05-07T20:32:02.6299668Z         x0 = x[:, :D]
2025-05-07T20:32:02.6299890Z         x1 = x[:, D:]
2025-05-07T20:32:02.6300103Z     
2025-05-07T20:32:02.6300296Z         if contiguous:
2025-05-07T20:32:02.6300525Z             x0 = x0.contiguous()
2025-05-07T20:32:02.6300790Z             x1 = x1.contiguous()
2025-05-07T20:32:02.6301030Z     
2025-05-07T20:32:02.6301218Z         if scale_ub is not None:
2025-05-07T20:32:02.6301502Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:02.6301857Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:02.6302206Z             )
2025-05-07T20:32:02.6302413Z         else:
2025-05-07T20:32:02.6302634Z             scale_ub_tensor = None
2025-05-07T20:32:02.6302888Z     
2025-05-07T20:32:02.6303121Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:02.6303448Z             op = silu_mul_quant
2025-05-07T20:32:02.6303710Z             if compiled:
2025-05-07T20:32:02.6303959Z                 op = torch.compile(op)
2025-05-07T20:32:02.6304267Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:02.6304555Z     
2025-05-07T20:32:02.6304746Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:02.6304923Z 
2025-05-07T20:32:02.6305029Z moe/activation_test.py:117: 
2025-05-07T20:32:02.6305335Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:02.6305671Z moe/activation_test.py:115: in fn
2025-05-07T20:32:02.6305954Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:02.6306659Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:02.6307358Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:02.6307893Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:02.6308586Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:02.6309335Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:02.6309873Z     kernel = self.compile(
2025-05-07T20:32:02.6310415Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:02.6311078Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:02.6311489Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:02.6311715Z 
2025-05-07T20:32:02.6311928Z self = <triton.compiler.compiler.ASTSource object at 0x7f48990e0690>
2025-05-07T20:32:02.6313007Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:02.6314501Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f49ce13c400>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f48990142c0>}
2025-05-07T20:32:02.6315860Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:02.6316907Z context = <triton._C.libtriton.ir.context object at 0x7f4899080cf0>
2025-05-07T20:32:02.6317195Z 
2025-05-07T20:32:02.6317370Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:02.6317886Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:02.6318355Z                            module_map=module_map)
2025-05-07T20:32:02.6318761Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:02.6319154Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:02.6319412Z E       ^
2025-05-07T20:32:02.6319884Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:02.6320333Z 
2025-05-07T20:32:02.6320758Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:02.6321280Z 
2025-05-07T20:32:02.6321411Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:02.6321846Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:02.6322256Z     T=4096,
2025-05-07T20:32:02.6322450Z     D=5120,
2025-05-07T20:32:02.6322643Z     scale_ub=1200.0,
2025-05-07T20:32:02.6322876Z     contiguous=False,
2025-05-07T20:32:02.6323106Z     compiled=True,
2025-05-07T20:32:02.6323311Z )
2025-05-07T20:32:02.6323633Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:02.6324135Z T = 4096, D = 5120, scale_ub = 1200.0, contiguous = False, compiled = True
2025-05-07T20:32:02.6324407Z 
2025-05-07T20:32:02.6324491Z     @given(
2025-05-07T20:32:02.6324739Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:02.6325070Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:02.6325386Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:02.6325720Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:02.6326060Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:02.6326347Z     )
2025-05-07T20:32:02.6326696Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:02.6327144Z     def test_silu_mul_quant(
2025-05-07T20:32:02.6327399Z         self,
2025-05-07T20:32:02.6327602Z         T: int,
2025-05-07T20:32:02.6327808Z         D: int,
2025-05-07T20:32:02.6328039Z         scale_ub: Optional[float],
2025-05-07T20:32:02.6328657Z         contiguous: bool,
2025-05-07T20:32:02.6328917Z         compiled: bool,
2025-05-07T20:32:02.6329152Z     ) -> None:
2025-05-07T20:32:02.6329373Z         torch.manual_seed(2025)
2025-05-07T20:32:02.6329630Z     
2025-05-07T20:32:02.6329917Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:02.6330267Z     
2025-05-07T20:32:02.6330461Z         x_sign = torch.sign(x)
2025-05-07T20:32:02.6330755Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:02.6331064Z         x = x_sign * x_clamp
2025-05-07T20:32:02.6331299Z         x0 = x[:, :D]
2025-05-07T20:32:02.6331523Z         x1 = x[:, D:]
2025-05-07T20:32:02.6331763Z     
2025-05-07T20:32:02.6331963Z         if contiguous:
2025-05-07T20:32:02.6332199Z             x0 = x0.contiguous()
2025-05-07T20:32:02.6332462Z             x1 = x1.contiguous()
2025-05-07T20:32:02.6332698Z     
2025-05-07T20:32:02.6332896Z         if scale_ub is not None:
2025-05-07T20:32:02.6333174Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:02.6333508Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:02.6333824Z             )
2025-05-07T20:32:02.6334024Z         else:
2025-05-07T20:32:02.6334236Z             scale_ub_tensor = None
2025-05-07T20:32:02.6334487Z     
2025-05-07T20:32:02.6334858Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:02.6335183Z             op = silu_mul_quant
2025-05-07T20:32:02.6335431Z             if compiled:
2025-05-07T20:32:02.6335679Z                 op = torch.compile(op)
2025-05-07T20:32:02.6335982Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:02.6336254Z     
2025-05-07T20:32:02.6336457Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:02.6336621Z 
2025-05-07T20:32:02.6336729Z moe/activation_test.py:117: 
2025-05-07T20:32:02.6337019Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:02.6337355Z moe/activation_test.py:115: in fn
2025-05-07T20:32:02.6337705Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:02.6338321Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/torch/_dynamo/eval_frame.py:678: in _fn
2025-05-07T20:32:02.6338886Z     return fn(*args, **kwargs)
2025-05-07T20:32:02.6339550Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:02.6340251Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:02.6340782Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:02.6341467Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:02.6342131Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:02.6342667Z     kernel = self.compile(
2025-05-07T20:32:02.6343207Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:02.6343869Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:02.6344271Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:02.6344499Z 
2025-05-07T20:32:02.6344713Z self = <triton.compiler.compiler.ASTSource object at 0x7f4898a27010>
2025-05-07T20:32:02.6345800Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:02.6347175Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f49ce13c400>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f48990154e0>}
2025-05-07T20:32:02.6348528Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:02.6349631Z context = <triton._C.libtriton.ir.context object at 0x7f4898a57670>
2025-05-07T20:32:02.6349920Z 
2025-05-07T20:32:02.6350092Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:02.6350612Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:02.6351079Z                            module_map=module_map)
2025-05-07T20:32:02.6351463Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:02.6351849Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:02.6352111Z E       ^
2025-05-07T20:32:02.6352579Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:02.6353024Z 
2025-05-07T20:32:02.6353446Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:02.6353964Z 
2025-05-07T20:32:02.7234470Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:02.7235051Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:02.7235675Z     T=2048,
2025-05-07T20:32:02.7235958Z     D=7168,
2025-05-07T20:32:02.7236427Z     scale_ub=1200.0,
2025-05-07T20:32:02.7236747Z     contiguous=False,
2025-05-07T20:32:02.7237061Z     compiled=False,
2025-05-07T20:32:02.7237295Z )
2025-05-07T20:32:02.7237619Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:02.7238123Z T = 2048, D = 7168, scale_ub = 1200.0, contiguous = False, compiled = False
2025-05-07T20:32:02.7238397Z 
2025-05-07T20:32:02.7238487Z     @given(
2025-05-07T20:32:02.7238719Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:02.7239033Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:02.7239347Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:02.7239778Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:02.7240171Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:02.7240460Z     )
2025-05-07T20:32:02.7240812Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:02.7241262Z     def test_silu_mul_quant(
2025-05-07T20:32:02.7241512Z         self,
2025-05-07T20:32:02.7241714Z         T: int,
2025-05-07T20:32:02.7241920Z         D: int,
2025-05-07T20:32:02.7242150Z         scale_ub: Optional[float],
2025-05-07T20:32:02.7242425Z         contiguous: bool,
2025-05-07T20:32:02.7242666Z         compiled: bool,
2025-05-07T20:32:02.7242897Z     ) -> None:
2025-05-07T20:32:02.7243123Z         torch.manual_seed(2025)
2025-05-07T20:32:02.7243367Z     
2025-05-07T20:32:02.7243649Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:02.7243997Z     
2025-05-07T20:32:02.7244200Z         x_sign = torch.sign(x)
2025-05-07T20:32:02.7244503Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:02.7244820Z         x = x_sign * x_clamp
2025-05-07T20:32:02.7245063Z         x0 = x[:, :D]
2025-05-07T20:32:02.7245289Z         x1 = x[:, D:]
2025-05-07T20:32:02.7245504Z     
2025-05-07T20:32:02.7245690Z         if contiguous:
2025-05-07T20:32:02.7245936Z             x0 = x0.contiguous()
2025-05-07T20:32:02.7246199Z             x1 = x1.contiguous()
2025-05-07T20:32:02.7246434Z     
2025-05-07T20:32:02.7246629Z         if scale_ub is not None:
2025-05-07T20:32:02.7246908Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:02.7247240Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:02.7247552Z             )
2025-05-07T20:32:02.7247751Z         else:
2025-05-07T20:32:02.7247970Z             scale_ub_tensor = None
2025-05-07T20:32:02.7248216Z     
2025-05-07T20:32:02.7248455Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:02.7248778Z             op = silu_mul_quant
2025-05-07T20:32:02.7249034Z             if compiled:
2025-05-07T20:32:02.7249288Z                 op = torch.compile(op)
2025-05-07T20:32:02.7249586Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:02.7249856Z     
2025-05-07T20:32:02.7250055Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:02.7250218Z 
2025-05-07T20:32:02.7250330Z moe/activation_test.py:117: 
2025-05-07T20:32:02.7250620Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:02.7250954Z moe/activation_test.py:115: in fn
2025-05-07T20:32:02.7251238Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:02.7251978Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:02.7252664Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:02.7253202Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:02.7253894Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:02.7254552Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:02.7255086Z     kernel = self.compile(
2025-05-07T20:32:02.7255719Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:02.7256378Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:02.7256775Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:02.7257012Z 
2025-05-07T20:32:02.7264305Z self = <triton.compiler.compiler.ASTSource object at 0x7f4898a605d0>
2025-05-07T20:32:02.7265419Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:02.7266917Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f49ce13c400>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f4899016340>}
2025-05-07T20:32:02.7268715Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:02.7269819Z context = <triton._C.libtriton.ir.context object at 0x7f4898ab8c30>
2025-05-07T20:32:02.7270110Z 
2025-05-07T20:32:02.7270278Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:02.7270800Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:02.7271278Z                            module_map=module_map)
2025-05-07T20:32:02.7271649Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:02.7272005Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:02.7272280Z E       ^
2025-05-07T20:32:02.7272757Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:02.7273205Z 
2025-05-07T20:32:02.7273642Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:02.7274155Z 
2025-05-07T20:32:02.7274263Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:02.7274690Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:02.7275101Z     T=1,
2025-05-07T20:32:02.7275288Z     D=7168,
2025-05-07T20:32:02.7275495Z     scale_ub=None,
2025-05-07T20:32:02.7275716Z     contiguous=True,
2025-05-07T20:32:02.7275938Z     compiled=False,
2025-05-07T20:32:02.7276162Z )
2025-05-07T20:32:02.7276498Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:02.7276976Z T = 1, D = 7168, scale_ub = None, contiguous = True, compiled = False
2025-05-07T20:32:02.7277253Z 
2025-05-07T20:32:02.7277334Z     @given(
2025-05-07T20:32:02.7277578Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:02.7277902Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:02.7278211Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:02.7278569Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:02.7278905Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:02.7279199Z     )
2025-05-07T20:32:02.7279546Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:02.7279992Z     def test_silu_mul_quant(
2025-05-07T20:32:02.7280241Z         self,
2025-05-07T20:32:02.7280437Z         T: int,
2025-05-07T20:32:02.7280639Z         D: int,
2025-05-07T20:32:02.7280869Z         scale_ub: Optional[float],
2025-05-07T20:32:02.7281137Z         contiguous: bool,
2025-05-07T20:32:02.7281403Z         compiled: bool,
2025-05-07T20:32:02.7281668Z     ) -> None:
2025-05-07T20:32:02.7281884Z         torch.manual_seed(2025)
2025-05-07T20:32:02.7282130Z     
2025-05-07T20:32:02.7282414Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:02.7282760Z     
2025-05-07T20:32:02.7282958Z         x_sign = torch.sign(x)
2025-05-07T20:32:02.7283358Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:02.7283676Z         x = x_sign * x_clamp
2025-05-07T20:32:02.7283922Z         x0 = x[:, :D]
2025-05-07T20:32:02.7284150Z         x1 = x[:, D:]
2025-05-07T20:32:02.7284369Z     
2025-05-07T20:32:02.7284559Z         if contiguous:
2025-05-07T20:32:02.7284798Z             x0 = x0.contiguous()
2025-05-07T20:32:02.7285065Z             x1 = x1.contiguous()
2025-05-07T20:32:02.7285304Z     
2025-05-07T20:32:02.7285504Z         if scale_ub is not None:
2025-05-07T20:32:02.7285784Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:02.7286124Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:02.7287866Z             )
2025-05-07T20:32:02.7288110Z         else:
2025-05-07T20:32:02.7288318Z             scale_ub_tensor = None
2025-05-07T20:32:02.7288571Z     
2025-05-07T20:32:02.7288813Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:02.7289138Z             op = silu_mul_quant
2025-05-07T20:32:02.7289393Z             if compiled:
2025-05-07T20:32:02.7289647Z                 op = torch.compile(op)
2025-05-07T20:32:02.7289949Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:02.7290227Z     
2025-05-07T20:32:02.7290436Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:02.7290603Z 
2025-05-07T20:32:02.7290712Z moe/activation_test.py:117: 
2025-05-07T20:32:02.7291008Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:02.7291344Z moe/activation_test.py:115: in fn
2025-05-07T20:32:02.7291641Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:02.7292361Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:02.7293065Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:02.7293611Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:02.7294306Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:02.7294977Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:02.7295518Z     kernel = self.compile(
2025-05-07T20:32:02.7296066Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:02.7296729Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:02.7297125Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:02.7297359Z 
2025-05-07T20:32:02.7297575Z self = <triton.compiler.compiler.ASTSource object at 0x7f4898af62d0>
2025-05-07T20:32:02.7298673Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:02.7300045Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f49ce13c400>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f4899015c60>}
2025-05-07T20:32:02.7301386Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:02.7302430Z context = <triton._C.libtriton.ir.context object at 0x7f4899468270>
2025-05-07T20:32:02.7302727Z 
2025-05-07T20:32:02.7302896Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:02.7303418Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:02.7303886Z                            module_map=module_map)
2025-05-07T20:32:02.7304258Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:02.7304708Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:02.7304970Z E       ^
2025-05-07T20:32:02.7305440Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:02.7305903Z 
2025-05-07T20:32:02.7306326Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:02.7306839Z 
2025-05-07T20:32:02.7306958Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:02.7307370Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:02.7307785Z     T=16384,
2025-05-07T20:32:02.7307992Z     D=7168,
2025-05-07T20:32:02.7308237Z     scale_ub=1200.0,
2025-05-07T20:32:02.7308508Z     contiguous=False,
2025-05-07T20:32:02.7308747Z     compiled=True,
2025-05-07T20:32:03.0933702Z )
2025-05-07T20:32:03.0934188Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:03.0934916Z T = 16384, D = 7168, scale_ub = 1200.0, contiguous = False, compiled = True
2025-05-07T20:32:03.0935324Z 
2025-05-07T20:32:03.0935436Z     @given(
2025-05-07T20:32:03.0935760Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:03.0936195Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:03.0936630Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:03.0936962Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:03.0937294Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:03.0937574Z     )
2025-05-07T20:32:03.0937925Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:03.0938373Z     def test_silu_mul_quant(
2025-05-07T20:32:03.0938620Z         self,
2025-05-07T20:32:03.0938815Z         T: int,
2025-05-07T20:32:03.0939018Z         D: int,
2025-05-07T20:32:03.0939242Z         scale_ub: Optional[float],
2025-05-07T20:32:03.0939505Z         contiguous: bool,
2025-05-07T20:32:03.0939751Z         compiled: bool,
2025-05-07T20:32:03.0939975Z     ) -> None:
2025-05-07T20:32:03.0940190Z         torch.manual_seed(2025)
2025-05-07T20:32:03.0940438Z     
2025-05-07T20:32:03.0940712Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:03.0941051Z     
2025-05-07T20:32:03.0941256Z         x_sign = torch.sign(x)
2025-05-07T20:32:03.0941552Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:03.0941865Z         x = x_sign * x_clamp
2025-05-07T20:32:03.0942149Z         x0 = x[:, :D]
2025-05-07T20:32:03.0942378Z         x1 = x[:, D:]
2025-05-07T20:32:03.0942576Z     
2025-05-07T20:32:03.0942764Z         if contiguous:
2025-05-07T20:32:03.0943008Z             x0 = x0.contiguous()
2025-05-07T20:32:03.0943269Z             x1 = x1.contiguous()
2025-05-07T20:32:03.0943500Z     
2025-05-07T20:32:03.0943691Z         if scale_ub is not None:
2025-05-07T20:32:03.0943965Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:03.0944296Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:03.0944609Z             )
2025-05-07T20:32:03.0944802Z         else:
2025-05-07T20:32:03.0945004Z             scale_ub_tensor = None
2025-05-07T20:32:03.0945251Z     
2025-05-07T20:32:03.0945479Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:03.0945784Z             op = silu_mul_quant
2025-05-07T20:32:03.0946032Z             if compiled:
2025-05-07T20:32:03.0946278Z                 op = torch.compile(op)
2025-05-07T20:32:03.0946566Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:03.0946837Z     
2025-05-07T20:32:03.0947031Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:03.0947192Z 
2025-05-07T20:32:03.0947294Z moe/activation_test.py:117: 
2025-05-07T20:32:03.0947585Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:03.0947915Z moe/activation_test.py:115: in fn
2025-05-07T20:32:03.0948192Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:03.0948954Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/torch/_dynamo/eval_frame.py:678: in _fn
2025-05-07T20:32:03.0949609Z     return fn(*args, **kwargs)
2025-05-07T20:32:03.0950263Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:03.0950937Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:03.0951468Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:03.0952190Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:03.0952846Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:03.0953491Z     kernel = self.compile(
2025-05-07T20:32:03.0954035Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:03.0954695Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:03.0955092Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:03.0955320Z 
2025-05-07T20:32:03.0955524Z self = <triton.compiler.compiler.ASTSource object at 0x7f48994de590>
2025-05-07T20:32:03.0956602Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:03.0957961Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f49ce13c400>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f4899408900>}
2025-05-07T20:32:03.0959301Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:03.0960313Z context = <triton._C.libtriton.ir.context object at 0x7f48994d66f0>
2025-05-07T20:32:03.0960602Z 
2025-05-07T20:32:03.0960765Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:03.0961278Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:03.0961768Z                            module_map=module_map)
2025-05-07T20:32:03.0962149Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:03.0962498Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:03.0962757Z E       ^
2025-05-07T20:32:03.0963211Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:03.0963668Z 
2025-05-07T20:32:03.0964081Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:03.0964589Z 
2025-05-07T20:32:03.0964698Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:03.0965116Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:03.0965647Z     T=1,
2025-05-07T20:32:03.0965830Z     D=7168,
2025-05-07T20:32:03.0966031Z     scale_ub=None,
2025-05-07T20:32:03.0966261Z     contiguous=False,
2025-05-07T20:32:03.0966484Z     compiled=False,
2025-05-07T20:32:03.0966695Z )
2025-05-07T20:32:03.0967018Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:03.0967499Z T = 1, D = 7168, scale_ub = None, contiguous = False, compiled = False
2025-05-07T20:32:03.0967764Z 
2025-05-07T20:32:03.0967842Z     @given(
2025-05-07T20:32:03.0968078Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:03.0968397Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:03.0968698Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:03.0969037Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:03.0969476Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:03.0969755Z     )
2025-05-07T20:32:03.0970102Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:03.0970545Z     def test_silu_mul_quant(
2025-05-07T20:32:03.0970782Z         self,
2025-05-07T20:32:03.0970979Z         T: int,
2025-05-07T20:32:03.0971175Z         D: int,
2025-05-07T20:32:03.0971389Z         scale_ub: Optional[float],
2025-05-07T20:32:03.0971658Z         contiguous: bool,
2025-05-07T20:32:03.0971922Z         compiled: bool,
2025-05-07T20:32:03.0972167Z     ) -> None:
2025-05-07T20:32:03.0972375Z         torch.manual_seed(2025)
2025-05-07T20:32:03.0972616Z     
2025-05-07T20:32:03.0972930Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:03.0973301Z     
2025-05-07T20:32:03.0973503Z         x_sign = torch.sign(x)
2025-05-07T20:32:03.0973792Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:03.0974091Z         x = x_sign * x_clamp
2025-05-07T20:32:03.0974334Z         x0 = x[:, :D]
2025-05-07T20:32:03.0974553Z         x1 = x[:, D:]
2025-05-07T20:32:03.0974753Z     
2025-05-07T20:32:03.0974939Z         if contiguous:
2025-05-07T20:32:03.0975170Z             x0 = x0.contiguous()
2025-05-07T20:32:03.0975421Z             x1 = x1.contiguous()
2025-05-07T20:32:03.0975657Z     
2025-05-07T20:32:03.0975847Z         if scale_ub is not None:
2025-05-07T20:32:03.0976194Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:03.0976621Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:03.0976931Z             )
2025-05-07T20:32:03.0977117Z         else:
2025-05-07T20:32:03.0977328Z             scale_ub_tensor = None
2025-05-07T20:32:03.0977582Z     
2025-05-07T20:32:03.0977816Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:03.0978128Z             op = silu_mul_quant
2025-05-07T20:32:03.0978377Z             if compiled:
2025-05-07T20:32:03.0978626Z                 op = torch.compile(op)
2025-05-07T20:32:03.0978923Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:03.0979200Z     
2025-05-07T20:32:03.0979399Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:03.0979562Z 
2025-05-07T20:32:03.0979658Z moe/activation_test.py:117: 
2025-05-07T20:32:03.0979958Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:03.0980290Z moe/activation_test.py:115: in fn
2025-05-07T20:32:03.0980566Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:03.0981253Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:03.0981946Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:03.0982477Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:03.0983147Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:03.0983811Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:03.0984341Z     kernel = self.compile(
2025-05-07T20:32:03.0984876Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:03.0985521Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:03.0985921Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:03.0986147Z 
2025-05-07T20:32:03.0986355Z self = <triton.compiler.compiler.ASTSource object at 0x7f489948c190>
2025-05-07T20:32:03.0987422Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:03.0988879Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f49ce13c400>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f4899409760>}
2025-05-07T20:32:03.0990309Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:03.0991320Z context = <triton._C.libtriton.ir.context object at 0x7f48994787f0>
2025-05-07T20:32:03.0991621Z 
2025-05-07T20:32:03.0991824Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:03.0992343Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:03.0992856Z                            module_map=module_map)
2025-05-07T20:32:03.0993260Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:03.0993602Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:03.0993865Z E       ^
2025-05-07T20:32:03.0994334Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:03.0994778Z 
2025-05-07T20:32:03.0995195Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:03.0995700Z 
2025-05-07T20:32:03.0995802Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:03.0996209Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:03.0996609Z     T=2048,
2025-05-07T20:32:03.0996797Z     D=7168,
2025-05-07T20:32:03.0996982Z     scale_ub=None,
2025-05-07T20:32:03.0997200Z     contiguous=False,
2025-05-07T20:32:03.0997423Z     compiled=True,
2025-05-07T20:32:03.0997624Z )
2025-05-07T20:32:03.1685064Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:03.1685732Z T = 2048, D = 7168, scale_ub = None, contiguous = False, compiled = True
2025-05-07T20:32:03.1686034Z 
2025-05-07T20:32:03.1686122Z     @given(
2025-05-07T20:32:03.1686361Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:03.1686676Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:03.1686984Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:03.1687318Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:03.1687641Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:03.1687931Z     )
2025-05-07T20:32:03.1688280Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:03.1688716Z     def test_silu_mul_quant(
2025-05-07T20:32:03.1688966Z         self,
2025-05-07T20:32:03.1689166Z         T: int,
2025-05-07T20:32:03.1689367Z         D: int,
2025-05-07T20:32:03.1689595Z         scale_ub: Optional[float],
2025-05-07T20:32:03.1689873Z         contiguous: bool,
2025-05-07T20:32:03.1690113Z         compiled: bool,
2025-05-07T20:32:03.1690345Z     ) -> None:
2025-05-07T20:32:03.1690568Z         torch.manual_seed(2025)
2025-05-07T20:32:03.1690810Z     
2025-05-07T20:32:03.1691089Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:03.1691436Z     
2025-05-07T20:32:03.1691635Z         x_sign = torch.sign(x)
2025-05-07T20:32:03.1691921Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:03.1692237Z         x = x_sign * x_clamp
2025-05-07T20:32:03.1692479Z         x0 = x[:, :D]
2025-05-07T20:32:03.1692692Z         x1 = x[:, D:]
2025-05-07T20:32:03.1692913Z     
2025-05-07T20:32:03.1693181Z         if contiguous:
2025-05-07T20:32:03.1693471Z             x0 = x0.contiguous()
2025-05-07T20:32:03.1693804Z             x1 = x1.contiguous()
2025-05-07T20:32:03.1694063Z     
2025-05-07T20:32:03.1694259Z         if scale_ub is not None:
2025-05-07T20:32:03.1694538Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:03.1694878Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:03.1695180Z             )
2025-05-07T20:32:03.1695380Z         else:
2025-05-07T20:32:03.1695768Z             scale_ub_tensor = None
2025-05-07T20:32:03.1696018Z     
2025-05-07T20:32:03.1696250Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:03.1696563Z             op = silu_mul_quant
2025-05-07T20:32:03.1696809Z             if compiled:
2025-05-07T20:32:03.1697059Z                 op = torch.compile(op)
2025-05-07T20:32:03.1697357Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:03.1697634Z     
2025-05-07T20:32:03.1697823Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:03.1697990Z 
2025-05-07T20:32:03.1698089Z moe/activation_test.py:117: 
2025-05-07T20:32:03.1698380Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:03.1698822Z moe/activation_test.py:115: in fn
2025-05-07T20:32:03.1699107Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:03.1699668Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/torch/_dynamo/eval_frame.py:678: in _fn
2025-05-07T20:32:03.1700235Z     return fn(*args, **kwargs)
2025-05-07T20:32:03.1700892Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:03.1701584Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:03.1702165Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:03.1702839Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:03.1703501Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:03.1704035Z     kernel = self.compile(
2025-05-07T20:32:03.1704584Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:03.1705242Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:03.1705648Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:03.1705874Z 
2025-05-07T20:32:03.1706089Z self = <triton.compiler.compiler.ASTSource object at 0x7f4898c62cd0>
2025-05-07T20:32:03.1707167Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:03.1708525Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f49ce13c400>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f489940aa20>}
2025-05-07T20:32:03.1709924Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:03.1710953Z context = <triton._C.libtriton.ir.context object at 0x7f4898c27330>
2025-05-07T20:32:03.1711238Z 
2025-05-07T20:32:03.1711417Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:03.1711930Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:03.1712398Z                            module_map=module_map)
2025-05-07T20:32:03.1712775Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:03.1713126Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:03.1713381Z E       ^
2025-05-07T20:32:03.1713847Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:03.1714291Z 
2025-05-07T20:32:03.1714713Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:03.1715220Z 
2025-05-07T20:32:03.1715323Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:03.1715734Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:03.1716247Z     T=4096,
2025-05-07T20:32:03.1716438Z     D=7168,
2025-05-07T20:32:03.1716628Z     scale_ub=None,
2025-05-07T20:32:03.1716848Z     contiguous=False,
2025-05-07T20:32:03.1717077Z     compiled=True,
2025-05-07T20:32:03.1717277Z )
2025-05-07T20:32:03.1717598Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:03.1718088Z T = 4096, D = 7168, scale_ub = None, contiguous = False, compiled = True
2025-05-07T20:32:03.1718358Z 
2025-05-07T20:32:03.1718437Z     @given(
2025-05-07T20:32:03.1718670Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:03.1718984Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:03.1719328Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:03.1719700Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:03.1720032Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:03.1720313Z     )
2025-05-07T20:32:03.1720661Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:03.1721105Z     def test_silu_mul_quant(
2025-05-07T20:32:03.1721347Z         self,
2025-05-07T20:32:03.1721539Z         T: int,
2025-05-07T20:32:03.1721780Z         D: int,
2025-05-07T20:32:03.1722025Z         scale_ub: Optional[float],
2025-05-07T20:32:03.1722297Z         contiguous: bool,
2025-05-07T20:32:03.1722532Z         compiled: bool,
2025-05-07T20:32:03.1722753Z     ) -> None:
2025-05-07T20:32:03.1729135Z         torch.manual_seed(2025)
2025-05-07T20:32:03.1729400Z     
2025-05-07T20:32:03.1729690Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:03.1730030Z     
2025-05-07T20:32:03.1730240Z         x_sign = torch.sign(x)
2025-05-07T20:32:03.1730541Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:03.1730848Z         x = x_sign * x_clamp
2025-05-07T20:32:03.1731095Z         x0 = x[:, :D]
2025-05-07T20:32:03.1731318Z         x1 = x[:, D:]
2025-05-07T20:32:03.1731527Z     
2025-05-07T20:32:03.1731723Z         if contiguous:
2025-05-07T20:32:03.1731955Z             x0 = x0.contiguous()
2025-05-07T20:32:03.1732211Z             x1 = x1.contiguous()
2025-05-07T20:32:03.1732455Z     
2025-05-07T20:32:03.1732651Z         if scale_ub is not None:
2025-05-07T20:32:03.1732917Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:03.1733261Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:03.1733572Z             )
2025-05-07T20:32:03.1733764Z         else:
2025-05-07T20:32:03.1733984Z             scale_ub_tensor = None
2025-05-07T20:32:03.1734237Z     
2025-05-07T20:32:03.1734469Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:03.1734783Z             op = silu_mul_quant
2025-05-07T20:32:03.1735039Z             if compiled:
2025-05-07T20:32:03.1735285Z                 op = torch.compile(op)
2025-05-07T20:32:03.1735579Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:03.1735858Z     
2025-05-07T20:32:03.1736060Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:03.1736221Z 
2025-05-07T20:32:03.1736321Z moe/activation_test.py:117: 
2025-05-07T20:32:03.1736618Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:03.1736948Z moe/activation_test.py:115: in fn
2025-05-07T20:32:03.1737222Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:03.1737785Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/torch/_dynamo/eval_frame.py:678: in _fn
2025-05-07T20:32:03.1738348Z     return fn(*args, **kwargs)
2025-05-07T20:32:03.1739007Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:03.1739695Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:03.1740229Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:03.1741062Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:03.1741719Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:03.1742253Z     kernel = self.compile(
2025-05-07T20:32:03.1742792Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:03.1743449Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:03.1743839Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:03.1744068Z 
2025-05-07T20:32:03.1744275Z self = <triton.compiler.compiler.ASTSource object at 0x7f4898c6c550>
2025-05-07T20:32:03.1745408Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:03.1746841Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f49ce13c400>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f489940bce0>}
2025-05-07T20:32:03.1748183Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:03.1749271Z context = <triton._C.libtriton.ir.context object at 0x7f4898c9cb70>
2025-05-07T20:32:03.1749566Z 
2025-05-07T20:32:03.1749734Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:03.1750262Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:03.1750729Z                            module_map=module_map)
2025-05-07T20:32:03.1751093Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:03.1751449Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:03.1751714Z E       ^
2025-05-07T20:32:03.1752228Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:03.1752685Z 
2025-05-07T20:32:03.1753101Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:03.1753610Z 
2025-05-07T20:32:03.3010364Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:03.3010808Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:03.3011324Z     T=16384,
2025-05-07T20:32:03.3011683Z     D=5120,
2025-05-07T20:32:03.3012015Z     scale_ub=1200.0,
2025-05-07T20:32:03.3012547Z     contiguous=False,
2025-05-07T20:32:03.3013232Z     compiled=False,
2025-05-07T20:32:03.3013850Z )
2025-05-07T20:32:03.3014470Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:03.3015388Z T = 16384, D = 5120, scale_ub = 1200.0, contiguous = False, compiled = False
2025-05-07T20:32:03.3015968Z 
2025-05-07T20:32:03.3016118Z     @given(
2025-05-07T20:32:03.3016542Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:03.3017108Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:03.3017669Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:03.3018278Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:03.3018881Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:03.3019400Z     )
2025-05-07T20:32:03.3020038Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:03.3020835Z     def test_silu_mul_quant(
2025-05-07T20:32:03.3021295Z         self,
2025-05-07T20:32:03.3021663Z         T: int,
2025-05-07T20:32:03.3022045Z         D: int,
2025-05-07T20:32:03.3022449Z         scale_ub: Optional[float],
2025-05-07T20:32:03.3022960Z         contiguous: bool,
2025-05-07T20:32:03.3023271Z         compiled: bool,
2025-05-07T20:32:03.3023494Z     ) -> None:
2025-05-07T20:32:03.3023886Z         torch.manual_seed(2025)
2025-05-07T20:32:03.3024148Z     
2025-05-07T20:32:03.3024416Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:03.3024767Z     
2025-05-07T20:32:03.3024968Z         x_sign = torch.sign(x)
2025-05-07T20:32:03.3025258Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:03.3025565Z         x = x_sign * x_clamp
2025-05-07T20:32:03.3025807Z         x0 = x[:, :D]
2025-05-07T20:32:03.3026025Z         x1 = x[:, D:]
2025-05-07T20:32:03.3026233Z     
2025-05-07T20:32:03.3026426Z         if contiguous:
2025-05-07T20:32:03.3026664Z             x0 = x0.contiguous()
2025-05-07T20:32:03.3026986Z             x1 = x1.contiguous()
2025-05-07T20:32:03.3027286Z     
2025-05-07T20:32:03.3027479Z         if scale_ub is not None:
2025-05-07T20:32:03.3027755Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:03.3028095Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:03.3028679Z             )
2025-05-07T20:32:03.3028868Z         else:
2025-05-07T20:32:03.3029137Z             scale_ub_tensor = None
2025-05-07T20:32:03.3029389Z     
2025-05-07T20:32:03.3029617Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:03.3029931Z             op = silu_mul_quant
2025-05-07T20:32:03.3030180Z             if compiled:
2025-05-07T20:32:03.3030421Z                 op = torch.compile(op)
2025-05-07T20:32:03.3030715Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:03.3030983Z     
2025-05-07T20:32:03.3031178Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:03.3031341Z 
2025-05-07T20:32:03.3031452Z moe/activation_test.py:117: 
2025-05-07T20:32:03.3031773Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:03.3032134Z moe/activation_test.py:115: in fn
2025-05-07T20:32:03.3032412Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:03.3033097Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:03.3033786Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:03.3034318Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:03.3034999Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:03.3035650Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:03.3036187Z     kernel = self.compile(
2025-05-07T20:32:03.3036724Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:03.3037375Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:03.3037770Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:03.3038009Z 
2025-05-07T20:32:03.3038221Z self = <triton.compiler.compiler.ASTSource object at 0x7f4898b597d0>
2025-05-07T20:32:03.3039298Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:03.3040653Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f49ce13c400>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f4898b24c20>}
2025-05-07T20:32:03.3041988Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:03.3043020Z context = <triton._C.libtriton.ir.context object at 0x7f4898b71e30>
2025-05-07T20:32:03.3043310Z 
2025-05-07T20:32:03.3043475Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:03.3044132Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:03.3044597Z                            module_map=module_map)
2025-05-07T20:32:03.3044961Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:03.3045313Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:03.3045565Z E       ^
2025-05-07T20:32:03.3046027Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:03.3046474Z 
2025-05-07T20:32:03.3046893Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:03.3047459Z 
2025-05-07T20:32:03.3047651Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:03.3048057Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:03.3048458Z     T=16384,
2025-05-07T20:32:03.3048656Z     D=5120,
2025-05-07T20:32:03.3048846Z     scale_ub=1200.0,
2025-05-07T20:32:03.3049072Z     contiguous=True,
2025-05-07T20:32:03.3049294Z     compiled=True,
2025-05-07T20:32:03.3049491Z )
2025-05-07T20:32:03.3049810Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:03.3050298Z T = 16384, D = 5120, scale_ub = 1200.0, contiguous = True, compiled = True
2025-05-07T20:32:03.3050569Z 
2025-05-07T20:32:03.3050650Z     @given(
2025-05-07T20:32:03.3050879Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:03.3051211Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:03.3051589Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:03.3051990Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:03.3052404Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:03.3052759Z     )
2025-05-07T20:32:03.3053193Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:03.3053739Z     def test_silu_mul_quant(
2025-05-07T20:32:03.3054029Z         self,
2025-05-07T20:32:03.3054221Z         T: int,
2025-05-07T20:32:03.3054413Z         D: int,
2025-05-07T20:32:03.3054631Z         scale_ub: Optional[float],
2025-05-07T20:32:03.3054901Z         contiguous: bool,
2025-05-07T20:32:03.3055136Z         compiled: bool,
2025-05-07T20:32:03.3055360Z     ) -> None:
2025-05-07T20:32:03.3055575Z         torch.manual_seed(2025)
2025-05-07T20:32:03.3055810Z     
2025-05-07T20:32:03.3056083Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:03.3056425Z     
2025-05-07T20:32:03.3056616Z         x_sign = torch.sign(x)
2025-05-07T20:32:03.3056912Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:03.3057225Z         x = x_sign * x_clamp
2025-05-07T20:32:03.3057458Z         x0 = x[:, :D]
2025-05-07T20:32:03.3057676Z         x1 = x[:, D:]
2025-05-07T20:32:03.3057884Z     
2025-05-07T20:32:03.3058063Z         if contiguous:
2025-05-07T20:32:03.3058293Z             x0 = x0.contiguous()
2025-05-07T20:32:03.3058559Z             x1 = x1.contiguous()
2025-05-07T20:32:03.3058791Z     
2025-05-07T20:32:03.3058987Z         if scale_ub is not None:
2025-05-07T20:32:03.3059258Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:03.3059590Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:03.3059895Z             )
2025-05-07T20:32:03.3060087Z         else:
2025-05-07T20:32:03.3060295Z             scale_ub_tensor = None
2025-05-07T20:32:03.3060541Z     
2025-05-07T20:32:03.3060772Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:03.3061089Z             op = silu_mul_quant
2025-05-07T20:32:03.3061345Z             if compiled:
2025-05-07T20:32:03.3061649Z                 op = torch.compile(op)
2025-05-07T20:32:03.3062014Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:03.3062360Z     
2025-05-07T20:32:03.3062602Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:03.3062809Z 
2025-05-07T20:32:03.3062933Z moe/activation_test.py:117: 
2025-05-07T20:32:03.3063392Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:03.3063720Z moe/activation_test.py:115: in fn
2025-05-07T20:32:03.3063999Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:03.3064556Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/torch/_dynamo/eval_frame.py:678: in _fn
2025-05-07T20:32:03.3065107Z     return fn(*args, **kwargs)
2025-05-07T20:32:03.3065757Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:03.3066444Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:03.3067019Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:03.3067726Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:03.3068388Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:03.3068916Z     kernel = self.compile(
2025-05-07T20:32:03.3069501Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:03.3070146Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:03.3070539Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:03.3070765Z 
2025-05-07T20:32:03.3070974Z self = <triton.compiler.compiler.ASTSource object at 0x7f4898b684d0>
2025-05-07T20:32:03.3072048Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:03.3073413Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f49ce13c400>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f4898b260c0>}
2025-05-07T20:32:03.3074754Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:03.3075776Z context = <triton._C.libtriton.ir.context object at 0x7f4898becb30>
2025-05-07T20:32:03.3076063Z 
2025-05-07T20:32:03.3076235Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:03.3076748Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:03.3077214Z                            module_map=module_map)
2025-05-07T20:32:03.3077580Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:03.3077933Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:03.3078190Z E       ^
2025-05-07T20:32:03.3078654Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:03.3079100Z 
2025-05-07T20:32:03.3079519Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:03.3080023Z 
2025-05-07T20:32:03.4385557Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:03.4386805Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:03.4387883Z     T=16384,
2025-05-07T20:32:03.4388405Z     D=5120,
2025-05-07T20:32:03.4388793Z     scale_ub=None,
2025-05-07T20:32:03.4389308Z     contiguous=False,
2025-05-07T20:32:03.4389757Z     compiled=True,
2025-05-07T20:32:03.4390154Z )
2025-05-07T20:32:03.4390790Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:03.4391780Z T = 16384, D = 5120, scale_ub = None, contiguous = False, compiled = True
2025-05-07T20:32:03.4392235Z 
2025-05-07T20:32:03.4392333Z     @given(
2025-05-07T20:32:03.4392768Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:03.4393094Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:03.4393399Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:03.4393728Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:03.4394053Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:03.4394337Z     )
2025-05-07T20:32:03.4394681Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:03.4395112Z     def test_silu_mul_quant(
2025-05-07T20:32:03.4395353Z         self,
2025-05-07T20:32:03.4395547Z         T: int,
2025-05-07T20:32:03.4395739Z         D: int,
2025-05-07T20:32:03.4396019Z         scale_ub: Optional[float],
2025-05-07T20:32:03.4396344Z         contiguous: bool,
2025-05-07T20:32:03.4396577Z         compiled: bool,
2025-05-07T20:32:03.4396799Z     ) -> None:
2025-05-07T20:32:03.4397016Z         torch.manual_seed(2025)
2025-05-07T20:32:03.4397248Z     
2025-05-07T20:32:03.4397528Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:03.4397867Z     
2025-05-07T20:32:03.4398061Z         x_sign = torch.sign(x)
2025-05-07T20:32:03.4398344Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:03.4398647Z         x = x_sign * x_clamp
2025-05-07T20:32:03.4398887Z         x0 = x[:, :D]
2025-05-07T20:32:03.4399098Z         x1 = x[:, D:]
2025-05-07T20:32:03.4399305Z     
2025-05-07T20:32:03.4399493Z         if contiguous:
2025-05-07T20:32:03.4399719Z             x0 = x0.contiguous()
2025-05-07T20:32:03.4399975Z             x1 = x1.contiguous()
2025-05-07T20:32:03.4400209Z     
2025-05-07T20:32:03.4400395Z         if scale_ub is not None:
2025-05-07T20:32:03.4400667Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:03.4401004Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:03.4401307Z             )
2025-05-07T20:32:03.4401503Z         else:
2025-05-07T20:32:03.4401711Z             scale_ub_tensor = None
2025-05-07T20:32:03.4401990Z     
2025-05-07T20:32:03.4402243Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:03.4402552Z             op = silu_mul_quant
2025-05-07T20:32:03.4402804Z             if compiled:
2025-05-07T20:32:03.4403040Z                 op = torch.compile(op)
2025-05-07T20:32:03.4403331Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:03.4403601Z     
2025-05-07T20:32:03.4403787Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:03.4403951Z 
2025-05-07T20:32:03.4404051Z moe/activation_test.py:117: 
2025-05-07T20:32:03.4404345Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:03.4404668Z moe/activation_test.py:115: in fn
2025-05-07T20:32:03.4404952Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:03.4405508Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/torch/_dynamo/eval_frame.py:678: in _fn
2025-05-07T20:32:03.4406059Z     return fn(*args, **kwargs)
2025-05-07T20:32:03.4406710Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:03.4407399Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:03.4407929Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:03.4408600Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:03.4409256Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:03.4409781Z     kernel = self.compile(
2025-05-07T20:32:03.4410320Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:03.4410968Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:03.4411360Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:03.4411585Z 
2025-05-07T20:32:03.4411890Z self = <triton.compiler.compiler.ASTSource object at 0x7f4898859a50>
2025-05-07T20:32:03.4412961Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:03.4414311Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f49ce13c400>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f4898b26c00>}
2025-05-07T20:32:03.4415639Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:03.4416728Z context = <triton._C.libtriton.ir.context object at 0x7f489881a070>
2025-05-07T20:32:03.4417016Z 
2025-05-07T20:32:03.4417191Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:03.4417703Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:03.4418166Z                            module_map=module_map)
2025-05-07T20:32:03.4418533Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:03.4418881Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:03.4419135Z E       ^
2025-05-07T20:32:03.4419597Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:03.4420043Z 
2025-05-07T20:32:03.4420470Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:03.4420981Z 
2025-05-07T20:32:03.4421090Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:03.4421495Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:03.4421921Z     T=2048,
2025-05-07T20:32:03.4422136Z     D=5120,
2025-05-07T20:32:03.4422323Z     scale_ub=None,
2025-05-07T20:32:03.4422538Z     contiguous=False,
2025-05-07T20:32:03.4422760Z     compiled=True,
2025-05-07T20:32:03.4422957Z )
2025-05-07T20:32:03.6912353Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:03.6913114Z T = 2048, D = 5120, scale_ub = None, contiguous = False, compiled = True
2025-05-07T20:32:03.6913500Z 
2025-05-07T20:32:03.6913622Z     @given(
2025-05-07T20:32:03.6913937Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:03.6914376Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:03.6914776Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:03.6915120Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:03.6915447Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:03.6915733Z     )
2025-05-07T20:32:03.6916080Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:03.6916517Z     def test_silu_mul_quant(
2025-05-07T20:32:03.6916768Z         self,
2025-05-07T20:32:03.6916966Z         T: int,
2025-05-07T20:32:03.6917159Z         D: int,
2025-05-07T20:32:03.6917383Z         scale_ub: Optional[float],
2025-05-07T20:32:03.6917658Z         contiguous: bool,
2025-05-07T20:32:03.6917892Z         compiled: bool,
2025-05-07T20:32:03.6918117Z     ) -> None:
2025-05-07T20:32:03.6918336Z         torch.manual_seed(2025)
2025-05-07T20:32:03.6918571Z     
2025-05-07T20:32:03.6918846Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:03.6919190Z     
2025-05-07T20:32:03.6919383Z         x_sign = torch.sign(x)
2025-05-07T20:32:03.6919680Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:03.6919988Z         x = x_sign * x_clamp
2025-05-07T20:32:03.6920235Z         x0 = x[:, :D]
2025-05-07T20:32:03.6920456Z         x1 = x[:, D:]
2025-05-07T20:32:03.6920669Z     
2025-05-07T20:32:03.6920859Z         if contiguous:
2025-05-07T20:32:03.6921259Z             x0 = x0.contiguous()
2025-05-07T20:32:03.6921526Z             x1 = x1.contiguous()
2025-05-07T20:32:03.6921781Z     
2025-05-07T20:32:03.6922000Z         if scale_ub is not None:
2025-05-07T20:32:03.6922274Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:03.6922613Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:03.6922914Z             )
2025-05-07T20:32:03.6923108Z         else:
2025-05-07T20:32:03.6923326Z             scale_ub_tensor = None
2025-05-07T20:32:03.6923573Z     
2025-05-07T20:32:03.6923805Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:03.6924120Z             op = silu_mul_quant
2025-05-07T20:32:03.6924516Z             if compiled:
2025-05-07T20:32:03.6924764Z                 op = torch.compile(op)
2025-05-07T20:32:03.6925057Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:03.6925334Z     
2025-05-07T20:32:03.6925524Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:03.6925697Z 
2025-05-07T20:32:03.6925799Z moe/activation_test.py:117: 
2025-05-07T20:32:03.6926090Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:03.6926416Z moe/activation_test.py:115: in fn
2025-05-07T20:32:03.6926704Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:03.6927263Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/torch/_dynamo/eval_frame.py:678: in _fn
2025-05-07T20:32:03.6927820Z     return fn(*args, **kwargs)
2025-05-07T20:32:03.6928692Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:03.6929383Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:03.6935753Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:03.6936518Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:03.6937198Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:03.6937740Z     kernel = self.compile(
2025-05-07T20:32:03.6938287Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:03.6938957Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:03.6939364Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:03.6939593Z 
2025-05-07T20:32:03.6939802Z self = <triton.compiler.compiler.ASTSource object at 0x7f48988c0150>
2025-05-07T20:32:03.6940890Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:03.6942275Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f49ce13c400>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f489881c680>}
2025-05-07T20:32:03.6943627Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:03.6944655Z context = <triton._C.libtriton.ir.context object at 0x7f48988e66b0>
2025-05-07T20:32:03.6944944Z 
2025-05-07T20:32:03.6945113Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:03.6945635Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:03.6946110Z                            module_map=module_map)
2025-05-07T20:32:03.6946480Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:03.6946831Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:03.6947091Z E       ^
2025-05-07T20:32:03.6947721Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:03.6948180Z 
2025-05-07T20:32:03.6948599Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:03.6949179Z 
2025-05-07T20:32:03.6949284Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:03.6949702Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:03.6950103Z     T=2048,
2025-05-07T20:32:03.6950289Z     D=5120,
2025-05-07T20:32:03.6950485Z     scale_ub=1200.0,
2025-05-07T20:32:03.6950714Z     contiguous=False,
2025-05-07T20:32:03.6951002Z     compiled=True,
2025-05-07T20:32:03.6951269Z )
2025-05-07T20:32:03.6951598Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:03.6952096Z T = 2048, D = 5120, scale_ub = 1200.0, contiguous = False, compiled = True
2025-05-07T20:32:03.6952411Z 
2025-05-07T20:32:03.6952512Z     @given(
2025-05-07T20:32:03.6952762Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:03.6953073Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:03.6953390Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:03.6953727Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:03.6954059Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:03.6954342Z     )
2025-05-07T20:32:03.6954694Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:03.6955140Z     def test_silu_mul_quant(
2025-05-07T20:32:03.6955376Z         self,
2025-05-07T20:32:03.6955569Z         T: int,
2025-05-07T20:32:03.6955772Z         D: int,
2025-05-07T20:32:03.6955990Z         scale_ub: Optional[float],
2025-05-07T20:32:03.6956259Z         contiguous: bool,
2025-05-07T20:32:03.6956503Z         compiled: bool,
2025-05-07T20:32:03.6956724Z     ) -> None:
2025-05-07T20:32:03.6956951Z         torch.manual_seed(2025)
2025-05-07T20:32:03.6957200Z     
2025-05-07T20:32:03.6957473Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:03.6957825Z     
2025-05-07T20:32:03.6958025Z         x_sign = torch.sign(x)
2025-05-07T20:32:03.6958326Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:03.6958633Z         x = x_sign * x_clamp
2025-05-07T20:32:03.6958867Z         x0 = x[:, :D]
2025-05-07T20:32:03.6959084Z         x1 = x[:, D:]
2025-05-07T20:32:03.6959287Z     
2025-05-07T20:32:03.6959472Z         if contiguous:
2025-05-07T20:32:03.6959704Z             x0 = x0.contiguous()
2025-05-07T20:32:03.6959963Z             x1 = x1.contiguous()
2025-05-07T20:32:03.6960206Z     
2025-05-07T20:32:03.6960403Z         if scale_ub is not None:
2025-05-07T20:32:03.6960680Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:03.6961015Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:03.6961329Z             )
2025-05-07T20:32:03.6961517Z         else:
2025-05-07T20:32:03.6961737Z             scale_ub_tensor = None
2025-05-07T20:32:03.6961995Z     
2025-05-07T20:32:03.6962253Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:03.6962600Z             op = silu_mul_quant
2025-05-07T20:32:03.6962848Z             if compiled:
2025-05-07T20:32:03.6963101Z                 op = torch.compile(op)
2025-05-07T20:32:03.6963402Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:03.6963679Z     
2025-05-07T20:32:03.6963872Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:03.6964037Z 
2025-05-07T20:32:03.6964137Z moe/activation_test.py:117: 
2025-05-07T20:32:03.6964430Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:03.6964769Z moe/activation_test.py:115: in fn
2025-05-07T20:32:03.6965047Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:03.6965609Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/torch/_dynamo/eval_frame.py:678: in _fn
2025-05-07T20:32:03.6966259Z     return fn(*args, **kwargs)
2025-05-07T20:32:03.6966921Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:03.6967605Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:03.6968143Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:03.6968829Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:03.6969486Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:03.6970019Z     kernel = self.compile(
2025-05-07T20:32:03.6970655Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:03.6971315Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:03.6971715Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:03.6971946Z 
2025-05-07T20:32:03.6972155Z self = <triton.compiler.compiler.ASTSource object at 0x7f48985c42d0>
2025-05-07T20:32:03.6973246Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:03.6974622Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f49ce13c400>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f489881d1c0>}
2025-05-07T20:32:03.6975973Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:03.6977014Z context = <triton._C.libtriton.ir.context object at 0x7f489884e070>
2025-05-07T20:32:03.6977308Z 
2025-05-07T20:32:03.6977482Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:03.6978004Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:03.6978467Z                            module_map=module_map)
2025-05-07T20:32:03.6978837Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:03.6979186Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:03.6979452Z E       ^
2025-05-07T20:32:03.6979917Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:03.6980375Z 
2025-05-07T20:32:03.6980801Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:03.6981319Z 
2025-05-07T20:32:03.8298031Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:03.8299255Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:03.8300400Z     T=4096,
2025-05-07T20:32:03.8300922Z     D=5120,
2025-05-07T20:32:03.8301323Z     scale_ub=1200.0,
2025-05-07T20:32:03.8301761Z     contiguous=True,
2025-05-07T20:32:03.8302170Z     compiled=True,
2025-05-07T20:32:03.8302401Z )
2025-05-07T20:32:03.8302745Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:03.8303234Z T = 4096, D = 5120, scale_ub = 1200.0, contiguous = True, compiled = True
2025-05-07T20:32:03.8303498Z 
2025-05-07T20:32:03.8303583Z     @given(
2025-05-07T20:32:03.8303815Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:03.8304129Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:03.8304438Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:03.8304770Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:03.8305093Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:03.8305373Z     )
2025-05-07T20:32:03.8305891Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:03.8306328Z     def test_silu_mul_quant(
2025-05-07T20:32:03.8306576Z         self,
2025-05-07T20:32:03.8306769Z         T: int,
2025-05-07T20:32:03.8306964Z         D: int,
2025-05-07T20:32:03.8307182Z         scale_ub: Optional[float],
2025-05-07T20:32:03.8307452Z         contiguous: bool,
2025-05-07T20:32:03.8307685Z         compiled: bool,
2025-05-07T20:32:03.8307910Z     ) -> None:
2025-05-07T20:32:03.8308128Z         torch.manual_seed(2025)
2025-05-07T20:32:03.8308364Z     
2025-05-07T20:32:03.8308635Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:03.8309036Z     
2025-05-07T20:32:03.8309310Z         x_sign = torch.sign(x)
2025-05-07T20:32:03.8309668Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:03.8309977Z         x = x_sign * x_clamp
2025-05-07T20:32:03.8310219Z         x0 = x[:, :D]
2025-05-07T20:32:03.8310427Z         x1 = x[:, D:]
2025-05-07T20:32:03.8310637Z     
2025-05-07T20:32:03.8310821Z         if contiguous:
2025-05-07T20:32:03.8311047Z             x0 = x0.contiguous()
2025-05-07T20:32:03.8311303Z             x1 = x1.contiguous()
2025-05-07T20:32:03.8311543Z     
2025-05-07T20:32:03.8311731Z         if scale_ub is not None:
2025-05-07T20:32:03.8311998Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:03.8312330Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:03.8312631Z             )
2025-05-07T20:32:03.8312824Z         else:
2025-05-07T20:32:03.8313034Z             scale_ub_tensor = None
2025-05-07T20:32:03.8313276Z     
2025-05-07T20:32:03.8313510Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:03.8313836Z             op = silu_mul_quant
2025-05-07T20:32:03.8314081Z             if compiled:
2025-05-07T20:32:03.8314330Z                 op = torch.compile(op)
2025-05-07T20:32:03.8314626Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:03.8314895Z     
2025-05-07T20:32:03.8315087Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:03.8315252Z 
2025-05-07T20:32:03.8315353Z moe/activation_test.py:117: 
2025-05-07T20:32:03.8315648Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:03.8315975Z moe/activation_test.py:115: in fn
2025-05-07T20:32:03.8316260Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:03.8316819Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/torch/_dynamo/eval_frame.py:678: in _fn
2025-05-07T20:32:03.8317367Z     return fn(*args, **kwargs)
2025-05-07T20:32:03.8318022Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:03.8318711Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:03.8319241Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:03.8319911Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:03.8320563Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:03.8321094Z     kernel = self.compile(
2025-05-07T20:32:03.8321629Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:03.8322314Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:03.8322710Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:03.8322936Z 
2025-05-07T20:32:03.8323146Z self = <triton.compiler.compiler.ASTSource object at 0x7f48985d51d0>
2025-05-07T20:32:03.8324219Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:03.8325678Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f49ce13c400>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f489881da80>}
2025-05-07T20:32:03.8327013Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:03.8328030Z context = <triton._C.libtriton.ir.context object at 0x7f4898551830>
2025-05-07T20:32:03.8328580Z 
2025-05-07T20:32:03.8328752Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:03.8329270Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:03.8329872Z                            module_map=module_map)
2025-05-07T20:32:03.8330237Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:03.8330583Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:03.8330846Z E       ^
2025-05-07T20:32:03.8331312Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:03.8331758Z 
2025-05-07T20:32:03.8332221Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:03.8332725Z 
2025-05-07T20:32:03.8332831Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:03.8333233Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:03.8333635Z     T=128,
2025-05-07T20:32:03.8333827Z     D=5120,
2025-05-07T20:32:03.8334018Z     scale_ub=1200.0,
2025-05-07T20:32:03.8334241Z     contiguous=False,
2025-05-07T20:32:03.8334472Z     compiled=True,
2025-05-07T20:32:03.8334679Z )
2025-05-07T20:32:03.9160838Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:03.9162200Z T = 128, D = 5120, scale_ub = 1200.0, contiguous = False, compiled = True
2025-05-07T20:32:03.9162637Z 
2025-05-07T20:32:03.9162750Z     @given(
2025-05-07T20:32:03.9163068Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:03.9163458Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:03.9163767Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:03.9164098Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:03.9164426Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:03.9164708Z     )
2025-05-07T20:32:03.9165058Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:03.9165501Z     def test_silu_mul_quant(
2025-05-07T20:32:03.9165745Z         self,
2025-05-07T20:32:03.9165947Z         T: int,
2025-05-07T20:32:03.9166146Z         D: int,
2025-05-07T20:32:03.9166365Z         scale_ub: Optional[float],
2025-05-07T20:32:03.9166631Z         contiguous: bool,
2025-05-07T20:32:03.9166865Z         compiled: bool,
2025-05-07T20:32:03.9167088Z     ) -> None:
2025-05-07T20:32:03.9167304Z         torch.manual_seed(2025)
2025-05-07T20:32:03.9167550Z     
2025-05-07T20:32:03.9167827Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:03.9168164Z     
2025-05-07T20:32:03.9168360Z         x_sign = torch.sign(x)
2025-05-07T20:32:03.9168651Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:03.9168958Z         x = x_sign * x_clamp
2025-05-07T20:32:03.9169207Z         x0 = x[:, :D]
2025-05-07T20:32:03.9169426Z         x1 = x[:, D:]
2025-05-07T20:32:03.9169628Z     
2025-05-07T20:32:03.9169815Z         if contiguous:
2025-05-07T20:32:03.9170053Z             x0 = x0.contiguous()
2025-05-07T20:32:03.9170308Z             x1 = x1.contiguous()
2025-05-07T20:32:03.9170545Z     
2025-05-07T20:32:03.9170736Z         if scale_ub is not None:
2025-05-07T20:32:03.9171004Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:03.9171334Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:03.9171814Z             )
2025-05-07T20:32:03.9172010Z         else:
2025-05-07T20:32:03.9172219Z             scale_ub_tensor = None
2025-05-07T20:32:03.9172470Z     
2025-05-07T20:32:03.9172699Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:03.9173011Z             op = silu_mul_quant
2025-05-07T20:32:03.9173263Z             if compiled:
2025-05-07T20:32:03.9173510Z                 op = torch.compile(op)
2025-05-07T20:32:03.9173802Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:03.9174078Z     
2025-05-07T20:32:03.9174271Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:03.9174433Z 
2025-05-07T20:32:03.9174532Z moe/activation_test.py:117: 
2025-05-07T20:32:03.9174890Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:03.9175276Z moe/activation_test.py:115: in fn
2025-05-07T20:32:03.9175554Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:03.9176111Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/torch/_dynamo/eval_frame.py:678: in _fn
2025-05-07T20:32:03.9176663Z     return fn(*args, **kwargs)
2025-05-07T20:32:03.9177315Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:03.9177992Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:03.9178523Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:03.9179196Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:03.9179848Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:03.9180375Z     kernel = self.compile(
2025-05-07T20:32:03.9180912Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:03.9181565Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:03.9181963Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:03.9182194Z 
2025-05-07T20:32:03.9182429Z self = <triton.compiler.compiler.ASTSource object at 0x7f4898580150>
2025-05-07T20:32:03.9183519Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:03.9184876Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f49ce13c400>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f489881fa60>}
2025-05-07T20:32:03.9186219Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:03.9187233Z context = <triton._C.libtriton.ir.context object at 0x7f4898596c70>
2025-05-07T20:32:03.9187520Z 
2025-05-07T20:32:03.9187685Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:03.9188202Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:03.9188666Z                            module_map=module_map)
2025-05-07T20:32:03.9189024Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:03.9189435Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:03.9189693Z E       ^
2025-05-07T20:32:03.9190150Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:03.9190604Z 
2025-05-07T20:32:03.9191018Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:03.9191528Z 
2025-05-07T20:32:03.9191632Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:03.9192164Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:03.9192575Z     T=16384,
2025-05-07T20:32:03.9192768Z     D=7168,
2025-05-07T20:32:03.9192968Z     scale_ub=1200.0,
2025-05-07T20:32:03.9193187Z     contiguous=True,
2025-05-07T20:32:03.9193413Z     compiled=True,
2025-05-07T20:32:03.9193616Z )
2025-05-07T20:32:03.9193926Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:03.9194415Z T = 16384, D = 7168, scale_ub = 1200.0, contiguous = True, compiled = True
2025-05-07T20:32:03.9194696Z 
2025-05-07T20:32:03.9194777Z     @given(
2025-05-07T20:32:03.9195008Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:03.9195361Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:03.9195716Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:03.9196044Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:03.9196362Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:03.9196650Z     )
2025-05-07T20:32:03.9196995Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:03.9197429Z     def test_silu_mul_quant(
2025-05-07T20:32:03.9197667Z         self,
2025-05-07T20:32:03.9197863Z         T: int,
2025-05-07T20:32:03.9198056Z         D: int,
2025-05-07T20:32:03.9198273Z         scale_ub: Optional[float],
2025-05-07T20:32:03.9198544Z         contiguous: bool,
2025-05-07T20:32:03.9198780Z         compiled: bool,
2025-05-07T20:32:03.9199002Z     ) -> None:
2025-05-07T20:32:03.9199218Z         torch.manual_seed(2025)
2025-05-07T20:32:03.9199460Z     
2025-05-07T20:32:03.9199726Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:03.9200069Z     
2025-05-07T20:32:03.9200269Z         x_sign = torch.sign(x)
2025-05-07T20:32:03.9200555Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:03.9200868Z         x = x_sign * x_clamp
2025-05-07T20:32:03.9201116Z         x0 = x[:, :D]
2025-05-07T20:32:03.9201327Z         x1 = x[:, D:]
2025-05-07T20:32:03.9201535Z     
2025-05-07T20:32:03.9201723Z         if contiguous:
2025-05-07T20:32:03.9201953Z             x0 = x0.contiguous()
2025-05-07T20:32:03.9202204Z             x1 = x1.contiguous()
2025-05-07T20:32:03.9202450Z     
2025-05-07T20:32:03.9202637Z         if scale_ub is not None:
2025-05-07T20:32:03.9202914Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:03.9203259Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:03.9203559Z             )
2025-05-07T20:32:03.9203750Z         else:
2025-05-07T20:32:03.9203963Z             scale_ub_tensor = None
2025-05-07T20:32:03.9204216Z     
2025-05-07T20:32:03.9204447Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:03.9204760Z             op = silu_mul_quant
2025-05-07T20:32:03.9205014Z             if compiled:
2025-05-07T20:32:03.9205252Z                 op = torch.compile(op)
2025-05-07T20:32:03.9205547Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:03.9205821Z     
2025-05-07T20:32:03.9206006Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:03.9206169Z 
2025-05-07T20:32:03.9206271Z moe/activation_test.py:117: 
2025-05-07T20:32:03.9206567Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:03.9206901Z moe/activation_test.py:115: in fn
2025-05-07T20:32:03.9207184Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:03.9207739Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/torch/_dynamo/eval_frame.py:678: in _fn
2025-05-07T20:32:03.9208297Z     return fn(*args, **kwargs)
2025-05-07T20:32:03.9208941Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:03.9209626Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:03.9210237Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:03.9210914Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:03.9211569Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:03.9212143Z     kernel = self.compile(
2025-05-07T20:32:03.9212680Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:03.9213322Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:03.9213724Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:03.9213993Z 
2025-05-07T20:32:03.9214202Z self = <triton.compiler.compiler.ASTSource object at 0x7f4898752c90>
2025-05-07T20:32:03.9215328Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:03.9216673Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f49ce13c400>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f489875cd60>}
2025-05-07T20:32:03.9217994Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:03.9219007Z context = <triton._C.libtriton.ir.context object at 0x7f48987dffb0>
2025-05-07T20:32:03.9219290Z 
2025-05-07T20:32:03.9219468Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:03.9219996Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:03.9220455Z                            module_map=module_map)
2025-05-07T20:32:03.9220820Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:03.9221176Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:03.9221427Z E       ^
2025-05-07T20:32:03.9221892Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:03.9222386Z 
2025-05-07T20:32:03.9222807Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:03.9223315Z 
2025-05-07T20:32:04.0178425Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:04.0179689Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:04.0180772Z     T=16384,
2025-05-07T20:32:04.0181314Z     D=5120,
2025-05-07T20:32:04.0181735Z     scale_ub=1200.0,
2025-05-07T20:32:04.0182150Z     contiguous=True,
2025-05-07T20:32:04.0182402Z     compiled=False,
2025-05-07T20:32:04.0182659Z )
2025-05-07T20:32:04.0182986Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:04.0183476Z T = 16384, D = 5120, scale_ub = 1200.0, contiguous = True, compiled = False
2025-05-07T20:32:04.0183755Z 
2025-05-07T20:32:04.0183834Z     @given(
2025-05-07T20:32:04.0184075Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:04.0184393Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:04.0184700Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:04.0185040Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:04.0191604Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:04.0191932Z     )
2025-05-07T20:32:04.0192288Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:04.0192739Z     def test_silu_mul_quant(
2025-05-07T20:32:04.0192980Z         self,
2025-05-07T20:32:04.0193175Z         T: int,
2025-05-07T20:32:04.0193372Z         D: int,
2025-05-07T20:32:04.0193590Z         scale_ub: Optional[float],
2025-05-07T20:32:04.0193866Z         contiguous: bool,
2025-05-07T20:32:04.0194268Z         compiled: bool,
2025-05-07T20:32:04.0194492Z     ) -> None:
2025-05-07T20:32:04.0194713Z         torch.manual_seed(2025)
2025-05-07T20:32:04.0194953Z     
2025-05-07T20:32:04.0195226Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:04.0195571Z     
2025-05-07T20:32:04.0195766Z         x_sign = torch.sign(x)
2025-05-07T20:32:04.0196058Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:04.0196369Z         x = x_sign * x_clamp
2025-05-07T20:32:04.0196612Z         x0 = x[:, :D]
2025-05-07T20:32:04.0196833Z         x1 = x[:, D:]
2025-05-07T20:32:04.0197040Z     
2025-05-07T20:32:04.0197223Z         if contiguous:
2025-05-07T20:32:04.0197515Z             x0 = x0.contiguous()
2025-05-07T20:32:04.0197826Z             x1 = x1.contiguous()
2025-05-07T20:32:04.0198068Z     
2025-05-07T20:32:04.0198259Z         if scale_ub is not None:
2025-05-07T20:32:04.0198531Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:04.0198884Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:04.0199203Z             )
2025-05-07T20:32:04.0199401Z         else:
2025-05-07T20:32:04.0199617Z             scale_ub_tensor = None
2025-05-07T20:32:04.0199866Z     
2025-05-07T20:32:04.0200094Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:04.0200409Z             op = silu_mul_quant
2025-05-07T20:32:04.0200667Z             if compiled:
2025-05-07T20:32:04.0200916Z                 op = torch.compile(op)
2025-05-07T20:32:04.0201209Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:04.0201487Z     
2025-05-07T20:32:04.0201682Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:04.0201875Z 
2025-05-07T20:32:04.0201989Z moe/activation_test.py:117: 
2025-05-07T20:32:04.0202297Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:04.0202632Z moe/activation_test.py:115: in fn
2025-05-07T20:32:04.0202913Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:04.0203612Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:04.0204311Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:04.0204843Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:04.0205515Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:04.0206184Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:04.0206713Z     kernel = self.compile(
2025-05-07T20:32:04.0207247Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:04.0207904Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:04.0208303Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:04.0208531Z 
2025-05-07T20:32:04.0208738Z self = <triton.compiler.compiler.ASTSource object at 0x7f489877cd50>
2025-05-07T20:32:04.0209820Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:04.0211195Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f49ce13c400>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f489875dbc0>}
2025-05-07T20:32:04.0212587Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:04.0213614Z context = <triton._C.libtriton.ir.context object at 0x7f4898731330>
2025-05-07T20:32:04.0213900Z 
2025-05-07T20:32:04.0214154Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:04.0214666Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:04.0215130Z                            module_map=module_map)
2025-05-07T20:32:04.0215489Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:04.0215834Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:04.0216088Z E       ^
2025-05-07T20:32:04.0216560Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:04.0217009Z 
2025-05-07T20:32:04.0217440Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:04.0218035Z 
2025-05-07T20:32:04.0218142Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:04.0218553Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:04.0218952Z     T=1,
2025-05-07T20:32:04.0219139Z     D=7168,
2025-05-07T20:32:04.0219336Z     scale_ub=1200.0,
2025-05-07T20:32:04.0219564Z     contiguous=False,
2025-05-07T20:32:04.0219787Z     compiled=False,
2025-05-07T20:32:04.0219986Z )
2025-05-07T20:32:04.0220306Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:04.0220794Z T = 1, D = 7168, scale_ub = 1200.0, contiguous = False, compiled = False
2025-05-07T20:32:04.0221055Z 
2025-05-07T20:32:04.0221131Z     @given(
2025-05-07T20:32:04.0221365Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:04.0221678Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:04.0221978Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:04.0222316Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:04.0222644Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:04.0222923Z     )
2025-05-07T20:32:04.0223282Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:04.0223724Z     def test_silu_mul_quant(
2025-05-07T20:32:04.0223967Z         self,
2025-05-07T20:32:04.0224159Z         T: int,
2025-05-07T20:32:04.0224357Z         D: int,
2025-05-07T20:32:04.0224573Z         scale_ub: Optional[float],
2025-05-07T20:32:04.0224840Z         contiguous: bool,
2025-05-07T20:32:04.0225079Z         compiled: bool,
2025-05-07T20:32:04.0225300Z     ) -> None:
2025-05-07T20:32:04.0225508Z         torch.manual_seed(2025)
2025-05-07T20:32:04.0225758Z     
2025-05-07T20:32:04.0226025Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:04.0226361Z     
2025-05-07T20:32:04.0226558Z         x_sign = torch.sign(x)
2025-05-07T20:32:04.0226852Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:04.0227158Z         x = x_sign * x_clamp
2025-05-07T20:32:04.0227404Z         x0 = x[:, :D]
2025-05-07T20:32:04.0227624Z         x1 = x[:, D:]
2025-05-07T20:32:04.0227830Z     
2025-05-07T20:32:04.0228012Z         if contiguous:
2025-05-07T20:32:04.0228450Z             x0 = x0.contiguous()
2025-05-07T20:32:04.0228705Z             x1 = x1.contiguous()
2025-05-07T20:32:04.0228938Z     
2025-05-07T20:32:04.0229181Z         if scale_ub is not None:
2025-05-07T20:32:04.0229447Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:04.0229773Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:04.0230076Z             )
2025-05-07T20:32:04.0230264Z         else:
2025-05-07T20:32:04.0230471Z             scale_ub_tensor = None
2025-05-07T20:32:04.0230719Z     
2025-05-07T20:32:04.0230945Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:04.0231253Z             op = silu_mul_quant
2025-05-07T20:32:04.0231499Z             if compiled:
2025-05-07T20:32:04.0231745Z                 op = torch.compile(op)
2025-05-07T20:32:04.0232094Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:04.0232364Z     
2025-05-07T20:32:04.0232551Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:04.0232847Z 
2025-05-07T20:32:04.0232945Z moe/activation_test.py:117: 
2025-05-07T20:32:04.0233233Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:04.0233554Z moe/activation_test.py:115: in fn
2025-05-07T20:32:04.0233833Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:04.0234518Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:04.0235202Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:04.0235727Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:04.0236460Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:04.0237172Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:04.0237696Z     kernel = self.compile(
2025-05-07T20:32:04.0238241Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:04.0238898Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:04.0239292Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:04.0239519Z 
2025-05-07T20:32:04.0239725Z self = <triton.compiler.compiler.ASTSource object at 0x7f489862d590>
2025-05-07T20:32:04.0240811Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:04.0242192Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f49ce13c400>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f489875d4e0>}
2025-05-07T20:32:04.0243597Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:04.0244627Z context = <triton._C.libtriton.ir.context object at 0x7f489861c5f0>
2025-05-07T20:32:04.0244914Z 
2025-05-07T20:32:04.0245082Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:04.0245599Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:04.0246058Z                            module_map=module_map)
2025-05-07T20:32:04.0246414Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:04.0246773Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:04.0247030Z E       ^
2025-05-07T20:32:04.0247491Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:04.0247944Z 
2025-05-07T20:32:04.0248367Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:04.0248883Z 
2025-05-07T20:32:04.3397858Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:04.3398478Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:04.3399029Z     T=4096,
2025-05-07T20:32:04.3399292Z     D=7168,
2025-05-07T20:32:04.3399554Z     scale_ub=1200.0,
2025-05-07T20:32:04.3399840Z     contiguous=False,
2025-05-07T20:32:04.3400129Z     compiled=True,
2025-05-07T20:32:04.3400386Z )
2025-05-07T20:32:04.3400758Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:04.3401259Z T = 4096, D = 7168, scale_ub = 1200.0, contiguous = False, compiled = True
2025-05-07T20:32:04.3401534Z 
2025-05-07T20:32:04.3401616Z     @given(
2025-05-07T20:32:04.3401849Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:04.3402159Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:04.3402633Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:04.3402965Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:04.3403288Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:04.3403576Z     )
2025-05-07T20:32:04.3403926Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:04.3404359Z     def test_silu_mul_quant(
2025-05-07T20:32:04.3404600Z         self,
2025-05-07T20:32:04.3404795Z         T: int,
2025-05-07T20:32:04.3404992Z         D: int,
2025-05-07T20:32:04.3405227Z         scale_ub: Optional[float],
2025-05-07T20:32:04.3405492Z         contiguous: bool,
2025-05-07T20:32:04.3405819Z         compiled: bool,
2025-05-07T20:32:04.3406115Z     ) -> None:
2025-05-07T20:32:04.3406327Z         torch.manual_seed(2025)
2025-05-07T20:32:04.3406572Z     
2025-05-07T20:32:04.3406845Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:04.3407183Z     
2025-05-07T20:32:04.3407390Z         x_sign = torch.sign(x)
2025-05-07T20:32:04.3407684Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:04.3407998Z         x = x_sign * x_clamp
2025-05-07T20:32:04.3408235Z         x0 = x[:, :D]
2025-05-07T20:32:04.3408451Z         x1 = x[:, D:]
2025-05-07T20:32:04.3408661Z     
2025-05-07T20:32:04.3408842Z         if contiguous:
2025-05-07T20:32:04.3409079Z             x0 = x0.contiguous()
2025-05-07T20:32:04.3409333Z             x1 = x1.contiguous()
2025-05-07T20:32:04.3409572Z     
2025-05-07T20:32:04.3409763Z         if scale_ub is not None:
2025-05-07T20:32:04.3410034Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:04.3410364Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:04.3410671Z             )
2025-05-07T20:32:04.3410865Z         else:
2025-05-07T20:32:04.3411079Z             scale_ub_tensor = None
2025-05-07T20:32:04.3411324Z     
2025-05-07T20:32:04.3411560Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:04.3411878Z             op = silu_mul_quant
2025-05-07T20:32:04.3412129Z             if compiled:
2025-05-07T20:32:04.3412377Z                 op = torch.compile(op)
2025-05-07T20:32:04.3412672Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:04.3412941Z     
2025-05-07T20:32:04.3413133Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:04.3413298Z 
2025-05-07T20:32:04.3413400Z moe/activation_test.py:117: 
2025-05-07T20:32:04.3413693Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:04.3414018Z moe/activation_test.py:115: in fn
2025-05-07T20:32:04.3414295Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:04.3414857Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/torch/_dynamo/eval_frame.py:678: in _fn
2025-05-07T20:32:04.3415413Z     return fn(*args, **kwargs)
2025-05-07T20:32:04.3416072Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:04.3416756Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:04.3417291Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:04.3417959Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:04.3418618Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:04.3419145Z     kernel = self.compile(
2025-05-07T20:32:04.3419681Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:04.3420338Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:04.3420739Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:04.3420964Z 
2025-05-07T20:32:04.3421171Z self = <triton.compiler.compiler.ASTSource object at 0x7f48986c8150>
2025-05-07T20:32:04.3422383Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:04.3423742Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f49ce13c400>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f48986f0180>}
2025-05-07T20:32:04.3425091Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:04.3426179Z context = <triton._C.libtriton.ir.context object at 0x7f48986a3070>
2025-05-07T20:32:04.3426464Z 
2025-05-07T20:32:04.3426635Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:04.3427149Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:04.3427616Z                            module_map=module_map)
2025-05-07T20:32:04.3427980Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:04.3428537Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:04.3428805Z E       ^
2025-05-07T20:32:04.3429311Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:04.3429756Z 
2025-05-07T20:32:04.3430173Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:04.3430681Z 
2025-05-07T20:32:04.3430791Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:04.3431208Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:04.3431613Z     T=128,
2025-05-07T20:32:04.3431802Z     D=7168,
2025-05-07T20:32:04.3431994Z     scale_ub=1200.0,
2025-05-07T20:32:04.3432269Z     contiguous=False,
2025-05-07T20:32:04.3432498Z     compiled=True,
2025-05-07T20:32:04.3432711Z )
2025-05-07T20:32:04.4153962Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:04.4154720Z T = 128, D = 7168, scale_ub = 1200.0, contiguous = False, compiled = True
2025-05-07T20:32:04.4155106Z 
2025-05-07T20:32:04.4155220Z     @given(
2025-05-07T20:32:04.4155523Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:04.4155876Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:04.4156179Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:04.4156498Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:04.4156830Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:04.4157115Z     )
2025-05-07T20:32:04.4157465Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:04.4157900Z     def test_silu_mul_quant(
2025-05-07T20:32:04.4158147Z         self,
2025-05-07T20:32:04.4158345Z         T: int,
2025-05-07T20:32:04.4158541Z         D: int,
2025-05-07T20:32:04.4158761Z         scale_ub: Optional[float],
2025-05-07T20:32:04.4159029Z         contiguous: bool,
2025-05-07T20:32:04.4159266Z         compiled: bool,
2025-05-07T20:32:04.4159486Z     ) -> None:
2025-05-07T20:32:04.4159700Z         torch.manual_seed(2025)
2025-05-07T20:32:04.4159937Z     
2025-05-07T20:32:04.4160209Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:04.4160550Z     
2025-05-07T20:32:04.4160748Z         x_sign = torch.sign(x)
2025-05-07T20:32:04.4161040Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:04.4161349Z         x = x_sign * x_clamp
2025-05-07T20:32:04.4161587Z         x0 = x[:, :D]
2025-05-07T20:32:04.4161800Z         x1 = x[:, D:]
2025-05-07T20:32:04.4162002Z     
2025-05-07T20:32:04.4162208Z         if contiguous:
2025-05-07T20:32:04.4162460Z             x0 = x0.contiguous()
2025-05-07T20:32:04.4162883Z             x1 = x1.contiguous()
2025-05-07T20:32:04.4163118Z     
2025-05-07T20:32:04.4163306Z         if scale_ub is not None:
2025-05-07T20:32:04.4163579Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:04.4163911Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:04.4164212Z             )
2025-05-07T20:32:04.4164405Z         else:
2025-05-07T20:32:04.4164611Z             scale_ub_tensor = None
2025-05-07T20:32:04.4164851Z     
2025-05-07T20:32:04.4165083Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:04.4165399Z             op = silu_mul_quant
2025-05-07T20:32:04.4165643Z             if compiled:
2025-05-07T20:32:04.4165886Z                 op = torch.compile(op)
2025-05-07T20:32:04.4166291Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:04.4166556Z     
2025-05-07T20:32:04.4166748Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:04.4166910Z 
2025-05-07T20:32:04.4167012Z moe/activation_test.py:117: 
2025-05-07T20:32:04.4167304Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:04.4167635Z moe/activation_test.py:115: in fn
2025-05-07T20:32:04.4167920Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:04.4168478Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/torch/_dynamo/eval_frame.py:678: in _fn
2025-05-07T20:32:04.4169031Z     return fn(*args, **kwargs)
2025-05-07T20:32:04.4169683Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:04.4170367Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:04.4170900Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:04.4171579Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:04.4172236Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:04.4172770Z     kernel = self.compile(
2025-05-07T20:32:04.4173300Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:04.4173947Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:04.4174340Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:04.4174566Z 
2025-05-07T20:32:04.4174774Z self = <triton.compiler.compiler.ASTSource object at 0x7f4898453d90>
2025-05-07T20:32:04.4175843Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:04.4177212Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f49ce13c400>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f48986f0cc0>}
2025-05-07T20:32:04.4178542Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:04.4179555Z context = <triton._C.libtriton.ir.context object at 0x7f48984a4530>
2025-05-07T20:32:04.4179840Z 
2025-05-07T20:32:04.4180012Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:04.4180518Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:04.4180981Z                            module_map=module_map)
2025-05-07T20:32:04.4181345Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:04.4181690Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:04.4181946Z E       ^
2025-05-07T20:32:04.4182460Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:04.4182987Z 
2025-05-07T20:32:04.4183403Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:04.4183908Z 
2025-05-07T20:32:04.4184012Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:04.4184420Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:04.4184821Z     T=2048,
2025-05-07T20:32:04.4185006Z     D=7168,
2025-05-07T20:32:04.4185199Z     scale_ub=None,
2025-05-07T20:32:04.4185434Z     contiguous=True,
2025-05-07T20:32:04.4185655Z     compiled=True,
2025-05-07T20:32:04.4185855Z )
2025-05-07T20:32:04.4186171Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:04.4186735Z T = 2048, D = 7168, scale_ub = None, contiguous = True, compiled = True
2025-05-07T20:32:04.4186995Z 
2025-05-07T20:32:04.4187078Z     @given(
2025-05-07T20:32:04.4187302Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:04.4187616Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:04.4187919Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:04.4188240Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:04.4188562Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:04.4188844Z     )
2025-05-07T20:32:04.4189259Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:04.4189698Z     def test_silu_mul_quant(
2025-05-07T20:32:04.4189939Z         self,
2025-05-07T20:32:04.4190129Z         T: int,
2025-05-07T20:32:04.4190324Z         D: int,
2025-05-07T20:32:04.4190541Z         scale_ub: Optional[float],
2025-05-07T20:32:04.4190808Z         contiguous: bool,
2025-05-07T20:32:04.4191049Z         compiled: bool,
2025-05-07T20:32:04.4191270Z     ) -> None:
2025-05-07T20:32:04.4191485Z         torch.manual_seed(2025)
2025-05-07T20:32:04.4191718Z     
2025-05-07T20:32:04.4191991Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:04.4192340Z     
2025-05-07T20:32:04.4192567Z         x_sign = torch.sign(x)
2025-05-07T20:32:04.4192863Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:04.4193166Z         x = x_sign * x_clamp
2025-05-07T20:32:04.4193405Z         x0 = x[:, :D]
2025-05-07T20:32:04.4193617Z         x1 = x[:, D:]
2025-05-07T20:32:04.4193820Z     
2025-05-07T20:32:04.4193998Z         if contiguous:
2025-05-07T20:32:04.4194232Z             x0 = x0.contiguous()
2025-05-07T20:32:04.4194488Z             x1 = x1.contiguous()
2025-05-07T20:32:04.4194721Z     
2025-05-07T20:32:04.4194916Z         if scale_ub is not None:
2025-05-07T20:32:04.4195188Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:04.4195521Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:04.4195824Z             )
2025-05-07T20:32:04.4196017Z         else:
2025-05-07T20:32:04.4196225Z             scale_ub_tensor = None
2025-05-07T20:32:04.4196477Z     
2025-05-07T20:32:04.4196721Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:04.4197043Z             op = silu_mul_quant
2025-05-07T20:32:04.4197288Z             if compiled:
2025-05-07T20:32:04.4197535Z                 op = torch.compile(op)
2025-05-07T20:32:04.4197836Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:04.4198105Z     
2025-05-07T20:32:04.4198297Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:04.4198458Z 
2025-05-07T20:32:04.4198560Z moe/activation_test.py:117: 
2025-05-07T20:32:04.4198861Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:04.4199200Z moe/activation_test.py:115: in fn
2025-05-07T20:32:04.4205165Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:04.4205748Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/torch/_dynamo/eval_frame.py:678: in _fn
2025-05-07T20:32:04.4206309Z     return fn(*args, **kwargs)
2025-05-07T20:32:04.4207104Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:04.4207804Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:04.4208344Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:04.4209025Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:04.4209703Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:04.4210236Z     kernel = self.compile(
2025-05-07T20:32:04.4210779Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:04.4211525Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:04.4211943Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:04.4212214Z 
2025-05-07T20:32:04.4212435Z self = <triton.compiler.compiler.ASTSource object at 0x7f48984fd110>
2025-05-07T20:32:04.4213523Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:04.4214905Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f49ce13c400>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f48986f1940>}
2025-05-07T20:32:04.4216264Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:04.4217306Z context = <triton._C.libtriton.ir.context object at 0x7f489842d770>
2025-05-07T20:32:04.4217595Z 
2025-05-07T20:32:04.4217774Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:04.4218291Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:04.4218762Z                            module_map=module_map)
2025-05-07T20:32:04.4219137Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:04.4219484Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:04.4219739Z E       ^
2025-05-07T20:32:04.4220204Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:04.4220656Z 
2025-05-07T20:32:04.4221083Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:04.4221603Z 
2025-05-07T20:32:04.4891977Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:04.4892589Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:04.4893208Z     T=16384,
2025-05-07T20:32:04.4893493Z     D=5120,
2025-05-07T20:32:04.4893769Z     scale_ub=None,
2025-05-07T20:32:04.4894057Z     contiguous=False,
2025-05-07T20:32:04.4894373Z     compiled=False,
2025-05-07T20:32:04.4894642Z )
2025-05-07T20:32:04.4895075Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:04.4895672Z T = 16384, D = 5120, scale_ub = None, contiguous = False, compiled = False
2025-05-07T20:32:04.4895949Z 
2025-05-07T20:32:04.4896030Z     @given(
2025-05-07T20:32:04.4896257Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:04.4896564Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:04.4896869Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:04.4897193Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:04.4897519Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:04.4897803Z     )
2025-05-07T20:32:04.4898144Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:04.4898755Z     def test_silu_mul_quant(
2025-05-07T20:32:04.4898998Z         self,
2025-05-07T20:32:04.4899186Z         T: int,
2025-05-07T20:32:04.4899389Z         D: int,
2025-05-07T20:32:04.4899605Z         scale_ub: Optional[float],
2025-05-07T20:32:04.4899866Z         contiguous: bool,
2025-05-07T20:32:04.4900105Z         compiled: bool,
2025-05-07T20:32:04.4900327Z     ) -> None:
2025-05-07T20:32:04.4900541Z         torch.manual_seed(2025)
2025-05-07T20:32:04.4900778Z     
2025-05-07T20:32:04.4901049Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:04.4901387Z     
2025-05-07T20:32:04.4901584Z         x_sign = torch.sign(x)
2025-05-07T20:32:04.4901873Z >       x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:04.4904013Z E       torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 320.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 140.44 MiB is free. Including non-PyTorch memory, this process has 21.92 GiB memory in use. Of the allocated memory 21.60 GiB is allocated by PyTorch, and 45.02 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
2025-05-07T20:32:04.4905892Z 
2025-05-07T20:32:04.4906012Z moe/activation_test.py:95: OutOfMemoryError
2025-05-07T20:32:04.4906220Z 
2025-05-07T20:32:04.4906325Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:04.4906722Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:04.4907115Z     T=4096,
2025-05-07T20:32:04.4907312Z     D=7168,
2025-05-07T20:32:04.4907507Z     scale_ub=1200.0,
2025-05-07T20:32:04.4907735Z     contiguous=True,
2025-05-07T20:32:04.4907956Z     compiled=True,
2025-05-07T20:32:04.4908154Z )
2025-05-07T20:32:04.4908471Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:04.4908968Z T = 4096, D = 7168, scale_ub = 1200.0, contiguous = True, compiled = True
2025-05-07T20:32:04.4909325Z 
2025-05-07T20:32:04.4909409Z     @given(
2025-05-07T20:32:04.4909630Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:04.4909936Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:04.4910244Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:04.4910565Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:04.4910887Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:04.4911175Z     )
2025-05-07T20:32:04.4911523Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:04.4911962Z     def test_silu_mul_quant(
2025-05-07T20:32:04.4912238Z         self,
2025-05-07T20:32:04.4912451Z         T: int,
2025-05-07T20:32:04.4912649Z         D: int,
2025-05-07T20:32:04.4912870Z         scale_ub: Optional[float],
2025-05-07T20:32:04.4913142Z         contiguous: bool,
2025-05-07T20:32:04.4913385Z         compiled: bool,
2025-05-07T20:32:04.4913609Z     ) -> None:
2025-05-07T20:32:04.4913828Z         torch.manual_seed(2025)
2025-05-07T20:32:04.4914062Z     
2025-05-07T20:32:04.4914337Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:04.4914674Z     
2025-05-07T20:32:04.4914865Z         x_sign = torch.sign(x)
2025-05-07T20:32:04.4915159Z >       x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:04.4917235Z E       torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 112.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 28.44 MiB is free. Including non-PyTorch memory, this process has 22.03 GiB memory in use. Of the allocated memory 21.61 GiB is allocated by PyTorch, and 141.02 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
2025-05-07T20:32:04.4919088Z 
2025-05-07T20:32:04.4919207Z moe/activation_test.py:95: OutOfMemoryError
2025-05-07T20:32:04.4919418Z 
2025-05-07T20:32:04.4919525Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:04.4919931Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:04.4920328Z     T=16384,
2025-05-07T20:32:04.4920528Z     D=7168,
2025-05-07T20:32:04.4920721Z     scale_ub=None,
2025-05-07T20:32:04.4920940Z     contiguous=False,
2025-05-07T20:32:04.4921168Z     compiled=False,
2025-05-07T20:32:04.4921367Z )
2025-05-07T20:32:04.4921688Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:04.4922246Z T = 16384, D = 7168, scale_ub = None, contiguous = False, compiled = False
2025-05-07T20:32:04.4922601Z 
2025-05-07T20:32:04.4922687Z     @given(
2025-05-07T20:32:04.4922916Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:04.4923223Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:04.4923533Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:04.4923858Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:04.4924182Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:04.4924467Z     )
2025-05-07T20:32:04.4924812Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:04.4925245Z     def test_silu_mul_quant(
2025-05-07T20:32:04.4925485Z         self,
2025-05-07T20:32:04.4925675Z         T: int,
2025-05-07T20:32:04.4925875Z         D: int,
2025-05-07T20:32:04.4926090Z         scale_ub: Optional[float],
2025-05-07T20:32:04.4926359Z         contiguous: bool,
2025-05-07T20:32:04.4926596Z         compiled: bool,
2025-05-07T20:32:04.4926816Z     ) -> None:
2025-05-07T20:32:04.4927024Z         torch.manual_seed(2025)
2025-05-07T20:32:04.4927263Z     
2025-05-07T20:32:04.4927533Z >       x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:04.4929847Z E       torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 448.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 140.44 MiB is free. Including non-PyTorch memory, this process has 21.92 GiB memory in use. Of the allocated memory 21.50 GiB is allocated by PyTorch, and 141.02 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
2025-05-07T20:32:04.4931716Z 
2025-05-07T20:32:04.4931836Z moe/activation_test.py:92: OutOfMemoryError
2025-05-07T20:32:04.4932054Z 
2025-05-07T20:32:04.4932158Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:04.4932562Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:04.4932961Z     T=2048,
2025-05-07T20:32:04.4933149Z     D=7168,
2025-05-07T20:32:04.4933340Z     scale_ub=1200.0,
2025-05-07T20:32:04.4933567Z     contiguous=True,
2025-05-07T20:32:04.4933782Z     compiled=True,
2025-05-07T20:32:04.4933987Z )
2025-05-07T20:32:04.4934307Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:04.4934790Z T = 2048, D = 7168, scale_ub = 1200.0, contiguous = True, compiled = True
2025-05-07T20:32:04.4935061Z 
2025-05-07T20:32:04.4935140Z     @given(
2025-05-07T20:32:04.4935365Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:04.4935678Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:04.4935978Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:04.4936301Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:04.4936631Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:04.4936905Z     )
2025-05-07T20:32:04.4937250Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:04.4937686Z     def test_silu_mul_quant(
2025-05-07T20:32:04.4938059Z         self,
2025-05-07T20:32:04.4938250Z         T: int,
2025-05-07T20:32:04.4938445Z         D: int,
2025-05-07T20:32:04.4938655Z         scale_ub: Optional[float],
2025-05-07T20:32:04.4938925Z         contiguous: bool,
2025-05-07T20:32:04.4939161Z         compiled: bool,
2025-05-07T20:32:04.4939376Z     ) -> None:
2025-05-07T20:32:04.4939589Z         torch.manual_seed(2025)
2025-05-07T20:32:04.4939825Z     
2025-05-07T20:32:04.4940090Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:04.4940430Z     
2025-05-07T20:32:04.4940622Z         x_sign = torch.sign(x)
2025-05-07T20:32:04.4940907Z >       x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:04.4943032Z E       torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 56.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 28.44 MiB is free. Including non-PyTorch memory, this process has 22.03 GiB memory in use. Of the allocated memory 21.67 GiB is allocated by PyTorch, and 85.02 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
2025-05-07T20:32:04.4944948Z 
2025-05-07T20:32:04.4945070Z moe/activation_test.py:95: OutOfMemoryError
2025-05-07T20:32:04.4945292Z 
2025-05-07T20:32:04.4945397Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:04.4945809Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:04.4946203Z     T=2048,
2025-05-07T20:32:04.4946389Z     D=7168,
2025-05-07T20:32:04.4946580Z     scale_ub=None,
2025-05-07T20:32:04.4946794Z     contiguous=True,
2025-05-07T20:32:04.4947013Z     compiled=False,
2025-05-07T20:32:04.4947217Z )
2025-05-07T20:32:04.5814594Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:04.5815299Z T = 2048, D = 7168, scale_ub = None, contiguous = True, compiled = False
2025-05-07T20:32:04.5815688Z 
2025-05-07T20:32:04.5815826Z     @given(
2025-05-07T20:32:04.5816139Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:04.5816570Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:04.5816873Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:04.5817202Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:04.5817526Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:04.5817801Z     )
2025-05-07T20:32:04.5818146Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:04.5818579Z     def test_silu_mul_quant(
2025-05-07T20:32:04.5818820Z         self,
2025-05-07T20:32:04.5819012Z         T: int,
2025-05-07T20:32:04.5819211Z         D: int,
2025-05-07T20:32:04.5819431Z         scale_ub: Optional[float],
2025-05-07T20:32:04.5819696Z         contiguous: bool,
2025-05-07T20:32:04.5819934Z         compiled: bool,
2025-05-07T20:32:04.5820160Z     ) -> None:
2025-05-07T20:32:04.5820380Z         torch.manual_seed(2025)
2025-05-07T20:32:04.5820613Z     
2025-05-07T20:32:04.5820879Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:04.5821213Z     
2025-05-07T20:32:04.5821409Z >       x_sign = torch.sign(x)
2025-05-07T20:32:04.5823714Z E       torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 56.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 28.44 MiB is free. Including non-PyTorch memory, this process has 22.03 GiB memory in use. Of the allocated memory 21.67 GiB is allocated by PyTorch, and 85.02 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
2025-05-07T20:32:04.5825640Z 
2025-05-07T20:32:04.5825948Z moe/activation_test.py:94: OutOfMemoryError
2025-05-07T20:32:04.5826165Z 
2025-05-07T20:32:04.5826266Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:04.5826675Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:04.5827068Z     T=1,
2025-05-07T20:32:04.5827256Z     D=7168,
2025-05-07T20:32:04.5827445Z     scale_ub=1200.0,
2025-05-07T20:32:04.5827658Z     contiguous=True,
2025-05-07T20:32:04.5827880Z     compiled=False,
2025-05-07T20:32:04.5828081Z )
2025-05-07T20:32:04.5828594Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:04.5829115Z T = 1, D = 7168, scale_ub = 1200.0, contiguous = True, compiled = False
2025-05-07T20:32:04.5829445Z 
2025-05-07T20:32:04.5829529Z     @given(
2025-05-07T20:32:04.5829815Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:04.5830125Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:04.5830428Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:04.5830763Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:04.5831081Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:04.5831361Z     )
2025-05-07T20:32:04.5831754Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:04.5832290Z     def test_silu_mul_quant(
2025-05-07T20:32:04.5832587Z         self,
2025-05-07T20:32:04.5832827Z         T: int,
2025-05-07T20:32:04.5833064Z         D: int,
2025-05-07T20:32:04.5833336Z         scale_ub: Optional[float],
2025-05-07T20:32:04.5833670Z         contiguous: bool,
2025-05-07T20:32:04.5833918Z         compiled: bool,
2025-05-07T20:32:04.5834134Z     ) -> None:
2025-05-07T20:32:04.5834349Z         torch.manual_seed(2025)
2025-05-07T20:32:04.5834588Z     
2025-05-07T20:32:04.5834861Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:04.5835195Z     
2025-05-07T20:32:04.5835391Z         x_sign = torch.sign(x)
2025-05-07T20:32:04.5835678Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:04.5835982Z         x = x_sign * x_clamp
2025-05-07T20:32:04.5836222Z         x0 = x[:, :D]
2025-05-07T20:32:04.5836431Z         x1 = x[:, D:]
2025-05-07T20:32:04.5836632Z     
2025-05-07T20:32:04.5836814Z         if contiguous:
2025-05-07T20:32:04.5837039Z             x0 = x0.contiguous()
2025-05-07T20:32:04.5837298Z             x1 = x1.contiguous()
2025-05-07T20:32:04.5837533Z     
2025-05-07T20:32:04.5837717Z         if scale_ub is not None:
2025-05-07T20:32:04.5837982Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:04.5838311Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:04.5838609Z             )
2025-05-07T20:32:04.5838802Z         else:
2025-05-07T20:32:04.5839012Z             scale_ub_tensor = None
2025-05-07T20:32:04.5839256Z     
2025-05-07T20:32:04.5839484Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:04.5839794Z             op = silu_mul_quant
2025-05-07T20:32:04.5840040Z             if compiled:
2025-05-07T20:32:04.5840284Z                 op = torch.compile(op)
2025-05-07T20:32:04.5840588Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:04.5840855Z     
2025-05-07T20:32:04.5841041Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:04.5841208Z 
2025-05-07T20:32:04.5841305Z moe/activation_test.py:117: 
2025-05-07T20:32:04.5841602Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:04.5841926Z moe/activation_test.py:115: in fn
2025-05-07T20:32:04.5842206Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:04.5842895Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:04.5843584Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:04.5844108Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:04.5844926Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:04.5845587Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:04.5846108Z     kernel = self.compile(
2025-05-07T20:32:04.5846641Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:04.5847294Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:04.5847686Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:04.5847911Z 
2025-05-07T20:32:04.5848115Z self = <triton.compiler.compiler.ASTSource object at 0x7f48982ef3d0>
2025-05-07T20:32:04.5849267Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:04.5850622Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f49ce13c400>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f4898219300>}
2025-05-07T20:32:04.5852024Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:04.5853280Z context = <triton._C.libtriton.ir.context object at 0x7f48982eb970>
2025-05-07T20:32:04.5853635Z 
2025-05-07T20:32:04.5853840Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:04.5854369Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:04.5854837Z                            module_map=module_map)
2025-05-07T20:32:04.5855193Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:04.5855545Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:04.5855805Z E       ^
2025-05-07T20:32:04.5856258Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:04.5856706Z 
2025-05-07T20:32:04.5857117Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:04.5857625Z 
2025-05-07T20:32:04.5857727Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:04.5858132Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:04.5858521Z     T=128,
2025-05-07T20:32:04.5858705Z     D=5120,
2025-05-07T20:32:04.5858895Z     scale_ub=None,
2025-05-07T20:32:04.5859108Z     contiguous=True,
2025-05-07T20:32:04.5859332Z     compiled=False,
2025-05-07T20:32:04.5859544Z )
2025-05-07T20:32:04.6403809Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:04.6404498Z T = 128, D = 5120, scale_ub = None, contiguous = True, compiled = False
2025-05-07T20:32:04.6404884Z 
2025-05-07T20:32:04.6404990Z     @given(
2025-05-07T20:32:04.6405308Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:04.6405733Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:04.6406122Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:04.6406447Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:04.6406766Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:04.6407044Z     )
2025-05-07T20:32:04.6407389Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:04.6407818Z     def test_silu_mul_quant(
2025-05-07T20:32:04.6408059Z         self,
2025-05-07T20:32:04.6408247Z         T: int,
2025-05-07T20:32:04.6408446Z         D: int,
2025-05-07T20:32:04.6408661Z         scale_ub: Optional[float],
2025-05-07T20:32:04.6408922Z         contiguous: bool,
2025-05-07T20:32:04.6409157Z         compiled: bool,
2025-05-07T20:32:04.6409381Z     ) -> None:
2025-05-07T20:32:04.6409750Z         torch.manual_seed(2025)
2025-05-07T20:32:04.6409992Z     
2025-05-07T20:32:04.6410260Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:04.6410590Z     
2025-05-07T20:32:04.6410783Z         x_sign = torch.sign(x)
2025-05-07T20:32:04.6411071Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:04.6411373Z         x = x_sign * x_clamp
2025-05-07T20:32:04.6411615Z         x0 = x[:, :D]
2025-05-07T20:32:04.6411831Z         x1 = x[:, D:]
2025-05-07T20:32:04.6412030Z     
2025-05-07T20:32:04.6412215Z         if contiguous:
2025-05-07T20:32:04.6412472Z             x0 = x0.contiguous()
2025-05-07T20:32:04.6412811Z             x1 = x1.contiguous()
2025-05-07T20:32:04.6413095Z     
2025-05-07T20:32:04.6413283Z         if scale_ub is not None:
2025-05-07T20:32:04.6413551Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:04.6413876Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:04.6414185Z             )
2025-05-07T20:32:04.6414378Z         else:
2025-05-07T20:32:04.6414584Z             scale_ub_tensor = None
2025-05-07T20:32:04.6414832Z     
2025-05-07T20:32:04.6415059Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:04.6415365Z             op = silu_mul_quant
2025-05-07T20:32:04.6415614Z             if compiled:
2025-05-07T20:32:04.6415857Z                 op = torch.compile(op)
2025-05-07T20:32:04.6416146Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:04.6416418Z     
2025-05-07T20:32:04.6416611Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:04.6416773Z 
2025-05-07T20:32:04.6416874Z moe/activation_test.py:117: 
2025-05-07T20:32:04.6417160Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:04.6417498Z moe/activation_test.py:115: in fn
2025-05-07T20:32:04.6417774Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:04.6418460Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:04.6419142Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:04.6419670Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:04.6420340Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:04.6420990Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:04.6421514Z     kernel = self.compile(
2025-05-07T20:32:04.6422047Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:04.6422699Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:04.6423096Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:04.6423324Z 
2025-05-07T20:32:04.6423530Z self = <triton.compiler.compiler.ASTSource object at 0x7f4898258150>
2025-05-07T20:32:04.6424595Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:04.6425946Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f49ce13c400>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f489821a520>}
2025-05-07T20:32:04.6427276Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:04.6428484Z context = <triton._C.libtriton.ir.context object at 0x7f489824e930>
2025-05-07T20:32:04.6428772Z 
2025-05-07T20:32:04.6428940Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:04.6429610Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:04.6430069Z                            module_map=module_map)
2025-05-07T20:32:04.6430429Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:04.6430779Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:04.6431032Z E       ^
2025-05-07T20:32:04.6431489Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:04.6431934Z 
2025-05-07T20:32:04.6432401Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:04.6432964Z 
2025-05-07T20:32:04.6433150Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:04.6433557Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:04.6433950Z     T=128,
2025-05-07T20:32:04.6434144Z     D=7168,
2025-05-07T20:32:04.6434331Z     scale_ub=None,
2025-05-07T20:32:04.6434576Z     contiguous=True,
2025-05-07T20:32:04.6434798Z     compiled=False,
2025-05-07T20:32:04.6435001Z )
2025-05-07T20:32:04.6435312Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:04.6435792Z T = 128, D = 7168, scale_ub = None, contiguous = True, compiled = False
2025-05-07T20:32:04.6436053Z 
2025-05-07T20:32:04.6436138Z     @given(
2025-05-07T20:32:04.6436364Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:04.6436677Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:04.6436982Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:04.6437305Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:04.6437629Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:04.6437912Z     )
2025-05-07T20:32:04.6438261Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:04.6438690Z     def test_silu_mul_quant(
2025-05-07T20:32:04.6438934Z         self,
2025-05-07T20:32:04.6439126Z         T: int,
2025-05-07T20:32:04.6439319Z         D: int,
2025-05-07T20:32:04.6445658Z         scale_ub: Optional[float],
2025-05-07T20:32:04.6445936Z         contiguous: bool,
2025-05-07T20:32:04.6446171Z         compiled: bool,
2025-05-07T20:32:04.6446389Z     ) -> None:
2025-05-07T20:32:04.6446600Z         torch.manual_seed(2025)
2025-05-07T20:32:04.6446836Z     
2025-05-07T20:32:04.6447106Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:04.6447445Z     
2025-05-07T20:32:04.6447637Z         x_sign = torch.sign(x)
2025-05-07T20:32:04.6447920Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:04.6448236Z         x = x_sign * x_clamp
2025-05-07T20:32:04.6448477Z         x0 = x[:, :D]
2025-05-07T20:32:04.6448685Z         x1 = x[:, D:]
2025-05-07T20:32:04.6448895Z     
2025-05-07T20:32:04.6449078Z         if contiguous:
2025-05-07T20:32:04.6449300Z             x0 = x0.contiguous()
2025-05-07T20:32:04.6449560Z             x1 = x1.contiguous()
2025-05-07T20:32:04.6449794Z     
2025-05-07T20:32:04.6449978Z         if scale_ub is not None:
2025-05-07T20:32:04.6450241Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:04.6450572Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:04.6450869Z             )
2025-05-07T20:32:04.6451061Z         else:
2025-05-07T20:32:04.6451277Z             scale_ub_tensor = None
2025-05-07T20:32:04.6451527Z     
2025-05-07T20:32:04.6451753Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:04.6452082Z             op = silu_mul_quant
2025-05-07T20:32:04.6452367Z             if compiled:
2025-05-07T20:32:04.6452607Z                 op = torch.compile(op)
2025-05-07T20:32:04.6452907Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:04.6453176Z     
2025-05-07T20:32:04.6453364Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:04.6453530Z 
2025-05-07T20:32:04.6453630Z moe/activation_test.py:117: 
2025-05-07T20:32:04.6454034Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:04.6454357Z moe/activation_test.py:115: in fn
2025-05-07T20:32:04.6454631Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:04.6455316Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:04.6456000Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:04.6456522Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:04.6457198Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:04.6457930Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:04.6458447Z     kernel = self.compile(
2025-05-07T20:32:04.6458986Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:04.6459632Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:04.6460026Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:04.6460248Z 
2025-05-07T20:32:04.6460451Z self = <triton.compiler.compiler.ASTSource object at 0x7f48980812d0>
2025-05-07T20:32:04.6461518Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:04.6462922Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f49ce13c400>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f489821b560>}
2025-05-07T20:32:04.6464260Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:04.6465272Z context = <triton._C.libtriton.ir.context object at 0x7f48982a03f0>
2025-05-07T20:32:04.6465555Z 
2025-05-07T20:32:04.6465718Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:04.6466224Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:04.6466680Z                            module_map=module_map)
2025-05-07T20:32:04.6467034Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:04.6467381Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:04.6467632Z E       ^
2025-05-07T20:32:04.6468091Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:04.6468535Z 
2025-05-07T20:32:04.6468951Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:04.6469531Z 
2025-05-07T20:32:04.6469633Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:04.6470035Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:04.6470430Z     T=2048,
2025-05-07T20:32:04.6470614Z     D=7168,
2025-05-07T20:32:04.6470813Z     scale_ub=1200.0,
2025-05-07T20:32:04.6471034Z     contiguous=True,
2025-05-07T20:32:04.6471247Z     compiled=False,
2025-05-07T20:32:04.6471453Z )
2025-05-07T20:32:04.7130713Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:04.7131350Z T = 2048, D = 7168, scale_ub = 1200.0, contiguous = True, compiled = False
2025-05-07T20:32:04.7131726Z 
2025-05-07T20:32:04.7131839Z     @given(
2025-05-07T20:32:04.7132207Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:04.7132679Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:04.7133106Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:04.7133692Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:04.7134026Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:04.7134301Z     )
2025-05-07T20:32:04.7134648Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:04.7135085Z     def test_silu_mul_quant(
2025-05-07T20:32:04.7135322Z         self,
2025-05-07T20:32:04.7135521Z         T: int,
2025-05-07T20:32:04.7135720Z         D: int,
2025-05-07T20:32:04.7135938Z         scale_ub: Optional[float],
2025-05-07T20:32:04.7136202Z         contiguous: bool,
2025-05-07T20:32:04.7136439Z         compiled: bool,
2025-05-07T20:32:04.7136664Z     ) -> None:
2025-05-07T20:32:04.7136938Z         torch.manual_seed(2025)
2025-05-07T20:32:04.7137233Z     
2025-05-07T20:32:04.7137504Z >       x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:04.7139553Z E       torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 56.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 26.44 MiB is free. Including non-PyTorch memory, this process has 22.04 GiB memory in use. Of the allocated memory 21.69 GiB is allocated by PyTorch, and 59.18 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
2025-05-07T20:32:04.7141396Z 
2025-05-07T20:32:04.7141515Z moe/activation_test.py:92: OutOfMemoryError
2025-05-07T20:32:04.7141730Z 
2025-05-07T20:32:04.7141833Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:04.7142246Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:04.7142647Z     T=1,
2025-05-07T20:32:04.7142833Z     D=5120,
2025-05-07T20:32:04.7143024Z     scale_ub=1200.0,
2025-05-07T20:32:04.7143251Z     contiguous=True,
2025-05-07T20:32:04.7143467Z     compiled=False,
2025-05-07T20:32:04.7143697Z )
2025-05-07T20:32:04.7144016Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:04.7144499Z T = 1, D = 5120, scale_ub = 1200.0, contiguous = True, compiled = False
2025-05-07T20:32:04.7144763Z 
2025-05-07T20:32:04.7144843Z     @given(
2025-05-07T20:32:04.7145073Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:04.7145380Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:04.7145675Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:04.7146004Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:04.7146330Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:04.7146612Z     )
2025-05-07T20:32:04.7146955Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:04.7147390Z     def test_silu_mul_quant(
2025-05-07T20:32:04.7147626Z         self,
2025-05-07T20:32:04.7147813Z         T: int,
2025-05-07T20:32:04.7148007Z         D: int,
2025-05-07T20:32:04.7148229Z         scale_ub: Optional[float],
2025-05-07T20:32:04.7148501Z         contiguous: bool,
2025-05-07T20:32:04.7148739Z         compiled: bool,
2025-05-07T20:32:04.7148956Z     ) -> None:
2025-05-07T20:32:04.7149241Z         torch.manual_seed(2025)
2025-05-07T20:32:04.7149478Z     
2025-05-07T20:32:04.7149742Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:04.7150077Z     
2025-05-07T20:32:04.7150269Z         x_sign = torch.sign(x)
2025-05-07T20:32:04.7150555Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:04.7150856Z         x = x_sign * x_clamp
2025-05-07T20:32:04.7151096Z         x0 = x[:, :D]
2025-05-07T20:32:04.7151312Z         x1 = x[:, D:]
2025-05-07T20:32:04.7151519Z     
2025-05-07T20:32:04.7151736Z         if contiguous:
2025-05-07T20:32:04.7152021Z             x0 = x0.contiguous()
2025-05-07T20:32:04.7152338Z             x1 = x1.contiguous()
2025-05-07T20:32:04.7152639Z     
2025-05-07T20:32:04.7152980Z         if scale_ub is not None:
2025-05-07T20:32:04.7153319Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:04.7153736Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:04.7154114Z             )
2025-05-07T20:32:04.7154358Z         else:
2025-05-07T20:32:04.7154576Z             scale_ub_tensor = None
2025-05-07T20:32:04.7154820Z     
2025-05-07T20:32:04.7155045Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:04.7155353Z             op = silu_mul_quant
2025-05-07T20:32:04.7155597Z             if compiled:
2025-05-07T20:32:04.7155840Z                 op = torch.compile(op)
2025-05-07T20:32:04.7156130Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:04.7156480Z     
2025-05-07T20:32:04.7156665Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:04.7156828Z 
2025-05-07T20:32:04.7156927Z moe/activation_test.py:117: 
2025-05-07T20:32:04.7157216Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:04.7157545Z moe/activation_test.py:115: in fn
2025-05-07T20:32:04.7157821Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:04.7158502Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:04.7159182Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:04.7159707Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:04.7160385Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:04.7161039Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:04.7161569Z     kernel = self.compile(
2025-05-07T20:32:04.7162211Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:04.7163032Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:04.7163524Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:04.7163795Z 
2025-05-07T20:32:04.7164001Z self = <triton.compiler.compiler.ASTSource object at 0x7f4898091490>
2025-05-07T20:32:04.7165070Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:04.7166415Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f49ce13c400>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f48980e8a40>}
2025-05-07T20:32:04.7167748Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:04.7168762Z context = <triton._C.libtriton.ir.context object at 0x7f48980b1ab0>
2025-05-07T20:32:04.7169046Z 
2025-05-07T20:32:04.7169212Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:04.7169728Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:04.7170193Z                            module_map=module_map)
2025-05-07T20:32:04.7170558Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:04.7170913Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:04.7171174Z E       ^
2025-05-07T20:32:04.7171633Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:04.7172086Z 
2025-05-07T20:32:04.7172500Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:04.7173013Z 
2025-05-07T20:32:04.7173115Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:04.7173604Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:04.7174002Z     T=2048,
2025-05-07T20:32:04.7174183Z     D=5120,
2025-05-07T20:32:04.7174375Z     scale_ub=None,
2025-05-07T20:32:04.7174585Z     contiguous=True,
2025-05-07T20:32:04.7174808Z     compiled=False,
2025-05-07T20:32:04.7175010Z )
2025-05-07T20:32:04.7175324Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:04.7175822Z T = 2048, D = 5120, scale_ub = None, contiguous = True, compiled = False
2025-05-07T20:32:04.7176088Z 
2025-05-07T20:32:04.7176164Z     @given(
2025-05-07T20:32:04.7176392Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:04.7176769Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:04.7177106Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:04.7177434Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:04.7177758Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:04.7178042Z     )
2025-05-07T20:32:04.7178389Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:04.7178823Z     def test_silu_mul_quant(
2025-05-07T20:32:04.7179058Z         self,
2025-05-07T20:32:04.7179243Z         T: int,
2025-05-07T20:32:04.7179439Z         D: int,
2025-05-07T20:32:04.7179657Z         scale_ub: Optional[float],
2025-05-07T20:32:04.7179921Z         contiguous: bool,
2025-05-07T20:32:04.7180159Z         compiled: bool,
2025-05-07T20:32:04.7180376Z     ) -> None:
2025-05-07T20:32:04.7180583Z         torch.manual_seed(2025)
2025-05-07T20:32:04.7180819Z     
2025-05-07T20:32:04.7181089Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:04.7181429Z     
2025-05-07T20:32:04.7181623Z >       x_sign = torch.sign(x)
2025-05-07T20:32:04.7183997Z E       torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 40.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 26.44 MiB is free. Including non-PyTorch memory, this process has 22.04 GiB memory in use. Of the allocated memory 21.73 GiB is allocated by PyTorch, and 19.12 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
2025-05-07T20:32:04.7185912Z 
2025-05-07T20:32:04.7186030Z moe/activation_test.py:94: OutOfMemoryError
2025-05-07T20:32:04.7186246Z 
2025-05-07T20:32:04.7186357Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:04.7186759Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:04.7187166Z     T=16384,
2025-05-07T20:32:04.7187355Z     D=5120,
2025-05-07T20:32:04.7187540Z     scale_ub=None,
2025-05-07T20:32:04.7187748Z     contiguous=True,
2025-05-07T20:32:04.7187970Z     compiled=False,
2025-05-07T20:32:04.7188165Z )
2025-05-07T20:32:04.7906498Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:04.7907856Z T = 16384, D = 5120, scale_ub = None, contiguous = True, compiled = False
2025-05-07T20:32:04.7908569Z 
2025-05-07T20:32:04.7908771Z     @given(
2025-05-07T20:32:04.7909458Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:04.7910119Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:04.7910677Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:04.7911268Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:04.7911879Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:04.7912493Z     )
2025-05-07T20:32:04.7913259Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:04.7914003Z     def test_silu_mul_quant(
2025-05-07T20:32:04.7914279Z         self,
2025-05-07T20:32:04.7914471Z         T: int,
2025-05-07T20:32:04.7914668Z         D: int,
2025-05-07T20:32:04.7915054Z         scale_ub: Optional[float],
2025-05-07T20:32:04.7915328Z         contiguous: bool,
2025-05-07T20:32:04.7915575Z         compiled: bool,
2025-05-07T20:32:04.7915800Z     ) -> None:
2025-05-07T20:32:04.7916093Z         torch.manual_seed(2025)
2025-05-07T20:32:04.7916425Z     
2025-05-07T20:32:04.7916773Z >       x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:04.7918826Z E       torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 320.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 26.44 MiB is free. Including non-PyTorch memory, this process has 22.04 GiB memory in use. Of the allocated memory 21.73 GiB is allocated by PyTorch, and 19.12 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
2025-05-07T20:32:04.7920830Z 
2025-05-07T20:32:04.7920961Z moe/activation_test.py:92: OutOfMemoryError
2025-05-07T20:32:04.7921177Z 
2025-05-07T20:32:04.7921279Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:04.7921687Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:04.7922096Z     T=4096,
2025-05-07T20:32:04.7922284Z     D=5120,
2025-05-07T20:32:04.7922476Z     scale_ub=None,
2025-05-07T20:32:04.7922695Z     contiguous=True,
2025-05-07T20:32:04.7922912Z     compiled=False,
2025-05-07T20:32:04.7923115Z )
2025-05-07T20:32:04.7923437Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:04.7923923Z T = 4096, D = 5120, scale_ub = None, contiguous = True, compiled = False
2025-05-07T20:32:04.7924195Z 
2025-05-07T20:32:04.7924273Z     @given(
2025-05-07T20:32:04.7924501Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:04.7924811Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:04.7925108Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:04.7925441Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:04.7925767Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:04.7926043Z     )
2025-05-07T20:32:04.7926399Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:04.7926844Z     def test_silu_mul_quant(
2025-05-07T20:32:04.7927087Z         self,
2025-05-07T20:32:04.7927279Z         T: int,
2025-05-07T20:32:04.7927476Z         D: int,
2025-05-07T20:32:04.7927694Z         scale_ub: Optional[float],
2025-05-07T20:32:04.7927960Z         contiguous: bool,
2025-05-07T20:32:04.7928426Z         compiled: bool,
2025-05-07T20:32:04.7928657Z     ) -> None:
2025-05-07T20:32:04.7928869Z         torch.manual_seed(2025)
2025-05-07T20:32:04.7929108Z     
2025-05-07T20:32:04.7929376Z >       x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:04.7931391Z E       torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 80.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 26.44 MiB is free. Including non-PyTorch memory, this process has 22.04 GiB memory in use. Of the allocated memory 21.73 GiB is allocated by PyTorch, and 19.12 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
2025-05-07T20:32:04.7933281Z 
2025-05-07T20:32:04.7933399Z moe/activation_test.py:92: OutOfMemoryError
2025-05-07T20:32:04.7933611Z 
2025-05-07T20:32:04.7933714Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:04.7934127Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:04.7934531Z     T=2048,
2025-05-07T20:32:04.7934716Z     D=5120,
2025-05-07T20:32:04.7934908Z     scale_ub=None,
2025-05-07T20:32:04.7935131Z     contiguous=False,
2025-05-07T20:32:04.7935489Z     compiled=False,
2025-05-07T20:32:04.7935694Z )
2025-05-07T20:32:04.7936009Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:04.7936489Z T = 2048, D = 5120, scale_ub = None, contiguous = False, compiled = False
2025-05-07T20:32:04.7936758Z 
2025-05-07T20:32:04.7936836Z     @given(
2025-05-07T20:32:04.7937062Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:04.7937368Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:04.7937665Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:04.7937989Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:04.7938310Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:04.7938704Z     )
2025-05-07T20:32:04.7939049Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:04.7939490Z     def test_silu_mul_quant(
2025-05-07T20:32:04.7939731Z         self,
2025-05-07T20:32:04.7939932Z         T: int,
2025-05-07T20:32:04.7940141Z         D: int,
2025-05-07T20:32:04.7940354Z         scale_ub: Optional[float],
2025-05-07T20:32:04.7940619Z         contiguous: bool,
2025-05-07T20:32:04.7940857Z         compiled: bool,
2025-05-07T20:32:04.7941072Z     ) -> None:
2025-05-07T20:32:04.7941291Z         torch.manual_seed(2025)
2025-05-07T20:32:04.7941534Z     
2025-05-07T20:32:04.7941797Z >       x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:04.7943872Z E       torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 40.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 26.44 MiB is free. Including non-PyTorch memory, this process has 22.04 GiB memory in use. Of the allocated memory 21.73 GiB is allocated by PyTorch, and 19.12 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
2025-05-07T20:32:04.7945701Z 
2025-05-07T20:32:04.7945819Z moe/activation_test.py:92: OutOfMemoryError
2025-05-07T20:32:04.7946031Z 
2025-05-07T20:32:04.7946134Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:04.7946540Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:04.7946932Z     T=4096,
2025-05-07T20:32:04.7947118Z     D=7168,
2025-05-07T20:32:04.7947312Z     scale_ub=None,
2025-05-07T20:32:04.7947516Z     contiguous=True,
2025-05-07T20:32:04.7947736Z     compiled=True,
2025-05-07T20:32:04.7947934Z )
2025-05-07T20:32:04.7948250Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:04.7948732Z T = 4096, D = 7168, scale_ub = None, contiguous = True, compiled = True
2025-05-07T20:32:04.7949005Z 
2025-05-07T20:32:04.7949151Z     @given(
2025-05-07T20:32:04.7949387Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:04.7949696Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:04.7950014Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:04.7950338Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:04.7950660Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:04.7950945Z     )
2025-05-07T20:32:04.7951290Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:04.7951728Z     def test_silu_mul_quant(
2025-05-07T20:32:04.7951966Z         self,
2025-05-07T20:32:04.7952175Z         T: int,
2025-05-07T20:32:04.7952369Z         D: int,
2025-05-07T20:32:04.7952585Z         scale_ub: Optional[float],
2025-05-07T20:32:04.7952853Z         contiguous: bool,
2025-05-07T20:32:04.7953092Z         compiled: bool,
2025-05-07T20:32:04.7953314Z     ) -> None:
2025-05-07T20:32:04.7953531Z         torch.manual_seed(2025)
2025-05-07T20:32:04.7953777Z     
2025-05-07T20:32:04.7954044Z >       x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:04.7956173Z E       torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 112.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 26.44 MiB is free. Including non-PyTorch memory, this process has 22.04 GiB memory in use. Of the allocated memory 21.73 GiB is allocated by PyTorch, and 19.12 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
2025-05-07T20:32:04.7958147Z 
2025-05-07T20:32:04.7958271Z moe/activation_test.py:92: OutOfMemoryError
2025-05-07T20:32:04.7958536Z 
2025-05-07T20:32:04.7958640Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:04.7959100Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:04.7959496Z     T=2048,
2025-05-07T20:32:04.7959684Z     D=5120,
2025-05-07T20:32:04.7959874Z     scale_ub=1200.0,
2025-05-07T20:32:04.7960098Z     contiguous=False,
2025-05-07T20:32:04.7960340Z     compiled=False,
2025-05-07T20:32:04.7960543Z )
2025-05-07T20:32:04.7960867Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:04.7961353Z T = 2048, D = 5120, scale_ub = 1200.0, contiguous = False, compiled = False
2025-05-07T20:32:04.7961629Z 
2025-05-07T20:32:04.7961708Z     @given(
2025-05-07T20:32:04.7961932Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:04.7962268Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:04.7962602Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:04.7962925Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:04.7963253Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:04.7963532Z     )
2025-05-07T20:32:04.7963878Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:04.7964315Z     def test_silu_mul_quant(
2025-05-07T20:32:04.7964552Z         self,
2025-05-07T20:32:04.7964753Z         T: int,
2025-05-07T20:32:04.7964947Z         D: int,
2025-05-07T20:32:04.7965164Z         scale_ub: Optional[float],
2025-05-07T20:32:04.7965432Z         contiguous: bool,
2025-05-07T20:32:04.7965680Z         compiled: bool,
2025-05-07T20:32:04.7965902Z     ) -> None:
2025-05-07T20:32:04.7966116Z         torch.manual_seed(2025)
2025-05-07T20:32:04.7966358Z     
2025-05-07T20:32:04.7966640Z >       x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:04.7968847Z E       torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 40.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 26.44 MiB is free. Including non-PyTorch memory, this process has 22.04 GiB memory in use. Of the allocated memory 21.73 GiB is allocated by PyTorch, and 19.12 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
2025-05-07T20:32:04.7970686Z 
2025-05-07T20:32:04.7970811Z moe/activation_test.py:92: OutOfMemoryError
2025-05-07T20:32:04.7971020Z 
2025-05-07T20:32:04.7971120Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:04.7971535Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:04.7971929Z     T=4096,
2025-05-07T20:32:04.7972124Z     D=7168,
2025-05-07T20:32:04.7972339Z     scale_ub=1200.0,
2025-05-07T20:32:04.7978552Z     contiguous=True,
2025-05-07T20:32:04.7978816Z     compiled=False,
2025-05-07T20:32:04.7979019Z )
2025-05-07T20:32:04.8892900Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:04.8893740Z T = 4096, D = 7168, scale_ub = 1200.0, contiguous = True, compiled = False
2025-05-07T20:32:04.8894125Z 
2025-05-07T20:32:04.8894236Z     @given(
2025-05-07T20:32:04.8894731Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:04.8895054Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:04.8895354Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:04.8895684Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:04.8896019Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:04.8896294Z     )
2025-05-07T20:32:04.8896638Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:04.8897080Z     def test_silu_mul_quant(
2025-05-07T20:32:04.8897316Z         self,
2025-05-07T20:32:04.8897509Z         T: int,
2025-05-07T20:32:04.8897707Z         D: int,
2025-05-07T20:32:04.8897919Z         scale_ub: Optional[float],
2025-05-07T20:32:04.8898303Z         contiguous: bool,
2025-05-07T20:32:04.8898539Z         compiled: bool,
2025-05-07T20:32:04.8898757Z     ) -> None:
2025-05-07T20:32:04.8898974Z         torch.manual_seed(2025)
2025-05-07T20:32:04.8899212Z     
2025-05-07T20:32:04.8899490Z >       x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:04.8901521Z E       torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 112.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 26.44 MiB is free. Including non-PyTorch memory, this process has 22.04 GiB memory in use. Of the allocated memory 21.73 GiB is allocated by PyTorch, and 19.12 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
2025-05-07T20:32:04.8903787Z 
2025-05-07T20:32:04.8903944Z moe/activation_test.py:92: OutOfMemoryError
2025-05-07T20:32:04.8904182Z 
2025-05-07T20:32:04.8904286Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:04.8904696Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:04.8905090Z     T=16384,
2025-05-07T20:32:04.8905291Z     D=7168,
2025-05-07T20:32:04.8905488Z     scale_ub=None,
2025-05-07T20:32:04.8905698Z     contiguous=False,
2025-05-07T20:32:04.8905926Z     compiled=True,
2025-05-07T20:32:04.8906128Z )
2025-05-07T20:32:04.8906440Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:04.8906926Z T = 16384, D = 7168, scale_ub = None, contiguous = False, compiled = True
2025-05-07T20:32:04.8907208Z 
2025-05-07T20:32:04.8907289Z     @given(
2025-05-07T20:32:04.8907515Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:04.8907822Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:04.8908127Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:04.8908458Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:04.8908780Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:04.8909129Z     )
2025-05-07T20:32:04.8909479Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:04.8909926Z     def test_silu_mul_quant(
2025-05-07T20:32:04.8910159Z         self,
2025-05-07T20:32:04.8910355Z         T: int,
2025-05-07T20:32:04.8910554Z         D: int,
2025-05-07T20:32:04.8910769Z         scale_ub: Optional[float],
2025-05-07T20:32:04.8911037Z         contiguous: bool,
2025-05-07T20:32:04.8911274Z         compiled: bool,
2025-05-07T20:32:04.8911491Z     ) -> None:
2025-05-07T20:32:04.8911705Z         torch.manual_seed(2025)
2025-05-07T20:32:04.8911948Z     
2025-05-07T20:32:04.8912214Z >       x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:04.8914321Z E       torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 448.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 26.44 MiB is free. Including non-PyTorch memory, this process has 22.04 GiB memory in use. Of the allocated memory 21.73 GiB is allocated by PyTorch, and 19.12 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
2025-05-07T20:32:04.8916167Z 
2025-05-07T20:32:04.8916290Z moe/activation_test.py:92: OutOfMemoryError
2025-05-07T20:32:04.8916500Z 
2025-05-07T20:32:04.8916606Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:04.8917010Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:04.8917403Z     T=4096,
2025-05-07T20:32:04.8917591Z     D=7168,
2025-05-07T20:32:04.8917786Z     scale_ub=None,
2025-05-07T20:32:04.8918002Z     contiguous=True,
2025-05-07T20:32:04.8918272Z     compiled=False,
2025-05-07T20:32:04.8918512Z )
2025-05-07T20:32:04.8918826Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:04.8919317Z T = 4096, D = 7168, scale_ub = None, contiguous = True, compiled = False
2025-05-07T20:32:04.8919583Z 
2025-05-07T20:32:04.8919668Z     @given(
2025-05-07T20:32:04.8919901Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:04.8920211Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:04.8920519Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:04.8920847Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:04.8921164Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:04.8921447Z     )
2025-05-07T20:32:04.8921825Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:04.8922366Z     def test_silu_mul_quant(
2025-05-07T20:32:04.8922662Z         self,
2025-05-07T20:32:04.8922904Z         T: int,
2025-05-07T20:32:04.8923153Z         D: int,
2025-05-07T20:32:04.8923437Z         scale_ub: Optional[float],
2025-05-07T20:32:04.8923772Z         contiguous: bool,
2025-05-07T20:32:04.8924066Z         compiled: bool,
2025-05-07T20:32:04.8924295Z     ) -> None:
2025-05-07T20:32:04.8924513Z         torch.manual_seed(2025)
2025-05-07T20:32:04.8924761Z     
2025-05-07T20:32:04.8925028Z >       x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:04.8927049Z E       torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 112.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 26.44 MiB is free. Including non-PyTorch memory, this process has 22.04 GiB memory in use. Of the allocated memory 21.73 GiB is allocated by PyTorch, and 19.12 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
2025-05-07T20:32:04.8929146Z 
2025-05-07T20:32:04.8929273Z moe/activation_test.py:92: OutOfMemoryError
2025-05-07T20:32:04.8929485Z 
2025-05-07T20:32:04.8929597Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:04.8930007Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:04.8930409Z     T=16384,
2025-05-07T20:32:04.8930608Z     D=7168,
2025-05-07T20:32:04.8930807Z     scale_ub=None,
2025-05-07T20:32:04.8931014Z     contiguous=True,
2025-05-07T20:32:04.8931239Z     compiled=False,
2025-05-07T20:32:04.8931443Z )
2025-05-07T20:32:04.8931758Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:04.8932375Z T = 16384, D = 7168, scale_ub = None, contiguous = True, compiled = False
2025-05-07T20:32:04.8932716Z 
2025-05-07T20:32:04.8932819Z     @given(
2025-05-07T20:32:04.8933105Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:04.8933496Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:04.8933872Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:04.8934194Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:04.8934517Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:04.8934800Z     )
2025-05-07T20:32:04.8935293Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:04.8935732Z     def test_silu_mul_quant(
2025-05-07T20:32:04.8935972Z         self,
2025-05-07T20:32:04.8936170Z         T: int,
2025-05-07T20:32:04.8936359Z         D: int,
2025-05-07T20:32:04.8936580Z         scale_ub: Optional[float],
2025-05-07T20:32:04.8936848Z         contiguous: bool,
2025-05-07T20:32:04.8937093Z         compiled: bool,
2025-05-07T20:32:04.8937316Z     ) -> None:
2025-05-07T20:32:04.8937536Z         torch.manual_seed(2025)
2025-05-07T20:32:04.8937774Z     
2025-05-07T20:32:04.8938038Z >       x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:04.8940132Z E       torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 448.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 26.44 MiB is free. Including non-PyTorch memory, this process has 22.04 GiB memory in use. Of the allocated memory 21.73 GiB is allocated by PyTorch, and 19.12 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
2025-05-07T20:32:04.8942027Z 
2025-05-07T20:32:04.8942146Z moe/activation_test.py:92: OutOfMemoryError
2025-05-07T20:32:04.8942360Z 
2025-05-07T20:32:04.8942463Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:04.8942867Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:04.8943260Z     T=16384,
2025-05-07T20:32:04.8943454Z     D=7168,
2025-05-07T20:32:04.8943645Z     scale_ub=1200.0,
2025-05-07T20:32:04.8943866Z     contiguous=True,
2025-05-07T20:32:04.8944090Z     compiled=False,
2025-05-07T20:32:04.8944295Z )
2025-05-07T20:32:04.8944611Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:04.8945101Z T = 16384, D = 7168, scale_ub = 1200.0, contiguous = True, compiled = False
2025-05-07T20:32:04.8945382Z 
2025-05-07T20:32:04.8945464Z     @given(
2025-05-07T20:32:04.8945693Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:04.8946003Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:04.8946302Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:04.8946628Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:04.8946955Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:04.8947234Z     )
2025-05-07T20:32:04.8947576Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:04.8948008Z     def test_silu_mul_quant(
2025-05-07T20:32:04.8948250Z         self,
2025-05-07T20:32:04.8948446Z         T: int,
2025-05-07T20:32:04.8948640Z         D: int,
2025-05-07T20:32:04.8948856Z         scale_ub: Optional[float],
2025-05-07T20:32:04.8949171Z         contiguous: bool,
2025-05-07T20:32:04.8949406Z         compiled: bool,
2025-05-07T20:32:04.8949625Z     ) -> None:
2025-05-07T20:32:04.8949841Z         torch.manual_seed(2025)
2025-05-07T20:32:04.8950078Z     
2025-05-07T20:32:04.8950343Z >       x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:04.8952370Z E       torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 448.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 26.44 MiB is free. Including non-PyTorch memory, this process has 22.04 GiB memory in use. Of the allocated memory 21.73 GiB is allocated by PyTorch, and 19.12 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
2025-05-07T20:32:04.8954263Z 
2025-05-07T20:32:04.8954384Z moe/activation_test.py:92: OutOfMemoryError
2025-05-07T20:32:04.8954593Z 
2025-05-07T20:32:04.8954706Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:04.8955193Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:04.8955596Z     T=128,
2025-05-07T20:32:04.8955785Z     D=5120,
2025-05-07T20:32:04.8955976Z     scale_ub=1200.0,
2025-05-07T20:32:04.8956195Z     contiguous=False,
2025-05-07T20:32:04.8956425Z     compiled=False,
2025-05-07T20:32:04.8956629Z )
2025-05-07T20:32:04.9971289Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:04.9971905Z T = 128, D = 5120, scale_ub = 1200.0, contiguous = False, compiled = False
2025-05-07T20:32:04.9972257Z 
2025-05-07T20:32:04.9972354Z     @given(
2025-05-07T20:32:04.9972601Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:04.9973093Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:04.9973401Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:04.9973728Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:04.9974058Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:04.9974334Z     )
2025-05-07T20:32:04.9974680Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:04.9975117Z     def test_silu_mul_quant(
2025-05-07T20:32:04.9975353Z         self,
2025-05-07T20:32:04.9975551Z         T: int,
2025-05-07T20:32:04.9975751Z         D: int,
2025-05-07T20:32:04.9975967Z         scale_ub: Optional[float],
2025-05-07T20:32:04.9976234Z         contiguous: bool,
2025-05-07T20:32:04.9976466Z         compiled: bool,
2025-05-07T20:32:04.9976687Z     ) -> None:
2025-05-07T20:32:04.9976904Z         torch.manual_seed(2025)
2025-05-07T20:32:04.9977143Z     
2025-05-07T20:32:04.9977425Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:04.9977764Z     
2025-05-07T20:32:04.9977958Z         x_sign = torch.sign(x)
2025-05-07T20:32:04.9978250Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:04.9978550Z         x = x_sign * x_clamp
2025-05-07T20:32:04.9978796Z         x0 = x[:, :D]
2025-05-07T20:32:04.9979013Z         x1 = x[:, D:]
2025-05-07T20:32:04.9979218Z     
2025-05-07T20:32:04.9979408Z         if contiguous:
2025-05-07T20:32:04.9979640Z             x0 = x0.contiguous()
2025-05-07T20:32:04.9979901Z             x1 = x1.contiguous()
2025-05-07T20:32:04.9980130Z     
2025-05-07T20:32:04.9980322Z         if scale_ub is not None:
2025-05-07T20:32:04.9980599Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:04.9980927Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:04.9981230Z             )
2025-05-07T20:32:04.9981421Z         else:
2025-05-07T20:32:04.9981627Z             scale_ub_tensor = None
2025-05-07T20:32:04.9981879Z     
2025-05-07T20:32:04.9982114Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:04.9982423Z             op = silu_mul_quant
2025-05-07T20:32:04.9982698Z             if compiled:
2025-05-07T20:32:04.9982969Z                 op = torch.compile(op)
2025-05-07T20:32:04.9983260Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:04.9983532Z     
2025-05-07T20:32:04.9983725Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:04.9983887Z 
2025-05-07T20:32:04.9983988Z moe/activation_test.py:117: 
2025-05-07T20:32:04.9984279Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:04.9984609Z moe/activation_test.py:115: in fn
2025-05-07T20:32:04.9984884Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:04.9985562Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:04.9986253Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:04.9986788Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:04.9987459Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:04.9988268Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:04.9988802Z     kernel = self.compile(
2025-05-07T20:32:04.9989394Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:04.9990041Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:04.9990434Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:04.9990660Z 
2025-05-07T20:32:04.9990871Z self = <triton.compiler.compiler.ASTSource object at 0x7f4898373810>
2025-05-07T20:32:04.9991942Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:04.9993396Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f49ce13c400>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f489817f6a0>}
2025-05-07T20:32:04.9994721Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:04.9995737Z context = <triton._C.libtriton.ir.context object at 0x7f48981ccc30>
2025-05-07T20:32:04.9996048Z 
2025-05-07T20:32:04.9996213Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:04.9996736Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:04.9997198Z                            module_map=module_map)
2025-05-07T20:32:04.9997565Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:04.9997919Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:04.9998178Z E       ^
2025-05-07T20:32:04.9998643Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:04.9999093Z 
2025-05-07T20:32:04.9999508Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:05.0000014Z 
2025-05-07T20:32:05.0000120Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:05.0000524Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:05.0000919Z     T=2048,
2025-05-07T20:32:05.0001108Z     D=7168,
2025-05-07T20:32:05.0001296Z     scale_ub=None,
2025-05-07T20:32:05.0001520Z     contiguous=False,
2025-05-07T20:32:05.0001747Z     compiled=False,
2025-05-07T20:32:05.0001953Z )
2025-05-07T20:32:05.0002287Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:05.0002812Z T = 2048, D = 7168, scale_ub = None, contiguous = False, compiled = False
2025-05-07T20:32:05.0003083Z 
2025-05-07T20:32:05.0003165Z     @given(
2025-05-07T20:32:05.0003394Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:05.0003704Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:05.0004013Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:05.0004337Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:05.0004660Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:05.0004941Z     )
2025-05-07T20:32:05.0005288Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:05.0005718Z     def test_silu_mul_quant(
2025-05-07T20:32:05.0005955Z         self,
2025-05-07T20:32:05.0006151Z         T: int,
2025-05-07T20:32:05.0006344Z         D: int,
2025-05-07T20:32:05.0006562Z         scale_ub: Optional[float],
2025-05-07T20:32:05.0006833Z         contiguous: bool,
2025-05-07T20:32:05.0007066Z         compiled: bool,
2025-05-07T20:32:05.0007285Z     ) -> None:
2025-05-07T20:32:05.0007496Z         torch.manual_seed(2025)
2025-05-07T20:32:05.0007731Z     
2025-05-07T20:32:05.0008087Z >       x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:05.0010135Z E       torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 56.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 26.44 MiB is free. Including non-PyTorch memory, this process has 22.04 GiB memory in use. Of the allocated memory 21.74 GiB is allocated by PyTorch, and 10.99 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
2025-05-07T20:32:05.0012011Z 
2025-05-07T20:32:05.0012170Z moe/activation_test.py:92: OutOfMemoryError
2025-05-07T20:32:05.0012384Z 
2025-05-07T20:32:05.0012514Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:05.0012943Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:05.0013349Z     T=128,
2025-05-07T20:32:05.0013538Z     D=7168,
2025-05-07T20:32:05.0013727Z     scale_ub=1200.0,
2025-05-07T20:32:05.0013946Z     contiguous=True,
2025-05-07T20:32:05.0014164Z     compiled=True,
2025-05-07T20:32:05.0014360Z )
2025-05-07T20:32:05.0318454Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:05.0318987Z T = 128, D = 7168, scale_ub = 1200.0, contiguous = True, compiled = True
2025-05-07T20:32:05.0319378Z 
2025-05-07T20:32:05.0319499Z     @given(
2025-05-07T20:32:05.0319818Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:05.0320235Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:05.0320661Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:05.0320996Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:05.0321318Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:05.0321600Z     )
2025-05-07T20:32:05.0321950Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:05.0322387Z     def test_silu_mul_quant(
2025-05-07T20:32:05.0322637Z         self,
2025-05-07T20:32:05.0322831Z         T: int,
2025-05-07T20:32:05.0323023Z         D: int,
2025-05-07T20:32:05.0323239Z         scale_ub: Optional[float],
2025-05-07T20:32:05.0323510Z         contiguous: bool,
2025-05-07T20:32:05.0323742Z         compiled: bool,
2025-05-07T20:32:05.0323966Z     ) -> None:
2025-05-07T20:32:05.0324177Z         torch.manual_seed(2025)
2025-05-07T20:32:05.0324414Z     
2025-05-07T20:32:05.0324680Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:05.0325017Z     
2025-05-07T20:32:05.0325214Z         x_sign = torch.sign(x)
2025-05-07T20:32:05.0325500Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:05.0325807Z         x = x_sign * x_clamp
2025-05-07T20:32:05.0326053Z         x0 = x[:, :D]
2025-05-07T20:32:05.0326265Z         x1 = x[:, D:]
2025-05-07T20:32:05.0326471Z     
2025-05-07T20:32:05.0326662Z         if contiguous:
2025-05-07T20:32:05.0326886Z             x0 = x0.contiguous()
2025-05-07T20:32:05.0327141Z             x1 = x1.contiguous()
2025-05-07T20:32:05.0327383Z     
2025-05-07T20:32:05.0327569Z         if scale_ub is not None:
2025-05-07T20:32:05.0327840Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:05.0328420Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:05.0328731Z             )
2025-05-07T20:32:05.0328921Z         else:
2025-05-07T20:32:05.0329128Z             scale_ub_tensor = None
2025-05-07T20:32:05.0329371Z     
2025-05-07T20:32:05.0329598Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:05.0329911Z             op = silu_mul_quant
2025-05-07T20:32:05.0330161Z             if compiled:
2025-05-07T20:32:05.0330404Z                 op = torch.compile(op)
2025-05-07T20:32:05.0330696Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:05.0330963Z     
2025-05-07T20:32:05.0331312Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:05.0331480Z 
2025-05-07T20:32:05.0331581Z moe/activation_test.py:117: 
2025-05-07T20:32:05.0331870Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:05.0332198Z moe/activation_test.py:115: in fn
2025-05-07T20:32:05.0332476Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:05.0333031Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/torch/_dynamo/eval_frame.py:678: in _fn
2025-05-07T20:32:05.0333584Z     return fn(*args, **kwargs)
2025-05-07T20:32:05.0334229Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:05.0335025Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:05.0335552Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:05.0336224Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:05.0336876Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:05.0337395Z     kernel = self.compile(
2025-05-07T20:32:05.0337930Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:05.0338572Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:05.0338964Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:05.0339187Z 
2025-05-07T20:32:05.0339392Z self = <triton.compiler.compiler.ASTSource object at 0x7f4417fbad90>
2025-05-07T20:32:05.0340462Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:05.0341822Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f49ce13c400>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f48983c68e0>}
2025-05-07T20:32:05.0343200Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:05.0344213Z context = <triton._C.libtriton.ir.context object at 0x7f48983ecc30>
2025-05-07T20:32:05.0344496Z 
2025-05-07T20:32:05.0344662Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:05.0345170Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:05.0345636Z                            module_map=module_map)
2025-05-07T20:32:05.0345995Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:05.0346342Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:05.0346593Z E       ^
2025-05-07T20:32:05.0347058Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:05.0347501Z 
2025-05-07T20:32:05.0347915Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:05.0348418Z 
2025-05-07T20:32:05.0348523Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:05.0348927Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:05.0349377Z     T=128,
2025-05-07T20:32:05.0349565Z     D=7168,
2025-05-07T20:32:05.0349751Z     scale_ub=1200.0,
2025-05-07T20:32:05.0349999Z     contiguous=True,
2025-05-07T20:32:05.0350216Z     compiled=False,
2025-05-07T20:32:05.0350422Z )
2025-05-07T20:32:05.0350732Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:05.0351219Z T = 128, D = 7168, scale_ub = 1200.0, contiguous = True, compiled = False
2025-05-07T20:32:05.0351484Z 
2025-05-07T20:32:05.0351658Z     @given(
2025-05-07T20:32:05.0351884Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:05.0352193Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:05.0352498Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:05.0352856Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:05.0353185Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:05.0353465Z     )
2025-05-07T20:32:05.0353813Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:05.0354244Z     def test_silu_mul_quant(
2025-05-07T20:32:05.0354488Z         self,
2025-05-07T20:32:05.0354725Z         T: int,
2025-05-07T20:32:05.0354982Z         D: int,
2025-05-07T20:32:05.0355203Z         scale_ub: Optional[float],
2025-05-07T20:32:05.0355472Z         contiguous: bool,
2025-05-07T20:32:05.0355704Z         compiled: bool,
2025-05-07T20:32:05.0355920Z     ) -> None:
2025-05-07T20:32:05.0356139Z         torch.manual_seed(2025)
2025-05-07T20:32:05.0356374Z     
2025-05-07T20:32:05.0356645Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:05.0356985Z     
2025-05-07T20:32:05.0363695Z         x_sign = torch.sign(x)
2025-05-07T20:32:05.0364013Z >       x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:05.0365996Z E       torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 20.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 4.44 MiB is free. Including non-PyTorch memory, this process has 22.06 GiB memory in use. Of the allocated memory 21.77 GiB is allocated by PyTorch, and 6.37 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
2025-05-07T20:32:05.0367841Z 
2025-05-07T20:32:05.0367968Z moe/activation_test.py:95: OutOfMemoryError
2025-05-07T20:32:05.0368186Z 
2025-05-07T20:32:05.0368289Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:05.0368695Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:05.0369082Z     T=128,
2025-05-07T20:32:05.0369268Z     D=5120,
2025-05-07T20:32:05.0369459Z     scale_ub=1200.0,
2025-05-07T20:32:05.0369676Z     contiguous=True,
2025-05-07T20:32:05.0369886Z     compiled=True,
2025-05-07T20:32:05.0370087Z )
2025-05-07T20:32:05.0370398Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:05.0370875Z T = 128, D = 5120, scale_ub = 1200.0, contiguous = True, compiled = True
2025-05-07T20:32:05.0371142Z 
2025-05-07T20:32:05.0371223Z     @given(
2025-05-07T20:32:05.0371449Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:05.0371748Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:05.0372045Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:05.0372421Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:05.0372738Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:05.0373017Z     )
2025-05-07T20:32:05.0373360Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:05.0373797Z     def test_silu_mul_quant(
2025-05-07T20:32:05.0374032Z         self,
2025-05-07T20:32:05.0374222Z         T: int,
2025-05-07T20:32:05.0374412Z         D: int,
2025-05-07T20:32:05.0374620Z         scale_ub: Optional[float],
2025-05-07T20:32:05.0374890Z         contiguous: bool,
2025-05-07T20:32:05.0375125Z         compiled: bool,
2025-05-07T20:32:05.0375336Z     ) -> None:
2025-05-07T20:32:05.0375555Z         torch.manual_seed(2025)
2025-05-07T20:32:05.0375789Z     
2025-05-07T20:32:05.0376051Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:05.0376383Z     
2025-05-07T20:32:05.0376576Z         x_sign = torch.sign(x)
2025-05-07T20:32:05.0376966Z >       x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:05.0378956Z E       torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 20.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 4.44 MiB is free. Including non-PyTorch memory, this process has 22.06 GiB memory in use. Of the allocated memory 21.77 GiB is allocated by PyTorch, and 3.87 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
2025-05-07T20:32:05.0380799Z 
2025-05-07T20:32:05.0380957Z moe/activation_test.py:95: OutOfMemoryError
2025-05-07T20:32:05.0381208Z 
2025-05-07T20:32:05.0381309Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:05.0381711Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:05.0382100Z     T=128,
2025-05-07T20:32:05.0382290Z     D=7168,
2025-05-07T20:32:05.0382479Z     scale_ub=None,
2025-05-07T20:32:05.0382681Z     contiguous=True,
2025-05-07T20:32:05.0382916Z     compiled=True,
2025-05-07T20:32:05.0383147Z )
2025-05-07T20:32:05.4492821Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:05.4493326Z T = 128, D = 7168, scale_ub = None, contiguous = True, compiled = True
2025-05-07T20:32:05.4493654Z 
2025-05-07T20:32:05.4493766Z     @given(
2025-05-07T20:32:05.4494086Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:05.4494493Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:05.4494898Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:05.4495341Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:05.4495673Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:05.4495954Z     )
2025-05-07T20:32:05.4496299Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:05.4496741Z     def test_silu_mul_quant(
2025-05-07T20:32:05.4496982Z         self,
2025-05-07T20:32:05.4497176Z         T: int,
2025-05-07T20:32:05.4497373Z         D: int,
2025-05-07T20:32:05.4497593Z         scale_ub: Optional[float],
2025-05-07T20:32:05.4497861Z         contiguous: bool,
2025-05-07T20:32:05.4498098Z         compiled: bool,
2025-05-07T20:32:05.4498318Z     ) -> None:
2025-05-07T20:32:05.4498535Z         torch.manual_seed(2025)
2025-05-07T20:32:05.4498776Z     
2025-05-07T20:32:05.4499043Z >       x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:05.4501079Z E       torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 20.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 4.44 MiB is free. Including non-PyTorch memory, this process has 22.06 GiB memory in use. Of the allocated memory 21.77 GiB is allocated by PyTorch, and 3.87 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
2025-05-07T20:32:05.4502918Z 
2025-05-07T20:32:05.4503040Z moe/activation_test.py:92: OutOfMemoryError
2025-05-07T20:32:05.4503249Z 
2025-05-07T20:32:05.4535039Z FAILED
2025-05-07T20:32:05.4535355Z 
2025-05-07T20:32:05.4535868Z =================================== FAILURES ===================================
2025-05-07T20:32:05.4536523Z _____________________ ActivationTests.test_silu_mul_quant ______________________
2025-05-07T20:32:05.4537084Z   + Exception Group Traceback (most recent call last):
2025-05-07T20:32:05.4537937Z   |   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/unittest/case.py", line 57, in testPartExecutor
2025-05-07T20:32:05.4538538Z   |     yield
2025-05-07T20:32:05.4538998Z   |   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/unittest/case.py", line 623, in run
2025-05-07T20:32:05.4539854Z   |     self._callTestMethod(testMethod)
2025-05-07T20:32:05.4540533Z   |   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/unittest/case.py", line 579, in _callTestMethod
2025-05-07T20:32:05.4541211Z   |     if method() is not None:
2025-05-07T20:32:05.4541472Z   |        ^^^^^^^^
2025-05-07T20:32:05.4542218Z   |   File "/home/ec2-user/actions-runner/_work/FBGEMM/FBGEMM/fbgemm_gpu/experimental/gen_ai/test/moe/activation_test.py", line 75, in test_silu_mul_quant
2025-05-07T20:32:05.4543012Z   |     T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:05.4543326Z   |            ^^^^^^^
2025-05-07T20:32:05.4544115Z   |   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/hypothesis/core.py", line 1850, in wrapped_test
2025-05-07T20:32:05.4545514Z   |     raise the_error_hypothesis_found
2025-05-07T20:32:05.4546122Z   | ExceptionGroup: Hypothesis found 4 distinct failures. (4 sub-exceptions)
2025-05-07T20:32:05.4546728Z   +-+---------------- 1 ----------------
2025-05-07T20:32:05.4547167Z     | Traceback (most recent call last):
2025-05-07T20:32:05.4548165Z     |   File "/home/ec2-user/actions-runner/_work/FBGEMM/FBGEMM/fbgemm_gpu/experimental/gen_ai/test/moe/activation_test.py", line 92, in test_silu_mul_quant
2025-05-07T20:32:05.4549330Z     |     x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:05.4549868Z     |         ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
2025-05-07T20:32:05.4553049Z     | torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 40.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 4.44 MiB is free. Including non-PyTorch memory, this process has 22.06 GiB memory in use. Of the allocated memory 21.77 GiB is allocated by PyTorch, and 3.87 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
2025-05-07T20:32:05.4555984Z     | Falsifying example: test_silu_mul_quant(
2025-05-07T20:32:05.4556606Z     |     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:05.4557177Z     |     T=2048,
2025-05-07T20:32:05.4557504Z     |     D=5120,  # or any other generated value
2025-05-07T20:32:05.4557984Z     |     scale_ub=None,  # or any other generated value
2025-05-07T20:32:05.4558503Z     |     contiguous=True,  # or any other generated value
2025-05-07T20:32:05.4559014Z     |     compiled=False,  # or any other generated value
2025-05-07T20:32:05.4559436Z     | )
2025-05-07T20:32:05.4559688Z     | 
2025-05-07T20:32:05.4560439Z     | You can reproduce this example by temporarily adding @reproduce_failure('6.131.14', b'AEECQQBBAEEAQQE=') as a decorator on your test case
2025-05-07T20:32:05.4561290Z     +---------------- 2 ----------------
2025-05-07T20:32:05.4561708Z     | Traceback (most recent call last):
2025-05-07T20:32:05.4562758Z     |   File "/home/ec2-user/actions-runner/_work/FBGEMM/FBGEMM/fbgemm_gpu/experimental/gen_ai/test/moe/activation_test.py", line 92, in test_silu_mul_quant
2025-05-07T20:32:05.4563860Z     |     x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:05.4564384Z     |         ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
2025-05-07T20:32:05.4567218Z     | torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 20.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 4.44 MiB is free. Including non-PyTorch memory, this process has 22.06 GiB memory in use. Of the allocated memory 21.77 GiB is allocated by PyTorch, and 3.87 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
2025-05-07T20:32:05.4570112Z     | Falsifying example: test_silu_mul_quant(
2025-05-07T20:32:05.4570739Z     |     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:05.4571310Z     |     T=128,
2025-05-07T20:32:05.4571590Z     |     D=7168,
2025-05-07T20:32:05.4571882Z     |     scale_ub=None,
2025-05-07T20:32:05.4572216Z     |     contiguous=True,
2025-05-07T20:32:05.4572594Z     |     compiled=True,
2025-05-07T20:32:05.4572913Z     | )
2025-05-07T20:32:05.4573169Z     | 
2025-05-07T20:32:05.4573905Z     | You can reproduce this example by temporarily adding @reproduce_failure('6.131.14', b'AEEBQQFBAEEAQQA=') as a decorator on your test case
2025-05-07T20:32:05.4574757Z     +---------------- 3 ----------------
2025-05-07T20:32:05.4575277Z     | Traceback (most recent call last):
2025-05-07T20:32:05.4576272Z     |   File "/home/ec2-user/actions-runner/_work/FBGEMM/FBGEMM/fbgemm_gpu/experimental/gen_ai/test/moe/activation_test.py", line 92, in test_silu_mul_quant
2025-05-07T20:32:05.4577256Z     |     x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:05.4577772Z     |         ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
2025-05-07T20:32:05.4579900Z     | torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 20.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 4.44 MiB is free. Including non-PyTorch memory, this process has 22.06 GiB memory in use. Of the allocated memory 21.77 GiB is allocated by PyTorch, and 3.87 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
2025-05-07T20:32:05.4581881Z     | Falsifying example: test_silu_mul_quant(
2025-05-07T20:32:05.4582335Z     |     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:05.4582746Z     |     T=128,
2025-05-07T20:32:05.4582967Z     |     D=5120,
2025-05-07T20:32:05.4583191Z     |     scale_ub=1200.0,
2025-05-07T20:32:05.4583442Z     |     contiguous=True,
2025-05-07T20:32:05.4583685Z     |     compiled=True,
2025-05-07T20:32:05.4583921Z     | )
2025-05-07T20:32:05.4584112Z     | 
2025-05-07T20:32:05.4584637Z     | You can reproduce this example by temporarily adding @reproduce_failure('6.131.14', b'AEEBQQBBAUEAQQA=') as a decorator on your test case
2025-05-07T20:32:05.4585251Z     +---------------- 4 ----------------
2025-05-07T20:32:05.4585554Z     | Traceback (most recent call last):
2025-05-07T20:32:05.4586269Z     |   File "/home/ec2-user/actions-runner/_work/FBGEMM/FBGEMM/fbgemm_gpu/experimental/gen_ai/test/moe/activation_test.py", line 126, in test_silu_mul_quant
2025-05-07T20:32:05.4587003Z     |     y_fp8_ref, y_scale_ref = ref_fn()
2025-05-07T20:32:05.4587303Z     |                              ^^^^^^^^
2025-05-07T20:32:05.4587959Z     |   File "/home/ec2-user/actions-runner/_work/FBGEMM/FBGEMM/fbgemm_gpu/experimental/gen_ai/test/moe/activation_test.py", line 124, in ref_fn
2025-05-07T20:32:05.4588655Z     |     return triton_quantize_fp8_row(y, scale_ub_tensor)
2025-05-07T20:32:05.4589005Z     |            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
2025-05-07T20:32:05.4589939Z     |   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/fbgemm_gpu/experimental/gemm/triton_gemm/fp8_gemm.py", line 2370, in triton_quantize_fp8_row
2025-05-07T20:32:05.4590755Z     |     _kernel_quantize_fp8_row[grid](
2025-05-07T20:32:05.4591426Z     |   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py", line 330, in <lambda>
2025-05-07T20:32:05.4592526Z     |     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:05.4593152Z     |                                    ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
2025-05-07T20:32:05.4594154Z     |   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/autotuner.py", line 186, in run
2025-05-07T20:32:05.4595249Z     |     timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs}
2025-05-07T20:32:05.4595924Z     |               ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
2025-05-07T20:32:05.4596871Z     |   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/autotuner.py", line 186, in <dictcomp>
2025-05-07T20:32:05.4598003Z     |     timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs}
2025-05-07T20:32:05.4598659Z     |                        ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
2025-05-07T20:32:05.4599668Z     |   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/autotuner.py", line 166, in _bench
2025-05-07T20:32:05.4600645Z     |     return self.do_bench(kernel_call, quantiles=(0.5, 0.2, 0.8))
2025-05-07T20:32:05.4601168Z     |            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
2025-05-07T20:32:05.4602013Z     |   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/testing.py", line 117, in do_bench
2025-05-07T20:32:05.4602863Z     |     fn()
2025-05-07T20:32:05.4603683Z     |   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/autotuner.py", line 152, in kernel_call
2025-05-07T20:32:05.4604591Z     |     self.fn.run(
2025-05-07T20:32:05.4605345Z     |   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py", line 623, in run
2025-05-07T20:32:05.4606171Z     |     kernel = self.compile(
2025-05-07T20:32:05.4606540Z     |              ^^^^^^^^^^^^^
2025-05-07T20:32:05.4607389Z     |   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py", line 273, in compile
2025-05-07T20:32:05.4608381Z     |     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:05.4608935Z     |              ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
2025-05-07T20:32:05.4609831Z     |   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py", line 100, in make_ir
2025-05-07T20:32:05.4610954Z     |     return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:05.4611638Z     |            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
2025-05-07T20:32:05.4612177Z     | triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:05.4612707Z     | def _kernel_quantize_fp8_row(
2025-05-07T20:32:05.4613085Z     | ^
2025-05-07T20:32:05.4613736Z     | ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:05.4614524Z     | Falsifying example: test_silu_mul_quant(
2025-05-07T20:32:05.4615095Z     |     # The test always failed when commented parts were varied together.
2025-05-07T20:32:05.4615835Z     |     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:05.4616452Z     |     T=1,  # or any other generated value
2025-05-07T20:32:05.4616893Z     |     D=5120,  # or any other generated value
2025-05-07T20:32:05.4617368Z     |     scale_ub=None,  # or any other generated value
2025-05-07T20:32:05.4617882Z     |     contiguous=True,  # or any other generated value
2025-05-07T20:32:05.4618388Z     |     compiled=True,  # or any other generated value
2025-05-07T20:32:05.4618812Z     | )
2025-05-07T20:32:05.4619071Z     | 
2025-05-07T20:32:05.4619808Z     | You can reproduce this example by temporarily adding @reproduce_failure('6.131.14', b'AEEAQQBBAEEAQQA=') as a decorator on your test case
2025-05-07T20:32:05.4620674Z     +------------------------------------
2025-05-07T20:32:05.4621186Z ---------------------------------- Hypothesis ----------------------------------
2025-05-07T20:32:05.4621852Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:05.4622444Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:05.4623019Z     T=1,
2025-05-07T20:32:05.4623291Z     D=5120,
2025-05-07T20:32:05.4623571Z     scale_ub=None,
2025-05-07T20:32:05.4623884Z     contiguous=True,
2025-05-07T20:32:05.4624208Z     compiled=True,
2025-05-07T20:32:05.4624502Z )
2025-05-07T20:32:05.4624959Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:05.4625638Z T = 1, D = 5120, scale_ub = None, contiguous = True, compiled = True
2025-05-07T20:32:05.4626018Z 
2025-05-07T20:32:05.4626140Z     @given(
2025-05-07T20:32:05.4626532Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:05.4627026Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:05.4627449Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:05.4627918Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:05.4628715Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:05.4629221Z     )
2025-05-07T20:32:05.4629842Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:05.4630465Z     def test_silu_mul_quant(
2025-05-07T20:32:05.4630809Z         self,
2025-05-07T20:32:05.4631081Z         T: int,
2025-05-07T20:32:05.4631368Z         D: int,
2025-05-07T20:32:05.4631678Z         scale_ub: Optional[float],
2025-05-07T20:32:05.4632056Z         contiguous: bool,
2025-05-07T20:32:05.4632427Z         compiled: bool,
2025-05-07T20:32:05.4632765Z     ) -> None:
2025-05-07T20:32:05.4633065Z         torch.manual_seed(2025)
2025-05-07T20:32:05.4633411Z     
2025-05-07T20:32:05.4633798Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:05.4634276Z     
2025-05-07T20:32:05.4634549Z         x_sign = torch.sign(x)
2025-05-07T20:32:05.4634951Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:05.4635386Z         x = x_sign * x_clamp
2025-05-07T20:32:05.4635720Z         x0 = x[:, :D]
2025-05-07T20:32:05.4636010Z         x1 = x[:, D:]
2025-05-07T20:32:05.4636288Z     
2025-05-07T20:32:05.4636536Z         if contiguous:
2025-05-07T20:32:05.4636861Z             x0 = x0.contiguous()
2025-05-07T20:32:05.4637235Z             x1 = x1.contiguous()
2025-05-07T20:32:05.4637570Z     
2025-05-07T20:32:05.4637830Z         if scale_ub is not None:
2025-05-07T20:32:05.4638198Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:05.4638643Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:05.4639060Z             )
2025-05-07T20:32:05.4639321Z         else:
2025-05-07T20:32:05.4639609Z             scale_ub_tensor = None
2025-05-07T20:32:05.4639971Z     
2025-05-07T20:32:05.4640282Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:05.4640708Z             op = silu_mul_quant
2025-05-07T20:32:05.4641054Z             if compiled:
2025-05-07T20:32:05.4641396Z                 op = torch.compile(op)
2025-05-07T20:32:05.4641822Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:05.4642213Z     
2025-05-07T20:32:05.4642484Z         y_fp8, y_scale = fn()
2025-05-07T20:32:05.4642913Z         y = y_fp8.to(torch.float32) * y_scale[:, None]
2025-05-07T20:32:05.4643312Z     
2025-05-07T20:32:05.4643642Z         def ref_fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:05.4644103Z             x0_fp32 = x0.to(torch.float32)
2025-05-07T20:32:05.4644500Z             x1_fp32 = x1.to(torch.float32)
2025-05-07T20:32:05.4644920Z             y = x0_fp32 * torch.sigmoid(x0_fp32) * x1_fp32
2025-05-07T20:32:05.4645421Z             return triton_quantize_fp8_row(y, scale_ub_tensor)
2025-05-07T20:32:05.4645857Z     
2025-05-07T20:32:05.4646149Z >       y_fp8_ref, y_scale_ref = ref_fn()
2025-05-07T20:32:05.4646415Z 
2025-05-07T20:32:05.4646562Z moe/activation_test.py:126: 
2025-05-07T20:32:05.4646975Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:05.4647651Z moe/activation_test.py:124: in ref_fn
2025-05-07T20:32:05.4648114Z     return triton_quantize_fp8_row(y, scale_ub_tensor)
2025-05-07T20:32:05.4649198Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/fbgemm_gpu/experimental/gemm/triton_gemm/fp8_gemm.py:2370: in triton_quantize_fp8_row
2025-05-07T20:32:05.4650200Z     _kernel_quantize_fp8_row[grid](
2025-05-07T20:32:05.4650928Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:05.4651833Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:05.4652759Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/autotuner.py:186: in run
2025-05-07T20:32:05.4653912Z     timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs}
2025-05-07T20:32:05.4654986Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/autotuner.py:186: in <dictcomp>
2025-05-07T20:32:05.4656038Z     timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs}
2025-05-07T20:32:05.4657066Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/autotuner.py:166: in _bench
2025-05-07T20:32:05.4657962Z     return self.do_bench(kernel_call, quantiles=(0.5, 0.2, 0.8))
2025-05-07T20:32:05.4658805Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/testing.py:117: in do_bench
2025-05-07T20:32:05.4659535Z     fn()
2025-05-07T20:32:05.4660239Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/autotuner.py:152: in kernel_call
2025-05-07T20:32:05.4661056Z     self.fn.run(
2025-05-07T20:32:05.4661707Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:05.4662499Z     kernel = self.compile(
2025-05-07T20:32:05.4663264Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:05.4664189Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:05.4664746Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:05.4665054Z 
2025-05-07T20:32:05.4665325Z self = <triton.compiler.compiler.ASTSource object at 0x7f498771dc90>
2025-05-07T20:32:05.4666784Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=2, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:05.4668681Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f49ce13c400>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f4986e7ae80>}
2025-05-07T20:32:05.4670648Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:05.4672047Z context = <triton._C.libtriton.ir.context object at 0x7f498ce1d430>
2025-05-07T20:32:05.4672447Z 
2025-05-07T20:32:05.4672731Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:05.4673454Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:05.4674100Z                            module_map=module_map)
2025-05-07T20:32:05.4674604Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:05.4675083Z E       def _kernel_quantize_fp8_row(
2025-05-07T20:32:05.4675462Z E       ^
2025-05-07T20:32:05.4676109Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:05.4676745Z 
2025-05-07T20:32:05.4677437Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:05.4678158Z 
2025-05-07T20:32:05.4678302Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:05.4678868Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:05.4679426Z     T=2048,
2025-05-07T20:32:05.4679678Z     D=5120,
2025-05-07T20:32:05.4679941Z     scale_ub=1200.0,
2025-05-07T20:32:05.4680259Z     contiguous=True,
2025-05-07T20:32:05.4680567Z     compiled=False,
2025-05-07T20:32:05.4680858Z )
2025-05-07T20:32:05.4681304Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:05.4681981Z T = 2048, D = 5120, scale_ub = 1200.0, contiguous = True, compiled = False
2025-05-07T20:32:05.4682469Z 
2025-05-07T20:32:05.4682576Z     @given(
2025-05-07T20:32:05.4682895Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:05.4683351Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:05.4683769Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:05.4684234Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:05.4684695Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:05.4685086Z     )
2025-05-07T20:32:05.4685574Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:05.4686185Z     def test_silu_mul_quant(
2025-05-07T20:32:05.4686520Z         self,
2025-05-07T20:32:05.4686787Z         T: int,
2025-05-07T20:32:05.4687052Z         D: int,
2025-05-07T20:32:05.4687339Z         scale_ub: Optional[float],
2025-05-07T20:32:05.4687697Z         contiguous: bool,
2025-05-07T20:32:05.4708099Z         compiled: bool,
2025-05-07T20:32:05.4708432Z     ) -> None:
2025-05-07T20:32:05.4708730Z         torch.manual_seed(2025)
2025-05-07T20:32:05.4709175Z     
2025-05-07T20:32:05.4709552Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:05.4710014Z     
2025-05-07T20:32:05.4710266Z         x_sign = torch.sign(x)
2025-05-07T20:32:05.4710652Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:05.4711064Z         x = x_sign * x_clamp
2025-05-07T20:32:05.4711377Z         x0 = x[:, :D]
2025-05-07T20:32:05.4711658Z         x1 = x[:, D:]
2025-05-07T20:32:05.4711938Z     
2025-05-07T20:32:05.4712184Z         if contiguous:
2025-05-07T20:32:05.4712529Z             x0 = x0.contiguous()
2025-05-07T20:32:05.4712905Z             x1 = x1.contiguous()
2025-05-07T20:32:05.4713223Z     
2025-05-07T20:32:05.4713491Z         if scale_ub is not None:
2025-05-07T20:32:05.4713868Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:05.4714318Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:05.4714740Z             )
2025-05-07T20:32:05.4715009Z         else:
2025-05-07T20:32:05.4715284Z             scale_ub_tensor = None
2025-05-07T20:32:05.4715626Z     
2025-05-07T20:32:05.4715937Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:05.4716359Z             op = silu_mul_quant
2025-05-07T20:32:05.4716701Z             if compiled:
2025-05-07T20:32:05.4717041Z                 op = torch.compile(op)
2025-05-07T20:32:05.4717441Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:05.4717800Z     
2025-05-07T20:32:05.4718058Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:05.4718284Z 
2025-05-07T20:32:05.4718427Z moe/activation_test.py:117: 
2025-05-07T20:32:05.4718814Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:05.4719265Z moe/activation_test.py:115: in fn
2025-05-07T20:32:05.4719648Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:05.4720610Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:05.4721570Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:05.4722315Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:05.4723480Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:05.4724403Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:05.4725155Z     kernel = self.compile(
2025-05-07T20:32:05.4725905Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:05.4726822Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:05.4727376Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:05.4727709Z 
2025-05-07T20:32:05.4727996Z self = <triton.compiler.compiler.ASTSource object at 0x7f49858f5ad0>
2025-05-07T20:32:05.4729947Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:05.4731865Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f49ce13c400>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f49872f9da0>}
2025-05-07T20:32:05.4733794Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:05.4735195Z context = <triton._C.libtriton.ir.context object at 0x7f4986dff1f0>
2025-05-07T20:32:05.4735590Z 
2025-05-07T20:32:05.4735813Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:05.4736552Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:05.4737179Z                            module_map=module_map)
2025-05-07T20:32:05.4737692Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:05.4738183Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:05.4738546Z E       ^
2025-05-07T20:32:05.4739174Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:05.4739810Z 
2025-05-07T20:32:05.4740386Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:05.4741110Z 
2025-05-07T20:32:05.4741259Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:05.4741832Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:05.4742391Z     T=2048,
2025-05-07T20:32:05.4742703Z     D=5120,
2025-05-07T20:32:05.4742978Z     scale_ub=1200.0,
2025-05-07T20:32:05.4743294Z     contiguous=True,
2025-05-07T20:32:05.4743618Z     compiled=True,
2025-05-07T20:32:05.4743919Z )
2025-05-07T20:32:05.4744372Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:05.4745077Z T = 2048, D = 5120, scale_ub = 1200.0, contiguous = True, compiled = True
2025-05-07T20:32:05.4745460Z 
2025-05-07T20:32:05.4745580Z     @given(
2025-05-07T20:32:05.4745894Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:05.4746333Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:05.4746756Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:05.4747193Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:05.4747624Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:05.4748015Z     )
2025-05-07T20:32:05.4748466Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:05.4749033Z     def test_silu_mul_quant(
2025-05-07T20:32:05.4749475Z         self,
2025-05-07T20:32:05.4749760Z         T: int,
2025-05-07T20:32:05.4750035Z         D: int,
2025-05-07T20:32:05.4750343Z         scale_ub: Optional[float],
2025-05-07T20:32:05.4750717Z         contiguous: bool,
2025-05-07T20:32:05.4751044Z         compiled: bool,
2025-05-07T20:32:05.4751562Z     ) -> None:
2025-05-07T20:32:05.4751870Z         torch.manual_seed(2025)
2025-05-07T20:32:05.4752203Z     
2025-05-07T20:32:05.4752627Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:05.4753100Z     
2025-05-07T20:32:05.4753360Z         x_sign = torch.sign(x)
2025-05-07T20:32:05.4753755Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:05.4754176Z         x = x_sign * x_clamp
2025-05-07T20:32:05.4754504Z         x0 = x[:, :D]
2025-05-07T20:32:05.4754815Z         x1 = x[:, D:]
2025-05-07T20:32:05.4755107Z     
2025-05-07T20:32:05.4755361Z         if contiguous:
2025-05-07T20:32:05.4755690Z             x0 = x0.contiguous()
2025-05-07T20:32:05.4756222Z             x1 = x1.contiguous()
2025-05-07T20:32:05.4756568Z     
2025-05-07T20:32:05.4756836Z         if scale_ub is not None:
2025-05-07T20:32:05.4757224Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:05.4757705Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:05.4758128Z             )
2025-05-07T20:32:05.4758386Z         else:
2025-05-07T20:32:05.4758682Z             scale_ub_tensor = None
2025-05-07T20:32:05.4758936Z     
2025-05-07T20:32:05.4759172Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:05.4759487Z             op = silu_mul_quant
2025-05-07T20:32:05.4759733Z             if compiled:
2025-05-07T20:32:05.4759984Z                 op = torch.compile(op)
2025-05-07T20:32:05.4760278Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:05.4760547Z     
2025-05-07T20:32:05.4760748Z         y_fp8, y_scale = fn()
2025-05-07T20:32:05.4761034Z         y = y_fp8.to(torch.float32) * y_scale[:, None]
2025-05-07T20:32:05.4761325Z     
2025-05-07T20:32:05.4761566Z         def ref_fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:05.4761900Z             x0_fp32 = x0.to(torch.float32)
2025-05-07T20:32:05.4762200Z             x1_fp32 = x1.to(torch.float32)
2025-05-07T20:32:05.4762517Z             y = x0_fp32 * torch.sigmoid(x0_fp32) * x1_fp32
2025-05-07T20:32:05.4762877Z             return triton_quantize_fp8_row(y, scale_ub_tensor)
2025-05-07T20:32:05.4763191Z     
2025-05-07T20:32:05.4763386Z >       y_fp8_ref, y_scale_ref = ref_fn()
2025-05-07T20:32:05.4763587Z 
2025-05-07T20:32:05.4763688Z moe/activation_test.py:126: 
2025-05-07T20:32:05.4763982Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:05.4764326Z moe/activation_test.py:124: in ref_fn
2025-05-07T20:32:05.4764644Z     return triton_quantize_fp8_row(y, scale_ub_tensor)
2025-05-07T20:32:05.4765436Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/fbgemm_gpu/experimental/gemm/triton_gemm/fp8_gemm.py:2370: in triton_quantize_fp8_row
2025-05-07T20:32:05.4766203Z     _kernel_quantize_fp8_row[grid](
2025-05-07T20:32:05.4766747Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:05.4767436Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:05.4768133Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/autotuner.py:186: in run
2025-05-07T20:32:05.4768858Z     timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs}
2025-05-07T20:32:05.4769620Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/autotuner.py:186: in <dictcomp>
2025-05-07T20:32:05.4770369Z     timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs}
2025-05-07T20:32:05.4771107Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/autotuner.py:166: in _bench
2025-05-07T20:32:05.4771768Z     return self.do_bench(kernel_call, quantiles=(0.5, 0.2, 0.8))
2025-05-07T20:32:05.4772357Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/testing.py:117: in do_bench
2025-05-07T20:32:05.4773578Z     fn()
2025-05-07T20:32:05.4774091Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/autotuner.py:152: in kernel_call
2025-05-07T20:32:05.4774666Z     self.fn.run(
2025-05-07T20:32:05.4775123Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:05.4775656Z     kernel = self.compile(
2025-05-07T20:32:05.4776190Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:05.4776840Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:05.4777236Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:05.4777552Z 
2025-05-07T20:32:05.4777755Z self = <triton.compiler.compiler.ASTSource object at 0x7f49856930d0>
2025-05-07T20:32:05.4778848Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=2, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:05.4780231Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f49ce13c400>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f4985c122a0>}
2025-05-07T20:32:05.4781579Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:05.4782659Z context = <triton._C.libtriton.ir.context object at 0x7f4985696db0>
2025-05-07T20:32:05.4782955Z 
2025-05-07T20:32:05.4783122Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:05.4783646Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:05.4784107Z                            module_map=module_map)
2025-05-07T20:32:05.4784474Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:05.4784826Z E       def _kernel_quantize_fp8_row(
2025-05-07T20:32:05.4785086Z E       ^
2025-05-07T20:32:05.4785550Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:05.4786005Z 
2025-05-07T20:32:05.4786423Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:05.4786936Z 
2025-05-07T20:32:05.4787043Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:05.4787453Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:05.4787862Z     T=16384,
2025-05-07T20:32:05.4788057Z     D=7168,
2025-05-07T20:32:05.4788248Z     scale_ub=1200.0,
2025-05-07T20:32:05.4788470Z     contiguous=False,
2025-05-07T20:32:05.4788699Z     compiled=False,
2025-05-07T20:32:05.4788897Z )
2025-05-07T20:32:05.4789331Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:05.4789835Z T = 16384, D = 7168, scale_ub = 1200.0, contiguous = False, compiled = False
2025-05-07T20:32:05.4790112Z 
2025-05-07T20:32:05.4790196Z     @given(
2025-05-07T20:32:05.4790423Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:05.4790743Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:05.4791052Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:05.4791371Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:05.4791696Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:05.4791977Z     )
2025-05-07T20:32:05.4792324Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:05.4792768Z     def test_silu_mul_quant(
2025-05-07T20:32:05.4793007Z         self,
2025-05-07T20:32:05.4793202Z         T: int,
2025-05-07T20:32:05.4793397Z         D: int,
2025-05-07T20:32:05.4793707Z         scale_ub: Optional[float],
2025-05-07T20:32:05.4793980Z         contiguous: bool,
2025-05-07T20:32:05.4794216Z         compiled: bool,
2025-05-07T20:32:05.4794437Z     ) -> None:
2025-05-07T20:32:05.4794653Z         torch.manual_seed(2025)
2025-05-07T20:32:05.4794888Z     
2025-05-07T20:32:05.4795162Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:05.4795499Z     
2025-05-07T20:32:05.4795688Z         x_sign = torch.sign(x)
2025-05-07T20:32:05.4795978Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:05.4796287Z         x = x_sign * x_clamp
2025-05-07T20:32:05.4796537Z         x0 = x[:, :D]
2025-05-07T20:32:05.4796749Z         x1 = x[:, D:]
2025-05-07T20:32:05.4797004Z     
2025-05-07T20:32:05.4797265Z         if contiguous:
2025-05-07T20:32:05.4797493Z             x0 = x0.contiguous()
2025-05-07T20:32:05.4797757Z             x1 = x1.contiguous()
2025-05-07T20:32:05.4797999Z     
2025-05-07T20:32:05.4798185Z         if scale_ub is not None:
2025-05-07T20:32:05.4798471Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:05.4798807Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:05.4799115Z             )
2025-05-07T20:32:05.4799306Z         else:
2025-05-07T20:32:05.4799521Z             scale_ub_tensor = None
2025-05-07T20:32:05.4799772Z     
2025-05-07T20:32:05.4799997Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:05.4800312Z             op = silu_mul_quant
2025-05-07T20:32:05.4800562Z             if compiled:
2025-05-07T20:32:05.4800808Z                 op = torch.compile(op)
2025-05-07T20:32:05.4801110Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:05.4801385Z     
2025-05-07T20:32:05.4801572Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:05.4801746Z 
2025-05-07T20:32:05.4801844Z moe/activation_test.py:117: 
2025-05-07T20:32:05.4802137Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:05.4802459Z moe/activation_test.py:115: in fn
2025-05-07T20:32:05.4802745Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:05.4803432Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:05.4804122Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:05.4804646Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:05.4805331Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:05.4805990Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:05.4806525Z     kernel = self.compile(
2025-05-07T20:32:05.4807067Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:05.4807719Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:05.4808119Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:05.4808351Z 
2025-05-07T20:32:05.4808556Z self = <triton.compiler.compiler.ASTSource object at 0x7f49856da910>
2025-05-07T20:32:05.4809631Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:05.4810993Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f49ce13c400>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f49857a2700>}
2025-05-07T20:32:05.4812358Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:05.4813499Z context = <triton._C.libtriton.ir.context object at 0x7f49856fef70>
2025-05-07T20:32:05.4813787Z 
2025-05-07T20:32:05.4813952Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:05.4814470Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:05.4814936Z                            module_map=module_map)
2025-05-07T20:32:05.4815296Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:05.4815647Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:05.4815907Z E       ^
2025-05-07T20:32:05.4816373Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:05.4816862Z 
2025-05-07T20:32:05.4817318Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:05.4817838Z 
2025-05-07T20:32:05.4817942Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:05.4818358Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:05.4818762Z     T=1,
2025-05-07T20:32:05.4818947Z     D=7168,
2025-05-07T20:32:05.4819147Z     scale_ub=None,
2025-05-07T20:32:05.4819368Z     contiguous=True,
2025-05-07T20:32:05.4819589Z     compiled=True,
2025-05-07T20:32:05.4819797Z )
2025-05-07T20:32:05.4820118Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:05.4820592Z T = 1, D = 7168, scale_ub = None, contiguous = True, compiled = True
2025-05-07T20:32:05.4820856Z 
2025-05-07T20:32:05.4820935Z     @given(
2025-05-07T20:32:05.4821175Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:05.4821483Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:05.4821796Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:05.4822126Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:05.4822476Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:05.4822788Z     )
2025-05-07T20:32:05.4823146Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:05.4823592Z     def test_silu_mul_quant(
2025-05-07T20:32:05.4823828Z         self,
2025-05-07T20:32:05.4824023Z         T: int,
2025-05-07T20:32:05.4824222Z         D: int,
2025-05-07T20:32:05.4824437Z         scale_ub: Optional[float],
2025-05-07T20:32:05.4824711Z         contiguous: bool,
2025-05-07T20:32:05.4824949Z         compiled: bool,
2025-05-07T20:32:05.4825165Z     ) -> None:
2025-05-07T20:32:05.4825384Z         torch.manual_seed(2025)
2025-05-07T20:32:05.4825634Z     
2025-05-07T20:32:05.4825899Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:05.4826242Z     
2025-05-07T20:32:05.4826443Z         x_sign = torch.sign(x)
2025-05-07T20:32:05.4826728Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:05.4827038Z         x = x_sign * x_clamp
2025-05-07T20:32:05.4827279Z         x0 = x[:, :D]
2025-05-07T20:32:05.4827501Z         x1 = x[:, D:]
2025-05-07T20:32:05.4827701Z     
2025-05-07T20:32:05.4827889Z         if contiguous:
2025-05-07T20:32:05.4828121Z             x0 = x0.contiguous()
2025-05-07T20:32:05.4828682Z             x1 = x1.contiguous()
2025-05-07T20:32:05.4828920Z     
2025-05-07T20:32:05.4829169Z         if scale_ub is not None:
2025-05-07T20:32:05.4829439Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:05.4829773Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:05.4830084Z             )
2025-05-07T20:32:05.4830272Z         else:
2025-05-07T20:32:05.4830480Z             scale_ub_tensor = None
2025-05-07T20:32:05.4830737Z     
2025-05-07T20:32:05.4830966Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:05.4831289Z             op = silu_mul_quant
2025-05-07T20:32:05.4831541Z             if compiled:
2025-05-07T20:32:05.4831786Z                 op = torch.compile(op)
2025-05-07T20:32:05.4832083Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:05.4832527Z     
2025-05-07T20:32:05.4832721Z         y_fp8, y_scale = fn()
2025-05-07T20:32:05.4833012Z         y = y_fp8.to(torch.float32) * y_scale[:, None]
2025-05-07T20:32:05.4833304Z     
2025-05-07T20:32:05.4833543Z         def ref_fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:05.4833874Z             x0_fp32 = x0.to(torch.float32)
2025-05-07T20:32:05.4834166Z             x1_fp32 = x1.to(torch.float32)
2025-05-07T20:32:05.4834479Z             y = x0_fp32 * torch.sigmoid(x0_fp32) * x1_fp32
2025-05-07T20:32:05.4834833Z             return triton_quantize_fp8_row(y, scale_ub_tensor)
2025-05-07T20:32:05.4835155Z     
2025-05-07T20:32:05.4835362Z >       y_fp8_ref, y_scale_ref = ref_fn()
2025-05-07T20:32:05.4835712Z 
2025-05-07T20:32:05.4835811Z moe/activation_test.py:126: 
2025-05-07T20:32:05.4836109Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:05.4836447Z moe/activation_test.py:124: in ref_fn
2025-05-07T20:32:05.4836777Z     return triton_quantize_fp8_row(y, scale_ub_tensor)
2025-05-07T20:32:05.4837555Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/fbgemm_gpu/experimental/gemm/triton_gemm/fp8_gemm.py:2370: in triton_quantize_fp8_row
2025-05-07T20:32:05.4838302Z     _kernel_quantize_fp8_row[grid](
2025-05-07T20:32:05.4838850Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:05.4839518Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:05.4840202Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/autotuner.py:186: in run
2025-05-07T20:32:05.4840922Z     timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs}
2025-05-07T20:32:05.4841679Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/autotuner.py:186: in <dictcomp>
2025-05-07T20:32:05.4842420Z     timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs}
2025-05-07T20:32:05.4843144Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/autotuner.py:166: in _bench
2025-05-07T20:32:05.4843777Z     return self.do_bench(kernel_call, quantiles=(0.5, 0.2, 0.8))
2025-05-07T20:32:05.4844372Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/testing.py:117: in do_bench
2025-05-07T20:32:05.4844879Z     fn()
2025-05-07T20:32:05.4845383Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/autotuner.py:152: in kernel_call
2025-05-07T20:32:05.4845963Z     self.fn.run(
2025-05-07T20:32:05.4846427Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:05.4846956Z     kernel = self.compile(
2025-05-07T20:32:05.4847494Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:05.4848149Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:05.4848539Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:05.4848773Z 
2025-05-07T20:32:05.4848981Z self = <triton.compiler.compiler.ASTSource object at 0x7f4985392310>
2025-05-07T20:32:05.4850052Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=2, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:05.4851413Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f49ce13c400>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f4985a18ae0>}
2025-05-07T20:32:05.4852829Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:05.4853850Z context = <triton._C.libtriton.ir.context object at 0x7f49853d1030>
2025-05-07T20:32:05.4854141Z 
2025-05-07T20:32:05.4854308Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:05.4854824Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:05.4855285Z                            module_map=module_map)
2025-05-07T20:32:05.4855650Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:05.4856004Z E       def _kernel_quantize_fp8_row(
2025-05-07T20:32:05.4856274Z E       ^
2025-05-07T20:32:05.4856734Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:05.4857262Z 
2025-05-07T20:32:05.4857674Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:05.4858181Z 
2025-05-07T20:32:05.4858298Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:05.4858712Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:05.4859108Z     T=4096,
2025-05-07T20:32:05.4859302Z     D=5120,
2025-05-07T20:32:05.4859498Z     scale_ub=None,
2025-05-07T20:32:05.4859710Z     contiguous=False,
2025-05-07T20:32:05.4859945Z     compiled=False,
2025-05-07T20:32:05.4860149Z )
2025-05-07T20:32:05.4860462Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:05.4860965Z T = 4096, D = 5120, scale_ub = None, contiguous = False, compiled = False
2025-05-07T20:32:05.4861235Z 
2025-05-07T20:32:05.4861322Z     @given(
2025-05-07T20:32:05.4861554Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:05.4861870Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:05.4862173Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:05.4862505Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:05.4862830Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:05.4863112Z     )
2025-05-07T20:32:05.4863456Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:05.4863887Z     def test_silu_mul_quant(
2025-05-07T20:32:05.4864128Z         self,
2025-05-07T20:32:05.4864322Z         T: int,
2025-05-07T20:32:05.4864512Z         D: int,
2025-05-07T20:32:05.4864734Z         scale_ub: Optional[float],
2025-05-07T20:32:05.4865003Z         contiguous: bool,
2025-05-07T20:32:05.4865236Z         compiled: bool,
2025-05-07T20:32:05.4865459Z     ) -> None:
2025-05-07T20:32:05.4865675Z         torch.manual_seed(2025)
2025-05-07T20:32:05.4865914Z     
2025-05-07T20:32:05.4866194Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:05.4866536Z     
2025-05-07T20:32:05.4866725Z         x_sign = torch.sign(x)
2025-05-07T20:32:05.4867014Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:05.4867323Z         x = x_sign * x_clamp
2025-05-07T20:32:05.4867560Z         x0 = x[:, :D]
2025-05-07T20:32:05.4867770Z         x1 = x[:, D:]
2025-05-07T20:32:05.4867977Z     
2025-05-07T20:32:05.4868170Z         if contiguous:
2025-05-07T20:32:05.4868398Z             x0 = x0.contiguous()
2025-05-07T20:32:05.4868654Z             x1 = x1.contiguous()
2025-05-07T20:32:05.4868891Z     
2025-05-07T20:32:05.4869123Z         if scale_ub is not None:
2025-05-07T20:32:05.4869396Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:05.4869727Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:05.4878013Z             )
2025-05-07T20:32:05.4878227Z         else:
2025-05-07T20:32:05.4878454Z             scale_ub_tensor = None
2025-05-07T20:32:05.4878719Z     
2025-05-07T20:32:05.4878959Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:05.4879274Z             op = silu_mul_quant
2025-05-07T20:32:05.4879529Z             if compiled:
2025-05-07T20:32:05.4879900Z                 op = torch.compile(op)
2025-05-07T20:32:05.4880195Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:05.4880471Z     
2025-05-07T20:32:05.4880670Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:05.4880834Z 
2025-05-07T20:32:05.4880936Z moe/activation_test.py:117: 
2025-05-07T20:32:05.4881074Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:05.4881177Z moe/activation_test.py:115: in fn
2025-05-07T20:32:05.4881278Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:05.4881789Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:05.4881962Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:05.4882401Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:05.4882647Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:05.4882993Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:05.4883096Z     kernel = self.compile(
2025-05-07T20:32:05.4883481Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:05.4883663Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:05.4883793Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:05.4883798Z 
2025-05-07T20:32:05.4884002Z self = <triton.compiler.compiler.ASTSource object at 0x7f49855aac90>
2025-05-07T20:32:05.4884790Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:05.4885295Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f49ce13c400>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f4985c11940>}
2025-05-07T20:32:05.4886047Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:05.4886239Z context = <triton._C.libtriton.ir.context object at 0x7f4985392ef0>
2025-05-07T20:32:05.4886243Z 
2025-05-07T20:32:05.4886407Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:05.4886673Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:05.4886786Z                            module_map=module_map)
2025-05-07T20:32:05.4886957Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:05.4887057Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:05.4887138Z E       ^
2025-05-07T20:32:05.4887508Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:05.4887513Z 
2025-05-07T20:32:05.4887930Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:05.4887934Z 
2025-05-07T20:32:05.4888046Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:05.4888269Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:05.4888349Z     T=4096,
2025-05-07T20:32:05.4888435Z     D=7168,
2025-05-07T20:32:05.4888518Z     scale_ub=None,
2025-05-07T20:32:05.4888607Z     contiguous=False,
2025-05-07T20:32:05.4888700Z     compiled=False,
2025-05-07T20:32:05.4888778Z )
2025-05-07T20:32:05.4888997Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:05.4889177Z T = 4096, D = 7168, scale_ub = None, contiguous = False, compiled = False
2025-05-07T20:32:05.4889181Z 
2025-05-07T20:32:05.4889345Z     @given(
2025-05-07T20:32:05.4889477Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:05.4889578Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:05.4889694Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:05.4889820Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:05.4889936Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:05.4890012Z     )
2025-05-07T20:32:05.4890263Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:05.4890357Z     def test_silu_mul_quant(
2025-05-07T20:32:05.4890436Z         self,
2025-05-07T20:32:05.4890524Z         T: int,
2025-05-07T20:32:05.4890645Z         D: int,
2025-05-07T20:32:05.4890785Z         scale_ub: Optional[float],
2025-05-07T20:32:05.4890884Z         contiguous: bool,
2025-05-07T20:32:05.4890970Z         compiled: bool,
2025-05-07T20:32:05.4891058Z     ) -> None:
2025-05-07T20:32:05.4891154Z         torch.manual_seed(2025)
2025-05-07T20:32:05.4891235Z     
2025-05-07T20:32:05.4891411Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:05.4891487Z     
2025-05-07T20:32:05.4891580Z         x_sign = torch.sign(x)
2025-05-07T20:32:05.4891717Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:05.4891809Z         x = x_sign * x_clamp
2025-05-07T20:32:05.4891891Z         x0 = x[:, :D]
2025-05-07T20:32:05.4891980Z         x1 = x[:, D:]
2025-05-07T20:32:05.4892056Z     
2025-05-07T20:32:05.4892142Z         if contiguous:
2025-05-07T20:32:05.4892243Z             x0 = x0.contiguous()
2025-05-07T20:32:05.4892334Z             x1 = x1.contiguous()
2025-05-07T20:32:05.4892420Z     
2025-05-07T20:32:05.4892515Z         if scale_ub is not None:
2025-05-07T20:32:05.4892622Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:05.4892765Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:05.4892842Z             )
2025-05-07T20:32:05.4892926Z         else:
2025-05-07T20:32:05.4893029Z             scale_ub_tensor = None
2025-05-07T20:32:05.4893105Z     
2025-05-07T20:32:05.4893237Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:05.4893337Z             op = silu_mul_quant
2025-05-07T20:32:05.4893426Z             if compiled:
2025-05-07T20:32:05.4893527Z                 op = torch.compile(op)
2025-05-07T20:32:05.4893642Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:05.4893716Z     
2025-05-07T20:32:05.4893816Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:05.4893820Z 
2025-05-07T20:32:05.4893921Z moe/activation_test.py:117: 
2025-05-07T20:32:05.4894053Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:05.4894172Z moe/activation_test.py:115: in fn
2025-05-07T20:32:05.4894273Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:05.4894772Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:05.4894880Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:05.4895236Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:05.4895465Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:05.4895802Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:05.4895899Z     kernel = self.compile(
2025-05-07T20:32:05.4896286Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:05.4896463Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:05.4896596Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:05.4896607Z 
2025-05-07T20:32:05.4896810Z self = <triton.compiler.compiler.ASTSource object at 0x7f498511b710>
2025-05-07T20:32:05.4897687Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:05.4898200Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f49ce13c400>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f4985a1b060>}
2025-05-07T20:32:05.4898946Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:05.4899219Z context = <triton._C.libtriton.ir.context object at 0x7f49851bfd30>
2025-05-07T20:32:05.4899224Z 
2025-05-07T20:32:05.4899388Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:05.4899652Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:05.4899769Z                            module_map=module_map)
2025-05-07T20:32:05.4899931Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:05.4900030Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:05.4900120Z E       ^
2025-05-07T20:32:05.4900473Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:05.4900478Z 
2025-05-07T20:32:05.4900897Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:05.4900901Z 
2025-05-07T20:32:05.4901008Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:05.4901235Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:05.4901322Z     T=128,
2025-05-07T20:32:05.4901399Z     D=7168,
2025-05-07T20:32:05.4901487Z     scale_ub=None,
2025-05-07T20:32:05.4901574Z     contiguous=False,
2025-05-07T20:32:05.4901664Z     compiled=True,
2025-05-07T20:32:05.4901742Z )
2025-05-07T20:32:05.4901961Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:05.4902131Z T = 128, D = 7168, scale_ub = None, contiguous = False, compiled = True
2025-05-07T20:32:05.4902135Z 
2025-05-07T20:32:05.4902221Z     @given(
2025-05-07T20:32:05.4902342Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:05.4902443Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:05.4902564Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:05.4902680Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:05.4902806Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:05.4902885Z     )
2025-05-07T20:32:05.4903128Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:05.4903233Z     def test_silu_mul_quant(
2025-05-07T20:32:05.4903311Z         self,
2025-05-07T20:32:05.4903392Z         T: int,
2025-05-07T20:32:05.4903476Z         D: int,
2025-05-07T20:32:05.4903575Z         scale_ub: Optional[float],
2025-05-07T20:32:05.4903665Z         contiguous: bool,
2025-05-07T20:32:05.4903758Z         compiled: bool,
2025-05-07T20:32:05.4903838Z     ) -> None:
2025-05-07T20:32:05.4903933Z         torch.manual_seed(2025)
2025-05-07T20:32:05.4904017Z     
2025-05-07T20:32:05.4904188Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:05.4904273Z     
2025-05-07T20:32:05.4904366Z         x_sign = torch.sign(x)
2025-05-07T20:32:05.4904493Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:05.4904595Z         x = x_sign * x_clamp
2025-05-07T20:32:05.4904686Z         x0 = x[:, :D]
2025-05-07T20:32:05.4904767Z         x1 = x[:, D:]
2025-05-07T20:32:05.4904847Z     
2025-05-07T20:32:05.4904933Z         if contiguous:
2025-05-07T20:32:05.4905026Z             x0 = x0.contiguous()
2025-05-07T20:32:05.4905124Z             x1 = x1.contiguous()
2025-05-07T20:32:05.4905295Z     
2025-05-07T20:32:05.4905388Z         if scale_ub is not None:
2025-05-07T20:32:05.4905507Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:05.4905643Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:05.4905720Z             )
2025-05-07T20:32:05.4905808Z         else:
2025-05-07T20:32:05.4905903Z             scale_ub_tensor = None
2025-05-07T20:32:05.4905986Z     
2025-05-07T20:32:05.4906118Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:05.4906210Z             op = silu_mul_quant
2025-05-07T20:32:05.4906306Z             if compiled:
2025-05-07T20:32:05.4906407Z                 op = torch.compile(op)
2025-05-07T20:32:05.4906593Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:05.4906673Z     
2025-05-07T20:32:05.4906765Z         y_fp8, y_scale = fn()
2025-05-07T20:32:05.4906886Z         y = y_fp8.to(torch.float32) * y_scale[:, None]
2025-05-07T20:32:05.4906966Z     
2025-05-07T20:32:05.4907110Z         def ref_fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:05.4907214Z             x0_fp32 = x0.to(torch.float32)
2025-05-07T20:32:05.4907324Z             x1_fp32 = x1.to(torch.float32)
2025-05-07T20:32:05.4907447Z             y = x0_fp32 * torch.sigmoid(x0_fp32) * x1_fp32
2025-05-07T20:32:05.4907594Z             return triton_quantize_fp8_row(y, scale_ub_tensor)
2025-05-07T20:32:05.4907669Z     
2025-05-07T20:32:05.4907771Z >       y_fp8_ref, y_scale_ref = ref_fn()
2025-05-07T20:32:05.4907775Z 
2025-05-07T20:32:05.4907883Z moe/activation_test.py:126: 
2025-05-07T20:32:05.4908014Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:05.4908123Z moe/activation_test.py:124: in ref_fn
2025-05-07T20:32:05.4908272Z     return triton_quantize_fp8_row(y, scale_ub_tensor)
2025-05-07T20:32:05.4908831Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/fbgemm_gpu/experimental/gemm/triton_gemm/fp8_gemm.py:2370: in triton_quantize_fp8_row
2025-05-07T20:32:05.4908944Z     _kernel_quantize_fp8_row[grid](
2025-05-07T20:32:05.4909402Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:05.4909623Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:05.4910001Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/autotuner.py:186: in run
2025-05-07T20:32:05.4910254Z     timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs}
2025-05-07T20:32:05.4910659Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/autotuner.py:186: in <dictcomp>
2025-05-07T20:32:05.4910916Z     timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs}
2025-05-07T20:32:05.4911288Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/autotuner.py:166: in _bench
2025-05-07T20:32:05.4911465Z     return self.do_bench(kernel_call, quantiles=(0.5, 0.2, 0.8))
2025-05-07T20:32:05.4911804Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/testing.py:117: in do_bench
2025-05-07T20:32:05.4911883Z     fn()
2025-05-07T20:32:05.4912287Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/autotuner.py:152: in kernel_call
2025-05-07T20:32:05.4912372Z     self.fn.run(
2025-05-07T20:32:05.4912764Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:05.4912862Z     kernel = self.compile(
2025-05-07T20:32:05.4913240Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:05.4913427Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:05.4913556Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:05.4913561Z 
2025-05-07T20:32:05.4913879Z self = <triton.compiler.compiler.ASTSource object at 0x7f4984f02310>
2025-05-07T20:32:05.4914654Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=2, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:05.4915152Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f49ce13c400>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f498548d9e0>}
2025-05-07T20:32:05.4915904Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:05.4916172Z context = <triton._C.libtriton.ir.context object at 0x7f4984d4cb70>
2025-05-07T20:32:05.4916177Z 
2025-05-07T20:32:05.4916354Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:05.4916617Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:05.4916727Z                            module_map=module_map)
2025-05-07T20:32:05.4916898Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:05.4917002Z E       def _kernel_quantize_fp8_row(
2025-05-07T20:32:05.4917081Z E       ^
2025-05-07T20:32:05.4917446Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:05.4917451Z 
2025-05-07T20:32:05.4917862Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:05.4917871Z 
2025-05-07T20:32:05.4917985Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:05.4918207Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:05.4918286Z     T=128,
2025-05-07T20:32:05.4918379Z     D=7168,
2025-05-07T20:32:05.4918462Z     scale_ub=None,
2025-05-07T20:32:05.4918557Z     contiguous=False,
2025-05-07T20:32:05.4918642Z     compiled=False,
2025-05-07T20:32:05.4918719Z )
2025-05-07T20:32:05.4918936Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:05.4919117Z T = 128, D = 7168, scale_ub = None, contiguous = False, compiled = False
2025-05-07T20:32:05.4919122Z 
2025-05-07T20:32:05.4919200Z     @given(
2025-05-07T20:32:05.4919325Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:05.4919424Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:05.4919539Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:05.4919668Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:05.4919780Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:05.4919859Z     )
2025-05-07T20:32:05.4920111Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:05.4920209Z     def test_silu_mul_quant(
2025-05-07T20:32:05.4920298Z         self,
2025-05-07T20:32:05.4920376Z         T: int,
2025-05-07T20:32:05.4920456Z         D: int,
2025-05-07T20:32:05.4920564Z         scale_ub: Optional[float],
2025-05-07T20:32:05.4920658Z         contiguous: bool,
2025-05-07T20:32:05.4920745Z         compiled: bool,
2025-05-07T20:32:05.4920830Z     ) -> None:
2025-05-07T20:32:05.4920925Z         torch.manual_seed(2025)
2025-05-07T20:32:05.4921003Z     
2025-05-07T20:32:05.4921177Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:05.4921253Z     
2025-05-07T20:32:05.4921348Z         x_sign = torch.sign(x)
2025-05-07T20:32:05.4921481Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:05.4921573Z         x = x_sign * x_clamp
2025-05-07T20:32:05.4921654Z         x0 = x[:, :D]
2025-05-07T20:32:05.4921738Z         x1 = x[:, D:]
2025-05-07T20:32:05.4921811Z     
2025-05-07T20:32:05.4921902Z         if contiguous:
2025-05-07T20:32:05.4922082Z             x0 = x0.contiguous()
2025-05-07T20:32:05.4922173Z             x1 = x1.contiguous()
2025-05-07T20:32:05.4922251Z     
2025-05-07T20:32:05.4922343Z         if scale_ub is not None:
2025-05-07T20:32:05.4922451Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:05.4922591Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:05.4922666Z             )
2025-05-07T20:32:05.4922743Z         else:
2025-05-07T20:32:05.4922843Z             scale_ub_tensor = None
2025-05-07T20:32:05.4922916Z     
2025-05-07T20:32:05.4923044Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:05.4923144Z             op = silu_mul_quant
2025-05-07T20:32:05.4923307Z             if compiled:
2025-05-07T20:32:05.4923414Z                 op = torch.compile(op)
2025-05-07T20:32:05.4923522Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:05.4923599Z     
2025-05-07T20:32:05.4923696Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:05.4923706Z 
2025-05-07T20:32:05.4923803Z moe/activation_test.py:117: 
2025-05-07T20:32:05.4923934Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:05.4924045Z moe/activation_test.py:115: in fn
2025-05-07T20:32:05.4924149Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:05.4924649Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:05.4924759Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:05.4925116Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:05.4925345Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:05.4925687Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:05.4925781Z     kernel = self.compile(
2025-05-07T20:32:05.4926174Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:05.4926349Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:05.4926483Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:05.4926488Z 
2025-05-07T20:32:05.4926688Z self = <triton.compiler.compiler.ASTSource object at 0x7f4984f02c90>
2025-05-07T20:32:05.4927459Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:05.4927962Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f49ce13c400>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f498548f1a0>}
2025-05-07T20:32:05.4929031Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:05.4929231Z context = <triton._C.libtriton.ir.context object at 0x7f4984dd0cf0>
2025-05-07T20:32:05.4929236Z 
2025-05-07T20:32:05.4929399Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:05.4929656Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:05.4929768Z                            module_map=module_map)
2025-05-07T20:32:05.4929929Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:05.4930039Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:05.4930122Z E       ^
2025-05-07T20:32:05.4930479Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:05.4930484Z 
2025-05-07T20:32:05.4931100Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:05.4931106Z 
2025-05-07T20:32:05.4931210Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:05.4931444Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:05.4931523Z     T=4096,
2025-05-07T20:32:05.4931600Z     D=5120,
2025-05-07T20:32:05.4931689Z     scale_ub=1200.0,
2025-05-07T20:32:05.4931774Z     contiguous=True,
2025-05-07T20:32:05.4931858Z     compiled=False,
2025-05-07T20:32:05.4931937Z )
2025-05-07T20:32:05.4932154Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:05.4932327Z T = 4096, D = 5120, scale_ub = 1200.0, contiguous = True, compiled = False
2025-05-07T20:32:05.4932454Z 
2025-05-07T20:32:05.4932540Z     @given(
2025-05-07T20:32:05.4932658Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:05.4932757Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:05.4932888Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:05.4933004Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:05.4933123Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:05.4933199Z     )
2025-05-07T20:32:05.4933443Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:05.4933547Z     def test_silu_mul_quant(
2025-05-07T20:32:05.4933625Z         self,
2025-05-07T20:32:05.4933704Z         T: int,
2025-05-07T20:32:05.4933789Z         D: int,
2025-05-07T20:32:05.4933887Z         scale_ub: Optional[float],
2025-05-07T20:32:05.4933977Z         contiguous: bool,
2025-05-07T20:32:05.4934071Z         compiled: bool,
2025-05-07T20:32:05.4934157Z     ) -> None:
2025-05-07T20:32:05.4934262Z         torch.manual_seed(2025)
2025-05-07T20:32:05.4934335Z     
2025-05-07T20:32:05.4934502Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:05.4934583Z     
2025-05-07T20:32:05.4934682Z         x_sign = torch.sign(x)
2025-05-07T20:32:05.4934807Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:05.4934902Z         x = x_sign * x_clamp
2025-05-07T20:32:05.4934986Z         x0 = x[:, :D]
2025-05-07T20:32:05.4935069Z         x1 = x[:, D:]
2025-05-07T20:32:05.4935149Z     
2025-05-07T20:32:05.4935235Z         if contiguous:
2025-05-07T20:32:05.4935328Z             x0 = x0.contiguous()
2025-05-07T20:32:05.4935424Z             x1 = x1.contiguous()
2025-05-07T20:32:05.4935496Z     
2025-05-07T20:32:05.4935586Z         if scale_ub is not None:
2025-05-07T20:32:05.4935696Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:05.4935829Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:05.4935915Z             )
2025-05-07T20:32:05.4935992Z         else:
2025-05-07T20:32:05.4936086Z             scale_ub_tensor = None
2025-05-07T20:32:05.4936165Z     
2025-05-07T20:32:05.4936294Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:05.4936388Z             op = silu_mul_quant
2025-05-07T20:32:05.4936478Z             if compiled:
2025-05-07T20:32:05.4936579Z                 op = torch.compile(op)
2025-05-07T20:32:05.4936684Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:05.4936766Z     
2025-05-07T20:32:05.4936856Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:05.4936860Z 
2025-05-07T20:32:05.4936963Z moe/activation_test.py:117: 
2025-05-07T20:32:05.4937092Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:05.4937192Z moe/activation_test.py:115: in fn
2025-05-07T20:32:05.4937295Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:05.4937790Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:05.4937891Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:05.4938340Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:05.4938564Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:05.4938907Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:05.4939006Z     kernel = self.compile(
2025-05-07T20:32:05.4939393Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:05.4939566Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:05.4939694Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:05.4939739Z 
2025-05-07T20:32:05.4939952Z self = <triton.compiler.compiler.ASTSource object at 0x7f4984de9050>
2025-05-07T20:32:05.4940767Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:05.4941273Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f49ce13c400>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f498548ea20>}
2025-05-07T20:32:05.4942015Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:05.4942204Z context = <triton._C.libtriton.ir.context object at 0x7f4984d91670>
2025-05-07T20:32:05.4942215Z 
2025-05-07T20:32:05.4942379Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:05.4942643Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:05.4942759Z                            module_map=module_map)
2025-05-07T20:32:05.4942920Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:05.4943024Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:05.4943112Z E       ^
2025-05-07T20:32:05.4943464Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:05.4943469Z 
2025-05-07T20:32:05.4943887Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:05.4943891Z 
2025-05-07T20:32:05.4943994Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:05.4944217Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:05.4944299Z     T=1,
2025-05-07T20:32:05.4944383Z     D=5120,
2025-05-07T20:32:05.4944468Z     scale_ub=None,
2025-05-07T20:32:05.4944566Z     contiguous=True,
2025-05-07T20:32:05.4944650Z     compiled=True,
2025-05-07T20:32:05.4944724Z )
2025-05-07T20:32:05.4944950Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:05.4945115Z T = 1, D = 5120, scale_ub = None, contiguous = True, compiled = True
2025-05-07T20:32:05.4945120Z 
2025-05-07T20:32:05.4945202Z     @given(
2025-05-07T20:32:05.4945321Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:05.4945419Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:05.4945541Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:05.4945658Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:05.4945772Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:05.4945852Z     )
2025-05-07T20:32:05.4946096Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:05.4946192Z     def test_silu_mul_quant(
2025-05-07T20:32:05.4946278Z         self,
2025-05-07T20:32:05.4946356Z         T: int,
2025-05-07T20:32:05.4946437Z         D: int,
2025-05-07T20:32:05.4946534Z         scale_ub: Optional[float],
2025-05-07T20:32:05.4946623Z         contiguous: bool,
2025-05-07T20:32:05.4946820Z         compiled: bool,
2025-05-07T20:32:05.4946901Z     ) -> None:
2025-05-07T20:32:05.4946998Z         torch.manual_seed(2025)
2025-05-07T20:32:05.4947075Z     
2025-05-07T20:32:05.4947243Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:05.4947316Z     
2025-05-07T20:32:05.4947417Z         x_sign = torch.sign(x)
2025-05-07T20:32:05.4947542Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:05.4947631Z         x = x_sign * x_clamp
2025-05-07T20:32:05.4947717Z         x0 = x[:, :D]
2025-05-07T20:32:05.4947798Z         x1 = x[:, D:]
2025-05-07T20:32:05.4947876Z     
2025-05-07T20:32:05.4947960Z         if contiguous:
2025-05-07T20:32:05.4948097Z             x0 = x0.contiguous()
2025-05-07T20:32:05.4948236Z             x1 = x1.contiguous()
2025-05-07T20:32:05.4948311Z     
2025-05-07T20:32:05.4948401Z         if scale_ub is not None:
2025-05-07T20:32:05.4948512Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:05.4948651Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:05.4948726Z             )
2025-05-07T20:32:05.4948809Z         else:
2025-05-07T20:32:05.4948902Z             scale_ub_tensor = None
2025-05-07T20:32:05.4948974Z     
2025-05-07T20:32:05.4949171Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:05.4949262Z             op = silu_mul_quant
2025-05-07T20:32:05.4949346Z             if compiled:
2025-05-07T20:32:05.4949452Z                 op = torch.compile(op)
2025-05-07T20:32:05.4949556Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:05.4949634Z     
2025-05-07T20:32:05.4949723Z         y_fp8, y_scale = fn()
2025-05-07T20:32:05.4949845Z         y = y_fp8.to(torch.float32) * y_scale[:, None]
2025-05-07T20:32:05.4949926Z     
2025-05-07T20:32:05.4950061Z         def ref_fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:05.4950161Z             x0_fp32 = x0.to(torch.float32)
2025-05-07T20:32:05.4950267Z             x1_fp32 = x1.to(torch.float32)
2025-05-07T20:32:05.4950392Z             y = x0_fp32 * torch.sigmoid(x0_fp32) * x1_fp32
2025-05-07T20:32:05.4950531Z             return triton_quantize_fp8_row(y, scale_ub_tensor)
2025-05-07T20:32:05.4950610Z     
2025-05-07T20:32:05.4950708Z >       y_fp8_ref, y_scale_ref = ref_fn()
2025-05-07T20:32:05.4950712Z 
2025-05-07T20:32:05.4950815Z moe/activation_test.py:126: 
2025-05-07T20:32:05.4950944Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:05.4951049Z moe/activation_test.py:124: in ref_fn
2025-05-07T20:32:05.4951189Z     return triton_quantize_fp8_row(y, scale_ub_tensor)
2025-05-07T20:32:05.4951743Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/fbgemm_gpu/experimental/gemm/triton_gemm/fp8_gemm.py:2370: in triton_quantize_fp8_row
2025-05-07T20:32:05.4951852Z     _kernel_quantize_fp8_row[grid](
2025-05-07T20:32:05.4952214Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:05.4952438Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:05.4952808Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/autotuner.py:186: in run
2025-05-07T20:32:05.4953059Z     timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs}
2025-05-07T20:32:05.4953457Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/autotuner.py:186: in <dictcomp>
2025-05-07T20:32:05.4953713Z     timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs}
2025-05-07T20:32:05.4954085Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/autotuner.py:166: in _bench
2025-05-07T20:32:05.4954264Z     return self.do_bench(kernel_call, quantiles=(0.5, 0.2, 0.8))
2025-05-07T20:32:05.4954600Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/testing.py:117: in do_bench
2025-05-07T20:32:05.4954765Z     fn()
2025-05-07T20:32:05.4955171Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/autotuner.py:152: in kernel_call
2025-05-07T20:32:05.4955255Z     self.fn.run(
2025-05-07T20:32:05.4955590Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:05.4955691Z     kernel = self.compile(
2025-05-07T20:32:05.4956069Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:05.4956250Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:05.4956417Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:05.4956460Z 
2025-05-07T20:32:05.4956663Z self = <triton.compiler.compiler.ASTSource object at 0x7f498423ae10>
2025-05-07T20:32:05.4957451Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=2, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:05.4957947Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f49ce13c400>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f4985498fe0>}
2025-05-07T20:32:05.4958697Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:05.4958889Z context = <triton._C.libtriton.ir.context object at 0x7f49842dbeb0>
2025-05-07T20:32:05.4958896Z 
2025-05-07T20:32:05.4959068Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:05.4959327Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:05.4959438Z                            module_map=module_map)
2025-05-07T20:32:05.4959607Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:05.4959709Z E       def _kernel_quantize_fp8_row(
2025-05-07T20:32:05.4959790Z E       ^
2025-05-07T20:32:05.4960156Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:05.4960161Z 
2025-05-07T20:32:05.4960573Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:05.4960577Z 
2025-05-07T20:32:05.4960690Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:05.4960912Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:05.4960997Z     T=2048,
2025-05-07T20:32:05.4961083Z     D=5120,
2025-05-07T20:32:05.4961166Z     scale_ub=None,
2025-05-07T20:32:05.4961250Z     contiguous=True,
2025-05-07T20:32:05.4961342Z     compiled=True,
2025-05-07T20:32:05.4961417Z )
2025-05-07T20:32:05.4961638Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:05.4961814Z T = 2048, D = 5120, scale_ub = None, contiguous = True, compiled = True
2025-05-07T20:32:05.4961819Z 
2025-05-07T20:32:05.4961896Z     @given(
2025-05-07T20:32:05.4962022Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:05.4962121Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:05.4962236Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:05.4962359Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:05.4962470Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:05.4962544Z     )
2025-05-07T20:32:05.4962795Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:05.4962892Z     def test_silu_mul_quant(
2025-05-07T20:32:05.4962968Z         self,
2025-05-07T20:32:05.4963051Z         T: int,
2025-05-07T20:32:05.4963127Z         D: int,
2025-05-07T20:32:05.4963314Z         scale_ub: Optional[float],
2025-05-07T20:32:05.4963405Z         contiguous: bool,
2025-05-07T20:32:05.4963490Z         compiled: bool,
2025-05-07T20:32:05.4963574Z     ) -> None:
2025-05-07T20:32:05.4963669Z         torch.manual_seed(2025)
2025-05-07T20:32:05.4963743Z     
2025-05-07T20:32:05.4963915Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:05.4963987Z     
2025-05-07T20:32:05.4964079Z         x_sign = torch.sign(x)
2025-05-07T20:32:05.4964208Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:05.4964298Z         x = x_sign * x_clamp
2025-05-07T20:32:05.4964378Z         x0 = x[:, :D]
2025-05-07T20:32:05.4964464Z         x1 = x[:, D:]
2025-05-07T20:32:05.4964619Z     
2025-05-07T20:32:05.4964713Z         if contiguous:
2025-05-07T20:32:05.4964805Z             x0 = x0.contiguous()
2025-05-07T20:32:05.4964893Z             x1 = x1.contiguous()
2025-05-07T20:32:05.4964971Z     
2025-05-07T20:32:05.4965061Z         if scale_ub is not None:
2025-05-07T20:32:05.4965177Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:05.4965322Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:05.4965397Z             )
2025-05-07T20:32:05.4965473Z         else:
2025-05-07T20:32:05.4965576Z             scale_ub_tensor = None
2025-05-07T20:32:05.4965648Z     
2025-05-07T20:32:05.4965776Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:05.4965873Z             op = silu_mul_quant
2025-05-07T20:32:05.4965958Z             if compiled:
2025-05-07T20:32:05.4966059Z                 op = torch.compile(op)
2025-05-07T20:32:05.4966171Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:05.4966245Z     
2025-05-07T20:32:05.4966344Z         y_fp8, y_scale = fn()
2025-05-07T20:32:05.4966464Z         y = y_fp8.to(torch.float32) * y_scale[:, None]
2025-05-07T20:32:05.4966536Z     
2025-05-07T20:32:05.4966676Z         def ref_fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:05.4966782Z             x0_fp32 = x0.to(torch.float32)
2025-05-07T20:32:05.4966881Z             x1_fp32 = x1.to(torch.float32)
2025-05-07T20:32:05.4967007Z             y = x0_fp32 * torch.sigmoid(x0_fp32) * x1_fp32
2025-05-07T20:32:05.4967146Z             return triton_quantize_fp8_row(y, scale_ub_tensor)
2025-05-07T20:32:05.4967220Z     
2025-05-07T20:32:05.4967324Z >       y_fp8_ref, y_scale_ref = ref_fn()
2025-05-07T20:32:05.4967329Z 
2025-05-07T20:32:05.4967427Z moe/activation_test.py:126: 
2025-05-07T20:32:05.4967563Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:05.4967669Z moe/activation_test.py:124: in ref_fn
2025-05-07T20:32:05.4967803Z     return triton_quantize_fp8_row(y, scale_ub_tensor)
2025-05-07T20:32:05.4968369Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/fbgemm_gpu/experimental/gemm/triton_gemm/fp8_gemm.py:2370: in triton_quantize_fp8_row
2025-05-07T20:32:05.4968469Z     _kernel_quantize_fp8_row[grid](
2025-05-07T20:32:05.4968828Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:05.4969058Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:05.4969422Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/autotuner.py:186: in run
2025-05-07T20:32:05.4969679Z     timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs}
2025-05-07T20:32:05.4970075Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/autotuner.py:186: in <dictcomp>
2025-05-07T20:32:05.4970325Z     timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs}
2025-05-07T20:32:05.4970709Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/autotuner.py:166: in _bench
2025-05-07T20:32:05.4970875Z     return self.do_bench(kernel_call, quantiles=(0.5, 0.2, 0.8))
2025-05-07T20:32:05.4971312Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/testing.py:117: in do_bench
2025-05-07T20:32:05.4971397Z     fn()
2025-05-07T20:32:05.4971792Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/autotuner.py:152: in kernel_call
2025-05-07T20:32:05.4971880Z     self.fn.run(
2025-05-07T20:32:05.4972218Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:05.4972336Z     kernel = self.compile(
2025-05-07T20:32:05.4972745Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:05.4972957Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:05.4973158Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:05.4973163Z 
2025-05-07T20:32:05.4973365Z self = <triton.compiler.compiler.ASTSource object at 0x7f498488e290>
2025-05-07T20:32:05.4974143Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=2, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:05.4974644Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f49ce13c400>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f4984c9d8a0>}
2025-05-07T20:32:05.4975385Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:05.4975587Z context = <triton._C.libtriton.ir.context object at 0x7f4984889b30>
2025-05-07T20:32:05.4975592Z 
2025-05-07T20:32:05.4975754Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:05.4976027Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:05.4976134Z                            module_map=module_map)
2025-05-07T20:32:05.4976293Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:05.4976406Z E       def _kernel_quantize_fp8_row(
2025-05-07T20:32:05.4976484Z E       ^
2025-05-07T20:32:05.4976838Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:05.4976843Z 
2025-05-07T20:32:05.4977261Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:05.4977266Z 
2025-05-07T20:32:05.4977373Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:05.4977604Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:05.4977682Z     T=128,
2025-05-07T20:32:05.4977758Z     D=5120,
2025-05-07T20:32:05.4977844Z     scale_ub=None,
2025-05-07T20:32:05.4977933Z     contiguous=True,
2025-05-07T20:32:05.4978017Z     compiled=True,
2025-05-07T20:32:05.4978093Z )
2025-05-07T20:32:05.4978310Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:05.4978476Z T = 128, D = 5120, scale_ub = None, contiguous = True, compiled = True
2025-05-07T20:32:05.4978485Z 
2025-05-07T20:32:05.4978563Z     @given(
2025-05-07T20:32:05.4978680Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:05.4978787Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:05.4978901Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:05.4979020Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:05.4979141Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:05.4979218Z     )
2025-05-07T20:32:05.4979462Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:05.4979561Z     def test_silu_mul_quant(
2025-05-07T20:32:05.4979637Z         self,
2025-05-07T20:32:05.4979800Z         T: int,
2025-05-07T20:32:05.4979885Z         D: int,
2025-05-07T20:32:05.4979984Z         scale_ub: Optional[float],
2025-05-07T20:32:05.4980082Z         contiguous: bool,
2025-05-07T20:32:05.4980168Z         compiled: bool,
2025-05-07T20:32:05.4980246Z     ) -> None:
2025-05-07T20:32:05.4980350Z         torch.manual_seed(2025)
2025-05-07T20:32:05.4980422Z     
2025-05-07T20:32:05.4980591Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:05.4980670Z     
2025-05-07T20:32:05.4980762Z         x_sign = torch.sign(x)
2025-05-07T20:32:05.4980886Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:05.4981023Z         x = x_sign * x_clamp
2025-05-07T20:32:05.4981143Z         x0 = x[:, :D]
2025-05-07T20:32:05.4981223Z         x1 = x[:, D:]
2025-05-07T20:32:05.4981302Z     
2025-05-07T20:32:05.4981386Z         if contiguous:
2025-05-07T20:32:05.4981482Z             x0 = x0.contiguous()
2025-05-07T20:32:05.4981585Z             x1 = x1.contiguous()
2025-05-07T20:32:05.4981657Z     
2025-05-07T20:32:05.4981754Z         if scale_ub is not None:
2025-05-07T20:32:05.4981859Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:05.4981994Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:05.4982075Z             )
2025-05-07T20:32:05.4982154Z         else:
2025-05-07T20:32:05.4982246Z             scale_ub_tensor = None
2025-05-07T20:32:05.4982325Z     
2025-05-07T20:32:05.4982453Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:05.4982547Z             op = silu_mul_quant
2025-05-07T20:32:05.4982637Z             if compiled:
2025-05-07T20:32:05.4982738Z                 op = torch.compile(op)
2025-05-07T20:32:05.4982872Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:05.4982959Z     
2025-05-07T20:32:05.4983067Z         y_fp8, y_scale = fn()
2025-05-07T20:32:05.4983193Z         y = y_fp8.to(torch.float32) * y_scale[:, None]
2025-05-07T20:32:05.4983265Z     
2025-05-07T20:32:05.4983407Z         def ref_fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:05.4983514Z             x0_fp32 = x0.to(torch.float32)
2025-05-07T20:32:05.4983614Z             x1_fp32 = x1.to(torch.float32)
2025-05-07T20:32:05.4983740Z             y = x0_fp32 * torch.sigmoid(x0_fp32) * x1_fp32
2025-05-07T20:32:05.4983886Z             return triton_quantize_fp8_row(y, scale_ub_tensor)
2025-05-07T20:32:05.4983961Z     
2025-05-07T20:32:05.4984060Z >       y_fp8_ref, y_scale_ref = ref_fn()
2025-05-07T20:32:05.4984071Z 
2025-05-07T20:32:05.4984168Z moe/activation_test.py:126: 
2025-05-07T20:32:05.4984298Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:05.4984416Z moe/activation_test.py:124: in ref_fn
2025-05-07T20:32:05.4984553Z     return triton_quantize_fp8_row(y, scale_ub_tensor)
2025-05-07T20:32:05.4985111Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/fbgemm_gpu/experimental/gemm/triton_gemm/fp8_gemm.py:2370: in triton_quantize_fp8_row
2025-05-07T20:32:05.4985222Z     _kernel_quantize_fp8_row[grid](
2025-05-07T20:32:05.4985577Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:05.4985806Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:05.4986173Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/autotuner.py:186: in run
2025-05-07T20:32:05.4986425Z     timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs}
2025-05-07T20:32:05.4986830Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/autotuner.py:186: in <dictcomp>
2025-05-07T20:32:05.4987087Z     timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs}
2025-05-07T20:32:05.4987459Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/autotuner.py:166: in _bench
2025-05-07T20:32:05.4987801Z     return self.do_bench(kernel_call, quantiles=(0.5, 0.2, 0.8))
2025-05-07T20:32:05.4988142Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/testing.py:117: in do_bench
2025-05-07T20:32:05.4988227Z     fn()
2025-05-07T20:32:05.4988624Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/autotuner.py:152: in kernel_call
2025-05-07T20:32:05.4988707Z     self.fn.run(
2025-05-07T20:32:05.4989051Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:05.4989196Z     kernel = self.compile(
2025-05-07T20:32:05.4989573Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:05.4989837Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:05.4989966Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:05.4989978Z 
2025-05-07T20:32:05.4990186Z self = <triton.compiler.compiler.ASTSource object at 0x7f4899f1f0d0>
2025-05-07T20:32:05.4990962Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=2, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:05.4991466Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f49ce13c400>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f49843453a0>}
2025-05-07T20:32:05.4992207Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:05.4992404Z context = <triton._C.libtriton.ir.context object at 0x7f4899f26e30>
2025-05-07T20:32:05.4992409Z 
2025-05-07T20:32:05.4992582Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:05.4992843Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:05.4992959Z                            module_map=module_map)
2025-05-07T20:32:05.4993119Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:05.4993222Z E       def _kernel_quantize_fp8_row(
2025-05-07T20:32:05.4993306Z E       ^
2025-05-07T20:32:05.4993659Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:05.4993663Z 
2025-05-07T20:32:05.4994074Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:05.4994089Z 
2025-05-07T20:32:05.4994193Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:05.4994414Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:05.4994500Z     T=4096,
2025-05-07T20:32:05.4994579Z     D=5120,
2025-05-07T20:32:05.4994661Z     scale_ub=None,
2025-05-07T20:32:05.4994752Z     contiguous=True,
2025-05-07T20:32:05.4994834Z     compiled=True,
2025-05-07T20:32:05.4994906Z )
2025-05-07T20:32:05.4995130Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:05.4995299Z T = 4096, D = 5120, scale_ub = None, contiguous = True, compiled = True
2025-05-07T20:32:05.4995304Z 
2025-05-07T20:32:05.4995385Z     @given(
2025-05-07T20:32:05.4995504Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:05.4995603Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:05.4995723Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:05.4995845Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:05.4995957Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:05.4996041Z     )
2025-05-07T20:32:05.4996285Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:05.4996459Z     def test_silu_mul_quant(
2025-05-07T20:32:05.4996547Z         self,
2025-05-07T20:32:05.4996626Z         T: int,
2025-05-07T20:32:05.4996702Z         D: int,
2025-05-07T20:32:05.4996813Z         scale_ub: Optional[float],
2025-05-07T20:32:05.4996906Z         contiguous: bool,
2025-05-07T20:32:05.4997000Z         compiled: bool,
2025-05-07T20:32:05.4997078Z     ) -> None:
2025-05-07T20:32:05.4997171Z         torch.manual_seed(2025)
2025-05-07T20:32:05.4997250Z     
2025-05-07T20:32:05.4997418Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:05.4997493Z     
2025-05-07T20:32:05.4997590Z         x_sign = torch.sign(x)
2025-05-07T20:32:05.4997781Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:05.4997910Z         x = x_sign * x_clamp
2025-05-07T20:32:05.4997997Z         x0 = x[:, :D]
2025-05-07T20:32:05.4998077Z         x1 = x[:, D:]
2025-05-07T20:32:05.4998151Z     
2025-05-07T20:32:05.4998241Z         if contiguous:
2025-05-07T20:32:05.4998339Z             x0 = x0.contiguous()
2025-05-07T20:32:05.4998428Z             x1 = x1.contiguous()
2025-05-07T20:32:05.4998510Z     
2025-05-07T20:32:05.4998600Z         if scale_ub is not None:
2025-05-07T20:32:05.4998710Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:05.4998846Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:05.4998921Z             )
2025-05-07T20:32:05.4999004Z         else:
2025-05-07T20:32:05.4999097Z             scale_ub_tensor = None
2025-05-07T20:32:05.4999170Z     
2025-05-07T20:32:05.4999304Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:05.4999395Z             op = silu_mul_quant
2025-05-07T20:32:05.4999488Z             if compiled:
2025-05-07T20:32:05.4999595Z                 op = torch.compile(op)
2025-05-07T20:32:05.4999700Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:05.4999775Z     
2025-05-07T20:32:05.4999872Z         y_fp8, y_scale = fn()
2025-05-07T20:32:05.4999998Z         y = y_fp8.to(torch.float32) * y_scale[:, None]
2025-05-07T20:32:05.5000076Z     
2025-05-07T20:32:05.5000210Z         def ref_fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:05.5000311Z             x0_fp32 = x0.to(torch.float32)
2025-05-07T20:32:05.5000422Z             x1_fp32 = x1.to(torch.float32)
2025-05-07T20:32:05.5000546Z             y = x0_fp32 * torch.sigmoid(x0_fp32) * x1_fp32
2025-05-07T20:32:05.5000685Z             return triton_quantize_fp8_row(y, scale_ub_tensor)
2025-05-07T20:32:05.5013443Z     
2025-05-07T20:32:05.5013573Z >       y_fp8_ref, y_scale_ref = ref_fn()
2025-05-07T20:32:05.5013580Z 
2025-05-07T20:32:05.5013685Z moe/activation_test.py:126: 
2025-05-07T20:32:05.5013838Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:05.5013951Z moe/activation_test.py:124: in ref_fn
2025-05-07T20:32:05.5014092Z     return triton_quantize_fp8_row(y, scale_ub_tensor)
2025-05-07T20:32:05.5014669Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/fbgemm_gpu/experimental/gemm/triton_gemm/fp8_gemm.py:2370: in triton_quantize_fp8_row
2025-05-07T20:32:05.5014776Z     _kernel_quantize_fp8_row[grid](
2025-05-07T20:32:05.5015143Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:05.5015366Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:05.5015734Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/autotuner.py:186: in run
2025-05-07T20:32:05.5015999Z     timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs}
2025-05-07T20:32:05.5016398Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/autotuner.py:186: in <dictcomp>
2025-05-07T20:32:05.5016658Z     timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs}
2025-05-07T20:32:05.5017152Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/autotuner.py:166: in _bench
2025-05-07T20:32:05.5017322Z     return self.do_bench(kernel_call, quantiles=(0.5, 0.2, 0.8))
2025-05-07T20:32:05.5017670Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/testing.py:117: in do_bench
2025-05-07T20:32:05.5017750Z     fn()
2025-05-07T20:32:05.5018150Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/autotuner.py:152: in kernel_call
2025-05-07T20:32:05.5018243Z     self.fn.run(
2025-05-07T20:32:05.5018580Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:05.5018800Z     kernel = self.compile(
2025-05-07T20:32:05.5019177Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:05.5019350Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:05.5019496Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:05.5019501Z 
2025-05-07T20:32:05.5019705Z self = <triton.compiler.compiler.ASTSource object at 0x7f4984645bd0>
2025-05-07T20:32:05.5020492Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=2, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:05.5020992Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f49ce13c400>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f4984345b20>}
2025-05-07T20:32:05.5021742Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:05.5021945Z context = <triton._C.libtriton.ir.context object at 0x7f4984649970>
2025-05-07T20:32:05.5021950Z 
2025-05-07T20:32:05.5022116Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:05.5022388Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:05.5022498Z                            module_map=module_map)
2025-05-07T20:32:05.5022663Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:05.5022774Z E       def _kernel_quantize_fp8_row(
2025-05-07T20:32:05.5022854Z E       ^
2025-05-07T20:32:05.5023209Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:05.5023224Z 
2025-05-07T20:32:05.5023642Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:05.5023647Z 
2025-05-07T20:32:05.5023754Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:05.5023991Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:05.5024075Z     T=16384,
2025-05-07T20:32:05.5024155Z     D=5120,
2025-05-07T20:32:05.5024247Z     scale_ub=None,
2025-05-07T20:32:05.5024335Z     contiguous=True,
2025-05-07T20:32:05.5024420Z     compiled=True,
2025-05-07T20:32:05.5024504Z )
2025-05-07T20:32:05.5024723Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:05.5024906Z T = 16384, D = 5120, scale_ub = None, contiguous = True, compiled = True
2025-05-07T20:32:05.5024911Z 
2025-05-07T20:32:05.5024990Z     @given(
2025-05-07T20:32:05.5025113Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:05.5025224Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:05.5025344Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:05.5025465Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:05.5025590Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:05.5025668Z     )
2025-05-07T20:32:05.5025999Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:05.5026103Z     def test_silu_mul_quant(
2025-05-07T20:32:05.5026182Z         self,
2025-05-07T20:32:05.5026268Z         T: int,
2025-05-07T20:32:05.5026345Z         D: int,
2025-05-07T20:32:05.5026445Z         scale_ub: Optional[float],
2025-05-07T20:32:05.5026544Z         contiguous: bool,
2025-05-07T20:32:05.5026631Z         compiled: bool,
2025-05-07T20:32:05.5026713Z     ) -> None:
2025-05-07T20:32:05.5026816Z         torch.manual_seed(2025)
2025-05-07T20:32:05.5026891Z     
2025-05-07T20:32:05.5027060Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:05.5027221Z     
2025-05-07T20:32:05.5027318Z         x_sign = torch.sign(x)
2025-05-07T20:32:05.5027446Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:05.5027545Z         x = x_sign * x_clamp
2025-05-07T20:32:05.5027628Z         x0 = x[:, :D]
2025-05-07T20:32:05.5027720Z         x1 = x[:, D:]
2025-05-07T20:32:05.5027793Z     
2025-05-07T20:32:05.5027877Z         if contiguous:
2025-05-07T20:32:05.5027979Z             x0 = x0.contiguous()
2025-05-07T20:32:05.5028067Z             x1 = x1.contiguous()
2025-05-07T20:32:05.5028394Z     
2025-05-07T20:32:05.5028551Z         if scale_ub is not None:
2025-05-07T20:32:05.5028704Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:05.5028893Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:05.5028994Z             )
2025-05-07T20:32:05.5029121Z         else:
2025-05-07T20:32:05.5029217Z             scale_ub_tensor = None
2025-05-07T20:32:05.5029296Z     
2025-05-07T20:32:05.5029432Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:05.5029532Z             op = silu_mul_quant
2025-05-07T20:32:05.5029617Z             if compiled:
2025-05-07T20:32:05.5029717Z                 op = torch.compile(op)
2025-05-07T20:32:05.5029836Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:05.5029909Z     
2025-05-07T20:32:05.5029999Z         y_fp8, y_scale = fn()
2025-05-07T20:32:05.5030130Z         y = y_fp8.to(torch.float32) * y_scale[:, None]
2025-05-07T20:32:05.5030202Z     
2025-05-07T20:32:05.5030340Z         def ref_fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:05.5030451Z             x0_fp32 = x0.to(torch.float32)
2025-05-07T20:32:05.5030551Z             x1_fp32 = x1.to(torch.float32)
2025-05-07T20:32:05.5030671Z             y = x0_fp32 * torch.sigmoid(x0_fp32) * x1_fp32
2025-05-07T20:32:05.5030823Z             return triton_quantize_fp8_row(y, scale_ub_tensor)
2025-05-07T20:32:05.5030897Z     
2025-05-07T20:32:05.5031006Z >       y_fp8_ref, y_scale_ref = ref_fn()
2025-05-07T20:32:05.5031014Z 
2025-05-07T20:32:05.5031114Z moe/activation_test.py:126: 
2025-05-07T20:32:05.5031244Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:05.5031359Z moe/activation_test.py:124: in ref_fn
2025-05-07T20:32:05.5031496Z     return triton_quantize_fp8_row(y, scale_ub_tensor)
2025-05-07T20:32:05.5032056Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/fbgemm_gpu/experimental/gemm/triton_gemm/fp8_gemm.py:2370: in triton_quantize_fp8_row
2025-05-07T20:32:05.5032168Z     _kernel_quantize_fp8_row[grid](
2025-05-07T20:32:05.5032527Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:05.5032757Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:05.5033123Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/autotuner.py:186: in run
2025-05-07T20:32:05.5033381Z     timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs}
2025-05-07T20:32:05.5033785Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/autotuner.py:186: in <dictcomp>
2025-05-07T20:32:05.5034269Z     timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs}
2025-05-07T20:32:05.5034658Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/autotuner.py:166: in _bench
2025-05-07T20:32:05.5034824Z     return self.do_bench(kernel_call, quantiles=(0.5, 0.2, 0.8))
2025-05-07T20:32:05.5035164Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/testing.py:117: in do_bench
2025-05-07T20:32:05.5035254Z     fn()
2025-05-07T20:32:05.5035652Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/autotuner.py:152: in kernel_call
2025-05-07T20:32:05.5035737Z     self.fn.run(
2025-05-07T20:32:05.5036148Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:05.5036302Z     kernel = self.compile(
2025-05-07T20:32:05.5036696Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:05.5036872Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:05.5037007Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:05.5037012Z 
2025-05-07T20:32:05.5037227Z self = <triton.compiler.compiler.ASTSource object at 0x7f49840ba310>
2025-05-07T20:32:05.5038002Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=2, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:05.5038506Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f49ce13c400>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f4984c9ea20>}
2025-05-07T20:32:05.5039265Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:05.5039454Z context = <triton._C.libtriton.ir.context object at 0x7f49840db8f0>
2025-05-07T20:32:05.5039467Z 
2025-05-07T20:32:05.5039638Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:05.5039902Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:05.5040021Z                            module_map=module_map)
2025-05-07T20:32:05.5040187Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:05.5040292Z E       def _kernel_quantize_fp8_row(
2025-05-07T20:32:05.5040379Z E       ^
2025-05-07T20:32:05.5040741Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:05.5040749Z 
2025-05-07T20:32:05.5041168Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:05.5041177Z 
2025-05-07T20:32:05.5041281Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:05.5041503Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:05.5041590Z     T=1,
2025-05-07T20:32:05.5041670Z     D=5120,
2025-05-07T20:32:05.5041756Z     scale_ub=1200.0,
2025-05-07T20:32:05.5041850Z     contiguous=True,
2025-05-07T20:32:05.5041939Z     compiled=True,
2025-05-07T20:32:05.5042014Z )
2025-05-07T20:32:05.5042239Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:05.5042404Z T = 1, D = 5120, scale_ub = 1200.0, contiguous = True, compiled = True
2025-05-07T20:32:05.5042408Z 
2025-05-07T20:32:05.5042500Z     @given(
2025-05-07T20:32:05.5042623Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:05.5042723Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:05.5042848Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:05.5042965Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:05.5043168Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:05.5043252Z     )
2025-05-07T20:32:05.5043495Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:05.5043589Z     def test_silu_mul_quant(
2025-05-07T20:32:05.5043676Z         self,
2025-05-07T20:32:05.5043755Z         T: int,
2025-05-07T20:32:05.5043840Z         D: int,
2025-05-07T20:32:05.5043938Z         scale_ub: Optional[float],
2025-05-07T20:32:05.5044028Z         contiguous: bool,
2025-05-07T20:32:05.5044122Z         compiled: bool,
2025-05-07T20:32:05.5044203Z     ) -> None:
2025-05-07T20:32:05.5044298Z         torch.manual_seed(2025)
2025-05-07T20:32:05.5044419Z     
2025-05-07T20:32:05.5044628Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:05.5044702Z     
2025-05-07T20:32:05.5044801Z         x_sign = torch.sign(x)
2025-05-07T20:32:05.5044927Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:05.5045022Z         x = x_sign * x_clamp
2025-05-07T20:32:05.5045109Z         x0 = x[:, :D]
2025-05-07T20:32:05.5045190Z         x1 = x[:, D:]
2025-05-07T20:32:05.5045270Z     
2025-05-07T20:32:05.5045355Z         if contiguous:
2025-05-07T20:32:05.5045446Z             x0 = x0.contiguous()
2025-05-07T20:32:05.5045543Z             x1 = x1.contiguous()
2025-05-07T20:32:05.5045615Z     
2025-05-07T20:32:05.5045704Z         if scale_ub is not None:
2025-05-07T20:32:05.5045816Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:05.5045951Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:05.5046028Z             )
2025-05-07T20:32:05.5046115Z         else:
2025-05-07T20:32:05.5046209Z             scale_ub_tensor = None
2025-05-07T20:32:05.5046286Z     
2025-05-07T20:32:05.5046422Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:05.5046513Z             op = silu_mul_quant
2025-05-07T20:32:05.5046599Z             if compiled:
2025-05-07T20:32:05.5046710Z                 op = torch.compile(op)
2025-05-07T20:32:05.5046820Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:05.5046900Z     
2025-05-07T20:32:05.5046992Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:05.5046997Z 
2025-05-07T20:32:05.5047098Z moe/activation_test.py:117: 
2025-05-07T20:32:05.5047236Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:05.5047338Z moe/activation_test.py:115: in fn
2025-05-07T20:32:05.5047440Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:05.5047820Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/torch/_dynamo/eval_frame.py:678: in _fn
2025-05-07T20:32:05.5047916Z     return fn(*args, **kwargs)
2025-05-07T20:32:05.5048418Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:05.5048515Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:05.5048874Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:05.5049104Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:05.5049442Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:05.5049535Z     kernel = self.compile(
2025-05-07T20:32:05.5049926Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:05.5050100Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:05.5050238Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:05.5050246Z 
2025-05-07T20:32:05.5050448Z self = <triton.compiler.compiler.ASTSource object at 0x7f4899954210>
2025-05-07T20:32:05.5051329Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:05.5051839Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f49ce13c400>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f4899d816c0>}
2025-05-07T20:32:05.5052585Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:05.5052783Z context = <triton._C.libtriton.ir.context object at 0x7f48999fc7b0>
2025-05-07T20:32:05.5052824Z 
2025-05-07T20:32:05.5053028Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:05.5053300Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:05.5053408Z                            module_map=module_map)
2025-05-07T20:32:05.5053575Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:05.5053681Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:05.5053760Z E       ^
2025-05-07T20:32:05.5054114Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:05.5054119Z 
2025-05-07T20:32:05.5054541Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:05.5054545Z 
2025-05-07T20:32:05.5054649Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:05.5054883Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:05.5054966Z     T=1,
2025-05-07T20:32:05.5055045Z     D=5120,
2025-05-07T20:32:05.5055136Z     scale_ub=None,
2025-05-07T20:32:05.5055224Z     contiguous=False,
2025-05-07T20:32:05.5055307Z     compiled=True,
2025-05-07T20:32:05.5055389Z )
2025-05-07T20:32:05.5055611Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:05.5055775Z T = 1, D = 5120, scale_ub = None, contiguous = False, compiled = True
2025-05-07T20:32:05.5055787Z 
2025-05-07T20:32:05.5055863Z     @given(
2025-05-07T20:32:05.5055982Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:05.5056090Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:05.5056205Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:05.5056321Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:05.5056441Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:05.5056517Z     )
2025-05-07T20:32:05.5056760Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:05.5056864Z     def test_silu_mul_quant(
2025-05-07T20:32:05.5056942Z         self,
2025-05-07T20:32:05.5057026Z         T: int,
2025-05-07T20:32:05.5057104Z         D: int,
2025-05-07T20:32:05.5057203Z         scale_ub: Optional[float],
2025-05-07T20:32:05.5057302Z         contiguous: bool,
2025-05-07T20:32:05.5057387Z         compiled: bool,
2025-05-07T20:32:05.5057464Z     ) -> None:
2025-05-07T20:32:05.5057568Z         torch.manual_seed(2025)
2025-05-07T20:32:05.5057640Z     
2025-05-07T20:32:05.5057815Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:05.5057887Z     
2025-05-07T20:32:05.5057978Z         x_sign = torch.sign(x)
2025-05-07T20:32:05.5058106Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:05.5058197Z         x = x_sign * x_clamp
2025-05-07T20:32:05.5058279Z         x0 = x[:, :D]
2025-05-07T20:32:05.5058366Z         x1 = x[:, D:]
2025-05-07T20:32:05.5058441Z     
2025-05-07T20:32:05.5058529Z         if contiguous:
2025-05-07T20:32:05.5058627Z             x0 = x0.contiguous()
2025-05-07T20:32:05.5058717Z             x1 = x1.contiguous()
2025-05-07T20:32:05.5058789Z     
2025-05-07T20:32:05.5058886Z         if scale_ub is not None:
2025-05-07T20:32:05.5059078Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:05.5059223Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:05.5059301Z             )
2025-05-07T20:32:05.5059376Z         else:
2025-05-07T20:32:05.5059473Z             scale_ub_tensor = None
2025-05-07T20:32:05.5059545Z     
2025-05-07T20:32:05.5059672Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:05.5059767Z             op = silu_mul_quant
2025-05-07T20:32:05.5059851Z             if compiled:
2025-05-07T20:32:05.5059951Z                 op = torch.compile(op)
2025-05-07T20:32:05.5060062Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:05.5060176Z     
2025-05-07T20:32:05.5060266Z         y_fp8, y_scale = fn()
2025-05-07T20:32:05.5060430Z         y = y_fp8.to(torch.float32) * y_scale[:, None]
2025-05-07T20:32:05.5060502Z     
2025-05-07T20:32:05.5060641Z         def ref_fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:05.5060750Z             x0_fp32 = x0.to(torch.float32)
2025-05-07T20:32:05.5060850Z             x1_fp32 = x1.to(torch.float32)
2025-05-07T20:32:05.5060976Z             y = x0_fp32 * torch.sigmoid(x0_fp32) * x1_fp32
2025-05-07T20:32:05.5061113Z             return triton_quantize_fp8_row(y, scale_ub_tensor)
2025-05-07T20:32:05.5061185Z     
2025-05-07T20:32:05.5061291Z >       y_fp8_ref, y_scale_ref = ref_fn()
2025-05-07T20:32:05.5061295Z 
2025-05-07T20:32:05.5061393Z moe/activation_test.py:126: 
2025-05-07T20:32:05.5061520Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:05.5061632Z moe/activation_test.py:124: in ref_fn
2025-05-07T20:32:05.5061764Z     return triton_quantize_fp8_row(y, scale_ub_tensor)
2025-05-07T20:32:05.5062445Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/fbgemm_gpu/experimental/gemm/triton_gemm/fp8_gemm.py:2370: in triton_quantize_fp8_row
2025-05-07T20:32:05.5062570Z     _kernel_quantize_fp8_row[grid](
2025-05-07T20:32:05.5063022Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:05.5063305Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:05.5063762Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/autotuner.py:186: in run
2025-05-07T20:32:05.5064085Z     timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs}
2025-05-07T20:32:05.5064539Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/autotuner.py:186: in <dictcomp>
2025-05-07T20:32:05.5064790Z     timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs}
2025-05-07T20:32:05.5065174Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/autotuner.py:166: in _bench
2025-05-07T20:32:05.5065342Z     return self.do_bench(kernel_call, quantiles=(0.5, 0.2, 0.8))
2025-05-07T20:32:05.5065690Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/testing.py:117: in do_bench
2025-05-07T20:32:05.5065772Z     fn()
2025-05-07T20:32:05.5066172Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/autotuner.py:152: in kernel_call
2025-05-07T20:32:05.5066260Z     self.fn.run(
2025-05-07T20:32:05.5066595Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:05.5066687Z     kernel = self.compile(
2025-05-07T20:32:05.5067077Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:05.5067248Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:05.5067381Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:05.5067391Z 
2025-05-07T20:32:05.5067593Z self = <triton.compiler.compiler.ASTSource object at 0x7f489990b810>
2025-05-07T20:32:05.5068449Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=2, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:05.5068959Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f49ce13c400>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f4899d82020>}
2025-05-07T20:32:05.5069761Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:05.5069999Z context = <triton._C.libtriton.ir.context object at 0x7f4899903e70>
2025-05-07T20:32:05.5070044Z 
2025-05-07T20:32:05.5070208Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:05.5070476Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:05.5070591Z                            module_map=module_map)
2025-05-07T20:32:05.5070751Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:05.5070858Z E       def _kernel_quantize_fp8_row(
2025-05-07T20:32:05.5070935Z E       ^
2025-05-07T20:32:05.5071288Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:05.5071293Z 
2025-05-07T20:32:05.5071714Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:05.5071718Z 
2025-05-07T20:32:05.5071820Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:05.5072051Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:05.5072148Z     T=1,
2025-05-07T20:32:05.5072243Z     D=5120,
2025-05-07T20:32:05.5072352Z     scale_ub=None,
2025-05-07T20:32:05.5072458Z     contiguous=True,
2025-05-07T20:32:05.5072571Z     compiled=False,
2025-05-07T20:32:05.5072669Z )
2025-05-07T20:32:05.5072940Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:05.5073146Z T = 1, D = 5120, scale_ub = None, contiguous = True, compiled = False
2025-05-07T20:32:05.5073152Z 
2025-05-07T20:32:05.5073255Z     @given(
2025-05-07T20:32:05.5073403Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:05.5073526Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:05.5073675Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:05.5073819Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:05.5073967Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:05.5074067Z     )
2025-05-07T20:32:05.5074315Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:05.5074417Z     def test_silu_mul_quant(
2025-05-07T20:32:05.5074495Z         self,
2025-05-07T20:32:05.5074572Z         T: int,
2025-05-07T20:32:05.5074660Z         D: int,
2025-05-07T20:32:05.5074758Z         scale_ub: Optional[float],
2025-05-07T20:32:05.5074847Z         contiguous: bool,
2025-05-07T20:32:05.5074942Z         compiled: bool,
2025-05-07T20:32:05.5075020Z     ) -> None:
2025-05-07T20:32:05.5075116Z         torch.manual_seed(2025)
2025-05-07T20:32:05.5075195Z     
2025-05-07T20:32:05.5075366Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:05.5075448Z     
2025-05-07T20:32:05.5075541Z         x_sign = torch.sign(x)
2025-05-07T20:32:05.5075664Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:05.5075758Z         x = x_sign * x_clamp
2025-05-07T20:32:05.5075842Z         x0 = x[:, :D]
2025-05-07T20:32:05.5075925Z         x1 = x[:, D:]
2025-05-07T20:32:05.5076004Z     
2025-05-07T20:32:05.5076091Z         if contiguous:
2025-05-07T20:32:05.5076181Z             x0 = x0.contiguous()
2025-05-07T20:32:05.5076277Z             x1 = x1.contiguous()
2025-05-07T20:32:05.5076437Z     
2025-05-07T20:32:05.5076529Z         if scale_ub is not None:
2025-05-07T20:32:05.5076640Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:05.5076774Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:05.5076856Z             )
2025-05-07T20:32:05.5076933Z         else:
2025-05-07T20:32:05.5077026Z             scale_ub_tensor = None
2025-05-07T20:32:05.5077105Z     
2025-05-07T20:32:05.5077234Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:05.5077326Z             op = silu_mul_quant
2025-05-07T20:32:05.5077417Z             if compiled:
2025-05-07T20:32:05.5077517Z                 op = torch.compile(op)
2025-05-07T20:32:05.5077662Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:05.5077780Z     
2025-05-07T20:32:05.5077870Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:05.5077874Z 
2025-05-07T20:32:05.5077991Z moe/activation_test.py:117: 
2025-05-07T20:32:05.5078126Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:05.5078228Z moe/activation_test.py:115: in fn
2025-05-07T20:32:05.5078335Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:05.5078833Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:05.5078938Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:05.5079293Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:05.5079517Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:05.5079865Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:05.5079961Z     kernel = self.compile(
2025-05-07T20:32:05.5080347Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:05.5080525Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:05.5080653Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:05.5080657Z 
2025-05-07T20:32:05.5080866Z self = <triton.compiler.compiler.ASTSource object at 0x7f4898e69810>
2025-05-07T20:32:05.5081635Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:05.5082142Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f49ce13c400>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f4899d837e0>}
2025-05-07T20:32:05.5082901Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:05.5083090Z context = <triton._C.libtriton.ir.context object at 0x7f4898ee1df0>
2025-05-07T20:32:05.5083094Z 
2025-05-07T20:32:05.5083265Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:05.5083526Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:05.5083645Z                            module_map=module_map)
2025-05-07T20:32:05.5083807Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:05.5083909Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:05.5083994Z E       ^
2025-05-07T20:32:05.5084346Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:05.5084356Z 
2025-05-07T20:32:05.5084775Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:05.5084780Z 
2025-05-07T20:32:05.5084986Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:05.5085213Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:05.5085298Z     T=128,
2025-05-07T20:32:05.5085379Z     D=5120,
2025-05-07T20:32:05.5085465Z     scale_ub=None,
2025-05-07T20:32:05.5085561Z     contiguous=False,
2025-05-07T20:32:05.5085645Z     compiled=True,
2025-05-07T20:32:05.5085718Z )
2025-05-07T20:32:05.5085942Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:05.5086112Z T = 128, D = 5120, scale_ub = None, contiguous = False, compiled = True
2025-05-07T20:32:05.5086116Z 
2025-05-07T20:32:05.5086200Z     @given(
2025-05-07T20:32:05.5086358Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:05.5086542Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:05.5086668Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:05.5086783Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:05.5086900Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:05.5086979Z     )
2025-05-07T20:32:05.5087227Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:05.5087323Z     def test_silu_mul_quant(
2025-05-07T20:32:05.5087404Z         self,
2025-05-07T20:32:05.5087480Z         T: int,
2025-05-07T20:32:05.5087562Z         D: int,
2025-05-07T20:32:05.5087660Z         scale_ub: Optional[float],
2025-05-07T20:32:05.5087749Z         contiguous: bool,
2025-05-07T20:32:05.5087840Z         compiled: bool,
2025-05-07T20:32:05.5087919Z     ) -> None:
2025-05-07T20:32:05.5088013Z         torch.manual_seed(2025)
2025-05-07T20:32:05.5088099Z     
2025-05-07T20:32:05.5088268Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:05.5088343Z     
2025-05-07T20:32:05.5088439Z         x_sign = torch.sign(x)
2025-05-07T20:32:05.5088562Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:05.5088655Z         x = x_sign * x_clamp
2025-05-07T20:32:05.5088742Z         x0 = x[:, :D]
2025-05-07T20:32:05.5088823Z         x1 = x[:, D:]
2025-05-07T20:32:05.5088894Z     
2025-05-07T20:32:05.5088984Z         if contiguous:
2025-05-07T20:32:05.5089076Z             x0 = x0.contiguous()
2025-05-07T20:32:05.5089170Z             x1 = x1.contiguous()
2025-05-07T20:32:05.5089241Z     
2025-05-07T20:32:05.5089331Z         if scale_ub is not None:
2025-05-07T20:32:05.5089444Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:05.5089578Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:05.5089653Z             )
2025-05-07T20:32:05.5089737Z         else:
2025-05-07T20:32:05.5089832Z             scale_ub_tensor = None
2025-05-07T20:32:05.5089908Z     
2025-05-07T20:32:05.5090044Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:05.5090133Z             op = silu_mul_quant
2025-05-07T20:32:05.5090219Z             if compiled:
2025-05-07T20:32:05.5090331Z                 op = torch.compile(op)
2025-05-07T20:32:05.5090437Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:05.5090517Z     
2025-05-07T20:32:05.5090607Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:05.5090612Z 
2025-05-07T20:32:05.5090709Z moe/activation_test.py:117: 
2025-05-07T20:32:05.5090848Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:05.5090948Z moe/activation_test.py:115: in fn
2025-05-07T20:32:05.5091051Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:05.5091424Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/torch/_dynamo/eval_frame.py:678: in _fn
2025-05-07T20:32:05.5091515Z     return fn(*args, **kwargs)
2025-05-07T20:32:05.5092016Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:05.5092111Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:05.5092583Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:05.5092840Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:05.5093176Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:05.5093269Z     kernel = self.compile(
2025-05-07T20:32:05.5093653Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:05.5093824Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:05.5093960Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:05.5094047Z 
2025-05-07T20:32:05.5094254Z self = <triton.compiler.compiler.ASTSource object at 0x7f4898e162d0>
2025-05-07T20:32:05.5095031Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:05.5095537Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f49ce13c400>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f48995e1ee0>}
2025-05-07T20:32:05.5096288Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:05.5096482Z context = <triton._C.libtriton.ir.context object at 0x7f4898ecbf70>
2025-05-07T20:32:05.5096487Z 
2025-05-07T20:32:05.5096651Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:05.5096924Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:05.5097034Z                            module_map=module_map)
2025-05-07T20:32:05.5097198Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:05.5097307Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:05.5097384Z E       ^
2025-05-07T20:32:05.5097739Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:05.5097744Z 
2025-05-07T20:32:05.5098163Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:05.5098168Z 
2025-05-07T20:32:05.5098270Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:05.5098501Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:05.5098581Z     T=128,
2025-05-07T20:32:05.5098661Z     D=7168,
2025-05-07T20:32:05.5098750Z     scale_ub=1200.0,
2025-05-07T20:32:05.5098837Z     contiguous=False,
2025-05-07T20:32:05.5098921Z     compiled=False,
2025-05-07T20:32:05.5099000Z )
2025-05-07T20:32:05.5099220Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:05.5099392Z T = 128, D = 7168, scale_ub = 1200.0, contiguous = False, compiled = False
2025-05-07T20:32:05.5099396Z 
2025-05-07T20:32:05.5099484Z     @given(
2025-05-07T20:32:05.5099601Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:05.5099707Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:05.5099822Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:05.5099939Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:05.5100059Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:05.5100134Z     )
2025-05-07T20:32:05.5100378Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:05.5100481Z     def test_silu_mul_quant(
2025-05-07T20:32:05.5100559Z         self,
2025-05-07T20:32:05.5100637Z         T: int,
2025-05-07T20:32:05.5100718Z         D: int,
2025-05-07T20:32:05.5100814Z         scale_ub: Optional[float],
2025-05-07T20:32:05.5100997Z         contiguous: bool,
2025-05-07T20:32:05.5101086Z         compiled: bool,
2025-05-07T20:32:05.5101165Z     ) -> None:
2025-05-07T20:32:05.5101267Z         torch.manual_seed(2025)
2025-05-07T20:32:05.5101339Z     
2025-05-07T20:32:05.5101507Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:05.5101587Z     
2025-05-07T20:32:05.5101678Z         x_sign = torch.sign(x)
2025-05-07T20:32:05.5101801Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:05.5101895Z         x = x_sign * x_clamp
2025-05-07T20:32:05.5101975Z         x0 = x[:, :D]
2025-05-07T20:32:05.5102055Z         x1 = x[:, D:]
2025-05-07T20:32:05.5102173Z     
2025-05-07T20:32:05.5102256Z         if contiguous:
2025-05-07T20:32:05.5102388Z             x0 = x0.contiguous()
2025-05-07T20:32:05.5102485Z             x1 = x1.contiguous()
2025-05-07T20:32:05.5102560Z     
2025-05-07T20:32:05.5102655Z         if scale_ub is not None:
2025-05-07T20:32:05.5102767Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:05.5102901Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:05.5102982Z             )
2025-05-07T20:32:05.5103057Z         else:
2025-05-07T20:32:05.5103151Z             scale_ub_tensor = None
2025-05-07T20:32:05.5103228Z     
2025-05-07T20:32:05.5103357Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:05.5103448Z             op = silu_mul_quant
2025-05-07T20:32:05.5103540Z             if compiled:
2025-05-07T20:32:05.5103639Z                 op = torch.compile(op)
2025-05-07T20:32:05.5103745Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:05.5103827Z     
2025-05-07T20:32:05.5103922Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:05.5103929Z 
2025-05-07T20:32:05.5104040Z moe/activation_test.py:117: 
2025-05-07T20:32:05.5104168Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:05.5104268Z moe/activation_test.py:115: in fn
2025-05-07T20:32:05.5104378Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:05.5104875Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:05.5104970Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:05.5105330Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:05.5105554Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:05.5105901Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:05.5105996Z     kernel = self.compile(
2025-05-07T20:32:05.5106378Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:05.5106556Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:05.5106687Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:05.5106691Z 
2025-05-07T20:32:05.5106901Z self = <triton.compiler.compiler.ASTSource object at 0x7f4898e7b350>
2025-05-07T20:32:05.5107671Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:05.5108166Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f49ce13c400>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f48995e1a80>}
2025-05-07T20:32:05.5108917Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:05.5109239Z context = <triton._C.libtriton.ir.context object at 0x7f4898e8b9b0>
2025-05-07T20:32:05.5109334Z 
2025-05-07T20:32:05.5109505Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:05.5109766Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:05.5109874Z                            module_map=module_map)
2025-05-07T20:32:05.5110039Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:05.5110136Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:05.5110221Z E       ^
2025-05-07T20:32:05.5110573Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:05.5110578Z 
2025-05-07T20:32:05.5111992Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:05.5112075Z 
2025-05-07T20:32:05.5112186Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:05.5112418Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:05.5112506Z     T=128,
2025-05-07T20:32:05.5112585Z     D=5120,
2025-05-07T20:32:05.5112668Z     scale_ub=None,
2025-05-07T20:32:05.5112765Z     contiguous=False,
2025-05-07T20:32:05.5112849Z     compiled=False,
2025-05-07T20:32:05.5112921Z )
2025-05-07T20:32:05.5113143Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:05.5113313Z T = 128, D = 5120, scale_ub = None, contiguous = False, compiled = False
2025-05-07T20:32:05.5113317Z 
2025-05-07T20:32:05.5113393Z     @given(
2025-05-07T20:32:05.5113520Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:05.5113621Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:05.5113744Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:05.5113863Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:05.5113981Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:05.5114064Z     )
2025-05-07T20:32:05.5114316Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:05.5114408Z     def test_silu_mul_quant(
2025-05-07T20:32:05.5114493Z         self,
2025-05-07T20:32:05.5114571Z         T: int,
2025-05-07T20:32:05.5114648Z         D: int,
2025-05-07T20:32:05.5114751Z         scale_ub: Optional[float],
2025-05-07T20:32:05.5114842Z         contiguous: bool,
2025-05-07T20:32:05.5114928Z         compiled: bool,
2025-05-07T20:32:05.5115012Z     ) -> None:
2025-05-07T20:32:05.5115106Z         torch.manual_seed(2025)
2025-05-07T20:32:05.5115179Z     
2025-05-07T20:32:05.5115352Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:05.5115428Z     
2025-05-07T20:32:05.5115526Z         x_sign = torch.sign(x)
2025-05-07T20:32:05.5115654Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:05.5115742Z         x = x_sign * x_clamp
2025-05-07T20:32:05.5115826Z         x0 = x[:, :D]
2025-05-07T20:32:05.5115905Z         x1 = x[:, D:]
2025-05-07T20:32:05.5115981Z     
2025-05-07T20:32:05.5116070Z         if contiguous:
2025-05-07T20:32:05.5116160Z             x0 = x0.contiguous()
2025-05-07T20:32:05.5116247Z             x1 = x1.contiguous()
2025-05-07T20:32:05.5116325Z     
2025-05-07T20:32:05.5116416Z         if scale_ub is not None:
2025-05-07T20:32:05.5116521Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:05.5116660Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:05.5116735Z             )
2025-05-07T20:32:05.5116817Z         else:
2025-05-07T20:32:05.5116909Z             scale_ub_tensor = None
2025-05-07T20:32:05.5116981Z     
2025-05-07T20:32:05.5117117Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:05.5117210Z             op = silu_mul_quant
2025-05-07T20:32:05.5117294Z             if compiled:
2025-05-07T20:32:05.5117400Z                 op = torch.compile(op)
2025-05-07T20:32:05.5117504Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:05.5117576Z     
2025-05-07T20:32:05.5117759Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:05.5117765Z 
2025-05-07T20:32:05.5117865Z moe/activation_test.py:117: 
2025-05-07T20:32:05.5117999Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:05.5118100Z moe/activation_test.py:115: in fn
2025-05-07T20:32:05.5118200Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:05.5118702Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:05.5118797Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:05.5119152Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:05.5119459Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:05.5119797Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:05.5119900Z     kernel = self.compile(
2025-05-07T20:32:05.5120280Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:05.5120451Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:05.5120584Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:05.5120589Z 
2025-05-07T20:32:05.5120793Z self = <triton.compiler.compiler.ASTSource object at 0x7f4898d36450>
2025-05-07T20:32:05.5121563Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:05.5122064Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f49ce13c400>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f48995e3c40>}
2025-05-07T20:32:05.5122815Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:05.5123009Z context = <triton._C.libtriton.ir.context object at 0x7f4898d7eb30>
2025-05-07T20:32:05.5123013Z 
2025-05-07T20:32:05.5123178Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:05.5123448Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:05.5123554Z                            module_map=module_map)
2025-05-07T20:32:05.5123714Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:05.5123828Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:05.5123905Z E       ^
2025-05-07T20:32:05.5124259Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:05.5124271Z 
2025-05-07T20:32:05.5124689Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:05.5124694Z 
2025-05-07T20:32:05.5124796Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:05.5125025Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:05.5125102Z     T=128,
2025-05-07T20:32:05.5125178Z     D=5120,
2025-05-07T20:32:05.5125268Z     scale_ub=1200.0,
2025-05-07T20:32:05.5125352Z     contiguous=True,
2025-05-07T20:32:05.5125438Z     compiled=False,
2025-05-07T20:32:05.5125519Z )
2025-05-07T20:32:05.5125736Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:05.5125920Z T = 128, D = 5120, scale_ub = 1200.0, contiguous = True, compiled = False
2025-05-07T20:32:05.5125924Z 
2025-05-07T20:32:05.5126003Z     @given(
2025-05-07T20:32:05.5126122Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:05.5126315Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:05.5126432Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:05.5126549Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:05.5126669Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:05.5126743Z     )
2025-05-07T20:32:05.5126987Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:05.5127087Z     def test_silu_mul_quant(
2025-05-07T20:32:05.5127164Z         self,
2025-05-07T20:32:05.5127248Z         T: int,
2025-05-07T20:32:05.5127325Z         D: int,
2025-05-07T20:32:05.5127424Z         scale_ub: Optional[float],
2025-05-07T20:32:05.5127563Z         contiguous: bool,
2025-05-07T20:32:05.5127766Z         compiled: bool,
2025-05-07T20:32:05.5127848Z     ) -> None:
2025-05-07T20:32:05.5127949Z         torch.manual_seed(2025)
2025-05-07T20:32:05.5128021Z     
2025-05-07T20:32:05.5128473Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:05.5128599Z     
2025-05-07T20:32:05.5128730Z         x_sign = torch.sign(x)
2025-05-07T20:32:05.5128857Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:05.5128952Z         x = x_sign * x_clamp
2025-05-07T20:32:05.5129033Z         x0 = x[:, :D]
2025-05-07T20:32:05.5129118Z         x1 = x[:, D:]
2025-05-07T20:32:05.5129190Z     
2025-05-07T20:32:05.5129274Z         if contiguous:
2025-05-07T20:32:05.5129370Z             x0 = x0.contiguous()
2025-05-07T20:32:05.5129457Z             x1 = x1.contiguous()
2025-05-07T20:32:05.5129528Z     
2025-05-07T20:32:05.5129627Z         if scale_ub is not None:
2025-05-07T20:32:05.5129733Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:05.5129874Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:05.5129955Z             )
2025-05-07T20:32:05.5130034Z         else:
2025-05-07T20:32:05.5130127Z             scale_ub_tensor = None
2025-05-07T20:32:05.5130205Z     
2025-05-07T20:32:05.5130338Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:05.5130436Z             op = silu_mul_quant
2025-05-07T20:32:05.5130522Z             if compiled:
2025-05-07T20:32:05.5130622Z                 op = torch.compile(op)
2025-05-07T20:32:05.5130732Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:05.5130803Z     
2025-05-07T20:32:05.5130895Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:05.5130899Z 
2025-05-07T20:32:05.5131001Z moe/activation_test.py:117: 
2025-05-07T20:32:05.5131128Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:05.5131228Z moe/activation_test.py:115: in fn
2025-05-07T20:32:05.5131341Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:05.5131839Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:05.5131940Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:05.5132298Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:05.5132517Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:05.5132861Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:05.5132954Z     kernel = self.compile(
2025-05-07T20:32:05.5133330Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:05.5133508Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:05.5133636Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:05.5133644Z 
2025-05-07T20:32:05.5133850Z self = <triton.compiler.compiler.ASTSource object at 0x7f4898df13d0>
2025-05-07T20:32:05.5134850Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:05.5135356Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f49ce13c400>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f4899db7ba0>}
2025-05-07T20:32:05.5136099Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:05.5136286Z context = <triton._C.libtriton.ir.context object at 0x7f4898dc03f0>
2025-05-07T20:32:05.5136352Z 
2025-05-07T20:32:05.5136523Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:05.5136844Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:05.5136959Z                            module_map=module_map)
2025-05-07T20:32:05.5137125Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:05.5137222Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:05.5137305Z E       ^
2025-05-07T20:32:05.5137658Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:05.5137663Z 
2025-05-07T20:32:05.5138072Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:05.5138082Z 
2025-05-07T20:32:05.5138185Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:05.5138404Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:05.5138490Z     T=1,
2025-05-07T20:32:05.5138567Z     D=7168,
2025-05-07T20:32:05.5138651Z     scale_ub=1200.0,
2025-05-07T20:32:05.5138743Z     contiguous=True,
2025-05-07T20:32:05.5138825Z     compiled=True,
2025-05-07T20:32:05.5138898Z )
2025-05-07T20:32:05.5139132Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:05.5139295Z T = 1, D = 7168, scale_ub = 1200.0, contiguous = True, compiled = True
2025-05-07T20:32:05.5139300Z 
2025-05-07T20:32:05.5139377Z     @given(
2025-05-07T20:32:05.5139500Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:05.5145030Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:05.5145176Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:05.5145295Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:05.5145421Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:05.5145497Z     )
2025-05-07T20:32:05.5145747Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:05.5145860Z     def test_silu_mul_quant(
2025-05-07T20:32:05.5145941Z         self,
2025-05-07T20:32:05.5146025Z         T: int,
2025-05-07T20:32:05.5146104Z         D: int,
2025-05-07T20:32:05.5146203Z         scale_ub: Optional[float],
2025-05-07T20:32:05.5146310Z         contiguous: bool,
2025-05-07T20:32:05.5146400Z         compiled: bool,
2025-05-07T20:32:05.5146484Z     ) -> None:
2025-05-07T20:32:05.5146587Z         torch.manual_seed(2025)
2025-05-07T20:32:05.5146660Z     
2025-05-07T20:32:05.5146832Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:05.5146915Z     
2025-05-07T20:32:05.5147009Z         x_sign = torch.sign(x)
2025-05-07T20:32:05.5147136Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:05.5147234Z         x = x_sign * x_clamp
2025-05-07T20:32:05.5147316Z         x0 = x[:, :D]
2025-05-07T20:32:05.5147399Z         x1 = x[:, D:]
2025-05-07T20:32:05.5147483Z     
2025-05-07T20:32:05.5147569Z         if contiguous:
2025-05-07T20:32:05.5147673Z             x0 = x0.contiguous()
2025-05-07T20:32:05.5147771Z             x1 = x1.contiguous()
2025-05-07T20:32:05.5147844Z     
2025-05-07T20:32:05.5147944Z         if scale_ub is not None:
2025-05-07T20:32:05.5148162Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:05.5148303Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:05.5148390Z             )
2025-05-07T20:32:05.5148469Z         else:
2025-05-07T20:32:05.5148568Z             scale_ub_tensor = None
2025-05-07T20:32:05.5148650Z     
2025-05-07T20:32:05.5148782Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:05.5148875Z             op = silu_mul_quant
2025-05-07T20:32:05.5148969Z             if compiled:
2025-05-07T20:32:05.5149146Z                 op = torch.compile(op)
2025-05-07T20:32:05.5149264Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:05.5149339Z     
2025-05-07T20:32:05.5149550Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:05.5149598Z 
2025-05-07T20:32:05.5149707Z moe/activation_test.py:117: 
2025-05-07T20:32:05.5149841Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:05.5149943Z moe/activation_test.py:115: in fn
2025-05-07T20:32:05.5150056Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:05.5150435Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/torch/_dynamo/eval_frame.py:678: in _fn
2025-05-07T20:32:05.5150529Z     return fn(*args, **kwargs)
2025-05-07T20:32:05.5151029Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:05.5151131Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:05.5151495Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:05.5151715Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:05.5152062Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:05.5152164Z     kernel = self.compile(
2025-05-07T20:32:05.5152551Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:05.5152735Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:05.5152864Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:05.5152869Z 
2025-05-07T20:32:05.5153071Z self = <triton.compiler.compiler.ASTSource object at 0x7f4899204250>
2025-05-07T20:32:05.5153855Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:05.5154359Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f49ce13c400>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f4984e20ea0>}
2025-05-07T20:32:05.5155116Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:05.5155306Z context = <triton._C.libtriton.ir.context object at 0x7f4898d66070>
2025-05-07T20:32:05.5155311Z 
2025-05-07T20:32:05.5155473Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:05.5155740Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:05.5155852Z                            module_map=module_map)
2025-05-07T20:32:05.5156022Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:05.5156122Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:05.5156202Z E       ^
2025-05-07T20:32:05.5156569Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:05.5156576Z 
2025-05-07T20:32:05.5157075Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:05.5157080Z 
2025-05-07T20:32:05.5157191Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:05.5157413Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:05.5157490Z     T=1,
2025-05-07T20:32:05.5157579Z     D=7168,
2025-05-07T20:32:05.5157663Z     scale_ub=1200.0,
2025-05-07T20:32:05.5157749Z     contiguous=False,
2025-05-07T20:32:05.5157840Z     compiled=True,
2025-05-07T20:32:05.5157913Z )
2025-05-07T20:32:05.5158127Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:05.5158300Z T = 1, D = 7168, scale_ub = 1200.0, contiguous = False, compiled = True
2025-05-07T20:32:05.5158345Z 
2025-05-07T20:32:05.5158463Z     @given(
2025-05-07T20:32:05.5158588Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:05.5158688Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:05.5158802Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:05.5158933Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:05.5159046Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:05.5159120Z     )
2025-05-07T20:32:05.5159370Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:05.5159470Z     def test_silu_mul_quant(
2025-05-07T20:32:05.5159547Z         self,
2025-05-07T20:32:05.5159631Z         T: int,
2025-05-07T20:32:05.5159710Z         D: int,
2025-05-07T20:32:05.5159815Z         scale_ub: Optional[float],
2025-05-07T20:32:05.5159905Z         contiguous: bool,
2025-05-07T20:32:05.5159993Z         compiled: bool,
2025-05-07T20:32:05.5160083Z     ) -> None:
2025-05-07T20:32:05.5160182Z         torch.manual_seed(2025)
2025-05-07T20:32:05.5160259Z     
2025-05-07T20:32:05.5160436Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:05.5160515Z     
2025-05-07T20:32:05.5160612Z         x_sign = torch.sign(x)
2025-05-07T20:32:05.5160746Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:05.5160837Z         x = x_sign * x_clamp
2025-05-07T20:32:05.5160919Z         x0 = x[:, :D]
2025-05-07T20:32:05.5161006Z         x1 = x[:, D:]
2025-05-07T20:32:05.5161080Z     
2025-05-07T20:32:05.5161164Z         if contiguous:
2025-05-07T20:32:05.5161264Z             x0 = x0.contiguous()
2025-05-07T20:32:05.5161353Z             x1 = x1.contiguous()
2025-05-07T20:32:05.5161437Z     
2025-05-07T20:32:05.5161530Z         if scale_ub is not None:
2025-05-07T20:32:05.5161638Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:05.5161784Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:05.5161863Z             )
2025-05-07T20:32:05.5161942Z         else:
2025-05-07T20:32:05.5162046Z             scale_ub_tensor = None
2025-05-07T20:32:05.5162120Z     
2025-05-07T20:32:05.5162249Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:05.5162351Z             op = silu_mul_quant
2025-05-07T20:32:05.5162448Z             if compiled:
2025-05-07T20:32:05.5162570Z                 op = torch.compile(op)
2025-05-07T20:32:05.5162702Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:05.5162784Z     
2025-05-07T20:32:05.5162885Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:05.5162889Z 
2025-05-07T20:32:05.5162988Z moe/activation_test.py:117: 
2025-05-07T20:32:05.5163117Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:05.5163228Z moe/activation_test.py:115: in fn
2025-05-07T20:32:05.5163329Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:05.5163696Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/torch/_dynamo/eval_frame.py:678: in _fn
2025-05-07T20:32:05.5163801Z     return fn(*args, **kwargs)
2025-05-07T20:32:05.5164292Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:05.5164400Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:05.5164841Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:05.5165066Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:05.5165413Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:05.5165508Z     kernel = self.compile(
2025-05-07T20:32:05.5165887Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:05.5166067Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:05.5166233Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:05.5166280Z 
2025-05-07T20:32:05.5166492Z self = <triton.compiler.compiler.ASTSource object at 0x7f4899219e90>
2025-05-07T20:32:05.5167270Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:05.5167767Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f49ce13c400>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f4899e4d1c0>}
2025-05-07T20:32:05.5168519Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:05.5168707Z context = <triton._C.libtriton.ir.context object at 0x7f48992b64f0>
2025-05-07T20:32:05.5168716Z 
2025-05-07T20:32:05.5168887Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:05.5169150Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:05.5169273Z                            module_map=module_map)
2025-05-07T20:32:05.5169435Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:05.5169534Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:05.5169619Z E       ^
2025-05-07T20:32:05.5169973Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:05.5169978Z 
2025-05-07T20:32:05.5170390Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:05.5170394Z 
2025-05-07T20:32:05.5170506Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:05.5170728Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:05.5170816Z     T=1,
2025-05-07T20:32:05.5170894Z     D=7168,
2025-05-07T20:32:05.5170977Z     scale_ub=None,
2025-05-07T20:32:05.5171077Z     contiguous=False,
2025-05-07T20:32:05.5171162Z     compiled=True,
2025-05-07T20:32:05.5171235Z )
2025-05-07T20:32:05.5171467Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:05.5171630Z T = 1, D = 7168, scale_ub = None, contiguous = False, compiled = True
2025-05-07T20:32:05.5171635Z 
2025-05-07T20:32:05.5171713Z     @given(
2025-05-07T20:32:05.5171839Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:05.5171937Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:05.5172061Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:05.5172177Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:05.5172290Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:05.5172372Z     )
2025-05-07T20:32:05.5172614Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:05.5172711Z     def test_silu_mul_quant(
2025-05-07T20:32:05.5172796Z         self,
2025-05-07T20:32:05.5172874Z         T: int,
2025-05-07T20:32:05.5172950Z         D: int,
2025-05-07T20:32:05.5173153Z         scale_ub: Optional[float],
2025-05-07T20:32:05.5173245Z         contiguous: bool,
2025-05-07T20:32:05.5173334Z         compiled: bool,
2025-05-07T20:32:05.5173419Z     ) -> None:
2025-05-07T20:32:05.5173513Z         torch.manual_seed(2025)
2025-05-07T20:32:05.5173592Z     
2025-05-07T20:32:05.5173759Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:05.5173833Z     
2025-05-07T20:32:05.5173935Z         x_sign = torch.sign(x)
2025-05-07T20:32:05.5174057Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:05.5174146Z         x = x_sign * x_clamp
2025-05-07T20:32:05.5174233Z         x0 = x[:, :D]
2025-05-07T20:32:05.5174359Z         x1 = x[:, D:]
2025-05-07T20:32:05.5174472Z     
2025-05-07T20:32:05.5174563Z         if contiguous:
2025-05-07T20:32:05.5174656Z             x0 = x0.contiguous()
2025-05-07T20:32:05.5174745Z             x1 = x1.contiguous()
2025-05-07T20:32:05.5174825Z     
2025-05-07T20:32:05.5174924Z         if scale_ub is not None:
2025-05-07T20:32:05.5175038Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:05.5175174Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:05.5175250Z             )
2025-05-07T20:32:05.5175335Z         else:
2025-05-07T20:32:05.5175427Z             scale_ub_tensor = None
2025-05-07T20:32:05.5175501Z     
2025-05-07T20:32:05.5175641Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:05.5175732Z             op = silu_mul_quant
2025-05-07T20:32:05.5175818Z             if compiled:
2025-05-07T20:32:05.5175926Z                 op = torch.compile(op)
2025-05-07T20:32:05.5176033Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:05.5176109Z     
2025-05-07T20:32:05.5176215Z         y_fp8, y_scale = fn()
2025-05-07T20:32:05.5176337Z         y = y_fp8.to(torch.float32) * y_scale[:, None]
2025-05-07T20:32:05.5176418Z     
2025-05-07T20:32:05.5176553Z         def ref_fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:05.5176661Z             x0_fp32 = x0.to(torch.float32)
2025-05-07T20:32:05.5176769Z             x1_fp32 = x1.to(torch.float32)
2025-05-07T20:32:05.5176894Z             y = x0_fp32 * torch.sigmoid(x0_fp32) * x1_fp32
2025-05-07T20:32:05.5177034Z             return triton_quantize_fp8_row(y, scale_ub_tensor)
2025-05-07T20:32:05.5177117Z     
2025-05-07T20:32:05.5177218Z >       y_fp8_ref, y_scale_ref = ref_fn()
2025-05-07T20:32:05.5177222Z 
2025-05-07T20:32:05.5177320Z moe/activation_test.py:126: 
2025-05-07T20:32:05.5177461Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:05.5177570Z moe/activation_test.py:124: in ref_fn
2025-05-07T20:32:05.5177716Z     return triton_quantize_fp8_row(y, scale_ub_tensor)
2025-05-07T20:32:05.5178275Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/fbgemm_gpu/experimental/gemm/triton_gemm/fp8_gemm.py:2370: in triton_quantize_fp8_row
2025-05-07T20:32:05.5178377Z     _kernel_quantize_fp8_row[grid](
2025-05-07T20:32:05.5178749Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:05.5178974Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:05.5179347Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/autotuner.py:186: in run
2025-05-07T20:32:05.5179602Z     timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs}
2025-05-07T20:32:05.5179996Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/autotuner.py:186: in <dictcomp>
2025-05-07T20:32:05.5180255Z     timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs}
2025-05-07T20:32:05.5180630Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/autotuner.py:166: in _bench
2025-05-07T20:32:05.5180796Z     return self.do_bench(kernel_call, quantiles=(0.5, 0.2, 0.8))
2025-05-07T20:32:05.5181252Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/testing.py:117: in do_bench
2025-05-07T20:32:05.5181332Z     fn()
2025-05-07T20:32:05.5181736Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/autotuner.py:152: in kernel_call
2025-05-07T20:32:05.5181820Z     self.fn.run(
2025-05-07T20:32:05.5182156Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:05.5182256Z     kernel = self.compile(
2025-05-07T20:32:05.5182674Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:05.5182897Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:05.5183070Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:05.5183075Z 
2025-05-07T20:32:05.5183279Z self = <triton.compiler.compiler.ASTSource object at 0x7f4898f8c710>
2025-05-07T20:32:05.5184059Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=2, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:05.5184555Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f49ce13c400>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f4899e4dda0>}
2025-05-07T20:32:05.5185302Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:05.5185497Z context = <triton._C.libtriton.ir.context object at 0x7f4898f6cd70>
2025-05-07T20:32:05.5185502Z 
2025-05-07T20:32:05.5185668Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:05.5185932Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:05.5186041Z                            module_map=module_map)
2025-05-07T20:32:05.5186208Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:05.5186312Z E       def _kernel_quantize_fp8_row(
2025-05-07T20:32:05.5186391Z E       ^
2025-05-07T20:32:05.5186750Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:05.5186755Z 
2025-05-07T20:32:05.5187167Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:05.5187174Z 
2025-05-07T20:32:05.5187284Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:05.5187507Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:05.5187586Z     T=1,
2025-05-07T20:32:05.5187671Z     D=5120,
2025-05-07T20:32:05.5187754Z     scale_ub=1200.0,
2025-05-07T20:32:05.5187845Z     contiguous=False,
2025-05-07T20:32:05.5187936Z     compiled=True,
2025-05-07T20:32:05.5188010Z )
2025-05-07T20:32:05.5188234Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:05.5188399Z T = 1, D = 5120, scale_ub = 1200.0, contiguous = False, compiled = True
2025-05-07T20:32:05.5188403Z 
2025-05-07T20:32:05.5188481Z     @given(
2025-05-07T20:32:05.5188604Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:05.5188702Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:05.5188816Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:05.5188936Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:05.5189050Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:05.5189185Z     )
2025-05-07T20:32:05.5189429Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:05.5189523Z     def test_silu_mul_quant(
2025-05-07T20:32:05.5189689Z         self,
2025-05-07T20:32:05.5189767Z         T: int,
2025-05-07T20:32:05.5189843Z         D: int,
2025-05-07T20:32:05.5189946Z         scale_ub: Optional[float],
2025-05-07T20:32:05.5190034Z         contiguous: bool,
2025-05-07T20:32:05.5190119Z         compiled: bool,
2025-05-07T20:32:05.5190202Z     ) -> None:
2025-05-07T20:32:05.5190296Z         torch.manual_seed(2025)
2025-05-07T20:32:05.5190369Z     
2025-05-07T20:32:05.5190540Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:05.5190613Z     
2025-05-07T20:32:05.5190703Z         x_sign = torch.sign(x)
2025-05-07T20:32:05.5190834Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:05.5190966Z         x = x_sign * x_clamp
2025-05-07T20:32:05.5191092Z         x0 = x[:, :D]
2025-05-07T20:32:05.5191173Z         x1 = x[:, D:]
2025-05-07T20:32:05.5191244Z     
2025-05-07T20:32:05.5191333Z         if contiguous:
2025-05-07T20:32:05.5191424Z             x0 = x0.contiguous()
2025-05-07T20:32:05.5191518Z             x1 = x1.contiguous()
2025-05-07T20:32:05.5191596Z     
2025-05-07T20:32:05.5191687Z         if scale_ub is not None:
2025-05-07T20:32:05.5191791Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:05.5191934Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:05.5192009Z             )
2025-05-07T20:32:05.5192085Z         else:
2025-05-07T20:32:05.5192182Z             scale_ub_tensor = None
2025-05-07T20:32:05.5192254Z     
2025-05-07T20:32:05.5192389Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:05.5192491Z             op = silu_mul_quant
2025-05-07T20:32:05.5192588Z             if compiled:
2025-05-07T20:32:05.5192714Z                 op = torch.compile(op)
2025-05-07T20:32:05.5192827Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:05.5192899Z     
2025-05-07T20:32:05.5192997Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:05.5193001Z 
2025-05-07T20:32:05.5193098Z moe/activation_test.py:117: 
2025-05-07T20:32:05.5193235Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:05.5193343Z moe/activation_test.py:115: in fn
2025-05-07T20:32:05.5193443Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:05.5193814Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/torch/_dynamo/eval_frame.py:678: in _fn
2025-05-07T20:32:05.5193906Z     return fn(*args, **kwargs)
2025-05-07T20:32:05.5194396Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:05.5194498Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:05.5194852Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:05.5195077Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:05.5195421Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:05.5195514Z     kernel = self.compile(
2025-05-07T20:32:05.5195896Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:05.5196066Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:05.5196193Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:05.5196197Z 
2025-05-07T20:32:05.5196406Z self = <triton.compiler.compiler.ASTSource object at 0x7f4898f14b90>
2025-05-07T20:32:05.5197173Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:05.5197765Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f49ce13c400>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f4899e4e020>}
2025-05-07T20:32:05.5198511Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:05.5198698Z context = <triton._C.libtriton.ir.context object at 0x7f4898f2d4b0>
2025-05-07T20:32:05.5198713Z 
2025-05-07T20:32:05.5198876Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:05.5199135Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:05.5199247Z                            module_map=module_map)
2025-05-07T20:32:05.5199448Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:05.5199585Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:05.5199669Z E       ^
2025-05-07T20:32:05.5200030Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:05.5200035Z 
2025-05-07T20:32:05.5200457Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:05.5200461Z 
2025-05-07T20:32:05.5200566Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:05.5200788Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:05.5200874Z     T=1,
2025-05-07T20:32:05.5200952Z     D=5120,
2025-05-07T20:32:05.5201035Z     scale_ub=1200.0,
2025-05-07T20:32:05.5201129Z     contiguous=False,
2025-05-07T20:32:05.5201214Z     compiled=False,
2025-05-07T20:32:05.5201289Z )
2025-05-07T20:32:05.5201509Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:05.5201681Z T = 1, D = 5120, scale_ub = 1200.0, contiguous = False, compiled = False
2025-05-07T20:32:05.5201686Z 
2025-05-07T20:32:05.5201772Z     @given(
2025-05-07T20:32:05.5201892Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:05.5201995Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:05.5202115Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:05.5202231Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:05.5202342Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:05.5202426Z     )
2025-05-07T20:32:05.5202667Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:05.5202768Z     def test_silu_mul_quant(
2025-05-07T20:32:05.5202844Z         self,
2025-05-07T20:32:05.5202920Z         T: int,
2025-05-07T20:32:05.5203003Z         D: int,
2025-05-07T20:32:05.5203101Z         scale_ub: Optional[float],
2025-05-07T20:32:05.5203192Z         contiguous: bool,
2025-05-07T20:32:05.5203286Z         compiled: bool,
2025-05-07T20:32:05.5203363Z     ) -> None:
2025-05-07T20:32:05.5203458Z         torch.manual_seed(2025)
2025-05-07T20:32:05.5203536Z     
2025-05-07T20:32:05.5203706Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:05.5203781Z     
2025-05-07T20:32:05.5203879Z         x_sign = torch.sign(x)
2025-05-07T20:32:05.5204005Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:05.5204096Z         x = x_sign * x_clamp
2025-05-07T20:32:05.5204180Z         x0 = x[:, :D]
2025-05-07T20:32:05.5204260Z         x1 = x[:, D:]
2025-05-07T20:32:05.5204338Z     
2025-05-07T20:32:05.5204421Z         if contiguous:
2025-05-07T20:32:05.5204516Z             x0 = x0.contiguous()
2025-05-07T20:32:05.5204610Z             x1 = x1.contiguous()
2025-05-07T20:32:05.5204680Z     
2025-05-07T20:32:05.5204782Z         if scale_ub is not None:
2025-05-07T20:32:05.5204890Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:05.5205028Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:05.5205109Z             )
2025-05-07T20:32:05.5205184Z         else:
2025-05-07T20:32:05.5205278Z             scale_ub_tensor = None
2025-05-07T20:32:05.5205358Z     
2025-05-07T20:32:05.5205570Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:05.5205662Z             op = silu_mul_quant
2025-05-07T20:32:05.5205754Z             if compiled:
2025-05-07T20:32:05.5205854Z                 op = torch.compile(op)
2025-05-07T20:32:05.5205968Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:05.5206037Z     
2025-05-07T20:32:05.5206127Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:05.5206132Z 
2025-05-07T20:32:05.5206236Z moe/activation_test.py:117: 
2025-05-07T20:32:05.5206364Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:05.5206465Z moe/activation_test.py:115: in fn
2025-05-07T20:32:05.5206614Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:05.5207146Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:05.5207248Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:05.5207607Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:05.5207824Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:05.5208168Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:05.5208262Z     kernel = self.compile(
2025-05-07T20:32:05.5208642Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:05.5208821Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:05.5208951Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:05.5208958Z 
2025-05-07T20:32:05.5209163Z self = <triton.compiler.compiler.ASTSource object at 0x7f4898f5a750>
2025-05-07T20:32:05.5209941Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:05.5210436Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f49ce13c400>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f4898fbc720>}
2025-05-07T20:32:05.5211186Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:05.5211374Z context = <triton._C.libtriton.ir.context object at 0x7f4898fcadb0>
2025-05-07T20:32:05.5211381Z 
2025-05-07T20:32:05.5211552Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:05.5211809Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:05.5211920Z                            module_map=module_map)
2025-05-07T20:32:05.5212088Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:05.5212187Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:05.5212270Z E       ^
2025-05-07T20:32:05.5212620Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:05.5212625Z 
2025-05-07T20:32:05.5213034Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:05.5213038Z 
2025-05-07T20:32:05.5213145Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:05.5213366Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:05.5213455Z     T=16384,
2025-05-07T20:32:05.5213530Z     D=5120,
2025-05-07T20:32:05.5213613Z     scale_ub=1200.0,
2025-05-07T20:32:05.5213704Z     contiguous=False,
2025-05-07T20:32:05.5213787Z     compiled=True,
2025-05-07T20:32:05.5213859Z )
2025-05-07T20:32:05.5214189Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:05.5214368Z T = 16384, D = 5120, scale_ub = 1200.0, contiguous = False, compiled = True
2025-05-07T20:32:05.5214373Z 
2025-05-07T20:32:05.5214450Z     @given(
2025-05-07T20:32:05.5214575Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:05.5214674Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:05.5214795Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:05.5214911Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:05.5215023Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:05.5215101Z     )
2025-05-07T20:32:05.5215384Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:05.5215516Z     def test_silu_mul_quant(
2025-05-07T20:32:05.5215599Z         self,
2025-05-07T20:32:05.5215676Z         T: int,
2025-05-07T20:32:05.5215753Z         D: int,
2025-05-07T20:32:05.5215863Z         scale_ub: Optional[float],
2025-05-07T20:32:05.5215953Z         contiguous: bool,
2025-05-07T20:32:05.5216038Z         compiled: bool,
2025-05-07T20:32:05.5216120Z     ) -> None:
2025-05-07T20:32:05.5216216Z         torch.manual_seed(2025)
2025-05-07T20:32:05.5216293Z     
2025-05-07T20:32:05.5216461Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:05.5216536Z     
2025-05-07T20:32:05.5216634Z         x_sign = torch.sign(x)
2025-05-07T20:32:05.5216757Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:05.5216847Z         x = x_sign * x_clamp
2025-05-07T20:32:05.5216935Z         x0 = x[:, :D]
2025-05-07T20:32:05.5217017Z         x1 = x[:, D:]
2025-05-07T20:32:05.5217092Z     
2025-05-07T20:32:05.5217181Z         if contiguous:
2025-05-07T20:32:05.5217273Z             x0 = x0.contiguous()
2025-05-07T20:32:05.5217364Z             x1 = x1.contiguous()
2025-05-07T20:32:05.5217441Z     
2025-05-07T20:32:05.5217535Z         if scale_ub is not None:
2025-05-07T20:32:05.5217646Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:05.5217786Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:05.5217860Z             )
2025-05-07T20:32:05.5217944Z         else:
2025-05-07T20:32:05.5218037Z             scale_ub_tensor = None
2025-05-07T20:32:05.5218109Z     
2025-05-07T20:32:05.5218242Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:05.5218331Z             op = silu_mul_quant
2025-05-07T20:32:05.5218416Z             if compiled:
2025-05-07T20:32:05.5218522Z                 op = torch.compile(op)
2025-05-07T20:32:05.5218626Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:05.5218704Z     
2025-05-07T20:32:05.5218805Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:05.5218809Z 
2025-05-07T20:32:05.5218906Z moe/activation_test.py:117: 
2025-05-07T20:32:05.5219042Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:05.5219147Z moe/activation_test.py:115: in fn
2025-05-07T20:32:05.5219247Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:05.5219619Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/torch/_dynamo/eval_frame.py:678: in _fn
2025-05-07T20:32:05.5219715Z     return fn(*args, **kwargs)
2025-05-07T20:32:05.5220204Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:05.5220308Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:05.5220664Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:05.5220891Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:05.5221230Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:05.5221323Z     kernel = self.compile(
2025-05-07T20:32:05.5221798Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:05.5221972Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:05.5222099Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:05.5222110Z 
2025-05-07T20:32:05.5222313Z self = <triton.compiler.compiler.ASTSource object at 0x7f48989e5510>
2025-05-07T20:32:05.5223082Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:05.5223624Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f49ce13c400>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f4898fbdd00>}
2025-05-07T20:32:05.5224409Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:05.5224606Z context = <triton._C.libtriton.ir.context object at 0x7f48989a4d30>
2025-05-07T20:32:05.5224611Z 
2025-05-07T20:32:05.5224772Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:05.5225032Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:05.5225146Z                            module_map=module_map)
2025-05-07T20:32:05.5225307Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:05.5225417Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:05.5225500Z E       ^
2025-05-07T20:32:05.5225855Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:05.5225860Z 
2025-05-07T20:32:05.5226281Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:05.5226286Z 
2025-05-07T20:32:05.5226387Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:05.5226609Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:05.5226696Z     T=2048,
2025-05-07T20:32:05.5226774Z     D=7168,
2025-05-07T20:32:05.5226865Z     scale_ub=1200.0,
2025-05-07T20:32:05.5226952Z     contiguous=False,
2025-05-07T20:32:05.5227035Z     compiled=True,
2025-05-07T20:32:05.5227112Z )
2025-05-07T20:32:05.5227328Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:05.5227500Z T = 2048, D = 7168, scale_ub = 1200.0, contiguous = False, compiled = True
2025-05-07T20:32:05.5227509Z 
2025-05-07T20:32:05.5227592Z     @given(
2025-05-07T20:32:05.5227711Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:05.5227809Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:05.5227930Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:05.5228047Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:05.5228432Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:05.5228546Z     )
2025-05-07T20:32:05.5228879Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:05.5228979Z     def test_silu_mul_quant(
2025-05-07T20:32:05.5229116Z         self,
2025-05-07T20:32:05.5229198Z         T: int,
2025-05-07T20:32:05.5229279Z         D: int,
2025-05-07T20:32:05.5229375Z         scale_ub: Optional[float],
2025-05-07T20:32:05.5229463Z         contiguous: bool,
2025-05-07T20:32:05.5229554Z         compiled: bool,
2025-05-07T20:32:05.5229635Z     ) -> None:
2025-05-07T20:32:05.5229732Z         torch.manual_seed(2025)
2025-05-07T20:32:05.5229810Z     
2025-05-07T20:32:05.5229976Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:05.5230055Z     
2025-05-07T20:32:05.5230148Z         x_sign = torch.sign(x)
2025-05-07T20:32:05.5230514Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:05.5230613Z         x = x_sign * x_clamp
2025-05-07T20:32:05.5230693Z         x0 = x[:, :D]
2025-05-07T20:32:05.5230772Z         x1 = x[:, D:]
2025-05-07T20:32:05.5230850Z     
2025-05-07T20:32:05.5230933Z         if contiguous:
2025-05-07T20:32:05.5231024Z             x0 = x0.contiguous()
2025-05-07T20:32:05.5231121Z             x1 = x1.contiguous()
2025-05-07T20:32:05.5231191Z     
2025-05-07T20:32:05.5231281Z         if scale_ub is not None:
2025-05-07T20:32:05.5231392Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:05.5231526Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:05.5231731Z             )
2025-05-07T20:32:05.5231807Z         else:
2025-05-07T20:32:05.5231900Z             scale_ub_tensor = None
2025-05-07T20:32:05.5231981Z     
2025-05-07T20:32:05.5232111Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:05.5232206Z             op = silu_mul_quant
2025-05-07T20:32:05.5232300Z             if compiled:
2025-05-07T20:32:05.5232401Z                 op = torch.compile(op)
2025-05-07T20:32:05.5232506Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:05.5232583Z     
2025-05-07T20:32:05.5232673Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:05.5232678Z 
2025-05-07T20:32:05.5232775Z moe/activation_test.py:117: 
2025-05-07T20:32:05.5232909Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:05.5233009Z moe/activation_test.py:115: in fn
2025-05-07T20:32:05.5233114Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:05.5233480Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/torch/_dynamo/eval_frame.py:678: in _fn
2025-05-07T20:32:05.5233578Z     return fn(*args, **kwargs)
2025-05-07T20:32:05.5234075Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:05.5234175Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:05.5234528Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:05.5234753Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:05.5235091Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:05.5235190Z     kernel = self.compile(
2025-05-07T20:32:05.5235568Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:05.5235738Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:05.5235884Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:05.5235888Z 
2025-05-07T20:32:05.5236090Z self = <triton.compiler.compiler.ASTSource object at 0x7f48989c0150>
2025-05-07T20:32:05.5236869Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:05.5237363Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f49ce13c400>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f4898fbe840>}
2025-05-07T20:32:05.5238104Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:05.5238298Z context = <triton._C.libtriton.ir.context object at 0x7f4898936e30>
2025-05-07T20:32:05.5238306Z 
2025-05-07T20:32:05.5238467Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:05.5238821Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:05.5238930Z                            module_map=module_map)
2025-05-07T20:32:05.5239090Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:05.5239198Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:05.5239275Z E       ^
2025-05-07T20:32:05.5239634Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:05.5239639Z 
2025-05-07T20:32:05.5240051Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:05.5240056Z 
2025-05-07T20:32:05.5240162Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:05.5240432Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:05.5240575Z     T=1,
2025-05-07T20:32:05.5240651Z     D=5120,
2025-05-07T20:32:05.5240738Z     scale_ub=None,
2025-05-07T20:32:05.5240824Z     contiguous=False,
2025-05-07T20:32:05.5240919Z     compiled=False,
2025-05-07T20:32:05.5240995Z )
2025-05-07T20:32:05.5241214Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:05.5241384Z T = 1, D = 5120, scale_ub = None, contiguous = False, compiled = False
2025-05-07T20:32:05.5241389Z 
2025-05-07T20:32:05.5241469Z     @given(
2025-05-07T20:32:05.5241587Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:05.5241693Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:05.5241806Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:05.5241921Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:05.5242038Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:05.5242120Z     )
2025-05-07T20:32:05.5242371Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:05.5242464Z     def test_silu_mul_quant(
2025-05-07T20:32:05.5242540Z         self,
2025-05-07T20:32:05.5242625Z         T: int,
2025-05-07T20:32:05.5242704Z         D: int,
2025-05-07T20:32:05.5242802Z         scale_ub: Optional[float],
2025-05-07T20:32:05.5242897Z         contiguous: bool,
2025-05-07T20:32:05.5242981Z         compiled: bool,
2025-05-07T20:32:05.5243058Z     ) -> None:
2025-05-07T20:32:05.5243159Z         torch.manual_seed(2025)
2025-05-07T20:32:05.5243232Z     
2025-05-07T20:32:05.5243399Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:05.5243479Z     
2025-05-07T20:32:05.5243569Z         x_sign = torch.sign(x)
2025-05-07T20:32:05.5243699Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:05.5243786Z         x = x_sign * x_clamp
2025-05-07T20:32:05.5243869Z         x0 = x[:, :D]
2025-05-07T20:32:05.5243959Z         x1 = x[:, D:]
2025-05-07T20:32:05.5244031Z     
2025-05-07T20:32:05.5244115Z         if contiguous:
2025-05-07T20:32:05.5244212Z             x0 = x0.contiguous()
2025-05-07T20:32:05.5244301Z             x1 = x1.contiguous()
2025-05-07T20:32:05.5244372Z     
2025-05-07T20:32:05.5244473Z         if scale_ub is not None:
2025-05-07T20:32:05.5244578Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:05.5244712Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:05.5244794Z             )
2025-05-07T20:32:05.5244871Z         else:
2025-05-07T20:32:05.5244965Z             scale_ub_tensor = None
2025-05-07T20:32:05.5245044Z     
2025-05-07T20:32:05.5245171Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:05.5245270Z             op = silu_mul_quant
2025-05-07T20:32:05.5245355Z             if compiled:
2025-05-07T20:32:05.5245458Z                 op = torch.compile(op)
2025-05-07T20:32:05.5245573Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:05.5245647Z     
2025-05-07T20:32:05.5245738Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:05.5245742Z 
2025-05-07T20:32:05.5245847Z moe/activation_test.py:117: 
2025-05-07T20:32:05.5246061Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:05.5246162Z moe/activation_test.py:115: in fn
2025-05-07T20:32:05.5246267Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:05.5246760Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:05.5246862Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:05.5247216Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:05.5247435Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:05.5247779Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:05.5247949Z     kernel = self.compile(
2025-05-07T20:32:05.5248332Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:05.5248509Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:05.5248636Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:05.5248640Z 
2025-05-07T20:32:05.5248848Z self = <triton.compiler.compiler.ASTSource object at 0x7f489913c390>
2025-05-07T20:32:05.5249615Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:05.5250117Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f49ce13c400>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f48991640e0>}
2025-05-07T20:32:05.5250862Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:05.5251054Z context = <triton._C.libtriton.ir.context object at 0x7f4899124970>
2025-05-07T20:32:05.5251058Z 
2025-05-07T20:32:05.5251228Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:05.5251487Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:05.5251601Z                            module_map=module_map)
2025-05-07T20:32:05.5251762Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:05.5251860Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:05.5251945Z E       ^
2025-05-07T20:32:05.5252298Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:05.5252308Z 
2025-05-07T20:32:05.5252719Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:05.5252730Z 
2025-05-07T20:32:05.5252838Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:05.5253059Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:05.5253143Z     T=4096,
2025-05-07T20:32:05.5253220Z     D=7168,
2025-05-07T20:32:05.5253304Z     scale_ub=1200.0,
2025-05-07T20:32:05.5253396Z     contiguous=False,
2025-05-07T20:32:05.5253481Z     compiled=False,
2025-05-07T20:32:05.5253555Z )
2025-05-07T20:32:05.5253780Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:05.5253957Z T = 4096, D = 7168, scale_ub = 1200.0, contiguous = False, compiled = False
2025-05-07T20:32:05.5253961Z 
2025-05-07T20:32:05.5254045Z     @given(
2025-05-07T20:32:05.5254164Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:05.5254267Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:05.5254388Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:05.5254504Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:05.5254698Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:05.5254779Z     )
2025-05-07T20:32:05.5255022Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:05.5255116Z     def test_silu_mul_quant(
2025-05-07T20:32:05.5255200Z         self,
2025-05-07T20:32:05.5255277Z         T: int,
2025-05-07T20:32:05.5255361Z         D: int,
2025-05-07T20:32:05.5255458Z         scale_ub: Optional[float],
2025-05-07T20:32:05.5255547Z         contiguous: bool,
2025-05-07T20:32:05.5255640Z         compiled: bool,
2025-05-07T20:32:05.5255717Z     ) -> None:
2025-05-07T20:32:05.5255812Z         torch.manual_seed(2025)
2025-05-07T20:32:05.5255891Z     
2025-05-07T20:32:05.5256098Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:05.5256212Z     
2025-05-07T20:32:05.5256311Z         x_sign = torch.sign(x)
2025-05-07T20:32:05.5256434Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:05.5256527Z         x = x_sign * x_clamp
2025-05-07T20:32:05.5256615Z         x0 = x[:, :D]
2025-05-07T20:32:05.5256694Z         x1 = x[:, D:]
2025-05-07T20:32:05.5256764Z     
2025-05-07T20:32:05.5256854Z         if contiguous:
2025-05-07T20:32:05.5256943Z             x0 = x0.contiguous()
2025-05-07T20:32:05.5257038Z             x1 = x1.contiguous()
2025-05-07T20:32:05.5257109Z     
2025-05-07T20:32:05.5257198Z         if scale_ub is not None:
2025-05-07T20:32:05.5257308Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:05.5257442Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:05.5257516Z             )
2025-05-07T20:32:05.5257597Z         else:
2025-05-07T20:32:05.5257692Z             scale_ub_tensor = None
2025-05-07T20:32:05.5257766Z     
2025-05-07T20:32:05.5257901Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:05.5257992Z             op = silu_mul_quant
2025-05-07T20:32:05.5258077Z             if compiled:
2025-05-07T20:32:05.5258189Z                 op = torch.compile(op)
2025-05-07T20:32:05.5258293Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:05.5258372Z     
2025-05-07T20:32:05.5258466Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:05.5258471Z 
2025-05-07T20:32:05.5258569Z moe/activation_test.py:117: 
2025-05-07T20:32:05.5258704Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:05.5258805Z moe/activation_test.py:115: in fn
2025-05-07T20:32:05.5258905Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:05.5259406Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:05.5259505Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:05.5259867Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:05.5260086Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:05.5260426Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:05.5260525Z     kernel = self.compile(
2025-05-07T20:32:05.5260907Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:05.5261077Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:05.5261209Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:05.5261214Z 
2025-05-07T20:32:05.5261414Z self = <triton.compiler.compiler.ASTSource object at 0x7f4899102c90>
2025-05-07T20:32:05.5262188Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:05.5262824Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f49ce13c400>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f4899165300>}
2025-05-07T20:32:05.5263572Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:05.5263758Z context = <triton._C.libtriton.ir.context object at 0x7f4899117930>
2025-05-07T20:32:05.5263762Z 
2025-05-07T20:32:05.5263924Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:05.5264189Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:05.5264450Z                            module_map=module_map)
2025-05-07T20:32:05.5264650Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:05.5264755Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:05.5264831Z E       ^
2025-05-07T20:32:05.5265196Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:05.5265200Z 
2025-05-07T20:32:05.5265613Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:05.5265617Z 
2025-05-07T20:32:05.5265720Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:05.5265947Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:05.5266026Z     T=16384,
2025-05-07T20:32:05.5266108Z     D=7168,
2025-05-07T20:32:05.5266189Z     scale_ub=None,
2025-05-07T20:32:05.5266273Z     contiguous=True,
2025-05-07T20:32:05.5266369Z     compiled=True,
2025-05-07T20:32:05.5266444Z )
2025-05-07T20:32:05.5271405Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:05.5271610Z T = 16384, D = 7168, scale_ub = None, contiguous = True, compiled = True
2025-05-07T20:32:05.5271615Z 
2025-05-07T20:32:05.5271704Z     @given(
2025-05-07T20:32:05.5271833Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:05.5271934Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:05.5272057Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:05.5272175Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:05.5272294Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:05.5272382Z     )
2025-05-07T20:32:05.5272674Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:05.5272776Z     def test_silu_mul_quant(
2025-05-07T20:32:05.5272867Z         self,
2025-05-07T20:32:05.5272946Z         T: int,
2025-05-07T20:32:05.5273031Z         D: int,
2025-05-07T20:32:05.5273143Z         scale_ub: Optional[float],
2025-05-07T20:32:05.5273234Z         contiguous: bool,
2025-05-07T20:32:05.5273329Z         compiled: bool,
2025-05-07T20:32:05.5273411Z     ) -> None:
2025-05-07T20:32:05.5273508Z         torch.manual_seed(2025)
2025-05-07T20:32:05.5273597Z     
2025-05-07T20:32:05.5273767Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:05.5273844Z     
2025-05-07T20:32:05.5273946Z         x_sign = torch.sign(x)
2025-05-07T20:32:05.5274074Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:05.5274168Z         x = x_sign * x_clamp
2025-05-07T20:32:05.5274258Z         x0 = x[:, :D]
2025-05-07T20:32:05.5274344Z         x1 = x[:, D:]
2025-05-07T20:32:05.5274418Z     
2025-05-07T20:32:05.5274513Z         if contiguous:
2025-05-07T20:32:05.5274604Z             x0 = x0.contiguous()
2025-05-07T20:32:05.5274703Z             x1 = x1.contiguous()
2025-05-07T20:32:05.5274779Z     
2025-05-07T20:32:05.5274870Z         if scale_ub is not None:
2025-05-07T20:32:05.5274989Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:05.5275126Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:05.5275208Z             )
2025-05-07T20:32:05.5275294Z         else:
2025-05-07T20:32:05.5275510Z             scale_ub_tensor = None
2025-05-07T20:32:05.5275585Z     
2025-05-07T20:32:05.5275724Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:05.5275816Z             op = silu_mul_quant
2025-05-07T20:32:05.5275903Z             if compiled:
2025-05-07T20:32:05.5276012Z                 op = torch.compile(op)
2025-05-07T20:32:05.5276120Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:05.5276202Z     
2025-05-07T20:32:05.5276293Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:05.5276298Z 
2025-05-07T20:32:05.5276397Z moe/activation_test.py:117: 
2025-05-07T20:32:05.5276534Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:05.5276752Z moe/activation_test.py:115: in fn
2025-05-07T20:32:05.5276853Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:05.5277230Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/torch/_dynamo/eval_frame.py:678: in _fn
2025-05-07T20:32:05.5277329Z     return fn(*args, **kwargs)
2025-05-07T20:32:05.5277828Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:05.5277926Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:05.5278282Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:05.5278512Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:05.5278850Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:05.5278943Z     kernel = self.compile(
2025-05-07T20:32:05.5279336Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:05.5279510Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:05.5279651Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:05.5279656Z 
2025-05-07T20:32:05.5279859Z self = <triton.compiler.compiler.ASTSource object at 0x7f489906bc10>
2025-05-07T20:32:05.5280632Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:05.5281138Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f49ce13c400>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f48991663e0>}
2025-05-07T20:32:05.5281881Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:05.5282082Z context = <triton._C.libtriton.ir.context object at 0x7f48990fe4b0>
2025-05-07T20:32:05.5282087Z 
2025-05-07T20:32:05.5282258Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:05.5282559Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:05.5282684Z                            module_map=module_map)
2025-05-07T20:32:05.5282845Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:05.5282955Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:05.5283033Z E       ^
2025-05-07T20:32:05.5283384Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:05.5283389Z 
2025-05-07T20:32:05.5283810Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:05.5283817Z 
2025-05-07T20:32:05.5283919Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:05.5284146Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:05.5284310Z     T=4096,
2025-05-07T20:32:05.5284388Z     D=5120,
2025-05-07T20:32:05.5284479Z     scale_ub=None,
2025-05-07T20:32:05.5284566Z     contiguous=False,
2025-05-07T20:32:05.5284649Z     compiled=True,
2025-05-07T20:32:05.5284729Z )
2025-05-07T20:32:05.5284945Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:05.5285115Z T = 4096, D = 5120, scale_ub = None, contiguous = False, compiled = True
2025-05-07T20:32:05.5285120Z 
2025-05-07T20:32:05.5285205Z     @given(
2025-05-07T20:32:05.5285322Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:05.5285429Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:05.5285582Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:05.5285741Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:05.5285858Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:05.5285933Z     )
2025-05-07T20:32:05.5286181Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:05.5286281Z     def test_silu_mul_quant(
2025-05-07T20:32:05.5286359Z         self,
2025-05-07T20:32:05.5286436Z         T: int,
2025-05-07T20:32:05.5286519Z         D: int,
2025-05-07T20:32:05.5286618Z         scale_ub: Optional[float],
2025-05-07T20:32:05.5286707Z         contiguous: bool,
2025-05-07T20:32:05.5286811Z         compiled: bool,
2025-05-07T20:32:05.5286892Z     ) -> None:
2025-05-07T20:32:05.5286997Z         torch.manual_seed(2025)
2025-05-07T20:32:05.5287071Z     
2025-05-07T20:32:05.5287236Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:05.5287321Z     
2025-05-07T20:32:05.5287412Z         x_sign = torch.sign(x)
2025-05-07T20:32:05.5287547Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:05.5287645Z         x = x_sign * x_clamp
2025-05-07T20:32:05.5287727Z         x0 = x[:, :D]
2025-05-07T20:32:05.5287812Z         x1 = x[:, D:]
2025-05-07T20:32:05.5287899Z     
2025-05-07T20:32:05.5287984Z         if contiguous:
2025-05-07T20:32:05.5288077Z             x0 = x0.contiguous()
2025-05-07T20:32:05.5288175Z             x1 = x1.contiguous()
2025-05-07T20:32:05.5288248Z     
2025-05-07T20:32:05.5288347Z         if scale_ub is not None:
2025-05-07T20:32:05.5288454Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:05.5288591Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:05.5288677Z             )
2025-05-07T20:32:05.5288755Z         else:
2025-05-07T20:32:05.5288849Z             scale_ub_tensor = None
2025-05-07T20:32:05.5288931Z     
2025-05-07T20:32:05.5289059Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:05.5289155Z             op = silu_mul_quant
2025-05-07T20:32:05.5289250Z             if compiled:
2025-05-07T20:32:05.5289351Z                 op = torch.compile(op)
2025-05-07T20:32:05.5289457Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:05.5289542Z     
2025-05-07T20:32:05.5289638Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:05.5289642Z 
2025-05-07T20:32:05.5289749Z moe/activation_test.py:117: 
2025-05-07T20:32:05.5289879Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:05.5289981Z moe/activation_test.py:115: in fn
2025-05-07T20:32:05.5290091Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:05.5290457Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/torch/_dynamo/eval_frame.py:678: in _fn
2025-05-07T20:32:05.5290555Z     return fn(*args, **kwargs)
2025-05-07T20:32:05.5291055Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:05.5291159Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:05.5291522Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:05.5291834Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:05.5292173Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:05.5292275Z     kernel = self.compile(
2025-05-07T20:32:05.5292654Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:05.5292825Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:05.5292962Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:05.5292966Z 
2025-05-07T20:32:05.5293168Z self = <triton.compiler.compiler.ASTSource object at 0x7f4899047390>
2025-05-07T20:32:05.5294025Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:05.5294529Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f49ce13c400>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f4899166a20>}
2025-05-07T20:32:05.5295278Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:05.5295466Z context = <triton._C.libtriton.ir.context object at 0x7f48990739b0>
2025-05-07T20:32:05.5295471Z 
2025-05-07T20:32:05.5295634Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:05.5295901Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:05.5296013Z                            module_map=module_map)
2025-05-07T20:32:05.5296182Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:05.5296281Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:05.5296364Z E       ^
2025-05-07T20:32:05.5296722Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:05.5296726Z 
2025-05-07T20:32:05.5297136Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:05.5297140Z 
2025-05-07T20:32:05.5297242Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:05.5297472Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:05.5297551Z     T=4096,
2025-05-07T20:32:05.5297636Z     D=5120,
2025-05-07T20:32:05.5297721Z     scale_ub=1200.0,
2025-05-07T20:32:05.5297811Z     contiguous=False,
2025-05-07T20:32:05.5297906Z     compiled=False,
2025-05-07T20:32:05.5297980Z )
2025-05-07T20:32:05.5298199Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:05.5298379Z T = 4096, D = 5120, scale_ub = 1200.0, contiguous = False, compiled = False
2025-05-07T20:32:05.5298390Z 
2025-05-07T20:32:05.5298467Z     @given(
2025-05-07T20:32:05.5298585Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:05.5298692Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:05.5298808Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:05.5298931Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:05.5299043Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:05.5299118Z     )
2025-05-07T20:32:05.5299369Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:05.5299465Z     def test_silu_mul_quant(
2025-05-07T20:32:05.5299544Z         self,
2025-05-07T20:32:05.5299633Z         T: int,
2025-05-07T20:32:05.5299709Z         D: int,
2025-05-07T20:32:05.5299806Z         scale_ub: Optional[float],
2025-05-07T20:32:05.5299902Z         contiguous: bool,
2025-05-07T20:32:05.5299988Z         compiled: bool,
2025-05-07T20:32:05.5300066Z     ) -> None:
2025-05-07T20:32:05.5300255Z         torch.manual_seed(2025)
2025-05-07T20:32:05.5300331Z     
2025-05-07T20:32:05.5300504Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:05.5300578Z     
2025-05-07T20:32:05.5300671Z         x_sign = torch.sign(x)
2025-05-07T20:32:05.5300802Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:05.5300890Z         x = x_sign * x_clamp
2025-05-07T20:32:05.5300971Z         x0 = x[:, :D]
2025-05-07T20:32:05.5301058Z         x1 = x[:, D:]
2025-05-07T20:32:05.5301131Z     
2025-05-07T20:32:05.5301216Z         if contiguous:
2025-05-07T20:32:05.5301317Z             x0 = x0.contiguous()
2025-05-07T20:32:05.5301452Z             x1 = x1.contiguous()
2025-05-07T20:32:05.5301568Z     
2025-05-07T20:32:05.5301670Z         if scale_ub is not None:
2025-05-07T20:32:05.5301776Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:05.5301919Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:05.5302003Z             )
2025-05-07T20:32:05.5302080Z         else:
2025-05-07T20:32:05.5302183Z             scale_ub_tensor = None
2025-05-07T20:32:05.5302259Z     
2025-05-07T20:32:05.5302391Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:05.5302494Z             op = silu_mul_quant
2025-05-07T20:32:05.5302602Z             if compiled:
2025-05-07T20:32:05.5302713Z                 op = torch.compile(op)
2025-05-07T20:32:05.5302842Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:05.5302914Z     
2025-05-07T20:32:05.5303005Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:05.5303010Z 
2025-05-07T20:32:05.5303117Z moe/activation_test.py:117: 
2025-05-07T20:32:05.5303251Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:05.5303363Z moe/activation_test.py:115: in fn
2025-05-07T20:32:05.5303463Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:05.5303962Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:05.5304067Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:05.5304424Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:05.5304645Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:05.5304994Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:05.5305088Z     kernel = self.compile(
2025-05-07T20:32:05.5305478Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:05.5305658Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:05.5305786Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:05.5305791Z 
2025-05-07T20:32:05.5306006Z self = <triton.compiler.compiler.ASTSource object at 0x7f48990e0690>
2025-05-07T20:32:05.5306779Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:05.5307281Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f49ce13c400>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f48990142c0>}
2025-05-07T20:32:05.5308024Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:05.5308225Z context = <triton._C.libtriton.ir.context object at 0x7f4899080cf0>
2025-05-07T20:32:05.5308230Z 
2025-05-07T20:32:05.5308392Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:05.5308773Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:05.5308894Z                            module_map=module_map)
2025-05-07T20:32:05.5309184Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:05.5309285Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:05.5309370Z E       ^
2025-05-07T20:32:05.5309726Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:05.5309730Z 
2025-05-07T20:32:05.5310152Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:05.5310199Z 
2025-05-07T20:32:05.5310341Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:05.5310562Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:05.5310649Z     T=4096,
2025-05-07T20:32:05.5310726Z     D=5120,
2025-05-07T20:32:05.5310815Z     scale_ub=1200.0,
2025-05-07T20:32:05.5310907Z     contiguous=False,
2025-05-07T20:32:05.5310988Z     compiled=True,
2025-05-07T20:32:05.5311070Z )
2025-05-07T20:32:05.5311286Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:05.5311459Z T = 4096, D = 5120, scale_ub = 1200.0, contiguous = False, compiled = True
2025-05-07T20:32:05.5311470Z 
2025-05-07T20:32:05.5311548Z     @given(
2025-05-07T20:32:05.5311666Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:05.5311772Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:05.5311885Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:05.5312002Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:05.5312127Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:05.5312201Z     )
2025-05-07T20:32:05.5312449Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:05.5312572Z     def test_silu_mul_quant(
2025-05-07T20:32:05.5312660Z         self,
2025-05-07T20:32:05.5312754Z         T: int,
2025-05-07T20:32:05.5312841Z         D: int,
2025-05-07T20:32:05.5312939Z         scale_ub: Optional[float],
2025-05-07T20:32:05.5313035Z         contiguous: bool,
2025-05-07T20:32:05.5313121Z         compiled: bool,
2025-05-07T20:32:05.5313199Z     ) -> None:
2025-05-07T20:32:05.5313300Z         torch.manual_seed(2025)
2025-05-07T20:32:05.5313373Z     
2025-05-07T20:32:05.5313539Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:05.5313619Z     
2025-05-07T20:32:05.5313710Z         x_sign = torch.sign(x)
2025-05-07T20:32:05.5313833Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:05.5313934Z         x = x_sign * x_clamp
2025-05-07T20:32:05.5314014Z         x0 = x[:, :D]
2025-05-07T20:32:05.5314094Z         x1 = x[:, D:]
2025-05-07T20:32:05.5314171Z     
2025-05-07T20:32:05.5314254Z         if contiguous:
2025-05-07T20:32:05.5314345Z             x0 = x0.contiguous()
2025-05-07T20:32:05.5314445Z             x1 = x1.contiguous()
2025-05-07T20:32:05.5314516Z     
2025-05-07T20:32:05.5314615Z         if scale_ub is not None:
2025-05-07T20:32:05.5314722Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:05.5314855Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:05.5314935Z             )
2025-05-07T20:32:05.5315010Z         else:
2025-05-07T20:32:05.5315108Z             scale_ub_tensor = None
2025-05-07T20:32:05.5315186Z     
2025-05-07T20:32:05.5315313Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:05.5315403Z             op = silu_mul_quant
2025-05-07T20:32:05.5315494Z             if compiled:
2025-05-07T20:32:05.5315597Z                 op = torch.compile(op)
2025-05-07T20:32:05.5315705Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:05.5315784Z     
2025-05-07T20:32:05.5315876Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:05.5315881Z 
2025-05-07T20:32:05.5316074Z moe/activation_test.py:117: 
2025-05-07T20:32:05.5316205Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:05.5316306Z moe/activation_test.py:115: in fn
2025-05-07T20:32:05.5316412Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:05.5316779Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/torch/_dynamo/eval_frame.py:678: in _fn
2025-05-07T20:32:05.5316872Z     return fn(*args, **kwargs)
2025-05-07T20:32:05.5317364Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:05.5317460Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:05.5317860Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:05.5318120Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:05.5318460Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:05.5318557Z     kernel = self.compile(
2025-05-07T20:32:05.5318934Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:05.5319112Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:05.5319239Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:05.5319244Z 
2025-05-07T20:32:05.5319448Z self = <triton.compiler.compiler.ASTSource object at 0x7f4898a27010>
2025-05-07T20:32:05.5320228Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:05.5320734Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f49ce13c400>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f48990154e0>}
2025-05-07T20:32:05.5321482Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:05.5321669Z context = <triton._C.libtriton.ir.context object at 0x7f4898a57670>
2025-05-07T20:32:05.5321673Z 
2025-05-07T20:32:05.5321836Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:05.5322100Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:05.5322210Z                            module_map=module_map)
2025-05-07T20:32:05.5322405Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:05.5322512Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:05.5322603Z E       ^
2025-05-07T20:32:05.5322965Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:05.5322970Z 
2025-05-07T20:32:05.5323380Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:05.5323385Z 
2025-05-07T20:32:05.5323492Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:05.5323715Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:05.5323793Z     T=2048,
2025-05-07T20:32:05.5323874Z     D=7168,
2025-05-07T20:32:05.5323957Z     scale_ub=1200.0,
2025-05-07T20:32:05.5324045Z     contiguous=False,
2025-05-07T20:32:05.5324135Z     compiled=False,
2025-05-07T20:32:05.5324209Z )
2025-05-07T20:32:05.5324427Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:05.5324610Z T = 2048, D = 7168, scale_ub = 1200.0, contiguous = False, compiled = False
2025-05-07T20:32:05.5324615Z 
2025-05-07T20:32:05.5324693Z     @given(
2025-05-07T20:32:05.5324902Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:05.5325004Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:05.5325117Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:05.5325240Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:05.5325352Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:05.5325425Z     )
2025-05-07T20:32:05.5325673Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:05.5325766Z     def test_silu_mul_quant(
2025-05-07T20:32:05.5325843Z         self,
2025-05-07T20:32:05.5325927Z         T: int,
2025-05-07T20:32:05.5326003Z         D: int,
2025-05-07T20:32:05.5326147Z         scale_ub: Optional[float],
2025-05-07T20:32:05.5326284Z         contiguous: bool,
2025-05-07T20:32:05.5326368Z         compiled: bool,
2025-05-07T20:32:05.5326455Z     ) -> None:
2025-05-07T20:32:05.5326553Z         torch.manual_seed(2025)
2025-05-07T20:32:05.5326626Z     
2025-05-07T20:32:05.5326805Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:05.5326879Z     
2025-05-07T20:32:05.5326970Z         x_sign = torch.sign(x)
2025-05-07T20:32:05.5327104Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:05.5327193Z         x = x_sign * x_clamp
2025-05-07T20:32:05.5327272Z         x0 = x[:, :D]
2025-05-07T20:32:05.5327361Z         x1 = x[:, D:]
2025-05-07T20:32:05.5327432Z     
2025-05-07T20:32:05.5327516Z         if contiguous:
2025-05-07T20:32:05.5327615Z             x0 = x0.contiguous()
2025-05-07T20:32:05.5327703Z             x1 = x1.contiguous()
2025-05-07T20:32:05.5327774Z     
2025-05-07T20:32:05.5327873Z         if scale_ub is not None:
2025-05-07T20:32:05.5327980Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:05.5328124Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:05.5328507Z             )
2025-05-07T20:32:05.5328620Z         else:
2025-05-07T20:32:05.5328763Z             scale_ub_tensor = None
2025-05-07T20:32:05.5328854Z     
2025-05-07T20:32:05.5328988Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:05.5329084Z             op = silu_mul_quant
2025-05-07T20:32:05.5329169Z             if compiled:
2025-05-07T20:32:05.5329268Z                 op = torch.compile(op)
2025-05-07T20:32:05.5329378Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:05.5329450Z     
2025-05-07T20:32:05.5329540Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:05.5329551Z 
2025-05-07T20:32:05.5329646Z moe/activation_test.py:117: 
2025-05-07T20:32:05.5329794Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:05.5329898Z moe/activation_test.py:115: in fn
2025-05-07T20:32:05.5330005Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:05.5330507Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:05.5330607Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:05.5330963Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:05.5331191Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:05.5331528Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:05.5331628Z     kernel = self.compile(
2025-05-07T20:32:05.5332004Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:05.5332176Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:05.5332317Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:05.5332323Z 
2025-05-07T20:32:05.5332524Z self = <triton.compiler.compiler.ASTSource object at 0x7f4898a605d0>
2025-05-07T20:32:05.5333548Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:05.5334171Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f49ce13c400>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f4899016340>}
2025-05-07T20:32:05.5335099Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:05.5335322Z context = <triton._C.libtriton.ir.context object at 0x7f4898ab8c30>
2025-05-07T20:32:05.5335450Z 
2025-05-07T20:32:05.5335634Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:05.5335947Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:05.5336066Z                            module_map=module_map)
2025-05-07T20:32:05.5336242Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:05.5336353Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:05.5336433Z E       ^
2025-05-07T20:32:05.5336858Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:05.5336869Z 
2025-05-07T20:32:05.5337368Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:05.5337373Z 
2025-05-07T20:32:05.5337481Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:05.5337749Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:05.5337830Z     T=1,
2025-05-07T20:32:05.5337908Z     D=7168,
2025-05-07T20:32:05.5337998Z     scale_ub=None,
2025-05-07T20:32:05.5338085Z     contiguous=True,
2025-05-07T20:32:05.5338173Z     compiled=False,
2025-05-07T20:32:05.5338252Z )
2025-05-07T20:32:05.5338506Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:05.5338692Z T = 1, D = 7168, scale_ub = None, contiguous = True, compiled = False
2025-05-07T20:32:05.5338697Z 
2025-05-07T20:32:05.5338774Z     @given(
2025-05-07T20:32:05.5338899Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:05.5339007Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:05.5339127Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:05.5339253Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:05.5339378Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:05.5339455Z     )
2025-05-07T20:32:05.5339746Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:05.5339845Z     def test_silu_mul_quant(
2025-05-07T20:32:05.5339922Z         self,
2025-05-07T20:32:05.5340005Z         T: int,
2025-05-07T20:32:05.5340081Z         D: int,
2025-05-07T20:32:05.5340186Z         scale_ub: Optional[float],
2025-05-07T20:32:05.5340285Z         contiguous: bool,
2025-05-07T20:32:05.5340371Z         compiled: bool,
2025-05-07T20:32:05.5340449Z     ) -> None:
2025-05-07T20:32:05.5340556Z         torch.manual_seed(2025)
2025-05-07T20:32:05.5340632Z     
2025-05-07T20:32:05.5340817Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:05.5340896Z     
2025-05-07T20:32:05.5340992Z         x_sign = torch.sign(x)
2025-05-07T20:32:05.5341123Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:05.5341220Z         x = x_sign * x_clamp
2025-05-07T20:32:05.5341302Z         x0 = x[:, :D]
2025-05-07T20:32:05.5341391Z         x1 = x[:, D:]
2025-05-07T20:32:05.5341467Z     
2025-05-07T20:32:05.5341552Z         if contiguous:
2025-05-07T20:32:05.5341653Z             x0 = x0.contiguous()
2025-05-07T20:32:05.5341744Z             x1 = x1.contiguous()
2025-05-07T20:32:05.5341817Z     
2025-05-07T20:32:05.5342046Z         if scale_ub is not None:
2025-05-07T20:32:05.5342157Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:05.5342291Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:05.5342374Z             )
2025-05-07T20:32:05.5342450Z         else:
2025-05-07T20:32:05.5342543Z             scale_ub_tensor = None
2025-05-07T20:32:05.5342622Z     
2025-05-07T20:32:05.5342752Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:05.5342847Z             op = silu_mul_quant
2025-05-07T20:32:05.5342931Z             if compiled:
2025-05-07T20:32:05.5343031Z                 op = torch.compile(op)
2025-05-07T20:32:05.5343139Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:05.5343289Z     
2025-05-07T20:32:05.5343383Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:05.5343387Z 
2025-05-07T20:32:05.5343490Z moe/activation_test.py:117: 
2025-05-07T20:32:05.5343619Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:05.5343727Z moe/activation_test.py:115: in fn
2025-05-07T20:32:05.5343832Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:05.5344327Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:05.5344431Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:05.5344788Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:05.5345008Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:05.5345352Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:05.5345451Z     kernel = self.compile(
2025-05-07T20:32:05.5345828Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:05.5346008Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:05.5346135Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:05.5346139Z 
2025-05-07T20:32:05.5346350Z self = <triton.compiler.compiler.ASTSource object at 0x7f4898af62d0>
2025-05-07T20:32:05.5347120Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:05.5347623Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f49ce13c400>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f4899015c60>}
2025-05-07T20:32:05.5348371Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:05.5348563Z context = <triton._C.libtriton.ir.context object at 0x7f4899468270>
2025-05-07T20:32:05.5348568Z 
2025-05-07T20:32:05.5348738Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:05.5349000Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:05.5349204Z                            module_map=module_map)
2025-05-07T20:32:05.5349367Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:05.5349468Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:05.5349553Z E       ^
2025-05-07T20:32:05.5349908Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:05.5349918Z 
2025-05-07T20:32:05.5350334Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:05.5350344Z 
2025-05-07T20:32:05.5350448Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:05.5350769Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:05.5350854Z     T=16384,
2025-05-07T20:32:05.5350930Z     D=7168,
2025-05-07T20:32:05.5351014Z     scale_ub=1200.0,
2025-05-07T20:32:05.5351108Z     contiguous=False,
2025-05-07T20:32:05.5351195Z     compiled=True,
2025-05-07T20:32:05.5351268Z )
2025-05-07T20:32:05.5351497Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:05.5351677Z T = 16384, D = 7168, scale_ub = 1200.0, contiguous = False, compiled = True
2025-05-07T20:32:05.5351681Z 
2025-05-07T20:32:05.5351769Z     @given(
2025-05-07T20:32:05.5351886Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:05.5352105Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:05.5352276Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:05.5352461Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:05.5352632Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:05.5352750Z     )
2025-05-07T20:32:05.5353036Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:05.5353132Z     def test_silu_mul_quant(
2025-05-07T20:32:05.5353217Z         self,
2025-05-07T20:32:05.5353295Z         T: int,
2025-05-07T20:32:05.5353371Z         D: int,
2025-05-07T20:32:05.5353475Z         scale_ub: Optional[float],
2025-05-07T20:32:05.5353563Z         contiguous: bool,
2025-05-07T20:32:05.5353654Z         compiled: bool,
2025-05-07T20:32:05.5353735Z     ) -> None:
2025-05-07T20:32:05.5353828Z         torch.manual_seed(2025)
2025-05-07T20:32:05.5353905Z     
2025-05-07T20:32:05.5354072Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:05.5354150Z     
2025-05-07T20:32:05.5354245Z         x_sign = torch.sign(x)
2025-05-07T20:32:05.5354370Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:05.5354460Z         x = x_sign * x_clamp
2025-05-07T20:32:05.5354549Z         x0 = x[:, :D]
2025-05-07T20:32:05.5354632Z         x1 = x[:, D:]
2025-05-07T20:32:05.5354704Z     
2025-05-07T20:32:05.5354792Z         if contiguous:
2025-05-07T20:32:05.5354883Z             x0 = x0.contiguous()
2025-05-07T20:32:05.5354970Z             x1 = x1.contiguous()
2025-05-07T20:32:05.5355048Z     
2025-05-07T20:32:05.5355137Z         if scale_ub is not None:
2025-05-07T20:32:05.5355249Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:05.5355383Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:05.5355457Z             )
2025-05-07T20:32:05.5355538Z         else:
2025-05-07T20:32:05.5355631Z             scale_ub_tensor = None
2025-05-07T20:32:05.5355707Z     
2025-05-07T20:32:05.5355847Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:05.5355936Z             op = silu_mul_quant
2025-05-07T20:32:05.5356021Z             if compiled:
2025-05-07T20:32:05.5356127Z                 op = torch.compile(op)
2025-05-07T20:32:05.5356235Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:05.5356309Z     
2025-05-07T20:32:05.5356406Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:05.5356410Z 
2025-05-07T20:32:05.5356508Z moe/activation_test.py:117: 
2025-05-07T20:32:05.5356641Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:05.5356742Z moe/activation_test.py:115: in fn
2025-05-07T20:32:05.5356841Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:05.5357213Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/torch/_dynamo/eval_frame.py:678: in _fn
2025-05-07T20:32:05.5357304Z     return fn(*args, **kwargs)
2025-05-07T20:32:05.5357795Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:05.5357903Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:05.5358363Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:05.5358592Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:05.5358928Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:05.5359021Z     kernel = self.compile(
2025-05-07T20:32:05.5359404Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:05.5359575Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:05.5359710Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:05.5359756Z 
2025-05-07T20:32:05.5359959Z self = <triton.compiler.compiler.ASTSource object at 0x7f48994de590>
2025-05-07T20:32:05.5360772Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:05.5361274Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f49ce13c400>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f4899408900>}
2025-05-07T20:32:05.5362015Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:05.5362211Z context = <triton._C.libtriton.ir.context object at 0x7f48994d66f0>
2025-05-07T20:32:05.5362216Z 
2025-05-07T20:32:05.5362388Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:05.5362692Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:05.5362805Z                            module_map=module_map)
2025-05-07T20:32:05.5362964Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:05.5363073Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:05.5363148Z E       ^
2025-05-07T20:32:05.5363498Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:05.5363503Z 
2025-05-07T20:32:05.5363923Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:05.5363927Z 
2025-05-07T20:32:05.5364030Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:05.5364257Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:05.5364334Z     T=1,
2025-05-07T20:32:05.5364414Z     D=7168,
2025-05-07T20:32:05.5364505Z     scale_ub=None,
2025-05-07T20:32:05.5364591Z     contiguous=False,
2025-05-07T20:32:05.5364675Z     compiled=False,
2025-05-07T20:32:05.5364758Z )
2025-05-07T20:32:05.5364974Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:05.5365143Z T = 1, D = 7168, scale_ub = None, contiguous = False, compiled = False
2025-05-07T20:32:05.5365148Z 
2025-05-07T20:32:05.5365233Z     @given(
2025-05-07T20:32:05.5365350Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:05.5365455Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:05.5365568Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:05.5365683Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:05.5365804Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:05.5365878Z     )
2025-05-07T20:32:05.5366121Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:05.5366223Z     def test_silu_mul_quant(
2025-05-07T20:32:05.5366301Z         self,
2025-05-07T20:32:05.5366379Z         T: int,
2025-05-07T20:32:05.5366460Z         D: int,
2025-05-07T20:32:05.5366557Z         scale_ub: Optional[float],
2025-05-07T20:32:05.5366646Z         contiguous: bool,
2025-05-07T20:32:05.5366827Z         compiled: bool,
2025-05-07T20:32:05.5366906Z     ) -> None:
2025-05-07T20:32:05.5367011Z         torch.manual_seed(2025)
2025-05-07T20:32:05.5367084Z     
2025-05-07T20:32:05.5367252Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:05.5367333Z     
2025-05-07T20:32:05.5367424Z         x_sign = torch.sign(x)
2025-05-07T20:32:05.5367548Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:05.5367641Z         x = x_sign * x_clamp
2025-05-07T20:32:05.5367720Z         x0 = x[:, :D]
2025-05-07T20:32:05.5367800Z         x1 = x[:, D:]
2025-05-07T20:32:05.5367876Z     
2025-05-07T20:32:05.5367959Z         if contiguous:
2025-05-07T20:32:05.5368095Z             x0 = x0.contiguous()
2025-05-07T20:32:05.5368266Z             x1 = x1.contiguous()
2025-05-07T20:32:05.5368339Z     
2025-05-07T20:32:05.5368430Z         if scale_ub is not None:
2025-05-07T20:32:05.5368547Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:05.5368688Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:05.5368770Z             )
2025-05-07T20:32:05.5368845Z         else:
2025-05-07T20:32:05.5368938Z             scale_ub_tensor = None
2025-05-07T20:32:05.5369015Z     
2025-05-07T20:32:05.5369143Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:05.5369233Z             op = silu_mul_quant
2025-05-07T20:32:05.5369323Z             if compiled:
2025-05-07T20:32:05.5369421Z                 op = torch.compile(op)
2025-05-07T20:32:05.5369527Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:05.5369603Z     
2025-05-07T20:32:05.5369693Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:05.5369701Z 
2025-05-07T20:32:05.5369806Z moe/activation_test.py:117: 
2025-05-07T20:32:05.5369937Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:05.5370037Z moe/activation_test.py:115: in fn
2025-05-07T20:32:05.5370146Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:05.5370642Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:05.5370738Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:05.5371100Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:05.5371318Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:05.5371660Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:05.5371752Z     kernel = self.compile(
2025-05-07T20:32:05.5372129Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:05.5372313Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:05.5372447Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:05.5372457Z 
2025-05-07T20:32:05.5372693Z self = <triton.compiler.compiler.ASTSource object at 0x7f489948c190>
2025-05-07T20:32:05.5373478Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:05.5373973Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f49ce13c400>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f4899409760>}
2025-05-07T20:32:05.5374722Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:05.5374914Z context = <triton._C.libtriton.ir.context object at 0x7f48994787f0>
2025-05-07T20:32:05.5374919Z 
2025-05-07T20:32:05.5375174Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:05.5375433Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:05.5375541Z                            module_map=module_map)
2025-05-07T20:32:05.5375707Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:05.5375806Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:05.5375883Z E       ^
2025-05-07T20:32:05.5376241Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:05.5376246Z 
2025-05-07T20:32:05.5376656Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:05.5376740Z 
2025-05-07T20:32:05.5376850Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:05.5377072Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:05.5377149Z     T=2048,
2025-05-07T20:32:05.5377239Z     D=7168,
2025-05-07T20:32:05.5377322Z     scale_ub=None,
2025-05-07T20:32:05.5377408Z     contiguous=False,
2025-05-07T20:32:05.5377499Z     compiled=True,
2025-05-07T20:32:05.5377570Z )
2025-05-07T20:32:05.5377792Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:05.5377965Z T = 2048, D = 7168, scale_ub = None, contiguous = False, compiled = True
2025-05-07T20:32:05.5377970Z 
2025-05-07T20:32:05.5378046Z     @given(
2025-05-07T20:32:05.5378171Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:05.5378269Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:05.5378383Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:05.5378511Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:05.5378623Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:05.5378697Z     )
2025-05-07T20:32:05.5378955Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:05.5379054Z     def test_silu_mul_quant(
2025-05-07T20:32:05.5379139Z         self,
2025-05-07T20:32:05.5379217Z         T: int,
2025-05-07T20:32:05.5379294Z         D: int,
2025-05-07T20:32:05.5379397Z         scale_ub: Optional[float],
2025-05-07T20:32:05.5379487Z         contiguous: bool,
2025-05-07T20:32:05.5379573Z         compiled: bool,
2025-05-07T20:32:05.5379657Z     ) -> None:
2025-05-07T20:32:05.5379752Z         torch.manual_seed(2025)
2025-05-07T20:32:05.5379826Z     
2025-05-07T20:32:05.5380000Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:05.5380075Z     
2025-05-07T20:32:05.5380166Z         x_sign = torch.sign(x)
2025-05-07T20:32:05.5380302Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:05.5380394Z         x = x_sign * x_clamp
2025-05-07T20:32:05.5380483Z         x0 = x[:, :D]
2025-05-07T20:32:05.5380562Z         x1 = x[:, D:]
2025-05-07T20:32:05.5380636Z     
2025-05-07T20:32:05.5380730Z         if contiguous:
2025-05-07T20:32:05.5380822Z             x0 = x0.contiguous()
2025-05-07T20:32:05.5380910Z             x1 = x1.contiguous()
2025-05-07T20:32:05.5380988Z     
2025-05-07T20:32:05.5381078Z         if scale_ub is not None:
2025-05-07T20:32:05.5381185Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:05.5381326Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:05.5381403Z             )
2025-05-07T20:32:05.5381480Z         else:
2025-05-07T20:32:05.5381579Z             scale_ub_tensor = None
2025-05-07T20:32:05.5381650Z     
2025-05-07T20:32:05.5381785Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:05.5381877Z             op = silu_mul_quant
2025-05-07T20:32:05.5381966Z             if compiled:
2025-05-07T20:32:05.5382072Z                 op = torch.compile(op)
2025-05-07T20:32:05.5382175Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:05.5382251Z     
2025-05-07T20:32:05.5382354Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:05.5382470Z 
2025-05-07T20:32:05.5382577Z moe/activation_test.py:117: 
2025-05-07T20:32:05.5382720Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:05.5382826Z moe/activation_test.py:115: in fn
2025-05-07T20:32:05.5382924Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:05.5383296Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/torch/_dynamo/eval_frame.py:678: in _fn
2025-05-07T20:32:05.5383387Z     return fn(*args, **kwargs)
2025-05-07T20:32:05.5383876Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:05.5384018Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:05.5384415Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:05.5384635Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:05.5384985Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:05.5385077Z     kernel = self.compile(
2025-05-07T20:32:05.5385460Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:05.5385635Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:05.5385762Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:05.5385766Z 
2025-05-07T20:32:05.5385974Z self = <triton.compiler.compiler.ASTSource object at 0x7f4898c62cd0>
2025-05-07T20:32:05.5386742Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:05.5387251Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f49ce13c400>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f489940aa20>}
2025-05-07T20:32:05.5387995Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:05.5388183Z context = <triton._C.libtriton.ir.context object at 0x7f4898c27330>
2025-05-07T20:32:05.5388193Z 
2025-05-07T20:32:05.5388357Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:05.5388619Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:05.5388736Z                            module_map=module_map)
2025-05-07T20:32:05.5388895Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:05.5388994Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:05.5389209Z E       ^
2025-05-07T20:32:05.5389569Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:05.5389574Z 
2025-05-07T20:32:05.5389990Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:05.5389994Z 
2025-05-07T20:32:05.5390097Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:05.5390317Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:05.5390400Z     T=4096,
2025-05-07T20:32:05.5390476Z     D=7168,
2025-05-07T20:32:05.5390559Z     scale_ub=None,
2025-05-07T20:32:05.5390650Z     contiguous=False,
2025-05-07T20:32:05.5390736Z     compiled=True,
2025-05-07T20:32:05.5390810Z )
2025-05-07T20:32:05.5391031Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:05.5391201Z T = 4096, D = 7168, scale_ub = None, contiguous = False, compiled = True
2025-05-07T20:32:05.5391206Z 
2025-05-07T20:32:05.5391378Z     @given(
2025-05-07T20:32:05.5391497Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:05.5391596Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:05.5391720Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:05.5391835Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:05.5391947Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:05.5392028Z     )
2025-05-07T20:32:05.5392271Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:05.5392371Z     def test_silu_mul_quant(
2025-05-07T20:32:05.5392447Z         self,
2025-05-07T20:32:05.5392523Z         T: int,
2025-05-07T20:32:05.5392649Z         D: int,
2025-05-07T20:32:05.5392784Z         scale_ub: Optional[float],
2025-05-07T20:32:05.5397419Z         contiguous: bool,
2025-05-07T20:32:05.5397532Z         compiled: bool,
2025-05-07T20:32:05.5397620Z     ) -> None:
2025-05-07T20:32:05.5397729Z         torch.manual_seed(2025)
2025-05-07T20:32:05.5397816Z     
2025-05-07T20:32:05.5398000Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:05.5398074Z     
2025-05-07T20:32:05.5398169Z         x_sign = torch.sign(x)
2025-05-07T20:32:05.5398307Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:05.5398397Z         x = x_sign * x_clamp
2025-05-07T20:32:05.5398480Z         x0 = x[:, :D]
2025-05-07T20:32:05.5398572Z         x1 = x[:, D:]
2025-05-07T20:32:05.5398645Z     
2025-05-07T20:32:05.5398731Z         if contiguous:
2025-05-07T20:32:05.5398835Z             x0 = x0.contiguous()
2025-05-07T20:32:05.5398925Z             x1 = x1.contiguous()
2025-05-07T20:32:05.5399002Z     
2025-05-07T20:32:05.5399104Z         if scale_ub is not None:
2025-05-07T20:32:05.5399212Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:05.5399348Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:05.5399435Z             )
2025-05-07T20:32:05.5399513Z         else:
2025-05-07T20:32:05.5399619Z             scale_ub_tensor = None
2025-05-07T20:32:05.5399694Z     
2025-05-07T20:32:05.5399827Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:05.5399927Z             op = silu_mul_quant
2025-05-07T20:32:05.5400015Z             if compiled:
2025-05-07T20:32:05.5400116Z                 op = torch.compile(op)
2025-05-07T20:32:05.5400232Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:05.5400307Z     
2025-05-07T20:32:05.5400398Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:05.5400403Z 
2025-05-07T20:32:05.5400519Z moe/activation_test.py:117: 
2025-05-07T20:32:05.5400656Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:05.5400771Z moe/activation_test.py:115: in fn
2025-05-07T20:32:05.5400876Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:05.5401253Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/torch/_dynamo/eval_frame.py:678: in _fn
2025-05-07T20:32:05.5401359Z     return fn(*args, **kwargs)
2025-05-07T20:32:05.5401849Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:05.5401949Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:05.5402314Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:05.5402536Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:05.5402880Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:05.5402977Z     kernel = self.compile(
2025-05-07T20:32:05.5403359Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:05.5403541Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:05.5403885Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:05.5403891Z 
2025-05-07T20:32:05.5404103Z self = <triton.compiler.compiler.ASTSource object at 0x7f4898c6c550>
2025-05-07T20:32:05.5404878Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:05.5405376Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f49ce13c400>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f489940bce0>}
2025-05-07T20:32:05.5406208Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:05.5406438Z context = <triton._C.libtriton.ir.context object at 0x7f4898c9cb70>
2025-05-07T20:32:05.5406448Z 
2025-05-07T20:32:05.5406622Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:05.5406883Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:05.5406991Z                            module_map=module_map)
2025-05-07T20:32:05.5407162Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:05.5407263Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:05.5407341Z E       ^
2025-05-07T20:32:05.5407702Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:05.5407707Z 
2025-05-07T20:32:05.5408125Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:05.5408132Z 
2025-05-07T20:32:05.5408246Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:05.5408473Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:05.5408553Z     T=16384,
2025-05-07T20:32:05.5408639Z     D=5120,
2025-05-07T20:32:05.5408725Z     scale_ub=1200.0,
2025-05-07T20:32:05.5408822Z     contiguous=False,
2025-05-07T20:32:05.5408909Z     compiled=False,
2025-05-07T20:32:05.5408984Z )
2025-05-07T20:32:05.5409209Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:05.5409390Z T = 16384, D = 5120, scale_ub = 1200.0, contiguous = False, compiled = False
2025-05-07T20:32:05.5409394Z 
2025-05-07T20:32:05.5409473Z     @given(
2025-05-07T20:32:05.5409601Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:05.5409700Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:05.5409819Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:05.5409944Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:05.5410058Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:05.5410145Z     )
2025-05-07T20:32:05.5410395Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:05.5410489Z     def test_silu_mul_quant(
2025-05-07T20:32:05.5410573Z         self,
2025-05-07T20:32:05.5410651Z         T: int,
2025-05-07T20:32:05.5410728Z         D: int,
2025-05-07T20:32:05.5410835Z         scale_ub: Optional[float],
2025-05-07T20:32:05.5410925Z         contiguous: bool,
2025-05-07T20:32:05.5411012Z         compiled: bool,
2025-05-07T20:32:05.5411098Z     ) -> None:
2025-05-07T20:32:05.5411194Z         torch.manual_seed(2025)
2025-05-07T20:32:05.5411267Z     
2025-05-07T20:32:05.5411443Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:05.5411520Z     
2025-05-07T20:32:05.5411622Z         x_sign = torch.sign(x)
2025-05-07T20:32:05.5411747Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:05.5411836Z         x = x_sign * x_clamp
2025-05-07T20:32:05.5411925Z         x0 = x[:, :D]
2025-05-07T20:32:05.5412007Z         x1 = x[:, D:]
2025-05-07T20:32:05.5412170Z     
2025-05-07T20:32:05.5412263Z         if contiguous:
2025-05-07T20:32:05.5412357Z             x0 = x0.contiguous()
2025-05-07T20:32:05.5412449Z             x1 = x1.contiguous()
2025-05-07T20:32:05.5412534Z     
2025-05-07T20:32:05.5412627Z         if scale_ub is not None:
2025-05-07T20:32:05.5412735Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:05.5412882Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:05.5412958Z             )
2025-05-07T20:32:05.5413036Z         else:
2025-05-07T20:32:05.5413139Z             scale_ub_tensor = None
2025-05-07T20:32:05.5413216Z     
2025-05-07T20:32:05.5413359Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:05.5413527Z             op = silu_mul_quant
2025-05-07T20:32:05.5413615Z             if compiled:
2025-05-07T20:32:05.5413725Z                 op = torch.compile(op)
2025-05-07T20:32:05.5413831Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:05.5413913Z     
2025-05-07T20:32:05.5414013Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:05.5414017Z 
2025-05-07T20:32:05.5414116Z moe/activation_test.py:117: 
2025-05-07T20:32:05.5414248Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:05.5414365Z moe/activation_test.py:115: in fn
2025-05-07T20:32:05.5414465Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:05.5414973Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:05.5415072Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:05.5415429Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:05.5415665Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:05.5416004Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:05.5416103Z     kernel = self.compile(
2025-05-07T20:32:05.5416493Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:05.5416666Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:05.5416808Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:05.5416812Z 
2025-05-07T20:32:05.5417015Z self = <triton.compiler.compiler.ASTSource object at 0x7f4898b597d0>
2025-05-07T20:32:05.5417787Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:05.5418297Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f49ce13c400>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f4898b24c20>}
2025-05-07T20:32:05.5419045Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:05.5419244Z context = <triton._C.libtriton.ir.context object at 0x7f4898b71e30>
2025-05-07T20:32:05.5419249Z 
2025-05-07T20:32:05.5419416Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:05.5419689Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:05.5419798Z                            module_map=module_map)
2025-05-07T20:32:05.5419962Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:05.5420074Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:05.5420154Z E       ^
2025-05-07T20:32:05.5420508Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:05.5420596Z 
2025-05-07T20:32:05.5421020Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:05.5421024Z 
2025-05-07T20:32:05.5421129Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:05.5421361Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:05.5421440Z     T=16384,
2025-05-07T20:32:05.5421518Z     D=5120,
2025-05-07T20:32:05.5421610Z     scale_ub=1200.0,
2025-05-07T20:32:05.5421701Z     contiguous=True,
2025-05-07T20:32:05.5421785Z     compiled=True,
2025-05-07T20:32:05.5421866Z )
2025-05-07T20:32:05.5422085Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:05.5422337Z T = 16384, D = 5120, scale_ub = 1200.0, contiguous = True, compiled = True
2025-05-07T20:32:05.5422351Z 
2025-05-07T20:32:05.5422429Z     @given(
2025-05-07T20:32:05.5422548Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:05.5422665Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:05.5422780Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:05.5422898Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:05.5423018Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:05.5423094Z     )
2025-05-07T20:32:05.5423338Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:05.5423443Z     def test_silu_mul_quant(
2025-05-07T20:32:05.5423522Z         self,
2025-05-07T20:32:05.5423600Z         T: int,
2025-05-07T20:32:05.5423687Z         D: int,
2025-05-07T20:32:05.5423789Z         scale_ub: Optional[float],
2025-05-07T20:32:05.5423891Z         contiguous: bool,
2025-05-07T20:32:05.5423985Z         compiled: bool,
2025-05-07T20:32:05.5424065Z     ) -> None:
2025-05-07T20:32:05.5424166Z         torch.manual_seed(2025)
2025-05-07T20:32:05.5424239Z     
2025-05-07T20:32:05.5424414Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:05.5424495Z     
2025-05-07T20:32:05.5424590Z         x_sign = torch.sign(x)
2025-05-07T20:32:05.5424714Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:05.5424811Z         x = x_sign * x_clamp
2025-05-07T20:32:05.5424892Z         x0 = x[:, :D]
2025-05-07T20:32:05.5424973Z         x1 = x[:, D:]
2025-05-07T20:32:05.5425055Z     
2025-05-07T20:32:05.5425141Z         if contiguous:
2025-05-07T20:32:05.5425241Z             x0 = x0.contiguous()
2025-05-07T20:32:05.5425330Z             x1 = x1.contiguous()
2025-05-07T20:32:05.5425403Z     
2025-05-07T20:32:05.5425502Z         if scale_ub is not None:
2025-05-07T20:32:05.5425611Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:05.5425750Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:05.5425834Z             )
2025-05-07T20:32:05.5425912Z         else:
2025-05-07T20:32:05.5426008Z             scale_ub_tensor = None
2025-05-07T20:32:05.5426092Z     
2025-05-07T20:32:05.5426226Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:05.5426318Z             op = silu_mul_quant
2025-05-07T20:32:05.5426412Z             if compiled:
2025-05-07T20:32:05.5426512Z                 op = torch.compile(op)
2025-05-07T20:32:05.5426627Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:05.5426702Z     
2025-05-07T20:32:05.5426794Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:05.5426798Z 
2025-05-07T20:32:05.5426908Z moe/activation_test.py:117: 
2025-05-07T20:32:05.5427040Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:05.5427143Z moe/activation_test.py:115: in fn
2025-05-07T20:32:05.5427256Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:05.5427627Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/torch/_dynamo/eval_frame.py:678: in _fn
2025-05-07T20:32:05.5427722Z     return fn(*args, **kwargs)
2025-05-07T20:32:05.5428882Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:05.5429009Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:05.5429428Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:05.5429652Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:05.5429991Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:05.5430096Z     kernel = self.compile(
2025-05-07T20:32:05.5430478Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:05.5430803Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:05.5430933Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:05.5430938Z 
2025-05-07T20:32:05.5431150Z self = <triton.compiler.compiler.ASTSource object at 0x7f4898b684d0>
2025-05-07T20:32:05.5431927Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:05.5432424Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f49ce13c400>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f4898b260c0>}
2025-05-07T20:32:05.5433171Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:05.5433366Z context = <triton._C.libtriton.ir.context object at 0x7f4898becb30>
2025-05-07T20:32:05.5433371Z 
2025-05-07T20:32:05.5433534Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:05.5433807Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:05.5433919Z                            module_map=module_map)
2025-05-07T20:32:05.5434088Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:05.5434189Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:05.5434270Z E       ^
2025-05-07T20:32:05.5434632Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:05.5434637Z 
2025-05-07T20:32:05.5435053Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:05.5435063Z 
2025-05-07T20:32:05.5435174Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:05.5435397Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:05.5435476Z     T=16384,
2025-05-07T20:32:05.5435566Z     D=5120,
2025-05-07T20:32:05.5435654Z     scale_ub=None,
2025-05-07T20:32:05.5435743Z     contiguous=False,
2025-05-07T20:32:05.5435840Z     compiled=True,
2025-05-07T20:32:05.5435914Z )
2025-05-07T20:32:05.5436130Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:05.5436316Z T = 16384, D = 5120, scale_ub = None, contiguous = False, compiled = True
2025-05-07T20:32:05.5436321Z 
2025-05-07T20:32:05.5436399Z     @given(
2025-05-07T20:32:05.5436526Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:05.5436629Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:05.5436744Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:05.5436873Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:05.5436993Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:05.5437067Z     )
2025-05-07T20:32:05.5437321Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:05.5437547Z     def test_silu_mul_quant(
2025-05-07T20:32:05.5437631Z         self,
2025-05-07T20:32:05.5437709Z         T: int,
2025-05-07T20:32:05.5437785Z         D: int,
2025-05-07T20:32:05.5437889Z         scale_ub: Optional[float],
2025-05-07T20:32:05.5437980Z         contiguous: bool,
2025-05-07T20:32:05.5438065Z         compiled: bool,
2025-05-07T20:32:05.5438150Z     ) -> None:
2025-05-07T20:32:05.5438245Z         torch.manual_seed(2025)
2025-05-07T20:32:05.5438318Z     
2025-05-07T20:32:05.5438492Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:05.5438566Z     
2025-05-07T20:32:05.5438665Z         x_sign = torch.sign(x)
2025-05-07T20:32:05.5438788Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:05.5438956Z         x = x_sign * x_clamp
2025-05-07T20:32:05.5439043Z         x0 = x[:, :D]
2025-05-07T20:32:05.5439123Z         x1 = x[:, D:]
2025-05-07T20:32:05.5439195Z     
2025-05-07T20:32:05.5439283Z         if contiguous:
2025-05-07T20:32:05.5439379Z             x0 = x0.contiguous()
2025-05-07T20:32:05.5439470Z             x1 = x1.contiguous()
2025-05-07T20:32:05.5439549Z     
2025-05-07T20:32:05.5439641Z         if scale_ub is not None:
2025-05-07T20:32:05.5439747Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:05.5439888Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:05.5439964Z             )
2025-05-07T20:32:05.5440048Z         else:
2025-05-07T20:32:05.5440140Z             scale_ub_tensor = None
2025-05-07T20:32:05.5440212Z     
2025-05-07T20:32:05.5440345Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:05.5440434Z             op = silu_mul_quant
2025-05-07T20:32:05.5440522Z             if compiled:
2025-05-07T20:32:05.5440634Z                 op = torch.compile(op)
2025-05-07T20:32:05.5440738Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:05.5440810Z     
2025-05-07T20:32:05.5440907Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:05.5440911Z 
2025-05-07T20:32:05.5441014Z moe/activation_test.py:117: 
2025-05-07T20:32:05.5441141Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:05.5441247Z moe/activation_test.py:115: in fn
2025-05-07T20:32:05.5441347Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:05.5441718Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/torch/_dynamo/eval_frame.py:678: in _fn
2025-05-07T20:32:05.5441811Z     return fn(*args, **kwargs)
2025-05-07T20:32:05.5442299Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:05.5442400Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:05.5442756Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:05.5442985Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:05.5443324Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:05.5443418Z     kernel = self.compile(
2025-05-07T20:32:05.5443801Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:05.5443976Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:05.5444104Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:05.5444108Z 
2025-05-07T20:32:05.5444317Z self = <triton.compiler.compiler.ASTSource object at 0x7f4898859a50>
2025-05-07T20:32:05.5445087Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:05.5445689Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f49ce13c400>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f4898b26c00>}
2025-05-07T20:32:05.5446432Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:05.5446625Z context = <triton._C.libtriton.ir.context object at 0x7f489881a070>
2025-05-07T20:32:05.5446630Z 
2025-05-07T20:32:05.5446791Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:05.5447049Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:05.5447201Z                            module_map=module_map)
2025-05-07T20:32:05.5447400Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:05.5447497Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:05.5447579Z E       ^
2025-05-07T20:32:05.5447935Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:05.5447940Z 
2025-05-07T20:32:05.5448356Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:05.5448361Z 
2025-05-07T20:32:05.5448463Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:05.5448682Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:05.5448769Z     T=2048,
2025-05-07T20:32:05.5448846Z     D=5120,
2025-05-07T20:32:05.5448930Z     scale_ub=None,
2025-05-07T20:32:05.5449026Z     contiguous=False,
2025-05-07T20:32:05.5449108Z     compiled=True,
2025-05-07T20:32:05.5449190Z )
2025-05-07T20:32:05.5449406Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:05.5449579Z T = 2048, D = 5120, scale_ub = None, contiguous = False, compiled = True
2025-05-07T20:32:05.5449584Z 
2025-05-07T20:32:05.5449667Z     @given(
2025-05-07T20:32:05.5449787Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:05.5449886Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:05.5450008Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:05.5450124Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:05.5450241Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:05.5450314Z     )
2025-05-07T20:32:05.5450557Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:05.5450656Z     def test_silu_mul_quant(
2025-05-07T20:32:05.5450733Z         self,
2025-05-07T20:32:05.5450810Z         T: int,
2025-05-07T20:32:05.5450892Z         D: int,
2025-05-07T20:32:05.5450993Z         scale_ub: Optional[float],
2025-05-07T20:32:05.5451085Z         contiguous: bool,
2025-05-07T20:32:05.5451179Z         compiled: bool,
2025-05-07T20:32:05.5451258Z     ) -> None:
2025-05-07T20:32:05.5451353Z         torch.manual_seed(2025)
2025-05-07T20:32:05.5451432Z     
2025-05-07T20:32:05.5451604Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:05.5451677Z     
2025-05-07T20:32:05.5451775Z         x_sign = torch.sign(x)
2025-05-07T20:32:05.5451898Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:05.5451995Z         x = x_sign * x_clamp
2025-05-07T20:32:05.5452077Z         x0 = x[:, :D]
2025-05-07T20:32:05.5452160Z         x1 = x[:, D:]
2025-05-07T20:32:05.5452240Z     
2025-05-07T20:32:05.5452324Z         if contiguous:
2025-05-07T20:32:05.5452420Z             x0 = x0.contiguous()
2025-05-07T20:32:05.5452515Z             x1 = x1.contiguous()
2025-05-07T20:32:05.5452587Z     
2025-05-07T20:32:05.5452680Z         if scale_ub is not None:
2025-05-07T20:32:05.5452798Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:05.5452933Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:05.5453009Z             )
2025-05-07T20:32:05.5453091Z         else:
2025-05-07T20:32:05.5453343Z             scale_ub_tensor = None
2025-05-07T20:32:05.5453427Z     
2025-05-07T20:32:05.5453555Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:05.5453647Z             op = silu_mul_quant
2025-05-07T20:32:05.5453740Z             if compiled:
2025-05-07T20:32:05.5453840Z                 op = torch.compile(op)
2025-05-07T20:32:05.5453946Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:05.5454024Z     
2025-05-07T20:32:05.5454113Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:05.5454118Z 
2025-05-07T20:32:05.5454213Z moe/activation_test.py:117: 
2025-05-07T20:32:05.5454345Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:05.5454489Z moe/activation_test.py:115: in fn
2025-05-07T20:32:05.5454633Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:05.5455003Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/torch/_dynamo/eval_frame.py:678: in _fn
2025-05-07T20:32:05.5455094Z     return fn(*args, **kwargs)
2025-05-07T20:32:05.5455594Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:05.5455691Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:05.5456045Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:05.5456271Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:05.5456606Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:05.5456708Z     kernel = self.compile(
2025-05-07T20:32:05.5457089Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:05.5457262Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:05.5457396Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:05.5457405Z 
2025-05-07T20:32:05.5457605Z self = <triton.compiler.compiler.ASTSource object at 0x7f48988c0150>
2025-05-07T20:32:05.5458395Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:05.5458892Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f49ce13c400>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f489881c680>}
2025-05-07T20:32:05.5459641Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:05.5459833Z context = <triton._C.libtriton.ir.context object at 0x7f48988e66b0>
2025-05-07T20:32:05.5459837Z 
2025-05-07T20:32:05.5460013Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:05.5460271Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:05.5460378Z                            module_map=module_map)
2025-05-07T20:32:05.5460545Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:05.5460644Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:05.5460721Z E       ^
2025-05-07T20:32:05.5461079Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:05.5461084Z 
2025-05-07T20:32:05.5461498Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:05.5461507Z 
2025-05-07T20:32:05.5461615Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:05.5461838Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:05.5461916Z     T=2048,
2025-05-07T20:32:05.5462082Z     D=5120,
2025-05-07T20:32:05.5462167Z     scale_ub=1200.0,
2025-05-07T20:32:05.5462254Z     contiguous=False,
2025-05-07T20:32:05.5462342Z     compiled=True,
2025-05-07T20:32:05.5462415Z )
2025-05-07T20:32:05.5462630Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:05.5462809Z T = 2048, D = 5120, scale_ub = 1200.0, contiguous = False, compiled = True
2025-05-07T20:32:05.5462814Z 
2025-05-07T20:32:05.5462891Z     @given(
2025-05-07T20:32:05.5463015Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:05.5463114Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:05.5463228Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:05.5463432Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:05.5463547Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:05.5463621Z     )
2025-05-07T20:32:05.5463876Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:05.5463970Z     def test_silu_mul_quant(
2025-05-07T20:32:05.5464055Z         self,
2025-05-07T20:32:05.5464132Z         T: int,
2025-05-07T20:32:05.5464207Z         D: int,
2025-05-07T20:32:05.5464312Z         scale_ub: Optional[float],
2025-05-07T20:32:05.5464401Z         contiguous: bool,
2025-05-07T20:32:05.5464487Z         compiled: bool,
2025-05-07T20:32:05.5464572Z     ) -> None:
2025-05-07T20:32:05.5464665Z         torch.manual_seed(2025)
2025-05-07T20:32:05.5464736Z     
2025-05-07T20:32:05.5464908Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:05.5464980Z     
2025-05-07T20:32:05.5465077Z         x_sign = torch.sign(x)
2025-05-07T20:32:05.5465209Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:05.5465295Z         x = x_sign * x_clamp
2025-05-07T20:32:05.5465375Z         x0 = x[:, :D]
2025-05-07T20:32:05.5465461Z         x1 = x[:, D:]
2025-05-07T20:32:05.5465533Z     
2025-05-07T20:32:05.5465627Z         if contiguous:
2025-05-07T20:32:05.5465717Z             x0 = x0.contiguous()
2025-05-07T20:32:05.5465805Z             x1 = x1.contiguous()
2025-05-07T20:32:05.5465881Z     
2025-05-07T20:32:05.5465972Z         if scale_ub is not None:
2025-05-07T20:32:05.5466076Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:05.5466218Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:05.5466293Z             )
2025-05-07T20:32:05.5466367Z         else:
2025-05-07T20:32:05.5466466Z             scale_ub_tensor = None
2025-05-07T20:32:05.5466537Z     
2025-05-07T20:32:05.5466664Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:05.5466763Z             op = silu_mul_quant
2025-05-07T20:32:05.5466851Z             if compiled:
2025-05-07T20:32:05.5466957Z                 op = torch.compile(op)
2025-05-07T20:32:05.5467062Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:05.5467133Z     
2025-05-07T20:32:05.5467233Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:05.5467238Z 
2025-05-07T20:32:05.5467336Z moe/activation_test.py:117: 
2025-05-07T20:32:05.5467464Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:05.5467570Z moe/activation_test.py:115: in fn
2025-05-07T20:32:05.5467670Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:05.5468035Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/torch/_dynamo/eval_frame.py:678: in _fn
2025-05-07T20:32:05.5468132Z     return fn(*args, **kwargs)
2025-05-07T20:32:05.5468623Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:05.5468728Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:05.5469175Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:05.5469398Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:05.5469866Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:05.5469960Z     kernel = self.compile(
2025-05-07T20:32:05.5470343Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:05.5470513Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:05.5470641Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:05.5470645Z 
2025-05-07T20:32:05.5470854Z self = <triton.compiler.compiler.ASTSource object at 0x7f48985c42d0>
2025-05-07T20:32:05.5471664Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:05.5472220Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f49ce13c400>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f489881d1c0>}
2025-05-07T20:32:05.5472961Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:05.5473149Z context = <triton._C.libtriton.ir.context object at 0x7f489884e070>
2025-05-07T20:32:05.5473154Z 
2025-05-07T20:32:05.5473327Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:05.5473585Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:05.5473702Z                            module_map=module_map)
2025-05-07T20:32:05.5473863Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:05.5473960Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:05.5474044Z E       ^
2025-05-07T20:32:05.5474400Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:05.5474405Z 
2025-05-07T20:32:05.5474816Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:05.5474827Z 
2025-05-07T20:32:05.5474930Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:05.5475152Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:05.5475236Z     T=4096,
2025-05-07T20:32:05.5475312Z     D=5120,
2025-05-07T20:32:05.5475397Z     scale_ub=1200.0,
2025-05-07T20:32:05.5475489Z     contiguous=True,
2025-05-07T20:32:05.5475574Z     compiled=True,
2025-05-07T20:32:05.5475650Z )
2025-05-07T20:32:05.5475872Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:05.5476041Z T = 4096, D = 5120, scale_ub = 1200.0, contiguous = True, compiled = True
2025-05-07T20:32:05.5476045Z 
2025-05-07T20:32:05.5476135Z     @given(
2025-05-07T20:32:05.5476253Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:05.5476351Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:05.5476472Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:05.5476588Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:05.5476702Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:05.5476782Z     )
2025-05-07T20:32:05.5477025Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:05.5477118Z     def test_silu_mul_quant(
2025-05-07T20:32:05.5477202Z         self,
2025-05-07T20:32:05.5477282Z         T: int,
2025-05-07T20:32:05.5477361Z         D: int,
2025-05-07T20:32:05.5477466Z         scale_ub: Optional[float],
2025-05-07T20:32:05.5477555Z         contiguous: bool,
2025-05-07T20:32:05.5477648Z         compiled: bool,
2025-05-07T20:32:05.5477729Z     ) -> None:
2025-05-07T20:32:05.5477910Z         torch.manual_seed(2025)
2025-05-07T20:32:05.5477991Z     
2025-05-07T20:32:05.5478159Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:05.5478232Z     
2025-05-07T20:32:05.5478329Z         x_sign = torch.sign(x)
2025-05-07T20:32:05.5478454Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:05.5478543Z         x = x_sign * x_clamp
2025-05-07T20:32:05.5478632Z         x0 = x[:, :D]
2025-05-07T20:32:05.5478712Z         x1 = x[:, D:]
2025-05-07T20:32:05.5478783Z     
2025-05-07T20:32:05.5478873Z         if contiguous:
2025-05-07T20:32:05.5478963Z             x0 = x0.contiguous()
2025-05-07T20:32:05.5479055Z             x1 = x1.contiguous()
2025-05-07T20:32:05.5479171Z     
2025-05-07T20:32:05.5479305Z         if scale_ub is not None:
2025-05-07T20:32:05.5479416Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:05.5479552Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:05.5479628Z             )
2025-05-07T20:32:05.5479716Z         else:
2025-05-07T20:32:05.5479809Z             scale_ub_tensor = None
2025-05-07T20:32:05.5479880Z     
2025-05-07T20:32:05.5480014Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:05.5480104Z             op = silu_mul_quant
2025-05-07T20:32:05.5480188Z             if compiled:
2025-05-07T20:32:05.5480293Z                 op = torch.compile(op)
2025-05-07T20:32:05.5480398Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:05.5480476Z     
2025-05-07T20:32:05.5480565Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:05.5480570Z 
2025-05-07T20:32:05.5480666Z moe/activation_test.py:117: 
2025-05-07T20:32:05.5480802Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:05.5480908Z moe/activation_test.py:115: in fn
2025-05-07T20:32:05.5481007Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:05.5481378Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/torch/_dynamo/eval_frame.py:678: in _fn
2025-05-07T20:32:05.5481474Z     return fn(*args, **kwargs)
2025-05-07T20:32:05.5481973Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:05.5482069Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:05.5482422Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:05.5482647Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:05.5482981Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:05.5483077Z     kernel = self.compile(
2025-05-07T20:32:05.5483464Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:05.5483636Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:05.5483774Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:05.5483779Z 
2025-05-07T20:32:05.5483980Z self = <triton.compiler.compiler.ASTSource object at 0x7f48985d51d0>
2025-05-07T20:32:05.5484752Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:05.5485252Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f49ce13c400>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f489881da80>}
2025-05-07T20:32:05.5485994Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:05.5486191Z context = <triton._C.libtriton.ir.context object at 0x7f4898551830>
2025-05-07T20:32:05.5486282Z 
2025-05-07T20:32:05.5486449Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:05.5486708Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:05.5486824Z                            module_map=module_map)
2025-05-07T20:32:05.5486982Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:05.5487090Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:05.5487167Z E       ^
2025-05-07T20:32:05.5487519Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:05.5487562Z 
2025-05-07T20:32:05.5487985Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:05.5488028Z 
2025-05-07T20:32:05.5488132Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:05.5488368Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:05.5488446Z     T=128,
2025-05-07T20:32:05.5488522Z     D=5120,
2025-05-07T20:32:05.5488610Z     scale_ub=1200.0,
2025-05-07T20:32:05.5488697Z     contiguous=False,
2025-05-07T20:32:05.5488783Z     compiled=True,
2025-05-07T20:32:05.5488860Z )
2025-05-07T20:32:05.5489076Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:05.5489244Z T = 128, D = 5120, scale_ub = 1200.0, contiguous = False, compiled = True
2025-05-07T20:32:05.5489248Z 
2025-05-07T20:32:05.5489333Z     @given(
2025-05-07T20:32:05.5489451Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:05.5489558Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:05.5489677Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:05.5489793Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:05.5489912Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:05.5489985Z     )
2025-05-07T20:32:05.5490233Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:05.5490336Z     def test_silu_mul_quant(
2025-05-07T20:32:05.5490412Z         self,
2025-05-07T20:32:05.5490487Z         T: int,
2025-05-07T20:32:05.5490570Z         D: int,
2025-05-07T20:32:05.5490667Z         scale_ub: Optional[float],
2025-05-07T20:32:05.5490755Z         contiguous: bool,
2025-05-07T20:32:05.5490847Z         compiled: bool,
2025-05-07T20:32:05.5490924Z     ) -> None:
2025-05-07T20:32:05.5491025Z         torch.manual_seed(2025)
2025-05-07T20:32:05.5491098Z     
2025-05-07T20:32:05.5491266Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:05.5491347Z     
2025-05-07T20:32:05.5491441Z         x_sign = torch.sign(x)
2025-05-07T20:32:05.5491565Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:05.5491661Z         x = x_sign * x_clamp
2025-05-07T20:32:05.5491741Z         x0 = x[:, :D]
2025-05-07T20:32:05.5491828Z         x1 = x[:, D:]
2025-05-07T20:32:05.5491907Z     
2025-05-07T20:32:05.5491992Z         if contiguous:
2025-05-07T20:32:05.5492083Z             x0 = x0.contiguous()
2025-05-07T20:32:05.5492178Z             x1 = x1.contiguous()
2025-05-07T20:32:05.5492249Z     
2025-05-07T20:32:05.5492349Z         if scale_ub is not None:
2025-05-07T20:32:05.5492455Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:05.5492589Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:05.5492671Z             )
2025-05-07T20:32:05.5492748Z         else:
2025-05-07T20:32:05.5492842Z             scale_ub_tensor = None
2025-05-07T20:32:05.5492919Z     
2025-05-07T20:32:05.5493048Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:05.5493140Z             op = silu_mul_quant
2025-05-07T20:32:05.5493229Z             if compiled:
2025-05-07T20:32:05.5493327Z                 op = torch.compile(op)
2025-05-07T20:32:05.5493431Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:05.5493603Z     
2025-05-07T20:32:05.5493696Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:05.5493700Z 
2025-05-07T20:32:05.5493803Z moe/activation_test.py:117: 
2025-05-07T20:32:05.5493931Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:05.5494031Z moe/activation_test.py:115: in fn
2025-05-07T20:32:05.5494136Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:05.5494500Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/torch/_dynamo/eval_frame.py:678: in _fn
2025-05-07T20:32:05.5494592Z     return fn(*args, **kwargs)
2025-05-07T20:32:05.5495086Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:05.5495297Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:05.5499969Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:05.5500248Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:05.5500592Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:05.5500691Z     kernel = self.compile(
2025-05-07T20:32:05.5501080Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:05.5501253Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:05.5501387Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:05.5501392Z 
2025-05-07T20:32:05.5501595Z self = <triton.compiler.compiler.ASTSource object at 0x7f4898580150>
2025-05-07T20:32:05.5502400Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:05.5502910Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f49ce13c400>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f489881fa60>}
2025-05-07T20:32:05.5503651Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:05.5503849Z context = <triton._C.libtriton.ir.context object at 0x7f4898596c70>
2025-05-07T20:32:05.5503854Z 
2025-05-07T20:32:05.5504017Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:05.5504286Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:05.5504396Z                            module_map=module_map)
2025-05-07T20:32:05.5504559Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:05.5504666Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:05.5504746Z E       ^
2025-05-07T20:32:05.5505103Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:05.5505108Z 
2025-05-07T20:32:05.5505527Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:05.5505532Z 
2025-05-07T20:32:05.5505635Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:05.5505863Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:05.5505943Z     T=16384,
2025-05-07T20:32:05.5506020Z     D=7168,
2025-05-07T20:32:05.5506114Z     scale_ub=1200.0,
2025-05-07T20:32:05.5506203Z     contiguous=True,
2025-05-07T20:32:05.5506287Z     compiled=True,
2025-05-07T20:32:05.5506366Z )
2025-05-07T20:32:05.5506583Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:05.5506835Z T = 16384, D = 7168, scale_ub = 1200.0, contiguous = True, compiled = True
2025-05-07T20:32:05.5506847Z 
2025-05-07T20:32:05.5506926Z     @given(
2025-05-07T20:32:05.5507043Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:05.5507151Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:05.5507266Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:05.5507382Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:05.5507503Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:05.5507578Z     )
2025-05-07T20:32:05.5507822Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:05.5507926Z     def test_silu_mul_quant(
2025-05-07T20:32:05.5508066Z         self,
2025-05-07T20:32:05.5508185Z         T: int,
2025-05-07T20:32:05.5508269Z         D: int,
2025-05-07T20:32:05.5508367Z         scale_ub: Optional[float],
2025-05-07T20:32:05.5508463Z         contiguous: bool,
2025-05-07T20:32:05.5508648Z         compiled: bool,
2025-05-07T20:32:05.5508734Z     ) -> None:
2025-05-07T20:32:05.5508836Z         torch.manual_seed(2025)
2025-05-07T20:32:05.5508911Z     
2025-05-07T20:32:05.5509169Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:05.5509251Z     
2025-05-07T20:32:05.5509344Z         x_sign = torch.sign(x)
2025-05-07T20:32:05.5509469Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:05.5509566Z         x = x_sign * x_clamp
2025-05-07T20:32:05.5509648Z         x0 = x[:, :D]
2025-05-07T20:32:05.5509729Z         x1 = x[:, D:]
2025-05-07T20:32:05.5509810Z     
2025-05-07T20:32:05.5509896Z         if contiguous:
2025-05-07T20:32:05.5509989Z             x0 = x0.contiguous()
2025-05-07T20:32:05.5510093Z             x1 = x1.contiguous()
2025-05-07T20:32:05.5510169Z     
2025-05-07T20:32:05.5510266Z         if scale_ub is not None:
2025-05-07T20:32:05.5510373Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:05.5510514Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:05.5510596Z             )
2025-05-07T20:32:05.5510672Z         else:
2025-05-07T20:32:05.5510768Z             scale_ub_tensor = None
2025-05-07T20:32:05.5510849Z     
2025-05-07T20:32:05.5510976Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:05.5511067Z             op = silu_mul_quant
2025-05-07T20:32:05.5511159Z             if compiled:
2025-05-07T20:32:05.5511260Z                 op = torch.compile(op)
2025-05-07T20:32:05.5511364Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:05.5511444Z     
2025-05-07T20:32:05.5511535Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:05.5511539Z 
2025-05-07T20:32:05.5511644Z moe/activation_test.py:117: 
2025-05-07T20:32:05.5511779Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:05.5511880Z moe/activation_test.py:115: in fn
2025-05-07T20:32:05.5511986Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:05.5512362Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/torch/_dynamo/eval_frame.py:678: in _fn
2025-05-07T20:32:05.5512475Z     return fn(*args, **kwargs)
2025-05-07T20:32:05.5512994Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:05.5513091Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:05.5513450Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:05.5513671Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:05.5514006Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:05.5514112Z     kernel = self.compile(
2025-05-07T20:32:05.5514490Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:05.5514711Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:05.5514848Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:05.5514852Z 
2025-05-07T20:32:05.5515054Z self = <triton.compiler.compiler.ASTSource object at 0x7f4898752c90>
2025-05-07T20:32:05.5515828Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:05.5516324Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f49ce13c400>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f489875cd60>}
2025-05-07T20:32:05.5517213Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:05.5517406Z context = <triton._C.libtriton.ir.context object at 0x7f48987dffb0>
2025-05-07T20:32:05.5517411Z 
2025-05-07T20:32:05.5517574Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:05.5517838Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:05.5517945Z                            module_map=module_map)
2025-05-07T20:32:05.5518113Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:05.5518211Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:05.5518289Z E       ^
2025-05-07T20:32:05.5518648Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:05.5518658Z 
2025-05-07T20:32:05.5519068Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:05.5519072Z 
2025-05-07T20:32:05.5519184Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:05.5519407Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:05.5519486Z     T=16384,
2025-05-07T20:32:05.5519569Z     D=5120,
2025-05-07T20:32:05.5519652Z     scale_ub=1200.0,
2025-05-07T20:32:05.5519740Z     contiguous=True,
2025-05-07T20:32:05.5519834Z     compiled=False,
2025-05-07T20:32:05.5519908Z )
2025-05-07T20:32:05.5520125Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:05.5520307Z T = 16384, D = 5120, scale_ub = 1200.0, contiguous = True, compiled = False
2025-05-07T20:32:05.5520312Z 
2025-05-07T20:32:05.5520389Z     @given(
2025-05-07T20:32:05.5520506Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:05.5520617Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:05.5520731Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:05.5520853Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:05.5520970Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:05.5521044Z     )
2025-05-07T20:32:05.5521292Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:05.5521387Z     def test_silu_mul_quant(
2025-05-07T20:32:05.5521464Z         self,
2025-05-07T20:32:05.5521549Z         T: int,
2025-05-07T20:32:05.5521626Z         D: int,
2025-05-07T20:32:05.5521724Z         scale_ub: Optional[float],
2025-05-07T20:32:05.5521819Z         contiguous: bool,
2025-05-07T20:32:05.5521906Z         compiled: bool,
2025-05-07T20:32:05.5521991Z     ) -> None:
2025-05-07T20:32:05.5522086Z         torch.manual_seed(2025)
2025-05-07T20:32:05.5522159Z     
2025-05-07T20:32:05.5522336Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:05.5522412Z     
2025-05-07T20:32:05.5527278Z         x_sign = torch.sign(x)
2025-05-07T20:32:05.5527432Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:05.5527530Z         x = x_sign * x_clamp
2025-05-07T20:32:05.5527702Z         x0 = x[:, :D]
2025-05-07T20:32:05.5527785Z         x1 = x[:, D:]
2025-05-07T20:32:05.5527869Z     
2025-05-07T20:32:05.5527955Z         if contiguous:
2025-05-07T20:32:05.5528049Z             x0 = x0.contiguous()
2025-05-07T20:32:05.5528444Z             x1 = x1.contiguous()
2025-05-07T20:32:05.5528557Z     
2025-05-07T20:32:05.5528692Z         if scale_ub is not None:
2025-05-07T20:32:05.5528849Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:05.5529037Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:05.5529124Z             )
2025-05-07T20:32:05.5529213Z         else:
2025-05-07T20:32:05.5529312Z             scale_ub_tensor = None
2025-05-07T20:32:05.5529520Z     
2025-05-07T20:32:05.5529734Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:05.5529827Z             op = silu_mul_quant
2025-05-07T20:32:05.5529923Z             if compiled:
2025-05-07T20:32:05.5530093Z                 op = torch.compile(op)
2025-05-07T20:32:05.5530207Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:05.5530291Z     
2025-05-07T20:32:05.5530385Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:05.5530390Z 
2025-05-07T20:32:05.5530492Z moe/activation_test.py:117: 
2025-05-07T20:32:05.5530634Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:05.5530737Z moe/activation_test.py:115: in fn
2025-05-07T20:32:05.5530839Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:05.5531350Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:05.5531450Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:05.5531822Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:05.5532045Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:05.5532389Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:05.5532494Z     kernel = self.compile(
2025-05-07T20:32:05.5532877Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:05.5533051Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:05.5533190Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:05.5533194Z 
2025-05-07T20:32:05.5533401Z self = <triton.compiler.compiler.ASTSource object at 0x7f489877cd50>
2025-05-07T20:32:05.5534185Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:05.5534696Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f49ce13c400>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f489875dbc0>}
2025-05-07T20:32:05.5535451Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:05.5535642Z context = <triton._C.libtriton.ir.context object at 0x7f4898731330>
2025-05-07T20:32:05.5535646Z 
2025-05-07T20:32:05.5535812Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:05.5536083Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:05.5536195Z                            module_map=module_map)
2025-05-07T20:32:05.5536370Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:05.5536472Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:05.5536553Z E       ^
2025-05-07T20:32:05.5536996Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:05.5537001Z 
2025-05-07T20:32:05.5537418Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:05.5537422Z 
2025-05-07T20:32:05.5537534Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:05.5537758Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:05.5537837Z     T=1,
2025-05-07T20:32:05.5537925Z     D=7168,
2025-05-07T20:32:05.5538011Z     scale_ub=1200.0,
2025-05-07T20:32:05.5538102Z     contiguous=False,
2025-05-07T20:32:05.5538199Z     compiled=False,
2025-05-07T20:32:05.5538352Z )
2025-05-07T20:32:05.5538613Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:05.5538791Z T = 1, D = 7168, scale_ub = 1200.0, contiguous = False, compiled = False
2025-05-07T20:32:05.5538796Z 
2025-05-07T20:32:05.5539006Z     @given(
2025-05-07T20:32:05.5539131Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:05.5539239Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:05.5539355Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:05.5539481Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:05.5539595Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:05.5539672Z     )
2025-05-07T20:32:05.5539924Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:05.5540020Z     def test_silu_mul_quant(
2025-05-07T20:32:05.5540103Z         self,
2025-05-07T20:32:05.5540188Z         T: int,
2025-05-07T20:32:05.5540266Z         D: int,
2025-05-07T20:32:05.5540367Z         scale_ub: Optional[float],
2025-05-07T20:32:05.5540469Z         contiguous: bool,
2025-05-07T20:32:05.5540557Z         compiled: bool,
2025-05-07T20:32:05.5540637Z     ) -> None:
2025-05-07T20:32:05.5540745Z         torch.manual_seed(2025)
2025-05-07T20:32:05.5540820Z     
2025-05-07T20:32:05.5541004Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:05.5541079Z     
2025-05-07T20:32:05.5541174Z         x_sign = torch.sign(x)
2025-05-07T20:32:05.5541306Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:05.5541398Z         x = x_sign * x_clamp
2025-05-07T20:32:05.5541482Z         x0 = x[:, :D]
2025-05-07T20:32:05.5541572Z         x1 = x[:, D:]
2025-05-07T20:32:05.5541648Z     
2025-05-07T20:32:05.5541734Z         if contiguous:
2025-05-07T20:32:05.5541834Z             x0 = x0.contiguous()
2025-05-07T20:32:05.5541926Z             x1 = x1.contiguous()
2025-05-07T20:32:05.5542001Z     
2025-05-07T20:32:05.5542107Z         if scale_ub is not None:
2025-05-07T20:32:05.5542219Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:05.5542366Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:05.5542465Z             )
2025-05-07T20:32:05.5542555Z         else:
2025-05-07T20:32:05.5542680Z             scale_ub_tensor = None
2025-05-07T20:32:05.5542757Z     
2025-05-07T20:32:05.5542888Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:05.5542989Z             op = silu_mul_quant
2025-05-07T20:32:05.5543078Z             if compiled:
2025-05-07T20:32:05.5543181Z                 op = torch.compile(op)
2025-05-07T20:32:05.5543295Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:05.5543372Z     
2025-05-07T20:32:05.5543465Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:05.5543477Z 
2025-05-07T20:32:05.5543577Z moe/activation_test.py:117: 
2025-05-07T20:32:05.5543710Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:05.5543826Z moe/activation_test.py:115: in fn
2025-05-07T20:32:05.5543931Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:05.5544433Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:05.5544591Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:05.5544950Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:05.5545173Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:05.5545520Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:05.5545616Z     kernel = self.compile(
2025-05-07T20:32:05.5546006Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:05.5546182Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:05.5546387Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:05.5546392Z 
2025-05-07T20:32:05.5546603Z self = <triton.compiler.compiler.ASTSource object at 0x7f489862d590>
2025-05-07T20:32:05.5547421Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:05.5547926Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f49ce13c400>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f489875d4e0>}
2025-05-07T20:32:05.5548670Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:05.5548874Z context = <triton._C.libtriton.ir.context object at 0x7f489861c5f0>
2025-05-07T20:32:05.5548881Z 
2025-05-07T20:32:05.5549046Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:05.5549393Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:05.5549515Z                            module_map=module_map)
2025-05-07T20:32:05.5549678Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:05.5549779Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:05.5549869Z E       ^
2025-05-07T20:32:05.5550221Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:05.5550225Z 
2025-05-07T20:32:05.5550645Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:05.5550650Z 
2025-05-07T20:32:05.5550755Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:05.5550982Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:05.5551074Z     T=4096,
2025-05-07T20:32:05.5551154Z     D=7168,
2025-05-07T20:32:05.5551240Z     scale_ub=1200.0,
2025-05-07T20:32:05.5551346Z     contiguous=False,
2025-05-07T20:32:05.5551436Z     compiled=True,
2025-05-07T20:32:05.5551520Z )
2025-05-07T20:32:05.5551741Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:05.5551921Z T = 4096, D = 7168, scale_ub = 1200.0, contiguous = False, compiled = True
2025-05-07T20:32:05.5551925Z 
2025-05-07T20:32:05.5552014Z     @given(
2025-05-07T20:32:05.5552134Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:05.5552237Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:05.5552364Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:05.5552482Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:05.5552621Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:05.5552719Z     )
2025-05-07T20:32:05.5552981Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:05.5553085Z     def test_silu_mul_quant(
2025-05-07T20:32:05.5553165Z         self,
2025-05-07T20:32:05.5553248Z         T: int,
2025-05-07T20:32:05.5553389Z         D: int,
2025-05-07T20:32:05.5553491Z         scale_ub: Optional[float],
2025-05-07T20:32:05.5553582Z         contiguous: bool,
2025-05-07T20:32:05.5553678Z         compiled: bool,
2025-05-07T20:32:05.5553761Z     ) -> None:
2025-05-07T20:32:05.5553858Z         torch.manual_seed(2025)
2025-05-07T20:32:05.5553940Z     
2025-05-07T20:32:05.5554109Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:05.5554185Z     
2025-05-07T20:32:05.5554288Z         x_sign = torch.sign(x)
2025-05-07T20:32:05.5554413Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:05.5554514Z         x = x_sign * x_clamp
2025-05-07T20:32:05.5554640Z         x0 = x[:, :D]
2025-05-07T20:32:05.5554760Z         x1 = x[:, D:]
2025-05-07T20:32:05.5554847Z     
2025-05-07T20:32:05.5554933Z         if contiguous:
2025-05-07T20:32:05.5555025Z             x0 = x0.contiguous()
2025-05-07T20:32:05.5555165Z             x1 = x1.contiguous()
2025-05-07T20:32:05.5555242Z     
2025-05-07T20:32:05.5555337Z         if scale_ub is not None:
2025-05-07T20:32:05.5555453Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:05.5555588Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:05.5555666Z             )
2025-05-07T20:32:05.5555756Z         else:
2025-05-07T20:32:05.5555851Z             scale_ub_tensor = None
2025-05-07T20:32:05.5555925Z     
2025-05-07T20:32:05.5556061Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:05.5556152Z             op = silu_mul_quant
2025-05-07T20:32:05.5556246Z             if compiled:
2025-05-07T20:32:05.5556348Z                 op = torch.compile(op)
2025-05-07T20:32:05.5556456Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:05.5556539Z     
2025-05-07T20:32:05.5556631Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:05.5556635Z 
2025-05-07T20:32:05.5556733Z moe/activation_test.py:117: 
2025-05-07T20:32:05.5556874Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:05.5556979Z moe/activation_test.py:115: in fn
2025-05-07T20:32:05.5557080Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:05.5557453Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/torch/_dynamo/eval_frame.py:678: in _fn
2025-05-07T20:32:05.5557547Z     return fn(*args, **kwargs)
2025-05-07T20:32:05.5558048Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:05.5558147Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:05.5558503Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:05.5558739Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:05.5559079Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:05.5559190Z     kernel = self.compile(
2025-05-07T20:32:05.5559569Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:05.5559742Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:05.5559881Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:05.5559885Z 
2025-05-07T20:32:05.5560089Z self = <triton.compiler.compiler.ASTSource object at 0x7f48986c8150>
2025-05-07T20:32:05.5560867Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:05.5561368Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f49ce13c400>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f48986f0180>}
2025-05-07T20:32:05.5562168Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:05.5562366Z context = <triton._C.libtriton.ir.context object at 0x7f48986a3070>
2025-05-07T20:32:05.5562371Z 
2025-05-07T20:32:05.5562536Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:05.5562803Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:05.5562913Z                            module_map=module_map)
2025-05-07T20:32:05.5563075Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:05.5563259Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:05.5563339Z E       ^
2025-05-07T20:32:05.5563694Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:05.5563747Z 
2025-05-07T20:32:05.5564164Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:05.5564168Z 
2025-05-07T20:32:05.5564275Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:05.5564508Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:05.5564587Z     T=128,
2025-05-07T20:32:05.5564668Z     D=7168,
2025-05-07T20:32:05.5564761Z     scale_ub=1200.0,
2025-05-07T20:32:05.5564853Z     contiguous=False,
2025-05-07T20:32:05.5564942Z     compiled=True,
2025-05-07T20:32:05.5565025Z )
2025-05-07T20:32:05.5565244Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:05.5565429Z T = 128, D = 7168, scale_ub = 1200.0, contiguous = False, compiled = True
2025-05-07T20:32:05.5565436Z 
2025-05-07T20:32:05.5565515Z     @given(
2025-05-07T20:32:05.5565634Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:05.5565748Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:05.5565870Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:05.5565987Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:05.5566109Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:05.5566184Z     )
2025-05-07T20:32:05.5566428Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:05.5566529Z     def test_silu_mul_quant(
2025-05-07T20:32:05.5566610Z         self,
2025-05-07T20:32:05.5566695Z         T: int,
2025-05-07T20:32:05.5566773Z         D: int,
2025-05-07T20:32:05.5566870Z         scale_ub: Optional[float],
2025-05-07T20:32:05.5566969Z         contiguous: bool,
2025-05-07T20:32:05.5567059Z         compiled: bool,
2025-05-07T20:32:05.5567140Z     ) -> None:
2025-05-07T20:32:05.5567238Z         torch.manual_seed(2025)
2025-05-07T20:32:05.5567314Z     
2025-05-07T20:32:05.5567491Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:05.5567565Z     
2025-05-07T20:32:05.5567660Z         x_sign = torch.sign(x)
2025-05-07T20:32:05.5567791Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:05.5567881Z         x = x_sign * x_clamp
2025-05-07T20:32:05.5567961Z         x0 = x[:, :D]
2025-05-07T20:32:05.5568050Z         x1 = x[:, D:]
2025-05-07T20:32:05.5568127Z     
2025-05-07T20:32:05.5568211Z         if contiguous:
2025-05-07T20:32:05.5568310Z             x0 = x0.contiguous()
2025-05-07T20:32:05.5568403Z             x1 = x1.contiguous()
2025-05-07T20:32:05.5568481Z     
2025-05-07T20:32:05.5568571Z         if scale_ub is not None:
2025-05-07T20:32:05.5568678Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:05.5568822Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:05.5568902Z             )
2025-05-07T20:32:05.5568980Z         else:
2025-05-07T20:32:05.5569081Z             scale_ub_tensor = None
2025-05-07T20:32:05.5569155Z     
2025-05-07T20:32:05.5569361Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:05.5569460Z             op = silu_mul_quant
2025-05-07T20:32:05.5569546Z             if compiled:
2025-05-07T20:32:05.5569648Z                 op = torch.compile(op)
2025-05-07T20:32:05.5569758Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:05.5569829Z     
2025-05-07T20:32:05.5569924Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:05.5569929Z 
2025-05-07T20:32:05.5570025Z moe/activation_test.py:117: 
2025-05-07T20:32:05.5570154Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:05.5570264Z moe/activation_test.py:115: in fn
2025-05-07T20:32:05.5570363Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:05.5570776Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/torch/_dynamo/eval_frame.py:678: in _fn
2025-05-07T20:32:05.5570912Z     return fn(*args, **kwargs)
2025-05-07T20:32:05.5571449Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:05.5571553Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:05.5571907Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:05.5572131Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:05.5572502Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:05.5572614Z     kernel = self.compile(
2025-05-07T20:32:05.5572997Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:05.5573177Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:05.5573308Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:05.5573312Z 
2025-05-07T20:32:05.5573525Z self = <triton.compiler.compiler.ASTSource object at 0x7f4898453d90>
2025-05-07T20:32:05.5574295Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:05.5574792Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f49ce13c400>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f48986f0cc0>}
2025-05-07T20:32:05.5575543Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:05.5575739Z context = <triton._C.libtriton.ir.context object at 0x7f48984a4530>
2025-05-07T20:32:05.5575743Z 
2025-05-07T20:32:05.5575917Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:05.5576181Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:05.5576295Z                            module_map=module_map)
2025-05-07T20:32:05.5576455Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:05.5576555Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:05.5576644Z E       ^
2025-05-07T20:32:05.5576998Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:05.5577002Z 
2025-05-07T20:32:05.5577414Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:05.5577422Z 
2025-05-07T20:32:05.5577531Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:05.5577757Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:05.5577839Z     T=2048,
2025-05-07T20:32:05.5577915Z     D=7168,
2025-05-07T20:32:05.5577999Z     scale_ub=None,
2025-05-07T20:32:05.5578133Z     contiguous=True,
2025-05-07T20:32:05.5578217Z     compiled=True,
2025-05-07T20:32:05.5578290Z )
2025-05-07T20:32:05.5578513Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:05.5578682Z T = 2048, D = 7168, scale_ub = None, contiguous = True, compiled = True
2025-05-07T20:32:05.5578687Z 
2025-05-07T20:32:05.5578763Z     @given(
2025-05-07T20:32:05.5578887Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:05.5578987Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:05.5579108Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:05.5579226Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:05.5579378Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:05.5579496Z     )
2025-05-07T20:32:05.5579741Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:05.5579878Z     def test_silu_mul_quant(
2025-05-07T20:32:05.5579964Z         self,
2025-05-07T20:32:05.5580044Z         T: int,
2025-05-07T20:32:05.5580122Z         D: int,
2025-05-07T20:32:05.5580227Z         scale_ub: Optional[float],
2025-05-07T20:32:05.5580316Z         contiguous: bool,
2025-05-07T20:32:05.5580401Z         compiled: bool,
2025-05-07T20:32:05.5580487Z     ) -> None:
2025-05-07T20:32:05.5580581Z         torch.manual_seed(2025)
2025-05-07T20:32:05.5580665Z     
2025-05-07T20:32:05.5580838Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:05.5580911Z     
2025-05-07T20:32:05.5581012Z         x_sign = torch.sign(x)
2025-05-07T20:32:05.5581139Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:05.5581232Z         x = x_sign * x_clamp
2025-05-07T20:32:05.5581321Z         x0 = x[:, :D]
2025-05-07T20:32:05.5581403Z         x1 = x[:, D:]
2025-05-07T20:32:05.5581476Z     
2025-05-07T20:32:05.5581567Z         if contiguous:
2025-05-07T20:32:05.5581663Z             x0 = x0.contiguous()
2025-05-07T20:32:05.5581756Z             x1 = x1.contiguous()
2025-05-07T20:32:05.5581837Z     
2025-05-07T20:32:05.5581931Z         if scale_ub is not None:
2025-05-07T20:32:05.5582044Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:05.5582179Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:05.5582256Z             )
2025-05-07T20:32:05.5582339Z         else:
2025-05-07T20:32:05.5582434Z             scale_ub_tensor = None
2025-05-07T20:32:05.5582508Z     
2025-05-07T20:32:05.5582643Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:05.5582734Z             op = silu_mul_quant
2025-05-07T20:32:05.5582820Z             if compiled:
2025-05-07T20:32:05.5582931Z                 op = torch.compile(op)
2025-05-07T20:32:05.5583041Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:05.5583115Z     
2025-05-07T20:32:05.5583213Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:05.5583218Z 
2025-05-07T20:32:05.5583318Z moe/activation_test.py:117: 
2025-05-07T20:32:05.5583459Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:05.5583560Z moe/activation_test.py:115: in fn
2025-05-07T20:32:05.5583660Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:05.5584034Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/torch/_dynamo/eval_frame.py:678: in _fn
2025-05-07T20:32:05.5584126Z     return fn(*args, **kwargs)
2025-05-07T20:32:05.5584618Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:05.5584726Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:05.5585080Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:05.5585311Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:05.5585710Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:05.5585804Z     kernel = self.compile(
2025-05-07T20:32:05.5586187Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:05.5586358Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:05.5586487Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:05.5586497Z 
2025-05-07T20:32:05.5586701Z self = <triton.compiler.compiler.ASTSource object at 0x7f48984fd110>
2025-05-07T20:32:05.5587468Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:05.5588091Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f49ce13c400>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f48986f1940>}
2025-05-07T20:32:05.5588835Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:05.5589030Z context = <triton._C.libtriton.ir.context object at 0x7f489842d770>
2025-05-07T20:32:05.5589035Z 
2025-05-07T20:32:05.5589243Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:05.5589503Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:05.5589616Z                            module_map=module_map)
2025-05-07T20:32:05.5589780Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:05.5589889Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:05.5589968Z E       ^
2025-05-07T20:32:05.5590337Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:05.5590342Z 
2025-05-07T20:32:05.5590756Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:05.5590761Z 
2025-05-07T20:32:05.5590863Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:05.5591093Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:05.5591175Z     T=16384,
2025-05-07T20:32:05.5591252Z     D=5120,
2025-05-07T20:32:05.5591341Z     scale_ub=None,
2025-05-07T20:32:05.5591431Z     contiguous=False,
2025-05-07T20:32:05.5591522Z     compiled=False,
2025-05-07T20:32:05.5591599Z )
2025-05-07T20:32:05.5591818Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:05.5592008Z T = 16384, D = 5120, scale_ub = None, contiguous = False, compiled = False
2025-05-07T20:32:05.5592012Z 
2025-05-07T20:32:05.5592090Z     @given(
2025-05-07T20:32:05.5592211Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:05.5592320Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:05.5592434Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:05.5592551Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:05.5592670Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:05.5592747Z     )
2025-05-07T20:32:05.5592999Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:05.5593092Z     def test_silu_mul_quant(
2025-05-07T20:32:05.5593170Z         self,
2025-05-07T20:32:05.5593253Z         T: int,
2025-05-07T20:32:05.5593330Z         D: int,
2025-05-07T20:32:05.5593430Z         scale_ub: Optional[float],
2025-05-07T20:32:05.5593532Z         contiguous: bool,
2025-05-07T20:32:05.5593623Z         compiled: bool,
2025-05-07T20:32:05.5593702Z     ) -> None:
2025-05-07T20:32:05.5593804Z         torch.manual_seed(2025)
2025-05-07T20:32:05.5593878Z     
2025-05-07T20:32:05.5594103Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:05.5594185Z     
2025-05-07T20:32:05.5594277Z         x_sign = torch.sign(x)
2025-05-07T20:32:05.5594408Z >       x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:05.5596216Z E       torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 320.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 140.44 MiB is free. Including non-PyTorch memory, this process has 21.92 GiB memory in use. Of the allocated memory 21.60 GiB is allocated by PyTorch, and 45.02 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
2025-05-07T20:32:05.5596295Z 
2025-05-07T20:32:05.5596421Z moe/activation_test.py:95: OutOfMemoryError
2025-05-07T20:32:05.5596426Z 
2025-05-07T20:32:05.5596592Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:05.5596816Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:05.5596899Z     T=4096,
2025-05-07T20:32:05.5596975Z     D=7168,
2025-05-07T20:32:05.5597058Z     scale_ub=1200.0,
2025-05-07T20:32:05.5597149Z     contiguous=True,
2025-05-07T20:32:05.5597231Z     compiled=True,
2025-05-07T20:32:05.5597304Z )
2025-05-07T20:32:05.5597527Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:05.5597697Z T = 4096, D = 7168, scale_ub = 1200.0, contiguous = True, compiled = True
2025-05-07T20:32:05.5597702Z 
2025-05-07T20:32:05.5597788Z     @given(
2025-05-07T20:32:05.5597905Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:05.5598007Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:05.5598127Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:05.5598244Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:05.5598361Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:05.5598440Z     )
2025-05-07T20:32:05.5598683Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:05.5598787Z     def test_silu_mul_quant(
2025-05-07T20:32:05.5598863Z         self,
2025-05-07T20:32:05.5598940Z         T: int,
2025-05-07T20:32:05.5599026Z         D: int,
2025-05-07T20:32:05.5599128Z         scale_ub: Optional[float],
2025-05-07T20:32:05.5599218Z         contiguous: bool,
2025-05-07T20:32:05.5599313Z         compiled: bool,
2025-05-07T20:32:05.5599392Z     ) -> None:
2025-05-07T20:32:05.5599488Z         torch.manual_seed(2025)
2025-05-07T20:32:05.5599567Z     
2025-05-07T20:32:05.5599735Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:05.5599812Z     
2025-05-07T20:32:05.5599911Z         x_sign = torch.sign(x)
2025-05-07T20:32:05.5600034Z >       x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:05.5601824Z E       torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 112.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 28.44 MiB is free. Including non-PyTorch memory, this process has 22.03 GiB memory in use. Of the allocated memory 21.61 GiB is allocated by PyTorch, and 141.02 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
2025-05-07T20:32:05.5601830Z 
2025-05-07T20:32:05.5601947Z moe/activation_test.py:95: OutOfMemoryError
2025-05-07T20:32:05.5601951Z 
2025-05-07T20:32:05.5602063Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:05.5602287Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:05.5602364Z     T=16384,
2025-05-07T20:32:05.5602447Z     D=7168,
2025-05-07T20:32:05.5602528Z     scale_ub=None,
2025-05-07T20:32:05.5602616Z     contiguous=False,
2025-05-07T20:32:05.5602756Z     compiled=False,
2025-05-07T20:32:05.5602830Z )
2025-05-07T20:32:05.5603045Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:05.5603228Z T = 16384, D = 7168, scale_ub = None, contiguous = False, compiled = False
2025-05-07T20:32:05.5603233Z 
2025-05-07T20:32:05.5603310Z     @given(
2025-05-07T20:32:05.5603434Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:05.5603532Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:05.5603646Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:05.5603768Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:05.5603926Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:05.5604040Z     )
2025-05-07T20:32:05.5604290Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:05.5604384Z     def test_silu_mul_quant(
2025-05-07T20:32:05.5604502Z         self,
2025-05-07T20:32:05.5604590Z         T: int,
2025-05-07T20:32:05.5604670Z         D: int,
2025-05-07T20:32:05.5604772Z         scale_ub: Optional[float],
2025-05-07T20:32:05.5604867Z         contiguous: bool,
2025-05-07T20:32:05.5604953Z         compiled: bool,
2025-05-07T20:32:05.5605039Z     ) -> None:
2025-05-07T20:32:05.5605134Z         torch.manual_seed(2025)
2025-05-07T20:32:05.5605208Z     
2025-05-07T20:32:05.5605379Z >       x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:05.5607156Z E       torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 448.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 140.44 MiB is free. Including non-PyTorch memory, this process has 21.92 GiB memory in use. Of the allocated memory 21.50 GiB is allocated by PyTorch, and 141.02 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
2025-05-07T20:32:05.5607167Z 
2025-05-07T20:32:05.5607293Z moe/activation_test.py:92: OutOfMemoryError
2025-05-07T20:32:05.5607298Z 
2025-05-07T20:32:05.5607401Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:05.5607621Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:05.5607705Z     T=2048,
2025-05-07T20:32:05.5607784Z     D=7168,
2025-05-07T20:32:05.5607867Z     scale_ub=1200.0,
2025-05-07T20:32:05.5607959Z     contiguous=True,
2025-05-07T20:32:05.5608042Z     compiled=True,
2025-05-07T20:32:05.5608125Z )
2025-05-07T20:32:05.5608341Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:05.5608516Z T = 2048, D = 7168, scale_ub = 1200.0, contiguous = True, compiled = True
2025-05-07T20:32:05.5608521Z 
2025-05-07T20:32:05.5608604Z     @given(
2025-05-07T20:32:05.5608721Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:05.5608822Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:05.5608947Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:05.5609062Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:05.5609175Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:05.5609256Z     )
2025-05-07T20:32:05.5609499Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:05.5609599Z     def test_silu_mul_quant(
2025-05-07T20:32:05.5609676Z         self,
2025-05-07T20:32:05.5609753Z         T: int,
2025-05-07T20:32:05.5609839Z         D: int,
2025-05-07T20:32:05.5609937Z         scale_ub: Optional[float],
2025-05-07T20:32:05.5610029Z         contiguous: bool,
2025-05-07T20:32:05.5610127Z         compiled: bool,
2025-05-07T20:32:05.5610206Z     ) -> None:
2025-05-07T20:32:05.5610302Z         torch.manual_seed(2025)
2025-05-07T20:32:05.5610383Z     
2025-05-07T20:32:05.5610551Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:05.5610672Z     
2025-05-07T20:32:05.5610773Z         x_sign = torch.sign(x)
2025-05-07T20:32:05.5610896Z >       x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:05.5612651Z E       torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 56.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 28.44 MiB is free. Including non-PyTorch memory, this process has 22.03 GiB memory in use. Of the allocated memory 21.67 GiB is allocated by PyTorch, and 85.02 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
2025-05-07T20:32:05.5612731Z 
2025-05-07T20:32:05.5612848Z moe/activation_test.py:95: OutOfMemoryError
2025-05-07T20:32:05.5612853Z 
2025-05-07T20:32:05.5613006Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:05.5613230Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:05.5613308Z     T=2048,
2025-05-07T20:32:05.5613392Z     D=7168,
2025-05-07T20:32:05.5613474Z     scale_ub=None,
2025-05-07T20:32:05.5613559Z     contiguous=True,
2025-05-07T20:32:05.5613649Z     compiled=False,
2025-05-07T20:32:05.5613722Z )
2025-05-07T20:32:05.5613937Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:05.5614113Z T = 2048, D = 7168, scale_ub = None, contiguous = True, compiled = False
2025-05-07T20:32:05.5614117Z 
2025-05-07T20:32:05.5614193Z     @given(
2025-05-07T20:32:05.5614317Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:05.5614417Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:05.5614533Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:05.5614654Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:05.5614769Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:05.5614846Z     )
2025-05-07T20:32:05.5615096Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:05.5615189Z     def test_silu_mul_quant(
2025-05-07T20:32:05.5615272Z         self,
2025-05-07T20:32:05.5615348Z         T: int,
2025-05-07T20:32:05.5615425Z         D: int,
2025-05-07T20:32:05.5615529Z         scale_ub: Optional[float],
2025-05-07T20:32:05.5615623Z         contiguous: bool,
2025-05-07T20:32:05.5615709Z         compiled: bool,
2025-05-07T20:32:05.5615796Z     ) -> None:
2025-05-07T20:32:05.5615890Z         torch.manual_seed(2025)
2025-05-07T20:32:05.5615962Z     
2025-05-07T20:32:05.5616133Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:05.5616210Z     
2025-05-07T20:32:05.5616301Z >       x_sign = torch.sign(x)
2025-05-07T20:32:05.5618066Z E       torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 56.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 28.44 MiB is free. Including non-PyTorch memory, this process has 22.03 GiB memory in use. Of the allocated memory 21.67 GiB is allocated by PyTorch, and 85.02 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
2025-05-07T20:32:05.5618072Z 
2025-05-07T20:32:05.5618188Z moe/activation_test.py:94: OutOfMemoryError
2025-05-07T20:32:05.5618193Z 
2025-05-07T20:32:05.5618306Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:05.5618527Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:05.5618614Z     T=1,
2025-05-07T20:32:05.5618693Z     D=7168,
2025-05-07T20:32:05.5618776Z     scale_ub=1200.0,
2025-05-07T20:32:05.5618866Z     contiguous=True,
2025-05-07T20:32:05.5618950Z     compiled=False,
2025-05-07T20:32:05.5619023Z )
2025-05-07T20:32:05.5619293Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:05.5619459Z T = 1, D = 7168, scale_ub = 1200.0, contiguous = True, compiled = False
2025-05-07T20:32:05.5619464Z 
2025-05-07T20:32:05.5619542Z     @given(
2025-05-07T20:32:05.5619664Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:05.5619762Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:05.5619880Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:05.5619995Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:05.5620107Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:05.5620186Z     )
2025-05-07T20:32:05.5620430Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:05.5620603Z     def test_silu_mul_quant(
2025-05-07T20:32:05.5620687Z         self,
2025-05-07T20:32:05.5620764Z         T: int,
2025-05-07T20:32:05.5620841Z         D: int,
2025-05-07T20:32:05.5620991Z         scale_ub: Optional[float],
2025-05-07T20:32:05.5621085Z         contiguous: bool,
2025-05-07T20:32:05.5621171Z         compiled: bool,
2025-05-07T20:32:05.5621257Z     ) -> None:
2025-05-07T20:32:05.5621355Z         torch.manual_seed(2025)
2025-05-07T20:32:05.5621435Z     
2025-05-07T20:32:05.5621601Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:05.5621674Z     
2025-05-07T20:32:05.5621777Z         x_sign = torch.sign(x)
2025-05-07T20:32:05.5621904Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:05.5621994Z         x = x_sign * x_clamp
2025-05-07T20:32:05.5622085Z         x0 = x[:, :D]
2025-05-07T20:32:05.5622170Z         x1 = x[:, D:]
2025-05-07T20:32:05.5622246Z     
2025-05-07T20:32:05.5622339Z         if contiguous:
2025-05-07T20:32:05.5622431Z             x0 = x0.contiguous()
2025-05-07T20:32:05.5622521Z             x1 = x1.contiguous()
2025-05-07T20:32:05.5622601Z     
2025-05-07T20:32:05.5622694Z         if scale_ub is not None:
2025-05-07T20:32:05.5622803Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:05.5622946Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:05.5623022Z             )
2025-05-07T20:32:05.5623105Z         else:
2025-05-07T20:32:05.5623199Z             scale_ub_tensor = None
2025-05-07T20:32:05.5623272Z     
2025-05-07T20:32:05.5623408Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:05.5623499Z             op = silu_mul_quant
2025-05-07T20:32:05.5623584Z             if compiled:
2025-05-07T20:32:05.5623690Z                 op = torch.compile(op)
2025-05-07T20:32:05.5623797Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:05.5623871Z     
2025-05-07T20:32:05.5623970Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:05.5623979Z 
2025-05-07T20:32:05.5624077Z moe/activation_test.py:117: 
2025-05-07T20:32:05.5624214Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:05.5624318Z moe/activation_test.py:115: in fn
2025-05-07T20:32:05.5624417Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:05.5624923Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:05.5625020Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:05.5625378Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:05.5625606Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:05.5625946Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:05.5626052Z     kernel = self.compile(
2025-05-07T20:32:05.5626435Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:05.5626608Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:05.5626793Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:05.5626798Z 
2025-05-07T20:32:05.5627001Z self = <triton.compiler.compiler.ASTSource object at 0x7f48982ef3d0>
2025-05-07T20:32:05.5627779Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:05.5628636Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f49ce13c400>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f4898219300>}
2025-05-07T20:32:05.5629529Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:05.5629882Z context = <triton._C.libtriton.ir.context object at 0x7f48982eb970>
2025-05-07T20:32:05.5629895Z 
2025-05-07T20:32:05.5630060Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:05.5630326Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:05.5630434Z                            module_map=module_map)
2025-05-07T20:32:05.5630593Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:05.5630704Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:05.5630782Z E       ^
2025-05-07T20:32:05.5631133Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:05.5631146Z 
2025-05-07T20:32:05.5631558Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:05.5631564Z 
2025-05-07T20:32:05.5631669Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:05.5631900Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:05.5631977Z     T=128,
2025-05-07T20:32:05.5632054Z     D=5120,
2025-05-07T20:32:05.5632142Z     scale_ub=None,
2025-05-07T20:32:05.5632231Z     contiguous=True,
2025-05-07T20:32:05.5632314Z     compiled=False,
2025-05-07T20:32:05.5632393Z )
2025-05-07T20:32:05.5632609Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:05.5632782Z T = 128, D = 5120, scale_ub = None, contiguous = True, compiled = False
2025-05-07T20:32:05.5632787Z 
2025-05-07T20:32:05.5632865Z     @given(
2025-05-07T20:32:05.5632981Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:05.5633090Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:05.5633210Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:05.5633326Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:05.5633442Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:05.5633519Z     )
2025-05-07T20:32:05.5633768Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:05.5633865Z     def test_silu_mul_quant(
2025-05-07T20:32:05.5633942Z         self,
2025-05-07T20:32:05.5634030Z         T: int,
2025-05-07T20:32:05.5634107Z         D: int,
2025-05-07T20:32:05.5634205Z         scale_ub: Optional[float],
2025-05-07T20:32:05.5634300Z         contiguous: bool,
2025-05-07T20:32:05.5634386Z         compiled: bool,
2025-05-07T20:32:05.5634466Z     ) -> None:
2025-05-07T20:32:05.5634570Z         torch.manual_seed(2025)
2025-05-07T20:32:05.5634642Z     
2025-05-07T20:32:05.5634811Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:05.5634895Z     
2025-05-07T20:32:05.5634990Z         x_sign = torch.sign(x)
2025-05-07T20:32:05.5635114Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:05.5635210Z         x = x_sign * x_clamp
2025-05-07T20:32:05.5635291Z         x0 = x[:, :D]
2025-05-07T20:32:05.5635378Z         x1 = x[:, D:]
2025-05-07T20:32:05.5635522Z     
2025-05-07T20:32:05.5635607Z         if contiguous:
2025-05-07T20:32:05.5635704Z             x0 = x0.contiguous()
2025-05-07T20:32:05.5635793Z             x1 = x1.contiguous()
2025-05-07T20:32:05.5635869Z     
2025-05-07T20:32:05.5635968Z         if scale_ub is not None:
2025-05-07T20:32:05.5636074Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:05.5636209Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:05.5636291Z             )
2025-05-07T20:32:05.5636370Z         else:
2025-05-07T20:32:05.5636467Z             scale_ub_tensor = None
2025-05-07T20:32:05.5636546Z     
2025-05-07T20:32:05.5636676Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:05.5636863Z             op = silu_mul_quant
2025-05-07T20:32:05.5636948Z             if compiled:
2025-05-07T20:32:05.5637048Z                 op = torch.compile(op)
2025-05-07T20:32:05.5637202Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:05.5637278Z     
2025-05-07T20:32:05.5637369Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:05.5637374Z 
2025-05-07T20:32:05.5637481Z moe/activation_test.py:117: 
2025-05-07T20:32:05.5637610Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:05.5637712Z moe/activation_test.py:115: in fn
2025-05-07T20:32:05.5637819Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:05.5638313Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:05.5638415Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:05.5638768Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:05.5638993Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:05.5639340Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:05.5639436Z     kernel = self.compile(
2025-05-07T20:32:05.5639814Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:05.5639991Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:05.5640118Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:05.5640122Z 
2025-05-07T20:32:05.5640328Z self = <triton.compiler.compiler.ASTSource object at 0x7f4898258150>
2025-05-07T20:32:05.5641097Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:05.5641611Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f49ce13c400>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f489821a520>}
2025-05-07T20:32:05.5642357Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:05.5642548Z context = <triton._C.libtriton.ir.context object at 0x7f489824e930>
2025-05-07T20:32:05.5642553Z 
2025-05-07T20:32:05.5642720Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:05.5642978Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:05.5643092Z                            module_map=module_map)
2025-05-07T20:32:05.5643254Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:05.5643355Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:05.5643436Z E       ^
2025-05-07T20:32:05.5643790Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:05.5643840Z 
2025-05-07T20:32:05.5644254Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:05.5644263Z 
2025-05-07T20:32:05.5644365Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:05.5644585Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:05.5644667Z     T=128,
2025-05-07T20:32:05.5644743Z     D=7168,
2025-05-07T20:32:05.5644824Z     scale_ub=None,
2025-05-07T20:32:05.5644915Z     contiguous=True,
2025-05-07T20:32:05.5645000Z     compiled=False,
2025-05-07T20:32:05.5645073Z )
2025-05-07T20:32:05.5645292Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:05.5645538Z T = 128, D = 7168, scale_ub = None, contiguous = True, compiled = False
2025-05-07T20:32:05.5645543Z 
2025-05-07T20:32:05.5645624Z     @given(
2025-05-07T20:32:05.5645810Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:05.5645912Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:05.5646031Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:05.5646145Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:05.5646256Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:05.5646338Z     )
2025-05-07T20:32:05.5646580Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:05.5646674Z     def test_silu_mul_quant(
2025-05-07T20:32:05.5646754Z         self,
2025-05-07T20:32:05.5646831Z         T: int,
2025-05-07T20:32:05.5646905Z         D: int,
2025-05-07T20:32:05.5647009Z         scale_ub: Optional[float],
2025-05-07T20:32:05.5647100Z         contiguous: bool,
2025-05-07T20:32:05.5647193Z         compiled: bool,
2025-05-07T20:32:05.5647271Z     ) -> None:
2025-05-07T20:32:05.5647368Z         torch.manual_seed(2025)
2025-05-07T20:32:05.5647444Z     
2025-05-07T20:32:05.5647614Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:05.5647687Z     
2025-05-07T20:32:05.5647783Z         x_sign = torch.sign(x)
2025-05-07T20:32:05.5647908Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:05.5647996Z         x = x_sign * x_clamp
2025-05-07T20:32:05.5648081Z         x0 = x[:, :D]
2025-05-07T20:32:05.5648161Z         x1 = x[:, D:]
2025-05-07T20:32:05.5648233Z     
2025-05-07T20:32:05.5648325Z         if contiguous:
2025-05-07T20:32:05.5648418Z             x0 = x0.contiguous()
2025-05-07T20:32:05.5648510Z             x1 = x1.contiguous()
2025-05-07T20:32:05.5648588Z     
2025-05-07T20:32:05.5648677Z         if scale_ub is not None:
2025-05-07T20:32:05.5648786Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:05.5648923Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:05.5659141Z             )
2025-05-07T20:32:05.5659259Z         else:
2025-05-07T20:32:05.5659368Z             scale_ub_tensor = None
2025-05-07T20:32:05.5659490Z     
2025-05-07T20:32:05.5659666Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:05.5659770Z             op = silu_mul_quant
2025-05-07T20:32:05.5659886Z             if compiled:
2025-05-07T20:32:05.5659993Z                 op = torch.compile(op)
2025-05-07T20:32:05.5660110Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:05.5660211Z     
2025-05-07T20:32:05.5660305Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:05.5660311Z 
2025-05-07T20:32:05.5660422Z moe/activation_test.py:117: 
2025-05-07T20:32:05.5660565Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:05.5660679Z moe/activation_test.py:115: in fn
2025-05-07T20:32:05.5660789Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:05.5661301Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:05.5661406Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:05.5661848Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:05.5662074Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:05.5662421Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:05.5662517Z     kernel = self.compile(
2025-05-07T20:32:05.5662901Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:05.5663078Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:05.5663208Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:05.5663291Z 
2025-05-07T20:32:05.5663503Z self = <triton.compiler.compiler.ASTSource object at 0x7f48980812d0>
2025-05-07T20:32:05.5664323Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:05.5664828Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f49ce13c400>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f489821b560>}
2025-05-07T20:32:05.5665602Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:05.5665794Z context = <triton._C.libtriton.ir.context object at 0x7f48982a03f0>
2025-05-07T20:32:05.5665802Z 
2025-05-07T20:32:05.5665972Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:05.5666237Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:05.5666356Z                            module_map=module_map)
2025-05-07T20:32:05.5666525Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:05.5666628Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:05.5666714Z E       ^
2025-05-07T20:32:05.5667067Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:05.5667071Z 
2025-05-07T20:32:05.5667486Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:05.5667499Z 
2025-05-07T20:32:05.5667606Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:05.5667831Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:05.5667920Z     T=2048,
2025-05-07T20:32:05.5667999Z     D=7168,
2025-05-07T20:32:05.5668088Z     scale_ub=1200.0,
2025-05-07T20:32:05.5668179Z     contiguous=True,
2025-05-07T20:32:05.5668265Z     compiled=False,
2025-05-07T20:32:05.5668340Z )
2025-05-07T20:32:05.5668568Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:05.5668744Z T = 2048, D = 7168, scale_ub = 1200.0, contiguous = True, compiled = False
2025-05-07T20:32:05.5668749Z 
2025-05-07T20:32:05.5668828Z     @given(
2025-05-07T20:32:05.5668957Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:05.5669146Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:05.5669272Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:05.5669391Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:05.5669508Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:05.5669589Z     )
2025-05-07T20:32:05.5669836Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:05.5669937Z     def test_silu_mul_quant(
2025-05-07T20:32:05.5670023Z         self,
2025-05-07T20:32:05.5670102Z         T: int,
2025-05-07T20:32:05.5670180Z         D: int,
2025-05-07T20:32:05.5670340Z         scale_ub: Optional[float],
2025-05-07T20:32:05.5670434Z         contiguous: bool,
2025-05-07T20:32:05.5670530Z         compiled: bool,
2025-05-07T20:32:05.5670612Z     ) -> None:
2025-05-07T20:32:05.5670711Z         torch.manual_seed(2025)
2025-05-07T20:32:05.5670793Z     
2025-05-07T20:32:05.5670964Z >       x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:05.5672806Z E       torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 56.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 26.44 MiB is free. Including non-PyTorch memory, this process has 22.04 GiB memory in use. Of the allocated memory 21.69 GiB is allocated by PyTorch, and 59.18 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
2025-05-07T20:32:05.5672931Z 
2025-05-07T20:32:05.5673059Z moe/activation_test.py:92: OutOfMemoryError
2025-05-07T20:32:05.5673064Z 
2025-05-07T20:32:05.5673170Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:05.5673397Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:05.5673477Z     T=1,
2025-05-07T20:32:05.5673555Z     D=5120,
2025-05-07T20:32:05.5673645Z     scale_ub=1200.0,
2025-05-07T20:32:05.5673730Z     contiguous=True,
2025-05-07T20:32:05.5673817Z     compiled=False,
2025-05-07T20:32:05.5673899Z )
2025-05-07T20:32:05.5674117Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:05.5674288Z T = 1, D = 5120, scale_ub = 1200.0, contiguous = True, compiled = False
2025-05-07T20:32:05.5674295Z 
2025-05-07T20:32:05.5674379Z     @given(
2025-05-07T20:32:05.5674499Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:05.5674606Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:05.5674723Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:05.5674847Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:05.5674970Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:05.5675049Z     )
2025-05-07T20:32:05.5675302Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:05.5675399Z     def test_silu_mul_quant(
2025-05-07T20:32:05.5675475Z         self,
2025-05-07T20:32:05.5675558Z         T: int,
2025-05-07T20:32:05.5675637Z         D: int,
2025-05-07T20:32:05.5675740Z         scale_ub: Optional[float],
2025-05-07T20:32:05.5675834Z         contiguous: bool,
2025-05-07T20:32:05.5675920Z         compiled: bool,
2025-05-07T20:32:05.5676001Z     ) -> None:
2025-05-07T20:32:05.5676109Z         torch.manual_seed(2025)
2025-05-07T20:32:05.5676187Z     
2025-05-07T20:32:05.5676357Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:05.5676441Z     
2025-05-07T20:32:05.5676599Z         x_sign = torch.sign(x)
2025-05-07T20:32:05.5676736Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:05.5676839Z         x = x_sign * x_clamp
2025-05-07T20:32:05.5676925Z         x0 = x[:, :D]
2025-05-07T20:32:05.5697762Z         x1 = x[:, D:]
2025-05-07T20:32:05.5697854Z     
2025-05-07T20:32:05.5697945Z         if contiguous:
2025-05-07T20:32:05.5698041Z             x0 = x0.contiguous()
2025-05-07T20:32:05.5698130Z             x1 = x1.contiguous()
2025-05-07T20:32:05.5698202Z     
2025-05-07T20:32:05.5698296Z         if scale_ub is not None:
2025-05-07T20:32:05.5698405Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:05.5698541Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:05.5698629Z             )
2025-05-07T20:32:05.5698709Z         else:
2025-05-07T20:32:05.5698803Z             scale_ub_tensor = None
2025-05-07T20:32:05.5698878Z     
2025-05-07T20:32:05.5699010Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:05.5699105Z             op = silu_mul_quant
2025-05-07T20:32:05.5699259Z             if compiled:
2025-05-07T20:32:05.5699361Z                 op = torch.compile(op)
2025-05-07T20:32:05.5699469Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:05.5699540Z     
2025-05-07T20:32:05.5699630Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:05.5699635Z 
2025-05-07T20:32:05.5699737Z moe/activation_test.py:117: 
2025-05-07T20:32:05.5699867Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:05.5699968Z moe/activation_test.py:115: in fn
2025-05-07T20:32:05.5700071Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:05.5700567Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:05.5700823Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:05.5701216Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:05.5701439Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:05.5701777Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:05.5701870Z     kernel = self.compile(
2025-05-07T20:32:05.5702247Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:05.5702422Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:05.5702547Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:05.5702552Z 
2025-05-07T20:32:05.5702752Z self = <triton.compiler.compiler.ASTSource object at 0x7f4898091490>
2025-05-07T20:32:05.5703535Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:05.5704033Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f49ce13c400>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f48980e8a40>}
2025-05-07T20:32:05.5704768Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:05.5704954Z context = <triton._C.libtriton.ir.context object at 0x7f48980b1ab0>
2025-05-07T20:32:05.5704958Z 
2025-05-07T20:32:05.5705121Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:05.5705379Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:05.5705491Z                            module_map=module_map)
2025-05-07T20:32:05.5705650Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:05.5705753Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:05.5705833Z E       ^
2025-05-07T20:32:05.5706182Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:05.5706187Z 
2025-05-07T20:32:05.5706592Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:05.5706600Z 
2025-05-07T20:32:05.5706701Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:05.5706921Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:05.5707001Z     T=2048,
2025-05-07T20:32:05.5707076Z     D=5120,
2025-05-07T20:32:05.5707158Z     scale_ub=None,
2025-05-07T20:32:05.5707248Z     contiguous=True,
2025-05-07T20:32:05.5707331Z     compiled=False,
2025-05-07T20:32:05.5707404Z )
2025-05-07T20:32:05.5707622Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:05.5707838Z T = 2048, D = 5120, scale_ub = None, contiguous = True, compiled = False
2025-05-07T20:32:05.5707843Z 
2025-05-07T20:32:05.5707923Z     @given(
2025-05-07T20:32:05.5708039Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:05.5708137Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:05.5708252Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:05.5708366Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:05.5708476Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:05.5708554Z     )
2025-05-07T20:32:05.5708793Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:05.5708885Z     def test_silu_mul_quant(
2025-05-07T20:32:05.5709005Z         self,
2025-05-07T20:32:05.5709199Z         T: int,
2025-05-07T20:32:05.5709274Z         D: int,
2025-05-07T20:32:05.5709373Z         scale_ub: Optional[float],
2025-05-07T20:32:05.5709461Z         contiguous: bool,
2025-05-07T20:32:05.5709587Z         compiled: bool,
2025-05-07T20:32:05.5709670Z     ) -> None:
2025-05-07T20:32:05.5709763Z         torch.manual_seed(2025)
2025-05-07T20:32:05.5709837Z     
2025-05-07T20:32:05.5710000Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:05.5710072Z     
2025-05-07T20:32:05.5710166Z >       x_sign = torch.sign(x)
2025-05-07T20:32:05.5711930Z E       torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 40.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 26.44 MiB is free. Including non-PyTorch memory, this process has 22.04 GiB memory in use. Of the allocated memory 21.73 GiB is allocated by PyTorch, and 19.12 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
2025-05-07T20:32:05.5711941Z 
2025-05-07T20:32:05.5712063Z moe/activation_test.py:94: OutOfMemoryError
2025-05-07T20:32:05.5712069Z 
2025-05-07T20:32:05.5712169Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:05.5712413Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:05.5712496Z     T=16384,
2025-05-07T20:32:05.5712590Z     D=5120,
2025-05-07T20:32:05.5712676Z     scale_ub=None,
2025-05-07T20:32:05.5712759Z     contiguous=True,
2025-05-07T20:32:05.5712842Z     compiled=False,
2025-05-07T20:32:05.5712915Z )
2025-05-07T20:32:05.5713127Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:05.5713297Z T = 16384, D = 5120, scale_ub = None, contiguous = True, compiled = False
2025-05-07T20:32:05.5713304Z 
2025-05-07T20:32:05.5713383Z     @given(
2025-05-07T20:32:05.5713499Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:05.5713595Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:05.5713709Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:05.5713828Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:05.5713940Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:05.5714013Z     )
2025-05-07T20:32:05.5714253Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:05.5714347Z     def test_silu_mul_quant(
2025-05-07T20:32:05.5714422Z         self,
2025-05-07T20:32:05.5714499Z         T: int,
2025-05-07T20:32:05.5714576Z         D: int,
2025-05-07T20:32:05.5714677Z         scale_ub: Optional[float],
2025-05-07T20:32:05.5714778Z         contiguous: bool,
2025-05-07T20:32:05.5714865Z         compiled: bool,
2025-05-07T20:32:05.5714949Z     ) -> None:
2025-05-07T20:32:05.5715057Z         torch.manual_seed(2025)
2025-05-07T20:32:05.5715136Z     
2025-05-07T20:32:05.5715303Z >       x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:05.5717118Z E       torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 320.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 26.44 MiB is free. Including non-PyTorch memory, this process has 22.04 GiB memory in use. Of the allocated memory 21.73 GiB is allocated by PyTorch, and 19.12 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
2025-05-07T20:32:05.5717125Z 
2025-05-07T20:32:05.5717246Z moe/activation_test.py:92: OutOfMemoryError
2025-05-07T20:32:05.5717256Z 
2025-05-07T20:32:05.5717361Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:05.5717582Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:05.5717742Z     T=4096,
2025-05-07T20:32:05.5717822Z     D=5120,
2025-05-07T20:32:05.5717907Z     scale_ub=None,
2025-05-07T20:32:05.5718000Z     contiguous=True,
2025-05-07T20:32:05.5718124Z     compiled=False,
2025-05-07T20:32:05.5718200Z )
2025-05-07T20:32:05.5718427Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:05.5718594Z T = 4096, D = 5120, scale_ub = None, contiguous = True, compiled = False
2025-05-07T20:32:05.5718598Z 
2025-05-07T20:32:05.5718677Z     @given(
2025-05-07T20:32:05.5718800Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:05.5718899Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:05.5719021Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:05.5719137Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:05.5719251Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:05.5719336Z     )
2025-05-07T20:32:05.5719579Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:05.5719678Z     def test_silu_mul_quant(
2025-05-07T20:32:05.5719764Z         self,
2025-05-07T20:32:05.5719843Z         T: int,
2025-05-07T20:32:05.5719925Z         D: int,
2025-05-07T20:32:05.5720029Z         scale_ub: Optional[float],
2025-05-07T20:32:05.5720120Z         contiguous: bool,
2025-05-07T20:32:05.5720214Z         compiled: bool,
2025-05-07T20:32:05.5720294Z     ) -> None:
2025-05-07T20:32:05.5720391Z         torch.manual_seed(2025)
2025-05-07T20:32:05.5720473Z     
2025-05-07T20:32:05.5720639Z >       x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:05.5722398Z E       torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 80.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 26.44 MiB is free. Including non-PyTorch memory, this process has 22.04 GiB memory in use. Of the allocated memory 21.73 GiB is allocated by PyTorch, and 19.12 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
2025-05-07T20:32:05.5722417Z 
2025-05-07T20:32:05.5722538Z moe/activation_test.py:92: OutOfMemoryError
2025-05-07T20:32:05.5722542Z 
2025-05-07T20:32:05.5722649Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:05.5722877Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:05.5722957Z     T=2048,
2025-05-07T20:32:05.5723036Z     D=5120,
2025-05-07T20:32:05.5723128Z     scale_ub=None,
2025-05-07T20:32:05.5723219Z     contiguous=False,
2025-05-07T20:32:05.5723305Z     compiled=False,
2025-05-07T20:32:05.5723391Z )
2025-05-07T20:32:05.5723609Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:05.5723788Z T = 2048, D = 5120, scale_ub = None, contiguous = False, compiled = False
2025-05-07T20:32:05.5723797Z 
2025-05-07T20:32:05.5723878Z     @given(
2025-05-07T20:32:05.5723995Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:05.5724103Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:05.5724262Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:05.5724380Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:05.5724499Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:05.5724574Z     )
2025-05-07T20:32:05.5724823Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:05.5724919Z     def test_silu_mul_quant(
2025-05-07T20:32:05.5724998Z         self,
2025-05-07T20:32:05.5725084Z         T: int,
2025-05-07T20:32:05.5725163Z         D: int,
2025-05-07T20:32:05.5725262Z         scale_ub: Optional[float],
2025-05-07T20:32:05.5725359Z         contiguous: bool,
2025-05-07T20:32:05.5725448Z         compiled: bool,
2025-05-07T20:32:05.5725581Z     ) -> None:
2025-05-07T20:32:05.5725719Z         torch.manual_seed(2025)
2025-05-07T20:32:05.5725792Z     
2025-05-07T20:32:05.5725962Z >       x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:05.5727764Z E       torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 40.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 26.44 MiB is free. Including non-PyTorch memory, this process has 22.04 GiB memory in use. Of the allocated memory 21.73 GiB is allocated by PyTorch, and 19.12 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
2025-05-07T20:32:05.5727770Z 
2025-05-07T20:32:05.5727888Z moe/activation_test.py:92: OutOfMemoryError
2025-05-07T20:32:05.5727902Z 
2025-05-07T20:32:05.5728004Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:05.5728458Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:05.5728585Z     T=4096,
2025-05-07T20:32:05.5728680Z     D=7168,
2025-05-07T20:32:05.5728766Z     scale_ub=None,
2025-05-07T20:32:05.5728864Z     contiguous=True,
2025-05-07T20:32:05.5728952Z     compiled=True,
2025-05-07T20:32:05.5729030Z )
2025-05-07T20:32:05.5729256Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:05.5729424Z T = 4096, D = 7168, scale_ub = None, contiguous = True, compiled = True
2025-05-07T20:32:05.5729429Z 
2025-05-07T20:32:05.5729508Z     @given(
2025-05-07T20:32:05.5729633Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:05.5729731Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:05.5729852Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:05.5729969Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:05.5730084Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:05.5730176Z     )
2025-05-07T20:32:05.5730418Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:05.5730516Z     def test_silu_mul_quant(
2025-05-07T20:32:05.5730602Z         self,
2025-05-07T20:32:05.5730682Z         T: int,
2025-05-07T20:32:05.5730760Z         D: int,
2025-05-07T20:32:05.5730864Z         scale_ub: Optional[float],
2025-05-07T20:32:05.5730953Z         contiguous: bool,
2025-05-07T20:32:05.5731049Z         compiled: bool,
2025-05-07T20:32:05.5731130Z     ) -> None:
2025-05-07T20:32:05.5731226Z         torch.manual_seed(2025)
2025-05-07T20:32:05.5731310Z     
2025-05-07T20:32:05.5731476Z >       x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:05.5733404Z E       torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 112.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 26.44 MiB is free. Including non-PyTorch memory, this process has 22.04 GiB memory in use. Of the allocated memory 21.73 GiB is allocated by PyTorch, and 19.12 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
2025-05-07T20:32:05.5733423Z 
2025-05-07T20:32:05.5733544Z moe/activation_test.py:92: OutOfMemoryError
2025-05-07T20:32:05.5733548Z 
2025-05-07T20:32:05.5733649Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:05.5733879Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:05.5733957Z     T=2048,
2025-05-07T20:32:05.5734035Z     D=5120,
2025-05-07T20:32:05.5734125Z     scale_ub=1200.0,
2025-05-07T20:32:05.5734214Z     contiguous=False,
2025-05-07T20:32:05.5734299Z     compiled=False,
2025-05-07T20:32:05.5734380Z )
2025-05-07T20:32:05.5734597Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:05.5734899Z T = 2048, D = 5120, scale_ub = 1200.0, contiguous = False, compiled = False
2025-05-07T20:32:05.5734903Z 
2025-05-07T20:32:05.5734982Z     @given(
2025-05-07T20:32:05.5735152Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:05.5735260Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:05.5735377Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:05.5735492Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:05.5735610Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:05.5735684Z     )
2025-05-07T20:32:05.5735934Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:05.5736030Z     def test_silu_mul_quant(
2025-05-07T20:32:05.5736107Z         self,
2025-05-07T20:32:05.5736190Z         T: int,
2025-05-07T20:32:05.5736268Z         D: int,
2025-05-07T20:32:05.5736368Z         scale_ub: Optional[float],
2025-05-07T20:32:05.5736467Z         contiguous: bool,
2025-05-07T20:32:05.5736557Z         compiled: bool,
2025-05-07T20:32:05.5736633Z     ) -> None:
2025-05-07T20:32:05.5736731Z         torch.manual_seed(2025)
2025-05-07T20:32:05.5736807Z     
2025-05-07T20:32:05.5736974Z >       x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:05.5738730Z E       torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 40.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 26.44 MiB is free. Including non-PyTorch memory, this process has 22.04 GiB memory in use. Of the allocated memory 21.73 GiB is allocated by PyTorch, and 19.12 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
2025-05-07T20:32:05.5738736Z 
2025-05-07T20:32:05.5738852Z moe/activation_test.py:92: OutOfMemoryError
2025-05-07T20:32:05.5738866Z 
2025-05-07T20:32:05.5738969Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:05.5739188Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:05.5739269Z     T=4096,
2025-05-07T20:32:05.5739352Z     D=7168,
2025-05-07T20:32:05.5739440Z     scale_ub=1200.0,
2025-05-07T20:32:05.5739534Z     contiguous=True,
2025-05-07T20:32:05.5739619Z     compiled=False,
2025-05-07T20:32:05.5739693Z )
2025-05-07T20:32:05.5739913Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:05.5740082Z T = 4096, D = 7168, scale_ub = 1200.0, contiguous = True, compiled = False
2025-05-07T20:32:05.5740087Z 
2025-05-07T20:32:05.5740165Z     @given(
2025-05-07T20:32:05.5740287Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:05.5740386Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:05.5740502Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:05.5740620Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:05.5740734Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:05.5740813Z     )
2025-05-07T20:32:05.5741056Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:05.5741202Z     def test_silu_mul_quant(
2025-05-07T20:32:05.5741285Z         self,
2025-05-07T20:32:05.5741362Z         T: int,
2025-05-07T20:32:05.5741438Z         D: int,
2025-05-07T20:32:05.5741541Z         scale_ub: Optional[float],
2025-05-07T20:32:05.5741632Z         contiguous: bool,
2025-05-07T20:32:05.5741716Z         compiled: bool,
2025-05-07T20:32:05.5741799Z     ) -> None:
2025-05-07T20:32:05.5741891Z         torch.manual_seed(2025)
2025-05-07T20:32:05.5741968Z     
2025-05-07T20:32:05.5742132Z >       x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:05.5743953Z E       torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 112.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 26.44 MiB is free. Including non-PyTorch memory, this process has 22.04 GiB memory in use. Of the allocated memory 21.73 GiB is allocated by PyTorch, and 19.12 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
2025-05-07T20:32:05.5744040Z 
2025-05-07T20:32:05.5744156Z moe/activation_test.py:92: OutOfMemoryError
2025-05-07T20:32:05.5744161Z 
2025-05-07T20:32:05.5744262Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:05.5744487Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:05.5744564Z     T=16384,
2025-05-07T20:32:05.5744639Z     D=7168,
2025-05-07T20:32:05.5744725Z     scale_ub=None,
2025-05-07T20:32:05.5744811Z     contiguous=False,
2025-05-07T20:32:05.5744893Z     compiled=True,
2025-05-07T20:32:05.5744973Z )
2025-05-07T20:32:05.5745186Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:05.5745371Z T = 16384, D = 7168, scale_ub = None, contiguous = False, compiled = True
2025-05-07T20:32:05.5745376Z 
2025-05-07T20:32:05.5745452Z     @given(
2025-05-07T20:32:05.5745571Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:05.5745671Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:05.5745782Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:05.5745896Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:05.5746015Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:05.5746089Z     )
2025-05-07T20:32:05.5746330Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:05.5746430Z     def test_silu_mul_quant(
2025-05-07T20:32:05.5746507Z         self,
2025-05-07T20:32:05.5746591Z         T: int,
2025-05-07T20:32:05.5746670Z         D: int,
2025-05-07T20:32:05.5746772Z         scale_ub: Optional[float],
2025-05-07T20:32:05.5746869Z         contiguous: bool,
2025-05-07T20:32:05.5746956Z         compiled: bool,
2025-05-07T20:32:05.5747032Z     ) -> None:
2025-05-07T20:32:05.5747131Z         torch.manual_seed(2025)
2025-05-07T20:32:05.5747205Z     
2025-05-07T20:32:05.5747373Z >       x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:05.5749233Z E       torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 448.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 26.44 MiB is free. Including non-PyTorch memory, this process has 22.04 GiB memory in use. Of the allocated memory 21.73 GiB is allocated by PyTorch, and 19.12 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
2025-05-07T20:32:05.5749242Z 
2025-05-07T20:32:05.5749360Z moe/activation_test.py:92: OutOfMemoryError
2025-05-07T20:32:05.5749364Z 
2025-05-07T20:32:05.5749470Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:05.5749692Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:05.5749824Z     T=4096,
2025-05-07T20:32:05.5749901Z     D=7168,
2025-05-07T20:32:05.5749982Z     scale_ub=None,
2025-05-07T20:32:05.5750071Z     contiguous=True,
2025-05-07T20:32:05.5750154Z     compiled=False,
2025-05-07T20:32:05.5750225Z )
2025-05-07T20:32:05.5750447Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:05.5750613Z T = 4096, D = 7168, scale_ub = None, contiguous = True, compiled = False
2025-05-07T20:32:05.5750617Z 
2025-05-07T20:32:05.5750692Z     @given(
2025-05-07T20:32:05.5750811Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:05.5750908Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:05.5751069Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:05.5751223Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:05.5751334Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:05.5751412Z     )
2025-05-07T20:32:05.5751692Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:05.5751787Z     def test_silu_mul_quant(
2025-05-07T20:32:05.5751869Z         self,
2025-05-07T20:32:05.5751945Z         T: int,
2025-05-07T20:32:05.5752020Z         D: int,
2025-05-07T20:32:05.5752124Z         scale_ub: Optional[float],
2025-05-07T20:32:05.5752212Z         contiguous: bool,
2025-05-07T20:32:05.5752297Z         compiled: bool,
2025-05-07T20:32:05.5752384Z     ) -> None:
2025-05-07T20:32:05.5752481Z         torch.manual_seed(2025)
2025-05-07T20:32:05.5752571Z     
2025-05-07T20:32:05.5752761Z >       x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:05.5754530Z E       torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 112.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 26.44 MiB is free. Including non-PyTorch memory, this process has 22.04 GiB memory in use. Of the allocated memory 21.73 GiB is allocated by PyTorch, and 19.12 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
2025-05-07T20:32:05.5754548Z 
2025-05-07T20:32:05.5754668Z moe/activation_test.py:92: OutOfMemoryError
2025-05-07T20:32:05.5754672Z 
2025-05-07T20:32:05.5754774Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:05.5755000Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:05.5755078Z     T=16384,
2025-05-07T20:32:05.5755158Z     D=7168,
2025-05-07T20:32:05.5755245Z     scale_ub=None,
2025-05-07T20:32:05.5755329Z     contiguous=True,
2025-05-07T20:32:05.5755416Z     compiled=False,
2025-05-07T20:32:05.5755498Z )
2025-05-07T20:32:05.5755711Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:05.5755893Z T = 16384, D = 7168, scale_ub = None, contiguous = True, compiled = False
2025-05-07T20:32:05.5755897Z 
2025-05-07T20:32:05.5755977Z     @given(
2025-05-07T20:32:05.5756093Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:05.5756195Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:05.5756308Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:05.5756424Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:05.5756545Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:05.5756619Z     )
2025-05-07T20:32:05.5756863Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:05.5756966Z     def test_silu_mul_quant(
2025-05-07T20:32:05.5757042Z         self,
2025-05-07T20:32:05.5757128Z         T: int,
2025-05-07T20:32:05.5757205Z         D: int,
2025-05-07T20:32:05.5757302Z         scale_ub: Optional[float],
2025-05-07T20:32:05.5757402Z         contiguous: bool,
2025-05-07T20:32:05.5757486Z         compiled: bool,
2025-05-07T20:32:05.5757564Z     ) -> None:
2025-05-07T20:32:05.5757713Z         torch.manual_seed(2025)
2025-05-07T20:32:05.5757788Z     
2025-05-07T20:32:05.5757952Z >       x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:05.5759710Z E       torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 448.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 26.44 MiB is free. Including non-PyTorch memory, this process has 22.04 GiB memory in use. Of the allocated memory 21.73 GiB is allocated by PyTorch, and 19.12 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
2025-05-07T20:32:05.5759789Z 
2025-05-07T20:32:05.5759905Z moe/activation_test.py:92: OutOfMemoryError
2025-05-07T20:32:05.5759910Z 
2025-05-07T20:32:05.5760020Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:05.5760278Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:05.5760361Z     T=16384,
2025-05-07T20:32:05.5760440Z     D=7168,
2025-05-07T20:32:05.5760522Z     scale_ub=1200.0,
2025-05-07T20:32:05.5760613Z     contiguous=True,
2025-05-07T20:32:05.5760697Z     compiled=False,
2025-05-07T20:32:05.5760770Z )
2025-05-07T20:32:05.5760987Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:05.5761161Z T = 16384, D = 7168, scale_ub = 1200.0, contiguous = True, compiled = False
2025-05-07T20:32:05.5761165Z 
2025-05-07T20:32:05.5761242Z     @given(
2025-05-07T20:32:05.5761362Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:05.5761461Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:05.5761589Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:05.5761703Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:05.5761822Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:05.5761896Z     )
2025-05-07T20:32:05.5762143Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:05.5762244Z     def test_silu_mul_quant(
2025-05-07T20:32:05.5762319Z         self,
2025-05-07T20:32:05.5762395Z         T: int,
2025-05-07T20:32:05.5762477Z         D: int,
2025-05-07T20:32:05.5762579Z         scale_ub: Optional[float],
2025-05-07T20:32:05.5762674Z         contiguous: bool,
2025-05-07T20:32:05.5762785Z         compiled: bool,
2025-05-07T20:32:05.5762870Z     ) -> None:
2025-05-07T20:32:05.5762980Z         torch.manual_seed(2025)
2025-05-07T20:32:05.5763062Z     
2025-05-07T20:32:05.5763228Z >       x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:05.5764999Z E       torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 448.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 26.44 MiB is free. Including non-PyTorch memory, this process has 22.04 GiB memory in use. Of the allocated memory 21.73 GiB is allocated by PyTorch, and 19.12 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
2025-05-07T20:32:05.5765006Z 
2025-05-07T20:32:05.5765122Z moe/activation_test.py:92: OutOfMemoryError
2025-05-07T20:32:05.5765126Z 
2025-05-07T20:32:05.5765236Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:05.5765456Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:05.5765534Z     T=128,
2025-05-07T20:32:05.5765618Z     D=5120,
2025-05-07T20:32:05.5765707Z     scale_ub=1200.0,
2025-05-07T20:32:05.5765796Z     contiguous=False,
2025-05-07T20:32:05.5765892Z     compiled=False,
2025-05-07T20:32:05.5765968Z )
2025-05-07T20:32:05.5766181Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:05.5766413Z T = 128, D = 5120, scale_ub = 1200.0, contiguous = False, compiled = False
2025-05-07T20:32:05.5766418Z 
2025-05-07T20:32:05.5766497Z     @given(
2025-05-07T20:32:05.5766615Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:05.5766712Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:05.5766824Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:05.5766945Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:05.5767056Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:05.5767130Z     )
2025-05-07T20:32:05.5767381Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:05.5767514Z     def test_silu_mul_quant(
2025-05-07T20:32:05.5767634Z         self,
2025-05-07T20:32:05.5767711Z         T: int,
2025-05-07T20:32:05.5767789Z         D: int,
2025-05-07T20:32:05.5767891Z         scale_ub: Optional[float],
2025-05-07T20:32:05.5767979Z         contiguous: bool,
2025-05-07T20:32:05.5768104Z         compiled: bool,
2025-05-07T20:32:05.5768194Z     ) -> None:
2025-05-07T20:32:05.5768288Z         torch.manual_seed(2025)
2025-05-07T20:32:05.5768360Z     
2025-05-07T20:32:05.5768530Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:05.5768603Z     
2025-05-07T20:32:05.5768696Z         x_sign = torch.sign(x)
2025-05-07T20:32:05.5768824Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:05.5768914Z         x = x_sign * x_clamp
2025-05-07T20:32:05.5768993Z         x0 = x[:, :D]
2025-05-07T20:32:05.5769079Z         x1 = x[:, D:]
2025-05-07T20:32:05.5769152Z     
2025-05-07T20:32:05.5769240Z         if contiguous:
2025-05-07T20:32:05.5769335Z             x0 = x0.contiguous()
2025-05-07T20:32:05.5769427Z             x1 = x1.contiguous()
2025-05-07T20:32:05.5769504Z     
2025-05-07T20:32:05.5769594Z         if scale_ub is not None:
2025-05-07T20:32:05.5769701Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:05.5769845Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:05.5769922Z             )
2025-05-07T20:32:05.5769997Z         else:
2025-05-07T20:32:05.5770101Z             scale_ub_tensor = None
2025-05-07T20:32:05.5770173Z     
2025-05-07T20:32:05.5770304Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:05.5770399Z             op = silu_mul_quant
2025-05-07T20:32:05.5770485Z             if compiled:
2025-05-07T20:32:05.5770591Z                 op = torch.compile(op)
2025-05-07T20:32:05.5770696Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:05.5770767Z     
2025-05-07T20:32:05.5770861Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:05.5770866Z 
2025-05-07T20:32:05.5770967Z moe/activation_test.py:117: 
2025-05-07T20:32:05.5771099Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:05.5771204Z moe/activation_test.py:115: in fn
2025-05-07T20:32:05.5771303Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:05.5771806Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:05.5771908Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:05.5772264Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:05.5772487Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:05.5772825Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:05.5772919Z     kernel = self.compile(
2025-05-07T20:32:05.5773305Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:05.5773480Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:05.5773615Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:05.5773619Z 
2025-05-07T20:32:05.5773892Z self = <triton.compiler.compiler.ASTSource object at 0x7f4898373810>
2025-05-07T20:32:05.5774670Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:05.5775173Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f49ce13c400>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f489817f6a0>}
2025-05-07T20:32:05.5775915Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:05.5776185Z context = <triton._C.libtriton.ir.context object at 0x7f48981ccc30>
2025-05-07T20:32:05.5776189Z 
2025-05-07T20:32:05.5776391Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:05.5776651Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:05.5776766Z                            module_map=module_map)
2025-05-07T20:32:05.5776926Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:05.5777030Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:05.5777111Z E       ^
2025-05-07T20:32:05.5777464Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:05.5777469Z 
2025-05-07T20:32:05.5777884Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:05.5777897Z 
2025-05-07T20:32:05.5778004Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:05.5778227Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:05.5778305Z     T=2048,
2025-05-07T20:32:05.5778385Z     D=7168,
2025-05-07T20:32:05.5778477Z     scale_ub=None,
2025-05-07T20:32:05.5778564Z     contiguous=False,
2025-05-07T20:32:05.5778648Z     compiled=False,
2025-05-07T20:32:05.5778724Z )
2025-05-07T20:32:05.5778940Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:05.5779111Z T = 2048, D = 7168, scale_ub = None, contiguous = False, compiled = False
2025-05-07T20:32:05.5779115Z 
2025-05-07T20:32:05.5779200Z     @given(
2025-05-07T20:32:05.5779316Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:05.5779421Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:05.5779534Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:05.5779652Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:05.5779773Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:05.5779847Z     )
2025-05-07T20:32:05.5780096Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:05.5780201Z     def test_silu_mul_quant(
2025-05-07T20:32:05.5780278Z         self,
2025-05-07T20:32:05.5780355Z         T: int,
2025-05-07T20:32:05.5780439Z         D: int,
2025-05-07T20:32:05.5780537Z         scale_ub: Optional[float],
2025-05-07T20:32:05.5780626Z         contiguous: bool,
2025-05-07T20:32:05.5780719Z         compiled: bool,
2025-05-07T20:32:05.5780797Z     ) -> None:
2025-05-07T20:32:05.5780898Z         torch.manual_seed(2025)
2025-05-07T20:32:05.5780970Z     
2025-05-07T20:32:05.5781135Z >       x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:05.5783008Z E       torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 56.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 26.44 MiB is free. Including non-PyTorch memory, this process has 22.04 GiB memory in use. Of the allocated memory 21.74 GiB is allocated by PyTorch, and 10.99 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
2025-05-07T20:32:05.5783019Z 
2025-05-07T20:32:05.5783138Z moe/activation_test.py:92: OutOfMemoryError
2025-05-07T20:32:05.5783142Z 
2025-05-07T20:32:05.5783251Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:05.5783470Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:05.5783547Z     T=128,
2025-05-07T20:32:05.5783631Z     D=7168,
2025-05-07T20:32:05.5783714Z     scale_ub=1200.0,
2025-05-07T20:32:05.5783800Z     contiguous=True,
2025-05-07T20:32:05.5783891Z     compiled=True,
2025-05-07T20:32:05.5784006Z )
2025-05-07T20:32:05.5784267Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:05.5784431Z T = 128, D = 7168, scale_ub = 1200.0, contiguous = True, compiled = True
2025-05-07T20:32:05.5784435Z 
2025-05-07T20:32:05.5784548Z     @given(
2025-05-07T20:32:05.5784672Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:05.5784770Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:05.5784884Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:05.5785005Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:05.5785115Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:05.5785188Z     )
2025-05-07T20:32:05.5785439Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:05.5785530Z     def test_silu_mul_quant(
2025-05-07T20:32:05.5785613Z         self,
2025-05-07T20:32:05.5785691Z         T: int,
2025-05-07T20:32:05.5785770Z         D: int,
2025-05-07T20:32:05.5785877Z         scale_ub: Optional[float],
2025-05-07T20:32:05.5785967Z         contiguous: bool,
2025-05-07T20:32:05.5786052Z         compiled: bool,
2025-05-07T20:32:05.5786135Z     ) -> None:
2025-05-07T20:32:05.5786231Z         torch.manual_seed(2025)
2025-05-07T20:32:05.5786309Z     
2025-05-07T20:32:05.5786479Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:05.5786553Z     
2025-05-07T20:32:05.5786646Z         x_sign = torch.sign(x)
2025-05-07T20:32:05.5786775Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:05.5786864Z         x = x_sign * x_clamp
2025-05-07T20:32:05.5786953Z         x0 = x[:, :D]
2025-05-07T20:32:05.5787032Z         x1 = x[:, D:]
2025-05-07T20:32:05.5787104Z     
2025-05-07T20:32:05.5787192Z         if contiguous:
2025-05-07T20:32:05.5787283Z             x0 = x0.contiguous()
2025-05-07T20:32:05.5787371Z             x1 = x1.contiguous()
2025-05-07T20:32:05.5787455Z     
2025-05-07T20:32:05.5787544Z         if scale_ub is not None:
2025-05-07T20:32:05.5787653Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:05.5787793Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:05.5787869Z             )
2025-05-07T20:32:05.5787953Z         else:
2025-05-07T20:32:05.5788054Z             scale_ub_tensor = None
2025-05-07T20:32:05.5788125Z     
2025-05-07T20:32:05.5788253Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:05.5788348Z             op = silu_mul_quant
2025-05-07T20:32:05.5788432Z             if compiled:
2025-05-07T20:32:05.5788538Z                 op = torch.compile(op)
2025-05-07T20:32:05.5788645Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:05.5788718Z     
2025-05-07T20:32:05.5788813Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:05.5788817Z 
2025-05-07T20:32:05.5788914Z moe/activation_test.py:117: 
2025-05-07T20:32:05.5789041Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:05.5789200Z moe/activation_test.py:115: in fn
2025-05-07T20:32:05.5789300Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:05.5789675Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/torch/_dynamo/eval_frame.py:678: in _fn
2025-05-07T20:32:05.5789816Z     return fn(*args, **kwargs)
2025-05-07T20:32:05.5790306Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:05.5790413Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:05.5790767Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:05.5790987Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:05.5791329Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:05.5791425Z     kernel = self.compile(
2025-05-07T20:32:05.5791852Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:05.5792065Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:05.5792231Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:05.5792236Z 
2025-05-07T20:32:05.5792443Z self = <triton.compiler.compiler.ASTSource object at 0x7f4417fbad90>
2025-05-07T20:32:05.5793267Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:05.5793768Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f49ce13c400>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f48983c68e0>}
2025-05-07T20:32:05.5794510Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:05.5794705Z context = <triton._C.libtriton.ir.context object at 0x7f48983ecc30>
2025-05-07T20:32:05.5794716Z 
2025-05-07T20:32:05.5794882Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:05.5795141Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:05.5795255Z                            module_map=module_map)
2025-05-07T20:32:05.5795415Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:05.5795517Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:05.5795603Z E       ^
2025-05-07T20:32:05.5795955Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:05.5795960Z 
2025-05-07T20:32:05.5796377Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:05.5796385Z 
2025-05-07T20:32:05.5796491Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:05.5796713Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:05.5796799Z     T=128,
2025-05-07T20:32:05.5796876Z     D=7168,
2025-05-07T20:32:05.5796958Z     scale_ub=1200.0,
2025-05-07T20:32:05.5797048Z     contiguous=True,
2025-05-07T20:32:05.5797134Z     compiled=False,
2025-05-07T20:32:05.5797206Z )
2025-05-07T20:32:05.5797423Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:05.5797594Z T = 128, D = 7168, scale_ub = 1200.0, contiguous = True, compiled = False
2025-05-07T20:32:05.5797599Z 
2025-05-07T20:32:05.5797682Z     @given(
2025-05-07T20:32:05.5797800Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:05.5797900Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:05.5798021Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:05.5798138Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:05.5798250Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:05.5798327Z     )
2025-05-07T20:32:05.5798613Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:05.5798707Z     def test_silu_mul_quant(
2025-05-07T20:32:05.5798790Z         self,
2025-05-07T20:32:05.5798866Z         T: int,
2025-05-07T20:32:05.5798950Z         D: int,
2025-05-07T20:32:05.5799047Z         scale_ub: Optional[float],
2025-05-07T20:32:05.5799135Z         contiguous: bool,
2025-05-07T20:32:05.5799225Z         compiled: bool,
2025-05-07T20:32:05.5799302Z     ) -> None:
2025-05-07T20:32:05.5799395Z         torch.manual_seed(2025)
2025-05-07T20:32:05.5799473Z     
2025-05-07T20:32:05.5799638Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:05.5799752Z     
2025-05-07T20:32:05.5799852Z         x_sign = torch.sign(x)
2025-05-07T20:32:05.5800012Z >       x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:05.5801836Z E       torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 20.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 4.44 MiB is free. Including non-PyTorch memory, this process has 22.06 GiB memory in use. Of the allocated memory 21.77 GiB is allocated by PyTorch, and 6.37 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
2025-05-07T20:32:05.5801843Z 
2025-05-07T20:32:05.5801960Z moe/activation_test.py:95: OutOfMemoryError
2025-05-07T20:32:05.5801964Z 
2025-05-07T20:32:05.5802070Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:05.5802293Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:05.5802380Z     T=128,
2025-05-07T20:32:05.5802463Z     D=5120,
2025-05-07T20:32:05.5802548Z     scale_ub=1200.0,
2025-05-07T20:32:05.5802633Z     contiguous=True,
2025-05-07T20:32:05.5802721Z     compiled=True,
2025-05-07T20:32:05.5802797Z )
2025-05-07T20:32:05.5803018Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:05.5803184Z T = 128, D = 5120, scale_ub = 1200.0, contiguous = True, compiled = True
2025-05-07T20:32:05.5803189Z 
2025-05-07T20:32:05.5803269Z     @given(
2025-05-07T20:32:05.5803392Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:05.5803489Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:05.5803601Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:05.5803723Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:05.5803834Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:05.5803907Z     )
2025-05-07T20:32:05.5804157Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:05.5804252Z     def test_silu_mul_quant(
2025-05-07T20:32:05.5804330Z         self,
2025-05-07T20:32:05.5804412Z         T: int,
2025-05-07T20:32:05.5804486Z         D: int,
2025-05-07T20:32:05.5804588Z         scale_ub: Optional[float],
2025-05-07T20:32:05.5804683Z         contiguous: bool,
2025-05-07T20:32:05.5804769Z         compiled: bool,
2025-05-07T20:32:05.5804850Z     ) -> None:
2025-05-07T20:32:05.5808832Z         torch.manual_seed(2025)
2025-05-07T20:32:05.5808917Z     
2025-05-07T20:32:05.5809094Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:05.5809167Z     
2025-05-07T20:32:05.5809267Z         x_sign = torch.sign(x)
2025-05-07T20:32:05.5809390Z >       x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:05.5811233Z E       torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 20.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 4.44 MiB is free. Including non-PyTorch memory, this process has 22.06 GiB memory in use. Of the allocated memory 21.77 GiB is allocated by PyTorch, and 3.87 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
2025-05-07T20:32:05.5811245Z 
2025-05-07T20:32:05.5811364Z moe/activation_test.py:95: OutOfMemoryError
2025-05-07T20:32:05.5811369Z 
2025-05-07T20:32:05.5811473Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:05.5811692Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:05.5811767Z     T=128,
2025-05-07T20:32:05.5811843Z     D=7168,
2025-05-07T20:32:05.5811922Z     scale_ub=None,
2025-05-07T20:32:05.5812004Z     contiguous=True,
2025-05-07T20:32:05.5812087Z     compiled=True,
2025-05-07T20:32:05.5812158Z )
2025-05-07T20:32:05.5812418Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:05.5812625Z T = 128, D = 7168, scale_ub = None, contiguous = True, compiled = True
2025-05-07T20:32:05.5812629Z 
2025-05-07T20:32:05.5812708Z     @given(
2025-05-07T20:32:05.5812882Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:05.5812981Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:05.5813093Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:05.5813210Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:05.5813319Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:05.5813391Z     )
2025-05-07T20:32:05.5813636Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:05.5813727Z     def test_silu_mul_quant(
2025-05-07T20:32:05.5813802Z         self,
2025-05-07T20:32:05.5813879Z         T: int,
2025-05-07T20:32:05.5813954Z         D: int,
2025-05-07T20:32:05.5814053Z         scale_ub: Optional[float],
2025-05-07T20:32:05.5814147Z         contiguous: bool,
2025-05-07T20:32:05.5814230Z         compiled: bool,
2025-05-07T20:32:05.5814318Z     ) -> None:
2025-05-07T20:32:05.5814413Z         torch.manual_seed(2025)
2025-05-07T20:32:05.5814482Z     
2025-05-07T20:32:05.5814654Z >       x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:05.5816431Z E       torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 20.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 4.44 MiB is free. Including non-PyTorch memory, this process has 22.06 GiB memory in use. Of the allocated memory 21.77 GiB is allocated by PyTorch, and 3.87 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
2025-05-07T20:32:05.5816439Z 
2025-05-07T20:32:05.5816558Z moe/activation_test.py:92: OutOfMemoryError
2025-05-07T20:32:05.5816693Z =============================== warnings summary ===============================
2025-05-07T20:32:05.5817002Z ../../../../../../../../miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/autotuner.py:108
2025-05-07T20:32:05.5817304Z ../../../../../../../../miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/autotuner.py:108
2025-05-07T20:32:05.5817596Z ../../../../../../../../miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/autotuner.py:108
2025-05-07T20:32:05.5818471Z   /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/autotuner.py:108: DeprecationWarning: warmup, rep, and use_cuda_graph parameters are deprecated. See https://github.com/triton-lang/triton/pull/4496 for details.
2025-05-07T20:32:05.5818698Z     warnings.warn(("warmup, rep, and use_cuda_graph parameters are deprecated. See "
2025-05-07T20:32:05.5818705Z 
2025-05-07T20:32:05.5818911Z -- Docs: https://docs.pytest.org/en/stable/how-to/capture-warnings.html
2025-05-07T20:32:05.5819081Z ================= 1 failed, 1 deselected, 3 warnings in 16.14s =================
2025-05-07T20:32:07.2168638Z ERROR conda.cli.main_run:execute(125): `conda run python -m pytest -v -rsx -s -W ignore::pytest.PytestCollectionWarning --lf --last-failed-no-failures none ./moe/activation_test.py` failed. (See above for error)
2025-05-07T20:32:07.2785424Z [EXEC] [ATTEMPT 0/2] Command attempt failed.
2025-05-07T20:32:07.2786037Z 
2025-05-07T20:32:09.2802878Z [EXEC] [ATTEMPT 1/2]    + conda run --no-capture-output -n build_binary python -m pytest -v -rsx -s -W ignore::pytest.PytestCollectionWarning --lf --last-failed-no-failures none ./moe/activation_test.py
2025-05-07T20:32:11.4264961Z ============================= test session starts ==============================
2025-05-07T20:32:11.4265689Z platform linux -- Python 3.11.8, pytest-8.3.5, pluggy-1.5.0 -- /home/ec2-user/miniconda/envs/build_binary/bin/python
2025-05-07T20:32:11.4266609Z cachedir: .pytest_cache
2025-05-07T20:32:11.4267187Z hypothesis profile 'ci' -> database=None, deadline=None, print_blob=True, derandomize=True, suppress_health_check=(HealthCheck.too_slow,)
2025-05-07T20:32:11.4267981Z rootdir: /home/ec2-user/actions-runner/_work/FBGEMM/FBGEMM/fbgemm_gpu
2025-05-07T20:32:11.4268391Z plugins: hypothesis-6.131.14
2025-05-07T20:32:13.0204626Z TMA benchmarks will be running with experimental grid constant TMA descriptor.
2025-05-07T20:32:13.1707283Z collecting ... collected 2 items / 1 deselected / 1 selected
2025-05-07T20:32:13.1707702Z run-last-failure: rerun previous 1 failure
2025-05-07T20:32:13.1707924Z 
2025-05-07T20:32:15.5330779Z moe/activation_test.py::ActivationTests::test_silu_mul_quant Trying example: test_silu_mul_quant(
2025-05-07T20:32:15.5332614Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:15.5333531Z     T=1,
2025-05-07T20:32:15.5333934Z     D=5120,
2025-05-07T20:32:15.5334359Z     scale_ub=None,
2025-05-07T20:32:15.5334815Z     contiguous=True,
2025-05-07T20:32:15.5335289Z     compiled=True,
2025-05-07T20:32:15.5335711Z )
2025-05-07T20:32:15.5336264Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:15.5336818Z T = 1, D = 5120, scale_ub = None, contiguous = True, compiled = True
2025-05-07T20:32:15.5337079Z 
2025-05-07T20:32:15.5337178Z     @given(
2025-05-07T20:32:15.5337419Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:15.5337749Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:15.5338066Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:15.5338399Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:15.5338736Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:15.5339038Z     )
2025-05-07T20:32:15.5339393Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:15.5339846Z     def test_silu_mul_quant(
2025-05-07T20:32:15.5340108Z         self,
2025-05-07T20:32:15.5340311Z         T: int,
2025-05-07T20:32:15.5340527Z         D: int,
2025-05-07T20:32:15.5340760Z         scale_ub: Optional[float],
2025-05-07T20:32:15.5341034Z         contiguous: bool,
2025-05-07T20:32:15.5341293Z         compiled: bool,
2025-05-07T20:32:15.5341534Z     ) -> None:
2025-05-07T20:32:15.5341767Z         torch.manual_seed(2025)
2025-05-07T20:32:15.5342014Z     
2025-05-07T20:32:15.5342302Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:15.5342656Z     
2025-05-07T20:32:15.5342868Z         x_sign = torch.sign(x)
2025-05-07T20:32:15.5343169Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:15.5343480Z         x = x_sign * x_clamp
2025-05-07T20:32:15.5343735Z         x0 = x[:, :D]
2025-05-07T20:32:15.5343963Z         x1 = x[:, D:]
2025-05-07T20:32:15.5344173Z     
2025-05-07T20:32:15.5344369Z         if contiguous:
2025-05-07T20:32:15.5344619Z             x0 = x0.contiguous()
2025-05-07T20:32:15.5344882Z             x1 = x1.contiguous()
2025-05-07T20:32:15.5345134Z     
2025-05-07T20:32:15.5345337Z         if scale_ub is not None:
2025-05-07T20:32:15.5345610Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:15.5346261Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:15.5346583Z             )
2025-05-07T20:32:15.5346779Z         else:
2025-05-07T20:32:15.5347004Z             scale_ub_tensor = None
2025-05-07T20:32:15.5347261Z     
2025-05-07T20:32:15.5347495Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:15.5347817Z             op = silu_mul_quant
2025-05-07T20:32:15.5348076Z             if compiled:
2025-05-07T20:32:15.5348329Z                 op = torch.compile(op)
2025-05-07T20:32:15.5348625Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:15.5348907Z     
2025-05-07T20:32:15.5349226Z         y_fp8, y_scale = fn()
2025-05-07T20:32:15.5349514Z         y = y_fp8.to(torch.float32) * y_scale[:, None]
2025-05-07T20:32:15.5349986Z     
2025-05-07T20:32:15.5350236Z         def ref_fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:15.5350574Z             x0_fp32 = x0.to(torch.float32)
2025-05-07T20:32:15.5350952Z             x1_fp32 = x1.to(torch.float32)
2025-05-07T20:32:15.5351279Z             y = x0_fp32 * torch.sigmoid(x0_fp32) * x1_fp32
2025-05-07T20:32:15.5351636Z             return triton_quantize_fp8_row(y, scale_ub_tensor)
2025-05-07T20:32:15.5351957Z     
2025-05-07T20:32:15.5352168Z >       y_fp8_ref, y_scale_ref = ref_fn()
2025-05-07T20:32:15.5352364Z 
2025-05-07T20:32:15.5352478Z moe/activation_test.py:126: 
2025-05-07T20:32:15.5352780Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:15.5353128Z moe/activation_test.py:124: in ref_fn
2025-05-07T20:32:15.5353458Z     return triton_quantize_fp8_row(y, scale_ub_tensor)
2025-05-07T20:32:15.5354244Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/fbgemm_gpu/experimental/gemm/triton_gemm/fp8_gemm.py:2370: in triton_quantize_fp8_row
2025-05-07T20:32:15.5355013Z     _kernel_quantize_fp8_row[grid](
2025-05-07T20:32:15.5355564Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:15.5356280Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:15.5356993Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/autotuner.py:186: in run
2025-05-07T20:32:15.5357720Z     timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs}
2025-05-07T20:32:15.5358478Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/autotuner.py:186: in <dictcomp>
2025-05-07T20:32:15.5359227Z     timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs}
2025-05-07T20:32:15.5359953Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/autotuner.py:166: in _bench
2025-05-07T20:32:15.5360605Z     return self.do_bench(kernel_call, quantiles=(0.5, 0.2, 0.8))
2025-05-07T20:32:15.5361218Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/testing.py:117: in do_bench
2025-05-07T20:32:15.5361736Z     fn()
2025-05-07T20:32:15.5362253Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/autotuner.py:152: in kernel_call
2025-05-07T20:32:15.5362842Z     self.fn.run(
2025-05-07T20:32:15.5363318Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:15.5363849Z     kernel = self.compile(
2025-05-07T20:32:15.5364398Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:15.5365059Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:15.5365457Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:15.5365698Z 
2025-05-07T20:32:15.5365907Z self = <triton.compiler.compiler.ASTSource object at 0x7efd96892f10>
2025-05-07T20:32:15.5367050Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=2, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:15.5368439Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7efdd7d50400>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7efd95239260>}
2025-05-07T20:32:15.5369785Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:15.5370803Z context = <triton._C.libtriton.ir.context object at 0x7efd968d0cf0>
2025-05-07T20:32:15.5371137Z 
2025-05-07T20:32:15.5371348Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:15.5371867Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:15.5372382Z                            module_map=module_map)
2025-05-07T20:32:15.5372748Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:15.5373109Z E       def _kernel_quantize_fp8_row(
2025-05-07T20:32:15.5373382Z E       ^
2025-05-07T20:32:15.5373866Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:15.5374312Z 
2025-05-07T20:32:15.5374726Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:15.5375241Z 
2025-05-07T20:32:15.5375352Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:15.5375779Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:15.5376201Z     T=2048,
2025-05-07T20:32:15.5376398Z     D=5120,
2025-05-07T20:32:15.5376636Z     scale_ub=1200.0,
2025-05-07T20:32:15.5376883Z     contiguous=True,
2025-05-07T20:32:15.5377109Z     compiled=False,
2025-05-07T20:32:15.5377326Z )
2025-05-07T20:32:16.4623123Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:16.4623903Z T = 2048, D = 5120, scale_ub = 1200.0, contiguous = True, compiled = False
2025-05-07T20:32:16.4624286Z 
2025-05-07T20:32:16.4624409Z     @given(
2025-05-07T20:32:16.4624721Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:16.4625101Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:16.4625415Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:16.4625751Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:16.4626074Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:16.4626365Z     )
2025-05-07T20:32:16.4626721Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:16.4627165Z     def test_silu_mul_quant(
2025-05-07T20:32:16.4627413Z         self,
2025-05-07T20:32:16.4627612Z         T: int,
2025-05-07T20:32:16.4627811Z         D: int,
2025-05-07T20:32:16.4628047Z         scale_ub: Optional[float],
2025-05-07T20:32:16.4628594Z         contiguous: bool,
2025-05-07T20:32:16.4628837Z         compiled: bool,
2025-05-07T20:32:16.4629118Z     ) -> None:
2025-05-07T20:32:16.4629341Z         torch.manual_seed(2025)
2025-05-07T20:32:16.4629582Z     
2025-05-07T20:32:16.4629861Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:16.4630207Z     
2025-05-07T20:32:16.4630402Z         x_sign = torch.sign(x)
2025-05-07T20:32:16.4630703Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:16.4631016Z         x = x_sign * x_clamp
2025-05-07T20:32:16.4631264Z         x0 = x[:, :D]
2025-05-07T20:32:16.4631481Z         x1 = x[:, D:]
2025-05-07T20:32:16.4631697Z     
2025-05-07T20:32:16.4631893Z         if contiguous:
2025-05-07T20:32:16.4632124Z             x0 = x0.contiguous()
2025-05-07T20:32:16.4632385Z             x1 = x1.contiguous()
2025-05-07T20:32:16.4632630Z     
2025-05-07T20:32:16.4632822Z         if scale_ub is not None:
2025-05-07T20:32:16.4633377Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:16.4633716Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:16.4634022Z             )
2025-05-07T20:32:16.4634223Z         else:
2025-05-07T20:32:16.4634439Z             scale_ub_tensor = None
2025-05-07T20:32:16.4634684Z     
2025-05-07T20:32:16.4634917Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:16.4635232Z             op = silu_mul_quant
2025-05-07T20:32:16.4635481Z             if compiled:
2025-05-07T20:32:16.4635738Z                 op = torch.compile(op)
2025-05-07T20:32:16.4636036Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:16.4636407Z     
2025-05-07T20:32:16.4636599Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:16.4636880Z 
2025-05-07T20:32:16.4636984Z moe/activation_test.py:117: 
2025-05-07T20:32:16.4637284Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:16.4637694Z moe/activation_test.py:115: in fn
2025-05-07T20:32:16.4637993Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:16.4638685Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:16.4639370Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:16.4639906Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:16.4640586Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:16.4641249Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:16.4641778Z     kernel = self.compile(
2025-05-07T20:32:16.4642325Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:16.4642982Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:16.4643386Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:16.4643614Z 
2025-05-07T20:32:16.4643824Z self = <triton.compiler.compiler.ASTSource object at 0x7efd95370c50>
2025-05-07T20:32:16.4644914Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:16.4646319Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7efdd7d50400>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7efd94ee4180>}
2025-05-07T20:32:16.4647679Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:16.4648717Z context = <triton._C.libtriton.ir.context object at 0x7efd953caeb0>
2025-05-07T20:32:16.4649010Z 
2025-05-07T20:32:16.4649179Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:16.4649703Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:16.4650179Z                            module_map=module_map)
2025-05-07T20:32:16.4650541Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:16.4650897Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:16.4651164Z E       ^
2025-05-07T20:32:16.4651637Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:16.4652088Z 
2025-05-07T20:32:16.4652508Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:16.4653027Z 
2025-05-07T20:32:16.4653135Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:16.4653604Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:16.4654011Z     T=2048,
2025-05-07T20:32:16.4654200Z     D=5120,
2025-05-07T20:32:16.4654401Z     scale_ub=1200.0,
2025-05-07T20:32:16.4654628Z     contiguous=True,
2025-05-07T20:32:16.4654848Z     compiled=True,
2025-05-07T20:32:16.4655063Z )
2025-05-07T20:32:16.4655388Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:16.4655878Z T = 2048, D = 5120, scale_ub = 1200.0, contiguous = True, compiled = True
2025-05-07T20:32:16.4656154Z 
2025-05-07T20:32:16.4656234Z     @given(
2025-05-07T20:32:16.4656470Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:16.4656825Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:16.4657207Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:16.4657554Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:16.4657886Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:16.4658208Z     )
2025-05-07T20:32:16.4658564Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:16.4659009Z     def test_silu_mul_quant(
2025-05-07T20:32:16.4659247Z         self,
2025-05-07T20:32:16.4659447Z         T: int,
2025-05-07T20:32:16.4659647Z         D: int,
2025-05-07T20:32:16.4659863Z         scale_ub: Optional[float],
2025-05-07T20:32:16.4660141Z         contiguous: bool,
2025-05-07T20:32:16.4660382Z         compiled: bool,
2025-05-07T20:32:16.4660605Z     ) -> None:
2025-05-07T20:32:16.4660823Z         torch.manual_seed(2025)
2025-05-07T20:32:16.4661072Z     
2025-05-07T20:32:16.4661343Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:16.4661694Z     
2025-05-07T20:32:16.4661896Z         x_sign = torch.sign(x)
2025-05-07T20:32:16.4662187Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:16.4662499Z         x = x_sign * x_clamp
2025-05-07T20:32:16.4662753Z         x0 = x[:, :D]
2025-05-07T20:32:16.4662979Z         x1 = x[:, D:]
2025-05-07T20:32:16.4663189Z     
2025-05-07T20:32:16.4663378Z         if contiguous:
2025-05-07T20:32:16.4663621Z             x0 = x0.contiguous()
2025-05-07T20:32:16.4663884Z             x1 = x1.contiguous()
2025-05-07T20:32:16.4664128Z     
2025-05-07T20:32:16.4664325Z         if scale_ub is not None:
2025-05-07T20:32:16.4664596Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:16.4664935Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:16.4665244Z             )
2025-05-07T20:32:16.4665437Z         else:
2025-05-07T20:32:16.4665651Z             scale_ub_tensor = None
2025-05-07T20:32:16.4665906Z     
2025-05-07T20:32:16.4666137Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:16.4666460Z             op = silu_mul_quant
2025-05-07T20:32:16.4666716Z             if compiled:
2025-05-07T20:32:16.4666961Z                 op = torch.compile(op)
2025-05-07T20:32:16.4667320Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:16.4667606Z     
2025-05-07T20:32:16.4667810Z         y_fp8, y_scale = fn()
2025-05-07T20:32:16.4668095Z         y = y_fp8.to(torch.float32) * y_scale[:, None]
2025-05-07T20:32:16.4668392Z     
2025-05-07T20:32:16.4668639Z         def ref_fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:16.4668974Z             x0_fp32 = x0.to(torch.float32)
2025-05-07T20:32:16.4669354Z             x1_fp32 = x1.to(torch.float32)
2025-05-07T20:32:16.4669675Z             y = x0_fp32 * torch.sigmoid(x0_fp32) * x1_fp32
2025-05-07T20:32:16.4670027Z             return triton_quantize_fp8_row(y, scale_ub_tensor)
2025-05-07T20:32:16.4670337Z     
2025-05-07T20:32:16.4670541Z >       y_fp8_ref, y_scale_ref = ref_fn()
2025-05-07T20:32:16.4670738Z 
2025-05-07T20:32:16.4670841Z moe/activation_test.py:126: 
2025-05-07T20:32:16.4671140Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:16.4671482Z moe/activation_test.py:124: in ref_fn
2025-05-07T20:32:16.4671863Z     return triton_quantize_fp8_row(y, scale_ub_tensor)
2025-05-07T20:32:16.4672641Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/fbgemm_gpu/experimental/gemm/triton_gemm/fp8_gemm.py:2370: in triton_quantize_fp8_row
2025-05-07T20:32:16.4673394Z     _kernel_quantize_fp8_row[grid](
2025-05-07T20:32:16.4673940Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:16.4674615Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:16.4675299Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/autotuner.py:186: in run
2025-05-07T20:32:16.4676054Z     timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs}
2025-05-07T20:32:16.4676840Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/autotuner.py:186: in <dictcomp>
2025-05-07T20:32:16.4677621Z     timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs}
2025-05-07T20:32:16.4678351Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/autotuner.py:166: in _bench
2025-05-07T20:32:16.4678990Z     return self.do_bench(kernel_call, quantiles=(0.5, 0.2, 0.8))
2025-05-07T20:32:16.4679598Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/testing.py:117: in do_bench
2025-05-07T20:32:16.4680112Z     fn()
2025-05-07T20:32:16.4680621Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/autotuner.py:152: in kernel_call
2025-05-07T20:32:16.4681207Z     self.fn.run(
2025-05-07T20:32:16.4681674Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:16.4682212Z     kernel = self.compile(
2025-05-07T20:32:16.4682758Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:16.4683419Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:16.4683818Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:16.4684053Z 
2025-05-07T20:32:16.4684265Z self = <triton.compiler.compiler.ASTSource object at 0x7efd8fce74d0>
2025-05-07T20:32:16.4685348Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=2, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:16.4686776Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7efdd7d50400>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7efd8fbb74c0>}
2025-05-07T20:32:16.4688138Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:16.4689172Z context = <triton._C.libtriton.ir.context object at 0x7efd8fceb8f0>
2025-05-07T20:32:16.4689471Z 
2025-05-07T20:32:16.4689642Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:16.4690170Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:16.4690632Z                            module_map=module_map)
2025-05-07T20:32:16.4691009Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:16.4691376Z E       def _kernel_quantize_fp8_row(
2025-05-07T20:32:16.4691648Z E       ^
2025-05-07T20:32:16.4692110Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:16.4692569Z 
2025-05-07T20:32:16.4692989Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:16.4693502Z 
2025-05-07T20:32:16.4693654Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:16.4701030Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:16.4701492Z     T=16384,
2025-05-07T20:32:16.4701700Z     D=7168,
2025-05-07T20:32:16.4701900Z     scale_ub=1200.0,
2025-05-07T20:32:16.4702132Z     contiguous=False,
2025-05-07T20:32:16.4702370Z     compiled=False,
2025-05-07T20:32:16.4702576Z )
2025-05-07T20:32:17.2600066Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:17.2600852Z T = 16384, D = 7168, scale_ub = 1200.0, contiguous = False, compiled = False
2025-05-07T20:32:17.2601253Z 
2025-05-07T20:32:17.2601701Z     @given(
2025-05-07T20:32:17.2602140Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:17.2602471Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:17.2602789Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:17.2603220Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:17.2603556Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:17.2603849Z     )
2025-05-07T20:32:17.2604209Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:17.2604655Z     def test_silu_mul_quant(
2025-05-07T20:32:17.2604906Z         self,
2025-05-07T20:32:17.2605114Z         T: int,
2025-05-07T20:32:17.2605318Z         D: int,
2025-05-07T20:32:17.2605556Z         scale_ub: Optional[float],
2025-05-07T20:32:17.2605841Z         contiguous: bool,
2025-05-07T20:32:17.2606082Z         compiled: bool,
2025-05-07T20:32:17.2606321Z     ) -> None:
2025-05-07T20:32:17.2606549Z         torch.manual_seed(2025)
2025-05-07T20:32:17.2606844Z     
2025-05-07T20:32:17.2607143Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:17.2607497Z     
2025-05-07T20:32:17.2607695Z         x_sign = torch.sign(x)
2025-05-07T20:32:17.2608007Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:17.2608336Z         x = x_sign * x_clamp
2025-05-07T20:32:17.2608592Z         x0 = x[:, :D]
2025-05-07T20:32:17.2608810Z         x1 = x[:, D:]
2025-05-07T20:32:17.2609025Z     
2025-05-07T20:32:17.2609224Z         if contiguous:
2025-05-07T20:32:17.2609457Z             x0 = x0.contiguous()
2025-05-07T20:32:17.2609723Z             x1 = x1.contiguous()
2025-05-07T20:32:17.2609967Z     
2025-05-07T20:32:17.2610162Z         if scale_ub is not None:
2025-05-07T20:32:17.2610440Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:17.2610781Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:17.2611087Z             )
2025-05-07T20:32:17.2611289Z         else:
2025-05-07T20:32:17.2611512Z             scale_ub_tensor = None
2025-05-07T20:32:17.2611768Z     
2025-05-07T20:32:17.2612004Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:17.2612325Z             op = silu_mul_quant
2025-05-07T20:32:17.2612578Z             if compiled:
2025-05-07T20:32:17.2612834Z                 op = torch.compile(op)
2025-05-07T20:32:17.2613134Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:17.2613417Z     
2025-05-07T20:32:17.2613612Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:17.2613785Z 
2025-05-07T20:32:17.2613888Z moe/activation_test.py:117: 
2025-05-07T20:32:17.2614192Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:17.2614526Z moe/activation_test.py:115: in fn
2025-05-07T20:32:17.2614814Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:17.2615510Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:17.2616203Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:17.2616791Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:17.2617566Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:17.2618239Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:17.2618770Z     kernel = self.compile(
2025-05-07T20:32:17.2619320Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:17.2619984Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:17.2620391Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:17.2620621Z 
2025-05-07T20:32:17.2620831Z self = <triton.compiler.compiler.ASTSource object at 0x7efd8f881490>
2025-05-07T20:32:17.2621953Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:17.2623414Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7efdd7d50400>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7efd8fe971a0>}
2025-05-07T20:32:17.2624764Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:17.2625786Z context = <triton._C.libtriton.ir.context object at 0x7efd8f889af0>
2025-05-07T20:32:17.2626087Z 
2025-05-07T20:32:17.2626258Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:17.2626786Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:17.2627264Z                            module_map=module_map)
2025-05-07T20:32:17.2627663Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:17.2628035Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:17.2628613Z E       ^
2025-05-07T20:32:17.2629140Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:17.2629595Z 
2025-05-07T20:32:17.2630013Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:17.2630531Z 
2025-05-07T20:32:17.2630636Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:17.2631054Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:17.2631456Z     T=1,
2025-05-07T20:32:17.2631651Z     D=7168,
2025-05-07T20:32:17.2631856Z     scale_ub=None,
2025-05-07T20:32:17.2632078Z     contiguous=True,
2025-05-07T20:32:17.2632315Z     compiled=True,
2025-05-07T20:32:17.2632536Z )
2025-05-07T20:32:17.2632858Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:17.2633344Z T = 1, D = 7168, scale_ub = None, contiguous = True, compiled = True
2025-05-07T20:32:17.2633608Z 
2025-05-07T20:32:17.2633692Z     @given(
2025-05-07T20:32:17.2633930Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:17.2634241Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:17.2634555Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:17.2634885Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:17.2635210Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:17.2635500Z     )
2025-05-07T20:32:17.2635862Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:17.2636305Z     def test_silu_mul_quant(
2025-05-07T20:32:17.2636558Z         self,
2025-05-07T20:32:17.2636764Z         T: int,
2025-05-07T20:32:17.2636973Z         D: int,
2025-05-07T20:32:17.2637193Z         scale_ub: Optional[float],
2025-05-07T20:32:17.2637471Z         contiguous: bool,
2025-05-07T20:32:17.2637718Z         compiled: bool,
2025-05-07T20:32:17.2637942Z     ) -> None:
2025-05-07T20:32:17.2638243Z         torch.manual_seed(2025)
2025-05-07T20:32:17.2638494Z     
2025-05-07T20:32:17.2638796Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:17.2639137Z     
2025-05-07T20:32:17.2639342Z         x_sign = torch.sign(x)
2025-05-07T20:32:17.2639640Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:17.2639946Z         x = x_sign * x_clamp
2025-05-07T20:32:17.2640194Z         x0 = x[:, :D]
2025-05-07T20:32:17.2640421Z         x1 = x[:, D:]
2025-05-07T20:32:17.2640640Z     
2025-05-07T20:32:17.2640836Z         if contiguous:
2025-05-07T20:32:17.2641067Z             x0 = x0.contiguous()
2025-05-07T20:32:17.2641330Z             x1 = x1.contiguous()
2025-05-07T20:32:17.2641635Z     
2025-05-07T20:32:17.2641882Z         if scale_ub is not None:
2025-05-07T20:32:17.2642158Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:17.2642496Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:17.2642851Z             )
2025-05-07T20:32:17.2643055Z         else:
2025-05-07T20:32:17.2643274Z             scale_ub_tensor = None
2025-05-07T20:32:17.2643522Z     
2025-05-07T20:32:17.2643759Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:17.2644079Z             op = silu_mul_quant
2025-05-07T20:32:17.2644331Z             if compiled:
2025-05-07T20:32:17.2644579Z                 op = torch.compile(op)
2025-05-07T20:32:17.2644875Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:17.2645154Z     
2025-05-07T20:32:17.2645346Z         y_fp8, y_scale = fn()
2025-05-07T20:32:17.2645633Z         y = y_fp8.to(torch.float32) * y_scale[:, None]
2025-05-07T20:32:17.2645924Z     
2025-05-07T20:32:17.2646166Z         def ref_fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:17.2646504Z             x0_fp32 = x0.to(torch.float32)
2025-05-07T20:32:17.2646802Z             x1_fp32 = x1.to(torch.float32)
2025-05-07T20:32:17.2647117Z             y = x0_fp32 * torch.sigmoid(x0_fp32) * x1_fp32
2025-05-07T20:32:17.2647488Z             return triton_quantize_fp8_row(y, scale_ub_tensor)
2025-05-07T20:32:17.2647804Z     
2025-05-07T20:32:17.2648011Z >       y_fp8_ref, y_scale_ref = ref_fn()
2025-05-07T20:32:17.2648205Z 
2025-05-07T20:32:17.2648307Z moe/activation_test.py:126: 
2025-05-07T20:32:17.2648607Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:17.2648951Z moe/activation_test.py:124: in ref_fn
2025-05-07T20:32:17.2649274Z     return triton_quantize_fp8_row(y, scale_ub_tensor)
2025-05-07T20:32:17.2650063Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/fbgemm_gpu/experimental/gemm/triton_gemm/fp8_gemm.py:2370: in triton_quantize_fp8_row
2025-05-07T20:32:17.2650821Z     _kernel_quantize_fp8_row[grid](
2025-05-07T20:32:17.2651547Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:17.2652230Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:17.2652918Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/autotuner.py:186: in run
2025-05-07T20:32:17.2653636Z     timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs}
2025-05-07T20:32:17.2654380Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/autotuner.py:186: in <dictcomp>
2025-05-07T20:32:17.2655125Z     timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs}
2025-05-07T20:32:17.2655855Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/autotuner.py:166: in _bench
2025-05-07T20:32:17.2656496Z     return self.do_bench(kernel_call, quantiles=(0.5, 0.2, 0.8))
2025-05-07T20:32:17.2657092Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/testing.py:117: in do_bench
2025-05-07T20:32:17.2657620Z     fn()
2025-05-07T20:32:17.2658183Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/autotuner.py:152: in kernel_call
2025-05-07T20:32:17.2658773Z     self.fn.run(
2025-05-07T20:32:17.2659238Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:17.2659779Z     kernel = self.compile(
2025-05-07T20:32:17.2660328Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:17.2660979Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:17.2661383Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:17.2661621Z 
2025-05-07T20:32:17.2661878Z self = <triton.compiler.compiler.ASTSource object at 0x7efd8f78c190>
2025-05-07T20:32:17.2663036Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=2, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:17.2664403Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7efdd7d50400>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7efd8fc128e0>}
2025-05-07T20:32:17.2665995Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:17.2667154Z context = <triton._C.libtriton.ir.context object at 0x7efd8f7a26b0>
2025-05-07T20:32:17.2667444Z 
2025-05-07T20:32:17.2667621Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:17.2668156Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:17.2668630Z                            module_map=module_map)
2025-05-07T20:32:17.2669003Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:17.2669446Z E       def _kernel_quantize_fp8_row(
2025-05-07T20:32:17.2669713Z E       ^
2025-05-07T20:32:17.2670178Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:17.2670625Z 
2025-05-07T20:32:17.2671049Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:17.2671558Z 
2025-05-07T20:32:17.2671669Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:17.2672076Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:17.2672478Z     T=4096,
2025-05-07T20:32:17.2672673Z     D=5120,
2025-05-07T20:32:17.2672890Z     scale_ub=None,
2025-05-07T20:32:17.2673113Z     contiguous=False,
2025-05-07T20:32:17.2673347Z     compiled=False,
2025-05-07T20:32:17.2673557Z )
2025-05-07T20:32:18.1810569Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:18.1811322Z T = 4096, D = 5120, scale_ub = None, contiguous = False, compiled = False
2025-05-07T20:32:18.1811608Z 
2025-05-07T20:32:18.1811694Z     @given(
2025-05-07T20:32:18.1811934Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:18.1812243Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:18.1812553Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:18.1812897Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:18.1813226Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:18.1813509Z     )
2025-05-07T20:32:18.1813863Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:18.1814312Z     def test_silu_mul_quant(
2025-05-07T20:32:18.1814560Z         self,
2025-05-07T20:32:18.1814770Z         T: int,
2025-05-07T20:32:18.1814972Z         D: int,
2025-05-07T20:32:18.1815191Z         scale_ub: Optional[float],
2025-05-07T20:32:18.1815468Z         contiguous: bool,
2025-05-07T20:32:18.1815997Z         compiled: bool,
2025-05-07T20:32:18.1816229Z     ) -> None:
2025-05-07T20:32:18.1816450Z         torch.manual_seed(2025)
2025-05-07T20:32:18.1816751Z     
2025-05-07T20:32:18.1817090Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:18.1817518Z     
2025-05-07T20:32:18.1817766Z         x_sign = torch.sign(x)
2025-05-07T20:32:18.1818132Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:18.1818518Z         x = x_sign * x_clamp
2025-05-07T20:32:18.1818825Z         x0 = x[:, :D]
2025-05-07T20:32:18.1819067Z         x1 = x[:, D:]
2025-05-07T20:32:18.1819273Z     
2025-05-07T20:32:18.1819470Z         if contiguous:
2025-05-07T20:32:18.1819785Z             x0 = x0.contiguous()
2025-05-07T20:32:18.1820133Z             x1 = x1.contiguous()
2025-05-07T20:32:18.1820382Z     
2025-05-07T20:32:18.1820584Z         if scale_ub is not None:
2025-05-07T20:32:18.1820854Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:18.1821265Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:18.1821577Z             )
2025-05-07T20:32:18.1821770Z         else:
2025-05-07T20:32:18.1821990Z             scale_ub_tensor = None
2025-05-07T20:32:18.1822242Z     
2025-05-07T20:32:18.1822470Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:18.1822787Z             op = silu_mul_quant
2025-05-07T20:32:18.1823044Z             if compiled:
2025-05-07T20:32:18.1823292Z                 op = torch.compile(op)
2025-05-07T20:32:18.1823583Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:18.1823859Z     
2025-05-07T20:32:18.1824056Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:18.1824229Z 
2025-05-07T20:32:18.1824331Z moe/activation_test.py:117: 
2025-05-07T20:32:18.1824636Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:18.1824974Z moe/activation_test.py:115: in fn
2025-05-07T20:32:18.1825254Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:18.1825951Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:18.1826639Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:18.1827226Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:18.1827900Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:18.1829023Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:18.1829615Z     kernel = self.compile(
2025-05-07T20:32:18.1830152Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:18.1830816Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:18.1831216Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:18.1831447Z 
2025-05-07T20:32:18.1831665Z self = <triton.compiler.compiler.ASTSource object at 0x7efd8f7dba90>
2025-05-07T20:32:18.1832738Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:18.1834129Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7efdd7d50400>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7efd8f59c680>}
2025-05-07T20:32:18.1835468Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:18.1836497Z context = <triton._C.libtriton.ir.context object at 0x7efd8f7ec130>
2025-05-07T20:32:18.1836792Z 
2025-05-07T20:32:18.1837567Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:18.1838216Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:18.1838804Z                            module_map=module_map)
2025-05-07T20:32:18.1839182Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:18.1839532Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:18.1839799Z E       ^
2025-05-07T20:32:18.1840268Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:18.1840717Z 
2025-05-07T20:32:18.1841146Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:18.1841766Z 
2025-05-07T20:32:18.1841871Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:18.1842293Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:18.1842795Z     T=4096,
2025-05-07T20:32:18.1842989Z     D=7168,
2025-05-07T20:32:18.1843192Z     scale_ub=None,
2025-05-07T20:32:18.1843415Z     contiguous=False,
2025-05-07T20:32:18.1843643Z     compiled=False,
2025-05-07T20:32:18.1843862Z )
2025-05-07T20:32:18.1844188Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:18.1844684Z T = 4096, D = 7168, scale_ub = None, contiguous = False, compiled = False
2025-05-07T20:32:18.1844954Z 
2025-05-07T20:32:18.1845036Z     @given(
2025-05-07T20:32:18.1845270Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:18.1845585Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:18.1845889Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:18.1846234Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:18.1846568Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:18.1846852Z     )
2025-05-07T20:32:18.1847209Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:18.1847654Z     def test_silu_mul_quant(
2025-05-07T20:32:18.1847906Z         self,
2025-05-07T20:32:18.1848134Z         T: int,
2025-05-07T20:32:18.1848354Z         D: int,
2025-05-07T20:32:18.1848581Z         scale_ub: Optional[float],
2025-05-07T20:32:18.1848852Z         contiguous: bool,
2025-05-07T20:32:18.1849098Z         compiled: bool,
2025-05-07T20:32:18.1849326Z     ) -> None:
2025-05-07T20:32:18.1849540Z         torch.manual_seed(2025)
2025-05-07T20:32:18.1849784Z     
2025-05-07T20:32:18.1850059Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:18.1850395Z     
2025-05-07T20:32:18.1850594Z         x_sign = torch.sign(x)
2025-05-07T20:32:18.1850892Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:18.1851205Z         x = x_sign * x_clamp
2025-05-07T20:32:18.1851448Z         x0 = x[:, :D]
2025-05-07T20:32:18.1851668Z         x1 = x[:, D:]
2025-05-07T20:32:18.1851873Z     
2025-05-07T20:32:18.1852067Z         if contiguous:
2025-05-07T20:32:18.1852303Z             x0 = x0.contiguous()
2025-05-07T20:32:18.1852559Z             x1 = x1.contiguous()
2025-05-07T20:32:18.1852800Z     
2025-05-07T20:32:18.1853000Z         if scale_ub is not None:
2025-05-07T20:32:18.1853276Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:18.1853606Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:18.1853919Z             )
2025-05-07T20:32:18.1854118Z         else:
2025-05-07T20:32:18.1854327Z             scale_ub_tensor = None
2025-05-07T20:32:18.1854581Z     
2025-05-07T20:32:18.1854816Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:18.1855128Z             op = silu_mul_quant
2025-05-07T20:32:18.1855386Z             if compiled:
2025-05-07T20:32:18.1855638Z                 op = torch.compile(op)
2025-05-07T20:32:18.1855932Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:18.1856212Z     
2025-05-07T20:32:18.1856414Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:18.1856632Z 
2025-05-07T20:32:18.1856735Z moe/activation_test.py:117: 
2025-05-07T20:32:18.1857031Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:18.1857365Z moe/activation_test.py:115: in fn
2025-05-07T20:32:18.1857649Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:18.1858331Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:18.1859016Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:18.1859548Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:18.1860266Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:18.1860961Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:18.1861555Z     kernel = self.compile(
2025-05-07T20:32:18.1862097Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:18.1862745Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:18.1863146Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:18.1863376Z 
2025-05-07T20:32:18.1863589Z self = <triton.compiler.compiler.ASTSource object at 0x7efd8ed962d0>
2025-05-07T20:32:18.1864675Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:18.1866045Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7efdd7d50400>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7efd8feec180>}
2025-05-07T20:32:18.1867600Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:18.1868873Z context = <triton._C.libtriton.ir.context object at 0x7efd8fbb8f70>
2025-05-07T20:32:18.1869249Z 
2025-05-07T20:32:18.1869424Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:18.1869936Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:18.1870404Z                            module_map=module_map)
2025-05-07T20:32:18.1870773Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:18.1871134Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:18.1871396Z E       ^
2025-05-07T20:32:18.1871863Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:18.1872308Z 
2025-05-07T20:32:18.1872738Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:18.1873249Z 
2025-05-07T20:32:18.1873360Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:18.1873767Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:18.1874176Z     T=128,
2025-05-07T20:32:18.1874369Z     D=7168,
2025-05-07T20:32:18.1874564Z     scale_ub=None,
2025-05-07T20:32:18.1874787Z     contiguous=False,
2025-05-07T20:32:18.1875018Z     compiled=True,
2025-05-07T20:32:18.1875224Z )
2025-05-07T20:32:18.2307729Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:18.2308484Z T = 128, D = 7168, scale_ub = None, contiguous = False, compiled = True
2025-05-07T20:32:18.2308879Z 
2025-05-07T20:32:18.2308993Z     @given(
2025-05-07T20:32:18.2309313Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:18.2309643Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:18.2310200Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:18.2310537Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:18.2310863Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:18.2311151Z     )
2025-05-07T20:32:18.2311505Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:18.2311941Z     def test_silu_mul_quant(
2025-05-07T20:32:18.2312191Z         self,
2025-05-07T20:32:18.2312391Z         T: int,
2025-05-07T20:32:18.2312590Z         D: int,
2025-05-07T20:32:18.2312818Z         scale_ub: Optional[float],
2025-05-07T20:32:18.2313092Z         contiguous: bool,
2025-05-07T20:32:18.2313330Z         compiled: bool,
2025-05-07T20:32:18.2313710Z     ) -> None:
2025-05-07T20:32:18.2320333Z         torch.manual_seed(2025)
2025-05-07T20:32:18.2320595Z     
2025-05-07T20:32:18.2320875Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:18.2321349Z     
2025-05-07T20:32:18.2321559Z         x_sign = torch.sign(x)
2025-05-07T20:32:18.2321851Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:18.2322170Z         x = x_sign * x_clamp
2025-05-07T20:32:18.2322418Z         x0 = x[:, :D]
2025-05-07T20:32:18.2322636Z         x1 = x[:, D:]
2025-05-07T20:32:18.2322848Z     
2025-05-07T20:32:18.2323047Z         if contiguous:
2025-05-07T20:32:18.2323282Z             x0 = x0.contiguous()
2025-05-07T20:32:18.2323555Z             x1 = x1.contiguous()
2025-05-07T20:32:18.2323818Z     
2025-05-07T20:32:18.2324020Z         if scale_ub is not None:
2025-05-07T20:32:18.2324294Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:18.2324639Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:18.2324964Z             )
2025-05-07T20:32:18.2325160Z         else:
2025-05-07T20:32:18.2325378Z             scale_ub_tensor = None
2025-05-07T20:32:18.2325635Z     
2025-05-07T20:32:18.2325870Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:18.2326193Z             op = silu_mul_quant
2025-05-07T20:32:18.2326459Z             if compiled:
2025-05-07T20:32:18.2326705Z                 op = torch.compile(op)
2025-05-07T20:32:18.2327006Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:18.2327311Z     
2025-05-07T20:32:18.2327528Z         y_fp8, y_scale = fn()
2025-05-07T20:32:18.2327818Z         y = y_fp8.to(torch.float32) * y_scale[:, None]
2025-05-07T20:32:18.2328115Z     
2025-05-07T20:32:18.2328639Z         def ref_fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:18.2328980Z             x0_fp32 = x0.to(torch.float32)
2025-05-07T20:32:18.2329283Z             x1_fp32 = x1.to(torch.float32)
2025-05-07T20:32:18.2329607Z             y = x0_fp32 * torch.sigmoid(x0_fp32) * x1_fp32
2025-05-07T20:32:18.2329965Z             return triton_quantize_fp8_row(y, scale_ub_tensor)
2025-05-07T20:32:18.2330280Z     
2025-05-07T20:32:18.2330491Z >       y_fp8_ref, y_scale_ref = ref_fn()
2025-05-07T20:32:18.2330688Z 
2025-05-07T20:32:18.2330793Z moe/activation_test.py:126: 
2025-05-07T20:32:18.2331094Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:18.2331435Z moe/activation_test.py:124: in ref_fn
2025-05-07T20:32:18.2331758Z     return triton_quantize_fp8_row(y, scale_ub_tensor)
2025-05-07T20:32:18.2332558Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/fbgemm_gpu/experimental/gemm/triton_gemm/fp8_gemm.py:2370: in triton_quantize_fp8_row
2025-05-07T20:32:18.2333322Z     _kernel_quantize_fp8_row[grid](
2025-05-07T20:32:18.2333877Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:18.2334566Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:18.2335262Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/autotuner.py:186: in run
2025-05-07T20:32:18.2336072Z     timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs}
2025-05-07T20:32:18.2336837Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/autotuner.py:186: in <dictcomp>
2025-05-07T20:32:18.2337633Z     timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs}
2025-05-07T20:32:18.2338367Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/autotuner.py:166: in _bench
2025-05-07T20:32:18.2339014Z     return self.do_bench(kernel_call, quantiles=(0.5, 0.2, 0.8))
2025-05-07T20:32:18.2339624Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/testing.py:117: in do_bench
2025-05-07T20:32:18.2340204Z     fn()
2025-05-07T20:32:18.2340781Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/autotuner.py:152: in kernel_call
2025-05-07T20:32:18.2341373Z     self.fn.run(
2025-05-07T20:32:18.2341897Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:18.2342437Z     kernel = self.compile(
2025-05-07T20:32:18.2342986Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:18.2343654Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:18.2344059Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:18.2344300Z 
2025-05-07T20:32:18.2344510Z self = <triton.compiler.compiler.ASTSource object at 0x7efd8f4bc650>
2025-05-07T20:32:18.2345608Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=2, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:18.2347047Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7efdd7d50400>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7efd8f5c7100>}
2025-05-07T20:32:18.2348419Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:18.2349517Z context = <triton._C.libtriton.ir.context object at 0x7efd8f4b9770>
2025-05-07T20:32:18.2349815Z 
2025-05-07T20:32:18.2349985Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:18.2350512Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:18.2350986Z                            module_map=module_map)
2025-05-07T20:32:18.2351364Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:18.2351728Z E       def _kernel_quantize_fp8_row(
2025-05-07T20:32:18.2351995Z E       ^
2025-05-07T20:32:18.2352474Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:18.2352934Z 
2025-05-07T20:32:18.2353353Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:18.2353868Z 
2025-05-07T20:32:18.2353981Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:18.2354393Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:18.2354804Z     T=128,
2025-05-07T20:32:18.2355003Z     D=7168,
2025-05-07T20:32:18.2355197Z     scale_ub=None,
2025-05-07T20:32:18.2355422Z     contiguous=False,
2025-05-07T20:32:18.2355659Z     compiled=False,
2025-05-07T20:32:18.2355874Z )
2025-05-07T20:32:18.5322361Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:18.5323138Z T = 128, D = 7168, scale_ub = None, contiguous = False, compiled = False
2025-05-07T20:32:18.5323487Z 
2025-05-07T20:32:18.5323586Z     @given(
2025-05-07T20:32:18.5324129Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:18.5324463Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:18.5324778Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:18.5325116Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:18.5325443Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:18.5325732Z     )
2025-05-07T20:32:18.5326093Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:18.5326541Z     def test_silu_mul_quant(
2025-05-07T20:32:18.5326792Z         self,
2025-05-07T20:32:18.5326998Z         T: int,
2025-05-07T20:32:18.5327211Z         D: int,
2025-05-07T20:32:18.5327479Z         scale_ub: Optional[float],
2025-05-07T20:32:18.5327922Z         contiguous: bool,
2025-05-07T20:32:18.5328441Z         compiled: bool,
2025-05-07T20:32:18.5328685Z     ) -> None:
2025-05-07T20:32:18.5328918Z         torch.manual_seed(2025)
2025-05-07T20:32:18.5329159Z     
2025-05-07T20:32:18.5329549Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:18.5329900Z     
2025-05-07T20:32:18.5330111Z         x_sign = torch.sign(x)
2025-05-07T20:32:18.5330400Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:18.5330717Z         x = x_sign * x_clamp
2025-05-07T20:32:18.5330962Z         x0 = x[:, :D]
2025-05-07T20:32:18.5331182Z         x1 = x[:, D:]
2025-05-07T20:32:18.5331405Z     
2025-05-07T20:32:18.5331598Z         if contiguous:
2025-05-07T20:32:18.5331834Z             x0 = x0.contiguous()
2025-05-07T20:32:18.5332100Z             x1 = x1.contiguous()
2025-05-07T20:32:18.5332354Z     
2025-05-07T20:32:18.5332557Z         if scale_ub is not None:
2025-05-07T20:32:18.5332828Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:18.5333182Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:18.5333499Z             )
2025-05-07T20:32:18.5333690Z         else:
2025-05-07T20:32:18.5333914Z             scale_ub_tensor = None
2025-05-07T20:32:18.5334168Z     
2025-05-07T20:32:18.5334401Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:18.5334719Z             op = silu_mul_quant
2025-05-07T20:32:18.5334984Z             if compiled:
2025-05-07T20:32:18.5335233Z                 op = torch.compile(op)
2025-05-07T20:32:18.5335531Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:18.5335807Z     
2025-05-07T20:32:18.5336003Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:18.5336166Z 
2025-05-07T20:32:18.5336268Z moe/activation_test.py:117: 
2025-05-07T20:32:18.5336564Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:18.5336900Z moe/activation_test.py:115: in fn
2025-05-07T20:32:18.5337182Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:18.5337872Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:18.5338564Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:18.5339095Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:18.5339776Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:18.5340440Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:18.5340977Z     kernel = self.compile(
2025-05-07T20:32:18.5341512Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:18.5342165Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:18.5342567Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:18.5342795Z 
2025-05-07T20:32:18.5343012Z self = <triton.compiler.compiler.ASTSource object at 0x7efd8f3a3310>
2025-05-07T20:32:18.5344147Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:18.5345537Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7efdd7d50400>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7efd8f08ce00>}
2025-05-07T20:32:18.5346881Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:18.5347908Z context = <triton._C.libtriton.ir.context object at 0x7efd8f39f8f0>
2025-05-07T20:32:18.5348322Z 
2025-05-07T20:32:18.5348494Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:18.5349007Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:18.5349608Z                            module_map=module_map)
2025-05-07T20:32:18.5349980Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:18.5350329Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:18.5350593Z E       ^
2025-05-07T20:32:18.5351061Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:18.5351507Z 
2025-05-07T20:32:18.5351927Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:18.5352435Z 
2025-05-07T20:32:18.5352539Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:18.5352953Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:18.5353364Z     T=4096,
2025-05-07T20:32:18.5353554Z     D=5120,
2025-05-07T20:32:18.5353762Z     scale_ub=1200.0,
2025-05-07T20:32:18.5353990Z     contiguous=True,
2025-05-07T20:32:18.5354215Z     compiled=False,
2025-05-07T20:32:18.5354435Z )
2025-05-07T20:32:18.5354773Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:18.5355264Z T = 4096, D = 5120, scale_ub = 1200.0, contiguous = True, compiled = False
2025-05-07T20:32:18.5355541Z 
2025-05-07T20:32:18.5355622Z     @given(
2025-05-07T20:32:18.5355860Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:18.5356176Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:18.5356480Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:18.5356812Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:18.5357146Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:18.5357432Z     )
2025-05-07T20:32:18.5357784Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:18.5358236Z     def test_silu_mul_quant(
2025-05-07T20:32:18.5358474Z         self,
2025-05-07T20:32:18.5358671Z         T: int,
2025-05-07T20:32:18.5358876Z         D: int,
2025-05-07T20:32:18.5359097Z         scale_ub: Optional[float],
2025-05-07T20:32:18.5359368Z         contiguous: bool,
2025-05-07T20:32:18.5359610Z         compiled: bool,
2025-05-07T20:32:18.5359832Z     ) -> None:
2025-05-07T20:32:18.5360043Z         torch.manual_seed(2025)
2025-05-07T20:32:18.5360298Z     
2025-05-07T20:32:18.5360576Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:18.5360922Z     
2025-05-07T20:32:18.5361115Z         x_sign = torch.sign(x)
2025-05-07T20:32:18.5361410Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:18.5361722Z         x = x_sign * x_clamp
2025-05-07T20:32:18.5361964Z         x0 = x[:, :D]
2025-05-07T20:32:18.5362188Z         x1 = x[:, D:]
2025-05-07T20:32:18.5362407Z     
2025-05-07T20:32:18.5362595Z         if contiguous:
2025-05-07T20:32:18.5362835Z             x0 = x0.contiguous()
2025-05-07T20:32:18.5363099Z             x1 = x1.contiguous()
2025-05-07T20:32:18.5363334Z     
2025-05-07T20:32:18.5363587Z         if scale_ub is not None:
2025-05-07T20:32:18.5363861Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:18.5364192Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:18.5364506Z             )
2025-05-07T20:32:18.5364703Z         else:
2025-05-07T20:32:18.5364915Z             scale_ub_tensor = None
2025-05-07T20:32:18.5365165Z     
2025-05-07T20:32:18.5365396Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:18.5365713Z             op = silu_mul_quant
2025-05-07T20:32:18.5365958Z             if compiled:
2025-05-07T20:32:18.5366213Z                 op = torch.compile(op)
2025-05-07T20:32:18.5366512Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:18.5366831Z     
2025-05-07T20:32:18.5367070Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:18.5367233Z 
2025-05-07T20:32:18.5367340Z moe/activation_test.py:117: 
2025-05-07T20:32:18.5367674Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:18.5368014Z moe/activation_test.py:115: in fn
2025-05-07T20:32:18.5368302Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:18.5368983Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:18.5369845Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:18.5370383Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:18.5371065Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:18.5371716Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:18.5372253Z     kernel = self.compile(
2025-05-07T20:32:18.5372794Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:18.5373452Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:18.5373849Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:18.5374083Z 
2025-05-07T20:32:18.5374292Z self = <triton.compiler.compiler.ASTSource object at 0x7efd8f3f7bd0>
2025-05-07T20:32:18.5375367Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:18.5376732Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7efdd7d50400>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7efd8f08df80>}
2025-05-07T20:32:18.5378075Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:18.5379107Z context = <triton._C.libtriton.ir.context object at 0x7efd8f3dc230>
2025-05-07T20:32:18.5379401Z 
2025-05-07T20:32:18.5379567Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:18.5380083Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:18.5380543Z                            module_map=module_map)
2025-05-07T20:32:18.5380909Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:18.5381266Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:18.5381524Z E       ^
2025-05-07T20:32:18.5381993Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:18.5382455Z 
2025-05-07T20:32:18.5382871Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:18.5383379Z 
2025-05-07T20:32:18.5383494Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:18.5383960Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:18.5384365Z     T=1,
2025-05-07T20:32:18.5384562Z     D=5120,
2025-05-07T20:32:18.5384762Z     scale_ub=None,
2025-05-07T20:32:18.5384975Z     contiguous=True,
2025-05-07T20:32:18.5385203Z     compiled=True,
2025-05-07T20:32:18.5385412Z )
2025-05-07T20:32:18.9833684Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:18.9834416Z T = 1, D = 5120, scale_ub = None, contiguous = True, compiled = True
2025-05-07T20:32:18.9834768Z 
2025-05-07T20:32:18.9834879Z     @given(
2025-05-07T20:32:18.9835155Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:18.9835771Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:18.9836180Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:18.9836510Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:18.9836935Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:18.9837233Z     )
2025-05-07T20:32:18.9837603Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:18.9838079Z     def test_silu_mul_quant(
2025-05-07T20:32:18.9838332Z         self,
2025-05-07T20:32:18.9838528Z         T: int,
2025-05-07T20:32:18.9838735Z         D: int,
2025-05-07T20:32:18.9838962Z         scale_ub: Optional[float],
2025-05-07T20:32:18.9839231Z         contiguous: bool,
2025-05-07T20:32:18.9839480Z         compiled: bool,
2025-05-07T20:32:18.9839722Z     ) -> None:
2025-05-07T20:32:18.9839942Z         torch.manual_seed(2025)
2025-05-07T20:32:18.9840197Z     
2025-05-07T20:32:18.9840476Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:18.9840826Z     
2025-05-07T20:32:18.9841023Z         x_sign = torch.sign(x)
2025-05-07T20:32:18.9841322Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:18.9841635Z         x = x_sign * x_clamp
2025-05-07T20:32:18.9841881Z         x0 = x[:, :D]
2025-05-07T20:32:18.9842116Z         x1 = x[:, D:]
2025-05-07T20:32:18.9842335Z     
2025-05-07T20:32:18.9842530Z         if contiguous:
2025-05-07T20:32:18.9842770Z             x0 = x0.contiguous()
2025-05-07T20:32:18.9843032Z             x1 = x1.contiguous()
2025-05-07T20:32:18.9843268Z     
2025-05-07T20:32:18.9843468Z         if scale_ub is not None:
2025-05-07T20:32:18.9843749Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:18.9844083Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:18.9844393Z             )
2025-05-07T20:32:18.9844597Z         else:
2025-05-07T20:32:18.9844808Z             scale_ub_tensor = None
2025-05-07T20:32:18.9845062Z     
2025-05-07T20:32:18.9845296Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:18.9845620Z             op = silu_mul_quant
2025-05-07T20:32:18.9845869Z             if compiled:
2025-05-07T20:32:18.9846118Z                 op = torch.compile(op)
2025-05-07T20:32:18.9846420Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:18.9846690Z     
2025-05-07T20:32:18.9846888Z         y_fp8, y_scale = fn()
2025-05-07T20:32:18.9847179Z         y = y_fp8.to(torch.float32) * y_scale[:, None]
2025-05-07T20:32:18.9847468Z     
2025-05-07T20:32:18.9847709Z         def ref_fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:18.9848044Z             x0_fp32 = x0.to(torch.float32)
2025-05-07T20:32:18.9848334Z             x1_fp32 = x1.to(torch.float32)
2025-05-07T20:32:18.9848650Z             y = x0_fp32 * torch.sigmoid(x0_fp32) * x1_fp32
2025-05-07T20:32:18.9849012Z             return triton_quantize_fp8_row(y, scale_ub_tensor)
2025-05-07T20:32:18.9849322Z     
2025-05-07T20:32:18.9849525Z >       y_fp8_ref, y_scale_ref = ref_fn()
2025-05-07T20:32:18.9849729Z 
2025-05-07T20:32:18.9849831Z moe/activation_test.py:126: 
2025-05-07T20:32:18.9850130Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:18.9850464Z moe/activation_test.py:124: in ref_fn
2025-05-07T20:32:18.9850887Z     return triton_quantize_fp8_row(y, scale_ub_tensor)
2025-05-07T20:32:18.9851674Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/fbgemm_gpu/experimental/gemm/triton_gemm/fp8_gemm.py:2370: in triton_quantize_fp8_row
2025-05-07T20:32:18.9852423Z     _kernel_quantize_fp8_row[grid](
2025-05-07T20:32:18.9852970Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:18.9853648Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:18.9854334Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/autotuner.py:186: in run
2025-05-07T20:32:18.9855135Z     timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs}
2025-05-07T20:32:18.9855929Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/autotuner.py:186: in <dictcomp>
2025-05-07T20:32:18.9856680Z     timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs}
2025-05-07T20:32:18.9857424Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/autotuner.py:166: in _bench
2025-05-07T20:32:18.9858087Z     return self.do_bench(kernel_call, quantiles=(0.5, 0.2, 0.8))
2025-05-07T20:32:18.9858687Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/testing.py:117: in do_bench
2025-05-07T20:32:18.9859207Z     fn()
2025-05-07T20:32:18.9859709Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/autotuner.py:152: in kernel_call
2025-05-07T20:32:18.9860298Z     self.fn.run(
2025-05-07T20:32:18.9860771Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:18.9861307Z     kernel = self.compile(
2025-05-07T20:32:18.9861848Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:18.9862500Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:18.9862899Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:18.9863125Z 
2025-05-07T20:32:18.9863337Z self = <triton.compiler.compiler.ASTSource object at 0x7efd8e849e50>
2025-05-07T20:32:18.9864411Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=2, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:18.9865793Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7efdd7d50400>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7efd8f08e340>}
2025-05-07T20:32:18.9867137Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:18.9868208Z context = <triton._C.libtriton.ir.context object at 0x7efd8e8fe430>
2025-05-07T20:32:18.9868495Z 
2025-05-07T20:32:18.9868662Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:18.9869325Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:18.9869795Z                            module_map=module_map)
2025-05-07T20:32:18.9870161Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:18.9870512Z E       def _kernel_quantize_fp8_row(
2025-05-07T20:32:18.9870783Z E       ^
2025-05-07T20:32:18.9871253Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:18.9871699Z 
2025-05-07T20:32:18.9872118Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:18.9872688Z 
2025-05-07T20:32:18.9872797Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:18.9873213Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:18.9873617Z     T=2048,
2025-05-07T20:32:18.9873808Z     D=5120,
2025-05-07T20:32:18.9874011Z     scale_ub=None,
2025-05-07T20:32:18.9874232Z     contiguous=True,
2025-05-07T20:32:18.9874455Z     compiled=True,
2025-05-07T20:32:18.9874677Z )
2025-05-07T20:32:19.4198273Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:19.4199009Z T = 2048, D = 5120, scale_ub = None, contiguous = True, compiled = True
2025-05-07T20:32:19.4199338Z 
2025-05-07T20:32:19.4199727Z     @given(
2025-05-07T20:32:19.4200060Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:19.4200371Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:19.4200687Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:19.4201106Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:19.4201474Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:19.4201800Z     )
2025-05-07T20:32:19.4202207Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:19.4202718Z     def test_silu_mul_quant(
2025-05-07T20:32:19.4202981Z         self,
2025-05-07T20:32:19.4203192Z         T: int,
2025-05-07T20:32:19.4203397Z         D: int,
2025-05-07T20:32:19.4203641Z         scale_ub: Optional[float],
2025-05-07T20:32:19.4203946Z         contiguous: bool,
2025-05-07T20:32:19.4204210Z         compiled: bool,
2025-05-07T20:32:19.4204459Z     ) -> None:
2025-05-07T20:32:19.4204700Z         torch.manual_seed(2025)
2025-05-07T20:32:19.4204966Z     
2025-05-07T20:32:19.4211166Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:19.4211529Z     
2025-05-07T20:32:19.4211736Z         x_sign = torch.sign(x)
2025-05-07T20:32:19.4212034Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:19.4212363Z         x = x_sign * x_clamp
2025-05-07T20:32:19.4212613Z         x0 = x[:, :D]
2025-05-07T20:32:19.4212835Z         x1 = x[:, D:]
2025-05-07T20:32:19.4213053Z     
2025-05-07T20:32:19.4213250Z         if contiguous:
2025-05-07T20:32:19.4213483Z             x0 = x0.contiguous()
2025-05-07T20:32:19.4213751Z             x1 = x1.contiguous()
2025-05-07T20:32:19.4214000Z     
2025-05-07T20:32:19.4214194Z         if scale_ub is not None:
2025-05-07T20:32:19.4214477Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:19.4214822Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:19.4215140Z             )
2025-05-07T20:32:19.4215342Z         else:
2025-05-07T20:32:19.4215567Z             scale_ub_tensor = None
2025-05-07T20:32:19.4215826Z     
2025-05-07T20:32:19.4216060Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:19.4216385Z             op = silu_mul_quant
2025-05-07T20:32:19.4216646Z             if compiled:
2025-05-07T20:32:19.4216896Z                 op = torch.compile(op)
2025-05-07T20:32:19.4217197Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:19.4217479Z     
2025-05-07T20:32:19.4217676Z         y_fp8, y_scale = fn()
2025-05-07T20:32:19.4217971Z         y = y_fp8.to(torch.float32) * y_scale[:, None]
2025-05-07T20:32:19.4218269Z     
2025-05-07T20:32:19.4218505Z         def ref_fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:19.4218847Z             x0_fp32 = x0.to(torch.float32)
2025-05-07T20:32:19.4219146Z             x1_fp32 = x1.to(torch.float32)
2025-05-07T20:32:19.4219468Z             y = x0_fp32 * torch.sigmoid(x0_fp32) * x1_fp32
2025-05-07T20:32:19.4219830Z             return triton_quantize_fp8_row(y, scale_ub_tensor)
2025-05-07T20:32:19.4220151Z     
2025-05-07T20:32:19.4220361Z >       y_fp8_ref, y_scale_ref = ref_fn()
2025-05-07T20:32:19.4220557Z 
2025-05-07T20:32:19.4220660Z moe/activation_test.py:126: 
2025-05-07T20:32:19.4221103Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:19.4221447Z moe/activation_test.py:124: in ref_fn
2025-05-07T20:32:19.4221774Z     return triton_quantize_fp8_row(y, scale_ub_tensor)
2025-05-07T20:32:19.4222572Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/fbgemm_gpu/experimental/gemm/triton_gemm/fp8_gemm.py:2370: in triton_quantize_fp8_row
2025-05-07T20:32:19.4223346Z     _kernel_quantize_fp8_row[grid](
2025-05-07T20:32:19.4223902Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:19.4224590Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:19.4225335Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/autotuner.py:186: in run
2025-05-07T20:32:19.4226152Z     timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs}
2025-05-07T20:32:19.4226967Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/autotuner.py:186: in <dictcomp>
2025-05-07T20:32:19.4227772Z     timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs}
2025-05-07T20:32:19.4228838Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/autotuner.py:166: in _bench
2025-05-07T20:32:19.4229551Z     return self.do_bench(kernel_call, quantiles=(0.5, 0.2, 0.8))
2025-05-07T20:32:19.4230158Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/testing.py:117: in do_bench
2025-05-07T20:32:19.4230691Z     fn()
2025-05-07T20:32:19.4231215Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/autotuner.py:152: in kernel_call
2025-05-07T20:32:19.4231818Z     self.fn.run(
2025-05-07T20:32:19.4232290Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:19.4232836Z     kernel = self.compile(
2025-05-07T20:32:19.4233391Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:19.4234054Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:19.4234465Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:19.4234705Z 
2025-05-07T20:32:19.4234916Z self = <triton.compiler.compiler.ASTSource object at 0x7efd8e947b50>
2025-05-07T20:32:19.4236013Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=2, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:19.4237430Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7efdd7d50400>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7efd8f6c2d40>}
2025-05-07T20:32:19.4238840Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:19.4239885Z context = <triton._C.libtriton.ir.context object at 0x7efd8e7b9770>
2025-05-07T20:32:19.4240186Z 
2025-05-07T20:32:19.4240356Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:19.4240884Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:19.4241355Z                            module_map=module_map)
2025-05-07T20:32:19.4241733Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:19.4242099Z E       def _kernel_quantize_fp8_row(
2025-05-07T20:32:19.4242368Z E       ^
2025-05-07T20:32:19.4242847Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:19.4243311Z 
2025-05-07T20:32:19.4243827Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:19.4244347Z 
2025-05-07T20:32:19.4244463Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:19.4244876Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:19.4245288Z     T=128,
2025-05-07T20:32:19.4245487Z     D=5120,
2025-05-07T20:32:19.4245685Z     scale_ub=None,
2025-05-07T20:32:19.4245913Z     contiguous=True,
2025-05-07T20:32:19.4246147Z     compiled=True,
2025-05-07T20:32:19.4246361Z )
2025-05-07T20:32:20.0838283Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:20.0839336Z T = 128, D = 5120, scale_ub = None, contiguous = True, compiled = True
2025-05-07T20:32:20.0839823Z 
2025-05-07T20:32:20.0839928Z     @given(
2025-05-07T20:32:20.0840235Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:20.0840730Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:20.0841046Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:20.0841386Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:20.0841723Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:20.0842005Z     )
2025-05-07T20:32:20.0842358Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:20.0842802Z     def test_silu_mul_quant(
2025-05-07T20:32:20.0843057Z         self,
2025-05-07T20:32:20.0843258Z         T: int,
2025-05-07T20:32:20.0843465Z         D: int,
2025-05-07T20:32:20.0843696Z         scale_ub: Optional[float],
2025-05-07T20:32:20.0843966Z         contiguous: bool,
2025-05-07T20:32:20.0844225Z         compiled: bool,
2025-05-07T20:32:20.0844465Z     ) -> None:
2025-05-07T20:32:20.0844682Z         torch.manual_seed(2025)
2025-05-07T20:32:20.0844932Z     
2025-05-07T20:32:20.0845216Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:20.0845562Z     
2025-05-07T20:32:20.0845769Z         x_sign = torch.sign(x)
2025-05-07T20:32:20.0846073Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:20.0846384Z         x = x_sign * x_clamp
2025-05-07T20:32:20.0846633Z         x0 = x[:, :D]
2025-05-07T20:32:20.0846880Z         x1 = x[:, D:]
2025-05-07T20:32:20.0847096Z     
2025-05-07T20:32:20.0847284Z         if contiguous:
2025-05-07T20:32:20.0847538Z             x0 = x0.contiguous()
2025-05-07T20:32:20.0847843Z             x1 = x1.contiguous()
2025-05-07T20:32:20.0848089Z     
2025-05-07T20:32:20.0848295Z         if scale_ub is not None:
2025-05-07T20:32:20.0848580Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:20.0848915Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:20.0849237Z             )
2025-05-07T20:32:20.0849438Z         else:
2025-05-07T20:32:20.0849654Z             scale_ub_tensor = None
2025-05-07T20:32:20.0849910Z     
2025-05-07T20:32:20.0850147Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:20.0850462Z             op = silu_mul_quant
2025-05-07T20:32:20.0850718Z             if compiled:
2025-05-07T20:32:20.0850974Z                 op = torch.compile(op)
2025-05-07T20:32:20.0851270Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:20.0851550Z     
2025-05-07T20:32:20.0851750Z         y_fp8, y_scale = fn()
2025-05-07T20:32:20.0852034Z         y = y_fp8.to(torch.float32) * y_scale[:, None]
2025-05-07T20:32:20.0852327Z     
2025-05-07T20:32:20.0852570Z         def ref_fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:20.0852904Z             x0_fp32 = x0.to(torch.float32)
2025-05-07T20:32:20.0853203Z             x1_fp32 = x1.to(torch.float32)
2025-05-07T20:32:20.0853520Z             y = x0_fp32 * torch.sigmoid(x0_fp32) * x1_fp32
2025-05-07T20:32:20.0853879Z             return triton_quantize_fp8_row(y, scale_ub_tensor)
2025-05-07T20:32:20.0854182Z     
2025-05-07T20:32:20.0854391Z >       y_fp8_ref, y_scale_ref = ref_fn()
2025-05-07T20:32:20.0854586Z 
2025-05-07T20:32:20.0854787Z moe/activation_test.py:126: 
2025-05-07T20:32:20.0855081Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:20.0855419Z moe/activation_test.py:124: in ref_fn
2025-05-07T20:32:20.0855745Z     return triton_quantize_fp8_row(y, scale_ub_tensor)
2025-05-07T20:32:20.0856530Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/fbgemm_gpu/experimental/gemm/triton_gemm/fp8_gemm.py:2370: in triton_quantize_fp8_row
2025-05-07T20:32:20.0857278Z     _kernel_quantize_fp8_row[grid](
2025-05-07T20:32:20.0857845Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:20.0858747Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:20.0859642Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/autotuner.py:186: in run
2025-05-07T20:32:20.0860478Z     timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs}
2025-05-07T20:32:20.0861228Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/autotuner.py:186: in <dictcomp>
2025-05-07T20:32:20.0861975Z     timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs}
2025-05-07T20:32:20.0862696Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/autotuner.py:166: in _bench
2025-05-07T20:32:20.0863332Z     return self.do_bench(kernel_call, quantiles=(0.5, 0.2, 0.8))
2025-05-07T20:32:20.0863930Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/testing.py:117: in do_bench
2025-05-07T20:32:20.0864450Z     fn()
2025-05-07T20:32:20.0864959Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/autotuner.py:152: in kernel_call
2025-05-07T20:32:20.0865540Z     self.fn.run(
2025-05-07T20:32:20.0866011Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:20.0866536Z     kernel = self.compile(
2025-05-07T20:32:20.0867080Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:20.0867736Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:20.0868138Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:20.0868366Z 
2025-05-07T20:32:20.0868575Z self = <triton.compiler.compiler.ASTSource object at 0x7efd8eb2eb50>
2025-05-07T20:32:20.0869804Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=2, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:20.0871195Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7efdd7d50400>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7efd8ecd3740>}
2025-05-07T20:32:20.0872534Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:20.0873548Z context = <triton._C.libtriton.ir.context object at 0x7efd8eb270b0>
2025-05-07T20:32:20.0873846Z 
2025-05-07T20:32:20.0874013Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:20.0874534Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:20.0875007Z                            module_map=module_map)
2025-05-07T20:32:20.0875371Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:20.0875738Z E       def _kernel_quantize_fp8_row(
2025-05-07T20:32:20.0876012Z E       ^
2025-05-07T20:32:20.0876520Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:20.0876974Z 
2025-05-07T20:32:20.0877389Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:20.0877903Z 
2025-05-07T20:32:20.0878009Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:20.0878430Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:20.0878828Z     T=4096,
2025-05-07T20:32:20.0879032Z     D=5120,
2025-05-07T20:32:20.0879232Z     scale_ub=None,
2025-05-07T20:32:20.0879450Z     contiguous=True,
2025-05-07T20:32:20.0879677Z     compiled=True,
2025-05-07T20:32:20.0879888Z )
2025-05-07T20:32:20.5988986Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:20.5990112Z T = 4096, D = 5120, scale_ub = None, contiguous = True, compiled = True
2025-05-07T20:32:20.5990417Z 
2025-05-07T20:32:20.5990503Z     @given(
2025-05-07T20:32:20.5990847Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:20.5991159Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:20.5991476Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:20.5991820Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:20.5992147Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:20.5992435Z     )
2025-05-07T20:32:20.5992791Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:20.5993233Z     def test_silu_mul_quant(
2025-05-07T20:32:20.5993493Z         self,
2025-05-07T20:32:20.5993697Z         T: int,
2025-05-07T20:32:20.5993895Z         D: int,
2025-05-07T20:32:20.5994131Z         scale_ub: Optional[float],
2025-05-07T20:32:20.5994411Z         contiguous: bool,
2025-05-07T20:32:20.5994650Z         compiled: bool,
2025-05-07T20:32:20.5994887Z     ) -> None:
2025-05-07T20:32:20.5995122Z         torch.manual_seed(2025)
2025-05-07T20:32:20.5995375Z     
2025-05-07T20:32:20.5995654Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:20.5995998Z     
2025-05-07T20:32:20.5996197Z         x_sign = torch.sign(x)
2025-05-07T20:32:20.5996483Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:20.5996797Z         x = x_sign * x_clamp
2025-05-07T20:32:20.5997042Z         x0 = x[:, :D]
2025-05-07T20:32:20.5997256Z         x1 = x[:, D:]
2025-05-07T20:32:20.5997469Z     
2025-05-07T20:32:20.5997661Z         if contiguous:
2025-05-07T20:32:20.5997891Z             x0 = x0.contiguous()
2025-05-07T20:32:20.5998157Z             x1 = x1.contiguous()
2025-05-07T20:32:20.5998408Z     
2025-05-07T20:32:20.5998635Z         if scale_ub is not None:
2025-05-07T20:32:20.5998922Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:20.5999269Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:20.5999573Z             )
2025-05-07T20:32:20.5999778Z         else:
2025-05-07T20:32:20.5999997Z             scale_ub_tensor = None
2025-05-07T20:32:20.6000258Z     
2025-05-07T20:32:20.6000491Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:20.6000807Z             op = silu_mul_quant
2025-05-07T20:32:20.6001061Z             if compiled:
2025-05-07T20:32:20.6001304Z                 op = torch.compile(op)
2025-05-07T20:32:20.6001603Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:20.6001882Z     
2025-05-07T20:32:20.6002076Z         y_fp8, y_scale = fn()
2025-05-07T20:32:20.6002368Z         y = y_fp8.to(torch.float32) * y_scale[:, None]
2025-05-07T20:32:20.6002661Z     
2025-05-07T20:32:20.6002901Z         def ref_fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:20.6003244Z             x0_fp32 = x0.to(torch.float32)
2025-05-07T20:32:20.6003544Z             x1_fp32 = x1.to(torch.float32)
2025-05-07T20:32:20.6003854Z             y = x0_fp32 * torch.sigmoid(x0_fp32) * x1_fp32
2025-05-07T20:32:20.6004216Z             return triton_quantize_fp8_row(y, scale_ub_tensor)
2025-05-07T20:32:20.6004529Z     
2025-05-07T20:32:20.6004841Z >       y_fp8_ref, y_scale_ref = ref_fn()
2025-05-07T20:32:20.6005037Z 
2025-05-07T20:32:20.6005144Z moe/activation_test.py:126: 
2025-05-07T20:32:20.6005448Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:20.6005789Z moe/activation_test.py:124: in ref_fn
2025-05-07T20:32:20.6006113Z     return triton_quantize_fp8_row(y, scale_ub_tensor)
2025-05-07T20:32:20.6006901Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/fbgemm_gpu/experimental/gemm/triton_gemm/fp8_gemm.py:2370: in triton_quantize_fp8_row
2025-05-07T20:32:20.6007656Z     _kernel_quantize_fp8_row[grid](
2025-05-07T20:32:20.6008335Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:20.6009052Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:20.6009784Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/autotuner.py:186: in run
2025-05-07T20:32:20.6010505Z     timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs}
2025-05-07T20:32:20.6011261Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/autotuner.py:186: in <dictcomp>
2025-05-07T20:32:20.6012004Z     timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs}
2025-05-07T20:32:20.6012736Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/autotuner.py:166: in _bench
2025-05-07T20:32:20.6013376Z     return self.do_bench(kernel_call, quantiles=(0.5, 0.2, 0.8))
2025-05-07T20:32:20.6013973Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/testing.py:117: in do_bench
2025-05-07T20:32:20.6014494Z     fn()
2025-05-07T20:32:20.6015003Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/autotuner.py:152: in kernel_call
2025-05-07T20:32:20.6015587Z     self.fn.run(
2025-05-07T20:32:20.6016048Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:20.6016579Z     kernel = self.compile(
2025-05-07T20:32:20.6017118Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:20.6017763Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:20.6018163Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:20.6018400Z 
2025-05-07T20:32:20.6018607Z self = <triton.compiler.compiler.ASTSource object at 0x7efd8ef9aed0>
2025-05-07T20:32:20.6019687Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=2, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:20.6021081Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7efdd7d50400>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7efd8e75d940>}
2025-05-07T20:32:20.6022412Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:20.6023434Z context = <triton._C.libtriton.ir.context object at 0x7efd8ef9f4b0>
2025-05-07T20:32:20.6023722Z 
2025-05-07T20:32:20.6023896Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:20.6024413Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:20.6024879Z                            module_map=module_map)
2025-05-07T20:32:20.6025247Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:20.6025603Z E       def _kernel_quantize_fp8_row(
2025-05-07T20:32:20.6025870Z E       ^
2025-05-07T20:32:20.6026392Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:20.6026841Z 
2025-05-07T20:32:20.6027263Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:20.6027771Z 
2025-05-07T20:32:20.6027881Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:20.6028654Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:20.6029104Z     T=16384,
2025-05-07T20:32:20.6029303Z     D=5120,
2025-05-07T20:32:20.6029495Z     scale_ub=None,
2025-05-07T20:32:20.6029794Z     contiguous=True,
2025-05-07T20:32:20.6030081Z     compiled=True,
2025-05-07T20:32:20.6030286Z )
2025-05-07T20:32:20.6285945Z W0507 20:32:20.627000 238910 site-packages/torch/_dynamo/convert_frame.py:987] [0/8] torch._dynamo hit config.recompile_limit (8)
2025-05-07T20:32:20.6287423Z W0507 20:32:20.627000 238910 site-packages/torch/_dynamo/convert_frame.py:987] [0/8]    function: 'silu_mul_quant' (/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:55)
2025-05-07T20:32:20.6288769Z W0507 20:32:20.627000 238910 site-packages/torch/_dynamo/convert_frame.py:987] [0/8]    last reason: 0/7: tensor 'x0' stride mismatch at index 0. expected 5120, actual 10240
2025-05-07T20:32:20.6289765Z W0507 20:32:20.627000 238910 site-packages/torch/_dynamo/convert_frame.py:987] [0/8] To log all recompilation reasons, use TORCH_LOGS="recompiles".
2025-05-07T20:32:20.6290871Z W0507 20:32:20.627000 238910 site-packages/torch/_dynamo/convert_frame.py:987] [0/8] To diagnose recompilation issues, see https://pytorch.org/docs/main/torch.compiler_troubleshooting.html.
2025-05-07T20:32:20.6969650Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:20.6971104Z T = 16384, D = 5120, scale_ub = None, contiguous = True, compiled = True
2025-05-07T20:32:20.6971861Z 
2025-05-07T20:32:20.6972025Z     @given(
2025-05-07T20:32:20.6972501Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:20.6973119Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:20.6973728Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:20.6974394Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:20.6975037Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:20.6975599Z     )
2025-05-07T20:32:20.6976298Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:20.6977190Z     def test_silu_mul_quant(
2025-05-07T20:32:20.6977667Z         self,
2025-05-07T20:32:20.6977957Z         T: int,
2025-05-07T20:32:20.6978162Z         D: int,
2025-05-07T20:32:20.6978384Z         scale_ub: Optional[float],
2025-05-07T20:32:20.6978657Z         contiguous: bool,
2025-05-07T20:32:20.6978899Z         compiled: bool,
2025-05-07T20:32:20.6979124Z     ) -> None:
2025-05-07T20:32:20.6979351Z         torch.manual_seed(2025)
2025-05-07T20:32:20.6979596Z     
2025-05-07T20:32:20.6979866Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:20.6980207Z     
2025-05-07T20:32:20.6980405Z         x_sign = torch.sign(x)
2025-05-07T20:32:20.6980692Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:20.6981005Z         x = x_sign * x_clamp
2025-05-07T20:32:20.6981247Z         x0 = x[:, :D]
2025-05-07T20:32:20.6981462Z         x1 = x[:, D:]
2025-05-07T20:32:20.6981671Z     
2025-05-07T20:32:20.6981863Z         if contiguous:
2025-05-07T20:32:20.6982094Z             x0 = x0.contiguous()
2025-05-07T20:32:20.6982355Z             x1 = x1.contiguous()
2025-05-07T20:32:20.6982601Z     
2025-05-07T20:32:20.6982792Z         if scale_ub is not None:
2025-05-07T20:32:20.6983067Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:20.6983407Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:20.6983721Z             )
2025-05-07T20:32:20.6984167Z         else:
2025-05-07T20:32:20.6984388Z             scale_ub_tensor = None
2025-05-07T20:32:20.6984644Z     
2025-05-07T20:32:20.6984876Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:20.6985197Z             op = silu_mul_quant
2025-05-07T20:32:20.6985451Z             if compiled:
2025-05-07T20:32:20.6985696Z                 op = torch.compile(op)
2025-05-07T20:32:20.6986001Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:20.6986288Z     
2025-05-07T20:32:20.6986481Z         y_fp8, y_scale = fn()
2025-05-07T20:32:20.6986774Z         y = y_fp8.to(torch.float32) * y_scale[:, None]
2025-05-07T20:32:20.6987145Z     
2025-05-07T20:32:20.6987383Z         def ref_fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:20.6987797Z             x0_fp32 = x0.to(torch.float32)
2025-05-07T20:32:20.6988104Z             x1_fp32 = x1.to(torch.float32)
2025-05-07T20:32:20.6988490Z             y = x0_fp32 * torch.sigmoid(x0_fp32) * x1_fp32
2025-05-07T20:32:20.6988850Z             return triton_quantize_fp8_row(y, scale_ub_tensor)
2025-05-07T20:32:20.6989272Z     
2025-05-07T20:32:20.6989481Z >       y_fp8_ref, y_scale_ref = ref_fn()
2025-05-07T20:32:20.6989674Z 
2025-05-07T20:32:20.6989776Z moe/activation_test.py:126: 
2025-05-07T20:32:20.6990075Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:20.6990415Z moe/activation_test.py:124: in ref_fn
2025-05-07T20:32:20.6990741Z     return triton_quantize_fp8_row(y, scale_ub_tensor)
2025-05-07T20:32:20.6991529Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/fbgemm_gpu/experimental/gemm/triton_gemm/fp8_gemm.py:2370: in triton_quantize_fp8_row
2025-05-07T20:32:20.6992286Z     _kernel_quantize_fp8_row[grid](
2025-05-07T20:32:20.6992827Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:20.6993510Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:20.6994204Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/autotuner.py:186: in run
2025-05-07T20:32:20.6994930Z     timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs}
2025-05-07T20:32:20.6995688Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/autotuner.py:186: in <dictcomp>
2025-05-07T20:32:20.6996435Z     timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs}
2025-05-07T20:32:20.6997163Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/autotuner.py:166: in _bench
2025-05-07T20:32:20.6997805Z     return self.do_bench(kernel_call, quantiles=(0.5, 0.2, 0.8))
2025-05-07T20:32:20.6998414Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/testing.py:117: in do_bench
2025-05-07T20:32:20.6998925Z     fn()
2025-05-07T20:32:20.6999441Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/autotuner.py:152: in kernel_call
2025-05-07T20:32:20.7000024Z     self.fn.run(
2025-05-07T20:32:20.7000490Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:20.7001025Z     kernel = self.compile(
2025-05-07T20:32:20.7001566Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:20.7002225Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:20.7002620Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:20.7002859Z 
2025-05-07T20:32:20.7003070Z self = <triton.compiler.compiler.ASTSource object at 0x7efca3e756d0>
2025-05-07T20:32:20.7004238Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=2, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:20.7005616Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7efdd7d50400>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7efd8e311bc0>}
2025-05-07T20:32:20.7006951Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:20.7007974Z context = <triton._C.libtriton.ir.context object at 0x7efca3e8e470>
2025-05-07T20:32:20.7008267Z 
2025-05-07T20:32:20.7008437Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:20.7009033Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:20.7015800Z                            module_map=module_map)
2025-05-07T20:32:20.7016305Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:20.7016668Z E       def _kernel_quantize_fp8_row(
2025-05-07T20:32:20.7016941Z E       ^
2025-05-07T20:32:20.7017417Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:20.7017871Z 
2025-05-07T20:32:20.7018348Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:20.7018880Z 
2025-05-07T20:32:20.7018988Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:20.7019413Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:20.7019822Z     T=1,
2025-05-07T20:32:20.7020013Z     D=5120,
2025-05-07T20:32:20.7020216Z     scale_ub=1200.0,
2025-05-07T20:32:20.7020447Z     contiguous=True,
2025-05-07T20:32:20.7020671Z     compiled=True,
2025-05-07T20:32:20.7020890Z )
2025-05-07T20:32:20.9653261Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:20.9654014Z T = 1, D = 5120, scale_ub = 1200.0, contiguous = True, compiled = True
2025-05-07T20:32:20.9654382Z 
2025-05-07T20:32:20.9654496Z     @given(
2025-05-07T20:32:20.9654749Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:20.9655068Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:20.9655384Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:20.9655721Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:20.9656045Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:20.9656337Z     )
2025-05-07T20:32:20.9656693Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:20.9657149Z     def test_silu_mul_quant(
2025-05-07T20:32:20.9657401Z         self,
2025-05-07T20:32:20.9657615Z         T: int,
2025-05-07T20:32:20.9657831Z         D: int,
2025-05-07T20:32:20.9658051Z         scale_ub: Optional[float],
2025-05-07T20:32:20.9658340Z         contiguous: bool,
2025-05-07T20:32:20.9658625Z         compiled: bool,
2025-05-07T20:32:20.9658868Z     ) -> None:
2025-05-07T20:32:20.9659098Z         torch.manual_seed(2025)
2025-05-07T20:32:20.9659351Z     
2025-05-07T20:32:20.9659630Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:20.9659981Z     
2025-05-07T20:32:20.9660185Z         x_sign = torch.sign(x)
2025-05-07T20:32:20.9660474Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:20.9660793Z         x = x_sign * x_clamp
2025-05-07T20:32:20.9661046Z         x0 = x[:, :D]
2025-05-07T20:32:20.9661264Z         x1 = x[:, D:]
2025-05-07T20:32:20.9661484Z     
2025-05-07T20:32:20.9661678Z         if contiguous:
2025-05-07T20:32:20.9661922Z             x0 = x0.contiguous()
2025-05-07T20:32:20.9662188Z             x1 = x1.contiguous()
2025-05-07T20:32:20.9662437Z     
2025-05-07T20:32:20.9662639Z         if scale_ub is not None:
2025-05-07T20:32:20.9662919Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:20.9663501Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:20.9663822Z             )
2025-05-07T20:32:20.9664021Z         else:
2025-05-07T20:32:20.9664245Z             scale_ub_tensor = None
2025-05-07T20:32:20.9664512Z     
2025-05-07T20:32:20.9664747Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:20.9665072Z             op = silu_mul_quant
2025-05-07T20:32:20.9665333Z             if compiled:
2025-05-07T20:32:20.9665581Z                 op = torch.compile(op)
2025-05-07T20:32:20.9665883Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:20.9666163Z     
2025-05-07T20:32:20.9666360Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:20.9666623Z 
2025-05-07T20:32:20.9666806Z moe/activation_test.py:117: 
2025-05-07T20:32:20.9667113Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:20.9667452Z moe/activation_test.py:115: in fn
2025-05-07T20:32:20.9667802Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:20.9668419Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/torch/_dynamo/eval_frame.py:678: in _fn
2025-05-07T20:32:20.9668983Z     return fn(*args, **kwargs)
2025-05-07T20:32:20.9669745Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:20.9670430Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:20.9670965Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:20.9671643Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:20.9672305Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:20.9672851Z     kernel = self.compile(
2025-05-07T20:32:20.9673401Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:20.9674058Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:20.9674468Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:20.9674706Z 
2025-05-07T20:32:20.9674920Z self = <triton.compiler.compiler.ASTSource object at 0x7efca37e3390>
2025-05-07T20:32:20.9676010Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:20.9677395Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7efdd7d50400>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7efd8e4998a0>}
2025-05-07T20:32:20.9678749Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:20.9679775Z context = <triton._C.libtriton.ir.context object at 0x7efca37df870>
2025-05-07T20:32:20.9680075Z 
2025-05-07T20:32:20.9680245Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:20.9680766Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:20.9681233Z                            module_map=module_map)
2025-05-07T20:32:20.9681606Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:20.9681963Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:20.9682227Z E       ^
2025-05-07T20:32:20.9682703Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:20.9683165Z 
2025-05-07T20:32:20.9683586Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:20.9684094Z 
2025-05-07T20:32:20.9684261Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:20.9684673Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:20.9685087Z     T=1,
2025-05-07T20:32:20.9685291Z     D=5120,
2025-05-07T20:32:20.9685488Z     scale_ub=None,
2025-05-07T20:32:20.9685718Z     contiguous=False,
2025-05-07T20:32:20.9685956Z     compiled=True,
2025-05-07T20:32:20.9686168Z )
2025-05-07T20:32:21.0160795Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:21.0161512Z T = 1, D = 5120, scale_ub = None, contiguous = False, compiled = True
2025-05-07T20:32:21.0161871Z 
2025-05-07T20:32:21.0162194Z     @given(
2025-05-07T20:32:21.0162432Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:21.0162844Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:21.0163161Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:21.0163564Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:21.0163909Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:21.0164203Z     )
2025-05-07T20:32:21.0164563Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:21.0165003Z     def test_silu_mul_quant(
2025-05-07T20:32:21.0165249Z         self,
2025-05-07T20:32:21.0165453Z         T: int,
2025-05-07T20:32:21.0165655Z         D: int,
2025-05-07T20:32:21.0165881Z         scale_ub: Optional[float],
2025-05-07T20:32:21.0166156Z         contiguous: bool,
2025-05-07T20:32:21.0166392Z         compiled: bool,
2025-05-07T20:32:21.0166627Z     ) -> None:
2025-05-07T20:32:21.0166864Z         torch.manual_seed(2025)
2025-05-07T20:32:21.0167115Z     
2025-05-07T20:32:21.0167391Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:21.0167745Z     
2025-05-07T20:32:21.0167946Z         x_sign = torch.sign(x)
2025-05-07T20:32:21.0168245Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:21.0168570Z         x = x_sign * x_clamp
2025-05-07T20:32:21.0168862Z         x0 = x[:, :D]
2025-05-07T20:32:21.0169095Z         x1 = x[:, D:]
2025-05-07T20:32:21.0169329Z     
2025-05-07T20:32:21.0169516Z         if contiguous:
2025-05-07T20:32:21.0169754Z             x0 = x0.contiguous()
2025-05-07T20:32:21.0170026Z             x1 = x1.contiguous()
2025-05-07T20:32:21.0170262Z     
2025-05-07T20:32:21.0170465Z         if scale_ub is not None:
2025-05-07T20:32:21.0170753Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:21.0171084Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:21.0171393Z             )
2025-05-07T20:32:21.0171589Z         else:
2025-05-07T20:32:21.0171810Z             scale_ub_tensor = None
2025-05-07T20:32:21.0172066Z     
2025-05-07T20:32:21.0172302Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:21.0172611Z             op = silu_mul_quant
2025-05-07T20:32:21.0172865Z             if compiled:
2025-05-07T20:32:21.0173118Z                 op = torch.compile(op)
2025-05-07T20:32:21.0173410Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:21.0173684Z     
2025-05-07T20:32:21.0173884Z         y_fp8, y_scale = fn()
2025-05-07T20:32:21.0174173Z         y = y_fp8.to(torch.float32) * y_scale[:, None]
2025-05-07T20:32:21.0174464Z     
2025-05-07T20:32:21.0174705Z         def ref_fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:21.0175039Z             x0_fp32 = x0.to(torch.float32)
2025-05-07T20:32:21.0175333Z             x1_fp32 = x1.to(torch.float32)
2025-05-07T20:32:21.0175653Z             y = x0_fp32 * torch.sigmoid(x0_fp32) * x1_fp32
2025-05-07T20:32:21.0176012Z             return triton_quantize_fp8_row(y, scale_ub_tensor)
2025-05-07T20:32:21.0176325Z     
2025-05-07T20:32:21.0176529Z >       y_fp8_ref, y_scale_ref = ref_fn()
2025-05-07T20:32:21.0176721Z 
2025-05-07T20:32:21.0176833Z moe/activation_test.py:126: 
2025-05-07T20:32:21.0177220Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:21.0177555Z moe/activation_test.py:124: in ref_fn
2025-05-07T20:32:21.0177885Z     return triton_quantize_fp8_row(y, scale_ub_tensor)
2025-05-07T20:32:21.0178719Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/fbgemm_gpu/experimental/gemm/triton_gemm/fp8_gemm.py:2370: in triton_quantize_fp8_row
2025-05-07T20:32:21.0179460Z     _kernel_quantize_fp8_row[grid](
2025-05-07T20:32:21.0180018Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:21.0180705Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:21.0181443Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/autotuner.py:186: in run
2025-05-07T20:32:21.0182272Z     timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs}
2025-05-07T20:32:21.0183158Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/autotuner.py:186: in <dictcomp>
2025-05-07T20:32:21.0183909Z     timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs}
2025-05-07T20:32:21.0184644Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/autotuner.py:166: in _bench
2025-05-07T20:32:21.0185277Z     return self.do_bench(kernel_call, quantiles=(0.5, 0.2, 0.8))
2025-05-07T20:32:21.0185887Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/testing.py:117: in do_bench
2025-05-07T20:32:21.0186410Z     fn()
2025-05-07T20:32:21.0186917Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/autotuner.py:152: in kernel_call
2025-05-07T20:32:21.0187510Z     self.fn.run(
2025-05-07T20:32:21.0187981Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:21.0188519Z     kernel = self.compile(
2025-05-07T20:32:21.0189171Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:21.0189824Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:21.0190229Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:21.0190460Z 
2025-05-07T20:32:21.0190668Z self = <triton.compiler.compiler.ASTSource object at 0x7efca376de50>
2025-05-07T20:32:21.0191746Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=2, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:21.0193143Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7efdd7d50400>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7efd8e4bde40>}
2025-05-07T20:32:21.0194502Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:21.0195528Z context = <triton._C.libtriton.ir.context object at 0x7efca37240b0>
2025-05-07T20:32:21.0195814Z 
2025-05-07T20:32:21.0195980Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:21.0196504Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:21.0196974Z                            module_map=module_map)
2025-05-07T20:32:21.0197358Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:21.0197719Z E       def _kernel_quantize_fp8_row(
2025-05-07T20:32:21.0197998Z E       ^
2025-05-07T20:32:21.0198469Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:21.0198971Z 
2025-05-07T20:32:21.0199456Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:21.0199970Z 
2025-05-07T20:32:21.0200079Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:21.0200498Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:21.0200911Z     T=1,
2025-05-07T20:32:21.0201103Z     D=5120,
2025-05-07T20:32:21.0201297Z     scale_ub=None,
2025-05-07T20:32:21.0201525Z     contiguous=True,
2025-05-07T20:32:21.0201759Z     compiled=False,
2025-05-07T20:32:21.0201968Z )
2025-05-07T20:32:21.1353677Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:21.1354422Z T = 1, D = 5120, scale_ub = None, contiguous = True, compiled = False
2025-05-07T20:32:21.1355202Z 
2025-05-07T20:32:21.1355328Z     @given(
2025-05-07T20:32:21.1355628Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:21.1356048Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:21.1356560Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:21.1356911Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:21.1357254Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:21.1357544Z     )
2025-05-07T20:32:21.1357896Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:21.1358332Z     def test_silu_mul_quant(
2025-05-07T20:32:21.1358578Z         self,
2025-05-07T20:32:21.1358781Z         T: int,
2025-05-07T20:32:21.1358976Z         D: int,
2025-05-07T20:32:21.1359205Z         scale_ub: Optional[float],
2025-05-07T20:32:21.1359478Z         contiguous: bool,
2025-05-07T20:32:21.1359716Z         compiled: bool,
2025-05-07T20:32:21.1359961Z     ) -> None:
2025-05-07T20:32:21.1360187Z         torch.manual_seed(2025)
2025-05-07T20:32:21.1360425Z     
2025-05-07T20:32:21.1360709Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:21.1361055Z     
2025-05-07T20:32:21.1361253Z         x_sign = torch.sign(x)
2025-05-07T20:32:21.1361551Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:21.1361867Z         x = x_sign * x_clamp
2025-05-07T20:32:21.1362110Z         x0 = x[:, :D]
2025-05-07T20:32:21.1362335Z         x1 = x[:, D:]
2025-05-07T20:32:21.1362559Z     
2025-05-07T20:32:21.1362758Z         if contiguous:
2025-05-07T20:32:21.1362993Z             x0 = x0.contiguous()
2025-05-07T20:32:21.1363256Z             x1 = x1.contiguous()
2025-05-07T20:32:21.1363501Z     
2025-05-07T20:32:21.1363693Z         if scale_ub is not None:
2025-05-07T20:32:21.1363978Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:21.1364329Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:21.1364639Z             )
2025-05-07T20:32:21.1364837Z         else:
2025-05-07T20:32:21.1365053Z             scale_ub_tensor = None
2025-05-07T20:32:21.1365305Z     
2025-05-07T20:32:21.1365542Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:21.1365861Z             op = silu_mul_quant
2025-05-07T20:32:21.1366112Z             if compiled:
2025-05-07T20:32:21.1366364Z                 op = torch.compile(op)
2025-05-07T20:32:21.1366664Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:21.1366937Z     
2025-05-07T20:32:21.1367137Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:21.1367312Z 
2025-05-07T20:32:21.1367418Z moe/activation_test.py:117: 
2025-05-07T20:32:21.1367722Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:21.1368065Z moe/activation_test.py:115: in fn
2025-05-07T20:32:21.1368402Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:21.1369091Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:21.1369781Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:21.1370318Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:21.1371091Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:21.1371756Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:21.1372281Z     kernel = self.compile(
2025-05-07T20:32:21.1372824Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:21.1373479Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:21.1373873Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:21.1374106Z 
2025-05-07T20:32:21.1374361Z self = <triton.compiler.compiler.ASTSource object at 0x7efca3847090>
2025-05-07T20:32:21.1375518Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:21.1376903Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7efdd7d50400>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7efd8e4bf880>}
2025-05-07T20:32:21.1378243Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:21.1379259Z context = <triton._C.libtriton.ir.context object at 0x7efca3883630>
2025-05-07T20:32:21.1379552Z 
2025-05-07T20:32:21.1379717Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:21.1380241Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:21.1380712Z                            module_map=module_map)
2025-05-07T20:32:21.1381076Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:21.1381440Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:21.1381701Z E       ^
2025-05-07T20:32:21.1382162Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:21.1382620Z 
2025-05-07T20:32:21.1383036Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:21.1383553Z 
2025-05-07T20:32:21.1383657Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:21.1384072Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:21.1384476Z     T=128,
2025-05-07T20:32:21.1384677Z     D=5120,
2025-05-07T20:32:21.1384878Z     scale_ub=None,
2025-05-07T20:32:21.1385097Z     contiguous=False,
2025-05-07T20:32:21.1385330Z     compiled=True,
2025-05-07T20:32:21.1385545Z )
2025-05-07T20:32:21.1385867Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:21.1386362Z T = 128, D = 5120, scale_ub = None, contiguous = False, compiled = True
2025-05-07T20:32:21.1386635Z 
2025-05-07T20:32:21.1386718Z     @given(
2025-05-07T20:32:21.1386955Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:21.1387268Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:21.1387581Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:21.1387920Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:21.1388249Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:21.1388559Z     )
2025-05-07T20:32:21.1388957Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:21.1389504Z     def test_silu_mul_quant(
2025-05-07T20:32:21.1389756Z         self,
2025-05-07T20:32:21.1389959Z         T: int,
2025-05-07T20:32:21.1390157Z         D: int,
2025-05-07T20:32:21.1390380Z         scale_ub: Optional[float],
2025-05-07T20:32:21.1390659Z         contiguous: bool,
2025-05-07T20:32:21.1390962Z         compiled: bool,
2025-05-07T20:32:21.1391185Z     ) -> None:
2025-05-07T20:32:21.1391410Z         torch.manual_seed(2025)
2025-05-07T20:32:21.1391653Z     
2025-05-07T20:32:21.1391922Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:21.1392267Z     
2025-05-07T20:32:21.1392467Z         x_sign = torch.sign(x)
2025-05-07T20:32:21.1392757Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:21.1393069Z         x = x_sign * x_clamp
2025-05-07T20:32:21.1393330Z         x0 = x[:, :D]
2025-05-07T20:32:21.1393552Z         x1 = x[:, D:]
2025-05-07T20:32:21.1393758Z     
2025-05-07T20:32:21.1393947Z         if contiguous:
2025-05-07T20:32:21.1394230Z             x0 = x0.contiguous()
2025-05-07T20:32:21.1394530Z             x1 = x1.contiguous()
2025-05-07T20:32:21.1394777Z     
2025-05-07T20:32:21.1394977Z         if scale_ub is not None:
2025-05-07T20:32:21.1395249Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:21.1395627Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:21.1395942Z             )
2025-05-07T20:32:21.1396137Z         else:
2025-05-07T20:32:21.1396348Z             scale_ub_tensor = None
2025-05-07T20:32:21.1396602Z     
2025-05-07T20:32:21.1396833Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:21.1397153Z             op = silu_mul_quant
2025-05-07T20:32:21.1397410Z             if compiled:
2025-05-07T20:32:21.1397657Z                 op = torch.compile(op)
2025-05-07T20:32:21.1397961Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:21.1398246Z     
2025-05-07T20:32:21.1398470Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:21.1398665Z 
2025-05-07T20:32:21.1398769Z moe/activation_test.py:117: 
2025-05-07T20:32:21.1399082Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:21.1399421Z moe/activation_test.py:115: in fn
2025-05-07T20:32:21.1399705Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:21.1400272Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/torch/_dynamo/eval_frame.py:678: in _fn
2025-05-07T20:32:21.1400835Z     return fn(*args, **kwargs)
2025-05-07T20:32:21.1401487Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:21.1402177Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:21.1402717Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:21.1403399Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:21.1404057Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:21.1404605Z     kernel = self.compile(
2025-05-07T20:32:21.1405163Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:21.1405832Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:21.1406236Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:21.1406478Z 
2025-05-07T20:32:21.1406692Z self = <triton.compiler.compiler.ASTSource object at 0x7efca3885e50>
2025-05-07T20:32:21.1407782Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:21.1409157Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7efdd7d50400>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7efd8e499c60>}
2025-05-07T20:32:21.1410551Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:21.1411583Z context = <triton._C.libtriton.ir.context object at 0x7efca38676f0>
2025-05-07T20:32:21.1411880Z 
2025-05-07T20:32:21.1412048Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:21.1412572Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:21.1413039Z                            module_map=module_map)
2025-05-07T20:32:21.1413409Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:21.1413766Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:21.1414025Z E       ^
2025-05-07T20:32:21.1414490Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:21.1415064Z 
2025-05-07T20:32:21.1415480Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:21.1416025Z 
2025-05-07T20:32:21.1416142Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:21.1416553Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:21.1416964Z     T=128,
2025-05-07T20:32:21.1417163Z     D=7168,
2025-05-07T20:32:21.1417358Z     scale_ub=1200.0,
2025-05-07T20:32:21.1417591Z     contiguous=False,
2025-05-07T20:32:21.1417824Z     compiled=False,
2025-05-07T20:32:21.1418030Z )
2025-05-07T20:32:21.2287937Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:21.2288810Z T = 128, D = 7168, scale_ub = 1200.0, contiguous = False, compiled = False
2025-05-07T20:32:21.2289109Z 
2025-05-07T20:32:21.2289189Z     @given(
2025-05-07T20:32:21.2289450Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:21.2289771Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:21.2290075Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:21.2290418Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:21.2290752Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:21.2291031Z     )
2025-05-07T20:32:21.2291387Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:21.2298837Z     def test_silu_mul_quant(
2025-05-07T20:32:21.2299163Z         self,
2025-05-07T20:32:21.2299380Z         T: int,
2025-05-07T20:32:21.2299580Z         D: int,
2025-05-07T20:32:21.2299814Z         scale_ub: Optional[float],
2025-05-07T20:32:21.2300104Z         contiguous: bool,
2025-05-07T20:32:21.2300349Z         compiled: bool,
2025-05-07T20:32:21.2300587Z     ) -> None:
2025-05-07T20:32:21.2300813Z         torch.manual_seed(2025)
2025-05-07T20:32:21.2301064Z     
2025-05-07T20:32:21.2301352Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:21.2301711Z     
2025-05-07T20:32:21.2301908Z         x_sign = torch.sign(x)
2025-05-07T20:32:21.2302212Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:21.2302527Z         x = x_sign * x_clamp
2025-05-07T20:32:21.2302775Z         x0 = x[:, :D]
2025-05-07T20:32:21.2302995Z         x1 = x[:, D:]
2025-05-07T20:32:21.2303212Z     
2025-05-07T20:32:21.2303410Z         if contiguous:
2025-05-07T20:32:21.2303642Z             x0 = x0.contiguous()
2025-05-07T20:32:21.2303897Z             x1 = x1.contiguous()
2025-05-07T20:32:21.2304141Z     
2025-05-07T20:32:21.2304332Z         if scale_ub is not None:
2025-05-07T20:32:21.2304614Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:21.2304954Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:21.2305266Z             )
2025-05-07T20:32:21.2305471Z         else:
2025-05-07T20:32:21.2305693Z             scale_ub_tensor = None
2025-05-07T20:32:21.2305948Z     
2025-05-07T20:32:21.2306189Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:21.2306509Z             op = silu_mul_quant
2025-05-07T20:32:21.2306763Z             if compiled:
2025-05-07T20:32:21.2307295Z                 op = torch.compile(op)
2025-05-07T20:32:21.2307604Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:21.2307885Z     
2025-05-07T20:32:21.2308079Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:21.2308254Z 
2025-05-07T20:32:21.2308358Z moe/activation_test.py:117: 
2025-05-07T20:32:21.2308671Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:21.2309002Z moe/activation_test.py:115: in fn
2025-05-07T20:32:21.2309390Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:21.2310095Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:21.2310886Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:21.2311517Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:21.2312282Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:21.2312959Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:21.2313492Z     kernel = self.compile(
2025-05-07T20:32:21.2314043Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:21.2314708Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:21.2315114Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:21.2315344Z 
2025-05-07T20:32:21.2315555Z self = <triton.compiler.compiler.ASTSource object at 0x7efca3856d50>
2025-05-07T20:32:21.2316649Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:21.2318071Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7efdd7d50400>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7efd8ea9c360>}
2025-05-07T20:32:21.2319432Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:21.2320466Z context = <triton._C.libtriton.ir.context object at 0x7efca3893330>
2025-05-07T20:32:21.2320765Z 
2025-05-07T20:32:21.2320933Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:21.2321458Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:21.2321944Z                            module_map=module_map)
2025-05-07T20:32:21.2322310Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:21.2322676Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:21.2322943Z E       ^
2025-05-07T20:32:21.2323417Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:21.2323880Z 
2025-05-07T20:32:21.2324300Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:21.2324825Z 
2025-05-07T20:32:21.2324934Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:21.2325359Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:21.2325765Z     T=128,
2025-05-07T20:32:21.2325966Z     D=5120,
2025-05-07T20:32:21.2326176Z     scale_ub=None,
2025-05-07T20:32:21.2326394Z     contiguous=False,
2025-05-07T20:32:21.2326633Z     compiled=False,
2025-05-07T20:32:21.2326858Z )
2025-05-07T20:32:21.2327179Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:21.2327679Z T = 128, D = 5120, scale_ub = None, contiguous = False, compiled = False
2025-05-07T20:32:21.2327965Z 
2025-05-07T20:32:21.2328049Z     @given(
2025-05-07T20:32:21.2328744Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:21.2329061Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:21.2329378Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:21.2329718Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:21.2330044Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:21.2330335Z     )
2025-05-07T20:32:21.2330692Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:21.2331137Z     def test_silu_mul_quant(
2025-05-07T20:32:21.2331394Z         self,
2025-05-07T20:32:21.2331599Z         T: int,
2025-05-07T20:32:21.2331860Z         D: int,
2025-05-07T20:32:21.2332143Z         scale_ub: Optional[float],
2025-05-07T20:32:21.2332422Z         contiguous: bool,
2025-05-07T20:32:21.2332679Z         compiled: bool,
2025-05-07T20:32:21.2332901Z     ) -> None:
2025-05-07T20:32:21.2333181Z         torch.manual_seed(2025)
2025-05-07T20:32:21.2333435Z     
2025-05-07T20:32:21.2333709Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:21.2334061Z     
2025-05-07T20:32:21.2334260Z         x_sign = torch.sign(x)
2025-05-07T20:32:21.2334550Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:21.2334869Z         x = x_sign * x_clamp
2025-05-07T20:32:21.2335116Z         x0 = x[:, :D]
2025-05-07T20:32:21.2335332Z         x1 = x[:, D:]
2025-05-07T20:32:21.2335550Z     
2025-05-07T20:32:21.2335753Z         if contiguous:
2025-05-07T20:32:21.2335987Z             x0 = x0.contiguous()
2025-05-07T20:32:21.2336258Z             x1 = x1.contiguous()
2025-05-07T20:32:21.2336513Z     
2025-05-07T20:32:21.2336704Z         if scale_ub is not None:
2025-05-07T20:32:21.2336981Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:21.2337329Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:21.2337648Z             )
2025-05-07T20:32:21.2337850Z         else:
2025-05-07T20:32:21.2338071Z             scale_ub_tensor = None
2025-05-07T20:32:21.2338334Z     
2025-05-07T20:32:21.2338568Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:21.2338897Z             op = silu_mul_quant
2025-05-07T20:32:21.2339163Z             if compiled:
2025-05-07T20:32:21.2339416Z                 op = torch.compile(op)
2025-05-07T20:32:21.2339722Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:21.2340008Z     
2025-05-07T20:32:21.2340198Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:21.2340378Z 
2025-05-07T20:32:21.2340478Z moe/activation_test.py:117: 
2025-05-07T20:32:21.2340785Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:21.2341122Z moe/activation_test.py:115: in fn
2025-05-07T20:32:21.2341421Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:21.2342122Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:21.2342818Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:21.2343371Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:21.2344063Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:21.2344743Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:21.2345286Z     kernel = self.compile(
2025-05-07T20:32:21.2345830Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:21.2346502Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:21.2346921Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:21.2347154Z 
2025-05-07T20:32:21.2347381Z self = <triton.compiler.compiler.ASTSource object at 0x7efca394a050>
2025-05-07T20:32:21.2348578Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:21.2350033Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7efdd7d50400>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7efca39885e0>}
2025-05-07T20:32:21.2351411Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:21.2352497Z context = <triton._C.libtriton.ir.context object at 0x7efca393a6b0>
2025-05-07T20:32:21.2352830Z 
2025-05-07T20:32:21.2353005Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:21.2353568Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:21.2354042Z                            module_map=module_map)
2025-05-07T20:32:21.2354409Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:21.2354760Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:21.2355023Z E       ^
2025-05-07T20:32:21.2355492Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:21.2355946Z 
2025-05-07T20:32:21.2356372Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:21.2356889Z 
2025-05-07T20:32:21.2356998Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:21.2357421Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:21.2357828Z     T=128,
2025-05-07T20:32:21.2358016Z     D=5120,
2025-05-07T20:32:21.2358217Z     scale_ub=1200.0,
2025-05-07T20:32:21.2358448Z     contiguous=True,
2025-05-07T20:32:21.2358672Z     compiled=False,
2025-05-07T20:32:21.2358883Z )
2025-05-07T20:32:21.5291747Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:21.5292853Z T = 128, D = 5120, scale_ub = 1200.0, contiguous = True, compiled = False
2025-05-07T20:32:21.5293402Z 
2025-05-07T20:32:21.5293586Z     @given(
2025-05-07T20:32:21.5294073Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:21.5294715Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:21.5295340Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:21.5296000Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:21.5296703Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:21.5297289Z     )
2025-05-07T20:32:21.5297993Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:21.5298601Z     def test_silu_mul_quant(
2025-05-07T20:32:21.5298872Z         self,
2025-05-07T20:32:21.5299084Z         T: int,
2025-05-07T20:32:21.5299298Z         D: int,
2025-05-07T20:32:21.5299531Z         scale_ub: Optional[float],
2025-05-07T20:32:21.5299809Z         contiguous: bool,
2025-05-07T20:32:21.5300062Z         compiled: bool,
2025-05-07T20:32:21.5300310Z     ) -> None:
2025-05-07T20:32:21.5300537Z         torch.manual_seed(2025)
2025-05-07T20:32:21.5300801Z     
2025-05-07T20:32:21.5301089Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:21.5301441Z     
2025-05-07T20:32:21.5301643Z         x_sign = torch.sign(x)
2025-05-07T20:32:21.5301942Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:21.5302265Z         x = x_sign * x_clamp
2025-05-07T20:32:21.5302528Z         x0 = x[:, :D]
2025-05-07T20:32:21.5302765Z         x1 = x[:, D:]
2025-05-07T20:32:21.5302979Z     
2025-05-07T20:32:21.5303179Z         if contiguous:
2025-05-07T20:32:21.5303424Z             x0 = x0.contiguous()
2025-05-07T20:32:21.5303690Z             x1 = x1.contiguous()
2025-05-07T20:32:21.5304198Z     
2025-05-07T20:32:21.5304407Z         if scale_ub is not None:
2025-05-07T20:32:21.5304683Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:21.5305033Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:21.5305353Z             )
2025-05-07T20:32:21.5305561Z         else:
2025-05-07T20:32:21.5305782Z             scale_ub_tensor = None
2025-05-07T20:32:21.5306051Z     
2025-05-07T20:32:21.5306300Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:21.5306622Z             op = silu_mul_quant
2025-05-07T20:32:21.5306886Z             if compiled:
2025-05-07T20:32:21.5307149Z                 op = torch.compile(op)
2025-05-07T20:32:21.5307541Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:21.5307946Z     
2025-05-07T20:32:21.5308159Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:21.5308326Z 
2025-05-07T20:32:21.5308435Z moe/activation_test.py:117: 
2025-05-07T20:32:21.5308866Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:21.5309287Z moe/activation_test.py:115: in fn
2025-05-07T20:32:21.5309576Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:21.5310274Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:21.5310967Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:21.5311506Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:21.5312182Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:21.5312851Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:21.5313392Z     kernel = self.compile(
2025-05-07T20:32:21.5313942Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:21.5314597Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:21.5315001Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:21.5315229Z 
2025-05-07T20:32:21.5315447Z self = <triton.compiler.compiler.ASTSource object at 0x7efca39f1150>
2025-05-07T20:32:21.5316529Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:21.5317908Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7efdd7d50400>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7efca3989760>}
2025-05-07T20:32:21.5319259Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:21.5320282Z context = <triton._C.libtriton.ir.context object at 0x7efca39dd7b0>
2025-05-07T20:32:21.5320569Z 
2025-05-07T20:32:21.5320743Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:21.5321255Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:21.5321725Z                            module_map=module_map)
2025-05-07T20:32:21.5322092Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:21.5322454Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:21.5322713Z E       ^
2025-05-07T20:32:21.5323180Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:21.5323636Z 
2025-05-07T20:32:21.5324060Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:21.5324570Z 
2025-05-07T20:32:21.5324733Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:21.5325145Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:21.5325556Z     T=1,
2025-05-07T20:32:21.5325760Z     D=7168,
2025-05-07T20:32:21.5325958Z     scale_ub=1200.0,
2025-05-07T20:32:21.5326191Z     contiguous=True,
2025-05-07T20:32:21.5326422Z     compiled=True,
2025-05-07T20:32:21.5326638Z )
2025-05-07T20:32:21.5326966Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:21.5327458Z T = 1, D = 7168, scale_ub = 1200.0, contiguous = True, compiled = True
2025-05-07T20:32:21.5327718Z 
2025-05-07T20:32:21.5327801Z     @given(
2025-05-07T20:32:21.5328084Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:21.5328724Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:21.5329040Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:21.5329434Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:21.5329769Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:21.5330060Z     )
2025-05-07T20:32:21.5330410Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:21.5330855Z     def test_silu_mul_quant(
2025-05-07T20:32:21.5331103Z         self,
2025-05-07T20:32:21.5331302Z         T: int,
2025-05-07T20:32:21.5331504Z         D: int,
2025-05-07T20:32:21.5331733Z         scale_ub: Optional[float],
2025-05-07T20:32:21.5332001Z         contiguous: bool,
2025-05-07T20:32:21.5332250Z         compiled: bool,
2025-05-07T20:32:21.5332479Z     ) -> None:
2025-05-07T20:32:21.5332697Z         torch.manual_seed(2025)
2025-05-07T20:32:21.5332946Z     
2025-05-07T20:32:21.5333230Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:21.5333570Z     
2025-05-07T20:32:21.5333775Z         x_sign = torch.sign(x)
2025-05-07T20:32:21.5334078Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:21.5334396Z         x = x_sign * x_clamp
2025-05-07T20:32:21.5334635Z         x0 = x[:, :D]
2025-05-07T20:32:21.5334861Z         x1 = x[:, D:]
2025-05-07T20:32:21.5335081Z     
2025-05-07T20:32:21.5335269Z         if contiguous:
2025-05-07T20:32:21.5335510Z             x0 = x0.contiguous()
2025-05-07T20:32:21.5335776Z             x1 = x1.contiguous()
2025-05-07T20:32:21.5336017Z     
2025-05-07T20:32:21.5336216Z         if scale_ub is not None:
2025-05-07T20:32:21.5336494Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:21.5336829Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:21.5337143Z             )
2025-05-07T20:32:21.5337354Z         else:
2025-05-07T20:32:21.5337571Z             scale_ub_tensor = None
2025-05-07T20:32:21.5337830Z     
2025-05-07T20:32:21.5338073Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:21.5338394Z             op = silu_mul_quant
2025-05-07T20:32:21.5338658Z             if compiled:
2025-05-07T20:32:21.5338927Z                 op = torch.compile(op)
2025-05-07T20:32:21.5339233Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:21.5339513Z     
2025-05-07T20:32:21.5339721Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:21.5339885Z 
2025-05-07T20:32:21.5340001Z moe/activation_test.py:117: 
2025-05-07T20:32:21.5340302Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:21.5340641Z moe/activation_test.py:115: in fn
2025-05-07T20:32:21.5340934Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:21.5341492Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/torch/_dynamo/eval_frame.py:678: in _fn
2025-05-07T20:32:21.5342063Z     return fn(*args, **kwargs)
2025-05-07T20:32:21.5342727Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:21.5343429Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:21.5344037Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:21.5344726Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:21.5345396Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:21.5345934Z     kernel = self.compile(
2025-05-07T20:32:21.5346490Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:21.5347153Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:21.5347559Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:21.5347893Z 
2025-05-07T20:32:21.5348104Z self = <triton.compiler.compiler.ASTSource object at 0x7efca3ae85d0>
2025-05-07T20:32:21.5349362Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:21.5350727Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7efdd7d50400>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7efca398ad40>}
2025-05-07T20:32:21.5352077Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:21.5353111Z context = <triton._C.libtriton.ir.context object at 0x7efca3aecc70>
2025-05-07T20:32:21.5353403Z 
2025-05-07T20:32:21.5353570Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:21.5354096Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:21.5354568Z                            module_map=module_map)
2025-05-07T20:32:21.5354933Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:21.5355290Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:21.5355561Z E       ^
2025-05-07T20:32:21.5356030Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:21.5356480Z 
2025-05-07T20:32:21.5356896Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:21.5357411Z 
2025-05-07T20:32:21.5357518Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:21.5357938Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:21.5358357Z     T=1,
2025-05-07T20:32:21.5358551Z     D=7168,
2025-05-07T20:32:21.5358756Z     scale_ub=1200.0,
2025-05-07T20:32:21.5358994Z     contiguous=False,
2025-05-07T20:32:21.5359227Z     compiled=True,
2025-05-07T20:32:21.5359441Z )
2025-05-07T20:32:21.6353228Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:21.6353737Z T = 1, D = 7168, scale_ub = 1200.0, contiguous = False, compiled = True
2025-05-07T20:32:21.6354011Z 
2025-05-07T20:32:21.6354095Z     @given(
2025-05-07T20:32:21.6354343Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:21.6354659Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:21.6354971Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:21.6355317Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:21.6355650Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:21.6355939Z     )
2025-05-07T20:32:21.6356300Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:21.6356763Z     def test_silu_mul_quant(
2025-05-07T20:32:21.6357005Z         self,
2025-05-07T20:32:21.6357216Z         T: int,
2025-05-07T20:32:21.6357428Z         D: int,
2025-05-07T20:32:21.6357657Z         scale_ub: Optional[float],
2025-05-07T20:32:21.6358183Z         contiguous: bool,
2025-05-07T20:32:21.6358435Z         compiled: bool,
2025-05-07T20:32:21.6358689Z     ) -> None:
2025-05-07T20:32:21.6358941Z         torch.manual_seed(2025)
2025-05-07T20:32:21.6359190Z     
2025-05-07T20:32:21.6359480Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:21.6359829Z     
2025-05-07T20:32:21.6360042Z         x_sign = torch.sign(x)
2025-05-07T20:32:21.6360336Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:21.6360658Z         x = x_sign * x_clamp
2025-05-07T20:32:21.6360905Z         x0 = x[:, :D]
2025-05-07T20:32:21.6361132Z         x1 = x[:, D:]
2025-05-07T20:32:21.6361411Z     
2025-05-07T20:32:21.6361610Z         if contiguous:
2025-05-07T20:32:21.6361920Z             x0 = x0.contiguous()
2025-05-07T20:32:21.6362185Z             x1 = x1.contiguous()
2025-05-07T20:32:21.6362425Z     
2025-05-07T20:32:21.6362624Z         if scale_ub is not None:
2025-05-07T20:32:21.6362976Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:21.6363308Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:21.6363627Z             )
2025-05-07T20:32:21.6363828Z         else:
2025-05-07T20:32:21.6364042Z             scale_ub_tensor = None
2025-05-07T20:32:21.6364302Z     
2025-05-07T20:32:21.6364544Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:21.6364858Z             op = silu_mul_quant
2025-05-07T20:32:21.6365117Z             if compiled:
2025-05-07T20:32:21.6365368Z                 op = torch.compile(op)
2025-05-07T20:32:21.6365660Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:21.6365939Z     
2025-05-07T20:32:21.6366142Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:21.6366313Z 
2025-05-07T20:32:21.6366424Z moe/activation_test.py:117: 
2025-05-07T20:32:21.6366717Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:21.6367057Z moe/activation_test.py:115: in fn
2025-05-07T20:32:21.6367345Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:21.6367899Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/torch/_dynamo/eval_frame.py:678: in _fn
2025-05-07T20:32:21.6368476Z     return fn(*args, **kwargs)
2025-05-07T20:32:21.6376005Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:21.6376701Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:21.6377241Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:21.6377916Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:21.6378648Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:21.6379187Z     kernel = self.compile(
2025-05-07T20:32:21.6379739Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:21.6380388Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:21.6380794Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:21.6381025Z 
2025-05-07T20:32:21.6381241Z self = <triton.compiler.compiler.ASTSource object at 0x7efca3a02290>
2025-05-07T20:32:21.6382319Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:21.6383686Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7efdd7d50400>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7efca3aa0540>}
2025-05-07T20:32:21.6385111Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:21.6386136Z context = <triton._C.libtriton.ir.context object at 0x7efca3aab9b0>
2025-05-07T20:32:21.6386423Z 
2025-05-07T20:32:21.6386597Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:21.6387112Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:21.6387579Z                            module_map=module_map)
2025-05-07T20:32:21.6387949Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:21.6388305Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:21.6388640Z E       ^
2025-05-07T20:32:21.6389264Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:21.6389760Z 
2025-05-07T20:32:21.6390223Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:21.6390730Z 
2025-05-07T20:32:21.6390835Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:21.6391246Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:21.6391649Z     T=1,
2025-05-07T20:32:21.6391843Z     D=7168,
2025-05-07T20:32:21.6392036Z     scale_ub=None,
2025-05-07T20:32:21.6392260Z     contiguous=False,
2025-05-07T20:32:21.6392491Z     compiled=True,
2025-05-07T20:32:21.6392695Z )
2025-05-07T20:32:21.7056709Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:21.7057224Z T = 1, D = 7168, scale_ub = None, contiguous = False, compiled = True
2025-05-07T20:32:21.7057506Z 
2025-05-07T20:32:21.7057596Z     @given(
2025-05-07T20:32:21.7057837Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:21.7058162Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:21.7058477Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:21.7058815Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:21.7059176Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:21.7059496Z     )
2025-05-07T20:32:21.7059841Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:21.7060290Z     def test_silu_mul_quant(
2025-05-07T20:32:21.7060544Z         self,
2025-05-07T20:32:21.7060754Z         T: int,
2025-05-07T20:32:21.7060952Z         D: int,
2025-05-07T20:32:21.7061178Z         scale_ub: Optional[float],
2025-05-07T20:32:21.7061459Z         contiguous: bool,
2025-05-07T20:32:21.7061699Z         compiled: bool,
2025-05-07T20:32:21.7061940Z     ) -> None:
2025-05-07T20:32:21.7062174Z         torch.manual_seed(2025)
2025-05-07T20:32:21.7062423Z     
2025-05-07T20:32:21.7062707Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:21.7063057Z     
2025-05-07T20:32:21.7063252Z         x_sign = torch.sign(x)
2025-05-07T20:32:21.7063559Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:21.7063882Z         x = x_sign * x_clamp
2025-05-07T20:32:21.7064121Z         x0 = x[:, :D]
2025-05-07T20:32:21.7064356Z         x1 = x[:, D:]
2025-05-07T20:32:21.7064573Z     
2025-05-07T20:32:21.7064761Z         if contiguous:
2025-05-07T20:32:21.7065008Z             x0 = x0.contiguous()
2025-05-07T20:32:21.7065276Z             x1 = x1.contiguous()
2025-05-07T20:32:21.7065526Z     
2025-05-07T20:32:21.7065731Z         if scale_ub is not None:
2025-05-07T20:32:21.7066014Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:21.7066366Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:21.7066673Z             )
2025-05-07T20:32:21.7066882Z         else:
2025-05-07T20:32:21.7067105Z             scale_ub_tensor = None
2025-05-07T20:32:21.7067359Z     
2025-05-07T20:32:21.7067618Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:21.7067948Z             op = silu_mul_quant
2025-05-07T20:32:21.7068468Z             if compiled:
2025-05-07T20:32:21.7068723Z                 op = torch.compile(op)
2025-05-07T20:32:21.7069130Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:21.7069414Z     
2025-05-07T20:32:21.7069607Z         y_fp8, y_scale = fn()
2025-05-07T20:32:21.7069900Z         y = y_fp8.to(torch.float32) * y_scale[:, None]
2025-05-07T20:32:21.7070193Z     
2025-05-07T20:32:21.7070429Z         def ref_fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:21.7070771Z             x0_fp32 = x0.to(torch.float32)
2025-05-07T20:32:21.7071071Z             x1_fp32 = x1.to(torch.float32)
2025-05-07T20:32:21.7071391Z             y = x0_fp32 * torch.sigmoid(x0_fp32) * x1_fp32
2025-05-07T20:32:21.7071871Z             return triton_quantize_fp8_row(y, scale_ub_tensor)
2025-05-07T20:32:21.7072295Z     
2025-05-07T20:32:21.7072513Z >       y_fp8_ref, y_scale_ref = ref_fn()
2025-05-07T20:32:21.7072735Z 
2025-05-07T20:32:21.7072841Z moe/activation_test.py:126: 
2025-05-07T20:32:21.7073243Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:21.7073631Z moe/activation_test.py:124: in ref_fn
2025-05-07T20:32:21.7073992Z     return triton_quantize_fp8_row(y, scale_ub_tensor)
2025-05-07T20:32:21.7074947Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/fbgemm_gpu/experimental/gemm/triton_gemm/fp8_gemm.py:2370: in triton_quantize_fp8_row
2025-05-07T20:32:21.7075864Z     _kernel_quantize_fp8_row[grid](
2025-05-07T20:32:21.7076514Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:21.7077329Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:21.7078166Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/autotuner.py:186: in run
2025-05-07T20:32:21.7079097Z     timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs}
2025-05-07T20:32:21.7080014Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/autotuner.py:186: in <dictcomp>
2025-05-07T20:32:21.7080917Z     timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs}
2025-05-07T20:32:21.7081800Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/autotuner.py:166: in _bench
2025-05-07T20:32:21.7082567Z     return self.do_bench(kernel_call, quantiles=(0.5, 0.2, 0.8))
2025-05-07T20:32:21.7083278Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/testing.py:117: in do_bench
2025-05-07T20:32:21.7083900Z     fn()
2025-05-07T20:32:21.7084508Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/autotuner.py:152: in kernel_call
2025-05-07T20:32:21.7085218Z     self.fn.run(
2025-05-07T20:32:21.7085760Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:21.7086400Z     kernel = self.compile(
2025-05-07T20:32:21.7087043Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:21.7087818Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:21.7088284Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:21.7088560Z 
2025-05-07T20:32:21.7088799Z self = <triton.compiler.compiler.ASTSource object at 0x7efd8e054b90>
2025-05-07T20:32:21.7090190Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=2, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:21.7091928Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7efdd7d50400>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7efca3aa1440>}
2025-05-07T20:32:21.7093652Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:21.7094915Z context = <triton._C.libtriton.ir.context object at 0x7efd8e00d130>
2025-05-07T20:32:21.7095264Z 
2025-05-07T20:32:21.7095450Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:21.7096069Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:21.7096619Z                            module_map=module_map)
2025-05-07T20:32:21.7097042Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:21.7097475Z E       def _kernel_quantize_fp8_row(
2025-05-07T20:32:21.7097756Z E       ^
2025-05-07T20:32:21.7098216Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:21.7098706Z 
2025-05-07T20:32:21.7099125Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:21.7099640Z 
2025-05-07T20:32:21.7099746Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:21.7100163Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:21.7100565Z     T=1,
2025-05-07T20:32:21.7100765Z     D=5120,
2025-05-07T20:32:21.7100976Z     scale_ub=1200.0,
2025-05-07T20:32:21.7101206Z     contiguous=False,
2025-05-07T20:32:21.7101444Z     compiled=True,
2025-05-07T20:32:21.7101667Z )
2025-05-07T20:32:21.8294588Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:21.8295126Z T = 1, D = 5120, scale_ub = 1200.0, contiguous = False, compiled = True
2025-05-07T20:32:21.8295398Z 
2025-05-07T20:32:21.8295481Z     @given(
2025-05-07T20:32:21.8295724Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:21.8296048Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:21.8296364Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:21.8296703Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:21.8297046Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:21.8297330Z     )
2025-05-07T20:32:21.8297688Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:21.8298139Z     def test_silu_mul_quant(
2025-05-07T20:32:21.8298392Z         self,
2025-05-07T20:32:21.8298594Z         T: int,
2025-05-07T20:32:21.8298805Z         D: int,
2025-05-07T20:32:21.8299071Z         scale_ub: Optional[float],
2025-05-07T20:32:21.8299354Z         contiguous: bool,
2025-05-07T20:32:21.8299602Z         compiled: bool,
2025-05-07T20:32:21.8299848Z     ) -> None:
2025-05-07T20:32:21.8300071Z         torch.manual_seed(2025)
2025-05-07T20:32:21.8300321Z     
2025-05-07T20:32:21.8300604Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:21.8300980Z     
2025-05-07T20:32:21.8301194Z         x_sign = torch.sign(x)
2025-05-07T20:32:21.8301488Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:21.8301806Z         x = x_sign * x_clamp
2025-05-07T20:32:21.8302053Z         x0 = x[:, :D]
2025-05-07T20:32:21.8302272Z         x1 = x[:, D:]
2025-05-07T20:32:21.8302489Z     
2025-05-07T20:32:21.8302682Z         if contiguous:
2025-05-07T20:32:21.8302915Z             x0 = x0.contiguous()
2025-05-07T20:32:21.8303180Z             x1 = x1.contiguous()
2025-05-07T20:32:21.8303427Z     
2025-05-07T20:32:21.8303619Z         if scale_ub is not None:
2025-05-07T20:32:21.8303902Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:21.8304248Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:21.8304556Z             )
2025-05-07T20:32:21.8304763Z         else:
2025-05-07T20:32:21.8304987Z             scale_ub_tensor = None
2025-05-07T20:32:21.8305234Z     
2025-05-07T20:32:21.8305711Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:21.8306034Z             op = silu_mul_quant
2025-05-07T20:32:21.8306287Z             if compiled:
2025-05-07T20:32:21.8306534Z                 op = torch.compile(op)
2025-05-07T20:32:21.8306834Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:21.8307110Z     
2025-05-07T20:32:21.8307300Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:21.8307471Z 
2025-05-07T20:32:21.8307576Z moe/activation_test.py:117: 
2025-05-07T20:32:21.8307876Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:21.8308230Z moe/activation_test.py:115: in fn
2025-05-07T20:32:21.8308529Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:21.8309263Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/torch/_dynamo/eval_frame.py:678: in _fn
2025-05-07T20:32:21.8309925Z     return fn(*args, **kwargs)
2025-05-07T20:32:21.8310653Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:21.8311333Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:21.8311873Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:21.8312562Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:21.8313223Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:21.8313750Z     kernel = self.compile(
2025-05-07T20:32:21.8314295Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:21.8314957Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:21.8315359Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:21.8315590Z 
2025-05-07T20:32:21.8315803Z self = <triton.compiler.compiler.ASTSource object at 0x7efd8e0851d0>
2025-05-07T20:32:21.8316884Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:21.8318266Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7efdd7d50400>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7efca3aa2a20>}
2025-05-07T20:32:21.8319610Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:21.8320629Z context = <triton._C.libtriton.ir.context object at 0x7efd8e059830>
2025-05-07T20:32:21.8320922Z 
2025-05-07T20:32:21.8321088Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:21.8321613Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:21.8322084Z                            module_map=module_map)
2025-05-07T20:32:21.8322449Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:21.8322803Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:21.8323066Z E       ^
2025-05-07T20:32:21.8323530Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:21.8323983Z 
2025-05-07T20:32:21.8324396Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:21.8324917Z 
2025-05-07T20:32:21.8325026Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:21.8325443Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:21.8325845Z     T=1,
2025-05-07T20:32:21.8326041Z     D=5120,
2025-05-07T20:32:21.8326250Z     scale_ub=1200.0,
2025-05-07T20:32:21.8326557Z     contiguous=False,
2025-05-07T20:32:21.8326795Z     compiled=False,
2025-05-07T20:32:21.8327008Z )
2025-05-07T20:32:21.8327324Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:21.8327813Z T = 1, D = 5120, scale_ub = 1200.0, contiguous = False, compiled = False
2025-05-07T20:32:21.8328086Z 
2025-05-07T20:32:21.8328432Z     @given(
2025-05-07T20:32:21.8328688Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:21.8328997Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:21.8329304Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:21.8329635Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:21.8330024Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:21.8330368Z     )
2025-05-07T20:32:21.8330723Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:21.8331230Z     def test_silu_mul_quant(
2025-05-07T20:32:21.8331479Z         self,
2025-05-07T20:32:21.8331683Z         T: int,
2025-05-07T20:32:21.8331883Z         D: int,
2025-05-07T20:32:21.8332106Z         scale_ub: Optional[float],
2025-05-07T20:32:21.8332383Z         contiguous: bool,
2025-05-07T20:32:21.8332632Z         compiled: bool,
2025-05-07T20:32:21.8332853Z     ) -> None:
2025-05-07T20:32:21.8333074Z         torch.manual_seed(2025)
2025-05-07T20:32:21.8333323Z     
2025-05-07T20:32:21.8333598Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:21.8333942Z     
2025-05-07T20:32:21.8334145Z         x_sign = torch.sign(x)
2025-05-07T20:32:21.8334433Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:21.8334748Z         x = x_sign * x_clamp
2025-05-07T20:32:21.8334994Z         x0 = x[:, :D]
2025-05-07T20:32:21.8335210Z         x1 = x[:, D:]
2025-05-07T20:32:21.8335423Z     
2025-05-07T20:32:21.8335612Z         if contiguous:
2025-05-07T20:32:21.8335843Z             x0 = x0.contiguous()
2025-05-07T20:32:21.8336110Z             x1 = x1.contiguous()
2025-05-07T20:32:21.8336351Z     
2025-05-07T20:32:21.8336540Z         if scale_ub is not None:
2025-05-07T20:32:21.8336815Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:21.8337151Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:21.8337465Z             )
2025-05-07T20:32:21.8337654Z         else:
2025-05-07T20:32:21.8337867Z             scale_ub_tensor = None
2025-05-07T20:32:21.8338122Z     
2025-05-07T20:32:21.8338352Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:21.8338697Z             op = silu_mul_quant
2025-05-07T20:32:21.8338977Z             if compiled:
2025-05-07T20:32:21.8339226Z                 op = torch.compile(op)
2025-05-07T20:32:21.8339528Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:21.8339807Z     
2025-05-07T20:32:21.8340000Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:21.8340173Z 
2025-05-07T20:32:21.8340276Z moe/activation_test.py:117: 
2025-05-07T20:32:21.8340577Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:21.8340904Z moe/activation_test.py:115: in fn
2025-05-07T20:32:21.8341188Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:21.8341873Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:21.8342568Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:21.8343099Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:21.8343782Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:21.8344447Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:21.8344984Z     kernel = self.compile(
2025-05-07T20:32:21.8345588Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:21.8346247Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:21.8346654Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:21.8346881Z 
2025-05-07T20:32:21.8347088Z self = <triton.compiler.compiler.ASTSource object at 0x7efd8e026c50>
2025-05-07T20:32:21.8348168Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:21.8349627Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7efdd7d50400>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7efca3aa31a0>}
2025-05-07T20:32:21.8351092Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:21.8352118Z context = <triton._C.libtriton.ir.context object at 0x7efd8e0732b0>
2025-05-07T20:32:21.8352406Z 
2025-05-07T20:32:21.8352575Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:21.8353102Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:21.8353577Z                            module_map=module_map)
2025-05-07T20:32:21.8353953Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:21.8354308Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:21.8354579Z E       ^
2025-05-07T20:32:21.8355059Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:21.8355507Z 
2025-05-07T20:32:21.8355924Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:21.8356452Z 
2025-05-07T20:32:21.8356560Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:21.8356979Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:21.8357390Z     T=16384,
2025-05-07T20:32:21.8357592Z     D=5120,
2025-05-07T20:32:21.8357801Z     scale_ub=1200.0,
2025-05-07T20:32:21.8358041Z     contiguous=False,
2025-05-07T20:32:21.8358300Z     compiled=True,
2025-05-07T20:32:21.8358540Z )
2025-05-07T20:32:22.0642590Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:22.0643289Z T = 16384, D = 5120, scale_ub = 1200.0, contiguous = False, compiled = True
2025-05-07T20:32:22.0643679Z 
2025-05-07T20:32:22.0643792Z     @given(
2025-05-07T20:32:22.0644112Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:22.0644431Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:22.0644733Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:22.0645080Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:22.0645409Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:22.0645691Z     )
2025-05-07T20:32:22.0646044Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:22.0646484Z     def test_silu_mul_quant(
2025-05-07T20:32:22.0646729Z         self,
2025-05-07T20:32:22.0646933Z         T: int,
2025-05-07T20:32:22.0647135Z         D: int,
2025-05-07T20:32:22.0647353Z         scale_ub: Optional[float],
2025-05-07T20:32:22.0647631Z         contiguous: bool,
2025-05-07T20:32:22.0647873Z         compiled: bool,
2025-05-07T20:32:22.0648138Z     ) -> None:
2025-05-07T20:32:22.0648365Z         torch.manual_seed(2025)
2025-05-07T20:32:22.0648620Z     
2025-05-07T20:32:22.0648934Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:22.0649284Z     
2025-05-07T20:32:22.0649478Z         x_sign = torch.sign(x)
2025-05-07T20:32:22.0650061Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:22.0650376Z         x = x_sign * x_clamp
2025-05-07T20:32:22.0650627Z         x0 = x[:, :D]
2025-05-07T20:32:22.0650848Z         x1 = x[:, D:]
2025-05-07T20:32:22.0651066Z     
2025-05-07T20:32:22.0651262Z         if contiguous:
2025-05-07T20:32:22.0651500Z             x0 = x0.contiguous()
2025-05-07T20:32:22.0651764Z             x1 = x1.contiguous()
2025-05-07T20:32:22.0652015Z     
2025-05-07T20:32:22.0652210Z         if scale_ub is not None:
2025-05-07T20:32:22.0652487Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:22.0652827Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:22.0653136Z             )
2025-05-07T20:32:22.0653420Z         else:
2025-05-07T20:32:22.0653710Z             scale_ub_tensor = None
2025-05-07T20:32:22.0653957Z     
2025-05-07T20:32:22.0654194Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:22.0654516Z             op = silu_mul_quant
2025-05-07T20:32:22.0654835Z             if compiled:
2025-05-07T20:32:22.0655093Z                 op = torch.compile(op)
2025-05-07T20:32:22.0655394Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:22.0655677Z     
2025-05-07T20:32:22.0655867Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:22.0656038Z 
2025-05-07T20:32:22.0656140Z moe/activation_test.py:117: 
2025-05-07T20:32:22.0656433Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:22.0656760Z moe/activation_test.py:115: in fn
2025-05-07T20:32:22.0657047Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:22.0657607Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/torch/_dynamo/eval_frame.py:678: in _fn
2025-05-07T20:32:22.0658159Z     return fn(*args, **kwargs)
2025-05-07T20:32:22.0658815Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:22.0659502Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:22.0660035Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:22.0660710Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:22.0661368Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:22.0661902Z     kernel = self.compile(
2025-05-07T20:32:22.0662441Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:22.0663086Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:22.0663496Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:22.0663732Z 
2025-05-07T20:32:22.0663949Z self = <triton.compiler.compiler.ASTSource object at 0x7efca3329d10>
2025-05-07T20:32:22.0665025Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:22.0666404Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7efdd7d50400>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7efca3308ea0>}
2025-05-07T20:32:22.0667742Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:22.0668810Z context = <triton._C.libtriton.ir.context object at 0x7efca3306330>
2025-05-07T20:32:22.0669173Z 
2025-05-07T20:32:22.0669347Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:22.0669857Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:22.0670371Z                            module_map=module_map)
2025-05-07T20:32:22.0670741Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:22.0671089Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:22.0671358Z E       ^
2025-05-07T20:32:22.0671822Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:22.0679062Z 
2025-05-07T20:32:22.0679525Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:22.0680061Z 
2025-05-07T20:32:22.0680169Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:22.0680596Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:22.0681176Z     T=2048,
2025-05-07T20:32:22.0681380Z     D=7168,
2025-05-07T20:32:22.0681587Z     scale_ub=1200.0,
2025-05-07T20:32:22.0681826Z     contiguous=False,
2025-05-07T20:32:22.0682077Z     compiled=True,
2025-05-07T20:32:22.0682381Z )
2025-05-07T20:32:22.0682703Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:22.0683204Z T = 2048, D = 7168, scale_ub = 1200.0, contiguous = False, compiled = True
2025-05-07T20:32:22.0683478Z 
2025-05-07T20:32:22.0683563Z     @given(
2025-05-07T20:32:22.0683795Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:22.0684114Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:22.0684429Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:22.0684764Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:22.0685092Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:22.0685381Z     )
2025-05-07T20:32:22.0685743Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:22.0686194Z     def test_silu_mul_quant(
2025-05-07T20:32:22.0686446Z         self,
2025-05-07T20:32:22.0686649Z         T: int,
2025-05-07T20:32:22.0686852Z         D: int,
2025-05-07T20:32:22.0687082Z         scale_ub: Optional[float],
2025-05-07T20:32:22.0687362Z         contiguous: bool,
2025-05-07T20:32:22.0687603Z         compiled: bool,
2025-05-07T20:32:22.0687833Z     ) -> None:
2025-05-07T20:32:22.0688054Z         torch.manual_seed(2025)
2025-05-07T20:32:22.0688292Z     
2025-05-07T20:32:22.0688574Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:22.0688924Z     
2025-05-07T20:32:22.0689131Z         x_sign = torch.sign(x)
2025-05-07T20:32:22.0689423Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:22.0689739Z         x = x_sign * x_clamp
2025-05-07T20:32:22.0689991Z         x0 = x[:, :D]
2025-05-07T20:32:22.0690214Z         x1 = x[:, D:]
2025-05-07T20:32:22.0690430Z     
2025-05-07T20:32:22.0690624Z         if contiguous:
2025-05-07T20:32:22.0690856Z             x0 = x0.contiguous()
2025-05-07T20:32:22.0691122Z             x1 = x1.contiguous()
2025-05-07T20:32:22.0691365Z     
2025-05-07T20:32:22.0691557Z         if scale_ub is not None:
2025-05-07T20:32:22.0691839Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:22.0692181Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:22.0692489Z             )
2025-05-07T20:32:22.0692690Z         else:
2025-05-07T20:32:22.0692913Z             scale_ub_tensor = None
2025-05-07T20:32:22.0693165Z     
2025-05-07T20:32:22.0693402Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:22.0693722Z             op = silu_mul_quant
2025-05-07T20:32:22.0693979Z             if compiled:
2025-05-07T20:32:22.0694230Z                 op = torch.compile(op)
2025-05-07T20:32:22.0694532Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:22.0694817Z     
2025-05-07T20:32:22.0695010Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:22.0695181Z 
2025-05-07T20:32:22.0695283Z moe/activation_test.py:117: 
2025-05-07T20:32:22.0695591Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:22.0695970Z moe/activation_test.py:115: in fn
2025-05-07T20:32:22.0696259Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:22.0696824Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/torch/_dynamo/eval_frame.py:678: in _fn
2025-05-07T20:32:22.0697392Z     return fn(*args, **kwargs)
2025-05-07T20:32:22.0698057Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:22.0698767Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:22.0699350Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:22.0700077Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:22.0700788Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:22.0701329Z     kernel = self.compile(
2025-05-07T20:32:22.0701929Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:22.0702589Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:22.0702999Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:22.0703232Z 
2025-05-07T20:32:22.0703453Z self = <triton.compiler.compiler.ASTSource object at 0x7efca33a7110>
2025-05-07T20:32:22.0704552Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:22.0705944Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7efdd7d50400>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7efca33099e0>}
2025-05-07T20:32:22.0707317Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:22.0708350Z context = <triton._C.libtriton.ir.context object at 0x7efca33cf770>
2025-05-07T20:32:22.0708640Z 
2025-05-07T20:32:22.0708821Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:22.0709483Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:22.0709957Z                            module_map=module_map)
2025-05-07T20:32:22.0710329Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:22.0710693Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:22.0710960Z E       ^
2025-05-07T20:32:22.0711440Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:22.0711897Z 
2025-05-07T20:32:22.0712335Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:22.0712853Z 
2025-05-07T20:32:22.1594370Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:22.1594834Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:22.1595414Z     T=1,
2025-05-07T20:32:22.1595680Z     D=5120,
2025-05-07T20:32:22.1595939Z     scale_ub=None,
2025-05-07T20:32:22.1596249Z     contiguous=False,
2025-05-07T20:32:22.1596551Z     compiled=False,
2025-05-07T20:32:22.1596820Z )
2025-05-07T20:32:22.1597247Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:22.1597821Z T = 1, D = 5120, scale_ub = None, contiguous = False, compiled = False
2025-05-07T20:32:22.1598104Z 
2025-05-07T20:32:22.1598197Z     @given(
2025-05-07T20:32:22.1598431Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:22.1598748Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:22.1599353Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:22.1599686Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:22.1600023Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:22.1600318Z     )
2025-05-07T20:32:22.1600667Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:22.1601112Z     def test_silu_mul_quant(
2025-05-07T20:32:22.1601360Z         self,
2025-05-07T20:32:22.1601557Z         T: int,
2025-05-07T20:32:22.1601768Z         D: int,
2025-05-07T20:32:22.1602000Z         scale_ub: Optional[float],
2025-05-07T20:32:22.1602272Z         contiguous: bool,
2025-05-07T20:32:22.1602565Z         compiled: bool,
2025-05-07T20:32:22.1602876Z     ) -> None:
2025-05-07T20:32:22.1603182Z         torch.manual_seed(2025)
2025-05-07T20:32:22.1603431Z     
2025-05-07T20:32:22.1603705Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:22.1604061Z     
2025-05-07T20:32:22.1604337Z         x_sign = torch.sign(x)
2025-05-07T20:32:22.1604646Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:22.1604955Z         x = x_sign * x_clamp
2025-05-07T20:32:22.1605203Z         x0 = x[:, :D]
2025-05-07T20:32:22.1605431Z         x1 = x[:, D:]
2025-05-07T20:32:22.1605635Z     
2025-05-07T20:32:22.1605830Z         if contiguous:
2025-05-07T20:32:22.1606073Z             x0 = x0.contiguous()
2025-05-07T20:32:22.1606332Z             x1 = x1.contiguous()
2025-05-07T20:32:22.1606584Z     
2025-05-07T20:32:22.1606789Z         if scale_ub is not None:
2025-05-07T20:32:22.1607066Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:22.1607419Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:22.1607739Z             )
2025-05-07T20:32:22.1607935Z         else:
2025-05-07T20:32:22.1608152Z             scale_ub_tensor = None
2025-05-07T20:32:22.1608413Z     
2025-05-07T20:32:22.1608643Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:22.1608966Z             op = silu_mul_quant
2025-05-07T20:32:22.1609221Z             if compiled:
2025-05-07T20:32:22.1609477Z                 op = torch.compile(op)
2025-05-07T20:32:22.1609774Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:22.1610049Z     
2025-05-07T20:32:22.1610247Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:22.1610415Z 
2025-05-07T20:32:22.1610519Z moe/activation_test.py:117: 
2025-05-07T20:32:22.1610822Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:22.1611157Z moe/activation_test.py:115: in fn
2025-05-07T20:32:22.1611436Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:22.1612127Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:22.1612828Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:22.1613369Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:22.1614051Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:22.1614713Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:22.1615244Z     kernel = self.compile(
2025-05-07T20:32:22.1615784Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:22.1616443Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:22.1616847Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:22.1617073Z 
2025-05-07T20:32:22.1617292Z self = <triton.compiler.compiler.ASTSource object at 0x7efca3d88bd0>
2025-05-07T20:32:22.1618431Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:22.1619841Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7efdd7d50400>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7efca330ad40>}
2025-05-07T20:32:22.1621183Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:22.1622201Z context = <triton._C.libtriton.ir.context object at 0x7efca3dd9230>
2025-05-07T20:32:22.1622487Z 
2025-05-07T20:32:22.1622657Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:22.1623222Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:22.1623731Z                            module_map=module_map)
2025-05-07T20:32:22.1624136Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:22.1624488Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:22.1624758Z E       ^
2025-05-07T20:32:22.1625234Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:22.1625681Z 
2025-05-07T20:32:22.1626103Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:22.1626614Z 
2025-05-07T20:32:22.1626719Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:22.1627136Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:22.1627541Z     T=4096,
2025-05-07T20:32:22.1627733Z     D=7168,
2025-05-07T20:32:22.1627934Z     scale_ub=1200.0,
2025-05-07T20:32:22.1628472Z     contiguous=False,
2025-05-07T20:32:22.1628701Z     compiled=False,
2025-05-07T20:32:22.1628909Z )
2025-05-07T20:32:22.1629320Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:22.1629830Z T = 4096, D = 7168, scale_ub = 1200.0, contiguous = False, compiled = False
2025-05-07T20:32:22.1630104Z 
2025-05-07T20:32:22.1630186Z     @given(
2025-05-07T20:32:22.1630419Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:22.1630732Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:22.1631035Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:22.1631366Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:22.1631699Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:22.1631979Z     )
2025-05-07T20:32:22.1632325Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:22.1632766Z     def test_silu_mul_quant(
2025-05-07T20:32:22.1633028Z         self,
2025-05-07T20:32:22.1633221Z         T: int,
2025-05-07T20:32:22.1633427Z         D: int,
2025-05-07T20:32:22.1633649Z         scale_ub: Optional[float],
2025-05-07T20:32:22.1633926Z         contiguous: bool,
2025-05-07T20:32:22.1634165Z         compiled: bool,
2025-05-07T20:32:22.1634391Z     ) -> None:
2025-05-07T20:32:22.1634613Z         torch.manual_seed(2025)
2025-05-07T20:32:22.1634851Z     
2025-05-07T20:32:22.1635128Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:22.1635473Z     
2025-05-07T20:32:22.1635665Z         x_sign = torch.sign(x)
2025-05-07T20:32:22.1635959Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:22.1636278Z         x = x_sign * x_clamp
2025-05-07T20:32:22.1636515Z         x0 = x[:, :D]
2025-05-07T20:32:22.1636738Z         x1 = x[:, D:]
2025-05-07T20:32:22.1636955Z     
2025-05-07T20:32:22.1637139Z         if contiguous:
2025-05-07T20:32:22.1637376Z             x0 = x0.contiguous()
2025-05-07T20:32:22.1637644Z             x1 = x1.contiguous()
2025-05-07T20:32:22.1637881Z     
2025-05-07T20:32:22.1638073Z         if scale_ub is not None:
2025-05-07T20:32:22.1638345Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:22.1638758Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:22.1639070Z             )
2025-05-07T20:32:22.1639266Z         else:
2025-05-07T20:32:22.1639480Z             scale_ub_tensor = None
2025-05-07T20:32:22.1639732Z     
2025-05-07T20:32:22.1639968Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:22.1640285Z             op = silu_mul_quant
2025-05-07T20:32:22.1640534Z             if compiled:
2025-05-07T20:32:22.1640785Z                 op = torch.compile(op)
2025-05-07T20:32:22.1641133Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:22.1641527Z     
2025-05-07T20:32:22.1641809Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:22.1642038Z 
2025-05-07T20:32:22.1642279Z moe/activation_test.py:117: 
2025-05-07T20:32:22.1642768Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:22.1643105Z moe/activation_test.py:115: in fn
2025-05-07T20:32:22.1643394Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:22.1644147Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:22.1644830Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:22.1645365Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:22.1646055Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:22.1646709Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:22.1647241Z     kernel = self.compile(
2025-05-07T20:32:22.1647784Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:22.1648487Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:22.1648905Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:22.1649142Z 
2025-05-07T20:32:22.1649351Z self = <triton.compiler.compiler.ASTSource object at 0x7efca3d9bb90>
2025-05-07T20:32:22.1650435Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:22.1651800Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7efdd7d50400>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7efca330ba60>}
2025-05-07T20:32:22.1653131Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:22.1654165Z context = <triton._C.libtriton.ir.context object at 0x7efca3d201b0>
2025-05-07T20:32:22.1654462Z 
2025-05-07T20:32:22.1654635Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:22.1655162Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:22.1655626Z                            module_map=module_map)
2025-05-07T20:32:22.1655997Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:22.1656361Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:22.1656629Z E       ^
2025-05-07T20:32:22.1657089Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:22.1657547Z 
2025-05-07T20:32:22.1657968Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:22.1658481Z 
2025-05-07T20:32:22.1658595Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:22.1659047Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:22.1659462Z     T=16384,
2025-05-07T20:32:22.1659706Z     D=7168,
2025-05-07T20:32:22.1659909Z     scale_ub=None,
2025-05-07T20:32:22.1660125Z     contiguous=True,
2025-05-07T20:32:22.1660352Z     compiled=True,
2025-05-07T20:32:22.1660560Z )
2025-05-07T20:32:22.3023250Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:22.3023999Z T = 16384, D = 7168, scale_ub = None, contiguous = True, compiled = True
2025-05-07T20:32:22.3024387Z 
2025-05-07T20:32:22.3024506Z     @given(
2025-05-07T20:32:22.3024824Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:22.3025238Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:22.3025552Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:22.3026100Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:22.3026531Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:22.3026823Z     )
2025-05-07T20:32:22.3027250Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:22.3027709Z     def test_silu_mul_quant(
2025-05-07T20:32:22.3027963Z         self,
2025-05-07T20:32:22.3028433Z         T: int,
2025-05-07T20:32:22.3028646Z         D: int,
2025-05-07T20:32:22.3028875Z         scale_ub: Optional[float],
2025-05-07T20:32:22.3029228Z         contiguous: bool,
2025-05-07T20:32:22.3029477Z         compiled: bool,
2025-05-07T20:32:22.3029709Z     ) -> None:
2025-05-07T20:32:22.3029934Z         torch.manual_seed(2025)
2025-05-07T20:32:22.3030175Z     
2025-05-07T20:32:22.3030462Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:22.3030810Z     
2025-05-07T20:32:22.3031013Z         x_sign = torch.sign(x)
2025-05-07T20:32:22.3031319Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:22.3031636Z         x = x_sign * x_clamp
2025-05-07T20:32:22.3031880Z         x0 = x[:, :D]
2025-05-07T20:32:22.3032106Z         x1 = x[:, D:]
2025-05-07T20:32:22.3032318Z     
2025-05-07T20:32:22.3032514Z         if contiguous:
2025-05-07T20:32:22.3032754Z             x0 = x0.contiguous()
2025-05-07T20:32:22.3033016Z             x1 = x1.contiguous()
2025-05-07T20:32:22.3033254Z     
2025-05-07T20:32:22.3033450Z         if scale_ub is not None:
2025-05-07T20:32:22.3033731Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:22.3034065Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:22.3034388Z             )
2025-05-07T20:32:22.3034593Z         else:
2025-05-07T20:32:22.3034809Z             scale_ub_tensor = None
2025-05-07T20:32:22.3035065Z     
2025-05-07T20:32:22.3035301Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:22.3035648Z             op = silu_mul_quant
2025-05-07T20:32:22.3035910Z             if compiled:
2025-05-07T20:32:22.3036158Z                 op = torch.compile(op)
2025-05-07T20:32:22.3036459Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:22.3036738Z     
2025-05-07T20:32:22.3036936Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:22.3037106Z 
2025-05-07T20:32:22.3037211Z moe/activation_test.py:117: 
2025-05-07T20:32:22.3037510Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:22.3037843Z moe/activation_test.py:115: in fn
2025-05-07T20:32:22.3038130Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:22.3038695Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/torch/_dynamo/eval_frame.py:678: in _fn
2025-05-07T20:32:22.3039260Z     return fn(*args, **kwargs)
2025-05-07T20:32:22.3039915Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:22.3040611Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:22.3041166Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:22.3041846Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:22.3042603Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:22.3043144Z     kernel = self.compile(
2025-05-07T20:32:22.3043688Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:22.3044346Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:22.3044753Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:22.3044985Z 
2025-05-07T20:32:22.3045202Z self = <triton.compiler.compiler.ASTSource object at 0x7efd8e10ee50>
2025-05-07T20:32:22.3046291Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:22.3047891Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7efdd7d50400>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7efca3d05120>}
2025-05-07T20:32:22.3049247Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:22.3050292Z context = <triton._C.libtriton.ir.context object at 0x7efd8e1eb4b0>
2025-05-07T20:32:22.3050589Z 
2025-05-07T20:32:22.3050771Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:22.3051291Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:22.3051767Z                            module_map=module_map)
2025-05-07T20:32:22.3052142Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:22.3052497Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:22.3052756Z E       ^
2025-05-07T20:32:22.3053229Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:22.3053677Z 
2025-05-07T20:32:22.3054097Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:22.3054606Z 
2025-05-07T20:32:22.3054711Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:22.3055128Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:22.3055534Z     T=4096,
2025-05-07T20:32:22.3055729Z     D=5120,
2025-05-07T20:32:22.3055924Z     scale_ub=None,
2025-05-07T20:32:22.3056148Z     contiguous=False,
2025-05-07T20:32:22.3056378Z     compiled=True,
2025-05-07T20:32:22.3056592Z )
2025-05-07T20:32:22.3056920Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:22.3057412Z T = 4096, D = 5120, scale_ub = None, contiguous = False, compiled = True
2025-05-07T20:32:22.3057682Z 
2025-05-07T20:32:22.3057767Z     @given(
2025-05-07T20:32:22.3058010Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:22.3058327Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:22.3058634Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:22.3058969Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:22.3059301Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:22.3059589Z     )
2025-05-07T20:32:22.3059937Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:22.3060383Z     def test_silu_mul_quant(
2025-05-07T20:32:22.3060632Z         self,
2025-05-07T20:32:22.3060834Z         T: int,
2025-05-07T20:32:22.3061049Z         D: int,
2025-05-07T20:32:22.3061282Z         scale_ub: Optional[float],
2025-05-07T20:32:22.3061551Z         contiguous: bool,
2025-05-07T20:32:22.3061802Z         compiled: bool,
2025-05-07T20:32:22.3062033Z     ) -> None:
2025-05-07T20:32:22.3062261Z         torch.manual_seed(2025)
2025-05-07T20:32:22.3062568Z     
2025-05-07T20:32:22.3062860Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:22.3063205Z     
2025-05-07T20:32:22.3063410Z         x_sign = torch.sign(x)
2025-05-07T20:32:22.3063714Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:22.3064034Z         x = x_sign * x_clamp
2025-05-07T20:32:22.3064279Z         x0 = x[:, :D]
2025-05-07T20:32:22.3064515Z         x1 = x[:, D:]
2025-05-07T20:32:22.3064737Z     
2025-05-07T20:32:22.3064946Z         if contiguous:
2025-05-07T20:32:22.3065177Z             x0 = x0.contiguous()
2025-05-07T20:32:22.3072236Z             x1 = x1.contiguous()
2025-05-07T20:32:22.3072588Z     
2025-05-07T20:32:22.3072785Z         if scale_ub is not None:
2025-05-07T20:32:22.3073109Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:22.3073452Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:22.3073756Z             )
2025-05-07T20:32:22.3073995Z         else:
2025-05-07T20:32:22.3074216Z             scale_ub_tensor = None
2025-05-07T20:32:22.3074466Z     
2025-05-07T20:32:22.3074711Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:22.3075035Z             op = silu_mul_quant
2025-05-07T20:32:22.3075292Z             if compiled:
2025-05-07T20:32:22.3075548Z                 op = torch.compile(op)
2025-05-07T20:32:22.3075853Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:22.3076128Z     
2025-05-07T20:32:22.3076335Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:22.3076499Z 
2025-05-07T20:32:22.3076609Z moe/activation_test.py:117: 
2025-05-07T20:32:22.3076912Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:22.3077251Z moe/activation_test.py:115: in fn
2025-05-07T20:32:22.3077542Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:22.3078106Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/torch/_dynamo/eval_frame.py:678: in _fn
2025-05-07T20:32:22.3078682Z     return fn(*args, **kwargs)
2025-05-07T20:32:22.3079381Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:22.3080070Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:22.3080608Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:22.3081283Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:22.3081953Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:22.3082491Z     kernel = self.compile(
2025-05-07T20:32:22.3083032Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:22.3083695Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:22.3084101Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:22.3084328Z 
2025-05-07T20:32:22.3084545Z self = <triton.compiler.compiler.ASTSource object at 0x7efd8e1c8550>
2025-05-07T20:32:22.3085618Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:22.3086986Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7efdd7d50400>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7efca3d05c60>}
2025-05-07T20:32:22.3088326Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:22.3089408Z context = <triton._C.libtriton.ir.context object at 0x7efd8e100bb0>
2025-05-07T20:32:22.3089694Z 
2025-05-07T20:32:22.3089921Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:22.3090434Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:22.3090901Z                            module_map=module_map)
2025-05-07T20:32:22.3091268Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:22.3091617Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:22.3091884Z E       ^
2025-05-07T20:32:22.3092354Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:22.3092805Z 
2025-05-07T20:32:22.3093272Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:22.3093819Z 
2025-05-07T20:32:22.4226290Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:22.4227288Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:22.4227837Z     T=4096,
2025-05-07T20:32:22.4228042Z     D=5120,
2025-05-07T20:32:22.4228507Z     scale_ub=1200.0,
2025-05-07T20:32:22.4228741Z     contiguous=False,
2025-05-07T20:32:22.4228977Z     compiled=False,
2025-05-07T20:32:22.4229245Z )
2025-05-07T20:32:22.4229572Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:22.4230075Z T = 4096, D = 5120, scale_ub = 1200.0, contiguous = False, compiled = False
2025-05-07T20:32:22.4230347Z 
2025-05-07T20:32:22.4230437Z     @given(
2025-05-07T20:32:22.4230683Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:22.4231005Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:22.4231334Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:22.4231679Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:22.4232005Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:22.4232306Z     )
2025-05-07T20:32:22.4232667Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:22.4233107Z     def test_silu_mul_quant(
2025-05-07T20:32:22.4233361Z         self,
2025-05-07T20:32:22.4233570Z         T: int,
2025-05-07T20:32:22.4233771Z         D: int,
2025-05-07T20:32:22.4234008Z         scale_ub: Optional[float],
2025-05-07T20:32:22.4234299Z         contiguous: bool,
2025-05-07T20:32:22.4234540Z         compiled: bool,
2025-05-07T20:32:22.4234787Z     ) -> None:
2025-05-07T20:32:22.4235019Z         torch.manual_seed(2025)
2025-05-07T20:32:22.4235270Z     
2025-05-07T20:32:22.4235545Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:22.4235901Z     
2025-05-07T20:32:22.4236110Z         x_sign = torch.sign(x)
2025-05-07T20:32:22.4236413Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:22.4236731Z         x = x_sign * x_clamp
2025-05-07T20:32:22.4236981Z         x0 = x[:, :D]
2025-05-07T20:32:22.4237201Z         x1 = x[:, D:]
2025-05-07T20:32:22.4237425Z     
2025-05-07T20:32:22.4237623Z         if contiguous:
2025-05-07T20:32:22.4237854Z             x0 = x0.contiguous()
2025-05-07T20:32:22.4238125Z             x1 = x1.contiguous()
2025-05-07T20:32:22.4238380Z     
2025-05-07T20:32:22.4238599Z         if scale_ub is not None:
2025-05-07T20:32:22.4238904Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:22.4239241Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:22.4239547Z             )
2025-05-07T20:32:22.4239751Z         else:
2025-05-07T20:32:22.4239970Z             scale_ub_tensor = None
2025-05-07T20:32:22.4240221Z     
2025-05-07T20:32:22.4240460Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:22.4240790Z             op = silu_mul_quant
2025-05-07T20:32:22.4241050Z             if compiled:
2025-05-07T20:32:22.4241302Z                 op = torch.compile(op)
2025-05-07T20:32:22.4241604Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:22.4241890Z     
2025-05-07T20:32:22.4242195Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:22.4242368Z 
2025-05-07T20:32:22.4242474Z moe/activation_test.py:117: 
2025-05-07T20:32:22.4242778Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:22.4243107Z moe/activation_test.py:115: in fn
2025-05-07T20:32:22.4243397Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:22.4244092Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:22.4244790Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:22.4245325Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:22.4246169Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:22.4246896Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:22.4247432Z     kernel = self.compile(
2025-05-07T20:32:22.4247983Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:22.4248644Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:22.4249055Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:22.4249283Z 
2025-05-07T20:32:22.4249496Z self = <triton.compiler.compiler.ASTSource object at 0x7efd8e105850>
2025-05-07T20:32:22.4250587Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:22.4251984Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7efdd7d50400>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7efca3d07240>}
2025-05-07T20:32:22.4253339Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:22.4254366Z context = <triton._C.libtriton.ir.context object at 0x7efd8e109e30>
2025-05-07T20:32:22.4254654Z 
2025-05-07T20:32:22.4254821Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:22.4255340Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:22.4255812Z                            module_map=module_map)
2025-05-07T20:32:22.4256182Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:22.4256536Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:22.4256805Z E       ^
2025-05-07T20:32:22.4257277Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:22.4257724Z 
2025-05-07T20:32:22.4258140Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:22.4258652Z 
2025-05-07T20:32:22.4258758Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:22.4259177Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:22.4259584Z     T=4096,
2025-05-07T20:32:22.4259775Z     D=5120,
2025-05-07T20:32:22.4259978Z     scale_ub=1200.0,
2025-05-07T20:32:22.4260207Z     contiguous=False,
2025-05-07T20:32:22.4260433Z     compiled=True,
2025-05-07T20:32:22.4260642Z )
2025-05-07T20:32:22.4260963Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:22.4261451Z T = 4096, D = 5120, scale_ub = 1200.0, contiguous = False, compiled = True
2025-05-07T20:32:22.4261728Z 
2025-05-07T20:32:22.4261808Z     @given(
2025-05-07T20:32:22.4262043Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:22.4262411Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:22.4262722Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:22.4263054Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:22.4263383Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:22.4263666Z     )
2025-05-07T20:32:22.4264022Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:22.4264470Z     def test_silu_mul_quant(
2025-05-07T20:32:22.4264712Z         self,
2025-05-07T20:32:22.4264915Z         T: int,
2025-05-07T20:32:22.4265119Z         D: int,
2025-05-07T20:32:22.4265337Z         scale_ub: Optional[float],
2025-05-07T20:32:22.4265662Z         contiguous: bool,
2025-05-07T20:32:22.4265947Z         compiled: bool,
2025-05-07T20:32:22.4266169Z     ) -> None:
2025-05-07T20:32:22.4266395Z         torch.manual_seed(2025)
2025-05-07T20:32:22.4266642Z     
2025-05-07T20:32:22.4266981Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:22.4267329Z     
2025-05-07T20:32:22.4267529Z         x_sign = torch.sign(x)
2025-05-07T20:32:22.4267818Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:22.4268133Z         x = x_sign * x_clamp
2025-05-07T20:32:22.4268389Z         x0 = x[:, :D]
2025-05-07T20:32:22.4268616Z         x1 = x[:, D:]
2025-05-07T20:32:22.4268822Z     
2025-05-07T20:32:22.4269015Z         if contiguous:
2025-05-07T20:32:22.4269343Z             x0 = x0.contiguous()
2025-05-07T20:32:22.4269604Z             x1 = x1.contiguous()
2025-05-07T20:32:22.4269846Z     
2025-05-07T20:32:22.4270045Z         if scale_ub is not None:
2025-05-07T20:32:22.4270319Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:22.4270658Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:22.4270973Z             )
2025-05-07T20:32:22.4271168Z         else:
2025-05-07T20:32:22.4271385Z             scale_ub_tensor = None
2025-05-07T20:32:22.4271643Z     
2025-05-07T20:32:22.4271882Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:22.4272197Z             op = silu_mul_quant
2025-05-07T20:32:22.4272453Z             if compiled:
2025-05-07T20:32:22.4272699Z                 op = torch.compile(op)
2025-05-07T20:32:22.4272998Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:22.4273273Z     
2025-05-07T20:32:22.4273473Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:22.4273640Z 
2025-05-07T20:32:22.4273743Z moe/activation_test.py:117: 
2025-05-07T20:32:22.4274038Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:22.4274369Z moe/activation_test.py:115: in fn
2025-05-07T20:32:22.4274647Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:22.4275229Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/torch/_dynamo/eval_frame.py:678: in _fn
2025-05-07T20:32:22.4275788Z     return fn(*args, **kwargs)
2025-05-07T20:32:22.4276452Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:22.4277143Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:22.4277671Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:22.4278348Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:22.4279011Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:22.4279536Z     kernel = self.compile(
2025-05-07T20:32:22.4280077Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:22.4280739Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:22.4281139Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:22.4281365Z 
2025-05-07T20:32:22.4281634Z self = <triton.compiler.compiler.ASTSource object at 0x7efca32707d0>
2025-05-07T20:32:22.4282715Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:22.4284076Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7efdd7d50400>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7efca32d0720>}
2025-05-07T20:32:22.4285418Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:22.4286508Z context = <triton._C.libtriton.ir.context object at 0x7efca325ce30>
2025-05-07T20:32:22.4286804Z 
2025-05-07T20:32:22.4287007Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:22.4287529Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:22.4287992Z                            module_map=module_map)
2025-05-07T20:32:22.4288350Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:22.4288749Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:22.4289019Z E       ^
2025-05-07T20:32:22.4289480Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:22.4289934Z 
2025-05-07T20:32:22.4290350Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:22.4290867Z 
2025-05-07T20:32:22.5165444Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:22.5166120Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:22.5166652Z     T=2048,
2025-05-07T20:32:22.5166848Z     D=7168,
2025-05-07T20:32:22.5167067Z     scale_ub=1200.0,
2025-05-07T20:32:22.5167299Z     contiguous=False,
2025-05-07T20:32:22.5167531Z     compiled=False,
2025-05-07T20:32:22.5167739Z )
2025-05-07T20:32:22.5168057Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:22.5168552Z T = 2048, D = 7168, scale_ub = 1200.0, contiguous = False, compiled = False
2025-05-07T20:32:22.5168847Z 
2025-05-07T20:32:22.5168946Z     @given(
2025-05-07T20:32:22.5169200Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:22.5169516Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:22.5169822Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:22.5170152Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:22.5170496Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:22.5170786Z     )
2025-05-07T20:32:22.5171130Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:22.5171575Z     def test_silu_mul_quant(
2025-05-07T20:32:22.5171823Z         self,
2025-05-07T20:32:22.5172019Z         T: int,
2025-05-07T20:32:22.5172228Z         D: int,
2025-05-07T20:32:22.5172456Z         scale_ub: Optional[float],
2025-05-07T20:32:22.5172729Z         contiguous: bool,
2025-05-07T20:32:22.5172976Z         compiled: bool,
2025-05-07T20:32:22.5173210Z     ) -> None:
2025-05-07T20:32:22.5173435Z         torch.manual_seed(2025)
2025-05-07T20:32:22.5173680Z     
2025-05-07T20:32:22.5173968Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:22.5174323Z     
2025-05-07T20:32:22.5174517Z         x_sign = torch.sign(x)
2025-05-07T20:32:22.5174815Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:22.5175136Z         x = x_sign * x_clamp
2025-05-07T20:32:22.5175374Z         x0 = x[:, :D]
2025-05-07T20:32:22.5175597Z         x1 = x[:, D:]
2025-05-07T20:32:22.5175814Z     
2025-05-07T20:32:22.5175999Z         if contiguous:
2025-05-07T20:32:22.5176243Z             x0 = x0.contiguous()
2025-05-07T20:32:22.5176771Z             x1 = x1.contiguous()
2025-05-07T20:32:22.5177011Z     
2025-05-07T20:32:22.5177207Z         if scale_ub is not None:
2025-05-07T20:32:22.5177487Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:22.5177815Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:22.5178125Z             )
2025-05-07T20:32:22.5178325Z         else:
2025-05-07T20:32:22.5178542Z             scale_ub_tensor = None
2025-05-07T20:32:22.5178787Z     
2025-05-07T20:32:22.5179026Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:22.5179341Z             op = silu_mul_quant
2025-05-07T20:32:22.5179595Z             if compiled:
2025-05-07T20:32:22.5179935Z                 op = torch.compile(op)
2025-05-07T20:32:22.5180301Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:22.5180573Z     
2025-05-07T20:32:22.5180776Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:22.5180937Z 
2025-05-07T20:32:22.5181120Z moe/activation_test.py:117: 
2025-05-07T20:32:22.5181414Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:22.5181751Z moe/activation_test.py:115: in fn
2025-05-07T20:32:22.5182030Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:22.5182715Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:22.5183397Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:22.5183929Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:22.5184609Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:22.5185265Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:22.5185803Z     kernel = self.compile(
2025-05-07T20:32:22.5186352Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:22.5187002Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:22.5187394Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:22.5187629Z 
2025-05-07T20:32:22.5187840Z self = <triton.compiler.compiler.ASTSource object at 0x7efca32ad9d0>
2025-05-07T20:32:22.5188965Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:22.5190447Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7efdd7d50400>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7efca32d1580>}
2025-05-07T20:32:22.5191789Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:22.5192813Z context = <triton._C.libtriton.ir.context object at 0x7efca32c5f70>
2025-05-07T20:32:22.5193110Z 
2025-05-07T20:32:22.5193275Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:22.5193790Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:22.5194250Z                            module_map=module_map)
2025-05-07T20:32:22.5194616Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:22.5194972Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:22.5195246Z E       ^
2025-05-07T20:32:22.5195705Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:22.5196159Z 
2025-05-07T20:32:22.5196637Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:22.5197147Z 
2025-05-07T20:32:22.5197259Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:22.5197675Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:22.5198071Z     T=1,
2025-05-07T20:32:22.5198264Z     D=7168,
2025-05-07T20:32:22.5198464Z     scale_ub=None,
2025-05-07T20:32:22.5198681Z     contiguous=True,
2025-05-07T20:32:22.5198934Z     compiled=False,
2025-05-07T20:32:22.5199162Z )
2025-05-07T20:32:22.5199475Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:22.5199955Z T = 1, D = 7168, scale_ub = None, contiguous = True, compiled = False
2025-05-07T20:32:22.5200258Z 
2025-05-07T20:32:22.5200385Z     @given(
2025-05-07T20:32:22.5200616Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:22.5200928Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:22.5201291Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:22.5201620Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:22.5201947Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:22.5202232Z     )
2025-05-07T20:32:22.5202580Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:22.5203014Z     def test_silu_mul_quant(
2025-05-07T20:32:22.5203257Z         self,
2025-05-07T20:32:22.5203456Z         T: int,
2025-05-07T20:32:22.5203649Z         D: int,
2025-05-07T20:32:22.5203870Z         scale_ub: Optional[float],
2025-05-07T20:32:22.5204146Z         contiguous: bool,
2025-05-07T20:32:22.5204384Z         compiled: bool,
2025-05-07T20:32:22.5204612Z     ) -> None:
2025-05-07T20:32:22.5204841Z         torch.manual_seed(2025)
2025-05-07T20:32:22.5205078Z     
2025-05-07T20:32:22.5205358Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:22.5205708Z     
2025-05-07T20:32:22.5205901Z         x_sign = torch.sign(x)
2025-05-07T20:32:22.5206207Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:22.5206517Z         x = x_sign * x_clamp
2025-05-07T20:32:22.5206765Z         x0 = x[:, :D]
2025-05-07T20:32:22.5206984Z         x1 = x[:, D:]
2025-05-07T20:32:22.5207202Z     
2025-05-07T20:32:22.5207399Z         if contiguous:
2025-05-07T20:32:22.5207631Z             x0 = x0.contiguous()
2025-05-07T20:32:22.5207907Z             x1 = x1.contiguous()
2025-05-07T20:32:22.5208158Z     
2025-05-07T20:32:22.5208350Z         if scale_ub is not None:
2025-05-07T20:32:22.5208634Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:22.5208979Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:22.5209295Z             )
2025-05-07T20:32:22.5209505Z         else:
2025-05-07T20:32:22.5209727Z             scale_ub_tensor = None
2025-05-07T20:32:22.5209976Z     
2025-05-07T20:32:22.5210216Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:22.5210539Z             op = silu_mul_quant
2025-05-07T20:32:22.5210789Z             if compiled:
2025-05-07T20:32:22.5211058Z                 op = torch.compile(op)
2025-05-07T20:32:22.5211364Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:22.5211638Z     
2025-05-07T20:32:22.5211841Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:22.5212021Z 
2025-05-07T20:32:22.5212123Z moe/activation_test.py:117: 
2025-05-07T20:32:22.5212427Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:22.5212758Z moe/activation_test.py:115: in fn
2025-05-07T20:32:22.5213041Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:22.5213733Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:22.5214425Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:22.5214969Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:22.5215708Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:22.5216378Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:22.5216904Z     kernel = self.compile(
2025-05-07T20:32:22.5217447Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:22.5218105Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:22.5218503Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:22.5218735Z 
2025-05-07T20:32:22.5218945Z self = <triton.compiler.compiler.ASTSource object at 0x7efca360d0d0>
2025-05-07T20:32:22.5220220Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:22.5221594Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7efdd7d50400>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7efca32d0ea0>}
2025-05-07T20:32:22.5222934Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:22.5223949Z context = <triton._C.libtriton.ir.context object at 0x7efca3695730>
2025-05-07T20:32:22.5224243Z 
2025-05-07T20:32:22.5224410Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:22.5224931Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:22.5225409Z                            module_map=module_map)
2025-05-07T20:32:22.5225767Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:22.5226127Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:22.5226405Z E       ^
2025-05-07T20:32:22.5226871Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:22.5227324Z 
2025-05-07T20:32:22.5227743Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:22.5228540Z 
2025-05-07T20:32:22.5228648Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:22.5229148Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:22.5229574Z     T=16384,
2025-05-07T20:32:22.5229774Z     D=7168,
2025-05-07T20:32:22.5229976Z     scale_ub=1200.0,
2025-05-07T20:32:22.5230205Z     contiguous=False,
2025-05-07T20:32:22.5230443Z     compiled=True,
2025-05-07T20:32:22.8706072Z )
2025-05-07T20:32:22.8706628Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:22.8707366Z T = 16384, D = 7168, scale_ub = 1200.0, contiguous = False, compiled = True
2025-05-07T20:32:22.8707752Z 
2025-05-07T20:32:22.8707870Z     @given(
2025-05-07T20:32:22.8708192Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:22.8708517Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:22.8708834Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:22.8709227Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:22.8709555Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:22.8709844Z     )
2025-05-07T20:32:22.8710200Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:22.8710636Z     def test_silu_mul_quant(
2025-05-07T20:32:22.8710890Z         self,
2025-05-07T20:32:22.8711097Z         T: int,
2025-05-07T20:32:22.8711297Z         D: int,
2025-05-07T20:32:22.8711523Z         scale_ub: Optional[float],
2025-05-07T20:32:22.8711799Z         contiguous: bool,
2025-05-07T20:32:22.8712040Z         compiled: bool,
2025-05-07T20:32:22.8712554Z     ) -> None:
2025-05-07T20:32:22.8712784Z         torch.manual_seed(2025)
2025-05-07T20:32:22.8713026Z     
2025-05-07T20:32:22.8713314Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:22.8713664Z     
2025-05-07T20:32:22.8713864Z         x_sign = torch.sign(x)
2025-05-07T20:32:22.8714169Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:22.8714492Z         x = x_sign * x_clamp
2025-05-07T20:32:22.8714747Z         x0 = x[:, :D]
2025-05-07T20:32:22.8714968Z         x1 = x[:, D:]
2025-05-07T20:32:22.8715191Z     
2025-05-07T20:32:22.8715391Z         if contiguous:
2025-05-07T20:32:22.8715628Z             x0 = x0.contiguous()
2025-05-07T20:32:22.8715979Z             x1 = x1.contiguous()
2025-05-07T20:32:22.8716307Z     
2025-05-07T20:32:22.8716501Z         if scale_ub is not None:
2025-05-07T20:32:22.8716817Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:22.8717230Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:22.8717555Z             )
2025-05-07T20:32:22.8717760Z         else:
2025-05-07T20:32:22.8717971Z             scale_ub_tensor = None
2025-05-07T20:32:22.8718232Z     
2025-05-07T20:32:22.8718474Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:22.8718823Z             op = silu_mul_quant
2025-05-07T20:32:22.8719097Z             if compiled:
2025-05-07T20:32:22.8719348Z                 op = torch.compile(op)
2025-05-07T20:32:22.8719644Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:22.8719929Z     
2025-05-07T20:32:22.8720129Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:22.8720293Z 
2025-05-07T20:32:22.8720398Z moe/activation_test.py:117: 
2025-05-07T20:32:22.8720702Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:22.8721045Z moe/activation_test.py:115: in fn
2025-05-07T20:32:22.8721331Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:22.8721893Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/torch/_dynamo/eval_frame.py:678: in _fn
2025-05-07T20:32:22.8722457Z     return fn(*args, **kwargs)
2025-05-07T20:32:22.8723121Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:22.8723800Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:22.8724340Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:22.8725025Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:22.8725689Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:22.8726222Z     kernel = self.compile(
2025-05-07T20:32:22.8726766Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:22.8727428Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:22.8727829Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:22.8728063Z 
2025-05-07T20:32:22.8728538Z self = <triton.compiler.compiler.ASTSource object at 0x7efca36f7cd0>
2025-05-07T20:32:22.8729673Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:22.8731057Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7efdd7d50400>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7efca32d39c0>}
2025-05-07T20:32:22.8732397Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:22.8733494Z context = <triton._C.libtriton.ir.context object at 0x7efca3658370>
2025-05-07T20:32:22.8733794Z 
2025-05-07T20:32:22.8733967Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:22.8734487Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:22.8734959Z                            module_map=module_map)
2025-05-07T20:32:22.8735322Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:22.8735685Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:22.8735953Z E       ^
2025-05-07T20:32:22.8736416Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:22.8736993Z 
2025-05-07T20:32:22.8737410Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:22.8737925Z 
2025-05-07T20:32:22.8738085Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:22.8738510Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:22.8738911Z     T=1,
2025-05-07T20:32:22.8739107Z     D=7168,
2025-05-07T20:32:22.8739310Z     scale_ub=None,
2025-05-07T20:32:22.8739529Z     contiguous=False,
2025-05-07T20:32:22.8739770Z     compiled=False,
2025-05-07T20:32:22.8739983Z )
2025-05-07T20:32:22.8740302Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:22.8740794Z T = 1, D = 7168, scale_ub = None, contiguous = False, compiled = False
2025-05-07T20:32:22.8741063Z 
2025-05-07T20:32:22.8741145Z     @given(
2025-05-07T20:32:22.8741385Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:22.8741704Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:22.8742019Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:22.8742355Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:22.8742682Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:22.8742979Z     )
2025-05-07T20:32:22.8743334Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:22.8743774Z     def test_silu_mul_quant(
2025-05-07T20:32:22.8744022Z         self,
2025-05-07T20:32:22.8744224Z         T: int,
2025-05-07T20:32:22.8744424Z         D: int,
2025-05-07T20:32:22.8744651Z         scale_ub: Optional[float],
2025-05-07T20:32:22.8744929Z         contiguous: bool,
2025-05-07T20:32:22.8745180Z         compiled: bool,
2025-05-07T20:32:22.8745399Z     ) -> None:
2025-05-07T20:32:22.8745623Z         torch.manual_seed(2025)
2025-05-07T20:32:22.8745870Z     
2025-05-07T20:32:22.8746144Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:22.8746492Z     
2025-05-07T20:32:22.8746704Z         x_sign = torch.sign(x)
2025-05-07T20:32:22.8747001Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:22.8747308Z         x = x_sign * x_clamp
2025-05-07T20:32:22.8747556Z         x0 = x[:, :D]
2025-05-07T20:32:22.8747779Z         x1 = x[:, D:]
2025-05-07T20:32:22.8747983Z     
2025-05-07T20:32:22.8748175Z         if contiguous:
2025-05-07T20:32:22.8748417Z             x0 = x0.contiguous()
2025-05-07T20:32:22.8748704Z             x1 = x1.contiguous()
2025-05-07T20:32:22.8748968Z     
2025-05-07T20:32:22.8749238Z         if scale_ub is not None:
2025-05-07T20:32:22.8749508Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:22.8749851Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:22.8750158Z             )
2025-05-07T20:32:22.8750361Z         else:
2025-05-07T20:32:22.8750574Z             scale_ub_tensor = None
2025-05-07T20:32:22.8750831Z     
2025-05-07T20:32:22.8751065Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:22.8751377Z             op = silu_mul_quant
2025-05-07T20:32:22.8751632Z             if compiled:
2025-05-07T20:32:22.8751884Z                 op = torch.compile(op)
2025-05-07T20:32:22.8752233Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:22.8752511Z     
2025-05-07T20:32:22.8752710Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:22.8752873Z 
2025-05-07T20:32:22.8752973Z moe/activation_test.py:117: 
2025-05-07T20:32:22.8753273Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:22.8753609Z moe/activation_test.py:115: in fn
2025-05-07T20:32:22.8753888Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:22.8754581Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:22.8755276Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:22.8755854Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:22.8756572Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:22.8757274Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:22.8757805Z     kernel = self.compile(
2025-05-07T20:32:22.8758345Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:22.8759022Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:22.8759445Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:22.8759673Z 
2025-05-07T20:32:22.8759893Z self = <triton.compiler.compiler.ASTSource object at 0x7efca3635490>
2025-05-07T20:32:22.8760973Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:22.8762341Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7efdd7d50400>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7efca3638860>}
2025-05-07T20:32:22.8763683Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:22.8764707Z context = <triton._C.libtriton.ir.context object at 0x7efca30f5af0>
2025-05-07T20:32:22.8764998Z 
2025-05-07T20:32:22.8765175Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:22.8765691Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:22.8766161Z                            module_map=module_map)
2025-05-07T20:32:22.8766536Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:22.8766897Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:22.8767153Z E       ^
2025-05-07T20:32:22.8767637Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:22.8768089Z 
2025-05-07T20:32:22.8768521Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:22.8769037Z 
2025-05-07T20:32:22.8769143Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:22.8769564Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:22.8769978Z     T=2048,
2025-05-07T20:32:22.8770178Z     D=7168,
2025-05-07T20:32:22.8770380Z     scale_ub=None,
2025-05-07T20:32:22.8770602Z     contiguous=False,
2025-05-07T20:32:22.8770837Z     compiled=True,
2025-05-07T20:32:22.8771048Z )
2025-05-07T20:32:22.9452837Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:22.9454350Z T = 2048, D = 7168, scale_ub = None, contiguous = False, compiled = True
2025-05-07T20:32:22.9455092Z 
2025-05-07T20:32:22.9455313Z     @given(
2025-05-07T20:32:22.9456226Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:22.9456841Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:22.9457451Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:22.9458105Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:22.9458703Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:22.9459056Z     )
2025-05-07T20:32:22.9459415Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:22.9459865Z     def test_silu_mul_quant(
2025-05-07T20:32:22.9460104Z         self,
2025-05-07T20:32:22.9460308Z         T: int,
2025-05-07T20:32:22.9460513Z         D: int,
2025-05-07T20:32:22.9460865Z         scale_ub: Optional[float],
2025-05-07T20:32:22.9461210Z         contiguous: bool,
2025-05-07T20:32:22.9461455Z         compiled: bool,
2025-05-07T20:32:22.9461686Z     ) -> None:
2025-05-07T20:32:22.9461906Z         torch.manual_seed(2025)
2025-05-07T20:32:22.9462233Z     
2025-05-07T20:32:22.9462514Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:22.9462854Z     
2025-05-07T20:32:22.9463057Z         x_sign = torch.sign(x)
2025-05-07T20:32:22.9463352Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:22.9463658Z         x = x_sign * x_clamp
2025-05-07T20:32:22.9463904Z         x0 = x[:, :D]
2025-05-07T20:32:22.9464124Z         x1 = x[:, D:]
2025-05-07T20:32:22.9464325Z     
2025-05-07T20:32:22.9464513Z         if contiguous:
2025-05-07T20:32:22.9464748Z             x0 = x0.contiguous()
2025-05-07T20:32:22.9465006Z             x1 = x1.contiguous()
2025-05-07T20:32:22.9465246Z     
2025-05-07T20:32:22.9465446Z         if scale_ub is not None:
2025-05-07T20:32:22.9465720Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:22.9466057Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:22.9466372Z             )
2025-05-07T20:32:22.9466572Z         else:
2025-05-07T20:32:22.9466787Z             scale_ub_tensor = None
2025-05-07T20:32:22.9467048Z     
2025-05-07T20:32:22.9467283Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:22.9467590Z             op = silu_mul_quant
2025-05-07T20:32:22.9467844Z             if compiled:
2025-05-07T20:32:22.9468093Z                 op = torch.compile(op)
2025-05-07T20:32:22.9468387Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:22.9468666Z     
2025-05-07T20:32:22.9468863Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:22.9469026Z 
2025-05-07T20:32:22.9469234Z moe/activation_test.py:117: 
2025-05-07T20:32:22.9469559Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:22.9469898Z moe/activation_test.py:115: in fn
2025-05-07T20:32:22.9470188Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:22.9470746Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/torch/_dynamo/eval_frame.py:678: in _fn
2025-05-07T20:32:22.9471305Z     return fn(*args, **kwargs)
2025-05-07T20:32:22.9471967Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:22.9472646Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:22.9473178Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:22.9473856Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:22.9474527Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:22.9475063Z     kernel = self.compile(
2025-05-07T20:32:22.9475614Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:22.9476280Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:22.9476684Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:22.9476980Z 
2025-05-07T20:32:22.9477193Z self = <triton.compiler.compiler.ASTSource object at 0x7efca3034490>
2025-05-07T20:32:22.9478275Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:22.9479661Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7efdd7d50400>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7efca3639bc0>}
2025-05-07T20:32:22.9480999Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:22.9482095Z context = <triton._C.libtriton.ir.context object at 0x7efca3090af0>
2025-05-07T20:32:22.9482389Z 
2025-05-07T20:32:22.9482596Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:22.9483113Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:22.9483584Z                            module_map=module_map)
2025-05-07T20:32:22.9483943Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:22.9484299Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:22.9484561Z E       ^
2025-05-07T20:32:22.9485021Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:22.9485472Z 
2025-05-07T20:32:22.9485887Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:22.9486407Z 
2025-05-07T20:32:22.9486513Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:22.9486923Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:22.9487322Z     T=4096,
2025-05-07T20:32:22.9487526Z     D=7168,
2025-05-07T20:32:22.9487727Z     scale_ub=None,
2025-05-07T20:32:22.9487946Z     contiguous=False,
2025-05-07T20:32:22.9488176Z     compiled=True,
2025-05-07T20:32:22.9488385Z )
2025-05-07T20:32:22.9488699Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:22.9489187Z T = 4096, D = 7168, scale_ub = None, contiguous = False, compiled = True
2025-05-07T20:32:22.9489460Z 
2025-05-07T20:32:22.9489541Z     @given(
2025-05-07T20:32:22.9489777Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:22.9490084Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:22.9490391Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:22.9490725Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:22.9491053Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:22.9491344Z     )
2025-05-07T20:32:22.9491699Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:22.9492134Z     def test_silu_mul_quant(
2025-05-07T20:32:22.9492378Z         self,
2025-05-07T20:32:22.9492580Z         T: int,
2025-05-07T20:32:22.9492780Z         D: int,
2025-05-07T20:32:22.9493006Z         scale_ub: Optional[float],
2025-05-07T20:32:22.9493282Z         contiguous: bool,
2025-05-07T20:32:22.9493518Z         compiled: bool,
2025-05-07T20:32:22.9493752Z     ) -> None:
2025-05-07T20:32:22.9493976Z         torch.manual_seed(2025)
2025-05-07T20:32:22.9494218Z     
2025-05-07T20:32:22.9494493Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:22.9494844Z     
2025-05-07T20:32:22.9495050Z         x_sign = torch.sign(x)
2025-05-07T20:32:22.9495349Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:22.9495673Z         x = x_sign * x_clamp
2025-05-07T20:32:22.9495924Z         x0 = x[:, :D]
2025-05-07T20:32:22.9496146Z         x1 = x[:, D:]
2025-05-07T20:32:22.9496359Z     
2025-05-07T20:32:22.9496608Z         if contiguous:
2025-05-07T20:32:22.9496840Z             x0 = x0.contiguous()
2025-05-07T20:32:22.9497116Z             x1 = x1.contiguous()
2025-05-07T20:32:22.9497359Z     
2025-05-07T20:32:22.9497555Z         if scale_ub is not None:
2025-05-07T20:32:22.9497837Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:22.9498175Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:22.9498480Z             )
2025-05-07T20:32:22.9498685Z         else:
2025-05-07T20:32:22.9498904Z             scale_ub_tensor = None
2025-05-07T20:32:22.9499164Z     
2025-05-07T20:32:22.9499397Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:22.9499763Z             op = silu_mul_quant
2025-05-07T20:32:22.9500064Z             if compiled:
2025-05-07T20:32:22.9500313Z                 op = torch.compile(op)
2025-05-07T20:32:22.9500610Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:22.9500892Z     
2025-05-07T20:32:22.9501127Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:22.9501303Z 
2025-05-07T20:32:22.9501406Z moe/activation_test.py:117: 
2025-05-07T20:32:22.9501705Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:22.9502033Z moe/activation_test.py:115: in fn
2025-05-07T20:32:22.9502320Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:22.9502875Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/torch/_dynamo/eval_frame.py:678: in _fn
2025-05-07T20:32:22.9503438Z     return fn(*args, **kwargs)
2025-05-07T20:32:22.9504090Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:22.9504779Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:22.9505323Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:22.9505997Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:22.9506657Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:22.9507192Z     kernel = self.compile(
2025-05-07T20:32:22.9507737Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:22.9508387Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:22.9508800Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:22.9509167Z 
2025-05-07T20:32:22.9509382Z self = <triton.compiler.compiler.ASTSource object at 0x7efca30ddd10>
2025-05-07T20:32:22.9510458Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:22.9511829Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7efdd7d50400>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7efca363a700>}
2025-05-07T20:32:22.9513168Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:22.9514191Z context = <triton._C.libtriton.ir.context object at 0x7efca309e3f0>
2025-05-07T20:32:22.9514483Z 
2025-05-07T20:32:22.9514658Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:22.9515175Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:22.9515653Z                            module_map=module_map)
2025-05-07T20:32:22.9516035Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:22.9516399Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:22.9516665Z E       ^
2025-05-07T20:32:22.9517191Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:22.9517647Z 
2025-05-07T20:32:22.9518079Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:22.9518591Z 
2025-05-07T20:32:23.0777137Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:23.0777790Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:23.0778366Z     T=16384,
2025-05-07T20:32:23.0778637Z     D=5120,
2025-05-07T20:32:23.0779153Z     scale_ub=1200.0,
2025-05-07T20:32:23.0779636Z     contiguous=False,
2025-05-07T20:32:23.0780433Z     compiled=False,
2025-05-07T20:32:23.0780971Z )
2025-05-07T20:32:23.0781596Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:23.0782587Z T = 16384, D = 5120, scale_ub = 1200.0, contiguous = False, compiled = False
2025-05-07T20:32:23.0783265Z 
2025-05-07T20:32:23.0783438Z     @given(
2025-05-07T20:32:23.0783895Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:23.0784514Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:23.0785116Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:23.0785768Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:23.0786418Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:23.0786981Z     )
2025-05-07T20:32:23.0787669Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:23.0788529Z     def test_silu_mul_quant(
2025-05-07T20:32:23.0788876Z         self,
2025-05-07T20:32:23.0789205Z         T: int,
2025-05-07T20:32:23.0789404Z         D: int,
2025-05-07T20:32:23.0789632Z         scale_ub: Optional[float],
2025-05-07T20:32:23.0789905Z         contiguous: bool,
2025-05-07T20:32:23.0790141Z         compiled: bool,
2025-05-07T20:32:23.0790368Z     ) -> None:
2025-05-07T20:32:23.0790592Z         torch.manual_seed(2025)
2025-05-07T20:32:23.0790836Z     
2025-05-07T20:32:23.0791108Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:23.0791448Z     
2025-05-07T20:32:23.0791638Z         x_sign = torch.sign(x)
2025-05-07T20:32:23.0791932Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:23.0792245Z         x = x_sign * x_clamp
2025-05-07T20:32:23.0792482Z         x0 = x[:, :D]
2025-05-07T20:32:23.0792699Z         x1 = x[:, D:]
2025-05-07T20:32:23.0792914Z     
2025-05-07T20:32:23.0793097Z         if contiguous:
2025-05-07T20:32:23.0793333Z             x0 = x0.contiguous()
2025-05-07T20:32:23.0793593Z             x1 = x1.contiguous()
2025-05-07T20:32:23.0793856Z     
2025-05-07T20:32:23.0794056Z         if scale_ub is not None:
2025-05-07T20:32:23.0794332Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:23.0794670Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:23.0802360Z             )
2025-05-07T20:32:23.0802609Z         else:
2025-05-07T20:32:23.0802830Z             scale_ub_tensor = None
2025-05-07T20:32:23.0803096Z     
2025-05-07T20:32:23.0803355Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:23.0803677Z             op = silu_mul_quant
2025-05-07T20:32:23.0803937Z             if compiled:
2025-05-07T20:32:23.0804195Z                 op = torch.compile(op)
2025-05-07T20:32:23.0804494Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:23.0804785Z     
2025-05-07T20:32:23.0804990Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:23.0805156Z 
2025-05-07T20:32:23.0805260Z moe/activation_test.py:117: 
2025-05-07T20:32:23.0805562Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:23.0805908Z moe/activation_test.py:115: in fn
2025-05-07T20:32:23.0806200Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:23.0807060Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:23.0807766Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:23.0808309Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:23.0809076Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:23.0809739Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:23.0810278Z     kernel = self.compile(
2025-05-07T20:32:23.0810830Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:23.0811549Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:23.0811987Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:23.0812228Z 
2025-05-07T20:32:23.0812485Z self = <triton.compiler.compiler.ASTSource object at 0x7efca351ef90>
2025-05-07T20:32:23.0813585Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:23.0814986Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7efdd7d50400>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7efca363b060>}
2025-05-07T20:32:23.0816335Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:23.0817371Z context = <triton._C.libtriton.ir.context object at 0x7efca35df5f0>
2025-05-07T20:32:23.0817667Z 
2025-05-07T20:32:23.0817834Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:23.0818363Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:23.0818826Z                            module_map=module_map)
2025-05-07T20:32:23.0819222Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:23.0819603Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:23.0819866Z E       ^
2025-05-07T20:32:23.0820329Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:23.0820784Z 
2025-05-07T20:32:23.0821208Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:23.0821720Z 
2025-05-07T20:32:23.0821835Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:23.0822248Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:23.0822662Z     T=16384,
2025-05-07T20:32:23.0822869Z     D=5120,
2025-05-07T20:32:23.0823074Z     scale_ub=1200.0,
2025-05-07T20:32:23.0823304Z     contiguous=True,
2025-05-07T20:32:23.0823532Z     compiled=True,
2025-05-07T20:32:23.0823747Z )
2025-05-07T20:32:23.0824068Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:23.0824572Z T = 16384, D = 5120, scale_ub = 1200.0, contiguous = True, compiled = True
2025-05-07T20:32:23.0824854Z 
2025-05-07T20:32:23.0824942Z     @given(
2025-05-07T20:32:23.0825173Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:23.0825496Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:23.0825808Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:23.0826140Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:23.0826476Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:23.0826780Z     )
2025-05-07T20:32:23.0827133Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:23.0827577Z     def test_silu_mul_quant(
2025-05-07T20:32:23.0827827Z         self,
2025-05-07T20:32:23.0828086Z         T: int,
2025-05-07T20:32:23.0828583Z         D: int,
2025-05-07T20:32:23.0828808Z         scale_ub: Optional[float],
2025-05-07T20:32:23.0829128Z         contiguous: bool,
2025-05-07T20:32:23.0829375Z         compiled: bool,
2025-05-07T20:32:23.0829602Z     ) -> None:
2025-05-07T20:32:23.0829823Z         torch.manual_seed(2025)
2025-05-07T20:32:23.0830063Z     
2025-05-07T20:32:23.0830344Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:23.0830689Z     
2025-05-07T20:32:23.0830883Z         x_sign = torch.sign(x)
2025-05-07T20:32:23.0831188Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:23.0831579Z         x = x_sign * x_clamp
2025-05-07T20:32:23.0831885Z         x0 = x[:, :D]
2025-05-07T20:32:23.0832100Z         x1 = x[:, D:]
2025-05-07T20:32:23.0832312Z     
2025-05-07T20:32:23.0832501Z         if contiguous:
2025-05-07T20:32:23.0832731Z             x0 = x0.contiguous()
2025-05-07T20:32:23.0833056Z             x1 = x1.contiguous()
2025-05-07T20:32:23.0833301Z     
2025-05-07T20:32:23.0833493Z         if scale_ub is not None:
2025-05-07T20:32:23.0833768Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:23.0834108Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:23.0834413Z             )
2025-05-07T20:32:23.0834616Z         else:
2025-05-07T20:32:23.0834832Z             scale_ub_tensor = None
2025-05-07T20:32:23.0835081Z     
2025-05-07T20:32:23.0835322Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:23.0835648Z             op = silu_mul_quant
2025-05-07T20:32:23.0835901Z             if compiled:
2025-05-07T20:32:23.0836159Z                 op = torch.compile(op)
2025-05-07T20:32:23.0836471Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:23.0836742Z     
2025-05-07T20:32:23.0836941Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:23.0837112Z 
2025-05-07T20:32:23.0837215Z moe/activation_test.py:117: 
2025-05-07T20:32:23.0837525Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:23.0837862Z moe/activation_test.py:115: in fn
2025-05-07T20:32:23.0838149Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:23.0838745Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/torch/_dynamo/eval_frame.py:678: in _fn
2025-05-07T20:32:23.0839331Z     return fn(*args, **kwargs)
2025-05-07T20:32:23.0840000Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:23.0840710Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:23.0841260Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:23.0841956Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:23.0842638Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:23.0843186Z     kernel = self.compile(
2025-05-07T20:32:23.0843733Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:23.0844403Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:23.0844820Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:23.0845053Z 
2025-05-07T20:32:23.0845272Z self = <triton.compiler.compiler.ASTSource object at 0x7efca3599210>
2025-05-07T20:32:23.0846362Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:23.0847757Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7efdd7d50400>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7efca35d11c0>}
2025-05-07T20:32:23.0849193Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:23.0850291Z context = <triton._C.libtriton.ir.context object at 0x7efca35164b0>
2025-05-07T20:32:23.0850587Z 
2025-05-07T20:32:23.0850765Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:23.0851287Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:23.0851768Z                            module_map=module_map)
2025-05-07T20:32:23.0852183Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:23.0852579Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:23.0852844Z E       ^
2025-05-07T20:32:23.0853361Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:23.0853824Z 
2025-05-07T20:32:23.0854245Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:23.0854773Z 
2025-05-07T20:32:23.3813136Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:23.3814039Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:23.3814852Z     T=16384,
2025-05-07T20:32:23.3815241Z     D=5120,
2025-05-07T20:32:23.3815634Z     scale_ub=None,
2025-05-07T20:32:23.3816074Z     contiguous=False,
2025-05-07T20:32:23.3816519Z     compiled=True,
2025-05-07T20:32:23.3816936Z )
2025-05-07T20:32:23.3817586Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:23.3818599Z T = 16384, D = 5120, scale_ub = None, contiguous = False, compiled = True
2025-05-07T20:32:23.3819151Z 
2025-05-07T20:32:23.3819235Z     @given(
2025-05-07T20:32:23.3819519Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:23.3819858Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:23.3820167Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:23.3820505Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:23.3820837Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:23.3821127Z     )
2025-05-07T20:32:23.3821487Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:23.3821930Z     def test_silu_mul_quant(
2025-05-07T20:32:23.3822174Z         self,
2025-05-07T20:32:23.3822379Z         T: int,
2025-05-07T20:32:23.3822583Z         D: int,
2025-05-07T20:32:23.3822806Z         scale_ub: Optional[float],
2025-05-07T20:32:23.3823087Z         contiguous: bool,
2025-05-07T20:32:23.3823336Z         compiled: bool,
2025-05-07T20:32:23.3823564Z     ) -> None:
2025-05-07T20:32:23.3823785Z         torch.manual_seed(2025)
2025-05-07T20:32:23.3824029Z     
2025-05-07T20:32:23.3824303Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:23.3824653Z     
2025-05-07T20:32:23.3824853Z         x_sign = torch.sign(x)
2025-05-07T20:32:23.3825144Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:23.3825450Z         x = x_sign * x_clamp
2025-05-07T20:32:23.3825694Z         x0 = x[:, :D]
2025-05-07T20:32:23.3825919Z         x1 = x[:, D:]
2025-05-07T20:32:23.3826125Z     
2025-05-07T20:32:23.3826319Z         if contiguous:
2025-05-07T20:32:23.3826558Z             x0 = x0.contiguous()
2025-05-07T20:32:23.3826814Z             x1 = x1.contiguous()
2025-05-07T20:32:23.3827054Z     
2025-05-07T20:32:23.3827258Z         if scale_ub is not None:
2025-05-07T20:32:23.3827528Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:23.3827873Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:23.3828455Z             )
2025-05-07T20:32:23.3828650Z         else:
2025-05-07T20:32:23.3828864Z             scale_ub_tensor = None
2025-05-07T20:32:23.3829172Z     
2025-05-07T20:32:23.3829677Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:23.3830000Z             op = silu_mul_quant
2025-05-07T20:32:23.3830255Z             if compiled:
2025-05-07T20:32:23.3830521Z                 op = torch.compile(op)
2025-05-07T20:32:23.3830821Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:23.3831097Z     
2025-05-07T20:32:23.3831291Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:23.3831463Z 
2025-05-07T20:32:23.3831567Z moe/activation_test.py:117: 
2025-05-07T20:32:23.3831864Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:23.3832195Z moe/activation_test.py:115: in fn
2025-05-07T20:32:23.3832559Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:23.3833201Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/torch/_dynamo/eval_frame.py:678: in _fn
2025-05-07T20:32:23.3833767Z     return fn(*args, **kwargs)
2025-05-07T20:32:23.3834520Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:23.3835216Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:23.3835753Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:23.3836430Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:23.3837090Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:23.3837623Z     kernel = self.compile(
2025-05-07T20:32:23.3838163Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:23.3838818Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:23.3839225Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:23.3839495Z 
2025-05-07T20:32:23.3839721Z self = <triton.compiler.compiler.ASTSource object at 0x7efca34972d0>
2025-05-07T20:32:23.3840793Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:23.3842162Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7efdd7d50400>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7efca35d1d00>}
2025-05-07T20:32:23.3843495Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:23.3844522Z context = <triton._C.libtriton.ir.context object at 0x7efca3413930>
2025-05-07T20:32:23.3844807Z 
2025-05-07T20:32:23.3844982Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:23.3845494Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:23.3845966Z                            module_map=module_map)
2025-05-07T20:32:23.3846329Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:23.3846684Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:23.3846938Z E       ^
2025-05-07T20:32:23.3847404Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:23.3847854Z 
2025-05-07T20:32:23.3848279Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:23.3848794Z 
2025-05-07T20:32:23.3848914Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:23.3849361Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:23.3849765Z     T=2048,
2025-05-07T20:32:23.3849960Z     D=5120,
2025-05-07T20:32:23.3850201Z     scale_ub=None,
2025-05-07T20:32:23.3850425Z     contiguous=False,
2025-05-07T20:32:23.3850653Z     compiled=True,
2025-05-07T20:32:23.3850853Z )
2025-05-07T20:32:23.4568763Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:23.4569431Z T = 2048, D = 5120, scale_ub = None, contiguous = False, compiled = True
2025-05-07T20:32:23.4569819Z 
2025-05-07T20:32:23.4569922Z     @given(
2025-05-07T20:32:23.4570225Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:23.4570627Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:23.4571027Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:23.4571628Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:23.4572047Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:23.4572332Z     )
2025-05-07T20:32:23.4572682Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:23.4573205Z     def test_silu_mul_quant(
2025-05-07T20:32:23.4573453Z         self,
2025-05-07T20:32:23.4573657Z         T: int,
2025-05-07T20:32:23.4573866Z         D: int,
2025-05-07T20:32:23.4574083Z         scale_ub: Optional[float],
2025-05-07T20:32:23.4574362Z         contiguous: bool,
2025-05-07T20:32:23.4574610Z         compiled: bool,
2025-05-07T20:32:23.4574838Z     ) -> None:
2025-05-07T20:32:23.4575061Z         torch.manual_seed(2025)
2025-05-07T20:32:23.4575310Z     
2025-05-07T20:32:23.4575586Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:23.4575930Z     
2025-05-07T20:32:23.4576133Z         x_sign = torch.sign(x)
2025-05-07T20:32:23.4576423Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:23.4576740Z         x = x_sign * x_clamp
2025-05-07T20:32:23.4576984Z         x0 = x[:, :D]
2025-05-07T20:32:23.4577204Z         x1 = x[:, D:]
2025-05-07T20:32:23.4577408Z     
2025-05-07T20:32:23.4577604Z         if contiguous:
2025-05-07T20:32:23.4577845Z             x0 = x0.contiguous()
2025-05-07T20:32:23.4578103Z             x1 = x1.contiguous()
2025-05-07T20:32:23.4578345Z     
2025-05-07T20:32:23.4578542Z         if scale_ub is not None:
2025-05-07T20:32:23.4578809Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:23.4579148Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:23.4579487Z             )
2025-05-07T20:32:23.4579703Z         else:
2025-05-07T20:32:23.4579919Z             scale_ub_tensor = None
2025-05-07T20:32:23.4580172Z     
2025-05-07T20:32:23.4580401Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:23.4580717Z             op = silu_mul_quant
2025-05-07T20:32:23.4580974Z             if compiled:
2025-05-07T20:32:23.4581220Z                 op = torch.compile(op)
2025-05-07T20:32:23.4581519Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:23.4581796Z     
2025-05-07T20:32:23.4581990Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:23.4582164Z 
2025-05-07T20:32:23.4582268Z moe/activation_test.py:117: 
2025-05-07T20:32:23.4582562Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:23.4582899Z moe/activation_test.py:115: in fn
2025-05-07T20:32:23.4583177Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:23.4583738Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/torch/_dynamo/eval_frame.py:678: in _fn
2025-05-07T20:32:23.4584302Z     return fn(*args, **kwargs)
2025-05-07T20:32:23.4584957Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:23.4585645Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:23.4586182Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:23.4586865Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:23.4587607Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:23.4588144Z     kernel = self.compile(
2025-05-07T20:32:23.4588693Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:23.4589443Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:23.4589834Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:23.4590069Z 
2025-05-07T20:32:23.4590278Z self = <triton.compiler.compiler.ASTSource object at 0x7efca34f88d0>
2025-05-07T20:32:23.4591358Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:23.4592867Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7efdd7d50400>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7efca35d1620>}
2025-05-07T20:32:23.4594206Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:23.4595239Z context = <triton._C.libtriton.ir.context object at 0x7efca34bcef0>
2025-05-07T20:32:23.4595535Z 
2025-05-07T20:32:23.4595701Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:23.4596234Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:23.4596695Z                            module_map=module_map)
2025-05-07T20:32:23.4597072Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:23.4597430Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:23.4597688Z E       ^
2025-05-07T20:32:23.4598160Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:23.4598613Z 
2025-05-07T20:32:23.4599028Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:23.4599537Z 
2025-05-07T20:32:23.4599648Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:23.4600055Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:23.4600458Z     T=2048,
2025-05-07T20:32:23.4600657Z     D=5120,
2025-05-07T20:32:23.4600849Z     scale_ub=1200.0,
2025-05-07T20:32:23.4601079Z     contiguous=False,
2025-05-07T20:32:23.4601314Z     compiled=True,
2025-05-07T20:32:23.4601520Z )
2025-05-07T20:32:23.4601845Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:23.4602341Z T = 2048, D = 5120, scale_ub = 1200.0, contiguous = False, compiled = True
2025-05-07T20:32:23.4602612Z 
2025-05-07T20:32:23.4602696Z     @given(
2025-05-07T20:32:23.4602937Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:23.4603257Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:23.4603565Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:23.4603890Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:23.4604218Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:23.4604503Z     )
2025-05-07T20:32:23.4604849Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:23.4605294Z     def test_silu_mul_quant(
2025-05-07T20:32:23.4605539Z         self,
2025-05-07T20:32:23.4605739Z         T: int,
2025-05-07T20:32:23.4605936Z         D: int,
2025-05-07T20:32:23.4606163Z         scale_ub: Optional[float],
2025-05-07T20:32:23.4606441Z         contiguous: bool,
2025-05-07T20:32:23.4606678Z         compiled: bool,
2025-05-07T20:32:23.4606908Z     ) -> None:
2025-05-07T20:32:23.4607127Z         torch.manual_seed(2025)
2025-05-07T20:32:23.4607370Z     
2025-05-07T20:32:23.4607704Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:23.4608055Z     
2025-05-07T20:32:23.4608255Z         x_sign = torch.sign(x)
2025-05-07T20:32:23.4608558Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:23.4608877Z         x = x_sign * x_clamp
2025-05-07T20:32:23.4609119Z         x0 = x[:, :D]
2025-05-07T20:32:23.4609348Z         x1 = x[:, D:]
2025-05-07T20:32:23.4609568Z     
2025-05-07T20:32:23.4609760Z         if contiguous:
2025-05-07T20:32:23.4610005Z             x0 = x0.contiguous()
2025-05-07T20:32:23.4610278Z             x1 = x1.contiguous()
2025-05-07T20:32:23.4610530Z     
2025-05-07T20:32:23.4610767Z         if scale_ub is not None:
2025-05-07T20:32:23.4611086Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:23.4611427Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:23.4611733Z             )
2025-05-07T20:32:23.4611933Z         else:
2025-05-07T20:32:23.4612199Z             scale_ub_tensor = None
2025-05-07T20:32:23.4612448Z     
2025-05-07T20:32:23.4612688Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:23.4613005Z             op = silu_mul_quant
2025-05-07T20:32:23.4613252Z             if compiled:
2025-05-07T20:32:23.4613506Z                 op = torch.compile(op)
2025-05-07T20:32:23.4613807Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:23.4614077Z     
2025-05-07T20:32:23.4614275Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:23.4614437Z 
2025-05-07T20:32:23.4614548Z moe/activation_test.py:117: 
2025-05-07T20:32:23.4614846Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:23.4615180Z moe/activation_test.py:115: in fn
2025-05-07T20:32:23.4615474Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:23.4616032Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/torch/_dynamo/eval_frame.py:678: in _fn
2025-05-07T20:32:23.4616586Z     return fn(*args, **kwargs)
2025-05-07T20:32:23.4617251Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:23.4617943Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:23.4618479Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:23.4619153Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:23.4619817Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:23.4620401Z     kernel = self.compile(
2025-05-07T20:32:23.4620939Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:23.4621599Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:23.4622004Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:23.4622233Z 
2025-05-07T20:32:23.4622447Z self = <triton.compiler.compiler.ASTSource object at 0x7efca311dc10>
2025-05-07T20:32:23.4623518Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:23.4624878Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7efdd7d50400>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7efca34905e0>}
2025-05-07T20:32:23.4626219Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:23.4627272Z context = <triton._C.libtriton.ir.context object at 0x7efca317e270>
2025-05-07T20:32:23.4627566Z 
2025-05-07T20:32:23.4627790Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:23.4628625Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:23.4629142Z                            module_map=module_map)
2025-05-07T20:32:23.4629518Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:23.4629873Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:23.4630144Z E       ^
2025-05-07T20:32:23.4630616Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:23.4637735Z 
2025-05-07T20:32:23.4638195Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:23.4638982Z 
2025-05-07T20:32:23.5956420Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:23.5958054Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:23.5958879Z     T=4096,
2025-05-07T20:32:23.5959082Z     D=5120,
2025-05-07T20:32:23.5959280Z     scale_ub=1200.0,
2025-05-07T20:32:23.5959517Z     contiguous=True,
2025-05-07T20:32:23.5959746Z     compiled=True,
2025-05-07T20:32:23.5959960Z )
2025-05-07T20:32:23.5960291Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:23.5960787Z T = 4096, D = 5120, scale_ub = 1200.0, contiguous = True, compiled = True
2025-05-07T20:32:23.5961057Z 
2025-05-07T20:32:23.5961149Z     @given(
2025-05-07T20:32:23.5961383Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:23.5961707Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:23.5962015Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:23.5962356Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:23.5962692Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:23.5962982Z     )
2025-05-07T20:32:23.5963343Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:23.5963788Z     def test_silu_mul_quant(
2025-05-07T20:32:23.5964040Z         self,
2025-05-07T20:32:23.5964245Z         T: int,
2025-05-07T20:32:23.5964454Z         D: int,
2025-05-07T20:32:23.5964691Z         scale_ub: Optional[float],
2025-05-07T20:32:23.5964967Z         contiguous: bool,
2025-05-07T20:32:23.5965224Z         compiled: bool,
2025-05-07T20:32:23.5965468Z     ) -> None:
2025-05-07T20:32:23.5965699Z         torch.manual_seed(2025)
2025-05-07T20:32:23.5965945Z     
2025-05-07T20:32:23.5966234Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:23.5966586Z     
2025-05-07T20:32:23.5966792Z         x_sign = torch.sign(x)
2025-05-07T20:32:23.5967106Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:23.5967426Z         x = x_sign * x_clamp
2025-05-07T20:32:23.5967669Z         x0 = x[:, :D]
2025-05-07T20:32:23.5967897Z         x1 = x[:, D:]
2025-05-07T20:32:23.5968120Z     
2025-05-07T20:32:23.5968312Z         if contiguous:
2025-05-07T20:32:23.5968555Z             x0 = x0.contiguous()
2025-05-07T20:32:23.5968824Z             x1 = x1.contiguous()
2025-05-07T20:32:23.5969064Z     
2025-05-07T20:32:23.5969269Z         if scale_ub is not None:
2025-05-07T20:32:23.5969546Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:23.5969882Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:23.5970198Z             )
2025-05-07T20:32:23.5970402Z         else:
2025-05-07T20:32:23.5970620Z             scale_ub_tensor = None
2025-05-07T20:32:23.5970871Z     
2025-05-07T20:32:23.5971110Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:23.5971439Z             op = silu_mul_quant
2025-05-07T20:32:23.5971693Z             if compiled:
2025-05-07T20:32:23.5971949Z                 op = torch.compile(op)
2025-05-07T20:32:23.5972246Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:23.5972517Z     
2025-05-07T20:32:23.5972832Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:23.5972998Z 
2025-05-07T20:32:23.5973109Z moe/activation_test.py:117: 
2025-05-07T20:32:23.5973400Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:23.5973736Z moe/activation_test.py:115: in fn
2025-05-07T20:32:23.5974021Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:23.5974584Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/torch/_dynamo/eval_frame.py:678: in _fn
2025-05-07T20:32:23.5975141Z     return fn(*args, **kwargs)
2025-05-07T20:32:23.5975809Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:23.5976667Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:23.5977199Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:23.5977933Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:23.5978607Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:23.5979143Z     kernel = self.compile(
2025-05-07T20:32:23.5979700Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:23.5980403Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:23.5980811Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:23.5981041Z 
2025-05-07T20:32:23.5981254Z self = <triton.compiler.compiler.ASTSource object at 0x7efca31c6fd0>
2025-05-07T20:32:23.5982337Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:23.5983741Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7efdd7d50400>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7efca3491120>}
2025-05-07T20:32:23.5985089Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:23.5986125Z context = <triton._C.libtriton.ir.context object at 0x7efca315b630>
2025-05-07T20:32:23.5986420Z 
2025-05-07T20:32:23.5986587Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:23.5987107Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:23.5987587Z                            module_map=module_map)
2025-05-07T20:32:23.5987965Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:23.5988320Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:23.5988593Z E       ^
2025-05-07T20:32:23.5989182Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:23.5989663Z 
2025-05-07T20:32:23.5990110Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:23.5990629Z 
2025-05-07T20:32:23.5990738Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:23.5991162Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:23.5991576Z     T=128,
2025-05-07T20:32:23.5991766Z     D=5120,
2025-05-07T20:32:23.5991970Z     scale_ub=1200.0,
2025-05-07T20:32:23.5992207Z     contiguous=False,
2025-05-07T20:32:23.5992434Z     compiled=True,
2025-05-07T20:32:23.5992655Z )
2025-05-07T20:32:23.8503872Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:23.8504653Z T = 128, D = 5120, scale_ub = 1200.0, contiguous = False, compiled = True
2025-05-07T20:32:23.8505042Z 
2025-05-07T20:32:23.8505437Z     @given(
2025-05-07T20:32:23.8505700Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:23.8506019Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:23.8506323Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:23.8506661Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:23.8506996Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:23.8507286Z     )
2025-05-07T20:32:23.8507634Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:23.8508083Z     def test_silu_mul_quant(
2025-05-07T20:32:23.8508338Z         self,
2025-05-07T20:32:23.8508628Z         T: int,
2025-05-07T20:32:23.8508920Z         D: int,
2025-05-07T20:32:23.8509248Z         scale_ub: Optional[float],
2025-05-07T20:32:23.8509546Z         contiguous: bool,
2025-05-07T20:32:23.8509814Z         compiled: bool,
2025-05-07T20:32:23.8510050Z     ) -> None:
2025-05-07T20:32:23.8510349Z         torch.manual_seed(2025)
2025-05-07T20:32:23.8510594Z     
2025-05-07T20:32:23.8510874Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:23.8511211Z     
2025-05-07T20:32:23.8511415Z         x_sign = torch.sign(x)
2025-05-07T20:32:23.8511711Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:23.8512032Z         x = x_sign * x_clamp
2025-05-07T20:32:23.8512272Z         x0 = x[:, :D]
2025-05-07T20:32:23.8512492Z         x1 = x[:, D:]
2025-05-07T20:32:23.8512705Z     
2025-05-07T20:32:23.8512891Z         if contiguous:
2025-05-07T20:32:23.8513127Z             x0 = x0.contiguous()
2025-05-07T20:32:23.8513392Z             x1 = x1.contiguous()
2025-05-07T20:32:23.8513630Z     
2025-05-07T20:32:23.8513830Z         if scale_ub is not None:
2025-05-07T20:32:23.8514109Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:23.8514437Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:23.8514758Z             )
2025-05-07T20:32:23.8514957Z         else:
2025-05-07T20:32:23.8515164Z             scale_ub_tensor = None
2025-05-07T20:32:23.8515416Z     
2025-05-07T20:32:23.8515656Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:23.8515963Z             op = silu_mul_quant
2025-05-07T20:32:23.8516223Z             if compiled:
2025-05-07T20:32:23.8516473Z                 op = torch.compile(op)
2025-05-07T20:32:23.8516767Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:23.8517045Z     
2025-05-07T20:32:23.8517242Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:23.8517404Z 
2025-05-07T20:32:23.8517512Z moe/activation_test.py:117: 
2025-05-07T20:32:23.8517803Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:23.8518149Z moe/activation_test.py:115: in fn
2025-05-07T20:32:23.8518432Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:23.8518990Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/torch/_dynamo/eval_frame.py:678: in _fn
2025-05-07T20:32:23.8519606Z     return fn(*args, **kwargs)
2025-05-07T20:32:23.8520264Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:23.8520956Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:23.8521486Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:23.8522176Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:23.8522839Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:23.8523370Z     kernel = self.compile(
2025-05-07T20:32:23.8523916Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:23.8524585Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:23.8525048Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:23.8525278Z 
2025-05-07T20:32:23.8525489Z self = <triton.compiler.compiler.ASTSource object at 0x7efca31b02d0>
2025-05-07T20:32:23.8526566Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:23.8527955Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7efdd7d50400>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7efca3492340>}
2025-05-07T20:32:23.8529664Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:23.8530811Z context = <triton._C.libtriton.ir.context object at 0x7efca2dd4930>
2025-05-07T20:32:23.8531104Z 
2025-05-07T20:32:23.8531272Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:23.8531797Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:23.8532266Z                            module_map=module_map)
2025-05-07T20:32:23.8532631Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:23.8532985Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:23.8533246Z E       ^
2025-05-07T20:32:23.8533715Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:23.8534162Z 
2025-05-07T20:32:23.8534577Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:23.8535093Z 
2025-05-07T20:32:23.8535198Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:23.8535615Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:23.8536010Z     T=16384,
2025-05-07T20:32:23.8536210Z     D=7168,
2025-05-07T20:32:23.8536411Z     scale_ub=1200.0,
2025-05-07T20:32:23.8536636Z     contiguous=True,
2025-05-07T20:32:23.8536853Z     compiled=True,
2025-05-07T20:32:23.8537084Z )
2025-05-07T20:32:23.8537407Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:23.8537904Z T = 16384, D = 7168, scale_ub = 1200.0, contiguous = True, compiled = True
2025-05-07T20:32:23.8538178Z 
2025-05-07T20:32:23.8538259Z     @given(
2025-05-07T20:32:23.8538503Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:23.8538826Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:23.8539130Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:23.8539472Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:23.8539847Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:23.8540133Z     )
2025-05-07T20:32:23.8540491Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:23.8540934Z     def test_silu_mul_quant(
2025-05-07T20:32:23.8541185Z         self,
2025-05-07T20:32:23.8541376Z         T: int,
2025-05-07T20:32:23.8541583Z         D: int,
2025-05-07T20:32:23.8541810Z         scale_ub: Optional[float],
2025-05-07T20:32:23.8542077Z         contiguous: bool,
2025-05-07T20:32:23.8542321Z         compiled: bool,
2025-05-07T20:32:23.8542551Z     ) -> None:
2025-05-07T20:32:23.8542765Z         torch.manual_seed(2025)
2025-05-07T20:32:23.8543006Z     
2025-05-07T20:32:23.8543280Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:23.8543621Z     
2025-05-07T20:32:23.8543824Z         x_sign = torch.sign(x)
2025-05-07T20:32:23.8544123Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:23.8544427Z         x = x_sign * x_clamp
2025-05-07T20:32:23.8544676Z         x0 = x[:, :D]
2025-05-07T20:32:23.8544980Z         x1 = x[:, D:]
2025-05-07T20:32:23.8545190Z     
2025-05-07T20:32:23.8545380Z         if contiguous:
2025-05-07T20:32:23.8545622Z             x0 = x0.contiguous()
2025-05-07T20:32:23.8545877Z             x1 = x1.contiguous()
2025-05-07T20:32:23.8546126Z     
2025-05-07T20:32:23.8546324Z         if scale_ub is not None:
2025-05-07T20:32:23.8546596Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:23.8546931Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:23.8547242Z             )
2025-05-07T20:32:23.8547436Z         else:
2025-05-07T20:32:23.8547645Z             scale_ub_tensor = None
2025-05-07T20:32:23.8547904Z     
2025-05-07T20:32:23.8548235Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:23.8548591Z             op = silu_mul_quant
2025-05-07T20:32:23.8548850Z             if compiled:
2025-05-07T20:32:23.8549152Z                 op = torch.compile(op)
2025-05-07T20:32:23.8549484Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:23.8549800Z     
2025-05-07T20:32:23.8550026Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:23.8550190Z 
2025-05-07T20:32:23.8550293Z moe/activation_test.py:117: 
2025-05-07T20:32:23.8550603Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:23.8550949Z moe/activation_test.py:115: in fn
2025-05-07T20:32:23.8551245Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:23.8551807Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/torch/_dynamo/eval_frame.py:678: in _fn
2025-05-07T20:32:23.8552373Z     return fn(*args, **kwargs)
2025-05-07T20:32:23.8553038Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:23.8553727Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:23.8554277Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:23.8554967Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:23.8555642Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:23.8556169Z     kernel = self.compile(
2025-05-07T20:32:23.8556713Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:23.8557370Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:23.8557770Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:23.8558010Z 
2025-05-07T20:32:23.8558220Z self = <triton.compiler.compiler.ASTSource object at 0x7efca2d4d650>
2025-05-07T20:32:23.8559366Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:23.8560741Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7efdd7d50400>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7efca3493c40>}
2025-05-07T20:32:23.8562090Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:23.8563115Z context = <triton._C.libtriton.ir.context object at 0x7efca2d41cb0>
2025-05-07T20:32:23.8563413Z 
2025-05-07T20:32:23.8563582Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:23.8564102Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:23.8564577Z                            module_map=module_map)
2025-05-07T20:32:23.8564946Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:23.8565355Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:23.8565623Z E       ^
2025-05-07T20:32:23.8566081Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:23.8566535Z 
2025-05-07T20:32:23.8566948Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:23.8567462Z 
2025-05-07T20:32:23.9526868Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:23.9527483Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:23.9528103Z     T=16384,
2025-05-07T20:32:23.9528654Z     D=5120,
2025-05-07T20:32:23.9529175Z     scale_ub=1200.0,
2025-05-07T20:32:23.9529524Z     contiguous=True,
2025-05-07T20:32:23.9529749Z     compiled=False,
2025-05-07T20:32:23.9529956Z )
2025-05-07T20:32:23.9530277Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:23.9530863Z T = 16384, D = 5120, scale_ub = 1200.0, contiguous = True, compiled = False
2025-05-07T20:32:23.9531144Z 
2025-05-07T20:32:23.9531231Z     @given(
2025-05-07T20:32:23.9531462Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:23.9531778Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:23.9532086Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:23.9532415Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:23.9532749Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:23.9533038Z     )
2025-05-07T20:32:23.9533381Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:23.9533829Z     def test_silu_mul_quant(
2025-05-07T20:32:23.9534087Z         self,
2025-05-07T20:32:23.9534288Z         T: int,
2025-05-07T20:32:23.9534497Z         D: int,
2025-05-07T20:32:23.9534722Z         scale_ub: Optional[float],
2025-05-07T20:32:23.9534999Z         contiguous: bool,
2025-05-07T20:32:23.9535245Z         compiled: bool,
2025-05-07T20:32:23.9535485Z     ) -> None:
2025-05-07T20:32:23.9535710Z         torch.manual_seed(2025)
2025-05-07T20:32:23.9535952Z     
2025-05-07T20:32:23.9536228Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:23.9536577Z     
2025-05-07T20:32:23.9536774Z         x_sign = torch.sign(x)
2025-05-07T20:32:23.9537068Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:23.9537386Z         x = x_sign * x_clamp
2025-05-07T20:32:23.9537629Z         x0 = x[:, :D]
2025-05-07T20:32:23.9537848Z         x1 = x[:, D:]
2025-05-07T20:32:23.9538060Z     
2025-05-07T20:32:23.9538248Z         if contiguous:
2025-05-07T20:32:23.9538487Z             x0 = x0.contiguous()
2025-05-07T20:32:23.9538754Z             x1 = x1.contiguous()
2025-05-07T20:32:23.9538991Z     
2025-05-07T20:32:23.9539189Z         if scale_ub is not None:
2025-05-07T20:32:23.9539464Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:23.9539808Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:23.9540117Z             )
2025-05-07T20:32:23.9540319Z         else:
2025-05-07T20:32:23.9540540Z             scale_ub_tensor = None
2025-05-07T20:32:23.9540789Z     
2025-05-07T20:32:23.9541027Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:23.9541347Z             op = silu_mul_quant
2025-05-07T20:32:23.9541596Z             if compiled:
2025-05-07T20:32:23.9541847Z                 op = torch.compile(op)
2025-05-07T20:32:23.9542148Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:23.9542424Z     
2025-05-07T20:32:23.9542629Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:23.9542799Z 
2025-05-07T20:32:23.9542916Z moe/activation_test.py:117: 
2025-05-07T20:32:23.9543209Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:23.9543545Z moe/activation_test.py:115: in fn
2025-05-07T20:32:23.9543828Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:23.9544601Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:23.9545285Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:23.9545822Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:23.9546507Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:23.9547159Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:23.9547695Z     kernel = self.compile(
2025-05-07T20:32:23.9548241Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:23.9549017Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:23.9549503Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:23.9549781Z 
2025-05-07T20:32:23.9549997Z self = <triton.compiler.compiler.ASTSource object at 0x7efca2db2790>
2025-05-07T20:32:23.9551080Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:23.9552474Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7efdd7d50400>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7efca2d00c20>}
2025-05-07T20:32:23.9553826Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:23.9554869Z context = <triton._C.libtriton.ir.context object at 0x7efca2d62db0>
2025-05-07T20:32:23.9555164Z 
2025-05-07T20:32:23.9555337Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:23.9555863Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:23.9556326Z                            module_map=module_map)
2025-05-07T20:32:23.9556699Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:23.9557058Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:23.9557322Z E       ^
2025-05-07T20:32:23.9557783Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:23.9558236Z 
2025-05-07T20:32:23.9558649Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:23.9559187Z 
2025-05-07T20:32:23.9559312Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:23.9559737Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:23.9560135Z     T=1,
2025-05-07T20:32:23.9560327Z     D=7168,
2025-05-07T20:32:23.9560529Z     scale_ub=1200.0,
2025-05-07T20:32:23.9560753Z     contiguous=False,
2025-05-07T20:32:23.9560983Z     compiled=False,
2025-05-07T20:32:23.9561195Z )
2025-05-07T20:32:23.9561510Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:23.9561996Z T = 1, D = 7168, scale_ub = 1200.0, contiguous = False, compiled = False
2025-05-07T20:32:23.9562259Z 
2025-05-07T20:32:23.9562345Z     @given(
2025-05-07T20:32:23.9562579Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:23.9562893Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:23.9563201Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:23.9563531Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:23.9563866Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:23.9564152Z     )
2025-05-07T20:32:23.9564507Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:23.9564993Z     def test_silu_mul_quant(
2025-05-07T20:32:23.9565245Z         self,
2025-05-07T20:32:23.9565446Z         T: int,
2025-05-07T20:32:23.9565646Z         D: int,
2025-05-07T20:32:23.9565867Z         scale_ub: Optional[float],
2025-05-07T20:32:23.9566140Z         contiguous: bool,
2025-05-07T20:32:23.9566382Z         compiled: bool,
2025-05-07T20:32:23.9566607Z     ) -> None:
2025-05-07T20:32:23.9566829Z         torch.manual_seed(2025)
2025-05-07T20:32:23.9567065Z     
2025-05-07T20:32:23.9567350Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:23.9567692Z     
2025-05-07T20:32:23.9567884Z         x_sign = torch.sign(x)
2025-05-07T20:32:23.9568228Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:23.9568605Z         x = x_sign * x_clamp
2025-05-07T20:32:23.9568853Z         x0 = x[:, :D]
2025-05-07T20:32:23.9569070Z         x1 = x[:, D:]
2025-05-07T20:32:23.9569283Z     
2025-05-07T20:32:23.9569525Z         if contiguous:
2025-05-07T20:32:23.9569792Z             x0 = x0.contiguous()
2025-05-07T20:32:23.9577212Z             x1 = x1.contiguous()
2025-05-07T20:32:23.9577481Z     
2025-05-07T20:32:23.9577678Z         if scale_ub is not None:
2025-05-07T20:32:23.9577965Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:23.9578322Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:23.9578638Z             )
2025-05-07T20:32:23.9578834Z         else:
2025-05-07T20:32:23.9579052Z             scale_ub_tensor = None
2025-05-07T20:32:23.9579319Z     
2025-05-07T20:32:23.9579585Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:23.9579934Z             op = silu_mul_quant
2025-05-07T20:32:23.9580205Z             if compiled:
2025-05-07T20:32:23.9580453Z                 op = torch.compile(op)
2025-05-07T20:32:23.9580760Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:23.9581044Z     
2025-05-07T20:32:23.9581238Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:23.9581412Z 
2025-05-07T20:32:23.9581515Z moe/activation_test.py:117: 
2025-05-07T20:32:23.9581821Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:23.9582173Z moe/activation_test.py:115: in fn
2025-05-07T20:32:23.9582451Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:23.9583159Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:23.9583860Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:23.9584396Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:23.9585088Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:23.9585762Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:23.9586299Z     kernel = self.compile(
2025-05-07T20:32:23.9586845Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:23.9587504Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:23.9587911Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:23.9588141Z 
2025-05-07T20:32:23.9588351Z self = <triton.compiler.compiler.ASTSource object at 0x7efca2e899d0>
2025-05-07T20:32:23.9589538Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:23.9590987Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7efdd7d50400>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7efca2d01120>}
2025-05-07T20:32:23.9592459Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:23.9593497Z context = <triton._C.libtriton.ir.context object at 0x7efca2e11ff0>
2025-05-07T20:32:23.9593791Z 
2025-05-07T20:32:23.9593959Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:23.9594485Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:23.9594962Z                            module_map=module_map)
2025-05-07T20:32:23.9595341Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:23.9595692Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:23.9596015Z E       ^
2025-05-07T20:32:23.9596537Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:23.9596989Z 
2025-05-07T20:32:23.9597455Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:23.9597981Z 
2025-05-07T20:32:24.0932823Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:24.0933418Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:24.0934022Z     T=4096,
2025-05-07T20:32:24.0934293Z     D=7168,
2025-05-07T20:32:24.0934565Z     scale_ub=1200.0,
2025-05-07T20:32:24.0934880Z     contiguous=False,
2025-05-07T20:32:24.0935118Z     compiled=True,
2025-05-07T20:32:24.0935337Z )
2025-05-07T20:32:24.0935704Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:24.0936273Z T = 4096, D = 7168, scale_ub = 1200.0, contiguous = False, compiled = True
2025-05-07T20:32:24.0936569Z 
2025-05-07T20:32:24.0936651Z     @given(
2025-05-07T20:32:24.0936897Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:24.0937216Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:24.0937539Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:24.0937872Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:24.0938206Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:24.0938504Z     )
2025-05-07T20:32:24.0938856Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:24.0939307Z     def test_silu_mul_quant(
2025-05-07T20:32:24.0939560Z         self,
2025-05-07T20:32:24.0939785Z         T: int,
2025-05-07T20:32:24.0940016Z         D: int,
2025-05-07T20:32:24.0940245Z         scale_ub: Optional[float],
2025-05-07T20:32:24.0940516Z         contiguous: bool,
2025-05-07T20:32:24.0940766Z         compiled: bool,
2025-05-07T20:32:24.0941004Z     ) -> None:
2025-05-07T20:32:24.0941226Z         torch.manual_seed(2025)
2025-05-07T20:32:24.0941477Z     
2025-05-07T20:32:24.0941763Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:24.0942114Z     
2025-05-07T20:32:24.0942314Z         x_sign = torch.sign(x)
2025-05-07T20:32:24.0942616Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:24.0942933Z         x = x_sign * x_clamp
2025-05-07T20:32:24.0943175Z         x0 = x[:, :D]
2025-05-07T20:32:24.0943404Z         x1 = x[:, D:]
2025-05-07T20:32:24.0943622Z     
2025-05-07T20:32:24.0943808Z         if contiguous:
2025-05-07T20:32:24.0944047Z             x0 = x0.contiguous()
2025-05-07T20:32:24.0944312Z             x1 = x1.contiguous()
2025-05-07T20:32:24.0944549Z     
2025-05-07T20:32:24.0944750Z         if scale_ub is not None:
2025-05-07T20:32:24.0945028Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:24.0945362Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:24.0945681Z             )
2025-05-07T20:32:24.0945892Z         else:
2025-05-07T20:32:24.0946107Z             scale_ub_tensor = None
2025-05-07T20:32:24.0946365Z     
2025-05-07T20:32:24.0946608Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:24.0947216Z             op = silu_mul_quant
2025-05-07T20:32:24.0947472Z             if compiled:
2025-05-07T20:32:24.0947727Z                 op = torch.compile(op)
2025-05-07T20:32:24.0948030Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:24.0948301Z     
2025-05-07T20:32:24.0948500Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:24.0948665Z 
2025-05-07T20:32:24.0948776Z moe/activation_test.py:117: 
2025-05-07T20:32:24.0949169Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:24.0949537Z moe/activation_test.py:115: in fn
2025-05-07T20:32:24.0949851Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:24.0950409Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/torch/_dynamo/eval_frame.py:678: in _fn
2025-05-07T20:32:24.0951140Z     return fn(*args, **kwargs)
2025-05-07T20:32:24.0951876Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:24.0952571Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:24.0953106Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:24.0953792Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:24.0954460Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:24.0955000Z     kernel = self.compile(
2025-05-07T20:32:24.0955538Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:24.0956195Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:24.0956603Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:24.0956832Z 
2025-05-07T20:32:24.0957044Z self = <triton.compiler.compiler.ASTSource object at 0x7efca2e18790>
2025-05-07T20:32:24.0958130Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:24.0959512Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7efdd7d50400>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7efca2d02f20>}
2025-05-07T20:32:24.0960858Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:24.0961874Z context = <triton._C.libtriton.ir.context object at 0x7efca2ef0db0>
2025-05-07T20:32:24.0962170Z 
2025-05-07T20:32:24.0962336Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:24.0962859Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:24.0963335Z                            module_map=module_map)
2025-05-07T20:32:24.0963696Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:24.0964065Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:24.0964337Z E       ^
2025-05-07T20:32:24.0964796Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:24.0965252Z 
2025-05-07T20:32:24.0965667Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:24.0966185Z 
2025-05-07T20:32:24.0966292Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:24.0966710Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:24.0967111Z     T=128,
2025-05-07T20:32:24.0967308Z     D=7168,
2025-05-07T20:32:24.0967510Z     scale_ub=1200.0,
2025-05-07T20:32:24.0967740Z     contiguous=False,
2025-05-07T20:32:24.0968044Z     compiled=True,
2025-05-07T20:32:24.0968258Z )
2025-05-07T20:32:24.1681581Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:24.1682353Z T = 128, D = 7168, scale_ub = 1200.0, contiguous = False, compiled = True
2025-05-07T20:32:24.1682724Z 
2025-05-07T20:32:24.1682844Z     @given(
2025-05-07T20:32:24.1683080Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:24.1683394Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:24.1683704Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:24.1684030Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:24.1684363Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:24.1685012Z     )
2025-05-07T20:32:24.1685356Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:24.1685805Z     def test_silu_mul_quant(
2025-05-07T20:32:24.1686054Z         self,
2025-05-07T20:32:24.1686326Z         T: int,
2025-05-07T20:32:24.1686537Z         D: int,
2025-05-07T20:32:24.1686760Z         scale_ub: Optional[float],
2025-05-07T20:32:24.1687030Z         contiguous: bool,
2025-05-07T20:32:24.1687269Z         compiled: bool,
2025-05-07T20:32:24.1687499Z     ) -> None:
2025-05-07T20:32:24.1687712Z         torch.manual_seed(2025)
2025-05-07T20:32:24.1687960Z     
2025-05-07T20:32:24.1688233Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:24.1688573Z     
2025-05-07T20:32:24.1688771Z         x_sign = torch.sign(x)
2025-05-07T20:32:24.1689069Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:24.1689422Z         x = x_sign * x_clamp
2025-05-07T20:32:24.1689680Z         x0 = x[:, :D]
2025-05-07T20:32:24.1689914Z         x1 = x[:, D:]
2025-05-07T20:32:24.1690131Z     
2025-05-07T20:32:24.1690318Z         if contiguous:
2025-05-07T20:32:24.1690559Z             x0 = x0.contiguous()
2025-05-07T20:32:24.1690824Z             x1 = x1.contiguous()
2025-05-07T20:32:24.1691060Z     
2025-05-07T20:32:24.1691268Z         if scale_ub is not None:
2025-05-07T20:32:24.1691545Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:24.1691876Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:24.1692187Z             )
2025-05-07T20:32:24.1692392Z         else:
2025-05-07T20:32:24.1692600Z             scale_ub_tensor = None
2025-05-07T20:32:24.1692855Z     
2025-05-07T20:32:24.1693094Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:24.1693419Z             op = silu_mul_quant
2025-05-07T20:32:24.1693665Z             if compiled:
2025-05-07T20:32:24.1693912Z                 op = torch.compile(op)
2025-05-07T20:32:24.1694213Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:24.1694484Z     
2025-05-07T20:32:24.1694690Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:24.1694849Z 
2025-05-07T20:32:24.1694958Z moe/activation_test.py:117: 
2025-05-07T20:32:24.1695254Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:24.1695599Z moe/activation_test.py:115: in fn
2025-05-07T20:32:24.1695882Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:24.1696446Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/torch/_dynamo/eval_frame.py:678: in _fn
2025-05-07T20:32:24.1697006Z     return fn(*args, **kwargs)
2025-05-07T20:32:24.1697669Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:24.1698360Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:24.1698899Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:24.1699579Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:24.1700241Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:24.1700875Z     kernel = self.compile(
2025-05-07T20:32:24.1701421Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:24.1702074Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:24.1702480Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:24.1702707Z 
2025-05-07T20:32:24.1702927Z self = <triton.compiler.compiler.ASTSource object at 0x7efca2f19850>
2025-05-07T20:32:24.1704004Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:24.1705472Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7efdd7d50400>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7efca2f7c220>}
2025-05-07T20:32:24.1706895Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:24.1707917Z context = <triton._C.libtriton.ir.context object at 0x7efca2f45eb0>
2025-05-07T20:32:24.1708208Z 
2025-05-07T20:32:24.1708381Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:24.1708898Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:24.1709444Z                            module_map=module_map)
2025-05-07T20:32:24.1709809Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:24.1710168Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:24.1710424Z E       ^
2025-05-07T20:32:24.1710891Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:24.1711339Z 
2025-05-07T20:32:24.1711761Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:24.1712269Z 
2025-05-07T20:32:24.1712377Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:24.1712789Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:24.1713194Z     T=2048,
2025-05-07T20:32:24.1713387Z     D=7168,
2025-05-07T20:32:24.1713576Z     scale_ub=None,
2025-05-07T20:32:24.1713797Z     contiguous=True,
2025-05-07T20:32:24.1714022Z     compiled=True,
2025-05-07T20:32:24.1714226Z )
2025-05-07T20:32:24.1714549Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:24.1715042Z T = 2048, D = 7168, scale_ub = None, contiguous = True, compiled = True
2025-05-07T20:32:24.1715311Z 
2025-05-07T20:32:24.1715392Z     @given(
2025-05-07T20:32:24.1715625Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:24.1715945Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:24.1716250Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:24.1716580Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:24.1716914Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:24.1717201Z     )
2025-05-07T20:32:24.1717544Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:24.1717986Z     def test_silu_mul_quant(
2025-05-07T20:32:24.1718229Z         self,
2025-05-07T20:32:24.1718421Z         T: int,
2025-05-07T20:32:24.1718625Z         D: int,
2025-05-07T20:32:24.1718849Z         scale_ub: Optional[float],
2025-05-07T20:32:24.1719119Z         contiguous: bool,
2025-05-07T20:32:24.1719372Z         compiled: bool,
2025-05-07T20:32:24.1719600Z     ) -> None:
2025-05-07T20:32:24.1719814Z         torch.manual_seed(2025)
2025-05-07T20:32:24.1720061Z     
2025-05-07T20:32:24.1720344Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:24.1720683Z     
2025-05-07T20:32:24.1720940Z         x_sign = torch.sign(x)
2025-05-07T20:32:24.1721241Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:24.1721551Z         x = x_sign * x_clamp
2025-05-07T20:32:24.1721805Z         x0 = x[:, :D]
2025-05-07T20:32:24.1722030Z         x1 = x[:, D:]
2025-05-07T20:32:24.1722248Z     
2025-05-07T20:32:24.1722439Z         if contiguous:
2025-05-07T20:32:24.1722682Z             x0 = x0.contiguous()
2025-05-07T20:32:24.1722947Z             x1 = x1.contiguous()
2025-05-07T20:32:24.1723185Z     
2025-05-07T20:32:24.1723390Z         if scale_ub is not None:
2025-05-07T20:32:24.1723672Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:24.1724048Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:24.1724401Z             )
2025-05-07T20:32:24.1724602Z         else:
2025-05-07T20:32:24.1724810Z             scale_ub_tensor = None
2025-05-07T20:32:24.1725066Z     
2025-05-07T20:32:24.1725346Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:24.1725658Z             op = silu_mul_quant
2025-05-07T20:32:24.1725914Z             if compiled:
2025-05-07T20:32:24.1726165Z                 op = torch.compile(op)
2025-05-07T20:32:24.1726460Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:24.1726739Z     
2025-05-07T20:32:24.1726939Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:24.1727105Z 
2025-05-07T20:32:24.1727214Z moe/activation_test.py:117: 
2025-05-07T20:32:24.1727507Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:24.1727847Z moe/activation_test.py:115: in fn
2025-05-07T20:32:24.1728420Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:24.1728982Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/torch/_dynamo/eval_frame.py:678: in _fn
2025-05-07T20:32:24.1729549Z     return fn(*args, **kwargs)
2025-05-07T20:32:24.1730216Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:24.1730911Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:24.1731447Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:24.1732127Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:24.1732787Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:24.1733314Z     kernel = self.compile(
2025-05-07T20:32:24.1733860Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:24.1734517Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:24.1734920Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:24.1735150Z 
2025-05-07T20:32:24.1735365Z self = <triton.compiler.compiler.ASTSource object at 0x7efca2f59210>
2025-05-07T20:32:24.1736445Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:24.1737820Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7efdd7d50400>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7efca2f7cd60>}
2025-05-07T20:32:24.1739162Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:24.1740247Z context = <triton._C.libtriton.ir.context object at 0x7efca2fc3530>
2025-05-07T20:32:24.1740544Z 
2025-05-07T20:32:24.1740713Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:24.1741319Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:24.1741793Z                            module_map=module_map)
2025-05-07T20:32:24.1742160Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:24.1742525Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:24.1742797Z E       ^
2025-05-07T20:32:24.1743269Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:24.1743726Z 
2025-05-07T20:32:24.1744145Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:24.1744735Z 
2025-05-07T20:32:24.2389302Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:24.2390203Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:24.2390763Z     T=16384,
2025-05-07T20:32:24.2390986Z     D=5120,
2025-05-07T20:32:24.2391298Z     scale_ub=None,
2025-05-07T20:32:24.2391526Z     contiguous=False,
2025-05-07T20:32:24.2391758Z     compiled=False,
2025-05-07T20:32:24.2391965Z )
2025-05-07T20:32:24.2392284Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:24.2392783Z T = 16384, D = 5120, scale_ub = None, contiguous = False, compiled = False
2025-05-07T20:32:24.2393065Z 
2025-05-07T20:32:24.2393153Z     @given(
2025-05-07T20:32:24.2393386Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:24.2393705Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:24.2394016Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:24.2394343Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:24.2394683Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:24.2394976Z     )
2025-05-07T20:32:24.2395321Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:24.2395766Z     def test_silu_mul_quant(
2025-05-07T20:32:24.2396018Z         self,
2025-05-07T20:32:24.2396216Z         T: int,
2025-05-07T20:32:24.2396422Z         D: int,
2025-05-07T20:32:24.2396649Z         scale_ub: Optional[float],
2025-05-07T20:32:24.2396924Z         contiguous: bool,
2025-05-07T20:32:24.2397163Z         compiled: bool,
2025-05-07T20:32:24.2397401Z     ) -> None:
2025-05-07T20:32:24.2397622Z         torch.manual_seed(2025)
2025-05-07T20:32:24.2397862Z     
2025-05-07T20:32:24.2398146Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:24.2398495Z     
2025-05-07T20:32:24.2398689Z         x_sign = torch.sign(x)
2025-05-07T20:32:24.2398989Z >       x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:24.2401078Z E       torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 320.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 140.44 MiB is free. Including non-PyTorch memory, this process has 21.92 GiB memory in use. Of the allocated memory 21.60 GiB is allocated by PyTorch, and 45.02 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
2025-05-07T20:32:24.2402944Z 
2025-05-07T20:32:24.2403075Z moe/activation_test.py:95: OutOfMemoryError
2025-05-07T20:32:24.2403289Z 
2025-05-07T20:32:24.2403403Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:24.2403811Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:24.2404227Z     T=4096,
2025-05-07T20:32:24.2404429Z     D=7168,
2025-05-07T20:32:24.2404625Z     scale_ub=1200.0,
2025-05-07T20:32:24.2404858Z     contiguous=True,
2025-05-07T20:32:24.2405092Z     compiled=True,
2025-05-07T20:32:24.2405297Z )
2025-05-07T20:32:24.2405620Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:24.2406202Z T = 4096, D = 7168, scale_ub = 1200.0, contiguous = True, compiled = True
2025-05-07T20:32:24.2406474Z 
2025-05-07T20:32:24.2406555Z     @given(
2025-05-07T20:32:24.2406794Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:24.2407113Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:24.2407425Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:24.2407760Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:24.2408092Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:24.2408386Z     )
2025-05-07T20:32:24.2408737Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:24.2409184Z     def test_silu_mul_quant(
2025-05-07T20:32:24.2409533Z         self,
2025-05-07T20:32:24.2409783Z         T: int,
2025-05-07T20:32:24.2409987Z         D: int,
2025-05-07T20:32:24.2410217Z         scale_ub: Optional[float],
2025-05-07T20:32:24.2410488Z         contiguous: bool,
2025-05-07T20:32:24.2410777Z         compiled: bool,
2025-05-07T20:32:24.2411011Z     ) -> None:
2025-05-07T20:32:24.2411226Z         torch.manual_seed(2025)
2025-05-07T20:32:24.2411473Z     
2025-05-07T20:32:24.2411753Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:24.2412096Z     
2025-05-07T20:32:24.2412290Z         x_sign = torch.sign(x)
2025-05-07T20:32:24.2412589Z >       x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:24.2414586Z E       torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 112.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 28.44 MiB is free. Including non-PyTorch memory, this process has 22.03 GiB memory in use. Of the allocated memory 21.61 GiB is allocated by PyTorch, and 141.02 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
2025-05-07T20:32:24.2416439Z 
2025-05-07T20:32:24.2416573Z moe/activation_test.py:95: OutOfMemoryError
2025-05-07T20:32:24.2416789Z 
2025-05-07T20:32:24.2416893Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:24.2417312Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:24.2417721Z     T=16384,
2025-05-07T20:32:24.2417926Z     D=7168,
2025-05-07T20:32:24.2418119Z     scale_ub=None,
2025-05-07T20:32:24.2418344Z     contiguous=False,
2025-05-07T20:32:24.2418581Z     compiled=False,
2025-05-07T20:32:24.2418790Z )
2025-05-07T20:32:24.2419114Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:24.2419610Z T = 16384, D = 7168, scale_ub = None, contiguous = False, compiled = False
2025-05-07T20:32:24.2419892Z 
2025-05-07T20:32:24.2419975Z     @given(
2025-05-07T20:32:24.2420229Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:24.2420555Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:24.2420863Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:24.2421198Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:24.2421540Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:24.2421835Z     )
2025-05-07T20:32:24.2430147Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:24.2430594Z     def test_silu_mul_quant(
2025-05-07T20:32:24.2430830Z         self,
2025-05-07T20:32:24.2431015Z         T: int,
2025-05-07T20:32:24.2431203Z         D: int,
2025-05-07T20:32:24.2431425Z         scale_ub: Optional[float],
2025-05-07T20:32:24.2431696Z         contiguous: bool,
2025-05-07T20:32:24.2431948Z         compiled: bool,
2025-05-07T20:32:24.2432181Z     ) -> None:
2025-05-07T20:32:24.2432404Z         torch.manual_seed(2025)
2025-05-07T20:32:24.2432653Z     
2025-05-07T20:32:24.2432933Z >       x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:24.2435144Z E       torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 448.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 140.44 MiB is free. Including non-PyTorch memory, this process has 21.92 GiB memory in use. Of the allocated memory 21.50 GiB is allocated by PyTorch, and 141.02 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
2025-05-07T20:32:24.2437027Z 
2025-05-07T20:32:24.2437149Z moe/activation_test.py:92: OutOfMemoryError
2025-05-07T20:32:24.2437370Z 
2025-05-07T20:32:24.2437543Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:24.2438055Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:24.2438465Z     T=2048,
2025-05-07T20:32:24.2438651Z     D=7168,
2025-05-07T20:32:24.2438849Z     scale_ub=1200.0,
2025-05-07T20:32:24.2439147Z     contiguous=True,
2025-05-07T20:32:24.2439401Z     compiled=True,
2025-05-07T20:32:24.2439641Z )
2025-05-07T20:32:24.2439970Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:24.2440464Z T = 2048, D = 7168, scale_ub = 1200.0, contiguous = True, compiled = True
2025-05-07T20:32:24.2440744Z 
2025-05-07T20:32:24.2440822Z     @given(
2025-05-07T20:32:24.2441059Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:24.2441383Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:24.2441690Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:24.2442024Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:24.2442369Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:24.2442661Z     )
2025-05-07T20:32:24.2443014Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:24.2443467Z     def test_silu_mul_quant(
2025-05-07T20:32:24.2443714Z         self,
2025-05-07T20:32:24.2443922Z         T: int,
2025-05-07T20:32:24.2444130Z         D: int,
2025-05-07T20:32:24.2444349Z         scale_ub: Optional[float],
2025-05-07T20:32:24.2444623Z         contiguous: bool,
2025-05-07T20:32:24.2444872Z         compiled: bool,
2025-05-07T20:32:24.2445091Z     ) -> None:
2025-05-07T20:32:24.2445317Z         torch.manual_seed(2025)
2025-05-07T20:32:24.2445563Z     
2025-05-07T20:32:24.2445845Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:24.2446184Z     
2025-05-07T20:32:24.2446385Z         x_sign = torch.sign(x)
2025-05-07T20:32:24.2446683Z >       x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:24.2448686Z E       torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 56.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 28.44 MiB is free. Including non-PyTorch memory, this process has 22.03 GiB memory in use. Of the allocated memory 21.67 GiB is allocated by PyTorch, and 85.02 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
2025-05-07T20:32:24.2450570Z 
2025-05-07T20:32:24.2450693Z moe/activation_test.py:95: OutOfMemoryError
2025-05-07T20:32:24.2450910Z 
2025-05-07T20:32:24.2451015Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:24.2451430Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:24.2451833Z     T=2048,
2025-05-07T20:32:24.2452020Z     D=7168,
2025-05-07T20:32:24.2452221Z     scale_ub=None,
2025-05-07T20:32:24.2452444Z     contiguous=True,
2025-05-07T20:32:24.2452671Z     compiled=False,
2025-05-07T20:32:24.2452880Z )
2025-05-07T20:32:24.3314352Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:24.3315147Z T = 2048, D = 7168, scale_ub = None, contiguous = True, compiled = False
2025-05-07T20:32:24.3315736Z 
2025-05-07T20:32:24.3315836Z     @given(
2025-05-07T20:32:24.3316080Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:24.3316405Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:24.3316724Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:24.3317056Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:24.3317395Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:24.3317692Z     )
2025-05-07T20:32:24.3318051Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:24.3318493Z     def test_silu_mul_quant(
2025-05-07T20:32:24.3318756Z         self,
2025-05-07T20:32:24.3319058Z         T: int,
2025-05-07T20:32:24.3319367Z         D: int,
2025-05-07T20:32:24.3319640Z         scale_ub: Optional[float],
2025-05-07T20:32:24.3319931Z         contiguous: bool,
2025-05-07T20:32:24.3320182Z         compiled: bool,
2025-05-07T20:32:24.3320548Z     ) -> None:
2025-05-07T20:32:24.3320786Z         torch.manual_seed(2025)
2025-05-07T20:32:24.3321038Z     
2025-05-07T20:32:24.3321321Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:24.3321675Z     
2025-05-07T20:32:24.3321879Z >       x_sign = torch.sign(x)
2025-05-07T20:32:24.3323832Z E       torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 56.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 28.44 MiB is free. Including non-PyTorch memory, this process has 22.03 GiB memory in use. Of the allocated memory 21.67 GiB is allocated by PyTorch, and 85.02 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
2025-05-07T20:32:24.3325703Z 
2025-05-07T20:32:24.3325832Z moe/activation_test.py:94: OutOfMemoryError
2025-05-07T20:32:24.3326055Z 
2025-05-07T20:32:24.3326165Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:24.3326585Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:24.3326990Z     T=1,
2025-05-07T20:32:24.3327189Z     D=7168,
2025-05-07T20:32:24.3327396Z     scale_ub=1200.0,
2025-05-07T20:32:24.3327630Z     contiguous=True,
2025-05-07T20:32:24.3327864Z     compiled=False,
2025-05-07T20:32:24.3328085Z )
2025-05-07T20:32:24.3328673Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:24.3329169Z T = 1, D = 7168, scale_ub = 1200.0, contiguous = True, compiled = False
2025-05-07T20:32:24.3329441Z 
2025-05-07T20:32:24.3329527Z     @given(
2025-05-07T20:32:24.3329773Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:24.3330093Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:24.3330409Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:24.3330752Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:24.3331085Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:24.3331381Z     )
2025-05-07T20:32:24.3331743Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:24.3332189Z     def test_silu_mul_quant(
2025-05-07T20:32:24.3332445Z         self,
2025-05-07T20:32:24.3332656Z         T: int,
2025-05-07T20:32:24.3332860Z         D: int,
2025-05-07T20:32:24.3333091Z         scale_ub: Optional[float],
2025-05-07T20:32:24.3333374Z         contiguous: bool,
2025-05-07T20:32:24.3333618Z         compiled: bool,
2025-05-07T20:32:24.3333852Z     ) -> None:
2025-05-07T20:32:24.3334078Z         torch.manual_seed(2025)
2025-05-07T20:32:24.3334335Z     
2025-05-07T20:32:24.3334620Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:24.3334973Z     
2025-05-07T20:32:24.3335176Z         x_sign = torch.sign(x)
2025-05-07T20:32:24.3335475Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:24.3335875Z         x = x_sign * x_clamp
2025-05-07T20:32:24.3336127Z         x0 = x[:, :D]
2025-05-07T20:32:24.3336347Z         x1 = x[:, D:]
2025-05-07T20:32:24.3336564Z     
2025-05-07T20:32:24.3336764Z         if contiguous:
2025-05-07T20:32:24.3336999Z             x0 = x0.contiguous()
2025-05-07T20:32:24.3337268Z             x1 = x1.contiguous()
2025-05-07T20:32:24.3337516Z     
2025-05-07T20:32:24.3337711Z         if scale_ub is not None:
2025-05-07T20:32:24.3337994Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:24.3338332Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:24.3338640Z             )
2025-05-07T20:32:24.3338846Z         else:
2025-05-07T20:32:24.3339133Z             scale_ub_tensor = None
2025-05-07T20:32:24.3339445Z     
2025-05-07T20:32:24.3339708Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:24.3340053Z             op = silu_mul_quant
2025-05-07T20:32:24.3340367Z             if compiled:
2025-05-07T20:32:24.3340626Z                 op = torch.compile(op)
2025-05-07T20:32:24.3340928Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:24.3341208Z     
2025-05-07T20:32:24.3341404Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:24.3341578Z 
2025-05-07T20:32:24.3341680Z moe/activation_test.py:117: 
2025-05-07T20:32:24.3341979Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:24.3342313Z moe/activation_test.py:115: in fn
2025-05-07T20:32:24.3342604Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:24.3343300Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:24.3343995Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:24.3344539Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:24.3345224Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:24.3345893Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:24.3346424Z     kernel = self.compile(
2025-05-07T20:32:24.3346970Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:24.3347629Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:24.3348033Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:24.3348262Z 
2025-05-07T20:32:24.3348472Z self = <triton.compiler.compiler.ASTSource object at 0x7efca2bfcf50>
2025-05-07T20:32:24.3349697Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:24.3351072Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7efdd7d50400>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7efca2b60540>}
2025-05-07T20:32:24.3352417Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:24.3353438Z context = <triton._C.libtriton.ir.context object at 0x7efca2b5c0b0>
2025-05-07T20:32:24.3353737Z 
2025-05-07T20:32:24.3353908Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:24.3354438Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:24.3354916Z                            module_map=module_map)
2025-05-07T20:32:24.3355284Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:24.3355652Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:24.3355928Z E       ^
2025-05-07T20:32:24.3356446Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:24.3356903Z 
2025-05-07T20:32:24.3357320Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:24.3357840Z 
2025-05-07T20:32:24.3357952Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:24.3358375Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:24.3358777Z     T=128,
2025-05-07T20:32:24.3358978Z     D=5120,
2025-05-07T20:32:24.3359190Z     scale_ub=None,
2025-05-07T20:32:24.3359412Z     contiguous=True,
2025-05-07T20:32:24.3359690Z     compiled=False,
2025-05-07T20:32:24.3359946Z )
2025-05-07T20:32:24.5594066Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:24.5595495Z T = 128, D = 5120, scale_ub = None, contiguous = True, compiled = False
2025-05-07T20:32:24.5596612Z 
2025-05-07T20:32:24.5596857Z     @given(
2025-05-07T20:32:24.5597442Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:24.5598089Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:24.5598718Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:24.5599287Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:24.5599659Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:24.5599949Z     )
2025-05-07T20:32:24.5600305Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:24.5600742Z     def test_silu_mul_quant(
2025-05-07T20:32:24.5600989Z         self,
2025-05-07T20:32:24.5601197Z         T: int,
2025-05-07T20:32:24.5601393Z         D: int,
2025-05-07T20:32:24.5601628Z         scale_ub: Optional[float],
2025-05-07T20:32:24.5601904Z         contiguous: bool,
2025-05-07T20:32:24.5602144Z         compiled: bool,
2025-05-07T20:32:24.5602388Z     ) -> None:
2025-05-07T20:32:24.5602624Z         torch.manual_seed(2025)
2025-05-07T20:32:24.5602869Z     
2025-05-07T20:32:24.5603148Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:24.5603494Z     
2025-05-07T20:32:24.5603701Z         x_sign = torch.sign(x)
2025-05-07T20:32:24.5603990Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:24.5604300Z         x = x_sign * x_clamp
2025-05-07T20:32:24.5604545Z         x0 = x[:, :D]
2025-05-07T20:32:24.5604764Z         x1 = x[:, D:]
2025-05-07T20:32:24.5604976Z     
2025-05-07T20:32:24.5605168Z         if contiguous:
2025-05-07T20:32:24.5605401Z             x0 = x0.contiguous()
2025-05-07T20:32:24.5605661Z             x1 = x1.contiguous()
2025-05-07T20:32:24.5605904Z     
2025-05-07T20:32:24.5606102Z         if scale_ub is not None:
2025-05-07T20:32:24.5606380Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:24.5606720Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:24.5607028Z             )
2025-05-07T20:32:24.5607229Z         else:
2025-05-07T20:32:24.5607443Z             scale_ub_tensor = None
2025-05-07T20:32:24.5607689Z     
2025-05-07T20:32:24.5607924Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:24.5608244Z             op = silu_mul_quant
2025-05-07T20:32:24.5608500Z             if compiled:
2025-05-07T20:32:24.5608746Z                 op = torch.compile(op)
2025-05-07T20:32:24.5609041Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:24.5609320Z     
2025-05-07T20:32:24.5609511Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:24.5609688Z 
2025-05-07T20:32:24.5609812Z moe/activation_test.py:117: 
2025-05-07T20:32:24.5610134Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:24.5610467Z moe/activation_test.py:115: in fn
2025-05-07T20:32:24.5610751Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:24.5611543Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:24.5612239Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:24.5612771Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:24.5613450Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:24.5614113Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:24.5614639Z     kernel = self.compile(
2025-05-07T20:32:24.5615190Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:24.5615963Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:24.5616448Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:24.5616680Z 
2025-05-07T20:32:24.5616930Z self = <triton.compiler.compiler.ASTSource object at 0x7efca2bdfcd0>
2025-05-07T20:32:24.5618016Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:24.5619402Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7efdd7d50400>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7efca2b61620>}
2025-05-07T20:32:24.5620741Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:24.5621771Z context = <triton._C.libtriton.ir.context object at 0x7efca2b6c370>
2025-05-07T20:32:24.5622062Z 
2025-05-07T20:32:24.5622230Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:24.5622760Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:24.5623229Z                            module_map=module_map)
2025-05-07T20:32:24.5623594Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:24.5623950Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:24.5624214Z E       ^
2025-05-07T20:32:24.5624677Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:24.5625132Z 
2025-05-07T20:32:24.5625548Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:24.5626065Z 
2025-05-07T20:32:24.5626173Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:24.5626597Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:24.5627001Z     T=128,
2025-05-07T20:32:24.5627195Z     D=7168,
2025-05-07T20:32:24.5627391Z     scale_ub=None,
2025-05-07T20:32:24.5627607Z     contiguous=True,
2025-05-07T20:32:24.5627840Z     compiled=False,
2025-05-07T20:32:24.5628058Z )
2025-05-07T20:32:24.5628665Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:24.5629195Z T = 128, D = 7168, scale_ub = None, contiguous = True, compiled = False
2025-05-07T20:32:24.5629465Z 
2025-05-07T20:32:24.5629546Z     @given(
2025-05-07T20:32:24.5629786Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:24.5630141Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:24.5630462Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:24.5630797Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:24.5631129Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:24.5631416Z     )
2025-05-07T20:32:24.5631764Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:24.5632203Z     def test_silu_mul_quant(
2025-05-07T20:32:24.5632451Z         self,
2025-05-07T20:32:24.5632730Z         T: int,
2025-05-07T20:32:24.5632940Z         D: int,
2025-05-07T20:32:24.5633158Z         scale_ub: Optional[float],
2025-05-07T20:32:24.5633429Z         contiguous: bool,
2025-05-07T20:32:24.5633673Z         compiled: bool,
2025-05-07T20:32:24.5633892Z     ) -> None:
2025-05-07T20:32:24.5634111Z         torch.manual_seed(2025)
2025-05-07T20:32:24.5634355Z     
2025-05-07T20:32:24.5634626Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:24.5634982Z     
2025-05-07T20:32:24.5635183Z         x_sign = torch.sign(x)
2025-05-07T20:32:24.5635471Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:24.5635858Z         x = x_sign * x_clamp
2025-05-07T20:32:24.5636163Z         x0 = x[:, :D]
2025-05-07T20:32:24.5636379Z         x1 = x[:, D:]
2025-05-07T20:32:24.5636591Z     
2025-05-07T20:32:24.5636782Z         if contiguous:
2025-05-07T20:32:24.5637012Z             x0 = x0.contiguous()
2025-05-07T20:32:24.5637344Z             x1 = x1.contiguous()
2025-05-07T20:32:24.5637592Z     
2025-05-07T20:32:24.5637794Z         if scale_ub is not None:
2025-05-07T20:32:24.5638069Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:24.5638413Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:24.5638728Z             )
2025-05-07T20:32:24.5638925Z         else:
2025-05-07T20:32:24.5639144Z             scale_ub_tensor = None
2025-05-07T20:32:24.5639402Z     
2025-05-07T20:32:24.5639641Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:24.5639959Z             op = silu_mul_quant
2025-05-07T20:32:24.5640261Z             if compiled:
2025-05-07T20:32:24.5640507Z                 op = torch.compile(op)
2025-05-07T20:32:24.5640817Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:24.5641098Z     
2025-05-07T20:32:24.5641292Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:24.5641466Z 
2025-05-07T20:32:24.5641567Z moe/activation_test.py:117: 
2025-05-07T20:32:24.5641873Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:24.5642209Z moe/activation_test.py:115: in fn
2025-05-07T20:32:24.5642486Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:24.5643169Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:24.5643858Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:24.5644387Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:24.5645068Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:24.5645730Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:24.5646272Z     kernel = self.compile(
2025-05-07T20:32:24.5646812Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:24.5647471Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:24.5647876Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:24.5648102Z 
2025-05-07T20:32:24.5648320Z self = <triton.compiler.compiler.ASTSource object at 0x7efca2cf6c90>
2025-05-07T20:32:24.5649440Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:24.5650824Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7efdd7d50400>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7efca2b62660>}
2025-05-07T20:32:24.5652217Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:24.5653242Z context = <triton._C.libtriton.ir.context object at 0x7efca2cd3270>
2025-05-07T20:32:24.5653531Z 
2025-05-07T20:32:24.5653697Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:24.5654218Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:24.5654688Z                            module_map=module_map)
2025-05-07T20:32:24.5655054Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:24.5655406Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:24.5655668Z E       ^
2025-05-07T20:32:24.5656137Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:24.5656661Z 
2025-05-07T20:32:24.5657112Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:24.5657626Z 
2025-05-07T20:32:24.5657735Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:24.5658149Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:24.5658554Z     T=2048,
2025-05-07T20:32:24.5658743Z     D=7168,
2025-05-07T20:32:24.5658945Z     scale_ub=1200.0,
2025-05-07T20:32:24.5659177Z     contiguous=True,
2025-05-07T20:32:24.5659419Z     compiled=False,
2025-05-07T20:32:24.5659662Z )
2025-05-07T20:32:24.6327738Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:24.6328700Z T = 2048, D = 7168, scale_ub = 1200.0, contiguous = True, compiled = False
2025-05-07T20:32:24.6329084Z 
2025-05-07T20:32:24.6329215Z     @given(
2025-05-07T20:32:24.6329538Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:24.6329967Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:24.6330278Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:24.6330617Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:24.6330958Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:24.6331250Z     )
2025-05-07T20:32:24.6331598Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:24.6332041Z     def test_silu_mul_quant(
2025-05-07T20:32:24.6332288Z         self,
2025-05-07T20:32:24.6332487Z         T: int,
2025-05-07T20:32:24.6332690Z         D: int,
2025-05-07T20:32:24.6332915Z         scale_ub: Optional[float],
2025-05-07T20:32:24.6333187Z         contiguous: bool,
2025-05-07T20:32:24.6333431Z         compiled: bool,
2025-05-07T20:32:24.6333664Z     ) -> None:
2025-05-07T20:32:24.6333878Z         torch.manual_seed(2025)
2025-05-07T20:32:24.6334133Z     
2025-05-07T20:32:24.6334411Z >       x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:24.6336462Z E       torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 56.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 26.44 MiB is free. Including non-PyTorch memory, this process has 22.04 GiB memory in use. Of the allocated memory 21.69 GiB is allocated by PyTorch, and 59.18 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
2025-05-07T20:32:24.6338304Z 
2025-05-07T20:32:24.6338432Z moe/activation_test.py:92: OutOfMemoryError
2025-05-07T20:32:24.6338643Z 
2025-05-07T20:32:24.6338746Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:24.6339158Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:24.6339611Z     T=1,
2025-05-07T20:32:24.6339793Z     D=5120,
2025-05-07T20:32:24.6339989Z     scale_ub=1200.0,
2025-05-07T20:32:24.6340216Z     contiguous=True,
2025-05-07T20:32:24.6340444Z     compiled=False,
2025-05-07T20:32:24.6340650Z )
2025-05-07T20:32:24.6341248Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:24.6341749Z T = 1, D = 5120, scale_ub = 1200.0, contiguous = True, compiled = False
2025-05-07T20:32:24.6349260Z 
2025-05-07T20:32:24.6349377Z     @given(
2025-05-07T20:32:24.6349651Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:24.6350002Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:24.6350318Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:24.6350658Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:24.6350989Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:24.6351283Z     )
2025-05-07T20:32:24.6351644Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:24.6352296Z     def test_silu_mul_quant(
2025-05-07T20:32:24.6352552Z         self,
2025-05-07T20:32:24.6352757Z         T: int,
2025-05-07T20:32:24.6352954Z         D: int,
2025-05-07T20:32:24.6353267Z         scale_ub: Optional[float],
2025-05-07T20:32:24.6353551Z         contiguous: bool,
2025-05-07T20:32:24.6353793Z         compiled: bool,
2025-05-07T20:32:24.6354027Z     ) -> None:
2025-05-07T20:32:24.6354257Z         torch.manual_seed(2025)
2025-05-07T20:32:24.6354504Z     
2025-05-07T20:32:24.6354790Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:24.6355143Z     
2025-05-07T20:32:24.6355352Z         x_sign = torch.sign(x)
2025-05-07T20:32:24.6355645Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:24.6355962Z         x = x_sign * x_clamp
2025-05-07T20:32:24.6356210Z         x0 = x[:, :D]
2025-05-07T20:32:24.6356428Z         x1 = x[:, D:]
2025-05-07T20:32:24.6356644Z     
2025-05-07T20:32:24.6356841Z         if contiguous:
2025-05-07T20:32:24.6357073Z             x0 = x0.contiguous()
2025-05-07T20:32:24.6357338Z             x1 = x1.contiguous()
2025-05-07T20:32:24.6357587Z     
2025-05-07T20:32:24.6357781Z         if scale_ub is not None:
2025-05-07T20:32:24.6358066Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:24.6358415Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:24.6358723Z             )
2025-05-07T20:32:24.6358925Z         else:
2025-05-07T20:32:24.6359143Z             scale_ub_tensor = None
2025-05-07T20:32:24.6359396Z     
2025-05-07T20:32:24.6359635Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:24.6359954Z             op = silu_mul_quant
2025-05-07T20:32:24.6360211Z             if compiled:
2025-05-07T20:32:24.6360457Z                 op = torch.compile(op)
2025-05-07T20:32:24.6360759Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:24.6361042Z     
2025-05-07T20:32:24.6361234Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:24.6361406Z 
2025-05-07T20:32:24.6361508Z moe/activation_test.py:117: 
2025-05-07T20:32:24.6361807Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:24.6362141Z moe/activation_test.py:115: in fn
2025-05-07T20:32:24.6362435Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:24.6363139Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:24.6363843Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:24.6364421Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:24.6365110Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:24.6365791Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:24.6366340Z     kernel = self.compile(
2025-05-07T20:32:24.6366887Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:24.6367556Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:24.6368045Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:24.6368274Z 
2025-05-07T20:32:24.6368494Z self = <triton.compiler.compiler.ASTSource object at 0x7efca2c6ac10>
2025-05-07T20:32:24.6369589Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:24.6371040Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7efdd7d50400>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7efca2b639c0>}
2025-05-07T20:32:24.6372443Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:24.6373561Z context = <triton._C.libtriton.ir.context object at 0x7efca2c0f270>
2025-05-07T20:32:24.6373862Z 
2025-05-07T20:32:24.6374040Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:24.6374568Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:24.6375050Z                            module_map=module_map)
2025-05-07T20:32:24.6375427Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:24.6375787Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:24.6376054Z E       ^
2025-05-07T20:32:24.6376523Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:24.6376986Z 
2025-05-07T20:32:24.6377423Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:24.6377944Z 
2025-05-07T20:32:24.6378051Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:24.6378488Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:24.6378898Z     T=2048,
2025-05-07T20:32:24.6379089Z     D=5120,
2025-05-07T20:32:24.6379297Z     scale_ub=None,
2025-05-07T20:32:24.6379530Z     contiguous=True,
2025-05-07T20:32:24.6379756Z     compiled=False,
2025-05-07T20:32:24.6379971Z )
2025-05-07T20:32:24.6380337Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:24.6380870Z T = 2048, D = 5120, scale_ub = None, contiguous = True, compiled = False
2025-05-07T20:32:24.6381141Z 
2025-05-07T20:32:24.6381226Z     @given(
2025-05-07T20:32:24.6381467Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:24.6381795Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:24.6382106Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:24.6382447Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:24.6382787Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:24.6383077Z     )
2025-05-07T20:32:24.6383440Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:24.6383891Z     def test_silu_mul_quant(
2025-05-07T20:32:24.6384140Z         self,
2025-05-07T20:32:24.6384338Z         T: int,
2025-05-07T20:32:24.6384545Z         D: int,
2025-05-07T20:32:24.6384772Z         scale_ub: Optional[float],
2025-05-07T20:32:24.6385047Z         contiguous: bool,
2025-05-07T20:32:24.6385297Z         compiled: bool,
2025-05-07T20:32:24.6385528Z     ) -> None:
2025-05-07T20:32:24.6385750Z         torch.manual_seed(2025)
2025-05-07T20:32:24.6386002Z     
2025-05-07T20:32:24.6386288Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:24.6386634Z     
2025-05-07T20:32:24.6386843Z >       x_sign = torch.sign(x)
2025-05-07T20:32:24.6388886Z E       torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 40.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 26.44 MiB is free. Including non-PyTorch memory, this process has 22.04 GiB memory in use. Of the allocated memory 21.73 GiB is allocated by PyTorch, and 19.12 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
2025-05-07T20:32:24.6390829Z 
2025-05-07T20:32:24.6390964Z moe/activation_test.py:94: OutOfMemoryError
2025-05-07T20:32:24.6391183Z 
2025-05-07T20:32:24.6391300Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:24.6391716Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:24.6392178Z     T=16384,
2025-05-07T20:32:24.6392422Z     D=5120,
2025-05-07T20:32:24.6392620Z     scale_ub=None,
2025-05-07T20:32:24.6392843Z     contiguous=True,
2025-05-07T20:32:24.6393074Z     compiled=False,
2025-05-07T20:32:24.6393280Z )
2025-05-07T20:32:24.7086350Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:24.7087096Z T = 16384, D = 5120, scale_ub = None, contiguous = True, compiled = False
2025-05-07T20:32:24.7087477Z 
2025-05-07T20:32:24.7087604Z     @given(
2025-05-07T20:32:24.7087918Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:24.7088279Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:24.7088595Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:24.7088935Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:24.7089275Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:24.7089575Z     )
2025-05-07T20:32:24.7089932Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:24.7090391Z     def test_silu_mul_quant(
2025-05-07T20:32:24.7090642Z         self,
2025-05-07T20:32:24.7090846Z         T: int,
2025-05-07T20:32:24.7091046Z         D: int,
2025-05-07T20:32:24.7091281Z         scale_ub: Optional[float],
2025-05-07T20:32:24.7091590Z         contiguous: bool,
2025-05-07T20:32:24.7091843Z         compiled: bool,
2025-05-07T20:32:24.7092078Z     ) -> None:
2025-05-07T20:32:24.7092297Z         torch.manual_seed(2025)
2025-05-07T20:32:24.7092551Z     
2025-05-07T20:32:24.7092831Z >       x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:24.7094856Z E       torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 320.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 26.44 MiB is free. Including non-PyTorch memory, this process has 22.04 GiB memory in use. Of the allocated memory 21.73 GiB is allocated by PyTorch, and 19.12 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
2025-05-07T20:32:24.7096703Z 
2025-05-07T20:32:24.7096833Z moe/activation_test.py:92: OutOfMemoryError
2025-05-07T20:32:24.7097045Z 
2025-05-07T20:32:24.7097153Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:24.7097571Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:24.7097974Z     T=4096,
2025-05-07T20:32:24.7098161Z     D=5120,
2025-05-07T20:32:24.7098361Z     scale_ub=None,
2025-05-07T20:32:24.7098582Z     contiguous=True,
2025-05-07T20:32:24.7098803Z     compiled=False,
2025-05-07T20:32:24.7099017Z )
2025-05-07T20:32:24.7099349Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:24.7099882Z T = 4096, D = 5120, scale_ub = None, contiguous = True, compiled = False
2025-05-07T20:32:24.7100161Z 
2025-05-07T20:32:24.7100241Z     @given(
2025-05-07T20:32:24.7100482Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:24.7100794Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:24.7101095Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:24.7101530Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:24.7101865Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:24.7102148Z     )
2025-05-07T20:32:24.7102502Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:24.7102943Z     def test_silu_mul_quant(
2025-05-07T20:32:24.7103185Z         self,
2025-05-07T20:32:24.7103390Z         T: int,
2025-05-07T20:32:24.7103594Z         D: int,
2025-05-07T20:32:24.7103815Z         scale_ub: Optional[float],
2025-05-07T20:32:24.7104095Z         contiguous: bool,
2025-05-07T20:32:24.7104341Z         compiled: bool,
2025-05-07T20:32:24.7104565Z     ) -> None:
2025-05-07T20:32:24.7104864Z         torch.manual_seed(2025)
2025-05-07T20:32:24.7105644Z     
2025-05-07T20:32:24.7105924Z >       x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:24.7107989Z E       torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 80.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 26.44 MiB is free. Including non-PyTorch memory, this process has 22.04 GiB memory in use. Of the allocated memory 21.73 GiB is allocated by PyTorch, and 19.12 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
2025-05-07T20:32:24.7110011Z 
2025-05-07T20:32:24.7110133Z moe/activation_test.py:92: OutOfMemoryError
2025-05-07T20:32:24.7110353Z 
2025-05-07T20:32:24.7110462Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:24.7110876Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:24.7111287Z     T=2048,
2025-05-07T20:32:24.7111485Z     D=5120,
2025-05-07T20:32:24.7111690Z     scale_ub=None,
2025-05-07T20:32:24.7111915Z     contiguous=False,
2025-05-07T20:32:24.7112149Z     compiled=False,
2025-05-07T20:32:24.7112361Z )
2025-05-07T20:32:24.7112686Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:24.7113175Z T = 2048, D = 5120, scale_ub = None, contiguous = False, compiled = False
2025-05-07T20:32:24.7113450Z 
2025-05-07T20:32:24.7113532Z     @given(
2025-05-07T20:32:24.7113769Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:24.7114078Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:24.7114386Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:24.7114715Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:24.7115042Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:24.7115336Z     )
2025-05-07T20:32:24.7115686Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:24.7116132Z     def test_silu_mul_quant(
2025-05-07T20:32:24.7116376Z         self,
2025-05-07T20:32:24.7116577Z         T: int,
2025-05-07T20:32:24.7116795Z         D: int,
2025-05-07T20:32:24.7117025Z         scale_ub: Optional[float],
2025-05-07T20:32:24.7117295Z         contiguous: bool,
2025-05-07T20:32:24.7117546Z         compiled: bool,
2025-05-07T20:32:24.7117777Z     ) -> None:
2025-05-07T20:32:24.7117992Z         torch.manual_seed(2025)
2025-05-07T20:32:24.7118241Z     
2025-05-07T20:32:24.7118519Z >       x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:24.7120597Z E       torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 40.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 26.44 MiB is free. Including non-PyTorch memory, this process has 22.04 GiB memory in use. Of the allocated memory 21.73 GiB is allocated by PyTorch, and 19.12 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
2025-05-07T20:32:24.7122496Z 
2025-05-07T20:32:24.7122619Z moe/activation_test.py:92: OutOfMemoryError
2025-05-07T20:32:24.7122836Z 
2025-05-07T20:32:24.7122942Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:24.7123362Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:24.7123781Z     T=4096,
2025-05-07T20:32:24.7123976Z     D=7168,
2025-05-07T20:32:24.7124180Z     scale_ub=None,
2025-05-07T20:32:24.7124406Z     contiguous=True,
2025-05-07T20:32:24.7124629Z     compiled=True,
2025-05-07T20:32:24.7124841Z )
2025-05-07T20:32:24.7125167Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:24.7125649Z T = 4096, D = 7168, scale_ub = None, contiguous = True, compiled = True
2025-05-07T20:32:24.7126002Z 
2025-05-07T20:32:24.7126082Z     @given(
2025-05-07T20:32:24.7126319Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:24.7126635Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:24.7127011Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:24.7127346Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:24.7127678Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:24.7127958Z     )
2025-05-07T20:32:24.7128564Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:24.7129015Z     def test_silu_mul_quant(
2025-05-07T20:32:24.7129258Z         self,
2025-05-07T20:32:24.7129458Z         T: int,
2025-05-07T20:32:24.7129667Z         D: int,
2025-05-07T20:32:24.7129887Z         scale_ub: Optional[float],
2025-05-07T20:32:24.7130165Z         contiguous: bool,
2025-05-07T20:32:24.7130412Z         compiled: bool,
2025-05-07T20:32:24.7130637Z     ) -> None:
2025-05-07T20:32:24.7130865Z         torch.manual_seed(2025)
2025-05-07T20:32:24.7131115Z     
2025-05-07T20:32:24.7131384Z >       x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:24.7133435Z E       torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 112.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 26.44 MiB is free. Including non-PyTorch memory, this process has 22.04 GiB memory in use. Of the allocated memory 21.73 GiB is allocated by PyTorch, and 19.12 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
2025-05-07T20:32:24.7135280Z 
2025-05-07T20:32:24.7135400Z moe/activation_test.py:92: OutOfMemoryError
2025-05-07T20:32:24.7135617Z 
2025-05-07T20:32:24.7135723Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:24.7136145Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:24.7136549Z     T=2048,
2025-05-07T20:32:24.7136743Z     D=5120,
2025-05-07T20:32:24.7136942Z     scale_ub=1200.0,
2025-05-07T20:32:24.7137175Z     contiguous=False,
2025-05-07T20:32:24.7137417Z     compiled=False,
2025-05-07T20:32:24.7137634Z )
2025-05-07T20:32:24.7137970Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:24.7138469Z T = 2048, D = 5120, scale_ub = 1200.0, contiguous = False, compiled = False
2025-05-07T20:32:24.7138756Z 
2025-05-07T20:32:24.7138839Z     @given(
2025-05-07T20:32:24.7139088Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:24.7139456Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:24.7139777Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:24.7140118Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:24.7140455Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:24.7140751Z     )
2025-05-07T20:32:24.7141108Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:24.7141560Z     def test_silu_mul_quant(
2025-05-07T20:32:24.7141808Z         self,
2025-05-07T20:32:24.7142103Z         T: int,
2025-05-07T20:32:24.7142313Z         D: int,
2025-05-07T20:32:24.7142536Z         scale_ub: Optional[float],
2025-05-07T20:32:24.7142817Z         contiguous: bool,
2025-05-07T20:32:24.7143076Z         compiled: bool,
2025-05-07T20:32:24.7143301Z     ) -> None:
2025-05-07T20:32:24.7143526Z         torch.manual_seed(2025)
2025-05-07T20:32:24.7143773Z     
2025-05-07T20:32:24.7144045Z >       x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:24.7146143Z E       torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 40.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 26.44 MiB is free. Including non-PyTorch memory, this process has 22.04 GiB memory in use. Of the allocated memory 21.73 GiB is allocated by PyTorch, and 19.12 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
2025-05-07T20:32:24.7148089Z 
2025-05-07T20:32:24.7148210Z moe/activation_test.py:92: OutOfMemoryError
2025-05-07T20:32:24.7148430Z 
2025-05-07T20:32:24.7148537Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:24.7148953Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:24.7149402Z     T=4096,
2025-05-07T20:32:24.7149597Z     D=7168,
2025-05-07T20:32:24.7149798Z     scale_ub=1200.0,
2025-05-07T20:32:24.7150037Z     contiguous=True,
2025-05-07T20:32:24.7150314Z     compiled=False,
2025-05-07T20:32:24.7150524Z )
2025-05-07T20:32:24.8076262Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:24.8077000Z T = 4096, D = 7168, scale_ub = 1200.0, contiguous = True, compiled = False
2025-05-07T20:32:24.8077332Z 
2025-05-07T20:32:24.8077442Z     @given(
2025-05-07T20:32:24.8077692Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:24.8078021Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:24.8078337Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:24.8078669Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:24.8079005Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:24.8079329Z     )
2025-05-07T20:32:24.8079761Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:24.8080322Z     def test_silu_mul_quant(
2025-05-07T20:32:24.8080636Z         self,
2025-05-07T20:32:24.8080881Z         T: int,
2025-05-07T20:32:24.8081135Z         D: int,
2025-05-07T20:32:24.8081417Z         scale_ub: Optional[float],
2025-05-07T20:32:24.8081757Z         contiguous: bool,
2025-05-07T20:32:24.8082006Z         compiled: bool,
2025-05-07T20:32:24.8082240Z     ) -> None:
2025-05-07T20:32:24.8082458Z         torch.manual_seed(2025)
2025-05-07T20:32:24.8082710Z     
2025-05-07T20:32:24.8082998Z >       x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:24.8085048Z E       torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 112.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 26.44 MiB is free. Including non-PyTorch memory, this process has 22.04 GiB memory in use. Of the allocated memory 21.73 GiB is allocated by PyTorch, and 19.12 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
2025-05-07T20:32:24.8086893Z 
2025-05-07T20:32:24.8087023Z moe/activation_test.py:92: OutOfMemoryError
2025-05-07T20:32:24.8087237Z 
2025-05-07T20:32:24.8087346Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:24.8087762Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:24.8088167Z     T=16384,
2025-05-07T20:32:24.8088362Z     D=7168,
2025-05-07T20:32:24.8088746Z     scale_ub=None,
2025-05-07T20:32:24.8088990Z     contiguous=False,
2025-05-07T20:32:24.8089268Z     compiled=True,
2025-05-07T20:32:24.8089532Z )
2025-05-07T20:32:24.8089935Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:24.8090555Z T = 16384, D = 7168, scale_ub = None, contiguous = False, compiled = True
2025-05-07T20:32:24.8090897Z 
2025-05-07T20:32:24.8090997Z     @given(
2025-05-07T20:32:24.8091242Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:24.8091561Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:24.8091860Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:24.8092275Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:24.8092683Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:24.8092962Z     )
2025-05-07T20:32:24.8093314Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:24.8093833Z     def test_silu_mul_quant(
2025-05-07T20:32:24.8094082Z         self,
2025-05-07T20:32:24.8094278Z         T: int,
2025-05-07T20:32:24.8094482Z         D: int,
2025-05-07T20:32:24.8094703Z         scale_ub: Optional[float],
2025-05-07T20:32:24.8094977Z         contiguous: bool,
2025-05-07T20:32:24.8095222Z         compiled: bool,
2025-05-07T20:32:24.8095446Z     ) -> None:
2025-05-07T20:32:24.8095661Z         torch.manual_seed(2025)
2025-05-07T20:32:24.8095911Z     
2025-05-07T20:32:24.8096192Z >       x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:24.8098229Z E       torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 448.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 26.44 MiB is free. Including non-PyTorch memory, this process has 22.04 GiB memory in use. Of the allocated memory 21.73 GiB is allocated by PyTorch, and 19.12 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
2025-05-07T20:32:24.8100071Z 
2025-05-07T20:32:24.8100194Z moe/activation_test.py:92: OutOfMemoryError
2025-05-07T20:32:24.8100418Z 
2025-05-07T20:32:24.8100523Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:24.8100935Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:24.8101339Z     T=4096,
2025-05-07T20:32:24.8101534Z     D=7168,
2025-05-07T20:32:24.8101740Z     scale_ub=None,
2025-05-07T20:32:24.8101960Z     contiguous=True,
2025-05-07T20:32:24.8102182Z     compiled=False,
2025-05-07T20:32:24.8102405Z )
2025-05-07T20:32:24.8102732Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:24.8103221Z T = 4096, D = 7168, scale_ub = None, contiguous = True, compiled = False
2025-05-07T20:32:24.8103500Z 
2025-05-07T20:32:24.8103589Z     @given(
2025-05-07T20:32:24.8103829Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:24.8104138Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:24.8104451Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:24.8104788Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:24.8105125Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:24.8105405Z     )
2025-05-07T20:32:24.8105761Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:24.8106203Z     def test_silu_mul_quant(
2025-05-07T20:32:24.8106445Z         self,
2025-05-07T20:32:24.8106650Z         T: int,
2025-05-07T20:32:24.8106856Z         D: int,
2025-05-07T20:32:24.8107075Z         scale_ub: Optional[float],
2025-05-07T20:32:24.8107354Z         contiguous: bool,
2025-05-07T20:32:24.8107608Z         compiled: bool,
2025-05-07T20:32:24.8107830Z     ) -> None:
2025-05-07T20:32:24.8108051Z         torch.manual_seed(2025)
2025-05-07T20:32:24.8108299Z     
2025-05-07T20:32:24.8108622Z >       x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:24.8111168Z E       torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 112.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 26.44 MiB is free. Including non-PyTorch memory, this process has 22.04 GiB memory in use. Of the allocated memory 21.73 GiB is allocated by PyTorch, and 19.12 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
2025-05-07T20:32:24.8113189Z 
2025-05-07T20:32:24.8113350Z moe/activation_test.py:92: OutOfMemoryError
2025-05-07T20:32:24.8113567Z 
2025-05-07T20:32:24.8113672Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:24.8114124Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:24.8114527Z     T=16384,
2025-05-07T20:32:24.8114731Z     D=7168,
2025-05-07T20:32:24.8114930Z     scale_ub=None,
2025-05-07T20:32:24.8115147Z     contiguous=True,
2025-05-07T20:32:24.8115378Z     compiled=False,
2025-05-07T20:32:24.8115587Z )
2025-05-07T20:32:24.8115904Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:24.8116397Z T = 16384, D = 7168, scale_ub = None, contiguous = True, compiled = False
2025-05-07T20:32:24.8116678Z 
2025-05-07T20:32:24.8116760Z     @given(
2025-05-07T20:32:24.8124084Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:24.8124454Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:24.8124804Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:24.8125189Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:24.8125569Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:24.8125897Z     )
2025-05-07T20:32:24.8126301Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:24.8126830Z     def test_silu_mul_quant(
2025-05-07T20:32:24.8127100Z         self,
2025-05-07T20:32:24.8127306Z         T: int,
2025-05-07T20:32:24.8127519Z         D: int,
2025-05-07T20:32:24.8127757Z         scale_ub: Optional[float],
2025-05-07T20:32:24.8128056Z         contiguous: bool,
2025-05-07T20:32:24.8128585Z         compiled: bool,
2025-05-07T20:32:24.8128818Z     ) -> None:
2025-05-07T20:32:24.8129040Z         torch.manual_seed(2025)
2025-05-07T20:32:24.8129295Z     
2025-05-07T20:32:24.8129587Z >       x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:24.8131682Z E       torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 448.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 26.44 MiB is free. Including non-PyTorch memory, this process has 22.04 GiB memory in use. Of the allocated memory 21.73 GiB is allocated by PyTorch, and 19.12 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
2025-05-07T20:32:24.8133574Z 
2025-05-07T20:32:24.8133711Z moe/activation_test.py:92: OutOfMemoryError
2025-05-07T20:32:24.8133929Z 
2025-05-07T20:32:24.8134035Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:24.8134461Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:24.8134878Z     T=16384,
2025-05-07T20:32:24.8135073Z     D=7168,
2025-05-07T20:32:24.8135284Z     scale_ub=1200.0,
2025-05-07T20:32:24.8135516Z     contiguous=True,
2025-05-07T20:32:24.8135743Z     compiled=False,
2025-05-07T20:32:24.8135958Z )
2025-05-07T20:32:24.8136289Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:24.8136796Z T = 16384, D = 7168, scale_ub = 1200.0, contiguous = True, compiled = False
2025-05-07T20:32:24.8137083Z 
2025-05-07T20:32:24.8137281Z     @given(
2025-05-07T20:32:24.8137525Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:24.8137846Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:24.8138155Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:24.8138495Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:24.8138837Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:24.8139186Z     )
2025-05-07T20:32:24.8139630Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:24.8140193Z     def test_silu_mul_quant(
2025-05-07T20:32:24.8140493Z         self,
2025-05-07T20:32:24.8140830Z         T: int,
2025-05-07T20:32:24.8141188Z         D: int,
2025-05-07T20:32:24.8141471Z         scale_ub: Optional[float],
2025-05-07T20:32:24.8141790Z         contiguous: bool,
2025-05-07T20:32:24.8142036Z         compiled: bool,
2025-05-07T20:32:24.8142319Z     ) -> None:
2025-05-07T20:32:24.8142541Z         torch.manual_seed(2025)
2025-05-07T20:32:24.8142795Z     
2025-05-07T20:32:24.8143077Z >       x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:24.8145164Z E       torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 448.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 26.44 MiB is free. Including non-PyTorch memory, this process has 22.04 GiB memory in use. Of the allocated memory 21.73 GiB is allocated by PyTorch, and 19.12 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
2025-05-07T20:32:24.8147060Z 
2025-05-07T20:32:24.8147185Z moe/activation_test.py:92: OutOfMemoryError
2025-05-07T20:32:24.8147408Z 
2025-05-07T20:32:24.8147516Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:24.8147949Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:24.8148365Z     T=128,
2025-05-07T20:32:24.8148561Z     D=5120,
2025-05-07T20:32:24.8148772Z     scale_ub=1200.0,
2025-05-07T20:32:24.8149020Z     contiguous=False,
2025-05-07T20:32:24.8149311Z     compiled=False,
2025-05-07T20:32:24.8149526Z )
2025-05-07T20:32:24.9163760Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:24.9164319Z T = 128, D = 5120, scale_ub = 1200.0, contiguous = False, compiled = False
2025-05-07T20:32:24.9164702Z 
2025-05-07T20:32:24.9164790Z     @given(
2025-05-07T20:32:24.9165032Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:24.9165366Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:24.9165682Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:24.9166025Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:24.9166371Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:24.9166660Z     )
2025-05-07T20:32:24.9167025Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:24.9167472Z     def test_silu_mul_quant(
2025-05-07T20:32:24.9167718Z         self,
2025-05-07T20:32:24.9167927Z         T: int,
2025-05-07T20:32:24.9168137Z         D: int,
2025-05-07T20:32:24.9168361Z         scale_ub: Optional[float],
2025-05-07T20:32:24.9168631Z         contiguous: bool,
2025-05-07T20:32:24.9168881Z         compiled: bool,
2025-05-07T20:32:24.9169111Z     ) -> None:
2025-05-07T20:32:24.9169326Z         torch.manual_seed(2025)
2025-05-07T20:32:24.9169582Z     
2025-05-07T20:32:24.9169861Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:24.9170209Z     
2025-05-07T20:32:24.9170414Z         x_sign = torch.sign(x)
2025-05-07T20:32:24.9170711Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:24.9171025Z         x = x_sign * x_clamp
2025-05-07T20:32:24.9171273Z         x0 = x[:, :D]
2025-05-07T20:32:24.9171771Z         x1 = x[:, D:]
2025-05-07T20:32:24.9171980Z     
2025-05-07T20:32:24.9172175Z         if contiguous:
2025-05-07T20:32:24.9172416Z             x0 = x0.contiguous()
2025-05-07T20:32:24.9172674Z             x1 = x1.contiguous()
2025-05-07T20:32:24.9172924Z     
2025-05-07T20:32:24.9173126Z         if scale_ub is not None:
2025-05-07T20:32:24.9173401Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:24.9173749Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:24.9174067Z             )
2025-05-07T20:32:24.9174266Z         else:
2025-05-07T20:32:24.9174480Z             scale_ub_tensor = None
2025-05-07T20:32:24.9174740Z     
2025-05-07T20:32:24.9175075Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:24.9175488Z             op = silu_mul_quant
2025-05-07T20:32:24.9175744Z             if compiled:
2025-05-07T20:32:24.9175999Z                 op = torch.compile(op)
2025-05-07T20:32:24.9176383Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:24.9176667Z     
2025-05-07T20:32:24.9176868Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:24.9177032Z 
2025-05-07T20:32:24.9177136Z moe/activation_test.py:117: 
2025-05-07T20:32:24.9177458Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:24.9177800Z moe/activation_test.py:115: in fn
2025-05-07T20:32:24.9178090Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:24.9178787Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:24.9179473Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:24.9180023Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:24.9180723Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:24.9181384Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:24.9181925Z     kernel = self.compile(
2025-05-07T20:32:24.9182476Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:24.9183141Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:24.9183537Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:24.9183772Z 
2025-05-07T20:32:24.9183983Z self = <triton.compiler.compiler.ASTSource object at 0x7efca2925e50>
2025-05-07T20:32:24.9185067Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:24.9186473Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7efdd7d50400>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7efca290a5c0>}
2025-05-07T20:32:24.9187821Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:24.9188833Z context = <triton._C.libtriton.ir.context object at 0x7efca2950f70>
2025-05-07T20:32:24.9189233Z 
2025-05-07T20:32:24.9189402Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:24.9189924Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:24.9190393Z                            module_map=module_map)
2025-05-07T20:32:24.9190755Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:24.9191117Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:24.9191383Z E       ^
2025-05-07T20:32:24.9191902Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:24.9192458Z 
2025-05-07T20:32:24.9192961Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:24.9193593Z 
2025-05-07T20:32:24.9193706Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:24.9194177Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:24.9194636Z     T=2048,
2025-05-07T20:32:24.9194844Z     D=7168,
2025-05-07T20:32:24.9195050Z     scale_ub=None,
2025-05-07T20:32:24.9195279Z     contiguous=False,
2025-05-07T20:32:24.9195523Z     compiled=False,
2025-05-07T20:32:24.9195743Z )
2025-05-07T20:32:24.9196141Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:24.9196755Z T = 2048, D = 7168, scale_ub = None, contiguous = False, compiled = False
2025-05-07T20:32:24.9197080Z 
2025-05-07T20:32:24.9197161Z     @given(
2025-05-07T20:32:24.9197447Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:24.9197796Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:24.9198140Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:24.9198515Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:24.9198882Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:24.9199206Z     )
2025-05-07T20:32:24.9199661Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:24.9200181Z     def test_silu_mul_quant(
2025-05-07T20:32:24.9200437Z         self,
2025-05-07T20:32:24.9200644Z         T: int,
2025-05-07T20:32:24.9200853Z         D: int,
2025-05-07T20:32:24.9201079Z         scale_ub: Optional[float],
2025-05-07T20:32:24.9201385Z         contiguous: bool,
2025-05-07T20:32:24.9201649Z         compiled: bool,
2025-05-07T20:32:24.9201883Z     ) -> None:
2025-05-07T20:32:24.9202112Z         torch.manual_seed(2025)
2025-05-07T20:32:24.9202381Z     
2025-05-07T20:32:24.9202680Z >       x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:24.9205273Z E       torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 56.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 26.44 MiB is free. Including non-PyTorch memory, this process has 22.04 GiB memory in use. Of the allocated memory 21.74 GiB is allocated by PyTorch, and 10.99 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
2025-05-07T20:32:24.9207632Z 
2025-05-07T20:32:24.9207763Z moe/activation_test.py:92: OutOfMemoryError
2025-05-07T20:32:24.9208016Z 
2025-05-07T20:32:24.9208126Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:24.9208601Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:24.9209065Z     T=128,
2025-05-07T20:32:24.9209272Z     D=7168,
2025-05-07T20:32:24.9209484Z     scale_ub=1200.0,
2025-05-07T20:32:24.9209721Z     contiguous=True,
2025-05-07T20:32:24.9209991Z     compiled=True,
2025-05-07T20:32:24.9210231Z )
2025-05-07T20:32:24.9512983Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:24.9513487Z T = 128, D = 7168, scale_ub = 1200.0, contiguous = True, compiled = True
2025-05-07T20:32:24.9513800Z 
2025-05-07T20:32:24.9513910Z     @given(
2025-05-07T20:32:24.9514244Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:24.9514667Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:24.9515084Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:24.9515487Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:24.9515821Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:24.9516110Z     )
2025-05-07T20:32:24.9516471Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:24.9517060Z     def test_silu_mul_quant(
2025-05-07T20:32:24.9517308Z         self,
2025-05-07T20:32:24.9517532Z         T: int,
2025-05-07T20:32:24.9517734Z         D: int,
2025-05-07T20:32:24.9517953Z         scale_ub: Optional[float],
2025-05-07T20:32:24.9518218Z         contiguous: bool,
2025-05-07T20:32:24.9518470Z         compiled: bool,
2025-05-07T20:32:24.9518698Z     ) -> None:
2025-05-07T20:32:24.9518926Z         torch.manual_seed(2025)
2025-05-07T20:32:24.9519163Z     
2025-05-07T20:32:24.9519442Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:24.9519785Z     
2025-05-07T20:32:24.9519981Z         x_sign = torch.sign(x)
2025-05-07T20:32:24.9520344Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:24.9520727Z         x = x_sign * x_clamp
2025-05-07T20:32:24.9520970Z         x0 = x[:, :D]
2025-05-07T20:32:24.9521185Z         x1 = x[:, D:]
2025-05-07T20:32:24.9521397Z     
2025-05-07T20:32:24.9521648Z         if contiguous:
2025-05-07T20:32:24.9521882Z             x0 = x0.contiguous()
2025-05-07T20:32:24.9522151Z             x1 = x1.contiguous()
2025-05-07T20:32:24.9522394Z     
2025-05-07T20:32:24.9522586Z         if scale_ub is not None:
2025-05-07T20:32:24.9522864Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:24.9523204Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:24.9523507Z             )
2025-05-07T20:32:24.9523705Z         else:
2025-05-07T20:32:24.9523919Z             scale_ub_tensor = None
2025-05-07T20:32:24.9524166Z     
2025-05-07T20:32:24.9524402Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:24.9524720Z             op = silu_mul_quant
2025-05-07T20:32:24.9524972Z             if compiled:
2025-05-07T20:32:24.9525225Z                 op = torch.compile(op)
2025-05-07T20:32:24.9525520Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:24.9525795Z     
2025-05-07T20:32:24.9525986Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:24.9526157Z 
2025-05-07T20:32:24.9526261Z moe/activation_test.py:117: 
2025-05-07T20:32:24.9526561Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:24.9526889Z moe/activation_test.py:115: in fn
2025-05-07T20:32:24.9527173Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:24.9527736Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/torch/_dynamo/eval_frame.py:678: in _fn
2025-05-07T20:32:24.9528590Z     return fn(*args, **kwargs)
2025-05-07T20:32:24.9529248Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:24.9529988Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:24.9530523Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:24.9531200Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:24.9531860Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:24.9532399Z     kernel = self.compile(
2025-05-07T20:32:24.9532940Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:24.9533588Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:24.9533986Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:24.9534213Z 
2025-05-07T20:32:24.9534428Z self = <triton.compiler.compiler.ASTSource object at 0x7efca2ad5e50>
2025-05-07T20:32:24.9535501Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:24.9536978Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7efdd7d50400>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7efca290aac0>}
2025-05-07T20:32:24.9538328Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:24.9539353Z context = <triton._C.libtriton.ir.context object at 0x7efca28d0f70>
2025-05-07T20:32:24.9539645Z 
2025-05-07T20:32:24.9539834Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:24.9540380Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:24.9540945Z                            module_map=module_map)
2025-05-07T20:32:24.9541381Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:24.9541733Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:24.9542004Z E       ^
2025-05-07T20:32:24.9542534Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:24.9542983Z 
2025-05-07T20:32:24.9543407Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:24.9543916Z 
2025-05-07T20:32:24.9544021Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:24.9544438Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:24.9544843Z     T=128,
2025-05-07T20:32:24.9545041Z     D=7168,
2025-05-07T20:32:24.9545240Z     scale_ub=1200.0,
2025-05-07T20:32:24.9545469Z     contiguous=True,
2025-05-07T20:32:24.9545699Z     compiled=False,
2025-05-07T20:32:24.9545912Z )
2025-05-07T20:32:24.9546237Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:24.9546726Z T = 128, D = 7168, scale_ub = 1200.0, contiguous = True, compiled = False
2025-05-07T20:32:24.9546994Z 
2025-05-07T20:32:24.9547077Z     @given(
2025-05-07T20:32:24.9547315Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:24.9547630Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:24.9547932Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:24.9548262Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:24.9548593Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:24.9548880Z     )
2025-05-07T20:32:24.9549339Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:24.9549783Z     def test_silu_mul_quant(
2025-05-07T20:32:24.9550027Z         self,
2025-05-07T20:32:24.9550222Z         T: int,
2025-05-07T20:32:24.9550427Z         D: int,
2025-05-07T20:32:24.9550655Z         scale_ub: Optional[float],
2025-05-07T20:32:24.9550923Z         contiguous: bool,
2025-05-07T20:32:24.9551165Z         compiled: bool,
2025-05-07T20:32:24.9551386Z     ) -> None:
2025-05-07T20:32:24.9551602Z         torch.manual_seed(2025)
2025-05-07T20:32:24.9551849Z     
2025-05-07T20:32:24.9552126Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:24.9552465Z     
2025-05-07T20:32:24.9552665Z         x_sign = torch.sign(x)
2025-05-07T20:32:24.9552961Z >       x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:24.9554955Z E       torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 20.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 4.44 MiB is free. Including non-PyTorch memory, this process has 22.06 GiB memory in use. Of the allocated memory 21.77 GiB is allocated by PyTorch, and 6.37 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
2025-05-07T20:32:24.9556791Z 
2025-05-07T20:32:24.9556920Z moe/activation_test.py:95: OutOfMemoryError
2025-05-07T20:32:24.9557193Z 
2025-05-07T20:32:24.9557302Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:24.9557714Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:24.9558116Z     T=128,
2025-05-07T20:32:24.9558305Z     D=5120,
2025-05-07T20:32:24.9558503Z     scale_ub=1200.0,
2025-05-07T20:32:24.9558728Z     contiguous=True,
2025-05-07T20:32:24.9558947Z     compiled=True,
2025-05-07T20:32:24.9559154Z )
2025-05-07T20:32:24.9559502Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:24.9560020Z T = 128, D = 5120, scale_ub = 1200.0, contiguous = True, compiled = True
2025-05-07T20:32:24.9560285Z 
2025-05-07T20:32:24.9560414Z     @given(
2025-05-07T20:32:24.9560686Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:24.9561001Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:24.9561305Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:24.9561682Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:24.9562018Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:24.9562298Z     )
2025-05-07T20:32:24.9562650Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:24.9563092Z     def test_silu_mul_quant(
2025-05-07T20:32:24.9563331Z         self,
2025-05-07T20:32:24.9563534Z         T: int,
2025-05-07T20:32:24.9563737Z         D: int,
2025-05-07T20:32:24.9563958Z         scale_ub: Optional[float],
2025-05-07T20:32:24.9564226Z         contiguous: bool,
2025-05-07T20:32:24.9564472Z         compiled: bool,
2025-05-07T20:32:24.9564701Z     ) -> None:
2025-05-07T20:32:24.9564917Z         torch.manual_seed(2025)
2025-05-07T20:32:24.9565167Z     
2025-05-07T20:32:24.9565449Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:24.9565789Z     
2025-05-07T20:32:24.9565990Z         x_sign = torch.sign(x)
2025-05-07T20:32:24.9566285Z >       x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:24.9568263Z E       torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 20.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 4.44 MiB is free. Including non-PyTorch memory, this process has 22.06 GiB memory in use. Of the allocated memory 21.77 GiB is allocated by PyTorch, and 3.87 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
2025-05-07T20:32:24.9570153Z 
2025-05-07T20:32:24.9570275Z moe/activation_test.py:95: OutOfMemoryError
2025-05-07T20:32:24.9570498Z 
2025-05-07T20:32:24.9570608Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:24.9571017Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:24.9571420Z     T=128,
2025-05-07T20:32:24.9571615Z     D=7168,
2025-05-07T20:32:24.9571817Z     scale_ub=None,
2025-05-07T20:32:24.9572036Z     contiguous=True,
2025-05-07T20:32:24.9572258Z     compiled=True,
2025-05-07T20:32:24.9572465Z )
2025-05-07T20:32:25.1476849Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:25.1477349Z T = 128, D = 7168, scale_ub = None, contiguous = True, compiled = True
2025-05-07T20:32:25.1477619Z 
2025-05-07T20:32:25.1477700Z     @given(
2025-05-07T20:32:25.1477936Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:25.1478250Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:25.1478552Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:25.1478884Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:25.1479233Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:25.1479512Z     )
2025-05-07T20:32:25.1479883Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:25.1480366Z     def test_silu_mul_quant(
2025-05-07T20:32:25.1480829Z         self,
2025-05-07T20:32:25.1481032Z         T: int,
2025-05-07T20:32:25.1481235Z         D: int,
2025-05-07T20:32:25.1481458Z         scale_ub: Optional[float],
2025-05-07T20:32:25.1481727Z         contiguous: bool,
2025-05-07T20:32:25.1481972Z         compiled: bool,
2025-05-07T20:32:25.1482202Z     ) -> None:
2025-05-07T20:32:25.1482416Z         torch.manual_seed(2025)
2025-05-07T20:32:25.1482660Z     
2025-05-07T20:32:25.1482935Z >       x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:25.1485040Z E       torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 20.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 4.44 MiB is free. Including non-PyTorch memory, this process has 22.06 GiB memory in use. Of the allocated memory 21.77 GiB is allocated by PyTorch, and 3.87 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
2025-05-07T20:32:25.1487028Z 
2025-05-07T20:32:25.1487149Z moe/activation_test.py:92: OutOfMemoryError
2025-05-07T20:32:25.1487372Z 
2025-05-07T20:32:25.1496661Z FAILED
2025-05-07T20:32:25.1496791Z 
2025-05-07T20:32:25.1496938Z =================================== FAILURES ===================================
2025-05-07T20:32:25.1497542Z _____________________ ActivationTests.test_silu_mul_quant ______________________
2025-05-07T20:32:25.1498155Z   + Exception Group Traceback (most recent call last):
2025-05-07T20:32:25.1499016Z   |   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/unittest/case.py", line 57, in testPartExecutor
2025-05-07T20:32:25.1499836Z   |     yield
2025-05-07T20:32:25.1500435Z   |   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/unittest/case.py", line 623, in run
2025-05-07T20:32:25.1501264Z   |     self._callTestMethod(testMethod)
2025-05-07T20:32:25.1502054Z   |   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/unittest/case.py", line 579, in _callTestMethod
2025-05-07T20:32:25.1502828Z   |     if method() is not None:
2025-05-07T20:32:25.1503178Z   |        ^^^^^^^^
2025-05-07T20:32:25.1504066Z   |   File "/home/ec2-user/actions-runner/_work/FBGEMM/FBGEMM/fbgemm_gpu/experimental/gen_ai/test/moe/activation_test.py", line 75, in test_silu_mul_quant
2025-05-07T20:32:25.1505066Z   |     T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:25.1505478Z   |            ^^^^^^^
2025-05-07T20:32:25.1506256Z   |   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/hypothesis/core.py", line 1850, in wrapped_test
2025-05-07T20:32:25.1517103Z   |     raise the_error_hypothesis_found
2025-05-07T20:32:25.1517752Z   | ExceptionGroup: Hypothesis found 4 distinct failures. (4 sub-exceptions)
2025-05-07T20:32:25.1518355Z   +-+---------------- 1 ----------------
2025-05-07T20:32:25.1518779Z     | Traceback (most recent call last):
2025-05-07T20:32:25.1519788Z     |   File "/home/ec2-user/actions-runner/_work/FBGEMM/FBGEMM/fbgemm_gpu/experimental/gen_ai/test/moe/activation_test.py", line 92, in test_silu_mul_quant
2025-05-07T20:32:25.1520908Z     |     x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:25.1521437Z     |         ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
2025-05-07T20:32:25.1524219Z     | torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 40.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 4.44 MiB is free. Including non-PyTorch memory, this process has 22.06 GiB memory in use. Of the allocated memory 21.77 GiB is allocated by PyTorch, and 3.87 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
2025-05-07T20:32:25.1527086Z     | Falsifying example: test_silu_mul_quant(
2025-05-07T20:32:25.1527694Z     |     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:25.1528500Z     |     T=2048,
2025-05-07T20:32:25.1528829Z     |     D=5120,  # or any other generated value
2025-05-07T20:32:25.1529312Z     |     scale_ub=None,  # or any other generated value
2025-05-07T20:32:25.1529863Z     |     contiguous=True,  # or any other generated value
2025-05-07T20:32:25.1530380Z     |     compiled=False,  # or any other generated value
2025-05-07T20:32:25.1530808Z     | )
2025-05-07T20:32:25.1531050Z     | 
2025-05-07T20:32:25.1531795Z     | You can reproduce this example by temporarily adding @reproduce_failure('6.131.14', b'AEECQQBBAEEAQQE=') as a decorator on your test case
2025-05-07T20:32:25.1532860Z     +---------------- 2 ----------------
2025-05-07T20:32:25.1533259Z     | Traceback (most recent call last):
2025-05-07T20:32:25.1534346Z     |   File "/home/ec2-user/actions-runner/_work/FBGEMM/FBGEMM/fbgemm_gpu/experimental/gen_ai/test/moe/activation_test.py", line 92, in test_silu_mul_quant
2025-05-07T20:32:25.1535449Z     |     x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:25.1535985Z     |         ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
2025-05-07T20:32:25.1538780Z     | torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 20.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 4.44 MiB is free. Including non-PyTorch memory, this process has 22.06 GiB memory in use. Of the allocated memory 21.77 GiB is allocated by PyTorch, and 3.87 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
2025-05-07T20:32:25.1541624Z     | Falsifying example: test_silu_mul_quant(
2025-05-07T20:32:25.1542227Z     |     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:25.1542789Z     |     T=128,
2025-05-07T20:32:25.1543081Z     |     D=7168,
2025-05-07T20:32:25.1543298Z     |     scale_ub=None,
2025-05-07T20:32:25.1543549Z     |     contiguous=True,
2025-05-07T20:32:25.1543796Z     |     compiled=True,
2025-05-07T20:32:25.1544023Z     | )
2025-05-07T20:32:25.1544211Z     | 
2025-05-07T20:32:25.1544739Z     | You can reproduce this example by temporarily adding @reproduce_failure('6.131.14', b'AEEBQQFBAEEAQQA=') as a decorator on your test case
2025-05-07T20:32:25.1545352Z     +---------------- 3 ----------------
2025-05-07T20:32:25.1545647Z     | Traceback (most recent call last):
2025-05-07T20:32:25.1546362Z     |   File "/home/ec2-user/actions-runner/_work/FBGEMM/FBGEMM/fbgemm_gpu/experimental/gen_ai/test/moe/activation_test.py", line 92, in test_silu_mul_quant
2025-05-07T20:32:25.1547143Z     |     x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:25.1547518Z     |         ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
2025-05-07T20:32:25.1549584Z     | torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 20.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 4.44 MiB is free. Including non-PyTorch memory, this process has 22.06 GiB memory in use. Of the allocated memory 21.77 GiB is allocated by PyTorch, and 3.87 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
2025-05-07T20:32:25.1551542Z     | Falsifying example: test_silu_mul_quant(
2025-05-07T20:32:25.1551985Z     |     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:25.1552395Z     |     T=128,
2025-05-07T20:32:25.1552599Z     |     D=5120,
2025-05-07T20:32:25.1552954Z     |     scale_ub=1200.0,
2025-05-07T20:32:25.1553225Z     |     contiguous=True,
2025-05-07T20:32:25.1553482Z     |     compiled=True,
2025-05-07T20:32:25.1553727Z     | )
2025-05-07T20:32:25.1553921Z     | 
2025-05-07T20:32:25.1554537Z     | You can reproduce this example by temporarily adding @reproduce_failure('6.131.14', b'AEEBQQBBAUEAQQA=') as a decorator on your test case
2025-05-07T20:32:25.1555262Z     +---------------- 4 ----------------
2025-05-07T20:32:25.1555585Z     | Traceback (most recent call last):
2025-05-07T20:32:25.1556437Z     |   File "/home/ec2-user/actions-runner/_work/FBGEMM/FBGEMM/fbgemm_gpu/experimental/gen_ai/test/moe/activation_test.py", line 126, in test_silu_mul_quant
2025-05-07T20:32:25.1557358Z     |     y_fp8_ref, y_scale_ref = ref_fn()
2025-05-07T20:32:25.1557717Z     |                              ^^^^^^^^
2025-05-07T20:32:25.1558519Z     |   File "/home/ec2-user/actions-runner/_work/FBGEMM/FBGEMM/fbgemm_gpu/experimental/gen_ai/test/moe/activation_test.py", line 124, in ref_fn
2025-05-07T20:32:25.1559354Z     |     return triton_quantize_fp8_row(y, scale_ub_tensor)
2025-05-07T20:32:25.1559791Z     |            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
2025-05-07T20:32:25.1560757Z     |   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/fbgemm_gpu/experimental/gemm/triton_gemm/fp8_gemm.py", line 2370, in triton_quantize_fp8_row
2025-05-07T20:32:25.1561729Z     |     _kernel_quantize_fp8_row[grid](
2025-05-07T20:32:25.1562448Z     |   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py", line 330, in <lambda>
2025-05-07T20:32:25.1563329Z     |     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:25.1563852Z     |                                    ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
2025-05-07T20:32:25.1564616Z     |   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/autotuner.py", line 186, in run
2025-05-07T20:32:25.1565544Z     |     timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs}
2025-05-07T20:32:25.1566107Z     |               ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
2025-05-07T20:32:25.1566908Z     |   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/autotuner.py", line 186, in <dictcomp>
2025-05-07T20:32:25.1567883Z     |     timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs}
2025-05-07T20:32:25.1568425Z     |                        ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
2025-05-07T20:32:25.1569186Z     |   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/autotuner.py", line 166, in _bench
2025-05-07T20:32:25.1570018Z     |     return self.do_bench(kernel_call, quantiles=(0.5, 0.2, 0.8))
2025-05-07T20:32:25.1570497Z     |            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
2025-05-07T20:32:25.1571202Z     |   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/testing.py", line 117, in do_bench
2025-05-07T20:32:25.1571875Z     |     fn()
2025-05-07T20:32:25.1572543Z     |   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/autotuner.py", line 152, in kernel_call
2025-05-07T20:32:25.1573300Z     |     self.fn.run(
2025-05-07T20:32:25.1573921Z     |   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py", line 623, in run
2025-05-07T20:32:25.1574607Z     |     kernel = self.compile(
2025-05-07T20:32:25.1574894Z     |              ^^^^^^^^^^^^^
2025-05-07T20:32:25.1575604Z     |   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py", line 273, in compile
2025-05-07T20:32:25.1576451Z     |     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:25.1576939Z     |              ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
2025-05-07T20:32:25.1577702Z     |   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py", line 100, in make_ir
2025-05-07T20:32:25.1578654Z     |     return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:25.1579207Z     |            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
2025-05-07T20:32:25.1579635Z     | triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:25.1580032Z     | def _kernel_quantize_fp8_row(
2025-05-07T20:32:25.1580324Z     | ^
2025-05-07T20:32:25.1580901Z     | ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:25.1581614Z     | Falsifying example: test_silu_mul_quant(
2025-05-07T20:32:25.1582106Z     |     # The test always failed when commented parts were varied together.
2025-05-07T20:32:25.1582705Z     |     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:25.1583208Z     |     T=1,  # or any other generated value
2025-05-07T20:32:25.1583558Z     |     D=5120,  # or any other generated value
2025-05-07T20:32:25.1583934Z     |     scale_ub=None,  # or any other generated value
2025-05-07T20:32:25.1584339Z     |     contiguous=True,  # or any other generated value
2025-05-07T20:32:25.1584751Z     |     compiled=True,  # or any other generated value
2025-05-07T20:32:25.1585089Z     | )
2025-05-07T20:32:25.1585275Z     | 
2025-05-07T20:32:25.1585896Z     | You can reproduce this example by temporarily adding @reproduce_failure('6.131.14', b'AEEAQQBBAEEAQQA=') as a decorator on your test case
2025-05-07T20:32:25.1586625Z     +------------------------------------
2025-05-07T20:32:25.1587025Z ---------------------------------- Hypothesis ----------------------------------
2025-05-07T20:32:25.1587455Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:25.1587937Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:25.1588502Z     T=1,
2025-05-07T20:32:25.1588757Z     D=5120,
2025-05-07T20:32:25.1589032Z     scale_ub=None,
2025-05-07T20:32:25.1589442Z     contiguous=True,
2025-05-07T20:32:25.1589756Z     compiled=True,
2025-05-07T20:32:25.1590098Z )
2025-05-07T20:32:25.1590547Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:25.1591222Z T = 1, D = 5120, scale_ub = None, contiguous = True, compiled = True
2025-05-07T20:32:25.1591597Z 
2025-05-07T20:32:25.1719740Z     @given(
2025-05-07T20:32:25.1720116Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:25.1720587Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:25.1720993Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:25.1721435Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:25.1721905Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:25.1722293Z     )
2025-05-07T20:32:25.1722772Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:25.1723372Z     def test_silu_mul_quant(
2025-05-07T20:32:25.1723698Z         self,
2025-05-07T20:32:25.1723968Z         T: int,
2025-05-07T20:32:25.1724237Z         D: int,
2025-05-07T20:32:25.1724533Z         scale_ub: Optional[float],
2025-05-07T20:32:25.1724910Z         contiguous: bool,
2025-05-07T20:32:25.1725239Z         compiled: bool,
2025-05-07T20:32:25.1725549Z     ) -> None:
2025-05-07T20:32:25.1725843Z         torch.manual_seed(2025)
2025-05-07T20:32:25.1726174Z     
2025-05-07T20:32:25.1726544Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:25.1727029Z     
2025-05-07T20:32:25.1727302Z         x_sign = torch.sign(x)
2025-05-07T20:32:25.1727706Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:25.1728400Z         x = x_sign * x_clamp
2025-05-07T20:32:25.1729156Z         x0 = x[:, :D]
2025-05-07T20:32:25.1729487Z         x1 = x[:, D:]
2025-05-07T20:32:25.1729779Z     
2025-05-07T20:32:25.1730033Z         if contiguous:
2025-05-07T20:32:25.1730340Z             x0 = x0.contiguous()
2025-05-07T20:32:25.1730674Z             x1 = x1.contiguous()
2025-05-07T20:32:25.1730976Z     
2025-05-07T20:32:25.1731227Z         if scale_ub is not None:
2025-05-07T20:32:25.1731584Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:25.1731915Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:25.1732223Z             )
2025-05-07T20:32:25.1732417Z         else:
2025-05-07T20:32:25.1732621Z             scale_ub_tensor = None
2025-05-07T20:32:25.1733073Z     
2025-05-07T20:32:25.1733519Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:25.1733970Z             op = silu_mul_quant
2025-05-07T20:32:25.1734330Z             if compiled:
2025-05-07T20:32:25.1734679Z                 op = torch.compile(op)
2025-05-07T20:32:25.1735156Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:25.1735501Z     
2025-05-07T20:32:25.1735742Z         y_fp8, y_scale = fn()
2025-05-07T20:32:25.1736133Z         y = y_fp8.to(torch.float32) * y_scale[:, None]
2025-05-07T20:32:25.1736438Z     
2025-05-07T20:32:25.1736679Z         def ref_fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:25.1737010Z             x0_fp32 = x0.to(torch.float32)
2025-05-07T20:32:25.1737297Z             x1_fp32 = x1.to(torch.float32)
2025-05-07T20:32:25.1737607Z             y = x0_fp32 * torch.sigmoid(x0_fp32) * x1_fp32
2025-05-07T20:32:25.1737967Z             return triton_quantize_fp8_row(y, scale_ub_tensor)
2025-05-07T20:32:25.1738270Z     
2025-05-07T20:32:25.1738479Z >       y_fp8_ref, y_scale_ref = ref_fn()
2025-05-07T20:32:25.1738676Z 
2025-05-07T20:32:25.1738784Z moe/activation_test.py:126: 
2025-05-07T20:32:25.1739077Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:25.1739420Z moe/activation_test.py:124: in ref_fn
2025-05-07T20:32:25.1739798Z     return triton_quantize_fp8_row(y, scale_ub_tensor)
2025-05-07T20:32:25.1740591Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/fbgemm_gpu/experimental/gemm/triton_gemm/fp8_gemm.py:2370: in triton_quantize_fp8_row
2025-05-07T20:32:25.1741351Z     _kernel_quantize_fp8_row[grid](
2025-05-07T20:32:25.1741899Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:25.1742576Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:25.1743257Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/autotuner.py:186: in run
2025-05-07T20:32:25.1743972Z     timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs}
2025-05-07T20:32:25.1744720Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/autotuner.py:186: in <dictcomp>
2025-05-07T20:32:25.1745469Z     timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs}
2025-05-07T20:32:25.1746191Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/autotuner.py:166: in _bench
2025-05-07T20:32:25.1746826Z     return self.do_bench(kernel_call, quantiles=(0.5, 0.2, 0.8))
2025-05-07T20:32:25.1747421Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/testing.py:117: in do_bench
2025-05-07T20:32:25.1747935Z     fn()
2025-05-07T20:32:25.1748432Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/autotuner.py:152: in kernel_call
2025-05-07T20:32:25.1749013Z     self.fn.run(
2025-05-07T20:32:25.1749619Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:25.1750169Z     kernel = self.compile(
2025-05-07T20:32:25.1750763Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:25.1751419Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:25.1751818Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:25.1752048Z 
2025-05-07T20:32:25.1752254Z self = <triton.compiler.compiler.ASTSource object at 0x7efd96892f10>
2025-05-07T20:32:25.1753331Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=2, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:25.1754712Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7efdd7d50400>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7efd95239260>}
2025-05-07T20:32:25.1756165Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:25.1757187Z context = <triton._C.libtriton.ir.context object at 0x7efd968d0cf0>
2025-05-07T20:32:25.1757472Z 
2025-05-07T20:32:25.1757639Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:25.1758161Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:25.1758626Z                            module_map=module_map)
2025-05-07T20:32:25.1758986Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:25.1759342Z E       def _kernel_quantize_fp8_row(
2025-05-07T20:32:25.1759609Z E       ^
2025-05-07T20:32:25.1760078Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:25.1760524Z 
2025-05-07T20:32:25.1760940Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:25.1761454Z 
2025-05-07T20:32:25.1761558Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:25.1761967Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:25.1762368Z     T=2048,
2025-05-07T20:32:25.1762553Z     D=5120,
2025-05-07T20:32:25.1762747Z     scale_ub=1200.0,
2025-05-07T20:32:25.1762969Z     contiguous=True,
2025-05-07T20:32:25.1763188Z     compiled=False,
2025-05-07T20:32:25.1763400Z )
2025-05-07T20:32:25.1763721Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:25.1764203Z T = 2048, D = 5120, scale_ub = 1200.0, contiguous = True, compiled = False
2025-05-07T20:32:25.1764480Z 
2025-05-07T20:32:25.1764557Z     @given(
2025-05-07T20:32:25.1764795Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:25.1765100Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:25.1765403Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:25.1765735Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:25.1766058Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:25.1766336Z     )
2025-05-07T20:32:25.1766683Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:25.1767122Z     def test_silu_mul_quant(
2025-05-07T20:32:25.1767356Z         self,
2025-05-07T20:32:25.1767554Z         T: int,
2025-05-07T20:32:25.1767752Z         D: int,
2025-05-07T20:32:25.1767964Z         scale_ub: Optional[float],
2025-05-07T20:32:25.1768232Z         contiguous: bool,
2025-05-07T20:32:25.1768472Z         compiled: bool,
2025-05-07T20:32:25.1768687Z     ) -> None:
2025-05-07T20:32:25.1768906Z         torch.manual_seed(2025)
2025-05-07T20:32:25.1769150Z     
2025-05-07T20:32:25.1769420Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:25.1769809Z     
2025-05-07T20:32:25.1770009Z         x_sign = torch.sign(x)
2025-05-07T20:32:25.1770349Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:25.1770662Z         x = x_sign * x_clamp
2025-05-07T20:32:25.1770905Z         x0 = x[:, :D]
2025-05-07T20:32:25.1771128Z         x1 = x[:, D:]
2025-05-07T20:32:25.1771330Z     
2025-05-07T20:32:25.1771517Z         if contiguous:
2025-05-07T20:32:25.1771748Z             x0 = x0.contiguous()
2025-05-07T20:32:25.1772000Z             x1 = x1.contiguous()
2025-05-07T20:32:25.1772240Z     
2025-05-07T20:32:25.1772431Z         if scale_ub is not None:
2025-05-07T20:32:25.1772697Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:25.1773032Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:25.1773339Z             )
2025-05-07T20:32:25.1773578Z         else:
2025-05-07T20:32:25.1773838Z             scale_ub_tensor = None
2025-05-07T20:32:25.1774089Z     
2025-05-07T20:32:25.1774317Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:25.1774632Z             op = silu_mul_quant
2025-05-07T20:32:25.1774921Z             if compiled:
2025-05-07T20:32:25.1775163Z                 op = torch.compile(op)
2025-05-07T20:32:25.1775459Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:25.1775733Z     
2025-05-07T20:32:25.1775927Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:25.1776092Z 
2025-05-07T20:32:25.1776193Z moe/activation_test.py:117: 
2025-05-07T20:32:25.1776485Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:25.1776818Z moe/activation_test.py:115: in fn
2025-05-07T20:32:25.1777093Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:25.1777774Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:25.1778470Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:25.1778997Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:25.1779678Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:25.1780355Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:25.1780920Z     kernel = self.compile(
2025-05-07T20:32:25.1781450Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:25.1782099Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:25.1782495Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:25.1782719Z 
2025-05-07T20:32:25.1782931Z self = <triton.compiler.compiler.ASTSource object at 0x7efd95370c50>
2025-05-07T20:32:25.1784005Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:25.1785370Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7efdd7d50400>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7efd94ee4180>}
2025-05-07T20:32:25.1786712Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:25.1787731Z context = <triton._C.libtriton.ir.context object at 0x7efd953caeb0>
2025-05-07T20:32:25.1788016Z 
2025-05-07T20:32:25.1788186Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:25.1788694Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:25.1789247Z                            module_map=module_map)
2025-05-07T20:32:25.1789615Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:25.1789968Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:25.1790279Z E       ^
2025-05-07T20:32:25.1790749Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:25.1791194Z 
2025-05-07T20:32:25.1791620Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:25.1792128Z 
2025-05-07T20:32:25.1792231Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:25.1792643Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:25.1793045Z     T=2048,
2025-05-07T20:32:25.1793232Z     D=5120,
2025-05-07T20:32:25.1793429Z     scale_ub=1200.0,
2025-05-07T20:32:25.1793696Z     contiguous=True,
2025-05-07T20:32:25.1793978Z     compiled=True,
2025-05-07T20:32:25.1794184Z )
2025-05-07T20:32:25.1794506Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:25.1795026Z T = 2048, D = 5120, scale_ub = 1200.0, contiguous = True, compiled = True
2025-05-07T20:32:25.1795304Z 
2025-05-07T20:32:25.1795383Z     @given(
2025-05-07T20:32:25.1795615Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:25.1795928Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:25.1796226Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:25.1796553Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:25.1796883Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:25.1797159Z     )
2025-05-07T20:32:25.1797415Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:25.1797509Z     def test_silu_mul_quant(
2025-05-07T20:32:25.1797594Z         self,
2025-05-07T20:32:25.1797685Z         T: int,
2025-05-07T20:32:25.1797761Z         D: int,
2025-05-07T20:32:25.1797864Z         scale_ub: Optional[float],
2025-05-07T20:32:25.1797953Z         contiguous: bool,
2025-05-07T20:32:25.1798041Z         compiled: bool,
2025-05-07T20:32:25.1798127Z     ) -> None:
2025-05-07T20:32:25.1798228Z         torch.manual_seed(2025)
2025-05-07T20:32:25.1798300Z     
2025-05-07T20:32:25.1798474Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:25.1798549Z     
2025-05-07T20:32:25.1798641Z         x_sign = torch.sign(x)
2025-05-07T20:32:25.1798772Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:25.1798861Z         x = x_sign * x_clamp
2025-05-07T20:32:25.1798941Z         x0 = x[:, :D]
2025-05-07T20:32:25.1799031Z         x1 = x[:, D:]
2025-05-07T20:32:25.1799103Z     
2025-05-07T20:32:25.1799187Z         if contiguous:
2025-05-07T20:32:25.1799284Z             x0 = x0.contiguous()
2025-05-07T20:32:25.1799376Z             x1 = x1.contiguous()
2025-05-07T20:32:25.1799456Z     
2025-05-07T20:32:25.1799545Z         if scale_ub is not None:
2025-05-07T20:32:25.1799650Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:25.1800042Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:25.1800132Z             )
2025-05-07T20:32:25.1800223Z         else:
2025-05-07T20:32:25.1800326Z             scale_ub_tensor = None
2025-05-07T20:32:25.1800399Z     
2025-05-07T20:32:25.1800531Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:25.1800627Z             op = silu_mul_quant
2025-05-07T20:32:25.1800712Z             if compiled:
2025-05-07T20:32:25.1800811Z                 op = torch.compile(op)
2025-05-07T20:32:25.1800920Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:25.1800992Z     
2025-05-07T20:32:25.1801087Z         y_fp8, y_scale = fn()
2025-05-07T20:32:25.1801208Z         y = y_fp8.to(torch.float32) * y_scale[:, None]
2025-05-07T20:32:25.1801282Z     
2025-05-07T20:32:25.1801424Z         def ref_fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:25.1801525Z             x0_fp32 = x0.to(torch.float32)
2025-05-07T20:32:25.1801625Z             x1_fp32 = x1.to(torch.float32)
2025-05-07T20:32:25.1801806Z             y = x0_fp32 * torch.sigmoid(x0_fp32) * x1_fp32
2025-05-07T20:32:25.1801946Z             return triton_quantize_fp8_row(y, scale_ub_tensor)
2025-05-07T20:32:25.1802020Z     
2025-05-07T20:32:25.1802126Z >       y_fp8_ref, y_scale_ref = ref_fn()
2025-05-07T20:32:25.1802131Z 
2025-05-07T20:32:25.1802229Z moe/activation_test.py:126: 
2025-05-07T20:32:25.1802364Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:25.1802468Z moe/activation_test.py:124: in ref_fn
2025-05-07T20:32:25.1802600Z     return triton_quantize_fp8_row(y, scale_ub_tensor)
2025-05-07T20:32:25.1803161Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/fbgemm_gpu/experimental/gemm/triton_gemm/fp8_gemm.py:2370: in triton_quantize_fp8_row
2025-05-07T20:32:25.1803341Z     _kernel_quantize_fp8_row[grid](
2025-05-07T20:32:25.1803700Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:25.1803978Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:25.1804347Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/autotuner.py:186: in run
2025-05-07T20:32:25.1804609Z     timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs}
2025-05-07T20:32:25.1805004Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/autotuner.py:186: in <dictcomp>
2025-05-07T20:32:25.1805255Z     timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs}
2025-05-07T20:32:25.1805635Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/autotuner.py:166: in _bench
2025-05-07T20:32:25.1805807Z     return self.do_bench(kernel_call, quantiles=(0.5, 0.2, 0.8))
2025-05-07T20:32:25.1806150Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/testing.py:117: in do_bench
2025-05-07T20:32:25.1806233Z     fn()
2025-05-07T20:32:25.1806632Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/autotuner.py:152: in kernel_call
2025-05-07T20:32:25.1806723Z     self.fn.run(
2025-05-07T20:32:25.1807058Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:25.1807155Z     kernel = self.compile(
2025-05-07T20:32:25.1807537Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:25.1807710Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:25.1807845Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:25.1807853Z 
2025-05-07T20:32:25.1808062Z self = <triton.compiler.compiler.ASTSource object at 0x7efd8fce74d0>
2025-05-07T20:32:25.1808839Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=2, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:25.1809348Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7efdd7d50400>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7efd8fbb74c0>}
2025-05-07T20:32:25.1810121Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:25.1810340Z context = <triton._C.libtriton.ir.context object at 0x7efd8fceb8f0>
2025-05-07T20:32:25.1810344Z 
2025-05-07T20:32:25.1810509Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:25.1810782Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:25.1810890Z                            module_map=module_map)
2025-05-07T20:32:25.1811102Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:25.1824563Z E       def _kernel_quantize_fp8_row(
2025-05-07T20:32:25.1824668Z E       ^
2025-05-07T20:32:25.1825044Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:25.1825050Z 
2025-05-07T20:32:25.1825470Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:25.1825474Z 
2025-05-07T20:32:25.1825577Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:25.1825800Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:25.1825963Z     T=16384,
2025-05-07T20:32:25.1826084Z     D=7168,
2025-05-07T20:32:25.1826169Z     scale_ub=1200.0,
2025-05-07T20:32:25.1826253Z     contiguous=False,
2025-05-07T20:32:25.1826336Z     compiled=False,
2025-05-07T20:32:25.1826406Z )
2025-05-07T20:32:25.1826666Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:25.1826848Z T = 16384, D = 7168, scale_ub = 1200.0, contiguous = False, compiled = False
2025-05-07T20:32:25.1826853Z 
2025-05-07T20:32:25.1826929Z     @given(
2025-05-07T20:32:25.1827047Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:25.1827147Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:25.1827260Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:25.1827373Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:25.1827485Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:25.1827557Z     )
2025-05-07T20:32:25.1827802Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:25.1827899Z     def test_silu_mul_quant(
2025-05-07T20:32:25.1827972Z         self,
2025-05-07T20:32:25.1828049Z         T: int,
2025-05-07T20:32:25.1828403Z         D: int,
2025-05-07T20:32:25.1828562Z         scale_ub: Optional[float],
2025-05-07T20:32:25.1828696Z         contiguous: bool,
2025-05-07T20:32:25.1828795Z         compiled: bool,
2025-05-07T20:32:25.1828874Z     ) -> None:
2025-05-07T20:32:25.1828972Z         torch.manual_seed(2025)
2025-05-07T20:32:25.1829041Z     
2025-05-07T20:32:25.1829258Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:25.1829337Z     
2025-05-07T20:32:25.1829432Z         x_sign = torch.sign(x)
2025-05-07T20:32:25.1829561Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:25.1829647Z         x = x_sign * x_clamp
2025-05-07T20:32:25.1829728Z         x0 = x[:, :D]
2025-05-07T20:32:25.1829813Z         x1 = x[:, D:]
2025-05-07T20:32:25.1829886Z     
2025-05-07T20:32:25.1829973Z         if contiguous:
2025-05-07T20:32:25.1830070Z             x0 = x0.contiguous()
2025-05-07T20:32:25.1830180Z             x1 = x1.contiguous()
2025-05-07T20:32:25.1830257Z     
2025-05-07T20:32:25.1830369Z         if scale_ub is not None:
2025-05-07T20:32:25.1830476Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:25.1830611Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:25.1830693Z             )
2025-05-07T20:32:25.1830767Z         else:
2025-05-07T20:32:25.1830864Z             scale_ub_tensor = None
2025-05-07T20:32:25.1830940Z     
2025-05-07T20:32:25.1831066Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:25.1831161Z             op = silu_mul_quant
2025-05-07T20:32:25.1831246Z             if compiled:
2025-05-07T20:32:25.1831345Z                 op = torch.compile(op)
2025-05-07T20:32:25.1831449Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:25.1831519Z     
2025-05-07T20:32:25.1831609Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:25.1831616Z 
2025-05-07T20:32:25.1831714Z moe/activation_test.py:117: 
2025-05-07T20:32:25.1831839Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:25.1831943Z moe/activation_test.py:115: in fn
2025-05-07T20:32:25.1832192Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:25.1832695Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:25.1832794Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:25.1833147Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:25.1833366Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:25.1833705Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:25.1833796Z     kernel = self.compile(
2025-05-07T20:32:25.1834255Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:25.1834487Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:25.1834682Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:25.1834687Z 
2025-05-07T20:32:25.1834900Z self = <triton.compiler.compiler.ASTSource object at 0x7efd8f881490>
2025-05-07T20:32:25.1835678Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:25.1836188Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7efdd7d50400>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7efd8fe971a0>}
2025-05-07T20:32:25.1836930Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:25.1837130Z context = <triton._C.libtriton.ir.context object at 0x7efd8f889af0>
2025-05-07T20:32:25.1837135Z 
2025-05-07T20:32:25.1837306Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:25.1837562Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:25.1837673Z                            module_map=module_map)
2025-05-07T20:32:25.1837834Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:25.1837934Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:25.1838015Z E       ^
2025-05-07T20:32:25.1838366Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:25.1838371Z 
2025-05-07T20:32:25.1838789Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:25.1838796Z 
2025-05-07T20:32:25.1838901Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:25.1839122Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:25.1839206Z     T=1,
2025-05-07T20:32:25.1839282Z     D=7168,
2025-05-07T20:32:25.1839362Z     scale_ub=None,
2025-05-07T20:32:25.1839450Z     contiguous=True,
2025-05-07T20:32:25.1839531Z     compiled=True,
2025-05-07T20:32:25.1839606Z )
2025-05-07T20:32:25.1839826Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:25.1839985Z T = 1, D = 7168, scale_ub = None, contiguous = True, compiled = True
2025-05-07T20:32:25.1839989Z 
2025-05-07T20:32:25.1840077Z     @given(
2025-05-07T20:32:25.1840202Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:25.1840302Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:25.1840453Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:25.1840584Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:25.1840707Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:25.1840786Z     )
2025-05-07T20:32:25.1841077Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:25.1841173Z     def test_silu_mul_quant(
2025-05-07T20:32:25.1841250Z         self,
2025-05-07T20:32:25.1841324Z         T: int,
2025-05-07T20:32:25.1841404Z         D: int,
2025-05-07T20:32:25.1841503Z         scale_ub: Optional[float],
2025-05-07T20:32:25.1841589Z         contiguous: bool,
2025-05-07T20:32:25.1841678Z         compiled: bool,
2025-05-07T20:32:25.1841758Z     ) -> None:
2025-05-07T20:32:25.1841853Z         torch.manual_seed(2025)
2025-05-07T20:32:25.1841928Z     
2025-05-07T20:32:25.1842098Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:25.1842170Z     
2025-05-07T20:32:25.1842330Z         x_sign = torch.sign(x)
2025-05-07T20:32:25.1842496Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:25.1842583Z         x = x_sign * x_clamp
2025-05-07T20:32:25.1842663Z         x0 = x[:, :D]
2025-05-07T20:32:25.1842777Z         x1 = x[:, D:]
2025-05-07T20:32:25.1842851Z     
2025-05-07T20:32:25.1842944Z         if contiguous:
2025-05-07T20:32:25.1843034Z             x0 = x0.contiguous()
2025-05-07T20:32:25.1843124Z             x1 = x1.contiguous()
2025-05-07T20:32:25.1843195Z     
2025-05-07T20:32:25.1843283Z         if scale_ub is not None:
2025-05-07T20:32:25.1843394Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:25.1843525Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:25.1843602Z             )
2025-05-07T20:32:25.1843678Z         else:
2025-05-07T20:32:25.1843769Z             scale_ub_tensor = None
2025-05-07T20:32:25.1843838Z     
2025-05-07T20:32:25.1843966Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:25.1844057Z             op = silu_mul_quant
2025-05-07T20:32:25.1844142Z             if compiled:
2025-05-07T20:32:25.1844240Z                 op = torch.compile(op)
2025-05-07T20:32:25.1844342Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:25.1844419Z     
2025-05-07T20:32:25.1844509Z         y_fp8, y_scale = fn()
2025-05-07T20:32:25.1844629Z         y = y_fp8.to(torch.float32) * y_scale[:, None]
2025-05-07T20:32:25.1844702Z     
2025-05-07T20:32:25.1844833Z         def ref_fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:25.1844932Z             x0_fp32 = x0.to(torch.float32)
2025-05-07T20:32:25.1845032Z             x1_fp32 = x1.to(torch.float32)
2025-05-07T20:32:25.1845150Z             y = x0_fp32 * torch.sigmoid(x0_fp32) * x1_fp32
2025-05-07T20:32:25.1845287Z             return triton_quantize_fp8_row(y, scale_ub_tensor)
2025-05-07T20:32:25.1845359Z     
2025-05-07T20:32:25.1845455Z >       y_fp8_ref, y_scale_ref = ref_fn()
2025-05-07T20:32:25.1845462Z 
2025-05-07T20:32:25.1845562Z moe/activation_test.py:126: 
2025-05-07T20:32:25.1845694Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:25.1845798Z moe/activation_test.py:124: in ref_fn
2025-05-07T20:32:25.1845936Z     return triton_quantize_fp8_row(y, scale_ub_tensor)
2025-05-07T20:32:25.1846498Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/fbgemm_gpu/experimental/gemm/triton_gemm/fp8_gemm.py:2370: in triton_quantize_fp8_row
2025-05-07T20:32:25.1846597Z     _kernel_quantize_fp8_row[grid](
2025-05-07T20:32:25.1847616Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:25.1847837Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:25.1848210Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/autotuner.py:186: in run
2025-05-07T20:32:25.1848465Z     timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs}
2025-05-07T20:32:25.1848866Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/autotuner.py:186: in <dictcomp>
2025-05-07T20:32:25.1849175Z     timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs}
2025-05-07T20:32:25.1849568Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/autotuner.py:166: in _bench
2025-05-07T20:32:25.1849763Z     return self.do_bench(kernel_call, quantiles=(0.5, 0.2, 0.8))
2025-05-07T20:32:25.1850112Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/testing.py:117: in do_bench
2025-05-07T20:32:25.1850190Z     fn()
2025-05-07T20:32:25.1850595Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/autotuner.py:152: in kernel_call
2025-05-07T20:32:25.1850680Z     self.fn.run(
2025-05-07T20:32:25.1851015Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:25.1851192Z     kernel = self.compile(
2025-05-07T20:32:25.1851570Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:25.1851791Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:25.1851922Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:25.1851926Z 
2025-05-07T20:32:25.1852129Z self = <triton.compiler.compiler.ASTSource object at 0x7efd8f78c190>
2025-05-07T20:32:25.1852909Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=2, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:25.1853411Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7efdd7d50400>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7efd8fc128e0>}
2025-05-07T20:32:25.1854168Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:25.1854362Z context = <triton._C.libtriton.ir.context object at 0x7efd8f7a26b0>
2025-05-07T20:32:25.1854366Z 
2025-05-07T20:32:25.1854538Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:25.1854799Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:25.1854910Z                            module_map=module_map)
2025-05-07T20:32:25.1855079Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:25.1855186Z E       def _kernel_quantize_fp8_row(
2025-05-07T20:32:25.1855268Z E       ^
2025-05-07T20:32:25.1855629Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:25.1855639Z 
2025-05-07T20:32:25.1856052Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:25.1856057Z 
2025-05-07T20:32:25.1856173Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:25.1856401Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:25.1856480Z     T=4096,
2025-05-07T20:32:25.1856567Z     D=5120,
2025-05-07T20:32:25.1856654Z     scale_ub=None,
2025-05-07T20:32:25.1856742Z     contiguous=False,
2025-05-07T20:32:25.1856838Z     compiled=False,
2025-05-07T20:32:25.1856913Z )
2025-05-07T20:32:25.1857131Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:25.1857316Z T = 4096, D = 5120, scale_ub = None, contiguous = False, compiled = False
2025-05-07T20:32:25.1857320Z 
2025-05-07T20:32:25.1857399Z     @given(
2025-05-07T20:32:25.1857526Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:25.1857633Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:25.1857749Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:25.1857874Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:25.1858035Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:25.1858112Z     )
2025-05-07T20:32:25.1858368Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:25.1858463Z     def test_silu_mul_quant(
2025-05-07T20:32:25.1858551Z         self,
2025-05-07T20:32:25.1858629Z         T: int,
2025-05-07T20:32:25.1858707Z         D: int,
2025-05-07T20:32:25.1858813Z         scale_ub: Optional[float],
2025-05-07T20:32:25.1858904Z         contiguous: bool,
2025-05-07T20:32:25.1858992Z         compiled: bool,
2025-05-07T20:32:25.1859082Z     ) -> None:
2025-05-07T20:32:25.1859178Z         torch.manual_seed(2025)
2025-05-07T20:32:25.1859252Z     
2025-05-07T20:32:25.1859472Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:25.1859592Z     
2025-05-07T20:32:25.1859686Z         x_sign = torch.sign(x)
2025-05-07T20:32:25.1859820Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:25.1859949Z         x = x_sign * x_clamp
2025-05-07T20:32:25.1860050Z         x0 = x[:, :D]
2025-05-07T20:32:25.1860154Z         x1 = x[:, D:]
2025-05-07T20:32:25.1860241Z     
2025-05-07T20:32:25.1860347Z         if contiguous:
2025-05-07T20:32:25.1860441Z             x0 = x0.contiguous()
2025-05-07T20:32:25.1860531Z             x1 = x1.contiguous()
2025-05-07T20:32:25.1860609Z     
2025-05-07T20:32:25.1860701Z         if scale_ub is not None:
2025-05-07T20:32:25.1860807Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:25.1860951Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:25.1861024Z             )
2025-05-07T20:32:25.1861109Z         else:
2025-05-07T20:32:25.1861202Z             scale_ub_tensor = None
2025-05-07T20:32:25.1861278Z     
2025-05-07T20:32:25.1861416Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:25.1861506Z             op = silu_mul_quant
2025-05-07T20:32:25.1861595Z             if compiled:
2025-05-07T20:32:25.1861704Z                 op = torch.compile(op)
2025-05-07T20:32:25.1861814Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:25.1861887Z     
2025-05-07T20:32:25.1861987Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:25.1861991Z 
2025-05-07T20:32:25.1862087Z moe/activation_test.py:117: 
2025-05-07T20:32:25.1862227Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:25.1862328Z moe/activation_test.py:115: in fn
2025-05-07T20:32:25.1862427Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:25.1862931Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:25.1863027Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:25.1863387Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:25.1863613Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:25.1863954Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:25.1864057Z     kernel = self.compile(
2025-05-07T20:32:25.1864437Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:25.1864611Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:25.1864747Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:25.1864752Z 
2025-05-07T20:32:25.1864955Z self = <triton.compiler.compiler.ASTSource object at 0x7efd8f7dba90>
2025-05-07T20:32:25.1865735Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:25.1866284Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7efdd7d50400>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7efd8f59c680>}
2025-05-07T20:32:25.1867032Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:25.1867228Z context = <triton._C.libtriton.ir.context object at 0x7efd8f7ec130>
2025-05-07T20:32:25.1867232Z 
2025-05-07T20:32:25.1867393Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:25.1867657Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:25.1867806Z                            module_map=module_map)
2025-05-07T20:32:25.1868006Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:25.1868112Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:25.1868190Z E       ^
2025-05-07T20:32:25.1868583Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:25.1868594Z 
2025-05-07T20:32:25.1869007Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:25.1869012Z 
2025-05-07T20:32:25.1869173Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:25.1869400Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:25.1869477Z     T=4096,
2025-05-07T20:32:25.1869553Z     D=7168,
2025-05-07T20:32:25.1869638Z     scale_ub=None,
2025-05-07T20:32:25.1869726Z     contiguous=False,
2025-05-07T20:32:25.1869813Z     compiled=False,
2025-05-07T20:32:25.1869895Z )
2025-05-07T20:32:25.1870114Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:25.1870296Z T = 4096, D = 7168, scale_ub = None, contiguous = False, compiled = False
2025-05-07T20:32:25.1870300Z 
2025-05-07T20:32:25.1870384Z     @given(
2025-05-07T20:32:25.1870530Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:25.1870650Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:25.1870773Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:25.1870888Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:25.1871005Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:25.1871080Z     )
2025-05-07T20:32:25.1871329Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:25.1871423Z     def test_silu_mul_quant(
2025-05-07T20:32:25.1871499Z         self,
2025-05-07T20:32:25.1871581Z         T: int,
2025-05-07T20:32:25.1871656Z         D: int,
2025-05-07T20:32:25.1871758Z         scale_ub: Optional[float],
2025-05-07T20:32:25.1871860Z         contiguous: bool,
2025-05-07T20:32:25.1871951Z         compiled: bool,
2025-05-07T20:32:25.1872030Z     ) -> None:
2025-05-07T20:32:25.1872129Z         torch.manual_seed(2025)
2025-05-07T20:32:25.1872204Z     
2025-05-07T20:32:25.1872374Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:25.1872453Z     
2025-05-07T20:32:25.1872544Z         x_sign = torch.sign(x)
2025-05-07T20:32:25.1872668Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:25.1872764Z         x = x_sign * x_clamp
2025-05-07T20:32:25.1872850Z         x0 = x[:, :D]
2025-05-07T20:32:25.1872936Z         x1 = x[:, D:]
2025-05-07T20:32:25.1873008Z     
2025-05-07T20:32:25.1873093Z         if contiguous:
2025-05-07T20:32:25.1873189Z             x0 = x0.contiguous()
2025-05-07T20:32:25.1873277Z             x1 = x1.contiguous()
2025-05-07T20:32:25.1873350Z     
2025-05-07T20:32:25.1873447Z         if scale_ub is not None:
2025-05-07T20:32:25.1873558Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:25.1873691Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:25.1873777Z             )
2025-05-07T20:32:25.1873855Z         else:
2025-05-07T20:32:25.1874034Z             scale_ub_tensor = None
2025-05-07T20:32:25.1874113Z     
2025-05-07T20:32:25.1874243Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:25.1874342Z             op = silu_mul_quant
2025-05-07T20:32:25.1874431Z             if compiled:
2025-05-07T20:32:25.1874530Z                 op = torch.compile(op)
2025-05-07T20:32:25.1874642Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:25.1874716Z     
2025-05-07T20:32:25.1874806Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:25.1874810Z 
2025-05-07T20:32:25.1874914Z moe/activation_test.py:117: 
2025-05-07T20:32:25.1875042Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:25.1875185Z moe/activation_test.py:115: in fn
2025-05-07T20:32:25.1875330Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:25.1875866Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:25.1875974Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:25.1876331Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:25.1876553Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:25.1876896Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:25.1876990Z     kernel = self.compile(
2025-05-07T20:32:25.1877371Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:25.1877551Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:25.1877684Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:25.1877689Z 
2025-05-07T20:32:25.1877899Z self = <triton.compiler.compiler.ASTSource object at 0x7efd8ed962d0>
2025-05-07T20:32:25.1878674Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:25.1879187Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7efdd7d50400>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7efd8feec180>}
2025-05-07T20:32:25.1879927Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:25.1880150Z context = <triton._C.libtriton.ir.context object at 0x7efd8fbb8f70>
2025-05-07T20:32:25.1880159Z 
2025-05-07T20:32:25.1880348Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:25.1880606Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:25.1880721Z                            module_map=module_map)
2025-05-07T20:32:25.1880884Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:25.1880985Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:25.1881070Z E       ^
2025-05-07T20:32:25.1881424Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:25.1881429Z 
2025-05-07T20:32:25.1881843Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:25.1881853Z 
2025-05-07T20:32:25.1881961Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:25.1882185Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:25.1882271Z     T=128,
2025-05-07T20:32:25.1882349Z     D=7168,
2025-05-07T20:32:25.1882430Z     scale_ub=None,
2025-05-07T20:32:25.1882521Z     contiguous=False,
2025-05-07T20:32:25.1882649Z     compiled=True,
2025-05-07T20:32:25.1882722Z )
2025-05-07T20:32:25.1882944Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:25.1883113Z T = 128, D = 7168, scale_ub = None, contiguous = False, compiled = True
2025-05-07T20:32:25.1883118Z 
2025-05-07T20:32:25.1883200Z     @given(
2025-05-07T20:32:25.1883317Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:25.1883417Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:25.1883535Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:25.1883655Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:25.1883766Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:25.1883886Z     )
2025-05-07T20:32:25.1884169Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:25.1884275Z     def test_silu_mul_quant(
2025-05-07T20:32:25.1884352Z         self,
2025-05-07T20:32:25.1884472Z         T: int,
2025-05-07T20:32:25.1884559Z         D: int,
2025-05-07T20:32:25.1884661Z         scale_ub: Optional[float],
2025-05-07T20:32:25.1884752Z         contiguous: bool,
2025-05-07T20:32:25.1884846Z         compiled: bool,
2025-05-07T20:32:25.1884925Z     ) -> None:
2025-05-07T20:32:25.1885024Z         torch.manual_seed(2025)
2025-05-07T20:32:25.1885098Z     
2025-05-07T20:32:25.1885265Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:25.1885345Z     
2025-05-07T20:32:25.1885441Z         x_sign = torch.sign(x)
2025-05-07T20:32:25.1885566Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:25.1885662Z         x = x_sign * x_clamp
2025-05-07T20:32:25.1885745Z         x0 = x[:, :D]
2025-05-07T20:32:25.1885828Z         x1 = x[:, D:]
2025-05-07T20:32:25.1885907Z     
2025-05-07T20:32:25.1885991Z         if contiguous:
2025-05-07T20:32:25.1886084Z             x0 = x0.contiguous()
2025-05-07T20:32:25.1886179Z             x1 = x1.contiguous()
2025-05-07T20:32:25.1886255Z     
2025-05-07T20:32:25.1886351Z         if scale_ub is not None:
2025-05-07T20:32:25.1886464Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:25.1886598Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:25.1886680Z             )
2025-05-07T20:32:25.1886760Z         else:
2025-05-07T20:32:25.1886856Z             scale_ub_tensor = None
2025-05-07T20:32:25.1886935Z     
2025-05-07T20:32:25.1887061Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:25.1887151Z             op = silu_mul_quant
2025-05-07T20:32:25.1887243Z             if compiled:
2025-05-07T20:32:25.1887343Z                 op = torch.compile(op)
2025-05-07T20:32:25.1887452Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:25.1887533Z     
2025-05-07T20:32:25.1887624Z         y_fp8, y_scale = fn()
2025-05-07T20:32:25.1887742Z         y = y_fp8.to(torch.float32) * y_scale[:, None]
2025-05-07T20:32:25.1887821Z     
2025-05-07T20:32:25.1887961Z         def ref_fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:25.1888067Z             x0_fp32 = x0.to(torch.float32)
2025-05-07T20:32:25.1888166Z             x1_fp32 = x1.to(torch.float32)
2025-05-07T20:32:25.1888287Z             y = x0_fp32 * torch.sigmoid(x0_fp32) * x1_fp32
2025-05-07T20:32:25.1888430Z             return triton_quantize_fp8_row(y, scale_ub_tensor)
2025-05-07T20:32:25.1888503Z     
2025-05-07T20:32:25.1888602Z >       y_fp8_ref, y_scale_ref = ref_fn()
2025-05-07T20:32:25.1888607Z 
2025-05-07T20:32:25.1888710Z moe/activation_test.py:126: 
2025-05-07T20:32:25.1888841Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:25.1888952Z moe/activation_test.py:124: in ref_fn
2025-05-07T20:32:25.1889088Z     return triton_quantize_fp8_row(y, scale_ub_tensor)
2025-05-07T20:32:25.1889647Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/fbgemm_gpu/experimental/gemm/triton_gemm/fp8_gemm.py:2370: in triton_quantize_fp8_row
2025-05-07T20:32:25.1889809Z     _kernel_quantize_fp8_row[grid](
2025-05-07T20:32:25.1890204Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:25.1890445Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:25.1890816Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/autotuner.py:186: in run
2025-05-07T20:32:25.1891067Z     timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs}
2025-05-07T20:32:25.1891466Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/autotuner.py:186: in <dictcomp>
2025-05-07T20:32:25.1891758Z     timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs}
2025-05-07T20:32:25.1892170Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/autotuner.py:166: in _bench
2025-05-07T20:32:25.1892381Z     return self.do_bench(kernel_call, quantiles=(0.5, 0.2, 0.8))
2025-05-07T20:32:25.1892722Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/testing.py:117: in do_bench
2025-05-07T20:32:25.1892806Z     fn()
2025-05-07T20:32:25.1893206Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/autotuner.py:152: in kernel_call
2025-05-07T20:32:25.1893290Z     self.fn.run(
2025-05-07T20:32:25.1893630Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:25.1893722Z     kernel = self.compile(
2025-05-07T20:32:25.1894099Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:25.1894287Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:25.1894414Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:25.1894418Z 
2025-05-07T20:32:25.1894633Z self = <triton.compiler.compiler.ASTSource object at 0x7efd8f4bc650>
2025-05-07T20:32:25.1895406Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=2, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:25.1896128Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7efdd7d50400>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7efd8f5c7100>}
2025-05-07T20:32:25.1896874Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:25.1897075Z context = <triton._C.libtriton.ir.context object at 0x7efd8f4b9770>
2025-05-07T20:32:25.1897080Z 
2025-05-07T20:32:25.1897251Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:25.1897513Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:25.1897619Z                            module_map=module_map)
2025-05-07T20:32:25.1897789Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:25.1897890Z E       def _kernel_quantize_fp8_row(
2025-05-07T20:32:25.1897971Z E       ^
2025-05-07T20:32:25.1898324Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:25.1898328Z 
2025-05-07T20:32:25.1898739Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:25.1898747Z 
2025-05-07T20:32:25.1898857Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:25.1899076Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:25.1899162Z     T=128,
2025-05-07T20:32:25.1899242Z     D=7168,
2025-05-07T20:32:25.1899375Z     scale_ub=None,
2025-05-07T20:32:25.1899471Z     contiguous=False,
2025-05-07T20:32:25.1899554Z     compiled=False,
2025-05-07T20:32:25.1899626Z )
2025-05-07T20:32:25.1899851Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:25.1900021Z T = 128, D = 7168, scale_ub = None, contiguous = False, compiled = False
2025-05-07T20:32:25.1900025Z 
2025-05-07T20:32:25.1900102Z     @given(
2025-05-07T20:32:25.1900229Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:25.1900328Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:25.1900473Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:25.1900655Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:25.1900807Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:25.1900888Z     )
2025-05-07T20:32:25.1901131Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:25.1901290Z     def test_silu_mul_quant(
2025-05-07T20:32:25.1901373Z         self,
2025-05-07T20:32:25.1901452Z         T: int,
2025-05-07T20:32:25.1901528Z         D: int,
2025-05-07T20:32:25.1901633Z         scale_ub: Optional[float],
2025-05-07T20:32:25.1901723Z         contiguous: bool,
2025-05-07T20:32:25.1901808Z         compiled: bool,
2025-05-07T20:32:25.1901896Z     ) -> None:
2025-05-07T20:32:25.1901992Z         torch.manual_seed(2025)
2025-05-07T20:32:25.1902070Z     
2025-05-07T20:32:25.1902242Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:25.1902317Z     
2025-05-07T20:32:25.1902416Z         x_sign = torch.sign(x)
2025-05-07T20:32:25.1902540Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:25.1902634Z         x = x_sign * x_clamp
2025-05-07T20:32:25.1902722Z         x0 = x[:, :D]
2025-05-07T20:32:25.1902802Z         x1 = x[:, D:]
2025-05-07T20:32:25.1902878Z     
2025-05-07T20:32:25.1902969Z         if contiguous:
2025-05-07T20:32:25.1903071Z             x0 = x0.contiguous()
2025-05-07T20:32:25.1903162Z             x1 = x1.contiguous()
2025-05-07T20:32:25.1903243Z     
2025-05-07T20:32:25.1903333Z         if scale_ub is not None:
2025-05-07T20:32:25.1903438Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:25.1903578Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:25.1903654Z             )
2025-05-07T20:32:25.1903736Z         else:
2025-05-07T20:32:25.1903833Z             scale_ub_tensor = None
2025-05-07T20:32:25.1903905Z     
2025-05-07T20:32:25.1904041Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:25.1904131Z             op = silu_mul_quant
2025-05-07T20:32:25.1904219Z             if compiled:
2025-05-07T20:32:25.1904330Z                 op = torch.compile(op)
2025-05-07T20:32:25.1904435Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:25.1904508Z     
2025-05-07T20:32:25.1904605Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:25.1904612Z 
2025-05-07T20:32:25.1904709Z moe/activation_test.py:117: 
2025-05-07T20:32:25.1904843Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:25.1904944Z moe/activation_test.py:115: in fn
2025-05-07T20:32:25.1905042Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:25.1905542Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:25.1905638Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:25.1905994Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:25.1906222Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:25.1906564Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:25.1906667Z     kernel = self.compile(
2025-05-07T20:32:25.1907100Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:25.1907275Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:25.1907408Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:25.1907413Z 
2025-05-07T20:32:25.1907615Z self = <triton.compiler.compiler.ASTSource object at 0x7efd8f3a3310>
2025-05-07T20:32:25.1908392Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:25.1908929Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7efdd7d50400>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7efd8f08ce00>}
2025-05-07T20:32:25.1909890Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:25.1910091Z context = <triton._C.libtriton.ir.context object at 0x7efd8f39f8f0>
2025-05-07T20:32:25.1910095Z 
2025-05-07T20:32:25.1910258Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:25.1910524Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:25.1910630Z                            module_map=module_map)
2025-05-07T20:32:25.1910792Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:25.1910896Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:25.1910977Z E       ^
2025-05-07T20:32:25.1911332Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:25.1911342Z 
2025-05-07T20:32:25.1911758Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:25.1911762Z 
2025-05-07T20:32:25.1911865Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:25.1912094Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:25.1912170Z     T=4096,
2025-05-07T20:32:25.1912247Z     D=5120,
2025-05-07T20:32:25.1912336Z     scale_ub=1200.0,
2025-05-07T20:32:25.1912422Z     contiguous=True,
2025-05-07T20:32:25.1912506Z     compiled=False,
2025-05-07T20:32:25.1912584Z )
2025-05-07T20:32:25.1912800Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:25.1912982Z T = 4096, D = 5120, scale_ub = 1200.0, contiguous = True, compiled = False
2025-05-07T20:32:25.1912988Z 
2025-05-07T20:32:25.1913068Z     @given(
2025-05-07T20:32:25.1913188Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:25.1913293Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:25.1913408Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:25.1913526Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:25.1913643Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:25.1913718Z     )
2025-05-07T20:32:25.1913964Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:25.1914063Z     def test_silu_mul_quant(
2025-05-07T20:32:25.1914140Z         self,
2025-05-07T20:32:25.1914224Z         T: int,
2025-05-07T20:32:25.1914302Z         D: int,
2025-05-07T20:32:25.1914399Z         scale_ub: Optional[float],
2025-05-07T20:32:25.1914494Z         contiguous: bool,
2025-05-07T20:32:25.1914580Z         compiled: bool,
2025-05-07T20:32:25.1914660Z     ) -> None:
2025-05-07T20:32:25.1914763Z         torch.manual_seed(2025)
2025-05-07T20:32:25.1914839Z     
2025-05-07T20:32:25.1915005Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:25.1915087Z     
2025-05-07T20:32:25.1915182Z         x_sign = torch.sign(x)
2025-05-07T20:32:25.1915353Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:25.1915453Z         x = x_sign * x_clamp
2025-05-07T20:32:25.1915534Z         x0 = x[:, :D]
2025-05-07T20:32:25.1915620Z         x1 = x[:, D:]
2025-05-07T20:32:25.1915693Z     
2025-05-07T20:32:25.1915780Z         if contiguous:
2025-05-07T20:32:25.1915879Z             x0 = x0.contiguous()
2025-05-07T20:32:25.1915968Z             x1 = x1.contiguous()
2025-05-07T20:32:25.1916042Z     
2025-05-07T20:32:25.1916137Z         if scale_ub is not None:
2025-05-07T20:32:25.1916242Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:25.1916376Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:25.1916505Z             )
2025-05-07T20:32:25.1916621Z         else:
2025-05-07T20:32:25.1916715Z             scale_ub_tensor = None
2025-05-07T20:32:25.1916793Z     
2025-05-07T20:32:25.1916921Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:25.1917058Z             op = silu_mul_quant
2025-05-07T20:32:25.1917149Z             if compiled:
2025-05-07T20:32:25.1917249Z                 op = torch.compile(op)
2025-05-07T20:32:25.1917361Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:25.1917435Z     
2025-05-07T20:32:25.1917526Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:25.1917530Z 
2025-05-07T20:32:25.1917634Z moe/activation_test.py:117: 
2025-05-07T20:32:25.1917762Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:25.1917863Z moe/activation_test.py:115: in fn
2025-05-07T20:32:25.1917969Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:25.1918463Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:25.1918574Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:25.1918930Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:25.1919154Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:25.1919525Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:25.1919628Z     kernel = self.compile(
2025-05-07T20:32:25.1920024Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:25.1920201Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:25.1920327Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:25.1920331Z 
2025-05-07T20:32:25.1920539Z self = <triton.compiler.compiler.ASTSource object at 0x7efd8f3f7bd0>
2025-05-07T20:32:25.1921322Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:25.1921833Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7efdd7d50400>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7efd8f08df80>}
2025-05-07T20:32:25.1922576Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:25.1922767Z context = <triton._C.libtriton.ir.context object at 0x7efd8f3dc230>
2025-05-07T20:32:25.1922772Z 
2025-05-07T20:32:25.1922939Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:25.1923198Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:25.1923312Z                            module_map=module_map)
2025-05-07T20:32:25.1923472Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:25.1923619Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:25.1923706Z E       ^
2025-05-07T20:32:25.1924061Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:25.1924065Z 
2025-05-07T20:32:25.1924476Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:25.1924487Z 
2025-05-07T20:32:25.1924590Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:25.1924810Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:25.1924894Z     T=1,
2025-05-07T20:32:25.1924971Z     D=5120,
2025-05-07T20:32:25.1925093Z     scale_ub=None,
2025-05-07T20:32:25.1925222Z     contiguous=True,
2025-05-07T20:32:25.1925304Z     compiled=True,
2025-05-07T20:32:25.1925377Z )
2025-05-07T20:32:25.1925598Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:25.1925797Z T = 1, D = 5120, scale_ub = None, contiguous = True, compiled = True
2025-05-07T20:32:25.1925802Z 
2025-05-07T20:32:25.1925880Z     @given(
2025-05-07T20:32:25.1926006Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:25.1926104Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:25.1926222Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:25.1926339Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:25.1926451Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:25.1926530Z     )
2025-05-07T20:32:25.1926774Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:25.1926866Z     def test_silu_mul_quant(
2025-05-07T20:32:25.1926952Z         self,
2025-05-07T20:32:25.1927032Z         T: int,
2025-05-07T20:32:25.1927110Z         D: int,
2025-05-07T20:32:25.1927214Z         scale_ub: Optional[float],
2025-05-07T20:32:25.1927304Z         contiguous: bool,
2025-05-07T20:32:25.1927398Z         compiled: bool,
2025-05-07T20:32:25.1927478Z     ) -> None:
2025-05-07T20:32:25.1927573Z         torch.manual_seed(2025)
2025-05-07T20:32:25.1927654Z     
2025-05-07T20:32:25.1927822Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:25.1927897Z     
2025-05-07T20:32:25.1927997Z         x_sign = torch.sign(x)
2025-05-07T20:32:25.1928309Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:25.1928443Z         x = x_sign * x_clamp
2025-05-07T20:32:25.1928565Z         x0 = x[:, :D]
2025-05-07T20:32:25.1928678Z         x1 = x[:, D:]
2025-05-07T20:32:25.1928778Z     
2025-05-07T20:32:25.1928869Z         if contiguous:
2025-05-07T20:32:25.1928961Z             x0 = x0.contiguous()
2025-05-07T20:32:25.1929057Z             x1 = x1.contiguous()
2025-05-07T20:32:25.1929142Z     
2025-05-07T20:32:25.1929232Z         if scale_ub is not None:
2025-05-07T20:32:25.1929344Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:25.1929481Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:25.1929559Z             )
2025-05-07T20:32:25.1929642Z         else:
2025-05-07T20:32:25.1929737Z             scale_ub_tensor = None
2025-05-07T20:32:25.1929808Z     
2025-05-07T20:32:25.1929944Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:25.1930033Z             op = silu_mul_quant
2025-05-07T20:32:25.1930119Z             if compiled:
2025-05-07T20:32:25.1930230Z                 op = torch.compile(op)
2025-05-07T20:32:25.1930336Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:25.1930411Z     
2025-05-07T20:32:25.1930509Z         y_fp8, y_scale = fn()
2025-05-07T20:32:25.1930631Z         y = y_fp8.to(torch.float32) * y_scale[:, None]
2025-05-07T20:32:25.1930727Z     
2025-05-07T20:32:25.1930882Z         def ref_fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:25.1931002Z             x0_fp32 = x0.to(torch.float32)
2025-05-07T20:32:25.1931108Z             x1_fp32 = x1.to(torch.float32)
2025-05-07T20:32:25.1931382Z             y = x0_fp32 * torch.sigmoid(x0_fp32) * x1_fp32
2025-05-07T20:32:25.1931526Z             return triton_quantize_fp8_row(y, scale_ub_tensor)
2025-05-07T20:32:25.1931607Z     
2025-05-07T20:32:25.1931708Z >       y_fp8_ref, y_scale_ref = ref_fn()
2025-05-07T20:32:25.1931713Z 
2025-05-07T20:32:25.1931811Z moe/activation_test.py:126: 
2025-05-07T20:32:25.1931947Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:25.1932054Z moe/activation_test.py:124: in ref_fn
2025-05-07T20:32:25.1932193Z     return triton_quantize_fp8_row(y, scale_ub_tensor)
2025-05-07T20:32:25.1932753Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/fbgemm_gpu/experimental/gemm/triton_gemm/fp8_gemm.py:2370: in triton_quantize_fp8_row
2025-05-07T20:32:25.1933008Z     _kernel_quantize_fp8_row[grid](
2025-05-07T20:32:25.1933374Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:25.1933657Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:25.1934029Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/autotuner.py:186: in run
2025-05-07T20:32:25.1934282Z     timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs}
2025-05-07T20:32:25.1934678Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/autotuner.py:186: in <dictcomp>
2025-05-07T20:32:25.1934935Z     timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs}
2025-05-07T20:32:25.1935308Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/autotuner.py:166: in _bench
2025-05-07T20:32:25.1935479Z     return self.do_bench(kernel_call, quantiles=(0.5, 0.2, 0.8))
2025-05-07T20:32:25.1935823Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/testing.py:117: in do_bench
2025-05-07T20:32:25.1935904Z     fn()
2025-05-07T20:32:25.1936309Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/autotuner.py:152: in kernel_call
2025-05-07T20:32:25.1936395Z     self.fn.run(
2025-05-07T20:32:25.1936729Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:25.1936829Z     kernel = self.compile(
2025-05-07T20:32:25.1937206Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:25.1937379Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:25.1937516Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:25.1937526Z 
2025-05-07T20:32:25.1937730Z self = <triton.compiler.compiler.ASTSource object at 0x7efd8e849e50>
2025-05-07T20:32:25.1938513Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=2, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:25.1939011Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7efdd7d50400>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7efd8f08e340>}
2025-05-07T20:32:25.1939758Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:25.1939950Z context = <triton._C.libtriton.ir.context object at 0x7efd8e8fe430>
2025-05-07T20:32:25.1939955Z 
2025-05-07T20:32:25.1940120Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:25.1940388Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:25.1940498Z                            module_map=module_map)
2025-05-07T20:32:25.1940715Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:25.1940818Z E       def _kernel_quantize_fp8_row(
2025-05-07T20:32:25.1940896Z E       ^
2025-05-07T20:32:25.1941255Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:25.1941259Z 
2025-05-07T20:32:25.1941669Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:25.1941673Z 
2025-05-07T20:32:25.1941777Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:25.1942003Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:25.1942119Z     T=2048,
2025-05-07T20:32:25.1942240Z     D=5120,
2025-05-07T20:32:25.1942323Z     scale_ub=None,
2025-05-07T20:32:25.1942408Z     contiguous=True,
2025-05-07T20:32:25.1942496Z     compiled=True,
2025-05-07T20:32:25.1942570Z )
2025-05-07T20:32:25.1942828Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:25.1943006Z T = 2048, D = 5120, scale_ub = None, contiguous = True, compiled = True
2025-05-07T20:32:25.1943011Z 
2025-05-07T20:32:25.1943087Z     @given(
2025-05-07T20:32:25.1943205Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:25.1943308Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:25.1943422Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:25.1943542Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:25.1943655Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:25.1943729Z     )
2025-05-07T20:32:25.1943977Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:25.1944074Z     def test_silu_mul_quant(
2025-05-07T20:32:25.1944150Z         self,
2025-05-07T20:32:25.1944236Z         T: int,
2025-05-07T20:32:25.1944312Z         D: int,
2025-05-07T20:32:25.1944413Z         scale_ub: Optional[float],
2025-05-07T20:32:25.1944509Z         contiguous: bool,
2025-05-07T20:32:25.1944594Z         compiled: bool,
2025-05-07T20:32:25.1944676Z     ) -> None:
2025-05-07T20:32:25.1944778Z         torch.manual_seed(2025)
2025-05-07T20:32:25.1944851Z     
2025-05-07T20:32:25.1945024Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:25.1945097Z     
2025-05-07T20:32:25.1945189Z         x_sign = torch.sign(x)
2025-05-07T20:32:25.1945318Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:25.1945418Z         x = x_sign * x_clamp
2025-05-07T20:32:25.1952207Z         x0 = x[:, :D]
2025-05-07T20:32:25.1952299Z         x1 = x[:, D:]
2025-05-07T20:32:25.1952388Z     
2025-05-07T20:32:25.1952475Z         if contiguous:
2025-05-07T20:32:25.1952574Z             x0 = x0.contiguous()
2025-05-07T20:32:25.1952671Z             x1 = x1.contiguous()
2025-05-07T20:32:25.1952745Z     
2025-05-07T20:32:25.1952837Z         if scale_ub is not None:
2025-05-07T20:32:25.1952962Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:25.1953102Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:25.1953181Z             )
2025-05-07T20:32:25.1953267Z         else:
2025-05-07T20:32:25.1953366Z             scale_ub_tensor = None
2025-05-07T20:32:25.1953440Z     
2025-05-07T20:32:25.1953582Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:25.1953674Z             op = silu_mul_quant
2025-05-07T20:32:25.1953770Z             if compiled:
2025-05-07T20:32:25.1953875Z                 op = torch.compile(op)
2025-05-07T20:32:25.1953983Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:25.1954065Z     
2025-05-07T20:32:25.1954160Z         y_fp8, y_scale = fn()
2025-05-07T20:32:25.1954286Z         y = y_fp8.to(torch.float32) * y_scale[:, None]
2025-05-07T20:32:25.1954372Z     
2025-05-07T20:32:25.1954510Z         def ref_fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:25.1954617Z             x0_fp32 = x0.to(torch.float32)
2025-05-07T20:32:25.1954802Z             x1_fp32 = x1.to(torch.float32)
2025-05-07T20:32:25.1954932Z             y = x0_fp32 * torch.sigmoid(x0_fp32) * x1_fp32
2025-05-07T20:32:25.1955074Z             return triton_quantize_fp8_row(y, scale_ub_tensor)
2025-05-07T20:32:25.1955156Z     
2025-05-07T20:32:25.1955259Z >       y_fp8_ref, y_scale_ref = ref_fn()
2025-05-07T20:32:25.1955264Z 
2025-05-07T20:32:25.1955371Z moe/activation_test.py:126: 
2025-05-07T20:32:25.1955505Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:25.1955614Z moe/activation_test.py:124: in ref_fn
2025-05-07T20:32:25.1955756Z     return triton_quantize_fp8_row(y, scale_ub_tensor)
2025-05-07T20:32:25.1956366Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/fbgemm_gpu/experimental/gemm/triton_gemm/fp8_gemm.py:2370: in triton_quantize_fp8_row
2025-05-07T20:32:25.1956513Z     _kernel_quantize_fp8_row[grid](
2025-05-07T20:32:25.1956923Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:25.1957148Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:25.1957524Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/autotuner.py:186: in run
2025-05-07T20:32:25.1957778Z     timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs}
2025-05-07T20:32:25.1958176Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/autotuner.py:186: in <dictcomp>
2025-05-07T20:32:25.1958437Z     timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs}
2025-05-07T20:32:25.1958814Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/autotuner.py:166: in _bench
2025-05-07T20:32:25.1958990Z     return self.do_bench(kernel_call, quantiles=(0.5, 0.2, 0.8))
2025-05-07T20:32:25.1959335Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/testing.py:117: in do_bench
2025-05-07T20:32:25.1959420Z     fn()
2025-05-07T20:32:25.1959839Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/autotuner.py:152: in kernel_call
2025-05-07T20:32:25.1959937Z     self.fn.run(
2025-05-07T20:32:25.1960301Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:25.1960404Z     kernel = self.compile(
2025-05-07T20:32:25.1960784Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:25.1960965Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:25.1961101Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:25.1961105Z 
2025-05-07T20:32:25.1961311Z self = <triton.compiler.compiler.ASTSource object at 0x7efd8e947b50>
2025-05-07T20:32:25.1962102Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=2, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:25.1962605Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7efdd7d50400>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7efd8f6c2d40>}
2025-05-07T20:32:25.1963356Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:25.1963551Z context = <triton._C.libtriton.ir.context object at 0x7efd8e7b9770>
2025-05-07T20:32:25.1963557Z 
2025-05-07T20:32:25.1963729Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:25.1963992Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:25.1964143Z                            module_map=module_map)
2025-05-07T20:32:25.1964315Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:25.1964417Z E       def _kernel_quantize_fp8_row(
2025-05-07T20:32:25.1964498Z E       ^
2025-05-07T20:32:25.1964860Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:25.1964864Z 
2025-05-07T20:32:25.1965275Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:25.1965280Z 
2025-05-07T20:32:25.1965392Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:25.1965656Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:25.1965775Z     T=128,
2025-05-07T20:32:25.1965862Z     D=5120,
2025-05-07T20:32:25.1965945Z     scale_ub=None,
2025-05-07T20:32:25.1966069Z     contiguous=True,
2025-05-07T20:32:25.1966163Z     compiled=True,
2025-05-07T20:32:25.1966238Z )
2025-05-07T20:32:25.1966457Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:25.1966629Z T = 128, D = 5120, scale_ub = None, contiguous = True, compiled = True
2025-05-07T20:32:25.1966634Z 
2025-05-07T20:32:25.1966715Z     @given(
2025-05-07T20:32:25.1966841Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:25.1966939Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:25.1967054Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:25.1967177Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:25.1967290Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:25.1967374Z     )
2025-05-07T20:32:25.1967628Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:25.1967722Z     def test_silu_mul_quant(
2025-05-07T20:32:25.1967800Z         self,
2025-05-07T20:32:25.1967887Z         T: int,
2025-05-07T20:32:25.1967969Z         D: int,
2025-05-07T20:32:25.1968077Z         scale_ub: Optional[float],
2025-05-07T20:32:25.1968172Z         contiguous: bool,
2025-05-07T20:32:25.1968259Z         compiled: bool,
2025-05-07T20:32:25.1968347Z     ) -> None:
2025-05-07T20:32:25.1968448Z         torch.manual_seed(2025)
2025-05-07T20:32:25.1968522Z     
2025-05-07T20:32:25.1968699Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:25.1968774Z     
2025-05-07T20:32:25.1968870Z         x_sign = torch.sign(x)
2025-05-07T20:32:25.1969005Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:25.1969096Z         x = x_sign * x_clamp
2025-05-07T20:32:25.1969182Z         x0 = x[:, :D]
2025-05-07T20:32:25.1969276Z         x1 = x[:, D:]
2025-05-07T20:32:25.1969354Z     
2025-05-07T20:32:25.1969446Z         if contiguous:
2025-05-07T20:32:25.1969543Z             x0 = x0.contiguous()
2025-05-07T20:32:25.1969638Z             x1 = x1.contiguous()
2025-05-07T20:32:25.1969721Z     
2025-05-07T20:32:25.1969817Z         if scale_ub is not None:
2025-05-07T20:32:25.1969927Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:25.1970073Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:25.1970151Z             )
2025-05-07T20:32:25.1970229Z         else:
2025-05-07T20:32:25.1970333Z             scale_ub_tensor = None
2025-05-07T20:32:25.1970408Z     
2025-05-07T20:32:25.1970545Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:25.1970644Z             op = silu_mul_quant
2025-05-07T20:32:25.1970731Z             if compiled:
2025-05-07T20:32:25.1970833Z                 op = torch.compile(op)
2025-05-07T20:32:25.1970949Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:25.1971026Z     
2025-05-07T20:32:25.1971127Z         y_fp8, y_scale = fn()
2025-05-07T20:32:25.1971249Z         y = y_fp8.to(torch.float32) * y_scale[:, None]
2025-05-07T20:32:25.1971323Z     
2025-05-07T20:32:25.1971519Z         def ref_fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:25.1971624Z             x0_fp32 = x0.to(torch.float32)
2025-05-07T20:32:25.1971725Z             x1_fp32 = x1.to(torch.float32)
2025-05-07T20:32:25.1971860Z             y = x0_fp32 * torch.sigmoid(x0_fp32) * x1_fp32
2025-05-07T20:32:25.1972002Z             return triton_quantize_fp8_row(y, scale_ub_tensor)
2025-05-07T20:32:25.1972076Z     
2025-05-07T20:32:25.1972185Z >       y_fp8_ref, y_scale_ref = ref_fn()
2025-05-07T20:32:25.1972190Z 
2025-05-07T20:32:25.1972290Z moe/activation_test.py:126: 
2025-05-07T20:32:25.1972428Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:25.1972536Z moe/activation_test.py:124: in ref_fn
2025-05-07T20:32:25.1972771Z     return triton_quantize_fp8_row(y, scale_ub_tensor)
2025-05-07T20:32:25.1973378Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/fbgemm_gpu/experimental/gemm/triton_gemm/fp8_gemm.py:2370: in triton_quantize_fp8_row
2025-05-07T20:32:25.1973484Z     _kernel_quantize_fp8_row[grid](
2025-05-07T20:32:25.1973843Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:25.1974073Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:25.1974442Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/autotuner.py:186: in run
2025-05-07T20:32:25.1974702Z     timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs}
2025-05-07T20:32:25.1975099Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/autotuner.py:186: in <dictcomp>
2025-05-07T20:32:25.1975352Z     timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs}
2025-05-07T20:32:25.1975739Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/autotuner.py:166: in _bench
2025-05-07T20:32:25.1975910Z     return self.do_bench(kernel_call, quantiles=(0.5, 0.2, 0.8))
2025-05-07T20:32:25.1976258Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/testing.py:117: in do_bench
2025-05-07T20:32:25.1976339Z     fn()
2025-05-07T20:32:25.1976737Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/autotuner.py:152: in kernel_call
2025-05-07T20:32:25.1976830Z     self.fn.run(
2025-05-07T20:32:25.1977166Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:25.1977264Z     kernel = self.compile(
2025-05-07T20:32:25.1977650Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:25.1977831Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:25.1977963Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:25.1977970Z 
2025-05-07T20:32:25.1978175Z self = <triton.compiler.compiler.ASTSource object at 0x7efd8eb2eb50>
2025-05-07T20:32:25.1978946Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=2, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:25.1979447Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7efdd7d50400>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7efd8ecd3740>}
2025-05-07T20:32:25.1980185Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:25.1980385Z context = <triton._C.libtriton.ir.context object at 0x7efd8eb270b0>
2025-05-07T20:32:25.1980389Z 
2025-05-07T20:32:25.1980578Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:25.1980913Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:25.1981022Z                            module_map=module_map)
2025-05-07T20:32:25.1981186Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:25.1981295Z E       def _kernel_quantize_fp8_row(
2025-05-07T20:32:25.1981373Z E       ^
2025-05-07T20:32:25.1981728Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:25.1981732Z 
2025-05-07T20:32:25.1982152Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:25.1982231Z 
2025-05-07T20:32:25.1982336Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:25.1982565Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:25.1982647Z     T=4096,
2025-05-07T20:32:25.1982764Z     D=5120,
2025-05-07T20:32:25.1982859Z     scale_ub=None,
2025-05-07T20:32:25.1982946Z     contiguous=True,
2025-05-07T20:32:25.1983029Z     compiled=True,
2025-05-07T20:32:25.1983113Z )
2025-05-07T20:32:25.1983333Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:25.1983503Z T = 4096, D = 5120, scale_ub = None, contiguous = True, compiled = True
2025-05-07T20:32:25.1983515Z 
2025-05-07T20:32:25.1983593Z     @given(
2025-05-07T20:32:25.1983711Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:25.1983818Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:25.1983937Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:25.1984058Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:25.1984179Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:25.1984254Z     )
2025-05-07T20:32:25.1984501Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:25.1984610Z     def test_silu_mul_quant(
2025-05-07T20:32:25.1984689Z         self,
2025-05-07T20:32:25.1984768Z         T: int,
2025-05-07T20:32:25.1984854Z         D: int,
2025-05-07T20:32:25.1984957Z         scale_ub: Optional[float],
2025-05-07T20:32:25.1985057Z         contiguous: bool,
2025-05-07T20:32:25.1985145Z         compiled: bool,
2025-05-07T20:32:25.1985230Z     ) -> None:
2025-05-07T20:32:25.1985336Z         torch.manual_seed(2025)
2025-05-07T20:32:25.1985412Z     
2025-05-07T20:32:25.1985582Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:25.1985666Z     
2025-05-07T20:32:25.1985761Z         x_sign = torch.sign(x)
2025-05-07T20:32:25.1985889Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:25.1985997Z         x = x_sign * x_clamp
2025-05-07T20:32:25.1986079Z         x0 = x[:, :D]
2025-05-07T20:32:25.1986162Z         x1 = x[:, D:]
2025-05-07T20:32:25.1986244Z     
2025-05-07T20:32:25.1986330Z         if contiguous:
2025-05-07T20:32:25.1986436Z             x0 = x0.contiguous()
2025-05-07T20:32:25.1986528Z             x1 = x1.contiguous()
2025-05-07T20:32:25.1986602Z     
2025-05-07T20:32:25.1986701Z         if scale_ub is not None:
2025-05-07T20:32:25.1986810Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:25.1986945Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:25.1987029Z             )
2025-05-07T20:32:25.1987108Z         else:
2025-05-07T20:32:25.1987204Z             scale_ub_tensor = None
2025-05-07T20:32:25.1987287Z     
2025-05-07T20:32:25.1987419Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:25.1987512Z             op = silu_mul_quant
2025-05-07T20:32:25.1987611Z             if compiled:
2025-05-07T20:32:25.1987715Z                 op = torch.compile(op)
2025-05-07T20:32:25.1987823Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:25.1987905Z     
2025-05-07T20:32:25.1988000Z         y_fp8, y_scale = fn()
2025-05-07T20:32:25.1988181Z         y = y_fp8.to(torch.float32) * y_scale[:, None]
2025-05-07T20:32:25.1988258Z     
2025-05-07T20:32:25.1988396Z         def ref_fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:25.1988506Z             x0_fp32 = x0.to(torch.float32)
2025-05-07T20:32:25.1988605Z             x1_fp32 = x1.to(torch.float32)
2025-05-07T20:32:25.1988726Z             y = x0_fp32 * torch.sigmoid(x0_fp32) * x1_fp32
2025-05-07T20:32:25.1988876Z             return triton_quantize_fp8_row(y, scale_ub_tensor)
2025-05-07T20:32:25.1988950Z     
2025-05-07T20:32:25.1989052Z >       y_fp8_ref, y_scale_ref = ref_fn()
2025-05-07T20:32:25.1989134Z 
2025-05-07T20:32:25.1989237Z moe/activation_test.py:126: 
2025-05-07T20:32:25.1989410Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:25.1989569Z moe/activation_test.py:124: in ref_fn
2025-05-07T20:32:25.1989703Z     return triton_quantize_fp8_row(y, scale_ub_tensor)
2025-05-07T20:32:25.1990315Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/fbgemm_gpu/experimental/gemm/triton_gemm/fp8_gemm.py:2370: in triton_quantize_fp8_row
2025-05-07T20:32:25.1990422Z     _kernel_quantize_fp8_row[grid](
2025-05-07T20:32:25.1990825Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:25.1991052Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:25.1991416Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/autotuner.py:186: in run
2025-05-07T20:32:25.1991674Z     timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs}
2025-05-07T20:32:25.1992071Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/autotuner.py:186: in <dictcomp>
2025-05-07T20:32:25.1992322Z     timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs}
2025-05-07T20:32:25.1992705Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/autotuner.py:166: in _bench
2025-05-07T20:32:25.1992873Z     return self.do_bench(kernel_call, quantiles=(0.5, 0.2, 0.8))
2025-05-07T20:32:25.1993212Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/testing.py:117: in do_bench
2025-05-07T20:32:25.1993298Z     fn()
2025-05-07T20:32:25.1993694Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/autotuner.py:152: in kernel_call
2025-05-07T20:32:25.1993784Z     self.fn.run(
2025-05-07T20:32:25.1994119Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:25.1994218Z     kernel = self.compile(
2025-05-07T20:32:25.1994608Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:25.1994785Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:25.1994918Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:25.1994931Z 
2025-05-07T20:32:25.1995135Z self = <triton.compiler.compiler.ASTSource object at 0x7efd8ef9aed0>
2025-05-07T20:32:25.1995909Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=2, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:25.1996420Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7efdd7d50400>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7efd8e75d940>}
2025-05-07T20:32:25.1997158Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:25.1997366Z context = <triton._C.libtriton.ir.context object at 0x7efd8ef9f4b0>
2025-05-07T20:32:25.1997415Z 
2025-05-07T20:32:25.1997581Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:25.1997840Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:25.1997953Z                            module_map=module_map)
2025-05-07T20:32:25.1998114Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:25.1998221Z E       def _kernel_quantize_fp8_row(
2025-05-07T20:32:25.1998298Z E       ^
2025-05-07T20:32:25.1998651Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:25.1998656Z 
2025-05-07T20:32:25.1999113Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:25.1999156Z 
2025-05-07T20:32:25.1999264Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:25.1999542Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:25.1999638Z     T=16384,
2025-05-07T20:32:25.1999733Z     D=5120,
2025-05-07T20:32:25.1999828Z     scale_ub=None,
2025-05-07T20:32:25.1999916Z     contiguous=True,
2025-05-07T20:32:25.1999997Z     compiled=True,
2025-05-07T20:32:25.2000075Z )
2025-05-07T20:32:25.2000290Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:25.2000462Z T = 16384, D = 5120, scale_ub = None, contiguous = True, compiled = True
2025-05-07T20:32:25.2000467Z 
2025-05-07T20:32:25.2000553Z     @given(
2025-05-07T20:32:25.2000670Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:25.2000769Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:25.2000895Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:25.2001013Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:25.2001132Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:25.2001207Z     )
2025-05-07T20:32:25.2001454Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:25.2001560Z     def test_silu_mul_quant(
2025-05-07T20:32:25.2001637Z         self,
2025-05-07T20:32:25.2001716Z         T: int,
2025-05-07T20:32:25.2001798Z         D: int,
2025-05-07T20:32:25.2001897Z         scale_ub: Optional[float],
2025-05-07T20:32:25.2001986Z         contiguous: bool,
2025-05-07T20:32:25.2002079Z         compiled: bool,
2025-05-07T20:32:25.2002157Z     ) -> None:
2025-05-07T20:32:25.2002254Z         torch.manual_seed(2025)
2025-05-07T20:32:25.2002334Z     
2025-05-07T20:32:25.2002500Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:25.2002582Z     
2025-05-07T20:32:25.2002677Z         x_sign = torch.sign(x)
2025-05-07T20:32:25.2002801Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:25.2002895Z         x = x_sign * x_clamp
2025-05-07T20:32:25.2002976Z         x0 = x[:, :D]
2025-05-07T20:32:25.2003059Z         x1 = x[:, D:]
2025-05-07T20:32:25.2003142Z     
2025-05-07T20:32:25.2003226Z         if contiguous:
2025-05-07T20:32:25.2003318Z             x0 = x0.contiguous()
2025-05-07T20:32:25.2003413Z             x1 = x1.contiguous()
2025-05-07T20:32:25.2003485Z     
2025-05-07T20:32:25.2003575Z         if scale_ub is not None:
2025-05-07T20:32:25.2003686Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:25.2003819Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:25.2003901Z             )
2025-05-07T20:32:25.2003978Z         else:
2025-05-07T20:32:25.2004072Z             scale_ub_tensor = None
2025-05-07T20:32:25.2004152Z     
2025-05-07T20:32:25.2004281Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:25.2004378Z             op = silu_mul_quant
2025-05-07T20:32:25.2004472Z             if compiled:
2025-05-07T20:32:25.2004575Z                 op = torch.compile(op)
2025-05-07T20:32:25.2004683Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:25.2004762Z     
2025-05-07T20:32:25.2004928Z         y_fp8, y_scale = fn()
2025-05-07T20:32:25.2005054Z         y = y_fp8.to(torch.float32) * y_scale[:, None]
2025-05-07T20:32:25.2005138Z     
2025-05-07T20:32:25.2005275Z         def ref_fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:25.2005384Z             x0_fp32 = x0.to(torch.float32)
2025-05-07T20:32:25.2005484Z             x1_fp32 = x1.to(torch.float32)
2025-05-07T20:32:25.2005605Z             y = x0_fp32 * torch.sigmoid(x0_fp32) * x1_fp32
2025-05-07T20:32:25.2005754Z             return triton_quantize_fp8_row(y, scale_ub_tensor)
2025-05-07T20:32:25.2005827Z     
2025-05-07T20:32:25.2005928Z >       y_fp8_ref, y_scale_ref = ref_fn()
2025-05-07T20:32:25.2005974Z 
2025-05-07T20:32:25.2006119Z moe/activation_test.py:126: 
2025-05-07T20:32:25.2006247Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:25.2006352Z moe/activation_test.py:124: in ref_fn
2025-05-07T20:32:25.2006536Z     return triton_quantize_fp8_row(y, scale_ub_tensor)
2025-05-07T20:32:25.2007096Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/fbgemm_gpu/experimental/gemm/triton_gemm/fp8_gemm.py:2370: in triton_quantize_fp8_row
2025-05-07T20:32:25.2007205Z     _kernel_quantize_fp8_row[grid](
2025-05-07T20:32:25.2007564Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:25.2007787Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:25.2008162Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/autotuner.py:186: in run
2025-05-07T20:32:25.2008420Z     timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs}
2025-05-07T20:32:25.2008829Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/autotuner.py:186: in <dictcomp>
2025-05-07T20:32:25.2009085Z     timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs}
2025-05-07T20:32:25.2009463Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/autotuner.py:166: in _bench
2025-05-07T20:32:25.2009661Z     return self.do_bench(kernel_call, quantiles=(0.5, 0.2, 0.8))
2025-05-07T20:32:25.2010025Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/testing.py:117: in do_bench
2025-05-07T20:32:25.2010104Z     fn()
2025-05-07T20:32:25.2010510Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/autotuner.py:152: in kernel_call
2025-05-07T20:32:25.2010595Z     self.fn.run(
2025-05-07T20:32:25.2010941Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:25.2011041Z     kernel = self.compile(
2025-05-07T20:32:25.2011422Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:25.2011609Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:25.2011738Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:25.2011743Z 
2025-05-07T20:32:25.2011954Z self = <triton.compiler.compiler.ASTSource object at 0x7efca3e756d0>
2025-05-07T20:32:25.2012725Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=2, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:25.2013225Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7efdd7d50400>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7efd8e311bc0>}
2025-05-07T20:32:25.2013982Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:25.2014232Z context = <triton._C.libtriton.ir.context object at 0x7efca3e8e470>
2025-05-07T20:32:25.2014238Z 
2025-05-07T20:32:25.2014408Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:25.2014670Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:25.2014777Z                            module_map=module_map)
2025-05-07T20:32:25.2014946Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:25.2015050Z E       def _kernel_quantize_fp8_row(
2025-05-07T20:32:25.2015134Z E       ^
2025-05-07T20:32:25.2015487Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:25.2015570Z 
2025-05-07T20:32:25.2015982Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:25.2016024Z 
2025-05-07T20:32:25.2016139Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:25.2016361Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:25.2016453Z     T=1,
2025-05-07T20:32:25.2016532Z     D=5120,
2025-05-07T20:32:25.2016616Z     scale_ub=1200.0,
2025-05-07T20:32:25.2016712Z     contiguous=True,
2025-05-07T20:32:25.2016795Z     compiled=True,
2025-05-07T20:32:25.2016876Z )
2025-05-07T20:32:25.2017093Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:25.2017258Z T = 1, D = 5120, scale_ub = 1200.0, contiguous = True, compiled = True
2025-05-07T20:32:25.2017262Z 
2025-05-07T20:32:25.2017348Z     @given(
2025-05-07T20:32:25.2017473Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:25.2017574Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:25.2017695Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:25.2017816Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:25.2017931Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:25.2018012Z     )
2025-05-07T20:32:25.2018256Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:25.2018356Z     def test_silu_mul_quant(
2025-05-07T20:32:25.2018436Z         self,
2025-05-07T20:32:25.2018514Z         T: int,
2025-05-07T20:32:25.2018598Z         D: int,
2025-05-07T20:32:25.2018698Z         scale_ub: Optional[float],
2025-05-07T20:32:25.2018789Z         contiguous: bool,
2025-05-07T20:32:25.2018883Z         compiled: bool,
2025-05-07T20:32:25.2018962Z     ) -> None:
2025-05-07T20:32:25.2019058Z         torch.manual_seed(2025)
2025-05-07T20:32:25.2019140Z     
2025-05-07T20:32:25.2019312Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:25.2019390Z     
2025-05-07T20:32:25.2019496Z         x_sign = torch.sign(x)
2025-05-07T20:32:25.2019623Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:25.2019725Z         x = x_sign * x_clamp
2025-05-07T20:32:25.2019812Z         x0 = x[:, :D]
2025-05-07T20:32:25.2019911Z         x1 = x[:, D:]
2025-05-07T20:32:25.2019999Z     
2025-05-07T20:32:25.2020099Z         if contiguous:
2025-05-07T20:32:25.2020201Z             x0 = x0.contiguous()
2025-05-07T20:32:25.2020300Z             x1 = x1.contiguous()
2025-05-07T20:32:25.2020373Z     
2025-05-07T20:32:25.2020465Z         if scale_ub is not None:
2025-05-07T20:32:25.2020576Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:25.2020710Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:25.2020786Z             )
2025-05-07T20:32:25.2020868Z         else:
2025-05-07T20:32:25.2020965Z             scale_ub_tensor = None
2025-05-07T20:32:25.2021041Z     
2025-05-07T20:32:25.2021176Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:25.2021268Z             op = silu_mul_quant
2025-05-07T20:32:25.2021362Z             if compiled:
2025-05-07T20:32:25.2021465Z                 op = torch.compile(op)
2025-05-07T20:32:25.2021626Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:25.2021708Z     
2025-05-07T20:32:25.2021798Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:25.2021802Z 
2025-05-07T20:32:25.2021900Z moe/activation_test.py:117: 
2025-05-07T20:32:25.2022036Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:25.2022139Z moe/activation_test.py:115: in fn
2025-05-07T20:32:25.2022240Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:25.2022612Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/torch/_dynamo/eval_frame.py:678: in _fn
2025-05-07T20:32:25.2022705Z     return fn(*args, **kwargs)
2025-05-07T20:32:25.2023242Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:25.2023382Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:25.2023778Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:25.2024010Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:25.2024348Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:25.2024448Z     kernel = self.compile(
2025-05-07T20:32:25.2024828Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:25.2025000Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:25.2025138Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:25.2025148Z 
2025-05-07T20:32:25.2025355Z self = <triton.compiler.compiler.ASTSource object at 0x7efca37e3390>
2025-05-07T20:32:25.2026137Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:25.2026642Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7efdd7d50400>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7efd8e4998a0>}
2025-05-07T20:32:25.2027385Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:25.2027582Z context = <triton._C.libtriton.ir.context object at 0x7efca37df870>
2025-05-07T20:32:25.2027587Z 
2025-05-07T20:32:25.2027753Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:25.2028024Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:25.2028398Z                            module_map=module_map)
2025-05-07T20:32:25.2028637Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:25.2028786Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:25.2028865Z E       ^
2025-05-07T20:32:25.2029262Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:25.2029274Z 
2025-05-07T20:32:25.2029688Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:25.2029693Z 
2025-05-07T20:32:25.2029796Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:25.2030027Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:25.2030109Z     T=1,
2025-05-07T20:32:25.2030189Z     D=5120,
2025-05-07T20:32:25.2030278Z     scale_ub=None,
2025-05-07T20:32:25.2030365Z     contiguous=False,
2025-05-07T20:32:25.2030448Z     compiled=True,
2025-05-07T20:32:25.2030527Z )
2025-05-07T20:32:25.2030929Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:25.2031100Z T = 1, D = 5120, scale_ub = None, contiguous = False, compiled = True
2025-05-07T20:32:25.2031104Z 
2025-05-07T20:32:25.2031184Z     @given(
2025-05-07T20:32:25.2031303Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:25.2031407Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:25.2031524Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:25.2031642Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:25.2031766Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:25.2031843Z     )
2025-05-07T20:32:25.2032086Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:25.2032325Z     def test_silu_mul_quant(
2025-05-07T20:32:25.2032406Z         self,
2025-05-07T20:32:25.2032490Z         T: int,
2025-05-07T20:32:25.2032568Z         D: int,
2025-05-07T20:32:25.2032667Z         scale_ub: Optional[float],
2025-05-07T20:32:25.2032850Z         contiguous: bool,
2025-05-07T20:32:25.2032939Z         compiled: bool,
2025-05-07T20:32:25.2033020Z     ) -> None:
2025-05-07T20:32:25.2033120Z         torch.manual_seed(2025)
2025-05-07T20:32:25.2033194Z     
2025-05-07T20:32:25.2033366Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:25.2033447Z     
2025-05-07T20:32:25.2033541Z         x_sign = torch.sign(x)
2025-05-07T20:32:25.2033668Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:25.2033765Z         x = x_sign * x_clamp
2025-05-07T20:32:25.2033851Z         x0 = x[:, :D]
2025-05-07T20:32:25.2033940Z         x1 = x[:, D:]
2025-05-07T20:32:25.2034013Z     
2025-05-07T20:32:25.2034100Z         if contiguous:
2025-05-07T20:32:25.2034204Z             x0 = x0.contiguous()
2025-05-07T20:32:25.2034295Z             x1 = x1.contiguous()
2025-05-07T20:32:25.2034367Z     
2025-05-07T20:32:25.2034465Z         if scale_ub is not None:
2025-05-07T20:32:25.2034573Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:25.2034711Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:25.2034796Z             )
2025-05-07T20:32:25.2034875Z         else:
2025-05-07T20:32:25.2034971Z             scale_ub_tensor = None
2025-05-07T20:32:25.2035056Z     
2025-05-07T20:32:25.2035186Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:25.2035279Z             op = silu_mul_quant
2025-05-07T20:32:25.2035374Z             if compiled:
2025-05-07T20:32:25.2035473Z                 op = torch.compile(op)
2025-05-07T20:32:25.2035587Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:25.2035660Z     
2025-05-07T20:32:25.2035750Z         y_fp8, y_scale = fn()
2025-05-07T20:32:25.2035883Z         y = y_fp8.to(torch.float32) * y_scale[:, None]
2025-05-07T20:32:25.2035954Z     
2025-05-07T20:32:25.2036090Z         def ref_fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:25.2036199Z             x0_fp32 = x0.to(torch.float32)
2025-05-07T20:32:25.2036301Z             x1_fp32 = x1.to(torch.float32)
2025-05-07T20:32:25.2036423Z             y = x0_fp32 * torch.sigmoid(x0_fp32) * x1_fp32
2025-05-07T20:32:25.2036568Z             return triton_quantize_fp8_row(y, scale_ub_tensor)
2025-05-07T20:32:25.2036641Z     
2025-05-07T20:32:25.2036748Z >       y_fp8_ref, y_scale_ref = ref_fn()
2025-05-07T20:32:25.2036752Z 
2025-05-07T20:32:25.2036851Z moe/activation_test.py:126: 
2025-05-07T20:32:25.2036980Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:25.2037091Z moe/activation_test.py:124: in ref_fn
2025-05-07T20:32:25.2037225Z     return triton_quantize_fp8_row(y, scale_ub_tensor)
2025-05-07T20:32:25.2037785Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/fbgemm_gpu/experimental/gemm/triton_gemm/fp8_gemm.py:2370: in triton_quantize_fp8_row
2025-05-07T20:32:25.2037897Z     _kernel_quantize_fp8_row[grid](
2025-05-07T20:32:25.2038305Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:25.2038533Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:25.2038896Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/autotuner.py:186: in run
2025-05-07T20:32:25.2039151Z     timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs}
2025-05-07T20:32:25.2039554Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/autotuner.py:186: in <dictcomp>
2025-05-07T20:32:25.2039843Z     timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs}
2025-05-07T20:32:25.2040267Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/autotuner.py:166: in _bench
2025-05-07T20:32:25.2040474Z     return self.do_bench(kernel_call, quantiles=(0.5, 0.2, 0.8))
2025-05-07T20:32:25.2040850Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/testing.py:117: in do_bench
2025-05-07T20:32:25.2040938Z     fn()
2025-05-07T20:32:25.2041334Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/autotuner.py:152: in kernel_call
2025-05-07T20:32:25.2041416Z     self.fn.run(
2025-05-07T20:32:25.2041757Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:25.2041852Z     kernel = self.compile(
2025-05-07T20:32:25.2042233Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:25.2042405Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:25.2042539Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:25.2042543Z 
2025-05-07T20:32:25.2042755Z self = <triton.compiler.compiler.ASTSource object at 0x7efca376de50>
2025-05-07T20:32:25.2043533Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=2, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:25.2044036Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7efdd7d50400>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7efd8e4bde40>}
2025-05-07T20:32:25.2044777Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:25.2044970Z context = <triton._C.libtriton.ir.context object at 0x7efca37240b0>
2025-05-07T20:32:25.2044985Z 
2025-05-07T20:32:25.2045149Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:25.2045410Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:25.2045529Z                            module_map=module_map)
2025-05-07T20:32:25.2045690Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:25.2045794Z E       def _kernel_quantize_fp8_row(
2025-05-07T20:32:25.2045882Z E       ^
2025-05-07T20:32:25.2046234Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:25.2046238Z 
2025-05-07T20:32:25.2046654Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:25.2046658Z 
2025-05-07T20:32:25.2046763Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:25.2046987Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:25.2047075Z     T=1,
2025-05-07T20:32:25.2047154Z     D=5120,
2025-05-07T20:32:25.2047238Z     scale_ub=None,
2025-05-07T20:32:25.2047331Z     contiguous=True,
2025-05-07T20:32:25.2047418Z     compiled=False,
2025-05-07T20:32:25.2047536Z )
2025-05-07T20:32:25.2047761Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:25.2047922Z T = 1, D = 5120, scale_ub = None, contiguous = True, compiled = False
2025-05-07T20:32:25.2047927Z 
2025-05-07T20:32:25.2048011Z     @given(
2025-05-07T20:32:25.2048130Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:25.2048231Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:25.2048354Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:25.2048476Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:25.2048590Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:25.2048712Z     )
2025-05-07T20:32:25.2049081Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:25.2049175Z     def test_silu_mul_quant(
2025-05-07T20:32:25.2049260Z         self,
2025-05-07T20:32:25.2049374Z         T: int,
2025-05-07T20:32:25.2049466Z         D: int,
2025-05-07T20:32:25.2049564Z         scale_ub: Optional[float],
2025-05-07T20:32:25.2049657Z         contiguous: bool,
2025-05-07T20:32:25.2049755Z         compiled: bool,
2025-05-07T20:32:25.2049853Z     ) -> None:
2025-05-07T20:32:25.2049956Z         torch.manual_seed(2025)
2025-05-07T20:32:25.2050052Z     
2025-05-07T20:32:25.2050221Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:25.2050294Z     
2025-05-07T20:32:25.2050393Z         x_sign = torch.sign(x)
2025-05-07T20:32:25.2050519Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:25.2050610Z         x = x_sign * x_clamp
2025-05-07T20:32:25.2050700Z         x0 = x[:, :D]
2025-05-07T20:32:25.2050784Z         x1 = x[:, D:]
2025-05-07T20:32:25.2050867Z     
2025-05-07T20:32:25.2050953Z         if contiguous:
2025-05-07T20:32:25.2051047Z             x0 = x0.contiguous()
2025-05-07T20:32:25.2051146Z             x1 = x1.contiguous()
2025-05-07T20:32:25.2051221Z     
2025-05-07T20:32:25.2051314Z         if scale_ub is not None:
2025-05-07T20:32:25.2051426Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:25.2051562Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:25.2051639Z             )
2025-05-07T20:32:25.2051723Z         else:
2025-05-07T20:32:25.2051818Z             scale_ub_tensor = None
2025-05-07T20:32:25.2051891Z     
2025-05-07T20:32:25.2052026Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:25.2052116Z             op = silu_mul_quant
2025-05-07T20:32:25.2052201Z             if compiled:
2025-05-07T20:32:25.2052307Z                 op = torch.compile(op)
2025-05-07T20:32:25.2052416Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:25.2052501Z     
2025-05-07T20:32:25.2052592Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:25.2052597Z 
2025-05-07T20:32:25.2052695Z moe/activation_test.py:117: 
2025-05-07T20:32:25.2052831Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:25.2052935Z moe/activation_test.py:115: in fn
2025-05-07T20:32:25.2053035Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:25.2053542Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:25.2053639Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:25.2054001Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:25.2054222Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:25.2054559Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:25.2054668Z     kernel = self.compile(
2025-05-07T20:32:25.2055046Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:25.2055267Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:25.2055407Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:25.2055411Z 
2025-05-07T20:32:25.2055614Z self = <triton.compiler.compiler.ASTSource object at 0x7efca3847090>
2025-05-07T20:32:25.2056390Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:25.2056887Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7efdd7d50400>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7efd8e4bf880>}
2025-05-07T20:32:25.2057746Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:25.2057942Z context = <triton._C.libtriton.ir.context object at 0x7efca3883630>
2025-05-07T20:32:25.2057947Z 
2025-05-07T20:32:25.2058109Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:25.2058375Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:25.2058483Z                            module_map=module_map)
2025-05-07T20:32:25.2058649Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:25.2058750Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:25.2058828Z E       ^
2025-05-07T20:32:25.2059187Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:25.2059196Z 
2025-05-07T20:32:25.2059605Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:25.2059610Z 
2025-05-07T20:32:25.2059714Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:25.2059943Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:25.2060022Z     T=128,
2025-05-07T20:32:25.2060108Z     D=5120,
2025-05-07T20:32:25.2060190Z     scale_ub=None,
2025-05-07T20:32:25.2060277Z     contiguous=False,
2025-05-07T20:32:25.2060367Z     compiled=True,
2025-05-07T20:32:25.2060441Z )
2025-05-07T20:32:25.2060659Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:25.2060834Z T = 128, D = 5120, scale_ub = None, contiguous = False, compiled = True
2025-05-07T20:32:25.2060839Z 
2025-05-07T20:32:25.2060918Z     @given(
2025-05-07T20:32:25.2061037Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:25.2061147Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:25.2061262Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:25.2061387Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:25.2061505Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:25.2061581Z     )
2025-05-07T20:32:25.2061831Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:25.2061925Z     def test_silu_mul_quant(
2025-05-07T20:32:25.2062003Z         self,
2025-05-07T20:32:25.2062089Z         T: int,
2025-05-07T20:32:25.2062167Z         D: int,
2025-05-07T20:32:25.2062266Z         scale_ub: Optional[float],
2025-05-07T20:32:25.2062364Z         contiguous: bool,
2025-05-07T20:32:25.2062452Z         compiled: bool,
2025-05-07T20:32:25.2062532Z     ) -> None:
2025-05-07T20:32:25.2062634Z         torch.manual_seed(2025)
2025-05-07T20:32:25.2062709Z     
2025-05-07T20:32:25.2062884Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:25.2062963Z     
2025-05-07T20:32:25.2063057Z         x_sign = torch.sign(x)
2025-05-07T20:32:25.2063192Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:25.2063288Z         x = x_sign * x_clamp
2025-05-07T20:32:25.2063417Z         x0 = x[:, :D]
2025-05-07T20:32:25.2063507Z         x1 = x[:, D:]
2025-05-07T20:32:25.2063581Z     
2025-05-07T20:32:25.2063666Z         if contiguous:
2025-05-07T20:32:25.2063766Z             x0 = x0.contiguous()
2025-05-07T20:32:25.2063857Z             x1 = x1.contiguous()
2025-05-07T20:32:25.2063930Z     
2025-05-07T20:32:25.2064028Z         if scale_ub is not None:
2025-05-07T20:32:25.2064135Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:25.2064277Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:25.2064353Z             )
2025-05-07T20:32:25.2064432Z         else:
2025-05-07T20:32:25.2064533Z             scale_ub_tensor = None
2025-05-07T20:32:25.2064651Z     
2025-05-07T20:32:25.2064843Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:25.2064939Z             op = silu_mul_quant
2025-05-07T20:32:25.2065025Z             if compiled:
2025-05-07T20:32:25.2065164Z                 op = torch.compile(op)
2025-05-07T20:32:25.2065280Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:25.2065352Z     
2025-05-07T20:32:25.2065443Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:25.2065448Z 
2025-05-07T20:32:25.2065550Z moe/activation_test.py:117: 
2025-05-07T20:32:25.2065681Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:25.2065789Z moe/activation_test.py:115: in fn
2025-05-07T20:32:25.2065888Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:25.2066255Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/torch/_dynamo/eval_frame.py:678: in _fn
2025-05-07T20:32:25.2066357Z     return fn(*args, **kwargs)
2025-05-07T20:32:25.2066850Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:25.2066949Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:25.2067314Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:25.2067536Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:25.2067879Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:25.2067973Z     kernel = self.compile(
2025-05-07T20:32:25.2068352Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:25.2068532Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:25.2068660Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:25.2068667Z 
2025-05-07T20:32:25.2068880Z self = <triton.compiler.compiler.ASTSource object at 0x7efca3885e50>
2025-05-07T20:32:25.2069708Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:25.2070245Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7efdd7d50400>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7efd8e499c60>}
2025-05-07T20:32:25.2071007Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:25.2071199Z context = <triton._C.libtriton.ir.context object at 0x7efca38676f0>
2025-05-07T20:32:25.2071204Z 
2025-05-07T20:32:25.2071374Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:25.2071636Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:25.2071745Z                            module_map=module_map)
2025-05-07T20:32:25.2071984Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:25.2072087Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:25.2072171Z E       ^
2025-05-07T20:32:25.2072522Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:25.2072526Z 
2025-05-07T20:32:25.2072938Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:25.2072942Z 
2025-05-07T20:32:25.2073054Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:25.2073275Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:25.2073354Z     T=128,
2025-05-07T20:32:25.2073477Z     D=7168,
2025-05-07T20:32:25.2073599Z     scale_ub=1200.0,
2025-05-07T20:32:25.2073692Z     contiguous=False,
2025-05-07T20:32:25.2073777Z     compiled=False,
2025-05-07T20:32:25.2073850Z )
2025-05-07T20:32:25.2074113Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:25.2074290Z T = 128, D = 7168, scale_ub = 1200.0, contiguous = False, compiled = False
2025-05-07T20:32:25.2074296Z 
2025-05-07T20:32:25.2074372Z     @given(
2025-05-07T20:32:25.2074498Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:25.2074600Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:25.2074714Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:25.2074836Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:25.2074949Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:25.2075028Z     )
2025-05-07T20:32:25.2075273Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:25.2075368Z     def test_silu_mul_quant(
2025-05-07T20:32:25.2075453Z         self,
2025-05-07T20:32:25.2075532Z         T: int,
2025-05-07T20:32:25.2075610Z         D: int,
2025-05-07T20:32:25.2075715Z         scale_ub: Optional[float],
2025-05-07T20:32:25.2075809Z         contiguous: bool,
2025-05-07T20:32:25.2075901Z         compiled: bool,
2025-05-07T20:32:25.2075989Z     ) -> None:
2025-05-07T20:32:25.2076084Z         torch.manual_seed(2025)
2025-05-07T20:32:25.2076157Z     
2025-05-07T20:32:25.2076334Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:25.2076406Z     
2025-05-07T20:32:25.2076502Z         x_sign = torch.sign(x)
2025-05-07T20:32:25.2076626Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:25.2076724Z         x = x_sign * x_clamp
2025-05-07T20:32:25.2076811Z         x0 = x[:, :D]
2025-05-07T20:32:25.2076895Z         x1 = x[:, D:]
2025-05-07T20:32:25.2076968Z     
2025-05-07T20:32:25.2077059Z         if contiguous:
2025-05-07T20:32:25.2077158Z             x0 = x0.contiguous()
2025-05-07T20:32:25.2088260Z             x1 = x1.contiguous()
2025-05-07T20:32:25.2088351Z     
2025-05-07T20:32:25.2088452Z         if scale_ub is not None:
2025-05-07T20:32:25.2088567Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:25.2088705Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:25.2088781Z             )
2025-05-07T20:32:25.2088862Z         else:
2025-05-07T20:32:25.2088961Z             scale_ub_tensor = None
2025-05-07T20:32:25.2089046Z     
2025-05-07T20:32:25.2089185Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:25.2089281Z             op = silu_mul_quant
2025-05-07T20:32:25.2089378Z             if compiled:
2025-05-07T20:32:25.2089479Z                 op = torch.compile(op)
2025-05-07T20:32:25.2089587Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:25.2089669Z     
2025-05-07T20:32:25.2089762Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:25.2089770Z 
2025-05-07T20:32:25.2089882Z moe/activation_test.py:117: 
2025-05-07T20:32:25.2090017Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:25.2090119Z moe/activation_test.py:115: in fn
2025-05-07T20:32:25.2090230Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:25.2090850Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:25.2090949Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:25.2091318Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:25.2091542Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:25.2091893Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:25.2091989Z     kernel = self.compile(
2025-05-07T20:32:25.2092425Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:25.2092654Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:25.2092826Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:25.2092834Z 
2025-05-07T20:32:25.2093049Z self = <triton.compiler.compiler.ASTSource object at 0x7efca3856d50>
2025-05-07T20:32:25.2093830Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:25.2094336Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7efdd7d50400>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7efd8ea9c360>}
2025-05-07T20:32:25.2095088Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:25.2095287Z context = <triton._C.libtriton.ir.context object at 0x7efca3893330>
2025-05-07T20:32:25.2095293Z 
2025-05-07T20:32:25.2095474Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:25.2095736Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:25.2095844Z                            module_map=module_map)
2025-05-07T20:32:25.2096019Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:25.2096121Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:25.2096208Z E       ^
2025-05-07T20:32:25.2096564Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:25.2096569Z 
2025-05-07T20:32:25.2096984Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:25.2096994Z 
2025-05-07T20:32:25.2097106Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:25.2097330Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:25.2097418Z     T=128,
2025-05-07T20:32:25.2097497Z     D=5120,
2025-05-07T20:32:25.2097582Z     scale_ub=None,
2025-05-07T20:32:25.2097678Z     contiguous=False,
2025-05-07T20:32:25.2097764Z     compiled=False,
2025-05-07T20:32:25.2097841Z )
2025-05-07T20:32:25.2098065Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:25.2098236Z T = 128, D = 5120, scale_ub = None, contiguous = False, compiled = False
2025-05-07T20:32:25.2098240Z 
2025-05-07T20:32:25.2098320Z     @given(
2025-05-07T20:32:25.2098449Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:25.2098548Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:25.2098665Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:25.2098793Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:25.2098906Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:25.2098992Z     )
2025-05-07T20:32:25.2099290Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:25.2099388Z     def test_silu_mul_quant(
2025-05-07T20:32:25.2099472Z         self,
2025-05-07T20:32:25.2099551Z         T: int,
2025-05-07T20:32:25.2099628Z         D: int,
2025-05-07T20:32:25.2099736Z         scale_ub: Optional[float],
2025-05-07T20:32:25.2099828Z         contiguous: bool,
2025-05-07T20:32:25.2099919Z         compiled: bool,
2025-05-07T20:32:25.2100009Z     ) -> None:
2025-05-07T20:32:25.2100130Z         torch.manual_seed(2025)
2025-05-07T20:32:25.2100208Z     
2025-05-07T20:32:25.2100407Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:25.2100481Z     
2025-05-07T20:32:25.2100623Z         x_sign = torch.sign(x)
2025-05-07T20:32:25.2100785Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:25.2100875Z         x = x_sign * x_clamp
2025-05-07T20:32:25.2100964Z         x0 = x[:, :D]
2025-05-07T20:32:25.2101045Z         x1 = x[:, D:]
2025-05-07T20:32:25.2101157Z     
2025-05-07T20:32:25.2101255Z         if contiguous:
2025-05-07T20:32:25.2101348Z             x0 = x0.contiguous()
2025-05-07T20:32:25.2101440Z             x1 = x1.contiguous()
2025-05-07T20:32:25.2101519Z     
2025-05-07T20:32:25.2101610Z         if scale_ub is not None:
2025-05-07T20:32:25.2101717Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:25.2101862Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:25.2101938Z             )
2025-05-07T20:32:25.2102022Z         else:
2025-05-07T20:32:25.2102119Z             scale_ub_tensor = None
2025-05-07T20:32:25.2102191Z     
2025-05-07T20:32:25.2102327Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:25.2102421Z             op = silu_mul_quant
2025-05-07T20:32:25.2102509Z             if compiled:
2025-05-07T20:32:25.2102618Z                 op = torch.compile(op)
2025-05-07T20:32:25.2102725Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:25.2102800Z     
2025-05-07T20:32:25.2102904Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:25.2102908Z 
2025-05-07T20:32:25.2103008Z moe/activation_test.py:117: 
2025-05-07T20:32:25.2103141Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:25.2103254Z moe/activation_test.py:115: in fn
2025-05-07T20:32:25.2103356Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:25.2103865Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:25.2103965Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:25.2104322Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:25.2104558Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:25.2104897Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:25.2105003Z     kernel = self.compile(
2025-05-07T20:32:25.2105386Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:25.2105559Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:25.2105698Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:25.2105702Z 
2025-05-07T20:32:25.2105907Z self = <triton.compiler.compiler.ASTSource object at 0x7efca394a050>
2025-05-07T20:32:25.2106690Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:25.2107200Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7efdd7d50400>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7efca39885e0>}
2025-05-07T20:32:25.2107996Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:25.2108201Z context = <triton._C.libtriton.ir.context object at 0x7efca393a6b0>
2025-05-07T20:32:25.2108206Z 
2025-05-07T20:32:25.2108369Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:25.2108638Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:25.2108747Z                            module_map=module_map)
2025-05-07T20:32:25.2108909Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:25.2109243Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:25.2109364Z E       ^
2025-05-07T20:32:25.2109719Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:25.2109731Z 
2025-05-07T20:32:25.2110187Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:25.2110192Z 
2025-05-07T20:32:25.2110297Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:25.2110552Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:25.2110640Z     T=128,
2025-05-07T20:32:25.2110733Z     D=5120,
2025-05-07T20:32:25.2110826Z     scale_ub=1200.0,
2025-05-07T20:32:25.2110912Z     contiguous=True,
2025-05-07T20:32:25.2110998Z     compiled=False,
2025-05-07T20:32:25.2111084Z )
2025-05-07T20:32:25.2111304Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:25.2111490Z T = 128, D = 5120, scale_ub = 1200.0, contiguous = True, compiled = False
2025-05-07T20:32:25.2111498Z 
2025-05-07T20:32:25.2111577Z     @given(
2025-05-07T20:32:25.2111699Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:25.2111808Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:25.2111927Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:25.2112044Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:25.2112167Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:25.2112243Z     )
2025-05-07T20:32:25.2112490Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:25.2112591Z     def test_silu_mul_quant(
2025-05-07T20:32:25.2112669Z         self,
2025-05-07T20:32:25.2112756Z         T: int,
2025-05-07T20:32:25.2112834Z         D: int,
2025-05-07T20:32:25.2112932Z         scale_ub: Optional[float],
2025-05-07T20:32:25.2113031Z         contiguous: bool,
2025-05-07T20:32:25.2113120Z         compiled: bool,
2025-05-07T20:32:25.2113201Z     ) -> None:
2025-05-07T20:32:25.2113305Z         torch.manual_seed(2025)
2025-05-07T20:32:25.2113379Z     
2025-05-07T20:32:25.2113548Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:25.2113631Z     
2025-05-07T20:32:25.2113726Z         x_sign = torch.sign(x)
2025-05-07T20:32:25.2113849Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:25.2113949Z         x = x_sign * x_clamp
2025-05-07T20:32:25.2114030Z         x0 = x[:, :D]
2025-05-07T20:32:25.2114121Z         x1 = x[:, D:]
2025-05-07T20:32:25.2114195Z     
2025-05-07T20:32:25.2114281Z         if contiguous:
2025-05-07T20:32:25.2114377Z             x0 = x0.contiguous()
2025-05-07T20:32:25.2114468Z             x1 = x1.contiguous()
2025-05-07T20:32:25.2114541Z     
2025-05-07T20:32:25.2114641Z         if scale_ub is not None:
2025-05-07T20:32:25.2114749Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:25.2114888Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:25.2114974Z             )
2025-05-07T20:32:25.2115052Z         else:
2025-05-07T20:32:25.2115147Z             scale_ub_tensor = None
2025-05-07T20:32:25.2115229Z     
2025-05-07T20:32:25.2115361Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:25.2115501Z             op = silu_mul_quant
2025-05-07T20:32:25.2115598Z             if compiled:
2025-05-07T20:32:25.2115698Z                 op = torch.compile(op)
2025-05-07T20:32:25.2115811Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:25.2115884Z     
2025-05-07T20:32:25.2115975Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:25.2115980Z 
2025-05-07T20:32:25.2116084Z moe/activation_test.py:117: 
2025-05-07T20:32:25.2116214Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:25.2116315Z moe/activation_test.py:115: in fn
2025-05-07T20:32:25.2116422Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:25.2116962Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:25.2117106Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:25.2117504Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:25.2117726Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:25.2118071Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:25.2118165Z     kernel = self.compile(
2025-05-07T20:32:25.2118547Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:25.2118727Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:25.2118857Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:25.2118867Z 
2025-05-07T20:32:25.2119083Z self = <triton.compiler.compiler.ASTSource object at 0x7efca39f1150>
2025-05-07T20:32:25.2119860Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:25.2120363Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7efdd7d50400>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7efca3989760>}
2025-05-07T20:32:25.2121119Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:25.2121311Z context = <triton._C.libtriton.ir.context object at 0x7efca39dd7b0>
2025-05-07T20:32:25.2121315Z 
2025-05-07T20:32:25.2121487Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:25.2121752Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:25.2121867Z                            module_map=module_map)
2025-05-07T20:32:25.2122034Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:25.2122134Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:25.2122221Z E       ^
2025-05-07T20:32:25.2122574Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:25.2122579Z 
2025-05-07T20:32:25.2122992Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:25.2123000Z 
2025-05-07T20:32:25.2123111Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:25.2123333Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:25.2123420Z     T=1,
2025-05-07T20:32:25.2123500Z     D=7168,
2025-05-07T20:32:25.2123587Z     scale_ub=1200.0,
2025-05-07T20:32:25.2123683Z     contiguous=True,
2025-05-07T20:32:25.2123768Z     compiled=True,
2025-05-07T20:32:25.2123842Z )
2025-05-07T20:32:25.2124070Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:25.2124281Z T = 1, D = 7168, scale_ub = 1200.0, contiguous = True, compiled = True
2025-05-07T20:32:25.2124286Z 
2025-05-07T20:32:25.2124367Z     @given(
2025-05-07T20:32:25.2124495Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:25.2124595Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:25.2124719Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:25.2124837Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:25.2124952Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:25.2125037Z     )
2025-05-07T20:32:25.2125281Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:25.2125452Z     def test_silu_mul_quant(
2025-05-07T20:32:25.2125538Z         self,
2025-05-07T20:32:25.2125616Z         T: int,
2025-05-07T20:32:25.2125695Z         D: int,
2025-05-07T20:32:25.2125802Z         scale_ub: Optional[float],
2025-05-07T20:32:25.2125935Z         contiguous: bool,
2025-05-07T20:32:25.2126026Z         compiled: bool,
2025-05-07T20:32:25.2126114Z     ) -> None:
2025-05-07T20:32:25.2126210Z         torch.manual_seed(2025)
2025-05-07T20:32:25.2126289Z     
2025-05-07T20:32:25.2126458Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:25.2126532Z     
2025-05-07T20:32:25.2126635Z         x_sign = torch.sign(x)
2025-05-07T20:32:25.2126762Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:25.2126851Z         x = x_sign * x_clamp
2025-05-07T20:32:25.2126939Z         x0 = x[:, :D]
2025-05-07T20:32:25.2127020Z         x1 = x[:, D:]
2025-05-07T20:32:25.2127096Z     
2025-05-07T20:32:25.2127195Z         if contiguous:
2025-05-07T20:32:25.2127288Z             x0 = x0.contiguous()
2025-05-07T20:32:25.2127383Z             x1 = x1.contiguous()
2025-05-07T20:32:25.2127455Z     
2025-05-07T20:32:25.2127545Z         if scale_ub is not None:
2025-05-07T20:32:25.2127660Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:25.2127799Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:25.2127878Z             )
2025-05-07T20:32:25.2127961Z         else:
2025-05-07T20:32:25.2128056Z             scale_ub_tensor = None
2025-05-07T20:32:25.2128135Z     
2025-05-07T20:32:25.2128619Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:25.2128754Z             op = silu_mul_quant
2025-05-07T20:32:25.2128864Z             if compiled:
2025-05-07T20:32:25.2128966Z                 op = torch.compile(op)
2025-05-07T20:32:25.2129071Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:25.2129149Z     
2025-05-07T20:32:25.2129240Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:25.2129249Z 
2025-05-07T20:32:25.2129350Z moe/activation_test.py:117: 
2025-05-07T20:32:25.2129486Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:25.2129588Z moe/activation_test.py:115: in fn
2025-05-07T20:32:25.2129699Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:25.2130098Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/torch/_dynamo/eval_frame.py:678: in _fn
2025-05-07T20:32:25.2130215Z     return fn(*args, **kwargs)
2025-05-07T20:32:25.2130714Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:25.2130813Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:25.2131170Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:25.2131398Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:25.2131740Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:25.2131841Z     kernel = self.compile(
2025-05-07T20:32:25.2132222Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:25.2132581Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:25.2132721Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:25.2132725Z 
2025-05-07T20:32:25.2132929Z self = <triton.compiler.compiler.ASTSource object at 0x7efca3ae85d0>
2025-05-07T20:32:25.2133703Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:25.2134203Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7efdd7d50400>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7efca398ad40>}
2025-05-07T20:32:25.2135817Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:25.2136023Z context = <triton._C.libtriton.ir.context object at 0x7efca3aecc70>
2025-05-07T20:32:25.2136028Z 
2025-05-07T20:32:25.2136192Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:25.2136457Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:25.2136563Z                            module_map=module_map)
2025-05-07T20:32:25.2136724Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:25.2136828Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:25.2136905Z E       ^
2025-05-07T20:32:25.2137259Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:25.2137277Z 
2025-05-07T20:32:25.2137693Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:25.2137698Z 
2025-05-07T20:32:25.2137804Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:25.2138033Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:25.2138111Z     T=1,
2025-05-07T20:32:25.2138187Z     D=7168,
2025-05-07T20:32:25.2138280Z     scale_ub=1200.0,
2025-05-07T20:32:25.2138367Z     contiguous=False,
2025-05-07T20:32:25.2138449Z     compiled=True,
2025-05-07T20:32:25.2138531Z )
2025-05-07T20:32:25.2138748Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:25.2138917Z T = 1, D = 7168, scale_ub = 1200.0, contiguous = False, compiled = True
2025-05-07T20:32:25.2138921Z 
2025-05-07T20:32:25.2139001Z     @given(
2025-05-07T20:32:25.2139121Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:25.2139226Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:25.2139342Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:25.2139459Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:25.2139579Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:25.2139654Z     )
2025-05-07T20:32:25.2139904Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:25.2139998Z     def test_silu_mul_quant(
2025-05-07T20:32:25.2140074Z         self,
2025-05-07T20:32:25.2140156Z         T: int,
2025-05-07T20:32:25.2140234Z         D: int,
2025-05-07T20:32:25.2140334Z         scale_ub: Optional[float],
2025-05-07T20:32:25.2140452Z         contiguous: bool,
2025-05-07T20:32:25.2140544Z         compiled: bool,
2025-05-07T20:32:25.2140641Z     ) -> None:
2025-05-07T20:32:25.2140742Z         torch.manual_seed(2025)
2025-05-07T20:32:25.2140816Z     
2025-05-07T20:32:25.2140990Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:25.2141071Z     
2025-05-07T20:32:25.2141163Z         x_sign = torch.sign(x)
2025-05-07T20:32:25.2141290Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:25.2141478Z         x = x_sign * x_clamp
2025-05-07T20:32:25.2141561Z         x0 = x[:, :D]
2025-05-07T20:32:25.2141646Z         x1 = x[:, D:]
2025-05-07T20:32:25.2141717Z     
2025-05-07T20:32:25.2141800Z         if contiguous:
2025-05-07T20:32:25.2141896Z             x0 = x0.contiguous()
2025-05-07T20:32:25.2141986Z             x1 = x1.contiguous()
2025-05-07T20:32:25.2142058Z     
2025-05-07T20:32:25.2142153Z         if scale_ub is not None:
2025-05-07T20:32:25.2142260Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:25.2142396Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:25.2142480Z             )
2025-05-07T20:32:25.2142598Z         else:
2025-05-07T20:32:25.2142693Z             scale_ub_tensor = None
2025-05-07T20:32:25.2142813Z     
2025-05-07T20:32:25.2142942Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:25.2143038Z             op = silu_mul_quant
2025-05-07T20:32:25.2143165Z             if compiled:
2025-05-07T20:32:25.2143270Z                 op = torch.compile(op)
2025-05-07T20:32:25.2143382Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:25.2143454Z     
2025-05-07T20:32:25.2143546Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:25.2143550Z 
2025-05-07T20:32:25.2143655Z moe/activation_test.py:117: 
2025-05-07T20:32:25.2143784Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:25.2143885Z moe/activation_test.py:115: in fn
2025-05-07T20:32:25.2143996Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:25.2144362Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/torch/_dynamo/eval_frame.py:678: in _fn
2025-05-07T20:32:25.2144466Z     return fn(*args, **kwargs)
2025-05-07T20:32:25.2144958Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:25.2145056Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:25.2145422Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:25.2145641Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:25.2145978Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:25.2146078Z     kernel = self.compile(
2025-05-07T20:32:25.2146456Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:25.2146632Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:25.2146760Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:25.2146770Z 
2025-05-07T20:32:25.2146974Z self = <triton.compiler.compiler.ASTSource object at 0x7efca3a02290>
2025-05-07T20:32:25.2147757Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:25.2148260Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7efdd7d50400>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7efca3aa0540>}
2025-05-07T20:32:25.2149008Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:25.2149271Z context = <triton._C.libtriton.ir.context object at 0x7efca3aab9b0>
2025-05-07T20:32:25.2149279Z 
2025-05-07T20:32:25.2149451Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:25.2149709Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:25.2149818Z                            module_map=module_map)
2025-05-07T20:32:25.2150036Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:25.2150136Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:25.2150216Z E       ^
2025-05-07T20:32:25.2150574Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:25.2150579Z 
2025-05-07T20:32:25.2150990Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:25.2150998Z 
2025-05-07T20:32:25.2151106Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:25.2151326Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:25.2151482Z     T=1,
2025-05-07T20:32:25.2151565Z     D=7168,
2025-05-07T20:32:25.2151648Z     scale_ub=None,
2025-05-07T20:32:25.2151735Z     contiguous=False,
2025-05-07T20:32:25.2151834Z     compiled=True,
2025-05-07T20:32:25.2151907Z )
2025-05-07T20:32:25.2152183Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:25.2152347Z T = 1, D = 7168, scale_ub = None, contiguous = False, compiled = True
2025-05-07T20:32:25.2152351Z 
2025-05-07T20:32:25.2152434Z     @given(
2025-05-07T20:32:25.2152553Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:25.2152653Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:25.2152776Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:25.2152894Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:25.2153007Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:25.2153088Z     )
2025-05-07T20:32:25.2153336Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:25.2153435Z     def test_silu_mul_quant(
2025-05-07T20:32:25.2153518Z         self,
2025-05-07T20:32:25.2153598Z         T: int,
2025-05-07T20:32:25.2153675Z         D: int,
2025-05-07T20:32:25.2153783Z         scale_ub: Optional[float],
2025-05-07T20:32:25.2153877Z         contiguous: bool,
2025-05-07T20:32:25.2153969Z         compiled: bool,
2025-05-07T20:32:25.2154048Z     ) -> None:
2025-05-07T20:32:25.2154143Z         torch.manual_seed(2025)
2025-05-07T20:32:25.2154222Z     
2025-05-07T20:32:25.2154390Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:25.2154467Z     
2025-05-07T20:32:25.2154569Z         x_sign = torch.sign(x)
2025-05-07T20:32:25.2154694Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:25.2154784Z         x = x_sign * x_clamp
2025-05-07T20:32:25.2154873Z         x0 = x[:, :D]
2025-05-07T20:32:25.2154953Z         x1 = x[:, D:]
2025-05-07T20:32:25.2155030Z     
2025-05-07T20:32:25.2155123Z         if contiguous:
2025-05-07T20:32:25.2155215Z             x0 = x0.contiguous()
2025-05-07T20:32:25.2155311Z             x1 = x1.contiguous()
2025-05-07T20:32:25.2155384Z     
2025-05-07T20:32:25.2155476Z         if scale_ub is not None:
2025-05-07T20:32:25.2155592Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:25.2155728Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:25.2155804Z             )
2025-05-07T20:32:25.2155889Z         else:
2025-05-07T20:32:25.2155984Z             scale_ub_tensor = None
2025-05-07T20:32:25.2156056Z     
2025-05-07T20:32:25.2156189Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:25.2156280Z             op = silu_mul_quant
2025-05-07T20:32:25.2156365Z             if compiled:
2025-05-07T20:32:25.2156475Z                 op = torch.compile(op)
2025-05-07T20:32:25.2156579Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:25.2156663Z     
2025-05-07T20:32:25.2156753Z         y_fp8, y_scale = fn()
2025-05-07T20:32:25.2156874Z         y = y_fp8.to(torch.float32) * y_scale[:, None]
2025-05-07T20:32:25.2156952Z     
2025-05-07T20:32:25.2157087Z         def ref_fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:25.2157240Z             x0_fp32 = x0.to(torch.float32)
2025-05-07T20:32:25.2157348Z             x1_fp32 = x1.to(torch.float32)
2025-05-07T20:32:25.2157469Z             y = x0_fp32 * torch.sigmoid(x0_fp32) * x1_fp32
2025-05-07T20:32:25.2157608Z             return triton_quantize_fp8_row(y, scale_ub_tensor)
2025-05-07T20:32:25.2157689Z     
2025-05-07T20:32:25.2157788Z >       y_fp8_ref, y_scale_ref = ref_fn()
2025-05-07T20:32:25.2157793Z 
2025-05-07T20:32:25.2157896Z moe/activation_test.py:126: 
2025-05-07T20:32:25.2158026Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:25.2158133Z moe/activation_test.py:124: in ref_fn
2025-05-07T20:32:25.2158270Z     return triton_quantize_fp8_row(y, scale_ub_tensor)
2025-05-07T20:32:25.2158906Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/fbgemm_gpu/experimental/gemm/triton_gemm/fp8_gemm.py:2370: in triton_quantize_fp8_row
2025-05-07T20:32:25.2159007Z     _kernel_quantize_fp8_row[grid](
2025-05-07T20:32:25.2159422Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:25.2159684Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:25.2160060Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/autotuner.py:186: in run
2025-05-07T20:32:25.2160315Z     timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs}
2025-05-07T20:32:25.2160711Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/autotuner.py:186: in <dictcomp>
2025-05-07T20:32:25.2160969Z     timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs}
2025-05-07T20:32:25.2161347Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/autotuner.py:166: in _bench
2025-05-07T20:32:25.2161518Z     return self.do_bench(kernel_call, quantiles=(0.5, 0.2, 0.8))
2025-05-07T20:32:25.2161861Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/testing.py:117: in do_bench
2025-05-07T20:32:25.2161938Z     fn()
2025-05-07T20:32:25.2162343Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/autotuner.py:152: in kernel_call
2025-05-07T20:32:25.2162425Z     self.fn.run(
2025-05-07T20:32:25.2162760Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:25.2162859Z     kernel = self.compile(
2025-05-07T20:32:25.2163238Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:25.2163420Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:25.2163554Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:25.2163559Z 
2025-05-07T20:32:25.2163766Z self = <triton.compiler.compiler.ASTSource object at 0x7efd8e054b90>
2025-05-07T20:32:25.2164552Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=2, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:25.2165054Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7efdd7d50400>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7efca3aa1440>}
2025-05-07T20:32:25.2165805Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:25.2165998Z context = <triton._C.libtriton.ir.context object at 0x7efd8e00d130>
2025-05-07T20:32:25.2166005Z 
2025-05-07T20:32:25.2166170Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:25.2166483Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:25.2166593Z                            module_map=module_map)
2025-05-07T20:32:25.2166759Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:25.2166860Z E       def _kernel_quantize_fp8_row(
2025-05-07T20:32:25.2166937Z E       ^
2025-05-07T20:32:25.2167294Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:25.2167299Z 
2025-05-07T20:32:25.2167711Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:25.2167715Z 
2025-05-07T20:32:25.2167828Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:25.2168125Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:25.2168203Z     T=1,
2025-05-07T20:32:25.2168290Z     D=5120,
2025-05-07T20:32:25.2168374Z     scale_ub=1200.0,
2025-05-07T20:32:25.2168531Z     contiguous=False,
2025-05-07T20:32:25.2168625Z     compiled=True,
2025-05-07T20:32:25.2168697Z )
2025-05-07T20:32:25.2168914Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:25.2169089Z T = 1, D = 5120, scale_ub = 1200.0, contiguous = False, compiled = True
2025-05-07T20:32:25.2169094Z 
2025-05-07T20:32:25.2169171Z     @given(
2025-05-07T20:32:25.2169294Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:25.2169393Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:25.2169510Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:25.2169633Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:25.2169769Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:25.2169853Z     )
2025-05-07T20:32:25.2170126Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:25.2170219Z     def test_silu_mul_quant(
2025-05-07T20:32:25.2170301Z         self,
2025-05-07T20:32:25.2170388Z         T: int,
2025-05-07T20:32:25.2170464Z         D: int,
2025-05-07T20:32:25.2170562Z         scale_ub: Optional[float],
2025-05-07T20:32:25.2170661Z         contiguous: bool,
2025-05-07T20:32:25.2170746Z         compiled: bool,
2025-05-07T20:32:25.2170831Z     ) -> None:
2025-05-07T20:32:25.2170928Z         torch.manual_seed(2025)
2025-05-07T20:32:25.2171001Z     
2025-05-07T20:32:25.2171176Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:25.2171249Z     
2025-05-07T20:32:25.2171341Z         x_sign = torch.sign(x)
2025-05-07T20:32:25.2171471Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:25.2171560Z         x = x_sign * x_clamp
2025-05-07T20:32:25.2171644Z         x0 = x[:, :D]
2025-05-07T20:32:25.2171733Z         x1 = x[:, D:]
2025-05-07T20:32:25.2171809Z     
2025-05-07T20:32:25.2171892Z         if contiguous:
2025-05-07T20:32:25.2171991Z             x0 = x0.contiguous()
2025-05-07T20:32:25.2172082Z             x1 = x1.contiguous()
2025-05-07T20:32:25.2172156Z     
2025-05-07T20:32:25.2172252Z         if scale_ub is not None:
2025-05-07T20:32:25.2172357Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:25.2172498Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:25.2172573Z             )
2025-05-07T20:32:25.2172651Z         else:
2025-05-07T20:32:25.2172751Z             scale_ub_tensor = None
2025-05-07T20:32:25.2172823Z     
2025-05-07T20:32:25.2172951Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:25.2173046Z             op = silu_mul_quant
2025-05-07T20:32:25.2173132Z             if compiled:
2025-05-07T20:32:25.2173232Z                 op = torch.compile(op)
2025-05-07T20:32:25.2173346Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:25.2173420Z     
2025-05-07T20:32:25.2173510Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:25.2173521Z 
2025-05-07T20:32:25.2173619Z moe/activation_test.py:117: 
2025-05-07T20:32:25.2173802Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:25.2173911Z moe/activation_test.py:115: in fn
2025-05-07T20:32:25.2174009Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:25.2174374Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/torch/_dynamo/eval_frame.py:678: in _fn
2025-05-07T20:32:25.2174471Z     return fn(*args, **kwargs)
2025-05-07T20:32:25.2174960Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:25.2175062Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:25.2175416Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:25.2175719Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:25.2176099Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:25.2176197Z     kernel = self.compile(
2025-05-07T20:32:25.2176574Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:25.2176754Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:25.2176883Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:25.2176887Z 
2025-05-07T20:32:25.2177098Z self = <triton.compiler.compiler.ASTSource object at 0x7efd8e0851d0>
2025-05-07T20:32:25.2177869Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:25.2178378Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7efdd7d50400>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7efca3aa2a20>}
2025-05-07T20:32:25.2179127Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:25.2179317Z context = <triton._C.libtriton.ir.context object at 0x7efd8e059830>
2025-05-07T20:32:25.2179322Z 
2025-05-07T20:32:25.2179490Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:25.2179749Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:25.2179855Z                            module_map=module_map)
2025-05-07T20:32:25.2180021Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:25.2180124Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:25.2180206Z E       ^
2025-05-07T20:32:25.2180561Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:25.2180566Z 
2025-05-07T20:32:25.2180983Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:25.2180987Z 
2025-05-07T20:32:25.2181095Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:25.2181317Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:25.2181401Z     T=1,
2025-05-07T20:32:25.2181477Z     D=5120,
2025-05-07T20:32:25.2181559Z     scale_ub=1200.0,
2025-05-07T20:32:25.2181655Z     contiguous=False,
2025-05-07T20:32:25.2181739Z     compiled=False,
2025-05-07T20:32:25.2181811Z )
2025-05-07T20:32:25.2182040Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:25.2182212Z T = 1, D = 5120, scale_ub = 1200.0, contiguous = False, compiled = False
2025-05-07T20:32:25.2182217Z 
2025-05-07T20:32:25.2182292Z     @given(
2025-05-07T20:32:25.2182418Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:25.2182569Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:25.2182692Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:25.2182808Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:25.2182920Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:25.2183000Z     )
2025-05-07T20:32:25.2183245Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:25.2183338Z     def test_silu_mul_quant(
2025-05-07T20:32:25.2183423Z         self,
2025-05-07T20:32:25.2183502Z         T: int,
2025-05-07T20:32:25.2183578Z         D: int,
2025-05-07T20:32:25.2183682Z         scale_ub: Optional[float],
2025-05-07T20:32:25.2183814Z         contiguous: bool,
2025-05-07T20:32:25.2183943Z         compiled: bool,
2025-05-07T20:32:25.2184028Z     ) -> None:
2025-05-07T20:32:25.2184123Z         torch.manual_seed(2025)
2025-05-07T20:32:25.2184203Z     
2025-05-07T20:32:25.2184411Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:25.2184493Z     
2025-05-07T20:32:25.2184592Z         x_sign = torch.sign(x)
2025-05-07T20:32:25.2184719Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:25.2184807Z         x = x_sign * x_clamp
2025-05-07T20:32:25.2184893Z         x0 = x[:, :D]
2025-05-07T20:32:25.2184975Z         x1 = x[:, D:]
2025-05-07T20:32:25.2185047Z     
2025-05-07T20:32:25.2185141Z         if contiguous:
2025-05-07T20:32:25.2185233Z             x0 = x0.contiguous()
2025-05-07T20:32:25.2185323Z             x1 = x1.contiguous()
2025-05-07T20:32:25.2185401Z     
2025-05-07T20:32:25.2185491Z         if scale_ub is not None:
2025-05-07T20:32:25.2185600Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:25.2185743Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:25.2185821Z             )
2025-05-07T20:32:25.2185903Z         else:
2025-05-07T20:32:25.2185996Z             scale_ub_tensor = None
2025-05-07T20:32:25.2186070Z     
2025-05-07T20:32:25.2186211Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:25.2186301Z             op = silu_mul_quant
2025-05-07T20:32:25.2186386Z             if compiled:
2025-05-07T20:32:25.2186494Z                 op = torch.compile(op)
2025-05-07T20:32:25.2186598Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:25.2186670Z     
2025-05-07T20:32:25.2186769Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:25.2186773Z 
2025-05-07T20:32:25.2186870Z moe/activation_test.py:117: 
2025-05-07T20:32:25.2187007Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:25.2187108Z moe/activation_test.py:115: in fn
2025-05-07T20:32:25.2187206Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:25.2187711Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:25.2187807Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:25.2188167Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:25.2188393Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:25.2188730Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:25.2188828Z     kernel = self.compile(
2025-05-07T20:32:25.2189376Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:25.2189549Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:25.2189682Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:25.2189691Z 
2025-05-07T20:32:25.2189895Z self = <triton.compiler.compiler.ASTSource object at 0x7efd8e026c50>
2025-05-07T20:32:25.2190778Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:25.2191280Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7efdd7d50400>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7efca3aa31a0>}
2025-05-07T20:32:25.2192022Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:25.2192218Z context = <triton._C.libtriton.ir.context object at 0x7efd8e0732b0>
2025-05-07T20:32:25.2192223Z 
2025-05-07T20:32:25.2192537Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:25.2192841Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:25.2192985Z                            module_map=module_map)
2025-05-07T20:32:25.2193151Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:25.2193254Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:25.2193332Z E       ^
2025-05-07T20:32:25.2193685Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:25.2193696Z 
2025-05-07T20:32:25.2194106Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:25.2194110Z 
2025-05-07T20:32:25.2194212Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:25.2194440Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:25.2194520Z     T=16384,
2025-05-07T20:32:25.2194599Z     D=5120,
2025-05-07T20:32:25.2194690Z     scale_ub=1200.0,
2025-05-07T20:32:25.2194778Z     contiguous=False,
2025-05-07T20:32:25.2194861Z     compiled=True,
2025-05-07T20:32:25.2194941Z )
2025-05-07T20:32:25.2195162Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:25.2195343Z T = 16384, D = 5120, scale_ub = 1200.0, contiguous = False, compiled = True
2025-05-07T20:32:25.2195347Z 
2025-05-07T20:32:25.2195423Z     @given(
2025-05-07T20:32:25.2195540Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:25.2195645Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:25.2195760Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:25.2195874Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:25.2195992Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:25.2196066Z     )
2025-05-07T20:32:25.2196316Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:25.2196414Z     def test_silu_mul_quant(
2025-05-07T20:32:25.2196490Z         self,
2025-05-07T20:32:25.2196573Z         T: int,
2025-05-07T20:32:25.2196650Z         D: int,
2025-05-07T20:32:25.2196753Z         scale_ub: Optional[float],
2025-05-07T20:32:25.2196851Z         contiguous: bool,
2025-05-07T20:32:25.2196937Z         compiled: bool,
2025-05-07T20:32:25.2197014Z     ) -> None:
2025-05-07T20:32:25.2197115Z         torch.manual_seed(2025)
2025-05-07T20:32:25.2197192Z     
2025-05-07T20:32:25.2197359Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:25.2197440Z     
2025-05-07T20:32:25.2197532Z         x_sign = torch.sign(x)
2025-05-07T20:32:25.2197657Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:25.2197752Z         x = x_sign * x_clamp
2025-05-07T20:32:25.2197833Z         x0 = x[:, :D]
2025-05-07T20:32:25.2197922Z         x1 = x[:, D:]
2025-05-07T20:32:25.2197997Z     
2025-05-07T20:32:25.2198082Z         if contiguous:
2025-05-07T20:32:25.2198182Z             x0 = x0.contiguous()
2025-05-07T20:32:25.2198271Z             x1 = x1.contiguous()
2025-05-07T20:32:25.2198343Z     
2025-05-07T20:32:25.2198439Z         if scale_ub is not None:
2025-05-07T20:32:25.2198595Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:25.2198731Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:25.2198816Z             )
2025-05-07T20:32:25.2198891Z         else:
2025-05-07T20:32:25.2198985Z             scale_ub_tensor = None
2025-05-07T20:32:25.2199066Z     
2025-05-07T20:32:25.2199194Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:25.2199290Z             op = silu_mul_quant
2025-05-07T20:32:25.2199375Z             if compiled:
2025-05-07T20:32:25.2199474Z                 op = torch.compile(op)
2025-05-07T20:32:25.2199589Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:25.2199662Z     
2025-05-07T20:32:25.2199806Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:25.2199895Z 
2025-05-07T20:32:25.2200013Z moe/activation_test.py:117: 
2025-05-07T20:32:25.2200166Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:25.2200304Z moe/activation_test.py:115: in fn
2025-05-07T20:32:25.2200414Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:25.2200779Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/torch/_dynamo/eval_frame.py:678: in _fn
2025-05-07T20:32:25.2200880Z     return fn(*args, **kwargs)
2025-05-07T20:32:25.2201367Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:25.2201463Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:25.2201822Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:25.2202041Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:25.2202380Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:25.2202479Z     kernel = self.compile(
2025-05-07T20:32:25.2202860Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:25.2203036Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:25.2203163Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:25.2203168Z 
2025-05-07T20:32:25.2203370Z self = <triton.compiler.compiler.ASTSource object at 0x7efca3329d10>
2025-05-07T20:32:25.2204143Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:25.2204642Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7efdd7d50400>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7efca3308ea0>}
2025-05-07T20:32:25.2205395Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:25.2205586Z context = <triton._C.libtriton.ir.context object at 0x7efca3306330>
2025-05-07T20:32:25.2205590Z 
2025-05-07T20:32:25.2205758Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:25.2206018Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:25.2206126Z                            module_map=module_map)
2025-05-07T20:32:25.2206291Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:25.2206390Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:25.2206469Z E       ^
2025-05-07T20:32:25.2206831Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:25.2206835Z 
2025-05-07T20:32:25.2207294Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:25.2207299Z 
2025-05-07T20:32:25.2207407Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:25.2207628Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:25.2207704Z     T=2048,
2025-05-07T20:32:25.2207787Z     D=7168,
2025-05-07T20:32:25.2207870Z     scale_ub=1200.0,
2025-05-07T20:32:25.2207956Z     contiguous=False,
2025-05-07T20:32:25.2208045Z     compiled=True,
2025-05-07T20:32:25.2208118Z )
2025-05-07T20:32:25.2208333Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:25.2208512Z T = 2048, D = 7168, scale_ub = 1200.0, contiguous = False, compiled = True
2025-05-07T20:32:25.2208559Z 
2025-05-07T20:32:25.2208674Z     @given(
2025-05-07T20:32:25.2208797Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:25.2208896Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:25.2209048Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:25.2209172Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:25.2209285Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:25.2209358Z     )
2025-05-07T20:32:25.2209606Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:25.2209698Z     def test_silu_mul_quant(
2025-05-07T20:32:25.2209779Z         self,
2025-05-07T20:32:25.2209854Z         T: int,
2025-05-07T20:32:25.2209930Z         D: int,
2025-05-07T20:32:25.2210034Z         scale_ub: Optional[float],
2025-05-07T20:32:25.2210121Z         contiguous: bool,
2025-05-07T20:32:25.2210206Z         compiled: bool,
2025-05-07T20:32:25.2210287Z     ) -> None:
2025-05-07T20:32:25.2210385Z         torch.manual_seed(2025)
2025-05-07T20:32:25.2210458Z     
2025-05-07T20:32:25.2210631Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:25.2210703Z     
2025-05-07T20:32:25.2210792Z         x_sign = torch.sign(x)
2025-05-07T20:32:25.2210928Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:25.2211017Z         x = x_sign * x_clamp
2025-05-07T20:32:25.2211097Z         x0 = x[:, :D]
2025-05-07T20:32:25.2211182Z         x1 = x[:, D:]
2025-05-07T20:32:25.2211254Z     
2025-05-07T20:32:25.2215724Z         if contiguous:
2025-05-07T20:32:25.2215837Z             x0 = x0.contiguous()
2025-05-07T20:32:25.2215941Z             x1 = x1.contiguous()
2025-05-07T20:32:25.2216016Z     
2025-05-07T20:32:25.2216108Z         if scale_ub is not None:
2025-05-07T20:32:25.2216227Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:25.2216370Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:25.2216460Z             )
2025-05-07T20:32:25.2216543Z         else:
2025-05-07T20:32:25.2216638Z             scale_ub_tensor = None
2025-05-07T20:32:25.2216723Z     
2025-05-07T20:32:25.2216859Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:25.2216955Z             op = silu_mul_quant
2025-05-07T20:32:25.2217052Z             if compiled:
2025-05-07T20:32:25.2217156Z                 op = torch.compile(op)
2025-05-07T20:32:25.2217263Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:25.2217347Z     
2025-05-07T20:32:25.2217440Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:25.2217445Z 
2025-05-07T20:32:25.2217544Z moe/activation_test.py:117: 
2025-05-07T20:32:25.2217686Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:25.2217790Z moe/activation_test.py:115: in fn
2025-05-07T20:32:25.2217899Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:25.2218273Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/torch/_dynamo/eval_frame.py:678: in _fn
2025-05-07T20:32:25.2218373Z     return fn(*args, **kwargs)
2025-05-07T20:32:25.2218875Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:25.2219053Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:25.2219416Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:25.2219650Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:25.2220041Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:25.2220146Z     kernel = self.compile(
2025-05-07T20:32:25.2220528Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:25.2220704Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:25.2220896Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:25.2220942Z 
2025-05-07T20:32:25.2221153Z self = <triton.compiler.compiler.ASTSource object at 0x7efca33a7110>
2025-05-07T20:32:25.2221983Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:25.2222488Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7efdd7d50400>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7efca33099e0>}
2025-05-07T20:32:25.2223232Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:25.2223435Z context = <triton._C.libtriton.ir.context object at 0x7efca33cf770>
2025-05-07T20:32:25.2223444Z 
2025-05-07T20:32:25.2223610Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:25.2223882Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:25.2223995Z                            module_map=module_map)
2025-05-07T20:32:25.2224157Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:25.2224264Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:25.2224342Z E       ^
2025-05-07T20:32:25.2224704Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:25.2224710Z 
2025-05-07T20:32:25.2225123Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:25.2225128Z 
2025-05-07T20:32:25.2225230Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:25.2225462Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:25.2225545Z     T=1,
2025-05-07T20:32:25.2225623Z     D=5120,
2025-05-07T20:32:25.2225716Z     scale_ub=None,
2025-05-07T20:32:25.2225805Z     contiguous=False,
2025-05-07T20:32:25.2225899Z     compiled=False,
2025-05-07T20:32:25.2225975Z )
2025-05-07T20:32:25.2226195Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:25.2226369Z T = 1, D = 5120, scale_ub = None, contiguous = False, compiled = False
2025-05-07T20:32:25.2226374Z 
2025-05-07T20:32:25.2226452Z     @given(
2025-05-07T20:32:25.2226575Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:25.2226687Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:25.2226802Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:25.2226920Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:25.2227039Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:25.2227116Z     )
2025-05-07T20:32:25.2227370Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:25.2227468Z     def test_silu_mul_quant(
2025-05-07T20:32:25.2227544Z         self,
2025-05-07T20:32:25.2227632Z         T: int,
2025-05-07T20:32:25.2227712Z         D: int,
2025-05-07T20:32:25.2227861Z         scale_ub: Optional[float],
2025-05-07T20:32:25.2227962Z         contiguous: bool,
2025-05-07T20:32:25.2228048Z         compiled: bool,
2025-05-07T20:32:25.2228131Z     ) -> None:
2025-05-07T20:32:25.2228596Z         torch.manual_seed(2025)
2025-05-07T20:32:25.2228704Z     
2025-05-07T20:32:25.2228916Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:25.2229000Z     
2025-05-07T20:32:25.2229148Z         x_sign = torch.sign(x)
2025-05-07T20:32:25.2229282Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:25.2229373Z         x = x_sign * x_clamp
2025-05-07T20:32:25.2229457Z         x0 = x[:, :D]
2025-05-07T20:32:25.2229715Z         x1 = x[:, D:]
2025-05-07T20:32:25.2229863Z     
2025-05-07T20:32:25.2229969Z         if contiguous:
2025-05-07T20:32:25.2230074Z             x0 = x0.contiguous()
2025-05-07T20:32:25.2230178Z             x1 = x1.contiguous()
2025-05-07T20:32:25.2230313Z     
2025-05-07T20:32:25.2230417Z         if scale_ub is not None:
2025-05-07T20:32:25.2230525Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:25.2230660Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:25.2230742Z             )
2025-05-07T20:32:25.2230818Z         else:
2025-05-07T20:32:25.2230917Z             scale_ub_tensor = None
2025-05-07T20:32:25.2230994Z     
2025-05-07T20:32:25.2231125Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:25.2231222Z             op = silu_mul_quant
2025-05-07T20:32:25.2231310Z             if compiled:
2025-05-07T20:32:25.2231412Z                 op = torch.compile(op)
2025-05-07T20:32:25.2231528Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:25.2231608Z     
2025-05-07T20:32:25.2231701Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:25.2231706Z 
2025-05-07T20:32:25.2231811Z moe/activation_test.py:117: 
2025-05-07T20:32:25.2231942Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:25.2232047Z moe/activation_test.py:115: in fn
2025-05-07T20:32:25.2232155Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:25.2232652Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:25.2232759Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:25.2233117Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:25.2233339Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:25.2233687Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:25.2233787Z     kernel = self.compile(
2025-05-07T20:32:25.2234175Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:25.2234350Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:25.2234478Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:25.2234483Z 
2025-05-07T20:32:25.2234697Z self = <triton.compiler.compiler.ASTSource object at 0x7efca3d88bd0>
2025-05-07T20:32:25.2235469Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:25.2235978Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7efdd7d50400>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7efca330ad40>}
2025-05-07T20:32:25.2236727Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:25.2237007Z context = <triton._C.libtriton.ir.context object at 0x7efca3dd9230>
2025-05-07T20:32:25.2237013Z 
2025-05-07T20:32:25.2237213Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:25.2237561Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:25.2237683Z                            module_map=module_map)
2025-05-07T20:32:25.2237845Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:25.2237944Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:25.2238031Z E       ^
2025-05-07T20:32:25.2238386Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:25.2238521Z 
2025-05-07T20:32:25.2238943Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:25.2238947Z 
2025-05-07T20:32:25.2239090Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:25.2239319Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:25.2239406Z     T=4096,
2025-05-07T20:32:25.2239484Z     D=7168,
2025-05-07T20:32:25.2239568Z     scale_ub=1200.0,
2025-05-07T20:32:25.2239665Z     contiguous=False,
2025-05-07T20:32:25.2239750Z     compiled=False,
2025-05-07T20:32:25.2239825Z )
2025-05-07T20:32:25.2240047Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:25.2240222Z T = 4096, D = 7168, scale_ub = 1200.0, contiguous = False, compiled = False
2025-05-07T20:32:25.2240227Z 
2025-05-07T20:32:25.2240313Z     @given(
2025-05-07T20:32:25.2240436Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:25.2240544Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:25.2240665Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:25.2240781Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:25.2240896Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:25.2240981Z     )
2025-05-07T20:32:25.2241223Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:25.2241318Z     def test_silu_mul_quant(
2025-05-07T20:32:25.2241403Z         self,
2025-05-07T20:32:25.2241480Z         T: int,
2025-05-07T20:32:25.2241564Z         D: int,
2025-05-07T20:32:25.2241663Z         scale_ub: Optional[float],
2025-05-07T20:32:25.2241754Z         contiguous: bool,
2025-05-07T20:32:25.2241848Z         compiled: bool,
2025-05-07T20:32:25.2241927Z     ) -> None:
2025-05-07T20:32:25.2242023Z         torch.manual_seed(2025)
2025-05-07T20:32:25.2242107Z     
2025-05-07T20:32:25.2242277Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:25.2242357Z     
2025-05-07T20:32:25.2242456Z         x_sign = torch.sign(x)
2025-05-07T20:32:25.2242585Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:25.2242678Z         x = x_sign * x_clamp
2025-05-07T20:32:25.2242770Z         x0 = x[:, :D]
2025-05-07T20:32:25.2242851Z         x1 = x[:, D:]
2025-05-07T20:32:25.2242924Z     
2025-05-07T20:32:25.2243018Z         if contiguous:
2025-05-07T20:32:25.2243109Z             x0 = x0.contiguous()
2025-05-07T20:32:25.2243206Z             x1 = x1.contiguous()
2025-05-07T20:32:25.2243277Z     
2025-05-07T20:32:25.2243371Z         if scale_ub is not None:
2025-05-07T20:32:25.2243484Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:25.2243618Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:25.2243696Z             )
2025-05-07T20:32:25.2243780Z         else:
2025-05-07T20:32:25.2243877Z             scale_ub_tensor = None
2025-05-07T20:32:25.2243953Z     
2025-05-07T20:32:25.2244092Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:25.2244183Z             op = silu_mul_quant
2025-05-07T20:32:25.2244268Z             if compiled:
2025-05-07T20:32:25.2244378Z                 op = torch.compile(op)
2025-05-07T20:32:25.2244535Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:25.2244616Z     
2025-05-07T20:32:25.2244707Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:25.2244712Z 
2025-05-07T20:32:25.2244809Z moe/activation_test.py:117: 
2025-05-07T20:32:25.2244946Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:25.2245048Z moe/activation_test.py:115: in fn
2025-05-07T20:32:25.2245148Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:25.2245650Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:25.2245750Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:25.2246155Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:25.2246415Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:25.2246792Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:25.2246896Z     kernel = self.compile(
2025-05-07T20:32:25.2247277Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:25.2247451Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:25.2247588Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:25.2247592Z 
2025-05-07T20:32:25.2247797Z self = <triton.compiler.compiler.ASTSource object at 0x7efca3d9bb90>
2025-05-07T20:32:25.2248578Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:25.2249087Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7efdd7d50400>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7efca330ba60>}
2025-05-07T20:32:25.2249895Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:25.2250086Z context = <triton._C.libtriton.ir.context object at 0x7efca3d201b0>
2025-05-07T20:32:25.2250090Z 
2025-05-07T20:32:25.2250253Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:25.2250521Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:25.2250630Z                            module_map=module_map)
2025-05-07T20:32:25.2250796Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:25.2250904Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:25.2250983Z E       ^
2025-05-07T20:32:25.2251350Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:25.2251355Z 
2025-05-07T20:32:25.2251767Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:25.2251772Z 
2025-05-07T20:32:25.2251877Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:25.2252107Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:25.2252187Z     T=16384,
2025-05-07T20:32:25.2252274Z     D=7168,
2025-05-07T20:32:25.2252357Z     scale_ub=None,
2025-05-07T20:32:25.2252443Z     contiguous=True,
2025-05-07T20:32:25.2252536Z     compiled=True,
2025-05-07T20:32:25.2252611Z )
2025-05-07T20:32:25.2252829Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:25.2253013Z T = 16384, D = 7168, scale_ub = None, contiguous = True, compiled = True
2025-05-07T20:32:25.2253017Z 
2025-05-07T20:32:25.2253098Z     @given(
2025-05-07T20:32:25.2253263Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:25.2253378Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:25.2253493Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:25.2253618Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:25.2253734Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:25.2253808Z     )
2025-05-07T20:32:25.2254062Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:25.2254157Z     def test_silu_mul_quant(
2025-05-07T20:32:25.2254234Z         self,
2025-05-07T20:32:25.2254320Z         T: int,
2025-05-07T20:32:25.2254398Z         D: int,
2025-05-07T20:32:25.2254538Z         scale_ub: Optional[float],
2025-05-07T20:32:25.2254676Z         contiguous: bool,
2025-05-07T20:32:25.2254761Z         compiled: bool,
2025-05-07T20:32:25.2254842Z     ) -> None:
2025-05-07T20:32:25.2254947Z         torch.manual_seed(2025)
2025-05-07T20:32:25.2255061Z     
2025-05-07T20:32:25.2255241Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:25.2255317Z     
2025-05-07T20:32:25.2255410Z         x_sign = torch.sign(x)
2025-05-07T20:32:25.2255539Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:25.2255633Z         x = x_sign * x_clamp
2025-05-07T20:32:25.2255722Z         x0 = x[:, :D]
2025-05-07T20:32:25.2255803Z         x1 = x[:, D:]
2025-05-07T20:32:25.2255875Z     
2025-05-07T20:32:25.2255965Z         if contiguous:
2025-05-07T20:32:25.2256055Z             x0 = x0.contiguous()
2025-05-07T20:32:25.2256145Z             x1 = x1.contiguous()
2025-05-07T20:32:25.2256223Z     
2025-05-07T20:32:25.2256313Z         if scale_ub is not None:
2025-05-07T20:32:25.2256427Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:25.2256568Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:25.2256644Z             )
2025-05-07T20:32:25.2256721Z         else:
2025-05-07T20:32:25.2256827Z             scale_ub_tensor = None
2025-05-07T20:32:25.2256900Z     
2025-05-07T20:32:25.2257030Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:25.2257125Z             op = silu_mul_quant
2025-05-07T20:32:25.2257212Z             if compiled:
2025-05-07T20:32:25.2257320Z                 op = torch.compile(op)
2025-05-07T20:32:25.2257426Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:25.2257499Z     
2025-05-07T20:32:25.2257597Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:25.2257601Z 
2025-05-07T20:32:25.2257699Z moe/activation_test.py:117: 
2025-05-07T20:32:25.2257827Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:25.2257939Z moe/activation_test.py:115: in fn
2025-05-07T20:32:25.2258042Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:25.2258412Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/torch/_dynamo/eval_frame.py:678: in _fn
2025-05-07T20:32:25.2258507Z     return fn(*args, **kwargs)
2025-05-07T20:32:25.2258997Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:25.2259098Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:25.2259451Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:25.2259671Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:25.2260012Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:25.2260105Z     kernel = self.compile(
2025-05-07T20:32:25.2260518Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:25.2260717Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:25.2260845Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:25.2260897Z 
2025-05-07T20:32:25.2261107Z self = <triton.compiler.compiler.ASTSource object at 0x7efd8e10ee50>
2025-05-07T20:32:25.2261875Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:25.2262381Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7efdd7d50400>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7efca3d05120>}
2025-05-07T20:32:25.2263120Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:25.2263409Z context = <triton._C.libtriton.ir.context object at 0x7efd8e1eb4b0>
2025-05-07T20:32:25.2263423Z 
2025-05-07T20:32:25.2263624Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:25.2263882Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:25.2263995Z                            module_map=module_map)
2025-05-07T20:32:25.2264154Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:25.2264252Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:25.2264338Z E       ^
2025-05-07T20:32:25.2264690Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:25.2264695Z 
2025-05-07T20:32:25.2265112Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:25.2265122Z 
2025-05-07T20:32:25.2265224Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:25.2265445Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:25.2265530Z     T=4096,
2025-05-07T20:32:25.2265607Z     D=5120,
2025-05-07T20:32:25.2265689Z     scale_ub=None,
2025-05-07T20:32:25.2265784Z     contiguous=False,
2025-05-07T20:32:25.2265866Z     compiled=True,
2025-05-07T20:32:25.2265938Z )
2025-05-07T20:32:25.2266158Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:25.2266329Z T = 4096, D = 5120, scale_ub = None, contiguous = False, compiled = True
2025-05-07T20:32:25.2266334Z 
2025-05-07T20:32:25.2266416Z     @given(
2025-05-07T20:32:25.2266533Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:25.2266633Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:25.2266751Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:25.2266872Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:25.2266984Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:25.2267062Z     )
2025-05-07T20:32:25.2267310Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:25.2267403Z     def test_silu_mul_quant(
2025-05-07T20:32:25.2267484Z         self,
2025-05-07T20:32:25.2267561Z         T: int,
2025-05-07T20:32:25.2267643Z         D: int,
2025-05-07T20:32:25.2267740Z         scale_ub: Optional[float],
2025-05-07T20:32:25.2267828Z         contiguous: bool,
2025-05-07T20:32:25.2267920Z         compiled: bool,
2025-05-07T20:32:25.2267997Z     ) -> None:
2025-05-07T20:32:25.2268091Z         torch.manual_seed(2025)
2025-05-07T20:32:25.2268171Z     
2025-05-07T20:32:25.2268337Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:25.2268411Z     
2025-05-07T20:32:25.2268509Z         x_sign = torch.sign(x)
2025-05-07T20:32:25.2268641Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:25.2268733Z         x = x_sign * x_clamp
2025-05-07T20:32:25.2268821Z         x0 = x[:, :D]
2025-05-07T20:32:25.2268905Z         x1 = x[:, D:]
2025-05-07T20:32:25.2268986Z     
2025-05-07T20:32:25.2269261Z         if contiguous:
2025-05-07T20:32:25.2269357Z             x0 = x0.contiguous()
2025-05-07T20:32:25.2269455Z             x1 = x1.contiguous()
2025-05-07T20:32:25.2269527Z     
2025-05-07T20:32:25.2269618Z         if scale_ub is not None:
2025-05-07T20:32:25.2269730Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:25.2269867Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:25.2269942Z             )
2025-05-07T20:32:25.2270027Z         else:
2025-05-07T20:32:25.2270122Z             scale_ub_tensor = None
2025-05-07T20:32:25.2270195Z     
2025-05-07T20:32:25.2270330Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:25.2270468Z             op = silu_mul_quant
2025-05-07T20:32:25.2270594Z             if compiled:
2025-05-07T20:32:25.2270701Z                 op = torch.compile(op)
2025-05-07T20:32:25.2270807Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:25.2270886Z     
2025-05-07T20:32:25.2271016Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:25.2271024Z 
2025-05-07T20:32:25.2271123Z moe/activation_test.py:117: 
2025-05-07T20:32:25.2271260Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:25.2271362Z moe/activation_test.py:115: in fn
2025-05-07T20:32:25.2271462Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:25.2271838Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/torch/_dynamo/eval_frame.py:678: in _fn
2025-05-07T20:32:25.2271930Z     return fn(*args, **kwargs)
2025-05-07T20:32:25.2272425Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:25.2272527Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:25.2272882Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:25.2273112Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:25.2273449Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:25.2273544Z     kernel = self.compile(
2025-05-07T20:32:25.2273931Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:25.2274103Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:25.2274237Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:25.2274241Z 
2025-05-07T20:32:25.2274449Z self = <triton.compiler.compiler.ASTSource object at 0x7efd8e1c8550>
2025-05-07T20:32:25.2275223Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:25.2275736Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7efdd7d50400>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7efca3d05c60>}
2025-05-07T20:32:25.2276477Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:25.2276675Z context = <triton._C.libtriton.ir.context object at 0x7efd8e100bb0>
2025-05-07T20:32:25.2276679Z 
2025-05-07T20:32:25.2276841Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:25.2277108Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:25.2277221Z                            module_map=module_map)
2025-05-07T20:32:25.2277380Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:25.2277483Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:25.2277561Z E       ^
2025-05-07T20:32:25.2277962Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:25.2277967Z 
2025-05-07T20:32:25.2278385Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:25.2278389Z 
2025-05-07T20:32:25.2278492Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:25.2278716Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:25.2278806Z     T=4096,
2025-05-07T20:32:25.2278890Z     D=5120,
2025-05-07T20:32:25.2278974Z     scale_ub=1200.0,
2025-05-07T20:32:25.2279063Z     contiguous=False,
2025-05-07T20:32:25.2279196Z     compiled=False,
2025-05-07T20:32:25.2279315Z )
2025-05-07T20:32:25.2279560Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:25.2279764Z T = 4096, D = 5120, scale_ub = 1200.0, contiguous = False, compiled = False
2025-05-07T20:32:25.2279807Z 
2025-05-07T20:32:25.2279890Z     @given(
2025-05-07T20:32:25.2280018Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:25.2280118Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:25.2280233Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:25.2280355Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:25.2280467Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:25.2280540Z     )
2025-05-07T20:32:25.2280788Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:25.2280882Z     def test_silu_mul_quant(
2025-05-07T20:32:25.2280958Z         self,
2025-05-07T20:32:25.2281046Z         T: int,
2025-05-07T20:32:25.2281125Z         D: int,
2025-05-07T20:32:25.2281222Z         scale_ub: Optional[float],
2025-05-07T20:32:25.2281317Z         contiguous: bool,
2025-05-07T20:32:25.2281403Z         compiled: bool,
2025-05-07T20:32:25.2281488Z     ) -> None:
2025-05-07T20:32:25.2281587Z         torch.manual_seed(2025)
2025-05-07T20:32:25.2281660Z     
2025-05-07T20:32:25.2281835Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:25.2281909Z     
2025-05-07T20:32:25.2282001Z         x_sign = torch.sign(x)
2025-05-07T20:32:25.2282133Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:25.2282223Z         x = x_sign * x_clamp
2025-05-07T20:32:25.2282304Z         x0 = x[:, :D]
2025-05-07T20:32:25.2282394Z         x1 = x[:, D:]
2025-05-07T20:32:25.2282467Z     
2025-05-07T20:32:25.2282550Z         if contiguous:
2025-05-07T20:32:25.2282648Z             x0 = x0.contiguous()
2025-05-07T20:32:25.2282737Z             x1 = x1.contiguous()
2025-05-07T20:32:25.2282810Z     
2025-05-07T20:32:25.2282911Z         if scale_ub is not None:
2025-05-07T20:32:25.2283014Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:25.2283158Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:25.2283235Z             )
2025-05-07T20:32:25.2283314Z         else:
2025-05-07T20:32:25.2283412Z             scale_ub_tensor = None
2025-05-07T20:32:25.2283484Z     
2025-05-07T20:32:25.2283612Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:25.2283708Z             op = silu_mul_quant
2025-05-07T20:32:25.2283792Z             if compiled:
2025-05-07T20:32:25.2283895Z                 op = torch.compile(op)
2025-05-07T20:32:25.2284005Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:25.2284078Z     
2025-05-07T20:32:25.2284172Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:25.2284182Z 
2025-05-07T20:32:25.2284279Z moe/activation_test.py:117: 
2025-05-07T20:32:25.2284408Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:25.2284521Z moe/activation_test.py:115: in fn
2025-05-07T20:32:25.2284620Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:25.2285167Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:25.2285270Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:25.2285625Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:25.2285850Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:25.2286186Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:25.2286279Z     kernel = self.compile(
2025-05-07T20:32:25.2286662Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:25.2286876Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:25.2287043Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:25.2287047Z 
2025-05-07T20:32:25.2287302Z self = <triton.compiler.compiler.ASTSource object at 0x7efd8e105850>
2025-05-07T20:32:25.2288075Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:25.2288587Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7efdd7d50400>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7efca3d07240>}
2025-05-07T20:32:25.2289325Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:25.2289528Z context = <triton._C.libtriton.ir.context object at 0x7efd8e109e30>
2025-05-07T20:32:25.2289533Z 
2025-05-07T20:32:25.2289696Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:25.2289958Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:25.2290075Z                            module_map=module_map)
2025-05-07T20:32:25.2290257Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:25.2290363Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:25.2290464Z E       ^
2025-05-07T20:32:25.2290815Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:25.2290820Z 
2025-05-07T20:32:25.2291235Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:25.2291239Z 
2025-05-07T20:32:25.2291345Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:25.2291569Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:25.2291652Z     T=4096,
2025-05-07T20:32:25.2291730Z     D=5120,
2025-05-07T20:32:25.2291816Z     scale_ub=1200.0,
2025-05-07T20:32:25.2291910Z     contiguous=False,
2025-05-07T20:32:25.2291996Z     compiled=True,
2025-05-07T20:32:25.2292075Z )
2025-05-07T20:32:25.2292292Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:25.2292463Z T = 4096, D = 5120, scale_ub = 1200.0, contiguous = False, compiled = True
2025-05-07T20:32:25.2292467Z 
2025-05-07T20:32:25.2292551Z     @given(
2025-05-07T20:32:25.2292667Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:25.2292767Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:25.2292886Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:25.2293001Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:25.2293123Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:25.2293200Z     )
2025-05-07T20:32:25.2293443Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:25.2293543Z     def test_silu_mul_quant(
2025-05-07T20:32:25.2293622Z         self,
2025-05-07T20:32:25.2293748Z         T: int,
2025-05-07T20:32:25.2293832Z         D: int,
2025-05-07T20:32:25.2293929Z         scale_ub: Optional[float],
2025-05-07T20:32:25.2294019Z         contiguous: bool,
2025-05-07T20:32:25.2294111Z         compiled: bool,
2025-05-07T20:32:25.2294189Z     ) -> None:
2025-05-07T20:32:25.2294283Z         torch.manual_seed(2025)
2025-05-07T20:32:25.2294364Z     
2025-05-07T20:32:25.2294533Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:25.2294608Z     
2025-05-07T20:32:25.2294710Z         x_sign = torch.sign(x)
2025-05-07T20:32:25.2294834Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:25.2294973Z         x = x_sign * x_clamp
2025-05-07T20:32:25.2295094Z         x0 = x[:, :D]
2025-05-07T20:32:25.2295173Z         x1 = x[:, D:]
2025-05-07T20:32:25.2295254Z     
2025-05-07T20:32:25.2295339Z         if contiguous:
2025-05-07T20:32:25.2295430Z             x0 = x0.contiguous()
2025-05-07T20:32:25.2295593Z             x1 = x1.contiguous()
2025-05-07T20:32:25.2295666Z     
2025-05-07T20:32:25.2295756Z         if scale_ub is not None:
2025-05-07T20:32:25.2295868Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:25.2296003Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:25.2296078Z             )
2025-05-07T20:32:25.2296161Z         else:
2025-05-07T20:32:25.2296255Z             scale_ub_tensor = None
2025-05-07T20:32:25.2296334Z     
2025-05-07T20:32:25.2296462Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:25.2296552Z             op = silu_mul_quant
2025-05-07T20:32:25.2296646Z             if compiled:
2025-05-07T20:32:25.2296748Z                 op = torch.compile(op)
2025-05-07T20:32:25.2296858Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:25.2296935Z     
2025-05-07T20:32:25.2297026Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:25.2297031Z 
2025-05-07T20:32:25.2297130Z moe/activation_test.py:117: 
2025-05-07T20:32:25.2297267Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:25.2297368Z moe/activation_test.py:115: in fn
2025-05-07T20:32:25.2297473Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:25.2297837Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/torch/_dynamo/eval_frame.py:678: in _fn
2025-05-07T20:32:25.2297928Z     return fn(*args, **kwargs)
2025-05-07T20:32:25.2298425Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:25.2298520Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:25.2298873Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:25.2299103Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:25.2299441Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:25.2299544Z     kernel = self.compile(
2025-05-07T20:32:25.2299921Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:25.2300090Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:25.2300224Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:25.2300228Z 
2025-05-07T20:32:25.2300430Z self = <triton.compiler.compiler.ASTSource object at 0x7efca32707d0>
2025-05-07T20:32:25.2301205Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:25.2301711Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7efdd7d50400>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7efca32d0720>}
2025-05-07T20:32:25.2302498Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:25.2302694Z context = <triton._C.libtriton.ir.context object at 0x7efca325ce30>
2025-05-07T20:32:25.2302698Z 
2025-05-07T20:32:25.2302859Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:25.2303124Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:25.2303230Z                            module_map=module_map)
2025-05-07T20:32:25.2303427Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:25.2303569Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:25.2303649Z E       ^
2025-05-07T20:32:25.2304039Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:25.2304056Z 
2025-05-07T20:32:25.2304468Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:25.2304472Z 
2025-05-07T20:32:25.2304574Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:25.2304803Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:25.2304879Z     T=2048,
2025-05-07T20:32:25.2304954Z     D=7168,
2025-05-07T20:32:25.2305042Z     scale_ub=1200.0,
2025-05-07T20:32:25.2305127Z     contiguous=False,
2025-05-07T20:32:25.2305211Z     compiled=False,
2025-05-07T20:32:25.2305289Z )
2025-05-07T20:32:25.2305504Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:25.2305687Z T = 2048, D = 7168, scale_ub = 1200.0, contiguous = False, compiled = False
2025-05-07T20:32:25.2305692Z 
2025-05-07T20:32:25.2305770Z     @given(
2025-05-07T20:32:25.2305891Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:25.2306001Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:25.2306115Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:25.2306231Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:25.2306349Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:25.2306422Z     )
2025-05-07T20:32:25.2306669Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:25.2306762Z     def test_silu_mul_quant(
2025-05-07T20:32:25.2306839Z         self,
2025-05-07T20:32:25.2306923Z         T: int,
2025-05-07T20:32:25.2306999Z         D: int,
2025-05-07T20:32:25.2307096Z         scale_ub: Optional[float],
2025-05-07T20:32:25.2307195Z         contiguous: bool,
2025-05-07T20:32:25.2307283Z         compiled: bool,
2025-05-07T20:32:25.2307363Z     ) -> None:
2025-05-07T20:32:25.2307463Z         torch.manual_seed(2025)
2025-05-07T20:32:25.2307535Z     
2025-05-07T20:32:25.2307707Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:25.2307790Z     
2025-05-07T20:32:25.2307882Z         x_sign = torch.sign(x)
2025-05-07T20:32:25.2308005Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:25.2308101Z         x = x_sign * x_clamp
2025-05-07T20:32:25.2308182Z         x0 = x[:, :D]
2025-05-07T20:32:25.2308269Z         x1 = x[:, D:]
2025-05-07T20:32:25.2308341Z     
2025-05-07T20:32:25.2308424Z         if contiguous:
2025-05-07T20:32:25.2308524Z             x0 = x0.contiguous()
2025-05-07T20:32:25.2308613Z             x1 = x1.contiguous()
2025-05-07T20:32:25.2308685Z     
2025-05-07T20:32:25.2308782Z         if scale_ub is not None:
2025-05-07T20:32:25.2308890Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:25.2309027Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:25.2309204Z             )
2025-05-07T20:32:25.2309281Z         else:
2025-05-07T20:32:25.2309375Z             scale_ub_tensor = None
2025-05-07T20:32:25.2309460Z     
2025-05-07T20:32:25.2309658Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:25.2309765Z             op = silu_mul_quant
2025-05-07T20:32:25.2309870Z             if compiled:
2025-05-07T20:32:25.2309974Z                 op = torch.compile(op)
2025-05-07T20:32:25.2310086Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:25.2310159Z     
2025-05-07T20:32:25.2310251Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:25.2310255Z 
2025-05-07T20:32:25.2310358Z moe/activation_test.py:117: 
2025-05-07T20:32:25.2310487Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:25.2310588Z moe/activation_test.py:115: in fn
2025-05-07T20:32:25.2310735Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:25.2311274Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:25.2311413Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:25.2311771Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:25.2311992Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:25.2312334Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:25.2312427Z     kernel = self.compile(
2025-05-07T20:32:25.2312810Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:25.2312987Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:25.2313114Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:25.2313123Z 
2025-05-07T20:32:25.2313332Z self = <triton.compiler.compiler.ASTSource object at 0x7efca32ad9d0>
2025-05-07T20:32:25.2314102Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:25.2314606Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7efdd7d50400>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7efca32d1580>}
2025-05-07T20:32:25.2315345Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:25.2315536Z context = <triton._C.libtriton.ir.context object at 0x7efca32c5f70>
2025-05-07T20:32:25.2315543Z 
2025-05-07T20:32:25.2315714Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:25.2315971Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:25.2316086Z                            module_map=module_map)
2025-05-07T20:32:25.2316251Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:25.2316348Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:25.2316431Z E       ^
2025-05-07T20:32:25.2316782Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:25.2316786Z 
2025-05-07T20:32:25.2317198Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:25.2317208Z 
2025-05-07T20:32:25.2317310Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:25.2317531Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:25.2317621Z     T=1,
2025-05-07T20:32:25.2317696Z     D=7168,
2025-05-07T20:32:25.2317780Z     scale_ub=None,
2025-05-07T20:32:25.2317873Z     contiguous=True,
2025-05-07T20:32:25.2317957Z     compiled=False,
2025-05-07T20:32:25.2318031Z )
2025-05-07T20:32:25.2318299Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:25.2318461Z T = 1, D = 7168, scale_ub = None, contiguous = True, compiled = False
2025-05-07T20:32:25.2318466Z 
2025-05-07T20:32:25.2318545Z     @given(
2025-05-07T20:32:25.2318670Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:25.2318769Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:25.2318891Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:25.2319006Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:25.2319118Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:25.2319197Z     )
2025-05-07T20:32:25.2319535Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:25.2319667Z     def test_silu_mul_quant(
2025-05-07T20:32:25.2319748Z         self,
2025-05-07T20:32:25.2319823Z         T: int,
2025-05-07T20:32:25.2319898Z         D: int,
2025-05-07T20:32:25.2320042Z         scale_ub: Optional[float],
2025-05-07T20:32:25.2320132Z         contiguous: bool,
2025-05-07T20:32:25.2320222Z         compiled: bool,
2025-05-07T20:32:25.2320302Z     ) -> None:
2025-05-07T20:32:25.2320396Z         torch.manual_seed(2025)
2025-05-07T20:32:25.2320492Z     
2025-05-07T20:32:25.2320685Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:25.2320763Z     
2025-05-07T20:32:25.2320860Z         x_sign = torch.sign(x)
2025-05-07T20:32:25.2320983Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:25.2321073Z         x = x_sign * x_clamp
2025-05-07T20:32:25.2321160Z         x0 = x[:, :D]
2025-05-07T20:32:25.2321239Z         x1 = x[:, D:]
2025-05-07T20:32:25.2321314Z     
2025-05-07T20:32:25.2321406Z         if contiguous:
2025-05-07T20:32:25.2321496Z             x0 = x0.contiguous()
2025-05-07T20:32:25.2321584Z             x1 = x1.contiguous()
2025-05-07T20:32:25.2321664Z     
2025-05-07T20:32:25.2321756Z         if scale_ub is not None:
2025-05-07T20:32:25.2321870Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:25.2322005Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:25.2322082Z             )
2025-05-07T20:32:25.2322164Z         else:
2025-05-07T20:32:25.2322258Z             scale_ub_tensor = None
2025-05-07T20:32:25.2322330Z     
2025-05-07T20:32:25.2322467Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:25.2322561Z             op = silu_mul_quant
2025-05-07T20:32:25.2322646Z             if compiled:
2025-05-07T20:32:25.2322755Z                 op = torch.compile(op)
2025-05-07T20:32:25.2322860Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:25.2322934Z     
2025-05-07T20:32:25.2323031Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:25.2323038Z 
2025-05-07T20:32:25.2323137Z moe/activation_test.py:117: 
2025-05-07T20:32:25.2323271Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:25.2323374Z moe/activation_test.py:115: in fn
2025-05-07T20:32:25.2323478Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:25.2323977Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:25.2324073Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:25.2324430Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:25.2324657Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:25.2324994Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:25.2325096Z     kernel = self.compile(
2025-05-07T20:32:25.2325476Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:25.2325649Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:25.2325941Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:25.2325947Z 
2025-05-07T20:32:25.2326154Z self = <triton.compiler.compiler.ASTSource object at 0x7efca360d0d0>
2025-05-07T20:32:25.2326929Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:25.2327427Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7efdd7d50400>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7efca32d0ea0>}
2025-05-07T20:32:25.2328630Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:25.2329167Z context = <triton._C.libtriton.ir.context object at 0x7efca3695730>
2025-05-07T20:32:25.2329173Z 
2025-05-07T20:32:25.2329357Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:25.2329669Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:25.2329781Z                            module_map=module_map)
2025-05-07T20:32:25.2329960Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:25.2330069Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:25.2330148Z E       ^
2025-05-07T20:32:25.2330581Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:25.2330588Z 
2025-05-07T20:32:25.2331089Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:25.2331094Z 
2025-05-07T20:32:25.2331202Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:25.2331466Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:25.2331544Z     T=16384,
2025-05-07T20:32:25.2331622Z     D=7168,
2025-05-07T20:32:25.2331715Z     scale_ub=1200.0,
2025-05-07T20:32:25.2331806Z     contiguous=False,
2025-05-07T20:32:25.2331896Z     compiled=True,
2025-05-07T20:32:25.2331969Z )
2025-05-07T20:32:25.2332218Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:25.2332422Z T = 16384, D = 7168, scale_ub = 1200.0, contiguous = False, compiled = True
2025-05-07T20:32:25.2332426Z 
2025-05-07T20:32:25.2332503Z     @given(
2025-05-07T20:32:25.2332628Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:25.2332740Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:25.2332863Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:25.2332987Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:25.2333112Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:25.2333189Z     )
2025-05-07T20:32:25.2333484Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:25.2333581Z     def test_silu_mul_quant(
2025-05-07T20:32:25.2333658Z         self,
2025-05-07T20:32:25.2333741Z         T: int,
2025-05-07T20:32:25.2333818Z         D: int,
2025-05-07T20:32:25.2333918Z         scale_ub: Optional[float],
2025-05-07T20:32:25.2334016Z         contiguous: bool,
2025-05-07T20:32:25.2334103Z         compiled: bool,
2025-05-07T20:32:25.2334182Z     ) -> None:
2025-05-07T20:32:25.2334286Z         torch.manual_seed(2025)
2025-05-07T20:32:25.2334360Z     
2025-05-07T20:32:25.2334545Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:25.2334629Z     
2025-05-07T20:32:25.2334726Z         x_sign = torch.sign(x)
2025-05-07T20:32:25.2334862Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:25.2334954Z         x = x_sign * x_clamp
2025-05-07T20:32:25.2335038Z         x0 = x[:, :D]
2025-05-07T20:32:25.2335197Z         x1 = x[:, D:]
2025-05-07T20:32:25.2335270Z     
2025-05-07T20:32:25.2335353Z         if contiguous:
2025-05-07T20:32:25.2335453Z             x0 = x0.contiguous()
2025-05-07T20:32:25.2335545Z             x1 = x1.contiguous()
2025-05-07T20:32:25.2335617Z     
2025-05-07T20:32:25.2335713Z         if scale_ub is not None:
2025-05-07T20:32:25.2335818Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:25.2335950Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:25.2336031Z             )
2025-05-07T20:32:25.2336106Z         else:
2025-05-07T20:32:25.2336205Z             scale_ub_tensor = None
2025-05-07T20:32:25.2336279Z     
2025-05-07T20:32:25.2336470Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:25.2336604Z             op = silu_mul_quant
2025-05-07T20:32:25.2336689Z             if compiled:
2025-05-07T20:32:25.2336787Z                 op = torch.compile(op)
2025-05-07T20:32:25.2336936Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:25.2337012Z     
2025-05-07T20:32:25.2341229Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:25.2341237Z 
2025-05-07T20:32:25.2341354Z moe/activation_test.py:117: 
2025-05-07T20:32:25.2341495Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:25.2341600Z moe/activation_test.py:115: in fn
2025-05-07T20:32:25.2341702Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:25.2342088Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/torch/_dynamo/eval_frame.py:678: in _fn
2025-05-07T20:32:25.2342184Z     return fn(*args, **kwargs)
2025-05-07T20:32:25.2342682Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:25.2342800Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:25.2343162Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:25.2343404Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:25.2343744Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:25.2343840Z     kernel = self.compile(
2025-05-07T20:32:25.2344235Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:25.2344410Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:25.2344549Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:25.2344554Z 
2025-05-07T20:32:25.2344761Z self = <triton.compiler.compiler.ASTSource object at 0x7efca36f7cd0>
2025-05-07T20:32:25.2345545Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:25.2346059Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7efdd7d50400>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7efca32d39c0>}
2025-05-07T20:32:25.2346806Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:25.2347007Z context = <triton._C.libtriton.ir.context object at 0x7efca3658370>
2025-05-07T20:32:25.2347012Z 
2025-05-07T20:32:25.2347178Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:25.2347441Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:25.2347562Z                            module_map=module_map)
2025-05-07T20:32:25.2347727Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:25.2347911Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:25.2347991Z E       ^
2025-05-07T20:32:25.2348346Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:25.2348351Z 
2025-05-07T20:32:25.2348775Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:25.2348779Z 
2025-05-07T20:32:25.2348886Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:25.2349198Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:25.2349278Z     T=1,
2025-05-07T20:32:25.2349358Z     D=7168,
2025-05-07T20:32:25.2349501Z     scale_ub=None,
2025-05-07T20:32:25.2349633Z     contiguous=False,
2025-05-07T20:32:25.2349719Z     compiled=False,
2025-05-07T20:32:25.2349817Z )
2025-05-07T20:32:25.2350066Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:25.2350276Z T = 1, D = 7168, scale_ub = None, contiguous = False, compiled = False
2025-05-07T20:32:25.2350281Z 
2025-05-07T20:32:25.2350369Z     @given(
2025-05-07T20:32:25.2350488Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:25.2350597Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:25.2350714Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:25.2350831Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:25.2350952Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:25.2351028Z     )
2025-05-07T20:32:25.2351272Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:25.2351374Z     def test_silu_mul_quant(
2025-05-07T20:32:25.2351455Z         self,
2025-05-07T20:32:25.2351535Z         T: int,
2025-05-07T20:32:25.2351621Z         D: int,
2025-05-07T20:32:25.2351720Z         scale_ub: Optional[float],
2025-05-07T20:32:25.2351809Z         contiguous: bool,
2025-05-07T20:32:25.2351906Z         compiled: bool,
2025-05-07T20:32:25.2351989Z     ) -> None:
2025-05-07T20:32:25.2352093Z         torch.manual_seed(2025)
2025-05-07T20:32:25.2352167Z     
2025-05-07T20:32:25.2352336Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:25.2352421Z     
2025-05-07T20:32:25.2352515Z         x_sign = torch.sign(x)
2025-05-07T20:32:25.2352641Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:25.2352740Z         x = x_sign * x_clamp
2025-05-07T20:32:25.2352821Z         x0 = x[:, :D]
2025-05-07T20:32:25.2352902Z         x1 = x[:, D:]
2025-05-07T20:32:25.2352983Z     
2025-05-07T20:32:25.2353067Z         if contiguous:
2025-05-07T20:32:25.2353159Z             x0 = x0.contiguous()
2025-05-07T20:32:25.2353265Z             x1 = x1.contiguous()
2025-05-07T20:32:25.2353339Z     
2025-05-07T20:32:25.2353431Z         if scale_ub is not None:
2025-05-07T20:32:25.2353544Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:25.2353688Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:25.2353774Z             )
2025-05-07T20:32:25.2353850Z         else:
2025-05-07T20:32:25.2353943Z             scale_ub_tensor = None
2025-05-07T20:32:25.2354024Z     
2025-05-07T20:32:25.2354154Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:25.2354243Z             op = silu_mul_quant
2025-05-07T20:32:25.2354342Z             if compiled:
2025-05-07T20:32:25.2354444Z                 op = torch.compile(op)
2025-05-07T20:32:25.2354548Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:25.2354629Z     
2025-05-07T20:32:25.2354721Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:25.2354725Z 
2025-05-07T20:32:25.2354828Z moe/activation_test.py:117: 
2025-05-07T20:32:25.2354962Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:25.2355062Z moe/activation_test.py:115: in fn
2025-05-07T20:32:25.2355169Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:25.2355721Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:25.2355819Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:25.2356182Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:25.2356402Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:25.2356747Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:25.2356841Z     kernel = self.compile(
2025-05-07T20:32:25.2357223Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:25.2357478Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:25.2357607Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:25.2357648Z 
2025-05-07T20:32:25.2357856Z self = <triton.compiler.compiler.ASTSource object at 0x7efca3635490>
2025-05-07T20:32:25.2358639Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:25.2359139Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7efdd7d50400>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7efca3638860>}
2025-05-07T20:32:25.2359893Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:25.2360090Z context = <triton._C.libtriton.ir.context object at 0x7efca30f5af0>
2025-05-07T20:32:25.2360095Z 
2025-05-07T20:32:25.2360269Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:25.2360532Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:25.2360640Z                            module_map=module_map)
2025-05-07T20:32:25.2360809Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:25.2360909Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:25.2360988Z E       ^
2025-05-07T20:32:25.2361351Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:25.2361356Z 
2025-05-07T20:32:25.2361766Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:25.2361776Z 
2025-05-07T20:32:25.2361887Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:25.2362110Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:25.2362190Z     T=2048,
2025-05-07T20:32:25.2362277Z     D=7168,
2025-05-07T20:32:25.2362364Z     scale_ub=None,
2025-05-07T20:32:25.2362452Z     contiguous=False,
2025-05-07T20:32:25.2362543Z     compiled=True,
2025-05-07T20:32:25.2362617Z )
2025-05-07T20:32:25.2362841Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:25.2363012Z T = 2048, D = 7168, scale_ub = None, contiguous = False, compiled = True
2025-05-07T20:32:25.2363017Z 
2025-05-07T20:32:25.2363097Z     @given(
2025-05-07T20:32:25.2363224Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:25.2363323Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:25.2363438Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:25.2363566Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:25.2363682Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:25.2363757Z     )
2025-05-07T20:32:25.2364012Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:25.2364154Z     def test_silu_mul_quant(
2025-05-07T20:32:25.2364242Z         self,
2025-05-07T20:32:25.2364320Z         T: int,
2025-05-07T20:32:25.2364398Z         D: int,
2025-05-07T20:32:25.2364505Z         scale_ub: Optional[float],
2025-05-07T20:32:25.2364596Z         contiguous: bool,
2025-05-07T20:32:25.2364682Z         compiled: bool,
2025-05-07T20:32:25.2364768Z     ) -> None:
2025-05-07T20:32:25.2364864Z         torch.manual_seed(2025)
2025-05-07T20:32:25.2364938Z     
2025-05-07T20:32:25.2365114Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:25.2365190Z     
2025-05-07T20:32:25.2365282Z         x_sign = torch.sign(x)
2025-05-07T20:32:25.2365501Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:25.2365631Z         x = x_sign * x_clamp
2025-05-07T20:32:25.2365722Z         x0 = x[:, :D]
2025-05-07T20:32:25.2365803Z         x1 = x[:, D:]
2025-05-07T20:32:25.2365876Z     
2025-05-07T20:32:25.2366005Z         if contiguous:
2025-05-07T20:32:25.2366102Z             x0 = x0.contiguous()
2025-05-07T20:32:25.2366193Z             x1 = x1.contiguous()
2025-05-07T20:32:25.2366273Z     
2025-05-07T20:32:25.2366363Z         if scale_ub is not None:
2025-05-07T20:32:25.2366470Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:25.2366615Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:25.2366692Z             )
2025-05-07T20:32:25.2366768Z         else:
2025-05-07T20:32:25.2366868Z             scale_ub_tensor = None
2025-05-07T20:32:25.2366941Z     
2025-05-07T20:32:25.2367076Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:25.2367167Z             op = silu_mul_quant
2025-05-07T20:32:25.2367256Z             if compiled:
2025-05-07T20:32:25.2367368Z                 op = torch.compile(op)
2025-05-07T20:32:25.2367474Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:25.2367547Z     
2025-05-07T20:32:25.2367646Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:25.2367655Z 
2025-05-07T20:32:25.2367755Z moe/activation_test.py:117: 
2025-05-07T20:32:25.2367884Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:25.2367993Z moe/activation_test.py:115: in fn
2025-05-07T20:32:25.2368092Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:25.2368465Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/torch/_dynamo/eval_frame.py:678: in _fn
2025-05-07T20:32:25.2368557Z     return fn(*args, **kwargs)
2025-05-07T20:32:25.2369048Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:25.2369157Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:25.2369539Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:25.2369789Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:25.2370135Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:25.2370229Z     kernel = self.compile(
2025-05-07T20:32:25.2370613Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:25.2370783Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:25.2370912Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:25.2370916Z 
2025-05-07T20:32:25.2371129Z self = <triton.compiler.compiler.ASTSource object at 0x7efca3034490>
2025-05-07T20:32:25.2371900Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:25.2372462Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7efdd7d50400>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7efca3639bc0>}
2025-05-07T20:32:25.2373204Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:25.2373396Z context = <triton._C.libtriton.ir.context object at 0x7efca3090af0>
2025-05-07T20:32:25.2373408Z 
2025-05-07T20:32:25.2373570Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:25.2373829Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:25.2373985Z                            module_map=module_map)
2025-05-07T20:32:25.2374187Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:25.2374285Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:25.2374370Z E       ^
2025-05-07T20:32:25.2374764Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:25.2374769Z 
2025-05-07T20:32:25.2375190Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:25.2375195Z 
2025-05-07T20:32:25.2375297Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:25.2375520Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:25.2375606Z     T=4096,
2025-05-07T20:32:25.2375684Z     D=7168,
2025-05-07T20:32:25.2375768Z     scale_ub=None,
2025-05-07T20:32:25.2375864Z     contiguous=False,
2025-05-07T20:32:25.2375949Z     compiled=True,
2025-05-07T20:32:25.2376025Z )
2025-05-07T20:32:25.2376253Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:25.2376424Z T = 4096, D = 7168, scale_ub = None, contiguous = False, compiled = True
2025-05-07T20:32:25.2376428Z 
2025-05-07T20:32:25.2376518Z     @given(
2025-05-07T20:32:25.2376640Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:25.2376742Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:25.2376866Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:25.2376983Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:25.2377096Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:25.2377180Z     )
2025-05-07T20:32:25.2377423Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:25.2377527Z     def test_silu_mul_quant(
2025-05-07T20:32:25.2377605Z         self,
2025-05-07T20:32:25.2377684Z         T: int,
2025-05-07T20:32:25.2377770Z         D: int,
2025-05-07T20:32:25.2377871Z         scale_ub: Optional[float],
2025-05-07T20:32:25.2377964Z         contiguous: bool,
2025-05-07T20:32:25.2378059Z         compiled: bool,
2025-05-07T20:32:25.2378139Z     ) -> None:
2025-05-07T20:32:25.2378235Z         torch.manual_seed(2025)
2025-05-07T20:32:25.2378321Z     
2025-05-07T20:32:25.2378492Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:25.2378571Z     
2025-05-07T20:32:25.2378673Z         x_sign = torch.sign(x)
2025-05-07T20:32:25.2378799Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:25.2378890Z         x = x_sign * x_clamp
2025-05-07T20:32:25.2378982Z         x0 = x[:, :D]
2025-05-07T20:32:25.2379066Z         x1 = x[:, D:]
2025-05-07T20:32:25.2379148Z     
2025-05-07T20:32:25.2379233Z         if contiguous:
2025-05-07T20:32:25.2379326Z             x0 = x0.contiguous()
2025-05-07T20:32:25.2379425Z             x1 = x1.contiguous()
2025-05-07T20:32:25.2379497Z     
2025-05-07T20:32:25.2379591Z         if scale_ub is not None:
2025-05-07T20:32:25.2379710Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:25.2379844Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:25.2379920Z             )
2025-05-07T20:32:25.2380007Z         else:
2025-05-07T20:32:25.2380153Z             scale_ub_tensor = None
2025-05-07T20:32:25.2380227Z     
2025-05-07T20:32:25.2380365Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:25.2380458Z             op = silu_mul_quant
2025-05-07T20:32:25.2380550Z             if compiled:
2025-05-07T20:32:25.2380667Z                 op = torch.compile(op)
2025-05-07T20:32:25.2380784Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:25.2380879Z     
2025-05-07T20:32:25.2380972Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:25.2380977Z 
2025-05-07T20:32:25.2381080Z moe/activation_test.py:117: 
2025-05-07T20:32:25.2381209Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:25.2381351Z moe/activation_test.py:115: in fn
2025-05-07T20:32:25.2381493Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:25.2381858Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/torch/_dynamo/eval_frame.py:678: in _fn
2025-05-07T20:32:25.2381989Z     return fn(*args, **kwargs)
2025-05-07T20:32:25.2382491Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:25.2382588Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:25.2382948Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:25.2383167Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:25.2383502Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:25.2383604Z     kernel = self.compile(
2025-05-07T20:32:25.2383987Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:25.2384162Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:25.2384301Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:25.2384306Z 
2025-05-07T20:32:25.2384510Z self = <triton.compiler.compiler.ASTSource object at 0x7efca30ddd10>
2025-05-07T20:32:25.2385289Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:25.2385788Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7efdd7d50400>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7efca363a700>}
2025-05-07T20:32:25.2386532Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:25.2386727Z context = <triton._C.libtriton.ir.context object at 0x7efca309e3f0>
2025-05-07T20:32:25.2386734Z 
2025-05-07T20:32:25.2386897Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:25.2387165Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:25.2387271Z                            module_map=module_map)
2025-05-07T20:32:25.2387437Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:25.2387534Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:25.2387613Z E       ^
2025-05-07T20:32:25.2387975Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:25.2387980Z 
2025-05-07T20:32:25.2388392Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:25.2388401Z 
2025-05-07T20:32:25.2388502Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:25.2388729Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:25.2388852Z     T=16384,
2025-05-07T20:32:25.2388936Z     D=5120,
2025-05-07T20:32:25.2389019Z     scale_ub=1200.0,
2025-05-07T20:32:25.2389223Z     contiguous=False,
2025-05-07T20:32:25.2389313Z     compiled=False,
2025-05-07T20:32:25.2389384Z )
2025-05-07T20:32:25.2389602Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:25.2389785Z T = 16384, D = 5120, scale_ub = 1200.0, contiguous = False, compiled = False
2025-05-07T20:32:25.2389789Z 
2025-05-07T20:32:25.2389865Z     @given(
2025-05-07T20:32:25.2389982Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:25.2390090Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:25.2390249Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:25.2390410Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:25.2390525Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:25.2390598Z     )
2025-05-07T20:32:25.2390887Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:25.2390981Z     def test_silu_mul_quant(
2025-05-07T20:32:25.2391058Z         self,
2025-05-07T20:32:25.2391140Z         T: int,
2025-05-07T20:32:25.2391217Z         D: int,
2025-05-07T20:32:25.2391314Z         scale_ub: Optional[float],
2025-05-07T20:32:25.2391409Z         contiguous: bool,
2025-05-07T20:32:25.2391495Z         compiled: bool,
2025-05-07T20:32:25.2391573Z     ) -> None:
2025-05-07T20:32:25.2391672Z         torch.manual_seed(2025)
2025-05-07T20:32:25.2391744Z     
2025-05-07T20:32:25.2391917Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:25.2391991Z     
2025-05-07T20:32:25.2392085Z         x_sign = torch.sign(x)
2025-05-07T20:32:25.2392216Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:25.2392305Z         x = x_sign * x_clamp
2025-05-07T20:32:25.2392385Z         x0 = x[:, :D]
2025-05-07T20:32:25.2392471Z         x1 = x[:, D:]
2025-05-07T20:32:25.2392546Z     
2025-05-07T20:32:25.2392631Z         if contiguous:
2025-05-07T20:32:25.2392728Z             x0 = x0.contiguous()
2025-05-07T20:32:25.2392816Z             x1 = x1.contiguous()
2025-05-07T20:32:25.2392888Z     
2025-05-07T20:32:25.2392983Z         if scale_ub is not None:
2025-05-07T20:32:25.2393088Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:25.2393225Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:25.2393301Z             )
2025-05-07T20:32:25.2393379Z         else:
2025-05-07T20:32:25.2393476Z             scale_ub_tensor = None
2025-05-07T20:32:25.2393548Z     
2025-05-07T20:32:25.2393676Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:25.2393774Z             op = silu_mul_quant
2025-05-07T20:32:25.2393864Z             if compiled:
2025-05-07T20:32:25.2393963Z                 op = torch.compile(op)
2025-05-07T20:32:25.2394075Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:25.2394154Z     
2025-05-07T20:32:25.2394249Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:25.2394260Z 
2025-05-07T20:32:25.2394357Z moe/activation_test.py:117: 
2025-05-07T20:32:25.2394487Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:25.2394593Z moe/activation_test.py:115: in fn
2025-05-07T20:32:25.2394694Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:25.2395186Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:25.2395290Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:25.2395648Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:25.2395873Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:25.2396221Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:25.2396396Z     kernel = self.compile(
2025-05-07T20:32:25.2396782Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:25.2396953Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:25.2397082Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:25.2397086Z 
2025-05-07T20:32:25.2397296Z self = <triton.compiler.compiler.ASTSource object at 0x7efca351ef90>
2025-05-07T20:32:25.2398061Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:25.2398646Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7efdd7d50400>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7efca363b060>}
2025-05-07T20:32:25.2399429Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:25.2399652Z context = <triton._C.libtriton.ir.context object at 0x7efca35df5f0>
2025-05-07T20:32:25.2399656Z 
2025-05-07T20:32:25.2399845Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:25.2400104Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:25.2400217Z                            module_map=module_map)
2025-05-07T20:32:25.2400377Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:25.2400481Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:25.2400563Z E       ^
2025-05-07T20:32:25.2400918Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:25.2400925Z 
2025-05-07T20:32:25.2401348Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:25.2401352Z 
2025-05-07T20:32:25.2401456Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:25.2401679Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:25.2401764Z     T=16384,
2025-05-07T20:32:25.2401840Z     D=5120,
2025-05-07T20:32:25.2401923Z     scale_ub=1200.0,
2025-05-07T20:32:25.2402016Z     contiguous=True,
2025-05-07T20:32:25.2402100Z     compiled=True,
2025-05-07T20:32:25.2402177Z )
2025-05-07T20:32:25.2402392Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:25.2402565Z T = 16384, D = 5120, scale_ub = 1200.0, contiguous = True, compiled = True
2025-05-07T20:32:25.2402573Z 
2025-05-07T20:32:25.2402656Z     @given(
2025-05-07T20:32:25.2402773Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:25.2402874Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:25.2402997Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:25.2403112Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:25.2403223Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:25.2403304Z     )
2025-05-07T20:32:25.2403546Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:25.2403644Z     def test_silu_mul_quant(
2025-05-07T20:32:25.2403733Z         self,
2025-05-07T20:32:25.2403817Z         T: int,
2025-05-07T20:32:25.2403895Z         D: int,
2025-05-07T20:32:25.2403993Z         scale_ub: Optional[float],
2025-05-07T20:32:25.2404091Z         contiguous: bool,
2025-05-07T20:32:25.2404181Z         compiled: bool,
2025-05-07T20:32:25.2404263Z     ) -> None:
2025-05-07T20:32:25.2404364Z         torch.manual_seed(2025)
2025-05-07T20:32:25.2404437Z     
2025-05-07T20:32:25.2404606Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:25.2404686Z     
2025-05-07T20:32:25.2404829Z         x_sign = torch.sign(x)
2025-05-07T20:32:25.2404954Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:25.2405051Z         x = x_sign * x_clamp
2025-05-07T20:32:25.2405132Z         x0 = x[:, :D]
2025-05-07T20:32:25.2405218Z         x1 = x[:, D:]
2025-05-07T20:32:25.2405289Z     
2025-05-07T20:32:25.2405373Z         if contiguous:
2025-05-07T20:32:25.2405472Z             x0 = x0.contiguous()
2025-05-07T20:32:25.2405560Z             x1 = x1.contiguous()
2025-05-07T20:32:25.2405633Z     
2025-05-07T20:32:25.2405729Z         if scale_ub is not None:
2025-05-07T20:32:25.2405834Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:25.2406012Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:25.2406134Z             )
2025-05-07T20:32:25.2406209Z         else:
2025-05-07T20:32:25.2406303Z             scale_ub_tensor = None
2025-05-07T20:32:25.2406385Z     
2025-05-07T20:32:25.2406554Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:25.2406649Z             op = silu_mul_quant
2025-05-07T20:32:25.2406740Z             if compiled:
2025-05-07T20:32:25.2406840Z                 op = torch.compile(op)
2025-05-07T20:32:25.2406950Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:25.2407022Z     
2025-05-07T20:32:25.2407111Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:25.2407115Z 
2025-05-07T20:32:25.2407218Z moe/activation_test.py:117: 
2025-05-07T20:32:25.2407346Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:25.2407445Z moe/activation_test.py:115: in fn
2025-05-07T20:32:25.2407549Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:25.2407914Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/torch/_dynamo/eval_frame.py:678: in _fn
2025-05-07T20:32:25.2408014Z     return fn(*args, **kwargs)
2025-05-07T20:32:25.2408510Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:25.2408606Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:25.2408964Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:25.2409181Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:25.2409516Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:25.2409622Z     kernel = self.compile(
2025-05-07T20:32:25.2410048Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:25.2410227Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:25.2410357Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:25.2410362Z 
2025-05-07T20:32:25.2410567Z self = <triton.compiler.compiler.ASTSource object at 0x7efca3599210>
2025-05-07T20:32:25.2411345Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:25.2411841Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7efdd7d50400>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7efca35d11c0>}
2025-05-07T20:32:25.2412586Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:25.2412779Z context = <triton._C.libtriton.ir.context object at 0x7efca35164b0>
2025-05-07T20:32:25.2412783Z 
2025-05-07T20:32:25.2412944Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:25.2413255Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:25.2413363Z                            module_map=module_map)
2025-05-07T20:32:25.2413528Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:25.2413626Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:25.2413702Z E       ^
2025-05-07T20:32:25.2414065Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:25.2414070Z 
2025-05-07T20:32:25.2414483Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:25.2414529Z 
2025-05-07T20:32:25.2414639Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:25.2414898Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:25.2414976Z     T=16384,
2025-05-07T20:32:25.2415060Z     D=5120,
2025-05-07T20:32:25.2415184Z     scale_ub=None,
2025-05-07T20:32:25.2415276Z     contiguous=False,
2025-05-07T20:32:25.2415366Z     compiled=True,
2025-05-07T20:32:25.2415438Z )
2025-05-07T20:32:25.2415653Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:25.2415832Z T = 16384, D = 5120, scale_ub = None, contiguous = False, compiled = True
2025-05-07T20:32:25.2415836Z 
2025-05-07T20:32:25.2415914Z     @given(
2025-05-07T20:32:25.2416035Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:25.2416134Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:25.2416247Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:25.2416369Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:25.2416488Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:25.2416564Z     )
2025-05-07T20:32:25.2416813Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:25.2416911Z     def test_silu_mul_quant(
2025-05-07T20:32:25.2416989Z         self,
2025-05-07T20:32:25.2417073Z         T: int,
2025-05-07T20:32:25.2417150Z         D: int,
2025-05-07T20:32:25.2417253Z         scale_ub: Optional[float],
2025-05-07T20:32:25.2417341Z         contiguous: bool,
2025-05-07T20:32:25.2417426Z         compiled: bool,
2025-05-07T20:32:25.2417509Z     ) -> None:
2025-05-07T20:32:25.2417602Z         torch.manual_seed(2025)
2025-05-07T20:32:25.2417675Z     
2025-05-07T20:32:25.2417845Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:25.2417919Z     
2025-05-07T20:32:25.2418012Z         x_sign = torch.sign(x)
2025-05-07T20:32:25.2418141Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:25.2418233Z         x = x_sign * x_clamp
2025-05-07T20:32:25.2418319Z         x0 = x[:, :D]
2025-05-07T20:32:25.2418408Z         x1 = x[:, D:]
2025-05-07T20:32:25.2418480Z     
2025-05-07T20:32:25.2418565Z         if contiguous:
2025-05-07T20:32:25.2418668Z             x0 = x0.contiguous()
2025-05-07T20:32:25.2418759Z             x1 = x1.contiguous()
2025-05-07T20:32:25.2418839Z     
2025-05-07T20:32:25.2418929Z         if scale_ub is not None:
2025-05-07T20:32:25.2419035Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:25.2419176Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:25.2419252Z             )
2025-05-07T20:32:25.2419329Z         else:
2025-05-07T20:32:25.2419429Z             scale_ub_tensor = None
2025-05-07T20:32:25.2419502Z     
2025-05-07T20:32:25.2419629Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:25.2419727Z             op = silu_mul_quant
2025-05-07T20:32:25.2419815Z             if compiled:
2025-05-07T20:32:25.2419915Z                 op = torch.compile(op)
2025-05-07T20:32:25.2420029Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:25.2420102Z     
2025-05-07T20:32:25.2420199Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:25.2420203Z 
2025-05-07T20:32:25.2420302Z moe/activation_test.py:117: 
2025-05-07T20:32:25.2420487Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:25.2420597Z moe/activation_test.py:115: in fn
2025-05-07T20:32:25.2420695Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:25.2421058Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/torch/_dynamo/eval_frame.py:678: in _fn
2025-05-07T20:32:25.2421156Z     return fn(*args, **kwargs)
2025-05-07T20:32:25.2421642Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:25.2421743Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:25.2422094Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:25.2422394Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:25.2422813Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:25.2422909Z     kernel = self.compile(
2025-05-07T20:32:25.2423286Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:25.2423464Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:25.2423591Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:25.2423596Z 
2025-05-07T20:32:25.2423804Z self = <triton.compiler.compiler.ASTSource object at 0x7efca34972d0>
2025-05-07T20:32:25.2424572Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:25.2425089Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7efdd7d50400>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7efca35d1d00>}
2025-05-07T20:32:25.2425828Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:25.2426017Z context = <triton._C.libtriton.ir.context object at 0x7efca3413930>
2025-05-07T20:32:25.2426021Z 
2025-05-07T20:32:25.2426189Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:25.2426445Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:25.2426557Z                            module_map=module_map)
2025-05-07T20:32:25.2426719Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:25.2426820Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:25.2426902Z E       ^
2025-05-07T20:32:25.2427259Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:25.2427263Z 
2025-05-07T20:32:25.2427680Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:25.2427690Z 
2025-05-07T20:32:25.2427791Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:25.2428012Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:25.2428095Z     T=2048,
2025-05-07T20:32:25.2428908Z     D=5120,
2025-05-07T20:32:25.2429002Z     scale_ub=None,
2025-05-07T20:32:25.2429140Z     contiguous=False,
2025-05-07T20:32:25.2429226Z     compiled=True,
2025-05-07T20:32:25.2429299Z )
2025-05-07T20:32:25.2429521Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:25.2429728Z T = 2048, D = 5120, scale_ub = None, contiguous = False, compiled = True
2025-05-07T20:32:25.2429733Z 
2025-05-07T20:32:25.2429822Z     @given(
2025-05-07T20:32:25.2429964Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:25.2430251Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:25.2430377Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:25.2430494Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:25.2430607Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:25.2430687Z     )
2025-05-07T20:32:25.2430932Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:25.2431026Z     def test_silu_mul_quant(
2025-05-07T20:32:25.2431110Z         self,
2025-05-07T20:32:25.2431186Z         T: int,
2025-05-07T20:32:25.2431265Z         D: int,
2025-05-07T20:32:25.2431367Z         scale_ub: Optional[float],
2025-05-07T20:32:25.2431601Z         contiguous: bool,
2025-05-07T20:32:25.2431693Z         compiled: bool,
2025-05-07T20:32:25.2431774Z     ) -> None:
2025-05-07T20:32:25.2431871Z         torch.manual_seed(2025)
2025-05-07T20:32:25.2431950Z     
2025-05-07T20:32:25.2432180Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:25.2432253Z     
2025-05-07T20:32:25.2432351Z         x_sign = torch.sign(x)
2025-05-07T20:32:25.2432473Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:25.2432562Z         x = x_sign * x_clamp
2025-05-07T20:32:25.2432647Z         x0 = x[:, :D]
2025-05-07T20:32:25.2432728Z         x1 = x[:, D:]
2025-05-07T20:32:25.2432800Z     
2025-05-07T20:32:25.2432891Z         if contiguous:
2025-05-07T20:32:25.2432983Z             x0 = x0.contiguous()
2025-05-07T20:32:25.2433071Z             x1 = x1.contiguous()
2025-05-07T20:32:25.2433148Z     
2025-05-07T20:32:25.2433237Z         if scale_ub is not None:
2025-05-07T20:32:25.2433354Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:25.2433491Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:25.2433566Z             )
2025-05-07T20:32:25.2433648Z         else:
2025-05-07T20:32:25.2433744Z             scale_ub_tensor = None
2025-05-07T20:32:25.2433819Z     
2025-05-07T20:32:25.2433956Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:25.2434049Z             op = silu_mul_quant
2025-05-07T20:32:25.2434136Z             if compiled:
2025-05-07T20:32:25.2434244Z                 op = torch.compile(op)
2025-05-07T20:32:25.2434348Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:25.2434420Z     
2025-05-07T20:32:25.2434517Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:25.2434522Z 
2025-05-07T20:32:25.2434619Z moe/activation_test.py:117: 
2025-05-07T20:32:25.2434759Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:25.2434860Z moe/activation_test.py:115: in fn
2025-05-07T20:32:25.2434961Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:25.2435339Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/torch/_dynamo/eval_frame.py:678: in _fn
2025-05-07T20:32:25.2435432Z     return fn(*args, **kwargs)
2025-05-07T20:32:25.2435925Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:25.2436031Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:25.2436385Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:25.2436611Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:25.2436948Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:25.2437042Z     kernel = self.compile(
2025-05-07T20:32:25.2437426Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:25.2437606Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:25.2437742Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:25.2437749Z 
2025-05-07T20:32:25.2438001Z self = <triton.compiler.compiler.ASTSource object at 0x7efca34f88d0>
2025-05-07T20:32:25.2438772Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:25.2439279Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7efdd7d50400>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7efca35d1620>}
2025-05-07T20:32:25.2440048Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:25.2440347Z context = <triton._C.libtriton.ir.context object at 0x7efca34bcef0>
2025-05-07T20:32:25.2440352Z 
2025-05-07T20:32:25.2440553Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:25.2440813Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:25.2440927Z                            module_map=module_map)
2025-05-07T20:32:25.2441086Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:25.2441191Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:25.2441268Z E       ^
2025-05-07T20:32:25.2441618Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:25.2441622Z 
2025-05-07T20:32:25.2442038Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:25.2442047Z 
2025-05-07T20:32:25.2442150Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:25.2442375Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:25.2442454Z     T=2048,
2025-05-07T20:32:25.2442531Z     D=5120,
2025-05-07T20:32:25.2442623Z     scale_ub=1200.0,
2025-05-07T20:32:25.2442712Z     contiguous=False,
2025-05-07T20:32:25.2442795Z     compiled=True,
2025-05-07T20:32:25.2442876Z )
2025-05-07T20:32:25.2443093Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:25.2443263Z T = 2048, D = 5120, scale_ub = 1200.0, contiguous = False, compiled = True
2025-05-07T20:32:25.2443267Z 
2025-05-07T20:32:25.2443351Z     @given(
2025-05-07T20:32:25.2443469Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:25.2443574Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:25.2443690Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:25.2443808Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:25.2443934Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:25.2444008Z     )
2025-05-07T20:32:25.2444253Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:25.2444356Z     def test_silu_mul_quant(
2025-05-07T20:32:25.2444432Z         self,
2025-05-07T20:32:25.2444512Z         T: int,
2025-05-07T20:32:25.2444595Z         D: int,
2025-05-07T20:32:25.2444692Z         scale_ub: Optional[float],
2025-05-07T20:32:25.2444785Z         contiguous: bool,
2025-05-07T20:32:25.2444876Z         compiled: bool,
2025-05-07T20:32:25.2444956Z     ) -> None:
2025-05-07T20:32:25.2445057Z         torch.manual_seed(2025)
2025-05-07T20:32:25.2445131Z     
2025-05-07T20:32:25.2445299Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:25.2445379Z     
2025-05-07T20:32:25.2445474Z         x_sign = torch.sign(x)
2025-05-07T20:32:25.2445600Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:25.2445702Z         x = x_sign * x_clamp
2025-05-07T20:32:25.2445783Z         x0 = x[:, :D]
2025-05-07T20:32:25.2445862Z         x1 = x[:, D:]
2025-05-07T20:32:25.2445942Z     
2025-05-07T20:32:25.2446030Z         if contiguous:
2025-05-07T20:32:25.2446171Z             x0 = x0.contiguous()
2025-05-07T20:32:25.2446270Z             x1 = x1.contiguous()
2025-05-07T20:32:25.2446343Z     
2025-05-07T20:32:25.2446432Z         if scale_ub is not None:
2025-05-07T20:32:25.2446545Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:25.2446680Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:25.2446763Z             )
2025-05-07T20:32:25.2446840Z         else:
2025-05-07T20:32:25.2446936Z             scale_ub_tensor = None
2025-05-07T20:32:25.2447018Z     
2025-05-07T20:32:25.2447148Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:25.2447240Z             op = silu_mul_quant
2025-05-07T20:32:25.2447378Z             if compiled:
2025-05-07T20:32:25.2447516Z                 op = torch.compile(op)
2025-05-07T20:32:25.2447622Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:25.2447702Z     
2025-05-07T20:32:25.2447830Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:25.2447835Z 
2025-05-07T20:32:25.2447944Z moe/activation_test.py:117: 
2025-05-07T20:32:25.2448078Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:25.2448179Z moe/activation_test.py:115: in fn
2025-05-07T20:32:25.2448284Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:25.2448650Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/torch/_dynamo/eval_frame.py:678: in _fn
2025-05-07T20:32:25.2448743Z     return fn(*args, **kwargs)
2025-05-07T20:32:25.2449239Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:25.2449340Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:25.2449702Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:25.2449921Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:25.2450260Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:25.2450365Z     kernel = self.compile(
2025-05-07T20:32:25.2450742Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:25.2450917Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:25.2451051Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:25.2451055Z 
2025-05-07T20:32:25.2451259Z self = <triton.compiler.compiler.ASTSource object at 0x7efca311dc10>
2025-05-07T20:32:25.2452032Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:25.2452546Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7efdd7d50400>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7efca34905e0>}
2025-05-07T20:32:25.2453293Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:25.2453483Z context = <triton._C.libtriton.ir.context object at 0x7efca317e270>
2025-05-07T20:32:25.2453487Z 
2025-05-07T20:32:25.2453650Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:25.2453919Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:25.2454028Z                            module_map=module_map)
2025-05-07T20:32:25.2454192Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:25.2454298Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:25.2454376Z E       ^
2025-05-07T20:32:25.2454781Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:25.2454787Z 
2025-05-07T20:32:25.2455199Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:25.2455203Z 
2025-05-07T20:32:25.2455306Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:25.2455532Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:25.2455614Z     T=4096,
2025-05-07T20:32:25.2455700Z     D=5120,
2025-05-07T20:32:25.2455784Z     scale_ub=1200.0,
2025-05-07T20:32:25.2455871Z     contiguous=True,
2025-05-07T20:32:25.2455959Z     compiled=True,
2025-05-07T20:32:25.2456073Z )
2025-05-07T20:32:25.2456354Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:25.2456530Z T = 4096, D = 5120, scale_ub = 1200.0, contiguous = True, compiled = True
2025-05-07T20:32:25.2456535Z 
2025-05-07T20:32:25.2456749Z     @given(
2025-05-07T20:32:25.2456872Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:25.2456977Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:25.2457091Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:25.2457212Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:25.2457324Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:25.2457399Z     )
2025-05-07T20:32:25.2457647Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:25.2457741Z     def test_silu_mul_quant(
2025-05-07T20:32:25.2457818Z         self,
2025-05-07T20:32:25.2457901Z         T: int,
2025-05-07T20:32:25.2457983Z         D: int,
2025-05-07T20:32:25.2458084Z         scale_ub: Optional[float],
2025-05-07T20:32:25.2458185Z         contiguous: bool,
2025-05-07T20:32:25.2458271Z         compiled: bool,
2025-05-07T20:32:25.2458353Z     ) -> None:
2025-05-07T20:32:25.2458461Z         torch.manual_seed(2025)
2025-05-07T20:32:25.2458537Z     
2025-05-07T20:32:25.2458713Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:25.2458787Z     
2025-05-07T20:32:25.2458880Z         x_sign = torch.sign(x)
2025-05-07T20:32:25.2459010Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:25.2459100Z         x = x_sign * x_clamp
2025-05-07T20:32:25.2459180Z         x0 = x[:, :D]
2025-05-07T20:32:25.2459266Z         x1 = x[:, D:]
2025-05-07T20:32:25.2459339Z     
2025-05-07T20:32:25.2459423Z         if contiguous:
2025-05-07T20:32:25.2459524Z             x0 = x0.contiguous()
2025-05-07T20:32:25.2459638Z             x1 = x1.contiguous()
2025-05-07T20:32:25.2459714Z     
2025-05-07T20:32:25.2459838Z         if scale_ub is not None:
2025-05-07T20:32:25.2459951Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:25.2460086Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:25.2460168Z             )
2025-05-07T20:32:25.2460247Z         else:
2025-05-07T20:32:25.2460347Z             scale_ub_tensor = None
2025-05-07T20:32:25.2460420Z     
2025-05-07T20:32:25.2460548Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:25.2460643Z             op = silu_mul_quant
2025-05-07T20:32:25.2460729Z             if compiled:
2025-05-07T20:32:25.2460831Z                 op = torch.compile(op)
2025-05-07T20:32:25.2460944Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:25.2461017Z     
2025-05-07T20:32:25.2461109Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:25.2461113Z 
2025-05-07T20:32:25.2461217Z moe/activation_test.py:117: 
2025-05-07T20:32:25.2461344Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:25.2461457Z moe/activation_test.py:115: in fn
2025-05-07T20:32:25.2461557Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:25.2461923Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/torch/_dynamo/eval_frame.py:678: in _fn
2025-05-07T20:32:25.2462075Z     return fn(*args, **kwargs)
2025-05-07T20:32:25.2462563Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:25.2462661Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:25.2463019Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:25.2463237Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:25.2467642Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:25.2467758Z     kernel = self.compile(
2025-05-07T20:32:25.2468238Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:25.2468455Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:25.2468639Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:25.2468644Z 
2025-05-07T20:32:25.2468854Z self = <triton.compiler.compiler.ASTSource object at 0x7efca31c6fd0>
2025-05-07T20:32:25.2469732Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:25.2470247Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7efdd7d50400>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7efca3491120>}
2025-05-07T20:32:25.2471045Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:25.2471249Z context = <triton._C.libtriton.ir.context object at 0x7efca315b630>
2025-05-07T20:32:25.2471256Z 
2025-05-07T20:32:25.2471423Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:25.2471685Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:25.2471802Z                            module_map=module_map)
2025-05-07T20:32:25.2471964Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:25.2472075Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:25.2472155Z E       ^
2025-05-07T20:32:25.2472509Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:25.2472514Z 
2025-05-07T20:32:25.2472936Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:25.2472945Z 
2025-05-07T20:32:25.2473052Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:25.2473286Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:25.2473369Z     T=128,
2025-05-07T20:32:25.2473449Z     D=5120,
2025-05-07T20:32:25.2473544Z     scale_ub=1200.0,
2025-05-07T20:32:25.2473634Z     contiguous=False,
2025-05-07T20:32:25.2473722Z     compiled=True,
2025-05-07T20:32:25.2473808Z )
2025-05-07T20:32:25.2474025Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:25.2474200Z T = 128, D = 5120, scale_ub = 1200.0, contiguous = False, compiled = True
2025-05-07T20:32:25.2474204Z 
2025-05-07T20:32:25.2474292Z     @given(
2025-05-07T20:32:25.2474410Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:25.2474523Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:25.2474648Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:25.2474770Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:25.2474892Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:25.2474970Z     )
2025-05-07T20:32:25.2475266Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:25.2475372Z     def test_silu_mul_quant(
2025-05-07T20:32:25.2475452Z         self,
2025-05-07T20:32:25.2475532Z         T: int,
2025-05-07T20:32:25.2475619Z         D: int,
2025-05-07T20:32:25.2475719Z         scale_ub: Optional[float],
2025-05-07T20:32:25.2475810Z         contiguous: bool,
2025-05-07T20:32:25.2475913Z         compiled: bool,
2025-05-07T20:32:25.2475993Z     ) -> None:
2025-05-07T20:32:25.2476100Z         torch.manual_seed(2025)
2025-05-07T20:32:25.2476176Z     
2025-05-07T20:32:25.2476346Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:25.2476429Z     
2025-05-07T20:32:25.2476563Z         x_sign = torch.sign(x)
2025-05-07T20:32:25.2476724Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:25.2476823Z         x = x_sign * x_clamp
2025-05-07T20:32:25.2476905Z         x0 = x[:, :D]
2025-05-07T20:32:25.2477023Z         x1 = x[:, D:]
2025-05-07T20:32:25.2477105Z     
2025-05-07T20:32:25.2477193Z         if contiguous:
2025-05-07T20:32:25.2477285Z             x0 = x0.contiguous()
2025-05-07T20:32:25.2477382Z             x1 = x1.contiguous()
2025-05-07T20:32:25.2477456Z     
2025-05-07T20:32:25.2477556Z         if scale_ub is not None:
2025-05-07T20:32:25.2477661Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:25.2477797Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:25.2477881Z             )
2025-05-07T20:32:25.2477958Z         else:
2025-05-07T20:32:25.2478052Z             scale_ub_tensor = None
2025-05-07T20:32:25.2478132Z     
2025-05-07T20:32:25.2478261Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:25.2478355Z             op = silu_mul_quant
2025-05-07T20:32:25.2478453Z             if compiled:
2025-05-07T20:32:25.2478555Z                 op = torch.compile(op)
2025-05-07T20:32:25.2478661Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:25.2478746Z     
2025-05-07T20:32:25.2478842Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:25.2478846Z 
2025-05-07T20:32:25.2478952Z moe/activation_test.py:117: 
2025-05-07T20:32:25.2479084Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:25.2479184Z moe/activation_test.py:115: in fn
2025-05-07T20:32:25.2479292Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:25.2479661Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/torch/_dynamo/eval_frame.py:678: in _fn
2025-05-07T20:32:25.2479754Z     return fn(*args, **kwargs)
2025-05-07T20:32:25.2480254Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:25.2480358Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:25.2480722Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:25.2480947Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:25.2481285Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:25.2481387Z     kernel = self.compile(
2025-05-07T20:32:25.2481767Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:25.2481940Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:25.2482080Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:25.2482084Z 
2025-05-07T20:32:25.2482289Z self = <triton.compiler.compiler.ASTSource object at 0x7efca31b02d0>
2025-05-07T20:32:25.2483067Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:25.2483619Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7efdd7d50400>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7efca3492340>}
2025-05-07T20:32:25.2484370Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:25.2484560Z context = <triton._C.libtriton.ir.context object at 0x7efca2dd4930>
2025-05-07T20:32:25.2484565Z 
2025-05-07T20:32:25.2484729Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:25.2484999Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:25.2485181Z                            module_map=module_map)
2025-05-07T20:32:25.2485351Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:25.2485490Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:25.2485572Z E       ^
2025-05-07T20:32:25.2485937Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:25.2485942Z 
2025-05-07T20:32:25.2486354Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:25.2486358Z 
2025-05-07T20:32:25.2486464Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:25.2486696Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:25.2486776Z     T=16384,
2025-05-07T20:32:25.2486863Z     D=7168,
2025-05-07T20:32:25.2486952Z     scale_ub=1200.0,
2025-05-07T20:32:25.2487045Z     contiguous=True,
2025-05-07T20:32:25.2487141Z     compiled=True,
2025-05-07T20:32:25.2487217Z )
2025-05-07T20:32:25.2487434Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:25.2487624Z T = 16384, D = 7168, scale_ub = 1200.0, contiguous = True, compiled = True
2025-05-07T20:32:25.2487631Z 
2025-05-07T20:32:25.2487712Z     @given(
2025-05-07T20:32:25.2487831Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:25.2487940Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:25.2488060Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:25.2488186Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:25.2488301Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:25.2488378Z     )
2025-05-07T20:32:25.2488630Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:25.2488726Z     def test_silu_mul_quant(
2025-05-07T20:32:25.2488808Z         self,
2025-05-07T20:32:25.2488897Z         T: int,
2025-05-07T20:32:25.2488979Z         D: int,
2025-05-07T20:32:25.2489080Z         scale_ub: Optional[float],
2025-05-07T20:32:25.2489181Z         contiguous: bool,
2025-05-07T20:32:25.2489269Z         compiled: bool,
2025-05-07T20:32:25.2489353Z     ) -> None:
2025-05-07T20:32:25.2489460Z         torch.manual_seed(2025)
2025-05-07T20:32:25.2489540Z     
2025-05-07T20:32:25.2489739Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:25.2489821Z     
2025-05-07T20:32:25.2489933Z         x_sign = torch.sign(x)
2025-05-07T20:32:25.2490068Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:25.2490159Z         x = x_sign * x_clamp
2025-05-07T20:32:25.2490242Z         x0 = x[:, :D]
2025-05-07T20:32:25.2490337Z         x1 = x[:, D:]
2025-05-07T20:32:25.2490411Z     
2025-05-07T20:32:25.2490499Z         if contiguous:
2025-05-07T20:32:25.2490601Z             x0 = x0.contiguous()
2025-05-07T20:32:25.2490695Z             x1 = x1.contiguous()
2025-05-07T20:32:25.2490771Z     
2025-05-07T20:32:25.2490871Z         if scale_ub is not None:
2025-05-07T20:32:25.2490978Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:25.2491124Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:25.2491254Z             )
2025-05-07T20:32:25.2491334Z         else:
2025-05-07T20:32:25.2491441Z             scale_ub_tensor = None
2025-05-07T20:32:25.2491516Z     
2025-05-07T20:32:25.2491646Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:25.2491746Z             op = silu_mul_quant
2025-05-07T20:32:25.2491834Z             if compiled:
2025-05-07T20:32:25.2491936Z                 op = torch.compile(op)
2025-05-07T20:32:25.2492052Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:25.2492128Z     
2025-05-07T20:32:25.2492222Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:25.2492226Z 
2025-05-07T20:32:25.2492334Z moe/activation_test.py:117: 
2025-05-07T20:32:25.2492534Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:25.2492682Z moe/activation_test.py:115: in fn
2025-05-07T20:32:25.2492784Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:25.2493190Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/torch/_dynamo/eval_frame.py:678: in _fn
2025-05-07T20:32:25.2493291Z     return fn(*args, **kwargs)
2025-05-07T20:32:25.2493782Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:25.2493880Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:25.2494243Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:25.2494466Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:25.2494813Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:25.2494915Z     kernel = self.compile(
2025-05-07T20:32:25.2495297Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:25.2495478Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:25.2495610Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:25.2495614Z 
2025-05-07T20:32:25.2495827Z self = <triton.compiler.compiler.ASTSource object at 0x7efca2d4d650>
2025-05-07T20:32:25.2496595Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:25.2497093Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7efdd7d50400>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7efca3493c40>}
2025-05-07T20:32:25.2497844Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:25.2498042Z context = <triton._C.libtriton.ir.context object at 0x7efca2d41cb0>
2025-05-07T20:32:25.2498046Z 
2025-05-07T20:32:25.2498219Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:25.2498480Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:25.2498587Z                            module_map=module_map)
2025-05-07T20:32:25.2498754Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:25.2498854Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:25.2498940Z E       ^
2025-05-07T20:32:25.2499293Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:25.2499300Z 
2025-05-07T20:32:25.2499714Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:25.2499719Z 
2025-05-07T20:32:25.2499830Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:25.2500098Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:25.2500187Z     T=16384,
2025-05-07T20:32:25.2500269Z     D=5120,
2025-05-07T20:32:25.2500355Z     scale_ub=1200.0,
2025-05-07T20:32:25.2500470Z     contiguous=True,
2025-05-07T20:32:25.2500564Z     compiled=False,
2025-05-07T20:32:25.2500655Z )
2025-05-07T20:32:25.2500884Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:25.2501063Z T = 16384, D = 5120, scale_ub = 1200.0, contiguous = True, compiled = False
2025-05-07T20:32:25.2501068Z 
2025-05-07T20:32:25.2501146Z     @given(
2025-05-07T20:32:25.2501275Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:25.2501419Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:25.2501577Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:25.2501701Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:25.2501855Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:25.2501942Z     )
2025-05-07T20:32:25.2502188Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:25.2502284Z     def test_silu_mul_quant(
2025-05-07T20:32:25.2502369Z         self,
2025-05-07T20:32:25.2502448Z         T: int,
2025-05-07T20:32:25.2502529Z         D: int,
2025-05-07T20:32:25.2502639Z         scale_ub: Optional[float],
2025-05-07T20:32:25.2502730Z         contiguous: bool,
2025-05-07T20:32:25.2502816Z         compiled: bool,
2025-05-07T20:32:25.2502903Z     ) -> None:
2025-05-07T20:32:25.2502999Z         torch.manual_seed(2025)
2025-05-07T20:32:25.2503073Z     
2025-05-07T20:32:25.2503253Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:25.2503335Z     
2025-05-07T20:32:25.2503437Z         x_sign = torch.sign(x)
2025-05-07T20:32:25.2503564Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:25.2503656Z         x = x_sign * x_clamp
2025-05-07T20:32:25.2503753Z         x0 = x[:, :D]
2025-05-07T20:32:25.2503840Z         x1 = x[:, D:]
2025-05-07T20:32:25.2503914Z     
2025-05-07T20:32:25.2504007Z         if contiguous:
2025-05-07T20:32:25.2504101Z             x0 = x0.contiguous()
2025-05-07T20:32:25.2504192Z             x1 = x1.contiguous()
2025-05-07T20:32:25.2504274Z     
2025-05-07T20:32:25.2504365Z         if scale_ub is not None:
2025-05-07T20:32:25.2504473Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:25.2504618Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:25.2504694Z             )
2025-05-07T20:32:25.2504779Z         else:
2025-05-07T20:32:25.2504875Z             scale_ub_tensor = None
2025-05-07T20:32:25.2504952Z     
2025-05-07T20:32:25.2505087Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:25.2505181Z             op = silu_mul_quant
2025-05-07T20:32:25.2505267Z             if compiled:
2025-05-07T20:32:25.2505374Z                 op = torch.compile(op)
2025-05-07T20:32:25.2505484Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:25.2505558Z     
2025-05-07T20:32:25.2505662Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:25.2505666Z 
2025-05-07T20:32:25.2505765Z moe/activation_test.py:117: 
2025-05-07T20:32:25.2505894Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:25.2506002Z moe/activation_test.py:115: in fn
2025-05-07T20:32:25.2506102Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:25.2506603Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:25.2506700Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:25.2507065Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:25.2507287Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:25.2507679Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:25.2507780Z     kernel = self.compile(
2025-05-07T20:32:25.2508160Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:25.2508343Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:25.2508471Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:25.2508475Z 
2025-05-07T20:32:25.2508680Z self = <triton.compiler.compiler.ASTSource object at 0x7efca2db2790>
2025-05-07T20:32:25.2509539Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:25.2510158Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7efdd7d50400>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7efca2d00c20>}
2025-05-07T20:32:25.2510909Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:25.2511098Z context = <triton._C.libtriton.ir.context object at 0x7efca2d62db0>
2025-05-07T20:32:25.2511103Z 
2025-05-07T20:32:25.2511265Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:25.2511532Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:25.2511640Z                            module_map=module_map)
2025-05-07T20:32:25.2511815Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:25.2511914Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:25.2511992Z E       ^
2025-05-07T20:32:25.2512359Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:25.2512364Z 
2025-05-07T20:32:25.2512778Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:25.2512782Z 
2025-05-07T20:32:25.2512892Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:25.2513114Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:25.2513191Z     T=1,
2025-05-07T20:32:25.2513275Z     D=7168,
2025-05-07T20:32:25.2513359Z     scale_ub=1200.0,
2025-05-07T20:32:25.2513446Z     contiguous=False,
2025-05-07T20:32:25.2513537Z     compiled=False,
2025-05-07T20:32:25.2513610Z )
2025-05-07T20:32:25.2513829Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:25.2514008Z T = 1, D = 7168, scale_ub = 1200.0, contiguous = False, compiled = False
2025-05-07T20:32:25.2514012Z 
2025-05-07T20:32:25.2514089Z     @given(
2025-05-07T20:32:25.2514219Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:25.2514317Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:25.2514430Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:25.2514553Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:25.2514664Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:25.2514740Z     )
2025-05-07T20:32:25.2514987Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:25.2515081Z     def test_silu_mul_quant(
2025-05-07T20:32:25.2515157Z         self,
2025-05-07T20:32:25.2515240Z         T: int,
2025-05-07T20:32:25.2515316Z         D: int,
2025-05-07T20:32:25.2515413Z         scale_ub: Optional[float],
2025-05-07T20:32:25.2515512Z         contiguous: bool,
2025-05-07T20:32:25.2515597Z         compiled: bool,
2025-05-07T20:32:25.2515683Z     ) -> None:
2025-05-07T20:32:25.2515779Z         torch.manual_seed(2025)
2025-05-07T20:32:25.2515852Z     
2025-05-07T20:32:25.2516074Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:25.2516150Z     
2025-05-07T20:32:25.2516241Z         x_sign = torch.sign(x)
2025-05-07T20:32:25.2516371Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:25.2516460Z         x = x_sign * x_clamp
2025-05-07T20:32:25.2516539Z         x0 = x[:, :D]
2025-05-07T20:32:25.2516625Z         x1 = x[:, D:]
2025-05-07T20:32:25.2516699Z     
2025-05-07T20:32:25.2516782Z         if contiguous:
2025-05-07T20:32:25.2516879Z             x0 = x0.contiguous()
2025-05-07T20:32:25.2516968Z             x1 = x1.contiguous()
2025-05-07T20:32:25.2517040Z     
2025-05-07T20:32:25.2517134Z         if scale_ub is not None:
2025-05-07T20:32:25.2517281Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:25.2517460Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:25.2517538Z             )
2025-05-07T20:32:25.2517618Z         else:
2025-05-07T20:32:25.2517759Z             scale_ub_tensor = None
2025-05-07T20:32:25.2517836Z     
2025-05-07T20:32:25.2517965Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:25.2518063Z             op = silu_mul_quant
2025-05-07T20:32:25.2518148Z             if compiled:
2025-05-07T20:32:25.2518247Z                 op = torch.compile(op)
2025-05-07T20:32:25.2518358Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:25.2518431Z     
2025-05-07T20:32:25.2518525Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:25.2518535Z 
2025-05-07T20:32:25.2518633Z moe/activation_test.py:117: 
2025-05-07T20:32:25.2518762Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:25.2518868Z moe/activation_test.py:115: in fn
2025-05-07T20:32:25.2518972Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:25.2519468Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:25.2519579Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:25.2519985Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:25.2520211Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:25.2520549Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:25.2520642Z     kernel = self.compile(
2025-05-07T20:32:25.2521030Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:25.2521203Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:25.2521334Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:25.2521341Z 
2025-05-07T20:32:25.2521552Z self = <triton.compiler.compiler.ASTSource object at 0x7efca2e899d0>
2025-05-07T20:32:25.2522327Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:25.2522835Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7efdd7d50400>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7efca2d01120>}
2025-05-07T20:32:25.2523576Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:25.2523774Z context = <triton._C.libtriton.ir.context object at 0x7efca2e11ff0>
2025-05-07T20:32:25.2523782Z 
2025-05-07T20:32:25.2523945Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:25.2524207Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:25.2524398Z                            module_map=module_map)
2025-05-07T20:32:25.2524561Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:25.2524660Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:25.2524744Z E       ^
2025-05-07T20:32:25.2525098Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:25.2525103Z 
2025-05-07T20:32:25.2525520Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:25.2525525Z 
2025-05-07T20:32:25.2525629Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:25.2525851Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:25.2526010Z     T=4096,
2025-05-07T20:32:25.2526088Z     D=7168,
2025-05-07T20:32:25.2526174Z     scale_ub=1200.0,
2025-05-07T20:32:25.2526267Z     contiguous=False,
2025-05-07T20:32:25.2526390Z     compiled=True,
2025-05-07T20:32:25.2526470Z )
2025-05-07T20:32:25.2526688Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:25.2526862Z T = 4096, D = 7168, scale_ub = 1200.0, contiguous = False, compiled = True
2025-05-07T20:32:25.2526867Z 
2025-05-07T20:32:25.2526952Z     @given(
2025-05-07T20:32:25.2527069Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:25.2527167Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:25.2527289Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:25.2527403Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:25.2527520Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:25.2527598Z     )
2025-05-07T20:32:25.2527842Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:25.2527944Z     def test_silu_mul_quant(
2025-05-07T20:32:25.2528020Z         self,
2025-05-07T20:32:25.2528096Z         T: int,
2025-05-07T20:32:25.2528584Z         D: int,
2025-05-07T20:32:25.2528731Z         scale_ub: Optional[float],
2025-05-07T20:32:25.2528824Z         contiguous: bool,
2025-05-07T20:32:25.2528914Z         compiled: bool,
2025-05-07T20:32:25.2528993Z     ) -> None:
2025-05-07T20:32:25.2529087Z         torch.manual_seed(2025)
2025-05-07T20:32:25.2529165Z     
2025-05-07T20:32:25.2529331Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:25.2529403Z     
2025-05-07T20:32:25.2529513Z         x_sign = torch.sign(x)
2025-05-07T20:32:25.2529637Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:25.2529736Z         x = x_sign * x_clamp
2025-05-07T20:32:25.2529817Z         x0 = x[:, :D]
2025-05-07T20:32:25.2529913Z         x1 = x[:, D:]
2025-05-07T20:32:25.2529990Z     
2025-05-07T20:32:25.2530093Z         if contiguous:
2025-05-07T20:32:25.2530209Z             x0 = x0.contiguous()
2025-05-07T20:32:25.2530320Z             x1 = x1.contiguous()
2025-05-07T20:32:25.2530392Z     
2025-05-07T20:32:25.2530492Z         if scale_ub is not None:
2025-05-07T20:32:25.2530598Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:25.2530733Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:25.2530816Z             )
2025-05-07T20:32:25.2530892Z         else:
2025-05-07T20:32:25.2530986Z             scale_ub_tensor = None
2025-05-07T20:32:25.2531066Z     
2025-05-07T20:32:25.2531194Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:25.2531291Z             op = silu_mul_quant
2025-05-07T20:32:25.2531377Z             if compiled:
2025-05-07T20:32:25.2531477Z                 op = torch.compile(op)
2025-05-07T20:32:25.2531588Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:25.2531665Z     
2025-05-07T20:32:25.2531757Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:25.2531761Z 
2025-05-07T20:32:25.2531869Z moe/activation_test.py:117: 
2025-05-07T20:32:25.2532000Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:25.2532285Z moe/activation_test.py:115: in fn
2025-05-07T20:32:25.2532393Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:25.2532760Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/torch/_dynamo/eval_frame.py:678: in _fn
2025-05-07T20:32:25.2532859Z     return fn(*args, **kwargs)
2025-05-07T20:32:25.2533350Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:25.2533447Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:25.2533814Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:25.2534107Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:25.2534504Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:25.2534664Z     kernel = self.compile(
2025-05-07T20:32:25.2535047Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:25.2535225Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:25.2535351Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:25.2535355Z 
2025-05-07T20:32:25.2535557Z self = <triton.compiler.compiler.ASTSource object at 0x7efca2e18790>
2025-05-07T20:32:25.2536337Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:25.2536843Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7efdd7d50400>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7efca2d02f20>}
2025-05-07T20:32:25.2537593Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:25.2537783Z context = <triton._C.libtriton.ir.context object at 0x7efca2ef0db0>
2025-05-07T20:32:25.2537787Z 
2025-05-07T20:32:25.2537955Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:25.2538215Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:25.2538321Z                            module_map=module_map)
2025-05-07T20:32:25.2538485Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:25.2538586Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:25.2538667Z E       ^
2025-05-07T20:32:25.2539025Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:25.2539030Z 
2025-05-07T20:32:25.2539446Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:25.2539451Z 
2025-05-07T20:32:25.2539559Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:25.2539779Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:25.2539856Z     T=128,
2025-05-07T20:32:25.2539940Z     D=7168,
2025-05-07T20:32:25.2540026Z     scale_ub=1200.0,
2025-05-07T20:32:25.2540113Z     contiguous=False,
2025-05-07T20:32:25.2540202Z     compiled=True,
2025-05-07T20:32:25.2540276Z )
2025-05-07T20:32:25.2540491Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:25.2540669Z T = 128, D = 7168, scale_ub = 1200.0, contiguous = False, compiled = True
2025-05-07T20:32:25.2540678Z 
2025-05-07T20:32:25.2540754Z     @given(
2025-05-07T20:32:25.2540880Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:25.2540979Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:25.2541141Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:25.2541269Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:25.2541381Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:25.2541456Z     )
2025-05-07T20:32:25.2541707Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:25.2541800Z     def test_silu_mul_quant(
2025-05-07T20:32:25.2541881Z         self,
2025-05-07T20:32:25.2541958Z         T: int,
2025-05-07T20:32:25.2542034Z         D: int,
2025-05-07T20:32:25.2542136Z         scale_ub: Optional[float],
2025-05-07T20:32:25.2542226Z         contiguous: bool,
2025-05-07T20:32:25.2542310Z         compiled: bool,
2025-05-07T20:32:25.2542438Z     ) -> None:
2025-05-07T20:32:25.2542573Z         torch.manual_seed(2025)
2025-05-07T20:32:25.2542645Z     
2025-05-07T20:32:25.2542817Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:25.2542890Z     
2025-05-07T20:32:25.2543021Z         x_sign = torch.sign(x)
2025-05-07T20:32:25.2543154Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:25.2543243Z         x = x_sign * x_clamp
2025-05-07T20:32:25.2543326Z         x0 = x[:, :D]
2025-05-07T20:32:25.2543414Z         x1 = x[:, D:]
2025-05-07T20:32:25.2543486Z     
2025-05-07T20:32:25.2543577Z         if contiguous:
2025-05-07T20:32:25.2543667Z             x0 = x0.contiguous()
2025-05-07T20:32:25.2543756Z             x1 = x1.contiguous()
2025-05-07T20:32:25.2543833Z     
2025-05-07T20:32:25.2543926Z         if scale_ub is not None:
2025-05-07T20:32:25.2544031Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:25.2544172Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:25.2544252Z             )
2025-05-07T20:32:25.2544331Z         else:
2025-05-07T20:32:25.2544432Z             scale_ub_tensor = None
2025-05-07T20:32:25.2544505Z     
2025-05-07T20:32:25.2544637Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:25.2544736Z             op = silu_mul_quant
2025-05-07T20:32:25.2544822Z             if compiled:
2025-05-07T20:32:25.2544927Z                 op = torch.compile(op)
2025-05-07T20:32:25.2545033Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:25.2545107Z     
2025-05-07T20:32:25.2545207Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:25.2545211Z 
2025-05-07T20:32:25.2545307Z moe/activation_test.py:117: 
2025-05-07T20:32:25.2545436Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:25.2545543Z moe/activation_test.py:115: in fn
2025-05-07T20:32:25.2545642Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:25.2546007Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/torch/_dynamo/eval_frame.py:678: in _fn
2025-05-07T20:32:25.2546111Z     return fn(*args, **kwargs)
2025-05-07T20:32:25.2546603Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:25.2546707Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:25.2547061Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:25.2547281Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:25.2547627Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:25.2547722Z     kernel = self.compile(
2025-05-07T20:32:25.2548109Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:25.2548283Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:25.2548415Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:25.2548419Z 
2025-05-07T20:32:25.2548634Z self = <triton.compiler.compiler.ASTSource object at 0x7efca2f19850>
2025-05-07T20:32:25.2549557Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:25.2550099Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7efdd7d50400>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7efca2f7c220>}
2025-05-07T20:32:25.2550843Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:25.2551075Z context = <triton._C.libtriton.ir.context object at 0x7efca2f45eb0>
2025-05-07T20:32:25.2551119Z 
2025-05-07T20:32:25.2551292Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:25.2551617Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:25.2551736Z                            module_map=module_map)
2025-05-07T20:32:25.2551895Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:25.2551992Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:25.2552077Z E       ^
2025-05-07T20:32:25.2552430Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:25.2552434Z 
2025-05-07T20:32:25.2552846Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:25.2552856Z 
2025-05-07T20:32:25.2552958Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:25.2553183Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:25.2553268Z     T=2048,
2025-05-07T20:32:25.2553344Z     D=7168,
2025-05-07T20:32:25.2553426Z     scale_ub=None,
2025-05-07T20:32:25.2553520Z     contiguous=True,
2025-05-07T20:32:25.2553606Z     compiled=True,
2025-05-07T20:32:25.2553678Z )
2025-05-07T20:32:25.2553900Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:25.2554067Z T = 2048, D = 7168, scale_ub = None, contiguous = True, compiled = True
2025-05-07T20:32:25.2554072Z 
2025-05-07T20:32:25.2554153Z     @given(
2025-05-07T20:32:25.2554272Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:25.2554369Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:25.2554490Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:25.2554606Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:25.2554718Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:25.2554802Z     )
2025-05-07T20:32:25.2555046Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:25.2555140Z     def test_silu_mul_quant(
2025-05-07T20:32:25.2555225Z         self,
2025-05-07T20:32:25.2555303Z         T: int,
2025-05-07T20:32:25.2555379Z         D: int,
2025-05-07T20:32:25.2555482Z         scale_ub: Optional[float],
2025-05-07T20:32:25.2555571Z         contiguous: bool,
2025-05-07T20:32:25.2555663Z         compiled: bool,
2025-05-07T20:32:25.2555740Z     ) -> None:
2025-05-07T20:32:25.2555833Z         torch.manual_seed(2025)
2025-05-07T20:32:25.2555913Z     
2025-05-07T20:32:25.2556079Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:25.2556151Z     
2025-05-07T20:32:25.2556247Z         x_sign = torch.sign(x)
2025-05-07T20:32:25.2556370Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:25.2556459Z         x = x_sign * x_clamp
2025-05-07T20:32:25.2556553Z         x0 = x[:, :D]
2025-05-07T20:32:25.2556636Z         x1 = x[:, D:]
2025-05-07T20:32:25.2556708Z     
2025-05-07T20:32:25.2556798Z         if contiguous:
2025-05-07T20:32:25.2556888Z             x0 = x0.contiguous()
2025-05-07T20:32:25.2556984Z             x1 = x1.contiguous()
2025-05-07T20:32:25.2557108Z     
2025-05-07T20:32:25.2557201Z         if scale_ub is not None:
2025-05-07T20:32:25.2557311Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:25.2557445Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:25.2557520Z             )
2025-05-07T20:32:25.2557602Z         else:
2025-05-07T20:32:25.2557695Z             scale_ub_tensor = None
2025-05-07T20:32:25.2557768Z     
2025-05-07T20:32:25.2557903Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:25.2557992Z             op = silu_mul_quant
2025-05-07T20:32:25.2558077Z             if compiled:
2025-05-07T20:32:25.2558185Z                 op = torch.compile(op)
2025-05-07T20:32:25.2558332Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:25.2558455Z     
2025-05-07T20:32:25.2558546Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:25.2558551Z 
2025-05-07T20:32:25.2558648Z moe/activation_test.py:117: 
2025-05-07T20:32:25.2558823Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:25.2558925Z moe/activation_test.py:115: in fn
2025-05-07T20:32:25.2559025Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:25.2559399Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/torch/_dynamo/eval_frame.py:678: in _fn
2025-05-07T20:32:25.2559492Z     return fn(*args, **kwargs)
2025-05-07T20:32:25.2560038Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:25.2560147Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:25.2560501Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:25.2560734Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:25.2561074Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:25.2561170Z     kernel = self.compile(
2025-05-07T20:32:25.2561558Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:25.2561731Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:25.2561865Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:25.2561869Z 
2025-05-07T20:32:25.2562074Z self = <triton.compiler.compiler.ASTSource object at 0x7efca2f59210>
2025-05-07T20:32:25.2562843Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:25.2563360Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7efdd7d50400>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7efca2f7cd60>}
2025-05-07T20:32:25.2564105Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:25.2564301Z context = <triton._C.libtriton.ir.context object at 0x7efca2fc3530>
2025-05-07T20:32:25.2564306Z 
2025-05-07T20:32:25.2564468Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:25.2564730Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:25.2564843Z                            module_map=module_map)
2025-05-07T20:32:25.2565003Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:25.2565114Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:25.2565192Z E       ^
2025-05-07T20:32:25.2565546Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:25.2565551Z 
2025-05-07T20:32:25.2566025Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:25.2566030Z 
2025-05-07T20:32:25.2566135Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:25.2566367Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:25.2566444Z     T=16384,
2025-05-07T20:32:25.2566521Z     D=5120,
2025-05-07T20:32:25.2566608Z     scale_ub=None,
2025-05-07T20:32:25.2566698Z     contiguous=False,
2025-05-07T20:32:25.2566782Z     compiled=False,
2025-05-07T20:32:25.2566863Z )
2025-05-07T20:32:25.2567078Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:25.2567329Z T = 16384, D = 5120, scale_ub = None, contiguous = False, compiled = False
2025-05-07T20:32:25.2567335Z 
2025-05-07T20:32:25.2567418Z     @given(
2025-05-07T20:32:25.2567537Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:25.2567681Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:25.2567797Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:25.2567914Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:25.2568031Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:25.2568105Z     )
2025-05-07T20:32:25.2568347Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:25.2568446Z     def test_silu_mul_quant(
2025-05-07T20:32:25.2568524Z         self,
2025-05-07T20:32:25.2568600Z         T: int,
2025-05-07T20:32:25.2568681Z         D: int,
2025-05-07T20:32:25.2568779Z         scale_ub: Optional[float],
2025-05-07T20:32:25.2568873Z         contiguous: bool,
2025-05-07T20:32:25.2568966Z         compiled: bool,
2025-05-07T20:32:25.2569049Z     ) -> None:
2025-05-07T20:32:25.2569150Z         torch.manual_seed(2025)
2025-05-07T20:32:25.2569223Z     
2025-05-07T20:32:25.2569392Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:25.2569473Z     
2025-05-07T20:32:25.2569564Z         x_sign = torch.sign(x)
2025-05-07T20:32:25.2569688Z >       x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:25.2571496Z E       torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 320.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 140.44 MiB is free. Including non-PyTorch memory, this process has 21.92 GiB memory in use. Of the allocated memory 21.60 GiB is allocated by PyTorch, and 45.02 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
2025-05-07T20:32:25.2571508Z 
2025-05-07T20:32:25.2571626Z moe/activation_test.py:95: OutOfMemoryError
2025-05-07T20:32:25.2571630Z 
2025-05-07T20:32:25.2571742Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:25.2571965Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:25.2572041Z     T=4096,
2025-05-07T20:32:25.2572125Z     D=7168,
2025-05-07T20:32:25.2572207Z     scale_ub=1200.0,
2025-05-07T20:32:25.2572298Z     contiguous=True,
2025-05-07T20:32:25.2572381Z     compiled=True,
2025-05-07T20:32:25.2572454Z )
2025-05-07T20:32:25.2572676Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:25.2572846Z T = 4096, D = 7168, scale_ub = 1200.0, contiguous = True, compiled = True
2025-05-07T20:32:25.2572851Z 
2025-05-07T20:32:25.2572926Z     @given(
2025-05-07T20:32:25.2573048Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:25.2573149Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:25.2573266Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:25.2573387Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:25.2573500Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:25.2573580Z     )
2025-05-07T20:32:25.2573874Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:25.2573969Z     def test_silu_mul_quant(
2025-05-07T20:32:25.2574053Z         self,
2025-05-07T20:32:25.2574131Z         T: int,
2025-05-07T20:32:25.2574207Z         D: int,
2025-05-07T20:32:25.2574312Z         scale_ub: Optional[float],
2025-05-07T20:32:25.2574403Z         contiguous: bool,
2025-05-07T20:32:25.2574488Z         compiled: bool,
2025-05-07T20:32:25.2574571Z     ) -> None:
2025-05-07T20:32:25.2574666Z         torch.manual_seed(2025)
2025-05-07T20:32:25.2574738Z     
2025-05-07T20:32:25.2574915Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:25.2575073Z     
2025-05-07T20:32:25.2575171Z         x_sign = torch.sign(x)
2025-05-07T20:32:25.2575295Z >       x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:25.2577109Z E       torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 112.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 28.44 MiB is free. Including non-PyTorch memory, this process has 22.03 GiB memory in use. Of the allocated memory 21.61 GiB is allocated by PyTorch, and 141.02 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
2025-05-07T20:32:25.2577122Z 
2025-05-07T20:32:25.2577239Z moe/activation_test.py:95: OutOfMemoryError
2025-05-07T20:32:25.2577243Z 
2025-05-07T20:32:25.2577345Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:25.2577575Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:25.2577656Z     T=16384,
2025-05-07T20:32:25.2577733Z     D=7168,
2025-05-07T20:32:25.2577823Z     scale_ub=None,
2025-05-07T20:32:25.2577909Z     contiguous=False,
2025-05-07T20:32:25.2577995Z     compiled=False,
2025-05-07T20:32:25.2578083Z )
2025-05-07T20:32:25.2578298Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:25.2578479Z T = 16384, D = 7168, scale_ub = None, contiguous = False, compiled = False
2025-05-07T20:32:25.2578483Z 
2025-05-07T20:32:25.2578559Z     @given(
2025-05-07T20:32:25.2578676Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:25.2578781Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:25.2578894Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:25.2579010Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:25.2579130Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:25.2579208Z     )
2025-05-07T20:32:25.2579454Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:25.2579555Z     def test_silu_mul_quant(
2025-05-07T20:32:25.2579632Z         self,
2025-05-07T20:32:25.2579718Z         T: int,
2025-05-07T20:32:25.2579802Z         D: int,
2025-05-07T20:32:25.2579900Z         scale_ub: Optional[float],
2025-05-07T20:32:25.2580002Z         contiguous: bool,
2025-05-07T20:32:25.2580092Z         compiled: bool,
2025-05-07T20:32:25.2580190Z     ) -> None:
2025-05-07T20:32:25.2580300Z         torch.manual_seed(2025)
2025-05-07T20:32:25.2580395Z     
2025-05-07T20:32:25.2580560Z >       x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:25.2582391Z E       torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 448.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 140.44 MiB is free. Including non-PyTorch memory, this process has 21.92 GiB memory in use. Of the allocated memory 21.50 GiB is allocated by PyTorch, and 141.02 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
2025-05-07T20:32:25.2582403Z 
2025-05-07T20:32:25.2582521Z moe/activation_test.py:92: OutOfMemoryError
2025-05-07T20:32:25.2582525Z 
2025-05-07T20:32:25.2582633Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:25.2582852Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:25.2582934Z     T=2048,
2025-05-07T20:32:25.2583009Z     D=7168,
2025-05-07T20:32:25.2583091Z     scale_ub=1200.0,
2025-05-07T20:32:25.2583181Z     contiguous=True,
2025-05-07T20:32:25.2583263Z     compiled=True,
2025-05-07T20:32:25.2583336Z )
2025-05-07T20:32:25.2583555Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:25.2583766Z T = 2048, D = 7168, scale_ub = 1200.0, contiguous = True, compiled = True
2025-05-07T20:32:25.2583836Z 
2025-05-07T20:32:25.2583916Z     @given(
2025-05-07T20:32:25.2584038Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:25.2584176Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:25.2584299Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:25.2584414Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:25.2584525Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:25.2584606Z     )
2025-05-07T20:32:25.2584847Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:25.2584941Z     def test_silu_mul_quant(
2025-05-07T20:32:25.2585023Z         self,
2025-05-07T20:32:25.2585098Z         T: int,
2025-05-07T20:32:25.2585174Z         D: int,
2025-05-07T20:32:25.2585277Z         scale_ub: Optional[float],
2025-05-07T20:32:25.2585366Z         contiguous: bool,
2025-05-07T20:32:25.2585455Z         compiled: bool,
2025-05-07T20:32:25.2585542Z     ) -> None:
2025-05-07T20:32:25.2585636Z         torch.manual_seed(2025)
2025-05-07T20:32:25.2585715Z     
2025-05-07T20:32:25.2585879Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:25.2585955Z     
2025-05-07T20:32:25.2586055Z         x_sign = torch.sign(x)
2025-05-07T20:32:25.2586180Z >       x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:25.2587930Z E       torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 56.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 28.44 MiB is free. Including non-PyTorch memory, this process has 22.03 GiB memory in use. Of the allocated memory 21.67 GiB is allocated by PyTorch, and 85.02 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
2025-05-07T20:32:25.2587947Z 
2025-05-07T20:32:25.2588064Z moe/activation_test.py:95: OutOfMemoryError
2025-05-07T20:32:25.2588069Z 
2025-05-07T20:32:25.2588171Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:25.2588400Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:25.2588476Z     T=2048,
2025-05-07T20:32:25.2588553Z     D=7168,
2025-05-07T20:32:25.2588644Z     scale_ub=None,
2025-05-07T20:32:25.2588728Z     contiguous=True,
2025-05-07T20:32:25.2588817Z     compiled=False,
2025-05-07T20:32:25.2588890Z )
2025-05-07T20:32:25.2589192Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:25.2589371Z T = 2048, D = 7168, scale_ub = None, contiguous = True, compiled = False
2025-05-07T20:32:25.2589376Z 
2025-05-07T20:32:25.2589452Z     @given(
2025-05-07T20:32:25.2589566Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:25.2589670Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:25.2589791Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:25.2589917Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:25.2590054Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:25.2590149Z     )
2025-05-07T20:32:25.2590453Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:25.2590549Z     def test_silu_mul_quant(
2025-05-07T20:32:25.2590626Z         self,
2025-05-07T20:32:25.2590712Z         T: int,
2025-05-07T20:32:25.2590787Z         D: int,
2025-05-07T20:32:25.2590884Z         scale_ub: Optional[float],
2025-05-07T20:32:25.2590993Z         contiguous: bool,
2025-05-07T20:32:25.2595163Z         compiled: bool,
2025-05-07T20:32:25.2595257Z     ) -> None:
2025-05-07T20:32:25.2595356Z         torch.manual_seed(2025)
2025-05-07T20:32:25.2595438Z     
2025-05-07T20:32:25.2595613Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:25.2595772Z     
2025-05-07T20:32:25.2596013Z >       x_sign = torch.sign(x)
2025-05-07T20:32:25.2597850Z E       torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 56.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 28.44 MiB is free. Including non-PyTorch memory, this process has 22.03 GiB memory in use. Of the allocated memory 21.67 GiB is allocated by PyTorch, and 85.02 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
2025-05-07T20:32:25.2597857Z 
2025-05-07T20:32:25.2597987Z moe/activation_test.py:94: OutOfMemoryError
2025-05-07T20:32:25.2597992Z 
2025-05-07T20:32:25.2598095Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:25.2598318Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:25.2598406Z     T=1,
2025-05-07T20:32:25.2598484Z     D=7168,
2025-05-07T20:32:25.2598583Z     scale_ub=1200.0,
2025-05-07T20:32:25.2598671Z     contiguous=True,
2025-05-07T20:32:25.2598759Z     compiled=False,
2025-05-07T20:32:25.2598842Z )
2025-05-07T20:32:25.2599064Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:25.2599230Z T = 1, D = 7168, scale_ub = 1200.0, contiguous = True, compiled = False
2025-05-07T20:32:25.2599235Z 
2025-05-07T20:32:25.2599322Z     @given(
2025-05-07T20:32:25.2599442Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:25.2599542Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:25.2599665Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:25.2599782Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:25.2599903Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:25.2599980Z     )
2025-05-07T20:32:25.2600225Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:25.2600336Z     def test_silu_mul_quant(
2025-05-07T20:32:25.2600415Z         self,
2025-05-07T20:32:25.2600494Z         T: int,
2025-05-07T20:32:25.2600585Z         D: int,
2025-05-07T20:32:25.2600686Z         scale_ub: Optional[float],
2025-05-07T20:32:25.2600782Z         contiguous: bool,
2025-05-07T20:32:25.2600878Z         compiled: bool,
2025-05-07T20:32:25.2600957Z     ) -> None:
2025-05-07T20:32:25.2601055Z         torch.manual_seed(2025)
2025-05-07T20:32:25.2601138Z     
2025-05-07T20:32:25.2601308Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:25.2601385Z     
2025-05-07T20:32:25.2601488Z         x_sign = torch.sign(x)
2025-05-07T20:32:25.2601615Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:25.2601715Z         x = x_sign * x_clamp
2025-05-07T20:32:25.2601798Z         x0 = x[:, :D]
2025-05-07T20:32:25.2601878Z         x1 = x[:, D:]
2025-05-07T20:32:25.2601958Z     
2025-05-07T20:32:25.2602049Z         if contiguous:
2025-05-07T20:32:25.2602146Z             x0 = x0.contiguous()
2025-05-07T20:32:25.2602242Z             x1 = x1.contiguous()
2025-05-07T20:32:25.2602315Z     
2025-05-07T20:32:25.2602405Z         if scale_ub is not None:
2025-05-07T20:32:25.2602521Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:25.2602712Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:25.2602790Z             )
2025-05-07T20:32:25.2602875Z         else:
2025-05-07T20:32:25.2602974Z             scale_ub_tensor = None
2025-05-07T20:32:25.2603054Z     
2025-05-07T20:32:25.2603184Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:25.2603275Z             op = silu_mul_quant
2025-05-07T20:32:25.2603366Z             if compiled:
2025-05-07T20:32:25.2603470Z                 op = torch.compile(op)
2025-05-07T20:32:25.2603576Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:25.2603659Z     
2025-05-07T20:32:25.2603751Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:25.2603797Z 
2025-05-07T20:32:25.2603934Z moe/activation_test.py:117: 
2025-05-07T20:32:25.2604075Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:25.2604177Z moe/activation_test.py:115: in fn
2025-05-07T20:32:25.2604327Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:25.2604833Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:25.2604932Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:25.2605300Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:25.2605521Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:25.2605864Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:25.2605964Z     kernel = self.compile(
2025-05-07T20:32:25.2606352Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:25.2606537Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:25.2606669Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:25.2606674Z 
2025-05-07T20:32:25.2606881Z self = <triton.compiler.compiler.ASTSource object at 0x7efca2bfcf50>
2025-05-07T20:32:25.2607664Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:25.2608171Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7efdd7d50400>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7efca2b60540>}
2025-05-07T20:32:25.2608928Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:25.2609127Z context = <triton._C.libtriton.ir.context object at 0x7efca2b5c0b0>
2025-05-07T20:32:25.2609134Z 
2025-05-07T20:32:25.2609311Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:25.2609617Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:25.2609734Z                            module_map=module_map)
2025-05-07T20:32:25.2609904Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:25.2610003Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:25.2610081Z E       ^
2025-05-07T20:32:25.2610444Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:25.2610449Z 
2025-05-07T20:32:25.2610866Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:25.2610876Z 
2025-05-07T20:32:25.2610988Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:25.2611214Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:25.2611337Z     T=128,
2025-05-07T20:32:25.2611425Z     D=5120,
2025-05-07T20:32:25.2611507Z     scale_ub=None,
2025-05-07T20:32:25.2611595Z     contiguous=True,
2025-05-07T20:32:25.2611686Z     compiled=False,
2025-05-07T20:32:25.2611761Z )
2025-05-07T20:32:25.2611980Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:25.2612162Z T = 128, D = 5120, scale_ub = None, contiguous = True, compiled = False
2025-05-07T20:32:25.2612166Z 
2025-05-07T20:32:25.2612245Z     @given(
2025-05-07T20:32:25.2612375Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:25.2612480Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:25.2612636Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:25.2612800Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:25.2612916Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:25.2612991Z     )
2025-05-07T20:32:25.2613288Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:25.2613389Z     def test_silu_mul_quant(
2025-05-07T20:32:25.2613470Z         self,
2025-05-07T20:32:25.2613556Z         T: int,
2025-05-07T20:32:25.2613633Z         D: int,
2025-05-07T20:32:25.2613743Z         scale_ub: Optional[float],
2025-05-07T20:32:25.2613834Z         contiguous: bool,
2025-05-07T20:32:25.2613920Z         compiled: bool,
2025-05-07T20:32:25.2614006Z     ) -> None:
2025-05-07T20:32:25.2614101Z         torch.manual_seed(2025)
2025-05-07T20:32:25.2614175Z     
2025-05-07T20:32:25.2614351Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:25.2614426Z     
2025-05-07T20:32:25.2614523Z         x_sign = torch.sign(x)
2025-05-07T20:32:25.2614660Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:25.2614750Z         x = x_sign * x_clamp
2025-05-07T20:32:25.2614834Z         x0 = x[:, :D]
2025-05-07T20:32:25.2614925Z         x1 = x[:, D:]
2025-05-07T20:32:25.2615002Z     
2025-05-07T20:32:25.2615099Z         if contiguous:
2025-05-07T20:32:25.2615192Z             x0 = x0.contiguous()
2025-05-07T20:32:25.2615283Z             x1 = x1.contiguous()
2025-05-07T20:32:25.2615364Z     
2025-05-07T20:32:25.2615456Z         if scale_ub is not None:
2025-05-07T20:32:25.2615563Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:25.2615709Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:25.2615790Z             )
2025-05-07T20:32:25.2615868Z         else:
2025-05-07T20:32:25.2615970Z             scale_ub_tensor = None
2025-05-07T20:32:25.2616043Z     
2025-05-07T20:32:25.2616174Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:25.2616279Z             op = silu_mul_quant
2025-05-07T20:32:25.2616368Z             if compiled:
2025-05-07T20:32:25.2616468Z                 op = torch.compile(op)
2025-05-07T20:32:25.2616582Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:25.2616659Z     
2025-05-07T20:32:25.2616766Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:25.2616770Z 
2025-05-07T20:32:25.2616869Z moe/activation_test.py:117: 
2025-05-07T20:32:25.2617001Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:25.2617112Z moe/activation_test.py:115: in fn
2025-05-07T20:32:25.2617212Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:25.2617710Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:25.2617818Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:25.2618174Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:25.2618407Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:25.2618747Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:25.2618895Z     kernel = self.compile(
2025-05-07T20:32:25.2619285Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:25.2619456Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:25.2619586Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:25.2619597Z 
2025-05-07T20:32:25.2619802Z self = <triton.compiler.compiler.ASTSource object at 0x7efca2bdfcd0>
2025-05-07T20:32:25.2620576Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:25.2621246Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7efdd7d50400>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7efca2b61620>}
2025-05-07T20:32:25.2622033Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:25.2622231Z context = <triton._C.libtriton.ir.context object at 0x7efca2b6c370>
2025-05-07T20:32:25.2622236Z 
2025-05-07T20:32:25.2622402Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:25.2622663Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:25.2622778Z                            module_map=module_map)
2025-05-07T20:32:25.2622941Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:25.2623051Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:25.2623128Z E       ^
2025-05-07T20:32:25.2623481Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:25.2623488Z 
2025-05-07T20:32:25.2623910Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:25.2623915Z 
2025-05-07T20:32:25.2624017Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:25.2624241Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:25.2624325Z     T=128,
2025-05-07T20:32:25.2624403Z     D=7168,
2025-05-07T20:32:25.2624493Z     scale_ub=None,
2025-05-07T20:32:25.2624579Z     contiguous=True,
2025-05-07T20:32:25.2624663Z     compiled=False,
2025-05-07T20:32:25.2624744Z )
2025-05-07T20:32:25.2624964Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:25.2625140Z T = 128, D = 7168, scale_ub = None, contiguous = True, compiled = False
2025-05-07T20:32:25.2625147Z 
2025-05-07T20:32:25.2625233Z     @given(
2025-05-07T20:32:25.2625351Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:25.2625453Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:25.2625581Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:25.2625698Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:25.2625816Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:25.2625890Z     )
2025-05-07T20:32:25.2626135Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:25.2626235Z     def test_silu_mul_quant(
2025-05-07T20:32:25.2626312Z         self,
2025-05-07T20:32:25.2626391Z         T: int,
2025-05-07T20:32:25.2626479Z         D: int,
2025-05-07T20:32:25.2626578Z         scale_ub: Optional[float],
2025-05-07T20:32:25.2626668Z         contiguous: bool,
2025-05-07T20:32:25.2626769Z         compiled: bool,
2025-05-07T20:32:25.2626852Z     ) -> None:
2025-05-07T20:32:25.2626948Z         torch.manual_seed(2025)
2025-05-07T20:32:25.2627029Z     
2025-05-07T20:32:25.2627201Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:25.2627283Z     
2025-05-07T20:32:25.2627460Z         x_sign = torch.sign(x)
2025-05-07T20:32:25.2627590Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:25.2627693Z         x = x_sign * x_clamp
2025-05-07T20:32:25.2627776Z         x0 = x[:, :D]
2025-05-07T20:32:25.2627858Z         x1 = x[:, D:]
2025-05-07T20:32:25.2627940Z     
2025-05-07T20:32:25.2628026Z         if contiguous:
2025-05-07T20:32:25.2628489Z             x0 = x0.contiguous()
2025-05-07T20:32:25.2628633Z             x1 = x1.contiguous()
2025-05-07T20:32:25.2628739Z     
2025-05-07T20:32:25.2628843Z         if scale_ub is not None:
2025-05-07T20:32:25.2628959Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:25.2629304Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:25.2629463Z             )
2025-05-07T20:32:25.2629541Z         else:
2025-05-07T20:32:25.2629636Z             scale_ub_tensor = None
2025-05-07T20:32:25.2629716Z     
2025-05-07T20:32:25.2629914Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:25.2630007Z             op = silu_mul_quant
2025-05-07T20:32:25.2630106Z             if compiled:
2025-05-07T20:32:25.2630207Z                 op = torch.compile(op)
2025-05-07T20:32:25.2630313Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:25.2630395Z     
2025-05-07T20:32:25.2630486Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:25.2630491Z 
2025-05-07T20:32:25.2630594Z moe/activation_test.py:117: 
2025-05-07T20:32:25.2630734Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:25.2630835Z moe/activation_test.py:115: in fn
2025-05-07T20:32:25.2630944Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:25.2631445Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:25.2631543Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:25.2631915Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:25.2632134Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:25.2632478Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:25.2632571Z     kernel = self.compile(
2025-05-07T20:32:25.2632949Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:25.2633128Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:25.2633256Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:25.2633263Z 
2025-05-07T20:32:25.2633468Z self = <triton.compiler.compiler.ASTSource object at 0x7efca2cf6c90>
2025-05-07T20:32:25.2634247Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:25.2634759Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7efdd7d50400>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7efca2b62660>}
2025-05-07T20:32:25.2635500Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:25.2635689Z context = <triton._C.libtriton.ir.context object at 0x7efca2cd3270>
2025-05-07T20:32:25.2635693Z 
2025-05-07T20:32:25.2635863Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:25.2636126Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:25.2636240Z                            module_map=module_map)
2025-05-07T20:32:25.2636479Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:25.2636579Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:25.2636663Z E       ^
2025-05-07T20:32:25.2637018Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:25.2637022Z 
2025-05-07T20:32:25.2637434Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:25.2637445Z 
2025-05-07T20:32:25.2637550Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:25.2637771Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:25.2637899Z     T=2048,
2025-05-07T20:32:25.2638017Z     D=7168,
2025-05-07T20:32:25.2638102Z     scale_ub=1200.0,
2025-05-07T20:32:25.2638195Z     contiguous=True,
2025-05-07T20:32:25.2638281Z     compiled=False,
2025-05-07T20:32:25.2638354Z )
2025-05-07T20:32:25.2638619Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:25.2638800Z T = 2048, D = 7168, scale_ub = 1200.0, contiguous = True, compiled = False
2025-05-07T20:32:25.2638807Z 
2025-05-07T20:32:25.2638924Z     @given(
2025-05-07T20:32:25.2639084Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:25.2639224Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:25.2639391Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:25.2639576Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:25.2639754Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:25.2639863Z     )
2025-05-07T20:32:25.2640195Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:25.2640296Z     def test_silu_mul_quant(
2025-05-07T20:32:25.2640378Z         self,
2025-05-07T20:32:25.2640455Z         T: int,
2025-05-07T20:32:25.2640531Z         D: int,
2025-05-07T20:32:25.2640636Z         scale_ub: Optional[float],
2025-05-07T20:32:25.2640730Z         contiguous: bool,
2025-05-07T20:32:25.2640824Z         compiled: bool,
2025-05-07T20:32:25.2640903Z     ) -> None:
2025-05-07T20:32:25.2640997Z         torch.manual_seed(2025)
2025-05-07T20:32:25.2641073Z     
2025-05-07T20:32:25.2641238Z >       x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:25.2643003Z E       torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 56.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 26.44 MiB is free. Including non-PyTorch memory, this process has 22.04 GiB memory in use. Of the allocated memory 21.69 GiB is allocated by PyTorch, and 59.18 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
2025-05-07T20:32:25.2643020Z 
2025-05-07T20:32:25.2643141Z moe/activation_test.py:92: OutOfMemoryError
2025-05-07T20:32:25.2643147Z 
2025-05-07T20:32:25.2643248Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:25.2643476Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:25.2643553Z     T=1,
2025-05-07T20:32:25.2643630Z     D=5120,
2025-05-07T20:32:25.2643718Z     scale_ub=1200.0,
2025-05-07T20:32:25.2643802Z     contiguous=True,
2025-05-07T20:32:25.2643894Z     compiled=False,
2025-05-07T20:32:25.2643967Z )
2025-05-07T20:32:25.2644182Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:25.2644350Z T = 1, D = 5120, scale_ub = 1200.0, contiguous = True, compiled = False
2025-05-07T20:32:25.2644357Z 
2025-05-07T20:32:25.2644433Z     @given(
2025-05-07T20:32:25.2644551Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:25.2644656Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:25.2644769Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:25.2644946Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:25.2645066Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:25.2645141Z     )
2025-05-07T20:32:25.2645389Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:25.2645482Z     def test_silu_mul_quant(
2025-05-07T20:32:25.2645557Z         self,
2025-05-07T20:32:25.2645638Z         T: int,
2025-05-07T20:32:25.2645714Z         D: int,
2025-05-07T20:32:25.2645812Z         scale_ub: Optional[float],
2025-05-07T20:32:25.2645904Z         contiguous: bool,
2025-05-07T20:32:25.2645989Z         compiled: bool,
2025-05-07T20:32:25.2646069Z     ) -> None:
2025-05-07T20:32:25.2646173Z         torch.manual_seed(2025)
2025-05-07T20:32:25.2646330Z     
2025-05-07T20:32:25.2646494Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:25.2646573Z     
2025-05-07T20:32:25.2646663Z         x_sign = torch.sign(x)
2025-05-07T20:32:25.2646831Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:25.2646921Z         x = x_sign * x_clamp
2025-05-07T20:32:25.2647001Z         x0 = x[:, :D]
2025-05-07T20:32:25.2647089Z         x1 = x[:, D:]
2025-05-07T20:32:25.2647160Z     
2025-05-07T20:32:25.2647242Z         if contiguous:
2025-05-07T20:32:25.2647342Z             x0 = x0.contiguous()
2025-05-07T20:32:25.2647430Z             x1 = x1.contiguous()
2025-05-07T20:32:25.2647501Z     
2025-05-07T20:32:25.2647600Z         if scale_ub is not None:
2025-05-07T20:32:25.2647706Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:25.2647839Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:25.2647921Z             )
2025-05-07T20:32:25.2648001Z         else:
2025-05-07T20:32:25.2648098Z             scale_ub_tensor = None
2025-05-07T20:32:25.2648176Z     
2025-05-07T20:32:25.2648305Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:25.2648403Z             op = silu_mul_quant
2025-05-07T20:32:25.2648492Z             if compiled:
2025-05-07T20:32:25.2648596Z                 op = torch.compile(op)
2025-05-07T20:32:25.2648707Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:25.2648783Z     
2025-05-07T20:32:25.2648877Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:25.2648881Z 
2025-05-07T20:32:25.2648985Z moe/activation_test.py:117: 
2025-05-07T20:32:25.2649113Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:25.2649214Z moe/activation_test.py:115: in fn
2025-05-07T20:32:25.2649319Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:25.2649844Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:25.2649974Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:25.2650329Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:25.2650552Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:25.2650896Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:25.2650989Z     kernel = self.compile(
2025-05-07T20:32:25.2651377Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:25.2651548Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:25.2651676Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:25.2651681Z 
2025-05-07T20:32:25.2651890Z self = <triton.compiler.compiler.ASTSource object at 0x7efca2c6ac10>
2025-05-07T20:32:25.2652668Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:25.2653253Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7efdd7d50400>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7efca2b639c0>}
2025-05-07T20:32:25.2653999Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:25.2654192Z context = <triton._C.libtriton.ir.context object at 0x7efca2c0f270>
2025-05-07T20:32:25.2654196Z 
2025-05-07T20:32:25.2654363Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:25.2654623Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:25.2654811Z                            module_map=module_map)
2025-05-07T20:32:25.2654970Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:25.2655105Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:25.2655192Z E       ^
2025-05-07T20:32:25.2655544Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:25.2655549Z 
2025-05-07T20:32:25.2655959Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:25.2655969Z 
2025-05-07T20:32:25.2656074Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:25.2656294Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:25.2656376Z     T=2048,
2025-05-07T20:32:25.2656451Z     D=5120,
2025-05-07T20:32:25.2656532Z     scale_ub=None,
2025-05-07T20:32:25.2656629Z     contiguous=True,
2025-05-07T20:32:25.2656713Z     compiled=False,
2025-05-07T20:32:25.2656784Z )
2025-05-07T20:32:25.2657008Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:25.2657184Z T = 2048, D = 5120, scale_ub = None, contiguous = True, compiled = False
2025-05-07T20:32:25.2657191Z 
2025-05-07T20:32:25.2657272Z     @given(
2025-05-07T20:32:25.2657395Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:25.2657500Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:25.2657614Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:25.2657730Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:25.2657847Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:25.2657924Z     )
2025-05-07T20:32:25.2658165Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:25.2658266Z     def test_silu_mul_quant(
2025-05-07T20:32:25.2658343Z         self,
2025-05-07T20:32:25.2658421Z         T: int,
2025-05-07T20:32:25.2658507Z         D: int,
2025-05-07T20:32:25.2658604Z         scale_ub: Optional[float],
2025-05-07T20:32:25.2658693Z         contiguous: bool,
2025-05-07T20:32:25.2658786Z         compiled: bool,
2025-05-07T20:32:25.2658866Z     ) -> None:
2025-05-07T20:32:25.2658969Z         torch.manual_seed(2025)
2025-05-07T20:32:25.2659042Z     
2025-05-07T20:32:25.2659206Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:25.2659287Z     
2025-05-07T20:32:25.2659377Z >       x_sign = torch.sign(x)
2025-05-07T20:32:25.2661141Z E       torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 40.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 26.44 MiB is free. Including non-PyTorch memory, this process has 22.04 GiB memory in use. Of the allocated memory 21.73 GiB is allocated by PyTorch, and 19.12 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
2025-05-07T20:32:25.2661161Z 
2025-05-07T20:32:25.2661279Z moe/activation_test.py:94: OutOfMemoryError
2025-05-07T20:32:25.2661287Z 
2025-05-07T20:32:25.2661431Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:25.2661658Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:25.2661737Z     T=16384,
2025-05-07T20:32:25.2661814Z     D=5120,
2025-05-07T20:32:25.2661904Z     scale_ub=None,
2025-05-07T20:32:25.2661988Z     contiguous=True,
2025-05-07T20:32:25.2662078Z     compiled=False,
2025-05-07T20:32:25.2662150Z )
2025-05-07T20:32:25.2662365Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:25.2662546Z T = 16384, D = 5120, scale_ub = None, contiguous = True, compiled = False
2025-05-07T20:32:25.2662550Z 
2025-05-07T20:32:25.2662628Z     @given(
2025-05-07T20:32:25.2662785Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:25.2662931Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:25.2663043Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:25.2663197Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:25.2663318Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:25.2663391Z     )
2025-05-07T20:32:25.2663642Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:25.2663735Z     def test_silu_mul_quant(
2025-05-07T20:32:25.2663810Z         self,
2025-05-07T20:32:25.2663890Z         T: int,
2025-05-07T20:32:25.2663965Z         D: int,
2025-05-07T20:32:25.2664059Z         scale_ub: Optional[float],
2025-05-07T20:32:25.2664154Z         contiguous: bool,
2025-05-07T20:32:25.2664238Z         compiled: bool,
2025-05-07T20:32:25.2664315Z     ) -> None:
2025-05-07T20:32:25.2664416Z         torch.manual_seed(2025)
2025-05-07T20:32:25.2664491Z     
2025-05-07T20:32:25.2664658Z >       x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:25.2666431Z E       torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 320.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 26.44 MiB is free. Including non-PyTorch memory, this process has 22.04 GiB memory in use. Of the allocated memory 21.73 GiB is allocated by PyTorch, and 19.12 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
2025-05-07T20:32:25.2666437Z 
2025-05-07T20:32:25.2666553Z moe/activation_test.py:92: OutOfMemoryError
2025-05-07T20:32:25.2666564Z 
2025-05-07T20:32:25.2666664Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:25.2666881Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:25.2666967Z     T=4096,
2025-05-07T20:32:25.2667044Z     D=5120,
2025-05-07T20:32:25.2667125Z     scale_ub=None,
2025-05-07T20:32:25.2667216Z     contiguous=True,
2025-05-07T20:32:25.2667298Z     compiled=False,
2025-05-07T20:32:25.2667370Z )
2025-05-07T20:32:25.2667595Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:25.2667763Z T = 4096, D = 5120, scale_ub = None, contiguous = True, compiled = False
2025-05-07T20:32:25.2667768Z 
2025-05-07T20:32:25.2667851Z     @given(
2025-05-07T20:32:25.2667967Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:25.2668063Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:25.2668182Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:25.2668296Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:25.2668407Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:25.2668488Z     )
2025-05-07T20:32:25.2668730Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:25.2668827Z     def test_silu_mul_quant(
2025-05-07T20:32:25.2668910Z         self,
2025-05-07T20:32:25.2668986Z         T: int,
2025-05-07T20:32:25.2669123Z         D: int,
2025-05-07T20:32:25.2669231Z         scale_ub: Optional[float],
2025-05-07T20:32:25.2669367Z         contiguous: bool,
2025-05-07T20:32:25.2669460Z         compiled: bool,
2025-05-07T20:32:25.2669539Z     ) -> None:
2025-05-07T20:32:25.2669632Z         torch.manual_seed(2025)
2025-05-07T20:32:25.2669710Z     
2025-05-07T20:32:25.2669874Z >       x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:25.2671674Z E       torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 80.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 26.44 MiB is free. Including non-PyTorch memory, this process has 22.04 GiB memory in use. Of the allocated memory 21.73 GiB is allocated by PyTorch, and 19.12 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
2025-05-07T20:32:25.2671804Z 
2025-05-07T20:32:25.2671924Z moe/activation_test.py:92: OutOfMemoryError
2025-05-07T20:32:25.2671928Z 
2025-05-07T20:32:25.2672029Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:25.2672254Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:25.2672330Z     T=2048,
2025-05-07T20:32:25.2672407Z     D=5120,
2025-05-07T20:32:25.2672494Z     scale_ub=None,
2025-05-07T20:32:25.2672582Z     contiguous=False,
2025-05-07T20:32:25.2672671Z     compiled=False,
2025-05-07T20:32:25.2672742Z )
2025-05-07T20:32:25.2672956Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:25.2673132Z T = 2048, D = 5120, scale_ub = None, contiguous = False, compiled = False
2025-05-07T20:32:25.2673139Z 
2025-05-07T20:32:25.2673219Z     @given(
2025-05-07T20:32:25.2673333Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:25.2673441Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:25.2673559Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:25.2673675Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:25.2673794Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:25.2673869Z     )
2025-05-07T20:32:25.2674117Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:25.2674211Z     def test_silu_mul_quant(
2025-05-07T20:32:25.2674287Z         self,
2025-05-07T20:32:25.2674370Z         T: int,
2025-05-07T20:32:25.2674445Z         D: int,
2025-05-07T20:32:25.2674542Z         scale_ub: Optional[float],
2025-05-07T20:32:25.2674642Z         contiguous: bool,
2025-05-07T20:32:25.2674728Z         compiled: bool,
2025-05-07T20:32:25.2674808Z     ) -> None:
2025-05-07T20:32:25.2674916Z         torch.manual_seed(2025)
2025-05-07T20:32:25.2674992Z     
2025-05-07T20:32:25.2675157Z >       x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:25.2676920Z E       torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 40.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 26.44 MiB is free. Including non-PyTorch memory, this process has 22.04 GiB memory in use. Of the allocated memory 21.73 GiB is allocated by PyTorch, and 19.12 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
2025-05-07T20:32:25.2676926Z 
2025-05-07T20:32:25.2677042Z moe/activation_test.py:92: OutOfMemoryError
2025-05-07T20:32:25.2677053Z 
2025-05-07T20:32:25.2677154Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:25.2677375Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:25.2677464Z     T=4096,
2025-05-07T20:32:25.2677541Z     D=7168,
2025-05-07T20:32:25.2677624Z     scale_ub=None,
2025-05-07T20:32:25.2677715Z     contiguous=True,
2025-05-07T20:32:25.2677799Z     compiled=True,
2025-05-07T20:32:25.2677919Z )
2025-05-07T20:32:25.2678142Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:25.2678311Z T = 4096, D = 7168, scale_ub = None, contiguous = True, compiled = True
2025-05-07T20:32:25.2678316Z 
2025-05-07T20:32:25.2678399Z     @given(
2025-05-07T20:32:25.2678514Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:25.2678611Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:25.2678731Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:25.2678850Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:25.2678962Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:25.2679082Z     )
2025-05-07T20:32:25.2679361Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:25.2679453Z     def test_silu_mul_quant(
2025-05-07T20:32:25.2679538Z         self,
2025-05-07T20:32:25.2679696Z         T: int,
2025-05-07T20:32:25.2679789Z         D: int,
2025-05-07T20:32:25.2679904Z         scale_ub: Optional[float],
2025-05-07T20:32:25.2680011Z         contiguous: bool,
2025-05-07T20:32:25.2680106Z         compiled: bool,
2025-05-07T20:32:25.2680182Z     ) -> None:
2025-05-07T20:32:25.2680275Z         torch.manual_seed(2025)
2025-05-07T20:32:25.2680355Z     
2025-05-07T20:32:25.2680518Z >       x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:25.2682277Z E       torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 112.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 26.44 MiB is free. Including non-PyTorch memory, this process has 22.04 GiB memory in use. Of the allocated memory 21.73 GiB is allocated by PyTorch, and 19.12 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
2025-05-07T20:32:25.2682296Z 
2025-05-07T20:32:25.2682413Z moe/activation_test.py:92: OutOfMemoryError
2025-05-07T20:32:25.2682417Z 
2025-05-07T20:32:25.2682517Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:25.2682740Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:25.2682816Z     T=2048,
2025-05-07T20:32:25.2682893Z     D=5120,
2025-05-07T20:32:25.2682980Z     scale_ub=1200.0,
2025-05-07T20:32:25.2683069Z     contiguous=False,
2025-05-07T20:32:25.2683157Z     compiled=False,
2025-05-07T20:32:25.2683230Z )
2025-05-07T20:32:25.2683445Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:25.2683627Z T = 2048, D = 5120, scale_ub = 1200.0, contiguous = False, compiled = False
2025-05-07T20:32:25.2683635Z 
2025-05-07T20:32:25.2683711Z     @given(
2025-05-07T20:32:25.2683828Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:25.2683935Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:25.2684050Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:25.2684166Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:25.2684282Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:25.2684355Z     )
2025-05-07T20:32:25.2684602Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:25.2684695Z     def test_silu_mul_quant(
2025-05-07T20:32:25.2684771Z         self,
2025-05-07T20:32:25.2684852Z         T: int,
2025-05-07T20:32:25.2684929Z         D: int,
2025-05-07T20:32:25.2685026Z         scale_ub: Optional[float],
2025-05-07T20:32:25.2685123Z         contiguous: bool,
2025-05-07T20:32:25.2685209Z         compiled: bool,
2025-05-07T20:32:25.2685292Z     ) -> None:
2025-05-07T20:32:25.2685397Z         torch.manual_seed(2025)
2025-05-07T20:32:25.2685468Z     
2025-05-07T20:32:25.2685632Z >       x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:25.2687433Z E       torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 40.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 26.44 MiB is free. Including non-PyTorch memory, this process has 22.04 GiB memory in use. Of the allocated memory 21.73 GiB is allocated by PyTorch, and 19.12 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
2025-05-07T20:32:25.2687440Z 
2025-05-07T20:32:25.2687555Z moe/activation_test.py:92: OutOfMemoryError
2025-05-07T20:32:25.2687604Z 
2025-05-07T20:32:25.2687706Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:25.2687961Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:25.2688043Z     T=4096,
2025-05-07T20:32:25.2688120Z     D=7168,
2025-05-07T20:32:25.2688239Z     scale_ub=1200.0,
2025-05-07T20:32:25.2688333Z     contiguous=True,
2025-05-07T20:32:25.2688415Z     compiled=False,
2025-05-07T20:32:25.2688486Z )
2025-05-07T20:32:25.2688707Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:25.2688878Z T = 4096, D = 7168, scale_ub = 1200.0, contiguous = True, compiled = False
2025-05-07T20:32:25.2688882Z 
2025-05-07T20:32:25.2688966Z     @given(
2025-05-07T20:32:25.2689081Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:25.2689178Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:25.2689298Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:25.2689414Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:25.2689528Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:25.2689611Z     )
2025-05-07T20:32:25.2689855Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:25.2689951Z     def test_silu_mul_quant(
2025-05-07T20:32:25.2690033Z         self,
2025-05-07T20:32:25.2690114Z         T: int,
2025-05-07T20:32:25.2690191Z         D: int,
2025-05-07T20:32:25.2690294Z         scale_ub: Optional[float],
2025-05-07T20:32:25.2690384Z         contiguous: bool,
2025-05-07T20:32:25.2690474Z         compiled: bool,
2025-05-07T20:32:25.2690553Z     ) -> None:
2025-05-07T20:32:25.2690647Z         torch.manual_seed(2025)
2025-05-07T20:32:25.2690726Z     
2025-05-07T20:32:25.2690888Z >       x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:25.2692643Z E       torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 112.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 26.44 MiB is free. Including non-PyTorch memory, this process has 22.04 GiB memory in use. Of the allocated memory 21.73 GiB is allocated by PyTorch, and 19.12 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
2025-05-07T20:32:25.2692659Z 
2025-05-07T20:32:25.2692774Z moe/activation_test.py:92: OutOfMemoryError
2025-05-07T20:32:25.2692779Z 
2025-05-07T20:32:25.2692879Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:25.2693104Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:25.2693183Z     T=16384,
2025-05-07T20:32:25.2693262Z     D=7168,
2025-05-07T20:32:25.2693352Z     scale_ub=None,
2025-05-07T20:32:25.2693438Z     contiguous=False,
2025-05-07T20:32:25.2693527Z     compiled=True,
2025-05-07T20:32:25.2693600Z )
2025-05-07T20:32:25.2693814Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:25.2693998Z T = 16384, D = 7168, scale_ub = None, contiguous = False, compiled = True
2025-05-07T20:32:25.2694002Z 
2025-05-07T20:32:25.2694081Z     @given(
2025-05-07T20:32:25.2694198Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:25.2694347Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:25.2694464Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:25.2694577Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:25.2694694Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:25.2694766Z     )
2025-05-07T20:32:25.2695012Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:25.2695107Z     def test_silu_mul_quant(
2025-05-07T20:32:25.2695182Z         self,
2025-05-07T20:32:25.2695264Z         T: int,
2025-05-07T20:32:25.2695342Z         D: int,
2025-05-07T20:32:25.2695439Z         scale_ub: Optional[float],
2025-05-07T20:32:25.2695617Z         contiguous: bool,
2025-05-07T20:32:25.2695702Z         compiled: bool,
2025-05-07T20:32:25.2695779Z     ) -> None:
2025-05-07T20:32:25.2695881Z         torch.manual_seed(2025)
2025-05-07T20:32:25.2695953Z     
2025-05-07T20:32:25.2696167Z >       x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:25.2697925Z E       torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 448.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 26.44 MiB is free. Including non-PyTorch memory, this process has 22.04 GiB memory in use. Of the allocated memory 21.73 GiB is allocated by PyTorch, and 19.12 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
2025-05-07T20:32:25.2697931Z 
2025-05-07T20:32:25.2698056Z moe/activation_test.py:92: OutOfMemoryError
2025-05-07T20:32:25.2698063Z 
2025-05-07T20:32:25.2698164Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:25.2698381Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:25.2698464Z     T=4096,
2025-05-07T20:32:25.2698542Z     D=7168,
2025-05-07T20:32:25.2698623Z     scale_ub=None,
2025-05-07T20:32:25.2698713Z     contiguous=True,
2025-05-07T20:32:25.2698795Z     compiled=False,
2025-05-07T20:32:25.2698866Z )
2025-05-07T20:32:25.2699086Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:25.2699253Z T = 4096, D = 7168, scale_ub = None, contiguous = True, compiled = False
2025-05-07T20:32:25.2699258Z 
2025-05-07T20:32:25.2699342Z     @given(
2025-05-07T20:32:25.2699460Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:25.2699572Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:25.2699704Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:25.2699839Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:25.2699954Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:25.2700033Z     )
2025-05-07T20:32:25.2700279Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:25.2700375Z     def test_silu_mul_quant(
2025-05-07T20:32:25.2700458Z         self,
2025-05-07T20:32:25.2700534Z         T: int,
2025-05-07T20:32:25.2700609Z         D: int,
2025-05-07T20:32:25.2700713Z         scale_ub: Optional[float],
2025-05-07T20:32:25.2700800Z         contiguous: bool,
2025-05-07T20:32:25.2700892Z         compiled: bool,
2025-05-07T20:32:25.2700969Z     ) -> None:
2025-05-07T20:32:25.2701062Z         torch.manual_seed(2025)
2025-05-07T20:32:25.2701139Z     
2025-05-07T20:32:25.2701303Z >       x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:25.2703099Z E       torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 112.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 26.44 MiB is free. Including non-PyTorch memory, this process has 22.04 GiB memory in use. Of the allocated memory 21.73 GiB is allocated by PyTorch, and 19.12 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
2025-05-07T20:32:25.2703119Z 
2025-05-07T20:32:25.2703235Z moe/activation_test.py:92: OutOfMemoryError
2025-05-07T20:32:25.2703240Z 
2025-05-07T20:32:25.2703341Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:25.2703565Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:25.2703643Z     T=16384,
2025-05-07T20:32:25.2703722Z     D=7168,
2025-05-07T20:32:25.2703808Z     scale_ub=None,
2025-05-07T20:32:25.2703894Z     contiguous=True,
2025-05-07T20:32:25.2703987Z     compiled=False,
2025-05-07T20:32:25.2704100Z )
2025-05-07T20:32:25.2704354Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:25.2704533Z T = 16384, D = 7168, scale_ub = None, contiguous = True, compiled = False
2025-05-07T20:32:25.2704537Z 
2025-05-07T20:32:25.2704653Z     @given(
2025-05-07T20:32:25.2704773Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:25.2704877Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:25.2704989Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:25.2705104Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:25.2705223Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:25.2705296Z     )
2025-05-07T20:32:25.2705545Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:25.2705639Z     def test_silu_mul_quant(
2025-05-07T20:32:25.2705715Z         self,
2025-05-07T20:32:25.2705797Z         T: int,
2025-05-07T20:32:25.2705877Z         D: int,
2025-05-07T20:32:25.2705977Z         scale_ub: Optional[float],
2025-05-07T20:32:25.2706072Z         contiguous: bool,
2025-05-07T20:32:25.2706159Z         compiled: bool,
2025-05-07T20:32:25.2706236Z     ) -> None:
2025-05-07T20:32:25.2706341Z         torch.manual_seed(2025)
2025-05-07T20:32:25.2706419Z     
2025-05-07T20:32:25.2706582Z >       x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:25.2708343Z E       torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 448.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 26.44 MiB is free. Including non-PyTorch memory, this process has 22.04 GiB memory in use. Of the allocated memory 21.73 GiB is allocated by PyTorch, and 19.12 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
2025-05-07T20:32:25.2708354Z 
2025-05-07T20:32:25.2708476Z moe/activation_test.py:92: OutOfMemoryError
2025-05-07T20:32:25.2708480Z 
2025-05-07T20:32:25.2708581Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:25.2708802Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:25.2708888Z     T=16384,
2025-05-07T20:32:25.2708965Z     D=7168,
2025-05-07T20:32:25.2709049Z     scale_ub=1200.0,
2025-05-07T20:32:25.2709192Z     contiguous=True,
2025-05-07T20:32:25.2709274Z     compiled=False,
2025-05-07T20:32:25.2709347Z )
2025-05-07T20:32:25.2709567Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:25.2709742Z T = 16384, D = 7168, scale_ub = 1200.0, contiguous = True, compiled = False
2025-05-07T20:32:25.2709747Z 
2025-05-07T20:32:25.2709829Z     @given(
2025-05-07T20:32:25.2709966Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:25.2710073Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:25.2710214Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:25.2710331Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:25.2710440Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:25.2710521Z     )
2025-05-07T20:32:25.2710809Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:25.2710905Z     def test_silu_mul_quant(
2025-05-07T20:32:25.2710988Z         self,
2025-05-07T20:32:25.2711063Z         T: int,
2025-05-07T20:32:25.2711138Z         D: int,
2025-05-07T20:32:25.2711247Z         scale_ub: Optional[float],
2025-05-07T20:32:25.2711337Z         contiguous: bool,
2025-05-07T20:32:25.2711429Z         compiled: bool,
2025-05-07T20:32:25.2711506Z     ) -> None:
2025-05-07T20:32:25.2711599Z         torch.manual_seed(2025)
2025-05-07T20:32:25.2711677Z     
2025-05-07T20:32:25.2711841Z >       x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:25.2713695Z E       torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 448.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 26.44 MiB is free. Including non-PyTorch memory, this process has 22.04 GiB memory in use. Of the allocated memory 21.73 GiB is allocated by PyTorch, and 19.12 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
2025-05-07T20:32:25.2713743Z 
2025-05-07T20:32:25.2713860Z moe/activation_test.py:92: OutOfMemoryError
2025-05-07T20:32:25.2713865Z 
2025-05-07T20:32:25.2713965Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:25.2714189Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:25.2714266Z     T=128,
2025-05-07T20:32:25.2714342Z     D=5120,
2025-05-07T20:32:25.2714429Z     scale_ub=1200.0,
2025-05-07T20:32:25.2714518Z     contiguous=False,
2025-05-07T20:32:25.2714609Z     compiled=False,
2025-05-07T20:32:25.2714681Z )
2025-05-07T20:32:25.2714895Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:25.2715073Z T = 128, D = 5120, scale_ub = 1200.0, contiguous = False, compiled = False
2025-05-07T20:32:25.2715077Z 
2025-05-07T20:32:25.2715157Z     @given(
2025-05-07T20:32:25.2715272Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:25.2715377Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:25.2715488Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:25.2715605Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:25.2715722Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:25.2715795Z     )
2025-05-07T20:32:25.2716044Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:25.2716136Z     def test_silu_mul_quant(
2025-05-07T20:32:25.2716215Z         self,
2025-05-07T20:32:25.2716299Z         T: int,
2025-05-07T20:32:25.2716374Z         D: int,
2025-05-07T20:32:25.2716471Z         scale_ub: Optional[float],
2025-05-07T20:32:25.2716563Z         contiguous: bool,
2025-05-07T20:32:25.2716648Z         compiled: bool,
2025-05-07T20:32:25.2716726Z     ) -> None:
2025-05-07T20:32:25.2716829Z         torch.manual_seed(2025)
2025-05-07T20:32:25.2716902Z     
2025-05-07T20:32:25.2717067Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:25.2717149Z     
2025-05-07T20:32:25.2717240Z         x_sign = torch.sign(x)
2025-05-07T20:32:25.2717370Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:25.2717458Z         x = x_sign * x_clamp
2025-05-07T20:32:25.2717541Z         x0 = x[:, :D]
2025-05-07T20:32:25.2717626Z         x1 = x[:, D:]
2025-05-07T20:32:25.2717698Z     
2025-05-07T20:32:25.2717782Z         if contiguous:
2025-05-07T20:32:25.2717882Z             x0 = x0.contiguous()
2025-05-07T20:32:25.2717973Z             x1 = x1.contiguous()
2025-05-07T20:32:25.2718047Z     
2025-05-07T20:32:25.2718152Z         if scale_ub is not None:
2025-05-07T20:32:25.2722534Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:25.2722698Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:25.2722849Z             )
2025-05-07T20:32:25.2722935Z         else:
2025-05-07T20:32:25.2723033Z             scale_ub_tensor = None
2025-05-07T20:32:25.2723107Z     
2025-05-07T20:32:25.2723250Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:25.2723344Z             op = silu_mul_quant
2025-05-07T20:32:25.2723435Z             if compiled:
2025-05-07T20:32:25.2723544Z                 op = torch.compile(op)
2025-05-07T20:32:25.2723651Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:25.2723725Z     
2025-05-07T20:32:25.2723824Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:25.2723829Z 
2025-05-07T20:32:25.2723929Z moe/activation_test.py:117: 
2025-05-07T20:32:25.2724114Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:25.2724255Z moe/activation_test.py:115: in fn
2025-05-07T20:32:25.2724357Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:25.2724909Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:25.2725008Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:25.2725365Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:25.2725597Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:25.2725938Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:25.2726040Z     kernel = self.compile(
2025-05-07T20:32:25.2726423Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:25.2726607Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:25.2726749Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:25.2726754Z 
2025-05-07T20:32:25.2726964Z self = <triton.compiler.compiler.ASTSource object at 0x7efca2925e50>
2025-05-07T20:32:25.2727748Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:25.2728697Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7efdd7d50400>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7efca290a5c0>}
2025-05-07T20:32:25.2729728Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:25.2729936Z context = <triton._C.libtriton.ir.context object at 0x7efca2950f70>
2025-05-07T20:32:25.2729942Z 
2025-05-07T20:32:25.2730111Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:25.2730384Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:25.2730493Z                            module_map=module_map)
2025-05-07T20:32:25.2730655Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:25.2730762Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:25.2730842Z E       ^
2025-05-07T20:32:25.2731207Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:25.2731212Z 
2025-05-07T20:32:25.2731623Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:25.2731630Z 
2025-05-07T20:32:25.2731737Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:25.2731967Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:25.2732047Z     T=2048,
2025-05-07T20:32:25.2732134Z     D=7168,
2025-05-07T20:32:25.2732394Z     scale_ub=None,
2025-05-07T20:32:25.2732484Z     contiguous=False,
2025-05-07T20:32:25.2732579Z     compiled=False,
2025-05-07T20:32:25.2732657Z )
2025-05-07T20:32:25.2732876Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:25.2733055Z T = 2048, D = 7168, scale_ub = None, contiguous = False, compiled = False
2025-05-07T20:32:25.2733060Z 
2025-05-07T20:32:25.2733138Z     @given(
2025-05-07T20:32:25.2733255Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:25.2733361Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:25.2733477Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:25.2733601Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:25.2733913Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:25.2733988Z     )
2025-05-07T20:32:25.2734243Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:25.2734400Z     def test_silu_mul_quant(
2025-05-07T20:32:25.2734482Z         self,
2025-05-07T20:32:25.2734569Z         T: int,
2025-05-07T20:32:25.2734650Z         D: int,
2025-05-07T20:32:25.2734748Z         scale_ub: Optional[float],
2025-05-07T20:32:25.2734846Z         contiguous: bool,
2025-05-07T20:32:25.2734934Z         compiled: bool,
2025-05-07T20:32:25.2735018Z     ) -> None:
2025-05-07T20:32:25.2735119Z         torch.manual_seed(2025)
2025-05-07T20:32:25.2735196Z     
2025-05-07T20:32:25.2735366Z >       x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:25.2737149Z E       torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 56.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 26.44 MiB is free. Including non-PyTorch memory, this process has 22.04 GiB memory in use. Of the allocated memory 21.74 GiB is allocated by PyTorch, and 10.99 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
2025-05-07T20:32:25.2737160Z 
2025-05-07T20:32:25.2737286Z moe/activation_test.py:92: OutOfMemoryError
2025-05-07T20:32:25.2737290Z 
2025-05-07T20:32:25.2737394Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:25.2737614Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:25.2737700Z     T=128,
2025-05-07T20:32:25.2737777Z     D=7168,
2025-05-07T20:32:25.2737862Z     scale_ub=1200.0,
2025-05-07T20:32:25.2737956Z     contiguous=True,
2025-05-07T20:32:25.2738042Z     compiled=True,
2025-05-07T20:32:25.2738120Z )
2025-05-07T20:32:25.2738344Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:25.2738513Z T = 128, D = 7168, scale_ub = 1200.0, contiguous = True, compiled = True
2025-05-07T20:32:25.2738517Z 
2025-05-07T20:32:25.2738607Z     @given(
2025-05-07T20:32:25.2738728Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:25.2738828Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:25.2738949Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:25.2739066Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:25.2739179Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:25.2739261Z     )
2025-05-07T20:32:25.2739529Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:25.2739642Z     def test_silu_mul_quant(
2025-05-07T20:32:25.2739733Z         self,
2025-05-07T20:32:25.2739813Z         T: int,
2025-05-07T20:32:25.2739897Z         D: int,
2025-05-07T20:32:25.2739997Z         scale_ub: Optional[float],
2025-05-07T20:32:25.2740111Z         contiguous: bool,
2025-05-07T20:32:25.2740245Z         compiled: bool,
2025-05-07T20:32:25.2740358Z     ) -> None:
2025-05-07T20:32:25.2740490Z         torch.manual_seed(2025)
2025-05-07T20:32:25.2740614Z     
2025-05-07T20:32:25.2741108Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:25.2741603Z     
2025-05-07T20:32:25.2741883Z         x_sign = torch.sign(x)
2025-05-07T20:32:25.2742267Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:25.2742685Z         x = x_sign * x_clamp
2025-05-07T20:32:25.2743041Z         x0 = x[:, :D]
2025-05-07T20:32:25.2743366Z         x1 = x[:, D:]
2025-05-07T20:32:25.2743699Z     
2025-05-07T20:32:25.2743992Z         if contiguous:
2025-05-07T20:32:25.2744332Z             x0 = x0.contiguous()
2025-05-07T20:32:25.2744749Z             x1 = x1.contiguous()
2025-05-07T20:32:25.2745136Z     
2025-05-07T20:32:25.2745442Z         if scale_ub is not None:
2025-05-07T20:32:25.2745998Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:25.2746541Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:25.2747026Z             )
2025-05-07T20:32:25.2747320Z         else:
2025-05-07T20:32:25.2747784Z             scale_ub_tensor = None
2025-05-07T20:32:25.2748165Z     
2025-05-07T20:32:25.2748519Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:25.2748969Z             op = silu_mul_quant
2025-05-07T20:32:25.2749397Z             if compiled:
2025-05-07T20:32:25.2749745Z                 op = torch.compile(op)
2025-05-07T20:32:25.2750051Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:25.2750332Z     
2025-05-07T20:32:25.2750528Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:25.2750692Z 
2025-05-07T20:32:25.2750801Z moe/activation_test.py:117: 
2025-05-07T20:32:25.2751091Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:25.2751429Z moe/activation_test.py:115: in fn
2025-05-07T20:32:25.2751714Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:25.2752266Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/torch/_dynamo/eval_frame.py:678: in _fn
2025-05-07T20:32:25.2752833Z     return fn(*args, **kwargs)
2025-05-07T20:32:25.2753493Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:25.2754178Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:25.2754704Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:25.2755381Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:25.2756046Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:25.2756570Z     kernel = self.compile(
2025-05-07T20:32:25.2757113Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:25.2757772Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:25.2758172Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:25.2758401Z 
2025-05-07T20:32:25.2758608Z self = <triton.compiler.compiler.ASTSource object at 0x7efca2ad5e50>
2025-05-07T20:32:25.2759686Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:25.2761069Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7efdd7d50400>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7efca290aac0>}
2025-05-07T20:32:25.2762404Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:25.2763427Z context = <triton._C.libtriton.ir.context object at 0x7efca28d0f70>
2025-05-07T20:32:25.2763714Z 
2025-05-07T20:32:25.2763955Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:25.2764471Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:25.2764941Z                            module_map=module_map)
2025-05-07T20:32:25.2765295Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:25.2765648Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:25.2765907Z E       ^
2025-05-07T20:32:25.2766365Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:25.2766817Z 
2025-05-07T20:32:25.2767235Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:25.2767871Z 
2025-05-07T20:32:25.2767977Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:25.2768392Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:25.2768832Z     T=128,
2025-05-07T20:32:25.2769028Z     D=7168,
2025-05-07T20:32:25.2769228Z     scale_ub=1200.0,
2025-05-07T20:32:25.2769447Z     contiguous=True,
2025-05-07T20:32:25.2769671Z     compiled=False,
2025-05-07T20:32:25.2769876Z )
2025-05-07T20:32:25.2770195Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:25.2770677Z T = 128, D = 7168, scale_ub = 1200.0, contiguous = True, compiled = False
2025-05-07T20:32:25.2770950Z 
2025-05-07T20:32:25.2771029Z     @given(
2025-05-07T20:32:25.2771262Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:25.2771567Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:25.2771875Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:25.2772205Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:25.2772522Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:25.2772801Z     )
2025-05-07T20:32:25.2773154Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:25.2773596Z     def test_silu_mul_quant(
2025-05-07T20:32:25.2773828Z         self,
2025-05-07T20:32:25.2774027Z         T: int,
2025-05-07T20:32:25.2774222Z         D: int,
2025-05-07T20:32:25.2774433Z         scale_ub: Optional[float],
2025-05-07T20:32:25.2774699Z         contiguous: bool,
2025-05-07T20:32:25.2774936Z         compiled: bool,
2025-05-07T20:32:25.2775149Z     ) -> None:
2025-05-07T20:32:25.2775366Z         torch.manual_seed(2025)
2025-05-07T20:32:25.2775605Z     
2025-05-07T20:32:25.2775870Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:25.2776207Z     
2025-05-07T20:32:25.2776400Z         x_sign = torch.sign(x)
2025-05-07T20:32:25.2776685Z >       x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:25.2778679Z E       torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 20.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 4.44 MiB is free. Including non-PyTorch memory, this process has 22.06 GiB memory in use. Of the allocated memory 21.77 GiB is allocated by PyTorch, and 6.37 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
2025-05-07T20:32:25.2780535Z 
2025-05-07T20:32:25.2780654Z moe/activation_test.py:95: OutOfMemoryError
2025-05-07T20:32:25.2780869Z 
2025-05-07T20:32:25.2780974Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:25.2781378Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:25.2781773Z     T=128,
2025-05-07T20:32:25.2781963Z     D=5120,
2025-05-07T20:32:25.2782154Z     scale_ub=1200.0,
2025-05-07T20:32:25.2782368Z     contiguous=True,
2025-05-07T20:32:25.2782590Z     compiled=True,
2025-05-07T20:32:25.2782793Z )
2025-05-07T20:32:25.2783161Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:25.2783645Z T = 128, D = 5120, scale_ub = 1200.0, contiguous = True, compiled = True
2025-05-07T20:32:25.2783916Z 
2025-05-07T20:32:25.2783995Z     @given(
2025-05-07T20:32:25.2784225Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:25.2784529Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:25.2784837Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:25.2785162Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:25.2785481Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:25.2785769Z     )
2025-05-07T20:32:25.2786115Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:25.2786630Z     def test_silu_mul_quant(
2025-05-07T20:32:25.2786869Z         self,
2025-05-07T20:32:25.2787065Z         T: int,
2025-05-07T20:32:25.2787257Z         D: int,
2025-05-07T20:32:25.2787516Z         scale_ub: Optional[float],
2025-05-07T20:32:25.2787791Z         contiguous: bool,
2025-05-07T20:32:25.2788023Z         compiled: bool,
2025-05-07T20:32:25.2788244Z     ) -> None:
2025-05-07T20:32:25.2788458Z         torch.manual_seed(2025)
2025-05-07T20:32:25.2788695Z     
2025-05-07T20:32:25.2788959Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:25.2789371Z     
2025-05-07T20:32:25.2789565Z         x_sign = torch.sign(x)
2025-05-07T20:32:25.2789845Z >       x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:25.2791829Z E       torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 20.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 4.44 MiB is free. Including non-PyTorch memory, this process has 22.06 GiB memory in use. Of the allocated memory 21.77 GiB is allocated by PyTorch, and 3.87 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
2025-05-07T20:32:25.2793694Z 
2025-05-07T20:32:25.2793812Z moe/activation_test.py:95: OutOfMemoryError
2025-05-07T20:32:25.2794024Z 
2025-05-07T20:32:25.2794134Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:25.2794538Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:25.2794949Z     T=128,
2025-05-07T20:32:25.2795133Z     D=7168,
2025-05-07T20:32:25.2795325Z     scale_ub=None,
2025-05-07T20:32:25.2795543Z     contiguous=True,
2025-05-07T20:32:25.2795760Z     compiled=True,
2025-05-07T20:32:25.2795962Z )
2025-05-07T20:32:25.2796281Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:25.2796761Z T = 128, D = 7168, scale_ub = None, contiguous = True, compiled = True
2025-05-07T20:32:25.2797029Z 
2025-05-07T20:32:25.2797107Z     @given(
2025-05-07T20:32:25.2797344Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:25.2797654Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:25.2797964Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:25.2798293Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:25.2798619Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:25.2798894Z     )
2025-05-07T20:32:25.2799242Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:25.2799680Z     def test_silu_mul_quant(
2025-05-07T20:32:25.2799915Z         self,
2025-05-07T20:32:25.2800110Z         T: int,
2025-05-07T20:32:25.2800302Z         D: int,
2025-05-07T20:32:25.2800509Z         scale_ub: Optional[float],
2025-05-07T20:32:25.2800778Z         contiguous: bool,
2025-05-07T20:32:25.2801018Z         compiled: bool,
2025-05-07T20:32:25.2801232Z     ) -> None:
2025-05-07T20:32:25.2801449Z         torch.manual_seed(2025)
2025-05-07T20:32:25.2801686Z     
2025-05-07T20:32:25.2802006Z >       x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:25.2804028Z E       torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 20.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 4.44 MiB is free. Including non-PyTorch memory, this process has 22.06 GiB memory in use. Of the allocated memory 21.77 GiB is allocated by PyTorch, and 3.87 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
2025-05-07T20:32:25.2805890Z 
2025-05-07T20:32:25.2806009Z moe/activation_test.py:92: OutOfMemoryError
2025-05-07T20:32:25.2806436Z =============================== warnings summary ===============================
2025-05-07T20:32:25.2806975Z ../../../../../../../../miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/autotuner.py:108
2025-05-07T20:32:25.2807706Z ../../../../../../../../miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/autotuner.py:108
2025-05-07T20:32:25.2808405Z ../../../../../../../../miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/autotuner.py:108
2025-05-07T20:32:25.2809722Z   /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/autotuner.py:108: DeprecationWarning: warmup, rep, and use_cuda_graph parameters are deprecated. See https://github.com/triton-lang/triton/pull/4496 for details.
2025-05-07T20:32:25.2810913Z     warnings.warn(("warmup, rep, and use_cuda_graph parameters are deprecated. See "
2025-05-07T20:32:25.2811242Z 
2025-05-07T20:32:25.2811450Z -- Docs: https://docs.pytest.org/en/stable/how-to/capture-warnings.html
2025-05-07T20:32:25.2811929Z ================= 1 failed, 1 deselected, 3 warnings in 13.85s =================
2025-05-07T20:32:26.8576161Z ERROR conda.cli.main_run:execute(125): `conda run python -m pytest -v -rsx -s -W ignore::pytest.PytestCollectionWarning --lf --last-failed-no-failures none ./moe/activation_test.py` failed. (See above for error)
2025-05-07T20:32:26.9198054Z [EXEC] [ATTEMPT 1/2] Command attempt failed.
2025-05-07T20:32:26.9198320Z 
2025-05-07T20:32:28.9214086Z [EXEC] [ATTEMPT 2/2]    + conda run --no-capture-output -n build_binary python -m pytest -v -rsx -s -W ignore::pytest.PytestCollectionWarning --lf --last-failed-no-failures none ./moe/activation_test.py
2025-05-07T20:32:31.0620323Z ============================= test session starts ==============================
2025-05-07T20:32:31.0622033Z platform linux -- Python 3.11.8, pytest-8.3.5, pluggy-1.5.0 -- /home/ec2-user/miniconda/envs/build_binary/bin/python
2025-05-07T20:32:31.0623053Z cachedir: .pytest_cache
2025-05-07T20:32:31.0623729Z hypothesis profile 'ci' -> database=None, deadline=None, print_blob=True, derandomize=True, suppress_health_check=(HealthCheck.too_slow,)
2025-05-07T20:32:31.0624458Z rootdir: /home/ec2-user/actions-runner/_work/FBGEMM/FBGEMM/fbgemm_gpu
2025-05-07T20:32:31.0624862Z plugins: hypothesis-6.131.14
2025-05-07T20:32:32.6777003Z TMA benchmarks will be running with experimental grid constant TMA descriptor.
2025-05-07T20:32:32.8304038Z collecting ... collected 2 items / 1 deselected / 1 selected
2025-05-07T20:32:32.8304609Z run-last-failure: rerun previous 1 failure
2025-05-07T20:32:32.8304907Z 
2025-05-07T20:32:35.1971696Z moe/activation_test.py::ActivationTests::test_silu_mul_quant Trying example: test_silu_mul_quant(
2025-05-07T20:32:35.1973265Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:35.1973716Z     T=1,
2025-05-07T20:32:35.1973901Z     D=5120,
2025-05-07T20:32:35.1974097Z     scale_ub=None,
2025-05-07T20:32:35.1974341Z     contiguous=True,
2025-05-07T20:32:35.1974567Z     compiled=True,
2025-05-07T20:32:35.1974778Z )
2025-05-07T20:32:35.1975103Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:35.1975587Z T = 1, D = 5120, scale_ub = None, contiguous = True, compiled = True
2025-05-07T20:32:35.1976150Z 
2025-05-07T20:32:35.1976234Z     @given(
2025-05-07T20:32:35.1976474Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:35.1976781Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:35.1977089Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:35.1977419Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:35.1977746Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:35.1978026Z     )
2025-05-07T20:32:35.1978377Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:35.1978816Z     def test_silu_mul_quant(
2025-05-07T20:32:35.1979146Z         self,
2025-05-07T20:32:35.1979431Z         T: int,
2025-05-07T20:32:35.1979638Z         D: int,
2025-05-07T20:32:35.1979857Z         scale_ub: Optional[float],
2025-05-07T20:32:35.1980130Z         contiguous: bool,
2025-05-07T20:32:35.1980373Z         compiled: bool,
2025-05-07T20:32:35.1980675Z     ) -> None:
2025-05-07T20:32:35.1980903Z         torch.manual_seed(2025)
2025-05-07T20:32:35.1981151Z     
2025-05-07T20:32:35.1981421Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:35.1981763Z     
2025-05-07T20:32:35.1981963Z         x_sign = torch.sign(x)
2025-05-07T20:32:35.1982250Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:35.1982564Z         x = x_sign * x_clamp
2025-05-07T20:32:35.1982808Z         x0 = x[:, :D]
2025-05-07T20:32:35.1983027Z         x1 = x[:, D:]
2025-05-07T20:32:35.1983231Z     
2025-05-07T20:32:35.1983431Z         if contiguous:
2025-05-07T20:32:35.1983708Z             x0 = x0.contiguous()
2025-05-07T20:32:35.1983969Z             x1 = x1.contiguous()
2025-05-07T20:32:35.1984215Z     
2025-05-07T20:32:35.1984411Z         if scale_ub is not None:
2025-05-07T20:32:35.1984680Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:35.1985021Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:35.1985334Z             )
2025-05-07T20:32:35.1985525Z         else:
2025-05-07T20:32:35.1985737Z             scale_ub_tensor = None
2025-05-07T20:32:35.1985991Z     
2025-05-07T20:32:35.1986223Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:35.1986537Z             op = silu_mul_quant
2025-05-07T20:32:35.1986791Z             if compiled:
2025-05-07T20:32:35.1987039Z                 op = torch.compile(op)
2025-05-07T20:32:35.1987338Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:35.1987613Z     
2025-05-07T20:32:35.1987808Z         y_fp8, y_scale = fn()
2025-05-07T20:32:35.1988086Z         y = y_fp8.to(torch.float32) * y_scale[:, None]
2025-05-07T20:32:35.1988382Z     
2025-05-07T20:32:35.1988625Z         def ref_fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:35.1988951Z             x0_fp32 = x0.to(torch.float32)
2025-05-07T20:32:35.1989329Z             x1_fp32 = x1.to(torch.float32)
2025-05-07T20:32:35.1989652Z             y = x0_fp32 * torch.sigmoid(x0_fp32) * x1_fp32
2025-05-07T20:32:35.1990004Z             return triton_quantize_fp8_row(y, scale_ub_tensor)
2025-05-07T20:32:35.1990316Z     
2025-05-07T20:32:35.1990521Z >       y_fp8_ref, y_scale_ref = ref_fn()
2025-05-07T20:32:35.1990715Z 
2025-05-07T20:32:35.1990816Z moe/activation_test.py:126: 
2025-05-07T20:32:35.1991113Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:35.1991451Z moe/activation_test.py:124: in ref_fn
2025-05-07T20:32:35.1991780Z     return triton_quantize_fp8_row(y, scale_ub_tensor)
2025-05-07T20:32:35.1992565Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/fbgemm_gpu/experimental/gemm/triton_gemm/fp8_gemm.py:2370: in triton_quantize_fp8_row
2025-05-07T20:32:35.1993379Z     _kernel_quantize_fp8_row[grid](
2025-05-07T20:32:35.1993924Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:35.1994657Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:35.1995340Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/autotuner.py:186: in run
2025-05-07T20:32:35.1996068Z     timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs}
2025-05-07T20:32:35.1996818Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/autotuner.py:186: in <dictcomp>
2025-05-07T20:32:35.1997556Z     timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs}
2025-05-07T20:32:35.1998287Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/autotuner.py:166: in _bench
2025-05-07T20:32:35.1999011Z     return self.do_bench(kernel_call, quantiles=(0.5, 0.2, 0.8))
2025-05-07T20:32:35.1999612Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/testing.py:117: in do_bench
2025-05-07T20:32:35.2000173Z     fn()
2025-05-07T20:32:35.2000693Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/autotuner.py:152: in kernel_call
2025-05-07T20:32:35.2001281Z     self.fn.run(
2025-05-07T20:32:35.2001741Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:35.2002272Z     kernel = self.compile(
2025-05-07T20:32:35.2002822Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:35.2003522Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:35.2003923Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:35.2004167Z 
2025-05-07T20:32:35.2004374Z self = <triton.compiler.compiler.ASTSource object at 0x7f987235f010>
2025-05-07T20:32:35.2005458Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=2, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:35.2006836Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f987464c400>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f9831ba1260>}
2025-05-07T20:32:35.2008171Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:35.2009179Z context = <triton._C.libtriton.ir.context object at 0x7f987247a670>
2025-05-07T20:32:35.2009467Z 
2025-05-07T20:32:35.2009637Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:35.2010158Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:35.2010618Z                            module_map=module_map)
2025-05-07T20:32:35.2010985Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:35.2011343Z E       def _kernel_quantize_fp8_row(
2025-05-07T20:32:35.2011613Z E       ^
2025-05-07T20:32:35.2012074Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:35.2012527Z 
2025-05-07T20:32:35.2012940Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:35.2013448Z 
2025-05-07T20:32:35.2013561Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:35.2014022Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:35.2014421Z     T=2048,
2025-05-07T20:32:35.2014624Z     D=5120,
2025-05-07T20:32:35.2014825Z     scale_ub=1200.0,
2025-05-07T20:32:35.2015044Z     contiguous=True,
2025-05-07T20:32:35.2015272Z     compiled=False,
2025-05-07T20:32:35.2015478Z )
2025-05-07T20:32:36.1327739Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:36.1328831Z T = 2048, D = 5120, scale_ub = 1200.0, contiguous = True, compiled = False
2025-05-07T20:32:36.1329211Z 
2025-05-07T20:32:36.1335417Z     @given(
2025-05-07T20:32:36.1335796Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:36.1336236Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:36.1336584Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:36.1336920Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:36.1337250Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:36.1337533Z     )
2025-05-07T20:32:36.1337884Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:36.1338576Z     def test_silu_mul_quant(
2025-05-07T20:32:36.1338826Z         self,
2025-05-07T20:32:36.1339024Z         T: int,
2025-05-07T20:32:36.1339227Z         D: int,
2025-05-07T20:32:36.1339520Z         scale_ub: Optional[float],
2025-05-07T20:32:36.1339794Z         contiguous: bool,
2025-05-07T20:32:36.1340037Z         compiled: bool,
2025-05-07T20:32:36.1340271Z     ) -> None:
2025-05-07T20:32:36.1340490Z         torch.manual_seed(2025)
2025-05-07T20:32:36.1340735Z     
2025-05-07T20:32:36.1341015Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:36.1341353Z     
2025-05-07T20:32:36.1341553Z         x_sign = torch.sign(x)
2025-05-07T20:32:36.1341849Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:36.1342153Z         x = x_sign * x_clamp
2025-05-07T20:32:36.1342397Z         x0 = x[:, :D]
2025-05-07T20:32:36.1342618Z         x1 = x[:, D:]
2025-05-07T20:32:36.1342820Z     
2025-05-07T20:32:36.1343016Z         if contiguous:
2025-05-07T20:32:36.1343262Z             x0 = x0.contiguous()
2025-05-07T20:32:36.1343553Z             x1 = x1.contiguous()
2025-05-07T20:32:36.1343806Z     
2025-05-07T20:32:36.1344004Z         if scale_ub is not None:
2025-05-07T20:32:36.1344278Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:36.1344615Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:36.1344928Z             )
2025-05-07T20:32:36.1345126Z         else:
2025-05-07T20:32:36.1345335Z             scale_ub_tensor = None
2025-05-07T20:32:36.1345588Z     
2025-05-07T20:32:36.1345823Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:36.1346128Z             op = silu_mul_quant
2025-05-07T20:32:36.1346384Z             if compiled:
2025-05-07T20:32:36.1346633Z                 op = torch.compile(op)
2025-05-07T20:32:36.1346929Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:36.1347208Z     
2025-05-07T20:32:36.1347406Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:36.1347572Z 
2025-05-07T20:32:36.1347712Z moe/activation_test.py:117: 
2025-05-07T20:32:36.1348006Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:36.1348333Z moe/activation_test.py:115: in fn
2025-05-07T20:32:36.1348623Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:36.1349408Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:36.1350102Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:36.1350636Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:36.1351324Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:36.1351985Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:36.1352513Z     kernel = self.compile(
2025-05-07T20:32:36.1353064Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:36.1353757Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:36.1354257Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:36.1354487Z 
2025-05-07T20:32:36.1354694Z self = <triton.compiler.compiler.ASTSource object at 0x7f9831c4cb90>
2025-05-07T20:32:36.1355776Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:36.1357149Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f987464c400>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f983184c180>}
2025-05-07T20:32:36.1358484Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:36.1359616Z context = <triton._C.libtriton.ir.context object at 0x7f9831bff570>
2025-05-07T20:32:36.1359903Z 
2025-05-07T20:32:36.1360074Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:36.1360596Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:36.1361068Z                            module_map=module_map)
2025-05-07T20:32:36.1361431Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:36.1361793Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:36.1362059Z E       ^
2025-05-07T20:32:36.1362522Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:36.1362976Z 
2025-05-07T20:32:36.1363396Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:36.1363917Z 
2025-05-07T20:32:36.1364022Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:36.1364439Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:36.1364840Z     T=2048,
2025-05-07T20:32:36.1365038Z     D=5120,
2025-05-07T20:32:36.1365238Z     scale_ub=1200.0,
2025-05-07T20:32:36.1365462Z     contiguous=True,
2025-05-07T20:32:36.1365689Z     compiled=True,
2025-05-07T20:32:36.1365905Z )
2025-05-07T20:32:36.1366229Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:36.1366711Z T = 2048, D = 5120, scale_ub = 1200.0, contiguous = True, compiled = True
2025-05-07T20:32:36.1366986Z 
2025-05-07T20:32:36.1367065Z     @given(
2025-05-07T20:32:36.1367308Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:36.1367615Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:36.1367925Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:36.1368257Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:36.1368579Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:36.1368866Z     )
2025-05-07T20:32:36.1369222Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:36.1369666Z     def test_silu_mul_quant(
2025-05-07T20:32:36.1369900Z         self,
2025-05-07T20:32:36.1370101Z         T: int,
2025-05-07T20:32:36.1370301Z         D: int,
2025-05-07T20:32:36.1370514Z         scale_ub: Optional[float],
2025-05-07T20:32:36.1370785Z         contiguous: bool,
2025-05-07T20:32:36.1371026Z         compiled: bool,
2025-05-07T20:32:36.1371243Z     ) -> None:
2025-05-07T20:32:36.1371463Z         torch.manual_seed(2025)
2025-05-07T20:32:36.1371708Z     
2025-05-07T20:32:36.1371975Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:36.1372322Z     
2025-05-07T20:32:36.1372521Z         x_sign = torch.sign(x)
2025-05-07T20:32:36.1372808Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:36.1373123Z         x = x_sign * x_clamp
2025-05-07T20:32:36.1373367Z         x0 = x[:, :D]
2025-05-07T20:32:36.1373590Z         x1 = x[:, D:]
2025-05-07T20:32:36.1373901Z     
2025-05-07T20:32:36.1374122Z         if contiguous:
2025-05-07T20:32:36.1374353Z             x0 = x0.contiguous()
2025-05-07T20:32:36.1374615Z             x1 = x1.contiguous()
2025-05-07T20:32:36.1374860Z     
2025-05-07T20:32:36.1375064Z         if scale_ub is not None:
2025-05-07T20:32:36.1375333Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:36.1375673Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:36.1375990Z             )
2025-05-07T20:32:36.1376181Z         else:
2025-05-07T20:32:36.1376400Z             scale_ub_tensor = None
2025-05-07T20:32:36.1376655Z     
2025-05-07T20:32:36.1376885Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:36.1377299Z             op = silu_mul_quant
2025-05-07T20:32:36.1377563Z             if compiled:
2025-05-07T20:32:36.1377805Z                 op = torch.compile(op)
2025-05-07T20:32:36.1378108Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:36.1378421Z     
2025-05-07T20:32:36.1378625Z         y_fp8, y_scale = fn()
2025-05-07T20:32:36.1378911Z         y = y_fp8.to(torch.float32) * y_scale[:, None]
2025-05-07T20:32:36.1379208Z     
2025-05-07T20:32:36.1379449Z         def ref_fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:36.1379777Z             x0_fp32 = x0.to(torch.float32)
2025-05-07T20:32:36.1380072Z             x1_fp32 = x1.to(torch.float32)
2025-05-07T20:32:36.1380386Z             y = x0_fp32 * torch.sigmoid(x0_fp32) * x1_fp32
2025-05-07T20:32:36.1380743Z             return triton_quantize_fp8_row(y, scale_ub_tensor)
2025-05-07T20:32:36.1381051Z     
2025-05-07T20:32:36.1381253Z >       y_fp8_ref, y_scale_ref = ref_fn()
2025-05-07T20:32:36.1381451Z 
2025-05-07T20:32:36.1381557Z moe/activation_test.py:126: 
2025-05-07T20:32:36.1381855Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:36.1382192Z moe/activation_test.py:124: in ref_fn
2025-05-07T20:32:36.1382517Z     return triton_quantize_fp8_row(y, scale_ub_tensor)
2025-05-07T20:32:36.1383294Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/fbgemm_gpu/experimental/gemm/triton_gemm/fp8_gemm.py:2370: in triton_quantize_fp8_row
2025-05-07T20:32:36.1384064Z     _kernel_quantize_fp8_row[grid](
2025-05-07T20:32:36.1384629Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:36.1385304Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:36.1385983Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/autotuner.py:186: in run
2025-05-07T20:32:36.1386701Z     timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs}
2025-05-07T20:32:36.1387459Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/autotuner.py:186: in <dictcomp>
2025-05-07T20:32:36.1388207Z     timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs}
2025-05-07T20:32:36.1388927Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/autotuner.py:166: in _bench
2025-05-07T20:32:36.1389641Z     return self.do_bench(kernel_call, quantiles=(0.5, 0.2, 0.8))
2025-05-07T20:32:36.1390241Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/testing.py:117: in do_bench
2025-05-07T20:32:36.1390754Z     fn()
2025-05-07T20:32:36.1391258Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/autotuner.py:152: in kernel_call
2025-05-07T20:32:36.1391839Z     self.fn.run(
2025-05-07T20:32:36.1392307Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:36.1392832Z     kernel = self.compile(
2025-05-07T20:32:36.1393373Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:36.1394072Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:36.1394473Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:36.1394699Z 
2025-05-07T20:32:36.1394904Z self = <triton.compiler.compiler.ASTSource object at 0x7f98307dd810>
2025-05-07T20:32:36.1395980Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=2, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:36.1397353Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f987464c400>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f9830943560>}
2025-05-07T20:32:36.1398804Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:36.1399830Z context = <triton._C.libtriton.ir.context object at 0x7f98301e2430>
2025-05-07T20:32:36.1400121Z 
2025-05-07T20:32:36.1400286Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:36.1400798Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:36.1401265Z                            module_map=module_map)
2025-05-07T20:32:36.1401623Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:36.1401984Z E       def _kernel_quantize_fp8_row(
2025-05-07T20:32:36.1402254Z E       ^
2025-05-07T20:32:36.1402712Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:36.1403170Z 
2025-05-07T20:32:36.1403587Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:36.1404101Z 
2025-05-07T20:32:36.1404211Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:36.1404622Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:36.1405017Z     T=16384,
2025-05-07T20:32:36.1405217Z     D=7168,
2025-05-07T20:32:36.1405416Z     scale_ub=1200.0,
2025-05-07T20:32:36.1405639Z     contiguous=False,
2025-05-07T20:32:36.1405871Z     compiled=False,
2025-05-07T20:32:36.1406082Z )
2025-05-07T20:32:36.9342104Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:36.9342866Z T = 16384, D = 7168, scale_ub = 1200.0, contiguous = False, compiled = False
2025-05-07T20:32:36.9343259Z 
2025-05-07T20:32:36.9343371Z     @given(
2025-05-07T20:32:36.9343704Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:36.9344033Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:36.9344349Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:36.9344675Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:36.9345021Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:36.9345309Z     )
2025-05-07T20:32:36.9345662Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:36.9346099Z     def test_silu_mul_quant(
2025-05-07T20:32:36.9346355Z         self,
2025-05-07T20:32:36.9346557Z         T: int,
2025-05-07T20:32:36.9346755Z         D: int,
2025-05-07T20:32:36.9346986Z         scale_ub: Optional[float],
2025-05-07T20:32:36.9347261Z         contiguous: bool,
2025-05-07T20:32:36.9347499Z         compiled: bool,
2025-05-07T20:32:36.9347736Z     ) -> None:
2025-05-07T20:32:36.9347970Z         torch.manual_seed(2025)
2025-05-07T20:32:36.9348216Z     
2025-05-07T20:32:36.9348497Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:36.9348854Z     
2025-05-07T20:32:36.9349115Z         x_sign = torch.sign(x)
2025-05-07T20:32:36.9349414Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:36.9349728Z         x = x_sign * x_clamp
2025-05-07T20:32:36.9350277Z         x0 = x[:, :D]
2025-05-07T20:32:36.9350505Z         x1 = x[:, D:]
2025-05-07T20:32:36.9350717Z     
2025-05-07T20:32:36.9350902Z         if contiguous:
2025-05-07T20:32:36.9351142Z             x0 = x0.contiguous()
2025-05-07T20:32:36.9351405Z             x1 = x1.contiguous()
2025-05-07T20:32:36.9351676Z     
2025-05-07T20:32:36.9351874Z         if scale_ub is not None:
2025-05-07T20:32:36.9352143Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:36.9352480Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:36.9352790Z             )
2025-05-07T20:32:36.9352983Z         else:
2025-05-07T20:32:36.9353201Z             scale_ub_tensor = None
2025-05-07T20:32:36.9353658Z     
2025-05-07T20:32:36.9353893Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:36.9354201Z             op = silu_mul_quant
2025-05-07T20:32:36.9354456Z             if compiled:
2025-05-07T20:32:36.9354785Z                 op = torch.compile(op)
2025-05-07T20:32:36.9355083Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:36.9355360Z     
2025-05-07T20:32:36.9355561Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:36.9355725Z 
2025-05-07T20:32:36.9355833Z moe/activation_test.py:117: 
2025-05-07T20:32:36.9356133Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:36.9356471Z moe/activation_test.py:115: in fn
2025-05-07T20:32:36.9356749Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:36.9357442Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:36.9358139Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:36.9358680Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:36.9359358Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:36.9360027Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:36.9360563Z     kernel = self.compile(
2025-05-07T20:32:36.9361107Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:36.9361766Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:36.9362169Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:36.9362400Z 
2025-05-07T20:32:36.9362614Z self = <triton.compiler.compiler.ASTSource object at 0x7f98302102d0>
2025-05-07T20:32:36.9363690Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:36.9365075Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f987464c400>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f9830683ba0>}
2025-05-07T20:32:36.9366413Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:36.9367436Z context = <triton._C.libtriton.ir.context object at 0x7f9830278830>
2025-05-07T20:32:36.9367722Z 
2025-05-07T20:32:36.9367896Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:36.9368410Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:36.9368884Z                            module_map=module_map)
2025-05-07T20:32:36.9369259Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:36.9369615Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:36.9369873Z E       ^
2025-05-07T20:32:36.9370391Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:36.9370838Z 
2025-05-07T20:32:36.9371258Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:36.9371766Z 
2025-05-07T20:32:36.9371870Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:36.9372286Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:36.9372692Z     T=1,
2025-05-07T20:32:36.9372882Z     D=7168,
2025-05-07T20:32:36.9373074Z     scale_ub=None,
2025-05-07T20:32:36.9373293Z     contiguous=True,
2025-05-07T20:32:36.9373522Z     compiled=True,
2025-05-07T20:32:36.9373772Z )
2025-05-07T20:32:36.9374138Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:36.9374617Z T = 1, D = 7168, scale_ub = None, contiguous = True, compiled = True
2025-05-07T20:32:36.9374871Z 
2025-05-07T20:32:36.9374989Z     @given(
2025-05-07T20:32:36.9375233Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:36.9375552Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:36.9375856Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:36.9376190Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:36.9376519Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:36.9376806Z     )
2025-05-07T20:32:36.9377151Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:36.9377594Z     def test_silu_mul_quant(
2025-05-07T20:32:36.9377840Z         self,
2025-05-07T20:32:36.9378037Z         T: int,
2025-05-07T20:32:36.9378241Z         D: int,
2025-05-07T20:32:36.9378467Z         scale_ub: Optional[float],
2025-05-07T20:32:36.9378737Z         contiguous: bool,
2025-05-07T20:32:36.9378982Z         compiled: bool,
2025-05-07T20:32:36.9379206Z     ) -> None:
2025-05-07T20:32:36.9379427Z         torch.manual_seed(2025)
2025-05-07T20:32:36.9379681Z     
2025-05-07T20:32:36.9380179Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:36.9380520Z     
2025-05-07T20:32:36.9380722Z         x_sign = torch.sign(x)
2025-05-07T20:32:36.9381017Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:36.9381324Z         x = x_sign * x_clamp
2025-05-07T20:32:36.9381569Z         x0 = x[:, :D]
2025-05-07T20:32:36.9381793Z         x1 = x[:, D:]
2025-05-07T20:32:36.9382005Z     
2025-05-07T20:32:36.9382194Z         if contiguous:
2025-05-07T20:32:36.9382432Z             x0 = x0.contiguous()
2025-05-07T20:32:36.9382700Z             x1 = x1.contiguous()
2025-05-07T20:32:36.9382934Z     
2025-05-07T20:32:36.9383139Z         if scale_ub is not None:
2025-05-07T20:32:36.9383421Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:36.9383752Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:36.9384066Z             )
2025-05-07T20:32:36.9384272Z         else:
2025-05-07T20:32:36.9384481Z             scale_ub_tensor = None
2025-05-07T20:32:36.9384735Z     
2025-05-07T20:32:36.9384971Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:36.9385281Z             op = silu_mul_quant
2025-05-07T20:32:36.9385534Z             if compiled:
2025-05-07T20:32:36.9385784Z                 op = torch.compile(op)
2025-05-07T20:32:36.9386076Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:36.9386359Z     
2025-05-07T20:32:36.9386559Z         y_fp8, y_scale = fn()
2025-05-07T20:32:36.9386846Z         y = y_fp8.to(torch.float32) * y_scale[:, None]
2025-05-07T20:32:36.9387133Z     
2025-05-07T20:32:36.9387374Z         def ref_fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:36.9387714Z             x0_fp32 = x0.to(torch.float32)
2025-05-07T20:32:36.9388006Z             x1_fp32 = x1.to(torch.float32)
2025-05-07T20:32:36.9388327Z             y = x0_fp32 * torch.sigmoid(x0_fp32) * x1_fp32
2025-05-07T20:32:36.9388749Z             return triton_quantize_fp8_row(y, scale_ub_tensor)
2025-05-07T20:32:36.9389119Z     
2025-05-07T20:32:36.9389327Z >       y_fp8_ref, y_scale_ref = ref_fn()
2025-05-07T20:32:36.9389521Z 
2025-05-07T20:32:36.9389630Z moe/activation_test.py:126: 
2025-05-07T20:32:36.9389926Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:36.9390264Z moe/activation_test.py:124: in ref_fn
2025-05-07T20:32:36.9390597Z     return triton_quantize_fp8_row(y, scale_ub_tensor)
2025-05-07T20:32:36.9391380Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/fbgemm_gpu/experimental/gemm/triton_gemm/fp8_gemm.py:2370: in triton_quantize_fp8_row
2025-05-07T20:32:36.9392127Z     _kernel_quantize_fp8_row[grid](
2025-05-07T20:32:36.9392765Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:36.9393449Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:36.9394175Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/autotuner.py:186: in run
2025-05-07T20:32:36.9394894Z     timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs}
2025-05-07T20:32:36.9395646Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/autotuner.py:186: in <dictcomp>
2025-05-07T20:32:36.9396398Z     timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs}
2025-05-07T20:32:36.9397126Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/autotuner.py:166: in _bench
2025-05-07T20:32:36.9397773Z     return self.do_bench(kernel_call, quantiles=(0.5, 0.2, 0.8))
2025-05-07T20:32:36.9398389Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/testing.py:117: in do_bench
2025-05-07T20:32:36.9398919Z     fn()
2025-05-07T20:32:36.9399428Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/autotuner.py:152: in kernel_call
2025-05-07T20:32:36.9400021Z     self.fn.run(
2025-05-07T20:32:36.9400491Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:36.9401027Z     kernel = self.compile(
2025-05-07T20:32:36.9401565Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:36.9402226Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:36.9402634Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:36.9402864Z 
2025-05-07T20:32:36.9403074Z self = <triton.compiler.compiler.ASTSource object at 0x7f98301430d0>
2025-05-07T20:32:36.9404167Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=2, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:36.9405540Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f987464c400>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f98304ff6a0>}
2025-05-07T20:32:36.9406886Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:36.9407917Z context = <triton._C.libtriton.ir.context object at 0x7f983019df30>
2025-05-07T20:32:36.9408205Z 
2025-05-07T20:32:36.9408375Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:36.9408905Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:36.9409380Z                            module_map=module_map)
2025-05-07T20:32:36.9409740Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:36.9410150Z E       def _kernel_quantize_fp8_row(
2025-05-07T20:32:36.9410431Z E       ^
2025-05-07T20:32:36.9410897Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:36.9411345Z 
2025-05-07T20:32:36.9411760Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:36.9412274Z 
2025-05-07T20:32:36.9412380Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:36.9412791Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:36.9413194Z     T=4096,
2025-05-07T20:32:36.9413380Z     D=5120,
2025-05-07T20:32:36.9413621Z     scale_ub=None,
2025-05-07T20:32:36.9413885Z     contiguous=False,
2025-05-07T20:32:36.9414109Z     compiled=False,
2025-05-07T20:32:36.9414317Z )
2025-05-07T20:32:37.8642477Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:37.8643594Z T = 4096, D = 5120, scale_ub = None, contiguous = False, compiled = False
2025-05-07T20:32:37.8643990Z 
2025-05-07T20:32:37.8644097Z     @given(
2025-05-07T20:32:37.8644406Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:37.8644820Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:37.8645222Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:37.8650983Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:37.8651360Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:37.8651654Z     )
2025-05-07T20:32:37.8652007Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:37.8652463Z     def test_silu_mul_quant(
2025-05-07T20:32:37.8652723Z         self,
2025-05-07T20:32:37.8652927Z         T: int,
2025-05-07T20:32:37.8653134Z         D: int,
2025-05-07T20:32:37.8653364Z         scale_ub: Optional[float],
2025-05-07T20:32:37.8653642Z         contiguous: bool,
2025-05-07T20:32:37.8653888Z         compiled: bool,
2025-05-07T20:32:37.8654172Z     ) -> None:
2025-05-07T20:32:37.8654406Z         torch.manual_seed(2025)
2025-05-07T20:32:37.8654653Z     
2025-05-07T20:32:37.8654933Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:37.8655278Z     
2025-05-07T20:32:37.8655479Z         x_sign = torch.sign(x)
2025-05-07T20:32:37.8655783Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:37.8656099Z         x = x_sign * x_clamp
2025-05-07T20:32:37.8656376Z         x0 = x[:, :D]
2025-05-07T20:32:37.8656607Z         x1 = x[:, D:]
2025-05-07T20:32:37.8656816Z     
2025-05-07T20:32:37.8657008Z         if contiguous:
2025-05-07T20:32:37.8657244Z             x0 = x0.contiguous()
2025-05-07T20:32:37.8657502Z             x1 = x1.contiguous()
2025-05-07T20:32:37.8657748Z     
2025-05-07T20:32:37.8657947Z         if scale_ub is not None:
2025-05-07T20:32:37.8658220Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:37.8658568Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:37.8658882Z             )
2025-05-07T20:32:37.8659074Z         else:
2025-05-07T20:32:37.8659291Z             scale_ub_tensor = None
2025-05-07T20:32:37.8659545Z     
2025-05-07T20:32:37.8659776Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:37.8660096Z             op = silu_mul_quant
2025-05-07T20:32:37.8660353Z             if compiled:
2025-05-07T20:32:37.8660603Z                 op = torch.compile(op)
2025-05-07T20:32:37.8660895Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:37.8661171Z     
2025-05-07T20:32:37.8661366Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:37.8661531Z 
2025-05-07T20:32:37.8661633Z moe/activation_test.py:117: 
2025-05-07T20:32:37.8661942Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:37.8662277Z moe/activation_test.py:115: in fn
2025-05-07T20:32:37.8662554Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:37.8663428Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:37.8664126Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:37.8664718Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:37.8665398Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:37.8666061Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:37.8666601Z     kernel = self.compile(
2025-05-07T20:32:37.8667141Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:37.8667980Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:37.8668383Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:37.8668658Z 
2025-05-07T20:32:37.8668878Z self = <triton.compiler.compiler.ASTSource object at 0x7f98076db710>
2025-05-07T20:32:37.8670045Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:37.8671426Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f987464c400>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f98304e00e0>}
2025-05-07T20:32:37.8672764Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:37.8673795Z context = <triton._C.libtriton.ir.context object at 0x7f98076ffd70>
2025-05-07T20:32:37.8674082Z 
2025-05-07T20:32:37.8674258Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:37.8674774Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:37.8675250Z                            module_map=module_map)
2025-05-07T20:32:37.8675624Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:37.8675972Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:37.8676236Z E       ^
2025-05-07T20:32:37.8676709Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:37.8677155Z 
2025-05-07T20:32:37.8677576Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:37.8678091Z 
2025-05-07T20:32:37.8678198Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:37.8678618Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:37.8679028Z     T=4096,
2025-05-07T20:32:37.8679222Z     D=7168,
2025-05-07T20:32:37.8679423Z     scale_ub=None,
2025-05-07T20:32:37.8679649Z     contiguous=False,
2025-05-07T20:32:37.8679876Z     compiled=False,
2025-05-07T20:32:37.8680095Z )
2025-05-07T20:32:37.8680418Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:37.8680917Z T = 4096, D = 7168, scale_ub = None, contiguous = False, compiled = False
2025-05-07T20:32:37.8681192Z 
2025-05-07T20:32:37.8681273Z     @given(
2025-05-07T20:32:37.8681512Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:37.8681832Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:37.8682139Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:37.8682476Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:37.8682809Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:37.8683090Z     )
2025-05-07T20:32:37.8683444Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:37.8683938Z     def test_silu_mul_quant(
2025-05-07T20:32:37.8684187Z         self,
2025-05-07T20:32:37.8684380Z         T: int,
2025-05-07T20:32:37.8684604Z         D: int,
2025-05-07T20:32:37.8684853Z         scale_ub: Optional[float],
2025-05-07T20:32:37.8685120Z         contiguous: bool,
2025-05-07T20:32:37.8685363Z         compiled: bool,
2025-05-07T20:32:37.8685591Z     ) -> None:
2025-05-07T20:32:37.8685803Z         torch.manual_seed(2025)
2025-05-07T20:32:37.8686048Z     
2025-05-07T20:32:37.8686324Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:37.8686661Z     
2025-05-07T20:32:37.8686861Z         x_sign = torch.sign(x)
2025-05-07T20:32:37.8687154Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:37.8687573Z         x = x_sign * x_clamp
2025-05-07T20:32:37.8688032Z         x0 = x[:, :D]
2025-05-07T20:32:37.8688256Z         x1 = x[:, D:]
2025-05-07T20:32:37.8688464Z     
2025-05-07T20:32:37.8688709Z         if contiguous:
2025-05-07T20:32:37.8688953Z             x0 = x0.contiguous()
2025-05-07T20:32:37.8689210Z             x1 = x1.contiguous()
2025-05-07T20:32:37.8689454Z     
2025-05-07T20:32:37.8689651Z         if scale_ub is not None:
2025-05-07T20:32:37.8689928Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:37.8690261Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:37.8690575Z             )
2025-05-07T20:32:37.8690779Z         else:
2025-05-07T20:32:37.8690987Z             scale_ub_tensor = None
2025-05-07T20:32:37.8691247Z     
2025-05-07T20:32:37.8691487Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:37.8691797Z             op = silu_mul_quant
2025-05-07T20:32:37.8692060Z             if compiled:
2025-05-07T20:32:37.8692314Z                 op = torch.compile(op)
2025-05-07T20:32:37.8692604Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:37.8692884Z     
2025-05-07T20:32:37.8693085Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:37.8693250Z 
2025-05-07T20:32:37.8693351Z moe/activation_test.py:117: 
2025-05-07T20:32:37.8693646Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:37.8693980Z moe/activation_test.py:115: in fn
2025-05-07T20:32:37.8694263Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:37.8695005Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:37.8695688Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:37.8696225Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:37.8696912Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:37.8697572Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:37.8698111Z     kernel = self.compile(
2025-05-07T20:32:37.8698656Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:37.8699309Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:37.8699703Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:37.8699938Z 
2025-05-07T20:32:37.8700144Z self = <triton.compiler.compiler.ASTSource object at 0x7f98076ce050>
2025-05-07T20:32:37.8701222Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:37.8702599Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f987464c400>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f98304e2a20>}
2025-05-07T20:32:37.8703996Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:37.8705017Z context = <triton._C.libtriton.ir.context object at 0x7f98076ba6b0>
2025-05-07T20:32:37.8705307Z 
2025-05-07T20:32:37.8705473Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:37.8705985Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:37.8706443Z                            module_map=module_map)
2025-05-07T20:32:37.8706813Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:37.8707169Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:37.8707472Z E       ^
2025-05-07T20:32:37.8707970Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:37.8708424Z 
2025-05-07T20:32:37.8709399Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:37.8709915Z 
2025-05-07T20:32:37.8710027Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:37.8710433Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:37.8710836Z     T=128,
2025-05-07T20:32:37.8711032Z     D=7168,
2025-05-07T20:32:37.8711229Z     scale_ub=None,
2025-05-07T20:32:37.8711447Z     contiguous=False,
2025-05-07T20:32:37.8711671Z     compiled=True,
2025-05-07T20:32:37.8711876Z )
2025-05-07T20:32:37.9139885Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:37.9140626Z T = 128, D = 7168, scale_ub = None, contiguous = False, compiled = True
2025-05-07T20:32:37.9140954Z 
2025-05-07T20:32:37.9141045Z     @given(
2025-05-07T20:32:37.9141279Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:37.9141597Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:37.9141910Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:37.9142238Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:37.9142569Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:37.9142864Z     )
2025-05-07T20:32:37.9143210Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:37.9143656Z     def test_silu_mul_quant(
2025-05-07T20:32:37.9143904Z         self,
2025-05-07T20:32:37.9144103Z         T: int,
2025-05-07T20:32:37.9144301Z         D: int,
2025-05-07T20:32:37.9144527Z         scale_ub: Optional[float],
2025-05-07T20:32:37.9144807Z         contiguous: bool,
2025-05-07T20:32:37.9145050Z         compiled: bool,
2025-05-07T20:32:37.9145284Z     ) -> None:
2025-05-07T20:32:37.9145510Z         torch.manual_seed(2025)
2025-05-07T20:32:37.9145750Z     
2025-05-07T20:32:37.9146025Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:37.9146373Z     
2025-05-07T20:32:37.9146571Z         x_sign = torch.sign(x)
2025-05-07T20:32:37.9146875Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:37.9147197Z         x = x_sign * x_clamp
2025-05-07T20:32:37.9147435Z         x0 = x[:, :D]
2025-05-07T20:32:37.9147660Z         x1 = x[:, D:]
2025-05-07T20:32:37.9148059Z     
2025-05-07T20:32:37.9148245Z         if contiguous:
2025-05-07T20:32:37.9148478Z             x0 = x0.contiguous()
2025-05-07T20:32:37.9148738Z             x1 = x1.contiguous()
2025-05-07T20:32:37.9148974Z     
2025-05-07T20:32:37.9149218Z         if scale_ub is not None:
2025-05-07T20:32:37.9149504Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:37.9149842Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:37.9150159Z             )
2025-05-07T20:32:37.9150353Z         else:
2025-05-07T20:32:37.9150570Z             scale_ub_tensor = None
2025-05-07T20:32:37.9150824Z     
2025-05-07T20:32:37.9151053Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:37.9151371Z             op = silu_mul_quant
2025-05-07T20:32:37.9151925Z             if compiled:
2025-05-07T20:32:37.9152174Z                 op = torch.compile(op)
2025-05-07T20:32:37.9152473Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:37.9152751Z     
2025-05-07T20:32:37.9152943Z         y_fp8, y_scale = fn()
2025-05-07T20:32:37.9153238Z         y = y_fp8.to(torch.float32) * y_scale[:, None]
2025-05-07T20:32:37.9153532Z     
2025-05-07T20:32:37.9153768Z         def ref_fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:37.9154103Z             x0_fp32 = x0.to(torch.float32)
2025-05-07T20:32:37.9154396Z             x1_fp32 = x1.to(torch.float32)
2025-05-07T20:32:37.9154743Z             y = x0_fp32 * torch.sigmoid(x0_fp32) * x1_fp32
2025-05-07T20:32:37.9155277Z             return triton_quantize_fp8_row(y, scale_ub_tensor)
2025-05-07T20:32:37.9155589Z     
2025-05-07T20:32:37.9155792Z >       y_fp8_ref, y_scale_ref = ref_fn()
2025-05-07T20:32:37.9155985Z 
2025-05-07T20:32:37.9156162Z moe/activation_test.py:126: 
2025-05-07T20:32:37.9156466Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:37.9156805Z moe/activation_test.py:124: in ref_fn
2025-05-07T20:32:37.9157126Z     return triton_quantize_fp8_row(y, scale_ub_tensor)
2025-05-07T20:32:37.9157906Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/fbgemm_gpu/experimental/gemm/triton_gemm/fp8_gemm.py:2370: in triton_quantize_fp8_row
2025-05-07T20:32:37.9158655Z     _kernel_quantize_fp8_row[grid](
2025-05-07T20:32:37.9159199Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:37.9159871Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:37.9160560Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/autotuner.py:186: in run
2025-05-07T20:32:37.9161286Z     timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs}
2025-05-07T20:32:37.9162035Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/autotuner.py:186: in <dictcomp>
2025-05-07T20:32:37.9162772Z     timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs}
2025-05-07T20:32:37.9163505Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/autotuner.py:166: in _bench
2025-05-07T20:32:37.9164143Z     return self.do_bench(kernel_call, quantiles=(0.5, 0.2, 0.8))
2025-05-07T20:32:37.9164738Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/testing.py:117: in do_bench
2025-05-07T20:32:37.9165304Z     fn()
2025-05-07T20:32:37.9165818Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/autotuner.py:152: in kernel_call
2025-05-07T20:32:37.9166403Z     self.fn.run(
2025-05-07T20:32:37.9166870Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:37.9167401Z     kernel = self.compile(
2025-05-07T20:32:37.9167943Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:37.9168588Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:37.9168993Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:37.9169225Z 
2025-05-07T20:32:37.9169430Z self = <triton.compiler.compiler.ASTSource object at 0x7f9807ce3d90>
2025-05-07T20:32:37.9170504Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=2, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:37.9171888Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f987464c400>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f98304e3100>}
2025-05-07T20:32:37.9173267Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:37.9174291Z context = <triton._C.libtriton.ir.context object at 0x7f98078239f0>
2025-05-07T20:32:37.9174582Z 
2025-05-07T20:32:37.9174749Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:37.9175267Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:37.9175729Z                            module_map=module_map)
2025-05-07T20:32:37.9176138Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:37.9176571Z E       def _kernel_quantize_fp8_row(
2025-05-07T20:32:37.9176834Z E       ^
2025-05-07T20:32:37.9177340Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:37.9177795Z 
2025-05-07T20:32:37.9178212Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:37.9178719Z 
2025-05-07T20:32:37.9178833Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:37.9179242Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:37.9179648Z     T=128,
2025-05-07T20:32:37.9179845Z     D=7168,
2025-05-07T20:32:37.9180041Z     scale_ub=None,
2025-05-07T20:32:37.9180267Z     contiguous=False,
2025-05-07T20:32:37.9180501Z     compiled=False,
2025-05-07T20:32:37.9180708Z )
2025-05-07T20:32:38.2144050Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:38.2144859Z T = 128, D = 7168, scale_ub = None, contiguous = False, compiled = False
2025-05-07T20:32:38.2145231Z 
2025-05-07T20:32:38.2145349Z     @given(
2025-05-07T20:32:38.2145650Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:38.2145976Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:38.2146288Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:38.2146612Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:38.2146943Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:38.2147229Z     )
2025-05-07T20:32:38.2147580Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:38.2148049Z     def test_silu_mul_quant(
2025-05-07T20:32:38.2148289Z         self,
2025-05-07T20:32:38.2148487Z         T: int,
2025-05-07T20:32:38.2148686Z         D: int,
2025-05-07T20:32:38.2148902Z         scale_ub: Optional[float],
2025-05-07T20:32:38.2149252Z         contiguous: bool,
2025-05-07T20:32:38.2149499Z         compiled: bool,
2025-05-07T20:32:38.2149727Z     ) -> None:
2025-05-07T20:32:38.2149939Z         torch.manual_seed(2025)
2025-05-07T20:32:38.2150183Z     
2025-05-07T20:32:38.2150458Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:38.2150797Z     
2025-05-07T20:32:38.2150993Z         x_sign = torch.sign(x)
2025-05-07T20:32:38.2151288Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:38.2151593Z         x = x_sign * x_clamp
2025-05-07T20:32:38.2151839Z         x0 = x[:, :D]
2025-05-07T20:32:38.2152062Z         x1 = x[:, D:]
2025-05-07T20:32:38.2152267Z     
2025-05-07T20:32:38.2152463Z         if contiguous:
2025-05-07T20:32:38.2152705Z             x0 = x0.contiguous()
2025-05-07T20:32:38.2152963Z             x1 = x1.contiguous()
2025-05-07T20:32:38.2153207Z     
2025-05-07T20:32:38.2153407Z         if scale_ub is not None:
2025-05-07T20:32:38.2153681Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:38.2154023Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:38.2154339Z             )
2025-05-07T20:32:38.2154567Z         else:
2025-05-07T20:32:38.2154798Z             scale_ub_tensor = None
2025-05-07T20:32:38.2155056Z     
2025-05-07T20:32:38.2155582Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:38.2155903Z             op = silu_mul_quant
2025-05-07T20:32:38.2156161Z             if compiled:
2025-05-07T20:32:38.2156415Z                 op = torch.compile(op)
2025-05-07T20:32:38.2156708Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:38.2156986Z     
2025-05-07T20:32:38.2157183Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:38.2157347Z 
2025-05-07T20:32:38.2157449Z moe/activation_test.py:117: 
2025-05-07T20:32:38.2157744Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:38.2158076Z moe/activation_test.py:115: in fn
2025-05-07T20:32:38.2158443Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:38.2159205Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:38.2159892Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:38.2160504Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:38.2161182Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:38.2161842Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:38.2162376Z     kernel = self.compile(
2025-05-07T20:32:38.2162919Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:38.2163564Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:38.2163970Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:38.2164210Z 
2025-05-07T20:32:38.2164424Z self = <triton.compiler.compiler.ASTSource object at 0x7f98078aca90>
2025-05-07T20:32:38.2165503Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:38.2166872Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f987464c400>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f9807db0cc0>}
2025-05-07T20:32:38.2168202Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:38.2169226Z context = <triton._C.libtriton.ir.context object at 0x7f980785d0f0>
2025-05-07T20:32:38.2169512Z 
2025-05-07T20:32:38.2169687Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:38.2170204Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:38.2170680Z                            module_map=module_map)
2025-05-07T20:32:38.2171050Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:38.2171405Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:38.2171663Z E       ^
2025-05-07T20:32:38.2172130Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:38.2172581Z 
2025-05-07T20:32:38.2173002Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:38.2173510Z 
2025-05-07T20:32:38.2173619Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:38.2174034Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:38.2174449Z     T=4096,
2025-05-07T20:32:38.2174644Z     D=5120,
2025-05-07T20:32:38.2174848Z     scale_ub=1200.0,
2025-05-07T20:32:38.2175116Z     contiguous=True,
2025-05-07T20:32:38.2175347Z     compiled=False,
2025-05-07T20:32:38.2175559Z )
2025-05-07T20:32:38.2175939Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:38.2176443Z T = 4096, D = 5120, scale_ub = 1200.0, contiguous = True, compiled = False
2025-05-07T20:32:38.2176713Z 
2025-05-07T20:32:38.2176793Z     @given(
2025-05-07T20:32:38.2177032Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:38.2177349Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:38.2177653Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:38.2177991Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:38.2178321Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:38.2178612Z     )
2025-05-07T20:32:38.2178961Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:38.2179496Z     def test_silu_mul_quant(
2025-05-07T20:32:38.2179750Z         self,
2025-05-07T20:32:38.2179948Z         T: int,
2025-05-07T20:32:38.2180158Z         D: int,
2025-05-07T20:32:38.2180426Z         scale_ub: Optional[float],
2025-05-07T20:32:38.2180697Z         contiguous: bool,
2025-05-07T20:32:38.2180943Z         compiled: bool,
2025-05-07T20:32:38.2181168Z     ) -> None:
2025-05-07T20:32:38.2181380Z         torch.manual_seed(2025)
2025-05-07T20:32:38.2181626Z     
2025-05-07T20:32:38.2181902Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:38.2182239Z     
2025-05-07T20:32:38.2182437Z         x_sign = torch.sign(x)
2025-05-07T20:32:38.2182734Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:38.2183049Z         x = x_sign * x_clamp
2025-05-07T20:32:38.2183285Z         x0 = x[:, :D]
2025-05-07T20:32:38.2183507Z         x1 = x[:, D:]
2025-05-07T20:32:38.2183725Z     
2025-05-07T20:32:38.2183915Z         if contiguous:
2025-05-07T20:32:38.2184148Z             x0 = x0.contiguous()
2025-05-07T20:32:38.2184406Z             x1 = x1.contiguous()
2025-05-07T20:32:38.2184640Z     
2025-05-07T20:32:38.2184864Z         if scale_ub is not None:
2025-05-07T20:32:38.2185161Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:38.2185491Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:38.2185803Z             )
2025-05-07T20:32:38.2186000Z         else:
2025-05-07T20:32:38.2186208Z             scale_ub_tensor = None
2025-05-07T20:32:38.2186462Z     
2025-05-07T20:32:38.2186701Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:38.2187009Z             op = silu_mul_quant
2025-05-07T20:32:38.2193458Z             if compiled:
2025-05-07T20:32:38.2193757Z                 op = torch.compile(op)
2025-05-07T20:32:38.2194070Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:38.2194361Z     
2025-05-07T20:32:38.2194556Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:38.2194731Z 
2025-05-07T20:32:38.2194833Z moe/activation_test.py:117: 
2025-05-07T20:32:38.2195136Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:38.2195472Z moe/activation_test.py:115: in fn
2025-05-07T20:32:38.2195761Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:38.2196454Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:38.2197147Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:38.2197677Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:38.2198365Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:38.2199027Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:38.2199568Z     kernel = self.compile(
2025-05-07T20:32:38.2200110Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:38.2200770Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:38.2201257Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:38.2201489Z 
2025-05-07T20:32:38.2201697Z self = <triton.compiler.compiler.ASTSource object at 0x7f98078e9550>
2025-05-07T20:32:38.2202779Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:38.2204147Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f987464c400>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f9807db1f80>}
2025-05-07T20:32:38.2205537Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:38.2206646Z context = <triton._C.libtriton.ir.context object at 0x7f98078f1b70>
2025-05-07T20:32:38.2206935Z 
2025-05-07T20:32:38.2207100Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:38.2207622Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:38.2208088Z                            module_map=module_map)
2025-05-07T20:32:38.2208449Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:38.2208802Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:38.2209065Z E       ^
2025-05-07T20:32:38.2209530Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:38.2209981Z 
2025-05-07T20:32:38.2210399Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:38.2210917Z 
2025-05-07T20:32:38.2211022Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:38.2211444Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:38.2211848Z     T=1,
2025-05-07T20:32:38.2212030Z     D=5120,
2025-05-07T20:32:38.2212231Z     scale_ub=None,
2025-05-07T20:32:38.2212450Z     contiguous=True,
2025-05-07T20:32:38.2212670Z     compiled=True,
2025-05-07T20:32:38.2212877Z )
2025-05-07T20:32:38.6648041Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:38.6648758Z T = 1, D = 5120, scale_ub = None, contiguous = True, compiled = True
2025-05-07T20:32:38.6649116Z 
2025-05-07T20:32:38.6649227Z     @given(
2025-05-07T20:32:38.6649561Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:38.6649899Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:38.6650248Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:38.6650584Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:38.6650925Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:38.6651232Z     )
2025-05-07T20:32:38.6651582Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:38.6652030Z     def test_silu_mul_quant(
2025-05-07T20:32:38.6652280Z         self,
2025-05-07T20:32:38.6652478Z         T: int,
2025-05-07T20:32:38.6652691Z         D: int,
2025-05-07T20:32:38.6652920Z         scale_ub: Optional[float],
2025-05-07T20:32:38.6653192Z         contiguous: bool,
2025-05-07T20:32:38.6653444Z         compiled: bool,
2025-05-07T20:32:38.6653684Z     ) -> None:
2025-05-07T20:32:38.6653903Z         torch.manual_seed(2025)
2025-05-07T20:32:38.6654152Z     
2025-05-07T20:32:38.6654439Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:38.6654800Z     
2025-05-07T20:32:38.6654999Z         x_sign = torch.sign(x)
2025-05-07T20:32:38.6655300Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:38.6655619Z         x = x_sign * x_clamp
2025-05-07T20:32:38.6655861Z         x0 = x[:, :D]
2025-05-07T20:32:38.6656092Z         x1 = x[:, D:]
2025-05-07T20:32:38.6656593Z     
2025-05-07T20:32:38.6656783Z         if contiguous:
2025-05-07T20:32:38.6657027Z             x0 = x0.contiguous()
2025-05-07T20:32:38.6657292Z             x1 = x1.contiguous()
2025-05-07T20:32:38.6657530Z     
2025-05-07T20:32:38.6657731Z         if scale_ub is not None:
2025-05-07T20:32:38.6658013Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:38.6658347Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:38.6658690Z             )
2025-05-07T20:32:38.6658896Z         else:
2025-05-07T20:32:38.6659118Z             scale_ub_tensor = None
2025-05-07T20:32:38.6659363Z     
2025-05-07T20:32:38.6659604Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:38.6660122Z             op = silu_mul_quant
2025-05-07T20:32:38.6660374Z             if compiled:
2025-05-07T20:32:38.6660628Z                 op = torch.compile(op)
2025-05-07T20:32:38.6661008Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:38.6661296Z     
2025-05-07T20:32:38.6661490Z         y_fp8, y_scale = fn()
2025-05-07T20:32:38.6661783Z         y = y_fp8.to(torch.float32) * y_scale[:, None]
2025-05-07T20:32:38.6662078Z     
2025-05-07T20:32:38.6662317Z         def ref_fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:38.6662655Z             x0_fp32 = x0.to(torch.float32)
2025-05-07T20:32:38.6662953Z             x1_fp32 = x1.to(torch.float32)
2025-05-07T20:32:38.6663318Z             y = x0_fp32 * torch.sigmoid(x0_fp32) * x1_fp32
2025-05-07T20:32:38.6663782Z             return triton_quantize_fp8_row(y, scale_ub_tensor)
2025-05-07T20:32:38.6664104Z     
2025-05-07T20:32:38.6664313Z >       y_fp8_ref, y_scale_ref = ref_fn()
2025-05-07T20:32:38.6664560Z 
2025-05-07T20:32:38.6664676Z moe/activation_test.py:126: 
2025-05-07T20:32:38.6664979Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:38.6665322Z moe/activation_test.py:124: in ref_fn
2025-05-07T20:32:38.6665653Z     return triton_quantize_fp8_row(y, scale_ub_tensor)
2025-05-07T20:32:38.6666446Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/fbgemm_gpu/experimental/gemm/triton_gemm/fp8_gemm.py:2370: in triton_quantize_fp8_row
2025-05-07T20:32:38.6667209Z     _kernel_quantize_fp8_row[grid](
2025-05-07T20:32:38.6667753Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:38.6668446Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:38.6669251Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/autotuner.py:186: in run
2025-05-07T20:32:38.6669979Z     timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs}
2025-05-07T20:32:38.6670738Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/autotuner.py:186: in <dictcomp>
2025-05-07T20:32:38.6671492Z     timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs}
2025-05-07T20:32:38.6672224Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/autotuner.py:166: in _bench
2025-05-07T20:32:38.6672864Z     return self.do_bench(kernel_call, quantiles=(0.5, 0.2, 0.8))
2025-05-07T20:32:38.6673458Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/testing.py:117: in do_bench
2025-05-07T20:32:38.6673985Z     fn()
2025-05-07T20:32:38.6674547Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/autotuner.py:152: in kernel_call
2025-05-07T20:32:38.6675130Z     self.fn.run(
2025-05-07T20:32:38.6675597Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:38.6676136Z     kernel = self.compile(
2025-05-07T20:32:38.6676675Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:38.6677403Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:38.6677811Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:38.6678041Z 
2025-05-07T20:32:38.6678250Z self = <triton.compiler.compiler.ASTSource object at 0x7f98070ab390>
2025-05-07T20:32:38.6679334Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=2, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:38.6680730Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f987464c400>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f9807db2fc0>}
2025-05-07T20:32:38.6682204Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:38.6683239Z context = <triton._C.libtriton.ir.context object at 0x7f98070579b0>
2025-05-07T20:32:38.6683529Z 
2025-05-07T20:32:38.6683697Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:38.6684217Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:38.6684688Z                            module_map=module_map)
2025-05-07T20:32:38.6685100Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:38.6685449Z E       def _kernel_quantize_fp8_row(
2025-05-07T20:32:38.6685727Z E       ^
2025-05-07T20:32:38.6686188Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:38.6686640Z 
2025-05-07T20:32:38.6687057Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:38.6687581Z 
2025-05-07T20:32:38.6687692Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:38.6688102Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:38.6688504Z     T=2048,
2025-05-07T20:32:38.6688691Z     D=5120,
2025-05-07T20:32:38.6688890Z     scale_ub=None,
2025-05-07T20:32:38.6689110Z     contiguous=True,
2025-05-07T20:32:38.6689330Z     compiled=True,
2025-05-07T20:32:38.6689539Z )
2025-05-07T20:32:39.0962508Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:39.0963265Z T = 2048, D = 5120, scale_ub = None, contiguous = True, compiled = True
2025-05-07T20:32:39.0963632Z 
2025-05-07T20:32:39.0963751Z     @given(
2025-05-07T20:32:39.0964086Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:39.0964497Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:39.0964899Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:39.0965338Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:39.0965717Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:39.0966000Z     )
2025-05-07T20:32:39.0966352Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:39.0966793Z     def test_silu_mul_quant(
2025-05-07T20:32:39.0967039Z         self,
2025-05-07T20:32:39.0967240Z         T: int,
2025-05-07T20:32:39.0967433Z         D: int,
2025-05-07T20:32:39.0967657Z         scale_ub: Optional[float],
2025-05-07T20:32:39.0967928Z         contiguous: bool,
2025-05-07T20:32:39.0968169Z         compiled: bool,
2025-05-07T20:32:39.0968398Z     ) -> None:
2025-05-07T20:32:39.0968617Z         torch.manual_seed(2025)
2025-05-07T20:32:39.0968866Z     
2025-05-07T20:32:39.0969133Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:39.0969481Z     
2025-05-07T20:32:39.0969678Z         x_sign = torch.sign(x)
2025-05-07T20:32:39.0969968Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:39.0970575Z         x = x_sign * x_clamp
2025-05-07T20:32:39.0970821Z         x0 = x[:, :D]
2025-05-07T20:32:39.0971035Z         x1 = x[:, D:]
2025-05-07T20:32:39.0971252Z     
2025-05-07T20:32:39.0971448Z         if contiguous:
2025-05-07T20:32:39.0971683Z             x0 = x0.contiguous()
2025-05-07T20:32:39.0971950Z             x1 = x1.contiguous()
2025-05-07T20:32:39.0972198Z     
2025-05-07T20:32:39.0972392Z         if scale_ub is not None:
2025-05-07T20:32:39.0972674Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:39.0973017Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:39.0973324Z             )
2025-05-07T20:32:39.0973544Z         else:
2025-05-07T20:32:39.0973853Z             scale_ub_tensor = None
2025-05-07T20:32:39.0974191Z     
2025-05-07T20:32:39.0974419Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:39.0974738Z             op = silu_mul_quant
2025-05-07T20:32:39.0974996Z             if compiled:
2025-05-07T20:32:39.0975320Z                 op = torch.compile(op)
2025-05-07T20:32:39.0975623Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:39.0975901Z     
2025-05-07T20:32:39.0976093Z         y_fp8, y_scale = fn()
2025-05-07T20:32:39.0976385Z         y = y_fp8.to(torch.float32) * y_scale[:, None]
2025-05-07T20:32:39.0976680Z     
2025-05-07T20:32:39.0976915Z         def ref_fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:39.0977254Z             x0_fp32 = x0.to(torch.float32)
2025-05-07T20:32:39.0977552Z             x1_fp32 = x1.to(torch.float32)
2025-05-07T20:32:39.0977868Z             y = x0_fp32 * torch.sigmoid(x0_fp32) * x1_fp32
2025-05-07T20:32:39.0978226Z             return triton_quantize_fp8_row(y, scale_ub_tensor)
2025-05-07T20:32:39.0978543Z     
2025-05-07T20:32:39.0978745Z >       y_fp8_ref, y_scale_ref = ref_fn()
2025-05-07T20:32:39.0978938Z 
2025-05-07T20:32:39.0979039Z moe/activation_test.py:126: 
2025-05-07T20:32:39.0979338Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:39.0979678Z moe/activation_test.py:124: in ref_fn
2025-05-07T20:32:39.0979999Z     return triton_quantize_fp8_row(y, scale_ub_tensor)
2025-05-07T20:32:39.0980787Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/fbgemm_gpu/experimental/gemm/triton_gemm/fp8_gemm.py:2370: in triton_quantize_fp8_row
2025-05-07T20:32:39.0981538Z     _kernel_quantize_fp8_row[grid](
2025-05-07T20:32:39.0982081Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:39.0982756Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:39.0983438Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/autotuner.py:186: in run
2025-05-07T20:32:39.0984160Z     timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs}
2025-05-07T20:32:39.0984923Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/autotuner.py:186: in <dictcomp>
2025-05-07T20:32:39.0985711Z     timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs}
2025-05-07T20:32:39.0986439Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/autotuner.py:166: in _bench
2025-05-07T20:32:39.0987081Z     return self.do_bench(kernel_call, quantiles=(0.5, 0.2, 0.8))
2025-05-07T20:32:39.0987680Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/testing.py:117: in do_bench
2025-05-07T20:32:39.0988193Z     fn()
2025-05-07T20:32:39.0988705Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/autotuner.py:152: in kernel_call
2025-05-07T20:32:39.0989408Z     self.fn.run(
2025-05-07T20:32:39.0989869Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:39.0990398Z     kernel = self.compile(
2025-05-07T20:32:39.0991000Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:39.0991654Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:39.0992047Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:39.0992281Z 
2025-05-07T20:32:39.0992489Z self = <triton.compiler.compiler.ASTSource object at 0x7f98071d4110>
2025-05-07T20:32:39.0993566Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=2, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:39.0994986Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f987464c400>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f980790a8e0>}
2025-05-07T20:32:39.0996453Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:39.0997476Z context = <triton._C.libtriton.ir.context object at 0x7f9807198670>
2025-05-07T20:32:39.0997770Z 
2025-05-07T20:32:39.0997943Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:39.0998464Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:39.0998926Z                            module_map=module_map)
2025-05-07T20:32:39.0999292Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:39.0999650Z E       def _kernel_quantize_fp8_row(
2025-05-07T20:32:39.0999920Z E       ^
2025-05-07T20:32:39.1000387Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:39.1000847Z 
2025-05-07T20:32:39.1001270Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:39.1001784Z 
2025-05-07T20:32:39.1001897Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:39.1002307Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:39.1002710Z     T=128,
2025-05-07T20:32:39.1002905Z     D=5120,
2025-05-07T20:32:39.1003104Z     scale_ub=None,
2025-05-07T20:32:39.1003327Z     contiguous=True,
2025-05-07T20:32:39.1003558Z     compiled=True,
2025-05-07T20:32:39.1003764Z )
2025-05-07T20:32:39.7590696Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:39.7591259Z T = 128, D = 5120, scale_ub = None, contiguous = True, compiled = True
2025-05-07T20:32:39.7591564Z 
2025-05-07T20:32:39.7591649Z     @given(
2025-05-07T20:32:39.7591900Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:39.7592232Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:39.7592557Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:39.7592900Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:39.7593239Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:39.7593531Z     )
2025-05-07T20:32:39.7593884Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:39.7594332Z     def test_silu_mul_quant(
2025-05-07T20:32:39.7594581Z         self,
2025-05-07T20:32:39.7594777Z         T: int,
2025-05-07T20:32:39.7595014Z         D: int,
2025-05-07T20:32:39.7595241Z         scale_ub: Optional[float],
2025-05-07T20:32:39.7595520Z         contiguous: bool,
2025-05-07T20:32:39.7595760Z         compiled: bool,
2025-05-07T20:32:39.7596004Z     ) -> None:
2025-05-07T20:32:39.7596233Z         torch.manual_seed(2025)
2025-05-07T20:32:39.7596475Z     
2025-05-07T20:32:39.7596759Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:39.7597110Z     
2025-05-07T20:32:39.7597310Z         x_sign = torch.sign(x)
2025-05-07T20:32:39.7597902Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:39.7598223Z         x = x_sign * x_clamp
2025-05-07T20:32:39.7598469Z         x0 = x[:, :D]
2025-05-07T20:32:39.7598706Z         x1 = x[:, D:]
2025-05-07T20:32:39.7598930Z     
2025-05-07T20:32:39.7599122Z         if contiguous:
2025-05-07T20:32:39.7599381Z             x0 = x0.contiguous()
2025-05-07T20:32:39.7599680Z             x1 = x1.contiguous()
2025-05-07T20:32:39.7599942Z     
2025-05-07T20:32:39.7600144Z         if scale_ub is not None:
2025-05-07T20:32:39.7600448Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:39.7600834Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:39.7601232Z             )
2025-05-07T20:32:39.7601518Z         else:
2025-05-07T20:32:39.7601739Z             scale_ub_tensor = None
2025-05-07T20:32:39.7601995Z     
2025-05-07T20:32:39.7602238Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:39.7602636Z             op = silu_mul_quant
2025-05-07T20:32:39.7602892Z             if compiled:
2025-05-07T20:32:39.7603151Z                 op = torch.compile(op)
2025-05-07T20:32:39.7603462Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:39.7603738Z     
2025-05-07T20:32:39.7603942Z         y_fp8, y_scale = fn()
2025-05-07T20:32:39.7604237Z         y = y_fp8.to(torch.float32) * y_scale[:, None]
2025-05-07T20:32:39.7604536Z     
2025-05-07T20:32:39.7604779Z         def ref_fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:39.7605125Z             x0_fp32 = x0.to(torch.float32)
2025-05-07T20:32:39.7605428Z             x1_fp32 = x1.to(torch.float32)
2025-05-07T20:32:39.7605746Z             y = x0_fp32 * torch.sigmoid(x0_fp32) * x1_fp32
2025-05-07T20:32:39.7606120Z             return triton_quantize_fp8_row(y, scale_ub_tensor)
2025-05-07T20:32:39.7606441Z     
2025-05-07T20:32:39.7606644Z >       y_fp8_ref, y_scale_ref = ref_fn()
2025-05-07T20:32:39.7606848Z 
2025-05-07T20:32:39.7606958Z moe/activation_test.py:126: 
2025-05-07T20:32:39.7607266Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:39.7607604Z moe/activation_test.py:124: in ref_fn
2025-05-07T20:32:39.7607932Z     return triton_quantize_fp8_row(y, scale_ub_tensor)
2025-05-07T20:32:39.7608731Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/fbgemm_gpu/experimental/gemm/triton_gemm/fp8_gemm.py:2370: in triton_quantize_fp8_row
2025-05-07T20:32:39.7609490Z     _kernel_quantize_fp8_row[grid](
2025-05-07T20:32:39.7610031Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:39.7610718Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:39.7611417Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/autotuner.py:186: in run
2025-05-07T20:32:39.7612140Z     timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs}
2025-05-07T20:32:39.7612883Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/autotuner.py:186: in <dictcomp>
2025-05-07T20:32:39.7613630Z     timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs}
2025-05-07T20:32:39.7614355Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/autotuner.py:166: in _bench
2025-05-07T20:32:39.7614992Z     return self.do_bench(kernel_call, quantiles=(0.5, 0.2, 0.8))
2025-05-07T20:32:39.7615585Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/testing.py:117: in do_bench
2025-05-07T20:32:39.7616105Z     fn()
2025-05-07T20:32:39.7616615Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/autotuner.py:152: in kernel_call
2025-05-07T20:32:39.7617194Z     self.fn.run(
2025-05-07T20:32:39.7617664Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:39.7618250Z     kernel = self.compile(
2025-05-07T20:32:39.7618791Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:39.7619438Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:39.7619838Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:39.7620065Z 
2025-05-07T20:32:39.7620282Z self = <triton.compiler.compiler.ASTSource object at 0x7f9806ea0f10>
2025-05-07T20:32:39.7621356Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=2, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:39.7622858Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f987464c400>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f9807441b20>}
2025-05-07T20:32:39.7624195Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:39.7625269Z context = <triton._C.libtriton.ir.context object at 0x7f9806e98ff0>
2025-05-07T20:32:39.7625554Z 
2025-05-07T20:32:39.7625728Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:39.7626248Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:39.7626710Z                            module_map=module_map)
2025-05-07T20:32:39.7627080Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:39.7627442Z E       def _kernel_quantize_fp8_row(
2025-05-07T20:32:39.7627711Z E       ^
2025-05-07T20:32:39.7628365Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:39.7628818Z 
2025-05-07T20:32:39.7629305Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:39.7629816Z 
2025-05-07T20:32:39.7629929Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:39.7630337Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:39.7630743Z     T=4096,
2025-05-07T20:32:39.7630938Z     D=5120,
2025-05-07T20:32:39.7631131Z     scale_ub=None,
2025-05-07T20:32:39.7631359Z     contiguous=True,
2025-05-07T20:32:39.7631591Z     compiled=True,
2025-05-07T20:32:39.7631799Z )
2025-05-07T20:32:40.2647889Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:40.2648481Z T = 4096, D = 5120, scale_ub = None, contiguous = True, compiled = True
2025-05-07T20:32:40.2648775Z 
2025-05-07T20:32:40.2648913Z     @given(
2025-05-07T20:32:40.2649247Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:40.2649672Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:40.2650074Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:40.2650490Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:40.2650831Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:40.2651118Z     )
2025-05-07T20:32:40.2651465Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:40.2651911Z     def test_silu_mul_quant(
2025-05-07T20:32:40.2652159Z         self,
2025-05-07T20:32:40.2652370Z         T: int,
2025-05-07T20:32:40.2652567Z         D: int,
2025-05-07T20:32:40.2652792Z         scale_ub: Optional[float],
2025-05-07T20:32:40.2653094Z         contiguous: bool,
2025-05-07T20:32:40.2653347Z         compiled: bool,
2025-05-07T20:32:40.2653575Z     ) -> None:
2025-05-07T20:32:40.2653800Z         torch.manual_seed(2025)
2025-05-07T20:32:40.2659683Z     
2025-05-07T20:32:40.2660157Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:40.2660523Z     
2025-05-07T20:32:40.2660733Z         x_sign = torch.sign(x)
2025-05-07T20:32:40.2661028Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:40.2661351Z         x = x_sign * x_clamp
2025-05-07T20:32:40.2661607Z         x0 = x[:, :D]
2025-05-07T20:32:40.2661831Z         x1 = x[:, D:]
2025-05-07T20:32:40.2662046Z     
2025-05-07T20:32:40.2662243Z         if contiguous:
2025-05-07T20:32:40.2662478Z             x0 = x0.contiguous()
2025-05-07T20:32:40.2662745Z             x1 = x1.contiguous()
2025-05-07T20:32:40.2662997Z     
2025-05-07T20:32:40.2663200Z         if scale_ub is not None:
2025-05-07T20:32:40.2663475Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:40.2663944Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:40.2664265Z             )
2025-05-07T20:32:40.2664462Z         else:
2025-05-07T20:32:40.2664683Z             scale_ub_tensor = None
2025-05-07T20:32:40.2665004Z     
2025-05-07T20:32:40.2665246Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:40.2665571Z             op = silu_mul_quant
2025-05-07T20:32:40.2665830Z             if compiled:
2025-05-07T20:32:40.2666078Z                 op = torch.compile(op)
2025-05-07T20:32:40.2666380Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:40.2666663Z     
2025-05-07T20:32:40.2666861Z         y_fp8, y_scale = fn()
2025-05-07T20:32:40.2667150Z         y = y_fp8.to(torch.float32) * y_scale[:, None]
2025-05-07T20:32:40.2667445Z     
2025-05-07T20:32:40.2667690Z         def ref_fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:40.2668033Z             x0_fp32 = x0.to(torch.float32)
2025-05-07T20:32:40.2668337Z             x1_fp32 = x1.to(torch.float32)
2025-05-07T20:32:40.2668662Z             y = x0_fp32 * torch.sigmoid(x0_fp32) * x1_fp32
2025-05-07T20:32:40.2669018Z             return triton_quantize_fp8_row(y, scale_ub_tensor)
2025-05-07T20:32:40.2669397Z     
2025-05-07T20:32:40.2669615Z >       y_fp8_ref, y_scale_ref = ref_fn()
2025-05-07T20:32:40.2669814Z 
2025-05-07T20:32:40.2669918Z moe/activation_test.py:126: 
2025-05-07T20:32:40.2670222Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:40.2670564Z moe/activation_test.py:124: in ref_fn
2025-05-07T20:32:40.2670895Z     return triton_quantize_fp8_row(y, scale_ub_tensor)
2025-05-07T20:32:40.2671681Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/fbgemm_gpu/experimental/gemm/triton_gemm/fp8_gemm.py:2370: in triton_quantize_fp8_row
2025-05-07T20:32:40.2672439Z     _kernel_quantize_fp8_row[grid](
2025-05-07T20:32:40.2672990Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:40.2673674Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:40.2674366Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/autotuner.py:186: in run
2025-05-07T20:32:40.2675096Z     timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs}
2025-05-07T20:32:40.2675851Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/autotuner.py:186: in <dictcomp>
2025-05-07T20:32:40.2676594Z     timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs}
2025-05-07T20:32:40.2677330Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/autotuner.py:166: in _bench
2025-05-07T20:32:40.2677982Z     return self.do_bench(kernel_call, quantiles=(0.5, 0.2, 0.8))
2025-05-07T20:32:40.2678590Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/testing.py:117: in do_bench
2025-05-07T20:32:40.2679111Z     fn()
2025-05-07T20:32:40.2679626Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/autotuner.py:152: in kernel_call
2025-05-07T20:32:40.2680221Z     self.fn.run(
2025-05-07T20:32:40.2680738Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:40.2681276Z     kernel = self.compile(
2025-05-07T20:32:40.2681822Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:40.2682489Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:40.2682889Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:40.2683130Z 
2025-05-07T20:32:40.2683339Z self = <triton.compiler.compiler.ASTSource object at 0x7f98068e31d0>
2025-05-07T20:32:40.2684430Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=2, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:40.2685921Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f987464c400>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f98075b2700>}
2025-05-07T20:32:40.2687256Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:40.2688285Z context = <triton._C.libtriton.ir.context object at 0x7f98077bcef0>
2025-05-07T20:32:40.2688582Z 
2025-05-07T20:32:40.2688751Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:40.2689274Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:40.2689744Z                            module_map=module_map)
2025-05-07T20:32:40.2690127Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:40.2690487Z E       def _kernel_quantize_fp8_row(
2025-05-07T20:32:40.2690754Z E       ^
2025-05-07T20:32:40.2691228Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:40.2691683Z 
2025-05-07T20:32:40.2692099Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:40.2692610Z 
2025-05-07T20:32:40.2692724Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:40.2693133Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:40.2693542Z     T=16384,
2025-05-07T20:32:40.2693749Z     D=5120,
2025-05-07T20:32:40.2693957Z     scale_ub=None,
2025-05-07T20:32:40.2694177Z     contiguous=True,
2025-05-07T20:32:40.2694414Z     compiled=True,
2025-05-07T20:32:40.2694634Z )
2025-05-07T20:32:40.2947124Z W0507 20:32:40.293000 239371 site-packages/torch/_dynamo/convert_frame.py:987] [0/8] torch._dynamo hit config.recompile_limit (8)
2025-05-07T20:32:40.2948375Z W0507 20:32:40.293000 239371 site-packages/torch/_dynamo/convert_frame.py:987] [0/8]    function: 'silu_mul_quant' (/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:55)
2025-05-07T20:32:40.2949743Z W0507 20:32:40.293000 239371 site-packages/torch/_dynamo/convert_frame.py:987] [0/8]    last reason: 0/7: tensor 'x0' stride mismatch at index 0. expected 5120, actual 10240
2025-05-07T20:32:40.2950727Z W0507 20:32:40.293000 239371 site-packages/torch/_dynamo/convert_frame.py:987] [0/8] To log all recompilation reasons, use TORCH_LOGS="recompiles".
2025-05-07T20:32:40.2951840Z W0507 20:32:40.293000 239371 site-packages/torch/_dynamo/convert_frame.py:987] [0/8] To diagnose recompilation issues, see https://pytorch.org/docs/main/torch.compiler_troubleshooting.html.
2025-05-07T20:32:40.3629131Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:40.3629672Z T = 16384, D = 5120, scale_ub = None, contiguous = True, compiled = True
2025-05-07T20:32:40.3629949Z 
2025-05-07T20:32:40.3630149Z     @given(
2025-05-07T20:32:40.3630386Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:40.3630708Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:40.3631018Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:40.3631352Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:40.3631684Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:40.3631970Z     )
2025-05-07T20:32:40.3632327Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:40.3632769Z     def test_silu_mul_quant(
2025-05-07T20:32:40.3633023Z         self,
2025-05-07T20:32:40.3633233Z         T: int,
2025-05-07T20:32:40.3633499Z         D: int,
2025-05-07T20:32:40.3633829Z         scale_ub: Optional[float],
2025-05-07T20:32:40.3634112Z         contiguous: bool,
2025-05-07T20:32:40.3634353Z         compiled: bool,
2025-05-07T20:32:40.3634584Z     ) -> None:
2025-05-07T20:32:40.3634862Z         torch.manual_seed(2025)
2025-05-07T20:32:40.3635112Z     
2025-05-07T20:32:40.3635399Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:40.3635752Z     
2025-05-07T20:32:40.3635960Z         x_sign = torch.sign(x)
2025-05-07T20:32:40.3636254Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:40.3636568Z         x = x_sign * x_clamp
2025-05-07T20:32:40.3636816Z         x0 = x[:, :D]
2025-05-07T20:32:40.3637039Z         x1 = x[:, D:]
2025-05-07T20:32:40.3637255Z     
2025-05-07T20:32:40.3637448Z         if contiguous:
2025-05-07T20:32:40.3637682Z             x0 = x0.contiguous()
2025-05-07T20:32:40.3637942Z             x1 = x1.contiguous()
2025-05-07T20:32:40.3638188Z     
2025-05-07T20:32:40.3638390Z         if scale_ub is not None:
2025-05-07T20:32:40.3638665Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:40.3639004Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:40.3639326Z             )
2025-05-07T20:32:40.3639525Z         else:
2025-05-07T20:32:40.3639737Z             scale_ub_tensor = None
2025-05-07T20:32:40.3639994Z     
2025-05-07T20:32:40.3640228Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:40.3640549Z             op = silu_mul_quant
2025-05-07T20:32:40.3640802Z             if compiled:
2025-05-07T20:32:40.3641055Z                 op = torch.compile(op)
2025-05-07T20:32:40.3641353Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:40.3641625Z     
2025-05-07T20:32:40.3641821Z         y_fp8, y_scale = fn()
2025-05-07T20:32:40.3642112Z         y = y_fp8.to(torch.float32) * y_scale[:, None]
2025-05-07T20:32:40.3642399Z     
2025-05-07T20:32:40.3642640Z         def ref_fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:40.3642979Z             x0_fp32 = x0.to(torch.float32)
2025-05-07T20:32:40.3643267Z             x1_fp32 = x1.to(torch.float32)
2025-05-07T20:32:40.3643581Z             y = x0_fp32 * torch.sigmoid(x0_fp32) * x1_fp32
2025-05-07T20:32:40.3643942Z             return triton_quantize_fp8_row(y, scale_ub_tensor)
2025-05-07T20:32:40.3644256Z     
2025-05-07T20:32:40.3644455Z >       y_fp8_ref, y_scale_ref = ref_fn()
2025-05-07T20:32:40.3644654Z 
2025-05-07T20:32:40.3644755Z moe/activation_test.py:126: 
2025-05-07T20:32:40.3645060Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:40.3645394Z moe/activation_test.py:124: in ref_fn
2025-05-07T20:32:40.3645722Z     return triton_quantize_fp8_row(y, scale_ub_tensor)
2025-05-07T20:32:40.3646512Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/fbgemm_gpu/experimental/gemm/triton_gemm/fp8_gemm.py:2370: in triton_quantize_fp8_row
2025-05-07T20:32:40.3647260Z     _kernel_quantize_fp8_row[grid](
2025-05-07T20:32:40.3647806Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:40.3648488Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:40.3649236Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/autotuner.py:186: in run
2025-05-07T20:32:40.3649952Z     timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs}
2025-05-07T20:32:40.3650704Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/autotuner.py:186: in <dictcomp>
2025-05-07T20:32:40.3651450Z     timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs}
2025-05-07T20:32:40.3652177Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/autotuner.py:166: in _bench
2025-05-07T20:32:40.3652851Z     return self.do_bench(kernel_call, quantiles=(0.5, 0.2, 0.8))
2025-05-07T20:32:40.3653487Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/testing.py:117: in do_bench
2025-05-07T20:32:40.3654005Z     fn()
2025-05-07T20:32:40.3654581Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/autotuner.py:152: in kernel_call
2025-05-07T20:32:40.3655160Z     self.fn.run(
2025-05-07T20:32:40.3655631Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:40.3656165Z     kernel = self.compile(
2025-05-07T20:32:40.3656704Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:40.3657368Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:40.3657772Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:40.3658006Z 
2025-05-07T20:32:40.3658226Z self = <triton.compiler.compiler.ASTSource object at 0x7f98067a4bd0>
2025-05-07T20:32:40.3659307Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=2, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:40.3660673Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f987464c400>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f9806edd3a0>}
2025-05-07T20:32:40.3662021Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:40.3663054Z context = <triton._C.libtriton.ir.context object at 0x7f98067b91f0>
2025-05-07T20:32:40.3663342Z 
2025-05-07T20:32:40.3663509Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:40.3664033Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:40.3664512Z                            module_map=module_map)
2025-05-07T20:32:40.3664888Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:40.3665248Z E       def _kernel_quantize_fp8_row(
2025-05-07T20:32:40.3665521Z E       ^
2025-05-07T20:32:40.3665994Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:40.3666442Z 
2025-05-07T20:32:40.3666865Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:40.3667376Z 
2025-05-07T20:32:40.3667484Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:40.3667900Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:40.3668300Z     T=1,
2025-05-07T20:32:40.3668488Z     D=5120,
2025-05-07T20:32:40.3668695Z     scale_ub=1200.0,
2025-05-07T20:32:40.3668930Z     contiguous=True,
2025-05-07T20:32:40.3669208Z     compiled=True,
2025-05-07T20:32:40.3669420Z )
2025-05-07T20:32:40.6304225Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:40.6305184Z T = 1, D = 5120, scale_ub = 1200.0, contiguous = True, compiled = True
2025-05-07T20:32:40.6305485Z 
2025-05-07T20:32:40.6305577Z     @given(
2025-05-07T20:32:40.6305818Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:40.6306135Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:40.6306447Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:40.6306785Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:40.6307114Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:40.6307401Z     )
2025-05-07T20:32:40.6307765Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:40.6308269Z     def test_silu_mul_quant(
2025-05-07T20:32:40.6308577Z         self,
2025-05-07T20:32:40.6308778Z         T: int,
2025-05-07T20:32:40.6308982Z         D: int,
2025-05-07T20:32:40.6309260Z         scale_ub: Optional[float],
2025-05-07T20:32:40.6309540Z         contiguous: bool,
2025-05-07T20:32:40.6309838Z         compiled: bool,
2025-05-07T20:32:40.6310072Z     ) -> None:
2025-05-07T20:32:40.6310300Z         torch.manual_seed(2025)
2025-05-07T20:32:40.6310537Z     
2025-05-07T20:32:40.6310815Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:40.6311161Z     
2025-05-07T20:32:40.6311361Z         x_sign = torch.sign(x)
2025-05-07T20:32:40.6311651Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:40.6311963Z         x = x_sign * x_clamp
2025-05-07T20:32:40.6312207Z         x0 = x[:, :D]
2025-05-07T20:32:40.6312426Z         x1 = x[:, D:]
2025-05-07T20:32:40.6312642Z     
2025-05-07T20:32:40.6312838Z         if contiguous:
2025-05-07T20:32:40.6313073Z             x0 = x0.contiguous()
2025-05-07T20:32:40.6313343Z             x1 = x1.contiguous()
2025-05-07T20:32:40.6313588Z     
2025-05-07T20:32:40.6313777Z         if scale_ub is not None:
2025-05-07T20:32:40.6314056Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:40.6314399Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:40.6314704Z             )
2025-05-07T20:32:40.6314905Z         else:
2025-05-07T20:32:40.6315126Z             scale_ub_tensor = None
2025-05-07T20:32:40.6315376Z     
2025-05-07T20:32:40.6315648Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:40.6315983Z             op = silu_mul_quant
2025-05-07T20:32:40.6316237Z             if compiled:
2025-05-07T20:32:40.6316484Z                 op = torch.compile(op)
2025-05-07T20:32:40.6316786Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:40.6317067Z     
2025-05-07T20:32:40.6317257Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:40.6317427Z 
2025-05-07T20:32:40.6317531Z moe/activation_test.py:117: 
2025-05-07T20:32:40.6317835Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:40.6318165Z moe/activation_test.py:115: in fn
2025-05-07T20:32:40.6318452Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:40.6319017Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/torch/_dynamo/eval_frame.py:678: in _fn
2025-05-07T20:32:40.6319575Z     return fn(*args, **kwargs)
2025-05-07T20:32:40.6320230Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:40.6320922Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:40.6321460Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:40.6322135Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:40.6322803Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:40.6323339Z     kernel = self.compile(
2025-05-07T20:32:40.6323887Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:40.6324592Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:40.6324996Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:40.6325225Z 
2025-05-07T20:32:40.6325442Z self = <triton.compiler.compiler.ASTSource object at 0x7f9806032590>
2025-05-07T20:32:40.6326528Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:40.6327886Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f987464c400>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f9806db0900>}
2025-05-07T20:32:40.6329533Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:40.6330563Z context = <triton._C.libtriton.ir.context object at 0x7f98060eb130>
2025-05-07T20:32:40.6330853Z 
2025-05-07T20:32:40.6331032Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:40.6331543Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:40.6332009Z                            module_map=module_map)
2025-05-07T20:32:40.6332373Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:40.6332731Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:40.6332987Z E       ^
2025-05-07T20:32:40.6333452Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:40.6333903Z 
2025-05-07T20:32:40.6334321Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:40.6334830Z 
2025-05-07T20:32:40.6334953Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:40.6335411Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:40.6335815Z     T=1,
2025-05-07T20:32:40.6336002Z     D=5120,
2025-05-07T20:32:40.6336192Z     scale_ub=None,
2025-05-07T20:32:40.6336418Z     contiguous=False,
2025-05-07T20:32:40.6336651Z     compiled=True,
2025-05-07T20:32:40.6336852Z )
2025-05-07T20:32:40.6813694Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:40.6814312Z T = 1, D = 5120, scale_ub = None, contiguous = False, compiled = True
2025-05-07T20:32:40.6814658Z 
2025-05-07T20:32:40.6814745Z     @given(
2025-05-07T20:32:40.6815022Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:40.6815361Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:40.6815828Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:40.6816470Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:40.6817129Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:40.6817692Z     )
2025-05-07T20:32:40.6818377Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:40.6819263Z     def test_silu_mul_quant(
2025-05-07T20:32:40.6819747Z         self,
2025-05-07T20:32:40.6820131Z         T: int,
2025-05-07T20:32:40.6820528Z         D: int,
2025-05-07T20:32:40.6820966Z         scale_ub: Optional[float],
2025-05-07T20:32:40.6821493Z         contiguous: bool,
2025-05-07T20:32:40.6821975Z         compiled: bool,
2025-05-07T20:32:40.6822425Z     ) -> None:
2025-05-07T20:32:40.6822852Z         torch.manual_seed(2025)
2025-05-07T20:32:40.6823325Z     
2025-05-07T20:32:40.6823867Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:40.6824545Z     
2025-05-07T20:32:40.6824924Z         x_sign = torch.sign(x)
2025-05-07T20:32:40.6825386Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:40.6825702Z         x = x_sign * x_clamp
2025-05-07T20:32:40.6826282Z         x0 = x[:, :D]
2025-05-07T20:32:40.6826510Z         x1 = x[:, D:]
2025-05-07T20:32:40.6826728Z     
2025-05-07T20:32:40.6826917Z         if contiguous:
2025-05-07T20:32:40.6827155Z             x0 = x0.contiguous()
2025-05-07T20:32:40.6827414Z             x1 = x1.contiguous()
2025-05-07T20:32:40.6827650Z     
2025-05-07T20:32:40.6827851Z         if scale_ub is not None:
2025-05-07T20:32:40.6828398Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:40.6828735Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:40.6829090Z             )
2025-05-07T20:32:40.6829288Z         else:
2025-05-07T20:32:40.6829500Z             scale_ub_tensor = None
2025-05-07T20:32:40.6829937Z     
2025-05-07T20:32:40.6830179Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:40.6830487Z             op = silu_mul_quant
2025-05-07T20:32:40.6830743Z             if compiled:
2025-05-07T20:32:40.6831076Z                 op = torch.compile(op)
2025-05-07T20:32:40.6831384Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:40.6831677Z     
2025-05-07T20:32:40.6831882Z         y_fp8, y_scale = fn()
2025-05-07T20:32:40.6832174Z         y = y_fp8.to(torch.float32) * y_scale[:, None]
2025-05-07T20:32:40.6832465Z     
2025-05-07T20:32:40.6832708Z         def ref_fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:40.6833048Z             x0_fp32 = x0.to(torch.float32)
2025-05-07T20:32:40.6833351Z             x1_fp32 = x1.to(torch.float32)
2025-05-07T20:32:40.6833667Z             y = x0_fp32 * torch.sigmoid(x0_fp32) * x1_fp32
2025-05-07T20:32:40.6834033Z             return triton_quantize_fp8_row(y, scale_ub_tensor)
2025-05-07T20:32:40.6834352Z     
2025-05-07T20:32:40.6834560Z >       y_fp8_ref, y_scale_ref = ref_fn()
2025-05-07T20:32:40.6834764Z 
2025-05-07T20:32:40.6834872Z moe/activation_test.py:126: 
2025-05-07T20:32:40.6835184Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:40.6835521Z moe/activation_test.py:124: in ref_fn
2025-05-07T20:32:40.6835859Z     return triton_quantize_fp8_row(y, scale_ub_tensor)
2025-05-07T20:32:40.6836659Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/fbgemm_gpu/experimental/gemm/triton_gemm/fp8_gemm.py:2370: in triton_quantize_fp8_row
2025-05-07T20:32:40.6837415Z     _kernel_quantize_fp8_row[grid](
2025-05-07T20:32:40.6837962Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:40.6838660Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:40.6839351Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/autotuner.py:186: in run
2025-05-07T20:32:40.6840088Z     timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs}
2025-05-07T20:32:40.6840843Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/autotuner.py:186: in <dictcomp>
2025-05-07T20:32:40.6841599Z     timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs}
2025-05-07T20:32:40.6842331Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/autotuner.py:166: in _bench
2025-05-07T20:32:40.6842969Z     return self.do_bench(kernel_call, quantiles=(0.5, 0.2, 0.8))
2025-05-07T20:32:40.6843573Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/testing.py:117: in do_bench
2025-05-07T20:32:40.6844099Z     fn()
2025-05-07T20:32:40.6844614Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/autotuner.py:152: in kernel_call
2025-05-07T20:32:40.6845197Z     self.fn.run(
2025-05-07T20:32:40.6845675Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:40.6846212Z     kernel = self.compile(
2025-05-07T20:32:40.6846818Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:40.6847477Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:40.6847878Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:40.6848108Z 
2025-05-07T20:32:40.6848325Z self = <triton.compiler.compiler.ASTSource object at 0x7f98060a2510>
2025-05-07T20:32:40.6849401Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=2, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:40.6850843Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f987464c400>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f9806dce0c0>}
2025-05-07T20:32:40.6852273Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:40.6853300Z context = <triton._C.libtriton.ir.context object at 0x7f98060a6bf0>
2025-05-07T20:32:40.6853591Z 
2025-05-07T20:32:40.6853771Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:40.6854282Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:40.6854763Z                            module_map=module_map)
2025-05-07T20:32:40.6855140Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:40.6855500Z E       def _kernel_quantize_fp8_row(
2025-05-07T20:32:40.6855784Z E       ^
2025-05-07T20:32:40.6856270Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:40.6856719Z 
2025-05-07T20:32:40.6857153Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:40.6857670Z 
2025-05-07T20:32:40.6857783Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:40.6858213Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:40.6858630Z     T=1,
2025-05-07T20:32:40.6858822Z     D=5120,
2025-05-07T20:32:40.6859031Z     scale_ub=None,
2025-05-07T20:32:40.6859266Z     contiguous=True,
2025-05-07T20:32:40.6859498Z     compiled=False,
2025-05-07T20:32:40.6859722Z )
2025-05-07T20:32:40.8010095Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:40.8010616Z T = 1, D = 5120, scale_ub = None, contiguous = True, compiled = False
2025-05-07T20:32:40.8010918Z 
2025-05-07T20:32:40.8011044Z     @given(
2025-05-07T20:32:40.8011388Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:40.8011839Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:40.8012285Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:40.8012660Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:40.8013004Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:40.8013310Z     )
2025-05-07T20:32:40.8013661Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:40.8014121Z     def test_silu_mul_quant(
2025-05-07T20:32:40.8014385Z         self,
2025-05-07T20:32:40.8014594Z         T: int,
2025-05-07T20:32:40.8014813Z         D: int,
2025-05-07T20:32:40.8015049Z         scale_ub: Optional[float],
2025-05-07T20:32:40.8015329Z         contiguous: bool,
2025-05-07T20:32:40.8015587Z         compiled: bool,
2025-05-07T20:32:40.8015855Z     ) -> None:
2025-05-07T20:32:40.8016106Z         torch.manual_seed(2025)
2025-05-07T20:32:40.8016366Z     
2025-05-07T20:32:40.8016654Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:40.8017039Z     
2025-05-07T20:32:40.8017252Z         x_sign = torch.sign(x)
2025-05-07T20:32:40.8017736Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:40.8018071Z         x = x_sign * x_clamp
2025-05-07T20:32:40.8018330Z         x0 = x[:, :D]
2025-05-07T20:32:40.8018558Z         x1 = x[:, D:]
2025-05-07T20:32:40.8018781Z     
2025-05-07T20:32:40.8018986Z         if contiguous:
2025-05-07T20:32:40.8019234Z             x0 = x0.contiguous()
2025-05-07T20:32:40.8019501Z             x1 = x1.contiguous()
2025-05-07T20:32:40.8019754Z     
2025-05-07T20:32:40.8019962Z         if scale_ub is not None:
2025-05-07T20:32:40.8020237Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:40.8020586Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:40.8020993Z             )
2025-05-07T20:32:40.8021276Z         else:
2025-05-07T20:32:40.8021503Z             scale_ub_tensor = None
2025-05-07T20:32:40.8021763Z     
2025-05-07T20:32:40.8022000Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:40.8022407Z             op = silu_mul_quant
2025-05-07T20:32:40.8022673Z             if compiled:
2025-05-07T20:32:40.8022922Z                 op = torch.compile(op)
2025-05-07T20:32:40.8023231Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:40.8023520Z     
2025-05-07T20:32:40.8023720Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:40.8023896Z 
2025-05-07T20:32:40.8024002Z moe/activation_test.py:117: 
2025-05-07T20:32:40.8024309Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:40.8024653Z moe/activation_test.py:115: in fn
2025-05-07T20:32:40.8024939Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:40.8025694Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:40.8026400Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:40.8026937Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:40.8027636Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:40.8028569Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:40.8029162Z     kernel = self.compile(
2025-05-07T20:32:40.8029702Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:40.8030364Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:40.8030776Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:40.8031006Z 
2025-05-07T20:32:40.8031227Z self = <triton.compiler.compiler.ASTSource object at 0x7f980619e590>
2025-05-07T20:32:40.8032316Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:40.8033702Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f987464c400>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f9806dcf420>}
2025-05-07T20:32:40.8035049Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:40.8036129Z context = <triton._C.libtriton.ir.context object at 0x7f98061a6c30>
2025-05-07T20:32:40.8036417Z 
2025-05-07T20:32:40.8036586Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:40.8037116Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:40.8037600Z                            module_map=module_map)
2025-05-07T20:32:40.8037978Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:40.8038334Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:40.8038681Z E       ^
2025-05-07T20:32:40.8039162Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:40.8039616Z 
2025-05-07T20:32:40.8040040Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:40.8040567Z 
2025-05-07T20:32:40.8040676Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:40.8041103Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:40.8041517Z     T=128,
2025-05-07T20:32:40.8041711Z     D=5120,
2025-05-07T20:32:40.8041924Z     scale_ub=None,
2025-05-07T20:32:40.8042220Z     contiguous=False,
2025-05-07T20:32:40.8042513Z     compiled=True,
2025-05-07T20:32:40.8042740Z )
2025-05-07T20:32:40.8043077Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:40.8043636Z T = 128, D = 5120, scale_ub = None, contiguous = False, compiled = True
2025-05-07T20:32:40.8043916Z 
2025-05-07T20:32:40.8044001Z     @given(
2025-05-07T20:32:40.8044249Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:40.8044579Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:40.8044893Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:40.8045241Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:40.8045583Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:40.8045918Z     )
2025-05-07T20:32:40.8046282Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:40.8046731Z     def test_silu_mul_quant(
2025-05-07T20:32:40.8046980Z         self,
2025-05-07T20:32:40.8047193Z         T: int,
2025-05-07T20:32:40.8047405Z         D: int,
2025-05-07T20:32:40.8047629Z         scale_ub: Optional[float],
2025-05-07T20:32:40.8047909Z         contiguous: bool,
2025-05-07T20:32:40.8048167Z         compiled: bool,
2025-05-07T20:32:40.8048392Z     ) -> None:
2025-05-07T20:32:40.8048622Z         torch.manual_seed(2025)
2025-05-07T20:32:40.8048879Z     
2025-05-07T20:32:40.8049165Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:40.8049508Z     
2025-05-07T20:32:40.8049716Z         x_sign = torch.sign(x)
2025-05-07T20:32:40.8050022Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:40.8050332Z         x = x_sign * x_clamp
2025-05-07T20:32:40.8050585Z         x0 = x[:, :D]
2025-05-07T20:32:40.8050816Z         x1 = x[:, D:]
2025-05-07T20:32:40.8051026Z     
2025-05-07T20:32:40.8051226Z         if contiguous:
2025-05-07T20:32:40.8051470Z             x0 = x0.contiguous()
2025-05-07T20:32:40.8051733Z             x1 = x1.contiguous()
2025-05-07T20:32:40.8051990Z     
2025-05-07T20:32:40.8052198Z         if scale_ub is not None:
2025-05-07T20:32:40.8052476Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:40.8052827Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:40.8053153Z             )
2025-05-07T20:32:40.8053363Z         else:
2025-05-07T20:32:40.8053575Z             scale_ub_tensor = None
2025-05-07T20:32:40.8053833Z     
2025-05-07T20:32:40.8054076Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:40.8054388Z             op = silu_mul_quant
2025-05-07T20:32:40.8054648Z             if compiled:
2025-05-07T20:32:40.8054905Z                 op = torch.compile(op)
2025-05-07T20:32:40.8055203Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:40.8055491Z     
2025-05-07T20:32:40.8055699Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:40.8055863Z 
2025-05-07T20:32:40.8055968Z moe/activation_test.py:117: 
2025-05-07T20:32:40.8056321Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:40.8056666Z moe/activation_test.py:115: in fn
2025-05-07T20:32:40.8056957Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:40.8057564Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/torch/_dynamo/eval_frame.py:678: in _fn
2025-05-07T20:32:40.8058128Z     return fn(*args, **kwargs)
2025-05-07T20:32:40.8058786Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:40.8059465Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:40.8060004Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:40.8060687Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:40.8061352Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:40.8061969Z     kernel = self.compile(
2025-05-07T20:32:40.8062512Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:40.8063225Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:40.8063628Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:40.8063866Z 
2025-05-07T20:32:40.8064074Z self = <triton.compiler.compiler.ASTSource object at 0x7f9806142790>
2025-05-07T20:32:40.8065152Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:40.8066520Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f987464c400>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f9806dcf1a0>}
2025-05-07T20:32:40.8067874Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:40.8068896Z context = <triton._C.libtriton.ir.context object at 0x7f9806152d70>
2025-05-07T20:32:40.8069278Z 
2025-05-07T20:32:40.8069448Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:40.8069970Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:40.8070440Z                            module_map=module_map)
2025-05-07T20:32:40.8070803Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:40.8071162Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:40.8071435Z E       ^
2025-05-07T20:32:40.8071897Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:40.8072361Z 
2025-05-07T20:32:40.8072778Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:40.8073297Z 
2025-05-07T20:32:40.8073406Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:40.8073828Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:40.8074230Z     T=128,
2025-05-07T20:32:40.8074428Z     D=7168,
2025-05-07T20:32:40.8074633Z     scale_ub=1200.0,
2025-05-07T20:32:40.8074863Z     contiguous=False,
2025-05-07T20:32:40.8075109Z     compiled=False,
2025-05-07T20:32:40.8075362Z )
2025-05-07T20:32:40.8950252Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:40.8951643Z T = 128, D = 7168, scale_ub = 1200.0, contiguous = False, compiled = False
2025-05-07T20:32:40.8952184Z 
2025-05-07T20:32:40.8952344Z     @given(
2025-05-07T20:32:40.8952812Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:40.8953466Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:40.8954078Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:40.8954726Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:40.8955573Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:40.8955870Z     )
2025-05-07T20:32:40.8956219Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:40.8956668Z     def test_silu_mul_quant(
2025-05-07T20:32:40.8956919Z         self,
2025-05-07T20:32:40.8957117Z         T: int,
2025-05-07T20:32:40.8957322Z         D: int,
2025-05-07T20:32:40.8957551Z         scale_ub: Optional[float],
2025-05-07T20:32:40.8957819Z         contiguous: bool,
2025-05-07T20:32:40.8958066Z         compiled: bool,
2025-05-07T20:32:40.8958299Z     ) -> None:
2025-05-07T20:32:40.8958517Z         torch.manual_seed(2025)
2025-05-07T20:32:40.8958765Z     
2025-05-07T20:32:40.8959046Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:40.8959608Z     
2025-05-07T20:32:40.8959807Z         x_sign = torch.sign(x)
2025-05-07T20:32:40.8960112Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:40.8960504Z         x = x_sign * x_clamp
2025-05-07T20:32:40.8960751Z         x0 = x[:, :D]
2025-05-07T20:32:40.8960980Z         x1 = x[:, D:]
2025-05-07T20:32:40.8961197Z     
2025-05-07T20:32:40.8961395Z         if contiguous:
2025-05-07T20:32:40.8961629Z             x0 = x0.contiguous()
2025-05-07T20:32:40.8961897Z             x1 = x1.contiguous()
2025-05-07T20:32:40.8962143Z     
2025-05-07T20:32:40.8962336Z         if scale_ub is not None:
2025-05-07T20:32:40.8962617Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:40.8962959Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:40.8963266Z             )
2025-05-07T20:32:40.8963471Z         else:
2025-05-07T20:32:40.8963693Z             scale_ub_tensor = None
2025-05-07T20:32:40.8963944Z     
2025-05-07T20:32:40.8964190Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:40.8964510Z             op = silu_mul_quant
2025-05-07T20:32:40.8964761Z             if compiled:
2025-05-07T20:32:40.8965021Z                 op = torch.compile(op)
2025-05-07T20:32:40.8965364Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:40.8965647Z     
2025-05-07T20:32:40.8965850Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:40.8966022Z 
2025-05-07T20:32:40.8966125Z moe/activation_test.py:117: 
2025-05-07T20:32:40.8966431Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:40.8966762Z moe/activation_test.py:115: in fn
2025-05-07T20:32:40.8967052Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:40.8967742Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:40.8968429Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:40.8968969Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:40.8969663Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:40.8970330Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:40.8970860Z     kernel = self.compile(
2025-05-07T20:32:40.8971407Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:40.8972066Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:40.8972459Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:40.8972696Z 
2025-05-07T20:32:40.8972903Z self = <triton.compiler.compiler.ASTSource object at 0x7f98061d2790>
2025-05-07T20:32:40.8973985Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:40.8975428Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f987464c400>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f98065307c0>}
2025-05-07T20:32:40.8976816Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:40.8977832Z context = <triton._C.libtriton.ir.context object at 0x7f98061b2db0>
2025-05-07T20:32:40.8978128Z 
2025-05-07T20:32:40.8978296Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:40.8978821Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:40.8979335Z                            module_map=module_map)
2025-05-07T20:32:40.8979737Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:40.8980100Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:40.8980369Z E       ^
2025-05-07T20:32:40.8980873Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:40.8981334Z 
2025-05-07T20:32:40.8981750Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:40.8982268Z 
2025-05-07T20:32:40.8982378Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:40.8982802Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:40.8983202Z     T=128,
2025-05-07T20:32:40.8983402Z     D=5120,
2025-05-07T20:32:40.8983608Z     scale_ub=None,
2025-05-07T20:32:40.8983828Z     contiguous=False,
2025-05-07T20:32:40.8984063Z     compiled=False,
2025-05-07T20:32:40.8984283Z )
2025-05-07T20:32:40.8984602Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:40.8985102Z T = 128, D = 5120, scale_ub = None, contiguous = False, compiled = False
2025-05-07T20:32:40.8985379Z 
2025-05-07T20:32:40.8985466Z     @given(
2025-05-07T20:32:40.8985710Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:40.8986023Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:40.8986339Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:40.8986676Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:40.8987001Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:40.8987288Z     )
2025-05-07T20:32:40.8987644Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:40.8988082Z     def test_silu_mul_quant(
2025-05-07T20:32:40.8988329Z         self,
2025-05-07T20:32:40.8988536Z         T: int,
2025-05-07T20:32:40.8988734Z         D: int,
2025-05-07T20:32:40.8988963Z         scale_ub: Optional[float],
2025-05-07T20:32:40.8989345Z         contiguous: bool,
2025-05-07T20:32:40.8989592Z         compiled: bool,
2025-05-07T20:32:40.8989814Z     ) -> None:
2025-05-07T20:32:40.8990039Z         torch.manual_seed(2025)
2025-05-07T20:32:40.8990290Z     
2025-05-07T20:32:40.8990563Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:40.8990916Z     
2025-05-07T20:32:40.8991121Z         x_sign = torch.sign(x)
2025-05-07T20:32:40.8991410Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:40.8991727Z         x = x_sign * x_clamp
2025-05-07T20:32:40.8991980Z         x0 = x[:, :D]
2025-05-07T20:32:40.8992199Z         x1 = x[:, D:]
2025-05-07T20:32:40.8992414Z     
2025-05-07T20:32:40.8992605Z         if contiguous:
2025-05-07T20:32:40.8992834Z             x0 = x0.contiguous()
2025-05-07T20:32:40.8993094Z             x1 = x1.contiguous()
2025-05-07T20:32:40.8993336Z     
2025-05-07T20:32:40.8993536Z         if scale_ub is not None:
2025-05-07T20:32:40.8993816Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:40.8994156Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:40.8994463Z             )
2025-05-07T20:32:40.8994663Z         else:
2025-05-07T20:32:40.8994936Z             scale_ub_tensor = None
2025-05-07T20:32:40.8995217Z     
2025-05-07T20:32:40.8995476Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:40.8995798Z             op = silu_mul_quant
2025-05-07T20:32:40.8996064Z             if compiled:
2025-05-07T20:32:40.8996311Z                 op = torch.compile(op)
2025-05-07T20:32:40.8996617Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:40.8996896Z     
2025-05-07T20:32:40.8997096Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:40.8997267Z 
2025-05-07T20:32:40.8997367Z moe/activation_test.py:117: 
2025-05-07T20:32:40.8997670Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:40.8998046Z moe/activation_test.py:115: in fn
2025-05-07T20:32:40.8998369Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:40.8999101Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:40.8999803Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:40.9000338Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:40.9001026Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:40.9001699Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:40.9002231Z     kernel = self.compile(
2025-05-07T20:32:40.9002775Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:40.9003437Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:40.9003845Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:40.9004077Z 
2025-05-07T20:32:40.9004281Z self = <triton.compiler.compiler.ASTSource object at 0x7f9806205950>
2025-05-07T20:32:40.9005366Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:40.9006777Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f987464c400>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f98062405e0>}
2025-05-07T20:32:40.9008120Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:40.9009142Z context = <triton._C.libtriton.ir.context object at 0x7f9806251fb0>
2025-05-07T20:32:40.9009433Z 
2025-05-07T20:32:40.9009597Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:40.9010121Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:40.9010590Z                            module_map=module_map)
2025-05-07T20:32:40.9010953Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:40.9011309Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:40.9011572Z E       ^
2025-05-07T20:32:40.9012037Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:40.9012482Z 
2025-05-07T20:32:40.9012898Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:40.9013416Z 
2025-05-07T20:32:40.9013523Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:40.9013945Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:40.9014357Z     T=128,
2025-05-07T20:32:40.9014545Z     D=5120,
2025-05-07T20:32:40.9014748Z     scale_ub=1200.0,
2025-05-07T20:32:40.9014975Z     contiguous=True,
2025-05-07T20:32:40.9015197Z     compiled=False,
2025-05-07T20:32:40.9015458Z )
2025-05-07T20:32:41.1943427Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:41.1944129Z T = 128, D = 5120, scale_ub = 1200.0, contiguous = True, compiled = False
2025-05-07T20:32:41.1944457Z 
2025-05-07T20:32:41.1944547Z     @given(
2025-05-07T20:32:41.1944796Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:41.1945194Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:41.1945584Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:41.1945994Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:41.1946405Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:41.1947104Z     )
2025-05-07T20:32:41.1947552Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:41.1947994Z     def test_silu_mul_quant(
2025-05-07T20:32:41.1948243Z         self,
2025-05-07T20:32:41.1948519Z         T: int,
2025-05-07T20:32:41.1948725Z         D: int,
2025-05-07T20:32:41.1948950Z         scale_ub: Optional[float],
2025-05-07T20:32:41.1949346Z         contiguous: bool,
2025-05-07T20:32:41.1949582Z         compiled: bool,
2025-05-07T20:32:41.1949812Z     ) -> None:
2025-05-07T20:32:41.1950031Z         torch.manual_seed(2025)
2025-05-07T20:32:41.1950276Z     
2025-05-07T20:32:41.1950558Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:41.1950898Z     
2025-05-07T20:32:41.1951093Z         x_sign = torch.sign(x)
2025-05-07T20:32:41.1951386Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:41.1951700Z         x = x_sign * x_clamp
2025-05-07T20:32:41.1951940Z         x0 = x[:, :D]
2025-05-07T20:32:41.1952163Z         x1 = x[:, D:]
2025-05-07T20:32:41.1952376Z     
2025-05-07T20:32:41.1952561Z         if contiguous:
2025-05-07T20:32:41.1952798Z             x0 = x0.contiguous()
2025-05-07T20:32:41.1953061Z             x1 = x1.contiguous()
2025-05-07T20:32:41.1953302Z     
2025-05-07T20:32:41.1953495Z         if scale_ub is not None:
2025-05-07T20:32:41.1953769Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:41.1954105Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:41.1954412Z             )
2025-05-07T20:32:41.1954615Z         else:
2025-05-07T20:32:41.1954833Z             scale_ub_tensor = None
2025-05-07T20:32:41.1955106Z     
2025-05-07T20:32:41.1955396Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:41.1955791Z             op = silu_mul_quant
2025-05-07T20:32:41.1956096Z             if compiled:
2025-05-07T20:32:41.1956412Z                 op = torch.compile(op)
2025-05-07T20:32:41.1956783Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:41.1957090Z     
2025-05-07T20:32:41.1957299Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:41.1957463Z 
2025-05-07T20:32:41.1957571Z moe/activation_test.py:117: 
2025-05-07T20:32:41.1957871Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:41.1965200Z moe/activation_test.py:115: in fn
2025-05-07T20:32:41.1965576Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:41.1966288Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:41.1967000Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:41.1967561Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:41.1968247Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:41.1968930Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:41.1969492Z     kernel = self.compile(
2025-05-07T20:32:41.1970041Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:41.1970843Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:41.1971256Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:41.1971487Z 
2025-05-07T20:32:41.1971704Z self = <triton.compiler.compiler.ASTSource object at 0x7f980627ca90>
2025-05-07T20:32:41.1972792Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:41.1974203Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f987464c400>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f9806241760>}
2025-05-07T20:32:41.1975921Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:41.1977197Z context = <triton._C.libtriton.ir.context object at 0x7f98062d90b0>
2025-05-07T20:32:41.1977489Z 
2025-05-07T20:32:41.1977668Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:41.1978193Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:41.1978676Z                            module_map=module_map)
2025-05-07T20:32:41.1979049Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:41.1979409Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:41.1979678Z E       ^
2025-05-07T20:32:41.1980153Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:41.1980612Z 
2025-05-07T20:32:41.1981042Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:41.1981559Z 
2025-05-07T20:32:41.1981668Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:41.1982097Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:41.1982511Z     T=1,
2025-05-07T20:32:41.1982695Z     D=7168,
2025-05-07T20:32:41.1982906Z     scale_ub=1200.0,
2025-05-07T20:32:41.1983137Z     contiguous=True,
2025-05-07T20:32:41.1983363Z     compiled=True,
2025-05-07T20:32:41.1983580Z )
2025-05-07T20:32:41.1983908Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:41.1984403Z T = 1, D = 7168, scale_ub = 1200.0, contiguous = True, compiled = True
2025-05-07T20:32:41.1984665Z 
2025-05-07T20:32:41.1984749Z     @given(
2025-05-07T20:32:41.1985021Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:41.1985428Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:41.1985808Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:41.1986225Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:41.1986649Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:41.1986982Z     )
2025-05-07T20:32:41.1987344Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:41.1987792Z     def test_silu_mul_quant(
2025-05-07T20:32:41.1988043Z         self,
2025-05-07T20:32:41.1988238Z         T: int,
2025-05-07T20:32:41.1988443Z         D: int,
2025-05-07T20:32:41.1988669Z         scale_ub: Optional[float],
2025-05-07T20:32:41.1988940Z         contiguous: bool,
2025-05-07T20:32:41.1989263Z         compiled: bool,
2025-05-07T20:32:41.1989495Z     ) -> None:
2025-05-07T20:32:41.1989714Z         torch.manual_seed(2025)
2025-05-07T20:32:41.1989962Z     
2025-05-07T20:32:41.1990246Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:41.1990595Z     
2025-05-07T20:32:41.1990797Z         x_sign = torch.sign(x)
2025-05-07T20:32:41.1991101Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:41.1991412Z         x = x_sign * x_clamp
2025-05-07T20:32:41.1991710Z         x0 = x[:, :D]
2025-05-07T20:32:41.1991939Z         x1 = x[:, D:]
2025-05-07T20:32:41.1992147Z     
2025-05-07T20:32:41.1992346Z         if contiguous:
2025-05-07T20:32:41.1992587Z             x0 = x0.contiguous()
2025-05-07T20:32:41.1992845Z             x1 = x1.contiguous()
2025-05-07T20:32:41.1993097Z     
2025-05-07T20:32:41.1993299Z         if scale_ub is not None:
2025-05-07T20:32:41.1993582Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:41.1993924Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:41.1994247Z             )
2025-05-07T20:32:41.1994452Z         else:
2025-05-07T20:32:41.1994665Z             scale_ub_tensor = None
2025-05-07T20:32:41.1994973Z     
2025-05-07T20:32:41.1995258Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:41.1995574Z             op = silu_mul_quant
2025-05-07T20:32:41.1995832Z             if compiled:
2025-05-07T20:32:41.1996131Z                 op = torch.compile(op)
2025-05-07T20:32:41.1996432Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:41.1996721Z     
2025-05-07T20:32:41.1996923Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:41.1997092Z 
2025-05-07T20:32:41.1997195Z moe/activation_test.py:117: 
2025-05-07T20:32:41.1997509Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:41.1997851Z moe/activation_test.py:115: in fn
2025-05-07T20:32:41.1998142Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:41.1998699Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/torch/_dynamo/eval_frame.py:678: in _fn
2025-05-07T20:32:41.1999265Z     return fn(*args, **kwargs)
2025-05-07T20:32:41.1999936Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:41.2000627Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:41.2001178Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:41.2001873Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:41.2002547Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:41.2003085Z     kernel = self.compile(
2025-05-07T20:32:41.2003640Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:41.2004308Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:41.2004720Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:41.2005005Z 
2025-05-07T20:32:41.2005265Z self = <triton.compiler.compiler.ASTSource object at 0x7f980635fed0>
2025-05-07T20:32:41.2006642Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:41.2008146Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f987464c400>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f9806242d40>}
2025-05-07T20:32:41.2009507Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:41.2010545Z context = <triton._C.libtriton.ir.context object at 0x7f98063205b0>
2025-05-07T20:32:41.2010845Z 
2025-05-07T20:32:41.2011019Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:41.2011558Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:41.2012038Z                            module_map=module_map)
2025-05-07T20:32:41.2012459Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:41.2012824Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:41.2013096Z E       ^
2025-05-07T20:32:41.2013568Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:41.2014033Z 
2025-05-07T20:32:41.2014453Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:41.2014983Z 
2025-05-07T20:32:41.2015089Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:41.2015607Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:41.2016112Z     T=1,
2025-05-07T20:32:41.2016407Z     D=7168,
2025-05-07T20:32:41.2016707Z     scale_ub=1200.0,
2025-05-07T20:32:41.2016987Z     contiguous=False,
2025-05-07T20:32:41.2017266Z     compiled=True,
2025-05-07T20:32:41.2017486Z )
2025-05-07T20:32:41.3014524Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:41.3015425Z T = 1, D = 7168, scale_ub = 1200.0, contiguous = False, compiled = True
2025-05-07T20:32:41.3015734Z 
2025-05-07T20:32:41.3015822Z     @given(
2025-05-07T20:32:41.3016061Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:41.3016382Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:41.3016696Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:41.3017021Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:41.3017353Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:41.3017641Z     )
2025-05-07T20:32:41.3017997Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:41.3018442Z     def test_silu_mul_quant(
2025-05-07T20:32:41.3018698Z         self,
2025-05-07T20:32:41.3018898Z         T: int,
2025-05-07T20:32:41.3019093Z         D: int,
2025-05-07T20:32:41.3019313Z         scale_ub: Optional[float],
2025-05-07T20:32:41.3019591Z         contiguous: bool,
2025-05-07T20:32:41.3019835Z         compiled: bool,
2025-05-07T20:32:41.3020063Z     ) -> None:
2025-05-07T20:32:41.3020284Z         torch.manual_seed(2025)
2025-05-07T20:32:41.3020524Z     
2025-05-07T20:32:41.3020802Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:41.3021160Z     
2025-05-07T20:32:41.3021357Z         x_sign = torch.sign(x)
2025-05-07T20:32:41.3021657Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:41.3021978Z         x = x_sign * x_clamp
2025-05-07T20:32:41.3022220Z         x0 = x[:, :D]
2025-05-07T20:32:41.3022443Z         x1 = x[:, D:]
2025-05-07T20:32:41.3022661Z     
2025-05-07T20:32:41.3022850Z         if contiguous:
2025-05-07T20:32:41.3023089Z             x0 = x0.contiguous()
2025-05-07T20:32:41.3023357Z             x1 = x1.contiguous()
2025-05-07T20:32:41.3023610Z     
2025-05-07T20:32:41.3023804Z         if scale_ub is not None:
2025-05-07T20:32:41.3024094Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:41.3024439Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:41.3024750Z             )
2025-05-07T20:32:41.3024950Z         else:
2025-05-07T20:32:41.3025165Z             scale_ub_tensor = None
2025-05-07T20:32:41.3025409Z     
2025-05-07T20:32:41.3025647Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:41.3025975Z             op = silu_mul_quant
2025-05-07T20:32:41.3026227Z             if compiled:
2025-05-07T20:32:41.3026480Z                 op = torch.compile(op)
2025-05-07T20:32:41.3026779Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:41.3027048Z     
2025-05-07T20:32:41.3027244Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:41.3027418Z 
2025-05-07T20:32:41.3027521Z moe/activation_test.py:117: 
2025-05-07T20:32:41.3027820Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:41.3028319Z moe/activation_test.py:115: in fn
2025-05-07T20:32:41.3028609Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:41.3029319Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/torch/_dynamo/eval_frame.py:678: in _fn
2025-05-07T20:32:41.3029880Z     return fn(*args, **kwargs)
2025-05-07T20:32:41.3030535Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:41.3031220Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:41.3031758Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:41.3032436Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:41.3033168Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:41.3033764Z     kernel = self.compile(
2025-05-07T20:32:41.3034352Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:41.3035008Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:41.3035407Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:41.3035633Z 
2025-05-07T20:32:41.3035848Z self = <triton.compiler.compiler.ASTSource object at 0x7f980637e710>
2025-05-07T20:32:41.3036928Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:41.3038308Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f987464c400>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f98063a4540>}
2025-05-07T20:32:41.3039658Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:41.3040677Z context = <triton._C.libtriton.ir.context object at 0x7f98063622b0>
2025-05-07T20:32:41.3040964Z 
2025-05-07T20:32:41.3041137Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:41.3041652Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:41.3042122Z                            module_map=module_map)
2025-05-07T20:32:41.3042489Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:41.3042842Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:41.3043107Z E       ^
2025-05-07T20:32:41.3043577Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:41.3044029Z 
2025-05-07T20:32:41.3044459Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:41.3044967Z 
2025-05-07T20:32:41.3045074Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:41.3045490Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:41.3045899Z     T=1,
2025-05-07T20:32:41.3046088Z     D=7168,
2025-05-07T20:32:41.3046291Z     scale_ub=None,
2025-05-07T20:32:41.3046515Z     contiguous=False,
2025-05-07T20:32:41.3046738Z     compiled=True,
2025-05-07T20:32:41.3046944Z )
2025-05-07T20:32:41.3718640Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:41.3719175Z T = 1, D = 7168, scale_ub = None, contiguous = False, compiled = True
2025-05-07T20:32:41.3719434Z 
2025-05-07T20:32:41.3719530Z     @given(
2025-05-07T20:32:41.3719763Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:41.3720081Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:41.3720390Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:41.3720718Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:41.3721183Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:41.3721477Z     )
2025-05-07T20:32:41.3721823Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:41.3722265Z     def test_silu_mul_quant(
2025-05-07T20:32:41.3722510Z         self,
2025-05-07T20:32:41.3722701Z         T: int,
2025-05-07T20:32:41.3722904Z         D: int,
2025-05-07T20:32:41.3723128Z         scale_ub: Optional[float],
2025-05-07T20:32:41.3723400Z         contiguous: bool,
2025-05-07T20:32:41.3723636Z         compiled: bool,
2025-05-07T20:32:41.3723862Z     ) -> None:
2025-05-07T20:32:41.3724086Z         torch.manual_seed(2025)
2025-05-07T20:32:41.3724390Z     
2025-05-07T20:32:41.3724736Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:41.3725079Z     
2025-05-07T20:32:41.3725273Z         x_sign = torch.sign(x)
2025-05-07T20:32:41.3725624Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:41.3725939Z         x = x_sign * x_clamp
2025-05-07T20:32:41.3726176Z         x0 = x[:, :D]
2025-05-07T20:32:41.3726397Z         x1 = x[:, D:]
2025-05-07T20:32:41.3726611Z     
2025-05-07T20:32:41.3726798Z         if contiguous:
2025-05-07T20:32:41.3727043Z             x0 = x0.contiguous()
2025-05-07T20:32:41.3727311Z             x1 = x1.contiguous()
2025-05-07T20:32:41.3727550Z     
2025-05-07T20:32:41.3727750Z         if scale_ub is not None:
2025-05-07T20:32:41.3728031Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:41.3728519Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:41.3728838Z             )
2025-05-07T20:32:41.3729039Z         else:
2025-05-07T20:32:41.3729263Z             scale_ub_tensor = None
2025-05-07T20:32:41.3729521Z     
2025-05-07T20:32:41.3729756Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:41.3730078Z             op = silu_mul_quant
2025-05-07T20:32:41.3730333Z             if compiled:
2025-05-07T20:32:41.3730587Z                 op = torch.compile(op)
2025-05-07T20:32:41.3730887Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:41.3731158Z     
2025-05-07T20:32:41.3731354Z         y_fp8, y_scale = fn()
2025-05-07T20:32:41.3731639Z         y = y_fp8.to(torch.float32) * y_scale[:, None]
2025-05-07T20:32:41.3731922Z     
2025-05-07T20:32:41.3732164Z         def ref_fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:41.3732497Z             x0_fp32 = x0.to(torch.float32)
2025-05-07T20:32:41.3732782Z             x1_fp32 = x1.to(torch.float32)
2025-05-07T20:32:41.3733097Z             y = x0_fp32 * torch.sigmoid(x0_fp32) * x1_fp32
2025-05-07T20:32:41.3733457Z             return triton_quantize_fp8_row(y, scale_ub_tensor)
2025-05-07T20:32:41.3733770Z     
2025-05-07T20:32:41.3733969Z >       y_fp8_ref, y_scale_ref = ref_fn()
2025-05-07T20:32:41.3734168Z 
2025-05-07T20:32:41.3734269Z moe/activation_test.py:126: 
2025-05-07T20:32:41.3734571Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:41.3734903Z moe/activation_test.py:124: in ref_fn
2025-05-07T20:32:41.3735230Z     return triton_quantize_fp8_row(y, scale_ub_tensor)
2025-05-07T20:32:41.3736018Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/fbgemm_gpu/experimental/gemm/triton_gemm/fp8_gemm.py:2370: in triton_quantize_fp8_row
2025-05-07T20:32:41.3736767Z     _kernel_quantize_fp8_row[grid](
2025-05-07T20:32:41.3737304Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:41.3737983Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:41.3738672Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/autotuner.py:186: in run
2025-05-07T20:32:41.3739383Z     timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs}
2025-05-07T20:32:41.3740208Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/autotuner.py:186: in <dictcomp>
2025-05-07T20:32:41.3740961Z     timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs}
2025-05-07T20:32:41.3741688Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/autotuner.py:166: in _bench
2025-05-07T20:32:41.3742317Z     return self.do_bench(kernel_call, quantiles=(0.5, 0.2, 0.8))
2025-05-07T20:32:41.3742923Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/testing.py:117: in do_bench
2025-05-07T20:32:41.3743447Z     fn()
2025-05-07T20:32:41.3743959Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/autotuner.py:152: in kernel_call
2025-05-07T20:32:41.3744649Z     self.fn.run(
2025-05-07T20:32:41.3745116Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:41.3745711Z     kernel = self.compile(
2025-05-07T20:32:41.3746251Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:41.3746903Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:41.3747308Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:41.3747537Z 
2025-05-07T20:32:41.3747753Z self = <triton.compiler.compiler.ASTSource object at 0x7f9655e0c590>
2025-05-07T20:32:41.3748822Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=2, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:41.3750237Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f987464c400>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f98063a5440>}
2025-05-07T20:32:41.3751576Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:41.3752594Z context = <triton._C.libtriton.ir.context object at 0x7f9655ea8bb0>
2025-05-07T20:32:41.3752879Z 
2025-05-07T20:32:41.3753059Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:41.3753578Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:41.3754051Z                            module_map=module_map)
2025-05-07T20:32:41.3754424Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:41.3754780Z E       def _kernel_quantize_fp8_row(
2025-05-07T20:32:41.3755053Z E       ^
2025-05-07T20:32:41.3755524Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:41.3755969Z 
2025-05-07T20:32:41.3756396Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:41.3756903Z 
2025-05-07T20:32:41.3757009Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:41.3757424Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:41.3757827Z     T=1,
2025-05-07T20:32:41.3758011Z     D=5120,
2025-05-07T20:32:41.3758213Z     scale_ub=1200.0,
2025-05-07T20:32:41.3758446Z     contiguous=False,
2025-05-07T20:32:41.3758674Z     compiled=True,
2025-05-07T20:32:41.3758881Z )
2025-05-07T20:32:41.4961158Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:41.4961709Z T = 1, D = 5120, scale_ub = 1200.0, contiguous = False, compiled = True
2025-05-07T20:32:41.4961988Z 
2025-05-07T20:32:41.4962079Z     @given(
2025-05-07T20:32:41.4962314Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:41.4962626Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:41.4963043Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:41.4963378Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:41.4963707Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:41.4963996Z     )
2025-05-07T20:32:41.4964340Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:41.4964777Z     def test_silu_mul_quant(
2025-05-07T20:32:41.4965028Z         self,
2025-05-07T20:32:41.4965219Z         T: int,
2025-05-07T20:32:41.4965420Z         D: int,
2025-05-07T20:32:41.4965642Z         scale_ub: Optional[float],
2025-05-07T20:32:41.4965919Z         contiguous: bool,
2025-05-07T20:32:41.4966158Z         compiled: bool,
2025-05-07T20:32:41.4966459Z     ) -> None:
2025-05-07T20:32:41.4966746Z         torch.manual_seed(2025)
2025-05-07T20:32:41.4966994Z     
2025-05-07T20:32:41.4967277Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:41.4967623Z     
2025-05-07T20:32:41.4967876Z         x_sign = torch.sign(x)
2025-05-07T20:32:41.4968172Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:41.4968488Z         x = x_sign * x_clamp
2025-05-07T20:32:41.4968724Z         x0 = x[:, :D]
2025-05-07T20:32:41.4968949Z         x1 = x[:, D:]
2025-05-07T20:32:41.4969158Z     
2025-05-07T20:32:41.4969346Z         if contiguous:
2025-05-07T20:32:41.4969586Z             x0 = x0.contiguous()
2025-05-07T20:32:41.4969855Z             x1 = x1.contiguous()
2025-05-07T20:32:41.4970097Z     
2025-05-07T20:32:41.4970300Z         if scale_ub is not None:
2025-05-07T20:32:41.4970573Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:41.4970907Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:41.4971234Z             )
2025-05-07T20:32:41.4971437Z         else:
2025-05-07T20:32:41.4971655Z             scale_ub_tensor = None
2025-05-07T20:32:41.4971900Z     
2025-05-07T20:32:41.4972134Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:41.4972456Z             op = silu_mul_quant
2025-05-07T20:32:41.4972703Z             if compiled:
2025-05-07T20:32:41.4972950Z                 op = torch.compile(op)
2025-05-07T20:32:41.4973243Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:41.4973511Z     
2025-05-07T20:32:41.4973706Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:41.4973869Z 
2025-05-07T20:32:41.4973980Z moe/activation_test.py:117: 
2025-05-07T20:32:41.4974270Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:41.4974602Z moe/activation_test.py:115: in fn
2025-05-07T20:32:41.4974888Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:41.4975470Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/torch/_dynamo/eval_frame.py:678: in _fn
2025-05-07T20:32:41.4976049Z     return fn(*args, **kwargs)
2025-05-07T20:32:41.4976706Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:41.4977390Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:41.4977916Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:41.4978597Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:41.4979256Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:41.4979789Z     kernel = self.compile(
2025-05-07T20:32:41.4980325Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:41.4980977Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:41.4981387Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:41.4981615Z 
2025-05-07T20:32:41.4981826Z self = <triton.compiler.compiler.ASTSource object at 0x7f9655e98e10>
2025-05-07T20:32:41.4982943Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:41.4984310Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f987464c400>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f98063a6a20>}
2025-05-07T20:32:41.4985637Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:41.4986695Z context = <triton._C.libtriton.ir.context object at 0x7f9655e8d470>
2025-05-07T20:32:41.4987018Z 
2025-05-07T20:32:41.4987185Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:41.4987777Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:41.4997052Z                            module_map=module_map)
2025-05-07T20:32:41.4997428Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:41.4997779Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:41.4998032Z E       ^
2025-05-07T20:32:41.4998494Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:41.4998952Z 
2025-05-07T20:32:41.4999377Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:41.4999893Z 
2025-05-07T20:32:41.4999999Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:41.5000415Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:41.5000840Z     T=1,
2025-05-07T20:32:41.5001100Z     D=5120,
2025-05-07T20:32:41.5001348Z     scale_ub=1200.0,
2025-05-07T20:32:41.5001577Z     contiguous=False,
2025-05-07T20:32:41.5001801Z     compiled=False,
2025-05-07T20:32:41.5001999Z )
2025-05-07T20:32:41.5002313Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:41.5002794Z T = 1, D = 5120, scale_ub = 1200.0, contiguous = False, compiled = False
2025-05-07T20:32:41.5003055Z 
2025-05-07T20:32:41.5003135Z     @given(
2025-05-07T20:32:41.5003359Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:41.5003667Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:41.5003967Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:41.5004286Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:41.5004605Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:41.5004890Z     )
2025-05-07T20:32:41.5005235Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:41.5005725Z     def test_silu_mul_quant(
2025-05-07T20:32:41.5005962Z         self,
2025-05-07T20:32:41.5006151Z         T: int,
2025-05-07T20:32:41.5006344Z         D: int,
2025-05-07T20:32:41.5006556Z         scale_ub: Optional[float],
2025-05-07T20:32:41.5006819Z         contiguous: bool,
2025-05-07T20:32:41.5007050Z         compiled: bool,
2025-05-07T20:32:41.5007268Z     ) -> None:
2025-05-07T20:32:41.5007478Z         torch.manual_seed(2025)
2025-05-07T20:32:41.5007710Z     
2025-05-07T20:32:41.5007997Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:41.5008331Z     
2025-05-07T20:32:41.5008523Z         x_sign = torch.sign(x)
2025-05-07T20:32:41.5008806Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:41.5009114Z         x = x_sign * x_clamp
2025-05-07T20:32:41.5009351Z         x0 = x[:, :D]
2025-05-07T20:32:41.5009562Z         x1 = x[:, D:]
2025-05-07T20:32:41.5009764Z     
2025-05-07T20:32:41.5009946Z         if contiguous:
2025-05-07T20:32:41.5010169Z             x0 = x0.contiguous()
2025-05-07T20:32:41.5010427Z             x1 = x1.contiguous()
2025-05-07T20:32:41.5010761Z     
2025-05-07T20:32:41.5010948Z         if scale_ub is not None:
2025-05-07T20:32:41.5011214Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:41.5011544Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:41.5011843Z             )
2025-05-07T20:32:41.5012033Z         else:
2025-05-07T20:32:41.5012244Z             scale_ub_tensor = None
2025-05-07T20:32:41.5012487Z     
2025-05-07T20:32:41.5012711Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:41.5013021Z             op = silu_mul_quant
2025-05-07T20:32:41.5013269Z             if compiled:
2025-05-07T20:32:41.5013508Z                 op = torch.compile(op)
2025-05-07T20:32:41.5013846Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:41.5014154Z     
2025-05-07T20:32:41.5014339Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:41.5014501Z 
2025-05-07T20:32:41.5014598Z moe/activation_test.py:117: 
2025-05-07T20:32:41.5014927Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:41.5015252Z moe/activation_test.py:115: in fn
2025-05-07T20:32:41.5015526Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:41.5016215Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:41.5016903Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:41.5017431Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:41.5018111Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:41.5018780Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:41.5019313Z     kernel = self.compile(
2025-05-07T20:32:41.5019852Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:41.5020506Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:41.5020897Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:41.5021123Z 
2025-05-07T20:32:41.5021327Z self = <triton.compiler.compiler.ASTSource object at 0x7f9655bbaa90>
2025-05-07T20:32:41.5022409Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:41.5023780Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f987464c400>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f98063a71a0>}
2025-05-07T20:32:41.5025140Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:41.5026165Z context = <triton._C.libtriton.ir.context object at 0x7f9655bbf0f0>
2025-05-07T20:32:41.5026455Z 
2025-05-07T20:32:41.5026618Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:41.5027133Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:41.5027596Z                            module_map=module_map)
2025-05-07T20:32:41.5027951Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:41.5028568Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:41.5028825Z E       ^
2025-05-07T20:32:41.5029326Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:41.5029782Z 
2025-05-07T20:32:41.5030198Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:41.5030718Z 
2025-05-07T20:32:41.5030918Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:41.5031330Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:41.5031729Z     T=16384,
2025-05-07T20:32:41.5031917Z     D=5120,
2025-05-07T20:32:41.5032107Z     scale_ub=1200.0,
2025-05-07T20:32:41.5032326Z     contiguous=False,
2025-05-07T20:32:41.5032542Z     compiled=True,
2025-05-07T20:32:41.5032740Z )
2025-05-07T20:32:41.7309833Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:41.7310917Z T = 16384, D = 5120, scale_ub = 1200.0, contiguous = False, compiled = True
2025-05-07T20:32:41.7311470Z 
2025-05-07T20:32:41.7311637Z     @given(
2025-05-07T20:32:41.7312305Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:41.7313041Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:41.7313643Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:41.7314394Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:41.7315056Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:41.7315513Z     )
2025-05-07T20:32:41.7315902Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:41.7316331Z     def test_silu_mul_quant(
2025-05-07T20:32:41.7316567Z         self,
2025-05-07T20:32:41.7316760Z         T: int,
2025-05-07T20:32:41.7316951Z         D: int,
2025-05-07T20:32:41.7317169Z         scale_ub: Optional[float],
2025-05-07T20:32:41.7317432Z         contiguous: bool,
2025-05-07T20:32:41.7317665Z         compiled: bool,
2025-05-07T20:32:41.7317896Z     ) -> None:
2025-05-07T20:32:41.7318117Z         torch.manual_seed(2025)
2025-05-07T20:32:41.7318358Z     
2025-05-07T20:32:41.7318640Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:41.7318986Z     
2025-05-07T20:32:41.7319184Z         x_sign = torch.sign(x)
2025-05-07T20:32:41.7319476Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:41.7319787Z         x = x_sign * x_clamp
2025-05-07T20:32:41.7320027Z         x0 = x[:, :D]
2025-05-07T20:32:41.7320247Z         x1 = x[:, D:]
2025-05-07T20:32:41.7320454Z     
2025-05-07T20:32:41.7320642Z         if contiguous:
2025-05-07T20:32:41.7320882Z             x0 = x0.contiguous()
2025-05-07T20:32:41.7321136Z             x1 = x1.contiguous()
2025-05-07T20:32:41.7321378Z     
2025-05-07T20:32:41.7321567Z         if scale_ub is not None:
2025-05-07T20:32:41.7321832Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:41.7322169Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:41.7322479Z             )
2025-05-07T20:32:41.7322669Z         else:
2025-05-07T20:32:41.7322885Z             scale_ub_tensor = None
2025-05-07T20:32:41.7323135Z     
2025-05-07T20:32:41.7323360Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:41.7323674Z             op = silu_mul_quant
2025-05-07T20:32:41.7323923Z             if compiled:
2025-05-07T20:32:41.7324171Z                 op = torch.compile(op)
2025-05-07T20:32:41.7324462Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:41.7324738Z     
2025-05-07T20:32:41.7324931Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:41.7325091Z 
2025-05-07T20:32:41.7325191Z moe/activation_test.py:117: 
2025-05-07T20:32:41.7325489Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:41.7325825Z moe/activation_test.py:115: in fn
2025-05-07T20:32:41.7326098Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:41.7326655Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/torch/_dynamo/eval_frame.py:678: in _fn
2025-05-07T20:32:41.7327211Z     return fn(*args, **kwargs)
2025-05-07T20:32:41.7327869Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:41.7328710Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:41.7329313Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:41.7329993Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:41.7330645Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:41.7331181Z     kernel = self.compile(
2025-05-07T20:32:41.7331718Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:41.7332368Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:41.7332764Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:41.7333111Z 
2025-05-07T20:32:41.7333317Z self = <triton.compiler.compiler.ASTSource object at 0x7f9655b15b10>
2025-05-07T20:32:41.7334445Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:41.7335800Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f987464c400>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f9655b84ea0>}
2025-05-07T20:32:41.7337135Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:41.7338152Z context = <triton._C.libtriton.ir.context object at 0x7f9655bf60f0>
2025-05-07T20:32:41.7338447Z 
2025-05-07T20:32:41.7338612Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:41.7339132Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:41.7339598Z                            module_map=module_map)
2025-05-07T20:32:41.7339964Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:41.7340316Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:41.7340578Z E       ^
2025-05-07T20:32:41.7341039Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:41.7341491Z 
2025-05-07T20:32:41.7341907Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:41.7342415Z 
2025-05-07T20:32:41.7342525Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:41.7342934Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:41.7343336Z     T=2048,
2025-05-07T20:32:41.7343529Z     D=7168,
2025-05-07T20:32:41.7343727Z     scale_ub=1200.0,
2025-05-07T20:32:41.7343953Z     contiguous=False,
2025-05-07T20:32:41.7344180Z     compiled=True,
2025-05-07T20:32:41.7344390Z )
2025-05-07T20:32:41.7344714Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:41.7345208Z T = 2048, D = 7168, scale_ub = 1200.0, contiguous = False, compiled = True
2025-05-07T20:32:41.7345476Z 
2025-05-07T20:32:41.7345559Z     @given(
2025-05-07T20:32:41.7345784Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:41.7346099Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:41.7346401Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:41.7346727Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:41.7347062Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:41.7347344Z     )
2025-05-07T20:32:41.7347694Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:41.7348141Z     def test_silu_mul_quant(
2025-05-07T20:32:41.7348383Z         self,
2025-05-07T20:32:41.7348581Z         T: int,
2025-05-07T20:32:41.7348782Z         D: int,
2025-05-07T20:32:41.7348998Z         scale_ub: Optional[float],
2025-05-07T20:32:41.7349377Z         contiguous: bool,
2025-05-07T20:32:41.7349620Z         compiled: bool,
2025-05-07T20:32:41.7349836Z     ) -> None:
2025-05-07T20:32:41.7350056Z         torch.manual_seed(2025)
2025-05-07T20:32:41.7350300Z     
2025-05-07T20:32:41.7350567Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:41.7350915Z     
2025-05-07T20:32:41.7351109Z         x_sign = torch.sign(x)
2025-05-07T20:32:41.7351391Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:41.7351699Z         x = x_sign * x_clamp
2025-05-07T20:32:41.7351941Z         x0 = x[:, :D]
2025-05-07T20:32:41.7352157Z         x1 = x[:, D:]
2025-05-07T20:32:41.7352405Z     
2025-05-07T20:32:41.7352669Z         if contiguous:
2025-05-07T20:32:41.7352898Z             x0 = x0.contiguous()
2025-05-07T20:32:41.7353152Z             x1 = x1.contiguous()
2025-05-07T20:32:41.7353392Z     
2025-05-07T20:32:41.7353583Z         if scale_ub is not None:
2025-05-07T20:32:41.7353889Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:41.7354224Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:41.7354532Z             )
2025-05-07T20:32:41.7354721Z         else:
2025-05-07T20:32:41.7354928Z             scale_ub_tensor = None
2025-05-07T20:32:41.7355176Z     
2025-05-07T20:32:41.7355410Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:41.7355769Z             op = silu_mul_quant
2025-05-07T20:32:41.7356026Z             if compiled:
2025-05-07T20:32:41.7356267Z                 op = torch.compile(op)
2025-05-07T20:32:41.7356558Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:41.7356830Z     
2025-05-07T20:32:41.7357022Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:41.7357193Z 
2025-05-07T20:32:41.7357290Z moe/activation_test.py:117: 
2025-05-07T20:32:41.7357588Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:41.7357922Z moe/activation_test.py:115: in fn
2025-05-07T20:32:41.7358196Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:41.7358752Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/torch/_dynamo/eval_frame.py:678: in _fn
2025-05-07T20:32:41.7359308Z     return fn(*args, **kwargs)
2025-05-07T20:32:41.7359953Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:41.7360628Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:41.7361157Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:41.7361832Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:41.7362485Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:41.7363015Z     kernel = self.compile(
2025-05-07T20:32:41.7363554Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:41.7364203Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:41.7364592Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:41.7364823Z 
2025-05-07T20:32:41.7365026Z self = <triton.compiler.compiler.ASTSource object at 0x7f9655b7f010>
2025-05-07T20:32:41.7366151Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:41.7367508Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f987464c400>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f9655b859e0>}
2025-05-07T20:32:41.7368950Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:41.7369970Z context = <triton._C.libtriton.ir.context object at 0x7f9655b0f670>
2025-05-07T20:32:41.7370256Z 
2025-05-07T20:32:41.7370423Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:41.7370940Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:41.7371396Z                            module_map=module_map)
2025-05-07T20:32:41.7371759Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:41.7372114Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:41.7372417Z E       ^
2025-05-07T20:32:41.7372924Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:41.7373371Z 
2025-05-07T20:32:41.7373822Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:41.7374329Z 
2025-05-07T20:32:41.8262882Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:41.8263327Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:41.8263780Z     T=1,
2025-05-07T20:32:41.8263972Z     D=5120,
2025-05-07T20:32:41.8264165Z     scale_ub=None,
2025-05-07T20:32:41.8264388Z     contiguous=False,
2025-05-07T20:32:41.8264619Z     compiled=False,
2025-05-07T20:32:41.8264821Z )
2025-05-07T20:32:41.8265145Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:41.8265629Z T = 1, D = 5120, scale_ub = None, contiguous = False, compiled = False
2025-05-07T20:32:41.8265922Z 
2025-05-07T20:32:41.8266031Z     @given(
2025-05-07T20:32:41.8266266Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:41.8266579Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:41.8266887Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:41.8267214Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:41.8267542Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:41.8267824Z     )
2025-05-07T20:32:41.8268166Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:41.8268606Z     def test_silu_mul_quant(
2025-05-07T20:32:41.8268844Z         self,
2025-05-07T20:32:41.8269088Z         T: int,
2025-05-07T20:32:41.8269283Z         D: int,
2025-05-07T20:32:41.8269499Z         scale_ub: Optional[float],
2025-05-07T20:32:41.8269765Z         contiguous: bool,
2025-05-07T20:32:41.8270005Z         compiled: bool,
2025-05-07T20:32:41.8270233Z     ) -> None:
2025-05-07T20:32:41.8270452Z         torch.manual_seed(2025)
2025-05-07T20:32:41.8270689Z     
2025-05-07T20:32:41.8270963Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:41.8271300Z     
2025-05-07T20:32:41.8271492Z         x_sign = torch.sign(x)
2025-05-07T20:32:41.8271785Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:41.8272095Z         x = x_sign * x_clamp
2025-05-07T20:32:41.8272332Z         x0 = x[:, :D]
2025-05-07T20:32:41.8272550Z         x1 = x[:, D:]
2025-05-07T20:32:41.8272758Z     
2025-05-07T20:32:41.8272938Z         if contiguous:
2025-05-07T20:32:41.8273178Z             x0 = x0.contiguous()
2025-05-07T20:32:41.8273440Z             x1 = x1.contiguous()
2025-05-07T20:32:41.8273672Z     
2025-05-07T20:32:41.8273867Z         if scale_ub is not None:
2025-05-07T20:32:41.8274143Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:41.8274484Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:41.8274791Z             )
2025-05-07T20:32:41.8274983Z         else:
2025-05-07T20:32:41.8275202Z             scale_ub_tensor = None
2025-05-07T20:32:41.8275449Z     
2025-05-07T20:32:41.8275680Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:41.8275995Z             op = silu_mul_quant
2025-05-07T20:32:41.8276348Z             if compiled:
2025-05-07T20:32:41.8276603Z                 op = torch.compile(op)
2025-05-07T20:32:41.8276900Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:41.8277166Z     
2025-05-07T20:32:41.8277360Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:41.8277519Z 
2025-05-07T20:32:41.8277623Z moe/activation_test.py:117: 
2025-05-07T20:32:41.8277911Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:41.8278243Z moe/activation_test.py:115: in fn
2025-05-07T20:32:41.8278523Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:41.8279210Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:41.8280012Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:41.8280548Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:41.8281289Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:41.8281946Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:41.8282471Z     kernel = self.compile(
2025-05-07T20:32:41.8283005Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:41.8283655Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:41.8284050Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:41.8284279Z 
2025-05-07T20:32:41.8284483Z self = <triton.compiler.compiler.ASTSource object at 0x7f9806ae4b10>
2025-05-07T20:32:41.8285590Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:41.8286975Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f987464c400>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f9655b86d40>}
2025-05-07T20:32:41.8288309Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:41.8289322Z context = <triton._C.libtriton.ir.context object at 0x7f9806a45130>
2025-05-07T20:32:41.8289613Z 
2025-05-07T20:32:41.8289781Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:41.8290301Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:41.8290776Z                            module_map=module_map)
2025-05-07T20:32:41.8291134Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:41.8291485Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:41.8291748Z E       ^
2025-05-07T20:32:41.8292205Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:41.8292656Z 
2025-05-07T20:32:41.8293072Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:41.8293584Z 
2025-05-07T20:32:41.8293689Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:41.8294098Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:41.8294492Z     T=4096,
2025-05-07T20:32:41.8294681Z     D=7168,
2025-05-07T20:32:41.8294877Z     scale_ub=1200.0,
2025-05-07T20:32:41.8295100Z     contiguous=False,
2025-05-07T20:32:41.8295328Z     compiled=False,
2025-05-07T20:32:41.8295531Z )
2025-05-07T20:32:41.8295851Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:41.8296448Z T = 4096, D = 7168, scale_ub = 1200.0, contiguous = False, compiled = False
2025-05-07T20:32:41.8296723Z 
2025-05-07T20:32:41.8296808Z     @given(
2025-05-07T20:32:41.8297037Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:41.8297347Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:41.8297650Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:41.8297978Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:41.8298297Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:41.8298581Z     )
2025-05-07T20:32:41.8298926Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:41.8299360Z     def test_silu_mul_quant(
2025-05-07T20:32:41.8299649Z         self,
2025-05-07T20:32:41.8299882Z         T: int,
2025-05-07T20:32:41.8300078Z         D: int,
2025-05-07T20:32:41.8300293Z         scale_ub: Optional[float],
2025-05-07T20:32:41.8300560Z         contiguous: bool,
2025-05-07T20:32:41.8300850Z         compiled: bool,
2025-05-07T20:32:41.8301076Z     ) -> None:
2025-05-07T20:32:41.8301294Z         torch.manual_seed(2025)
2025-05-07T20:32:41.8301529Z     
2025-05-07T20:32:41.8301802Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:41.8302144Z     
2025-05-07T20:32:41.8302346Z         x_sign = torch.sign(x)
2025-05-07T20:32:41.8302644Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:41.8302949Z         x = x_sign * x_clamp
2025-05-07T20:32:41.8308435Z         x0 = x[:, :D]
2025-05-07T20:32:41.8308662Z         x1 = x[:, D:]
2025-05-07T20:32:41.8308866Z     
2025-05-07T20:32:41.8309108Z         if contiguous:
2025-05-07T20:32:41.8309344Z             x0 = x0.contiguous()
2025-05-07T20:32:41.8309608Z             x1 = x1.contiguous()
2025-05-07T20:32:41.8309849Z     
2025-05-07T20:32:41.8310038Z         if scale_ub is not None:
2025-05-07T20:32:41.8310309Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:41.8310644Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:41.8310956Z             )
2025-05-07T20:32:41.8311147Z         else:
2025-05-07T20:32:41.8311352Z             scale_ub_tensor = None
2025-05-07T20:32:41.8311603Z     
2025-05-07T20:32:41.8311838Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:41.8312154Z             op = silu_mul_quant
2025-05-07T20:32:41.8312414Z             if compiled:
2025-05-07T20:32:41.8312667Z                 op = torch.compile(op)
2025-05-07T20:32:41.8312957Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:41.8313244Z     
2025-05-07T20:32:41.8313440Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:41.8313601Z 
2025-05-07T20:32:41.8313703Z moe/activation_test.py:117: 
2025-05-07T20:32:41.8313999Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:41.8314336Z moe/activation_test.py:115: in fn
2025-05-07T20:32:41.8314615Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:41.8315312Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:41.8316008Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:41.8316548Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:41.8317228Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:41.8317889Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:41.8318422Z     kernel = self.compile(
2025-05-07T20:32:41.8318969Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:41.8319632Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:41.8320036Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:41.8320271Z 
2025-05-07T20:32:41.8320555Z self = <triton.compiler.compiler.ASTSource object at 0x7f9806a0b990>
2025-05-07T20:32:41.8321645Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:41.8323019Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f987464c400>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f9655b87a60>}
2025-05-07T20:32:41.8324359Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:41.8325493Z context = <triton._C.libtriton.ir.context object at 0x7f9806acabb0>
2025-05-07T20:32:41.8325784Z 
2025-05-07T20:32:41.8325988Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:41.8326513Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:41.8326976Z                            module_map=module_map)
2025-05-07T20:32:41.8327341Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:41.8327687Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:41.8327938Z E       ^
2025-05-07T20:32:41.8328583Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:41.8329031Z 
2025-05-07T20:32:41.8329453Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:41.8329969Z 
2025-05-07T20:32:41.8330081Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:41.8330497Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:41.8330909Z     T=16384,
2025-05-07T20:32:41.8331111Z     D=7168,
2025-05-07T20:32:41.8331301Z     scale_ub=None,
2025-05-07T20:32:41.8331520Z     contiguous=True,
2025-05-07T20:32:41.8331746Z     compiled=True,
2025-05-07T20:32:41.8331943Z )
2025-05-07T20:32:41.9691662Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:41.9692230Z T = 16384, D = 7168, scale_ub = None, contiguous = True, compiled = True
2025-05-07T20:32:41.9692513Z 
2025-05-07T20:32:41.9692594Z     @given(
2025-05-07T20:32:41.9692825Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:41.9693134Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:41.9693441Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:41.9693777Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:41.9694107Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:41.9694392Z     )
2025-05-07T20:32:41.9694742Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:41.9695181Z     def test_silu_mul_quant(
2025-05-07T20:32:41.9695421Z         self,
2025-05-07T20:32:41.9695647Z         T: int,
2025-05-07T20:32:41.9695864Z         D: int,
2025-05-07T20:32:41.9696089Z         scale_ub: Optional[float],
2025-05-07T20:32:41.9696524Z         contiguous: bool,
2025-05-07T20:32:41.9696768Z         compiled: bool,
2025-05-07T20:32:41.9696985Z     ) -> None:
2025-05-07T20:32:41.9697204Z         torch.manual_seed(2025)
2025-05-07T20:32:41.9697444Z     
2025-05-07T20:32:41.9697710Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:41.9698048Z     
2025-05-07T20:32:41.9698239Z         x_sign = torch.sign(x)
2025-05-07T20:32:41.9698525Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:41.9698833Z         x = x_sign * x_clamp
2025-05-07T20:32:41.9699076Z         x0 = x[:, :D]
2025-05-07T20:32:41.9699291Z         x1 = x[:, D:]
2025-05-07T20:32:41.9699505Z     
2025-05-07T20:32:41.9699690Z         if contiguous:
2025-05-07T20:32:41.9700035Z             x0 = x0.contiguous()
2025-05-07T20:32:41.9700291Z             x1 = x1.contiguous()
2025-05-07T20:32:41.9700531Z     
2025-05-07T20:32:41.9700715Z         if scale_ub is not None:
2025-05-07T20:32:41.9700997Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:41.9701335Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:41.9701641Z             )
2025-05-07T20:32:41.9701828Z         else:
2025-05-07T20:32:41.9702039Z             scale_ub_tensor = None
2025-05-07T20:32:41.9702294Z     
2025-05-07T20:32:41.9702527Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:41.9702841Z             op = silu_mul_quant
2025-05-07T20:32:41.9703167Z             if compiled:
2025-05-07T20:32:41.9703474Z                 op = torch.compile(op)
2025-05-07T20:32:41.9703778Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:41.9704058Z     
2025-05-07T20:32:41.9704254Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:41.9704478Z 
2025-05-07T20:32:41.9704591Z moe/activation_test.py:117: 
2025-05-07T20:32:41.9704889Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:41.9705216Z moe/activation_test.py:115: in fn
2025-05-07T20:32:41.9705506Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:41.9706119Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/torch/_dynamo/eval_frame.py:678: in _fn
2025-05-07T20:32:41.9706680Z     return fn(*args, **kwargs)
2025-05-07T20:32:41.9707336Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:41.9708019Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:41.9708560Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:41.9709289Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:41.9709948Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:41.9710476Z     kernel = self.compile(
2025-05-07T20:32:41.9711022Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:41.9711669Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:41.9712078Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:41.9712306Z 
2025-05-07T20:32:41.9712516Z self = <triton.compiler.compiler.ASTSource object at 0x7f980697a810>
2025-05-07T20:32:41.9713585Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:41.9714953Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f987464c400>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f9806ae1120>}
2025-05-07T20:32:41.9716331Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:41.9717353Z context = <triton._C.libtriton.ir.context object at 0x7f9806942e30>
2025-05-07T20:32:41.9717643Z 
2025-05-07T20:32:41.9717809Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:41.9718328Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:41.9718788Z                            module_map=module_map)
2025-05-07T20:32:41.9719155Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:41.9719507Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:41.9719769Z E       ^
2025-05-07T20:32:41.9720278Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:41.9720728Z 
2025-05-07T20:32:41.9721144Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:41.9721653Z 
2025-05-07T20:32:41.9721761Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:41.9722163Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:41.9722560Z     T=4096,
2025-05-07T20:32:41.9722750Z     D=5120,
2025-05-07T20:32:41.9722944Z     scale_ub=None,
2025-05-07T20:32:41.9723157Z     contiguous=False,
2025-05-07T20:32:41.9723381Z     compiled=True,
2025-05-07T20:32:41.9723629Z )
2025-05-07T20:32:41.9723945Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:41.9724476Z T = 4096, D = 5120, scale_ub = None, contiguous = False, compiled = True
2025-05-07T20:32:41.9724745Z 
2025-05-07T20:32:41.9724830Z     @given(
2025-05-07T20:32:41.9725099Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:41.9725415Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:41.9725721Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:41.9726042Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:41.9726365Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:41.9726651Z     )
2025-05-07T20:32:41.9726996Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:41.9727425Z     def test_silu_mul_quant(
2025-05-07T20:32:41.9727691Z         self,
2025-05-07T20:32:41.9727886Z         T: int,
2025-05-07T20:32:41.9728084Z         D: int,
2025-05-07T20:32:41.9728459Z         scale_ub: Optional[float],
2025-05-07T20:32:41.9728731Z         contiguous: bool,
2025-05-07T20:32:41.9728969Z         compiled: bool,
2025-05-07T20:32:41.9729191Z     ) -> None:
2025-05-07T20:32:41.9729413Z         torch.manual_seed(2025)
2025-05-07T20:32:41.9729654Z     
2025-05-07T20:32:41.9729922Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:41.9730268Z     
2025-05-07T20:32:41.9730458Z         x_sign = torch.sign(x)
2025-05-07T20:32:41.9730750Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:41.9731058Z         x = x_sign * x_clamp
2025-05-07T20:32:41.9731302Z         x0 = x[:, :D]
2025-05-07T20:32:41.9731527Z         x1 = x[:, D:]
2025-05-07T20:32:41.9731741Z     
2025-05-07T20:32:41.9731931Z         if contiguous:
2025-05-07T20:32:41.9732165Z             x0 = x0.contiguous()
2025-05-07T20:32:41.9732417Z             x1 = x1.contiguous()
2025-05-07T20:32:41.9732659Z     
2025-05-07T20:32:41.9732865Z         if scale_ub is not None:
2025-05-07T20:32:41.9733146Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:41.9733484Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:41.9733794Z             )
2025-05-07T20:32:41.9733987Z         else:
2025-05-07T20:32:41.9734197Z             scale_ub_tensor = None
2025-05-07T20:32:41.9734446Z     
2025-05-07T20:32:41.9734679Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:41.9734997Z             op = silu_mul_quant
2025-05-07T20:32:41.9735249Z             if compiled:
2025-05-07T20:32:41.9735493Z                 op = torch.compile(op)
2025-05-07T20:32:41.9735789Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:41.9736071Z     
2025-05-07T20:32:41.9736258Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:41.9736425Z 
2025-05-07T20:32:41.9736522Z moe/activation_test.py:117: 
2025-05-07T20:32:41.9736819Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:41.9737155Z moe/activation_test.py:115: in fn
2025-05-07T20:32:41.9737431Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:41.9737993Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/torch/_dynamo/eval_frame.py:678: in _fn
2025-05-07T20:32:41.9738557Z     return fn(*args, **kwargs)
2025-05-07T20:32:41.9739287Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:41.9739967Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:41.9740495Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:41.9741171Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:41.9741824Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:41.9742347Z     kernel = self.compile(
2025-05-07T20:32:41.9742945Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:41.9743648Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:41.9744091Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:41.9744328Z 
2025-05-07T20:32:41.9744532Z self = <triton.compiler.compiler.ASTSource object at 0x7f98069ffd90>
2025-05-07T20:32:41.9745641Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:41.9747015Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f987464c400>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f9806ae1c60>}
2025-05-07T20:32:41.9748349Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:41.9749409Z context = <triton._C.libtriton.ir.context object at 0x7f9806944370>
2025-05-07T20:32:41.9749695Z 
2025-05-07T20:32:41.9749867Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:41.9750386Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:41.9750850Z                            module_map=module_map)
2025-05-07T20:32:41.9751208Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:41.9751555Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:41.9751814Z E       ^
2025-05-07T20:32:41.9752266Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:41.9752717Z 
2025-05-07T20:32:41.9753135Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:41.9753654Z 
2025-05-07T20:32:42.0898834Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:42.0899321Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:42.0899740Z     T=4096,
2025-05-07T20:32:42.0899930Z     D=5120,
2025-05-07T20:32:42.0900119Z     scale_ub=1200.0,
2025-05-07T20:32:42.0900345Z     contiguous=False,
2025-05-07T20:32:42.0900568Z     compiled=False,
2025-05-07T20:32:42.0900773Z )
2025-05-07T20:32:42.0901093Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:42.0901582Z T = 4096, D = 5120, scale_ub = 1200.0, contiguous = False, compiled = False
2025-05-07T20:32:42.0901859Z 
2025-05-07T20:32:42.0901938Z     @given(
2025-05-07T20:32:42.0902174Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:42.0902481Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:42.0902792Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:42.0903126Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:42.0903453Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:42.0903729Z     )
2025-05-07T20:32:42.0904213Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:42.0904656Z     def test_silu_mul_quant(
2025-05-07T20:32:42.0904891Z         self,
2025-05-07T20:32:42.0905084Z         T: int,
2025-05-07T20:32:42.0905283Z         D: int,
2025-05-07T20:32:42.0905497Z         scale_ub: Optional[float],
2025-05-07T20:32:42.0905773Z         contiguous: bool,
2025-05-07T20:32:42.0906011Z         compiled: bool,
2025-05-07T20:32:42.0906225Z     ) -> None:
2025-05-07T20:32:42.0906441Z         torch.manual_seed(2025)
2025-05-07T20:32:42.0906678Z     
2025-05-07T20:32:42.0906943Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:42.0907285Z     
2025-05-07T20:32:42.0907480Z         x_sign = torch.sign(x)
2025-05-07T20:32:42.0907888Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:42.0908191Z         x = x_sign * x_clamp
2025-05-07T20:32:42.0908433Z         x0 = x[:, :D]
2025-05-07T20:32:42.0908657Z         x1 = x[:, D:]
2025-05-07T20:32:42.0908913Z     
2025-05-07T20:32:42.0909153Z         if contiguous:
2025-05-07T20:32:42.0909386Z             x0 = x0.contiguous()
2025-05-07T20:32:42.0909643Z             x1 = x1.contiguous()
2025-05-07T20:32:42.0909883Z     
2025-05-07T20:32:42.0910074Z         if scale_ub is not None:
2025-05-07T20:32:42.0910338Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:42.0910665Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:42.0910972Z             )
2025-05-07T20:32:42.0911160Z         else:
2025-05-07T20:32:42.0911368Z             scale_ub_tensor = None
2025-05-07T20:32:42.0911616Z     
2025-05-07T20:32:42.0911843Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:42.0912162Z             op = silu_mul_quant
2025-05-07T20:32:42.0912416Z             if compiled:
2025-05-07T20:32:42.0912657Z                 op = torch.compile(op)
2025-05-07T20:32:42.0912951Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:42.0913223Z     
2025-05-07T20:32:42.0913417Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:42.0913614Z 
2025-05-07T20:32:42.0913723Z moe/activation_test.py:117: 
2025-05-07T20:32:42.0914019Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:42.0914351Z moe/activation_test.py:115: in fn
2025-05-07T20:32:42.0914628Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:42.0915306Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:42.0915992Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:42.0916525Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:42.0917206Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:42.0917858Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:42.0918388Z     kernel = self.compile(
2025-05-07T20:32:42.0918924Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:42.0919567Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:42.0919961Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:42.0920192Z 
2025-05-07T20:32:42.0920395Z self = <triton.compiler.compiler.ASTSource object at 0x7f98066b0fd0>
2025-05-07T20:32:42.0921468Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:42.0923018Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f987464c400>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f9806ae3240>}
2025-05-07T20:32:42.0924390Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:42.0925403Z context = <triton._C.libtriton.ir.context object at 0x7f98066f1630>
2025-05-07T20:32:42.0925689Z 
2025-05-07T20:32:42.0925858Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:42.0926367Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:42.0926828Z                            module_map=module_map)
2025-05-07T20:32:42.0927190Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:42.0927579Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:42.0927872Z E       ^
2025-05-07T20:32:42.0928484Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:42.0928930Z 
2025-05-07T20:32:42.0929418Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:42.0929924Z 
2025-05-07T20:32:42.0930033Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:42.0930439Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:42.0930837Z     T=4096,
2025-05-07T20:32:42.0931023Z     D=5120,
2025-05-07T20:32:42.0931213Z     scale_ub=1200.0,
2025-05-07T20:32:42.0931436Z     contiguous=False,
2025-05-07T20:32:42.0931660Z     compiled=True,
2025-05-07T20:32:42.0931858Z )
2025-05-07T20:32:42.0932175Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:42.0932664Z T = 4096, D = 5120, scale_ub = 1200.0, contiguous = False, compiled = True
2025-05-07T20:32:42.0932936Z 
2025-05-07T20:32:42.0933020Z     @given(
2025-05-07T20:32:42.0933246Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:42.0933560Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:42.0933867Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:42.0934190Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:42.0934513Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:42.0934793Z     )
2025-05-07T20:32:42.0935133Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:42.0935582Z     def test_silu_mul_quant(
2025-05-07T20:32:42.0935862Z         self,
2025-05-07T20:32:42.0936052Z         T: int,
2025-05-07T20:32:42.0936252Z         D: int,
2025-05-07T20:32:42.0936474Z         scale_ub: Optional[float],
2025-05-07T20:32:42.0936739Z         contiguous: bool,
2025-05-07T20:32:42.0936975Z         compiled: bool,
2025-05-07T20:32:42.0937195Z     ) -> None:
2025-05-07T20:32:42.0937412Z         torch.manual_seed(2025)
2025-05-07T20:32:42.0937645Z     
2025-05-07T20:32:42.0937916Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:42.0938255Z     
2025-05-07T20:32:42.0938446Z         x_sign = torch.sign(x)
2025-05-07T20:32:42.0938732Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:42.0939035Z         x = x_sign * x_clamp
2025-05-07T20:32:42.0939270Z         x0 = x[:, :D]
2025-05-07T20:32:42.0939487Z         x1 = x[:, D:]
2025-05-07T20:32:42.0939697Z     
2025-05-07T20:32:42.0939878Z         if contiguous:
2025-05-07T20:32:42.0940108Z             x0 = x0.contiguous()
2025-05-07T20:32:42.0940361Z             x1 = x1.contiguous()
2025-05-07T20:32:42.0940590Z     
2025-05-07T20:32:42.0940780Z         if scale_ub is not None:
2025-05-07T20:32:42.0941045Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:42.0941373Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:42.0941681Z             )
2025-05-07T20:32:42.0941876Z         else:
2025-05-07T20:32:42.0942084Z             scale_ub_tensor = None
2025-05-07T20:32:42.0942326Z     
2025-05-07T20:32:42.0942554Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:42.0942938Z             op = silu_mul_quant
2025-05-07T20:32:42.0943185Z             if compiled:
2025-05-07T20:32:42.0943428Z                 op = torch.compile(op)
2025-05-07T20:32:42.0943722Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:42.0943987Z     
2025-05-07T20:32:42.0944178Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:42.0944340Z 
2025-05-07T20:32:42.0944445Z moe/activation_test.py:117: 
2025-05-07T20:32:42.0944732Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:42.0945063Z moe/activation_test.py:115: in fn
2025-05-07T20:32:42.0945341Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:42.0946007Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/torch/_dynamo/eval_frame.py:678: in _fn
2025-05-07T20:32:42.0946615Z     return fn(*args, **kwargs)
2025-05-07T20:32:42.0947301Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:42.0947980Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:42.0948504Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:42.0949240Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:42.0949898Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:42.0950425Z     kernel = self.compile(
2025-05-07T20:32:42.0950955Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:42.0951603Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:42.0951999Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:42.0952224Z 
2025-05-07T20:32:42.0952433Z self = <triton.compiler.compiler.ASTSource object at 0x7f98066a7fd0>
2025-05-07T20:32:42.0953505Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:42.0954855Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f987464c400>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f980664c720>}
2025-05-07T20:32:42.0956179Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:42.0957195Z context = <triton._C.libtriton.ir.context object at 0x7f9806694670>
2025-05-07T20:32:42.0957476Z 
2025-05-07T20:32:42.0957644Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:42.0958159Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:42.0958618Z                            module_map=module_map)
2025-05-07T20:32:42.0958983Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:42.0959329Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:42.0959586Z E       ^
2025-05-07T20:32:42.0960050Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:42.0960495Z 
2025-05-07T20:32:42.0960907Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:42.0961417Z 
2025-05-07T20:32:42.1841800Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:42.1850035Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:42.1850449Z     T=2048,
2025-05-07T20:32:42.1850637Z     D=7168,
2025-05-07T20:32:42.1850838Z     scale_ub=1200.0,
2025-05-07T20:32:42.1851178Z     contiguous=False,
2025-05-07T20:32:42.1851405Z     compiled=False,
2025-05-07T20:32:42.1851609Z )
2025-05-07T20:32:42.1851925Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:42.1852412Z T = 2048, D = 7168, scale_ub = 1200.0, contiguous = False, compiled = False
2025-05-07T20:32:42.1852693Z 
2025-05-07T20:32:42.1852770Z     @given(
2025-05-07T20:32:42.1853001Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:42.1853310Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:42.1853614Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:42.1853948Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:42.1854340Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:42.1854680Z     )
2025-05-07T20:32:42.1855033Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:42.1855484Z     def test_silu_mul_quant(
2025-05-07T20:32:42.1855855Z         self,
2025-05-07T20:32:42.1856064Z         T: int,
2025-05-07T20:32:42.1856266Z         D: int,
2025-05-07T20:32:42.1856484Z         scale_ub: Optional[float],
2025-05-07T20:32:42.1856750Z         contiguous: bool,
2025-05-07T20:32:42.1856992Z         compiled: bool,
2025-05-07T20:32:42.1857211Z     ) -> None:
2025-05-07T20:32:42.1857429Z         torch.manual_seed(2025)
2025-05-07T20:32:42.1857666Z     
2025-05-07T20:32:42.1857936Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:42.1858280Z     
2025-05-07T20:32:42.1858483Z         x_sign = torch.sign(x)
2025-05-07T20:32:42.1858773Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:42.1859089Z         x = x_sign * x_clamp
2025-05-07T20:32:42.1859335Z         x0 = x[:, :D]
2025-05-07T20:32:42.1859551Z         x1 = x[:, D:]
2025-05-07T20:32:42.1859760Z     
2025-05-07T20:32:42.1859942Z         if contiguous:
2025-05-07T20:32:42.1860175Z             x0 = x0.contiguous()
2025-05-07T20:32:42.1860431Z             x1 = x1.contiguous()
2025-05-07T20:32:42.1860668Z     
2025-05-07T20:32:42.1860859Z         if scale_ub is not None:
2025-05-07T20:32:42.1861122Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:42.1861455Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:42.1861758Z             )
2025-05-07T20:32:42.1861944Z         else:
2025-05-07T20:32:42.1862159Z             scale_ub_tensor = None
2025-05-07T20:32:42.1862407Z     
2025-05-07T20:32:42.1862631Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:42.1862948Z             op = silu_mul_quant
2025-05-07T20:32:42.1863197Z             if compiled:
2025-05-07T20:32:42.1863440Z                 op = torch.compile(op)
2025-05-07T20:32:42.1863733Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:42.1864012Z     
2025-05-07T20:32:42.1864198Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:42.1864366Z 
2025-05-07T20:32:42.1864468Z moe/activation_test.py:117: 
2025-05-07T20:32:42.1864781Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:42.1865134Z moe/activation_test.py:115: in fn
2025-05-07T20:32:42.1865422Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:42.1866170Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:42.1866851Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:42.1867372Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:42.1868048Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:42.1868717Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:42.1869314Z     kernel = self.compile(
2025-05-07T20:32:42.1869903Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:42.1870553Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:42.1870954Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:42.1871178Z 
2025-05-07T20:32:42.1871383Z self = <triton.compiler.compiler.ASTSource object at 0x7f98066651d0>
2025-05-07T20:32:42.1872450Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:42.1873804Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f987464c400>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f980664d580>}
2025-05-07T20:32:42.1875250Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:42.1876265Z context = <triton._C.libtriton.ir.context object at 0x7f980661d7b0>
2025-05-07T20:32:42.1876548Z 
2025-05-07T20:32:42.1876713Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:42.1877223Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:42.1877691Z                            module_map=module_map)
2025-05-07T20:32:42.1878054Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:42.1878398Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:42.1878651Z E       ^
2025-05-07T20:32:42.1879110Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:42.1879559Z 
2025-05-07T20:32:42.1879974Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:42.1880484Z 
2025-05-07T20:32:42.1880586Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:42.1881027Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:42.1881427Z     T=1,
2025-05-07T20:32:42.1881614Z     D=7168,
2025-05-07T20:32:42.1881799Z     scale_ub=None,
2025-05-07T20:32:42.1882014Z     contiguous=True,
2025-05-07T20:32:42.1882237Z     compiled=False,
2025-05-07T20:32:42.1882438Z )
2025-05-07T20:32:42.1882756Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:42.1883241Z T = 1, D = 7168, scale_ub = None, contiguous = True, compiled = False
2025-05-07T20:32:42.1883496Z 
2025-05-07T20:32:42.1883578Z     @given(
2025-05-07T20:32:42.1883812Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:42.1884127Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:42.1884431Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:42.1884757Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:42.1885085Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:42.1885372Z     )
2025-05-07T20:32:42.1885719Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:42.1886158Z     def test_silu_mul_quant(
2025-05-07T20:32:42.1886404Z         self,
2025-05-07T20:32:42.1886593Z         T: int,
2025-05-07T20:32:42.1886792Z         D: int,
2025-05-07T20:32:42.1887010Z         scale_ub: Optional[float],
2025-05-07T20:32:42.1887271Z         contiguous: bool,
2025-05-07T20:32:42.1887516Z         compiled: bool,
2025-05-07T20:32:42.1887741Z     ) -> None:
2025-05-07T20:32:42.1887951Z         torch.manual_seed(2025)
2025-05-07T20:32:42.1888197Z     
2025-05-07T20:32:42.1888469Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:42.1888809Z     
2025-05-07T20:32:42.1889001Z         x_sign = torch.sign(x)
2025-05-07T20:32:42.1889291Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:42.1889644Z         x = x_sign * x_clamp
2025-05-07T20:32:42.1889878Z         x0 = x[:, :D]
2025-05-07T20:32:42.1890096Z         x1 = x[:, D:]
2025-05-07T20:32:42.1890299Z     
2025-05-07T20:32:42.1890482Z         if contiguous:
2025-05-07T20:32:42.1890712Z             x0 = x0.contiguous()
2025-05-07T20:32:42.1890966Z             x1 = x1.contiguous()
2025-05-07T20:32:42.1891194Z     
2025-05-07T20:32:42.1891399Z         if scale_ub is not None:
2025-05-07T20:32:42.1891675Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:42.1892013Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:42.1892316Z             )
2025-05-07T20:32:42.1892558Z         else:
2025-05-07T20:32:42.1892768Z             scale_ub_tensor = None
2025-05-07T20:32:42.1893047Z     
2025-05-07T20:32:42.1893281Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:42.1893591Z             op = silu_mul_quant
2025-05-07T20:32:42.1893874Z             if compiled:
2025-05-07T20:32:42.1894124Z                 op = torch.compile(op)
2025-05-07T20:32:42.1894413Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:42.1894678Z     
2025-05-07T20:32:42.1894869Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:42.1895037Z 
2025-05-07T20:32:42.1895133Z moe/activation_test.py:117: 
2025-05-07T20:32:42.1895433Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:42.1895758Z moe/activation_test.py:115: in fn
2025-05-07T20:32:42.1896059Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:42.1896774Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:42.1897457Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:42.1897990Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:42.1898670Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:42.1899322Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:42.1899844Z     kernel = self.compile(
2025-05-07T20:32:42.1900376Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:42.1901029Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:42.1901420Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:42.1901649Z 
2025-05-07T20:32:42.1901851Z self = <triton.compiler.compiler.ASTSource object at 0x7f9655ccc610>
2025-05-07T20:32:42.1902918Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:42.1904284Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f987464c400>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f980664cea0>}
2025-05-07T20:32:42.1905608Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:42.1906668Z context = <triton._C.libtriton.ir.context object at 0x7f9655c90c30>
2025-05-07T20:32:42.1906958Z 
2025-05-07T20:32:42.1907121Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:42.1907633Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:42.1908103Z                            module_map=module_map)
2025-05-07T20:32:42.1908468Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:42.1908816Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:42.1909119Z E       ^
2025-05-07T20:32:42.1909628Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:42.1910083Z 
2025-05-07T20:32:42.1910497Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:42.1911005Z 
2025-05-07T20:32:42.1911107Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:42.1911518Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:42.1911914Z     T=16384,
2025-05-07T20:32:42.1912110Z     D=7168,
2025-05-07T20:32:42.1912309Z     scale_ub=1200.0,
2025-05-07T20:32:42.1912529Z     contiguous=False,
2025-05-07T20:32:42.1912798Z     compiled=True,
2025-05-07T20:32:42.5375502Z )
2025-05-07T20:32:42.5377063Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:42.5378207Z T = 16384, D = 7168, scale_ub = 1200.0, contiguous = False, compiled = True
2025-05-07T20:32:42.5378617Z 
2025-05-07T20:32:42.5378742Z     @given(
2025-05-07T20:32:42.5379074Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:42.5379489Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:42.5379900Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:42.5380321Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:42.5380677Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:42.5380978Z     )
2025-05-07T20:32:42.5381329Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:42.5381782Z     def test_silu_mul_quant(
2025-05-07T20:32:42.5382035Z         self,
2025-05-07T20:32:42.5382240Z         T: int,
2025-05-07T20:32:42.5382458Z         D: int,
2025-05-07T20:32:42.5382694Z         scale_ub: Optional[float],
2025-05-07T20:32:42.5382967Z         contiguous: bool,
2025-05-07T20:32:42.5383218Z         compiled: bool,
2025-05-07T20:32:42.5383460Z     ) -> None:
2025-05-07T20:32:42.5383681Z         torch.manual_seed(2025)
2025-05-07T20:32:42.5383933Z     
2025-05-07T20:32:42.5384220Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:42.5384575Z     
2025-05-07T20:32:42.5384780Z         x_sign = torch.sign(x)
2025-05-07T20:32:42.5385085Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:42.5385405Z         x = x_sign * x_clamp
2025-05-07T20:32:42.5385644Z         x0 = x[:, :D]
2025-05-07T20:32:42.5385867Z         x1 = x[:, D:]
2025-05-07T20:32:42.5386085Z     
2025-05-07T20:32:42.5386273Z         if contiguous:
2025-05-07T20:32:42.5386512Z             x0 = x0.contiguous()
2025-05-07T20:32:42.5386779Z             x1 = x1.contiguous()
2025-05-07T20:32:42.5387019Z     
2025-05-07T20:32:42.5387228Z         if scale_ub is not None:
2025-05-07T20:32:42.5387517Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:42.5387859Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:42.5388181Z             )
2025-05-07T20:32:42.5388387Z         else:
2025-05-07T20:32:42.5388600Z             scale_ub_tensor = None
2025-05-07T20:32:42.5388860Z     
2025-05-07T20:32:42.5389202Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:42.5389520Z             op = silu_mul_quant
2025-05-07T20:32:42.5389783Z             if compiled:
2025-05-07T20:32:42.5390043Z                 op = torch.compile(op)
2025-05-07T20:32:42.5390346Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:42.5390620Z     
2025-05-07T20:32:42.5390824Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:42.5390989Z 
2025-05-07T20:32:42.5391102Z moe/activation_test.py:117: 
2025-05-07T20:32:42.5391396Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:42.5391748Z moe/activation_test.py:115: in fn
2025-05-07T20:32:42.5392041Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:42.5392712Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/torch/_dynamo/eval_frame.py:678: in _fn
2025-05-07T20:32:42.5393286Z     return fn(*args, **kwargs)
2025-05-07T20:32:42.5393951Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:42.5394640Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:42.5395174Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:42.5395888Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:42.5396583Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:42.5397212Z     kernel = self.compile(
2025-05-07T20:32:42.5397837Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:42.5398538Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:42.5398949Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:42.5399182Z 
2025-05-07T20:32:42.5399390Z self = <triton.compiler.compiler.ASTSource object at 0x7f9655cff550>
2025-05-07T20:32:42.5400479Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:42.5401864Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f987464c400>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f980664f9c0>}
2025-05-07T20:32:42.5403212Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:42.5404246Z context = <triton._C.libtriton.ir.context object at 0x7f9655c17b70>
2025-05-07T20:32:42.5404533Z 
2025-05-07T20:32:42.5404704Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:42.5405230Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:42.5405710Z                            module_map=module_map)
2025-05-07T20:32:42.5406075Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:42.5406444Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:42.5406760Z E       ^
2025-05-07T20:32:42.5407235Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:42.5407688Z 
2025-05-07T20:32:42.5408111Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:42.5408630Z 
2025-05-07T20:32:42.5408738Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:42.5409166Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:42.5409574Z     T=1,
2025-05-07T20:32:42.5409761Z     D=7168,
2025-05-07T20:32:42.5409970Z     scale_ub=None,
2025-05-07T20:32:42.5410199Z     contiguous=False,
2025-05-07T20:32:42.5410428Z     compiled=False,
2025-05-07T20:32:42.5410645Z )
2025-05-07T20:32:42.5410976Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:42.5411460Z T = 1, D = 7168, scale_ub = None, contiguous = False, compiled = False
2025-05-07T20:32:42.5411730Z 
2025-05-07T20:32:42.5411813Z     @given(
2025-05-07T20:32:42.5412055Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:42.5412374Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:42.5412692Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:42.5413031Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:42.5413366Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:42.5413651Z     )
2025-05-07T20:32:42.5414063Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:42.5414515Z     def test_silu_mul_quant(
2025-05-07T20:32:42.5414760Z         self,
2025-05-07T20:32:42.5414966Z         T: int,
2025-05-07T20:32:42.5415172Z         D: int,
2025-05-07T20:32:42.5415393Z         scale_ub: Optional[float],
2025-05-07T20:32:42.5415673Z         contiguous: bool,
2025-05-07T20:32:42.5415923Z         compiled: bool,
2025-05-07T20:32:42.5416149Z     ) -> None:
2025-05-07T20:32:42.5416373Z         torch.manual_seed(2025)
2025-05-07T20:32:42.5416629Z     
2025-05-07T20:32:42.5416902Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:42.5417300Z     
2025-05-07T20:32:42.5417548Z         x_sign = torch.sign(x)
2025-05-07T20:32:42.5417842Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:42.5418159Z         x = x_sign * x_clamp
2025-05-07T20:32:42.5418451Z         x0 = x[:, :D]
2025-05-07T20:32:42.5418683Z         x1 = x[:, D:]
2025-05-07T20:32:42.5418893Z     
2025-05-07T20:32:42.5419093Z         if contiguous:
2025-05-07T20:32:42.5419335Z             x0 = x0.contiguous()
2025-05-07T20:32:42.5419595Z             x1 = x1.contiguous()
2025-05-07T20:32:42.5419842Z     
2025-05-07T20:32:42.5420047Z         if scale_ub is not None:
2025-05-07T20:32:42.5420320Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:42.5420660Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:42.5420980Z             )
2025-05-07T20:32:42.5421174Z         else:
2025-05-07T20:32:42.5421394Z             scale_ub_tensor = None
2025-05-07T20:32:42.5421652Z     
2025-05-07T20:32:42.5421885Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:42.5422210Z             op = silu_mul_quant
2025-05-07T20:32:42.5422469Z             if compiled:
2025-05-07T20:32:42.5422717Z                 op = torch.compile(op)
2025-05-07T20:32:42.5423023Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:42.5423307Z     
2025-05-07T20:32:42.5423507Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:42.5423674Z 
2025-05-07T20:32:42.5423796Z moe/activation_test.py:117: 
2025-05-07T20:32:42.5424103Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:42.5424444Z moe/activation_test.py:115: in fn
2025-05-07T20:32:42.5424726Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:42.5425416Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:42.5426115Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:42.5426705Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:42.5427394Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:42.5428065Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:42.5428913Z     kernel = self.compile(
2025-05-07T20:32:42.5429495Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:42.5430151Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:42.5430558Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:42.5430789Z 
2025-05-07T20:32:42.5431006Z self = <triton.compiler.compiler.ASTSource object at 0x7f9655f78c10>
2025-05-07T20:32:42.5432090Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:42.5433532Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f987464c400>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f9655c5c860>}
2025-05-07T20:32:42.5434880Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:42.5435912Z context = <triton._C.libtriton.ir.context object at 0x7f9655fc1230>
2025-05-07T20:32:42.5436199Z 
2025-05-07T20:32:42.5436379Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:42.5436940Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:42.5437415Z                            module_map=module_map)
2025-05-07T20:32:42.5437888Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:42.5438305Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:42.5438570Z E       ^
2025-05-07T20:32:42.5439107Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:42.5439562Z 
2025-05-07T20:32:42.5439991Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:42.5440502Z 
2025-05-07T20:32:42.5440609Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:42.5441030Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:42.5441444Z     T=2048,
2025-05-07T20:32:42.5441648Z     D=7168,
2025-05-07T20:32:42.5441845Z     scale_ub=None,
2025-05-07T20:32:42.5442072Z     contiguous=False,
2025-05-07T20:32:42.5442308Z     compiled=True,
2025-05-07T20:32:42.5442517Z )
2025-05-07T20:32:42.6129002Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:42.6130433Z T = 2048, D = 7168, scale_ub = None, contiguous = False, compiled = True
2025-05-07T20:32:42.6131175Z 
2025-05-07T20:32:42.6131401Z     @given(
2025-05-07T20:32:42.6131879Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:42.6132514Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:42.6133121Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:42.6133766Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:42.6134417Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:42.6134981Z     )
2025-05-07T20:32:42.6135663Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:42.6136373Z     def test_silu_mul_quant(
2025-05-07T20:32:42.6136654Z         self,
2025-05-07T20:32:42.6136851Z         T: int,
2025-05-07T20:32:42.6137098Z         D: int,
2025-05-07T20:32:42.6137320Z         scale_ub: Optional[float],
2025-05-07T20:32:42.6137604Z         contiguous: bool,
2025-05-07T20:32:42.6137854Z         compiled: bool,
2025-05-07T20:32:42.6138084Z     ) -> None:
2025-05-07T20:32:42.6138312Z         torch.manual_seed(2025)
2025-05-07T20:32:42.6138565Z     
2025-05-07T20:32:42.6138842Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:42.6139190Z     
2025-05-07T20:32:42.6139397Z         x_sign = torch.sign(x)
2025-05-07T20:32:42.6139687Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:42.6140002Z         x = x_sign * x_clamp
2025-05-07T20:32:42.6140251Z         x0 = x[:, :D]
2025-05-07T20:32:42.6140475Z         x1 = x[:, D:]
2025-05-07T20:32:42.6140686Z     
2025-05-07T20:32:42.6140886Z         if contiguous:
2025-05-07T20:32:42.6141129Z             x0 = x0.contiguous()
2025-05-07T20:32:42.6141389Z             x1 = x1.contiguous()
2025-05-07T20:32:42.6141638Z     
2025-05-07T20:32:42.6149525Z         if scale_ub is not None:
2025-05-07T20:32:42.6149841Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:42.6150185Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:42.6150505Z             )
2025-05-07T20:32:42.6150714Z         else:
2025-05-07T20:32:42.6150929Z             scale_ub_tensor = None
2025-05-07T20:32:42.6151194Z     
2025-05-07T20:32:42.6151728Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:42.6152061Z             op = silu_mul_quant
2025-05-07T20:32:42.6152315Z             if compiled:
2025-05-07T20:32:42.6152578Z                 op = torch.compile(op)
2025-05-07T20:32:42.6152886Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:42.6153163Z     
2025-05-07T20:32:42.6153370Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:42.6153537Z 
2025-05-07T20:32:42.6153655Z moe/activation_test.py:117: 
2025-05-07T20:32:42.6153952Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:42.6154299Z moe/activation_test.py:115: in fn
2025-05-07T20:32:42.6154682Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:42.6155342Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/torch/_dynamo/eval_frame.py:678: in _fn
2025-05-07T20:32:42.6155961Z     return fn(*args, **kwargs)
2025-05-07T20:32:42.6156713Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:42.6157406Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:42.6157940Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:42.6158625Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:42.6159289Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:42.6159827Z     kernel = self.compile(
2025-05-07T20:32:42.6160368Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:42.6161037Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:42.6161446Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:42.6161680Z 
2025-05-07T20:32:42.6161890Z self = <triton.compiler.compiler.ASTSource object at 0x7f9655f87dd0>
2025-05-07T20:32:42.6162980Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:42.6164374Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f987464c400>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f9655c5dbc0>}
2025-05-07T20:32:42.6165727Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:42.6166764Z context = <triton._C.libtriton.ir.context object at 0x7f9655f58430>
2025-05-07T20:32:42.6167053Z 
2025-05-07T20:32:42.6167225Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:42.6167755Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:42.6168232Z                            module_map=module_map)
2025-05-07T20:32:42.6168606Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:42.6168958Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:42.6169228Z E       ^
2025-05-07T20:32:42.6169701Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:42.6170155Z 
2025-05-07T20:32:42.6170576Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:42.6171101Z 
2025-05-07T20:32:42.6171209Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:42.6171631Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:42.6172044Z     T=4096,
2025-05-07T20:32:42.6172240Z     D=7168,
2025-05-07T20:32:42.6172497Z     scale_ub=None,
2025-05-07T20:32:42.6172730Z     contiguous=False,
2025-05-07T20:32:42.6172959Z     compiled=True,
2025-05-07T20:32:42.6173183Z )
2025-05-07T20:32:42.6173514Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:42.6174003Z T = 4096, D = 7168, scale_ub = None, contiguous = False, compiled = True
2025-05-07T20:32:42.6174284Z 
2025-05-07T20:32:42.6174368Z     @given(
2025-05-07T20:32:42.6174615Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:42.6174929Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:42.6175246Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:42.6175638Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:42.6176024Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:42.6176312Z     )
2025-05-07T20:32:42.6176709Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:42.6177164Z     def test_silu_mul_quant(
2025-05-07T20:32:42.6177408Z         self,
2025-05-07T20:32:42.6177618Z         T: int,
2025-05-07T20:32:42.6177826Z         D: int,
2025-05-07T20:32:42.6178047Z         scale_ub: Optional[float],
2025-05-07T20:32:42.6178329Z         contiguous: bool,
2025-05-07T20:32:42.6178581Z         compiled: bool,
2025-05-07T20:32:42.6178807Z     ) -> None:
2025-05-07T20:32:42.6179034Z         torch.manual_seed(2025)
2025-05-07T20:32:42.6179288Z     
2025-05-07T20:32:42.6179561Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:42.6179910Z     
2025-05-07T20:32:42.6180117Z         x_sign = torch.sign(x)
2025-05-07T20:32:42.6180420Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:42.6180736Z         x = x_sign * x_clamp
2025-05-07T20:32:42.6180990Z         x0 = x[:, :D]
2025-05-07T20:32:42.6181216Z         x1 = x[:, D:]
2025-05-07T20:32:42.6181425Z     
2025-05-07T20:32:42.6181627Z         if contiguous:
2025-05-07T20:32:42.6181872Z             x0 = x0.contiguous()
2025-05-07T20:32:42.6182130Z             x1 = x1.contiguous()
2025-05-07T20:32:42.6182379Z     
2025-05-07T20:32:42.6182585Z         if scale_ub is not None:
2025-05-07T20:32:42.6182858Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:42.6183201Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:42.6183518Z             )
2025-05-07T20:32:42.6183715Z         else:
2025-05-07T20:32:42.6183936Z             scale_ub_tensor = None
2025-05-07T20:32:42.6184195Z     
2025-05-07T20:32:42.6184429Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:42.6184751Z             op = silu_mul_quant
2025-05-07T20:32:42.6185016Z             if compiled:
2025-05-07T20:32:42.6185267Z                 op = torch.compile(op)
2025-05-07T20:32:42.6185573Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:42.6185859Z     
2025-05-07T20:32:42.6186067Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:42.6186236Z 
2025-05-07T20:32:42.6186344Z moe/activation_test.py:117: 
2025-05-07T20:32:42.6186696Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:42.6187041Z moe/activation_test.py:115: in fn
2025-05-07T20:32:42.6187324Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:42.6187886Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/torch/_dynamo/eval_frame.py:678: in _fn
2025-05-07T20:32:42.6188452Z     return fn(*args, **kwargs)
2025-05-07T20:32:42.6189195Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:42.6189882Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:42.6190432Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:42.6191118Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:42.6191835Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:42.6192377Z     kernel = self.compile(
2025-05-07T20:32:42.6192927Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:42.6193592Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:42.6193993Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:42.6194233Z 
2025-05-07T20:32:42.6194441Z self = <triton.compiler.compiler.ASTSource object at 0x7f9655f65490>
2025-05-07T20:32:42.6195527Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:42.6197067Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f987464c400>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f9655c5e700>}
2025-05-07T20:32:42.6198402Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:42.6199429Z context = <triton._C.libtriton.ir.context object at 0x7f9655f81af0>
2025-05-07T20:32:42.6199725Z 
2025-05-07T20:32:42.6199893Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:42.6200416Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:42.6200883Z                            module_map=module_map)
2025-05-07T20:32:42.6201262Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:42.6201619Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:42.6201886Z E       ^
2025-05-07T20:32:42.6202353Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:42.6202807Z 
2025-05-07T20:32:42.6203223Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:42.6203738Z 
2025-05-07T20:32:42.7461929Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:42.7462401Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:42.7462990Z     T=16384,
2025-05-07T20:32:42.7463267Z     D=5120,
2025-05-07T20:32:42.7463543Z     scale_ub=1200.0,
2025-05-07T20:32:42.7463861Z     contiguous=False,
2025-05-07T20:32:42.7464172Z     compiled=False,
2025-05-07T20:32:42.7464486Z )
2025-05-07T20:32:42.7464941Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:42.7465467Z T = 16384, D = 5120, scale_ub = 1200.0, contiguous = False, compiled = False
2025-05-07T20:32:42.7465756Z 
2025-05-07T20:32:42.7465846Z     @given(
2025-05-07T20:32:42.7466103Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:42.7466463Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:42.7466767Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:42.7467107Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:42.7467441Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:42.7467724Z     )
2025-05-07T20:32:42.7468080Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:42.7468528Z     def test_silu_mul_quant(
2025-05-07T20:32:42.7468772Z         self,
2025-05-07T20:32:42.7468977Z         T: int,
2025-05-07T20:32:42.7469250Z         D: int,
2025-05-07T20:32:42.7469472Z         scale_ub: Optional[float],
2025-05-07T20:32:42.7469756Z         contiguous: bool,
2025-05-07T20:32:42.7470008Z         compiled: bool,
2025-05-07T20:32:42.7470244Z     ) -> None:
2025-05-07T20:32:42.7470461Z         torch.manual_seed(2025)
2025-05-07T20:32:42.7470713Z     
2025-05-07T20:32:42.7471328Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:42.7471672Z     
2025-05-07T20:32:42.7471876Z         x_sign = torch.sign(x)
2025-05-07T20:32:42.7472173Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:42.7472484Z         x = x_sign * x_clamp
2025-05-07T20:32:42.7472734Z         x0 = x[:, :D]
2025-05-07T20:32:42.7472963Z         x1 = x[:, D:]
2025-05-07T20:32:42.7473173Z     
2025-05-07T20:32:42.7473372Z         if contiguous:
2025-05-07T20:32:42.7473618Z             x0 = x0.contiguous()
2025-05-07T20:32:42.7473877Z             x1 = x1.contiguous()
2025-05-07T20:32:42.7474126Z     
2025-05-07T20:32:42.7474419Z         if scale_ub is not None:
2025-05-07T20:32:42.7474773Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:42.7475124Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:42.7475454Z             )
2025-05-07T20:32:42.7475755Z         else:
2025-05-07T20:32:42.7475982Z             scale_ub_tensor = None
2025-05-07T20:32:42.7476243Z     
2025-05-07T20:32:42.7476491Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:42.7476809Z             op = silu_mul_quant
2025-05-07T20:32:42.7477094Z             if compiled:
2025-05-07T20:32:42.7477353Z                 op = torch.compile(op)
2025-05-07T20:32:42.7477658Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:42.7477934Z     
2025-05-07T20:32:42.7478143Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:42.7478310Z 
2025-05-07T20:32:42.7478422Z moe/activation_test.py:117: 
2025-05-07T20:32:42.7478722Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:42.7479072Z moe/activation_test.py:115: in fn
2025-05-07T20:32:42.7479369Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:42.7480052Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:42.7480745Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:42.7481283Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:42.7481973Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:42.7482655Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:42.7483184Z     kernel = self.compile(
2025-05-07T20:32:42.7483729Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:42.7484391Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:42.7484802Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:42.7485032Z 
2025-05-07T20:32:42.7485244Z self = <triton.compiler.compiler.ASTSource object at 0x7f9655976650>
2025-05-07T20:32:42.7486330Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:42.7487718Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f987464c400>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f9655c5f060>}
2025-05-07T20:32:42.7489056Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:42.7490070Z context = <triton._C.libtriton.ir.context object at 0x7f96559f2cb0>
2025-05-07T20:32:42.7490364Z 
2025-05-07T20:32:42.7490532Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:42.7491051Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:42.7491577Z                            module_map=module_map)
2025-05-07T20:32:42.7491941Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:42.7492303Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:42.7492570Z E       ^
2025-05-07T20:32:42.7493032Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:42.7493488Z 
2025-05-07T20:32:42.7493907Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:42.7494426Z 
2025-05-07T20:32:42.7494533Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:42.7494992Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:42.7495427Z     T=16384,
2025-05-07T20:32:42.7495625Z     D=5120,
2025-05-07T20:32:42.7495823Z     scale_ub=1200.0,
2025-05-07T20:32:42.7496082Z     contiguous=True,
2025-05-07T20:32:42.7496314Z     compiled=True,
2025-05-07T20:32:42.7496519Z )
2025-05-07T20:32:42.7496835Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:42.7497328Z T = 16384, D = 5120, scale_ub = 1200.0, contiguous = True, compiled = True
2025-05-07T20:32:42.7497609Z 
2025-05-07T20:32:42.7497689Z     @given(
2025-05-07T20:32:42.7497926Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:42.7498234Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:42.7498545Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:42.7498879Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:42.7499200Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:42.7499493Z     )
2025-05-07T20:32:42.7499845Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:42.7500284Z     def test_silu_mul_quant(
2025-05-07T20:32:42.7500524Z         self,
2025-05-07T20:32:42.7500724Z         T: int,
2025-05-07T20:32:42.7500932Z         D: int,
2025-05-07T20:32:42.7501149Z         scale_ub: Optional[float],
2025-05-07T20:32:42.7501421Z         contiguous: bool,
2025-05-07T20:32:42.7501663Z         compiled: bool,
2025-05-07T20:32:42.7501883Z     ) -> None:
2025-05-07T20:32:42.7502101Z         torch.manual_seed(2025)
2025-05-07T20:32:42.7502352Z     
2025-05-07T20:32:42.7502622Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:42.7502967Z     
2025-05-07T20:32:42.7503167Z         x_sign = torch.sign(x)
2025-05-07T20:32:42.7503458Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:42.7503775Z         x = x_sign * x_clamp
2025-05-07T20:32:42.7504025Z         x0 = x[:, :D]
2025-05-07T20:32:42.7504245Z         x1 = x[:, D:]
2025-05-07T20:32:42.7504457Z     
2025-05-07T20:32:42.7504654Z         if contiguous:
2025-05-07T20:32:42.7504887Z             x0 = x0.contiguous()
2025-05-07T20:32:42.7505161Z             x1 = x1.contiguous()
2025-05-07T20:32:42.7505404Z     
2025-05-07T20:32:42.7505596Z         if scale_ub is not None:
2025-05-07T20:32:42.7505871Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:42.7506214Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:42.7506530Z             )
2025-05-07T20:32:42.7506724Z         else:
2025-05-07T20:32:42.7506942Z             scale_ub_tensor = None
2025-05-07T20:32:42.7507203Z     
2025-05-07T20:32:42.7507436Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:42.7507753Z             op = silu_mul_quant
2025-05-07T20:32:42.7508010Z             if compiled:
2025-05-07T20:32:42.7508256Z                 op = torch.compile(op)
2025-05-07T20:32:42.7508566Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:42.7508844Z     
2025-05-07T20:32:42.7509126Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:42.7509295Z 
2025-05-07T20:32:42.7509396Z moe/activation_test.py:117: 
2025-05-07T20:32:42.7509746Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:42.7510085Z moe/activation_test.py:115: in fn
2025-05-07T20:32:42.7510360Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:42.7510916Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/torch/_dynamo/eval_frame.py:678: in _fn
2025-05-07T20:32:42.7511474Z     return fn(*args, **kwargs)
2025-05-07T20:32:42.7512119Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:42.7512802Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:42.7513339Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:42.7514109Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:42.7514759Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:42.7515334Z     kernel = self.compile(
2025-05-07T20:32:42.7515875Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:42.7516525Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:42.7516931Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:42.7517163Z 
2025-05-07T20:32:42.7517368Z self = <triton.compiler.compiler.ASTSource object at 0x7f9655925790>
2025-05-07T20:32:42.7518440Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:42.7519799Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f987464c400>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f96559e51c0>}
2025-05-07T20:32:42.7521141Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:42.7522183Z context = <triton._C.libtriton.ir.context object at 0x7f9655931db0>
2025-05-07T20:32:42.7522467Z 
2025-05-07T20:32:42.7522641Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:42.7523155Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:42.7523612Z                            module_map=module_map)
2025-05-07T20:32:42.7523987Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:42.7524347Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:42.7524610Z E       ^
2025-05-07T20:32:42.7525077Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:42.7525537Z 
2025-05-07T20:32:42.7525971Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:42.7526484Z 
2025-05-07T20:32:43.0486106Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:43.0486926Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:43.0488046Z     T=16384,
2025-05-07T20:32:43.0488453Z     D=5120,
2025-05-07T20:32:43.0488837Z     scale_ub=None,
2025-05-07T20:32:43.0489289Z     contiguous=False,
2025-05-07T20:32:43.0489748Z     compiled=True,
2025-05-07T20:32:43.0490156Z )
2025-05-07T20:32:43.0490798Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:43.0491821Z T = 16384, D = 5120, scale_ub = None, contiguous = False, compiled = True
2025-05-07T20:32:43.0492387Z 
2025-05-07T20:32:43.0492564Z     @given(
2025-05-07T20:32:43.0493030Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:43.0493673Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:43.0494662Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:43.0495321Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:43.0495982Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:43.0496521Z     )
2025-05-07T20:32:43.0496872Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:43.0497316Z     def test_silu_mul_quant(
2025-05-07T20:32:43.0497565Z         self,
2025-05-07T20:32:43.0497762Z         T: int,
2025-05-07T20:32:43.0497970Z         D: int,
2025-05-07T20:32:43.0498203Z         scale_ub: Optional[float],
2025-05-07T20:32:43.0498483Z         contiguous: bool,
2025-05-07T20:32:43.0498810Z         compiled: bool,
2025-05-07T20:32:43.0499133Z     ) -> None:
2025-05-07T20:32:43.0499357Z         torch.manual_seed(2025)
2025-05-07T20:32:43.0499600Z     
2025-05-07T20:32:43.0500004Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:43.0500353Z     
2025-05-07T20:32:43.0500556Z         x_sign = torch.sign(x)
2025-05-07T20:32:43.0500853Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:43.0501172Z         x = x_sign * x_clamp
2025-05-07T20:32:43.0501416Z         x0 = x[:, :D]
2025-05-07T20:32:43.0501642Z         x1 = x[:, D:]
2025-05-07T20:32:43.0501857Z     
2025-05-07T20:32:43.0502044Z         if contiguous:
2025-05-07T20:32:43.0502284Z             x0 = x0.contiguous()
2025-05-07T20:32:43.0502547Z             x1 = x1.contiguous()
2025-05-07T20:32:43.0502785Z     
2025-05-07T20:32:43.0502982Z         if scale_ub is not None:
2025-05-07T20:32:43.0503258Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:43.0503594Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:43.0503915Z             )
2025-05-07T20:32:43.0504126Z         else:
2025-05-07T20:32:43.0504356Z             scale_ub_tensor = None
2025-05-07T20:32:43.0504615Z     
2025-05-07T20:32:43.0504863Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:43.0505186Z             op = silu_mul_quant
2025-05-07T20:32:43.0505441Z             if compiled:
2025-05-07T20:32:43.0505697Z                 op = torch.compile(op)
2025-05-07T20:32:43.0506010Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:43.0506326Z     
2025-05-07T20:32:43.0506529Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:43.0506694Z 
2025-05-07T20:32:43.0506805Z moe/activation_test.py:117: 
2025-05-07T20:32:43.0507100Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:43.0507442Z moe/activation_test.py:115: in fn
2025-05-07T20:32:43.0507730Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:43.0508296Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/torch/_dynamo/eval_frame.py:678: in _fn
2025-05-07T20:32:43.0508860Z     return fn(*args, **kwargs)
2025-05-07T20:32:43.0509616Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:43.0510309Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:43.0510842Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:43.0511534Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:43.0512204Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:43.0512742Z     kernel = self.compile(
2025-05-07T20:32:43.0513282Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:43.0513945Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:43.0514353Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:43.0514582Z 
2025-05-07T20:32:43.0514852Z self = <triton.compiler.compiler.ASTSource object at 0x7f9655ad6c50>
2025-05-07T20:32:43.0515929Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:43.0517361Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f987464c400>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f96559e5d00>}
2025-05-07T20:32:43.0518710Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:43.0519839Z context = <triton._C.libtriton.ir.context object at 0x7f9655a8b270>
2025-05-07T20:32:43.0520127Z 
2025-05-07T20:32:43.0520294Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:43.0520860Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:43.0521342Z                            module_map=module_map)
2025-05-07T20:32:43.0528698Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:43.0529108Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:43.0529375Z E       ^
2025-05-07T20:32:43.0529839Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:43.0530302Z 
2025-05-07T20:32:43.0530726Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:43.0531248Z 
2025-05-07T20:32:43.0531366Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:43.0531794Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:43.0532195Z     T=2048,
2025-05-07T20:32:43.0532395Z     D=5120,
2025-05-07T20:32:43.0532600Z     scale_ub=None,
2025-05-07T20:32:43.0532819Z     contiguous=False,
2025-05-07T20:32:43.0533056Z     compiled=True,
2025-05-07T20:32:43.0533269Z )
2025-05-07T20:32:43.1244942Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:43.1245743Z T = 2048, D = 5120, scale_ub = None, contiguous = False, compiled = True
2025-05-07T20:32:43.1246106Z 
2025-05-07T20:32:43.1246193Z     @given(
2025-05-07T20:32:43.1246447Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:43.1246845Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:43.1247481Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:43.1248144Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:43.1248819Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:43.1249404Z     )
2025-05-07T20:32:43.1250095Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:43.1250989Z     def test_silu_mul_quant(
2025-05-07T20:32:43.1251487Z         self,
2025-05-07T20:32:43.1251880Z         T: int,
2025-05-07T20:32:43.1252287Z         D: int,
2025-05-07T20:32:43.1252736Z         scale_ub: Optional[float],
2025-05-07T20:32:43.1253277Z         contiguous: bool,
2025-05-07T20:32:43.1253764Z         compiled: bool,
2025-05-07T20:32:43.1254221Z     ) -> None:
2025-05-07T20:32:43.1254654Z         torch.manual_seed(2025)
2025-05-07T20:32:43.1255142Z     
2025-05-07T20:32:43.1255703Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:43.1256382Z     
2025-05-07T20:32:43.1256785Z         x_sign = torch.sign(x)
2025-05-07T20:32:43.1257142Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:43.1257476Z         x = x_sign * x_clamp
2025-05-07T20:32:43.1257728Z         x0 = x[:, :D]
2025-05-07T20:32:43.1257956Z         x1 = x[:, D:]
2025-05-07T20:32:43.1258165Z     
2025-05-07T20:32:43.1258358Z         if contiguous:
2025-05-07T20:32:43.1258606Z             x0 = x0.contiguous()
2025-05-07T20:32:43.1259173Z             x1 = x1.contiguous()
2025-05-07T20:32:43.1259414Z     
2025-05-07T20:32:43.1259619Z         if scale_ub is not None:
2025-05-07T20:32:43.1259896Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:43.1260235Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:43.1260549Z             )
2025-05-07T20:32:43.1260757Z         else:
2025-05-07T20:32:43.1260969Z             scale_ub_tensor = None
2025-05-07T20:32:43.1261226Z     
2025-05-07T20:32:43.1261467Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:43.1261782Z             op = silu_mul_quant
2025-05-07T20:32:43.1262047Z             if compiled:
2025-05-07T20:32:43.1262392Z                 op = torch.compile(op)
2025-05-07T20:32:43.1262768Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:43.1263053Z     
2025-05-07T20:32:43.1263256Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:43.1263421Z 
2025-05-07T20:32:43.1263610Z moe/activation_test.py:117: 
2025-05-07T20:32:43.1263912Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:43.1264258Z moe/activation_test.py:115: in fn
2025-05-07T20:32:43.1264549Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:43.1265107Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/torch/_dynamo/eval_frame.py:678: in _fn
2025-05-07T20:32:43.1265676Z     return fn(*args, **kwargs)
2025-05-07T20:32:43.1266370Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:43.1267092Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:43.1267623Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:43.1268314Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:43.1268981Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:43.1269592Z     kernel = self.compile(
2025-05-07T20:32:43.1270138Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:43.1270798Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:43.1271207Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:43.1271438Z 
2025-05-07T20:32:43.1271650Z self = <triton.compiler.compiler.ASTSource object at 0x7f9655acff90>
2025-05-07T20:32:43.1272733Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:43.1274121Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f987464c400>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f96559e5620>}
2025-05-07T20:32:43.1275456Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:43.1276488Z context = <triton._C.libtriton.ir.context object at 0x7f9655adc630>
2025-05-07T20:32:43.1276816Z 
2025-05-07T20:32:43.1276984Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:43.1277504Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:43.1277980Z                            module_map=module_map)
2025-05-07T20:32:43.1278344Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:43.1278703Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:43.1278968Z E       ^
2025-05-07T20:32:43.1279487Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:43.1279945Z 
2025-05-07T20:32:43.1280362Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:43.1280879Z 
2025-05-07T20:32:43.1280988Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:43.1281404Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:43.1281804Z     T=2048,
2025-05-07T20:32:43.1282001Z     D=5120,
2025-05-07T20:32:43.1282202Z     scale_ub=1200.0,
2025-05-07T20:32:43.1282426Z     contiguous=False,
2025-05-07T20:32:43.1282659Z     compiled=True,
2025-05-07T20:32:43.1282880Z )
2025-05-07T20:32:43.1283202Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:43.1283800Z T = 2048, D = 5120, scale_ub = 1200.0, contiguous = False, compiled = True
2025-05-07T20:32:43.1284088Z 
2025-05-07T20:32:43.1284174Z     @given(
2025-05-07T20:32:43.1284460Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:43.1284780Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:43.1285099Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:43.1285442Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:43.1285774Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:43.1286073Z     )
2025-05-07T20:32:43.1286454Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:43.1286940Z     def test_silu_mul_quant(
2025-05-07T20:32:43.1287184Z         self,
2025-05-07T20:32:43.1287400Z         T: int,
2025-05-07T20:32:43.1287608Z         D: int,
2025-05-07T20:32:43.1287833Z         scale_ub: Optional[float],
2025-05-07T20:32:43.1288123Z         contiguous: bool,
2025-05-07T20:32:43.1288371Z         compiled: bool,
2025-05-07T20:32:43.1288593Z     ) -> None:
2025-05-07T20:32:43.1288820Z         torch.manual_seed(2025)
2025-05-07T20:32:43.1289071Z     
2025-05-07T20:32:43.1289352Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:43.1289710Z     
2025-05-07T20:32:43.1289922Z         x_sign = torch.sign(x)
2025-05-07T20:32:43.1290211Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:43.1290534Z         x = x_sign * x_clamp
2025-05-07T20:32:43.1290787Z         x0 = x[:, :D]
2025-05-07T20:32:43.1291012Z         x1 = x[:, D:]
2025-05-07T20:32:43.1291223Z     
2025-05-07T20:32:43.1291427Z         if contiguous:
2025-05-07T20:32:43.1291676Z             x0 = x0.contiguous()
2025-05-07T20:32:43.1291939Z             x1 = x1.contiguous()
2025-05-07T20:32:43.1292192Z     
2025-05-07T20:32:43.1292397Z         if scale_ub is not None:
2025-05-07T20:32:43.1292676Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:43.1293027Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:43.1293354Z             )
2025-05-07T20:32:43.1293556Z         else:
2025-05-07T20:32:43.1293782Z             scale_ub_tensor = None
2025-05-07T20:32:43.1294039Z     
2025-05-07T20:32:43.1294276Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:43.1294594Z             op = silu_mul_quant
2025-05-07T20:32:43.1294856Z             if compiled:
2025-05-07T20:32:43.1295112Z                 op = torch.compile(op)
2025-05-07T20:32:43.1295403Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:43.1295683Z     
2025-05-07T20:32:43.1295884Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:43.1296049Z 
2025-05-07T20:32:43.1296150Z moe/activation_test.py:117: 
2025-05-07T20:32:43.1296467Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:43.1296844Z moe/activation_test.py:115: in fn
2025-05-07T20:32:43.1297132Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:43.1297696Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/torch/_dynamo/eval_frame.py:678: in _fn
2025-05-07T20:32:43.1298256Z     return fn(*args, **kwargs)
2025-05-07T20:32:43.1298968Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:43.1299652Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:43.1300187Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:43.1300864Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:43.1301523Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:43.1302047Z     kernel = self.compile(
2025-05-07T20:32:43.1302588Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:43.1303360Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:43.1303752Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:43.1304021Z 
2025-05-07T20:32:43.1304236Z self = <triton.compiler.compiler.ASTSource object at 0x7f965585d2d0>
2025-05-07T20:32:43.1305316Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:43.1306723Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f987464c400>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f96558145e0>}
2025-05-07T20:32:43.1308062Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:43.1309152Z context = <triton._C.libtriton.ir.context object at 0x7f9655885930>
2025-05-07T20:32:43.1309446Z 
2025-05-07T20:32:43.1309620Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:43.1310142Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:43.1310614Z                            module_map=module_map)
2025-05-07T20:32:43.1310980Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:43.1311340Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:43.1311616Z E       ^
2025-05-07T20:32:43.1312083Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:43.1312543Z 
2025-05-07T20:32:43.1312964Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:43.1313494Z 
2025-05-07T20:32:43.2649309Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:43.2649950Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:43.2650524Z     T=4096,
2025-05-07T20:32:43.2650787Z     D=5120,
2025-05-07T20:32:43.2650998Z     scale_ub=1200.0,
2025-05-07T20:32:43.2651225Z     contiguous=True,
2025-05-07T20:32:43.2651452Z     compiled=True,
2025-05-07T20:32:43.2651664Z )
2025-05-07T20:32:43.2651980Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:43.2652481Z T = 4096, D = 5120, scale_ub = 1200.0, contiguous = True, compiled = True
2025-05-07T20:32:43.2652758Z 
2025-05-07T20:32:43.2652841Z     @given(
2025-05-07T20:32:43.2653076Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:43.2653383Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:43.2653691Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:43.2654033Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:43.2654386Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:43.2654670Z     )
2025-05-07T20:32:43.2655041Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:43.2655782Z     def test_silu_mul_quant(
2025-05-07T20:32:43.2656040Z         self,
2025-05-07T20:32:43.2656241Z         T: int,
2025-05-07T20:32:43.2656456Z         D: int,
2025-05-07T20:32:43.2656690Z         scale_ub: Optional[float],
2025-05-07T20:32:43.2656998Z         contiguous: bool,
2025-05-07T20:32:43.2657263Z         compiled: bool,
2025-05-07T20:32:43.2657505Z     ) -> None:
2025-05-07T20:32:43.2657721Z         torch.manual_seed(2025)
2025-05-07T20:32:43.2657972Z     
2025-05-07T20:32:43.2658257Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:43.2658603Z     
2025-05-07T20:32:43.2658798Z         x_sign = torch.sign(x)
2025-05-07T20:32:43.2659178Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:43.2659568Z         x = x_sign * x_clamp
2025-05-07T20:32:43.2659811Z         x0 = x[:, :D]
2025-05-07T20:32:43.2660032Z         x1 = x[:, D:]
2025-05-07T20:32:43.2660243Z     
2025-05-07T20:32:43.2660499Z         if contiguous:
2025-05-07T20:32:43.2660736Z             x0 = x0.contiguous()
2025-05-07T20:32:43.2660997Z             x1 = x1.contiguous()
2025-05-07T20:32:43.2661232Z     
2025-05-07T20:32:43.2661426Z         if scale_ub is not None:
2025-05-07T20:32:43.2661699Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:43.2662030Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:43.2662412Z             )
2025-05-07T20:32:43.2662680Z         else:
2025-05-07T20:32:43.2662890Z             scale_ub_tensor = None
2025-05-07T20:32:43.2663143Z     
2025-05-07T20:32:43.2663378Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:43.2663687Z             op = silu_mul_quant
2025-05-07T20:32:43.2663950Z             if compiled:
2025-05-07T20:32:43.2664206Z                 op = torch.compile(op)
2025-05-07T20:32:43.2664505Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:43.2664774Z     
2025-05-07T20:32:43.2664973Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:43.2665136Z 
2025-05-07T20:32:43.2665243Z moe/activation_test.py:117: 
2025-05-07T20:32:43.2665532Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:43.2665868Z moe/activation_test.py:115: in fn
2025-05-07T20:32:43.2666151Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:43.2666704Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/torch/_dynamo/eval_frame.py:678: in _fn
2025-05-07T20:32:43.2667272Z     return fn(*args, **kwargs)
2025-05-07T20:32:43.2667925Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:43.2668611Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:43.2669246Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:43.2669931Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:43.2670595Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:43.2671129Z     kernel = self.compile(
2025-05-07T20:32:43.2671663Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:43.2672321Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:43.2672727Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:43.2672954Z 
2025-05-07T20:32:43.2673158Z self = <triton.compiler.compiler.ASTSource object at 0x7f9655842610>
2025-05-07T20:32:43.2674230Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:43.2675677Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f987464c400>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f9655815120>}
2025-05-07T20:32:43.2677011Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:43.2678025Z context = <triton._C.libtriton.ir.context object at 0x7f96558a2c70>
2025-05-07T20:32:43.2678312Z 
2025-05-07T20:32:43.2678478Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:43.2678996Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:43.2679505Z                            module_map=module_map)
2025-05-07T20:32:43.2679908Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:43.2680268Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:43.2680534Z E       ^
2025-05-07T20:32:43.2681049Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:43.2681496Z 
2025-05-07T20:32:43.2681913Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:43.2682425Z 
2025-05-07T20:32:43.2682532Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:43.2682946Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:43.2683348Z     T=128,
2025-05-07T20:32:43.2683533Z     D=5120,
2025-05-07T20:32:43.2683729Z     scale_ub=1200.0,
2025-05-07T20:32:43.2683956Z     contiguous=False,
2025-05-07T20:32:43.2684177Z     compiled=True,
2025-05-07T20:32:43.2684387Z )
2025-05-07T20:32:43.5168281Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:43.5169019Z T = 128, D = 5120, scale_ub = 1200.0, contiguous = False, compiled = True
2025-05-07T20:32:43.5169397Z 
2025-05-07T20:32:43.5169527Z     @given(
2025-05-07T20:32:43.5169909Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:43.5170247Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:43.5170565Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:43.5170904Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:43.5171237Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:43.5171531Z     )
2025-05-07T20:32:43.5171894Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:43.5172334Z     def test_silu_mul_quant(
2025-05-07T20:32:43.5172586Z         self,
2025-05-07T20:32:43.5172795Z         T: int,
2025-05-07T20:32:43.5173007Z         D: int,
2025-05-07T20:32:43.5173241Z         scale_ub: Optional[float],
2025-05-07T20:32:43.5173515Z         contiguous: bool,
2025-05-07T20:32:43.5173757Z         compiled: bool,
2025-05-07T20:32:43.5173993Z     ) -> None:
2025-05-07T20:32:43.5174216Z         torch.manual_seed(2025)
2025-05-07T20:32:43.5174461Z     
2025-05-07T20:32:43.5174732Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:43.5175075Z     
2025-05-07T20:32:43.5175285Z         x_sign = torch.sign(x)
2025-05-07T20:32:43.5175577Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:43.5175893Z         x = x_sign * x_clamp
2025-05-07T20:32:43.5176164Z         x0 = x[:, :D]
2025-05-07T20:32:43.5176412Z         x1 = x[:, D:]
2025-05-07T20:32:43.5176628Z     
2025-05-07T20:32:43.5176818Z         if contiguous:
2025-05-07T20:32:43.5177047Z             x0 = x0.contiguous()
2025-05-07T20:32:43.5177309Z             x1 = x1.contiguous()
2025-05-07T20:32:43.5177554Z     
2025-05-07T20:32:43.5177748Z         if scale_ub is not None:
﻿2025-05-07T20:32:43.5181116Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:43.5181459Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:43.5181765Z             )
2025-05-07T20:32:43.5181965Z         else:
2025-05-07T20:32:43.5182272Z             scale_ub_tensor = None
2025-05-07T20:32:43.5182520Z     
2025-05-07T20:32:43.5182757Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:43.5183072Z             op = silu_mul_quant
2025-05-07T20:32:43.5183318Z             if compiled:
2025-05-07T20:32:43.5183568Z                 op = torch.compile(op)
2025-05-07T20:32:43.5183872Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:43.5184140Z     
2025-05-07T20:32:43.5184342Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:43.5184505Z 
2025-05-07T20:32:43.5184616Z moe/activation_test.py:117: 
2025-05-07T20:32:43.5184917Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:43.5185343Z moe/activation_test.py:115: in fn
2025-05-07T20:32:43.5185656Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:43.5186221Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/torch/_dynamo/eval_frame.py:678: in _fn
2025-05-07T20:32:43.5187021Z     return fn(*args, **kwargs)
2025-05-07T20:32:43.5187840Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:43.5188576Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:43.5189210Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:43.5189887Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:43.5190549Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:43.5191073Z     kernel = self.compile(
2025-05-07T20:32:43.5191618Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:43.5192283Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:43.5192684Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:43.5192924Z 
2025-05-07T20:32:43.5193136Z self = <triton.compiler.compiler.ASTSource object at 0x7f96558e6ad0>
2025-05-07T20:32:43.5194212Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:43.5195577Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f987464c400>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f9655816340>}
2025-05-07T20:32:43.5196922Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:43.5197992Z context = <triton._C.libtriton.ir.context object at 0x7f9655657e30>
2025-05-07T20:32:43.5198285Z 
2025-05-07T20:32:43.5198452Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:43.5198971Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:43.5199437Z                            module_map=module_map)
2025-05-07T20:32:43.5199799Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:43.5200153Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:43.5200415Z E       ^
2025-05-07T20:32:43.5200872Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:43.5201323Z 
2025-05-07T20:32:43.5201737Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:43.5202355Z 
2025-05-07T20:32:43.5202464Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:43.5202880Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:43.5203353Z     T=16384,
2025-05-07T20:32:43.5203560Z     D=7168,
2025-05-07T20:32:43.5203761Z     scale_ub=1200.0,
2025-05-07T20:32:43.5203985Z     contiguous=True,
2025-05-07T20:32:43.5204216Z     compiled=True,
2025-05-07T20:32:43.5204430Z )
2025-05-07T20:32:43.5204747Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:43.5205240Z T = 16384, D = 7168, scale_ub = 1200.0, contiguous = True, compiled = True
2025-05-07T20:32:43.5205525Z 
2025-05-07T20:32:43.5205606Z     @given(
2025-05-07T20:32:43.5205845Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:43.5206183Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:43.5206519Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:43.5206897Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:43.5207219Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:43.5207504Z     )
2025-05-07T20:32:43.5207895Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:43.5208331Z     def test_silu_mul_quant(
2025-05-07T20:32:43.5208573Z         self,
2025-05-07T20:32:43.5208774Z         T: int,
2025-05-07T20:32:43.5208969Z         D: int,
2025-05-07T20:32:43.5209195Z         scale_ub: Optional[float],
2025-05-07T20:32:43.5209475Z         contiguous: bool,
2025-05-07T20:32:43.5209717Z         compiled: bool,
2025-05-07T20:32:43.5209941Z     ) -> None:
2025-05-07T20:32:43.5210166Z         torch.manual_seed(2025)
2025-05-07T20:32:43.5210411Z     
2025-05-07T20:32:43.5217393Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:43.5217868Z     
2025-05-07T20:32:43.5218131Z         x_sign = torch.sign(x)
2025-05-07T20:32:43.5218441Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:43.5218760Z         x = x_sign * x_clamp
2025-05-07T20:32:43.5219014Z         x0 = x[:, :D]
2025-05-07T20:32:43.5219251Z         x1 = x[:, D:]
2025-05-07T20:32:43.5219466Z     
2025-05-07T20:32:43.5219674Z         if contiguous:
2025-05-07T20:32:43.5219922Z             x0 = x0.contiguous()
2025-05-07T20:32:43.5220184Z             x1 = x1.contiguous()
2025-05-07T20:32:43.5220434Z     
2025-05-07T20:32:43.5220639Z         if scale_ub is not None:
2025-05-07T20:32:43.5220917Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:43.5221251Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:43.5221560Z             )
2025-05-07T20:32:43.5221771Z         else:
2025-05-07T20:32:43.5221987Z             scale_ub_tensor = None
2025-05-07T20:32:43.5222254Z     
2025-05-07T20:32:43.5222499Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:43.5222819Z             op = silu_mul_quant
2025-05-07T20:32:43.5223096Z             if compiled:
2025-05-07T20:32:43.5223356Z                 op = torch.compile(op)
2025-05-07T20:32:43.5223661Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:43.5223952Z     
2025-05-07T20:32:43.5224160Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:43.5224333Z 
2025-05-07T20:32:43.5224439Z moe/activation_test.py:117: 
2025-05-07T20:32:43.5224746Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:43.5225090Z moe/activation_test.py:115: in fn
2025-05-07T20:32:43.5225376Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:43.5225962Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/torch/_dynamo/eval_frame.py:678: in _fn
2025-05-07T20:32:43.5226536Z     return fn(*args, **kwargs)
2025-05-07T20:32:43.5227258Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:43.5227956Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:43.5228891Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:43.5229707Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:43.5230373Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:43.5230900Z     kernel = self.compile(
2025-05-07T20:32:43.5231451Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:43.5232113Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:43.5232513Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:43.5232751Z 
2025-05-07T20:32:43.5232960Z self = <triton.compiler.compiler.ASTSource object at 0x7f9655680c50>
2025-05-07T20:32:43.5234046Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:43.5235555Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f987464c400>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f9655817c40>}
2025-05-07T20:32:43.5236902Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:43.5237919Z context = <triton._C.libtriton.ir.context object at 0x7f96556e1270>
2025-05-07T20:32:43.5238219Z 
2025-05-07T20:32:43.5238388Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:43.5238912Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:43.5239397Z                            module_map=module_map)
2025-05-07T20:32:43.5239768Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:43.5240137Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:43.5240413Z E       ^
2025-05-07T20:32:43.5240886Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:43.5241343Z 
2025-05-07T20:32:43.5241765Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:43.5242286Z 
2025-05-07T20:32:43.6203635Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:43.6204337Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:43.6204913Z     T=16384,
2025-05-07T20:32:43.6205213Z     D=5120,
2025-05-07T20:32:43.6205424Z     scale_ub=1200.0,
2025-05-07T20:32:43.6205668Z     contiguous=True,
2025-05-07T20:32:43.6205945Z     compiled=False,
2025-05-07T20:32:43.6206180Z )
2025-05-07T20:32:43.6206546Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:43.6207140Z T = 16384, D = 5120, scale_ub = 1200.0, contiguous = True, compiled = False
2025-05-07T20:32:43.6207479Z 
2025-05-07T20:32:43.6207566Z     @given(
2025-05-07T20:32:43.6207827Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:43.6208187Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:43.6208541Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:43.6208875Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:43.6209218Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:43.6209509Z     )
2025-05-07T20:32:43.6209861Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:43.6210306Z     def test_silu_mul_quant(
2025-05-07T20:32:43.6210558Z         self,
2025-05-07T20:32:43.6210758Z         T: int,
2025-05-07T20:32:43.6210972Z         D: int,
2025-05-07T20:32:43.6211484Z         scale_ub: Optional[float],
2025-05-07T20:32:43.6211758Z         contiguous: bool,
2025-05-07T20:32:43.6212010Z         compiled: bool,
2025-05-07T20:32:43.6212251Z     ) -> None:
2025-05-07T20:32:43.6212604Z         torch.manual_seed(2025)
2025-05-07T20:32:43.6212861Z     
2025-05-07T20:32:43.6213143Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:43.6213493Z     
2025-05-07T20:32:43.6213693Z         x_sign = torch.sign(x)
2025-05-07T20:32:43.6213998Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:43.6214317Z         x = x_sign * x_clamp
2025-05-07T20:32:43.6214557Z         x0 = x[:, :D]
2025-05-07T20:32:43.6214786Z         x1 = x[:, D:]
2025-05-07T20:32:43.6215003Z     
2025-05-07T20:32:43.6215194Z         if contiguous:
2025-05-07T20:32:43.6215437Z             x0 = x0.contiguous()
2025-05-07T20:32:43.6215707Z             x1 = x1.contiguous()
2025-05-07T20:32:43.6216037Z     
2025-05-07T20:32:43.6216255Z         if scale_ub is not None:
2025-05-07T20:32:43.6216541Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:43.6216958Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:43.6217281Z             )
2025-05-07T20:32:43.6217492Z         else:
2025-05-07T20:32:43.6217712Z             scale_ub_tensor = None
2025-05-07T20:32:43.6217970Z     
2025-05-07T20:32:43.6218219Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:43.6218538Z             op = silu_mul_quant
2025-05-07T20:32:43.6218804Z             if compiled:
2025-05-07T20:32:43.6219070Z                 op = torch.compile(op)
2025-05-07T20:32:43.6219373Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:43.6219646Z     
2025-05-07T20:32:43.6219853Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:43.6220020Z 
2025-05-07T20:32:43.6220135Z moe/activation_test.py:117: 
2025-05-07T20:32:43.6220433Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:43.6220771Z moe/activation_test.py:115: in fn
2025-05-07T20:32:43.6221057Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:43.6221756Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:43.6222449Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:43.6222985Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:43.6223675Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:43.6224343Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:43.6224871Z     kernel = self.compile(
2025-05-07T20:32:43.6225417Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:43.6226079Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:43.6226477Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:43.6226714Z 
2025-05-07T20:32:43.6226924Z self = <triton.compiler.compiler.ASTSource object at 0x7f96556fe050>
2025-05-07T20:32:43.6228000Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:43.6229707Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f987464c400>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f9655618ae0>}
2025-05-07T20:32:43.6231040Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:43.6232144Z context = <triton._C.libtriton.ir.context object at 0x7f965566a5f0>
2025-05-07T20:32:43.6232435Z 
2025-05-07T20:32:43.6232603Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:43.6233187Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:43.6233659Z                            module_map=module_map)
2025-05-07T20:32:43.6234021Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:43.6234379Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:43.6234644Z E       ^
2025-05-07T20:32:43.6235108Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:43.6235561Z 
2025-05-07T20:32:43.6235976Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:43.6236496Z 
2025-05-07T20:32:43.6236670Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:43.6237091Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:43.6237490Z     T=1,
2025-05-07T20:32:43.6237686Z     D=7168,
2025-05-07T20:32:43.6237949Z     scale_ub=1200.0,
2025-05-07T20:32:43.6238183Z     contiguous=False,
2025-05-07T20:32:43.6238416Z     compiled=False,
2025-05-07T20:32:43.6238629Z )
2025-05-07T20:32:43.6238946Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:43.6239436Z T = 1, D = 7168, scale_ub = 1200.0, contiguous = False, compiled = False
2025-05-07T20:32:43.6239700Z 
2025-05-07T20:32:43.6239791Z     @given(
2025-05-07T20:32:43.6240027Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:43.6240346Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:43.6240658Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:43.6240992Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:43.6241322Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:43.6241614Z     )
2025-05-07T20:32:43.6241969Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:43.6242417Z     def test_silu_mul_quant(
2025-05-07T20:32:43.6242676Z         self,
2025-05-07T20:32:43.6243446Z         T: int,
2025-05-07T20:32:43.6243732Z         D: int,
2025-05-07T20:32:43.6243980Z         scale_ub: Optional[float],
2025-05-07T20:32:43.6244286Z         contiguous: bool,
2025-05-07T20:32:43.6244594Z         compiled: bool,
2025-05-07T20:32:43.6244826Z     ) -> None:
2025-05-07T20:32:43.6245053Z         torch.manual_seed(2025)
2025-05-07T20:32:43.6245295Z     
2025-05-07T20:32:43.6245579Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:43.6245924Z     
2025-05-07T20:32:43.6246130Z         x_sign = torch.sign(x)
2025-05-07T20:32:43.6246418Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:43.6246730Z         x = x_sign * x_clamp
2025-05-07T20:32:43.6246980Z         x0 = x[:, :D]
2025-05-07T20:32:43.6247195Z         x1 = x[:, D:]
2025-05-07T20:32:43.6247403Z     
2025-05-07T20:32:43.6247603Z         if contiguous:
2025-05-07T20:32:43.6247842Z             x0 = x0.contiguous()
2025-05-07T20:32:43.6248101Z             x1 = x1.contiguous()
2025-05-07T20:32:43.6248346Z     
2025-05-07T20:32:43.6248545Z         if scale_ub is not None:
2025-05-07T20:32:43.6248817Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:43.6249158Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:43.6249471Z             )
2025-05-07T20:32:43.6249663Z         else:
2025-05-07T20:32:43.6249880Z             scale_ub_tensor = None
2025-05-07T20:32:43.6250135Z     
2025-05-07T20:32:43.6250405Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:43.6250739Z             op = silu_mul_quant
2025-05-07T20:32:43.6251001Z             if compiled:
2025-05-07T20:32:43.6251247Z                 op = torch.compile(op)
2025-05-07T20:32:43.6251634Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:43.6251912Z     
2025-05-07T20:32:43.6252109Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:43.6252272Z 
2025-05-07T20:32:43.6252376Z moe/activation_test.py:117: 
2025-05-07T20:32:43.6252718Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:43.6253054Z moe/activation_test.py:115: in fn
2025-05-07T20:32:43.6253330Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:43.6254016Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:43.6254701Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:43.6255239Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:43.6255911Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:43.6256681Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:43.6257219Z     kernel = self.compile(
2025-05-07T20:32:43.6257797Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:43.6258455Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:43.6258855Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:43.6259080Z 
2025-05-07T20:32:43.6259292Z self = <triton.compiler.compiler.ASTSource object at 0x7f96557cd4d0>
2025-05-07T20:32:43.6260361Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:43.6261724Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f987464c400>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f9655618400>}
2025-05-07T20:32:43.6263064Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:43.6264081Z context = <triton._C.libtriton.ir.context object at 0x7f96557817f0>
2025-05-07T20:32:43.6264365Z 
2025-05-07T20:32:43.6264536Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:43.6265045Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:43.6265518Z                            module_map=module_map)
2025-05-07T20:32:43.6265886Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:43.6266233Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:43.6266499Z E       ^
2025-05-07T20:32:43.6266973Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:43.6267425Z 
2025-05-07T20:32:43.6267854Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:43.6268362Z 
2025-05-07T20:32:43.7609391Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:43.7610577Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:43.7611391Z     T=4096,
2025-05-07T20:32:43.7611772Z     D=7168,
2025-05-07T20:32:43.7612156Z     scale_ub=1200.0,
2025-05-07T20:32:43.7612614Z     contiguous=False,
2025-05-07T20:32:43.7613068Z     compiled=True,
2025-05-07T20:32:43.7613471Z )
2025-05-07T20:32:43.7614105Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:43.7615082Z T = 4096, D = 7168, scale_ub = 1200.0, contiguous = False, compiled = True
2025-05-07T20:32:43.7615624Z 
2025-05-07T20:32:43.7615816Z     @given(
2025-05-07T20:32:43.7616447Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:43.7616767Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:43.7617091Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:43.7617514Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:43.7617845Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:43.7618133Z     )
2025-05-07T20:32:43.7618478Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:43.7618922Z     def test_silu_mul_quant(
2025-05-07T20:32:43.7619172Z         self,
2025-05-07T20:32:43.7619374Z         T: int,
2025-05-07T20:32:43.7619570Z         D: int,
2025-05-07T20:32:43.7619795Z         scale_ub: Optional[float],
2025-05-07T20:32:43.7620068Z         contiguous: bool,
2025-05-07T20:32:43.7620302Z         compiled: bool,
2025-05-07T20:32:43.7620530Z     ) -> None:
2025-05-07T20:32:43.7620750Z         torch.manual_seed(2025)
2025-05-07T20:32:43.7621082Z     
2025-05-07T20:32:43.7621360Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:43.7621704Z     
2025-05-07T20:32:43.7621897Z         x_sign = torch.sign(x)
2025-05-07T20:32:43.7622283Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:43.7622597Z         x = x_sign * x_clamp
2025-05-07T20:32:43.7622833Z         x0 = x[:, :D]
2025-05-07T20:32:43.7623055Z         x1 = x[:, D:]
2025-05-07T20:32:43.7623266Z     
2025-05-07T20:32:43.7623450Z         if contiguous:
2025-05-07T20:32:43.7623683Z             x0 = x0.contiguous()
2025-05-07T20:32:43.7623945Z             x1 = x1.contiguous()
2025-05-07T20:32:43.7624184Z     
2025-05-07T20:32:43.7624386Z         if scale_ub is not None:
2025-05-07T20:32:43.7624667Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:43.7625002Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:43.7625305Z             )
2025-05-07T20:32:43.7625503Z         else:
2025-05-07T20:32:43.7625723Z             scale_ub_tensor = None
2025-05-07T20:32:43.7625972Z     
2025-05-07T20:32:43.7626206Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:43.7626523Z             op = silu_mul_quant
2025-05-07T20:32:43.7626775Z             if compiled:
2025-05-07T20:32:43.7627030Z                 op = torch.compile(op)
2025-05-07T20:32:43.7627330Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:43.7627600Z     
2025-05-07T20:32:43.7627808Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:43.7627974Z 
2025-05-07T20:32:43.7628082Z moe/activation_test.py:117: 
2025-05-07T20:32:43.7628649Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:43.7628989Z moe/activation_test.py:115: in fn
2025-05-07T20:32:43.7629348Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:43.7629911Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/torch/_dynamo/eval_frame.py:678: in _fn
2025-05-07T20:32:43.7630470Z     return fn(*args, **kwargs)
2025-05-07T20:32:43.7631129Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:43.7631819Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:43.7632354Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:43.7633035Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:43.7633698Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:43.7634231Z     kernel = self.compile(
2025-05-07T20:32:43.7634767Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:43.7635419Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:43.7635819Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:43.7636177Z 
2025-05-07T20:32:43.7636391Z self = <triton.compiler.compiler.ASTSource object at 0x7f965571f410>
2025-05-07T20:32:43.7637523Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:43.7638896Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f987464c400>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f965561af20>}
2025-05-07T20:32:43.7640225Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:43.7641242Z context = <triton._C.libtriton.ir.context object at 0x7f96557d8330>
2025-05-07T20:32:43.7641589Z 
2025-05-07T20:32:43.7641766Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:43.7642281Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:43.7642816Z                            module_map=module_map)
2025-05-07T20:32:43.7643184Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:43.7643530Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:43.7643794Z E       ^
2025-05-07T20:32:43.7644258Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:43.7644707Z 
2025-05-07T20:32:43.7645128Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:43.7645636Z 
2025-05-07T20:32:43.7645744Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:43.7646158Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:43.7646567Z     T=128,
2025-05-07T20:32:43.7646755Z     D=7168,
2025-05-07T20:32:43.7646951Z     scale_ub=1200.0,
2025-05-07T20:32:43.7647183Z     contiguous=False,
2025-05-07T20:32:43.7647411Z     compiled=True,
2025-05-07T20:32:43.7647617Z )
2025-05-07T20:32:43.8366008Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:43.8366774Z T = 128, D = 7168, scale_ub = 1200.0, contiguous = False, compiled = True
2025-05-07T20:32:43.8367055Z 
2025-05-07T20:32:43.8367151Z     @given(
2025-05-07T20:32:43.8367386Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:43.8367714Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:43.8368034Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:43.8368364Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:43.8368697Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:43.8368989Z     )
2025-05-07T20:32:43.8369339Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:43.8369803Z     def test_silu_mul_quant(
2025-05-07T20:32:43.8370055Z         self,
2025-05-07T20:32:43.8370256Z         T: int,
2025-05-07T20:32:43.8370473Z         D: int,
2025-05-07T20:32:43.8370707Z         scale_ub: Optional[float],
2025-05-07T20:32:43.8370987Z         contiguous: bool,
2025-05-07T20:32:43.8371229Z         compiled: bool,
2025-05-07T20:32:43.8371467Z     ) -> None:
2025-05-07T20:32:43.8371692Z         torch.manual_seed(2025)
2025-05-07T20:32:43.8371936Z     
2025-05-07T20:32:43.8372216Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:43.8372563Z     
2025-05-07T20:32:43.8372762Z         x_sign = torch.sign(x)
2025-05-07T20:32:43.8373059Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:43.8373372Z         x = x_sign * x_clamp
2025-05-07T20:32:43.8373616Z         x0 = x[:, :D]
2025-05-07T20:32:43.8373850Z         x1 = x[:, D:]
2025-05-07T20:32:43.8374071Z     
2025-05-07T20:32:43.8374488Z         if contiguous:
2025-05-07T20:32:43.8374734Z             x0 = x0.contiguous()
2025-05-07T20:32:43.8374998Z             x1 = x1.contiguous()
2025-05-07T20:32:43.8375236Z     
2025-05-07T20:32:43.8375526Z         if scale_ub is not None:
2025-05-07T20:32:43.8375808Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:43.8376140Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:43.8376503Z             )
2025-05-07T20:32:43.8376707Z         else:
2025-05-07T20:32:43.8376923Z             scale_ub_tensor = None
2025-05-07T20:32:43.8377173Z     
2025-05-07T20:32:43.8377414Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:43.8377730Z             op = silu_mul_quant
2025-05-07T20:32:43.8377983Z             if compiled:
2025-05-07T20:32:43.8378254Z                 op = torch.compile(op)
2025-05-07T20:32:43.8385462Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:43.8385887Z     
2025-05-07T20:32:43.8386093Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:43.8386273Z 
2025-05-07T20:32:43.8386380Z moe/activation_test.py:117: 
2025-05-07T20:32:43.8386813Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:43.8387187Z moe/activation_test.py:115: in fn
2025-05-07T20:32:43.8387474Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:43.8388046Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/torch/_dynamo/eval_frame.py:678: in _fn
2025-05-07T20:32:43.8388622Z     return fn(*args, **kwargs)
2025-05-07T20:32:43.8389359Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:43.8390061Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:43.8390605Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:43.8391294Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:43.8391963Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:43.8392518Z     kernel = self.compile(
2025-05-07T20:32:43.8393074Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:43.8393737Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:43.8394146Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:43.8394383Z 
2025-05-07T20:32:43.8394593Z self = <triton.compiler.compiler.ASTSource object at 0x7f9655dfd210>
2025-05-07T20:32:43.8395691Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:43.8397103Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f987464c400>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f9655d14220>}
2025-05-07T20:32:43.8398460Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:43.8399503Z context = <triton._C.libtriton.ir.context object at 0x7f9655dc57b0>
2025-05-07T20:32:43.8399798Z 
2025-05-07T20:32:43.8399968Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:43.8400497Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:43.8400969Z                            module_map=module_map)
2025-05-07T20:32:43.8401344Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:43.8401714Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:43.8401981Z E       ^
2025-05-07T20:32:43.8402528Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:43.8402993Z 
2025-05-07T20:32:43.8403463Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:43.8403980Z 
2025-05-07T20:32:43.8404096Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:43.8404518Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:43.8404937Z     T=2048,
2025-05-07T20:32:43.8405147Z     D=7168,
2025-05-07T20:32:43.8405355Z     scale_ub=None,
2025-05-07T20:32:43.8405585Z     contiguous=True,
2025-05-07T20:32:43.8405828Z     compiled=True,
2025-05-07T20:32:43.8406054Z )
2025-05-07T20:32:43.8406423Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:43.8406951Z T = 2048, D = 7168, scale_ub = None, contiguous = True, compiled = True
2025-05-07T20:32:43.8407268Z 
2025-05-07T20:32:43.8407365Z     @given(
2025-05-07T20:32:43.8407610Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:43.8407937Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:43.8408296Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:43.8408627Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:43.8408969Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:43.8409266Z     )
2025-05-07T20:32:43.8409629Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:43.8410097Z     def test_silu_mul_quant(
2025-05-07T20:32:43.8410354Z         self,
2025-05-07T20:32:43.8410556Z         T: int,
2025-05-07T20:32:43.8410774Z         D: int,
2025-05-07T20:32:43.8410997Z         scale_ub: Optional[float],
2025-05-07T20:32:43.8411274Z         contiguous: bool,
2025-05-07T20:32:43.8411516Z         compiled: bool,
2025-05-07T20:32:43.8411737Z     ) -> None:
2025-05-07T20:32:43.8411964Z         torch.manual_seed(2025)
2025-05-07T20:32:43.8412211Z     
2025-05-07T20:32:43.8412502Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:43.8412856Z     
2025-05-07T20:32:43.8413064Z         x_sign = torch.sign(x)
2025-05-07T20:32:43.8413368Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:43.8413694Z         x = x_sign * x_clamp
2025-05-07T20:32:43.8413939Z         x0 = x[:, :D]
2025-05-07T20:32:43.8414169Z         x1 = x[:, D:]
2025-05-07T20:32:43.8414393Z     
2025-05-07T20:32:43.8414587Z         if contiguous:
2025-05-07T20:32:43.8414832Z             x0 = x0.contiguous()
2025-05-07T20:32:43.8415107Z             x1 = x1.contiguous()
2025-05-07T20:32:43.8415348Z     
2025-05-07T20:32:43.8415551Z         if scale_ub is not None:
2025-05-07T20:32:43.8415842Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:43.8416183Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:43.8416522Z             )
2025-05-07T20:32:43.8416772Z         else:
2025-05-07T20:32:43.8416999Z             scale_ub_tensor = None
2025-05-07T20:32:43.8417256Z     
2025-05-07T20:32:43.8417511Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:43.8417843Z             op = silu_mul_quant
2025-05-07T20:32:43.8418109Z             if compiled:
2025-05-07T20:32:43.8418376Z                 op = torch.compile(op)
2025-05-07T20:32:43.8418685Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:43.8418963Z     
2025-05-07T20:32:43.8419171Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:43.8419342Z 
2025-05-07T20:32:43.8419454Z moe/activation_test.py:117: 
2025-05-07T20:32:43.8419756Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:43.8420113Z moe/activation_test.py:115: in fn
2025-05-07T20:32:43.8420412Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:43.8420995Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/torch/_dynamo/eval_frame.py:678: in _fn
2025-05-07T20:32:43.8421624Z     return fn(*args, **kwargs)
2025-05-07T20:32:43.8422311Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:43.8423056Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:43.8423603Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:43.8424307Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:43.8424984Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:43.8425534Z     kernel = self.compile(
2025-05-07T20:32:43.8426087Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:43.8426770Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:43.8427234Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:43.8427472Z 
2025-05-07T20:32:43.8427767Z self = <triton.compiler.compiler.ASTSource object at 0x7f9655d92890>
2025-05-07T20:32:43.8429193Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:43.8430587Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f987464c400>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f9655d14d60>}
2025-05-07T20:32:43.8431957Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:43.8433002Z context = <triton._C.libtriton.ir.context object at 0x7f9655d7eeb0>
2025-05-07T20:32:43.8433301Z 
2025-05-07T20:32:43.8433473Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:43.8434015Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:43.8434498Z                            module_map=module_map)
2025-05-07T20:32:43.8434872Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:43.8435239Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:43.8435515Z E       ^
2025-05-07T20:32:43.8435994Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:43.8436500Z 
2025-05-07T20:32:43.8436923Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:43.8437452Z 
2025-05-07T20:32:43.9089239Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:43.9089822Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:43.9090356Z     T=16384,
2025-05-07T20:32:43.9090562Z     D=5120,
2025-05-07T20:32:43.9090755Z     scale_ub=None,
2025-05-07T20:32:43.9090986Z     contiguous=False,
2025-05-07T20:32:43.9091226Z     compiled=False,
2025-05-07T20:32:43.9091452Z )
2025-05-07T20:32:43.9091775Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:43.9092274Z T = 16384, D = 5120, scale_ub = None, contiguous = False, compiled = False
2025-05-07T20:32:43.9092550Z 
2025-05-07T20:32:43.9092635Z     @given(
2025-05-07T20:32:43.9092863Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:43.9093193Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:43.9093502Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:43.9093830Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:43.9094164Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:43.9094453Z     )
2025-05-07T20:32:43.9094991Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:43.9095440Z     def test_silu_mul_quant(
2025-05-07T20:32:43.9095691Z         self,
2025-05-07T20:32:43.9095963Z         T: int,
2025-05-07T20:32:43.9096160Z         D: int,
2025-05-07T20:32:43.9096422Z         scale_ub: Optional[float],
2025-05-07T20:32:43.9096708Z         contiguous: bool,
2025-05-07T20:32:43.9096957Z         compiled: bool,
2025-05-07T20:32:43.9097184Z     ) -> None:
2025-05-07T20:32:43.9097403Z         torch.manual_seed(2025)
2025-05-07T20:32:43.9097640Z     
2025-05-07T20:32:43.9097920Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:43.9098266Z     
2025-05-07T20:32:43.9098459Z         x_sign = torch.sign(x)
2025-05-07T20:32:43.9098751Z >       x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:43.9100872Z E       torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 320.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 140.44 MiB is free. Including non-PyTorch memory, this process has 21.92 GiB memory in use. Of the allocated memory 21.60 GiB is allocated by PyTorch, and 45.02 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
2025-05-07T20:32:43.9102828Z 
2025-05-07T20:32:43.9102952Z moe/activation_test.py:95: OutOfMemoryError
2025-05-07T20:32:43.9103164Z 
2025-05-07T20:32:43.9103277Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:43.9103682Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:43.9104101Z     T=4096,
2025-05-07T20:32:43.9104367Z     D=7168,
2025-05-07T20:32:43.9104607Z     scale_ub=1200.0,
2025-05-07T20:32:43.9104840Z     contiguous=True,
2025-05-07T20:32:43.9105070Z     compiled=True,
2025-05-07T20:32:43.9105277Z )
2025-05-07T20:32:43.9105601Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:43.9106110Z T = 4096, D = 7168, scale_ub = 1200.0, contiguous = True, compiled = True
2025-05-07T20:32:43.9106399Z 
2025-05-07T20:32:43.9106498Z     @given(
2025-05-07T20:32:43.9106753Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:43.9107076Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:43.9107389Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:43.9107718Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:43.9108055Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:43.9108349Z     )
2025-05-07T20:32:43.9108698Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:43.9109252Z     def test_silu_mul_quant(
2025-05-07T20:32:43.9109499Z         self,
2025-05-07T20:32:43.9109696Z         T: int,
2025-05-07T20:32:43.9109901Z         D: int,
2025-05-07T20:32:43.9110131Z         scale_ub: Optional[float],
2025-05-07T20:32:43.9110409Z         contiguous: bool,
2025-05-07T20:32:43.9110643Z         compiled: bool,
2025-05-07T20:32:43.9110870Z     ) -> None:
2025-05-07T20:32:43.9111094Z         torch.manual_seed(2025)
2025-05-07T20:32:43.9111331Z     
2025-05-07T20:32:43.9111619Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:43.9111962Z     
2025-05-07T20:32:43.9112165Z         x_sign = torch.sign(x)
2025-05-07T20:32:43.9112459Z >       x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:43.9114469Z E       torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 112.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 28.44 MiB is free. Including non-PyTorch memory, this process has 22.03 GiB memory in use. Of the allocated memory 21.61 GiB is allocated by PyTorch, and 141.02 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
2025-05-07T20:32:43.9116485Z 
2025-05-07T20:32:43.9116645Z moe/activation_test.py:95: OutOfMemoryError
2025-05-07T20:32:43.9116862Z 
2025-05-07T20:32:43.9116974Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:43.9117388Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:43.9117786Z     T=16384,
2025-05-07T20:32:43.9117995Z     D=7168,
2025-05-07T20:32:43.9118190Z     scale_ub=None,
2025-05-07T20:32:43.9118407Z     contiguous=False,
2025-05-07T20:32:43.9118640Z     compiled=False,
2025-05-07T20:32:43.9118850Z )
2025-05-07T20:32:43.9119163Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:43.9119663Z T = 16384, D = 7168, scale_ub = None, contiguous = False, compiled = False
2025-05-07T20:32:43.9119987Z 
2025-05-07T20:32:43.9120075Z     @given(
2025-05-07T20:32:43.9120303Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:43.9120624Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:43.9120974Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:43.9121317Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:43.9121643Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:43.9121930Z     )
2025-05-07T20:32:43.9122281Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:43.9122716Z     def test_silu_mul_quant(
2025-05-07T20:32:43.9122963Z         self,
2025-05-07T20:32:43.9123166Z         T: int,
2025-05-07T20:32:43.9123360Z         D: int,
2025-05-07T20:32:43.9123582Z         scale_ub: Optional[float],
2025-05-07T20:32:43.9123855Z         contiguous: bool,
2025-05-07T20:32:43.9124093Z         compiled: bool,
2025-05-07T20:32:43.9124316Z     ) -> None:
2025-05-07T20:32:43.9124536Z         torch.manual_seed(2025)
2025-05-07T20:32:43.9124783Z     
2025-05-07T20:32:43.9125057Z >       x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:43.9127168Z E       torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 448.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 140.44 MiB is free. Including non-PyTorch memory, this process has 21.92 GiB memory in use. Of the allocated memory 21.50 GiB is allocated by PyTorch, and 141.02 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
2025-05-07T20:32:43.9129425Z 
2025-05-07T20:32:43.9129554Z moe/activation_test.py:92: OutOfMemoryError
2025-05-07T20:32:43.9129769Z 
2025-05-07T20:32:43.9129883Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:43.9130289Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:43.9130710Z     T=2048,
2025-05-07T20:32:43.9130903Z     D=7168,
2025-05-07T20:32:43.9131093Z     scale_ub=1200.0,
2025-05-07T20:32:43.9131332Z     contiguous=True,
2025-05-07T20:32:43.9131558Z     compiled=True,
2025-05-07T20:32:43.9131764Z )
2025-05-07T20:32:43.9132094Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:43.9132598Z T = 2048, D = 7168, scale_ub = 1200.0, contiguous = True, compiled = True
2025-05-07T20:32:43.9132869Z 
2025-05-07T20:32:43.9132959Z     @given(
2025-05-07T20:32:43.9133185Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:43.9133501Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:43.9133815Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:43.9134142Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:43.9134482Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:43.9134774Z     )
2025-05-07T20:32:43.9135124Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:43.9135673Z     def test_silu_mul_quant(
2025-05-07T20:32:43.9135923Z         self,
2025-05-07T20:32:43.9136121Z         T: int,
2025-05-07T20:32:43.9136390Z         D: int,
2025-05-07T20:32:43.9136613Z         scale_ub: Optional[float],
2025-05-07T20:32:43.9136889Z         contiguous: bool,
2025-05-07T20:32:43.9137131Z         compiled: bool,
2025-05-07T20:32:43.9137388Z     ) -> None:
2025-05-07T20:32:43.9137632Z         torch.manual_seed(2025)
2025-05-07T20:32:43.9137874Z     
2025-05-07T20:32:43.9138155Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:43.9138514Z     
2025-05-07T20:32:43.9138708Z         x_sign = torch.sign(x)
2025-05-07T20:32:43.9139019Z >       x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:43.9141094Z E       torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 56.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 28.44 MiB is free. Including non-PyTorch memory, this process has 22.03 GiB memory in use. Of the allocated memory 21.67 GiB is allocated by PyTorch, and 85.02 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
2025-05-07T20:32:43.9143015Z 
2025-05-07T20:32:43.9143135Z moe/activation_test.py:95: OutOfMemoryError
2025-05-07T20:32:43.9143348Z 
2025-05-07T20:32:43.9143464Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:43.9143871Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:43.9144279Z     T=2048,
2025-05-07T20:32:43.9144469Z     D=7168,
2025-05-07T20:32:43.9144657Z     scale_ub=None,
2025-05-07T20:32:43.9144874Z     contiguous=True,
2025-05-07T20:32:43.9145103Z     compiled=False,
2025-05-07T20:32:43.9145310Z )
2025-05-07T20:32:44.1690670Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:44.1691401Z T = 2048, D = 7168, scale_ub = None, contiguous = True, compiled = False
2025-05-07T20:32:44.1691790Z 
2025-05-07T20:32:44.1691939Z     @given(
2025-05-07T20:32:44.1692237Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:44.1692639Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:44.1692948Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:44.1693276Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:44.1693600Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:44.1693889Z     )
2025-05-07T20:32:44.1694242Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:44.1694675Z     def test_silu_mul_quant(
2025-05-07T20:32:44.1694920Z         self,
2025-05-07T20:32:44.1695119Z         T: int,
2025-05-07T20:32:44.1695312Z         D: int,
2025-05-07T20:32:44.1695549Z         scale_ub: Optional[float],
2025-05-07T20:32:44.1695838Z         contiguous: bool,
2025-05-07T20:32:44.1696089Z         compiled: bool,
2025-05-07T20:32:44.1696314Z     ) -> None:
2025-05-07T20:32:44.1696534Z         torch.manual_seed(2025)
2025-05-07T20:32:44.1696780Z     
2025-05-07T20:32:44.1697051Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:44.1697393Z     
2025-05-07T20:32:44.1697592Z >       x_sign = torch.sign(x)
2025-05-07T20:32:44.1699524Z E       torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 56.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 28.44 MiB is free. Including non-PyTorch memory, this process has 22.03 GiB memory in use. Of the allocated memory 21.67 GiB is allocated by PyTorch, and 85.02 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
2025-05-07T20:32:44.1701687Z 
2025-05-07T20:32:44.1701808Z moe/activation_test.py:94: OutOfMemoryError
2025-05-07T20:32:44.1702025Z 
2025-05-07T20:32:44.1702220Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:44.1702637Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:44.1703042Z     T=1,
2025-05-07T20:32:44.1703230Z     D=7168,
2025-05-07T20:32:44.1703427Z     scale_ub=1200.0,
2025-05-07T20:32:44.1703654Z     contiguous=True,
2025-05-07T20:32:44.1703873Z     compiled=False,
2025-05-07T20:32:44.1704086Z )
2025-05-07T20:32:44.1704404Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:44.1704887Z T = 1, D = 7168, scale_ub = 1200.0, contiguous = True, compiled = False
2025-05-07T20:32:44.1705158Z 
2025-05-07T20:32:44.1705238Z     @given(
2025-05-07T20:32:44.1705469Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:44.1705907Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:44.1706221Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:44.1706550Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:44.1706961Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:44.1707242Z     )
2025-05-07T20:32:44.1707594Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:44.1708036Z     def test_silu_mul_quant(
2025-05-07T20:32:44.1708273Z         self,
2025-05-07T20:32:44.1708471Z         T: int,
2025-05-07T20:32:44.1708676Z         D: int,
2025-05-07T20:32:44.1708890Z         scale_ub: Optional[float],
2025-05-07T20:32:44.1709289Z         contiguous: bool,
2025-05-07T20:32:44.1709530Z         compiled: bool,
2025-05-07T20:32:44.1709746Z     ) -> None:
2025-05-07T20:32:44.1709963Z         torch.manual_seed(2025)
2025-05-07T20:32:44.1710203Z     
2025-05-07T20:32:44.1710471Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:44.1710818Z     
2025-05-07T20:32:44.1711017Z         x_sign = torch.sign(x)
2025-05-07T20:32:44.1711308Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:44.1711616Z         x = x_sign * x_clamp
2025-05-07T20:32:44.1711866Z         x0 = x[:, :D]
2025-05-07T20:32:44.1712084Z         x1 = x[:, D:]
2025-05-07T20:32:44.1712288Z     
2025-05-07T20:32:44.1712478Z         if contiguous:
2025-05-07T20:32:44.1712712Z             x0 = x0.contiguous()
2025-05-07T20:32:44.1712968Z             x1 = x1.contiguous()
2025-05-07T20:32:44.1713213Z     
2025-05-07T20:32:44.1713418Z         if scale_ub is not None:
2025-05-07T20:32:44.1713697Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:44.1714048Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:44.1714359Z             )
2025-05-07T20:32:44.1714553Z         else:
2025-05-07T20:32:44.1714770Z             scale_ub_tensor = None
2025-05-07T20:32:44.1715026Z     
2025-05-07T20:32:44.1715256Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:44.1715579Z             op = silu_mul_quant
2025-05-07T20:32:44.1715835Z             if compiled:
2025-05-07T20:32:44.1716086Z                 op = torch.compile(op)
2025-05-07T20:32:44.1716394Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:44.1716674Z     
2025-05-07T20:32:44.1716876Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:44.1717037Z 
2025-05-07T20:32:44.1717138Z moe/activation_test.py:117: 
2025-05-07T20:32:44.1717435Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:44.1717769Z moe/activation_test.py:115: in fn
2025-05-07T20:32:44.1718047Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:44.1718738Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:44.1719427Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:44.1719968Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:44.1720704Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:44.1721405Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:44.1721941Z     kernel = self.compile(
2025-05-07T20:32:44.1722477Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:44.1723135Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:44.1723542Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:44.1723772Z 
2025-05-07T20:32:44.1723984Z self = <triton.compiler.compiler.ASTSource object at 0x7f96555a0e10>
2025-05-07T20:32:44.1725058Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:44.1726506Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f987464c400>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f965556c540>}
2025-05-07T20:32:44.1727844Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:44.1729139Z context = <triton._C.libtriton.ir.context object at 0x7f96555103b0>
2025-05-07T20:32:44.1729424Z 
2025-05-07T20:32:44.1729596Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:44.1730104Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:44.1730566Z                            module_map=module_map)
2025-05-07T20:32:44.1730937Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:44.1731283Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:44.1731547Z E       ^
2025-05-07T20:32:44.1732025Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:44.1732473Z 
2025-05-07T20:32:44.1732894Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:44.1733403Z 
2025-05-07T20:32:44.1733516Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:44.1733933Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:44.1734337Z     T=128,
2025-05-07T20:32:44.1734522Z     D=5120,
2025-05-07T20:32:44.1734743Z     scale_ub=None,
2025-05-07T20:32:44.1734977Z     contiguous=True,
2025-05-07T20:32:44.1735205Z     compiled=False,
2025-05-07T20:32:44.1735427Z )
2025-05-07T20:32:44.2284922Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:44.2285449Z T = 128, D = 5120, scale_ub = None, contiguous = True, compiled = False
2025-05-07T20:32:44.2285814Z 
2025-05-07T20:32:44.2285950Z     @given(
2025-05-07T20:32:44.2286291Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:44.2286718Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:44.2298122Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:44.2298519Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:44.2298857Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:44.2299139Z     )
2025-05-07T20:32:44.2299497Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:44.2299951Z     def test_silu_mul_quant(
2025-05-07T20:32:44.2300194Z         self,
2025-05-07T20:32:44.2300397Z         T: int,
2025-05-07T20:32:44.2300600Z         D: int,
2025-05-07T20:32:44.2300825Z         scale_ub: Optional[float],
2025-05-07T20:32:44.2301392Z         contiguous: bool,
2025-05-07T20:32:44.2301642Z         compiled: bool,
2025-05-07T20:32:44.2301872Z     ) -> None:
2025-05-07T20:32:44.2302097Z         torch.manual_seed(2025)
2025-05-07T20:32:44.2302346Z     
2025-05-07T20:32:44.2302717Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:44.2303060Z     
2025-05-07T20:32:44.2303259Z         x_sign = torch.sign(x)
2025-05-07T20:32:44.2303585Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:44.2303905Z         x = x_sign * x_clamp
2025-05-07T20:32:44.2304151Z         x0 = x[:, :D]
2025-05-07T20:32:44.2304366Z         x1 = x[:, D:]
2025-05-07T20:32:44.2304581Z     
2025-05-07T20:32:44.2304778Z         if contiguous:
2025-05-07T20:32:44.2305014Z             x0 = x0.contiguous()
2025-05-07T20:32:44.2305288Z             x1 = x1.contiguous()
2025-05-07T20:32:44.2305533Z     
2025-05-07T20:32:44.2305724Z         if scale_ub is not None:
2025-05-07T20:32:44.2306105Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:44.2306478Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:44.2306809Z             )
2025-05-07T20:32:44.2307099Z         else:
2025-05-07T20:32:44.2307321Z             scale_ub_tensor = None
2025-05-07T20:32:44.2307576Z     
2025-05-07T20:32:44.2307818Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:44.2308138Z             op = silu_mul_quant
2025-05-07T20:32:44.2308399Z             if compiled:
2025-05-07T20:32:44.2308643Z                 op = torch.compile(op)
2025-05-07T20:32:44.2308942Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:44.2309288Z     
2025-05-07T20:32:44.2309482Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:44.2309653Z 
2025-05-07T20:32:44.2309753Z moe/activation_test.py:117: 
2025-05-07T20:32:44.2310051Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:44.2310378Z moe/activation_test.py:115: in fn
2025-05-07T20:32:44.2310669Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:44.2311365Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:44.2312069Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:44.2312601Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:44.2313291Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:44.2313956Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:44.2314484Z     kernel = self.compile(
2025-05-07T20:32:44.2315033Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:44.2315704Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:44.2316112Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:44.2316348Z 
2025-05-07T20:32:44.2316592Z self = <triton.compiler.compiler.ASTSource object at 0x7f96555ffd10>
2025-05-07T20:32:44.2317703Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:44.2319109Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f987464c400>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f965556d620>}
2025-05-07T20:32:44.2320463Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:44.2321500Z context = <triton._C.libtriton.ir.context object at 0x7f96555a7ef0>
2025-05-07T20:32:44.2321873Z 
2025-05-07T20:32:44.2322042Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:44.2322570Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:44.2323090Z                            module_map=module_map)
2025-05-07T20:32:44.2323456Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:44.2323814Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:44.2324079Z E       ^
2025-05-07T20:32:44.2324541Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:44.2324998Z 
2025-05-07T20:32:44.2325422Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:44.2325941Z 
2025-05-07T20:32:44.2326048Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:44.2326467Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:44.2326915Z     T=128,
2025-05-07T20:32:44.2327112Z     D=7168,
2025-05-07T20:32:44.2327315Z     scale_ub=None,
2025-05-07T20:32:44.2327574Z     contiguous=True,
2025-05-07T20:32:44.2327812Z     compiled=False,
2025-05-07T20:32:44.2328026Z )
2025-05-07T20:32:44.2328635Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:44.2329129Z T = 128, D = 7168, scale_ub = None, contiguous = True, compiled = False
2025-05-07T20:32:44.2329408Z 
2025-05-07T20:32:44.2329491Z     @given(
2025-05-07T20:32:44.2329731Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:44.2330042Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:44.2330355Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:44.2330695Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:44.2331021Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:44.2331313Z     )
2025-05-07T20:32:44.2331665Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:44.2332118Z     def test_silu_mul_quant(
2025-05-07T20:32:44.2332361Z         self,
2025-05-07T20:32:44.2332567Z         T: int,
2025-05-07T20:32:44.2332780Z         D: int,
2025-05-07T20:32:44.2332998Z         scale_ub: Optional[float],
2025-05-07T20:32:44.2333272Z         contiguous: bool,
2025-05-07T20:32:44.2333516Z         compiled: bool,
2025-05-07T20:32:44.2333740Z     ) -> None:
2025-05-07T20:32:44.2333962Z         torch.manual_seed(2025)
2025-05-07T20:32:44.2334210Z     
2025-05-07T20:32:44.2334482Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:44.2334836Z     
2025-05-07T20:32:44.2335040Z         x_sign = torch.sign(x)
2025-05-07T20:32:44.2335333Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:44.2335653Z         x = x_sign * x_clamp
2025-05-07T20:32:44.2335904Z         x0 = x[:, :D]
2025-05-07T20:32:44.2336128Z         x1 = x[:, D:]
2025-05-07T20:32:44.2336349Z     
2025-05-07T20:32:44.2336553Z         if contiguous:
2025-05-07T20:32:44.2336805Z             x0 = x0.contiguous()
2025-05-07T20:32:44.2337114Z             x1 = x1.contiguous()
2025-05-07T20:32:44.2337354Z     
2025-05-07T20:32:44.2337559Z         if scale_ub is not None:
2025-05-07T20:32:44.2337843Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:44.2338181Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:44.2338501Z             )
2025-05-07T20:32:44.2338699Z         else:
2025-05-07T20:32:44.2338915Z             scale_ub_tensor = None
2025-05-07T20:32:44.2339182Z     
2025-05-07T20:32:44.2339419Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:44.2339744Z             op = silu_mul_quant
2025-05-07T20:32:44.2340000Z             if compiled:
2025-05-07T20:32:44.2340259Z                 op = torch.compile(op)
2025-05-07T20:32:44.2340562Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:44.2340920Z     
2025-05-07T20:32:44.2341116Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:44.2341283Z 
2025-05-07T20:32:44.2341389Z moe/activation_test.py:117: 
2025-05-07T20:32:44.2341778Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:44.2342122Z moe/activation_test.py:115: in fn
2025-05-07T20:32:44.2342405Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:44.2343090Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:44.2343783Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:44.2344321Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:44.2345006Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:44.2345664Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:44.2346272Z     kernel = self.compile(
2025-05-07T20:32:44.2346817Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:44.2347540Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:44.2347941Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:44.2348175Z 
2025-05-07T20:32:44.2348382Z self = <triton.compiler.compiler.ASTSource object at 0x7f9655472790>
2025-05-07T20:32:44.2349540Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:44.2350919Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f987464c400>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f965556e480>}
2025-05-07T20:32:44.2352275Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:44.2353309Z context = <triton._C.libtriton.ir.context object at 0x7f965541adb0>
2025-05-07T20:32:44.2353602Z 
2025-05-07T20:32:44.2353768Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:44.2354291Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:44.2354756Z                            module_map=module_map)
2025-05-07T20:32:44.2355121Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:44.2355479Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:44.2355735Z E       ^
2025-05-07T20:32:44.2356202Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:44.2356670Z 
2025-05-07T20:32:44.2357092Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:44.2357607Z 
2025-05-07T20:32:44.2357725Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:44.2358132Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:44.2358541Z     T=2048,
2025-05-07T20:32:44.2358736Z     D=7168,
2025-05-07T20:32:44.2358931Z     scale_ub=1200.0,
2025-05-07T20:32:44.2359163Z     contiguous=True,
2025-05-07T20:32:44.2359389Z     compiled=False,
2025-05-07T20:32:44.2359600Z )
2025-05-07T20:32:44.3021275Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:44.3022043Z T = 2048, D = 7168, scale_ub = 1200.0, contiguous = True, compiled = False
2025-05-07T20:32:44.3022434Z 
2025-05-07T20:32:44.3022546Z     @given(
2025-05-07T20:32:44.3022868Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:44.3023278Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:44.3023773Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:44.3024106Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:44.3024544Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:44.3024833Z     )
2025-05-07T20:32:44.3025184Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:44.3025629Z     def test_silu_mul_quant(
2025-05-07T20:32:44.3025870Z         self,
2025-05-07T20:32:44.3026073Z         T: int,
2025-05-07T20:32:44.3026274Z         D: int,
2025-05-07T20:32:44.3026517Z         scale_ub: Optional[float],
2025-05-07T20:32:44.3026824Z         contiguous: bool,
2025-05-07T20:32:44.3027067Z         compiled: bool,
2025-05-07T20:32:44.3027299Z     ) -> None:
2025-05-07T20:32:44.3027519Z         torch.manual_seed(2025)
2025-05-07T20:32:44.3027768Z     
2025-05-07T20:32:44.3028050Z >       x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:44.3030578Z E       torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 56.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 26.44 MiB is free. Including non-PyTorch memory, this process has 22.04 GiB memory in use. Of the allocated memory 21.69 GiB is allocated by PyTorch, and 59.18 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
2025-05-07T20:32:44.3032459Z 
2025-05-07T20:32:44.3032581Z moe/activation_test.py:92: OutOfMemoryError
2025-05-07T20:32:44.3032802Z 
2025-05-07T20:32:44.3032912Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:44.3033327Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:44.3033730Z     T=1,
2025-05-07T20:32:44.3033916Z     D=5120,
2025-05-07T20:32:44.3034115Z     scale_ub=1200.0,
2025-05-07T20:32:44.3034343Z     contiguous=True,
2025-05-07T20:32:44.3034560Z     compiled=False,
2025-05-07T20:32:44.3034770Z )
2025-05-07T20:32:44.3035100Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:44.3035577Z T = 1, D = 5120, scale_ub = 1200.0, contiguous = True, compiled = False
2025-05-07T20:32:44.3035847Z 
2025-05-07T20:32:44.3035928Z     @given(
2025-05-07T20:32:44.3036160Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:44.3036473Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:44.3036785Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:44.3037118Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:44.3037447Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:44.3037730Z     )
2025-05-07T20:32:44.3038089Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:44.3038532Z     def test_silu_mul_quant(
2025-05-07T20:32:44.3038776Z         self,
2025-05-07T20:32:44.3038977Z         T: int,
2025-05-07T20:32:44.3039176Z         D: int,
2025-05-07T20:32:44.3039396Z         scale_ub: Optional[float],
2025-05-07T20:32:44.3039671Z         contiguous: bool,
2025-05-07T20:32:44.3039926Z         compiled: bool,
2025-05-07T20:32:44.3040145Z     ) -> None:
2025-05-07T20:32:44.3040369Z         torch.manual_seed(2025)
2025-05-07T20:32:44.3040615Z     
2025-05-07T20:32:44.3040885Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:44.3041230Z     
2025-05-07T20:32:44.3041433Z         x_sign = torch.sign(x)
2025-05-07T20:32:44.3041728Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:44.3042045Z         x = x_sign * x_clamp
2025-05-07T20:32:44.3042293Z         x0 = x[:, :D]
2025-05-07T20:32:44.3042522Z         x1 = x[:, D:]
2025-05-07T20:32:44.3042730Z     
2025-05-07T20:32:44.3042927Z         if contiguous:
2025-05-07T20:32:44.3043166Z             x0 = x0.contiguous()
2025-05-07T20:32:44.3043494Z             x1 = x1.contiguous()
2025-05-07T20:32:44.3043737Z     
2025-05-07T20:32:44.3043936Z         if scale_ub is not None:
2025-05-07T20:32:44.3044207Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:44.3044604Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:44.3044921Z             )
2025-05-07T20:32:44.3045117Z         else:
2025-05-07T20:32:44.3045340Z             scale_ub_tensor = None
2025-05-07T20:32:44.3045594Z     
2025-05-07T20:32:44.3045826Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:44.3046144Z             op = silu_mul_quant
2025-05-07T20:32:44.3046402Z             if compiled:
2025-05-07T20:32:44.3046678Z                 op = torch.compile(op)
2025-05-07T20:32:44.3047020Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:44.3047296Z     
2025-05-07T20:32:44.3047500Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:44.3047732Z 
2025-05-07T20:32:44.3047839Z moe/activation_test.py:117: 
2025-05-07T20:32:44.3048133Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:44.3048467Z moe/activation_test.py:115: in fn
2025-05-07T20:32:44.3048791Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:44.3049475Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:44.3050164Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:44.3050697Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:44.3051376Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:44.3052032Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:44.3052576Z     kernel = self.compile(
2025-05-07T20:32:44.3053129Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:44.3053789Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:44.3054195Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:44.3054430Z 
2025-05-07T20:32:44.3054636Z self = <triton.compiler.compiler.ASTSource object at 0x7f965544e990>
2025-05-07T20:32:44.3055713Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:44.3057123Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f987464c400>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f965556fa60>}
2025-05-07T20:32:44.3058464Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:44.3059496Z context = <triton._C.libtriton.ir.context object at 0x7f9655432ff0>
2025-05-07T20:32:44.3059783Z 
2025-05-07T20:32:44.3059958Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:44.3060472Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:44.3060932Z                            module_map=module_map)
2025-05-07T20:32:44.3061303Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:44.3061657Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:44.3061921Z E       ^
2025-05-07T20:32:44.3062389Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:44.3062843Z 
2025-05-07T20:32:44.3063270Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:44.3063840Z 
2025-05-07T20:32:44.3063957Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:44.3064373Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:44.3064822Z     T=2048,
2025-05-07T20:32:44.3065023Z     D=5120,
2025-05-07T20:32:44.3065215Z     scale_ub=None,
2025-05-07T20:32:44.3065444Z     contiguous=True,
2025-05-07T20:32:44.3065678Z     compiled=False,
2025-05-07T20:32:44.3065885Z )
2025-05-07T20:32:44.3066207Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:44.3066703Z T = 2048, D = 5120, scale_ub = None, contiguous = True, compiled = False
2025-05-07T20:32:44.3066973Z 
2025-05-07T20:32:44.3067064Z     @given(
2025-05-07T20:32:44.3067299Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:44.3067616Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:44.3067927Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:44.3068305Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:44.3068635Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:44.3068922Z     )
2025-05-07T20:32:44.3069373Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:44.3069814Z     def test_silu_mul_quant(
2025-05-07T20:32:44.3070063Z         self,
2025-05-07T20:32:44.3070256Z         T: int,
2025-05-07T20:32:44.3070461Z         D: int,
2025-05-07T20:32:44.3070682Z         scale_ub: Optional[float],
2025-05-07T20:32:44.3070949Z         contiguous: bool,
2025-05-07T20:32:44.3071193Z         compiled: bool,
2025-05-07T20:32:44.3071418Z     ) -> None:
2025-05-07T20:32:44.3071641Z         torch.manual_seed(2025)
2025-05-07T20:32:44.3071881Z     
2025-05-07T20:32:44.3072160Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:44.3072505Z     
2025-05-07T20:32:44.3072698Z >       x_sign = torch.sign(x)
2025-05-07T20:32:44.3074685Z E       torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 40.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 26.44 MiB is free. Including non-PyTorch memory, this process has 22.04 GiB memory in use. Of the allocated memory 21.73 GiB is allocated by PyTorch, and 19.12 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
2025-05-07T20:32:44.3076565Z 
2025-05-07T20:32:44.3076689Z moe/activation_test.py:94: OutOfMemoryError
2025-05-07T20:32:44.3076902Z 
2025-05-07T20:32:44.3077014Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:44.3077430Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:44.3077836Z     T=16384,
2025-05-07T20:32:44.3078034Z     D=5120,
2025-05-07T20:32:44.3078233Z     scale_ub=None,
2025-05-07T20:32:44.3078447Z     contiguous=True,
2025-05-07T20:32:44.3078680Z     compiled=False,
2025-05-07T20:32:44.3078888Z )
2025-05-07T20:32:44.3784125Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:44.3785625Z T = 16384, D = 5120, scale_ub = None, contiguous = True, compiled = False
2025-05-07T20:32:44.3786375Z 
2025-05-07T20:32:44.3786594Z     @given(
2025-05-07T20:32:44.3787127Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:44.3787491Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:44.3787812Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:44.3788145Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:44.3788484Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:44.3788765Z     )
2025-05-07T20:32:44.3789201Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:44.3789651Z     def test_silu_mul_quant(
2025-05-07T20:32:44.3790099Z         self,
2025-05-07T20:32:44.3790299Z         T: int,
2025-05-07T20:32:44.3790505Z         D: int,
2025-05-07T20:32:44.3790731Z         scale_ub: Optional[float],
2025-05-07T20:32:44.3791003Z         contiguous: bool,
2025-05-07T20:32:44.3791322Z         compiled: bool,
2025-05-07T20:32:44.3791556Z     ) -> None:
2025-05-07T20:32:44.3791780Z         torch.manual_seed(2025)
2025-05-07T20:32:44.3792025Z     
2025-05-07T20:32:44.3792301Z >       x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:44.3794323Z E       torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 320.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 26.44 MiB is free. Including non-PyTorch memory, this process has 22.04 GiB memory in use. Of the allocated memory 21.73 GiB is allocated by PyTorch, and 19.12 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
2025-05-07T20:32:44.3796242Z 
2025-05-07T20:32:44.3796428Z moe/activation_test.py:92: OutOfMemoryError
2025-05-07T20:32:44.3796645Z 
2025-05-07T20:32:44.3796752Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:44.3797163Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:44.3797571Z     T=4096,
2025-05-07T20:32:44.3797764Z     D=5120,
2025-05-07T20:32:44.3797966Z     scale_ub=None,
2025-05-07T20:32:44.3798190Z     contiguous=True,
2025-05-07T20:32:44.3798412Z     compiled=False,
2025-05-07T20:32:44.3798620Z )
2025-05-07T20:32:44.3798943Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:44.3799425Z T = 4096, D = 5120, scale_ub = None, contiguous = True, compiled = False
2025-05-07T20:32:44.3799696Z 
2025-05-07T20:32:44.3799775Z     @given(
2025-05-07T20:32:44.3800012Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:44.3800337Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:44.3800638Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:44.3800970Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:44.3801302Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:44.3801581Z     )
2025-05-07T20:32:44.3801931Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:44.3802373Z     def test_silu_mul_quant(
2025-05-07T20:32:44.3802614Z         self,
2025-05-07T20:32:44.3802820Z         T: int,
2025-05-07T20:32:44.3803022Z         D: int,
2025-05-07T20:32:44.3803243Z         scale_ub: Optional[float],
2025-05-07T20:32:44.3803521Z         contiguous: bool,
2025-05-07T20:32:44.3803766Z         compiled: bool,
2025-05-07T20:32:44.3803988Z     ) -> None:
2025-05-07T20:32:44.3804210Z         torch.manual_seed(2025)
2025-05-07T20:32:44.3804472Z     
2025-05-07T20:32:44.3804751Z >       x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:44.3806789Z E       torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 80.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 26.44 MiB is free. Including non-PyTorch memory, this process has 22.04 GiB memory in use. Of the allocated memory 21.73 GiB is allocated by PyTorch, and 19.12 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
2025-05-07T20:32:44.3808667Z 
2025-05-07T20:32:44.3808792Z moe/activation_test.py:92: OutOfMemoryError
2025-05-07T20:32:44.3809016Z 
2025-05-07T20:32:44.3809124Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:44.3809550Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:44.3809957Z     T=2048,
2025-05-07T20:32:44.3810222Z     D=5120,
2025-05-07T20:32:44.3810431Z     scale_ub=None,
2025-05-07T20:32:44.3810652Z     contiguous=False,
2025-05-07T20:32:44.3810890Z     compiled=False,
2025-05-07T20:32:44.3811111Z )
2025-05-07T20:32:44.3811483Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:44.3811973Z T = 2048, D = 5120, scale_ub = None, contiguous = False, compiled = False
2025-05-07T20:32:44.3819486Z 
2025-05-07T20:32:44.3819593Z     @given(
2025-05-07T20:32:44.3819850Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:44.3820180Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:44.3820494Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:44.3820836Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:44.3821172Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:44.3821457Z     )
2025-05-07T20:32:44.3821819Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:44.3822358Z     def test_silu_mul_quant(
2025-05-07T20:32:44.3822602Z         self,
2025-05-07T20:32:44.3822809Z         T: int,
2025-05-07T20:32:44.3823020Z         D: int,
2025-05-07T20:32:44.3823286Z         scale_ub: Optional[float],
2025-05-07T20:32:44.3823570Z         contiguous: bool,
2025-05-07T20:32:44.3823821Z         compiled: bool,
2025-05-07T20:32:44.3824048Z     ) -> None:
2025-05-07T20:32:44.3824278Z         torch.manual_seed(2025)
2025-05-07T20:32:44.3824527Z     
2025-05-07T20:32:44.3824809Z >       x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:44.3826875Z E       torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 40.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 26.44 MiB is free. Including non-PyTorch memory, this process has 22.04 GiB memory in use. Of the allocated memory 21.73 GiB is allocated by PyTorch, and 19.12 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
2025-05-07T20:32:44.3829070Z 
2025-05-07T20:32:44.3829199Z moe/activation_test.py:92: OutOfMemoryError
2025-05-07T20:32:44.3829428Z 
2025-05-07T20:32:44.3829534Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:44.3829960Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:44.3830363Z     T=4096,
2025-05-07T20:32:44.3830567Z     D=7168,
2025-05-07T20:32:44.3830771Z     scale_ub=None,
2025-05-07T20:32:44.3830993Z     contiguous=True,
2025-05-07T20:32:44.3831224Z     compiled=True,
2025-05-07T20:32:44.3831432Z )
2025-05-07T20:32:44.3831758Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:44.3832244Z T = 4096, D = 7168, scale_ub = None, contiguous = True, compiled = True
2025-05-07T20:32:44.3832522Z 
2025-05-07T20:32:44.3832611Z     @given(
2025-05-07T20:32:44.3832858Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:44.3833168Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:44.3833492Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:44.3833833Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:44.3834165Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:44.3834461Z     )
2025-05-07T20:32:44.3834820Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:44.3835277Z     def test_silu_mul_quant(
2025-05-07T20:32:44.3835526Z         self,
2025-05-07T20:32:44.3835736Z         T: int,
2025-05-07T20:32:44.3835946Z         D: int,
2025-05-07T20:32:44.3836169Z         scale_ub: Optional[float],
2025-05-07T20:32:44.3836457Z         contiguous: bool,
2025-05-07T20:32:44.3836707Z         compiled: bool,
2025-05-07T20:32:44.3836953Z     ) -> None:
2025-05-07T20:32:44.3837206Z         torch.manual_seed(2025)
2025-05-07T20:32:44.3837559Z     
2025-05-07T20:32:44.3837835Z >       x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:44.3839984Z E       torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 112.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 26.44 MiB is free. Including non-PyTorch memory, this process has 22.04 GiB memory in use. Of the allocated memory 21.73 GiB is allocated by PyTorch, and 19.12 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
2025-05-07T20:32:44.3841870Z 
2025-05-07T20:32:44.3841997Z moe/activation_test.py:92: OutOfMemoryError
2025-05-07T20:32:44.3842220Z 
2025-05-07T20:32:44.3842329Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:44.3842748Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:44.3843218Z     T=2048,
2025-05-07T20:32:44.3843423Z     D=5120,
2025-05-07T20:32:44.3843639Z     scale_ub=1200.0,
2025-05-07T20:32:44.3843867Z     contiguous=False,
2025-05-07T20:32:44.3844167Z     compiled=False,
2025-05-07T20:32:44.3844387Z )
2025-05-07T20:32:44.3844706Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:44.3845214Z T = 2048, D = 5120, scale_ub = 1200.0, contiguous = False, compiled = False
2025-05-07T20:32:44.3845496Z 
2025-05-07T20:32:44.3845591Z     @given(
2025-05-07T20:32:44.3845832Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:44.3846144Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:44.3846462Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:44.3846803Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:44.3847137Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:44.3847481Z     )
2025-05-07T20:32:44.3847847Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:44.3848296Z     def test_silu_mul_quant(
2025-05-07T20:32:44.3848549Z         self,
2025-05-07T20:32:44.3848758Z         T: int,
2025-05-07T20:32:44.3848964Z         D: int,
2025-05-07T20:32:44.3849196Z         scale_ub: Optional[float],
2025-05-07T20:32:44.3849474Z         contiguous: bool,
2025-05-07T20:32:44.3849711Z         compiled: bool,
2025-05-07T20:32:44.3849946Z     ) -> None:
2025-05-07T20:32:44.3850174Z         torch.manual_seed(2025)
2025-05-07T20:32:44.3850416Z     
2025-05-07T20:32:44.3850698Z >       x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:44.3852767Z E       torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 40.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 26.44 MiB is free. Including non-PyTorch memory, this process has 22.04 GiB memory in use. Of the allocated memory 21.73 GiB is allocated by PyTorch, and 19.12 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
2025-05-07T20:32:44.3854653Z 
2025-05-07T20:32:44.3854775Z moe/activation_test.py:92: OutOfMemoryError
2025-05-07T20:32:44.3854991Z 
2025-05-07T20:32:44.3855104Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:44.3855520Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:44.3855946Z     T=4096,
2025-05-07T20:32:44.3856147Z     D=7168,
2025-05-07T20:32:44.3856345Z     scale_ub=1200.0,
2025-05-07T20:32:44.3856578Z     contiguous=True,
2025-05-07T20:32:44.3856813Z     compiled=False,
2025-05-07T20:32:44.3857032Z )
2025-05-07T20:32:44.4769657Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:44.4771182Z T = 4096, D = 7168, scale_ub = 1200.0, contiguous = True, compiled = False
2025-05-07T20:32:44.4772219Z 
2025-05-07T20:32:44.4772384Z     @given(
2025-05-07T20:32:44.4772848Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:44.4773478Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:44.4774223Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:44.4774881Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:44.4775529Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:44.4776088Z     )
2025-05-07T20:32:44.4776784Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:44.4777319Z     def test_silu_mul_quant(
2025-05-07T20:32:44.4777560Z         self,
2025-05-07T20:32:44.4777769Z         T: int,
2025-05-07T20:32:44.4777976Z         D: int,
2025-05-07T20:32:44.4778205Z         scale_ub: Optional[float],
2025-05-07T20:32:44.4778472Z         contiguous: bool,
2025-05-07T20:32:44.4778718Z         compiled: bool,
2025-05-07T20:32:44.4779117Z     ) -> None:
2025-05-07T20:32:44.4779333Z         torch.manual_seed(2025)
2025-05-07T20:32:44.4779576Z     
2025-05-07T20:32:44.4779976Z >       x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:44.4782016Z E       torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 112.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 26.44 MiB is free. Including non-PyTorch memory, this process has 22.04 GiB memory in use. Of the allocated memory 21.73 GiB is allocated by PyTorch, and 19.12 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
2025-05-07T20:32:44.4783876Z 
2025-05-07T20:32:44.4784002Z moe/activation_test.py:92: OutOfMemoryError
2025-05-07T20:32:44.4784212Z 
2025-05-07T20:32:44.4784315Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:44.4784731Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:44.4785136Z     T=16384,
2025-05-07T20:32:44.4785328Z     D=7168,
2025-05-07T20:32:44.4785531Z     scale_ub=None,
2025-05-07T20:32:44.4785757Z     contiguous=False,
2025-05-07T20:32:44.4785980Z     compiled=True,
2025-05-07T20:32:44.4786190Z )
2025-05-07T20:32:44.4786512Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:44.4787004Z T = 16384, D = 7168, scale_ub = None, contiguous = False, compiled = True
2025-05-07T20:32:44.4787277Z 
2025-05-07T20:32:44.4787356Z     @given(
2025-05-07T20:32:44.4787586Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:44.4787897Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:44.4788197Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:44.4788526Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:44.4788856Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:44.4789250Z     )
2025-05-07T20:32:44.4789596Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:44.4790044Z     def test_silu_mul_quant(
2025-05-07T20:32:44.4790289Z         self,
2025-05-07T20:32:44.4790481Z         T: int,
2025-05-07T20:32:44.4790678Z         D: int,
2025-05-07T20:32:44.4790895Z         scale_ub: Optional[float],
2025-05-07T20:32:44.4791163Z         contiguous: bool,
2025-05-07T20:32:44.4791409Z         compiled: bool,
2025-05-07T20:32:44.4791641Z     ) -> None:
2025-05-07T20:32:44.4791850Z         torch.manual_seed(2025)
2025-05-07T20:32:44.4792094Z     
2025-05-07T20:32:44.4792363Z >       x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:44.4794435Z E       torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 448.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 26.44 MiB is free. Including non-PyTorch memory, this process has 22.04 GiB memory in use. Of the allocated memory 21.73 GiB is allocated by PyTorch, and 19.12 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
2025-05-07T20:32:44.4796328Z 
2025-05-07T20:32:44.4796446Z moe/activation_test.py:92: OutOfMemoryError
2025-05-07T20:32:44.4796688Z 
2025-05-07T20:32:44.4796801Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:44.4797230Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:44.4797636Z     T=4096,
2025-05-07T20:32:44.4797829Z     D=7168,
2025-05-07T20:32:44.4798024Z     scale_ub=None,
2025-05-07T20:32:44.4798244Z     contiguous=True,
2025-05-07T20:32:44.4798465Z     compiled=False,
2025-05-07T20:32:44.4798673Z )
2025-05-07T20:32:44.4798991Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:44.4799524Z T = 4096, D = 7168, scale_ub = None, contiguous = True, compiled = False
2025-05-07T20:32:44.4799798Z 
2025-05-07T20:32:44.4799881Z     @given(
2025-05-07T20:32:44.4800160Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:44.4800475Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:44.4800786Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:44.4801125Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:44.4801460Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:44.4801740Z     )
2025-05-07T20:32:44.4802099Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:44.4802544Z     def test_silu_mul_quant(
2025-05-07T20:32:44.4802791Z         self,
2025-05-07T20:32:44.4802996Z         T: int,
2025-05-07T20:32:44.4803205Z         D: int,
2025-05-07T20:32:44.4803426Z         scale_ub: Optional[float],
2025-05-07T20:32:44.4803711Z         contiguous: bool,
2025-05-07T20:32:44.4803958Z         compiled: bool,
2025-05-07T20:32:44.4804177Z     ) -> None:
2025-05-07T20:32:44.4804401Z         torch.manual_seed(2025)
2025-05-07T20:32:44.4804661Z     
2025-05-07T20:32:44.4804943Z >       x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:44.4807010Z E       torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 112.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 26.44 MiB is free. Including non-PyTorch memory, this process has 22.04 GiB memory in use. Of the allocated memory 21.73 GiB is allocated by PyTorch, and 19.12 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
2025-05-07T20:32:44.4808866Z 
2025-05-07T20:32:44.4809000Z moe/activation_test.py:92: OutOfMemoryError
2025-05-07T20:32:44.4809218Z 
2025-05-07T20:32:44.4809331Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:44.4809752Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:44.4810168Z     T=16384,
2025-05-07T20:32:44.4810369Z     D=7168,
2025-05-07T20:32:44.4810574Z     scale_ub=None,
2025-05-07T20:32:44.4810796Z     contiguous=True,
2025-05-07T20:32:44.4811021Z     compiled=False,
2025-05-07T20:32:44.4811240Z )
2025-05-07T20:32:44.4811569Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:44.4812065Z T = 16384, D = 7168, scale_ub = None, contiguous = True, compiled = False
2025-05-07T20:32:44.4812358Z 
2025-05-07T20:32:44.4812444Z     @given(
2025-05-07T20:32:44.4812683Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:44.4813001Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:44.4813308Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:44.4813644Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:44.4814027Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:44.4814309Z     )
2025-05-07T20:32:44.4814664Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:44.4815157Z     def test_silu_mul_quant(
2025-05-07T20:32:44.4815398Z         self,
2025-05-07T20:32:44.4815602Z         T: int,
2025-05-07T20:32:44.4815803Z         D: int,
2025-05-07T20:32:44.4816020Z         scale_ub: Optional[float],
2025-05-07T20:32:44.4816292Z         contiguous: bool,
2025-05-07T20:32:44.4816541Z         compiled: bool,
2025-05-07T20:32:44.4816761Z     ) -> None:
2025-05-07T20:32:44.4816974Z         torch.manual_seed(2025)
2025-05-07T20:32:44.4817220Z     
2025-05-07T20:32:44.4817499Z >       x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:44.4819559Z E       torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 448.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 26.44 MiB is free. Including non-PyTorch memory, this process has 22.04 GiB memory in use. Of the allocated memory 21.73 GiB is allocated by PyTorch, and 19.12 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
2025-05-07T20:32:44.4821460Z 
2025-05-07T20:32:44.4821581Z moe/activation_test.py:92: OutOfMemoryError
2025-05-07T20:32:44.4821801Z 
2025-05-07T20:32:44.4821904Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:44.4822311Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:44.4822714Z     T=16384,
2025-05-07T20:32:44.4822907Z     D=7168,
2025-05-07T20:32:44.4823108Z     scale_ub=1200.0,
2025-05-07T20:32:44.4823335Z     contiguous=True,
2025-05-07T20:32:44.4823563Z     compiled=False,
2025-05-07T20:32:44.4823776Z )
2025-05-07T20:32:44.4824101Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:44.4824592Z T = 16384, D = 7168, scale_ub = 1200.0, contiguous = True, compiled = False
2025-05-07T20:32:44.4824880Z 
2025-05-07T20:32:44.4824962Z     @given(
2025-05-07T20:32:44.4825201Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:44.4825511Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:44.4825821Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:44.4826161Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:44.4826505Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:44.4826815Z     )
2025-05-07T20:32:44.4827194Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:44.4827639Z     def test_silu_mul_quant(
2025-05-07T20:32:44.4827887Z         self,
2025-05-07T20:32:44.4828094Z         T: int,
2025-05-07T20:32:44.4828583Z         D: int,
2025-05-07T20:32:44.4828803Z         scale_ub: Optional[float],
2025-05-07T20:32:44.4829118Z         contiguous: bool,
2025-05-07T20:32:44.4829365Z         compiled: bool,
2025-05-07T20:32:44.4829583Z     ) -> None:
2025-05-07T20:32:44.4829812Z         torch.manual_seed(2025)
2025-05-07T20:32:44.4830068Z     
2025-05-07T20:32:44.4830334Z >       x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:44.4832363Z E       torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 448.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 26.44 MiB is free. Including non-PyTorch memory, this process has 22.04 GiB memory in use. Of the allocated memory 21.73 GiB is allocated by PyTorch, and 19.12 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
2025-05-07T20:32:44.4834217Z 
2025-05-07T20:32:44.4834427Z moe/activation_test.py:92: OutOfMemoryError
2025-05-07T20:32:44.4834681Z 
2025-05-07T20:32:44.4834794Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:44.4835340Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:44.4835742Z     T=128,
2025-05-07T20:32:44.4835937Z     D=5120,
2025-05-07T20:32:44.4836137Z     scale_ub=1200.0,
2025-05-07T20:32:44.4836367Z     contiguous=False,
2025-05-07T20:32:44.4836593Z     compiled=False,
2025-05-07T20:32:44.4836799Z )
2025-05-07T20:32:44.5849674Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:44.5851121Z T = 128, D = 5120, scale_ub = 1200.0, contiguous = False, compiled = False
2025-05-07T20:32:44.5851786Z 
2025-05-07T20:32:44.5851950Z     @given(
2025-05-07T20:32:44.5852411Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:44.5853031Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:44.5854020Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:44.5854681Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:44.5855326Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:44.5855875Z     )
2025-05-07T20:32:44.5856714Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:44.5857463Z     def test_silu_mul_quant(
2025-05-07T20:32:44.5857735Z         self,
2025-05-07T20:32:44.5857935Z         T: int,
2025-05-07T20:32:44.5858143Z         D: int,
2025-05-07T20:32:44.5858364Z         scale_ub: Optional[float],
2025-05-07T20:32:44.5858632Z         contiguous: bool,
2025-05-07T20:32:44.5858873Z         compiled: bool,
2025-05-07T20:32:44.5859106Z     ) -> None:
2025-05-07T20:32:44.5859320Z         torch.manual_seed(2025)
2025-05-07T20:32:44.5859563Z     
2025-05-07T20:32:44.5859840Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:44.5860179Z     
2025-05-07T20:32:44.5860384Z         x_sign = torch.sign(x)
2025-05-07T20:32:44.5860683Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:44.5860990Z         x = x_sign * x_clamp
2025-05-07T20:32:44.5861237Z         x0 = x[:, :D]
2025-05-07T20:32:44.5861460Z         x1 = x[:, D:]
2025-05-07T20:32:44.5861667Z     
2025-05-07T20:32:44.5861858Z         if contiguous:
2025-05-07T20:32:44.5862095Z             x0 = x0.contiguous()
2025-05-07T20:32:44.5862348Z             x1 = x1.contiguous()
2025-05-07T20:32:44.5862589Z     
2025-05-07T20:32:44.5862788Z         if scale_ub is not None:
2025-05-07T20:32:44.5863061Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:44.5863398Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:44.5863715Z             )
2025-05-07T20:32:44.5863914Z         else:
2025-05-07T20:32:44.5864124Z             scale_ub_tensor = None
2025-05-07T20:32:44.5864378Z     
2025-05-07T20:32:44.5864611Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:44.5864923Z             op = silu_mul_quant
2025-05-07T20:32:44.5865178Z             if compiled:
2025-05-07T20:32:44.5865429Z                 op = torch.compile(op)
2025-05-07T20:32:44.5865722Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:44.5866002Z     
2025-05-07T20:32:44.5866200Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:44.5866366Z 
2025-05-07T20:32:44.5866468Z moe/activation_test.py:117: 
2025-05-07T20:32:44.5866763Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:44.5867101Z moe/activation_test.py:115: in fn
2025-05-07T20:32:44.5867385Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:44.5868077Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:44.5868770Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:44.5869411Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:44.5870178Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:44.5870837Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:44.5871443Z     kernel = self.compile(
2025-05-07T20:32:44.5871987Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:44.5872632Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:44.5873033Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:44.5873259Z 
2025-05-07T20:32:44.5873473Z self = <triton.compiler.compiler.ASTSource object at 0x7f96553f2590>
2025-05-07T20:32:44.5874549Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:44.5876045Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f987464c400>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f96553b6660>}
2025-05-07T20:32:44.5877441Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:44.5878474Z context = <triton._C.libtriton.ir.context object at 0x7f96551583b0>
2025-05-07T20:32:44.5878767Z 
2025-05-07T20:32:44.5878941Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:44.5879453Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:44.5879927Z                            module_map=module_map)
2025-05-07T20:32:44.5880300Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:44.5880659Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:44.5880924Z E       ^
2025-05-07T20:32:44.5881395Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:44.5881847Z 
2025-05-07T20:32:44.5882270Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:44.5882779Z 
2025-05-07T20:32:44.5882898Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:44.5883308Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:44.5883713Z     T=2048,
2025-05-07T20:32:44.5883907Z     D=7168,
2025-05-07T20:32:44.5884099Z     scale_ub=None,
2025-05-07T20:32:44.5884324Z     contiguous=False,
2025-05-07T20:32:44.5884562Z     compiled=False,
2025-05-07T20:32:44.5884770Z )
2025-05-07T20:32:44.5885097Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:44.5885596Z T = 2048, D = 7168, scale_ub = None, contiguous = False, compiled = False
2025-05-07T20:32:44.5885869Z 
2025-05-07T20:32:44.5885953Z     @given(
2025-05-07T20:32:44.5886193Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:44.5886521Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:44.5886837Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:44.5887191Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:44.5887540Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:44.5887831Z     )
2025-05-07T20:32:44.5888172Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:44.5888611Z     def test_silu_mul_quant(
2025-05-07T20:32:44.5888858Z         self,
2025-05-07T20:32:44.5889048Z         T: int,
2025-05-07T20:32:44.5889255Z         D: int,
2025-05-07T20:32:44.5889480Z         scale_ub: Optional[float],
2025-05-07T20:32:44.5889748Z         contiguous: bool,
2025-05-07T20:32:44.5889992Z         compiled: bool,
2025-05-07T20:32:44.5890272Z     ) -> None:
2025-05-07T20:32:44.5890489Z         torch.manual_seed(2025)
2025-05-07T20:32:44.5890734Z     
2025-05-07T20:32:44.5891014Z >       x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:44.5893133Z E       torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 56.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 26.44 MiB is free. Including non-PyTorch memory, this process has 22.04 GiB memory in use. Of the allocated memory 21.74 GiB is allocated by PyTorch, and 10.99 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
2025-05-07T20:32:44.5894970Z 
2025-05-07T20:32:44.5895096Z moe/activation_test.py:92: OutOfMemoryError
2025-05-07T20:32:44.5895308Z 
2025-05-07T20:32:44.5895456Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:44.5895875Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:44.5896282Z     T=128,
2025-05-07T20:32:44.5896508Z     D=7168,
2025-05-07T20:32:44.5896714Z     scale_ub=1200.0,
2025-05-07T20:32:44.5896950Z     contiguous=True,
2025-05-07T20:32:44.5897168Z     compiled=True,
2025-05-07T20:32:44.5897375Z )
2025-05-07T20:32:44.6197041Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:44.6197834Z T = 128, D = 7168, scale_ub = 1200.0, contiguous = True, compiled = True
2025-05-07T20:32:44.6198214Z 
2025-05-07T20:32:44.6198332Z     @given(
2025-05-07T20:32:44.6198642Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:44.6199052Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:44.6199367Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:44.6199707Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:44.6200042Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:44.6200338Z     )
2025-05-07T20:32:44.6200697Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:44.6201139Z     def test_silu_mul_quant(
2025-05-07T20:32:44.6201395Z         self,
2025-05-07T20:32:44.6201597Z         T: int,
2025-05-07T20:32:44.6201798Z         D: int,
2025-05-07T20:32:44.6202025Z         scale_ub: Optional[float],
2025-05-07T20:32:44.6202300Z         contiguous: bool,
2025-05-07T20:32:44.6210287Z         compiled: bool,
2025-05-07T20:32:44.6210551Z     ) -> None:
2025-05-07T20:32:44.6210778Z         torch.manual_seed(2025)
2025-05-07T20:32:44.6211035Z     
2025-05-07T20:32:44.6211318Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:44.6211663Z     
2025-05-07T20:32:44.6211870Z         x_sign = torch.sign(x)
2025-05-07T20:32:44.6212175Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:44.6212499Z         x = x_sign * x_clamp
2025-05-07T20:32:44.6212756Z         x0 = x[:, :D]
2025-05-07T20:32:44.6212987Z         x1 = x[:, D:]
2025-05-07T20:32:44.6213198Z     
2025-05-07T20:32:44.6213397Z         if contiguous:
2025-05-07T20:32:44.6213646Z             x0 = x0.contiguous()
2025-05-07T20:32:44.6213911Z             x1 = x1.contiguous()
2025-05-07T20:32:44.6214160Z     
2025-05-07T20:32:44.6214362Z         if scale_ub is not None:
2025-05-07T20:32:44.6214637Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:44.6214980Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:44.6215297Z             )
2025-05-07T20:32:44.6215503Z         else:
2025-05-07T20:32:44.6215717Z             scale_ub_tensor = None
2025-05-07T20:32:44.6215974Z     
2025-05-07T20:32:44.6216215Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:44.6216529Z             op = silu_mul_quant
2025-05-07T20:32:44.6216804Z             if compiled:
2025-05-07T20:32:44.6217096Z                 op = torch.compile(op)
2025-05-07T20:32:44.6217638Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:44.6217919Z     
2025-05-07T20:32:44.6218121Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:44.6218287Z 
2025-05-07T20:32:44.6218485Z moe/activation_test.py:117: 
2025-05-07T20:32:44.6218789Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:44.6219136Z moe/activation_test.py:115: in fn
2025-05-07T20:32:44.6219424Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:44.6219987Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/torch/_dynamo/eval_frame.py:678: in _fn
2025-05-07T20:32:44.6220566Z     return fn(*args, **kwargs)
2025-05-07T20:32:44.6221246Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:44.6221938Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:44.6222478Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:44.6223379Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:44.6224253Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:44.6224879Z     kernel = self.compile(
2025-05-07T20:32:44.6225518Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:44.6226300Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:44.6226754Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:44.6227033Z 
2025-05-07T20:32:44.6227267Z self = <triton.compiler.compiler.ASTSource object at 0x7f965520a590>
2025-05-07T20:32:44.6229183Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:44.6230940Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f987464c400>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f96553b7c40>}
2025-05-07T20:32:44.6232608Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:44.6233859Z context = <triton._C.libtriton.ir.context object at 0x7f9655278530>
2025-05-07T20:32:44.6234207Z 
2025-05-07T20:32:44.6234393Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:44.6235007Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:44.6235560Z                            module_map=module_map)
2025-05-07T20:32:44.6235972Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:44.6236381Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:44.6236696Z E       ^
2025-05-07T20:32:44.6237266Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:44.6237825Z 
2025-05-07T20:32:44.6238330Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:44.6238962Z 
2025-05-07T20:32:44.6239076Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:44.6239563Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:44.6240025Z     T=128,
2025-05-07T20:32:44.6240232Z     D=7168,
2025-05-07T20:32:44.6240445Z     scale_ub=1200.0,
2025-05-07T20:32:44.6240685Z     contiguous=True,
2025-05-07T20:32:44.6240930Z     compiled=False,
2025-05-07T20:32:44.6241166Z )
2025-05-07T20:32:44.6241522Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:44.6242121Z T = 128, D = 7168, scale_ub = 1200.0, contiguous = True, compiled = False
2025-05-07T20:32:44.6242400Z 
2025-05-07T20:32:44.6242482Z     @given(
2025-05-07T20:32:44.6242789Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:44.6243104Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:44.6243418Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:44.6243758Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:44.6244086Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:44.6244379Z     )
2025-05-07T20:32:44.6244738Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:44.6245181Z     def test_silu_mul_quant(
2025-05-07T20:32:44.6245427Z         self,
2025-05-07T20:32:44.6245629Z         T: int,
2025-05-07T20:32:44.6245826Z         D: int,
2025-05-07T20:32:44.6246049Z         scale_ub: Optional[float],
2025-05-07T20:32:44.6246442Z         contiguous: bool,
2025-05-07T20:32:44.6246692Z         compiled: bool,
2025-05-07T20:32:44.6246913Z     ) -> None:
2025-05-07T20:32:44.6247170Z         torch.manual_seed(2025)
2025-05-07T20:32:44.6247528Z     
2025-05-07T20:32:44.6247804Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:44.6248156Z     
2025-05-07T20:32:44.6248355Z         x_sign = torch.sign(x)
2025-05-07T20:32:44.6248639Z >       x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:44.6250665Z E       torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 20.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 4.44 MiB is free. Including non-PyTorch memory, this process has 22.06 GiB memory in use. Of the allocated memory 21.77 GiB is allocated by PyTorch, and 6.37 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
2025-05-07T20:32:44.6252545Z 
2025-05-07T20:32:44.6252667Z moe/activation_test.py:95: OutOfMemoryError
2025-05-07T20:32:44.6252896Z 
2025-05-07T20:32:44.6253004Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:44.6253422Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:44.6253828Z     T=128,
2025-05-07T20:32:44.6254023Z     D=5120,
2025-05-07T20:32:44.6254223Z     scale_ub=1200.0,
2025-05-07T20:32:44.6254446Z     contiguous=True,
2025-05-07T20:32:44.6254676Z     compiled=True,
2025-05-07T20:32:44.6254885Z )
2025-05-07T20:32:44.6255208Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:44.6255705Z T = 128, D = 5120, scale_ub = 1200.0, contiguous = True, compiled = True
2025-05-07T20:32:44.6255981Z 
2025-05-07T20:32:44.6256064Z     @given(
2025-05-07T20:32:44.6256298Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:44.6256615Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:44.6256926Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:44.6257264Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:44.6257592Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:44.6257883Z     )
2025-05-07T20:32:44.6258236Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:44.6258683Z     def test_silu_mul_quant(
2025-05-07T20:32:44.6258927Z         self,
2025-05-07T20:32:44.6259125Z         T: int,
2025-05-07T20:32:44.6259333Z         D: int,
2025-05-07T20:32:44.6259552Z         scale_ub: Optional[float],
2025-05-07T20:32:44.6259825Z         contiguous: bool,
2025-05-07T20:32:44.6260066Z         compiled: bool,
2025-05-07T20:32:44.6260286Z     ) -> None:
2025-05-07T20:32:44.6260503Z         torch.manual_seed(2025)
2025-05-07T20:32:44.6260799Z     
2025-05-07T20:32:44.6261167Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:44.6261748Z     
2025-05-07T20:32:44.6262015Z         x_sign = torch.sign(x)
2025-05-07T20:32:44.6262397Z >       x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:44.6264504Z E       torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 20.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 4.44 MiB is free. Including non-PyTorch memory, this process has 22.06 GiB memory in use. Of the allocated memory 21.77 GiB is allocated by PyTorch, and 3.87 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
2025-05-07T20:32:44.6266381Z 
2025-05-07T20:32:44.6266501Z moe/activation_test.py:95: OutOfMemoryError
2025-05-07T20:32:44.6266745Z 
2025-05-07T20:32:44.6266863Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:44.6267335Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:44.6267735Z     T=128,
2025-05-07T20:32:44.6267932Z     D=7168,
2025-05-07T20:32:44.6268171Z     scale_ub=None,
2025-05-07T20:32:44.6268386Z     contiguous=True,
2025-05-07T20:32:44.6268612Z     compiled=True,
2025-05-07T20:32:44.6268816Z )
2025-05-07T20:32:44.8199264Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:44.8199984Z T = 128, D = 7168, scale_ub = None, contiguous = True, compiled = True
2025-05-07T20:32:44.8200342Z 
2025-05-07T20:32:44.8200427Z     @given(
2025-05-07T20:32:44.8200666Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:44.8200979Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:44.8201289Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:44.8201620Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:44.8201950Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:44.8202258Z     )
2025-05-07T20:32:44.8202610Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:44.8203053Z     def test_silu_mul_quant(
2025-05-07T20:32:44.8203304Z         self,
2025-05-07T20:32:44.8203511Z         T: int,
2025-05-07T20:32:44.8203713Z         D: int,
2025-05-07T20:32:44.8203929Z         scale_ub: Optional[float],
2025-05-07T20:32:44.8204204Z         contiguous: bool,
2025-05-07T20:32:44.8204450Z         compiled: bool,
2025-05-07T20:32:44.8204679Z     ) -> None:
2025-05-07T20:32:44.8204903Z         torch.manual_seed(2025)
2025-05-07T20:32:44.8205148Z     
2025-05-07T20:32:44.8205419Z >       x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:44.8207468Z E       torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 20.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 4.44 MiB is free. Including non-PyTorch memory, this process has 22.06 GiB memory in use. Of the allocated memory 21.77 GiB is allocated by PyTorch, and 3.87 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
2025-05-07T20:32:44.8209336Z 
2025-05-07T20:32:44.8209456Z moe/activation_test.py:92: OutOfMemoryError
2025-05-07T20:32:44.8209674Z 
2025-05-07T20:32:44.8219002Z FAILED
2025-05-07T20:32:44.8219173Z 
2025-05-07T20:32:44.8219629Z =================================== FAILURES ===================================
2025-05-07T20:32:44.8220233Z _____________________ ActivationTests.test_silu_mul_quant ______________________
2025-05-07T20:32:44.8220854Z   + Exception Group Traceback (most recent call last):
2025-05-07T20:32:44.8221708Z   |   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/unittest/case.py", line 57, in testPartExecutor
2025-05-07T20:32:44.8222478Z   |     yield
2025-05-07T20:32:44.8223261Z   |   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/unittest/case.py", line 623, in run
2025-05-07T20:32:44.8223982Z   |     self._callTestMethod(testMethod)
2025-05-07T20:32:44.8224863Z   |   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/unittest/case.py", line 579, in _callTestMethod
2025-05-07T20:32:44.8225617Z   |     if method() is not None:
2025-05-07T20:32:44.8225951Z   |        ^^^^^^^^
2025-05-07T20:32:44.8227010Z   |   File "/home/ec2-user/actions-runner/_work/FBGEMM/FBGEMM/fbgemm_gpu/experimental/gen_ai/test/moe/activation_test.py", line 75, in test_silu_mul_quant
2025-05-07T20:32:44.8228025Z   |     T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:44.8228674Z   |            ^^^^^^^
2025-05-07T20:32:44.8229538Z   |   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/hypothesis/core.py", line 1850, in wrapped_test
2025-05-07T20:32:44.8230423Z   |     raise the_error_hypothesis_found
2025-05-07T20:32:44.8231137Z   | ExceptionGroup: Hypothesis found 4 distinct failures. (4 sub-exceptions)
2025-05-07T20:32:44.8231718Z   +-+---------------- 1 ----------------
2025-05-07T20:32:44.8232220Z     | Traceback (most recent call last):
2025-05-07T20:32:44.8233220Z     |   File "/home/ec2-user/actions-runner/_work/FBGEMM/FBGEMM/fbgemm_gpu/experimental/gen_ai/test/moe/activation_test.py", line 92, in test_silu_mul_quant
2025-05-07T20:32:44.8234291Z     |     x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:44.8234814Z     |         ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
2025-05-07T20:32:44.8237647Z     | torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 40.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 4.44 MiB is free. Including non-PyTorch memory, this process has 22.06 GiB memory in use. Of the allocated memory 21.77 GiB is allocated by PyTorch, and 3.87 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
2025-05-07T20:32:44.8240437Z     | Falsifying example: test_silu_mul_quant(
2025-05-07T20:32:44.8241051Z     |     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:44.8241609Z     |     T=2048,
2025-05-07T20:32:44.8241935Z     |     D=5120,  # or any other generated value
2025-05-07T20:32:44.8242404Z     |     scale_ub=None,  # or any other generated value
2025-05-07T20:32:44.8242892Z     |     contiguous=True,  # or any other generated value
2025-05-07T20:32:44.8243402Z     |     compiled=False,  # or any other generated value
2025-05-07T20:32:44.8243832Z     | )
2025-05-07T20:32:44.8244092Z     | 
2025-05-07T20:32:44.8244828Z     | You can reproduce this example by temporarily adding @reproduce_failure('6.131.14', b'AEECQQBBAEEAQQE=') as a decorator on your test case
2025-05-07T20:32:44.8245688Z     +---------------- 2 ----------------
2025-05-07T20:32:44.8246090Z     | Traceback (most recent call last):
2025-05-07T20:32:44.8247114Z     |   File "/home/ec2-user/actions-runner/_work/FBGEMM/FBGEMM/fbgemm_gpu/experimental/gen_ai/test/moe/activation_test.py", line 92, in test_silu_mul_quant
2025-05-07T20:32:44.8248215Z     |     x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:44.8248733Z     |         ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
2025-05-07T20:32:44.8251506Z     | torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 20.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 4.44 MiB is free. Including non-PyTorch memory, this process has 22.06 GiB memory in use. Of the allocated memory 21.77 GiB is allocated by PyTorch, and 3.87 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
2025-05-07T20:32:44.8254379Z     | Falsifying example: test_silu_mul_quant(
2025-05-07T20:32:44.8255062Z     |     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:44.8255618Z     |     T=128,
2025-05-07T20:32:44.8255891Z     |     D=7168,
2025-05-07T20:32:44.8256171Z     |     scale_ub=None,
2025-05-07T20:32:44.8256503Z     |     contiguous=True,
2025-05-07T20:32:44.8256832Z     |     compiled=True,
2025-05-07T20:32:44.8257137Z     | )
2025-05-07T20:32:44.8257380Z     | 
2025-05-07T20:32:44.8258107Z     | You can reproduce this example by temporarily adding @reproduce_failure('6.131.14', b'AEEBQQFBAEEAQQA=') as a decorator on your test case
2025-05-07T20:32:44.8258818Z     +---------------- 3 ----------------
2025-05-07T20:32:44.8259134Z     | Traceback (most recent call last):
2025-05-07T20:32:44.8259986Z     |   File "/home/ec2-user/actions-runner/_work/FBGEMM/FBGEMM/fbgemm_gpu/experimental/gen_ai/test/moe/activation_test.py", line 92, in test_silu_mul_quant
2025-05-07T20:32:44.8260860Z     |     x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:44.8261253Z     |         ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
2025-05-07T20:32:44.8263213Z     | torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 20.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 4.44 MiB is free. Including non-PyTorch memory, this process has 22.06 GiB memory in use. Of the allocated memory 21.77 GiB is allocated by PyTorch, and 3.87 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
2025-05-07T20:32:44.8265167Z     | Falsifying example: test_silu_mul_quant(
2025-05-07T20:32:44.8265611Z     |     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:44.8266024Z     |     T=128,
2025-05-07T20:32:44.8266223Z     |     D=5120,
2025-05-07T20:32:44.8266442Z     |     scale_ub=1200.0,
2025-05-07T20:32:44.8266688Z     |     contiguous=True,
2025-05-07T20:32:44.8266951Z     |     compiled=True,
2025-05-07T20:32:44.8267296Z     | )
2025-05-07T20:32:44.8267544Z     | 
2025-05-07T20:32:44.8268262Z     | You can reproduce this example by temporarily adding @reproduce_failure('6.131.14', b'AEEBQQBBAUEAQQA=') as a decorator on your test case
2025-05-07T20:32:44.8269205Z     +---------------- 4 ----------------
2025-05-07T20:32:44.8269609Z     | Traceback (most recent call last):
2025-05-07T20:32:44.8270611Z     |   File "/home/ec2-user/actions-runner/_work/FBGEMM/FBGEMM/fbgemm_gpu/experimental/gen_ai/test/moe/activation_test.py", line 126, in test_silu_mul_quant
2025-05-07T20:32:44.8271615Z     |     y_fp8_ref, y_scale_ref = ref_fn()
2025-05-07T20:32:44.8272029Z     |                              ^^^^^^^^
2025-05-07T20:32:44.8272946Z     |   File "/home/ec2-user/actions-runner/_work/FBGEMM/FBGEMM/fbgemm_gpu/experimental/gen_ai/test/moe/activation_test.py", line 124, in ref_fn
2025-05-07T20:32:44.8273932Z     |     return triton_quantize_fp8_row(y, scale_ub_tensor)
2025-05-07T20:32:44.8274407Z     |            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
2025-05-07T20:32:44.8275535Z     |   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/fbgemm_gpu/experimental/gemm/triton_gemm/fp8_gemm.py", line 2370, in triton_quantize_fp8_row
2025-05-07T20:32:44.8276664Z     |     _kernel_quantize_fp8_row[grid](
2025-05-07T20:32:44.8277517Z     |   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py", line 330, in <lambda>
2025-05-07T20:32:44.8278531Z     |     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:44.8279151Z     |                                    ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
2025-05-07T20:32:44.8280143Z     |   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/autotuner.py", line 186, in run
2025-05-07T20:32:44.8281280Z     |     timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs}
2025-05-07T20:32:44.8281958Z     |               ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
2025-05-07T20:32:44.8282895Z     |   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/autotuner.py", line 186, in <dictcomp>
2025-05-07T20:32:44.8284022Z     |     timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs}
2025-05-07T20:32:44.8284672Z     |                        ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
2025-05-07T20:32:44.8285571Z     |   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/autotuner.py", line 166, in _bench
2025-05-07T20:32:44.8286641Z     |     return self.do_bench(kernel_call, quantiles=(0.5, 0.2, 0.8))
2025-05-07T20:32:44.8287286Z     |            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
2025-05-07T20:32:44.8288121Z     |   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/testing.py", line 117, in do_bench
2025-05-07T20:32:44.8288925Z     |     fn()
2025-05-07T20:32:44.8289736Z     |   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/autotuner.py", line 152, in kernel_call
2025-05-07T20:32:44.8290627Z     |     self.fn.run(
2025-05-07T20:32:44.8291375Z     |   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py", line 623, in run
2025-05-07T20:32:44.8292208Z     |     kernel = self.compile(
2025-05-07T20:32:44.8292583Z     |              ^^^^^^^^^^^^^
2025-05-07T20:32:44.8293420Z     |   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py", line 273, in compile
2025-05-07T20:32:44.8294422Z     |     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:44.8294972Z     |              ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
2025-05-07T20:32:44.8295871Z     |   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py", line 100, in make_ir
2025-05-07T20:32:44.8297029Z     |     return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:44.8297700Z     |            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
2025-05-07T20:32:44.8298235Z     | triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:44.8298721Z     | def _kernel_quantize_fp8_row(
2025-05-07T20:32:44.8299092Z     | ^
2025-05-07T20:32:44.8299744Z     | ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:44.8300550Z     | Falsifying example: test_silu_mul_quant(
2025-05-07T20:32:44.8301106Z     |     # The test always failed when commented parts were varied together.
2025-05-07T20:32:44.8301821Z     |     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:44.8302423Z     |     T=1,  # or any other generated value
2025-05-07T20:32:44.8302849Z     |     D=5120,  # or any other generated value
2025-05-07T20:32:44.8303337Z     |     scale_ub=None,  # or any other generated value
2025-05-07T20:32:44.8303847Z     |     contiguous=True,  # or any other generated value
2025-05-07T20:32:44.8304359Z     |     compiled=True,  # or any other generated value
2025-05-07T20:32:44.8304779Z     | )
2025-05-07T20:32:44.8305039Z     | 
2025-05-07T20:32:44.8305780Z     | You can reproduce this example by temporarily adding @reproduce_failure('6.131.14', b'AEEAQQBBAEEAQQA=') as a decorator on your test case
2025-05-07T20:32:44.8306640Z     +------------------------------------
2025-05-07T20:32:44.8307245Z ---------------------------------- Hypothesis ----------------------------------
2025-05-07T20:32:44.8307775Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:44.8308399Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:44.8308970Z     T=1,
2025-05-07T20:32:44.8309354Z     D=5120,
2025-05-07T20:32:44.8309620Z     scale_ub=None,
2025-05-07T20:32:44.8309921Z     contiguous=True,
2025-05-07T20:32:44.8310234Z     compiled=True,
2025-05-07T20:32:44.8310518Z )
2025-05-07T20:32:44.8310950Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:44.8311613Z T = 1, D = 5120, scale_ub = None, contiguous = True, compiled = True
2025-05-07T20:32:44.8311974Z 
2025-05-07T20:32:44.8312087Z     @given(
2025-05-07T20:32:44.8312394Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:44.8312812Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:44.8313276Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:44.8313725Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:44.8314180Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:44.8314637Z     )
2025-05-07T20:32:44.8315116Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:44.8315734Z     def test_silu_mul_quant(
2025-05-07T20:32:44.8316078Z         self,
2025-05-07T20:32:44.8316355Z         T: int,
2025-05-07T20:32:44.8316631Z         D: int,
2025-05-07T20:32:44.8316961Z         scale_ub: Optional[float],
2025-05-07T20:32:44.8317373Z         contiguous: bool,
2025-05-07T20:32:44.8317693Z         compiled: bool,
2025-05-07T20:32:44.8317994Z     ) -> None:
2025-05-07T20:32:44.8318277Z         torch.manual_seed(2025)
2025-05-07T20:32:44.8318592Z     
2025-05-07T20:32:44.8318963Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:44.8319436Z     
2025-05-07T20:32:44.8319714Z         x_sign = torch.sign(x)
2025-05-07T20:32:44.8320134Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:44.8320575Z         x = x_sign * x_clamp
2025-05-07T20:32:44.8320910Z         x0 = x[:, :D]
2025-05-07T20:32:44.8340781Z         x1 = x[:, D:]
2025-05-07T20:32:44.8341115Z     
2025-05-07T20:32:44.8341359Z         if contiguous:
2025-05-07T20:32:44.8341660Z             x0 = x0.contiguous()
2025-05-07T20:32:44.8341992Z             x1 = x1.contiguous()
2025-05-07T20:32:44.8342302Z     
2025-05-07T20:32:44.8342551Z         if scale_ub is not None:
2025-05-07T20:32:44.8342900Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:44.8343349Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:44.8343768Z             )
2025-05-07T20:32:44.8344027Z         else:
2025-05-07T20:32:44.8344294Z             scale_ub_tensor = None
2025-05-07T20:32:44.8344634Z     
2025-05-07T20:32:44.8344935Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:44.8345352Z             op = silu_mul_quant
2025-05-07T20:32:44.8345680Z             if compiled:
2025-05-07T20:32:44.8346007Z                 op = torch.compile(op)
2025-05-07T20:32:44.8346391Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:44.8346762Z     
2025-05-07T20:32:44.8347058Z         y_fp8, y_scale = fn()
2025-05-07T20:32:44.8347450Z         y = y_fp8.to(torch.float32) * y_scale[:, None]
2025-05-07T20:32:44.8347847Z     
2025-05-07T20:32:44.8348188Z         def ref_fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:44.8348642Z             x0_fp32 = x0.to(torch.float32)
2025-05-07T20:32:44.8349152Z             x1_fp32 = x1.to(torch.float32)
2025-05-07T20:32:44.8349595Z             y = x0_fp32 * torch.sigmoid(x0_fp32) * x1_fp32
2025-05-07T20:32:44.8350089Z             return triton_quantize_fp8_row(y, scale_ub_tensor)
2025-05-07T20:32:44.8350516Z     
2025-05-07T20:32:44.8350801Z >       y_fp8_ref, y_scale_ref = ref_fn()
2025-05-07T20:32:44.8351084Z 
2025-05-07T20:32:44.8351235Z moe/activation_test.py:126: 
2025-05-07T20:32:44.8351832Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:44.8352297Z moe/activation_test.py:124: in ref_fn
2025-05-07T20:32:44.8352842Z     return triton_quantize_fp8_row(y, scale_ub_tensor)
2025-05-07T20:32:44.8353942Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/fbgemm_gpu/experimental/gemm/triton_gemm/fp8_gemm.py:2370: in triton_quantize_fp8_row
2025-05-07T20:32:44.8354992Z     _kernel_quantize_fp8_row[grid](
2025-05-07T20:32:44.8355744Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:44.8356729Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:44.8357697Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/autotuner.py:186: in run
2025-05-07T20:32:44.8358717Z     timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs}
2025-05-07T20:32:44.8359868Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/autotuner.py:186: in <dictcomp>
2025-05-07T20:32:44.8361009Z     timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs}
2025-05-07T20:32:44.8362007Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/autotuner.py:166: in _bench
2025-05-07T20:32:44.8362895Z     return self.do_bench(kernel_call, quantiles=(0.5, 0.2, 0.8))
2025-05-07T20:32:44.8363733Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/testing.py:117: in do_bench
2025-05-07T20:32:44.8364463Z     fn()
2025-05-07T20:32:44.8365161Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/autotuner.py:152: in kernel_call
2025-05-07T20:32:44.8365952Z     self.fn.run(
2025-05-07T20:32:44.8366574Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:44.8367285Z     kernel = self.compile(
2025-05-07T20:32:44.8368004Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:44.8368876Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:44.8369409Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:44.8369716Z 
2025-05-07T20:32:44.8369988Z self = <triton.compiler.compiler.ASTSource object at 0x7f987235f010>
2025-05-07T20:32:44.8371442Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=2, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:44.8373307Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f987464c400>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f9831ba1260>}
2025-05-07T20:32:44.8375167Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:44.8376649Z context = <triton._C.libtriton.ir.context object at 0x7f987247a670>
2025-05-07T20:32:44.8377069Z 
2025-05-07T20:32:44.8377290Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:44.8378010Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:44.8378678Z                            module_map=module_map)
2025-05-07T20:32:44.8379163Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:44.8379639Z E       def _kernel_quantize_fp8_row(
2025-05-07T20:32:44.8379997Z E       ^
2025-05-07T20:32:44.8380615Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:44.8381285Z 
2025-05-07T20:32:44.8381840Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:44.8382542Z 
2025-05-07T20:32:44.8382725Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:44.8383278Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:44.8383807Z     T=2048,
2025-05-07T20:32:44.8384060Z     D=5120,
2025-05-07T20:32:44.8384320Z     scale_ub=1200.0,
2025-05-07T20:32:44.8384604Z     contiguous=True,
2025-05-07T20:32:44.8384903Z     compiled=False,
2025-05-07T20:32:44.8385179Z )
2025-05-07T20:32:44.8385598Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:44.8386252Z T = 2048, D = 5120, scale_ub = 1200.0, contiguous = True, compiled = False
2025-05-07T20:32:44.8386627Z 
2025-05-07T20:32:44.8386750Z     @given(
2025-05-07T20:32:44.8387091Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:44.8387556Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:44.8387963Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:44.8388472Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:44.8388906Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:44.8389394Z     )
2025-05-07T20:32:44.8389861Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:44.8390448Z     def test_silu_mul_quant(
2025-05-07T20:32:44.8390761Z         self,
2025-05-07T20:32:44.8391021Z         T: int,
2025-05-07T20:32:44.8391287Z         D: int,
2025-05-07T20:32:44.8391570Z         scale_ub: Optional[float],
2025-05-07T20:32:44.8391929Z         contiguous: bool,
2025-05-07T20:32:44.8392245Z         compiled: bool,
2025-05-07T20:32:44.8392535Z     ) -> None:
2025-05-07T20:32:44.8392824Z         torch.manual_seed(2025)
2025-05-07T20:32:44.8393147Z     
2025-05-07T20:32:44.8393505Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:44.8393964Z     
2025-05-07T20:32:44.8394225Z         x_sign = torch.sign(x)
2025-05-07T20:32:44.8394603Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:44.8395021Z         x = x_sign * x_clamp
2025-05-07T20:32:44.8395343Z         x0 = x[:, :D]
2025-05-07T20:32:44.8395625Z         x1 = x[:, D:]
2025-05-07T20:32:44.8395901Z     
2025-05-07T20:32:44.8396165Z         if contiguous:
2025-05-07T20:32:44.8396499Z             x0 = x0.contiguous()
2025-05-07T20:32:44.8396865Z             x1 = x1.contiguous()
2025-05-07T20:32:44.8397202Z     
2025-05-07T20:32:44.8397463Z         if scale_ub is not None:
2025-05-07T20:32:44.8397843Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:44.8398309Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:44.8398727Z             )
2025-05-07T20:32:44.8398995Z         else:
2025-05-07T20:32:44.8399297Z             scale_ub_tensor = None
2025-05-07T20:32:44.8399656Z     
2025-05-07T20:32:44.8399958Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:44.8400383Z             op = silu_mul_quant
2025-05-07T20:32:44.8400726Z             if compiled:
2025-05-07T20:32:44.8401053Z                 op = torch.compile(op)
2025-05-07T20:32:44.8401453Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:44.8401829Z     
2025-05-07T20:32:44.8402079Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:44.8402305Z 
2025-05-07T20:32:44.8402436Z moe/activation_test.py:117: 
2025-05-07T20:32:44.8402854Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:44.8403328Z moe/activation_test.py:115: in fn
2025-05-07T20:32:44.8403714Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:44.8404648Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:44.8405601Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:44.8406416Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:44.8407371Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:44.8408344Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:44.8409075Z     kernel = self.compile(
2025-05-07T20:32:44.8409811Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:44.8410726Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:44.8411261Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:44.8411583Z 
2025-05-07T20:32:44.8411857Z self = <triton.compiler.compiler.ASTSource object at 0x7f9831c4cb90>
2025-05-07T20:32:44.8413329Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:44.8415406Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f987464c400>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f983184c180>}
2025-05-07T20:32:44.8417365Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:44.8418791Z context = <triton._C.libtriton.ir.context object at 0x7f9831bff570>
2025-05-07T20:32:44.8419180Z 
2025-05-07T20:32:44.8419403Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:44.8420091Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:44.8420732Z                            module_map=module_map)
2025-05-07T20:32:44.8421221Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:44.8421689Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:44.8422044Z E       ^
2025-05-07T20:32:44.8422679Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:44.8423313Z 
2025-05-07T20:32:44.8423912Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:44.8424624Z 
2025-05-07T20:32:44.8424766Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:44.8425329Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:44.8425873Z     T=2048,
2025-05-07T20:32:44.8426127Z     D=5120,
2025-05-07T20:32:44.8426377Z     scale_ub=1200.0,
2025-05-07T20:32:44.8426721Z     contiguous=True,
2025-05-07T20:32:44.8427023Z     compiled=True,
2025-05-07T20:32:44.8427295Z )
2025-05-07T20:32:44.8427723Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:44.8428671Z T = 2048, D = 5120, scale_ub = 1200.0, contiguous = True, compiled = True
2025-05-07T20:32:44.8429108Z 
2025-05-07T20:32:44.8429218Z     @given(
2025-05-07T20:32:44.8429525Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:44.8429944Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:44.8430350Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:44.8430806Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:44.8431264Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:44.8431668Z     )
2025-05-07T20:32:44.8432154Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:44.8432787Z     def test_silu_mul_quant(
2025-05-07T20:32:44.8433129Z         self,
2025-05-07T20:32:44.8433386Z         T: int,
2025-05-07T20:32:44.8433656Z         D: int,
2025-05-07T20:32:44.8434060Z         scale_ub: Optional[float],
2025-05-07T20:32:44.8434326Z         contiguous: bool,
2025-05-07T20:32:44.8434570Z         compiled: bool,
2025-05-07T20:32:44.8434793Z     ) -> None:
2025-05-07T20:32:44.8435091Z         torch.manual_seed(2025)
2025-05-07T20:32:44.8435335Z     
2025-05-07T20:32:44.8435612Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:44.8435951Z     
2025-05-07T20:32:44.8436149Z         x_sign = torch.sign(x)
2025-05-07T20:32:44.8436439Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:44.8436743Z         x = x_sign * x_clamp
2025-05-07T20:32:44.8437003Z         x0 = x[:, :D]
2025-05-07T20:32:44.8437257Z         x1 = x[:, D:]
2025-05-07T20:32:44.8437460Z     
2025-05-07T20:32:44.8437651Z         if contiguous:
2025-05-07T20:32:44.8437883Z             x0 = x0.contiguous()
2025-05-07T20:32:44.8438143Z             x1 = x1.contiguous()
2025-05-07T20:32:44.8438374Z     
2025-05-07T20:32:44.8438636Z         if scale_ub is not None:
2025-05-07T20:32:44.8438909Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:44.8439237Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:44.8439611Z             )
2025-05-07T20:32:44.8439824Z         else:
2025-05-07T20:32:44.8440038Z             scale_ub_tensor = None
2025-05-07T20:32:44.8440283Z     
2025-05-07T20:32:44.8440516Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:44.8440834Z             op = silu_mul_quant
2025-05-07T20:32:44.8441081Z             if compiled:
2025-05-07T20:32:44.8441330Z                 op = torch.compile(op)
2025-05-07T20:32:44.8441628Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:44.8441898Z     
2025-05-07T20:32:44.8442093Z         y_fp8, y_scale = fn()
2025-05-07T20:32:44.8442380Z         y = y_fp8.to(torch.float32) * y_scale[:, None]
2025-05-07T20:32:44.8442664Z     
2025-05-07T20:32:44.8442903Z         def ref_fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:44.8443246Z             x0_fp32 = x0.to(torch.float32)
2025-05-07T20:32:44.8443541Z             x1_fp32 = x1.to(torch.float32)
2025-05-07T20:32:44.8443852Z             y = x0_fp32 * torch.sigmoid(x0_fp32) * x1_fp32
2025-05-07T20:32:44.8444212Z             return triton_quantize_fp8_row(y, scale_ub_tensor)
2025-05-07T20:32:44.8444523Z     
2025-05-07T20:32:44.8444722Z >       y_fp8_ref, y_scale_ref = ref_fn()
2025-05-07T20:32:44.8444919Z 
2025-05-07T20:32:44.8445018Z moe/activation_test.py:126: 
2025-05-07T20:32:44.8445313Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:44.8445644Z moe/activation_test.py:124: in ref_fn
2025-05-07T20:32:44.8445973Z     return triton_quantize_fp8_row(y, scale_ub_tensor)
2025-05-07T20:32:44.8446765Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/fbgemm_gpu/experimental/gemm/triton_gemm/fp8_gemm.py:2370: in triton_quantize_fp8_row
2025-05-07T20:32:44.8447573Z     _kernel_quantize_fp8_row[grid](
2025-05-07T20:32:44.8448119Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:44.8448811Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:44.8449500Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/autotuner.py:186: in run
2025-05-07T20:32:44.8450228Z     timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs}
2025-05-07T20:32:44.8450979Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/autotuner.py:186: in <dictcomp>
2025-05-07T20:32:44.8451728Z     timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs}
2025-05-07T20:32:44.8452462Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/autotuner.py:166: in _bench
2025-05-07T20:32:44.8453100Z     return self.do_bench(kernel_call, quantiles=(0.5, 0.2, 0.8))
2025-05-07T20:32:44.8453758Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/testing.py:117: in do_bench
2025-05-07T20:32:44.8454284Z     fn()
2025-05-07T20:32:44.8454835Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/autotuner.py:152: in kernel_call
2025-05-07T20:32:44.8455415Z     self.fn.run(
2025-05-07T20:32:44.8455883Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:44.8456453Z     kernel = self.compile(
2025-05-07T20:32:44.8457216Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:44.8458060Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:44.8458466Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:44.8458700Z 
2025-05-07T20:32:44.8458913Z self = <triton.compiler.compiler.ASTSource object at 0x7f98307dd810>
2025-05-07T20:32:44.8460110Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=2, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:44.8461499Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f987464c400>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f9830943560>}
2025-05-07T20:32:44.8462850Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:44.8463883Z context = <triton._C.libtriton.ir.context object at 0x7f98301e2430>
2025-05-07T20:32:44.8464171Z 
2025-05-07T20:32:44.8464346Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:44.8464863Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:44.8465342Z                            module_map=module_map)
2025-05-07T20:32:44.8465715Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:44.8466070Z E       def _kernel_quantize_fp8_row(
2025-05-07T20:32:44.8466344Z E       ^
2025-05-07T20:32:44.8466815Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:44.8467319Z 
2025-05-07T20:32:44.8467744Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:44.8468256Z 
2025-05-07T20:32:44.8468360Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:44.8468781Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:44.8469303Z     T=16384,
2025-05-07T20:32:44.8469491Z     D=7168,
2025-05-07T20:32:44.8469694Z     scale_ub=1200.0,
2025-05-07T20:32:44.8469923Z     contiguous=False,
2025-05-07T20:32:44.8470146Z     compiled=False,
2025-05-07T20:32:44.8470357Z )
2025-05-07T20:32:44.8470682Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:44.8471187Z T = 16384, D = 7168, scale_ub = 1200.0, contiguous = False, compiled = False
2025-05-07T20:32:44.8471468Z 
2025-05-07T20:32:44.8471546Z     @given(
2025-05-07T20:32:44.8471778Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:44.8472090Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:44.8472398Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:44.8472733Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:44.8473059Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:44.8473340Z     )
2025-05-07T20:32:44.8473690Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:44.8474134Z     def test_silu_mul_quant(
2025-05-07T20:32:44.8474382Z         self,
2025-05-07T20:32:44.8474622Z         T: int,
2025-05-07T20:32:44.8474820Z         D: int,
2025-05-07T20:32:44.8475040Z         scale_ub: Optional[float],
2025-05-07T20:32:44.8475308Z         contiguous: bool,
2025-05-07T20:32:44.8475588Z         compiled: bool,
2025-05-07T20:32:44.8475812Z     ) -> None:
2025-05-07T20:32:44.8476027Z         torch.manual_seed(2025)
2025-05-07T20:32:44.8476271Z     
2025-05-07T20:32:44.8476544Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:44.8476881Z     
2025-05-07T20:32:44.8477079Z         x_sign = torch.sign(x)
2025-05-07T20:32:44.8477397Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:44.8477729Z         x = x_sign * x_clamp
2025-05-07T20:32:44.8477971Z         x0 = x[:, :D]
2025-05-07T20:32:44.8478192Z         x1 = x[:, D:]
2025-05-07T20:32:44.8478396Z     
2025-05-07T20:32:44.8478584Z         if contiguous:
2025-05-07T20:32:44.8478817Z             x0 = x0.contiguous()
2025-05-07T20:32:44.8479144Z             x1 = x1.contiguous()
2025-05-07T20:32:44.8479383Z     
2025-05-07T20:32:44.8479574Z         if scale_ub is not None:
2025-05-07T20:32:44.8479886Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:44.8480221Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:44.8480534Z             )
2025-05-07T20:32:44.8480728Z         else:
2025-05-07T20:32:44.8480935Z             scale_ub_tensor = None
2025-05-07T20:32:44.8481188Z     
2025-05-07T20:32:44.8481420Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:44.8481728Z             op = silu_mul_quant
2025-05-07T20:32:44.8481985Z             if compiled:
2025-05-07T20:32:44.8482232Z                 op = torch.compile(op)
2025-05-07T20:32:44.8482523Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:44.8482803Z     
2025-05-07T20:32:44.8482996Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:44.8483163Z 
2025-05-07T20:32:44.8483263Z moe/activation_test.py:117: 
2025-05-07T20:32:44.8483564Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:44.8483897Z moe/activation_test.py:115: in fn
2025-05-07T20:32:44.8484180Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:44.8484867Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:44.8485557Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:44.8486092Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:44.8486772Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:44.8487486Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:44.8488021Z     kernel = self.compile(
2025-05-07T20:32:44.8488560Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:44.8489213Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:44.8489617Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:44.8489845Z 
2025-05-07T20:32:44.8490056Z self = <triton.compiler.compiler.ASTSource object at 0x7f98302102d0>
2025-05-07T20:32:44.8491138Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:44.8492508Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f987464c400>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f9830683ba0>}
2025-05-07T20:32:44.8493854Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:44.8494941Z context = <triton._C.libtriton.ir.context object at 0x7f9830278830>
2025-05-07T20:32:44.8495227Z 
2025-05-07T20:32:44.8495442Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:44.8495956Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:44.8496424Z                            module_map=module_map)
2025-05-07T20:32:44.8496786Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:44.8497171Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:44.8497448Z E       ^
2025-05-07T20:32:44.8497915Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:44.8498366Z 
2025-05-07T20:32:44.8498792Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:44.8499353Z 
2025-05-07T20:32:44.8499462Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:44.8499868Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:44.8500307Z     T=1,
2025-05-07T20:32:44.8500498Z     D=7168,
2025-05-07T20:32:44.8500684Z     scale_ub=None,
2025-05-07T20:32:44.8500901Z     contiguous=True,
2025-05-07T20:32:44.8501125Z     compiled=True,
2025-05-07T20:32:44.8501320Z )
2025-05-07T20:32:44.8501644Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:44.8502127Z T = 1, D = 7168, scale_ub = None, contiguous = True, compiled = True
2025-05-07T20:32:44.8502383Z 
2025-05-07T20:32:44.8502462Z     @given(
2025-05-07T20:32:44.8502698Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:44.8503012Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:44.8503323Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:44.8503652Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:44.8503984Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:44.8504272Z     )
2025-05-07T20:32:44.8504618Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:44.8505064Z     def test_silu_mul_quant(
2025-05-07T20:32:44.8505308Z         self,
2025-05-07T20:32:44.8505500Z         T: int,
2025-05-07T20:32:44.8505699Z         D: int,
2025-05-07T20:32:44.8505916Z         scale_ub: Optional[float],
2025-05-07T20:32:44.8506183Z         contiguous: bool,
2025-05-07T20:32:44.8506421Z         compiled: bool,
2025-05-07T20:32:44.8506646Z     ) -> None:
2025-05-07T20:32:44.8506855Z         torch.manual_seed(2025)
2025-05-07T20:32:44.8507094Z     
2025-05-07T20:32:44.8507368Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:44.8507758Z     
2025-05-07T20:32:44.8508005Z         x_sign = torch.sign(x)
2025-05-07T20:32:44.8508403Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:44.8508800Z         x = x_sign * x_clamp
2025-05-07T20:32:44.8509122Z         x0 = x[:, :D]
2025-05-07T20:32:44.8509342Z         x1 = x[:, D:]
2025-05-07T20:32:44.8509549Z     
2025-05-07T20:32:44.8509734Z         if contiguous:
2025-05-07T20:32:44.8509965Z             x0 = x0.contiguous()
2025-05-07T20:32:44.8510223Z             x1 = x1.contiguous()
2025-05-07T20:32:44.8510458Z     
2025-05-07T20:32:44.8510651Z         if scale_ub is not None:
2025-05-07T20:32:44.8510923Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:44.8511252Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:44.8511557Z             )
2025-05-07T20:32:44.8511749Z         else:
2025-05-07T20:32:44.8511953Z             scale_ub_tensor = None
2025-05-07T20:32:44.8512202Z     
2025-05-07T20:32:44.8512432Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:44.8512744Z             op = silu_mul_quant
2025-05-07T20:32:44.8512994Z             if compiled:
2025-05-07T20:32:44.8513307Z                 op = torch.compile(op)
2025-05-07T20:32:44.8513601Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:44.8513867Z     
2025-05-07T20:32:44.8514060Z         y_fp8, y_scale = fn()
2025-05-07T20:32:44.8521949Z         y = y_fp8.to(torch.float32) * y_scale[:, None]
2025-05-07T20:32:44.8522259Z     
2025-05-07T20:32:44.8522503Z         def ref_fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:44.8522850Z             x0_fp32 = x0.to(torch.float32)
2025-05-07T20:32:44.8523152Z             x1_fp32 = x1.to(torch.float32)
2025-05-07T20:32:44.8523468Z             y = x0_fp32 * torch.sigmoid(x0_fp32) * x1_fp32
2025-05-07T20:32:44.8523824Z             return triton_quantize_fp8_row(y, scale_ub_tensor)
2025-05-07T20:32:44.8524140Z     
2025-05-07T20:32:44.8524351Z >       y_fp8_ref, y_scale_ref = ref_fn()
2025-05-07T20:32:44.8524546Z 
2025-05-07T20:32:44.8524649Z moe/activation_test.py:126: 
2025-05-07T20:32:44.8525002Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:44.8525351Z moe/activation_test.py:124: in ref_fn
2025-05-07T20:32:44.8525676Z     return triton_quantize_fp8_row(y, scale_ub_tensor)
2025-05-07T20:32:44.8526518Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/fbgemm_gpu/experimental/gemm/triton_gemm/fp8_gemm.py:2370: in triton_quantize_fp8_row
2025-05-07T20:32:44.8527334Z     _kernel_quantize_fp8_row[grid](
2025-05-07T20:32:44.8527883Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:44.8528942Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:44.8529774Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/autotuner.py:186: in run
2025-05-07T20:32:44.8530647Z     timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs}
2025-05-07T20:32:44.8531569Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/autotuner.py:186: in <dictcomp>
2025-05-07T20:32:44.8532471Z     timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs}
2025-05-07T20:32:44.8533357Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/autotuner.py:166: in _bench
2025-05-07T20:32:44.8534123Z     return self.do_bench(kernel_call, quantiles=(0.5, 0.2, 0.8))
2025-05-07T20:32:44.8534838Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/testing.py:117: in do_bench
2025-05-07T20:32:44.8535449Z     fn()
2025-05-07T20:32:44.8536047Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/autotuner.py:152: in kernel_call
2025-05-07T20:32:44.8536743Z     self.fn.run(
2025-05-07T20:32:44.8537300Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:44.8537966Z     kernel = self.compile(
2025-05-07T20:32:44.8538609Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:44.8539393Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:44.8539851Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:44.8540126Z 
2025-05-07T20:32:44.8540363Z self = <triton.compiler.compiler.ASTSource object at 0x7f98301430d0>
2025-05-07T20:32:44.8541694Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=2, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:44.8543408Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f987464c400>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f98304ff6a0>}
2025-05-07T20:32:44.8545070Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:44.8546315Z context = <triton._C.libtriton.ir.context object at 0x7f983019df30>
2025-05-07T20:32:44.8546619Z 
2025-05-07T20:32:44.8546788Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:44.8547319Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:44.8547786Z                            module_map=module_map)
2025-05-07T20:32:44.8548155Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:44.8548516Z E       def _kernel_quantize_fp8_row(
2025-05-07T20:32:44.8548779Z E       ^
2025-05-07T20:32:44.8549314Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:44.8549780Z 
2025-05-07T20:32:44.8550274Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:44.8550791Z 
2025-05-07T20:32:44.8550908Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:44.8551389Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:44.8551799Z     T=4096,
2025-05-07T20:32:44.8551993Z     D=5120,
2025-05-07T20:32:44.8552189Z     scale_ub=None,
2025-05-07T20:32:44.8552414Z     contiguous=False,
2025-05-07T20:32:44.8552646Z     compiled=False,
2025-05-07T20:32:44.8552860Z )
2025-05-07T20:32:44.8553177Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:44.8553678Z T = 4096, D = 5120, scale_ub = None, contiguous = False, compiled = False
2025-05-07T20:32:44.8553950Z 
2025-05-07T20:32:44.8554037Z     @given(
2025-05-07T20:32:44.8554264Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:44.8554583Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:44.8554899Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:44.8555224Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:44.8555557Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:44.8555849Z     )
2025-05-07T20:32:44.8556194Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:44.8556639Z     def test_silu_mul_quant(
2025-05-07T20:32:44.8556888Z         self,
2025-05-07T20:32:44.8557120Z         T: int,
2025-05-07T20:32:44.8557333Z         D: int,
2025-05-07T20:32:44.8557555Z         scale_ub: Optional[float],
2025-05-07T20:32:44.8557827Z         contiguous: bool,
2025-05-07T20:32:44.8558061Z         compiled: bool,
2025-05-07T20:32:44.8558288Z     ) -> None:
2025-05-07T20:32:44.8558505Z         torch.manual_seed(2025)
2025-05-07T20:32:44.8558739Z     
2025-05-07T20:32:44.8559014Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:44.8559365Z     
2025-05-07T20:32:44.8559558Z         x_sign = torch.sign(x)
2025-05-07T20:32:44.8559852Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:44.8560163Z         x = x_sign * x_clamp
2025-05-07T20:32:44.8560400Z         x0 = x[:, :D]
2025-05-07T20:32:44.8560628Z         x1 = x[:, D:]
2025-05-07T20:32:44.8560839Z     
2025-05-07T20:32:44.8561021Z         if contiguous:
2025-05-07T20:32:44.8561256Z             x0 = x0.contiguous()
2025-05-07T20:32:44.8561522Z             x1 = x1.contiguous()
2025-05-07T20:32:44.8561763Z     
2025-05-07T20:32:44.8561951Z         if scale_ub is not None:
2025-05-07T20:32:44.8562234Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:44.8562574Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:44.8562876Z             )
2025-05-07T20:32:44.8563076Z         else:
2025-05-07T20:32:44.8563291Z             scale_ub_tensor = None
2025-05-07T20:32:44.8563537Z     
2025-05-07T20:32:44.8563773Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:44.8564144Z             op = silu_mul_quant
2025-05-07T20:32:44.8564396Z             if compiled:
2025-05-07T20:32:44.8564646Z                 op = torch.compile(op)
2025-05-07T20:32:44.8565058Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:44.8565333Z     
2025-05-07T20:32:44.8565532Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:44.8565696Z 
2025-05-07T20:32:44.8565802Z moe/activation_test.py:117: 
2025-05-07T20:32:44.8566106Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:44.8566435Z moe/activation_test.py:115: in fn
2025-05-07T20:32:44.8566719Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:44.8567433Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:44.8568149Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:44.8568690Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:44.8569423Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:44.8570133Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:44.8570665Z     kernel = self.compile(
2025-05-07T20:32:44.8571212Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:44.8571882Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:44.8572280Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:44.8572515Z 
2025-05-07T20:32:44.8572723Z self = <triton.compiler.compiler.ASTSource object at 0x7f98076db710>
2025-05-07T20:32:44.8573813Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:44.8575204Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f987464c400>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f98304e00e0>}
2025-05-07T20:32:44.8576562Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:44.8577644Z context = <triton._C.libtriton.ir.context object at 0x7f98076ffd70>
2025-05-07T20:32:44.8577938Z 
2025-05-07T20:32:44.8578106Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:44.8578628Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:44.8579092Z                            module_map=module_map)
2025-05-07T20:32:44.8579466Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:44.8579822Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:44.8580082Z E       ^
2025-05-07T20:32:44.8580550Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:44.8581013Z 
2025-05-07T20:32:44.8581433Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:44.8581947Z 
2025-05-07T20:32:44.8582057Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:44.8582470Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:44.8582874Z     T=4096,
2025-05-07T20:32:44.8583063Z     D=7168,
2025-05-07T20:32:44.8583255Z     scale_ub=None,
2025-05-07T20:32:44.8583468Z     contiguous=False,
2025-05-07T20:32:44.8583696Z     compiled=False,
2025-05-07T20:32:44.8583900Z )
2025-05-07T20:32:44.8584212Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:44.8584758Z T = 4096, D = 7168, scale_ub = None, contiguous = False, compiled = False
2025-05-07T20:32:44.8585029Z 
2025-05-07T20:32:44.8585115Z     @given(
2025-05-07T20:32:44.8585381Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:44.8585695Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:44.8586001Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:44.8586325Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:44.8586654Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:44.8586950Z     )
2025-05-07T20:32:44.8587340Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:44.8587776Z     def test_silu_mul_quant(
2025-05-07T20:32:44.8588019Z         self,
2025-05-07T20:32:44.8588216Z         T: int,
2025-05-07T20:32:44.8588410Z         D: int,
2025-05-07T20:32:44.8588631Z         scale_ub: Optional[float],
2025-05-07T20:32:44.8588947Z         contiguous: bool,
2025-05-07T20:32:44.8589249Z         compiled: bool,
2025-05-07T20:32:44.8589471Z     ) -> None:
2025-05-07T20:32:44.8589686Z         torch.manual_seed(2025)
2025-05-07T20:32:44.8589919Z     
2025-05-07T20:32:44.8590243Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:44.8590587Z     
2025-05-07T20:32:44.8590773Z         x_sign = torch.sign(x)
2025-05-07T20:32:44.8591067Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:44.8591378Z         x = x_sign * x_clamp
2025-05-07T20:32:44.8591611Z         x0 = x[:, :D]
2025-05-07T20:32:44.8591827Z         x1 = x[:, D:]
2025-05-07T20:32:44.8592033Z     
2025-05-07T20:32:44.8592217Z         if contiguous:
2025-05-07T20:32:44.8592442Z             x0 = x0.contiguous()
2025-05-07T20:32:44.8592701Z             x1 = x1.contiguous()
2025-05-07T20:32:44.8592944Z     
2025-05-07T20:32:44.8593130Z         if scale_ub is not None:
2025-05-07T20:32:44.8593403Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:44.8593747Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:44.8594051Z             )
2025-05-07T20:32:44.8594247Z         else:
2025-05-07T20:32:44.8594464Z             scale_ub_tensor = None
2025-05-07T20:32:44.8594711Z     
2025-05-07T20:32:44.8594945Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:44.8595263Z             op = silu_mul_quant
2025-05-07T20:32:44.8595508Z             if compiled:
2025-05-07T20:32:44.8595756Z                 op = torch.compile(op)
2025-05-07T20:32:44.8596052Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:44.8596319Z     
2025-05-07T20:32:44.8596515Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:44.8596685Z 
2025-05-07T20:32:44.8596782Z moe/activation_test.py:117: 
2025-05-07T20:32:44.8597076Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:44.8597430Z moe/activation_test.py:115: in fn
2025-05-07T20:32:44.8597739Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:44.8598436Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:44.8599124Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:44.8599666Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:44.8600350Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:44.8601017Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:44.8601557Z     kernel = self.compile(
2025-05-07T20:32:44.8602095Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:44.8602280Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:44.8602410Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:44.8602488Z 
2025-05-07T20:32:44.8602694Z self = <triton.compiler.compiler.ASTSource object at 0x7f98076ce050>
2025-05-07T20:32:44.8603529Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:44.8604035Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f987464c400>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f98304e2a20>}
2025-05-07T20:32:44.8604796Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:44.8604987Z context = <triton._C.libtriton.ir.context object at 0x7f98076ba6b0>
2025-05-07T20:32:44.8605031Z 
2025-05-07T20:32:44.8605206Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:44.8605467Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:44.8605612Z                            module_map=module_map)
2025-05-07T20:32:44.8605783Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:44.8605884Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:44.8605966Z E       ^
2025-05-07T20:32:44.8606333Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:44.8606338Z 
2025-05-07T20:32:44.8606753Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:44.8606758Z 
2025-05-07T20:32:44.8606871Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:44.8607097Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:44.8607184Z     T=128,
2025-05-07T20:32:44.8607269Z     D=7168,
2025-05-07T20:32:44.8607355Z     scale_ub=None,
2025-05-07T20:32:44.8607447Z     contiguous=False,
2025-05-07T20:32:44.8607540Z     compiled=True,
2025-05-07T20:32:44.8607616Z )
2025-05-07T20:32:44.8607844Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:44.8608014Z T = 128, D = 7168, scale_ub = None, contiguous = False, compiled = True
2025-05-07T20:32:44.8608019Z 
2025-05-07T20:32:44.8608097Z     @given(
2025-05-07T20:32:44.8608223Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:44.8608323Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:44.8608437Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:44.8608561Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:44.8608675Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:44.8608756Z     )
2025-05-07T20:32:44.8609003Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:44.8609103Z     def test_silu_mul_quant(
2025-05-07T20:32:44.8609186Z         self,
2025-05-07T20:32:44.8609263Z         T: int,
2025-05-07T20:32:44.8609342Z         D: int,
2025-05-07T20:32:44.8609449Z         scale_ub: Optional[float],
2025-05-07T20:32:44.8609539Z         contiguous: bool,
2025-05-07T20:32:44.8609625Z         compiled: bool,
2025-05-07T20:32:44.8609710Z     ) -> None:
2025-05-07T20:32:44.8609810Z         torch.manual_seed(2025)
2025-05-07T20:32:44.8609881Z     
2025-05-07T20:32:44.8610056Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:44.8610128Z     
2025-05-07T20:32:44.8610220Z         x_sign = torch.sign(x)
2025-05-07T20:32:44.8610351Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:44.8610439Z         x = x_sign * x_clamp
2025-05-07T20:32:44.8610525Z         x0 = x[:, :D]
2025-05-07T20:32:44.8610605Z         x1 = x[:, D:]
2025-05-07T20:32:44.8610679Z     
2025-05-07T20:32:44.8610818Z         if contiguous:
2025-05-07T20:32:44.8610912Z             x0 = x0.contiguous()
2025-05-07T20:32:44.8611000Z             x1 = x1.contiguous()
2025-05-07T20:32:44.8611081Z     
2025-05-07T20:32:44.8611213Z         if scale_ub is not None:
2025-05-07T20:32:44.8611320Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:44.8611464Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:44.8611539Z             )
2025-05-07T20:32:44.8611619Z         else:
2025-05-07T20:32:44.8611719Z             scale_ub_tensor = None
2025-05-07T20:32:44.8611790Z     
2025-05-07T20:32:44.8611926Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:44.8612016Z             op = silu_mul_quant
2025-05-07T20:32:44.8612101Z             if compiled:
2025-05-07T20:32:44.8612208Z                 op = torch.compile(op)
2025-05-07T20:32:44.8612314Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:44.8612389Z     
2025-05-07T20:32:44.8612526Z         y_fp8, y_scale = fn()
2025-05-07T20:32:44.8612652Z         y = y_fp8.to(torch.float32) * y_scale[:, None]
2025-05-07T20:32:44.8612726Z     
2025-05-07T20:32:44.8612907Z         def ref_fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:44.8613014Z             x0_fp32 = x0.to(torch.float32)
2025-05-07T20:32:44.8613113Z             x1_fp32 = x1.to(torch.float32)
2025-05-07T20:32:44.8613242Z             y = x0_fp32 * torch.sigmoid(x0_fp32) * x1_fp32
2025-05-07T20:32:44.8613383Z             return triton_quantize_fp8_row(y, scale_ub_tensor)
2025-05-07T20:32:44.8613463Z     
2025-05-07T20:32:44.8613563Z >       y_fp8_ref, y_scale_ref = ref_fn()
2025-05-07T20:32:44.8613567Z 
2025-05-07T20:32:44.8613666Z moe/activation_test.py:126: 
2025-05-07T20:32:44.8613804Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:44.8613911Z moe/activation_test.py:124: in ref_fn
2025-05-07T20:32:44.8614047Z     return triton_quantize_fp8_row(y, scale_ub_tensor)
2025-05-07T20:32:44.8614622Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/fbgemm_gpu/experimental/gemm/triton_gemm/fp8_gemm.py:2370: in triton_quantize_fp8_row
2025-05-07T20:32:44.8614731Z     _kernel_quantize_fp8_row[grid](
2025-05-07T20:32:44.8615103Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:44.8615327Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:44.8615699Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/autotuner.py:186: in run
2025-05-07T20:32:44.8615964Z     timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs}
2025-05-07T20:32:44.8616370Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/autotuner.py:186: in <dictcomp>
2025-05-07T20:32:44.8616632Z     timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs}
2025-05-07T20:32:44.8617056Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/autotuner.py:166: in _bench
2025-05-07T20:32:44.8617239Z     return self.do_bench(kernel_call, quantiles=(0.5, 0.2, 0.8))
2025-05-07T20:32:44.8617596Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/testing.py:117: in do_bench
2025-05-07T20:32:44.8617675Z     fn()
2025-05-07T20:32:44.8618081Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/autotuner.py:152: in kernel_call
2025-05-07T20:32:44.8618173Z     self.fn.run(
2025-05-07T20:32:44.8618513Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:44.8618617Z     kernel = self.compile(
2025-05-07T20:32:44.8619000Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:44.8619176Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:44.8619388Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:44.8619393Z 
2025-05-07T20:32:44.8619639Z self = <triton.compiler.compiler.ASTSource object at 0x7f9807ce3d90>
2025-05-07T20:32:44.8620432Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=2, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:44.8620936Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f987464c400>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f98304e3100>}
2025-05-07T20:32:44.8621687Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:44.8621943Z context = <triton._C.libtriton.ir.context object at 0x7f98078239f0>
2025-05-07T20:32:44.8621951Z 
2025-05-07T20:32:44.8622116Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:44.8622426Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:44.8622536Z                            module_map=module_map)
2025-05-07T20:32:44.8622699Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:44.8622809Z E       def _kernel_quantize_fp8_row(
2025-05-07T20:32:44.8622888Z E       ^
2025-05-07T20:32:44.8623247Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:44.8623258Z 
2025-05-07T20:32:44.8623675Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:44.8623680Z 
2025-05-07T20:32:44.8623784Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:44.8624019Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:44.8624098Z     T=128,
2025-05-07T20:32:44.8624177Z     D=7168,
2025-05-07T20:32:44.8624270Z     scale_ub=None,
2025-05-07T20:32:44.8624360Z     contiguous=False,
2025-05-07T20:32:44.8624444Z     compiled=False,
2025-05-07T20:32:44.8624525Z )
2025-05-07T20:32:44.8624746Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:44.8624924Z T = 128, D = 7168, scale_ub = None, contiguous = False, compiled = False
2025-05-07T20:32:44.8624928Z 
2025-05-07T20:32:44.8625008Z     @given(
2025-05-07T20:32:44.8625126Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:44.8625235Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:44.8625351Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:44.8625469Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:44.8625588Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:44.8625669Z     )
2025-05-07T20:32:44.8625926Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:44.8626024Z     def test_silu_mul_quant(
2025-05-07T20:32:44.8626104Z         self,
2025-05-07T20:32:44.8626187Z         T: int,
2025-05-07T20:32:44.8626263Z         D: int,
2025-05-07T20:32:44.8626361Z         scale_ub: Optional[float],
2025-05-07T20:32:44.8626460Z         contiguous: bool,
2025-05-07T20:32:44.8626547Z         compiled: bool,
2025-05-07T20:32:44.8626626Z     ) -> None:
2025-05-07T20:32:44.8626729Z         torch.manual_seed(2025)
2025-05-07T20:32:44.8626803Z     
2025-05-07T20:32:44.8626975Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:44.8627059Z     
2025-05-07T20:32:44.8627173Z         x_sign = torch.sign(x)
2025-05-07T20:32:44.8627316Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:44.8627422Z         x = x_sign * x_clamp
2025-05-07T20:32:44.8627503Z         x0 = x[:, :D]
2025-05-07T20:32:44.8627637Z         x1 = x[:, D:]
2025-05-07T20:32:44.8627711Z     
2025-05-07T20:32:44.8627797Z         if contiguous:
2025-05-07T20:32:44.8627897Z             x0 = x0.contiguous()
2025-05-07T20:32:44.8628030Z             x1 = x1.contiguous()
2025-05-07T20:32:44.8628103Z     
2025-05-07T20:32:44.8628439Z         if scale_ub is not None:
2025-05-07T20:32:44.8628598Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:44.8628744Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:44.8628825Z             )
2025-05-07T20:32:44.8628901Z         else:
2025-05-07T20:32:44.8628994Z             scale_ub_tensor = None
2025-05-07T20:32:44.8629117Z     
2025-05-07T20:32:44.8629248Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:44.8629343Z             op = silu_mul_quant
2025-05-07T20:32:44.8629428Z             if compiled:
2025-05-07T20:32:44.8629526Z                 op = torch.compile(op)
2025-05-07T20:32:44.8629755Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:44.8629832Z     
2025-05-07T20:32:44.8629923Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:44.8629927Z 
2025-05-07T20:32:44.8630097Z moe/activation_test.py:117: 
2025-05-07T20:32:44.8630229Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:44.8630330Z moe/activation_test.py:115: in fn
2025-05-07T20:32:44.8630435Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:44.8630939Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:44.8631041Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:44.8631402Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:44.8631624Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:44.8631978Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:44.8632076Z     kernel = self.compile(
2025-05-07T20:32:44.8632462Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:44.8632649Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:44.8632776Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:44.8632780Z 
2025-05-07T20:32:44.8632994Z self = <triton.compiler.compiler.ASTSource object at 0x7f98078aca90>
2025-05-07T20:32:44.8633776Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:44.8634285Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f987464c400>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f9807db0cc0>}
2025-05-07T20:32:44.8635048Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:44.8635239Z context = <triton._C.libtriton.ir.context object at 0x7f980785d0f0>
2025-05-07T20:32:44.8635244Z 
2025-05-07T20:32:44.8635418Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:44.8635679Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:44.8635794Z                            module_map=module_map)
2025-05-07T20:32:44.8635956Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:44.8636057Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:44.8636140Z E       ^
2025-05-07T20:32:44.8636498Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:44.8636570Z 
2025-05-07T20:32:44.8637042Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:44.8637112Z 
2025-05-07T20:32:44.8637219Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:44.8637447Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:44.8637534Z     T=4096,
2025-05-07T20:32:44.8637611Z     D=5120,
2025-05-07T20:32:44.8637698Z     scale_ub=1200.0,
2025-05-07T20:32:44.8637792Z     contiguous=True,
2025-05-07T20:32:44.8637877Z     compiled=False,
2025-05-07T20:32:44.8637950Z )
2025-05-07T20:32:44.8638181Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:44.8638356Z T = 4096, D = 5120, scale_ub = 1200.0, contiguous = True, compiled = False
2025-05-07T20:32:44.8638361Z 
2025-05-07T20:32:44.8638445Z     @given(
2025-05-07T20:32:44.8638604Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:44.8638706Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:44.8638830Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:44.8638986Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:44.8639099Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:44.8639183Z     )
2025-05-07T20:32:44.8639430Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:44.8639526Z     def test_silu_mul_quant(
2025-05-07T20:32:44.8639608Z         self,
2025-05-07T20:32:44.8639688Z         T: int,
2025-05-07T20:32:44.8639766Z         D: int,
2025-05-07T20:32:44.8639872Z         scale_ub: Optional[float],
2025-05-07T20:32:44.8639962Z         contiguous: bool,
2025-05-07T20:32:44.8640052Z         compiled: bool,
2025-05-07T20:32:44.8640130Z     ) -> None:
2025-05-07T20:32:44.8640222Z         torch.manual_seed(2025)
2025-05-07T20:32:44.8640302Z     
2025-05-07T20:32:44.8640479Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:44.8640554Z     
2025-05-07T20:32:44.8640651Z         x_sign = torch.sign(x)
2025-05-07T20:32:44.8640780Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:44.8640867Z         x = x_sign * x_clamp
2025-05-07T20:32:44.8640952Z         x0 = x[:, :D]
2025-05-07T20:32:44.8641031Z         x1 = x[:, D:]
2025-05-07T20:32:44.8641107Z     
2025-05-07T20:32:44.8641196Z         if contiguous:
2025-05-07T20:32:44.8641287Z             x0 = x0.contiguous()
2025-05-07T20:32:44.8641375Z             x1 = x1.contiguous()
2025-05-07T20:32:44.8641455Z     
2025-05-07T20:32:44.8641544Z         if scale_ub is not None:
2025-05-07T20:32:44.8641654Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:44.8641792Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:44.8641870Z             )
2025-05-07T20:32:44.8641950Z         else:
2025-05-07T20:32:44.8642046Z             scale_ub_tensor = None
2025-05-07T20:32:44.8642122Z     
2025-05-07T20:32:44.8642257Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:44.8642346Z             op = silu_mul_quant
2025-05-07T20:32:44.8642433Z             if compiled:
2025-05-07T20:32:44.8642543Z                 op = torch.compile(op)
2025-05-07T20:32:44.8642649Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:44.8642721Z     
2025-05-07T20:32:44.8642819Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:44.8642824Z 
2025-05-07T20:32:44.8642923Z moe/activation_test.py:117: 
2025-05-07T20:32:44.8643055Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:44.8643156Z moe/activation_test.py:115: in fn
2025-05-07T20:32:44.8643256Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:44.8643765Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:44.8643862Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:44.8644273Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:44.8644538Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:44.8644882Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:44.8644983Z     kernel = self.compile(
2025-05-07T20:32:44.8645365Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:44.8645537Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:44.8645672Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:44.8645677Z 
2025-05-07T20:32:44.8645878Z self = <triton.compiler.compiler.ASTSource object at 0x7f98078e9550>
2025-05-07T20:32:44.8646737Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:44.8647331Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f987464c400>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f9807db1f80>}
2025-05-07T20:32:44.8648084Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:44.8648281Z context = <triton._C.libtriton.ir.context object at 0x7f98078f1b70>
2025-05-07T20:32:44.8648285Z 
2025-05-07T20:32:44.8648446Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:44.8648713Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:44.8648824Z                            module_map=module_map)
2025-05-07T20:32:44.8648986Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:44.8649093Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:44.8649173Z E       ^
2025-05-07T20:32:44.8649537Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:44.8649541Z 
2025-05-07T20:32:44.8649956Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:44.8649961Z 
2025-05-07T20:32:44.8650068Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:44.8650298Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:44.8650378Z     T=1,
2025-05-07T20:32:44.8650458Z     D=5120,
2025-05-07T20:32:44.8650546Z     scale_ub=None,
2025-05-07T20:32:44.8650631Z     contiguous=True,
2025-05-07T20:32:44.8650725Z     compiled=True,
2025-05-07T20:32:44.8650801Z )
2025-05-07T20:32:44.8651023Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:44.8651191Z T = 1, D = 5120, scale_ub = None, contiguous = True, compiled = True
2025-05-07T20:32:44.8651198Z 
2025-05-07T20:32:44.8651276Z     @given(
2025-05-07T20:32:44.8651398Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:44.8651504Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:44.8651618Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:44.8651733Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:44.8651855Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:44.8651928Z     )
2025-05-07T20:32:44.8652184Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:44.8652278Z     def test_silu_mul_quant(
2025-05-07T20:32:44.8652357Z         self,
2025-05-07T20:32:44.8652445Z         T: int,
2025-05-07T20:32:44.8652523Z         D: int,
2025-05-07T20:32:44.8652667Z         scale_ub: Optional[float],
2025-05-07T20:32:44.8652767Z         contiguous: bool,
2025-05-07T20:32:44.8652855Z         compiled: bool,
2025-05-07T20:32:44.8652935Z     ) -> None:
2025-05-07T20:32:44.8653075Z         torch.manual_seed(2025)
2025-05-07T20:32:44.8653150Z     
2025-05-07T20:32:44.8653319Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:44.8653402Z     
2025-05-07T20:32:44.8653494Z         x_sign = torch.sign(x)
2025-05-07T20:32:44.8653626Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:44.8653717Z         x = x_sign * x_clamp
2025-05-07T20:32:44.8653800Z         x0 = x[:, :D]
2025-05-07T20:32:44.8653887Z         x1 = x[:, D:]
2025-05-07T20:32:44.8653960Z     
2025-05-07T20:32:44.8654047Z         if contiguous:
2025-05-07T20:32:44.8654146Z             x0 = x0.contiguous()
2025-05-07T20:32:44.8654235Z             x1 = x1.contiguous()
2025-05-07T20:32:44.8654350Z     
2025-05-07T20:32:44.8654450Z         if scale_ub is not None:
2025-05-07T20:32:44.8654556Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:44.8654690Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:44.8654808Z             )
2025-05-07T20:32:44.8654891Z         else:
2025-05-07T20:32:44.8654989Z             scale_ub_tensor = None
2025-05-07T20:32:44.8655062Z     
2025-05-07T20:32:44.8655190Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:44.8655285Z             op = silu_mul_quant
2025-05-07T20:32:44.8655369Z             if compiled:
2025-05-07T20:32:44.8655470Z                 op = torch.compile(op)
2025-05-07T20:32:44.8656216Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:44.8656288Z     
2025-05-07T20:32:44.8656381Z         y_fp8, y_scale = fn()
2025-05-07T20:32:44.8656509Z         y = y_fp8.to(torch.float32) * y_scale[:, None]
2025-05-07T20:32:44.8656581Z     
2025-05-07T20:32:44.8656716Z         def ref_fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:44.8656828Z             x0_fp32 = x0.to(torch.float32)
2025-05-07T20:32:44.8656931Z             x1_fp32 = x1.to(torch.float32)
2025-05-07T20:32:44.8657067Z             y = x0_fp32 * torch.sigmoid(x0_fp32) * x1_fp32
2025-05-07T20:32:44.8657233Z             return triton_quantize_fp8_row(y, scale_ub_tensor)
2025-05-07T20:32:44.8657327Z     
2025-05-07T20:32:44.8657439Z >       y_fp8_ref, y_scale_ref = ref_fn()
2025-05-07T20:32:44.8657444Z 
2025-05-07T20:32:44.8657540Z moe/activation_test.py:126: 
2025-05-07T20:32:44.8657667Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:44.8657779Z moe/activation_test.py:124: in ref_fn
2025-05-07T20:32:44.8657912Z     return triton_quantize_fp8_row(y, scale_ub_tensor)
2025-05-07T20:32:44.8658480Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/fbgemm_gpu/experimental/gemm/triton_gemm/fp8_gemm.py:2370: in triton_quantize_fp8_row
2025-05-07T20:32:44.8658586Z     _kernel_quantize_fp8_row[grid](
2025-05-07T20:32:44.8658949Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:44.8659180Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:44.8659546Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/autotuner.py:186: in run
2025-05-07T20:32:44.8659807Z     timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs}
2025-05-07T20:32:44.8660207Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/autotuner.py:186: in <dictcomp>
2025-05-07T20:32:44.8660459Z     timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs}
2025-05-07T20:32:44.8660842Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/autotuner.py:166: in _bench
2025-05-07T20:32:44.8661008Z     return self.do_bench(kernel_call, quantiles=(0.5, 0.2, 0.8))
2025-05-07T20:32:44.8661410Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/testing.py:117: in do_bench
2025-05-07T20:32:44.8661493Z     fn()
2025-05-07T20:32:44.8661938Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/autotuner.py:152: in kernel_call
2025-05-07T20:32:44.8662028Z     self.fn.run(
2025-05-07T20:32:44.8662365Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:44.8662467Z     kernel = self.compile(
2025-05-07T20:32:44.8662855Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:44.8674273Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:44.8674428Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:44.8674433Z 
2025-05-07T20:32:44.8674735Z self = <triton.compiler.compiler.ASTSource object at 0x7f98070ab390>
2025-05-07T20:32:44.8675575Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=2, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:44.8676088Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f987464c400>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f9807db2fc0>}
2025-05-07T20:32:44.8676849Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:44.8677068Z context = <triton._C.libtriton.ir.context object at 0x7f98070579b0>
2025-05-07T20:32:44.8677073Z 
2025-05-07T20:32:44.8677275Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:44.8677539Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:44.8677660Z                            module_map=module_map)
2025-05-07T20:32:44.8677827Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:44.8677931Z E       def _kernel_quantize_fp8_row(
2025-05-07T20:32:44.8678017Z E       ^
2025-05-07T20:32:44.8678375Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:44.8678381Z 
2025-05-07T20:32:44.8678801Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:44.8678813Z 
2025-05-07T20:32:44.8678917Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:44.8679142Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:44.8679227Z     T=2048,
2025-05-07T20:32:44.8679306Z     D=5120,
2025-05-07T20:32:44.8679393Z     scale_ub=None,
2025-05-07T20:32:44.8679490Z     contiguous=True,
2025-05-07T20:32:44.8679573Z     compiled=True,
2025-05-07T20:32:44.8679648Z )
2025-05-07T20:32:44.8679879Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:44.8680054Z T = 2048, D = 5120, scale_ub = None, contiguous = True, compiled = True
2025-05-07T20:32:44.8680058Z 
2025-05-07T20:32:44.8680145Z     @given(
2025-05-07T20:32:44.8680266Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:44.8680366Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:44.8680489Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:44.8680608Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:44.8680724Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:44.8680810Z     )
2025-05-07T20:32:44.8681059Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:44.8681155Z     def test_silu_mul_quant(
2025-05-07T20:32:44.8681291Z         self,
2025-05-07T20:32:44.8681372Z         T: int,
2025-05-07T20:32:44.8681449Z         D: int,
2025-05-07T20:32:44.8681558Z         scale_ub: Optional[float],
2025-05-07T20:32:44.8681651Z         contiguous: bool,
2025-05-07T20:32:44.8681790Z         compiled: bool,
2025-05-07T20:32:44.8681873Z     ) -> None:
2025-05-07T20:32:44.8681969Z         torch.manual_seed(2025)
2025-05-07T20:32:44.8682051Z     
2025-05-07T20:32:44.8682221Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:44.8682296Z     
2025-05-07T20:32:44.8682394Z         x_sign = torch.sign(x)
2025-05-07T20:32:44.8682520Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:44.8682610Z         x = x_sign * x_clamp
2025-05-07T20:32:44.8682699Z         x0 = x[:, :D]
2025-05-07T20:32:44.8682779Z         x1 = x[:, D:]
2025-05-07T20:32:44.8682856Z     
2025-05-07T20:32:44.8682948Z         if contiguous:
2025-05-07T20:32:44.8683086Z             x0 = x0.contiguous()
2025-05-07T20:32:44.8683185Z             x1 = x1.contiguous()
2025-05-07T20:32:44.8683264Z     
2025-05-07T20:32:44.8683354Z         if scale_ub is not None:
2025-05-07T20:32:44.8683506Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:44.8683645Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:44.8683722Z             )
2025-05-07T20:32:44.8683811Z         else:
2025-05-07T20:32:44.8683906Z             scale_ub_tensor = None
2025-05-07T20:32:44.8683979Z     
2025-05-07T20:32:44.8684118Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:44.8684209Z             op = silu_mul_quant
2025-05-07T20:32:44.8684294Z             if compiled:
2025-05-07T20:32:44.8684404Z                 op = torch.compile(op)
2025-05-07T20:32:44.8684510Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:44.8684590Z     
2025-05-07T20:32:44.8684683Z         y_fp8, y_scale = fn()
2025-05-07T20:32:44.8684804Z         y = y_fp8.to(torch.float32) * y_scale[:, None]
2025-05-07T20:32:44.8684889Z     
2025-05-07T20:32:44.8685025Z         def ref_fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:44.8685128Z             x0_fp32 = x0.to(torch.float32)
2025-05-07T20:32:44.8685238Z             x1_fp32 = x1.to(torch.float32)
2025-05-07T20:32:44.8685361Z             y = x0_fp32 * torch.sigmoid(x0_fp32) * x1_fp32
2025-05-07T20:32:44.8685502Z             return triton_quantize_fp8_row(y, scale_ub_tensor)
2025-05-07T20:32:44.8685581Z     
2025-05-07T20:32:44.8685681Z >       y_fp8_ref, y_scale_ref = ref_fn()
2025-05-07T20:32:44.8685685Z 
2025-05-07T20:32:44.8685791Z moe/activation_test.py:126: 
2025-05-07T20:32:44.8685923Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:44.8686030Z moe/activation_test.py:124: in ref_fn
2025-05-07T20:32:44.8686173Z     return triton_quantize_fp8_row(y, scale_ub_tensor)
2025-05-07T20:32:44.8686740Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/fbgemm_gpu/experimental/gemm/triton_gemm/fp8_gemm.py:2370: in triton_quantize_fp8_row
2025-05-07T20:32:44.8686847Z     _kernel_quantize_fp8_row[grid](
2025-05-07T20:32:44.8687222Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:44.8687469Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:44.8687875Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/autotuner.py:186: in run
2025-05-07T20:32:44.8688131Z     timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs}
2025-05-07T20:32:44.8688531Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/autotuner.py:186: in <dictcomp>
2025-05-07T20:32:44.8688793Z     timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs}
2025-05-07T20:32:44.8689172Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/autotuner.py:166: in _bench
2025-05-07T20:32:44.8689396Z     return self.do_bench(kernel_call, quantiles=(0.5, 0.2, 0.8))
2025-05-07T20:32:44.8689780Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/testing.py:117: in do_bench
2025-05-07T20:32:44.8689859Z     fn()
2025-05-07T20:32:44.8690271Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/autotuner.py:152: in kernel_call
2025-05-07T20:32:44.8690355Z     self.fn.run(
2025-05-07T20:32:44.8690696Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:44.8690801Z     kernel = self.compile(
2025-05-07T20:32:44.8691184Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:44.8691366Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:44.8691498Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:44.8691544Z 
2025-05-07T20:32:44.8691751Z self = <triton.compiler.compiler.ASTSource object at 0x7f98071d4110>
2025-05-07T20:32:44.8692587Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=2, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:44.8693093Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f987464c400>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f980790a8e0>}
2025-05-07T20:32:44.8693854Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:44.8694044Z context = <triton._C.libtriton.ir.context object at 0x7f9807198670>
2025-05-07T20:32:44.8694051Z 
2025-05-07T20:32:44.8694220Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:44.8694499Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:44.8694610Z                            module_map=module_map)
2025-05-07T20:32:44.8694782Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:44.8694889Z E       def _kernel_quantize_fp8_row(
2025-05-07T20:32:44.8694968Z E       ^
2025-05-07T20:32:44.8695333Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:44.8695338Z 
2025-05-07T20:32:44.8695757Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:44.8695762Z 
2025-05-07T20:32:44.8695874Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:44.8696099Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:44.8696183Z     T=128,
2025-05-07T20:32:44.8696269Z     D=5120,
2025-05-07T20:32:44.8696353Z     scale_ub=None,
2025-05-07T20:32:44.8696440Z     contiguous=True,
2025-05-07T20:32:44.8696532Z     compiled=True,
2025-05-07T20:32:44.8696607Z )
2025-05-07T20:32:44.8696831Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:44.8697008Z T = 128, D = 5120, scale_ub = None, contiguous = True, compiled = True
2025-05-07T20:32:44.8697013Z 
2025-05-07T20:32:44.8697092Z     @given(
2025-05-07T20:32:44.8697222Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:44.8697324Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:44.8697440Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:44.8697565Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:44.8697680Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:44.8697756Z     )
2025-05-07T20:32:44.8698012Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:44.8698159Z     def test_silu_mul_quant(
2025-05-07T20:32:44.8698238Z         self,
2025-05-07T20:32:44.8698323Z         T: int,
2025-05-07T20:32:44.8698403Z         D: int,
2025-05-07T20:32:44.8698543Z         scale_ub: Optional[float],
2025-05-07T20:32:44.8698642Z         contiguous: bool,
2025-05-07T20:32:44.8698729Z         compiled: bool,
2025-05-07T20:32:44.8698817Z     ) -> None:
2025-05-07T20:32:44.8698913Z         torch.manual_seed(2025)
2025-05-07T20:32:44.8698989Z     
2025-05-07T20:32:44.8699167Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:44.8699241Z     
2025-05-07T20:32:44.8699334Z         x_sign = torch.sign(x)
2025-05-07T20:32:44.8699464Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:44.8699554Z         x = x_sign * x_clamp
2025-05-07T20:32:44.8699635Z         x0 = x[:, :D]
2025-05-07T20:32:44.8699724Z         x1 = x[:, D:]
2025-05-07T20:32:44.8699866Z     
2025-05-07T20:32:44.8699953Z         if contiguous:
2025-05-07T20:32:44.8700053Z             x0 = x0.contiguous()
2025-05-07T20:32:44.8700144Z             x1 = x1.contiguous()
2025-05-07T20:32:44.8700225Z     
2025-05-07T20:32:44.8700360Z         if scale_ub is not None:
2025-05-07T20:32:44.8700467Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:44.8700609Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:44.8700685Z             )
2025-05-07T20:32:44.8700763Z         else:
2025-05-07T20:32:44.8700865Z             scale_ub_tensor = None
2025-05-07T20:32:44.8700939Z     
2025-05-07T20:32:44.8701071Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:44.8701172Z             op = silu_mul_quant
2025-05-07T20:32:44.8701259Z             if compiled:
2025-05-07T20:32:44.8701360Z                 op = torch.compile(op)
2025-05-07T20:32:44.8701475Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:44.8701550Z     
2025-05-07T20:32:44.8701642Z         y_fp8, y_scale = fn()
2025-05-07T20:32:44.8701773Z         y = y_fp8.to(torch.float32) * y_scale[:, None]
2025-05-07T20:32:44.8701846Z     
2025-05-07T20:32:44.8701992Z         def ref_fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:44.8702098Z             x0_fp32 = x0.to(torch.float32)
2025-05-07T20:32:44.8702198Z             x1_fp32 = x1.to(torch.float32)
2025-05-07T20:32:44.8702328Z             y = x0_fp32 * torch.sigmoid(x0_fp32) * x1_fp32
2025-05-07T20:32:44.8702468Z             return triton_quantize_fp8_row(y, scale_ub_tensor)
2025-05-07T20:32:44.8702544Z     
2025-05-07T20:32:44.8702651Z >       y_fp8_ref, y_scale_ref = ref_fn()
2025-05-07T20:32:44.8702655Z 
2025-05-07T20:32:44.8702754Z moe/activation_test.py:126: 
2025-05-07T20:32:44.8702888Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:44.8702998Z moe/activation_test.py:124: in ref_fn
2025-05-07T20:32:44.8703133Z     return triton_quantize_fp8_row(y, scale_ub_tensor)
2025-05-07T20:32:44.8703706Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/fbgemm_gpu/experimental/gemm/triton_gemm/fp8_gemm.py:2370: in triton_quantize_fp8_row
2025-05-07T20:32:44.8703812Z     _kernel_quantize_fp8_row[grid](
2025-05-07T20:32:44.8704177Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:44.8704408Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:44.8704778Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/autotuner.py:186: in run
2025-05-07T20:32:44.8705050Z     timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs}
2025-05-07T20:32:44.8705450Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/autotuner.py:186: in <dictcomp>
2025-05-07T20:32:44.8705702Z     timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs}
2025-05-07T20:32:44.8706136Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/autotuner.py:166: in _bench
2025-05-07T20:32:44.8706306Z     return self.do_bench(kernel_call, quantiles=(0.5, 0.2, 0.8))
2025-05-07T20:32:44.8706755Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/testing.py:117: in do_bench
2025-05-07T20:32:44.8706838Z     fn()
2025-05-07T20:32:44.8707291Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/autotuner.py:152: in kernel_call
2025-05-07T20:32:44.8707383Z     self.fn.run(
2025-05-07T20:32:44.8707725Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:44.8707820Z     kernel = self.compile(
2025-05-07T20:32:44.8708212Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:44.8708386Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:44.8708564Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:44.8708569Z 
2025-05-07T20:32:44.8708811Z self = <triton.compiler.compiler.ASTSource object at 0x7f9806ea0f10>
2025-05-07T20:32:44.8709692Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=2, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:44.8710210Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f987464c400>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f9807441b20>}
2025-05-07T20:32:44.8710967Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:44.8711173Z context = <triton._C.libtriton.ir.context object at 0x7f9806e98ff0>
2025-05-07T20:32:44.8711180Z 
2025-05-07T20:32:44.8711345Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:44.8711613Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:44.8711730Z                            module_map=module_map)
2025-05-07T20:32:44.8711893Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:44.8712006Z E       def _kernel_quantize_fp8_row(
2025-05-07T20:32:44.8712085Z E       ^
2025-05-07T20:32:44.8712446Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:44.8712450Z 
2025-05-07T20:32:44.8712874Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:44.8712878Z 
2025-05-07T20:32:44.8712983Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:44.8713222Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:44.8713302Z     T=4096,
2025-05-07T20:32:44.8713381Z     D=5120,
2025-05-07T20:32:44.8713478Z     scale_ub=None,
2025-05-07T20:32:44.8713568Z     contiguous=True,
2025-05-07T20:32:44.8713654Z     compiled=True,
2025-05-07T20:32:44.8713735Z )
2025-05-07T20:32:44.8713953Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:44.8714124Z T = 4096, D = 5120, scale_ub = None, contiguous = True, compiled = True
2025-05-07T20:32:44.8714129Z 
2025-05-07T20:32:44.8714218Z     @given(
2025-05-07T20:32:44.8714337Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:44.8714444Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:44.8714561Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:44.8714678Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:44.8714801Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:44.8714930Z     )
2025-05-07T20:32:44.8715175Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:44.8715278Z     def test_silu_mul_quant(
2025-05-07T20:32:44.8715394Z         self,
2025-05-07T20:32:44.8715473Z         T: int,
2025-05-07T20:32:44.8715560Z         D: int,
2025-05-07T20:32:44.8715659Z         scale_ub: Optional[float],
2025-05-07T20:32:44.8715750Z         contiguous: bool,
2025-05-07T20:32:44.8715840Z         compiled: bool,
2025-05-07T20:32:44.8715922Z     ) -> None:
2025-05-07T20:32:44.8716024Z         torch.manual_seed(2025)
2025-05-07T20:32:44.8716095Z     
2025-05-07T20:32:44.8716263Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:44.8716343Z     
2025-05-07T20:32:44.8716435Z         x_sign = torch.sign(x)
2025-05-07T20:32:44.8716564Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:44.8716653Z         x = x_sign * x_clamp
2025-05-07T20:32:44.8716778Z         x0 = x[:, :D]
2025-05-07T20:32:44.8716870Z         x1 = x[:, D:]
2025-05-07T20:32:44.8716942Z     
2025-05-07T20:32:44.8717025Z         if contiguous:
2025-05-07T20:32:44.8717160Z             x0 = x0.contiguous()
2025-05-07T20:32:44.8717256Z             x1 = x1.contiguous()
2025-05-07T20:32:44.8717335Z     
2025-05-07T20:32:44.8717455Z         if scale_ub is not None:
2025-05-07T20:32:44.8717578Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:44.8717719Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:44.8717802Z             )
2025-05-07T20:32:44.8717878Z         else:
2025-05-07T20:32:44.8717976Z             scale_ub_tensor = None
2025-05-07T20:32:44.8718050Z     
2025-05-07T20:32:44.8718182Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:44.8718280Z             op = silu_mul_quant
2025-05-07T20:32:44.8718365Z             if compiled:
2025-05-07T20:32:44.8718465Z                 op = torch.compile(op)
2025-05-07T20:32:44.8718584Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:44.8718656Z     
2025-05-07T20:32:44.8718746Z         y_fp8, y_scale = fn()
2025-05-07T20:32:44.8718874Z         y = y_fp8.to(torch.float32) * y_scale[:, None]
2025-05-07T20:32:44.8718952Z     
2025-05-07T20:32:44.8719086Z         def ref_fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:44.8719194Z             x0_fp32 = x0.to(torch.float32)
2025-05-07T20:32:44.8719292Z             x1_fp32 = x1.to(torch.float32)
2025-05-07T20:32:44.8719418Z             y = x0_fp32 * torch.sigmoid(x0_fp32) * x1_fp32
2025-05-07T20:32:44.8719556Z             return triton_quantize_fp8_row(y, scale_ub_tensor)
2025-05-07T20:32:44.8719629Z     
2025-05-07T20:32:44.8719734Z >       y_fp8_ref, y_scale_ref = ref_fn()
2025-05-07T20:32:44.8719739Z 
2025-05-07T20:32:44.8719836Z moe/activation_test.py:126: 
2025-05-07T20:32:44.8719964Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:44.8720082Z moe/activation_test.py:124: in ref_fn
2025-05-07T20:32:44.8720217Z     return triton_quantize_fp8_row(y, scale_ub_tensor)
2025-05-07T20:32:44.8720797Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/fbgemm_gpu/experimental/gemm/triton_gemm/fp8_gemm.py:2370: in triton_quantize_fp8_row
2025-05-07T20:32:44.8720898Z     _kernel_quantize_fp8_row[grid](
2025-05-07T20:32:44.8721260Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:44.8721489Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:44.8721857Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/autotuner.py:186: in run
2025-05-07T20:32:44.8722111Z     timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs}
2025-05-07T20:32:44.8722517Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/autotuner.py:186: in <dictcomp>
2025-05-07T20:32:44.8722820Z     timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs}
2025-05-07T20:32:44.8723250Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/autotuner.py:166: in _bench
2025-05-07T20:32:44.8723417Z     return self.do_bench(kernel_call, quantiles=(0.5, 0.2, 0.8))
2025-05-07T20:32:44.8723759Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/testing.py:117: in do_bench
2025-05-07T20:32:44.8723843Z     fn()
2025-05-07T20:32:44.8724245Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/autotuner.py:152: in kernel_call
2025-05-07T20:32:44.8724333Z     self.fn.run(
2025-05-07T20:32:44.8724671Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:44.8724764Z     kernel = self.compile(
2025-05-07T20:32:44.8725151Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:44.8725367Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:44.8725535Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:44.8725540Z 
2025-05-07T20:32:44.8725750Z self = <triton.compiler.compiler.ASTSource object at 0x7f98068e31d0>
2025-05-07T20:32:44.8726533Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=2, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:44.8727046Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f987464c400>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f98075b2700>}
2025-05-07T20:32:44.8727798Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:44.8727997Z context = <triton._C.libtriton.ir.context object at 0x7f98077bcef0>
2025-05-07T20:32:44.8728003Z 
2025-05-07T20:32:44.8728424Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:44.8728805Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:44.8728924Z                            module_map=module_map)
2025-05-07T20:32:44.8729087Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:44.8729189Z E       def _kernel_quantize_fp8_row(
2025-05-07T20:32:44.8729272Z E       ^
2025-05-07T20:32:44.8729625Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:44.8729630Z 
2025-05-07T20:32:44.8730048Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:44.8730058Z 
2025-05-07T20:32:44.8730163Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:44.8730388Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:44.8730474Z     T=16384,
2025-05-07T20:32:44.8730554Z     D=5120,
2025-05-07T20:32:44.8730637Z     scale_ub=None,
2025-05-07T20:32:44.8730729Z     contiguous=True,
2025-05-07T20:32:44.8730816Z     compiled=True,
2025-05-07T20:32:44.8730894Z )
2025-05-07T20:32:44.8731111Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:44.8731283Z T = 16384, D = 5120, scale_ub = None, contiguous = True, compiled = True
2025-05-07T20:32:44.8731288Z 
2025-05-07T20:32:44.8731374Z     @given(
2025-05-07T20:32:44.8731492Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:44.8731591Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:44.8731717Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:44.8731837Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:44.8732117Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:44.8732203Z     )
2025-05-07T20:32:44.8732549Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:44.8732654Z     def test_silu_mul_quant(
2025-05-07T20:32:44.8732731Z         self,
2025-05-07T20:32:44.8732808Z         T: int,
2025-05-07T20:32:44.8732894Z         D: int,
2025-05-07T20:32:44.8732992Z         scale_ub: Optional[float],
2025-05-07T20:32:44.8733082Z         contiguous: bool,
2025-05-07T20:32:44.8733178Z         compiled: bool,
2025-05-07T20:32:44.8733258Z     ) -> None:
2025-05-07T20:32:44.8733355Z         torch.manual_seed(2025)
2025-05-07T20:32:44.8733435Z     
2025-05-07T20:32:44.8733602Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:44.8733675Z     
2025-05-07T20:32:44.8733772Z         x_sign = torch.sign(x)
2025-05-07T20:32:44.8733961Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:44.8734061Z         x = x_sign * x_clamp
2025-05-07T20:32:44.8734143Z         x0 = x[:, :D]
2025-05-07T20:32:44.8734223Z         x1 = x[:, D:]
2025-05-07T20:32:44.8734366Z     
2025-05-07T20:32:44.8734455Z         if contiguous:
2025-05-07T20:32:44.8734546Z             x0 = x0.contiguous()
2025-05-07T20:32:44.8734644Z             x1 = x1.contiguous()
2025-05-07T20:32:44.8734715Z     
2025-05-07T20:32:44.8734805Z         if scale_ub is not None:
2025-05-07T20:32:44.8734915Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:44.8735049Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:44.8735124Z             )
2025-05-07T20:32:44.8735206Z         else:
2025-05-07T20:32:44.8735300Z             scale_ub_tensor = None
2025-05-07T20:32:44.8735384Z     
2025-05-07T20:32:44.8735513Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:44.8735602Z             op = silu_mul_quant
2025-05-07T20:32:44.8735696Z             if compiled:
2025-05-07T20:32:44.8735797Z                 op = torch.compile(op)
2025-05-07T20:32:44.8735902Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:44.8735979Z     
2025-05-07T20:32:44.8736074Z         y_fp8, y_scale = fn()
2025-05-07T20:32:44.8736194Z         y = y_fp8.to(torch.float32) * y_scale[:, None]
2025-05-07T20:32:44.8736271Z     
2025-05-07T20:32:44.8736405Z         def ref_fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:44.8736506Z             x0_fp32 = x0.to(torch.float32)
2025-05-07T20:32:44.8736610Z             x1_fp32 = x1.to(torch.float32)
2025-05-07T20:32:44.8736730Z             y = x0_fp32 * torch.sigmoid(x0_fp32) * x1_fp32
2025-05-07T20:32:44.8736882Z             return triton_quantize_fp8_row(y, scale_ub_tensor)
2025-05-07T20:32:44.8736955Z     
2025-05-07T20:32:44.8737065Z >       y_fp8_ref, y_scale_ref = ref_fn()
2025-05-07T20:32:44.8737070Z 
2025-05-07T20:32:44.8737184Z moe/activation_test.py:126: 
2025-05-07T20:32:44.8737333Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:44.8737465Z moe/activation_test.py:124: in ref_fn
2025-05-07T20:32:44.8737604Z     return triton_quantize_fp8_row(y, scale_ub_tensor)
2025-05-07T20:32:44.8738174Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/fbgemm_gpu/experimental/gemm/triton_gemm/fp8_gemm.py:2370: in triton_quantize_fp8_row
2025-05-07T20:32:44.8738284Z     _kernel_quantize_fp8_row[grid](
2025-05-07T20:32:44.8738645Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:44.8738874Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:44.8739242Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/autotuner.py:186: in run
2025-05-07T20:32:44.8739497Z     timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs}
2025-05-07T20:32:44.8739903Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/autotuner.py:186: in <dictcomp>
2025-05-07T20:32:44.8740207Z     timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs}
2025-05-07T20:32:44.8740628Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/autotuner.py:166: in _bench
2025-05-07T20:32:44.8740796Z     return self.do_bench(kernel_call, quantiles=(0.5, 0.2, 0.8))
2025-05-07T20:32:44.8741138Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/testing.py:117: in do_bench
2025-05-07T20:32:44.8741223Z     fn()
2025-05-07T20:32:44.8741624Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/autotuner.py:152: in kernel_call
2025-05-07T20:32:44.8741709Z     self.fn.run(
2025-05-07T20:32:44.8742057Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:44.8742189Z     kernel = self.compile(
2025-05-07T20:32:44.8742580Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:44.8742794Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:44.8742923Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:44.8742928Z 
2025-05-07T20:32:44.8743143Z self = <triton.compiler.compiler.ASTSource object at 0x7f98067a4bd0>
2025-05-07T20:32:44.8743929Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=2, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:44.8744441Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f987464c400>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f9806edd3a0>}
2025-05-07T20:32:44.8745199Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:44.8745395Z context = <triton._C.libtriton.ir.context object at 0x7f98067b91f0>
2025-05-07T20:32:44.8745407Z 
2025-05-07T20:32:44.8745571Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:44.8745833Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:44.8745948Z                            module_map=module_map)
2025-05-07T20:32:44.8746109Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:44.8746211Z E       def _kernel_quantize_fp8_row(
2025-05-07T20:32:44.8746295Z E       ^
2025-05-07T20:32:44.8746657Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:44.8746665Z 
2025-05-07T20:32:44.8747088Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:44.8747093Z 
2025-05-07T20:32:44.8747197Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:44.8747441Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:44.8747534Z     T=1,
2025-05-07T20:32:44.8747626Z     D=5120,
2025-05-07T20:32:44.8747721Z     scale_ub=1200.0,
2025-05-07T20:32:44.8747812Z     contiguous=True,
2025-05-07T20:32:44.8747896Z     compiled=True,
2025-05-07T20:32:44.8747970Z )
2025-05-07T20:32:44.8748194Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:44.8748358Z T = 1, D = 5120, scale_ub = 1200.0, contiguous = True, compiled = True
2025-05-07T20:32:44.8748362Z 
2025-05-07T20:32:44.8748446Z     @given(
2025-05-07T20:32:44.8748565Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:44.8748666Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:44.8748885Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:44.8749002Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:44.8749190Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:44.8749314Z     )
2025-05-07T20:32:44.8749561Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:44.8749657Z     def test_silu_mul_quant(
2025-05-07T20:32:44.8749740Z         self,
2025-05-07T20:32:44.8749817Z         T: int,
2025-05-07T20:32:44.8749901Z         D: int,
2025-05-07T20:32:44.8750001Z         scale_ub: Optional[float],
2025-05-07T20:32:44.8750090Z         contiguous: bool,
2025-05-07T20:32:44.8750183Z         compiled: bool,
2025-05-07T20:32:44.8750262Z     ) -> None:
2025-05-07T20:32:44.8750355Z         torch.manual_seed(2025)
2025-05-07T20:32:44.8750434Z     
2025-05-07T20:32:44.8750602Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:44.8750718Z     
2025-05-07T20:32:44.8750817Z         x_sign = torch.sign(x)
2025-05-07T20:32:44.8750941Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:44.8751028Z         x = x_sign * x_clamp
2025-05-07T20:32:44.8751154Z         x0 = x[:, :D]
2025-05-07T20:32:44.8751235Z         x1 = x[:, D:]
2025-05-07T20:32:44.8751314Z     
2025-05-07T20:32:44.8751398Z         if contiguous:
2025-05-07T20:32:44.8751488Z             x0 = x0.contiguous()
2025-05-07T20:32:44.8751581Z             x1 = x1.contiguous()
2025-05-07T20:32:44.8751656Z     
2025-05-07T20:32:44.8751746Z         if scale_ub is not None:
2025-05-07T20:32:44.8751859Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:44.8751994Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:44.8752069Z             )
2025-05-07T20:32:44.8752152Z         else:
2025-05-07T20:32:44.8752248Z             scale_ub_tensor = None
2025-05-07T20:32:44.8752321Z     
2025-05-07T20:32:44.8752459Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:44.8752552Z             op = silu_mul_quant
2025-05-07T20:32:44.8752636Z             if compiled:
2025-05-07T20:32:44.8752741Z                 op = torch.compile(op)
2025-05-07T20:32:44.8752852Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:44.8752929Z     
2025-05-07T20:32:44.8753020Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:44.8753025Z 
2025-05-07T20:32:44.8753124Z moe/activation_test.py:117: 
2025-05-07T20:32:44.8753259Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:44.8753360Z moe/activation_test.py:115: in fn
2025-05-07T20:32:44.8753462Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:44.8753840Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/torch/_dynamo/eval_frame.py:678: in _fn
2025-05-07T20:32:44.8753936Z     return fn(*args, **kwargs)
2025-05-07T20:32:44.8754447Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:44.8754550Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:44.8754914Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:44.8755144Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:44.8755483Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:44.8755575Z     kernel = self.compile(
2025-05-07T20:32:44.8755963Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:44.8756137Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:44.8756272Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:44.8756277Z 
2025-05-07T20:32:44.8756479Z self = <triton.compiler.compiler.ASTSource object at 0x7f9806032590>
2025-05-07T20:32:44.8757361Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:44.8757873Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f987464c400>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f9806db0900>}
2025-05-07T20:32:44.8758628Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:44.8758824Z context = <triton._C.libtriton.ir.context object at 0x7f98060eb130>
2025-05-07T20:32:44.8758828Z 
2025-05-07T20:32:44.8758992Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:44.8759299Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:44.8759408Z                            module_map=module_map)
2025-05-07T20:32:44.8759607Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:44.8759720Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:44.8759798Z E       ^
2025-05-07T20:32:44.8760159Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:44.8760164Z 
2025-05-07T20:32:44.8760590Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:44.8760594Z 
2025-05-07T20:32:44.8760696Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:44.8760931Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:44.8761006Z     T=1,
2025-05-07T20:32:44.8761082Z     D=5120,
2025-05-07T20:32:44.8761175Z     scale_ub=None,
2025-05-07T20:32:44.8761265Z     contiguous=False,
2025-05-07T20:32:44.8761348Z     compiled=True,
2025-05-07T20:32:44.8761428Z )
2025-05-07T20:32:44.8761650Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:44.8761816Z T = 1, D = 5120, scale_ub = None, contiguous = False, compiled = True
2025-05-07T20:32:44.8761827Z 
2025-05-07T20:32:44.8761903Z     @given(
2025-05-07T20:32:44.8762020Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:44.8762124Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:44.8762237Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:44.8762353Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:44.8762470Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:44.8762544Z     )
2025-05-07T20:32:44.8762790Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:44.8762890Z     def test_silu_mul_quant(
2025-05-07T20:32:44.8762970Z         self,
2025-05-07T20:32:44.8763048Z         T: int,
2025-05-07T20:32:44.8763131Z         D: int,
2025-05-07T20:32:44.8763228Z         scale_ub: Optional[float],
2025-05-07T20:32:44.8763324Z         contiguous: bool,
2025-05-07T20:32:44.8763412Z         compiled: bool,
2025-05-07T20:32:44.8763490Z     ) -> None:
2025-05-07T20:32:44.8763589Z         torch.manual_seed(2025)
2025-05-07T20:32:44.8763662Z     
2025-05-07T20:32:44.8763829Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:44.8763907Z     
2025-05-07T20:32:44.8763997Z         x_sign = torch.sign(x)
2025-05-07T20:32:44.8764119Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:44.8764215Z         x = x_sign * x_clamp
2025-05-07T20:32:44.8764298Z         x0 = x[:, :D]
2025-05-07T20:32:44.8764378Z         x1 = x[:, D:]
2025-05-07T20:32:44.8764456Z     
2025-05-07T20:32:44.8764539Z         if contiguous:
2025-05-07T20:32:44.8764629Z             x0 = x0.contiguous()
2025-05-07T20:32:44.8764726Z             x1 = x1.contiguous()
2025-05-07T20:32:44.8764872Z     
2025-05-07T20:32:44.8764970Z         if scale_ub is not None:
2025-05-07T20:32:44.8765078Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:44.8765253Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:44.8765335Z             )
2025-05-07T20:32:44.8765410Z         else:
2025-05-07T20:32:44.8765503Z             scale_ub_tensor = None
2025-05-07T20:32:44.8765580Z     
2025-05-07T20:32:44.8765711Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:44.8765801Z             op = silu_mul_quant
2025-05-07T20:32:44.8765890Z             if compiled:
2025-05-07T20:32:44.8765990Z                 op = torch.compile(op)
2025-05-07T20:32:44.8766094Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:44.8766171Z     
2025-05-07T20:32:44.8766261Z         y_fp8, y_scale = fn()
2025-05-07T20:32:44.8766385Z         y = y_fp8.to(torch.float32) * y_scale[:, None]
2025-05-07T20:32:44.8766501Z     
2025-05-07T20:32:44.8766635Z         def ref_fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:44.8766741Z             x0_fp32 = x0.to(torch.float32)
2025-05-07T20:32:44.8766879Z             x1_fp32 = x1.to(torch.float32)
2025-05-07T20:32:44.8767008Z             y = x0_fp32 * torch.sigmoid(x0_fp32) * x1_fp32
2025-05-07T20:32:44.8767172Z             return triton_quantize_fp8_row(y, scale_ub_tensor)
2025-05-07T20:32:44.8767253Z     
2025-05-07T20:32:44.8767375Z >       y_fp8_ref, y_scale_ref = ref_fn()
2025-05-07T20:32:44.8767387Z 
2025-05-07T20:32:44.8767486Z moe/activation_test.py:126: 
2025-05-07T20:32:44.8767613Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:44.8767730Z moe/activation_test.py:124: in ref_fn
2025-05-07T20:32:44.8767862Z     return triton_quantize_fp8_row(y, scale_ub_tensor)
2025-05-07T20:32:44.8768428Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/fbgemm_gpu/experimental/gemm/triton_gemm/fp8_gemm.py:2370: in triton_quantize_fp8_row
2025-05-07T20:32:44.8768542Z     _kernel_quantize_fp8_row[grid](
2025-05-07T20:32:44.8768906Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:44.8769138Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:44.8769514Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/autotuner.py:186: in run
2025-05-07T20:32:44.8769771Z     timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs}
2025-05-07T20:32:44.8770179Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/autotuner.py:186: in <dictcomp>
2025-05-07T20:32:44.8770433Z     timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs}
2025-05-07T20:32:44.8770811Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/autotuner.py:166: in _bench
2025-05-07T20:32:44.8770990Z     return self.do_bench(kernel_call, quantiles=(0.5, 0.2, 0.8))
2025-05-07T20:32:44.8771338Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/testing.py:117: in do_bench
2025-05-07T20:32:44.8771424Z     fn()
2025-05-07T20:32:44.8771830Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/autotuner.py:152: in kernel_call
2025-05-07T20:32:44.8771910Z     self.fn.run(
2025-05-07T20:32:44.8772257Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:44.8772351Z     kernel = self.compile(
2025-05-07T20:32:44.8772736Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:44.8772919Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:44.8773052Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:44.8773106Z 
2025-05-07T20:32:44.8773320Z self = <triton.compiler.compiler.ASTSource object at 0x7f98060a2510>
2025-05-07T20:32:44.8774149Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=2, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:44.8774661Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f987464c400>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f9806dce0c0>}
2025-05-07T20:32:44.8775413Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:44.8775603Z context = <triton._C.libtriton.ir.context object at 0x7f98060a6bf0>
2025-05-07T20:32:44.8775608Z 
2025-05-07T20:32:44.8775819Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:44.8776084Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:44.8776237Z                            module_map=module_map)
2025-05-07T20:32:44.8776400Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:44.8776502Z E       def _kernel_quantize_fp8_row(
2025-05-07T20:32:44.8776585Z E       ^
2025-05-07T20:32:44.8776945Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:44.8776950Z 
2025-05-07T20:32:44.8777366Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:44.8777379Z 
2025-05-07T20:32:44.8777484Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:44.8777735Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:44.8777832Z     T=1,
2025-05-07T20:32:44.8777922Z     D=5120,
2025-05-07T20:32:44.8778002Z     scale_ub=None,
2025-05-07T20:32:44.8778097Z     contiguous=True,
2025-05-07T20:32:44.8778180Z     compiled=False,
2025-05-07T20:32:44.8778255Z )
2025-05-07T20:32:44.8778485Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:44.8778647Z T = 1, D = 5120, scale_ub = None, contiguous = True, compiled = False
2025-05-07T20:32:44.8778651Z 
2025-05-07T20:32:44.8778733Z     @given(
2025-05-07T20:32:44.8778849Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:44.8778949Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:44.8779068Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:44.8779184Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:44.8779296Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:44.8779374Z     )
2025-05-07T20:32:44.8779619Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:44.8779718Z     def test_silu_mul_quant(
2025-05-07T20:32:44.8779801Z         self,
2025-05-07T20:32:44.8779877Z         T: int,
2025-05-07T20:32:44.8779954Z         D: int,
2025-05-07T20:32:44.8780060Z         scale_ub: Optional[float],
2025-05-07T20:32:44.8780147Z         contiguous: bool,
2025-05-07T20:32:44.8780239Z         compiled: bool,
2025-05-07T20:32:44.8780316Z     ) -> None:
2025-05-07T20:32:44.8780407Z         torch.manual_seed(2025)
2025-05-07T20:32:44.8780486Z     
2025-05-07T20:32:44.8780654Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:44.8780725Z     
2025-05-07T20:32:44.8780822Z         x_sign = torch.sign(x)
2025-05-07T20:32:44.8780946Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:44.8781035Z         x = x_sign * x_clamp
2025-05-07T20:32:44.8781120Z         x0 = x[:, :D]
2025-05-07T20:32:44.8781200Z         x1 = x[:, D:]
2025-05-07T20:32:44.8781274Z     
2025-05-07T20:32:44.8781365Z         if contiguous:
2025-05-07T20:32:44.8781505Z             x0 = x0.contiguous()
2025-05-07T20:32:44.8781595Z             x1 = x1.contiguous()
2025-05-07T20:32:44.8781679Z     
2025-05-07T20:32:44.8781771Z         if scale_ub is not None:
2025-05-07T20:32:44.8781925Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:44.8782060Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:44.8782135Z             )
2025-05-07T20:32:44.8782217Z         else:
2025-05-07T20:32:44.8782312Z             scale_ub_tensor = None
2025-05-07T20:32:44.8782385Z     
2025-05-07T20:32:44.8782519Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:44.8782608Z             op = silu_mul_quant
2025-05-07T20:32:44.8782692Z             if compiled:
2025-05-07T20:32:44.8782799Z                 op = torch.compile(op)
2025-05-07T20:32:44.8782908Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:44.8782980Z     
2025-05-07T20:32:44.8783122Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:44.8783128Z 
2025-05-07T20:32:44.8783227Z moe/activation_test.py:117: 
2025-05-07T20:32:44.8783364Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:44.8783507Z moe/activation_test.py:115: in fn
2025-05-07T20:32:44.8783609Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:44.8784117Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:44.8784213Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:44.8784573Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:44.8784801Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:44.8785145Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:44.8785249Z     kernel = self.compile(
2025-05-07T20:32:44.8785633Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:44.8785808Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:44.8785945Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:44.8785949Z 
2025-05-07T20:32:44.8786153Z self = <triton.compiler.compiler.ASTSource object at 0x7f980619e590>
2025-05-07T20:32:44.8786945Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:44.8787449Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f987464c400>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f9806dcf420>}
2025-05-07T20:32:44.8788206Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:44.8788412Z context = <triton._C.libtriton.ir.context object at 0x7f98061a6c30>
2025-05-07T20:32:44.8788417Z 
2025-05-07T20:32:44.8788579Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:44.8788852Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:44.8788958Z                            module_map=module_map)
2025-05-07T20:32:44.8789233Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:44.8789340Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:44.8789421Z E       ^
2025-05-07T20:32:44.8789784Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:44.8789789Z 
2025-05-07T20:32:44.8790207Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:44.8790260Z 
2025-05-07T20:32:44.8790368Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:44.8790638Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:44.8790719Z     T=128,
2025-05-07T20:32:44.8790795Z     D=5120,
2025-05-07T20:32:44.8790885Z     scale_ub=None,
2025-05-07T20:32:44.8790975Z     contiguous=False,
2025-05-07T20:32:44.8791062Z     compiled=True,
2025-05-07T20:32:44.8791135Z )
2025-05-07T20:32:44.8791354Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:44.8791528Z T = 128, D = 5120, scale_ub = None, contiguous = False, compiled = True
2025-05-07T20:32:44.8791532Z 
2025-05-07T20:32:44.8791608Z     @given(
2025-05-07T20:32:44.8791727Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:44.8791830Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:44.8791990Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:44.8792108Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:44.8792293Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:44.8792368Z     )
2025-05-07T20:32:44.8792624Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:44.8792716Z     def test_silu_mul_quant(
2025-05-07T20:32:44.8792791Z         self,
2025-05-07T20:32:44.8792873Z         T: int,
2025-05-07T20:32:44.8792948Z         D: int,
2025-05-07T20:32:44.8793045Z         scale_ub: Optional[float],
2025-05-07T20:32:44.8793140Z         contiguous: bool,
2025-05-07T20:32:44.8793224Z         compiled: bool,
2025-05-07T20:32:44.8793301Z     ) -> None:
2025-05-07T20:32:44.8793401Z         torch.manual_seed(2025)
2025-05-07T20:32:44.8793473Z     
2025-05-07T20:32:44.8793640Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:44.8793722Z     
2025-05-07T20:32:44.8793813Z         x_sign = torch.sign(x)
2025-05-07T20:32:44.8793943Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:44.8794031Z         x = x_sign * x_clamp
2025-05-07T20:32:44.8794114Z         x0 = x[:, :D]
2025-05-07T20:32:44.8794202Z         x1 = x[:, D:]
2025-05-07T20:32:44.8794273Z     
2025-05-07T20:32:44.8794356Z         if contiguous:
2025-05-07T20:32:44.8794452Z             x0 = x0.contiguous()
2025-05-07T20:32:44.8794539Z             x1 = x1.contiguous()
2025-05-07T20:32:44.8794611Z     
2025-05-07T20:32:44.8794707Z         if scale_ub is not None:
2025-05-07T20:32:44.8794812Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:44.8794944Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:44.8795025Z             )
2025-05-07T20:32:44.8795100Z         else:
2025-05-07T20:32:44.8795200Z             scale_ub_tensor = None
2025-05-07T20:32:44.8795272Z     
2025-05-07T20:32:44.8795401Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:44.8795501Z             op = silu_mul_quant
2025-05-07T20:32:44.8795589Z             if compiled:
2025-05-07T20:32:44.8795687Z                 op = torch.compile(op)
2025-05-07T20:32:44.8795800Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:44.8795873Z     
2025-05-07T20:32:44.8795964Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:44.8795969Z 
2025-05-07T20:32:44.8796073Z moe/activation_test.py:117: 
2025-05-07T20:32:44.8796200Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:44.8796300Z moe/activation_test.py:115: in fn
2025-05-07T20:32:44.8796403Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:44.8796803Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/torch/_dynamo/eval_frame.py:678: in _fn
2025-05-07T20:32:44.8796924Z     return fn(*args, **kwargs)
2025-05-07T20:32:44.8797419Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:44.8797570Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:44.8797935Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:44.8798196Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:44.8798542Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:44.8798635Z     kernel = self.compile(
2025-05-07T20:32:44.8799016Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:44.8799193Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:44.8799324Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:44.8799328Z 
2025-05-07T20:32:44.8799530Z self = <triton.compiler.compiler.ASTSource object at 0x7f9806142790>
2025-05-07T20:32:44.8800403Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:44.8800905Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f987464c400>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f9806dcf1a0>}
2025-05-07T20:32:44.8801670Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:44.8806329Z context = <triton._C.libtriton.ir.context object at 0x7f9806152d70>
2025-05-07T20:32:44.8806338Z 
2025-05-07T20:32:44.8806521Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:44.8806786Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:44.8806913Z                            module_map=module_map)
2025-05-07T20:32:44.8807094Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:44.8807218Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:44.8807315Z E       ^
2025-05-07T20:32:44.8807677Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:44.8807681Z 
2025-05-07T20:32:44.8808117Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:44.8808121Z 
2025-05-07T20:32:44.8808226Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:44.8808459Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:44.8808537Z     T=128,
2025-05-07T20:32:44.8808616Z     D=7168,
2025-05-07T20:32:44.8808708Z     scale_ub=1200.0,
2025-05-07T20:32:44.8808801Z     contiguous=False,
2025-05-07T20:32:44.8808889Z     compiled=False,
2025-05-07T20:32:44.8808973Z )
2025-05-07T20:32:44.8809192Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:44.8809371Z T = 128, D = 7168, scale_ub = 1200.0, contiguous = False, compiled = False
2025-05-07T20:32:44.8809376Z 
2025-05-07T20:32:44.8809463Z     @given(
2025-05-07T20:32:44.8809585Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:44.8809693Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:44.8809810Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:44.8809929Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:44.8810050Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:44.8810128Z     )
2025-05-07T20:32:44.8810377Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:44.8810480Z     def test_silu_mul_quant(
2025-05-07T20:32:44.8810560Z         self,
2025-05-07T20:32:44.8810720Z         T: int,
2025-05-07T20:32:44.8810810Z         D: int,
2025-05-07T20:32:44.8810911Z         scale_ub: Optional[float],
2025-05-07T20:32:44.8811001Z         contiguous: bool,
2025-05-07T20:32:44.8811099Z         compiled: bool,
2025-05-07T20:32:44.8811220Z     ) -> None:
2025-05-07T20:32:44.8811326Z         torch.manual_seed(2025)
2025-05-07T20:32:44.8811401Z     
2025-05-07T20:32:44.8811572Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:44.8811653Z     
2025-05-07T20:32:44.8811750Z         x_sign = torch.sign(x)
2025-05-07T20:32:44.8811876Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:44.8811976Z         x = x_sign * x_clamp
2025-05-07T20:32:44.8812059Z         x0 = x[:, :D]
2025-05-07T20:32:44.8812140Z         x1 = x[:, D:]
2025-05-07T20:32:44.8812219Z     
2025-05-07T20:32:44.8812307Z         if contiguous:
2025-05-07T20:32:44.8812405Z             x0 = x0.contiguous()
2025-05-07T20:32:44.8812547Z             x1 = x1.contiguous()
2025-05-07T20:32:44.8812622Z     
2025-05-07T20:32:44.8812715Z         if scale_ub is not None:
2025-05-07T20:32:44.8812830Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:44.8813006Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:44.8813091Z             )
2025-05-07T20:32:44.8813168Z         else:
2025-05-07T20:32:44.8813262Z             scale_ub_tensor = None
2025-05-07T20:32:44.8813345Z     
2025-05-07T20:32:44.8813475Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:44.8813566Z             op = silu_mul_quant
2025-05-07T20:32:44.8813662Z             if compiled:
2025-05-07T20:32:44.8813763Z                 op = torch.compile(op)
2025-05-07T20:32:44.8813869Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:44.8813951Z     
2025-05-07T20:32:44.8814043Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:44.8814048Z 
2025-05-07T20:32:44.8814154Z moe/activation_test.py:117: 
2025-05-07T20:32:44.8814290Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:44.8814396Z moe/activation_test.py:115: in fn
2025-05-07T20:32:44.8814505Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:44.8815016Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:44.8815115Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:44.8815488Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:44.8815713Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:44.8816061Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:44.8816157Z     kernel = self.compile(
2025-05-07T20:32:44.8816546Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:44.8816732Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:44.8816864Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:44.8816868Z 
2025-05-07T20:32:44.8817075Z self = <triton.compiler.compiler.ASTSource object at 0x7f98061d2790>
2025-05-07T20:32:44.8817869Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:44.8818377Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f987464c400>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f98065307c0>}
2025-05-07T20:32:44.8819138Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:44.8819377Z context = <triton._C.libtriton.ir.context object at 0x7f98061b2db0>
2025-05-07T20:32:44.8819382Z 
2025-05-07T20:32:44.8819594Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:44.8819857Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:44.8819964Z                            module_map=module_map)
2025-05-07T20:32:44.8820136Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:44.8820236Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:44.8820313Z E       ^
2025-05-07T20:32:44.8820678Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:44.8820682Z 
2025-05-07T20:32:44.8821103Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:44.8821147Z 
2025-05-07T20:32:44.8821262Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:44.8821487Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:44.8821605Z     T=128,
2025-05-07T20:32:44.8821693Z     D=5120,
2025-05-07T20:32:44.8821781Z     scale_ub=None,
2025-05-07T20:32:44.8821871Z     contiguous=False,
2025-05-07T20:32:44.8821969Z     compiled=False,
2025-05-07T20:32:44.8822044Z )
2025-05-07T20:32:44.8822273Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:44.8822446Z T = 128, D = 5120, scale_ub = None, contiguous = False, compiled = False
2025-05-07T20:32:44.8822450Z 
2025-05-07T20:32:44.8822530Z     @given(
2025-05-07T20:32:44.8822660Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:44.8822763Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:44.8822880Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:44.8823005Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:44.8823125Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:44.8823210Z     )
2025-05-07T20:32:44.8823461Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:44.8823561Z     def test_silu_mul_quant(
2025-05-07T20:32:44.8823648Z         self,
2025-05-07T20:32:44.8823726Z         T: int,
2025-05-07T20:32:44.8823806Z         D: int,
2025-05-07T20:32:44.8823913Z         scale_ub: Optional[float],
2025-05-07T20:32:44.8824004Z         contiguous: bool,
2025-05-07T20:32:44.8824094Z         compiled: bool,
2025-05-07T20:32:44.8824185Z     ) -> None:
2025-05-07T20:32:44.8824281Z         torch.manual_seed(2025)
2025-05-07T20:32:44.8824354Z     
2025-05-07T20:32:44.8824534Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:44.8824608Z     
2025-05-07T20:32:44.8824702Z         x_sign = torch.sign(x)
2025-05-07T20:32:44.8824837Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:44.8824932Z         x = x_sign * x_clamp
2025-05-07T20:32:44.8825022Z         x0 = x[:, :D]
2025-05-07T20:32:44.8825104Z         x1 = x[:, D:]
2025-05-07T20:32:44.8825179Z     
2025-05-07T20:32:44.8825278Z         if contiguous:
2025-05-07T20:32:44.8825373Z             x0 = x0.contiguous()
2025-05-07T20:32:44.8825463Z             x1 = x1.contiguous()
2025-05-07T20:32:44.8825545Z     
2025-05-07T20:32:44.8825639Z         if scale_ub is not None:
2025-05-07T20:32:44.8825748Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:44.8825898Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:44.8825976Z             )
2025-05-07T20:32:44.8826053Z         else:
2025-05-07T20:32:44.8826154Z             scale_ub_tensor = None
2025-05-07T20:32:44.8826227Z     
2025-05-07T20:32:44.8826368Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:44.8826460Z             op = silu_mul_quant
2025-05-07T20:32:44.8826547Z             if compiled:
2025-05-07T20:32:44.8826711Z                 op = torch.compile(op)
2025-05-07T20:32:44.8826818Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:44.8826894Z     
2025-05-07T20:32:44.8826996Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:44.8827000Z 
2025-05-07T20:32:44.8827143Z moe/activation_test.py:117: 
2025-05-07T20:32:44.8827288Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:44.8827411Z moe/activation_test.py:115: in fn
2025-05-07T20:32:44.8827534Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:44.8828047Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:44.8828460Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:44.8828924Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:44.8829226Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:44.8829764Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:44.8829859Z     kernel = self.compile(
2025-05-07T20:32:44.8830321Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:44.8830496Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:44.8830636Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:44.8830640Z 
2025-05-07T20:32:44.8830843Z self = <triton.compiler.compiler.ASTSource object at 0x7f9806205950>
2025-05-07T20:32:44.8831629Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:44.8832144Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f987464c400>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f98062405e0>}
2025-05-07T20:32:44.8832906Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:44.8833105Z context = <triton._C.libtriton.ir.context object at 0x7f9806251fb0>
2025-05-07T20:32:44.8833110Z 
2025-05-07T20:32:44.8833275Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:44.8833547Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:44.8833655Z                            module_map=module_map)
2025-05-07T20:32:44.8833816Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:44.8833920Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:44.8834001Z E       ^
2025-05-07T20:32:44.8834364Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:44.8834369Z 
2025-05-07T20:32:44.8834801Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:44.8834805Z 
2025-05-07T20:32:44.8834910Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:44.8835141Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:44.8835217Z     T=128,
2025-05-07T20:32:44.8835295Z     D=5120,
2025-05-07T20:32:44.8835387Z     scale_ub=1200.0,
2025-05-07T20:32:44.8835472Z     contiguous=True,
2025-05-07T20:32:44.8835558Z     compiled=False,
2025-05-07T20:32:44.8835638Z )
2025-05-07T20:32:44.8835856Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:44.8836026Z T = 128, D = 5120, scale_ub = 1200.0, contiguous = True, compiled = False
2025-05-07T20:32:44.8836039Z 
2025-05-07T20:32:44.8836184Z     @given(
2025-05-07T20:32:44.8836302Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:44.8836408Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:44.8836588Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:44.8836705Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:44.8836824Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:44.8836900Z     )
2025-05-07T20:32:44.8837146Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:44.8837247Z     def test_silu_mul_quant(
2025-05-07T20:32:44.8837324Z         self,
2025-05-07T20:32:44.8837406Z         T: int,
2025-05-07T20:32:44.8837491Z         D: int,
2025-05-07T20:32:44.8837601Z         scale_ub: Optional[float],
2025-05-07T20:32:44.8837717Z         contiguous: bool,
2025-05-07T20:32:44.8837819Z         compiled: bool,
2025-05-07T20:32:44.8837908Z     ) -> None:
2025-05-07T20:32:44.8838061Z         torch.manual_seed(2025)
2025-05-07T20:32:44.8838137Z     
2025-05-07T20:32:44.8838307Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:44.8838390Z     
2025-05-07T20:32:44.8838524Z         x_sign = torch.sign(x)
2025-05-07T20:32:44.8838654Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:44.8838752Z         x = x_sign * x_clamp
2025-05-07T20:32:44.8838834Z         x0 = x[:, :D]
2025-05-07T20:32:44.8838919Z         x1 = x[:, D:]
2025-05-07T20:32:44.8839004Z     
2025-05-07T20:32:44.8839090Z         if contiguous:
2025-05-07T20:32:44.8839182Z             x0 = x0.contiguous()
2025-05-07T20:32:44.8839280Z             x1 = x1.contiguous()
2025-05-07T20:32:44.8839353Z     
2025-05-07T20:32:44.8839452Z         if scale_ub is not None:
2025-05-07T20:32:44.8839561Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:44.8839699Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:44.8839786Z             )
2025-05-07T20:32:44.8839866Z         else:
2025-05-07T20:32:44.8839964Z             scale_ub_tensor = None
2025-05-07T20:32:44.8840047Z     
2025-05-07T20:32:44.8840180Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:44.8840273Z             op = silu_mul_quant
2025-05-07T20:32:44.8840369Z             if compiled:
2025-05-07T20:32:44.8840470Z                 op = torch.compile(op)
2025-05-07T20:32:44.8840577Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:44.8840660Z     
2025-05-07T20:32:44.8840752Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:44.8840756Z 
2025-05-07T20:32:44.8840864Z moe/activation_test.py:117: 
2025-05-07T20:32:44.8840993Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:44.8841096Z moe/activation_test.py:115: in fn
2025-05-07T20:32:44.8841205Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:44.8841707Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:44.8841816Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:44.8842189Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:44.8842414Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:44.8842763Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:44.8842857Z     kernel = self.compile(
2025-05-07T20:32:44.8843237Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:44.8843417Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:44.8843545Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:44.8843550Z 
2025-05-07T20:32:44.8843761Z self = <triton.compiler.compiler.ASTSource object at 0x7f980627ca90>
2025-05-07T20:32:44.8844705Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:44.8845215Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f987464c400>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f9806241760>}
2025-05-07T20:32:44.8845969Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:44.8846159Z context = <triton._C.libtriton.ir.context object at 0x7f98062d90b0>
2025-05-07T20:32:44.8846173Z 
2025-05-07T20:32:44.8846339Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:44.8846601Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:44.8846754Z                            module_map=module_map)
2025-05-07T20:32:44.8846952Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:44.8847057Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:44.8847141Z E       ^
2025-05-07T20:32:44.8847499Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:44.8847504Z 
2025-05-07T20:32:44.8847927Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:44.8847931Z 
2025-05-07T20:32:44.8848035Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:44.8848258Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:44.8848340Z     T=1,
2025-05-07T20:32:44.8848417Z     D=7168,
2025-05-07T20:32:44.8848502Z     scale_ub=1200.0,
2025-05-07T20:32:44.8848597Z     contiguous=True,
2025-05-07T20:32:44.8848687Z     compiled=True,
2025-05-07T20:32:44.8848759Z )
2025-05-07T20:32:44.8848985Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:44.8849153Z T = 1, D = 7168, scale_ub = 1200.0, contiguous = True, compiled = True
2025-05-07T20:32:44.8849157Z 
2025-05-07T20:32:44.8849242Z     @given(
2025-05-07T20:32:44.8849360Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:44.8849459Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:44.8849578Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:44.8849694Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:44.8849805Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:44.8849887Z     )
2025-05-07T20:32:44.8850132Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:44.8850234Z     def test_silu_mul_quant(
2025-05-07T20:32:44.8850313Z         self,
2025-05-07T20:32:44.8850392Z         T: int,
2025-05-07T20:32:44.8850475Z         D: int,
2025-05-07T20:32:44.8850572Z         scale_ub: Optional[float],
2025-05-07T20:32:44.8850660Z         contiguous: bool,
2025-05-07T20:32:44.8850756Z         compiled: bool,
2025-05-07T20:32:44.8850833Z     ) -> None:
2025-05-07T20:32:44.8850927Z         torch.manual_seed(2025)
2025-05-07T20:32:44.8851006Z     
2025-05-07T20:32:44.8851174Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:44.8851248Z     
2025-05-07T20:32:44.8851346Z         x_sign = torch.sign(x)
2025-05-07T20:32:44.8851470Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:44.8851559Z         x = x_sign * x_clamp
2025-05-07T20:32:44.8851646Z         x0 = x[:, :D]
2025-05-07T20:32:44.8851726Z         x1 = x[:, D:]
2025-05-07T20:32:44.8851804Z     
2025-05-07T20:32:44.8851888Z         if contiguous:
2025-05-07T20:32:44.8851979Z             x0 = x0.contiguous()
2025-05-07T20:32:44.8852077Z             x1 = x1.contiguous()
2025-05-07T20:32:44.8852198Z     
2025-05-07T20:32:44.8852288Z         if scale_ub is not None:
2025-05-07T20:32:44.8852399Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:44.8852575Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:44.8852651Z             )
2025-05-07T20:32:44.8852737Z         else:
2025-05-07T20:32:44.8852830Z             scale_ub_tensor = None
2025-05-07T20:32:44.8852902Z     
2025-05-07T20:32:44.8853036Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:44.8853127Z             op = silu_mul_quant
2025-05-07T20:32:44.8853220Z             if compiled:
2025-05-07T20:32:44.8853320Z                 op = torch.compile(op)
2025-05-07T20:32:44.8853426Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:44.8853505Z     
2025-05-07T20:32:44.8853595Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:44.8853600Z 
2025-05-07T20:32:44.8853697Z moe/activation_test.py:117: 
2025-05-07T20:32:44.8853872Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:44.8853979Z moe/activation_test.py:115: in fn
2025-05-07T20:32:44.8854078Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:44.8854496Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/torch/_dynamo/eval_frame.py:678: in _fn
2025-05-07T20:32:44.8854591Z     return fn(*args, **kwargs)
2025-05-07T20:32:44.8855092Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:44.8855192Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:44.8855549Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:44.8855778Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:44.8856116Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:44.8856218Z     kernel = self.compile(
2025-05-07T20:32:44.8856602Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:44.8856776Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:44.8856910Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:44.8856914Z 
2025-05-07T20:32:44.8857117Z self = <triton.compiler.compiler.ASTSource object at 0x7f980635fed0>
2025-05-07T20:32:44.8857952Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:44.8858458Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f987464c400>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f9806242d40>}
2025-05-07T20:32:44.8859216Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:44.8859416Z context = <triton._C.libtriton.ir.context object at 0x7f98063205b0>
2025-05-07T20:32:44.8859421Z 
2025-05-07T20:32:44.8859584Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:44.8859854Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:44.8859962Z                            module_map=module_map)
2025-05-07T20:32:44.8860122Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:44.8860229Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:44.8860305Z E       ^
2025-05-07T20:32:44.8860662Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:44.8860712Z 
2025-05-07T20:32:44.8861140Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:44.8861144Z 
2025-05-07T20:32:44.8861324Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:44.8861557Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:44.8861636Z     T=1,
2025-05-07T20:32:44.8861713Z     D=7168,
2025-05-07T20:32:44.8861807Z     scale_ub=1200.0,
2025-05-07T20:32:44.8861895Z     contiguous=False,
2025-05-07T20:32:44.8861979Z     compiled=True,
2025-05-07T20:32:44.8862058Z )
2025-05-07T20:32:44.8862279Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:44.8862445Z T = 1, D = 7168, scale_ub = 1200.0, contiguous = False, compiled = True
2025-05-07T20:32:44.8862455Z 
2025-05-07T20:32:44.8862532Z     @given(
2025-05-07T20:32:44.8862655Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:44.8862806Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:44.8862921Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:44.8863038Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:44.8863200Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:44.8863275Z     )
2025-05-07T20:32:44.8863520Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:44.8863621Z     def test_silu_mul_quant(
2025-05-07T20:32:44.8863700Z         self,
2025-05-07T20:32:44.8863783Z         T: int,
2025-05-07T20:32:44.8863860Z         D: int,
2025-05-07T20:32:44.8863959Z         scale_ub: Optional[float],
2025-05-07T20:32:44.8864057Z         contiguous: bool,
2025-05-07T20:32:44.8864142Z         compiled: bool,
2025-05-07T20:32:44.8864219Z     ) -> None:
2025-05-07T20:32:44.8864319Z         torch.manual_seed(2025)
2025-05-07T20:32:44.8864394Z     
2025-05-07T20:32:44.8864562Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:44.8864648Z     
2025-05-07T20:32:44.8864739Z         x_sign = torch.sign(x)
2025-05-07T20:32:44.8864863Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:44.8864960Z         x = x_sign * x_clamp
2025-05-07T20:32:44.8865044Z         x0 = x[:, :D]
2025-05-07T20:32:44.8865124Z         x1 = x[:, D:]
2025-05-07T20:32:44.8865204Z     
2025-05-07T20:32:44.8865287Z         if contiguous:
2025-05-07T20:32:44.8865387Z             x0 = x0.contiguous()
2025-05-07T20:32:44.8865474Z             x1 = x1.contiguous()
2025-05-07T20:32:44.8865546Z     
2025-05-07T20:32:44.8865640Z         if scale_ub is not None:
2025-05-07T20:32:44.8865759Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:44.8865901Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:44.8865976Z             )
2025-05-07T20:32:44.8866053Z         else:
2025-05-07T20:32:44.8866153Z             scale_ub_tensor = None
2025-05-07T20:32:44.8866231Z     
2025-05-07T20:32:44.8866365Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:44.8866465Z             op = silu_mul_quant
2025-05-07T20:32:44.8866551Z             if compiled:
2025-05-07T20:32:44.8866653Z                 op = torch.compile(op)
2025-05-07T20:32:44.8866770Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:44.8866844Z     
2025-05-07T20:32:44.8866942Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:44.8866946Z 
2025-05-07T20:32:44.8867044Z moe/activation_test.py:117: 
2025-05-07T20:32:44.8867172Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:44.8867280Z moe/activation_test.py:115: in fn
2025-05-07T20:32:44.8867383Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:44.8867796Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/torch/_dynamo/eval_frame.py:678: in _fn
2025-05-07T20:32:44.8867902Z     return fn(*args, **kwargs)
2025-05-07T20:32:44.8868395Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:44.8868557Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:44.8868955Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:44.8869284Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:44.8869629Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:44.8869723Z     kernel = self.compile(
2025-05-07T20:32:44.8870103Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:44.8870282Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:44.8870409Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:44.8870456Z 
2025-05-07T20:32:44.8870668Z self = <triton.compiler.compiler.ASTSource object at 0x7f980637e710>
2025-05-07T20:32:44.8871495Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:44.8872001Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f987464c400>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f98063a4540>}
2025-05-07T20:32:44.8872765Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:44.8872954Z context = <triton._C.libtriton.ir.context object at 0x7f98063622b0>
2025-05-07T20:32:44.8872959Z 
2025-05-07T20:32:44.8873131Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:44.8873396Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:44.8873510Z                            module_map=module_map)
2025-05-07T20:32:44.8873677Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:44.8873775Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:44.8873861Z E       ^
2025-05-07T20:32:44.8874218Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:44.8874223Z 
2025-05-07T20:32:44.8874639Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:44.8874643Z 
2025-05-07T20:32:44.8874750Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:44.8874971Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:44.8875054Z     T=1,
2025-05-07T20:32:44.8875132Z     D=7168,
2025-05-07T20:32:44.8875215Z     scale_ub=None,
2025-05-07T20:32:44.8875311Z     contiguous=False,
2025-05-07T20:32:44.8875394Z     compiled=True,
2025-05-07T20:32:44.8875466Z )
2025-05-07T20:32:44.8875694Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:44.8875855Z T = 1, D = 7168, scale_ub = None, contiguous = False, compiled = True
2025-05-07T20:32:44.8875859Z 
2025-05-07T20:32:44.8875937Z     @given(
2025-05-07T20:32:44.8876064Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:44.8876161Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:44.8876279Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:44.8876397Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:44.8876511Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:44.8876589Z     )
2025-05-07T20:32:44.8876832Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:44.8876930Z     def test_silu_mul_quant(
2025-05-07T20:32:44.8877061Z         self,
2025-05-07T20:32:44.8877137Z         T: int,
2025-05-07T20:32:44.8877212Z         D: int,
2025-05-07T20:32:44.8877318Z         scale_ub: Optional[float],
2025-05-07T20:32:44.8878114Z         contiguous: bool,
2025-05-07T20:32:44.8878205Z         compiled: bool,
2025-05-07T20:32:44.8878291Z     ) -> None:
2025-05-07T20:32:44.8878386Z         torch.manual_seed(2025)
2025-05-07T20:32:44.8878464Z     
2025-05-07T20:32:44.8878634Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:44.8878707Z     
2025-05-07T20:32:44.8878803Z         x_sign = torch.sign(x)
2025-05-07T20:32:44.8878929Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:44.8879017Z         x = x_sign * x_clamp
2025-05-07T20:32:44.8879102Z         x0 = x[:, :D]
2025-05-07T20:32:44.8879184Z         x1 = x[:, D:]
2025-05-07T20:32:44.8879256Z     
2025-05-07T20:32:44.8879344Z         if contiguous:
2025-05-07T20:32:44.8879486Z             x0 = x0.contiguous()
2025-05-07T20:32:44.8879579Z             x1 = x1.contiguous()
2025-05-07T20:32:44.8879660Z     
2025-05-07T20:32:44.8879752Z         if scale_ub is not None:
2025-05-07T20:32:44.8879902Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:44.8880042Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:44.8880119Z             )
2025-05-07T20:32:44.8880200Z         else:
2025-05-07T20:32:44.8880294Z             scale_ub_tensor = None
2025-05-07T20:32:44.8880368Z     
2025-05-07T20:32:44.8880502Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:44.8880593Z             op = silu_mul_quant
2025-05-07T20:32:44.8880678Z             if compiled:
2025-05-07T20:32:44.8880783Z                 op = torch.compile(op)
2025-05-07T20:32:44.8880888Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:44.8880961Z     
2025-05-07T20:32:44.8881058Z         y_fp8, y_scale = fn()
2025-05-07T20:32:44.8881182Z         y = y_fp8.to(torch.float32) * y_scale[:, None]
2025-05-07T20:32:44.8881264Z     
2025-05-07T20:32:44.8881399Z         def ref_fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:44.8881503Z             x0_fp32 = x0.to(torch.float32)
2025-05-07T20:32:44.8881613Z             x1_fp32 = x1.to(torch.float32)
2025-05-07T20:32:44.8881735Z             y = x0_fp32 * torch.sigmoid(x0_fp32) * x1_fp32
2025-05-07T20:32:44.8881875Z             return triton_quantize_fp8_row(y, scale_ub_tensor)
2025-05-07T20:32:44.8881957Z     
2025-05-07T20:32:44.8882058Z >       y_fp8_ref, y_scale_ref = ref_fn()
2025-05-07T20:32:44.8882063Z 
2025-05-07T20:32:44.8882162Z moe/activation_test.py:126: 
2025-05-07T20:32:44.8882297Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:44.8882404Z moe/activation_test.py:124: in ref_fn
2025-05-07T20:32:44.8882548Z     return triton_quantize_fp8_row(y, scale_ub_tensor)
2025-05-07T20:32:44.8883116Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/fbgemm_gpu/experimental/gemm/triton_gemm/fp8_gemm.py:2370: in triton_quantize_fp8_row
2025-05-07T20:32:44.8883224Z     _kernel_quantize_fp8_row[grid](
2025-05-07T20:32:44.8883599Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:44.8883825Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:44.8884192Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/autotuner.py:186: in run
2025-05-07T20:32:44.8884455Z     timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs}
2025-05-07T20:32:44.8884855Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/autotuner.py:186: in <dictcomp>
2025-05-07T20:32:44.8885116Z     timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs}
2025-05-07T20:32:44.8885493Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/autotuner.py:166: in _bench
2025-05-07T20:32:44.8885737Z     return self.do_bench(kernel_call, quantiles=(0.5, 0.2, 0.8))
2025-05-07T20:32:44.8886127Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/testing.py:117: in do_bench
2025-05-07T20:32:44.8886205Z     fn()
2025-05-07T20:32:44.8886632Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/autotuner.py:152: in kernel_call
2025-05-07T20:32:44.8886724Z     self.fn.run(
2025-05-07T20:32:44.8887089Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:44.8887188Z     kernel = self.compile(
2025-05-07T20:32:44.8887573Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:44.8887746Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:44.8887922Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:44.8887929Z 
2025-05-07T20:32:44.8888133Z self = <triton.compiler.compiler.ASTSource object at 0x7f9655e0c590>
2025-05-07T20:32:44.8888962Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=2, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:44.8889467Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f987464c400>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f98063a5440>}
2025-05-07T20:32:44.8890229Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:44.8890418Z context = <triton._C.libtriton.ir.context object at 0x7f9655ea8bb0>
2025-05-07T20:32:44.8890426Z 
2025-05-07T20:32:44.8890592Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:44.8890862Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:44.8890971Z                            module_map=module_map)
2025-05-07T20:32:44.8891137Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:44.8891240Z E       def _kernel_quantize_fp8_row(
2025-05-07T20:32:44.8891317Z E       ^
2025-05-07T20:32:44.8891679Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:44.8891684Z 
2025-05-07T20:32:44.8892100Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:44.8892104Z 
2025-05-07T20:32:44.8892205Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:44.8892433Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:44.8892515Z     T=1,
2025-05-07T20:32:44.8892597Z     D=5120,
2025-05-07T20:32:44.8892679Z     scale_ub=1200.0,
2025-05-07T20:32:44.8892765Z     contiguous=False,
2025-05-07T20:32:44.8892856Z     compiled=True,
2025-05-07T20:32:44.8892932Z )
2025-05-07T20:32:44.8893150Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:44.8893321Z T = 1, D = 5120, scale_ub = 1200.0, contiguous = False, compiled = True
2025-05-07T20:32:44.8893325Z 
2025-05-07T20:32:44.8893402Z     @given(
2025-05-07T20:32:44.8893520Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:44.8893626Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:44.8893741Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:44.8893864Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:44.8893976Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:44.8894050Z     )
2025-05-07T20:32:44.8894301Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:44.8894476Z     def test_silu_mul_quant(
2025-05-07T20:32:44.8894553Z         self,
2025-05-07T20:32:44.8894643Z         T: int,
2025-05-07T20:32:44.8894758Z         D: int,
2025-05-07T20:32:44.8894856Z         scale_ub: Optional[float],
2025-05-07T20:32:44.8894951Z         contiguous: bool,
2025-05-07T20:32:44.8895035Z         compiled: bool,
2025-05-07T20:32:44.8895111Z     ) -> None:
2025-05-07T20:32:44.8895210Z         torch.manual_seed(2025)
2025-05-07T20:32:44.8895282Z     
2025-05-07T20:32:44.8895455Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:44.8895528Z     
2025-05-07T20:32:44.8895620Z         x_sign = torch.sign(x)
2025-05-07T20:32:44.8895749Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:44.8895837Z         x = x_sign * x_clamp
2025-05-07T20:32:44.8895918Z         x0 = x[:, :D]
2025-05-07T20:32:44.8896006Z         x1 = x[:, D:]
2025-05-07T20:32:44.8896122Z     
2025-05-07T20:32:44.8896206Z         if contiguous:
2025-05-07T20:32:44.8896304Z             x0 = x0.contiguous()
2025-05-07T20:32:44.8896393Z             x1 = x1.contiguous()
2025-05-07T20:32:44.8896505Z     
2025-05-07T20:32:44.8896605Z         if scale_ub is not None:
2025-05-07T20:32:44.8896710Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:44.8896856Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:44.8896940Z             )
2025-05-07T20:32:44.8897031Z         else:
2025-05-07T20:32:44.8897141Z             scale_ub_tensor = None
2025-05-07T20:32:44.8897228Z     
2025-05-07T20:32:44.8897357Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:44.8897454Z             op = silu_mul_quant
2025-05-07T20:32:44.8897542Z             if compiled:
2025-05-07T20:32:44.8897643Z                 op = torch.compile(op)
2025-05-07T20:32:44.8897755Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:44.8897831Z     
2025-05-07T20:32:44.8897923Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:44.8897928Z 
2025-05-07T20:32:44.8898032Z moe/activation_test.py:117: 
2025-05-07T20:32:44.8898162Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:44.8898272Z moe/activation_test.py:115: in fn
2025-05-07T20:32:44.8898371Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:44.8898739Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/torch/_dynamo/eval_frame.py:678: in _fn
2025-05-07T20:32:44.8898838Z     return fn(*args, **kwargs)
2025-05-07T20:32:44.8899334Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:44.8899431Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:44.8899796Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:44.8900019Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:44.8900371Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:44.8900468Z     kernel = self.compile(
2025-05-07T20:32:44.8900852Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:44.8901031Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:44.8901158Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:44.8901164Z 
2025-05-07T20:32:44.8901375Z self = <triton.compiler.compiler.ASTSource object at 0x7f9655e98e10>
2025-05-07T20:32:44.8902158Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:44.8902663Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f987464c400>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f98063a6a20>}
2025-05-07T20:32:44.8903513Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:44.8903703Z context = <triton._C.libtriton.ir.context object at 0x7f9655e8d470>
2025-05-07T20:32:44.8903708Z 
2025-05-07T20:32:44.8903882Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:44.8904145Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:44.8904252Z                            module_map=module_map)
2025-05-07T20:32:44.8904420Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:44.8904518Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:44.8904646Z E       ^
2025-05-07T20:32:44.8905006Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:44.8905011Z 
2025-05-07T20:32:44.8905471Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:44.8905476Z 
2025-05-07T20:32:44.8905586Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:44.8905811Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:44.8905887Z     T=1,
2025-05-07T20:32:44.8905969Z     D=5120,
2025-05-07T20:32:44.8906051Z     scale_ub=1200.0,
2025-05-07T20:32:44.8906143Z     contiguous=False,
2025-05-07T20:32:44.8906228Z     compiled=False,
2025-05-07T20:32:44.8906302Z )
2025-05-07T20:32:44.8906530Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:44.8906696Z T = 1, D = 5120, scale_ub = 1200.0, contiguous = False, compiled = False
2025-05-07T20:32:44.8906706Z 
2025-05-07T20:32:44.8906782Z     @given(
2025-05-07T20:32:44.8906904Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:44.8907005Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:44.8907123Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:44.8907245Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:44.8907357Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:44.8907437Z     )
2025-05-07T20:32:44.8907685Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:44.8907778Z     def test_silu_mul_quant(
2025-05-07T20:32:44.8907859Z         self,
2025-05-07T20:32:44.8907934Z         T: int,
2025-05-07T20:32:44.8908010Z         D: int,
2025-05-07T20:32:44.8908114Z         scale_ub: Optional[float],
2025-05-07T20:32:44.8908203Z         contiguous: bool,
2025-05-07T20:32:44.8908288Z         compiled: bool,
2025-05-07T20:32:44.8908375Z     ) -> None:
2025-05-07T20:32:44.8908472Z         torch.manual_seed(2025)
2025-05-07T20:32:44.8908544Z     
2025-05-07T20:32:44.8908718Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:44.8908793Z     
2025-05-07T20:32:44.8908894Z         x_sign = torch.sign(x)
2025-05-07T20:32:44.8909020Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:44.8909220Z         x = x_sign * x_clamp
2025-05-07T20:32:44.8909311Z         x0 = x[:, :D]
2025-05-07T20:32:44.8909393Z         x1 = x[:, D:]
2025-05-07T20:32:44.8909464Z     
2025-05-07T20:32:44.8909554Z         if contiguous:
2025-05-07T20:32:44.8909644Z             x0 = x0.contiguous()
2025-05-07T20:32:44.8909733Z             x1 = x1.contiguous()
2025-05-07T20:32:44.8909814Z     
2025-05-07T20:32:44.8909905Z         if scale_ub is not None:
2025-05-07T20:32:44.8910009Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:44.8910151Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:44.8910282Z             )
2025-05-07T20:32:44.8910363Z         else:
2025-05-07T20:32:44.8910457Z             scale_ub_tensor = None
2025-05-07T20:32:44.8910528Z     
2025-05-07T20:32:44.8910667Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:44.8910800Z             op = silu_mul_quant
2025-05-07T20:32:44.8910888Z             if compiled:
2025-05-07T20:32:44.8910995Z                 op = torch.compile(op)
2025-05-07T20:32:44.8911101Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:44.8911173Z     
2025-05-07T20:32:44.8911271Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:44.8911276Z 
2025-05-07T20:32:44.8911374Z moe/activation_test.py:117: 
2025-05-07T20:32:44.8911503Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:44.8911611Z moe/activation_test.py:115: in fn
2025-05-07T20:32:44.8911712Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:44.8912225Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:44.8912365Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:44.8912760Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:44.8912989Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:44.8913328Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:44.8913428Z     kernel = self.compile(
2025-05-07T20:32:44.8913809Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:44.8913980Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:44.8914114Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:44.8914118Z 
2025-05-07T20:32:44.8914328Z self = <triton.compiler.compiler.ASTSource object at 0x7f9655bbaa90>
2025-05-07T20:32:44.8915118Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:44.8915628Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f987464c400>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f98063a71a0>}
2025-05-07T20:32:44.8916383Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:44.8916579Z context = <triton._C.libtriton.ir.context object at 0x7f9655bbf0f0>
2025-05-07T20:32:44.8916583Z 
2025-05-07T20:32:44.8916747Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:44.8917022Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:44.8917140Z                            module_map=module_map)
2025-05-07T20:32:44.8917332Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:44.8917454Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:44.8917532Z E       ^
2025-05-07T20:32:44.8917894Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:44.8917899Z 
2025-05-07T20:32:44.8918326Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:44.8918330Z 
2025-05-07T20:32:44.8918434Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:44.8918665Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:44.8918744Z     T=16384,
2025-05-07T20:32:44.8918821Z     D=5120,
2025-05-07T20:32:44.8918914Z     scale_ub=1200.0,
2025-05-07T20:32:44.8919056Z     contiguous=False,
2025-05-07T20:32:44.8919140Z     compiled=True,
2025-05-07T20:32:44.8919221Z )
2025-05-07T20:32:44.8919442Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:44.8919670Z T = 16384, D = 5120, scale_ub = 1200.0, contiguous = False, compiled = True
2025-05-07T20:32:44.8919675Z 
2025-05-07T20:32:44.8919752Z     @given(
2025-05-07T20:32:44.8919870Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:44.8919975Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:44.8920088Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:44.8920204Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:44.8920326Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:44.8920399Z     )
2025-05-07T20:32:44.8920643Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:44.8920787Z     def test_silu_mul_quant(
2025-05-07T20:32:44.8920867Z         self,
2025-05-07T20:32:44.8920947Z         T: int,
2025-05-07T20:32:44.8921022Z         D: int,
2025-05-07T20:32:44.8921119Z         scale_ub: Optional[float],
2025-05-07T20:32:44.8921278Z         contiguous: bool,
2025-05-07T20:32:44.8921372Z         compiled: bool,
2025-05-07T20:32:44.8921452Z     ) -> None:
2025-05-07T20:32:44.8921551Z         torch.manual_seed(2025)
2025-05-07T20:32:44.8921623Z     
2025-05-07T20:32:44.8921792Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:44.8921872Z     
2025-05-07T20:32:44.8921963Z         x_sign = torch.sign(x)
2025-05-07T20:32:44.8922088Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:44.8922181Z         x = x_sign * x_clamp
2025-05-07T20:32:44.8922260Z         x0 = x[:, :D]
2025-05-07T20:32:44.8922340Z         x1 = x[:, D:]
2025-05-07T20:32:44.8922418Z     
2025-05-07T20:32:44.8922503Z         if contiguous:
2025-05-07T20:32:44.8922602Z             x0 = x0.contiguous()
2025-05-07T20:32:44.8922694Z             x1 = x1.contiguous()
2025-05-07T20:32:44.8922764Z     
2025-05-07T20:32:44.8922860Z         if scale_ub is not None:
2025-05-07T20:32:44.8922970Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:44.8923109Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:44.8923194Z             )
2025-05-07T20:32:44.8923270Z         else:
2025-05-07T20:32:44.8923363Z             scale_ub_tensor = None
2025-05-07T20:32:44.8923442Z     
2025-05-07T20:32:44.8923571Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:44.8923661Z             op = silu_mul_quant
2025-05-07T20:32:44.8923750Z             if compiled:
2025-05-07T20:32:44.8923849Z                 op = torch.compile(op)
2025-05-07T20:32:44.8923959Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:44.8924030Z     
2025-05-07T20:32:44.8924121Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:44.8924125Z 
2025-05-07T20:32:44.8924233Z moe/activation_test.py:117: 
2025-05-07T20:32:44.8924364Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:44.8924465Z moe/activation_test.py:115: in fn
2025-05-07T20:32:44.8924572Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:44.8924944Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/torch/_dynamo/eval_frame.py:678: in _fn
2025-05-07T20:32:44.8925036Z     return fn(*args, **kwargs)
2025-05-07T20:32:44.8925535Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:44.8925631Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:44.8925993Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:44.8926215Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:44.8926554Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:44.8926708Z     kernel = self.compile(
2025-05-07T20:32:44.8927131Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:44.8927314Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:44.8927465Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:44.8927469Z 
2025-05-07T20:32:44.8927696Z self = <triton.compiler.compiler.ASTSource object at 0x7f9655b15b10>
2025-05-07T20:32:44.8928843Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:44.8934851Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f987464c400>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f9655b84ea0>}
2025-05-07T20:32:44.8935892Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:44.8936102Z context = <triton._C.libtriton.ir.context object at 0x7f9655bf60f0>
2025-05-07T20:32:44.8936107Z 
2025-05-07T20:32:44.8936275Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:44.8936542Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:44.8936659Z                            module_map=module_map)
2025-05-07T20:32:44.8936822Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:44.8936925Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:44.8937014Z E       ^
2025-05-07T20:32:44.8937379Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:44.8937390Z 
2025-05-07T20:32:44.8937820Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:44.8937824Z 
2025-05-07T20:32:44.8937932Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:44.8938157Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:44.8938245Z     T=2048,
2025-05-07T20:32:44.8938322Z     D=7168,
2025-05-07T20:32:44.8938407Z     scale_ub=1200.0,
2025-05-07T20:32:44.8938504Z     contiguous=False,
2025-05-07T20:32:44.8938588Z     compiled=True,
2025-05-07T20:32:44.8938672Z )
2025-05-07T20:32:44.8938892Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:44.8939068Z T = 2048, D = 7168, scale_ub = 1200.0, contiguous = False, compiled = True
2025-05-07T20:32:44.8939072Z 
2025-05-07T20:32:44.8939158Z     @given(
2025-05-07T20:32:44.8939282Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:44.8939385Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:44.8939510Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:44.8939632Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:44.8939745Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:44.8939828Z     )
2025-05-07T20:32:44.8940076Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:44.8940182Z     def test_silu_mul_quant(
2025-05-07T20:32:44.8940260Z         self,
2025-05-07T20:32:44.8940340Z         T: int,
2025-05-07T20:32:44.8940426Z         D: int,
2025-05-07T20:32:44.8940526Z         scale_ub: Optional[float],
2025-05-07T20:32:44.8940618Z         contiguous: bool,
2025-05-07T20:32:44.8940713Z         compiled: bool,
2025-05-07T20:32:44.8940793Z     ) -> None:
2025-05-07T20:32:44.8940892Z         torch.manual_seed(2025)
2025-05-07T20:32:44.8940974Z     
2025-05-07T20:32:44.8941147Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:44.8941299Z     
2025-05-07T20:32:44.8941404Z         x_sign = torch.sign(x)
2025-05-07T20:32:44.8941533Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:44.8941703Z         x = x_sign * x_clamp
2025-05-07T20:32:44.8941786Z         x0 = x[:, :D]
2025-05-07T20:32:44.8941867Z         x1 = x[:, D:]
2025-05-07T20:32:44.8941949Z     
2025-05-07T20:32:44.8942034Z         if contiguous:
2025-05-07T20:32:44.8942128Z             x0 = x0.contiguous()
2025-05-07T20:32:44.8942225Z             x1 = x1.contiguous()
2025-05-07T20:32:44.8942300Z     
2025-05-07T20:32:44.8942392Z         if scale_ub is not None:
2025-05-07T20:32:44.8942515Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:44.8942651Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:44.8942727Z             )
2025-05-07T20:32:44.8942812Z         else:
2025-05-07T20:32:44.8942975Z             scale_ub_tensor = None
2025-05-07T20:32:44.8943059Z     
2025-05-07T20:32:44.8943190Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:44.8943282Z             op = silu_mul_quant
2025-05-07T20:32:44.8943422Z             if compiled:
2025-05-07T20:32:44.8943529Z                 op = torch.compile(op)
2025-05-07T20:32:44.8943637Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:44.8943717Z     
2025-05-07T20:32:44.8943809Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:44.8943813Z 
2025-05-07T20:32:44.8943912Z moe/activation_test.py:117: 
2025-05-07T20:32:44.8944050Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:44.8944152Z moe/activation_test.py:115: in fn
2025-05-07T20:32:44.8944262Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:44.8944635Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/torch/_dynamo/eval_frame.py:678: in _fn
2025-05-07T20:32:44.8944728Z     return fn(*args, **kwargs)
2025-05-07T20:32:44.8945241Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:44.8945343Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:44.8945704Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:44.8945935Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:44.8946276Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:44.8946376Z     kernel = self.compile(
2025-05-07T20:32:44.8946761Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:44.8946937Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:44.8947089Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:44.8947099Z 
2025-05-07T20:32:44.8947333Z self = <triton.compiler.compiler.ASTSource object at 0x7f9655b7f010>
2025-05-07T20:32:44.8948132Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:44.8948639Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f987464c400>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f9655b859e0>}
2025-05-07T20:32:44.8949501Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:44.8949701Z context = <triton._C.libtriton.ir.context object at 0x7f9655b0f670>
2025-05-07T20:32:44.8949705Z 
2025-05-07T20:32:44.8949873Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:44.8950194Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:44.8950306Z                            module_map=module_map)
2025-05-07T20:32:44.8950505Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:44.8950616Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:44.8950695Z E       ^
2025-05-07T20:32:44.8951052Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:44.8951064Z 
2025-05-07T20:32:44.8951483Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:44.8951487Z 
2025-05-07T20:32:44.8951592Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:44.8951822Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:44.8951951Z     T=1,
2025-05-07T20:32:44.8952035Z     D=5120,
2025-05-07T20:32:44.8952128Z     scale_ub=None,
2025-05-07T20:32:44.8952217Z     contiguous=False,
2025-05-07T20:32:44.8952303Z     compiled=False,
2025-05-07T20:32:44.8952421Z )
2025-05-07T20:32:44.8952646Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:44.8952820Z T = 1, D = 5120, scale_ub = None, contiguous = False, compiled = False
2025-05-07T20:32:44.8952825Z 
2025-05-07T20:32:44.8952904Z     @given(
2025-05-07T20:32:44.8953025Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:44.8953135Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:44.8953251Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:44.8953369Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:44.8953491Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:44.8953567Z     )
2025-05-07T20:32:44.8953813Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:44.8953927Z     def test_silu_mul_quant(
2025-05-07T20:32:44.8954009Z         self,
2025-05-07T20:32:44.8954099Z         T: int,
2025-05-07T20:32:44.8954178Z         D: int,
2025-05-07T20:32:44.8954282Z         scale_ub: Optional[float],
2025-05-07T20:32:44.8954381Z         contiguous: bool,
2025-05-07T20:32:44.8954472Z         compiled: bool,
2025-05-07T20:32:44.8954553Z     ) -> None:
2025-05-07T20:32:44.8954657Z         torch.manual_seed(2025)
2025-05-07T20:32:44.8954731Z     
2025-05-07T20:32:44.8954899Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:44.8954983Z     
2025-05-07T20:32:44.8955077Z         x_sign = torch.sign(x)
2025-05-07T20:32:44.8955204Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:44.8955303Z         x = x_sign * x_clamp
2025-05-07T20:32:44.8955385Z         x0 = x[:, :D]
2025-05-07T20:32:44.8955477Z         x1 = x[:, D:]
2025-05-07T20:32:44.8955551Z     
2025-05-07T20:32:44.8955639Z         if contiguous:
2025-05-07T20:32:44.8955744Z             x0 = x0.contiguous()
2025-05-07T20:32:44.8955838Z             x1 = x1.contiguous()
2025-05-07T20:32:44.8955913Z     
2025-05-07T20:32:44.8956017Z         if scale_ub is not None:
2025-05-07T20:32:44.8956126Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:44.8956261Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:44.8956346Z             )
2025-05-07T20:32:44.8956424Z         else:
2025-05-07T20:32:44.8956523Z             scale_ub_tensor = None
2025-05-07T20:32:44.8956605Z     
2025-05-07T20:32:44.8956734Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:44.8956833Z             op = silu_mul_quant
2025-05-07T20:32:44.8956921Z             if compiled:
2025-05-07T20:32:44.8957021Z                 op = torch.compile(op)
2025-05-07T20:32:44.8957135Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:44.8957213Z     
2025-05-07T20:32:44.8957311Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:44.8957378Z 
2025-05-07T20:32:44.8957495Z moe/activation_test.py:117: 
2025-05-07T20:32:44.8957647Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:44.8957750Z moe/activation_test.py:115: in fn
2025-05-07T20:32:44.8957901Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:44.8958406Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:44.8958514Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:44.8958876Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:44.8959100Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:44.8959454Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:44.8959549Z     kernel = self.compile(
2025-05-07T20:32:44.8959976Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:44.8960202Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:44.8960335Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:44.8960340Z 
2025-05-07T20:32:44.8960551Z self = <triton.compiler.compiler.ASTSource object at 0x7f9806ae4b10>
2025-05-07T20:32:44.8961332Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:44.8961845Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f987464c400>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f9655b86d40>}
2025-05-07T20:32:44.8962598Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:44.8962798Z context = <triton._C.libtriton.ir.context object at 0x7f9806a45130>
2025-05-07T20:32:44.8962803Z 
2025-05-07T20:32:44.8962976Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:44.8963237Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:44.8963351Z                            module_map=module_map)
2025-05-07T20:32:44.8963515Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:44.8963615Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:44.8963705Z E       ^
2025-05-07T20:32:44.8964064Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:44.8964069Z 
2025-05-07T20:32:44.8964495Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:44.8964509Z 
2025-05-07T20:32:44.8964613Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:44.8964842Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:44.8964927Z     T=4096,
2025-05-07T20:32:44.8965005Z     D=7168,
2025-05-07T20:32:44.8965090Z     scale_ub=1200.0,
2025-05-07T20:32:44.8965187Z     contiguous=False,
2025-05-07T20:32:44.8965273Z     compiled=False,
2025-05-07T20:32:44.8965348Z )
2025-05-07T20:32:44.8965574Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:44.8965753Z T = 4096, D = 7168, scale_ub = 1200.0, contiguous = False, compiled = False
2025-05-07T20:32:44.8965758Z 
2025-05-07T20:32:44.8965838Z     @given(
2025-05-07T20:32:44.8965963Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:44.8966065Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:44.8966192Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:44.8966353Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:44.8966468Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:44.8966555Z     )
2025-05-07T20:32:44.8966839Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:44.8966937Z     def test_silu_mul_quant(
2025-05-07T20:32:44.8967022Z         self,
2025-05-07T20:32:44.8967101Z         T: int,
2025-05-07T20:32:44.8967178Z         D: int,
2025-05-07T20:32:44.8967286Z         scale_ub: Optional[float],
2025-05-07T20:32:44.8967376Z         contiguous: bool,
2025-05-07T20:32:44.8967476Z         compiled: bool,
2025-05-07T20:32:44.8967556Z     ) -> None:
2025-05-07T20:32:44.8967653Z         torch.manual_seed(2025)
2025-05-07T20:32:44.8967735Z     
2025-05-07T20:32:44.8967906Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:44.8967982Z     
2025-05-07T20:32:44.8968127Z         x_sign = torch.sign(x)
2025-05-07T20:32:44.8968257Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:44.8968349Z         x = x_sign * x_clamp
2025-05-07T20:32:44.8968476Z         x0 = x[:, :D]
2025-05-07T20:32:44.8968563Z         x1 = x[:, D:]
2025-05-07T20:32:44.8968637Z     
2025-05-07T20:32:44.8968731Z         if contiguous:
2025-05-07T20:32:44.8968825Z             x0 = x0.contiguous()
2025-05-07T20:32:44.8968915Z             x1 = x1.contiguous()
2025-05-07T20:32:44.8968997Z     
2025-05-07T20:32:44.8969088Z         if scale_ub is not None:
2025-05-07T20:32:44.8969204Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:44.8969340Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:44.8969417Z             )
2025-05-07T20:32:44.8969503Z         else:
2025-05-07T20:32:44.8969599Z             scale_ub_tensor = None
2025-05-07T20:32:44.8969674Z     
2025-05-07T20:32:44.8969813Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:44.8969910Z             op = silu_mul_quant
2025-05-07T20:32:44.8969997Z             if compiled:
2025-05-07T20:32:44.8970106Z                 op = torch.compile(op)
2025-05-07T20:32:44.8970215Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:44.8970291Z     
2025-05-07T20:32:44.8970393Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:44.8970399Z 
2025-05-07T20:32:44.8970497Z moe/activation_test.py:117: 
2025-05-07T20:32:44.8970634Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:44.8970736Z moe/activation_test.py:115: in fn
2025-05-07T20:32:44.8970836Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:44.8971346Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:44.8971442Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:44.8971802Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:44.8972038Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:44.8972387Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:44.8972489Z     kernel = self.compile(
2025-05-07T20:32:44.8972873Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:44.8973059Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:44.8973190Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:44.8973194Z 
2025-05-07T20:32:44.8973398Z self = <triton.compiler.compiler.ASTSource object at 0x7f9806a0b990>
2025-05-07T20:32:44.8974186Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:44.8974806Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f987464c400>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f9655b87a60>}
2025-05-07T20:32:44.8975565Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:44.8975755Z context = <triton._C.libtriton.ir.context object at 0x7f9806acabb0>
2025-05-07T20:32:44.8975759Z 
2025-05-07T20:32:44.8975930Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:44.8976192Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:44.8976300Z                            module_map=module_map)
2025-05-07T20:32:44.8976468Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:44.8976611Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:44.8976689Z E       ^
2025-05-07T20:32:44.8977095Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:44.8977100Z 
2025-05-07T20:32:44.8977571Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:44.8977576Z 
2025-05-07T20:32:44.8977687Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:44.8977912Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:44.8977991Z     T=16384,
2025-05-07T20:32:44.8978074Z     D=7168,
2025-05-07T20:32:44.8978157Z     scale_ub=None,
2025-05-07T20:32:44.8978243Z     contiguous=True,
2025-05-07T20:32:44.8978335Z     compiled=True,
2025-05-07T20:32:44.8978407Z )
2025-05-07T20:32:44.8978625Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:44.8978808Z T = 16384, D = 7168, scale_ub = None, contiguous = True, compiled = True
2025-05-07T20:32:44.8978813Z 
2025-05-07T20:32:44.8978892Z     @given(
2025-05-07T20:32:44.8979016Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:44.8979119Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:44.8979233Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:44.8979354Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:44.8979466Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:44.8979541Z     )
2025-05-07T20:32:44.8979794Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:44.8979888Z     def test_silu_mul_quant(
2025-05-07T20:32:44.8979970Z         self,
2025-05-07T20:32:44.8980046Z         T: int,
2025-05-07T20:32:44.8980122Z         D: int,
2025-05-07T20:32:44.8980227Z         scale_ub: Optional[float],
2025-05-07T20:32:44.8980316Z         contiguous: bool,
2025-05-07T20:32:44.8980406Z         compiled: bool,
2025-05-07T20:32:44.8980489Z     ) -> None:
2025-05-07T20:32:44.8980584Z         torch.manual_seed(2025)
2025-05-07T20:32:44.8980656Z     
2025-05-07T20:32:44.8980835Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:44.8980912Z     
2025-05-07T20:32:44.8981007Z         x_sign = torch.sign(x)
2025-05-07T20:32:44.8981138Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:44.8981226Z         x = x_sign * x_clamp
2025-05-07T20:32:44.8981307Z         x0 = x[:, :D]
2025-05-07T20:32:44.8981396Z         x1 = x[:, D:]
2025-05-07T20:32:44.8981469Z     
2025-05-07T20:32:44.8981562Z         if contiguous:
2025-05-07T20:32:44.8981654Z             x0 = x0.contiguous()
2025-05-07T20:32:44.8981744Z             x1 = x1.contiguous()
2025-05-07T20:32:44.8981821Z     
2025-05-07T20:32:44.8981913Z         if scale_ub is not None:
2025-05-07T20:32:44.8982019Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:44.8982163Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:44.8982290Z             )
2025-05-07T20:32:44.8982369Z         else:
2025-05-07T20:32:44.8982471Z             scale_ub_tensor = None
2025-05-07T20:32:44.8982545Z     
2025-05-07T20:32:44.8982713Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:44.8982809Z             op = silu_mul_quant
2025-05-07T20:32:44.8982894Z             if compiled:
2025-05-07T20:32:44.8982998Z                 op = torch.compile(op)
2025-05-07T20:32:44.8983102Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:44.8983174Z     
2025-05-07T20:32:44.8983272Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:44.8983277Z 
2025-05-07T20:32:44.8983374Z moe/activation_test.py:117: 
2025-05-07T20:32:44.8983508Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:44.8983616Z moe/activation_test.py:115: in fn
2025-05-07T20:32:44.8983717Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:44.8984130Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/torch/_dynamo/eval_frame.py:678: in _fn
2025-05-07T20:32:44.8984228Z     return fn(*args, **kwargs)
2025-05-07T20:32:44.8984760Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:44.8984865Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:44.8985220Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:44.8985440Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:44.8985785Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:44.8985877Z     kernel = self.compile(
2025-05-07T20:32:44.8986263Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:44.8986440Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:44.8986567Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:44.8986574Z 
2025-05-07T20:32:44.8986789Z self = <triton.compiler.compiler.ASTSource object at 0x7f980697a810>
2025-05-07T20:32:44.8987597Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:44.8988131Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f987464c400>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f9806ae1120>}
2025-05-07T20:32:44.8988885Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:44.8989186Z context = <triton._C.libtriton.ir.context object at 0x7f9806942e30>
2025-05-07T20:32:44.8989191Z 
2025-05-07T20:32:44.8989366Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:44.8989630Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:44.8989745Z                            module_map=module_map)
2025-05-07T20:32:44.8989905Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:44.8990005Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:44.8990089Z E       ^
2025-05-07T20:32:44.8990448Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:44.8990452Z 
2025-05-07T20:32:44.8990870Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:44.8990884Z 
2025-05-07T20:32:44.8991039Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:44.8991266Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:44.8991352Z     T=4096,
2025-05-07T20:32:44.8991430Z     D=5120,
2025-05-07T20:32:44.8991553Z     scale_ub=None,
2025-05-07T20:32:44.8991647Z     contiguous=False,
2025-05-07T20:32:44.8991731Z     compiled=True,
2025-05-07T20:32:44.8991806Z )
2025-05-07T20:32:44.8992030Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:44.8992204Z T = 4096, D = 5120, scale_ub = None, contiguous = False, compiled = True
2025-05-07T20:32:44.8992208Z 
2025-05-07T20:32:44.8992294Z     @given(
2025-05-07T20:32:44.8992412Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:44.8992510Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:44.8992631Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:44.8992748Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:44.8992905Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:44.8992986Z     )
2025-05-07T20:32:44.8993269Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:44.8993372Z     def test_silu_mul_quant(
2025-05-07T20:32:44.8993456Z         self,
2025-05-07T20:32:44.8993531Z         T: int,
2025-05-07T20:32:44.8993608Z         D: int,
2025-05-07T20:32:44.8993712Z         scale_ub: Optional[float],
2025-05-07T20:32:44.8993802Z         contiguous: bool,
2025-05-07T20:32:44.8993893Z         compiled: bool,
2025-05-07T20:32:44.8993970Z     ) -> None:
2025-05-07T20:32:44.8994064Z         torch.manual_seed(2025)
2025-05-07T20:32:44.8994143Z     
2025-05-07T20:32:44.8994309Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:44.8994382Z     
2025-05-07T20:32:44.8994480Z         x_sign = torch.sign(x)
2025-05-07T20:32:44.8994603Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:44.8994699Z         x = x_sign * x_clamp
2025-05-07T20:32:44.8994785Z         x0 = x[:, :D]
2025-05-07T20:32:44.8994864Z         x1 = x[:, D:]
2025-05-07T20:32:44.8994937Z     
2025-05-07T20:32:44.8995027Z         if contiguous:
2025-05-07T20:32:44.8995139Z             x0 = x0.contiguous()
2025-05-07T20:32:44.8995237Z             x1 = x1.contiguous()
2025-05-07T20:32:44.8995310Z     
2025-05-07T20:32:44.8995401Z         if scale_ub is not None:
2025-05-07T20:32:44.8995513Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:44.8995648Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:44.8995730Z             )
2025-05-07T20:32:44.8995806Z         else:
2025-05-07T20:32:44.8995901Z             scale_ub_tensor = None
2025-05-07T20:32:44.8995982Z     
2025-05-07T20:32:44.8996110Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:44.8996202Z             op = silu_mul_quant
2025-05-07T20:32:44.8996293Z             if compiled:
2025-05-07T20:32:44.8996399Z                 op = torch.compile(op)
2025-05-07T20:32:44.8996505Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:44.8996587Z     
2025-05-07T20:32:44.8996677Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:44.8996684Z 
2025-05-07T20:32:44.8996792Z moe/activation_test.py:117: 
2025-05-07T20:32:44.8996919Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:44.8997019Z moe/activation_test.py:115: in fn
2025-05-07T20:32:44.8997128Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:44.8997495Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/torch/_dynamo/eval_frame.py:678: in _fn
2025-05-07T20:32:44.8997588Z     return fn(*args, **kwargs)
2025-05-07T20:32:44.8998087Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:44.8998184Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:44.8998548Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:44.8998818Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:44.8999236Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:44.8999336Z     kernel = self.compile(
2025-05-07T20:32:44.8999720Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:44.8999894Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:44.9000028Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:44.9000033Z 
2025-05-07T20:32:44.9000237Z self = <triton.compiler.compiler.ASTSource object at 0x7f98069ffd90>
2025-05-07T20:32:44.9001026Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:44.9001613Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f987464c400>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f9806ae1c60>}
2025-05-07T20:32:44.9002376Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:44.9002564Z context = <triton._C.libtriton.ir.context object at 0x7f9806944370>
2025-05-07T20:32:44.9002568Z 
2025-05-07T20:32:44.9002730Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:44.9002995Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:44.9003100Z                            module_map=module_map)
2025-05-07T20:32:44.9003270Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:44.9003370Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:44.9003446Z E       ^
2025-05-07T20:32:44.9003815Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:44.9003819Z 
2025-05-07T20:32:44.9004236Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:44.9004241Z 
2025-05-07T20:32:44.9004343Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:44.9004575Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:44.9004652Z     T=4096,
2025-05-07T20:32:44.9004736Z     D=5120,
2025-05-07T20:32:44.9004818Z     scale_ub=1200.0,
2025-05-07T20:32:44.9004904Z     contiguous=False,
2025-05-07T20:32:44.9004994Z     compiled=False,
2025-05-07T20:32:44.9005066Z )
2025-05-07T20:32:44.9005288Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:44.9005472Z T = 4096, D = 5120, scale_ub = 1200.0, contiguous = False, compiled = False
2025-05-07T20:32:44.9005477Z 
2025-05-07T20:32:44.9005556Z     @given(
2025-05-07T20:32:44.9005674Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:44.9005781Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:44.9005893Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:44.9006018Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:44.9006131Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:44.9006203Z     )
2025-05-07T20:32:44.9006452Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:44.9006546Z     def test_silu_mul_quant(
2025-05-07T20:32:44.9006621Z         self,
2025-05-07T20:32:44.9006702Z         T: int,
2025-05-07T20:32:44.9006779Z         D: int,
2025-05-07T20:32:44.9006881Z         scale_ub: Optional[float],
2025-05-07T20:32:44.9007080Z         contiguous: bool,
2025-05-07T20:32:44.9007172Z         compiled: bool,
2025-05-07T20:32:44.9007268Z     ) -> None:
2025-05-07T20:32:44.9007372Z         torch.manual_seed(2025)
2025-05-07T20:32:44.9007445Z     
2025-05-07T20:32:44.9007661Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:44.9007733Z     
2025-05-07T20:32:44.9007825Z         x_sign = torch.sign(x)
2025-05-07T20:32:44.9007956Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:44.9008043Z         x = x_sign * x_clamp
2025-05-07T20:32:44.9008124Z         x0 = x[:, :D]
2025-05-07T20:32:44.9008211Z         x1 = x[:, D:]
2025-05-07T20:32:44.9008284Z     
2025-05-07T20:32:44.9008366Z         if contiguous:
2025-05-07T20:32:44.9008467Z             x0 = x0.contiguous()
2025-05-07T20:32:44.9008561Z             x1 = x1.contiguous()
2025-05-07T20:32:44.9008634Z     
2025-05-07T20:32:44.9008730Z         if scale_ub is not None:
2025-05-07T20:32:44.9008876Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:44.9009022Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:44.9009097Z             )
2025-05-07T20:32:44.9009209Z         else:
2025-05-07T20:32:44.9009312Z             scale_ub_tensor = None
2025-05-07T20:32:44.9009383Z     
2025-05-07T20:32:44.9009513Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:44.9009610Z             op = silu_mul_quant
2025-05-07T20:32:44.9009694Z             if compiled:
2025-05-07T20:32:44.9009795Z                 op = torch.compile(op)
2025-05-07T20:32:44.9009907Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:44.9009979Z     
2025-05-07T20:32:44.9010069Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:44.9010073Z 
2025-05-07T20:32:44.9010177Z moe/activation_test.py:117: 
2025-05-07T20:32:44.9010305Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:44.9010413Z moe/activation_test.py:115: in fn
2025-05-07T20:32:44.9010517Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:44.9011019Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:44.9011125Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:44.9011481Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:44.9011702Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:44.9012048Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:44.9012143Z     kernel = self.compile(
2025-05-07T20:32:44.9012532Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:44.9012704Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:44.9012834Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:44.9012841Z 
2025-05-07T20:32:44.9013052Z self = <triton.compiler.compiler.ASTSource object at 0x7f98066b0fd0>
2025-05-07T20:32:44.9013838Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:44.9014349Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f987464c400>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f9806ae3240>}
2025-05-07T20:32:44.9015101Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:44.9015290Z context = <triton._C.libtriton.ir.context object at 0x7f98066f1630>
2025-05-07T20:32:44.9015350Z 
2025-05-07T20:32:44.9015515Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:44.9015779Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:44.9015937Z                            module_map=module_map)
2025-05-07T20:32:44.9016098Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:44.9016198Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:44.9016284Z E       ^
2025-05-07T20:32:44.9016643Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:44.9016648Z 
2025-05-07T20:32:44.9017069Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:44.9017074Z 
2025-05-07T20:32:44.9017177Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:44.9017405Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:44.9017556Z     T=4096,
2025-05-07T20:32:44.9017638Z     D=5120,
2025-05-07T20:32:44.9017739Z     scale_ub=1200.0,
2025-05-07T20:32:44.9017873Z     contiguous=False,
2025-05-07T20:32:44.9017962Z     compiled=True,
2025-05-07T20:32:44.9018034Z )
2025-05-07T20:32:44.9018256Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:44.9018429Z T = 4096, D = 5120, scale_ub = 1200.0, contiguous = False, compiled = True
2025-05-07T20:32:44.9018433Z 
2025-05-07T20:32:44.9018515Z     @given(
2025-05-07T20:32:44.9018632Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:44.9018730Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:44.9018849Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:44.9018966Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:44.9019078Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:44.9019161Z     )
2025-05-07T20:32:44.9019411Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:44.9019509Z     def test_silu_mul_quant(
2025-05-07T20:32:44.9019581Z         self,
2025-05-07T20:32:44.9019661Z         T: int,
2025-05-07T20:32:44.9019746Z         D: int,
2025-05-07T20:32:44.9019844Z         scale_ub: Optional[float],
2025-05-07T20:32:44.9019932Z         contiguous: bool,
2025-05-07T20:32:44.9020024Z         compiled: bool,
2025-05-07T20:32:44.9020100Z     ) -> None:
2025-05-07T20:32:44.9020195Z         torch.manual_seed(2025)
2025-05-07T20:32:44.9020273Z     
2025-05-07T20:32:44.9020441Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:44.9020514Z     
2025-05-07T20:32:44.9020612Z         x_sign = torch.sign(x)
2025-05-07T20:32:44.9020736Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:44.9020826Z         x = x_sign * x_clamp
2025-05-07T20:32:44.9020913Z         x0 = x[:, :D]
2025-05-07T20:32:44.9020996Z         x1 = x[:, D:]
2025-05-07T20:32:44.9021076Z     
2025-05-07T20:32:44.9021159Z         if contiguous:
2025-05-07T20:32:44.9021250Z             x0 = x0.contiguous()
2025-05-07T20:32:44.9021344Z             x1 = x1.contiguous()
2025-05-07T20:32:44.9021417Z     
2025-05-07T20:32:44.9021507Z         if scale_ub is not None:
2025-05-07T20:32:44.9021620Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:44.9021752Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:44.9021829Z             )
2025-05-07T20:32:44.9021911Z         else:
2025-05-07T20:32:44.9022004Z             scale_ub_tensor = None
2025-05-07T20:32:44.9022077Z     
2025-05-07T20:32:44.9022209Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:44.9022298Z             op = silu_mul_quant
2025-05-07T20:32:44.9022391Z             if compiled:
2025-05-07T20:32:44.9022491Z                 op = torch.compile(op)
2025-05-07T20:32:44.9022596Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:44.9022751Z     
2025-05-07T20:32:44.9022842Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:44.9022847Z 
2025-05-07T20:32:44.9022944Z moe/activation_test.py:117: 
2025-05-07T20:32:44.9023121Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:44.9023225Z moe/activation_test.py:115: in fn
2025-05-07T20:32:44.9023326Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:44.9023700Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/torch/_dynamo/eval_frame.py:678: in _fn
2025-05-07T20:32:44.9023792Z     return fn(*args, **kwargs)
2025-05-07T20:32:44.9024294Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:44.9024392Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:44.9024748Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:44.9025016Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:44.9025395Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:44.9025500Z     kernel = self.compile(
2025-05-07T20:32:44.9025881Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:44.9026051Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:44.9026183Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:44.9026188Z 
2025-05-07T20:32:44.9026390Z self = <triton.compiler.compiler.ASTSource object at 0x7f98066a7fd0>
2025-05-07T20:32:44.9027170Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:44.9027684Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f987464c400>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f980664c720>}
2025-05-07T20:32:44.9028828Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:44.9029079Z context = <triton._C.libtriton.ir.context object at 0x7f9806694670>
2025-05-07T20:32:44.9029085Z 
2025-05-07T20:32:44.9029248Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:44.9029516Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:44.9029623Z                            module_map=module_map)
2025-05-07T20:32:44.9029784Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:44.9029893Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:44.9029972Z E       ^
2025-05-07T20:32:44.9030328Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:44.9030335Z 
2025-05-07T20:32:44.9030760Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:44.9030764Z 
2025-05-07T20:32:44.9030866Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:44.9031094Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:44.9031175Z     T=2048,
2025-05-07T20:32:44.9031251Z     D=7168,
2025-05-07T20:32:44.9031342Z     scale_ub=1200.0,
2025-05-07T20:32:44.9031429Z     contiguous=False,
2025-05-07T20:32:44.9031513Z     compiled=False,
2025-05-07T20:32:44.9031591Z )
2025-05-07T20:32:44.9031807Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:44.9031988Z T = 2048, D = 7168, scale_ub = 1200.0, contiguous = False, compiled = False
2025-05-07T20:32:44.9032151Z 
2025-05-07T20:32:44.9032232Z     @given(
2025-05-07T20:32:44.9032349Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:44.9032530Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:44.9032646Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:44.9032763Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:44.9032880Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:44.9032954Z     )
2025-05-07T20:32:44.9033198Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:44.9033296Z     def test_silu_mul_quant(
2025-05-07T20:32:44.9033371Z         self,
2025-05-07T20:32:44.9033453Z         T: int,
2025-05-07T20:32:44.9033529Z         D: int,
2025-05-07T20:32:44.9033627Z         scale_ub: Optional[float],
2025-05-07T20:32:44.9033720Z         contiguous: bool,
2025-05-07T20:32:44.9033805Z         compiled: bool,
2025-05-07T20:32:44.9033952Z     ) -> None:
2025-05-07T20:32:44.9034054Z         torch.manual_seed(2025)
2025-05-07T20:32:44.9034125Z     
2025-05-07T20:32:44.9034387Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:44.9034469Z     
2025-05-07T20:32:44.9034560Z         x_sign = torch.sign(x)
2025-05-07T20:32:44.9034683Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:44.9034778Z         x = x_sign * x_clamp
2025-05-07T20:32:44.9034857Z         x0 = x[:, :D]
2025-05-07T20:32:44.9034938Z         x1 = x[:, D:]
2025-05-07T20:32:44.9035015Z     
2025-05-07T20:32:44.9035098Z         if contiguous:
2025-05-07T20:32:44.9035193Z             x0 = x0.contiguous()
2025-05-07T20:32:44.9035280Z             x1 = x1.contiguous()
2025-05-07T20:32:44.9035351Z     
2025-05-07T20:32:44.9035448Z         if scale_ub is not None:
2025-05-07T20:32:44.9035553Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:44.9035687Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:44.9035776Z             )
2025-05-07T20:32:44.9035851Z         else:
2025-05-07T20:32:44.9035943Z             scale_ub_tensor = None
2025-05-07T20:32:44.9036022Z     
2025-05-07T20:32:44.9036154Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:44.9036244Z             op = silu_mul_quant
2025-05-07T20:32:44.9036338Z             if compiled:
2025-05-07T20:32:44.9036437Z                 op = torch.compile(op)
2025-05-07T20:32:44.9036552Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:44.9036626Z     
2025-05-07T20:32:44.9036715Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:44.9036719Z 
2025-05-07T20:32:44.9036827Z moe/activation_test.py:117: 
2025-05-07T20:32:44.9036956Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:44.9037079Z moe/activation_test.py:115: in fn
2025-05-07T20:32:44.9037196Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:44.9037710Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:44.9037814Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:44.9038175Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:44.9038396Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:44.9038741Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:44.9038833Z     kernel = self.compile(
2025-05-07T20:32:44.9039213Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:44.9039391Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:44.9039520Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:44.9039526Z 
2025-05-07T20:32:44.9039736Z self = <triton.compiler.compiler.ASTSource object at 0x7f98066651d0>
2025-05-07T20:32:44.9040612Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:44.9041117Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f987464c400>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f980664d580>}
2025-05-07T20:32:44.9041877Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:44.9042067Z context = <triton._C.libtriton.ir.context object at 0x7f980661d7b0>
2025-05-07T20:32:44.9042072Z 
2025-05-07T20:32:44.9042242Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:44.9042545Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:44.9042652Z                            module_map=module_map)
2025-05-07T20:32:44.9042858Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:44.9042961Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:44.9043048Z E       ^
2025-05-07T20:32:44.9043404Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:44.9043409Z 
2025-05-07T20:32:44.9043825Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:44.9043829Z 
2025-05-07T20:32:44.9043940Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:44.9044163Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:44.9044247Z     T=1,
2025-05-07T20:32:44.9044325Z     D=7168,
2025-05-07T20:32:44.9044410Z     scale_ub=None,
2025-05-07T20:32:44.9044503Z     contiguous=True,
2025-05-07T20:32:44.9044588Z     compiled=False,
2025-05-07T20:32:44.9044660Z )
2025-05-07T20:32:44.9044889Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:44.9045053Z T = 1, D = 7168, scale_ub = None, contiguous = True, compiled = False
2025-05-07T20:32:44.9045058Z 
2025-05-07T20:32:44.9045134Z     @given(
2025-05-07T20:32:44.9045260Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:44.9045360Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:44.9045484Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:44.9045602Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:44.9045715Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:44.9045794Z     )
2025-05-07T20:32:44.9046040Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:44.9046139Z     def test_silu_mul_quant(
2025-05-07T20:32:44.9046226Z         self,
2025-05-07T20:32:44.9046306Z         T: int,
2025-05-07T20:32:44.9046382Z         D: int,
2025-05-07T20:32:44.9046490Z         scale_ub: Optional[float],
2025-05-07T20:32:44.9046582Z         contiguous: bool,
2025-05-07T20:32:44.9046669Z         compiled: bool,
2025-05-07T20:32:44.9046760Z     ) -> None:
2025-05-07T20:32:44.9046855Z         torch.manual_seed(2025)
2025-05-07T20:32:44.9046949Z     
2025-05-07T20:32:44.9047142Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:44.9047225Z     
2025-05-07T20:32:44.9047323Z         x_sign = torch.sign(x)
2025-05-07T20:32:44.9047446Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:44.9047534Z         x = x_sign * x_clamp
2025-05-07T20:32:44.9047624Z         x0 = x[:, :D]
2025-05-07T20:32:44.9047705Z         x1 = x[:, D:]
2025-05-07T20:32:44.9047778Z     
2025-05-07T20:32:44.9047866Z         if contiguous:
2025-05-07T20:32:44.9047958Z             x0 = x0.contiguous()
2025-05-07T20:32:44.9048097Z             x1 = x1.contiguous()
2025-05-07T20:32:44.9048175Z     
2025-05-07T20:32:44.9048265Z         if scale_ub is not None:
2025-05-07T20:32:44.9048409Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:44.9048550Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:44.9048625Z             )
2025-05-07T20:32:44.9048709Z         else:
2025-05-07T20:32:44.9048802Z             scale_ub_tensor = None
2025-05-07T20:32:44.9048873Z     
2025-05-07T20:32:44.9049006Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:44.9049095Z             op = silu_mul_quant
2025-05-07T20:32:44.9049179Z             if compiled:
2025-05-07T20:32:44.9049285Z                 op = torch.compile(op)
2025-05-07T20:32:44.9049391Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:44.9049462Z     
2025-05-07T20:32:44.9049559Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:44.9049607Z 
2025-05-07T20:32:44.9049704Z moe/activation_test.py:117: 
2025-05-07T20:32:44.9049840Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:44.9049942Z moe/activation_test.py:115: in fn
2025-05-07T20:32:44.9050106Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:44.9050613Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:44.9050710Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:44.9051068Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:44.9051296Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:44.9051634Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:44.9051732Z     kernel = self.compile(
2025-05-07T20:32:44.9052111Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:44.9052287Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:44.9052424Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:44.9052428Z 
2025-05-07T20:32:44.9052634Z self = <triton.compiler.compiler.ASTSource object at 0x7f9655ccc610>
2025-05-07T20:32:44.9053419Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:44.9053919Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f987464c400>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f980664cea0>}
2025-05-07T20:32:44.9054671Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:44.9054870Z context = <triton._C.libtriton.ir.context object at 0x7f9655c90c30>
2025-05-07T20:32:44.9054874Z 
2025-05-07T20:32:44.9055040Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:44.9055308Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:44.9055414Z                            module_map=module_map)
2025-05-07T20:32:44.9055574Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:44.9055679Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:44.9055755Z E       ^
2025-05-07T20:32:44.9056111Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:44.9056121Z 
2025-05-07T20:32:44.9056542Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:44.9056592Z 
2025-05-07T20:32:44.9056696Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:44.9056933Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:44.9057052Z     T=16384,
2025-05-07T20:32:44.9057130Z     D=7168,
2025-05-07T20:32:44.9057218Z     scale_ub=1200.0,
2025-05-07T20:32:44.9061351Z     contiguous=False,
2025-05-07T20:32:44.9061462Z     compiled=True,
2025-05-07T20:32:44.9061539Z )
2025-05-07T20:32:44.9061772Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:44.9061953Z T = 16384, D = 7168, scale_ub = 1200.0, contiguous = False, compiled = True
2025-05-07T20:32:44.9061957Z 
2025-05-07T20:32:44.9062039Z     @given(
2025-05-07T20:32:44.9062167Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:44.9062267Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:44.9062390Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:44.9062587Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:44.9062702Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:44.9062787Z     )
2025-05-07T20:32:44.9063185Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:44.9063286Z     def test_silu_mul_quant(
2025-05-07T20:32:44.9063372Z         self,
2025-05-07T20:32:44.9063450Z         T: int,
2025-05-07T20:32:44.9063527Z         D: int,
2025-05-07T20:32:44.9063637Z         scale_ub: Optional[float],
2025-05-07T20:32:44.9063727Z         contiguous: bool,
2025-05-07T20:32:44.9063814Z         compiled: bool,
2025-05-07T20:32:44.9063903Z     ) -> None:
2025-05-07T20:32:44.9063999Z         torch.manual_seed(2025)
2025-05-07T20:32:44.9064082Z     
2025-05-07T20:32:44.9064250Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:44.9064327Z     
2025-05-07T20:32:44.9064428Z         x_sign = torch.sign(x)
2025-05-07T20:32:44.9064560Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:44.9064651Z         x = x_sign * x_clamp
2025-05-07T20:32:44.9064740Z         x0 = x[:, :D]
2025-05-07T20:32:44.9064826Z         x1 = x[:, D:]
2025-05-07T20:32:44.9064901Z     
2025-05-07T20:32:44.9064995Z         if contiguous:
2025-05-07T20:32:44.9065090Z             x0 = x0.contiguous()
2025-05-07T20:32:44.9065180Z             x1 = x1.contiguous()
2025-05-07T20:32:44.9065264Z     
2025-05-07T20:32:44.9065355Z         if scale_ub is not None:
2025-05-07T20:32:44.9065463Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:44.9065607Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:44.9065689Z             )
2025-05-07T20:32:44.9065774Z         else:
2025-05-07T20:32:44.9065870Z             scale_ub_tensor = None
2025-05-07T20:32:44.9065945Z     
2025-05-07T20:32:44.9066083Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:44.9066178Z             op = silu_mul_quant
2025-05-07T20:32:44.9066267Z             if compiled:
2025-05-07T20:32:44.9066379Z                 op = torch.compile(op)
2025-05-07T20:32:44.9066485Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:44.9066565Z     
2025-05-07T20:32:44.9066685Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:44.9066690Z 
2025-05-07T20:32:44.9066801Z moe/activation_test.py:117: 
2025-05-07T20:32:44.9066953Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:44.9067055Z moe/activation_test.py:115: in fn
2025-05-07T20:32:44.9067156Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:44.9067539Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/torch/_dynamo/eval_frame.py:678: in _fn
2025-05-07T20:32:44.9067634Z     return fn(*args, **kwargs)
2025-05-07T20:32:44.9068129Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:44.9068234Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:44.9068644Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:44.9068915Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:44.9069374Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:44.9069469Z     kernel = self.compile(
2025-05-07T20:32:44.9069859Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:44.9070034Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:44.9070163Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:44.9070174Z 
2025-05-07T20:32:44.9070380Z self = <triton.compiler.compiler.ASTSource object at 0x7f9655cff550>
2025-05-07T20:32:44.9071277Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:44.9071796Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f987464c400>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f980664f9c0>}
2025-05-07T20:32:44.9072550Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:44.9072749Z context = <triton._C.libtriton.ir.context object at 0x7f9655c17b70>
2025-05-07T20:32:44.9072753Z 
2025-05-07T20:32:44.9072919Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:44.9073181Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:44.9073303Z                            module_map=module_map)
2025-05-07T20:32:44.9073464Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:44.9073570Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:44.9073651Z E       ^
2025-05-07T20:32:44.9074010Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:44.9074015Z 
2025-05-07T20:32:44.9074436Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:44.9074440Z 
2025-05-07T20:32:44.9074544Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:44.9074766Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:44.9074850Z     T=1,
2025-05-07T20:32:44.9074929Z     D=7168,
2025-05-07T20:32:44.9075022Z     scale_ub=None,
2025-05-07T20:32:44.9075110Z     contiguous=False,
2025-05-07T20:32:44.9075197Z     compiled=False,
2025-05-07T20:32:44.9075282Z )
2025-05-07T20:32:44.9075499Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:44.9075668Z T = 1, D = 7168, scale_ub = None, contiguous = False, compiled = False
2025-05-07T20:32:44.9075674Z 
2025-05-07T20:32:44.9075759Z     @given(
2025-05-07T20:32:44.9075877Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:44.9075975Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:44.9076098Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:44.9076213Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:44.9076333Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:44.9076409Z     )
2025-05-07T20:32:44.9076654Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:44.9076757Z     def test_silu_mul_quant(
2025-05-07T20:32:44.9076838Z         self,
2025-05-07T20:32:44.9076916Z         T: int,
2025-05-07T20:32:44.9077007Z         D: int,
2025-05-07T20:32:44.9077159Z         scale_ub: Optional[float],
2025-05-07T20:32:44.9077251Z         contiguous: bool,
2025-05-07T20:32:44.9077367Z         compiled: bool,
2025-05-07T20:32:44.9077454Z     ) -> None:
2025-05-07T20:32:44.9077612Z         torch.manual_seed(2025)
2025-05-07T20:32:44.9077698Z     
2025-05-07T20:32:44.9077870Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:44.9077953Z     
2025-05-07T20:32:44.9078050Z         x_sign = torch.sign(x)
2025-05-07T20:32:44.9078177Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:44.9078278Z         x = x_sign * x_clamp
2025-05-07T20:32:44.9078360Z         x0 = x[:, :D]
2025-05-07T20:32:44.9078444Z         x1 = x[:, D:]
2025-05-07T20:32:44.9078525Z     
2025-05-07T20:32:44.9078611Z         if contiguous:
2025-05-07T20:32:44.9078705Z             x0 = x0.contiguous()
2025-05-07T20:32:44.9078804Z             x1 = x1.contiguous()
2025-05-07T20:32:44.9078923Z     
2025-05-07T20:32:44.9079018Z         if scale_ub is not None:
2025-05-07T20:32:44.9079133Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:44.9079307Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:44.9079398Z             )
2025-05-07T20:32:44.9079476Z         else:
2025-05-07T20:32:44.9079572Z             scale_ub_tensor = None
2025-05-07T20:32:44.9079652Z     
2025-05-07T20:32:44.9079783Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:44.9079873Z             op = silu_mul_quant
2025-05-07T20:32:44.9079968Z             if compiled:
2025-05-07T20:32:44.9080072Z                 op = torch.compile(op)
2025-05-07T20:32:44.9080178Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:44.9080259Z     
2025-05-07T20:32:44.9080353Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:44.9080357Z 
2025-05-07T20:32:44.9080457Z moe/activation_test.py:117: 
2025-05-07T20:32:44.9080597Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:44.9080707Z moe/activation_test.py:115: in fn
2025-05-07T20:32:44.9080815Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:44.9081322Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:44.9081420Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:44.9081788Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:44.9082012Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:44.9082354Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:44.9082462Z     kernel = self.compile(
2025-05-07T20:32:44.9082846Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:44.9083029Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:44.9083161Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:44.9083166Z 
2025-05-07T20:32:44.9083373Z self = <triton.compiler.compiler.ASTSource object at 0x7f9655f78c10>
2025-05-07T20:32:44.9084167Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:44.9084671Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f987464c400>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f9655c5c860>}
2025-05-07T20:32:44.9085434Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:44.9085673Z context = <triton._C.libtriton.ir.context object at 0x7f9655fc1230>
2025-05-07T20:32:44.9085678Z 
2025-05-07T20:32:44.9085852Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:44.9086172Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:44.9086281Z                            module_map=module_map)
2025-05-07T20:32:44.9086452Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:44.9086554Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:44.9086632Z E       ^
2025-05-07T20:32:44.9087048Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:44.9087053Z 
2025-05-07T20:32:44.9087469Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:44.9087473Z 
2025-05-07T20:32:44.9087624Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:44.9087850Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:44.9087929Z     T=2048,
2025-05-07T20:32:44.9088015Z     D=7168,
2025-05-07T20:32:44.9088136Z     scale_ub=None,
2025-05-07T20:32:44.9088228Z     contiguous=False,
2025-05-07T20:32:44.9088322Z     compiled=True,
2025-05-07T20:32:44.9088395Z )
2025-05-07T20:32:44.9088615Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:44.9088794Z T = 2048, D = 7168, scale_ub = None, contiguous = False, compiled = True
2025-05-07T20:32:44.9088799Z 
2025-05-07T20:32:44.9088877Z     @given(
2025-05-07T20:32:44.9089000Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:44.9089100Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:44.9089218Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:44.9089344Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:44.9089460Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:44.9089539Z     )
2025-05-07T20:32:44.9089794Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:44.9089891Z     def test_silu_mul_quant(
2025-05-07T20:32:44.9089978Z         self,
2025-05-07T20:32:44.9090055Z         T: int,
2025-05-07T20:32:44.9090133Z         D: int,
2025-05-07T20:32:44.9090238Z         scale_ub: Optional[float],
2025-05-07T20:32:44.9090330Z         contiguous: bool,
2025-05-07T20:32:44.9090416Z         compiled: bool,
2025-05-07T20:32:44.9090500Z     ) -> None:
2025-05-07T20:32:44.9090597Z         torch.manual_seed(2025)
2025-05-07T20:32:44.9090670Z     
2025-05-07T20:32:44.9090844Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:44.9090920Z     
2025-05-07T20:32:44.9091016Z         x_sign = torch.sign(x)
2025-05-07T20:32:44.9091147Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:44.9091238Z         x = x_sign * x_clamp
2025-05-07T20:32:44.9091324Z         x0 = x[:, :D]
2025-05-07T20:32:44.9091413Z         x1 = x[:, D:]
2025-05-07T20:32:44.9091488Z     
2025-05-07T20:32:44.9091581Z         if contiguous:
2025-05-07T20:32:44.9091676Z             x0 = x0.contiguous()
2025-05-07T20:32:44.9091767Z             x1 = x1.contiguous()
2025-05-07T20:32:44.9091849Z     
2025-05-07T20:32:44.9091941Z         if scale_ub is not None:
2025-05-07T20:32:44.9092049Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:44.9092191Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:44.9092269Z             )
2025-05-07T20:32:44.9092346Z         else:
2025-05-07T20:32:44.9092449Z             scale_ub_tensor = None
2025-05-07T20:32:44.9092523Z     
2025-05-07T20:32:44.9092657Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:44.9092758Z             op = silu_mul_quant
2025-05-07T20:32:44.9092844Z             if compiled:
2025-05-07T20:32:44.9092953Z                 op = torch.compile(op)
2025-05-07T20:32:44.9093114Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:44.9093188Z     
2025-05-07T20:32:44.9093290Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:44.9093295Z 
2025-05-07T20:32:44.9093397Z moe/activation_test.py:117: 
2025-05-07T20:32:44.9093568Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:44.9093681Z moe/activation_test.py:115: in fn
2025-05-07T20:32:44.9093782Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:44.9094152Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/torch/_dynamo/eval_frame.py:678: in _fn
2025-05-07T20:32:44.9094255Z     return fn(*args, **kwargs)
2025-05-07T20:32:44.9094750Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:44.9094857Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:44.9095215Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:44.9095480Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:44.9095872Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:44.9095969Z     kernel = self.compile(
2025-05-07T20:32:44.9096360Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:44.9096535Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:44.9096667Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:44.9096671Z 
2025-05-07T20:32:44.9096883Z self = <triton.compiler.compiler.ASTSource object at 0x7f9655f87dd0>
2025-05-07T20:32:44.9097720Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:44.9098241Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f987464c400>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f9655c5dbc0>}
2025-05-07T20:32:44.9098999Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:44.9099189Z context = <triton._C.libtriton.ir.context object at 0x7f9655f58430>
2025-05-07T20:32:44.9099193Z 
2025-05-07T20:32:44.9099363Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:44.9099628Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:44.9099748Z                            module_map=module_map)
2025-05-07T20:32:44.9099911Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:44.9100019Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:44.9100106Z E       ^
2025-05-07T20:32:44.9100470Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:44.9100475Z 
2025-05-07T20:32:44.9100907Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:44.9100911Z 
2025-05-07T20:32:44.9101015Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:44.9101248Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:44.9101325Z     T=4096,
2025-05-07T20:32:44.9101402Z     D=7168,
2025-05-07T20:32:44.9101489Z     scale_ub=None,
2025-05-07T20:32:44.9101577Z     contiguous=False,
2025-05-07T20:32:44.9101661Z     compiled=True,
2025-05-07T20:32:44.9101739Z )
2025-05-07T20:32:44.9101959Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:44.9102179Z T = 4096, D = 7168, scale_ub = None, contiguous = False, compiled = True
2025-05-07T20:32:44.9102183Z 
2025-05-07T20:32:44.9102267Z     @given(
2025-05-07T20:32:44.9102458Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:44.9102558Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:44.9102678Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:44.9102796Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:44.9102915Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:44.9102990Z     )
2025-05-07T20:32:44.9103235Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:44.9103337Z     def test_silu_mul_quant(
2025-05-07T20:32:44.9103414Z         self,
2025-05-07T20:32:44.9103492Z         T: int,
2025-05-07T20:32:44.9103576Z         D: int,
2025-05-07T20:32:44.9103674Z         scale_ub: Optional[float],
2025-05-07T20:32:44.9103809Z         contiguous: bool,
2025-05-07T20:32:44.9103903Z         compiled: bool,
2025-05-07T20:32:44.9103985Z     ) -> None:
2025-05-07T20:32:44.9104086Z         torch.manual_seed(2025)
2025-05-07T20:32:44.9104159Z     
2025-05-07T20:32:44.9104370Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:44.9104450Z     
2025-05-07T20:32:44.9104542Z         x_sign = torch.sign(x)
2025-05-07T20:32:44.9104666Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:44.9104760Z         x = x_sign * x_clamp
2025-05-07T20:32:44.9104841Z         x0 = x[:, :D]
2025-05-07T20:32:44.9104924Z         x1 = x[:, D:]
2025-05-07T20:32:44.9105002Z     
2025-05-07T20:32:44.9105086Z         if contiguous:
2025-05-07T20:32:44.9105177Z             x0 = x0.contiguous()
2025-05-07T20:32:44.9105272Z             x1 = x1.contiguous()
2025-05-07T20:32:44.9105348Z     
2025-05-07T20:32:44.9105441Z         if scale_ub is not None:
2025-05-07T20:32:44.9105552Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:44.9105691Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:44.9105773Z             )
2025-05-07T20:32:44.9105850Z         else:
2025-05-07T20:32:44.9105946Z             scale_ub_tensor = None
2025-05-07T20:32:44.9106028Z     
2025-05-07T20:32:44.9106161Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:44.9106251Z             op = silu_mul_quant
2025-05-07T20:32:44.9106344Z             if compiled:
2025-05-07T20:32:44.9106444Z                 op = torch.compile(op)
2025-05-07T20:32:44.9106548Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:44.9106632Z     
2025-05-07T20:32:44.9106722Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:44.9106727Z 
2025-05-07T20:32:44.9106831Z moe/activation_test.py:117: 
2025-05-07T20:32:44.9106961Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:44.9107063Z moe/activation_test.py:115: in fn
2025-05-07T20:32:44.9107171Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:44.9107579Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/torch/_dynamo/eval_frame.py:678: in _fn
2025-05-07T20:32:44.9107689Z     return fn(*args, **kwargs)
2025-05-07T20:32:44.9108192Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:44.9108288Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:44.9108648Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:44.9108870Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:44.9109329Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:44.9109429Z     kernel = self.compile(
2025-05-07T20:32:44.9109810Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:44.9110038Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:44.9110174Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:44.9110178Z 
2025-05-07T20:32:44.9110420Z self = <triton.compiler.compiler.ASTSource object at 0x7f9655f65490>
2025-05-07T20:32:44.9111214Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:44.9111716Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f987464c400>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f9655c5e700>}
2025-05-07T20:32:44.9112474Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:44.9112706Z context = <triton._C.libtriton.ir.context object at 0x7f9655f81af0>
2025-05-07T20:32:44.9112711Z 
2025-05-07T20:32:44.9112913Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:44.9113183Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:44.9113289Z                            module_map=module_map)
2025-05-07T20:32:44.9113451Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:44.9113556Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:44.9113633Z E       ^
2025-05-07T20:32:44.9113998Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:44.9114002Z 
2025-05-07T20:32:44.9114419Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:44.9114427Z 
2025-05-07T20:32:44.9114532Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:44.9114762Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:44.9114842Z     T=16384,
2025-05-07T20:32:44.9114927Z     D=5120,
2025-05-07T20:32:44.9115012Z     scale_ub=1200.0,
2025-05-07T20:32:44.9115100Z     contiguous=False,
2025-05-07T20:32:44.9115193Z     compiled=False,
2025-05-07T20:32:44.9115268Z )
2025-05-07T20:32:44.9115486Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:44.9115673Z T = 16384, D = 5120, scale_ub = 1200.0, contiguous = False, compiled = False
2025-05-07T20:32:44.9115677Z 
2025-05-07T20:32:44.9115759Z     @given(
2025-05-07T20:32:44.9115877Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:44.9115982Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:44.9116098Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:44.9116223Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:44.9116339Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:44.9116414Z     )
2025-05-07T20:32:44.9116669Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:44.9116789Z     def test_silu_mul_quant(
2025-05-07T20:32:44.9116870Z         self,
2025-05-07T20:32:44.9116974Z         T: int,
2025-05-07T20:32:44.9117051Z         D: int,
2025-05-07T20:32:44.9117152Z         scale_ub: Optional[float],
2025-05-07T20:32:44.9117248Z         contiguous: bool,
2025-05-07T20:32:44.9117336Z         compiled: bool,
2025-05-07T20:32:44.9117415Z     ) -> None:
2025-05-07T20:32:44.9117518Z         torch.manual_seed(2025)
2025-05-07T20:32:44.9117590Z     
2025-05-07T20:32:44.9117759Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:44.9117840Z     
2025-05-07T20:32:44.9117933Z         x_sign = torch.sign(x)
2025-05-07T20:32:44.9118065Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:44.9118211Z         x = x_sign * x_clamp
2025-05-07T20:32:44.9118291Z         x0 = x[:, :D]
2025-05-07T20:32:44.9118378Z         x1 = x[:, D:]
2025-05-07T20:32:44.9118451Z     
2025-05-07T20:32:44.9118538Z         if contiguous:
2025-05-07T20:32:44.9118675Z             x0 = x0.contiguous()
2025-05-07T20:32:44.9118765Z             x1 = x1.contiguous()
2025-05-07T20:32:44.9118836Z     
2025-05-07T20:32:44.9118932Z         if scale_ub is not None:
2025-05-07T20:32:44.9119037Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:44.9119173Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:44.9119254Z             )
2025-05-07T20:32:44.9119330Z         else:
2025-05-07T20:32:44.9119430Z             scale_ub_tensor = None
2025-05-07T20:32:44.9119501Z     
2025-05-07T20:32:44.9119629Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:44.9119727Z             op = silu_mul_quant
2025-05-07T20:32:44.9119853Z             if compiled:
2025-05-07T20:32:44.9119955Z                 op = torch.compile(op)
2025-05-07T20:32:44.9120066Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:44.9120149Z     
2025-05-07T20:32:44.9120287Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:44.9120295Z 
2025-05-07T20:32:44.9120396Z moe/activation_test.py:117: 
2025-05-07T20:32:44.9120526Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:44.9120635Z moe/activation_test.py:115: in fn
2025-05-07T20:32:44.9120737Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:44.9121243Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:44.9121339Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:44.9121698Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:44.9121926Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:44.9122271Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:44.9122367Z     kernel = self.compile(
2025-05-07T20:32:44.9122758Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:44.9122930Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:44.9123063Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:44.9123068Z 
2025-05-07T20:32:44.9123270Z self = <triton.compiler.compiler.ASTSource object at 0x7f9655976650>
2025-05-07T20:32:44.9124052Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:44.9124568Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f987464c400>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f9655c5f060>}
2025-05-07T20:32:44.9125328Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:44.9125524Z context = <triton._C.libtriton.ir.context object at 0x7f96559f2cb0>
2025-05-07T20:32:44.9125529Z 
2025-05-07T20:32:44.9125694Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:44.9125963Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:44.9126070Z                            module_map=module_map)
2025-05-07T20:32:44.9126231Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:44.9126337Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:44.9126418Z E       ^
2025-05-07T20:32:44.9126823Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:44.9126827Z 
2025-05-07T20:32:44.9127311Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:44.9127317Z 
2025-05-07T20:32:44.9127434Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:44.9127683Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:44.9127763Z     T=16384,
2025-05-07T20:32:44.9127840Z     D=5120,
2025-05-07T20:32:44.9127931Z     scale_ub=1200.0,
2025-05-07T20:32:44.9128017Z     contiguous=True,
2025-05-07T20:32:44.9128100Z     compiled=True,
2025-05-07T20:32:44.9128467Z )
2025-05-07T20:32:44.9128782Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:44.9128977Z T = 16384, D = 5120, scale_ub = 1200.0, contiguous = True, compiled = True
2025-05-07T20:32:44.9129169Z 
2025-05-07T20:32:44.9129247Z     @given(
2025-05-07T20:32:44.9129371Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:44.9129479Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:44.9129672Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:44.9129800Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:44.9129926Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:44.9130000Z     )
2025-05-07T20:32:44.9130286Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:44.9130388Z     def test_silu_mul_quant(
2025-05-07T20:32:44.9130465Z         self,
2025-05-07T20:32:44.9130542Z         T: int,
2025-05-07T20:32:44.9130626Z         D: int,
2025-05-07T20:32:44.9130728Z         scale_ub: Optional[float],
2025-05-07T20:32:44.9130825Z         contiguous: bool,
2025-05-07T20:32:44.9130914Z         compiled: bool,
2025-05-07T20:32:44.9130994Z     ) -> None:
2025-05-07T20:32:44.9131100Z         torch.manual_seed(2025)
2025-05-07T20:32:44.9131176Z     
2025-05-07T20:32:44.9131360Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:44.9131440Z     
2025-05-07T20:32:44.9131539Z         x_sign = torch.sign(x)
2025-05-07T20:32:44.9131670Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:44.9131771Z         x = x_sign * x_clamp
2025-05-07T20:32:44.9131854Z         x0 = x[:, :D]
2025-05-07T20:32:44.9131934Z         x1 = x[:, D:]
2025-05-07T20:32:44.9132013Z     
2025-05-07T20:32:44.9132098Z         if contiguous:
2025-05-07T20:32:44.9132191Z             x0 = x0.contiguous()
2025-05-07T20:32:44.9132291Z             x1 = x1.contiguous()
2025-05-07T20:32:44.9132364Z     
2025-05-07T20:32:44.9132463Z         if scale_ub is not None:
2025-05-07T20:32:44.9132573Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:44.9132716Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:44.9132802Z             )
2025-05-07T20:32:44.9132881Z         else:
2025-05-07T20:32:44.9132977Z             scale_ub_tensor = None
2025-05-07T20:32:44.9133055Z     
2025-05-07T20:32:44.9133194Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:44.9133288Z             op = silu_mul_quant
2025-05-07T20:32:44.9133382Z             if compiled:
2025-05-07T20:32:44.9133486Z                 op = torch.compile(op)
2025-05-07T20:32:44.9133599Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:44.9133681Z     
2025-05-07T20:32:44.9133775Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:44.9133780Z 
2025-05-07T20:32:44.9133886Z moe/activation_test.py:117: 
2025-05-07T20:32:44.9134028Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:44.9134134Z moe/activation_test.py:115: in fn
2025-05-07T20:32:44.9134245Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:44.9134687Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/torch/_dynamo/eval_frame.py:678: in _fn
2025-05-07T20:32:44.9134902Z     return fn(*args, **kwargs)
2025-05-07T20:32:44.9135475Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:44.9135573Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:44.9135937Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:44.9136158Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:44.9136497Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:44.9136601Z     kernel = self.compile(
2025-05-07T20:32:44.9136983Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:44.9137161Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:44.9137339Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:44.9137345Z 
2025-05-07T20:32:44.9137614Z self = <triton.compiler.compiler.ASTSource object at 0x7f9655925790>
2025-05-07T20:32:44.9138430Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:44.9138932Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f987464c400>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f96559e51c0>}
2025-05-07T20:32:44.9139691Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:44.9139882Z context = <triton._C.libtriton.ir.context object at 0x7f9655931db0>
2025-05-07T20:32:44.9139891Z 
2025-05-07T20:32:44.9140056Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:44.9140328Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:44.9140435Z                            module_map=module_map)
2025-05-07T20:32:44.9140602Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:44.9140702Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:44.9140779Z E       ^
2025-05-07T20:32:44.9141145Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:44.9141150Z 
2025-05-07T20:32:44.9141568Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:44.9141572Z 
2025-05-07T20:32:44.9141683Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:44.9141911Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:44.9141990Z     T=16384,
2025-05-07T20:32:44.9142073Z     D=5120,
2025-05-07T20:32:44.9142157Z     scale_ub=None,
2025-05-07T20:32:44.9142248Z     contiguous=False,
2025-05-07T20:32:44.9142343Z     compiled=True,
2025-05-07T20:32:44.9142416Z )
2025-05-07T20:32:44.9142635Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:44.9142819Z T = 16384, D = 5120, scale_ub = None, contiguous = False, compiled = True
2025-05-07T20:32:44.9142824Z 
2025-05-07T20:32:44.9142902Z     @given(
2025-05-07T20:32:44.9143028Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:44.9143128Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:44.9143244Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:44.9143369Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:44.9143483Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:44.9143560Z     )
2025-05-07T20:32:44.9143863Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:44.9143958Z     def test_silu_mul_quant(
2025-05-07T20:32:44.9144036Z         self,
2025-05-07T20:32:44.9144156Z         T: int,
2025-05-07T20:32:44.9144234Z         D: int,
2025-05-07T20:32:44.9144332Z         scale_ub: Optional[float],
2025-05-07T20:32:44.9144425Z         contiguous: bool,
2025-05-07T20:32:44.9144511Z         compiled: bool,
2025-05-07T20:32:44.9144594Z     ) -> None:
2025-05-07T20:32:44.9144688Z         torch.manual_seed(2025)
2025-05-07T20:32:44.9144760Z     
2025-05-07T20:32:44.9144935Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:44.9145009Z     
2025-05-07T20:32:44.9145101Z         x_sign = torch.sign(x)
2025-05-07T20:32:44.9145230Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:44.9145319Z         x = x_sign * x_clamp
2025-05-07T20:32:44.9145398Z         x0 = x[:, :D]
2025-05-07T20:32:44.9145533Z         x1 = x[:, D:]
2025-05-07T20:32:44.9145605Z     
2025-05-07T20:32:44.9145689Z         if contiguous:
2025-05-07T20:32:44.9145786Z             x0 = x0.contiguous()
2025-05-07T20:32:44.9145913Z             x1 = x1.contiguous()
2025-05-07T20:32:44.9145990Z     
2025-05-07T20:32:44.9146086Z         if scale_ub is not None:
2025-05-07T20:32:44.9146193Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:44.9146332Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:44.9146408Z             )
2025-05-07T20:32:44.9146483Z         else:
2025-05-07T20:32:44.9146581Z             scale_ub_tensor = None
2025-05-07T20:32:44.9146654Z     
2025-05-07T20:32:44.9146783Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:44.9146880Z             op = silu_mul_quant
2025-05-07T20:32:44.9146966Z             if compiled:
2025-05-07T20:32:44.9147066Z                 op = torch.compile(op)
2025-05-07T20:32:44.9147177Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:44.9147256Z     
2025-05-07T20:32:44.9147349Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:44.9147358Z 
2025-05-07T20:32:44.9147455Z moe/activation_test.py:117: 
2025-05-07T20:32:44.9147588Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:44.9147695Z moe/activation_test.py:115: in fn
2025-05-07T20:32:44.9147794Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:44.9148161Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/torch/_dynamo/eval_frame.py:678: in _fn
2025-05-07T20:32:44.9148259Z     return fn(*args, **kwargs)
2025-05-07T20:32:44.9148752Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:44.9148853Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:44.9149289Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:44.9149517Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:44.9149864Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:44.9149959Z     kernel = self.compile(
2025-05-07T20:32:44.9150340Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:44.9150516Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:44.9150644Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:44.9150649Z 
2025-05-07T20:32:44.9150857Z self = <triton.compiler.compiler.ASTSource object at 0x7f9655ad6c50>
2025-05-07T20:32:44.9151639Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:44.9152198Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f987464c400>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f96559e5d00>}
2025-05-07T20:32:44.9152996Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:44.9153186Z context = <triton._C.libtriton.ir.context object at 0x7f9655a8b270>
2025-05-07T20:32:44.9153191Z 
2025-05-07T20:32:44.9153359Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:44.9153620Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:44.9153727Z                            module_map=module_map)
2025-05-07T20:32:44.9153895Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:44.9154038Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:44.9154124Z E       ^
2025-05-07T20:32:44.9154520Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:44.9154525Z 
2025-05-07T20:32:44.9154946Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:44.9154951Z 
2025-05-07T20:32:44.9155061Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:44.9155285Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:44.9155369Z     T=2048,
2025-05-07T20:32:44.9155445Z     D=5120,
2025-05-07T20:32:44.9155527Z     scale_ub=None,
2025-05-07T20:32:44.9155622Z     contiguous=False,
2025-05-07T20:32:44.9155706Z     compiled=True,
2025-05-07T20:32:44.9155778Z )
2025-05-07T20:32:44.9156006Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:44.9156182Z T = 2048, D = 5120, scale_ub = None, contiguous = False, compiled = True
2025-05-07T20:32:44.9156190Z 
2025-05-07T20:32:44.9156268Z     @given(
2025-05-07T20:32:44.9156392Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:44.9156497Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:44.9156620Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:44.9156735Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:44.9156851Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:44.9156932Z     )
2025-05-07T20:32:44.9157215Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:44.9157323Z     def test_silu_mul_quant(
2025-05-07T20:32:44.9157407Z         self,
2025-05-07T20:32:44.9157483Z         T: int,
2025-05-07T20:32:44.9157560Z         D: int,
2025-05-07T20:32:44.9157665Z         scale_ub: Optional[float],
2025-05-07T20:32:44.9157756Z         contiguous: bool,
2025-05-07T20:32:44.9157846Z         compiled: bool,
2025-05-07T20:32:44.9157936Z     ) -> None:
2025-05-07T20:32:44.9158030Z         torch.manual_seed(2025)
2025-05-07T20:32:44.9158110Z     
2025-05-07T20:32:44.9158279Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:44.9158355Z     
2025-05-07T20:32:44.9158453Z         x_sign = torch.sign(x)
2025-05-07T20:32:44.9158579Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:44.9158667Z         x = x_sign * x_clamp
2025-05-07T20:32:44.9158755Z         x0 = x[:, :D]
2025-05-07T20:32:44.9158837Z         x1 = x[:, D:]
2025-05-07T20:32:44.9158910Z     
2025-05-07T20:32:44.9159000Z         if contiguous:
2025-05-07T20:32:44.9159092Z             x0 = x0.contiguous()
2025-05-07T20:32:44.9159182Z             x1 = x1.contiguous()
2025-05-07T20:32:44.9159259Z     
2025-05-07T20:32:44.9159350Z         if scale_ub is not None:
2025-05-07T20:32:44.9159457Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:44.9159596Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:44.9159746Z             )
2025-05-07T20:32:44.9159827Z         else:
2025-05-07T20:32:44.9159921Z             scale_ub_tensor = None
2025-05-07T20:32:44.9159993Z     
2025-05-07T20:32:44.9160170Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:44.9160262Z             op = silu_mul_quant
2025-05-07T20:32:44.9160346Z             if compiled:
2025-05-07T20:32:44.9160453Z                 op = torch.compile(op)
2025-05-07T20:32:44.9160560Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:44.9160633Z     
2025-05-07T20:32:44.9160729Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:44.9160734Z 
2025-05-07T20:32:44.9160831Z moe/activation_test.py:117: 
2025-05-07T20:32:44.9160966Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:44.9161067Z moe/activation_test.py:115: in fn
2025-05-07T20:32:44.9161167Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:44.9161586Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/torch/_dynamo/eval_frame.py:678: in _fn
2025-05-07T20:32:44.9161680Z     return fn(*args, **kwargs)
2025-05-07T20:32:44.9162242Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:44.9162349Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:44.9162704Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:44.9162933Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:44.9163270Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:44.9163363Z     kernel = self.compile(
2025-05-07T20:32:44.9163749Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:44.9163923Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:44.9164051Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:44.9164061Z 
2025-05-07T20:32:44.9164270Z self = <triton.compiler.compiler.ASTSource object at 0x7f9655acff90>
2025-05-07T20:32:44.9165050Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:44.9165557Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f987464c400>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f96559e5620>}
2025-05-07T20:32:44.9166309Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:44.9166509Z context = <triton._C.libtriton.ir.context object at 0x7f9655adc630>
2025-05-07T20:32:44.9166513Z 
2025-05-07T20:32:44.9166678Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:44.9166946Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:44.9167058Z                            module_map=module_map)
2025-05-07T20:32:44.9167219Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:44.9167321Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:44.9167417Z E       ^
2025-05-07T20:32:44.9167811Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:44.9167815Z 
2025-05-07T20:32:44.9168237Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:44.9168241Z 
2025-05-07T20:32:44.9168346Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:44.9168614Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:44.9168700Z     T=2048,
2025-05-07T20:32:44.9168777Z     D=5120,
2025-05-07T20:32:44.9168907Z     scale_ub=1200.0,
2025-05-07T20:32:44.9168995Z     contiguous=False,
2025-05-07T20:32:44.9169080Z     compiled=True,
2025-05-07T20:32:44.9169160Z )
2025-05-07T20:32:44.9169378Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:44.9169552Z T = 2048, D = 5120, scale_ub = 1200.0, contiguous = False, compiled = True
2025-05-07T20:32:44.9169556Z 
2025-05-07T20:32:44.9169643Z     @given(
2025-05-07T20:32:44.9169764Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:44.9169864Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:44.9169989Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:44.9170106Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:44.9170267Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:44.9170344Z     )
2025-05-07T20:32:44.9170590Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:44.9170736Z     def test_silu_mul_quant(
2025-05-07T20:32:44.9170816Z         self,
2025-05-07T20:32:44.9170892Z         T: int,
2025-05-07T20:32:44.9170975Z         D: int,
2025-05-07T20:32:44.9171075Z         scale_ub: Optional[float],
2025-05-07T20:32:44.9171163Z         contiguous: bool,
2025-05-07T20:32:44.9171255Z         compiled: bool,
2025-05-07T20:32:44.9171334Z     ) -> None:
2025-05-07T20:32:44.9171429Z         torch.manual_seed(2025)
2025-05-07T20:32:44.9171508Z     
2025-05-07T20:32:44.9171676Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:44.9171757Z     
2025-05-07T20:32:44.9171849Z         x_sign = torch.sign(x)
2025-05-07T20:32:44.9171973Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:44.9172071Z         x = x_sign * x_clamp
2025-05-07T20:32:44.9172156Z         x0 = x[:, :D]
2025-05-07T20:32:44.9172236Z         x1 = x[:, D:]
2025-05-07T20:32:44.9172317Z     
2025-05-07T20:32:44.9172403Z         if contiguous:
2025-05-07T20:32:44.9172497Z             x0 = x0.contiguous()
2025-05-07T20:32:44.9172597Z             x1 = x1.contiguous()
2025-05-07T20:32:44.9172668Z     
2025-05-07T20:32:44.9172758Z         if scale_ub is not None:
2025-05-07T20:32:44.9172873Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:44.9173007Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:44.9173081Z             )
2025-05-07T20:32:44.9173165Z         else:
2025-05-07T20:32:44.9173258Z             scale_ub_tensor = None
2025-05-07T20:32:44.9173336Z     
2025-05-07T20:32:44.9173465Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:44.9173554Z             op = silu_mul_quant
2025-05-07T20:32:44.9173646Z             if compiled:
2025-05-07T20:32:44.9173745Z                 op = torch.compile(op)
2025-05-07T20:32:44.9173854Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:44.9173934Z     
2025-05-07T20:32:44.9174024Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:44.9174029Z 
2025-05-07T20:32:44.9174130Z moe/activation_test.py:117: 
2025-05-07T20:32:44.9174263Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:44.9174363Z moe/activation_test.py:115: in fn
2025-05-07T20:32:44.9174474Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:44.9174840Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/torch/_dynamo/eval_frame.py:678: in _fn
2025-05-07T20:32:44.9174932Z     return fn(*args, **kwargs)
2025-05-07T20:32:44.9175431Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:44.9175526Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:44.9175882Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:44.9176167Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:44.9176546Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:44.9176646Z     kernel = self.compile(
2025-05-07T20:32:44.9177026Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:44.9177200Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:44.9177333Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:44.9177337Z 
2025-05-07T20:32:44.9177539Z self = <triton.compiler.compiler.ASTSource object at 0x7f965585d2d0>
2025-05-07T20:32:44.9178327Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:44.9178969Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f987464c400>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f96558145e0>}
2025-05-07T20:32:44.9179722Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:44.9179917Z context = <triton._C.libtriton.ir.context object at 0x7f9655885930>
2025-05-07T20:32:44.9179921Z 
2025-05-07T20:32:44.9180085Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:44.9180355Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:44.9180462Z                            module_map=module_map)
2025-05-07T20:32:44.9180631Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:44.9180739Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:44.9180816Z E       ^
2025-05-07T20:32:44.9181182Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:44.9181187Z 
2025-05-07T20:32:44.9181603Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:44.9181607Z 
2025-05-07T20:32:44.9181713Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:44.9181945Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:44.9182022Z     T=4096,
2025-05-07T20:32:44.9182098Z     D=5120,
2025-05-07T20:32:44.9182190Z     scale_ub=1200.0,
2025-05-07T20:32:44.9182274Z     contiguous=True,
2025-05-07T20:32:44.9182361Z     compiled=True,
2025-05-07T20:32:44.9182434Z )
2025-05-07T20:32:44.9182651Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:44.9182832Z T = 4096, D = 5120, scale_ub = 1200.0, contiguous = True, compiled = True
2025-05-07T20:32:44.9182837Z 
2025-05-07T20:32:44.9182914Z     @given(
2025-05-07T20:32:44.9183036Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:44.9183141Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:44.9183255Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:44.9183371Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:44.9183492Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:44.9187657Z     )
2025-05-07T20:32:44.9187932Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:44.9188030Z     def test_silu_mul_quant(
2025-05-07T20:32:44.9188117Z         self,
2025-05-07T20:32:44.9188195Z         T: int,
2025-05-07T20:32:44.9188273Z         D: int,
2025-05-07T20:32:44.9188384Z         scale_ub: Optional[float],
2025-05-07T20:32:44.9188483Z         contiguous: bool,
2025-05-07T20:32:44.9188665Z         compiled: bool,
2025-05-07T20:32:44.9188748Z     ) -> None:
2025-05-07T20:32:44.9188845Z         torch.manual_seed(2025)
2025-05-07T20:32:44.9188931Z     
2025-05-07T20:32:44.9189271Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:44.9189350Z     
2025-05-07T20:32:44.9189453Z         x_sign = torch.sign(x)
2025-05-07T20:32:44.9189579Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:44.9189671Z         x = x_sign * x_clamp
2025-05-07T20:32:44.9189762Z         x0 = x[:, :D]
2025-05-07T20:32:44.9189843Z         x1 = x[:, D:]
2025-05-07T20:32:44.9189917Z     
2025-05-07T20:32:44.9190010Z         if contiguous:
2025-05-07T20:32:44.9190103Z             x0 = x0.contiguous()
2025-05-07T20:32:44.9190194Z             x1 = x1.contiguous()
2025-05-07T20:32:44.9190276Z     
2025-05-07T20:32:44.9190370Z         if scale_ub is not None:
2025-05-07T20:32:44.9190485Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:44.9190669Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:44.9190749Z             )
2025-05-07T20:32:44.9190834Z         else:
2025-05-07T20:32:44.9190971Z             scale_ub_tensor = None
2025-05-07T20:32:44.9191048Z     
2025-05-07T20:32:44.9191190Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:44.9191284Z             op = silu_mul_quant
2025-05-07T20:32:44.9191370Z             if compiled:
2025-05-07T20:32:44.9191480Z                 op = torch.compile(op)
2025-05-07T20:32:44.9191586Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:44.9191659Z     
2025-05-07T20:32:44.9191760Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:44.9191765Z 
2025-05-07T20:32:44.9191865Z moe/activation_test.py:117: 
2025-05-07T20:32:44.9192006Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:44.9192111Z moe/activation_test.py:115: in fn
2025-05-07T20:32:44.9192217Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:44.9192599Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/torch/_dynamo/eval_frame.py:678: in _fn
2025-05-07T20:32:44.9192695Z     return fn(*args, **kwargs)
2025-05-07T20:32:44.9193194Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:44.9193300Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:44.9193660Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:44.9193890Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:44.9194230Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:44.9194324Z     kernel = self.compile(
2025-05-07T20:32:44.9194718Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:44.9194897Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:44.9195038Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:44.9195044Z 
2025-05-07T20:32:44.9195250Z self = <triton.compiler.compiler.ASTSource object at 0x7f9655842610>
2025-05-07T20:32:44.9196036Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:44.9196548Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f987464c400>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f9655815120>}
2025-05-07T20:32:44.9197299Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:44.9197576Z context = <triton._C.libtriton.ir.context object at 0x7f96558a2c70>
2025-05-07T20:32:44.9197581Z 
2025-05-07T20:32:44.9197811Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:44.9198077Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:44.9198195Z                            module_map=module_map)
2025-05-07T20:32:44.9198358Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:44.9198465Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:44.9198543Z E       ^
2025-05-07T20:32:44.9198902Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:44.9198907Z 
2025-05-07T20:32:44.9199333Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:44.9199510Z 
2025-05-07T20:32:44.9199615Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:44.9199845Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:44.9199963Z     T=128,
2025-05-07T20:32:44.9200044Z     D=5120,
2025-05-07T20:32:44.9200135Z     scale_ub=1200.0,
2025-05-07T20:32:44.9200222Z     contiguous=False,
2025-05-07T20:32:44.9200305Z     compiled=True,
2025-05-07T20:32:44.9200385Z )
2025-05-07T20:32:44.9200604Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:44.9200776Z T = 128, D = 5120, scale_ub = 1200.0, contiguous = False, compiled = True
2025-05-07T20:32:44.9200780Z 
2025-05-07T20:32:44.9200869Z     @given(
2025-05-07T20:32:44.9200986Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:44.9201086Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:44.9201206Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:44.9201326Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:44.9201447Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:44.9201521Z     )
2025-05-07T20:32:44.9201773Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:44.9201874Z     def test_silu_mul_quant(
2025-05-07T20:32:44.9201951Z         self,
2025-05-07T20:32:44.9202028Z         T: int,
2025-05-07T20:32:44.9202113Z         D: int,
2025-05-07T20:32:44.9202211Z         scale_ub: Optional[float],
2025-05-07T20:32:44.9202300Z         contiguous: bool,
2025-05-07T20:32:44.9202394Z         compiled: bool,
2025-05-07T20:32:44.9202473Z     ) -> None:
2025-05-07T20:32:44.9202568Z         torch.manual_seed(2025)
2025-05-07T20:32:44.9202648Z     
2025-05-07T20:32:44.9202816Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:44.9202898Z     
2025-05-07T20:32:44.9202992Z         x_sign = torch.sign(x)
2025-05-07T20:32:44.9203117Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:44.9203220Z         x = x_sign * x_clamp
2025-05-07T20:32:44.9203303Z         x0 = x[:, :D]
2025-05-07T20:32:44.9203387Z         x1 = x[:, D:]
2025-05-07T20:32:44.9203468Z     
2025-05-07T20:32:44.9203559Z         if contiguous:
2025-05-07T20:32:44.9203652Z             x0 = x0.contiguous()
2025-05-07T20:32:44.9203755Z             x1 = x1.contiguous()
2025-05-07T20:32:44.9203828Z     
2025-05-07T20:32:44.9203919Z         if scale_ub is not None:
2025-05-07T20:32:44.9204038Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:44.9204175Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:44.9204262Z             )
2025-05-07T20:32:44.9204339Z         else:
2025-05-07T20:32:44.9204436Z             scale_ub_tensor = None
2025-05-07T20:32:44.9204517Z     
2025-05-07T20:32:44.9204647Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:44.9204739Z             op = silu_mul_quant
2025-05-07T20:32:44.9204840Z             if compiled:
2025-05-07T20:32:44.9204992Z                 op = torch.compile(op)
2025-05-07T20:32:44.9205099Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:44.9205182Z     
2025-05-07T20:32:44.9205276Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:44.9205319Z 
2025-05-07T20:32:44.9205421Z moe/activation_test.py:117: 
2025-05-07T20:32:44.9205557Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:44.9205660Z moe/activation_test.py:115: in fn
2025-05-07T20:32:44.9205769Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:44.9206141Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/torch/_dynamo/eval_frame.py:678: in _fn
2025-05-07T20:32:44.9206237Z     return fn(*args, **kwargs)
2025-05-07T20:32:44.9206744Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:44.9206862Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:44.9207295Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:44.9207561Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:44.9207906Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:44.9208010Z     kernel = self.compile(
2025-05-07T20:32:44.9208393Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:44.9208567Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:44.9208707Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:44.9208711Z 
2025-05-07T20:32:44.9208916Z self = <triton.compiler.compiler.ASTSource object at 0x7f96558e6ad0>
2025-05-07T20:32:44.9209709Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:44.9210222Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f987464c400>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f9655816340>}
2025-05-07T20:32:44.9210980Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:44.9211171Z context = <triton._C.libtriton.ir.context object at 0x7f9655657e30>
2025-05-07T20:32:44.9211175Z 
2025-05-07T20:32:44.9211340Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:44.9211609Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:44.9211720Z                            module_map=module_map)
2025-05-07T20:32:44.9211887Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:44.9211994Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:44.9212074Z E       ^
2025-05-07T20:32:44.9212444Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:44.9212448Z 
2025-05-07T20:32:44.9212866Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:44.9212870Z 
2025-05-07T20:32:44.9212975Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:44.9213208Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:44.9213287Z     T=16384,
2025-05-07T20:32:44.9213374Z     D=7168,
2025-05-07T20:32:44.9213459Z     scale_ub=1200.0,
2025-05-07T20:32:44.9213545Z     contiguous=True,
2025-05-07T20:32:44.9213639Z     compiled=True,
2025-05-07T20:32:44.9213715Z )
2025-05-07T20:32:44.9213979Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:44.9214161Z T = 16384, D = 7168, scale_ub = 1200.0, contiguous = True, compiled = True
2025-05-07T20:32:44.9214168Z 
2025-05-07T20:32:44.9214286Z     @given(
2025-05-07T20:32:44.9214406Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:44.9214516Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:44.9214631Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:44.9214759Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:44.9214873Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:44.9214949Z     )
2025-05-07T20:32:44.9215202Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:44.9215297Z     def test_silu_mul_quant(
2025-05-07T20:32:44.9215374Z         self,
2025-05-07T20:32:44.9215458Z         T: int,
2025-05-07T20:32:44.9215536Z         D: int,
2025-05-07T20:32:44.9215679Z         scale_ub: Optional[float],
2025-05-07T20:32:44.9215780Z         contiguous: bool,
2025-05-07T20:32:44.9215868Z         compiled: bool,
2025-05-07T20:32:44.9215947Z     ) -> None:
2025-05-07T20:32:44.9216088Z         torch.manual_seed(2025)
2025-05-07T20:32:44.9216165Z     
2025-05-07T20:32:44.9216334Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:44.9216415Z     
2025-05-07T20:32:44.9216507Z         x_sign = torch.sign(x)
2025-05-07T20:32:44.9216639Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:44.9216729Z         x = x_sign * x_clamp
2025-05-07T20:32:44.9216810Z         x0 = x[:, :D]
2025-05-07T20:32:44.9216898Z         x1 = x[:, D:]
2025-05-07T20:32:44.9216971Z     
2025-05-07T20:32:44.9217056Z         if contiguous:
2025-05-07T20:32:44.9217154Z             x0 = x0.contiguous()
2025-05-07T20:32:44.9217263Z             x1 = x1.contiguous()
2025-05-07T20:32:44.9217343Z     
2025-05-07T20:32:44.9217463Z         if scale_ub is not None:
2025-05-07T20:32:44.9217577Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:44.9217715Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:44.9217802Z             )
2025-05-07T20:32:44.9217884Z         else:
2025-05-07T20:32:44.9217988Z             scale_ub_tensor = None
2025-05-07T20:32:44.9218064Z     
2025-05-07T20:32:44.9218194Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:44.9218292Z             op = silu_mul_quant
2025-05-07T20:32:44.9218380Z             if compiled:
2025-05-07T20:32:44.9218483Z                 op = torch.compile(op)
2025-05-07T20:32:44.9218596Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:44.9218671Z     
2025-05-07T20:32:44.9218762Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:44.9218767Z 
2025-05-07T20:32:44.9218873Z moe/activation_test.py:117: 
2025-05-07T20:32:44.9219005Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:44.9219122Z moe/activation_test.py:115: in fn
2025-05-07T20:32:44.9219224Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:44.9219596Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/torch/_dynamo/eval_frame.py:678: in _fn
2025-05-07T20:32:44.9219699Z     return fn(*args, **kwargs)
2025-05-07T20:32:44.9220193Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:44.9220290Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:44.9220656Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:44.9220877Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:44.9221226Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:44.9221322Z     kernel = self.compile(
2025-05-07T20:32:44.9221707Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:44.9221937Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:44.9222115Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:44.9222119Z 
2025-05-07T20:32:44.9222326Z self = <triton.compiler.compiler.ASTSource object at 0x7f9655680c50>
2025-05-07T20:32:44.9223124Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:44.9223629Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f987464c400>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f9655817c40>}
2025-05-07T20:32:44.9224394Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:44.9224665Z context = <triton._C.libtriton.ir.context object at 0x7f96556e1270>
2025-05-07T20:32:44.9224676Z 
2025-05-07T20:32:44.9224852Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:44.9225116Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:44.9225228Z                            module_map=module_map)
2025-05-07T20:32:44.9225399Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:44.9225500Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:44.9225579Z E       ^
2025-05-07T20:32:44.9225946Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:44.9225950Z 
2025-05-07T20:32:44.9226369Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:44.9226378Z 
2025-05-07T20:32:44.9226490Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:44.9226716Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:44.9226794Z     T=16384,
2025-05-07T20:32:44.9226878Z     D=5120,
2025-05-07T20:32:44.9226962Z     scale_ub=1200.0,
2025-05-07T20:32:44.9227049Z     contiguous=True,
2025-05-07T20:32:44.9227142Z     compiled=False,
2025-05-07T20:32:44.9227215Z )
2025-05-07T20:32:44.9227441Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:44.9227646Z T = 16384, D = 5120, scale_ub = 1200.0, contiguous = True, compiled = False
2025-05-07T20:32:44.9227654Z 
2025-05-07T20:32:44.9227758Z     @given(
2025-05-07T20:32:44.9227877Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:44.9227977Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:44.9228100Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:44.9228756Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:44.9228915Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:44.9229001Z     )
2025-05-07T20:32:44.9229296Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:44.9229391Z     def test_silu_mul_quant(
2025-05-07T20:32:44.9229475Z         self,
2025-05-07T20:32:44.9229554Z         T: int,
2025-05-07T20:32:44.9229643Z         D: int,
2025-05-07T20:32:44.9229742Z         scale_ub: Optional[float],
2025-05-07T20:32:44.9229833Z         contiguous: bool,
2025-05-07T20:32:44.9229926Z         compiled: bool,
2025-05-07T20:32:44.9230005Z     ) -> None:
2025-05-07T20:32:44.9230100Z         torch.manual_seed(2025)
2025-05-07T20:32:44.9230182Z     
2025-05-07T20:32:44.9230352Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:44.9230426Z     
2025-05-07T20:32:44.9230528Z         x_sign = torch.sign(x)
2025-05-07T20:32:44.9230824Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:44.9230916Z         x = x_sign * x_clamp
2025-05-07T20:32:44.9231001Z         x0 = x[:, :D]
2025-05-07T20:32:44.9231084Z         x1 = x[:, D:]
2025-05-07T20:32:44.9231254Z     
2025-05-07T20:32:44.9231346Z         if contiguous:
2025-05-07T20:32:44.9231437Z             x0 = x0.contiguous()
2025-05-07T20:32:44.9231532Z             x1 = x1.contiguous()
2025-05-07T20:32:44.9231605Z     
2025-05-07T20:32:44.9231693Z         if scale_ub is not None:
2025-05-07T20:32:44.9231804Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:44.9231938Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:44.9232013Z             )
2025-05-07T20:32:44.9232094Z         else:
2025-05-07T20:32:44.9232186Z             scale_ub_tensor = None
2025-05-07T20:32:44.9232257Z     
2025-05-07T20:32:44.9232392Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:44.9232551Z             op = silu_mul_quant
2025-05-07T20:32:44.9232638Z             if compiled:
2025-05-07T20:32:44.9232742Z                 op = torch.compile(op)
2025-05-07T20:32:44.9232915Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:44.9232997Z     
2025-05-07T20:32:44.9233087Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:44.9233092Z 
2025-05-07T20:32:44.9233189Z moe/activation_test.py:117: 
2025-05-07T20:32:44.9233323Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:44.9233422Z moe/activation_test.py:115: in fn
2025-05-07T20:32:44.9233522Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:44.9234033Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:44.9234128Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:44.9234491Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:44.9234717Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:44.9235058Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:44.9235159Z     kernel = self.compile(
2025-05-07T20:32:44.9235540Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:44.9235711Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:44.9235848Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:44.9235853Z 
2025-05-07T20:32:44.9236056Z self = <triton.compiler.compiler.ASTSource object at 0x7f96556fe050>
2025-05-07T20:32:44.9236845Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:44.9237354Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f987464c400>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f9655618ae0>}
2025-05-07T20:32:44.9238114Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:44.9238301Z context = <triton._C.libtriton.ir.context object at 0x7f965566a5f0>
2025-05-07T20:32:44.9238305Z 
2025-05-07T20:32:44.9238467Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:44.9238732Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:44.9238838Z                            module_map=module_map)
2025-05-07T20:32:44.9239005Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:44.9239151Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:44.9239229Z E       ^
2025-05-07T20:32:44.9239594Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:44.9239637Z 
2025-05-07T20:32:44.9240056Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:44.9240061Z 
2025-05-07T20:32:44.9240168Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:44.9240396Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:44.9240473Z     T=1,
2025-05-07T20:32:44.9240554Z     D=7168,
2025-05-07T20:32:44.9240638Z     scale_ub=1200.0,
2025-05-07T20:32:44.9240724Z     contiguous=False,
2025-05-07T20:32:44.9240814Z     compiled=False,
2025-05-07T20:32:44.9240886Z )
2025-05-07T20:32:44.9241103Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:44.9241320Z T = 1, D = 7168, scale_ub = 1200.0, contiguous = False, compiled = False
2025-05-07T20:32:44.9241327Z 
2025-05-07T20:32:44.9241403Z     @given(
2025-05-07T20:32:44.9241559Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:44.9241668Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:44.9241781Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:44.9241904Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:44.9242018Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:44.9242090Z     )
2025-05-07T20:32:44.9242342Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:44.9242436Z     def test_silu_mul_quant(
2025-05-07T20:32:44.9242512Z         self,
2025-05-07T20:32:44.9242596Z         T: int,
2025-05-07T20:32:44.9242674Z         D: int,
2025-05-07T20:32:44.9242771Z         scale_ub: Optional[float],
2025-05-07T20:32:44.9242867Z         contiguous: bool,
2025-05-07T20:32:44.9242959Z         compiled: bool,
2025-05-07T20:32:44.9243044Z     ) -> None:
2025-05-07T20:32:44.9243144Z         torch.manual_seed(2025)
2025-05-07T20:32:44.9243218Z     
2025-05-07T20:32:44.9243399Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:44.9243471Z     
2025-05-07T20:32:44.9243563Z         x_sign = torch.sign(x)
2025-05-07T20:32:44.9243696Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:44.9243784Z         x = x_sign * x_clamp
2025-05-07T20:32:44.9243864Z         x0 = x[:, :D]
2025-05-07T20:32:44.9243951Z         x1 = x[:, D:]
2025-05-07T20:32:44.9244024Z     
2025-05-07T20:32:44.9244109Z         if contiguous:
2025-05-07T20:32:44.9244207Z             x0 = x0.contiguous()
2025-05-07T20:32:44.9244295Z             x1 = x1.contiguous()
2025-05-07T20:32:44.9244370Z     
2025-05-07T20:32:44.9244469Z         if scale_ub is not None:
2025-05-07T20:32:44.9244576Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:44.9244727Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:44.9244805Z             )
2025-05-07T20:32:44.9244883Z         else:
2025-05-07T20:32:44.9244983Z             scale_ub_tensor = None
2025-05-07T20:32:44.9245060Z     
2025-05-07T20:32:44.9245192Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:44.9245289Z             op = silu_mul_quant
2025-05-07T20:32:44.9245375Z             if compiled:
2025-05-07T20:32:44.9245475Z                 op = torch.compile(op)
2025-05-07T20:32:44.9245589Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:44.9245660Z     
2025-05-07T20:32:44.9245749Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:44.9245753Z 
2025-05-07T20:32:44.9245854Z moe/activation_test.py:117: 
2025-05-07T20:32:44.9245984Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:44.9246090Z moe/activation_test.py:115: in fn
2025-05-07T20:32:44.9246189Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:44.9246690Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:44.9246847Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:44.9247295Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:44.9247517Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:44.9247864Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:44.9247977Z     kernel = self.compile(
2025-05-07T20:32:44.9248359Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:44.9248533Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:44.9248670Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:44.9248715Z 
2025-05-07T20:32:44.9248921Z self = <triton.compiler.compiler.ASTSource object at 0x7f96557cd4d0>
2025-05-07T20:32:44.9249745Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:44.9250259Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f987464c400>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f9655618400>}
2025-05-07T20:32:44.9251010Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:44.9251208Z context = <triton._C.libtriton.ir.context object at 0x7f96557817f0>
2025-05-07T20:32:44.9251213Z 
2025-05-07T20:32:44.9251378Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:44.9251658Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:44.9251767Z                            module_map=module_map)
2025-05-07T20:32:44.9251930Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:44.9252035Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:44.9252113Z E       ^
2025-05-07T20:32:44.9252471Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:44.9252476Z 
2025-05-07T20:32:44.9252899Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:44.9252904Z 
2025-05-07T20:32:44.9253007Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:44.9253236Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:44.9253316Z     T=4096,
2025-05-07T20:32:44.9253394Z     D=7168,
2025-05-07T20:32:44.9253484Z     scale_ub=1200.0,
2025-05-07T20:32:44.9253570Z     contiguous=False,
2025-05-07T20:32:44.9253655Z     compiled=True,
2025-05-07T20:32:44.9253733Z )
2025-05-07T20:32:44.9253956Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:44.9254135Z T = 4096, D = 7168, scale_ub = 1200.0, contiguous = False, compiled = True
2025-05-07T20:32:44.9254146Z 
2025-05-07T20:32:44.9254222Z     @given(
2025-05-07T20:32:44.9254341Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:44.9254450Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:44.9254563Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:44.9254679Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:44.9254799Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:44.9254872Z     )
2025-05-07T20:32:44.9255116Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:44.9255269Z     def test_silu_mul_quant(
2025-05-07T20:32:44.9255345Z         self,
2025-05-07T20:32:44.9255427Z         T: int,
2025-05-07T20:32:44.9255505Z         D: int,
2025-05-07T20:32:44.9255603Z         scale_ub: Optional[float],
2025-05-07T20:32:44.9255739Z         contiguous: bool,
2025-05-07T20:32:44.9255825Z         compiled: bool,
2025-05-07T20:32:44.9255901Z     ) -> None:
2025-05-07T20:32:44.9256003Z         torch.manual_seed(2025)
2025-05-07T20:32:44.9256075Z     
2025-05-07T20:32:44.9256243Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:44.9256321Z     
2025-05-07T20:32:44.9256413Z         x_sign = torch.sign(x)
2025-05-07T20:32:44.9256536Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:44.9256631Z         x = x_sign * x_clamp
2025-05-07T20:32:44.9256710Z         x0 = x[:, :D]
2025-05-07T20:32:44.9256790Z         x1 = x[:, D:]
2025-05-07T20:32:44.9256868Z     
2025-05-07T20:32:44.9256996Z         if contiguous:
2025-05-07T20:32:44.9257098Z             x0 = x0.contiguous()
2025-05-07T20:32:44.9257187Z             x1 = x1.contiguous()
2025-05-07T20:32:44.9257259Z     
2025-05-07T20:32:44.9257398Z         if scale_ub is not None:
2025-05-07T20:32:44.9257518Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:44.9257671Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:44.9257774Z             )
2025-05-07T20:32:44.9257850Z         else:
2025-05-07T20:32:44.9257943Z             scale_ub_tensor = None
2025-05-07T20:32:44.9258022Z     
2025-05-07T20:32:44.9258150Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:44.9258240Z             op = silu_mul_quant
2025-05-07T20:32:44.9258333Z             if compiled:
2025-05-07T20:32:44.9258432Z                 op = torch.compile(op)
2025-05-07T20:32:44.9258547Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:44.9258623Z     
2025-05-07T20:32:44.9258716Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:44.9258724Z 
2025-05-07T20:32:44.9258828Z moe/activation_test.py:117: 
2025-05-07T20:32:44.9258959Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:44.9259062Z moe/activation_test.py:115: in fn
2025-05-07T20:32:44.9259169Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:44.9259545Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/torch/_dynamo/eval_frame.py:678: in _fn
2025-05-07T20:32:44.9259637Z     return fn(*args, **kwargs)
2025-05-07T20:32:44.9260135Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:44.9260237Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:44.9260602Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:44.9260824Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:44.9261167Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:44.9261269Z     kernel = self.compile(
2025-05-07T20:32:44.9261654Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:44.9261832Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:44.9261963Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:44.9261967Z 
2025-05-07T20:32:44.9262172Z self = <triton.compiler.compiler.ASTSource object at 0x7f965571f410>
2025-05-07T20:32:44.9262966Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:44.9263467Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f987464c400>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f965561af20>}
2025-05-07T20:32:44.9264353Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:44.9264544Z context = <triton._C.libtriton.ir.context object at 0x7f96557d8330>
2025-05-07T20:32:44.9264548Z 
2025-05-07T20:32:44.9264715Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:44.9264986Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:44.9265094Z                            module_map=module_map)
2025-05-07T20:32:44.9265263Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:44.9265363Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:44.9265440Z E       ^
2025-05-07T20:32:44.9265844Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:44.9265851Z 
2025-05-07T20:32:44.9266310Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:44.9266315Z 
2025-05-07T20:32:44.9266425Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:44.9266647Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:44.9266725Z     T=128,
2025-05-07T20:32:44.9266809Z     D=7168,
2025-05-07T20:32:44.9266892Z     scale_ub=1200.0,
2025-05-07T20:32:44.9266980Z     contiguous=False,
2025-05-07T20:32:44.9267073Z     compiled=True,
2025-05-07T20:32:44.9267145Z )
2025-05-07T20:32:44.9267362Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:44.9267541Z T = 128, D = 7168, scale_ub = 1200.0, contiguous = False, compiled = True
2025-05-07T20:32:44.9267545Z 
2025-05-07T20:32:44.9267625Z     @given(
2025-05-07T20:32:44.9267754Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:44.9267854Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:44.9267970Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:44.9268097Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:44.9268210Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:44.9268284Z     )
2025-05-07T20:32:44.9268537Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:44.9268631Z     def test_silu_mul_quant(
2025-05-07T20:32:44.9268707Z         self,
2025-05-07T20:32:44.9268793Z         T: int,
2025-05-07T20:32:44.9268871Z         D: int,
2025-05-07T20:32:44.9268967Z         scale_ub: Optional[float],
2025-05-07T20:32:44.9269145Z         contiguous: bool,
2025-05-07T20:32:44.9269230Z         compiled: bool,
2025-05-07T20:32:44.9269315Z     ) -> None:
2025-05-07T20:32:44.9269410Z         torch.manual_seed(2025)
2025-05-07T20:32:44.9269486Z     
2025-05-07T20:32:44.9269660Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:44.9269736Z     
2025-05-07T20:32:44.9269829Z         x_sign = torch.sign(x)
2025-05-07T20:32:44.9269966Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:44.9270055Z         x = x_sign * x_clamp
2025-05-07T20:32:44.9270134Z         x0 = x[:, :D]
2025-05-07T20:32:44.9270219Z         x1 = x[:, D:]
2025-05-07T20:32:44.9270291Z     
2025-05-07T20:32:44.9270373Z         if contiguous:
2025-05-07T20:32:44.9270469Z             x0 = x0.contiguous()
2025-05-07T20:32:44.9270557Z             x1 = x1.contiguous()
2025-05-07T20:32:44.9270633Z     
2025-05-07T20:32:44.9270723Z         if scale_ub is not None:
2025-05-07T20:32:44.9270828Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:44.9270967Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:44.9271042Z             )
2025-05-07T20:32:44.9271120Z         else:
2025-05-07T20:32:44.9271271Z             scale_ub_tensor = None
2025-05-07T20:32:44.9271344Z     
2025-05-07T20:32:44.9271471Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:44.9271569Z             op = silu_mul_quant
2025-05-07T20:32:44.9271698Z             if compiled:
2025-05-07T20:32:44.9271800Z                 op = torch.compile(op)
2025-05-07T20:32:44.9271910Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:44.9271981Z     
2025-05-07T20:32:44.9272077Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:44.9272082Z 
2025-05-07T20:32:44.9272179Z moe/activation_test.py:117: 
2025-05-07T20:32:44.9272308Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:44.9272413Z moe/activation_test.py:115: in fn
2025-05-07T20:32:44.9272514Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:44.9272883Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/torch/_dynamo/eval_frame.py:678: in _fn
2025-05-07T20:32:44.9273024Z     return fn(*args, **kwargs)
2025-05-07T20:32:44.9273517Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:44.9273657Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:44.9274016Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:44.9274237Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:44.9274581Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:44.9274673Z     kernel = self.compile(
2025-05-07T20:32:44.9275053Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:44.9275231Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:44.9275360Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:44.9275367Z 
2025-05-07T20:32:44.9275576Z self = <triton.compiler.compiler.ASTSource object at 0x7f9655dfd210>
2025-05-07T20:32:44.9276362Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:44.9276867Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f987464c400>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f9655d14220>}
2025-05-07T20:32:44.9277675Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:44.9277863Z context = <triton._C.libtriton.ir.context object at 0x7f9655dc57b0>
2025-05-07T20:32:44.9277871Z 
2025-05-07T20:32:44.9278048Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:44.9278312Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:44.9278426Z                            module_map=module_map)
2025-05-07T20:32:44.9278586Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:44.9278684Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:44.9278770Z E       ^
2025-05-07T20:32:44.9279129Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:44.9279134Z 
2025-05-07T20:32:44.9279550Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:44.9279555Z 
2025-05-07T20:32:44.9279663Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:44.9279887Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:44.9280017Z     T=2048,
2025-05-07T20:32:44.9280092Z     D=7168,
2025-05-07T20:32:44.9280173Z     scale_ub=None,
2025-05-07T20:32:44.9280265Z     contiguous=True,
2025-05-07T20:32:44.9280349Z     compiled=True,
2025-05-07T20:32:44.9280462Z )
2025-05-07T20:32:44.9280692Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:44.9280862Z T = 2048, D = 7168, scale_ub = None, contiguous = True, compiled = True
2025-05-07T20:32:44.9280866Z 
2025-05-07T20:32:44.9280943Z     @given(
2025-05-07T20:32:44.9281067Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:44.9281167Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:44.9281289Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:44.9281407Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:44.9281519Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:44.9281599Z     )
2025-05-07T20:32:44.9281889Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:44.9281985Z     def test_silu_mul_quant(
2025-05-07T20:32:44.9282067Z         self,
2025-05-07T20:32:44.9282183Z         T: int,
2025-05-07T20:32:44.9282263Z         D: int,
2025-05-07T20:32:44.9282368Z         scale_ub: Optional[float],
2025-05-07T20:32:44.9282461Z         contiguous: bool,
2025-05-07T20:32:44.9282550Z         compiled: bool,
2025-05-07T20:32:44.9282635Z     ) -> None:
2025-05-07T20:32:44.9282731Z         torch.manual_seed(2025)
2025-05-07T20:32:44.9282808Z     
2025-05-07T20:32:44.9282977Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:44.9283048Z     
2025-05-07T20:32:44.9283147Z         x_sign = torch.sign(x)
2025-05-07T20:32:44.9283276Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:44.9283365Z         x = x_sign * x_clamp
2025-05-07T20:32:44.9283455Z         x0 = x[:, :D]
2025-05-07T20:32:44.9283535Z         x1 = x[:, D:]
2025-05-07T20:32:44.9283612Z     
2025-05-07T20:32:44.9283705Z         if contiguous:
2025-05-07T20:32:44.9283797Z             x0 = x0.contiguous()
2025-05-07T20:32:44.9283886Z             x1 = x1.contiguous()
2025-05-07T20:32:44.9283965Z     
2025-05-07T20:32:44.9284059Z         if scale_ub is not None:
2025-05-07T20:32:44.9284170Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:44.9284305Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:44.9284381Z             )
2025-05-07T20:32:44.9284462Z         else:
2025-05-07T20:32:44.9284557Z             scale_ub_tensor = None
2025-05-07T20:32:44.9284629Z     
2025-05-07T20:32:44.9284763Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:44.9284852Z             op = silu_mul_quant
2025-05-07T20:32:44.9284936Z             if compiled:
2025-05-07T20:32:44.9285042Z                 op = torch.compile(op)
2025-05-07T20:32:44.9285147Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:44.9285221Z     
2025-05-07T20:32:44.9285321Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:44.9285326Z 
2025-05-07T20:32:44.9285424Z moe/activation_test.py:117: 
2025-05-07T20:32:44.9285562Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:44.9285665Z moe/activation_test.py:115: in fn
2025-05-07T20:32:44.9285764Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:44.9286137Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/torch/_dynamo/eval_frame.py:678: in _fn
2025-05-07T20:32:44.9286229Z     return fn(*args, **kwargs)
2025-05-07T20:32:44.9286723Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:44.9286824Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:44.9287180Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:44.9287410Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:44.9287849Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:44.9287946Z     kernel = self.compile(
2025-05-07T20:32:44.9288377Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:44.9288551Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:44.9288680Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:44.9288691Z 
2025-05-07T20:32:44.9288895Z self = <triton.compiler.compiler.ASTSource object at 0x7f9655d92890>
2025-05-07T20:32:44.9289676Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:44.9290229Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f987464c400>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f9655d14d60>}
2025-05-07T20:32:44.9291046Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:44.9291243Z context = <triton._C.libtriton.ir.context object at 0x7f9655d7eeb0>
2025-05-07T20:32:44.9291247Z 
2025-05-07T20:32:44.9291410Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:44.9291672Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:44.9291782Z                            module_map=module_map)
2025-05-07T20:32:44.9291941Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:44.9292044Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:44.9292126Z E       ^
2025-05-07T20:32:44.9292483Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:44.9292487Z 
2025-05-07T20:32:44.9292914Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:44.9292918Z 
2025-05-07T20:32:44.9293020Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:44.9293242Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:44.9293324Z     T=16384,
2025-05-07T20:32:44.9293400Z     D=5120,
2025-05-07T20:32:44.9293487Z     scale_ub=None,
2025-05-07T20:32:44.9293573Z     contiguous=False,
2025-05-07T20:32:44.9293657Z     compiled=False,
2025-05-07T20:32:44.9293735Z )
2025-05-07T20:32:44.9293954Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:44.9294130Z T = 16384, D = 5120, scale_ub = None, contiguous = False, compiled = False
2025-05-07T20:32:44.9294138Z 
2025-05-07T20:32:44.9294222Z     @given(
2025-05-07T20:32:44.9294338Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:44.9294439Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:44.9294563Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:44.9294680Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:44.9294799Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:44.9294871Z     )
2025-05-07T20:32:44.9295118Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:44.9295218Z     def test_silu_mul_quant(
2025-05-07T20:32:44.9295294Z         self,
2025-05-07T20:32:44.9295370Z         T: int,
2025-05-07T20:32:44.9295452Z         D: int,
2025-05-07T20:32:44.9295550Z         scale_ub: Optional[float],
2025-05-07T20:32:44.9295638Z         contiguous: bool,
2025-05-07T20:32:44.9295729Z         compiled: bool,
2025-05-07T20:32:44.9295808Z     ) -> None:
2025-05-07T20:32:44.9296003Z         torch.manual_seed(2025)
2025-05-07T20:32:44.9296082Z     
2025-05-07T20:32:44.9296250Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:44.9296333Z     
2025-05-07T20:32:44.9296464Z         x_sign = torch.sign(x)
2025-05-07T20:32:44.9296588Z >       x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:44.9298411Z E       torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 320.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 140.44 MiB is free. Including non-PyTorch memory, this process has 21.92 GiB memory in use. Of the allocated memory 21.60 GiB is allocated by PyTorch, and 45.02 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
2025-05-07T20:32:44.9298458Z 
2025-05-07T20:32:44.9298580Z moe/activation_test.py:95: OutOfMemoryError
2025-05-07T20:32:44.9298584Z 
2025-05-07T20:32:44.9298692Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:44.9298957Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:44.9299035Z     T=4096,
2025-05-07T20:32:44.9299119Z     D=7168,
2025-05-07T20:32:44.9299200Z     scale_ub=1200.0,
2025-05-07T20:32:44.9299286Z     contiguous=True,
2025-05-07T20:32:44.9299374Z     compiled=True,
2025-05-07T20:32:44.9299447Z )
2025-05-07T20:32:44.9299672Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:44.9299846Z T = 4096, D = 7168, scale_ub = 1200.0, contiguous = True, compiled = True
2025-05-07T20:32:44.9299851Z 
2025-05-07T20:32:44.9299926Z     @given(
2025-05-07T20:32:44.9300051Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:44.9300148Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:44.9300261Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:44.9300389Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:44.9300501Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:44.9300574Z     )
2025-05-07T20:32:44.9300832Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:44.9300927Z     def test_silu_mul_quant(
2025-05-07T20:32:44.9301007Z         self,
2025-05-07T20:32:44.9301082Z         T: int,
2025-05-07T20:32:44.9301157Z         D: int,
2025-05-07T20:32:44.9301258Z         scale_ub: Optional[float],
2025-05-07T20:32:44.9301346Z         contiguous: bool,
2025-05-07T20:32:44.9301429Z         compiled: bool,
2025-05-07T20:32:44.9301513Z     ) -> None:
2025-05-07T20:32:44.9301606Z         torch.manual_seed(2025)
2025-05-07T20:32:44.9301678Z     
2025-05-07T20:32:44.9301853Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:44.9301924Z     
2025-05-07T20:32:44.9302017Z         x_sign = torch.sign(x)
2025-05-07T20:32:44.9302147Z >       x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:44.9303940Z E       torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 112.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 28.44 MiB is free. Including non-PyTorch memory, this process has 22.03 GiB memory in use. Of the allocated memory 21.61 GiB is allocated by PyTorch, and 141.02 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
2025-05-07T20:32:44.9303952Z 
2025-05-07T20:32:44.9304068Z moe/activation_test.py:95: OutOfMemoryError
2025-05-07T20:32:44.9304072Z 
2025-05-07T20:32:44.9304174Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:44.9304399Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:44.9304477Z     T=16384,
2025-05-07T20:32:44.9304603Z     D=7168,
2025-05-07T20:32:44.9304689Z     scale_ub=None,
2025-05-07T20:32:44.9304774Z     contiguous=False,
2025-05-07T20:32:44.9304858Z     compiled=False,
2025-05-07T20:32:44.9304937Z )
2025-05-07T20:32:44.9305202Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:44.9305382Z T = 16384, D = 7168, scale_ub = None, contiguous = False, compiled = False
2025-05-07T20:32:44.9305392Z 
2025-05-07T20:32:44.9305468Z     @given(
2025-05-07T20:32:44.9305586Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:44.9305691Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:44.9305804Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:44.9305919Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:44.9306037Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:44.9306110Z     )
2025-05-07T20:32:44.9306354Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:44.9306501Z     def test_silu_mul_quant(
2025-05-07T20:32:44.9306576Z         self,
2025-05-07T20:32:44.9306656Z         T: int,
2025-05-07T20:32:44.9306770Z         D: int,
2025-05-07T20:32:44.9306876Z         scale_ub: Optional[float],
2025-05-07T20:32:44.9306970Z         contiguous: bool,
2025-05-07T20:32:44.9307074Z         compiled: bool,
2025-05-07T20:32:44.9307156Z     ) -> None:
2025-05-07T20:32:44.9307278Z         torch.manual_seed(2025)
2025-05-07T20:32:44.9307351Z     
2025-05-07T20:32:44.9307519Z >       x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:44.9309412Z E       torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 448.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 140.44 MiB is free. Including non-PyTorch memory, this process has 21.92 GiB memory in use. Of the allocated memory 21.50 GiB is allocated by PyTorch, and 141.02 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
2025-05-07T20:32:44.9309425Z 
2025-05-07T20:32:44.9309544Z moe/activation_test.py:92: OutOfMemoryError
2025-05-07T20:32:44.9309549Z 
2025-05-07T20:32:44.9309659Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:44.9309879Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:44.9309961Z     T=2048,
2025-05-07T20:32:44.9310036Z     D=7168,
2025-05-07T20:32:44.9310117Z     scale_ub=1200.0,
2025-05-07T20:32:44.9310206Z     contiguous=True,
2025-05-07T20:32:44.9310290Z     compiled=True,
2025-05-07T20:32:44.9310362Z )
2025-05-07T20:32:44.9310585Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:44.9310755Z T = 2048, D = 7168, scale_ub = 1200.0, contiguous = True, compiled = True
2025-05-07T20:32:44.9310763Z 
2025-05-07T20:32:44.9314914Z     @given(
2025-05-07T20:32:44.9315057Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:44.9315171Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:44.9315296Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:44.9315413Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:44.9315536Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:44.9315612Z     )
2025-05-07T20:32:44.9315864Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:44.9315970Z     def test_silu_mul_quant(
2025-05-07T20:32:44.9316047Z         self,
2025-05-07T20:32:44.9316126Z         T: int,
2025-05-07T20:32:44.9316216Z         D: int,
2025-05-07T20:32:44.9316315Z         scale_ub: Optional[float],
2025-05-07T20:32:44.9316415Z         contiguous: bool,
2025-05-07T20:32:44.9316501Z         compiled: bool,
2025-05-07T20:32:44.9316584Z     ) -> None:
2025-05-07T20:32:44.9316692Z         torch.manual_seed(2025)
2025-05-07T20:32:44.9316853Z     
2025-05-07T20:32:44.9317026Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:44.9317109Z     
2025-05-07T20:32:44.9317253Z         x_sign = torch.sign(x)
2025-05-07T20:32:44.9317389Z >       x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:44.9319235Z E       torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 56.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 28.44 MiB is free. Including non-PyTorch memory, this process has 22.03 GiB memory in use. Of the allocated memory 21.67 GiB is allocated by PyTorch, and 85.02 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
2025-05-07T20:32:44.9319241Z 
2025-05-07T20:32:44.9319406Z moe/activation_test.py:95: OutOfMemoryError
2025-05-07T20:32:44.9319412Z 
2025-05-07T20:32:44.9319524Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:44.9319787Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:44.9319880Z     T=2048,
2025-05-07T20:32:44.9319957Z     D=7168,
2025-05-07T20:32:44.9320041Z     scale_ub=None,
2025-05-07T20:32:44.9320136Z     contiguous=True,
2025-05-07T20:32:44.9320222Z     compiled=False,
2025-05-07T20:32:44.9320297Z )
2025-05-07T20:32:44.9320525Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:44.9320698Z T = 2048, D = 7168, scale_ub = None, contiguous = True, compiled = False
2025-05-07T20:32:44.9320702Z 
2025-05-07T20:32:44.9320780Z     @given(
2025-05-07T20:32:44.9320906Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:44.9321005Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:44.9321128Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:44.9321251Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:44.9321364Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:44.9321447Z     )
2025-05-07T20:32:44.9321697Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:44.9321792Z     def test_silu_mul_quant(
2025-05-07T20:32:44.9321878Z         self,
2025-05-07T20:32:44.9321957Z         T: int,
2025-05-07T20:32:44.9322033Z         D: int,
2025-05-07T20:32:44.9322140Z         scale_ub: Optional[float],
2025-05-07T20:32:44.9322230Z         contiguous: bool,
2025-05-07T20:32:44.9322317Z         compiled: bool,
2025-05-07T20:32:44.9322402Z     ) -> None:
2025-05-07T20:32:44.9322498Z         torch.manual_seed(2025)
2025-05-07T20:32:44.9322580Z     
2025-05-07T20:32:44.9322748Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:44.9322823Z     
2025-05-07T20:32:44.9322924Z >       x_sign = torch.sign(x)
2025-05-07T20:32:44.9324724Z E       torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 56.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 28.44 MiB is free. Including non-PyTorch memory, this process has 22.03 GiB memory in use. Of the allocated memory 21.67 GiB is allocated by PyTorch, and 85.02 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
2025-05-07T20:32:44.9324731Z 
2025-05-07T20:32:44.9324856Z moe/activation_test.py:94: OutOfMemoryError
2025-05-07T20:32:44.9324860Z 
2025-05-07T20:32:44.9324964Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:44.9325187Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:44.9325275Z     T=1,
2025-05-07T20:32:44.9325352Z     D=7168,
2025-05-07T20:32:44.9325437Z     scale_ub=1200.0,
2025-05-07T20:32:44.9325580Z     contiguous=True,
2025-05-07T20:32:44.9325665Z     compiled=False,
2025-05-07T20:32:44.9325747Z )
2025-05-07T20:32:44.9325975Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:44.9326183Z T = 1, D = 7168, scale_ub = 1200.0, contiguous = True, compiled = False
2025-05-07T20:32:44.9326189Z 
2025-05-07T20:32:44.9326276Z     @given(
2025-05-07T20:32:44.9326394Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:44.9326492Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:44.9326615Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:44.9326731Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:44.9326844Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:44.9326924Z     )
2025-05-07T20:32:44.9327169Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:44.9327269Z     def test_silu_mul_quant(
2025-05-07T20:32:44.9327427Z         self,
2025-05-07T20:32:44.9327506Z         T: int,
2025-05-07T20:32:44.9327591Z         D: int,
2025-05-07T20:32:44.9327689Z         scale_ub: Optional[float],
2025-05-07T20:32:44.9327820Z         contiguous: bool,
2025-05-07T20:32:44.9327919Z         compiled: bool,
2025-05-07T20:32:44.9327998Z     ) -> None:
2025-05-07T20:32:44.9328094Z         torch.manual_seed(2025)
2025-05-07T20:32:44.9328624Z     
2025-05-07T20:32:44.9328858Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:44.9328936Z     
2025-05-07T20:32:44.9329040Z         x_sign = torch.sign(x)
2025-05-07T20:32:44.9329166Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:44.9329267Z         x = x_sign * x_clamp
2025-05-07T20:32:44.9329353Z         x0 = x[:, :D]
2025-05-07T20:32:44.9329436Z         x1 = x[:, D:]
2025-05-07T20:32:44.9329518Z     
2025-05-07T20:32:44.9329606Z         if contiguous:
2025-05-07T20:32:44.9329702Z             x0 = x0.contiguous()
2025-05-07T20:32:44.9329804Z             x1 = x1.contiguous()
2025-05-07T20:32:44.9329883Z     
2025-05-07T20:32:44.9329978Z         if scale_ub is not None:
2025-05-07T20:32:44.9330098Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:44.9330243Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:44.9330322Z             )
2025-05-07T20:32:44.9330409Z         else:
2025-05-07T20:32:44.9330509Z             scale_ub_tensor = None
2025-05-07T20:32:44.9330583Z     
2025-05-07T20:32:44.9330725Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:44.9330819Z             op = silu_mul_quant
2025-05-07T20:32:44.9330916Z             if compiled:
2025-05-07T20:32:44.9331017Z                 op = torch.compile(op)
2025-05-07T20:32:44.9331124Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:44.9331203Z     
2025-05-07T20:32:44.9331295Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:44.9331300Z 
2025-05-07T20:32:44.9331400Z moe/activation_test.py:117: 
2025-05-07T20:32:44.9331548Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:44.9331651Z moe/activation_test.py:115: in fn
2025-05-07T20:32:44.9331756Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:44.9332270Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:44.9332368Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:44.9332736Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:44.9332961Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:44.9333303Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:44.9333407Z     kernel = self.compile(
2025-05-07T20:32:44.9333791Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:44.9334142Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:44.9334274Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:44.9334441Z 
2025-05-07T20:32:44.9334652Z self = <triton.compiler.compiler.ASTSource object at 0x7f96555a0e10>
2025-05-07T20:32:44.9335447Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:44.9335951Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f987464c400>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f965556c540>}
2025-05-07T20:32:44.9336750Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:44.9337020Z context = <triton._C.libtriton.ir.context object at 0x7f96555103b0>
2025-05-07T20:32:44.9337025Z 
2025-05-07T20:32:44.9337254Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:44.9337525Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:44.9337634Z                            module_map=module_map)
2025-05-07T20:32:44.9337809Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:44.9337910Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:44.9337993Z E       ^
2025-05-07T20:32:44.9338355Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:44.9338359Z 
2025-05-07T20:32:44.9338776Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:44.9338785Z 
2025-05-07T20:32:44.9338899Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:44.9339123Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:44.9339206Z     T=128,
2025-05-07T20:32:44.9339296Z     D=5120,
2025-05-07T20:32:44.9339384Z     scale_ub=None,
2025-05-07T20:32:44.9339472Z     contiguous=True,
2025-05-07T20:32:44.9339568Z     compiled=False,
2025-05-07T20:32:44.9339644Z )
2025-05-07T20:32:44.9339863Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:44.9340042Z T = 128, D = 5120, scale_ub = None, contiguous = True, compiled = False
2025-05-07T20:32:44.9340046Z 
2025-05-07T20:32:44.9340124Z     @given(
2025-05-07T20:32:44.9340251Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:44.9340351Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:44.9340468Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:44.9340601Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:44.9340718Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:44.9340794Z     )
2025-05-07T20:32:44.9341048Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:44.9341147Z     def test_silu_mul_quant(
2025-05-07T20:32:44.9341226Z         self,
2025-05-07T20:32:44.9341313Z         T: int,
2025-05-07T20:32:44.9341392Z         D: int,
2025-05-07T20:32:44.9341492Z         scale_ub: Optional[float],
2025-05-07T20:32:44.9341592Z         contiguous: bool,
2025-05-07T20:32:44.9341679Z         compiled: bool,
2025-05-07T20:32:44.9341766Z     ) -> None:
2025-05-07T20:32:44.9341863Z         torch.manual_seed(2025)
2025-05-07T20:32:44.9341937Z     
2025-05-07T20:32:44.9342110Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:44.9342188Z     
2025-05-07T20:32:44.9342281Z         x_sign = torch.sign(x)
2025-05-07T20:32:44.9342413Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:44.9342553Z         x = x_sign * x_clamp
2025-05-07T20:32:44.9342637Z         x0 = x[:, :D]
2025-05-07T20:32:44.9342726Z         x1 = x[:, D:]
2025-05-07T20:32:44.9342803Z     
2025-05-07T20:32:44.9342888Z         if contiguous:
2025-05-07T20:32:44.9343027Z             x0 = x0.contiguous()
2025-05-07T20:32:44.9343118Z             x1 = x1.contiguous()
2025-05-07T20:32:44.9343201Z     
2025-05-07T20:32:44.9343291Z         if scale_ub is not None:
2025-05-07T20:32:44.9343398Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:44.9343539Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:44.9343615Z             )
2025-05-07T20:32:44.9343693Z         else:
2025-05-07T20:32:44.9343793Z             scale_ub_tensor = None
2025-05-07T20:32:44.9343867Z     
2025-05-07T20:32:44.9343996Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:44.9344094Z             op = silu_mul_quant
2025-05-07T20:32:44.9344224Z             if compiled:
2025-05-07T20:32:44.9344329Z                 op = torch.compile(op)
2025-05-07T20:32:44.9344441Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:44.9344516Z     
2025-05-07T20:32:44.9344652Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:44.9344660Z 
2025-05-07T20:32:44.9344760Z moe/activation_test.py:117: 
2025-05-07T20:32:44.9344893Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:44.9345004Z moe/activation_test.py:115: in fn
2025-05-07T20:32:44.9345106Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:44.9345606Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:44.9345713Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:44.9346072Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:44.9346307Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:44.9346653Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:44.9346755Z     kernel = self.compile(
2025-05-07T20:32:44.9347199Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:44.9347374Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:44.9347502Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:44.9347517Z 
2025-05-07T20:32:44.9347722Z self = <triton.compiler.compiler.ASTSource object at 0x7f96555ffd10>
2025-05-07T20:32:44.9348508Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:44.9349021Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f987464c400>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f965556d620>}
2025-05-07T20:32:44.9349850Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:44.9350049Z context = <triton._C.libtriton.ir.context object at 0x7f96555a7ef0>
2025-05-07T20:32:44.9350054Z 
2025-05-07T20:32:44.9350219Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:44.9350483Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:44.9350602Z                            module_map=module_map)
2025-05-07T20:32:44.9350764Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:44.9350864Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:44.9350954Z E       ^
2025-05-07T20:32:44.9351356Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:44.9351361Z 
2025-05-07T20:32:44.9351825Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:44.9351829Z 
2025-05-07T20:32:44.9351936Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:44.9352159Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:44.9352248Z     T=128,
2025-05-07T20:32:44.9352327Z     D=7168,
2025-05-07T20:32:44.9352420Z     scale_ub=None,
2025-05-07T20:32:44.9352508Z     contiguous=True,
2025-05-07T20:32:44.9352595Z     compiled=False,
2025-05-07T20:32:44.9352680Z )
2025-05-07T20:32:44.9352898Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:44.9353068Z T = 128, D = 7168, scale_ub = None, contiguous = True, compiled = False
2025-05-07T20:32:44.9353111Z 
2025-05-07T20:32:44.9353201Z     @given(
2025-05-07T20:32:44.9353319Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:44.9353419Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:44.9353585Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:44.9353703Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:44.9353829Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:44.9353905Z     )
2025-05-07T20:32:44.9354151Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:44.9354255Z     def test_silu_mul_quant(
2025-05-07T20:32:44.9354333Z         self,
2025-05-07T20:32:44.9354412Z         T: int,
2025-05-07T20:32:44.9354497Z         D: int,
2025-05-07T20:32:44.9354596Z         scale_ub: Optional[float],
2025-05-07T20:32:44.9354691Z         contiguous: bool,
2025-05-07T20:32:44.9354777Z         compiled: bool,
2025-05-07T20:32:44.9354855Z     ) -> None:
2025-05-07T20:32:44.9354958Z         torch.manual_seed(2025)
2025-05-07T20:32:44.9355035Z     
2025-05-07T20:32:44.9355203Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:44.9355283Z     
2025-05-07T20:32:44.9355383Z         x_sign = torch.sign(x)
2025-05-07T20:32:44.9355507Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:44.9355603Z         x = x_sign * x_clamp
2025-05-07T20:32:44.9355685Z         x0 = x[:, :D]
2025-05-07T20:32:44.9355771Z         x1 = x[:, D:]
2025-05-07T20:32:44.9355844Z     
2025-05-07T20:32:44.9355932Z         if contiguous:
2025-05-07T20:32:44.9356031Z             x0 = x0.contiguous()
2025-05-07T20:32:44.9356119Z             x1 = x1.contiguous()
2025-05-07T20:32:44.9356193Z     
2025-05-07T20:32:44.9356291Z         if scale_ub is not None:
2025-05-07T20:32:44.9356396Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:44.9356532Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:44.9356618Z             )
2025-05-07T20:32:44.9356700Z         else:
2025-05-07T20:32:44.9356796Z             scale_ub_tensor = None
2025-05-07T20:32:44.9356874Z     
2025-05-07T20:32:44.9357006Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:44.9357098Z             op = silu_mul_quant
2025-05-07T20:32:44.9357191Z             if compiled:
2025-05-07T20:32:44.9357293Z                 op = torch.compile(op)
2025-05-07T20:32:44.9357404Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:44.9357478Z     
2025-05-07T20:32:44.9357569Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:44.9357574Z 
2025-05-07T20:32:44.9357678Z moe/activation_test.py:117: 
2025-05-07T20:32:44.9357805Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:44.9357907Z moe/activation_test.py:115: in fn
2025-05-07T20:32:44.9358016Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:44.9358514Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:44.9358669Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:44.9359028Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:44.9359324Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:44.9359670Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:44.9359763Z     kernel = self.compile(
2025-05-07T20:32:44.9360143Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:44.9360322Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:44.9360448Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:44.9360453Z 
2025-05-07T20:32:44.9360662Z self = <triton.compiler.compiler.ASTSource object at 0x7f9655472790>
2025-05-07T20:32:44.9361526Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:44.9362030Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f987464c400>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f965556e480>}
2025-05-07T20:32:44.9362790Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:44.9362979Z context = <triton._C.libtriton.ir.context object at 0x7f965541adb0>
2025-05-07T20:32:44.9362984Z 
2025-05-07T20:32:44.9363154Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:44.9363421Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:44.9363540Z                            module_map=module_map)
2025-05-07T20:32:44.9363702Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:44.9363803Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:44.9363889Z E       ^
2025-05-07T20:32:44.9364246Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:44.9364251Z 
2025-05-07T20:32:44.9364665Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:44.9364670Z 
2025-05-07T20:32:44.9364780Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:44.9365002Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:44.9365089Z     T=2048,
2025-05-07T20:32:44.9365166Z     D=7168,
2025-05-07T20:32:44.9365249Z     scale_ub=1200.0,
2025-05-07T20:32:44.9365347Z     contiguous=True,
2025-05-07T20:32:44.9365431Z     compiled=False,
2025-05-07T20:32:44.9365505Z )
2025-05-07T20:32:44.9365736Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:44.9365915Z T = 2048, D = 7168, scale_ub = 1200.0, contiguous = True, compiled = False
2025-05-07T20:32:44.9365920Z 
2025-05-07T20:32:44.9365997Z     @given(
2025-05-07T20:32:44.9366121Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:44.9366219Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:44.9366341Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:44.9366459Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:44.9366592Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:44.9366679Z     )
2025-05-07T20:32:44.9366949Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:44.9367044Z     def test_silu_mul_quant(
2025-05-07T20:32:44.9367132Z         self,
2025-05-07T20:32:44.9367256Z         T: int,
2025-05-07T20:32:44.9367333Z         D: int,
2025-05-07T20:32:44.9367438Z         scale_ub: Optional[float],
2025-05-07T20:32:44.9367530Z         contiguous: bool,
2025-05-07T20:32:44.9367655Z         compiled: bool,
2025-05-07T20:32:44.9367743Z     ) -> None:
2025-05-07T20:32:44.9367839Z         torch.manual_seed(2025)
2025-05-07T20:32:44.9367924Z     
2025-05-07T20:32:44.9368094Z >       x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:44.9369890Z E       torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 56.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 26.44 MiB is free. Including non-PyTorch memory, this process has 22.04 GiB memory in use. Of the allocated memory 21.69 GiB is allocated by PyTorch, and 59.18 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
2025-05-07T20:32:44.9369946Z 
2025-05-07T20:32:44.9370065Z moe/activation_test.py:92: OutOfMemoryError
2025-05-07T20:32:44.9370107Z 
2025-05-07T20:32:44.9370215Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:44.9370444Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:44.9370522Z     T=1,
2025-05-07T20:32:44.9370599Z     D=5120,
2025-05-07T20:32:44.9370690Z     scale_ub=1200.0,
2025-05-07T20:32:44.9370779Z     contiguous=True,
2025-05-07T20:32:44.9370866Z     compiled=False,
2025-05-07T20:32:44.9370947Z )
2025-05-07T20:32:44.9371165Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:44.9371340Z T = 1, D = 5120, scale_ub = 1200.0, contiguous = True, compiled = False
2025-05-07T20:32:44.9371344Z 
2025-05-07T20:32:44.9371421Z     @given(
2025-05-07T20:32:44.9371537Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:44.9371648Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:44.9371763Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:44.9371883Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:44.9372005Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:44.9372081Z     )
2025-05-07T20:32:44.9372328Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:44.9372429Z     def test_silu_mul_quant(
2025-05-07T20:32:44.9372506Z         self,
2025-05-07T20:32:44.9372592Z         T: int,
2025-05-07T20:32:44.9372669Z         D: int,
2025-05-07T20:32:44.9372767Z         scale_ub: Optional[float],
2025-05-07T20:32:44.9372863Z         contiguous: bool,
2025-05-07T20:32:44.9372950Z         compiled: bool,
2025-05-07T20:32:44.9373029Z     ) -> None:
2025-05-07T20:32:44.9373130Z         torch.manual_seed(2025)
2025-05-07T20:32:44.9373203Z     
2025-05-07T20:32:44.9373373Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:44.9373456Z     
2025-05-07T20:32:44.9373549Z         x_sign = torch.sign(x)
2025-05-07T20:32:44.9373675Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:44.9373774Z         x = x_sign * x_clamp
2025-05-07T20:32:44.9373855Z         x0 = x[:, :D]
2025-05-07T20:32:44.9373944Z         x1 = x[:, D:]
2025-05-07T20:32:44.9374016Z     
2025-05-07T20:32:44.9374101Z         if contiguous:
2025-05-07T20:32:44.9374200Z             x0 = x0.contiguous()
2025-05-07T20:32:44.9374293Z             x1 = x1.contiguous()
2025-05-07T20:32:44.9374365Z     
2025-05-07T20:32:44.9374464Z         if scale_ub is not None:
2025-05-07T20:32:44.9374570Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:44.9374705Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:44.9374790Z             )
2025-05-07T20:32:44.9374866Z         else:
2025-05-07T20:32:44.9374979Z             scale_ub_tensor = None
2025-05-07T20:32:44.9375054Z     
2025-05-07T20:32:44.9375234Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:44.9375332Z             op = silu_mul_quant
2025-05-07T20:32:44.9375419Z             if compiled:
2025-05-07T20:32:44.9375567Z                 op = torch.compile(op)
2025-05-07T20:32:44.9375684Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:44.9375758Z     
2025-05-07T20:32:44.9375857Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:44.9375861Z 
2025-05-07T20:32:44.9375959Z moe/activation_test.py:117: 
2025-05-07T20:32:44.9376087Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:44.9376200Z moe/activation_test.py:115: in fn
2025-05-07T20:32:44.9376299Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:44.9376798Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:44.9376901Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:44.9377352Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:44.9377616Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:44.9377961Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:44.9378055Z     kernel = self.compile(
2025-05-07T20:32:44.9378445Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:44.9378618Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:44.9378753Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:44.9378757Z 
2025-05-07T20:32:44.9378963Z self = <triton.compiler.compiler.ASTSource object at 0x7f965544e990>
2025-05-07T20:32:44.9379742Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:44.9380259Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f987464c400>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f965556fa60>}
2025-05-07T20:32:44.9381014Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:44.9381211Z context = <triton._C.libtriton.ir.context object at 0x7f9655432ff0>
2025-05-07T20:32:44.9381215Z 
2025-05-07T20:32:44.9381379Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:44.9381641Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:44.9381763Z                            module_map=module_map)
2025-05-07T20:32:44.9381924Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:44.9382029Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:44.9382109Z E       ^
2025-05-07T20:32:44.9382470Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:44.9382474Z 
2025-05-07T20:32:44.9382897Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:44.9382902Z 
2025-05-07T20:32:44.9383005Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:44.9383235Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:44.9383314Z     T=2048,
2025-05-07T20:32:44.9383389Z     D=5120,
2025-05-07T20:32:44.9383476Z     scale_ub=None,
2025-05-07T20:32:44.9383562Z     contiguous=True,
2025-05-07T20:32:44.9383646Z     compiled=False,
2025-05-07T20:32:44.9383727Z )
2025-05-07T20:32:44.9383990Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:44.9384161Z T = 2048, D = 5120, scale_ub = None, contiguous = True, compiled = False
2025-05-07T20:32:44.9384168Z 
2025-05-07T20:32:44.9384294Z     @given(
2025-05-07T20:32:44.9384413Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:44.9384512Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:44.9384633Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:44.9384750Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:44.9384870Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:44.9384945Z     )
2025-05-07T20:32:44.9385190Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:44.9385290Z     def test_silu_mul_quant(
2025-05-07T20:32:44.9385367Z         self,
2025-05-07T20:32:44.9385445Z         T: int,
2025-05-07T20:32:44.9385527Z         D: int,
2025-05-07T20:32:44.9385665Z         scale_ub: Optional[float],
2025-05-07T20:32:44.9385758Z         contiguous: bool,
2025-05-07T20:32:44.9385850Z         compiled: bool,
2025-05-07T20:32:44.9385928Z     ) -> None:
2025-05-07T20:32:44.9386060Z         torch.manual_seed(2025)
2025-05-07T20:32:44.9386145Z     
2025-05-07T20:32:44.9386312Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:44.9386394Z     
2025-05-07T20:32:44.9386489Z >       x_sign = torch.sign(x)
2025-05-07T20:32:44.9388279Z E       torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 40.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 26.44 MiB is free. Including non-PyTorch memory, this process has 22.04 GiB memory in use. Of the allocated memory 21.73 GiB is allocated by PyTorch, and 19.12 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
2025-05-07T20:32:44.9388297Z 
2025-05-07T20:32:44.9388415Z moe/activation_test.py:94: OutOfMemoryError
2025-05-07T20:32:44.9388420Z 
2025-05-07T20:32:44.9388527Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:44.9388756Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:44.9388833Z     T=16384,
2025-05-07T20:32:44.9388911Z     D=5120,
2025-05-07T20:32:44.9389003Z     scale_ub=None,
2025-05-07T20:32:44.9389212Z     contiguous=True,
2025-05-07T20:32:44.9389297Z     compiled=False,
2025-05-07T20:32:44.9389378Z )
2025-05-07T20:32:44.9389597Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:44.9389778Z T = 16384, D = 5120, scale_ub = None, contiguous = True, compiled = False
2025-05-07T20:32:44.9389782Z 
2025-05-07T20:32:44.9389858Z     @given(
2025-05-07T20:32:44.9389976Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:44.9390086Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:44.9390202Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:44.9390318Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:44.9390444Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:44.9390518Z     )
2025-05-07T20:32:44.9390763Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:44.9390863Z     def test_silu_mul_quant(
2025-05-07T20:32:44.9390943Z         self,
2025-05-07T20:32:44.9391024Z         T: int,
2025-05-07T20:32:44.9391101Z         D: int,
2025-05-07T20:32:44.9391200Z         scale_ub: Optional[float],
2025-05-07T20:32:44.9391296Z         contiguous: bool,
2025-05-07T20:32:44.9391383Z         compiled: bool,
2025-05-07T20:32:44.9391462Z     ) -> None:
2025-05-07T20:32:44.9391563Z         torch.manual_seed(2025)
2025-05-07T20:32:44.9391640Z     
2025-05-07T20:32:44.9391807Z >       x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:44.9393718Z E       torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 320.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 26.44 MiB is free. Including non-PyTorch memory, this process has 22.04 GiB memory in use. Of the allocated memory 21.73 GiB is allocated by PyTorch, and 19.12 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
2025-05-07T20:32:44.9393725Z 
2025-05-07T20:32:44.9393843Z moe/activation_test.py:92: OutOfMemoryError
2025-05-07T20:32:44.9393848Z 
2025-05-07T20:32:44.9393955Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:44.9394176Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:44.9394260Z     T=4096,
2025-05-07T20:32:44.9394376Z     D=5120,
2025-05-07T20:32:44.9394458Z     scale_ub=None,
2025-05-07T20:32:44.9394555Z     contiguous=True,
2025-05-07T20:32:44.9394640Z     compiled=False,
2025-05-07T20:32:44.9394714Z )
2025-05-07T20:32:44.9394983Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:44.9395154Z T = 4096, D = 5120, scale_ub = None, contiguous = True, compiled = False
2025-05-07T20:32:44.9395158Z 
2025-05-07T20:32:44.9395240Z     @given(
2025-05-07T20:32:44.9395361Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:44.9395459Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:44.9395582Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:44.9395698Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:44.9395811Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:44.9395893Z     )
2025-05-07T20:32:44.9396137Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:44.9396235Z     def test_silu_mul_quant(
2025-05-07T20:32:44.9396325Z         self,
2025-05-07T20:32:44.9396403Z         T: int,
2025-05-07T20:32:44.9396481Z         D: int,
2025-05-07T20:32:44.9396586Z         scale_ub: Optional[float],
2025-05-07T20:32:44.9396681Z         contiguous: bool,
2025-05-07T20:32:44.9396767Z         compiled: bool,
2025-05-07T20:32:44.9396855Z     ) -> None:
2025-05-07T20:32:44.9396955Z         torch.manual_seed(2025)
2025-05-07T20:32:44.9397040Z     
2025-05-07T20:32:44.9397234Z >       x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:44.9399025Z E       torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 80.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 26.44 MiB is free. Including non-PyTorch memory, this process has 22.04 GiB memory in use. Of the allocated memory 21.73 GiB is allocated by PyTorch, and 19.12 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
2025-05-07T20:32:44.9399043Z 
2025-05-07T20:32:44.9399163Z moe/activation_test.py:92: OutOfMemoryError
2025-05-07T20:32:44.9399170Z 
2025-05-07T20:32:44.9399272Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:44.9399498Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:44.9399575Z     T=2048,
2025-05-07T20:32:44.9399651Z     D=5120,
2025-05-07T20:32:44.9399739Z     scale_ub=None,
2025-05-07T20:32:44.9399829Z     contiguous=False,
2025-05-07T20:32:44.9399914Z     compiled=False,
2025-05-07T20:32:44.9399993Z )
2025-05-07T20:32:44.9400208Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:44.9400384Z T = 2048, D = 5120, scale_ub = None, contiguous = False, compiled = False
2025-05-07T20:32:44.9400389Z 
2025-05-07T20:32:44.9400465Z     @given(
2025-05-07T20:32:44.9400584Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:44.9400736Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:44.9400850Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:44.9401007Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:44.9401125Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:44.9401200Z     )
2025-05-07T20:32:44.9401444Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:44.9401545Z     def test_silu_mul_quant(
2025-05-07T20:32:44.9401621Z         self,
2025-05-07T20:32:44.9401705Z         T: int,
2025-05-07T20:32:44.9401781Z         D: int,
2025-05-07T20:32:44.9401879Z         scale_ub: Optional[float],
2025-05-07T20:32:44.9401974Z         contiguous: bool,
2025-05-07T20:32:44.9402059Z         compiled: bool,
2025-05-07T20:32:44.9402138Z     ) -> None:
2025-05-07T20:32:44.9402239Z         torch.manual_seed(2025)
2025-05-07T20:32:44.9402354Z     
2025-05-07T20:32:44.9402523Z >       x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:44.9404339Z E       torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 40.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 26.44 MiB is free. Including non-PyTorch memory, this process has 22.04 GiB memory in use. Of the allocated memory 21.73 GiB is allocated by PyTorch, and 19.12 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
2025-05-07T20:32:44.9404346Z 
2025-05-07T20:32:44.9404465Z moe/activation_test.py:92: OutOfMemoryError
2025-05-07T20:32:44.9404470Z 
2025-05-07T20:32:44.9404579Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:44.9404801Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:44.9404887Z     T=4096,
2025-05-07T20:32:44.9404968Z     D=7168,
2025-05-07T20:32:44.9405051Z     scale_ub=None,
2025-05-07T20:32:44.9405141Z     contiguous=True,
2025-05-07T20:32:44.9405224Z     compiled=True,
2025-05-07T20:32:44.9405300Z )
2025-05-07T20:32:44.9405526Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:44.9405692Z T = 4096, D = 7168, scale_ub = None, contiguous = True, compiled = True
2025-05-07T20:32:44.9405697Z 
2025-05-07T20:32:44.9405775Z     @given(
2025-05-07T20:32:44.9405896Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:44.9405996Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:44.9406115Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:44.9406235Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:44.9406351Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:44.9406430Z     )
2025-05-07T20:32:44.9406674Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:44.9406773Z     def test_silu_mul_quant(
2025-05-07T20:32:44.9406856Z         self,
2025-05-07T20:32:44.9406933Z         T: int,
2025-05-07T20:32:44.9407013Z         D: int,
2025-05-07T20:32:44.9407119Z         scale_ub: Optional[float],
2025-05-07T20:32:44.9407208Z         contiguous: bool,
2025-05-07T20:32:44.9407297Z         compiled: bool,
2025-05-07T20:32:44.9407384Z     ) -> None:
2025-05-07T20:32:44.9407498Z         torch.manual_seed(2025)
2025-05-07T20:32:44.9407584Z     
2025-05-07T20:32:44.9407774Z >       x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:44.9409553Z E       torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 112.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 26.44 MiB is free. Including non-PyTorch memory, this process has 22.04 GiB memory in use. Of the allocated memory 21.73 GiB is allocated by PyTorch, and 19.12 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
2025-05-07T20:32:44.9409616Z 
2025-05-07T20:32:44.9409772Z moe/activation_test.py:92: OutOfMemoryError
2025-05-07T20:32:44.9409777Z 
2025-05-07T20:32:44.9409881Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:44.9410109Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:44.9410188Z     T=2048,
2025-05-07T20:32:44.9410265Z     D=5120,
2025-05-07T20:32:44.9410355Z     scale_ub=1200.0,
2025-05-07T20:32:44.9410442Z     contiguous=False,
2025-05-07T20:32:44.9410529Z     compiled=False,
2025-05-07T20:32:44.9410611Z )
2025-05-07T20:32:44.9410828Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:44.9411008Z T = 2048, D = 5120, scale_ub = 1200.0, contiguous = False, compiled = False
2025-05-07T20:32:44.9411053Z 
2025-05-07T20:32:44.9411133Z     @given(
2025-05-07T20:32:44.9411250Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:44.9411354Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:44.9411511Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:44.9411627Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:44.9411750Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:44.9411826Z     )
2025-05-07T20:32:44.9412076Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:44.9412171Z     def test_silu_mul_quant(
2025-05-07T20:32:44.9412249Z         self,
2025-05-07T20:32:44.9412337Z         T: int,
2025-05-07T20:32:44.9412413Z         D: int,
2025-05-07T20:32:44.9412513Z         scale_ub: Optional[float],
2025-05-07T20:32:44.9412608Z         contiguous: bool,
2025-05-07T20:32:44.9412695Z         compiled: bool,
2025-05-07T20:32:44.9412773Z     ) -> None:
2025-05-07T20:32:44.9412876Z         torch.manual_seed(2025)
2025-05-07T20:32:44.9412953Z     
2025-05-07T20:32:44.9413119Z >       x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:44.9414899Z E       torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 40.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 26.44 MiB is free. Including non-PyTorch memory, this process has 22.04 GiB memory in use. Of the allocated memory 21.73 GiB is allocated by PyTorch, and 19.12 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
2025-05-07T20:32:44.9414904Z 
2025-05-07T20:32:44.9415021Z moe/activation_test.py:92: OutOfMemoryError
2025-05-07T20:32:44.9415025Z 
2025-05-07T20:32:44.9415133Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:44.9415355Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:44.9415438Z     T=4096,
2025-05-07T20:32:44.9415514Z     D=7168,
2025-05-07T20:32:44.9415597Z     scale_ub=1200.0,
2025-05-07T20:32:44.9415690Z     contiguous=True,
2025-05-07T20:32:44.9415778Z     compiled=False,
2025-05-07T20:32:44.9415852Z )
2025-05-07T20:32:44.9416077Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:44.9416249Z T = 4096, D = 7168, scale_ub = 1200.0, contiguous = True, compiled = False
2025-05-07T20:32:44.9416254Z 
2025-05-07T20:32:44.9416330Z     @given(
2025-05-07T20:32:44.9416451Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:44.9416549Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:44.9416667Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:44.9416782Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:44.9416895Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:44.9416981Z     )
2025-05-07T20:32:44.9417273Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:44.9417367Z     def test_silu_mul_quant(
2025-05-07T20:32:44.9417451Z         self,
2025-05-07T20:32:44.9417571Z         T: int,
2025-05-07T20:32:44.9417649Z         D: int,
2025-05-07T20:32:44.9417753Z         scale_ub: Optional[float],
2025-05-07T20:32:44.9417844Z         contiguous: bool,
2025-05-07T20:32:44.9417929Z         compiled: bool,
2025-05-07T20:32:44.9418013Z     ) -> None:
2025-05-07T20:32:44.9418108Z         torch.manual_seed(2025)
2025-05-07T20:32:44.9418187Z     
2025-05-07T20:32:44.9418352Z >       x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:44.9420175Z E       torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 112.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 26.44 MiB is free. Including non-PyTorch memory, this process has 22.04 GiB memory in use. Of the allocated memory 21.73 GiB is allocated by PyTorch, and 19.12 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
2025-05-07T20:32:44.9420251Z 
2025-05-07T20:32:44.9420368Z moe/activation_test.py:92: OutOfMemoryError
2025-05-07T20:32:44.9420374Z 
2025-05-07T20:32:44.9420476Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:44.9420703Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:44.9420780Z     T=16384,
2025-05-07T20:32:44.9420858Z     D=7168,
2025-05-07T20:32:44.9420948Z     scale_ub=None,
2025-05-07T20:32:44.9421034Z     contiguous=False,
2025-05-07T20:32:44.9421117Z     compiled=True,
2025-05-07T20:32:44.9421196Z )
2025-05-07T20:32:44.9421411Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:44.9421593Z T = 16384, D = 7168, scale_ub = None, contiguous = False, compiled = True
2025-05-07T20:32:44.9421600Z 
2025-05-07T20:32:44.9421677Z     @given(
2025-05-07T20:32:44.9421791Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:44.9421899Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:44.9422012Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:44.9422128Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:44.9422247Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:44.9422321Z     )
2025-05-07T20:32:44.9422575Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:44.9422673Z     def test_silu_mul_quant(
2025-05-07T20:32:44.9422749Z         self,
2025-05-07T20:32:44.9422831Z         T: int,
2025-05-07T20:32:44.9422908Z         D: int,
2025-05-07T20:32:44.9423007Z         scale_ub: Optional[float],
2025-05-07T20:32:44.9423102Z         contiguous: bool,
2025-05-07T20:32:44.9423192Z         compiled: bool,
2025-05-07T20:32:44.9423273Z     ) -> None:
2025-05-07T20:32:44.9423374Z         torch.manual_seed(2025)
2025-05-07T20:32:44.9423448Z     
2025-05-07T20:32:44.9423618Z >       x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:44.9425411Z E       torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 448.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 26.44 MiB is free. Including non-PyTorch memory, this process has 22.04 GiB memory in use. Of the allocated memory 21.73 GiB is allocated by PyTorch, and 19.12 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
2025-05-07T20:32:44.9425417Z 
2025-05-07T20:32:44.9425534Z moe/activation_test.py:92: OutOfMemoryError
2025-05-07T20:32:44.9425539Z 
2025-05-07T20:32:44.9425650Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:44.9425918Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:44.9426001Z     T=4096,
2025-05-07T20:32:44.9426079Z     D=7168,
2025-05-07T20:32:44.9426201Z     scale_ub=None,
2025-05-07T20:32:44.9426299Z     contiguous=True,
2025-05-07T20:32:44.9426382Z     compiled=False,
2025-05-07T20:32:44.9426454Z )
2025-05-07T20:32:44.9426682Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:44.9426850Z T = 4096, D = 7168, scale_ub = None, contiguous = True, compiled = False
2025-05-07T20:32:44.9426855Z 
2025-05-07T20:32:44.9426933Z     @given(
2025-05-07T20:32:44.9427060Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:44.9427178Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:44.9427315Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:44.9427438Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:44.9427590Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:44.9427673Z     )
2025-05-07T20:32:44.9427917Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:44.9428048Z     def test_silu_mul_quant(
2025-05-07T20:32:44.9428499Z         self,
2025-05-07T20:32:44.9428617Z         T: int,
2025-05-07T20:32:44.9428720Z         D: int,
2025-05-07T20:32:44.9428830Z         scale_ub: Optional[float],
2025-05-07T20:32:44.9428921Z         contiguous: bool,
2025-05-07T20:32:44.9429010Z         compiled: bool,
2025-05-07T20:32:44.9429143Z     ) -> None:
2025-05-07T20:32:44.9429241Z         torch.manual_seed(2025)
2025-05-07T20:32:44.9429318Z     
2025-05-07T20:32:44.9429485Z >       x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:44.9431266Z E       torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 112.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 26.44 MiB is free. Including non-PyTorch memory, this process has 22.04 GiB memory in use. Of the allocated memory 21.73 GiB is allocated by PyTorch, and 19.12 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
2025-05-07T20:32:44.9431282Z 
2025-05-07T20:32:44.9431397Z moe/activation_test.py:92: OutOfMemoryError
2025-05-07T20:32:44.9431401Z 
2025-05-07T20:32:44.9431507Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:44.9431736Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:44.9431811Z     T=16384,
2025-05-07T20:32:44.9431887Z     D=7168,
2025-05-07T20:32:44.9431972Z     scale_ub=None,
2025-05-07T20:32:44.9432055Z     contiguous=True,
2025-05-07T20:32:44.9432137Z     compiled=False,
2025-05-07T20:32:44.9432215Z )
2025-05-07T20:32:44.9432431Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:44.9432615Z T = 16384, D = 7168, scale_ub = None, contiguous = True, compiled = False
2025-05-07T20:32:44.9432619Z 
2025-05-07T20:32:44.9432697Z     @given(
2025-05-07T20:32:44.9432815Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:44.9432917Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:44.9433029Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:44.9433144Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:44.9433261Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:44.9433334Z     )
2025-05-07T20:32:44.9433585Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:44.9433678Z     def test_silu_mul_quant(
2025-05-07T20:32:44.9433754Z         self,
2025-05-07T20:32:44.9433835Z         T: int,
2025-05-07T20:32:44.9433910Z         D: int,
2025-05-07T20:32:44.9434009Z         scale_ub: Optional[float],
2025-05-07T20:32:44.9434106Z         contiguous: bool,
2025-05-07T20:32:44.9434367Z         compiled: bool,
2025-05-07T20:32:44.9434443Z     ) -> None:
2025-05-07T20:32:44.9434544Z         torch.manual_seed(2025)
2025-05-07T20:32:44.9434617Z     
2025-05-07T20:32:44.9434849Z >       x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:44.9436632Z E       torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 448.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 26.44 MiB is free. Including non-PyTorch memory, this process has 22.04 GiB memory in use. Of the allocated memory 21.73 GiB is allocated by PyTorch, and 19.12 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
2025-05-07T20:32:44.9436639Z 
2025-05-07T20:32:44.9436754Z moe/activation_test.py:92: OutOfMemoryError
2025-05-07T20:32:44.9436826Z 
2025-05-07T20:32:44.9436936Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:44.9437161Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:44.9437311Z     T=16384,
2025-05-07T20:32:44.9437392Z     D=7168,
2025-05-07T20:32:44.9437486Z     scale_ub=1200.0,
2025-05-07T20:32:44.9437585Z     contiguous=True,
2025-05-07T20:32:44.9437682Z     compiled=False,
2025-05-07T20:32:44.9437763Z )
2025-05-07T20:32:44.9437985Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:44.9438159Z T = 16384, D = 7168, scale_ub = 1200.0, contiguous = True, compiled = False
2025-05-07T20:32:44.9438163Z 
2025-05-07T20:32:44.9438239Z     @given(
2025-05-07T20:32:44.9438358Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:44.9438455Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:44.9438576Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:44.9438695Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:44.9438809Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:44.9438891Z     )
2025-05-07T20:32:44.9439143Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:44.9439237Z     def test_silu_mul_quant(
2025-05-07T20:32:44.9439318Z         self,
2025-05-07T20:32:44.9439394Z         T: int,
2025-05-07T20:32:44.9439469Z         D: int,
2025-05-07T20:32:44.9439571Z         scale_ub: Optional[float],
2025-05-07T20:32:44.9439659Z         contiguous: bool,
2025-05-07T20:32:44.9439743Z         compiled: bool,
2025-05-07T20:32:44.9443822Z     ) -> None:
2025-05-07T20:32:44.9443950Z         torch.manual_seed(2025)
2025-05-07T20:32:44.9444024Z     
2025-05-07T20:32:44.9444199Z >       x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:44.9446013Z E       torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 448.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 26.44 MiB is free. Including non-PyTorch memory, this process has 22.04 GiB memory in use. Of the allocated memory 21.73 GiB is allocated by PyTorch, and 19.12 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
2025-05-07T20:32:44.9446025Z 
2025-05-07T20:32:44.9446145Z moe/activation_test.py:92: OutOfMemoryError
2025-05-07T20:32:44.9446150Z 
2025-05-07T20:32:44.9446259Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:44.9446481Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:44.9446567Z     T=128,
2025-05-07T20:32:44.9446644Z     D=5120,
2025-05-07T20:32:44.9446728Z     scale_ub=1200.0,
2025-05-07T20:32:44.9446823Z     contiguous=False,
2025-05-07T20:32:44.9446909Z     compiled=False,
2025-05-07T20:32:44.9446988Z )
2025-05-07T20:32:44.9447286Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:44.9447461Z T = 128, D = 5120, scale_ub = 1200.0, contiguous = False, compiled = False
2025-05-07T20:32:44.9447465Z 
2025-05-07T20:32:44.9447586Z     @given(
2025-05-07T20:32:44.9447713Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:44.9447811Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:44.9447935Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:44.9448051Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:44.9448164Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:44.9448250Z     )
2025-05-07T20:32:44.9448496Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:44.9448592Z     def test_silu_mul_quant(
2025-05-07T20:32:44.9448679Z         self,
2025-05-07T20:32:44.9448755Z         T: int,
2025-05-07T20:32:44.9448876Z         D: int,
2025-05-07T20:32:44.9448992Z         scale_ub: Optional[float],
2025-05-07T20:32:44.9449083Z         contiguous: bool,
2025-05-07T20:32:44.9449172Z         compiled: bool,
2025-05-07T20:32:44.9449263Z     ) -> None:
2025-05-07T20:32:44.9449399Z         torch.manual_seed(2025)
2025-05-07T20:32:44.9449484Z     
2025-05-07T20:32:44.9449653Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:44.9449728Z     
2025-05-07T20:32:44.9449829Z         x_sign = torch.sign(x)
2025-05-07T20:32:44.9449957Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:44.9450048Z         x = x_sign * x_clamp
2025-05-07T20:32:44.9450139Z         x0 = x[:, :D]
2025-05-07T20:32:44.9450220Z         x1 = x[:, D:]
2025-05-07T20:32:44.9450292Z     
2025-05-07T20:32:44.9450389Z         if contiguous:
2025-05-07T20:32:44.9450484Z             x0 = x0.contiguous()
2025-05-07T20:32:44.9450579Z             x1 = x1.contiguous()
2025-05-07T20:32:44.9450658Z     
2025-05-07T20:32:44.9450752Z         if scale_ub is not None:
2025-05-07T20:32:44.9450867Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:44.9451004Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:44.9451085Z             )
2025-05-07T20:32:44.9451171Z         else:
2025-05-07T20:32:44.9451266Z             scale_ub_tensor = None
2025-05-07T20:32:44.9451340Z     
2025-05-07T20:32:44.9451480Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:44.9451573Z             op = silu_mul_quant
2025-05-07T20:32:44.9451662Z             if compiled:
2025-05-07T20:32:44.9451773Z                 op = torch.compile(op)
2025-05-07T20:32:44.9451880Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:44.9451954Z     
2025-05-07T20:32:44.9452054Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:44.9452058Z 
2025-05-07T20:32:44.9452157Z moe/activation_test.py:117: 
2025-05-07T20:32:44.9452299Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:44.9452405Z moe/activation_test.py:115: in fn
2025-05-07T20:32:44.9452507Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:44.9453029Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:44.9453127Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:44.9453489Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:44.9453720Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:44.9454063Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:44.9454167Z     kernel = self.compile(
2025-05-07T20:32:44.9454552Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:44.9454727Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:44.9454918Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:44.9454922Z 
2025-05-07T20:32:44.9455127Z self = <triton.compiler.compiler.ASTSource object at 0x7f96553f2590>
2025-05-07T20:32:44.9455965Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:44.9456471Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f987464c400>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f96553b6660>}
2025-05-07T20:32:44.9457221Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:44.9457491Z context = <triton._C.libtriton.ir.context object at 0x7f96551583b0>
2025-05-07T20:32:44.9457500Z 
2025-05-07T20:32:44.9457691Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:44.9458023Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:44.9458136Z                            module_map=module_map)
2025-05-07T20:32:44.9458297Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:44.9458403Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:44.9458481Z E       ^
2025-05-07T20:32:44.9458849Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:44.9458854Z 
2025-05-07T20:32:44.9459271Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:44.9459275Z 
2025-05-07T20:32:44.9459379Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:44.9459619Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:44.9459696Z     T=2048,
2025-05-07T20:32:44.9459776Z     D=7168,
2025-05-07T20:32:44.9459869Z     scale_ub=None,
2025-05-07T20:32:44.9459962Z     contiguous=False,
2025-05-07T20:32:44.9460056Z     compiled=False,
2025-05-07T20:32:44.9460129Z )
2025-05-07T20:32:44.9460346Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:44.9460527Z T = 2048, D = 7168, scale_ub = None, contiguous = False, compiled = False
2025-05-07T20:32:44.9460532Z 
2025-05-07T20:32:44.9460611Z     @given(
2025-05-07T20:32:44.9460730Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:44.9460839Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:44.9460955Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:44.9461077Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:44.9461198Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:44.9461283Z     )
2025-05-07T20:32:44.9461535Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:44.9461633Z     def test_silu_mul_quant(
2025-05-07T20:32:44.9461718Z         self,
2025-05-07T20:32:44.9461804Z         T: int,
2025-05-07T20:32:44.9461883Z         D: int,
2025-05-07T20:32:44.9461982Z         scale_ub: Optional[float],
2025-05-07T20:32:44.9462080Z         contiguous: bool,
2025-05-07T20:32:44.9462168Z         compiled: bool,
2025-05-07T20:32:44.9462250Z     ) -> None:
2025-05-07T20:32:44.9462352Z         torch.manual_seed(2025)
2025-05-07T20:32:44.9462427Z     
2025-05-07T20:32:44.9462598Z >       x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:44.9464433Z E       torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 56.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 26.44 MiB is free. Including non-PyTorch memory, this process has 22.04 GiB memory in use. Of the allocated memory 21.74 GiB is allocated by PyTorch, and 10.99 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
2025-05-07T20:32:44.9464479Z 
2025-05-07T20:32:44.9464599Z moe/activation_test.py:92: OutOfMemoryError
2025-05-07T20:32:44.9464612Z 
2025-05-07T20:32:44.9464716Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:44.9464940Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:44.9465027Z     T=128,
2025-05-07T20:32:44.9465105Z     D=7168,
2025-05-07T20:32:44.9465194Z     scale_ub=1200.0,
2025-05-07T20:32:44.9465293Z     contiguous=True,
2025-05-07T20:32:44.9465378Z     compiled=True,
2025-05-07T20:32:44.9465453Z )
2025-05-07T20:32:44.9465681Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:44.9465896Z T = 128, D = 7168, scale_ub = 1200.0, contiguous = True, compiled = True
2025-05-07T20:32:44.9465900Z 
2025-05-07T20:32:44.9465980Z     @given(
2025-05-07T20:32:44.9466145Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:44.9466250Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:44.9466374Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:44.9466491Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:44.9466604Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:44.9466685Z     )
2025-05-07T20:32:44.9466931Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:44.9467029Z     def test_silu_mul_quant(
2025-05-07T20:32:44.9467119Z         self,
2025-05-07T20:32:44.9467196Z         T: int,
2025-05-07T20:32:44.9467273Z         D: int,
2025-05-07T20:32:44.9467381Z         scale_ub: Optional[float],
2025-05-07T20:32:44.9467470Z         contiguous: bool,
2025-05-07T20:32:44.9467591Z         compiled: bool,
2025-05-07T20:32:44.9467677Z     ) -> None:
2025-05-07T20:32:44.9467795Z         torch.manual_seed(2025)
2025-05-07T20:32:44.9467877Z     
2025-05-07T20:32:44.9468049Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:44.9468124Z     
2025-05-07T20:32:44.9468224Z         x_sign = torch.sign(x)
2025-05-07T20:32:44.9468351Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:44.9468443Z         x = x_sign * x_clamp
2025-05-07T20:32:44.9468534Z         x0 = x[:, :D]
2025-05-07T20:32:44.9468617Z         x1 = x[:, D:]
2025-05-07T20:32:44.9468690Z     
2025-05-07T20:32:44.9468782Z         if contiguous:
2025-05-07T20:32:44.9468876Z             x0 = x0.contiguous()
2025-05-07T20:32:44.9468968Z             x1 = x1.contiguous()
2025-05-07T20:32:44.9469121Z     
2025-05-07T20:32:44.9469213Z         if scale_ub is not None:
2025-05-07T20:32:44.9469332Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:44.9469476Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:44.9469556Z             )
2025-05-07T20:32:44.9469640Z         else:
2025-05-07T20:32:44.9469737Z             scale_ub_tensor = None
2025-05-07T20:32:44.9469811Z     
2025-05-07T20:32:44.9469950Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:44.9470040Z             op = silu_mul_quant
2025-05-07T20:32:44.9470125Z             if compiled:
2025-05-07T20:32:44.9470232Z                 op = torch.compile(op)
2025-05-07T20:32:44.9470337Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:44.9470412Z     
2025-05-07T20:32:44.9470509Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:44.9470514Z 
2025-05-07T20:32:44.9470610Z moe/activation_test.py:117: 
2025-05-07T20:32:44.9470745Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:44.9470846Z moe/activation_test.py:115: in fn
2025-05-07T20:32:44.9470946Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:44.9471453Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/torch/_dynamo/eval_frame.py:678: in _fn
2025-05-07T20:32:44.9471547Z     return fn(*args, **kwargs)
2025-05-07T20:32:44.9472085Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:44.9472195Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:44.9472553Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:44.9472783Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:44.9473122Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:44.9473216Z     kernel = self.compile(
2025-05-07T20:32:44.9473605Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:44.9473821Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:44.9473958Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:44.9474001Z 
2025-05-07T20:32:44.9474208Z self = <triton.compiler.compiler.ASTSource object at 0x7f965520a590>
2025-05-07T20:32:44.9474992Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:44.9475505Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f987464c400>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f96553b7c40>}
2025-05-07T20:32:44.9476260Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:44.9476464Z context = <triton._C.libtriton.ir.context object at 0x7f9655278530>
2025-05-07T20:32:44.9476469Z 
2025-05-07T20:32:44.9476639Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:44.9476904Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:44.9477019Z                            module_map=module_map)
2025-05-07T20:32:44.9477182Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:44.9477288Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:44.9477366Z E       ^
2025-05-07T20:32:44.9477729Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:44.9477734Z 
2025-05-07T20:32:44.9478162Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:44.9478170Z 
2025-05-07T20:32:44.9478281Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:44.9478513Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:44.9478591Z     T=128,
2025-05-07T20:32:44.9478671Z     D=7168,
2025-05-07T20:32:44.9478766Z     scale_ub=1200.0,
2025-05-07T20:32:44.9478853Z     contiguous=True,
2025-05-07T20:32:44.9478939Z     compiled=False,
2025-05-07T20:32:44.9479019Z )
2025-05-07T20:32:44.9479237Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:44.9479408Z T = 128, D = 7168, scale_ub = 1200.0, contiguous = True, compiled = False
2025-05-07T20:32:44.9479412Z 
2025-05-07T20:32:44.9479499Z     @given(
2025-05-07T20:32:44.9479618Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:44.9479718Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:44.9479842Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:44.9479960Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:44.9480127Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:44.9480201Z     )
2025-05-07T20:32:44.9480453Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:44.9480596Z     def test_silu_mul_quant(
2025-05-07T20:32:44.9480674Z         self,
2025-05-07T20:32:44.9480756Z         T: int,
2025-05-07T20:32:44.9480845Z         D: int,
2025-05-07T20:32:44.9480945Z         scale_ub: Optional[float],
2025-05-07T20:32:44.9481034Z         contiguous: bool,
2025-05-07T20:32:44.9481130Z         compiled: bool,
2025-05-07T20:32:44.9481209Z     ) -> None:
2025-05-07T20:32:44.9481315Z         torch.manual_seed(2025)
2025-05-07T20:32:44.9481392Z     
2025-05-07T20:32:44.9481560Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:44.9481643Z     
2025-05-07T20:32:44.9481738Z         x_sign = torch.sign(x)
2025-05-07T20:32:44.9481863Z >       x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:44.9483740Z E       torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 20.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 4.44 MiB is free. Including non-PyTorch memory, this process has 22.06 GiB memory in use. Of the allocated memory 21.77 GiB is allocated by PyTorch, and 6.37 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
2025-05-07T20:32:44.9483747Z 
2025-05-07T20:32:44.9483866Z moe/activation_test.py:95: OutOfMemoryError
2025-05-07T20:32:44.9483871Z 
2025-05-07T20:32:44.9483982Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:44.9484203Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:44.9484281Z     T=128,
2025-05-07T20:32:44.9484364Z     D=5120,
2025-05-07T20:32:44.9484448Z     scale_ub=1200.0,
2025-05-07T20:32:44.9484544Z     contiguous=True,
2025-05-07T20:32:44.9484629Z     compiled=True,
2025-05-07T20:32:44.9484703Z )
2025-05-07T20:32:44.9484929Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:44.9485098Z T = 128, D = 5120, scale_ub = 1200.0, contiguous = True, compiled = True
2025-05-07T20:32:44.9485104Z 
2025-05-07T20:32:44.9485187Z     @given(
2025-05-07T20:32:44.9485303Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:44.9485401Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:44.9485524Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:44.9485639Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:44.9485751Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:44.9485835Z     )
2025-05-07T20:32:44.9486079Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:44.9486174Z     def test_silu_mul_quant(
2025-05-07T20:32:44.9486264Z         self,
2025-05-07T20:32:44.9486341Z         T: int,
2025-05-07T20:32:44.9486424Z         D: int,
2025-05-07T20:32:44.9486521Z         scale_ub: Optional[float],
2025-05-07T20:32:44.9486613Z         contiguous: bool,
2025-05-07T20:32:44.9486707Z         compiled: bool,
2025-05-07T20:32:44.9486786Z     ) -> None:
2025-05-07T20:32:44.9486882Z         torch.manual_seed(2025)
2025-05-07T20:32:44.9486966Z     
2025-05-07T20:32:44.9487138Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:44.9487214Z     
2025-05-07T20:32:44.9487334Z         x_sign = torch.sign(x)
2025-05-07T20:32:44.9487473Z >       x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:44.9489311Z E       torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 20.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 4.44 MiB is free. Including non-PyTorch memory, this process has 22.06 GiB memory in use. Of the allocated memory 21.77 GiB is allocated by PyTorch, and 3.87 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
2025-05-07T20:32:44.9489379Z 
2025-05-07T20:32:44.9489497Z moe/activation_test.py:95: OutOfMemoryError
2025-05-07T20:32:44.9489501Z 
2025-05-07T20:32:44.9489610Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:44.9489831Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:44.9489911Z     T=128,
2025-05-07T20:32:44.9489993Z     D=7168,
2025-05-07T20:32:44.9490077Z     scale_ub=None,
2025-05-07T20:32:44.9490164Z     contiguous=True,
2025-05-07T20:32:44.9490253Z     compiled=True,
2025-05-07T20:32:44.9490330Z )
2025-05-07T20:32:44.9490548Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:44.9490719Z T = 128, D = 7168, scale_ub = None, contiguous = True, compiled = True
2025-05-07T20:32:44.9490767Z 
2025-05-07T20:32:44.9490845Z     @given(
2025-05-07T20:32:44.9490960Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:44.9491104Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:44.9491221Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:44.9491344Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:44.9491457Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:44.9491530Z     )
2025-05-07T20:32:44.9491779Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:44.9491872Z     def test_silu_mul_quant(
2025-05-07T20:32:44.9491948Z         self,
2025-05-07T20:32:44.9492031Z         T: int,
2025-05-07T20:32:44.9492107Z         D: int,
2025-05-07T20:32:44.9492203Z         scale_ub: Optional[float],
2025-05-07T20:32:44.9492302Z         contiguous: bool,
2025-05-07T20:32:44.9492386Z         compiled: bool,
2025-05-07T20:32:44.9492477Z     ) -> None:
2025-05-07T20:32:44.9492571Z         torch.manual_seed(2025)
2025-05-07T20:32:44.9492645Z     
2025-05-07T20:32:44.9492818Z >       x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:44.9494595Z E       torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 20.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 4.44 MiB is free. Including non-PyTorch memory, this process has 22.06 GiB memory in use. Of the allocated memory 21.77 GiB is allocated by PyTorch, and 3.87 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
2025-05-07T20:32:44.9494601Z 
2025-05-07T20:32:44.9494725Z moe/activation_test.py:92: OutOfMemoryError
2025-05-07T20:32:44.9494859Z =============================== warnings summary ===============================
2025-05-07T20:32:44.9495173Z ../../../../../../../../miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/autotuner.py:108
2025-05-07T20:32:44.9495483Z ../../../../../../../../miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/autotuner.py:108
2025-05-07T20:32:44.9495782Z ../../../../../../../../miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/autotuner.py:108
2025-05-07T20:32:44.9496674Z   /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/autotuner.py:108: DeprecationWarning: warmup, rep, and use_cuda_graph parameters are deprecated. See https://github.com/triton-lang/triton/pull/4496 for details.
2025-05-07T20:32:44.9496903Z     warnings.warn(("warmup, rep, and use_cuda_graph parameters are deprecated. See "
2025-05-07T20:32:44.9496908Z 
2025-05-07T20:32:44.9497118Z -- Docs: https://docs.pytest.org/en/stable/how-to/capture-warnings.html
2025-05-07T20:32:44.9497293Z ================= 1 failed, 1 deselected, 3 warnings in 13.88s =================
2025-05-07T20:32:46.5191792Z ERROR conda.cli.main_run:execute(125): `conda run python -m pytest -v -rsx -s -W ignore::pytest.PytestCollectionWarning --lf --last-failed-no-failures none ./moe/activation_test.py` failed. (See above for error)
2025-05-07T20:32:46.5810987Z [EXEC] [ATTEMPT 2/2] Command attempt failed.
2025-05-07T20:32:46.5811613Z 
2025-05-07T20:32:46.5812080Z [EXEC] The command has failed after 2 + 1 attempts; aborting.
2025-05-07T20:32:46.5813520Z [TEST] Python test suite FAILED for some or all tests despite multiple retries: ./moe/activation_test.py
2025-05-07T20:32:46.5814324Z 
2025-05-07T20:32:46.5814332Z 
2025-05-07T20:32:46.5814340Z 
2025-05-07T20:32:46.5831161Z ##[error]Process completed with exit code 1.
2025-05-07T20:32:46.5911701Z Post job cleanup.
2025-05-07T20:32:46.6891685Z [command]/usr/bin/git version
2025-05-07T20:32:46.6933015Z git version 2.47.1
2025-05-07T20:32:46.6967773Z Copying '/home/ec2-user/.gitconfig' to '/home/ec2-user/actions-runner/_work/_temp/b3674192-b0ff-41b0-bb10-935329a809c5/.gitconfig'
2025-05-07T20:32:46.6978458Z Temporarily overriding HOME='/home/ec2-user/actions-runner/_work/_temp/b3674192-b0ff-41b0-bb10-935329a809c5' before making global git config changes
2025-05-07T20:32:46.6979309Z Adding repository directory to the temporary git global config as a safe directory
2025-05-07T20:32:46.6983889Z [command]/usr/bin/git config --global --add safe.directory /home/ec2-user/actions-runner/_work/FBGEMM/FBGEMM
2025-05-07T20:32:46.7032664Z [command]/usr/bin/git config --local --name-only --get-regexp core\.sshCommand
2025-05-07T20:32:46.7067704Z [command]/usr/bin/git submodule foreach --recursive sh -c "git config --local --name-only --get-regexp 'core\.sshCommand' && git config --local --unset-all 'core.sshCommand' || :"
2025-05-07T20:32:46.7406013Z Entering 'external/asmjit'
2025-05-07T20:32:46.7473222Z Entering 'external/composable_kernel'
2025-05-07T20:32:46.7547295Z Entering 'external/cpuinfo'
2025-05-07T20:32:46.7618285Z Entering 'external/cutlass'
2025-05-07T20:32:46.7694068Z Entering 'external/googletest'
2025-05-07T20:32:46.7761159Z Entering 'external/hipify_torch'
2025-05-07T20:32:46.7827284Z Entering 'external/json'
2025-05-07T20:32:46.7917690Z [command]/usr/bin/git config --local --name-only --get-regexp http\.https\:\/\/github\.com\/\.extraheader
2025-05-07T20:32:46.7944352Z http.https://github.com/.extraheader
2025-05-07T20:32:46.7956321Z [command]/usr/bin/git config --local --unset-all http.https://github.com/.extraheader
2025-05-07T20:32:46.7992039Z [command]/usr/bin/git submodule foreach --recursive sh -c "git config --local --name-only --get-regexp 'http\.https\:\/\/github\.com\/\.extraheader' && git config --local --unset-all 'http.https://github.com/.extraheader' || :"
2025-05-07T20:32:46.8326872Z Entering 'external/asmjit'
2025-05-07T20:32:46.8369675Z http.https://github.com/.extraheader
2025-05-07T20:32:46.8412732Z Entering 'external/composable_kernel'
2025-05-07T20:32:46.8456953Z http.https://github.com/.extraheader
2025-05-07T20:32:46.8505770Z Entering 'external/cpuinfo'
2025-05-07T20:32:46.8549129Z http.https://github.com/.extraheader
2025-05-07T20:32:46.8592384Z Entering 'external/cutlass'
2025-05-07T20:32:46.8635477Z http.https://github.com/.extraheader
2025-05-07T20:32:46.8686279Z Entering 'external/googletest'
2025-05-07T20:32:46.8737277Z http.https://github.com/.extraheader
2025-05-07T20:32:46.8774204Z Entering 'external/hipify_torch'
2025-05-07T20:32:46.8816608Z http.https://github.com/.extraheader
2025-05-07T20:32:46.8859551Z Entering 'external/json'
2025-05-07T20:32:46.8903069Z http.https://github.com/.extraheader
2025-05-07T20:32:46.9052067Z A job completed hook has been configured by the self-hosted runner administrator
2025-05-07T20:32:46.9085424Z ##[group]Run '/home/ec2-user/runner-scripts/after_job.sh'
2025-05-07T20:32:46.9096190Z shell: /usr/bin/bash --noprofile --norc -e -o pipefail {0}
2025-05-07T20:32:46.9096548Z ##[endgroup]
2025-05-07T20:32:46.9196508Z [!ALERT!] Swap in detected! [!ALERT!]
2025-05-07T20:32:57.7037607Z [!ALERT!] Swap out detected [!ALERT!]
2025-05-07T20:33:14.0619137Z Cleaning up orphan processes