2025-05-07T20:23:13.3944892Z Current runner version: '2.323.0'
2025-05-07T20:23:13.3950413Z Runner name: 'i-0b68a33264ad7b273'
2025-05-07T20:23:13.3951395Z Machine name: 'ip-10-0-14-174'
2025-05-07T20:23:13.3954176Z ##[group]GITHUB_TOKEN Permissions
2025-05-07T20:23:13.3956484Z Contents: read
2025-05-07T20:23:13.3957026Z Metadata: read
2025-05-07T20:23:13.3957534Z Packages: read
2025-05-07T20:23:13.3958042Z ##[endgroup]
2025-05-07T20:23:13.3959988Z Secret source: None
2025-05-07T20:23:13.3960643Z Prepare workflow directory
2025-05-07T20:23:13.4884742Z Prepare all required actions
2025-05-07T20:23:13.4924176Z Getting action download info
2025-05-07T20:23:13.7038331Z Download action repository 'actions/checkout@v4' (SHA:11bd71901bbe5b1630ceea73d27597364c9af683)
2025-05-07T20:23:13.9891981Z Download action repository 'actions/download-artifact@v4' (SHA:d3f86a106a0bac45b974a628896c90dbdf5c8093)
2025-05-07T20:23:14.3999816Z Download action repository 'pytorch/test-infra@main' (SHA:117fccdf5892ff9a958d2afb4b4b8b6e930d3187)
2025-05-07T20:23:16.0056498Z Getting action download info
2025-05-07T20:23:16.1091791Z Download action repository 'nick-fields/retry@3e91a01664abd3c5cd539100d10d33b9c5b68482' (SHA:3e91a01664abd3c5cd539100d10d33b9c5b68482)
2025-05-07T20:23:16.3112390Z Complete job name: test_and_publish_artifact (x86, linux.g5.4xlarge.nvidia.gpu, genai, 3.13, 12.6.3, 12.6.3, gcc)
2025-05-07T20:23:16.3728880Z A job started hook has been configured by the self-hosted runner administrator
2025-05-07T20:23:16.3862526Z ##[group]Run '/home/ec2-user/runner-scripts/before_job.sh'
2025-05-07T20:23:16.3875121Z shell: /usr/bin/bash --noprofile --norc -e -o pipefail {0}
2025-05-07T20:23:16.3876921Z ##[endgroup]
2025-05-07T20:23:17.4974099Z Runner Type: linux.g5.4xlarge.nvidia.gpu
2025-05-07T20:23:17.4974539Z Instance Type: g5.4xlarge
2025-05-07T20:23:17.4974784Z AMI Name: unknown
2025-05-07T20:23:17.5012706Z AMI ID: ami-071226ecf16aa7d96
2025-05-07T20:23:22.9017110Z ##[group]Run actions/checkout@v4
2025-05-07T20:23:22.9017416Z with:
2025-05-07T20:23:22.9017662Z   submodules: true
2025-05-07T20:23:22.9017914Z   repository: pytorch/FBGEMM
2025-05-07T20:23:22.9018298Z   token: ***
2025-05-07T20:23:22.9018505Z   ssh-strict: true
2025-05-07T20:23:22.9018712Z   ssh-user: git
2025-05-07T20:23:22.9018937Z   persist-credentials: true
2025-05-07T20:23:22.9019180Z   clean: true
2025-05-07T20:23:22.9019410Z   sparse-checkout-cone-mode: true
2025-05-07T20:23:22.9019676Z   fetch-depth: 1
2025-05-07T20:23:22.9019891Z   fetch-tags: false
2025-05-07T20:23:22.9020110Z   show-progress: true
2025-05-07T20:23:22.9020325Z   lfs: false
2025-05-07T20:23:22.9020538Z   set-safe-directory: true
2025-05-07T20:23:22.9020786Z env:
2025-05-07T20:23:22.9021001Z   PRELUDE: .github/scripts/setup_env.bash
2025-05-07T20:23:22.9021295Z   BUILD_ENV: build_binary
2025-05-07T20:23:22.9021553Z   BUILD_TARGET: genai
2025-05-07T20:23:22.9021770Z   BUILD_VARIANT: cuda
2025-05-07T20:23:22.9022027Z   BUILD_CUDA_VERSION: 12.6.3
2025-05-07T20:23:22.9022275Z   ENFORCE_CUDA_DEVICE: 1
2025-05-07T20:23:22.9022508Z ##[endgroup]
2025-05-07T20:23:23.0189936Z Syncing repository: pytorch/FBGEMM
2025-05-07T20:23:23.0191253Z ##[group]Getting Git version info
2025-05-07T20:23:23.0191824Z Working directory is '/home/ec2-user/actions-runner/_work/FBGEMM/FBGEMM'
2025-05-07T20:23:23.0192606Z [command]/usr/bin/git version
2025-05-07T20:23:23.0192943Z git version 2.47.1
2025-05-07T20:23:23.0208952Z ##[endgroup]
2025-05-07T20:23:23.0219362Z Copying '/home/ec2-user/.gitconfig' to '/home/ec2-user/actions-runner/_work/_temp/d4e0d646-646d-4635-b7a6-4aac06c6045d/.gitconfig'
2025-05-07T20:23:23.0229496Z Temporarily overriding HOME='/home/ec2-user/actions-runner/_work/_temp/d4e0d646-646d-4635-b7a6-4aac06c6045d' before making global git config changes
2025-05-07T20:23:23.0230472Z Adding repository directory to the temporary git global config as a safe directory
2025-05-07T20:23:23.0243041Z [command]/usr/bin/git config --global --add safe.directory /home/ec2-user/actions-runner/_work/FBGEMM/FBGEMM
2025-05-07T20:23:23.0289710Z [command]/usr/bin/git config --local --get remote.origin.url
2025-05-07T20:23:23.0315300Z https://github.com/pytorch/FBGEMM
2025-05-07T20:23:23.0333825Z ##[group]Removing previously created refs, to avoid conflicts
2025-05-07T20:23:23.0337744Z [command]/usr/bin/git rev-parse --symbolic-full-name --verify --quiet HEAD
2025-05-07T20:23:23.0363689Z refs/heads/main
2025-05-07T20:23:23.0374173Z [command]/usr/bin/git checkout --detach
2025-05-07T20:23:23.9071531Z HEAD is now at b6b2ce3 Migrate TBE forward kernels to `FBGEMM_LAUNCH_KERNEL` (#4079)
2025-05-07T20:23:23.9123905Z [command]/usr/bin/git branch --delete --force main
2025-05-07T20:23:23.9155296Z Deleted branch main (was b6b2ce3).
2025-05-07T20:23:23.9160483Z ##[endgroup]
2025-05-07T20:23:23.9163996Z [command]/usr/bin/git submodule status
2025-05-07T20:23:23.9586910Z  e5d7c0bd5d9aec44d68830187138149e6a8c4e32 external/asmjit (e5d7c0b)
2025-05-07T20:23:23.9671615Z  4a61bdd4bd4ed730e078aebc7c0fcf046ff29406 external/composable_kernel (4a61bdd)
2025-05-07T20:23:23.9759703Z  6543fec09b2f04ac4a666882998b534afc9c1349 external/cpuinfo (6543fec)
2025-05-07T20:23:23.9848541Z  3ed8d2ec4ba35ef5d9d8353826209b6f868f63d3 external/cutlass (3ed8d2e)
2025-05-07T20:23:23.9934064Z  f8d7d77c06936315286eb55f8de22cd23c188571 external/googletest (f8d7d77)
2025-05-07T20:23:24.0019285Z  420084499c7c1e1c2d801922f40df202eac5f3a0 external/hipify_torch (4200844)
2025-05-07T20:23:24.0102492Z  9cca280a4d0ccf0c08f47a99aa71d1b0e52f8d03 external/json (9cca280)
2025-05-07T20:23:24.0116999Z ##[group]Cleaning the repository
2025-05-07T20:23:24.0121932Z [command]/usr/bin/git clean -ffdx
2025-05-07T20:23:24.0180398Z [command]/usr/bin/git reset --hard HEAD
2025-05-07T20:23:24.0294255Z HEAD is now at b6b2ce3 Migrate TBE forward kernels to `FBGEMM_LAUNCH_KERNEL` (#4079)
2025-05-07T20:23:24.0302071Z ##[endgroup]
2025-05-07T20:23:24.0304281Z ##[group]Disabling automatic garbage collection
2025-05-07T20:23:24.0308683Z [command]/usr/bin/git config --local gc.auto 0
2025-05-07T20:23:24.0341684Z ##[endgroup]
2025-05-07T20:23:24.0342142Z ##[group]Setting up auth
2025-05-07T20:23:24.0347822Z [command]/usr/bin/git config --local --name-only --get-regexp core\.sshCommand
2025-05-07T20:23:24.0390668Z [command]/usr/bin/git submodule foreach --recursive sh -c "git config --local --name-only --get-regexp 'core\.sshCommand' && git config --local --unset-all 'core.sshCommand' || :"
2025-05-07T20:23:24.0724863Z Entering 'external/asmjit'
2025-05-07T20:23:24.0791039Z Entering 'external/composable_kernel'
2025-05-07T20:23:24.0863547Z Entering 'external/cpuinfo'
2025-05-07T20:23:24.0932115Z Entering 'external/cutlass'
2025-05-07T20:23:24.1005841Z Entering 'external/googletest'
2025-05-07T20:23:24.1071075Z Entering 'external/hipify_torch'
2025-05-07T20:23:24.1138490Z Entering 'external/json'
2025-05-07T20:23:24.1224844Z [command]/usr/bin/git config --local --name-only --get-regexp http\.https\:\/\/github\.com\/\.extraheader
2025-05-07T20:23:24.1258953Z [command]/usr/bin/git submodule foreach --recursive sh -c "git config --local --name-only --get-regexp 'http\.https\:\/\/github\.com\/\.extraheader' && git config --local --unset-all 'http.https://github.com/.extraheader' || :"
2025-05-07T20:23:24.1591484Z Entering 'external/asmjit'
2025-05-07T20:23:24.1658187Z Entering 'external/composable_kernel'
2025-05-07T20:23:24.1730866Z Entering 'external/cpuinfo'
2025-05-07T20:23:24.1797551Z Entering 'external/cutlass'
2025-05-07T20:23:24.1874404Z Entering 'external/googletest'
2025-05-07T20:23:24.1941202Z Entering 'external/hipify_torch'
2025-05-07T20:23:24.2006939Z Entering 'external/json'
2025-05-07T20:23:24.2093185Z [command]/usr/bin/git config --local http.https://github.com/.extraheader AUTHORIZATION: basic ***
2025-05-07T20:23:24.2147773Z ##[endgroup]
2025-05-07T20:23:24.2148785Z ##[group]Fetching the repository
2025-05-07T20:23:24.2156288Z [command]/usr/bin/git -c protocol.version=2 fetch --no-tags --prune --no-recurse-submodules --depth=1 origin +a2f4c52051596e74bc8c16e3d2867a4ecdd271e0:refs/remotes/pull/4066/merge
2025-05-07T20:23:24.4546439Z From https://github.com/pytorch/FBGEMM
2025-05-07T20:23:24.4547239Z  * [new ref]         a2f4c52051596e74bc8c16e3d2867a4ecdd271e0 -> pull/4066/merge
2025-05-07T20:23:24.4573863Z ##[endgroup]
2025-05-07T20:23:24.4574628Z ##[group]Determining the checkout info
2025-05-07T20:23:24.4576694Z ##[endgroup]
2025-05-07T20:23:24.4581700Z [command]/usr/bin/git sparse-checkout disable
2025-05-07T20:23:24.4634927Z [command]/usr/bin/git config --local --unset-all extensions.worktreeConfig
2025-05-07T20:23:24.4665421Z ##[group]Checking out the ref
2025-05-07T20:23:24.4669659Z [command]/usr/bin/git checkout --progress --force refs/remotes/pull/4066/merge
2025-05-07T20:23:24.4799448Z Previous HEAD position was b6b2ce3 Migrate TBE forward kernels to `FBGEMM_LAUNCH_KERNEL` (#4079)
2025-05-07T20:23:24.4802513Z HEAD is now at a2f4c52 Merge 6060cd4b5f971680caecdcc657faccb5720d1c3e into fd4df5f456e0cca514bacd98a39efb72990fd9f4
2025-05-07T20:23:24.4812624Z ##[endgroup]
2025-05-07T20:23:24.4813151Z ##[group]Setting up auth for fetching submodules
2025-05-07T20:23:24.4818334Z [command]/usr/bin/git config --global http.https://github.com/.extraheader AUTHORIZATION: basic ***
2025-05-07T20:23:24.4868699Z [command]/usr/bin/git config --global --unset-all url.https://github.com/.insteadOf
2025-05-07T20:23:24.4899477Z [command]/usr/bin/git config --global --add url.https://github.com/.insteadOf git@github.com:
2025-05-07T20:23:24.4930425Z [command]/usr/bin/git config --global --add url.https://github.com/.insteadOf org-21003710@github.com:
2025-05-07T20:23:24.4959150Z ##[endgroup]
2025-05-07T20:23:24.4959782Z ##[group]Fetching submodules
2025-05-07T20:23:24.4962566Z [command]/usr/bin/git submodule sync
2025-05-07T20:23:24.5337647Z Synchronizing submodule url for 'external/asmjit'
2025-05-07T20:23:24.5338775Z Synchronizing submodule url for 'external/composable_kernel'
2025-05-07T20:23:24.5340132Z Synchronizing submodule url for 'external/cpuinfo'
2025-05-07T20:23:24.5340907Z Synchronizing submodule url for 'external/cutlass'
2025-05-07T20:23:24.5341662Z Synchronizing submodule url for 'external/googletest'
2025-05-07T20:23:24.5342189Z Synchronizing submodule url for 'external/hipify_torch'
2025-05-07T20:23:24.5342643Z Synchronizing submodule url for 'external/json'
2025-05-07T20:23:24.5354337Z [command]/usr/bin/git -c protocol.version=2 submodule update --init --force --depth=1
2025-05-07T20:23:24.5779383Z Submodule path 'external/asmjit': checked out 'e5d7c0bd5d9aec44d68830187138149e6a8c4e32'
2025-05-07T20:23:24.5927379Z Submodule path 'external/composable_kernel': checked out '4a61bdd4bd4ed730e078aebc7c0fcf046ff29406'
2025-05-07T20:23:24.6028022Z Submodule path 'external/cpuinfo': checked out '6543fec09b2f04ac4a666882998b534afc9c1349'
2025-05-07T20:23:24.6195117Z Submodule path 'external/cutlass': checked out '3ed8d2ec4ba35ef5d9d8353826209b6f868f63d3'
2025-05-07T20:23:24.6283584Z Submodule path 'external/googletest': checked out 'f8d7d77c06936315286eb55f8de22cd23c188571'
2025-05-07T20:23:24.6365231Z Submodule path 'external/hipify_torch': checked out '420084499c7c1e1c2d801922f40df202eac5f3a0'
2025-05-07T20:23:24.6469864Z Submodule path 'external/json': checked out '9cca280a4d0ccf0c08f47a99aa71d1b0e52f8d03'
2025-05-07T20:23:24.6488253Z [command]/usr/bin/git submodule foreach git config --local gc.auto 0
2025-05-07T20:23:24.6825452Z Entering 'external/asmjit'
2025-05-07T20:23:24.6858113Z Entering 'external/composable_kernel'
2025-05-07T20:23:24.6890122Z Entering 'external/cpuinfo'
2025-05-07T20:23:24.6923236Z Entering 'external/cutlass'
2025-05-07T20:23:24.6955471Z Entering 'external/googletest'
2025-05-07T20:23:24.6987642Z Entering 'external/hipify_torch'
2025-05-07T20:23:24.7020254Z Entering 'external/json'
2025-05-07T20:23:24.7065070Z ##[endgroup]
2025-05-07T20:23:24.7065579Z ##[group]Persisting credentials for submodules
2025-05-07T20:23:24.7070729Z [command]/usr/bin/git submodule foreach --recursive sh -c "git config --local --name-only --get-regexp 'url\.https\:\/\/github\.com\/\.insteadOf' && git config --local --unset-all 'url.https://github.com/.insteadOf' || :"
2025-05-07T20:23:24.7401871Z Entering 'external/asmjit'
2025-05-07T20:23:24.7444480Z url.https://github.com/.insteadof
2025-05-07T20:23:24.7444977Z url.https://github.com/.insteadof
2025-05-07T20:23:24.7487394Z Entering 'external/composable_kernel'
2025-05-07T20:23:24.7531398Z url.https://github.com/.insteadof
2025-05-07T20:23:24.7556115Z url.https://github.com/.insteadof
2025-05-07T20:23:24.7583375Z Entering 'external/cpuinfo'
2025-05-07T20:23:24.7627749Z url.https://github.com/.insteadof
2025-05-07T20:23:24.7628074Z url.https://github.com/.insteadof
2025-05-07T20:23:24.7671070Z Entering 'external/cutlass'
2025-05-07T20:23:24.7712907Z url.https://github.com/.insteadof
2025-05-07T20:23:24.7713220Z url.https://github.com/.insteadof
2025-05-07T20:23:24.7764731Z Entering 'external/googletest'
2025-05-07T20:23:24.7808864Z url.https://github.com/.insteadof
2025-05-07T20:23:24.7809202Z url.https://github.com/.insteadof
2025-05-07T20:23:24.7851695Z Entering 'external/hipify_torch'
2025-05-07T20:23:24.7895329Z url.https://github.com/.insteadof
2025-05-07T20:23:24.7895658Z url.https://github.com/.insteadof
2025-05-07T20:23:24.7937788Z Entering 'external/json'
2025-05-07T20:23:24.7979934Z url.https://github.com/.insteadof
2025-05-07T20:23:24.7980264Z url.https://github.com/.insteadof
2025-05-07T20:23:24.8043397Z [command]/usr/bin/git submodule foreach sh -c "git config --local 'http.https://github.com/.extraheader' 'AUTHORIZATION: basic ***' && git config --local --show-origin --name-only --get-regexp remote.origin.url"
2025-05-07T20:23:24.8377929Z Entering 'external/asmjit'
2025-05-07T20:23:24.8441440Z file:/home/ec2-user/actions-runner/_work/FBGEMM/FBGEMM/.git/modules/external/asmjit/config	remote.origin.url
2025-05-07T20:23:24.8444207Z Entering 'external/composable_kernel'
2025-05-07T20:23:24.8507835Z file:/home/ec2-user/actions-runner/_work/FBGEMM/FBGEMM/.git/modules/external/composable_kernel/config	remote.origin.url
2025-05-07T20:23:24.8509080Z Entering 'external/cpuinfo'
2025-05-07T20:23:24.8570749Z file:/home/ec2-user/actions-runner/_work/FBGEMM/FBGEMM/.git/modules/external/cpuinfo/config	remote.origin.url
2025-05-07T20:23:24.8574223Z Entering 'external/cutlass'
2025-05-07T20:23:24.8635399Z file:/home/ec2-user/actions-runner/_work/FBGEMM/FBGEMM/.git/modules/external/cutlass/config	remote.origin.url
2025-05-07T20:23:24.8638751Z Entering 'external/googletest'
2025-05-07T20:23:24.8700647Z file:/home/ec2-user/actions-runner/_work/FBGEMM/FBGEMM/.git/modules/external/googletest/config	remote.origin.url
2025-05-07T20:23:24.8703937Z Entering 'external/hipify_torch'
2025-05-07T20:23:24.8765927Z file:/home/ec2-user/actions-runner/_work/FBGEMM/FBGEMM/.git/modules/external/hipify_torch/config	remote.origin.url
2025-05-07T20:23:24.8768676Z Entering 'external/json'
2025-05-07T20:23:24.8829784Z file:/home/ec2-user/actions-runner/_work/FBGEMM/FBGEMM/.git/modules/external/json/config	remote.origin.url
2025-05-07T20:23:24.8953492Z [command]/usr/bin/git submodule foreach git config --local --add 'url.https://github.com/.insteadOf' 'git@github.com:'
2025-05-07T20:23:24.9289191Z Entering 'external/asmjit'
2025-05-07T20:23:24.9322002Z Entering 'external/composable_kernel'
2025-05-07T20:23:24.9354726Z Entering 'external/cpuinfo'
2025-05-07T20:23:24.9385821Z Entering 'external/cutlass'
2025-05-07T20:23:24.9419599Z Entering 'external/googletest'
2025-05-07T20:23:24.9450625Z Entering 'external/hipify_torch'
2025-05-07T20:23:24.9482496Z Entering 'external/json'
2025-05-07T20:23:24.9530375Z [command]/usr/bin/git submodule foreach git config --local --add 'url.https://github.com/.insteadOf' 'org-21003710@github.com:'
2025-05-07T20:23:24.9860722Z Entering 'external/asmjit'
2025-05-07T20:23:24.9893491Z Entering 'external/composable_kernel'
2025-05-07T20:23:24.9925914Z Entering 'external/cpuinfo'
2025-05-07T20:23:24.9956862Z Entering 'external/cutlass'
2025-05-07T20:23:24.9987581Z Entering 'external/googletest'
2025-05-07T20:23:25.0019079Z Entering 'external/hipify_torch'
2025-05-07T20:23:25.0052450Z Entering 'external/json'
2025-05-07T20:23:25.0096003Z ##[endgroup]
2025-05-07T20:23:25.0138800Z [command]/usr/bin/git log -1 --format=%H
2025-05-07T20:23:25.0163050Z a2f4c52051596e74bc8c16e3d2867a4ecdd271e0
2025-05-07T20:23:25.0353882Z ##[group]Run actions/download-artifact@v4
2025-05-07T20:23:25.0354190Z with:
2025-05-07T20:23:25.0354423Z   name: fbgemm_genai_x86_gcc_py3.13_cu12.6.3.whl
2025-05-07T20:23:25.0354735Z   merge-multiple: false
2025-05-07T20:23:25.0354978Z   repository: pytorch/FBGEMM
2025-05-07T20:23:25.0355225Z   run-id: 14891846252
2025-05-07T20:23:25.0355426Z env:
2025-05-07T20:23:25.0355642Z   PRELUDE: .github/scripts/setup_env.bash
2025-05-07T20:23:25.0355926Z   BUILD_ENV: build_binary
2025-05-07T20:23:25.0356157Z   BUILD_TARGET: genai
2025-05-07T20:23:25.0356371Z   BUILD_VARIANT: cuda
2025-05-07T20:23:25.0356605Z   BUILD_CUDA_VERSION: 12.6.3
2025-05-07T20:23:25.0356841Z   ENFORCE_CUDA_DEVICE: 1
2025-05-07T20:23:25.0357072Z ##[endgroup]
2025-05-07T20:23:25.2716231Z Downloading single artifact
2025-05-07T20:23:25.3710390Z Preparing to download the following artifacts:
2025-05-07T20:23:25.3711204Z - fbgemm_genai_x86_gcc_py3.13_cu12.6.3.whl (ID: 3081362642, Size: 12512725, Expected Digest: sha256:228c0da92693d2954cf116c01d25e7cc680533513556b331a58d6b7834b2e3d4)
2025-05-07T20:23:25.4211348Z Redirecting to blob download url: https://productionresultssa4.blob.core.windows.net/actions-results/b81c1ade-b872-4473-afc9-b227c140a38f/workflow-job-run-d2ebcb72-c99d-5c1c-9db7-78599d6c6d28/artifacts/4c58965a6bbd4d44222979263dfcdea5bd55f581a5885da24be5168ea14aaaab.zip
2025-05-07T20:23:25.4212744Z Starting download of artifact to: /home/ec2-user/actions-runner/_work/FBGEMM/FBGEMM
2025-05-07T20:23:25.5048221Z (node:245835) [DEP0005] DeprecationWarning: Buffer() is deprecated due to security and usability issues. Please use the Buffer.alloc(), Buffer.allocUnsafe(), or Buffer.from() methods instead.
2025-05-07T20:23:25.5049166Z (Use `node --trace-deprecation ...` to show where the warning was created)
2025-05-07T20:23:25.7225302Z SHA256 digest of downloaded artifact is 228c0da92693d2954cf116c01d25e7cc680533513556b331a58d6b7834b2e3d4
2025-05-07T20:23:25.7225886Z Artifact download completed successfully.
2025-05-07T20:23:25.7226223Z Total of 1 artifact(s) downloaded
2025-05-07T20:23:25.7231699Z Download artifact has finished successfully
2025-05-07T20:23:25.7475366Z ##[group]Run pytorch/test-infra/.github/actions/setup-nvidia@main
2025-05-07T20:23:25.7475743Z with:
2025-05-07T20:23:25.7475963Z   driver-version: 570.133.07
2025-05-07T20:23:25.7476206Z env:
2025-05-07T20:23:25.7476421Z   PRELUDE: .github/scripts/setup_env.bash
2025-05-07T20:23:25.7476710Z   BUILD_ENV: build_binary
2025-05-07T20:23:25.7476942Z   BUILD_TARGET: genai
2025-05-07T20:23:25.7477175Z   BUILD_VARIANT: cuda
2025-05-07T20:23:25.7477399Z   BUILD_CUDA_VERSION: 12.6.3
2025-05-07T20:23:25.7477648Z   ENFORCE_CUDA_DEVICE: 1
2025-05-07T20:23:25.7477881Z ##[endgroup]
2025-05-07T20:23:25.7571929Z ##[group]Run nick-fields/retry@3e91a01664abd3c5cd539100d10d33b9c5b68482
2025-05-07T20:23:25.7572305Z with:
2025-05-07T20:23:25.7572521Z   timeout_minutes: 10
2025-05-07T20:23:25.7572748Z   max_attempts: 3
2025-05-07T20:23:25.7595334Z   command: # Is it disgusting to have a full shell script here in this github action? Sure
# But is it the best way to make it so that this action relies on nothing else? Absolutely
set -eou pipefail

DISTRIBUTION=$(. /etc/os-release;echo $ID$VERSION_ID)
DRIVER_FN="NVIDIA-Linux-x86_64-${DRIVER_VERSION}.run"

install_nvidia_docker2_amzn2() {
    (
        set -x
        # Needed for yum-config-manager
        sudo yum install -y yum-utils
        if [[ "${DISTRIBUTION}" == "amzn2023" ]] ; then
          YUM_REPO_URL="https://nvidia.github.io/libnvidia-container/stable/rpm/nvidia-container-toolkit.repo"
        else
          # Amazon Linux 2
          YUM_REPO_URL="https://nvidia.github.io/nvidia-docker/${DISTRIBUTION}/nvidia-docker.repo"
        fi

        sudo yum-config-manager --add-repo "${YUM_REPO_URL}"
        sudo yum install -y nvidia-docker2 nvidia-container-toolkit-1.16.2
        sudo systemctl restart docker
    )
}

install_nvidia_docker2_ubuntu20() {
    (
        set -x
        # Install nvidia-driver package if not installed
        status="$(dpkg-query -W --showformat='${db:Status-Status}' nvidia-docker2 2>&1)"
        if [ ! $? = 0 ] || [ ! "$status" = installed ]; then
          sudo apt-get install -y nvidia-docker2 nvidia-container-toolkit-1.16.2
          sudo systemctl restart docker
        fi
    )
}

pre_install_nvidia_driver_amzn2() {
    (
        # Purge any nvidia driver installed from RHEL repo
        sudo yum remove -y nvidia-driver-latest-dkms
    )
}

install_nvidia_driver_common() {
    (
        # Try to gather more information about the runner and its existing NVIDIA driver if any
        echo "Before installing NVIDIA driver"
        lspci
        lsmod
        modinfo nvidia || true

        HAS_NVIDIA_DRIVER=0
        # Check if NVIDIA driver has already been installed
        if [ -x "$(command -v nvidia-smi)" ]; then
            set +e
            # The driver exists, check its version next. Also check only the first GPU if there are more than one of them
            # so that the same driver version is not print over multiple lines
            INSTALLED_DRIVER_VERSION=$(nvidia-smi --query-gpu=driver_version --format=csv,noheader --id=0)
            NVIDIA_SMI_STATUS=$?

            if [ "$NVIDIA_SMI_STATUS" -ne 0 ] && [ "$NVIDIA_SMI_STATUS" -ne 14 ]; then
                echo "Failed to get NVIDIA driver version ($INSTALLED_DRIVER_VERSION). Continuing"
            elif [ "$INSTALLED_DRIVER_VERSION" != "$DRIVER_VERSION" ]; then
                echo "NVIDIA driver ($INSTALLED_DRIVER_VERSION) has been installed, but we expect to have $DRIVER_VERSION instead. Continuing"

                # Turn off persistent mode so that the installation script can unload the kernel module
                sudo killall nvidia-persistenced || true
            else
                HAS_NVIDIA_DRIVER=1
                echo "NVIDIA driver ($INSTALLED_DRIVER_VERSION) has already been installed. Skipping NVIDIA driver installation"
            fi
            set -e
        fi

        if [ "$HAS_NVIDIA_DRIVER" -eq 0 ]; then
            # CAUTION: this may need to be updated in future
            if [ "${DISTRIBUTION}" != ubuntu20.04 ]; then
                  sudo yum groupinstall -y "Development Tools"
                  # ensure our kernel install is the same as our underlying kernel,
                  # groupinstall "Development Tools" has a habit of mismatching kernel headers
                  sudo yum install -y "kernel-devel-uname-r == $(uname -r)"
                  sudo modprobe backlight
            fi
            sudo curl -fsL -o /tmp/nvidia_driver "https://s3.amazonaws.com/ossci-linux/nvidia_driver/$DRIVER_FN"

            set +e
            sudo /bin/bash /tmp/nvidia_driver -s --no-drm
            NVIDIA_INSTALLATION_STATUS=$?

            RESET_GPU=0
            if [ "$NVIDIA_INSTALLATION_STATUS" -ne 0 ]; then
                sudo cat /var/log/nvidia-installer.log
                # Fail to install NVIDIA driver, try to reset the GPU
                RESET_GPU=1
            elif [ -x "$(command -v nvidia-smi)" ]; then
                # Check again if nvidia-smi works even if the driver installation completes successfully
                INSTALLED_DRIVER_VERSION=$(nvidia-smi --query-gpu=driver_version --format=csv,noheader --id=0)
                NVIDIA_SMI_STATUS=$?

                if [ "$NVIDIA_SMI_STATUS" -ne 0 ] && [ "$NVIDIA_SMI_STATUS" -ne 14 ]; then
                    RESET_GPU=1
                fi
            fi

            if [ "$RESET_GPU" -eq 1 ]; then
                NVIDIA_DEVICES=$(lspci -D | grep -i NVIDIA | cut -d' ' -f1)
                # The GPU can get stuck in a failure state if somehow the test crashs the GPU microcode. When this
                # happens, we'll try to reset all NVIDIA devices https://github.com/pytorch/pytorch/issues/88388
                for PCI_ID in $NVIDIA_DEVICES; do
                    DEVICE_ENABLED=$(cat /sys/bus/pci/devices/$PCI_ID/enable)

                    echo "Reseting $PCI_ID (enabled state: $DEVICE_ENABLED)"
                    # This requires sudo permission of course
                    echo "1" | sudo tee /sys/bus/pci/devices/$PCI_ID/reset
                    sleep 1
                done
            fi

            sudo rm -fv /tmp/nvidia_driver
            set -e
        fi
    )
}

post_install_nvidia_driver_common() {
    (
        sudo modprobe nvidia || true
        echo "After installing NVIDIA driver"
        lspci
        lsmod
        modinfo nvidia || true

        (
            set +e

            nvidia-smi
            # NB: Annoyingly, nvidia-smi command returns successfully with return code 0 even in
            # the case where the driver has already crashed as it still can get the driver version
            # and some basic information like the bus ID.  However, the rest of the information
            # would be missing (ERR!), for example:
            #
            # +-----------------------------------------------------------------------------+
            # | NVIDIA-SMI 525.89.02    Driver Version: 525.89.02    CUDA Version: 12.0     |
            # |-------------------------------+----------------------+----------------------+
            # | GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
            # | Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
            # |                               |                      |               MIG M. |
            # |===============================+======================+======================|
            # |   0  ERR!                Off  | 00000000:00:1E.0 Off |                 ERR! |
            # |ERR!  ERR! ERR!    ERR! / ERR! |   4184MiB / 23028MiB |    ERR!      Default |
            # |                               |                      |                 ERR! |
            # +-------------------------------+----------------------+----------------------+
            #
            # +-----------------------------------------------------------------------------+
            # | Processes:                                                                  |
            # |  GPU   GI   CI        PID   Type   Process name                  GPU Memory |
            # |        ID   ID                                                   Usage      |
            # |=============================================================================|
            # +-----------------------------------------------------------------------------+
            #
            # This should be reported as a failure instead as it will guarantee to fail when
            # Docker tries to run with --gpus all
            #
            # So, the correct check here is to query one of the missing piece of info like
            # GPU name, so that the command can fail accordingly
            nvidia-smi --query-gpu=gpu_name --format=csv,noheader --id=0
            NVIDIA_SMI_STATUS=$?

            # Allowable exit statuses for nvidia-smi, see: https://github.com/NVIDIA/gpu-operator/issues/285
            if [ "$NVIDIA_SMI_STATUS" -eq 0 ] || [ "$NVIDIA_SMI_STATUS" -eq 14 ]; then
                echo "INFO: Ignoring allowed status ${NVIDIA_SMI_STATUS}"
            else
                echo "ERROR: nvidia-smi exited with unresolved status ${NVIDIA_SMI_STATUS}"
                exit ${NVIDIA_SMI_STATUS}
            fi
            set -e
        )
    )
}

install_nvidia_driver_amzn2() {
    (
        set -x
        pre_install_nvidia_driver_amzn2
        install_nvidia_driver_common
        post_install_nvidia_driver_common
    )
}

install_nvidia_driver_ubuntu20() {
    (
        set -x
        install_nvidia_driver_common
        post_install_nvidia_driver_common
    )
}

echo "== Installing nvidia driver ${DRIVER_FN} =="
case "${DISTRIBUTION}" in
    amzn*)
        install_nvidia_driver_amzn2
        ;;
    ubuntu20.04)
        install_nvidia_driver_ubuntu20
        ;;
    *)
        echo "ERROR: Unknown distribution ${DISTRIBUTION}"
        exit 1
        ;;
esac

# Install container toolkit based on distribution
echo "== Installing nvidia container toolkit for ${DISTRIBUTION} =="
case "${DISTRIBUTION}" in
    amzn*)
        install_nvidia_docker2_amzn2
        ;;
    ubuntu20.04)
        install_nvidia_docker2_ubuntu20
        ;;
    *)
        echo "ERROR: Unknown distribution ${DISTRIBUTION}"
        exit 1
        ;;
esac
echo "GPU_FLAG=--gpus all -e NVIDIA_DRIVER_CAPABILITIES=all" >> "${GITHUB_ENV}"

# Fix https://github.com/NVIDIA/nvidia-docker/issues/1648 on runners with
# more than one GPUs. This just needs to be run once. The command fails
# on subsequent runs and complains that the mode is already on, but that's
# ok
sudo nvidia-persistenced || true
# This should show persistence mode ON
nvidia-smi

2025-05-07T20:23:25.7618445Z   retry_wait_seconds: 10
2025-05-07T20:23:25.7618701Z   polling_interval_seconds: 1
2025-05-07T20:23:25.7618950Z   warning_on_retry: true
2025-05-07T20:23:25.7619188Z   continue_on_error: false
2025-05-07T20:23:25.7619428Z env:
2025-05-07T20:23:25.7619635Z   PRELUDE: .github/scripts/setup_env.bash
2025-05-07T20:23:25.7619930Z   BUILD_ENV: build_binary
2025-05-07T20:23:25.7632680Z   BUILD_TARGET: genai
2025-05-07T20:23:25.7632922Z   BUILD_VARIANT: cuda
2025-05-07T20:23:25.7633159Z   BUILD_CUDA_VERSION: 12.6.3
2025-05-07T20:23:25.7633412Z   ENFORCE_CUDA_DEVICE: 1
2025-05-07T20:23:25.7633652Z   DRIVER_VERSION: 570.133.07
2025-05-07T20:23:25.7633884Z ##[endgroup]
2025-05-07T20:23:26.6536998Z == Installing nvidia driver NVIDIA-Linux-x86_64-570.133.07.run ==
2025-05-07T20:23:26.6539237Z + pre_install_nvidia_driver_amzn2
2025-05-07T20:23:26.6539755Z + sudo yum remove -y nvidia-driver-latest-dkms
2025-05-07T20:23:26.9476564Z No match for argument: nvidia-driver-latest-dkms
2025-05-07T20:23:26.9477584Z No packages marked for removal.
2025-05-07T20:23:26.9541023Z Dependencies resolved.
2025-05-07T20:23:26.9550634Z Nothing to do.
2025-05-07T20:23:26.9550984Z Complete!
2025-05-07T20:23:26.9901181Z + install_nvidia_driver_common
2025-05-07T20:23:26.9905047Z + echo 'Before installing NVIDIA driver'
2025-05-07T20:23:26.9905472Z + lspci
2025-05-07T20:23:26.9907736Z Before installing NVIDIA driver
2025-05-07T20:23:27.0027083Z 00:00.0 Host bridge: Intel Corporation 440FX - 82441FX PMC [Natoma]
2025-05-07T20:23:27.0027817Z 00:01.0 ISA bridge: Intel Corporation 82371SB PIIX3 ISA [Natoma/Triton II]
2025-05-07T20:23:27.0028359Z 00:01.3 Non-VGA unclassified device: Intel Corporation 82371AB/EB/MB PIIX4 ACPI (rev 08)
2025-05-07T20:23:27.0028983Z 00:03.0 VGA compatible controller: Amazon.com, Inc. Device 1111
2025-05-07T20:23:27.0029673Z 00:04.0 Non-Volatile memory controller: Amazon.com, Inc. NVMe EBS Controller
2025-05-07T20:23:27.0030439Z 00:05.0 Ethernet controller: Amazon.com, Inc. Elastic Network Adapter (ENA)
2025-05-07T20:23:27.0030926Z 00:1e.0 3D controller: NVIDIA Corporation GA102GL [A10G] (rev a1)
2025-05-07T20:23:27.0031398Z 00:1f.0 Non-Volatile memory controller: Amazon.com, Inc. NVMe SSD Controller
2025-05-07T20:23:27.0031827Z + lsmod
2025-05-07T20:23:27.0074694Z Module                  Size  Used by
2025-05-07T20:23:27.0075300Z xt_nat                 16384  0
2025-05-07T20:23:27.0075812Z nvidia_modeset       1716224  0
2025-05-07T20:23:27.0076355Z video                  65536  1 nvidia_modeset
2025-05-07T20:23:27.0076956Z wmi                    36864  1 video
2025-05-07T20:23:27.0077495Z nvidia_uvm           1884160  0
2025-05-07T20:23:27.0078086Z nvidia              11583488  7 nvidia_uvm,nvidia_modeset
2025-05-07T20:23:27.0078714Z drm                   602112  1 nvidia
2025-05-07T20:23:27.0079306Z drm_panel_orientation_quirks    32768  1 drm
2025-05-07T20:23:27.0080032Z backlight              24576  3 video,drm,nvidia_modeset
2025-05-07T20:23:27.0080705Z i2c_core              110592  2 nvidia,drm
2025-05-07T20:23:27.0081264Z veth                   36864  0
2025-05-07T20:23:27.0081773Z xt_conntrack           16384  1
2025-05-07T20:23:27.0082226Z nft_chain_nat          16384  3
2025-05-07T20:23:27.0082488Z xt_MASQUERADE          20480  1
2025-05-07T20:23:27.0083008Z nf_nat                 57344  3 xt_nat,nft_chain_nat,xt_MASQUERADE
2025-05-07T20:23:27.0083347Z nf_conntrack_netlink    57344  0
2025-05-07T20:23:27.0083772Z nf_conntrack          184320  5 xt_conntrack,nf_nat,xt_nat,nf_conntrack_netlink,xt_MASQUERADE
2025-05-07T20:23:27.0084233Z nf_defrag_ipv6         24576  1 nf_conntrack
2025-05-07T20:23:27.0084548Z nf_defrag_ipv4         16384  1 nf_conntrack
2025-05-07T20:23:27.0084836Z xfrm_user              57344  1
2025-05-07T20:23:27.0085105Z xfrm_algo              16384  1 xfrm_user
2025-05-07T20:23:27.0085397Z xt_addrtype            16384  2
2025-05-07T20:23:27.0085648Z nft_compat             20480  4
2025-05-07T20:23:27.0085951Z nf_tables             311296  57 nft_compat,nft_chain_nat
2025-05-07T20:23:27.0086361Z nfnetlink              20480  4 nft_compat,nf_conntrack_netlink,nf_tables
2025-05-07T20:23:27.0086729Z br_netfilter           36864  0
2025-05-07T20:23:27.0087008Z bridge                323584  1 br_netfilter
2025-05-07T20:23:27.0087309Z stp                    16384  1 bridge
2025-05-07T20:23:27.0087594Z llc                    16384  2 bridge,stp
2025-05-07T20:23:27.0087871Z overlay               167936  0
2025-05-07T20:23:27.0088122Z tls                   135168  0
2025-05-07T20:23:27.0088378Z nls_ascii              16384  1
2025-05-07T20:23:27.0088624Z nls_cp437              20480  1
2025-05-07T20:23:27.0088874Z vfat                   24576  1
2025-05-07T20:23:27.0089125Z fat                    86016  1 vfat
2025-05-07T20:23:27.0089387Z sunrpc                696320  1
2025-05-07T20:23:27.0089640Z ena                   180224  0
2025-05-07T20:23:27.0089881Z i8042                  45056  0
2025-05-07T20:23:27.0090130Z serio                  28672  3 i8042
2025-05-07T20:23:27.0090406Z ghash_clmulni_intel    16384  0
2025-05-07T20:23:27.0090666Z button                 24576  0
2025-05-07T20:23:27.0090915Z sch_fq_codel           20480  17
2025-05-07T20:23:27.0091174Z fuse                  163840  1
2025-05-07T20:23:27.0091426Z dm_mod                188416  0
2025-05-07T20:23:27.0091681Z configfs               57344  1
2025-05-07T20:23:27.0091931Z dax                    45056  1 dm_mod
2025-05-07T20:23:27.0092343Z loop                   36864  0
2025-05-07T20:23:27.0092628Z dmi_sysfs              20480  0
2025-05-07T20:23:27.0092985Z crc32_pclmul           16384  0
2025-05-07T20:23:27.0093240Z crc32c_intel           24576  0
2025-05-07T20:23:27.0093493Z efivarfs               24576  1
2025-05-07T20:23:27.0093880Z + modinfo nvidia
2025-05-07T20:23:27.0094250Z filename:       /lib/modules/6.1.130-139.222.amzn2023.x86_64/kernel/drivers/video/nvidia.ko
2025-05-07T20:23:27.0094688Z import_ns:      DMA_BUF
2025-05-07T20:23:27.0094939Z alias:          char-major-195-*
2025-05-07T20:23:27.0095214Z version:        570.133.07
2025-05-07T20:23:27.0095459Z supported:      external
2025-05-07T20:23:27.0095709Z license:        Dual MIT/GPL
2025-05-07T20:23:27.0095999Z firmware:       nvidia/570.133.07/gsp_tu10x.bin
2025-05-07T20:23:27.0096339Z firmware:       nvidia/570.133.07/gsp_ga10x.bin
2025-05-07T20:23:27.0096662Z srcversion:     49515739FD8F721A3F2F714
2025-05-07T20:23:27.0096995Z alias:          pci:v000010DEd*sv*sd*bc06sc80i00*
2025-05-07T20:23:27.0097328Z alias:          pci:v000010DEd*sv*sd*bc03sc02i00*
2025-05-07T20:23:27.0097667Z alias:          pci:v000010DEd*sv*sd*bc03sc00i00*
2025-05-07T20:23:27.0097979Z depends:        i2c-core,drm
2025-05-07T20:23:27.0098400Z retpoline:      Y
2025-05-07T20:23:27.0098620Z name:           nvidia
2025-05-07T20:23:27.0098977Z vermagic:       6.1.130-139.222.amzn2023.x86_64 SMP preempt mod_unload modversions 
2025-05-07T20:23:27.0099444Z parm:           NvSwitchRegDwords:NvSwitch regkey (charp)
2025-05-07T20:23:27.0099878Z parm:           NvSwitchBlacklist:NvSwitchBlacklist=uuid[,uuid...] (charp)
2025-05-07T20:23:27.0100293Z parm:           NVreg_ResmanDebugLevel:int
2025-05-07T20:23:27.0100599Z parm:           NVreg_RmLogonRC:int
2025-05-07T20:23:27.0100890Z parm:           NVreg_ModifyDeviceFiles:int
2025-05-07T20:23:27.0101344Z parm:           NVreg_DeviceFileUID:int
2025-05-07T20:23:27.0101647Z parm:           NVreg_DeviceFileGID:int
2025-05-07T20:23:27.0101944Z parm:           NVreg_DeviceFileMode:int
2025-05-07T20:23:27.0102306Z parm:           NVreg_InitializeSystemMemoryAllocations:int
2025-05-07T20:23:27.0102689Z parm:           NVreg_UsePageAttributeTable:int
2025-05-07T20:23:27.0103018Z parm:           NVreg_EnablePCIeGen3:int
2025-05-07T20:23:27.0103313Z parm:           NVreg_EnableMSI:int
2025-05-07T20:23:27.0103619Z parm:           NVreg_EnableStreamMemOPs:int
2025-05-07T20:23:27.0103978Z parm:           NVreg_RestrictProfilingToAdminUsers:int
2025-05-07T20:23:27.0104364Z parm:           NVreg_PreserveVideoMemoryAllocations:int
2025-05-07T20:23:27.0104739Z parm:           NVreg_EnableS0ixPowerManagement:int
2025-05-07T20:23:27.0105149Z parm:           NVreg_S0ixPowerManagementVideoMemoryThreshold:int
2025-05-07T20:23:27.0105546Z parm:           NVreg_DynamicPowerManagement:int
2025-05-07T20:23:27.0105968Z parm:           NVreg_DynamicPowerManagementVideoMemoryThreshold:int
2025-05-07T20:23:27.0106371Z parm:           NVreg_EnableGpuFirmware:int
2025-05-07T20:23:27.0106710Z parm:           NVreg_EnableGpuFirmwareLogs:int
2025-05-07T20:23:27.0107073Z parm:           NVreg_OpenRmEnableUnsupportedGpus:int
2025-05-07T20:23:27.0107440Z parm:           NVreg_EnableUserNUMAManagement:int
2025-05-07T20:23:27.0107778Z parm:           NVreg_MemoryPoolSize:int
2025-05-07T20:23:27.0108089Z parm:           NVreg_KMallocHeapMaxSize:int
2025-05-07T20:23:27.0108417Z parm:           NVreg_VMallocHeapMaxSize:int
2025-05-07T20:23:27.0108738Z parm:           NVreg_IgnoreMMIOCheck:int
2025-05-07T20:23:27.0109039Z parm:           NVreg_NvLinkDisable:int
2025-05-07T20:23:27.0109387Z parm:           NVreg_EnablePCIERelaxedOrderingMode:int
2025-05-07T20:23:27.0109744Z parm:           NVreg_RegisterPCIDriver:int
2025-05-07T20:23:27.0110087Z parm:           NVreg_EnableResizableBar:int
2025-05-07T20:23:27.0110412Z parm:           NVreg_EnableDbgBreakpoint:int
2025-05-07T20:23:27.0110762Z parm:           NVreg_EnableNonblockingOpen:int
2025-05-07T20:23:27.0111095Z parm:           NVreg_RegistryDwords:charp
2025-05-07T20:23:27.0111426Z parm:           NVreg_RegistryDwordsPerDevice:charp
2025-05-07T20:23:27.0111894Z parm:           NVreg_RmMsg:charp
2025-05-07T20:23:27.0112189Z parm:           NVreg_GpuBlacklist:charp
2025-05-07T20:23:27.0112503Z parm:           NVreg_TemporaryFilePath:charp
2025-05-07T20:23:27.0112824Z parm:           NVreg_ExcludedGpus:charp
2025-05-07T20:23:27.0113138Z parm:           NVreg_DmaRemapPeerMmio:int
2025-05-07T20:23:27.0113455Z parm:           NVreg_RmNvlinkBandwidth:charp
2025-05-07T20:23:27.0113809Z parm:           NVreg_RmNvlinkBandwidthLinkCount:int
2025-05-07T20:23:27.0114160Z parm:           NVreg_ImexChannelCount:int
2025-05-07T20:23:27.0114481Z parm:           NVreg_CreateImexChannel0:int
2025-05-07T20:23:27.0114816Z parm:           NVreg_GrdmaPciTopoCheckOverride:int
2025-05-07T20:23:27.0115155Z parm:           rm_firmware_active:charp
2025-05-07T20:23:27.0115451Z + HAS_NVIDIA_DRIVER=0
2025-05-07T20:23:27.0115686Z ++ command -v nvidia-smi
2025-05-07T20:23:27.0115944Z + '[' -x /usr/bin/nvidia-smi ']'
2025-05-07T20:23:27.0116200Z + set +e
2025-05-07T20:23:27.0116503Z ++ nvidia-smi --query-gpu=driver_version --format=csv,noheader --id=0
2025-05-07T20:23:27.0344260Z + INSTALLED_DRIVER_VERSION=570.133.07
2025-05-07T20:23:27.0344563Z + NVIDIA_SMI_STATUS=0
2025-05-07T20:23:27.0344791Z + '[' 0 -ne 0 ']'
2025-05-07T20:23:27.0344997Z + '[' 570.133.07 '!=' 570.133.07 ']'
2025-05-07T20:23:27.0345256Z + HAS_NVIDIA_DRIVER=1
2025-05-07T20:23:27.0345666Z + echo 'NVIDIA driver (570.133.07) has already been installed. Skipping NVIDIA driver installation'
2025-05-07T20:23:27.0346109Z + set -e
2025-05-07T20:23:27.0346299Z + '[' 1 -eq 0 ']'
2025-05-07T20:23:27.0346672Z NVIDIA driver (570.133.07) has already been installed. Skipping NVIDIA driver installation
2025-05-07T20:23:27.0347123Z + post_install_nvidia_driver_common
2025-05-07T20:23:27.0350848Z + sudo modprobe nvidia
2025-05-07T20:23:27.1907841Z + echo 'After installing NVIDIA driver'
2025-05-07T20:23:27.1908274Z + lspci
2025-05-07T20:23:27.1908572Z After installing NVIDIA driver
2025-05-07T20:23:27.2023451Z 00:00.0 Host bridge: Intel Corporation 440FX - 82441FX PMC [Natoma]
2025-05-07T20:23:27.2024111Z 00:01.0 ISA bridge: Intel Corporation 82371SB PIIX3 ISA [Natoma/Triton II]
2025-05-07T20:23:27.2024720Z 00:01.3 Non-VGA unclassified device: Intel Corporation 82371AB/EB/MB PIIX4 ACPI (rev 08)
2025-05-07T20:23:27.2025426Z 00:03.0 VGA compatible controller: Amazon.com, Inc. Device 1111
2025-05-07T20:23:27.2026062Z 00:04.0 Non-Volatile memory controller: Amazon.com, Inc. NVMe EBS Controller
2025-05-07T20:23:27.2026605Z 00:05.0 Ethernet controller: Amazon.com, Inc. Elastic Network Adapter (ENA)
2025-05-07T20:23:27.2027078Z 00:1e.0 3D controller: NVIDIA Corporation GA102GL [A10G] (rev a1)
2025-05-07T20:23:27.2027543Z 00:1f.0 Non-Volatile memory controller: Amazon.com, Inc. NVMe SSD Controller
2025-05-07T20:23:27.2027952Z + lsmod
2025-05-07T20:23:27.2057150Z Module                  Size  Used by
2025-05-07T20:23:27.2057590Z xt_nat                 16384  0
2025-05-07T20:23:27.2057931Z nvidia_modeset       1716224  0
2025-05-07T20:23:27.2058320Z video                  65536  1 nvidia_modeset
2025-05-07T20:23:27.2058693Z wmi                    36864  1 video
2025-05-07T20:23:27.2058965Z nvidia_uvm           1884160  0
2025-05-07T20:23:27.2059270Z nvidia              11583488  7 nvidia_uvm,nvidia_modeset
2025-05-07T20:23:27.2059681Z drm                   602112  1 nvidia
2025-05-07T20:23:27.2059989Z drm_panel_orientation_quirks    32768  1 drm
2025-05-07T20:23:27.2060352Z backlight              24576  3 video,drm,nvidia_modeset
2025-05-07T20:23:27.2060702Z i2c_core              110592  2 nvidia,drm
2025-05-07T20:23:27.2060991Z veth                   36864  0
2025-05-07T20:23:27.2061244Z xt_conntrack           16384  1
2025-05-07T20:23:27.2061503Z nft_chain_nat          16384  3
2025-05-07T20:23:27.2061776Z xt_MASQUERADE          20480  1
2025-05-07T20:23:27.2062124Z nf_nat                 57344  3 xt_nat,nft_chain_nat,xt_MASQUERADE
2025-05-07T20:23:27.2062471Z nf_conntrack_netlink    57344  0
2025-05-07T20:23:27.2063477Z nf_conntrack          184320  5 xt_conntrack,nf_nat,xt_nat,nf_conntrack_netlink,xt_MASQUERADE
2025-05-07T20:23:27.2063942Z nf_defrag_ipv6         24576  1 nf_conntrack
2025-05-07T20:23:27.2064249Z nf_defrag_ipv4         16384  1 nf_conntrack
2025-05-07T20:23:27.2064545Z xfrm_user              57344  1
2025-05-07T20:23:27.2064812Z xfrm_algo              16384  1 xfrm_user
2025-05-07T20:23:27.2065093Z xt_addrtype            16384  2
2025-05-07T20:23:27.2065351Z nft_compat             20480  4
2025-05-07T20:23:27.2065652Z nf_tables             311296  57 nft_compat,nft_chain_nat
2025-05-07T20:23:27.2066048Z nfnetlink              20480  4 nft_compat,nf_conntrack_netlink,nf_tables
2025-05-07T20:23:27.2066421Z br_netfilter           36864  0
2025-05-07T20:23:27.2066702Z bridge                323584  1 br_netfilter
2025-05-07T20:23:27.2066999Z stp                    16384  1 bridge
2025-05-07T20:23:27.2067276Z llc                    16384  2 bridge,stp
2025-05-07T20:23:27.2067558Z overlay               167936  0
2025-05-07T20:23:27.2067815Z tls                   135168  0
2025-05-07T20:23:27.2068063Z nls_ascii              16384  1
2025-05-07T20:23:27.2068314Z nls_cp437              20480  1
2025-05-07T20:23:27.2068561Z vfat                   24576  1
2025-05-07T20:23:27.2068807Z fat                    86016  1 vfat
2025-05-07T20:23:27.2069074Z sunrpc                696320  1
2025-05-07T20:23:27.2069325Z ena                   180224  0
2025-05-07T20:23:27.2069571Z i8042                  45056  0
2025-05-07T20:23:27.2069822Z serio                  28672  3 i8042
2025-05-07T20:23:27.2070099Z ghash_clmulni_intel    16384  0
2025-05-07T20:23:27.2070353Z button                 24576  0
2025-05-07T20:23:27.2070606Z sch_fq_codel           20480  17
2025-05-07T20:23:27.2071020Z fuse                  163840  1
2025-05-07T20:23:27.2071269Z dm_mod                188416  0
2025-05-07T20:23:27.2071512Z configfs               57344  1
2025-05-07T20:23:27.2071774Z dax                    45056  1 dm_mod
2025-05-07T20:23:27.2072049Z loop                   36864  0
2025-05-07T20:23:27.2072297Z dmi_sysfs              20480  0
2025-05-07T20:23:27.2072550Z crc32_pclmul           16384  0
2025-05-07T20:23:27.2072807Z crc32c_intel           24576  0
2025-05-07T20:23:27.2073053Z efivarfs               24576  1
2025-05-07T20:23:27.2073302Z + modinfo nvidia
2025-05-07T20:23:27.2073925Z filename:       /lib/modules/6.1.130-139.222.amzn2023.x86_64/kernel/drivers/video/nvidia.ko
2025-05-07T20:23:27.2074423Z import_ns:      DMA_BUF
2025-05-07T20:23:27.2074684Z alias:          char-major-195-*
2025-05-07T20:23:27.2074960Z version:        570.133.07
2025-05-07T20:23:27.2075211Z supported:      external
2025-05-07T20:23:27.2075460Z license:        Dual MIT/GPL
2025-05-07T20:23:27.2075749Z firmware:       nvidia/570.133.07/gsp_tu10x.bin
2025-05-07T20:23:27.2076093Z firmware:       nvidia/570.133.07/gsp_ga10x.bin
2025-05-07T20:23:27.2076406Z srcversion:     49515739FD8F721A3F2F714
2025-05-07T20:23:27.2076726Z alias:          pci:v000010DEd*sv*sd*bc06sc80i00*
2025-05-07T20:23:27.2077067Z alias:          pci:v000010DEd*sv*sd*bc03sc02i00*
2025-05-07T20:23:27.2077398Z alias:          pci:v000010DEd*sv*sd*bc03sc00i00*
2025-05-07T20:23:27.2077699Z depends:        i2c-core,drm
2025-05-07T20:23:27.2077957Z retpoline:      Y
2025-05-07T20:23:27.2078175Z name:           nvidia
2025-05-07T20:23:27.2078521Z vermagic:       6.1.130-139.222.amzn2023.x86_64 SMP preempt mod_unload modversions 
2025-05-07T20:23:27.2078989Z parm:           NvSwitchRegDwords:NvSwitch regkey (charp)
2025-05-07T20:23:27.2079424Z parm:           NvSwitchBlacklist:NvSwitchBlacklist=uuid[,uuid...] (charp)
2025-05-07T20:23:27.2079834Z parm:           NVreg_ResmanDebugLevel:int
2025-05-07T20:23:27.2080140Z parm:           NVreg_RmLogonRC:int
2025-05-07T20:23:27.2080443Z parm:           NVreg_ModifyDeviceFiles:int
2025-05-07T20:23:27.2080755Z parm:           NVreg_DeviceFileUID:int
2025-05-07T20:23:27.2081046Z parm:           NVreg_DeviceFileGID:int
2025-05-07T20:23:27.2081346Z parm:           NVreg_DeviceFileMode:int
2025-05-07T20:23:27.2081812Z parm:           NVreg_InitializeSystemMemoryAllocations:int
2025-05-07T20:23:27.2082192Z parm:           NVreg_UsePageAttributeTable:int
2025-05-07T20:23:27.2082521Z parm:           NVreg_EnablePCIeGen3:int
2025-05-07T20:23:27.2082818Z parm:           NVreg_EnableMSI:int
2025-05-07T20:23:27.2083115Z parm:           NVreg_EnableStreamMemOPs:int
2025-05-07T20:23:27.2083474Z parm:           NVreg_RestrictProfilingToAdminUsers:int
2025-05-07T20:23:27.2083867Z parm:           NVreg_PreserveVideoMemoryAllocations:int
2025-05-07T20:23:27.2084239Z parm:           NVreg_EnableS0ixPowerManagement:int
2025-05-07T20:23:27.2084649Z parm:           NVreg_S0ixPowerManagementVideoMemoryThreshold:int
2025-05-07T20:23:27.2085052Z parm:           NVreg_DynamicPowerManagement:int
2025-05-07T20:23:27.2085470Z parm:           NVreg_DynamicPowerManagementVideoMemoryThreshold:int
2025-05-07T20:23:27.2085870Z parm:           NVreg_EnableGpuFirmware:int
2025-05-07T20:23:27.2086206Z parm:           NVreg_EnableGpuFirmwareLogs:int
2025-05-07T20:23:27.2086571Z parm:           NVreg_OpenRmEnableUnsupportedGpus:int
2025-05-07T20:23:27.2086935Z parm:           NVreg_EnableUserNUMAManagement:int
2025-05-07T20:23:27.2087278Z parm:           NVreg_MemoryPoolSize:int
2025-05-07T20:23:27.2087600Z parm:           NVreg_KMallocHeapMaxSize:int
2025-05-07T20:23:27.2087921Z parm:           NVreg_VMallocHeapMaxSize:int
2025-05-07T20:23:27.2088244Z parm:           NVreg_IgnoreMMIOCheck:int
2025-05-07T20:23:27.2088552Z parm:           NVreg_NvLinkDisable:int
2025-05-07T20:23:27.2088898Z parm:           NVreg_EnablePCIERelaxedOrderingMode:int
2025-05-07T20:23:27.2089251Z parm:           NVreg_RegisterPCIDriver:int
2025-05-07T20:23:27.2089581Z parm:           NVreg_EnableResizableBar:int
2025-05-07T20:23:27.2091282Z parm:           NVreg_EnableDbgBreakpoint:int
2025-05-07T20:23:27.2091620Z parm:           NVreg_EnableNonblockingOpen:int
2025-05-07T20:23:27.2091961Z parm:           NVreg_RegistryDwords:charp
2025-05-07T20:23:27.2092306Z parm:           NVreg_RegistryDwordsPerDevice:charp
2025-05-07T20:23:27.2092632Z parm:           NVreg_RmMsg:charp
2025-05-07T20:23:27.2092920Z parm:           NVreg_GpuBlacklist:charp
2025-05-07T20:23:27.2093245Z parm:           NVreg_TemporaryFilePath:charp
2025-05-07T20:23:27.2093568Z parm:           NVreg_ExcludedGpus:charp
2025-05-07T20:23:27.2093986Z parm:           NVreg_DmaRemapPeerMmio:int
2025-05-07T20:23:27.2094314Z parm:           NVreg_RmNvlinkBandwidth:charp
2025-05-07T20:23:27.2094674Z parm:           NVreg_RmNvlinkBandwidthLinkCount:int
2025-05-07T20:23:27.2095015Z parm:           NVreg_ImexChannelCount:int
2025-05-07T20:23:27.2095341Z parm:           NVreg_CreateImexChannel0:int
2025-05-07T20:23:27.2095690Z parm:           NVreg_GrdmaPciTopoCheckOverride:int
2025-05-07T20:23:27.2096033Z parm:           rm_firmware_active:charp
2025-05-07T20:23:27.2096318Z + set +e
2025-05-07T20:23:27.2096513Z + nvidia-smi
2025-05-07T20:23:27.2253971Z Wed May  7 20:23:27 2025       
2025-05-07T20:23:27.2254761Z +-----------------------------------------------------------------------------------------+
2025-05-07T20:23:27.2255932Z | NVIDIA-SMI 570.133.07             Driver Version: 570.133.07     CUDA Version: 12.8     |
2025-05-07T20:23:27.2256879Z |-----------------------------------------+------------------------+----------------------+
2025-05-07T20:23:27.2257843Z | GPU  Name                 Persistence-M | Bus-Id          Disp.A | Volatile Uncorr. ECC |
2025-05-07T20:23:27.2258870Z | Fan  Temp   Perf          Pwr:Usage/Cap |           Memory-Usage | GPU-Util  Compute M. |
2025-05-07T20:23:27.2259709Z |                                         |                        |               MIG M. |
2025-05-07T20:23:27.2260372Z |=========================================+========================+======================|
2025-05-07T20:23:27.2430004Z |   0  NVIDIA A10G                    On  |   00000000:00:1E.0 Off |                    0 |
2025-05-07T20:23:27.2430753Z |  0%   29C    P8             24W /  300W |       0MiB /  23028MiB |      0%      Default |
2025-05-07T20:23:27.2431140Z |                                         |                        |                  N/A |
2025-05-07T20:23:27.2431534Z +-----------------------------------------+------------------------+----------------------+
2025-05-07T20:23:27.2435164Z                                                                                          
2025-05-07T20:23:27.2436086Z +-----------------------------------------------------------------------------------------+
2025-05-07T20:23:27.2436856Z | Processes:                                                                              |
2025-05-07T20:23:27.2437665Z |  GPU   GI   CI              PID   Type   Process name                        GPU Memory |
2025-05-07T20:23:27.2438423Z |        ID   ID                                                               Usage      |
2025-05-07T20:23:27.2439059Z |=========================================================================================|
2025-05-07T20:23:27.2440664Z |  No running processes found                                                             |
2025-05-07T20:23:27.2441640Z +-----------------------------------------------------------------------------------------+
2025-05-07T20:23:27.4793662Z + nvidia-smi --query-gpu=gpu_name --format=csv,noheader --id=0
2025-05-07T20:23:27.4965001Z NVIDIA A10G
2025-05-07T20:23:27.5008033Z + NVIDIA_SMI_STATUS=0
2025-05-07T20:23:27.5008370Z + '[' 0 -eq 0 ']'
2025-05-07T20:23:27.5008700Z + echo 'INFO: Ignoring allowed status 0'
2025-05-07T20:23:27.5009001Z + set -e
2025-05-07T20:23:27.5009218Z INFO: Ignoring allowed status 0
2025-05-07T20:23:27.5066081Z == Installing nvidia container toolkit for amzn2023 ==
2025-05-07T20:23:27.5066641Z + sudo yum install -y yum-utils
2025-05-07T20:23:27.9091326Z Last metadata expiration check: 0:54:00 ago on Wed May  7 19:29:27 2025.
2025-05-07T20:23:27.9340418Z Package dnf-utils-4.3.0-13.amzn2023.0.5.noarch is already installed.
2025-05-07T20:23:27.9739076Z Dependencies resolved.
2025-05-07T20:23:27.9922128Z Nothing to do.
2025-05-07T20:23:27.9922461Z Complete!
2025-05-07T20:23:28.0321296Z + [[ amzn2023 == \a\m\z\n\2\0\2\3 ]]
2025-05-07T20:23:28.0321892Z + YUM_REPO_URL=https://nvidia.github.io/libnvidia-container/stable/rpm/nvidia-container-toolkit.repo
2025-05-07T20:23:28.0323513Z + sudo yum-config-manager --add-repo https://nvidia.github.io/libnvidia-container/stable/rpm/nvidia-container-toolkit.repo
2025-05-07T20:23:28.2798730Z Adding repo from: https://nvidia.github.io/libnvidia-container/stable/rpm/nvidia-container-toolkit.repo
2025-05-07T20:23:28.3385484Z + sudo yum install -y nvidia-docker2 nvidia-container-toolkit-1.16.2
2025-05-07T20:23:28.8724774Z nvidia-container-toolkit                         13 kB/s | 833  B     00:00    
2025-05-07T20:23:28.8984166Z Package nvidia-docker2-2.14.0-1.noarch is already installed.
2025-05-07T20:23:28.9389718Z Dependencies resolved.
2025-05-07T20:23:28.9568070Z ================================================================================
2025-05-07T20:23:28.9568585Z  Package                       Arch   Version    Repository                Size
2025-05-07T20:23:28.9568962Z ================================================================================
2025-05-07T20:23:28.9569271Z Downgrading:
2025-05-07T20:23:28.9569634Z  nvidia-container-toolkit      x86_64 1.16.2-1   nvidia-container-toolkit 1.2 M
2025-05-07T20:23:28.9570206Z  nvidia-container-toolkit-base x86_64 1.16.2-1   nvidia-container-toolkit 5.6 M
2025-05-07T20:23:28.9570566Z 
2025-05-07T20:23:28.9570661Z Transaction Summary
2025-05-07T20:23:28.9570912Z ================================================================================
2025-05-07T20:23:28.9571216Z Downgrade  2 Packages
2025-05-07T20:23:28.9571364Z 
2025-05-07T20:23:28.9571481Z Total download size: 6.8 M
2025-05-07T20:23:28.9572246Z Downloading Packages:
2025-05-07T20:23:29.0426014Z (1/2): nvidia-container-toolkit-1.16.2-1.x86_64  15 MB/s | 1.2 MB     00:00    
2025-05-07T20:23:29.1416959Z (2/2): nvidia-container-toolkit-base-1.16.2-1.x  31 MB/s | 5.6 MB     00:00    
2025-05-07T20:23:29.1425496Z --------------------------------------------------------------------------------
2025-05-07T20:23:29.1428374Z Total                                            37 MB/s | 6.8 MB     00:00     
2025-05-07T20:23:29.1431372Z Running transaction check
2025-05-07T20:23:29.1536494Z Transaction check succeeded.
2025-05-07T20:23:29.1537243Z Running transaction test
2025-05-07T20:23:29.1833965Z Transaction test succeeded.
2025-05-07T20:23:29.1836330Z Running transaction
2025-05-07T20:23:29.7347750Z   Preparing        :                                                        1/1 
2025-05-07T20:23:29.8421602Z   Downgrading      : nvidia-container-toolkit-base-1.16.2-1.x86_64          1/4 
2025-05-07T20:23:29.8459128Z   Downgrading      : nvidia-container-toolkit-1.16.2-1.x86_64               2/4 
2025-05-07T20:23:29.8690538Z   Running scriptlet: nvidia-container-toolkit-1.16.2-1.x86_64               2/4 
2025-05-07T20:23:29.8691299Z   Cleanup          : nvidia-container-toolkit-1.17.6-1.x86_64               3/4 
2025-05-07T20:23:29.8803517Z   Running scriptlet: nvidia-container-toolkit-1.17.6-1.x86_64               3/4 
2025-05-07T20:23:29.8831517Z   Cleanup          : nvidia-container-toolkit-base-1.17.6-1.x86_64          4/4 
2025-05-07T20:23:30.0740001Z   Running scriptlet: nvidia-container-toolkit-1.16.2-1.x86_64               4/4 
2025-05-07T20:23:30.0741155Z   Verifying        : nvidia-container-toolkit-1.16.2-1.x86_64               1/4 
2025-05-07T20:23:30.0742216Z   Verifying        : nvidia-container-toolkit-1.17.6-1.x86_64               2/4 
2025-05-07T20:23:30.0742991Z   Verifying        : nvidia-container-toolkit-base-1.16.2-1.x86_64          3/4 
2025-05-07T20:23:30.2079504Z   Verifying        : nvidia-container-toolkit-base-1.17.6-1.x86_64          4/4================================================================================
2025-05-07T20:23:30.2080070Z WARNING:
2025-05-07T20:23:30.2080315Z   A newer release of "Amazon Linux" is available.
2025-05-07T20:23:30.2080572Z 
2025-05-07T20:23:30.2080674Z   Available Versions:
2025-05-07T20:23:30.2080821Z 
2025-05-07T20:23:30.2080917Z   Version 2023.7.20250331:
2025-05-07T20:23:30.2081228Z     Run the following command to upgrade to 2023.7.20250331:
2025-05-07T20:23:30.2081482Z 
2025-05-07T20:23:30.2081606Z       dnf upgrade --releasever=2023.7.20250331
2025-05-07T20:23:30.2081815Z 
2025-05-07T20:23:30.2081906Z     Release notes:
2025-05-07T20:23:30.2082314Z      https://docs.aws.amazon.com/linux/al2023/release-notes/relnotes-2023.7.20250331.html
2025-05-07T20:23:30.2082696Z 
2025-05-07T20:23:30.2082806Z   Version 2023.7.20250414:
2025-05-07T20:23:30.2083137Z     Run the following command to upgrade to 2023.7.20250414:
2025-05-07T20:23:30.2083391Z 
2025-05-07T20:23:30.2083513Z       dnf upgrade --releasever=2023.7.20250414
2025-05-07T20:23:30.2083717Z 
2025-05-07T20:23:30.2083804Z     Release notes:
2025-05-07T20:23:30.2084199Z      https://docs.aws.amazon.com/linux/al2023/release-notes/relnotes-2023.7.20250414.html
2025-05-07T20:23:30.2084555Z 
2025-05-07T20:23:30.2084656Z   Version 2023.7.20250428:
2025-05-07T20:23:30.2084954Z     Run the following command to upgrade to 2023.7.20250428:
2025-05-07T20:23:30.2085206Z 
2025-05-07T20:23:30.2085321Z       dnf upgrade --releasever=2023.7.20250428
2025-05-07T20:23:30.2085533Z 
2025-05-07T20:23:30.2085617Z     Release notes:
2025-05-07T20:23:30.2086004Z      https://docs.aws.amazon.com/linux/al2023/release-notes/relnotes-2023.7.20250428.html
2025-05-07T20:23:30.2086355Z 
2025-05-07T20:23:30.2086469Z ================================================================================
2025-05-07T20:23:30.2444197Z  
2025-05-07T20:23:30.2444348Z 
2025-05-07T20:23:30.2444434Z Downgraded:
2025-05-07T20:23:30.2444814Z   nvidia-container-toolkit-1.16.2-1.x86_64                                      
2025-05-07T20:23:30.2445378Z   nvidia-container-toolkit-base-1.16.2-1.x86_64                                 
2025-05-07T20:23:30.2445717Z 
2025-05-07T20:23:30.2445800Z Complete!
2025-05-07T20:23:30.2899990Z + sudo systemctl restart docker
2025-05-07T20:23:33.2674923Z nvidia-persistenced failed to initialize. Check syslog for more details.
2025-05-07T20:23:33.2872266Z Wed May  7 20:23:33 2025       
2025-05-07T20:23:33.2873026Z +-----------------------------------------------------------------------------------------+
2025-05-07T20:23:33.2873701Z | NVIDIA-SMI 570.133.07             Driver Version: 570.133.07     CUDA Version: 12.8     |
2025-05-07T20:23:33.2874168Z |-----------------------------------------+------------------------+----------------------+
2025-05-07T20:23:33.2874650Z | GPU  Name                 Persistence-M | Bus-Id          Disp.A | Volatile Uncorr. ECC |
2025-05-07T20:23:33.2875161Z | Fan  Temp   Perf          Pwr:Usage/Cap |           Memory-Usage | GPU-Util  Compute M. |
2025-05-07T20:23:33.2875611Z |                                         |                        |               MIG M. |
2025-05-07T20:23:33.2875941Z |=========================================+========================+======================|
2025-05-07T20:23:33.3008124Z |   0  NVIDIA A10G                    On  |   00000000:00:1E.0 Off |                    0 |
2025-05-07T20:23:33.3008563Z |  0%   29C    P8             24W /  300W |       0MiB /  23028MiB |      0%      Default |
2025-05-07T20:23:33.3008937Z |                                         |                        |                  N/A |
2025-05-07T20:23:33.3009320Z +-----------------------------------------+------------------------+----------------------+
2025-05-07T20:23:33.3011591Z                                                                                          
2025-05-07T20:23:33.3011997Z +-----------------------------------------------------------------------------------------+
2025-05-07T20:23:33.3012817Z | Processes:                                                                              |
2025-05-07T20:23:33.3013243Z |  GPU   GI   CI              PID   Type   Process name                        GPU Memory |
2025-05-07T20:23:33.3013794Z |        ID   ID                                                               Usage      |
2025-05-07T20:23:33.3014137Z |=========================================================================================|
2025-05-07T20:23:33.3017770Z |  No running processes found                                                             |
2025-05-07T20:23:33.3018237Z +-----------------------------------------------------------------------------------------+
2025-05-07T20:23:33.8155590Z Command completed after 1 attempt(s).
2025-05-07T20:23:33.8242280Z ##[group]Run . $PRELUDE; print_system_info; print_ec2_info
2025-05-07T20:23:33.8242719Z [36;1m. $PRELUDE; print_system_info; print_ec2_info[0m
2025-05-07T20:23:33.8256347Z shell: /usr/bin/bash --noprofile --norc -e -o pipefail {0}
2025-05-07T20:23:33.8256707Z env:
2025-05-07T20:23:33.8256932Z   PRELUDE: .github/scripts/setup_env.bash
2025-05-07T20:23:33.8257232Z   BUILD_ENV: build_binary
2025-05-07T20:23:33.8257478Z   BUILD_TARGET: genai
2025-05-07T20:23:33.8257711Z   BUILD_VARIANT: cuda
2025-05-07T20:23:33.8257940Z   BUILD_CUDA_VERSION: 12.6.3
2025-05-07T20:23:33.8258195Z   ENFORCE_CUDA_DEVICE: 1
2025-05-07T20:23:33.8258498Z   GPU_FLAG: --gpus all -e NVIDIA_DRIVER_CAPABILITIES=all
2025-05-07T20:23:33.8258821Z ##[endgroup]
2025-05-07T20:23:34.1626961Z ################################################################################
2025-05-07T20:23:34.1627423Z # Print System Info
2025-05-07T20:23:34.1627711Z #
2025-05-07T20:23:34.1644260Z # [2025-05-07T20:23:34.164Z] + print_system_info 
2025-05-07T20:23:34.1644608Z ################################################################################
2025-05-07T20:23:34.1644823Z 
2025-05-07T20:23:34.1644937Z ################################################################################
2025-05-07T20:23:34.1645272Z [INFO] Printing environment variables ...
2025-05-07T20:23:34.1645567Z + printenv
2025-05-07T20:23:34.1645723Z 
2025-05-07T20:23:34.1655151Z SHELL=/bin/bash
2025-05-07T20:23:34.1655495Z GITHUB_WORKSPACE=/home/ec2-user/actions-runner/_work/FBGEMM/FBGEMM
2025-05-07T20:23:34.1655910Z BUILD_VARIANT=cuda
2025-05-07T20:23:34.1656424Z GITHUB_PATH=/home/ec2-user/actions-runner/_work/_temp/_runner_file_commands/add_path_78ba5815-fdf5-45ec-beb7-0271d86c1f0b
2025-05-07T20:23:34.1656985Z GITHUB_ACTION=__run
2025-05-07T20:23:34.1657264Z GPU_FLAG=--gpus all -e NVIDIA_DRIVER_CAPABILITIES=all
2025-05-07T20:23:34.1657595Z GITHUB_RUN_NUMBER=10601
2025-05-07T20:23:34.1657841Z RUNNER_NAME=i-0b68a33264ad7b273
2025-05-07T20:23:34.1658115Z GITHUB_REPOSITORY_OWNER_ID=21003710
2025-05-07T20:23:34.1658418Z PLATFORM_NAME_LC=linux-x86_64
2025-05-07T20:23:34.1658678Z MACHINE_NAME_LC=x86_64
2025-05-07T20:23:34.1659033Z ACTIONS_RUNNER_HOOK_JOB_COMPLETED=/home/ec2-user/runner-scripts/after_job.sh
2025-05-07T20:23:34.1659453Z GITHUB_TRIGGERING_ACTOR=q10
2025-05-07T20:23:34.1659730Z PRELUDE=.github/scripts/setup_env.bash
2025-05-07T20:23:34.1660013Z GITHUB_REF_TYPE=branch
2025-05-07T20:23:34.1660457Z ***
2025-05-07T20:23:34.1660651Z LOGNAME=ec2-user
2025-05-07T20:23:34.1660879Z GITHUB_REPOSITORY_ID=150154628
2025-05-07T20:23:34.1661129Z ENFORCE_CUDA_DEVICE=1
2025-05-07T20:23:34.1661356Z GITHUB_ACTIONS=true
2025-05-07T20:23:34.1661572Z SYSTEMD_EXEC_PID=55528
2025-05-07T20:23:34.1661837Z GITHUB_SHA=a2f4c52051596e74bc8c16e3d2867a4ecdd271e0
2025-05-07T20:23:34.1662372Z GITHUB_WORKFLOW_REF=pytorch/FBGEMM/.github/workflows/fbgemm_gpu_ci_cuda.yml@refs/pull/4066/merge
2025-05-07T20:23:34.1662873Z RUNNER_ENVIRONMENT=self-hosted
2025-05-07T20:23:34.1663139Z GITHUB_REF=refs/pull/4066/merge
2025-05-07T20:23:34.1663392Z RUNNER_OS=Linux
2025-05-07T20:23:34.1663611Z GITHUB_REF_PROTECTED=false
2025-05-07T20:23:34.1663845Z HOME=/home/ec2-user
2025-05-07T20:23:34.1664095Z GITHUB_API_URL=https://api.github.com
2025-05-07T20:23:34.1664948Z LANG=C.UTF-8
2025-05-07T20:23:34.1665699Z RUNNER_TRACKING_ID=github_85c37a8c-042b-4f5a-98d5-bf97741633f7
2025-05-07T20:23:34.1666043Z RUNNER_ARCH=X64
2025-05-07T20:23:34.1666313Z RUNNER_TEMP=/home/ec2-user/actions-runner/_work/_temp
2025-05-07T20:23:34.1666632Z BUILD_TARGET=genai
2025-05-07T20:23:34.1667139Z GITHUB_STATE=/home/ec2-user/actions-runner/_work/_temp/_runner_file_commands/save_state_78ba5815-fdf5-45ec-beb7-0271d86c1f0b
2025-05-07T20:23:34.1667983Z GITHUB_ENV=/home/ec2-user/actions-runner/_work/_temp/_runner_file_commands/set_env_78ba5815-fdf5-45ec-beb7-0271d86c1f0b
2025-05-07T20:23:34.1668698Z GITHUB_EVENT_PATH=/home/ec2-user/actions-runner/_work/_temp/_github_workflow/event.json
2025-05-07T20:23:34.1669349Z INVOCATION_ID=95a11ac3c71b4f0f87a09cc23f2e742b
2025-05-07T20:23:34.1669665Z GITHUB_EVENT_NAME=pull_request
2025-05-07T20:23:34.1669922Z GITHUB_RUN_ID=14891846252
2025-05-07T20:23:34.1670486Z GITHUB_STEP_SUMMARY=/home/ec2-user/actions-runner/_work/_temp/_runner_file_commands/step_summary_78ba5815-fdf5-45ec-beb7-0271d86c1f0b
2025-05-07T20:23:34.1671081Z BUILD_ENV=build_binary
2025-05-07T20:23:34.1671305Z GITHUB_ACTOR=q10
2025-05-07T20:23:34.1671518Z GITHUB_RUN_ATTEMPT=1
2025-05-07T20:23:34.1671735Z KERN_NAME_LC=linux
2025-05-07T20:23:34.1671955Z BUILD_CUDA_VERSION=12.6.3
2025-05-07T20:23:34.1672251Z GITHUB_GRAPHQL_URL=https://api.github.com/graphql
2025-05-07T20:23:34.1672580Z PLATFORM_NAME=Linux-x86_64
2025-05-07T20:23:34.1672820Z USER=ec2-user
2025-05-07T20:23:34.1673049Z GITHUB_SERVER_URL=https://github.com
2025-05-07T20:23:34.1673316Z SHLVL=1
2025-05-07T20:23:34.1673512Z GITHUB_ACTOR_ID=255046
2025-05-07T20:23:34.1673841Z RUNNER_TOOL_CACHE=/home/ec2-user/actions-runner/_work/_tool
2025-05-07T20:23:34.1674296Z GITHUB_WORKFLOW_SHA=6060cd4b5f971680caecdcc657faccb5720d1c3e
2025-05-07T20:23:34.1674639Z GITHUB_REF_NAME=4066/merge
2025-05-07T20:23:34.1674876Z KERN_NAME=Linux
2025-05-07T20:23:34.1675103Z GITHUB_JOB=test_and_publish_artifact
2025-05-07T20:23:34.1675498Z ACTIONS_RUNNER_HOOK_JOB_STARTED=/home/ec2-user/runner-scripts/before_job.sh
2025-05-07T20:23:34.1675914Z GITHUB_REPOSITORY=pytorch/FBGEMM
2025-05-07T20:23:34.1676183Z GITHUB_RETENTION_DAYS=90
2025-05-07T20:23:34.1676414Z JOURNAL_STREAM=8:84460
2025-05-07T20:23:34.1676723Z RUNNER_WORKSPACE=/home/ec2-user/actions-runner/_work/FBGEMM
2025-05-07T20:23:34.1677082Z GITHUB_ACTION_REPOSITORY=
2025-05-07T20:23:34.1677380Z PATH=/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin
2025-05-07T20:23:34.1677703Z GITHUB_BASE_REF=main
2025-05-07T20:23:34.1677922Z CI=true
2025-05-07T20:23:34.1678122Z GITHUB_REPOSITORY_OWNER=pytorch
2025-05-07T20:23:34.1678402Z GITHUB_HEAD_REF=bm/genai-rocm-oss-6
2025-05-07T20:23:34.1678674Z GITHUB_ACTION_REF=
2025-05-07T20:23:34.1678921Z GITHUB_WORKFLOW=FBGEMM GPU/GenAI CUDA CI
2025-05-07T20:23:34.1679508Z GITHUB_OUTPUT=/home/ec2-user/actions-runner/_work/_temp/_runner_file_commands/set_output_78ba5815-fdf5-45ec-beb7-0271d86c1f0b
2025-05-07T20:23:34.1680080Z MACHINE_NAME=x86_64
2025-05-07T20:23:34.1680340Z _=/usr/bin/printenv
2025-05-07T20:23:34.1680481Z 
2025-05-07T20:23:34.1680598Z ################################################################################
2025-05-07T20:23:34.1680910Z [INFO] Print ldd version ...
2025-05-07T20:23:34.1681160Z + ldd --version
2025-05-07T20:23:34.1681283Z 
2025-05-07T20:23:34.1681373Z ldd (GNU libc) 2.34
2025-05-07T20:23:34.1681626Z Copyright (C) 2021 Free Software Foundation, Inc.
2025-05-07T20:23:34.1682055Z This is free software; see the source for copying conditions.  There is NO
2025-05-07T20:23:34.1682604Z warranty; not even for MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.
2025-05-07T20:23:34.1683031Z Written by Roland McGrath and Ulrich Drepper.
2025-05-07T20:23:34.1683249Z 
2025-05-07T20:23:34.1683366Z ################################################################################
2025-05-07T20:23:34.1683668Z [INFO] Print CPU info ...
2025-05-07T20:23:34.1683902Z + nproc
2025-05-07T20:23:34.1684007Z 
2025-05-07T20:23:34.1686952Z 16
2025-05-07T20:23:34.1688586Z 
2025-05-07T20:23:34.1688753Z + lscpu
2025-05-07T20:23:34.1688860Z 
2025-05-07T20:23:34.1761769Z Architecture:                         x86_64
2025-05-07T20:23:34.1762474Z CPU op-mode(s):                       32-bit, 64-bit
2025-05-07T20:23:34.1763225Z Address sizes:                        48 bits physical, 48 bits virtual
2025-05-07T20:23:34.1763876Z Byte Order:                           Little Endian
2025-05-07T20:23:34.1764223Z CPU(s):                               16
2025-05-07T20:23:34.1764520Z On-line CPU(s) list:                  0-15
2025-05-07T20:23:34.1764832Z Vendor ID:                            AuthenticAMD
2025-05-07T20:23:34.1765165Z Model name:                           AMD EPYC 7R32
2025-05-07T20:23:34.1765474Z CPU family:                           23
2025-05-07T20:23:34.1765989Z Model:                                49
2025-05-07T20:23:34.1766279Z Thread(s) per core:                   2
2025-05-07T20:23:34.1766564Z Core(s) per socket:                   8
2025-05-07T20:23:34.1766844Z Socket(s):                            1
2025-05-07T20:23:34.1767120Z Stepping:                             0
2025-05-07T20:23:34.1767418Z BogoMIPS:                             5599.85
2025-05-07T20:23:34.1769431Z Flags:                                fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ht syscall nx mmxext fxsr_opt pdpe1gb rdtscp lm constant_tsc rep_good nopl nonstop_tsc cpuid extd_apicid aperfmperf tsc_known_freq pni pclmulqdq ssse3 fma cx16 sse4_1 sse4_2 movbe popcnt aes xsave avx f16c rdrand hypervisor lahf_lm cmp_legacy cr8_legacy abm sse4a misalignsse 3dnowprefetch topoext ssbd ibrs ibpb stibp vmmcall fsgsbase bmi1 avx2 smep bmi2 rdseed adx smap clflushopt clwb sha_ni xsaveopt xsavec xgetbv1 clzero xsaveerptr rdpru wbnoinvd arat npt nrip_save rdpid
2025-05-07T20:23:34.1771427Z Hypervisor vendor:                    KVM
2025-05-07T20:23:34.1771732Z Virtualization type:                  full
2025-05-07T20:23:34.1772073Z L1d cache:                            256 KiB (8 instances)
2025-05-07T20:23:34.1772431Z L1i cache:                            256 KiB (8 instances)
2025-05-07T20:23:34.1772781Z L2 cache:                             4 MiB (8 instances)
2025-05-07T20:23:34.1773121Z L3 cache:                             32 MiB (2 instances)
2025-05-07T20:23:34.1773438Z NUMA node(s):                         1
2025-05-07T20:23:34.1773833Z NUMA node0 CPU(s):                    0-15
2025-05-07T20:23:34.1774199Z Vulnerability Gather data sampling:   Not affected
2025-05-07T20:23:34.1774570Z Vulnerability Itlb multihit:          Not affected
2025-05-07T20:23:34.1774924Z Vulnerability L1tf:                   Not affected
2025-05-07T20:23:34.1775261Z Vulnerability Mds:                    Not affected
2025-05-07T20:23:34.1775614Z Vulnerability Meltdown:               Not affected
2025-05-07T20:23:34.1775969Z Vulnerability Mmio stale data:        Not affected
2025-05-07T20:23:34.1776327Z Vulnerability Reg file data sampling: Not affected
2025-05-07T20:23:34.1776857Z Vulnerability Retbleed:               Mitigation; untrained return thunk; SMT enabled with STIBP protection
2025-05-07T20:23:34.1777423Z Vulnerability Spec rstack overflow:   Mitigation; safe RET
2025-05-07T20:23:34.1777953Z Vulnerability Spec store bypass:      Mitigation; Speculative Store Bypass disabled via prctl
2025-05-07T20:23:34.1778614Z Vulnerability Spectre v1:             Mitigation; usercopy/swapgs barriers and __user pointer sanitization
2025-05-07T20:23:34.1779486Z Vulnerability Spectre v2:             Mitigation; Retpolines; IBPB conditional; STIBP always-on; RSB filling; PBRSB-eIBRS Not affected; BHI Not affected
2025-05-07T20:23:34.1780179Z Vulnerability Srbds:                  Not affected
2025-05-07T20:23:34.1780540Z Vulnerability Tsx async abort:        Not affected
2025-05-07T20:23:34.1780886Z 
2025-05-07T20:23:34.1781016Z + cat /proc/cpuinfo
2025-05-07T20:23:34.1781217Z 
2025-05-07T20:23:34.1781333Z processor	: 0
2025-05-07T20:23:34.1781629Z vendor_id	: AuthenticAMD
2025-05-07T20:23:34.1782144Z cpu family	: 23
2025-05-07T20:23:34.1782415Z model		: 49
2025-05-07T20:23:34.1782698Z model name	: AMD EPYC 7R32
2025-05-07T20:23:34.1783016Z stepping	: 0
2025-05-07T20:23:34.1783224Z microcode	: 0x830107f
2025-05-07T20:23:34.1783450Z cpu MHz		: 3298.326
2025-05-07T20:23:34.1783666Z cache size	: 512 KB
2025-05-07T20:23:34.1783876Z physical id	: 0
2025-05-07T20:23:34.1784104Z siblings	: 16
2025-05-07T20:23:34.1784339Z core id		: 0
2025-05-07T20:23:34.1784534Z cpu cores	: 8
2025-05-07T20:23:34.1784736Z apicid		: 0
2025-05-07T20:23:34.1784935Z initial apicid	: 0
2025-05-07T20:23:34.1785144Z fpu		: yes
2025-05-07T20:23:34.1785343Z fpu_exception	: yes
2025-05-07T20:23:34.1785560Z cpuid level	: 13
2025-05-07T20:23:34.1785765Z wp		: yes
2025-05-07T20:23:34.1787827Z flags		: fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ht syscall nx mmxext fxsr_opt pdpe1gb rdtscp lm constant_tsc rep_good nopl nonstop_tsc cpuid extd_apicid aperfmperf tsc_known_freq pni pclmulqdq ssse3 fma cx16 sse4_1 sse4_2 movbe popcnt aes xsave avx f16c rdrand hypervisor lahf_lm cmp_legacy cr8_legacy abm sse4a misalignsse 3dnowprefetch topoext ssbd ibrs ibpb stibp vmmcall fsgsbase bmi1 avx2 smep bmi2 rdseed adx smap clflushopt clwb sha_ni xsaveopt xsavec xgetbv1 clzero xsaveerptr rdpru wbnoinvd arat npt nrip_save rdpid
2025-05-07T20:23:34.1790008Z bugs		: sysret_ss_attrs null_seg spectre_v1 spectre_v2 spec_store_bypass retbleed smt_rsb srso ibpb_no_ret
2025-05-07T20:23:34.1790493Z bogomips	: 5599.85
2025-05-07T20:23:34.1790716Z TLB size	: 3072 4K pages
2025-05-07T20:23:34.1790947Z clflush size	: 64
2025-05-07T20:23:34.1791167Z cache_alignment	: 64
2025-05-07T20:23:34.1791438Z address sizes	: 48 bits physical, 48 bits virtual
2025-05-07T20:23:34.1791759Z power management:
2025-05-07T20:23:34.1791902Z 
2025-05-07T20:23:34.1791985Z processor	: 1
2025-05-07T20:23:34.1792207Z vendor_id	: AuthenticAMD
2025-05-07T20:23:34.1792439Z cpu family	: 23
2025-05-07T20:23:34.1792654Z model		: 49
2025-05-07T20:23:34.1792861Z model name	: AMD EPYC 7R32
2025-05-07T20:23:34.1793096Z stepping	: 0
2025-05-07T20:23:34.1793322Z microcode	: 0x830107f
2025-05-07T20:23:34.1804126Z cpu MHz		: 3307.447
2025-05-07T20:23:34.1804361Z cache size	: 512 KB
2025-05-07T20:23:34.1804585Z physical id	: 0
2025-05-07T20:23:34.1804801Z siblings	: 16
2025-05-07T20:23:34.1804998Z core id		: 1
2025-05-07T20:23:34.1805204Z cpu cores	: 8
2025-05-07T20:23:34.1805408Z apicid		: 2
2025-05-07T20:23:34.1805607Z initial apicid	: 2
2025-05-07T20:23:34.1805828Z fpu		: yes
2025-05-07T20:23:34.1806039Z fpu_exception	: yes
2025-05-07T20:23:34.1806256Z cpuid level	: 13
2025-05-07T20:23:34.1806469Z wp		: yes
2025-05-07T20:23:34.1808386Z flags		: fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ht syscall nx mmxext fxsr_opt pdpe1gb rdtscp lm constant_tsc rep_good nopl nonstop_tsc cpuid extd_apicid aperfmperf tsc_known_freq pni pclmulqdq ssse3 fma cx16 sse4_1 sse4_2 movbe popcnt aes xsave avx f16c rdrand hypervisor lahf_lm cmp_legacy cr8_legacy abm sse4a misalignsse 3dnowprefetch topoext ssbd ibrs ibpb stibp vmmcall fsgsbase bmi1 avx2 smep bmi2 rdseed adx smap clflushopt clwb sha_ni xsaveopt xsavec xgetbv1 clzero xsaveerptr rdpru wbnoinvd arat npt nrip_save rdpid
2025-05-07T20:23:34.1810561Z bugs		: sysret_ss_attrs null_seg spectre_v1 spectre_v2 spec_store_bypass retbleed smt_rsb srso ibpb_no_ret
2025-05-07T20:23:34.1811046Z bogomips	: 5599.85
2025-05-07T20:23:34.1811268Z TLB size	: 3072 4K pages
2025-05-07T20:23:34.1811510Z clflush size	: 64
2025-05-07T20:23:34.1811735Z cache_alignment	: 64
2025-05-07T20:23:34.1812001Z address sizes	: 48 bits physical, 48 bits virtual
2025-05-07T20:23:34.1812321Z power management:
2025-05-07T20:23:34.1812454Z 
2025-05-07T20:23:34.1812553Z processor	: 2
2025-05-07T20:23:34.1812770Z vendor_id	: AuthenticAMD
2025-05-07T20:23:34.1813018Z cpu family	: 23
2025-05-07T20:23:34.1813236Z model		: 49
2025-05-07T20:23:34.1813442Z model name	: AMD EPYC 7R32
2025-05-07T20:23:34.1814064Z stepping	: 0
2025-05-07T20:23:34.1814306Z microcode	: 0x830107f
2025-05-07T20:23:34.1814557Z cpu MHz		: 3301.782
2025-05-07T20:23:34.1814769Z cache size	: 512 KB
2025-05-07T20:23:34.1814987Z physical id	: 0
2025-05-07T20:23:34.1815206Z siblings	: 16
2025-05-07T20:23:34.1815405Z core id		: 2
2025-05-07T20:23:34.1815608Z cpu cores	: 8
2025-05-07T20:23:34.1815813Z apicid		: 4
2025-05-07T20:23:34.1816010Z initial apicid	: 4
2025-05-07T20:23:34.1816229Z fpu		: yes
2025-05-07T20:23:34.1816433Z fpu_exception	: yes
2025-05-07T20:23:34.1816646Z cpuid level	: 13
2025-05-07T20:23:34.1816859Z wp		: yes
2025-05-07T20:23:34.1818894Z flags		: fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ht syscall nx mmxext fxsr_opt pdpe1gb rdtscp lm constant_tsc rep_good nopl nonstop_tsc cpuid extd_apicid aperfmperf tsc_known_freq pni pclmulqdq ssse3 fma cx16 sse4_1 sse4_2 movbe popcnt aes xsave avx f16c rdrand hypervisor lahf_lm cmp_legacy cr8_legacy abm sse4a misalignsse 3dnowprefetch topoext ssbd ibrs ibpb stibp vmmcall fsgsbase bmi1 avx2 smep bmi2 rdseed adx smap clflushopt clwb sha_ni xsaveopt xsavec xgetbv1 clzero xsaveerptr rdpru wbnoinvd arat npt nrip_save rdpid
2025-05-07T20:23:34.1821062Z bugs		: sysret_ss_attrs null_seg spectre_v1 spectre_v2 spec_store_bypass retbleed smt_rsb srso ibpb_no_ret
2025-05-07T20:23:34.1821538Z bogomips	: 5599.85
2025-05-07T20:23:34.1821766Z TLB size	: 3072 4K pages
2025-05-07T20:23:34.1822008Z clflush size	: 64
2025-05-07T20:23:34.1822223Z cache_alignment	: 64
2025-05-07T20:23:34.1822496Z address sizes	: 48 bits physical, 48 bits virtual
2025-05-07T20:23:34.1822813Z power management:
2025-05-07T20:23:34.1822947Z 
2025-05-07T20:23:34.1823044Z processor	: 3
2025-05-07T20:23:34.1823259Z vendor_id	: AuthenticAMD
2025-05-07T20:23:34.1823504Z cpu family	: 23
2025-05-07T20:23:34.1823717Z model		: 49
2025-05-07T20:23:34.1823920Z model name	: AMD EPYC 7R32
2025-05-07T20:23:34.1824164Z stepping	: 0
2025-05-07T20:23:34.1824380Z microcode	: 0x830107f
2025-05-07T20:23:34.1824608Z cpu MHz		: 3299.453
2025-05-07T20:23:34.1824828Z cache size	: 512 KB
2025-05-07T20:23:34.1825045Z physical id	: 0
2025-05-07T20:23:34.1825250Z siblings	: 16
2025-05-07T20:23:34.1825452Z core id		: 3
2025-05-07T20:23:34.1825656Z cpu cores	: 8
2025-05-07T20:23:34.1825851Z apicid		: 6
2025-05-07T20:23:34.1826051Z initial apicid	: 6
2025-05-07T20:23:34.1826264Z fpu		: yes
2025-05-07T20:23:34.1826458Z fpu_exception	: yes
2025-05-07T20:23:34.1826677Z cpuid level	: 13
2025-05-07T20:23:34.1826886Z wp		: yes
2025-05-07T20:23:34.1828779Z flags		: fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ht syscall nx mmxext fxsr_opt pdpe1gb rdtscp lm constant_tsc rep_good nopl nonstop_tsc cpuid extd_apicid aperfmperf tsc_known_freq pni pclmulqdq ssse3 fma cx16 sse4_1 sse4_2 movbe popcnt aes xsave avx f16c rdrand hypervisor lahf_lm cmp_legacy cr8_legacy abm sse4a misalignsse 3dnowprefetch topoext ssbd ibrs ibpb stibp vmmcall fsgsbase bmi1 avx2 smep bmi2 rdseed adx smap clflushopt clwb sha_ni xsaveopt xsavec xgetbv1 clzero xsaveerptr rdpru wbnoinvd arat npt nrip_save rdpid
2025-05-07T20:23:34.1830934Z bugs		: sysret_ss_attrs null_seg spectre_v1 spectre_v2 spec_store_bypass retbleed smt_rsb srso ibpb_no_ret
2025-05-07T20:23:34.1831417Z bogomips	: 5599.85
2025-05-07T20:23:34.1831646Z TLB size	: 3072 4K pages
2025-05-07T20:23:34.1831889Z clflush size	: 64
2025-05-07T20:23:34.1832103Z cache_alignment	: 64
2025-05-07T20:23:34.1832377Z address sizes	: 48 bits physical, 48 bits virtual
2025-05-07T20:23:34.1832694Z power management:
2025-05-07T20:23:34.1832824Z 
2025-05-07T20:23:34.1832908Z processor	: 4
2025-05-07T20:23:34.1833130Z vendor_id	: AuthenticAMD
2025-05-07T20:23:34.1833371Z cpu family	: 23
2025-05-07T20:23:34.1833574Z model		: 49
2025-05-07T20:23:34.1833792Z model name	: AMD EPYC 7R32
2025-05-07T20:23:34.1834037Z stepping	: 0
2025-05-07T20:23:34.1834240Z microcode	: 0x830107f
2025-05-07T20:23:34.1834471Z cpu MHz		: 3296.391
2025-05-07T20:23:34.1834786Z cache size	: 512 KB
2025-05-07T20:23:34.1835002Z physical id	: 0
2025-05-07T20:23:34.1835214Z siblings	: 16
2025-05-07T20:23:34.1835424Z core id		: 4
2025-05-07T20:23:34.1835621Z cpu cores	: 8
2025-05-07T20:23:34.1835824Z apicid		: 8
2025-05-07T20:23:34.1836029Z initial apicid	: 8
2025-05-07T20:23:34.1836237Z fpu		: yes
2025-05-07T20:23:34.1836501Z fpu_exception	: yes
2025-05-07T20:23:34.1836730Z cpuid level	: 13
2025-05-07T20:23:34.1836943Z wp		: yes
2025-05-07T20:23:34.1838907Z flags		: fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ht syscall nx mmxext fxsr_opt pdpe1gb rdtscp lm constant_tsc rep_good nopl nonstop_tsc cpuid extd_apicid aperfmperf tsc_known_freq pni pclmulqdq ssse3 fma cx16 sse4_1 sse4_2 movbe popcnt aes xsave avx f16c rdrand hypervisor lahf_lm cmp_legacy cr8_legacy abm sse4a misalignsse 3dnowprefetch topoext ssbd ibrs ibpb stibp vmmcall fsgsbase bmi1 avx2 smep bmi2 rdseed adx smap clflushopt clwb sha_ni xsaveopt xsavec xgetbv1 clzero xsaveerptr rdpru wbnoinvd arat npt nrip_save rdpid
2025-05-07T20:23:34.1841066Z bugs		: sysret_ss_attrs null_seg spectre_v1 spectre_v2 spec_store_bypass retbleed smt_rsb srso ibpb_no_ret
2025-05-07T20:23:34.1841544Z bogomips	: 5599.85
2025-05-07T20:23:34.1841769Z TLB size	: 3072 4K pages
2025-05-07T20:23:34.1841993Z clflush size	: 64
2025-05-07T20:23:34.1842210Z cache_alignment	: 64
2025-05-07T20:23:34.1842476Z address sizes	: 48 bits physical, 48 bits virtual
2025-05-07T20:23:34.1842782Z power management:
2025-05-07T20:23:34.1842921Z 
2025-05-07T20:23:34.1843004Z processor	: 5
2025-05-07T20:23:34.1843217Z vendor_id	: AuthenticAMD
2025-05-07T20:23:34.1843453Z cpu family	: 23
2025-05-07T20:23:34.1843651Z model		: 49
2025-05-07T20:23:34.1843853Z model name	: AMD EPYC 7R32
2025-05-07T20:23:34.1844096Z stepping	: 0
2025-05-07T20:23:34.1844296Z microcode	: 0x830107f
2025-05-07T20:23:34.1844517Z cpu MHz		: 3286.612
2025-05-07T20:23:34.1844729Z cache size	: 512 KB
2025-05-07T20:23:34.1844935Z physical id	: 0
2025-05-07T20:23:34.1845145Z siblings	: 16
2025-05-07T20:23:34.1845343Z core id		: 5
2025-05-07T20:23:34.1845532Z cpu cores	: 8
2025-05-07T20:23:34.1845733Z apicid		: 10
2025-05-07T20:23:34.1845936Z initial apicid	: 10
2025-05-07T20:23:34.1846143Z fpu		: yes
2025-05-07T20:23:34.1846343Z fpu_exception	: yes
2025-05-07T20:23:34.1846565Z cpuid level	: 13
2025-05-07T20:23:34.1846766Z wp		: yes
2025-05-07T20:23:34.1848662Z flags		: fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ht syscall nx mmxext fxsr_opt pdpe1gb rdtscp lm constant_tsc rep_good nopl nonstop_tsc cpuid extd_apicid aperfmperf tsc_known_freq pni pclmulqdq ssse3 fma cx16 sse4_1 sse4_2 movbe popcnt aes xsave avx f16c rdrand hypervisor lahf_lm cmp_legacy cr8_legacy abm sse4a misalignsse 3dnowprefetch topoext ssbd ibrs ibpb stibp vmmcall fsgsbase bmi1 avx2 smep bmi2 rdseed adx smap clflushopt clwb sha_ni xsaveopt xsavec xgetbv1 clzero xsaveerptr rdpru wbnoinvd arat npt nrip_save rdpid
2025-05-07T20:23:34.1850817Z bugs		: sysret_ss_attrs null_seg spectre_v1 spectre_v2 spec_store_bypass retbleed smt_rsb srso ibpb_no_ret
2025-05-07T20:23:34.1851300Z bogomips	: 5599.85
2025-05-07T20:23:34.1851512Z TLB size	: 3072 4K pages
2025-05-07T20:23:34.1851751Z clflush size	: 64
2025-05-07T20:23:34.1851971Z cache_alignment	: 64
2025-05-07T20:23:34.1852243Z address sizes	: 48 bits physical, 48 bits virtual
2025-05-07T20:23:34.1852548Z power management:
2025-05-07T20:23:34.1852687Z 
2025-05-07T20:23:34.1852769Z processor	: 6
2025-05-07T20:23:34.1852983Z vendor_id	: AuthenticAMD
2025-05-07T20:23:34.1853212Z cpu family	: 23
2025-05-07T20:23:34.1853422Z model		: 49
2025-05-07T20:23:34.1853626Z model name	: AMD EPYC 7R32
2025-05-07T20:23:34.1853939Z stepping	: 0
2025-05-07T20:23:34.1854148Z microcode	: 0x830107f
2025-05-07T20:23:34.1854376Z cpu MHz		: 3290.744
2025-05-07T20:23:34.1854584Z cache size	: 512 KB
2025-05-07T20:23:34.1854805Z physical id	: 0
2025-05-07T20:23:34.1855019Z siblings	: 16
2025-05-07T20:23:34.1855212Z core id		: 6
2025-05-07T20:23:34.1855500Z cpu cores	: 8
2025-05-07T20:23:34.1855701Z apicid		: 12
2025-05-07T20:23:34.1855901Z initial apicid	: 12
2025-05-07T20:23:34.1856111Z fpu		: yes
2025-05-07T20:23:34.1856312Z fpu_exception	: yes
2025-05-07T20:23:34.1856523Z cpuid level	: 13
2025-05-07T20:23:34.1856733Z wp		: yes
2025-05-07T20:23:34.1858735Z flags		: fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ht syscall nx mmxext fxsr_opt pdpe1gb rdtscp lm constant_tsc rep_good nopl nonstop_tsc cpuid extd_apicid aperfmperf tsc_known_freq pni pclmulqdq ssse3 fma cx16 sse4_1 sse4_2 movbe popcnt aes xsave avx f16c rdrand hypervisor lahf_lm cmp_legacy cr8_legacy abm sse4a misalignsse 3dnowprefetch topoext ssbd ibrs ibpb stibp vmmcall fsgsbase bmi1 avx2 smep bmi2 rdseed adx smap clflushopt clwb sha_ni xsaveopt xsavec xgetbv1 clzero xsaveerptr rdpru wbnoinvd arat npt nrip_save rdpid
2025-05-07T20:23:34.1861012Z bugs		: sysret_ss_attrs null_seg spectre_v1 spectre_v2 spec_store_bypass retbleed smt_rsb srso ibpb_no_ret
2025-05-07T20:23:34.1861501Z bogomips	: 5599.85
2025-05-07T20:23:34.1861711Z TLB size	: 3072 4K pages
2025-05-07T20:23:34.1861947Z clflush size	: 64
2025-05-07T20:23:34.1862160Z cache_alignment	: 64
2025-05-07T20:23:34.1862418Z address sizes	: 48 bits physical, 48 bits virtual
2025-05-07T20:23:34.1862730Z power management:
2025-05-07T20:23:34.1862858Z 
2025-05-07T20:23:34.1862947Z processor	: 7
2025-05-07T20:23:34.1863153Z vendor_id	: AuthenticAMD
2025-05-07T20:23:34.1863386Z cpu family	: 23
2025-05-07T20:23:34.1863592Z model		: 49
2025-05-07T20:23:34.1863808Z model name	: AMD EPYC 7R32
2025-05-07T20:23:34.1864076Z stepping	: 0
2025-05-07T20:23:34.1864282Z microcode	: 0x830107f
2025-05-07T20:23:34.1864500Z cpu MHz		: 3294.699
2025-05-07T20:23:34.1864722Z cache size	: 512 KB
2025-05-07T20:23:34.1864939Z physical id	: 0
2025-05-07T20:23:34.1865147Z siblings	: 16
2025-05-07T20:23:34.1865351Z core id		: 7
2025-05-07T20:23:34.1865550Z cpu cores	: 8
2025-05-07T20:23:34.1865746Z apicid		: 14
2025-05-07T20:23:34.1865953Z initial apicid	: 14
2025-05-07T20:23:34.1866161Z fpu		: yes
2025-05-07T20:23:34.1866350Z fpu_exception	: yes
2025-05-07T20:23:34.1866559Z cpuid level	: 13
2025-05-07T20:23:34.1866764Z wp		: yes
2025-05-07T20:23:34.1868653Z flags		: fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ht syscall nx mmxext fxsr_opt pdpe1gb rdtscp lm constant_tsc rep_good nopl nonstop_tsc cpuid extd_apicid aperfmperf tsc_known_freq pni pclmulqdq ssse3 fma cx16 sse4_1 sse4_2 movbe popcnt aes xsave avx f16c rdrand hypervisor lahf_lm cmp_legacy cr8_legacy abm sse4a misalignsse 3dnowprefetch topoext ssbd ibrs ibpb stibp vmmcall fsgsbase bmi1 avx2 smep bmi2 rdseed adx smap clflushopt clwb sha_ni xsaveopt xsavec xgetbv1 clzero xsaveerptr rdpru wbnoinvd arat npt nrip_save rdpid
2025-05-07T20:23:34.1870800Z bugs		: sysret_ss_attrs null_seg spectre_v1 spectre_v2 spec_store_bypass retbleed smt_rsb srso ibpb_no_ret
2025-05-07T20:23:34.1871277Z bogomips	: 5599.85
2025-05-07T20:23:34.1871492Z TLB size	: 3072 4K pages
2025-05-07T20:23:34.1871724Z clflush size	: 64
2025-05-07T20:23:34.1871939Z cache_alignment	: 64
2025-05-07T20:23:34.1872196Z address sizes	: 48 bits physical, 48 bits virtual
2025-05-07T20:23:34.1872505Z power management:
2025-05-07T20:23:34.1872633Z 
2025-05-07T20:23:34.1872723Z processor	: 8
2025-05-07T20:23:34.1872927Z vendor_id	: AuthenticAMD
2025-05-07T20:23:34.1873163Z cpu family	: 23
2025-05-07T20:23:34.1873366Z model		: 49
2025-05-07T20:23:34.1873564Z model name	: AMD EPYC 7R32
2025-05-07T20:23:34.1873802Z stepping	: 0
2025-05-07T20:23:34.1874012Z microcode	: 0x830107f
2025-05-07T20:23:34.1874229Z cpu MHz		: 3298.834
2025-05-07T20:23:34.1874440Z cache size	: 512 KB
2025-05-07T20:23:34.1874651Z physical id	: 0
2025-05-07T20:23:34.1874853Z siblings	: 16
2025-05-07T20:23:34.1875052Z core id		: 0
2025-05-07T20:23:34.1875251Z cpu cores	: 8
2025-05-07T20:23:34.1875442Z apicid		: 1
2025-05-07T20:23:34.1875637Z initial apicid	: 1
2025-05-07T20:23:34.1875933Z fpu		: yes
2025-05-07T20:23:34.1876121Z fpu_exception	: yes
2025-05-07T20:23:34.1876334Z cpuid level	: 13
2025-05-07T20:23:34.1876540Z wp		: yes
2025-05-07T20:23:34.1878422Z flags		: fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ht syscall nx mmxext fxsr_opt pdpe1gb rdtscp lm constant_tsc rep_good nopl nonstop_tsc cpuid extd_apicid aperfmperf tsc_known_freq pni pclmulqdq ssse3 fma cx16 sse4_1 sse4_2 movbe popcnt aes xsave avx f16c rdrand hypervisor lahf_lm cmp_legacy cr8_legacy abm sse4a misalignsse 3dnowprefetch topoext ssbd ibrs ibpb stibp vmmcall fsgsbase bmi1 avx2 smep bmi2 rdseed adx smap clflushopt clwb sha_ni xsaveopt xsavec xgetbv1 clzero xsaveerptr rdpru wbnoinvd arat npt nrip_save rdpid
2025-05-07T20:23:34.1880655Z bugs		: sysret_ss_attrs null_seg spectre_v1 spectre_v2 spec_store_bypass retbleed smt_rsb srso ibpb_no_ret
2025-05-07T20:23:34.1881124Z bogomips	: 5599.85
2025-05-07T20:23:34.1881343Z TLB size	: 3072 4K pages
2025-05-07T20:23:34.1881581Z clflush size	: 64
2025-05-07T20:23:34.1881789Z cache_alignment	: 64
2025-05-07T20:23:34.1882054Z address sizes	: 48 bits physical, 48 bits virtual
2025-05-07T20:23:34.1882361Z power management:
2025-05-07T20:23:34.1882487Z 
2025-05-07T20:23:34.1882571Z processor	: 9
2025-05-07T20:23:34.1882781Z vendor_id	: AuthenticAMD
2025-05-07T20:23:34.1883013Z cpu family	: 23
2025-05-07T20:23:34.1883209Z model		: 49
2025-05-07T20:23:34.1883411Z model name	: AMD EPYC 7R32
2025-05-07T20:23:34.1883645Z stepping	: 0
2025-05-07T20:23:34.1883843Z microcode	: 0x830107f
2025-05-07T20:23:34.1884069Z cpu MHz		: 3283.912
2025-05-07T20:23:34.1884276Z cache size	: 512 KB
2025-05-07T20:23:34.1884492Z physical id	: 0
2025-05-07T20:23:34.1884691Z siblings	: 16
2025-05-07T20:23:34.1884886Z core id		: 1
2025-05-07T20:23:34.1885086Z cpu cores	: 8
2025-05-07T20:23:34.1885281Z apicid		: 3
2025-05-07T20:23:34.1885477Z initial apicid	: 3
2025-05-07T20:23:34.1885690Z fpu		: yes
2025-05-07T20:23:34.1885880Z fpu_exception	: yes
2025-05-07T20:23:34.1886100Z cpuid level	: 13
2025-05-07T20:23:34.1886305Z wp		: yes
2025-05-07T20:23:34.1888185Z flags		: fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ht syscall nx mmxext fxsr_opt pdpe1gb rdtscp lm constant_tsc rep_good nopl nonstop_tsc cpuid extd_apicid aperfmperf tsc_known_freq pni pclmulqdq ssse3 fma cx16 sse4_1 sse4_2 movbe popcnt aes xsave avx f16c rdrand hypervisor lahf_lm cmp_legacy cr8_legacy abm sse4a misalignsse 3dnowprefetch topoext ssbd ibrs ibpb stibp vmmcall fsgsbase bmi1 avx2 smep bmi2 rdseed adx smap clflushopt clwb sha_ni xsaveopt xsavec xgetbv1 clzero xsaveerptr rdpru wbnoinvd arat npt nrip_save rdpid
2025-05-07T20:23:34.1890338Z bugs		: sysret_ss_attrs null_seg spectre_v1 spectre_v2 spec_store_bypass retbleed smt_rsb srso ibpb_no_ret
2025-05-07T20:23:34.1890820Z bogomips	: 5599.85
2025-05-07T20:23:34.1891038Z TLB size	: 3072 4K pages
2025-05-07T20:23:34.1891266Z clflush size	: 64
2025-05-07T20:23:34.1891480Z cache_alignment	: 64
2025-05-07T20:23:34.1891750Z address sizes	: 48 bits physical, 48 bits virtual
2025-05-07T20:23:34.1892060Z power management:
2025-05-07T20:23:34.1892188Z 
2025-05-07T20:23:34.1892274Z processor	: 10
2025-05-07T20:23:34.1892487Z vendor_id	: AuthenticAMD
2025-05-07T20:23:34.1892722Z cpu family	: 23
2025-05-07T20:23:34.1892922Z model		: 49
2025-05-07T20:23:34.1893126Z model name	: AMD EPYC 7R32
2025-05-07T20:23:34.1893396Z stepping	: 0
2025-05-07T20:23:34.1893607Z microcode	: 0x830107f
2025-05-07T20:23:34.1893948Z cpu MHz		: 3299.820
2025-05-07T20:23:34.1894171Z cache size	: 512 KB
2025-05-07T20:23:34.1894379Z physical id	: 0
2025-05-07T20:23:34.1894583Z siblings	: 16
2025-05-07T20:23:34.1894774Z core id		: 2
2025-05-07T20:23:34.1894972Z cpu cores	: 8
2025-05-07T20:23:34.1895167Z apicid		: 5
2025-05-07T20:23:34.1895360Z initial apicid	: 5
2025-05-07T20:23:34.1895571Z fpu		: yes
2025-05-07T20:23:34.1895767Z fpu_exception	: yes
2025-05-07T20:23:34.1895974Z cpuid level	: 13
2025-05-07T20:23:34.1896268Z wp		: yes
2025-05-07T20:23:34.1898153Z flags		: fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ht syscall nx mmxext fxsr_opt pdpe1gb rdtscp lm constant_tsc rep_good nopl nonstop_tsc cpuid extd_apicid aperfmperf tsc_known_freq pni pclmulqdq ssse3 fma cx16 sse4_1 sse4_2 movbe popcnt aes xsave avx f16c rdrand hypervisor lahf_lm cmp_legacy cr8_legacy abm sse4a misalignsse 3dnowprefetch topoext ssbd ibrs ibpb stibp vmmcall fsgsbase bmi1 avx2 smep bmi2 rdseed adx smap clflushopt clwb sha_ni xsaveopt xsavec xgetbv1 clzero xsaveerptr rdpru wbnoinvd arat npt nrip_save rdpid
2025-05-07T20:23:34.1900721Z bugs		: sysret_ss_attrs null_seg spectre_v1 spectre_v2 spec_store_bypass retbleed smt_rsb srso ibpb_no_ret
2025-05-07T20:23:34.1901188Z bogomips	: 5599.85
2025-05-07T20:23:34.1901554Z TLB size	: 3072 4K pages
2025-05-07T20:23:34.1901791Z clflush size	: 64
2025-05-07T20:23:34.1901999Z cache_alignment	: 64
2025-05-07T20:23:34.1902261Z address sizes	: 48 bits physical, 48 bits virtual
2025-05-07T20:23:34.1902579Z power management:
2025-05-07T20:23:34.1902706Z 
2025-05-07T20:23:34.1902796Z processor	: 11
2025-05-07T20:23:34.1903002Z vendor_id	: AuthenticAMD
2025-05-07T20:23:34.1903234Z cpu family	: 23
2025-05-07T20:23:34.1903438Z model		: 49
2025-05-07T20:23:34.1903636Z model name	: AMD EPYC 7R32
2025-05-07T20:23:34.1903882Z stepping	: 0
2025-05-07T20:23:34.1904089Z microcode	: 0x830107f
2025-05-07T20:23:34.1904332Z cpu MHz		: 3301.322
2025-05-07T20:23:34.1904569Z cache size	: 512 KB
2025-05-07T20:23:34.1904780Z physical id	: 0
2025-05-07T20:23:34.1904984Z siblings	: 16
2025-05-07T20:23:34.1905185Z core id		: 3
2025-05-07T20:23:34.1905381Z cpu cores	: 8
2025-05-07T20:23:34.1905573Z apicid		: 7
2025-05-07T20:23:34.1905769Z initial apicid	: 7
2025-05-07T20:23:34.1905983Z fpu		: yes
2025-05-07T20:23:34.1906174Z fpu_exception	: yes
2025-05-07T20:23:34.1906391Z cpuid level	: 13
2025-05-07T20:23:34.1906598Z wp		: yes
2025-05-07T20:23:34.1908484Z flags		: fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ht syscall nx mmxext fxsr_opt pdpe1gb rdtscp lm constant_tsc rep_good nopl nonstop_tsc cpuid extd_apicid aperfmperf tsc_known_freq pni pclmulqdq ssse3 fma cx16 sse4_1 sse4_2 movbe popcnt aes xsave avx f16c rdrand hypervisor lahf_lm cmp_legacy cr8_legacy abm sse4a misalignsse 3dnowprefetch topoext ssbd ibrs ibpb stibp vmmcall fsgsbase bmi1 avx2 smep bmi2 rdseed adx smap clflushopt clwb sha_ni xsaveopt xsavec xgetbv1 clzero xsaveerptr rdpru wbnoinvd arat npt nrip_save rdpid
2025-05-07T20:23:34.1910633Z bugs		: sysret_ss_attrs null_seg spectre_v1 spectre_v2 spec_store_bypass retbleed smt_rsb srso ibpb_no_ret
2025-05-07T20:23:34.1911110Z bogomips	: 5599.85
2025-05-07T20:23:34.1911328Z TLB size	: 3072 4K pages
2025-05-07T20:23:34.1911567Z clflush size	: 64
2025-05-07T20:23:34.1911775Z cache_alignment	: 64
2025-05-07T20:23:34.1912039Z address sizes	: 48 bits physical, 48 bits virtual
2025-05-07T20:23:34.1912348Z power management:
2025-05-07T20:23:34.1912480Z 
2025-05-07T20:23:34.1912562Z processor	: 12
2025-05-07T20:23:34.1912776Z vendor_id	: AuthenticAMD
2025-05-07T20:23:34.1913010Z cpu family	: 23
2025-05-07T20:23:34.1913207Z model		: 49
2025-05-07T20:23:34.1913415Z model name	: AMD EPYC 7R32
2025-05-07T20:23:34.1913653Z stepping	: 0
2025-05-07T20:23:34.1913876Z microcode	: 0x830107f
2025-05-07T20:23:34.1914123Z cpu MHz		: 3300.495
2025-05-07T20:23:34.1914334Z cache size	: 512 KB
2025-05-07T20:23:34.1914541Z physical id	: 0
2025-05-07T20:23:34.1914747Z siblings	: 16
2025-05-07T20:23:34.1914947Z core id		: 4
2025-05-07T20:23:34.1915137Z cpu cores	: 8
2025-05-07T20:23:34.1915336Z apicid		: 9
2025-05-07T20:23:34.1915536Z initial apicid	: 9
2025-05-07T20:23:34.1915741Z fpu		: yes
2025-05-07T20:23:34.1915941Z fpu_exception	: yes
2025-05-07T20:23:34.1916160Z cpuid level	: 13
2025-05-07T20:23:34.1916358Z wp		: yes
2025-05-07T20:23:34.1918241Z flags		: fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ht syscall nx mmxext fxsr_opt pdpe1gb rdtscp lm constant_tsc rep_good nopl nonstop_tsc cpuid extd_apicid aperfmperf tsc_known_freq pni pclmulqdq ssse3 fma cx16 sse4_1 sse4_2 movbe popcnt aes xsave avx f16c rdrand hypervisor lahf_lm cmp_legacy cr8_legacy abm sse4a misalignsse 3dnowprefetch topoext ssbd ibrs ibpb stibp vmmcall fsgsbase bmi1 avx2 smep bmi2 rdseed adx smap clflushopt clwb sha_ni xsaveopt xsavec xgetbv1 clzero xsaveerptr rdpru wbnoinvd arat npt nrip_save rdpid
2025-05-07T20:23:34.1920523Z bugs		: sysret_ss_attrs null_seg spectre_v1 spectre_v2 spec_store_bypass retbleed smt_rsb srso ibpb_no_ret
2025-05-07T20:23:34.1921001Z bogomips	: 5599.85
2025-05-07T20:23:34.1921217Z TLB size	: 3072 4K pages
2025-05-07T20:23:34.1921450Z clflush size	: 64
2025-05-07T20:23:34.1921669Z cache_alignment	: 64
2025-05-07T20:23:34.1922022Z address sizes	: 48 bits physical, 48 bits virtual
2025-05-07T20:23:34.1922328Z power management:
2025-05-07T20:23:34.1922461Z 
2025-05-07T20:23:34.1922546Z processor	: 13
2025-05-07T20:23:34.1922770Z vendor_id	: AuthenticAMD
2025-05-07T20:23:34.1923001Z cpu family	: 23
2025-05-07T20:23:34.1923212Z model		: 49
2025-05-07T20:23:34.1923418Z model name	: AMD EPYC 7R32
2025-05-07T20:23:34.1923655Z stepping	: 0
2025-05-07T20:23:34.1923891Z microcode	: 0x830107f
2025-05-07T20:23:34.1924141Z cpu MHz		: 3292.898
2025-05-07T20:23:34.1924359Z cache size	: 512 KB
2025-05-07T20:23:34.1924565Z physical id	: 0
2025-05-07T20:23:34.1924771Z siblings	: 16
2025-05-07T20:23:34.1924972Z core id		: 5
2025-05-07T20:23:34.1925167Z cpu cores	: 8
2025-05-07T20:23:34.1925370Z apicid		: 11
2025-05-07T20:23:34.1925578Z initial apicid	: 11
2025-05-07T20:23:34.1925783Z fpu		: yes
2025-05-07T20:23:34.1925980Z fpu_exception	: yes
2025-05-07T20:23:34.1926196Z cpuid level	: 13
2025-05-07T20:23:34.1926393Z wp		: yes
2025-05-07T20:23:34.1928289Z flags		: fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ht syscall nx mmxext fxsr_opt pdpe1gb rdtscp lm constant_tsc rep_good nopl nonstop_tsc cpuid extd_apicid aperfmperf tsc_known_freq pni pclmulqdq ssse3 fma cx16 sse4_1 sse4_2 movbe popcnt aes xsave avx f16c rdrand hypervisor lahf_lm cmp_legacy cr8_legacy abm sse4a misalignsse 3dnowprefetch topoext ssbd ibrs ibpb stibp vmmcall fsgsbase bmi1 avx2 smep bmi2 rdseed adx smap clflushopt clwb sha_ni xsaveopt xsavec xgetbv1 clzero xsaveerptr rdpru wbnoinvd arat npt nrip_save rdpid
2025-05-07T20:23:34.1930443Z bugs		: sysret_ss_attrs null_seg spectre_v1 spectre_v2 spec_store_bypass retbleed smt_rsb srso ibpb_no_ret
2025-05-07T20:23:34.1930920Z bogomips	: 5599.85
2025-05-07T20:23:34.1931132Z TLB size	: 3072 4K pages
2025-05-07T20:23:34.1931365Z clflush size	: 64
2025-05-07T20:23:34.1931581Z cache_alignment	: 64
2025-05-07T20:23:34.1931840Z address sizes	: 48 bits physical, 48 bits virtual
2025-05-07T20:23:34.1932152Z power management:
2025-05-07T20:23:34.1932290Z 
2025-05-07T20:23:34.1932373Z processor	: 14
2025-05-07T20:23:34.1932588Z vendor_id	: AuthenticAMD
2025-05-07T20:23:34.1932814Z cpu family	: 23
2025-05-07T20:23:34.1933020Z model		: 49
2025-05-07T20:23:34.1933225Z model name	: AMD EPYC 7R32
2025-05-07T20:23:34.1933457Z stepping	: 0
2025-05-07T20:23:34.1933774Z microcode	: 0x830107f
2025-05-07T20:23:34.1933998Z cpu MHz		: 3299.445
2025-05-07T20:23:34.1934203Z cache size	: 512 KB
2025-05-07T20:23:34.1934415Z physical id	: 0
2025-05-07T20:23:34.1934618Z siblings	: 16
2025-05-07T20:23:34.1934808Z core id		: 6
2025-05-07T20:23:34.1935005Z cpu cores	: 8
2025-05-07T20:23:34.1935203Z apicid		: 13
2025-05-07T20:23:34.1935403Z initial apicid	: 13
2025-05-07T20:23:34.1935614Z fpu		: yes
2025-05-07T20:23:34.1935813Z fpu_exception	: yes
2025-05-07T20:23:34.1936021Z cpuid level	: 13
2025-05-07T20:23:34.1936225Z wp		: yes
2025-05-07T20:23:34.1938115Z flags		: fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ht syscall nx mmxext fxsr_opt pdpe1gb rdtscp lm constant_tsc rep_good nopl nonstop_tsc cpuid extd_apicid aperfmperf tsc_known_freq pni pclmulqdq ssse3 fma cx16 sse4_1 sse4_2 movbe popcnt aes xsave avx f16c rdrand hypervisor lahf_lm cmp_legacy cr8_legacy abm sse4a misalignsse 3dnowprefetch topoext ssbd ibrs ibpb stibp vmmcall fsgsbase bmi1 avx2 smep bmi2 rdseed adx smap clflushopt clwb sha_ni xsaveopt xsavec xgetbv1 clzero xsaveerptr rdpru wbnoinvd arat npt nrip_save rdpid
2025-05-07T20:23:34.1940379Z bugs		: sysret_ss_attrs null_seg spectre_v1 spectre_v2 spec_store_bypass retbleed smt_rsb srso ibpb_no_ret
2025-05-07T20:23:34.1940859Z bogomips	: 5599.85
2025-05-07T20:23:34.1941071Z TLB size	: 3072 4K pages
2025-05-07T20:23:34.1941304Z clflush size	: 64
2025-05-07T20:23:34.1941519Z cache_alignment	: 64
2025-05-07T20:23:34.1941776Z address sizes	: 48 bits physical, 48 bits virtual
2025-05-07T20:23:34.1942087Z power management:
2025-05-07T20:23:34.1942216Z 
2025-05-07T20:23:34.1942390Z processor	: 15
2025-05-07T20:23:34.1942600Z vendor_id	: AuthenticAMD
2025-05-07T20:23:34.1942835Z cpu family	: 23
2025-05-07T20:23:34.1943037Z model		: 49
2025-05-07T20:23:34.1943245Z model name	: AMD EPYC 7R32
2025-05-07T20:23:34.1943478Z stepping	: 0
2025-05-07T20:23:34.1951115Z microcode	: 0x830107f
2025-05-07T20:23:34.1951351Z cpu MHz		: 3295.983
2025-05-07T20:23:34.1951568Z cache size	: 512 KB
2025-05-07T20:23:34.1951782Z physical id	: 0
2025-05-07T20:23:34.1951984Z siblings	: 16
2025-05-07T20:23:34.1952183Z core id		: 7
2025-05-07T20:23:34.1952383Z cpu cores	: 8
2025-05-07T20:23:34.1952577Z apicid		: 15
2025-05-07T20:23:34.1952778Z initial apicid	: 15
2025-05-07T20:23:34.1952990Z fpu		: yes
2025-05-07T20:23:34.1953183Z fpu_exception	: yes
2025-05-07T20:23:34.1953398Z cpuid level	: 13
2025-05-07T20:23:34.1953604Z wp		: yes
2025-05-07T20:23:34.1955549Z flags		: fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ht syscall nx mmxext fxsr_opt pdpe1gb rdtscp lm constant_tsc rep_good nopl nonstop_tsc cpuid extd_apicid aperfmperf tsc_known_freq pni pclmulqdq ssse3 fma cx16 sse4_1 sse4_2 movbe popcnt aes xsave avx f16c rdrand hypervisor lahf_lm cmp_legacy cr8_legacy abm sse4a misalignsse 3dnowprefetch topoext ssbd ibrs ibpb stibp vmmcall fsgsbase bmi1 avx2 smep bmi2 rdseed adx smap clflushopt clwb sha_ni xsaveopt xsavec xgetbv1 clzero xsaveerptr rdpru wbnoinvd arat npt nrip_save rdpid
2025-05-07T20:23:34.1957708Z bugs		: sysret_ss_attrs null_seg spectre_v1 spectre_v2 spec_store_bypass retbleed smt_rsb srso ibpb_no_ret
2025-05-07T20:23:34.1958186Z bogomips	: 5599.85
2025-05-07T20:23:34.1958409Z TLB size	: 3072 4K pages
2025-05-07T20:23:34.1958645Z clflush size	: 64
2025-05-07T20:23:34.1958854Z cache_alignment	: 64
2025-05-07T20:23:34.1959122Z address sizes	: 48 bits physical, 48 bits virtual
2025-05-07T20:23:34.1959433Z power management:
2025-05-07T20:23:34.1959559Z 
2025-05-07T20:23:34.1959564Z 
2025-05-07T20:23:34.1959682Z ################################################################################
2025-05-07T20:23:34.1959988Z [INFO] Print PCI info ...
2025-05-07T20:23:34.1960231Z + lspci -v
2025-05-07T20:23:34.1960342Z 
2025-05-07T20:23:34.1960561Z 00:00.0 Host bridge: Intel Corporation 440FX - 82441FX PMC [Natoma]
2025-05-07T20:23:34.1960934Z 	Subsystem: Amazon.com, Inc. Device 1237
2025-05-07T20:23:34.1961253Z 	Flags: bus master, medium devsel, latency 0
2025-05-07T20:23:34.1961460Z 
2025-05-07T20:23:34.1961650Z 00:01.0 ISA bridge: Intel Corporation 82371SB PIIX3 ISA [Natoma/Triton II]
2025-05-07T20:23:34.1962020Z 	Physical Slot: 1
2025-05-07T20:23:34.1962251Z 	Flags: bus master, fast devsel, latency 0
2025-05-07T20:23:34.1962453Z 
2025-05-07T20:23:34.1962695Z 00:01.3 Non-VGA unclassified device: Intel Corporation 82371AB/EB/MB PIIX4 ACPI (rev 08)
2025-05-07T20:23:34.1963117Z 	Physical Slot: 1
2025-05-07T20:23:34.1963363Z 	Flags: bus master, fast devsel, latency 0, IRQ 9
2025-05-07T20:23:34.1963586Z 
2025-05-07T20:23:34.1963844Z 00:03.0 VGA compatible controller: Amazon.com, Inc. Device 1111 (prog-if 00 [VGA controller])
2025-05-07T20:23:34.1964317Z 	Physical Slot: 3
2025-05-07T20:23:34.1964565Z 	Flags: bus master, fast devsel, latency 0
2025-05-07T20:23:34.1965010Z 	Memory at c1000000 (32-bit, prefetchable) [size=4M]
2025-05-07T20:23:34.1965359Z 	Expansion ROM at 000c0000 [disabled] [size=128K]
2025-05-07T20:23:34.1965575Z 
2025-05-07T20:23:34.1965873Z 00:04.0 Non-Volatile memory controller: Amazon.com, Inc. NVMe EBS Controller (prog-if 02 [NVM Express])
2025-05-07T20:23:34.1966366Z 	Subsystem: Amazon.com, Inc. Device 0000
2025-05-07T20:23:34.1966647Z 	Physical Slot: 4
2025-05-07T20:23:34.1966898Z 	Flags: bus master, fast devsel, latency 0, IRQ 11
2025-05-07T20:23:34.1967268Z 	Memory at c1808000 (32-bit, non-prefetchable) [size=16K]
2025-05-07T20:23:34.1967610Z 	Capabilities: <access denied>
2025-05-07T20:23:34.1967884Z 	Kernel driver in use: nvme
2025-05-07T20:23:34.1968045Z 
2025-05-07T20:23:34.1968348Z 00:05.0 Ethernet controller: Amazon.com, Inc. Elastic Network Adapter (ENA)
2025-05-07T20:23:34.1968809Z 	Subsystem: Amazon.com, Inc. Elastic Network Adapter (ENA)
2025-05-07T20:23:34.1969148Z 	Physical Slot: 5
2025-05-07T20:23:34.1969390Z 	Flags: bus master, fast devsel, latency 0
2025-05-07T20:23:34.1969740Z 	Memory at c1804000 (32-bit, non-prefetchable) [size=16K]
2025-05-07T20:23:34.1970113Z 	Memory at c1400000 (32-bit, prefetchable) [size=4M]
2025-05-07T20:23:34.1970436Z 	Capabilities: <access denied>
2025-05-07T20:23:34.1970696Z 	Kernel driver in use: ena
2025-05-07T20:23:34.1970930Z 	Kernel modules: ena
2025-05-07T20:23:34.1971072Z 
2025-05-07T20:23:34.1971236Z 00:1e.0 3D controller: NVIDIA Corporation GA102GL [A10G] (rev a1)
2025-05-07T20:23:34.1971605Z 	Subsystem: NVIDIA Corporation Device 152f
2025-05-07T20:23:34.1971887Z 	Physical Slot: 30
2025-05-07T20:23:34.1972144Z 	Flags: bus master, fast devsel, latency 0, IRQ 10
2025-05-07T20:23:34.1972509Z 	Memory at c0000000 (32-bit, non-prefetchable) [size=16M]
2025-05-07T20:23:34.1972883Z 	Memory at 1800000000 (64-bit, prefetchable) [size=32G]
2025-05-07T20:23:34.1973246Z 	Memory at 1040000000 (64-bit, prefetchable) [size=32M]
2025-05-07T20:23:34.1973570Z 	Capabilities: <access denied>
2025-05-07T20:23:34.1973962Z 	Kernel driver in use: nvidia
2025-05-07T20:23:34.1974241Z 	Kernel modules: nvidia
2025-05-07T20:23:34.1974388Z 
2025-05-07T20:23:34.1974679Z 00:1f.0 Non-Volatile memory controller: Amazon.com, Inc. NVMe SSD Controller (prog-if 02 [NVM Express])
2025-05-07T20:23:34.1975177Z 	Subsystem: Amazon.com, Inc. Device 0000
2025-05-07T20:23:34.1975454Z 	Physical Slot: 31
2025-05-07T20:23:34.1975695Z 	Flags: bus master, fast devsel, latency 0
2025-05-07T20:23:34.1976041Z 	Memory at c1800000 (32-bit, non-prefetchable) [size=16K]
2025-05-07T20:23:34.1976410Z 	Memory at c180c000 (32-bit, prefetchable) [size=8K]
2025-05-07T20:23:34.1976735Z 	Capabilities: <access denied>
2025-05-07T20:23:34.1976992Z 	Kernel driver in use: nvme
2025-05-07T20:23:34.1977147Z 
2025-05-07T20:23:34.1977151Z 
2025-05-07T20:23:34.1977273Z ################################################################################
2025-05-07T20:23:34.1977589Z [INFO] Print Linux distribution info ...
2025-05-07T20:23:34.1977868Z + uname -a
2025-05-07T20:23:34.1977983Z 
2025-05-07T20:23:34.1978383Z Linux ip-10-0-14-174.ec2.internal 6.1.130-139.222.amzn2023.x86_64 #1 SMP PREEMPT_DYNAMIC Tue Mar 11 01:10:58 UTC 2025 x86_64 x86_64 x86_64 GNU/Linux
2025-05-07T20:23:34.1978864Z 
2025-05-07T20:23:34.1978945Z + uname -m
2025-05-07T20:23:34.1979059Z 
2025-05-07T20:23:34.1979132Z x86_64
2025-05-07T20:23:34.1979239Z 
2025-05-07T20:23:34.1979325Z + cat /proc/version
2025-05-07T20:23:34.1979452Z 
2025-05-07T20:23:34.1979978Z Linux version 6.1.130-139.222.amzn2023.x86_64 (mockbuild@ip-10-0-55-76) (gcc (GCC) 11.5.0 20240719 (Red Hat 11.5.0-5), GNU ld version 2.39-6.amzn2023.0.11) #1 SMP PREEMPT_DYNAMIC Tue Mar 11 01:10:58 UTC 2025
2025-05-07T20:23:34.1980580Z 
2025-05-07T20:23:34.1980669Z + cat /etc/os-release
2025-05-07T20:23:34.1980817Z 
2025-05-07T20:23:34.1980907Z NAME="Amazon Linux"
2025-05-07T20:23:34.1981112Z VERSION="2023"
2025-05-07T20:23:34.1981312Z ID="amzn"
2025-05-07T20:23:34.1981493Z ID_LIKE="fedora"
2025-05-07T20:23:34.1981701Z VERSION_ID="2023"
2025-05-07T20:23:34.1982021Z PLATFORM_ID="platform:al2023"
2025-05-07T20:23:34.1982291Z PRETTY_NAME="Amazon Linux 2023.6.20250317"
2025-05-07T20:23:34.1982567Z ANSI_COLOR="0;33"
2025-05-07T20:23:34.1982812Z CPE_NAME="cpe:2.3:o:amazon:amazon_linux:2023"
2025-05-07T20:23:34.1983191Z HOME_URL="https://aws.amazon.com/linux/amazon-linux-2023/"
2025-05-07T20:23:34.1983611Z DOCUMENTATION_URL="https://docs.aws.amazon.com/linux/"
2025-05-07T20:23:34.1984017Z SUPPORT_URL="https://aws.amazon.com/premiumsupport/"
2025-05-07T20:23:34.1984473Z BUG_REPORT_URL="https://github.com/amazonlinux/amazon-linux-2023"
2025-05-07T20:23:34.1984856Z VENDOR_NAME="AWS"
2025-05-07T20:23:34.1985092Z VENDOR_URL="https://aws.amazon.com/"
2025-05-07T20:23:34.1985376Z SUPPORT_END="2029-06-30"
2025-05-07T20:23:34.1985525Z 
2025-05-07T20:23:34.1985726Z ################################################################################
2025-05-07T20:23:34.1986029Z # Print EC2 Instance Info
2025-05-07T20:23:34.1986263Z #
2025-05-07T20:23:34.1986475Z # [2025-05-07T20:23:34.197Z] + print_ec2_info 
2025-05-07T20:23:34.1986793Z ################################################################################
2025-05-07T20:23:34.1987000Z 
2025-05-07T20:23:34.2096670Z ami-id: ami-071226ecf16aa7d96
2025-05-07T20:23:34.2212506Z instance-id: i-0b68a33264ad7b273
2025-05-07T20:23:34.2335605Z instance-type: g5.4xlarge
2025-05-07T20:23:34.2373358Z ##[group]Run . $PRELUDE; print_gpu_info
2025-05-07T20:23:34.2373858Z [36;1m. $PRELUDE; print_gpu_info[0m
2025-05-07T20:23:34.2383649Z shell: /usr/bin/bash --noprofile --norc -e -o pipefail {0}
2025-05-07T20:23:34.2384049Z env:
2025-05-07T20:23:34.2384273Z   PRELUDE: .github/scripts/setup_env.bash
2025-05-07T20:23:34.2384572Z   BUILD_ENV: build_binary
2025-05-07T20:23:34.2384815Z   BUILD_TARGET: genai
2025-05-07T20:23:34.2385044Z   BUILD_VARIANT: cuda
2025-05-07T20:23:34.2385281Z   BUILD_CUDA_VERSION: 12.6.3
2025-05-07T20:23:34.2385535Z   ENFORCE_CUDA_DEVICE: 1
2025-05-07T20:23:34.2385833Z   GPU_FLAG: --gpus all -e NVIDIA_DRIVER_CAPABILITIES=all
2025-05-07T20:23:34.2386165Z ##[endgroup]
2025-05-07T20:23:34.5753219Z ################################################################################
2025-05-07T20:23:34.5753619Z [INFO] Printing general display info ...
2025-05-07T20:23:34.5767585Z [EXEC] [ATTEMPT 0/3]    + wget -q --timeout 1 pypi.org -O /dev/null
2025-05-07T20:23:34.6662871Z [CHECK] Network does not appear to be blocked.
2025-05-07T20:23:34.6671576Z /usr/bin/sudo
2025-05-07T20:23:34.6682526Z which: no apt-get in (/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin)
2025-05-07T20:23:34.6693505Z /usr/bin/yum
2025-05-07T20:23:34.6695319Z [INSTALL] Updating system repositories ...
2025-05-07T20:23:34.6716868Z [EXEC] [ATTEMPT 0/3]    + sudo yum update -y
2025-05-07T20:23:35.0822361Z Last metadata expiration check: 0:00:07 ago on Wed May  7 20:23:28 2025.
2025-05-07T20:23:35.1731342Z ================================================================================
2025-05-07T20:23:35.1731803Z WARNING:
2025-05-07T20:23:35.1732075Z   A newer release of "Amazon Linux" is available.
2025-05-07T20:23:35.1732311Z 
2025-05-07T20:23:35.1732402Z   Available Versions:
2025-05-07T20:23:35.1732549Z 
2025-05-07T20:23:35.1732646Z   Version 2023.7.20250331:
2025-05-07T20:23:35.1732948Z     Run the following command to upgrade to 2023.7.20250331:
2025-05-07T20:23:35.1733207Z 
2025-05-07T20:23:35.1733341Z       dnf upgrade --releasever=2023.7.20250331
2025-05-07T20:23:35.1733554Z 
2025-05-07T20:23:35.1733727Z     Release notes:
2025-05-07T20:23:35.1734127Z      https://docs.aws.amazon.com/linux/al2023/release-notes/relnotes-2023.7.20250331.html
2025-05-07T20:23:35.1734489Z 
2025-05-07T20:23:35.1734598Z   Version 2023.7.20250414:
2025-05-07T20:23:35.1734896Z     Run the following command to upgrade to 2023.7.20250414:
2025-05-07T20:23:35.1735146Z 
2025-05-07T20:23:35.1735261Z       dnf upgrade --releasever=2023.7.20250414
2025-05-07T20:23:35.1735466Z 
2025-05-07T20:23:35.1735559Z     Release notes:
2025-05-07T20:23:35.1735940Z      https://docs.aws.amazon.com/linux/al2023/release-notes/relnotes-2023.7.20250414.html
2025-05-07T20:23:35.1736585Z 
2025-05-07T20:23:35.1736673Z   Version 2023.7.20250428:
2025-05-07T20:23:35.1736974Z     Run the following command to upgrade to 2023.7.20250428:
2025-05-07T20:23:35.1737214Z 
2025-05-07T20:23:35.1737330Z       dnf upgrade --releasever=2023.7.20250428
2025-05-07T20:23:35.1737538Z 
2025-05-07T20:23:35.1737622Z     Release notes:
2025-05-07T20:23:35.1738003Z      https://docs.aws.amazon.com/linux/al2023/release-notes/relnotes-2023.7.20250428.html
2025-05-07T20:23:35.1738355Z 
2025-05-07T20:23:35.1738471Z ================================================================================
2025-05-07T20:23:35.2892018Z Dependencies resolved.
2025-05-07T20:23:35.3175140Z ================================================================================
2025-05-07T20:23:35.3175948Z  Package                       Arch   Version    Repository                Size
2025-05-07T20:23:35.3176706Z ================================================================================
2025-05-07T20:23:35.3177313Z Upgrading:
2025-05-07T20:23:35.3178010Z  nvidia-container-toolkit      x86_64 1.17.6-1   nvidia-container-toolkit 1.2 M
2025-05-07T20:23:35.3179152Z  nvidia-container-toolkit-base x86_64 1.17.6-1   nvidia-container-toolkit 5.7 M
2025-05-07T20:23:35.3179839Z 
2025-05-07T20:23:35.3180434Z Transaction Summary
2025-05-07T20:23:35.3180936Z ================================================================================
2025-05-07T20:23:35.3181535Z Upgrade  2 Packages
2025-05-07T20:23:35.3181799Z 
2025-05-07T20:23:35.3182011Z Total download size: 6.9 M
2025-05-07T20:23:35.3182504Z Downloading Packages:
2025-05-07T20:23:35.3566222Z (1/2): nvidia-container-toolkit-1.17.6-1.x86_64  33 MB/s | 1.2 MB     00:00    
2025-05-07T20:23:35.4234277Z (2/2): nvidia-container-toolkit-base-1.17.6-1.x  54 MB/s | 5.7 MB     00:00    
2025-05-07T20:23:35.4245265Z --------------------------------------------------------------------------------
2025-05-07T20:23:35.4246432Z Total                                            65 MB/s | 6.9 MB     00:00     
2025-05-07T20:23:35.4249693Z Running transaction check
2025-05-07T20:23:35.4355582Z Transaction check succeeded.
2025-05-07T20:23:35.4356022Z Running transaction test
2025-05-07T20:23:35.4651640Z Transaction test succeeded.
2025-05-07T20:23:35.4654518Z Running transaction
2025-05-07T20:23:36.0205109Z   Preparing        :                                                        1/1 
2025-05-07T20:23:36.1257857Z   Upgrading        : nvidia-container-toolkit-base-1.17.6-1.x86_64          1/4 
2025-05-07T20:23:36.1281633Z   Upgrading        : nvidia-container-toolkit-1.17.6-1.x86_64               2/4 
2025-05-07T20:23:36.1495874Z   Running scriptlet: nvidia-container-toolkit-1.17.6-1.x86_64               2/4 
2025-05-07T20:23:36.1496605Z   Cleanup          : nvidia-container-toolkit-1.16.2-1.x86_64               3/4 
2025-05-07T20:23:36.1603832Z   Running scriptlet: nvidia-container-toolkit-1.16.2-1.x86_64               3/4 
2025-05-07T20:23:36.1630739Z   Cleanup          : nvidia-container-toolkit-base-1.16.2-1.x86_64          4/4 
2025-05-07T20:23:36.3216991Z   Running scriptlet: nvidia-container-toolkit-1.17.6-1.x86_64               4/4 
2025-05-07T20:23:36.3217550Z   Verifying        : nvidia-container-toolkit-1.17.6-1.x86_64               1/4 
2025-05-07T20:23:36.3218098Z   Verifying        : nvidia-container-toolkit-1.16.2-1.x86_64               2/4 
2025-05-07T20:23:36.3218626Z   Verifying        : nvidia-container-toolkit-base-1.17.6-1.x86_64          3/4 
2025-05-07T20:23:36.4607830Z ================================================================================
2025-05-07T20:23:36.4608202Z WARNING:
2025-05-07T20:23:36.4608455Z   A newer release of "Amazon Linux" is available.
2025-05-07T20:23:36.4608682Z 
2025-05-07T20:23:36.4608783Z   Available Versions:
2025-05-07T20:23:36.4608929Z 
2025-05-07T20:23:36.4609035Z   Version 2023.7.20250331:
2025-05-07T20:23:36.4609341Z     Run the following command to upgrade to 2023.7.20250331:
2025-05-07T20:23:36.4609906Z 
2025-05-07T20:23:36.4610034Z       dnf upgrade --releasever=2023.7.20250331
2025-05-07T20:23:36.4610241Z 
2025-05-07T20:23:36.4610331Z     Release notes:
2025-05-07T20:23:36.4610734Z      https://docs.aws.amazon.com/linux/al2023/release-notes/relnotes-2023.7.20250331.html
2025-05-07T20:23:36.4611098Z 
2025-05-07T20:23:36.4611205Z   Version 2023.7.20250414:
2025-05-07T20:23:36.4611504Z     Run the following command to upgrade to 2023.7.20250414:
2025-05-07T20:23:36.4611747Z 
2025-05-07T20:23:36.4611871Z       dnf upgrade --releasever=2023.7.20250414
2025-05-07T20:23:36.4612076Z 
2025-05-07T20:23:36.4612162Z     Release notes:
2025-05-07T20:23:36.4612549Z      https://docs.aws.amazon.com/linux/al2023/release-notes/relnotes-2023.7.20250414.html
2025-05-07T20:23:36.4612902Z 
2025-05-07T20:23:36.4613000Z   Version 2023.7.20250428:
2025-05-07T20:23:36.4613295Z     Run the following command to upgrade to 2023.7.20250428:
2025-05-07T20:23:36.4613546Z 
2025-05-07T20:23:36.4613804Z       dnf upgrade --releasever=2023.7.20250428
2025-05-07T20:23:36.4614017Z 
2025-05-07T20:23:36.4614104Z     Release notes:
2025-05-07T20:23:36.4614489Z      https://docs.aws.amazon.com/linux/al2023/release-notes/relnotes-2023.7.20250428.html
2025-05-07T20:23:36.4614839Z 
2025-05-07T20:23:36.4615192Z ================================================================================
2025-05-07T20:23:36.5173541Z   Verifying        : nvidia-container-toolkit-base-1.16.2-1.x86_64          4/4 
2025-05-07T20:23:36.5174202Z 
2025-05-07T20:23:36.5174326Z Upgraded:
2025-05-07T20:23:36.5174862Z   nvidia-container-toolkit-1.17.6-1.x86_64                                      
2025-05-07T20:23:36.5175875Z   nvidia-container-toolkit-base-1.17.6-1.x86_64                                 
2025-05-07T20:23:36.5176490Z 
2025-05-07T20:23:36.5176604Z Complete!
2025-05-07T20:23:36.5618537Z [INSTALL] Installing system package(s): hostname lshw ...
2025-05-07T20:23:36.5641850Z [EXEC] [ATTEMPT 0/3]    + sudo yum install -y hostname lshw
2025-05-07T20:23:37.0614577Z Last metadata expiration check: 0:00:09 ago on Wed May  7 20:23:28 2025.
2025-05-07T20:23:37.0853278Z Package hostname-3.23-4.amzn2023.0.3.x86_64 is already installed.
2025-05-07T20:23:37.0858299Z Package lshw-B.02.19.2-7.amzn2023.0.3.x86_64 is already installed.
2025-05-07T20:23:37.1254597Z Dependencies resolved.
2025-05-07T20:23:37.1437684Z Nothing to do.
2025-05-07T20:23:37.1438449Z Complete!
2025-05-07T20:23:37.1830519Z + hostname
2025-05-07T20:23:37.1830672Z 
2025-05-07T20:23:37.1845288Z ip-10-0-14-174.ec2.internal
2025-05-07T20:23:37.1846856Z 
2025-05-07T20:23:37.1847210Z + sudo lshw -C display
2025-05-07T20:23:37.1847374Z 
2025-05-07T20:23:37.4575501Z   *-display:0 UNCLAIMED
2025-05-07T20:23:37.4575836Z        description: VGA compatible controller
2025-05-07T20:23:37.4576160Z        product: Amazon.com, Inc.
2025-05-07T20:23:37.4576439Z        vendor: Amazon.com, Inc.
2025-05-07T20:23:37.4576697Z        physical id: 3
2025-05-07T20:23:37.4576931Z        bus info: pci@0000:00:03.0
2025-05-07T20:23:37.4577219Z        version: 00
2025-05-07T20:23:37.4577435Z        width: 32 bits
2025-05-07T20:23:37.4577653Z        clock: 33MHz
2025-05-07T20:23:37.4577904Z        capabilities: vga_controller bus_master
2025-05-07T20:23:37.4578220Z        configuration: latency=0
2025-05-07T20:23:37.4578555Z        resources: memory:c1000000-c13fffff memory:c0000-dffff
2025-05-07T20:23:37.4578881Z   *-display:1
2025-05-07T20:23:37.4579105Z        description: 3D controller
2025-05-07T20:23:37.4579387Z        product: GA102GL [A10G]
2025-05-07T20:23:37.4579647Z        vendor: NVIDIA Corporation
2025-05-07T20:23:37.4579916Z        physical id: 1e
2025-05-07T20:23:37.4580155Z        bus info: pci@0000:00:1e.0
2025-05-07T20:23:37.4580405Z        version: a1
2025-05-07T20:23:37.4580620Z        width: 64 bits
2025-05-07T20:23:37.4580839Z        clock: 33MHz
2025-05-07T20:23:37.4581118Z        capabilities: pm pciexpress msix bus_master cap_list
2025-05-07T20:23:37.4581487Z        configuration: driver=nvidia latency=0
2025-05-07T20:23:37.4582408Z        resources: iomemory:180-17f iomemory:100-ff irq:10 memory:c0000000-c0ffffff memory:1800000000-1fffffffff memory:1040000000-1041ffffff
2025-05-07T20:23:37.4616201Z 
2025-05-07T20:23:37.4616565Z ################################################################################
2025-05-07T20:23:37.4616928Z [INFO] Printing NVIDIA GPU info ...
2025-05-07T20:23:37.4744061Z 00:1e.0 3D controller: NVIDIA Corporation GA102GL [A10G] (rev a1)
2025-05-07T20:23:37.4927840Z Wed May  7 20:23:37 2025       
2025-05-07T20:23:37.4928193Z +-----------------------------------------------------------------------------------------+
2025-05-07T20:23:37.4928697Z | NVIDIA-SMI 570.133.07             Driver Version: 570.133.07     CUDA Version: 12.8     |
2025-05-07T20:23:37.4929170Z |-----------------------------------------+------------------------+----------------------+
2025-05-07T20:23:37.4929643Z | GPU  Name                 Persistence-M | Bus-Id          Disp.A | Volatile Uncorr. ECC |
2025-05-07T20:23:37.4930174Z | Fan  Temp   Perf          Pwr:Usage/Cap |           Memory-Usage | GPU-Util  Compute M. |
2025-05-07T20:23:37.4930589Z |                                         |                        |               MIG M. |
2025-05-07T20:23:37.4931205Z |=========================================+========================+======================|
2025-05-07T20:23:37.5062247Z |   0  NVIDIA A10G                    On  |   00000000:00:1E.0 Off |                    0 |
2025-05-07T20:23:37.5062686Z |  0%   29C    P8             24W /  300W |       0MiB /  23028MiB |      0%      Default |
2025-05-07T20:23:37.5063080Z |                                         |                        |                  N/A |
2025-05-07T20:23:37.5063459Z +-----------------------------------------+------------------------+----------------------+
2025-05-07T20:23:37.5067439Z                                                                                          
2025-05-07T20:23:37.5068217Z +-----------------------------------------------------------------------------------------+
2025-05-07T20:23:37.5069057Z | Processes:                                                                              |
2025-05-07T20:23:37.5069894Z |  GPU   GI   CI              PID   Type   Process name                        GPU Memory |
2025-05-07T20:23:37.5070707Z |        ID   ID                                                               Usage      |
2025-05-07T20:23:37.5071370Z |=========================================================================================|
2025-05-07T20:23:37.5072203Z |  No running processes found                                                             |
2025-05-07T20:23:37.5073098Z +-----------------------------------------------------------------------------------------+
2025-05-07T20:23:37.7710964Z ################################################################################
2025-05-07T20:23:37.7711295Z [INFO] Printing AMD GPU info ...
2025-05-07T20:23:37.7865776Z which: no rocminfo in (/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin)
2025-05-07T20:23:37.7866233Z [CHECK] rocminfo not found
2025-05-07T20:23:37.7866932Z which: no rocm-smi in (/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin)
2025-05-07T20:23:37.7868734Z [CHECK] rocm-smi not found
2025-05-07T20:23:37.7905250Z ##[group]Run . $PRELUDE; setup_miniconda $HOME/miniconda
2025-05-07T20:23:37.7905672Z [36;1m. $PRELUDE; setup_miniconda $HOME/miniconda[0m
2025-05-07T20:23:37.7917729Z shell: /usr/bin/bash --noprofile --norc -e -o pipefail {0}
2025-05-07T20:23:37.7918074Z env:
2025-05-07T20:23:37.7918298Z   PRELUDE: .github/scripts/setup_env.bash
2025-05-07T20:23:37.7918587Z   BUILD_ENV: build_binary
2025-05-07T20:23:37.7918829Z   BUILD_TARGET: genai
2025-05-07T20:23:37.7919058Z   BUILD_VARIANT: cuda
2025-05-07T20:23:37.7919284Z   BUILD_CUDA_VERSION: 12.6.3
2025-05-07T20:23:37.7919536Z   ENFORCE_CUDA_DEVICE: 1
2025-05-07T20:23:37.7919835Z   GPU_FLAG: --gpus all -e NVIDIA_DRIVER_CAPABILITIES=all
2025-05-07T20:23:37.7920155Z ##[endgroup]
2025-05-07T20:23:38.1282392Z ################################################################################
2025-05-07T20:23:38.1282761Z # Setup Miniconda
2025-05-07T20:23:38.1282971Z #
2025-05-07T20:23:38.1297625Z # [2025-05-07T20:23:38.129Z] + setup_miniconda /home/ec2-user/miniconda
2025-05-07T20:23:38.1298036Z ################################################################################
2025-05-07T20:23:38.1298519Z 
2025-05-07T20:23:38.1314129Z [EXEC] [ATTEMPT 0/3]    + wget -q --timeout 1 pypi.org -O /dev/null
2025-05-07T20:23:38.2223328Z [CHECK] Network does not appear to be blocked.
2025-05-07T20:23:38.2223837Z [SETUP] A Miniconda installation appears to already exist in /home/ec2-user/miniconda ...
2025-05-07T20:23:38.2224377Z [SETUP] Clearing out directory: /home/ec2-user/miniconda ...
2025-05-07T20:23:38.2224739Z + rm -rf /home/ec2-user/miniconda
2025-05-07T20:23:38.2224927Z 
2025-05-07T20:23:43.2506857Z 
2025-05-07T20:23:43.2507454Z + mkdir -p /home/ec2-user/miniconda
2025-05-07T20:23:43.2507718Z 
2025-05-07T20:23:43.2523485Z 
2025-05-07T20:23:43.2523806Z [SETUP] Downloading the Miniconda installer ...
2025-05-07T20:23:43.2546619Z [EXEC] [ATTEMPT 0/3]    + wget -q https://repo.anaconda.com/miniconda/Miniconda3-latest-Linux-x86_64.sh -O miniconda.sh
2025-05-07T20:23:44.2614384Z [SETUP] Installing Miniconda ...
2025-05-07T20:23:44.2614764Z + bash miniconda.sh -b -p /home/ec2-user/miniconda -u
2025-05-07T20:23:44.2615011Z 
2025-05-07T20:23:44.2761383Z PREFIX=/home/ec2-user/miniconda
2025-05-07T20:23:44.7255795Z Unpacking payload ...
2025-05-07T20:23:45.2427737Z entry_point.py:256: DeprecationWarning: Python 3.14 will, by default, filter extracted tar archives and reject files or modify their metadata. Use the filter argument to control this behavior.
2025-05-07T20:23:46.0462660Z entry_point.py:256: DeprecationWarning: Python 3.14 will, by default, filter extracted tar archives and reject files or modify their metadata. Use the filter argument to control this behavior.
2025-05-07T20:23:48.1745781Z 
2025-05-07T20:23:48.1746161Z Installing base environment...
2025-05-07T20:23:48.1746397Z 
2025-05-07T20:23:49.2539381Z Preparing transaction: ...working... done
2025-05-07T20:23:52.2639236Z Executing transaction: ...working... done
2025-05-07T20:23:52.9206845Z entry_point.py:256: DeprecationWarning: Python 3.14 will, by default, filter extracted tar archives and reject files or modify their metadata. Use the filter argument to control this behavior.
2025-05-07T20:23:53.0102485Z installation finished.
2025-05-07T20:23:53.0109687Z 
2025-05-07T20:23:53.0109994Z + rm -f miniconda.sh
2025-05-07T20:23:53.0110146Z 
2025-05-07T20:23:53.0425894Z 
2025-05-07T20:23:53.0426163Z [SETUP] Reloading the bash configuration ...
2025-05-07T20:23:53.0426510Z + /home/ec2-user/miniconda/bin/conda init bash
2025-05-07T20:23:53.0426734Z 
2025-05-07T20:23:53.4143991Z no change     /home/ec2-user/miniconda/condabin/conda
2025-05-07T20:23:53.4144484Z no change     /home/ec2-user/miniconda/bin/conda
2025-05-07T20:23:53.4144833Z no change     /home/ec2-user/miniconda/bin/conda-env
2025-05-07T20:23:53.4145213Z no change     /home/ec2-user/miniconda/bin/activate
2025-05-07T20:23:53.4145565Z no change     /home/ec2-user/miniconda/bin/deactivate
2025-05-07T20:23:53.4146365Z no change     /home/ec2-user/miniconda/etc/profile.d/conda.sh
2025-05-07T20:23:53.4146797Z no change     /home/ec2-user/miniconda/etc/fish/conf.d/conda.fish
2025-05-07T20:23:53.4147228Z no change     /home/ec2-user/miniconda/shell/condabin/Conda.psm1
2025-05-07T20:23:53.4147677Z no change     /home/ec2-user/miniconda/shell/condabin/conda-hook.ps1
2025-05-07T20:23:53.4148202Z no change     /home/ec2-user/miniconda/lib/python3.13/site-packages/xontrib/conda.xsh
2025-05-07T20:23:53.4148714Z no change     /home/ec2-user/miniconda/etc/profile.d/conda.csh
2025-05-07T20:23:53.4149067Z no change     /home/ec2-user/.bashrc
2025-05-07T20:23:53.4149329Z No action taken.
2025-05-07T20:23:53.4805846Z 
2025-05-07T20:23:53.4806295Z + . /home/ec2-user/.bashrc
2025-05-07T20:23:53.4806928Z 
2025-05-07T20:23:54.3259607Z 
2025-05-07T20:23:54.3260480Z [SETUP] Installing libmamba-solver (required since Anaconda 2024.02-1) and libarchive ...
2025-05-07T20:23:54.3284535Z [EXEC] [ATTEMPT 0/3]    + conda install --solver=classic -c conda-forge --override-channels -y conda-libmamba-solver libmamba libmambapy libarchive
2025-05-07T20:24:07.7679165Z Collecting package metadata (current_repodata.json): - \ | / - \ | / - \ | / - \ | / - \ | / - \ | / - \ | / - \ | / - \ | / - \ | / - \ | / - \ | / - \ | / - \ | / - \ | / - \ | / - \ | / - \ | / - \ | / - \ | / - \ | / - \ | / - \ | / - \ | / - \ | / - \ | / - \ | / - done
2025-05-07T20:24:09.3455567Z Solving environment: | / - \ | / - \ | / - \ done
2025-05-07T20:24:09.4427377Z 
2025-05-07T20:24:09.4427513Z ## Package Plan ##
2025-05-07T20:24:09.4427765Z 
2025-05-07T20:24:09.4427905Z   environment location: /home/ec2-user/miniconda
2025-05-07T20:24:09.4428145Z 
2025-05-07T20:24:09.4428242Z   added / updated specs:
2025-05-07T20:24:09.4428502Z     - conda-libmamba-solver
2025-05-07T20:24:09.4428763Z     - libarchive
2025-05-07T20:24:09.4428969Z     - libmamba
2025-05-07T20:24:09.4429171Z     - libmambapy
2025-05-07T20:24:09.4429296Z 
2025-05-07T20:24:09.4429299Z 
2025-05-07T20:24:09.4429418Z The following packages will be downloaded:
2025-05-07T20:24:09.4429632Z 
2025-05-07T20:24:09.4429744Z     package                    |            build
2025-05-07T20:24:09.4430062Z     ---------------------------|-----------------
2025-05-07T20:24:09.4430466Z     ca-certificates-2025.4.26  |       hbd8a1cb_0         149 KB  conda-forge
2025-05-07T20:24:09.4430926Z     certifi-2025.4.26          |     pyhd8ed1ab_0         154 KB  conda-forge
2025-05-07T20:24:09.4431348Z     conda-25.3.1               |  py313h78bf25f_1         1.1 MB  conda-forge
2025-05-07T20:24:09.4431817Z     conda-libmamba-solver-25.4.0|     pyhd8ed1ab_0          41 KB  conda-forge
2025-05-07T20:24:09.4432257Z     ------------------------------------------------------------
2025-05-07T20:24:09.4432600Z                                            Total:         1.4 MB
2025-05-07T20:24:09.4432806Z 
2025-05-07T20:24:09.4432921Z The following packages will be UPDATED:
2025-05-07T20:24:09.4433130Z 
2025-05-07T20:24:09.4436557Z   ca-certificates    pkgs/main/linux-64::ca-certificates-2~ --> conda-forge/noarch::ca-certificates-2025.4.26-hbd8a1cb_0 
2025-05-07T20:24:09.4437332Z   conda              pkgs/main::conda-25.3.1-py313h06a4308~ --> conda-forge::conda-25.3.1-py313h78bf25f_1 
2025-05-07T20:24:09.4437701Z 
2025-05-07T20:24:09.4437925Z The following packages will be SUPERSEDED by a higher-priority channel:
2025-05-07T20:24:09.4438231Z 
2025-05-07T20:24:09.4438543Z   certifi            pkgs/main/linux-64::certifi-2025.4.26~ --> conda-forge/noarch::certifi-2025.4.26-pyhd8ed1ab_0 
2025-05-07T20:24:09.4439330Z   conda-libmamba-so~ pkgs/main::conda-libmamba-solver-25.4~ --> conda-forge::conda-libmamba-solver-25.4.0-pyhd8ed1ab_0 
2025-05-07T20:24:09.4439800Z 
2025-05-07T20:24:09.4439805Z 
2025-05-07T20:24:09.4439809Z 
2025-05-07T20:24:09.4440201Z Downloading and Extracting Packages: ...working...
2025-05-07T20:24:09.4440567Z conda-25.3.1         | 1.1 MB    |            |   0% 
2025-05-07T20:24:09.4440780Z 
2025-05-07T20:24:09.4441052Z certifi-2025.4.26    | 154 KB    |            |   0% [A
2025-05-07T20:24:09.4441281Z 
2025-05-07T20:24:09.4441285Z 
2025-05-07T20:24:09.4449357Z ca-certificates-2025 | 149 KB    |            |   0% [A[A
2025-05-07T20:24:09.4449696Z 
2025-05-07T20:24:09.4449700Z 
2025-05-07T20:24:09.4451885Z 
2025-05-07T20:24:09.5093189Z conda-libmamba-solve | 41 KB     |            |   0% [A[A[A
2025-05-07T20:24:09.5093473Z 
2025-05-07T20:24:09.5093477Z 
2025-05-07T20:24:09.5093568Z 
2025-05-07T20:24:09.5197173Z conda-libmamba-solve | 41 KB     | ########## | 100% [A[A[A
2025-05-07T20:24:09.5197808Z 
2025-05-07T20:24:09.5197812Z 
2025-05-07T20:24:09.5197816Z 
2025-05-07T20:24:09.5276451Z conda-libmamba-solve | 41 KB     | ########## | 100% [A[A[A
2025-05-07T20:24:09.5277166Z 
2025-05-07T20:24:09.5277174Z 
2025-05-07T20:24:09.5341845Z ca-certificates-2025 | 149 KB    | ########## | 100% [A[A
2025-05-07T20:24:09.5342179Z 
2025-05-07T20:24:09.5342193Z 
2025-05-07T20:24:09.5410543Z ca-certificates-2025 | 149 KB    | ########## | 100% [A[A
2025-05-07T20:24:09.5410828Z 
2025-05-07T20:24:09.5429396Z certifi-2025.4.26    | 154 KB    | ########## | 100% [A
2025-05-07T20:24:09.5499632Z conda-25.3.1         | 1.1 MB    | #1         |  11% 
2025-05-07T20:24:09.5499961Z 
2025-05-07T20:24:09.5500636Z certifi-2025.4.26    | 154 KB    | ########## | 100% [A
2025-05-07T20:24:09.5500891Z 
2025-05-07T20:24:09.5830393Z certifi-2025.4.26    | 154 KB    | ########## | 100% [A
2025-05-07T20:24:09.6904326Z conda-25.3.1         | 1.1 MB    | ########## | 100% 
2025-05-07T20:24:09.6904765Z conda-25.3.1         | 1.1 MB    | ########## | 100% 
2025-05-07T20:24:09.6910662Z conda-25.3.1         | 1.1 MB    | ########## | 100% 
2025-05-07T20:24:09.6911103Z                                                      
2025-05-07T20:24:09.6911375Z 
2025-05-07T20:24:09.6911732Z                                                      [A
2025-05-07T20:24:09.6911937Z 
2025-05-07T20:24:09.6911941Z 
2025-05-07T20:24:09.6912119Z                                                      [A[A
2025-05-07T20:24:09.6912326Z 
2025-05-07T20:24:09.6912330Z 
2025-05-07T20:24:09.6912334Z 
2025-05-07T20:24:09.6912526Z                                                      [A[A[A done
2025-05-07T20:24:09.7915329Z Preparing transaction: / done
2025-05-07T20:24:09.8918284Z Verifying transaction: \ done
2025-05-07T20:24:11.2940424Z Executing transaction: / - \ | / - \ | / - \ | / - done
2025-05-07T20:24:13.2107222Z [SETUP] Updating Miniconda base packages ...
2025-05-07T20:24:13.2132528Z [EXEC] [ATTEMPT 0/3]    + conda update -n base -c defaults --update-deps -y conda
2025-05-07T20:24:14.0421571Z Channels:
2025-05-07T20:24:14.0421893Z  - defaults
2025-05-07T20:24:14.0422166Z Platform: linux-64
2025-05-07T20:24:15.2828299Z Collecting package metadata (repodata.json): - \ | / - \ | done
2025-05-07T20:24:15.4054267Z Solving environment: - \ Channels:
2025-05-07T20:24:15.4055080Z  - defaults
2025-05-07T20:24:15.4055388Z Platform: linux-64
2025-05-07T20:24:15.7003025Z Collecting package metadata (repodata.json): / - \ | done
2025-05-07T20:24:15.9160214Z Solving environment: - \ | / done
2025-05-07T20:24:15.9944664Z done
2025-05-07T20:24:16.0587513Z 
2025-05-07T20:24:16.0587756Z ## Package Plan ##
2025-05-07T20:24:16.0587900Z 
2025-05-07T20:24:16.0588070Z   environment location: /home/ec2-user/miniconda
2025-05-07T20:24:16.0588301Z 
2025-05-07T20:24:16.0588395Z   added / updated specs:
2025-05-07T20:24:16.0588631Z     - conda
2025-05-07T20:24:16.0588746Z 
2025-05-07T20:24:16.0588771Z 
2025-05-07T20:24:16.0588894Z The following packages will be downloaded:
2025-05-07T20:24:16.0589101Z 
2025-05-07T20:24:16.0589213Z     package                    |            build
2025-05-07T20:24:16.0589767Z     ---------------------------|-----------------
2025-05-07T20:24:16.0590117Z     pip-25.1                   |     pyhc872135_2         1.3 MB
2025-05-07T20:24:16.0590502Z     tzdata-2025b               |       h04d1e81_0         116 KB
2025-05-07T20:24:16.0590866Z     ------------------------------------------------------------
2025-05-07T20:24:16.0591201Z                                            Total:         1.4 MB
2025-05-07T20:24:16.0591403Z 
2025-05-07T20:24:16.0591522Z The following packages will be UPDATED:
2025-05-07T20:24:16.0591720Z 
2025-05-07T20:24:16.0592017Z   pip                pkgs/main/linux-64::pip-25.0-py313h06~ --> pkgs/main/noarch::pip-25.1-pyhc872135_2 
2025-05-07T20:24:16.0592510Z   tzdata                                   2025a-h04d1e81_0 --> 2025b-h04d1e81_0 
2025-05-07T20:24:16.0592920Z 
2025-05-07T20:24:16.0592924Z 
2025-05-07T20:24:16.0592928Z 
2025-05-07T20:24:16.0593067Z Downloading and Extracting Packages: ...working...
2025-05-07T20:24:16.0593428Z pip-25.1             | 1.3 MB    |            |   0% 
2025-05-07T20:24:16.0593863Z 
2025-05-07T20:24:16.0913713Z tzdata-2025b         | 116 KB    |            |   0% [A
2025-05-07T20:24:16.0913970Z 
2025-05-07T20:24:16.1533478Z tzdata-2025b         | 116 KB    | ########## | 100% [A
2025-05-07T20:24:16.2808754Z pip-25.1             | 1.3 MB    | ########## | 100% 
2025-05-07T20:24:16.2809088Z 
2025-05-07T20:24:16.2811287Z tzdata-2025b         | 116 KB    | ########## | 100% [A
2025-05-07T20:24:16.2811598Z 
2025-05-07T20:24:16.3212646Z tzdata-2025b         | 116 KB    | ########## | 100% [A
2025-05-07T20:24:16.3213132Z pip-25.1             | 1.3 MB    | ########## | 100% 
2025-05-07T20:24:16.3217190Z pip-25.1             | 1.3 MB    | ########## | 100% 
2025-05-07T20:24:16.3217633Z                                                      
2025-05-07T20:24:16.3217927Z 
2025-05-07T20:24:16.3218196Z                                                      [A done
2025-05-07T20:24:16.4220799Z Preparing transaction: \ done
2025-05-07T20:24:16.5226201Z Verifying transaction: / done
2025-05-07T20:24:18.6252314Z Executing transaction: \ | / - \ | / - \ | / - \ | / - \ | / - \ done
2025-05-07T20:24:19.2623347Z [SETUP] Cleaning up Conda packages ...
2025-05-07T20:24:19.2627791Z + conda clean --packages --tarball -y
2025-05-07T20:24:19.2627997Z 
2025-05-07T20:24:20.2691291Z Will remove 99 (117.8 MB) tarball(s).
2025-05-07T20:24:20.2691692Z Will remove 11 (16.0 MB) package(s).
2025-05-07T20:24:20.3343689Z 
2025-05-07T20:24:20.3352125Z + conda clean --all -y
2025-05-07T20:24:20.3352380Z 
2025-05-07T20:24:20.8870262Z There are no unused tarball(s) to remove.
2025-05-07T20:24:20.8870672Z Will remove 1 index cache(s).
2025-05-07T20:24:20.8870964Z There are no unused package(s) to remove.
2025-05-07T20:24:20.8871317Z There are no tempfile(s) to remove.
2025-05-07T20:24:20.8871616Z There are no logfile(s) to remove.
2025-05-07T20:24:20.9520178Z 
2025-05-07T20:24:20.9524736Z + conda info
2025-05-07T20:24:20.9525180Z 
2025-05-07T20:24:21.7073130Z 
2025-05-07T20:24:21.7073794Z      active environment : base
2025-05-07T20:24:21.7074267Z     active env location : /home/ec2-user/miniconda
2025-05-07T20:24:21.7074696Z             shell level : 1
2025-05-07T20:24:21.7075046Z        user config file : /home/ec2-user/.condarc
2025-05-07T20:24:21.7075558Z  populated config files : /home/ec2-user/miniconda/.condarc
2025-05-07T20:24:21.7076040Z           conda version : 25.3.1
2025-05-07T20:24:21.7076366Z     conda-build version : not installed
2025-05-07T20:24:21.7076676Z          python version : 3.13.2.final.0
2025-05-07T20:24:21.7077090Z                  solver : libmamba (default)
2025-05-07T20:24:21.7077497Z        virtual packages : __archspec=1=zen2
2025-05-07T20:24:21.7077900Z                           __conda=25.3.1=0
2025-05-07T20:24:21.7078284Z                           __cuda=12.8=0
2025-05-07T20:24:21.7078769Z                           __glibc=2.34=0
2025-05-07T20:24:21.7079043Z                           __linux=6.1.130=0
2025-05-07T20:24:21.7079672Z                           __unix=0=0
2025-05-07T20:24:21.7080010Z        base environment : /home/ec2-user/miniconda  (writable)
2025-05-07T20:24:21.7080404Z       conda av data dir : /home/ec2-user/miniconda/etc/conda
2025-05-07T20:24:21.7080754Z   conda av metadata url : None
2025-05-07T20:24:21.7081124Z            channel URLs : https://repo.anaconda.com/pkgs/main/linux-64
2025-05-07T20:24:21.7081561Z                           https://repo.anaconda.com/pkgs/main/noarch
2025-05-07T20:24:21.7081932Z                           https://repo.anaconda.com/pkgs/r/linux-64
2025-05-07T20:24:21.7082307Z                           https://repo.anaconda.com/pkgs/r/noarch
2025-05-07T20:24:21.7082668Z           package cache : /home/ec2-user/miniconda/pkgs
2025-05-07T20:24:21.7083148Z                           /home/ec2-user/.conda/pkgs
2025-05-07T20:24:21.7083484Z        envs directories : /home/ec2-user/miniconda/envs
2025-05-07T20:24:21.7083815Z                           /home/ec2-user/.conda/envs
2025-05-07T20:24:21.7084117Z                platform : linux-64
2025-05-07T20:24:21.7084927Z              user-agent : conda/25.3.1 requests/2.32.3 CPython/3.13.2 Linux/6.1.130-139.222.amzn2023.x86_64 amzn/2023.6.20250317 glibc/2.34 solver/libmamba conda-libmamba-solver/25.4.0 libmambapy/2.0.5 aau/0.7.0 c/. s/. e/.
2025-05-07T20:24:21.7085737Z                 UID:GID : 1000:1000
2025-05-07T20:24:21.7086012Z              netrc file : None
2025-05-07T20:24:21.7086261Z            offline mode : False
2025-05-07T20:24:21.7086430Z 
2025-05-07T20:24:21.7756740Z 
2025-05-07T20:24:21.7757178Z [SETUP] Exporting Miniconda variables ...
2025-05-07T20:24:21.7758126Z [SETUP] Saving Miniconda variables to /home/ec2-user/actions-runner/_work/_temp/_runner_file_commands/add_path_2db73161-039f-40ef-95f2-00b3cc70e8e1 ...
2025-05-07T20:24:21.7758917Z [SETUP] Successfully set up Miniconda at /home/ec2-user/miniconda
2025-05-07T20:24:21.7840890Z ##[group]Run . $PRELUDE; create_conda_environment $BUILD_ENV 3.13
2025-05-07T20:24:21.7841373Z [36;1m. $PRELUDE; create_conda_environment $BUILD_ENV 3.13[0m
2025-05-07T20:24:21.7857411Z shell: /usr/bin/bash --noprofile --norc -e -o pipefail {0}
2025-05-07T20:24:21.7857794Z env:
2025-05-07T20:24:21.7858015Z   PRELUDE: .github/scripts/setup_env.bash
2025-05-07T20:24:21.7858309Z   BUILD_ENV: build_binary
2025-05-07T20:24:21.7858550Z   BUILD_TARGET: genai
2025-05-07T20:24:21.7858782Z   BUILD_VARIANT: cuda
2025-05-07T20:24:21.7859010Z   BUILD_CUDA_VERSION: 12.6.3
2025-05-07T20:24:21.7859251Z   ENFORCE_CUDA_DEVICE: 1
2025-05-07T20:24:21.7859551Z   GPU_FLAG: --gpus all -e NVIDIA_DRIVER_CAPABILITIES=all
2025-05-07T20:24:21.7859876Z ##[endgroup]
2025-05-07T20:24:22.1236529Z ################################################################################
2025-05-07T20:24:22.1237063Z # Create Conda Environment
2025-05-07T20:24:22.1237453Z #
2025-05-07T20:24:22.1251820Z # [2025-05-07T20:24:22.124Z] + create_conda_environment build_binary 3.13
2025-05-07T20:24:22.1252314Z ################################################################################
2025-05-07T20:24:22.1252539Z 
2025-05-07T20:24:22.1266970Z [EXEC] [ATTEMPT 0/3]    + wget -q --timeout 1 pypi.org -O /dev/null
2025-05-07T20:24:22.2202418Z [CHECK] Network does not appear to be blocked.
2025-05-07T20:24:22.2202778Z [SETUP] Listing existing Conda environments ...
2025-05-07T20:24:22.2203106Z + conda info --envs
2025-05-07T20:24:22.2203251Z 
2025-05-07T20:24:22.9727630Z 
2025-05-07T20:24:22.9728172Z # conda environments:
2025-05-07T20:24:22.9728451Z #
2025-05-07T20:24:22.9728672Z base                   /home/ec2-user/miniconda
2025-05-07T20:24:22.9728889Z 
2025-05-07T20:24:23.0406364Z 
2025-05-07T20:24:23.0407268Z [SETUP] Deleting the prefix directory if it exists ...
2025-05-07T20:24:24.6984927Z + rm -rf /home/ec2-user/miniconda/envs/build_binary
2025-05-07T20:24:24.6985234Z 
2025-05-07T20:24:24.6998487Z 
2025-05-07T20:24:24.7007927Z [SETUP] Creating new Conda environment (Python 3.13) ...
2025-05-07T20:24:24.7031020Z [EXEC] [ATTEMPT 0/3]    + conda create -y -n build_binary python=3.13
2025-05-07T20:24:25.4590957Z Channels:
2025-05-07T20:24:25.4591202Z  - defaults
2025-05-07T20:24:25.4591852Z Platform: linux-64
2025-05-07T20:24:27.0486725Z Collecting package metadata (repodata.json): - \ | / - \ | / - \ | done
2025-05-07T20:24:27.1490922Z Solving environment: - done
2025-05-07T20:24:27.1781593Z 
2025-05-07T20:24:27.1781858Z ## Package Plan ##
2025-05-07T20:24:27.1782020Z 
2025-05-07T20:24:27.1782247Z   environment location: /home/ec2-user/miniconda/envs/build_binary
2025-05-07T20:24:27.1782594Z 
2025-05-07T20:24:27.1782714Z   added / updated specs:
2025-05-07T20:24:27.1782960Z     - python=3.13
2025-05-07T20:24:27.1783092Z 
2025-05-07T20:24:27.1783096Z 
2025-05-07T20:24:27.1783214Z The following packages will be downloaded:
2025-05-07T20:24:27.1783856Z 
2025-05-07T20:24:27.1783983Z     package                    |            build
2025-05-07T20:24:27.1784301Z     ---------------------------|-----------------
2025-05-07T20:24:27.1784658Z     _libgcc_mutex-0.1          |             main           3 KB
2025-05-07T20:24:27.1785050Z     _openmp_mutex-5.1          |            1_gnu          21 KB
2025-05-07T20:24:27.1785458Z     ca-certificates-2025.2.25  |       h06a4308_0         129 KB
2025-05-07T20:24:27.1785866Z     python_abi-3.13            |          0_cp313           6 KB
2025-05-07T20:24:27.1786223Z     ------------------------------------------------------------
2025-05-07T20:24:27.1786558Z                                            Total:         159 KB
2025-05-07T20:24:27.1786762Z 
2025-05-07T20:24:27.1786893Z The following NEW packages will be INSTALLED:
2025-05-07T20:24:27.1787107Z 
2025-05-07T20:24:27.1787306Z   _libgcc_mutex      pkgs/main/linux-64::_libgcc_mutex-0.1-main 
2025-05-07T20:24:27.1787744Z   _openmp_mutex      pkgs/main/linux-64::_openmp_mutex-5.1-1_gnu 
2025-05-07T20:24:27.1788374Z   bzip2              pkgs/main/linux-64::bzip2-1.0.8-h5eee18b_6 
2025-05-07T20:24:27.1788850Z   ca-certificates    pkgs/main/linux-64::ca-certificates-2025.2.25-h06a4308_0 
2025-05-07T20:24:27.1789317Z   expat              pkgs/main/linux-64::expat-2.7.1-h6a678d5_0 
2025-05-07T20:24:27.1789755Z   ld_impl_linux-64   pkgs/main/linux-64::ld_impl_linux-64-2.40-h12ee557_0 
2025-05-07T20:24:27.1790201Z   libffi             pkgs/main/linux-64::libffi-3.4.4-h6a678d5_1 
2025-05-07T20:24:27.1790623Z   libgcc-ng          pkgs/main/linux-64::libgcc-ng-11.2.0-h1234567_1 
2025-05-07T20:24:27.1791045Z   libgomp            pkgs/main/linux-64::libgomp-11.2.0-h1234567_1 
2025-05-07T20:24:27.1791468Z   libmpdec           pkgs/main/linux-64::libmpdec-4.0.0-h5eee18b_0 
2025-05-07T20:24:27.1791919Z   libstdcxx-ng       pkgs/main/linux-64::libstdcxx-ng-11.2.0-h1234567_1 
2025-05-07T20:24:27.1792364Z   libuuid            pkgs/main/linux-64::libuuid-1.41.5-h5eee18b_0 
2025-05-07T20:24:27.1792778Z   ncurses            pkgs/main/linux-64::ncurses-6.4-h6a678d5_0 
2025-05-07T20:24:27.1793190Z   openssl            pkgs/main/linux-64::openssl-3.0.16-h5eee18b_0 
2025-05-07T20:24:27.1793588Z   pip                pkgs/main/noarch::pip-25.1-pyhc872135_2 
2025-05-07T20:24:27.1793997Z   python             pkgs/main/linux-64::python-3.13.2-hf623796_100_cp313 
2025-05-07T20:24:27.1794425Z   python_abi         pkgs/main/linux-64::python_abi-3.13-0_cp313 
2025-05-07T20:24:27.1794844Z   readline           pkgs/main/linux-64::readline-8.2-h5eee18b_0 
2025-05-07T20:24:27.1795305Z   setuptools         pkgs/main/linux-64::setuptools-78.1.1-py313h06a4308_0 
2025-05-07T20:24:27.1795758Z   sqlite             pkgs/main/linux-64::sqlite-3.45.3-h5eee18b_0 
2025-05-07T20:24:27.1796140Z   tk                 pkgs/main/linux-64::tk-8.6.14-h39e8969_0 
2025-05-07T20:24:27.1796511Z   tzdata             pkgs/main/noarch::tzdata-2025b-h04d1e81_0 
2025-05-07T20:24:27.1796924Z   wheel              pkgs/main/linux-64::wheel-0.45.1-py313h06a4308_0 
2025-05-07T20:24:27.1797307Z   xz                 pkgs/main/linux-64::xz-5.6.4-h5eee18b_1 
2025-05-07T20:24:27.1797669Z   zlib               pkgs/main/linux-64::zlib-1.2.13-h5eee18b_1 
2025-05-07T20:24:27.1797906Z 
2025-05-07T20:24:27.1797910Z 
2025-05-07T20:24:27.1797914Z 
2025-05-07T20:24:27.1798052Z Downloading and Extracting Packages: ...working...
2025-05-07T20:24:27.1798703Z ca-certificates-2025 | 129 KB    |            |   0% 
2025-05-07T20:24:27.1798928Z 
2025-05-07T20:24:27.1799225Z _openmp_mutex-5.1    | 21 KB     |            |   0% [A
2025-05-07T20:24:27.1799458Z 
2025-05-07T20:24:27.1799462Z 
2025-05-07T20:24:27.1809617Z python_abi-3.13      | 6 KB      |            |   0% [A[A
2025-05-07T20:24:27.1809866Z 
2025-05-07T20:24:27.1809869Z 
2025-05-07T20:24:27.1809876Z 
2025-05-07T20:24:27.2166330Z _libgcc_mutex-0.1    | 3 KB      |            |   0% [A[A[A
2025-05-07T20:24:27.2166703Z 
2025-05-07T20:24:27.2167002Z 
2025-05-07T20:24:27.2172249Z python_abi-3.13      | 6 KB      | ########## | 100% [A[A
2025-05-07T20:24:27.2248337Z ca-certificates-2025 | 129 KB    | ########## | 100% 
2025-05-07T20:24:27.2249708Z 
2025-05-07T20:24:27.2291618Z _openmp_mutex-5.1    | 21 KB     | ########## | 100% [A
2025-05-07T20:24:27.2291960Z 
2025-05-07T20:24:27.2291964Z 
2025-05-07T20:24:27.2293348Z 
2025-05-07T20:24:27.2338185Z _libgcc_mutex-0.1    | 3 KB      | ########## | 100% [A[A[A
2025-05-07T20:24:27.2338449Z 
2025-05-07T20:24:27.2339880Z 
2025-05-07T20:24:27.2413840Z python_abi-3.13      | 6 KB      | ########## | 100% [A[A
2025-05-07T20:24:27.2430193Z ca-certificates-2025 | 129 KB    | ########## | 100% 
2025-05-07T20:24:27.2430504Z 
2025-05-07T20:24:27.2430510Z 
2025-05-07T20:24:27.2430515Z 
2025-05-07T20:24:27.2524075Z _libgcc_mutex-0.1    | 3 KB      | ########## | 100% [A[A[A
2025-05-07T20:24:27.2524443Z 
2025-05-07T20:24:27.2529926Z _openmp_mutex-5.1    | 21 KB     | ########## | 100% [A
2025-05-07T20:24:27.2530465Z                                                      
2025-05-07T20:24:27.2530746Z 
2025-05-07T20:24:27.2531321Z                                                      [A
2025-05-07T20:24:27.2531626Z 
2025-05-07T20:24:27.2531631Z 
2025-05-07T20:24:27.2531887Z                                                      [A[A
2025-05-07T20:24:27.2532128Z 
2025-05-07T20:24:27.2532132Z 
2025-05-07T20:24:27.2532136Z 
2025-05-07T20:24:27.2532389Z                                                      [A[A[A done
2025-05-07T20:24:27.4638625Z Preparing transaction: | / done
2025-05-07T20:24:28.8900758Z Verifying transaction: \ | / - \ | / - \ | / - \ done
2025-05-07T20:24:31.3082615Z Executing transaction: / - \ | / - \ | / - \ | / - \ | / - \ | / - \ | done
2025-05-07T20:24:31.3584987Z #
2025-05-07T20:24:31.3585334Z # To activate this environment, use
2025-05-07T20:24:31.3585738Z #
2025-05-07T20:24:31.3586019Z #     $ conda activate build_binary
2025-05-07T20:24:31.3586434Z #
2025-05-07T20:24:31.3586738Z # To deactivate an active environment, use
2025-05-07T20:24:31.3587146Z #
2025-05-07T20:24:31.3587405Z #     $ conda deactivate
2025-05-07T20:24:31.3587619Z 
2025-05-07T20:24:31.4684639Z [SETUP] Upgrading PIP to latest ...
2025-05-07T20:24:31.4706335Z [EXEC] [ATTEMPT 0/3]    + conda run -n build_binary pip install --upgrade pip
2025-05-07T20:24:34.3308876Z Requirement already satisfied: pip in /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages (25.1)
2025-05-07T20:24:34.3310423Z Collecting pip
2025-05-07T20:24:34.3310750Z   Using cached pip-25.1.1-py3-none-any.whl.metadata (3.6 kB)
2025-05-07T20:24:34.3311151Z Using cached pip-25.1.1-py3-none-any.whl (1.8 MB)
2025-05-07T20:24:34.3311499Z Installing collected packages: pip
2025-05-07T20:24:34.3311791Z   Attempting uninstall: pip
2025-05-07T20:24:34.3312061Z     Found existing installation: pip 25.1
2025-05-07T20:24:34.3312368Z     Uninstalling pip-25.1:
2025-05-07T20:24:34.3312665Z       Successfully uninstalled pip-25.1
2025-05-07T20:24:34.3312974Z Successfully installed pip-25.1.1
2025-05-07T20:24:34.3313167Z 
2025-05-07T20:24:34.3953524Z [SETUP] Upgrading pyOpenSSL ...
2025-05-07T20:24:34.3975671Z [EXEC] [ATTEMPT 0/3]    + conda install -n build_binary -c conda-forge --override-channels -y pyOpenSSL>22.1.0
2025-05-07T20:24:35.2519654Z Channels:
2025-05-07T20:24:35.2519910Z  - conda-forge
2025-05-07T20:24:35.2520144Z Platform: linux-64
2025-05-07T20:24:45.8507389Z Collecting package metadata (repodata.json): - \ | / - \ | / - \ | / - \ | / - done
2025-05-07T20:24:47.5474263Z Solving environment: | / - \ | / done
2025-05-07T20:24:47.6117874Z 
2025-05-07T20:24:47.6118336Z ## Package Plan ##
2025-05-07T20:24:47.6118567Z 
2025-05-07T20:24:47.6118855Z   environment location: /home/ec2-user/miniconda/envs/build_binary
2025-05-07T20:24:47.6119276Z 
2025-05-07T20:24:47.6119403Z   added / updated specs:
2025-05-07T20:24:47.6120217Z     - pyopenssl[version='>22.1.0']
2025-05-07T20:24:47.6120443Z 
2025-05-07T20:24:47.6120448Z 
2025-05-07T20:24:47.6120584Z The following packages will be downloaded:
2025-05-07T20:24:47.6120793Z 
2025-05-07T20:24:47.6120909Z     package                    |            build
2025-05-07T20:24:47.6121226Z     ---------------------------|-----------------
2025-05-07T20:24:47.6121589Z     cffi-1.17.1                |  py313hfab6e84_0         289 KB  conda-forge
2025-05-07T20:24:47.6122025Z     cryptography-44.0.3        |  py313h6556f6e_0         1.5 MB  conda-forge
2025-05-07T20:24:47.6122566Z     libgcc-15.1.0              |       h767d61c_2         810 KB  conda-forge
2025-05-07T20:24:47.6123073Z     libgcc-ng-15.1.0           |       h69a702a_2          34 KB  conda-forge
2025-05-07T20:24:47.6123483Z     libgomp-15.1.0             |       h767d61c_2         442 KB  conda-forge
2025-05-07T20:24:47.6123877Z     openssl-3.5.0              |       h7b32b05_1         3.0 MB  conda-forge
2025-05-07T20:24:47.6124313Z     pycparser-2.22             |     pyh29332c3_1         108 KB  conda-forge
2025-05-07T20:24:47.6124931Z     pyopenssl-25.0.0           |     pyhd8ed1ab_0         120 KB  conda-forge
2025-05-07T20:24:47.6125402Z     typing-extensions-4.13.2   |       h0e9735f_0          88 KB  conda-forge
2025-05-07T20:24:47.6125875Z     typing_extensions-4.13.2   |     pyh29332c3_0          51 KB  conda-forge
2025-05-07T20:24:47.6126282Z     ------------------------------------------------------------
2025-05-07T20:24:47.6126623Z                                            Total:         6.4 MB
2025-05-07T20:24:47.6126838Z 
2025-05-07T20:24:47.6126964Z The following NEW packages will be INSTALLED:
2025-05-07T20:24:47.6127178Z 
2025-05-07T20:24:47.6127380Z   cffi               conda-forge/linux-64::cffi-1.17.1-py313hfab6e84_0 
2025-05-07T20:24:47.6127862Z   cryptography       conda-forge/linux-64::cryptography-44.0.3-py313h6556f6e_0 
2025-05-07T20:24:47.6128349Z   libgcc             conda-forge/linux-64::libgcc-15.1.0-h767d61c_2 
2025-05-07T20:24:47.6129066Z   pycparser          conda-forge/noarch::pycparser-2.22-pyh29332c3_1 
2025-05-07T20:24:47.6129699Z   pyopenssl          conda-forge/noarch::pyopenssl-25.0.0-pyhd8ed1ab_0 
2025-05-07T20:24:47.6130444Z   typing-extensions  conda-forge/noarch::typing-extensions-4.13.2-h0e9735f_0 
2025-05-07T20:24:47.6131306Z   typing_extensions  conda-forge/noarch::typing_extensions-4.13.2-pyh29332c3_0 
2025-05-07T20:24:47.6131821Z 
2025-05-07T20:24:47.6131995Z The following packages will be UPDATED:
2025-05-07T20:24:47.6132328Z 
2025-05-07T20:24:47.6132937Z   ca-certificates    pkgs/main/linux-64::ca-certificates-2~ --> conda-forge/noarch::ca-certificates-2025.4.26-hbd8a1cb_0 
2025-05-07T20:24:47.6134192Z   libgcc-ng          pkgs/main::libgcc-ng-11.2.0-h1234567_1 --> conda-forge::libgcc-ng-15.1.0-h69a702a_2 
2025-05-07T20:24:47.6135120Z   libgomp              pkgs/main::libgomp-11.2.0-h1234567_1 --> conda-forge::libgomp-15.1.0-h767d61c_2 
2025-05-07T20:24:47.6135806Z   openssl              pkgs/main::openssl-3.0.16-h5eee18b_0 --> conda-forge::openssl-3.5.0-h7b32b05_1 
2025-05-07T20:24:47.6136173Z 
2025-05-07T20:24:47.6136182Z 
2025-05-07T20:24:47.6136186Z 
2025-05-07T20:24:47.6136335Z Downloading and Extracting Packages: ...working...
2025-05-07T20:24:47.6136718Z openssl-3.5.0        | 3.0 MB    |            |   0% 
2025-05-07T20:24:47.6136939Z 
2025-05-07T20:24:47.6137315Z cryptography-44.0.3  | 1.5 MB    |            |   0% [A
2025-05-07T20:24:47.6137563Z 
2025-05-07T20:24:47.6137566Z 
2025-05-07T20:24:47.6144472Z libgcc-15.1.0        | 810 KB    |            |   0% [A[A
2025-05-07T20:24:47.6144808Z 
2025-05-07T20:24:47.6144814Z 
2025-05-07T20:24:47.6144819Z 
2025-05-07T20:24:47.6156582Z libgomp-15.1.0       | 442 KB    |            |   0% [A[A[A
2025-05-07T20:24:47.6156827Z 
2025-05-07T20:24:47.6156830Z 
2025-05-07T20:24:47.6156842Z 
2025-05-07T20:24:47.6156848Z 
2025-05-07T20:24:47.6172673Z cffi-1.17.1          | 289 KB    |            |   0% [A[A[A[A
2025-05-07T20:24:47.6173050Z 
2025-05-07T20:24:47.6173053Z 
2025-05-07T20:24:47.6173057Z 
2025-05-07T20:24:47.6173067Z 
2025-05-07T20:24:47.6178612Z 
2025-05-07T20:24:47.6180697Z pyopenssl-25.0.0     | 120 KB    |            |   0% [A[A[A[A[A
2025-05-07T20:24:47.6181053Z 
2025-05-07T20:24:47.6181059Z 
2025-05-07T20:24:47.6181075Z 
2025-05-07T20:24:47.6181080Z 
2025-05-07T20:24:47.6181085Z 
2025-05-07T20:24:47.6181090Z 
2025-05-07T20:24:47.6182559Z pycparser-2.22       | 108 KB    |            |   0% [A[A[A[A[A[A
2025-05-07T20:24:47.6182943Z 
2025-05-07T20:24:47.6182947Z 
2025-05-07T20:24:47.6182950Z 
2025-05-07T20:24:47.6182954Z 
2025-05-07T20:24:47.6182957Z 
2025-05-07T20:24:47.6182961Z 
2025-05-07T20:24:47.6182965Z 
2025-05-07T20:24:47.6186485Z typing-extensions-4. | 88 KB     |            |   0% [A[A[A[A[A[A[A
2025-05-07T20:24:47.6186904Z 
2025-05-07T20:24:47.6186909Z 
2025-05-07T20:24:47.6186913Z 
2025-05-07T20:24:47.6186916Z 
2025-05-07T20:24:47.6186920Z 
2025-05-07T20:24:47.6186924Z 
2025-05-07T20:24:47.6186940Z 
2025-05-07T20:24:47.6186943Z 
2025-05-07T20:24:47.6190050Z typing_extensions-4. | 51 KB     |            |   0% [A[A[A[A[A[A[A[A
2025-05-07T20:24:47.6190362Z 
2025-05-07T20:24:47.6190366Z 
2025-05-07T20:24:47.6190370Z 
2025-05-07T20:24:47.6190373Z 
2025-05-07T20:24:47.6190377Z 
2025-05-07T20:24:47.6190381Z 
2025-05-07T20:24:47.6190384Z 
2025-05-07T20:24:47.6190388Z 
2025-05-07T20:24:47.6190391Z 
2025-05-07T20:24:47.6861454Z libgcc-ng-15.1.0     | 34 KB     |            |   0% [A[A[A[A[A[A[A[A[A
2025-05-07T20:24:47.6861739Z 
2025-05-07T20:24:47.6862391Z 
2025-05-07T20:24:47.6993790Z libgcc-15.1.0        | 810 KB    | ########## | 100% [A[A
2025-05-07T20:24:47.6994036Z 
2025-05-07T20:24:47.6994040Z 
2025-05-07T20:24:47.6995674Z 
2025-05-07T20:24:47.7120926Z libgomp-15.1.0       | 442 KB    | ########## | 100% [A[A[A
2025-05-07T20:24:47.7247449Z openssl-3.5.0        | 3.0 MB    | ########9  |  90% 
2025-05-07T20:24:47.7254026Z 
2025-05-07T20:24:47.7258899Z cryptography-44.0.3  | 1.5 MB    | ########## | 100% [A
2025-05-07T20:24:47.7260806Z 
2025-05-07T20:24:47.7313951Z cryptography-44.0.3  | 1.5 MB    | ########## | 100% [A
2025-05-07T20:24:47.7314217Z 
2025-05-07T20:24:47.7314222Z 
2025-05-07T20:24:47.7314228Z 
2025-05-07T20:24:47.7314232Z 
2025-05-07T20:24:47.7314237Z 
2025-05-07T20:24:47.7314242Z 
2025-05-07T20:24:47.7373851Z pycparser-2.22       | 108 KB    | #4         |  15% [A[A[A[A[A[A
2025-05-07T20:24:47.7374127Z 
2025-05-07T20:24:47.7374131Z 
2025-05-07T20:24:47.7374135Z 
2025-05-07T20:24:47.7374139Z 
2025-05-07T20:24:47.7376122Z 
2025-05-07T20:24:47.7408244Z pyopenssl-25.0.0     | 120 KB    | #3         |  13% [A[A[A[A[A
2025-05-07T20:24:47.7408517Z 
2025-05-07T20:24:47.7408520Z 
2025-05-07T20:24:47.7408524Z 
2025-05-07T20:24:47.7408528Z 
2025-05-07T20:24:47.7408531Z 
2025-05-07T20:24:47.7408535Z 
2025-05-07T20:24:47.7458692Z pycparser-2.22       | 108 KB    | ########## | 100% [A[A[A[A[A[A
2025-05-07T20:24:47.7458958Z 
2025-05-07T20:24:47.7458962Z 
2025-05-07T20:24:47.7458973Z 
2025-05-07T20:24:47.7460990Z 
2025-05-07T20:24:47.7484500Z cffi-1.17.1          | 289 KB    | 5          |   6% [A[A[A[A
2025-05-07T20:24:47.7484743Z 
2025-05-07T20:24:47.7484746Z 
2025-05-07T20:24:47.7484750Z 
2025-05-07T20:24:47.7484754Z 
2025-05-07T20:24:47.7488916Z 
2025-05-07T20:24:47.7732020Z pyopenssl-25.0.0     | 120 KB    | ########## | 100% [A[A[A[A[A
2025-05-07T20:24:47.7732290Z 
2025-05-07T20:24:47.7732294Z 
2025-05-07T20:24:47.7732297Z 
2025-05-07T20:24:47.7738179Z 
2025-05-07T20:24:47.7752690Z cffi-1.17.1          | 289 KB    | ########## | 100% [A[A[A[A
2025-05-07T20:24:47.7752931Z 
2025-05-07T20:24:47.7752935Z 
2025-05-07T20:24:47.7752938Z 
2025-05-07T20:24:47.7752942Z 
2025-05-07T20:24:47.7752946Z 
2025-05-07T20:24:47.7752949Z 
2025-05-07T20:24:47.7752953Z 
2025-05-07T20:24:47.7754063Z 
2025-05-07T20:24:47.7800012Z typing_extensions-4. | 51 KB     | ###1       |  31% [A[A[A[A[A[A[A[A
2025-05-07T20:24:47.7800309Z 
2025-05-07T20:24:47.7800313Z 
2025-05-07T20:24:47.7800531Z 
2025-05-07T20:24:47.7800535Z 
2025-05-07T20:24:47.7800538Z 
2025-05-07T20:24:47.7800542Z 
2025-05-07T20:24:47.7800550Z 
2025-05-07T20:24:47.7800720Z 
2025-05-07T20:24:47.7825477Z typing_extensions-4. | 51 KB     | ########## | 100% [A[A[A[A[A[A[A[A
2025-05-07T20:24:47.7825900Z 
2025-05-07T20:24:47.7825906Z 
2025-05-07T20:24:47.7825910Z 
2025-05-07T20:24:47.7825921Z 
2025-05-07T20:24:47.7825925Z 
2025-05-07T20:24:47.7825928Z 
2025-05-07T20:24:47.7825932Z 
2025-05-07T20:24:47.7825935Z 
2025-05-07T20:24:47.7825939Z 
2025-05-07T20:24:47.7834610Z libgcc-ng-15.1.0     | 34 KB     | ####7      |  47% [A[A[A[A[A[A[A[A[A
2025-05-07T20:24:47.7834885Z 
2025-05-07T20:24:47.7834889Z 
2025-05-07T20:24:47.7834892Z 
2025-05-07T20:24:47.7834896Z 
2025-05-07T20:24:47.7834899Z 
2025-05-07T20:24:47.7834903Z 
2025-05-07T20:24:47.7834906Z 
2025-05-07T20:24:47.7865387Z typing-extensions-4. | 88 KB     | #8         |  18% [A[A[A[A[A[A[A
2025-05-07T20:24:47.7865684Z 
2025-05-07T20:24:47.7865700Z 
2025-05-07T20:24:47.7865703Z 
2025-05-07T20:24:47.7865707Z 
2025-05-07T20:24:47.7865711Z 
2025-05-07T20:24:47.7865714Z 
2025-05-07T20:24:47.7865929Z 
2025-05-07T20:24:47.7865935Z 
2025-05-07T20:24:47.7866422Z 
2025-05-07T20:24:47.7874352Z libgcc-ng-15.1.0     | 34 KB     | ########## | 100% [A[A[A[A[A[A[A[A[A
2025-05-07T20:24:47.7894192Z openssl-3.5.0        | 3.0 MB    | ########## | 100% 
2025-05-07T20:24:47.7894461Z 
2025-05-07T20:24:47.7894466Z 
2025-05-07T20:24:47.7894471Z 
2025-05-07T20:24:47.7894475Z 
2025-05-07T20:24:47.7894480Z 
2025-05-07T20:24:47.7894484Z 
2025-05-07T20:24:47.7896266Z 
2025-05-07T20:24:47.8104544Z typing-extensions-4. | 88 KB     | ########## | 100% [A[A[A[A[A[A[A
2025-05-07T20:24:47.8104952Z 
2025-05-07T20:24:47.8104958Z 
2025-05-07T20:24:47.8104963Z 
2025-05-07T20:24:47.8110886Z libgomp-15.1.0       | 442 KB    | ########## | 100% [A[A[A
2025-05-07T20:24:47.8111249Z 
2025-05-07T20:24:47.8111255Z 
2025-05-07T20:24:47.8111260Z 
2025-05-07T20:24:47.8383394Z libgomp-15.1.0       | 442 KB    | ########## | 100% [A[A[A
2025-05-07T20:24:47.8384050Z 
2025-05-07T20:24:47.8384058Z 
2025-05-07T20:24:47.8388452Z libgcc-15.1.0        | 810 KB    | ########## | 100% [A[A
2025-05-07T20:24:47.8388790Z 
2025-05-07T20:24:47.8388794Z 
2025-05-07T20:24:47.8855517Z libgcc-15.1.0        | 810 KB    | ########## | 100% [A[A
2025-05-07T20:24:47.8855845Z 
2025-05-07T20:24:47.8855851Z 
2025-05-07T20:24:47.8855865Z 
2025-05-07T20:24:47.8855870Z 
2025-05-07T20:24:47.8855875Z 
2025-05-07T20:24:47.8859144Z pyopenssl-25.0.0     | 120 KB    | ########## | 100% [A[A[A[A[A
2025-05-07T20:24:47.8859498Z 
2025-05-07T20:24:47.8859502Z 
2025-05-07T20:24:47.8859512Z 
2025-05-07T20:24:47.8859516Z 
2025-05-07T20:24:47.8859585Z 
2025-05-07T20:24:47.9564736Z pyopenssl-25.0.0     | 120 KB    | ########## | 100% [A[A[A[A[A
2025-05-07T20:24:47.9565015Z 
2025-05-07T20:24:47.9565019Z 
2025-05-07T20:24:47.9565023Z 
2025-05-07T20:24:47.9565147Z 
2025-05-07T20:24:47.9570667Z cffi-1.17.1          | 289 KB    | ########## | 100% [A[A[A[A
2025-05-07T20:24:47.9570966Z 
2025-05-07T20:24:47.9570978Z 
2025-05-07T20:24:47.9570987Z 
2025-05-07T20:24:47.9570991Z 
2025-05-07T20:24:47.9582598Z cffi-1.17.1          | 289 KB    | ########## | 100% [A[A[A[A
2025-05-07T20:24:47.9582924Z 
2025-05-07T20:24:47.9582929Z 
2025-05-07T20:24:47.9582944Z 
2025-05-07T20:24:47.9582950Z 
2025-05-07T20:24:47.9582954Z 
2025-05-07T20:24:47.9583464Z 
2025-05-07T20:24:47.9587209Z pycparser-2.22       | 108 KB    | ########## | 100% [A[A[A[A[A[A
2025-05-07T20:24:47.9587547Z 
2025-05-07T20:24:47.9587551Z 
2025-05-07T20:24:47.9587555Z 
2025-05-07T20:24:47.9587558Z 
2025-05-07T20:24:47.9587562Z 
2025-05-07T20:24:47.9587673Z 
2025-05-07T20:24:47.9738616Z pycparser-2.22       | 108 KB    | ########## | 100% [A[A[A[A[A[A
2025-05-07T20:24:47.9739004Z 
2025-05-07T20:24:47.9739010Z 
2025-05-07T20:24:47.9739015Z 
2025-05-07T20:24:47.9739020Z 
2025-05-07T20:24:47.9739026Z 
2025-05-07T20:24:47.9739032Z 
2025-05-07T20:24:47.9739249Z 
2025-05-07T20:24:47.9739252Z 
2025-05-07T20:24:47.9743345Z typing_extensions-4. | 51 KB     | ########## | 100% [A[A[A[A[A[A[A[A
2025-05-07T20:24:47.9743733Z 
2025-05-07T20:24:47.9743737Z 
2025-05-07T20:24:47.9743740Z 
2025-05-07T20:24:47.9743744Z 
2025-05-07T20:24:47.9743747Z 
2025-05-07T20:24:47.9743751Z 
2025-05-07T20:24:47.9743755Z 
2025-05-07T20:24:47.9743758Z 
2025-05-07T20:24:48.0070127Z typing_extensions-4. | 51 KB     | ########## | 100% [A[A[A[A[A[A[A[A
2025-05-07T20:24:48.0070524Z 
2025-05-07T20:24:48.0070528Z 
2025-05-07T20:24:48.0070532Z 
2025-05-07T20:24:48.0070535Z 
2025-05-07T20:24:48.0070539Z 
2025-05-07T20:24:48.0070543Z 
2025-05-07T20:24:48.0070547Z 
2025-05-07T20:24:48.0070550Z 
2025-05-07T20:24:48.0071703Z 
2025-05-07T20:24:48.0075666Z libgcc-ng-15.1.0     | 34 KB     | ########## | 100% [A[A[A[A[A[A[A[A[A
2025-05-07T20:24:48.0076031Z 
2025-05-07T20:24:48.0076035Z 
2025-05-07T20:24:48.0076039Z 
2025-05-07T20:24:48.0076042Z 
2025-05-07T20:24:48.0076056Z 
2025-05-07T20:24:48.0076060Z 
2025-05-07T20:24:48.0076063Z 
2025-05-07T20:24:48.0076067Z 
2025-05-07T20:24:48.0076144Z 
2025-05-07T20:24:48.0274177Z libgcc-ng-15.1.0     | 34 KB     | ########## | 100% [A[A[A[A[A[A[A[A[A
2025-05-07T20:24:48.0274503Z 
2025-05-07T20:24:48.0274507Z 
2025-05-07T20:24:48.0274510Z 
2025-05-07T20:24:48.0274514Z 
2025-05-07T20:24:48.0274518Z 
2025-05-07T20:24:48.0274528Z 
2025-05-07T20:24:48.0274532Z 
2025-05-07T20:24:48.0278085Z typing-extensions-4. | 88 KB     | ########## | 100% [A[A[A[A[A[A[A
2025-05-07T20:24:48.0278378Z 
2025-05-07T20:24:48.0278382Z 
2025-05-07T20:24:48.0278385Z 
2025-05-07T20:24:48.0278396Z 
2025-05-07T20:24:48.0278400Z 
2025-05-07T20:24:48.0278403Z 
2025-05-07T20:24:48.0278528Z 
2025-05-07T20:24:48.0673876Z typing-extensions-4. | 88 KB     | ########## | 100% [A[A[A[A[A[A[A
2025-05-07T20:24:48.0674286Z 
2025-05-07T20:24:48.1020547Z cryptography-44.0.3  | 1.5 MB    | ########## | 100% [A
2025-05-07T20:24:48.1026829Z openssl-3.5.0        | 3.0 MB    | ########## | 100% 
2025-05-07T20:24:48.1027187Z                                                      
2025-05-07T20:24:48.1027390Z 
2025-05-07T20:24:48.1027593Z                                                      [A
2025-05-07T20:24:48.1027791Z 
2025-05-07T20:24:48.1027795Z 
2025-05-07T20:24:48.1027958Z                                                      [A[A
2025-05-07T20:24:48.1028165Z 
2025-05-07T20:24:48.1028168Z 
2025-05-07T20:24:48.1028172Z 
2025-05-07T20:24:48.1028348Z                                                      [A[A[A
2025-05-07T20:24:48.1028556Z 
2025-05-07T20:24:48.1028560Z 
2025-05-07T20:24:48.1028563Z 
2025-05-07T20:24:48.1028567Z 
2025-05-07T20:24:48.1028736Z                                                      [A[A[A[A
2025-05-07T20:24:48.1028946Z 
2025-05-07T20:24:48.1028949Z 
2025-05-07T20:24:48.1028953Z 
2025-05-07T20:24:48.1028957Z 
2025-05-07T20:24:48.1028960Z 
2025-05-07T20:24:48.1029130Z                                                      [A[A[A[A[A
2025-05-07T20:24:48.1029338Z 
2025-05-07T20:24:48.1029349Z 
2025-05-07T20:24:48.1029352Z 
2025-05-07T20:24:48.1029356Z 
2025-05-07T20:24:48.1029364Z 
2025-05-07T20:24:48.1029368Z 
2025-05-07T20:24:48.1029546Z                                                      [A[A[A[A[A[A
2025-05-07T20:24:48.1029753Z 
2025-05-07T20:24:48.1029756Z 
2025-05-07T20:24:48.1029768Z 
2025-05-07T20:24:48.1029772Z 
2025-05-07T20:24:48.1029776Z 
2025-05-07T20:24:48.1029779Z 
2025-05-07T20:24:48.1029783Z 
2025-05-07T20:24:48.1029961Z                                                      [A[A[A[A[A[A[A
2025-05-07T20:24:48.1030171Z 
2025-05-07T20:24:48.1030181Z 
2025-05-07T20:24:48.1030185Z 
2025-05-07T20:24:48.1030188Z 
2025-05-07T20:24:48.1030192Z 
2025-05-07T20:24:48.1030195Z 
2025-05-07T20:24:48.1030199Z 
2025-05-07T20:24:48.1030203Z 
2025-05-07T20:24:48.1030386Z                                                      [A[A[A[A[A[A[A[A
2025-05-07T20:24:48.1030667Z 
2025-05-07T20:24:48.1030672Z 
2025-05-07T20:24:48.1030883Z 
2025-05-07T20:24:48.1030888Z 
2025-05-07T20:24:48.1030893Z 
2025-05-07T20:24:48.1030898Z 
2025-05-07T20:24:48.1030910Z 
2025-05-07T20:24:48.1030915Z 
2025-05-07T20:24:48.1030920Z 
2025-05-07T20:24:48.1031201Z                                                      [A[A[A[A[A[A[A[A[A done
2025-05-07T20:24:48.2041730Z Preparing transaction: \ done
2025-05-07T20:24:48.3046957Z Verifying transaction: / done
2025-05-07T20:24:49.8071688Z Executing transaction: \ | / - \ | / - \ | / - \ | / done
2025-05-07T20:24:49.9862751Z [SETUP] Testing pyOpenSSL import ...
2025-05-07T20:24:51.7208595Z [CHECK] Python (sub-)package 'OpenSSL' found ...
2025-05-07T20:24:51.7221885Z [SETUP] Installing libxcrypt ...
2025-05-07T20:24:51.7244614Z [EXEC] [ATTEMPT 0/3]    + conda install -n build_binary -c conda-forge --override-channels -y libxcrypt
2025-05-07T20:24:52.5903032Z Channels:
2025-05-07T20:24:52.5903357Z  - conda-forge
2025-05-07T20:24:52.5903661Z Platform: linux-64
2025-05-07T20:24:55.8913022Z Collecting package metadata (repodata.json): - \ | / done
2025-05-07T20:24:56.2592939Z Solving environment: \ done
2025-05-07T20:24:56.3216640Z 
2025-05-07T20:24:56.3217185Z ## Package Plan ##
2025-05-07T20:24:56.3217392Z 
2025-05-07T20:24:56.3217607Z   environment location: /home/ec2-user/miniconda/envs/build_binary
2025-05-07T20:24:56.3217904Z 
2025-05-07T20:24:56.3217999Z   added / updated specs:
2025-05-07T20:24:56.3218248Z     - libxcrypt
2025-05-07T20:24:56.3218381Z 
2025-05-07T20:24:56.3218386Z 
2025-05-07T20:24:56.3218510Z The following packages will be downloaded:
2025-05-07T20:24:56.3218724Z 
2025-05-07T20:24:56.3218847Z     package                    |            build
2025-05-07T20:24:56.3219161Z     ---------------------------|-----------------
2025-05-07T20:24:56.3219537Z     libxcrypt-4.4.36           |       hd590300_1          98 KB  conda-forge
2025-05-07T20:24:56.3219940Z     ------------------------------------------------------------
2025-05-07T20:24:56.3220302Z                                            Total:          98 KB
2025-05-07T20:24:56.3220521Z 
2025-05-07T20:24:56.3220656Z The following NEW packages will be INSTALLED:
2025-05-07T20:24:56.3220878Z 
2025-05-07T20:24:56.3221096Z   libxcrypt          conda-forge/linux-64::libxcrypt-4.4.36-hd590300_1 
2025-05-07T20:24:56.3221376Z 
2025-05-07T20:24:56.3221380Z 
2025-05-07T20:24:56.3221384Z 
2025-05-07T20:24:56.3221537Z Downloading and Extracting Packages: ...working...
2025-05-07T20:24:56.4852687Z libxcrypt-4.4.36     | 98 KB     |            |   0% 
2025-05-07T20:24:56.4879501Z libxcrypt-4.4.36     | 98 KB     | #6         |  16% 
2025-05-07T20:24:56.4982215Z libxcrypt-4.4.36     | 98 KB     | ########## | 100% 
2025-05-07T20:24:56.4984722Z libxcrypt-4.4.36     | 98 KB     | ########## | 100% 
2025-05-07T20:24:56.4985198Z                                                      
2025-05-07T20:24:56.4985493Z  done
2025-05-07T20:24:56.5988550Z Preparing transaction: / done
2025-05-07T20:24:56.6993963Z Verifying transaction: \ done
2025-05-07T20:24:56.8001140Z Executing transaction: / done
2025-05-07T20:25:00.2593053Z [SETUP] Copying <crypt.h> over ...
2025-05-07T20:25:00.2593762Z + cp /home/ec2-user/miniconda/envs/build_binary/include/crypt.h /home/ec2-user/miniconda/envs/build_binary/include/python3.13/crypt.h
2025-05-07T20:25:00.2594290Z 
2025-05-07T20:25:00.2623556Z 
2025-05-07T20:25:01.9166359Z [SETUP] Installed Python version: Python 3.13.2
2025-05-07T20:25:01.9167492Z [SETUP] Successfully created Conda environment: build_binary
2025-05-07T20:25:01.9199121Z ##[group]Run . $PRELUDE; install_cxx_compiler $BUILD_ENV gcc
2025-05-07T20:25:01.9199726Z [36;1m. $PRELUDE; install_cxx_compiler $BUILD_ENV gcc[0m
2025-05-07T20:25:01.9212416Z shell: /usr/bin/bash --noprofile --norc -e -o pipefail {0}
2025-05-07T20:25:01.9212763Z env:
2025-05-07T20:25:01.9212982Z   PRELUDE: .github/scripts/setup_env.bash
2025-05-07T20:25:01.9213277Z   BUILD_ENV: build_binary
2025-05-07T20:25:01.9213536Z   BUILD_TARGET: genai
2025-05-07T20:25:01.9214072Z   BUILD_VARIANT: cuda
2025-05-07T20:25:01.9214295Z   BUILD_CUDA_VERSION: 12.6.3
2025-05-07T20:25:01.9214547Z   ENFORCE_CUDA_DEVICE: 1
2025-05-07T20:25:01.9214844Z   GPU_FLAG: --gpus all -e NVIDIA_DRIVER_CAPABILITIES=all
2025-05-07T20:25:01.9215163Z ##[endgroup]
2025-05-07T20:25:02.2607434Z ################################################################################
2025-05-07T20:25:02.2607920Z # Install C/C++ Compilers
2025-05-07T20:25:02.2608189Z #
2025-05-07T20:25:02.2624631Z # [2025-05-07T20:25:02.262Z] + install_cxx_compiler build_binary gcc
2025-05-07T20:25:02.2625164Z ################################################################################
2025-05-07T20:25:02.2625465Z 
2025-05-07T20:25:02.2642113Z [EXEC] [ATTEMPT 0/3]    + wget -q --timeout 1 pypi.org -O /dev/null
2025-05-07T20:25:02.3567841Z [CHECK] Network does not appear to be blocked.
2025-05-07T20:25:02.3578553Z [INSTALL] Installing GLIBC (architecture = 64) ...
2025-05-07T20:25:02.3601036Z [EXEC] [ATTEMPT 0/3]    + conda install -n build_binary -c conda-forge --override-channels -y sysroot_linux-64=2.17
2025-05-07T20:25:03.2238371Z Channels:
2025-05-07T20:25:03.2238664Z  - conda-forge
2025-05-07T20:25:03.2238924Z Platform: linux-64
2025-05-07T20:25:06.5706005Z Collecting package metadata (repodata.json): - \ | / done
2025-05-07T20:25:06.9410085Z Solving environment: \ done
2025-05-07T20:25:07.0040216Z 
2025-05-07T20:25:07.0040735Z ## Package Plan ##
2025-05-07T20:25:07.0040910Z 
2025-05-07T20:25:07.0041147Z   environment location: /home/ec2-user/miniconda/envs/build_binary
2025-05-07T20:25:07.0041463Z 
2025-05-07T20:25:07.0041562Z   added / updated specs:
2025-05-07T20:25:07.0041823Z     - sysroot_linux-64=2.17
2025-05-07T20:25:07.0041989Z 
2025-05-07T20:25:07.0041993Z 
2025-05-07T20:25:07.0042120Z The following packages will be downloaded:
2025-05-07T20:25:07.0042330Z 
2025-05-07T20:25:07.0042446Z     package                    |            build
2025-05-07T20:25:07.0042764Z     ---------------------------|-----------------
2025-05-07T20:25:07.0043189Z     kernel-headers_linux-64-3.10.0|      he073ed8_18         921 KB  conda-forge
2025-05-07T20:25:07.0043660Z     sysroot_linux-64-2.17      |      h0157908_18        14.5 MB  conda-forge
2025-05-07T20:25:07.0044060Z     ------------------------------------------------------------
2025-05-07T20:25:07.0044393Z                                            Total:        15.4 MB
2025-05-07T20:25:07.0044595Z 
2025-05-07T20:25:07.0044722Z The following NEW packages will be INSTALLED:
2025-05-07T20:25:07.0044946Z 
2025-05-07T20:25:07.0045221Z   kernel-headers_li~ conda-forge/noarch::kernel-headers_linux-64-3.10.0-he073ed8_18 
2025-05-07T20:25:07.0045767Z   sysroot_linux-64   conda-forge/noarch::sysroot_linux-64-2.17-h0157908_18 
2025-05-07T20:25:07.0046070Z 
2025-05-07T20:25:07.0046074Z 
2025-05-07T20:25:07.0046078Z 
2025-05-07T20:25:07.0046218Z Downloading and Extracting Packages: ...working...
2025-05-07T20:25:07.0046584Z sysroot_linux-64-2.1 | 14.5 MB   |            |   0% 
2025-05-07T20:25:07.0046810Z 
2025-05-07T20:25:07.1096547Z kernel-headers_linux | 921 KB    |            |   0% [A
2025-05-07T20:25:07.1320165Z sysroot_linux-64-2.1 | 14.5 MB   | #3         |  13% 
2025-05-07T20:25:07.1320487Z 
2025-05-07T20:25:07.1405532Z kernel-headers_linux | 921 KB    | 1          |   2% [A
2025-05-07T20:25:07.1408174Z 
2025-05-07T20:25:07.2098093Z kernel-headers_linux | 921 KB    | ########## | 100% [A
2025-05-07T20:25:07.2648404Z sysroot_linux-64-2.1 | 14.5 MB   | #######8   |  78% 
2025-05-07T20:25:07.4336578Z sysroot_linux-64-2.1 | 14.5 MB   | ########## | 100% 
2025-05-07T20:25:07.4336852Z 
2025-05-07T20:25:07.4338418Z kernel-headers_linux | 921 KB    | ########## | 100% [A
2025-05-07T20:25:07.4338706Z 
2025-05-07T20:25:07.9056448Z kernel-headers_linux | 921 KB    | ########## | 100% [A
2025-05-07T20:25:07.9060201Z sysroot_linux-64-2.1 | 14.5 MB   | ########## | 100% 
2025-05-07T20:25:07.9060705Z                                                      
2025-05-07T20:25:07.9061257Z 
2025-05-07T20:25:07.9061529Z                                                      [A done
2025-05-07T20:25:08.0065362Z Preparing transaction: / done
2025-05-07T20:25:08.2071640Z Verifying transaction: \ | done
2025-05-07T20:25:08.4131059Z Executing transaction: - \ done
2025-05-07T20:25:08.5705281Z [CHECK] LD_LIBRARY_PATH = 
2025-05-07T20:25:08.5705707Z [CHECK] CONDA_PREFIX is not set.
2025-05-07T20:25:10.2769666Z [CHECK] libstdc++.so.6 found in CONDA_PREFIX PATH (symbolic link): /home/ec2-user/miniconda/envs/build_binary/lib/libstdc++.so.6
2025-05-07T20:25:10.2782856Z [INSTALL] Installing GCC (11.4.0, 64) through Conda ...
2025-05-07T20:25:10.2805278Z [EXEC] [ATTEMPT 0/3]    + conda install -n build_binary -c conda-forge --override-channels -y gxx_linux-64=11.4.0
2025-05-07T20:25:11.1688221Z Channels:
2025-05-07T20:25:11.1688481Z  - conda-forge
2025-05-07T20:25:11.1688720Z Platform: linux-64
2025-05-07T20:25:14.4785684Z Collecting package metadata (repodata.json): - \ | / done
2025-05-07T20:25:15.4437234Z Solving environment: \ | / done
2025-05-07T20:25:15.5082662Z 
2025-05-07T20:25:15.5083247Z ## Package Plan ##
2025-05-07T20:25:15.5083423Z 
2025-05-07T20:25:15.5083634Z   environment location: /home/ec2-user/miniconda/envs/build_binary
2025-05-07T20:25:15.5083951Z 
2025-05-07T20:25:15.5084058Z   added / updated specs:
2025-05-07T20:25:15.5084313Z     - gxx_linux-64=11.4.0
2025-05-07T20:25:15.5084473Z 
2025-05-07T20:25:15.5084477Z 
2025-05-07T20:25:15.5084594Z The following packages will be downloaded:
2025-05-07T20:25:15.5084829Z 
2025-05-07T20:25:15.5084948Z     package                    |            build
2025-05-07T20:25:15.5085255Z     ---------------------------|-----------------
2025-05-07T20:25:15.5085647Z     binutils_impl_linux-64-2.40|       ha1999f0_7         6.0 MB  conda-forge
2025-05-07T20:25:15.5086122Z     binutils_linux-64-2.40     |       hb3c18ed_4          28 KB  conda-forge
2025-05-07T20:25:15.5086578Z     gcc_impl_linux-64-11.4.0   |      h00c12a0_13        53.0 MB  conda-forge
2025-05-07T20:25:15.5087012Z     gcc_linux-64-11.4.0        |       ha077dfb_4          31 KB  conda-forge
2025-05-07T20:25:15.5087439Z     gxx_impl_linux-64-11.4.0   |      h634f3ee_13        11.2 MB  conda-forge
2025-05-07T20:25:15.5087863Z     gxx_linux-64-11.4.0        |       h35bfe5d_4          29 KB  conda-forge
2025-05-07T20:25:15.5088273Z     ld_impl_linux-64-2.40      |       hf3520f5_7         691 KB  conda-forge
2025-05-07T20:25:15.5088729Z     libgcc-devel_linux-64-11.4.0|     h8f596e0_113         2.3 MB  conda-forge
2025-05-07T20:25:15.5089195Z     libsanitizer-11.4.0        |      h5763a12_13         3.5 MB  conda-forge
2025-05-07T20:25:15.5089626Z     libstdcxx-15.1.0           |       h8f9b012_2         3.7 MB  conda-forge
2025-05-07T20:25:15.5090084Z     libstdcxx-devel_linux-64-11.4.0|     h8f596e0_113        11.1 MB  conda-forge
2025-05-07T20:25:15.5090553Z     libstdcxx-ng-15.1.0        |       h4852527_2          34 KB  conda-forge
2025-05-07T20:25:15.5090956Z     ------------------------------------------------------------
2025-05-07T20:25:15.5091290Z                                            Total:        91.6 MB
2025-05-07T20:25:15.5091509Z 
2025-05-07T20:25:15.5091637Z The following NEW packages will be INSTALLED:
2025-05-07T20:25:15.5091856Z 
2025-05-07T20:25:15.5092121Z   binutils_impl_lin~ conda-forge/linux-64::binutils_impl_linux-64-2.40-ha1999f0_7 
2025-05-07T20:25:15.5092662Z   binutils_linux-64  conda-forge/linux-64::binutils_linux-64-2.40-hb3c18ed_4 
2025-05-07T20:25:15.5093538Z   gcc_impl_linux-64  conda-forge/linux-64::gcc_impl_linux-64-11.4.0-h00c12a0_13 
2025-05-07T20:25:15.5094168Z   gcc_linux-64       conda-forge/linux-64::gcc_linux-64-11.4.0-ha077dfb_4 
2025-05-07T20:25:15.5094655Z   gxx_impl_linux-64  conda-forge/linux-64::gxx_impl_linux-64-11.4.0-h634f3ee_13 
2025-05-07T20:25:15.5095145Z   gxx_linux-64       conda-forge/linux-64::gxx_linux-64-11.4.0-h35bfe5d_4 
2025-05-07T20:25:15.5095655Z   libgcc-devel_linu~ conda-forge/noarch::libgcc-devel_linux-64-11.4.0-h8f596e0_113 
2025-05-07T20:25:15.5096364Z   libsanitizer       conda-forge/linux-64::libsanitizer-11.4.0-h5763a12_13 
2025-05-07T20:25:15.5096850Z   libstdcxx          conda-forge/linux-64::libstdcxx-15.1.0-h8f9b012_2 
2025-05-07T20:25:15.5097375Z   libstdcxx-devel_l~ conda-forge/noarch::libstdcxx-devel_linux-64-11.4.0-h8f596e0_113 
2025-05-07T20:25:15.5097724Z 
2025-05-07T20:25:15.5097835Z The following packages will be UPDATED:
2025-05-07T20:25:15.5098041Z 
2025-05-07T20:25:15.5098524Z   ld_impl_linux-64   pkgs/main::ld_impl_linux-64-2.40-h12e~ --> conda-forge::ld_impl_linux-64-2.40-hf3520f5_7 
2025-05-07T20:25:15.5099218Z   libstdcxx-ng       pkgs/main::libstdcxx-ng-11.2.0-h12345~ --> conda-forge::libstdcxx-ng-15.1.0-h4852527_2 
2025-05-07T20:25:15.5099608Z 
2025-05-07T20:25:15.5099612Z 
2025-05-07T20:25:15.5099616Z 
2025-05-07T20:25:15.5099760Z Downloading and Extracting Packages: ...working...
2025-05-07T20:25:15.5100131Z gcc_impl_linux-64-11 | 53.0 MB   |            |   0% 
2025-05-07T20:25:15.5100356Z 
2025-05-07T20:25:15.5100728Z gxx_impl_linux-64-11 | 11.2 MB   |            |   0% [A
2025-05-07T20:25:15.5100962Z 
2025-05-07T20:25:15.5100966Z 
2025-05-07T20:25:15.5113180Z libstdcxx-devel_linu | 11.1 MB   |            |   0% [A[A
2025-05-07T20:25:15.5113440Z 
2025-05-07T20:25:15.5113444Z 
2025-05-07T20:25:15.5113448Z 
2025-05-07T20:25:15.5125235Z binutils_impl_linux- | 6.0 MB    |            |   0% [A[A[A
2025-05-07T20:25:15.5125496Z 
2025-05-07T20:25:15.5125539Z 
2025-05-07T20:25:15.5125542Z 
2025-05-07T20:25:15.5125663Z 
2025-05-07T20:25:15.5158989Z libstdcxx-15.1.0     | 3.7 MB    |            |   0% [A[A[A[A
2025-05-07T20:25:15.5159242Z 
2025-05-07T20:25:15.5159253Z 
2025-05-07T20:25:15.5159257Z 
2025-05-07T20:25:15.5159260Z 
2025-05-07T20:25:15.5172711Z 
2025-05-07T20:25:15.5176718Z libsanitizer-11.4.0  | 3.5 MB    |            |   0% [A[A[A[A[A
2025-05-07T20:25:15.5177112Z 
2025-05-07T20:25:15.5177118Z 
2025-05-07T20:25:15.5177124Z 
2025-05-07T20:25:15.5177129Z 
2025-05-07T20:25:15.5177147Z 
2025-05-07T20:25:15.5177153Z 
2025-05-07T20:25:15.5178067Z libgcc-devel_linux-6 | 2.3 MB    |            |   0% [A[A[A[A[A[A
2025-05-07T20:25:15.5178456Z 
2025-05-07T20:25:15.5178462Z 
2025-05-07T20:25:15.5178467Z 
2025-05-07T20:25:15.5178472Z 
2025-05-07T20:25:15.5178477Z 
2025-05-07T20:25:15.5178483Z 
2025-05-07T20:25:15.5182264Z 
2025-05-07T20:25:15.5183917Z ld_impl_linux-64-2.4 | 691 KB    |            |   0% [A[A[A[A[A[A[A
2025-05-07T20:25:15.5184310Z 
2025-05-07T20:25:15.5184316Z 
2025-05-07T20:25:15.5184332Z 
2025-05-07T20:25:15.5184337Z 
2025-05-07T20:25:15.5184343Z 
2025-05-07T20:25:15.5184348Z 
2025-05-07T20:25:15.5184353Z 
2025-05-07T20:25:15.5186405Z 
2025-05-07T20:25:15.5188098Z libstdcxx-ng-15.1.0  | 34 KB     |            |   0% [A[A[A[A[A[A[A[A
2025-05-07T20:25:15.5188397Z 
2025-05-07T20:25:15.5188401Z 
2025-05-07T20:25:15.5188404Z 
2025-05-07T20:25:15.5188416Z 
2025-05-07T20:25:15.5188419Z 
2025-05-07T20:25:15.5188423Z 
2025-05-07T20:25:15.5188429Z 
2025-05-07T20:25:15.5188441Z 
2025-05-07T20:25:15.5188444Z 
2025-05-07T20:25:15.5190102Z gcc_linux-64-11.4.0  | 31 KB     |            |   0% [A[A[A[A[A[A[A[A[A
2025-05-07T20:25:15.5190425Z 
2025-05-07T20:25:15.5190429Z 
2025-05-07T20:25:15.5190433Z 
2025-05-07T20:25:15.5190444Z 
2025-05-07T20:25:15.5190448Z 
2025-05-07T20:25:15.5190455Z 
2025-05-07T20:25:15.5190459Z 
2025-05-07T20:25:15.5190462Z 
2025-05-07T20:25:15.5190466Z 
2025-05-07T20:25:15.5192651Z 
2025-05-07T20:25:15.5194412Z gxx_linux-64-11.4.0  | 29 KB     |            |   0% [A[A[A[A[A[A[A[A[A[A
2025-05-07T20:25:15.5194735Z 
2025-05-07T20:25:15.5194750Z 
2025-05-07T20:25:15.5194755Z 
2025-05-07T20:25:15.5194761Z 
2025-05-07T20:25:15.5194766Z 
2025-05-07T20:25:15.5194771Z 
2025-05-07T20:25:15.5194776Z 
2025-05-07T20:25:15.5194781Z 
2025-05-07T20:25:15.5194787Z 
2025-05-07T20:25:15.5194792Z 
2025-05-07T20:25:15.5194797Z 
2025-05-07T20:25:15.6129717Z binutils_linux-64-2. | 28 KB     |            |   0% [A[A[A[A[A[A[A[A[A[A[A
2025-05-07T20:25:15.6130182Z 
2025-05-07T20:25:15.6130186Z 
2025-05-07T20:25:15.6181443Z 
2025-05-07T20:25:15.6439354Z binutils_impl_linux- | 6.0 MB    | 2          |   2% [A[A[A
2025-05-07T20:25:15.6439617Z 
2025-05-07T20:25:15.6439620Z 
2025-05-07T20:25:15.6439624Z 
2025-05-07T20:25:15.6439627Z 
2025-05-07T20:25:15.7148476Z libstdcxx-15.1.0     | 3.7 MB    |            |   0% [A[A[A[A
2025-05-07T20:25:15.7148742Z 
2025-05-07T20:25:15.7148746Z 
2025-05-07T20:25:15.7479332Z 
2025-05-07T20:25:15.7583882Z binutils_impl_linux- | 6.0 MB    | #9         |  20% [A[A[A
2025-05-07T20:25:15.7584407Z 
2025-05-07T20:25:15.7584414Z 
2025-05-07T20:25:15.7584422Z 
2025-05-07T20:25:15.7669303Z 
2025-05-07T20:25:15.7875999Z libstdcxx-15.1.0     | 3.7 MB    | #5         |  16% [A[A[A[A
2025-05-07T20:25:15.8152401Z gcc_impl_linux-64-11 | 53.0 MB   |            |   0% 
2025-05-07T20:25:15.8152663Z 
2025-05-07T20:25:15.8152668Z 
2025-05-07T20:25:15.8152673Z 
2025-05-07T20:25:15.8465570Z binutils_impl_linux- | 6.0 MB    | ####8      |  48% [A[A[A
2025-05-07T20:25:15.8465912Z 
2025-05-07T20:25:15.8549630Z gxx_impl_linux-64-11 | 11.2 MB   |            |   0% [A
2025-05-07T20:25:15.8549978Z 
2025-05-07T20:25:15.8550110Z 
2025-05-07T20:25:15.8772870Z libstdcxx-devel_linu | 11.1 MB   |            |   0% [A[A
2025-05-07T20:25:15.8773227Z 
2025-05-07T20:25:15.8773233Z 
2025-05-07T20:25:15.8773238Z 
2025-05-07T20:25:15.8775397Z 
2025-05-07T20:25:15.8778142Z libstdcxx-15.1.0     | 3.7 MB    | ########## | 100% [A[A[A[A
2025-05-07T20:25:15.8778514Z 
2025-05-07T20:25:15.8778518Z 
2025-05-07T20:25:15.8778522Z 
2025-05-07T20:25:15.8779883Z 
2025-05-07T20:25:15.8878669Z libstdcxx-15.1.0     | 3.7 MB    | ########## | 100% [A[A[A[A
2025-05-07T20:25:15.9343720Z gcc_impl_linux-64-11 | 53.0 MB   | #1         |  12% 
2025-05-07T20:25:15.9343973Z 
2025-05-07T20:25:15.9343978Z 
2025-05-07T20:25:15.9343982Z 
2025-05-07T20:25:15.9343993Z 
2025-05-07T20:25:15.9346508Z 
2025-05-07T20:25:15.9465189Z libsanitizer-11.4.0  | 3.5 MB    |            |   0% [A[A[A[A[A
2025-05-07T20:25:15.9466859Z 
2025-05-07T20:25:15.9552777Z gxx_impl_linux-64-11 | 11.2 MB   | ###2       |  32% [A
2025-05-07T20:25:15.9553022Z 
2025-05-07T20:25:15.9553709Z 
2025-05-07T20:25:15.9884296Z libstdcxx-devel_linu | 11.1 MB   | ##8        |  28% [A[A
2025-05-07T20:25:16.0001911Z gcc_impl_linux-64-11 | 53.0 MB   | #9         |  19% 
2025-05-07T20:25:16.0002190Z 
2025-05-07T20:25:16.0002195Z 
2025-05-07T20:25:16.0007513Z 
2025-05-07T20:25:16.0013786Z binutils_impl_linux- | 6.0 MB    | ########## | 100% [A[A[A
2025-05-07T20:25:16.0014070Z 
2025-05-07T20:25:16.0014074Z 
2025-05-07T20:25:16.0014078Z 
2025-05-07T20:25:16.0353864Z binutils_impl_linux- | 6.0 MB    | ########## | 100% [A[A[A
2025-05-07T20:25:16.0354157Z 
2025-05-07T20:25:16.0354161Z 
2025-05-07T20:25:16.0354165Z 
2025-05-07T20:25:16.0354169Z 
2025-05-07T20:25:16.0356984Z 
2025-05-07T20:25:16.0373777Z libsanitizer-11.4.0  | 3.5 MB    | #######8   |  79% [A[A[A[A[A
2025-05-07T20:25:16.0374160Z 
2025-05-07T20:25:16.0374164Z 
2025-05-07T20:25:16.0374168Z 
2025-05-07T20:25:16.0374172Z 
2025-05-07T20:25:16.0374175Z 
2025-05-07T20:25:16.0375971Z 
2025-05-07T20:25:16.0465298Z libgcc-devel_linux-6 | 2.3 MB    |            |   1% [A[A[A[A[A[A
2025-05-07T20:25:16.0469132Z 
2025-05-07T20:25:16.0555531Z gxx_impl_linux-64-11 | 11.2 MB   | #####8     |  58% [A
2025-05-07T20:25:16.0555777Z 
2025-05-07T20:25:16.0558192Z 
2025-05-07T20:25:16.1452691Z libstdcxx-devel_linu | 11.1 MB   | #####2     |  53% [A[A
2025-05-07T20:25:16.1465948Z gcc_impl_linux-64-11 | 53.0 MB   | ##6        |  27% 
2025-05-07T20:25:16.1467547Z 
2025-05-07T20:25:16.1559799Z gxx_impl_linux-64-11 | 11.2 MB   | ########4  |  84% [A
2025-05-07T20:25:16.1560037Z 
2025-05-07T20:25:16.1561516Z 
2025-05-07T20:25:16.2066224Z libstdcxx-devel_linu | 11.1 MB   | #######5   |  76% [A[A
2025-05-07T20:25:16.2066484Z 
2025-05-07T20:25:16.2066631Z 
2025-05-07T20:25:16.2066637Z 
2025-05-07T20:25:16.2066641Z 
2025-05-07T20:25:16.2066644Z 
2025-05-07T20:25:16.2074826Z 
2025-05-07T20:25:16.2077933Z libgcc-devel_linux-6 | 2.3 MB    | ########## | 100% [A[A[A[A[A[A
2025-05-07T20:25:16.2078228Z 
2025-05-07T20:25:16.2078232Z 
2025-05-07T20:25:16.2078236Z 
2025-05-07T20:25:16.2078239Z 
2025-05-07T20:25:16.2078243Z 
2025-05-07T20:25:16.2078247Z 
2025-05-07T20:25:16.2111645Z libgcc-devel_linux-6 | 2.3 MB    | ########## | 100% [A[A[A[A[A[A
2025-05-07T20:25:16.2111931Z 
2025-05-07T20:25:16.2111935Z 
2025-05-07T20:25:16.2111939Z 
2025-05-07T20:25:16.2111942Z 
2025-05-07T20:25:16.2111957Z 
2025-05-07T20:25:16.2464520Z libsanitizer-11.4.0  | 3.5 MB    | ########## | 100% [A[A[A[A[A
2025-05-07T20:25:16.2469777Z gcc_impl_linux-64-11 | 53.0 MB   | ###2       |  33% 
2025-05-07T20:25:16.2470008Z 
2025-05-07T20:25:16.2470260Z 
2025-05-07T20:25:16.2470266Z 
2025-05-07T20:25:16.2470272Z 
2025-05-07T20:25:16.2470277Z 
2025-05-07T20:25:16.2470282Z 
2025-05-07T20:25:16.2472135Z 
2025-05-07T20:25:16.2714034Z ld_impl_linux-64-2.4 | 691 KB    | 2          |   2% [A[A[A[A[A[A[A
2025-05-07T20:25:16.2714455Z 
2025-05-07T20:25:16.2714462Z 
2025-05-07T20:25:16.2714467Z 
2025-05-07T20:25:16.2714472Z 
2025-05-07T20:25:16.2714477Z 
2025-05-07T20:25:16.2714482Z 
2025-05-07T20:25:16.2714487Z 
2025-05-07T20:25:16.2721431Z 
2025-05-07T20:25:16.2798082Z libstdcxx-ng-15.1.0  | 34 KB     | ####7      |  47% [A[A[A[A[A[A[A[A
2025-05-07T20:25:16.2798667Z 
2025-05-07T20:25:16.2798672Z 
2025-05-07T20:25:16.2798677Z 
2025-05-07T20:25:16.2798682Z 
2025-05-07T20:25:16.2798688Z 
2025-05-07T20:25:16.2798706Z 
2025-05-07T20:25:16.2798711Z 
2025-05-07T20:25:16.2799542Z 
2025-05-07T20:25:16.2890490Z libstdcxx-ng-15.1.0  | 34 KB     | ########## | 100% [A[A[A[A[A[A[A[A
2025-05-07T20:25:16.2890870Z 
2025-05-07T20:25:16.2890876Z 
2025-05-07T20:25:16.2890881Z 
2025-05-07T20:25:16.2890886Z 
2025-05-07T20:25:16.2890891Z 
2025-05-07T20:25:16.2890897Z 
2025-05-07T20:25:16.2892476Z 
2025-05-07T20:25:16.3362707Z ld_impl_linux-64-2.4 | 691 KB    | ########## | 100% [A[A[A[A[A[A[A
2025-05-07T20:25:16.3363109Z 
2025-05-07T20:25:16.3363114Z 
2025-05-07T20:25:16.3363119Z 
2025-05-07T20:25:16.3363124Z 
2025-05-07T20:25:16.3363130Z 
2025-05-07T20:25:16.3363135Z 
2025-05-07T20:25:16.3363140Z 
2025-05-07T20:25:16.3363145Z 
2025-05-07T20:25:16.3363150Z 
2025-05-07T20:25:16.3365227Z 
2025-05-07T20:25:16.3420384Z gxx_linux-64-11.4.0  | 29 KB     | #####5     |  55% [A[A[A[A[A[A[A[A[A[A
2025-05-07T20:25:16.3420778Z 
2025-05-07T20:25:16.3420783Z 
2025-05-07T20:25:16.3420788Z 
2025-05-07T20:25:16.3420803Z 
2025-05-07T20:25:16.3420809Z 
2025-05-07T20:25:16.3420814Z 
2025-05-07T20:25:16.3420819Z 
2025-05-07T20:25:16.3420824Z 
2025-05-07T20:25:16.3420829Z 
2025-05-07T20:25:16.3421204Z 
2025-05-07T20:25:16.3466989Z gxx_linux-64-11.4.0  | 29 KB     | ########## | 100% [A[A[A[A[A[A[A[A[A[A
2025-05-07T20:25:16.3561612Z gcc_impl_linux-64-11 | 53.0 MB   | ###9       |  39% 
2025-05-07T20:25:16.3561944Z 
2025-05-07T20:25:16.3561950Z 
2025-05-07T20:25:16.3561955Z 
2025-05-07T20:25:16.3561971Z 
2025-05-07T20:25:16.3561976Z 
2025-05-07T20:25:16.3561981Z 
2025-05-07T20:25:16.3561986Z 
2025-05-07T20:25:16.3561992Z 
2025-05-07T20:25:16.3561997Z 
2025-05-07T20:25:16.3596953Z gcc_linux-64-11.4.0  | 31 KB     | #####2     |  52% [A[A[A[A[A[A[A[A[A
2025-05-07T20:25:16.3597327Z 
2025-05-07T20:25:16.3597333Z 
2025-05-07T20:25:16.3597338Z 
2025-05-07T20:25:16.3597343Z 
2025-05-07T20:25:16.3597348Z 
2025-05-07T20:25:16.3597353Z 
2025-05-07T20:25:16.3597367Z 
2025-05-07T20:25:16.3597372Z 
2025-05-07T20:25:16.3599312Z 
2025-05-07T20:25:16.3785876Z gcc_linux-64-11.4.0  | 31 KB     | ########## | 100% [A[A[A[A[A[A[A[A[A
2025-05-07T20:25:16.3786258Z 
2025-05-07T20:25:16.3786264Z 
2025-05-07T20:25:16.3786269Z 
2025-05-07T20:25:16.3786274Z 
2025-05-07T20:25:16.3786279Z 
2025-05-07T20:25:16.3786284Z 
2025-05-07T20:25:16.3786289Z 
2025-05-07T20:25:16.3786295Z 
2025-05-07T20:25:16.3786300Z 
2025-05-07T20:25:16.3786305Z 
2025-05-07T20:25:16.3787750Z 
2025-05-07T20:25:16.3834945Z binutils_linux-64-2. | 28 KB     | #####6     |  56% [A[A[A[A[A[A[A[A[A[A[A
2025-05-07T20:25:16.3835582Z 
2025-05-07T20:25:16.3835588Z 
2025-05-07T20:25:16.3835593Z 
2025-05-07T20:25:16.3835598Z 
2025-05-07T20:25:16.3835603Z 
2025-05-07T20:25:16.3835616Z 
2025-05-07T20:25:16.3835622Z 
2025-05-07T20:25:16.3835627Z 
2025-05-07T20:25:16.3835632Z 
2025-05-07T20:25:16.3835637Z 
2025-05-07T20:25:16.3835642Z 
2025-05-07T20:25:16.4057844Z binutils_linux-64-2. | 28 KB     | ########## | 100% [A[A[A[A[A[A[A[A[A[A[A
2025-05-07T20:25:16.4058273Z 
2025-05-07T20:25:16.4058279Z 
2025-05-07T20:25:16.4058284Z 
2025-05-07T20:25:16.4058971Z 
2025-05-07T20:25:16.4469485Z libstdcxx-15.1.0     | 3.7 MB    | ########## | 100% [A[A[A[A
2025-05-07T20:25:16.5099190Z gcc_impl_linux-64-11 | 53.0 MB   | ####6      |  47% 
2025-05-07T20:25:16.5105492Z 
2025-05-07T20:25:16.5223169Z gxx_impl_linux-64-11 | 11.2 MB   | ########## | 100% [A
2025-05-07T20:25:16.5223498Z 
2025-05-07T20:25:16.5224091Z 
2025-05-07T20:25:16.5224706Z libstdcxx-devel_linu | 11.1 MB   | ########## | 100% [A[A
2025-05-07T20:25:16.5225049Z 
2025-05-07T20:25:16.5226519Z 
2025-05-07T20:25:16.5470887Z libstdcxx-devel_linu | 11.1 MB   | ########## | 100% [A[A
2025-05-07T20:25:16.5686949Z gcc_impl_linux-64-11 | 53.0 MB   | #####5     |  56% 
2025-05-07T20:25:16.5687286Z 
2025-05-07T20:25:16.5687291Z 
2025-05-07T20:25:16.5687306Z 
2025-05-07T20:25:16.5687312Z 
2025-05-07T20:25:16.5687317Z 
2025-05-07T20:25:16.5689258Z 
2025-05-07T20:25:16.6428864Z libgcc-devel_linux-6 | 2.3 MB    | ########## | 100% [A[A[A[A[A[A
2025-05-07T20:25:16.6429292Z 
2025-05-07T20:25:16.6429297Z 
2025-05-07T20:25:16.6429302Z 
2025-05-07T20:25:16.6429308Z 
2025-05-07T20:25:16.6429313Z 
2025-05-07T20:25:16.6429318Z 
2025-05-07T20:25:16.6429323Z 
2025-05-07T20:25:16.6429328Z 
2025-05-07T20:25:16.6435230Z libstdcxx-ng-15.1.0  | 34 KB     | ########## | 100% [A[A[A[A[A[A[A[A
2025-05-07T20:25:16.6435634Z 
2025-05-07T20:25:16.6435639Z 
2025-05-07T20:25:16.6435645Z 
2025-05-07T20:25:16.6435668Z 
2025-05-07T20:25:16.6435673Z 
2025-05-07T20:25:16.6435678Z 
2025-05-07T20:25:16.6435683Z 
2025-05-07T20:25:16.6435688Z 
2025-05-07T20:25:16.6473695Z libstdcxx-ng-15.1.0  | 34 KB     | ########## | 100% [A[A[A[A[A[A[A[A
2025-05-07T20:25:16.7248161Z gcc_impl_linux-64-11 | 53.0 MB   | ######7    |  68% 
2025-05-07T20:25:16.7248523Z 
2025-05-07T20:25:16.7248529Z 
2025-05-07T20:25:16.7248534Z 
2025-05-07T20:25:16.7248539Z 
2025-05-07T20:25:16.7248544Z 
2025-05-07T20:25:16.7248551Z 
2025-05-07T20:25:16.7249411Z 
2025-05-07T20:25:16.7257435Z ld_impl_linux-64-2.4 | 691 KB    | ########## | 100% [A[A[A[A[A[A[A
2025-05-07T20:25:16.7257828Z 
2025-05-07T20:25:16.7257833Z 
2025-05-07T20:25:16.7257838Z 
2025-05-07T20:25:16.7257843Z 
2025-05-07T20:25:16.7257849Z 
2025-05-07T20:25:16.7257854Z 
2025-05-07T20:25:16.7257859Z 
2025-05-07T20:25:16.7478411Z ld_impl_linux-64-2.4 | 691 KB    | ########## | 100% [A[A[A[A[A[A[A
2025-05-07T20:25:16.7906193Z gcc_impl_linux-64-11 | 53.0 MB   | #######9   |  80% 
2025-05-07T20:25:16.7906562Z 
2025-05-07T20:25:16.7906568Z 
2025-05-07T20:25:16.7906573Z 
2025-05-07T20:25:16.7906578Z 
2025-05-07T20:25:16.7906583Z 
2025-05-07T20:25:16.7906588Z 
2025-05-07T20:25:16.7906594Z 
2025-05-07T20:25:16.7906599Z 
2025-05-07T20:25:16.7906604Z 
2025-05-07T20:25:16.7907042Z 
2025-05-07T20:25:16.7919919Z gxx_linux-64-11.4.0  | 29 KB     | ########## | 100% [A[A[A[A[A[A[A[A[A[A
2025-05-07T20:25:16.7920305Z 
2025-05-07T20:25:16.7920310Z 
2025-05-07T20:25:16.7920316Z 
2025-05-07T20:25:16.7920625Z 
2025-05-07T20:25:16.7920632Z 
2025-05-07T20:25:16.7920637Z 
2025-05-07T20:25:16.7920643Z 
2025-05-07T20:25:16.7920648Z 
2025-05-07T20:25:16.7920653Z 
2025-05-07T20:25:16.7920658Z 
2025-05-07T20:25:16.8027498Z gxx_linux-64-11.4.0  | 29 KB     | ########## | 100% [A[A[A[A[A[A[A[A[A[A
2025-05-07T20:25:16.8027913Z 
2025-05-07T20:25:16.8027917Z 
2025-05-07T20:25:16.8027921Z 
2025-05-07T20:25:16.8027924Z 
2025-05-07T20:25:16.8028235Z 
2025-05-07T20:25:16.8484327Z libsanitizer-11.4.0  | 3.5 MB    | ########## | 100% [A[A[A[A[A
2025-05-07T20:25:16.8572270Z gcc_impl_linux-64-11 | 53.0 MB   | ########9  |  90% 
2025-05-07T20:25:16.8572532Z 
2025-05-07T20:25:16.8572537Z 
2025-05-07T20:25:16.8572540Z 
2025-05-07T20:25:16.8572544Z 
2025-05-07T20:25:16.8572548Z 
2025-05-07T20:25:16.8572551Z 
2025-05-07T20:25:16.8572555Z 
2025-05-07T20:25:16.8572559Z 
2025-05-07T20:25:16.8572643Z 
2025-05-07T20:25:16.8584987Z gcc_linux-64-11.4.0  | 31 KB     | ########## | 100% [A[A[A[A[A[A[A[A[A
2025-05-07T20:25:16.8585361Z 
2025-05-07T20:25:16.8585365Z 
2025-05-07T20:25:16.8585368Z 
2025-05-07T20:25:16.8585372Z 
2025-05-07T20:25:16.8585375Z 
2025-05-07T20:25:16.8585379Z 
2025-05-07T20:25:16.8585382Z 
2025-05-07T20:25:16.8585386Z 
2025-05-07T20:25:16.8585389Z 
2025-05-07T20:25:16.8656377Z gcc_linux-64-11.4.0  | 31 KB     | ########## | 100% [A[A[A[A[A[A[A[A[A
2025-05-07T20:25:16.8656758Z 
2025-05-07T20:25:16.8656764Z 
2025-05-07T20:25:16.8656769Z 
2025-05-07T20:25:16.8656786Z 
2025-05-07T20:25:16.8656791Z 
2025-05-07T20:25:16.8656796Z 
2025-05-07T20:25:16.8656802Z 
2025-05-07T20:25:16.8656807Z 
2025-05-07T20:25:16.8656813Z 
2025-05-07T20:25:16.8656818Z 
2025-05-07T20:25:16.8658283Z 
2025-05-07T20:25:16.8664578Z binutils_linux-64-2. | 28 KB     | ########## | 100% [A[A[A[A[A[A[A[A[A[A[A
2025-05-07T20:25:16.8664949Z 
2025-05-07T20:25:16.8664953Z 
2025-05-07T20:25:16.8664956Z 
2025-05-07T20:25:16.8664960Z 
2025-05-07T20:25:16.8664963Z 
2025-05-07T20:25:16.8664967Z 
2025-05-07T20:25:16.8664978Z 
2025-05-07T20:25:16.8664994Z 
2025-05-07T20:25:16.8664998Z 
2025-05-07T20:25:16.8665002Z 
2025-05-07T20:25:16.8665005Z 
2025-05-07T20:25:17.0085923Z binutils_linux-64-2. | 28 KB     | ########## | 100% [A[A[A[A[A[A[A[A[A[A[A
2025-05-07T20:25:17.0086238Z 
2025-05-07T20:25:17.0086242Z 
2025-05-07T20:25:17.0086246Z 
2025-05-07T20:25:17.1869376Z binutils_impl_linux- | 6.0 MB    | ########## | 100% [A[A[A
2025-05-07T20:25:17.1870157Z gcc_impl_linux-64-11 | 53.0 MB   | ########## | 100% 
2025-05-07T20:25:17.2394437Z gcc_impl_linux-64-11 | 53.0 MB   | ########## | 100% 
2025-05-07T20:25:17.2394705Z 
2025-05-07T20:25:17.4903107Z gxx_impl_linux-64-11 | 11.2 MB   | ########## | 100% [A
2025-05-07T20:25:17.4903385Z 
2025-05-07T20:25:17.4903390Z 
2025-05-07T20:25:17.9128394Z libstdcxx-devel_linu | 11.1 MB   | ########## | 100% [A[A
2025-05-07T20:25:17.9135587Z gcc_impl_linux-64-11 | 53.0 MB   | ########## | 100% 
2025-05-07T20:25:17.9135941Z                                                      
2025-05-07T20:25:17.9136160Z 
2025-05-07T20:25:17.9136460Z                                                      [A
2025-05-07T20:25:17.9136723Z 
2025-05-07T20:25:17.9136727Z 
2025-05-07T20:25:17.9136905Z                                                      [A[A
2025-05-07T20:25:17.9137191Z 
2025-05-07T20:25:17.9137197Z 
2025-05-07T20:25:17.9137203Z 
2025-05-07T20:25:17.9137391Z                                                      [A[A[A
2025-05-07T20:25:17.9137601Z 
2025-05-07T20:25:17.9137617Z 
2025-05-07T20:25:17.9137621Z 
2025-05-07T20:25:17.9137625Z 
2025-05-07T20:25:17.9137798Z                                                      [A[A[A[A
2025-05-07T20:25:17.9138010Z 
2025-05-07T20:25:17.9138014Z 
2025-05-07T20:25:17.9138018Z 
2025-05-07T20:25:17.9138021Z 
2025-05-07T20:25:17.9138025Z 
2025-05-07T20:25:17.9138211Z                                                      [A[A[A[A[A
2025-05-07T20:25:17.9138425Z 
2025-05-07T20:25:17.9138429Z 
2025-05-07T20:25:17.9138433Z 
2025-05-07T20:25:17.9138436Z 
2025-05-07T20:25:17.9138684Z 
2025-05-07T20:25:17.9138689Z 
2025-05-07T20:25:17.9138874Z                                                      [A[A[A[A[A[A
2025-05-07T20:25:17.9139095Z 
2025-05-07T20:25:17.9139099Z 
2025-05-07T20:25:17.9139102Z 
2025-05-07T20:25:17.9139106Z 
2025-05-07T20:25:17.9139110Z 
2025-05-07T20:25:17.9139113Z 
2025-05-07T20:25:17.9139117Z 
2025-05-07T20:25:17.9139298Z                                                      [A[A[A[A[A[A[A
2025-05-07T20:25:17.9139670Z 
2025-05-07T20:25:17.9139673Z 
2025-05-07T20:25:17.9139677Z 
2025-05-07T20:25:17.9139680Z 
2025-05-07T20:25:17.9139684Z 
2025-05-07T20:25:17.9139687Z 
2025-05-07T20:25:17.9139691Z 
2025-05-07T20:25:17.9139694Z 
2025-05-07T20:25:17.9139877Z                                                      [A[A[A[A[A[A[A[A
2025-05-07T20:25:17.9140102Z 
2025-05-07T20:25:17.9140105Z 
2025-05-07T20:25:17.9140109Z 
2025-05-07T20:25:17.9140112Z 
2025-05-07T20:25:17.9140116Z 
2025-05-07T20:25:17.9140119Z 
2025-05-07T20:25:17.9140130Z 
2025-05-07T20:25:17.9140134Z 
2025-05-07T20:25:17.9140138Z 
2025-05-07T20:25:17.9140327Z                                                      [A[A[A[A[A[A[A[A[A
2025-05-07T20:25:17.9140540Z 
2025-05-07T20:25:17.9140543Z 
2025-05-07T20:25:17.9140547Z 
2025-05-07T20:25:17.9140550Z 
2025-05-07T20:25:17.9140554Z 
2025-05-07T20:25:17.9140558Z 
2025-05-07T20:25:17.9140561Z 
2025-05-07T20:25:17.9140565Z 
2025-05-07T20:25:17.9140568Z 
2025-05-07T20:25:17.9140572Z 
2025-05-07T20:25:17.9140775Z                                                      [A[A[A[A[A[A[A[A[A[A
2025-05-07T20:25:17.9140996Z 
2025-05-07T20:25:17.9141000Z 
2025-05-07T20:25:17.9141003Z 
2025-05-07T20:25:17.9141007Z 
2025-05-07T20:25:17.9141010Z 
2025-05-07T20:25:17.9141014Z 
2025-05-07T20:25:17.9141017Z 
2025-05-07T20:25:17.9141021Z 
2025-05-07T20:25:17.9141025Z 
2025-05-07T20:25:17.9141028Z 
2025-05-07T20:25:17.9141039Z 
2025-05-07T20:25:17.9141247Z                                                      [A[A[A[A[A[A[A[A[A[A[A done
2025-05-07T20:25:18.0142384Z Preparing transaction: \ done
2025-05-07T20:25:18.3150243Z Verifying transaction: / - \ done
2025-05-07T20:25:18.4160139Z Executing transaction: / done
2025-05-07T20:25:18.5810239Z [INSTALL] Setting the C/C++ compiler symlinks ...
2025-05-07T20:25:22.5070964Z + ln -sf /home/ec2-user/miniconda/envs/build_binary/bin/x86_64-conda-linux-gnu-cc /home/ec2-user/miniconda/envs/build_binary/bin/cc
2025-05-07T20:25:22.5071516Z 
2025-05-07T20:25:22.5082442Z 
2025-05-07T20:25:22.5102338Z + ln -sf /home/ec2-user/miniconda/envs/build_binary/bin/x86_64-conda-linux-gnu-cc /home/ec2-user/miniconda/envs/build_binary/bin/gcc
2025-05-07T20:25:22.5103001Z 
2025-05-07T20:25:22.5115182Z 
2025-05-07T20:25:22.5132724Z + ln -sf /home/ec2-user/miniconda/envs/build_binary/bin/x86_64-conda-linux-gnu-c++ /home/ec2-user/miniconda/envs/build_binary/bin/c++
2025-05-07T20:25:22.5133401Z 
2025-05-07T20:25:22.5144406Z 
2025-05-07T20:25:22.5160123Z + ln -sf /home/ec2-user/miniconda/envs/build_binary/bin/x86_64-conda-linux-gnu-c++ /home/ec2-user/miniconda/envs/build_binary/bin/g++
2025-05-07T20:25:22.5160812Z 
2025-05-07T20:25:22.5171455Z 
2025-05-07T20:25:24.4196770Z /home/ec2-user/miniconda/envs/build_binary/bin/cc
2025-05-07T20:25:24.4197042Z 
2025-05-07T20:25:24.4824491Z [CHECK] Binary cc found in PATH
2025-05-07T20:25:26.3806701Z /home/ec2-user/miniconda/envs/build_binary/bin/gcc
2025-05-07T20:25:26.3807051Z 
2025-05-07T20:25:26.4444697Z [CHECK] Binary gcc found in PATH
2025-05-07T20:25:28.3393039Z /home/ec2-user/miniconda/envs/build_binary/bin/c++
2025-05-07T20:25:28.3393409Z 
2025-05-07T20:25:28.4027051Z [CHECK] Binary c++ found in PATH
2025-05-07T20:25:30.3048146Z /home/ec2-user/miniconda/envs/build_binary/bin/g++
2025-05-07T20:25:30.3048531Z 
2025-05-07T20:25:30.3674804Z [CHECK] Binary g++ found in PATH
2025-05-07T20:25:30.3679137Z [INFO] Printing out all preprocessor defines in the C compiler ...
2025-05-07T20:25:30.3679703Z + conda run -n build_binary cc -dM -E -
2025-05-07T20:25:30.3680351Z 
2025-05-07T20:25:32.2664271Z #define __DBL_MIN_EXP__ (-1021)
2025-05-07T20:25:32.2664744Z #define __UINT_LEAST16_MAX__ 0xffff
2025-05-07T20:25:32.2665151Z #define __ATOMIC_ACQUIRE 2
2025-05-07T20:25:32.2665491Z #define __FLT128_MAX_10_EXP__ 4932
2025-05-07T20:25:32.2665877Z #define __FLT_MIN__ 1.17549435082228750796873653722224568e-38F
2025-05-07T20:25:32.2666224Z #define __GCC_IEC_559_COMPLEX 2
2025-05-07T20:25:32.2666501Z #define __UINT_LEAST8_TYPE__ unsigned char
2025-05-07T20:25:32.2667171Z #define __SIZEOF_FLOAT80__ 16
2025-05-07T20:25:32.2667433Z #define __INTMAX_C(c) c ## L
2025-05-07T20:25:32.2667682Z #define __CHAR_BIT__ 8
2025-05-07T20:25:32.2667909Z #define __UINT8_MAX__ 0xff
2025-05-07T20:25:32.2668158Z #define __SCHAR_WIDTH__ 8
2025-05-07T20:25:32.2668417Z #define __WINT_MAX__ 0xffffffffU
2025-05-07T20:25:32.2668686Z #define __FLT32_MIN_EXP__ (-125)
2025-05-07T20:25:32.2669107Z #define __ORDER_LITTLE_ENDIAN__ 1234
2025-05-07T20:25:32.2669409Z #define __SIZE_MAX__ 0xffffffffffffffffUL
2025-05-07T20:25:32.2669719Z #define __WCHAR_MAX__ 0x7fffffff
2025-05-07T20:25:32.2670012Z #define __GCC_HAVE_SYNC_COMPARE_AND_SWAP_1 1
2025-05-07T20:25:32.2670334Z #define __GCC_HAVE_SYNC_COMPARE_AND_SWAP_2 1
2025-05-07T20:25:32.2670645Z #define __GCC_HAVE_SYNC_COMPARE_AND_SWAP_4 1
2025-05-07T20:25:32.2671084Z #define __DBL_DENORM_MIN__ ((double)4.94065645841246544176568792868221372e-324L)
2025-05-07T20:25:32.2671490Z #define __GCC_HAVE_SYNC_COMPARE_AND_SWAP_8 1
2025-05-07T20:25:32.2671807Z #define __GCC_ATOMIC_CHAR_LOCK_FREE 2
2025-05-07T20:25:32.2672077Z #define __GCC_IEC_559 2
2025-05-07T20:25:32.2672321Z #define __FLT32X_DECIMAL_DIG__ 17
2025-05-07T20:25:32.2672592Z #define __FLT_EVAL_METHOD__ 0
2025-05-07T20:25:32.2672846Z #define __FLT64_DECIMAL_DIG__ 17
2025-05-07T20:25:32.2673129Z #define __GCC_ATOMIC_CHAR32_T_LOCK_FREE 2
2025-05-07T20:25:32.2673454Z #define __UINT_FAST64_MAX__ 0xffffffffffffffffUL
2025-05-07T20:25:32.2673764Z #define __SIG_ATOMIC_TYPE__ int
2025-05-07T20:25:32.2674039Z #define __DBL_MIN_10_EXP__ (-307)
2025-05-07T20:25:32.2674311Z #define __FINITE_MATH_ONLY__ 0
2025-05-07T20:25:32.2674567Z #define __FLT32X_MAX_EXP__ 1024
2025-05-07T20:25:32.2674829Z #define __FLT32_HAS_DENORM__ 1
2025-05-07T20:25:32.2675087Z #define __UINT_FAST8_MAX__ 0xff
2025-05-07T20:25:32.2675345Z #define __FLT32_MAX_10_EXP__ 38
2025-05-07T20:25:32.2675597Z #define __DEC64_MAX_EXP__ 385
2025-05-07T20:25:32.2675843Z #define __INT8_C(c) c
2025-05-07T20:25:32.2676081Z #define __INT_LEAST8_WIDTH__ 8
2025-05-07T20:25:32.2676380Z #define __UINT_LEAST64_MAX__ 0xffffffffffffffffUL
2025-05-07T20:25:32.2676726Z #define __SHRT_MAX__ 0x7fff
2025-05-07T20:25:32.2677069Z #define __LDBL_MAX__ 1.18973149535723176502126385303097021e+4932L
2025-05-07T20:25:32.2677415Z #define __FLT64X_MAX_10_EXP__ 4932
2025-05-07T20:25:32.2677688Z #define __LDBL_IS_IEC_60559__ 2
2025-05-07T20:25:32.2677948Z #define __FLT64X_HAS_QUIET_NAN__ 1
2025-05-07T20:25:32.2678218Z #define __UINT_LEAST8_MAX__ 0xff
2025-05-07T20:25:32.2678498Z #define __GCC_ATOMIC_BOOL_LOCK_FREE 2
2025-05-07T20:25:32.2678884Z #define __FLT128_DENORM_MIN__ 6.47517511943802511092443895822764655e-4966F128
2025-05-07T20:25:32.2679293Z #define __UINTMAX_TYPE__ long unsigned int
2025-05-07T20:25:32.2679575Z #define __linux 1
2025-05-07T20:25:32.2679806Z #define __DEC32_EPSILON__ 1E-6DF
2025-05-07T20:25:32.2680086Z #define __FLT_EVAL_METHOD_TS_18661_3__ 0
2025-05-07T20:25:32.2680354Z #define __unix 1
2025-05-07T20:25:32.2680584Z #define __UINT32_MAX__ 0xffffffffU
2025-05-07T20:25:32.2680871Z #define __FLT128_MIN_EXP__ (-16381)
2025-05-07T20:25:32.2681134Z #define __WINT_MIN__ 0U
2025-05-07T20:25:32.2681377Z #define __FLT128_MIN_10_EXP__ (-4931)
2025-05-07T20:25:32.2681655Z #define __FLT32X_IS_IEC_60559__ 2
2025-05-07T20:25:32.2681924Z #define __INT_LEAST16_WIDTH__ 16
2025-05-07T20:25:32.2682188Z #define __SCHAR_MAX__ 0x7f
2025-05-07T20:25:32.2682438Z #define __FLT128_MANT_DIG__ 113
2025-05-07T20:25:32.2682712Z #define __WCHAR_MIN__ (-__WCHAR_MAX__ - 1)
2025-05-07T20:25:32.2683208Z #define __INT64_C(c) c ## L
2025-05-07T20:25:32.2683475Z #define __GCC_ATOMIC_POINTER_LOCK_FREE 2
2025-05-07T20:25:32.2683769Z #define __FLT32X_MANT_DIG__ 53
2025-05-07T20:25:32.2684023Z #define __USER_LABEL_PREFIX__ 
2025-05-07T20:25:32.2684364Z #define __FLT64X_EPSILON__ 1.08420217248550443400745280086994171e-19F64x
2025-05-07T20:25:32.2684732Z #define __STDC_HOSTED__ 1
2025-05-07T20:25:32.2684977Z #define __DEC64_MIN_EXP__ (-382)
2025-05-07T20:25:32.2685236Z #define __DBL_DIG__ 15
2025-05-07T20:25:32.2685548Z #define __FLT32_DIG__ 6
2025-05-07T20:25:32.2685840Z #define __FLT_EPSILON__ 1.19209289550781250000000000000000000e-7F
2025-05-07T20:25:32.2686185Z #define __SHRT_WIDTH__ 16
2025-05-07T20:25:32.2686431Z #define __FLT32_IS_IEC_60559__ 2
2025-05-07T20:25:32.2686745Z #define __LDBL_MIN__ 3.36210314311209350626267781732175260e-4932L
2025-05-07T20:25:32.2687085Z #define __STDC_UTF_16__ 1
2025-05-07T20:25:32.2687330Z #define __DBL_IS_IEC_60559__ 2
2025-05-07T20:25:32.2687585Z #define __DEC32_MAX__ 9.999999E96DF
2025-05-07T20:25:32.2687985Z #define __FLT64X_DENORM_MIN__ 3.64519953188247460252840593361941982e-4951F64x
2025-05-07T20:25:32.2688370Z #define __FLT32X_HAS_INFINITY__ 1
2025-05-07T20:25:32.2688645Z #define __INT32_MAX__ 0x7fffffff
2025-05-07T20:25:32.2688904Z #define __unix__ 1
2025-05-07T20:25:32.2689120Z #define __INT_WIDTH__ 32
2025-05-07T20:25:32.2689356Z #define __SIZEOF_LONG__ 8
2025-05-07T20:25:32.2689599Z #define __STDC_IEC_559__ 1
2025-05-07T20:25:32.2689843Z #define __STDC_ISO_10646__ 201103L
2025-05-07T20:25:32.2690116Z #define __UINT16_C(c) c
2025-05-07T20:25:32.2690350Z #define __DECIMAL_DIG__ 21
2025-05-07T20:25:32.2690592Z #define __STDC_IEC_559_COMPLEX__ 1
2025-05-07T20:25:32.2690945Z #define __FLT64_EPSILON__ 2.22044604925031308084726333618164062e-16F64
2025-05-07T20:25:32.2691302Z #define __gnu_linux__ 1
2025-05-07T20:25:32.2691544Z #define __FLT128_IS_IEC_60559__ 2
2025-05-07T20:25:32.2691814Z #define __FLT64X_MIN_10_EXP__ (-4931)
2025-05-07T20:25:32.2692093Z #define __LDBL_HAS_QUIET_NAN__ 1
2025-05-07T20:25:32.2692370Z #define __FLT64_MANT_DIG__ 53
2025-05-07T20:25:32.2692622Z #define __FLT64X_MANT_DIG__ 64
2025-05-07T20:25:32.2692869Z #define __GNUC__ 11
2025-05-07T20:25:32.2693082Z #define __pie__ 2
2025-05-07T20:25:32.2703538Z #define __MMX__ 1
2025-05-07T20:25:32.2703777Z #define __FLT_HAS_DENORM__ 1
2025-05-07T20:25:32.2704034Z #define __SIZEOF_LONG_DOUBLE__ 16
2025-05-07T20:25:32.2704304Z #define __BIGGEST_ALIGNMENT__ 16
2025-05-07T20:25:32.2704567Z #define __FLT64_MAX_10_EXP__ 308
2025-05-07T20:25:32.2704911Z #define __DBL_MAX__ ((double)1.79769313486231570814527423731704357e+308L)
2025-05-07T20:25:32.2705298Z #define __INT_FAST32_MAX__ 0x7fffffffffffffffL
2025-05-07T20:25:32.2705601Z #define __DBL_HAS_INFINITY__ 1
2025-05-07T20:25:32.2705854Z #define __SIZEOF_FLOAT__ 4
2025-05-07T20:25:32.2706105Z #define __HAVE_SPECULATION_SAFE_VALUE 1
2025-05-07T20:25:32.2706389Z #define __DEC32_MIN_EXP__ (-94)
2025-05-07T20:25:32.2706643Z #define __INTPTR_WIDTH__ 64
2025-05-07T20:25:32.2706901Z #define __FLT64X_HAS_INFINITY__ 1
2025-05-07T20:25:32.2707194Z #define __UINT_LEAST32_MAX__ 0xffffffffU
2025-05-07T20:25:32.2707488Z #define __FLT32X_HAS_DENORM__ 1
2025-05-07T20:25:32.2707746Z #define __INT_FAST16_TYPE__ long int
2025-05-07T20:25:32.2708012Z #define __MMX_WITH_SSE__ 1
2025-05-07T20:25:32.2708252Z #define __LDBL_HAS_DENORM__ 1
2025-05-07T20:25:32.2708499Z #define __FLT128_HAS_INFINITY__ 1
2025-05-07T20:25:32.2708764Z #define __DEC32_MIN__ 1E-95DF
2025-05-07T20:25:32.2709034Z #define __DBL_MAX_EXP__ 1024
2025-05-07T20:25:32.2709287Z #define __WCHAR_WIDTH__ 32
2025-05-07T20:25:32.2709605Z #define __FLT32_MAX__ 3.40282346638528859811704183484516925e+38F32
2025-05-07T20:25:32.2709953Z #define __DEC128_EPSILON__ 1E-33DL
2025-05-07T20:25:32.2710218Z #define __SSE2_MATH__ 1
2025-05-07T20:25:32.2710457Z #define __ATOMIC_HLE_RELEASE 131072
2025-05-07T20:25:32.2710756Z #define __PTRDIFF_MAX__ 0x7fffffffffffffffL
2025-05-07T20:25:32.2711044Z #define __amd64 1
2025-05-07T20:25:32.2711546Z #define __STDC_NO_THREADS__ 1
2025-05-07T20:25:32.2711816Z #define __ATOMIC_HLE_ACQUIRE 65536
2025-05-07T20:25:32.2712110Z #define __LONG_LONG_MAX__ 0x7fffffffffffffffLL
2025-05-07T20:25:32.2712419Z #define __SIZEOF_SIZE_T__ 8
2025-05-07T20:25:32.2712671Z #define __FLT64X_MIN_EXP__ (-16381)
2025-05-07T20:25:32.2712943Z #define __SIZEOF_WINT_T__ 4
2025-05-07T20:25:32.2713193Z #define __LONG_LONG_WIDTH__ 64
2025-05-07T20:25:32.2713446Z #define __FLT32_MAX_EXP__ 128
2025-05-07T20:25:32.2713702Z #define __GXX_ABI_VERSION 1016
2025-05-07T20:25:32.2714116Z #define __FLT_MIN_EXP__ (-125)
2025-05-07T20:25:32.2714372Z #define __GCC_HAVE_DWARF2_CFI_ASM 1
2025-05-07T20:25:32.2714647Z #define __INT16_MAX__ 0x7fff
2025-05-07T20:25:32.2714891Z #define __x86_64 1
2025-05-07T20:25:32.2715114Z #define __INT_FAST64_TYPE__ long int
2025-05-07T20:25:32.2715479Z #define __FLT64_DENORM_MIN__ 4.94065645841246544176568792868221372e-324F64
2025-05-07T20:25:32.2715929Z #define __DBL_MIN__ ((double)2.22507385850720138309023271733240406e-308L)
2025-05-07T20:25:32.2716375Z #define __FLT128_EPSILON__ 1.92592994438723585305597794258492732e-34F128
2025-05-07T20:25:32.2716837Z #define __FLT64X_NORM_MAX__ 1.18973149535723176502126385303097021e+4932F64x
2025-05-07T20:25:32.2717216Z #define __SIZEOF_POINTER__ 8
2025-05-07T20:25:32.2717461Z #define __LP64__ 1
2025-05-07T20:25:32.2717677Z #define __DBL_HAS_QUIET_NAN__ 1
2025-05-07T20:25:32.2718021Z #define __FLT32X_EPSILON__ 2.22044604925031308084726333618164062e-16F32x
2025-05-07T20:25:32.2718397Z #define __DECIMAL_BID_FORMAT__ 1
2025-05-07T20:25:32.2718663Z #define __FLT64_MIN_EXP__ (-1021)
2025-05-07T20:25:32.2718938Z #define __FLT64_MIN_10_EXP__ (-307)
2025-05-07T20:25:32.2719216Z #define __FLT64X_DECIMAL_DIG__ 21
2025-05-07T20:25:32.2719480Z #define __DEC128_MIN__ 1E-6143DL
2025-05-07T20:25:32.2719744Z #define __REGISTER_PREFIX__ 
2025-05-07T20:25:32.2720000Z #define __UINT16_MAX__ 0xffff
2025-05-07T20:25:32.2720247Z #define __DBL_HAS_DENORM__ 1
2025-05-07T20:25:32.2720505Z #define __LDBL_HAS_INFINITY__ 1
2025-05-07T20:25:32.2720834Z #define __FLT32_MIN__ 1.17549435082228750796873653722224568e-38F32
2025-05-07T20:25:32.2721189Z #define __UINT8_TYPE__ unsigned char
2025-05-07T20:25:32.2721455Z #define __FLT_DIG__ 6
2025-05-07T20:25:32.2721684Z #define __NO_INLINE__ 1
2025-05-07T20:25:32.2721925Z #define __DEC_EVAL_METHOD__ 2
2025-05-07T20:25:32.2722242Z #define __DEC128_MAX__ 9.999999999999999999999999999999999E6144DL
2025-05-07T20:25:32.2722587Z #define __FLT_MANT_DIG__ 24
2025-05-07T20:25:32.2722848Z #define __LDBL_DECIMAL_DIG__ 21
2025-05-07T20:25:32.2723104Z #define __VERSION__ "11.4.0"
2025-05-07T20:25:32.2723357Z #define __UINT64_C(c) c ## UL
2025-05-07T20:25:32.2723619Z #define _STDC_PREDEF_H 1
2025-05-07T20:25:32.2723867Z #define __INT_LEAST32_MAX__ 0x7fffffff
2025-05-07T20:25:32.2724159Z #define __GCC_ATOMIC_INT_LOCK_FREE 2
2025-05-07T20:25:32.2724443Z #define __FLT128_MAX_EXP__ 16384
2025-05-07T20:25:32.2724703Z #define __FLT32_MANT_DIG__ 24
2025-05-07T20:25:32.2725003Z #define __FLOAT_WORD_ORDER__ __ORDER_LITTLE_ENDIAN__
2025-05-07T20:25:32.2725326Z #define __FLT128_HAS_DENORM__ 1
2025-05-07T20:25:32.2725584Z #define __FLT32_DECIMAL_DIG__ 9
2025-05-07T20:25:32.2725828Z #define __FLT128_DIG__ 33
2025-05-07T20:25:32.2726058Z #define __INT32_C(c) c
2025-05-07T20:25:32.2726293Z #define __DEC64_EPSILON__ 1E-15DD
2025-05-07T20:25:32.2726557Z #define __ORDER_PDP_ENDIAN__ 3412
2025-05-07T20:25:32.2726830Z #define __DEC128_MIN_EXP__ (-6142)
2025-05-07T20:25:32.2727151Z #define __INT_FAST32_TYPE__ long int
2025-05-07T20:25:32.2727456Z #define __UINT_LEAST16_TYPE__ short unsigned int
2025-05-07T20:25:32.2727753Z #define unix 1
2025-05-07T20:25:32.2727983Z #define __SIZE_TYPE__ long unsigned int
2025-05-07T20:25:32.2728277Z #define __UINT64_MAX__ 0xffffffffffffffffUL
2025-05-07T20:25:32.2728574Z #define __FLT_IS_IEC_60559__ 2
2025-05-07T20:25:32.2728876Z #define __GNUC_WIDE_EXECUTION_CHARSET_NAME "UTF-32LE"
2025-05-07T20:25:32.2729195Z #define __FLT64X_DIG__ 18
2025-05-07T20:25:32.2729433Z #define __INT8_TYPE__ signed char
2025-05-07T20:25:32.2729856Z #define __ELF__ 1
2025-05-07T20:25:32.2730088Z #define __GCC_ASM_FLAG_OUTPUTS__ 1
2025-05-07T20:25:32.2730358Z #define __UINT32_TYPE__ unsigned int
2025-05-07T20:25:32.2730627Z #define __FLT_RADIX__ 2
2025-05-07T20:25:32.2730866Z #define __INT_LEAST16_TYPE__ short int
2025-05-07T20:25:32.2731208Z #define __LDBL_EPSILON__ 1.08420217248550443400745280086994171e-19L
2025-05-07T20:25:32.2731558Z #define __UINTMAX_C(c) c ## UL
2025-05-07T20:25:32.2731804Z #define __SSE_MATH__ 1
2025-05-07T20:25:32.2732102Z #define __k8 1
2025-05-07T20:25:32.2732394Z #define __FLT32X_MIN__ 2.22507385850720138309023271733240406e-308F32x
2025-05-07T20:25:32.2732758Z #define __SIG_ATOMIC_MAX__ 0x7fffffff
2025-05-07T20:25:32.2733038Z #define __GCC_ATOMIC_WCHAR_T_LOCK_FREE 2
2025-05-07T20:25:32.2733330Z #define __SIZEOF_PTRDIFF_T__ 8
2025-05-07T20:25:32.2733583Z #define __LDBL_DIG__ 18
2025-05-07T20:25:32.2733981Z #define __FLT64_IS_IEC_60559__ 2
2025-05-07T20:25:32.2734224Z #define __x86_64__ 1
2025-05-07T20:25:32.2734462Z #define __FLT32X_MIN_EXP__ (-1021)
2025-05-07T20:25:32.2734754Z #define __DEC32_SUBNORMAL_MIN__ 0.000001E-95DF
2025-05-07T20:25:32.2735072Z #define __INT_FAST16_MAX__ 0x7fffffffffffffffL
2025-05-07T20:25:32.2735372Z #define __FLT64_DIG__ 15
2025-05-07T20:25:32.2735647Z #define __UINT_FAST32_MAX__ 0xffffffffffffffffUL
2025-05-07T20:25:32.2735978Z #define __UINT_LEAST64_TYPE__ long unsigned int
2025-05-07T20:25:32.2736289Z #define __FLT_HAS_QUIET_NAN__ 1
2025-05-07T20:25:32.2736557Z #define __FLT_MAX_10_EXP__ 38
2025-05-07T20:25:32.2736839Z #define __LONG_MAX__ 0x7fffffffffffffffL
2025-05-07T20:25:32.2737166Z #define __FLT64X_HAS_DENORM__ 1
2025-05-07T20:25:32.2737523Z #define __DEC128_SUBNORMAL_MIN__ 0.000000000000000000000000000000001E-6143DL
2025-05-07T20:25:32.2737911Z #define __FLT_HAS_INFINITY__ 1
2025-05-07T20:25:32.2738190Z #define __GNUC_EXECUTION_CHARSET_NAME "UTF-8"
2025-05-07T20:25:32.2738513Z #define __UINT_FAST16_TYPE__ long unsigned int
2025-05-07T20:25:32.2738830Z #define __DEC64_MAX__ 9.999999999999999E384DD
2025-05-07T20:25:32.2739112Z #define __INT_FAST32_WIDTH__ 64
2025-05-07T20:25:32.2739382Z #define __CHAR16_TYPE__ short unsigned int
2025-05-07T20:25:32.2739679Z #define __PRAGMA_REDEFINE_EXTNAME 1
2025-05-07T20:25:32.2739945Z #define __SIZE_WIDTH__ 64
2025-05-07T20:25:32.2740180Z #define __SEG_FS 1
2025-05-07T20:25:32.2740399Z #define __INT_LEAST16_MAX__ 0x7fff
2025-05-07T20:25:32.2740658Z #define __DEC64_MANT_DIG__ 16
2025-05-07T20:25:32.2740927Z #define __INT64_MAX__ 0x7fffffffffffffffL
2025-05-07T20:25:32.2741209Z #define __SEG_GS 1
2025-05-07T20:25:32.2741504Z #define __FLT32_DENORM_MIN__ 1.40129846432481707092372958328991613e-45F32
2025-05-07T20:25:32.2741877Z #define __SIG_ATOMIC_WIDTH__ 32
2025-05-07T20:25:32.2742139Z #define __INT_LEAST64_TYPE__ long int
2025-05-07T20:25:32.2742421Z #define __INT16_TYPE__ short int
2025-05-07T20:25:32.2742688Z #define __INT_LEAST8_TYPE__ signed char
2025-05-07T20:25:32.2742972Z #define __STDC_VERSION__ 201710L
2025-05-07T20:25:32.2743233Z #define __SIZEOF_INT__ 4
2025-05-07T20:25:32.2743468Z #define __DEC32_MAX_EXP__ 97
2025-05-07T20:25:32.2743718Z #define __INT_FAST8_MAX__ 0x7f
2025-05-07T20:25:32.2744052Z #define __FLT128_MAX__ 1.18973149535723176508575932662800702e+4932F128
2025-05-07T20:25:32.2744420Z #define __INTPTR_MAX__ 0x7fffffffffffffffL
2025-05-07T20:25:32.2744703Z #define linux 1
2025-05-07T20:25:32.2744920Z #define __FLT64_HAS_QUIET_NAN__ 1
2025-05-07T20:25:32.2745184Z #define __FLT32_MIN_10_EXP__ (-37)
2025-05-07T20:25:32.2745453Z #define __FLT32X_DIG__ 15
2025-05-07T20:25:32.2745698Z #define __PTRDIFF_WIDTH__ 64
2025-05-07T20:25:32.2745944Z #define __LDBL_MANT_DIG__ 64
2025-05-07T20:25:32.2746199Z #define __FLT64_HAS_INFINITY__ 1
2025-05-07T20:25:32.2746536Z #define __FLT64X_MAX__ 1.18973149535723176502126385303097021e+4932F64x
2025-05-07T20:25:32.2746938Z #define __SIG_ATOMIC_MIN__ (-__SIG_ATOMIC_MAX__ - 1)
2025-05-07T20:25:32.2747258Z #define __code_model_small__ 1
2025-05-07T20:25:32.2747645Z #define __GCC_ATOMIC_LONG_LOCK_FREE 2
2025-05-07T20:25:32.2747929Z #define __DEC32_MANT_DIG__ 7
2025-05-07T20:25:32.2748162Z #define __k8__ 1
2025-05-07T20:25:32.2748388Z #define __INTPTR_TYPE__ long int
2025-05-07T20:25:32.2748668Z #define __UINT16_TYPE__ short unsigned int
2025-05-07T20:25:32.2748950Z #define __WCHAR_TYPE__ int
2025-05-07T20:25:32.2749189Z #define __pic__ 2
2025-05-07T20:25:32.2749431Z #define __UINTPTR_MAX__ 0xffffffffffffffffUL
2025-05-07T20:25:32.2749725Z #define __INT_FAST64_WIDTH__ 64
2025-05-07T20:25:32.2750090Z #define __INT_FAST64_MAX__ 0x7fffffffffffffffL
2025-05-07T20:25:32.2750413Z #define __GCC_ATOMIC_TEST_AND_SET_TRUEVAL 1
2025-05-07T20:25:32.2750772Z #define __FLT_NORM_MAX__ 3.40282346638528859811704183484516925e+38F
2025-05-07T20:25:32.2751121Z #define __FLT32_HAS_INFINITY__ 1
2025-05-07T20:25:32.2751385Z #define __FLT64X_MAX_EXP__ 16384
2025-05-07T20:25:32.2751671Z #define __UINT_FAST64_TYPE__ long unsigned int
2025-05-07T20:25:32.2751970Z #define __INT_MAX__ 0x7fffffff
2025-05-07T20:25:32.2752224Z #define __linux__ 1
2025-05-07T20:25:32.2752444Z #define __INT64_TYPE__ long int
2025-05-07T20:25:32.2752691Z #define __FLT_MAX_EXP__ 128
2025-05-07T20:25:32.2752942Z #define __ORDER_BIG_ENDIAN__ 4321
2025-05-07T20:25:32.2753206Z #define __DBL_MANT_DIG__ 53
2025-05-07T20:25:32.2753445Z #define __SIZEOF_FLOAT128__ 16
2025-05-07T20:25:32.2753727Z #define __INT_LEAST64_MAX__ 0x7fffffffffffffffL
2025-05-07T20:25:32.2754041Z #define __GCC_ATOMIC_CHAR16_T_LOCK_FREE 2
2025-05-07T20:25:32.2754331Z #define __DEC64_MIN__ 1E-383DD
2025-05-07T20:25:32.2754589Z #define __WINT_TYPE__ unsigned int
2025-05-07T20:25:32.2754873Z #define __UINT_LEAST32_TYPE__ unsigned int
2025-05-07T20:25:32.2755161Z #define __SIZEOF_SHORT__ 2
2025-05-07T20:25:32.2755475Z #define __FLT32_NORM_MAX__ 3.40282346638528859811704183484516925e+38F32
2025-05-07T20:25:32.2755822Z #define __SSE__ 1
2025-05-07T20:25:32.2756043Z #define __LDBL_MIN_EXP__ (-16381)
2025-05-07T20:25:32.2756366Z #define __FLT64_MAX__ 1.79769313486231570814527423731704357e+308F64
2025-05-07T20:25:32.2756703Z #define __amd64__ 1
2025-05-07T20:25:32.2756921Z #define __WINT_WIDTH__ 32
2025-05-07T20:25:32.2757160Z #define __INT_LEAST8_MAX__ 0x7f
2025-05-07T20:25:32.2757417Z #define __INT_LEAST64_WIDTH__ 64
2025-05-07T20:25:32.2757678Z #define __LDBL_MAX_EXP__ 16384
2025-05-07T20:25:32.2757952Z #define __FLT32X_MAX_10_EXP__ 308
2025-05-07T20:25:32.2758214Z #define __SIZEOF_INT128__ 16
2025-05-07T20:25:32.2758463Z #define __FLT64X_IS_IEC_60559__ 2
2025-05-07T20:25:32.2758730Z #define __LDBL_MAX_10_EXP__ 4932
2025-05-07T20:25:32.2758988Z #define __ATOMIC_RELAXED 0
2025-05-07T20:25:32.2759326Z #define __DBL_EPSILON__ ((double)2.22044604925031308084726333618164062e-16L)
2025-05-07T20:25:32.2759781Z #define __FLT128_MIN__ 3.36210314311209350626267781732175260e-4932F128
2025-05-07T20:25:32.2760119Z #define _LP64 1
2025-05-07T20:25:32.2760332Z #define __UINT8_C(c) c
2025-05-07T20:25:32.2760566Z #define __FLT64_MAX_EXP__ 1024
2025-05-07T20:25:32.2760820Z #define __INT_LEAST32_TYPE__ int
2025-05-07T20:25:32.2761087Z #define __SIZEOF_WCHAR_T__ 4
2025-05-07T20:25:32.2761351Z #define __UINT64_TYPE__ long unsigned int
2025-05-07T20:25:32.2761640Z #define __GNUC_PATCHLEVEL__ 0
2025-05-07T20:25:32.2761986Z #define __FLT128_NORM_MAX__ 1.18973149535723176508575932662800702e+4932F128
2025-05-07T20:25:32.2762440Z #define __FLT64_NORM_MAX__ 1.79769313486231570814527423731704357e+308F64
2025-05-07T20:25:32.2762800Z #define __FLT128_HAS_QUIET_NAN__ 1
2025-05-07T20:25:32.2763090Z #define __INTMAX_MAX__ 0x7fffffffffffffffL
2025-05-07T20:25:32.2763397Z #define __INT_FAST8_TYPE__ signed char
2025-05-07T20:25:32.2763752Z #define __FLT64X_MIN__ 3.36210314311209350626267781732175260e-4932F64x
2025-05-07T20:25:32.2764101Z #define __GNUC_STDC_INLINE__ 1
2025-05-07T20:25:32.2764356Z #define __FLT64_HAS_DENORM__ 1
2025-05-07T20:25:32.2764685Z #define __FLT32_EPSILON__ 1.19209289550781250000000000000000000e-7F32
2025-05-07T20:25:32.2765036Z #define __DBL_DECIMAL_DIG__ 17
2025-05-07T20:25:32.2765287Z #define __STDC_UTF_32__ 1
2025-05-07T20:25:32.2765642Z #define __INT_FAST8_WIDTH__ 8
2025-05-07T20:25:32.2765879Z #define __FXSR__ 1
2025-05-07T20:25:32.2766170Z #define __FLT32X_MAX__ 1.79769313486231570814527423731704357e+308F32x
2025-05-07T20:25:32.2766635Z #define __DBL_NORM_MAX__ ((double)1.79769313486231570814527423731704357e+308L)
2025-05-07T20:25:32.2767057Z #define __BYTE_ORDER__ __ORDER_LITTLE_ENDIAN__
2025-05-07T20:25:32.2767359Z #define __INTMAX_WIDTH__ 64
2025-05-07T20:25:32.2767607Z #define __UINT32_C(c) c ## U
2025-05-07T20:25:32.2768011Z #define __FLT_DENORM_MIN__ 1.40129846432481707092372958328991613e-45F
2025-05-07T20:25:32.2768355Z #define __INT8_MAX__ 0x7f
2025-05-07T20:25:32.2768590Z #define __LONG_WIDTH__ 64
2025-05-07T20:25:32.2768816Z #define __PIC__ 2
2025-05-07T20:25:32.2769052Z #define __UINT_FAST32_TYPE__ long unsigned int
2025-05-07T20:25:32.2769438Z #define __FLT32X_NORM_MAX__ 1.79769313486231570814527423731704357e+308F32x
2025-05-07T20:25:32.2769818Z #define __CHAR32_TYPE__ unsigned int
2025-05-07T20:25:32.2770148Z #define __FLT_MAX__ 3.40282346638528859811704183484516925e+38F
2025-05-07T20:25:32.2770472Z #define __SSE2__ 1
2025-05-07T20:25:32.2770688Z #define __INT32_TYPE__ int
2025-05-07T20:25:32.2770922Z #define __SIZEOF_DOUBLE__ 8
2025-05-07T20:25:32.2771171Z #define __FLT_MIN_10_EXP__ (-37)
2025-05-07T20:25:32.2771498Z #define __FLT64_MIN__ 2.22507385850720138309023271733240406e-308F64
2025-05-07T20:25:32.2771846Z #define __INT_LEAST32_WIDTH__ 32
2025-05-07T20:25:32.2772104Z #define __INTMAX_TYPE__ long int
2025-05-07T20:25:32.2772379Z #define __DEC128_MAX_EXP__ 6145
2025-05-07T20:25:32.2772643Z #define __FLT32X_HAS_QUIET_NAN__ 1
2025-05-07T20:25:32.2772908Z #define __ATOMIC_CONSUME 1
2025-05-07T20:25:32.2773149Z #define __GNUC_MINOR__ 4
2025-05-07T20:25:32.2773393Z #define __INT_FAST16_WIDTH__ 64
2025-05-07T20:25:32.2773800Z #define __UINTMAX_MAX__ 0xffffffffffffffffUL
2025-05-07T20:25:32.2774096Z #define __PIE__ 2
2025-05-07T20:25:32.2774415Z #define __FLT32X_DENORM_MIN__ 4.94065645841246544176568792868221372e-324F32x
2025-05-07T20:25:32.2774789Z #define __DBL_MAX_10_EXP__ 308
2025-05-07T20:25:32.2775129Z #define __LDBL_DENORM_MIN__ 3.64519953188247460252840593361941982e-4951L
2025-05-07T20:25:32.2775484Z #define __INT16_C(c) c
2025-05-07T20:25:32.2775705Z #define __STDC__ 1
2025-05-07T20:25:32.2775924Z #define __PTRDIFF_TYPE__ long int
2025-05-07T20:25:32.2776189Z #define __ATOMIC_SEQ_CST 5
2025-05-07T20:25:32.2776441Z #define __FLT32X_MIN_10_EXP__ (-307)
2025-05-07T20:25:32.2776732Z #define __UINTPTR_TYPE__ long unsigned int
2025-05-07T20:25:32.2777074Z #define __DEC64_SUBNORMAL_MIN__ 0.000000000000001E-383DD
2025-05-07T20:25:32.2777401Z #define __DEC128_MANT_DIG__ 34
2025-05-07T20:25:32.2777652Z #define __LDBL_MIN_10_EXP__ (-4931)
2025-05-07T20:25:32.2777929Z #define __SIZEOF_LONG_LONG__ 8
2025-05-07T20:25:32.2778189Z #define __FLT128_DECIMAL_DIG__ 36
2025-05-07T20:25:32.2778454Z #define __GCC_ATOMIC_LLONG_LOCK_FREE 2
2025-05-07T20:25:32.2778737Z #define __FLT32_HAS_QUIET_NAN__ 1
2025-05-07T20:25:32.2779008Z #define __FLT_DECIMAL_DIG__ 9
2025-05-07T20:25:32.2779288Z #define __UINT_FAST16_MAX__ 0xffffffffffffffffUL
2025-05-07T20:25:32.2779671Z #define __LDBL_NORM_MAX__ 1.18973149535723176502126385303097021e+4932L
2025-05-07T20:25:32.2780028Z #define __GCC_ATOMIC_SHORT_LOCK_FREE 2
2025-05-07T20:25:32.2780324Z #define __UINT_FAST8_TYPE__ unsigned char
2025-05-07T20:25:32.2780609Z #define __ATOMIC_ACQ_REL 4
2025-05-07T20:25:32.2780851Z #define __ATOMIC_RELEASE 3
2025-05-07T20:25:32.2781011Z 
2025-05-07T20:25:32.3310216Z 
2025-05-07T20:25:32.3310998Z [INFO] Printing out all preprocessor defines in the C++ compiler ...
2025-05-07T20:25:32.3311453Z + conda run -n build_binary c++ -dM -E -x c++ -
2025-05-07T20:25:32.3311676Z 
2025-05-07T20:25:34.2336213Z #define __DBL_MIN_EXP__ (-1021)
2025-05-07T20:25:34.2336594Z #define __cpp_attributes 200809L
2025-05-07T20:25:34.2336961Z #define __cpp_nontype_template_parameter_auto 201606L
2025-05-07T20:25:34.2337330Z #define __UINT_LEAST16_MAX__ 0xffff
2025-05-07T20:25:34.2337974Z #define __ATOMIC_ACQUIRE 2
2025-05-07T20:25:34.2338240Z #define __FLT128_MAX_10_EXP__ 4932
2025-05-07T20:25:34.2338566Z #define __FLT_MIN__ 1.17549435082228750796873653722224568e-38F
2025-05-07T20:25:34.2338911Z #define __GCC_IEC_559_COMPLEX 2
2025-05-07T20:25:34.2339185Z #define __cpp_aggregate_nsdmi 201304L
2025-05-07T20:25:34.2339487Z #define __UINT_LEAST8_TYPE__ unsigned char
2025-05-07T20:25:34.2339790Z #define __SIZEOF_FLOAT80__ 16
2025-05-07T20:25:34.2340048Z #define __INTMAX_C(c) c ## L
2025-05-07T20:25:34.2340444Z #define __CHAR_BIT__ 8
2025-05-07T20:25:34.2340679Z #define __UINT8_MAX__ 0xff
2025-05-07T20:25:34.2340919Z #define __SCHAR_WIDTH__ 8
2025-05-07T20:25:34.2341170Z #define __WINT_MAX__ 0xffffffffU
2025-05-07T20:25:34.2341435Z #define __FLT32_MIN_EXP__ (-125)
2025-05-07T20:25:34.2341707Z #define __cpp_static_assert 201411L
2025-05-07T20:25:34.2341985Z #define __ORDER_LITTLE_ENDIAN__ 1234
2025-05-07T20:25:34.2342282Z #define __SIZE_MAX__ 0xffffffffffffffffUL
2025-05-07T20:25:34.2342588Z #define __WCHAR_MAX__ 0x7fffffff
2025-05-07T20:25:34.2342868Z #define __GCC_HAVE_SYNC_COMPARE_AND_SWAP_1 1
2025-05-07T20:25:34.2343188Z #define __GCC_HAVE_SYNC_COMPARE_AND_SWAP_2 1
2025-05-07T20:25:34.2343507Z #define __GCC_HAVE_SYNC_COMPARE_AND_SWAP_4 1
2025-05-07T20:25:34.2343894Z #define __DBL_DENORM_MIN__ double(4.94065645841246544176568792868221372e-324L)
2025-05-07T20:25:34.2344300Z #define __GCC_HAVE_SYNC_COMPARE_AND_SWAP_8 1
2025-05-07T20:25:34.2344603Z #define __GCC_ATOMIC_CHAR_LOCK_FREE 2
2025-05-07T20:25:34.2344882Z #define __GCC_IEC_559 2
2025-05-07T20:25:34.2345127Z #define __FLT32X_DECIMAL_DIG__ 17
2025-05-07T20:25:34.2345398Z #define __FLT_EVAL_METHOD__ 0
2025-05-07T20:25:34.2345666Z #define __cpp_binary_literals 201304L
2025-05-07T20:25:34.2345945Z #define __FLT64_DECIMAL_DIG__ 17
2025-05-07T20:25:34.2346231Z #define __cpp_noexcept_function_type 201510L
2025-05-07T20:25:34.2346548Z #define __GCC_ATOMIC_CHAR32_T_LOCK_FREE 2
2025-05-07T20:25:34.2346844Z #define __cpp_variadic_templates 200704L
2025-05-07T20:25:34.2347186Z #define __UINT_FAST64_MAX__ 0xffffffffffffffffUL
2025-05-07T20:25:34.2347549Z #define __SIG_ATOMIC_TYPE__ int
2025-05-07T20:25:34.2347812Z #define __DBL_MIN_10_EXP__ (-307)
2025-05-07T20:25:34.2348085Z #define __FINITE_MATH_ONLY__ 0
2025-05-07T20:25:34.2348367Z #define __cpp_variable_templates 201304L
2025-05-07T20:25:34.2348660Z #define __FLT32X_MAX_EXP__ 1024
2025-05-07T20:25:34.2348924Z #define __FLT32_HAS_DENORM__ 1
2025-05-07T20:25:34.2349187Z #define __UINT_FAST8_MAX__ 0xff
2025-05-07T20:25:34.2349463Z #define __cpp_rvalue_reference 200610L
2025-05-07T20:25:34.2349794Z #define __cpp_nested_namespace_definitions 201411L
2025-05-07T20:25:34.2350122Z #define __DEC64_MAX_EXP__ 385
2025-05-07T20:25:34.2350379Z #define __INT8_C(c) c
2025-05-07T20:25:34.2350612Z #define __INT_LEAST8_WIDTH__ 8
2025-05-07T20:25:34.2350880Z #define __cpp_variadic_using 201611L
2025-05-07T20:25:34.2351201Z #define __UINT_LEAST64_MAX__ 0xffffffffffffffffUL
2025-05-07T20:25:34.2351521Z #define __INT_LEAST8_MAX__ 0x7f
2025-05-07T20:25:34.2351803Z #define __cpp_capture_star_this 201603L
2025-05-07T20:25:34.2352099Z #define __SHRT_MAX__ 0x7fff
2025-05-07T20:25:34.2352404Z #define __LDBL_MAX__ 1.18973149535723176502126385303097021e+4932L
2025-05-07T20:25:34.2352755Z #define __FLT64X_MAX_10_EXP__ 4932
2025-05-07T20:25:34.2353041Z #define __cpp_if_constexpr 201606L
2025-05-07T20:25:34.2353314Z #define __LDBL_IS_IEC_60559__ 2
2025-05-07T20:25:34.2353578Z #define __FLT64X_HAS_QUIET_NAN__ 1
2025-05-07T20:25:34.2353857Z #define __UINT_LEAST8_MAX__ 0xff
2025-05-07T20:25:34.2354136Z #define __GCC_ATOMIC_BOOL_LOCK_FREE 2
2025-05-07T20:25:34.2354514Z #define __FLT128_DENORM_MIN__ 6.47517511943802511092443895822764655e-4966F128
2025-05-07T20:25:34.2354921Z #define __UINTMAX_TYPE__ long unsigned int
2025-05-07T20:25:34.2355209Z #define __linux 1
2025-05-07T20:25:34.2355434Z #define __DEC32_EPSILON__ 1E-6DF
2025-05-07T20:25:34.2355712Z #define __FLT_EVAL_METHOD_TS_18661_3__ 0
2025-05-07T20:25:34.2355987Z #define __unix 1
2025-05-07T20:25:34.2356289Z #define __UINT32_MAX__ 0xffffffffU
2025-05-07T20:25:34.2356576Z #define __GXX_EXPERIMENTAL_CXX0X__ 1
2025-05-07T20:25:34.2356862Z #define __FLT128_MIN_EXP__ (-16381)
2025-05-07T20:25:34.2357122Z #define __WINT_MIN__ 0U
2025-05-07T20:25:34.2357366Z #define __FLT128_MIN_10_EXP__ (-4931)
2025-05-07T20:25:34.2357647Z #define __FLT32X_IS_IEC_60559__ 2
2025-05-07T20:25:34.2357912Z #define __INT_LEAST16_WIDTH__ 16
2025-05-07T20:25:34.2358177Z #define __SCHAR_MAX__ 0x7f
2025-05-07T20:25:34.2358496Z #define __FLT128_MANT_DIG__ 113
2025-05-07T20:25:34.2358777Z #define __WCHAR_MIN__ (-__WCHAR_MAX__ - 1)
2025-05-07T20:25:34.2359064Z #define __INT64_C(c) c ## L
2025-05-07T20:25:34.2359329Z #define __GCC_ATOMIC_POINTER_LOCK_FREE 2
2025-05-07T20:25:34.2359622Z #define __FLT32X_MANT_DIG__ 53
2025-05-07T20:25:34.2359889Z #define __GCC_ATOMIC_CHAR16_T_LOCK_FREE 2
2025-05-07T20:25:34.2360190Z #define __cpp_aligned_new 201606L
2025-05-07T20:25:34.2360465Z #define __USER_LABEL_PREFIX__ 
2025-05-07T20:25:34.2360726Z #define __FLT32_MAX_10_EXP__ 38
2025-05-07T20:25:34.2361068Z #define __FLT64X_EPSILON__ 1.08420217248550443400745280086994171e-19F64x
2025-05-07T20:25:34.2361437Z #define __STDC_HOSTED__ 1
2025-05-07T20:25:34.2361683Z #define __DEC64_MIN_EXP__ (-382)
2025-05-07T20:25:34.2361961Z #define __cpp_decltype_auto 201304L
2025-05-07T20:25:34.2362235Z #define __DBL_DIG__ 15
2025-05-07T20:25:34.2362456Z #define __FLT32_DIG__ 6
2025-05-07T20:25:34.2362756Z #define __FLT_EPSILON__ 1.19209289550781250000000000000000000e-7F
2025-05-07T20:25:34.2363102Z #define __GXX_WEAK__ 1
2025-05-07T20:25:34.2363336Z #define __SHRT_WIDTH__ 16
2025-05-07T20:25:34.2363577Z #define __FLT32_IS_IEC_60559__ 2
2025-05-07T20:25:34.2363901Z #define __LDBL_MIN__ 3.36210314311209350626267781732175260e-4932L
2025-05-07T20:25:34.2364251Z #define __DBL_IS_IEC_60559__ 2
2025-05-07T20:25:34.2364510Z #define __DEC32_MAX__ 9.999999E96DF
2025-05-07T20:25:34.2364810Z #define __cpp_threadsafe_static_init 200806L
2025-05-07T20:25:34.2365141Z #define __cpp_enumerator_attributes 201411L
2025-05-07T20:25:34.2365553Z #define __FLT64X_DENORM_MIN__ 3.64519953188247460252840593361941982e-4951F64x
2025-05-07T20:25:34.2373947Z #define __FLT32X_HAS_INFINITY__ 1
2025-05-07T20:25:34.2374237Z #define __INT32_MAX__ 0x7fffffff
2025-05-07T20:25:34.2374496Z #define __unix__ 1
2025-05-07T20:25:34.2374718Z #define __INT_WIDTH__ 32
2025-05-07T20:25:34.2374948Z #define __SIZEOF_LONG__ 8
2025-05-07T20:25:34.2375190Z #define __STDC_IEC_559__ 1
2025-05-07T20:25:34.2375449Z #define __STDC_ISO_10646__ 201103L
2025-05-07T20:25:34.2375714Z #define __UINT16_C(c) c
2025-05-07T20:25:34.2375948Z #define __DECIMAL_DIG__ 21
2025-05-07T20:25:34.2376199Z #define __STDC_IEC_559_COMPLEX__ 1
2025-05-07T20:25:34.2376553Z #define __FLT64_EPSILON__ 2.22044604925031308084726333618164062e-16F64
2025-05-07T20:25:34.2376907Z #define __gnu_linux__ 1
2025-05-07T20:25:34.2377147Z #define __INT16_MAX__ 0x7fff
2025-05-07T20:25:34.2377408Z #define __FLT64_MIN_EXP__ (-1021)
2025-05-07T20:25:34.2377751Z #define __FLT64X_MIN_10_EXP__ (-4931)
2025-05-07T20:25:34.2378035Z #define __LDBL_HAS_QUIET_NAN__ 1
2025-05-07T20:25:34.2378307Z #define __FLT64_MANT_DIG__ 53
2025-05-07T20:25:34.2378559Z #define __FLT64X_MANT_DIG__ 64
2025-05-07T20:25:34.2378805Z #define __GNUC__ 11
2025-05-07T20:25:34.2379023Z #define __GXX_RTTI 1
2025-05-07T20:25:34.2379236Z #define __pie__ 2
2025-05-07T20:25:34.2379450Z #define __MMX__ 1
2025-05-07T20:25:34.2379662Z #define __FLT_HAS_DENORM__ 1
2025-05-07T20:25:34.2379926Z #define __SIZEOF_LONG_DOUBLE__ 16
2025-05-07T20:25:34.2380208Z #define __BIGGEST_ALIGNMENT__ 16
2025-05-07T20:25:34.2380462Z #define __STDC_UTF_16__ 1
2025-05-07T20:25:34.2380707Z #define __FLT64_MAX_10_EXP__ 308
2025-05-07T20:25:34.2381002Z #define __cpp_delegating_constructors 200604L
2025-05-07T20:25:34.2381307Z #define __FLT32_HAS_INFINITY__ 1
2025-05-07T20:25:34.2381647Z #define __DBL_MAX__ double(1.79769313486231570814527423731704357e+308L)
2025-05-07T20:25:34.2382015Z #define __cpp_raw_strings 200710L
2025-05-07T20:25:34.2382450Z #define __INT_FAST32_MAX__ 0x7fffffffffffffffL
2025-05-07T20:25:34.2382755Z #define __DBL_HAS_INFINITY__ 1
2025-05-07T20:25:34.2383012Z #define __SIZEOF_FLOAT__ 4
2025-05-07T20:25:34.2383269Z #define __HAVE_SPECULATION_SAFE_VALUE 1
2025-05-07T20:25:34.2383568Z #define __cpp_fold_expressions 201603L
2025-05-07T20:25:34.2383858Z #define __DEC32_MIN_EXP__ (-94)
2025-05-07T20:25:34.2384118Z #define __INTPTR_WIDTH__ 64
2025-05-07T20:25:34.2384368Z #define __FLT64X_HAS_INFINITY__ 1
2025-05-07T20:25:34.2384725Z #define __UINT_LEAST32_MAX__ 0xffffffffU
2025-05-07T20:25:34.2385016Z #define __FLT32X_HAS_DENORM__ 1
2025-05-07T20:25:34.2385271Z #define __INT_FAST16_TYPE__ long int
2025-05-07T20:25:34.2385544Z #define __MMX_WITH_SSE__ 1
2025-05-07T20:25:34.2385791Z #define __LDBL_HAS_DENORM__ 1
2025-05-07T20:25:34.2386043Z #define __cplusplus 201703L
2025-05-07T20:25:34.2386307Z #define __cpp_ref_qualifiers 200710L
2025-05-07T20:25:34.2386585Z #define __DEC32_MIN__ 1E-95DF
2025-05-07T20:25:34.2386842Z #define __DEPRECATED 1
2025-05-07T20:25:34.2387116Z #define __cpp_rvalue_references 200610L
2025-05-07T20:25:34.2387430Z #define __DBL_MAX_EXP__ 1024
2025-05-07T20:25:34.2387679Z #define __WCHAR_WIDTH__ 32
2025-05-07T20:25:34.2387987Z #define __FLT32_MAX__ 3.40282346638528859811704183484516925e+38F32
2025-05-07T20:25:34.2388339Z #define __DEC128_EPSILON__ 1E-33DL
2025-05-07T20:25:34.2388609Z #define __SSE2_MATH__ 1
2025-05-07T20:25:34.2388846Z #define __ATOMIC_HLE_RELEASE 131072
2025-05-07T20:25:34.2389154Z #define __PTRDIFF_MAX__ 0x7fffffffffffffffL
2025-05-07T20:25:34.2389439Z #define __amd64 1
2025-05-07T20:25:34.2389654Z #define __STDC_NO_THREADS__ 1
2025-05-07T20:25:34.2389919Z #define __ATOMIC_HLE_ACQUIRE 65536
2025-05-07T20:25:34.2390180Z #define __GNUG__ 11
2025-05-07T20:25:34.2390429Z #define __LONG_LONG_MAX__ 0x7fffffffffffffffLL
2025-05-07T20:25:34.2390738Z #define __SIZEOF_SIZE_T__ 8
2025-05-07T20:25:34.2390984Z #define __cpp_nsdmi 200809L
2025-05-07T20:25:34.2391239Z #define __FLT64X_MIN_EXP__ (-16381)
2025-05-07T20:25:34.2391511Z #define __SIZEOF_WINT_T__ 4
2025-05-07T20:25:34.2391760Z #define __LONG_LONG_WIDTH__ 64
2025-05-07T20:25:34.2392037Z #define __cpp_initializer_lists 200806L
2025-05-07T20:25:34.2392319Z #define __FLT32_MAX_EXP__ 128
2025-05-07T20:25:34.2392571Z #define __cpp_hex_float 201603L
2025-05-07T20:25:34.2392834Z #define __GXX_ABI_VERSION 1016
2025-05-07T20:25:34.2393088Z #define __FLT128_HAS_INFINITY__ 1
2025-05-07T20:25:34.2393355Z #define __FLT_MIN_EXP__ (-125)
2025-05-07T20:25:34.2393622Z #define __GCC_HAVE_DWARF2_CFI_ASM 1
2025-05-07T20:25:34.2393875Z #define __x86_64 1
2025-05-07T20:25:34.2394096Z #define __cpp_lambdas 200907L
2025-05-07T20:25:34.2394358Z #define __INT_FAST64_TYPE__ long int
2025-05-07T20:25:34.2394710Z #define __FLT64_DENORM_MIN__ 4.94065645841246544176568792868221372e-324F64
2025-05-07T20:25:34.2395094Z #define __cpp_template_auto 201606L
2025-05-07T20:25:34.2395443Z #define __DBL_MIN__ double(2.22507385850720138309023271733240406e-308L)
2025-05-07T20:25:34.2395890Z #define __FLT128_EPSILON__ 1.92592994438723585305597794258492732e-34F128
2025-05-07T20:25:34.2396341Z #define __FLT64X_NORM_MAX__ 1.18973149535723176502126385303097021e+4932F64x
2025-05-07T20:25:34.2396721Z #define __SIZEOF_POINTER__ 8
2025-05-07T20:25:34.2396968Z #define __LP64__ 1
2025-05-07T20:25:34.2397184Z #define __DBL_HAS_QUIET_NAN__ 1
2025-05-07T20:25:34.2397527Z #define __FLT32X_EPSILON__ 2.22044604925031308084726333618164062e-16F32x
2025-05-07T20:25:34.2397897Z #define __DECIMAL_BID_FORMAT__ 1
2025-05-07T20:25:34.2398452Z #define __FLT64_MIN_10_EXP__ (-307)
2025-05-07T20:25:34.2398819Z #define __FLT64X_DECIMAL_DIG__ 21
2025-05-07T20:25:34.2399090Z #define __DEC128_MIN__ 1E-6143DL
2025-05-07T20:25:34.2399351Z #define __REGISTER_PREFIX__ 
2025-05-07T20:25:34.2399597Z #define __UINT16_MAX__ 0xffff
2025-05-07T20:25:34.2399858Z #define __LDBL_HAS_INFINITY__ 1
2025-05-07T20:25:34.2400176Z #define __FLT32_MIN__ 1.17549435082228750796873653722224568e-38F32
2025-05-07T20:25:34.2400707Z #define __UINT8_TYPE__ unsigned char
2025-05-07T20:25:34.2400986Z #define __FLT_DIG__ 6
2025-05-07T20:25:34.2401212Z #define __NO_INLINE__ 1
2025-05-07T20:25:34.2401440Z #define __DEC_EVAL_METHOD__ 2
2025-05-07T20:25:34.2401755Z #define __DEC128_MAX__ 9.999999999999999999999999999999999E6144DL
2025-05-07T20:25:34.2402096Z #define __FLT_MANT_DIG__ 24
2025-05-07T20:25:34.2402338Z #define __LDBL_DECIMAL_DIG__ 21
2025-05-07T20:25:34.2402598Z #define __VERSION__ "11.4.0"
2025-05-07T20:25:34.2402849Z #define __UINT64_C(c) c ## UL
2025-05-07T20:25:34.2403217Z #define __cpp_unicode_characters 201411L
2025-05-07T20:25:34.2403506Z #define _STDC_PREDEF_H 1
2025-05-07T20:25:34.2403753Z #define __INT_LEAST32_MAX__ 0x7fffffff
2025-05-07T20:25:34.2404039Z #define __GCC_ATOMIC_INT_LOCK_FREE 2
2025-05-07T20:25:34.2404309Z #define __FLT128_MAX_EXP__ 16384
2025-05-07T20:25:34.2404573Z #define __FLT32_MANT_DIG__ 24
2025-05-07T20:25:34.2404869Z #define __FLOAT_WORD_ORDER__ __ORDER_LITTLE_ENDIAN__
2025-05-07T20:25:34.2405201Z #define __cpp_aggregate_bases 201603L
2025-05-07T20:25:34.2405483Z #define __FLT128_HAS_DENORM__ 1
2025-05-07T20:25:34.2405739Z #define __FLT32_DECIMAL_DIG__ 9
2025-05-07T20:25:34.2405985Z #define __FLT128_DIG__ 33
2025-05-07T20:25:34.2406223Z #define __INT32_C(c) c
2025-05-07T20:25:34.2406461Z #define __DEC64_EPSILON__ 1E-15DD
2025-05-07T20:25:34.2406730Z #define __ORDER_PDP_ENDIAN__ 3412
2025-05-07T20:25:34.2407002Z #define __DEC128_MIN_EXP__ (-6142)
2025-05-07T20:25:34.2407274Z #define __INT_FAST32_TYPE__ long int
2025-05-07T20:25:34.2407578Z #define __UINT_LEAST16_TYPE__ short unsigned int
2025-05-07T20:25:34.2407879Z #define unix 1
2025-05-07T20:25:34.2408092Z #define __DBL_HAS_DENORM__ 1
2025-05-07T20:25:34.2408346Z #define __cpp_rtti 199711L
2025-05-07T20:25:34.2408596Z #define __SIZE_TYPE__ long unsigned int
2025-05-07T20:25:34.2408902Z #define __UINT64_MAX__ 0xffffffffffffffffUL
2025-05-07T20:25:34.2409197Z #define __FLT_IS_IEC_60559__ 2
2025-05-07T20:25:34.2409493Z #define __GNUC_WIDE_EXECUTION_CHARSET_NAME "UTF-32LE"
2025-05-07T20:25:34.2409819Z #define __FLT64X_DIG__ 18
2025-05-07T20:25:34.2410064Z #define __INT8_TYPE__ signed char
2025-05-07T20:25:34.2410337Z #define __cpp_digit_separators 201309L
2025-05-07T20:25:34.2410608Z #define __ELF__ 1
2025-05-07T20:25:34.2410835Z #define __GCC_ASM_FLAG_OUTPUTS__ 1
2025-05-07T20:25:34.2411106Z #define __UINT32_TYPE__ unsigned int
2025-05-07T20:25:34.2411375Z #define __FLT_RADIX__ 2
2025-05-07T20:25:34.2411616Z #define __INT_LEAST16_TYPE__ short int
2025-05-07T20:25:34.2411961Z #define __LDBL_EPSILON__ 1.08420217248550443400745280086994171e-19L
2025-05-07T20:25:34.2412317Z #define __UINTMAX_C(c) c ## UL
2025-05-07T20:25:34.2412581Z #define __GLIBCXX_BITSIZE_INT_N_0 128
2025-05-07T20:25:34.2412852Z #define __k8 1
2025-05-07T20:25:34.2413133Z #define __FLT32X_MIN__ 2.22507385850720138309023271733240406e-308F32x
2025-05-07T20:25:34.2413493Z #define __SIG_ATOMIC_MAX__ 0x7fffffff
2025-05-07T20:25:34.2413853Z #define __GCC_ATOMIC_WCHAR_T_LOCK_FREE 2
2025-05-07T20:25:34.2414141Z #define __SIZEOF_PTRDIFF_T__ 8
2025-05-07T20:25:34.2414394Z #define __LDBL_DIG__ 18
2025-05-07T20:25:34.2414631Z #define __FLT64_IS_IEC_60559__ 2
2025-05-07T20:25:34.2414877Z #define __x86_64__ 1
2025-05-07T20:25:34.2415111Z #define __FLT32X_MIN_EXP__ (-1021)
2025-05-07T20:25:34.2415401Z #define __DEC32_SUBNORMAL_MIN__ 0.000001E-95DF
2025-05-07T20:25:34.2415726Z #define __INT_FAST16_MAX__ 0x7fffffffffffffffL
2025-05-07T20:25:34.2416023Z #define __FLT64_DIG__ 15
2025-05-07T20:25:34.2416301Z #define __UINT_FAST32_MAX__ 0xffffffffffffffffUL
2025-05-07T20:25:34.2416647Z #define __UINT_LEAST64_TYPE__ long unsigned int
2025-05-07T20:25:34.2416950Z #define __FLT_HAS_QUIET_NAN__ 1
2025-05-07T20:25:34.2417214Z #define __FLT_MAX_10_EXP__ 38
2025-05-07T20:25:34.2417487Z #define __LONG_MAX__ 0x7fffffffffffffffL
2025-05-07T20:25:34.2417772Z #define __FLT64X_HAS_DENORM__ 1
2025-05-07T20:25:34.2418127Z #define __DEC128_SUBNORMAL_MIN__ 0.000000000000000000000000000000001E-6143DL
2025-05-07T20:25:34.2418517Z #define __FLT_HAS_INFINITY__ 1
2025-05-07T20:25:34.2418873Z #define __GNUC_EXECUTION_CHARSET_NAME "UTF-8"
2025-05-07T20:25:34.2419193Z #define __cpp_unicode_literals 200710L
2025-05-07T20:25:34.2419501Z #define __UINT_FAST16_TYPE__ long unsigned int
2025-05-07T20:25:34.2419810Z #define __DEC64_MAX__ 9.999999999999999E384DD
2025-05-07T20:25:34.2420104Z #define __INT_FAST32_WIDTH__ 64
2025-05-07T20:25:34.2420380Z #define __CHAR16_TYPE__ short unsigned int
2025-05-07T20:25:34.2420681Z #define __PRAGMA_REDEFINE_EXTNAME 1
2025-05-07T20:25:34.2421014Z #define __SIZE_WIDTH__ 64
2025-05-07T20:25:34.2421249Z #define __SEG_FS 1
2025-05-07T20:25:34.2421472Z #define __INT_LEAST16_MAX__ 0x7fff
2025-05-07T20:25:34.2421734Z #define __DEC64_MANT_DIG__ 16
2025-05-07T20:25:34.2422001Z #define __INT64_MAX__ 0x7fffffffffffffffL
2025-05-07T20:25:34.2422281Z #define __SEG_GS 1
2025-05-07T20:25:34.2422577Z #define __FLT32_DENORM_MIN__ 1.40129846432481707092372958328991613e-45F32
2025-05-07T20:25:34.2422950Z #define __SIG_ATOMIC_WIDTH__ 32
2025-05-07T20:25:34.2423223Z #define __INT_LEAST64_TYPE__ long int
2025-05-07T20:25:34.2423498Z #define __INT16_TYPE__ short int
2025-05-07T20:25:34.2423767Z #define __INT_LEAST8_TYPE__ signed char
2025-05-07T20:25:34.2424064Z #define __cpp_structured_bindings 201606L
2025-05-07T20:25:34.2424347Z #define __SIZEOF_INT__ 4
2025-05-07T20:25:34.2424589Z #define __DEC32_MAX_EXP__ 97
2025-05-07T20:25:34.2424843Z #define __INT_FAST8_MAX__ 0x7f
2025-05-07T20:25:34.2425171Z #define __FLT128_MAX__ 1.18973149535723176508575932662800702e+4932F128
2025-05-07T20:25:34.2425552Z #define __INTPTR_MAX__ 0x7fffffffffffffffL
2025-05-07T20:25:34.2425862Z #define __cpp_sized_deallocation 201309L
2025-05-07T20:25:34.2426176Z #define __cpp_guaranteed_copy_elision 201606L
2025-05-07T20:25:34.2426460Z #define linux 1
2025-05-07T20:25:34.2426680Z #define __FLT64_HAS_QUIET_NAN__ 1
2025-05-07T20:25:34.2426951Z #define __FLT32_MIN_10_EXP__ (-37)
2025-05-07T20:25:34.2427213Z #define __EXCEPTIONS 1
2025-05-07T20:25:34.2427452Z #define __PTRDIFF_WIDTH__ 64
2025-05-07T20:25:34.2427712Z #define __LDBL_MANT_DIG__ 64
2025-05-07T20:25:34.2427968Z #define __cpp_range_based_for 201603L
2025-05-07T20:25:34.2428251Z #define __FLT64_HAS_INFINITY__ 1
2025-05-07T20:25:34.2428589Z #define __FLT64X_MAX__ 1.18973149535723176502126385303097021e+4932F64x
2025-05-07T20:25:34.2428959Z #define __STDCPP_DEFAULT_NEW_ALIGNMENT__ 16
2025-05-07T20:25:34.2429295Z #define __SIG_ATOMIC_MIN__ (-__SIG_ATOMIC_MAX__ - 1)
2025-05-07T20:25:34.2429616Z #define __code_model_small__ 1
2025-05-07T20:25:34.2429890Z #define __GCC_ATOMIC_LONG_LOCK_FREE 2
2025-05-07T20:25:34.2430182Z #define __cpp_nontype_template_args 201411L
2025-05-07T20:25:34.2430476Z #define __DEC32_MANT_DIG__ 7
2025-05-07T20:25:34.2430745Z #define __cpp_return_type_deduction 201304L
2025-05-07T20:25:34.2431022Z #define __k8__ 1
2025-05-07T20:25:34.2431242Z #define __INTPTR_TYPE__ long int
2025-05-07T20:25:34.2431518Z #define __UINT16_TYPE__ short unsigned int
2025-05-07T20:25:34.2431801Z #define __WCHAR_TYPE__ int
2025-05-07T20:25:34.2432045Z #define __pic__ 2
2025-05-07T20:25:34.2432291Z #define __UINTPTR_MAX__ 0xffffffffffffffffUL
2025-05-07T20:25:34.2432585Z #define __INT_FAST64_WIDTH__ 64
2025-05-07T20:25:34.2432846Z #define __cpp_decltype 200707L
2025-05-07T20:25:34.2433131Z #define __INT_FAST64_MAX__ 0x7fffffffffffffffL
2025-05-07T20:25:34.2433446Z #define __GCC_ATOMIC_TEST_AND_SET_TRUEVAL 1
2025-05-07T20:25:34.2433801Z #define __FLT_NORM_MAX__ 3.40282346638528859811704183484516925e+38F
2025-05-07T20:25:34.2434156Z #define __FLT64X_MAX_EXP__ 16384
2025-05-07T20:25:34.2434441Z #define __UINT_FAST64_TYPE__ long unsigned int
2025-05-07T20:25:34.2434747Z #define __cpp_inline_variables 201606L
2025-05-07T20:25:34.2435034Z #define __INT_MAX__ 0x7fffffff
2025-05-07T20:25:34.2435281Z #define __linux__ 1
2025-05-07T20:25:34.2435500Z #define __INT64_TYPE__ long int
2025-05-07T20:25:34.2435753Z #define __FLT_MAX_EXP__ 128
2025-05-07T20:25:34.2436007Z #define __ORDER_BIG_ENDIAN__ 4321
2025-05-07T20:25:34.2436265Z #define __DBL_MANT_DIG__ 53
2025-05-07T20:25:34.2436618Z #define __cpp_inheriting_constructors 201511L
2025-05-07T20:25:34.2436931Z #define __SIZEOF_FLOAT128__ 16
2025-05-07T20:25:34.2437213Z #define __INT_LEAST64_MAX__ 0x7fffffffffffffffL
2025-05-07T20:25:34.2437518Z #define __DEC64_MIN__ 1E-383DD
2025-05-07T20:25:34.2437778Z #define __WINT_TYPE__ unsigned int
2025-05-07T20:25:34.2438057Z #define __UINT_LEAST32_TYPE__ unsigned int
2025-05-07T20:25:34.2438343Z #define __SIZEOF_SHORT__ 2
2025-05-07T20:25:34.2438735Z #define __FLT32_NORM_MAX__ 3.40282346638528859811704183484516925e+38F32
2025-05-07T20:25:34.2439084Z #define __SSE__ 1
2025-05-07T20:25:34.2439302Z #define __LDBL_MIN_EXP__ (-16381)
2025-05-07T20:25:34.2439633Z #define __FLT64_MAX__ 1.79769313486231570814527423731704357e+308F64
2025-05-07T20:25:34.2439966Z #define __amd64__ 1
2025-05-07T20:25:34.2440178Z #define __WINT_WIDTH__ 32
2025-05-07T20:25:34.2440425Z #define __INT_LEAST64_WIDTH__ 64
2025-05-07T20:25:34.2440690Z #define __LDBL_MAX_EXP__ 16384
2025-05-07T20:25:34.2440949Z #define __FLT32X_MAX_10_EXP__ 308
2025-05-07T20:25:34.2441214Z #define __SIZEOF_INT128__ 16
2025-05-07T20:25:34.2441467Z #define __FLT64X_IS_IEC_60559__ 2
2025-05-07T20:25:34.2441725Z #define __LDBL_MAX_10_EXP__ 4932
2025-05-07T20:25:34.2441987Z #define __ATOMIC_RELAXED 0
2025-05-07T20:25:34.2442323Z #define __DBL_EPSILON__ double(2.22044604925031308084726333618164062e-16L)
2025-05-07T20:25:34.2442774Z #define __FLT128_MIN__ 3.36210314311209350626267781732175260e-4932F128
2025-05-07T20:25:34.2443119Z #define _LP64 1
2025-05-07T20:25:34.2443330Z #define __UINT8_C(c) c
2025-05-07T20:25:34.2443564Z #define __FLT64_MAX_EXP__ 1024
2025-05-07T20:25:34.2443818Z #define __INT_LEAST32_TYPE__ int
2025-05-07T20:25:34.2444083Z #define __SIZEOF_WCHAR_T__ 4
2025-05-07T20:25:34.2444338Z #define __GNUC_PATCHLEVEL__ 0
2025-05-07T20:25:34.2444675Z #define __FLT128_NORM_MAX__ 1.18973149535723176508575932662800702e+4932F128
2025-05-07T20:25:34.2445125Z #define __FLT64_NORM_MAX__ 1.79769313486231570814527423731704357e+308F64
2025-05-07T20:25:34.2445497Z #define __FLT128_HAS_QUIET_NAN__ 1
2025-05-07T20:25:34.2445778Z #define __INTMAX_MAX__ 0x7fffffffffffffffL
2025-05-07T20:25:34.2446080Z #define __INT_FAST8_TYPE__ signed char
2025-05-07T20:25:34.2446378Z #define __cpp_namespace_attributes 201411L
2025-05-07T20:25:34.2446745Z #define __FLT64X_MIN__ 3.36210314311209350626267781732175260e-4932F64x
2025-05-07T20:25:34.2447097Z #define __STDCPP_THREADS__ 1
2025-05-07T20:25:34.2447353Z #define __GNUC_STDC_INLINE__ 1
2025-05-07T20:25:34.2447616Z #define __FLT64_HAS_DENORM__ 1
2025-05-07T20:25:34.2447936Z #define __FLT32_EPSILON__ 1.19209289550781250000000000000000000e-7F32
2025-05-07T20:25:34.2448293Z #define __DBL_DECIMAL_DIG__ 17
2025-05-07T20:25:34.2448545Z #define __STDC_UTF_32__ 1
2025-05-07T20:25:34.2448782Z #define __INT_FAST8_WIDTH__ 8
2025-05-07T20:25:34.2449020Z #define __FXSR__ 1
2025-05-07T20:25:34.2449314Z #define __FLT32X_MAX__ 1.79769313486231570814527423731704357e+308F32x
2025-05-07T20:25:34.2449755Z #define __DBL_NORM_MAX__ double(1.79769313486231570814527423731704357e+308L)
2025-05-07T20:25:34.2450154Z #define __BYTE_ORDER__ __ORDER_LITTLE_ENDIAN__
2025-05-07T20:25:34.2450453Z #define __INTMAX_WIDTH__ 64
2025-05-07T20:25:34.2450709Z #define __cpp_runtime_arrays 198712L
2025-05-07T20:25:34.2450993Z #define __UINT64_TYPE__ long unsigned int
2025-05-07T20:25:34.2451280Z #define __UINT32_C(c) c ## U
2025-05-07T20:25:34.2451542Z #define __cpp_alias_templates 200704L
2025-05-07T20:25:34.2451885Z #define __FLT_DENORM_MIN__ 1.40129846432481707092372958328991613e-45F
2025-05-07T20:25:34.2452243Z #define __FLT128_IS_IEC_60559__ 2
2025-05-07T20:25:34.2452503Z #define __INT8_MAX__ 0x7f
2025-05-07T20:25:34.2452736Z #define __LONG_WIDTH__ 64
2025-05-07T20:25:34.2452964Z #define __PIC__ 2
2025-05-07T20:25:34.2453209Z #define __UINT_FAST32_TYPE__ long unsigned int
2025-05-07T20:25:34.2453590Z #define __FLT32X_NORM_MAX__ 1.79769313486231570814527423731704357e+308F32x
2025-05-07T20:25:34.2454029Z #define __CHAR32_TYPE__ unsigned int
2025-05-07T20:25:34.2454456Z #define __FLT_MAX__ 3.40282346638528859811704183484516925e+38F
2025-05-07T20:25:34.2454798Z #define __cpp_constexpr 201603L
2025-05-07T20:25:34.2455040Z #define __SSE2__ 1
2025-05-07T20:25:34.2455273Z #define __cpp_deduction_guides 201703L
2025-05-07T20:25:34.2455561Z #define __INT32_TYPE__ int
2025-05-07T20:25:34.2455804Z #define __SIZEOF_DOUBLE__ 8
2025-05-07T20:25:34.2456060Z #define __cpp_exceptions 199711L
2025-05-07T20:25:34.2456331Z #define __FLT_MIN_10_EXP__ (-37)
2025-05-07T20:25:34.2456725Z #define __FLT64_MIN__ 2.22507385850720138309023271733240406e-308F64
2025-05-07T20:25:34.2457072Z #define __INT_LEAST32_WIDTH__ 32
2025-05-07T20:25:34.2457336Z #define __INTMAX_TYPE__ long int
2025-05-07T20:25:34.2457593Z #define __DEC128_MAX_EXP__ 6145
2025-05-07T20:25:34.2457856Z #define __FLT32X_HAS_QUIET_NAN__ 1
2025-05-07T20:25:34.2458124Z #define __ATOMIC_CONSUME 1
2025-05-07T20:25:34.2458368Z #define __GNUC_MINOR__ 4
2025-05-07T20:25:34.2458613Z #define __GLIBCXX_TYPE_INT_N_0 __int128
2025-05-07T20:25:34.2458904Z #define __INT_FAST16_WIDTH__ 64
2025-05-07T20:25:34.2459189Z #define __UINTMAX_MAX__ 0xffffffffffffffffUL
2025-05-07T20:25:34.2459470Z #define __PIE__ 2
2025-05-07T20:25:34.2459782Z #define __FLT32X_DENORM_MIN__ 4.94065645841246544176568792868221372e-324F32x
2025-05-07T20:25:34.2460190Z #define __cpp_template_template_args 201611L
2025-05-07T20:25:34.2460481Z #define __DBL_MAX_10_EXP__ 308
2025-05-07T20:25:34.2460822Z #define __LDBL_DENORM_MIN__ 3.64519953188247460252840593361941982e-4951L
2025-05-07T20:25:34.2461183Z #define __INT16_C(c) c
2025-05-07T20:25:34.2461398Z #define __STDC__ 1
2025-05-07T20:25:34.2461612Z #define __FLT32X_DIG__ 15
2025-05-07T20:25:34.2461858Z #define __PTRDIFF_TYPE__ long int
2025-05-07T20:25:34.2462117Z #define __ATOMIC_SEQ_CST 5
2025-05-07T20:25:34.2462366Z #define __FLT32X_MIN_10_EXP__ (-307)
2025-05-07T20:25:34.2462655Z #define __UINTPTR_TYPE__ long unsigned int
2025-05-07T20:25:34.2462994Z #define __DEC64_SUBNORMAL_MIN__ 0.000000000000001E-383DD
2025-05-07T20:25:34.2463315Z #define __DEC128_MANT_DIG__ 34
2025-05-07T20:25:34.2463576Z #define __LDBL_MIN_10_EXP__ (-4931)
2025-05-07T20:25:34.2463854Z #define __cpp_generic_lambdas 201304L
2025-05-07T20:25:34.2464120Z #define __SSE_MATH__ 1
2025-05-07T20:25:34.2464357Z #define __SIZEOF_LONG_LONG__ 8
2025-05-07T20:25:34.2464634Z #define __cpp_user_defined_literals 200809L
2025-05-07T20:25:34.2464931Z #define __FLT128_DECIMAL_DIG__ 36
2025-05-07T20:25:34.2465207Z #define __GCC_ATOMIC_LLONG_LOCK_FREE 2
2025-05-07T20:25:34.2465498Z #define __FLT32_HAS_QUIET_NAN__ 1
2025-05-07T20:25:34.2465757Z #define __FLT_DECIMAL_DIG__ 9
2025-05-07T20:25:34.2466048Z #define __UINT_FAST16_MAX__ 0xffffffffffffffffUL
2025-05-07T20:25:34.2466432Z #define __LDBL_NORM_MAX__ 1.18973149535723176502126385303097021e+4932L
2025-05-07T20:25:34.2466793Z #define __GCC_ATOMIC_SHORT_LOCK_FREE 2
2025-05-07T20:25:34.2467084Z #define __UINT_FAST8_TYPE__ unsigned char
2025-05-07T20:25:34.2467374Z #define _GNU_SOURCE 1
2025-05-07T20:25:34.2467621Z #define __cpp_init_captures 201304L
2025-05-07T20:25:34.2467893Z #define __ATOMIC_ACQ_REL 4
2025-05-07T20:25:34.2468138Z #define __ATOMIC_RELEASE 3
2025-05-07T20:25:34.2468292Z 
2025-05-07T20:25:34.2969801Z 
2025-05-07T20:25:34.2970356Z + conda run -n build_binary c++ --version
2025-05-07T20:25:34.2970780Z 
2025-05-07T20:25:36.1929834Z c++ (conda-forge gcc 11.4.0-13) 11.4.0
2025-05-07T20:25:36.1930447Z Copyright (C) 2021 Free Software Foundation, Inc.
2025-05-07T20:25:36.1931127Z This is free software; see the source for copying conditions.  There is NO
2025-05-07T20:25:36.1931933Z warranty; not even for MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.
2025-05-07T20:25:36.1932416Z 
2025-05-07T20:25:36.1932423Z 
2025-05-07T20:25:36.2569414Z 
2025-05-07T20:25:36.2570653Z [INFO] Printing the default version of the C standard used by the compiler ...
2025-05-07T20:25:36.2571739Z + conda run -n build_binary cc -dM -E - < /dev/null | grep __STDC_VERSION__
2025-05-07T20:25:36.2572350Z 
2025-05-07T20:25:38.2220779Z #define __STDC_VERSION__ 201710L
2025-05-07T20:25:38.2222622Z 
2025-05-07T20:25:38.2223098Z [INFO] Printing the default version of the C++ standard used by the compiler ...
2025-05-07T20:25:38.2223654Z + conda run -n build_binary c++ -dM -E -x c++ - < /dev/null | grep __cplusplus
2025-05-07T20:25:38.2223957Z 
2025-05-07T20:25:40.1879810Z #define __cplusplus 201703L
2025-05-07T20:25:40.1882221Z 
2025-05-07T20:25:40.1883006Z [INSTALL] Successfully installed C/C++ compilers
2025-05-07T20:25:40.1930675Z ##[group]Run . $PRELUDE; install_cuda $BUILD_ENV 12.6.3
2025-05-07T20:25:40.1931086Z [36;1m. $PRELUDE; install_cuda $BUILD_ENV 12.6.3[0m
2025-05-07T20:25:40.1943681Z shell: /usr/bin/bash --noprofile --norc -e -o pipefail {0}
2025-05-07T20:25:40.1944026Z env:
2025-05-07T20:25:40.1944256Z   PRELUDE: .github/scripts/setup_env.bash
2025-05-07T20:25:40.1944552Z   BUILD_ENV: build_binary
2025-05-07T20:25:40.1944801Z   BUILD_TARGET: genai
2025-05-07T20:25:40.1945031Z   BUILD_VARIANT: cuda
2025-05-07T20:25:40.1945262Z   BUILD_CUDA_VERSION: 12.6.3
2025-05-07T20:25:40.1945525Z   ENFORCE_CUDA_DEVICE: 1
2025-05-07T20:25:40.1945828Z   GPU_FLAG: --gpus all -e NVIDIA_DRIVER_CAPABILITIES=all
2025-05-07T20:25:40.1946156Z ##[endgroup]
2025-05-07T20:25:40.5332740Z ################################################################################
2025-05-07T20:25:40.5333105Z # Install CUDA
2025-05-07T20:25:40.5333318Z #
2025-05-07T20:25:40.5349661Z # [2025-05-07T20:25:40.534Z] + install_cuda build_binary 12.6.3
2025-05-07T20:25:40.5350048Z ################################################################################
2025-05-07T20:25:40.5350262Z 
2025-05-07T20:25:40.5366566Z [EXEC] [ATTEMPT 0/3]    + wget -q --timeout 1 pypi.org -O /dev/null
2025-05-07T20:25:40.6292208Z [CHECK] Network does not appear to be blocked.
2025-05-07T20:25:40.6292595Z [SETUP] Cleaning up Conda packages ...
2025-05-07T20:25:40.6297682Z + conda clean --packages --tarball -y
2025-05-07T20:25:40.6297898Z 
2025-05-07T20:25:41.3409510Z Will remove 29 (113.6 MB) tarball(s).
2025-05-07T20:25:41.3409867Z Will remove 6 (619 KB) package(s).
2025-05-07T20:25:41.4088118Z 
2025-05-07T20:25:41.4096961Z + conda clean --all -y
2025-05-07T20:25:41.4097176Z 
2025-05-07T20:25:42.0815092Z There are no unused tarball(s) to remove.
2025-05-07T20:25:42.0815420Z Will remove 1 index cache(s).
2025-05-07T20:25:42.0815789Z There are no unused package(s) to remove.
2025-05-07T20:25:42.0816219Z There are no tempfile(s) to remove.
2025-05-07T20:25:42.0816615Z There are no logfile(s) to remove.
2025-05-07T20:25:42.1450625Z 
2025-05-07T20:25:42.1464203Z [INSTALL] Installing CUDA 12.6.3 ...
2025-05-07T20:25:42.1489082Z [EXEC] [ATTEMPT 0/3]    + conda install --force-reinstall -n build_binary -c conda-forge --override-channels -y cuda=12.6.3
2025-05-07T20:25:43.0597217Z Channels:
2025-05-07T20:25:43.0597709Z  - conda-forge
2025-05-07T20:25:43.0598538Z Platform: linux-64
2025-05-07T20:25:53.6575149Z Collecting package metadata (repodata.json): - \ | / - \ | / - \ | / - \ | / - done
2025-05-07T20:25:54.7668159Z Solving environment: | / - \ | done
2025-05-07T20:25:54.8423350Z 
2025-05-07T20:25:54.8423680Z ## Package Plan ##
2025-05-07T20:25:54.8423840Z 
2025-05-07T20:25:54.8424189Z   environment location: /home/ec2-user/miniconda/envs/build_binary
2025-05-07T20:25:54.8424532Z 
2025-05-07T20:25:54.8424633Z   added / updated specs:
2025-05-07T20:25:54.8424883Z     - cuda=12.6.3
2025-05-07T20:25:54.8425020Z 
2025-05-07T20:25:54.8425054Z 
2025-05-07T20:25:54.8425186Z The following packages will be downloaded:
2025-05-07T20:25:54.8425399Z 
2025-05-07T20:25:54.8425519Z     package                    |            build
2025-05-07T20:25:54.8425830Z     ---------------------------|-----------------
2025-05-07T20:25:54.8426197Z     alsa-lib-1.2.14            |       hb9d3cd8_0         553 KB  conda-forge
2025-05-07T20:25:54.8426602Z     attr-2.5.1                 |       h166bdaf_1          69 KB  conda-forge
2025-05-07T20:25:54.8427001Z     binutils-2.40              |       h4852527_7          31 KB  conda-forge
2025-05-07T20:25:54.8427522Z     c-compiler-1.5.2           |       h0b41bf4_0           6 KB  conda-forge
2025-05-07T20:25:54.8428112Z     cuda-12.6.3                |       ha804496_0          26 KB  conda-forge
2025-05-07T20:25:54.8428623Z     cuda-cccl_linux-64-12.6.77 |       ha770c72_0         1.0 MB  conda-forge
2025-05-07T20:25:54.8429300Z     cuda-command-line-tools-12.6.3|       ha770c72_0          20 KB  conda-forge
2025-05-07T20:25:54.8430785Z     cuda-compiler-12.6.3       |       hbad6d8a_0          20 KB  conda-forge
2025-05-07T20:25:54.8431358Z     cuda-crt-dev_linux-64-12.6.85|       ha770c72_0          87 KB  conda-forge
2025-05-07T20:25:54.8431991Z     cuda-crt-tools-12.6.85     |       ha770c72_0          26 KB  conda-forge
2025-05-07T20:25:54.8432606Z     cuda-cudart-12.6.77        |       h5888daf_0          22 KB  conda-forge
2025-05-07T20:25:54.8433226Z     cuda-cudart-dev-12.6.77    |       h5888daf_0          22 KB  conda-forge
2025-05-07T20:25:54.8433873Z     cuda-cudart-dev_linux-64-12.6.77|       h3f2d84a_0         357 KB  conda-forge
2025-05-07T20:25:54.8434548Z     cuda-cudart-static-12.6.77 |       h5888daf_0          22 KB  conda-forge
2025-05-07T20:25:54.8435152Z     cuda-cudart-static_linux-64-12.6.77|       h3f2d84a_0         744 KB  conda-forge
2025-05-07T20:25:54.8435662Z     cuda-cudart_linux-64-12.6.77|       h3f2d84a_0         184 KB  conda-forge
2025-05-07T20:25:54.8436143Z     cuda-cuobjdump-12.6.77     |       hbd13f7d_1         241 KB  conda-forge
2025-05-07T20:25:54.8436586Z     cuda-cupti-12.6.80         |       hbd13f7d_0         1.9 MB  conda-forge
2025-05-07T20:25:54.8437032Z     cuda-cupti-dev-12.6.80     |       h5888daf_0         3.4 MB  conda-forge
2025-05-07T20:25:54.8437486Z     cuda-cuxxfilt-12.6.77      |       hbd13f7d_1         211 KB  conda-forge
2025-05-07T20:25:54.8437943Z     cuda-driver-dev-12.6.77    |       h5888daf_0          22 KB  conda-forge
2025-05-07T20:25:54.8438421Z     cuda-driver-dev_linux-64-12.6.77|       h3f2d84a_0          35 KB  conda-forge
2025-05-07T20:25:54.8438877Z     cuda-gdb-12.6.77           |       h50b4baa_1         370 KB  conda-forge
2025-05-07T20:25:54.8439308Z     cuda-libraries-12.6.3      |       ha770c72_0          20 KB  conda-forge
2025-05-07T20:25:54.8439770Z     cuda-libraries-dev-12.6.3  |       ha770c72_0          20 KB  conda-forge
2025-05-07T20:25:54.8440235Z     cuda-nsight-12.6.77        |       h7938cbb_0       113.2 MB  conda-forge
2025-05-07T20:25:54.8440670Z     cuda-nvcc-12.6.85          |       hcdd1206_0          23 KB  conda-forge
2025-05-07T20:25:54.8441124Z     cuda-nvcc-dev_linux-64-12.6.85|       he91c749_0        10.8 MB  conda-forge
2025-05-07T20:25:54.8441583Z     cuda-nvcc-impl-12.6.85     |       h85509e4_0          25 KB  conda-forge
2025-05-07T20:25:54.8442037Z     cuda-nvcc-tools-12.6.85    |       he02047a_0        23.0 MB  conda-forge
2025-05-07T20:25:54.8442502Z     cuda-nvcc_linux-64-12.6.85 |       h04802cd_0          25 KB  conda-forge
2025-05-07T20:25:54.8442950Z     cuda-nvdisasm-12.6.77      |       hbd13f7d_1        47.6 MB  conda-forge
2025-05-07T20:25:54.8443399Z     cuda-nvml-dev-12.6.77      |       hbd13f7d_1         159 KB  conda-forge
2025-05-07T20:25:54.8443842Z     cuda-nvprof-12.6.80        |       hbd13f7d_0         2.6 MB  conda-forge
2025-05-07T20:25:54.8444282Z     cuda-nvprune-12.6.77       |       hbd13f7d_1          66 KB  conda-forge
2025-05-07T20:25:54.8444719Z     cuda-nvrtc-12.6.85         |       hbd13f7d_0        17.3 MB  conda-forge
2025-05-07T20:25:54.8445163Z     cuda-nvrtc-dev-12.6.85     |       h5888daf_0          31 KB  conda-forge
2025-05-07T20:25:54.8445604Z     cuda-nvtx-12.6.77          |       hbd13f7d_0          31 KB  conda-forge
2025-05-07T20:25:54.8446050Z     cuda-nvvm-dev_linux-64-12.6.85|       ha770c72_0          25 KB  conda-forge
2025-05-07T20:25:54.8446516Z     cuda-nvvm-impl-12.6.85     |       he02047a_0         7.7 MB  conda-forge
2025-05-07T20:25:54.8446972Z     cuda-nvvm-tools-12.6.85    |       he02047a_0        10.4 MB  conda-forge
2025-05-07T20:25:54.8447416Z     cuda-nvvp-12.6.80          |       hbd13f7d_1       109.3 MB  conda-forge
2025-05-07T20:25:54.8447844Z     cuda-opencl-12.6.77        |       hbd13f7d_0          29 KB  conda-forge
2025-05-07T20:25:54.8448293Z     cuda-opencl-dev-12.6.77    |       h5888daf_0          93 KB  conda-forge
2025-05-07T20:25:54.8448767Z     cuda-profiler-api-12.6.77  |       h7938cbb_0          22 KB  conda-forge
2025-05-07T20:25:54.8449459Z     cuda-runtime-12.6.3        |       ha804496_0          19 KB  conda-forge
2025-05-07T20:25:54.8449916Z     cuda-sanitizer-api-12.6.77 |       hbd13f7d_1         8.9 MB  conda-forge
2025-05-07T20:25:54.8450379Z     cuda-toolkit-12.6.3        |       ha804496_0          19 KB  conda-forge
2025-05-07T20:25:54.8450809Z     cuda-tools-12.6.3          |       ha770c72_0          19 KB  conda-forge
2025-05-07T20:25:54.8451229Z     cuda-version-12.6          |       h7480c83_3          20 KB  conda-forge
2025-05-07T20:25:54.8451683Z     cuda-visual-tools-12.6.3   |       ha770c72_0          19 KB  conda-forge
2025-05-07T20:25:54.8452136Z     cxx-compiler-1.5.2         |       hf52228f_0           6 KB  conda-forge
2025-05-07T20:25:54.8452543Z     dbus-1.13.6                |       h5008d03_3         604 KB  conda-forge
2025-05-07T20:25:54.8452921Z     expat-2.7.0                |       h5888daf_0         137 KB  conda-forge
2025-05-07T20:25:54.8453385Z     font-ttf-dejavu-sans-mono-2.37|       hab24e00_0         388 KB  conda-forge
2025-05-07T20:25:54.8454062Z     font-ttf-inconsolata-3.000 |       h77eed37_0          94 KB  conda-forge
2025-05-07T20:25:54.8454563Z     font-ttf-source-code-pro-2.038|       h77eed37_0         684 KB  conda-forge
2025-05-07T20:25:54.8455053Z     font-ttf-ubuntu-0.83       |       h77eed37_3         1.5 MB  conda-forge
2025-05-07T20:25:54.8455506Z     fontconfig-2.15.0          |       h7e30c49_1         259 KB  conda-forge
2025-05-07T20:25:54.8455955Z     fonts-conda-ecosystem-1    |                0           4 KB  conda-forge
2025-05-07T20:25:54.8456421Z     fonts-conda-forge-1        |                0           4 KB  conda-forge
2025-05-07T20:25:54.8456851Z     freetype-2.13.3            |       ha770c72_1         168 KB  conda-forge
2025-05-07T20:25:54.8457245Z     gcc-11.4.0                 |      h602e360_13          49 KB  conda-forge
2025-05-07T20:25:54.8457636Z     gds-tools-1.11.1.6         |       h5888daf_4        37.8 MB  conda-forge
2025-05-07T20:25:54.8458043Z     gmp-6.3.0                  |       hac33072_2         449 KB  conda-forge
2025-05-07T20:25:54.8458411Z     gxx-11.4.0                 |      h602e360_13          49 KB  conda-forge
2025-05-07T20:25:54.8458789Z     keyutils-1.6.1             |       h166bdaf_0         115 KB  conda-forge
2025-05-07T20:25:54.8459180Z     krb5-1.21.3                |       h659f571_0         1.3 MB  conda-forge
2025-05-07T20:25:54.8459565Z     libcap-2.71                |       h39aace5_0         100 KB  conda-forge
2025-05-07T20:25:54.8459973Z     libcublas-12.6.4.1         |       h5888daf_1       256.2 MB  conda-forge
2025-05-07T20:25:54.8460410Z     libcublas-dev-12.6.4.1     |       h5888daf_1          88 KB  conda-forge
2025-05-07T20:25:54.8460846Z     libcufft-11.3.0.4          |       hbd13f7d_0       156.2 MB  conda-forge
2025-05-07T20:25:54.8461283Z     libcufft-dev-11.3.0.4      |       h5888daf_0          33 KB  conda-forge
2025-05-07T20:25:54.8461720Z     libcufile-1.11.1.6         |       h12f29b5_4         900 KB  conda-forge
2025-05-07T20:25:54.8462166Z     libcufile-dev-1.11.1.6     |       h5888daf_4          35 KB  conda-forge
2025-05-07T20:25:54.8462607Z     libcurand-10.3.7.77        |       hbd13f7d_0        39.9 MB  conda-forge
2025-05-07T20:25:54.8463052Z     libcurand-dev-10.3.7.77    |       h5888daf_0         262 KB  conda-forge
2025-05-07T20:25:54.8463494Z     libcusolver-11.7.1.2       |       h5888daf_1        95.8 MB  conda-forge
2025-05-07T20:25:54.8463947Z     libcusolver-dev-11.7.1.2   |       h5888daf_1          59 KB  conda-forge
2025-05-07T20:25:54.8464399Z     libcusparse-12.5.4.2       |       hbd13f7d_0       118.6 MB  conda-forge
2025-05-07T20:25:54.8464842Z     libcusparse-dev-12.5.4.2   |       h5888daf_0          51 KB  conda-forge
2025-05-07T20:25:54.8465309Z     libedit-3.1.20250104       | pl5321h7949ede_0         132 KB  conda-forge
2025-05-07T20:25:54.8465745Z     libexpat-2.7.0             |       h5888daf_0          73 KB  conda-forge
2025-05-07T20:25:54.8466275Z     libfreetype-2.13.3         |       ha770c72_1           8 KB  conda-forge
2025-05-07T20:25:54.8466787Z     libfreetype6-2.13.3        |       h48d6fc4_1         371 KB  conda-forge
2025-05-07T20:25:54.8467234Z     libgcrypt-lib-1.11.0       |       hb9d3cd8_2         572 KB  conda-forge
2025-05-07T20:25:54.8467662Z     libglib-2.84.0             |       h2ff4ddf_0         3.8 MB  conda-forge
2025-05-07T20:25:54.8468083Z     libgpg-error-1.55          |       h3f2d84a_0         305 KB  conda-forge
2025-05-07T20:25:54.8468498Z     libiconv-1.18              |       h4ce23a2_1         696 KB  conda-forge
2025-05-07T20:25:54.8468893Z     libnl-3.11.0               |       hb9d3cd8_0         724 KB  conda-forge
2025-05-07T20:25:54.8469288Z     libnpp-12.3.1.54           |       h5888daf_0        93.4 MB  conda-forge
2025-05-07T20:25:54.8469704Z     libnpp-dev-12.3.1.54       |       h5888daf_0         441 KB  conda-forge
2025-05-07T20:25:54.8470128Z     libnuma-2.0.18             |       h4ab18f5_2          42 KB  conda-forge
2025-05-07T20:25:54.8470557Z     libnvfatbin-12.6.77        |       hbd13f7d_0         783 KB  conda-forge
2025-05-07T20:25:54.8471006Z     libnvfatbin-dev-12.6.77    |       h5888daf_0          26 KB  conda-forge
2025-05-07T20:25:54.8471459Z     libnvjitlink-12.6.85       |       hbd13f7d_0        14.9 MB  conda-forge
2025-05-07T20:25:54.8471924Z     libnvjitlink-dev-12.6.85   |       h5888daf_0          25 KB  conda-forge
2025-05-07T20:25:54.8472377Z     libnvjpeg-12.3.3.54        |       h5888daf_0         2.4 MB  conda-forge
2025-05-07T20:25:54.8472815Z     libnvjpeg-dev-12.3.3.54    |       ha770c72_0          31 KB  conda-forge
2025-05-07T20:25:54.8473241Z     libpng-1.6.47              |       h943b412_0         282 KB  conda-forge
2025-05-07T20:25:54.8473655Z     libsqlite-3.49.2           |       hee588c1_0         895 KB  conda-forge
2025-05-07T20:25:54.8474085Z     libsystemd0-256.9          |       h2774228_0         401 KB  conda-forge
2025-05-07T20:25:54.8474507Z     libudev1-257.4             |       h9a4d06a_0         140 KB  conda-forge
2025-05-07T20:25:54.8474924Z     libuuid-2.38.1             |       h0b41bf4_0          33 KB  conda-forge
2025-05-07T20:25:54.8475323Z     libxcb-1.17.0              |       h8a09558_0         387 KB  conda-forge
2025-05-07T20:25:54.8475737Z     libxkbcommon-1.8.0         |       hc4a0caf_0         627 KB  conda-forge
2025-05-07T20:25:54.8476173Z     libxkbfile-1.1.0           |       h166bdaf_1         111 KB  conda-forge
2025-05-07T20:25:54.8476590Z     libxml2-2.13.5             |       h064dc61_0         673 KB  conda-forge
2025-05-07T20:25:54.8476990Z     libzlib-1.3.1              |       hb9d3cd8_2          60 KB  conda-forge
2025-05-07T20:25:54.8477375Z     lz4-c-1.9.4                |       hcb278e6_0         140 KB  conda-forge
2025-05-07T20:25:54.8477765Z     ncurses-6.5                |       h2d0b736_3         871 KB  conda-forge
2025-05-07T20:25:54.8478212Z     nsight-compute-2024.3.2.3  |       hb5ebaad_0       443.1 MB  conda-forge
2025-05-07T20:25:54.8478650Z     nspr-4.36                  |       h5888daf_0         225 KB  conda-forge
2025-05-07T20:25:54.8479023Z     nss-3.111                  |       h159eef7_0         1.9 MB  conda-forge
2025-05-07T20:25:54.8479417Z     ocl-icd-2.3.3              |       hb9d3cd8_0         104 KB  conda-forge
2025-05-07T20:25:54.8479857Z     opencl-headers-2024.10.24  |       h5888daf_0          53 KB  conda-forge
2025-05-07T20:25:54.8480280Z     pcre2-10.44                |       hc749103_2         934 KB  conda-forge
2025-05-07T20:25:54.8480701Z     pthread-stubs-0.4          |    hb9d3cd8_1002           8 KB  conda-forge
2025-05-07T20:25:54.8481142Z     python-3.13.0              |h9ebbce0_101_cp313        31.5 MB  conda-forge
2025-05-07T20:25:54.8481559Z     rdma-core-55.0             |       h5888daf_0         1.2 MB  conda-forge
2025-05-07T20:25:54.8481962Z     sqlite-3.49.2              |       h9eae976_0         840 KB  conda-forge
2025-05-07T20:25:54.8482450Z     tk-8.6.13                  |noxft_h4845f30_101         3.2 MB  conda-forge
2025-05-07T20:25:54.8482969Z     wayland-1.23.1             |       h3e06ad9_0         314 KB  conda-forge
2025-05-07T20:25:54.8483367Z     xcb-util-0.4.1             |       hb711507_2          19 KB  conda-forge
2025-05-07T20:25:54.8483847Z     xcb-util-cursor-0.1.5      |       hb9d3cd8_0          20 KB  conda-forge
2025-05-07T20:25:54.8484298Z     xcb-util-image-0.4.0       |       hb711507_2          24 KB  conda-forge
2025-05-07T20:25:54.8484743Z     xcb-util-keysyms-0.4.1     |       hb711507_0          14 KB  conda-forge
2025-05-07T20:25:54.8485211Z     xcb-util-renderutil-0.3.10 |       hb711507_0          17 KB  conda-forge
2025-05-07T20:25:54.8485663Z     xcb-util-wm-0.4.2          |       hb711507_0          50 KB  conda-forge
2025-05-07T20:25:54.8486105Z     xkeyboard-config-2.44      |       hb9d3cd8_0         384 KB  conda-forge
2025-05-07T20:25:54.8486546Z     xorg-libice-1.1.2          |       hb9d3cd8_0          57 KB  conda-forge
2025-05-07T20:25:54.8486989Z     xorg-libsm-1.2.6           |       he73a12e_0          27 KB  conda-forge
2025-05-07T20:25:54.8487413Z     xorg-libx11-1.8.12         |       h4f16b4b_0         816 KB  conda-forge
2025-05-07T20:25:54.8487837Z     xorg-libxau-1.0.12         |       hb9d3cd8_0          14 KB  conda-forge
2025-05-07T20:25:54.8488289Z     xorg-libxcomposite-0.4.6   |       hb9d3cd8_2          13 KB  conda-forge
2025-05-07T20:25:54.8488757Z     xorg-libxdamage-1.1.6      |       hb9d3cd8_0          13 KB  conda-forge
2025-05-07T20:25:54.8489207Z     xorg-libxdmcp-1.1.5        |       hb9d3cd8_0          19 KB  conda-forge
2025-05-07T20:25:54.8489642Z     xorg-libxext-1.3.6         |       hb9d3cd8_0          49 KB  conda-forge
2025-05-07T20:25:54.8490074Z     xorg-libxfixes-6.0.1       |       hb9d3cd8_0          19 KB  conda-forge
2025-05-07T20:25:54.8490502Z     xorg-libxi-1.8.2           |       hb9d3cd8_0          46 KB  conda-forge
2025-05-07T20:25:54.8490941Z     xorg-libxrandr-1.5.4       |       hb9d3cd8_0          29 KB  conda-forge
2025-05-07T20:25:54.8491391Z     xorg-libxrender-0.9.12     |       hb9d3cd8_0          32 KB  conda-forge
2025-05-07T20:25:54.8491837Z     xorg-libxtst-1.2.5         |       hb9d3cd8_3          32 KB  conda-forge
2025-05-07T20:25:54.8492243Z     zlib-1.3.1                 |       hb9d3cd8_2          90 KB  conda-forge
2025-05-07T20:25:54.8492619Z     zstd-1.5.7                 |       hb8e6e7a_2         554 KB  conda-forge
2025-05-07T20:25:54.8492985Z     ------------------------------------------------------------
2025-05-07T20:25:54.8493321Z                                            Total:        1.64 GB
2025-05-07T20:25:54.8493528Z 
2025-05-07T20:25:54.8493782Z The following NEW packages will be INSTALLED:
2025-05-07T20:25:54.8494004Z 
2025-05-07T20:25:54.8494204Z   alsa-lib           conda-forge/linux-64::alsa-lib-1.2.14-hb9d3cd8_0 
2025-05-07T20:25:54.8494623Z   attr               conda-forge/linux-64::attr-2.5.1-h166bdaf_1 
2025-05-07T20:25:54.8495038Z   binutils           conda-forge/linux-64::binutils-2.40-h4852527_7 
2025-05-07T20:25:54.8495498Z   c-compiler         conda-forge/linux-64::c-compiler-1.5.2-h0b41bf4_0 
2025-05-07T20:25:54.8495919Z   cuda               conda-forge/noarch::cuda-12.6.3-ha804496_0 
2025-05-07T20:25:54.8496390Z   cuda-cccl_linux-64 conda-forge/noarch::cuda-cccl_linux-64-12.6.77-ha770c72_0 
2025-05-07T20:25:54.8496975Z   cuda-command-line~ conda-forge/linux-64::cuda-command-line-tools-12.6.3-ha770c72_0 
2025-05-07T20:25:54.8497549Z   cuda-compiler      conda-forge/noarch::cuda-compiler-12.6.3-hbad6d8a_0 
2025-05-07T20:25:54.8498087Z   cuda-crt-dev_linu~ conda-forge/noarch::cuda-crt-dev_linux-64-12.6.85-ha770c72_0 
2025-05-07T20:25:54.8499077Z   cuda-crt-tools     conda-forge/linux-64::cuda-crt-tools-12.6.85-ha770c72_0 
2025-05-07T20:25:54.8499593Z   cuda-cudart        conda-forge/linux-64::cuda-cudart-12.6.77-h5888daf_0 
2025-05-07T20:25:54.8500111Z   cuda-cudart-dev    conda-forge/linux-64::cuda-cudart-dev-12.6.77-h5888daf_0 
2025-05-07T20:25:54.8501140Z   cuda-cudart-dev_l~ conda-forge/noarch::cuda-cudart-dev_linux-64-12.6.77-h3f2d84a_0 
2025-05-07T20:25:54.8501739Z   cuda-cudart-static conda-forge/linux-64::cuda-cudart-static-12.6.77-h5888daf_0 
2025-05-07T20:25:54.8502385Z   cuda-cudart-stati~ conda-forge/noarch::cuda-cudart-static_linux-64-12.6.77-h3f2d84a_0 
2025-05-07T20:25:54.8503018Z   cuda-cudart_linux~ conda-forge/noarch::cuda-cudart_linux-64-12.6.77-h3f2d84a_0 
2025-05-07T20:25:54.8503581Z   cuda-cuobjdump     conda-forge/linux-64::cuda-cuobjdump-12.6.77-hbd13f7d_1 
2025-05-07T20:25:54.8504088Z   cuda-cupti         conda-forge/linux-64::cuda-cupti-12.6.80-hbd13f7d_0 
2025-05-07T20:25:54.8504582Z   cuda-cupti-dev     conda-forge/linux-64::cuda-cupti-dev-12.6.80-h5888daf_0 
2025-05-07T20:25:54.8505106Z   cuda-cuxxfilt      conda-forge/linux-64::cuda-cuxxfilt-12.6.77-hbd13f7d_1 
2025-05-07T20:25:54.8505644Z   cuda-driver-dev    conda-forge/linux-64::cuda-driver-dev-12.6.77-h5888daf_0 
2025-05-07T20:25:54.8506215Z   cuda-driver-dev_l~ conda-forge/noarch::cuda-driver-dev_linux-64-12.6.77-h3f2d84a_0 
2025-05-07T20:25:54.8506734Z   cuda-gdb           conda-forge/linux-64::cuda-gdb-12.6.77-h50b4baa_1 
2025-05-07T20:25:54.8507217Z   cuda-libraries     conda-forge/linux-64::cuda-libraries-12.6.3-ha770c72_0 
2025-05-07T20:25:54.8507766Z   cuda-libraries-dev conda-forge/linux-64::cuda-libraries-dev-12.6.3-ha770c72_0 
2025-05-07T20:25:54.8508299Z   cuda-nsight        conda-forge/linux-64::cuda-nsight-12.6.77-h7938cbb_0 
2025-05-07T20:25:54.8508765Z   cuda-nvcc          conda-forge/linux-64::cuda-nvcc-12.6.85-hcdd1206_0 
2025-05-07T20:25:54.8509281Z   cuda-nvcc-dev_lin~ conda-forge/noarch::cuda-nvcc-dev_linux-64-12.6.85-he91c749_0 
2025-05-07T20:25:54.8509828Z   cuda-nvcc-impl     conda-forge/linux-64::cuda-nvcc-impl-12.6.85-h85509e4_0 
2025-05-07T20:25:54.8510357Z   cuda-nvcc-tools    conda-forge/linux-64::cuda-nvcc-tools-12.6.85-he02047a_0 
2025-05-07T20:25:54.8510894Z   cuda-nvcc_linux-64 conda-forge/linux-64::cuda-nvcc_linux-64-12.6.85-h04802cd_0 
2025-05-07T20:25:54.8511432Z   cuda-nvdisasm      conda-forge/linux-64::cuda-nvdisasm-12.6.77-hbd13f7d_1 
2025-05-07T20:25:54.8511945Z   cuda-nvml-dev      conda-forge/linux-64::cuda-nvml-dev-12.6.77-hbd13f7d_1 
2025-05-07T20:25:54.8512437Z   cuda-nvprof        conda-forge/linux-64::cuda-nvprof-12.6.80-hbd13f7d_0 
2025-05-07T20:25:54.8512933Z   cuda-nvprune       conda-forge/linux-64::cuda-nvprune-12.6.77-hbd13f7d_1 
2025-05-07T20:25:54.8513424Z   cuda-nvrtc         conda-forge/linux-64::cuda-nvrtc-12.6.85-hbd13f7d_0 
2025-05-07T20:25:54.8513926Z   cuda-nvrtc-dev     conda-forge/linux-64::cuda-nvrtc-dev-12.6.85-h5888daf_0 
2025-05-07T20:25:54.8514406Z   cuda-nvtx          conda-forge/linux-64::cuda-nvtx-12.6.77-hbd13f7d_0 
2025-05-07T20:25:54.8514913Z   cuda-nvvm-dev_lin~ conda-forge/noarch::cuda-nvvm-dev_linux-64-12.6.85-ha770c72_0 
2025-05-07T20:25:54.8515461Z   cuda-nvvm-impl     conda-forge/linux-64::cuda-nvvm-impl-12.6.85-he02047a_0 
2025-05-07T20:25:54.8515997Z   cuda-nvvm-tools    conda-forge/linux-64::cuda-nvvm-tools-12.6.85-he02047a_0 
2025-05-07T20:25:54.8516550Z   cuda-nvvp          conda-forge/linux-64::cuda-nvvp-12.6.80-hbd13f7d_1 
2025-05-07T20:25:54.8517222Z   cuda-opencl        conda-forge/linux-64::cuda-opencl-12.6.77-hbd13f7d_0 
2025-05-07T20:25:54.8517921Z   cuda-opencl-dev    conda-forge/linux-64::cuda-opencl-dev-12.6.77-h5888daf_0 
2025-05-07T20:25:54.8518534Z   cuda-profiler-api  conda-forge/linux-64::cuda-profiler-api-12.6.77-h7938cbb_0 
2025-05-07T20:25:54.8519057Z   cuda-runtime       conda-forge/noarch::cuda-runtime-12.6.3-ha804496_0 
2025-05-07T20:25:54.8519595Z   cuda-sanitizer-api conda-forge/linux-64::cuda-sanitizer-api-12.6.77-hbd13f7d_1 
2025-05-07T20:25:54.8520137Z   cuda-toolkit       conda-forge/noarch::cuda-toolkit-12.6.3-ha804496_0 
2025-05-07T20:25:54.8520628Z   cuda-tools         conda-forge/linux-64::cuda-tools-12.6.3-ha770c72_0 
2025-05-07T20:25:54.8521272Z   cuda-version       conda-forge/noarch::cuda-version-12.6-h7480c83_3 
2025-05-07T20:25:54.8522114Z   cuda-visual-tools  conda-forge/linux-64::cuda-visual-tools-12.6.3-ha770c72_0 
2025-05-07T20:25:54.8522740Z   cxx-compiler       conda-forge/linux-64::cxx-compiler-1.5.2-hf52228f_0 
2025-05-07T20:25:54.8523185Z   dbus               conda-forge/linux-64::dbus-1.13.6-h5008d03_3 
2025-05-07T20:25:54.8523679Z   font-ttf-dejavu-s~ conda-forge/noarch::font-ttf-dejavu-sans-mono-2.37-hab24e00_0 
2025-05-07T20:25:54.8524271Z   font-ttf-inconsol~ conda-forge/noarch::font-ttf-inconsolata-3.000-h77eed37_0 
2025-05-07T20:25:54.8524864Z   font-ttf-source-c~ conda-forge/noarch::font-ttf-source-code-pro-2.038-h77eed37_0 
2025-05-07T20:25:54.8525417Z   font-ttf-ubuntu    conda-forge/noarch::font-ttf-ubuntu-0.83-h77eed37_3 
2025-05-07T20:25:54.8525909Z   fontconfig         conda-forge/linux-64::fontconfig-2.15.0-h7e30c49_1 
2025-05-07T20:25:54.8526398Z   fonts-conda-ecosy~ conda-forge/noarch::fonts-conda-ecosystem-1-0 
2025-05-07T20:25:54.8526879Z   fonts-conda-forge  conda-forge/noarch::fonts-conda-forge-1-0 
2025-05-07T20:25:54.8527333Z   freetype           conda-forge/linux-64::freetype-2.13.3-ha770c72_1 
2025-05-07T20:25:54.8527756Z   gcc                conda-forge/linux-64::gcc-11.4.0-h602e360_13 
2025-05-07T20:25:54.8528175Z   gds-tools          conda-forge/linux-64::gds-tools-1.11.1.6-h5888daf_4 
2025-05-07T20:25:54.8528647Z   gmp                conda-forge/linux-64::gmp-6.3.0-hac33072_2 
2025-05-07T20:25:54.8529169Z   gxx                conda-forge/linux-64::gxx-11.4.0-h602e360_13 
2025-05-07T20:25:54.8529730Z   keyutils           conda-forge/linux-64::keyutils-1.6.1-h166bdaf_0 
2025-05-07T20:25:54.8530209Z   krb5               conda-forge/linux-64::krb5-1.21.3-h659f571_0 
2025-05-07T20:25:54.8530608Z   libcap             conda-forge/linux-64::libcap-2.71-h39aace5_0 
2025-05-07T20:25:54.8531040Z   libcublas          conda-forge/linux-64::libcublas-12.6.4.1-h5888daf_1 
2025-05-07T20:25:54.8531548Z   libcublas-dev      conda-forge/linux-64::libcublas-dev-12.6.4.1-h5888daf_1 
2025-05-07T20:25:54.8532047Z   libcufft           conda-forge/linux-64::libcufft-11.3.0.4-hbd13f7d_0 
2025-05-07T20:25:54.8532522Z   libcufft-dev       conda-forge/linux-64::libcufft-dev-11.3.0.4-h5888daf_0 
2025-05-07T20:25:54.8533006Z   libcufile          conda-forge/linux-64::libcufile-1.11.1.6-h12f29b5_4 
2025-05-07T20:25:54.8533499Z   libcufile-dev      conda-forge/linux-64::libcufile-dev-1.11.1.6-h5888daf_4 
2025-05-07T20:25:54.8534129Z   libcurand          conda-forge/linux-64::libcurand-10.3.7.77-hbd13f7d_0 
2025-05-07T20:25:54.8534623Z   libcurand-dev      conda-forge/linux-64::libcurand-dev-10.3.7.77-h5888daf_0 
2025-05-07T20:25:54.8535131Z   libcusolver        conda-forge/linux-64::libcusolver-11.7.1.2-h5888daf_1 
2025-05-07T20:25:54.8535655Z   libcusolver-dev    conda-forge/linux-64::libcusolver-dev-11.7.1.2-h5888daf_1 
2025-05-07T20:25:54.8536180Z   libcusparse        conda-forge/linux-64::libcusparse-12.5.4.2-hbd13f7d_0 
2025-05-07T20:25:54.8536699Z   libcusparse-dev    conda-forge/linux-64::libcusparse-dev-12.5.4.2-h5888daf_0 
2025-05-07T20:25:54.8537230Z   libedit            conda-forge/linux-64::libedit-3.1.20250104-pl5321h7949ede_0 
2025-05-07T20:25:54.8537710Z   libexpat           conda-forge/linux-64::libexpat-2.7.0-h5888daf_0 
2025-05-07T20:25:54.8538171Z   libfreetype        conda-forge/linux-64::libfreetype-2.13.3-ha770c72_1 
2025-05-07T20:25:54.8538656Z   libfreetype6       conda-forge/linux-64::libfreetype6-2.13.3-h48d6fc4_1 
2025-05-07T20:25:54.8539163Z   libgcrypt-lib      conda-forge/linux-64::libgcrypt-lib-1.11.0-hb9d3cd8_2 
2025-05-07T20:25:54.8539638Z   libglib            conda-forge/linux-64::libglib-2.84.0-h2ff4ddf_0 
2025-05-07T20:25:54.8540095Z   libgpg-error       conda-forge/linux-64::libgpg-error-1.55-h3f2d84a_0 
2025-05-07T20:25:54.8540553Z   libiconv           conda-forge/linux-64::libiconv-1.18-h4ce23a2_1 
2025-05-07T20:25:54.8540976Z   libnl              conda-forge/linux-64::libnl-3.11.0-hb9d3cd8_0 
2025-05-07T20:25:54.8541395Z   libnpp             conda-forge/linux-64::libnpp-12.3.1.54-h5888daf_0 
2025-05-07T20:25:54.8541967Z   libnpp-dev         conda-forge/linux-64::libnpp-dev-12.3.1.54-h5888daf_0 
2025-05-07T20:25:54.8542502Z   libnuma            conda-forge/linux-64::libnuma-2.0.18-h4ab18f5_2 
2025-05-07T20:25:54.8542963Z   libnvfatbin        conda-forge/linux-64::libnvfatbin-12.6.77-hbd13f7d_0 
2025-05-07T20:25:54.8543483Z   libnvfatbin-dev    conda-forge/linux-64::libnvfatbin-dev-12.6.77-h5888daf_0 
2025-05-07T20:25:54.8544002Z   libnvjitlink       conda-forge/linux-64::libnvjitlink-12.6.85-hbd13f7d_0 
2025-05-07T20:25:54.8544544Z   libnvjitlink-dev   conda-forge/linux-64::libnvjitlink-dev-12.6.85-h5888daf_0 
2025-05-07T20:25:54.8545060Z   libnvjpeg          conda-forge/linux-64::libnvjpeg-12.3.3.54-h5888daf_0 
2025-05-07T20:25:54.8545575Z   libnvjpeg-dev      conda-forge/linux-64::libnvjpeg-dev-12.3.3.54-ha770c72_0 
2025-05-07T20:25:54.8546048Z   libpng             conda-forge/linux-64::libpng-1.6.47-h943b412_0 
2025-05-07T20:25:54.8546487Z   libsqlite          conda-forge/linux-64::libsqlite-3.49.2-hee588c1_0 
2025-05-07T20:25:54.8546960Z   libsystemd0        conda-forge/linux-64::libsystemd0-256.9-h2774228_0 
2025-05-07T20:25:54.8547424Z   libudev1           conda-forge/linux-64::libudev1-257.4-h9a4d06a_0 
2025-05-07T20:25:54.8547841Z   libxcb             conda-forge/linux-64::libxcb-1.17.0-h8a09558_0 
2025-05-07T20:25:54.8548296Z   libxkbcommon       conda-forge/linux-64::libxkbcommon-1.8.0-hc4a0caf_0 
2025-05-07T20:25:54.8548778Z   libxkbfile         conda-forge/linux-64::libxkbfile-1.1.0-h166bdaf_1 
2025-05-07T20:25:54.8549228Z   libxml2            conda-forge/linux-64::libxml2-2.13.5-h064dc61_0 
2025-05-07T20:25:54.8549646Z   libzlib            conda-forge/linux-64::libzlib-1.3.1-hb9d3cd8_2 
2025-05-07T20:25:54.8558945Z   lz4-c              conda-forge/linux-64::lz4-c-1.9.4-hcb278e6_0 
2025-05-07T20:25:54.8559681Z   nsight-compute     conda-forge/linux-64::nsight-compute-2024.3.2.3-hb5ebaad_0 
2025-05-07T20:25:54.8560342Z   nspr               conda-forge/linux-64::nspr-4.36-h5888daf_0 
2025-05-07T20:25:54.8560732Z   nss                conda-forge/linux-64::nss-3.111-h159eef7_0 
2025-05-07T20:25:54.8561148Z   ocl-icd            conda-forge/linux-64::ocl-icd-2.3.3-hb9d3cd8_0 
2025-05-07T20:25:54.8561650Z   opencl-headers     conda-forge/linux-64::opencl-headers-2024.10.24-h5888daf_0 
2025-05-07T20:25:54.8562140Z   pcre2              conda-forge/linux-64::pcre2-10.44-hc749103_2 
2025-05-07T20:25:54.8562598Z   pthread-stubs      conda-forge/linux-64::pthread-stubs-0.4-hb9d3cd8_1002 
2025-05-07T20:25:54.8563139Z   rdma-core          conda-forge/linux-64::rdma-core-55.0-h5888daf_0 
2025-05-07T20:25:54.8563581Z   wayland            conda-forge/linux-64::wayland-1.23.1-h3e06ad9_0 
2025-05-07T20:25:54.8564018Z   xcb-util           conda-forge/linux-64::xcb-util-0.4.1-hb711507_2 
2025-05-07T20:25:54.8564495Z   xcb-util-cursor    conda-forge/linux-64::xcb-util-cursor-0.1.5-hb9d3cd8_0 
2025-05-07T20:25:54.8565024Z   xcb-util-image     conda-forge/linux-64::xcb-util-image-0.4.0-hb711507_2 
2025-05-07T20:25:54.8565563Z   xcb-util-keysyms   conda-forge/linux-64::xcb-util-keysyms-0.4.1-hb711507_0 
2025-05-07T20:25:54.8566142Z   xcb-util-renderut~ conda-forge/linux-64::xcb-util-renderutil-0.3.10-hb711507_0 
2025-05-07T20:25:54.8566669Z   xcb-util-wm        conda-forge/linux-64::xcb-util-wm-0.4.2-hb711507_0 
2025-05-07T20:25:54.8567179Z   xkeyboard-config   conda-forge/linux-64::xkeyboard-config-2.44-hb9d3cd8_0 
2025-05-07T20:25:54.8567698Z   xorg-libice        conda-forge/linux-64::xorg-libice-1.1.2-hb9d3cd8_0 
2025-05-07T20:25:54.8568177Z   xorg-libsm         conda-forge/linux-64::xorg-libsm-1.2.6-he73a12e_0 
2025-05-07T20:25:54.8568636Z   xorg-libx11        conda-forge/linux-64::xorg-libx11-1.8.12-h4f16b4b_0 
2025-05-07T20:25:54.8569116Z   xorg-libxau        conda-forge/linux-64::xorg-libxau-1.0.12-hb9d3cd8_0 
2025-05-07T20:25:54.8569656Z   xorg-libxcomposite conda-forge/linux-64::xorg-libxcomposite-0.4.6-hb9d3cd8_2 
2025-05-07T20:25:54.8570227Z   xorg-libxdamage    conda-forge/linux-64::xorg-libxdamage-1.1.6-hb9d3cd8_0 
2025-05-07T20:25:54.8570920Z   xorg-libxdmcp      conda-forge/linux-64::xorg-libxdmcp-1.1.5-hb9d3cd8_0 
2025-05-07T20:25:54.8571515Z   xorg-libxext       conda-forge/linux-64::xorg-libxext-1.3.6-hb9d3cd8_0 
2025-05-07T20:25:54.8572030Z   xorg-libxfixes     conda-forge/linux-64::xorg-libxfixes-6.0.1-hb9d3cd8_0 
2025-05-07T20:25:54.8572521Z   xorg-libxi         conda-forge/linux-64::xorg-libxi-1.8.2-hb9d3cd8_0 
2025-05-07T20:25:54.8573015Z   xorg-libxrandr     conda-forge/linux-64::xorg-libxrandr-1.5.4-hb9d3cd8_0 
2025-05-07T20:25:54.8573554Z   xorg-libxrender    conda-forge/linux-64::xorg-libxrender-0.9.12-hb9d3cd8_0 
2025-05-07T20:25:54.8574213Z   xorg-libxtst       conda-forge/linux-64::xorg-libxtst-1.2.5-hb9d3cd8_3 
2025-05-07T20:25:54.8574918Z   zstd               conda-forge/linux-64::zstd-1.5.7-hb8e6e7a_2 
2025-05-07T20:25:54.8575170Z 
2025-05-07T20:25:54.8575285Z The following packages will be UPDATED:
2025-05-07T20:25:54.8575490Z 
2025-05-07T20:25:54.8575773Z   libuuid              pkgs/main::libuuid-1.41.5-h5eee18b_0 --> conda-forge::libuuid-2.38.1-h0b41bf4_0 
2025-05-07T20:25:54.8576393Z   ncurses                 pkgs/main::ncurses-6.4-h6a678d5_0 --> conda-forge::ncurses-6.5-h2d0b736_3 
2025-05-07T20:25:54.8576986Z   sqlite                pkgs/main::sqlite-3.45.3-h5eee18b_0 --> conda-forge::sqlite-3.49.2-h9eae976_0 
2025-05-07T20:25:54.8577559Z   zlib                    pkgs/main::zlib-1.2.13-h5eee18b_1 --> conda-forge::zlib-1.3.1-hb9d3cd8_2 
2025-05-07T20:25:54.8577887Z 
2025-05-07T20:25:54.8578102Z The following packages will be SUPERSEDED by a higher-priority channel:
2025-05-07T20:25:54.8578408Z 
2025-05-07T20:25:54.8578648Z   expat                   pkgs/main::expat-2.7.1-h6a678d5_0 --> conda-forge::expat-2.7.0-h5888daf_0 
2025-05-07T20:25:54.8579326Z   python             pkgs/main::python-3.13.2-hf623796_100~ --> conda-forge::python-3.13.0-h9ebbce0_101_cp313 
2025-05-07T20:25:54.8580142Z   tk                        pkgs/main::tk-8.6.14-h39e8969_0 --> conda-forge::tk-8.6.13-noxft_h4845f30_101 
2025-05-07T20:25:54.8580521Z 
2025-05-07T20:25:54.8580555Z 
2025-05-07T20:25:54.8580559Z 
2025-05-07T20:25:54.8580706Z Downloading and Extracting Packages: ...working...
2025-05-07T20:25:54.8581089Z nsight-compute-2024. | 443.1 MB  |            |   0% 
2025-05-07T20:25:54.8581323Z 
2025-05-07T20:25:54.8581723Z libcublas-12.6.4.1   | 256.2 MB  |            |   0% [A
2025-05-07T20:25:54.8581958Z 
2025-05-07T20:25:54.8581962Z 
2025-05-07T20:25:54.8582176Z libcufft-11.3.0.4    | 156.2 MB  |            |   0% [A[A
2025-05-07T20:25:54.8582418Z 
2025-05-07T20:25:54.8582422Z 
2025-05-07T20:25:54.8582426Z 
2025-05-07T20:25:54.8582648Z libcusparse-12.5.4.2 | 118.6 MB  |            |   0% [A[A[A
2025-05-07T20:25:54.8582908Z 
2025-05-07T20:25:54.8582912Z 
2025-05-07T20:25:54.8582916Z 
2025-05-07T20:25:54.8582920Z 
2025-05-07T20:25:54.8583154Z cuda-nsight-12.6.77  | 113.2 MB  |            |   0% [A[A[A[A
2025-05-07T20:25:54.8583420Z 
2025-05-07T20:25:54.8583424Z 
2025-05-07T20:25:54.8583428Z 
2025-05-07T20:25:54.8583432Z 
2025-05-07T20:25:54.8586623Z 
2025-05-07T20:25:54.8596578Z cuda-nvvp-12.6.80    | 109.3 MB  |            |   0% [A[A[A[A[A
2025-05-07T20:25:54.8596949Z 
2025-05-07T20:25:54.8596954Z 
2025-05-07T20:25:54.8596960Z 
2025-05-07T20:25:54.8596965Z 
2025-05-07T20:25:54.8596970Z 
2025-05-07T20:25:54.8596979Z 
2025-05-07T20:25:54.8599813Z libcusolver-11.7.1.2 | 95.8 MB   |            |   0% [A[A[A[A[A[A
2025-05-07T20:25:54.8600190Z 
2025-05-07T20:25:54.8600195Z 
2025-05-07T20:25:54.8600201Z 
2025-05-07T20:25:54.8600206Z 
2025-05-07T20:25:54.8600682Z 
2025-05-07T20:25:54.8600688Z 
2025-05-07T20:25:54.8600693Z 
2025-05-07T20:25:54.8602738Z libnpp-12.3.1.54     | 93.4 MB   |            |   0% [A[A[A[A[A[A[A
2025-05-07T20:25:54.8603109Z 
2025-05-07T20:25:54.8603114Z 
2025-05-07T20:25:54.8603128Z 
2025-05-07T20:25:54.8603134Z 
2025-05-07T20:25:54.8603139Z 
2025-05-07T20:25:54.8603144Z 
2025-05-07T20:25:54.8603149Z 
2025-05-07T20:25:54.8603154Z 
2025-05-07T20:25:54.8604700Z cuda-nvdisasm-12.6.7 | 47.6 MB   |            |   0% [A[A[A[A[A[A[A[A
2025-05-07T20:25:54.8605245Z 
2025-05-07T20:25:54.8605250Z 
2025-05-07T20:25:54.8605255Z 
2025-05-07T20:25:54.8605260Z 
2025-05-07T20:25:54.8605266Z 
2025-05-07T20:25:54.8605271Z 
2025-05-07T20:25:54.8605275Z 
2025-05-07T20:25:54.8605279Z 
2025-05-07T20:25:54.8605287Z 
2025-05-07T20:25:54.8606103Z libcurand-10.3.7.77  | 39.9 MB   |            |   0% [A[A[A[A[A[A[A[A[A
2025-05-07T20:25:54.8606477Z 
2025-05-07T20:25:54.8606482Z 
2025-05-07T20:25:54.8606487Z 
2025-05-07T20:25:54.8606492Z 
2025-05-07T20:25:54.8606497Z 
2025-05-07T20:25:54.8606502Z 
2025-05-07T20:25:54.8606507Z 
2025-05-07T20:25:54.8606513Z 
2025-05-07T20:25:54.8606518Z 
2025-05-07T20:25:54.8606527Z 
2025-05-07T20:25:54.8607582Z gds-tools-1.11.1.6   | 37.8 MB   |            |   0% [A[A[A[A[A[A[A[A[A[A
2025-05-07T20:25:54.8607956Z 
2025-05-07T20:25:54.8607961Z 
2025-05-07T20:25:54.8607966Z 
2025-05-07T20:25:54.8607972Z 
2025-05-07T20:25:54.8607977Z 
2025-05-07T20:25:54.8607982Z 
2025-05-07T20:25:54.8607997Z 
2025-05-07T20:25:54.8608009Z 
2025-05-07T20:25:54.8608014Z 
2025-05-07T20:25:54.8608019Z 
2025-05-07T20:25:54.8608028Z 
2025-05-07T20:25:54.8610317Z python-3.13.0        | 31.5 MB   |            |   0% [A[A[A[A[A[A[A[A[A[A[A
2025-05-07T20:25:54.8610722Z 
2025-05-07T20:25:54.8610728Z 
2025-05-07T20:25:54.8610733Z 
2025-05-07T20:25:54.8610738Z 
2025-05-07T20:25:54.8610743Z 
2025-05-07T20:25:54.8610748Z 
2025-05-07T20:25:54.8610754Z 
2025-05-07T20:25:54.8610759Z 
2025-05-07T20:25:54.8610764Z 
2025-05-07T20:25:54.8610769Z 
2025-05-07T20:25:54.8610774Z 
2025-05-07T20:25:54.8610779Z 
2025-05-07T20:25:54.8612057Z cuda-nvcc-tools-12.6 | 23.0 MB   |            |   0% [A[A[A[A[A[A[A[A[A[A[A[A
2025-05-07T20:25:54.8612454Z 
2025-05-07T20:25:54.8612459Z 
2025-05-07T20:25:54.8612464Z 
2025-05-07T20:25:54.8612469Z 
2025-05-07T20:25:54.8612474Z 
2025-05-07T20:25:54.8612480Z 
2025-05-07T20:25:54.8612484Z 
2025-05-07T20:25:54.8612490Z 
2025-05-07T20:25:54.8612495Z 
2025-05-07T20:25:54.8612500Z 
2025-05-07T20:25:54.8612532Z 
2025-05-07T20:25:54.8612538Z 
2025-05-07T20:25:54.8612547Z 
2025-05-07T20:25:54.8615350Z cuda-nvrtc-12.6.85   | 17.3 MB   |            |   0% [A[A[A[A[A[A[A[A[A[A[A[A[A
2025-05-07T20:25:54.8615748Z 
2025-05-07T20:25:54.8615762Z 
2025-05-07T20:25:54.8615767Z 
2025-05-07T20:25:54.8615772Z 
2025-05-07T20:25:54.8615777Z 
2025-05-07T20:25:54.8615782Z 
2025-05-07T20:25:54.8615787Z 
2025-05-07T20:25:54.8615792Z 
2025-05-07T20:25:54.8615797Z 
2025-05-07T20:25:54.8615803Z 
2025-05-07T20:25:54.8615808Z 
2025-05-07T20:25:54.8615813Z 
2025-05-07T20:25:54.8615818Z 
2025-05-07T20:25:54.8615823Z 
2025-05-07T20:25:54.8616948Z libnvjitlink-12.6.85 | 14.9 MB   |            |   0% [A[A[A[A[A[A[A[A[A[A[A[A[A[A
2025-05-07T20:25:54.8617360Z 
2025-05-07T20:25:54.8617365Z 
2025-05-07T20:25:54.8617370Z 
2025-05-07T20:25:54.8617375Z 
2025-05-07T20:25:54.8617380Z 
2025-05-07T20:25:54.8617385Z 
2025-05-07T20:25:54.8617391Z 
2025-05-07T20:25:54.8617401Z 
2025-05-07T20:25:54.8617406Z 
2025-05-07T20:25:54.8617428Z 
2025-05-07T20:25:54.8617433Z 
2025-05-07T20:25:54.8617438Z 
2025-05-07T20:25:54.8617443Z 
2025-05-07T20:25:54.8617448Z 
2025-05-07T20:25:54.8617453Z 
2025-05-07T20:25:54.8618539Z cuda-nvcc-dev_linux- | 10.8 MB   |            |   0% [A[A[A[A[A[A[A[A[A[A[A[A[A[A[A
2025-05-07T20:25:54.8618943Z 
2025-05-07T20:25:54.8618948Z 
2025-05-07T20:25:54.8618953Z 
2025-05-07T20:25:54.8618959Z 
2025-05-07T20:25:54.8618971Z 
2025-05-07T20:25:54.8618977Z 
2025-05-07T20:25:54.8618982Z 
2025-05-07T20:25:54.8618987Z 
2025-05-07T20:25:54.8618992Z 
2025-05-07T20:25:54.8618997Z 
2025-05-07T20:25:54.8619002Z 
2025-05-07T20:25:54.8619007Z 
2025-05-07T20:25:54.8619013Z 
2025-05-07T20:25:54.8619021Z 
2025-05-07T20:25:54.8619027Z 
2025-05-07T20:25:54.8619032Z 
2025-05-07T20:25:54.8621623Z cuda-nvvm-tools-12.6 | 10.4 MB   |            |   0% [A[A[A[A[A[A[A[A[A[A[A[A[A[A[A[A
2025-05-07T20:25:54.8622063Z 
2025-05-07T20:25:54.8622068Z 
2025-05-07T20:25:54.8622074Z 
2025-05-07T20:25:54.8622231Z 
2025-05-07T20:25:54.8622326Z 
2025-05-07T20:25:54.8622331Z 
2025-05-07T20:25:54.8622336Z 
2025-05-07T20:25:54.8622341Z 
2025-05-07T20:25:54.8622346Z 
2025-05-07T20:25:54.8622352Z 
2025-05-07T20:25:54.8622357Z 
2025-05-07T20:25:54.8622362Z 
2025-05-07T20:25:54.8622367Z 
2025-05-07T20:25:54.8622372Z 
2025-05-07T20:25:54.8622386Z 
2025-05-07T20:25:54.8622392Z 
2025-05-07T20:25:54.8622397Z 
2025-05-07T20:25:54.8623425Z cuda-sanitizer-api-1 | 8.9 MB    |            |   0% [A[A[A[A[A[A[A[A[A[A[A[A[A[A[A[A[A
2025-05-07T20:25:54.8623849Z 
2025-05-07T20:25:54.8623854Z 
2025-05-07T20:25:54.8623875Z 
2025-05-07T20:25:54.8623881Z 
2025-05-07T20:25:54.8623886Z 
2025-05-07T20:25:54.8623891Z 
2025-05-07T20:25:54.8623896Z 
2025-05-07T20:25:54.8623901Z 
2025-05-07T20:25:54.8623906Z 
2025-05-07T20:25:54.8623911Z 
2025-05-07T20:25:54.8623916Z 
2025-05-07T20:25:54.8623921Z 
2025-05-07T20:25:54.8623926Z 
2025-05-07T20:25:54.8623932Z 
2025-05-07T20:25:54.8623937Z 
2025-05-07T20:25:54.8623942Z 
2025-05-07T20:25:54.8623965Z 
2025-05-07T20:25:54.8623971Z 
2025-05-07T20:25:54.8624963Z cuda-nvvm-impl-12.6. | 7.7 MB    |            |   0% [A[A[A[A[A[A[A[A[A[A[A[A[A[A[A[A[A[A
2025-05-07T20:25:54.8625372Z 
2025-05-07T20:25:54.8625377Z 
2025-05-07T20:25:54.8625382Z 
2025-05-07T20:25:54.8625387Z 
2025-05-07T20:25:54.8625392Z 
2025-05-07T20:25:54.8625397Z 
2025-05-07T20:25:54.8625402Z 
2025-05-07T20:25:54.8625408Z 
2025-05-07T20:25:54.8625413Z 
2025-05-07T20:25:54.8625418Z 
2025-05-07T20:25:54.8625440Z 
2025-05-07T20:25:54.8625445Z 
2025-05-07T20:25:54.8625450Z 
2025-05-07T20:25:54.8625455Z 
2025-05-07T20:25:54.8625460Z 
2025-05-07T20:25:54.8625465Z 
2025-05-07T20:25:54.8625470Z 
2025-05-07T20:25:54.8625475Z 
2025-05-07T20:25:54.8625480Z 
2025-05-07T20:25:54.9547594Z  ... (more hidden) ...[A[A[A[A[A[A[A[A[A[A[A[A[A[A[A[A[A[A[A
2025-05-07T20:25:54.9547942Z 
2025-05-07T20:25:54.9547948Z 
2025-05-07T20:25:54.9553936Z libcufft-11.3.0.4    | 156.2 MB  | 2          |   2% [A[A
2025-05-07T20:25:54.9565055Z nsight-compute-2024. | 443.1 MB  |            |   0% 
2025-05-07T20:25:54.9565413Z 
2025-05-07T20:25:54.9565428Z 
2025-05-07T20:25:54.9565433Z 
2025-05-07T20:25:54.9565476Z 
2025-05-07T20:25:54.9698589Z cuda-nsight-12.6.77  | 113.2 MB  | 2          |   3% [A[A[A[A
2025-05-07T20:25:54.9698971Z 
2025-05-07T20:25:54.9698977Z 
2025-05-07T20:25:54.9699538Z 
2025-05-07T20:25:54.9702540Z libcusparse-12.5.4.2 | 118.6 MB  | 1          |   2% [A[A[A
2025-05-07T20:25:54.9702918Z 
2025-05-07T20:25:55.0552405Z libcublas-12.6.4.1   | 256.2 MB  |            |   0% [A
2025-05-07T20:25:55.0552683Z 
2025-05-07T20:25:55.0554509Z 
2025-05-07T20:25:55.0560344Z libcufft-11.3.0.4    | 156.2 MB  | 5          |   6% [A[A
2025-05-07T20:25:55.0566677Z nsight-compute-2024. | 443.1 MB  |            |   1% 
2025-05-07T20:25:55.0566940Z 
2025-05-07T20:25:55.0566944Z 
2025-05-07T20:25:55.0566948Z 
2025-05-07T20:25:55.0567592Z 
2025-05-07T20:25:55.0703682Z cuda-nsight-12.6.77  | 113.2 MB  | 6          |   6% [A[A[A[A
2025-05-07T20:25:55.0703992Z 
2025-05-07T20:25:55.0703996Z 
2025-05-07T20:25:55.0704000Z 
2025-05-07T20:25:55.0707868Z libcusparse-12.5.4.2 | 118.6 MB  | 5          |   5% [A[A[A
2025-05-07T20:25:55.0712323Z 
2025-05-07T20:25:55.1562061Z libcublas-12.6.4.1   | 256.2 MB  | 1          |   1% [A
2025-05-07T20:25:55.1569991Z nsight-compute-2024. | 443.1 MB  | 1          |   2% 
2025-05-07T20:25:55.1570358Z 
2025-05-07T20:25:55.1570365Z 
2025-05-07T20:25:55.1570370Z 
2025-05-07T20:25:55.1572683Z 
2025-05-07T20:25:55.1704093Z cuda-nsight-12.6.77  | 113.2 MB  | 9          |   9% [A[A[A[A
2025-05-07T20:25:55.1704391Z 
2025-05-07T20:25:55.1704396Z 
2025-05-07T20:25:55.1704418Z 
2025-05-07T20:25:55.1710315Z libcusparse-12.5.4.2 | 118.6 MB  | 8          |   8% [A[A[A
2025-05-07T20:25:55.1710660Z 
2025-05-07T20:25:55.1811436Z libcublas-12.6.4.1   | 256.2 MB  | 2          |   2% [A
2025-05-07T20:25:55.1811728Z 
2025-05-07T20:25:55.1811733Z 
2025-05-07T20:25:55.2563733Z libcufft-11.3.0.4    | 156.2 MB  | 8          |   9% [A[A
2025-05-07T20:25:55.2573850Z nsight-compute-2024. | 443.1 MB  | 2          |   2% 
2025-05-07T20:25:55.2574118Z 
2025-05-07T20:25:55.2574256Z 
2025-05-07T20:25:55.2574264Z 
2025-05-07T20:25:55.2574297Z 
2025-05-07T20:25:55.2707806Z cuda-nsight-12.6.77  | 113.2 MB  | #2         |  13% [A[A[A[A
2025-05-07T20:25:55.2708093Z 
2025-05-07T20:25:55.2708098Z 
2025-05-07T20:25:55.2708101Z 
2025-05-07T20:25:55.2710985Z libcusparse-12.5.4.2 | 118.6 MB  | #1         |  11% [A[A[A
2025-05-07T20:25:55.2711367Z 
2025-05-07T20:25:55.3127544Z libcublas-12.6.4.1   | 256.2 MB  | 3          |   4% [A
2025-05-07T20:25:55.3127863Z 
2025-05-07T20:25:55.3127867Z 
2025-05-07T20:25:55.3565295Z libcufft-11.3.0.4    | 156.2 MB  | #1         |  12% [A[A
2025-05-07T20:25:55.3645815Z nsight-compute-2024. | 443.1 MB  | 3          |   3% 
2025-05-07T20:25:55.3646126Z 
2025-05-07T20:25:55.3646231Z 
2025-05-07T20:25:55.3646236Z 
2025-05-07T20:25:55.3646753Z 
2025-05-07T20:25:55.3714174Z cuda-nsight-12.6.77  | 113.2 MB  | #5         |  16% [A[A[A[A
2025-05-07T20:25:55.3714704Z 
2025-05-07T20:25:55.3772330Z libcublas-12.6.4.1   | 256.2 MB  | 4          |   5% [A
2025-05-07T20:25:55.3772717Z 
2025-05-07T20:25:55.3772723Z 
2025-05-07T20:25:55.3774420Z 
2025-05-07T20:25:55.4306518Z libcusparse-12.5.4.2 | 118.6 MB  | #4         |  14% [A[A[A
2025-05-07T20:25:55.4306818Z 
2025-05-07T20:25:55.4308508Z 
2025-05-07T20:25:55.4647516Z libcufft-11.3.0.4    | 156.2 MB  | #4         |  14% [A[A
2025-05-07T20:25:55.4647876Z 
2025-05-07T20:25:55.4647883Z 
2025-05-07T20:25:55.4647889Z 
2025-05-07T20:25:55.4647894Z 
2025-05-07T20:25:55.4717680Z cuda-nsight-12.6.77  | 113.2 MB  | #9         |  19% [A[A[A[A
2025-05-07T20:25:55.4718103Z 
2025-05-07T20:25:55.4775613Z libcublas-12.6.4.1   | 256.2 MB  | 6          |   6% [A
2025-05-07T20:25:55.4776247Z 
2025-05-07T20:25:55.4776253Z 
2025-05-07T20:25:55.4776579Z 
2025-05-07T20:25:55.5215810Z libcusparse-12.5.4.2 | 118.6 MB  | #7         |  17% [A[A[A
2025-05-07T20:25:55.5307049Z nsight-compute-2024. | 443.1 MB  | 3          |   4% 
2025-05-07T20:25:55.5307537Z 
2025-05-07T20:25:55.5307541Z 
2025-05-07T20:25:55.5719427Z libcufft-11.3.0.4    | 156.2 MB  | #7         |  17% [A[A
2025-05-07T20:25:55.5723910Z 
2025-05-07T20:25:55.5778920Z libcublas-12.6.4.1   | 256.2 MB  | 8          |   8% [A
2025-05-07T20:25:55.5779188Z 
2025-05-07T20:25:55.5779192Z 
2025-05-07T20:25:55.5781445Z 
2025-05-07T20:25:55.6218264Z libcusparse-12.5.4.2 | 118.6 MB  | ##1        |  21% [A[A[A
2025-05-07T20:25:55.6404538Z nsight-compute-2024. | 443.1 MB  | 4          |   5% 
2025-05-07T20:25:55.6404805Z 
2025-05-07T20:25:55.6404818Z 
2025-05-07T20:25:55.6678883Z libcufft-11.3.0.4    | 156.2 MB  | #9         |  20% [A[A
2025-05-07T20:25:55.6679157Z 
2025-05-07T20:25:55.6679161Z 
2025-05-07T20:25:55.6679164Z 
2025-05-07T20:25:55.6679659Z 
2025-05-07T20:25:55.6857979Z cuda-nsight-12.6.77  | 113.2 MB  | ##2        |  23% [A[A[A[A
2025-05-07T20:25:55.6858318Z 
2025-05-07T20:25:55.6858322Z 
2025-05-07T20:25:55.6861379Z 
2025-05-07T20:25:55.7219148Z libcusparse-12.5.4.2 | 118.6 MB  | ##4        |  25% [A[A[A
2025-05-07T20:25:55.7511508Z nsight-compute-2024. | 443.1 MB  | 5          |   6% 
2025-05-07T20:25:55.7511774Z 
2025-05-07T20:25:55.7513311Z 
2025-05-07T20:25:55.7619953Z libcufft-11.3.0.4    | 156.2 MB  | ##2        |  22% [A[A
2025-05-07T20:25:55.7622443Z 
2025-05-07T20:25:55.7682250Z libcublas-12.6.4.1   | 256.2 MB  | 9          |  10% [A
2025-05-07T20:25:55.7682810Z 
2025-05-07T20:25:55.7682814Z 
2025-05-07T20:25:55.7682818Z 
2025-05-07T20:25:55.7683167Z 
2025-05-07T20:25:55.7858645Z cuda-nsight-12.6.77  | 113.2 MB  | ##6        |  26% [A[A[A[A
2025-05-07T20:25:55.7859066Z 
2025-05-07T20:25:55.7859073Z 
2025-05-07T20:25:55.7866186Z 
2025-05-07T20:25:55.8219782Z libcusparse-12.5.4.2 | 118.6 MB  | ##8        |  28% [A[A[A
2025-05-07T20:25:55.8575112Z nsight-compute-2024. | 443.1 MB  | 6          |   7% 
2025-05-07T20:25:55.8575385Z 
2025-05-07T20:25:55.8576619Z 
2025-05-07T20:25:55.8683744Z libcufft-11.3.0.4    | 156.2 MB  | ##4        |  25% [A[A
2025-05-07T20:25:55.8684019Z 
2025-05-07T20:25:55.8684023Z 
2025-05-07T20:25:55.8684027Z 
2025-05-07T20:25:55.8684031Z 
2025-05-07T20:25:55.8722251Z cuda-nsight-12.6.77  | 113.2 MB  | ##9        |  29% [A[A[A[A
2025-05-07T20:25:55.8722902Z 
2025-05-07T20:25:55.9073824Z libcublas-12.6.4.1   | 256.2 MB  | #          |  11% [A
2025-05-07T20:25:55.9074094Z 
2025-05-07T20:25:55.9074098Z 
2025-05-07T20:25:55.9079288Z 
2025-05-07T20:25:55.9250746Z libcusparse-12.5.4.2 | 118.6 MB  | ###1       |  31% [A[A[A
2025-05-07T20:25:55.9617379Z nsight-compute-2024. | 443.1 MB  | 7          |   7% 
2025-05-07T20:25:55.9617665Z 
2025-05-07T20:25:55.9617671Z 
2025-05-07T20:25:55.9684911Z libcufft-11.3.0.4    | 156.2 MB  | ##7        |  27% [A[A
2025-05-07T20:25:55.9685179Z 
2025-05-07T20:25:55.9685183Z 
2025-05-07T20:25:55.9685187Z 
2025-05-07T20:25:55.9685852Z 
2025-05-07T20:25:55.9724697Z cuda-nsight-12.6.77  | 113.2 MB  | ###2       |  33% [A[A[A[A
2025-05-07T20:25:55.9725063Z 
2025-05-07T20:25:56.0132337Z libcublas-12.6.4.1   | 256.2 MB  | #2         |  12% [A
2025-05-07T20:25:56.0132611Z 
2025-05-07T20:25:56.0132615Z 
2025-05-07T20:25:56.0132619Z 
2025-05-07T20:25:56.0253109Z libcusparse-12.5.4.2 | 118.6 MB  | ###4       |  34% [A[A[A
2025-05-07T20:25:56.0639517Z nsight-compute-2024. | 443.1 MB  | 8          |   8% 
2025-05-07T20:25:56.0639791Z 
2025-05-07T20:25:56.0639795Z 
2025-05-07T20:25:56.0700205Z libcufft-11.3.0.4    | 156.2 MB  | ##9        |  30% [A[A
2025-05-07T20:25:56.0700526Z 
2025-05-07T20:25:56.0700532Z 
2025-05-07T20:25:56.0700538Z 
2025-05-07T20:25:56.0703002Z 
2025-05-07T20:25:56.0726448Z cuda-nsight-12.6.77  | 113.2 MB  | ###5       |  36% [A[A[A[A
2025-05-07T20:25:56.0726917Z 
2025-05-07T20:25:56.1253979Z libcublas-12.6.4.1   | 256.2 MB  | #3         |  13% [A
2025-05-07T20:25:56.1643414Z nsight-compute-2024. | 443.1 MB  | 9          |   9% 
2025-05-07T20:25:56.1643706Z 
2025-05-07T20:25:56.1644487Z 
2025-05-07T20:25:56.1700818Z libcufft-11.3.0.4    | 156.2 MB  | ###2       |  33% [A[A
2025-05-07T20:25:56.1701168Z 
2025-05-07T20:25:56.1701172Z 
2025-05-07T20:25:56.1701176Z 
2025-05-07T20:25:56.1701978Z 
2025-05-07T20:25:56.1727426Z cuda-nsight-12.6.77  | 113.2 MB  | ###9       |  40% [A[A[A[A
2025-05-07T20:25:56.1727704Z 
2025-05-07T20:25:56.2255450Z libcublas-12.6.4.1   | 256.2 MB  | #5         |  15% [A
2025-05-07T20:25:56.2646814Z nsight-compute-2024. | 443.1 MB  | #          |  10% 
2025-05-07T20:25:56.2647185Z 
2025-05-07T20:25:56.2647191Z 
2025-05-07T20:25:56.2650662Z libcufft-11.3.0.4    | 156.2 MB  | ###5       |  36% [A[A
2025-05-07T20:25:56.2650923Z 
2025-05-07T20:25:56.2651069Z 
2025-05-07T20:25:56.2651109Z 
2025-05-07T20:25:56.2704204Z libcusparse-12.5.4.2 | 118.6 MB  | ###7       |  38% [A[A[A
2025-05-07T20:25:56.2704555Z 
2025-05-07T20:25:56.2704559Z 
2025-05-07T20:25:56.2704562Z 
2025-05-07T20:25:56.2704566Z 
2025-05-07T20:25:56.2823214Z cuda-nsight-12.6.77  | 113.2 MB  | ####3      |  44% [A[A[A[A
2025-05-07T20:25:56.2823569Z 
2025-05-07T20:25:56.3259005Z libcublas-12.6.4.1   | 256.2 MB  | #6         |  17% [A
2025-05-07T20:25:56.3672751Z nsight-compute-2024. | 443.1 MB  | #1         |  11% 
2025-05-07T20:25:56.3673033Z 
2025-05-07T20:25:56.3673668Z 
2025-05-07T20:25:56.3707068Z libcufft-11.3.0.4    | 156.2 MB  | ###8       |  39% [A[A
2025-05-07T20:25:56.3707369Z 
2025-05-07T20:25:56.3707373Z 
2025-05-07T20:25:56.3707377Z 
2025-05-07T20:25:56.3707381Z 
2025-05-07T20:25:56.3807227Z cuda-nsight-12.6.77  | 113.2 MB  | ####7      |  48% [A[A[A[A
2025-05-07T20:25:56.3807509Z 
2025-05-07T20:25:56.3807513Z 
2025-05-07T20:25:56.3807517Z 
2025-05-07T20:25:56.3999268Z libcusparse-12.5.4.2 | 118.6 MB  | ####       |  40% [A[A[A
2025-05-07T20:25:56.4002067Z 
2025-05-07T20:25:56.4426267Z libcublas-12.6.4.1   | 256.2 MB  | #8         |  18% [A
2025-05-07T20:25:56.4807716Z nsight-compute-2024. | 443.1 MB  | #2         |  12% 
2025-05-07T20:25:56.4808000Z 
2025-05-07T20:25:56.4808134Z 
2025-05-07T20:25:56.4809985Z 
2025-05-07T20:25:56.4856061Z libcusparse-12.5.4.2 | 118.6 MB  | ####2      |  43% [A[A[A
2025-05-07T20:25:56.4856352Z 
2025-05-07T20:25:56.4856356Z 
2025-05-07T20:25:56.4856360Z 
2025-05-07T20:25:56.4858188Z 
2025-05-07T20:25:56.4942718Z cuda-nsight-12.6.77  | 113.2 MB  | #####1     |  51% [A[A[A[A
2025-05-07T20:25:56.4943120Z 
2025-05-07T20:25:56.4943127Z 
2025-05-07T20:25:56.5013516Z libcufft-11.3.0.4    | 156.2 MB  | ####1      |  41% [A[A
2025-05-07T20:25:56.5014692Z 
2025-05-07T20:25:56.5601619Z libcublas-12.6.4.1   | 256.2 MB  | #9         |  19% [A
2025-05-07T20:25:56.5808954Z nsight-compute-2024. | 443.1 MB  | #3         |  13% 
2025-05-07T20:25:56.5809216Z 
2025-05-07T20:25:56.5809220Z 
2025-05-07T20:25:56.5809224Z 
2025-05-07T20:25:56.6016622Z libcusparse-12.5.4.2 | 118.6 MB  | ####5      |  46% [A[A[A
2025-05-07T20:25:56.6021003Z 
2025-05-07T20:25:56.6052114Z libcublas-12.6.4.1   | 256.2 MB  | ##         |  21% [A
2025-05-07T20:25:56.6052445Z 
2025-05-07T20:25:56.6054969Z 
2025-05-07T20:25:56.6150763Z libcufft-11.3.0.4    | 156.2 MB  | ####3      |  44% [A[A
2025-05-07T20:25:56.6151136Z 
2025-05-07T20:25:56.6151141Z 
2025-05-07T20:25:56.6151147Z 
2025-05-07T20:25:56.6151798Z 
2025-05-07T20:25:56.6816694Z cuda-nsight-12.6.77  | 113.2 MB  | #####4     |  55% [A[A[A[A
2025-05-07T20:25:56.6816999Z 
2025-05-07T20:25:56.6817003Z 
2025-05-07T20:25:56.6820380Z 
2025-05-07T20:25:56.7019795Z libcusparse-12.5.4.2 | 118.6 MB  | ####8      |  48% [A[A[A
2025-05-07T20:25:56.7024164Z 
2025-05-07T20:25:56.7052283Z libcublas-12.6.4.1   | 256.2 MB  | ##2        |  23% [A
2025-05-07T20:25:56.7052643Z 
2025-05-07T20:25:56.7053379Z 
2025-05-07T20:25:56.7154339Z libcufft-11.3.0.4    | 156.2 MB  | ####6      |  47% [A[A
2025-05-07T20:25:56.7154983Z 
2025-05-07T20:25:56.7154989Z 
2025-05-07T20:25:56.7154994Z 
2025-05-07T20:25:56.7156644Z 
2025-05-07T20:25:56.7179422Z cuda-nsight-12.6.77  | 113.2 MB  | #####8     |  59% [A[A[A[A
2025-05-07T20:25:56.7817237Z nsight-compute-2024. | 443.1 MB  | #3         |  14% 
2025-05-07T20:25:56.7817595Z 
2025-05-07T20:25:56.7817601Z 
2025-05-07T20:25:56.7817606Z 
2025-05-07T20:25:56.8086877Z libcusparse-12.5.4.2 | 118.6 MB  | #####1     |  51% [A[A[A
2025-05-07T20:25:56.8089284Z 
2025-05-07T20:25:56.8129126Z libcublas-12.6.4.1   | 256.2 MB  | ##4        |  24% [A
2025-05-07T20:25:56.8129456Z 
2025-05-07T20:25:56.8129469Z 
2025-05-07T20:25:56.8252555Z libcufft-11.3.0.4    | 156.2 MB  | ####9      |  49% [A[A
2025-05-07T20:25:56.8252847Z 
2025-05-07T20:25:56.8252852Z 
2025-05-07T20:25:56.8252856Z 
2025-05-07T20:25:56.8254858Z 
2025-05-07T20:25:56.8288604Z cuda-nsight-12.6.77  | 113.2 MB  | ######2    |  62% [A[A[A[A
2025-05-07T20:25:56.8818275Z nsight-compute-2024. | 443.1 MB  | #4         |  15% 
2025-05-07T20:25:56.8818548Z 
2025-05-07T20:25:56.8818552Z 
2025-05-07T20:25:56.8820058Z 
2025-05-07T20:25:56.9134829Z libcusparse-12.5.4.2 | 118.6 MB  | #####4     |  54% [A[A[A
2025-05-07T20:25:56.9135939Z 
2025-05-07T20:25:56.9218771Z libcublas-12.6.4.1   | 256.2 MB  | ##5        |  25% [A
2025-05-07T20:25:56.9219299Z 
2025-05-07T20:25:56.9219303Z 
2025-05-07T20:25:56.9288460Z libcufft-11.3.0.4    | 156.2 MB  | #####1     |  52% [A[A
2025-05-07T20:25:56.9288793Z 
2025-05-07T20:25:56.9288800Z 
2025-05-07T20:25:56.9288805Z 
2025-05-07T20:25:56.9288810Z 
2025-05-07T20:25:56.9304955Z cuda-nsight-12.6.77  | 113.2 MB  | ######5    |  66% [A[A[A[A
2025-05-07T20:25:56.9820700Z nsight-compute-2024. | 443.1 MB  | #5         |  15% 
2025-05-07T20:25:56.9821076Z 
2025-05-07T20:25:56.9821147Z 
2025-05-07T20:25:56.9821261Z 
2025-05-07T20:25:57.0206643Z libcusparse-12.5.4.2 | 118.6 MB  | #####7     |  57% [A[A[A
2025-05-07T20:25:57.0209436Z 
2025-05-07T20:25:57.0262597Z libcublas-12.6.4.1   | 256.2 MB  | ##6        |  27% [A
2025-05-07T20:25:57.0262938Z 
2025-05-07T20:25:57.0264972Z 
2025-05-07T20:25:57.0318080Z libcufft-11.3.0.4    | 156.2 MB  | #####4     |  54% [A[A
2025-05-07T20:25:57.0318365Z 
2025-05-07T20:25:57.0318369Z 
2025-05-07T20:25:57.0318794Z 
2025-05-07T20:25:57.0318799Z 
2025-05-07T20:25:57.0330385Z cuda-nsight-12.6.77  | 113.2 MB  | ######9    |  69% [A[A[A[A
2025-05-07T20:25:57.0822760Z nsight-compute-2024. | 443.1 MB  | #6         |  16% 
2025-05-07T20:25:57.0823058Z 
2025-05-07T20:25:57.0823063Z 
2025-05-07T20:25:57.0823068Z 
2025-05-07T20:25:57.1265377Z libcusparse-12.5.4.2 | 118.6 MB  | ######     |  61% [A[A[A
2025-05-07T20:25:57.1265675Z 
2025-05-07T20:25:57.1266104Z 
2025-05-07T20:25:57.1311928Z libcufft-11.3.0.4    | 156.2 MB  | #####6     |  57% [A[A
2025-05-07T20:25:57.1312279Z 
2025-05-07T20:25:57.1332176Z libcublas-12.6.4.1   | 256.2 MB  | ##8        |  28% [A
2025-05-07T20:25:57.1395592Z nsight-compute-2024. | 443.1 MB  | #7         |  17% 
2025-05-07T20:25:57.1395899Z 
2025-05-07T20:25:57.1395910Z 
2025-05-07T20:25:57.1395914Z 
2025-05-07T20:25:57.1399544Z 
2025-05-07T20:25:57.1827057Z cuda-nsight-12.6.77  | 113.2 MB  | #######2   |  73% [A[A[A[A
2025-05-07T20:25:57.1827350Z 
2025-05-07T20:25:57.1827395Z 
2025-05-07T20:25:57.1828041Z 
2025-05-07T20:25:57.2292662Z libcusparse-12.5.4.2 | 118.6 MB  | ######3    |  64% [A[A[A
2025-05-07T20:25:57.2293049Z 
2025-05-07T20:25:57.2294748Z 
2025-05-07T20:25:57.2315625Z libcufft-11.3.0.4    | 156.2 MB  | #####9     |  59% [A[A
2025-05-07T20:25:57.2318503Z 
2025-05-07T20:25:57.2334110Z libcublas-12.6.4.1   | 256.2 MB  | ##9        |  30% [A
2025-05-07T20:25:57.2396973Z nsight-compute-2024. | 443.1 MB  | #7         |  18% 
2025-05-07T20:25:57.2397244Z 
2025-05-07T20:25:57.2397248Z 
2025-05-07T20:25:57.2397252Z 
2025-05-07T20:25:57.2398636Z 
2025-05-07T20:25:57.2829911Z cuda-nsight-12.6.77  | 113.2 MB  | #######5   |  76% [A[A[A[A
2025-05-07T20:25:57.2830203Z 
2025-05-07T20:25:57.2830207Z 
2025-05-07T20:25:57.2832199Z 
2025-05-07T20:25:57.3332080Z libcusparse-12.5.4.2 | 118.6 MB  | ######6    |  67% [A[A[A
2025-05-07T20:25:57.3334592Z 
2025-05-07T20:25:57.3340924Z libcublas-12.6.4.1   | 256.2 MB  | ###1       |  31% [A
2025-05-07T20:25:57.3345968Z nsight-compute-2024. | 443.1 MB  | #8         |  19% 
2025-05-07T20:25:57.3346409Z 
2025-05-07T20:25:57.3346421Z 
2025-05-07T20:25:57.3399513Z libcufft-11.3.0.4    | 156.2 MB  | ######1    |  62% [A[A
2025-05-07T20:25:57.3399986Z 
2025-05-07T20:25:57.3399992Z 
2025-05-07T20:25:57.3400008Z 
2025-05-07T20:25:57.3400013Z 
2025-05-07T20:25:57.3831519Z cuda-nsight-12.6.77  | 113.2 MB  | #######9   |  79% [A[A[A[A
2025-05-07T20:25:57.3831811Z 
2025-05-07T20:25:57.3831815Z 
2025-05-07T20:25:57.3831819Z 
2025-05-07T20:25:57.4339627Z libcusparse-12.5.4.2 | 118.6 MB  | ######9    |  70% [A[A[A
2025-05-07T20:25:57.4360089Z nsight-compute-2024. | 443.1 MB  | #9         |  20% 
2025-05-07T20:25:57.4360339Z 
2025-05-07T20:25:57.4362577Z 
2025-05-07T20:25:57.4402764Z libcufft-11.3.0.4    | 156.2 MB  | ######3    |  64% [A[A
2025-05-07T20:25:57.4403022Z 
2025-05-07T20:25:57.4403026Z 
2025-05-07T20:25:57.4403029Z 
2025-05-07T20:25:57.4403041Z 
2025-05-07T20:25:57.4416529Z cuda-nsight-12.6.77  | 113.2 MB  | ########2  |  83% [A[A[A[A
2025-05-07T20:25:57.4419915Z 
2025-05-07T20:25:57.4904618Z libcublas-12.6.4.1   | 256.2 MB  | ###2       |  32% [A
2025-05-07T20:25:57.4904883Z 
2025-05-07T20:25:57.4905156Z 
2025-05-07T20:25:57.4905259Z 
2025-05-07T20:25:57.5344593Z libcusparse-12.5.4.2 | 118.6 MB  | #######2   |  73% [A[A[A
2025-05-07T20:25:57.5365553Z nsight-compute-2024. | 443.1 MB  | ##         |  20% 
2025-05-07T20:25:57.5365902Z 
2025-05-07T20:25:57.5368021Z 
2025-05-07T20:25:57.5403744Z libcufft-11.3.0.4    | 156.2 MB  | ######6    |  66% [A[A
2025-05-07T20:25:57.5404101Z 
2025-05-07T20:25:57.5404105Z 
2025-05-07T20:25:57.5404109Z 
2025-05-07T20:25:57.5404113Z 
2025-05-07T20:25:57.5418252Z cuda-nsight-12.6.77  | 113.2 MB  | ########6  |  86% [A[A[A[A
2025-05-07T20:25:57.5421665Z 
2025-05-07T20:25:57.5908111Z libcublas-12.6.4.1   | 256.2 MB  | ###3       |  34% [A
2025-05-07T20:25:57.5908386Z 
2025-05-07T20:25:57.5908390Z 
2025-05-07T20:25:57.5908394Z 
2025-05-07T20:25:57.6348796Z libcusparse-12.5.4.2 | 118.6 MB  | #######5   |  76% [A[A[A
2025-05-07T20:25:57.6383795Z nsight-compute-2024. | 443.1 MB  | ##1        |  21% 
2025-05-07T20:25:57.6384051Z 
2025-05-07T20:25:57.6384812Z 
2025-05-07T20:25:57.6408070Z libcufft-11.3.0.4    | 156.2 MB  | ######8    |  69% [A[A
2025-05-07T20:25:57.6408400Z 
2025-05-07T20:25:57.6408406Z 
2025-05-07T20:25:57.6408411Z 
2025-05-07T20:25:57.6408417Z 
2025-05-07T20:25:57.6420491Z cuda-nsight-12.6.77  | 113.2 MB  | ########9  |  90% [A[A[A[A
2025-05-07T20:25:57.6422298Z 
2025-05-07T20:25:57.6977647Z libcublas-12.6.4.1   | 256.2 MB  | ###5       |  35% [A
2025-05-07T20:25:57.6977921Z 
2025-05-07T20:25:57.6977925Z 
2025-05-07T20:25:57.6978333Z 
2025-05-07T20:25:57.7384525Z libcusparse-12.5.4.2 | 118.6 MB  | #######8   |  79% [A[A[A
2025-05-07T20:25:57.7384816Z 
2025-05-07T20:25:57.7385306Z 
2025-05-07T20:25:57.7409954Z libcufft-11.3.0.4    | 156.2 MB  | #######1   |  71% [A[A
2025-05-07T20:25:57.7410224Z 
2025-05-07T20:25:57.7410228Z 
2025-05-07T20:25:57.7410231Z 
2025-05-07T20:25:57.7410570Z 
2025-05-07T20:25:57.7862424Z cuda-nsight-12.6.77  | 113.2 MB  | #########3 |  93% [A[A[A[A
2025-05-07T20:25:57.7977604Z nsight-compute-2024. | 443.1 MB  | ##2        |  22% 
2025-05-07T20:25:57.7977863Z 
2025-05-07T20:25:57.7977867Z 
2025-05-07T20:25:57.7979329Z 
2025-05-07T20:25:57.8388155Z libcusparse-12.5.4.2 | 118.6 MB  | ########2  |  83% [A[A[A
2025-05-07T20:25:57.8388447Z 
2025-05-07T20:25:57.8388451Z 
2025-05-07T20:25:57.8410187Z libcufft-11.3.0.4    | 156.2 MB  | #######4   |  75% [A[A
2025-05-07T20:25:57.8410512Z 
2025-05-07T20:25:57.8410518Z 
2025-05-07T20:25:57.8410524Z 
2025-05-07T20:25:57.8410529Z 
2025-05-07T20:25:57.8863435Z cuda-nsight-12.6.77  | 113.2 MB  | #########7 |  98% [A[A[A[A
2025-05-07T20:25:57.8978341Z nsight-compute-2024. | 443.1 MB  | ##3        |  23% 
2025-05-07T20:25:57.8978607Z 
2025-05-07T20:25:57.8978611Z 
2025-05-07T20:25:57.8979246Z 
2025-05-07T20:25:57.9145594Z libcusparse-12.5.4.2 | 118.6 MB  | ########6  |  87% [A[A[A
2025-05-07T20:25:57.9145917Z 
2025-05-07T20:25:57.9388020Z libcublas-12.6.4.1   | 256.2 MB  | ###6       |  37% [A
2025-05-07T20:25:57.9388276Z 
2025-05-07T20:25:57.9389004Z 
2025-05-07T20:25:57.9867386Z libcufft-11.3.0.4    | 156.2 MB  | #######8   |  78% [A[A
2025-05-07T20:25:57.9979818Z nsight-compute-2024. | 443.1 MB  | ##4        |  24% 
2025-05-07T20:25:57.9980078Z 
2025-05-07T20:25:57.9980082Z 
2025-05-07T20:25:57.9980720Z 
2025-05-07T20:25:58.0147289Z libcusparse-12.5.4.2 | 118.6 MB  | #########  |  90% [A[A[A
2025-05-07T20:25:58.0147563Z 
2025-05-07T20:25:58.0518474Z libcublas-12.6.4.1   | 256.2 MB  | ###7       |  38% [A
2025-05-07T20:25:58.0518743Z 
2025-05-07T20:25:58.0518747Z 
2025-05-07T20:25:58.0869227Z libcufft-11.3.0.4    | 156.2 MB  | ########   |  81% [A[A
2025-05-07T20:25:58.1096349Z nsight-compute-2024. | 443.1 MB  | ##5        |  25% 
2025-05-07T20:25:58.1096751Z 
2025-05-07T20:25:58.1096755Z 
2025-05-07T20:25:58.1098947Z 
2025-05-07T20:25:58.1147439Z libcusparse-12.5.4.2 | 118.6 MB  | #########3 |  94% [A[A[A
2025-05-07T20:25:58.1149694Z 
2025-05-07T20:25:58.1643588Z libcublas-12.6.4.1   | 256.2 MB  | ###9       |  39% [A
2025-05-07T20:25:58.1643858Z 
2025-05-07T20:25:58.1643862Z 
2025-05-07T20:25:58.1872247Z libcufft-11.3.0.4    | 156.2 MB  | ########3  |  84% [A[A
2025-05-07T20:25:58.2150381Z nsight-compute-2024. | 443.1 MB  | ##5        |  26% 
2025-05-07T20:25:58.2152998Z 
2025-05-07T20:25:58.2190743Z libcublas-12.6.4.1   | 256.2 MB  | ####       |  41% [A
2025-05-07T20:25:58.2191017Z 
2025-05-07T20:25:58.2191022Z 
2025-05-07T20:25:58.2192440Z 
2025-05-07T20:25:58.2658372Z libcusparse-12.5.4.2 | 118.6 MB  | #########7 |  97% [A[A[A
2025-05-07T20:25:58.2658668Z 
2025-05-07T20:25:58.2659139Z 
2025-05-07T20:25:58.2925837Z libcufft-11.3.0.4    | 156.2 MB  | ########6  |  86% [A[A
2025-05-07T20:25:58.3151190Z nsight-compute-2024. | 443.1 MB  | ##6        |  27% 
2025-05-07T20:25:58.3152488Z 
2025-05-07T20:25:58.3867843Z libcublas-12.6.4.1   | 256.2 MB  | ####2      |  42% [A
2025-05-07T20:25:58.3868540Z 
2025-05-07T20:25:58.3868545Z 
2025-05-07T20:25:58.3931668Z libcufft-11.3.0.4    | 156.2 MB  | ########9  |  89% [A[A
2025-05-07T20:25:58.4151205Z nsight-compute-2024. | 443.1 MB  | ##8        |  28% 
2025-05-07T20:25:58.4152894Z 
2025-05-07T20:25:58.4871030Z libcublas-12.6.4.1   | 256.2 MB  | ####3      |  44% [A
2025-05-07T20:25:58.4871334Z 
2025-05-07T20:25:58.4871864Z 
2025-05-07T20:25:58.4976286Z libcufft-11.3.0.4    | 156.2 MB  | #########1 |  92% [A[A
2025-05-07T20:25:58.5152606Z nsight-compute-2024. | 443.1 MB  | ##9        |  29% 
2025-05-07T20:25:58.5154310Z 
2025-05-07T20:25:58.5873185Z libcublas-12.6.4.1   | 256.2 MB  | ####5      |  46% [A
2025-05-07T20:25:58.5873439Z 
2025-05-07T20:25:58.5873452Z 
2025-05-07T20:25:58.5976837Z libcufft-11.3.0.4    | 156.2 MB  | #########5 |  95% [A[A
2025-05-07T20:25:58.6551064Z nsight-compute-2024. | 443.1 MB  | ###        |  30% 
2025-05-07T20:25:58.6554228Z 
2025-05-07T20:25:58.6873874Z libcublas-12.6.4.1   | 256.2 MB  | ####7      |  47% [A
2025-05-07T20:25:58.6874195Z 
2025-05-07T20:25:58.6875385Z 
2025-05-07T20:25:58.6979815Z libcufft-11.3.0.4    | 156.2 MB  | #########8 |  98% [A[A
2025-05-07T20:25:58.7554017Z nsight-compute-2024. | 443.1 MB  | ###1       |  31% 
2025-05-07T20:25:58.7556611Z 
2025-05-07T20:25:58.7981265Z libcublas-12.6.4.1   | 256.2 MB  | ####8      |  49% [A
2025-05-07T20:25:58.8554740Z nsight-compute-2024. | 443.1 MB  | ###2       |  32% 
2025-05-07T20:25:58.8555165Z 
2025-05-07T20:25:58.8982561Z libcublas-12.6.4.1   | 256.2 MB  | #####      |  51% [A
2025-05-07T20:25:58.9556825Z nsight-compute-2024. | 443.1 MB  | ###3       |  34% 
2025-05-07T20:25:58.9558856Z 
2025-05-07T20:25:58.9985201Z libcublas-12.6.4.1   | 256.2 MB  | #####2     |  53% [A
2025-05-07T20:25:59.0556737Z nsight-compute-2024. | 443.1 MB  | ###4       |  35% 
2025-05-07T20:25:59.0559402Z 
2025-05-07T20:25:59.0985513Z libcublas-12.6.4.1   | 256.2 MB  | #####4     |  55% [A
2025-05-07T20:25:59.1557550Z nsight-compute-2024. | 443.1 MB  | ###5       |  36% 
2025-05-07T20:25:59.1557963Z 
2025-05-07T20:25:59.2260434Z libcublas-12.6.4.1   | 256.2 MB  | #####7     |  57% [A
2025-05-07T20:25:59.2698873Z nsight-compute-2024. | 443.1 MB  | ###6       |  37% 
2025-05-07T20:25:59.2699247Z 
2025-05-07T20:25:59.3284433Z libcublas-12.6.4.1   | 256.2 MB  | #####9     |  59% [A
2025-05-07T20:25:59.3741410Z nsight-compute-2024. | 443.1 MB  | ###7       |  38% 
2025-05-07T20:25:59.3741787Z 
2025-05-07T20:25:59.4312977Z libcublas-12.6.4.1   | 256.2 MB  | ######1    |  61% [A
2025-05-07T20:25:59.4840065Z nsight-compute-2024. | 443.1 MB  | ###8       |  39% 
2025-05-07T20:25:59.4840401Z 
2025-05-07T20:25:59.5386334Z libcublas-12.6.4.1   | 256.2 MB  | ######3    |  63% [A
2025-05-07T20:25:59.5876045Z nsight-compute-2024. | 443.1 MB  | ###9       |  40% 
2025-05-07T20:25:59.5876431Z 
2025-05-07T20:25:59.6392875Z libcublas-12.6.4.1   | 256.2 MB  | ######5    |  65% [A
2025-05-07T20:25:59.6878019Z nsight-compute-2024. | 443.1 MB  | ####       |  41% 
2025-05-07T20:25:59.6882967Z 
2025-05-07T20:25:59.7396020Z libcublas-12.6.4.1   | 256.2 MB  | ######6    |  67% [A
2025-05-07T20:25:59.7880465Z nsight-compute-2024. | 443.1 MB  | ####1      |  42% 
2025-05-07T20:25:59.7883026Z 
2025-05-07T20:25:59.8398876Z libcublas-12.6.4.1   | 256.2 MB  | ######8    |  69% [A
2025-05-07T20:25:59.8881861Z nsight-compute-2024. | 443.1 MB  | ####2      |  43% 
2025-05-07T20:25:59.8884531Z 
2025-05-07T20:25:59.9416055Z libcublas-12.6.4.1   | 256.2 MB  | #######    |  71% [A
2025-05-07T20:25:59.9882425Z nsight-compute-2024. | 443.1 MB  | ####3      |  44% 
2025-05-07T20:25:59.9884040Z 
2025-05-07T20:26:00.0448591Z libcublas-12.6.4.1   | 256.2 MB  | #######3   |  73% [A
2025-05-07T20:26:00.0448856Z 
2025-05-07T20:26:00.0448860Z 
2025-05-07T20:26:00.0448874Z 
2025-05-07T20:26:00.0448878Z 
2025-05-07T20:26:00.0896739Z cuda-nsight-12.6.77  | 113.2 MB  | ########## | 100% [A[A[A[A
2025-05-07T20:26:00.0897101Z 
2025-05-07T20:26:00.1018290Z libcublas-12.6.4.1   | 256.2 MB  | #######5   |  75% [A
2025-05-07T20:26:00.1018732Z 
2025-05-07T20:26:00.1018736Z 
2025-05-07T20:26:00.1018740Z 
2025-05-07T20:26:00.1018743Z 
2025-05-07T20:26:00.1019240Z 
2025-05-07T20:26:00.1263893Z cuda-nvvp-12.6.80    | 109.3 MB  |            |   0% [A[A[A[A[A
2025-05-07T20:26:00.2020478Z nsight-compute-2024. | 443.1 MB  | ####5      |  45% 
2025-05-07T20:26:00.2020755Z 
2025-05-07T20:26:00.2020759Z 
2025-05-07T20:26:00.2020763Z 
2025-05-07T20:26:00.2020766Z 
2025-05-07T20:26:00.2021960Z 
2025-05-07T20:26:00.2245232Z cuda-nvvp-12.6.80    | 109.3 MB  | 3          |   3% [A[A[A[A[A
2025-05-07T20:26:00.2247475Z 
2025-05-07T20:26:00.2330781Z libcublas-12.6.4.1   | 256.2 MB  | #######7   |  77% [A
2025-05-07T20:26:00.3024539Z nsight-compute-2024. | 443.1 MB  | ####5      |  46% 
2025-05-07T20:26:00.3024980Z 
2025-05-07T20:26:00.3024986Z 
2025-05-07T20:26:00.3024992Z 
2025-05-07T20:26:00.3024998Z 
2025-05-07T20:26:00.3025003Z 
2025-05-07T20:26:00.3047365Z cuda-nvvp-12.6.80    | 109.3 MB  | 6          |   7% [A[A[A[A[A
2025-05-07T20:26:00.3047798Z 
2025-05-07T20:26:00.3047804Z 
2025-05-07T20:26:00.3050802Z 
2025-05-07T20:26:00.3459197Z libcusparse-12.5.4.2 | 118.6 MB  | ########## | 100% [A[A[A
2025-05-07T20:26:00.3653821Z nsight-compute-2024. | 443.1 MB  | ####6      |  47% 
2025-05-07T20:26:00.3659184Z 
2025-05-07T20:26:00.3666225Z libcublas-12.6.4.1   | 256.2 MB  | #######8   |  79% [A
2025-05-07T20:26:00.3666488Z 
2025-05-07T20:26:00.3666492Z 
2025-05-07T20:26:00.3666495Z 
2025-05-07T20:26:00.3666499Z 
2025-05-07T20:26:00.3666503Z 
2025-05-07T20:26:00.3666748Z 
2025-05-07T20:26:00.4026207Z libcusolver-11.7.1.2 | 95.8 MB   |            |   0% [A[A[A[A[A[A
2025-05-07T20:26:00.4026506Z 
2025-05-07T20:26:00.4026510Z 
2025-05-07T20:26:00.4026514Z 
2025-05-07T20:26:00.4026518Z 
2025-05-07T20:26:00.4026522Z 
2025-05-07T20:26:00.4658917Z cuda-nvvp-12.6.80    | 109.3 MB  | 9          |  10% [A[A[A[A[A
2025-05-07T20:26:00.4659208Z 
2025-05-07T20:26:00.4659213Z 
2025-05-07T20:26:00.4659216Z 
2025-05-07T20:26:00.4659255Z 
2025-05-07T20:26:00.4659259Z 
2025-05-07T20:26:00.4661009Z 
2025-05-07T20:26:00.4760897Z libcusolver-11.7.1.2 | 95.8 MB   | 3          |   3% [A[A[A[A[A[A
2025-05-07T20:26:00.5035718Z nsight-compute-2024. | 443.1 MB  | ####7      |  48% 
2025-05-07T20:26:00.5035977Z 
2025-05-07T20:26:00.5035981Z 
2025-05-07T20:26:00.5035985Z 
2025-05-07T20:26:00.5035989Z 
2025-05-07T20:26:00.5035993Z 
2025-05-07T20:26:00.5088969Z cuda-nvvp-12.6.80    | 109.3 MB  | #2         |  12% [A[A[A[A[A
2025-05-07T20:26:00.5094436Z 
2025-05-07T20:26:00.5659308Z libcublas-12.6.4.1   | 256.2 MB  | ########   |  81% [A
2025-05-07T20:26:00.5659590Z 
2025-05-07T20:26:00.5659594Z 
2025-05-07T20:26:00.5659597Z 
2025-05-07T20:26:00.5659602Z 
2025-05-07T20:26:00.5659605Z 
2025-05-07T20:26:00.5660337Z 
2025-05-07T20:26:00.5885444Z libcusolver-11.7.1.2 | 95.8 MB   | 6          |   6% [A[A[A[A[A[A
2025-05-07T20:26:00.6040262Z nsight-compute-2024. | 443.1 MB  | ####8      |  48% 
2025-05-07T20:26:00.6040528Z 
2025-05-07T20:26:00.6040569Z 
2025-05-07T20:26:00.6040574Z 
2025-05-07T20:26:00.6040577Z 
2025-05-07T20:26:00.6044702Z 
2025-05-07T20:26:00.6337269Z cuda-nvvp-12.6.80    | 109.3 MB  | #5         |  15% [A[A[A[A[A
2025-05-07T20:26:00.6337552Z 
2025-05-07T20:26:00.6662000Z libcublas-12.6.4.1   | 256.2 MB  | ########2  |  82% [A
2025-05-07T20:26:00.6662366Z 
2025-05-07T20:26:00.6662372Z 
2025-05-07T20:26:00.6662377Z 
2025-05-07T20:26:00.6662382Z 
2025-05-07T20:26:00.6662387Z 
2025-05-07T20:26:00.6664749Z 
2025-05-07T20:26:00.7042124Z libcusolver-11.7.1.2 | 95.8 MB   | 9          |   9% [A[A[A[A[A[A
2025-05-07T20:26:00.7083688Z nsight-compute-2024. | 443.1 MB  | ####9      |  49% 
2025-05-07T20:26:00.7083953Z 
2025-05-07T20:26:00.7083957Z 
2025-05-07T20:26:00.7083960Z 
2025-05-07T20:26:00.7083964Z 
2025-05-07T20:26:00.7086845Z 
2025-05-07T20:26:00.7482071Z cuda-nvvp-12.6.80    | 109.3 MB  | #7         |  18% [A[A[A[A[A
2025-05-07T20:26:00.7483126Z 
2025-05-07T20:26:00.7665196Z libcublas-12.6.4.1   | 256.2 MB  | ########3  |  84% [A
2025-05-07T20:26:00.7665642Z 
2025-05-07T20:26:00.7665646Z 
2025-05-07T20:26:00.7665650Z 
2025-05-07T20:26:00.7665653Z 
2025-05-07T20:26:00.7665657Z 
2025-05-07T20:26:00.7667758Z 
2025-05-07T20:26:00.8145701Z libcusolver-11.7.1.2 | 95.8 MB   | #2         |  12% [A[A[A[A[A[A
2025-05-07T20:26:00.8274926Z nsight-compute-2024. | 443.1 MB  | ####9      |  50% 
2025-05-07T20:26:00.8275185Z 
2025-05-07T20:26:00.8275197Z 
2025-05-07T20:26:00.8275201Z 
2025-05-07T20:26:00.8275205Z 
2025-05-07T20:26:00.8279070Z 
2025-05-07T20:26:00.8666536Z cuda-nvvp-12.6.80    | 109.3 MB  | ##         |  21% [A[A[A[A[A
2025-05-07T20:26:00.8666841Z 
2025-05-07T20:26:00.8666845Z 
2025-05-07T20:26:00.8666849Z 
2025-05-07T20:26:00.8666853Z 
2025-05-07T20:26:00.8666856Z 
2025-05-07T20:26:00.8666860Z 
2025-05-07T20:26:00.8673409Z libcusolver-11.7.1.2 | 95.8 MB   | #5         |  15% [A[A[A[A[A[A
2025-05-07T20:26:00.8673864Z 
2025-05-07T20:26:00.9317840Z libcublas-12.6.4.1   | 256.2 MB  | ########4  |  85% [A
2025-05-07T20:26:00.9371619Z nsight-compute-2024. | 443.1 MB  | #####      |  51% 
2025-05-07T20:26:00.9371892Z 
2025-05-07T20:26:00.9371896Z 
2025-05-07T20:26:00.9371900Z 
2025-05-07T20:26:00.9371903Z 
2025-05-07T20:26:00.9371939Z 
2025-05-07T20:26:00.9673384Z cuda-nvvp-12.6.80    | 109.3 MB  | ##3        |  23% [A[A[A[A[A
2025-05-07T20:26:00.9673679Z 
2025-05-07T20:26:00.9673683Z 
2025-05-07T20:26:00.9673686Z 
2025-05-07T20:26:00.9673690Z 
2025-05-07T20:26:00.9673694Z 
2025-05-07T20:26:00.9674392Z 
2025-05-07T20:26:00.9790537Z libcusolver-11.7.1.2 | 95.8 MB   | #8         |  18% [A[A[A[A[A[A
2025-05-07T20:26:00.9790861Z 
2025-05-07T20:26:01.0322233Z libcublas-12.6.4.1   | 256.2 MB  | ########6  |  86% [A
2025-05-07T20:26:01.0397183Z nsight-compute-2024. | 443.1 MB  | #####1     |  51% 
2025-05-07T20:26:01.0397445Z 
2025-05-07T20:26:01.0397501Z 
2025-05-07T20:26:01.0397505Z 
2025-05-07T20:26:01.0397658Z 
2025-05-07T20:26:01.0398584Z 
2025-05-07T20:26:01.0684261Z cuda-nvvp-12.6.80    | 109.3 MB  | ##5        |  26% [A[A[A[A[A
2025-05-07T20:26:01.0684670Z 
2025-05-07T20:26:01.0684677Z 
2025-05-07T20:26:01.0684684Z 
2025-05-07T20:26:01.0684690Z 
2025-05-07T20:26:01.0684697Z 
2025-05-07T20:26:01.0684812Z 
2025-05-07T20:26:01.0958190Z libcusolver-11.7.1.2 | 95.8 MB   | ##1        |  21% [A[A[A[A[A[A
2025-05-07T20:26:01.0958636Z 
2025-05-07T20:26:01.1327032Z libcublas-12.6.4.1   | 256.2 MB  | ########7  |  88% [A
2025-05-07T20:26:01.1398868Z nsight-compute-2024. | 443.1 MB  | #####1     |  52% 
2025-05-07T20:26:01.1399122Z 
2025-05-07T20:26:01.1399250Z 
2025-05-07T20:26:01.1399254Z 
2025-05-07T20:26:01.1399271Z 
2025-05-07T20:26:01.1399377Z 
2025-05-07T20:26:01.1686675Z cuda-nvvp-12.6.80    | 109.3 MB  | ##8        |  29% [A[A[A[A[A
2025-05-07T20:26:01.1687109Z 
2025-05-07T20:26:01.1687115Z 
2025-05-07T20:26:01.1687121Z 
2025-05-07T20:26:01.1687126Z 
2025-05-07T20:26:01.1687131Z 
2025-05-07T20:26:01.1689657Z 
2025-05-07T20:26:01.2013138Z libcusolver-11.7.1.2 | 95.8 MB   | ##4        |  25% [A[A[A[A[A[A
2025-05-07T20:26:01.2013606Z 
2025-05-07T20:26:01.2399875Z libcublas-12.6.4.1   | 256.2 MB  | ########8  |  89% [A
2025-05-07T20:26:01.2409480Z nsight-compute-2024. | 443.1 MB  | #####2     |  53% 
2025-05-07T20:26:01.2409732Z 
2025-05-07T20:26:01.2409771Z 
2025-05-07T20:26:01.2409775Z 
2025-05-07T20:26:01.2409778Z 
2025-05-07T20:26:01.2410219Z 
2025-05-07T20:26:01.2689688Z cuda-nvvp-12.6.80    | 109.3 MB  | ###1       |  32% [A[A[A[A[A
2025-05-07T20:26:01.2689980Z 
2025-05-07T20:26:01.2690126Z 
2025-05-07T20:26:01.2690137Z 
2025-05-07T20:26:01.2690140Z 
2025-05-07T20:26:01.2690144Z 
2025-05-07T20:26:01.2692927Z 
2025-05-07T20:26:01.3048617Z libcusolver-11.7.1.2 | 95.8 MB   | ##7        |  28% [A[A[A[A[A[A
2025-05-07T20:26:01.3050923Z 
2025-05-07T20:26:01.3404032Z libcublas-12.6.4.1   | 256.2 MB  | #########  |  90% [A
2025-05-07T20:26:01.3420691Z nsight-compute-2024. | 443.1 MB  | #####3     |  53% 
2025-05-07T20:26:01.3421027Z 
2025-05-07T20:26:01.3421383Z 
2025-05-07T20:26:01.3421570Z 
2025-05-07T20:26:01.3421577Z 
2025-05-07T20:26:01.3425504Z 
2025-05-07T20:26:01.3692515Z cuda-nvvp-12.6.80    | 109.3 MB  | ###4       |  34% [A[A[A[A[A
2025-05-07T20:26:01.3692797Z 
2025-05-07T20:26:01.3693239Z 
2025-05-07T20:26:01.3693245Z 
2025-05-07T20:26:01.3693249Z 
2025-05-07T20:26:01.3693252Z 
2025-05-07T20:26:01.3694148Z 
2025-05-07T20:26:01.4052603Z libcusolver-11.7.1.2 | 95.8 MB   | ###1       |  31% [A[A[A[A[A[A
2025-05-07T20:26:01.4054091Z 
2025-05-07T20:26:01.4438346Z libcublas-12.6.4.1   | 256.2 MB  | #########1 |  91% [A
2025-05-07T20:26:01.4438605Z 
2025-05-07T20:26:01.4438609Z 
2025-05-07T20:26:01.4438613Z 
2025-05-07T20:26:01.4438617Z 
2025-05-07T20:26:01.4441655Z 
2025-05-07T20:26:01.4473921Z cuda-nvvp-12.6.80    | 109.3 MB  | ###7       |  37% [A[A[A[A[A
2025-05-07T20:26:01.4693236Z nsight-compute-2024. | 443.1 MB  | #####4     |  54% 
2025-05-07T20:26:01.4693493Z 
2025-05-07T20:26:01.4693497Z 
2025-05-07T20:26:01.4693501Z 
2025-05-07T20:26:01.4693542Z 
2025-05-07T20:26:01.4693546Z 
2025-05-07T20:26:01.4696616Z 
2025-05-07T20:26:01.5053588Z libcusolver-11.7.1.2 | 95.8 MB   | ###4       |  35% [A[A[A[A[A[A
2025-05-07T20:26:01.5054466Z 
2025-05-07T20:26:01.5512038Z libcublas-12.6.4.1   | 256.2 MB  | #########2 |  93% [A
2025-05-07T20:26:01.5512457Z nsight-compute-2024. | 443.1 MB  | #####4     |  55% 
2025-05-07T20:26:01.5512696Z 
2025-05-07T20:26:01.5512700Z 
2025-05-07T20:26:01.5512704Z 
2025-05-07T20:26:01.5512708Z 
2025-05-07T20:26:01.5512971Z 
2025-05-07T20:26:01.6058446Z cuda-nvvp-12.6.80    | 109.3 MB  | ###9       |  40% [A[A[A[A[A
2025-05-07T20:26:01.6059033Z 
2025-05-07T20:26:01.6235173Z libcublas-12.6.4.1   | 256.2 MB  | #########4 |  94% [A
2025-05-07T20:26:01.6235432Z 
2025-05-07T20:26:01.6235436Z 
2025-05-07T20:26:01.6235440Z 
2025-05-07T20:26:01.6235444Z 
2025-05-07T20:26:01.6235447Z 
2025-05-07T20:26:01.6241483Z 
2025-05-07T20:26:01.6513334Z libcusolver-11.7.1.2 | 95.8 MB   | ###7       |  38% [A[A[A[A[A[A
2025-05-07T20:26:01.6513693Z 
2025-05-07T20:26:01.6513698Z 
2025-05-07T20:26:01.6513702Z 
2025-05-07T20:26:01.6513705Z 
2025-05-07T20:26:01.6513709Z 
2025-05-07T20:26:01.7060380Z cuda-nvvp-12.6.80    | 109.3 MB  | ####2      |  43% [A[A[A[A[A
2025-05-07T20:26:01.7062946Z 
2025-05-07T20:26:01.7238224Z libcublas-12.6.4.1   | 256.2 MB  | #########5 |  95% [A
2025-05-07T20:26:01.7238490Z 
2025-05-07T20:26:01.7238494Z 
2025-05-07T20:26:01.7238497Z 
2025-05-07T20:26:01.7238501Z 
2025-05-07T20:26:01.7238505Z 
2025-05-07T20:26:01.7238992Z 
2025-05-07T20:26:01.7516710Z libcusolver-11.7.1.2 | 95.8 MB   | ####1      |  42% [A[A[A[A[A[A
2025-05-07T20:26:01.7517025Z 
2025-05-07T20:26:01.7517029Z 
2025-05-07T20:26:01.7517033Z 
2025-05-07T20:26:01.7517037Z 
2025-05-07T20:26:01.7517043Z 
2025-05-07T20:26:01.7599332Z cuda-nvvp-12.6.80    | 109.3 MB  | ####5      |  46% [A[A[A[A[A
2025-05-07T20:26:01.8324182Z nsight-compute-2024. | 443.1 MB  | #####5     |  55% 
2025-05-07T20:26:01.8327183Z 
2025-05-07T20:26:01.8414621Z libcublas-12.6.4.1   | 256.2 MB  | #########6 |  97% [A
2025-05-07T20:26:01.8415001Z 
2025-05-07T20:26:01.8415007Z 
2025-05-07T20:26:01.8415010Z 
2025-05-07T20:26:01.8415023Z 
2025-05-07T20:26:01.8415027Z 
2025-05-07T20:26:01.8415031Z 
2025-05-07T20:26:01.8519021Z libcusolver-11.7.1.2 | 95.8 MB   | ####5      |  45% [A[A[A[A[A[A
2025-05-07T20:26:01.8519325Z 
2025-05-07T20:26:01.8519329Z 
2025-05-07T20:26:01.8519340Z 
2025-05-07T20:26:01.8519343Z 
2025-05-07T20:26:01.8519347Z 
2025-05-07T20:26:01.8608975Z cuda-nvvp-12.6.80    | 109.3 MB  | ####8      |  49% [A[A[A[A[A
2025-05-07T20:26:01.9325440Z nsight-compute-2024. | 443.1 MB  | #####6     |  56% 
2025-05-07T20:26:01.9328533Z 
2025-05-07T20:26:01.9419283Z libcublas-12.6.4.1   | 256.2 MB  | #########7 |  98% [A
2025-05-07T20:26:01.9429300Z 
2025-05-07T20:26:01.9429305Z 
2025-05-07T20:26:01.9429309Z 
2025-05-07T20:26:01.9429313Z 
2025-05-07T20:26:01.9429317Z 
2025-05-07T20:26:01.9429396Z 
2025-05-07T20:26:01.9604345Z libcusolver-11.7.1.2 | 95.8 MB   | ####8      |  48% [A[A[A[A[A[A
2025-05-07T20:26:01.9604823Z 
2025-05-07T20:26:01.9604827Z 
2025-05-07T20:26:01.9604831Z 
2025-05-07T20:26:01.9604834Z 
2025-05-07T20:26:01.9604838Z 
2025-05-07T20:26:01.9611839Z cuda-nvvp-12.6.80    | 109.3 MB  | #####1     |  52% [A[A[A[A[A
2025-05-07T20:26:02.0417992Z nsight-compute-2024. | 443.1 MB  | #####6     |  57% 
2025-05-07T20:26:02.0418722Z 
2025-05-07T20:26:02.0429003Z libcublas-12.6.4.1   | 256.2 MB  | #########9 |  99% [A
2025-05-07T20:26:02.0429261Z 
2025-05-07T20:26:02.0429265Z 
2025-05-07T20:26:02.0429270Z 
2025-05-07T20:26:02.0429287Z 
2025-05-07T20:26:02.0429292Z 
2025-05-07T20:26:02.0429324Z 
2025-05-07T20:26:02.0613597Z libcusolver-11.7.1.2 | 95.8 MB   | #####1     |  51% [A[A[A[A[A[A
2025-05-07T20:26:02.0671033Z nsight-compute-2024. | 443.1 MB  | #####7     |  57% 
2025-05-07T20:26:02.0671478Z 
2025-05-07T20:26:02.0671482Z 
2025-05-07T20:26:02.0671486Z 
2025-05-07T20:26:02.0671489Z 
2025-05-07T20:26:02.0672499Z 
2025-05-07T20:26:02.1431019Z cuda-nvvp-12.6.80    | 109.3 MB  | #####4     |  55% [A[A[A[A[A
2025-05-07T20:26:02.1431317Z 
2025-05-07T20:26:02.1431321Z 
2025-05-07T20:26:02.1431325Z 
2025-05-07T20:26:02.1431329Z 
2025-05-07T20:26:02.1431332Z 
2025-05-07T20:26:02.1433766Z 
2025-05-07T20:26:02.1652203Z libcusolver-11.7.1.2 | 95.8 MB   | #####4     |  54% [A[A[A[A[A[A
2025-05-07T20:26:02.1674811Z nsight-compute-2024. | 443.1 MB  | #####8     |  58% 
2025-05-07T20:26:02.1675060Z 
2025-05-07T20:26:02.1675064Z 
2025-05-07T20:26:02.1675068Z 
2025-05-07T20:26:02.1675071Z 
2025-05-07T20:26:02.1677906Z 
2025-05-07T20:26:02.2432485Z cuda-nvvp-12.6.80    | 109.3 MB  | #####7     |  58% [A[A[A[A[A
2025-05-07T20:26:02.2432941Z 
2025-05-07T20:26:02.2432948Z 
2025-05-07T20:26:02.2432954Z 
2025-05-07T20:26:02.2432959Z 
2025-05-07T20:26:02.2432964Z 
2025-05-07T20:26:02.2432970Z 
2025-05-07T20:26:02.2653622Z libcusolver-11.7.1.2 | 95.8 MB   | #####7     |  58% [A[A[A[A[A[A
2025-05-07T20:26:02.2676568Z nsight-compute-2024. | 443.1 MB  | #####8     |  59% 
2025-05-07T20:26:02.2676855Z 
2025-05-07T20:26:02.2676859Z 
2025-05-07T20:26:02.2676863Z 
2025-05-07T20:26:02.2676867Z 
2025-05-07T20:26:02.2681138Z 
2025-05-07T20:26:02.3118255Z cuda-nvvp-12.6.80    | 109.3 MB  | ######     |  61% [A[A[A[A[A
2025-05-07T20:26:02.3118550Z 
2025-05-07T20:26:02.3123613Z 
2025-05-07T20:26:02.3439322Z libcufft-11.3.0.4    | 156.2 MB  | ########## | 100% [A[A
2025-05-07T20:26:02.3439609Z 
2025-05-07T20:26:02.3439620Z 
2025-05-07T20:26:02.3439625Z 
2025-05-07T20:26:02.3439630Z 
2025-05-07T20:26:02.3439634Z 
2025-05-07T20:26:02.3442108Z 
2025-05-07T20:26:02.3664088Z libcusolver-11.7.1.2 | 95.8 MB   | ######1    |  61% [A[A[A[A[A[A
2025-05-07T20:26:02.3829610Z nsight-compute-2024. | 443.1 MB  | #####9     |  59% 
2025-05-07T20:26:02.3829939Z 
2025-05-07T20:26:02.3829943Z 
2025-05-07T20:26:02.3829947Z 
2025-05-07T20:26:02.3829950Z 
2025-05-07T20:26:02.3829954Z 
2025-05-07T20:26:02.3906409Z cuda-nvvp-12.6.80    | 109.3 MB  | ######3    |  63% [A[A[A[A[A
2025-05-07T20:26:02.3906849Z 
2025-05-07T20:26:02.3906855Z 
2025-05-07T20:26:02.3906860Z 
2025-05-07T20:26:02.3906865Z 
2025-05-07T20:26:02.3906871Z 
2025-05-07T20:26:02.3906874Z 
2025-05-07T20:26:02.3906878Z 
2025-05-07T20:26:02.4518876Z libnpp-12.3.1.54     | 93.4 MB   |            |   0% [A[A[A[A[A[A[A
2025-05-07T20:26:02.4519187Z 
2025-05-07T20:26:02.4519191Z 
2025-05-07T20:26:02.4519194Z 
2025-05-07T20:26:02.4519198Z 
2025-05-07T20:26:02.4519202Z 
2025-05-07T20:26:02.4519205Z 
2025-05-07T20:26:02.4864358Z libcusolver-11.7.1.2 | 95.8 MB   | ######4    |  64% [A[A[A[A[A[A
2025-05-07T20:26:02.4907644Z nsight-compute-2024. | 443.1 MB  | ######     |  60% 
2025-05-07T20:26:02.4908497Z 
2025-05-07T20:26:02.4908805Z 
2025-05-07T20:26:02.4908835Z 
2025-05-07T20:26:02.4908838Z 
2025-05-07T20:26:02.4908842Z 
2025-05-07T20:26:02.4908961Z 
2025-05-07T20:26:02.4908965Z 
2025-05-07T20:26:02.5028370Z libnpp-12.3.1.54     | 93.4 MB   | 3          |   3% [A[A[A[A[A[A[A
2025-05-07T20:26:02.5029364Z 
2025-05-07T20:26:02.5029372Z 
2025-05-07T20:26:02.5029378Z 
2025-05-07T20:26:02.5029383Z 
2025-05-07T20:26:02.5029389Z 
2025-05-07T20:26:02.5651261Z cuda-nvvp-12.6.80    | 109.3 MB  | ######6    |  66% [A[A[A[A[A
2025-05-07T20:26:02.5651571Z 
2025-05-07T20:26:02.5651578Z 
2025-05-07T20:26:02.5651583Z 
2025-05-07T20:26:02.5651589Z 
2025-05-07T20:26:02.5651592Z 
2025-05-07T20:26:02.5651596Z 
2025-05-07T20:26:02.5909439Z libcusolver-11.7.1.2 | 95.8 MB   | ######7    |  67% [A[A[A[A[A[A
2025-05-07T20:26:02.5909861Z 
2025-05-07T20:26:02.5909865Z 
2025-05-07T20:26:02.5909869Z 
2025-05-07T20:26:02.5909872Z 
2025-05-07T20:26:02.5909877Z 
2025-05-07T20:26:02.5909881Z 
2025-05-07T20:26:02.5909886Z 
2025-05-07T20:26:02.5927424Z libnpp-12.3.1.54     | 93.4 MB   | 5          |   6% [A[A[A[A[A[A[A
2025-05-07T20:26:02.6139289Z nsight-compute-2024. | 443.1 MB  | ######     |  61% 
2025-05-07T20:26:02.6139648Z 
2025-05-07T20:26:02.6139654Z 
2025-05-07T20:26:02.6139659Z 
2025-05-07T20:26:02.6139694Z 
2025-05-07T20:26:02.6141725Z 
2025-05-07T20:26:02.6827131Z cuda-nvvp-12.6.80    | 109.3 MB  | ######8    |  69% [A[A[A[A[A
2025-05-07T20:26:02.6827512Z 
2025-05-07T20:26:02.6827517Z 
2025-05-07T20:26:02.6827523Z 
2025-05-07T20:26:02.6827528Z 
2025-05-07T20:26:02.6827533Z 
2025-05-07T20:26:02.6827539Z 
2025-05-07T20:26:02.6913251Z libcusolver-11.7.1.2 | 95.8 MB   | #######    |  70% [A[A[A[A[A[A
2025-05-07T20:26:02.6913620Z 
2025-05-07T20:26:02.6913624Z 
2025-05-07T20:26:02.6913812Z 
2025-05-07T20:26:02.6913819Z 
2025-05-07T20:26:02.6913825Z 
2025-05-07T20:26:02.6913830Z 
2025-05-07T20:26:02.6920876Z 
2025-05-07T20:26:02.7039240Z libnpp-12.3.1.54     | 93.4 MB   | 8          |   9% [A[A[A[A[A[A[A
2025-05-07T20:26:02.7202012Z nsight-compute-2024. | 443.1 MB  | ######1    |  61% 
2025-05-07T20:26:02.7202368Z 
2025-05-07T20:26:02.7202372Z 
2025-05-07T20:26:02.7202376Z 
2025-05-07T20:26:02.7202380Z 
2025-05-07T20:26:02.7204295Z 
2025-05-07T20:26:02.7836171Z cuda-nvvp-12.6.80    | 109.3 MB  | #######1   |  71% [A[A[A[A[A
2025-05-07T20:26:02.7836483Z 
2025-05-07T20:26:02.7836487Z 
2025-05-07T20:26:02.7836490Z 
2025-05-07T20:26:02.7836494Z 
2025-05-07T20:26:02.7836506Z 
2025-05-07T20:26:02.7842621Z 
2025-05-07T20:26:02.7919540Z libcusolver-11.7.1.2 | 95.8 MB   | #######3   |  73% [A[A[A[A[A[A
2025-05-07T20:26:02.7919950Z 
2025-05-07T20:26:02.7919953Z 
2025-05-07T20:26:02.7919966Z 
2025-05-07T20:26:02.7919970Z 
2025-05-07T20:26:02.7919973Z 
2025-05-07T20:26:02.7919977Z 
2025-05-07T20:26:02.7919981Z 
2025-05-07T20:26:02.8174471Z libnpp-12.3.1.54     | 93.4 MB   | #1         |  12% [A[A[A[A[A[A[A
2025-05-07T20:26:02.8302144Z nsight-compute-2024. | 443.1 MB  | ######1    |  62% 
2025-05-07T20:26:02.8302451Z 
2025-05-07T20:26:02.8302457Z 
2025-05-07T20:26:02.8302462Z 
2025-05-07T20:26:02.8302467Z 
2025-05-07T20:26:02.8306032Z 
2025-05-07T20:26:02.8923766Z cuda-nvvp-12.6.80    | 109.3 MB  | #######4   |  74% [A[A[A[A[A
2025-05-07T20:26:02.8924111Z 
2025-05-07T20:26:02.8924144Z 
2025-05-07T20:26:02.8924164Z 
2025-05-07T20:26:02.8924168Z 
2025-05-07T20:26:02.8924171Z 
2025-05-07T20:26:02.8924175Z 
2025-05-07T20:26:02.8931250Z 
2025-05-07T20:26:02.8953692Z libnpp-12.3.1.54     | 93.4 MB   | #4         |  14% [A[A[A[A[A[A[A
2025-05-07T20:26:02.8953993Z 
2025-05-07T20:26:02.8953996Z 
2025-05-07T20:26:02.8954000Z 
2025-05-07T20:26:02.8954004Z 
2025-05-07T20:26:02.8954007Z 
2025-05-07T20:26:02.8954011Z 
2025-05-07T20:26:02.9186565Z libcusolver-11.7.1.2 | 95.8 MB   | #######6   |  76% [A[A[A[A[A[A
2025-05-07T20:26:02.9476915Z nsight-compute-2024. | 443.1 MB  | ######2    |  62% 
2025-05-07T20:26:02.9477186Z 
2025-05-07T20:26:02.9477190Z 
2025-05-07T20:26:02.9477194Z 
2025-05-07T20:26:02.9479210Z 
2025-05-07T20:26:02.9479216Z 
2025-05-07T20:26:02.9938586Z cuda-nvvp-12.6.80    | 109.3 MB  | #######6   |  76% [A[A[A[A[A
2025-05-07T20:26:02.9939017Z 
2025-05-07T20:26:02.9939023Z 
2025-05-07T20:26:02.9939028Z 
2025-05-07T20:26:02.9939033Z 
2025-05-07T20:26:02.9939038Z 
2025-05-07T20:26:02.9939513Z 
2025-05-07T20:26:02.9939519Z 
2025-05-07T20:26:02.9958314Z libnpp-12.3.1.54     | 93.4 MB   | #7         |  17% [A[A[A[A[A[A[A
2025-05-07T20:26:02.9958598Z 
2025-05-07T20:26:02.9958602Z 
2025-05-07T20:26:02.9958606Z 
2025-05-07T20:26:02.9958610Z 
2025-05-07T20:26:02.9958613Z 
2025-05-07T20:26:02.9958617Z 
2025-05-07T20:26:03.0190558Z libcusolver-11.7.1.2 | 95.8 MB   | #######9   |  79% [A[A[A[A[A[A
2025-05-07T20:26:03.0593725Z nsight-compute-2024. | 443.1 MB  | ######3    |  63% 
2025-05-07T20:26:03.0593987Z 
2025-05-07T20:26:03.0593991Z 
2025-05-07T20:26:03.0594005Z 
2025-05-07T20:26:03.0594009Z 
2025-05-07T20:26:03.0599548Z 
2025-05-07T20:26:03.0944468Z cuda-nvvp-12.6.80    | 109.3 MB  | #######8   |  79% [A[A[A[A[A
2025-05-07T20:26:03.0944824Z 
2025-05-07T20:26:03.0944831Z 
2025-05-07T20:26:03.0944836Z 
2025-05-07T20:26:03.0944841Z 
2025-05-07T20:26:03.0944846Z 
2025-05-07T20:26:03.0944851Z 
2025-05-07T20:26:03.0946739Z 
2025-05-07T20:26:03.0999440Z libnpp-12.3.1.54     | 93.4 MB   | ##         |  20% [A[A[A[A[A[A[A
2025-05-07T20:26:03.0999764Z 
2025-05-07T20:26:03.0999769Z 
2025-05-07T20:26:03.0999773Z 
2025-05-07T20:26:03.0999779Z 
2025-05-07T20:26:03.0999788Z 
2025-05-07T20:26:03.0999810Z 
2025-05-07T20:26:03.1192905Z libcusolver-11.7.1.2 | 95.8 MB   | ########2  |  82% [A[A[A[A[A[A
2025-05-07T20:26:03.1637778Z nsight-compute-2024. | 443.1 MB  | ######3    |  64% 
2025-05-07T20:26:03.1638055Z 
2025-05-07T20:26:03.1638059Z 
2025-05-07T20:26:03.1638063Z 
2025-05-07T20:26:03.1638067Z 
2025-05-07T20:26:03.1641593Z 
2025-05-07T20:26:03.1960053Z cuda-nvvp-12.6.80    | 109.3 MB  | ########1  |  81% [A[A[A[A[A
2025-05-07T20:26:03.1960344Z 
2025-05-07T20:26:03.1960349Z 
2025-05-07T20:26:03.1960352Z 
2025-05-07T20:26:03.1960356Z 
2025-05-07T20:26:03.1960360Z 
2025-05-07T20:26:03.1960364Z 
2025-05-07T20:26:03.1962649Z 
2025-05-07T20:26:03.2122308Z libnpp-12.3.1.54     | 93.4 MB   | ##2        |  23% [A[A[A[A[A[A[A
2025-05-07T20:26:03.2122594Z 
2025-05-07T20:26:03.2122631Z 
2025-05-07T20:26:03.2122635Z 
2025-05-07T20:26:03.2122639Z 
2025-05-07T20:26:03.2122642Z 
2025-05-07T20:26:03.2122646Z 
2025-05-07T20:26:03.2204525Z libcusolver-11.7.1.2 | 95.8 MB   | ########5  |  85% [A[A[A[A[A[A
2025-05-07T20:26:03.2637920Z nsight-compute-2024. | 443.1 MB  | ######4    |  64% 
2025-05-07T20:26:03.2638189Z 
2025-05-07T20:26:03.2638193Z 
2025-05-07T20:26:03.2638197Z 
2025-05-07T20:26:03.2638201Z 
2025-05-07T20:26:03.2640442Z 
2025-05-07T20:26:03.3007072Z cuda-nvvp-12.6.80    | 109.3 MB  | ########3  |  83% [A[A[A[A[A
2025-05-07T20:26:03.3007366Z 
2025-05-07T20:26:03.3007370Z 
2025-05-07T20:26:03.3007374Z 
2025-05-07T20:26:03.3007377Z 
2025-05-07T20:26:03.3007381Z 
2025-05-07T20:26:03.3007385Z 
2025-05-07T20:26:03.3008135Z 
2025-05-07T20:26:03.3168618Z libnpp-12.3.1.54     | 93.4 MB   | ##5        |  26% [A[A[A[A[A[A[A
2025-05-07T20:26:03.3168896Z 
2025-05-07T20:26:03.3168900Z 
2025-05-07T20:26:03.3168904Z 
2025-05-07T20:26:03.3175054Z 
2025-05-07T20:26:03.3183129Z cuda-nsight-12.6.77  | 113.2 MB  | ########## | 100% [A[A[A[A
2025-05-07T20:26:03.3183402Z 
2025-05-07T20:26:03.3183407Z 
2025-05-07T20:26:03.3183410Z 
2025-05-07T20:26:03.3183414Z 
2025-05-07T20:26:03.3183418Z 
2025-05-07T20:26:03.3183421Z 
2025-05-07T20:26:03.3217825Z libcusolver-11.7.1.2 | 95.8 MB   | ########7  |  88% [A[A[A[A[A[A
2025-05-07T20:26:03.3809416Z nsight-compute-2024. | 443.1 MB  | ######4    |  65% 
2025-05-07T20:26:03.3809706Z 
2025-05-07T20:26:03.3809710Z 
2025-05-07T20:26:03.3809714Z 
2025-05-07T20:26:03.3809720Z 
2025-05-07T20:26:03.3814085Z 
2025-05-07T20:26:03.4009664Z cuda-nvvp-12.6.80    | 109.3 MB  | ########5  |  86% [A[A[A[A[A
2025-05-07T20:26:03.4009961Z 
2025-05-07T20:26:03.4009965Z 
2025-05-07T20:26:03.4009969Z 
2025-05-07T20:26:03.4009972Z 
2025-05-07T20:26:03.4009976Z 
2025-05-07T20:26:03.4009979Z 
2025-05-07T20:26:03.4011008Z 
2025-05-07T20:26:03.4185521Z libnpp-12.3.1.54     | 93.4 MB   | ##8        |  29% [A[A[A[A[A[A[A
2025-05-07T20:26:03.4186265Z 
2025-05-07T20:26:03.4186496Z 
2025-05-07T20:26:03.4186505Z 
2025-05-07T20:26:03.4186512Z 
2025-05-07T20:26:03.4186519Z 
2025-05-07T20:26:03.4186574Z 
2025-05-07T20:26:03.4224824Z libcusolver-11.7.1.2 | 95.8 MB   | #########  |  91% [A[A[A[A[A[A
2025-05-07T20:26:03.4812413Z nsight-compute-2024. | 443.1 MB  | ######5    |  65% 
2025-05-07T20:26:03.4812687Z 
2025-05-07T20:26:03.4812698Z 
2025-05-07T20:26:03.4812702Z 
2025-05-07T20:26:03.4812706Z 
2025-05-07T20:26:03.4812710Z 
2025-05-07T20:26:03.5011029Z cuda-nvvp-12.6.80    | 109.3 MB  | ########8  |  88% [A[A[A[A[A
2025-05-07T20:26:03.5011314Z 
2025-05-07T20:26:03.5011324Z 
2025-05-07T20:26:03.5011327Z 
2025-05-07T20:26:03.5011331Z 
2025-05-07T20:26:03.5011334Z 
2025-05-07T20:26:03.5011339Z 
2025-05-07T20:26:03.5011342Z 
2025-05-07T20:26:03.5187284Z libnpp-12.3.1.54     | 93.4 MB   | ###1       |  31% [A[A[A[A[A[A[A
2025-05-07T20:26:03.5187575Z 
2025-05-07T20:26:03.5187579Z 
2025-05-07T20:26:03.5187583Z 
2025-05-07T20:26:03.5187611Z 
2025-05-07T20:26:03.5187625Z 
2025-05-07T20:26:03.5187629Z 
2025-05-07T20:26:03.5229226Z libcusolver-11.7.1.2 | 95.8 MB   | #########3 |  94% [A[A[A[A[A[A
2025-05-07T20:26:03.5816007Z nsight-compute-2024. | 443.1 MB  | ######6    |  66% 
2025-05-07T20:26:03.5816377Z 
2025-05-07T20:26:03.5816382Z 
2025-05-07T20:26:03.5816387Z 
2025-05-07T20:26:03.5816393Z 
2025-05-07T20:26:03.5816409Z 
2025-05-07T20:26:03.6011598Z cuda-nvvp-12.6.80    | 109.3 MB  | #########  |  91% [A[A[A[A[A
2025-05-07T20:26:03.6011979Z 
2025-05-07T20:26:03.6011984Z 
2025-05-07T20:26:03.6011989Z 
2025-05-07T20:26:03.6012005Z 
2025-05-07T20:26:03.6012012Z 
2025-05-07T20:26:03.6012017Z 
2025-05-07T20:26:03.6012022Z 
2025-05-07T20:26:03.6189862Z libnpp-12.3.1.54     | 93.4 MB   | ###4       |  34% [A[A[A[A[A[A[A
2025-05-07T20:26:03.6190238Z 
2025-05-07T20:26:03.6190773Z 
2025-05-07T20:26:03.6190779Z 
2025-05-07T20:26:03.6190784Z 
2025-05-07T20:26:03.6190789Z 
2025-05-07T20:26:03.6190962Z 
2025-05-07T20:26:03.6230302Z libcusolver-11.7.1.2 | 95.8 MB   | #########7 |  97% [A[A[A[A[A[A
2025-05-07T20:26:03.6820078Z nsight-compute-2024. | 443.1 MB  | ######6    |  67% 
2025-05-07T20:26:03.6820361Z 
2025-05-07T20:26:03.6820366Z 
2025-05-07T20:26:03.6820372Z 
2025-05-07T20:26:03.6820377Z 
2025-05-07T20:26:03.6824588Z 
2025-05-07T20:26:03.7018095Z cuda-nvvp-12.6.80    | 109.3 MB  | #########3 |  94% [A[A[A[A[A
2025-05-07T20:26:03.7018389Z 
2025-05-07T20:26:03.7018395Z 
2025-05-07T20:26:03.7018400Z 
2025-05-07T20:26:03.7018413Z 
2025-05-07T20:26:03.7018417Z 
2025-05-07T20:26:03.7018423Z 
2025-05-07T20:26:03.7018428Z 
2025-05-07T20:26:03.7232588Z libnpp-12.3.1.54     | 93.4 MB   | ###7       |  38% [A[A[A[A[A[A[A
2025-05-07T20:26:03.7823151Z nsight-compute-2024. | 443.1 MB  | ######7    |  67% 
2025-05-07T20:26:03.7823517Z 
2025-05-07T20:26:03.7823524Z 
2025-05-07T20:26:03.7823530Z 
2025-05-07T20:26:03.7823536Z 
2025-05-07T20:26:03.7826628Z 
2025-05-07T20:26:03.8019937Z cuda-nvvp-12.6.80    | 109.3 MB  | #########6 |  97% [A[A[A[A[A
2025-05-07T20:26:03.8020306Z 
2025-05-07T20:26:03.8020312Z 
2025-05-07T20:26:03.8020317Z 
2025-05-07T20:26:03.8020322Z 
2025-05-07T20:26:03.8020327Z 
2025-05-07T20:26:03.8020333Z 
2025-05-07T20:26:03.8020338Z 
2025-05-07T20:26:03.8237343Z libnpp-12.3.1.54     | 93.4 MB   | ####1      |  41% [A[A[A[A[A[A[A
2025-05-07T20:26:03.8824533Z nsight-compute-2024. | 443.1 MB  | ######8    |  68% 
2025-05-07T20:26:03.8824913Z 
2025-05-07T20:26:03.8824917Z 
2025-05-07T20:26:03.8824920Z 
2025-05-07T20:26:03.8824924Z 
2025-05-07T20:26:03.8824928Z 
2025-05-07T20:26:03.9023647Z cuda-nvvp-12.6.80    | 109.3 MB  | #########9 | 100% [A[A[A[A[A
2025-05-07T20:26:03.9024037Z 
2025-05-07T20:26:03.9024043Z 
2025-05-07T20:26:03.9024048Z 
2025-05-07T20:26:03.9024053Z 
2025-05-07T20:26:03.9024058Z 
2025-05-07T20:26:03.9024063Z 
2025-05-07T20:26:03.9025410Z 
2025-05-07T20:26:03.9242385Z libnpp-12.3.1.54     | 93.4 MB   | ####4      |  45% [A[A[A[A[A[A[A
2025-05-07T20:26:04.0024376Z nsight-compute-2024. | 443.1 MB  | ######8    |  69% 
2025-05-07T20:26:04.0024828Z 
2025-05-07T20:26:04.0024832Z 
2025-05-07T20:26:04.0024836Z 
2025-05-07T20:26:04.0024840Z 
2025-05-07T20:26:04.0024843Z 
2025-05-07T20:26:04.0024847Z 
2025-05-07T20:26:04.0026290Z 
2025-05-07T20:26:04.0249211Z libnpp-12.3.1.54     | 93.4 MB   | ####8      |  49% [A[A[A[A[A[A[A
2025-05-07T20:26:04.1024125Z nsight-compute-2024. | 443.1 MB  | ######9    |  70% 
2025-05-07T20:26:04.1024411Z 
2025-05-07T20:26:04.1024416Z 
2025-05-07T20:26:04.1024419Z 
2025-05-07T20:26:04.1024423Z 
2025-05-07T20:26:04.1024427Z 
2025-05-07T20:26:04.1024432Z 
2025-05-07T20:26:04.1024879Z 
2025-05-07T20:26:04.1253247Z libnpp-12.3.1.54     | 93.4 MB   | #####2     |  52% [A[A[A[A[A[A[A
2025-05-07T20:26:04.2027233Z nsight-compute-2024. | 443.1 MB  | #######    |  70% 
2025-05-07T20:26:04.2027544Z 
2025-05-07T20:26:04.2027548Z 
2025-05-07T20:26:04.2027552Z 
2025-05-07T20:26:04.2027556Z 
2025-05-07T20:26:04.2027560Z 
2025-05-07T20:26:04.2027564Z 
2025-05-07T20:26:04.2027606Z 
2025-05-07T20:26:04.2255628Z libnpp-12.3.1.54     | 93.4 MB   | #####6     |  56% [A[A[A[A[A[A[A
2025-05-07T20:26:04.3042748Z nsight-compute-2024. | 443.1 MB  | #######1   |  71% 
2025-05-07T20:26:04.3043038Z 
2025-05-07T20:26:04.3043042Z 
2025-05-07T20:26:04.3043045Z 
2025-05-07T20:26:04.3043049Z 
2025-05-07T20:26:04.3043053Z 
2025-05-07T20:26:04.3043056Z 
2025-05-07T20:26:04.3043060Z 
2025-05-07T20:26:04.3259681Z libnpp-12.3.1.54     | 93.4 MB   | ######     |  60% [A[A[A[A[A[A[A
2025-05-07T20:26:04.4052323Z nsight-compute-2024. | 443.1 MB  | #######2   |  72% 
2025-05-07T20:26:04.4052690Z 
2025-05-07T20:26:04.4052696Z 
2025-05-07T20:26:04.4052701Z 
2025-05-07T20:26:04.4052715Z 
2025-05-07T20:26:04.4052721Z 
2025-05-07T20:26:04.4052726Z 
2025-05-07T20:26:04.4052731Z 
2025-05-07T20:26:04.4380203Z libnpp-12.3.1.54     | 93.4 MB   | ######3    |  64% [A[A[A[A[A[A[A
2025-05-07T20:26:04.5053025Z nsight-compute-2024. | 443.1 MB  | #######2   |  73% 
2025-05-07T20:26:04.5053332Z 
2025-05-07T20:26:04.5053352Z 
2025-05-07T20:26:04.5053358Z 
2025-05-07T20:26:04.5053363Z 
2025-05-07T20:26:04.5053368Z 
2025-05-07T20:26:04.5053373Z 
2025-05-07T20:26:04.5057592Z 
2025-05-07T20:26:04.5382776Z libnpp-12.3.1.54     | 93.4 MB   | ######7    |  68% [A[A[A[A[A[A[A
2025-05-07T20:26:04.6123263Z nsight-compute-2024. | 443.1 MB  | #######3   |  74% 
2025-05-07T20:26:04.6123543Z 
2025-05-07T20:26:04.6123548Z 
2025-05-07T20:26:04.6123552Z 
2025-05-07T20:26:04.6123555Z 
2025-05-07T20:26:04.6123559Z 
2025-05-07T20:26:04.6123563Z 
2025-05-07T20:26:04.6124001Z 
2025-05-07T20:26:04.6383257Z libnpp-12.3.1.54     | 93.4 MB   | #######1   |  72% [A[A[A[A[A[A[A
2025-05-07T20:26:04.7127188Z nsight-compute-2024. | 443.1 MB  | #######4   |  75% 
2025-05-07T20:26:04.7127543Z 
2025-05-07T20:26:04.7127548Z 
2025-05-07T20:26:04.7127552Z 
2025-05-07T20:26:04.7127555Z 
2025-05-07T20:26:04.7127560Z 
2025-05-07T20:26:04.7127564Z 
2025-05-07T20:26:04.7129600Z 
2025-05-07T20:26:04.7388476Z libnpp-12.3.1.54     | 93.4 MB   | #######5   |  76% [A[A[A[A[A[A[A
2025-05-07T20:26:04.8167833Z nsight-compute-2024. | 443.1 MB  | #######5   |  75% 
2025-05-07T20:26:04.8168151Z 
2025-05-07T20:26:04.8168157Z 
2025-05-07T20:26:04.8168162Z 
2025-05-07T20:26:04.8168167Z 
2025-05-07T20:26:04.8168173Z 
2025-05-07T20:26:04.8168178Z 
2025-05-07T20:26:04.8170185Z 
2025-05-07T20:26:04.8410531Z libnpp-12.3.1.54     | 93.4 MB   | #######9   |  79% [A[A[A[A[A[A[A
2025-05-07T20:26:04.9170162Z nsight-compute-2024. | 443.1 MB  | #######6   |  76% 
2025-05-07T20:26:04.9170432Z 
2025-05-07T20:26:04.9170436Z 
2025-05-07T20:26:04.9170439Z 
2025-05-07T20:26:04.9170443Z 
2025-05-07T20:26:04.9170446Z 
2025-05-07T20:26:04.9170452Z 
2025-05-07T20:26:04.9171728Z 
2025-05-07T20:26:04.9489755Z libnpp-12.3.1.54     | 93.4 MB   | ########3  |  83% [A[A[A[A[A[A[A
2025-05-07T20:26:05.0172495Z nsight-compute-2024. | 443.1 MB  | #######7   |  77% 
2025-05-07T20:26:05.0172784Z 
2025-05-07T20:26:05.0172790Z 
2025-05-07T20:26:05.0173090Z 
2025-05-07T20:26:05.0173262Z 
2025-05-07T20:26:05.0173268Z 
2025-05-07T20:26:05.0173273Z 
2025-05-07T20:26:05.0173308Z 
2025-05-07T20:26:05.0490717Z libnpp-12.3.1.54     | 93.4 MB   | ########7  |  87% [A[A[A[A[A[A[A
2025-05-07T20:26:05.1173728Z nsight-compute-2024. | 443.1 MB  | #######7   |  78% 
2025-05-07T20:26:05.1174110Z 
2025-05-07T20:26:05.1174116Z 
2025-05-07T20:26:05.1174122Z 
2025-05-07T20:26:05.1174127Z 
2025-05-07T20:26:05.1174132Z 
2025-05-07T20:26:05.1174137Z 
2025-05-07T20:26:05.1176878Z 
2025-05-07T20:26:05.1491889Z libnpp-12.3.1.54     | 93.4 MB   | #########1 |  91% [A[A[A[A[A[A[A
2025-05-07T20:26:05.2179141Z nsight-compute-2024. | 443.1 MB  | #######8   |  79% 
2025-05-07T20:26:05.2179487Z 
2025-05-07T20:26:05.2179494Z 
2025-05-07T20:26:05.2179499Z 
2025-05-07T20:26:05.2179506Z 
2025-05-07T20:26:05.2179512Z 
2025-05-07T20:26:05.2179518Z 
2025-05-07T20:26:05.2179523Z 
2025-05-07T20:26:05.2497685Z libnpp-12.3.1.54     | 93.4 MB   | #########5 |  95% [A[A[A[A[A[A[A
2025-05-07T20:26:05.3181550Z nsight-compute-2024. | 443.1 MB  | #######9   |  80% 
2025-05-07T20:26:05.3181905Z 
2025-05-07T20:26:05.3181912Z 
2025-05-07T20:26:05.3181918Z 
2025-05-07T20:26:05.3181923Z 
2025-05-07T20:26:05.3181928Z 
2025-05-07T20:26:05.3181933Z 
2025-05-07T20:26:05.3181938Z 
2025-05-07T20:26:05.3586616Z libnpp-12.3.1.54     | 93.4 MB   | #########8 |  99% [A[A[A[A[A[A[A
2025-05-07T20:26:05.4586496Z nsight-compute-2024. | 443.1 MB  | ########   |  80% 
2025-05-07T20:26:05.5591359Z nsight-compute-2024. | 443.1 MB  | ########1  |  81% 
2025-05-07T20:26:05.6595668Z nsight-compute-2024. | 443.1 MB  | ########2  |  82% 
2025-05-07T20:26:05.7604154Z nsight-compute-2024. | 443.1 MB  | ########3  |  83% 
2025-05-07T20:26:05.8605511Z nsight-compute-2024. | 443.1 MB  | ########4  |  84% 
2025-05-07T20:26:05.9619093Z nsight-compute-2024. | 443.1 MB  | ########5  |  85% 
2025-05-07T20:26:06.0653240Z nsight-compute-2024. | 443.1 MB  | ########6  |  86% 
2025-05-07T20:26:06.1653595Z nsight-compute-2024. | 443.1 MB  | ########7  |  87% 
2025-05-07T20:26:06.2655200Z nsight-compute-2024. | 443.1 MB  | ########7  |  88% 
2025-05-07T20:26:06.4358831Z nsight-compute-2024. | 443.1 MB  | ########8  |  89% 
2025-05-07T20:26:06.4490516Z nsight-compute-2024. | 443.1 MB  | ########9  |  90% 
2025-05-07T20:26:06.4490871Z 
2025-05-07T20:26:06.4490877Z 
2025-05-07T20:26:06.4490882Z 
2025-05-07T20:26:06.4490888Z 
2025-05-07T20:26:06.4490893Z 
2025-05-07T20:26:06.4495419Z 
2025-05-07T20:26:06.5074203Z libcusolver-11.7.1.2 | 95.8 MB   | ########## | 100% [A[A[A[A[A[A
2025-05-07T20:26:06.5074507Z 
2025-05-07T20:26:06.5074511Z 
2025-05-07T20:26:06.5074515Z 
2025-05-07T20:26:06.5074520Z 
2025-05-07T20:26:06.5074525Z 
2025-05-07T20:26:06.5074529Z 
2025-05-07T20:26:06.5074545Z 
2025-05-07T20:26:06.5080796Z 
2025-05-07T20:26:06.5441975Z cuda-nvdisasm-12.6.7 | 47.6 MB   |            |   0% [A[A[A[A[A[A[A[A
2025-05-07T20:26:06.6077453Z nsight-compute-2024. | 443.1 MB  | #########  |  91% 
2025-05-07T20:26:06.6077742Z 
2025-05-07T20:26:06.6077788Z 
2025-05-07T20:26:06.6077792Z 
2025-05-07T20:26:06.6077796Z 
2025-05-07T20:26:06.6077799Z 
2025-05-07T20:26:06.6077803Z 
2025-05-07T20:26:06.6077807Z 
2025-05-07T20:26:06.6080467Z 
2025-05-07T20:26:06.6446470Z cuda-nvdisasm-12.6.7 | 47.6 MB   | 6          |   7% [A[A[A[A[A[A[A[A
2025-05-07T20:26:06.7282406Z nsight-compute-2024. | 443.1 MB  | #########1 |  92% 
2025-05-07T20:26:06.7282700Z 
2025-05-07T20:26:06.7282704Z 
2025-05-07T20:26:06.7282708Z 
2025-05-07T20:26:06.7282711Z 
2025-05-07T20:26:06.7282715Z 
2025-05-07T20:26:06.7282720Z 
2025-05-07T20:26:06.7282723Z 
2025-05-07T20:26:06.7282727Z 
2025-05-07T20:26:06.7631976Z cuda-nvdisasm-12.6.7 | 47.6 MB   | #3         |  13% [A[A[A[A[A[A[A[A
2025-05-07T20:26:06.8290655Z nsight-compute-2024. | 443.1 MB  | #########2 |  92% 
2025-05-07T20:26:06.8290956Z 
2025-05-07T20:26:06.8290961Z 
2025-05-07T20:26:06.8290979Z 
2025-05-07T20:26:06.8290984Z 
2025-05-07T20:26:06.8290990Z 
2025-05-07T20:26:06.8290995Z 
2025-05-07T20:26:06.8291477Z 
2025-05-07T20:26:06.8294862Z 
2025-05-07T20:26:06.8760734Z cuda-nvdisasm-12.6.7 | 47.6 MB   | #9         |  20% [A[A[A[A[A[A[A[A
2025-05-07T20:26:06.9011154Z nsight-compute-2024. | 443.1 MB  | #########3 |  93% 
2025-05-07T20:26:06.9011449Z 
2025-05-07T20:26:06.9011455Z 
2025-05-07T20:26:06.9011460Z 
2025-05-07T20:26:06.9011465Z 
2025-05-07T20:26:06.9013406Z 
2025-05-07T20:26:06.9290916Z cuda-nvvp-12.6.80    | 109.3 MB  | ########## | 100% [A[A[A[A[A
2025-05-07T20:26:06.9291274Z 
2025-05-07T20:26:06.9291278Z 
2025-05-07T20:26:06.9291281Z 
2025-05-07T20:26:06.9291285Z 
2025-05-07T20:26:06.9291289Z 
2025-05-07T20:26:06.9291294Z 
2025-05-07T20:26:06.9291298Z 
2025-05-07T20:26:06.9291645Z 
2025-05-07T20:26:06.9460604Z cuda-nvdisasm-12.6.7 | 47.6 MB   | ##6        |  26% [A[A[A[A[A[A[A[A
2025-05-07T20:26:06.9460995Z 
2025-05-07T20:26:06.9461000Z 
2025-05-07T20:26:06.9461004Z 
2025-05-07T20:26:06.9461017Z 
2025-05-07T20:26:06.9461021Z 
2025-05-07T20:26:06.9461024Z 
2025-05-07T20:26:06.9461069Z 
2025-05-07T20:26:06.9461074Z 
2025-05-07T20:26:06.9463253Z 
2025-05-07T20:26:06.9850099Z libcurand-10.3.7.77  | 39.9 MB   |            |   0% [A[A[A[A[A[A[A[A[A
2025-05-07T20:26:07.0406483Z nsight-compute-2024. | 443.1 MB  | #########3 |  94% 
2025-05-07T20:26:07.0406763Z 
2025-05-07T20:26:07.0406768Z 
2025-05-07T20:26:07.0406771Z 
2025-05-07T20:26:07.0406775Z 
2025-05-07T20:26:07.0406788Z 
2025-05-07T20:26:07.0406792Z 
2025-05-07T20:26:07.0406795Z 
2025-05-07T20:26:07.0407712Z 
2025-05-07T20:26:07.0465456Z cuda-nvdisasm-12.6.7 | 47.6 MB   | ###2       |  32% [A[A[A[A[A[A[A[A
2025-05-07T20:26:07.0465908Z 
2025-05-07T20:26:07.0465915Z 
2025-05-07T20:26:07.0465920Z 
2025-05-07T20:26:07.0465925Z 
2025-05-07T20:26:07.0465931Z 
2025-05-07T20:26:07.0465936Z 
2025-05-07T20:26:07.0465941Z 
2025-05-07T20:26:07.0465946Z 
2025-05-07T20:26:07.0467649Z 
2025-05-07T20:26:07.0971467Z libcurand-10.3.7.77  | 39.9 MB   | 7          |   7% [A[A[A[A[A[A[A[A[A
2025-05-07T20:26:07.1412334Z nsight-compute-2024. | 443.1 MB  | #########4 |  95% 
2025-05-07T20:26:07.1412643Z 
2025-05-07T20:26:07.1412647Z 
2025-05-07T20:26:07.1412650Z 
2025-05-07T20:26:07.1412654Z 
2025-05-07T20:26:07.1412658Z 
2025-05-07T20:26:07.1412661Z 
2025-05-07T20:26:07.1412665Z 
2025-05-07T20:26:07.1424272Z 
2025-05-07T20:26:07.1491907Z cuda-nvdisasm-12.6.7 | 47.6 MB   | ###8       |  39% [A[A[A[A[A[A[A[A
2025-05-07T20:26:07.1492236Z 
2025-05-07T20:26:07.1492240Z 
2025-05-07T20:26:07.1492244Z 
2025-05-07T20:26:07.1492248Z 
2025-05-07T20:26:07.1492252Z 
2025-05-07T20:26:07.1492255Z 
2025-05-07T20:26:07.1492259Z 
2025-05-07T20:26:07.1492263Z 
2025-05-07T20:26:07.1494360Z 
2025-05-07T20:26:07.2203197Z libcurand-10.3.7.77  | 39.9 MB   | #4         |  14% [A[A[A[A[A[A[A[A[A
2025-05-07T20:26:07.2483310Z nsight-compute-2024. | 443.1 MB  | #########5 |  95% 
2025-05-07T20:26:07.2483644Z 
2025-05-07T20:26:07.2483649Z 
2025-05-07T20:26:07.2483652Z 
2025-05-07T20:26:07.2483656Z 
2025-05-07T20:26:07.2483660Z 
2025-05-07T20:26:07.2483706Z 
2025-05-07T20:26:07.2483710Z 
2025-05-07T20:26:07.2484996Z 
2025-05-07T20:26:07.2498045Z cuda-nvdisasm-12.6.7 | 47.6 MB   | ####4      |  45% [A[A[A[A[A[A[A[A
2025-05-07T20:26:07.2498666Z 
2025-05-07T20:26:07.2498672Z 
2025-05-07T20:26:07.2498678Z 
2025-05-07T20:26:07.2498683Z 
2025-05-07T20:26:07.2498689Z 
2025-05-07T20:26:07.2498694Z 
2025-05-07T20:26:07.2498699Z 
2025-05-07T20:26:07.2498705Z 
2025-05-07T20:26:07.2499967Z 
2025-05-07T20:26:07.3361699Z libcurand-10.3.7.77  | 39.9 MB   | ##1        |  22% [A[A[A[A[A[A[A[A[A
2025-05-07T20:26:07.3500084Z nsight-compute-2024. | 443.1 MB  | #########5 |  96% 
2025-05-07T20:26:07.3500390Z 
2025-05-07T20:26:07.3500394Z 
2025-05-07T20:26:07.3500397Z 
2025-05-07T20:26:07.3500401Z 
2025-05-07T20:26:07.3500405Z 
2025-05-07T20:26:07.3500408Z 
2025-05-07T20:26:07.3500412Z 
2025-05-07T20:26:07.3500416Z 
2025-05-07T20:26:07.3502287Z 
2025-05-07T20:26:07.4254424Z libcurand-10.3.7.77  | 39.9 MB   | ###        |  30% [A[A[A[A[A[A[A[A[A
2025-05-07T20:26:07.4254935Z 
2025-05-07T20:26:07.4254939Z 
2025-05-07T20:26:07.4254943Z 
2025-05-07T20:26:07.4254946Z 
2025-05-07T20:26:07.4254950Z 
2025-05-07T20:26:07.4254954Z 
2025-05-07T20:26:07.4254957Z 
2025-05-07T20:26:07.4256451Z 
2025-05-07T20:26:07.4365983Z cuda-nvdisasm-12.6.7 | 47.6 MB   | #####      |  51% [A[A[A[A[A[A[A[A
2025-05-07T20:26:07.4505817Z nsight-compute-2024. | 443.1 MB  | #########6 |  97% 
2025-05-07T20:26:07.4506095Z 
2025-05-07T20:26:07.4506102Z 
2025-05-07T20:26:07.4506108Z 
2025-05-07T20:26:07.4506115Z 
2025-05-07T20:26:07.4506119Z 
2025-05-07T20:26:07.4506122Z 
2025-05-07T20:26:07.4506126Z 
2025-05-07T20:26:07.4506130Z 
2025-05-07T20:26:07.4508434Z 
2025-05-07T20:26:07.5259160Z libcurand-10.3.7.77  | 39.9 MB   | ###8       |  38% [A[A[A[A[A[A[A[A[A
2025-05-07T20:26:07.5259466Z 
2025-05-07T20:26:07.5259469Z 
2025-05-07T20:26:07.5259473Z 
2025-05-07T20:26:07.5259477Z 
2025-05-07T20:26:07.5259481Z 
2025-05-07T20:26:07.5259485Z 
2025-05-07T20:26:07.5259516Z 
2025-05-07T20:26:07.5261699Z 
2025-05-07T20:26:07.5366061Z cuda-nvdisasm-12.6.7 | 47.6 MB   | #####5     |  56% [A[A[A[A[A[A[A[A
2025-05-07T20:26:07.5602290Z nsight-compute-2024. | 443.1 MB  | #########7 |  97% 
2025-05-07T20:26:07.5602648Z 
2025-05-07T20:26:07.5602653Z 
2025-05-07T20:26:07.5602656Z 
2025-05-07T20:26:07.5602661Z 
2025-05-07T20:26:07.5602664Z 
2025-05-07T20:26:07.5602668Z 
2025-05-07T20:26:07.5602672Z 
2025-05-07T20:26:07.5602676Z 
2025-05-07T20:26:07.5608895Z 
2025-05-07T20:26:07.6266433Z libcurand-10.3.7.77  | 39.9 MB   | ####6      |  46% [A[A[A[A[A[A[A[A[A
2025-05-07T20:26:07.6266742Z 
2025-05-07T20:26:07.6266745Z 
2025-05-07T20:26:07.6266749Z 
2025-05-07T20:26:07.6266752Z 
2025-05-07T20:26:07.6266756Z 
2025-05-07T20:26:07.6266760Z 
2025-05-07T20:26:07.6266763Z 
2025-05-07T20:26:07.6266767Z 
2025-05-07T20:26:07.6480071Z cuda-nvdisasm-12.6.7 | 47.6 MB   | ######1    |  61% [A[A[A[A[A[A[A[A
2025-05-07T20:26:07.6610489Z nsight-compute-2024. | 443.1 MB  | #########8 |  98% 
2025-05-07T20:26:07.6610851Z 
2025-05-07T20:26:07.6610861Z 
2025-05-07T20:26:07.6611022Z 
2025-05-07T20:26:07.6611052Z 
2025-05-07T20:26:07.6611058Z 
2025-05-07T20:26:07.6611063Z 
2025-05-07T20:26:07.6611068Z 
2025-05-07T20:26:07.6611107Z 
2025-05-07T20:26:07.6611204Z 
2025-05-07T20:26:07.7270642Z libcurand-10.3.7.77  | 39.9 MB   | #####4     |  54% [A[A[A[A[A[A[A[A[A
2025-05-07T20:26:07.7270948Z 
2025-05-07T20:26:07.7270951Z 
2025-05-07T20:26:07.7270955Z 
2025-05-07T20:26:07.7270959Z 
2025-05-07T20:26:07.7270962Z 
2025-05-07T20:26:07.7270966Z 
2025-05-07T20:26:07.7270979Z 
2025-05-07T20:26:07.7271785Z 
2025-05-07T20:26:07.7484865Z cuda-nvdisasm-12.6.7 | 47.6 MB   | ######6    |  67% [A[A[A[A[A[A[A[A
2025-05-07T20:26:07.7686787Z nsight-compute-2024. | 443.1 MB  | #########8 |  99% 
2025-05-07T20:26:07.7687098Z 
2025-05-07T20:26:07.7687290Z 
2025-05-07T20:26:07.7687294Z 
2025-05-07T20:26:07.7687298Z 
2025-05-07T20:26:07.7687302Z 
2025-05-07T20:26:07.7687305Z 
2025-05-07T20:26:07.7687337Z 
2025-05-07T20:26:07.7687369Z 
2025-05-07T20:26:07.7687379Z 
2025-05-07T20:26:07.8331370Z libcurand-10.3.7.77  | 39.9 MB   | ######2    |  62% [A[A[A[A[A[A[A[A[A
2025-05-07T20:26:07.8331666Z 
2025-05-07T20:26:07.8331670Z 
2025-05-07T20:26:07.8331673Z 
2025-05-07T20:26:07.8331689Z 
2025-05-07T20:26:07.8331692Z 
2025-05-07T20:26:07.8331697Z 
2025-05-07T20:26:07.8331700Z 
2025-05-07T20:26:07.8333090Z 
2025-05-07T20:26:07.8678031Z cuda-nvdisasm-12.6.7 | 47.6 MB   | #######1   |  72% [A[A[A[A[A[A[A[A
2025-05-07T20:26:07.8836105Z nsight-compute-2024. | 443.1 MB  | #########9 |  99% 
2025-05-07T20:26:07.8836398Z 
2025-05-07T20:26:07.8836404Z 
2025-05-07T20:26:07.8836409Z 
2025-05-07T20:26:07.8836414Z 
2025-05-07T20:26:07.8836419Z 
2025-05-07T20:26:07.8836439Z 
2025-05-07T20:26:07.8836444Z 
2025-05-07T20:26:07.8836449Z 
2025-05-07T20:26:07.8836632Z 
2025-05-07T20:26:07.9345002Z libcurand-10.3.7.77  | 39.9 MB   | ######9    |  70% [A[A[A[A[A[A[A[A[A
2025-05-07T20:26:07.9345298Z 
2025-05-07T20:26:07.9345704Z 
2025-05-07T20:26:07.9345709Z 
2025-05-07T20:26:07.9345713Z 
2025-05-07T20:26:07.9345728Z 
2025-05-07T20:26:07.9345732Z 
2025-05-07T20:26:07.9345735Z 
2025-05-07T20:26:07.9345743Z 
2025-05-07T20:26:07.9764058Z cuda-nvdisasm-12.6.7 | 47.6 MB   | #######7   |  77% [A[A[A[A[A[A[A[A
2025-05-07T20:26:07.9841390Z nsight-compute-2024. | 443.1 MB  | #########9 | 100% 
2025-05-07T20:26:07.9841750Z 
2025-05-07T20:26:07.9841756Z 
2025-05-07T20:26:07.9841761Z 
2025-05-07T20:26:07.9841766Z 
2025-05-07T20:26:07.9841772Z 
2025-05-07T20:26:07.9841778Z 
2025-05-07T20:26:07.9841794Z 
2025-05-07T20:26:07.9841799Z 
2025-05-07T20:26:07.9841946Z 
2025-05-07T20:26:08.0349294Z libcurand-10.3.7.77  | 39.9 MB   | #######7   |  77% [A[A[A[A[A[A[A[A[A
2025-05-07T20:26:08.0349627Z 
2025-05-07T20:26:08.0349640Z 
2025-05-07T20:26:08.0349644Z 
2025-05-07T20:26:08.0349648Z 
2025-05-07T20:26:08.0349652Z 
2025-05-07T20:26:08.0349658Z 
2025-05-07T20:26:08.0349663Z 
2025-05-07T20:26:08.0349667Z 
2025-05-07T20:26:08.0881775Z cuda-nvdisasm-12.6.7 | 47.6 MB   | ########3  |  83% [A[A[A[A[A[A[A[A
2025-05-07T20:26:08.0882125Z 
2025-05-07T20:26:08.0882129Z 
2025-05-07T20:26:08.0882133Z 
2025-05-07T20:26:08.0882136Z 
2025-05-07T20:26:08.0882140Z 
2025-05-07T20:26:08.0882144Z 
2025-05-07T20:26:08.0882147Z 
2025-05-07T20:26:08.0882151Z 
2025-05-07T20:26:08.0886527Z 
2025-05-07T20:26:08.1350590Z libcurand-10.3.7.77  | 39.9 MB   | ########4  |  84% [A[A[A[A[A[A[A[A[A
2025-05-07T20:26:08.1350932Z 
2025-05-07T20:26:08.1350936Z 
2025-05-07T20:26:08.1350940Z 
2025-05-07T20:26:08.1350944Z 
2025-05-07T20:26:08.1350947Z 
2025-05-07T20:26:08.1350951Z 
2025-05-07T20:26:08.1350954Z 
2025-05-07T20:26:08.1352438Z 
2025-05-07T20:26:08.1908127Z cuda-nvdisasm-12.6.7 | 47.6 MB   | #########  |  90% [A[A[A[A[A[A[A[A
2025-05-07T20:26:08.1908469Z 
2025-05-07T20:26:08.1908473Z 
2025-05-07T20:26:08.1908477Z 
2025-05-07T20:26:08.1908484Z 
2025-05-07T20:26:08.1908489Z 
2025-05-07T20:26:08.1908494Z 
2025-05-07T20:26:08.1908527Z 
2025-05-07T20:26:08.1908543Z 
2025-05-07T20:26:08.1908547Z 
2025-05-07T20:26:08.2356442Z libcurand-10.3.7.77  | 39.9 MB   | #########1 |  92% [A[A[A[A[A[A[A[A[A
2025-05-07T20:26:08.2356780Z 
2025-05-07T20:26:08.2356784Z 
2025-05-07T20:26:08.2356788Z 
2025-05-07T20:26:08.2356791Z 
2025-05-07T20:26:08.2356795Z 
2025-05-07T20:26:08.2356799Z 
2025-05-07T20:26:08.2356811Z 
2025-05-07T20:26:08.2356815Z 
2025-05-07T20:26:08.2828347Z cuda-nvdisasm-12.6.7 | 47.6 MB   | #########6 |  96% [A[A[A[A[A[A[A[A
2025-05-07T20:26:08.2828803Z 
2025-05-07T20:26:08.2828809Z 
2025-05-07T20:26:08.2828815Z 
2025-05-07T20:26:08.2828829Z 
2025-05-07T20:26:08.2828835Z 
2025-05-07T20:26:08.2828840Z 
2025-05-07T20:26:08.2831313Z 
2025-05-07T20:26:08.3514866Z libnpp-12.3.1.54     | 93.4 MB   | ########## | 100% [A[A[A[A[A[A[A
2025-05-07T20:26:08.3515302Z 
2025-05-07T20:26:08.3515307Z 
2025-05-07T20:26:08.3515310Z 
2025-05-07T20:26:08.3515315Z 
2025-05-07T20:26:08.3515320Z 
2025-05-07T20:26:08.3515324Z 
2025-05-07T20:26:08.3515371Z 
2025-05-07T20:26:08.3515375Z 
2025-05-07T20:26:08.3515378Z 
2025-05-07T20:26:08.3517392Z 
2025-05-07T20:26:08.4464994Z gds-tools-1.11.1.6   | 37.8 MB   |            |   0% [A[A[A[A[A[A[A[A[A[A
2025-05-07T20:26:08.4465305Z 
2025-05-07T20:26:08.4465309Z 
2025-05-07T20:26:08.4468200Z 
2025-05-07T20:26:08.4514581Z libcusparse-12.5.4.2 | 118.6 MB  | ########## | 100% [A[A[A
2025-05-07T20:26:08.4515064Z 
2025-05-07T20:26:08.4515070Z 
2025-05-07T20:26:08.4515076Z 
2025-05-07T20:26:08.4515082Z 
2025-05-07T20:26:08.4515088Z 
2025-05-07T20:26:08.4515093Z 
2025-05-07T20:26:08.4515100Z 
2025-05-07T20:26:08.4515106Z 
2025-05-07T20:26:08.4515111Z 
2025-05-07T20:26:08.4517998Z 
2025-05-07T20:26:08.5520716Z gds-tools-1.11.1.6   | 37.8 MB   | 8          |   8% [A[A[A[A[A[A[A[A[A[A
2025-05-07T20:26:08.5521039Z 
2025-05-07T20:26:08.5521043Z 
2025-05-07T20:26:08.5521054Z 
2025-05-07T20:26:08.5521058Z 
2025-05-07T20:26:08.5521061Z 
2025-05-07T20:26:08.5521065Z 
2025-05-07T20:26:08.5521344Z 
2025-05-07T20:26:08.5521533Z 
2025-05-07T20:26:08.5521539Z 
2025-05-07T20:26:08.5521544Z 
2025-05-07T20:26:08.6538695Z gds-tools-1.11.1.6   | 37.8 MB   | #7         |  18% [A[A[A[A[A[A[A[A[A[A
2025-05-07T20:26:08.6539153Z 
2025-05-07T20:26:08.6539159Z 
2025-05-07T20:26:08.6539164Z 
2025-05-07T20:26:08.6539168Z 
2025-05-07T20:26:08.6539174Z 
2025-05-07T20:26:08.6539179Z 
2025-05-07T20:26:08.6539184Z 
2025-05-07T20:26:08.6539189Z 
2025-05-07T20:26:08.6539194Z 
2025-05-07T20:26:08.6544676Z 
2025-05-07T20:26:08.7542184Z gds-tools-1.11.1.6   | 37.8 MB   | ##6        |  26% [A[A[A[A[A[A[A[A[A[A
2025-05-07T20:26:08.7542648Z 
2025-05-07T20:26:08.7542655Z 
2025-05-07T20:26:08.7542661Z 
2025-05-07T20:26:08.7542668Z 
2025-05-07T20:26:08.7542674Z 
2025-05-07T20:26:08.7542681Z 
2025-05-07T20:26:08.7542687Z 
2025-05-07T20:26:08.7542693Z 
2025-05-07T20:26:08.7542701Z 
2025-05-07T20:26:08.7542707Z 
2025-05-07T20:26:08.8612282Z gds-tools-1.11.1.6   | 37.8 MB   | ###5       |  36% [A[A[A[A[A[A[A[A[A[A
2025-05-07T20:26:08.8612722Z 
2025-05-07T20:26:08.8612727Z 
2025-05-07T20:26:08.8612730Z 
2025-05-07T20:26:08.8612734Z 
2025-05-07T20:26:08.8612738Z 
2025-05-07T20:26:08.8612741Z 
2025-05-07T20:26:08.8612745Z 
2025-05-07T20:26:08.8612759Z 
2025-05-07T20:26:08.8612762Z 
2025-05-07T20:26:08.8612766Z 
2025-05-07T20:26:08.9704634Z gds-tools-1.11.1.6   | 37.8 MB   | ####5      |  45% [A[A[A[A[A[A[A[A[A[A
2025-05-07T20:26:08.9704963Z 
2025-05-07T20:26:08.9704967Z 
2025-05-07T20:26:08.9704971Z 
2025-05-07T20:26:08.9704974Z 
2025-05-07T20:26:08.9704978Z 
2025-05-07T20:26:08.9704982Z 
2025-05-07T20:26:08.9704985Z 
2025-05-07T20:26:08.9704989Z 
2025-05-07T20:26:08.9704993Z 
2025-05-07T20:26:08.9704996Z 
2025-05-07T20:26:09.0774775Z gds-tools-1.11.1.6   | 37.8 MB   | #####3     |  54% [A[A[A[A[A[A[A[A[A[A
2025-05-07T20:26:09.0775101Z 
2025-05-07T20:26:09.0775105Z 
2025-05-07T20:26:09.0775109Z 
2025-05-07T20:26:09.0775113Z 
2025-05-07T20:26:09.0775116Z 
2025-05-07T20:26:09.0775120Z 
2025-05-07T20:26:09.0775164Z 
2025-05-07T20:26:09.0775168Z 
2025-05-07T20:26:09.0775171Z 
2025-05-07T20:26:09.0778657Z 
2025-05-07T20:26:09.1825204Z gds-tools-1.11.1.6   | 37.8 MB   | ######2    |  63% [A[A[A[A[A[A[A[A[A[A
2025-05-07T20:26:09.1825531Z 
2025-05-07T20:26:09.1825535Z 
2025-05-07T20:26:09.1825549Z 
2025-05-07T20:26:09.1825552Z 
2025-05-07T20:26:09.1825559Z 
2025-05-07T20:26:09.1825562Z 
2025-05-07T20:26:09.1825567Z 
2025-05-07T20:26:09.1825571Z 
2025-05-07T20:26:09.1825575Z 
2025-05-07T20:26:09.1825579Z 
2025-05-07T20:26:09.2827203Z gds-tools-1.11.1.6   | 37.8 MB   | #######1   |  71% [A[A[A[A[A[A[A[A[A[A
2025-05-07T20:26:09.2827538Z 
2025-05-07T20:26:09.2827542Z 
2025-05-07T20:26:09.2827546Z 
2025-05-07T20:26:09.2827550Z 
2025-05-07T20:26:09.2827554Z 
2025-05-07T20:26:09.2827558Z 
2025-05-07T20:26:09.2827562Z 
2025-05-07T20:26:09.2827566Z 
2025-05-07T20:26:09.2827569Z 
2025-05-07T20:26:09.2827573Z 
2025-05-07T20:26:09.3835116Z gds-tools-1.11.1.6   | 37.8 MB   | ########   |  81% [A[A[A[A[A[A[A[A[A[A
2025-05-07T20:26:09.3835460Z 
2025-05-07T20:26:09.3835464Z 
2025-05-07T20:26:09.3835468Z 
2025-05-07T20:26:09.3835472Z 
2025-05-07T20:26:09.3835475Z 
2025-05-07T20:26:09.3835479Z 
2025-05-07T20:26:09.3835483Z 
2025-05-07T20:26:09.3835487Z 
2025-05-07T20:26:09.3835490Z 
2025-05-07T20:26:09.3835494Z 
2025-05-07T20:26:09.5895477Z gds-tools-1.11.1.6   | 37.8 MB   | #########  |  91% [A[A[A[A[A[A[A[A[A[A
2025-05-07T20:26:09.5896004Z 
2025-05-07T20:26:09.5896011Z 
2025-05-07T20:26:09.5896016Z 
2025-05-07T20:26:09.5896021Z 
2025-05-07T20:26:09.5896026Z 
2025-05-07T20:26:09.5896032Z 
2025-05-07T20:26:09.5896037Z 
2025-05-07T20:26:09.5896042Z 
2025-05-07T20:26:09.5896048Z 
2025-05-07T20:26:09.5896451Z libcurand-10.3.7.77  | 39.9 MB   | ########## | 100% [A[A[A[A[A[A[A[A[A
2025-05-07T20:26:09.5896871Z 
2025-05-07T20:26:09.5896879Z 
2025-05-07T20:26:09.5896885Z 
2025-05-07T20:26:09.5896891Z 
2025-05-07T20:26:09.5896898Z 
2025-05-07T20:26:09.5896905Z 
2025-05-07T20:26:09.5897208Z 
2025-05-07T20:26:09.5897403Z 
2025-05-07T20:26:09.5897419Z 
2025-05-07T20:26:09.6530276Z libcurand-10.3.7.77  | 39.9 MB   | ########## | 100% [A[A[A[A[A[A[A[A[A
2025-05-07T20:26:09.6530747Z 
2025-05-07T20:26:09.6530753Z 
2025-05-07T20:26:09.6530758Z 
2025-05-07T20:26:09.6530764Z 
2025-05-07T20:26:09.6530769Z 
2025-05-07T20:26:09.6530774Z 
2025-05-07T20:26:09.6530780Z 
2025-05-07T20:26:09.6530785Z 
2025-05-07T20:26:09.6530792Z 
2025-05-07T20:26:09.6530798Z 
2025-05-07T20:26:09.6530803Z 
2025-05-07T20:26:09.7540937Z python-3.13.0        | 31.5 MB   |            |   0% [A[A[A[A[A[A[A[A[A[A[A
2025-05-07T20:26:09.7541255Z 
2025-05-07T20:26:09.7541259Z 
2025-05-07T20:26:09.7541263Z 
2025-05-07T20:26:09.7541267Z 
2025-05-07T20:26:09.7541272Z 
2025-05-07T20:26:09.7541276Z 
2025-05-07T20:26:09.7541279Z 
2025-05-07T20:26:09.7541283Z 
2025-05-07T20:26:09.7541287Z 
2025-05-07T20:26:09.7541300Z 
2025-05-07T20:26:09.7550186Z 
2025-05-07T20:26:09.8548608Z python-3.13.0        | 31.5 MB   | #3         |  13% [A[A[A[A[A[A[A[A[A[A[A
2025-05-07T20:26:09.8549093Z 
2025-05-07T20:26:09.8549111Z 
2025-05-07T20:26:09.8549118Z 
2025-05-07T20:26:09.8549123Z 
2025-05-07T20:26:09.8549129Z 
2025-05-07T20:26:09.8549134Z 
2025-05-07T20:26:09.8549140Z 
2025-05-07T20:26:09.8549145Z 
2025-05-07T20:26:09.8549150Z 
2025-05-07T20:26:09.8549155Z 
2025-05-07T20:26:09.8549160Z 
2025-05-07T20:26:09.8746021Z python-3.13.0        | 31.5 MB   | ##6        |  26% [A[A[A[A[A[A[A[A[A[A[A
2025-05-07T20:26:09.8746332Z 
2025-05-07T20:26:09.8746337Z 
2025-05-07T20:26:09.8746340Z 
2025-05-07T20:26:09.8746344Z 
2025-05-07T20:26:09.8746347Z 
2025-05-07T20:26:09.8746351Z 
2025-05-07T20:26:09.8746355Z 
2025-05-07T20:26:09.8746522Z 
2025-05-07T20:26:09.9347202Z cuda-nvdisasm-12.6.7 | 47.6 MB   | ########## | 100% [A[A[A[A[A[A[A[A
2025-05-07T20:26:09.9347608Z 
2025-05-07T20:26:09.9347612Z 
2025-05-07T20:26:09.9347617Z 
2025-05-07T20:26:09.9347620Z 
2025-05-07T20:26:09.9347625Z 
2025-05-07T20:26:09.9347642Z 
2025-05-07T20:26:09.9347687Z 
2025-05-07T20:26:09.9347693Z 
2025-05-07T20:26:09.9347698Z 
2025-05-07T20:26:09.9347704Z 
2025-05-07T20:26:09.9347709Z 
2025-05-07T20:26:09.9347714Z 
2025-05-07T20:26:09.9571047Z cuda-nvcc-tools-12.6 | 23.0 MB   |            |   0% [A[A[A[A[A[A[A[A[A[A[A[A
2025-05-07T20:26:09.9571381Z 
2025-05-07T20:26:09.9571385Z 
2025-05-07T20:26:09.9571389Z 
2025-05-07T20:26:09.9571392Z 
2025-05-07T20:26:09.9571396Z 
2025-05-07T20:26:09.9571400Z 
2025-05-07T20:26:09.9571404Z 
2025-05-07T20:26:09.9571407Z 
2025-05-07T20:26:09.9571411Z 
2025-05-07T20:26:09.9571415Z 
2025-05-07T20:26:09.9571418Z 
2025-05-07T20:26:10.0352221Z python-3.13.0        | 31.5 MB   | ###9       |  39% [A[A[A[A[A[A[A[A[A[A[A
2025-05-07T20:26:10.0352739Z 
2025-05-07T20:26:10.0352746Z 
2025-05-07T20:26:10.0352751Z 
2025-05-07T20:26:10.0352756Z 
2025-05-07T20:26:10.0352761Z 
2025-05-07T20:26:10.0352766Z 
2025-05-07T20:26:10.0352771Z 
2025-05-07T20:26:10.0352777Z 
2025-05-07T20:26:10.0352782Z 
2025-05-07T20:26:10.0352840Z 
2025-05-07T20:26:10.0352845Z 
2025-05-07T20:26:10.0355587Z 
2025-05-07T20:26:10.0768193Z cuda-nvcc-tools-12.6 | 23.0 MB   | #3         |  13% [A[A[A[A[A[A[A[A[A[A[A[A
2025-05-07T20:26:10.0768697Z 
2025-05-07T20:26:10.0768703Z 
2025-05-07T20:26:10.0768708Z 
2025-05-07T20:26:10.0768713Z 
2025-05-07T20:26:10.0768719Z 
2025-05-07T20:26:10.0768724Z 
2025-05-07T20:26:10.0768729Z 
2025-05-07T20:26:10.0768734Z 
2025-05-07T20:26:10.0768739Z 
2025-05-07T20:26:10.0768746Z 
2025-05-07T20:26:10.0769397Z 
2025-05-07T20:26:10.1361980Z python-3.13.0        | 31.5 MB   | #####2     |  52% [A[A[A[A[A[A[A[A[A[A[A
2025-05-07T20:26:10.1362295Z 
2025-05-07T20:26:10.1362299Z 
2025-05-07T20:26:10.1362303Z 
2025-05-07T20:26:10.1362306Z 
2025-05-07T20:26:10.1362310Z 
2025-05-07T20:26:10.1362314Z 
2025-05-07T20:26:10.1362318Z 
2025-05-07T20:26:10.1362336Z 
2025-05-07T20:26:10.1362340Z 
2025-05-07T20:26:10.1362343Z 
2025-05-07T20:26:10.1362347Z 
2025-05-07T20:26:10.1364429Z 
2025-05-07T20:26:10.2081643Z cuda-nvcc-tools-12.6 | 23.0 MB   | ##6        |  26% [A[A[A[A[A[A[A[A[A[A[A[A
2025-05-07T20:26:10.2082199Z 
2025-05-07T20:26:10.2082204Z 
2025-05-07T20:26:10.2082207Z 
2025-05-07T20:26:10.2082211Z 
2025-05-07T20:26:10.2082215Z 
2025-05-07T20:26:10.2082218Z 
2025-05-07T20:26:10.2082222Z 
2025-05-07T20:26:10.2082226Z 
2025-05-07T20:26:10.2082229Z 
2025-05-07T20:26:10.2082233Z 
2025-05-07T20:26:10.2082237Z 
2025-05-07T20:26:10.2451160Z python-3.13.0        | 31.5 MB   | ######4    |  65% [A[A[A[A[A[A[A[A[A[A[A
2025-05-07T20:26:10.2451486Z 
2025-05-07T20:26:10.2451490Z 
2025-05-07T20:26:10.2451494Z 
2025-05-07T20:26:10.2451497Z 
2025-05-07T20:26:10.2451501Z 
2025-05-07T20:26:10.2451505Z 
2025-05-07T20:26:10.2451508Z 
2025-05-07T20:26:10.2451512Z 
2025-05-07T20:26:10.2451516Z 
2025-05-07T20:26:10.2451519Z 
2025-05-07T20:26:10.2451523Z 
2025-05-07T20:26:10.2452305Z 
2025-05-07T20:26:10.2962472Z cuda-nvcc-tools-12.6 | 23.0 MB   | ###8       |  39% [A[A[A[A[A[A[A[A[A[A[A[A
2025-05-07T20:26:10.2962903Z 
2025-05-07T20:26:10.2967320Z 
2025-05-07T20:26:10.3290220Z libcufft-11.3.0.4    | 156.2 MB  | ########## | 100% [A[A
2025-05-07T20:26:10.3290614Z 
2025-05-07T20:26:10.3290621Z 
2025-05-07T20:26:10.3290625Z 
2025-05-07T20:26:10.3290628Z 
2025-05-07T20:26:10.3290632Z 
2025-05-07T20:26:10.3290636Z 
2025-05-07T20:26:10.3290639Z 
2025-05-07T20:26:10.3290644Z 
2025-05-07T20:26:10.3290662Z 
2025-05-07T20:26:10.3290668Z 
2025-05-07T20:26:10.3293901Z 
2025-05-07T20:26:10.3451191Z python-3.13.0        | 31.5 MB   | #######5   |  76% [A[A[A[A[A[A[A[A[A[A[A
2025-05-07T20:26:10.3451546Z 
2025-05-07T20:26:10.3451554Z 
2025-05-07T20:26:10.3451559Z 
2025-05-07T20:26:10.3451564Z 
2025-05-07T20:26:10.3451570Z 
2025-05-07T20:26:10.3451574Z 
2025-05-07T20:26:10.3451578Z 
2025-05-07T20:26:10.3451582Z 
2025-05-07T20:26:10.3451586Z 
2025-05-07T20:26:10.3451589Z 
2025-05-07T20:26:10.3451594Z 
2025-05-07T20:26:10.3455170Z 
2025-05-07T20:26:10.4476538Z cuda-nvcc-tools-12.6 | 23.0 MB   | #####1     |  51% [A[A[A[A[A[A[A[A[A[A[A[A
2025-05-07T20:26:10.4477008Z 
2025-05-07T20:26:10.4477014Z 
2025-05-07T20:26:10.4477020Z 
2025-05-07T20:26:10.4477026Z 
2025-05-07T20:26:10.4477031Z 
2025-05-07T20:26:10.4477036Z 
2025-05-07T20:26:10.4477041Z 
2025-05-07T20:26:10.4477046Z 
2025-05-07T20:26:10.4477051Z 
2025-05-07T20:26:10.4477056Z 
2025-05-07T20:26:10.4477075Z 
2025-05-07T20:26:10.4481652Z 
2025-05-07T20:26:10.4486808Z cuda-nvcc-tools-12.6 | 23.0 MB   | ######3    |  64% [A[A[A[A[A[A[A[A[A[A[A[A
2025-05-07T20:26:10.4487296Z 
2025-05-07T20:26:10.4487312Z 
2025-05-07T20:26:10.4487316Z 
2025-05-07T20:26:10.4487324Z 
2025-05-07T20:26:10.4487331Z 
2025-05-07T20:26:10.4487337Z 
2025-05-07T20:26:10.4487344Z 
2025-05-07T20:26:10.4487350Z 
2025-05-07T20:26:10.4487356Z 
2025-05-07T20:26:10.4487363Z 
2025-05-07T20:26:10.4489409Z 
2025-05-07T20:26:10.5477605Z python-3.13.0        | 31.5 MB   | ########6  |  86% [A[A[A[A[A[A[A[A[A[A[A
2025-05-07T20:26:10.5478065Z 
2025-05-07T20:26:10.5478103Z 
2025-05-07T20:26:10.5478122Z 
2025-05-07T20:26:10.5478128Z 
2025-05-07T20:26:10.5478133Z 
2025-05-07T20:26:10.5478138Z 
2025-05-07T20:26:10.5478143Z 
2025-05-07T20:26:10.5478149Z 
2025-05-07T20:26:10.5478154Z 
2025-05-07T20:26:10.5478159Z 
2025-05-07T20:26:10.5478164Z 
2025-05-07T20:26:10.5479629Z 
2025-05-07T20:26:10.5488404Z cuda-nvcc-tools-12.6 | 23.0 MB   | #######6   |  76% [A[A[A[A[A[A[A[A[A[A[A[A
2025-05-07T20:26:10.5488830Z 
2025-05-07T20:26:10.5488834Z 
2025-05-07T20:26:10.5488838Z 
2025-05-07T20:26:10.5488841Z 
2025-05-07T20:26:10.5488845Z 
2025-05-07T20:26:10.5488848Z 
2025-05-07T20:26:10.5488852Z 
2025-05-07T20:26:10.5488856Z 
2025-05-07T20:26:10.5488869Z 
2025-05-07T20:26:10.5488873Z 
2025-05-07T20:26:10.5493534Z 
2025-05-07T20:26:10.6479109Z python-3.13.0        | 31.5 MB   | #########6 |  97% [A[A[A[A[A[A[A[A[A[A[A
2025-05-07T20:26:10.6479490Z 
2025-05-07T20:26:10.6479494Z 
2025-05-07T20:26:10.6479497Z 
2025-05-07T20:26:10.6479501Z 
2025-05-07T20:26:10.6479505Z 
2025-05-07T20:26:10.6479932Z 
2025-05-07T20:26:10.6479936Z 
2025-05-07T20:26:10.6479940Z 
2025-05-07T20:26:10.6479944Z 
2025-05-07T20:26:10.6479947Z 
2025-05-07T20:26:10.6479951Z 
2025-05-07T20:26:10.6481397Z 
2025-05-07T20:26:10.8195008Z cuda-nvcc-tools-12.6 | 23.0 MB   | #########  |  91% [A[A[A[A[A[A[A[A[A[A[A[A
2025-05-07T20:26:10.8195411Z 
2025-05-07T20:26:10.8195415Z 
2025-05-07T20:26:10.8195419Z 
2025-05-07T20:26:10.8195422Z 
2025-05-07T20:26:10.8195426Z 
2025-05-07T20:26:10.8195430Z 
2025-05-07T20:26:10.8195433Z 
2025-05-07T20:26:10.8195437Z 
2025-05-07T20:26:10.8195441Z 
2025-05-07T20:26:10.8195444Z 
2025-05-07T20:26:10.8799437Z gds-tools-1.11.1.6   | 37.8 MB   | ########## | 100% [A[A[A[A[A[A[A[A[A[A
2025-05-07T20:26:10.8799824Z 
2025-05-07T20:26:10.8799829Z 
2025-05-07T20:26:10.8799833Z 
2025-05-07T20:26:10.8799837Z 
2025-05-07T20:26:10.8799840Z 
2025-05-07T20:26:10.8799844Z 
2025-05-07T20:26:10.8799848Z 
2025-05-07T20:26:10.8799851Z 
2025-05-07T20:26:10.8799863Z 
2025-05-07T20:26:10.8799905Z 
2025-05-07T20:26:10.8799909Z 
2025-05-07T20:26:10.8799912Z 
2025-05-07T20:26:10.8804465Z 
2025-05-07T20:26:10.9803088Z cuda-nvrtc-12.6.85   | 17.3 MB   |            |   0% [A[A[A[A[A[A[A[A[A[A[A[A[A
2025-05-07T20:26:10.9803446Z 
2025-05-07T20:26:10.9803450Z 
2025-05-07T20:26:10.9803454Z 
2025-05-07T20:26:10.9803458Z 
2025-05-07T20:26:10.9803462Z 
2025-05-07T20:26:10.9803466Z 
2025-05-07T20:26:10.9803478Z 
2025-05-07T20:26:10.9803482Z 
2025-05-07T20:26:10.9803485Z 
2025-05-07T20:26:10.9803489Z 
2025-05-07T20:26:10.9803493Z 
2025-05-07T20:26:10.9803496Z 
2025-05-07T20:26:10.9809141Z 
2025-05-07T20:26:10.9971477Z cuda-nvrtc-12.6.85   | 17.3 MB   | #9         |  20% [A[A[A[A[A[A[A[A[A[A[A[A[A
2025-05-07T20:26:10.9977609Z 
2025-05-07T20:26:11.0806062Z libcublas-12.6.4.1   | 256.2 MB  | ########## | 100% [A
2025-05-07T20:26:11.0806413Z 
2025-05-07T20:26:11.0806429Z 
2025-05-07T20:26:11.0806434Z 
2025-05-07T20:26:11.0806439Z 
2025-05-07T20:26:11.0806444Z 
2025-05-07T20:26:11.0806503Z 
2025-05-07T20:26:11.0806509Z 
2025-05-07T20:26:11.0806514Z 
2025-05-07T20:26:11.0806519Z 
2025-05-07T20:26:11.0806524Z 
2025-05-07T20:26:11.0806528Z 
2025-05-07T20:26:11.0806534Z 
2025-05-07T20:26:11.0808421Z 
2025-05-07T20:26:11.1035440Z cuda-nvrtc-12.6.85   | 17.3 MB   | ####1      |  41% [A[A[A[A[A[A[A[A[A[A[A[A[A
2025-05-07T20:26:11.1035838Z 
2025-05-07T20:26:11.1035843Z 
2025-05-07T20:26:11.1035846Z 
2025-05-07T20:26:11.1035850Z 
2025-05-07T20:26:11.1035853Z 
2025-05-07T20:26:11.1035857Z 
2025-05-07T20:26:11.1035860Z 
2025-05-07T20:26:11.1035864Z 
2025-05-07T20:26:11.1035868Z 
2025-05-07T20:26:11.1035871Z 
2025-05-07T20:26:11.1035875Z 
2025-05-07T20:26:11.1035878Z 
2025-05-07T20:26:11.1035882Z 
2025-05-07T20:26:11.1037994Z 
2025-05-07T20:26:11.1956110Z libnvjitlink-12.6.85 | 14.9 MB   |            |   0% [A[A[A[A[A[A[A[A[A[A[A[A[A[A
2025-05-07T20:26:11.1956496Z 
2025-05-07T20:26:11.1956502Z 
2025-05-07T20:26:11.1956507Z 
2025-05-07T20:26:11.1956521Z 
2025-05-07T20:26:11.1956570Z 
2025-05-07T20:26:11.1956577Z 
2025-05-07T20:26:11.1956582Z 
2025-05-07T20:26:11.1956587Z 
2025-05-07T20:26:11.1956593Z 
2025-05-07T20:26:11.1956598Z 
2025-05-07T20:26:11.1956603Z 
2025-05-07T20:26:11.1956609Z 
2025-05-07T20:26:11.1959600Z 
2025-05-07T20:26:11.2037988Z cuda-nvrtc-12.6.85   | 17.3 MB   | ######2    |  62% [A[A[A[A[A[A[A[A[A[A[A[A[A
2025-05-07T20:26:11.2038303Z 
2025-05-07T20:26:11.2038307Z 
2025-05-07T20:26:11.2038311Z 
2025-05-07T20:26:11.2038314Z 
2025-05-07T20:26:11.2038318Z 
2025-05-07T20:26:11.2038322Z 
2025-05-07T20:26:11.2038325Z 
2025-05-07T20:26:11.2038329Z 
2025-05-07T20:26:11.2038333Z 
2025-05-07T20:26:11.2038336Z 
2025-05-07T20:26:11.2038340Z 
2025-05-07T20:26:11.2038343Z 
2025-05-07T20:26:11.2038347Z 
2025-05-07T20:26:11.2039066Z 
2025-05-07T20:26:11.3045598Z libnvjitlink-12.6.85 | 14.9 MB   | ##         |  20% [A[A[A[A[A[A[A[A[A[A[A[A[A[A
2025-05-07T20:26:11.3045964Z 
2025-05-07T20:26:11.3045969Z 
2025-05-07T20:26:11.3045975Z 
2025-05-07T20:26:11.3046515Z 
2025-05-07T20:26:11.3046520Z 
2025-05-07T20:26:11.3046524Z 
2025-05-07T20:26:11.3046538Z 
2025-05-07T20:26:11.3046542Z 
2025-05-07T20:26:11.3046546Z 
2025-05-07T20:26:11.3046549Z 
2025-05-07T20:26:11.3046553Z 
2025-05-07T20:26:11.3046556Z 
2025-05-07T20:26:11.3052959Z 
2025-05-07T20:26:11.3143540Z cuda-nvrtc-12.6.85   | 17.3 MB   | ########1  |  82% [A[A[A[A[A[A[A[A[A[A[A[A[A
2025-05-07T20:26:11.3143955Z 
2025-05-07T20:26:11.3143959Z 
2025-05-07T20:26:11.3143963Z 
2025-05-07T20:26:11.3143966Z 
2025-05-07T20:26:11.3143970Z 
2025-05-07T20:26:11.3143973Z 
2025-05-07T20:26:11.3143977Z 
2025-05-07T20:26:11.3143981Z 
2025-05-07T20:26:11.3143984Z 
2025-05-07T20:26:11.3143988Z 
2025-05-07T20:26:11.3143991Z 
2025-05-07T20:26:11.3143995Z 
2025-05-07T20:26:11.3143999Z 
2025-05-07T20:26:11.3144002Z 
2025-05-07T20:26:11.4319270Z libnvjitlink-12.6.85 | 14.9 MB   | ####       |  40% [A[A[A[A[A[A[A[A[A[A[A[A[A[A
2025-05-07T20:26:11.4319625Z 
2025-05-07T20:26:11.4319629Z 
2025-05-07T20:26:11.4319668Z 
2025-05-07T20:26:11.4319672Z 
2025-05-07T20:26:11.4319676Z 
2025-05-07T20:26:11.4319679Z 
2025-05-07T20:26:11.4319683Z 
2025-05-07T20:26:11.4319687Z 
2025-05-07T20:26:11.4319690Z 
2025-05-07T20:26:11.4319694Z 
2025-05-07T20:26:11.4319698Z 
2025-05-07T20:26:11.4320476Z 
2025-05-07T20:26:11.4581404Z cuda-nvcc-tools-12.6 | 23.0 MB   | ########## | 100% [A[A[A[A[A[A[A[A[A[A[A[A
2025-05-07T20:26:11.4581728Z 
2025-05-07T20:26:11.4581732Z 
2025-05-07T20:26:11.4581736Z 
2025-05-07T20:26:11.4581739Z 
2025-05-07T20:26:11.4581743Z 
2025-05-07T20:26:11.4581756Z 
2025-05-07T20:26:11.4581760Z 
2025-05-07T20:26:11.4581763Z 
2025-05-07T20:26:11.4581767Z 
2025-05-07T20:26:11.4581770Z 
2025-05-07T20:26:11.4581774Z 
2025-05-07T20:26:11.4581781Z 
2025-05-07T20:26:11.4581786Z 
2025-05-07T20:26:11.4581790Z 
2025-05-07T20:26:11.4927102Z libnvjitlink-12.6.85 | 14.9 MB   | #####9     |  59% [A[A[A[A[A[A[A[A[A[A[A[A[A[A
2025-05-07T20:26:11.4927571Z 
2025-05-07T20:26:11.4927577Z 
2025-05-07T20:26:11.4927620Z 
2025-05-07T20:26:11.4927626Z 
2025-05-07T20:26:11.4927631Z 
2025-05-07T20:26:11.4927636Z 
2025-05-07T20:26:11.4927642Z 
2025-05-07T20:26:11.4927647Z 
2025-05-07T20:26:11.4927652Z 
2025-05-07T20:26:11.4927657Z 
2025-05-07T20:26:11.4927662Z 
2025-05-07T20:26:11.4927667Z 
2025-05-07T20:26:11.4927673Z 
2025-05-07T20:26:11.4927678Z 
2025-05-07T20:26:11.4929001Z 
2025-05-07T20:26:11.5341487Z cuda-nvcc-dev_linux- | 10.8 MB   |            |   0% [A[A[A[A[A[A[A[A[A[A[A[A[A[A[A
2025-05-07T20:26:11.5341875Z 
2025-05-07T20:26:11.5341881Z 
2025-05-07T20:26:11.5341886Z 
2025-05-07T20:26:11.5341891Z 
2025-05-07T20:26:11.5341896Z 
2025-05-07T20:26:11.5341902Z 
2025-05-07T20:26:11.5341906Z 
2025-05-07T20:26:11.5341922Z 
2025-05-07T20:26:11.5341928Z 
2025-05-07T20:26:11.5341933Z 
2025-05-07T20:26:11.5343722Z 
2025-05-07T20:26:11.5586257Z python-3.13.0        | 31.5 MB   | ########## | 100% [A[A[A[A[A[A[A[A[A[A[A
2025-05-07T20:26:11.5586647Z 
2025-05-07T20:26:11.5586651Z 
2025-05-07T20:26:11.5586691Z 
2025-05-07T20:26:11.5586695Z 
2025-05-07T20:26:11.5586699Z 
2025-05-07T20:26:11.5586703Z 
2025-05-07T20:26:11.5586706Z 
2025-05-07T20:26:11.5586710Z 
2025-05-07T20:26:11.5586714Z 
2025-05-07T20:26:11.5586717Z 
2025-05-07T20:26:11.5586721Z 
2025-05-07T20:26:11.5586725Z 
2025-05-07T20:26:11.5586728Z 
2025-05-07T20:26:11.5589528Z 
2025-05-07T20:26:11.5866105Z libnvjitlink-12.6.85 | 14.9 MB   | #######8   |  78% [A[A[A[A[A[A[A[A[A[A[A[A[A[A
2025-05-07T20:26:11.5866536Z 
2025-05-07T20:26:11.5866540Z 
2025-05-07T20:26:11.5866543Z 
2025-05-07T20:26:11.5866547Z 
2025-05-07T20:26:11.5866551Z 
2025-05-07T20:26:11.5866555Z 
2025-05-07T20:26:11.5866558Z 
2025-05-07T20:26:11.5866562Z 
2025-05-07T20:26:11.5866566Z 
2025-05-07T20:26:11.5866569Z 
2025-05-07T20:26:11.5866573Z 
2025-05-07T20:26:11.5866577Z 
2025-05-07T20:26:11.5866595Z 
2025-05-07T20:26:11.5866599Z 
2025-05-07T20:26:11.5866603Z 
2025-05-07T20:26:11.5868563Z 
2025-05-07T20:26:11.5934248Z cuda-nvvm-tools-12.6 | 10.4 MB   |            |   0% [A[A[A[A[A[A[A[A[A[A[A[A[A[A[A[A
2025-05-07T20:26:11.5934829Z 
2025-05-07T20:26:11.5934835Z 
2025-05-07T20:26:11.5934843Z 
2025-05-07T20:26:11.5934850Z 
2025-05-07T20:26:11.5934858Z 
2025-05-07T20:26:11.5934865Z 
2025-05-07T20:26:11.5934873Z 
2025-05-07T20:26:11.5934891Z 
2025-05-07T20:26:11.5934900Z 
2025-05-07T20:26:11.5934906Z 
2025-05-07T20:26:11.5934912Z 
2025-05-07T20:26:11.5934919Z 
2025-05-07T20:26:11.5934925Z 
2025-05-07T20:26:11.5934931Z 
2025-05-07T20:26:11.5934936Z 
2025-05-07T20:26:11.6597349Z cuda-nvcc-dev_linux- | 10.8 MB   | ##6        |  27% [A[A[A[A[A[A[A[A[A[A[A[A[A[A[A
2025-05-07T20:26:11.6597690Z 
2025-05-07T20:26:11.6597694Z 
2025-05-07T20:26:11.6597698Z 
2025-05-07T20:26:11.6597702Z 
2025-05-07T20:26:11.6597705Z 
2025-05-07T20:26:11.6597709Z 
2025-05-07T20:26:11.6597713Z 
2025-05-07T20:26:11.6597717Z 
2025-05-07T20:26:11.6597721Z 
2025-05-07T20:26:11.6597724Z 
2025-05-07T20:26:11.6597728Z 
2025-05-07T20:26:11.6597732Z 
2025-05-07T20:26:11.6597767Z 
2025-05-07T20:26:11.6598803Z 
2025-05-07T20:26:11.6868605Z libnvjitlink-12.6.85 | 14.9 MB   | #########8 |  98% [A[A[A[A[A[A[A[A[A[A[A[A[A[A
2025-05-07T20:26:11.6869039Z 
2025-05-07T20:26:11.6869055Z 
2025-05-07T20:26:11.6869061Z 
2025-05-07T20:26:11.6869066Z 
2025-05-07T20:26:11.6869071Z 
2025-05-07T20:26:11.6869076Z 
2025-05-07T20:26:11.6869082Z 
2025-05-07T20:26:11.6869087Z 
2025-05-07T20:26:11.6869092Z 
2025-05-07T20:26:11.6869098Z 
2025-05-07T20:26:11.6869103Z 
2025-05-07T20:26:11.6869109Z 
2025-05-07T20:26:11.6869114Z 
2025-05-07T20:26:11.6869119Z 
2025-05-07T20:26:11.6869124Z 
2025-05-07T20:26:11.6871182Z 
2025-05-07T20:26:11.6938001Z cuda-nvvm-tools-12.6 | 10.4 MB   | ###1       |  32% [A[A[A[A[A[A[A[A[A[A[A[A[A[A[A[A
2025-05-07T20:26:11.6938336Z 
2025-05-07T20:26:11.6938341Z 
2025-05-07T20:26:11.6938344Z 
2025-05-07T20:26:11.6938348Z 
2025-05-07T20:26:11.6938351Z 
2025-05-07T20:26:11.6938355Z 
2025-05-07T20:26:11.6938359Z 
2025-05-07T20:26:11.6938363Z 
2025-05-07T20:26:11.6938394Z 
2025-05-07T20:26:11.6938398Z 
2025-05-07T20:26:11.6938401Z 
2025-05-07T20:26:11.6938412Z 
2025-05-07T20:26:11.6938415Z 
2025-05-07T20:26:11.6938419Z 
2025-05-07T20:26:11.6938422Z 
2025-05-07T20:26:11.7870728Z cuda-nvcc-dev_linux- | 10.8 MB   | #####3     |  54% [A[A[A[A[A[A[A[A[A[A[A[A[A[A[A
2025-05-07T20:26:11.7871084Z 
2025-05-07T20:26:11.7871088Z 
2025-05-07T20:26:11.7871092Z 
2025-05-07T20:26:11.7871096Z 
2025-05-07T20:26:11.7871099Z 
2025-05-07T20:26:11.7871103Z 
2025-05-07T20:26:11.7871107Z 
2025-05-07T20:26:11.7871112Z 
2025-05-07T20:26:11.7871116Z 
2025-05-07T20:26:11.7871120Z 
2025-05-07T20:26:11.7871123Z 
2025-05-07T20:26:11.7871127Z 
2025-05-07T20:26:11.7871131Z 
2025-05-07T20:26:11.7871134Z 
2025-05-07T20:26:11.7871138Z 
2025-05-07T20:26:11.7871416Z 
2025-05-07T20:26:11.7945921Z cuda-nvvm-tools-12.6 | 10.4 MB   | ######3    |  64% [A[A[A[A[A[A[A[A[A[A[A[A[A[A[A[A
2025-05-07T20:26:11.7946304Z 
2025-05-07T20:26:11.7946310Z 
2025-05-07T20:26:11.7946341Z 
2025-05-07T20:26:11.7946359Z 
2025-05-07T20:26:11.7946364Z 
2025-05-07T20:26:11.7946370Z 
2025-05-07T20:26:11.7946375Z 
2025-05-07T20:26:11.7946380Z 
2025-05-07T20:26:11.7946386Z 
2025-05-07T20:26:11.7946391Z 
2025-05-07T20:26:11.7946396Z 
2025-05-07T20:26:11.7946410Z 
2025-05-07T20:26:11.7946415Z 
2025-05-07T20:26:11.7946420Z 
2025-05-07T20:26:11.7946425Z 
2025-05-07T20:26:11.8872624Z cuda-nvcc-dev_linux- | 10.8 MB   | ########1  |  82% [A[A[A[A[A[A[A[A[A[A[A[A[A[A[A
2025-05-07T20:26:11.8873083Z 
2025-05-07T20:26:11.8875965Z 
2025-05-07T20:26:11.8876281Z 
2025-05-07T20:26:11.8876295Z 
2025-05-07T20:26:11.8878487Z 
2025-05-07T20:26:11.8878497Z 
2025-05-07T20:26:11.8878506Z 
2025-05-07T20:26:11.8878514Z 
2025-05-07T20:26:11.8878522Z 
2025-05-07T20:26:11.8878531Z 
2025-05-07T20:26:11.8878539Z 
2025-05-07T20:26:11.8878548Z 
2025-05-07T20:26:11.8878555Z 
2025-05-07T20:26:11.8878563Z 
2025-05-07T20:26:11.8878569Z 
2025-05-07T20:26:11.8878574Z 
2025-05-07T20:26:11.9484858Z cuda-nvvm-tools-12.6 | 10.4 MB   | #########5 |  96% [A[A[A[A[A[A[A[A[A[A[A[A[A[A[A[A
2025-05-07T20:26:11.9485411Z 
2025-05-07T20:26:11.9485415Z 
2025-05-07T20:26:11.9485420Z 
2025-05-07T20:26:11.9485424Z 
2025-05-07T20:26:11.9485428Z 
2025-05-07T20:26:11.9485433Z 
2025-05-07T20:26:11.9485437Z 
2025-05-07T20:26:11.9485442Z 
2025-05-07T20:26:11.9485445Z 
2025-05-07T20:26:11.9485449Z 
2025-05-07T20:26:11.9485453Z 
2025-05-07T20:26:11.9485456Z 
2025-05-07T20:26:11.9485460Z 
2025-05-07T20:26:12.0097824Z cuda-nvrtc-12.6.85   | 17.3 MB   | ########## | 100% [A[A[A[A[A[A[A[A[A[A[A[A[A
2025-05-07T20:26:12.0098434Z 
2025-05-07T20:26:12.0098440Z 
2025-05-07T20:26:12.0098445Z 
2025-05-07T20:26:12.0098451Z 
2025-05-07T20:26:12.0098456Z 
2025-05-07T20:26:12.0098462Z 
2025-05-07T20:26:12.0098467Z 
2025-05-07T20:26:12.0098472Z 
2025-05-07T20:26:12.0098478Z 
2025-05-07T20:26:12.0098483Z 
2025-05-07T20:26:12.0098502Z 
2025-05-07T20:26:12.0098507Z 
2025-05-07T20:26:12.0098512Z 
2025-05-07T20:26:12.0098517Z 
2025-05-07T20:26:12.0098551Z 
2025-05-07T20:26:12.0098557Z 
2025-05-07T20:26:12.0107192Z 
2025-05-07T20:26:12.1099173Z cuda-sanitizer-api-1 | 8.9 MB    |            |   0% [A[A[A[A[A[A[A[A[A[A[A[A[A[A[A[A[A
2025-05-07T20:26:12.1099598Z 
2025-05-07T20:26:12.1099602Z 
2025-05-07T20:26:12.1099606Z 
2025-05-07T20:26:12.1099610Z 
2025-05-07T20:26:12.1099613Z 
2025-05-07T20:26:12.1099617Z 
2025-05-07T20:26:12.1099621Z 
2025-05-07T20:26:12.1099626Z 
2025-05-07T20:26:12.1099630Z 
2025-05-07T20:26:12.1099633Z 
2025-05-07T20:26:12.1099637Z 
2025-05-07T20:26:12.1099641Z 
2025-05-07T20:26:12.1099644Z 
2025-05-07T20:26:12.1099648Z 
2025-05-07T20:26:12.1099652Z 
2025-05-07T20:26:12.1099655Z 
2025-05-07T20:26:12.1099659Z 
2025-05-07T20:26:12.2032015Z cuda-sanitizer-api-1 | 8.9 MB    | ###9       |  39% [A[A[A[A[A[A[A[A[A[A[A[A[A[A[A[A[A
2025-05-07T20:26:12.2032423Z 
2025-05-07T20:26:12.2032428Z 
2025-05-07T20:26:12.2032431Z 
2025-05-07T20:26:12.2032435Z 
2025-05-07T20:26:12.2032439Z 
2025-05-07T20:26:12.2032479Z 
2025-05-07T20:26:12.2032493Z 
2025-05-07T20:26:12.2032497Z 
2025-05-07T20:26:12.2032501Z 
2025-05-07T20:26:12.2032504Z 
2025-05-07T20:26:12.2032508Z 
2025-05-07T20:26:12.2032512Z 
2025-05-07T20:26:12.2032515Z 
2025-05-07T20:26:12.2032519Z 
2025-05-07T20:26:12.2036547Z 
2025-05-07T20:26:12.2080079Z cuda-nvcc-dev_linux- | 10.8 MB   | ########## | 100% [A[A[A[A[A[A[A[A[A[A[A[A[A[A[A
2025-05-07T20:26:12.2080423Z 
2025-05-07T20:26:12.2080427Z 
2025-05-07T20:26:12.2080431Z 
2025-05-07T20:26:12.2080435Z 
2025-05-07T20:26:12.2080439Z 
2025-05-07T20:26:12.2080442Z 
2025-05-07T20:26:12.2080446Z 
2025-05-07T20:26:12.2080449Z 
2025-05-07T20:26:12.2080453Z 
2025-05-07T20:26:12.2080456Z 
2025-05-07T20:26:12.2080460Z 
2025-05-07T20:26:12.2080463Z 
2025-05-07T20:26:12.2080467Z 
2025-05-07T20:26:12.2084329Z 
2025-05-07T20:26:12.2099306Z libnvjitlink-12.6.85 | 14.9 MB   | ########## | 100% [A[A[A[A[A[A[A[A[A[A[A[A[A[A
2025-05-07T20:26:12.2099646Z 
2025-05-07T20:26:12.2099651Z 
2025-05-07T20:26:12.2099676Z 
2025-05-07T20:26:12.2099681Z 
2025-05-07T20:26:12.2099687Z 
2025-05-07T20:26:12.2099692Z 
2025-05-07T20:26:12.2099712Z 
2025-05-07T20:26:12.2099717Z 
2025-05-07T20:26:12.2099722Z 
2025-05-07T20:26:12.2099727Z 
2025-05-07T20:26:12.2099732Z 
2025-05-07T20:26:12.2099737Z 
2025-05-07T20:26:12.2099742Z 
2025-05-07T20:26:12.2099747Z 
2025-05-07T20:26:12.2099752Z 
2025-05-07T20:26:12.2099757Z 
2025-05-07T20:26:12.2109949Z 
2025-05-07T20:26:12.2168454Z cuda-sanitizer-api-1 | 8.9 MB    | ########1  |  82% [A[A[A[A[A[A[A[A[A[A[A[A[A[A[A[A[A
2025-05-07T20:26:12.2168870Z 
2025-05-07T20:26:12.2168874Z 
2025-05-07T20:26:12.2168877Z 
2025-05-07T20:26:12.2168881Z 
2025-05-07T20:26:12.2168885Z 
2025-05-07T20:26:12.2168888Z 
2025-05-07T20:26:12.2168892Z 
2025-05-07T20:26:12.2168896Z 
2025-05-07T20:26:12.2168899Z 
2025-05-07T20:26:12.2168903Z 
2025-05-07T20:26:12.2168907Z 
2025-05-07T20:26:12.2168910Z 
2025-05-07T20:26:12.2168914Z 
2025-05-07T20:26:12.2168918Z 
2025-05-07T20:26:12.2169155Z 
2025-05-07T20:26:12.2172185Z 
2025-05-07T20:26:12.2620725Z cuda-nvvm-tools-12.6 | 10.4 MB   | ########## | 100% [A[A[A[A[A[A[A[A[A[A[A[A[A[A[A[A
2025-05-07T20:26:12.2621175Z 
2025-05-07T20:26:12.2621186Z 
2025-05-07T20:26:12.2621190Z 
2025-05-07T20:26:12.2621193Z 
2025-05-07T20:26:12.2621197Z 
2025-05-07T20:26:12.2621201Z 
2025-05-07T20:26:12.2621206Z 
2025-05-07T20:26:12.2621209Z 
2025-05-07T20:26:12.2621214Z 
2025-05-07T20:26:12.2621218Z 
2025-05-07T20:26:12.2621221Z 
2025-05-07T20:26:12.2621225Z 
2025-05-07T20:26:12.2621228Z 
2025-05-07T20:26:12.2621232Z 
2025-05-07T20:26:12.2621236Z 
2025-05-07T20:26:12.2621239Z 
2025-05-07T20:26:12.2621243Z 
2025-05-07T20:26:12.2621247Z 
2025-05-07T20:26:12.2739274Z cuda-nvvm-impl-12.6. | 7.7 MB    |            |   0% [A[A[A[A[A[A[A[A[A[A[A[A[A[A[A[A[A[A
2025-05-07T20:26:12.2739720Z 
2025-05-07T20:26:12.2739726Z 
2025-05-07T20:26:12.2739731Z 
2025-05-07T20:26:12.2739737Z 
2025-05-07T20:26:12.2739742Z 
2025-05-07T20:26:12.2739747Z 
2025-05-07T20:26:12.2739792Z 
2025-05-07T20:26:12.2739799Z 
2025-05-07T20:26:12.2739804Z 
2025-05-07T20:26:12.2739809Z 
2025-05-07T20:26:12.2739829Z 
2025-05-07T20:26:12.2739833Z 
2025-05-07T20:26:12.2739836Z 
2025-05-07T20:26:12.2739840Z 
2025-05-07T20:26:12.2739844Z 
2025-05-07T20:26:12.2739847Z 
2025-05-07T20:26:12.2739851Z 
2025-05-07T20:26:12.2739854Z 
2025-05-07T20:26:12.2739858Z 
2025-05-07T20:26:12.3626628Z  ... (more hidden) ...[A[A[A[A[A[A[A[A[A[A[A[A[A[A[A[A[A[A[A
2025-05-07T20:26:12.3626957Z 
2025-05-07T20:26:12.3626962Z 
2025-05-07T20:26:12.3626965Z 
2025-05-07T20:26:12.3626969Z 
2025-05-07T20:26:12.3626974Z 
2025-05-07T20:26:12.3626978Z 
2025-05-07T20:26:12.3626982Z 
2025-05-07T20:26:12.3626986Z 
2025-05-07T20:26:12.3626989Z 
2025-05-07T20:26:12.3626994Z 
2025-05-07T20:26:12.3626997Z 
2025-05-07T20:26:12.3627001Z 
2025-05-07T20:26:12.3627005Z 
2025-05-07T20:26:12.3627008Z 
2025-05-07T20:26:12.3627012Z 
2025-05-07T20:26:12.3627016Z 
2025-05-07T20:26:12.3627019Z 
2025-05-07T20:26:12.3627372Z 
2025-05-07T20:26:12.3742146Z cuda-nvvm-impl-12.6. | 7.7 MB    | ###8       |  39% [A[A[A[A[A[A[A[A[A[A[A[A[A[A[A[A[A[A
2025-05-07T20:26:12.3742617Z 
2025-05-07T20:26:12.3742623Z 
2025-05-07T20:26:12.3742638Z 
2025-05-07T20:26:12.3742643Z 
2025-05-07T20:26:12.3742648Z 
2025-05-07T20:26:12.3742653Z 
2025-05-07T20:26:12.3742659Z 
2025-05-07T20:26:12.3742664Z 
2025-05-07T20:26:12.3742669Z 
2025-05-07T20:26:12.3742675Z 
2025-05-07T20:26:12.3742680Z 
2025-05-07T20:26:12.3742685Z 
2025-05-07T20:26:12.3742690Z 
2025-05-07T20:26:12.3742696Z 
2025-05-07T20:26:12.3742701Z 
2025-05-07T20:26:12.3742707Z 
2025-05-07T20:26:12.3742712Z 
2025-05-07T20:26:12.3742717Z 
2025-05-07T20:26:12.3742723Z 
2025-05-07T20:26:12.4632820Z  ... (more hidden) ...[A[A[A[A[A[A[A[A[A[A[A[A[A[A[A[A[A[A[A
2025-05-07T20:26:12.4633204Z 
2025-05-07T20:26:12.4633210Z 
2025-05-07T20:26:12.4633215Z 
2025-05-07T20:26:12.4633221Z 
2025-05-07T20:26:12.4633226Z 
2025-05-07T20:26:12.4633231Z 
2025-05-07T20:26:12.4633255Z 
2025-05-07T20:26:12.4633270Z 
2025-05-07T20:26:12.4633274Z 
2025-05-07T20:26:12.4633277Z 
2025-05-07T20:26:12.4633281Z 
2025-05-07T20:26:12.4633284Z 
2025-05-07T20:26:12.4633288Z 
2025-05-07T20:26:12.4633291Z 
2025-05-07T20:26:12.4633295Z 
2025-05-07T20:26:12.4633298Z 
2025-05-07T20:26:12.4633302Z 
2025-05-07T20:26:12.4634701Z 
2025-05-07T20:26:12.5366341Z cuda-nvvm-impl-12.6. | 7.7 MB    | #######7   |  77% [A[A[A[A[A[A[A[A[A[A[A[A[A[A[A[A[A[A
2025-05-07T20:26:12.5366695Z 
2025-05-07T20:26:12.5366700Z 
2025-05-07T20:26:12.5366706Z 
2025-05-07T20:26:12.5366712Z 
2025-05-07T20:26:12.5366718Z 
2025-05-07T20:26:12.5366724Z 
2025-05-07T20:26:12.5366740Z 
2025-05-07T20:26:12.5366745Z 
2025-05-07T20:26:12.5366751Z 
2025-05-07T20:26:12.5366756Z 
2025-05-07T20:26:12.5366761Z 
2025-05-07T20:26:12.5366766Z 
2025-05-07T20:26:12.5366771Z 
2025-05-07T20:26:12.5366776Z 
2025-05-07T20:26:12.5366781Z 
2025-05-07T20:26:12.5366786Z 
2025-05-07T20:26:12.5370487Z 
2025-05-07T20:26:12.5381737Z cuda-sanitizer-api-1 | 8.9 MB    | ########## | 100% [A[A[A[A[A[A[A[A[A[A[A[A[A[A[A[A[A
2025-05-07T20:26:12.5382260Z 
2025-05-07T20:26:12.5382264Z 
2025-05-07T20:26:12.5382267Z 
2025-05-07T20:26:12.5382271Z 
2025-05-07T20:26:12.5382275Z 
2025-05-07T20:26:12.5382278Z 
2025-05-07T20:26:12.5382282Z 
2025-05-07T20:26:12.5382286Z 
2025-05-07T20:26:12.5382289Z 
2025-05-07T20:26:12.5382293Z 
2025-05-07T20:26:12.5382297Z 
2025-05-07T20:26:12.5382300Z 
2025-05-07T20:26:12.5382304Z 
2025-05-07T20:26:12.5382308Z 
2025-05-07T20:26:12.5382311Z 
2025-05-07T20:26:12.5382322Z 
2025-05-07T20:26:12.5382325Z 
2025-05-07T20:26:12.5382329Z 
2025-05-07T20:26:12.5385019Z 
2025-05-07T20:26:12.7706390Z  ... (more hidden) ...[A[A[A[A[A[A[A[A[A[A[A[A[A[A[A[A[A[A[A
2025-05-07T20:26:12.7706731Z 
2025-05-07T20:26:12.7706735Z 
2025-05-07T20:26:12.7706739Z 
2025-05-07T20:26:12.7706743Z 
2025-05-07T20:26:12.7706746Z 
2025-05-07T20:26:12.7706750Z 
2025-05-07T20:26:12.7706754Z 
2025-05-07T20:26:12.7706759Z 
2025-05-07T20:26:12.7706803Z 
2025-05-07T20:26:12.7706807Z 
2025-05-07T20:26:12.7706810Z 
2025-05-07T20:26:12.7706814Z 
2025-05-07T20:26:12.7706817Z 
2025-05-07T20:26:12.7706833Z 
2025-05-07T20:26:12.7706837Z 
2025-05-07T20:26:12.7706841Z 
2025-05-07T20:26:12.7706845Z 
2025-05-07T20:26:12.7706848Z 
2025-05-07T20:26:13.1529068Z cuda-nvvm-impl-12.6. | 7.7 MB    | ########## | 100% [A[A[A[A[A[A[A[A[A[A[A[A[A[A[A[A[A[A
2025-05-07T20:26:13.1529492Z 
2025-05-07T20:26:13.1529497Z 
2025-05-07T20:26:13.1529503Z 
2025-05-07T20:26:13.1529508Z 
2025-05-07T20:26:13.1529513Z 
2025-05-07T20:26:13.1531074Z 
2025-05-07T20:26:14.1968817Z libcusolver-11.7.1.2 | 95.8 MB   | ########## | 100% [A[A[A[A[A[A
2025-05-07T20:26:14.1969132Z 
2025-05-07T20:26:14.1969136Z 
2025-05-07T20:26:14.1969139Z 
2025-05-07T20:26:14.1969143Z 
2025-05-07T20:26:14.1969674Z 
2025-05-07T20:26:14.3518001Z cuda-nvvp-12.6.80    | 109.3 MB  | ########## | 100% [A[A[A[A[A
2025-05-07T20:26:14.3518328Z 
2025-05-07T20:26:14.3518333Z 
2025-05-07T20:26:14.3518390Z 
2025-05-07T20:26:14.3518394Z 
2025-05-07T20:26:14.3518398Z 
2025-05-07T20:26:14.3518401Z 
2025-05-07T20:26:14.3518405Z 
2025-05-07T20:26:14.3518409Z 
2025-05-07T20:26:14.3518839Z 
2025-05-07T20:26:14.6884562Z libcurand-10.3.7.77  | 39.9 MB   | ########## | 100% [A[A[A[A[A[A[A[A[A
2025-05-07T20:26:14.6885032Z 
2025-05-07T20:26:14.6885037Z 
2025-05-07T20:26:14.6885043Z 
2025-05-07T20:26:14.6885050Z 
2025-05-07T20:26:14.6885055Z 
2025-05-07T20:26:14.6885060Z 
2025-05-07T20:26:14.6885065Z 
2025-05-07T20:26:14.6885071Z 
2025-05-07T20:26:14.9240116Z cuda-nvdisasm-12.6.7 | 47.6 MB   | ########## | 100% [A[A[A[A[A[A[A[A
2025-05-07T20:26:14.9240531Z 
2025-05-07T20:26:14.9240538Z 
2025-05-07T20:26:14.9240544Z 
2025-05-07T20:26:14.9240549Z 
2025-05-07T20:26:14.9240554Z 
2025-05-07T20:26:14.9240560Z 
2025-05-07T20:26:14.9240566Z 
2025-05-07T20:26:14.9478397Z libnpp-12.3.1.54     | 93.4 MB   | ########## | 100% [A[A[A[A[A[A[A
2025-05-07T20:26:14.9478860Z 
2025-05-07T20:26:14.9478910Z 
2025-05-07T20:26:14.9478917Z 
2025-05-07T20:26:14.9478923Z 
2025-05-07T20:26:14.9478928Z 
2025-05-07T20:26:14.9478934Z 
2025-05-07T20:26:14.9478940Z 
2025-05-07T20:26:14.9478945Z 
2025-05-07T20:26:14.9478951Z 
2025-05-07T20:26:14.9479662Z 
2025-05-07T20:26:15.7938456Z gds-tools-1.11.1.6   | 37.8 MB   | ########## | 100% [A[A[A[A[A[A[A[A[A[A
2025-05-07T20:26:15.7963771Z nsight-compute-2024. | 443.1 MB  | ########## | 100% 
2025-05-07T20:26:15.7964096Z 
2025-05-07T20:26:15.7964264Z 
2025-05-07T20:26:15.7964270Z 
2025-05-07T20:26:15.7964274Z 
2025-05-07T20:26:15.7964277Z 
2025-05-07T20:26:15.7964281Z 
2025-05-07T20:26:15.7964285Z 
2025-05-07T20:26:15.7964289Z 
2025-05-07T20:26:15.7964294Z 
2025-05-07T20:26:15.7964298Z 
2025-05-07T20:26:15.7964302Z 
2025-05-07T20:26:15.7964860Z 
2025-05-07T20:26:16.1313631Z cuda-nvcc-tools-12.6 | 23.0 MB   | ########## | 100% [A[A[A[A[A[A[A[A[A[A[A[A
2025-05-07T20:26:16.1313983Z 
2025-05-07T20:26:16.1313987Z 
2025-05-07T20:26:16.1313991Z 
2025-05-07T20:26:16.1314412Z 
2025-05-07T20:26:16.1314417Z 
2025-05-07T20:26:16.1314421Z 
2025-05-07T20:26:16.1314425Z 
2025-05-07T20:26:16.1314428Z 
2025-05-07T20:26:16.1314432Z 
2025-05-07T20:26:16.1314436Z 
2025-05-07T20:26:16.1314439Z 
2025-05-07T20:26:16.1314443Z 
2025-05-07T20:26:16.1315056Z 
2025-05-07T20:26:16.4750171Z cuda-nvrtc-12.6.85   | 17.3 MB   | ########## | 100% [A[A[A[A[A[A[A[A[A[A[A[A[A
2025-05-07T20:26:16.4750612Z 
2025-05-07T20:26:16.4750616Z 
2025-05-07T20:26:16.4750620Z 
2025-05-07T20:26:16.4750624Z 
2025-05-07T20:26:16.4750628Z 
2025-05-07T20:26:16.4750632Z 
2025-05-07T20:26:16.4750636Z 
2025-05-07T20:26:16.4750640Z 
2025-05-07T20:26:16.4750643Z 
2025-05-07T20:26:16.4750647Z 
2025-05-07T20:26:16.4750651Z 
2025-05-07T20:26:16.4750654Z 
2025-05-07T20:26:16.4750658Z 
2025-05-07T20:26:16.4750662Z 
2025-05-07T20:26:16.4750665Z 
2025-05-07T20:26:16.6996991Z cuda-nvcc-dev_linux- | 10.8 MB   | ########## | 100% [A[A[A[A[A[A[A[A[A[A[A[A[A[A[A
2025-05-07T20:26:16.6997429Z 
2025-05-07T20:26:16.6997486Z 
2025-05-07T20:26:16.6997492Z 
2025-05-07T20:26:16.6997496Z 
2025-05-07T20:26:16.6997502Z 
2025-05-07T20:26:16.6997506Z 
2025-05-07T20:26:16.6997511Z 
2025-05-07T20:26:16.6997516Z 
2025-05-07T20:26:16.6997536Z 
2025-05-07T20:26:16.6997541Z 
2025-05-07T20:26:16.6997545Z 
2025-05-07T20:26:16.6997550Z 
2025-05-07T20:26:16.6997555Z 
2025-05-07T20:26:16.6997568Z 
2025-05-07T20:26:16.7543872Z libnvjitlink-12.6.85 | 14.9 MB   | ########## | 100% [A[A[A[A[A[A[A[A[A[A[A[A[A[A
2025-05-07T20:26:16.7544335Z 
2025-05-07T20:26:16.7544341Z 
2025-05-07T20:26:16.7544346Z 
2025-05-07T20:26:16.7544351Z 
2025-05-07T20:26:16.7544357Z 
2025-05-07T20:26:16.7544362Z 
2025-05-07T20:26:16.7544368Z 
2025-05-07T20:26:16.7544373Z 
2025-05-07T20:26:16.7544379Z 
2025-05-07T20:26:16.7544384Z 
2025-05-07T20:26:16.7544389Z 
2025-05-07T20:26:16.8937358Z python-3.13.0        | 31.5 MB   | ########## | 100% [A[A[A[A[A[A[A[A[A[A[A
2025-05-07T20:26:16.8937764Z 
2025-05-07T20:26:16.8937770Z 
2025-05-07T20:26:16.8937811Z 
2025-05-07T20:26:16.8937817Z 
2025-05-07T20:26:16.8937822Z 
2025-05-07T20:26:16.8937827Z 
2025-05-07T20:26:16.8937841Z 
2025-05-07T20:26:16.8937847Z 
2025-05-07T20:26:16.8937852Z 
2025-05-07T20:26:16.8937857Z 
2025-05-07T20:26:16.8937862Z 
2025-05-07T20:26:16.8937868Z 
2025-05-07T20:26:16.8937872Z 
2025-05-07T20:26:16.8937878Z 
2025-05-07T20:26:16.8937883Z 
2025-05-07T20:26:16.8937888Z 
2025-05-07T20:26:17.0875405Z cuda-nvvm-tools-12.6 | 10.4 MB   | ########## | 100% [A[A[A[A[A[A[A[A[A[A[A[A[A[A[A[A
2025-05-07T20:26:17.0875881Z 
2025-05-07T20:26:17.0875887Z 
2025-05-07T20:26:17.0875892Z 
2025-05-07T20:26:17.0875897Z 
2025-05-07T20:26:17.0875902Z 
2025-05-07T20:26:17.0875907Z 
2025-05-07T20:26:17.0875912Z 
2025-05-07T20:26:17.0875918Z 
2025-05-07T20:26:17.0875924Z 
2025-05-07T20:26:17.0875929Z 
2025-05-07T20:26:17.0875935Z 
2025-05-07T20:26:17.0875940Z 
2025-05-07T20:26:17.0875945Z 
2025-05-07T20:26:17.0875950Z 
2025-05-07T20:26:17.0875956Z 
2025-05-07T20:26:17.0875993Z 
2025-05-07T20:26:17.0876020Z 
2025-05-07T20:26:17.1219195Z cuda-sanitizer-api-1 | 8.9 MB    | ########## | 100% [A[A[A[A[A[A[A[A[A[A[A[A[A[A[A[A[A
2025-05-07T20:26:17.1219719Z 
2025-05-07T20:26:17.1219725Z 
2025-05-07T20:26:17.1219740Z 
2025-05-07T20:26:17.1219745Z 
2025-05-07T20:26:17.1219751Z 
2025-05-07T20:26:17.1219756Z 
2025-05-07T20:26:17.1219761Z 
2025-05-07T20:26:17.1219766Z 
2025-05-07T20:26:17.1219772Z 
2025-05-07T20:26:17.1219777Z 
2025-05-07T20:26:17.1219782Z 
2025-05-07T20:26:17.1219788Z 
2025-05-07T20:26:17.1219793Z 
2025-05-07T20:26:17.1219798Z 
2025-05-07T20:26:17.1219803Z 
2025-05-07T20:26:17.1219809Z 
2025-05-07T20:26:17.1219814Z 
2025-05-07T20:26:17.1219819Z 
2025-05-07T20:26:17.1219824Z 
2025-05-07T20:26:17.3803692Z  ... (more hidden) ...[A[A[A[A[A[A[A[A[A[A[A[A[A[A[A[A[A[A[A
2025-05-07T20:26:17.3804102Z 
2025-05-07T20:26:17.3804108Z 
2025-05-07T20:26:17.3804113Z 
2025-05-07T20:26:17.3804118Z 
2025-05-07T20:26:17.3804125Z 
2025-05-07T20:26:17.3804411Z 
2025-05-07T20:26:17.3804579Z 
2025-05-07T20:26:17.3804583Z 
2025-05-07T20:26:17.3804587Z 
2025-05-07T20:26:17.3804590Z 
2025-05-07T20:26:17.3804594Z 
2025-05-07T20:26:17.3804597Z 
2025-05-07T20:26:17.3804614Z 
2025-05-07T20:26:17.3804619Z 
2025-05-07T20:26:17.3804622Z 
2025-05-07T20:26:17.3804626Z 
2025-05-07T20:26:17.3804629Z 
2025-05-07T20:26:17.3804633Z 
2025-05-07T20:26:19.4009366Z cuda-nvvm-impl-12.6. | 7.7 MB    | ########## | 100% [A[A[A[A[A[A[A[A[A[A[A[A[A[A[A[A[A[A
2025-05-07T20:26:19.4009858Z 
2025-05-07T20:26:24.0427236Z libcublas-12.6.4.1   | 256.2 MB  | ########## | 100% [A
2025-05-07T20:26:24.0435310Z nsight-compute-2024. | 443.1 MB  | ########## | 100% 
2025-05-07T20:26:24.0435657Z 
2025-05-07T20:26:24.0435834Z 
2025-05-07T20:26:24.0435841Z 
2025-05-07T20:26:24.0435846Z 
2025-05-07T20:26:24.0435851Z 
2025-05-07T20:26:24.0435857Z 
2025-05-07T20:26:24.0435863Z 
2025-05-07T20:26:24.0435869Z 
2025-05-07T20:26:24.0435877Z 
2025-05-07T20:26:24.0435884Z 
2025-05-07T20:26:24.0435922Z 
2025-05-07T20:26:24.0435941Z 
2025-05-07T20:26:24.0435947Z 
2025-05-07T20:26:24.0435952Z 
2025-05-07T20:26:24.0435957Z 
2025-05-07T20:26:24.0435983Z 
2025-05-07T20:26:24.0435988Z 
2025-05-07T20:26:24.0435993Z 
2025-05-07T20:26:24.0436009Z 
2025-05-07T20:26:24.0436153Z                       
2025-05-07T20:26:24.0436656Z [A[A[A[A[A[A[A[A[A[A[A[A[A[A[A[A[A[A[A
2025-05-07T20:26:24.0437075Z                                                      
2025-05-07T20:26:24.0437347Z 
2025-05-07T20:26:24.0437564Z                                                      [A
2025-05-07T20:26:24.0437858Z 
2025-05-07T20:26:24.0437864Z 
2025-05-07T20:26:24.0438084Z                                                      [A[A
2025-05-07T20:26:24.0438368Z 
2025-05-07T20:26:24.0438381Z 
2025-05-07T20:26:24.0438387Z 
2025-05-07T20:26:24.0438610Z                                                      [A[A[A
2025-05-07T20:26:24.0438890Z 
2025-05-07T20:26:24.0438895Z 
2025-05-07T20:26:24.0438901Z 
2025-05-07T20:26:24.0438913Z 
2025-05-07T20:26:24.0439157Z                                                      [A[A[A[A
2025-05-07T20:26:24.0439428Z 
2025-05-07T20:26:24.0439433Z 
2025-05-07T20:26:24.0439438Z 
2025-05-07T20:26:24.0439444Z 
2025-05-07T20:26:24.0439449Z 
2025-05-07T20:26:24.0439701Z                                                      [A[A[A[A[A
2025-05-07T20:26:24.0439981Z 
2025-05-07T20:26:24.0439986Z 
2025-05-07T20:26:24.0439991Z 
2025-05-07T20:26:24.0439996Z 
2025-05-07T20:26:24.0440001Z 
2025-05-07T20:26:24.0440006Z 
2025-05-07T20:26:24.0440249Z                                                      [A[A[A[A[A[A
2025-05-07T20:26:24.0440527Z 
2025-05-07T20:26:24.0440533Z 
2025-05-07T20:26:24.0440537Z 
2025-05-07T20:26:24.0440543Z 
2025-05-07T20:26:24.0440548Z 
2025-05-07T20:26:24.0440553Z 
2025-05-07T20:26:24.0440558Z 
2025-05-07T20:26:24.0440802Z                                                      [A[A[A[A[A[A[A
2025-05-07T20:26:24.0441109Z 
2025-05-07T20:26:24.0441114Z 
2025-05-07T20:26:24.0441132Z 
2025-05-07T20:26:24.0441137Z 
2025-05-07T20:26:24.0441142Z 
2025-05-07T20:26:24.0441147Z 
2025-05-07T20:26:24.0441153Z 
2025-05-07T20:26:24.0441158Z 
2025-05-07T20:26:24.0441450Z                                                      [A[A[A[A[A[A[A[A
2025-05-07T20:26:24.0441740Z 
2025-05-07T20:26:24.0441745Z 
2025-05-07T20:26:24.0441750Z 
2025-05-07T20:26:24.0441763Z 
2025-05-07T20:26:24.0441768Z 
2025-05-07T20:26:24.0441773Z 
2025-05-07T20:26:24.0441778Z 
2025-05-07T20:26:24.0441784Z 
2025-05-07T20:26:24.0441789Z 
2025-05-07T20:26:24.0442034Z                                                      [A[A[A[A[A[A[A[A[A
2025-05-07T20:26:24.0442327Z 
2025-05-07T20:26:24.0442333Z 
2025-05-07T20:26:24.0442338Z 
2025-05-07T20:26:24.0442343Z 
2025-05-07T20:26:24.0442348Z 
2025-05-07T20:26:24.0442354Z 
2025-05-07T20:26:24.0442359Z 
2025-05-07T20:26:24.0442364Z 
2025-05-07T20:26:24.0442369Z 
2025-05-07T20:26:24.0442374Z 
2025-05-07T20:26:24.0442888Z                                                      [A[A[A[A[A[A[A[A[A[A
2025-05-07T20:26:24.0443395Z 
2025-05-07T20:26:24.0443401Z 
2025-05-07T20:26:24.0443406Z 
2025-05-07T20:26:24.0443411Z 
2025-05-07T20:26:24.0443416Z 
2025-05-07T20:26:24.0443421Z 
2025-05-07T20:26:24.0443426Z 
2025-05-07T20:26:24.0443431Z 
2025-05-07T20:26:24.0443437Z 
2025-05-07T20:26:24.0443442Z 
2025-05-07T20:26:24.0443447Z 
2025-05-07T20:26:24.0443756Z                                                      [A[A[A[A[A[A[A[A[A[A[A
2025-05-07T20:26:24.0444053Z 
2025-05-07T20:26:24.0444058Z 
2025-05-07T20:26:24.0444064Z 
2025-05-07T20:26:24.0444070Z 
2025-05-07T20:26:24.0444075Z 
2025-05-07T20:26:24.0444081Z 
2025-05-07T20:26:24.0444086Z 
2025-05-07T20:26:24.0444091Z 
2025-05-07T20:26:24.0444095Z 
2025-05-07T20:26:24.0444100Z 
2025-05-07T20:26:24.0444105Z 
2025-05-07T20:26:24.0444109Z 
2025-05-07T20:26:24.0444376Z                                                      [A[A[A[A[A[A[A[A[A[A[A[A
2025-05-07T20:26:24.0444677Z 
2025-05-07T20:26:24.0444693Z 
2025-05-07T20:26:24.0444705Z 
2025-05-07T20:26:24.0444710Z 
2025-05-07T20:26:24.0444716Z 
2025-05-07T20:26:24.0444721Z 
2025-05-07T20:26:24.0444726Z 
2025-05-07T20:26:24.0444742Z 
2025-05-07T20:26:24.0444747Z 
2025-05-07T20:26:24.0444752Z 
2025-05-07T20:26:24.0444757Z 
2025-05-07T20:26:24.0444763Z 
2025-05-07T20:26:24.0444768Z 
2025-05-07T20:26:24.0445032Z                                                      [A[A[A[A[A[A[A[A[A[A[A[A[A
2025-05-07T20:26:24.0445335Z 
2025-05-07T20:26:24.0445340Z 
2025-05-07T20:26:24.0445346Z 
2025-05-07T20:26:24.0445350Z 
2025-05-07T20:26:24.0445356Z 
2025-05-07T20:26:24.0445361Z 
2025-05-07T20:26:24.0445366Z 
2025-05-07T20:26:24.0445371Z 
2025-05-07T20:26:24.0445376Z 
2025-05-07T20:26:24.0445381Z 
2025-05-07T20:26:24.0445386Z 
2025-05-07T20:26:24.0445391Z 
2025-05-07T20:26:24.0445396Z 
2025-05-07T20:26:24.0445401Z 
2025-05-07T20:26:24.0445670Z                                                      [A[A[A[A[A[A[A[A[A[A[A[A[A[A
2025-05-07T20:26:24.0445909Z 
2025-05-07T20:26:24.0445919Z 
2025-05-07T20:26:24.0445922Z 
2025-05-07T20:26:24.0445926Z 
2025-05-07T20:26:24.0445929Z 
2025-05-07T20:26:24.0445933Z 
2025-05-07T20:26:24.0445937Z 
2025-05-07T20:26:24.0445940Z 
2025-05-07T20:26:24.0445944Z 
2025-05-07T20:26:24.0445947Z 
2025-05-07T20:26:24.0445951Z 
2025-05-07T20:26:24.0445955Z 
2025-05-07T20:26:24.0445958Z 
2025-05-07T20:26:24.0445962Z 
2025-05-07T20:26:24.0445966Z 
2025-05-07T20:26:24.0446198Z                                                      [A[A[A[A[A[A[A[A[A[A[A[A[A[A[A
2025-05-07T20:26:24.0446423Z 
2025-05-07T20:26:24.0446427Z 
2025-05-07T20:26:24.0446431Z 
2025-05-07T20:26:24.0446434Z 
2025-05-07T20:26:24.0446444Z 
2025-05-07T20:26:24.0446448Z 
2025-05-07T20:26:24.0446452Z 
2025-05-07T20:26:24.0446455Z 
2025-05-07T20:26:24.0446459Z 
2025-05-07T20:26:24.0446462Z 
2025-05-07T20:26:24.0446466Z 
2025-05-07T20:26:24.0446469Z 
2025-05-07T20:26:24.0446473Z 
2025-05-07T20:26:24.0446477Z 
2025-05-07T20:26:24.0446480Z 
2025-05-07T20:26:24.0446487Z 
2025-05-07T20:26:24.0446695Z                                                      [A[A[A[A[A[A[A[A[A[A[A[A[A[A[A[A
2025-05-07T20:26:24.0446930Z 
2025-05-07T20:26:24.0446934Z 
2025-05-07T20:26:24.0446938Z 
2025-05-07T20:26:24.0446941Z 
2025-05-07T20:26:24.0446945Z 
2025-05-07T20:26:24.0446948Z 
2025-05-07T20:26:24.0446952Z 
2025-05-07T20:26:24.0446955Z 
2025-05-07T20:26:24.0446959Z 
2025-05-07T20:26:24.0446962Z 
2025-05-07T20:26:24.0446966Z 
2025-05-07T20:26:24.0446969Z 
2025-05-07T20:26:24.0446973Z 
2025-05-07T20:26:24.0446977Z 
2025-05-07T20:26:24.0446980Z 
2025-05-07T20:26:24.0446984Z 
2025-05-07T20:26:24.0446987Z 
2025-05-07T20:26:24.0447207Z                                                      [A[A[A[A[A[A[A[A[A[A[A[A[A[A[A[A[A
2025-05-07T20:26:24.0447437Z 
2025-05-07T20:26:24.0447441Z 
2025-05-07T20:26:24.0447445Z 
2025-05-07T20:26:24.0447448Z 
2025-05-07T20:26:24.0447452Z 
2025-05-07T20:26:24.0447455Z 
2025-05-07T20:26:24.0447459Z 
2025-05-07T20:26:24.0447468Z 
2025-05-07T20:26:24.0447637Z 
2025-05-07T20:26:24.0447641Z 
2025-05-07T20:26:24.0447645Z 
2025-05-07T20:26:24.0447648Z 
2025-05-07T20:26:24.0447652Z 
2025-05-07T20:26:24.0447655Z 
2025-05-07T20:26:24.0447659Z 
2025-05-07T20:26:24.0447662Z 
2025-05-07T20:26:24.0447666Z 
2025-05-07T20:26:24.0447669Z 
2025-05-07T20:26:24.0447891Z                                                      [A[A[A[A[A[A[A[A[A[A[A[A[A[A[A[A[A[A
2025-05-07T20:26:24.0448131Z 
2025-05-07T20:26:24.0448135Z 
2025-05-07T20:26:24.0448246Z [A
2025-05-07T20:26:24.0448400Z 
2025-05-07T20:26:24.0448406Z 
2025-05-07T20:26:24.0448558Z [A[A
2025-05-07T20:26:24.0448738Z 
2025-05-07T20:26:24.0448741Z 
2025-05-07T20:26:24.0448745Z 
2025-05-07T20:26:24.0448898Z [A[A[A
2025-05-07T20:26:24.0449058Z 
2025-05-07T20:26:24.0449064Z 
2025-05-07T20:26:24.0449069Z 
2025-05-07T20:26:24.0449075Z 
2025-05-07T20:26:24.0449205Z [A[A[A[A
2025-05-07T20:26:24.0449390Z 
2025-05-07T20:26:24.0449395Z 
2025-05-07T20:26:24.0449400Z 
2025-05-07T20:26:24.0449415Z 
2025-05-07T20:26:24.0449427Z 
2025-05-07T20:26:24.0449568Z [A[A[A[A[A
2025-05-07T20:26:24.0449691Z 
2025-05-07T20:26:24.0449703Z 
2025-05-07T20:26:24.0449707Z 
2025-05-07T20:26:24.0449710Z 
2025-05-07T20:26:24.0449726Z 
2025-05-07T20:26:24.0449729Z 
2025-05-07T20:26:24.0449839Z [A[A[A[A[A[A
2025-05-07T20:26:24.0449971Z 
2025-05-07T20:26:24.0449974Z 
2025-05-07T20:26:24.0449978Z 
2025-05-07T20:26:24.0449981Z 
2025-05-07T20:26:24.0449985Z 
2025-05-07T20:26:24.0449988Z 
2025-05-07T20:26:24.0449992Z 
2025-05-07T20:26:24.0450104Z [A[A[A[A[A[A[A
2025-05-07T20:26:24.0450241Z 
2025-05-07T20:26:24.0450251Z 
2025-05-07T20:26:24.0450254Z 
2025-05-07T20:26:24.0450258Z 
2025-05-07T20:26:24.0450261Z 
2025-05-07T20:26:24.0450265Z 
2025-05-07T20:26:24.0450268Z 
2025-05-07T20:26:24.0450276Z 
2025-05-07T20:26:24.0450433Z [A[A[A[A[A[A[A[A
2025-05-07T20:26:24.0450646Z 
2025-05-07T20:26:24.0450649Z 
2025-05-07T20:26:24.0450653Z 
2025-05-07T20:26:24.0450657Z 
2025-05-07T20:26:24.0450680Z 
2025-05-07T20:26:24.0450694Z 
2025-05-07T20:26:24.0450698Z 
2025-05-07T20:26:24.0450706Z 
2025-05-07T20:26:24.0450710Z 
2025-05-07T20:26:24.0450895Z [A[A[A[A[A[A[A[A[A
2025-05-07T20:26:24.0451107Z 
2025-05-07T20:26:24.0451111Z 
2025-05-07T20:26:24.0451116Z 
2025-05-07T20:26:24.0451120Z 
2025-05-07T20:26:24.0451125Z 
2025-05-07T20:26:24.0451129Z 
2025-05-07T20:26:24.0451133Z 
2025-05-07T20:26:24.0451136Z 
2025-05-07T20:26:24.0451140Z 
2025-05-07T20:26:24.0451144Z 
2025-05-07T20:26:24.0451314Z [A[A[A[A[A[A[A[A[A[A
2025-05-07T20:26:24.0451546Z 
2025-05-07T20:26:24.0451551Z 
2025-05-07T20:26:24.0451557Z 
2025-05-07T20:26:24.0451562Z 
2025-05-07T20:26:24.0451567Z 
2025-05-07T20:26:24.0451572Z 
2025-05-07T20:26:24.0451575Z 
2025-05-07T20:26:24.0451579Z 
2025-05-07T20:26:24.0451582Z 
2025-05-07T20:26:24.0451586Z 
2025-05-07T20:26:24.0451590Z 
2025-05-07T20:26:24.0451743Z [A[A[A[A[A[A[A[A[A[A[A
2025-05-07T20:26:24.0451975Z 
2025-05-07T20:26:24.0451980Z 
2025-05-07T20:26:24.0451985Z 
2025-05-07T20:26:24.0452003Z 
2025-05-07T20:26:24.0452008Z 
2025-05-07T20:26:24.0452014Z 
2025-05-07T20:26:24.0452018Z 
2025-05-07T20:26:24.0452024Z 
2025-05-07T20:26:24.0452029Z 
2025-05-07T20:26:24.0452035Z 
2025-05-07T20:26:24.0452040Z 
2025-05-07T20:26:24.0452046Z 
2025-05-07T20:26:24.0452206Z [A[A[A[A[A[A[A[A[A[A[A[A
2025-05-07T20:26:24.0452384Z 
2025-05-07T20:26:24.0452388Z 
2025-05-07T20:26:24.0452392Z 
2025-05-07T20:26:24.0452395Z 
2025-05-07T20:26:24.0452399Z 
2025-05-07T20:26:24.0452402Z 
2025-05-07T20:26:24.0452412Z 
2025-05-07T20:26:24.0452415Z 
2025-05-07T20:26:24.0452419Z 
2025-05-07T20:26:24.0452423Z 
2025-05-07T20:26:24.0452426Z 
2025-05-07T20:26:24.0452430Z 
2025-05-07T20:26:24.0452433Z 
2025-05-07T20:26:24.0452573Z [A[A[A[A[A[A[A[A[A[A[A[A[A
2025-05-07T20:26:24.0452760Z 
2025-05-07T20:26:24.0452764Z 
2025-05-07T20:26:24.0452767Z 
2025-05-07T20:26:24.0452771Z 
2025-05-07T20:26:24.0452775Z 
2025-05-07T20:26:24.0452778Z 
2025-05-07T20:26:24.0452782Z 
2025-05-07T20:26:24.0452923Z 
2025-05-07T20:26:24.0453019Z 
2025-05-07T20:26:24.0453024Z 
2025-05-07T20:26:24.0453029Z 
2025-05-07T20:26:24.0453035Z 
2025-05-07T20:26:24.0453040Z 
2025-05-07T20:26:24.0453045Z 
2025-05-07T20:26:24.0453258Z [A[A[A[A[A[A[A[A[A[A[A[A[A[A
2025-05-07T20:26:24.0453450Z 
2025-05-07T20:26:24.0453454Z 
2025-05-07T20:26:24.0453457Z 
2025-05-07T20:26:24.0453461Z 
2025-05-07T20:26:24.0453464Z 
2025-05-07T20:26:24.0453468Z 
2025-05-07T20:26:24.0453471Z 
2025-05-07T20:26:24.0453475Z 
2025-05-07T20:26:24.0453479Z 
2025-05-07T20:26:24.0453482Z 
2025-05-07T20:26:24.0453486Z 
2025-05-07T20:26:24.0453489Z 
2025-05-07T20:26:24.0453493Z 
2025-05-07T20:26:24.0453496Z 
2025-05-07T20:26:24.0453500Z 
2025-05-07T20:26:24.0453799Z [A[A[A[A[A[A[A[A[A[A[A[A[A[A[A
2025-05-07T20:26:24.0462226Z 
2025-05-07T20:26:24.0462231Z 
2025-05-07T20:26:24.0462235Z 
2025-05-07T20:26:24.0462238Z 
2025-05-07T20:26:24.0462242Z 
2025-05-07T20:26:24.0462245Z 
2025-05-07T20:26:24.0462249Z 
2025-05-07T20:26:24.0462253Z 
2025-05-07T20:26:24.0462272Z 
2025-05-07T20:26:24.0462276Z 
2025-05-07T20:26:24.0462279Z 
2025-05-07T20:26:24.0462299Z 
2025-05-07T20:26:24.0462304Z 
2025-05-07T20:26:24.0462311Z 
2025-05-07T20:26:24.0462316Z 
2025-05-07T20:26:24.0462320Z 
2025-05-07T20:26:24.0462561Z [A[A[A[A[A[A[A[A[A[A[A[A[A[A[A[A
2025-05-07T20:26:24.0462789Z 
2025-05-07T20:26:24.0462793Z 
2025-05-07T20:26:24.0462797Z 
2025-05-07T20:26:24.0462801Z 
2025-05-07T20:26:24.0462804Z 
2025-05-07T20:26:24.0462808Z 
2025-05-07T20:26:24.0462811Z 
2025-05-07T20:26:24.0462815Z 
2025-05-07T20:26:24.0462819Z 
2025-05-07T20:26:24.0462822Z 
2025-05-07T20:26:24.0462826Z 
2025-05-07T20:26:24.0462829Z 
2025-05-07T20:26:24.0462833Z 
2025-05-07T20:26:24.0462837Z 
2025-05-07T20:26:24.0462840Z 
2025-05-07T20:26:24.0462844Z 
2025-05-07T20:26:24.0462847Z 
2025-05-07T20:26:24.0463026Z [A[A[A[A[A[A[A[A[A[A[A[A[A[A[A[A[A
2025-05-07T20:26:24.0463324Z 
2025-05-07T20:26:24.0463329Z 
2025-05-07T20:26:24.0463335Z 
2025-05-07T20:26:24.0463349Z 
2025-05-07T20:26:24.0463362Z 
2025-05-07T20:26:24.0463368Z 
2025-05-07T20:26:24.0463414Z 
2025-05-07T20:26:24.0463420Z 
2025-05-07T20:26:24.0463425Z 
2025-05-07T20:26:24.0463429Z 
2025-05-07T20:26:24.0463434Z 
2025-05-07T20:26:24.0463439Z 
2025-05-07T20:26:24.0463444Z 
2025-05-07T20:26:24.0463449Z 
2025-05-07T20:26:24.0463454Z 
2025-05-07T20:26:24.0463459Z 
2025-05-07T20:26:24.0463464Z 
2025-05-07T20:26:24.0463469Z 
2025-05-07T20:26:24.0463705Z [A[A[A[A[A[A[A[A[A[A[A[A[A[A[A[A[A[A
2025-05-07T20:26:24.0464015Z 
2025-05-07T20:26:24.0464020Z 
2025-05-07T20:26:24.0464156Z [A
2025-05-07T20:26:24.0464701Z 
2025-05-07T20:26:24.0464706Z 
2025-05-07T20:26:24.0464863Z [A[A
2025-05-07T20:26:24.0465008Z 
2025-05-07T20:26:24.0465013Z 
2025-05-07T20:26:24.0465018Z 
2025-05-07T20:26:24.0465173Z [A[A[A
2025-05-07T20:26:24.0465321Z 
2025-05-07T20:26:24.0465327Z 
2025-05-07T20:26:24.0465332Z 
2025-05-07T20:26:24.0465337Z 
2025-05-07T20:26:24.0465486Z [A[A[A[A
2025-05-07T20:26:24.0465654Z 
2025-05-07T20:26:24.0465671Z 
2025-05-07T20:26:24.0465677Z 
2025-05-07T20:26:24.0465682Z 
2025-05-07T20:26:24.0465687Z 
2025-05-07T20:26:24.0465833Z [A[A[A[A[A
2025-05-07T20:26:24.0466007Z 
2025-05-07T20:26:24.0466012Z 
2025-05-07T20:26:24.0466017Z 
2025-05-07T20:26:24.0466023Z 
2025-05-07T20:26:24.0466028Z 
2025-05-07T20:26:24.0466033Z 
2025-05-07T20:26:24.0466197Z [A[A[A[A[A[A
2025-05-07T20:26:24.0466371Z 
2025-05-07T20:26:24.0466376Z 
2025-05-07T20:26:24.0466381Z 
2025-05-07T20:26:24.0466386Z 
2025-05-07T20:26:24.0466392Z 
2025-05-07T20:26:24.0466397Z 
2025-05-07T20:26:24.0466402Z 
2025-05-07T20:26:24.0466556Z [A[A[A[A[A[A[A
2025-05-07T20:26:24.0466747Z 
2025-05-07T20:26:24.0466752Z 
2025-05-07T20:26:24.0466757Z 
2025-05-07T20:26:24.0466762Z 
2025-05-07T20:26:24.0466767Z 
2025-05-07T20:26:24.0466773Z 
2025-05-07T20:26:24.0466778Z 
2025-05-07T20:26:24.0466783Z 
2025-05-07T20:26:24.0466946Z [A[A[A[A[A[A[A[A
2025-05-07T20:26:24.0467150Z 
2025-05-07T20:26:24.0467155Z 
2025-05-07T20:26:24.0467892Z 
2025-05-07T20:26:24.0467985Z 
2025-05-07T20:26:24.0467990Z 
2025-05-07T20:26:24.0467995Z 
2025-05-07T20:26:24.0468000Z 
2025-05-07T20:26:24.0468005Z 
2025-05-07T20:26:24.0468011Z 
2025-05-07T20:26:24.0468207Z [A[A[A[A[A[A[A[A[A
2025-05-07T20:26:24.0468425Z 
2025-05-07T20:26:24.0468430Z 
2025-05-07T20:26:24.0468435Z 
2025-05-07T20:26:24.0468441Z 
2025-05-07T20:26:24.0468446Z 
2025-05-07T20:26:24.0468451Z 
2025-05-07T20:26:24.0468456Z 
2025-05-07T20:26:24.0468461Z 
2025-05-07T20:26:24.0468466Z 
2025-05-07T20:26:24.0468471Z 
2025-05-07T20:26:24.0468646Z [A[A[A[A[A[A[A[A[A[A
2025-05-07T20:26:24.0468874Z 
2025-05-07T20:26:24.0468879Z 
2025-05-07T20:26:24.0468884Z 
2025-05-07T20:26:24.0468889Z 
2025-05-07T20:26:24.0468894Z 
2025-05-07T20:26:24.0468899Z 
2025-05-07T20:26:24.0468905Z 
2025-05-07T20:26:24.0468909Z 
2025-05-07T20:26:24.0468915Z 
2025-05-07T20:26:24.0468920Z 
2025-05-07T20:26:24.0468925Z 
2025-05-07T20:26:24.0469110Z [A[A[A[A[A[A[A[A[A[A[A
2025-05-07T20:26:24.0469361Z 
2025-05-07T20:26:24.0469372Z 
2025-05-07T20:26:24.0469378Z 
2025-05-07T20:26:24.0469382Z 
2025-05-07T20:26:24.0469388Z 
2025-05-07T20:26:24.0469393Z 
2025-05-07T20:26:24.0469398Z 
2025-05-07T20:26:24.0469403Z 
2025-05-07T20:26:24.0469408Z 
2025-05-07T20:26:24.0469413Z 
2025-05-07T20:26:24.0469418Z 
2025-05-07T20:26:24.0469423Z 
2025-05-07T20:26:24.0469625Z [A[A[A[A[A[A[A[A[A[A[A[A
2025-05-07T20:26:24.0469873Z 
2025-05-07T20:26:24.0469878Z 
2025-05-07T20:26:24.0469883Z 
2025-05-07T20:26:24.0469888Z 
2025-05-07T20:26:24.0469893Z 
2025-05-07T20:26:24.0469898Z 
2025-05-07T20:26:24.0469903Z 
2025-05-07T20:26:24.0469908Z 
2025-05-07T20:26:24.0469923Z 
2025-05-07T20:26:24.0469928Z 
2025-05-07T20:26:24.0469933Z 
2025-05-07T20:26:24.0470031Z 
2025-05-07T20:26:24.0470036Z 
2025-05-07T20:26:24.0470226Z [A[A[A[A[A[A[A[A[A[A[A[A[A
2025-05-07T20:26:24.0470480Z 
2025-05-07T20:26:24.0470486Z 
2025-05-07T20:26:24.0470503Z 
2025-05-07T20:26:24.0470508Z 
2025-05-07T20:26:24.0470513Z 
2025-05-07T20:26:24.0470526Z 
2025-05-07T20:26:24.0470537Z 
2025-05-07T20:26:24.0470542Z 
2025-05-07T20:26:24.0470547Z 
2025-05-07T20:26:24.0470552Z 
2025-05-07T20:26:24.0470557Z 
2025-05-07T20:26:24.0470562Z 
2025-05-07T20:26:24.0470566Z 
2025-05-07T20:26:24.0470571Z 
2025-05-07T20:26:24.0470763Z [A[A[A[A[A[A[A[A[A[A[A[A[A[A
2025-05-07T20:26:24.0471069Z 
2025-05-07T20:26:24.0471073Z 
2025-05-07T20:26:24.0471078Z 
2025-05-07T20:26:24.0471084Z 
2025-05-07T20:26:24.0471089Z 
2025-05-07T20:26:24.0471094Z 
2025-05-07T20:26:24.0471099Z 
2025-05-07T20:26:24.0471105Z 
2025-05-07T20:26:24.0471111Z 
2025-05-07T20:26:24.0471117Z 
2025-05-07T20:26:24.0471124Z 
2025-05-07T20:26:24.0471130Z 
2025-05-07T20:26:24.0471137Z 
2025-05-07T20:26:24.0471143Z 
2025-05-07T20:26:24.0471150Z 
2025-05-07T20:26:24.0471376Z [A[A[A[A[A[A[A[A[A[A[A[A[A[A[A
2025-05-07T20:26:24.0471643Z 
2025-05-07T20:26:24.0471648Z 
2025-05-07T20:26:24.0471653Z 
2025-05-07T20:26:24.0471658Z 
2025-05-07T20:26:24.0471663Z 
2025-05-07T20:26:24.0471676Z 
2025-05-07T20:26:24.0471687Z 
2025-05-07T20:26:24.0471692Z 
2025-05-07T20:26:24.0471697Z 
2025-05-07T20:26:24.0471703Z 
2025-05-07T20:26:24.0471708Z 
2025-05-07T20:26:24.0471723Z 
2025-05-07T20:26:24.0471728Z 
2025-05-07T20:26:24.0471733Z 
2025-05-07T20:26:24.0471737Z 
2025-05-07T20:26:24.0471742Z 
2025-05-07T20:26:24.0471952Z [A[A[A[A[A[A[A[A[A[A[A[A[A[A[A[A
2025-05-07T20:26:24.0472226Z 
2025-05-07T20:26:24.0472232Z 
2025-05-07T20:26:24.0472246Z 
2025-05-07T20:26:24.0472252Z 
2025-05-07T20:26:24.0472257Z 
2025-05-07T20:26:24.0472262Z 
2025-05-07T20:26:24.0472267Z 
2025-05-07T20:26:24.0472272Z 
2025-05-07T20:26:24.0472277Z 
2025-05-07T20:26:24.0472282Z 
2025-05-07T20:26:24.0472287Z 
2025-05-07T20:26:24.0472292Z 
2025-05-07T20:26:24.0472297Z 
2025-05-07T20:26:24.0472302Z 
2025-05-07T20:26:24.0472308Z 
2025-05-07T20:26:24.0472313Z 
2025-05-07T20:26:24.0472318Z 
2025-05-07T20:26:24.0472533Z [A[A[A[A[A[A[A[A[A[A[A[A[A[A[A[A[A
2025-05-07T20:26:24.0472831Z 
2025-05-07T20:26:24.0472944Z 
2025-05-07T20:26:24.0473034Z 
2025-05-07T20:26:24.0473039Z 
2025-05-07T20:26:24.0473044Z 
2025-05-07T20:26:24.0473049Z 
2025-05-07T20:26:24.0473054Z 
2025-05-07T20:26:24.0473060Z 
2025-05-07T20:26:24.0473065Z 
2025-05-07T20:26:24.0473070Z 
2025-05-07T20:26:24.0473075Z 
2025-05-07T20:26:24.0473080Z 
2025-05-07T20:26:24.0473085Z 
2025-05-07T20:26:24.0473090Z 
2025-05-07T20:26:24.0473096Z 
2025-05-07T20:26:24.0473101Z 
2025-05-07T20:26:24.0473119Z 
2025-05-07T20:26:24.0473124Z 
2025-05-07T20:26:24.0473361Z [A[A[A[A[A[A[A[A[A[A[A[A[A[A[A[A[A[A
2025-05-07T20:26:24.0473648Z 
2025-05-07T20:26:24.0473653Z 
2025-05-07T20:26:24.0473803Z [A
2025-05-07T20:26:24.0473942Z 
2025-05-07T20:26:24.0473947Z 
2025-05-07T20:26:24.0474085Z [A[A
2025-05-07T20:26:24.0474236Z 
2025-05-07T20:26:24.0474241Z 
2025-05-07T20:26:24.0474246Z 
2025-05-07T20:26:24.0474392Z [A[A[A
2025-05-07T20:26:24.0474550Z 
2025-05-07T20:26:24.0474555Z 
2025-05-07T20:26:24.0474560Z 
2025-05-07T20:26:24.0474565Z 
2025-05-07T20:26:24.0474729Z [A[A[A[A
2025-05-07T20:26:24.0474893Z 
2025-05-07T20:26:24.0474899Z 
2025-05-07T20:26:24.0474914Z 
2025-05-07T20:26:24.0474919Z 
2025-05-07T20:26:24.0474924Z 
2025-05-07T20:26:24.0475072Z [A[A[A[A[A
2025-05-07T20:26:24.0475236Z 
2025-05-07T20:26:24.0475241Z 
2025-05-07T20:26:24.0475246Z 
2025-05-07T20:26:24.0475251Z 
2025-05-07T20:26:24.0475265Z 
2025-05-07T20:26:24.0475270Z 
2025-05-07T20:26:24.0475421Z [A[A[A[A[A[A
2025-05-07T20:26:24.0475591Z 
2025-05-07T20:26:24.0475597Z 
2025-05-07T20:26:24.0475602Z 
2025-05-07T20:26:24.0475607Z 
2025-05-07T20:26:24.0475612Z 
2025-05-07T20:26:24.0475617Z 
2025-05-07T20:26:24.0475632Z 
2025-05-07T20:26:24.0475789Z [A[A[A[A[A[A[A
2025-05-07T20:26:24.0475973Z 
2025-05-07T20:26:24.0475978Z 
2025-05-07T20:26:24.0475983Z 
2025-05-07T20:26:24.0475989Z 
2025-05-07T20:26:24.0475994Z 
2025-05-07T20:26:24.0475999Z 
2025-05-07T20:26:24.0476004Z 
2025-05-07T20:26:24.0476020Z 
2025-05-07T20:26:24.0476180Z [A[A[A[A[A[A[A[A
2025-05-07T20:26:24.0476392Z 
2025-05-07T20:26:24.0476397Z 
2025-05-07T20:26:24.0476402Z 
2025-05-07T20:26:24.0476408Z 
2025-05-07T20:26:24.0476413Z 
2025-05-07T20:26:24.0476418Z 
2025-05-07T20:26:24.0476430Z 
2025-05-07T20:26:24.0476436Z 
2025-05-07T20:26:24.0476441Z 
2025-05-07T20:26:24.0476609Z [A[A[A[A[A[A[A[A[A
2025-05-07T20:26:24.0476813Z 
2025-05-07T20:26:24.0476818Z 
2025-05-07T20:26:24.0476823Z 
2025-05-07T20:26:24.0476828Z 
2025-05-07T20:26:24.0476833Z 
2025-05-07T20:26:24.0476847Z 
2025-05-07T20:26:24.0476852Z 
2025-05-07T20:26:24.0476857Z 
2025-05-07T20:26:24.0476863Z 
2025-05-07T20:26:24.0476868Z 
2025-05-07T20:26:24.0477038Z [A[A[A[A[A[A[A[A[A[A
2025-05-07T20:26:24.0477254Z 
2025-05-07T20:26:24.0477259Z 
2025-05-07T20:26:24.0477273Z 
2025-05-07T20:26:24.0477279Z 
2025-05-07T20:26:24.0477284Z 
2025-05-07T20:26:24.0477289Z 
2025-05-07T20:26:24.0477294Z 
2025-05-07T20:26:24.0477299Z 
2025-05-07T20:26:24.0477304Z 
2025-05-07T20:26:24.0477309Z 
2025-05-07T20:26:24.0477314Z 
2025-05-07T20:26:24.0477499Z [A[A[A[A[A[A[A[A[A[A[A
2025-05-07T20:26:24.0477744Z 
2025-05-07T20:26:24.0477749Z 
2025-05-07T20:26:24.0477754Z 
2025-05-07T20:26:24.0477759Z 
2025-05-07T20:26:24.0477764Z 
2025-05-07T20:26:24.0477769Z 
2025-05-07T20:26:24.0477774Z 
2025-05-07T20:26:24.0477779Z 
2025-05-07T20:26:24.0477784Z 
2025-05-07T20:26:24.0477789Z 
2025-05-07T20:26:24.0477794Z 
2025-05-07T20:26:24.0477800Z 
2025-05-07T20:26:24.0477977Z [A[A[A[A[A[A[A[A[A[A[A[A
2025-05-07T20:26:24.0478219Z 
2025-05-07T20:26:24.0478224Z 
2025-05-07T20:26:24.0478229Z 
2025-05-07T20:26:24.0478234Z 
2025-05-07T20:26:24.0478239Z 
2025-05-07T20:26:24.0478244Z 
2025-05-07T20:26:24.0478249Z 
2025-05-07T20:26:24.0478254Z 
2025-05-07T20:26:24.0478259Z 
2025-05-07T20:26:24.0478270Z 
2025-05-07T20:26:24.0478275Z 
2025-05-07T20:26:24.0478280Z 
2025-05-07T20:26:24.0478286Z 
2025-05-07T20:26:24.0478471Z [A[A[A[A[A[A[A[A[A[A[A[A[A
2025-05-07T20:26:24.0478718Z 
2025-05-07T20:26:24.0478723Z 
2025-05-07T20:26:24.0478728Z 
2025-05-07T20:26:24.0478959Z 
2025-05-07T20:26:24.0478980Z 
2025-05-07T20:26:24.0478985Z 
2025-05-07T20:26:24.0478990Z 
2025-05-07T20:26:24.0478996Z 
2025-05-07T20:26:24.0479000Z 
2025-05-07T20:26:24.0479006Z 
2025-05-07T20:26:24.0479011Z 
2025-05-07T20:26:24.0479016Z 
2025-05-07T20:26:24.0479021Z 
2025-05-07T20:26:24.0479026Z 
2025-05-07T20:26:24.0479225Z [A[A[A[A[A[A[A[A[A[A[A[A[A[A
2025-05-07T20:26:24.0479492Z 
2025-05-07T20:26:24.0479497Z 
2025-05-07T20:26:24.0479502Z 
2025-05-07T20:26:24.0479507Z 
2025-05-07T20:26:24.0479512Z 
2025-05-07T20:26:24.0479518Z 
2025-05-07T20:26:24.0479523Z 
2025-05-07T20:26:24.0479528Z 
2025-05-07T20:26:24.0479533Z 
2025-05-07T20:26:24.0479537Z 
2025-05-07T20:26:24.0479542Z 
2025-05-07T20:26:24.0479547Z 
2025-05-07T20:26:24.0479551Z 
2025-05-07T20:26:24.0479557Z 
2025-05-07T20:26:24.0479562Z 
2025-05-07T20:26:24.0479770Z [A[A[A[A[A[A[A[A[A[A[A[A[A[A[A
2025-05-07T20:26:24.0480035Z 
2025-05-07T20:26:24.0480040Z 
2025-05-07T20:26:24.0480045Z 
2025-05-07T20:26:24.0480065Z 
2025-05-07T20:26:24.0480070Z 
2025-05-07T20:26:24.0480076Z 
2025-05-07T20:26:24.0480081Z 
2025-05-07T20:26:24.0480086Z 
2025-05-07T20:26:24.0480091Z 
2025-05-07T20:26:24.0480096Z 
2025-05-07T20:26:24.0480101Z 
2025-05-07T20:26:24.0480106Z 
2025-05-07T20:26:24.0480111Z 
2025-05-07T20:26:24.0480129Z 
2025-05-07T20:26:24.0480134Z 
2025-05-07T20:26:24.0480139Z 
2025-05-07T20:26:24.0480343Z [A[A[A[A[A[A[A[A[A[A[A[A[A[A[A[A
2025-05-07T20:26:24.0480612Z 
2025-05-07T20:26:24.0480617Z 
2025-05-07T20:26:24.0480622Z 
2025-05-07T20:26:24.0480627Z 
2025-05-07T20:26:24.0480632Z 
2025-05-07T20:26:24.0480644Z 
2025-05-07T20:26:24.0480649Z 
2025-05-07T20:26:24.0480654Z 
2025-05-07T20:26:24.0480659Z 
2025-05-07T20:26:24.0480664Z 
2025-05-07T20:26:24.0480669Z 
2025-05-07T20:26:24.0480675Z 
2025-05-07T20:26:24.0480679Z 
2025-05-07T20:26:24.0480685Z 
2025-05-07T20:26:24.0480690Z 
2025-05-07T20:26:24.0480695Z 
2025-05-07T20:26:24.0480700Z 
2025-05-07T20:26:24.0480911Z [A[A[A[A[A[A[A[A[A[A[A[A[A[A[A[A[A
2025-05-07T20:26:24.0481201Z 
2025-05-07T20:26:24.0481206Z 
2025-05-07T20:26:24.0481211Z 
2025-05-07T20:26:24.0481216Z 
2025-05-07T20:26:24.0481221Z 
2025-05-07T20:26:24.0481226Z 
2025-05-07T20:26:24.0481231Z 
2025-05-07T20:26:24.0481236Z 
2025-05-07T20:26:24.0481241Z 
2025-05-07T20:26:24.0481247Z 
2025-05-07T20:26:24.0481251Z 
2025-05-07T20:26:24.0481257Z 
2025-05-07T20:26:24.0481262Z 
2025-05-07T20:26:24.0481267Z 
2025-05-07T20:26:24.0481272Z 
2025-05-07T20:26:24.0481277Z 
2025-05-07T20:26:24.0481282Z 
2025-05-07T20:26:24.0481287Z 
2025-05-07T20:26:24.0481521Z [A[A[A[A[A[A[A[A[A[A[A[A[A[A[A[A[A[A
2025-05-07T20:26:24.0481795Z 
2025-05-07T20:26:24.0481800Z 
2025-05-07T20:26:24.0481944Z [A
2025-05-07T20:26:24.0482081Z 
2025-05-07T20:26:24.0482086Z 
2025-05-07T20:26:24.0482220Z [A[A
2025-05-07T20:26:24.0482370Z 
2025-05-07T20:26:24.0482375Z 
2025-05-07T20:26:24.0482380Z 
2025-05-07T20:26:24.0482522Z [A[A[A
2025-05-07T20:26:24.0482678Z 
2025-05-07T20:26:24.0482684Z 
2025-05-07T20:26:24.0482703Z 
2025-05-07T20:26:24.0482708Z 
2025-05-07T20:26:24.0482943Z [A[A[A[A
2025-05-07T20:26:24.0483098Z 
2025-05-07T20:26:24.0483103Z 
2025-05-07T20:26:24.0483108Z 
2025-05-07T20:26:24.0483121Z 
2025-05-07T20:26:24.0483127Z 
2025-05-07T20:26:24.0483275Z [A[A[A[A[A
2025-05-07T20:26:24.0483434Z 
2025-05-07T20:26:24.0483439Z 
2025-05-07T20:26:24.0483444Z 
2025-05-07T20:26:24.0483449Z 
2025-05-07T20:26:24.0483454Z 
2025-05-07T20:26:24.0483459Z 
2025-05-07T20:26:24.0483616Z [A[A[A[A[A[A
2025-05-07T20:26:24.0483785Z 
2025-05-07T20:26:24.0483790Z 
2025-05-07T20:26:24.0483796Z 
2025-05-07T20:26:24.0483801Z 
2025-05-07T20:26:24.0483806Z 
2025-05-07T20:26:24.0483811Z 
2025-05-07T20:26:24.0483816Z 
2025-05-07T20:26:24.0483982Z [A[A[A[A[A[A[A
2025-05-07T20:26:24.0484174Z 
2025-05-07T20:26:24.0484180Z 
2025-05-07T20:26:24.0484184Z 
2025-05-07T20:26:24.0484189Z 
2025-05-07T20:26:24.0484194Z 
2025-05-07T20:26:24.0484199Z 
2025-05-07T20:26:24.0484203Z 
2025-05-07T20:26:24.0484209Z 
2025-05-07T20:26:24.0484573Z [A[A[A[A[A[A[A[A
2025-05-07T20:26:24.0484768Z 
2025-05-07T20:26:24.0484774Z 
2025-05-07T20:26:24.0484778Z 
2025-05-07T20:26:24.0484784Z 
2025-05-07T20:26:24.0484789Z 
2025-05-07T20:26:24.0484794Z 
2025-05-07T20:26:24.0484799Z 
2025-05-07T20:26:24.0484804Z 
2025-05-07T20:26:24.0484817Z 
2025-05-07T20:26:24.0484979Z [A[A[A[A[A[A[A[A[A
2025-05-07T20:26:24.0485183Z 
2025-05-07T20:26:24.0485188Z 
2025-05-07T20:26:24.0485193Z 
2025-05-07T20:26:24.0485198Z 
2025-05-07T20:26:24.0485202Z 
2025-05-07T20:26:24.0485208Z 
2025-05-07T20:26:24.0485220Z 
2025-05-07T20:26:24.0485224Z 
2025-05-07T20:26:24.0485230Z 
2025-05-07T20:26:24.0485235Z 
2025-05-07T20:26:24.0485407Z [A[A[A[A[A[A[A[A[A[A
2025-05-07T20:26:24.0485623Z 
2025-05-07T20:26:24.0485628Z 
2025-05-07T20:26:24.0485633Z 
2025-05-07T20:26:24.0485638Z 
2025-05-07T20:26:24.0485651Z 
2025-05-07T20:26:24.0485656Z 
2025-05-07T20:26:24.0485661Z 
2025-05-07T20:26:24.0485666Z 
2025-05-07T20:26:24.0485671Z 
2025-05-07T20:26:24.0485689Z 
2025-05-07T20:26:24.0485694Z 
2025-05-07T20:26:24.0485868Z [A[A[A[A[A[A[A[A[A[A[A
2025-05-07T20:26:24.0486104Z 
2025-05-07T20:26:24.0486109Z 
2025-05-07T20:26:24.0486114Z 
2025-05-07T20:26:24.0486119Z 
2025-05-07T20:26:24.0486124Z 
2025-05-07T20:26:24.0486129Z 
2025-05-07T20:26:24.0486134Z 
2025-05-07T20:26:24.0486140Z 
2025-05-07T20:26:24.0486144Z 
2025-05-07T20:26:24.0486149Z 
2025-05-07T20:26:24.0486154Z 
2025-05-07T20:26:24.0486159Z 
2025-05-07T20:26:24.0486334Z [A[A[A[A[A[A[A[A[A[A[A[A
2025-05-07T20:26:24.0486579Z 
2025-05-07T20:26:24.0486584Z 
2025-05-07T20:26:24.0486589Z 
2025-05-07T20:26:24.0486594Z 
2025-05-07T20:26:24.0486600Z 
2025-05-07T20:26:24.0486605Z 
2025-05-07T20:26:24.0486610Z 
2025-05-07T20:26:24.0486615Z 
2025-05-07T20:26:24.0486619Z 
2025-05-07T20:26:24.0486624Z 
2025-05-07T20:26:24.0486629Z 
2025-05-07T20:26:24.0486634Z 
2025-05-07T20:26:24.0486639Z 
2025-05-07T20:26:24.0486840Z [A[A[A[A[A[A[A[A[A[A[A[A[A
2025-05-07T20:26:24.0487088Z 
2025-05-07T20:26:24.0487107Z 
2025-05-07T20:26:24.0487112Z 
2025-05-07T20:26:24.0487117Z 
2025-05-07T20:26:24.0487122Z 
2025-05-07T20:26:24.0487127Z 
2025-05-07T20:26:24.0487132Z 
2025-05-07T20:26:24.0487137Z 
2025-05-07T20:26:24.0487142Z 
2025-05-07T20:26:24.0487147Z 
2025-05-07T20:26:24.0487152Z 
2025-05-07T20:26:24.0487157Z 
2025-05-07T20:26:24.0487162Z 
2025-05-07T20:26:24.0487167Z 
2025-05-07T20:26:24.0487366Z [A[A[A[A[A[A[A[A[A[A[A[A[A[A
2025-05-07T20:26:24.0487622Z 
2025-05-07T20:26:24.0487627Z 
2025-05-07T20:26:24.0487632Z 
2025-05-07T20:26:24.0487637Z 
2025-05-07T20:26:24.0487642Z 
2025-05-07T20:26:24.0487647Z 
2025-05-07T20:26:24.0487652Z 
2025-05-07T20:26:24.0487664Z 
2025-05-07T20:26:24.0487670Z 
2025-05-07T20:26:24.0487675Z 
2025-05-07T20:26:24.0487680Z 
2025-05-07T20:26:24.0487685Z 
2025-05-07T20:26:24.0487690Z 
2025-05-07T20:26:24.0487695Z 
2025-05-07T20:26:24.0487701Z 
2025-05-07T20:26:24.0487896Z [A[A[A[A[A[A[A[A[A[A[A[A[A[A[A
2025-05-07T20:26:24.0488168Z 
2025-05-07T20:26:24.0488184Z 
2025-05-07T20:26:24.0488189Z 
2025-05-07T20:26:24.0488194Z 
2025-05-07T20:26:24.0488200Z 
2025-05-07T20:26:24.0488205Z 
2025-05-07T20:26:24.0488210Z 
2025-05-07T20:26:24.0488215Z 
2025-05-07T20:26:24.0488220Z 
2025-05-07T20:26:24.0488225Z 
2025-05-07T20:26:24.0488230Z 
2025-05-07T20:26:24.0488235Z 
2025-05-07T20:26:24.0488240Z 
2025-05-07T20:26:24.0488245Z 
2025-05-07T20:26:24.0488250Z 
2025-05-07T20:26:24.0488256Z 
2025-05-07T20:26:24.0488466Z [A[A[A[A[A[A[A[A[A[A[A[A[A[A[A[A
2025-05-07T20:26:24.0488716Z 
2025-05-07T20:26:24.0488720Z 
2025-05-07T20:26:24.0488724Z 
2025-05-07T20:26:24.0488727Z 
2025-05-07T20:26:24.0488731Z 
2025-05-07T20:26:24.0488735Z 
2025-05-07T20:26:24.0488738Z 
2025-05-07T20:26:24.0488742Z 
2025-05-07T20:26:24.0488745Z 
2025-05-07T20:26:24.0488749Z 
2025-05-07T20:26:24.0488753Z 
2025-05-07T20:26:24.0488756Z 
2025-05-07T20:26:24.0488760Z 
2025-05-07T20:26:24.0488764Z 
2025-05-07T20:26:24.0488773Z 
2025-05-07T20:26:24.0488777Z 
2025-05-07T20:26:24.0488870Z 
2025-05-07T20:26:24.0489109Z [A[A[A[A[A[A[A[A[A[A[A[A[A[A[A[A[A
2025-05-07T20:26:24.0489313Z 
2025-05-07T20:26:24.0489316Z 
2025-05-07T20:26:24.0489320Z 
2025-05-07T20:26:24.0489324Z 
2025-05-07T20:26:24.0489327Z 
2025-05-07T20:26:24.0489337Z 
2025-05-07T20:26:24.0489340Z 
2025-05-07T20:26:24.0489344Z 
2025-05-07T20:26:24.0489347Z 
2025-05-07T20:26:24.0489351Z 
2025-05-07T20:26:24.0489355Z 
2025-05-07T20:26:24.0489358Z 
2025-05-07T20:26:24.0489362Z 
2025-05-07T20:26:24.0489365Z 
2025-05-07T20:26:24.0489369Z 
2025-05-07T20:26:24.0489372Z 
2025-05-07T20:26:24.0489376Z 
2025-05-07T20:26:24.0489379Z 
2025-05-07T20:26:24.0489539Z [A[A[A[A[A[A[A[A[A[A[A[A[A[A[A[A[A[A
2025-05-07T20:26:24.0489749Z 
2025-05-07T20:26:24.0489752Z 
2025-05-07T20:26:24.0489852Z [A
2025-05-07T20:26:24.0489960Z 
2025-05-07T20:26:24.0489964Z 
2025-05-07T20:26:24.0490066Z [A[A
2025-05-07T20:26:24.0490168Z 
2025-05-07T20:26:24.0490172Z 
2025-05-07T20:26:24.0490175Z 
2025-05-07T20:26:24.0490294Z [A[A[A
2025-05-07T20:26:24.0490409Z 
2025-05-07T20:26:24.0490413Z 
2025-05-07T20:26:24.0490416Z 
2025-05-07T20:26:24.0490420Z 
2025-05-07T20:26:24.0490530Z [A[A[A[A
2025-05-07T20:26:24.0490643Z 
2025-05-07T20:26:24.0490647Z 
2025-05-07T20:26:24.0490650Z 
2025-05-07T20:26:24.0490654Z 
2025-05-07T20:26:24.0490658Z 
2025-05-07T20:26:24.0490765Z [A[A[A[A[A
2025-05-07T20:26:24.0490891Z 
2025-05-07T20:26:24.0490895Z 
2025-05-07T20:26:24.0490898Z 
2025-05-07T20:26:24.0490902Z 
2025-05-07T20:26:24.0490906Z 
2025-05-07T20:26:24.0490909Z 
2025-05-07T20:26:24.0491020Z [A[A[A[A[A[A
2025-05-07T20:26:24.0491150Z 
2025-05-07T20:26:24.0491153Z 
2025-05-07T20:26:24.0491157Z 
2025-05-07T20:26:24.0491161Z 
2025-05-07T20:26:24.0491164Z 
2025-05-07T20:26:24.0491168Z 
2025-05-07T20:26:24.0491172Z 
2025-05-07T20:26:24.0491335Z [A[A[A[A[A[A[A
2025-05-07T20:26:24.0491533Z 
2025-05-07T20:26:24.0491538Z 
2025-05-07T20:26:24.0491543Z 
2025-05-07T20:26:24.0491547Z 
2025-05-07T20:26:24.0491552Z 
2025-05-07T20:26:24.0491566Z 
2025-05-07T20:26:24.0491578Z 
2025-05-07T20:26:24.0491583Z 
2025-05-07T20:26:24.0491749Z [A[A[A[A[A[A[A[A
2025-05-07T20:26:24.0491960Z 
2025-05-07T20:26:24.0491965Z 
2025-05-07T20:26:24.0491970Z 
2025-05-07T20:26:24.0491976Z 
2025-05-07T20:26:24.0491981Z 
2025-05-07T20:26:24.0491986Z 
2025-05-07T20:26:24.0491991Z 
2025-05-07T20:26:24.0491996Z 
2025-05-07T20:26:24.0492001Z 
2025-05-07T20:26:24.0492166Z [A[A[A[A[A[A[A[A[A
2025-05-07T20:26:24.0492380Z 
2025-05-07T20:26:24.0492385Z 
2025-05-07T20:26:24.0492390Z 
2025-05-07T20:26:24.0492396Z 
2025-05-07T20:26:24.0492401Z 
2025-05-07T20:26:24.0492406Z 
2025-05-07T20:26:24.0492411Z 
2025-05-07T20:26:24.0492416Z 
2025-05-07T20:26:24.0492421Z 
2025-05-07T20:26:24.0492426Z 
2025-05-07T20:26:24.0492563Z [A[A[A[A[A[A[A[A[A[A
2025-05-07T20:26:24.0492723Z 
2025-05-07T20:26:24.0492727Z 
2025-05-07T20:26:24.0492730Z 
2025-05-07T20:26:24.0492734Z 
2025-05-07T20:26:24.0492738Z 
2025-05-07T20:26:24.0492741Z 
2025-05-07T20:26:24.0492745Z 
2025-05-07T20:26:24.0492757Z 
2025-05-07T20:26:24.0492760Z 
2025-05-07T20:26:24.0492764Z 
2025-05-07T20:26:24.0492768Z 
2025-05-07T20:26:24.0492905Z [A[A[A[A[A[A[A[A[A[A[A
2025-05-07T20:26:24.0493074Z 
2025-05-07T20:26:24.0493078Z 
2025-05-07T20:26:24.0493081Z 
2025-05-07T20:26:24.0493085Z 
2025-05-07T20:26:24.0493088Z 
2025-05-07T20:26:24.0493092Z 
2025-05-07T20:26:24.0493095Z 
2025-05-07T20:26:24.0493099Z 
2025-05-07T20:26:24.0493103Z 
2025-05-07T20:26:24.0493106Z 
2025-05-07T20:26:24.0493110Z 
2025-05-07T20:26:24.0493118Z 
2025-05-07T20:26:24.0493252Z [A[A[A[A[A[A[A[A[A[A[A[A
2025-05-07T20:26:24.0493428Z 
2025-05-07T20:26:24.0493432Z 
2025-05-07T20:26:24.0493436Z 
2025-05-07T20:26:24.0493439Z 
2025-05-07T20:26:24.0493443Z 
2025-05-07T20:26:24.0493446Z 
2025-05-07T20:26:24.0493450Z 
2025-05-07T20:26:24.0493460Z 
2025-05-07T20:26:24.0493463Z 
2025-05-07T20:26:24.0493467Z 
2025-05-07T20:26:24.0493470Z 
2025-05-07T20:26:24.0493474Z 
2025-05-07T20:26:24.0493477Z 
2025-05-07T20:26:24.0493829Z [A[A[A[A[A[A[A[A[A[A[A[A[A
2025-05-07T20:26:24.0494090Z 
2025-05-07T20:26:24.0494101Z 
2025-05-07T20:26:24.0494105Z 
2025-05-07T20:26:24.0494109Z 
2025-05-07T20:26:24.0494112Z 
2025-05-07T20:26:24.0494116Z 
2025-05-07T20:26:24.0494119Z 
2025-05-07T20:26:24.0494123Z 
2025-05-07T20:26:24.0494126Z 
2025-05-07T20:26:24.0494130Z 
2025-05-07T20:26:24.0494134Z 
2025-05-07T20:26:24.0494137Z 
2025-05-07T20:26:24.0494141Z 
2025-05-07T20:26:24.0494144Z 
2025-05-07T20:26:24.0494290Z [A[A[A[A[A[A[A[A[A[A[A[A[A[A
2025-05-07T20:26:24.0494484Z 
2025-05-07T20:26:24.0494488Z 
2025-05-07T20:26:24.0494491Z 
2025-05-07T20:26:24.0494495Z 
2025-05-07T20:26:24.0494499Z 
2025-05-07T20:26:24.0494502Z 
2025-05-07T20:26:24.0494506Z 
2025-05-07T20:26:24.0494509Z 
2025-05-07T20:26:24.0494513Z 
2025-05-07T20:26:24.0494517Z 
2025-05-07T20:26:24.0494520Z 
2025-05-07T20:26:24.0494524Z 
2025-05-07T20:26:24.0494528Z 
2025-05-07T20:26:24.0494531Z 
2025-05-07T20:26:24.0494535Z 
2025-05-07T20:26:24.0494695Z [A[A[A[A[A[A[A[A[A[A[A[A[A[A[A
2025-05-07T20:26:24.0494894Z 
2025-05-07T20:26:24.0494897Z 
2025-05-07T20:26:24.0494901Z 
2025-05-07T20:26:24.0494904Z 
2025-05-07T20:26:24.0494908Z 
2025-05-07T20:26:24.0494911Z 
2025-05-07T20:26:24.0494915Z 
2025-05-07T20:26:24.0494919Z 
2025-05-07T20:26:24.0494922Z 
2025-05-07T20:26:24.0494926Z 
2025-05-07T20:26:24.0494929Z 
2025-05-07T20:26:24.0494944Z 
2025-05-07T20:26:24.0494950Z 
2025-05-07T20:26:24.0494954Z 
2025-05-07T20:26:24.0494957Z 
2025-05-07T20:26:24.0494961Z 
2025-05-07T20:26:24.0495112Z [A[A[A[A[A[A[A[A[A[A[A[A[A[A[A[A
2025-05-07T20:26:24.0495312Z 
2025-05-07T20:26:24.0495315Z 
2025-05-07T20:26:24.0495326Z 
2025-05-07T20:26:24.0495330Z 
2025-05-07T20:26:24.0495333Z 
2025-05-07T20:26:24.0495337Z 
2025-05-07T20:26:24.0495341Z 
2025-05-07T20:26:24.0495344Z 
2025-05-07T20:26:24.0495348Z 
2025-05-07T20:26:24.0495351Z 
2025-05-07T20:26:24.0495355Z 
2025-05-07T20:26:24.0495358Z 
2025-05-07T20:26:24.0495362Z 
2025-05-07T20:26:24.0495366Z 
2025-05-07T20:26:24.0495378Z 
2025-05-07T20:26:24.0495382Z 
2025-05-07T20:26:24.0495385Z 
2025-05-07T20:26:24.0495539Z [A[A[A[A[A[A[A[A[A[A[A[A[A[A[A[A[A
2025-05-07T20:26:24.0495753Z 
2025-05-07T20:26:24.0495757Z 
2025-05-07T20:26:24.0495760Z 
2025-05-07T20:26:24.0495764Z 
2025-05-07T20:26:24.0495767Z 
2025-05-07T20:26:24.0495771Z 
2025-05-07T20:26:24.0495774Z 
2025-05-07T20:26:24.0495778Z 
2025-05-07T20:26:24.0495782Z 
2025-05-07T20:26:24.0495785Z 
2025-05-07T20:26:24.0495789Z 
2025-05-07T20:26:24.0495792Z 
2025-05-07T20:26:24.0495796Z 
2025-05-07T20:26:24.0495799Z 
2025-05-07T20:26:24.0495803Z 
2025-05-07T20:26:24.0495806Z 
2025-05-07T20:26:24.0495816Z 
2025-05-07T20:26:24.0495820Z 
2025-05-07T20:26:24.0495980Z [A[A[A[A[A[A[A[A[A[A[A[A[A[A[A[A[A[A
2025-05-07T20:26:24.0496188Z 
2025-05-07T20:26:24.0496191Z 
2025-05-07T20:26:24.0496296Z [A
2025-05-07T20:26:24.0496397Z 
2025-05-07T20:26:24.0496400Z 
2025-05-07T20:26:24.0496502Z [A[A
2025-05-07T20:26:24.0496612Z 
2025-05-07T20:26:24.0496621Z 
2025-05-07T20:26:24.0496630Z 
2025-05-07T20:26:24.0496735Z [A[A[A
2025-05-07T20:26:24.0496849Z 
2025-05-07T20:26:24.0496852Z 
2025-05-07T20:26:24.0496856Z 
2025-05-07T20:26:24.0496860Z 
2025-05-07T20:26:24.0496965Z [A[A[A[A
2025-05-07T20:26:24.0497080Z 
2025-05-07T20:26:24.0497083Z 
2025-05-07T20:26:24.0497095Z 
2025-05-07T20:26:24.0497099Z 
2025-05-07T20:26:24.0497102Z 
2025-05-07T20:26:24.0497211Z [A[A[A[A[A
2025-05-07T20:26:24.0497332Z 
2025-05-07T20:26:24.0497336Z 
2025-05-07T20:26:24.0497340Z 
2025-05-07T20:26:24.0497343Z 
2025-05-07T20:26:24.0497352Z 
2025-05-07T20:26:24.0497355Z 
2025-05-07T20:26:24.0497465Z [A[A[A[A[A[A
2025-05-07T20:26:24.0497590Z 
2025-05-07T20:26:24.0497594Z 
2025-05-07T20:26:24.0497597Z 
2025-05-07T20:26:24.0497601Z 
2025-05-07T20:26:24.0497604Z 
2025-05-07T20:26:24.0497608Z 
2025-05-07T20:26:24.0497618Z 
2025-05-07T20:26:24.0497732Z [A[A[A[A[A[A[A
2025-05-07T20:26:24.0497871Z 
2025-05-07T20:26:24.0497874Z 
2025-05-07T20:26:24.0497878Z 
2025-05-07T20:26:24.0498099Z 
2025-05-07T20:26:24.0498103Z 
2025-05-07T20:26:24.0498107Z 
2025-05-07T20:26:24.0498110Z 
2025-05-07T20:26:24.0498120Z 
2025-05-07T20:26:24.0498659Z [A[A[A[A[A[A[A[A
2025-05-07T20:26:24.0498832Z 
2025-05-07T20:26:24.0498835Z 
2025-05-07T20:26:24.0498839Z 
2025-05-07T20:26:24.0498842Z 
2025-05-07T20:26:24.0498846Z 
2025-05-07T20:26:24.0498849Z 
2025-05-07T20:26:24.0498852Z 
2025-05-07T20:26:24.0498856Z 
2025-05-07T20:26:24.0498859Z 
2025-05-07T20:26:24.0498988Z [A[A[A[A[A[A[A[A[A
2025-05-07T20:26:24.0499168Z 
2025-05-07T20:26:24.0499172Z 
2025-05-07T20:26:24.0499175Z 
2025-05-07T20:26:24.0499179Z 
2025-05-07T20:26:24.0499182Z 
2025-05-07T20:26:24.0499185Z 
2025-05-07T20:26:24.0499189Z 
2025-05-07T20:26:24.0499192Z 
2025-05-07T20:26:24.0499196Z 
2025-05-07T20:26:24.0499199Z 
2025-05-07T20:26:24.0499339Z [A[A[A[A[A[A[A[A[A[A
2025-05-07T20:26:24.0499519Z 
2025-05-07T20:26:24.0499522Z 
2025-05-07T20:26:24.0499526Z 
2025-05-07T20:26:24.0499529Z 
2025-05-07T20:26:24.0499540Z 
2025-05-07T20:26:24.0499550Z 
2025-05-07T20:26:24.0499553Z 
2025-05-07T20:26:24.0499556Z 
2025-05-07T20:26:24.0499560Z 
2025-05-07T20:26:24.0499563Z 
2025-05-07T20:26:24.0499567Z 
2025-05-07T20:26:24.0499711Z [A[A[A[A[A[A[A[A[A[A[A
2025-05-07T20:26:24.0499900Z 
2025-05-07T20:26:24.0499903Z 
2025-05-07T20:26:24.0499907Z 
2025-05-07T20:26:24.0499910Z 
2025-05-07T20:26:24.0499913Z 
2025-05-07T20:26:24.0499917Z 
2025-05-07T20:26:24.0499920Z 
2025-05-07T20:26:24.0499924Z 
2025-05-07T20:26:24.0499927Z 
2025-05-07T20:26:24.0499930Z 
2025-05-07T20:26:24.0499942Z 
2025-05-07T20:26:24.0499945Z 
2025-05-07T20:26:24.0500081Z [A[A[A[A[A[A[A[A[A[A[A[A
2025-05-07T20:26:24.0500280Z 
2025-05-07T20:26:24.0500284Z 
2025-05-07T20:26:24.0500287Z 
2025-05-07T20:26:24.0500290Z 
2025-05-07T20:26:24.0500294Z 
2025-05-07T20:26:24.0500304Z 
2025-05-07T20:26:24.0500307Z 
2025-05-07T20:26:24.0500311Z 
2025-05-07T20:26:24.0500314Z 
2025-05-07T20:26:24.0500318Z 
2025-05-07T20:26:24.0500321Z 
2025-05-07T20:26:24.0500328Z 
2025-05-07T20:26:24.0500337Z 
2025-05-07T20:26:24.0500477Z [A[A[A[A[A[A[A[A[A[A[A[A[A
2025-05-07T20:26:24.0500690Z 
2025-05-07T20:26:24.0500693Z 
2025-05-07T20:26:24.0500696Z 
2025-05-07T20:26:24.0500700Z 
2025-05-07T20:26:24.0500703Z 
2025-05-07T20:26:24.0500707Z 
2025-05-07T20:26:24.0500710Z 
2025-05-07T20:26:24.0500713Z 
2025-05-07T20:26:24.0500717Z 
2025-05-07T20:26:24.0500720Z 
2025-05-07T20:26:24.0500723Z 
2025-05-07T20:26:24.0500727Z 
2025-05-07T20:26:24.0500730Z 
2025-05-07T20:26:24.0500734Z 
2025-05-07T20:26:24.0500880Z [A[A[A[A[A[A[A[A[A[A[A[A[A[A
2025-05-07T20:26:24.0501101Z 
2025-05-07T20:26:24.0501105Z 
2025-05-07T20:26:24.0501108Z 
2025-05-07T20:26:24.0501112Z 
2025-05-07T20:26:24.0501115Z 
2025-05-07T20:26:24.0501119Z 
2025-05-07T20:26:24.0501122Z 
2025-05-07T20:26:24.0501125Z 
2025-05-07T20:26:24.0501129Z 
2025-05-07T20:26:24.0501132Z 
2025-05-07T20:26:24.0501136Z 
2025-05-07T20:26:24.0501139Z 
2025-05-07T20:26:24.0501143Z 
2025-05-07T20:26:24.0501146Z 
2025-05-07T20:26:24.0501160Z 
2025-05-07T20:26:24.0501345Z [A[A[A[A[A[A[A[A[A[A[A[A[A[A[A
2025-05-07T20:26:24.0501585Z 
2025-05-07T20:26:24.0501589Z 
2025-05-07T20:26:24.0501592Z 
2025-05-07T20:26:24.0501596Z 
2025-05-07T20:26:24.0501599Z 
2025-05-07T20:26:24.0501603Z 
2025-05-07T20:26:24.0501606Z 
2025-05-07T20:26:24.0501609Z 
2025-05-07T20:26:24.0501613Z 
2025-05-07T20:26:24.0501623Z 
2025-05-07T20:26:24.0501627Z 
2025-05-07T20:26:24.0501630Z 
2025-05-07T20:26:24.0501634Z 
2025-05-07T20:26:24.0501637Z 
2025-05-07T20:26:24.0501641Z 
2025-05-07T20:26:24.0501644Z 
2025-05-07T20:26:24.0501804Z [A[A[A[A[A[A[A[A[A[A[A[A[A[A[A[A
2025-05-07T20:26:24.0502035Z 
2025-05-07T20:26:24.0502038Z 
2025-05-07T20:26:24.0502042Z 
2025-05-07T20:26:24.0502046Z 
2025-05-07T20:26:24.0502049Z 
2025-05-07T20:26:24.0502053Z 
2025-05-07T20:26:24.0502056Z 
2025-05-07T20:26:24.0502060Z 
2025-05-07T20:26:24.0502063Z 
2025-05-07T20:26:24.0502067Z 
2025-05-07T20:26:24.0502070Z 
2025-05-07T20:26:24.0502074Z 
2025-05-07T20:26:24.0502335Z 
2025-05-07T20:26:24.0502340Z 
2025-05-07T20:26:24.0502343Z 
2025-05-07T20:26:24.0502347Z 
2025-05-07T20:26:24.0502351Z 
2025-05-07T20:26:24.0502520Z [A[A[A[A[A[A[A[A[A[A[A[A[A[A[A[A[A
2025-05-07T20:26:24.0502723Z 
2025-05-07T20:26:24.0502727Z 
2025-05-07T20:26:24.0502731Z 
2025-05-07T20:26:24.0502734Z 
2025-05-07T20:26:24.0502738Z 
2025-05-07T20:26:24.0502741Z 
2025-05-07T20:26:24.0502745Z 
2025-05-07T20:26:24.0502748Z 
2025-05-07T20:26:24.0502752Z 
2025-05-07T20:26:24.0502756Z 
2025-05-07T20:26:24.0502759Z 
2025-05-07T20:26:24.0502763Z 
2025-05-07T20:26:24.0502766Z 
2025-05-07T20:26:24.0502770Z 
2025-05-07T20:26:24.0502781Z 
2025-05-07T20:26:24.0502785Z 
2025-05-07T20:26:24.0502788Z 
2025-05-07T20:26:24.0502792Z 
2025-05-07T20:26:24.0502953Z [A[A[A[A[A[A[A[A[A[A[A[A[A[A[A[A[A[A
2025-05-07T20:26:24.0503155Z 
2025-05-07T20:26:24.0503159Z 
2025-05-07T20:26:24.0503266Z [A
2025-05-07T20:26:24.0503368Z 
2025-05-07T20:26:24.0503372Z 
2025-05-07T20:26:24.0503486Z [A[A
2025-05-07T20:26:24.0503596Z 
2025-05-07T20:26:24.0503600Z 
2025-05-07T20:26:24.0503603Z 
2025-05-07T20:26:24.0503704Z [A[A[A
2025-05-07T20:26:24.0503818Z 
2025-05-07T20:26:24.0503821Z 
2025-05-07T20:26:24.0503825Z 
2025-05-07T20:26:24.0503829Z 
2025-05-07T20:26:24.0503934Z [A[A[A[A
2025-05-07T20:26:24.0504055Z 
2025-05-07T20:26:24.0504058Z 
2025-05-07T20:26:24.0504062Z 
2025-05-07T20:26:24.0504066Z 
2025-05-07T20:26:24.0504069Z 
2025-05-07T20:26:24.0504178Z [A[A[A[A[A
2025-05-07T20:26:24.0504297Z 
2025-05-07T20:26:24.0504301Z 
2025-05-07T20:26:24.0504310Z 
2025-05-07T20:26:24.0504314Z 
2025-05-07T20:26:24.0504318Z 
2025-05-07T20:26:24.0504321Z 
2025-05-07T20:26:24.0504434Z [A[A[A[A[A[A
2025-05-07T20:26:24.0504559Z 
2025-05-07T20:26:24.0504562Z 
2025-05-07T20:26:24.0504566Z 
2025-05-07T20:26:24.0504570Z 
2025-05-07T20:26:24.0504580Z 
2025-05-07T20:26:24.0504583Z 
2025-05-07T20:26:24.0504587Z 
2025-05-07T20:26:24.0504699Z [A[A[A[A[A[A[A
2025-05-07T20:26:24.0504834Z 
2025-05-07T20:26:24.0504848Z 
2025-05-07T20:26:24.0504851Z 
2025-05-07T20:26:24.0504855Z 
2025-05-07T20:26:24.0504858Z 
2025-05-07T20:26:24.0504868Z 
2025-05-07T20:26:24.0504872Z 
2025-05-07T20:26:24.0504875Z 
2025-05-07T20:26:24.0504993Z [A[A[A[A[A[A[A[A
2025-05-07T20:26:24.0505137Z 
2025-05-07T20:26:24.0505141Z 
2025-05-07T20:26:24.0505144Z 
2025-05-07T20:26:24.0505148Z 
2025-05-07T20:26:24.0505162Z 
2025-05-07T20:26:24.0505165Z 
2025-05-07T20:26:24.0505169Z 
2025-05-07T20:26:24.0505173Z 
2025-05-07T20:26:24.0505176Z 
2025-05-07T20:26:24.0505300Z [A[A[A[A[A[A[A[A[A
2025-05-07T20:26:24.0505449Z 
2025-05-07T20:26:24.0505452Z 
2025-05-07T20:26:24.0505456Z 
2025-05-07T20:26:24.0505466Z 
2025-05-07T20:26:24.0505470Z 
2025-05-07T20:26:24.0505473Z 
2025-05-07T20:26:24.0505477Z 
2025-05-07T20:26:24.0505481Z 
2025-05-07T20:26:24.0505484Z 
2025-05-07T20:26:24.0505488Z 
2025-05-07T20:26:24.0505613Z [A[A[A[A[A[A[A[A[A[A
2025-05-07T20:26:24.0505770Z 
2025-05-07T20:26:24.0505780Z 
2025-05-07T20:26:24.0505792Z 
2025-05-07T20:26:24.0505795Z 
2025-05-07T20:26:24.0505799Z 
2025-05-07T20:26:24.0505803Z 
2025-05-07T20:26:24.0505806Z 
2025-05-07T20:26:24.0505810Z 
2025-05-07T20:26:24.0505813Z 
2025-05-07T20:26:24.0505817Z 
2025-05-07T20:26:24.0505821Z 
2025-05-07T20:26:24.0505954Z [A[A[A[A[A[A[A[A[A[A[A done
2025-05-07T20:26:24.3553116Z Preparing transaction: - \ | done
2025-05-07T20:26:30.8871528Z Verifying transaction: - \ | / - \ | / - \ | / - \ | / - \ | / - \ | / - \ | / - \ | / - \ | / - \ | / - \ | / - \ | / - \ | / - \ | / - \ | / - \ | / done
2025-05-07T20:26:31.8070720Z Executing transaction: \ | / - \ | / - \ done
2025-05-07T20:26:34.4210787Z [INSTALL] Fixing file placements for CUDA 12.6.3+ ...
2025-05-07T20:26:34.4211186Z [INSTALL] Creating symlinks: libnvToolsExt.so
2025-05-07T20:26:34.4212247Z + ln -sf /home/ec2-user/miniconda/envs/build_binary/lib/libnvToolsExt.so.1 /home/ec2-user/miniconda/envs/build_binary/lib/libnvToolsExt.so
2025-05-07T20:26:34.4212955Z 
2025-05-07T20:26:34.4225569Z 
2025-05-07T20:26:34.4226431Z + ln -sf /home/ec2-user/miniconda/envs/build_binary/targets/x86_64-linux/lib/libnvToolsExt.so.1 /home/ec2-user/miniconda/envs/build_binary/targets/x86_64-linux/lib/libnvToolsExt.so
2025-05-07T20:26:34.4227136Z 
2025-05-07T20:26:34.4238785Z 
2025-05-07T20:26:34.4239067Z [INSTALL] Copying nvtx3 headers ...
2025-05-07T20:26:34.4243952Z + cp -r /home/ec2-user/miniconda/envs/build_binary/nsight-compute-2024.3.2/host/target-linux-x64/nvtx/include/nvtx3/nvToolsExt.h /home/ec2-user/miniconda/envs/build_binary/nsight-compute-2024.3.2/host/target-linux-x64/nvtx/include/nvtx3/nvToolsExtCuda.h /home/ec2-user/miniconda/envs/build_binary/nsight-compute-2024.3.2/host/target-linux-x64/nvtx/include/nvtx3/nvToolsExtCudaRt.h /home/ec2-user/miniconda/envs/build_binary/nsight-compute-2024.3.2/host/target-linux-x64/nvtx/include/nvtx3/nvToolsExtOpenCL.h /home/ec2-user/miniconda/envs/build_binary/nsight-compute-2024.3.2/host/target-linux-x64/nvtx/include/nvtx3/nvToolsExtSync.h /home/ec2-user/miniconda/envs/build_binary/nsight-compute-2024.3.2/host/target-linux-x64/nvtx/include/nvtx3/nvtx3.hpp /home/ec2-user/miniconda/envs/build_binary/nsight-compute-2024.3.2/host/target-linux-x64/nvtx/include/nvtx3/nvtxDetail /home/ec2-user/miniconda/envs/build_binary/include/
2025-05-07T20:26:34.4247676Z 
2025-05-07T20:26:34.5823560Z 
2025-05-07T20:26:34.5828655Z + cp -r /home/ec2-user/miniconda/envs/build_binary/nsight-compute-2024.3.2/host/target-linux-x64/nvtx/include/nvtx3/nvToolsExt.h /home/ec2-user/miniconda/envs/build_binary/nsight-compute-2024.3.2/host/target-linux-x64/nvtx/include/nvtx3/nvToolsExtCuda.h /home/ec2-user/miniconda/envs/build_binary/nsight-compute-2024.3.2/host/target-linux-x64/nvtx/include/nvtx3/nvToolsExtCudaRt.h /home/ec2-user/miniconda/envs/build_binary/nsight-compute-2024.3.2/host/target-linux-x64/nvtx/include/nvtx3/nvToolsExtOpenCL.h /home/ec2-user/miniconda/envs/build_binary/nsight-compute-2024.3.2/host/target-linux-x64/nvtx/include/nvtx3/nvToolsExtSync.h /home/ec2-user/miniconda/envs/build_binary/nsight-compute-2024.3.2/host/target-linux-x64/nvtx/include/nvtx3/nvtx3.hpp /home/ec2-user/miniconda/envs/build_binary/nsight-compute-2024.3.2/host/target-linux-x64/nvtx/include/nvtx3/nvtxDetail /home/ec2-user/miniconda/envs/build_binary/targets/x86_64-linux/include/
2025-05-07T20:26:34.5832466Z 
2025-05-07T20:26:34.5851226Z 
2025-05-07T20:26:34.5851598Z [INSTALL] Appending libcuda.so path to LD_LIBRARY_PATH ...
2025-05-07T20:26:34.6226517Z [ENV] Appending to LD_LIBRARY_PATH: /home/ec2-user/miniconda/envs/build_binary/targets/x86_64-linux/lib/stubs ...
2025-05-07T20:26:36.5234066Z ERROR conda.cli.main_run:execute(125): `conda run printenv LD_LIBRARY_PATH` failed. (See above for error)
2025-05-07T20:26:36.5888611Z + conda env config vars set -n build_binary LD_LIBRARY_PATH=/home/ec2-user/miniconda/envs/build_binary/targets/x86_64-linux/lib/stubs
2025-05-07T20:26:36.5889166Z 
2025-05-07T20:26:37.0162374Z 
2025-05-07T20:26:37.0171148Z [INSTALL] Setting environment variable NVML_LIB_PATH ...
2025-05-07T20:26:37.0522798Z + conda env config vars set -n build_binary NVML_LIB_PATH=/home/ec2-user/miniconda/envs/build_binary/lib/stubs/libnvidia-ml.so
2025-05-07T20:26:37.0523395Z 
2025-05-07T20:26:37.4925662Z 
2025-05-07T20:26:37.4926030Z [INSTALL] Setting environment variable CUDA_INCLUDE_DIRS ...
2025-05-07T20:26:37.4927329Z + conda env config vars set -n build_binary CUDA_INCLUDE_DIRS="/home/ec2-user/miniconda/envs/build_binary/include/:/home/ec2-user/miniconda/envs/build_binary/targets/x86_64-linux/include/"
2025-05-07T20:26:37.4928024Z 
2025-05-07T20:26:37.9199729Z 
2025-05-07T20:26:39.9670449Z [CHECK] cuda_runtime.h found in CONDA_PREFIX PATH (file): /home/ec2-user/miniconda/envs/build_binary/targets/x86_64-linux/include/cuda_runtime.h
2025-05-07T20:26:42.0022576Z [CHECK] libcuda.so found in CONDA_PREFIX PATH (file): /home/ec2-user/miniconda/envs/build_binary/targets/x86_64-linux/lib/stubs/libcuda.so
2025-05-07T20:26:44.0453751Z [CHECK] libnvToolsExt.so found in CONDA_PREFIX PATH (symbolic link): /home/ec2-user/miniconda/envs/build_binary/lib/libnvToolsExt.so
2025-05-07T20:26:44.0454548Z /home/ec2-user/miniconda/envs/build_binary/targets/x86_64-linux/lib/libnvToolsExt.so
2025-05-07T20:26:46.0840753Z [CHECK] libnvidia-ml.so found in CONDA_PREFIX PATH (file): /home/ec2-user/miniconda/envs/build_binary/targets/x86_64-linux/lib/stubs/libnvidia-ml.so
2025-05-07T20:26:48.0003345Z /home/ec2-user/miniconda/envs/build_binary/bin/nvcc
2025-05-07T20:26:48.0003782Z 
2025-05-07T20:26:48.0642407Z [CHECK] Binary nvcc found in PATH
2025-05-07T20:26:51.9629159Z /tmp/tmp6ugsjzsc: line 3: clang: command not found
2025-05-07T20:26:51.9629437Z 
2025-05-07T20:26:51.9630378Z ERROR conda.cli.main_run:execute(125): `conda run clang --version` failed. (See above for error)
2025-05-07T20:26:52.0328275Z + ls -la /home/ec2-user/miniconda/envs/build_binary/etc/conda/activate.d
2025-05-07T20:26:52.0328654Z 
2025-05-07T20:26:52.0350880Z total 36
2025-05-07T20:26:52.0351171Z drwxr-xr-x. 2 ec2-user ec2-user   191 May  7 20:26 .
2025-05-07T20:26:52.0351599Z drwxr-xr-x. 5 ec2-user ec2-user    62 May  7 20:25 ..
2025-05-07T20:26:52.0352040Z -rw-r--r--. 2 ec2-user ec2-user  3778 Jun 10  2024 activate-binutils_linux-64.sh
2025-05-07T20:26:52.0352538Z -rw-r--r--. 2 ec2-user ec2-user 11630 Jun 10  2024 activate-gcc_linux-64.sh
2025-05-07T20:26:52.0353007Z -rw-r--r--. 2 ec2-user ec2-user  5190 Jun 10  2024 activate-gxx_linux-64.sh
2025-05-07T20:26:52.0353457Z -rw-r--r--. 2 ec2-user ec2-user   136 Mar 27 01:27 libglib_activate.sh
2025-05-07T20:26:52.0353890Z -rw-r--r--. 2 ec2-user ec2-user   872 Nov 13 09:20 libxml2_activate.sh
2025-05-07T20:26:52.0354331Z -rw-r--r--. 2 ec2-user ec2-user  2932 Nov 20 20:32 ~cuda-nvcc_activate.sh
2025-05-07T20:26:52.0354608Z 
2025-05-07T20:26:52.0354820Z [INSTALL] Removing the -ccbin=CXX hook from NVCC activation scripts ...
2025-05-07T20:26:52.0355442Z + sed -i /-ccbin=/d /home/ec2-user/miniconda/envs/build_binary/etc/conda/activate.d/*cuda-nvcc_activate.sh
2025-05-07T20:26:52.0355861Z 
2025-05-07T20:26:52.0374483Z 
2025-05-07T20:26:52.0375057Z + conda run -n build_binary c++ --version | grep -i clang
2025-05-07T20:26:52.0375336Z 
2025-05-07T20:26:54.0335900Z 
2025-05-07T20:26:54.0336692Z [BUILD] Setting prepend flags for NVCC ...
2025-05-07T20:26:54.0337393Z + conda env config vars set -n build_binary NVCC_PREPEND_FLAGS="-allow-unsupported-compiler"
2025-05-07T20:26:54.0337765Z 
2025-05-07T20:26:54.4759990Z 
2025-05-07T20:26:54.4760453Z + conda run -n build_binary printenv NVCC_PREPEND_FLAGS
2025-05-07T20:26:54.4760715Z 
2025-05-07T20:26:56.3758971Z -allow-unsupported-compiler
2025-05-07T20:26:56.3759188Z 
2025-05-07T20:26:56.4405877Z 
2025-05-07T20:26:56.4406543Z [INFO] Printing out all preprocessor defines in nvcc ...
2025-05-07T20:26:56.4407221Z + conda run -n build_binary nvcc --compiler-options -dM -E -x cu - < /dev/null
2025-05-07T20:26:58.4135422Z 
2025-05-07T20:26:58.4136160Z #define _GLIBCXX_DEPRECATED_SUGGEST(ALT) __attribute__ ((__deprecated__ ("use '" ALT "' instead")))
2025-05-07T20:26:58.4136894Z #define M_PIl 3.141592653589793238462643383279502884L
2025-05-07T20:26:58.4137239Z #define _IO_CURRENTLY_PUTTING 0x800
2025-05-07T20:26:58.4137560Z #define __W_EXITCODE(ret,sig) ((ret) << 8 | (sig))
2025-05-07T20:26:58.4137881Z #define __DBL_MIN_EXP__ (-1021)
2025-05-07T20:26:58.4138166Z #define _STL_PAIR_H 1
2025-05-07T20:26:58.4138467Z #define __cpp_attributes 200809L
2025-05-07T20:26:58.4138935Z #define __cpp_nontype_template_parameter_auto 201606L
2025-05-07T20:26:58.4139383Z #define __DELETE_THROW throw()
2025-05-07T20:26:58.4139640Z #define _PTRDIFF_T_ 
2025-05-07T20:26:58.4139878Z #define M_PI_4 0.78539816339744830962
2025-05-07T20:26:58.4140248Z #define __UINT_LEAST16_MAX__ 0xffff
2025-05-07T20:26:58.4140597Z #define _IO_LEFT 02
2025-05-07T20:26:58.4140850Z #define __ATOMIC_ACQUIRE 2
2025-05-07T20:26:58.4141196Z #define _POSIX2_BC_SCALE_MAX 99
2025-05-07T20:26:58.4141934Z #define _GLIBCXX_USE_RANDOM_TR1 1
2025-05-07T20:26:58.4144180Z #define _GLIBCXX_MOVE_BACKWARD3(_Tp,_Up,_Vp) std::move_backward(_Tp, _Up, _Vp)
2025-05-07T20:26:58.4144596Z #define __FLT128_MAX_10_EXP__ 4932
2025-05-07T20:26:58.4144868Z #define RE_DUP_MAX (0x7fff)
2025-05-07T20:26:58.4145113Z #define _IOS_OUTPUT 2
2025-05-07T20:26:58.4145470Z #define __FLT_MIN__ 1.17549435082228750796873653722224568e-38F
2025-05-07T20:26:58.4145975Z #define toascii_l(c,l) __toascii_l ((c), (l))
2025-05-07T20:26:58.4146386Z #define __GCC_IEC_559_COMPLEX 2
2025-05-07T20:26:58.4146704Z #define _GLIBCXX_USE_FCHMOD 1
2025-05-07T20:26:58.4146970Z #define __cpp_aggregate_nsdmi 201304L
2025-05-07T20:26:58.4147717Z #define __bswap_16(x) (__extension__ ({ unsigned short int __v, __x = (unsigned short int) (x); if (__builtin_constant_p (__x)) __v = __bswap_constant_16 (__x); else __asm__ ("rorw $8, %w0" : "=r" (__v) : "0" (__x) : "cc"); __v; }))
2025-05-07T20:26:58.4148665Z #define __UINT_LEAST8_TYPE__ unsigned char
2025-05-07T20:26:58.4149216Z #define __SIZEOF_FLOAT80__ 16
2025-05-07T20:26:58.4149511Z #define cudaTextureTypeCubemapLayered 0xFC
2025-05-07T20:26:58.4149812Z #define _T_WCHAR_ 
2025-05-07T20:26:58.4150031Z #define stdout stdout
2025-05-07T20:26:58.4150347Z #define _GLIBCXX_ABI_TAG_CXX11 __attribute ((__abi_tag__ ("cxx11")))
2025-05-07T20:26:58.4150720Z #define CHAR_BIT __CHAR_BIT__
2025-05-07T20:26:58.4150967Z #define __flexarr []
2025-05-07T20:26:58.4151192Z #define _GLIBCXX_HAVE_FINITEF 1
2025-05-07T20:26:58.4151503Z #define __islower_l(c,l) __isctype_l((c), _ISlower, (l))
2025-05-07T20:26:58.4151840Z #define _IO_FLAGS2_USER_WBUF 8
2025-05-07T20:26:58.4152079Z #define _MATH_H 1
2025-05-07T20:26:58.4152350Z #define cudaOccupancyDisableCachingOverride 0x01
2025-05-07T20:26:58.4152683Z #define __S64_TYPE long int
2025-05-07T20:26:58.4152923Z #define __stub_fchflags 
2025-05-07T20:26:58.4153185Z #define cudaDeviceScheduleMask 0x07
2025-05-07T20:26:58.4153468Z #define __SQUAD_TYPE long int
2025-05-07T20:26:58.4153727Z #define __INTMAX_C(c) c ## L
2025-05-07T20:26:58.4153987Z #define _BSD_SIZE_T_DEFINED_ 
2025-05-07T20:26:58.4154240Z #define NL_NMAX INT_MAX
2025-05-07T20:26:58.4154474Z #define _BITS_TIME_H 1
2025-05-07T20:26:58.4154734Z #define M_LN10l 2.302585092994045684017991454684364208L
2025-05-07T20:26:58.4155056Z #define _GLIBCXX_TXN_SAFE_DYN 
2025-05-07T20:26:58.4155353Z #define cudaStreamTailLaunch ((cudaStream_t)0x3)
2025-05-07T20:26:58.4155692Z #define M_El 2.718281828459045235360287471352662498L
2025-05-07T20:26:58.4156082Z #define _PSTL_PRAGMA_DECLARE_SIMD _PSTL_PRAGMA(omp declare simd)
2025-05-07T20:26:58.4156441Z #define __CHAR_BIT__ 8
2025-05-07T20:26:58.4156690Z #define __FSWORD_T_TYPE __SYSCALL_SLONG_TYPE
2025-05-07T20:26:58.4157000Z #define _PSTL_STRING_CONCAT(x,y) x #y
2025-05-07T20:26:58.4157288Z #define _GLIBCXX98_USE_C99_MATH 1
2025-05-07T20:26:58.4157545Z #define FP_NAN 0
2025-05-07T20:26:58.4157804Z #define makedev(maj,min) gnu_dev_makedev (maj, min)
2025-05-07T20:26:58.4158238Z #define __glibcxx_requires_sorted_set_pred(_First1,_Last1,_First2,_Pred) 
2025-05-07T20:26:58.4158723Z #define cudaGetDeviceProperties cudaGetDeviceProperties_v2
2025-05-07T20:26:58.4159098Z #define __cudaCDP2GetErrorString 
2025-05-07T20:26:58.4159381Z #define SHRT_MAX __SHRT_MAX__
2025-05-07T20:26:58.4159638Z #define _GLIBCXX_X86_RDSEED 1
2025-05-07T20:26:58.4159883Z #define __SM_80_RT_H__ 
2025-05-07T20:26:58.4160107Z #define _NEW 
2025-05-07T20:26:58.4160326Z #define CLOCK_PROCESS_CPUTIME_ID 2
2025-05-07T20:26:58.4160593Z #define __UINT8_MAX__ 0xff
2025-05-07T20:26:58.4160949Z #define _PSTL_ASSERT_MSG(_Condition,_Message) __glibcxx_assert(_Condition)
2025-05-07T20:26:58.4161355Z #define __SM_35_ATOMIC_FUNCTIONS_H__ 
2025-05-07T20:26:58.4161625Z #define __SCHAR_WIDTH__ 8
2025-05-07T20:26:58.4161860Z #define __USE_ANSI 1
2025-05-07T20:26:58.4162137Z #define _IO_BE(expr,res) __builtin_expect ((expr), res)
2025-05-07T20:26:58.4162518Z #define __isupper_l(c,l) __isctype_l((c), _ISupper, (l))
2025-05-07T20:26:58.4162976Z #define __cudaCDP2Memcpy2DAsync_ptsz 
2025-05-07T20:26:58.4163352Z #define __WINT_MAX__ 0xffffffffU
2025-05-07T20:26:58.4163624Z #define __SIZEOF_PTHREAD_ATTR_T 56
2025-05-07T20:26:58.4163892Z #define __FLT32_MIN_EXP__ (-125)
2025-05-07T20:26:58.4164163Z #define _GLIBCXX_END_NAMESPACE_LDBL 
2025-05-07T20:26:58.4164439Z #define PIPE_BUF 4096
2025-05-07T20:26:58.4164743Z #define _PSTL_PRAGMA_SIMD_ORDERED_MONOTONIC_2ARGS(PRM1,PRM2) 
2025-05-07T20:26:58.4165093Z #define ADJ_TICK 0x4000
2025-05-07T20:26:58.4165362Z #define _PSTL_VERSION_PATCH (_PSTL_VERSION % 10)
2025-05-07T20:26:58.4165666Z #define MQ_PRIO_MAX 32768
2025-05-07T20:26:58.4165922Z #define __SIZEOF_PTHREAD_MUTEXATTR_T 4
2025-05-07T20:26:58.4166234Z #define __WAIT_INT(status) (*(int *) &(status))
2025-05-07T20:26:58.4166687Z #define __GLIBC_PREREQ(maj,min) ((__GLIBC__ << 16) + __GLIBC_MINOR__ >= ((maj) << 16) + (min))
2025-05-07T20:26:58.4167195Z #define cudaCooperativeLaunchMultiDeviceNoPreSync 0x01
2025-05-07T20:26:58.4167555Z #define _XOPEN_SOURCE 700
2025-05-07T20:26:58.4167820Z #define _POSIX2_BC_DIM_MAX 2048
2025-05-07T20:26:58.4168083Z #define __VECTOR_FUNCTIONS_HPP__ 
2025-05-07T20:26:58.4168363Z #define __cpp_static_assert 201411L
2025-05-07T20:26:58.4168690Z #define __WEXITSTATUS(status) (((status) & 0xff00) >> 8)
2025-05-07T20:26:58.4169021Z #define _GLIBCXX_HAVE_STRXFRM_L 1
2025-05-07T20:26:58.4169300Z #define _POSIX_TTY_NAME_MAX 9
2025-05-07T20:26:58.4169577Z #define _GLIBCXX_USE_WEAK_REF __GXX_WEAK__
2025-05-07T20:26:58.4169871Z #define __OFF_T_MATCHES_OFF64_T 1
2025-05-07T20:26:58.4170144Z #define __ORDER_LITTLE_ENDIAN__ 1234
2025-05-07T20:26:58.4170439Z #define __SIZE_MAX__ 0xffffffffffffffffUL
2025-05-07T20:26:58.4170791Z #define __ispunct_l(c,l) __isctype_l((c), _ISpunct, (l))
2025-05-07T20:26:58.4171121Z #define __WCHAR_MAX__ 0x7fffffff
2025-05-07T20:26:58.4171397Z #define _GLIBCXX_USE_CLOCK_MONOTONIC 1
2025-05-07T20:26:58.4171705Z #define __BLKCNT_T_TYPE __SYSCALL_SLONG_TYPE
2025-05-07T20:26:58.4172055Z #define __isprint_l(c,l) __isctype_l((c), _ISprint, (l))
2025-05-07T20:26:58.4172409Z #define cudaNvSciSyncAttrSignal 0x1
2025-05-07T20:26:58.4172703Z #define _GLIBCXX_USE_LONG_LONG 1
2025-05-07T20:26:58.4172986Z #define __GCC_HAVE_SYNC_COMPARE_AND_SWAP_1 1
2025-05-07T20:26:58.4173307Z #define __GCC_HAVE_SYNC_COMPARE_AND_SWAP_2 1
2025-05-07T20:26:58.4173822Z #define __GCC_HAVE_SYNC_COMPARE_AND_SWAP_4 1
2025-05-07T20:26:58.4174263Z #define __DBL_DENORM_MIN__ double(4.94065645841246544176568792868221372e-324L)
2025-05-07T20:26:58.4174665Z #define __GCC_HAVE_SYNC_COMPARE_AND_SWAP_8 1
2025-05-07T20:26:58.4174967Z #define ADJ_ESTERROR 0x0008
2025-05-07T20:26:58.4175231Z #define __GCC_ATOMIC_CHAR_LOCK_FREE 2
2025-05-07T20:26:58.4175506Z #define __GCC_IEC_559 2
2025-05-07T20:26:58.4175790Z #define __cpp_lib_transformation_trait_aliases 201304
2025-05-07T20:26:58.4176124Z #define _IO_flockfile(_fp) 
2025-05-07T20:26:58.4176374Z #define CLOCK_MONOTONIC_RAW 4
2025-05-07T20:26:58.4176636Z #define __FLT32X_DECIMAL_DIG__ 17
2025-05-07T20:26:58.4176893Z #define _IOFBF 0
2025-05-07T20:26:58.4177110Z #define __USE_BSD 1
2025-05-07T20:26:58.4177332Z #define __FLT_EVAL_METHOD__ 0
2025-05-07T20:26:58.4177596Z #define SHRT_MIN (-SHRT_MAX - 1)
2025-05-07T20:26:58.4177860Z #define _IO_USER_LOCK 0x8000
2025-05-07T20:26:58.4178108Z #define _IO_NO_WRITES 8
2025-05-07T20:26:58.4178364Z #define _GLIBCXX_PSEUDO_VISIBILITY(V) 
2025-05-07T20:26:58.4178705Z #define __ASMNAME2(prefix,cname) __STRING (prefix) cname
2025-05-07T20:26:58.4179050Z #define _GLIBCXX_HAVE_SYS_STAT_H 1
2025-05-07T20:26:58.4179350Z #define MB_CUR_MAX (__ctype_get_mb_cur_max ())
2025-05-07T20:26:58.4179671Z #define __cpp_binary_literals 201304L
2025-05-07T20:26:58.4179949Z #define _CPP_TYPE_TRAITS_H 1
2025-05-07T20:26:58.4180206Z #define __BEGIN_NAMESPACE_C99 
2025-05-07T20:26:58.4180470Z #define __FLT64_DECIMAL_DIG__ 17
2025-05-07T20:26:58.4180771Z #define _GLIBCXX_SYNCHRONIZATION_HAPPENS_AFTER(A) 
2025-05-07T20:26:58.4181150Z #define _G_HAVE_ST_BLKSIZE defined (_STATBUF_ST_BLKSIZE)
2025-05-07T20:26:58.4181610Z #define __cpp_noexcept_function_type 201510L
2025-05-07T20:26:58.4181992Z #define M_PI 3.14159265358979323846
2025-05-07T20:26:58.4182294Z #define _GLIBCXX_PACKAGE_NAME "package-unused"
2025-05-07T20:26:58.4182615Z #define _GLIBCXX_HAVE_BUILTIN_IS_SAME 1
2025-05-07T20:26:58.4182910Z #define __GCC_ATOMIC_CHAR32_T_LOCK_FREE 2
2025-05-07T20:26:58.4183207Z #define _POSIX_DELAYTIMER_MAX 32
2025-05-07T20:26:58.4183499Z #define _GLIBCXX_USE_UTIME 1
2025-05-07T20:26:58.4183762Z #define _STL_ITERATOR_BASE_FUNCS_H 1
2025-05-07T20:26:58.4184327Z #define _IO_peekc_unlocked(_fp) (_IO_BE ((_fp)->_IO_read_ptr >= (_fp)->_IO_read_end, 0) && __underflow (_fp) == EOF ? EOF : *(unsigned char *) (_fp)->_IO_read_ptr)
2025-05-07T20:26:58.4184897Z #define _GLIBCXX_TR1_ELL_INTEGRAL_TCC 1
2025-05-07T20:26:58.4185215Z #define w_termsig __wait_terminated.__w_termsig
2025-05-07T20:26:58.4185527Z #define __FLOAT_WORD_ORDER __BYTE_ORDER
2025-05-07T20:26:58.4185822Z #define __cudaCDP2GetErrorName 
2025-05-07T20:26:58.4186099Z #define XATTR_SIZE_MAX 65536
2025-05-07T20:26:58.4186361Z #define be64toh(x) __bswap_64 (x)
2025-05-07T20:26:58.4186656Z #define __ASSERT_VOID_CAST static_cast<void>
2025-05-07T20:26:58.4186979Z #define __cpp_variadic_templates 200704L
2025-05-07T20:26:58.4187272Z #define RAND_MAX 2147483647
2025-05-07T20:26:58.4187525Z #define _GLIBCXX_USE_C99_COMPLEX_TR1 1
2025-05-07T20:26:58.4250688Z #define __UINT_FAST64_MAX__ 0xffffffffffffffffUL
2025-05-07T20:26:58.4251136Z #define __SM_90_RT_H__ 
2025-05-07T20:26:58.4251387Z #define __SIG_ATOMIC_TYPE__ int
2025-05-07T20:26:58.4251641Z #define __COMPAR_FN_T 
2025-05-07T20:26:58.4251888Z #define __GID_T_TYPE __U32_TYPE
2025-05-07T20:26:58.4252159Z #define _IO_BAD_SEEN 0x4000
2025-05-07T20:26:58.4252623Z #define _PSTL_PRAGMA_MESSAGE_IMPL(x) _PSTL_PRAGMA(message(_PSTL_STRING_CONCAT(_PSTL_PRAGMA_LOCATION, x)))
2025-05-07T20:26:58.4253137Z #define __DBL_MIN_10_EXP__ (-307)
2025-05-07T20:26:58.4253479Z #define __glibcxx_requires_sorted_pred(_First,_Last,_Pred) 
2025-05-07T20:26:58.4254009Z #define __FINITE_MATH_ONLY__ 0
2025-05-07T20:26:58.4254314Z #define _PSTL_PRAGMA_SIMD_INCLUSIVE_SCAN(PRM) 
2025-05-07T20:26:58.4254655Z #define cudaArrayColorAttachment 0x20
2025-05-07T20:26:58.4254970Z #define __cpp_variable_templates 201304L
2025-05-07T20:26:58.4255472Z #define cudaKernelNodeAttributeMemSyncDomainMap cudaLaunchAttributeMemSyncDomainMap
2025-05-07T20:26:58.4256006Z #define __cpp_lib_integral_constant_callable 201304
2025-05-07T20:26:58.4256339Z #define _GLIBCXX_HAVE_SINHF 1
2025-05-07T20:26:58.4256603Z #define MOD_TIMECONST ADJ_TIMECONST
2025-05-07T20:26:58.4256899Z #define __cpp_lib_result_of_sfinae 201210
2025-05-07T20:26:58.4257201Z #define __SM_30_INTRINSICS_H__ 
2025-05-07T20:26:58.4257460Z #define __FLT32X_MAX_EXP__ 1024
2025-05-07T20:26:58.4257728Z #define _GLIBCXX_USE_WCHAR_T 1
2025-05-07T20:26:58.4257988Z #define _GLIBCXX_MATH_H 1
2025-05-07T20:26:58.4258230Z #define __u_char_defined 
2025-05-07T20:26:58.4258545Z #define WIFEXITED(status) __WIFEXITED (__WAIT_INT (status))
2025-05-07T20:26:58.4258904Z #define STA_PPSERROR 0x0800
2025-05-07T20:26:58.4259167Z #define _GLIBCXX_STD_A std
2025-05-07T20:26:58.4259413Z #define __FLT32_HAS_DENORM__ 1
2025-05-07T20:26:58.4259694Z #define _GLIBCXX_BEGIN_NAMESPACE_VERSION 
2025-05-07T20:26:58.4260128Z #define __device_builtin_texture_type__ __location__(device_builtin_texture_type)
2025-05-07T20:26:58.4260555Z #define FP_INFINITE 1
2025-05-07T20:26:58.4260952Z #define _GLIBCXX11_DEPRECATED_SUGGEST(ALT) _GLIBCXX_DEPRECATED_SUGGEST(ALT)
2025-05-07T20:26:58.4261364Z #define _IO_pid_t __pid_t
2025-05-07T20:26:58.4261607Z #define __UINT_FAST8_MAX__ 0xff
2025-05-07T20:26:58.4261871Z #define __LEAF , __leaf__
2025-05-07T20:26:58.4262112Z #define PATH_MAX 4096
2025-05-07T20:26:58.4262354Z #define __cpp_rvalue_reference 200610L
2025-05-07T20:26:58.4262685Z #define __LDBL_REDIR1(name,proto,alias) name proto
2025-05-07T20:26:58.4263007Z #define _LIMITS_H___ 
2025-05-07T20:26:58.4263223Z #define __size_t 
2025-05-07T20:26:58.4263450Z #define _GLIBCXX_HAVE_FREXPF 1
2025-05-07T20:26:58.4264481Z #define STA_RONLY (STA_PPSSIGNAL | STA_PPSJITTER | STA_PPSWANDER | STA_PPSERROR | STA_CLOCKERR | STA_NANO | STA_MODE | STA_CLK)
2025-05-07T20:26:58.4265040Z #define _GLIBCXX_HAVE_FREXPL 1
2025-05-07T20:26:58.4265339Z #define __cpp_nested_namespace_definitions 201411L
2025-05-07T20:26:58.4265667Z #define __DEC64_MAX_EXP__ 385
2025-05-07T20:26:58.4265926Z #define _WCHAR_T_DEFINED 
2025-05-07T20:26:58.4266270Z #define __glibcxx_requires_can_decrement_range(_First1,_Last1,_First2) 
2025-05-07T20:26:58.4266668Z #define MOD_STATUS ADJ_STATUS
2025-05-07T20:26:58.4266964Z #define _GLIBCXX_PURE __attribute__ ((__pure__))
2025-05-07T20:26:58.4267283Z #define _GLIBCXX_HAVE_STDINT_H 1
2025-05-07T20:26:58.4267565Z #define __SIZEOF_PTHREAD_CONDATTR_T 4
2025-05-07T20:26:58.4267845Z #define __INT8_C(c) c
2025-05-07T20:26:58.4268103Z #define __cudaCDP2GetParameterBuffer 
2025-05-07T20:26:58.4268395Z #define _GLIBCXX_HAVE_COSHF 1
2025-05-07T20:26:58.4268660Z #define _GLIBCXX_HAVE_COSHL 1
2025-05-07T20:26:58.4268931Z #define __SM_70_RT_HPP__ 
2025-05-07T20:26:58.4269173Z #define __INT_LEAST8_WIDTH__ 8
2025-05-07T20:26:58.4269456Z #define __cpp_variadic_using 201611L
2025-05-07T20:26:58.4269764Z #define __UINT_LEAST64_MAX__ 0xffffffffffffffffUL
2025-05-07T20:26:58.4270086Z #define __INT_LEAST8_MAX__ 0x7f
2025-05-07T20:26:58.4270352Z #define __SM_61_INTRINSICS_HPP__ 
2025-05-07T20:26:58.4270616Z #define _IO_FLAGS2_MMAP 1
2025-05-07T20:26:58.4270871Z #define __cpp_capture_star_this 201603L
2025-05-07T20:26:58.4271183Z #define __cudaCDP2LaunchDeviceV2_ptsz 
2025-05-07T20:26:58.4271481Z #define _GLIBCXX_HAVE_ENDIAN_H 1
2025-05-07T20:26:58.4271833Z #define __always_inline __inline __attribute__ ((__always_inline__))
2025-05-07T20:26:58.4272205Z #define NFDBITS __NFDBITS
2025-05-07T20:26:58.4272460Z #define _PSTL_PRAGMA_FORCEINLINE 
2025-05-07T20:26:58.4272743Z #define _GLIBCXX_HAVE_SYS_STATVFS_H 1
2025-05-07T20:26:58.4273060Z #define __glibcxx_requires_sorted(_First,_Last) 
2025-05-07T20:26:58.4273381Z #define __SHRT_MAX__ 0x7fff
2025-05-07T20:26:58.4273635Z #define _GLIBCXX_SYMVER_GNU 1
2025-05-07T20:26:58.4273917Z #define w_stopval __wait_stopped.__w_stopval
2025-05-07T20:26:58.4274219Z #define STA_UNSYNC 0x0040
2025-05-07T20:26:58.4274519Z #define __LDBL_MAX__ 1.18973149535723176502126385303097021e+4932L
2025-05-07T20:26:58.4274926Z #define _GLIBCXX_USE_C99_COMPLEX _GLIBCXX11_USE_C99_COMPLEX
2025-05-07T20:26:58.4275281Z #define __FLT64X_MAX_10_EXP__ 4932
2025-05-07T20:26:58.4275564Z #define __cpp_if_constexpr 201606L
2025-05-07T20:26:58.4275867Z #define __glibcxx_class_requires4(_a,_b,_c,_d,_e) 
2025-05-07T20:26:58.4276229Z #define cudaStreamFireAndForget ((cudaStream_t)0x4)
2025-05-07T20:26:58.4276564Z #define _GLIBCXX_HAVE_WCHAR_H 1
2025-05-07T20:26:58.4276867Z #define _GLIBCXX_USE_C99_STDIO _GLIBCXX11_USE_C99_STDIO
2025-05-07T20:26:58.4277195Z #define __daddr_t_defined 
2025-05-07T20:26:58.4277442Z #define __LDBL_IS_IEC_60559__ 2
2025-05-07T20:26:58.4277705Z #define _GLIBCXX_TR1_RIEMANN_ZETA_TCC 1
2025-05-07T20:26:58.4278027Z #define _GLIBCXX_HAVE_STRUCT_DIRENT_D_TYPE 1
2025-05-07T20:26:58.4278529Z #define _PSTL_CPP11_STD_ROTATE_BROKEN ((__GLIBCXX__ && __GLIBCXX__ < 20150716) || (_MSC_VER && _MSC_VER < 1800))
2025-05-07T20:26:58.4279013Z #define _ACRTIMP 
2025-05-07T20:26:58.4279229Z #define _IO_EOF_SEEN 0x10
2025-05-07T20:26:58.4279490Z #define _GLIBCXX_TR1_POLY_LAGUERRE_TCC 1
2025-05-07T20:26:58.4279774Z #define _IOS_BIN 128
2025-05-07T20:26:58.4280111Z #define __fortify_function __extern_always_inline __attribute_artificial__
2025-05-07T20:26:58.4280518Z #define __FLT64X_HAS_QUIET_NAN__ 1
2025-05-07T20:26:58.4280782Z #define UNDERFLOW 4
2025-05-07T20:26:58.4280991Z #define NAME_MAX 255
2025-05-07T20:26:58.4281225Z #define SCHAR_MAX __SCHAR_MAX__
2025-05-07T20:26:58.4281490Z #define __UINT_LEAST8_MAX__ 0xff
2025-05-07T20:26:58.4281759Z #define __GCC_ATOMIC_BOOL_LOCK_FREE 2
2025-05-07T20:26:58.4282050Z #define _IO_UNIFIED_JUMPTABLES 1
2025-05-07T20:26:58.4282526Z #define __FLT128_DENORM_MIN__ 6.47517511943802511092443895822764655e-4966F128
2025-05-07T20:26:58.4282976Z #define __ptr_t void *
2025-05-07T20:26:58.4283210Z #define M_E 2.7182818284590452354
2025-05-07T20:26:58.4283483Z #define cudaSurfaceType1D 0x01
2025-05-07T20:26:58.4283744Z #define __USE_ISOCXX11 1
2025-05-07T20:26:58.4284000Z #define __UINTMAX_TYPE__ long unsigned int
2025-05-07T20:26:58.4284312Z #define cudaDeviceBlockingSync 0x04
2025-05-07T20:26:58.4284601Z #define CLOCK_MONOTONIC_COARSE 6
2025-05-07T20:26:58.4284866Z #define _GLIBCXX_OS_DEFINES 1
2025-05-07T20:26:58.4285146Z #define _GLIBCXX_NODISCARD [[__nodiscard__]]
2025-05-07T20:26:58.4285456Z #define cudaSurfaceType2D 0x02
2025-05-07T20:26:58.4285708Z #define __linux 1
2025-05-07T20:26:58.4285938Z #define __DEC32_EPSILON__ 1E-6DF
2025-05-07T20:26:58.4286210Z #define cudaDeviceMask 0xff
2025-05-07T20:26:58.4286468Z #define _GLIBCXX_END_NAMESPACE_ALGO 
2025-05-07T20:26:58.4286758Z #define __CUDA_API_VER_MAJOR__ 12
2025-05-07T20:26:58.4287032Z #define htobe16(x) __bswap_16 (x)
2025-05-07T20:26:58.4287318Z #define HUGE_VALF (__builtin_huge_valf())
2025-05-07T20:26:58.4287626Z #define __FLT_EVAL_METHOD_TS_18661_3__ 0
2025-05-07T20:26:58.4287926Z #define HUGE_VALL (__builtin_huge_vall())
2025-05-07T20:26:58.4288218Z #define _BITS_TYPES_H 1
2025-05-07T20:26:58.4288494Z #define ULONG_LONG_MAX (LONG_LONG_MAX * 2ULL + 1ULL)
2025-05-07T20:26:58.4288829Z #define _IO_cleanup_region_end(_Doit) 
2025-05-07T20:26:58.4289126Z #define cudaSurfaceType3D 0x03
2025-05-07T20:26:58.4289395Z #define _GLIBCXX_HAVE_SYS_TIME_H 1
2025-05-07T20:26:58.4289678Z #define __cudaGet_blockIdx() blockIdx
2025-05-07T20:26:58.4289963Z #define _IO_DONT_CLOSE 0100000
2025-05-07T20:26:58.4290766Z #define __MATHDECLX(type,function,suffix,args,attrib) __MATHDECL_1(type, function,suffix, args) __attribute__ (attrib); __MATHDECL_1(type, __CONCAT(__,function),suffix, args) __attribute__ (attrib)
2025-05-07T20:26:58.4291570Z #define cudaHostRegisterDefault 0x00
2025-05-07T20:26:58.4291849Z #define __unix 1
2025-05-07T20:26:58.4292070Z #define MATH_ERRNO 1
2025-05-07T20:26:58.4292306Z #define _GLIBCXX_STDIO_SEEK_END 2
2025-05-07T20:26:58.4292580Z #define _GLIBCXX_USE_FCHMODAT 1
2025-05-07T20:26:58.4292846Z #define __UINT32_MAX__ 0xffffffffU
2025-05-07T20:26:58.4293122Z #define __GXX_EXPERIMENTAL_CXX0X__ 1
2025-05-07T20:26:58.4293404Z #define __UID_T_TYPE __U32_TYPE
2025-05-07T20:26:58.4293834Z #define _GLIBCXX_HAVE_ATOMIC_LOCK_POLICY 1
2025-05-07T20:26:58.4294301Z #define __CUDART_API_VERSION ((__CUDA_API_VER_MAJOR__ * 1000) + (__CUDA_API_VER_MINOR__ * 10))
2025-05-07T20:26:58.4294765Z #define __nv_pure__ __location__(nv_pure)
2025-05-07T20:26:58.4295059Z #define CUDARTAPI_CDECL 
2025-05-07T20:26:58.4295304Z #define _PSTL_USAGE_WARNINGS 0
2025-05-07T20:26:58.4295574Z #define _GLIBCXX98_USE_C99_COMPLEX 1
2025-05-07T20:26:58.4295856Z #define __cpp_lib_void_t 201411
2025-05-07T20:26:58.4296109Z #define _POSIX_AIO_MAX 1
2025-05-07T20:26:58.4296346Z #define __SIZE_T 
2025-05-07T20:26:58.4296595Z #define isgraph_l(c,l) __isgraph_l ((c), (l))
2025-05-07T20:26:58.4296922Z #define _GLIBCXX_FULLY_DYNAMIC_STRING 0
2025-05-07T20:26:58.4297210Z #define _POSIX_PIPE_BUF 512
2025-05-07T20:26:58.4297468Z #define _GLIBCXX_HAVE_STRTOLD 1
2025-05-07T20:26:58.4297726Z #define _ATFILE_SOURCE 1
2025-05-07T20:26:58.4298101Z #define __glibcxx_assert(cond) do { __glibcxx_constexpr_assert(cond); } while (false)
2025-05-07T20:26:58.4299689Z #define __WAIT_STATUS void *
2025-05-07T20:26:58.4299953Z #define __MATH_FUNCTIONS_H__ 
2025-05-07T20:26:58.4300210Z #define _GLIBCXX_HAVE_WCSTOF 1
2025-05-07T20:26:58.4300476Z #define __FLT128_MIN_EXP__ (-16381)
2025-05-07T20:26:58.4300792Z #define _GLIBCXX_HAVE_LC_MESSAGES 1
2025-05-07T20:26:58.4301074Z #define __WINT_MIN__ 0U
2025-05-07T20:26:58.4301641Z #define _PSTL_CPP14_VARIABLE_TEMPLATES_PRESENT (!__INTEL_COMPILER || __INTEL_COMPILER >= 1700) && (_MSC_FULL_VER >= 190023918 || __cplusplus >= 201402L)
2025-05-07T20:26:58.4302274Z #define isdigit_l(c,l) __isdigit_l ((c), (l))
2025-05-07T20:26:58.4302573Z #define WUNTRACED 2
2025-05-07T20:26:58.4303189Z #define _GLIBCXX_HAVE_SQRTF 1
2025-05-07T20:26:58.4303465Z #define __SIZEOF_PTHREAD_RWLOCKATTR_T 8
2025-05-07T20:26:58.4303745Z #define NZERO 20
2025-05-07T20:26:58.4303967Z #define _GLIBCXX_HAVE_MEMALIGN 1
2025-05-07T20:26:58.4304246Z #define _PSTL_PRAGMA(x) _Pragma(#x)
2025-05-07T20:26:58.4304533Z #define MOD_CLKA ADJ_OFFSET_SINGLESHOT
2025-05-07T20:26:58.4304808Z #define MOD_CLKB ADJ_TICK
2025-05-07T20:26:58.4305066Z #define __FLT128_MIN_10_EXP__ (-4931)
2025-05-07T20:26:58.4305347Z #define __FLT32X_IS_IEC_60559__ 2
2025-05-07T20:26:58.4305610Z #define __DEVICE_FUNCTIONS_H__ 
2025-05-07T20:26:58.4305885Z #define SCHAR_MIN (-SCHAR_MAX - 1)
2025-05-07T20:26:58.4306156Z #define EXIT_FAILURE 1
2025-05-07T20:26:58.4306393Z #define ADJ_MAXERROR 0x0004
2025-05-07T20:26:58.4306648Z #define __INT_LEAST16_WIDTH__ 16
2025-05-07T20:26:58.4306912Z #define _SIZE_T_DEFINED_ 
2025-05-07T20:26:58.4307160Z #define _POSIX_AIO_LISTIO_MAX 2
2025-05-07T20:26:58.4307429Z #define __cudaCDP2DeviceGetLimit 
2025-05-07T20:26:58.4307771Z #define __LDBL_REDIR_NTH(name,proto) name proto __THROW
2025-05-07T20:26:58.4308124Z #define __cudaCDP2FuncGetAttributes 
2025-05-07T20:26:58.4308403Z #define __SCHAR_MAX__ 0x7f
2025-05-07T20:26:58.4308652Z #define __FLT128_MANT_DIG__ 113
2025-05-07T20:26:58.4308920Z #define __USING_NAMESPACE_STD(name) 
2025-05-07T20:26:58.4309202Z #define _GLIBCXX_HAVE_OBSOLETE_ISINF 1
2025-05-07T20:26:58.4309501Z #define __WCHAR_MIN__ (-__WCHAR_MAX__ - 1)
2025-05-07T20:26:58.4309782Z #define SEEK_DATA 3
2025-05-07T20:26:58.4310005Z #define __KERNEL_STRICT_NAMES 
2025-05-07T20:26:58.4310298Z #define _IO_stderr ((_IO_FILE*)(&_IO_2_1_stderr_))
2025-05-07T20:26:58.4310707Z #define _IO_ferror_unlocked(__fp) (((__fp)->_flags & _IO_ERR_SEEN) != 0)
2025-05-07T20:26:58.4311090Z #define _FUNCTEXCEPT_H 1
2025-05-07T20:26:58.4311329Z #define __INT64_C(c) c ## L
2025-05-07T20:26:58.4311597Z #define __NTH(fct) __LEAF_ATTR fct throw ()
2025-05-07T20:26:58.4311942Z #define _GLIBCXX_CONST __attribute__ ((__const__))
2025-05-07T20:26:58.4312274Z #define _GLIBCXX_HAVE_LINK 1
2025-05-07T20:26:58.4312538Z #define cudaNvSciSyncAttrWait 0x2
2025-05-07T20:26:58.4312833Z #define __GCC_ATOMIC_POINTER_LOCK_FREE 2
2025-05-07T20:26:58.4313131Z #define STA_PPSWANDER 0x0400
2025-05-07T20:26:58.4313378Z #define __INT_WCHAR_T_H 
2025-05-07T20:26:58.4313617Z #define WSTOPPED 2
2025-05-07T20:26:58.4313855Z #define _POSIX_THREAD_THREADS_MAX 64
2025-05-07T20:26:58.4314131Z #define _POSIX_MQ_OPEN_MAX 8
2025-05-07T20:26:58.4314379Z #define FP_NORMAL 4
2025-05-07T20:26:58.4314618Z #define __cudaCDP2LaunchDevice_ptsz 
2025-05-07T20:26:58.4314884Z #define _BITS_TIMEX_H 1
2025-05-07T20:26:58.4315113Z #define _POSIX_LINK_MAX 8
2025-05-07T20:26:58.4315363Z #define _GLIBCXX_HAVE_LIMIT_FSIZE 1
2025-05-07T20:26:58.4315640Z #define _GLIBCXX_HAVE_ATAN2F 1
2025-05-07T20:26:58.4315902Z #define cudaTextureType1D 0x01
2025-05-07T20:26:58.4316164Z #define _GLIBCXX_HAVE_ATAN2L 1
2025-05-07T20:26:58.4316417Z #define COLL_WEIGHTS_MAX 255
2025-05-07T20:26:58.4316681Z #define __isascii(c) (((c) & ~0x7f) == 0)
2025-05-07T20:26:58.4316974Z #define __toascii(c) ((c) & 0x7f)
2025-05-07T20:26:58.4317387Z #define __attribute_format_strfmon__(a,b) __attribute__ ((__format__ (__strfmon__, a, b)))
2025-05-07T20:26:58.4317820Z #define _IO_MAGIC 0xFBAD0000
2025-05-07T20:26:58.4318079Z #define _GLIBCXX_USE_SENDFILE 1
2025-05-07T20:26:58.4318332Z #define _POSIX_SOURCE 1
2025-05-07T20:26:58.4318570Z #define cudaTextureType2D 0x02
2025-05-07T20:26:58.4318824Z #define _PTR_TRAITS_H 1
2025-05-07T20:26:58.4319085Z #define _GLIBCXX_NOEXCEPT_QUAL noexcept (_NE)
2025-05-07T20:26:58.4319382Z #define _GLIBCXX_HAVE_POWF 1
2025-05-07T20:26:58.4319641Z #define _POSIX2_BC_STRING_MAX 1000
2025-05-07T20:26:58.4319954Z #define __attribute_used__ __attribute__ ((__used__))
2025-05-07T20:26:58.4320279Z #define cudaTextureType3D 0x03
2025-05-07T20:26:58.4320536Z #define _STDIO_USES_IOSTREAM 
2025-05-07T20:26:58.4320788Z #define CLOCK_REALTIME 0
2025-05-07T20:26:58.4321159Z #define __FLT32X_MANT_DIG__ 53
2025-05-07T20:26:58.4321535Z #define __GCC_ATOMIC_CHAR16_T_LOCK_FREE 2
2025-05-07T20:26:58.4321836Z #define __cpp_aligned_new 201606L
2025-05-07T20:26:58.4322105Z #define __USER_LABEL_PREFIX__ 
2025-05-07T20:26:58.4322369Z #define cudaEventBlockingSync 0x01
2025-05-07T20:26:58.4322647Z #define _GLIBCXX_HAVE_TANL 1
2025-05-07T20:26:58.4322908Z #define _GLIBCXX_USE_PTHREAD_RWLOCK_T 1
2025-05-07T20:26:58.4323201Z #define _GLIBCXX_HAVE_LINUX_RANDOM_H 1
2025-05-07T20:26:58.4323487Z #define _GLIBCXX_USE_C99_FENV_TR1 1
2025-05-07T20:26:58.4323761Z #define __FLT32_MAX_10_EXP__ 38
2025-05-07T20:26:58.4324006Z #define __GLIBC__ 2
2025-05-07T20:26:58.4324210Z #define __END_DECLS }
2025-05-07T20:26:58.4324445Z #define FP_ILOGB0 (-2147483647 - 1)
2025-05-07T20:26:58.4324795Z #define __FLT64X_EPSILON__ 1.08420217248550443400745280086994171e-19F64x
2025-05-07T20:26:58.4325156Z #define __CONCAT(x,y) x ## y
2025-05-07T20:26:58.4325398Z #define WCONTINUED 8
2025-05-07T20:26:58.4325625Z #define __STDC_HOSTED__ 1
2025-05-07T20:26:58.4325884Z #define _GLIBCXX_HAVE_ARPA_INET_H 1
2025-05-07T20:26:58.4326149Z #define _ALLOCA_H 1
2025-05-07T20:26:58.4326373Z #define __host__ __location__(host)
2025-05-07T20:26:58.4326773Z #define __warndecl(name,msg) extern void name (void) __attribute__((__warning__ (msg)))
2025-05-07T20:26:58.4327196Z #define __SLONG32_TYPE int
2025-05-07T20:26:58.4327451Z #define _GLIBCXX_DEBUG_ASSERTIONS_H 1
2025-05-07T20:26:58.4327723Z #define _SYS_SELECT_H 1
2025-05-07T20:26:58.4327955Z #define _IO_LINE_BUF 0x200
2025-05-07T20:26:58.4328203Z #define _IOS_NOCREATE 32
2025-05-07T20:26:58.4328451Z #define __DEC64_MIN_EXP__ (-382)
2025-05-07T20:26:58.4328719Z #define __cudaGet_warpSize() warpSize
2025-05-07T20:26:58.4329004Z #define __SSIZE_T_TYPE __SWORD_TYPE
2025-05-07T20:26:58.4329283Z #define _GLIBCXX_HAVE_LIMIT_VMEM 0
2025-05-07T20:26:58.4329554Z #define __global__ __location__(global)
2025-05-07T20:26:58.4329832Z #define __GNU_LIBRARY__ 6
2025-05-07T20:26:58.4330081Z #define __cpp_decltype_auto 201304L
2025-05-07T20:26:58.4330354Z #define __DBL_DIG__ 15
2025-05-07T20:26:58.4330575Z #define TIME_UTC 1
2025-05-07T20:26:58.4330788Z #define __FLT32_DIG__ 6
2025-05-07T20:26:58.4331095Z #define __forceinline__ __inline__ __attribute__((always_inline))
2025-05-07T20:26:58.4331479Z #define cudaHostAllocWriteCombined 0x04
2025-05-07T20:26:58.4331787Z #define cudaDeviceScheduleAuto 0x00
2025-05-07T20:26:58.4332081Z #define iscntrl_l(c,l) __iscntrl_l ((c), (l))
2025-05-07T20:26:58.4332369Z #define _G_BUFSIZ 8192
2025-05-07T20:26:58.4332660Z #define __FLT_EPSILON__ 1.19209289550781250000000000000000000e-7F
2025-05-07T20:26:58.4333016Z #define cudaTextureTypeCubemap 0x0C
2025-05-07T20:26:58.4333297Z #define __cudaCDP2GetDevice 
2025-05-07T20:26:58.4333568Z #define __cudaCDP2PeekAtLastError 
2025-05-07T20:26:58.4334015Z #define STA_CLOCKERR 0x1000
2025-05-07T20:26:58.4334276Z #define __GXX_WEAK__ 1
2025-05-07T20:26:58.4334549Z #define __RLIM_T_TYPE __SYSCALL_ULONG_TYPE
2025-05-07T20:26:58.4334883Z #define _GLIBCXX_HAVE_ISNANF 1
2025-05-07T20:26:58.4335167Z #define __SHRT_WIDTH__ 16
2025-05-07T20:26:58.4335488Z #define __cpp_lib_robust_nonmodifying_seq_ops 201304
2025-05-07T20:26:58.4335866Z #define _GLIBCXX_BITS_SPECFUN_H 1
2025-05-07T20:26:58.4336163Z #define _GLIBCXX_HAVE_ISNANL 1
2025-05-07T20:26:58.4336472Z #define isblank_l(c,l) __isblank_l ((c), (l))
2025-05-07T20:26:58.4336802Z #define _G_config_h 1
2025-05-07T20:26:58.4337093Z #define M_LOG2El 1.442695040888963407359924681001892137L
2025-05-07T20:26:58.4337465Z #define ADJ_OFFSET_SINGLESHOT 0x8001
2025-05-07T20:26:58.4337766Z #define _GCC_WCHAR_T 
2025-05-07T20:26:58.4338016Z #define TMP_MAX 238328
2025-05-07T20:26:58.4338264Z #define __FLT32_IS_IEC_60559__ 2
2025-05-07T20:26:58.4338552Z #define __DEVICE_TYPES_H__ 
2025-05-07T20:26:58.4338829Z #define __DEV_T_TYPE __UQUAD_TYPE
2025-05-07T20:26:58.4339126Z #define _EXT_NUMERIC_TRAITS 1
2025-05-07T20:26:58.4339429Z #define _GLIBCXX_BEGIN_NAMESPACE_ALGO 
2025-05-07T20:26:58.4339744Z #define _IO_SKIPWS 01
2025-05-07T20:26:58.4340324Z #define cudaStreamGraphFireAndForgetAsSibling (cudaStream_t)0x0300000000000000
2025-05-07T20:26:58.4340939Z #define _IO_SCIENTIFIC 04000
2025-05-07T20:26:58.4341232Z #define _GLIBCXX_HAVE_STRING_H 1
2025-05-07T20:26:58.4341594Z #define __LDBL_MIN__ 3.36210314311209350626267781732175260e-4932L
2025-05-07T20:26:58.4342003Z #define cudaDeviceScheduleSpin 0x01
2025-05-07T20:26:58.4342411Z #define __nonnull(params) __attribute__ ((__nonnull__ params))
2025-05-07T20:26:58.4342818Z #define __DBL_IS_IEC_60559__ 2
2025-05-07T20:26:58.4343088Z #define le32toh(x) (x)
2025-05-07T20:26:58.4343337Z #define _SIZE_T_DEFINED 
2025-05-07T20:26:58.4343610Z #define _GLIBCXX_HAVE_XLOCALE_H 1
2025-05-07T20:26:58.4343983Z #define cudaArraySparsePropertiesSingleMipTail 0x1
2025-05-07T20:26:58.4344382Z #define __DEC32_MAX__ 9.999999E96DF
2025-05-07T20:26:58.4344833Z #define __WIFSIGNALED(status) (((signed char) (((status) & 0x7f) + 1) >> 1) > 0)
2025-05-07T20:26:58.4345302Z #define _GLIBCXX_HAVE_FMODL 1
2025-05-07T20:26:58.4345600Z #define _GLIBCXX_HAVE_POLL 1
2025-05-07T20:26:58.4345890Z #define __SM_32_INTRINSICS_H__ 
2025-05-07T20:26:58.4346171Z #define _POSIX_NAME_MAX 14
2025-05-07T20:26:58.4346471Z #define __cpp_threadsafe_static_init 200806L
2025-05-07T20:26:58.4347017Z #define _GLIBCXX_MAKE_MOVE_IF_NOEXCEPT_ITERATOR(_Iter) std::__make_move_if_noexcept_iterator(_Iter)
2025-05-07T20:26:58.4347504Z #define _GLIBCXX_USE_CLOCK_REALTIME 1
2025-05-07T20:26:58.4347799Z #define __cpp_enumerator_attributes 201411L
2025-05-07T20:26:58.4348131Z #define __WCOREDUMP(status) ((status) & __WCOREFLAG)
2025-05-07T20:26:58.4348440Z #define _WCHAR_T_ 
2025-05-07T20:26:58.4348653Z #define _GLIBCXX_FAST_MATH 0
2025-05-07T20:26:58.4349005Z #define __FLT64X_DENORM_MIN__ 3.64519953188247460252840593361941982e-4951F64x
2025-05-07T20:26:58.4349378Z #define RTSIG_MAX 32
2025-05-07T20:26:58.4349588Z #define _STDDEF_H 
2025-05-07T20:26:58.4349812Z #define CU_UUID_HAS_BEEN_DEFINED 
2025-05-07T20:26:58.4350077Z #define _VA_LIST_DEFINED 
2025-05-07T20:26:58.4350321Z #define __FLT32X_HAS_INFINITY__ 1
2025-05-07T20:26:58.4350651Z #define __glibcxx_requires_non_empty_range(_First,_Last) 
2025-05-07T20:26:58.4351032Z #define __grid_constant__ __location__(grid_constant)
2025-05-07T20:26:58.4351351Z #define __INT32_MAX__ 0x7fffffff
2025-05-07T20:26:58.4351630Z #define _GLIBCXX_BEGIN_EXTERN_C extern "C" {
2025-05-07T20:26:58.4352076Z #define _PSTL_CPP14_INTEGER_SEQUENCE_PRESENT (_MSC_VER >= 1900 || __cplusplus >= 201402L)
2025-05-07T20:26:58.4352585Z #define __glibcxx_digits_b(T,B) (B - __glibcxx_signed_b (T,B))
2025-05-07T20:26:58.4352936Z #define __SIZEOF_PTHREAD_COND_T 48
2025-05-07T20:26:58.4353249Z #define _PSTL_PRAGMA_SIMD_ORDERED_MONOTONIC(PRM) 
2025-05-07T20:26:58.4353554Z #define __unix__ 1
2025-05-07T20:26:58.4353776Z #define __SM_60_ATOMIC_FUNCTIONS_H__ 
2025-05-07T20:26:58.4354053Z #define __INT_WIDTH__ 32
2025-05-07T20:26:58.4354291Z #define __SIZEOF_LONG__ 8
2025-05-07T20:26:58.4354514Z #define _IONBF 2
2025-05-07T20:26:58.4354955Z #define __MATHCALLX(function,suffix,args,attrib) __MATHDECLX (_Mdouble_,function,suffix, args, attrib)
2025-05-07T20:26:58.4355697Z #define _IO_getc_unlocked(_fp) (_IO_BE ((_fp)->_IO_read_ptr >= (_fp)->_IO_read_end, 0) ? __uflow (_fp) : *(unsigned char *) (_fp)->_IO_read_ptr++)
2025-05-07T20:26:58.4356213Z #define __STDC_IEC_559__ 1
2025-05-07T20:26:58.4356458Z #define __STDC_ISO_10646__ 201103L
2025-05-07T20:26:58.4356722Z #define __UINT16_C(c) c
2025-05-07T20:26:58.4356963Z #define M_2_PI 0.63661977236758134308
2025-05-07T20:26:58.4357225Z #define STA_DEL 0x0020
2025-05-07T20:26:58.4357460Z #define __CUDACC_VER_MINOR__ 6
2025-05-07T20:26:58.4357709Z #define __id_t_defined 
2025-05-07T20:26:58.4357966Z #define w_retcode __wait_terminated.__w_retcode
2025-05-07T20:26:58.4358404Z #define _IO_PENDING_OUTPUT_COUNT(_fp) ((_fp)->_IO_write_ptr - (_fp)->_IO_write_base)
2025-05-07T20:26:58.4358824Z #define _GLIBCXX_HAVE_MODFF 1
2025-05-07T20:26:58.4359081Z #define _GLIBCXX_HAVE_MODFL 1
2025-05-07T20:26:58.4359443Z #define __DECIMAL_DIG__ 21
2025-05-07T20:26:58.4359773Z #define _POSIX2_RE_DUP_MAX 255
2025-05-07T20:26:58.4360038Z #define __USE_FORTIFY_LEVEL 0
2025-05-07T20:26:58.4360290Z #define __STDC_IEC_559_COMPLEX__ 1
2025-05-07T20:26:58.4360549Z #define SING 2
2025-05-07T20:26:58.4360759Z #define STA_FREQHOLD 0x0080
2025-05-07T20:26:58.4361014Z #define __SM_32_ATOMIC_FUNCTIONS_HPP__ 
2025-05-07T20:26:58.4361311Z #define cudaStreamDefault 0x00
2025-05-07T20:26:58.4361647Z #define __FLT64_EPSILON__ 2.22044604925031308084726333618164062e-16F64
2025-05-07T20:26:58.4362001Z #define _GLIBCXX_HAVE_HYPOTL 1
2025-05-07T20:26:58.4362262Z #define _GLIBCXX_HAVE_SYS_UIO_H 1
2025-05-07T20:26:58.4362523Z #define __gnu_linux__ 1
2025-05-07T20:26:58.4362748Z #define __INT16_MAX__ 0x7fff
2025-05-07T20:26:58.4362996Z #define _LARGEFILE_SOURCE 1
2025-05-07T20:26:58.4363233Z #define MAX_INPUT 255
2025-05-07T20:26:58.4363465Z #define __FLT64_MIN_EXP__ (-1021)
2025-05-07T20:26:58.4363777Z #define __isalpha_l(c,l) __isctype_l((c), _ISalpha, (l))
2025-05-07T20:26:58.4364146Z #define __glibcxx_requires_heap(_First,_Last) 
2025-05-07T20:26:58.4364509Z #define _GLIBCXX_CPU_DEFINES 1
2025-05-07T20:26:58.4364764Z #define _GLIBCXX_HAVE_POLL_H 1
2025-05-07T20:26:58.4365148Z #define __attribute_warn_unused_result__ __attribute__ ((__warn_unused_result__))
2025-05-07T20:26:58.4365557Z #define _IO_SHOWPOS 02000
2025-05-07T20:26:58.4365868Z #define _GLIBCXX_HAVE_SYMVER_SYMBOL_RENAMING_RUNTIME_SUPPORT 1
2025-05-07T20:26:58.4366220Z #define _Mfloat_ float
2025-05-07T20:26:58.4366475Z #define __glibcxx_requires_cond(_Cond,_Msg) 
2025-05-07T20:26:58.4366776Z #define __FLT64X_MIN_10_EXP__ (-4931)
2025-05-07T20:26:58.4367047Z #define DELAYTIMER_MAX 2147483647
2025-05-07T20:26:58.4367521Z #define __glibcxx_max_b(T,B) (__glibcxx_signed_b (T,B) ? (((((T)1 << (__glibcxx_digits_b (T,B) - 1)) - 1) << 1) + 1) : ~(T)0)
2025-05-07T20:26:58.4367999Z #define __LDBL_HAS_QUIET_NAN__ 1
2025-05-07T20:26:58.4368257Z #define _GLIBCXX98_USE_C99_STDIO 1
2025-05-07T20:26:58.4368577Z #define cudaKernelNodeAttrID cudaLaunchAttributeID
2025-05-07T20:26:58.4368930Z #define __glibcxx_class_requires2(_a,_b,_c) 
2025-05-07T20:26:58.4369212Z #define __USE_ISOC11 1
2025-05-07T20:26:58.4369435Z #define _BSD_SIZE_T_ 
2025-05-07T20:26:58.4369658Z #define ADJ_MICRO 0x1000
2025-05-07T20:26:58.4379405Z #define _GLIBCXX_HAVE_FABSF 1
2025-05-07T20:26:58.4379730Z #define _GLIBCXX_HAVE_FABSL 1
2025-05-07T20:26:58.4380036Z #define _PSTL_PRAGMA_SIMD _PSTL_PRAGMA(omp simd)
2025-05-07T20:26:58.4380354Z #define __FLT64_MANT_DIG__ 53
2025-05-07T20:26:58.4380668Z #define __attribute_const__ __attribute__ ((__const__))
2025-05-07T20:26:58.4381000Z #define __THROW throw ()
2025-05-07T20:26:58.4381257Z #define __cudaGet_gridDim() gridDim
2025-05-07T20:26:58.4381544Z #define __SM_60_ATOMIC_FUNCTIONS_HPP__ 
2025-05-07T20:26:58.4381901Z #define __glibcxx_requires_heap_pred(_First,_Last,_Pred) 
2025-05-07T20:26:58.4382258Z #define htobe32(x) __bswap_32 (x)
2025-05-07T20:26:58.4382530Z #define _GLIBCXX_HAVE_POWL 1
2025-05-07T20:26:58.4382806Z #define __FLT64X_MANT_DIG__ 64
2025-05-07T20:26:58.4383083Z #define __GLIBC_HAVE_LONG_LONG 1
2025-05-07T20:26:58.4383339Z #define L_tmpnam 20
2025-05-07T20:26:58.4383573Z #define ___int_wchar_t_h 
2025-05-07T20:26:58.4383912Z #define WIFCONTINUED(status) __WIFCONTINUED (__WAIT_INT (status))
2025-05-07T20:26:58.4384285Z #define isascii(c) __isascii (c)
2025-05-07T20:26:58.4384548Z #define _T_PTRDIFF 
2025-05-07T20:26:58.4384852Z #define _GLIBCXX_MOVE3(_Tp,_Up,_Vp) std::move(_Tp, _Up, _Vp)
2025-05-07T20:26:58.4385206Z #define toascii(c) __toascii (c)
2025-05-07T20:26:58.4385455Z #define __GNUC__ 11
2025-05-07T20:26:58.4385708Z #define __SYSCALL_ULONG_TYPE __ULONGWORD_TYPE
2025-05-07T20:26:58.4386006Z #define __GXX_RTTI 1
2025-05-07T20:26:58.4386222Z #define __pie__ 2
2025-05-07T20:26:58.4386434Z #define __MMX__ 1
2025-05-07T20:26:58.4386656Z #define __cudaCDP2Malloc 
2025-05-07T20:26:58.4386903Z #define __timespec_defined 1
2025-05-07T20:26:58.4387154Z #define L_ctermid 9
2025-05-07T20:26:58.4387591Z #define __OFF64_T_TYPE __SQUAD_TYPE
2025-05-07T20:26:58.4388023Z #define __cudaCDP2GetParameterBufferV2 
2025-05-07T20:26:58.4388411Z #define offsetof(TYPE,MEMBER) __builtin_offsetof (TYPE, MEMBER)
2025-05-07T20:26:58.4388782Z #define _BITS_POSIX2_LIM_H 1
2025-05-07T20:26:58.4389040Z #define _GLIBCXX98_USE_C99_STDLIB 1
2025-05-07T20:26:58.4389331Z #define cudaMemAttachGlobal 0x01
2025-05-07T20:26:58.4389640Z #define FD_SET(fd,fdsetp) __FD_SET (fd, fdsetp)
2025-05-07T20:26:58.4389946Z #define __FLT_HAS_DENORM__ 1
2025-05-07T20:26:58.4390211Z #define __SIZEOF_LONG_DOUBLE__ 16
2025-05-07T20:26:58.4390647Z #define _GLIBCXX_NATIVE_THREAD_ID (__gthread_active_p() ? __gthread_self() : (__gthread_t)1)
2025-05-07T20:26:58.4391384Z #define assert_perror(errnum) (!(errnum) ? __ASSERT_VOID_CAST (0) : __assert_perror_fail ((errnum), __FILE__, __LINE__, __ASSERT_FUNCTION))
2025-05-07T20:26:58.4391974Z #define _IO_HAVE_ST_BLKSIZE _G_HAVE_ST_BLKSIZE
2025-05-07T20:26:58.4392279Z #define __USE_SVID 1
2025-05-07T20:26:58.4392539Z #define __constant__ __location__(constant)
2025-05-07T20:26:58.4392851Z #define _GLIBCXX_HAVE_POSIX_MEMALIGN 1
2025-05-07T20:26:58.4393150Z #define __device__ __location__(device)
2025-05-07T20:26:58.4393476Z #define _GLIBCXX_HAVE_EXCEPTION_PTR_SINCE_GCC46 1
2025-05-07T20:26:58.4393803Z #define _GLIBCXX_RES_LIMITS 1
2025-05-07T20:26:58.4394065Z #define M_1_PI 0.31830988618379067154
2025-05-07T20:26:58.4394348Z #define CUDART_DEVICE __device__
2025-05-07T20:26:58.4394695Z #define __LDBL_REDIR1_NTH(name,proto,alias) name proto __THROW
2025-05-07T20:26:58.4395052Z #define M_PI_2 1.57079632679489661923
2025-05-07T20:26:58.4395338Z #define __BIGGEST_ALIGNMENT__ 16
2025-05-07T20:26:58.4395706Z #define cudaExternalSemaphoreWaitSkipNvSciBufMemSync 0x02
2025-05-07T20:26:58.4396075Z #define __STDC_UTF_16__ 1
2025-05-07T20:26:58.4396324Z #define LONG_MAX __LONG_MAX__
2025-05-07T20:26:58.4396683Z #define __glibcxx_digits10_b(T,B) (__glibcxx_digits_b (T,B) * 643L / 2136)
2025-05-07T20:26:58.4397116Z #define _POSIX_THREAD_DESTRUCTOR_ITERATIONS 4
2025-05-07T20:26:58.4397434Z #define _POSIX_HOST_NAME_MAX 255
2025-05-07T20:26:58.4397695Z #define __FLT64_MAX_10_EXP__ 308
2025-05-07T20:26:58.4397956Z #define NGROUPS_MAX 65536
2025-05-07T20:26:58.4398666Z #define _GLIBCXX_NAMESPACE_LDBL 
2025-05-07T20:26:58.4398983Z #define __USE_ISOC95 1
2025-05-07T20:26:58.4399212Z #define _TIME_H 1
2025-05-07T20:26:58.4399482Z #define M_LOG10El 0.434294481903251827651128918916605082L
2025-05-07T20:26:58.4399794Z #define __USE_ISOC99 1
2025-05-07T20:26:58.4400113Z #define __ASMNAME(cname) __ASMNAME2 (__USER_LABEL_PREFIX__, cname)
2025-05-07T20:26:58.4400479Z #define HOST_NAME_MAX 64
2025-05-07T20:26:58.4400722Z #define _POSIX_SEM_NSEMS_MAX 256
2025-05-07T20:26:58.4400983Z #define _IOS_ATEND 4
2025-05-07T20:26:58.4401218Z #define __SM_35_INTRINSICS_H__ 
2025-05-07T20:26:58.4401532Z #define WTERMSIG(status) __WTERMSIG (__WAIT_INT (status))
2025-05-07T20:26:58.4401929Z #define cudaStreamAttrValue cudaLaunchAttributeValue
2025-05-07T20:26:58.4402276Z #define _GLIBCXX_HAVE_S_ISREG 1
2025-05-07T20:26:58.4402563Z #define cudaSurfaceTypeCubemap 0x0C
2025-05-07T20:26:58.4402876Z #define __cpp_delegating_constructors 200604L
2025-05-07T20:26:58.4403187Z #define __FLT32_HAS_INFINITY__ 1
2025-05-07T20:26:58.4403448Z #define _STDIO_H 1
2025-05-07T20:26:58.4403831Z #define __isctype_l(c,type,locale) ((locale)->__ctype_b[(int) (c)] & (unsigned short int) type)
2025-05-07T20:26:58.4404291Z #define _GLIBCXX_PREDEFINED_OPS_H 1
2025-05-07T20:26:58.4404645Z #define __DBL_MAX__ double(1.79769313486231570814527423731704357e+308L)
2025-05-07T20:26:58.4405012Z #define _G_IO_IO_FILE_VERSION 0x20001
2025-05-07T20:26:58.4405301Z #define _POSIX_SIGQUEUE_MAX 32
2025-05-07T20:26:58.4405567Z #define _GLIBCXX_HAVE_GETS 1
2025-05-07T20:26:58.4405831Z #define _GLIBCXX_HAVE_LINUX_TYPES_H 1
2025-05-07T20:26:58.4406119Z #define __cpp_raw_strings 200710L
2025-05-07T20:26:58.4406419Z #define __INT_FAST32_MAX__ 0x7fffffffffffffffL
2025-05-07T20:26:58.4406730Z #define _GLIBCXX_HAVE_VFWSCANF 1
2025-05-07T20:26:58.4407392Z #define __DBL_HAS_INFINITY__ 1
2025-05-07T20:26:58.4407673Z #define __STDCPP_MATH_SPEC_FUNCS__ 201003L
2025-05-07T20:26:58.4407976Z #define _GLIBCXX_STDIO_EOF -1
2025-05-07T20:26:58.4408238Z #define __SIZEOF_PTHREAD_MUTEX_T 40
2025-05-07T20:26:58.4408525Z #define __CHANNEL_DESCRIPTOR_H__ 
2025-05-07T20:26:58.4408874Z #define _ISbit(bit) ((bit) < 8 ? ((1 << (bit)) << 8) : ((1 << (bit)) >> 8))
2025-05-07T20:26:58.4409231Z #define __SIZEOF_FLOAT__ 4
2025-05-07T20:26:58.4409477Z #define __USE_XOPEN 1
2025-05-07T20:26:58.4409724Z #define __SIZEOF_PTHREAD_RWLOCK_T 56
2025-05-07T20:26:58.4410150Z #define cudaStreamAttributeMemSyncDomain cudaLaunchAttributeMemSyncDomain
2025-05-07T20:26:58.4410586Z #define __USE_XOPEN2K 1
2025-05-07T20:26:58.4410826Z #define _PSTL_UDR_PRESENT 1
2025-05-07T20:26:58.4411092Z #define __HAVE_SPECULATION_SAFE_VALUE 1
2025-05-07T20:26:58.4411384Z #define _GLIBCXX_HAVE_COSF 1
2025-05-07T20:26:58.4411656Z #define __cpp_fold_expressions 201603L
2025-05-07T20:26:58.4412173Z #define cudaWaitExternalSemaphoresAsync __CUDART_API_PTSZ(cudaWaitExternalSemaphoresAsync_v2)
2025-05-07T20:26:58.4412689Z #define NL_LANGMAX _POSIX2_LINE_MAX
2025-05-07T20:26:58.4412975Z #define __DEC32_MIN_EXP__ (-94)
2025-05-07T20:26:58.4413323Z #define __glibcxx_requires_partitioned_upper(_First,_Last,_Value) 
2025-05-07T20:26:58.4413799Z #define __DADDR_T_TYPE __S32_TYPE
2025-05-07T20:26:58.4414169Z #define cudaExternalSemaphoreSignalSkipNvSciBufMemSync 0x01
2025-05-07T20:26:58.4414553Z #define __END_NAMESPACE_C99 
2025-05-07T20:26:58.4414812Z #define __glibcxx_integral_traps true
2025-05-07T20:26:58.4415092Z #define _POSIX_PATH_MAX 256
2025-05-07T20:26:58.4415338Z #define __INTPTR_WIDTH__ 64
2025-05-07T20:26:58.4415588Z #define __FLT64X_HAS_INFINITY__ 1
2025-05-07T20:26:58.4415845Z #define _ISOC11_SOURCE 1
2025-05-07T20:26:58.4416089Z #define _GLIBCXX_HAVE_LINUX_FUTEX 1
2025-05-07T20:26:58.4416374Z #define __UINT_LEAST32_MAX__ 0xffffffffU
2025-05-07T20:26:58.4416668Z #define _GLIBCXX_HAVE_QUICK_EXIT 1
2025-05-07T20:26:58.4417029Z #define __glibcxx_requires_irreflexive_pred2(_First,_Last,_Pred) 
2025-05-07T20:26:58.4417401Z #define LONG_MIN (-LONG_MAX - 1L)
2025-05-07T20:26:58.4417668Z #define _GLIBCXX_HAVE_SINCOSF 1
2025-05-07T20:26:58.4417924Z #define _IO_UNITBUF 020000
2025-05-07T20:26:58.4418167Z #define _GLIBCXX_HAVE_SINCOSL 1
2025-05-07T20:26:58.4418415Z #define __FD_SETSIZE 1024
2025-05-07T20:26:58.4418660Z #define getc(_fp) _IO_getc (_fp)
2025-05-07T20:26:58.4418925Z #define be32toh(x) __bswap_32 (x)
2025-05-07T20:26:58.4419250Z #define _GLIBCXX_PACKAGE__GLIBCXX_VERSION "version-unused"
2025-05-07T20:26:58.4419604Z #define __FLT32X_HAS_DENORM__ 1
2025-05-07T20:26:58.4419861Z #define __INT_FAST16_TYPE__ long int
2025-05-07T20:26:58.4420165Z #define isxdigit_l(c,l) __isxdigit_l ((c), (l))
2025-05-07T20:26:58.4420475Z #define _GLIBCXX_HAVE_GETIPINFO 1
2025-05-07T20:26:58.4420742Z #define __MMX_WITH_SSE__ 1
2025-05-07T20:26:58.4421034Z #define __isalnum_l(c,l) __isctype_l((c), _ISalnum, (l))
2025-05-07T20:26:58.4421366Z #define _WCHAR_T_DEFINED_ 
2025-05-07T20:26:58.4421646Z #define cudaIpcMemLazyEnablePeerAccess 0x01
2025-05-07T20:26:58.4421965Z #define _GLIBCXX_HAVE_AT_QUICK_EXIT 1
2025-05-07T20:26:58.4422237Z #define __INO_T_MATCHES_INO64_T 1
2025-05-07T20:26:58.4422503Z #define __USE_POSIX199506 1
2025-05-07T20:26:58.4422741Z #define _FEATURES_H 1
2025-05-07T20:26:58.4422965Z #define __LDBL_HAS_DENORM__ 1
2025-05-07T20:26:58.4423344Z #define _PSTL_PRAGMA_SIMD_REDUCTION(PRM) _PSTL_PRAGMA(omp simd reduction(PRM))
2025-05-07T20:26:58.4423745Z #define __stub_getmsg 
2025-05-07T20:26:58.4423972Z #define _IO_FIXED 010000
2025-05-07T20:26:58.4424229Z #define __cpp_lib_addressof_constexpr 201603
2025-05-07T20:26:58.4424531Z #define _GLIBCXX11_USE_C99_STDIO 1
2025-05-07T20:26:58.4424792Z #define __stub_setlogin 
2025-05-07T20:26:58.4425014Z #define __stub_fattach 
2025-05-07T20:26:58.4425250Z #define __cplusplus 201703L
2025-05-07T20:26:58.4425508Z #define __cpp_ref_qualifiers 200710L
2025-05-07T20:26:58.4425894Z #define _STRUCT_TIMEVAL 1
2025-05-07T20:26:58.4426225Z #define INFINITY (__builtin_inff())
2025-05-07T20:26:58.4426496Z #define _IO_UNBUFFERED 2
2025-05-07T20:26:58.4426962Z #define cudaStreamAttributeSynchronizationPolicy cudaLaunchAttributeSynchronizationPolicy
2025-05-07T20:26:58.4427478Z #define _IO_INTERNAL 010
2025-05-07T20:26:58.4427719Z #define __DEC32_MIN__ 1E-95DF
2025-05-07T20:26:58.4428037Z #define cudaKernelNodeAttrValue cudaLaunchAttributeValue
2025-05-07T20:26:58.4428385Z #define __dev_t_defined 
2025-05-07T20:26:58.4428617Z #define __DEPRECATED 1
2025-05-07T20:26:58.4428840Z #define __S32_TYPE int
2025-05-07T20:26:58.4429075Z #define __cpp_rvalue_references 200610L
2025-05-07T20:26:58.4429361Z #define __DBL_MAX_EXP__ 1024
2025-05-07T20:26:58.4429617Z #define _IO_fpos_t _G_fpos_t
2025-05-07T20:26:58.4429858Z #define __WCHAR_WIDTH__ 32
2025-05-07T20:26:58.4430448Z #define cudaKernelNodeAttributePreferredSharedMemoryCarveout cudaLaunchAttributePreferredSharedMemoryCarveout
2025-05-07T20:26:58.4431077Z #define _G_HAVE_MREMAP 1
2025-05-07T20:26:58.4431377Z #define __FLT32_MAX__ 3.40282346638528859811704183484516925e+38F32
2025-05-07T20:26:58.4431715Z #define OVERFLOW 3
2025-05-07T20:26:58.4431960Z #define __toascii_l(c,l) ((l), __toascii (c))
2025-05-07T20:26:58.4432256Z #define __DEC128_EPSILON__ 1E-33DL
2025-05-07T20:26:58.4432535Z #define __SM_32_ATOMIC_FUNCTIONS_H__ 
2025-05-07T20:26:58.4432866Z #define _GLIBCXX_DEFAULT_ABI_TAG _GLIBCXX_ABI_TAG_CXX11
2025-05-07T20:26:58.4433191Z #define __SSE2_MATH__ 1
2025-05-07T20:26:58.4433424Z #define __ATOMIC_HLE_RELEASE 131072
2025-05-07T20:26:58.4433725Z #define __FSFILCNT_T_TYPE __SYSCALL_ULONG_TYPE
2025-05-07T20:26:58.4434017Z #define _IO_STDIO_H 
2025-05-07T20:26:58.4434246Z #define PDP_ENDIAN __PDP_ENDIAN
2025-05-07T20:26:58.4434532Z #define isspace_l(c,l) __isspace_l ((c), (l))
2025-05-07T20:26:58.4434839Z #define __cudaCDP2Memcpy2DAsync 
2025-05-07T20:26:58.4435120Z #define __PTRDIFF_MAX__ 0x7fffffffffffffffL
2025-05-07T20:26:58.4435427Z #define _GLIBCXX_HAVE_STRERROR_R 1
2025-05-07T20:26:58.4435692Z #define __amd64 1
2025-05-07T20:26:58.4435903Z #define _POSIX_TZNAME_MAX 6
2025-05-07T20:26:58.4436161Z #define __cudaCDP2Memset3DAsync 
2025-05-07T20:26:58.4436431Z #define __SYSCALL_WORDSIZE 64
2025-05-07T20:26:58.4436707Z #define _GLIBCXX_HAVE_ATTRIBUTE_VISIBILITY 1
2025-05-07T20:26:58.4437004Z #define _EXT_TYPE_TRAITS 1
2025-05-07T20:26:58.4437264Z #define _GLIBCXX_HAVE_POSIX_SEMAPHORE 1
2025-05-07T20:26:58.4437556Z #define _POSIX_RE_DUP_MAX 255
2025-05-07T20:26:58.4437805Z #define __STDC_NO_THREADS__ 1
2025-05-07T20:26:58.4438049Z #define __bounded 
2025-05-07T20:26:58.4438292Z #define __USECONDS_T_TYPE __U32_TYPE
2025-05-07T20:26:58.4438572Z #define _IO_DELETE_DONT_CLOSE 0x40
2025-05-07T20:26:58.4438839Z #define __BEGIN_NAMESPACE_STD 
2025-05-07T20:26:58.4439099Z #define _PTRDIFF_T_DECLARED 
2025-05-07T20:26:58.4439372Z #define __OFF_T_TYPE __SYSCALL_SLONG_TYPE
2025-05-07T20:26:58.4439678Z #define __W_STOPCODE(sig) ((sig) << 8 | 0x7f)
2025-05-07T20:26:58.4440092Z #define cudaStreamAttributePriority cudaLaunchAttributePriority
2025-05-07T20:26:58.4440488Z #define _GLIBCXX_HAVE_NETDB_H 1
2025-05-07T20:26:58.4440776Z #define __SM_20_INTRINSICS_HPP__ 
2025-05-07T20:26:58.4441124Z #define __cpp_lib_has_unique_object_representations 201606
2025-05-07T20:26:58.4441464Z #define STA_PLL 0x0001
2025-05-07T20:26:58.4441703Z #define __ATOMIC_HLE_ACQUIRE 65536
2025-05-07T20:26:58.4441955Z #define __GNUG__ 11
2025-05-07T20:26:58.4442181Z #define _GLIBCXX_USE_GET_NPROCS 1
2025-05-07T20:26:58.4442439Z #define _T_WCHAR 
2025-05-07T20:26:58.4442664Z #define __cudaCDP2GetDeviceCount 
2025-05-07T20:26:58.4442944Z #define __specialization_static 
2025-05-07T20:26:58.4443237Z #define __LONG_LONG_MAX__ 0x7fffffffffffffffLL
2025-05-07T20:26:58.4443530Z #define __SIZEOF_SIZE_T__ 8
2025-05-07T20:26:58.4443783Z #define cudaArraySparse 0x40
2025-05-07T20:26:58.4444037Z #define STA_PPSFREQ 0x0002
2025-05-07T20:26:58.4444272Z #define __GLIBCXX__ 20230528
2025-05-07T20:26:58.4444754Z #define _IO_stdin ((_IO_FILE*)(&_IO_2_1_stdin_))
2025-05-07T20:26:58.4445155Z #define _WCHAR_T 
2025-05-07T20:26:58.4445361Z #define __cudaCDP2Free 
2025-05-07T20:26:58.4445982Z #define __FD_ZERO(fdsp) do { int __d0, __d1; __asm__ __volatile__ ("cld; rep; " __FD_ZERO_STOS : "=c" (__d0), "=D" (__d1) : "a" (0), "0" (sizeof (fd_set) / sizeof (__fd_mask)), "1" (&__FDS_BITS (fdsp)[0]) : "memory"); } while (0)
2025-05-07T20:26:58.4446647Z #define __cpp_nsdmi 200809L
2025-05-07T20:26:58.4447047Z #define __glibcxx_min_b(T,B) (__glibcxx_signed_b (T,B) ? -__glibcxx_max_b (T,B) - 1 : (T)0)
2025-05-07T20:26:58.4447472Z #define __FLT64X_MIN_EXP__ (-16381)
2025-05-07T20:26:58.4447743Z #define __SIZEOF_WINT_T__ 4
2025-05-07T20:26:58.4447999Z #define cudaArrayCubemap 0x04
2025-05-07T20:26:58.4448316Z #define _PSTL_MONOTONIC_PRESENT (__INTEL_COMPILER >= 1800)
2025-05-07T20:26:58.4448658Z #define _GLIBCXX_UTILITY 1
2025-05-07T20:26:58.4448891Z #define __NO_CTYPE 1
2025-05-07T20:26:58.4449110Z #define __stub_bdflush 
2025-05-07T20:26:58.4449474Z #define _GLIBCXX_MAKE_MOVE_ITERATOR(_Iter) std::make_move_iterator(_Iter)
2025-05-07T20:26:58.4449882Z #define __CORRECT_ISO_CPP_STRING_H_PROTO 
2025-05-07T20:26:58.4450174Z #define _GLIBCXX_STDC_HEADERS 1
2025-05-07T20:26:58.4450431Z #define __LONG_LONG_WIDTH__ 64
2025-05-07T20:26:58.4450715Z #define __cpp_initializer_lists 200806L
2025-05-07T20:26:58.4451045Z #define _GLIBCXX_HAVE_NETINET_TCP_H 1
2025-05-07T20:26:58.4451326Z #define __U16_TYPE unsigned short int
2025-05-07T20:26:58.4451654Z #define __glibcxx_requires_can_increment(_First,_Size) 
2025-05-07T20:26:58.4451993Z #define _GLIBCXX_HAVE_SYS_PARAM_H 1
2025-05-07T20:26:58.4452262Z #define __FLT32_MAX_EXP__ 128
2025-05-07T20:26:58.4452535Z #define cudaHostRegisterIoMemory 0x04
2025-05-07T20:26:58.4452864Z #define __FD_MASK(d) ((__fd_mask) 1 << ((d) % __NFDBITS))
2025-05-07T20:26:58.4453198Z #define __cpp_lib_is_invocable 201703
2025-05-07T20:26:58.4453463Z #define _IO_STDIO 040000
2025-05-07T20:26:58.4453920Z #define _SIGSET_NWORDS (1024 / (8 * sizeof (unsigned long int)))
2025-05-07T20:26:58.4454364Z #define cudaSurfaceType1DLayered 0xF1
2025-05-07T20:26:58.4454709Z #define cudaArraySurfaceLoadStore 0x02
2025-05-07T20:26:58.4455030Z #define _PTRDIFF_T 
2025-05-07T20:26:58.4455261Z #define _MOVE_H 1
2025-05-07T20:26:58.4455498Z #define __cpp_hex_float 201603L
2025-05-07T20:26:58.4455785Z #define ADJ_TAI 0x0080
2025-05-07T20:26:58.4456031Z #define __ptrvalue 
2025-05-07T20:26:58.4456263Z #define _GLIBCXX_HOSTED 1
2025-05-07T20:26:58.4456542Z #define __GXX_ABI_VERSION 1016
2025-05-07T20:26:58.4456852Z #define __WTERMSIG(status) ((status) & 0x7f)
2025-05-07T20:26:58.4457182Z #define MATH_ERREXCEPT 2
2025-05-07T20:26:58.4457452Z #define _GLIBCXX_HAS_GTHREADS 1
2025-05-07T20:26:58.4457767Z #define cudaTextureType2DLayered 0xF2
2025-05-07T20:26:58.4458213Z #define __isleap(year) ((year) % 4 == 0 && ((year) % 100 != 0 || (year) % 400 == 0))
2025-05-07T20:26:58.4458640Z #define __USE_GNU 1
2025-05-07T20:26:58.4458888Z #define __FLT128_HAS_INFINITY__ 1
2025-05-07T20:26:58.4459192Z #define __FLT_MIN_EXP__ (-125)
2025-05-07T20:26:58.4459478Z #define __GCC_HAVE_DWARF2_CFI_ASM 1
2025-05-07T20:26:58.4459910Z #define __FD_CLR(d,set) ((void) (__FDS_BITS (set)[__FD_ELT (d)] &= ~__FD_MASK (d)))
2025-05-07T20:26:58.4460343Z #define WEXITED 4
2025-05-07T20:26:58.4460566Z #define _IO_NO_READS 4
2025-05-07T20:26:58.4460899Z #define cudaGraphKernelNodePortLaunchCompletion 2
2025-05-07T20:26:58.4461290Z #define M_LOG2E 1.4426950408889634074
2025-05-07T20:26:58.4461590Z #define _POSIX_SYMLINK_MAX 255
2025-05-07T20:26:58.4461916Z #define _GLIBCXX_HAVE_BUILTIN_HAS_UNIQ_OBJ_REP 1
2025-05-07T20:26:58.4462264Z #define __uid_t_defined 
2025-05-07T20:26:58.4462526Z #define __FD_ELT(d) ((d) / __NFDBITS)
2025-05-07T20:26:58.4462840Z #define _GLIBCXX_USE_STD_SPEC_FUNCS 1
2025-05-07T20:26:58.4463137Z #define WNOHANG 1
2025-05-07T20:26:58.4463400Z #define alloca(size) __builtin_alloca (size)
2025-05-07T20:26:58.4463734Z #define _GLIBCXX_HAVE_HYPOTF 1
2025-05-07T20:26:58.4464137Z #define cudaEventDefault 0x00
2025-05-07T20:26:58.4464554Z #define __maxnreg__(a) __attribute__((maxnreg(a)))
2025-05-07T20:26:58.4464856Z #define NL_SETMAX INT_MAX
2025-05-07T20:26:58.4465087Z #define __x86_64 1
2025-05-07T20:26:58.4465311Z #define __cudaCDP2LaunchDevice 
2025-05-07T20:26:58.4465685Z #define __REDIRECT(name,proto,alias) name proto __asm__ (__ASMNAME (#alias))
2025-05-07T20:26:58.4466146Z #define _GLIBCXX_BEGIN_NAMESPACE_CXX11 namespace __cxx11 {
2025-05-07T20:26:58.4466629Z #define __extern_always_inline extern __always_inline __attribute__ ((__gnu_inline__))
2025-05-07T20:26:58.4467047Z #define __PTRDIFF_T 
2025-05-07T20:26:58.4467353Z #define __exctype_l(name) extern int name (int, __locale_t) __THROW
2025-05-07T20:26:58.4467719Z #define _GLIBCXX_HAVE_FINITEL 1
2025-05-07T20:26:58.4467983Z #define _Mlong_double_ long double
2025-05-07T20:26:58.4468246Z #define __cpp_lambdas 200907L
2025-05-07T20:26:58.4468494Z #define _IO_DEC 020
2025-05-07T20:26:58.4468724Z #define _GLIBCXX_HAVE_SINHL 1
2025-05-07T20:26:58.4468988Z #define _POSIX_CLOCKRES_MIN 20000000
2025-05-07T20:26:58.4469271Z #define __INT_FAST64_TYPE__ long int
2025-05-07T20:26:58.4469543Z #define ADJ_TIMECONST 0x0020
2025-05-07T20:26:58.4469789Z #define _GLIBCXX_HAVE_SQRTL 1
2025-05-07T20:26:58.4470081Z #define __cudaCDP2DeviceGetSharedMemConfig 
2025-05-07T20:26:58.4470400Z #define _GLIBCXX_HAVE_STDALIGN_H 1
2025-05-07T20:26:58.4470659Z #define _ANSI_STDDEF_H 
2025-05-07T20:26:58.4470923Z #define _GLIBCXX_MOVE(__val) std::move(__val)
2025-05-07T20:26:58.4471227Z #define _GLIBCXX_HAVE_STRERROR_L 1
2025-05-07T20:26:58.4471581Z #define __FLT64_DENORM_MIN__ 4.94065645841246544176568792868221372e-324F64
2025-05-07T20:26:58.4471954Z #define _GLIBCXX_USE_DEV_RANDOM 1
2025-05-07T20:26:58.4472229Z #define _STL_ITERATOR_BASE_TYPES_H 1
2025-05-07T20:26:58.4472518Z #define __cpp_template_auto 201606L
2025-05-07T20:26:58.4472862Z #define __DBL_MIN__ double(2.22507385850720138309023271733240406e-308L)
2025-05-07T20:26:58.4473231Z #define _GLIBCXX_HAVE_SYS_SEM_H 1
2025-05-07T20:26:58.4473499Z #define __key_t_defined 
2025-05-07T20:26:58.4473736Z #define _IO_MAGIC_MASK 0xFFFF0000
2025-05-07T20:26:58.4474094Z #define __cluster_dims__(...) __attribute__((cluster_dims(__VA_ARGS__)))
2025-05-07T20:26:58.4474544Z #define __FLT128_EPSILON__ 1.92592994438723585305597794258492732e-34F128
2025-05-07T20:26:58.4474903Z #define __GNUC_VA_LIST 
2025-05-07T20:26:58.4475219Z #define __FLT64X_NORM_MAX__ 1.18973149535723176502126385303097021e+4932F64x
2025-05-07T20:26:58.4475590Z #define __SIZEOF_POINTER__ 8
2025-05-07T20:26:58.4475851Z #define CLOCK_REALTIME_COARSE 5
2025-05-07T20:26:58.4476114Z #define _GLIBCXX14_CONSTEXPR constexpr
2025-05-07T20:26:58.4476399Z #define __USE_XOPEN2KXSI 1
2025-05-07T20:26:58.4476640Z #define __WCOREFLAG 0x80
2025-05-07T20:26:58.4476880Z #define M_2_SQRTPI 1.12837916709551257390
2025-05-07T20:26:58.4477175Z #define cudaEventDisableTiming 0x02
2025-05-07T20:26:58.4477445Z #define __LP64__ 1
2025-05-07T20:26:58.4477685Z #define __isascii_l(c,l) ((l), __isascii (c))
2025-05-07T20:26:58.4477998Z #define cudaStreamNonBlocking 0x01
2025-05-07T20:26:58.4478277Z #define _IO_off64_t __off64_t
2025-05-07T20:26:58.4478523Z #define __DBL_HAS_QUIET_NAN__ 1
2025-05-07T20:26:58.4478777Z #define __time_t_defined 1
2025-05-07T20:26:58.4479021Z #define _POSIX_SYMLOOP_MAX 8
2025-05-07T20:26:58.4479356Z #define __FLT32X_EPSILON__ 2.22044604925031308084726333618164062e-16F32x
2025-05-07T20:26:58.4479706Z #define __USE_UNIX98 1
2025-05-07T20:26:58.4479942Z #define __MODE_T_TYPE __U32_TYPE
2025-05-07T20:26:58.4480213Z #define CLOCK_REALTIME_ALARM 8
2025-05-07T20:26:58.4480470Z #define _GLIBCXX_HAVE_STRINGS_H 1
2025-05-07T20:26:58.4480761Z #define __LEAF_ATTR __attribute__ ((__leaf__))
2025-05-07T20:26:58.4481064Z #define __DECIMAL_BID_FORMAT__ 1
2025-05-07T20:26:58.4481311Z #define SEEK_CUR 1
2025-05-07T20:26:58.4481536Z #define __RLIM64_T_TYPE __UQUAD_TYPE
2025-05-07T20:26:58.4481796Z #define _ASSERT_H 1
2025-05-07T20:26:58.4482441Z #define _PSTL_PRAGMA_DECLARE_REDUCTION(NAME,OP) _PSTL_PRAGMA(omp declare reduction(NAME:OP : omp_out(omp_in)) initializer(omp_priv = omp_orig))
2025-05-07T20:26:58.4483127Z #define _GLIBCXX_USE_DEPRECATED 1
2025-05-07T20:26:58.4483401Z #define CHAR_MAX SCHAR_MAX
2025-05-07T20:26:58.4483655Z #define _GLIBCXX_HAVE_SETENV 1
2025-05-07T20:26:58.4483910Z #define NL_ARGMAX _POSIX_ARG_MAX
2025-05-07T20:26:58.4484178Z #define _GLIBCXX_USE_UTIMENSAT 1
2025-05-07T20:26:58.4484547Z #define __extern_inline extern __inline __attribute__ ((__gnu_inline__))
2025-05-07T20:26:58.4484945Z #define _GLIBCXX_DEBUG_ONLY(_Statement) 
2025-05-07T20:26:58.4485581Z #define _IO_putc_unlocked(_ch,_fp) (_IO_BE ((_fp)->_IO_write_ptr >= (_fp)->_IO_write_end, 0) ? __overflow (_fp, (unsigned char) (_ch)) : (unsigned char) (*(_fp)->_IO_write_ptr++ = (_ch)))
2025-05-07T20:26:58.4486214Z #define _GLIBCXX_HAVE_BUILTIN_LAUNDER 1
2025-05-07T20:26:58.4486502Z #define _IO_BOOLALPHA 0200000
2025-05-07T20:26:58.4486852Z #define _PSTL_CPP17_EXECUTION_POLICIES_PRESENT (_MSC_VER >= 1912)
2025-05-07T20:26:58.4487226Z #define _GLIBCXX_PACKAGE_URL ""
2025-05-07T20:26:58.4487489Z #define __FLT64_MIN_10_EXP__ (-307)
2025-05-07T20:26:58.4487761Z #define cudaArrayDefault 0x00
2025-05-07T20:26:58.4488033Z #define __cudaCDP2LaunchDeviceV2 
2025-05-07T20:26:58.4488317Z #define __FDS_BITS(set) ((set)->fds_bits)
2025-05-07T20:26:58.4488582Z #define TLOSS 5
2025-05-07T20:26:58.4488794Z #define __ssize_t_defined 
2025-05-07T20:26:58.4489043Z #define __CUDACC_VER_BUILD__ 85
2025-05-07T20:26:58.4489297Z #define _GLIBCXX_HAVE_SYS_SOCKET_H 1
2025-05-07T20:26:58.4489582Z #define ULONG_MAX (LONG_MAX * 2UL + 1UL)
2025-05-07T20:26:58.4489869Z #define __FLT64X_DECIMAL_DIG__ 21
2025-05-07T20:26:58.4490213Z #define _GLIBCXX_NAMESPACE_LDBL_OR_CXX11 _GLIBCXX_NAMESPACE_CXX11
2025-05-07T20:26:58.4490590Z #define _POSIX_HIWAT _POSIX_PIPE_BUF
2025-05-07T20:26:58.4490868Z #define __DEC128_MIN__ 1E-6143DL
2025-05-07T20:26:58.4491150Z #define __cudaCDP2EventRecordWithFlags 
2025-05-07T20:26:58.4491452Z #define _GLIBCXX_ATOMIC_BUILTINS 1
2025-05-07T20:26:58.4491743Z #define cudaPeerAccessDefault 0x00
2025-05-07T20:26:58.4492020Z #define __REGISTER_PREFIX__ 
2025-05-07T20:26:58.4492265Z #define __UINT16_MAX__ 0xffff
2025-05-07T20:26:58.4492587Z #define __glibcxx_requires_sorted_set(_First1,_Last1,_First2) 
2025-05-07T20:26:58.4492939Z #define _IOS_NOREPLACE 64
2025-05-07T20:26:58.4493163Z #define __cdecl 
2025-05-07T20:26:58.4493394Z #define cudaEventInterprocess 0x04
2025-05-07T20:26:58.4493822Z #define M_SQRT1_2l 0.707106781186547524400844362104849039L
2025-05-07T20:26:58.4494136Z #define LOGIN_NAME_MAX 256
2025-05-07T20:26:58.4494381Z #define _IO_TIED_PUT_GET 0x400
2025-05-07T20:26:58.4494647Z #define X_TLOSS 1.41484755040568800000e+16
2025-05-07T20:26:58.4494922Z #define CUDA_IPC_HANDLE_SIZE 64
2025-05-07T20:26:58.4495182Z #define __LDBL_HAS_INFINITY__ 1
2025-05-07T20:26:58.4495481Z #define __attribute_pure__ __attribute__ ((__pure__))
2025-05-07T20:26:58.4495799Z #define __TEXTURE_TYPES_H__ 
2025-05-07T20:26:58.4496191Z #define __NV_GLIBCXX_VERSION (__GNUC__ * 10000 + __GNUC_MINOR__ * 100 + __GNUC_PATCHLEVEL__)
2025-05-07T20:26:58.4496614Z #define ADJ_NANO 0x2000
2025-05-07T20:26:58.4496910Z #define __FLT32_MIN__ 1.17549435082228750796873653722224568e-38F32
2025-05-07T20:26:58.4497253Z #define __UINT8_TYPE__ unsigned char
2025-05-07T20:26:58.4497534Z #define _GLIBCXX_HAVE_ISWBLANK 1
2025-05-07T20:26:58.4497901Z #define __FLT_DIG__ 6
2025-05-07T20:26:58.4498609Z #define __REDIRECT_LDBL(name,proto,alias) __REDIRECT (name, proto, alias)
2025-05-07T20:26:58.4505164Z #define __NO_INLINE__ 1
2025-05-07T20:26:58.4505518Z #define _PSTL_EARLYEXIT_PRESENT (__INTEL_COMPILER >= 1800)
2025-05-07T20:26:58.4505878Z #define _POSIX_NGROUPS_MAX 8
2025-05-07T20:26:58.4506139Z #define ADJ_STATUS 0x0010
2025-05-07T20:26:58.4506413Z #define __cudaCDP2MemcpyAsync_ptsz 
2025-05-07T20:26:58.4506712Z #define CLOCK_BOOTTIME_ALARM 9
2025-05-07T20:26:58.4506987Z #define LONG_LONG_MAX __LONG_LONG_MAX__
2025-05-07T20:26:58.4507632Z #define _GLIBCXX_HAVE_OBSOLETE_ISNAN 1
2025-05-07T20:26:58.4508081Z #define __DEC_EVAL_METHOD__ 2
2025-05-07T20:26:58.4508464Z #define cudaStreamGraphFireAndForget (cudaStream_t)0x0200000000000000
2025-05-07T20:26:58.4508878Z #define _GLIBCXX_HAVE_ALIGNED_ALLOC 1
2025-05-07T20:26:58.4509236Z #define __DEC128_MAX__ 9.999999999999999999999999999999999E6144DL
2025-05-07T20:26:58.4509591Z #define CHAR_MIN SCHAR_MIN
2025-05-07T20:26:58.4509835Z #define MAX_CANON 255
2025-05-07T20:26:58.4510077Z #define __FLT_MANT_DIG__ 24
2025-05-07T20:26:58.4510340Z #define __LDBL_DECIMAL_DIG__ 21
2025-05-07T20:26:58.4510614Z #define _GLIBCXX_HAVE_COMPLEX_H 1
2025-05-07T20:26:58.4510912Z #define _PSTL_PRAGMA_VECTOR_UNALIGNED 
2025-05-07T20:26:58.4511230Z #define _POSIX_FD_SETSIZE _POSIX_OPEN_MAX
2025-05-07T20:26:58.4511540Z #define _GLIBCXX_HAVE_HYPOT 1
2025-05-07T20:26:58.4511818Z #define __cudaCDP2Memset2DAsync_ptsz 
2025-05-07T20:26:58.4512145Z #define _GLIBCXX_TR1_MODIFIED_BESSEL_FUNC_TCC 1
2025-05-07T20:26:58.4512474Z #define __VERSION__ "11.4.0"
2025-05-07T20:26:58.4512740Z #define _GLIBCXX11_USE_C99_STDLIB 1
2025-05-07T20:26:58.4513039Z #define cudaHostRegisterMapped 0x02
2025-05-07T20:26:58.4513332Z #define _GLIBCXX_HAVE_INT64_T 1
2025-05-07T20:26:58.4513608Z #define _GLIBCXX_USE_CONSTEXPR constexpr
2025-05-07T20:26:58.4513930Z #define FD_ZERO(fdsetp) __FD_ZERO (fdsetp)
2025-05-07T20:26:58.4514230Z #define __UINT64_C(c) c ## UL
2025-05-07T20:26:58.4514491Z #define MOD_OFFSET ADJ_OFFSET
2025-05-07T20:26:58.4514751Z #define _SYS_TYPES_H 1
2025-05-07T20:26:58.4514997Z #define AIO_PRIO_DELTA_MAX 20
2025-05-07T20:26:58.4515257Z #define _GLIBCXX_HAVE_TANHF 1
2025-05-07T20:26:58.4515514Z #define _SYS_CDEFS_H 1
2025-05-07T20:26:58.4515754Z #define _GLIBCXX_HAVE_TANHL 1
2025-05-07T20:26:58.4516034Z #define __cpp_unicode_characters 201411L
2025-05-07T20:26:58.4516323Z #define _IO_ERR_SEEN 0x20
2025-05-07T20:26:58.4516582Z #define _GLIBCXX_USE_DECIMAL_FLOAT 1
2025-05-07T20:26:58.4516878Z #define __cudaCDP2StreamDestroy 
2025-05-07T20:26:58.4517149Z #define FP_SUBNORMAL 3
2025-05-07T20:26:58.4517410Z #define cudaOccupancyDefault 0x00
2025-05-07T20:26:58.4517693Z #define _INITIALIZER_LIST 
2025-05-07T20:26:58.4517940Z #define _STDC_PREDEF_H 1
2025-05-07T20:26:58.4518196Z #define __CUDA_RUNTIME_API_H__ 
2025-05-07T20:26:58.4518475Z #define _GLIBCXX_PACKAGE_BUGREPORT ""
2025-05-07T20:26:58.4518760Z #define _GLIBCXX_HAVE_MODF 1
2025-05-07T20:26:58.4519023Z #define _IO_file_flags _flags
2025-05-07T20:26:58.4519284Z #define __USE_XOPEN2K8 1
2025-05-07T20:26:58.4519534Z #define htobe64(x) __bswap_64 (x)
2025-05-07T20:26:58.4519817Z #define _OLD_STDIO_MAGIC 0xFABC0000
2025-05-07T20:26:58.4520100Z #define HUGE 3.40282347e+38F
2025-05-07T20:26:58.4520367Z #define __cpp_lib_is_null_pointer 201309
2025-05-07T20:26:58.4520746Z #define WEXITSTATUS(status) __WEXITSTATUS (__WAIT_INT (status))
2025-05-07T20:26:58.4521142Z #define islower_l(c,l) __islower_l ((c), (l))
2025-05-07T20:26:58.4521455Z #define _GLIBCXX_USE_CXX11_ABI 1
2025-05-07T20:26:58.4521727Z #define _GLIBCXX_HAVE_SYMLINK 1
2025-05-07T20:26:58.4521986Z #define _BSD_SOURCE 1
2025-05-07T20:26:58.4522223Z #define _GLIBCXX_THROW(_EXC) 
2025-05-07T20:26:58.4523059Z #define _GLIBCXX_HAS_NESTED_TYPE(_NTYPE) template<typename _Tp, typename = __void_t<>> struct __has_ ##_NTYPE : false_type { }; template<typename _Tp> struct __has_ ##_NTYPE<_Tp, __void_t<typename _Tp::_NTYPE>> : true_type { };
2025-05-07T20:26:58.4523891Z #define __catch(X) catch(X)
2025-05-07T20:26:58.4524152Z #define __INT_LEAST32_MAX__ 0x7fffffff
2025-05-07T20:26:58.4524448Z #define LINE_MAX _POSIX2_LINE_MAX
2025-05-07T20:26:58.4524715Z #define __TIMER_T_TYPE void *
2025-05-07T20:26:58.4524971Z #define __STRING(x) #x
2025-05-07T20:26:58.4525216Z #define __GCC_ATOMIC_INT_LOCK_FREE 2
2025-05-07T20:26:58.4525482Z #define _T_PTRDIFF_ 
2025-05-07T20:26:58.4525729Z #define _GLIBCXX_USE_NOEXCEPT noexcept
2025-05-07T20:26:58.4526033Z #define cudaEventWaitExternal 0x01
2025-05-07T20:26:58.4526302Z #define __unbounded 
2025-05-07T20:26:58.4526678Z #define __DEVICE_ATOMIC_FUNCTIONS_H__ 
2025-05-07T20:26:58.4527055Z #define __FLT128_MAX_EXP__ 16384
2025-05-07T20:26:58.4527327Z #define __INO_T_TYPE __SYSCALL_ULONG_TYPE
2025-05-07T20:26:58.4527625Z #define be16toh(x) __bswap_16 (x)
2025-05-07T20:26:58.4527908Z #define __cpp_lib_is_final 201402L
2025-05-07T20:26:58.4528196Z #define _GLIBCXX_BEGIN_NAMESPACE_CONTAINER 
2025-05-07T20:26:58.4528523Z #define LONG_LONG_MIN (-LONG_LONG_MAX - 1LL)
2025-05-07T20:26:58.4528825Z #define __MATH_DECLARE_LDOUBLE 1
2025-05-07T20:26:58.4529103Z #define __managed__ __location__(managed)
2025-05-07T20:26:58.4529392Z #define _POSIX2_EXPR_NEST_MAX 32
2025-05-07T20:26:58.4529782Z #define __GNUC_PREREQ(maj,min) ((__GNUC__ << 16) + __GNUC_MINOR__ >= ((maj) << 16) + (min))
2025-05-07T20:26:58.4530194Z #define _POSIX_STREAM_MAX 8
2025-05-07T20:26:58.4530443Z #define __LIBRARY_TYPES_H__ 
2025-05-07T20:26:58.4530806Z #define _GLIBCXX_END_NAMESPACE_LDBL_OR_CXX11 _GLIBCXX_END_NAMESPACE_CXX11
2025-05-07T20:26:58.4531204Z #define __FLT32_MANT_DIG__ 24
2025-05-07T20:26:58.4531466Z #define _SYS_SIZE_T_H 
2025-05-07T20:26:58.4531756Z #define _PSTL_VERSION_MINOR ((_PSTL_VERSION % 1000) / 10)
2025-05-07T20:26:58.4532090Z #define _GLIBCXX_STDLIB_H 1
2025-05-07T20:26:58.4532363Z #define isupper_l(c,l) __isupper_l ((c), (l))
2025-05-07T20:26:58.4532652Z #define _CRTIMP 
2025-05-07T20:26:58.4532877Z #define _GLIBCXX_CXX_CONFIG_H 1
2025-05-07T20:26:58.4533181Z #define __FLOAT_WORD_ORDER__ __ORDER_LITTLE_ENDIAN__
2025-05-07T20:26:58.4533497Z #define STA_PPSJITTER 0x0200
2025-05-07T20:26:58.4533998Z #define _IO_feof_unlocked(__fp) (((__fp)->_flags & _IO_EOF_SEEN) != 0)
2025-05-07T20:26:58.4534398Z #define __SUSECONDS_T_TYPE __SYSCALL_SLONG_TYPE
2025-05-07T20:26:58.4534701Z #define _GLIBCXX_HAVE_ISINFF 1
2025-05-07T20:26:58.4534974Z #define __glibcxx_requires_subscript(_N) 
2025-05-07T20:26:58.4535251Z #define __SIZE_T__ 
2025-05-07T20:26:58.4535454Z #define __stub_gtty 
2025-05-07T20:26:58.4535682Z #define __pid_t_defined 
2025-05-07T20:26:58.4535938Z #define _GLIBCXX_FWDREF(_Tp) _Tp&&
2025-05-07T20:26:58.4536234Z #define __NLINK_T_TYPE __SYSCALL_ULONG_TYPE
2025-05-07T20:26:58.4536543Z #define __glibcxx_function_requires(...) 
2025-05-07T20:26:58.4536826Z #define __SM_80_RT_HPP__ 
2025-05-07T20:26:58.4537066Z #define __need_clockid_t 
2025-05-07T20:26:58.4537297Z #define SSIZE_MAX LONG_MAX
2025-05-07T20:26:58.4537550Z #define _GLIBCXX_HAVE_USELOCALE 1
2025-05-07T20:26:58.4537865Z #define __glibcxx_requires_string_len(_String,_Len) 
2025-05-07T20:26:58.4538165Z #define _IO_HEX 0100
2025-05-07T20:26:58.4538418Z #define __NFDBITS (8 * (int) sizeof (__fd_mask))
2025-05-07T20:26:58.4538750Z #define cudaExternalMemoryDedicated 0x1
2025-05-07T20:26:58.4539048Z #define _GLIBCXX_HAVE_TGMATH_H 1
2025-05-07T20:26:58.4539319Z #define _GLIBCXX11_USE_C99_COMPLEX 1
2025-05-07T20:26:58.4539717Z #define _GLIBCXX17_DEPRECATED_SUGGEST(ALT) _GLIBCXX_DEPRECATED_SUGGEST(ALT)
2025-05-07T20:26:58.4540139Z #define ispunct_l(c,l) __ispunct_l ((c), (l))
2025-05-07T20:26:58.4540452Z #define __cpp_aggregate_bases 201603L
2025-05-07T20:26:58.4540747Z #define __cudaGet_blockDim() blockDim
2025-05-07T20:26:58.4540851Z #define __cudaCDP2Memcpy3DAsync 
2025-05-07T20:26:58.4540961Z #define __cudaCDP2MemcpyAsync 
2025-05-07T20:26:58.4541042Z #define __stub_sstk 
2025-05-07T20:26:58.4541134Z #define _IO_IN_BACKUP 0x100
2025-05-07T20:26:58.4541287Z #define _GLIBCXX_USE_C99_STDLIB _GLIBCXX11_USE_C99_STDLIB
2025-05-07T20:26:58.4541368Z #define __wur 
2025-05-07T20:26:58.4541482Z #define isprint_l(c,l) __isprint_l ((c), (l))
2025-05-07T20:26:58.4541570Z #define _G_HAVE_MMAP 1
2025-05-07T20:26:58.4541650Z #define _IO_OCT 040
2025-05-07T20:26:58.4541746Z #define __FLT128_HAS_DENORM__ 1
2025-05-07T20:26:58.4541841Z #define NL_MSGMAX INT_MAX
2025-05-07T20:26:58.4541927Z #define _GLIBCXX_USE_LFS 1
2025-05-07T20:26:58.4542051Z #define cudaDeviceScheduleBlockingSync 0x04
2025-05-07T20:26:58.4542148Z #define _POSIX_RTSIG_MAX 8
2025-05-07T20:26:58.4542247Z #define _GLIBCXX_NOEXCEPT noexcept
2025-05-07T20:26:58.4542550Z #define __glibcxx_requires_partitioned_lower(_First,_Last,_Value) 
2025-05-07T20:26:58.4542721Z #define __FLT32_DECIMAL_DIG__ 9
2025-05-07T20:26:58.4542809Z #define _STL_ALGOBASE_H 1
2025-05-07T20:26:58.4542919Z #define __cudaCDP2MemsetAsync_ptsz 
2025-05-07T20:26:58.4543009Z #define __off64_t_defined 
2025-05-07T20:26:58.4543105Z #define _GLIBCXX_WEAK_DEFINITION 
2025-05-07T20:26:58.4543194Z #define __FLT128_DIG__ 33
2025-05-07T20:26:58.4543295Z #define _GLIBCXX_USE_C99_INTTYPES_TR1 1
2025-05-07T20:26:58.4543388Z #define _GLIBCXX_HAVE_LOCALE_H 1
2025-05-07T20:26:58.4543476Z #define __INT32_C(c) c
2025-05-07T20:26:58.4543571Z #define __DEC64_EPSILON__ 1E-15DD
2025-05-07T20:26:58.4543662Z #define __ORDER_PDP_ENDIAN__ 3412
2025-05-07T20:26:58.4543759Z #define __DEC128_MIN_EXP__ (-6142)
2025-05-07T20:26:58.4543847Z #define __PDP_ENDIAN 3412
2025-05-07T20:26:58.4543939Z #define _ISOC95_SOURCE 1
2025-05-07T20:26:58.4544032Z #define _IO_fpos64_t _G_fpos64_t
2025-05-07T20:26:58.4544162Z #define M_PI_2l 1.570796326794896619231321691639751442L
2025-05-07T20:26:58.4544267Z #define BYTE_ORDER __BYTE_ORDER
2025-05-07T20:26:58.4544353Z #define __SM_90_RT_HPP__ 
2025-05-07T20:26:58.4544447Z #define __INT_FAST32_TYPE__ long int
2025-05-07T20:26:58.4544545Z #define __have_pthread_attr_t 1
2025-05-07T20:26:58.4544643Z #define _GLIBCXX_HAVE_LIMIT_DATA 1
2025-05-07T20:26:58.4544856Z #define _GLIBCXX_BEGIN_NAMESPACE_LDBL_OR_CXX11 _GLIBCXX_BEGIN_NAMESPACE_CXX11
2025-05-07T20:26:58.4544967Z #define __cudaCDP2StreamWaitEvent 
2025-05-07T20:26:58.4545065Z #define __cudaCDP2EventRecord 
2025-05-07T20:26:58.4545156Z #define _BITS_TYPESIZES_H 1
2025-05-07T20:26:58.4545244Z #define htole32(x) (x)
2025-05-07T20:26:58.4545488Z #define __cudaCDP2OccupancyMaxActiveBlocksPerMultiprocessorWithFlags 
2025-05-07T20:26:58.4545610Z #define __SYSCALL_SLONG_TYPE __SLONGWORD_TYPE
2025-05-07T20:26:58.4545705Z #define _GLIBCXX_USE_C99_MATH_TR1 1
2025-05-07T20:26:58.4545857Z #define WSTOPSIG(status) __WSTOPSIG (__WAIT_INT (status))
2025-05-07T20:26:58.4546003Z #define _GLIBCXX_USE_C99_MATH _GLIBCXX11_USE_C99_MATH
2025-05-07T20:26:58.4546130Z #define __UINT_LEAST16_TYPE__ short unsigned int
2025-05-07T20:26:58.4546263Z #define __WIFEXITED(status) (__WTERMSIG(status) == 0)
2025-05-07T20:26:58.4546362Z #define ADJ_OFFSET 0x0001
2025-05-07T20:26:58.4546459Z #define cudaArrayLayered 0x01
2025-05-07T20:26:58.4546629Z #define _PSTL_ICC_18_OMP_SIMD_BROKEN (__INTEL_COMPILER == 1800)
2025-05-07T20:26:58.4546735Z #define cudaEventRecordDefault 0x00
2025-05-07T20:26:58.4546827Z #define _GLIBCXX_HAVE_FMODF 1
2025-05-07T20:26:58.4546928Z #define _PSTL_PRAGMA_MESSAGE(x) 
2025-05-07T20:26:58.4547006Z #define unix 1
2025-05-07T20:26:58.4547094Z #define __DBL_HAS_DENORM__ 1
2025-05-07T20:26:58.4547189Z #define _POSIX_CHILD_MAX 25
2025-05-07T20:26:58.4547280Z #define _POSIX_MAX_INPUT 255
2025-05-07T20:26:58.4547395Z #define __cudaCDP2DeviceGetCacheConfig 
2025-05-07T20:26:58.4547486Z #define __USE_POSIX 1
2025-05-07T20:26:58.4547578Z #define __FD_ZERO_STOS "stosq"
2025-05-07T20:26:58.4547710Z #define _PSTL_VERSION_MAJOR (_PSTL_VERSION / 1000)
2025-05-07T20:26:58.4547811Z #define __THROWNL throw ()
2025-05-07T20:26:58.4547898Z #define __cpp_rtti 199711L
2025-05-07T20:26:58.4548003Z #define __SIZE_TYPE__ long unsigned int
2025-05-07T20:26:58.4548088Z #define __PMT(args) args
2025-05-07T20:26:58.4548197Z #define __UINT64_MAX__ 0xffffffffffffffffUL
2025-05-07T20:26:58.4548343Z #define __va_arg_pack_len() __builtin_va_arg_pack_len ()
2025-05-07T20:26:58.4548453Z #define __ULONGWORD_TYPE unsigned long int
2025-05-07T20:26:58.4548539Z #define _SIZE_T_DECLARED 
2025-05-07T20:26:58.4548637Z #define _PSTL_STRING_AUX(x) #x
2025-05-07T20:26:58.4548726Z #define __FLT_IS_IEC_60559__ 2
2025-05-07T20:26:58.4549106Z #define _PSTL_CPP14_MAKE_REVERSE_ITERATOR_PRESENT (_MSC_VER >= 1900 || __cplusplus >= 201402L || __cpp_lib_make_reverse_iterator == 201402)
2025-05-07T20:26:58.4549210Z #define _GLIBCXX_HAVE_LIMIT_AS 1
2025-05-07T20:26:58.4549298Z #define XATTR_LIST_MAX 65536
2025-05-07T20:26:58.4549496Z #define __CUDACC_VER_MAJOR__ 12
2025-05-07T20:26:58.4549749Z #define __GNUC_WIDE_EXECUTION_CHARSET_NAME "UTF-32LE"
2025-05-07T20:26:58.4549829Z #define _WCHAR_T_H 
2025-05-07T20:26:58.4549919Z #define __FLT64X_DIG__ 18
2025-05-07T20:26:58.4550005Z #define _IO_SHOWBASE 0200
2025-05-07T20:26:58.4550089Z #define _POSIX_QLIMIT 1
2025-05-07T20:26:58.4550191Z #define __INT8_TYPE__ signed char
2025-05-07T20:26:58.4550282Z #define __SURFACE_TYPES_H__ 
2025-05-07T20:26:58.4550366Z #define __CUDA_ARCH__ 520
2025-05-07T20:26:58.4550477Z #define __cpp_digit_separators 201309L
2025-05-07T20:26:58.4550556Z #define __ELF__ 1
2025-05-07T20:26:58.4550652Z #define CLOCK_THREAD_CPUTIME_ID 3
2025-05-07T20:26:58.4550758Z #define __GCC_ASM_FLAG_OUTPUTS__ 1
2025-05-07T20:26:58.4550842Z #define STA_INS 0x0010
2025-05-07T20:26:58.4550936Z #define __UINT32_TYPE__ unsigned int
2025-05-07T20:26:58.4551113Z #define _toupper(c) ((int) (*__ctype_toupper_loc ())[(int) (c)])
2025-05-07T20:26:58.4551204Z #define _BITS_BYTESWAP_H 1
2025-05-07T20:26:58.4551315Z #define __ID_T_TYPE __U32_TYPE
2025-05-07T20:26:58.4551422Z #define __TIME_T_TYPE __SYSCALL_SLONG_TYPE
2025-05-07T20:26:58.4551527Z #define __DEVICE_DOUBLE_FUNCTIONS_HPP__ 
2025-05-07T20:26:58.4551629Z #define _GLIBCXX_HAVE_MBSTATE_T 1
2025-05-07T20:26:58.4551732Z #define __cpp_lib_logical_traits 201510
2025-05-07T20:26:58.4551825Z #define ADJ_OFFSET_SS_READ 0xa001
2025-05-07T20:26:58.4551983Z #define __warnattr(msg) __attribute__((__warning__ (msg)))
2025-05-07T20:26:58.4552133Z #define _PSTL_PRAGMA_LOCATION " [Parallel STL message]: "
2025-05-07T20:26:58.4552226Z #define _IO_funlockfile(_fp) 
2025-05-07T20:26:58.4552551Z #define cudaKernelNodeAttributeAccessPolicyWindow cudaLaunchAttributeAccessPolicyWindow
2025-05-07T20:26:58.4552675Z #define M_2_PIl 0.636619772367581343075535053490057448L
2025-05-07T20:26:58.4552773Z #define __DRIVER_TYPES_H__ 
2025-05-07T20:26:58.4552857Z #define __FLT_RADIX__ 2
2025-05-07T20:26:58.4552956Z #define __INT_LEAST16_TYPE__ short int
2025-05-07T20:26:58.4553127Z #define __LDBL_EPSILON__ 1.08420217248550443400745280086994171e-19L
2025-05-07T20:26:58.4553224Z #define __UINTMAX_C(c) c ## UL
2025-05-07T20:26:58.4553314Z #define _GLIBCXX_USE_LSTAT 1
2025-05-07T20:26:58.4553420Z #define minor(dev) gnu_dev_minor (dev)
2025-05-07T20:26:58.4553514Z #define _POSIX_C_SOURCE 200809L
2025-05-07T20:26:58.4553606Z #define _GLIBCXX_HAVE_DIRENT_H 1
2025-05-07T20:26:58.4553712Z #define __GLIBCXX_BITSIZE_INT_N_0 128
2025-05-07T20:26:58.4553792Z #define WORD_BIT 32
2025-05-07T20:26:58.4553879Z #define _IO_USER_BUF 1
2025-05-07T20:26:58.4553967Z #define __VECTOR_TYPES_H__ 
2025-05-07T20:26:58.4554066Z #define __SM_20_ATOMIC_FUNCTIONS_HPP__ 
2025-05-07T20:26:58.4554177Z #define cudaHostAllocPortable 0x01
2025-05-07T20:26:58.4554271Z #define PTHREAD_STACK_MIN 16384
2025-05-07T20:26:58.4554367Z #define __long_double_t long double
2025-05-07T20:26:58.4554464Z #define _GLIBCXX_HAVE_ISINF 1
2025-05-07T20:26:58.4554551Z #define _POSIX_ARG_MAX 4096
2025-05-07T20:26:58.4554945Z #define cudaKernelNodeAttributeDeviceUpdatableKernelNode cudaLaunchAttributeDeviceUpdatableKernelNode
2025-05-07T20:26:58.4555032Z #define __k8 1
2025-05-07T20:26:58.4555220Z #define _GLIBCXX_NO_OBSOLETE_ISINF_ISNAN_DYNAMIC __GLIBC_PREREQ(2,23)
2025-05-07T20:26:58.4555389Z #define __FLT32X_MIN__ 2.22507385850720138309023271733240406e-308F32x
2025-05-07T20:26:58.4555499Z #define __LDBL_REDIR(name,proto) name proto
2025-05-07T20:26:58.4555597Z #define __SIG_ATOMIC_MAX__ 0x7fffffff
2025-05-07T20:26:58.4555697Z #define __SM_30_INTRINSICS_HPP__ 
2025-05-07T20:26:58.4555792Z #define _GLIBCXX_EXTERN_TEMPLATE 1
2025-05-07T20:26:58.4555884Z #define __blksize_t_defined 
2025-05-07T20:26:58.4555978Z #define _IO_SHOWPOINT 0400
2025-05-07T20:26:58.4556071Z #define _GLIBCXX_HAVE_LIMIT_RSS 1
2025-05-07T20:26:58.4556180Z #define cudaDeviceLmemResizeToMax 0x10
2025-05-07T20:26:58.4556279Z #define _GLIBCXX_X86_RDRAND 1
2025-05-07T20:26:58.4556381Z #define __GCC_ATOMIC_WCHAR_T_LOCK_FREE 2
2025-05-07T20:26:58.4556565Z #define _IO_IS_FILEBUF 0x2000
2025-05-07T20:26:58.4556744Z #define _GLIBCXX_USE_DUAL_ABI 1
2025-05-07T20:26:58.4556991Z #define __bswap_constant_16(x) ((unsigned short int) ((((x) >> 8) & 0xff) | (((x) & 0xff) << 8)))
2025-05-07T20:26:58.4557338Z #define cudaSignalExternalSemaphoresAsync __CUDART_API_PTSZ(cudaSignalExternalSemaphoresAsync_v2)
2025-05-07T20:26:58.4557435Z #define UCHAR_MAX (SCHAR_MAX * 2 + 1)
2025-05-07T20:26:58.4557537Z #define __SIZEOF_PTRDIFF_T__ 8
2025-05-07T20:26:58.4557614Z #define SEEK_SET 0
2025-05-07T20:26:58.4557706Z #define _GLIBCXX_TR1_GAMMA_TCC 1
2025-05-07T20:26:58.4557805Z #define __CUDA_API_VER_MINOR__ 6
2025-05-07T20:26:58.4557990Z #define _GLIBCXX_VISIBILITY(V) __attribute__ ((__visibility__ (#V)))
2025-05-07T20:26:58.4558088Z #define _GLIBCXX20_DEPRECATED(MSG) 
2025-05-07T20:26:58.4558195Z #define __cudaCDP2GetLastError 
2025-05-07T20:26:58.4558287Z #define _GLIBCXX_HAVE_COSL 1
2025-05-07T20:26:58.4558380Z #define _MATH_H_MATHDEF 1
2025-05-07T20:26:58.4558699Z #define __bswap_constant_32(x) ((((x) & 0xff000000) >> 24) | (((x) & 0x00ff0000) >> 8) | (((x) & 0x0000ff00) << 8) | (((x) & 0x000000ff) << 24))
2025-05-07T20:26:58.4558802Z #define _GLIBCXX_USE_FLOAT128 1
2025-05-07T20:26:58.4558901Z #define _IO_FLAGS2_NOTCANCEL 2
2025-05-07T20:26:58.4558987Z #define __stub_sigreturn 
2025-05-07T20:26:58.4559216Z #define __errordecl(name,msg) extern void name (void) __attribute__((__error__ (msg)))
2025-05-07T20:26:58.4559313Z #define _GLIBCXX_HAVE_UTIME_H 1
2025-05-07T20:26:58.4559399Z #define __HOST_CONFIG_H__ 
2025-05-07T20:26:58.4559494Z #define _XOPEN_SOURCE_EXTENDED 1
2025-05-07T20:26:58.4559580Z #define CLOCK_TAI 11
2025-05-07T20:26:58.4559682Z #define _GLIBCXX_END_NAMESPACE_VERSION 
2025-05-07T20:26:58.4559774Z #define __restrict_arr 
2025-05-07T20:26:58.4559880Z #define _PSTL_PRAGMA_MESSAGE_POLICIES(x) 
2025-05-07T20:26:58.4560016Z #define __glibcxx_requires_valid_range(_First,_Last) 
2025-05-07T20:26:58.4560535Z #define strndupa(s,n) (__extension__ ({ const char *__old = (s); size_t __len = strnlen (__old, (n)); char *__new = (char *) __builtin_alloca (__len + 1); __new[__len] = '\0'; (char *) memcpy (__new, __old, __len); }))
2025-05-07T20:26:58.4560718Z #define __attribute_artificial__ __attribute__ ((__artificial__))
2025-05-07T20:26:58.4560800Z #define __USE_MISC 1
2025-05-07T20:26:58.4560909Z #define __UWORD_TYPE unsigned long int
2025-05-07T20:26:58.4561004Z #define _EXCEPTION_DEFINES_H 1
2025-05-07T20:26:58.4561099Z #define _GCC_LIMITS_H_ 
2025-05-07T20:26:58.4561182Z #define __LDBL_DIG__ 18
2025-05-07T20:26:58.4561278Z #define __BIT_TYPES_DEFINED__ 1
2025-05-07T20:26:58.4561383Z #define __malloc_and_calloc_defined 
2025-05-07T20:26:58.4561473Z #define __FLT64_IS_IEC_60559__ 2
2025-05-07T20:26:58.4561574Z #define _GLIBCXX_HAVE_SYS_SYSINFO_H 1
2025-05-07T20:26:58.4561660Z #define __x86_64__ 1
2025-05-07T20:26:58.4561736Z #define _SIZE_T_ 
2025-05-07T20:26:58.4562595Z #define __bswap_constant_64(x) (__extension__ ((((x) & 0xff00000000000000ull) >> 56) | (((x) & 0x00ff000000000000ull) >> 40) | (((x) & 0x0000ff0000000000ull) >> 24) | (((x) & 0x000000ff00000000ull) >> 8) | (((x) & 0x00000000ff000000ull) << 8) | (((x) & 0x0000000000ff0000ull) << 24) | (((x) & 0x000000000000ff00ull) << 40) | (((x) & 0x00000000000000ffull) << 56)))
2025-05-07T20:26:58.4562708Z #define _POSIX2_COLL_WEIGHTS_MAX 2
2025-05-07T20:26:58.4562801Z #define __FLT32X_MIN_EXP__ (-1021)
2025-05-07T20:26:58.4562919Z #define __PTHREAD_RWLOCK_INT_FLAGS_SHARED 1
2025-05-07T20:26:58.4563031Z #define __DEC32_SUBNORMAL_MIN__ 0.000001E-95DF
2025-05-07T20:26:58.4563122Z #define _IO_iconv_t _G_iconv_t
2025-05-07T20:26:58.4563231Z #define _GLIBCXX_FLOAT_IS_IEEE_BINARY32 1
2025-05-07T20:26:58.4563348Z #define __cpp_lib_make_reverse_iterator 201402
2025-05-07T20:26:58.4563482Z #define _GLIBCXX_SYNCHRONIZATION_HAPPENS_BEFORE(A) 
2025-05-07T20:26:58.4563582Z #define _GLIBCXX_HAVE_DLFCN_H 1
2025-05-07T20:26:58.4564113Z #define strdupa(s) (__extension__ ({ const char *__old = (s); size_t __len = strlen (__old) + 1; char *__new = (char *) __builtin_alloca (__len); (char *) memcpy (__new, __old, __len); }))
2025-05-07T20:26:58.4564316Z #define __no_return__ __attribute__((noreturn))
2025-05-07T20:26:58.4564454Z #define __device_builtin__ __location__(device_builtin)
2025-05-07T20:26:58.4564553Z #define _PSTL_HIDE_FROM_ABI_POP 
2025-05-07T20:26:58.4564650Z #define _GLIBCXX_HAVE_ACOSF 1
2025-05-07T20:26:58.4564734Z #define STA_FLL 0x0008
2025-05-07T20:26:58.4564871Z #define _GLIBCXX_HAVE_BUILTIN_IS_CONSTANT_EVALUATED 1
2025-05-07T20:26:58.4564973Z #define _GLIBCXX_END_EXTERN_C }
2025-05-07T20:26:58.4565090Z #define __INT_FAST16_MAX__ 0x7fffffffffffffffL
2025-05-07T20:26:58.4565195Z #define __cpp_lib_integer_sequence 201304
2025-05-07T20:26:58.4565283Z #define __stub_revoke 
2025-05-07T20:26:58.4565370Z #define __timer_t_defined 1
2025-05-07T20:26:58.4565503Z #define _GLIBCXX11_DEPRECATED _GLIBCXX_DEPRECATED
2025-05-07T20:26:58.4565591Z #define INT_MAX __INT_MAX__
2025-05-07T20:26:58.4565693Z #define ULLONG_MAX (LLONG_MAX * 2ULL + 1)
2025-05-07T20:26:58.4565805Z #define _GLIBCXX_END_NAMESPACE_CXX11 }
2025-05-07T20:26:58.4565905Z #define _GLIBCXX_ICONV_CONST 
2025-05-07T20:26:58.4566002Z #define major(dev) gnu_dev_major (dev)
2025-05-07T20:26:58.4566113Z #define cudaArrayTextureGather 0x08
2025-05-07T20:26:58.4566208Z #define _GLIBCXX_LT_OBJDIR ".libs/"
2025-05-07T20:26:58.4566346Z #define __inline_hint__ __attribute__((nv_inline_hint))
2025-05-07T20:26:58.4566444Z #define __NV_LEGACY_LAUNCH 1
2025-05-07T20:26:58.4566532Z #define _IO_off_t __off_t
2025-05-07T20:26:58.4566615Z #define __FLT64_DIG__ 15
2025-05-07T20:26:58.4566833Z #define PTHREAD_DESTRUCTOR_ITERATIONS _POSIX_THREAD_DESTRUCTOR_ITERATIONS
2025-05-07T20:26:58.4566925Z #define _POSIX2_LINE_MAX 2048
2025-05-07T20:26:58.4567055Z #define __UINT_FAST32_MAX__ 0xffffffffffffffffUL
2025-05-07T20:26:58.4567171Z #define __UINT_LEAST64_TYPE__ long unsigned int
2025-05-07T20:26:58.4567264Z #define ADJ_FREQUENCY 0x0002
2025-05-07T20:26:58.4567370Z #define __CUDART_API_PTDS(api) api
2025-05-07T20:26:58.4567453Z #define NULL __null
2025-05-07T20:26:58.4567584Z #define cudaStreamPerThread ((cudaStream_t)0x2)
2025-05-07T20:26:58.4567691Z #define _GLIBCXX_CONSTEXPR constexpr
2025-05-07T20:26:58.4567784Z #define __U64_TYPE unsigned long int
2025-05-07T20:26:58.4567877Z #define __FLT_HAS_QUIET_NAN__ 1
2025-05-07T20:26:58.4567971Z #define __FLT_MAX_10_EXP__ 38
2025-05-07T20:26:58.4568049Z #define FP_ZERO 2
2025-05-07T20:26:58.4568150Z #define _GLIBCXX_HAVE_FLOORL 1
2025-05-07T20:26:58.4568294Z #define __isgraph_l(c,l) __isctype_l((c), _ISgraph, (l))
2025-05-07T20:26:58.4568398Z #define __LONG_MAX__ 0x7fffffffffffffffL
2025-05-07T20:26:58.4568486Z #define __WCHAR_T__ 
2025-05-07T20:26:58.4568576Z #define __FLT64X_HAS_DENORM__ 1
2025-05-07T20:26:58.4568766Z #define __DEC128_SUBNORMAL_MIN__ 0.000000000000000000000000000000001E-6143DL
2025-05-07T20:26:58.4568914Z #define _GLIBCXX_NORETURN __attribute__ ((__noreturn__))
2025-05-07T20:26:58.4569007Z #define __FLT_HAS_INFINITY__ 1
2025-05-07T20:26:58.4569130Z #define __GNUC_EXECUTION_CHARSET_NAME "UTF-8"
2025-05-07T20:26:58.4569250Z #define _GLIBCXX20_DEPRECATED_SUGGEST(ALT) 
2025-05-07T20:26:58.4569372Z #define __WSTOPSIG(status) __WEXITSTATUS(status)
2025-05-07T20:26:58.4569499Z #define cudaSurfaceTypeCubemapLayered 0xFC
2025-05-07T20:26:58.4569587Z #define _BSD_PTRDIFF_T_ 
2025-05-07T20:26:58.4569675Z #define _SIGSET_H_types 1
2025-05-07T20:26:58.4569791Z #define cudaTextureType1DLayered 0xF1
2025-05-07T20:26:58.4569893Z #define __cpp_unicode_literals 200710L
2025-05-07T20:26:58.4570034Z #define __isdigit_l(c,l) __isctype_l((c), _ISdigit, (l))
2025-05-07T20:26:58.4570139Z #define __LONG_LONG_PAIR(HI,LO) LO, HI
2025-05-07T20:26:58.4570254Z #define __UINT_FAST16_TYPE__ long unsigned int
2025-05-07T20:26:58.4570380Z #define __bos0(ptr) __builtin_object_size (ptr, 0)
2025-05-07T20:26:58.4570496Z #define __DEC64_MAX__ 9.999999999999999E384DD
2025-05-07T20:26:58.4570619Z #define M_1_PIl 0.318309886183790671537767526745028724L
2025-05-07T20:26:58.4570919Z #define WIFSTOPPED(status) __WIFSTOPPED (__WAIT_INT (status))
2025-05-07T20:26:58.4571094Z #define __INT_FAST32_WIDTH__ 64
2025-05-07T20:26:58.4571194Z #define _POSIX2_CHARCLASS_NAME_MAX 14
2025-05-07T20:26:58.4571297Z #define _GLIBCXX_BITS_STD_ABS_H 
2025-05-07T20:26:58.4571381Z #define STA_MODE 0x4000
2025-05-07T20:26:58.4571490Z #define __CHAR16_TYPE__ short unsigned int
2025-05-07T20:26:58.4571592Z #define __PRAGMA_REDEFINE_EXTNAME 1
2025-05-07T20:26:58.4571702Z #define __glibcxx_signed_b(T,B) ((T)(-1) < 0)
2025-05-07T20:26:58.4571800Z #define __USING_NAMESPACE_C99(name) 
2025-05-07T20:26:58.4571897Z #define BIG_ENDIAN __BIG_ENDIAN
2025-05-07T20:26:58.4571999Z #define __cudaCDP2EventRecord_ptsz 
2025-05-07T20:26:58.4572092Z #define _GLIBCXX_HAVE_SINL 1
2025-05-07T20:26:58.4572206Z #define EXPR_NEST_MAX _POSIX2_EXPR_NEST_MAX
2025-05-07T20:26:58.4572291Z #define __SIZE_WIDTH__ 64
2025-05-07T20:26:58.4572412Z #define __BLKSIZE_T_TYPE __SYSCALL_SLONG_TYPE
2025-05-07T20:26:58.4572489Z #define __SEG_FS 1
2025-05-07T20:26:58.4572580Z #define _IO_size_t size_t
2025-05-07T20:26:58.4572689Z #define __INT_LEAST16_MAX__ 0x7fff
2025-05-07T20:26:58.4572785Z #define INT_MIN (-INT_MAX - 1)
2025-05-07T20:26:58.4572867Z #define __stub_lchmod 
2025-05-07T20:26:58.4572962Z #define __DEC64_MANT_DIG__ 16
2025-05-07T20:26:58.4573065Z #define __INT64_MAX__ 0x7fffffffffffffffL
2025-05-07T20:26:58.4573158Z #define _GLIBCXX_MANGLE_SIZE_T m
2025-05-07T20:26:58.4573244Z #define __SEG_GS 1
2025-05-07T20:26:58.4573418Z #define __FLT32_DENORM_MIN__ 1.40129846432481707092372958328991613e-45F32
2025-05-07T20:26:58.4573502Z #define _IOS_APPEND 8
2025-05-07T20:26:58.4573600Z #define __SIG_ATOMIC_WIDTH__ 32
2025-05-07T20:26:58.4573811Z #define _GLIBCXX_RELEASE 11
2025-05-07T20:26:58.4573911Z #define _GLIBCXX98_USE_C99_WCHAR 1
2025-05-07T20:26:58.4574005Z #define _IO_IS_APPENDING 0x1000
2025-05-07T20:26:58.4574102Z #define __INT_LEAST64_TYPE__ long int
2025-05-07T20:26:58.4574188Z #define htole16(x) (x)
2025-05-07T20:26:58.4574293Z #define __TEXTURE_INDIRECT_FUNCTIONS_H__ 
2025-05-07T20:26:58.4574389Z #define _GLIBCXX_HAVE_FCNTL_H 1
2025-05-07T20:26:58.4574491Z #define __INT16_TYPE__ short int
2025-05-07T20:26:58.4574590Z #define __INT_LEAST8_TYPE__ signed char
2025-05-07T20:26:58.4574696Z #define __glibcxx_class_requires(_a,_b) 
2025-05-07T20:26:58.4574807Z #define __cpp_structured_bindings 201606L
2025-05-07T20:26:58.4574924Z #define __align__(n) __attribute__((aligned(n)))
2025-05-07T20:26:58.4575015Z #define __SIZEOF_INT__ 4
2025-05-07T20:26:58.4575102Z #define __WCLONE 0x80000000
2025-05-07T20:26:58.4575193Z #define __DEC32_MAX_EXP__ 97
2025-05-07T20:26:58.4575283Z #define SEEK_HOLE 4
2025-05-07T20:26:58.4575368Z #define TIMER_ABSTIME 1
2025-05-07T20:26:58.4575458Z #define __INT_FAST8_MAX__ 0x7f
2025-05-07T20:26:58.4575553Z #define __CUDA_MATH_CRTIMP 
2025-05-07T20:26:58.4575722Z #define __FLT128_MAX__ 1.18973149535723176508575932662800702e+4932F128
2025-05-07T20:26:58.4575829Z #define __INTPTR_MAX__ 0x7fffffffffffffffL
2025-05-07T20:26:58.4575925Z #define __DRIVER_FUNCTIONS_H__ 
2025-05-07T20:26:58.4576037Z #define __cpp_sized_deallocation 201309L
2025-05-07T20:26:58.4576130Z #define __MATH_FUNCTIONS_HPP__ 
2025-05-07T20:26:58.4576257Z #define __cpp_guaranteed_copy_elision 201606L
2025-05-07T20:26:58.4576342Z #define _LINUX_LIMITS_H 
2025-05-07T20:26:58.4576424Z #define linux 1
2025-05-07T20:26:58.4576513Z #define MOD_MICRO ADJ_MICRO
2025-05-07T20:26:58.4576621Z #define _GLIBCXX_DEBUG_ASSERT(_Condition) 
2025-05-07T20:26:58.4576723Z #define _GLIBCXX_HAVE_VSWSCANF 1
2025-05-07T20:26:58.4576814Z #define _GLIBCXX_HAVE_ISNAN 1
2025-05-07T20:26:58.4576915Z #define _XOPEN_IOV_MAX _POSIX_UIO_MAXIOV
2025-05-07T20:26:58.4577057Z #define __cudart_builtin__ __location__(cudart_builtin)
2025-05-07T20:26:58.4577150Z #define __cpp_lib_hypot 201603
2025-05-07T20:26:58.4577242Z #define __FLT64_HAS_QUIET_NAN__ 1
2025-05-07T20:26:58.4577343Z #define _GLIBCXX_HAVE_WCTYPE_H 1
2025-05-07T20:26:58.4577431Z #define MOD_NANO ADJ_NANO
2025-05-07T20:26:58.4577513Z #define htole64(x) (x)
2025-05-07T20:26:58.4577705Z #define FP_ILOGBNAN (-2147483647 - 1)
2025-05-07T20:26:58.4577933Z #define _IO_stdout ((_IO_FILE*)(&_IO_2_1_stdout_))
2025-05-07T20:26:58.4578029Z #define _IO_UPPERCASE 01000
2025-05-07T20:26:58.4578508Z #define cudaKernelNodeAttributeClusterSchedulingPolicyPreference cudaLaunchAttributeClusterSchedulingPolicyPreference
2025-05-07T20:26:58.4578593Z #define __USE_POSIX2 1
2025-05-07T20:26:58.4578693Z #define MOD_ESTERROR ADJ_ESTERROR
2025-05-07T20:26:58.4578777Z #define __WALL 0x40000000
2025-05-07T20:26:58.4578870Z #define _GLIBCXX_HAVE_LDEXPF 1
2025-05-07T20:26:58.4578959Z #define _XLOCALE_H 1
2025-05-07T20:26:58.4579050Z #define _GLIBCXX_USE_TMPNAM 1
2025-05-07T20:26:58.4579146Z #define __FLT32_MIN_10_EXP__ (-37)
2025-05-07T20:26:58.4579244Z #define __KEY_T_TYPE __S32_TYPE
2025-05-07T20:26:58.4579345Z #define __cudaGet_threadIdx() threadIdx
2025-05-07T20:26:58.4579435Z #define __EXCEPTIONS 1
2025-05-07T20:26:58.4579532Z #define __CUDART_API_PTSZ(api) api
2025-05-07T20:26:58.4579721Z #define __launch_bounds__(...) __annotate__(launch_bounds(__VA_ARGS__))
2025-05-07T20:26:58.4579815Z #define __WORDSIZE 64
2025-05-07T20:26:58.4579903Z #define CLOCK_MONOTONIC 1
2025-05-07T20:26:58.4579990Z #define _STL_RELOPS_H 1
2025-05-07T20:26:58.4580090Z #define __PTRDIFF_WIDTH__ 64
2025-05-07T20:26:58.4580184Z #define __BEGIN_DECLS extern "C" {
2025-05-07T20:26:58.4580278Z #define _GLIBCXX_HAVE_SYS_IPC_H 1
2025-05-07T20:26:58.4580375Z #define __LDBL_MANT_DIG__ 64
2025-05-07T20:26:58.4580469Z #define _GLIBCXX_HAVE_TRUNCATE 1
2025-05-07T20:26:58.4580783Z #define cudaKernelNodeAttributeClusterDimension cudaLaunchAttributeClusterDimension
2025-05-07T20:26:58.4581037Z #define _PSTL_GCC_VERSION (__GNUC__ * 10000 + __GNUC_MINOR__ * 100 + __GNUC_PATCHLEVEL__)
2025-05-07T20:26:58.4581161Z #define _GLIBCXX_NAMESPACE_CXX11 __cxx11::
2025-05-07T20:26:58.4581262Z #define _GLIBCXX_NUMERIC_LIMITS 1
2025-05-07T20:26:58.4581362Z #define __cpp_range_based_for 201603L
2025-05-07T20:26:58.4581473Z #define __cpp_lib_exchange_function 201304
2025-05-07T20:26:58.4581586Z #define _GLIBCXX_HAVE_INTTYPES_H 1
2025-05-07T20:26:58.4581690Z #define _GLIBCXX_DARWIN_USE_64_BIT_INODE 1
2025-05-07T20:26:58.4581867Z #define cudaCooperativeLaunchMultiDeviceNoPostSync 0x02
2025-05-07T20:26:58.4581966Z #define __FLT64_HAS_INFINITY__ 1
2025-05-07T20:26:58.4582055Z #define _GLIBCXX_CSTDLIB 1
2025-05-07T20:26:58.4582153Z #define _GLIBCXX_DEBUG_MACRO_SWITCH_H 1
2025-05-07T20:26:58.4582329Z #define __FLT64X_MAX__ 1.18973149535723176502126385303097021e+4932F64x
2025-05-07T20:26:58.4582442Z #define __STDCPP_DEFAULT_NEW_ALIGNMENT__ 16
2025-05-07T20:26:58.4582529Z #define _STRING_H 1
2025-05-07T20:26:58.4582631Z #define _BITS_PTHREADTYPES_H 1
2025-05-07T20:26:58.4582718Z #define _GCC_MAX_ALIGN_T 
2025-05-07T20:26:58.4582817Z #define __SM_32_INTRINSICS_HPP__ 
2025-05-07T20:26:58.4582949Z #define __SIG_ATOMIC_MIN__ (-__SIG_ATOMIC_MAX__ - 1)
2025-05-07T20:26:58.4583040Z #define __code_model_small__ 1
2025-05-07T20:26:58.4583138Z #define _PSTL_CONFIG_H 
2025-05-07T20:26:58.4583242Z #define __GCC_ATOMIC_LONG_LOCK_FREE 2
2025-05-07T20:26:58.4583355Z #define __cpp_nontype_template_args 201411L
2025-05-07T20:26:58.4583452Z #define __SM_20_INTRINSICS_H__ 
2025-05-07T20:26:58.4583550Z #define cudaCpuDeviceId ((int)-1)
2025-05-07T20:26:58.4583880Z #define assert(expr) ((expr) ? __ASSERT_VOID_CAST (0) : __assert_fail (__STRING(expr), __FILE__, __LINE__, __ASSERT_FUNCTION))
2025-05-07T20:26:58.4583971Z #define __DEC32_MANT_DIG__ 7
2025-05-07T20:26:58.4584052Z #define le64toh(x) (x)
2025-05-07T20:26:58.4584146Z #define FILENAME_MAX 4096
2025-05-07T20:26:58.4584289Z #define __iscntrl_l(c,l) __isctype_l((c), _IScntrl, (l))
2025-05-07T20:26:58.4584400Z #define __cpp_return_type_deduction 201304L
2025-05-07T20:26:58.4584488Z #define L_cuserid 9
2025-05-07T20:26:58.4584572Z #define __ino_t_defined 
2025-05-07T20:26:58.4584649Z #define __k8__ 1
2025-05-07T20:26:58.4584748Z #define __INTPTR_TYPE__ long int
2025-05-07T20:26:58.4584850Z #define __UINT16_TYPE__ short unsigned int
2025-05-07T20:26:58.4585026Z #define __int8_t_defined 
2025-05-07T20:26:58.4585197Z #define __WCHAR_TYPE__ int
2025-05-07T20:26:58.4585295Z #define __CLOCKID_T_TYPE __S32_TYPE
2025-05-07T20:26:58.4585411Z #define cudaHostRegisterPortable 0x01
2025-05-07T20:26:58.4585506Z #define __SLONGWORD_TYPE long int
2025-05-07T20:26:58.4585588Z #define _IOS_TRUNC 16
2025-05-07T20:26:58.4585707Z #define _GLIBCXX_PACKAGE_TARNAME "libstdc++"
2025-05-07T20:26:58.4585851Z #define __isblank_l(c,l) __isctype_l((c), _ISblank, (l))
2025-05-07T20:26:58.4585933Z #define __HAVE_COLUMN 
2025-05-07T20:26:58.4586021Z #define __stub_fdetach 
2025-05-07T20:26:58.4586412Z #define __CUDACC_VER__ "__CUDACC_VER__ is no longer supported.  Use __CUDACC_VER_MAJOR__, __CUDACC_VER_MINOR__, and __CUDACC_VER_BUILD__ instead."
2025-05-07T20:26:58.4586493Z #define __pic__ 2
2025-05-07T20:26:58.4586614Z #define __UINTPTR_MAX__ 0xffffffffffffffffUL
2025-05-07T20:26:58.4586708Z #define CLOCKS_PER_SEC 1000000l
2025-05-07T20:26:58.4586804Z #define __INT_FAST64_WIDTH__ 64
2025-05-07T20:26:58.4586911Z #define _GLIBCXX_HAVE_SOCKATMARK 1
2025-05-07T20:26:58.4586994Z #define __stub_chflags 
2025-05-07T20:26:58.4587086Z #define CLOCK_BOOTTIME 7
2025-05-07T20:26:58.4587168Z #define __need_IOV_MAX 
2025-05-07T20:26:58.4587273Z #define putc(_ch,_fp) _IO_putc (_ch, _fp)
2025-05-07T20:26:58.4587379Z #define __UQUAD_TYPE unsigned long int
2025-05-07T20:26:58.4587472Z #define __cpp_decltype 200707L
2025-05-07T20:26:58.4587569Z #define __BYTE_ORDER __LITTLE_ENDIAN
2025-05-07T20:26:58.4587663Z #define _GLIBCXX_USE_C99 1
2025-05-07T20:26:58.4587765Z #define _GLIBCXX_TR1_BETA_FUNCTION_TCC 1
2025-05-07T20:26:58.4587851Z #define TTY_NAME_MAX 32
2025-05-07T20:26:58.4588017Z #define _GLIBCXX_FORWARD(_Tp,__val) std::forward<_Tp>(__val)
2025-05-07T20:26:58.4588134Z #define __INT_FAST64_MAX__ 0x7fffffffffffffffL
2025-05-07T20:26:58.4588303Z #define _PSTL_ASSERT(_Condition) __glibcxx_assert(_Condition)
2025-05-07T20:26:58.4588411Z #define __GCC_ATOMIC_TEST_AND_SET_TRUEVAL 1
2025-05-07T20:26:58.4588505Z #define __LITTLE_ENDIAN 1234
2025-05-07T20:26:58.4588606Z #define STA_PPSTIME 0x0004
2025-05-07T20:26:58.4588686Z #define __import__ 
2025-05-07T20:26:58.4588772Z #define BUFSIZ _IO_BUFSIZ
2025-05-07T20:26:58.4588910Z #define M_SQRT2l 1.414213562373095048801688724209698079L
2025-05-07T20:26:58.4588990Z #define __export__ 
2025-05-07T20:26:58.4589104Z #define __FSID_T_TYPE struct { int __val[2]; }
2025-05-07T20:26:58.4589207Z #define cudaMemAttachHost 0x02
2025-05-07T20:26:58.4589362Z #define __FLT_NORM_MAX__ 3.40282346638528859811704183484516925e+38F
2025-05-07T20:26:58.4589462Z #define _GLIBCXX_HAVE_ICONV 1
2025-05-07T20:26:58.4589549Z #define _GLIBCXX_SYMVER 1
2025-05-07T20:26:58.4589641Z #define __FLT64X_MAX_EXP__ 16384
2025-05-07T20:26:58.4589733Z #define _WCHAR_T_DECLARED 
2025-05-07T20:26:58.4589847Z #define __UINT_FAST64_TYPE__ long unsigned int
2025-05-07T20:26:58.4589961Z #define isalpha_l(c,l) __isalpha_l ((c), (l))
2025-05-07T20:26:58.4590068Z #define __cpp_inline_variables 201606L
2025-05-07T20:26:58.4590162Z #define WNOWAIT 0x01000000
2025-05-07T20:26:58.4590249Z #define PLOSS 6
2025-05-07T20:26:58.4590345Z #define M_LN10 2.30258509299404568402
2025-05-07T20:26:58.4590601Z #define _PSTL_UDS_PRESENT (__INTEL_COMPILER >= 1900 && __INTEL_COMPILER_BUILD_DATE >= 20180626)
2025-05-07T20:26:58.4590695Z #define EXIT_SUCCESS 0
2025-05-07T20:26:58.4590790Z #define __LDBL_REDIR_DECL(name) 
2025-05-07T20:26:58.4590881Z #define _GLIBCXX_HAVE_STRTOF 1
2025-05-07T20:26:58.4590986Z #define MOD_FREQUENCY ADJ_FREQUENCY
2025-05-07T20:26:58.4591075Z #define __thread__ __thread
2025-05-07T20:26:58.4591170Z #define _GLIBCXX_HAVE_MEMORY_H 1
2025-05-07T20:26:58.4591267Z #define __INT_MAX__ 0x7fffffff
2025-05-07T20:26:58.4591367Z #define __SIZEOF_PTHREAD_BARRIER_T 32
2025-05-07T20:26:58.4591586Z #define __glibcxx_requires_partitioned_upper_pred(_First,_Last,_Value,_Pred) 
2025-05-07T20:26:58.4591702Z #define __cudaCDP2StreamWaitEvent_ptsz 
2025-05-07T20:26:58.4591792Z #define _GLIBCXX_HAVE_SINF 1
2025-05-07T20:26:58.4591967Z #define __linux__ 1
2025-05-07T20:26:58.4592143Z #define STA_PPSSIGNAL 0x0100
2025-05-07T20:26:58.4592270Z #define M_LN2l 0.693147180559945309417232121458176568L
2025-05-07T20:26:58.4592364Z #define __S16_TYPE short int
2025-05-07T20:26:58.4592698Z #define __glibcxx_constexpr_assert(cond) if (__builtin_is_constant_evaluated() && !bool(cond)) __builtin_unreachable()
2025-05-07T20:26:58.4592802Z #define __NVCC_DIAG_PRAGMA_SUPPORT__ 1
2025-05-07T20:26:58.4592993Z #define __bos(ptr) __builtin_object_size (ptr, __USE_FORTIFY_LEVEL > 1)
2025-05-07T20:26:58.4593090Z #define __COMMON_FUNCTIONS_H__ 
2025-05-07T20:26:58.4593186Z #define UINT_MAX (INT_MAX * 2U + 1U)
2025-05-07T20:26:58.4593275Z #define _T_SIZE_ 
2025-05-07T20:26:58.4593371Z #define LLONG_MAX __LONG_LONG_MAX__
2025-05-07T20:26:58.4593491Z #define __cudaCDP2StreamCreateWithFlags 
2025-05-07T20:26:58.4593589Z #define _PSTL_VERSION 12000
2025-05-07T20:26:58.4593706Z #define __noinline__ __attribute__((noinline))
2025-05-07T20:26:58.4593810Z #define __WNOTHREAD 0x20000000
2025-05-07T20:26:58.4593910Z #define _G_va_list __gnuc_va_list
2025-05-07T20:26:58.4594037Z #define M_PI_4l 0.785398163397448309615660845819875721L
2025-05-07T20:26:58.4594129Z #define _IOS_INPUT 1
2025-05-07T20:26:58.4594218Z #define __USE_LARGEFILE64 1
2025-05-07T20:26:58.4594319Z #define _GLIBCXX_TR1_EXP_INTEGRAL_TCC 1
2025-05-07T20:26:58.4594415Z #define __INT64_TYPE__ long int
2025-05-07T20:26:58.4594510Z #define _POSIX_SSIZE_MAX 32767
2025-05-07T20:26:58.4594605Z #define __shared__ __location__(shared)
2025-05-07T20:26:58.4594697Z #define __FLT_MAX_EXP__ 128
2025-05-07T20:26:58.4594847Z #define __glibc_unlikely(cond) __builtin_expect((cond), 0)
2025-05-07T20:26:58.4594932Z #define __gid_t_defined 
2025-05-07T20:26:58.4595048Z #define _GLIBCXX_USE_SC_NPROCESSORS_ONLN 1
2025-05-07T20:26:58.4595144Z #define __ORDER_BIG_ENDIAN__ 4321
2025-05-07T20:26:58.4595342Z #define __glibcxx_requires_can_increment_range(_First1,_Last1,_First2) 
2025-05-07T20:26:58.4595437Z #define _GLIBCXX17_INLINE inline
2025-05-07T20:26:58.4595534Z #define __DBL_MANT_DIG__ 53
2025-05-07T20:26:58.4595624Z #define ___int_size_t_h 
2025-05-07T20:26:58.4595727Z #define __FSBLKCNT64_T_TYPE __UQUAD_TYPE
2025-05-07T20:26:58.4595847Z #define __cpp_inheriting_constructors 201511L
2025-05-07T20:26:58.4596002Z #define __WIFCONTINUED(status) ((status) == __W_CONTINUED)
2025-05-07T20:26:58.4596100Z #define CUDA_DOUBLE_MATH_FUNCTIONS 1
2025-05-07T20:26:58.4596193Z #define _GLIBCXX_HAVE_FENV_H 1
2025-05-07T20:26:58.4596291Z #define _GLIBCXX_HAVE_STDBOOL_H 1
2025-05-07T20:26:58.4596383Z #define __SIZEOF_FLOAT128__ 16
2025-05-07T20:26:58.4596508Z #define __INT_LEAST64_MAX__ 0x7fffffffffffffffL
2025-05-07T20:26:58.4596615Z #define _GLIBCXX_TR1_HYPERGEOMETRIC_TCC 1
2025-05-07T20:26:58.4596732Z #define _GLIBCXX_DEBUG_PEDASSERT(_Condition) 
2025-05-07T20:26:58.4596826Z #define __clock_t_defined 1
2025-05-07T20:26:58.4596921Z #define _POSIX_SEM_VALUE_MAX 32767
2025-05-07T20:26:58.4597028Z #define __cudaCDP2RuntimeGetVersion 
2025-05-07T20:26:58.4597124Z #define __GLIBC_MINOR__ 17
2025-05-07T20:26:58.4597217Z #define __DEC64_MIN__ 1E-383DD
2025-05-07T20:26:58.4597312Z #define __WINT_TYPE__ unsigned int
2025-05-07T20:26:58.4597425Z #define __UINT_LEAST32_TYPE__ unsigned int
2025-05-07T20:26:58.4597513Z #define __SIZEOF_SHORT__ 2
2025-05-07T20:26:58.4597679Z #define __FLT32_NORM_MAX__ 3.40282346638528859811704183484516925e+38F32
2025-05-07T20:26:58.4597762Z #define __SSE__ 1
2025-05-07T20:26:58.4597856Z #define SEM_VALUE_MAX (2147483647)
2025-05-07T20:26:58.4597957Z #define M_SQRT1_2 0.70710678118654752440
2025-05-07T20:26:58.4598038Z #define _CTYPE_H 1
2025-05-07T20:26:58.4598127Z #define __sigset_t_defined 
2025-05-07T20:26:58.4598564Z #define __LDBL_MIN_EXP__ (-16381)
2025-05-07T20:26:58.4598707Z #define _GLIBCXX_HAVE_LOGF 1
2025-05-07T20:26:58.4598824Z #define MOD_TAI ADJ_TAI
2025-05-07T20:26:58.4598940Z #define _IO_va_list __gnuc_va_list
2025-05-07T20:26:58.4599033Z #define _GLIBCXX_HAVE_LOGL 1
2025-05-07T20:26:58.4599113Z #define __SM_70_RT_H__ 
2025-05-07T20:26:58.4599573Z #define _GLIBCXX_HAVE_WRITEV 1
2025-05-07T20:26:58.4599676Z #define cudaEventWaitDefault 0x00
2025-05-07T20:26:58.4599772Z #define _GLIBCXX_HAVE_EXPL 1
2025-05-07T20:26:58.4599943Z #define __FLT64_MAX__ 1.79769313486231570814527423731704357e+308F64
2025-05-07T20:26:58.4600034Z #define _POSIX_MAX_CANON 255
2025-05-07T20:26:58.4600144Z #define _GLIBCXX_NOEXCEPT_PARM , bool _NE
2025-05-07T20:26:58.4600236Z #define FD_SETSIZE __FD_SETSIZE
2025-05-07T20:26:58.4600322Z #define _GLIBCXX_TXN_SAFE 
2025-05-07T20:26:58.4600415Z #define __WINT_WIDTH__ 32
2025-05-07T20:26:58.4600517Z #define __CUDA_DEVICE_RUNTIME_API_H__ 
2025-05-07T20:26:58.4600775Z #define __REDIRECT_NTHNL(name,proto,alias) name proto __THROWNL __asm__ (__ASMNAME (#alias))
2025-05-07T20:26:58.4600875Z #define _GLIBCXX_STDIO_SEEK_CUR 1
2025-05-07T20:26:58.4600957Z #define EOF (-1)
2025-05-07T20:26:58.4601049Z #define __WAIT_STATUS_DEFN void *
2025-05-07T20:26:58.4601147Z #define __USE_POSIX199309 1
2025-05-07T20:26:58.4601244Z #define __INT_LEAST64_WIDTH__ 64
2025-05-07T20:26:58.4601347Z #define __LDBL_MAX_EXP__ 16384
2025-05-07T20:26:58.4601439Z #define __FLT32X_MAX_10_EXP__ 308
2025-05-07T20:26:58.4601534Z #define LLONG_MIN (-LLONG_MAX-1)
2025-05-07T20:26:58.4601648Z #define cudaSurfaceType2DLayered 0xF2
2025-05-07T20:26:58.4601739Z #define ____mbstate_t_defined 1
2025-05-07T20:26:58.4601841Z #define STA_NANO 0x2000
2025-05-07T20:26:58.4601940Z #define _GLIBCXX_HAVE_LOG10F 1
2025-05-07T20:26:58.4602031Z #define _GLIBCXX_HAVE_LOG10L 1
2025-05-07T20:26:58.4602114Z #define _IO_LINKED 0x80
2025-05-07T20:26:58.4602378Z #define __cpp_lib_launder 201606
2025-05-07T20:26:58.4602502Z #define __SIZEOF_INT128__ 16
2025-05-07T20:26:58.4608372Z #define __PTHREAD_MUTEX_HAVE_PREV 1
2025-05-07T20:26:58.4608493Z #define __FLT64X_IS_IEC_60559__ 2
2025-05-07T20:26:58.4608604Z #define _GLIBCXX_TYPE_TRAITS 1
2025-05-07T20:26:58.4608750Z #define cudaGraphKernelNodePortProgrammatic 1
2025-05-07T20:26:58.4608872Z #define __DEVICE_ATOMIC_FUNCTIONS_HPP__ 
2025-05-07T20:26:58.4608992Z #define __BLKCNT64_T_TYPE __SQUAD_TYPE
2025-05-07T20:26:58.4609090Z #define __LDBL_MAX_10_EXP__ 4932
2025-05-07T20:26:58.4609192Z #define __W_CONTINUED 0xffff
2025-05-07T20:26:58.4609284Z #define __ATOMIC_RELAXED 0
2025-05-07T20:26:58.4609415Z #define w_coredump __wait_terminated.__w_coredump
2025-05-07T20:26:58.4609547Z #define __FSBLKCNT_T_TYPE __SYSCALL_ULONG_TYPE
2025-05-07T20:26:58.4609749Z #define __cudaCDP2OccupancyMaxActiveBlocksPerMultiprocessor 
2025-05-07T20:26:58.4609933Z #define __DBL_EPSILON__ double(2.22044604925031308084726333618164062e-16L)
2025-05-07T20:26:58.4610028Z #define __stub_stty 
2025-05-07T20:26:58.4610194Z #define _tolower(c) ((int) (*__ctype_tolower_loc ())[(int) (c)])
2025-05-07T20:26:58.4610289Z #define le16toh(x) (x)
2025-05-07T20:26:58.4610398Z #define BC_SCALE_MAX _POSIX2_BC_SCALE_MAX
2025-05-07T20:26:58.4610570Z #define __FLT128_MIN__ 3.36210314311209350626267781732175260e-4932F128
2025-05-07T20:26:58.4610663Z #define _SIZET_ 
2025-05-07T20:26:58.4610761Z #define XATTR_NAME_MAX 255
2025-05-07T20:26:58.4610854Z #define _SVID_SOURCE 1
2025-05-07T20:26:58.4610944Z #define _LP64 1
2025-05-07T20:26:58.4611034Z #define _LIBC_LIMITS_H_ 1
2025-05-07T20:26:58.4611265Z #define __REDIRECT_NTH_LDBL(name,proto,alias) __REDIRECT_NTH (name, proto, alias)
2025-05-07T20:26:58.4611382Z #define _GLIBCXX_TR1_BESSEL_FUNCTION_TCC 1
2025-05-07T20:26:58.4611469Z #define __UINT8_C(c) c
2025-05-07T20:26:58.4611565Z #define _GLIBCXX_HAVE_CEILF 1
2025-05-07T20:26:58.4611667Z #define _GLIBCXX_HAVE_CEILL 1
2025-05-07T20:26:58.4611779Z #define __cudaCDP2Memset3DAsync_ptsz 
2025-05-07T20:26:58.4611881Z #define __CUDA_ARCH_LIST__ 520
2025-05-07T20:26:58.4611975Z #define __FLT64_MAX_EXP__ 1024
2025-05-07T20:26:58.4612075Z #define MOD_MAXERROR ADJ_MAXERROR
2025-05-07T20:26:58.4612173Z #define CUDARTAPI 
2025-05-07T20:26:58.4612257Z #define IOV_MAX 1024
2025-05-07T20:26:58.4612403Z #define __glibcxx_requires_irreflexive2(_First,_Last) 
2025-05-07T20:26:58.4612637Z #define __INT_LEAST32_TYPE__ int
2025-05-07T20:26:58.4612828Z #define cudaMemAttachSingle 0x04
2025-05-07T20:26:58.4612912Z #define __wchar_t__ 
2025-05-07T20:26:58.4613022Z #define __cpp_lib_is_aggregate 201703
2025-05-07T20:26:58.4613106Z #define SEEK_END 2
2025-05-07T20:26:58.4613199Z #define __SIZEOF_WCHAR_T__ 4
2025-05-07T20:26:58.4613376Z #define _GLIBCXX_USE_TBB_PAR_BACKEND __has_include(<tbb/tbb.h>)
2025-05-07T20:26:58.4613474Z #define _IO_ftrylockfile(_fp) 
2025-05-07T20:26:58.4613750Z #define _GLIBCXX_USE_C99_WCHAR _GLIBCXX11_USE_C99_WCHAR
2025-05-07T20:26:58.4613842Z #define ____FILE_defined 1
2025-05-07T20:26:58.4613960Z #define _GLIBCXX_HAVE_BUILTIN_IS_AGGREGATE 1
2025-05-07T20:26:58.4614062Z #define __GNUC_PATCHLEVEL__ 0
2025-05-07T20:26:58.4614150Z #define _ISOC99_SOURCE 1
2025-05-07T20:26:58.4614245Z #define __VECTOR_FUNCTIONS_H__ 
2025-05-07T20:26:58.4614497Z #define __REDIRECT_NTH(name,proto,alias) name proto __THROW __asm__ (__ASMNAME (#alias))
2025-05-07T20:26:58.4614631Z #define _PSTL_USE_NONTEMPORAL_STORES_IF_ALLOWED 
2025-05-07T20:26:58.4614720Z #define _IO_RIGHT 04
2025-05-07T20:26:58.4614821Z #define __END_NAMESPACE_STD 
2025-05-07T20:26:58.4615005Z #define __FLT128_NORM_MAX__ 1.18973149535723176508575932662800702e+4932F128
2025-05-07T20:26:58.4615104Z #define _GLIBCXX_STD_C std
2025-05-07T20:26:58.4615224Z #define cudaInitDeviceFlagsAreValid 0x01
2025-05-07T20:26:58.4615321Z #define _LARGEFILE64_SOURCE 1
2025-05-07T20:26:58.4615428Z #define _GLIBCXX_USE_C99_STDINT_TR1 1
2025-05-07T20:26:58.4615513Z #define _STDDEF_H_ 
2025-05-07T20:26:58.4615596Z #define __amd64__ 1
2025-05-07T20:26:58.4615773Z #define __FLT64_NORM_MAX__ 1.79769313486231570814527423731704357e+308F64
2025-05-07T20:26:58.4615872Z #define __FLT128_HAS_QUIET_NAN__ 1
2025-05-07T20:26:58.4615990Z #define isalnum_l(c,l) __isalnum_l ((c), (l))
2025-05-07T20:26:58.4616193Z #define __FD_ISSET(d,set) ((__FDS_BITS (set)[__FD_ELT (d)] & __FD_MASK (d)) != 0)
2025-05-07T20:26:58.4616305Z #define __INTMAX_MAX__ 0x7fffffffffffffffL
2025-05-07T20:26:58.4616458Z #define __glibcxx_requires_irreflexive(_First,_Last) 
2025-05-07T20:26:58.4616589Z #define cudaGraphKernelNodePortDefault 0
2025-05-07T20:26:58.4616695Z #define __INT_FAST8_TYPE__ signed char
2025-05-07T20:26:58.4616813Z #define __cudaCDP2Memcpy3DAsync_ptsz 
2025-05-07T20:26:58.4616914Z #define __PID_T_TYPE __S32_TYPE
2025-05-07T20:26:58.4617027Z #define __cpp_namespace_attributes 201411L
2025-05-07T20:26:58.4617131Z #define CHARCLASS_NAME_MAX 2048
2025-05-07T20:26:58.4617228Z #define _GLIBCXX_HAVE_TANF 1
2025-05-07T20:26:58.4617324Z #define _GLIBCXX_USE_ST_MTIM 1
2025-05-07T20:26:58.4617502Z #define __FLT64X_MIN__ 3.36210314311209350626267781732175260e-4932F64x
2025-05-07T20:26:58.4617596Z #define __CUDA_RUNTIME_H__ 
2025-05-07T20:26:58.4617778Z #define WIFSIGNALED(status) __WIFSIGNALED (__WAIT_INT (status))
2025-05-07T20:26:58.4617877Z #define _GLIBCXX_HAVE_STDLIB_H 1
2025-05-07T20:26:58.4617974Z #define __STDCPP_THREADS__ 1
2025-05-07T20:26:58.4618130Z #define M_2_SQRTPIl 1.128379167095512573896158903121545172L
2025-05-07T20:26:58.4618235Z #define __GNUC_STDC_INLINE__ 1
2025-05-07T20:26:58.4618329Z #define _POSIX_UIO_MAXIOV 16
2025-05-07T20:26:58.4618438Z #define _PSTL_PAR_BACKEND_SERIAL 
2025-05-07T20:26:58.4618535Z #define P_tmpdir "/tmp"
2025-05-07T20:26:58.4618654Z #define __ASSERT_FUNCTION __PRETTY_FUNCTION__
2025-05-07T20:26:58.4618755Z #define __FLT64_HAS_DENORM__ 1
2025-05-07T20:26:58.4618857Z #define __WORDSIZE_TIME64_COMPAT32 1
2025-05-07T20:26:58.4619021Z #define _GLIBCXX_DEPRECATED __attribute__ ((__deprecated__))
2025-05-07T20:26:58.4619197Z #define __FLT32_EPSILON__ 1.19209289550781250000000000000000000e-7F32
2025-05-07T20:26:58.4619298Z #define _PSTL_HIDE_FROM_ABI_PUSH 
2025-05-07T20:26:58.4619426Z #define cudaStreamLegacy ((cudaStream_t)0x1)
2025-05-07T20:26:58.4619539Z #define _IO_cleanup_region_start(_fct,_fp) 
2025-05-07T20:26:58.4619641Z #define __location__(a) __annotate__(a)
2025-05-07T20:26:58.4619872Z #define __device_builtin_surface_type__ __location__(device_builtin_surface_type)
2025-05-07T20:26:58.4620138Z #define _POSIX2_BC_BASE_MAX 99
2025-05-07T20:26:58.4620254Z #define __cudaCDP2DeviceGetAttribute 
2025-05-07T20:26:58.4620357Z #define __DBL_DECIMAL_DIG__ 17
2025-05-07T20:26:58.4620448Z #define __STDC_UTF_32__ 1
2025-05-07T20:26:58.4620543Z #define __INT_FAST8_WIDTH__ 8
2025-05-07T20:26:58.4620647Z #define NAN (__builtin_nanf (""))
2025-05-07T20:26:58.4620744Z #define _POSIX_MQ_PRIO_MAX 32
2025-05-07T20:26:58.4620837Z #define __FXSR__ 1
2025-05-07T20:26:58.4620919Z #define _SIZE_T 
2025-05-07T20:26:58.4621022Z #define _GLIBCXX_USE_GETTIMEOFDAY 1
2025-05-07T20:26:58.4621134Z #define cudaHostRegisterReadOnly 0x08
2025-05-07T20:26:58.4621307Z #define __FLT32X_MAX__ 1.79769313486231570814527423731704357e+308F32x
2025-05-07T20:26:58.4621454Z #define __WIFSTOPPED(status) (((status) & 0xff) == 0x7f)
2025-05-07T20:26:58.4621551Z #define _IO_ssize_t __ssize_t
2025-05-07T20:26:58.4621650Z #define __ULONG32_TYPE unsigned int
2025-05-07T20:26:58.4621839Z #define __DBL_NORM_MAX__ double(1.79769313486231570814527423731704357e+308L)
2025-05-07T20:26:58.4622045Z #define cudaStreamGraphTailLaunch (cudaStream_t)0x0100000000000000
2025-05-07T20:26:58.4622133Z #define _GXX_NULLPTR_T 
2025-05-07T20:26:58.4622255Z #define __glibcxx_class_requires3(_a,_b,_c,_d) 
2025-05-07T20:26:58.4622346Z #define FOPEN_MAX 16
2025-05-07T20:26:58.4622434Z #define __BIG_ENDIAN 4321
2025-05-07T20:26:58.4622551Z #define __BYTE_ORDER__ __ORDER_LITTLE_ENDIAN__
2025-05-07T20:26:58.4622651Z #define __suseconds_t_defined 
2025-05-07T20:26:58.4622738Z #define __off_t_defined 
2025-05-07T20:26:58.4622831Z #define stderr stderr
2025-05-07T20:26:58.4622926Z #define M_LOG10E 0.43429448190325182765
2025-05-07T20:26:58.4623039Z #define __glibcxx_requires_string(_String) 
2025-05-07T20:26:58.4623142Z #define _GLIBCXX_HAVE_LDEXPL 1
2025-05-07T20:26:58.4623234Z #define __INTMAX_WIDTH__ 64
2025-05-07T20:26:58.4623640Z #define _PSTL_CPP14_2RANGE_MISMATCH_EQUAL_PRESENT (_MSC_VER >= 1900 || __cplusplus >= 201300L || __cpp_lib_robust_nonmodifying_seq_ops == 201304)
2025-05-07T20:26:58.4623741Z #define __mode_t_defined 
2025-05-07T20:26:58.4623825Z #define _GCC_SIZE_T 
2025-05-07T20:26:58.4623923Z #define __INO64_T_TYPE __UQUAD_TYPE
2025-05-07T20:26:58.4624035Z #define __cpp_runtime_arrays 198712L
2025-05-07T20:26:58.4624140Z #define __UINT64_TYPE__ long unsigned int
2025-05-07T20:26:58.4624235Z #define __USE_XOPEN2K8XSI 1
2025-05-07T20:26:58.4624335Z #define __UINT32_C(c) c ## U
2025-05-07T20:26:58.4624443Z #define __cpp_alias_templates 200704L
2025-05-07T20:26:58.4624555Z #define cudaHostAllocMapped 0x02
2025-05-07T20:26:58.4624662Z #define __DEVICE_LAUNCH_PARAMETERS_H__ 
2025-05-07T20:26:58.4624753Z #define _STL_ITERATOR_H 1
2025-05-07T20:26:58.4624841Z #define __size_t__ 
2025-05-07T20:26:58.4624970Z #define cudaStreamAttrID cudaLaunchAttributeID
2025-05-07T20:26:58.4625065Z #define _GLIBCXX_HAVE_ATANF 1
2025-05-07T20:26:58.4625181Z #define cudaEventRecordExternal 0x01
2025-05-07T20:26:58.4625334Z #define __isspace_l(c,l) __isctype_l((c), _ISspace, (l))
2025-05-07T20:26:58.4625433Z #define _IO_BUFSIZ _G_BUFSIZ
2025-05-07T20:26:58.4625605Z #define __FLT_DENORM_MIN__ 1.40129846432481707092372958328991613e-45F
2025-05-07T20:26:58.4625691Z #define _ENDIAN_H 1
2025-05-07T20:26:58.4625802Z #define __builtin_align__(a) __align__(a)
2025-05-07T20:26:58.4625899Z #define _GLIBCXX20_CONSTEXPR 
2025-05-07T20:26:58.4626000Z #define __NV_NO_HOST_COMPILER_CHECK 1
2025-05-07T20:26:58.4626087Z #define __try try
2025-05-07T20:26:58.4626186Z #define _GLIBCXX_HAVE_FINITE 1
2025-05-07T20:26:58.4626281Z #define __FLT128_IS_IEC_60559__ 2
2025-05-07T20:26:58.4626379Z #define __INT8_MAX__ 0x7f
2025-05-07T20:26:58.4626631Z #define cudaStreamGetCaptureInfo __CUDART_API_PTSZ(cudaStreamGetCaptureInfo_v2)
2025-05-07T20:26:58.4626720Z #define __LONG_WIDTH__ 64
2025-05-07T20:26:58.4626807Z #define __PIC__ 2
2025-05-07T20:26:58.4626919Z #define BC_STRING_MAX _POSIX2_BC_STRING_MAX
2025-05-07T20:26:58.4627038Z #define __UINT_FAST32_TYPE__ long unsigned int
2025-05-07T20:26:58.4627284Z #define FD_ISSET(fd,fdsetp) __FD_ISSET (fd, fdsetp)
2025-05-07T20:26:58.4627486Z #define _GLIBCXX_HAVE_FLOAT_H 1
2025-05-07T20:26:58.4627584Z #define _GLIBCXX_HAVE_ATANL 1
2025-05-07T20:26:58.4627764Z #define __FLT32X_NORM_MAX__ 1.79769313486231570814527423731704357e+308F32x
2025-05-07T20:26:58.4627863Z #define __DEVICE_FUNCTIONS_HPP__ 
2025-05-07T20:26:58.4627968Z #define __CHAR32_TYPE__ unsigned int
2025-05-07T20:26:58.4628056Z #define _IO_uid_t __uid_t
2025-05-07T20:26:58.4628154Z #define _GLIBCXX_HAVE_READLINK 1
2025-05-07T20:26:58.4628288Z #define __cudaCDP2EventRecordWithFlags_ptsz 
2025-05-07T20:26:58.4628379Z #define _CONCEPT_CHECK_H 1
2025-05-07T20:26:58.4628523Z #define __FLT_MAX__ 3.40282346638528859811704183484516925e+38F
2025-05-07T20:26:58.4628630Z #define _GLIBCXX_HAVE_NETINET_IN_H 1
2025-05-07T20:26:58.4628749Z #define _GLIBCXX_TR1_SPECIAL_FUNCTION_UTIL_H 1
2025-05-07T20:26:58.4628837Z #define LONG_BIT 64
2025-05-07T20:26:58.4628943Z #define __SIZEOF_PTHREAD_BARRIERATTR_T 4
2025-05-07T20:26:58.4629053Z #define _GLIBCXX_USE_ALLOCATOR_NEW 1
2025-05-07T20:26:58.4629191Z #define __cpp_lib_math_special_functions 201603L
2025-05-07T20:26:58.4629286Z #define __fsfilcnt_t_defined 
2025-05-07T20:26:58.4629376Z #define __blkcnt_t_defined 
2025-05-07T20:26:58.4629648Z #define cudaKernelNodeAttributeMemSyncDomain cudaLaunchAttributeMemSyncDomain
2025-05-07T20:26:58.4629740Z #define __USE_LARGEFILE 1
2025-05-07T20:26:58.4629839Z #define __cpp_constexpr 201603L
2025-05-07T20:26:58.4629939Z #define CUDART_VERSION 12060
2025-05-07T20:26:58.4630028Z #define NL_TEXTMAX INT_MAX
2025-05-07T20:26:58.4630129Z #define cudaDeviceMapHost 0x08
2025-05-07T20:26:58.4630227Z #define _GLIBCXX_CMATH 1
2025-05-07T20:26:58.4630419Z #define __attribute_format_arg__(x) __attribute__ ((__format_arg__ (x)))
2025-05-07T20:26:58.4630516Z #define __lldiv_t_defined 1
2025-05-07T20:26:58.4630598Z #define __SSE2__ 1
2025-05-07T20:26:58.4630680Z #define _IOLBF 1
2025-05-07T20:26:58.4630787Z #define _GLIBCXX_HAVE_SYS_TYPES_H 1
2025-05-07T20:26:58.4630894Z #define _GLIBCXX_HAVE_FLOORF 1
2025-05-07T20:26:58.4631002Z #define __cpp_deduction_guides 201703L
2025-05-07T20:26:58.4631104Z #define _GLIBCXX_HAVE_EXPF 1
2025-05-07T20:26:58.4631212Z #define __annotate__(a) __attribute__((a))
2025-05-07T20:26:58.4631300Z #define __INT32_TYPE__ int
2025-05-07T20:26:58.4631395Z #define __SIZEOF_DOUBLE__ 8
2025-05-07T20:26:58.4631502Z #define cudaDeviceSyncMemops 0x80
2025-05-07T20:26:58.4631606Z #define __cpp_exceptions 199711L
2025-05-07T20:26:58.4631701Z #define __FLT_MIN_10_EXP__ (-37)
2025-05-07T20:26:58.4631813Z #define cudaDeviceScheduleYield 0x02
2025-05-07T20:26:58.4631911Z #define _SYS_SYSMACROS_H 1
2025-05-07T20:26:58.4632026Z #define _GLIBCXX_TR1_LEGENDRE_FUNCTION_TCC 1
2025-05-07T20:26:58.4632184Z #define __FLT64_MIN__ 2.22507385850720138309023271733240406e-308F64
2025-05-07T20:26:58.4632283Z #define __INT_LEAST32_WIDTH__ 32
2025-05-07T20:26:58.4632375Z #define __SWORD_TYPE long int
2025-05-07T20:26:58.4632468Z #define __INTMAX_TYPE__ long int
2025-05-07T20:26:58.4632576Z #define _GLIBCXX11_USE_C99_MATH 1
2025-05-07T20:26:58.4632671Z #define __PTHREAD_SPINS 0, 0
2025-05-07T20:26:58.4632764Z #define _BITS_POSIX1_LIM_H 1
2025-05-07T20:26:58.4633047Z #define cudaStreamAttributeMemSyncDomainMap cudaLaunchAttributeMemSyncDomainMap
2025-05-07T20:26:58.4633138Z #define __DEC128_MAX_EXP__ 6145
2025-05-07T20:26:58.4633288Z #define math_errhandling (MATH_ERRNO | MATH_ERREXCEPT)
2025-05-07T20:26:58.4633369Z #define _T_SIZE 
2025-05-07T20:26:58.4633473Z #define cudaHostAllocDefault 0x00
2025-05-07T20:26:58.4633601Z #define _PSTL_PRAGMA_SIMD_EXCLUSIVE_SCAN(PRM) 
2025-05-07T20:26:58.4633723Z #define __va_arg_pack() __builtin_va_arg_pack ()
2025-05-07T20:26:58.4633814Z #define _POSIX_TIMER_MAX 32
2025-05-07T20:26:58.4633910Z #define _GLIBCXX_HAVE_TLS 1
2025-05-07T20:26:58.4634033Z #define _GLIBCXX_NOTHROW _GLIBCXX_USE_NOEXCEPT
2025-05-07T20:26:58.4634124Z #define _GLIBCXX_HAVE_ACOSL 1
2025-05-07T20:26:58.4634229Z #define __FLT32X_HAS_QUIET_NAN__ 1
2025-05-07T20:26:58.4634406Z #define __ATOMIC_CONSUME 1
2025-05-07T20:26:58.4634673Z #define __CUDA_ARCH_HAS_FEATURE__(_FEAT) __CUDA_ARCH_FEAT_ ##_FEAT
2025-05-07T20:26:58.4634762Z #define __GNUC_MINOR__ 4
2025-05-07T20:26:58.4634864Z #define __GLIBCXX_TYPE_INT_N_0 __int128
2025-05-07T20:26:58.4634965Z #define __INT_FAST16_WIDTH__ 64
2025-05-07T20:26:58.4635081Z #define __UINTMAX_MAX__ 0xffffffffffffffffUL
2025-05-07T20:26:58.4635162Z #define __PIE__ 2
2025-05-07T20:26:58.4635273Z #define LITTLE_ENDIAN __LITTLE_ENDIAN
2025-05-07T20:26:58.4635373Z #define _GLIBCXX_HAVE_INT64_T_LONG 1
2025-05-07T20:26:58.4635560Z #define __FLT32X_DENORM_MIN__ 4.94065645841246544176568792868221372e-324F32x
2025-05-07T20:26:58.4635784Z #define __intN_t(N,MODE) typedef int int ##N ##_t __attribute__ ((__mode__ (MODE)))
2025-05-07T20:26:58.4635876Z #define __nlink_t_defined 
2025-05-07T20:26:58.4636009Z #define _GLIBCXX17_DEPRECATED [[__deprecated__]]
2025-05-07T20:26:58.4636121Z #define _PSTL_STRING(x) _PSTL_STRING_AUX(x)
2025-05-07T20:26:58.4636213Z #define _XOPEN_LIM_H 1
2025-05-07T20:26:58.4636482Z #define __u_intN_t(N,MODE) typedef unsigned int u_int ##N ##_t __attribute__ ((__mode__ (MODE)))
2025-05-07T20:26:58.4636599Z #define __cpp_template_template_args 201611L
2025-05-07T20:26:58.4636705Z #define _GTHREAD_USE_MUTEX_TIMEDLOCK 1
2025-05-07T20:26:58.4636812Z #define BC_DIM_MAX _POSIX2_BC_DIM_MAX
2025-05-07T20:26:58.4636911Z #define __DBL_MAX_10_EXP__ 308
2025-05-07T20:26:58.4637001Z #define __FILE_defined 1
2025-05-07T20:26:58.4637180Z #define __LDBL_DENORM_MIN__ 3.64519953188247460252840593361941982e-4951L
2025-05-07T20:26:58.4637278Z #define _GLIBCXX_HAVE_SINCOS 1
2025-05-07T20:26:58.4637373Z #define __USE_XOPEN_EXTENDED 1
2025-05-07T20:26:58.4637491Z #define __cpp_lib_tuple_element_t 201402L
2025-05-07T20:26:58.4637606Z #define isascii_l(c,l) __isascii_l ((c), (l))
2025-05-07T20:26:58.4637725Z #define cudaInvalidDeviceId ((int)-2)
2025-05-07T20:26:58.4637828Z #define _GLIBCXX_HAVE_SYS_RESOURCE_H 1
2025-05-07T20:26:58.4637917Z #define __INT16_C(c) c
2025-05-07T20:26:58.4638027Z #define __U32_TYPE unsigned int
2025-05-07T20:26:58.4638126Z #define _GLIBCXX_HAVE_SYS_IOCTL_H 1
2025-05-07T20:26:58.4638246Z #define FD_CLR(fd,fdsetp) __FD_CLR (fd, fdsetp)
2025-05-07T20:26:58.4638333Z #define __STDC__ 1
2025-05-07T20:26:58.4638428Z #define _GLIBCXX_HAVE_VWSCANF 1
2025-05-07T20:26:58.4638527Z #define _GLIBCXX_HAVE_EXECINFO_H 1
2025-05-07T20:26:58.4638628Z #define _GLIBCXX_USE_REALPATH 1
2025-05-07T20:26:58.4638775Z #define __attribute_malloc__ __attribute__ ((__malloc__))
2025-05-07T20:26:58.4638869Z #define __FLT32X_DIG__ 15
2025-05-07T20:26:58.4638968Z #define _GLIBCXX_USE_C99_CTYPE_TR1 1
2025-05-07T20:26:58.4639064Z #define __PTRDIFF_TYPE__ long int
2025-05-07T20:26:58.4639181Z #define cudaArrayDeferredMapping 0x80
2025-05-07T20:26:58.4639290Z #define _GLIBCXX_END_NAMESPACE_CONTAINER 
2025-05-07T20:26:58.4639388Z #define USHRT_MAX (SHRT_MAX * 2 + 1)
2025-05-07T20:26:58.4639496Z #define __cpp_lib_is_swappable 201603
2025-05-07T20:26:58.4639583Z #define stdin stdin
2025-05-07T20:26:58.4639675Z #define __ino64_t_defined 
2025-05-07T20:26:58.4639768Z #define STA_CLK 0x8000
2025-05-07T20:26:58.4639862Z #define __clockid_t_defined 1
2025-05-07T20:26:58.4640005Z #define _GLIBCXX_NOEXCEPT_IF(...) noexcept(__VA_ARGS__)
2025-05-07T20:26:58.4640172Z #define __attribute_noinline__ __attribute__ ((__noinline__))
2025-05-07T20:26:58.4640274Z #define __cudaCDP2MemsetAsync 
2025-05-07T20:26:58.4640382Z #define _PSTL_PRAGMA_SIMD_SCAN(PRM) 
2025-05-07T20:26:58.4640486Z #define _GLIBCXX_BEGIN_NAMESPACE_LDBL 
2025-05-07T20:26:58.4640589Z #define _GLIBCXX_TR1_POLY_HERMITE_TCC 1
2025-05-07T20:26:58.4640790Z #define __FD_SET(d,set) ((void) (__FDS_BITS (set)[__FD_ELT (d)] |= __FD_MASK (d)))
2025-05-07T20:26:58.4640883Z #define __ATOMIC_SEQ_CST 5
2025-05-07T20:26:58.4641487Z #define __tobody(c,f,a,args) (__extension__ ({ int __res; if (sizeof (c) > 1) { if (__builtin_constant_p (c)) { int __c = (c); __res = __c < -128 || __c > 255 ? __c : (a)[__c]; } else __res = f args; } else __res = (a)[(int) (c)]; __res; }))
2025-05-07T20:26:58.4641671Z #define DOMAIN 1
2025-05-07T20:26:58.4641764Z #define M_LN2 0.69314718055994530942
2025-05-07T20:26:58.4641855Z #define __NVCC__ 1
2025-05-07T20:26:58.4641958Z #define __cudaCDP2Memset2DAsync 
2025-05-07T20:26:58.4642069Z #define __CLOCK_T_TYPE __SYSCALL_SLONG_TYPE
2025-05-07T20:26:58.4642177Z #define _PSTL_PRAGMA_SIMD_EARLYEXIT 
2025-05-07T20:26:58.4642281Z #define __throw_exception_again throw
2025-05-07T20:26:58.4642376Z #define M_SQRT2 1.41421356237309504880
2025-05-07T20:26:58.4642474Z #define __EXCEPTION_H 1
2025-05-07T20:26:58.4642571Z #define __FLT32X_MIN_10_EXP__ (-307)
2025-05-07T20:26:58.4642672Z #define HUGE_VAL (__builtin_huge_val())
2025-05-07T20:26:58.4642974Z #define cudaStreamAttributeAccessPolicyWindow cudaLaunchAttributeAccessPolicyWindow
2025-05-07T20:26:58.4643084Z #define __UINTPTR_TYPE__ long unsigned int
2025-05-07T20:26:58.4643188Z #define _GLIBCXX_INLINE_VERSION 0
2025-05-07T20:26:58.4643283Z #define _GLIBCXX_USE_INT128 1
2025-05-07T20:26:58.4643399Z #define __cpp_lib_bool_constant 201505
2025-05-07T20:26:58.4643500Z #define PTHREAD_KEYS_MAX 1024
2025-05-07T20:26:58.4643642Z #define __DEC64_SUBNORMAL_MIN__ 0.000000000000001E-383DD
2025-05-07T20:26:58.4643748Z #define __FSFILCNT64_T_TYPE __UQUAD_TYPE
2025-05-07T20:26:58.4643863Z #define _GLIBCXX_DOUBLE_IS_IEEE_BINARY64 1
2025-05-07T20:26:58.4643956Z #define __DEC128_MANT_DIG__ 34
2025-05-07T20:26:58.4644060Z #define __cpp_lib_tuples_by_type 201304
2025-05-07T20:26:58.4644163Z #define __LDBL_MIN_10_EXP__ (-4931)
2025-05-07T20:26:58.4644264Z #define __cpp_generic_lambdas 201304L
2025-05-07T20:26:58.4644397Z #define _GLIBCXX_THROW_OR_ABORT(_EXC) (throw (_EXC))
2025-05-07T20:26:58.4644500Z #define __useconds_t_defined 
2025-05-07T20:26:58.4644599Z #define _GLIBCXX_USE_SCHED_YIELD 1
2025-05-07T20:26:58.4644783Z #define __attribute_deprecated__ __attribute__ ((__deprecated__))
2025-05-07T20:26:58.4644927Z #define __cpp_lib_type_trait_variable_templates 201510L
2025-05-07T20:26:58.4645016Z #define __SSE_MATH__ 1
2025-05-07T20:26:58.4645118Z #define _IO_wint_t wint_t
2025-05-07T20:26:58.4645212Z #define __SIZEOF_LONG_LONG__ 8
2025-05-07T20:26:58.4645303Z #define _GLIBCXX_VERBOSE 1
2025-05-07T20:26:58.4645406Z #define _GLIBCXX_HAVE_ASINF 1
2025-05-07T20:26:58.4645520Z #define __cpp_user_defined_literals 200809L
2025-05-07T20:26:58.4645616Z #define _GLIBCXX_HAVE_ISINFL 1
2025-05-07T20:26:58.4645718Z #define _GLIBCXX_HAVE_ASINL 1
2025-05-07T20:26:58.4645805Z #define __USE_ATFILE 1
2025-05-07T20:26:58.4645906Z #define _POSIX_OPEN_MAX 20
2025-05-07T20:26:58.4646002Z #define _POSIX_LOGIN_NAME_MAX 9
2025-05-07T20:26:58.4646090Z #define _GCC_PTRDIFF_T 
2025-05-07T20:26:58.4646318Z #define cudaKernelNodeAttributePriority cudaLaunchAttributePriority
2025-05-07T20:26:58.4646415Z #define __FLT128_DECIMAL_DIG__ 36
2025-05-07T20:26:58.4646514Z #define _POSIX_THREAD_KEYS_MAX 128
2025-05-07T20:26:58.4646622Z #define __GCC_ATOMIC_LLONG_LOCK_FREE 2
2025-05-07T20:26:58.4646735Z #define __cpp_lib_array_constexpr 201803L
2025-05-07T20:26:58.4646822Z #define _STDLIB_H 1
2025-05-07T20:26:58.4646968Z #define __exctype(name) extern int name (int) __THROW
2025-05-07T20:26:58.4647064Z #define __FLT32_HAS_QUIET_NAN__ 1
2025-05-07T20:26:58.4647157Z #define __FLT_DECIMAL_DIG__ 9
2025-05-07T20:26:58.4647290Z #define __UINT_FAST16_MAX__ 0xffffffffffffffffUL
2025-05-07T20:26:58.4647399Z #define __SURFACE_INDIRECT_FUNCTIONS_H__ 
2025-05-07T20:26:58.4647499Z #define __SM_61_INTRINSICS_H__ 
2025-05-07T20:26:58.4647680Z #define _GLIBCXX_PACKAGE_STRING "package-unused version-unused"
2025-05-07T20:26:58.4647831Z #define __isxdigit_l(c,l) __isctype_l((c), _ISxdigit, (l))
2025-05-07T20:26:58.4647945Z #define __glibcxx_requires_nonempty() 
2025-05-07T20:26:58.4648061Z #define w_stopsig __wait_stopped.__w_stopsig
2025-05-07T20:26:58.4648152Z #define __ldiv_t_defined 1
2025-05-07T20:26:58.4648336Z #define __glibcxx_requires_irreflexive_pred(_First,_Last,_Pred) 
2025-05-07T20:26:58.4648428Z #define ___int_ptrdiff_t_h 
2025-05-07T20:26:58.4648688Z #define __LDBL_NORM_MAX__ 1.18973149535723176502126385303097021e+4932L
2025-05-07T20:26:58.4648877Z #define __cudaCDP2EventDestroy 
2025-05-07T20:26:58.4648971Z #define __HOST_DEFINES_H__ 
2025-05-07T20:26:58.4649080Z #define __GCC_ATOMIC_SHORT_LOCK_FREE 2
2025-05-07T20:26:58.4649182Z #define __SM_20_ATOMIC_FUNCTIONS_H__ 
2025-05-07T20:26:58.4649279Z #define _GLIBCXX_USE_NANOSLEEP 1
2025-05-07T20:26:58.4649368Z #define CUDART_CB 
2025-05-07T20:26:58.4649469Z #define BC_BASE_MAX _POSIX2_BC_BASE_MAX
2025-05-07T20:26:58.4649592Z #define _GLIBCXX_USE_C99_INTTYPES_WCHAR_T_TR1 1
2025-05-07T20:26:58.4649686Z #define MB_LEN_MAX 16
2025-05-07T20:26:58.4649904Z #define __glibcxx_requires_partitioned_lower_pred(_First,_Last,_Value,_Pred) 
2025-05-07T20:26:58.4650001Z #define _GLIBCXX11_USE_C99_WCHAR 1
2025-05-07T20:26:58.4650128Z #define _IO_peekc(_fp) _IO_peekc_unlocked (_fp)
2025-05-07T20:26:58.4650240Z #define _GLIBCXX_HAVE_AS_SYMVER_DIRECTIVE 1
2025-05-07T20:26:58.4650344Z #define _GLIBCXX_HAVE_UNISTD_H 1
2025-05-07T20:26:58.4650500Z #define __glibc_likely(cond) __builtin_expect((cond), 1)
2025-05-07T20:26:58.4650608Z #define __UINT_FAST8_TYPE__ unsigned char
2025-05-07T20:26:58.4650699Z #define _GNU_SOURCE 1
2025-05-07T20:26:58.4650785Z #define __stub_putmsg 
2025-05-07T20:26:58.4650868Z #define __CUDACC__ 1
2025-05-07T20:26:58.4650962Z #define __N(msgid) (msgid)
2025-05-07T20:26:58.4651047Z #define __P(args) args
2025-05-07T20:26:58.4651293Z #define cudaKernelNodeAttributeCooperative cudaLaunchAttributeCooperative
2025-05-07T20:26:58.4651402Z #define __cpp_init_captures 201304L
2025-05-07T20:26:58.4651506Z #define _GLIBCXX17_CONSTEXPR constexpr
2025-05-07T20:26:58.4651596Z #define __ATOMIC_ACQ_REL 4
2025-05-07T20:26:58.4651701Z #define __cpp_lib_as_const 201510
2025-05-07T20:26:58.4651782Z #define __WCHAR_T 
2025-05-07T20:26:58.4651879Z #define __ATOMIC_RELEASE 3
2025-05-07T20:26:58.4651973Z #define __fsblkcnt_t_defined 
2025-05-07T20:26:58.4652091Z #define __cudaCDP2EventCreateWithFlags 
2025-05-07T20:26:58.4652209Z #define __DEVICE_DOUBLE_FUNCTIONS_H__ 
2025-05-07T20:26:58.4652221Z 
2025-05-07T20:26:58.4821089Z 
2025-05-07T20:26:58.4821528Z + conda run -n build_binary nvcc --version
2025-05-07T20:26:58.4821567Z 
2025-05-07T20:27:00.3847252Z nvcc: NVIDIA (R) Cuda compiler driver
2025-05-07T20:27:00.3847645Z Copyright (c) 2005-2024 NVIDIA Corporation
2025-05-07T20:27:00.3847963Z Built on Tue_Oct_29_23:50:19_PDT_2024
2025-05-07T20:27:00.3848266Z Cuda compilation tools, release 12.6, V12.6.85
2025-05-07T20:27:00.3848593Z Build cuda_12.6.r12.6/compiler.35059454_0
2025-05-07T20:27:00.3848796Z 
2025-05-07T20:27:00.4481489Z 
2025-05-07T20:27:00.4492922Z /usr/bin/nvidia-smi
2025-05-07T20:27:00.4498532Z + nvidia-smi
2025-05-07T20:27:00.4498669Z 
2025-05-07T20:27:00.4672343Z Wed May  7 20:27:00 2025       
2025-05-07T20:27:00.4673228Z +-----------------------------------------------------------------------------------------+
2025-05-07T20:27:00.4674298Z | NVIDIA-SMI 570.133.07             Driver Version: 570.133.07     CUDA Version: 12.8     |
2025-05-07T20:27:00.4675257Z |-----------------------------------------+------------------------+----------------------+
2025-05-07T20:27:00.4676384Z | GPU  Name                 Persistence-M | Bus-Id          Disp.A | Volatile Uncorr. ECC |
2025-05-07T20:27:00.4677400Z | Fan  Temp   Perf          Pwr:Usage/Cap |           Memory-Usage | GPU-Util  Compute M. |
2025-05-07T20:27:00.4678232Z |                                         |                        |               MIG M. |
2025-05-07T20:27:00.4678876Z |=========================================+========================+======================|
2025-05-07T20:27:00.4843068Z |   0  NVIDIA A10G                    On  |   00000000:00:1E.0 Off |                    0 |
2025-05-07T20:27:00.4843597Z |  0%   29C    P8             24W /  300W |       0MiB /  23028MiB |      0%      Default |
2025-05-07T20:27:00.4843983Z |                                         |                        |                  N/A |
2025-05-07T20:27:00.4848155Z +-----------------------------------------+------------------------+----------------------+
2025-05-07T20:27:00.4848845Z                                                                                          
2025-05-07T20:27:00.4849249Z +-----------------------------------------------------------------------------------------+
2025-05-07T20:27:00.4849668Z | Processes:                                                                              |
2025-05-07T20:27:00.4850100Z |  GPU   GI   CI              PID   Type   Process name                        GPU Memory |
2025-05-07T20:27:00.4850501Z |        ID   ID                                                               Usage      |
2025-05-07T20:27:00.4850840Z |=========================================================================================|
2025-05-07T20:27:00.4852728Z |  No running processes found                                                             |
2025-05-07T20:27:00.4853382Z +-----------------------------------------------------------------------------------------+
2025-05-07T20:27:00.7421165Z 
2025-05-07T20:27:00.7425734Z [INSTALL] Successfully installed CUDA 12.6.3
2025-05-07T20:27:00.7474903Z ##[group]Run . $PRELUDE; install_pytorch_pip $BUILD_ENV nightly cuda/12.6.3
2025-05-07T20:27:00.7475436Z [36;1m. $PRELUDE; install_pytorch_pip $BUILD_ENV nightly cuda/12.6.3[0m
2025-05-07T20:27:00.7489101Z shell: /usr/bin/bash --noprofile --norc -e -o pipefail {0}
2025-05-07T20:27:00.7489434Z env:
2025-05-07T20:27:00.7489651Z   PRELUDE: .github/scripts/setup_env.bash
2025-05-07T20:27:00.7489935Z   BUILD_ENV: build_binary
2025-05-07T20:27:00.7490171Z   BUILD_TARGET: genai
2025-05-07T20:27:00.7490391Z   BUILD_VARIANT: cuda
2025-05-07T20:27:00.7490613Z   BUILD_CUDA_VERSION: 12.6.3
2025-05-07T20:27:00.7490860Z   ENFORCE_CUDA_DEVICE: 1
2025-05-07T20:27:00.7491156Z   GPU_FLAG: --gpus all -e NVIDIA_DRIVER_CAPABILITIES=all
2025-05-07T20:27:00.7491483Z ##[endgroup]
2025-05-07T20:27:01.0848791Z ################################################################################
2025-05-07T20:27:01.0849186Z # Install PyTorch (PIP)
2025-05-07T20:27:01.0849414Z #
2025-05-07T20:27:01.0864259Z # [2025-05-07T20:27:01.086Z] + install_pytorch_pip build_binary nightly cuda/12.6.3
2025-05-07T20:27:01.0864936Z ################################################################################
2025-05-07T20:27:01.0865293Z 
2025-05-07T20:27:01.0893053Z [EXEC] [ATTEMPT 0/3]    + conda install -n build_binary -c conda-forge --override-channels -y numpy
2025-05-07T20:27:02.0771703Z Channels:
2025-05-07T20:27:02.0771946Z  - conda-forge
2025-05-07T20:27:02.0772180Z Platform: linux-64
2025-05-07T20:27:05.3889934Z Collecting package metadata (repodata.json): - \ | / done
2025-05-07T20:27:06.1104906Z Solving environment: \ | / done
2025-05-07T20:27:06.3320013Z 
2025-05-07T20:27:06.3320650Z ## Package Plan ##
2025-05-07T20:27:06.3320914Z 
2025-05-07T20:27:06.3321192Z   environment location: /home/ec2-user/miniconda/envs/build_binary
2025-05-07T20:27:06.3321569Z 
2025-05-07T20:27:06.3321707Z   added / updated specs:
2025-05-07T20:27:06.3321965Z     - numpy
2025-05-07T20:27:06.3322096Z 
2025-05-07T20:27:06.3322112Z 
2025-05-07T20:27:06.3322241Z The following packages will be downloaded:
2025-05-07T20:27:06.3322485Z 
2025-05-07T20:27:06.3322613Z     package                    |            build
2025-05-07T20:27:06.3322956Z     ---------------------------|-----------------
2025-05-07T20:27:06.3323430Z     libblas-3.9.0              |31_h59b9bed_openblas          16 KB  conda-forge
2025-05-07T20:27:06.3324066Z     libcblas-3.9.0             |31_he106b2a_openblas          16 KB  conda-forge
2025-05-07T20:27:06.3324673Z     libgfortran-15.1.0         |       h69a702a_2          34 KB  conda-forge
2025-05-07T20:27:06.3325111Z     libgfortran5-15.1.0        |       hcea5267_2         1.5 MB  conda-forge
2025-05-07T20:27:06.3325560Z     liblapack-3.9.0            |31_h7ac8fdf_openblas          16 KB  conda-forge
2025-05-07T20:27:06.3326026Z     libopenblas-0.3.29         |pthreads_h94d23a6_0         5.6 MB  conda-forge
2025-05-07T20:27:06.3326836Z     numpy-2.2.5                |  py313h17eae1a_0         8.1 MB  conda-forge
2025-05-07T20:27:06.3327216Z     ------------------------------------------------------------
2025-05-07T20:27:06.3327554Z                                            Total:        15.4 MB
2025-05-07T20:27:06.3327761Z 
2025-05-07T20:27:06.3327895Z The following NEW packages will be INSTALLED:
2025-05-07T20:27:06.3328108Z 
2025-05-07T20:27:06.3328329Z   libblas            conda-forge/linux-64::libblas-3.9.0-31_h59b9bed_openblas 
2025-05-07T20:27:06.3328826Z   libcblas           conda-forge/linux-64::libcblas-3.9.0-31_he106b2a_openblas 
2025-05-07T20:27:06.3329322Z   libgfortran        conda-forge/linux-64::libgfortran-15.1.0-h69a702a_2 
2025-05-07T20:27:06.3329822Z   libgfortran5       conda-forge/linux-64::libgfortran5-15.1.0-hcea5267_2 
2025-05-07T20:27:06.3330344Z   liblapack          conda-forge/linux-64::liblapack-3.9.0-31_h7ac8fdf_openblas 
2025-05-07T20:27:06.3330870Z   libopenblas        conda-forge/linux-64::libopenblas-0.3.29-pthreads_h94d23a6_0 
2025-05-07T20:27:06.3331813Z   numpy              conda-forge/linux-64::numpy-2.2.5-py313h17eae1a_0 
2025-05-07T20:27:06.3332187Z 
2025-05-07T20:27:06.3332192Z 
2025-05-07T20:27:06.3332198Z 
2025-05-07T20:27:06.3332625Z Downloading and Extracting Packages: ...working...
2025-05-07T20:27:06.3341263Z numpy-2.2.5          | 8.1 MB    |            |   0% 
2025-05-07T20:27:06.3341586Z 
2025-05-07T20:27:06.3342085Z libopenblas-0.3.29   | 5.6 MB    |            |   0% [A
2025-05-07T20:27:06.3342413Z 
2025-05-07T20:27:06.3342418Z 
2025-05-07T20:27:06.3353943Z libgfortran5-15.1.0  | 1.5 MB    |            |   0% [A[A
2025-05-07T20:27:06.3354303Z 
2025-05-07T20:27:06.3354320Z 
2025-05-07T20:27:06.3360158Z 
2025-05-07T20:27:06.3372320Z libgfortran-15.1.0   | 34 KB     |            |   0% [A[A[A
2025-05-07T20:27:06.3372681Z 
2025-05-07T20:27:06.3372686Z 
2025-05-07T20:27:06.3372701Z 
2025-05-07T20:27:06.3373263Z 
2025-05-07T20:27:06.3394547Z libblas-3.9.0        | 16 KB     |            |   0% [A[A[A[A
2025-05-07T20:27:06.3394904Z 
2025-05-07T20:27:06.3394916Z 
2025-05-07T20:27:06.3394933Z 
2025-05-07T20:27:06.3394938Z 
2025-05-07T20:27:06.3398798Z 
2025-05-07T20:27:06.3400120Z libcblas-3.9.0       | 16 KB     |            |   0% [A[A[A[A[A
2025-05-07T20:27:06.3400469Z 
2025-05-07T20:27:06.3400475Z 
2025-05-07T20:27:06.3400491Z 
2025-05-07T20:27:06.3400500Z 
2025-05-07T20:27:06.3400505Z 
2025-05-07T20:27:06.3400511Z 
2025-05-07T20:27:06.4343570Z liblapack-3.9.0      | 16 KB     |            |   0% [A[A[A[A[A[A
2025-05-07T20:27:06.4343950Z 
2025-05-07T20:27:06.4343956Z 
2025-05-07T20:27:06.4343971Z 
2025-05-07T20:27:06.4395026Z 
2025-05-07T20:27:06.4984120Z libblas-3.9.0        | 16 KB     | ########## | 100% [A[A[A[A
2025-05-07T20:27:06.4984476Z 
2025-05-07T20:27:06.4984481Z 
2025-05-07T20:27:06.5035943Z 
2025-05-07T20:27:06.5887826Z libgfortran-15.1.0   | 34 KB     | ####7      |  47% [A[A[A
2025-05-07T20:27:06.5888203Z 
2025-05-07T20:27:06.5888221Z 
2025-05-07T20:27:06.5888227Z 
2025-05-07T20:27:06.5888232Z 
2025-05-07T20:27:06.5888472Z 
2025-05-07T20:27:06.5912822Z libcblas-3.9.0       | 16 KB     | #########7 |  98% [A[A[A[A[A
2025-05-07T20:27:06.5913190Z 
2025-05-07T20:27:06.5913195Z 
2025-05-07T20:27:06.5936070Z 
2025-05-07T20:27:06.5955733Z libgfortran-15.1.0   | 34 KB     | ########## | 100% [A[A[A
2025-05-07T20:27:06.5956112Z 
2025-05-07T20:27:06.5956118Z 
2025-05-07T20:27:06.5956123Z 
2025-05-07T20:27:06.5956128Z 
2025-05-07T20:27:06.5961434Z 
2025-05-07T20:27:06.6290504Z libcblas-3.9.0       | 16 KB     | ########## | 100% [A[A[A[A[A
2025-05-07T20:27:06.6290895Z 
2025-05-07T20:27:06.6290900Z 
2025-05-07T20:27:06.6290905Z 
2025-05-07T20:27:06.6290911Z 
2025-05-07T20:27:06.6294820Z libblas-3.9.0        | 16 KB     | ########## | 100% [A[A[A[A
2025-05-07T20:27:06.6295178Z 
2025-05-07T20:27:06.6295183Z 
2025-05-07T20:27:06.6295188Z 
2025-05-07T20:27:06.6295664Z 
2025-05-07T20:27:06.6351805Z libblas-3.9.0        | 16 KB     | ########## | 100% [A[A[A[A
2025-05-07T20:27:06.6352379Z 
2025-05-07T20:27:06.6352383Z 
2025-05-07T20:27:06.6352391Z 
2025-05-07T20:27:06.6352395Z 
2025-05-07T20:27:06.6352399Z 
2025-05-07T20:27:06.6366822Z libcblas-3.9.0       | 16 KB     | ########## | 100% [A[A[A[A[A
2025-05-07T20:27:06.6367172Z 
2025-05-07T20:27:06.6367178Z 
2025-05-07T20:27:06.6367182Z 
2025-05-07T20:27:06.6367188Z 
2025-05-07T20:27:06.6367193Z 
2025-05-07T20:27:06.6367198Z 
2025-05-07T20:27:06.6375602Z liblapack-3.9.0      | 16 KB     | #########7 |  98% [A[A[A[A[A[A
2025-05-07T20:27:06.6375960Z 
2025-05-07T20:27:06.6375963Z 
2025-05-07T20:27:06.6375967Z 
2025-05-07T20:27:06.6375971Z 
2025-05-07T20:27:06.6375974Z 
2025-05-07T20:27:06.6376351Z 
2025-05-07T20:27:06.6433619Z liblapack-3.9.0      | 16 KB     | ########## | 100% [A[A[A[A[A[A
2025-05-07T20:27:06.6434299Z 
2025-05-07T20:27:06.6434309Z 
2025-05-07T20:27:06.6434318Z 
2025-05-07T20:27:06.6434877Z libgfortran-15.1.0   | 34 KB     | ########## | 100% [A[A[A
2025-05-07T20:27:06.6435529Z 
2025-05-07T20:27:06.6435538Z 
2025-05-07T20:27:06.6435554Z 
2025-05-07T20:27:06.6495656Z libgfortran-15.1.0   | 34 KB     | ########## | 100% [A[A[A
2025-05-07T20:27:06.6496034Z 
2025-05-07T20:27:06.6496052Z 
2025-05-07T20:27:06.6496057Z 
2025-05-07T20:27:06.6496062Z 
2025-05-07T20:27:06.6496068Z 
2025-05-07T20:27:06.6496073Z 
2025-05-07T20:27:06.6543471Z liblapack-3.9.0      | 16 KB     | ########## | 100% [A[A[A[A[A[A
2025-05-07T20:27:06.6543845Z 
2025-05-07T20:27:06.6683579Z libopenblas-0.3.29   | 5.6 MB    |            |   0% [A
2025-05-07T20:27:06.6683924Z 
2025-05-07T20:27:06.6683930Z 
2025-05-07T20:27:06.6785226Z libgfortran5-15.1.0  | 1.5 MB    | 1          |   1% [A[A
2025-05-07T20:27:06.6785584Z 
2025-05-07T20:27:06.6785589Z 
2025-05-07T20:27:06.6914618Z libgfortran5-15.1.0  | 1.5 MB    | ########## | 100% [A[A
2025-05-07T20:27:06.7108672Z numpy-2.2.5          | 8.1 MB    |            |   0% 
2025-05-07T20:27:06.7109785Z 
2025-05-07T20:27:06.7412966Z libopenblas-0.3.29   | 5.6 MB    | ########## | 100% [A
2025-05-07T20:27:06.7413318Z 
2025-05-07T20:27:06.7413336Z 
2025-05-07T20:27:06.7938056Z libgfortran5-15.1.0  | 1.5 MB    | ########## | 100% [A[A
2025-05-07T20:27:06.8122472Z numpy-2.2.5          | 8.1 MB    | #######3   |  73% 
2025-05-07T20:27:06.8122794Z 
2025-05-07T20:27:06.8123275Z libopenblas-0.3.29   | 5.6 MB    | ########## | 100% [A
2025-05-07T20:27:06.8123546Z 
2025-05-07T20:27:06.8582759Z libopenblas-0.3.29   | 5.6 MB    | ########## | 100% [A
2025-05-07T20:27:07.2352837Z numpy-2.2.5          | 8.1 MB    | ########## | 100% 
2025-05-07T20:27:07.2359308Z numpy-2.2.5          | 8.1 MB    | ########## | 100% 
2025-05-07T20:27:07.2359790Z                                                      
2025-05-07T20:27:07.2360062Z 
2025-05-07T20:27:07.2360338Z                                                      [A
2025-05-07T20:27:07.2360625Z 
2025-05-07T20:27:07.2360631Z 
2025-05-07T20:27:07.2360848Z                                                      [A[A
2025-05-07T20:27:07.2361140Z 
2025-05-07T20:27:07.2361145Z 
2025-05-07T20:27:07.2361150Z 
2025-05-07T20:27:07.2361370Z                                                      [A[A[A
2025-05-07T20:27:07.2361647Z 
2025-05-07T20:27:07.2361652Z 
2025-05-07T20:27:07.2361665Z 
2025-05-07T20:27:07.2361670Z 
2025-05-07T20:27:07.2361900Z                                                      [A[A[A[A
2025-05-07T20:27:07.2362170Z 
2025-05-07T20:27:07.2362175Z 
2025-05-07T20:27:07.2362180Z 
2025-05-07T20:27:07.2362185Z 
2025-05-07T20:27:07.2362190Z 
2025-05-07T20:27:07.2362436Z                                                      [A[A[A[A[A
2025-05-07T20:27:07.2362668Z 
2025-05-07T20:27:07.2362672Z 
2025-05-07T20:27:07.2362676Z 
2025-05-07T20:27:07.2362679Z 
2025-05-07T20:27:07.2362683Z 
2025-05-07T20:27:07.2362686Z 
2025-05-07T20:27:07.2362888Z                                                      [A[A[A[A[A[A done
2025-05-07T20:27:07.3368360Z Preparing transaction: \ done
2025-05-07T20:27:07.5377329Z Verifying transaction: / - done
2025-05-07T20:27:07.6384949Z Executing transaction: | done
2025-05-07T20:27:07.8175141Z ################################################################################
2025-05-07T20:27:07.8175678Z # Install Package From PyTorch PIP: torch
2025-05-07T20:27:07.8176072Z #
2025-05-07T20:27:07.8190183Z # [2025-05-07T20:27:07.818Z] + install_from_pytorch_pip build_binary torch nightly cuda/12.6.3
2025-05-07T20:27:07.8190814Z ################################################################################
2025-05-07T20:27:07.8191113Z 
2025-05-07T20:27:07.8205920Z [EXEC] [ATTEMPT 0/3]    + wget -q --timeout 1 pypi.org -O /dev/null
2025-05-07T20:27:07.9124477Z [CHECK] Network does not appear to be blocked.
2025-05-07T20:27:07.9125422Z ################################################################################
2025-05-07T20:27:07.9126297Z # Prepare PIP Arguments (PyTorch PIP)
2025-05-07T20:27:07.9126950Z #
2025-05-07T20:27:07.9141619Z # [2025-05-07T20:27:07.913Z] + __prepare_pip_arguments torch nightly cuda/12.6.3
2025-05-07T20:27:07.9142487Z ################################################################################
2025-05-07T20:27:07.9142754Z 
2025-05-07T20:27:07.9162979Z [INSTALL] Extracted package (channel, version): (nightly, LATEST)
2025-05-07T20:27:07.9190438Z [INSTALL] Extracted package variant: cu126
2025-05-07T20:27:07.9207615Z [INSTALL] Using a non-RELEASE channel: nightly ...
2025-05-07T20:27:07.9208150Z [INSTALL] Extracted the full PIP channel: https://download.pytorch.org/whl/nightly/cu126/
2025-05-07T20:27:07.9216584Z [INSTALL] Extracted the full PIP package: --pre torch
2025-05-07T20:27:07.9225758Z [INSTALL] Attempting to install [torch, LATEST] from PyTorch PIP using channel https://download.pytorch.org/whl/nightly/cu126/ ...
2025-05-07T20:27:07.9248439Z [EXEC] [ATTEMPT 0/3]    + conda run -n build_binary pip install --pre torch --index-url https://download.pytorch.org/whl/nightly/cu126/
2025-05-07T20:28:24.5354597Z   DEPRECATION: Building 'MarkupSafe' using the legacy setup.py bdist_wheel mechanism, which will be removed in a future version. pip 25.3 will enforce this behaviour change. A possible replacement is to use the standardized build interface by setting the `--use-pep517` option, (possibly combined with `--no-build-isolation`), or adding a `pyproject.toml` file to the source tree of 'MarkupSafe'. Discussion can be found at https://github.com/pypa/pip/issues/6334
2025-05-07T20:28:24.5356870Z Looking in indexes: https://download.pytorch.org/whl/nightly/cu126/
2025-05-07T20:28:24.5357253Z Collecting torch
2025-05-07T20:28:24.5357893Z   Downloading https://download.pytorch.org/whl/nightly/cu126/torch-2.8.0.dev20250507%2Bcu126-cp313-cp313-manylinux_2_28_x86_64.whl.metadata (30 kB)
2025-05-07T20:28:24.5358582Z Collecting filelock (from torch)
2025-05-07T20:28:24.5359068Z   Using cached https://download.pytorch.org/whl/nightly/filelock-3.16.1-py3-none-any.whl (16 kB)
2025-05-07T20:28:24.5359971Z Requirement already satisfied: typing-extensions>=4.10.0 in /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages (from torch) (4.13.2)
2025-05-07T20:28:24.5361020Z Requirement already satisfied: setuptools in /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages (from torch) (78.1.1)
2025-05-07T20:28:24.5361668Z Collecting sympy>=1.13.3 (from torch)
2025-05-07T20:28:24.5362152Z   Using cached https://download.pytorch.org/whl/nightly/sympy-1.13.3-py3-none-any.whl (6.2 MB)
2025-05-07T20:28:24.5362647Z Collecting networkx (from torch)
2025-05-07T20:28:24.5363132Z   Downloading https://download.pytorch.org/whl/nightly/networkx-3.4.2-py3-none-any.whl (1.7 MB)
2025-05-07T20:28:24.5365843Z      ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 1.7/1.7 MB 20.4 MB/s eta 0:00:00
2025-05-07T20:28:24.5366198Z Collecting jinja2 (from torch)
2025-05-07T20:28:24.5366674Z   Using cached https://download.pytorch.org/whl/nightly/jinja2-3.1.4-py3-none-any.whl (133 kB)
2025-05-07T20:28:24.5367168Z Collecting fsspec (from torch)
2025-05-07T20:28:24.5368106Z   Using cached https://download.pytorch.org/whl/nightly/fsspec-2024.10.0-py3-none-any.whl (179 kB)
2025-05-07T20:28:24.5368548Z 
2025-05-07T20:28:24.5368707Z Collecting nvidia-cuda-nvrtc-cu12==12.6.77 (from torch)
2025-05-07T20:28:24.5369406Z   Downloading https://download.pytorch.org/whl/nightly/cu126/nvidia_cuda_nvrtc_cu12-12.6.77-py3-none-manylinux2014_x86_64.whl (23.7 MB)
2025-05-07T20:28:24.5370211Z      ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 23.7/23.7 MB 58.2 MB/s eta 0:00:00
2025-05-07T20:28:24.5370623Z Collecting nvidia-cuda-runtime-cu12==12.6.77 (from torch)
2025-05-07T20:28:24.5371319Z   Downloading https://download.pytorch.org/whl/nightly/cu126/nvidia_cuda_runtime_cu12-12.6.77-py3-none-manylinux2014_x86_64.whl (897 kB)
2025-05-07T20:28:24.5372090Z      ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 897.7/897.7 kB 11.2 MB/s eta 0:00:00
2025-05-07T20:28:24.5372480Z Collecting nvidia-cuda-cupti-cu12==12.6.80 (from torch)
2025-05-07T20:28:24.5373164Z   Downloading https://download.pytorch.org/whl/nightly/cu126/nvidia_cuda_cupti_cu12-12.6.80-py3-none-manylinux2014_x86_64.whl (8.9 MB)
2025-05-07T20:28:24.5374230Z      ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 8.9/8.9 MB 48.6 MB/s eta 0:00:00
2025-05-07T20:28:24.5374605Z Collecting nvidia-cudnn-cu12==9.5.1.17 (from torch)
2025-05-07T20:28:24.5375276Z   Downloading https://download.pytorch.org/whl/nightly/cu126/nvidia_cudnn_cu12-9.5.1.17-py3-none-manylinux_2_28_x86_64.whl (571.0 MB)
2025-05-07T20:28:24.5376030Z      ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 571.0/571.0 MB 57.4 MB/s eta 0:00:00
2025-05-07T20:28:24.5376402Z Collecting nvidia-cublas-cu12==12.6.4.1 (from torch)
2025-05-07T20:28:24.5377147Z   Downloading https://download.pytorch.org/whl/nightly/cu126/nvidia_cublas_cu12-12.6.4.1-py3-none-manylinux2014_x86_64.manylinux_2_17_x86_64.whl (393.1 MB)
2025-05-07T20:28:24.5377975Z      ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 393.1/393.1 MB 85.5 MB/s eta 0:00:00
2025-05-07T20:28:24.5378336Z Collecting nvidia-cufft-cu12==11.3.0.4 (from torch)
2025-05-07T20:28:24.5379020Z   Downloading https://download.pytorch.org/whl/nightly/cu126/nvidia_cufft_cu12-11.3.0.4-py3-none-manylinux2014_x86_64.whl (200.2 MB)
2025-05-07T20:28:24.5379766Z      ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 200.2/200.2 MB 123.6 MB/s eta 0:00:00
2025-05-07T20:28:24.5380133Z Collecting nvidia-curand-cu12==10.3.7.77 (from torch)
2025-05-07T20:28:24.5380795Z   Downloading https://download.pytorch.org/whl/nightly/cu126/nvidia_curand_cu12-10.3.7.77-py3-none-manylinux2014_x86_64.whl (56.3 MB)
2025-05-07T20:28:24.5381544Z      ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 56.3/56.3 MB 214.4 MB/s eta 0:00:00
2025-05-07T20:28:24.5381926Z Collecting nvidia-cusolver-cu12==11.7.1.2 (from torch)
2025-05-07T20:28:24.5382603Z   Downloading https://download.pytorch.org/whl/nightly/cu126/nvidia_cusolver_cu12-11.7.1.2-py3-none-manylinux2014_x86_64.whl (158.2 MB)
2025-05-07T20:28:24.5383360Z      ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 158.2/158.2 MB 148.2 MB/s eta 0:00:00
2025-05-07T20:28:24.5383738Z Collecting nvidia-cusparse-cu12==12.5.4.2 (from torch)
2025-05-07T20:28:24.5384441Z   Downloading https://download.pytorch.org/whl/nightly/cu126/nvidia_cusparse_cu12-12.5.4.2-py3-none-manylinux2014_x86_64.whl (216.6 MB)
2025-05-07T20:28:24.5385195Z      ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 216.6/216.6 MB 143.7 MB/s eta 0:00:00
2025-05-07T20:28:24.5385582Z Collecting nvidia-cusparselt-cu12==0.6.3 (from torch)
2025-05-07T20:28:24.5386265Z   Downloading https://download.pytorch.org/whl/nightly/cu126/nvidia_cusparselt_cu12-0.6.3-py3-none-manylinux2014_x86_64.whl (156.8 MB)
2025-05-07T20:28:24.5387024Z      ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 156.8/156.8 MB 157.4 MB/s eta 0:00:00
2025-05-07T20:28:24.5387382Z Collecting nvidia-nccl-cu12==2.26.2 (from torch)
2025-05-07T20:28:24.5388126Z   Downloading https://download.pytorch.org/whl/nightly/cu126/nvidia_nccl_cu12-2.26.2-py3-none-manylinux2014_x86_64.manylinux_2_17_x86_64.whl.metadata (2.0 kB)
2025-05-07T20:28:24.5388876Z Collecting nvidia-nvtx-cu12==12.6.77 (from torch)
2025-05-07T20:28:24.5389612Z   Downloading https://download.pytorch.org/whl/nightly/cu126/nvidia_nvtx_cu12-12.6.77-py3-none-manylinux2014_x86_64.whl (89 kB)
2025-05-07T20:28:24.5390266Z Collecting nvidia-nvjitlink-cu12==12.6.85 (from torch)
2025-05-07T20:28:24.5391022Z   Downloading https://download.pytorch.org/whl/nightly/cu126/nvidia_nvjitlink_cu12-12.6.85-py3-none-manylinux2010_x86_64.manylinux_2_12_x86_64.whl (19.7 MB)
2025-05-07T20:28:24.5391855Z      ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 19.7/19.7 MB 195.7 MB/s eta 0:00:00
2025-05-07T20:28:24.5392224Z Collecting nvidia-cufile-cu12==1.11.1.6 (from torch)
2025-05-07T20:28:24.5392987Z   Downloading https://download.pytorch.org/whl/nightly/cu126/nvidia_cufile_cu12-1.11.1.6-py3-none-manylinux2014_x86_64.manylinux_2_17_x86_64.whl.metadata (1.5 kB)
2025-05-07T20:28:24.5393770Z Collecting pytorch-triton==3.3.0+git96316ce5 (from torch)
2025-05-07T20:28:24.5394607Z   Downloading https://download.pytorch.org/whl/nightly/pytorch_triton-3.3.0%2Bgit96316ce5-cp313-cp313-manylinux_2_27_x86_64.manylinux_2_28_x86_64.whl.metadata (1.6 kB)
2025-05-07T20:28:24.5395529Z Collecting mpmath<1.4,>=1.1.0 (from sympy>=1.13.3->torch)
2025-05-07T20:28:24.5396074Z   Using cached https://download.pytorch.org/whl/nightly/mpmath-1.3.0-py3-none-any.whl (536 kB)
2025-05-07T20:28:24.5396591Z Collecting MarkupSafe>=2.0 (from jinja2->torch)
2025-05-07T20:28:24.5397070Z   Downloading https://download.pytorch.org/whl/nightly/MarkupSafe-2.1.5.tar.gz (19 kB)
2025-05-07T20:28:24.5397543Z   Preparing metadata (setup.py): started
2025-05-07T20:28:24.5397913Z   Preparing metadata (setup.py): finished with status 'done'
2025-05-07T20:28:24.5398966Z Downloading https://download.pytorch.org/whl/nightly/cu126/torch-2.8.0.dev20250507%2Bcu126-cp313-cp313-manylinux_2_28_x86_64.whl (825.4 MB)
2025-05-07T20:28:24.5399758Z    ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 825.4/825.4 MB 36.2 MB/s eta 0:00:00
2025-05-07T20:28:24.5400498Z Downloading https://download.pytorch.org/whl/nightly/cu126/nvidia_cufile_cu12-1.11.1.6-py3-none-manylinux2014_x86_64.manylinux_2_17_x86_64.whl (1.1 MB)
2025-05-07T20:28:24.5401335Z    ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 1.1/1.1 MB 12.4 MB/s eta 0:00:00
2025-05-07T20:28:24.5402068Z Downloading https://download.pytorch.org/whl/nightly/cu126/nvidia_nccl_cu12-2.26.2-py3-none-manylinux2014_x86_64.manylinux_2_17_x86_64.whl (201.3 MB)
2025-05-07T20:28:24.5402878Z    ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 201.3/201.3 MB 94.7 MB/s eta 0:00:00
2025-05-07T20:28:24.5403646Z Downloading https://download.pytorch.org/whl/nightly/pytorch_triton-3.3.0%2Bgit96316ce5-cp313-cp313-manylinux_2_27_x86_64.manylinux_2_28_x86_64.whl (153.5 MB)
2025-05-07T20:28:24.5404493Z    ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 153.5/153.5 MB 134.3 MB/s eta 0:00:00
2025-05-07T20:28:24.5404869Z Building wheels for collected packages: MarkupSafe
2025-05-07T20:28:24.5405238Z   Building wheel for MarkupSafe (setup.py): started
2025-05-07T20:28:24.5405670Z   Building wheel for MarkupSafe (setup.py): finished with status 'done'
2025-05-07T20:28:24.5406695Z   Created wheel for MarkupSafe: filename=markupsafe-2.1.5-cp313-cp313-linux_x86_64.whl size=14954 sha256=8642341f746950f07f790b09c3e552393bd8cdf535cdc73dd539cf084cd476d7
2025-05-07T20:28:24.5407683Z   Stored in directory: /home/ec2-user/.cache/pip/wheels/3a/21/87/28c44597225fd0c28d6ffa365f1c2c9dd0ab763711aa4957c6
2025-05-07T20:28:24.5408251Z Successfully built MarkupSafe
2025-05-07T20:28:24.5409867Z Installing collected packages: nvidia-cusparselt-cu12, mpmath, sympy, pytorch-triton, nvidia-nvtx-cu12, nvidia-nvjitlink-cu12, nvidia-nccl-cu12, nvidia-curand-cu12, nvidia-cufile-cu12, nvidia-cuda-runtime-cu12, nvidia-cuda-nvrtc-cu12, nvidia-cuda-cupti-cu12, nvidia-cublas-cu12, networkx, MarkupSafe, fsspec, filelock, nvidia-cusparse-cu12, nvidia-cufft-cu12, nvidia-cudnn-cu12, jinja2, nvidia-cusolver-cu12, torch
2025-05-07T20:28:24.5411410Z 
2025-05-07T20:28:24.5413436Z Successfully installed MarkupSafe-2.1.5 filelock-3.16.1 fsspec-2024.10.0 jinja2-3.1.4 mpmath-1.3.0 networkx-3.4.2 nvidia-cublas-cu12-12.6.4.1 nvidia-cuda-cupti-cu12-12.6.80 nvidia-cuda-nvrtc-cu12-12.6.77 nvidia-cuda-runtime-cu12-12.6.77 nvidia-cudnn-cu12-9.5.1.17 nvidia-cufft-cu12-11.3.0.4 nvidia-cufile-cu12-1.11.1.6 nvidia-curand-cu12-10.3.7.77 nvidia-cusolver-cu12-11.7.1.2 nvidia-cusparse-cu12-12.5.4.2 nvidia-cusparselt-cu12-0.6.3 nvidia-nccl-cu12-2.26.2 nvidia-nvjitlink-cu12-12.6.85 nvidia-nvtx-cu12-12.6.77 pytorch-triton-3.3.0+git96316ce5 sympy-1.13.3 torch-2.8.0.dev20250507+cu126
2025-05-07T20:28:24.5415562Z 
2025-05-07T20:28:26.7721946Z torch                    2.8.0.dev20250507+cu126
2025-05-07T20:28:26.7724057Z [CHECK] The installed package [torch, nightly/LATEST] is the correct variant (cu126)
2025-05-07T20:28:30.2095658Z [CHECK] Python (sub-)package 'torch.distributed' found ...
2025-05-07T20:28:33.6728396Z [CHECK] NOTE: The installed version is: 2.8.0.dev20250507+cu126
2025-05-07T20:28:33.6728982Z [CHECK] NOTE: Checking _GLIBCXX_USE_CXX11_ABI ...
2025-05-07T20:28:37.0156286Z True
2025-05-07T20:28:37.0156507Z True
2025-05-07T20:28:37.0156638Z 
2025-05-07T20:28:37.0781500Z [INSTALL] Successfully installed PyTorch through PyTorch PIP
2025-05-07T20:28:37.0818386Z ##[group]Run if . $PRELUDE && which conda; then collect_pytorch_env_info $BUILD_ENV; fi
2025-05-07T20:28:37.0818984Z [36;1mif . $PRELUDE && which conda; then collect_pytorch_env_info $BUILD_ENV; fi[0m
2025-05-07T20:28:37.0832286Z shell: /usr/bin/bash --noprofile --norc -e -o pipefail {0}
2025-05-07T20:28:37.0832632Z env:
2025-05-07T20:28:37.0832860Z   PRELUDE: .github/scripts/setup_env.bash
2025-05-07T20:28:37.0833154Z   BUILD_ENV: build_binary
2025-05-07T20:28:37.0833400Z   BUILD_TARGET: genai
2025-05-07T20:28:37.0833626Z   BUILD_VARIANT: cuda
2025-05-07T20:28:37.0833864Z   BUILD_CUDA_VERSION: 12.6.3
2025-05-07T20:28:37.0834112Z   ENFORCE_CUDA_DEVICE: 1
2025-05-07T20:28:37.0834412Z   GPU_FLAG: --gpus all -e NVIDIA_DRIVER_CAPABILITIES=all
2025-05-07T20:28:37.0834743Z ##[endgroup]
2025-05-07T20:28:37.4190292Z /home/ec2-user/miniconda/bin/conda
2025-05-07T20:28:37.4192310Z ################################################################################
2025-05-07T20:28:37.4192796Z # Collect PyTorch Environment Information (for Reporting Issues)
2025-05-07T20:28:37.4193158Z #
2025-05-07T20:28:37.4209041Z # [2025-05-07T20:28:37.420Z] + collect_pytorch_env_info build_binary
2025-05-07T20:28:37.4209432Z ################################################################################
2025-05-07T20:28:37.4209649Z 
2025-05-07T20:28:37.4226594Z [EXEC] [ATTEMPT 0/3]    + wget -q --timeout 1 pypi.org -O /dev/null
2025-05-07T20:28:37.5125549Z [CHECK] Network does not appear to be blocked.
2025-05-07T20:28:37.5135786Z [INFO] Downloading the PyTorch environment info collection script ...
2025-05-07T20:28:37.5136391Z + wget -q https://raw.githubusercontent.com/pytorch/pytorch/main/torch/utils/collect_env.py
2025-05-07T20:28:37.5136785Z 
2025-05-07T20:28:37.6006068Z 
2025-05-07T20:28:37.6006662Z [INFO] Collecting PyTorch environment info (will be needed for reporting issues to PyTorch) ...
2025-05-07T20:28:37.6030625Z [EXEC] [ATTEMPT 0/3]    + conda run -n build_binary python collect_env.py
2025-05-07T20:28:43.3833056Z Collecting environment information...
2025-05-07T20:28:43.3833634Z PyTorch version: 2.8.0.dev20250507+cu126
2025-05-07T20:28:43.3834029Z Is debug build: False
2025-05-07T20:28:43.3834365Z CUDA used to build PyTorch: 12.6
2025-05-07T20:28:43.3834675Z ROCM used to build PyTorch: N/A
2025-05-07T20:28:43.3834846Z 
2025-05-07T20:28:43.3834950Z OS: Amazon Linux 2023.6.20250317 (x86_64)
2025-05-07T20:28:43.3835268Z GCC version: (conda-forge gcc 11.4.0-13) 11.4.0
2025-05-07T20:28:43.3835583Z Clang version: Could not collect
2025-05-07T20:28:43.3835846Z CMake version: Could not collect
2025-05-07T20:28:43.3836113Z Libc version: glibc-2.34
2025-05-07T20:28:43.3836263Z 
2025-05-07T20:28:43.3836568Z Python version: 3.13.0 | packaged by conda-forge | (main, Nov 27 2024, 19:18:50) [GCC 13.3.0] (64-bit runtime)
2025-05-07T20:28:43.3837183Z Python platform: Linux-6.1.130-139.222.amzn2023.x86_64-x86_64-with-glibc2.34
2025-05-07T20:28:43.3837585Z Is CUDA available: True
2025-05-07T20:28:43.3837838Z CUDA runtime version: 12.6.85
2025-05-07T20:28:43.3838221Z CUDA_MODULE_LOADING set to: LAZY
2025-05-07T20:28:43.3838622Z GPU models and configuration: GPU 0: NVIDIA A10G
2025-05-07T20:28:43.3839001Z Nvidia driver version: 570.133.07
2025-05-07T20:28:43.3839540Z cuDNN version: Could not collect
2025-05-07T20:28:43.3839865Z HIP runtime version: N/A
2025-05-07T20:28:43.3840205Z MIOpen runtime version: N/A
2025-05-07T20:28:43.3840633Z Is XNNPACK available: True
2025-05-07T20:28:43.3840822Z 
2025-05-07T20:28:43.3840932Z CPU:
2025-05-07T20:28:43.3841261Z Architecture:                         x86_64
2025-05-07T20:28:43.3850231Z CPU op-mode(s):                       32-bit, 64-bit
2025-05-07T20:28:43.3850631Z Address sizes:                        48 bits physical, 48 bits virtual
2025-05-07T20:28:43.3851021Z Byte Order:                           Little Endian
2025-05-07T20:28:43.3851338Z CPU(s):                               16
2025-05-07T20:28:43.3851639Z On-line CPU(s) list:                  0-15
2025-05-07T20:28:43.3852267Z Vendor ID:                            AuthenticAMD
2025-05-07T20:28:43.3852617Z Model name:                           AMD EPYC 7R32
2025-05-07T20:28:43.3852938Z CPU family:                           23
2025-05-07T20:28:43.3853218Z Model:                                49
2025-05-07T20:28:43.3853508Z Thread(s) per core:                   2
2025-05-07T20:28:43.3853946Z Core(s) per socket:                   8
2025-05-07T20:28:43.3854226Z Socket(s):                            1
2025-05-07T20:28:43.3854507Z Stepping:                             0
2025-05-07T20:28:43.3854811Z BogoMIPS:                             5599.85
2025-05-07T20:28:43.3856960Z Flags:                                fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ht syscall nx mmxext fxsr_opt pdpe1gb rdtscp lm constant_tsc rep_good nopl nonstop_tsc cpuid extd_apicid aperfmperf tsc_known_freq pni pclmulqdq ssse3 fma cx16 sse4_1 sse4_2 movbe popcnt aes xsave avx f16c rdrand hypervisor lahf_lm cmp_legacy cr8_legacy abm sse4a misalignsse 3dnowprefetch topoext ssbd ibrs ibpb stibp vmmcall fsgsbase bmi1 avx2 smep bmi2 rdseed adx smap clflushopt clwb sha_ni xsaveopt xsavec xgetbv1 clzero xsaveerptr rdpru wbnoinvd arat npt nrip_save rdpid
2025-05-07T20:28:43.3858963Z Hypervisor vendor:                    KVM
2025-05-07T20:28:43.3859271Z Virtualization type:                  full
2025-05-07T20:28:43.3859680Z L1d cache:                            256 KiB (8 instances)
2025-05-07T20:28:43.3860102Z L1i cache:                            256 KiB (8 instances)
2025-05-07T20:28:43.3860473Z L2 cache:                             4 MiB (8 instances)
2025-05-07T20:28:43.3860816Z L3 cache:                             32 MiB (2 instances)
2025-05-07T20:28:43.3861137Z NUMA node(s):                         1
2025-05-07T20:28:43.3861430Z NUMA node0 CPU(s):                    0-15
2025-05-07T20:28:43.3861760Z Vulnerability Gather data sampling:   Not affected
2025-05-07T20:28:43.3862313Z Vulnerability Itlb multihit:          Not affected
2025-05-07T20:28:43.3862670Z Vulnerability L1tf:                   Not affected
2025-05-07T20:28:43.3863012Z Vulnerability Mds:                    Not affected
2025-05-07T20:28:43.3863362Z Vulnerability Meltdown:               Not affected
2025-05-07T20:28:43.3863721Z Vulnerability Mmio stale data:        Not affected
2025-05-07T20:28:43.3864082Z Vulnerability Reg file data sampling: Not affected
2025-05-07T20:28:43.3864612Z Vulnerability Retbleed:               Mitigation; untrained return thunk; SMT enabled with STIBP protection
2025-05-07T20:28:43.3865183Z Vulnerability Spec rstack overflow:   Mitigation; safe RET
2025-05-07T20:28:43.3865720Z Vulnerability Spec store bypass:      Mitigation; Speculative Store Bypass disabled via prctl
2025-05-07T20:28:43.3866385Z Vulnerability Spectre v1:             Mitigation; usercopy/swapgs barriers and __user pointer sanitization
2025-05-07T20:28:43.3867230Z Vulnerability Spectre v2:             Mitigation; Retpolines; IBPB conditional; STIBP always-on; RSB filling; PBRSB-eIBRS Not affected; BHI Not affected
2025-05-07T20:28:43.3867898Z Vulnerability Srbds:                  Not affected
2025-05-07T20:28:43.3868262Z Vulnerability Tsx async abort:        Not affected
2025-05-07T20:28:43.3868492Z 
2025-05-07T20:28:43.3868597Z Versions of relevant libraries:
2025-05-07T20:28:43.3868869Z [pip3] numpy==2.2.5
2025-05-07T20:28:43.3869114Z [pip3] nvidia-cublas-cu12==12.6.4.1
2025-05-07T20:28:43.3869412Z [pip3] nvidia-cuda-cupti-cu12==12.6.80
2025-05-07T20:28:43.3869723Z [pip3] nvidia-cuda-nvrtc-cu12==12.6.77
2025-05-07T20:28:43.3870038Z [pip3] nvidia-cuda-runtime-cu12==12.6.77
2025-05-07T20:28:43.3870350Z [pip3] nvidia-cudnn-cu12==9.5.1.17
2025-05-07T20:28:43.3870629Z [pip3] nvidia-cufft-cu12==11.3.0.4
2025-05-07T20:28:43.3870915Z [pip3] nvidia-curand-cu12==10.3.7.77
2025-05-07T20:28:43.3871207Z [pip3] nvidia-cusolver-cu12==11.7.1.2
2025-05-07T20:28:43.3871503Z [pip3] nvidia-cusparse-cu12==12.5.4.2
2025-05-07T20:28:43.3871916Z [pip3] nvidia-cusparselt-cu12==0.6.3
2025-05-07T20:28:43.3872211Z [pip3] nvidia-nccl-cu12==2.26.2
2025-05-07T20:28:43.3872484Z [pip3] nvidia-nvjitlink-cu12==12.6.85
2025-05-07T20:28:43.3872780Z [pip3] nvidia-nvtx-cu12==12.6.77
2025-05-07T20:28:43.3873068Z [pip3] pytorch-triton==3.3.0+git96316ce5
2025-05-07T20:28:43.3873359Z [pip3] torch==2.8.0.dev20250507+cu126
2025-05-07T20:28:43.3873726Z [conda] cuda-cudart               12.6.77              h5888daf_0    conda-forge
2025-05-07T20:28:43.3874264Z [conda] cuda-cudart-dev           12.6.77              h5888daf_0    conda-forge
2025-05-07T20:28:43.3874835Z [conda] cuda-cudart-dev_linux-64  12.6.77              h3f2d84a_0    conda-forge
2025-05-07T20:28:43.3875340Z [conda] cuda-cudart-static        12.6.77              h5888daf_0    conda-forge
2025-05-07T20:28:43.3875865Z [conda] cuda-cudart-static_linux-64 12.6.77              h3f2d84a_0    conda-forge
2025-05-07T20:28:43.3876382Z [conda] cuda-cudart_linux-64      12.6.77              h3f2d84a_0    conda-forge
2025-05-07T20:28:43.3876862Z [conda] cuda-cupti                12.6.80              hbd13f7d_0    conda-forge
2025-05-07T20:28:43.3877322Z [conda] cuda-cupti-dev            12.6.80              h5888daf_0    conda-forge
2025-05-07T20:28:43.3877794Z [conda] cuda-libraries            12.6.3               ha770c72_0    conda-forge
2025-05-07T20:28:43.3878281Z [conda] cuda-libraries-dev        12.6.3               ha770c72_0    conda-forge
2025-05-07T20:28:43.3878746Z [conda] cuda-nvrtc                12.6.85              hbd13f7d_0    conda-forge
2025-05-07T20:28:43.3879202Z [conda] cuda-nvrtc-dev            12.6.85              h5888daf_0    conda-forge
2025-05-07T20:28:43.3879657Z [conda] cuda-nvtx                 12.6.77              hbd13f7d_0    conda-forge
2025-05-07T20:28:43.3880109Z [conda] cuda-opencl               12.6.77              hbd13f7d_0    conda-forge
2025-05-07T20:28:43.3880570Z [conda] cuda-opencl-dev           12.6.77              h5888daf_0    conda-forge
2025-05-07T20:28:43.3881142Z [conda] cuda-runtime              12.6.3               ha804496_0    conda-forge
2025-05-07T20:28:43.3881598Z [conda] libcublas                 12.6.4.1             h5888daf_1    conda-forge
2025-05-07T20:28:43.3882051Z [conda] libcublas-dev             12.6.4.1             h5888daf_1    conda-forge
2025-05-07T20:28:43.3882511Z [conda] libcufft                  11.3.0.4             hbd13f7d_0    conda-forge
2025-05-07T20:28:43.3882966Z [conda] libcufft-dev              11.3.0.4             h5888daf_0    conda-forge
2025-05-07T20:28:43.3883423Z [conda] libcurand                 10.3.7.77            hbd13f7d_0    conda-forge
2025-05-07T20:28:43.3883875Z [conda] libcurand-dev             10.3.7.77            h5888daf_0    conda-forge
2025-05-07T20:28:43.3884340Z [conda] libcusolver               11.7.1.2             h5888daf_1    conda-forge
2025-05-07T20:28:43.3884808Z [conda] libcusolver-dev           11.7.1.2             h5888daf_1    conda-forge
2025-05-07T20:28:43.3885313Z [conda] libcusparse               12.5.4.2             hbd13f7d_0    conda-forge
2025-05-07T20:28:43.3885834Z [conda] libcusparse-dev           12.5.4.2             h5888daf_0    conda-forge
2025-05-07T20:28:43.3886315Z [conda] libnvjitlink              12.6.85              hbd13f7d_0    conda-forge
2025-05-07T20:28:43.3886804Z [conda] libnvjitlink-dev          12.6.85              h5888daf_0    conda-forge
2025-05-07T20:28:43.3887262Z [conda] numpy                     2.2.5           py313h17eae1a_0    conda-forge
2025-05-07T20:28:43.3887720Z [conda] nvidia-cublas-cu12        12.6.4.1                 pypi_0    pypi
2025-05-07T20:28:43.3888210Z [conda] nvidia-cuda-cupti-cu12    12.6.80                  pypi_0    pypi
2025-05-07T20:28:43.3888695Z [conda] nvidia-cuda-nvrtc-cu12    12.6.77                  pypi_0    pypi
2025-05-07T20:28:43.3889206Z [conda] nvidia-cuda-runtime-cu12  12.6.77                  pypi_0    pypi
2025-05-07T20:28:43.3889770Z [conda] nvidia-cudnn-cu12         9.5.1.17                 pypi_0    pypi
2025-05-07T20:28:43.3890351Z [conda] nvidia-cufft-cu12         11.3.0.4                 pypi_0    pypi
2025-05-07T20:28:43.3890825Z [conda] nvidia-curand-cu12        10.3.7.77                pypi_0    pypi
2025-05-07T20:28:43.3891306Z [conda] nvidia-cusolver-cu12      11.7.1.2                 pypi_0    pypi
2025-05-07T20:28:43.3891796Z [conda] nvidia-cusparse-cu12      12.5.4.2                 pypi_0    pypi
2025-05-07T20:28:43.3892277Z [conda] nvidia-cusparselt-cu12    0.6.3                    pypi_0    pypi
2025-05-07T20:28:43.3892758Z [conda] nvidia-nccl-cu12          2.26.2                   pypi_0    pypi
2025-05-07T20:28:43.3893233Z [conda] nvidia-nvjitlink-cu12     12.6.85                  pypi_0    pypi
2025-05-07T20:28:43.3893813Z [conda] nvidia-nvtx-cu12          12.6.77                  pypi_0    pypi
2025-05-07T20:28:43.3894284Z [conda] pytorch-triton            3.3.0+git96316ce5          pypi_0    pypi
2025-05-07T20:28:43.3894742Z [conda] torch                     2.8.0.dev20250507+cu126          pypi_0    pypi
2025-05-07T20:28:43.3895011Z 
2025-05-07T20:28:43.4569722Z ##[group]Run . $PRELUDE; cd fbgemm_gpu; prepare_fbgemm_gpu_build $BUILD_ENV
2025-05-07T20:28:43.4570419Z [36;1m. $PRELUDE; cd fbgemm_gpu; prepare_fbgemm_gpu_build $BUILD_ENV[0m
2025-05-07T20:28:43.4582343Z shell: /usr/bin/bash --noprofile --norc -e -o pipefail {0}
2025-05-07T20:28:43.4582679Z env:
2025-05-07T20:28:43.4582905Z   PRELUDE: .github/scripts/setup_env.bash
2025-05-07T20:28:43.4583198Z   BUILD_ENV: build_binary
2025-05-07T20:28:43.4583433Z   BUILD_TARGET: genai
2025-05-07T20:28:43.4583658Z   BUILD_VARIANT: cuda
2025-05-07T20:28:43.4583892Z   BUILD_CUDA_VERSION: 12.6.3
2025-05-07T20:28:43.4584144Z   ENFORCE_CUDA_DEVICE: 1
2025-05-07T20:28:43.4584434Z   GPU_FLAG: --gpus all -e NVIDIA_DRIVER_CAPABILITIES=all
2025-05-07T20:28:43.4584763Z ##[endgroup]
2025-05-07T20:28:43.7971342Z ################################################################################
2025-05-07T20:28:43.7972057Z # Prepare FBGEMM-GPU Build
2025-05-07T20:28:43.7972303Z #
2025-05-07T20:28:43.7987934Z # [2025-05-07T20:28:43.798Z] + prepare_fbgemm_gpu_build build_binary
2025-05-07T20:28:43.7988343Z ################################################################################
2025-05-07T20:28:43.7988557Z 
2025-05-07T20:28:43.8003671Z [EXEC] [ATTEMPT 0/3]    + wget -q --timeout 1 pypi.org -O /dev/null
2025-05-07T20:28:43.8909496Z [CHECK] Network does not appear to be blocked.
2025-05-07T20:28:43.8930594Z [BUILD] Running git submodules update ...
2025-05-07T20:28:43.8953461Z [EXEC] [ATTEMPT 0/3]    + git submodule sync
2025-05-07T20:28:43.9318211Z Synchronizing submodule url for '../external/asmjit'
2025-05-07T20:28:43.9318685Z Synchronizing submodule url for '../external/composable_kernel'
2025-05-07T20:28:43.9319116Z Synchronizing submodule url for '../external/cpuinfo'
2025-05-07T20:28:43.9319498Z Synchronizing submodule url for '../external/cutlass'
2025-05-07T20:28:43.9319895Z Synchronizing submodule url for '../external/googletest'
2025-05-07T20:28:43.9320390Z Synchronizing submodule url for '../external/hipify_torch'
2025-05-07T20:28:43.9320794Z Synchronizing submodule url for '../external/json'
2025-05-07T20:28:43.9353736Z [EXEC] [ATTEMPT 0/3]    + git submodule update --init --recursive
2025-05-07T20:28:43.9901440Z [BUILD] Installing other build dependencies ...
2025-05-07T20:28:43.9922550Z [EXEC] [ATTEMPT 0/3]    + conda run --no-capture-output -n build_binary python -m pip install -r requirements.txt
2025-05-07T20:28:46.4527270Z Collecting backports.tarfile (from -r requirements.txt (line 13))
2025-05-07T20:28:46.4538552Z   Using cached backports.tarfile-1.2.0-py3-none-any.whl.metadata (2.0 kB)
2025-05-07T20:28:46.4887231Z Collecting build (from -r requirements.txt (line 14))
2025-05-07T20:28:46.4896861Z   Using cached build-1.2.2.post1-py3-none-any.whl.metadata (6.5 kB)
2025-05-07T20:28:46.6189115Z Collecting cmake (from -r requirements.txt (line 15))
2025-05-07T20:28:46.6200762Z   Using cached cmake-4.0.0-py3-none-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata (6.3 kB)
2025-05-07T20:28:46.6530646Z Collecting click (from -r requirements.txt (line 16))
2025-05-07T20:28:46.6539896Z   Using cached click-8.1.8-py3-none-any.whl.metadata (2.3 kB)
2025-05-07T20:28:46.8738306Z Collecting hypothesis (from -r requirements.txt (line 17))
2025-05-07T20:28:46.8748874Z   Using cached hypothesis-6.131.14-py3-none-any.whl.metadata (5.6 kB)
2025-05-07T20:28:46.8834980Z Requirement already satisfied: jinja2 in /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages (from -r requirements.txt (line 18)) (3.1.4)
2025-05-07T20:28:46.8838008Z Requirement already satisfied: mpmath==1.3.0 in /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages (from -r requirements.txt (line 19)) (1.3.0)
2025-05-07T20:28:46.9280494Z Collecting ninja (from -r requirements.txt (line 20))
2025-05-07T20:28:46.9290176Z   Using cached ninja-1.11.1.4-py3-none-manylinux_2_12_x86_64.manylinux2010_x86_64.whl.metadata (5.0 kB)
2025-05-07T20:28:46.9304069Z Requirement already satisfied: numpy>=2.0.2 in /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages (from -r requirements.txt (line 21)) (2.2.5)
2025-05-07T20:28:46.9649246Z Collecting pyre-extensions (from -r requirements.txt (line 22))
2025-05-07T20:28:46.9657882Z   Using cached pyre_extensions-0.0.32-py3-none-any.whl.metadata (4.0 kB)
2025-05-07T20:28:47.0108646Z Collecting pyyaml (from -r requirements.txt (line 23))
2025-05-07T20:28:47.0291329Z   Downloading PyYAML-6.0.2-cp313-cp313-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata (2.1 kB)
2025-05-07T20:28:47.1105450Z Collecting scikit-build (from -r requirements.txt (line 24))
2025-05-07T20:28:47.1114324Z   Using cached scikit_build-0.18.1-py3-none-any.whl.metadata (18 kB)
2025-05-07T20:28:47.1164207Z Requirement already satisfied: setuptools in /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages (from -r requirements.txt (line 25)) (78.1.1)
2025-05-07T20:28:47.1577045Z Collecting setuptools_git_versioning (from -r requirements.txt (line 26))
2025-05-07T20:28:47.1585663Z   Using cached setuptools_git_versioning-2.1.0-py3-none-any.whl.metadata (6.1 kB)
2025-05-07T20:28:47.1970257Z Collecting tabulate (from -r requirements.txt (line 27))
2025-05-07T20:28:47.1979421Z   Using cached tabulate-0.9.0-py3-none-any.whl.metadata (34 kB)
2025-05-07T20:28:47.2302386Z Collecting patchelf (from -r requirements.txt (line 28))
2025-05-07T20:28:47.2312059Z   Using cached patchelf-0.17.2.2-py3-none-manylinux1_x86_64.manylinux_2_5_x86_64.musllinux_1_1_x86_64.whl.metadata (3.5 kB)
2025-05-07T20:28:47.2719299Z Collecting packaging>=19.1 (from build->-r requirements.txt (line 14))
2025-05-07T20:28:47.2728293Z   Using cached packaging-25.0-py3-none-any.whl.metadata (3.3 kB)
2025-05-07T20:28:47.3089325Z Collecting pyproject_hooks (from build->-r requirements.txt (line 14))
2025-05-07T20:28:47.3098930Z   Using cached pyproject_hooks-1.2.0-py3-none-any.whl.metadata (1.3 kB)
2025-05-07T20:28:47.3457851Z Collecting attrs>=22.2.0 (from hypothesis->-r requirements.txt (line 17))
2025-05-07T20:28:47.3466813Z   Using cached attrs-25.3.0-py3-none-any.whl.metadata (10 kB)
2025-05-07T20:28:47.3876810Z Collecting sortedcontainers<3.0.0,>=2.1.0 (from hypothesis->-r requirements.txt (line 17))
2025-05-07T20:28:47.3885700Z   Using cached sortedcontainers-2.4.0-py2.py3-none-any.whl.metadata (10 kB)
2025-05-07T20:28:47.3910907Z Requirement already satisfied: MarkupSafe>=2.0 in /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages (from jinja2->-r requirements.txt (line 18)) (2.1.5)
2025-05-07T20:28:47.4252764Z Collecting typing-inspect (from pyre-extensions->-r requirements.txt (line 22))
2025-05-07T20:28:47.4261654Z   Using cached typing_inspect-0.9.0-py3-none-any.whl.metadata (1.5 kB)
2025-05-07T20:28:47.4275394Z Requirement already satisfied: typing-extensions in /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages (from pyre-extensions->-r requirements.txt (line 22)) (4.13.2)
2025-05-07T20:28:47.4531728Z Collecting distro (from scikit-build->-r requirements.txt (line 24))
2025-05-07T20:28:47.4540814Z   Using cached distro-1.9.0-py3-none-any.whl.metadata (6.8 kB)
2025-05-07T20:28:47.4560678Z Requirement already satisfied: wheel>=0.32.0 in /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages (from scikit-build->-r requirements.txt (line 24)) (0.45.1)
2025-05-07T20:28:47.5030219Z Collecting mypy-extensions>=0.3.0 (from typing-inspect->pyre-extensions->-r requirements.txt (line 22))
2025-05-07T20:28:47.5039084Z   Using cached mypy_extensions-1.1.0-py3-none-any.whl.metadata (1.1 kB)
2025-05-07T20:28:47.5067444Z Using cached backports.tarfile-1.2.0-py3-none-any.whl (30 kB)
2025-05-07T20:28:47.5076448Z Using cached build-1.2.2.post1-py3-none-any.whl (22 kB)
2025-05-07T20:28:47.5085759Z Using cached cmake-4.0.0-py3-none-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (27.9 MB)
2025-05-07T20:28:47.5282295Z Using cached click-8.1.8-py3-none-any.whl (98 kB)
2025-05-07T20:28:47.5291665Z Using cached hypothesis-6.131.14-py3-none-any.whl (500 kB)
2025-05-07T20:28:47.5304668Z Using cached sortedcontainers-2.4.0-py2.py3-none-any.whl (29 kB)
2025-05-07T20:28:47.5313718Z Using cached ninja-1.11.1.4-py3-none-manylinux_2_12_x86_64.manylinux2010_x86_64.whl (422 kB)
2025-05-07T20:28:47.5325346Z Using cached pyre_extensions-0.0.32-py3-none-any.whl (12 kB)
2025-05-07T20:28:47.5356012Z Downloading PyYAML-6.0.2-cp313-cp313-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (759 kB)
2025-05-07T20:28:47.6186569Z    ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 759.5/759.5 kB 6.9 MB/s eta 0:00:00
2025-05-07T20:28:47.6195046Z Using cached scikit_build-0.18.1-py3-none-any.whl (85 kB)
2025-05-07T20:28:47.6205020Z Using cached setuptools_git_versioning-2.1.0-py3-none-any.whl (10 kB)
2025-05-07T20:28:47.6214002Z Using cached tabulate-0.9.0-py3-none-any.whl (35 kB)
2025-05-07T20:28:47.6223556Z Using cached patchelf-0.17.2.2-py3-none-manylinux1_x86_64.manylinux_2_5_x86_64.musllinux_1_1_x86_64.whl (466 kB)
2025-05-07T20:28:47.6235161Z Using cached attrs-25.3.0-py3-none-any.whl (63 kB)
2025-05-07T20:28:47.6244518Z Using cached packaging-25.0-py3-none-any.whl (66 kB)
2025-05-07T20:28:47.6253808Z Using cached distro-1.9.0-py3-none-any.whl (20 kB)
2025-05-07T20:28:47.6262920Z Using cached pyproject_hooks-1.2.0-py3-none-any.whl (10 kB)
2025-05-07T20:28:47.6271673Z Using cached typing_inspect-0.9.0-py3-none-any.whl (8.8 kB)
2025-05-07T20:28:47.6280569Z Using cached mypy_extensions-1.1.0-py3-none-any.whl (5.0 kB)
2025-05-07T20:28:47.7510169Z Installing collected packages: sortedcontainers, tabulate, pyyaml, pyproject_hooks, patchelf, packaging, ninja, mypy-extensions, distro, cmake, click, backports.tarfile, attrs, typing-inspect, setuptools_git_versioning, scikit-build, hypothesis, build, pyre-extensions
2025-05-07T20:28:50.1821283Z 
2025-05-07T20:28:50.1874603Z Successfully installed attrs-25.3.0 backports.tarfile-1.2.0 build-1.2.2.post1 click-8.1.8 cmake-4.0.0 distro-1.9.0 hypothesis-6.131.14 mypy-extensions-1.1.0 ninja-1.11.1.4 packaging-25.0 patchelf-0.17.2.2 pyproject_hooks-1.2.0 pyre-extensions-0.0.32 pyyaml-6.0.2 scikit-build-0.18.1 setuptools_git_versioning-2.1.0 sortedcontainers-2.4.0 tabulate-0.9.0 typing-inspect-0.9.0
2025-05-07T20:28:50.3574918Z ################################################################################
2025-05-07T20:28:50.3575371Z # Install PyTorch (PyTorch PIP)
2025-05-07T20:28:50.3575641Z #
2025-05-07T20:28:50.3592114Z # [2025-05-07T20:28:50.358Z] + install_triton_pip build_binary
2025-05-07T20:28:50.3592566Z ################################################################################
2025-05-07T20:28:50.3592787Z 
2025-05-07T20:28:50.3593007Z [BUILD] Installing pytorch-triton nightly/3.2.0+git4b3bb1f8 from PIP ...
2025-05-07T20:28:50.3593438Z ################################################################################
2025-05-07T20:28:50.3593791Z # Install Package From PyTorch PIP: pytorch-triton
2025-05-07T20:28:50.3594096Z #
2025-05-07T20:28:50.3609123Z # [2025-05-07T20:28:50.360Z] + install_from_pytorch_pip build_binary pytorch-triton nightly/3.2.0+git4b3bb1f8
2025-05-07T20:28:50.3609746Z ################################################################################
2025-05-07T20:28:50.3609959Z 
2025-05-07T20:28:50.3626580Z [EXEC] [ATTEMPT 0/3]    + wget -q --timeout 1 pypi.org -O /dev/null
2025-05-07T20:28:50.4562642Z [CHECK] Network does not appear to be blocked.
2025-05-07T20:28:50.4563013Z ################################################################################
2025-05-07T20:28:50.4563359Z # Prepare PIP Arguments (PyTorch PIP)
2025-05-07T20:28:50.4563639Z #
2025-05-07T20:28:50.4580696Z # [2025-05-07T20:28:50.457Z] + __prepare_pip_arguments pytorch-triton nightly/3.2.0+git4b3bb1f8 
2025-05-07T20:28:50.4581188Z ################################################################################
2025-05-07T20:28:50.4581410Z 
2025-05-07T20:28:50.4629656Z [INSTALL] Extracted package (channel, version): (nightly, 3.2.0+git4b3bb1f8)
2025-05-07T20:28:50.4645994Z [INSTALL] Using a non-RELEASE channel: nightly ...
2025-05-07T20:28:50.4646746Z [INSTALL] Extracted the full PIP channel: https://download.pytorch.org/whl/nightly/
2025-05-07T20:28:50.4655154Z [INSTALL] Extracted the full PIP package: --pre pytorch-triton==3.2.0+git4b3bb1f8
2025-05-07T20:28:50.4664433Z [INSTALL] Attempting to install [pytorch-triton, 3.2.0+git4b3bb1f8] from PyTorch PIP using channel https://download.pytorch.org/whl/nightly/ ...
2025-05-07T20:28:50.4685182Z [EXEC] [ATTEMPT 0/3]    + conda run -n build_binary pip install --pre pytorch-triton==3.2.0+git4b3bb1f8 --index-url https://download.pytorch.org/whl/nightly/
2025-05-07T20:28:57.1005683Z ERROR: pip's dependency resolver does not currently take into account all the packages that are installed. This behaviour is the source of the following dependency conflicts.
2025-05-07T20:28:57.1007787Z torch 2.8.0.dev20250507+cu126 requires pytorch-triton==3.3.0+git96316ce5; platform_system == "Linux" and platform_machine == "x86_64", but you have pytorch-triton 3.2.0+git4b3bb1f8 which is incompatible.
2025-05-07T20:28:57.1009407Z 
2025-05-07T20:28:57.1009737Z Looking in indexes: https://download.pytorch.org/whl/nightly/
2025-05-07T20:28:57.1010323Z Collecting pytorch-triton==3.2.0+git4b3bb1f8
2025-05-07T20:28:57.1011487Z   Downloading https://download.pytorch.org/whl/nightly/pytorch_triton-3.2.0%2Bgit4b3bb1f8-cp313-cp313-manylinux_2_27_x86_64.manylinux_2_28_x86_64.whl.metadata (1.3 kB)
2025-05-07T20:28:57.1013334Z Downloading https://download.pytorch.org/whl/nightly/pytorch_triton-3.2.0%2Bgit4b3bb1f8-cp313-cp313-manylinux_2_27_x86_64.manylinux_2_28_x86_64.whl (166.5 MB)
2025-05-07T20:28:57.1015066Z    ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 166.5/166.5 MB 86.4 MB/s eta 0:00:00
2025-05-07T20:28:57.1015591Z Installing collected packages: pytorch-triton
2025-05-07T20:28:57.1016088Z   Attempting uninstall: pytorch-triton
2025-05-07T20:28:57.1016631Z     Found existing installation: pytorch-triton 3.3.0+git96316ce5
2025-05-07T20:28:57.1017233Z     Uninstalling pytorch-triton-3.3.0+git96316ce5:
2025-05-07T20:28:57.1017889Z       Successfully uninstalled pytorch-triton-3.3.0+git96316ce5
2025-05-07T20:28:57.1018528Z Successfully installed pytorch-triton-3.2.0+git4b3bb1f8
2025-05-07T20:28:57.1018908Z 
2025-05-07T20:28:59.3452695Z [CHECK] Python (sub-)package 'triton' found ...
2025-05-07T20:28:59.3455992Z [CHECK] Printing out the pytorch-triton version ...
2025-05-07T20:29:01.5091022Z ################################################################################
2025-05-07T20:29:01.5091479Z [CHECK] The installed VERSION of pytorch-triton is: 3.2.0
2025-05-07T20:29:01.5091850Z ################################################################################
2025-05-07T20:29:01.5092066Z 
2025-05-07T20:29:03.5751684Z [CHECK] Python (sub-)package 'numpy' found ...
2025-05-07T20:29:05.7684798Z [CHECK] Python (sub-)package 'skbuild' found ...
2025-05-07T20:29:05.7688735Z [BUILD] Successfully ran git submodules update
2025-05-07T20:29:05.7742054Z ##[group]Run . $PRELUDE; install_fbgemm_gpu_wheel $BUILD_ENV *.whl
2025-05-07T20:29:05.7742539Z [36;1m. $PRELUDE; install_fbgemm_gpu_wheel $BUILD_ENV *.whl[0m
2025-05-07T20:29:05.7754214Z shell: /usr/bin/bash --noprofile --norc -e -o pipefail {0}
2025-05-07T20:29:05.7754549Z env:
2025-05-07T20:29:05.7754769Z   PRELUDE: .github/scripts/setup_env.bash
2025-05-07T20:29:05.7755055Z   BUILD_ENV: build_binary
2025-05-07T20:29:05.7755293Z   BUILD_TARGET: genai
2025-05-07T20:29:05.7755524Z   BUILD_VARIANT: cuda
2025-05-07T20:29:05.7755749Z   BUILD_CUDA_VERSION: 12.6.3
2025-05-07T20:29:05.7756000Z   ENFORCE_CUDA_DEVICE: 1
2025-05-07T20:29:05.7756296Z   GPU_FLAG: --gpus all -e NVIDIA_DRIVER_CAPABILITIES=all
2025-05-07T20:29:05.7756612Z ##[endgroup]
2025-05-07T20:29:06.1133318Z ################################################################################
2025-05-07T20:29:06.1133831Z # Install FBGEMM-GPU from Wheel
2025-05-07T20:29:06.1134086Z #
2025-05-07T20:29:06.1150618Z # [2025-05-07T20:29:06.114Z] + install_fbgemm_gpu_wheel build_binary fbgemm_gpu_genai_nightly-2025.5.7-cp313-cp313-manylinux_2_28_x86_64.whl
2025-05-07T20:29:06.1151255Z ################################################################################
2025-05-07T20:29:06.1151466Z 
2025-05-07T20:29:06.1151818Z [INSTALL] Printing out FBGEMM-GPU wheel SHA: fbgemm_gpu_genai_nightly-2025.5.7-cp313-cp313-manylinux_2_28_x86_64.whl
2025-05-07T20:29:06.1152493Z + sha1sum fbgemm_gpu_genai_nightly-2025.5.7-cp313-cp313-manylinux_2_28_x86_64.whl
2025-05-07T20:29:06.1152819Z 
2025-05-07T20:29:06.1259376Z f90095cdf9a3f2a3bbac1aa51f6d03c22b933a7e  fbgemm_gpu_genai_nightly-2025.5.7-cp313-cp313-manylinux_2_28_x86_64.whl
2025-05-07T20:29:06.1261908Z 
2025-05-07T20:29:06.1262482Z + sha256sum fbgemm_gpu_genai_nightly-2025.5.7-cp313-cp313-manylinux_2_28_x86_64.whl
2025-05-07T20:29:06.1262819Z 
2025-05-07T20:29:06.1387626Z ed17ebd3a2864d614d536415eaaeb2b336bf2d88ef5df95627044ab7b9ab7adc  fbgemm_gpu_genai_nightly-2025.5.7-cp313-cp313-manylinux_2_28_x86_64.whl
2025-05-07T20:29:06.1389959Z 
2025-05-07T20:29:06.1390371Z + md5sum fbgemm_gpu_genai_nightly-2025.5.7-cp313-cp313-manylinux_2_28_x86_64.whl
2025-05-07T20:29:06.1391050Z 
2025-05-07T20:29:06.1624594Z 7ea0c844ddb54583dafff944ccac7bb0  fbgemm_gpu_genai_nightly-2025.5.7-cp313-cp313-manylinux_2_28_x86_64.whl
2025-05-07T20:29:06.1627052Z 
2025-05-07T20:29:06.1636266Z [INSTALL] Installing FBGEMM-GPU wheel: fbgemm_gpu_genai_nightly-2025.5.7-cp313-cp313-manylinux_2_28_x86_64.whl ...
2025-05-07T20:29:06.1657896Z [EXEC] [ATTEMPT 0/3]    + conda run -n build_binary python -m pip install fbgemm_gpu_genai_nightly-2025.5.7-cp313-cp313-manylinux_2_28_x86_64.whl
2025-05-07T20:29:08.8678618Z Processing ./fbgemm_gpu_genai_nightly-2025.5.7-cp313-cp313-manylinux_2_28_x86_64.whl
2025-05-07T20:29:08.8679561Z Requirement already satisfied: numpy in /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages (from fbgemm-gpu-genai-nightly==2025.5.7) (2.2.5)
2025-05-07T20:29:08.8680398Z Installing collected packages: fbgemm-gpu-genai-nightly
2025-05-07T20:29:08.8680829Z Successfully installed fbgemm-gpu-genai-nightly-2025.5.7
2025-05-07T20:29:08.8681108Z 
2025-05-07T20:29:15.6835489Z ################################################################################
2025-05-07T20:29:15.6836206Z [CHECK] !!!!    INFO    !!!!
2025-05-07T20:29:15.6836924Z [CHECK] The installed version of PyTorch is: 2.8.0.dev20250507+cu126
2025-05-07T20:29:15.6837742Z [CHECK] CUDA version reported by PyTorch is: 12.6
2025-05-07T20:29:15.6838341Z [CHECK]
2025-05-07T20:29:15.6838881Z [CHECK] NOTE: If the PyTorch package channel is different from the FBGEMM_GPU
2025-05-07T20:29:15.6839418Z [CHECK]       package channel; the package may be broken at runtime!!!
2025-05-07T20:29:15.6839803Z ################################################################################
2025-05-07T20:29:15.6840013Z 
2025-05-07T20:29:15.6840130Z [INSTALL] Checking imports and symbols ...
2025-05-07T20:29:19.6188219Z [CHECK] Python (sub-)package 'fbgemm_gpu' found ...
2025-05-07T20:29:23.5453450Z [CHECK] Found symbol '__version__' in Python package 'fbgemm_gpu'.
2025-05-07T20:29:27.4764995Z [CHECK] Found symbol '__variant__' in Python package 'fbgemm_gpu'.
2025-05-07T20:29:27.4768013Z [CHECK] Printing out the FBGEMM-GPU version ...
2025-05-07T20:29:39.2341695Z ################################################################################
2025-05-07T20:29:39.2342102Z [CHECK] The installed FBGEMM TARGET is: genai
2025-05-07T20:29:39.2342445Z [CHECK] The installed FBGEMM VARIANT is: cuda
2025-05-07T20:29:39.2342784Z [CHECK] The installed FBGEMM VERSION is: 2025.5.7
2025-05-07T20:29:39.2343121Z ################################################################################
2025-05-07T20:29:39.2343331Z 
2025-05-07T20:29:47.0965736Z ################################################################################
2025-05-07T20:29:47.0966491Z [CHECK] FBGEMM_GPU Experimental Packages
2025-05-07T20:29:47.0968438Z [CHECK] fbgemm_gpu: ['__annotations__', '__builtins__', '__cached__', '__doc__', '__file__', '__loader__', '__name__', '__package__', '__path__', '__spec__', '__target__', '__variant__', '__version__', '_load_library', 'docs', 'fbgemm_genai_libraries', 'fbgemm_gpu', 'fbgemm_gpu_libraries', 'libraries_to_load', 'library', 'logging', 'open_source', 'os', 'split_embedding_configs', 'split_table_batched_embeddings_ops_common', 'torch', 'utils']
2025-05-07T20:29:47.0969979Z [CHECK] fbgemm_gpu.experimental: ['__doc__', '__file__', '__loader__', '__name__', '__package__', '__path__', '__spec__']
2025-05-07T20:29:47.0970496Z ################################################################################
2025-05-07T20:29:47.0970708Z 
2025-05-07T20:29:47.0970867Z [INSTALL] Check for installation of Python sources ...
2025-05-07T20:29:51.0318244Z [CHECK] Python (sub-)package 'fbgemm_gpu.config' found ...
2025-05-07T20:29:54.9641601Z [CHECK] Python (sub-)package 'fbgemm_gpu.docs' found ...
2025-05-07T20:29:59.0086292Z [CHECK] Python (sub-)package 'fbgemm_gpu.quantize' found ...
2025-05-07T20:30:02.9434601Z [CHECK] Python (sub-)package 'fbgemm_gpu.tbe.cache' found ...
2025-05-07T20:30:02.9442090Z [INSTALL] Check for operator registrations ...
2025-05-07T20:30:06.7843321Z fbgemm.nccl_init
2025-05-07T20:30:06.7843516Z 
2025-05-07T20:30:06.8478776Z [CHECK] FBGEMM_GPU operator appears to be correctly registered: torch.ops.fbgemm.nccl_init
2025-05-07T20:30:10.6902297Z fbgemm.gqa_attn_splitk
2025-05-07T20:30:10.6902512Z 
2025-05-07T20:30:10.7537497Z [CHECK] FBGEMM_GPU operator appears to be correctly registered: torch.ops.fbgemm.gqa_attn_splitk
2025-05-07T20:30:14.5971326Z fbgemm.rope_qkv_decoding
2025-05-07T20:30:14.5971554Z 
2025-05-07T20:30:14.6601436Z [CHECK] FBGEMM_GPU operator appears to be correctly registered: torch.ops.fbgemm.rope_qkv_decoding
2025-05-07T20:30:14.6602034Z [INSTALL] FBGEMM-GPU installation through wheel completed ...
2025-05-07T20:30:14.6638708Z ##[group]Run . $PRELUDE; test_all_fbgemm_gpu_modules $BUILD_ENV
2025-05-07T20:30:14.6639162Z [36;1m. $PRELUDE; test_all_fbgemm_gpu_modules $BUILD_ENV[0m
2025-05-07T20:30:14.6651879Z shell: /usr/bin/bash --noprofile --norc -e -o pipefail {0}
2025-05-07T20:30:14.6652238Z env:
2025-05-07T20:30:14.6652465Z   PRELUDE: .github/scripts/setup_env.bash
2025-05-07T20:30:14.6652762Z   BUILD_ENV: build_binary
2025-05-07T20:30:14.6653008Z   BUILD_TARGET: genai
2025-05-07T20:30:14.6653239Z   BUILD_VARIANT: cuda
2025-05-07T20:30:14.6653469Z   BUILD_CUDA_VERSION: 12.6.3
2025-05-07T20:30:14.6653923Z   ENFORCE_CUDA_DEVICE: 1
2025-05-07T20:30:14.6654228Z   GPU_FLAG: --gpus all -e NVIDIA_DRIVER_CAPABILITIES=all
2025-05-07T20:30:14.6654554Z ##[endgroup]
2025-05-07T20:30:15.0044183Z ################################################################################
2025-05-07T20:30:15.0044550Z # Test All FBGEMM-GPU Modules
2025-05-07T20:30:15.0044813Z #
2025-05-07T20:30:15.0062152Z # [2025-05-07T20:30:15.005Z] + test_all_fbgemm_gpu_modules build_binary
2025-05-07T20:30:15.0062563Z ################################################################################
2025-05-07T20:30:15.0062780Z 
2025-05-07T20:30:22.8459731Z [TEST] Determined FBGEMM_GPU (target : variant) from installation: (genai : cuda)
2025-05-07T20:30:22.8460351Z [TEST] Will be running tests specific to this target and variant ...
2025-05-07T20:30:22.8460751Z [TEST] Determined the test directories:
2025-05-07T20:30:22.8461062Z fbgemm_gpu/experimental/gen_ai/test
2025-05-07T20:30:22.8461352Z fbgemm_gpu/experimental/example/test
2025-05-07T20:30:22.8461652Z fbgemm_gpu/experimental/gemm/test
2025-05-07T20:30:22.8461835Z 
2025-05-07T20:30:22.8466532Z [TEST] FBGEMM_GPU variant is cuda; configuring for CUDA-based testing ...
2025-05-07T20:30:22.8473301Z [TEST] Set environment variables for CUDA testing ...
2025-05-07T20:30:22.8473872Z + conda env config vars unset -n build_binary CUDA_VISIBLE_DEVICES
2025-05-07T20:30:22.8474261Z 
2025-05-07T20:30:23.2753260Z 
2025-05-07T20:30:23.2754053Z [TEST] Installing PyTest ...
2025-05-07T20:30:23.2776369Z [EXEC] [ATTEMPT 0/3]    + conda install -n build_binary -c conda-forge --override-channels -y pytest expecttest
2025-05-07T20:30:24.5572341Z Channels:
2025-05-07T20:30:24.5572599Z  - conda-forge
2025-05-07T20:30:24.5572824Z Platform: linux-64
2025-05-07T20:30:27.8748624Z Collecting package metadata (repodata.json): - \ | / done
2025-05-07T20:30:29.0223416Z Solving environment: \ | / done
2025-05-07T20:30:29.2508881Z 
2025-05-07T20:30:29.2509397Z ## Package Plan ##
2025-05-07T20:30:29.2509577Z 
2025-05-07T20:30:29.2509790Z   environment location: /home/ec2-user/miniconda/envs/build_binary
2025-05-07T20:30:29.2510085Z 
2025-05-07T20:30:29.2510183Z   added / updated specs:
2025-05-07T20:30:29.2510438Z     - expecttest
2025-05-07T20:30:29.2510658Z     - pytest
2025-05-07T20:30:29.2510777Z 
2025-05-07T20:30:29.2510781Z 
2025-05-07T20:30:29.2510904Z The following packages will be downloaded:
2025-05-07T20:30:29.2511134Z 
2025-05-07T20:30:29.2511251Z     package                    |            build
2025-05-07T20:30:29.2511568Z     ---------------------------|-----------------
2025-05-07T20:30:29.2511938Z     colorama-0.4.6             |     pyhd8ed1ab_1          26 KB  conda-forge
2025-05-07T20:30:29.2513056Z     exceptiongroup-1.2.2       |     pyhd8ed1ab_1          20 KB  conda-forge
2025-05-07T20:30:29.2513658Z     expecttest-0.3.0           |     pyhd8ed1ab_0          14 KB  conda-forge
2025-05-07T20:30:29.2514220Z     iniconfig-2.0.0            |     pyhd8ed1ab_1          11 KB  conda-forge
2025-05-07T20:30:29.2514656Z     packaging-25.0             |     pyh29332c3_1          61 KB  conda-forge
2025-05-07T20:30:29.2515058Z     pluggy-1.5.0               |     pyhd8ed1ab_1          23 KB  conda-forge
2025-05-07T20:30:29.2515457Z     pytest-8.3.5               |     pyhd8ed1ab_0         254 KB  conda-forge
2025-05-07T20:30:29.2516072Z     tomli-2.2.1                |     pyhd8ed1ab_1          19 KB  conda-forge
2025-05-07T20:30:29.2516454Z     ------------------------------------------------------------
2025-05-07T20:30:29.2516782Z                                            Total:         428 KB
2025-05-07T20:30:29.2516994Z 
2025-05-07T20:30:29.2517124Z The following NEW packages will be INSTALLED:
2025-05-07T20:30:29.2517335Z 
2025-05-07T20:30:29.2517538Z   colorama           conda-forge/noarch::colorama-0.4.6-pyhd8ed1ab_1 
2025-05-07T20:30:29.2518233Z   exceptiongroup     conda-forge/noarch::exceptiongroup-1.2.2-pyhd8ed1ab_1 
2025-05-07T20:30:29.2518859Z   expecttest         conda-forge/noarch::expecttest-0.3.0-pyhd8ed1ab_0 
2025-05-07T20:30:29.2519326Z   iniconfig          conda-forge/noarch::iniconfig-2.0.0-pyhd8ed1ab_1 
2025-05-07T20:30:29.2519784Z   packaging          conda-forge/noarch::packaging-25.0-pyh29332c3_1 
2025-05-07T20:30:29.2520228Z   pluggy             conda-forge/noarch::pluggy-1.5.0-pyhd8ed1ab_1 
2025-05-07T20:30:29.2520653Z   pytest             conda-forge/noarch::pytest-8.3.5-pyhd8ed1ab_0 
2025-05-07T20:30:29.2521066Z   tomli              conda-forge/noarch::tomli-2.2.1-pyhd8ed1ab_1 
2025-05-07T20:30:29.2521313Z 
2025-05-07T20:30:29.2521317Z 
2025-05-07T20:30:29.2521321Z 
2025-05-07T20:30:29.2521465Z Downloading and Extracting Packages: ...working...
2025-05-07T20:30:29.2521832Z pytest-8.3.5         | 254 KB    |            |   0% 
2025-05-07T20:30:29.2522055Z 
2025-05-07T20:30:29.2539023Z packaging-25.0       | 61 KB     |            |   0% [A
2025-05-07T20:30:29.2539345Z 
2025-05-07T20:30:29.2539348Z 
2025-05-07T20:30:29.2543723Z colorama-0.4.6       | 26 KB     |            |   0% [A[A
2025-05-07T20:30:29.2543992Z 
2025-05-07T20:30:29.2543996Z 
2025-05-07T20:30:29.2544000Z 
2025-05-07T20:30:29.2553383Z pluggy-1.5.0         | 23 KB     |            |   0% [A[A[A
2025-05-07T20:30:29.2553634Z 
2025-05-07T20:30:29.2553637Z 
2025-05-07T20:30:29.2553641Z 
2025-05-07T20:30:29.2562597Z 
2025-05-07T20:30:29.2579557Z exceptiongroup-1.2.2 | 20 KB     |            |   0% [A[A[A[A
2025-05-07T20:30:29.2579907Z 
2025-05-07T20:30:29.2579913Z 
2025-05-07T20:30:29.2579918Z 
2025-05-07T20:30:29.2579923Z 
2025-05-07T20:30:29.2579929Z 
2025-05-07T20:30:29.2582274Z tomli-2.2.1          | 19 KB     |            |   0% [A[A[A[A[A
2025-05-07T20:30:29.2582619Z 
2025-05-07T20:30:29.2582623Z 
2025-05-07T20:30:29.2582627Z 
2025-05-07T20:30:29.2582638Z 
2025-05-07T20:30:29.2582642Z 
2025-05-07T20:30:29.2582650Z 
2025-05-07T20:30:29.2584086Z expecttest-0.3.0     | 14 KB     |            |   0% [A[A[A[A[A[A
2025-05-07T20:30:29.2584464Z 
2025-05-07T20:30:29.2584478Z 
2025-05-07T20:30:29.2584483Z 
2025-05-07T20:30:29.2584489Z 
2025-05-07T20:30:29.2584494Z 
2025-05-07T20:30:29.2584499Z 
2025-05-07T20:30:29.2584504Z 
2025-05-07T20:30:29.2963934Z iniconfig-2.0.0      | 11 KB     |            |   0% [A[A[A[A[A[A[A
2025-05-07T20:30:29.2964322Z 
2025-05-07T20:30:29.2964327Z 
2025-05-07T20:30:29.2965831Z 
2025-05-07T20:30:29.3081223Z pluggy-1.5.0         | 23 KB     | ########## | 100% [A[A[A
2025-05-07T20:30:29.3081601Z 
2025-05-07T20:30:29.3081608Z 
2025-05-07T20:30:29.3081613Z 
2025-05-07T20:30:29.3124038Z pluggy-1.5.0         | 23 KB     | ########## | 100% [A[A[A
2025-05-07T20:30:29.3179213Z pytest-8.3.5         | 254 KB    | ########## | 100% 
2025-05-07T20:30:29.3179809Z 
2025-05-07T20:30:29.3180243Z 
2025-05-07T20:30:29.3394369Z colorama-0.4.6       | 26 KB     | ########## | 100% [A[A
2025-05-07T20:30:29.3394688Z 
2025-05-07T20:30:29.3394692Z 
2025-05-07T20:30:29.3394696Z 
2025-05-07T20:30:29.3394700Z 
2025-05-07T20:30:29.3396105Z 
2025-05-07T20:30:29.3417333Z tomli-2.2.1          | 19 KB     | ########## | 100% [A[A[A[A[A
2025-05-07T20:30:29.3417586Z 
2025-05-07T20:30:29.3419535Z 
2025-05-07T20:30:29.3486317Z colorama-0.4.6       | 26 KB     | ########## | 100% [A[A
2025-05-07T20:30:29.3489737Z 
2025-05-07T20:30:29.3512558Z packaging-25.0       | 61 KB     | ########## | 100% [A
2025-05-07T20:30:29.3512807Z 
2025-05-07T20:30:29.3512811Z 
2025-05-07T20:30:29.3512815Z 
2025-05-07T20:30:29.3513068Z 
2025-05-07T20:30:29.3513073Z 
2025-05-07T20:30:29.3513077Z 
2025-05-07T20:30:29.3567074Z expecttest-0.3.0     | 14 KB     | ########## | 100% [A[A[A[A[A[A
2025-05-07T20:30:29.3567345Z 
2025-05-07T20:30:29.3567360Z 
2025-05-07T20:30:29.3567364Z 
2025-05-07T20:30:29.3569423Z 
2025-05-07T20:30:29.3618066Z exceptiongroup-1.2.2 | 20 KB     | ########## | 100% [A[A[A[A
2025-05-07T20:30:29.3618351Z 
2025-05-07T20:30:29.3618355Z 
2025-05-07T20:30:29.3618359Z 
2025-05-07T20:30:29.3618362Z 
2025-05-07T20:30:29.3618366Z 
2025-05-07T20:30:29.3618369Z 
2025-05-07T20:30:29.3619290Z 
2025-05-07T20:30:29.3639920Z iniconfig-2.0.0      | 11 KB     | ########## | 100% [A[A[A[A[A[A[A
2025-05-07T20:30:29.3640211Z 
2025-05-07T20:30:29.3640215Z 
2025-05-07T20:30:29.3640219Z 
2025-05-07T20:30:29.3640222Z 
2025-05-07T20:30:29.3640226Z 
2025-05-07T20:30:29.3640235Z 
2025-05-07T20:30:29.3640239Z 
2025-05-07T20:30:29.3792425Z iniconfig-2.0.0      | 11 KB     | ########## | 100% [A[A[A[A[A[A[A
2025-05-07T20:30:29.3792706Z 
2025-05-07T20:30:29.3792710Z 
2025-05-07T20:30:29.3792713Z 
2025-05-07T20:30:29.3792724Z 
2025-05-07T20:30:29.3793414Z 
2025-05-07T20:30:29.3797797Z tomli-2.2.1          | 19 KB     | ########## | 100% [A[A[A[A[A
2025-05-07T20:30:29.3798130Z 
2025-05-07T20:30:29.3798134Z 
2025-05-07T20:30:29.3798145Z 
2025-05-07T20:30:29.3798149Z 
2025-05-07T20:30:29.3798153Z 
2025-05-07T20:30:29.3929580Z tomli-2.2.1          | 19 KB     | ########## | 100% [A[A[A[A[A
2025-05-07T20:30:29.3931821Z 
2025-05-07T20:30:29.3937845Z packaging-25.0       | 61 KB     | ########## | 100% [A
2025-05-07T20:30:29.3938435Z 
2025-05-07T20:30:29.4046930Z packaging-25.0       | 61 KB     | ########## | 100% [A
2025-05-07T20:30:29.4047252Z 
2025-05-07T20:30:29.4047256Z 
2025-05-07T20:30:29.4047260Z 
2025-05-07T20:30:29.4047263Z 
2025-05-07T20:30:29.4047267Z 
2025-05-07T20:30:29.4047424Z 
2025-05-07T20:30:29.4052479Z expecttest-0.3.0     | 14 KB     | ########## | 100% [A[A[A[A[A[A
2025-05-07T20:30:29.4052813Z 
2025-05-07T20:30:29.4052818Z 
2025-05-07T20:30:29.4052822Z 
2025-05-07T20:30:29.4052826Z 
2025-05-07T20:30:29.4052829Z 
2025-05-07T20:30:29.4052833Z 
2025-05-07T20:30:29.4124482Z expecttest-0.3.0     | 14 KB     | ########## | 100% [A[A[A[A[A[A
2025-05-07T20:30:29.4127111Z pytest-8.3.5         | 254 KB    | ########## | 100% 
2025-05-07T20:30:29.4197235Z pytest-8.3.5         | 254 KB    | ########## | 100% 
2025-05-07T20:30:29.4197549Z 
2025-05-07T20:30:29.4197555Z 
2025-05-07T20:30:29.4197560Z 
2025-05-07T20:30:29.4197663Z 
2025-05-07T20:30:29.4199922Z exceptiongroup-1.2.2 | 20 KB     | ########## | 100% [A[A[A[A
2025-05-07T20:30:29.4200267Z 
2025-05-07T20:30:29.4200271Z 
2025-05-07T20:30:29.4200275Z 
2025-05-07T20:30:29.4200278Z 
2025-05-07T20:30:29.4221064Z exceptiongroup-1.2.2 | 20 KB     | ########## | 100% [A[A[A[A
2025-05-07T20:30:29.4221392Z 
2025-05-07T20:30:29.4221396Z 
2025-05-07T20:30:29.4221400Z 
2025-05-07T20:30:29.4221404Z 
2025-05-07T20:30:29.4221407Z 
2025-05-07T20:30:29.4221421Z 
2025-05-07T20:30:29.4221425Z 
2025-05-07T20:30:29.4227935Z iniconfig-2.0.0      | 11 KB     | ########## | 100% [A[A[A[A[A[A[A
2025-05-07T20:30:29.4228451Z                                                      
2025-05-07T20:30:29.4228981Z 
2025-05-07T20:30:29.4229220Z                                                      [A
2025-05-07T20:30:29.4229496Z 
2025-05-07T20:30:29.4229501Z 
2025-05-07T20:30:29.4229718Z                                                      [A[A
2025-05-07T20:30:29.4229971Z 
2025-05-07T20:30:29.4229975Z 
2025-05-07T20:30:29.4229978Z 
2025-05-07T20:30:29.4230168Z                                                      [A[A[A
2025-05-07T20:30:29.4230453Z 
2025-05-07T20:30:29.4230459Z 
2025-05-07T20:30:29.4230464Z 
2025-05-07T20:30:29.4230478Z 
2025-05-07T20:30:29.4230725Z                                                      [A[A[A[A
2025-05-07T20:30:29.4230998Z 
2025-05-07T20:30:29.4231003Z 
2025-05-07T20:30:29.4231009Z 
2025-05-07T20:30:29.4231014Z 
2025-05-07T20:30:29.4231259Z 
2025-05-07T20:30:29.4231450Z                                                      [A[A[A[A[A
2025-05-07T20:30:29.4231660Z 
2025-05-07T20:30:29.4231664Z 
2025-05-07T20:30:29.4231667Z 
2025-05-07T20:30:29.4231678Z 
2025-05-07T20:30:29.4231682Z 
2025-05-07T20:30:29.4231685Z 
2025-05-07T20:30:29.4231869Z                                                      [A[A[A[A[A[A
2025-05-07T20:30:29.4232078Z 
2025-05-07T20:30:29.4232082Z 
2025-05-07T20:30:29.4232085Z 
2025-05-07T20:30:29.4232089Z 
2025-05-07T20:30:29.4232092Z 
2025-05-07T20:30:29.4232096Z 
2025-05-07T20:30:29.4232106Z 
2025-05-07T20:30:29.4232289Z                                                      [A[A[A[A[A[A[A done
2025-05-07T20:30:29.5235050Z Preparing transaction: \ done
2025-05-07T20:30:29.6242137Z Verifying transaction: / done
2025-05-07T20:30:31.5272010Z Executing transaction: \ | / - \ | / - \ | / - \ | / - \ | / done
2025-05-07T20:30:31.6579573Z [TEST] Checking imports ...
2025-05-07T20:30:35.5716294Z [CHECK] Python (sub-)package 'fbgemm_gpu' found ...
2025-05-07T20:30:35.5729883Z [TEST] Setting feature flags ...
2025-05-07T20:30:35.5730391Z + conda env config vars set -n build_binary FBGEMM_TBE_ENSEMBLE_ROWWISE_ADAGRAD=1
2025-05-07T20:30:35.5730798Z 
2025-05-07T20:30:35.9989262Z 
2025-05-07T20:30:35.9989630Z [TEST] PyTest args:  -v -rsx -s -W ignore::pytest.PytestCollectionWarning
2025-05-07T20:30:35.9991245Z ################################################################################
2025-05-07T20:30:35.9991693Z # Run FBGEMM-GPU Tests: 
2025-05-07T20:30:35.9991933Z #
2025-05-07T20:30:36.0011324Z # [2025-05-07T20:30:36.000Z] + __run_fbgemm_gpu_tests_in_directory build_binary
2025-05-07T20:30:36.0011893Z ################################################################################
2025-05-07T20:30:36.0012178Z 
2025-05-07T20:30:36.0019140Z [TEST] Enumerating ALL test files ...
2025-05-07T20:30:36.0049789Z ./attention/gqa_test.py
2025-05-07T20:30:36.0050197Z ./coalesce/coalesce_test.py
2025-05-07T20:30:36.0050544Z ./comm/multi_gpu_car_test.py
2025-05-07T20:30:36.0050911Z ./gather_scatter/gather_scatter_test.py
2025-05-07T20:30:36.0051260Z ./kv_cache/kv_cache_test.py
2025-05-07T20:30:36.0051526Z ./moe/activation_test.py
2025-05-07T20:30:36.0051813Z ./moe/gather_scatter_test.py
2025-05-07T20:30:36.0052063Z ./moe/layers_test.py
2025-05-07T20:30:36.0052289Z ./moe/shuffling_test.py
2025-05-07T20:30:36.0052528Z ./quantize/quantize_test.py
2025-05-07T20:30:36.0052695Z 
2025-05-07T20:30:36.0052810Z [TEST] Enumerating IGNORED test files ...
2025-05-07T20:30:36.0062732Z 
2025-05-07T20:30:36.0075809Z ################################################################################
2025-05-07T20:30:36.0093425Z # [2025-05-07T20:30:36.009Z] Run Python Test Suite:
2025-05-07T20:30:36.0093992Z #   ./attention/gqa_test.py
2025-05-07T20:30:36.0094360Z ################################################################################
2025-05-07T20:30:36.0118688Z + conda run --no-capture-output -n build_binary python -m pytest -v -rsx -s -W ignore::pytest.PytestCollectionWarning --cache-clear ./attention/gqa_test.py
2025-05-07T20:30:36.0119411Z 
2025-05-07T20:30:38.5749763Z ============================= test session starts ==============================
2025-05-07T20:30:38.5750802Z platform linux -- Python 3.13.0, pytest-8.3.5, pluggy-1.5.0 -- /home/ec2-user/miniconda/envs/build_binary/bin/python
2025-05-07T20:30:38.5751317Z cachedir: .pytest_cache
2025-05-07T20:30:38.5751898Z hypothesis profile 'ci' -> database=None, deadline=None, print_blob=True, derandomize=True, suppress_health_check=(HealthCheck.too_slow,)
2025-05-07T20:30:38.5752614Z rootdir: /home/ec2-user/actions-runner/_work/FBGEMM/FBGEMM/fbgemm_gpu
2025-05-07T20:30:38.5753025Z plugins: hypothesis-6.131.14
2025-05-07T20:30:40.1512733Z collecting ... collected 2 items
2025-05-07T20:30:40.1513248Z 
2025-05-07T20:31:17.4167266Z attention/gqa_test.py::Int4GQATest::test_gqa Trying example: test_gqa(
2025-05-07T20:31:17.4168375Z     self=<gqa_test.Int4GQATest testMethod=test_gqa>,
2025-05-07T20:31:17.4168779Z     int4_kv=False,
2025-05-07T20:31:17.4169036Z     num_groups=1,
2025-05-07T20:31:17.4169289Z     B=1,
2025-05-07T20:31:17.4169519Z     MAX_T=4,
2025-05-07T20:31:17.4169763Z     N_H_L=1,
2025-05-07T20:31:17.4170001Z )
2025-05-07T20:31:17.4170248Z Trying example: test_gqa(
2025-05-07T20:31:17.4170597Z     self=<gqa_test.Int4GQATest testMethod=test_gqa>,
2025-05-07T20:31:17.4170972Z     int4_kv=True,
2025-05-07T20:31:17.4171225Z     num_groups=1,
2025-05-07T20:31:17.4171506Z     B=1,
2025-05-07T20:31:17.4171724Z     MAX_T=4,
2025-05-07T20:31:17.4171963Z     N_H_L=1,
2025-05-07T20:31:17.4172209Z )
2025-05-07T20:31:17.4172458Z Trying example: test_gqa(
2025-05-07T20:31:17.4172823Z     self=<gqa_test.Int4GQATest testMethod=test_gqa>,
2025-05-07T20:31:17.4173224Z     int4_kv=True,
2025-05-07T20:31:17.4173481Z     num_groups=4,
2025-05-07T20:31:17.4173899Z     B=23,
2025-05-07T20:31:17.4174118Z     MAX_T=33,
2025-05-07T20:31:17.4174359Z     N_H_L=68,
2025-05-07T20:31:17.4174589Z )
2025-05-07T20:31:17.4174812Z Trying example: test_gqa(
2025-05-07T20:31:17.4175154Z     self=<gqa_test.Int4GQATest testMethod=test_gqa>,
2025-05-07T20:31:17.4175527Z     int4_kv=True,
2025-05-07T20:31:17.4175774Z     num_groups=4,
2025-05-07T20:31:17.4176029Z     B=77,
2025-05-07T20:31:17.4176251Z     MAX_T=4,
2025-05-07T20:31:17.4176475Z     N_H_L=1,
2025-05-07T20:31:17.4176700Z )
2025-05-07T20:31:17.4176932Z Trying example: test_gqa(
2025-05-07T20:31:17.4177270Z     self=<gqa_test.Int4GQATest testMethod=test_gqa>,
2025-05-07T20:31:17.4177643Z     int4_kv=True,
2025-05-07T20:31:17.4177902Z     num_groups=4,
2025-05-07T20:31:17.4178140Z     B=77,
2025-05-07T20:31:17.4178370Z     MAX_T=52,
2025-05-07T20:31:17.4178601Z     N_H_L=67,
2025-05-07T20:31:17.4178822Z )
2025-05-07T20:31:17.4179053Z Trying example: test_gqa(
2025-05-07T20:31:17.4179396Z     self=<gqa_test.Int4GQATest testMethod=test_gqa>,
2025-05-07T20:31:17.4179794Z     int4_kv=False,
2025-05-07T20:31:17.4180081Z     num_groups=4,
2025-05-07T20:31:17.4180482Z     B=57,
2025-05-07T20:31:17.4180710Z     MAX_T=45,
2025-05-07T20:31:17.4180940Z     N_H_L=120,
2025-05-07T20:31:17.4181173Z )
2025-05-07T20:31:17.4181403Z Trying example: test_gqa(
2025-05-07T20:31:17.4181743Z     self=<gqa_test.Int4GQATest testMethod=test_gqa>,
2025-05-07T20:31:17.4182115Z     int4_kv=True,
2025-05-07T20:31:17.4182364Z     num_groups=4,
2025-05-07T20:31:17.4182600Z     B=52,
2025-05-07T20:31:17.4182823Z     MAX_T=42,
2025-05-07T20:31:17.4183054Z     N_H_L=53,
2025-05-07T20:31:17.4183275Z )
2025-05-07T20:31:17.4183502Z Trying example: test_gqa(
2025-05-07T20:31:17.4183851Z     self=<gqa_test.Int4GQATest testMethod=test_gqa>,
2025-05-07T20:31:17.4184215Z     int4_kv=True,
2025-05-07T20:31:17.4184464Z     num_groups=1,
2025-05-07T20:31:17.4184708Z     B=77,
2025-05-07T20:31:17.4184925Z     MAX_T=95,
2025-05-07T20:31:17.4185156Z     N_H_L=53,
2025-05-07T20:31:17.4185387Z )
2025-05-07T20:31:17.4185610Z Trying example: test_gqa(
2025-05-07T20:31:17.4185958Z     self=<gqa_test.Int4GQATest testMethod=test_gqa>,
2025-05-07T20:31:17.4186330Z     int4_kv=True,
2025-05-07T20:31:17.4186570Z     num_groups=4,
2025-05-07T20:31:17.4186813Z     B=113,
2025-05-07T20:31:17.4187040Z     MAX_T=48,
2025-05-07T20:31:17.4187498Z     N_H_L=96,
2025-05-07T20:31:17.4187726Z )
2025-05-07T20:31:17.4187954Z Trying example: test_gqa(
2025-05-07T20:31:17.4188296Z     self=<gqa_test.Int4GQATest testMethod=test_gqa>,
2025-05-07T20:31:17.4188660Z     int4_kv=False,
2025-05-07T20:31:17.4188911Z     num_groups=1,
2025-05-07T20:31:17.4189160Z     B=51,
2025-05-07T20:31:17.4189383Z     MAX_T=61,
2025-05-07T20:31:17.4189617Z     N_H_L=69,
2025-05-07T20:31:17.4189847Z )
2025-05-07T20:31:17.4190069Z Trying example: test_gqa(
2025-05-07T20:31:17.4190414Z     self=<gqa_test.Int4GQATest testMethod=test_gqa>,
2025-05-07T20:31:17.4190788Z     int4_kv=False,
2025-05-07T20:31:17.4191032Z     num_groups=4,
2025-05-07T20:31:17.4191279Z     B=17,
2025-05-07T20:31:17.4191508Z     MAX_T=113,
2025-05-07T20:31:17.4191839Z     N_H_L=65,
2025-05-07T20:31:17.4192072Z )
2025-05-07T20:31:17.4192301Z Trying example: test_gqa(
2025-05-07T20:31:17.4192637Z     self=<gqa_test.Int4GQATest testMethod=test_gqa>,
2025-05-07T20:31:17.4193013Z     int4_kv=False,
2025-05-07T20:31:17.4193274Z     num_groups=4,
2025-05-07T20:31:17.4193513Z     B=17,
2025-05-07T20:31:17.4193738Z     MAX_T=65,
2025-05-07T20:31:17.4193970Z     N_H_L=65,
2025-05-07T20:31:17.4194191Z )
2025-05-07T20:31:17.4194422Z Trying example: test_gqa(
2025-05-07T20:31:17.4194764Z     self=<gqa_test.Int4GQATest testMethod=test_gqa>,
2025-05-07T20:31:17.4195129Z     int4_kv=False,
2025-05-07T20:31:17.4195383Z     num_groups=4,
2025-05-07T20:31:17.4195628Z     B=65,
2025-05-07T20:31:17.4195850Z     MAX_T=65,
2025-05-07T20:31:17.4196077Z     N_H_L=65,
2025-05-07T20:31:17.4196307Z )
2025-05-07T20:31:17.4196535Z Trying example: test_gqa(
2025-05-07T20:31:17.4196869Z     self=<gqa_test.Int4GQATest testMethod=test_gqa>,
2025-05-07T20:31:17.4197243Z     int4_kv=False,
2025-05-07T20:31:17.4197501Z     num_groups=1,
2025-05-07T20:31:17.4197741Z     B=6,
2025-05-07T20:31:17.4197972Z     MAX_T=108,
2025-05-07T20:31:17.4198453Z     N_H_L=14,
2025-05-07T20:31:17.4198683Z )
2025-05-07T20:31:17.4198915Z Trying example: test_gqa(
2025-05-07T20:31:17.4199263Z     self=<gqa_test.Int4GQATest testMethod=test_gqa>,
2025-05-07T20:31:17.4199633Z     int4_kv=False,
2025-05-07T20:31:17.4199884Z     num_groups=1,
2025-05-07T20:31:17.4200126Z     B=6,
2025-05-07T20:31:17.4200340Z     MAX_T=14,
2025-05-07T20:31:17.4200572Z     N_H_L=14,
2025-05-07T20:31:17.4200800Z )
2025-05-07T20:31:17.4201022Z Trying example: test_gqa(
2025-05-07T20:31:17.4201365Z     self=<gqa_test.Int4GQATest testMethod=test_gqa>,
2025-05-07T20:31:17.4201737Z     int4_kv=False,
2025-05-07T20:31:17.4201981Z     num_groups=1,
2025-05-07T20:31:17.4202227Z     B=6,
2025-05-07T20:31:17.4202451Z     MAX_T=6,
2025-05-07T20:31:17.4202677Z     N_H_L=14,
2025-05-07T20:31:17.4202931Z )
2025-05-07T20:31:17.4203189Z Trying example: test_gqa(
2025-05-07T20:31:17.4203526Z     self=<gqa_test.Int4GQATest testMethod=test_gqa>,
2025-05-07T20:31:17.4203900Z     int4_kv=False,
2025-05-07T20:31:17.4204151Z     num_groups=1,
2025-05-07T20:31:17.4204388Z     B=6,
2025-05-07T20:31:17.4204616Z     MAX_T=6,
2025-05-07T20:31:17.4204844Z     N_H_L=6,
2025-05-07T20:31:17.4205071Z )
2025-05-07T20:31:17.4205295Z Trying example: test_gqa(
2025-05-07T20:31:17.4205637Z     self=<gqa_test.Int4GQATest testMethod=test_gqa>,
2025-05-07T20:31:17.4206013Z     int4_kv=False,
2025-05-07T20:31:17.4206256Z     num_groups=1,
2025-05-07T20:31:17.4206502Z     B=70,
2025-05-07T20:31:17.4206729Z     MAX_T=94,
2025-05-07T20:31:17.4206958Z     N_H_L=78,
2025-05-07T20:31:17.4207184Z )
2025-05-07T20:31:17.4207414Z Trying example: test_gqa(
2025-05-07T20:31:17.4207748Z     self=<gqa_test.Int4GQATest testMethod=test_gqa>,
2025-05-07T20:31:17.4208119Z     int4_kv=False,
2025-05-07T20:31:17.4208372Z     num_groups=1,
2025-05-07T20:31:17.4208608Z     B=78,
2025-05-07T20:31:17.4208835Z     MAX_T=94,
2025-05-07T20:31:17.4209068Z     N_H_L=78,
2025-05-07T20:31:17.4209287Z )
2025-05-07T20:31:17.4209517Z Trying example: test_gqa(
2025-05-07T20:31:17.4209860Z     self=<gqa_test.Int4GQATest testMethod=test_gqa>,
2025-05-07T20:31:17.4210378Z     int4_kv=False,
2025-05-07T20:31:17.4210630Z     num_groups=1,
2025-05-07T20:31:17.4210875Z     B=94,
2025-05-07T20:31:17.4211094Z     MAX_T=94,
2025-05-07T20:31:17.4211325Z     N_H_L=78,
2025-05-07T20:31:17.4211557Z )
2025-05-07T20:31:17.4211783Z Trying example: test_gqa(
2025-05-07T20:31:17.4212123Z     self=<gqa_test.Int4GQATest testMethod=test_gqa>,
2025-05-07T20:31:17.4212499Z     int4_kv=False,
2025-05-07T20:31:17.4212747Z     num_groups=1,
2025-05-07T20:31:17.4213033Z     B=94,
2025-05-07T20:31:17.4213267Z     MAX_T=94,
2025-05-07T20:31:17.4213493Z     N_H_L=94,
2025-05-07T20:31:17.4213817Z )
2025-05-07T20:31:17.4214047Z Trying example: test_gqa(
2025-05-07T20:31:17.4214389Z     self=<gqa_test.Int4GQATest testMethod=test_gqa>,
2025-05-07T20:31:17.4214904Z     int4_kv=False,
2025-05-07T20:31:17.4215164Z     num_groups=4,
2025-05-07T20:31:17.4215410Z     B=41,
2025-05-07T20:31:17.4215632Z     MAX_T=105,
2025-05-07T20:31:17.4215871Z     N_H_L=126,
2025-05-07T20:31:17.4216112Z )
2025-05-07T20:31:17.4216345Z Trying example: test_gqa(
2025-05-07T20:31:17.4216692Z     self=<gqa_test.Int4GQATest testMethod=test_gqa>,
2025-05-07T20:31:17.4217088Z     int4_kv=False,
2025-05-07T20:31:17.4217364Z     num_groups=4,
2025-05-07T20:31:17.4217645Z     B=105,
2025-05-07T20:31:17.4217901Z     MAX_T=105,
2025-05-07T20:31:17.4218162Z     N_H_L=126,
2025-05-07T20:31:17.4218414Z )
2025-05-07T20:31:17.4218662Z Trying example: test_gqa(
2025-05-07T20:31:17.4219024Z     self=<gqa_test.Int4GQATest testMethod=test_gqa>,
2025-05-07T20:31:17.4219436Z     int4_kv=False,
2025-05-07T20:31:17.4219649Z     num_groups=4,
2025-05-07T20:31:17.4219856Z     B=105,
2025-05-07T20:31:17.4220050Z     MAX_T=105,
2025-05-07T20:31:17.4220254Z     N_H_L=105,
2025-05-07T20:31:17.4220456Z )
2025-05-07T20:31:17.4220655Z Trying example: test_gqa(
2025-05-07T20:31:17.4220949Z     self=<gqa_test.Int4GQATest testMethod=test_gqa>,
2025-05-07T20:31:17.4221257Z     int4_kv=True,
2025-05-07T20:31:17.4221465Z     num_groups=1,
2025-05-07T20:31:17.4221678Z     B=95,
2025-05-07T20:31:17.4221876Z     MAX_T=114,
2025-05-07T20:31:17.4222071Z     N_H_L=43,
2025-05-07T20:31:17.4222272Z )
2025-05-07T20:31:17.4222471Z Trying example: test_gqa(
2025-05-07T20:31:17.4222755Z     self=<gqa_test.Int4GQATest testMethod=test_gqa>,
2025-05-07T20:31:17.4223063Z     int4_kv=True,
2025-05-07T20:31:17.4223277Z     num_groups=1,
2025-05-07T20:31:17.4223479Z     B=43,
2025-05-07T20:31:17.4223675Z     MAX_T=114,
2025-05-07T20:31:17.4223878Z     N_H_L=43,
2025-05-07T20:31:17.4224068Z )
2025-05-07T20:31:17.4224263Z Trying example: test_gqa(
2025-05-07T20:31:17.4224553Z     self=<gqa_test.Int4GQATest testMethod=test_gqa>,
2025-05-07T20:31:17.4224851Z     int4_kv=True,
2025-05-07T20:31:17.4225062Z     num_groups=1,
2025-05-07T20:31:17.4225274Z     B=43,
2025-05-07T20:31:17.4225464Z     MAX_T=43,
2025-05-07T20:31:17.4225658Z     N_H_L=43,
2025-05-07T20:31:17.4225853Z )
2025-05-07T20:31:17.4226049Z Trying example: test_gqa(
2025-05-07T20:31:17.4226336Z     self=<gqa_test.Int4GQATest testMethod=test_gqa>,
2025-05-07T20:31:17.4226652Z     int4_kv=False,
2025-05-07T20:31:17.4226866Z     num_groups=1,
2025-05-07T20:31:17.4227067Z     B=21,
2025-05-07T20:31:17.4227263Z     MAX_T=38,
2025-05-07T20:31:17.4227461Z     N_H_L=42,
2025-05-07T20:31:17.4227649Z )
2025-05-07T20:31:17.4227844Z Trying example: test_gqa(
2025-05-07T20:31:17.4228140Z     self=<gqa_test.Int4GQATest testMethod=test_gqa>,
2025-05-07T20:31:17.4228444Z     int4_kv=False,
2025-05-07T20:31:17.4228659Z     num_groups=1,
2025-05-07T20:31:17.4228863Z     B=38,
2025-05-07T20:31:17.4229045Z     MAX_T=38,
2025-05-07T20:31:17.4229244Z     N_H_L=42,
2025-05-07T20:31:17.4229440Z )
2025-05-07T20:31:17.4229630Z Trying example: test_gqa(
2025-05-07T20:31:17.4229923Z     self=<gqa_test.Int4GQATest testMethod=test_gqa>,
2025-05-07T20:31:17.4230232Z     int4_kv=False,
2025-05-07T20:31:17.4230440Z     num_groups=1,
2025-05-07T20:31:17.4230651Z     B=38,
2025-05-07T20:31:17.4230847Z     MAX_T=42,
2025-05-07T20:31:17.4231041Z     N_H_L=42,
2025-05-07T20:31:17.4231856Z )
2025-05-07T20:31:17.4232058Z Trying example: test_gqa(
2025-05-07T20:31:17.4232340Z     self=<gqa_test.Int4GQATest testMethod=test_gqa>,
2025-05-07T20:31:17.4232654Z     int4_kv=False,
2025-05-07T20:31:17.4232868Z     num_groups=1,
2025-05-07T20:31:17.4233078Z     B=42,
2025-05-07T20:31:17.4233265Z     MAX_T=42,
2025-05-07T20:31:17.4233468Z     N_H_L=42,
2025-05-07T20:31:17.4233664Z )
2025-05-07T20:31:17.4233857Z Trying example: test_gqa(
2025-05-07T20:31:17.4234147Z     self=<gqa_test.Int4GQATest testMethod=test_gqa>,
2025-05-07T20:31:17.4234461Z     int4_kv=True,
2025-05-07T20:31:17.4234669Z     num_groups=1,
2025-05-07T20:31:17.4234877Z     B=74,
2025-05-07T20:31:17.4235070Z     MAX_T=20,
2025-05-07T20:31:17.4235262Z     N_H_L=15,
2025-05-07T20:31:17.4235550Z )
2025-05-07T20:31:17.4235744Z Trying example: test_gqa(
2025-05-07T20:31:17.4236027Z     self=<gqa_test.Int4GQATest testMethod=test_gqa>,
2025-05-07T20:31:17.4236334Z     int4_kv=True,
2025-05-07T20:31:17.4236546Z     num_groups=1,
2025-05-07T20:31:17.4236751Z     B=20,
2025-05-07T20:31:17.4236939Z     MAX_T=20,
2025-05-07T20:31:17.4237133Z     N_H_L=15,
2025-05-07T20:31:17.4237320Z )
2025-05-07T20:31:17.4237516Z Trying example: test_gqa(
2025-05-07T20:31:17.4237805Z     self=<gqa_test.Int4GQATest testMethod=test_gqa>,
2025-05-07T20:31:17.4238109Z     int4_kv=True,
2025-05-07T20:31:17.4238317Z     num_groups=1,
2025-05-07T20:31:17.4238524Z     B=20,
2025-05-07T20:31:17.4238708Z     MAX_T=15,
2025-05-07T20:31:17.4238907Z     N_H_L=15,
2025-05-07T20:31:17.4239100Z )
2025-05-07T20:31:17.4239285Z Trying example: test_gqa(
2025-05-07T20:31:17.4239568Z     self=<gqa_test.Int4GQATest testMethod=test_gqa>,
2025-05-07T20:31:17.4239874Z     int4_kv=True,
2025-05-07T20:31:17.4240080Z     num_groups=1,
2025-05-07T20:31:17.4240290Z     B=15,
2025-05-07T20:31:17.4240482Z     MAX_T=20,
2025-05-07T20:31:17.4240681Z     N_H_L=15,
2025-05-07T20:31:17.4240866Z )
2025-05-07T20:31:17.4241061Z Trying example: test_gqa(
2025-05-07T20:31:17.4241353Z     self=<gqa_test.Int4GQATest testMethod=test_gqa>,
2025-05-07T20:31:17.4241659Z     int4_kv=True,
2025-05-07T20:31:17.4241868Z     num_groups=1,
2025-05-07T20:31:17.4242073Z     B=15,
2025-05-07T20:31:17.4242259Z     MAX_T=15,
2025-05-07T20:31:17.4242462Z     N_H_L=15,
2025-05-07T20:31:17.4242654Z )
2025-05-07T20:31:17.4242846Z Trying example: test_gqa(
2025-05-07T20:31:17.4243132Z     self=<gqa_test.Int4GQATest testMethod=test_gqa>,
2025-05-07T20:31:17.4243444Z     int4_kv=False,
2025-05-07T20:31:17.4243651Z     num_groups=4,
2025-05-07T20:31:17.4243856Z     B=117,
2025-05-07T20:31:17.4244052Z     MAX_T=104,
2025-05-07T20:31:17.4244246Z     N_H_L=69,
2025-05-07T20:31:17.4244441Z )
2025-05-07T20:31:17.4244640Z Trying example: test_gqa(
2025-05-07T20:31:17.4244926Z     self=<gqa_test.Int4GQATest testMethod=test_gqa>,
2025-05-07T20:31:17.4245236Z     int4_kv=False,
2025-05-07T20:31:17.4245449Z     num_groups=4,
2025-05-07T20:31:17.4245646Z     B=117,
2025-05-07T20:31:17.4245839Z     MAX_T=117,
2025-05-07T20:31:17.4246046Z     N_H_L=69,
2025-05-07T20:31:17.4246229Z )
2025-05-07T20:31:17.4246420Z Trying example: test_gqa(
2025-05-07T20:31:17.4246709Z     self=<gqa_test.Int4GQATest testMethod=test_gqa>,
2025-05-07T20:31:17.4247012Z     int4_kv=False,
2025-05-07T20:31:17.4247227Z     num_groups=4,
2025-05-07T20:31:17.4247435Z     B=69,
2025-05-07T20:31:17.4247626Z     MAX_T=117,
2025-05-07T20:31:17.4247816Z     N_H_L=69,
2025-05-07T20:31:17.4248008Z )
2025-05-07T20:31:17.4248200Z Trying example: test_gqa(
2025-05-07T20:31:17.4248478Z     self=<gqa_test.Int4GQATest testMethod=test_gqa>,
2025-05-07T20:31:17.4248785Z     int4_kv=False,
2025-05-07T20:31:17.4248997Z     num_groups=4,
2025-05-07T20:31:17.4249197Z     B=117,
2025-05-07T20:31:17.4249390Z     MAX_T=69,
2025-05-07T20:31:17.4249589Z     N_H_L=69,
2025-05-07T20:31:17.4249772Z )
2025-05-07T20:31:17.4249959Z PASSED
2025-05-07T20:31:17.4355883Z attention/gqa_test.py::Int4GQATest::test_mqa_main SKIPPED (Skip when...)
2025-05-07T20:31:17.4356210Z 
2025-05-07T20:31:17.4356566Z =========================== short test summary info ============================
2025-05-07T20:31:17.4357271Z SKIPPED [1] ../../../../../../../../miniconda/envs/build_binary/lib/python3.13/unittest/case.py:154: Skip when CUDA is not available or xformers is not available
2025-05-07T20:31:17.4357957Z ======================== 1 passed, 1 skipped in 39.38s =========================
2025-05-07T20:31:18.0894981Z 
2025-05-07T20:31:18.0895491Z [TEST] Python test suite PASSED: ./attention/gqa_test.py
2025-05-07T20:31:18.0915061Z [TEST] Python test time for ./attention/gqa_test.py: 42 seconds
2025-05-07T20:31:18.0915340Z 
2025-05-07T20:31:18.0915351Z 
2025-05-07T20:31:18.0915562Z 
2025-05-07T20:31:18.0915574Z 
2025-05-07T20:31:18.0936746Z ################################################################################
2025-05-07T20:31:18.0951497Z # [2025-05-07T20:31:18.094Z] Run Python Test Suite:
2025-05-07T20:31:18.0951838Z #   ./coalesce/coalesce_test.py
2025-05-07T20:31:18.0952125Z ################################################################################
2025-05-07T20:31:18.0977842Z + conda run --no-capture-output -n build_binary python -m pytest -v -rsx -s -W ignore::pytest.PytestCollectionWarning --cache-clear ./coalesce/coalesce_test.py
2025-05-07T20:31:18.0978443Z 
2025-05-07T20:31:20.2666165Z ============================= test session starts ==============================
2025-05-07T20:31:20.2666796Z platform linux -- Python 3.13.0, pytest-8.3.5, pluggy-1.5.0 -- /home/ec2-user/miniconda/envs/build_binary/bin/python
2025-05-07T20:31:20.2667308Z cachedir: .pytest_cache
2025-05-07T20:31:20.2667875Z hypothesis profile 'ci' -> database=None, deadline=None, print_blob=True, derandomize=True, suppress_health_check=(HealthCheck.too_slow,)
2025-05-07T20:31:20.2668616Z rootdir: /home/ec2-user/actions-runner/_work/FBGEMM/FBGEMM/fbgemm_gpu
2025-05-07T20:31:20.2669022Z plugins: hypothesis-6.131.14
2025-05-07T20:31:21.8292613Z collecting ... collected 1 item
2025-05-07T20:31:21.8293145Z 
2025-05-07T20:31:22.5853866Z coalesce/coalesce_test.py::CoalesceTest::test_coalesce_batches PASSED
2025-05-07T20:31:22.5854195Z 
2025-05-07T20:31:22.5854762Z ============================== 1 passed in 2.45s ===============================
2025-05-07T20:31:23.2147350Z 
2025-05-07T20:31:23.2147838Z [TEST] Python test suite PASSED: ./coalesce/coalesce_test.py
2025-05-07T20:31:23.2168569Z [TEST] Python test time for ./coalesce/coalesce_test.py: 5 seconds
2025-05-07T20:31:23.2168853Z 
2025-05-07T20:31:23.2168858Z 
2025-05-07T20:31:23.2168861Z 
2025-05-07T20:31:23.2168865Z 
2025-05-07T20:31:23.2189812Z ################################################################################
2025-05-07T20:31:23.2205707Z # [2025-05-07T20:31:23.220Z] Run Python Test Suite:
2025-05-07T20:31:23.2206072Z #   ./comm/multi_gpu_car_test.py
2025-05-07T20:31:23.2206367Z ################################################################################
2025-05-07T20:31:23.2232569Z + conda run --no-capture-output -n build_binary python -m pytest -v -rsx -s -W ignore::pytest.PytestCollectionWarning --cache-clear ./comm/multi_gpu_car_test.py
2025-05-07T20:31:23.2233197Z 
2025-05-07T20:31:25.3950925Z ============================= test session starts ==============================
2025-05-07T20:31:25.3951571Z platform linux -- Python 3.13.0, pytest-8.3.5, pluggy-1.5.0 -- /home/ec2-user/miniconda/envs/build_binary/bin/python
2025-05-07T20:31:25.3952089Z cachedir: .pytest_cache
2025-05-07T20:31:25.3952654Z hypothesis profile 'ci' -> database=None, deadline=None, print_blob=True, derandomize=True, suppress_health_check=(HealthCheck.too_slow,)
2025-05-07T20:31:25.3953367Z rootdir: /home/ec2-user/actions-runner/_work/FBGEMM/FBGEMM/fbgemm_gpu
2025-05-07T20:31:25.3953780Z plugins: hypothesis-6.131.14
2025-05-07T20:31:26.9954491Z collecting ... collected 5 items
2025-05-07T20:31:26.9954701Z 
2025-05-07T20:31:26.9965341Z comm/multi_gpu_car_test.py::LLamaMultiGpuTests::test_allgather SKIPPED
2025-05-07T20:31:26.9972587Z comm/multi_gpu_car_test.py::LLamaMultiGpuTests::test_allgather_dtype_mismatch SKIPPED
2025-05-07T20:31:26.9979353Z comm/multi_gpu_car_test.py::LLamaMultiGpuTests::test_allreduce SKIPPED
2025-05-07T20:31:26.9989883Z comm/multi_gpu_car_test.py::LLamaMultiGpuTests::test_oneshot_car_stress SKIPPED
2025-05-07T20:31:27.0005132Z comm/multi_gpu_car_test.py::LLamaMultiGpuTests::test_reducescatter SKIPPED
2025-05-07T20:31:27.0005478Z 
2025-05-07T20:31:27.0005625Z =========================== short test summary info ============================
2025-05-07T20:31:27.0006292Z SKIPPED [1] comm/multi_gpu_car_test.py:310: Skip when CUDA is not available or when there are not enough GPUs; these tests require at least two GPUs
2025-05-07T20:31:27.0007397Z SKIPPED [1] comm/multi_gpu_car_test.py:351: Skip when CUDA is not available or when there are not enough GPUs; these tests require at least two GPUs
2025-05-07T20:31:27.0008304Z SKIPPED [1] comm/multi_gpu_car_test.py:418: Skip when CUDA is not available or when there are not enough GPUs; these tests require at least two GPUs
2025-05-07T20:31:27.0009198Z SKIPPED [1] comm/multi_gpu_car_test.py:434: Skip when CUDA is not available or when there are not enough GPUs; these tests require at least two GPUs
2025-05-07T20:31:27.0010100Z SKIPPED [1] comm/multi_gpu_car_test.py:402: Skip when CUDA is not available or when there are not enough GPUs; these tests require at least two GPUs
2025-05-07T20:31:27.0010736Z ============================== 5 skipped in 1.74s ==============================
2025-05-07T20:31:27.5728782Z 
2025-05-07T20:31:27.5729467Z [TEST] Python test suite PASSED: ./comm/multi_gpu_car_test.py
2025-05-07T20:31:27.5749035Z [TEST] Python test time for ./comm/multi_gpu_car_test.py: 4 seconds
2025-05-07T20:31:27.5749343Z 
2025-05-07T20:31:27.5749348Z 
2025-05-07T20:31:27.5749352Z 
2025-05-07T20:31:27.5749387Z 
2025-05-07T20:31:27.5770237Z ################################################################################
2025-05-07T20:31:27.5785748Z # [2025-05-07T20:31:27.578Z] Run Python Test Suite:
2025-05-07T20:31:27.5786091Z #   ./gather_scatter/gather_scatter_test.py
2025-05-07T20:31:27.5786421Z ################################################################################
2025-05-07T20:31:27.5811802Z + conda run --no-capture-output -n build_binary python -m pytest -v -rsx -s -W ignore::pytest.PytestCollectionWarning --cache-clear ./gather_scatter/gather_scatter_test.py
2025-05-07T20:31:27.5812453Z 
2025-05-07T20:31:29.7524188Z ============================= test session starts ==============================
2025-05-07T20:31:29.7524984Z platform linux -- Python 3.13.0, pytest-8.3.5, pluggy-1.5.0 -- /home/ec2-user/miniconda/envs/build_binary/bin/python
2025-05-07T20:31:29.7525503Z cachedir: .pytest_cache
2025-05-07T20:31:29.7526094Z hypothesis profile 'ci' -> database=None, deadline=None, print_blob=True, derandomize=True, suppress_health_check=(HealthCheck.too_slow,)
2025-05-07T20:31:29.7526857Z rootdir: /home/ec2-user/actions-runner/_work/FBGEMM/FBGEMM/fbgemm_gpu
2025-05-07T20:31:29.7527261Z plugins: hypothesis-6.131.14
2025-05-07T20:31:31.4060986Z collecting ... collected 2 items
2025-05-07T20:31:31.4061276Z 
2025-05-07T20:31:31.4072093Z gather_scatter/gather_scatter_test.py::GatherScatterTests::test_gather_along_first_dim SKIPPED
2025-05-07T20:31:31.4086844Z gather_scatter/gather_scatter_test.py::GatherScatterTests::test_scatter_add_along_first_dim SKIPPED
2025-05-07T20:31:31.4087464Z 
2025-05-07T20:31:31.4087674Z =========================== short test summary info ============================
2025-05-07T20:31:31.4088291Z SKIPPED [1] gather_scatter/gather_scatter_test.py:29: Skip when no Hopper GPU is available. This test is only for Hopper GPU.
2025-05-07T20:31:31.4089103Z SKIPPED [1] gather_scatter/gather_scatter_test.py:70: Skip when no Hopper GPU is available. This test is only for Hopper GPU.
2025-05-07T20:31:31.4089709Z ============================== 2 skipped in 1.79s ==============================
2025-05-07T20:31:31.9992510Z 
2025-05-07T20:31:31.9993080Z [TEST] Python test suite PASSED: ./gather_scatter/gather_scatter_test.py
2025-05-07T20:31:32.0014360Z [TEST] Python test time for ./gather_scatter/gather_scatter_test.py: 5 seconds
2025-05-07T20:31:32.0014681Z 
2025-05-07T20:31:32.0014686Z 
2025-05-07T20:31:32.0014690Z 
2025-05-07T20:31:32.0014694Z 
2025-05-07T20:31:32.0035304Z ################################################################################
2025-05-07T20:31:32.0050442Z # [2025-05-07T20:31:32.004Z] Run Python Test Suite:
2025-05-07T20:31:32.0050769Z #   ./kv_cache/kv_cache_test.py
2025-05-07T20:31:32.0051057Z ################################################################################
2025-05-07T20:31:32.0076430Z + conda run --no-capture-output -n build_binary python -m pytest -v -rsx -s -W ignore::pytest.PytestCollectionWarning --cache-clear ./kv_cache/kv_cache_test.py
2025-05-07T20:31:32.0077245Z 
2025-05-07T20:31:34.1808374Z ============================= test session starts ==============================
2025-05-07T20:31:34.1809039Z platform linux -- Python 3.13.0, pytest-8.3.5, pluggy-1.5.0 -- /home/ec2-user/miniconda/envs/build_binary/bin/python
2025-05-07T20:31:34.1809586Z cachedir: .pytest_cache
2025-05-07T20:31:34.1810152Z hypothesis profile 'ci' -> database=None, deadline=None, print_blob=True, derandomize=True, suppress_health_check=(HealthCheck.too_slow,)
2025-05-07T20:31:34.1810859Z rootdir: /home/ec2-user/actions-runner/_work/FBGEMM/FBGEMM/fbgemm_gpu
2025-05-07T20:31:34.1811258Z plugins: hypothesis-6.131.14
2025-05-07T20:31:35.7577869Z collecting ... collected 4 items
2025-05-07T20:31:35.7578076Z 
2025-05-07T20:31:38.0279881Z kv_cache/kv_cache_test.py::KVCacheTests::test_fp8_kv_cache SKIPPED (...)
2025-05-07T20:31:38.0362960Z kv_cache/kv_cache_test.py::KVCacheTests::test_int4_kv_cache SKIPPED
2025-05-07T20:31:38.0452601Z kv_cache/kv_cache_test.py::KVCacheTests::test_positional_encoding_with_paged_attention SKIPPED
2025-05-07T20:31:38.0538838Z kv_cache/kv_cache_test.py::KVCacheTests::test_rope_positional_encoding_only SKIPPED
2025-05-07T20:31:38.0539202Z 
2025-05-07T20:31:38.0539352Z =========================== short test summary info ============================
2025-05-07T20:31:38.0540053Z SKIPPED [1] ../../../../../../../../miniconda/envs/build_binary/lib/python3.13/unittest/case.py:154: Skip when H100 is not available or MI300 is not available
2025-05-07T20:31:38.0540937Z SKIPPED [3] ../../../../../../../../miniconda/envs/build_binary/lib/python3.13/unittest/case.py:154: Skip when xformers is not available
2025-05-07T20:31:38.0541535Z ============================== 4 skipped in 4.01s ==============================
2025-05-07T20:31:40.2816660Z 
2025-05-07T20:31:40.2817359Z [TEST] Python test suite PASSED: ./kv_cache/kv_cache_test.py
2025-05-07T20:31:40.2837422Z [TEST] Python test time for ./kv_cache/kv_cache_test.py: 8 seconds
2025-05-07T20:31:40.2837707Z 
2025-05-07T20:31:40.2837739Z 
2025-05-07T20:31:40.2837743Z 
2025-05-07T20:31:40.2837747Z 
2025-05-07T20:31:40.2858375Z ################################################################################
2025-05-07T20:31:40.2873760Z # [2025-05-07T20:31:40.287Z] Run Python Test Suite:
2025-05-07T20:31:40.2874107Z #   ./moe/activation_test.py
2025-05-07T20:31:40.2874390Z ################################################################################
2025-05-07T20:31:40.2899805Z + conda run --no-capture-output -n build_binary python -m pytest -v -rsx -s -W ignore::pytest.PytestCollectionWarning --cache-clear ./moe/activation_test.py
2025-05-07T20:31:40.2900382Z 
2025-05-07T20:31:42.4639644Z ============================= test session starts ==============================
2025-05-07T20:31:42.4640579Z platform linux -- Python 3.13.0, pytest-8.3.5, pluggy-1.5.0 -- /home/ec2-user/miniconda/envs/build_binary/bin/python
2025-05-07T20:31:42.4641094Z cachedir: .pytest_cache
2025-05-07T20:31:42.4641688Z hypothesis profile 'ci' -> database=None, deadline=None, print_blob=True, derandomize=True, suppress_health_check=(HealthCheck.too_slow,)
2025-05-07T20:31:42.4642400Z rootdir: /home/ec2-user/actions-runner/_work/FBGEMM/FBGEMM/fbgemm_gpu
2025-05-07T20:31:42.4642811Z plugins: hypothesis-6.131.14
2025-05-07T20:31:44.0529274Z TMA benchmarks will be running with experimental grid constant TMA descriptor.
2025-05-07T20:31:44.1497056Z collecting ... collected 2 items
2025-05-07T20:31:44.1497318Z 
2025-05-07T20:31:49.0934227Z moe/activation_test.py::ActivationTests::test_silu_mul Trying example: test_silu_mul(
2025-05-07T20:31:49.0935102Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul>,
2025-05-07T20:31:49.0935611Z     T=1,
2025-05-07T20:31:49.0935843Z     D=5120,
2025-05-07T20:31:49.0936094Z     contiguous=True,
2025-05-07T20:31:49.0936379Z     compiled=True,
2025-05-07T20:31:49.0936641Z )
2025-05-07T20:31:49.0936897Z Trying example: test_silu_mul(
2025-05-07T20:31:49.0937710Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul>,
2025-05-07T20:31:49.0938087Z     T=4096,
2025-05-07T20:31:49.0938275Z     D=5120,
2025-05-07T20:31:49.0938469Z     contiguous=True,
2025-05-07T20:31:49.0938693Z     compiled=True,
2025-05-07T20:31:49.0938900Z )
2025-05-07T20:31:49.0939114Z Trying example: test_silu_mul(
2025-05-07T20:31:49.0939474Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul>,
2025-05-07T20:31:49.0939847Z     T=4096,
2025-05-07T20:31:49.0940035Z     D=7168,
2025-05-07T20:31:49.0940227Z     contiguous=False,
2025-05-07T20:31:49.0940448Z     compiled=False,
2025-05-07T20:31:49.0940652Z )
2025-05-07T20:31:49.0940851Z Trying example: test_silu_mul(
2025-05-07T20:31:49.0941212Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul>,
2025-05-07T20:31:49.0941585Z     T=4096,
2025-05-07T20:31:49.0941781Z     D=5120,
2025-05-07T20:31:49.0941972Z     contiguous=False,
2025-05-07T20:31:49.0942195Z     compiled=True,
2025-05-07T20:31:49.0942396Z )
2025-05-07T20:31:49.0942594Z Trying example: test_silu_mul(
2025-05-07T20:31:49.0942959Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul>,
2025-05-07T20:31:49.0943332Z     T=1,
2025-05-07T20:31:49.0943509Z     D=7168,
2025-05-07T20:31:49.0943708Z     contiguous=True,
2025-05-07T20:31:49.0943941Z     compiled=True,
2025-05-07T20:31:49.0944141Z )
2025-05-07T20:31:49.0944337Z Trying example: test_silu_mul(
2025-05-07T20:31:49.0944699Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul>,
2025-05-07T20:31:49.0945068Z     T=1,
2025-05-07T20:31:49.0945256Z     D=7168,
2025-05-07T20:31:49.0945454Z     contiguous=False,
2025-05-07T20:31:49.0945678Z     compiled=True,
2025-05-07T20:31:49.0945877Z )
2025-05-07T20:31:49.0946081Z Trying example: test_silu_mul(
2025-05-07T20:31:49.0946448Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul>,
2025-05-07T20:31:49.0946811Z     T=4096,
2025-05-07T20:31:49.0947002Z     D=5120,
2025-05-07T20:31:49.0947198Z     contiguous=False,
2025-05-07T20:31:49.0947422Z     compiled=False,
2025-05-07T20:31:49.0947629Z )
2025-05-07T20:31:49.0947828Z Trying example: test_silu_mul(
2025-05-07T20:31:49.0948189Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul>,
2025-05-07T20:31:49.0948571Z     T=1,
2025-05-07T20:31:49.0948760Z     D=7168,
2025-05-07T20:31:49.0948948Z     contiguous=True,
2025-05-07T20:31:49.0949170Z     compiled=False,
2025-05-07T20:31:49.0949375Z )
2025-05-07T20:31:49.0949567Z Trying example: test_silu_mul(
2025-05-07T20:31:49.0949933Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul>,
2025-05-07T20:31:49.0950305Z     T=2048,
2025-05-07T20:31:49.0950482Z     D=5120,
2025-05-07T20:31:49.0950674Z     contiguous=True,
2025-05-07T20:31:49.0950897Z     compiled=True,
2025-05-07T20:31:49.0951089Z )
2025-05-07T20:31:49.0951282Z Trying example: test_silu_mul(
2025-05-07T20:31:49.0951645Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul>,
2025-05-07T20:31:49.0952014Z     T=2048,
2025-05-07T20:31:49.0952197Z     D=7168,
2025-05-07T20:31:49.0952396Z     contiguous=True,
2025-05-07T20:31:49.0952617Z     compiled=True,
2025-05-07T20:31:49.0952810Z )
2025-05-07T20:31:49.0953006Z Trying example: test_silu_mul(
2025-05-07T20:31:49.0953371Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul>,
2025-05-07T20:31:49.0953918Z     T=2048,
2025-05-07T20:31:49.0954106Z     D=7168,
2025-05-07T20:31:49.0954297Z     contiguous=True,
2025-05-07T20:31:49.0954516Z     compiled=False,
2025-05-07T20:31:49.0954722Z )
2025-05-07T20:31:49.0954921Z Trying example: test_silu_mul(
2025-05-07T20:31:49.0955279Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul>,
2025-05-07T20:31:49.0955651Z     T=128,
2025-05-07T20:31:49.0955835Z     D=5120,
2025-05-07T20:31:49.0956026Z     contiguous=False,
2025-05-07T20:31:49.0956249Z     compiled=True,
2025-05-07T20:31:49.0956452Z )
2025-05-07T20:31:49.0956643Z Trying example: test_silu_mul(
2025-05-07T20:31:49.0957109Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul>,
2025-05-07T20:31:49.0957484Z     T=128,
2025-05-07T20:31:49.0957669Z     D=5120,
2025-05-07T20:31:49.0957865Z     contiguous=True,
2025-05-07T20:31:49.0958089Z     compiled=True,
2025-05-07T20:31:49.0958298Z )
2025-05-07T20:31:49.0958490Z Trying example: test_silu_mul(
2025-05-07T20:31:49.0958857Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul>,
2025-05-07T20:31:49.0959230Z     T=16384,
2025-05-07T20:31:49.0959419Z     D=5120,
2025-05-07T20:31:49.0959618Z     contiguous=False,
2025-05-07T20:31:49.0959845Z     compiled=True,
2025-05-07T20:31:49.0960041Z )
2025-05-07T20:31:49.0960238Z Trying example: test_silu_mul(
2025-05-07T20:31:49.0960606Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul>,
2025-05-07T20:31:49.0960973Z     T=16384,
2025-05-07T20:31:49.0961165Z     D=5120,
2025-05-07T20:31:49.0961364Z     contiguous=False,
2025-05-07T20:31:49.0961580Z     compiled=False,
2025-05-07T20:31:49.0961785Z )
2025-05-07T20:31:49.0961988Z Trying example: test_silu_mul(
2025-05-07T20:31:49.0962345Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul>,
2025-05-07T20:31:49.0962719Z     T=128,
2025-05-07T20:31:49.0962907Z     D=7168,
2025-05-07T20:31:49.0963105Z     contiguous=True,
2025-05-07T20:31:49.0963324Z     compiled=False,
2025-05-07T20:31:49.0963525Z )
2025-05-07T20:31:49.0963733Z Trying example: test_silu_mul(
2025-05-07T20:31:49.0964092Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul>,
2025-05-07T20:31:49.0964464Z     T=128,
2025-05-07T20:31:49.0964653Z     D=7168,
2025-05-07T20:31:49.0964852Z     contiguous=False,
2025-05-07T20:31:49.0965072Z     compiled=False,
2025-05-07T20:31:49.0965283Z )
2025-05-07T20:31:49.0965484Z Trying example: test_silu_mul(
2025-05-07T20:31:49.0965841Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul>,
2025-05-07T20:31:49.0966218Z     T=1,
2025-05-07T20:31:49.0966403Z     D=5120,
2025-05-07T20:31:49.0966595Z     contiguous=False,
2025-05-07T20:31:49.0966831Z     compiled=False,
2025-05-07T20:31:49.0967037Z )
2025-05-07T20:31:49.0967226Z Trying example: test_silu_mul(
2025-05-07T20:31:49.0967594Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul>,
2025-05-07T20:31:49.0967970Z     T=1,
2025-05-07T20:31:49.0968149Z     D=7168,
2025-05-07T20:31:49.0968340Z     contiguous=False,
2025-05-07T20:31:49.0968564Z     compiled=False,
2025-05-07T20:31:49.0968763Z )
2025-05-07T20:31:49.0968960Z Trying example: test_silu_mul(
2025-05-07T20:31:49.0969331Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul>,
2025-05-07T20:31:49.0969694Z     T=4096,
2025-05-07T20:31:49.0969883Z     D=5120,
2025-05-07T20:31:49.0970080Z     contiguous=True,
2025-05-07T20:31:49.0970294Z     compiled=False,
2025-05-07T20:31:49.0970501Z )
2025-05-07T20:31:49.0970696Z Trying example: test_silu_mul(
2025-05-07T20:31:49.0971061Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul>,
2025-05-07T20:31:49.0971427Z     T=128,
2025-05-07T20:31:49.0971624Z     D=7168,
2025-05-07T20:31:49.0971825Z     contiguous=True,
2025-05-07T20:31:49.0972039Z     compiled=True,
2025-05-07T20:31:49.0972245Z )
2025-05-07T20:31:49.0972444Z Trying example: test_silu_mul(
2025-05-07T20:31:49.0972900Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul>,
2025-05-07T20:31:49.0973272Z     T=1,
2025-05-07T20:31:49.0973482Z     D=5120,
2025-05-07T20:31:49.0973837Z     contiguous=False,
2025-05-07T20:31:49.0974077Z     compiled=True,
2025-05-07T20:31:49.0974283Z )
2025-05-07T20:31:49.0974474Z Trying example: test_silu_mul(
2025-05-07T20:31:49.0974845Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul>,
2025-05-07T20:31:49.0975216Z     T=4096,
2025-05-07T20:31:49.0975396Z     D=7168,
2025-05-07T20:31:49.0975592Z     contiguous=True,
2025-05-07T20:31:49.0975813Z     compiled=False,
2025-05-07T20:31:49.0976012Z )
2025-05-07T20:31:49.0976213Z Trying example: test_silu_mul(
2025-05-07T20:31:49.0976689Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul>,
2025-05-07T20:31:49.0977060Z     T=4096,
2025-05-07T20:31:49.0977252Z     D=7168,
2025-05-07T20:31:49.0977449Z     contiguous=False,
2025-05-07T20:31:49.0977676Z     compiled=True,
2025-05-07T20:31:49.0977882Z )
2025-05-07T20:31:49.0978081Z Trying example: test_silu_mul(
2025-05-07T20:31:49.0978444Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul>,
2025-05-07T20:31:49.0978812Z     T=128,
2025-05-07T20:31:49.0979003Z     D=5120,
2025-05-07T20:31:49.0979199Z     contiguous=True,
2025-05-07T20:31:49.0979418Z     compiled=False,
2025-05-07T20:31:49.0979624Z )
2025-05-07T20:31:49.0979822Z Trying example: test_silu_mul(
2025-05-07T20:31:49.0980181Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul>,
2025-05-07T20:31:49.0980554Z     T=128,
2025-05-07T20:31:49.0980741Z     D=5120,
2025-05-07T20:31:49.0980930Z     contiguous=False,
2025-05-07T20:31:49.0981153Z     compiled=False,
2025-05-07T20:31:49.0981361Z )
2025-05-07T20:31:49.0981558Z Trying example: test_silu_mul(
2025-05-07T20:31:49.0981921Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul>,
2025-05-07T20:31:49.0982292Z     T=1,
2025-05-07T20:31:49.0982470Z     D=5120,
2025-05-07T20:31:49.0982674Z     contiguous=True,
2025-05-07T20:31:49.0982895Z     compiled=False,
2025-05-07T20:31:49.0983094Z )
2025-05-07T20:31:49.0983295Z Trying example: test_silu_mul(
2025-05-07T20:31:49.0983659Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul>,
2025-05-07T20:31:49.0984028Z     T=2048,
2025-05-07T20:31:49.0984212Z     D=7168,
2025-05-07T20:31:49.0984409Z     contiguous=False,
2025-05-07T20:31:49.0984635Z     compiled=True,
2025-05-07T20:31:49.0984829Z )
2025-05-07T20:31:49.0985026Z Trying example: test_silu_mul(
2025-05-07T20:31:49.0985389Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul>,
2025-05-07T20:31:49.0985750Z     T=2048,
2025-05-07T20:31:49.0985938Z     D=7168,
2025-05-07T20:31:49.0986130Z     contiguous=False,
2025-05-07T20:31:49.0986352Z     compiled=False,
2025-05-07T20:31:49.0986559Z )
2025-05-07T20:31:49.0986759Z Trying example: test_silu_mul(
2025-05-07T20:31:49.0987118Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul>,
2025-05-07T20:31:49.0987493Z     T=16384,
2025-05-07T20:31:49.0987690Z     D=7168,
2025-05-07T20:31:49.0987879Z     contiguous=False,
2025-05-07T20:31:49.0988106Z     compiled=True,
2025-05-07T20:31:49.0988310Z )
2025-05-07T20:31:49.0988503Z Trying example: test_silu_mul(
2025-05-07T20:31:49.0988873Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul>,
2025-05-07T20:31:49.0989246Z     T=16384,
2025-05-07T20:31:49.0989437Z     D=7168,
2025-05-07T20:31:49.0989626Z     contiguous=True,
2025-05-07T20:31:49.0989850Z     compiled=True,
2025-05-07T20:31:49.0990054Z )
2025-05-07T20:31:49.0990247Z Trying example: test_silu_mul(
2025-05-07T20:31:49.0990618Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul>,
2025-05-07T20:31:49.0990990Z     T=4096,
2025-05-07T20:31:49.0991177Z     D=7168,
2025-05-07T20:31:49.0991370Z     contiguous=True,
2025-05-07T20:31:49.0991592Z     compiled=True,
2025-05-07T20:31:49.0991787Z )
2025-05-07T20:31:49.0991987Z Trying example: test_silu_mul(
2025-05-07T20:31:49.0992448Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul>,
2025-05-07T20:31:49.0992811Z     T=2048,
2025-05-07T20:31:49.0993002Z     D=5120,
2025-05-07T20:31:49.0993198Z     contiguous=False,
2025-05-07T20:31:49.0993418Z     compiled=False,
2025-05-07T20:31:49.0993626Z )
2025-05-07T20:31:49.0993823Z Trying example: test_silu_mul(
2025-05-07T20:31:49.0994181Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul>,
2025-05-07T20:31:49.0994549Z     T=2048,
2025-05-07T20:31:49.0994739Z     D=5120,
2025-05-07T20:31:49.0994928Z     contiguous=True,
2025-05-07T20:31:49.0995148Z     compiled=False,
2025-05-07T20:31:49.0995355Z )
2025-05-07T20:31:49.0995544Z Trying example: test_silu_mul(
2025-05-07T20:31:49.0996005Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul>,
2025-05-07T20:31:49.0996380Z     T=128,
2025-05-07T20:31:49.0996571Z     D=7168,
2025-05-07T20:31:49.0996757Z     contiguous=False,
2025-05-07T20:31:49.0996979Z     compiled=True,
2025-05-07T20:31:49.0997188Z )
2025-05-07T20:31:49.0997375Z Trying example: test_silu_mul(
2025-05-07T20:31:49.0997740Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul>,
2025-05-07T20:31:49.0998108Z     T=16384,
2025-05-07T20:31:49.0998752Z     D=5120,
2025-05-07T20:31:49.0998950Z     contiguous=True,
2025-05-07T20:31:49.0999173Z     compiled=True,
2025-05-07T20:31:49.0999369Z )
2025-05-07T20:31:49.0999568Z Trying example: test_silu_mul(
2025-05-07T20:31:49.0999933Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul>,
2025-05-07T20:31:49.1000301Z     T=2048,
2025-05-07T20:31:49.1000489Z     D=5120,
2025-05-07T20:31:49.1000699Z     contiguous=False,
2025-05-07T20:31:49.1000925Z     compiled=True,
2025-05-07T20:31:49.1001132Z )
2025-05-07T20:31:49.1001331Z Trying example: test_silu_mul(
2025-05-07T20:31:49.1001705Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul>,
2025-05-07T20:31:49.1002082Z     T=16384,
2025-05-07T20:31:49.1002271Z     D=5120,
2025-05-07T20:31:49.1002475Z     contiguous=True,
2025-05-07T20:31:49.1002699Z     compiled=False,
2025-05-07T20:31:49.1002902Z )
2025-05-07T20:31:49.1011800Z Trying example: test_silu_mul(
2025-05-07T20:31:49.1012219Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul>,
2025-05-07T20:31:49.1012620Z     T=16384,
2025-05-07T20:31:49.1012823Z     D=7168,
2025-05-07T20:31:49.1013030Z     contiguous=False,
2025-05-07T20:31:49.1013256Z     compiled=False,
2025-05-07T20:31:49.1013469Z )
2025-05-07T20:31:49.1013779Z Trying example: test_silu_mul(
2025-05-07T20:31:49.1014149Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul>,
2025-05-07T20:31:49.1014530Z     T=16384,
2025-05-07T20:31:49.1014728Z     D=7168,
2025-05-07T20:31:49.1014930Z     contiguous=True,
2025-05-07T20:31:49.1015157Z     compiled=False,
2025-05-07T20:31:49.1015366Z )
2025-05-07T20:31:49.1015554Z PASSED
2025-05-07T20:31:49.1633808Z W0507 20:31:49.161000 275344 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0] Encountered an exception in identify_mutated_tensors, assuming every input is mutated
2025-05-07T20:31:49.1634948Z W0507 20:31:49.161000 275344 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0] Traceback (most recent call last):
2025-05-07T20:31:49.1636292Z W0507 20:31:49.161000 275344 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/torch/_higher_order_ops/triton_kernel_wrap.py", line 731, in identify_mutated_tensors
2025-05-07T20:31:49.1637732Z W0507 20:31:49.161000 275344 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0]     ttir_module, ordered_tensor_names = generate_ttir(kernel, kwargs)
2025-05-07T20:31:49.1638718Z W0507 20:31:49.161000 275344 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0]                                         ~~~~~~~~~~~~~^^^^^^^^^^^^^^^^
2025-05-07T20:31:49.1640021Z W0507 20:31:49.161000 275344 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/torch/_higher_order_ops/triton_kernel_wrap.py", line 356, in generate_ttir
2025-05-07T20:31:49.1641786Z W0507 20:31:49.161000 275344 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0]     ttir_module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:31:49.1643174Z W0507 20:31:49.161000 275344 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py", line 100, in make_ir
2025-05-07T20:31:49.1644718Z W0507 20:31:49.161000 275344 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0]     return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:31:49.1645768Z W0507 20:31:49.161000 275344 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0]                        module_map=module_map)
2025-05-07T20:31:49.1647014Z W0507 20:31:49.161000 275344 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/code_generator.py", line 1298, in ast_to_ttir
2025-05-07T20:31:49.1648249Z W0507 20:31:49.161000 275344 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0]     generator.visit(fn.parse())
2025-05-07T20:31:49.1649093Z W0507 20:31:49.161000 275344 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0]     ~~~~~~~~~~~~~~~^^^^^^^^^^^^
2025-05-07T20:31:49.1650290Z W0507 20:31:49.161000 275344 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/code_generator.py", line 1201, in visit
2025-05-07T20:31:49.1651483Z W0507 20:31:49.161000 275344 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0]     ret = super().visit(node)
2025-05-07T20:31:49.1652517Z W0507 20:31:49.161000 275344 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/ast.py", line 428, in visit
2025-05-07T20:31:49.1653535Z W0507 20:31:49.161000 275344 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0]     return visitor(node)
2025-05-07T20:31:49.1654845Z W0507 20:31:49.161000 275344 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/code_generator.py", line 352, in visit_Module
2025-05-07T20:31:49.1656124Z W0507 20:31:49.161000 275344 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0]     ast.NodeVisitor.generic_visit(self, node)
2025-05-07T20:31:49.1657019Z W0507 20:31:49.161000 275344 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0]     ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~^^^^^^^^^^^^
2025-05-07T20:31:49.1658099Z W0507 20:31:49.161000 275344 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/ast.py", line 436, in generic_visit
2025-05-07T20:31:49.1659135Z W0507 20:31:49.161000 275344 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0]     self.visit(item)
2025-05-07T20:31:49.1659907Z W0507 20:31:49.161000 275344 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0]     ~~~~~~~~~~^^^^^^
2025-05-07T20:31:49.1661073Z W0507 20:31:49.161000 275344 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/code_generator.py", line 1207, in visit
2025-05-07T20:31:49.1662402Z W0507 20:31:49.161000 275344 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0]     raise CompilationError(self.jit_fn.src, self.cur_node, repr(e)) from None
2025-05-07T20:31:49.1663551Z W0507 20:31:49.161000 275344 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0] triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:31:49.1664453Z W0507 20:31:49.161000 275344 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0] def _fbgemm_silu_mul_quant(
2025-05-07T20:31:49.1665196Z W0507 20:31:49.161000 275344 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0] ^
2025-05-07T20:31:49.1666212Z W0507 20:31:49.161000 275344 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0] ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:31:49.1791142Z W0507 20:31:49.177000 275344 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0] Encountered an exception in identify_mutated_tensors, assuming every input is mutated
2025-05-07T20:31:49.1792220Z W0507 20:31:49.177000 275344 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0] Traceback (most recent call last):
2025-05-07T20:31:49.1793549Z W0507 20:31:49.177000 275344 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/torch/_higher_order_ops/triton_kernel_wrap.py", line 731, in identify_mutated_tensors
2025-05-07T20:31:49.1794973Z W0507 20:31:49.177000 275344 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0]     ttir_module, ordered_tensor_names = generate_ttir(kernel, kwargs)
2025-05-07T20:31:49.1795953Z W0507 20:31:49.177000 275344 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0]                                         ~~~~~~~~~~~~~^^^^^^^^^^^^^^^^
2025-05-07T20:31:49.1797259Z W0507 20:31:49.177000 275344 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/torch/_higher_order_ops/triton_kernel_wrap.py", line 356, in generate_ttir
2025-05-07T20:31:49.1798987Z W0507 20:31:49.177000 275344 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0]     ttir_module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:31:49.1800274Z W0507 20:31:49.177000 275344 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py", line 100, in make_ir
2025-05-07T20:31:49.1801620Z W0507 20:31:49.177000 275344 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0]     return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:31:49.1802661Z W0507 20:31:49.177000 275344 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0]                        module_map=module_map)
2025-05-07T20:31:49.1803902Z W0507 20:31:49.177000 275344 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/code_generator.py", line 1298, in ast_to_ttir
2025-05-07T20:31:49.1805132Z W0507 20:31:49.177000 275344 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0]     generator.visit(fn.parse())
2025-05-07T20:31:49.1805973Z W0507 20:31:49.177000 275344 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0]     ~~~~~~~~~~~~~~~^^^^^^^^^^^^
2025-05-07T20:31:49.1807155Z W0507 20:31:49.177000 275344 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/code_generator.py", line 1201, in visit
2025-05-07T20:31:49.1808356Z W0507 20:31:49.177000 275344 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0]     ret = super().visit(node)
2025-05-07T20:31:49.1809379Z W0507 20:31:49.177000 275344 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/ast.py", line 428, in visit
2025-05-07T20:31:49.1810676Z W0507 20:31:49.177000 275344 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0]     return visitor(node)
2025-05-07T20:31:49.1811877Z W0507 20:31:49.177000 275344 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/code_generator.py", line 352, in visit_Module
2025-05-07T20:31:49.1813142Z W0507 20:31:49.177000 275344 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0]     ast.NodeVisitor.generic_visit(self, node)
2025-05-07T20:31:49.1814357Z W0507 20:31:49.177000 275344 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0]     ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~^^^^^^^^^^^^
2025-05-07T20:31:49.1815432Z W0507 20:31:49.177000 275344 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/ast.py", line 436, in generic_visit
2025-05-07T20:31:49.1816463Z W0507 20:31:49.177000 275344 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0]     self.visit(item)
2025-05-07T20:31:49.1817217Z W0507 20:31:49.177000 275344 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0]     ~~~~~~~~~~^^^^^^
2025-05-07T20:31:49.1818369Z W0507 20:31:49.177000 275344 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/code_generator.py", line 1207, in visit
2025-05-07T20:31:49.1819706Z W0507 20:31:49.177000 275344 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0]     raise CompilationError(self.jit_fn.src, self.cur_node, repr(e)) from None
2025-05-07T20:31:49.1820751Z W0507 20:31:49.177000 275344 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0] triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:31:49.1821659Z W0507 20:31:49.177000 275344 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0] def _fbgemm_silu_mul_quant(
2025-05-07T20:31:49.1822387Z W0507 20:31:49.177000 275344 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0] ^
2025-05-07T20:31:49.1823390Z W0507 20:31:49.177000 275344 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0] ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:31:49.2188804Z W0507 20:31:49.217000 275344 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0] Encountered an exception in identify_mutated_tensors, assuming every input is mutated
2025-05-07T20:31:49.2189978Z W0507 20:31:49.217000 275344 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0] Traceback (most recent call last):
2025-05-07T20:31:49.2191308Z W0507 20:31:49.217000 275344 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/torch/_higher_order_ops/triton_kernel_wrap.py", line 731, in identify_mutated_tensors
2025-05-07T20:31:49.2192734Z W0507 20:31:49.217000 275344 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0]     ttir_module, ordered_tensor_names = generate_ttir(kernel, kwargs)
2025-05-07T20:31:49.2193711Z W0507 20:31:49.217000 275344 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0]                                         ~~~~~~~~~~~~~^^^^^^^^^^^^^^^^
2025-05-07T20:31:49.2195006Z W0507 20:31:49.217000 275344 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/torch/_higher_order_ops/triton_kernel_wrap.py", line 356, in generate_ttir
2025-05-07T20:31:49.2196378Z W0507 20:31:49.217000 275344 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0]     ttir_module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:31:49.2199081Z W0507 20:31:49.217000 275344 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py", line 100, in make_ir
2025-05-07T20:31:49.2200438Z W0507 20:31:49.217000 275344 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0]     return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:31:49.2201473Z W0507 20:31:49.217000 275344 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0]                        module_map=module_map)
2025-05-07T20:31:49.2202917Z W0507 20:31:49.217000 275344 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/code_generator.py", line 1298, in ast_to_ttir
2025-05-07T20:31:49.2204154Z W0507 20:31:49.217000 275344 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0]     generator.visit(fn.parse())
2025-05-07T20:31:49.2205000Z W0507 20:31:49.217000 275344 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0]     ~~~~~~~~~~~~~~~^^^^^^^^^^^^
2025-05-07T20:31:49.2206181Z W0507 20:31:49.217000 275344 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/code_generator.py", line 1201, in visit
2025-05-07T20:31:49.2207372Z W0507 20:31:49.217000 275344 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0]     ret = super().visit(node)
2025-05-07T20:31:49.2208399Z W0507 20:31:49.217000 275344 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/ast.py", line 428, in visit
2025-05-07T20:31:49.2209407Z W0507 20:31:49.217000 275344 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0]     return visitor(node)
2025-05-07T20:31:49.2210611Z W0507 20:31:49.217000 275344 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/code_generator.py", line 352, in visit_Module
2025-05-07T20:31:49.2211866Z W0507 20:31:49.217000 275344 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0]     ast.NodeVisitor.generic_visit(self, node)
2025-05-07T20:31:49.2212760Z W0507 20:31:49.217000 275344 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0]     ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~^^^^^^^^^^^^
2025-05-07T20:31:49.2213971Z W0507 20:31:49.217000 275344 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/ast.py", line 436, in generic_visit
2025-05-07T20:31:49.2215003Z W0507 20:31:49.217000 275344 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0]     self.visit(item)
2025-05-07T20:31:49.2215765Z W0507 20:31:49.217000 275344 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0]     ~~~~~~~~~~^^^^^^
2025-05-07T20:31:49.2216921Z W0507 20:31:49.217000 275344 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/code_generator.py", line 1207, in visit
2025-05-07T20:31:49.2218260Z W0507 20:31:49.217000 275344 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0]     raise CompilationError(self.jit_fn.src, self.cur_node, repr(e)) from None
2025-05-07T20:31:49.2219320Z W0507 20:31:49.217000 275344 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0] triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:31:49.2220220Z W0507 20:31:49.217000 275344 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0] def _fbgemm_silu_mul_quant(
2025-05-07T20:31:49.2220952Z W0507 20:31:49.217000 275344 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0] ^
2025-05-07T20:31:49.2222096Z W0507 20:31:49.217000 275344 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0] ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:31:49.2237263Z W0507 20:31:49.222000 275344 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0] Encountered an exception in identify_mutated_tensors, assuming every input is mutated
2025-05-07T20:31:49.2238505Z W0507 20:31:49.222000 275344 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0] Traceback (most recent call last):
2025-05-07T20:31:49.2239940Z W0507 20:31:49.222000 275344 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/torch/_higher_order_ops/triton_kernel_wrap.py", line 731, in identify_mutated_tensors
2025-05-07T20:31:49.2241349Z W0507 20:31:49.222000 275344 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0]     ttir_module, ordered_tensor_names = generate_ttir(kernel, kwargs)
2025-05-07T20:31:49.2242315Z W0507 20:31:49.222000 275344 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0]                                         ~~~~~~~~~~~~~^^^^^^^^^^^^^^^^
2025-05-07T20:31:49.2243605Z W0507 20:31:49.222000 275344 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/torch/_higher_order_ops/triton_kernel_wrap.py", line 356, in generate_ttir
2025-05-07T20:31:49.2244976Z W0507 20:31:49.222000 275344 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0]     ttir_module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:31:49.2246260Z W0507 20:31:49.222000 275344 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py", line 100, in make_ir
2025-05-07T20:31:49.2247622Z W0507 20:31:49.222000 275344 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0]     return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:31:49.2248653Z W0507 20:31:49.222000 275344 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0]                        module_map=module_map)
2025-05-07T20:31:49.2249895Z W0507 20:31:49.222000 275344 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/code_generator.py", line 1298, in ast_to_ttir
2025-05-07T20:31:49.2251129Z W0507 20:31:49.222000 275344 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0]     generator.visit(fn.parse())
2025-05-07T20:31:49.2251969Z W0507 20:31:49.222000 275344 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0]     ~~~~~~~~~~~~~~~^^^^^^^^^^^^
2025-05-07T20:31:49.2253159Z W0507 20:31:49.222000 275344 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/code_generator.py", line 1201, in visit
2025-05-07T20:31:49.2254453Z W0507 20:31:49.222000 275344 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0]     ret = super().visit(node)
2025-05-07T20:31:49.2255481Z W0507 20:31:49.222000 275344 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/ast.py", line 428, in visit
2025-05-07T20:31:49.2256494Z W0507 20:31:49.222000 275344 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0]     return visitor(node)
2025-05-07T20:31:49.2257694Z W0507 20:31:49.222000 275344 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/code_generator.py", line 352, in visit_Module
2025-05-07T20:31:49.2259046Z W0507 20:31:49.222000 275344 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0]     ast.NodeVisitor.generic_visit(self, node)
2025-05-07T20:31:49.2259941Z W0507 20:31:49.222000 275344 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0]     ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~^^^^^^^^^^^^
2025-05-07T20:31:49.2261013Z W0507 20:31:49.222000 275344 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/ast.py", line 436, in generic_visit
2025-05-07T20:31:49.2262123Z W0507 20:31:49.222000 275344 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0]     self.visit(item)
2025-05-07T20:31:49.2262889Z W0507 20:31:49.222000 275344 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0]     ~~~~~~~~~~^^^^^^
2025-05-07T20:31:49.2264034Z W0507 20:31:49.222000 275344 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/code_generator.py", line 1207, in visit
2025-05-07T20:31:49.2265373Z W0507 20:31:49.222000 275344 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0]     raise CompilationError(self.jit_fn.src, self.cur_node, repr(e)) from None
2025-05-07T20:31:49.2266426Z W0507 20:31:49.222000 275344 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0] triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:31:49.2267331Z W0507 20:31:49.222000 275344 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0] def _fbgemm_silu_mul_quant(
2025-05-07T20:31:49.2268068Z W0507 20:31:49.222000 275344 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0] ^
2025-05-07T20:31:49.2269078Z W0507 20:31:49.222000 275344 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0] ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:31:49.6290104Z moe/activation_test.py::ActivationTests::test_silu_mul_quant Trying example: test_silu_mul_quant(
2025-05-07T20:31:49.6291903Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:31:49.6292434Z     T=1,
2025-05-07T20:31:49.6292626Z     D=5120,
2025-05-07T20:31:49.6292849Z     scale_ub=None,
2025-05-07T20:31:49.6293149Z     contiguous=True,
2025-05-07T20:31:49.6293460Z     compiled=True,
2025-05-07T20:31:49.6293861Z )
2025-05-07T20:31:49.6294324Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:31:49.6295015Z T = 1, D = 5120, scale_ub = None, contiguous = True, compiled = True
2025-05-07T20:31:49.6295276Z 
2025-05-07T20:31:49.6295360Z     @given(
2025-05-07T20:31:49.6295598Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:31:49.6295919Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:31:49.6296235Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:31:49.6296569Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:31:49.6296897Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:31:49.6297188Z     )
2025-05-07T20:31:49.6297568Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:31:49.6298007Z     def test_silu_mul_quant(
2025-05-07T20:31:49.6298561Z         self,
2025-05-07T20:31:49.6298769Z         T: int,
2025-05-07T20:31:49.6298965Z         D: int,
2025-05-07T20:31:49.6299192Z         scale_ub: Optional[float],
2025-05-07T20:31:49.6299471Z         contiguous: bool,
2025-05-07T20:31:49.6299721Z         compiled: bool,
2025-05-07T20:31:49.6299963Z     ) -> None:
2025-05-07T20:31:49.6300189Z         torch.manual_seed(2025)
2025-05-07T20:31:49.6300434Z     
2025-05-07T20:31:49.6300704Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:31:49.6301453Z     
2025-05-07T20:31:49.6301657Z         x_sign = torch.sign(x)
2025-05-07T20:31:49.6301947Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:31:49.6302263Z         x = x_sign * x_clamp
2025-05-07T20:31:49.6302507Z         x0 = x[:, :D]
2025-05-07T20:31:49.6302722Z         x1 = x[:, D:]
2025-05-07T20:31:49.6302936Z     
2025-05-07T20:31:49.6303128Z         if contiguous:
2025-05-07T20:31:49.6303357Z             x0 = x0.contiguous()
2025-05-07T20:31:49.6303620Z             x1 = x1.contiguous()
2025-05-07T20:31:49.6303864Z     
2025-05-07T20:31:49.6304051Z         if scale_ub is not None:
2025-05-07T20:31:49.6304329Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:31:49.6304804Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:31:49.6305121Z             )
2025-05-07T20:31:49.6305313Z         else:
2025-05-07T20:31:49.6305527Z             scale_ub_tensor = None
2025-05-07T20:31:49.6305781Z     
2025-05-07T20:31:49.6306009Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:31:49.6306338Z             op = silu_mul_quant
2025-05-07T20:31:49.6306593Z             if compiled:
2025-05-07T20:31:49.6306835Z                 op = torch.compile(op)
2025-05-07T20:31:49.6307136Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:31:49.6307412Z     
2025-05-07T20:31:49.6307602Z         y_fp8, y_scale = fn()
2025-05-07T20:31:49.6307886Z         y = y_fp8.to(torch.float32) * y_scale[:, None]
2025-05-07T20:31:49.6308180Z     
2025-05-07T20:31:49.6308412Z         def ref_fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:31:49.6308749Z             x0_fp32 = x0.to(torch.float32)
2025-05-07T20:31:49.6309039Z             x1_fp32 = x1.to(torch.float32)
2025-05-07T20:31:49.6309356Z             y = x0_fp32 * torch.sigmoid(x0_fp32) * x1_fp32
2025-05-07T20:31:49.6309716Z             return triton_quantize_fp8_row(y, scale_ub_tensor)
2025-05-07T20:31:49.6310033Z     
2025-05-07T20:31:49.6310238Z >       y_fp8_ref, y_scale_ref = ref_fn()
2025-05-07T20:31:49.6310439Z 
2025-05-07T20:31:49.6310540Z moe/activation_test.py:126: 
2025-05-07T20:31:49.6310839Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:31:49.6311180Z moe/activation_test.py:124: in ref_fn
2025-05-07T20:31:49.6311501Z     return triton_quantize_fp8_row(y, scale_ub_tensor)
2025-05-07T20:31:49.6312287Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gemm/triton_gemm/fp8_gemm.py:2370: in triton_quantize_fp8_row
2025-05-07T20:31:49.6313038Z     _kernel_quantize_fp8_row[grid](
2025-05-07T20:31:49.6313582Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:31:49.6314261Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:31:49.6314945Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/autotuner.py:186: in run
2025-05-07T20:31:49.6315664Z     timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs}
2025-05-07T20:31:49.6316383Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/autotuner.py:166: in _bench
2025-05-07T20:31:49.6317018Z     return self.do_bench(kernel_call, quantiles=(0.5, 0.2, 0.8))
2025-05-07T20:31:49.6317618Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/testing.py:117: in do_bench
2025-05-07T20:31:49.6318129Z     fn()
2025-05-07T20:31:49.6318633Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/autotuner.py:152: in kernel_call
2025-05-07T20:31:49.6319213Z     self.fn.run(
2025-05-07T20:31:49.6319677Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:31:49.6320211Z     kernel = self.compile(
2025-05-07T20:31:49.6320753Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:31:49.6321494Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:31:49.6321889Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:31:49.6322125Z 
2025-05-07T20:31:49.6322331Z self = <triton.compiler.compiler.ASTSource object at 0x7f93cf631ad0>
2025-05-07T20:31:49.6323405Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=2, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:31:49.6324866Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f9402963ce0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f93bd68d3a0>}
2025-05-07T20:31:49.6326199Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:31:49.6327211Z context = <triton._C.libtriton.ir.context object at 0x7f93ceadff70>
2025-05-07T20:31:49.6327503Z 
2025-05-07T20:31:49.6327672Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:31:49.6328189Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:31:49.6328653Z                            module_map=module_map)
2025-05-07T20:31:49.6329019Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:31:49.6329383Z E       def _kernel_quantize_fp8_row(
2025-05-07T20:31:49.6329651Z E       ^
2025-05-07T20:31:49.6330110Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:31:49.6330558Z 
2025-05-07T20:31:49.6330967Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:31:49.6331484Z 
2025-05-07T20:31:49.6331586Z Trying example: test_silu_mul_quant(
2025-05-07T20:31:49.6331999Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:31:49.6332392Z     T=2048,
2025-05-07T20:31:49.6332581Z     D=5120,
2025-05-07T20:31:49.6332775Z     scale_ub=1200.0,
2025-05-07T20:31:49.6332993Z     contiguous=True,
2025-05-07T20:31:49.6333218Z     compiled=False,
2025-05-07T20:31:49.6333426Z )
2025-05-07T20:31:49.6333820Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:31:49.6334313Z T = 2048, D = 5120, scale_ub = 1200.0, contiguous = True, compiled = False
2025-05-07T20:31:49.6334581Z 
2025-05-07T20:31:49.6334666Z     @given(
2025-05-07T20:31:49.6334905Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:31:49.6335210Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:31:49.6335517Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:31:49.6335848Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:31:49.6336176Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:31:49.6336466Z     )
2025-05-07T20:31:49.6336815Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:31:49.6337246Z     def test_silu_mul_quant(
2025-05-07T20:31:49.6337487Z         self,
2025-05-07T20:31:49.6337683Z         T: int,
2025-05-07T20:31:49.6337876Z         D: int,
2025-05-07T20:31:49.6338096Z         scale_ub: Optional[float],
2025-05-07T20:31:49.6338371Z         contiguous: bool,
2025-05-07T20:31:49.6338609Z         compiled: bool,
2025-05-07T20:31:49.6338832Z     ) -> None:
2025-05-07T20:31:49.6339047Z         torch.manual_seed(2025)
2025-05-07T20:31:49.6339284Z     
2025-05-07T20:31:49.6339563Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:31:49.6339908Z     
2025-05-07T20:31:49.6340102Z         x_sign = torch.sign(x)
2025-05-07T20:31:49.6340387Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:31:49.6340790Z         x = x_sign * x_clamp
2025-05-07T20:31:49.6341037Z         x0 = x[:, :D]
2025-05-07T20:31:49.6341251Z         x1 = x[:, D:]
2025-05-07T20:31:49.6341461Z     
2025-05-07T20:31:49.6341652Z         if contiguous:
2025-05-07T20:31:49.6341881Z             x0 = x0.contiguous()
2025-05-07T20:31:49.6342142Z             x1 = x1.contiguous()
2025-05-07T20:31:49.6342386Z     
2025-05-07T20:31:49.6342576Z         if scale_ub is not None:
2025-05-07T20:31:49.6342852Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:31:49.6343182Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:31:49.6343480Z             )
2025-05-07T20:31:49.6343675Z         else:
2025-05-07T20:31:49.6343968Z             scale_ub_tensor = None
2025-05-07T20:31:49.6344215Z     
2025-05-07T20:31:49.6344442Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:31:49.6344754Z             op = silu_mul_quant
2025-05-07T20:31:49.6344999Z             if compiled:
2025-05-07T20:31:49.6345247Z                 op = torch.compile(op)
2025-05-07T20:31:49.6345545Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:31:49.6345817Z     
2025-05-07T20:31:49.6346003Z >       y_fp8, y_scale = fn()
2025-05-07T20:31:49.6346172Z 
2025-05-07T20:31:49.6346269Z moe/activation_test.py:117: 
2025-05-07T20:31:49.6346569Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:31:49.6346892Z moe/activation_test.py:115: in fn
2025-05-07T20:31:49.6347177Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:31:49.6347856Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:31:49.6348549Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:31:49.6349076Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:31:49.6349749Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:31:49.6350410Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:31:49.6350934Z     kernel = self.compile(
2025-05-07T20:31:49.6351471Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:31:49.6352154Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:31:49.6352573Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:31:49.6352799Z 
2025-05-07T20:31:49.6353004Z self = <triton.compiler.compiler.ASTSource object at 0x7f93ce5431d0>
2025-05-07T20:31:49.6354072Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:31:49.6355417Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f9402963ce0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f93ce57e200>}
2025-05-07T20:31:49.6356745Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:31:49.6357752Z context = <triton._C.libtriton.ir.context object at 0x7f93ce3e5730>
2025-05-07T20:31:49.6358035Z 
2025-05-07T20:31:49.6358201Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:31:49.6358718Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:31:49.6359189Z                            module_map=module_map)
2025-05-07T20:31:49.6359545Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:31:49.6359897Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:31:49.6360155Z E       ^
2025-05-07T20:31:49.6360704Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:31:49.6361146Z 
2025-05-07T20:31:49.6361554Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:31:49.9001743Z W0507 20:31:49.895000 275344 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1] Encountered an exception in identify_mutated_tensors, assuming every input is mutated
2025-05-07T20:31:49.9003682Z W0507 20:31:49.895000 275344 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1] Traceback (most recent call last):
2025-05-07T20:31:49.9007901Z W0507 20:31:49.895000 275344 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/torch/_higher_order_ops/triton_kernel_wrap.py", line 731, in identify_mutated_tensors
2025-05-07T20:31:49.9010818Z W0507 20:31:49.895000 275344 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1]     ttir_module, ordered_tensor_names = generate_ttir(kernel, kwargs)
2025-05-07T20:31:49.9012358Z W0507 20:31:49.895000 275344 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1]                                         ~~~~~~~~~~~~~^^^^^^^^^^^^^^^^
2025-05-07T20:31:49.9013731Z W0507 20:31:49.895000 275344 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/torch/_higher_order_ops/triton_kernel_wrap.py", line 356, in generate_ttir
2025-05-07T20:31:49.9015097Z W0507 20:31:49.895000 275344 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1]     ttir_module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:31:49.9016379Z W0507 20:31:49.895000 275344 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py", line 100, in make_ir
2025-05-07T20:31:49.9017735Z W0507 20:31:49.895000 275344 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1]     return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:31:49.9018762Z W0507 20:31:49.895000 275344 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1]                        module_map=module_map)
2025-05-07T20:31:49.9019993Z W0507 20:31:49.895000 275344 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/code_generator.py", line 1298, in ast_to_ttir
2025-05-07T20:31:49.9021224Z W0507 20:31:49.895000 275344 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1]     generator.visit(fn.parse())
2025-05-07T20:31:49.9022065Z W0507 20:31:49.895000 275344 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1]     ~~~~~~~~~~~~~~~^^^^^^^^^^^^
2025-05-07T20:31:49.9023298Z W0507 20:31:49.895000 275344 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/code_generator.py", line 1201, in visit
2025-05-07T20:31:49.9024486Z W0507 20:31:49.895000 275344 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1]     ret = super().visit(node)
2025-05-07T20:31:49.9025494Z W0507 20:31:49.895000 275344 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/ast.py", line 428, in visit
2025-05-07T20:31:49.9026500Z W0507 20:31:49.895000 275344 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1]     return visitor(node)
2025-05-07T20:31:49.9027694Z W0507 20:31:49.895000 275344 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/code_generator.py", line 352, in visit_Module
2025-05-07T20:31:49.9029129Z W0507 20:31:49.895000 275344 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1]     ast.NodeVisitor.generic_visit(self, node)
2025-05-07T20:31:49.9030011Z W0507 20:31:49.895000 275344 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1]     ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~^^^^^^^^^^^^
2025-05-07T20:31:49.9031073Z W0507 20:31:49.895000 275344 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/ast.py", line 436, in generic_visit
2025-05-07T20:31:49.9032175Z W0507 20:31:49.895000 275344 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1]     self.visit(item)
2025-05-07T20:31:49.9032940Z W0507 20:31:49.895000 275344 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1]     ~~~~~~~~~~^^^^^^
2025-05-07T20:31:49.9034089Z W0507 20:31:49.895000 275344 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/code_generator.py", line 1207, in visit
2025-05-07T20:31:49.9035411Z W0507 20:31:49.895000 275344 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1]     raise CompilationError(self.jit_fn.src, self.cur_node, repr(e)) from None
2025-05-07T20:31:49.9036457Z W0507 20:31:49.895000 275344 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1] triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:31:49.9037349Z W0507 20:31:49.895000 275344 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1] def _fbgemm_silu_mul_quant(
2025-05-07T20:31:49.9038087Z W0507 20:31:49.895000 275344 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1] ^
2025-05-07T20:31:49.9039084Z W0507 20:31:49.895000 275344 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1] ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:31:49.9708320Z W0507 20:31:49.967000 275344 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1] Encountered an exception in identify_mutated_tensors, assuming every input is mutated
2025-05-07T20:31:49.9709383Z W0507 20:31:49.967000 275344 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1] Traceback (most recent call last):
2025-05-07T20:31:49.9710711Z W0507 20:31:49.967000 275344 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/torch/_higher_order_ops/triton_kernel_wrap.py", line 731, in identify_mutated_tensors
2025-05-07T20:31:49.9712132Z W0507 20:31:49.967000 275344 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1]     ttir_module, ordered_tensor_names = generate_ttir(kernel, kwargs)
2025-05-07T20:31:49.9713134Z W0507 20:31:49.967000 275344 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1]                                         ~~~~~~~~~~~~~^^^^^^^^^^^^^^^^
2025-05-07T20:31:49.9714415Z W0507 20:31:49.967000 275344 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/torch/_higher_order_ops/triton_kernel_wrap.py", line 356, in generate_ttir
2025-05-07T20:31:49.9715776Z W0507 20:31:49.967000 275344 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1]     ttir_module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:31:49.9717055Z W0507 20:31:49.967000 275344 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py", line 100, in make_ir
2025-05-07T20:31:49.9718405Z W0507 20:31:49.967000 275344 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1]     return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:31:49.9727426Z W0507 20:31:49.967000 275344 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1]                        module_map=module_map)
2025-05-07T20:31:49.9728744Z W0507 20:31:49.967000 275344 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/code_generator.py", line 1298, in ast_to_ttir
2025-05-07T20:31:49.9729994Z W0507 20:31:49.967000 275344 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1]     generator.visit(fn.parse())
2025-05-07T20:31:49.9730995Z W0507 20:31:49.967000 275344 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1]     ~~~~~~~~~~~~~~~^^^^^^^^^^^^
2025-05-07T20:31:49.9732196Z W0507 20:31:49.967000 275344 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/code_generator.py", line 1201, in visit
2025-05-07T20:31:49.9733402Z W0507 20:31:49.967000 275344 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1]     ret = super().visit(node)
2025-05-07T20:31:49.9734512Z W0507 20:31:49.967000 275344 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/ast.py", line 428, in visit
2025-05-07T20:31:49.9735525Z W0507 20:31:49.967000 275344 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1]     return visitor(node)
2025-05-07T20:31:49.9736732Z W0507 20:31:49.967000 275344 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/code_generator.py", line 352, in visit_Module
2025-05-07T20:31:49.9737997Z W0507 20:31:49.967000 275344 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1]     ast.NodeVisitor.generic_visit(self, node)
2025-05-07T20:31:49.9738900Z W0507 20:31:49.967000 275344 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1]     ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~^^^^^^^^^^^^
2025-05-07T20:31:49.9739967Z W0507 20:31:49.967000 275344 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/ast.py", line 436, in generic_visit
2025-05-07T20:31:49.9740996Z W0507 20:31:49.967000 275344 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1]     self.visit(item)
2025-05-07T20:31:49.9741764Z W0507 20:31:49.967000 275344 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1]     ~~~~~~~~~~^^^^^^
2025-05-07T20:31:49.9742917Z W0507 20:31:49.967000 275344 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/code_generator.py", line 1207, in visit
2025-05-07T20:31:49.9744251Z W0507 20:31:49.967000 275344 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1]     raise CompilationError(self.jit_fn.src, self.cur_node, repr(e)) from None
2025-05-07T20:31:49.9745303Z W0507 20:31:49.967000 275344 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1] triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:31:49.9746204Z W0507 20:31:49.967000 275344 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1] def _fbgemm_silu_mul_quant(
2025-05-07T20:31:49.9746944Z W0507 20:31:49.967000 275344 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1] ^
2025-05-07T20:31:49.9747959Z W0507 20:31:49.967000 275344 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1] ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:31:50.1773987Z W0507 20:31:50.173000 275344 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1] Encountered an exception in identify_mutated_tensors, assuming every input is mutated
2025-05-07T20:31:50.1776239Z W0507 20:31:50.173000 275344 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1] Traceback (most recent call last):
2025-05-07T20:31:50.1777578Z W0507 20:31:50.173000 275344 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/torch/_higher_order_ops/triton_kernel_wrap.py", line 731, in identify_mutated_tensors
2025-05-07T20:31:50.1779086Z W0507 20:31:50.173000 275344 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1]     ttir_module, ordered_tensor_names = generate_ttir(kernel, kwargs)
2025-05-07T20:31:50.1780057Z W0507 20:31:50.173000 275344 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1]                                         ~~~~~~~~~~~~~^^^^^^^^^^^^^^^^
2025-05-07T20:31:50.1781340Z W0507 20:31:50.173000 275344 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/torch/_higher_order_ops/triton_kernel_wrap.py", line 356, in generate_ttir
2025-05-07T20:31:50.1782686Z W0507 20:31:50.173000 275344 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1]     ttir_module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:31:50.1783965Z W0507 20:31:50.173000 275344 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py", line 100, in make_ir
2025-05-07T20:31:50.1785323Z W0507 20:31:50.173000 275344 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1]     return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:31:50.1786352Z W0507 20:31:50.173000 275344 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1]                        module_map=module_map)
2025-05-07T20:31:50.1787599Z W0507 20:31:50.173000 275344 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/code_generator.py", line 1298, in ast_to_ttir
2025-05-07T20:31:50.1788820Z W0507 20:31:50.173000 275344 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1]     generator.visit(fn.parse())
2025-05-07T20:31:50.1789659Z W0507 20:31:50.173000 275344 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1]     ~~~~~~~~~~~~~~~^^^^^^^^^^^^
2025-05-07T20:31:50.1790851Z W0507 20:31:50.173000 275344 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/code_generator.py", line 1201, in visit
2025-05-07T20:31:50.1792043Z W0507 20:31:50.173000 275344 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1]     ret = super().visit(node)
2025-05-07T20:31:50.1793112Z W0507 20:31:50.173000 275344 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/ast.py", line 428, in visit
2025-05-07T20:31:50.1794108Z W0507 20:31:50.173000 275344 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1]     return visitor(node)
2025-05-07T20:31:50.1795309Z W0507 20:31:50.173000 275344 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/code_generator.py", line 352, in visit_Module
2025-05-07T20:31:50.1796570Z W0507 20:31:50.173000 275344 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1]     ast.NodeVisitor.generic_visit(self, node)
2025-05-07T20:31:50.1797459Z W0507 20:31:50.173000 275344 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1]     ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~^^^^^^^^^^^^
2025-05-07T20:31:50.1798787Z W0507 20:31:50.173000 275344 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/ast.py", line 436, in generic_visit
2025-05-07T20:31:50.1799818Z W0507 20:31:50.173000 275344 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1]     self.visit(item)
2025-05-07T20:31:50.1800576Z W0507 20:31:50.173000 275344 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1]     ~~~~~~~~~~^^^^^^
2025-05-07T20:31:50.1801839Z W0507 20:31:50.173000 275344 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/code_generator.py", line 1207, in visit
2025-05-07T20:31:50.1803224Z W0507 20:31:50.173000 275344 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1]     raise CompilationError(self.jit_fn.src, self.cur_node, repr(e)) from None
2025-05-07T20:31:50.1804275Z W0507 20:31:50.173000 275344 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1] triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:31:50.1805170Z W0507 20:31:50.173000 275344 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1] def _fbgemm_silu_mul_quant(
2025-05-07T20:31:50.1805897Z W0507 20:31:50.173000 275344 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1] ^
2025-05-07T20:31:50.1806900Z W0507 20:31:50.173000 275344 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1] ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:31:50.1874681Z W0507 20:31:50.184000 275344 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1] Encountered an exception in identify_mutated_tensors, assuming every input is mutated
2025-05-07T20:31:50.1875728Z W0507 20:31:50.184000 275344 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1] Traceback (most recent call last):
2025-05-07T20:31:50.1877042Z W0507 20:31:50.184000 275344 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/torch/_higher_order_ops/triton_kernel_wrap.py", line 731, in identify_mutated_tensors
2025-05-07T20:31:50.1878435Z W0507 20:31:50.184000 275344 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1]     ttir_module, ordered_tensor_names = generate_ttir(kernel, kwargs)
2025-05-07T20:31:50.1879396Z W0507 20:31:50.184000 275344 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1]                                         ~~~~~~~~~~~~~^^^^^^^^^^^^^^^^
2025-05-07T20:31:50.1880672Z W0507 20:31:50.184000 275344 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/torch/_higher_order_ops/triton_kernel_wrap.py", line 356, in generate_ttir
2025-05-07T20:31:50.1882030Z W0507 20:31:50.184000 275344 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1]     ttir_module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:31:50.1883316Z W0507 20:31:50.184000 275344 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py", line 100, in make_ir
2025-05-07T20:31:50.1884667Z W0507 20:31:50.184000 275344 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1]     return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:31:50.1885691Z W0507 20:31:50.184000 275344 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1]                        module_map=module_map)
2025-05-07T20:31:50.1886925Z W0507 20:31:50.184000 275344 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/code_generator.py", line 1298, in ast_to_ttir
2025-05-07T20:31:50.1888641Z W0507 20:31:50.184000 275344 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1]     generator.visit(fn.parse())
2025-05-07T20:31:50.1889674Z W0507 20:31:50.184000 275344 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1]     ~~~~~~~~~~~~~~~^^^^^^^^^^^^
2025-05-07T20:31:50.1891168Z W0507 20:31:50.184000 275344 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/code_generator.py", line 1201, in visit
2025-05-07T20:31:50.1892790Z W0507 20:31:50.184000 275344 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1]     ret = super().visit(node)
2025-05-07T20:31:50.1893948Z W0507 20:31:50.184000 275344 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/ast.py", line 428, in visit
2025-05-07T20:31:50.1894960Z W0507 20:31:50.184000 275344 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1]     return visitor(node)
2025-05-07T20:31:50.1896155Z W0507 20:31:50.184000 275344 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/code_generator.py", line 352, in visit_Module
2025-05-07T20:31:50.1897415Z W0507 20:31:50.184000 275344 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1]     ast.NodeVisitor.generic_visit(self, node)
2025-05-07T20:31:50.1898525Z W0507 20:31:50.184000 275344 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1]     ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~^^^^^^^^^^^^
2025-05-07T20:31:50.1899597Z W0507 20:31:50.184000 275344 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/ast.py", line 436, in generic_visit
2025-05-07T20:31:50.1900625Z W0507 20:31:50.184000 275344 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1]     self.visit(item)
2025-05-07T20:31:50.1901386Z W0507 20:31:50.184000 275344 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1]     ~~~~~~~~~~^^^^^^
2025-05-07T20:31:50.1902571Z W0507 20:31:50.184000 275344 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/code_generator.py", line 1207, in visit
2025-05-07T20:31:50.1903912Z W0507 20:31:50.184000 275344 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1]     raise CompilationError(self.jit_fn.src, self.cur_node, repr(e)) from None
2025-05-07T20:31:50.1904957Z W0507 20:31:50.184000 275344 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1] triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:31:50.1905858Z W0507 20:31:50.184000 275344 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1] def _fbgemm_silu_mul_quant(
2025-05-07T20:31:50.1906584Z W0507 20:31:50.184000 275344 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1] ^
2025-05-07T20:31:50.1907587Z W0507 20:31:50.184000 275344 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1] ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:31:50.4952294Z 
2025-05-07T20:31:50.4952847Z Trying example: test_silu_mul_quant(
2025-05-07T20:31:50.4953341Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:31:50.4953790Z     T=2048,
2025-05-07T20:31:50.4953987Z     D=5120,
2025-05-07T20:31:50.4954179Z     scale_ub=1200.0,
2025-05-07T20:31:50.4954441Z     contiguous=True,
2025-05-07T20:31:50.4954671Z     compiled=True,
2025-05-07T20:31:50.4955086Z )
2025-05-07T20:31:50.4955405Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:31:50.4955897Z T = 2048, D = 5120, scale_ub = 1200.0, contiguous = True, compiled = True
2025-05-07T20:31:50.4956165Z 
2025-05-07T20:31:50.4956248Z     @given(
2025-05-07T20:31:50.4956472Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:31:50.4956789Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:31:50.4957093Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:31:50.4957412Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:31:50.4957741Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:31:50.4958028Z     )
2025-05-07T20:31:50.4958510Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:31:50.4958959Z     def test_silu_mul_quant(
2025-05-07T20:31:50.4959202Z         self,
2025-05-07T20:31:50.4959399Z         T: int,
2025-05-07T20:31:50.4959591Z         D: int,
2025-05-07T20:31:50.4959819Z         scale_ub: Optional[float],
2025-05-07T20:31:50.4960094Z         contiguous: bool,
2025-05-07T20:31:50.4960329Z         compiled: bool,
2025-05-07T20:31:50.4960553Z     ) -> None:
2025-05-07T20:31:50.4960775Z         torch.manual_seed(2025)
2025-05-07T20:31:50.4961008Z     
2025-05-07T20:31:50.4961273Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:31:50.4961619Z     
2025-05-07T20:31:50.4961811Z         x_sign = torch.sign(x)
2025-05-07T20:31:50.4962099Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:31:50.4962410Z         x = x_sign * x_clamp
2025-05-07T20:31:50.4962648Z         x0 = x[:, :D]
2025-05-07T20:31:50.4962866Z         x1 = x[:, D:]
2025-05-07T20:31:50.4963075Z     
2025-05-07T20:31:50.4963264Z         if contiguous:
2025-05-07T20:31:50.4963495Z             x0 = x0.contiguous()
2025-05-07T20:31:50.4963755Z             x1 = x1.contiguous()
2025-05-07T20:31:50.4963986Z     
2025-05-07T20:31:50.4964186Z         if scale_ub is not None:
2025-05-07T20:31:50.4964460Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:31:50.4964796Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:31:50.4965099Z             )
2025-05-07T20:31:50.4965298Z         else:
2025-05-07T20:31:50.4965518Z             scale_ub_tensor = None
2025-05-07T20:31:50.4965761Z     
2025-05-07T20:31:50.4965991Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:31:50.4966307Z             op = silu_mul_quant
2025-05-07T20:31:50.4966548Z             if compiled:
2025-05-07T20:31:50.4966796Z                 op = torch.compile(op)
2025-05-07T20:31:50.4967094Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:31:50.4967362Z     
2025-05-07T20:31:50.4967562Z         y_fp8, y_scale = fn()
2025-05-07T20:31:50.4967845Z         y = y_fp8.to(torch.float32) * y_scale[:, None]
2025-05-07T20:31:50.4968128Z     
2025-05-07T20:31:50.4968374Z         def ref_fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:31:50.4968716Z             x0_fp32 = x0.to(torch.float32)
2025-05-07T20:31:50.4969006Z             x1_fp32 = x1.to(torch.float32)
2025-05-07T20:31:50.4969307Z             y = x0_fp32 * torch.sigmoid(x0_fp32) * x1_fp32
2025-05-07T20:31:50.4969659Z             return triton_quantize_fp8_row(y, scale_ub_tensor)
2025-05-07T20:31:50.4969966Z     
2025-05-07T20:31:50.4970160Z >       y_fp8_ref, y_scale_ref = ref_fn()
2025-05-07T20:31:50.4970356Z 
2025-05-07T20:31:50.4970457Z moe/activation_test.py:126: 
2025-05-07T20:31:50.4970751Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:31:50.4971079Z moe/activation_test.py:124: in ref_fn
2025-05-07T20:31:50.4971402Z     return triton_quantize_fp8_row(y, scale_ub_tensor)
2025-05-07T20:31:50.4972180Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gemm/triton_gemm/fp8_gemm.py:2370: in triton_quantize_fp8_row
2025-05-07T20:31:50.4972974Z     _kernel_quantize_fp8_row[grid](
2025-05-07T20:31:50.4973701Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:31:50.4974379Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:31:50.4975058Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/autotuner.py:186: in run
2025-05-07T20:31:50.4975772Z     timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs}
2025-05-07T20:31:50.4976480Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/autotuner.py:166: in _bench
2025-05-07T20:31:50.4977111Z     return self.do_bench(kernel_call, quantiles=(0.5, 0.2, 0.8))
2025-05-07T20:31:50.4977792Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/testing.py:117: in do_bench
2025-05-07T20:31:50.4978305Z     fn()
2025-05-07T20:31:50.4978808Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/autotuner.py:152: in kernel_call
2025-05-07T20:31:50.4979391Z     self.fn.run(
2025-05-07T20:31:50.4979854Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:31:50.4980372Z     kernel = self.compile(
2025-05-07T20:31:50.4980912Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:31:50.4981564Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:31:50.4981954Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:31:50.4982186Z 
2025-05-07T20:31:50.4982396Z self = <triton.compiler.compiler.ASTSource object at 0x7f93ce36e5d0>
2025-05-07T20:31:50.4983516Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=2, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:31:50.4984864Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f9402963ce0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f93ce4bede0>}
2025-05-07T20:31:50.4986191Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:31:50.4987190Z context = <triton._C.libtriton.ir.context object at 0x7f93a7292fb0>
2025-05-07T20:31:50.4987482Z 
2025-05-07T20:31:50.4987648Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:31:50.4988169Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:31:50.4988634Z                            module_map=module_map)
2025-05-07T20:31:50.4988996Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:31:50.4989355Z E       def _kernel_quantize_fp8_row(
2025-05-07T20:31:50.4989628Z E       ^
2025-05-07T20:31:50.4990078Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:31:50.4990524Z 
2025-05-07T20:31:50.4990933Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:31:50.4991444Z 
2025-05-07T20:31:50.4991548Z Trying example: test_silu_mul_quant(
2025-05-07T20:31:50.4991958Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:31:50.4992355Z     T=16384,
2025-05-07T20:31:50.4992550Z     D=7168,
2025-05-07T20:31:50.4992768Z     scale_ub=1200.0,
2025-05-07T20:31:50.4993000Z     contiguous=False,
2025-05-07T20:31:50.4993217Z     compiled=False,
2025-05-07T20:31:50.4993423Z )
2025-05-07T20:31:50.4993736Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:31:50.4994223Z T = 16384, D = 7168, scale_ub = 1200.0, contiguous = False, compiled = False
2025-05-07T20:31:50.4994588Z 
2025-05-07T20:31:50.4994664Z     @given(
2025-05-07T20:31:50.4994894Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:31:50.4995205Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:31:50.4995503Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:31:50.4995831Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:31:50.4996154Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:31:50.4996436Z     )
2025-05-07T20:31:50.4996783Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:31:50.4997220Z     def test_silu_mul_quant(
2025-05-07T20:31:50.4997456Z         self,
2025-05-07T20:31:50.4997734Z         T: int,
2025-05-07T20:31:50.4997935Z         D: int,
2025-05-07T20:31:50.4998147Z         scale_ub: Optional[float],
2025-05-07T20:31:50.4998581Z         contiguous: bool,
2025-05-07T20:31:50.4998827Z         compiled: bool,
2025-05-07T20:31:50.4999043Z     ) -> None:
2025-05-07T20:31:50.4999256Z         torch.manual_seed(2025)
2025-05-07T20:31:50.4999498Z     
2025-05-07T20:31:50.4999761Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:31:50.5000105Z     
2025-05-07T20:31:50.5000297Z         x_sign = torch.sign(x)
2025-05-07T20:31:50.5000589Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:31:50.5000895Z         x = x_sign * x_clamp
2025-05-07T20:31:50.5001131Z         x0 = x[:, :D]
2025-05-07T20:31:50.5001344Z         x1 = x[:, D:]
2025-05-07T20:31:50.5001547Z     
2025-05-07T20:31:50.5001736Z         if contiguous:
2025-05-07T20:31:50.5001971Z             x0 = x0.contiguous()
2025-05-07T20:31:50.5002239Z             x1 = x1.contiguous()
2025-05-07T20:31:50.5002518Z     
2025-05-07T20:31:50.5002723Z         if scale_ub is not None:
2025-05-07T20:31:50.5002998Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:31:50.5003334Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:31:50.5003637Z             )
2025-05-07T20:31:50.5003830Z         else:
2025-05-07T20:31:50.5004043Z             scale_ub_tensor = None
2025-05-07T20:31:50.5004286Z     
2025-05-07T20:31:50.5004519Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:31:50.5004834Z             op = silu_mul_quant
2025-05-07T20:31:50.5005079Z             if compiled:
2025-05-07T20:31:50.5005324Z                 op = torch.compile(op)
2025-05-07T20:31:50.5005628Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:31:50.5005897Z     
2025-05-07T20:31:50.5006095Z >       y_fp8, y_scale = fn()
2025-05-07T20:31:50.5006256Z 
2025-05-07T20:31:50.5006358Z moe/activation_test.py:117: 
2025-05-07T20:31:50.5006654Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:31:50.5006980Z moe/activation_test.py:115: in fn
2025-05-07T20:31:50.5007264Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:31:50.5007950Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:31:50.5008628Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:31:50.5009163Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:31:50.5009839Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:31:50.5010491Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:31:50.5011015Z     kernel = self.compile(
2025-05-07T20:31:50.5011553Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:31:50.5012206Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:31:50.5012631Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:31:50.5013010Z 
2025-05-07T20:31:50.5013216Z self = <triton.compiler.compiler.ASTSource object at 0x7f93ce38b8d0>
2025-05-07T20:31:50.5014355Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:31:50.5015708Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f9402963ce0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f93a73cd440>}
2025-05-07T20:31:50.5017190Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:31:50.5018194Z context = <triton._C.libtriton.ir.context object at 0x7f93a69058f0>
2025-05-07T20:31:50.5018486Z 
2025-05-07T20:31:50.5018650Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:31:50.5019169Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:31:50.5019631Z                            module_map=module_map)
2025-05-07T20:31:50.5019990Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:31:50.5020345Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:31:50.5020605Z E       ^
2025-05-07T20:31:50.5021059Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:31:50.5021504Z 
2025-05-07T20:31:50.5021915Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:31:50.6802679Z W0507 20:31:50.676000 275344 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2] Encountered an exception in identify_mutated_tensors, assuming every input is mutated
2025-05-07T20:31:50.6803809Z W0507 20:31:50.676000 275344 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2] Traceback (most recent call last):
2025-05-07T20:31:50.6805490Z W0507 20:31:50.676000 275344 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/torch/_higher_order_ops/triton_kernel_wrap.py", line 731, in identify_mutated_tensors
2025-05-07T20:31:50.6807258Z W0507 20:31:50.676000 275344 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2]     ttir_module, ordered_tensor_names = generate_ttir(kernel, kwargs)
2025-05-07T20:31:50.6808232Z W0507 20:31:50.676000 275344 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2]                                         ~~~~~~~~~~~~~^^^^^^^^^^^^^^^^
2025-05-07T20:31:50.6809521Z W0507 20:31:50.676000 275344 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/torch/_higher_order_ops/triton_kernel_wrap.py", line 356, in generate_ttir
2025-05-07T20:31:50.6810886Z W0507 20:31:50.676000 275344 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2]     ttir_module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:31:50.6812162Z W0507 20:31:50.676000 275344 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py", line 100, in make_ir
2025-05-07T20:31:50.6813517Z W0507 20:31:50.676000 275344 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2]     return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:31:50.6814918Z W0507 20:31:50.676000 275344 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2]                        module_map=module_map)
2025-05-07T20:31:50.6816492Z W0507 20:31:50.676000 275344 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/code_generator.py", line 1298, in ast_to_ttir
2025-05-07T20:31:50.6817874Z W0507 20:31:50.676000 275344 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2]     generator.visit(fn.parse())
2025-05-07T20:31:50.6818710Z W0507 20:31:50.676000 275344 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2]     ~~~~~~~~~~~~~~~^^^^^^^^^^^^
2025-05-07T20:31:50.6819888Z W0507 20:31:50.676000 275344 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/code_generator.py", line 1201, in visit
2025-05-07T20:31:50.6821192Z W0507 20:31:50.676000 275344 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2]     ret = super().visit(node)
2025-05-07T20:31:50.6822217Z W0507 20:31:50.676000 275344 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/ast.py", line 428, in visit
2025-05-07T20:31:50.6823275Z W0507 20:31:50.676000 275344 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2]     return visitor(node)
2025-05-07T20:31:50.6824480Z W0507 20:31:50.676000 275344 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/code_generator.py", line 352, in visit_Module
2025-05-07T20:31:50.6825735Z W0507 20:31:50.676000 275344 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2]     ast.NodeVisitor.generic_visit(self, node)
2025-05-07T20:31:50.6826633Z W0507 20:31:50.676000 275344 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2]     ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~^^^^^^^^^^^^
2025-05-07T20:31:50.6827707Z W0507 20:31:50.676000 275344 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/ast.py", line 436, in generic_visit
2025-05-07T20:31:50.6828739Z W0507 20:31:50.676000 275344 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2]     self.visit(item)
2025-05-07T20:31:50.6829500Z W0507 20:31:50.676000 275344 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2]     ~~~~~~~~~~^^^^^^
2025-05-07T20:31:50.6830645Z W0507 20:31:50.676000 275344 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/code_generator.py", line 1207, in visit
2025-05-07T20:31:50.6831980Z W0507 20:31:50.676000 275344 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2]     raise CompilationError(self.jit_fn.src, self.cur_node, repr(e)) from None
2025-05-07T20:31:50.6833028Z W0507 20:31:50.676000 275344 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2] triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:31:50.6833938Z W0507 20:31:50.676000 275344 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2] def _fbgemm_silu_mul_quant(
2025-05-07T20:31:50.6834667Z W0507 20:31:50.676000 275344 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2] ^
2025-05-07T20:31:50.6835672Z W0507 20:31:50.676000 275344 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2] ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:31:50.7318820Z W0507 20:31:50.728000 275344 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2] Encountered an exception in identify_mutated_tensors, assuming every input is mutated
2025-05-07T20:31:50.7320884Z W0507 20:31:50.728000 275344 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2] Traceback (most recent call last):
2025-05-07T20:31:50.7322919Z W0507 20:31:50.728000 275344 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/torch/_higher_order_ops/triton_kernel_wrap.py", line 731, in identify_mutated_tensors
2025-05-07T20:31:50.7324455Z W0507 20:31:50.728000 275344 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2]     ttir_module, ordered_tensor_names = generate_ttir(kernel, kwargs)
2025-05-07T20:31:50.7325411Z W0507 20:31:50.728000 275344 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2]                                         ~~~~~~~~~~~~~^^^^^^^^^^^^^^^^
2025-05-07T20:31:50.7326799Z W0507 20:31:50.728000 275344 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/torch/_higher_order_ops/triton_kernel_wrap.py", line 356, in generate_ttir
2025-05-07T20:31:50.7328154Z W0507 20:31:50.728000 275344 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2]     ttir_module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:31:50.7329432Z W0507 20:31:50.728000 275344 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py", line 100, in make_ir
2025-05-07T20:31:50.7330776Z W0507 20:31:50.728000 275344 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2]     return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:31:50.7331796Z W0507 20:31:50.728000 275344 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2]                        module_map=module_map)
2025-05-07T20:31:50.7333086Z W0507 20:31:50.728000 275344 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/code_generator.py", line 1298, in ast_to_ttir
2025-05-07T20:31:50.7334392Z W0507 20:31:50.728000 275344 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2]     generator.visit(fn.parse())
2025-05-07T20:31:50.7335225Z W0507 20:31:50.728000 275344 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2]     ~~~~~~~~~~~~~~~^^^^^^^^^^^^
2025-05-07T20:31:50.7336404Z W0507 20:31:50.728000 275344 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/code_generator.py", line 1201, in visit
2025-05-07T20:31:50.7337585Z W0507 20:31:50.728000 275344 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2]     ret = super().visit(node)
2025-05-07T20:31:50.7338607Z W0507 20:31:50.728000 275344 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/ast.py", line 428, in visit
2025-05-07T20:31:50.7339607Z W0507 20:31:50.728000 275344 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2]     return visitor(node)
2025-05-07T20:31:50.7340803Z W0507 20:31:50.728000 275344 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/code_generator.py", line 352, in visit_Module
2025-05-07T20:31:50.7342061Z W0507 20:31:50.728000 275344 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2]     ast.NodeVisitor.generic_visit(self, node)
2025-05-07T20:31:50.7343001Z W0507 20:31:50.728000 275344 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2]     ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~^^^^^^^^^^^^
2025-05-07T20:31:50.7344078Z W0507 20:31:50.728000 275344 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/ast.py", line 436, in generic_visit
2025-05-07T20:31:50.7345103Z W0507 20:31:50.728000 275344 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2]     self.visit(item)
2025-05-07T20:31:50.7346125Z W0507 20:31:50.728000 275344 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2]     ~~~~~~~~~~^^^^^^
2025-05-07T20:31:50.7347578Z W0507 20:31:50.728000 275344 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/code_generator.py", line 1207, in visit
2025-05-07T20:31:50.7349269Z W0507 20:31:50.728000 275344 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2]     raise CompilationError(self.jit_fn.src, self.cur_node, repr(e)) from None
2025-05-07T20:31:50.7351033Z W0507 20:31:50.728000 275344 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2] triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:31:50.7351939Z W0507 20:31:50.728000 275344 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2] def _fbgemm_silu_mul_quant(
2025-05-07T20:31:50.7352684Z W0507 20:31:50.728000 275344 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2] ^
2025-05-07T20:31:50.7353685Z W0507 20:31:50.728000 275344 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2] ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:31:50.9027409Z W0507 20:31:50.899000 275344 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2] Encountered an exception in identify_mutated_tensors, assuming every input is mutated
2025-05-07T20:31:50.9029501Z W0507 20:31:50.899000 275344 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2] Traceback (most recent call last):
2025-05-07T20:31:50.9032134Z W0507 20:31:50.899000 275344 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/torch/_higher_order_ops/triton_kernel_wrap.py", line 731, in identify_mutated_tensors
2025-05-07T20:31:50.9033679Z W0507 20:31:50.899000 275344 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2]     ttir_module, ordered_tensor_names = generate_ttir(kernel, kwargs)
2025-05-07T20:31:50.9034642Z W0507 20:31:50.899000 275344 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2]                                         ~~~~~~~~~~~~~^^^^^^^^^^^^^^^^
2025-05-07T20:31:50.9035932Z W0507 20:31:50.899000 275344 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/torch/_higher_order_ops/triton_kernel_wrap.py", line 356, in generate_ttir
2025-05-07T20:31:50.9037300Z W0507 20:31:50.899000 275344 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2]     ttir_module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:31:50.9038584Z W0507 20:31:50.899000 275344 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py", line 100, in make_ir
2025-05-07T20:31:50.9039940Z W0507 20:31:50.899000 275344 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2]     return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:31:50.9040976Z W0507 20:31:50.899000 275344 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2]                        module_map=module_map)
2025-05-07T20:31:50.9042213Z W0507 20:31:50.899000 275344 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/code_generator.py", line 1298, in ast_to_ttir
2025-05-07T20:31:50.9043439Z W0507 20:31:50.899000 275344 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2]     generator.visit(fn.parse())
2025-05-07T20:31:50.9044275Z W0507 20:31:50.899000 275344 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2]     ~~~~~~~~~~~~~~~^^^^^^^^^^^^
2025-05-07T20:31:50.9045638Z W0507 20:31:50.899000 275344 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/code_generator.py", line 1201, in visit
2025-05-07T20:31:50.9046825Z W0507 20:31:50.899000 275344 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2]     ret = super().visit(node)
2025-05-07T20:31:50.9047850Z W0507 20:31:50.899000 275344 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/ast.py", line 428, in visit
2025-05-07T20:31:50.9048990Z W0507 20:31:50.899000 275344 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2]     return visitor(node)
2025-05-07T20:31:50.9050202Z W0507 20:31:50.899000 275344 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/code_generator.py", line 352, in visit_Module
2025-05-07T20:31:50.9051475Z W0507 20:31:50.899000 275344 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2]     ast.NodeVisitor.generic_visit(self, node)
2025-05-07T20:31:50.9052370Z W0507 20:31:50.899000 275344 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2]     ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~^^^^^^^^^^^^
2025-05-07T20:31:50.9053448Z W0507 20:31:50.899000 275344 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/ast.py", line 436, in generic_visit
2025-05-07T20:31:50.9054585Z W0507 20:31:50.899000 275344 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2]     self.visit(item)
2025-05-07T20:31:50.9055349Z W0507 20:31:50.899000 275344 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2]     ~~~~~~~~~~^^^^^^
2025-05-07T20:31:50.9056516Z W0507 20:31:50.899000 275344 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/code_generator.py", line 1207, in visit
2025-05-07T20:31:50.9057852Z W0507 20:31:50.899000 275344 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2]     raise CompilationError(self.jit_fn.src, self.cur_node, repr(e)) from None
2025-05-07T20:31:50.9058904Z W0507 20:31:50.899000 275344 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2] triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:31:50.9059813Z W0507 20:31:50.899000 275344 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2] def _fbgemm_silu_mul_quant(
2025-05-07T20:31:50.9060552Z W0507 20:31:50.899000 275344 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2] ^
2025-05-07T20:31:50.9061557Z W0507 20:31:50.899000 275344 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2] ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:31:50.9124078Z W0507 20:31:50.909000 275344 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2] Encountered an exception in identify_mutated_tensors, assuming every input is mutated
2025-05-07T20:31:50.9125120Z W0507 20:31:50.909000 275344 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2] Traceback (most recent call last):
2025-05-07T20:31:50.9126434Z W0507 20:31:50.909000 275344 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/torch/_higher_order_ops/triton_kernel_wrap.py", line 731, in identify_mutated_tensors
2025-05-07T20:31:50.9127832Z W0507 20:31:50.909000 275344 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2]     ttir_module, ordered_tensor_names = generate_ttir(kernel, kwargs)
2025-05-07T20:31:50.9128949Z W0507 20:31:50.909000 275344 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2]                                         ~~~~~~~~~~~~~^^^^^^^^^^^^^^^^
2025-05-07T20:31:50.9130231Z W0507 20:31:50.909000 275344 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/torch/_higher_order_ops/triton_kernel_wrap.py", line 356, in generate_ttir
2025-05-07T20:31:50.9131591Z W0507 20:31:50.909000 275344 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2]     ttir_module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:31:50.9133067Z W0507 20:31:50.909000 275344 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py", line 100, in make_ir
2025-05-07T20:31:50.9134523Z W0507 20:31:50.909000 275344 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2]     return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:31:50.9135566Z W0507 20:31:50.909000 275344 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2]                        module_map=module_map)
2025-05-07T20:31:50.9136811Z W0507 20:31:50.909000 275344 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/code_generator.py", line 1298, in ast_to_ttir
2025-05-07T20:31:50.9138040Z W0507 20:31:50.909000 275344 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2]     generator.visit(fn.parse())
2025-05-07T20:31:50.9138884Z W0507 20:31:50.909000 275344 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2]     ~~~~~~~~~~~~~~~^^^^^^^^^^^^
2025-05-07T20:31:50.9140065Z W0507 20:31:50.909000 275344 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/code_generator.py", line 1201, in visit
2025-05-07T20:31:50.9141266Z W0507 20:31:50.909000 275344 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2]     ret = super().visit(node)
2025-05-07T20:31:50.9142289Z W0507 20:31:50.909000 275344 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/ast.py", line 428, in visit
2025-05-07T20:31:50.9143303Z W0507 20:31:50.909000 275344 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2]     return visitor(node)
2025-05-07T20:31:50.9144506Z W0507 20:31:50.909000 275344 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/code_generator.py", line 352, in visit_Module
2025-05-07T20:31:50.9152717Z W0507 20:31:50.909000 275344 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2]     ast.NodeVisitor.generic_visit(self, node)
2025-05-07T20:31:50.9153641Z W0507 20:31:50.909000 275344 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2]     ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~^^^^^^^^^^^^
2025-05-07T20:31:50.9154724Z W0507 20:31:50.909000 275344 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/ast.py", line 436, in generic_visit
2025-05-07T20:31:50.9155762Z W0507 20:31:50.909000 275344 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2]     self.visit(item)
2025-05-07T20:31:50.9156523Z W0507 20:31:50.909000 275344 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2]     ~~~~~~~~~~^^^^^^
2025-05-07T20:31:50.9157673Z W0507 20:31:50.909000 275344 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/code_generator.py", line 1207, in visit
2025-05-07T20:31:50.9159123Z W0507 20:31:50.909000 275344 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2]     raise CompilationError(self.jit_fn.src, self.cur_node, repr(e)) from None
2025-05-07T20:31:50.9160170Z W0507 20:31:50.909000 275344 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2] triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:31:50.9161067Z W0507 20:31:50.909000 275344 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2] def _fbgemm_silu_mul_quant(
2025-05-07T20:31:50.9161797Z W0507 20:31:50.909000 275344 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2] ^
2025-05-07T20:31:50.9162880Z W0507 20:31:50.909000 275344 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2] ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:31:51.5866636Z 
2025-05-07T20:31:51.5867392Z Trying example: test_silu_mul_quant(
2025-05-07T20:31:51.5867880Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:31:51.5868301Z     T=1,
2025-05-07T20:31:51.5868488Z     D=7168,
2025-05-07T20:31:51.5868688Z     scale_ub=None,
2025-05-07T20:31:51.5868907Z     contiguous=True,
2025-05-07T20:31:51.5869129Z     compiled=True,
2025-05-07T20:31:51.5869341Z )
2025-05-07T20:31:51.5869663Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:31:51.5870144Z T = 1, D = 7168, scale_ub = None, contiguous = True, compiled = True
2025-05-07T20:31:51.5870408Z 
2025-05-07T20:31:51.5870489Z     @given(
2025-05-07T20:31:51.5870755Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:31:51.5871069Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:31:51.5871382Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:31:51.5871718Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:31:51.5872063Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:31:51.5872344Z     )
2025-05-07T20:31:51.5872722Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:31:51.5873189Z     def test_silu_mul_quant(
2025-05-07T20:31:51.5873434Z         self,
2025-05-07T20:31:51.5873630Z         T: int,
2025-05-07T20:31:51.5873859Z         D: int,
2025-05-07T20:31:51.5874069Z         scale_ub: Optional[float],
2025-05-07T20:31:51.5874336Z         contiguous: bool,
2025-05-07T20:31:51.5874577Z         compiled: bool,
2025-05-07T20:31:51.5874801Z     ) -> None:
2025-05-07T20:31:51.5875021Z         torch.manual_seed(2025)
2025-05-07T20:31:51.5875262Z     
2025-05-07T20:31:51.5875529Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:31:51.5875869Z     
2025-05-07T20:31:51.5876057Z         x_sign = torch.sign(x)
2025-05-07T20:31:51.5876337Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:31:51.5876647Z         x = x_sign * x_clamp
2025-05-07T20:31:51.5876886Z         x0 = x[:, :D]
2025-05-07T20:31:51.5877096Z         x1 = x[:, D:]
2025-05-07T20:31:51.5877299Z     
2025-05-07T20:31:51.5877484Z         if contiguous:
2025-05-07T20:31:51.5877706Z             x0 = x0.contiguous()
2025-05-07T20:31:51.5877961Z             x1 = x1.contiguous()
2025-05-07T20:31:51.5878198Z     
2025-05-07T20:31:51.5878382Z         if scale_ub is not None:
2025-05-07T20:31:51.5878655Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:31:51.5878985Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:31:51.5879288Z             )
2025-05-07T20:31:51.5879473Z         else:
2025-05-07T20:31:51.5879683Z             scale_ub_tensor = None
2025-05-07T20:31:51.5879933Z     
2025-05-07T20:31:51.5880155Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:31:51.5880466Z             op = silu_mul_quant
2025-05-07T20:31:51.5880712Z             if compiled:
2025-05-07T20:31:51.5881239Z                 op = torch.compile(op)
2025-05-07T20:31:51.5881533Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:31:51.5881808Z     
2025-05-07T20:31:51.5881992Z         y_fp8, y_scale = fn()
2025-05-07T20:31:51.5882270Z         y = y_fp8.to(torch.float32) * y_scale[:, None]
2025-05-07T20:31:51.5882558Z     
2025-05-07T20:31:51.5882785Z         def ref_fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:31:51.5883119Z             x0_fp32 = x0.to(torch.float32)
2025-05-07T20:31:51.5883408Z             x1_fp32 = x1.to(torch.float32)
2025-05-07T20:31:51.5883717Z             y = x0_fp32 * torch.sigmoid(x0_fp32) * x1_fp32
2025-05-07T20:31:51.5884067Z             return triton_quantize_fp8_row(y, scale_ub_tensor)
2025-05-07T20:31:51.5884377Z     
2025-05-07T20:31:51.5884728Z >       y_fp8_ref, y_scale_ref = ref_fn()
2025-05-07T20:31:51.5884921Z 
2025-05-07T20:31:51.5885023Z moe/activation_test.py:126: 
2025-05-07T20:31:51.5885321Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:31:51.5885664Z moe/activation_test.py:124: in ref_fn
2025-05-07T20:31:51.5885983Z     return triton_quantize_fp8_row(y, scale_ub_tensor)
2025-05-07T20:31:51.5886767Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gemm/triton_gemm/fp8_gemm.py:2370: in triton_quantize_fp8_row
2025-05-07T20:31:51.5887509Z     _kernel_quantize_fp8_row[grid](
2025-05-07T20:31:51.5888051Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:31:51.5888719Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:31:51.5889402Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/autotuner.py:186: in run
2025-05-07T20:31:51.5890120Z     timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs}
2025-05-07T20:31:51.5890834Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/autotuner.py:166: in _bench
2025-05-07T20:31:51.5891459Z     return self.do_bench(kernel_call, quantiles=(0.5, 0.2, 0.8))
2025-05-07T20:31:51.5892057Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/testing.py:117: in do_bench
2025-05-07T20:31:51.5892704Z     fn()
2025-05-07T20:31:51.5893206Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/autotuner.py:152: in kernel_call
2025-05-07T20:31:51.5893962Z     self.fn.run(
2025-05-07T20:31:51.5894483Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:31:51.5895014Z     kernel = self.compile(
2025-05-07T20:31:51.5895551Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:31:51.5896214Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:31:51.5896609Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:31:51.5896844Z 
2025-05-07T20:31:51.5897054Z self = <triton.compiler.compiler.ASTSource object at 0x7f93ce29c250>
2025-05-07T20:31:51.5898123Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=2, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:31:51.5899976Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f9402963ce0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f93a73cf920>}
2025-05-07T20:31:51.5901357Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:31:51.5902369Z context = <triton._C.libtriton.ir.context object at 0x7f93a6ac6830>
2025-05-07T20:31:51.5902856Z 
2025-05-07T20:31:51.5903021Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:31:51.5903536Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:31:51.5904001Z                            module_map=module_map)
2025-05-07T20:31:51.5904372Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:31:51.5904719Z E       def _kernel_quantize_fp8_row(
2025-05-07T20:31:51.5904982Z E       ^
2025-05-07T20:31:51.5905429Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:31:51.5905864Z 
2025-05-07T20:31:51.5906388Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:31:51.5906898Z 
2025-05-07T20:31:51.5906999Z Trying example: test_silu_mul_quant(
2025-05-07T20:31:51.5907403Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:31:51.5907805Z     T=4096,
2025-05-07T20:31:51.5907986Z     D=5120,
2025-05-07T20:31:51.5908177Z     scale_ub=None,
2025-05-07T20:31:51.5908393Z     contiguous=False,
2025-05-07T20:31:51.5908610Z     compiled=False,
2025-05-07T20:31:51.5908814Z )
2025-05-07T20:31:51.5909126Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:31:51.5909607Z T = 4096, D = 5120, scale_ub = None, contiguous = False, compiled = False
2025-05-07T20:31:51.5909885Z 
2025-05-07T20:31:51.5909962Z     @given(
2025-05-07T20:31:51.5910191Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:31:51.5910502Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:31:51.5910800Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:31:51.5911129Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:31:51.5911453Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:31:51.5911731Z     )
2025-05-07T20:31:51.5912078Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:31:51.5912528Z     def test_silu_mul_quant(
2025-05-07T20:31:51.5912759Z         self,
2025-05-07T20:31:51.5912956Z         T: int,
2025-05-07T20:31:51.5913154Z         D: int,
2025-05-07T20:31:51.5913364Z         scale_ub: Optional[float],
2025-05-07T20:31:51.5913638Z         contiguous: bool,
2025-05-07T20:31:51.5913877Z         compiled: bool,
2025-05-07T20:31:51.5914092Z     ) -> None:
2025-05-07T20:31:51.5914306Z         torch.manual_seed(2025)
2025-05-07T20:31:51.5914544Z     
2025-05-07T20:31:51.5914804Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:31:51.5915146Z     
2025-05-07T20:31:51.5915339Z         x_sign = torch.sign(x)
2025-05-07T20:31:51.5915629Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:31:51.5915929Z         x = x_sign * x_clamp
2025-05-07T20:31:51.5916160Z         x0 = x[:, :D]
2025-05-07T20:31:51.5916370Z         x1 = x[:, D:]
2025-05-07T20:31:51.5916572Z     
2025-05-07T20:31:51.5916753Z         if contiguous:
2025-05-07T20:31:51.5916977Z             x0 = x0.contiguous()
2025-05-07T20:31:51.5917227Z             x1 = x1.contiguous()
2025-05-07T20:31:51.5917463Z     
2025-05-07T20:31:51.5917652Z         if scale_ub is not None:
2025-05-07T20:31:51.5917916Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:31:51.5918249Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:31:51.5918556Z             )
2025-05-07T20:31:51.5918739Z         else:
2025-05-07T20:31:51.5918949Z             scale_ub_tensor = None
2025-05-07T20:31:51.5919201Z     
2025-05-07T20:31:51.5919422Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:31:51.5919737Z             op = silu_mul_quant
2025-05-07T20:31:51.5919986Z             if compiled:
2025-05-07T20:31:51.5920231Z                 op = torch.compile(op)
2025-05-07T20:31:51.5920523Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:31:51.5920799Z     
2025-05-07T20:31:51.5921084Z >       y_fp8, y_scale = fn()
2025-05-07T20:31:51.5921244Z 
2025-05-07T20:31:51.5921340Z moe/activation_test.py:117: 
2025-05-07T20:31:51.5921632Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:31:51.5921964Z moe/activation_test.py:115: in fn
2025-05-07T20:31:51.5922238Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:31:51.5922923Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:31:51.5923650Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:31:51.5924186Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:31:51.5924934Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:31:51.5925597Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:31:51.5926135Z     kernel = self.compile(
2025-05-07T20:31:51.5926662Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:31:51.5927312Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:31:51.5927708Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:31:51.5927936Z 
2025-05-07T20:31:51.5928152Z self = <triton.compiler.compiler.ASTSource object at 0x7f93ceb351d0>
2025-05-07T20:31:51.5929218Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:31:51.5930570Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f9402963ce0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f93a7554cc0>}
2025-05-07T20:31:51.5931900Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:31:51.5932909Z context = <triton._C.libtriton.ir.context object at 0x7f93a6af30b0>
2025-05-07T20:31:51.5933191Z 
2025-05-07T20:31:51.5933366Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:31:51.5933963Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:31:51.5934422Z                            module_map=module_map)
2025-05-07T20:31:51.5934783Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:31:51.5935130Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:31:51.5935388Z E       ^
2025-05-07T20:31:51.5935841Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:31:51.5936283Z 
2025-05-07T20:31:51.5936692Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:31:51.8661142Z W0507 20:31:51.862000 275344 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3] Encountered an exception in identify_mutated_tensors, assuming every input is mutated
2025-05-07T20:31:51.8662233Z W0507 20:31:51.862000 275344 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3] Traceback (most recent call last):
2025-05-07T20:31:51.8663653Z W0507 20:31:51.862000 275344 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/torch/_higher_order_ops/triton_kernel_wrap.py", line 731, in identify_mutated_tensors
2025-05-07T20:31:51.8665169Z W0507 20:31:51.862000 275344 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3]     ttir_module, ordered_tensor_names = generate_ttir(kernel, kwargs)
2025-05-07T20:31:51.8666409Z W0507 20:31:51.862000 275344 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3]                                         ~~~~~~~~~~~~~^^^^^^^^^^^^^^^^
2025-05-07T20:31:51.8667692Z W0507 20:31:51.862000 275344 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/torch/_higher_order_ops/triton_kernel_wrap.py", line 356, in generate_ttir
2025-05-07T20:31:51.8669050Z W0507 20:31:51.862000 275344 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3]     ttir_module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:31:51.8670479Z W0507 20:31:51.862000 275344 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py", line 100, in make_ir
2025-05-07T20:31:51.8671828Z W0507 20:31:51.862000 275344 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3]     return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:31:51.8672857Z W0507 20:31:51.862000 275344 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3]                        module_map=module_map)
2025-05-07T20:31:51.8674092Z W0507 20:31:51.862000 275344 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/code_generator.py", line 1298, in ast_to_ttir
2025-05-07T20:31:51.8675319Z W0507 20:31:51.862000 275344 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3]     generator.visit(fn.parse())
2025-05-07T20:31:51.8676159Z W0507 20:31:51.862000 275344 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3]     ~~~~~~~~~~~~~~~^^^^^^^^^^^^
2025-05-07T20:31:51.8677338Z W0507 20:31:51.862000 275344 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/code_generator.py", line 1201, in visit
2025-05-07T20:31:51.8678529Z W0507 20:31:51.862000 275344 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3]     ret = super().visit(node)
2025-05-07T20:31:51.8679548Z W0507 20:31:51.862000 275344 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/ast.py", line 428, in visit
2025-05-07T20:31:51.8680550Z W0507 20:31:51.862000 275344 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3]     return visitor(node)
2025-05-07T20:31:51.8681756Z W0507 20:31:51.862000 275344 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/code_generator.py", line 352, in visit_Module
2025-05-07T20:31:51.8683065Z W0507 20:31:51.862000 275344 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3]     ast.NodeVisitor.generic_visit(self, node)
2025-05-07T20:31:51.8683955Z W0507 20:31:51.862000 275344 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3]     ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~^^^^^^^^^^^^
2025-05-07T20:31:51.8685023Z W0507 20:31:51.862000 275344 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/ast.py", line 436, in generic_visit
2025-05-07T20:31:51.8686047Z W0507 20:31:51.862000 275344 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3]     self.visit(item)
2025-05-07T20:31:51.8686813Z W0507 20:31:51.862000 275344 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3]     ~~~~~~~~~~^^^^^^
2025-05-07T20:31:51.8687955Z W0507 20:31:51.862000 275344 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/code_generator.py", line 1207, in visit
2025-05-07T20:31:51.8689372Z W0507 20:31:51.862000 275344 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3]     raise CompilationError(self.jit_fn.src, self.cur_node, repr(e)) from None
2025-05-07T20:31:51.8690417Z W0507 20:31:51.862000 275344 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3] triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:31:51.8691319Z W0507 20:31:51.862000 275344 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3] def _fbgemm_silu_mul_quant(
2025-05-07T20:31:51.8692054Z W0507 20:31:51.862000 275344 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3] ^
2025-05-07T20:31:51.8693128Z W0507 20:31:51.862000 275344 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3] ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:31:52.0356029Z W0507 20:31:52.031000 275344 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3] Encountered an exception in identify_mutated_tensors, assuming every input is mutated
2025-05-07T20:31:52.0358139Z W0507 20:31:52.031000 275344 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3] Traceback (most recent call last):
2025-05-07T20:31:52.0360737Z W0507 20:31:52.031000 275344 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/torch/_higher_order_ops/triton_kernel_wrap.py", line 731, in identify_mutated_tensors
2025-05-07T20:31:52.0363053Z W0507 20:31:52.031000 275344 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3]     ttir_module, ordered_tensor_names = generate_ttir(kernel, kwargs)
2025-05-07T20:31:52.0364021Z W0507 20:31:52.031000 275344 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3]                                         ~~~~~~~~~~~~~^^^^^^^^^^^^^^^^
2025-05-07T20:31:52.0365308Z W0507 20:31:52.031000 275344 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/torch/_higher_order_ops/triton_kernel_wrap.py", line 356, in generate_ttir
2025-05-07T20:31:52.0366661Z W0507 20:31:52.031000 275344 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3]     ttir_module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:31:52.0367934Z W0507 20:31:52.031000 275344 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py", line 100, in make_ir
2025-05-07T20:31:52.0369289Z W0507 20:31:52.031000 275344 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3]     return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:31:52.0370318Z W0507 20:31:52.031000 275344 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3]                        module_map=module_map)
2025-05-07T20:31:52.0371562Z W0507 20:31:52.031000 275344 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/code_generator.py", line 1298, in ast_to_ttir
2025-05-07T20:31:52.0372815Z W0507 20:31:52.031000 275344 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3]     generator.visit(fn.parse())
2025-05-07T20:31:52.0373743Z W0507 20:31:52.031000 275344 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3]     ~~~~~~~~~~~~~~~^^^^^^^^^^^^
2025-05-07T20:31:52.0374939Z W0507 20:31:52.031000 275344 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/code_generator.py", line 1201, in visit
2025-05-07T20:31:52.0376137Z W0507 20:31:52.031000 275344 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3]     ret = super().visit(node)
2025-05-07T20:31:52.0377332Z W0507 20:31:52.031000 275344 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/ast.py", line 428, in visit
2025-05-07T20:31:52.0378337Z W0507 20:31:52.031000 275344 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3]     return visitor(node)
2025-05-07T20:31:52.0379529Z W0507 20:31:52.031000 275344 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/code_generator.py", line 352, in visit_Module
2025-05-07T20:31:52.0380900Z W0507 20:31:52.031000 275344 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3]     ast.NodeVisitor.generic_visit(self, node)
2025-05-07T20:31:52.0381796Z W0507 20:31:52.031000 275344 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3]     ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~^^^^^^^^^^^^
2025-05-07T20:31:52.0382886Z W0507 20:31:52.031000 275344 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/ast.py", line 436, in generic_visit
2025-05-07T20:31:52.0383936Z W0507 20:31:52.031000 275344 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3]     self.visit(item)
2025-05-07T20:31:52.0384695Z W0507 20:31:52.031000 275344 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3]     ~~~~~~~~~~^^^^^^
2025-05-07T20:31:52.0385848Z W0507 20:31:52.031000 275344 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/code_generator.py", line 1207, in visit
2025-05-07T20:31:52.0387177Z W0507 20:31:52.031000 275344 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3]     raise CompilationError(self.jit_fn.src, self.cur_node, repr(e)) from None
2025-05-07T20:31:52.0388227Z W0507 20:31:52.031000 275344 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3] triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:31:52.0389121Z W0507 20:31:52.031000 275344 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3] def _fbgemm_silu_mul_quant(
2025-05-07T20:31:52.0389852Z W0507 20:31:52.031000 275344 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3] ^
2025-05-07T20:31:52.0390861Z W0507 20:31:52.031000 275344 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3] ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:31:52.2978726Z W0507 20:31:52.294000 275344 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3] Encountered an exception in identify_mutated_tensors, assuming every input is mutated
2025-05-07T20:31:52.2979774Z W0507 20:31:52.294000 275344 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3] Traceback (most recent call last):
2025-05-07T20:31:52.2981073Z W0507 20:31:52.294000 275344 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/torch/_higher_order_ops/triton_kernel_wrap.py", line 731, in identify_mutated_tensors
2025-05-07T20:31:52.2982467Z W0507 20:31:52.294000 275344 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3]     ttir_module, ordered_tensor_names = generate_ttir(kernel, kwargs)
2025-05-07T20:31:52.2983435Z W0507 20:31:52.294000 275344 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3]                                         ~~~~~~~~~~~~~^^^^^^^^^^^^^^^^
2025-05-07T20:31:52.2984714Z W0507 20:31:52.294000 275344 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/torch/_higher_order_ops/triton_kernel_wrap.py", line 356, in generate_ttir
2025-05-07T20:31:52.2986224Z W0507 20:31:52.294000 275344 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3]     ttir_module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:31:52.2987498Z W0507 20:31:52.294000 275344 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py", line 100, in make_ir
2025-05-07T20:31:52.2988841Z W0507 20:31:52.294000 275344 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3]     return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:31:52.2990008Z W0507 20:31:52.294000 275344 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3]                        module_map=module_map)
2025-05-07T20:31:52.2991248Z W0507 20:31:52.294000 275344 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/code_generator.py", line 1298, in ast_to_ttir
2025-05-07T20:31:52.2992478Z W0507 20:31:52.294000 275344 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3]     generator.visit(fn.parse())
2025-05-07T20:31:52.2993360Z W0507 20:31:52.294000 275344 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3]     ~~~~~~~~~~~~~~~^^^^^^^^^^^^
2025-05-07T20:31:52.2994544Z W0507 20:31:52.294000 275344 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/code_generator.py", line 1201, in visit
2025-05-07T20:31:52.2995736Z W0507 20:31:52.294000 275344 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3]     ret = super().visit(node)
2025-05-07T20:31:52.2996756Z W0507 20:31:52.294000 275344 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/ast.py", line 428, in visit
2025-05-07T20:31:52.2997766Z W0507 20:31:52.294000 275344 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3]     return visitor(node)
2025-05-07T20:31:52.2999127Z W0507 20:31:52.294000 275344 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/code_generator.py", line 352, in visit_Module
2025-05-07T20:31:52.3000388Z W0507 20:31:52.294000 275344 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3]     ast.NodeVisitor.generic_visit(self, node)
2025-05-07T20:31:52.3001286Z W0507 20:31:52.294000 275344 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3]     ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~^^^^^^^^^^^^
2025-05-07T20:31:52.3002359Z W0507 20:31:52.294000 275344 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/ast.py", line 436, in generic_visit
2025-05-07T20:31:52.3003392Z W0507 20:31:52.294000 275344 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3]     self.visit(item)
2025-05-07T20:31:52.3004147Z W0507 20:31:52.294000 275344 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3]     ~~~~~~~~~~^^^^^^
2025-05-07T20:31:52.3005298Z W0507 20:31:52.294000 275344 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/code_generator.py", line 1207, in visit
2025-05-07T20:31:52.3006638Z W0507 20:31:52.294000 275344 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3]     raise CompilationError(self.jit_fn.src, self.cur_node, repr(e)) from None
2025-05-07T20:31:52.3007688Z W0507 20:31:52.294000 275344 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3] triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:31:52.3008708Z W0507 20:31:52.294000 275344 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3] def _fbgemm_silu_mul_quant(
2025-05-07T20:31:52.3009445Z W0507 20:31:52.294000 275344 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3] ^
2025-05-07T20:31:52.3010452Z W0507 20:31:52.294000 275344 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3] ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:31:52.3076989Z W0507 20:31:52.304000 275344 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3] Encountered an exception in identify_mutated_tensors, assuming every input is mutated
2025-05-07T20:31:52.3078030Z W0507 20:31:52.304000 275344 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3] Traceback (most recent call last):
2025-05-07T20:31:52.3079336Z W0507 20:31:52.304000 275344 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/torch/_higher_order_ops/triton_kernel_wrap.py", line 731, in identify_mutated_tensors
2025-05-07T20:31:52.3080727Z W0507 20:31:52.304000 275344 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3]     ttir_module, ordered_tensor_names = generate_ttir(kernel, kwargs)
2025-05-07T20:31:52.3081688Z W0507 20:31:52.304000 275344 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3]                                         ~~~~~~~~~~~~~^^^^^^^^^^^^^^^^
2025-05-07T20:31:52.3083002Z W0507 20:31:52.304000 275344 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/torch/_higher_order_ops/triton_kernel_wrap.py", line 356, in generate_ttir
2025-05-07T20:31:52.3084377Z W0507 20:31:52.304000 275344 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3]     ttir_module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:31:52.3085660Z W0507 20:31:52.304000 275344 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py", line 100, in make_ir
2025-05-07T20:31:52.3087002Z W0507 20:31:52.304000 275344 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3]     return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:31:52.3088030Z W0507 20:31:52.304000 275344 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3]                        module_map=module_map)
2025-05-07T20:31:52.3089272Z W0507 20:31:52.304000 275344 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/code_generator.py", line 1298, in ast_to_ttir
2025-05-07T20:31:52.3090496Z W0507 20:31:52.304000 275344 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3]     generator.visit(fn.parse())
2025-05-07T20:31:52.3091330Z W0507 20:31:52.304000 275344 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3]     ~~~~~~~~~~~~~~~^^^^^^^^^^^^
2025-05-07T20:31:52.3092510Z W0507 20:31:52.304000 275344 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/code_generator.py", line 1201, in visit
2025-05-07T20:31:52.3093785Z W0507 20:31:52.304000 275344 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3]     ret = super().visit(node)
2025-05-07T20:31:52.3094817Z W0507 20:31:52.304000 275344 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/ast.py", line 428, in visit
2025-05-07T20:31:52.3095820Z W0507 20:31:52.304000 275344 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3]     return visitor(node)
2025-05-07T20:31:52.3097105Z W0507 20:31:52.304000 275344 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/code_generator.py", line 352, in visit_Module
2025-05-07T20:31:52.3098573Z W0507 20:31:52.304000 275344 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3]     ast.NodeVisitor.generic_visit(self, node)
2025-05-07T20:31:52.3099465Z W0507 20:31:52.304000 275344 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3]     ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~^^^^^^^^^^^^
2025-05-07T20:31:52.3100654Z W0507 20:31:52.304000 275344 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/ast.py", line 436, in generic_visit
2025-05-07T20:31:52.3101678Z W0507 20:31:52.304000 275344 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3]     self.visit(item)
2025-05-07T20:31:52.3102438Z W0507 20:31:52.304000 275344 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3]     ~~~~~~~~~~^^^^^^
2025-05-07T20:31:52.3103638Z W0507 20:31:52.304000 275344 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/code_generator.py", line 1207, in visit
2025-05-07T20:31:52.3104973Z W0507 20:31:52.304000 275344 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3]     raise CompilationError(self.jit_fn.src, self.cur_node, repr(e)) from None
2025-05-07T20:31:52.3106029Z W0507 20:31:52.304000 275344 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3] triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:31:52.3106928Z W0507 20:31:52.304000 275344 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3] def _fbgemm_silu_mul_quant(
2025-05-07T20:31:52.3107666Z W0507 20:31:52.304000 275344 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3] ^
2025-05-07T20:31:52.3108674Z W0507 20:31:52.304000 275344 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3] ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:31:53.4833144Z 
2025-05-07T20:31:53.4834186Z Trying example: test_silu_mul_quant(
2025-05-07T20:31:53.4835130Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:31:53.4835950Z     T=4096,
2025-05-07T20:31:53.4836331Z     D=7168,
2025-05-07T20:31:53.4836702Z     scale_ub=None,
2025-05-07T20:31:53.4837127Z     contiguous=False,
2025-05-07T20:31:53.4837603Z     compiled=False,
2025-05-07T20:31:53.4838001Z )
2025-05-07T20:31:53.4838628Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:31:53.4839605Z T = 4096, D = 7168, scale_ub = None, contiguous = False, compiled = False
2025-05-07T20:31:53.4840157Z 
2025-05-07T20:31:53.4840323Z     @given(
2025-05-07T20:31:53.4840772Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:31:53.4841397Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:31:53.4841996Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:31:53.4842631Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:31:53.4843271Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:31:53.4843761Z     )
2025-05-07T20:31:53.4844151Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:31:53.4844593Z     def test_silu_mul_quant(
2025-05-07T20:31:53.4844839Z         self,
2025-05-07T20:31:53.4845034Z         T: int,
2025-05-07T20:31:53.4845234Z         D: int,
2025-05-07T20:31:53.4845453Z         scale_ub: Optional[float],
2025-05-07T20:31:53.4845718Z         contiguous: bool,
2025-05-07T20:31:53.4845962Z         compiled: bool,
2025-05-07T20:31:53.4846604Z     ) -> None:
2025-05-07T20:31:53.4846838Z         torch.manual_seed(2025)
2025-05-07T20:31:53.4847078Z     
2025-05-07T20:31:53.4847351Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:31:53.4847702Z     
2025-05-07T20:31:53.4847892Z         x_sign = torch.sign(x)
2025-05-07T20:31:53.4848184Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:31:53.4848495Z         x = x_sign * x_clamp
2025-05-07T20:31:53.4848738Z         x0 = x[:, :D]
2025-05-07T20:31:53.4848950Z         x1 = x[:, D:]
2025-05-07T20:31:53.4849157Z     
2025-05-07T20:31:53.4849343Z         if contiguous:
2025-05-07T20:31:53.4849564Z             x0 = x0.contiguous()
2025-05-07T20:31:53.4849822Z             x1 = x1.contiguous()
2025-05-07T20:31:53.4858411Z     
2025-05-07T20:31:53.4858650Z         if scale_ub is not None:
2025-05-07T20:31:53.4858944Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:31:53.4859296Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:31:53.4859608Z             )
2025-05-07T20:31:53.4859808Z         else:
2025-05-07T20:31:53.4860025Z             scale_ub_tensor = None
2025-05-07T20:31:53.4860273Z     
2025-05-07T20:31:53.4860511Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:31:53.4860833Z             op = silu_mul_quant
2025-05-07T20:31:53.4861082Z             if compiled:
2025-05-07T20:31:53.4861334Z                 op = torch.compile(op)
2025-05-07T20:31:53.4861632Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:31:53.4861910Z     
2025-05-07T20:31:53.4862100Z >       y_fp8, y_scale = fn()
2025-05-07T20:31:53.4862271Z 
2025-05-07T20:31:53.4862373Z moe/activation_test.py:117: 
2025-05-07T20:31:53.4862683Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:31:53.4863018Z moe/activation_test.py:115: in fn
2025-05-07T20:31:53.4863304Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:31:53.4863997Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:31:53.4864685Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:31:53.4865221Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:31:53.4865901Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:31:53.4866587Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:31:53.4867125Z     kernel = self.compile(
2025-05-07T20:31:53.4867659Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:31:53.4868321Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:31:53.4868720Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:31:53.4868948Z 
2025-05-07T20:31:53.4869165Z self = <triton.compiler.compiler.ASTSource object at 0x7f93a7451d50>
2025-05-07T20:31:53.4870234Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:31:53.4871602Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f9402963ce0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f93a75568e0>}
2025-05-07T20:31:53.4872933Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:31:53.4873948Z context = <triton._C.libtriton.ir.context object at 0x7f93a61d0bb0>
2025-05-07T20:31:53.4874231Z 
2025-05-07T20:31:53.4874395Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:31:53.4875007Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:31:53.4875475Z                            module_map=module_map)
2025-05-07T20:31:53.4875843Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:31:53.4876187Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:31:53.4876450Z E       ^
2025-05-07T20:31:53.4876910Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:31:53.4877352Z 
2025-05-07T20:31:53.4877767Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:31:53.4878273Z 
2025-05-07T20:31:53.4878456Z Trying example: test_silu_mul_quant(
2025-05-07T20:31:53.4878868Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:31:53.4879267Z     T=128,
2025-05-07T20:31:53.4879454Z     D=7168,
2025-05-07T20:31:53.4879655Z     scale_ub=None,
2025-05-07T20:31:53.4879869Z     contiguous=False,
2025-05-07T20:31:53.4880095Z     compiled=True,
2025-05-07T20:31:53.4880298Z )
2025-05-07T20:31:53.4880606Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:31:53.4881094Z T = 128, D = 7168, scale_ub = None, contiguous = False, compiled = True
2025-05-07T20:31:53.4881358Z 
2025-05-07T20:31:53.4881445Z     @given(
2025-05-07T20:31:53.4881670Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:31:53.4881986Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:31:53.4882295Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:31:53.4882618Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:31:53.4882958Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:31:53.4883257Z     )
2025-05-07T20:31:53.4883645Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:31:53.4884083Z     def test_silu_mul_quant(
2025-05-07T20:31:53.4884329Z         self,
2025-05-07T20:31:53.4884528Z         T: int,
2025-05-07T20:31:53.4884716Z         D: int,
2025-05-07T20:31:53.4884935Z         scale_ub: Optional[float],
2025-05-07T20:31:53.4885203Z         contiguous: bool,
2025-05-07T20:31:53.4885435Z         compiled: bool,
2025-05-07T20:31:53.4885661Z     ) -> None:
2025-05-07T20:31:53.4885874Z         torch.manual_seed(2025)
2025-05-07T20:31:53.4886108Z     
2025-05-07T20:31:53.4886378Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:31:53.4886718Z     
2025-05-07T20:31:53.4886906Z         x_sign = torch.sign(x)
2025-05-07T20:31:53.4887195Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:31:53.4887508Z         x = x_sign * x_clamp
2025-05-07T20:31:53.4887737Z         x0 = x[:, :D]
2025-05-07T20:31:53.4887954Z         x1 = x[:, D:]
2025-05-07T20:31:53.4888163Z     
2025-05-07T20:31:53.4888348Z         if contiguous:
2025-05-07T20:31:53.4888577Z             x0 = x0.contiguous()
2025-05-07T20:31:53.4888835Z             x1 = x1.contiguous()
2025-05-07T20:31:53.4889075Z     
2025-05-07T20:31:53.4889261Z         if scale_ub is not None:
2025-05-07T20:31:53.4889539Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:31:53.4889870Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:31:53.4890177Z             )
2025-05-07T20:31:53.4890376Z         else:
2025-05-07T20:31:53.4890588Z             scale_ub_tensor = None
2025-05-07T20:31:53.4890840Z     
2025-05-07T20:31:53.4891069Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:31:53.4891381Z             op = silu_mul_quant
2025-05-07T20:31:53.4891625Z             if compiled:
2025-05-07T20:31:53.4891877Z                 op = torch.compile(op)
2025-05-07T20:31:53.4892173Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:31:53.4892444Z     
2025-05-07T20:31:53.4892640Z         y_fp8, y_scale = fn()
2025-05-07T20:31:53.4892923Z         y = y_fp8.to(torch.float32) * y_scale[:, None]
2025-05-07T20:31:53.4893332Z     
2025-05-07T20:31:53.4893590Z         def ref_fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:31:53.4894040Z             x0_fp32 = x0.to(torch.float32)
2025-05-07T20:31:53.4894333Z             x1_fp32 = x1.to(torch.float32)
2025-05-07T20:31:53.4894639Z             y = x0_fp32 * torch.sigmoid(x0_fp32) * x1_fp32
2025-05-07T20:31:53.4895003Z             return triton_quantize_fp8_row(y, scale_ub_tensor)
2025-05-07T20:31:53.4895315Z     
2025-05-07T20:31:53.4895510Z >       y_fp8_ref, y_scale_ref = ref_fn()
2025-05-07T20:31:53.4895707Z 
2025-05-07T20:31:53.4895807Z moe/activation_test.py:126: 
2025-05-07T20:31:53.4896102Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:31:53.4896521Z moe/activation_test.py:124: in ref_fn
2025-05-07T20:31:53.4896842Z     return triton_quantize_fp8_row(y, scale_ub_tensor)
2025-05-07T20:31:53.4897624Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gemm/triton_gemm/fp8_gemm.py:2370: in triton_quantize_fp8_row
2025-05-07T20:31:53.4898637Z     _kernel_quantize_fp8_row[grid](
2025-05-07T20:31:53.4899175Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:31:53.4899851Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:31:53.4900533Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/autotuner.py:186: in run
2025-05-07T20:31:53.4901245Z     timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs}
2025-05-07T20:31:53.4901964Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/autotuner.py:166: in _bench
2025-05-07T20:31:53.4902602Z     return self.do_bench(kernel_call, quantiles=(0.5, 0.2, 0.8))
2025-05-07T20:31:53.4903198Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/testing.py:117: in do_bench
2025-05-07T20:31:53.4903717Z     fn()
2025-05-07T20:31:53.4904212Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/autotuner.py:152: in kernel_call
2025-05-07T20:31:53.4904784Z     self.fn.run(
2025-05-07T20:31:53.4905246Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:31:53.4905765Z     kernel = self.compile(
2025-05-07T20:31:53.4906302Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:31:53.4906946Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:31:53.4907340Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:31:53.4907572Z 
2025-05-07T20:31:53.4907779Z self = <triton.compiler.compiler.ASTSource object at 0x7f93ce29ca50>
2025-05-07T20:31:53.4908853Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=2, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:31:53.4910206Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f9402963ce0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f93a75558a0>}
2025-05-07T20:31:53.4911526Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:31:53.4912527Z context = <triton._C.libtriton.ir.context object at 0x7f93a5e37f70>
2025-05-07T20:31:53.4912811Z 
2025-05-07T20:31:53.4912981Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:31:53.4913495Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:31:53.4914105Z                            module_map=module_map)
2025-05-07T20:31:53.4914464Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:31:53.4914819Z E       def _kernel_quantize_fp8_row(
2025-05-07T20:31:53.4915083Z E       ^
2025-05-07T20:31:53.4915533Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:31:53.4915975Z 
2025-05-07T20:31:53.4916385Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:31:53.7294045Z 
2025-05-07T20:31:53.7294319Z Trying example: test_silu_mul_quant(
2025-05-07T20:31:53.7294758Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:31:53.7295397Z     T=128,
2025-05-07T20:31:53.7295605Z     D=7168,
2025-05-07T20:31:53.7295802Z     scale_ub=None,
2025-05-07T20:31:53.7296031Z     contiguous=False,
2025-05-07T20:31:53.7296266Z     compiled=False,
2025-05-07T20:31:53.7296479Z )
2025-05-07T20:31:53.7296806Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:31:53.7297302Z T = 128, D = 7168, scale_ub = None, contiguous = False, compiled = False
2025-05-07T20:31:53.7297569Z 
2025-05-07T20:31:53.7297660Z     @given(
2025-05-07T20:31:53.7297893Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:31:53.7298368Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:31:53.7298683Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:31:53.7299009Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:31:53.7299341Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:31:53.7299632Z     )
2025-05-07T20:31:53.7299982Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:31:53.7300426Z     def test_silu_mul_quant(
2025-05-07T20:31:53.7300670Z         self,
2025-05-07T20:31:53.7300865Z         T: int,
2025-05-07T20:31:53.7301067Z         D: int,
2025-05-07T20:31:53.7301296Z         scale_ub: Optional[float],
2025-05-07T20:31:53.7301564Z         contiguous: bool,
2025-05-07T20:31:53.7301811Z         compiled: bool,
2025-05-07T20:31:53.7302045Z     ) -> None:
2025-05-07T20:31:53.7302268Z         torch.manual_seed(2025)
2025-05-07T20:31:53.7302506Z     
2025-05-07T20:31:53.7302782Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:31:53.7303129Z     
2025-05-07T20:31:53.7303323Z         x_sign = torch.sign(x)
2025-05-07T20:31:53.7303618Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:31:53.7303931Z         x = x_sign * x_clamp
2025-05-07T20:31:53.7304169Z         x0 = x[:, :D]
2025-05-07T20:31:53.7304390Z         x1 = x[:, D:]
2025-05-07T20:31:53.7304600Z     
2025-05-07T20:31:53.7304795Z         if contiguous:
2025-05-07T20:31:53.7305031Z             x0 = x0.contiguous()
2025-05-07T20:31:53.7305296Z             x1 = x1.contiguous()
2025-05-07T20:31:53.7305537Z     
2025-05-07T20:31:53.7305745Z         if scale_ub is not None:
2025-05-07T20:31:53.7306024Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:31:53.7306356Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:31:53.7306669Z             )
2025-05-07T20:31:53.7306870Z         else:
2025-05-07T20:31:53.7307089Z             scale_ub_tensor = None
2025-05-07T20:31:53.7307339Z     
2025-05-07T20:31:53.7307573Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:31:53.7307894Z             op = silu_mul_quant
2025-05-07T20:31:53.7308147Z             if compiled:
2025-05-07T20:31:53.7308404Z                 op = torch.compile(op)
2025-05-07T20:31:53.7308706Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:31:53.7308979Z     
2025-05-07T20:31:53.7309184Z >       y_fp8, y_scale = fn()
2025-05-07T20:31:53.7309349Z 
2025-05-07T20:31:53.7309456Z moe/activation_test.py:117: 
2025-05-07T20:31:53.7309749Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:31:53.7310272Z moe/activation_test.py:115: in fn
2025-05-07T20:31:53.7310560Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:31:53.7311248Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:31:53.7311929Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:31:53.7312466Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:31:53.7313150Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:31:53.7313854Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:31:53.7314542Z     kernel = self.compile(
2025-05-07T20:31:53.7315085Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:31:53.7315738Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:31:53.7316140Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:31:53.7316372Z 
2025-05-07T20:31:53.7316578Z self = <triton.compiler.compiler.ASTSource object at 0x7f93ce39c0d0>
2025-05-07T20:31:53.7317642Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:31:53.7318991Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f9402963ce0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f93a659e700>}
2025-05-07T20:31:53.7320315Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:31:53.7321326Z context = <triton._C.libtriton.ir.context object at 0x7f93a5e5b5f0>
2025-05-07T20:31:53.7321619Z 
2025-05-07T20:31:53.7321785Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:31:53.7322305Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:31:53.7322769Z                            module_map=module_map)
2025-05-07T20:31:53.7323138Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:31:53.7323525Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:31:53.7323809Z E       ^
2025-05-07T20:31:53.7324268Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:31:53.7324718Z 
2025-05-07T20:31:53.7325135Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:31:53.7325642Z 
2025-05-07T20:31:53.7325756Z Trying example: test_silu_mul_quant(
2025-05-07T20:31:53.7326166Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:31:53.7326568Z     T=4096,
2025-05-07T20:31:53.7326764Z     D=5120,
2025-05-07T20:31:53.7326959Z     scale_ub=1200.0,
2025-05-07T20:31:53.7327186Z     contiguous=True,
2025-05-07T20:31:53.7327416Z     compiled=False,
2025-05-07T20:31:53.7327626Z )
2025-05-07T20:31:53.7327942Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:31:53.7328445Z T = 4096, D = 5120, scale_ub = 1200.0, contiguous = True, compiled = False
2025-05-07T20:31:53.7328715Z 
2025-05-07T20:31:53.7328801Z     @given(
2025-05-07T20:31:53.7329029Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:31:53.7329352Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:31:53.7329662Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:31:53.7329986Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:31:53.7330315Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:31:53.7330695Z     )
2025-05-07T20:31:53.7331039Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:31:53.7331485Z     def test_silu_mul_quant(
2025-05-07T20:31:53.7331730Z         self,
2025-05-07T20:31:53.7331929Z         T: int,
2025-05-07T20:31:53.7332124Z         D: int,
2025-05-07T20:31:53.7332353Z         scale_ub: Optional[float],
2025-05-07T20:31:53.7332627Z         contiguous: bool,
2025-05-07T20:31:53.7332865Z         compiled: bool,
2025-05-07T20:31:53.7333090Z     ) -> None:
2025-05-07T20:31:53.7333309Z         torch.manual_seed(2025)
2025-05-07T20:31:53.7333546Z     
2025-05-07T20:31:53.7333880Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:31:53.7334302Z     
2025-05-07T20:31:53.7334498Z         x_sign = torch.sign(x)
2025-05-07T20:31:53.7334787Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:31:53.7335098Z         x = x_sign * x_clamp
2025-05-07T20:31:53.7335341Z         x0 = x[:, :D]
2025-05-07T20:31:53.7335563Z         x1 = x[:, D:]
2025-05-07T20:31:53.7335774Z     
2025-05-07T20:31:53.7335956Z         if contiguous:
2025-05-07T20:31:53.7336190Z             x0 = x0.contiguous()
2025-05-07T20:31:53.7336453Z             x1 = x1.contiguous()
2025-05-07T20:31:53.7336696Z     
2025-05-07T20:31:53.7336886Z         if scale_ub is not None:
2025-05-07T20:31:53.7337163Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:31:53.7337504Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:31:53.7337809Z             )
2025-05-07T20:31:53.7338011Z         else:
2025-05-07T20:31:53.7338225Z             scale_ub_tensor = None
2025-05-07T20:31:53.7338474Z     
2025-05-07T20:31:53.7338718Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:31:53.7339041Z             op = silu_mul_quant
2025-05-07T20:31:53.7339287Z             if compiled:
2025-05-07T20:31:53.7339538Z                 op = torch.compile(op)
2025-05-07T20:31:53.7339846Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:31:53.7340120Z     
2025-05-07T20:31:53.7340318Z >       y_fp8, y_scale = fn()
2025-05-07T20:31:53.7340481Z 
2025-05-07T20:31:53.7340588Z moe/activation_test.py:117: 
2025-05-07T20:31:53.7340888Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:31:53.7341218Z moe/activation_test.py:115: in fn
2025-05-07T20:31:53.7341502Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:31:53.7342190Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:31:53.7342875Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:31:53.7343419Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:31:53.7344103Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:31:53.7344769Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:31:53.7345303Z     kernel = self.compile(
2025-05-07T20:31:53.7345844Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:31:53.7346501Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:31:53.7346895Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:31:53.7347131Z 
2025-05-07T20:31:53.7347339Z self = <triton.compiler.compiler.ASTSource object at 0x7f93a64db150>
2025-05-07T20:31:53.7348412Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:31:53.7349767Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f9402963ce0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f93a659c360>}
2025-05-07T20:31:53.7351177Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:31:53.7352176Z context = <triton._C.libtriton.ir.context object at 0x7f93a5e9b4f0>
2025-05-07T20:31:53.7352467Z 
2025-05-07T20:31:53.7352633Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:31:53.7353151Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:31:53.7353665Z                            module_map=module_map)
2025-05-07T20:31:53.7354097Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:31:53.7354452Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:31:53.7354713Z E       ^
2025-05-07T20:31:53.7355170Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:31:53.7355623Z 
2025-05-07T20:31:53.7356032Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:31:53.7356545Z 
2025-05-07T20:31:53.7356649Z Trying example: test_silu_mul_quant(
2025-05-07T20:31:53.7357059Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:31:53.7357455Z     T=1,
2025-05-07T20:31:53.7357644Z     D=5120,
2025-05-07T20:31:53.7357841Z     scale_ub=None,
2025-05-07T20:31:53.7358055Z     contiguous=True,
2025-05-07T20:31:53.7358284Z     compiled=True,
2025-05-07T20:31:53.7358488Z )
2025-05-07T20:31:53.7358805Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:31:53.7359287Z T = 1, D = 5120, scale_ub = None, contiguous = True, compiled = True
2025-05-07T20:31:53.7359543Z 
2025-05-07T20:31:53.7359628Z     @given(
2025-05-07T20:31:53.7359856Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:31:53.7360178Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:31:53.7360489Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:31:53.7360822Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:31:53.7361143Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:31:53.7361432Z     )
2025-05-07T20:31:53.7361782Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:31:53.7362220Z     def test_silu_mul_quant(
2025-05-07T20:31:53.7362465Z         self,
2025-05-07T20:31:53.7362663Z         T: int,
2025-05-07T20:31:53.7362858Z         D: int,
2025-05-07T20:31:53.7363084Z         scale_ub: Optional[float],
2025-05-07T20:31:53.7363391Z         contiguous: bool,
2025-05-07T20:31:53.7363653Z         compiled: bool,
2025-05-07T20:31:53.7363878Z     ) -> None:
2025-05-07T20:31:53.7364101Z         torch.manual_seed(2025)
2025-05-07T20:31:53.7364340Z     
2025-05-07T20:31:53.7364619Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:31:53.7364964Z     
2025-05-07T20:31:53.7365164Z         x_sign = torch.sign(x)
2025-05-07T20:31:53.7365453Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:31:53.7367243Z         x = x_sign * x_clamp
2025-05-07T20:31:53.7367484Z         x0 = x[:, :D]
2025-05-07T20:31:53.7367701Z         x1 = x[:, D:]
2025-05-07T20:31:53.7367912Z     
2025-05-07T20:31:53.7368103Z         if contiguous:
2025-05-07T20:31:53.7368330Z             x0 = x0.contiguous()
2025-05-07T20:31:53.7368592Z             x1 = x1.contiguous()
2025-05-07T20:31:53.7368838Z     
2025-05-07T20:31:53.7369026Z         if scale_ub is not None:
2025-05-07T20:31:53.7369312Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:31:53.7369656Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:31:53.7369964Z             )
2025-05-07T20:31:53.7370161Z         else:
2025-05-07T20:31:53.7370376Z             scale_ub_tensor = None
2025-05-07T20:31:53.7370710Z     
2025-05-07T20:31:53.7370942Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:31:53.7371262Z             op = silu_mul_quant
2025-05-07T20:31:53.7371518Z             if compiled:
2025-05-07T20:31:53.7371761Z                 op = torch.compile(op)
2025-05-07T20:31:53.7372056Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:31:53.7372334Z     
2025-05-07T20:31:53.7372522Z         y_fp8, y_scale = fn()
2025-05-07T20:31:53.7372829Z         y = y_fp8.to(torch.float32) * y_scale[:, None]
2025-05-07T20:31:53.7373120Z     
2025-05-07T20:31:53.7373355Z         def ref_fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:31:53.7373765Z             x0_fp32 = x0.to(torch.float32)
2025-05-07T20:31:53.7374161Z             x1_fp32 = x1.to(torch.float32)
2025-05-07T20:31:53.7374479Z             y = x0_fp32 * torch.sigmoid(x0_fp32) * x1_fp32
2025-05-07T20:31:53.7374833Z             return triton_quantize_fp8_row(y, scale_ub_tensor)
2025-05-07T20:31:53.7375150Z     
2025-05-07T20:31:53.7375351Z >       y_fp8_ref, y_scale_ref = ref_fn()
2025-05-07T20:31:53.7375542Z 
2025-05-07T20:31:53.7375643Z moe/activation_test.py:126: 
2025-05-07T20:31:53.7375944Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:31:53.7376278Z moe/activation_test.py:124: in ref_fn
2025-05-07T20:31:53.7376599Z     return triton_quantize_fp8_row(y, scale_ub_tensor)
2025-05-07T20:31:53.7377378Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gemm/triton_gemm/fp8_gemm.py:2370: in triton_quantize_fp8_row
2025-05-07T20:31:53.7378138Z     _kernel_quantize_fp8_row[grid](
2025-05-07T20:31:53.7378682Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:31:53.7379360Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:31:53.7380040Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/autotuner.py:186: in run
2025-05-07T20:31:53.7380754Z     timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs}
2025-05-07T20:31:53.7381470Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/autotuner.py:166: in _bench
2025-05-07T20:31:53.7382104Z     return self.do_bench(kernel_call, quantiles=(0.5, 0.2, 0.8))
2025-05-07T20:31:53.7382702Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/testing.py:117: in do_bench
2025-05-07T20:31:53.7390263Z     fn()
2025-05-07T20:31:53.7390810Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/autotuner.py:152: in kernel_call
2025-05-07T20:31:53.7391405Z     self.fn.run(
2025-05-07T20:31:53.7391870Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:31:53.7392406Z     kernel = self.compile(
2025-05-07T20:31:53.7392946Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:31:53.7393601Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:31:53.7393997Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:31:53.7394232Z 
2025-05-07T20:31:53.7394439Z self = <triton.compiler.compiler.ASTSource object at 0x7f93a6cebb50>
2025-05-07T20:31:53.7395510Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=2, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:31:53.7396869Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f9402963ce0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f93a659ef20>}
2025-05-07T20:31:53.7398423Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:31:53.7399610Z context = <triton._C.libtriton.ir.context object at 0x7f93a57aecb0>
2025-05-07T20:31:53.7399896Z 
2025-05-07T20:31:53.7400064Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:31:53.7400578Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:31:53.7401032Z                            module_map=module_map)
2025-05-07T20:31:53.7401396Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:31:53.7401748Z E       def _kernel_quantize_fp8_row(
2025-05-07T20:31:53.7402011Z E       ^
2025-05-07T20:31:53.7402583Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:31:53.7403035Z 
2025-05-07T20:31:53.7403445Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:31:53.9635724Z W0507 20:31:53.959000 275344 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4] Encountered an exception in identify_mutated_tensors, assuming every input is mutated
2025-05-07T20:31:53.9636797Z W0507 20:31:53.959000 275344 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4] Traceback (most recent call last):
2025-05-07T20:31:53.9638125Z W0507 20:31:53.959000 275344 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/torch/_higher_order_ops/triton_kernel_wrap.py", line 731, in identify_mutated_tensors
2025-05-07T20:31:53.9639535Z W0507 20:31:53.959000 275344 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4]     ttir_module, ordered_tensor_names = generate_ttir(kernel, kwargs)
2025-05-07T20:31:53.9640512Z W0507 20:31:53.959000 275344 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4]                                         ~~~~~~~~~~~~~^^^^^^^^^^^^^^^^
2025-05-07T20:31:53.9641808Z W0507 20:31:53.959000 275344 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/torch/_higher_order_ops/triton_kernel_wrap.py", line 356, in generate_ttir
2025-05-07T20:31:53.9643162Z W0507 20:31:53.959000 275344 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4]     ttir_module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:31:53.9644500Z W0507 20:31:53.959000 275344 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py", line 100, in make_ir
2025-05-07T20:31:53.9645849Z W0507 20:31:53.959000 275344 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4]     return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:31:53.9646892Z W0507 20:31:53.959000 275344 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4]                        module_map=module_map)
2025-05-07T20:31:53.9648140Z W0507 20:31:53.959000 275344 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/code_generator.py", line 1298, in ast_to_ttir
2025-05-07T20:31:53.9649367Z W0507 20:31:53.959000 275344 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4]     generator.visit(fn.parse())
2025-05-07T20:31:53.9650216Z W0507 20:31:53.959000 275344 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4]     ~~~~~~~~~~~~~~~^^^^^^^^^^^^
2025-05-07T20:31:53.9651398Z W0507 20:31:53.959000 275344 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/code_generator.py", line 1201, in visit
2025-05-07T20:31:53.9652745Z W0507 20:31:53.959000 275344 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4]     ret = super().visit(node)
2025-05-07T20:31:53.9653933Z W0507 20:31:53.959000 275344 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/ast.py", line 428, in visit
2025-05-07T20:31:53.9654933Z W0507 20:31:53.959000 275344 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4]     return visitor(node)
2025-05-07T20:31:53.9656253Z W0507 20:31:53.959000 275344 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/code_generator.py", line 352, in visit_Module
2025-05-07T20:31:53.9657515Z W0507 20:31:53.959000 275344 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4]     ast.NodeVisitor.generic_visit(self, node)
2025-05-07T20:31:53.9658408Z W0507 20:31:53.959000 275344 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4]     ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~^^^^^^^^^^^^
2025-05-07T20:31:53.9659477Z W0507 20:31:53.959000 275344 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/ast.py", line 436, in generic_visit
2025-05-07T20:31:53.9660506Z W0507 20:31:53.959000 275344 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4]     self.visit(item)
2025-05-07T20:31:53.9661269Z W0507 20:31:53.959000 275344 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4]     ~~~~~~~~~~^^^^^^
2025-05-07T20:31:53.9662422Z W0507 20:31:53.959000 275344 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/code_generator.py", line 1207, in visit
2025-05-07T20:31:53.9663761Z W0507 20:31:53.959000 275344 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4]     raise CompilationError(self.jit_fn.src, self.cur_node, repr(e)) from None
2025-05-07T20:31:53.9664809Z W0507 20:31:53.959000 275344 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4] triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:31:53.9665712Z W0507 20:31:53.959000 275344 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4] def _fbgemm_silu_mul_quant(
2025-05-07T20:31:53.9666456Z W0507 20:31:53.959000 275344 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4] ^
2025-05-07T20:31:53.9667465Z W0507 20:31:53.959000 275344 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4] ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:31:54.0248353Z W0507 20:31:54.021000 275344 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4] Encountered an exception in identify_mutated_tensors, assuming every input is mutated
2025-05-07T20:31:54.0249410Z W0507 20:31:54.021000 275344 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4] Traceback (most recent call last):
2025-05-07T20:31:54.0250720Z W0507 20:31:54.021000 275344 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/torch/_higher_order_ops/triton_kernel_wrap.py", line 731, in identify_mutated_tensors
2025-05-07T20:31:54.0252112Z W0507 20:31:54.021000 275344 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4]     ttir_module, ordered_tensor_names = generate_ttir(kernel, kwargs)
2025-05-07T20:31:54.0253084Z W0507 20:31:54.021000 275344 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4]                                         ~~~~~~~~~~~~~^^^^^^^^^^^^^^^^
2025-05-07T20:31:54.0254461Z W0507 20:31:54.021000 275344 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/torch/_higher_order_ops/triton_kernel_wrap.py", line 356, in generate_ttir
2025-05-07T20:31:54.0255964Z W0507 20:31:54.021000 275344 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4]     ttir_module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:31:54.0257248Z W0507 20:31:54.021000 275344 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py", line 100, in make_ir
2025-05-07T20:31:54.0258718Z W0507 20:31:54.021000 275344 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4]     return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:31:54.0259756Z W0507 20:31:54.021000 275344 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4]                        module_map=module_map)
2025-05-07T20:31:54.0260997Z W0507 20:31:54.021000 275344 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/code_generator.py", line 1298, in ast_to_ttir
2025-05-07T20:31:54.0262222Z W0507 20:31:54.021000 275344 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4]     generator.visit(fn.parse())
2025-05-07T20:31:54.0263051Z W0507 20:31:54.021000 275344 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4]     ~~~~~~~~~~~~~~~^^^^^^^^^^^^
2025-05-07T20:31:54.0264236Z W0507 20:31:54.021000 275344 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/code_generator.py", line 1201, in visit
2025-05-07T20:31:54.0265418Z W0507 20:31:54.021000 275344 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4]     ret = super().visit(node)
2025-05-07T20:31:54.0266436Z W0507 20:31:54.021000 275344 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/ast.py", line 428, in visit
2025-05-07T20:31:54.0267434Z W0507 20:31:54.021000 275344 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4]     return visitor(node)
2025-05-07T20:31:54.0268627Z W0507 20:31:54.021000 275344 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/code_generator.py", line 352, in visit_Module
2025-05-07T20:31:54.0269882Z W0507 20:31:54.021000 275344 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4]     ast.NodeVisitor.generic_visit(self, node)
2025-05-07T20:31:54.0270763Z W0507 20:31:54.021000 275344 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4]     ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~^^^^^^^^^^^^
2025-05-07T20:31:54.0271831Z W0507 20:31:54.021000 275344 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/ast.py", line 436, in generic_visit
2025-05-07T20:31:54.0272850Z W0507 20:31:54.021000 275344 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4]     self.visit(item)
2025-05-07T20:31:54.0273610Z W0507 20:31:54.021000 275344 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4]     ~~~~~~~~~~^^^^^^
2025-05-07T20:31:54.0274761Z W0507 20:31:54.021000 275344 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/code_generator.py", line 1207, in visit
2025-05-07T20:31:54.0276081Z W0507 20:31:54.021000 275344 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4]     raise CompilationError(self.jit_fn.src, self.cur_node, repr(e)) from None
2025-05-07T20:31:54.0277210Z W0507 20:31:54.021000 275344 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4] triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:31:54.0278102Z W0507 20:31:54.021000 275344 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4] def _fbgemm_silu_mul_quant(
2025-05-07T20:31:54.0278831Z W0507 20:31:54.021000 275344 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4] ^
2025-05-07T20:31:54.0279830Z W0507 20:31:54.021000 275344 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4] ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:31:54.2107701Z W0507 20:31:54.207000 275344 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4] Encountered an exception in identify_mutated_tensors, assuming every input is mutated
2025-05-07T20:31:54.2108757Z W0507 20:31:54.207000 275344 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4] Traceback (most recent call last):
2025-05-07T20:31:54.2110086Z W0507 20:31:54.207000 275344 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/torch/_higher_order_ops/triton_kernel_wrap.py", line 731, in identify_mutated_tensors
2025-05-07T20:31:54.2111484Z W0507 20:31:54.207000 275344 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4]     ttir_module, ordered_tensor_names = generate_ttir(kernel, kwargs)
2025-05-07T20:31:54.2112459Z W0507 20:31:54.207000 275344 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4]                                         ~~~~~~~~~~~~~^^^^^^^^^^^^^^^^
2025-05-07T20:31:54.2113741Z W0507 20:31:54.207000 275344 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/torch/_higher_order_ops/triton_kernel_wrap.py", line 356, in generate_ttir
2025-05-07T20:31:54.2115102Z W0507 20:31:54.207000 275344 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4]     ttir_module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:31:54.2116380Z W0507 20:31:54.207000 275344 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py", line 100, in make_ir
2025-05-07T20:31:54.2117730Z W0507 20:31:54.207000 275344 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4]     return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:31:54.2118769Z W0507 20:31:54.207000 275344 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4]                        module_map=module_map)
2025-05-07T20:31:54.2120003Z W0507 20:31:54.207000 275344 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/code_generator.py", line 1298, in ast_to_ttir
2025-05-07T20:31:54.2121231Z W0507 20:31:54.207000 275344 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4]     generator.visit(fn.parse())
2025-05-07T20:31:54.2122073Z W0507 20:31:54.207000 275344 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4]     ~~~~~~~~~~~~~~~^^^^^^^^^^^^
2025-05-07T20:31:54.2123258Z W0507 20:31:54.207000 275344 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/code_generator.py", line 1201, in visit
2025-05-07T20:31:54.2124504Z W0507 20:31:54.207000 275344 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4]     ret = super().visit(node)
2025-05-07T20:31:54.2125521Z W0507 20:31:54.207000 275344 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/ast.py", line 428, in visit
2025-05-07T20:31:54.2126652Z W0507 20:31:54.207000 275344 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4]     return visitor(node)
2025-05-07T20:31:54.2127854Z W0507 20:31:54.207000 275344 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/code_generator.py", line 352, in visit_Module
2025-05-07T20:31:54.2129119Z W0507 20:31:54.207000 275344 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4]     ast.NodeVisitor.generic_visit(self, node)
2025-05-07T20:31:54.2130087Z W0507 20:31:54.207000 275344 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4]     ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~^^^^^^^^^^^^
2025-05-07T20:31:54.2131156Z W0507 20:31:54.207000 275344 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/ast.py", line 436, in generic_visit
2025-05-07T20:31:54.2132190Z W0507 20:31:54.207000 275344 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4]     self.visit(item)
2025-05-07T20:31:54.2132954Z W0507 20:31:54.207000 275344 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4]     ~~~~~~~~~~^^^^^^
2025-05-07T20:31:54.2134247Z W0507 20:31:54.207000 275344 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/code_generator.py", line 1207, in visit
2025-05-07T20:31:54.2135587Z W0507 20:31:54.207000 275344 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4]     raise CompilationError(self.jit_fn.src, self.cur_node, repr(e)) from None
2025-05-07T20:31:54.2136636Z W0507 20:31:54.207000 275344 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4] triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:31:54.2137549Z W0507 20:31:54.207000 275344 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4] def _fbgemm_silu_mul_quant(
2025-05-07T20:31:54.2138287Z W0507 20:31:54.207000 275344 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4] ^
2025-05-07T20:31:54.2139294Z W0507 20:31:54.207000 275344 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4] ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:31:54.2198599Z W0507 20:31:54.216000 275344 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4] Encountered an exception in identify_mutated_tensors, assuming every input is mutated
2025-05-07T20:31:54.2199642Z W0507 20:31:54.216000 275344 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4] Traceback (most recent call last):
2025-05-07T20:31:54.2200969Z W0507 20:31:54.216000 275344 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/torch/_higher_order_ops/triton_kernel_wrap.py", line 731, in identify_mutated_tensors
2025-05-07T20:31:54.2202376Z W0507 20:31:54.216000 275344 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4]     ttir_module, ordered_tensor_names = generate_ttir(kernel, kwargs)
2025-05-07T20:31:54.2203346Z W0507 20:31:54.216000 275344 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4]                                         ~~~~~~~~~~~~~^^^^^^^^^^^^^^^^
2025-05-07T20:31:54.2204682Z W0507 20:31:54.216000 275344 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/torch/_higher_order_ops/triton_kernel_wrap.py", line 356, in generate_ttir
2025-05-07T20:31:54.2206042Z W0507 20:31:54.216000 275344 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4]     ttir_module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:31:54.2207466Z W0507 20:31:54.216000 275344 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py", line 100, in make_ir
2025-05-07T20:31:54.2208815Z W0507 20:31:54.216000 275344 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4]     return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:31:54.2209848Z W0507 20:31:54.216000 275344 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4]                        module_map=module_map)
2025-05-07T20:31:54.2211197Z W0507 20:31:54.216000 275344 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/code_generator.py", line 1298, in ast_to_ttir
2025-05-07T20:31:54.2212421Z W0507 20:31:54.216000 275344 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4]     generator.visit(fn.parse())
2025-05-07T20:31:54.2213271Z W0507 20:31:54.216000 275344 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4]     ~~~~~~~~~~~~~~~^^^^^^^^^^^^
2025-05-07T20:31:54.2214577Z W0507 20:31:54.216000 275344 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/code_generator.py", line 1201, in visit
2025-05-07T20:31:54.2215770Z W0507 20:31:54.216000 275344 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4]     ret = super().visit(node)
2025-05-07T20:31:54.2216794Z W0507 20:31:54.216000 275344 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/ast.py", line 428, in visit
2025-05-07T20:31:54.2217803Z W0507 20:31:54.216000 275344 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4]     return visitor(node)
2025-05-07T20:31:54.2219008Z W0507 20:31:54.216000 275344 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/code_generator.py", line 352, in visit_Module
2025-05-07T20:31:54.2220271Z W0507 20:31:54.216000 275344 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4]     ast.NodeVisitor.generic_visit(self, node)
2025-05-07T20:31:54.2221168Z W0507 20:31:54.216000 275344 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4]     ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~^^^^^^^^^^^^
2025-05-07T20:31:54.2222241Z W0507 20:31:54.216000 275344 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/ast.py", line 436, in generic_visit
2025-05-07T20:31:54.2223274Z W0507 20:31:54.216000 275344 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4]     self.visit(item)
2025-05-07T20:31:54.2224098Z W0507 20:31:54.216000 275344 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4]     ~~~~~~~~~~^^^^^^
2025-05-07T20:31:54.2225258Z W0507 20:31:54.216000 275344 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/code_generator.py", line 1207, in visit
2025-05-07T20:31:54.2226600Z W0507 20:31:54.216000 275344 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4]     raise CompilationError(self.jit_fn.src, self.cur_node, repr(e)) from None
2025-05-07T20:31:54.2227653Z W0507 20:31:54.216000 275344 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4] triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:31:54.2228564Z W0507 20:31:54.216000 275344 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4] def _fbgemm_silu_mul_quant(
2025-05-07T20:31:54.2229303Z W0507 20:31:54.216000 275344 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4] ^
2025-05-07T20:31:54.2230398Z W0507 20:31:54.216000 275344 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4] ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:31:54.8412816Z W0507 20:31:54.639000 275344 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5] Encountered an exception in identify_mutated_tensors, assuming every input is mutated
2025-05-07T20:31:54.8414533Z W0507 20:31:54.639000 275344 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5] Traceback (most recent call last):
2025-05-07T20:31:54.8417475Z W0507 20:31:54.639000 275344 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/torch/_higher_order_ops/triton_kernel_wrap.py", line 731, in identify_mutated_tensors
2025-05-07T20:31:54.8420268Z W0507 20:31:54.639000 275344 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5]     ttir_module, ordered_tensor_names = generate_ttir(kernel, kwargs)
2025-05-07T20:31:54.8422202Z W0507 20:31:54.639000 275344 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5]                                         ~~~~~~~~~~~~~^^^^^^^^^^^^^^^^
2025-05-07T20:31:54.8424163Z W0507 20:31:54.639000 275344 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/torch/_higher_order_ops/triton_kernel_wrap.py", line 356, in generate_ttir
2025-05-07T20:31:54.8425519Z W0507 20:31:54.639000 275344 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5]     ttir_module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:31:54.8426805Z W0507 20:31:54.639000 275344 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py", line 100, in make_ir
2025-05-07T20:31:54.8428163Z W0507 20:31:54.639000 275344 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5]     return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:31:54.8429195Z W0507 20:31:54.639000 275344 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5]                        module_map=module_map)
2025-05-07T20:31:54.8430435Z W0507 20:31:54.639000 275344 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/code_generator.py", line 1298, in ast_to_ttir
2025-05-07T20:31:54.8431658Z W0507 20:31:54.639000 275344 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5]     generator.visit(fn.parse())
2025-05-07T20:31:54.8432494Z W0507 20:31:54.639000 275344 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5]     ~~~~~~~~~~~~~~~^^^^^^^^^^^^
2025-05-07T20:31:54.8433689Z W0507 20:31:54.639000 275344 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/code_generator.py", line 1201, in visit
2025-05-07T20:31:54.8434881Z W0507 20:31:54.639000 275344 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5]     ret = super().visit(node)
2025-05-07T20:31:54.8435904Z W0507 20:31:54.639000 275344 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/ast.py", line 428, in visit
2025-05-07T20:31:54.8436914Z W0507 20:31:54.639000 275344 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5]     return visitor(node)
2025-05-07T20:31:54.8438112Z W0507 20:31:54.639000 275344 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/code_generator.py", line 352, in visit_Module
2025-05-07T20:31:54.8439524Z W0507 20:31:54.639000 275344 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5]     ast.NodeVisitor.generic_visit(self, node)
2025-05-07T20:31:54.8440418Z W0507 20:31:54.639000 275344 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5]     ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~^^^^^^^^^^^^
2025-05-07T20:31:54.8441490Z W0507 20:31:54.639000 275344 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/ast.py", line 436, in generic_visit
2025-05-07T20:31:54.8442640Z W0507 20:31:54.639000 275344 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5]     self.visit(item)
2025-05-07T20:31:54.8443408Z W0507 20:31:54.639000 275344 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5]     ~~~~~~~~~~^^^^^^
2025-05-07T20:31:54.8444611Z W0507 20:31:54.639000 275344 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/code_generator.py", line 1207, in visit
2025-05-07T20:31:54.8445950Z W0507 20:31:54.639000 275344 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5]     raise CompilationError(self.jit_fn.src, self.cur_node, repr(e)) from None
2025-05-07T20:31:54.8446992Z W0507 20:31:54.639000 275344 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5] triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:31:54.8447894Z W0507 20:31:54.639000 275344 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5] def _fbgemm_silu_mul_quant(
2025-05-07T20:31:54.8448641Z W0507 20:31:54.639000 275344 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5] ^
2025-05-07T20:31:54.8449649Z W0507 20:31:54.639000 275344 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5] ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:31:54.9039548Z W0507 20:31:54.900000 275344 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5] Encountered an exception in identify_mutated_tensors, assuming every input is mutated
2025-05-07T20:31:54.9040589Z W0507 20:31:54.900000 275344 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5] Traceback (most recent call last):
2025-05-07T20:31:54.9041901Z W0507 20:31:54.900000 275344 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/torch/_higher_order_ops/triton_kernel_wrap.py", line 731, in identify_mutated_tensors
2025-05-07T20:31:54.9043296Z W0507 20:31:54.900000 275344 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5]     ttir_module, ordered_tensor_names = generate_ttir(kernel, kwargs)
2025-05-07T20:31:54.9044264Z W0507 20:31:54.900000 275344 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5]                                         ~~~~~~~~~~~~~^^^^^^^^^^^^^^^^
2025-05-07T20:31:54.9045549Z W0507 20:31:54.900000 275344 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/torch/_higher_order_ops/triton_kernel_wrap.py", line 356, in generate_ttir
2025-05-07T20:31:54.9046905Z W0507 20:31:54.900000 275344 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5]     ttir_module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:31:54.9048183Z W0507 20:31:54.900000 275344 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py", line 100, in make_ir
2025-05-07T20:31:54.9049529Z W0507 20:31:54.900000 275344 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5]     return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:31:54.9050705Z W0507 20:31:54.900000 275344 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5]                        module_map=module_map)
2025-05-07T20:31:54.9051945Z W0507 20:31:54.900000 275344 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/code_generator.py", line 1298, in ast_to_ttir
2025-05-07T20:31:54.9053169Z W0507 20:31:54.900000 275344 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5]     generator.visit(fn.parse())
2025-05-07T20:31:54.9054275Z W0507 20:31:54.900000 275344 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5]     ~~~~~~~~~~~~~~~^^^^^^^^^^^^
2025-05-07T20:31:54.9055462Z W0507 20:31:54.900000 275344 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/code_generator.py", line 1201, in visit
2025-05-07T20:31:54.9056662Z W0507 20:31:54.900000 275344 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5]     ret = super().visit(node)
2025-05-07T20:31:54.9057685Z W0507 20:31:54.900000 275344 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/ast.py", line 428, in visit
2025-05-07T20:31:54.9058689Z W0507 20:31:54.900000 275344 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5]     return visitor(node)
2025-05-07T20:31:54.9059892Z W0507 20:31:54.900000 275344 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/code_generator.py", line 352, in visit_Module
2025-05-07T20:31:54.9061150Z W0507 20:31:54.900000 275344 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5]     ast.NodeVisitor.generic_visit(self, node)
2025-05-07T20:31:54.9062047Z W0507 20:31:54.900000 275344 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5]     ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~^^^^^^^^^^^^
2025-05-07T20:31:54.9063119Z W0507 20:31:54.900000 275344 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/ast.py", line 436, in generic_visit
2025-05-07T20:31:54.9064145Z W0507 20:31:54.900000 275344 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5]     self.visit(item)
2025-05-07T20:31:54.9064908Z W0507 20:31:54.900000 275344 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5]     ~~~~~~~~~~^^^^^^
2025-05-07T20:31:54.9066067Z W0507 20:31:54.900000 275344 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/code_generator.py", line 1207, in visit
2025-05-07T20:31:54.9067403Z W0507 20:31:54.900000 275344 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5]     raise CompilationError(self.jit_fn.src, self.cur_node, repr(e)) from None
2025-05-07T20:31:54.9068446Z W0507 20:31:54.900000 275344 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5] triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:31:54.9069345Z W0507 20:31:54.900000 275344 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5] def _fbgemm_silu_mul_quant(
2025-05-07T20:31:54.9070080Z W0507 20:31:54.900000 275344 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5] ^
2025-05-07T20:31:54.9071091Z W0507 20:31:54.900000 275344 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5] ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:31:55.0896995Z W0507 20:31:55.086000 275344 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5] Encountered an exception in identify_mutated_tensors, assuming every input is mutated
2025-05-07T20:31:55.0898385Z W0507 20:31:55.086000 275344 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5] Traceback (most recent call last):
2025-05-07T20:31:55.0899706Z W0507 20:31:55.086000 275344 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/torch/_higher_order_ops/triton_kernel_wrap.py", line 731, in identify_mutated_tensors
2025-05-07T20:31:55.0901103Z W0507 20:31:55.086000 275344 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5]     ttir_module, ordered_tensor_names = generate_ttir(kernel, kwargs)
2025-05-07T20:31:55.0902194Z W0507 20:31:55.086000 275344 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5]                                         ~~~~~~~~~~~~~^^^^^^^^^^^^^^^^
2025-05-07T20:31:55.0903473Z W0507 20:31:55.086000 275344 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/torch/_higher_order_ops/triton_kernel_wrap.py", line 356, in generate_ttir
2025-05-07T20:31:55.0904838Z W0507 20:31:55.086000 275344 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5]     ttir_module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:31:55.0906131Z W0507 20:31:55.086000 275344 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py", line 100, in make_ir
2025-05-07T20:31:55.0907487Z W0507 20:31:55.086000 275344 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5]     return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:31:55.0908523Z W0507 20:31:55.086000 275344 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5]                        module_map=module_map)
2025-05-07T20:31:55.0909773Z W0507 20:31:55.086000 275344 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/code_generator.py", line 1298, in ast_to_ttir
2025-05-07T20:31:55.0910994Z W0507 20:31:55.086000 275344 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5]     generator.visit(fn.parse())
2025-05-07T20:31:55.0911837Z W0507 20:31:55.086000 275344 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5]     ~~~~~~~~~~~~~~~^^^^^^^^^^^^
2025-05-07T20:31:55.0913037Z W0507 20:31:55.086000 275344 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/code_generator.py", line 1201, in visit
2025-05-07T20:31:55.0914233Z W0507 20:31:55.086000 275344 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5]     ret = super().visit(node)
2025-05-07T20:31:55.0915268Z W0507 20:31:55.086000 275344 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/ast.py", line 428, in visit
2025-05-07T20:31:55.0916276Z W0507 20:31:55.086000 275344 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5]     return visitor(node)
2025-05-07T20:31:55.0917481Z W0507 20:31:55.086000 275344 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/code_generator.py", line 352, in visit_Module
2025-05-07T20:31:55.0918751Z W0507 20:31:55.086000 275344 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5]     ast.NodeVisitor.generic_visit(self, node)
2025-05-07T20:31:55.0919646Z W0507 20:31:55.086000 275344 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5]     ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~^^^^^^^^^^^^
2025-05-07T20:31:55.0920848Z W0507 20:31:55.086000 275344 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/ast.py", line 436, in generic_visit
2025-05-07T20:31:55.0921872Z W0507 20:31:55.086000 275344 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5]     self.visit(item)
2025-05-07T20:31:55.0922637Z W0507 20:31:55.086000 275344 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5]     ~~~~~~~~~~^^^^^^
2025-05-07T20:31:55.0923890Z W0507 20:31:55.086000 275344 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/code_generator.py", line 1207, in visit
2025-05-07T20:31:55.0925229Z W0507 20:31:55.086000 275344 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5]     raise CompilationError(self.jit_fn.src, self.cur_node, repr(e)) from None
2025-05-07T20:31:55.0926280Z W0507 20:31:55.086000 275344 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5] triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:31:55.0927185Z W0507 20:31:55.086000 275344 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5] def _fbgemm_silu_mul_quant(
2025-05-07T20:31:55.0927924Z W0507 20:31:55.086000 275344 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5] ^
2025-05-07T20:31:55.0928930Z W0507 20:31:55.086000 275344 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5] ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:31:55.0990630Z W0507 20:31:55.095000 275344 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5] Encountered an exception in identify_mutated_tensors, assuming every input is mutated
2025-05-07T20:31:55.0991664Z W0507 20:31:55.095000 275344 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5] Traceback (most recent call last):
2025-05-07T20:31:55.0992980Z W0507 20:31:55.095000 275344 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/torch/_higher_order_ops/triton_kernel_wrap.py", line 731, in identify_mutated_tensors
2025-05-07T20:31:55.0994425Z W0507 20:31:55.095000 275344 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5]     ttir_module, ordered_tensor_names = generate_ttir(kernel, kwargs)
2025-05-07T20:31:55.0995396Z W0507 20:31:55.095000 275344 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5]                                         ~~~~~~~~~~~~~^^^^^^^^^^^^^^^^
2025-05-07T20:31:55.0996681Z W0507 20:31:55.095000 275344 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/torch/_higher_order_ops/triton_kernel_wrap.py", line 356, in generate_ttir
2025-05-07T20:31:55.0998047Z W0507 20:31:55.095000 275344 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5]     ttir_module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:31:55.0999540Z W0507 20:31:55.095000 275344 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py", line 100, in make_ir
2025-05-07T20:31:55.1000890Z W0507 20:31:55.095000 275344 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5]     return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:31:55.1001928Z W0507 20:31:55.095000 275344 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5]                        module_map=module_map)
2025-05-07T20:31:55.1003169Z W0507 20:31:55.095000 275344 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/code_generator.py", line 1298, in ast_to_ttir
2025-05-07T20:31:55.1004537Z W0507 20:31:55.095000 275344 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5]     generator.visit(fn.parse())
2025-05-07T20:31:55.1005372Z W0507 20:31:55.095000 275344 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5]     ~~~~~~~~~~~~~~~^^^^^^^^^^^^
2025-05-07T20:31:55.1006555Z W0507 20:31:55.095000 275344 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/code_generator.py", line 1201, in visit
2025-05-07T20:31:55.1007852Z W0507 20:31:55.095000 275344 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5]     ret = super().visit(node)
2025-05-07T20:31:55.1008878Z W0507 20:31:55.095000 275344 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/ast.py", line 428, in visit
2025-05-07T20:31:55.1009889Z W0507 20:31:55.095000 275344 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5]     return visitor(node)
2025-05-07T20:31:55.1011090Z W0507 20:31:55.095000 275344 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/code_generator.py", line 352, in visit_Module
2025-05-07T20:31:55.1012355Z W0507 20:31:55.095000 275344 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5]     ast.NodeVisitor.generic_visit(self, node)
2025-05-07T20:31:55.1013258Z W0507 20:31:55.095000 275344 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5]     ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~^^^^^^^^^^^^
2025-05-07T20:31:55.1014491Z W0507 20:31:55.095000 275344 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/ast.py", line 436, in generic_visit
2025-05-07T20:31:55.1015516Z W0507 20:31:55.095000 275344 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5]     self.visit(item)
2025-05-07T20:31:55.1016281Z W0507 20:31:55.095000 275344 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5]     ~~~~~~~~~~^^^^^^
2025-05-07T20:31:55.1017433Z W0507 20:31:55.095000 275344 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/code_generator.py", line 1207, in visit
2025-05-07T20:31:55.1018775Z W0507 20:31:55.095000 275344 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5]     raise CompilationError(self.jit_fn.src, self.cur_node, repr(e)) from None
2025-05-07T20:31:55.1019824Z W0507 20:31:55.095000 275344 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5] triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:31:55.1020724Z W0507 20:31:55.095000 275344 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5] def _fbgemm_silu_mul_quant(
2025-05-07T20:31:55.1021465Z W0507 20:31:55.095000 275344 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5] ^
2025-05-07T20:31:55.1022474Z W0507 20:31:55.095000 275344 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5] ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:31:55.2863331Z 
2025-05-07T20:31:55.2863484Z Trying example: test_silu_mul_quant(
2025-05-07T20:31:55.2863917Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:31:55.2864372Z     T=2048,
2025-05-07T20:31:55.2864587Z     D=5120,
2025-05-07T20:31:55.2864790Z     scale_ub=None,
2025-05-07T20:31:55.2865013Z     contiguous=True,
2025-05-07T20:31:55.2865238Z     compiled=True,
2025-05-07T20:31:55.2865448Z )
2025-05-07T20:31:55.2865946Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:31:55.2866433Z T = 2048, D = 5120, scale_ub = None, contiguous = True, compiled = True
2025-05-07T20:31:55.2866706Z 
2025-05-07T20:31:55.2866788Z     @given(
2025-05-07T20:31:55.2867029Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:31:55.2867343Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:31:55.2867652Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:31:55.2867985Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:31:55.2868310Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:31:55.2868602Z     )
2025-05-07T20:31:55.2869070Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:31:55.2869522Z     def test_silu_mul_quant(
2025-05-07T20:31:55.2869766Z         self,
2025-05-07T20:31:55.2869972Z         T: int,
2025-05-07T20:31:55.2870174Z         D: int,
2025-05-07T20:31:55.2870398Z         scale_ub: Optional[float],
2025-05-07T20:31:55.2870676Z         contiguous: bool,
2025-05-07T20:31:55.2870921Z         compiled: bool,
2025-05-07T20:31:55.2871145Z     ) -> None:
2025-05-07T20:31:55.2871368Z         torch.manual_seed(2025)
2025-05-07T20:31:55.2871613Z     
2025-05-07T20:31:55.2871884Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:31:55.2872232Z     
2025-05-07T20:31:55.2872430Z         x_sign = torch.sign(x)
2025-05-07T20:31:55.2872719Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:31:55.2873033Z         x = x_sign * x_clamp
2025-05-07T20:31:55.2873278Z         x0 = x[:, :D]
2025-05-07T20:31:55.2873495Z         x1 = x[:, D:]
2025-05-07T20:31:55.2873707Z     
2025-05-07T20:31:55.2873906Z         if contiguous:
2025-05-07T20:31:55.2874139Z             x0 = x0.contiguous()
2025-05-07T20:31:55.2874400Z             x1 = x1.contiguous()
2025-05-07T20:31:55.2874645Z     
2025-05-07T20:31:55.2874843Z         if scale_ub is not None:
2025-05-07T20:31:55.2875118Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:31:55.2875456Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:31:55.2875775Z             )
2025-05-07T20:31:55.2875969Z         else:
2025-05-07T20:31:55.2876187Z             scale_ub_tensor = None
2025-05-07T20:31:55.2876446Z     
2025-05-07T20:31:55.2876678Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:31:55.2877000Z             op = silu_mul_quant
2025-05-07T20:31:55.2877256Z             if compiled:
2025-05-07T20:31:55.2877501Z                 op = torch.compile(op)
2025-05-07T20:31:55.2877803Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:31:55.2878089Z     
2025-05-07T20:31:55.2878281Z         y_fp8, y_scale = fn()
2025-05-07T20:31:55.2878575Z         y = y_fp8.to(torch.float32) * y_scale[:, None]
2025-05-07T20:31:55.2878875Z     
2025-05-07T20:31:55.2879115Z         def ref_fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:31:55.2879458Z             x0_fp32 = x0.to(torch.float32)
2025-05-07T20:31:55.2879754Z             x1_fp32 = x1.to(torch.float32)
2025-05-07T20:31:55.2880075Z             y = x0_fp32 * torch.sigmoid(x0_fp32) * x1_fp32
2025-05-07T20:31:55.2880429Z             return triton_quantize_fp8_row(y, scale_ub_tensor)
2025-05-07T20:31:55.2880745Z     
2025-05-07T20:31:55.2880952Z >       y_fp8_ref, y_scale_ref = ref_fn()
2025-05-07T20:31:55.2881145Z 
2025-05-07T20:31:55.2881249Z moe/activation_test.py:126: 
2025-05-07T20:31:55.2881547Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:31:55.2881886Z moe/activation_test.py:124: in ref_fn
2025-05-07T20:31:55.2882216Z     return triton_quantize_fp8_row(y, scale_ub_tensor)
2025-05-07T20:31:55.2882999Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gemm/triton_gemm/fp8_gemm.py:2370: in triton_quantize_fp8_row
2025-05-07T20:31:55.2883749Z     _kernel_quantize_fp8_row[grid](
2025-05-07T20:31:55.2884388Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:31:55.2885064Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:31:55.2885759Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/autotuner.py:186: in run
2025-05-07T20:31:55.2886482Z     timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs}
2025-05-07T20:31:55.2887202Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/autotuner.py:166: in _bench
2025-05-07T20:31:55.2887833Z     return self.do_bench(kernel_call, quantiles=(0.5, 0.2, 0.8))
2025-05-07T20:31:55.2888504Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/testing.py:117: in do_bench
2025-05-07T20:31:55.2889024Z     fn()
2025-05-07T20:31:55.2889528Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/autotuner.py:152: in kernel_call
2025-05-07T20:31:55.2890109Z     self.fn.run(
2025-05-07T20:31:55.2890579Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:31:55.2891111Z     kernel = self.compile(
2025-05-07T20:31:55.2891646Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:31:55.2892296Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:31:55.2892695Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:31:55.2892926Z 
2025-05-07T20:31:55.2893140Z self = <triton.compiler.compiler.ASTSource object at 0x7f93cf5e3050>
2025-05-07T20:31:55.2894378Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=2, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:31:55.2895735Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f9402963ce0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f93a651ab60>}
2025-05-07T20:31:55.2897062Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:31:55.2904415Z context = <triton._C.libtriton.ir.context object at 0x7f93a5a6e970>
2025-05-07T20:31:55.2904742Z 
2025-05-07T20:31:55.2904914Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:31:55.2905437Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:31:55.2905894Z                            module_map=module_map)
2025-05-07T20:31:55.2906256Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:31:55.2906608Z E       def _kernel_quantize_fp8_row(
2025-05-07T20:31:55.2906875Z E       ^
2025-05-07T20:31:55.2907330Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:31:55.2907778Z 
2025-05-07T20:31:55.2908193Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:31:55.2908698Z 
2025-05-07T20:31:55.2908813Z Trying example: test_silu_mul_quant(
2025-05-07T20:31:55.2909218Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:31:55.2909616Z     T=128,
2025-05-07T20:31:55.2909833Z     D=5120,
2025-05-07T20:31:55.2910028Z     scale_ub=None,
2025-05-07T20:31:55.2910239Z     contiguous=True,
2025-05-07T20:31:55.2910466Z     compiled=True,
2025-05-07T20:31:55.2910671Z )
2025-05-07T20:31:55.2910983Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:31:55.2911463Z T = 128, D = 5120, scale_ub = None, contiguous = True, compiled = True
2025-05-07T20:31:55.2911893Z 
2025-05-07T20:31:55.2911974Z     @given(
2025-05-07T20:31:55.2912208Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:31:55.2912513Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:31:55.2912811Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:31:55.2913139Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:31:55.2913456Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:31:55.2913742Z     )
2025-05-07T20:31:55.2914090Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:31:55.2914526Z     def test_silu_mul_quant(
2025-05-07T20:31:55.2914766Z         self,
2025-05-07T20:31:55.2914960Z         T: int,
2025-05-07T20:31:55.2915274Z         D: int,
2025-05-07T20:31:55.2915490Z         scale_ub: Optional[float],
2025-05-07T20:31:55.2915772Z         contiguous: bool,
2025-05-07T20:31:55.2916015Z         compiled: bool,
2025-05-07T20:31:55.2916242Z     ) -> None:
2025-05-07T20:31:55.2916458Z         torch.manual_seed(2025)
2025-05-07T20:31:55.2916704Z     
2025-05-07T20:31:55.2916968Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:31:55.2917303Z     
2025-05-07T20:31:55.2917495Z         x_sign = torch.sign(x)
2025-05-07T20:31:55.2917776Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:31:55.2918083Z         x = x_sign * x_clamp
2025-05-07T20:31:55.2918320Z         x0 = x[:, :D]
2025-05-07T20:31:55.2918533Z         x1 = x[:, D:]
2025-05-07T20:31:55.2918735Z     
2025-05-07T20:31:55.2918920Z         if contiguous:
2025-05-07T20:31:55.2919149Z             x0 = x0.contiguous()
2025-05-07T20:31:55.2919406Z             x1 = x1.contiguous()
2025-05-07T20:31:55.2919639Z     
2025-05-07T20:31:55.2919825Z         if scale_ub is not None:
2025-05-07T20:31:55.2920099Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:31:55.2920422Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:31:55.2920728Z             )
2025-05-07T20:31:55.2920919Z         else:
2025-05-07T20:31:55.2921122Z             scale_ub_tensor = None
2025-05-07T20:31:55.2921370Z     
2025-05-07T20:31:55.2921597Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:31:55.2921903Z             op = silu_mul_quant
2025-05-07T20:31:55.2922144Z             if compiled:
2025-05-07T20:31:55.2922390Z                 op = torch.compile(op)
2025-05-07T20:31:55.2922679Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:31:55.2922954Z     
2025-05-07T20:31:55.2923144Z         y_fp8, y_scale = fn()
2025-05-07T20:31:55.2923436Z         y = y_fp8.to(torch.float32) * y_scale[:, None]
2025-05-07T20:31:55.2923746Z     
2025-05-07T20:31:55.2924008Z         def ref_fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:31:55.2924337Z             x0_fp32 = x0.to(torch.float32)
2025-05-07T20:31:55.2924618Z             x1_fp32 = x1.to(torch.float32)
2025-05-07T20:31:55.2924925Z             y = x0_fp32 * torch.sigmoid(x0_fp32) * x1_fp32
2025-05-07T20:31:55.2925276Z             return triton_quantize_fp8_row(y, scale_ub_tensor)
2025-05-07T20:31:55.2925578Z     
2025-05-07T20:31:55.2925774Z >       y_fp8_ref, y_scale_ref = ref_fn()
2025-05-07T20:31:55.2925963Z 
2025-05-07T20:31:55.2926067Z moe/activation_test.py:126: 
2025-05-07T20:31:55.2926354Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:31:55.2926678Z moe/activation_test.py:124: in ref_fn
2025-05-07T20:31:55.2926995Z     return triton_quantize_fp8_row(y, scale_ub_tensor)
2025-05-07T20:31:55.2927765Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gemm/triton_gemm/fp8_gemm.py:2370: in triton_quantize_fp8_row
2025-05-07T20:31:55.2928502Z     _kernel_quantize_fp8_row[grid](
2025-05-07T20:31:55.2929039Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:31:55.2929707Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:31:55.2930471Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/autotuner.py:186: in run
2025-05-07T20:31:55.2931177Z     timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs}
2025-05-07T20:31:55.2931894Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/autotuner.py:166: in _bench
2025-05-07T20:31:55.2932522Z     return self.do_bench(kernel_call, quantiles=(0.5, 0.2, 0.8))
2025-05-07T20:31:55.2933115Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/testing.py:117: in do_bench
2025-05-07T20:31:55.2933674Z     fn()
2025-05-07T20:31:55.2934305Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/autotuner.py:152: in kernel_call
2025-05-07T20:31:55.2934879Z     self.fn.run(
2025-05-07T20:31:55.2935332Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:31:55.2935862Z     kernel = self.compile(
2025-05-07T20:31:55.2936389Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:31:55.2937031Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:31:55.2937420Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:31:55.2937651Z 
2025-05-07T20:31:55.2937859Z self = <triton.compiler.compiler.ASTSource object at 0x7f93a653c150>
2025-05-07T20:31:55.2938932Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=2, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:31:55.2940273Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f9402963ce0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f93a6c42700>}
2025-05-07T20:31:55.2941592Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:31:55.2942595Z context = <triton._C.libtriton.ir.context object at 0x7f93a54dca30>
2025-05-07T20:31:55.2942882Z 
2025-05-07T20:31:55.2943044Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:31:55.2943554Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:31:55.2944008Z                            module_map=module_map)
2025-05-07T20:31:55.2944377Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:31:55.2944732Z E       def _kernel_quantize_fp8_row(
2025-05-07T20:31:55.2944993Z E       ^
2025-05-07T20:31:55.2945449Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:31:55.2945894Z 
2025-05-07T20:31:55.2946301Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:31:55.5234148Z W0507 20:31:55.519000 275344 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6] Encountered an exception in identify_mutated_tensors, assuming every input is mutated
2025-05-07T20:31:55.5235295Z W0507 20:31:55.519000 275344 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6] Traceback (most recent call last):
2025-05-07T20:31:55.5236632Z W0507 20:31:55.519000 275344 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/torch/_higher_order_ops/triton_kernel_wrap.py", line 731, in identify_mutated_tensors
2025-05-07T20:31:55.5238028Z W0507 20:31:55.519000 275344 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6]     ttir_module, ordered_tensor_names = generate_ttir(kernel, kwargs)
2025-05-07T20:31:55.5239174Z W0507 20:31:55.519000 275344 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6]                                         ~~~~~~~~~~~~~^^^^^^^^^^^^^^^^
2025-05-07T20:31:55.5240464Z W0507 20:31:55.519000 275344 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/torch/_higher_order_ops/triton_kernel_wrap.py", line 356, in generate_ttir
2025-05-07T20:31:55.5241822Z W0507 20:31:55.519000 275344 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6]     ttir_module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:31:55.5243220Z W0507 20:31:55.519000 275344 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py", line 100, in make_ir
2025-05-07T20:31:55.5244582Z W0507 20:31:55.519000 275344 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6]     return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:31:55.5245620Z W0507 20:31:55.519000 275344 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6]                        module_map=module_map)
2025-05-07T20:31:55.5246869Z W0507 20:31:55.519000 275344 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/code_generator.py", line 1298, in ast_to_ttir
2025-05-07T20:31:55.5248105Z W0507 20:31:55.519000 275344 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6]     generator.visit(fn.parse())
2025-05-07T20:31:55.5248948Z W0507 20:31:55.519000 275344 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6]     ~~~~~~~~~~~~~~~^^^^^^^^^^^^
2025-05-07T20:31:55.5250141Z W0507 20:31:55.519000 275344 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/code_generator.py", line 1201, in visit
2025-05-07T20:31:55.5251344Z W0507 20:31:55.519000 275344 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6]     ret = super().visit(node)
2025-05-07T20:31:55.5252372Z W0507 20:31:55.519000 275344 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/ast.py", line 428, in visit
2025-05-07T20:31:55.5253380Z W0507 20:31:55.519000 275344 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6]     return visitor(node)
2025-05-07T20:31:55.5254722Z W0507 20:31:55.519000 275344 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/code_generator.py", line 352, in visit_Module
2025-05-07T20:31:55.5255990Z W0507 20:31:55.519000 275344 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6]     ast.NodeVisitor.generic_visit(self, node)
2025-05-07T20:31:55.5256889Z W0507 20:31:55.519000 275344 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6]     ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~^^^^^^^^^^^^
2025-05-07T20:31:55.5257971Z W0507 20:31:55.519000 275344 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/ast.py", line 436, in generic_visit
2025-05-07T20:31:55.5259003Z W0507 20:31:55.519000 275344 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6]     self.visit(item)
2025-05-07T20:31:55.5259765Z W0507 20:31:55.519000 275344 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6]     ~~~~~~~~~~^^^^^^
2025-05-07T20:31:55.5260920Z W0507 20:31:55.519000 275344 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/code_generator.py", line 1207, in visit
2025-05-07T20:31:55.5262689Z W0507 20:31:55.519000 275344 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6]     raise CompilationError(self.jit_fn.src, self.cur_node, repr(e)) from None
2025-05-07T20:31:55.5263993Z W0507 20:31:55.519000 275344 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6] triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:31:55.5265111Z W0507 20:31:55.519000 275344 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6] def _fbgemm_silu_mul_quant(
2025-05-07T20:31:55.5266102Z W0507 20:31:55.519000 275344 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6] ^
2025-05-07T20:31:55.5267126Z W0507 20:31:55.519000 275344 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6] ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:31:55.5851723Z W0507 20:31:55.581000 275344 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6] Encountered an exception in identify_mutated_tensors, assuming every input is mutated
2025-05-07T20:31:55.5853789Z W0507 20:31:55.581000 275344 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6] Traceback (most recent call last):
2025-05-07T20:31:55.5855108Z W0507 20:31:55.581000 275344 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/torch/_higher_order_ops/triton_kernel_wrap.py", line 731, in identify_mutated_tensors
2025-05-07T20:31:55.5856506Z W0507 20:31:55.581000 275344 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6]     ttir_module, ordered_tensor_names = generate_ttir(kernel, kwargs)
2025-05-07T20:31:55.5857477Z W0507 20:31:55.581000 275344 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6]                                         ~~~~~~~~~~~~~^^^^^^^^^^^^^^^^
2025-05-07T20:31:55.5858774Z W0507 20:31:55.581000 275344 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/torch/_higher_order_ops/triton_kernel_wrap.py", line 356, in generate_ttir
2025-05-07T20:31:55.5860136Z W0507 20:31:55.581000 275344 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6]     ttir_module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:31:55.5861424Z W0507 20:31:55.581000 275344 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py", line 100, in make_ir
2025-05-07T20:31:55.5862773Z W0507 20:31:55.581000 275344 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6]     return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:31:55.5863867Z W0507 20:31:55.581000 275344 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6]                        module_map=module_map)
2025-05-07T20:31:55.5865107Z W0507 20:31:55.581000 275344 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/code_generator.py", line 1298, in ast_to_ttir
2025-05-07T20:31:55.5866334Z W0507 20:31:55.581000 275344 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6]     generator.visit(fn.parse())
2025-05-07T20:31:55.5867179Z W0507 20:31:55.581000 275344 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6]     ~~~~~~~~~~~~~~~^^^^^^^^^^^^
2025-05-07T20:31:55.5868365Z W0507 20:31:55.581000 275344 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/code_generator.py", line 1201, in visit
2025-05-07T20:31:55.5869696Z W0507 20:31:55.581000 275344 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6]     ret = super().visit(node)
2025-05-07T20:31:55.5870718Z W0507 20:31:55.581000 275344 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/ast.py", line 428, in visit
2025-05-07T20:31:55.5871735Z W0507 20:31:55.581000 275344 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6]     return visitor(node)
2025-05-07T20:31:55.5873043Z W0507 20:31:55.581000 275344 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/code_generator.py", line 352, in visit_Module
2025-05-07T20:31:55.5874348Z W0507 20:31:55.581000 275344 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6]     ast.NodeVisitor.generic_visit(self, node)
2025-05-07T20:31:55.5875252Z W0507 20:31:55.581000 275344 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6]     ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~^^^^^^^^^^^^
2025-05-07T20:31:55.5876328Z W0507 20:31:55.581000 275344 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/ast.py", line 436, in generic_visit
2025-05-07T20:31:55.5877351Z W0507 20:31:55.581000 275344 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6]     self.visit(item)
2025-05-07T20:31:55.5878106Z W0507 20:31:55.581000 275344 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6]     ~~~~~~~~~~^^^^^^
2025-05-07T20:31:55.5879263Z W0507 20:31:55.581000 275344 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/code_generator.py", line 1207, in visit
2025-05-07T20:31:55.5880597Z W0507 20:31:55.581000 275344 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6]     raise CompilationError(self.jit_fn.src, self.cur_node, repr(e)) from None
2025-05-07T20:31:55.5881650Z W0507 20:31:55.581000 275344 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6] triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:31:55.5882552Z W0507 20:31:55.581000 275344 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6] def _fbgemm_silu_mul_quant(
2025-05-07T20:31:55.5883286Z W0507 20:31:55.581000 275344 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6] ^
2025-05-07T20:31:55.5884351Z W0507 20:31:55.581000 275344 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6] ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:31:55.7722223Z W0507 20:31:55.768000 275344 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6] Encountered an exception in identify_mutated_tensors, assuming every input is mutated
2025-05-07T20:31:55.7723542Z W0507 20:31:55.768000 275344 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6] Traceback (most recent call last):
2025-05-07T20:31:55.7726261Z W0507 20:31:55.768000 275344 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/torch/_higher_order_ops/triton_kernel_wrap.py", line 731, in identify_mutated_tensors
2025-05-07T20:31:55.7729778Z W0507 20:31:55.768000 275344 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6]     ttir_module, ordered_tensor_names = generate_ttir(kernel, kwargs)
2025-05-07T20:31:55.7732170Z W0507 20:31:55.768000 275344 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6]                                         ~~~~~~~~~~~~~^^^^^^^^^^^^^^^^
2025-05-07T20:31:55.7734584Z W0507 20:31:55.768000 275344 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/torch/_higher_order_ops/triton_kernel_wrap.py", line 356, in generate_ttir
2025-05-07T20:31:55.7736111Z W0507 20:31:55.768000 275344 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6]     ttir_module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:31:55.7737392Z W0507 20:31:55.768000 275344 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py", line 100, in make_ir
2025-05-07T20:31:55.7738849Z W0507 20:31:55.768000 275344 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6]     return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:31:55.7739877Z W0507 20:31:55.768000 275344 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6]                        module_map=module_map)
2025-05-07T20:31:55.7741129Z W0507 20:31:55.768000 275344 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/code_generator.py", line 1298, in ast_to_ttir
2025-05-07T20:31:55.7742358Z W0507 20:31:55.768000 275344 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6]     generator.visit(fn.parse())
2025-05-07T20:31:55.7743192Z W0507 20:31:55.768000 275344 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6]     ~~~~~~~~~~~~~~~^^^^^^^^^^^^
2025-05-07T20:31:55.7744381Z W0507 20:31:55.768000 275344 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/code_generator.py", line 1201, in visit
2025-05-07T20:31:55.7745571Z W0507 20:31:55.768000 275344 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6]     ret = super().visit(node)
2025-05-07T20:31:55.7746602Z W0507 20:31:55.768000 275344 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/ast.py", line 428, in visit
2025-05-07T20:31:55.7747609Z W0507 20:31:55.768000 275344 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6]     return visitor(node)
2025-05-07T20:31:55.7748807Z W0507 20:31:55.768000 275344 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/code_generator.py", line 352, in visit_Module
2025-05-07T20:31:55.7750066Z W0507 20:31:55.768000 275344 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6]     ast.NodeVisitor.generic_visit(self, node)
2025-05-07T20:31:55.7750954Z W0507 20:31:55.768000 275344 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6]     ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~^^^^^^^^^^^^
2025-05-07T20:31:55.7752032Z W0507 20:31:55.768000 275344 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/ast.py", line 436, in generic_visit
2025-05-07T20:31:55.7753059Z W0507 20:31:55.768000 275344 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6]     self.visit(item)
2025-05-07T20:31:55.7753823Z W0507 20:31:55.768000 275344 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6]     ~~~~~~~~~~^^^^^^
2025-05-07T20:31:55.7755013Z W0507 20:31:55.768000 275344 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/code_generator.py", line 1207, in visit
2025-05-07T20:31:55.7756350Z W0507 20:31:55.768000 275344 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6]     raise CompilationError(self.jit_fn.src, self.cur_node, repr(e)) from None
2025-05-07T20:31:55.7757397Z W0507 20:31:55.768000 275344 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6] triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:31:55.7758376Z W0507 20:31:55.768000 275344 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6] def _fbgemm_silu_mul_quant(
2025-05-07T20:31:55.7759111Z W0507 20:31:55.768000 275344 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6] ^
2025-05-07T20:31:55.7760109Z W0507 20:31:55.768000 275344 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6] ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:31:55.7821293Z W0507 20:31:55.778000 275344 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6] Encountered an exception in identify_mutated_tensors, assuming every input is mutated
2025-05-07T20:31:55.7823200Z W0507 20:31:55.778000 275344 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6] Traceback (most recent call last):
2025-05-07T20:31:55.7824867Z W0507 20:31:55.778000 275344 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/torch/_higher_order_ops/triton_kernel_wrap.py", line 731, in identify_mutated_tensors
2025-05-07T20:31:55.7826260Z W0507 20:31:55.778000 275344 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6]     ttir_module, ordered_tensor_names = generate_ttir(kernel, kwargs)
2025-05-07T20:31:55.7827221Z W0507 20:31:55.778000 275344 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6]                                         ~~~~~~~~~~~~~^^^^^^^^^^^^^^^^
2025-05-07T20:31:55.7828511Z W0507 20:31:55.778000 275344 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/torch/_higher_order_ops/triton_kernel_wrap.py", line 356, in generate_ttir
2025-05-07T20:31:55.7829869Z W0507 20:31:55.778000 275344 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6]     ttir_module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:31:55.7831159Z W0507 20:31:55.778000 275344 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py", line 100, in make_ir
2025-05-07T20:31:55.7832507Z W0507 20:31:55.778000 275344 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6]     return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:31:55.7833544Z W0507 20:31:55.778000 275344 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6]                        module_map=module_map)
2025-05-07T20:31:55.7834777Z W0507 20:31:55.778000 275344 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/code_generator.py", line 1298, in ast_to_ttir
2025-05-07T20:31:55.7836002Z W0507 20:31:55.778000 275344 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6]     generator.visit(fn.parse())
2025-05-07T20:31:55.7836836Z W0507 20:31:55.778000 275344 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6]     ~~~~~~~~~~~~~~~^^^^^^^^^^^^
2025-05-07T20:31:55.7838018Z W0507 20:31:55.778000 275344 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/code_generator.py", line 1201, in visit
2025-05-07T20:31:55.7839208Z W0507 20:31:55.778000 275344 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6]     ret = super().visit(node)
2025-05-07T20:31:55.7840230Z W0507 20:31:55.778000 275344 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/ast.py", line 428, in visit
2025-05-07T20:31:55.7841381Z W0507 20:31:55.778000 275344 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6]     return visitor(node)
2025-05-07T20:31:55.7842579Z W0507 20:31:55.778000 275344 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/code_generator.py", line 352, in visit_Module
2025-05-07T20:31:55.7843836Z W0507 20:31:55.778000 275344 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6]     ast.NodeVisitor.generic_visit(self, node)
2025-05-07T20:31:55.7844847Z W0507 20:31:55.778000 275344 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6]     ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~^^^^^^^^^^^^
2025-05-07T20:31:55.7845929Z W0507 20:31:55.778000 275344 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/ast.py", line 436, in generic_visit
2025-05-07T20:31:55.7846962Z W0507 20:31:55.778000 275344 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6]     self.visit(item)
2025-05-07T20:31:55.7847726Z W0507 20:31:55.778000 275344 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6]     ~~~~~~~~~~^^^^^^
2025-05-07T20:31:55.7848874Z W0507 20:31:55.778000 275344 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/code_generator.py", line 1207, in visit
2025-05-07T20:31:55.7850202Z W0507 20:31:55.778000 275344 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6]     raise CompilationError(self.jit_fn.src, self.cur_node, repr(e)) from None
2025-05-07T20:31:55.7851258Z W0507 20:31:55.778000 275344 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6] triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:31:55.7852160Z W0507 20:31:55.778000 275344 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6] def _fbgemm_silu_mul_quant(
2025-05-07T20:31:55.7852903Z W0507 20:31:55.778000 275344 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6] ^
2025-05-07T20:31:55.7854050Z W0507 20:31:55.778000 275344 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6] ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:31:56.2391437Z W0507 20:31:56.235000 275344 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7] Encountered an exception in identify_mutated_tensors, assuming every input is mutated
2025-05-07T20:31:56.2393392Z W0507 20:31:56.235000 275344 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7] Traceback (most recent call last):
2025-05-07T20:31:56.2395213Z W0507 20:31:56.235000 275344 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/torch/_higher_order_ops/triton_kernel_wrap.py", line 731, in identify_mutated_tensors
2025-05-07T20:31:56.2396616Z W0507 20:31:56.235000 275344 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7]     ttir_module, ordered_tensor_names = generate_ttir(kernel, kwargs)
2025-05-07T20:31:56.2397582Z W0507 20:31:56.235000 275344 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7]                                         ~~~~~~~~~~~~~^^^^^^^^^^^^^^^^
2025-05-07T20:31:56.2399037Z W0507 20:31:56.235000 275344 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/torch/_higher_order_ops/triton_kernel_wrap.py", line 356, in generate_ttir
2025-05-07T20:31:56.2400390Z W0507 20:31:56.235000 275344 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7]     ttir_module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:31:56.2401847Z W0507 20:31:56.235000 275344 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py", line 100, in make_ir
2025-05-07T20:31:56.2403197Z W0507 20:31:56.235000 275344 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7]     return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:31:56.2404229Z W0507 20:31:56.235000 275344 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7]                        module_map=module_map)
2025-05-07T20:31:56.2405577Z W0507 20:31:56.235000 275344 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/code_generator.py", line 1298, in ast_to_ttir
2025-05-07T20:31:56.2406798Z W0507 20:31:56.235000 275344 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7]     generator.visit(fn.parse())
2025-05-07T20:31:56.2407650Z W0507 20:31:56.235000 275344 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7]     ~~~~~~~~~~~~~~~^^^^^^^^^^^^
2025-05-07T20:31:56.2408835Z W0507 20:31:56.235000 275344 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/code_generator.py", line 1201, in visit
2025-05-07T20:31:56.2410026Z W0507 20:31:56.235000 275344 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7]     ret = super().visit(node)
2025-05-07T20:31:56.2411055Z W0507 20:31:56.235000 275344 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/ast.py", line 428, in visit
2025-05-07T20:31:56.2412057Z W0507 20:31:56.235000 275344 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7]     return visitor(node)
2025-05-07T20:31:56.2413264Z W0507 20:31:56.235000 275344 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/code_generator.py", line 352, in visit_Module
2025-05-07T20:31:56.2414642Z W0507 20:31:56.235000 275344 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7]     ast.NodeVisitor.generic_visit(self, node)
2025-05-07T20:31:56.2415540Z W0507 20:31:56.235000 275344 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7]     ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~^^^^^^^^^^^^
2025-05-07T20:31:56.2416612Z W0507 20:31:56.235000 275344 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/ast.py", line 436, in generic_visit
2025-05-07T20:31:56.2417637Z W0507 20:31:56.235000 275344 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7]     self.visit(item)
2025-05-07T20:31:56.2418400Z W0507 20:31:56.235000 275344 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7]     ~~~~~~~~~~^^^^^^
2025-05-07T20:31:56.2419559Z W0507 20:31:56.235000 275344 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/code_generator.py", line 1207, in visit
2025-05-07T20:31:56.2420894Z W0507 20:31:56.235000 275344 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7]     raise CompilationError(self.jit_fn.src, self.cur_node, repr(e)) from None
2025-05-07T20:31:56.2421944Z W0507 20:31:56.235000 275344 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7] triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:31:56.2422852Z W0507 20:31:56.235000 275344 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7] def _fbgemm_silu_mul_quant(
2025-05-07T20:31:56.2423592Z W0507 20:31:56.235000 275344 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7] ^
2025-05-07T20:31:56.2424793Z W0507 20:31:56.235000 275344 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7] ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:31:56.3010975Z W0507 20:31:56.297000 275344 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7] Encountered an exception in identify_mutated_tensors, assuming every input is mutated
2025-05-07T20:31:56.3012024Z W0507 20:31:56.297000 275344 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7] Traceback (most recent call last):
2025-05-07T20:31:56.3013499Z W0507 20:31:56.297000 275344 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/torch/_higher_order_ops/triton_kernel_wrap.py", line 731, in identify_mutated_tensors
2025-05-07T20:31:56.3015031Z W0507 20:31:56.297000 275344 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7]     ttir_module, ordered_tensor_names = generate_ttir(kernel, kwargs)
2025-05-07T20:31:56.3016007Z W0507 20:31:56.297000 275344 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7]                                         ~~~~~~~~~~~~~^^^^^^^^^^^^^^^^
2025-05-07T20:31:56.3017293Z W0507 20:31:56.297000 275344 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/torch/_higher_order_ops/triton_kernel_wrap.py", line 356, in generate_ttir
2025-05-07T20:31:56.3018650Z W0507 20:31:56.297000 275344 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7]     ttir_module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:31:56.3019937Z W0507 20:31:56.297000 275344 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py", line 100, in make_ir
2025-05-07T20:31:56.3021293Z W0507 20:31:56.297000 275344 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7]     return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:31:56.3022326Z W0507 20:31:56.297000 275344 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7]                        module_map=module_map)
2025-05-07T20:31:56.3023568Z W0507 20:31:56.297000 275344 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/code_generator.py", line 1298, in ast_to_ttir
2025-05-07T20:31:56.3024859Z W0507 20:31:56.297000 275344 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7]     generator.visit(fn.parse())
2025-05-07T20:31:56.3025694Z W0507 20:31:56.297000 275344 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7]     ~~~~~~~~~~~~~~~^^^^^^^^^^^^
2025-05-07T20:31:56.3026887Z W0507 20:31:56.297000 275344 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/code_generator.py", line 1201, in visit
2025-05-07T20:31:56.3028087Z W0507 20:31:56.297000 275344 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7]     ret = super().visit(node)
2025-05-07T20:31:56.3029118Z W0507 20:31:56.297000 275344 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/ast.py", line 428, in visit
2025-05-07T20:31:56.3030127Z W0507 20:31:56.297000 275344 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7]     return visitor(node)
2025-05-07T20:31:56.3031333Z W0507 20:31:56.297000 275344 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/code_generator.py", line 352, in visit_Module
2025-05-07T20:31:56.3032716Z W0507 20:31:56.297000 275344 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7]     ast.NodeVisitor.generic_visit(self, node)
2025-05-07T20:31:56.3033609Z W0507 20:31:56.297000 275344 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7]     ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~^^^^^^^^^^^^
2025-05-07T20:31:56.3034684Z W0507 20:31:56.297000 275344 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/ast.py", line 436, in generic_visit
2025-05-07T20:31:56.3035712Z W0507 20:31:56.297000 275344 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7]     self.visit(item)
2025-05-07T20:31:56.3036570Z W0507 20:31:56.297000 275344 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7]     ~~~~~~~~~~^^^^^^
2025-05-07T20:31:56.3037727Z W0507 20:31:56.297000 275344 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/code_generator.py", line 1207, in visit
2025-05-07T20:31:56.3039066Z W0507 20:31:56.297000 275344 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7]     raise CompilationError(self.jit_fn.src, self.cur_node, repr(e)) from None
2025-05-07T20:31:56.3040114Z W0507 20:31:56.297000 275344 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7] triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:31:56.3041012Z W0507 20:31:56.297000 275344 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7] def _fbgemm_silu_mul_quant(
2025-05-07T20:31:56.3041756Z W0507 20:31:56.297000 275344 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7] ^
2025-05-07T20:31:56.3042766Z W0507 20:31:56.297000 275344 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7] ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:31:56.4890279Z W0507 20:31:56.485000 275344 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7] Encountered an exception in identify_mutated_tensors, assuming every input is mutated
2025-05-07T20:31:56.4891381Z W0507 20:31:56.485000 275344 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7] Traceback (most recent call last):
2025-05-07T20:31:56.4892736Z W0507 20:31:56.485000 275344 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/torch/_higher_order_ops/triton_kernel_wrap.py", line 731, in identify_mutated_tensors
2025-05-07T20:31:56.4894313Z W0507 20:31:56.485000 275344 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7]     ttir_module, ordered_tensor_names = generate_ttir(kernel, kwargs)
2025-05-07T20:31:56.4895305Z W0507 20:31:56.485000 275344 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7]                                         ~~~~~~~~~~~~~^^^^^^^^^^^^^^^^
2025-05-07T20:31:56.4896618Z W0507 20:31:56.485000 275344 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/torch/_higher_order_ops/triton_kernel_wrap.py", line 356, in generate_ttir
2025-05-07T20:31:56.4898012Z W0507 20:31:56.485000 275344 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7]     ttir_module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:31:56.4899624Z W0507 20:31:56.485000 275344 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py", line 100, in make_ir
2025-05-07T20:31:56.4900990Z W0507 20:31:56.485000 275344 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7]     return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:31:56.4902463Z W0507 20:31:56.485000 275344 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7]                        module_map=module_map)
2025-05-07T20:31:56.4903708Z W0507 20:31:56.485000 275344 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/code_generator.py", line 1298, in ast_to_ttir
2025-05-07T20:31:56.4904948Z W0507 20:31:56.485000 275344 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7]     generator.visit(fn.parse())
2025-05-07T20:31:56.4905945Z W0507 20:31:56.485000 275344 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7]     ~~~~~~~~~~~~~~~^^^^^^^^^^^^
2025-05-07T20:31:56.4907137Z W0507 20:31:56.485000 275344 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/code_generator.py", line 1201, in visit
2025-05-07T20:31:56.4908345Z W0507 20:31:56.485000 275344 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7]     ret = super().visit(node)
2025-05-07T20:31:56.4909367Z W0507 20:31:56.485000 275344 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/ast.py", line 428, in visit
2025-05-07T20:31:56.4910383Z W0507 20:31:56.485000 275344 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7]     return visitor(node)
2025-05-07T20:31:56.4911596Z W0507 20:31:56.485000 275344 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/code_generator.py", line 352, in visit_Module
2025-05-07T20:31:56.4912863Z W0507 20:31:56.485000 275344 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7]     ast.NodeVisitor.generic_visit(self, node)
2025-05-07T20:31:56.4913770Z W0507 20:31:56.485000 275344 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7]     ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~^^^^^^^^^^^^
2025-05-07T20:31:56.4914841Z W0507 20:31:56.485000 275344 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/ast.py", line 436, in generic_visit
2025-05-07T20:31:56.4915877Z W0507 20:31:56.485000 275344 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7]     self.visit(item)
2025-05-07T20:31:56.4916648Z W0507 20:31:56.485000 275344 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7]     ~~~~~~~~~~^^^^^^
2025-05-07T20:31:56.4917812Z W0507 20:31:56.485000 275344 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/code_generator.py", line 1207, in visit
2025-05-07T20:31:56.4919153Z W0507 20:31:56.485000 275344 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7]     raise CompilationError(self.jit_fn.src, self.cur_node, repr(e)) from None
2025-05-07T20:31:56.4920195Z W0507 20:31:56.485000 275344 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7] triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:31:56.4921104Z W0507 20:31:56.485000 275344 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7] def _fbgemm_silu_mul_quant(
2025-05-07T20:31:56.4921852Z W0507 20:31:56.485000 275344 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7] ^
2025-05-07T20:31:56.4922874Z W0507 20:31:56.485000 275344 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7] ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:31:56.4983881Z W0507 20:31:56.494000 275344 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7] Encountered an exception in identify_mutated_tensors, assuming every input is mutated
2025-05-07T20:31:56.4985155Z W0507 20:31:56.494000 275344 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7] Traceback (most recent call last):
2025-05-07T20:31:56.4986474Z W0507 20:31:56.494000 275344 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/torch/_higher_order_ops/triton_kernel_wrap.py", line 731, in identify_mutated_tensors
2025-05-07T20:31:56.4987873Z W0507 20:31:56.494000 275344 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7]     ttir_module, ordered_tensor_names = generate_ttir(kernel, kwargs)
2025-05-07T20:31:56.4988929Z W0507 20:31:56.494000 275344 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7]                                         ~~~~~~~~~~~~~^^^^^^^^^^^^^^^^
2025-05-07T20:31:56.4990229Z W0507 20:31:56.494000 275344 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/torch/_higher_order_ops/triton_kernel_wrap.py", line 356, in generate_ttir
2025-05-07T20:31:56.4991591Z W0507 20:31:56.494000 275344 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7]     ttir_module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:31:56.4992882Z W0507 20:31:56.494000 275344 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py", line 100, in make_ir
2025-05-07T20:31:56.4994247Z W0507 20:31:56.494000 275344 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7]     return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:31:56.4995289Z W0507 20:31:56.494000 275344 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7]                        module_map=module_map)
2025-05-07T20:31:56.4996540Z W0507 20:31:56.494000 275344 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/code_generator.py", line 1298, in ast_to_ttir
2025-05-07T20:31:56.4997767Z W0507 20:31:56.494000 275344 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7]     generator.visit(fn.parse())
2025-05-07T20:31:56.4998960Z W0507 20:31:56.494000 275344 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7]     ~~~~~~~~~~~~~~~^^^^^^^^^^^^
2025-05-07T20:31:56.5000166Z W0507 20:31:56.494000 275344 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/code_generator.py", line 1201, in visit
2025-05-07T20:31:56.5001369Z W0507 20:31:56.494000 275344 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7]     ret = super().visit(node)
2025-05-07T20:31:56.5002409Z W0507 20:31:56.494000 275344 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/ast.py", line 428, in visit
2025-05-07T20:31:56.5003417Z W0507 20:31:56.494000 275344 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7]     return visitor(node)
2025-05-07T20:31:56.5004632Z W0507 20:31:56.494000 275344 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/code_generator.py", line 352, in visit_Module
2025-05-07T20:31:56.5005906Z W0507 20:31:56.494000 275344 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7]     ast.NodeVisitor.generic_visit(self, node)
2025-05-07T20:31:56.5006816Z W0507 20:31:56.494000 275344 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7]     ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~^^^^^^^^^^^^
2025-05-07T20:31:56.5008067Z W0507 20:31:56.494000 275344 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/ast.py", line 436, in generic_visit
2025-05-07T20:31:56.5009105Z W0507 20:31:56.494000 275344 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7]     self.visit(item)
2025-05-07T20:31:56.5009883Z W0507 20:31:56.494000 275344 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7]     ~~~~~~~~~~^^^^^^
2025-05-07T20:31:56.5011157Z W0507 20:31:56.494000 275344 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/code_generator.py", line 1207, in visit
2025-05-07T20:31:56.5012504Z W0507 20:31:56.494000 275344 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7]     raise CompilationError(self.jit_fn.src, self.cur_node, repr(e)) from None
2025-05-07T20:31:56.5013562Z W0507 20:31:56.494000 275344 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7] triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:31:56.5014574Z W0507 20:31:56.494000 275344 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7] def _fbgemm_silu_mul_quant(
2025-05-07T20:31:56.5015325Z W0507 20:31:56.494000 275344 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7] ^
2025-05-07T20:31:56.5016343Z W0507 20:31:56.494000 275344 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7] ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:31:56.7220849Z 
2025-05-07T20:31:56.7221209Z Trying example: test_silu_mul_quant(
2025-05-07T20:31:56.7221680Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:31:56.7222104Z     T=4096,
2025-05-07T20:31:56.7222305Z     D=5120,
2025-05-07T20:31:56.7230149Z     scale_ub=None,
2025-05-07T20:31:56.7230388Z     contiguous=True,
2025-05-07T20:31:56.7230600Z     compiled=True,
2025-05-07T20:31:56.7230795Z )
2025-05-07T20:31:56.7231144Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:31:56.7231630Z T = 4096, D = 5120, scale_ub = None, contiguous = True, compiled = True
2025-05-07T20:31:56.7231891Z 
2025-05-07T20:31:56.7231967Z     @given(
2025-05-07T20:31:56.7232185Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:31:56.7232492Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:31:56.7232791Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:31:56.7233105Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:31:56.7233428Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:31:56.7233707Z     )
2025-05-07T20:31:56.7234044Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:31:56.7234476Z     def test_silu_mul_quant(
2025-05-07T20:31:56.7234710Z         self,
2025-05-07T20:31:56.7234897Z         T: int,
2025-05-07T20:31:56.7235075Z         D: int,
2025-05-07T20:31:56.7235283Z         scale_ub: Optional[float],
2025-05-07T20:31:56.7235552Z         contiguous: bool,
2025-05-07T20:31:56.7235777Z         compiled: bool,
2025-05-07T20:31:56.7235992Z     ) -> None:
2025-05-07T20:31:56.7236202Z         torch.manual_seed(2025)
2025-05-07T20:31:56.7236428Z     
2025-05-07T20:31:56.7236691Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:31:56.7237026Z     
2025-05-07T20:31:56.7237208Z         x_sign = torch.sign(x)
2025-05-07T20:31:56.7237489Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:31:56.7237789Z         x = x_sign * x_clamp
2025-05-07T20:31:56.7238031Z         x0 = x[:, :D]
2025-05-07T20:31:56.7238235Z         x1 = x[:, D:]
2025-05-07T20:31:56.7238431Z     
2025-05-07T20:31:56.7238606Z         if contiguous:
2025-05-07T20:31:56.7238824Z             x0 = x0.contiguous()
2025-05-07T20:31:56.7239263Z             x1 = x1.contiguous()
2025-05-07T20:31:56.7239495Z     
2025-05-07T20:31:56.7239677Z         if scale_ub is not None:
2025-05-07T20:31:56.7239943Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:31:56.7240267Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:31:56.7240563Z             )
2025-05-07T20:31:56.7240746Z         else:
2025-05-07T20:31:56.7240947Z             scale_ub_tensor = None
2025-05-07T20:31:56.7241184Z     
2025-05-07T20:31:56.7241409Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:31:56.7241713Z             op = silu_mul_quant
2025-05-07T20:31:56.7241952Z             if compiled:
2025-05-07T20:31:56.7242194Z                 op = torch.compile(op)
2025-05-07T20:31:56.7242611Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:31:56.7242881Z     
2025-05-07T20:31:56.7243065Z         y_fp8, y_scale = fn()
2025-05-07T20:31:56.7243333Z         y = y_fp8.to(torch.float32) * y_scale[:, None]
2025-05-07T20:31:56.7243624Z     
2025-05-07T20:31:56.7243852Z         def ref_fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:31:56.7244171Z             x0_fp32 = x0.to(torch.float32)
2025-05-07T20:31:56.7244451Z             x1_fp32 = x1.to(torch.float32)
2025-05-07T20:31:56.7244754Z             y = x0_fp32 * torch.sigmoid(x0_fp32) * x1_fp32
2025-05-07T20:31:56.7245103Z             return triton_quantize_fp8_row(y, scale_ub_tensor)
2025-05-07T20:31:56.7245400Z     
2025-05-07T20:31:56.7245592Z >       y_fp8_ref, y_scale_ref = ref_fn()
2025-05-07T20:31:56.7245778Z 
2025-05-07T20:31:56.7245878Z moe/activation_test.py:126: 
2025-05-07T20:31:56.7246157Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:31:56.7246491Z moe/activation_test.py:124: in ref_fn
2025-05-07T20:31:56.7246801Z     return triton_quantize_fp8_row(y, scale_ub_tensor)
2025-05-07T20:31:56.7247567Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gemm/triton_gemm/fp8_gemm.py:2370: in triton_quantize_fp8_row
2025-05-07T20:31:56.7248305Z     _kernel_quantize_fp8_row[grid](
2025-05-07T20:31:56.7248834Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:31:56.7249494Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:31:56.7250160Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/autotuner.py:186: in run
2025-05-07T20:31:56.7250861Z     timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs}
2025-05-07T20:31:56.7251567Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/autotuner.py:166: in _bench
2025-05-07T20:31:56.7252190Z     return self.do_bench(kernel_call, quantiles=(0.5, 0.2, 0.8))
2025-05-07T20:31:56.7252772Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/testing.py:117: in do_bench
2025-05-07T20:31:56.7253280Z     fn()
2025-05-07T20:31:56.7253885Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/autotuner.py:152: in kernel_call
2025-05-07T20:31:56.7254442Z     self.fn.run(
2025-05-07T20:31:56.7254950Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:31:56.7255467Z     kernel = self.compile(
2025-05-07T20:31:56.7255991Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:31:56.7256621Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:31:56.7257007Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:31:56.7257235Z 
2025-05-07T20:31:56.7257444Z self = <triton.compiler.compiler.ASTSource object at 0x7f93bd6fbad0>
2025-05-07T20:31:56.7258500Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=2, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:31:56.7259922Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f9402963ce0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f93a55a3600>}
2025-05-07T20:31:56.7261231Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:31:56.7262221Z context = <triton._C.libtriton.ir.context object at 0x7f93a5c65630>
2025-05-07T20:31:56.7262499Z 
2025-05-07T20:31:56.7262736Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:31:56.7263237Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:31:56.7263691Z                            module_map=module_map)
2025-05-07T20:31:56.7264050Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:31:56.7264398Z E       def _kernel_quantize_fp8_row(
2025-05-07T20:31:56.7264668Z E       ^
2025-05-07T20:31:56.7265152Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:31:56.7265584Z 
2025-05-07T20:31:56.7265993Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:31:56.7266492Z 
2025-05-07T20:31:56.7266591Z Trying example: test_silu_mul_quant(
2025-05-07T20:31:56.7266990Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:31:56.7267381Z     T=16384,
2025-05-07T20:31:56.7267567Z     D=5120,
2025-05-07T20:31:56.7267746Z     scale_ub=None,
2025-05-07T20:31:56.7267951Z     contiguous=True,
2025-05-07T20:31:56.7268170Z     compiled=True,
2025-05-07T20:31:56.7268357Z )
2025-05-07T20:31:56.7268671Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:31:56.7269148Z T = 16384, D = 5120, scale_ub = None, contiguous = True, compiled = True
2025-05-07T20:31:56.7269412Z 
2025-05-07T20:31:56.7269487Z     @given(
2025-05-07T20:31:56.7269709Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:31:56.7270010Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:31:56.7270296Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:31:56.7270617Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:31:56.7270935Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:31:56.7271211Z     )
2025-05-07T20:31:56.7271554Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:31:56.7271988Z     def test_silu_mul_quant(
2025-05-07T20:31:56.7272217Z         self,
2025-05-07T20:31:56.7272396Z         T: int,
2025-05-07T20:31:56.7272582Z         D: int,
2025-05-07T20:31:56.7272790Z         scale_ub: Optional[float],
2025-05-07T20:31:56.7273051Z         contiguous: bool,
2025-05-07T20:31:56.7273280Z         compiled: bool,
2025-05-07T20:31:56.7273490Z     ) -> None:
2025-05-07T20:31:56.7273688Z         torch.manual_seed(2025)
2025-05-07T20:31:56.7273925Z     
2025-05-07T20:31:56.7274188Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:31:56.7274514Z     
2025-05-07T20:31:56.7274696Z         x_sign = torch.sign(x)
2025-05-07T20:31:56.7274977Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:31:56.7275273Z         x = x_sign * x_clamp
2025-05-07T20:31:56.7275501Z         x0 = x[:, :D]
2025-05-07T20:31:56.7275709Z         x1 = x[:, D:]
2025-05-07T20:31:56.7275908Z     
2025-05-07T20:31:56.7276086Z         if contiguous:
2025-05-07T20:31:56.7276305Z             x0 = x0.contiguous()
2025-05-07T20:31:56.7276556Z             x1 = x1.contiguous()
2025-05-07T20:31:56.7276782Z     
2025-05-07T20:31:56.7276964Z         if scale_ub is not None:
2025-05-07T20:31:56.7277311Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:31:56.7277628Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:31:56.7277928Z             )
2025-05-07T20:31:56.7278114Z         else:
2025-05-07T20:31:56.7278311Z             scale_ub_tensor = None
2025-05-07T20:31:56.7278553Z     
2025-05-07T20:31:56.7278775Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:31:56.7279079Z             op = silu_mul_quant
2025-05-07T20:31:56.7279324Z             if compiled:
2025-05-07T20:31:56.7279563Z                 op = torch.compile(op)
2025-05-07T20:31:56.7279850Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:31:56.7280118Z     
2025-05-07T20:31:56.7280304Z         y_fp8, y_scale = fn()
2025-05-07T20:31:56.7280654Z         y = y_fp8.to(torch.float32) * y_scale[:, None]
2025-05-07T20:31:56.7280933Z     
2025-05-07T20:31:56.7281160Z         def ref_fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:31:56.7281482Z             x0_fp32 = x0.to(torch.float32)
2025-05-07T20:31:56.7281767Z             x1_fp32 = x1.to(torch.float32)
2025-05-07T20:31:56.7282071Z             y = x0_fp32 * torch.sigmoid(x0_fp32) * x1_fp32
2025-05-07T20:31:56.7282416Z             return triton_quantize_fp8_row(y, scale_ub_tensor)
2025-05-07T20:31:56.7282709Z     
2025-05-07T20:31:56.7282903Z >       y_fp8_ref, y_scale_ref = ref_fn()
2025-05-07T20:31:56.7283090Z 
2025-05-07T20:31:56.7283187Z moe/activation_test.py:126: 
2025-05-07T20:31:56.7283473Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:31:56.7283790Z moe/activation_test.py:124: in ref_fn
2025-05-07T20:31:56.7284104Z     return triton_quantize_fp8_row(y, scale_ub_tensor)
2025-05-07T20:31:56.7284923Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gemm/triton_gemm/fp8_gemm.py:2370: in triton_quantize_fp8_row
2025-05-07T20:31:56.7285646Z     _kernel_quantize_fp8_row[grid](
2025-05-07T20:31:56.7286179Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:31:56.7286847Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:31:56.7287516Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/autotuner.py:186: in run
2025-05-07T20:31:56.7288217Z     timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs}
2025-05-07T20:31:56.7288927Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/autotuner.py:166: in _bench
2025-05-07T20:31:56.7289544Z     return self.do_bench(kernel_call, quantiles=(0.5, 0.2, 0.8))
2025-05-07T20:31:56.7290133Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/testing.py:117: in do_bench
2025-05-07T20:31:56.7290634Z     fn()
2025-05-07T20:31:56.7291122Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/autotuner.py:152: in kernel_call
2025-05-07T20:31:56.7291687Z     self.fn.run(
2025-05-07T20:31:56.7292140Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:31:56.7292656Z     kernel = self.compile(
2025-05-07T20:31:56.7293180Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:31:56.7293901Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:31:56.7294332Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:31:56.7294557Z 
2025-05-07T20:31:56.7294758Z self = <triton.compiler.compiler.ASTSource object at 0x7f93bd6f9ad0>
2025-05-07T20:31:56.7295815Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=2, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:31:56.7297235Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f9402963ce0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f93a55eaac0>}
2025-05-07T20:31:56.7298801Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:31:56.7299797Z context = <triton._C.libtriton.ir.context object at 0x7f93a49eb6b0>
2025-05-07T20:31:56.7300081Z 
2025-05-07T20:31:56.7300244Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:31:56.7300896Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:31:56.7301348Z                            module_map=module_map)
2025-05-07T20:31:56.7301700Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:31:56.7302044Z E       def _kernel_quantize_fp8_row(
2025-05-07T20:31:56.7302301Z E       ^
2025-05-07T20:31:56.7302748Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:31:56.7303182Z 
2025-05-07T20:31:56.7303585Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:31:56.7518641Z W0507 20:31:56.750000 275344 site-packages/torch/_dynamo/convert_frame.py:987] [1/8] torch._dynamo hit config.recompile_limit (8)
2025-05-07T20:31:56.7519861Z W0507 20:31:56.750000 275344 site-packages/torch/_dynamo/convert_frame.py:987] [1/8]    function: 'silu_mul_quant' (/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:55)
2025-05-07T20:31:56.7521159Z W0507 20:31:56.750000 275344 site-packages/torch/_dynamo/convert_frame.py:987] [1/8]    last reason: 1/7: tensor 'x0' stride mismatch at index 0. expected 5120, actual 10240
2025-05-07T20:31:56.7522134Z W0507 20:31:56.750000 275344 site-packages/torch/_dynamo/convert_frame.py:987] [1/8] To log all recompilation reasons, use TORCH_LOGS="recompiles".
2025-05-07T20:31:56.7523211Z W0507 20:31:56.750000 275344 site-packages/torch/_dynamo/convert_frame.py:987] [1/8] To diagnose recompilation issues, see https://pytorch.org/docs/main/torch.compiler_troubleshooting.html.
2025-05-07T20:31:56.9662952Z 
2025-05-07T20:31:56.9663260Z Trying example: test_silu_mul_quant(
2025-05-07T20:31:56.9663694Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:31:56.9664095Z     T=1,
2025-05-07T20:31:56.9664277Z     D=5120,
2025-05-07T20:31:56.9664512Z     scale_ub=1200.0,
2025-05-07T20:31:56.9664738Z     contiguous=True,
2025-05-07T20:31:56.9664966Z     compiled=True,
2025-05-07T20:31:56.9665169Z )
2025-05-07T20:31:56.9665481Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:31:56.9665959Z T = 1, D = 5120, scale_ub = 1200.0, contiguous = True, compiled = True
2025-05-07T20:31:56.9666219Z 
2025-05-07T20:31:56.9666299Z     @given(
2025-05-07T20:31:56.9666521Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:31:56.9666827Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:31:56.9667131Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:31:56.9667449Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:31:56.9667771Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:31:56.9668053Z     )
2025-05-07T20:31:56.9668390Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:31:56.9668825Z     def test_silu_mul_quant(
2025-05-07T20:31:56.9669063Z         self,
2025-05-07T20:31:56.9669251Z         T: int,
2025-05-07T20:31:56.9669449Z         D: int,
2025-05-07T20:31:56.9669664Z         scale_ub: Optional[float],
2025-05-07T20:31:56.9669922Z         contiguous: bool,
2025-05-07T20:31:56.9670169Z         compiled: bool,
2025-05-07T20:31:56.9670556Z     ) -> None:
2025-05-07T20:31:56.9670764Z         torch.manual_seed(2025)
2025-05-07T20:31:56.9670999Z     
2025-05-07T20:31:56.9671263Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:31:56.9671598Z     
2025-05-07T20:31:56.9671788Z         x_sign = torch.sign(x)
2025-05-07T20:31:56.9672078Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:31:56.9672376Z         x = x_sign * x_clamp
2025-05-07T20:31:56.9672613Z         x0 = x[:, :D]
2025-05-07T20:31:56.9672832Z         x1 = x[:, D:]
2025-05-07T20:31:56.9673032Z     
2025-05-07T20:31:56.9673213Z         if contiguous:
2025-05-07T20:31:56.9673438Z             x0 = x0.contiguous()
2025-05-07T20:31:56.9673688Z             x1 = x1.contiguous()
2025-05-07T20:31:56.9674040Z     
2025-05-07T20:31:56.9674229Z         if scale_ub is not None:
2025-05-07T20:31:56.9674492Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:31:56.9674817Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:31:56.9675126Z             )
2025-05-07T20:31:56.9675316Z         else:
2025-05-07T20:31:56.9675515Z             scale_ub_tensor = None
2025-05-07T20:31:56.9675760Z     
2025-05-07T20:31:56.9675986Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:31:56.9676288Z             op = silu_mul_quant
2025-05-07T20:31:56.9676533Z             if compiled:
2025-05-07T20:31:56.9676775Z                 op = torch.compile(op)
2025-05-07T20:31:56.9677061Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:31:56.9677332Z     
2025-05-07T20:31:56.9677519Z >       y_fp8, y_scale = fn()
2025-05-07T20:31:56.9677681Z 
2025-05-07T20:31:56.9677777Z moe/activation_test.py:117: 
2025-05-07T20:31:56.9678070Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:31:56.9678397Z moe/activation_test.py:115: in fn
2025-05-07T20:31:56.9678672Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:31:56.9679218Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/torch/_dynamo/eval_frame.py:678: in _fn
2025-05-07T20:31:56.9679771Z     return fn(*args, **kwargs)
2025-05-07T20:31:56.9680420Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:31:56.9681087Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:31:56.9681614Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:31:56.9682278Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:31:56.9682925Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:31:56.9683452Z     kernel = self.compile(
2025-05-07T20:31:56.9683988Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:31:56.9684682Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:31:56.9685074Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:31:56.9685301Z 
2025-05-07T20:31:56.9685505Z self = <triton.compiler.compiler.ASTSource object at 0x7f93cd90fdd0>
2025-05-07T20:31:56.9686567Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:31:56.9687914Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f9402963ce0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f93a4a4dbc0>}
2025-05-07T20:31:56.9689231Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:31:56.9690311Z context = <triton._C.libtriton.ir.context object at 0x7f93a4754370>
2025-05-07T20:31:56.9690592Z 
2025-05-07T20:31:56.9690755Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:31:56.9691270Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:31:56.9691731Z                            module_map=module_map)
2025-05-07T20:31:56.9692089Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:31:56.9692437Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:31:56.9692693Z E       ^
2025-05-07T20:31:56.9693142Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:31:56.9693580Z 
2025-05-07T20:31:56.9694186Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:31:56.9694745Z 
2025-05-07T20:31:56.9694846Z Trying example: test_silu_mul_quant(
2025-05-07T20:31:56.9695257Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:31:56.9695645Z     T=1,
2025-05-07T20:31:56.9695825Z     D=5120,
2025-05-07T20:31:56.9696017Z     scale_ub=None,
2025-05-07T20:31:56.9696223Z     contiguous=False,
2025-05-07T20:31:56.9696448Z     compiled=True,
2025-05-07T20:31:56.9696648Z )
2025-05-07T20:31:56.9696953Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:31:56.9697427Z T = 1, D = 5120, scale_ub = None, contiguous = False, compiled = True
2025-05-07T20:31:56.9697683Z 
2025-05-07T20:31:56.9697759Z     @given(
2025-05-07T20:31:56.9697986Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:31:56.9698436Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:31:56.9698747Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:31:56.9699075Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:31:56.9699393Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:31:56.9699680Z     )
2025-05-07T20:31:56.9700024Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:31:56.9700455Z     def test_silu_mul_quant(
2025-05-07T20:31:56.9700704Z         self,
2025-05-07T20:31:56.9700901Z         T: int,
2025-05-07T20:31:56.9701091Z         D: int,
2025-05-07T20:31:56.9701307Z         scale_ub: Optional[float],
2025-05-07T20:31:56.9701579Z         contiguous: bool,
2025-05-07T20:31:56.9701817Z         compiled: bool,
2025-05-07T20:31:56.9702031Z     ) -> None:
2025-05-07T20:31:56.9702246Z         torch.manual_seed(2025)
2025-05-07T20:31:56.9702482Z     
2025-05-07T20:31:56.9702746Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:31:56.9703088Z     
2025-05-07T20:31:56.9703285Z         x_sign = torch.sign(x)
2025-05-07T20:31:56.9703566Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:31:56.9703873Z         x = x_sign * x_clamp
2025-05-07T20:31:56.9704113Z         x0 = x[:, :D]
2025-05-07T20:31:56.9704327Z         x1 = x[:, D:]
2025-05-07T20:31:56.9704534Z     
2025-05-07T20:31:56.9704724Z         if contiguous:
2025-05-07T20:31:56.9704949Z             x0 = x0.contiguous()
2025-05-07T20:31:56.9705210Z             x1 = x1.contiguous()
2025-05-07T20:31:56.9705457Z     
2025-05-07T20:31:56.9705647Z         if scale_ub is not None:
2025-05-07T20:31:56.9705921Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:31:56.9706256Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:31:56.9706563Z             )
2025-05-07T20:31:56.9706756Z         else:
2025-05-07T20:31:56.9706968Z             scale_ub_tensor = None
2025-05-07T20:31:56.9707221Z     
2025-05-07T20:31:56.9707448Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:31:56.9707765Z             op = silu_mul_quant
2025-05-07T20:31:56.9708018Z             if compiled:
2025-05-07T20:31:56.9708262Z                 op = torch.compile(op)
2025-05-07T20:31:56.9708558Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:31:56.9708965Z     
2025-05-07T20:31:56.9709154Z         y_fp8, y_scale = fn()
2025-05-07T20:31:56.9709437Z         y = y_fp8.to(torch.float32) * y_scale[:, None]
2025-05-07T20:31:56.9709725Z     
2025-05-07T20:31:56.9709955Z         def ref_fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:31:56.9710285Z             x0_fp32 = x0.to(torch.float32)
2025-05-07T20:31:56.9710574Z             x1_fp32 = x1.to(torch.float32)
2025-05-07T20:31:56.9710880Z             y = x0_fp32 * torch.sigmoid(x0_fp32) * x1_fp32
2025-05-07T20:31:56.9711240Z             return triton_quantize_fp8_row(y, scale_ub_tensor)
2025-05-07T20:31:56.9711548Z     
2025-05-07T20:31:56.9711748Z >       y_fp8_ref, y_scale_ref = ref_fn()
2025-05-07T20:31:56.9712053Z 
2025-05-07T20:31:56.9712155Z moe/activation_test.py:126: 
2025-05-07T20:31:56.9712451Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:31:56.9712784Z moe/activation_test.py:124: in ref_fn
2025-05-07T20:31:56.9713107Z     return triton_quantize_fp8_row(y, scale_ub_tensor)
2025-05-07T20:31:56.9713881Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gemm/triton_gemm/fp8_gemm.py:2370: in triton_quantize_fp8_row
2025-05-07T20:31:56.9714666Z     _kernel_quantize_fp8_row[grid](
2025-05-07T20:31:56.9715207Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:31:56.9715873Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:31:56.9716552Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/autotuner.py:186: in run
2025-05-07T20:31:56.9717280Z     timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs}
2025-05-07T20:31:56.9717995Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/autotuner.py:166: in _bench
2025-05-07T20:31:56.9718627Z     return self.do_bench(kernel_call, quantiles=(0.5, 0.2, 0.8))
2025-05-07T20:31:56.9719223Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/testing.py:117: in do_bench
2025-05-07T20:31:56.9719735Z     fn()
2025-05-07T20:31:56.9720228Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/autotuner.py:152: in kernel_call
2025-05-07T20:31:56.9720799Z     self.fn.run(
2025-05-07T20:31:56.9721263Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:31:56.9721791Z     kernel = self.compile(
2025-05-07T20:31:56.9722322Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:31:56.9722971Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:31:56.9723368Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:31:56.9723598Z 
2025-05-07T20:31:56.9723803Z self = <triton.compiler.compiler.ASTSource object at 0x7f93cd95d9d0>
2025-05-07T20:31:56.9724927Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=2, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:31:56.9726275Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f9402963ce0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f93a5c30ae0>}
2025-05-07T20:31:56.9727594Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:31:56.9728600Z context = <triton._C.libtriton.ir.context object at 0x7f93a4798930>
2025-05-07T20:31:56.9734636Z 
2025-05-07T20:31:56.9734828Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:31:56.9735467Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:31:56.9735928Z                            module_map=module_map)
2025-05-07T20:31:56.9736287Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:31:56.9736643Z E       def _kernel_quantize_fp8_row(
2025-05-07T20:31:56.9736909Z E       ^
2025-05-07T20:31:56.9737362Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:31:56.9737801Z 
2025-05-07T20:31:56.9738212Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:31:57.1132429Z 
2025-05-07T20:31:57.1132953Z Trying example: test_silu_mul_quant(
2025-05-07T20:31:57.1133394Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:31:57.1133916Z     T=1,
2025-05-07T20:31:57.1134121Z     D=5120,
2025-05-07T20:31:57.1134312Z     scale_ub=None,
2025-05-07T20:31:57.1134527Z     contiguous=True,
2025-05-07T20:31:57.1134751Z     compiled=False,
2025-05-07T20:31:57.1134958Z )
2025-05-07T20:31:57.1135268Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:31:57.1135750Z T = 1, D = 5120, scale_ub = None, contiguous = True, compiled = False
2025-05-07T20:31:57.1136006Z 
2025-05-07T20:31:57.1136089Z     @given(
2025-05-07T20:31:57.1136313Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:31:57.1136623Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:31:57.1136930Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:31:57.1137251Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:31:57.1137578Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:31:57.1137868Z     )
2025-05-07T20:31:57.1138209Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:31:57.1138651Z     def test_silu_mul_quant(
2025-05-07T20:31:57.1138891Z         self,
2025-05-07T20:31:57.1139089Z         T: int,
2025-05-07T20:31:57.1139282Z         D: int,
2025-05-07T20:31:57.1139503Z         scale_ub: Optional[float],
2025-05-07T20:31:57.1139776Z         contiguous: bool,
2025-05-07T20:31:57.1140011Z         compiled: bool,
2025-05-07T20:31:57.1140237Z     ) -> None:
2025-05-07T20:31:57.1140455Z         torch.manual_seed(2025)
2025-05-07T20:31:57.1140690Z     
2025-05-07T20:31:57.1140958Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:31:57.1141297Z     
2025-05-07T20:31:57.1141485Z         x_sign = torch.sign(x)
2025-05-07T20:31:57.1141777Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:31:57.1142088Z         x = x_sign * x_clamp
2025-05-07T20:31:57.1142321Z         x0 = x[:, :D]
2025-05-07T20:31:57.1142539Z         x1 = x[:, D:]
2025-05-07T20:31:57.1142744Z     
2025-05-07T20:31:57.1142927Z         if contiguous:
2025-05-07T20:31:57.1143156Z             x0 = x0.contiguous()
2025-05-07T20:31:57.1143407Z             x1 = x1.contiguous()
2025-05-07T20:31:57.1143646Z     
2025-05-07T20:31:57.1143845Z         if scale_ub is not None:
2025-05-07T20:31:57.1144134Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:31:57.1144490Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:31:57.1144795Z             )
2025-05-07T20:31:57.1144985Z         else:
2025-05-07T20:31:57.1145195Z             scale_ub_tensor = None
2025-05-07T20:31:57.1145445Z     
2025-05-07T20:31:57.1145673Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:31:57.1145988Z             op = silu_mul_quant
2025-05-07T20:31:57.1146237Z             if compiled:
2025-05-07T20:31:57.1146491Z                 op = torch.compile(op)
2025-05-07T20:31:57.1146784Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:31:57.1147060Z     
2025-05-07T20:31:57.1147251Z >       y_fp8, y_scale = fn()
2025-05-07T20:31:57.1147585Z 
2025-05-07T20:31:57.1147685Z moe/activation_test.py:117: 
2025-05-07T20:31:57.1147980Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:31:57.1148311Z moe/activation_test.py:115: in fn
2025-05-07T20:31:57.1148589Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:31:57.1149275Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:31:57.1149956Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:31:57.1150489Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:31:57.1151236Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:31:57.1151895Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:31:57.1152425Z     kernel = self.compile(
2025-05-07T20:31:57.1152956Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:31:57.1153612Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:31:57.1154008Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:31:57.1154258Z 
2025-05-07T20:31:57.1154491Z self = <triton.compiler.compiler.ASTSource object at 0x7f93ce36d250>
2025-05-07T20:31:57.1155547Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:31:57.1156900Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f9402963ce0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f93a6f956c0>}
2025-05-07T20:31:57.1158222Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:31:57.1159226Z context = <triton._C.libtriton.ir.context object at 0x7f9397c47e30>
2025-05-07T20:31:57.1159506Z 
2025-05-07T20:31:57.1159672Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:31:57.1160180Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:31:57.1160644Z                            module_map=module_map)
2025-05-07T20:31:57.1161006Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:31:57.1161353Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:31:57.1161608Z E       ^
2025-05-07T20:31:57.1162071Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:31:57.1162512Z 
2025-05-07T20:31:57.1162925Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:31:57.1163434Z 
2025-05-07T20:31:57.1163535Z Trying example: test_silu_mul_quant(
2025-05-07T20:31:57.1163944Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:31:57.1164369Z     T=128,
2025-05-07T20:31:57.1164573Z     D=5120,
2025-05-07T20:31:57.1164765Z     scale_ub=None,
2025-05-07T20:31:57.1164979Z     contiguous=False,
2025-05-07T20:31:57.1165196Z     compiled=True,
2025-05-07T20:31:57.1165402Z )
2025-05-07T20:31:57.1165717Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:31:57.1166198Z T = 128, D = 5120, scale_ub = None, contiguous = False, compiled = True
2025-05-07T20:31:57.1166463Z 
2025-05-07T20:31:57.1166546Z     @given(
2025-05-07T20:31:57.1166778Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:31:57.1167091Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:31:57.1167391Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:31:57.1167809Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:31:57.1168136Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:31:57.1168416Z     )
2025-05-07T20:31:57.1168766Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:31:57.1169205Z     def test_silu_mul_quant(
2025-05-07T20:31:57.1169442Z         self,
2025-05-07T20:31:57.1169632Z         T: int,
2025-05-07T20:31:57.1169829Z         D: int,
2025-05-07T20:31:57.1170048Z         scale_ub: Optional[float],
2025-05-07T20:31:57.1170310Z         contiguous: bool,
2025-05-07T20:31:57.1170548Z         compiled: bool,
2025-05-07T20:31:57.1170768Z     ) -> None:
2025-05-07T20:31:57.1171053Z         torch.manual_seed(2025)
2025-05-07T20:31:57.1171292Z     
2025-05-07T20:31:57.1171560Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:31:57.1171897Z     
2025-05-07T20:31:57.1172091Z         x_sign = torch.sign(x)
2025-05-07T20:31:57.1172384Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:31:57.1172686Z         x = x_sign * x_clamp
2025-05-07T20:31:57.1172925Z         x0 = x[:, :D]
2025-05-07T20:31:57.1173145Z         x1 = x[:, D:]
2025-05-07T20:31:57.1173349Z     
2025-05-07T20:31:57.1173535Z         if contiguous:
2025-05-07T20:31:57.1173836Z             x0 = x0.contiguous()
2025-05-07T20:31:57.1174092Z             x1 = x1.contiguous()
2025-05-07T20:31:57.1174333Z     
2025-05-07T20:31:57.1174523Z         if scale_ub is not None:
2025-05-07T20:31:57.1174798Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:31:57.1175123Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:31:57.1175429Z             )
2025-05-07T20:31:57.1175629Z         else:
2025-05-07T20:31:57.1175835Z             scale_ub_tensor = None
2025-05-07T20:31:57.1176083Z     
2025-05-07T20:31:57.1176312Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:31:57.1176622Z             op = silu_mul_quant
2025-05-07T20:31:57.1176874Z             if compiled:
2025-05-07T20:31:57.1177120Z                 op = torch.compile(op)
2025-05-07T20:31:57.1177408Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:31:57.1177680Z     
2025-05-07T20:31:57.1177874Z >       y_fp8, y_scale = fn()
2025-05-07T20:31:57.1178036Z 
2025-05-07T20:31:57.1178135Z moe/activation_test.py:117: 
2025-05-07T20:31:57.1178429Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:31:57.1178767Z moe/activation_test.py:115: in fn
2025-05-07T20:31:57.1179047Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:31:57.1179595Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/torch/_dynamo/eval_frame.py:678: in _fn
2025-05-07T20:31:57.1180151Z     return fn(*args, **kwargs)
2025-05-07T20:31:57.1180799Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:31:57.1181475Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:31:57.1182007Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:31:57.1182678Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:31:57.1183331Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:31:57.1183860Z     kernel = self.compile(
2025-05-07T20:31:57.1184393Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:31:57.1185044Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:31:57.1185444Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:31:57.1185671Z 
2025-05-07T20:31:57.1185875Z self = <triton.compiler.compiler.ASTSource object at 0x7f93a487f8d0>
2025-05-07T20:31:57.1186937Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:31:57.1188375Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f9402963ce0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f93a5c32340>}
2025-05-07T20:31:57.1189692Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:31:57.1190688Z context = <triton._C.libtriton.ir.context object at 0x7f9397ed1df0>
2025-05-07T20:31:57.1191071Z 
2025-05-07T20:31:57.1191238Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:31:57.1191754Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:31:57.1192239Z                            module_map=module_map)
2025-05-07T20:31:57.1192594Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:31:57.1192941Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:31:57.1193198Z E       ^
2025-05-07T20:31:57.1193646Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:31:57.1194094Z 
2025-05-07T20:31:57.1194501Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:31:57.1195055Z 
2025-05-07T20:31:57.1195162Z Trying example: test_silu_mul_quant(
2025-05-07T20:31:57.1195574Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:31:57.1195965Z     T=128,
2025-05-07T20:31:57.1196153Z     D=7168,
2025-05-07T20:31:57.1196347Z     scale_ub=1200.0,
2025-05-07T20:31:57.1196566Z     contiguous=False,
2025-05-07T20:31:57.1196795Z     compiled=False,
2025-05-07T20:31:57.2762231Z )
2025-05-07T20:31:57.2762592Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:31:57.2763085Z T = 128, D = 7168, scale_ub = 1200.0, contiguous = False, compiled = False
2025-05-07T20:31:57.2763421Z 
2025-05-07T20:31:57.2763503Z     @given(
2025-05-07T20:31:57.2763760Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:31:57.2764089Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:31:57.2764687Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:31:57.2765378Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:31:57.2766008Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:31:57.2766560Z     )
2025-05-07T20:31:57.2767251Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:31:57.2768103Z     def test_silu_mul_quant(
2025-05-07T20:31:57.2768565Z         self,
2025-05-07T20:31:57.2768942Z         T: int,
2025-05-07T20:31:57.2769318Z         D: int,
2025-05-07T20:31:57.2769733Z         scale_ub: Optional[float],
2025-05-07T20:31:57.2770252Z         contiguous: bool,
2025-05-07T20:31:57.2770705Z         compiled: bool,
2025-05-07T20:31:57.2771125Z     ) -> None:
2025-05-07T20:31:57.2771531Z         torch.manual_seed(2025)
2025-05-07T20:31:57.2771987Z     
2025-05-07T20:31:57.2772506Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:31:57.2773155Z     
2025-05-07T20:31:57.2773518Z         x_sign = torch.sign(x)
2025-05-07T20:31:57.2774187Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:31:57.2774528Z         x = x_sign * x_clamp
2025-05-07T20:31:57.2774772Z         x0 = x[:, :D]
2025-05-07T20:31:57.2774994Z         x1 = x[:, D:]
2025-05-07T20:31:57.2775207Z     
2025-05-07T20:31:57.2775397Z         if contiguous:
2025-05-07T20:31:57.2775624Z             x0 = x0.contiguous()
2025-05-07T20:31:57.2775873Z             x1 = x1.contiguous()
2025-05-07T20:31:57.2776271Z     
2025-05-07T20:31:57.2776454Z         if scale_ub is not None:
2025-05-07T20:31:57.2776725Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:31:57.2777045Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:31:57.2777338Z             )
2025-05-07T20:31:57.2777522Z         else:
2025-05-07T20:31:57.2777724Z             scale_ub_tensor = None
2025-05-07T20:31:57.2777961Z     
2025-05-07T20:31:57.2778187Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:31:57.2778496Z             op = silu_mul_quant
2025-05-07T20:31:57.2778733Z             if compiled:
2025-05-07T20:31:57.2778972Z                 op = torch.compile(op)
2025-05-07T20:31:57.2779258Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:31:57.2779637Z     
2025-05-07T20:31:57.2779826Z >       y_fp8, y_scale = fn()
2025-05-07T20:31:57.2779994Z 
2025-05-07T20:31:57.2780090Z moe/activation_test.py:117: 
2025-05-07T20:31:57.2780374Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:31:57.2780699Z moe/activation_test.py:115: in fn
2025-05-07T20:31:57.2780971Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:31:57.2781644Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:31:57.2782309Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:31:57.2782834Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:31:57.2783497Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:31:57.2784155Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:31:57.2784674Z     kernel = self.compile(
2025-05-07T20:31:57.2785203Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:31:57.2785852Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:31:57.2786247Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:31:57.2786467Z 
2025-05-07T20:31:57.2786670Z self = <triton.compiler.compiler.ASTSource object at 0x7f93cea74f50>
2025-05-07T20:31:57.2787719Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:31:57.2789066Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f9402963ce0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f93a5c30220>}
2025-05-07T20:31:57.2790380Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:31:57.2791375Z context = <triton._C.libtriton.ir.context object at 0x7f93a464acb0>
2025-05-07T20:31:57.2791657Z 
2025-05-07T20:31:57.2791818Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:31:57.2792327Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:31:57.2792790Z                            module_map=module_map)
2025-05-07T20:31:57.2793142Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:31:57.2793486Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:31:57.2793741Z E       ^
2025-05-07T20:31:57.2794186Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:31:57.2794683Z 
2025-05-07T20:31:57.2795089Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:31:57.2795594Z 
2025-05-07T20:31:57.2795787Z Trying example: test_silu_mul_quant(
2025-05-07T20:31:57.2796191Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:31:57.2796573Z     T=128,
2025-05-07T20:31:57.2796753Z     D=5120,
2025-05-07T20:31:57.2796945Z     scale_ub=None,
2025-05-07T20:31:57.2797148Z     contiguous=False,
2025-05-07T20:31:57.2797365Z     compiled=False,
2025-05-07T20:31:57.2797562Z )
2025-05-07T20:31:57.2797865Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:31:57.2798526Z T = 128, D = 5120, scale_ub = None, contiguous = False, compiled = False
2025-05-07T20:31:57.2798795Z 
2025-05-07T20:31:57.2798869Z     @given(
2025-05-07T20:31:57.2799099Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:31:57.2799527Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:31:57.2799828Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:31:57.2800151Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:31:57.2800470Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:31:57.2800750Z     )
2025-05-07T20:31:57.2801092Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:31:57.2801521Z     def test_silu_mul_quant(
2025-05-07T20:31:57.2801755Z         self,
2025-05-07T20:31:57.2801944Z         T: int,
2025-05-07T20:31:57.2802136Z         D: int,
2025-05-07T20:31:57.2802344Z         scale_ub: Optional[float],
2025-05-07T20:31:57.2802608Z         contiguous: bool,
2025-05-07T20:31:57.2802838Z         compiled: bool,
2025-05-07T20:31:57.2803049Z     ) -> None:
2025-05-07T20:31:57.2803261Z         torch.manual_seed(2025)
2025-05-07T20:31:57.2803492Z     
2025-05-07T20:31:57.2803755Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:31:57.2804087Z     
2025-05-07T20:31:57.2804275Z         x_sign = torch.sign(x)
2025-05-07T20:31:57.2804555Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:31:57.2804859Z         x = x_sign * x_clamp
2025-05-07T20:31:57.2805098Z         x0 = x[:, :D]
2025-05-07T20:31:57.2805298Z         x1 = x[:, D:]
2025-05-07T20:31:57.2805503Z     
2025-05-07T20:31:57.2805679Z         if contiguous:
2025-05-07T20:31:57.2805894Z             x0 = x0.contiguous()
2025-05-07T20:31:57.2806143Z             x1 = x1.contiguous()
2025-05-07T20:31:57.2806380Z     
2025-05-07T20:31:57.2806560Z         if scale_ub is not None:
2025-05-07T20:31:57.2806831Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:31:57.2807157Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:31:57.2807453Z             )
2025-05-07T20:31:57.2807635Z         else:
2025-05-07T20:31:57.2807837Z             scale_ub_tensor = None
2025-05-07T20:31:57.2808083Z     
2025-05-07T20:31:57.2808306Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:31:57.2808613Z             op = silu_mul_quant
2025-05-07T20:31:57.2808862Z             if compiled:
2025-05-07T20:31:57.2809096Z                 op = torch.compile(op)
2025-05-07T20:31:57.2809392Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:31:57.2809661Z     
2025-05-07T20:31:57.2809842Z >       y_fp8, y_scale = fn()
2025-05-07T20:31:57.2810004Z 
2025-05-07T20:31:57.2810098Z moe/activation_test.py:117: 
2025-05-07T20:31:57.2810383Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:31:57.2810707Z moe/activation_test.py:115: in fn
2025-05-07T20:31:57.2810980Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:31:57.2811655Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:31:57.2812328Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:31:57.2812856Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:31:57.2813523Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:31:57.2814445Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:31:57.2814986Z     kernel = self.compile(
2025-05-07T20:31:57.2815511Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:31:57.2816156Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:31:57.2816545Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:31:57.2816767Z 
2025-05-07T20:31:57.2816967Z self = <triton.compiler.compiler.ASTSource object at 0x7f93d47964d0>
2025-05-07T20:31:57.2818090Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:31:57.2819427Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f9402963ce0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f93a4a4c900>}
2025-05-07T20:31:57.2820746Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:31:57.2821738Z context = <triton._C.libtriton.ir.context object at 0x7f93a4184270>
2025-05-07T20:31:57.2822017Z 
2025-05-07T20:31:57.2822178Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:31:57.2822685Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:31:57.2823142Z                            module_map=module_map)
2025-05-07T20:31:57.2823500Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:31:57.2823842Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:31:57.2824097Z E       ^
2025-05-07T20:31:57.2824592Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:31:57.2825035Z 
2025-05-07T20:31:57.2825439Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:31:57.2825944Z 
2025-05-07T20:31:57.2826045Z Trying example: test_silu_mul_quant(
2025-05-07T20:31:57.2826453Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:31:57.2826847Z     T=128,
2025-05-07T20:31:57.2827027Z     D=5120,
2025-05-07T20:31:57.2827216Z     scale_ub=1200.0,
2025-05-07T20:31:57.2827434Z     contiguous=True,
2025-05-07T20:31:57.2827644Z     compiled=False,
2025-05-07T20:31:57.2827841Z )
2025-05-07T20:31:57.2828154Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:31:57.2828626Z T = 128, D = 5120, scale_ub = 1200.0, contiguous = True, compiled = False
2025-05-07T20:31:57.2828897Z 
2025-05-07T20:31:57.2828972Z     @given(
2025-05-07T20:31:57.2829199Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:31:57.2829496Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:31:57.2829796Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:31:57.2830119Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:31:57.2830439Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:31:57.2830714Z     )
2025-05-07T20:31:57.2831052Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:31:57.2831483Z     def test_silu_mul_quant(
2025-05-07T20:31:57.2831707Z         self,
2025-05-07T20:31:57.2831895Z         T: int,
2025-05-07T20:31:57.2832083Z         D: int,
2025-05-07T20:31:57.2832294Z         scale_ub: Optional[float],
2025-05-07T20:31:57.2832563Z         contiguous: bool,
2025-05-07T20:31:57.2832795Z         compiled: bool,
2025-05-07T20:31:57.2833007Z     ) -> None:
2025-05-07T20:31:57.2833218Z         torch.manual_seed(2025)
2025-05-07T20:31:57.2833621Z     
2025-05-07T20:31:57.2833881Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:31:57.2834217Z     
2025-05-07T20:31:57.2834409Z         x_sign = torch.sign(x)
2025-05-07T20:31:57.2834689Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:31:57.2834990Z         x = x_sign * x_clamp
2025-05-07T20:31:57.2835226Z         x0 = x[:, :D]
2025-05-07T20:31:57.2835434Z         x1 = x[:, D:]
2025-05-07T20:31:57.2835631Z     
2025-05-07T20:31:57.2835814Z         if contiguous:
2025-05-07T20:31:57.2836039Z             x0 = x0.contiguous()
2025-05-07T20:31:57.2836285Z             x1 = x1.contiguous()
2025-05-07T20:31:57.2836521Z     
2025-05-07T20:31:57.2836708Z         if scale_ub is not None:
2025-05-07T20:31:57.2837050Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:31:57.2837386Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:31:57.2843582Z             )
2025-05-07T20:31:57.2843797Z         else:
2025-05-07T20:31:57.2844018Z             scale_ub_tensor = None
2025-05-07T20:31:57.2844271Z     
2025-05-07T20:31:57.2844505Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:31:57.2844823Z             op = silu_mul_quant
2025-05-07T20:31:57.2845068Z             if compiled:
2025-05-07T20:31:57.2845312Z                 op = torch.compile(op)
2025-05-07T20:31:57.2845612Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:31:57.2845881Z     
2025-05-07T20:31:57.2846082Z >       y_fp8, y_scale = fn()
2025-05-07T20:31:57.2846248Z 
2025-05-07T20:31:57.2846358Z moe/activation_test.py:117: 
2025-05-07T20:31:57.2846649Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:31:57.2846991Z moe/activation_test.py:115: in fn
2025-05-07T20:31:57.2847270Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:31:57.2847953Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:31:57.2848633Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:31:57.2849166Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:31:57.2849837Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:31:57.2850489Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:31:57.2851016Z     kernel = self.compile(
2025-05-07T20:31:57.2851548Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:31:57.2852193Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:31:57.2852591Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:31:57.2852819Z 
2025-05-07T20:31:57.2853023Z self = <triton.compiler.compiler.ASTSource object at 0x7f93a6f54350>
2025-05-07T20:31:57.2854211Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:31:57.2855585Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f9402963ce0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f93a4c39ee0>}
2025-05-07T20:31:57.2856898Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:31:57.2857900Z context = <triton._C.libtriton.ir.context object at 0x7f9397f079f0>
2025-05-07T20:31:57.2858196Z 
2025-05-07T20:31:57.2858359Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:31:57.2858869Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:31:57.2859451Z                            module_map=module_map)
2025-05-07T20:31:57.2859810Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:31:57.2860163Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:31:57.2860419Z E       ^
2025-05-07T20:31:57.2860870Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:31:57.2861312Z 
2025-05-07T20:31:57.2861725Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:31:57.4388747Z 
2025-05-07T20:31:57.4389081Z Trying example: test_silu_mul_quant(
2025-05-07T20:31:57.4389722Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:31:57.4390131Z     T=1,
2025-05-07T20:31:57.4390323Z     D=7168,
2025-05-07T20:31:57.4390519Z     scale_ub=1200.0,
2025-05-07T20:31:57.4390737Z     contiguous=True,
2025-05-07T20:31:57.4390964Z     compiled=True,
2025-05-07T20:31:57.4391168Z )
2025-05-07T20:31:57.4391481Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:31:57.4391964Z T = 1, D = 7168, scale_ub = 1200.0, contiguous = True, compiled = True
2025-05-07T20:31:57.4392221Z 
2025-05-07T20:31:57.4392309Z     @given(
2025-05-07T20:31:57.4392538Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:31:57.4392844Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:31:57.4393143Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:31:57.4393469Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:31:57.4393784Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:31:57.4394066Z     )
2025-05-07T20:31:57.4394420Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:31:57.4394851Z     def test_silu_mul_quant(
2025-05-07T20:31:57.4395085Z         self,
2025-05-07T20:31:57.4395276Z         T: int,
2025-05-07T20:31:57.4395472Z         D: int,
2025-05-07T20:31:57.4395688Z         scale_ub: Optional[float],
2025-05-07T20:31:57.4395954Z         contiguous: bool,
2025-05-07T20:31:57.4396184Z         compiled: bool,
2025-05-07T20:31:57.4396403Z     ) -> None:
2025-05-07T20:31:57.4396617Z         torch.manual_seed(2025)
2025-05-07T20:31:57.4396852Z     
2025-05-07T20:31:57.4397149Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:31:57.4397479Z     
2025-05-07T20:31:57.4397670Z         x_sign = torch.sign(x)
2025-05-07T20:31:57.4397955Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:31:57.4398424Z         x = x_sign * x_clamp
2025-05-07T20:31:57.4398665Z         x0 = x[:, :D]
2025-05-07T20:31:57.4398885Z         x1 = x[:, D:]
2025-05-07T20:31:57.4399088Z     
2025-05-07T20:31:57.4399268Z         if contiguous:
2025-05-07T20:31:57.4399493Z             x0 = x0.contiguous()
2025-05-07T20:31:57.4399746Z             x1 = x1.contiguous()
2025-05-07T20:31:57.4399985Z     
2025-05-07T20:31:57.4400175Z         if scale_ub is not None:
2025-05-07T20:31:57.4400442Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:31:57.4400765Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:31:57.4401067Z             )
2025-05-07T20:31:57.4401257Z         else:
2025-05-07T20:31:57.4401463Z             scale_ub_tensor = None
2025-05-07T20:31:57.4401715Z     
2025-05-07T20:31:57.4401947Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:31:57.4402251Z             op = silu_mul_quant
2025-05-07T20:31:57.4402497Z             if compiled:
2025-05-07T20:31:57.4402742Z                 op = torch.compile(op)
2025-05-07T20:31:57.4403032Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:31:57.4403313Z     
2025-05-07T20:31:57.4403509Z >       y_fp8, y_scale = fn()
2025-05-07T20:31:57.4403670Z 
2025-05-07T20:31:57.4403773Z moe/activation_test.py:117: 
2025-05-07T20:31:57.4404060Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:31:57.4404529Z moe/activation_test.py:115: in fn
2025-05-07T20:31:57.4404811Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:31:57.4405362Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/torch/_dynamo/eval_frame.py:678: in _fn
2025-05-07T20:31:57.4405916Z     return fn(*args, **kwargs)
2025-05-07T20:31:57.4406558Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:31:57.4407230Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:31:57.4407756Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:31:57.4408527Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:31:57.4409180Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:31:57.4409701Z     kernel = self.compile(
2025-05-07T20:31:57.4410229Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:31:57.4410875Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:31:57.4411270Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:31:57.4411492Z 
2025-05-07T20:31:57.4411694Z self = <triton.compiler.compiler.ASTSource object at 0x7f93a5ac1050>
2025-05-07T20:31:57.4412761Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:31:57.4414200Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f9402963ce0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f93a4c3a660>}
2025-05-07T20:31:57.4415517Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:31:57.4416513Z context = <triton._C.libtriton.ir.context object at 0x7f9397f7b0f0>
2025-05-07T20:31:57.4416793Z 
2025-05-07T20:31:57.4416958Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:31:57.4417466Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:31:57.4417923Z                            module_map=module_map)
2025-05-07T20:31:57.4418276Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:31:57.4418633Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:31:57.4418888Z E       ^
2025-05-07T20:31:57.4419335Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:31:57.4419777Z 
2025-05-07T20:31:57.4420189Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:31:57.4420693Z 
2025-05-07T20:31:57.4420795Z Trying example: test_silu_mul_quant(
2025-05-07T20:31:57.4421199Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:31:57.4421588Z     T=1,
2025-05-07T20:31:57.4421767Z     D=7168,
2025-05-07T20:31:57.4421958Z     scale_ub=1200.0,
2025-05-07T20:31:57.4422170Z     contiguous=False,
2025-05-07T20:31:57.4422388Z     compiled=True,
2025-05-07T20:31:57.4422585Z )
2025-05-07T20:31:57.4422895Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:31:57.4423369Z T = 1, D = 7168, scale_ub = 1200.0, contiguous = False, compiled = True
2025-05-07T20:31:57.4423629Z 
2025-05-07T20:31:57.4423706Z     @given(
2025-05-07T20:31:57.4423935Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:31:57.4424267Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:31:57.4424687Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:31:57.4425010Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:31:57.4425328Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:31:57.4425610Z     )
2025-05-07T20:31:57.4425945Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:31:57.4426378Z     def test_silu_mul_quant(
2025-05-07T20:31:57.4426610Z         self,
2025-05-07T20:31:57.4426800Z         T: int,
2025-05-07T20:31:57.4426991Z         D: int,
2025-05-07T20:31:57.4427201Z         scale_ub: Optional[float],
2025-05-07T20:31:57.4427462Z         contiguous: bool,
2025-05-07T20:31:57.4427697Z         compiled: bool,
2025-05-07T20:31:57.4428480Z     ) -> None:
2025-05-07T20:31:57.4428695Z         torch.manual_seed(2025)
2025-05-07T20:31:57.4428931Z     
2025-05-07T20:31:57.4429195Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:31:57.4429538Z     
2025-05-07T20:31:57.4429728Z         x_sign = torch.sign(x)
2025-05-07T20:31:57.4430007Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:31:57.4430310Z         x = x_sign * x_clamp
2025-05-07T20:31:57.4430544Z         x0 = x[:, :D]
2025-05-07T20:31:57.4430752Z         x1 = x[:, D:]
2025-05-07T20:31:57.4430951Z     
2025-05-07T20:31:57.4431131Z         if contiguous:
2025-05-07T20:31:57.4431356Z             x0 = x0.contiguous()
2025-05-07T20:31:57.4431607Z             x1 = x1.contiguous()
2025-05-07T20:31:57.4431839Z     
2025-05-07T20:31:57.4432028Z         if scale_ub is not None:
2025-05-07T20:31:57.4432292Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:31:57.4432626Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:31:57.4432930Z             )
2025-05-07T20:31:57.4433113Z         else:
2025-05-07T20:31:57.4433319Z             scale_ub_tensor = None
2025-05-07T20:31:57.4433563Z     
2025-05-07T20:31:57.4433784Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:31:57.4434099Z             op = silu_mul_quant
2025-05-07T20:31:57.4434346Z             if compiled:
2025-05-07T20:31:57.4434591Z                 op = torch.compile(op)
2025-05-07T20:31:57.4434884Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:31:57.4435157Z     
2025-05-07T20:31:57.4435349Z >       y_fp8, y_scale = fn()
2025-05-07T20:31:57.4435515Z 
2025-05-07T20:31:57.4435615Z moe/activation_test.py:117: 
2025-05-07T20:31:57.4435905Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:31:57.4436229Z moe/activation_test.py:115: in fn
2025-05-07T20:31:57.4436502Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:31:57.4437052Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/torch/_dynamo/eval_frame.py:678: in _fn
2025-05-07T20:31:57.4437608Z     return fn(*args, **kwargs)
2025-05-07T20:31:57.4438250Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:31:57.4438928Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:31:57.4439454Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:31:57.4440118Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:31:57.4440761Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:31:57.4441287Z     kernel = self.compile(
2025-05-07T20:31:57.4441814Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:31:57.4442453Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:31:57.4442838Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:31:57.4443061Z 
2025-05-07T20:31:57.4443265Z self = <triton.compiler.compiler.ASTSource object at 0x7f93a555c7d0>
2025-05-07T20:31:57.4444403Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:31:57.4445742Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f9402963ce0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f93a4c393a0>}
2025-05-07T20:31:57.4447048Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:31:57.4448114Z context = <triton._C.libtriton.ir.context object at 0x7f9397f6a3b0>
2025-05-07T20:31:57.4448400Z 
2025-05-07T20:31:57.4448563Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:31:57.4449073Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:31:57.4449530Z                            module_map=module_map)
2025-05-07T20:31:57.4449887Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:31:57.4450236Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:31:57.4450493Z E       ^
2025-05-07T20:31:57.4450942Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:31:57.4451387Z 
2025-05-07T20:31:57.4451793Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:31:57.8772450Z 
2025-05-07T20:31:57.8772752Z Trying example: test_silu_mul_quant(
2025-05-07T20:31:57.8773182Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:31:57.8773762Z     T=1,
2025-05-07T20:31:57.8773963Z     D=7168,
2025-05-07T20:31:57.8774165Z     scale_ub=None,
2025-05-07T20:31:57.8774393Z     contiguous=False,
2025-05-07T20:31:57.8774618Z     compiled=True,
2025-05-07T20:31:57.8775010Z )
2025-05-07T20:31:57.8775650Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:31:57.8776588Z T = 1, D = 7168, scale_ub = None, contiguous = False, compiled = True
2025-05-07T20:31:57.8777097Z 
2025-05-07T20:31:57.8777247Z     @given(
2025-05-07T20:31:57.8777694Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:31:57.8778296Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:31:57.8778891Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:31:57.8779530Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:31:57.8780168Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:31:57.8780724Z     )
2025-05-07T20:31:57.8781407Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:31:57.8782270Z     def test_silu_mul_quant(
2025-05-07T20:31:57.8782743Z         self,
2025-05-07T20:31:57.8783107Z         T: int,
2025-05-07T20:31:57.8783487Z         D: int,
2025-05-07T20:31:57.8783908Z         scale_ub: Optional[float],
2025-05-07T20:31:57.8784314Z         contiguous: bool,
2025-05-07T20:31:57.8784591Z         compiled: bool,
2025-05-07T20:31:57.8784814Z     ) -> None:
2025-05-07T20:31:57.8785020Z         torch.manual_seed(2025)
2025-05-07T20:31:57.8785261Z     
2025-05-07T20:31:57.8785530Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:31:57.8785867Z     
2025-05-07T20:31:57.8786054Z         x_sign = torch.sign(x)
2025-05-07T20:31:57.8786339Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:31:57.8786640Z         x = x_sign * x_clamp
2025-05-07T20:31:57.8786877Z         x0 = x[:, :D]
2025-05-07T20:31:57.8787094Z         x1 = x[:, D:]
2025-05-07T20:31:57.8787293Z     
2025-05-07T20:31:57.8787475Z         if contiguous:
2025-05-07T20:31:57.8787700Z             x0 = x0.contiguous()
2025-05-07T20:31:57.8788153Z             x1 = x1.contiguous()
2025-05-07T20:31:57.8788385Z     
2025-05-07T20:31:57.8788574Z         if scale_ub is not None:
2025-05-07T20:31:57.8788840Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:31:57.8789158Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:31:57.8789458Z             )
2025-05-07T20:31:57.8789651Z         else:
2025-05-07T20:31:57.8789848Z             scale_ub_tensor = None
2025-05-07T20:31:57.8790092Z     
2025-05-07T20:31:57.8790319Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:31:57.8790622Z             op = silu_mul_quant
2025-05-07T20:31:57.8790861Z             if compiled:
2025-05-07T20:31:57.8791101Z                 op = torch.compile(op)
2025-05-07T20:31:57.8791503Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:31:57.8791779Z     
2025-05-07T20:31:57.8791970Z         y_fp8, y_scale = fn()
2025-05-07T20:31:57.8792240Z         y = y_fp8.to(torch.float32) * y_scale[:, None]
2025-05-07T20:31:57.8792529Z     
2025-05-07T20:31:57.8792761Z         def ref_fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:31:57.8793085Z             x0_fp32 = x0.to(torch.float32)
2025-05-07T20:31:57.8793366Z             x1_fp32 = x1.to(torch.float32)
2025-05-07T20:31:57.8793669Z             y = x0_fp32 * torch.sigmoid(x0_fp32) * x1_fp32
2025-05-07T20:31:57.8794026Z             return triton_quantize_fp8_row(y, scale_ub_tensor)
2025-05-07T20:31:57.8794327Z     
2025-05-07T20:31:57.8794527Z >       y_fp8_ref, y_scale_ref = ref_fn()
2025-05-07T20:31:57.8794714Z 
2025-05-07T20:31:57.8794813Z moe/activation_test.py:126: 
2025-05-07T20:31:57.8795095Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:31:57.8795437Z moe/activation_test.py:124: in ref_fn
2025-05-07T20:31:57.8795753Z     return triton_quantize_fp8_row(y, scale_ub_tensor)
2025-05-07T20:31:57.8796523Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gemm/triton_gemm/fp8_gemm.py:2370: in triton_quantize_fp8_row
2025-05-07T20:31:57.8797255Z     _kernel_quantize_fp8_row[grid](
2025-05-07T20:31:57.8797795Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:31:57.8798639Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:31:57.8799305Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/autotuner.py:186: in run
2025-05-07T20:31:57.8800009Z     timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs}
2025-05-07T20:31:57.8800720Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/autotuner.py:166: in _bench
2025-05-07T20:31:57.8801344Z     return self.do_bench(kernel_call, quantiles=(0.5, 0.2, 0.8))
2025-05-07T20:31:57.8801929Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/testing.py:117: in do_bench
2025-05-07T20:31:57.8802464Z     fn()
2025-05-07T20:31:57.8802965Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/autotuner.py:152: in kernel_call
2025-05-07T20:31:57.8803529Z     self.fn.run(
2025-05-07T20:31:57.8803986Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:31:57.8804501Z     kernel = self.compile(
2025-05-07T20:31:57.8805029Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:31:57.8805664Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:31:57.8806050Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:31:57.8806280Z 
2025-05-07T20:31:57.8806484Z self = <triton.compiler.compiler.ASTSource object at 0x7f93ce29f450>
2025-05-07T20:31:57.8807542Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=2, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:31:57.8809013Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f9402963ce0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f93a4c3bce0>}
2025-05-07T20:31:57.8810325Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:31:57.8811317Z context = <triton._C.libtriton.ir.context object at 0x7f9397cf95f0>
2025-05-07T20:31:57.8811601Z 
2025-05-07T20:31:57.8811876Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:31:57.8812391Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:31:57.8812850Z                            module_map=module_map)
2025-05-07T20:31:57.8813213Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:31:57.8813564Z E       def _kernel_quantize_fp8_row(
2025-05-07T20:31:57.8813945Z E       ^
2025-05-07T20:31:57.8814400Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:31:57.8814841Z 
2025-05-07T20:31:57.8815247Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:31:57.8815752Z 
2025-05-07T20:31:57.8815851Z Trying example: test_silu_mul_quant(
2025-05-07T20:31:57.8816254Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:31:57.8816639Z     T=1,
2025-05-07T20:31:57.8816823Z     D=5120,
2025-05-07T20:31:57.8817014Z     scale_ub=1200.0,
2025-05-07T20:31:57.8817226Z     contiguous=False,
2025-05-07T20:31:57.8817442Z     compiled=True,
2025-05-07T20:31:57.8817643Z )
2025-05-07T20:31:57.8817950Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:31:57.8818431Z T = 1, D = 5120, scale_ub = 1200.0, contiguous = False, compiled = True
2025-05-07T20:31:57.8818689Z 
2025-05-07T20:31:57.8818765Z     @given(
2025-05-07T20:31:57.8818991Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:31:57.8819297Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:31:57.8819595Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:31:57.8819915Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:31:57.8820233Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:31:57.8820514Z     )
2025-05-07T20:31:57.8820855Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:31:57.8821293Z     def test_silu_mul_quant(
2025-05-07T20:31:57.8821525Z         self,
2025-05-07T20:31:57.8821717Z         T: int,
2025-05-07T20:31:57.8821904Z         D: int,
2025-05-07T20:31:57.8822115Z         scale_ub: Optional[float],
2025-05-07T20:31:57.8822384Z         contiguous: bool,
2025-05-07T20:31:57.8822615Z         compiled: bool,
2025-05-07T20:31:57.8822830Z     ) -> None:
2025-05-07T20:31:57.8823035Z         torch.manual_seed(2025)
2025-05-07T20:31:57.8823268Z     
2025-05-07T20:31:57.8823528Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:31:57.8823859Z     
2025-05-07T20:31:57.8824046Z         x_sign = torch.sign(x)
2025-05-07T20:31:57.8824324Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:31:57.8824623Z         x = x_sign * x_clamp
2025-05-07T20:31:57.8824857Z         x0 = x[:, :D]
2025-05-07T20:31:57.8825061Z         x1 = x[:, D:]
2025-05-07T20:31:57.8825262Z     
2025-05-07T20:31:57.8825447Z         if contiguous:
2025-05-07T20:31:57.8825668Z             x0 = x0.contiguous()
2025-05-07T20:31:57.8825914Z             x1 = x1.contiguous()
2025-05-07T20:31:57.8826147Z     
2025-05-07T20:31:57.8826328Z         if scale_ub is not None:
2025-05-07T20:31:57.8826684Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:31:57.8827013Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:31:57.8827307Z             )
2025-05-07T20:31:57.8827502Z         else:
2025-05-07T20:31:57.8827706Z             scale_ub_tensor = None
2025-05-07T20:31:57.8827949Z     
2025-05-07T20:31:57.8828171Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:31:57.8828478Z             op = silu_mul_quant
2025-05-07T20:31:57.8828719Z             if compiled:
2025-05-07T20:31:57.8828957Z                 op = torch.compile(op)
2025-05-07T20:31:57.8829247Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:31:57.8829517Z     
2025-05-07T20:31:57.8829704Z >       y_fp8, y_scale = fn()
2025-05-07T20:31:57.8829944Z 
2025-05-07T20:31:57.8830042Z moe/activation_test.py:117: 
2025-05-07T20:31:57.8830334Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:31:57.8830650Z moe/activation_test.py:115: in fn
2025-05-07T20:31:57.8830937Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:31:57.8831481Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/torch/_dynamo/eval_frame.py:678: in _fn
2025-05-07T20:31:57.8832023Z     return fn(*args, **kwargs)
2025-05-07T20:31:57.8832665Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:31:57.8833334Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:31:57.8833856Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:31:57.8834518Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:31:57.8841719Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:31:57.8842295Z     kernel = self.compile(
2025-05-07T20:31:57.8842845Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:31:57.8843516Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:31:57.8843910Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:31:57.8844140Z 
2025-05-07T20:31:57.8844343Z self = <triton.compiler.compiler.ASTSource object at 0x7f9397f09b50>
2025-05-07T20:31:57.8845404Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:31:57.8846755Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f9402963ce0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f93a4867920>}
2025-05-07T20:31:57.8848075Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:31:57.8849076Z context = <triton._C.libtriton.ir.context object at 0x7f9397ca8bb0>
2025-05-07T20:31:57.8849362Z 
2025-05-07T20:31:57.8849525Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:31:57.8850037Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:31:57.8850499Z                            module_map=module_map)
2025-05-07T20:31:57.8850857Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:31:57.8851209Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:31:57.8851470Z E       ^
2025-05-07T20:31:57.8851923Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:31:57.8852369Z 
2025-05-07T20:31:57.8852776Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:31:58.0239670Z 
2025-05-07T20:31:58.0239939Z Trying example: test_silu_mul_quant(
2025-05-07T20:31:58.0240372Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:31:58.0240845Z     T=1,
2025-05-07T20:31:58.0241049Z     D=5120,
2025-05-07T20:31:58.0241256Z     scale_ub=1200.0,
2025-05-07T20:31:58.0241490Z     contiguous=False,
2025-05-07T20:31:58.0241724Z     compiled=False,
2025-05-07T20:31:58.0241933Z )
2025-05-07T20:31:58.0242250Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:31:58.0242739Z T = 1, D = 5120, scale_ub = 1200.0, contiguous = False, compiled = False
2025-05-07T20:31:58.0243008Z 
2025-05-07T20:31:58.0243094Z     @given(
2025-05-07T20:31:58.0243495Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:31:58.0243816Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:31:58.0244133Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:31:58.0244474Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:31:58.0244868Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:31:58.0245156Z     )
2025-05-07T20:31:58.0245506Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:31:58.0245952Z     def test_silu_mul_quant(
2025-05-07T20:31:58.0246193Z         self,
2025-05-07T20:31:58.0246398Z         T: int,
2025-05-07T20:31:58.0246606Z         D: int,
2025-05-07T20:31:58.0246830Z         scale_ub: Optional[float],
2025-05-07T20:31:58.0247107Z         contiguous: bool,
2025-05-07T20:31:58.0247352Z         compiled: bool,
2025-05-07T20:31:58.0247578Z     ) -> None:
2025-05-07T20:31:58.0247797Z         torch.manual_seed(2025)
2025-05-07T20:31:58.0248047Z     
2025-05-07T20:31:58.0248315Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:31:58.0248660Z     
2025-05-07T20:31:58.0248861Z         x_sign = torch.sign(x)
2025-05-07T20:31:58.0249156Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:31:58.0249472Z         x = x_sign * x_clamp
2025-05-07T20:31:58.0249718Z         x0 = x[:, :D]
2025-05-07T20:31:58.0249930Z         x1 = x[:, D:]
2025-05-07T20:31:58.0250136Z     
2025-05-07T20:31:58.0250321Z         if contiguous:
2025-05-07T20:31:58.0250550Z             x0 = x0.contiguous()
2025-05-07T20:31:58.0250803Z             x1 = x1.contiguous()
2025-05-07T20:31:58.0251045Z     
2025-05-07T20:31:58.0251237Z         if scale_ub is not None:
2025-05-07T20:31:58.0251504Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:31:58.0251838Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:31:58.0252144Z             )
2025-05-07T20:31:58.0252338Z         else:
2025-05-07T20:31:58.0252553Z             scale_ub_tensor = None
2025-05-07T20:31:58.0252801Z     
2025-05-07T20:31:58.0253026Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:31:58.0253340Z             op = silu_mul_quant
2025-05-07T20:31:58.0253594Z             if compiled:
2025-05-07T20:31:58.0253944Z                 op = torch.compile(op)
2025-05-07T20:31:58.0254242Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:31:58.0254517Z     
2025-05-07T20:31:58.0254709Z >       y_fp8, y_scale = fn()
2025-05-07T20:31:58.0254871Z 
2025-05-07T20:31:58.0254969Z moe/activation_test.py:117: 
2025-05-07T20:31:58.0255260Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:31:58.0255588Z moe/activation_test.py:115: in fn
2025-05-07T20:31:58.0255866Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:31:58.0256550Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:31:58.0257231Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:31:58.0257761Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:31:58.0258558Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:31:58.0259211Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:31:58.0259740Z     kernel = self.compile(
2025-05-07T20:31:58.0260271Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:31:58.0260917Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:31:58.0261313Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:31:58.0261537Z 
2025-05-07T20:31:58.0261751Z self = <triton.compiler.compiler.ASTSource object at 0x7f93ce36ded0>
2025-05-07T20:31:58.0262898Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:31:58.0264264Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f9402963ce0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f93a4d72b60>}
2025-05-07T20:31:58.0265579Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:31:58.0266580Z context = <triton._C.libtriton.ir.context object at 0x7f9397d77db0>
2025-05-07T20:31:58.0266862Z 
2025-05-07T20:31:58.0267031Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:31:58.0267540Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:31:58.0268001Z                            module_map=module_map)
2025-05-07T20:31:58.0268361Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:31:58.0268704Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:31:58.0268969Z E       ^
2025-05-07T20:31:58.0269422Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:31:58.0269861Z 
2025-05-07T20:31:58.0270271Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:31:58.0270774Z 
2025-05-07T20:31:58.0270877Z Trying example: test_silu_mul_quant(
2025-05-07T20:31:58.0271284Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:31:58.0271683Z     T=16384,
2025-05-07T20:31:58.0271872Z     D=5120,
2025-05-07T20:31:58.0272063Z     scale_ub=1200.0,
2025-05-07T20:31:58.0272284Z     contiguous=False,
2025-05-07T20:31:58.0272512Z     compiled=True,
2025-05-07T20:31:58.0272713Z )
2025-05-07T20:31:58.0273029Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:31:58.0273513Z T = 16384, D = 5120, scale_ub = 1200.0, contiguous = False, compiled = True
2025-05-07T20:31:58.0273792Z 
2025-05-07T20:31:58.0273870Z     @given(
2025-05-07T20:31:58.0274097Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:31:58.0274430Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:31:58.0274769Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:31:58.0275104Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:31:58.0275431Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:31:58.0275714Z     )
2025-05-07T20:31:58.0276054Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:31:58.0276488Z     def test_silu_mul_quant(
2025-05-07T20:31:58.0276726Z         self,
2025-05-07T20:31:58.0276917Z         T: int,
2025-05-07T20:31:58.0277106Z         D: int,
2025-05-07T20:31:58.0277321Z         scale_ub: Optional[float],
2025-05-07T20:31:58.0277583Z         contiguous: bool,
2025-05-07T20:31:58.0277820Z         compiled: bool,
2025-05-07T20:31:58.0278128Z     ) -> None:
2025-05-07T20:31:58.0278337Z         torch.manual_seed(2025)
2025-05-07T20:31:58.0278570Z     
2025-05-07T20:31:58.0278840Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:31:58.0279176Z     
2025-05-07T20:31:58.0279363Z         x_sign = torch.sign(x)
2025-05-07T20:31:58.0279644Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:31:58.0279948Z         x = x_sign * x_clamp
2025-05-07T20:31:58.0280177Z         x0 = x[:, :D]
2025-05-07T20:31:58.0280393Z         x1 = x[:, D:]
2025-05-07T20:31:58.0280594Z     
2025-05-07T20:31:58.0280772Z         if contiguous:
2025-05-07T20:31:58.0280999Z             x0 = x0.contiguous()
2025-05-07T20:31:58.0281252Z             x1 = x1.contiguous()
2025-05-07T20:31:58.0281569Z     
2025-05-07T20:31:58.0281761Z         if scale_ub is not None:
2025-05-07T20:31:58.0282032Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:31:58.0282355Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:31:58.0282665Z             )
2025-05-07T20:31:58.0282856Z         else:
2025-05-07T20:31:58.0283067Z             scale_ub_tensor = None
2025-05-07T20:31:58.0283311Z     
2025-05-07T20:31:58.0283536Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:31:58.0283848Z             op = silu_mul_quant
2025-05-07T20:31:58.0284090Z             if compiled:
2025-05-07T20:31:58.0284331Z                 op = torch.compile(op)
2025-05-07T20:31:58.0284644Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:31:58.0284941Z     
2025-05-07T20:31:58.0285128Z >       y_fp8, y_scale = fn()
2025-05-07T20:31:58.0285288Z 
2025-05-07T20:31:58.0285386Z moe/activation_test.py:117: 
2025-05-07T20:31:58.0285676Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:31:58.0286000Z moe/activation_test.py:115: in fn
2025-05-07T20:31:58.0286278Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:31:58.0286824Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/torch/_dynamo/eval_frame.py:678: in _fn
2025-05-07T20:31:58.0287375Z     return fn(*args, **kwargs)
2025-05-07T20:31:58.0288021Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:31:58.0288693Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:31:58.0289215Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:31:58.0289882Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:31:58.0290530Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:31:58.0291057Z     kernel = self.compile(
2025-05-07T20:31:58.0291583Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:31:58.0292228Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:31:58.0292628Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:31:58.0292852Z 
2025-05-07T20:31:58.0293058Z self = <triton.compiler.compiler.ASTSource object at 0x7f93a55de3d0>
2025-05-07T20:31:58.0294176Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:31:58.0295518Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f9402963ce0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f93a4d72d40>}
2025-05-07T20:31:58.0296843Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:31:58.0297935Z context = <triton._C.libtriton.ir.context object at 0x7f9397cdb2f0>
2025-05-07T20:31:58.0298371Z 
2025-05-07T20:31:58.0298534Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:31:58.0299046Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:31:58.0299506Z                            module_map=module_map)
2025-05-07T20:31:58.0299866Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:31:58.0300208Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:31:58.0300462Z E       ^
2025-05-07T20:31:58.0300914Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:31:58.0301500Z 
2025-05-07T20:31:58.0301912Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:31:58.0302422Z 
2025-05-07T20:31:58.0302525Z Trying example: test_silu_mul_quant(
2025-05-07T20:31:58.0302939Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:31:58.0303338Z     T=2048,
2025-05-07T20:31:58.0303521Z     D=7168,
2025-05-07T20:31:58.0303711Z     scale_ub=1200.0,
2025-05-07T20:31:58.0303932Z     contiguous=False,
2025-05-07T20:31:58.0304150Z     compiled=True,
2025-05-07T20:31:58.2147379Z )
2025-05-07T20:31:58.2147735Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:31:58.2148241Z T = 2048, D = 7168, scale_ub = 1200.0, contiguous = False, compiled = True
2025-05-07T20:31:58.2148547Z 
2025-05-07T20:31:58.2148639Z     @given(
2025-05-07T20:31:58.2148871Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:31:58.2149199Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:31:58.2149512Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:31:58.2149836Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:31:58.2150164Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:31:58.2150457Z     )
2025-05-07T20:31:58.2150803Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:31:58.2151246Z     def test_silu_mul_quant(
2025-05-07T20:31:58.2151484Z         self,
2025-05-07T20:31:58.2151679Z         T: int,
2025-05-07T20:31:58.2151881Z         D: int,
2025-05-07T20:31:58.2152100Z         scale_ub: Optional[float],
2025-05-07T20:31:58.2152375Z         contiguous: bool,
2025-05-07T20:31:58.2152617Z         compiled: bool,
2025-05-07T20:31:58.2152843Z     ) -> None:
2025-05-07T20:31:58.2153058Z         torch.manual_seed(2025)
2025-05-07T20:31:58.2153296Z     
2025-05-07T20:31:58.2153566Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:31:58.2153913Z     
2025-05-07T20:31:58.2154105Z         x_sign = torch.sign(x)
2025-05-07T20:31:58.2154393Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:31:58.2154702Z         x = x_sign * x_clamp
2025-05-07T20:31:58.2154944Z         x0 = x[:, :D]
2025-05-07T20:31:58.2155162Z         x1 = x[:, D:]
2025-05-07T20:31:58.2155372Z     
2025-05-07T20:31:58.2155554Z         if contiguous:
2025-05-07T20:31:58.2155782Z             x0 = x0.contiguous()
2025-05-07T20:31:58.2156044Z             x1 = x1.contiguous()
2025-05-07T20:31:58.2156280Z     
2025-05-07T20:31:58.2156476Z         if scale_ub is not None:
2025-05-07T20:31:58.2156753Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:31:58.2157087Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:31:58.2157394Z             )
2025-05-07T20:31:58.2157587Z         else:
2025-05-07T20:31:58.2157800Z             scale_ub_tensor = None
2025-05-07T20:31:58.2158051Z     
2025-05-07T20:31:58.2158291Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:31:58.2158606Z             op = silu_mul_quant
2025-05-07T20:31:58.2158853Z             if compiled:
2025-05-07T20:31:58.2159108Z                 op = torch.compile(op)
2025-05-07T20:31:58.2159551Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:31:58.2159822Z     
2025-05-07T20:31:58.2160018Z >       y_fp8, y_scale = fn()
2025-05-07T20:31:58.2160182Z 
2025-05-07T20:31:58.2160288Z moe/activation_test.py:117: 
2025-05-07T20:31:58.2160585Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:31:58.2160914Z moe/activation_test.py:115: in fn
2025-05-07T20:31:58.2161193Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:31:58.2161746Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/torch/_dynamo/eval_frame.py:678: in _fn
2025-05-07T20:31:58.2162297Z     return fn(*args, **kwargs)
2025-05-07T20:31:58.2163061Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:31:58.2163741Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:31:58.2164275Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:31:58.2165000Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:31:58.2165654Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:31:58.2166185Z     kernel = self.compile(
2025-05-07T20:31:58.2166717Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:31:58.2167370Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:31:58.2167769Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:31:58.2167997Z 
2025-05-07T20:31:58.2168212Z self = <triton.compiler.compiler.ASTSource object at 0x7f93a5ac0f50>
2025-05-07T20:31:58.2169272Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:31:58.2170628Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f9402963ce0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f93cd947e20>}
2025-05-07T20:31:58.2171953Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:31:58.2172958Z context = <triton._C.libtriton.ir.context object at 0x7f93a46e39f0>
2025-05-07T20:31:58.2173242Z 
2025-05-07T20:31:58.2173411Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:31:58.2173989Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:31:58.2174452Z                            module_map=module_map)
2025-05-07T20:31:58.2174816Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:31:58.2175171Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:31:58.2175430Z E       ^
2025-05-07T20:31:58.2175888Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:31:58.2176331Z 
2025-05-07T20:31:58.2176745Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:31:58.2177252Z 
2025-05-07T20:31:58.2177357Z Trying example: test_silu_mul_quant(
2025-05-07T20:31:58.2177767Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:31:58.2178170Z     T=1,
2025-05-07T20:31:58.2178351Z     D=5120,
2025-05-07T20:31:58.2178551Z     scale_ub=None,
2025-05-07T20:31:58.2178769Z     contiguous=False,
2025-05-07T20:31:58.2178991Z     compiled=False,
2025-05-07T20:31:58.2179193Z )
2025-05-07T20:31:58.2179507Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:31:58.2180078Z T = 1, D = 5120, scale_ub = None, contiguous = False, compiled = False
2025-05-07T20:31:58.2180339Z 
2025-05-07T20:31:58.2180420Z     @given(
2025-05-07T20:31:58.2180654Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:31:58.2180968Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:31:58.2181268Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:31:58.2181601Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:31:58.2181928Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:31:58.2182206Z     )
2025-05-07T20:31:58.2182553Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:31:58.2182995Z     def test_silu_mul_quant(
2025-05-07T20:31:58.2183306Z         self,
2025-05-07T20:31:58.2183507Z         T: int,
2025-05-07T20:31:58.2183706Z         D: int,
2025-05-07T20:31:58.2183921Z         scale_ub: Optional[float],
2025-05-07T20:31:58.2184188Z         contiguous: bool,
2025-05-07T20:31:58.2184433Z         compiled: bool,
2025-05-07T20:31:58.2184660Z     ) -> None:
2025-05-07T20:31:58.2184870Z         torch.manual_seed(2025)
2025-05-07T20:31:58.2185116Z     
2025-05-07T20:31:58.2185389Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:31:58.2185729Z     
2025-05-07T20:31:58.2185930Z         x_sign = torch.sign(x)
2025-05-07T20:31:58.2186216Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:31:58.2186521Z         x = x_sign * x_clamp
2025-05-07T20:31:58.2186759Z         x0 = x[:, :D]
2025-05-07T20:31:58.2187006Z         x1 = x[:, D:]
2025-05-07T20:31:58.2187215Z     
2025-05-07T20:31:58.2187402Z         if contiguous:
2025-05-07T20:31:58.2187628Z             x0 = x0.contiguous()
2025-05-07T20:31:58.2187895Z             x1 = x1.contiguous()
2025-05-07T20:31:58.2188134Z     
2025-05-07T20:31:58.2188324Z         if scale_ub is not None:
2025-05-07T20:31:58.2188598Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:31:58.2188946Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:31:58.2189249Z             )
2025-05-07T20:31:58.2189443Z         else:
2025-05-07T20:31:58.2189655Z             scale_ub_tensor = None
2025-05-07T20:31:58.2189903Z     
2025-05-07T20:31:58.2190133Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:31:58.2190446Z             op = silu_mul_quant
2025-05-07T20:31:58.2190696Z             if compiled:
2025-05-07T20:31:58.2190939Z                 op = torch.compile(op)
2025-05-07T20:31:58.2191237Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:31:58.2191513Z     
2025-05-07T20:31:58.2191703Z >       y_fp8, y_scale = fn()
2025-05-07T20:31:58.2191868Z 
2025-05-07T20:31:58.2191972Z moe/activation_test.py:117: 
2025-05-07T20:31:58.2192266Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:31:58.2192593Z moe/activation_test.py:115: in fn
2025-05-07T20:31:58.2192870Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:31:58.2193555Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:31:58.2194238Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:31:58.2194817Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:31:58.2195495Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:31:58.2196155Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:31:58.2196682Z     kernel = self.compile(
2025-05-07T20:31:58.2197221Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:31:58.2197874Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:31:58.2198435Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:31:58.2198803Z 
2025-05-07T20:31:58.2199008Z self = <triton.compiler.compiler.ASTSource object at 0x7f93a555c750>
2025-05-07T20:31:58.2200070Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:31:58.2201413Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f9402963ce0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f93a55eb100>}
2025-05-07T20:31:58.2202863Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:31:58.2203873Z context = <triton._C.libtriton.ir.context object at 0x7f93979b2bf0>
2025-05-07T20:31:58.2204157Z 
2025-05-07T20:31:58.2204327Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:31:58.2204847Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:31:58.2205310Z                            module_map=module_map)
2025-05-07T20:31:58.2205666Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:31:58.2206018Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:31:58.2206279Z E       ^
2025-05-07T20:31:58.2206731Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:31:58.2207175Z 
2025-05-07T20:31:58.2207589Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:31:58.2208097Z 
2025-05-07T20:31:58.2208202Z Trying example: test_silu_mul_quant(
2025-05-07T20:31:58.2208614Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:31:58.2209012Z     T=4096,
2025-05-07T20:31:58.2209203Z     D=7168,
2025-05-07T20:31:58.2209398Z     scale_ub=1200.0,
2025-05-07T20:31:58.2209618Z     contiguous=False,
2025-05-07T20:31:58.2209842Z     compiled=False,
2025-05-07T20:31:58.2210050Z )
2025-05-07T20:31:58.2210364Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:31:58.2210853Z T = 4096, D = 7168, scale_ub = 1200.0, contiguous = False, compiled = False
2025-05-07T20:31:58.2211127Z 
2025-05-07T20:31:58.2211206Z     @given(
2025-05-07T20:31:58.2211435Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:31:58.2211747Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:31:58.2212056Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:31:58.2212392Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:31:58.2212713Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:31:58.2213002Z     )
2025-05-07T20:31:58.2213349Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:31:58.2213867Z     def test_silu_mul_quant(
2025-05-07T20:31:58.2214107Z         self,
2025-05-07T20:31:58.2214308Z         T: int,
2025-05-07T20:31:58.2220802Z         D: int,
2025-05-07T20:31:58.2221066Z         scale_ub: Optional[float],
2025-05-07T20:31:58.2221333Z         contiguous: bool,
2025-05-07T20:31:58.2221572Z         compiled: bool,
2025-05-07T20:31:58.2221793Z     ) -> None:
2025-05-07T20:31:58.2222002Z         torch.manual_seed(2025)
2025-05-07T20:31:58.2222241Z     
2025-05-07T20:31:58.2222506Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:31:58.2222839Z     
2025-05-07T20:31:58.2223028Z         x_sign = torch.sign(x)
2025-05-07T20:31:58.2223326Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:31:58.2223668Z         x = x_sign * x_clamp
2025-05-07T20:31:58.2223917Z         x0 = x[:, :D]
2025-05-07T20:31:58.2224145Z         x1 = x[:, D:]
2025-05-07T20:31:58.2224360Z     
2025-05-07T20:31:58.2224657Z         if contiguous:
2025-05-07T20:31:58.2224880Z             x0 = x0.contiguous()
2025-05-07T20:31:58.2225131Z             x1 = x1.contiguous()
2025-05-07T20:31:58.2225361Z     
2025-05-07T20:31:58.2225543Z         if scale_ub is not None:
2025-05-07T20:31:58.2225810Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:31:58.2226138Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:31:58.2226443Z             )
2025-05-07T20:31:58.2226633Z         else:
2025-05-07T20:31:58.2226835Z             scale_ub_tensor = None
2025-05-07T20:31:58.2227083Z     
2025-05-07T20:31:58.2227309Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:31:58.2227620Z             op = silu_mul_quant
2025-05-07T20:31:58.2227941Z             if compiled:
2025-05-07T20:31:58.2228190Z                 op = torch.compile(op)
2025-05-07T20:31:58.2228486Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:31:58.2228758Z     
2025-05-07T20:31:58.2228961Z >       y_fp8, y_scale = fn()
2025-05-07T20:31:58.2229121Z 
2025-05-07T20:31:58.2229226Z moe/activation_test.py:117: 
2025-05-07T20:31:58.2229515Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:31:58.2229841Z moe/activation_test.py:115: in fn
2025-05-07T20:31:58.2230208Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:31:58.2231129Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:31:58.2231995Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:31:58.2232518Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:31:58.2233191Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:31:58.2233836Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:31:58.2234361Z     kernel = self.compile(
2025-05-07T20:31:58.2234899Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:31:58.2235540Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:31:58.2235938Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:31:58.2236159Z 
2025-05-07T20:31:58.2236360Z self = <triton.compiler.compiler.ASTSource object at 0x7f93a4c41cd0>
2025-05-07T20:31:58.2237420Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:31:58.2238763Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f9402963ce0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f9397cdc2c0>}
2025-05-07T20:31:58.2240073Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:31:58.2241072Z context = <triton._C.libtriton.ir.context object at 0x7f939792f5f0>
2025-05-07T20:31:58.2241350Z 
2025-05-07T20:31:58.2241510Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:31:58.2242013Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:31:58.2242472Z                            module_map=module_map)
2025-05-07T20:31:58.2242830Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:31:58.2243172Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:31:58.2243429Z E       ^
2025-05-07T20:31:58.2243878Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:31:58.2244315Z 
2025-05-07T20:31:58.2244812Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:31:58.3785812Z 
2025-05-07T20:31:58.3786098Z Trying example: test_silu_mul_quant(
2025-05-07T20:31:58.3786532Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:31:58.3786987Z     T=16384,
2025-05-07T20:31:58.3787190Z     D=7168,
2025-05-07T20:31:58.3787384Z     scale_ub=None,
2025-05-07T20:31:58.3787594Z     contiguous=True,
2025-05-07T20:31:58.3787817Z     compiled=True,
2025-05-07T20:31:58.3788019Z )
2025-05-07T20:31:58.3788328Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:31:58.3788972Z T = 16384, D = 7168, scale_ub = None, contiguous = True, compiled = True
2025-05-07T20:31:58.3789245Z 
2025-05-07T20:31:58.3789327Z     @given(
2025-05-07T20:31:58.3789556Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:31:58.3789874Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:31:58.3790186Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:31:58.3790512Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:31:58.3790834Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:31:58.3791118Z     )
2025-05-07T20:31:58.3791461Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:31:58.3791898Z     def test_silu_mul_quant(
2025-05-07T20:31:58.3792137Z         self,
2025-05-07T20:31:58.3792330Z         T: int,
2025-05-07T20:31:58.3792523Z         D: int,
2025-05-07T20:31:58.3792742Z         scale_ub: Optional[float],
2025-05-07T20:31:58.3793013Z         contiguous: bool,
2025-05-07T20:31:58.3793245Z         compiled: bool,
2025-05-07T20:31:58.3793473Z     ) -> None:
2025-05-07T20:31:58.3793689Z         torch.manual_seed(2025)
2025-05-07T20:31:58.3793922Z     
2025-05-07T20:31:58.3794194Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:31:58.3794536Z     
2025-05-07T20:31:58.3794727Z         x_sign = torch.sign(x)
2025-05-07T20:31:58.3795008Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:31:58.3795319Z         x = x_sign * x_clamp
2025-05-07T20:31:58.3795556Z         x0 = x[:, :D]
2025-05-07T20:31:58.3795766Z         x1 = x[:, D:]
2025-05-07T20:31:58.3795970Z     
2025-05-07T20:31:58.3796154Z         if contiguous:
2025-05-07T20:31:58.3796376Z             x0 = x0.contiguous()
2025-05-07T20:31:58.3796624Z             x1 = x1.contiguous()
2025-05-07T20:31:58.3796858Z     
2025-05-07T20:31:58.3797040Z         if scale_ub is not None:
2025-05-07T20:31:58.3797308Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:31:58.3797634Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:31:58.3797939Z             )
2025-05-07T20:31:58.3798125Z         else:
2025-05-07T20:31:58.3798490Z             scale_ub_tensor = None
2025-05-07T20:31:58.3798730Z     
2025-05-07T20:31:58.3798956Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:31:58.3799266Z             op = silu_mul_quant
2025-05-07T20:31:58.3799502Z             if compiled:
2025-05-07T20:31:58.3799743Z                 op = torch.compile(op)
2025-05-07T20:31:58.3800033Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:31:58.3800297Z     
2025-05-07T20:31:58.3800475Z >       y_fp8, y_scale = fn()
2025-05-07T20:31:58.3800639Z 
2025-05-07T20:31:58.3800736Z moe/activation_test.py:117: 
2025-05-07T20:31:58.3801022Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:31:58.3801340Z moe/activation_test.py:115: in fn
2025-05-07T20:31:58.3801613Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:31:58.3802166Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/torch/_dynamo/eval_frame.py:678: in _fn
2025-05-07T20:31:58.3802708Z     return fn(*args, **kwargs)
2025-05-07T20:31:58.3803348Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:31:58.3804147Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:31:58.3804703Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:31:58.3805374Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:31:58.3806023Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:31:58.3806544Z     kernel = self.compile(
2025-05-07T20:31:58.3808520Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:31:58.3809267Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:31:58.3809660Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:31:58.3809884Z 
2025-05-07T20:31:58.3810092Z self = <triton.compiler.compiler.ASTSource object at 0x7f93a52637d0>
2025-05-07T20:31:58.3811151Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:31:58.3812483Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f9402963ce0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f9397cddc60>}
2025-05-07T20:31:58.3813903Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:31:58.3814955Z context = <triton._C.libtriton.ir.context object at 0x7f939798ee30>
2025-05-07T20:31:58.3815237Z 
2025-05-07T20:31:58.3815406Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:31:58.3815908Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:31:58.3816369Z                            module_map=module_map)
2025-05-07T20:31:58.3816726Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:31:58.3817071Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:31:58.3817319Z E       ^
2025-05-07T20:31:58.3817766Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:31:58.3818204Z 
2025-05-07T20:31:58.3818611Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:31:58.3819110Z 
2025-05-07T20:31:58.3819209Z Trying example: test_silu_mul_quant(
2025-05-07T20:31:58.3819621Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:31:58.3820011Z     T=4096,
2025-05-07T20:31:58.3820192Z     D=5120,
2025-05-07T20:31:58.3820376Z     scale_ub=None,
2025-05-07T20:31:58.3820592Z     contiguous=False,
2025-05-07T20:31:58.3820808Z     compiled=True,
2025-05-07T20:31:58.3821000Z )
2025-05-07T20:31:58.3821311Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:31:58.3821793Z T = 4096, D = 5120, scale_ub = None, contiguous = False, compiled = True
2025-05-07T20:31:58.3822058Z 
2025-05-07T20:31:58.3822132Z     @given(
2025-05-07T20:31:58.3822354Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:31:58.3822663Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:31:58.3822956Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:31:58.3823275Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:31:58.3823594Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:31:58.3823880Z     )
2025-05-07T20:31:58.3824216Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:31:58.3824656Z     def test_silu_mul_quant(
2025-05-07T20:31:58.3825017Z         self,
2025-05-07T20:31:58.3825202Z         T: int,
2025-05-07T20:31:58.3825392Z         D: int,
2025-05-07T20:31:58.3825605Z         scale_ub: Optional[float],
2025-05-07T20:31:58.3825865Z         contiguous: bool,
2025-05-07T20:31:58.3826096Z         compiled: bool,
2025-05-07T20:31:58.3826308Z     ) -> None:
2025-05-07T20:31:58.3826509Z         torch.manual_seed(2025)
2025-05-07T20:31:58.3826744Z     
2025-05-07T20:31:58.3827003Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:31:58.3827327Z     
2025-05-07T20:31:58.3827511Z         x_sign = torch.sign(x)
2025-05-07T20:31:58.3827790Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:31:58.3828089Z         x = x_sign * x_clamp
2025-05-07T20:31:58.3828427Z         x0 = x[:, :D]
2025-05-07T20:31:58.3828638Z         x1 = x[:, D:]
2025-05-07T20:31:58.3828839Z     
2025-05-07T20:31:58.3829012Z         if contiguous:
2025-05-07T20:31:58.3829237Z             x0 = x0.contiguous()
2025-05-07T20:31:58.3829490Z             x1 = x1.contiguous()
2025-05-07T20:31:58.3829721Z     
2025-05-07T20:31:58.3829906Z         if scale_ub is not None:
2025-05-07T20:31:58.3830174Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:31:58.3830497Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:31:58.3830804Z             )
2025-05-07T20:31:58.3830994Z         else:
2025-05-07T20:31:58.3831195Z             scale_ub_tensor = None
2025-05-07T20:31:58.3831438Z     
2025-05-07T20:31:58.3831662Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:31:58.3831966Z             op = silu_mul_quant
2025-05-07T20:31:58.3832230Z             if compiled:
2025-05-07T20:31:58.3832471Z                 op = torch.compile(op)
2025-05-07T20:31:58.3832768Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:31:58.3833029Z     
2025-05-07T20:31:58.3833213Z >       y_fp8, y_scale = fn()
2025-05-07T20:31:58.3833371Z 
2025-05-07T20:31:58.3833467Z moe/activation_test.py:117: 
2025-05-07T20:31:58.3833757Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:31:58.3834080Z moe/activation_test.py:115: in fn
2025-05-07T20:31:58.3834350Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:31:58.3834897Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/torch/_dynamo/eval_frame.py:678: in _fn
2025-05-07T20:31:58.3835448Z     return fn(*args, **kwargs)
2025-05-07T20:31:58.3836095Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:31:58.3836768Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:31:58.3837295Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:31:58.3837959Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:31:58.3838614Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:31:58.3839138Z     kernel = self.compile(
2025-05-07T20:31:58.3839662Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:31:58.3840307Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:31:58.3840697Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:31:58.3840918Z 
2025-05-07T20:31:58.3841124Z self = <triton.compiler.compiler.ASTSource object at 0x7f93a4c40fd0>
2025-05-07T20:31:58.3842177Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:31:58.3843508Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f9402963ce0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f9397cde980>}
2025-05-07T20:31:58.3844908Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:31:58.3845903Z context = <triton._C.libtriton.ir.context object at 0x7f93979a1b30>
2025-05-07T20:31:58.3846180Z 
2025-05-07T20:31:58.3846344Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:31:58.3846850Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:31:58.3847300Z                            module_map=module_map)
2025-05-07T20:31:58.3847653Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:31:58.3848068Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:31:58.3848320Z E       ^
2025-05-07T20:31:58.3848767Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:31:58.3849210Z 
2025-05-07T20:31:58.3849614Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:31:58.5219770Z 
2025-05-07T20:31:58.5220095Z Trying example: test_silu_mul_quant(
2025-05-07T20:31:58.5220524Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:31:58.5220973Z     T=4096,
2025-05-07T20:31:58.5221210Z     D=5120,
2025-05-07T20:31:58.5221414Z     scale_ub=1200.0,
2025-05-07T20:31:58.5221642Z     contiguous=False,
2025-05-07T20:31:58.5221873Z     compiled=False,
2025-05-07T20:31:58.5222088Z )
2025-05-07T20:31:58.5222405Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:31:58.5222909Z T = 4096, D = 5120, scale_ub = 1200.0, contiguous = False, compiled = False
2025-05-07T20:31:58.5223185Z 
2025-05-07T20:31:58.5223269Z     @given(
2025-05-07T20:31:58.5223501Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:31:58.5223822Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:31:58.5224132Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:31:58.5224466Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:31:58.5224852Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:31:58.5225143Z     )
2025-05-07T20:31:58.5225495Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:31:58.5225936Z     def test_silu_mul_quant(
2025-05-07T20:31:58.5226185Z         self,
2025-05-07T20:31:58.5226383Z         T: int,
2025-05-07T20:31:58.5226583Z         D: int,
2025-05-07T20:31:58.5226806Z         scale_ub: Optional[float],
2025-05-07T20:31:58.5227083Z         contiguous: bool,
2025-05-07T20:31:58.5227327Z         compiled: bool,
2025-05-07T20:31:58.5227560Z     ) -> None:
2025-05-07T20:31:58.5227785Z         torch.manual_seed(2025)
2025-05-07T20:31:58.5228024Z     
2025-05-07T20:31:58.5228300Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:31:58.5228651Z     
2025-05-07T20:31:58.5228848Z         x_sign = torch.sign(x)
2025-05-07T20:31:58.5229143Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:31:58.5229458Z         x = x_sign * x_clamp
2025-05-07T20:31:58.5229694Z         x0 = x[:, :D]
2025-05-07T20:31:58.5229913Z         x1 = x[:, D:]
2025-05-07T20:31:58.5230121Z     
2025-05-07T20:31:58.5230306Z         if contiguous:
2025-05-07T20:31:58.5230537Z             x0 = x0.contiguous()
2025-05-07T20:31:58.5230800Z             x1 = x1.contiguous()
2025-05-07T20:31:58.5231043Z     
2025-05-07T20:31:58.5231233Z         if scale_ub is not None:
2025-05-07T20:31:58.5231504Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:31:58.5231849Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:31:58.5232154Z             )
2025-05-07T20:31:58.5232350Z         else:
2025-05-07T20:31:58.5232563Z             scale_ub_tensor = None
2025-05-07T20:31:58.5232982Z     
2025-05-07T20:31:58.5233219Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:31:58.5233540Z             op = silu_mul_quant
2025-05-07T20:31:58.5233788Z             if compiled:
2025-05-07T20:31:58.5234039Z                 op = torch.compile(op)
2025-05-07T20:31:58.5234337Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:31:58.5234613Z     
2025-05-07T20:31:58.5234806Z >       y_fp8, y_scale = fn()
2025-05-07T20:31:58.5234971Z 
2025-05-07T20:31:58.5235070Z moe/activation_test.py:117: 
2025-05-07T20:31:58.5235365Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:31:58.5235693Z moe/activation_test.py:115: in fn
2025-05-07T20:31:58.5235973Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:31:58.5236775Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:31:58.5237458Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:31:58.5237997Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:31:58.5238672Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:31:58.5239329Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:31:58.5239858Z     kernel = self.compile(
2025-05-07T20:31:58.5240396Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:31:58.5241056Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:31:58.5241450Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:31:58.5241686Z 
2025-05-07T20:31:58.5241892Z self = <triton.compiler.compiler.ASTSource object at 0x7f93a5923bd0>
2025-05-07T20:31:58.5242954Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:31:58.5244308Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f9402963ce0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f9397cdfba0>}
2025-05-07T20:31:58.5245685Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:31:58.5246686Z context = <triton._C.libtriton.ir.context object at 0x7f9397ddbb30>
2025-05-07T20:31:58.5246970Z 
2025-05-07T20:31:58.5247142Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:31:58.5247659Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:31:58.5248126Z                            module_map=module_map)
2025-05-07T20:31:58.5248495Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:31:58.5248847Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:31:58.5249106Z E       ^
2025-05-07T20:31:58.5249562Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:31:58.5250009Z 
2025-05-07T20:31:58.5250422Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:31:58.5250932Z 
2025-05-07T20:31:58.5251038Z Trying example: test_silu_mul_quant(
2025-05-07T20:31:58.5251449Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:31:58.5251851Z     T=4096,
2025-05-07T20:31:58.5252046Z     D=5120,
2025-05-07T20:31:58.5252242Z     scale_ub=1200.0,
2025-05-07T20:31:58.5252465Z     contiguous=False,
2025-05-07T20:31:58.5252692Z     compiled=True,
2025-05-07T20:31:58.5252899Z )
2025-05-07T20:31:58.5253306Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:31:58.5253912Z T = 4096, D = 5120, scale_ub = 1200.0, contiguous = False, compiled = True
2025-05-07T20:31:58.5254188Z 
2025-05-07T20:31:58.5254290Z     @given(
2025-05-07T20:31:58.5254523Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:31:58.5254863Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:31:58.5255192Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:31:58.5255519Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:31:58.5255846Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:31:58.5256128Z     )
2025-05-07T20:31:58.5256567Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:31:58.5257010Z     def test_silu_mul_quant(
2025-05-07T20:31:58.5257247Z         self,
2025-05-07T20:31:58.5257444Z         T: int,
2025-05-07T20:31:58.5257644Z         D: int,
2025-05-07T20:31:58.5257865Z         scale_ub: Optional[float],
2025-05-07T20:31:58.5258134Z         contiguous: bool,
2025-05-07T20:31:58.5258373Z         compiled: bool,
2025-05-07T20:31:58.5258594Z     ) -> None:
2025-05-07T20:31:58.5258808Z         torch.manual_seed(2025)
2025-05-07T20:31:58.5259045Z     
2025-05-07T20:31:58.5259312Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:31:58.5259660Z     
2025-05-07T20:31:58.5259855Z         x_sign = torch.sign(x)
2025-05-07T20:31:58.5260144Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:31:58.5260449Z         x = x_sign * x_clamp
2025-05-07T20:31:58.5260686Z         x0 = x[:, :D]
2025-05-07T20:31:58.5260903Z         x1 = x[:, D:]
2025-05-07T20:31:58.5261106Z     
2025-05-07T20:31:58.5261301Z         if contiguous:
2025-05-07T20:31:58.5261533Z             x0 = x0.contiguous()
2025-05-07T20:31:58.5261789Z             x1 = x1.contiguous()
2025-05-07T20:31:58.5262033Z     
2025-05-07T20:31:58.5262232Z         if scale_ub is not None:
2025-05-07T20:31:58.5262507Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:31:58.5262840Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:31:58.5263148Z             )
2025-05-07T20:31:58.5263339Z         else:
2025-05-07T20:31:58.5263554Z             scale_ub_tensor = None
2025-05-07T20:31:58.5263806Z     
2025-05-07T20:31:58.5264032Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:31:58.5264349Z             op = silu_mul_quant
2025-05-07T20:31:58.5264599Z             if compiled:
2025-05-07T20:31:58.5264855Z                 op = torch.compile(op)
2025-05-07T20:31:58.5265148Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:31:58.5265424Z     
2025-05-07T20:31:58.5265627Z >       y_fp8, y_scale = fn()
2025-05-07T20:31:58.5265788Z 
2025-05-07T20:31:58.5265889Z moe/activation_test.py:117: 
2025-05-07T20:31:58.5266182Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:31:58.5266518Z moe/activation_test.py:115: in fn
2025-05-07T20:31:58.5266798Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:31:58.5267355Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/torch/_dynamo/eval_frame.py:678: in _fn
2025-05-07T20:31:58.5267923Z     return fn(*args, **kwargs)
2025-05-07T20:31:58.5274535Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:31:58.5275228Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:31:58.5275758Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:31:58.5276442Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:31:58.5277094Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:31:58.5277616Z     kernel = self.compile(
2025-05-07T20:31:58.5278147Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:31:58.5278904Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:31:58.5279297Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:31:58.5279524Z 
2025-05-07T20:31:58.5279728Z self = <triton.compiler.compiler.ASTSource object at 0x7f9397981050>
2025-05-07T20:31:58.5280780Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:31:58.5282200Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f9402963ce0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f9397d94ea0>}
2025-05-07T20:31:58.5283516Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:31:58.5284520Z context = <triton._C.libtriton.ir.context object at 0x7f9397d05530>
2025-05-07T20:31:58.5284796Z 
2025-05-07T20:31:58.5284961Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:31:58.5285469Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:31:58.5285926Z                            module_map=module_map)
2025-05-07T20:31:58.5286279Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:31:58.5286625Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:31:58.5286883Z E       ^
2025-05-07T20:31:58.5287334Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:31:58.5287774Z 
2025-05-07T20:31:58.5288178Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:31:58.5288691Z 
2025-05-07T20:31:58.5288795Z Trying example: test_silu_mul_quant(
2025-05-07T20:31:58.5289199Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:31:58.5289588Z     T=2048,
2025-05-07T20:31:58.5289773Z     D=7168,
2025-05-07T20:31:58.5289964Z     scale_ub=1200.0,
2025-05-07T20:31:58.5290179Z     contiguous=False,
2025-05-07T20:31:58.5290399Z     compiled=False,
2025-05-07T20:31:58.7245112Z )
2025-05-07T20:31:58.7246653Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:31:58.7247536Z T = 2048, D = 7168, scale_ub = 1200.0, contiguous = False, compiled = False
2025-05-07T20:31:58.7248004Z 
2025-05-07T20:31:58.7248171Z     @given(
2025-05-07T20:31:58.7248554Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:31:58.7249056Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:31:58.7249536Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:31:58.7250059Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:31:58.7250585Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:31:58.7251074Z     )
2025-05-07T20:31:58.7251662Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:31:58.7252375Z     def test_silu_mul_quant(
2025-05-07T20:31:58.7252760Z         self,
2025-05-07T20:31:58.7253060Z         T: int,
2025-05-07T20:31:58.7253355Z         D: int,
2025-05-07T20:31:58.7253857Z         scale_ub: Optional[float],
2025-05-07T20:31:58.7254296Z         contiguous: bool,
2025-05-07T20:31:58.7254683Z         compiled: bool,
2025-05-07T20:31:58.7255047Z     ) -> None:
2025-05-07T20:31:58.7255402Z         torch.manual_seed(2025)
2025-05-07T20:31:58.7255806Z     
2025-05-07T20:31:58.7256292Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:31:58.7256866Z     
2025-05-07T20:31:58.7257177Z         x_sign = torch.sign(x)
2025-05-07T20:31:58.7258106Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:31:58.7258615Z         x = x_sign * x_clamp
2025-05-07T20:31:58.7258987Z         x0 = x[:, :D]
2025-05-07T20:31:58.7259329Z         x1 = x[:, D:]
2025-05-07T20:31:58.7259666Z     
2025-05-07T20:31:58.7259941Z         if contiguous:
2025-05-07T20:31:58.7260312Z             x0 = x0.contiguous()
2025-05-07T20:31:58.7260720Z             x1 = x1.contiguous()
2025-05-07T20:31:58.7261094Z     
2025-05-07T20:31:58.7261393Z         if scale_ub is not None:
2025-05-07T20:31:58.7261828Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:31:58.7262341Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:31:58.7262837Z             )
2025-05-07T20:31:58.7263408Z         else:
2025-05-07T20:31:58.7263759Z             scale_ub_tensor = None
2025-05-07T20:31:58.7264164Z     
2025-05-07T20:31:58.7264521Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:31:58.7265048Z             op = silu_mul_quant
2025-05-07T20:31:58.7265441Z             if compiled:
2025-05-07T20:31:58.7265837Z                 op = torch.compile(op)
2025-05-07T20:31:58.7266271Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:31:58.7266668Z     
2025-05-07T20:31:58.7266939Z >       y_fp8, y_scale = fn()
2025-05-07T20:31:58.7267163Z 
2025-05-07T20:31:58.7267309Z moe/activation_test.py:117: 
2025-05-07T20:31:58.7267711Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:31:58.7268176Z moe/activation_test.py:115: in fn
2025-05-07T20:31:58.7268567Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:31:58.7269548Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:31:58.7270592Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:31:58.7271429Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:31:58.7272444Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:31:58.7273500Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:31:58.7274381Z     kernel = self.compile(
2025-05-07T20:31:58.7275210Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:31:58.7276300Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:31:58.7276906Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:31:58.7277289Z 
2025-05-07T20:31:58.7277644Z self = <triton.compiler.compiler.ASTSource object at 0x7f93a5ac35d0>
2025-05-07T20:31:58.7279531Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:31:58.7282033Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f9402963ce0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f9397d95940>}
2025-05-07T20:31:58.7284240Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:31:58.7285909Z context = <triton._C.libtriton.ir.context object at 0x7f9397a182b0>
2025-05-07T20:31:58.7286382Z 
2025-05-07T20:31:58.7286648Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:31:58.7287514Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:31:58.7288296Z                            module_map=module_map)
2025-05-07T20:31:58.7288879Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:31:58.7289619Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:31:58.7290038Z E       ^
2025-05-07T20:31:58.7290778Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:31:58.7291506Z 
2025-05-07T20:31:58.7292223Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:31:58.7293052Z 
2025-05-07T20:31:58.7293209Z Trying example: test_silu_mul_quant(
2025-05-07T20:31:58.7293915Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:31:58.7294569Z     T=1,
2025-05-07T20:31:58.7294879Z     D=7168,
2025-05-07T20:31:58.7295196Z     scale_ub=None,
2025-05-07T20:31:58.7295686Z     contiguous=True,
2025-05-07T20:31:58.7296062Z     compiled=False,
2025-05-07T20:31:58.7296412Z )
2025-05-07T20:31:58.7296934Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:31:58.7297756Z T = 1, D = 7168, scale_ub = None, contiguous = True, compiled = False
2025-05-07T20:31:58.7298565Z 
2025-05-07T20:31:58.7298713Z     @given(
2025-05-07T20:31:58.7299092Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:31:58.7299611Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:31:58.7300122Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:31:58.7300680Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:31:58.7301227Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:31:58.7301711Z     )
2025-05-07T20:31:58.7302301Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:31:58.7303053Z     def test_silu_mul_quant(
2025-05-07T20:31:58.7303453Z         self,
2025-05-07T20:31:58.7303782Z         T: int,
2025-05-07T20:31:58.7304100Z         D: int,
2025-05-07T20:31:58.7304453Z         scale_ub: Optional[float],
2025-05-07T20:31:58.7304871Z         contiguous: bool,
2025-05-07T20:31:58.7305241Z         compiled: bool,
2025-05-07T20:31:58.7305601Z     ) -> None:
2025-05-07T20:31:58.7305943Z         torch.manual_seed(2025)
2025-05-07T20:31:58.7306326Z     
2025-05-07T20:31:58.7306744Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:31:58.7307290Z     
2025-05-07T20:31:58.7307609Z         x_sign = torch.sign(x)
2025-05-07T20:31:58.7308087Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:31:58.7308608Z         x = x_sign * x_clamp
2025-05-07T20:31:58.7309008Z         x0 = x[:, :D]
2025-05-07T20:31:58.7309351Z         x1 = x[:, D:]
2025-05-07T20:31:58.7309690Z     
2025-05-07T20:31:58.7309989Z         if contiguous:
2025-05-07T20:31:58.7310362Z             x0 = x0.contiguous()
2025-05-07T20:31:58.7310808Z             x1 = x1.contiguous()
2025-05-07T20:31:58.7311207Z     
2025-05-07T20:31:58.7311512Z         if scale_ub is not None:
2025-05-07T20:31:58.7311975Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:31:58.7312538Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:31:58.7313051Z             )
2025-05-07T20:31:58.7313373Z         else:
2025-05-07T20:31:58.7313723Z             scale_ub_tensor = None
2025-05-07T20:31:58.7314136Z     
2025-05-07T20:31:58.7314504Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:31:58.7315025Z             op = silu_mul_quant
2025-05-07T20:31:58.7315435Z             if compiled:
2025-05-07T20:31:58.7315833Z                 op = torch.compile(op)
2025-05-07T20:31:58.7316327Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:31:58.7316790Z     
2025-05-07T20:31:58.7317092Z >       y_fp8, y_scale = fn()
2025-05-07T20:31:58.7317370Z 
2025-05-07T20:31:58.7317530Z moe/activation_test.py:117: 
2025-05-07T20:31:58.7318031Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:31:58.7318583Z moe/activation_test.py:115: in fn
2025-05-07T20:31:58.7319057Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:31:58.7320475Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:31:58.7321684Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:31:58.7322601Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:31:58.7323790Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:31:58.7324946Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:31:58.7325866Z     kernel = self.compile(
2025-05-07T20:31:58.7326966Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:31:58.7328109Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:31:58.7328792Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:31:58.7329194Z 
2025-05-07T20:31:58.7329532Z self = <triton.compiler.compiler.ASTSource object at 0x7f93a5f59bd0>
2025-05-07T20:31:58.7331425Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:31:58.7333995Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f9402963ce0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f9397d96ca0>}
2025-05-07T20:31:58.7336300Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:31:58.7337659Z context = <triton._C.libtriton.ir.context object at 0x7f9397a35d70>
2025-05-07T20:31:58.7338066Z 
2025-05-07T20:31:58.7338296Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:31:58.7339036Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:31:58.7339755Z                            module_map=module_map)
2025-05-07T20:31:58.7340280Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:31:58.7340813Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:31:58.7341227Z E       ^
2025-05-07T20:31:58.7341972Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:31:58.7342698Z 
2025-05-07T20:31:58.7343362Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:31:58.7344253Z 
2025-05-07T20:31:58.7344426Z Trying example: test_silu_mul_quant(
2025-05-07T20:31:58.7345115Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:31:58.7345739Z     T=16384,
2025-05-07T20:31:58.7346054Z     D=7168,
2025-05-07T20:31:58.7346363Z     scale_ub=1200.0,
2025-05-07T20:31:58.7346721Z     contiguous=False,
2025-05-07T20:31:58.7347078Z     compiled=True,
2025-05-07T20:31:58.7347417Z )
2025-05-07T20:31:58.7347935Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:31:58.7348766Z T = 16384, D = 7168, scale_ub = 1200.0, contiguous = False, compiled = True
2025-05-07T20:31:58.7349194Z 
2025-05-07T20:31:58.7349315Z     @given(
2025-05-07T20:31:58.7349657Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:31:58.7350147Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:31:58.7350608Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:31:58.7351141Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:31:58.7351597Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:31:58.7352015Z     )
2025-05-07T20:31:58.7352517Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:31:58.7353362Z     def test_silu_mul_quant(
2025-05-07T20:31:58.7353717Z         self,
2025-05-07T20:31:58.7354004Z         T: int,
2025-05-07T20:31:58.7354303Z         D: int,
2025-05-07T20:31:58.7354639Z         scale_ub: Optional[float],
2025-05-07T20:31:58.7355090Z         contiguous: bool,
2025-05-07T20:31:58.7355452Z         compiled: bool,
2025-05-07T20:31:58.7355783Z     ) -> None:
2025-05-07T20:31:58.7356105Z         torch.manual_seed(2025)
2025-05-07T20:31:58.7356464Z     
2025-05-07T20:31:58.7356863Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:31:58.7357393Z     
2025-05-07T20:31:58.7357692Z         x_sign = torch.sign(x)
2025-05-07T20:31:58.7358141Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:31:58.7358767Z         x = x_sign * x_clamp
2025-05-07T20:31:58.7359158Z         x0 = x[:, :D]
2025-05-07T20:31:58.7359489Z         x1 = x[:, D:]
2025-05-07T20:31:58.7359809Z     
2025-05-07T20:31:58.7360097Z         if contiguous:
2025-05-07T20:31:58.7360476Z             x0 = x0.contiguous()
2025-05-07T20:31:58.7360872Z             x1 = x1.contiguous()
2025-05-07T20:31:58.7361238Z     
2025-05-07T20:31:58.7361535Z         if scale_ub is not None:
2025-05-07T20:31:58.7361964Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:31:58.7362484Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:31:58.7362940Z             )
2025-05-07T20:31:58.7363243Z         else:
2025-05-07T20:31:58.7363587Z             scale_ub_tensor = None
2025-05-07T20:31:58.7363960Z     
2025-05-07T20:31:58.7364329Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:31:58.7364840Z             op = silu_mul_quant
2025-05-07T20:31:58.7365235Z             if compiled:
2025-05-07T20:31:58.7365607Z                 op = torch.compile(op)
2025-05-07T20:31:58.7366052Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:31:58.7366472Z     
2025-05-07T20:31:58.7366758Z >       y_fp8, y_scale = fn()
2025-05-07T20:31:58.7367030Z 
2025-05-07T20:31:58.7367182Z moe/activation_test.py:117: 
2025-05-07T20:31:58.7367632Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:31:58.7368142Z moe/activation_test.py:115: in fn
2025-05-07T20:31:58.7368578Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:31:58.7369448Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/torch/_dynamo/eval_frame.py:678: in _fn
2025-05-07T20:31:58.7370358Z     return fn(*args, **kwargs)
2025-05-07T20:31:58.7371499Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:31:58.7372690Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:31:58.7373603Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:31:58.7374925Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:31:58.7375993Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:31:58.7376887Z     kernel = self.compile(
2025-05-07T20:31:58.7377829Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:31:58.7378963Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:31:58.7379649Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:31:58.7380042Z 
2025-05-07T20:31:58.7380393Z self = <triton.compiler.compiler.ASTSource object at 0x7f93a5921350>
2025-05-07T20:31:58.7382299Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:31:58.7384735Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f9402963ce0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f9397d97f60>}
2025-05-07T20:31:58.7387288Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:31:58.7389097Z context = <triton._C.libtriton.ir.context object at 0x7f93a426b330>
2025-05-07T20:31:58.7389594Z 
2025-05-07T20:31:58.7389880Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:31:58.7390772Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:31:58.7391739Z                            module_map=module_map)
2025-05-07T20:31:58.7392365Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:31:58.7392967Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:31:58.7393398Z E       ^
2025-05-07T20:31:58.7394197Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:31:58.7395053Z 
2025-05-07T20:31:58.7395783Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:31:58.8690994Z 
2025-05-07T20:31:58.8692186Z Trying example: test_silu_mul_quant(
2025-05-07T20:31:58.8693486Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:31:58.8694634Z     T=1,
2025-05-07T20:31:58.8694840Z     D=7168,
2025-05-07T20:31:58.8695039Z     scale_ub=None,
2025-05-07T20:31:58.8695266Z     contiguous=False,
2025-05-07T20:31:58.8695505Z     compiled=False,
2025-05-07T20:31:58.8695717Z )
2025-05-07T20:31:58.8696073Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:31:58.8696570Z T = 1, D = 7168, scale_ub = None, contiguous = False, compiled = False
2025-05-07T20:31:58.8696834Z 
2025-05-07T20:31:58.8696918Z     @given(
2025-05-07T20:31:58.8697168Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:31:58.8697487Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:31:58.8697797Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:31:58.8698123Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:31:58.8698704Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:31:58.8698997Z     )
2025-05-07T20:31:58.8699345Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:31:58.8699794Z     def test_silu_mul_quant(
2025-05-07T20:31:58.8700050Z         self,
2025-05-07T20:31:58.8700252Z         T: int,
2025-05-07T20:31:58.8700459Z         D: int,
2025-05-07T20:31:58.8700692Z         scale_ub: Optional[float],
2025-05-07T20:31:58.8700963Z         contiguous: bool,
2025-05-07T20:31:58.8701209Z         compiled: bool,
2025-05-07T20:31:58.8701446Z     ) -> None:
2025-05-07T20:31:58.8701663Z         torch.manual_seed(2025)
2025-05-07T20:31:58.8701918Z     
2025-05-07T20:31:58.8702196Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:31:58.8702548Z     
2025-05-07T20:31:58.8702745Z         x_sign = torch.sign(x)
2025-05-07T20:31:58.8703045Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:31:58.8703366Z         x = x_sign * x_clamp
2025-05-07T20:31:58.8703609Z         x0 = x[:, :D]
2025-05-07T20:31:58.8703837Z         x1 = x[:, D:]
2025-05-07T20:31:58.8704059Z     
2025-05-07T20:31:58.8704249Z         if contiguous:
2025-05-07T20:31:58.8704491Z             x0 = x0.contiguous()
2025-05-07T20:31:58.8704754Z             x1 = x1.contiguous()
2025-05-07T20:31:58.8704989Z     
2025-05-07T20:31:58.8705190Z         if scale_ub is not None:
2025-05-07T20:31:58.8705471Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:31:58.8705802Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:31:58.8706116Z             )
2025-05-07T20:31:58.8706322Z         else:
2025-05-07T20:31:58.8706900Z             scale_ub_tensor = None
2025-05-07T20:31:58.8707158Z     
2025-05-07T20:31:58.8707398Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:31:58.8707712Z             op = silu_mul_quant
2025-05-07T20:31:58.8707972Z             if compiled:
2025-05-07T20:31:58.8708230Z                 op = torch.compile(op)
2025-05-07T20:31:58.8708540Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:31:58.8708819Z     
2025-05-07T20:31:58.8709025Z >       y_fp8, y_scale = fn()
2025-05-07T20:31:58.8709190Z 
2025-05-07T20:31:58.8709301Z moe/activation_test.py:117: 
2025-05-07T20:31:58.8709597Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:31:58.8710088Z moe/activation_test.py:115: in fn
2025-05-07T20:31:58.8710380Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:31:58.8711069Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:31:58.8711763Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:31:58.8712301Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:31:58.8712982Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:31:58.8713633Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:31:58.8714168Z     kernel = self.compile(
2025-05-07T20:31:58.8714709Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:31:58.8715368Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:31:58.8715767Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:31:58.8716001Z 
2025-05-07T20:31:58.8716208Z self = <triton.compiler.compiler.ASTSource object at 0x7f93a5f591d0>
2025-05-07T20:31:58.8717285Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:31:58.8721236Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f9402963ce0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f9397a149a0>}
2025-05-07T20:31:58.8722558Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:31:58.8723567Z context = <triton._C.libtriton.ir.context object at 0x7f93a4209af0>
2025-05-07T20:31:58.8723858Z 
2025-05-07T20:31:58.8724024Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:31:58.8724544Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:31:58.8725013Z                            module_map=module_map)
2025-05-07T20:31:58.8725372Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:31:58.8725728Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:31:58.8725990Z E       ^
2025-05-07T20:31:58.8726445Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:31:58.8726895Z 
2025-05-07T20:31:58.8727303Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:31:58.8727815Z 
2025-05-07T20:31:58.8727920Z Trying example: test_silu_mul_quant(
2025-05-07T20:31:58.8728344Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:31:58.8728739Z     T=2048,
2025-05-07T20:31:58.8728937Z     D=7168,
2025-05-07T20:31:58.8729137Z     scale_ub=None,
2025-05-07T20:31:58.8729349Z     contiguous=False,
2025-05-07T20:31:58.8729691Z     compiled=True,
2025-05-07T20:31:58.8729900Z )
2025-05-07T20:31:58.8730216Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:31:58.8730709Z T = 2048, D = 7168, scale_ub = None, contiguous = False, compiled = True
2025-05-07T20:31:58.8730977Z 
2025-05-07T20:31:58.8731070Z     @given(
2025-05-07T20:31:58.8731299Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:31:58.8731617Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:31:58.8731928Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:31:58.8732260Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:31:58.8732580Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:31:58.8732870Z     )
2025-05-07T20:31:58.8733308Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:31:58.8733839Z     def test_silu_mul_quant(
2025-05-07T20:31:58.8734107Z         self,
2025-05-07T20:31:58.8734300Z         T: int,
2025-05-07T20:31:58.8734506Z         D: int,
2025-05-07T20:31:58.8734727Z         scale_ub: Optional[float],
2025-05-07T20:31:58.8744311Z         contiguous: bool,
2025-05-07T20:31:58.8744728Z         compiled: bool,
2025-05-07T20:31:58.8745000Z     ) -> None:
2025-05-07T20:31:58.8745224Z         torch.manual_seed(2025)
2025-05-07T20:31:58.8745485Z     
2025-05-07T20:31:58.8745775Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:31:58.8746134Z     
2025-05-07T20:31:58.8746334Z         x_sign = torch.sign(x)
2025-05-07T20:31:58.8746638Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:31:58.8746960Z         x = x_sign * x_clamp
2025-05-07T20:31:58.8747203Z         x0 = x[:, :D]
2025-05-07T20:31:58.8747441Z         x1 = x[:, D:]
2025-05-07T20:31:58.8747660Z     
2025-05-07T20:31:58.8747850Z         if contiguous:
2025-05-07T20:31:58.8748093Z             x0 = x0.contiguous()
2025-05-07T20:31:58.8748364Z             x1 = x1.contiguous()
2025-05-07T20:31:58.8748615Z     
2025-05-07T20:31:58.8748820Z         if scale_ub is not None:
2025-05-07T20:31:58.8749104Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:31:58.8749449Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:31:58.8749778Z             )
2025-05-07T20:31:58.8749988Z         else:
2025-05-07T20:31:58.8750207Z             scale_ub_tensor = None
2025-05-07T20:31:58.8750480Z     
2025-05-07T20:31:58.8750739Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:31:58.8751068Z             op = silu_mul_quant
2025-05-07T20:31:58.8751322Z             if compiled:
2025-05-07T20:31:58.8751581Z                 op = torch.compile(op)
2025-05-07T20:31:58.8751892Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:31:58.8752177Z     
2025-05-07T20:31:58.8752385Z >       y_fp8, y_scale = fn()
2025-05-07T20:31:58.8752554Z 
2025-05-07T20:31:58.8752664Z moe/activation_test.py:117: 
2025-05-07T20:31:58.8752957Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:31:58.8753302Z moe/activation_test.py:115: in fn
2025-05-07T20:31:58.8753585Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:31:58.8754140Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/torch/_dynamo/eval_frame.py:678: in _fn
2025-05-07T20:31:58.8754704Z     return fn(*args, **kwargs)
2025-05-07T20:31:58.8755363Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:31:58.8756047Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:31:58.8756578Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:31:58.8757258Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:31:58.8757918Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:31:58.8758629Z     kernel = self.compile(
2025-05-07T20:31:58.8759166Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:31:58.8759822Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:31:58.8760227Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:31:58.8760463Z 
2025-05-07T20:31:58.8760667Z self = <triton.compiler.compiler.ASTSource object at 0x7f93a43af450>
2025-05-07T20:31:58.8761861Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:31:58.8763219Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f9402963ce0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f9397a16160>}
2025-05-07T20:31:58.8764550Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:31:58.8765558Z context = <triton._C.libtriton.ir.context object at 0x7f93a42d6430>
2025-05-07T20:31:58.8765847Z 
2025-05-07T20:31:58.8766015Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:31:58.8766530Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:31:58.8766990Z                            module_map=module_map)
2025-05-07T20:31:58.8767355Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:31:58.8767720Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:31:58.8767977Z E       ^
2025-05-07T20:31:58.8768436Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:31:58.8768890Z 
2025-05-07T20:31:58.8769300Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:31:58.8769807Z 
2025-05-07T20:31:58.8769918Z Trying example: test_silu_mul_quant(
2025-05-07T20:31:58.8770324Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:31:58.8770729Z     T=4096,
2025-05-07T20:31:58.8770921Z     D=7168,
2025-05-07T20:31:58.8771115Z     scale_ub=None,
2025-05-07T20:31:58.8771332Z     contiguous=False,
2025-05-07T20:31:58.8771555Z     compiled=True,
2025-05-07T20:31:59.1010739Z )
2025-05-07T20:31:59.1011331Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:31:59.1012073Z T = 4096, D = 7168, scale_ub = None, contiguous = False, compiled = True
2025-05-07T20:31:59.1012454Z 
2025-05-07T20:31:59.1012565Z     @given(
2025-05-07T20:31:59.1012854Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:31:59.1013204Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:31:59.1013525Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:31:59.1013966Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:31:59.1014299Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:31:59.1014615Z     )
2025-05-07T20:31:59.1015017Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:31:59.1015535Z     def test_silu_mul_quant(
2025-05-07T20:31:59.1015776Z         self,
2025-05-07T20:31:59.1015978Z         T: int,
2025-05-07T20:31:59.1016179Z         D: int,
2025-05-07T20:31:59.1016398Z         scale_ub: Optional[float],
2025-05-07T20:31:59.1016681Z         contiguous: bool,
2025-05-07T20:31:59.1016926Z         compiled: bool,
2025-05-07T20:31:59.1017157Z     ) -> None:
2025-05-07T20:31:59.1017378Z         torch.manual_seed(2025)
2025-05-07T20:31:59.1017623Z     
2025-05-07T20:31:59.1017891Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:31:59.1018610Z     
2025-05-07T20:31:59.1018806Z         x_sign = torch.sign(x)
2025-05-07T20:31:59.1019097Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:31:59.1019409Z         x = x_sign * x_clamp
2025-05-07T20:31:59.1019651Z         x0 = x[:, :D]
2025-05-07T20:31:59.1019875Z         x1 = x[:, D:]
2025-05-07T20:31:59.1020082Z     
2025-05-07T20:31:59.1020272Z         if contiguous:
2025-05-07T20:31:59.1020506Z             x0 = x0.contiguous()
2025-05-07T20:31:59.1020762Z             x1 = x1.contiguous()
2025-05-07T20:31:59.1021013Z     
2025-05-07T20:31:59.1021213Z         if scale_ub is not None:
2025-05-07T20:31:59.1021483Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:31:59.1021977Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:31:59.1022292Z             )
2025-05-07T20:31:59.1022485Z         else:
2025-05-07T20:31:59.1022704Z             scale_ub_tensor = None
2025-05-07T20:31:59.1022960Z     
2025-05-07T20:31:59.1023186Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:31:59.1023512Z             op = silu_mul_quant
2025-05-07T20:31:59.1023768Z             if compiled:
2025-05-07T20:31:59.1024013Z                 op = torch.compile(op)
2025-05-07T20:31:59.1024317Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:31:59.1024599Z     
2025-05-07T20:31:59.1024822Z >       y_fp8, y_scale = fn()
2025-05-07T20:31:59.1025009Z 
2025-05-07T20:31:59.1025118Z moe/activation_test.py:117: 
2025-05-07T20:31:59.1025420Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:31:59.1025756Z moe/activation_test.py:115: in fn
2025-05-07T20:31:59.1026036Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:31:59.1026606Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/torch/_dynamo/eval_frame.py:678: in _fn
2025-05-07T20:31:59.1027172Z     return fn(*args, **kwargs)
2025-05-07T20:31:59.1027820Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:31:59.1028510Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:31:59.1029048Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:31:59.1029725Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:31:59.1030377Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:31:59.1030921Z     kernel = self.compile(
2025-05-07T20:31:59.1031464Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:31:59.1032122Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:31:59.1032515Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:31:59.1032745Z 
2025-05-07T20:31:59.1032952Z self = <triton.compiler.compiler.ASTSource object at 0x7f9397f087d0>
2025-05-07T20:31:59.1034022Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:31:59.1035390Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f9402963ce0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f9397a16e80>}
2025-05-07T20:31:59.1036705Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:31:59.1037720Z context = <triton._C.libtriton.ir.context object at 0x7f93a4316770>
2025-05-07T20:31:59.1038008Z 
2025-05-07T20:31:59.1038173Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:31:59.1038693Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:31:59.1039239Z                            module_map=module_map)
2025-05-07T20:31:59.1039605Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:31:59.1039960Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:31:59.1040214Z E       ^
2025-05-07T20:31:59.1040669Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:31:59.1041113Z 
2025-05-07T20:31:59.1041523Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:31:59.1042024Z 
2025-05-07T20:31:59.1042213Z Trying example: test_silu_mul_quant(
2025-05-07T20:31:59.1042620Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:31:59.1043023Z     T=16384,
2025-05-07T20:31:59.1043216Z     D=5120,
2025-05-07T20:31:59.1043409Z     scale_ub=1200.0,
2025-05-07T20:31:59.1043645Z     contiguous=False,
2025-05-07T20:31:59.1043873Z     compiled=False,
2025-05-07T20:31:59.1044080Z )
2025-05-07T20:31:59.1044390Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:31:59.1044889Z T = 16384, D = 5120, scale_ub = 1200.0, contiguous = False, compiled = False
2025-05-07T20:31:59.1045162Z 
2025-05-07T20:31:59.1045244Z     @given(
2025-05-07T20:31:59.1045475Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:31:59.1045788Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:31:59.1046098Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:31:59.1046423Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:31:59.1046759Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:31:59.1047046Z     )
2025-05-07T20:31:59.1047396Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:31:59.1047828Z     def test_silu_mul_quant(
2025-05-07T20:31:59.1048081Z         self,
2025-05-07T20:31:59.1048279Z         T: int,
2025-05-07T20:31:59.1048472Z         D: int,
2025-05-07T20:31:59.1048696Z         scale_ub: Optional[float],
2025-05-07T20:31:59.1048969Z         contiguous: bool,
2025-05-07T20:31:59.1049200Z         compiled: bool,
2025-05-07T20:31:59.1049424Z     ) -> None:
2025-05-07T20:31:59.1049641Z         torch.manual_seed(2025)
2025-05-07T20:31:59.1049879Z     
2025-05-07T20:31:59.1050149Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:31:59.1050497Z     
2025-05-07T20:31:59.1050689Z         x_sign = torch.sign(x)
2025-05-07T20:31:59.1050982Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:31:59.1051291Z         x = x_sign * x_clamp
2025-05-07T20:31:59.1051533Z         x0 = x[:, :D]
2025-05-07T20:31:59.1051754Z         x1 = x[:, D:]
2025-05-07T20:31:59.1051966Z     
2025-05-07T20:31:59.1052148Z         if contiguous:
2025-05-07T20:31:59.1052385Z             x0 = x0.contiguous()
2025-05-07T20:31:59.1052645Z             x1 = x1.contiguous()
2025-05-07T20:31:59.1052889Z     
2025-05-07T20:31:59.1053079Z         if scale_ub is not None:
2025-05-07T20:31:59.1053352Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:31:59.1053778Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:31:59.1054079Z             )
2025-05-07T20:31:59.1054273Z         else:
2025-05-07T20:31:59.1054486Z             scale_ub_tensor = None
2025-05-07T20:31:59.1054729Z     
2025-05-07T20:31:59.1055001Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:31:59.1055324Z             op = silu_mul_quant
2025-05-07T20:31:59.1055568Z             if compiled:
2025-05-07T20:31:59.1055816Z                 op = torch.compile(op)
2025-05-07T20:31:59.1056116Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:31:59.1056388Z     
2025-05-07T20:31:59.1056580Z >       y_fp8, y_scale = fn()
2025-05-07T20:31:59.1056741Z 
2025-05-07T20:31:59.1056846Z moe/activation_test.py:117: 
2025-05-07T20:31:59.1057226Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:31:59.1057553Z moe/activation_test.py:115: in fn
2025-05-07T20:31:59.1057833Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:31:59.1058516Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:31:59.1059187Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:31:59.1059742Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:31:59.1060417Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:31:59.1061179Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:31:59.1061713Z     kernel = self.compile(
2025-05-07T20:31:59.1062242Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:31:59.1062898Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:31:59.1063295Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:31:59.1063520Z 
2025-05-07T20:31:59.1063724Z self = <triton.compiler.compiler.ASTSource object at 0x7f93a487e2d0>
2025-05-07T20:31:59.1064807Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:31:59.1066185Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f9402963ce0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f93a43cc220>}
2025-05-07T20:31:59.1067506Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:31:59.1068517Z context = <triton._C.libtriton.ir.context object at 0x7f939762ddb0>
2025-05-07T20:31:59.1068803Z 
2025-05-07T20:31:59.1068967Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:31:59.1069483Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:31:59.1069951Z                            module_map=module_map)
2025-05-07T20:31:59.1070317Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:31:59.1070664Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:31:59.1070928Z E       ^
2025-05-07T20:31:59.1071392Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:31:59.1071830Z 
2025-05-07T20:31:59.1072242Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:31:59.1072758Z 
2025-05-07T20:31:59.1072865Z Trying example: test_silu_mul_quant(
2025-05-07T20:31:59.1073277Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:31:59.1073683Z     T=16384,
2025-05-07T20:31:59.1073872Z     D=5120,
2025-05-07T20:31:59.1074071Z     scale_ub=1200.0,
2025-05-07T20:31:59.1074296Z     contiguous=True,
2025-05-07T20:31:59.1074515Z     compiled=True,
2025-05-07T20:31:59.1074722Z )
2025-05-07T20:31:59.1075038Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:31:59.1075520Z T = 16384, D = 5120, scale_ub = 1200.0, contiguous = True, compiled = True
2025-05-07T20:31:59.1075794Z 
2025-05-07T20:31:59.1075873Z     @given(
2025-05-07T20:31:59.1076111Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:31:59.1076423Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:31:59.1076722Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:31:59.1077145Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:31:59.1077473Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:31:59.1077754Z     )
2025-05-07T20:31:59.1078105Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:31:59.1078550Z     def test_silu_mul_quant(
2025-05-07T20:31:59.1078788Z         self,
2025-05-07T20:31:59.1078985Z         T: int,
2025-05-07T20:31:59.1079185Z         D: int,
2025-05-07T20:31:59.1079398Z         scale_ub: Optional[float],
2025-05-07T20:31:59.1079670Z         contiguous: bool,
2025-05-07T20:31:59.1079912Z         compiled: bool,
2025-05-07T20:31:59.1080129Z     ) -> None:
2025-05-07T20:31:59.1080350Z         torch.manual_seed(2025)
2025-05-07T20:31:59.1080591Z     
2025-05-07T20:31:59.1080943Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:31:59.1081285Z     
2025-05-07T20:31:59.1081489Z         x_sign = torch.sign(x)
2025-05-07T20:31:59.1081785Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:31:59.1082098Z         x = x_sign * x_clamp
2025-05-07T20:31:59.1082343Z         x0 = x[:, :D]
2025-05-07T20:31:59.1082564Z         x1 = x[:, D:]
2025-05-07T20:31:59.1082767Z     
2025-05-07T20:31:59.1082966Z         if contiguous:
2025-05-07T20:31:59.1083198Z             x0 = x0.contiguous()
2025-05-07T20:31:59.1083455Z             x1 = x1.contiguous()
2025-05-07T20:31:59.1083706Z     
2025-05-07T20:31:59.1083904Z         if scale_ub is not None:
2025-05-07T20:31:59.1084176Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:31:59.1084512Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:31:59.1084823Z             )
2025-05-07T20:31:59.1085018Z         else:
2025-05-07T20:31:59.1085241Z             scale_ub_tensor = None
2025-05-07T20:31:59.1085496Z     
2025-05-07T20:31:59.1085728Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:31:59.1086047Z             op = silu_mul_quant
2025-05-07T20:31:59.1086308Z             if compiled:
2025-05-07T20:31:59.1086559Z                 op = torch.compile(op)
2025-05-07T20:31:59.1086851Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:31:59.1087129Z     
2025-05-07T20:31:59.1087327Z >       y_fp8, y_scale = fn()
2025-05-07T20:31:59.1087489Z 
2025-05-07T20:31:59.1087588Z moe/activation_test.py:117: 
2025-05-07T20:31:59.1087888Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:31:59.1088226Z moe/activation_test.py:115: in fn
2025-05-07T20:31:59.1088508Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:31:59.1089063Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/torch/_dynamo/eval_frame.py:678: in _fn
2025-05-07T20:31:59.1089625Z     return fn(*args, **kwargs)
2025-05-07T20:31:59.1090278Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:31:59.1090954Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:31:59.1091490Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:31:59.1092164Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:31:59.1092812Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:31:59.1093340Z     kernel = self.compile(
2025-05-07T20:31:59.1093970Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:31:59.1094621Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:31:59.1095015Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:31:59.1095247Z 
2025-05-07T20:31:59.1095453Z self = <triton.compiler.compiler.ASTSource object at 0x7f93a426e750>
2025-05-07T20:31:59.1096515Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:31:59.1097946Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f9402963ce0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f93a43cd4e0>}
2025-05-07T20:31:59.1099566Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:31:59.1100565Z context = <triton._C.libtriton.ir.context object at 0x7f93a438c930>
2025-05-07T20:31:59.1100853Z 
2025-05-07T20:31:59.1101158Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:31:59.1101674Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:31:59.1102142Z                            module_map=module_map)
2025-05-07T20:31:59.1102500Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:31:59.1102855Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:31:59.1103114Z E       ^
2025-05-07T20:31:59.1103563Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:31:59.1104009Z 
2025-05-07T20:31:59.1104414Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:31:59.2652582Z 
2025-05-07T20:31:59.2653372Z Trying example: test_silu_mul_quant(
2025-05-07T20:31:59.2654387Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:31:59.2654929Z     T=16384,
2025-05-07T20:31:59.2655171Z     D=5120,
2025-05-07T20:31:59.2655375Z     scale_ub=None,
2025-05-07T20:31:59.2655597Z     contiguous=False,
2025-05-07T20:31:59.2655828Z     compiled=True,
2025-05-07T20:31:59.2656045Z )
2025-05-07T20:31:59.2656372Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:31:59.2656867Z T = 16384, D = 5120, scale_ub = None, contiguous = False, compiled = True
2025-05-07T20:31:59.2657140Z 
2025-05-07T20:31:59.2657220Z     @given(
2025-05-07T20:31:59.2657460Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:31:59.2657780Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:31:59.2658080Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:31:59.2658409Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:31:59.2658739Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:31:59.2659026Z     )
2025-05-07T20:31:59.2659372Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:31:59.2659810Z     def test_silu_mul_quant(
2025-05-07T20:31:59.2660055Z         self,
2025-05-07T20:31:59.2660245Z         T: int,
2025-05-07T20:31:59.2660446Z         D: int,
2025-05-07T20:31:59.2660668Z         scale_ub: Optional[float],
2025-05-07T20:31:59.2660935Z         contiguous: bool,
2025-05-07T20:31:59.2661173Z         compiled: bool,
2025-05-07T20:31:59.2661399Z     ) -> None:
2025-05-07T20:31:59.2661612Z         torch.manual_seed(2025)
2025-05-07T20:31:59.2661856Z     
2025-05-07T20:31:59.2662131Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:31:59.2662467Z     
2025-05-07T20:31:59.2662665Z         x_sign = torch.sign(x)
2025-05-07T20:31:59.2662955Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:31:59.2663260Z         x = x_sign * x_clamp
2025-05-07T20:31:59.2663505Z         x0 = x[:, :D]
2025-05-07T20:31:59.2663722Z         x1 = x[:, D:]
2025-05-07T20:31:59.2663928Z     
2025-05-07T20:31:59.2664123Z         if contiguous:
2025-05-07T20:31:59.2664356Z             x0 = x0.contiguous()
2025-05-07T20:31:59.2664616Z             x1 = x1.contiguous()
2025-05-07T20:31:59.2664852Z     
2025-05-07T20:31:59.2665363Z         if scale_ub is not None:
2025-05-07T20:31:59.2665654Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:31:59.2665983Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:31:59.2666292Z             )
2025-05-07T20:31:59.2666494Z         else:
2025-05-07T20:31:59.2666702Z             scale_ub_tensor = None
2025-05-07T20:31:59.2666955Z     
2025-05-07T20:31:59.2667190Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:31:59.2667498Z             op = silu_mul_quant
2025-05-07T20:31:59.2667747Z             if compiled:
2025-05-07T20:31:59.2667999Z                 op = torch.compile(op)
2025-05-07T20:31:59.2668302Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:31:59.2668579Z     
2025-05-07T20:31:59.2668902Z >       y_fp8, y_scale = fn()
2025-05-07T20:31:59.2669075Z 
2025-05-07T20:31:59.2669174Z moe/activation_test.py:117: 
2025-05-07T20:31:59.2669470Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:31:59.2669799Z moe/activation_test.py:115: in fn
2025-05-07T20:31:59.2670083Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:31:59.2670639Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/torch/_dynamo/eval_frame.py:678: in _fn
2025-05-07T20:31:59.2678666Z     return fn(*args, **kwargs)
2025-05-07T20:31:59.2679346Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:31:59.2680031Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:31:59.2680571Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:31:59.2681262Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:31:59.2681921Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:31:59.2682450Z     kernel = self.compile(
2025-05-07T20:31:59.2683003Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:31:59.2683662Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:31:59.2684057Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:31:59.2684297Z 
2025-05-07T20:31:59.2684505Z self = <triton.compiler.compiler.ASTSource object at 0x7f93a44a4ed0>
2025-05-07T20:31:59.2685625Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:31:59.2686988Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f9402963ce0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f93a43ce2a0>}
2025-05-07T20:31:59.2688790Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:31:59.2689795Z context = <triton._C.libtriton.ir.context object at 0x7f9397b08970>
2025-05-07T20:31:59.2690085Z 
2025-05-07T20:31:59.2690249Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:31:59.2690763Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:31:59.2691224Z                            module_map=module_map)
2025-05-07T20:31:59.2691580Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:31:59.2691934Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:31:59.2692195Z E       ^
2025-05-07T20:31:59.2692648Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:31:59.2693093Z 
2025-05-07T20:31:59.2693499Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:31:59.2694250Z 
2025-05-07T20:31:59.2694353Z Trying example: test_silu_mul_quant(
2025-05-07T20:31:59.2694764Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:31:59.2695153Z     T=2048,
2025-05-07T20:31:59.2695342Z     D=5120,
2025-05-07T20:31:59.2695535Z     scale_ub=None,
2025-05-07T20:31:59.2695744Z     contiguous=False,
2025-05-07T20:31:59.2695969Z     compiled=True,
2025-05-07T20:31:59.2696173Z )
2025-05-07T20:31:59.2696480Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:31:59.2696966Z T = 2048, D = 5120, scale_ub = None, contiguous = False, compiled = True
2025-05-07T20:31:59.2697318Z 
2025-05-07T20:31:59.2697396Z     @given(
2025-05-07T20:31:59.2697628Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:31:59.2697928Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:31:59.2698436Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:31:59.2698763Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:31:59.2699078Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:31:59.2699360Z     )
2025-05-07T20:31:59.2699703Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:31:59.2700129Z     def test_silu_mul_quant(
2025-05-07T20:31:59.2700367Z         self,
2025-05-07T20:31:59.2700560Z         T: int,
2025-05-07T20:31:59.2700744Z         D: int,
2025-05-07T20:31:59.2700959Z         scale_ub: Optional[float],
2025-05-07T20:31:59.2701220Z         contiguous: bool,
2025-05-07T20:31:59.2701457Z         compiled: bool,
2025-05-07T20:31:59.2701674Z     ) -> None:
2025-05-07T20:31:59.2701884Z         torch.manual_seed(2025)
2025-05-07T20:31:59.2702123Z     
2025-05-07T20:31:59.2702387Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:31:59.2702717Z     
2025-05-07T20:31:59.2702909Z         x_sign = torch.sign(x)
2025-05-07T20:31:59.2703197Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:31:59.2703496Z         x = x_sign * x_clamp
2025-05-07T20:31:59.2703736Z         x0 = x[:, :D]
2025-05-07T20:31:59.2703946Z         x1 = x[:, D:]
2025-05-07T20:31:59.2704148Z     
2025-05-07T20:31:59.2704323Z         if contiguous:
2025-05-07T20:31:59.2704551Z             x0 = x0.contiguous()
2025-05-07T20:31:59.2704805Z             x1 = x1.contiguous()
2025-05-07T20:31:59.2705034Z     
2025-05-07T20:31:59.2705226Z         if scale_ub is not None:
2025-05-07T20:31:59.2705491Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:31:59.2705806Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:31:59.2706109Z             )
2025-05-07T20:31:59.2706302Z         else:
2025-05-07T20:31:59.2706500Z             scale_ub_tensor = None
2025-05-07T20:31:59.2706742Z     
2025-05-07T20:31:59.2706967Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:31:59.2707278Z             op = silu_mul_quant
2025-05-07T20:31:59.2707520Z             if compiled:
2025-05-07T20:31:59.2707762Z                 op = torch.compile(op)
2025-05-07T20:31:59.2708045Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:31:59.2708316Z     
2025-05-07T20:31:59.2708502Z >       y_fp8, y_scale = fn()
2025-05-07T20:31:59.2708658Z 
2025-05-07T20:31:59.2708762Z moe/activation_test.py:117: 
2025-05-07T20:31:59.2709046Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:31:59.2709368Z moe/activation_test.py:115: in fn
2025-05-07T20:31:59.2709647Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:31:59.2710193Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/torch/_dynamo/eval_frame.py:678: in _fn
2025-05-07T20:31:59.2710739Z     return fn(*args, **kwargs)
2025-05-07T20:31:59.2711380Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:31:59.2712196Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:31:59.2712716Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:31:59.2713380Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:31:59.2714033Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:31:59.2714548Z     kernel = self.compile(
2025-05-07T20:31:59.2715124Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:31:59.2715774Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:31:59.2716271Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:31:59.2716496Z 
2025-05-07T20:31:59.2716695Z self = <triton.compiler.compiler.ASTSource object at 0x7f93a4c41c50>
2025-05-07T20:31:59.2717759Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:31:59.2719101Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f9402963ce0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f93a43cf560>}
2025-05-07T20:31:59.2720418Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:31:59.2721422Z context = <triton._C.libtriton.ir.context object at 0x7f9397bee170>
2025-05-07T20:31:59.2721703Z 
2025-05-07T20:31:59.2721867Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:31:59.2722379Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:31:59.2722842Z                            module_map=module_map)
2025-05-07T20:31:59.2723195Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:31:59.2723543Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:31:59.2723801Z E       ^
2025-05-07T20:31:59.2724249Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:31:59.2724691Z 
2025-05-07T20:31:59.2725137Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:31:59.4314414Z 
2025-05-07T20:31:59.4314580Z Trying example: test_silu_mul_quant(
2025-05-07T20:31:59.4315042Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:31:59.4315833Z     T=2048,
2025-05-07T20:31:59.4316205Z     D=5120,
2025-05-07T20:31:59.4316586Z     scale_ub=1200.0,
2025-05-07T20:31:59.4317013Z     contiguous=False,
2025-05-07T20:31:59.4317454Z     compiled=True,
2025-05-07T20:31:59.4317856Z )
2025-05-07T20:31:59.4318474Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:31:59.4319454Z T = 2048, D = 5120, scale_ub = 1200.0, contiguous = False, compiled = True
2025-05-07T20:31:59.4320001Z 
2025-05-07T20:31:59.4320152Z     @given(
2025-05-07T20:31:59.4320607Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:31:59.4321207Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:31:59.4321805Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:31:59.4322453Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:31:59.4323087Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:31:59.4323656Z     )
2025-05-07T20:31:59.4324336Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:31:59.4325197Z     def test_silu_mul_quant(
2025-05-07T20:31:59.4325518Z         self,
2025-05-07T20:31:59.4325893Z         T: int,
2025-05-07T20:31:59.4326086Z         D: int,
2025-05-07T20:31:59.4326303Z         scale_ub: Optional[float],
2025-05-07T20:31:59.4326572Z         contiguous: bool,
2025-05-07T20:31:59.4326812Z         compiled: bool,
2025-05-07T20:31:59.4327032Z     ) -> None:
2025-05-07T20:31:59.4327247Z         torch.manual_seed(2025)
2025-05-07T20:31:59.4327485Z     
2025-05-07T20:31:59.4327749Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:31:59.4328090Z     
2025-05-07T20:31:59.4328285Z         x_sign = torch.sign(x)
2025-05-07T20:31:59.4328568Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:31:59.4328877Z         x = x_sign * x_clamp
2025-05-07T20:31:59.4329118Z         x0 = x[:, :D]
2025-05-07T20:31:59.4329446Z         x1 = x[:, D:]
2025-05-07T20:31:59.4329662Z     
2025-05-07T20:31:59.4329848Z         if contiguous:
2025-05-07T20:31:59.4330073Z             x0 = x0.contiguous()
2025-05-07T20:31:59.4330332Z             x1 = x1.contiguous()
2025-05-07T20:31:59.4330578Z     
2025-05-07T20:31:59.4330766Z         if scale_ub is not None:
2025-05-07T20:31:59.4331040Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:31:59.4331376Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:31:59.4331688Z             )
2025-05-07T20:31:59.4331879Z         else:
2025-05-07T20:31:59.4332094Z             scale_ub_tensor = None
2025-05-07T20:31:59.4332351Z     
2025-05-07T20:31:59.4332576Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:31:59.4332891Z             op = silu_mul_quant
2025-05-07T20:31:59.4333142Z             if compiled:
2025-05-07T20:31:59.4333382Z                 op = torch.compile(op)
2025-05-07T20:31:59.4333785Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:31:59.4334059Z     
2025-05-07T20:31:59.4334243Z >       y_fp8, y_scale = fn()
2025-05-07T20:31:59.4334414Z 
2025-05-07T20:31:59.4334511Z moe/activation_test.py:117: 
2025-05-07T20:31:59.4334805Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:31:59.4335136Z moe/activation_test.py:115: in fn
2025-05-07T20:31:59.4335422Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:31:59.4335979Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/torch/_dynamo/eval_frame.py:678: in _fn
2025-05-07T20:31:59.4336537Z     return fn(*args, **kwargs)
2025-05-07T20:31:59.4337184Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:31:59.4337861Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:31:59.4338393Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:31:59.4339067Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:31:59.4339718Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:31:59.4340252Z     kernel = self.compile(
2025-05-07T20:31:59.4340788Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:31:59.4341429Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:31:59.4341826Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:31:59.4342056Z 
2025-05-07T20:31:59.4342261Z self = <triton.compiler.compiler.ASTSource object at 0x7f93a43af550>
2025-05-07T20:31:59.4343325Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:31:59.4344663Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f9402963ce0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f9397618c20>}
2025-05-07T20:31:59.4346117Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:31:59.4347121Z context = <triton._C.libtriton.ir.context object at 0x7f93976a8e30>
2025-05-07T20:31:59.4347399Z 
2025-05-07T20:31:59.4347570Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:31:59.4348092Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:31:59.4348551Z                            module_map=module_map)
2025-05-07T20:31:59.4348919Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:31:59.4349376Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:31:59.4349633Z E       ^
2025-05-07T20:31:59.4350098Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:31:59.4350544Z 
2025-05-07T20:31:59.4350959Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:31:59.4351465Z 
2025-05-07T20:31:59.4351576Z Trying example: test_silu_mul_quant(
2025-05-07T20:31:59.4351980Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:31:59.4352382Z     T=4096,
2025-05-07T20:31:59.4352575Z     D=5120,
2025-05-07T20:31:59.4352766Z     scale_ub=1200.0,
2025-05-07T20:31:59.4352988Z     contiguous=True,
2025-05-07T20:31:59.4353215Z     compiled=True,
2025-05-07T20:31:59.4353412Z )
2025-05-07T20:31:59.4353732Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:31:59.4354230Z T = 4096, D = 5120, scale_ub = 1200.0, contiguous = True, compiled = True
2025-05-07T20:31:59.4354497Z 
2025-05-07T20:31:59.4354585Z     @given(
2025-05-07T20:31:59.4354814Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:31:59.4355130Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:31:59.4355437Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:31:59.4355760Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:31:59.4356088Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:31:59.4356369Z     )
2025-05-07T20:31:59.4356708Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:31:59.4357151Z     def test_silu_mul_quant(
2025-05-07T20:31:59.4357396Z         self,
2025-05-07T20:31:59.4357585Z         T: int,
2025-05-07T20:31:59.4357784Z         D: int,
2025-05-07T20:31:59.4358002Z         scale_ub: Optional[float],
2025-05-07T20:31:59.4358269Z         contiguous: bool,
2025-05-07T20:31:59.4358520Z         compiled: bool,
2025-05-07T20:31:59.4358745Z     ) -> None:
2025-05-07T20:31:59.4358957Z         torch.manual_seed(2025)
2025-05-07T20:31:59.4359205Z     
2025-05-07T20:31:59.4359474Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:31:59.4359819Z     
2025-05-07T20:31:59.4360018Z         x_sign = torch.sign(x)
2025-05-07T20:31:59.4360309Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:31:59.4360614Z         x = x_sign * x_clamp
2025-05-07T20:31:59.4360852Z         x0 = x[:, :D]
2025-05-07T20:31:59.4361069Z         x1 = x[:, D:]
2025-05-07T20:31:59.4361272Z     
2025-05-07T20:31:59.4361458Z         if contiguous:
2025-05-07T20:31:59.4361692Z             x0 = x0.contiguous()
2025-05-07T20:31:59.4361952Z             x1 = x1.contiguous()
2025-05-07T20:31:59.4362195Z     
2025-05-07T20:31:59.4362389Z         if scale_ub is not None:
2025-05-07T20:31:59.4362666Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:31:59.4363001Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:31:59.4363312Z             )
2025-05-07T20:31:59.4363509Z         else:
2025-05-07T20:31:59.4363718Z             scale_ub_tensor = None
2025-05-07T20:31:59.4363976Z     
2025-05-07T20:31:59.4364292Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:31:59.4364601Z             op = silu_mul_quant
2025-05-07T20:31:59.4364864Z             if compiled:
2025-05-07T20:31:59.4365151Z                 op = torch.compile(op)
2025-05-07T20:31:59.4365443Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:31:59.4365722Z     
2025-05-07T20:31:59.4365917Z >       y_fp8, y_scale = fn()
2025-05-07T20:31:59.4366080Z 
2025-05-07T20:31:59.4366177Z moe/activation_test.py:117: 
2025-05-07T20:31:59.4366477Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:31:59.4366807Z moe/activation_test.py:115: in fn
2025-05-07T20:31:59.4367094Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:31:59.4367723Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/torch/_dynamo/eval_frame.py:678: in _fn
2025-05-07T20:31:59.4368284Z     return fn(*args, **kwargs)
2025-05-07T20:31:59.4368936Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:31:59.4369611Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:31:59.4370145Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:31:59.4370821Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:31:59.4371481Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:31:59.4372007Z     kernel = self.compile(
2025-05-07T20:31:59.4372548Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:31:59.4373205Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:31:59.4373604Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:31:59.4373976Z 
2025-05-07T20:31:59.4374182Z self = <triton.compiler.compiler.ASTSource object at 0x7f93a44a4550>
2025-05-07T20:31:59.4375252Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:31:59.4376643Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f9402963ce0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f9397619a80>}
2025-05-07T20:31:59.4377963Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:31:59.4378965Z context = <triton._C.libtriton.ir.context object at 0x7f9397648b30>
2025-05-07T20:31:59.4379251Z 
2025-05-07T20:31:59.4379418Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:31:59.4379938Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:31:59.4380404Z                            module_map=module_map)
2025-05-07T20:31:59.4380763Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:31:59.4381123Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:31:59.4381388Z E       ^
2025-05-07T20:31:59.4381840Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:31:59.4382286Z 
2025-05-07T20:31:59.4382696Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:31:59.6063910Z 
2025-05-07T20:31:59.6064672Z Trying example: test_silu_mul_quant(
2025-05-07T20:31:59.6065300Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:31:59.6065747Z     T=128,
2025-05-07T20:31:59.6065955Z     D=5120,
2025-05-07T20:31:59.6066991Z     scale_ub=1200.0,
2025-05-07T20:31:59.6067228Z     contiguous=False,
2025-05-07T20:31:59.6067470Z     compiled=True,
2025-05-07T20:31:59.6067691Z )
2025-05-07T20:31:59.6068021Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:31:59.6068557Z T = 128, D = 5120, scale_ub = 1200.0, contiguous = False, compiled = True
2025-05-07T20:31:59.6068829Z 
2025-05-07T20:31:59.6068909Z     @given(
2025-05-07T20:31:59.6069156Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:31:59.6069476Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:31:59.6069789Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:31:59.6070126Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:31:59.6070618Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:31:59.6070922Z     )
2025-05-07T20:31:59.6071272Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:31:59.6071720Z     def test_silu_mul_quant(
2025-05-07T20:31:59.6071978Z         self,
2025-05-07T20:31:59.6072176Z         T: int,
2025-05-07T20:31:59.6072383Z         D: int,
2025-05-07T20:31:59.6072610Z         scale_ub: Optional[float],
2025-05-07T20:31:59.6072884Z         contiguous: bool,
2025-05-07T20:31:59.6073132Z         compiled: bool,
2025-05-07T20:31:59.6073370Z     ) -> None:
2025-05-07T20:31:59.6073587Z         torch.manual_seed(2025)
2025-05-07T20:31:59.6073842Z     
2025-05-07T20:31:59.6074121Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:31:59.6074467Z     
2025-05-07T20:31:59.6074677Z         x_sign = torch.sign(x)
2025-05-07T20:31:59.6074992Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:31:59.6075320Z         x = x_sign * x_clamp
2025-05-07T20:31:59.6075579Z         x0 = x[:, :D]
2025-05-07T20:31:59.6075818Z         x1 = x[:, D:]
2025-05-07T20:31:59.6076048Z     
2025-05-07T20:31:59.6076244Z         if contiguous:
2025-05-07T20:31:59.6076492Z             x0 = x0.contiguous()
2025-05-07T20:31:59.6076771Z             x1 = x1.contiguous()
2025-05-07T20:31:59.6077021Z     
2025-05-07T20:31:59.6077227Z         if scale_ub is not None:
2025-05-07T20:31:59.6077517Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:31:59.6077855Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:31:59.6078181Z             )
2025-05-07T20:31:59.6078391Z         else:
2025-05-07T20:31:59.6078611Z             scale_ub_tensor = None
2025-05-07T20:31:59.6078876Z     
2025-05-07T20:31:59.6079125Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:31:59.6079446Z             op = silu_mul_quant
2025-05-07T20:31:59.6079712Z             if compiled:
2025-05-07T20:31:59.6079979Z                 op = torch.compile(op)
2025-05-07T20:31:59.6080283Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:31:59.6080577Z     
2025-05-07T20:31:59.6080790Z >       y_fp8, y_scale = fn()
2025-05-07T20:31:59.6080958Z 
2025-05-07T20:31:59.6081069Z moe/activation_test.py:117: 
2025-05-07T20:31:59.6081373Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:31:59.6081715Z moe/activation_test.py:115: in fn
2025-05-07T20:31:59.6082014Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:31:59.6082575Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/torch/_dynamo/eval_frame.py:678: in _fn
2025-05-07T20:31:59.6083144Z     return fn(*args, **kwargs)
2025-05-07T20:31:59.6083810Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:31:59.6084503Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:31:59.6085041Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:31:59.6085726Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:31:59.6086393Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:31:59.6087042Z     kernel = self.compile(
2025-05-07T20:31:59.6087595Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:31:59.6088258Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:31:59.6088669Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:31:59.6088900Z 
2025-05-07T20:31:59.6089111Z self = <triton.compiler.compiler.ASTSource object at 0x7f939778fad0>
2025-05-07T20:31:59.6090261Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:31:59.6091634Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f9402963ce0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f939761aca0>}
2025-05-07T20:31:59.6092971Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:31:59.6094144Z context = <triton._C.libtriton.ir.context object at 0x7f93977094f0>
2025-05-07T20:31:59.6094440Z 
2025-05-07T20:31:59.6094612Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:31:59.6095137Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:31:59.6095611Z                            module_map=module_map)
2025-05-07T20:31:59.6095985Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:31:59.6096354Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:31:59.6096628Z E       ^
2025-05-07T20:31:59.6097091Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:31:59.6097548Z 
2025-05-07T20:31:59.6097961Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:31:59.6098750Z 
2025-05-07T20:31:59.6098861Z Trying example: test_silu_mul_quant(
2025-05-07T20:31:59.6099286Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:31:59.6099721Z     T=16384,
2025-05-07T20:31:59.6099933Z     D=7168,
2025-05-07T20:31:59.6100134Z     scale_ub=1200.0,
2025-05-07T20:31:59.6100373Z     contiguous=True,
2025-05-07T20:31:59.6100612Z     compiled=True,
2025-05-07T20:31:59.6100833Z )
2025-05-07T20:31:59.6101175Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:31:59.6101682Z T = 16384, D = 7168, scale_ub = 1200.0, contiguous = True, compiled = True
2025-05-07T20:31:59.6101955Z 
2025-05-07T20:31:59.6102038Z     @given(
2025-05-07T20:31:59.6112118Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:31:59.6112510Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:31:59.6112832Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:31:59.6113163Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:31:59.6113498Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:31:59.6113795Z     )
2025-05-07T20:31:59.6114148Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:31:59.6114603Z     def test_silu_mul_quant(
2025-05-07T20:31:59.6114858Z         self,
2025-05-07T20:31:59.6115056Z         T: int,
2025-05-07T20:31:59.6115263Z         D: int,
2025-05-07T20:31:59.6115493Z         scale_ub: Optional[float],
2025-05-07T20:31:59.6115772Z         contiguous: bool,
2025-05-07T20:31:59.6116021Z         compiled: bool,
2025-05-07T20:31:59.6116257Z     ) -> None:
2025-05-07T20:31:59.6116472Z         torch.manual_seed(2025)
2025-05-07T20:31:59.6116726Z     
2025-05-07T20:31:59.6117212Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:31:59.6117566Z     
2025-05-07T20:31:59.6117760Z         x_sign = torch.sign(x)
2025-05-07T20:31:59.6118063Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:31:59.6118384Z         x = x_sign * x_clamp
2025-05-07T20:31:59.6118626Z         x0 = x[:, :D]
2025-05-07T20:31:59.6118855Z         x1 = x[:, D:]
2025-05-07T20:31:59.6119074Z     
2025-05-07T20:31:59.6119262Z         if contiguous:
2025-05-07T20:31:59.6119504Z             x0 = x0.contiguous()
2025-05-07T20:31:59.6119773Z             x1 = x1.contiguous()
2025-05-07T20:31:59.6120014Z     
2025-05-07T20:31:59.6120217Z         if scale_ub is not None:
2025-05-07T20:31:59.6120628Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:31:59.6120964Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:31:59.6121278Z             )
2025-05-07T20:31:59.6121483Z         else:
2025-05-07T20:31:59.6121697Z             scale_ub_tensor = None
2025-05-07T20:31:59.6121966Z     
2025-05-07T20:31:59.6122208Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:31:59.6122535Z             op = silu_mul_quant
2025-05-07T20:31:59.6122785Z             if compiled:
2025-05-07T20:31:59.6123041Z                 op = torch.compile(op)
2025-05-07T20:31:59.6123348Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:31:59.6123624Z     
2025-05-07T20:31:59.6123826Z >       y_fp8, y_scale = fn()
2025-05-07T20:31:59.6123991Z 
2025-05-07T20:31:59.6124105Z moe/activation_test.py:117: 
2025-05-07T20:31:59.6124391Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:31:59.6124732Z moe/activation_test.py:115: in fn
2025-05-07T20:31:59.6125023Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:31:59.6125582Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/torch/_dynamo/eval_frame.py:678: in _fn
2025-05-07T20:31:59.6126146Z     return fn(*args, **kwargs)
2025-05-07T20:31:59.6126799Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:31:59.6127485Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:31:59.6128026Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:31:59.6128701Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:31:59.6129357Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:31:59.6129893Z     kernel = self.compile(
2025-05-07T20:31:59.6130440Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:31:59.6131090Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:31:59.6131493Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:31:59.6131736Z 
2025-05-07T20:31:59.6131942Z self = <triton.compiler.compiler.ASTSource object at 0x7f93a43ad250>
2025-05-07T20:31:59.6133014Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:31:59.6134501Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f9402963ce0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f93978e8400>}
2025-05-07T20:31:59.6135830Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:31:59.6136844Z context = <triton._C.libtriton.ir.context object at 0x7f939786d2b0>
2025-05-07T20:31:59.6137138Z 
2025-05-07T20:31:59.6137305Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:31:59.6137915Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:31:59.6138377Z                            module_map=module_map)
2025-05-07T20:31:59.6138752Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:31:59.6139112Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:31:59.6139371Z E       ^
2025-05-07T20:31:59.6139834Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:31:59.6140286Z 
2025-05-07T20:31:59.6140695Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:31:59.7285909Z 
2025-05-07T20:31:59.7286277Z Trying example: test_silu_mul_quant(
2025-05-07T20:31:59.7286884Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:31:59.7287322Z     T=16384,
2025-05-07T20:31:59.7287543Z     D=5120,
2025-05-07T20:31:59.7287753Z     scale_ub=1200.0,
2025-05-07T20:31:59.7287980Z     contiguous=True,
2025-05-07T20:31:59.7288214Z     compiled=False,
2025-05-07T20:31:59.7288436Z )
2025-05-07T20:31:59.7288755Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:31:59.7289263Z T = 16384, D = 5120, scale_ub = 1200.0, contiguous = True, compiled = False
2025-05-07T20:31:59.7289549Z 
2025-05-07T20:31:59.7289631Z     @given(
2025-05-07T20:31:59.7289874Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:31:59.7290191Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:31:59.7290505Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:31:59.7290851Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:31:59.7291181Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:31:59.7291477Z     )
2025-05-07T20:31:59.7291833Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:31:59.7292287Z     def test_silu_mul_quant(
2025-05-07T20:31:59.7292530Z         self,
2025-05-07T20:31:59.7292737Z         T: int,
2025-05-07T20:31:59.7292945Z         D: int,
2025-05-07T20:31:59.7293166Z         scale_ub: Optional[float],
2025-05-07T20:31:59.7293450Z         contiguous: bool,
2025-05-07T20:31:59.7293808Z         compiled: bool,
2025-05-07T20:31:59.7294042Z     ) -> None:
2025-05-07T20:31:59.7294270Z         torch.manual_seed(2025)
2025-05-07T20:31:59.7294524Z     
2025-05-07T20:31:59.7294796Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:31:59.7295154Z     
2025-05-07T20:31:59.7295362Z         x_sign = torch.sign(x)
2025-05-07T20:31:59.7295658Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:31:59.7295980Z         x = x_sign * x_clamp
2025-05-07T20:31:59.7296233Z         x0 = x[:, :D]
2025-05-07T20:31:59.7296452Z         x1 = x[:, D:]
2025-05-07T20:31:59.7296684Z     
2025-05-07T20:31:59.7296889Z         if contiguous:
2025-05-07T20:31:59.7297128Z             x0 = x0.contiguous()
2025-05-07T20:31:59.7297398Z             x1 = x1.contiguous()
2025-05-07T20:31:59.7297655Z     
2025-05-07T20:31:59.7297853Z         if scale_ub is not None:
2025-05-07T20:31:59.7298143Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:31:59.7298755Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:31:59.7299080Z             )
2025-05-07T20:31:59.7299281Z         else:
2025-05-07T20:31:59.7299510Z             scale_ub_tensor = None
2025-05-07T20:31:59.7299774Z     
2025-05-07T20:31:59.7300013Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:31:59.7300345Z             op = silu_mul_quant
2025-05-07T20:31:59.7300615Z             if compiled:
2025-05-07T20:31:59.7300864Z                 op = torch.compile(op)
2025-05-07T20:31:59.7301177Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:31:59.7301463Z     
2025-05-07T20:31:59.7301659Z >       y_fp8, y_scale = fn()
2025-05-07T20:31:59.7302026Z 
2025-05-07T20:31:59.7302131Z moe/activation_test.py:117: 
2025-05-07T20:31:59.7302435Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:31:59.7302777Z moe/activation_test.py:115: in fn
2025-05-07T20:31:59.7303062Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:31:59.7303756Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:31:59.7304453Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:31:59.7304993Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:31:59.7305782Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:31:59.7306452Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:31:59.7306992Z     kernel = self.compile(
2025-05-07T20:31:59.7307535Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:31:59.7308196Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:31:59.7308601Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:31:59.7308833Z 
2025-05-07T20:31:59.7309050Z self = <triton.compiler.compiler.ASTSource object at 0x7f939778e650>
2025-05-07T20:31:59.7310113Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:31:59.7311482Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f9402963ce0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f93978e8e00>}
2025-05-07T20:31:59.7312822Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:31:59.7313845Z context = <triton._C.libtriton.ir.context object at 0x7f93978ec3f0>
2025-05-07T20:31:59.7314131Z 
2025-05-07T20:31:59.7314300Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:31:59.7314829Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:31:59.7315303Z                            module_map=module_map)
2025-05-07T20:31:59.7315677Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:31:59.7316030Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:31:59.7316304Z E       ^
2025-05-07T20:31:59.7316769Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:31:59.7317213Z 
2025-05-07T20:31:59.7317624Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:31:59.7318148Z 
2025-05-07T20:31:59.7318255Z Trying example: test_silu_mul_quant(
2025-05-07T20:31:59.7318692Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:31:59.7319104Z     T=1,
2025-05-07T20:31:59.7319303Z     D=7168,
2025-05-07T20:31:59.7319499Z     scale_ub=1200.0,
2025-05-07T20:31:59.7319734Z     contiguous=False,
2025-05-07T20:31:59.7319970Z     compiled=False,
2025-05-07T20:31:59.7320177Z )
2025-05-07T20:31:59.7320505Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:31:59.7321000Z T = 1, D = 7168, scale_ub = 1200.0, contiguous = False, compiled = False
2025-05-07T20:31:59.7321270Z 
2025-05-07T20:31:59.7321352Z     @given(
2025-05-07T20:31:59.7321595Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:31:59.7321920Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:31:59.7322317Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:31:59.7322657Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:31:59.7322997Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:31:59.7323293Z     )
2025-05-07T20:31:59.7323640Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:31:59.7324092Z     def test_silu_mul_quant(
2025-05-07T20:31:59.7324344Z         self,
2025-05-07T20:31:59.7324542Z         T: int,
2025-05-07T20:31:59.7324749Z         D: int,
2025-05-07T20:31:59.7324977Z         scale_ub: Optional[float],
2025-05-07T20:31:59.7325250Z         contiguous: bool,
2025-05-07T20:31:59.7325500Z         compiled: bool,
2025-05-07T20:31:59.7325730Z     ) -> None:
2025-05-07T20:31:59.7326053Z         torch.manual_seed(2025)
2025-05-07T20:31:59.7326304Z     
2025-05-07T20:31:59.7326582Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:31:59.7326922Z     
2025-05-07T20:31:59.7327133Z         x_sign = torch.sign(x)
2025-05-07T20:31:59.7327430Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:31:59.7327749Z         x = x_sign * x_clamp
2025-05-07T20:31:59.7327987Z         x0 = x[:, :D]
2025-05-07T20:31:59.7328212Z         x1 = x[:, D:]
2025-05-07T20:31:59.7328430Z     
2025-05-07T20:31:59.7328618Z         if contiguous:
2025-05-07T20:31:59.7328857Z             x0 = x0.contiguous()
2025-05-07T20:31:59.7329125Z             x1 = x1.contiguous()
2025-05-07T20:31:59.7329369Z     
2025-05-07T20:31:59.7329572Z         if scale_ub is not None:
2025-05-07T20:31:59.7329854Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:31:59.7330190Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:31:59.7330514Z             )
2025-05-07T20:31:59.7330720Z         else:
2025-05-07T20:31:59.7330935Z             scale_ub_tensor = None
2025-05-07T20:31:59.7331196Z     
2025-05-07T20:31:59.7331441Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:31:59.7331763Z             op = silu_mul_quant
2025-05-07T20:31:59.7332026Z             if compiled:
2025-05-07T20:31:59.7332282Z                 op = torch.compile(op)
2025-05-07T20:31:59.7332590Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:31:59.7332867Z     
2025-05-07T20:31:59.7333072Z >       y_fp8, y_scale = fn()
2025-05-07T20:31:59.7333237Z 
2025-05-07T20:31:59.7333347Z moe/activation_test.py:117: 
2025-05-07T20:31:59.7333720Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:31:59.7334061Z moe/activation_test.py:115: in fn
2025-05-07T20:31:59.7334348Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:31:59.7335033Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:31:59.7335718Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:31:59.7336256Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:31:59.7336936Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:31:59.7337590Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:31:59.7338122Z     kernel = self.compile(
2025-05-07T20:31:59.7338663Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:31:59.7339307Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:31:59.7339708Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:31:59.7339947Z 
2025-05-07T20:31:59.7340157Z self = <triton.compiler.compiler.ASTSource object at 0x7f93a4c432d0>
2025-05-07T20:31:59.7341222Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:31:59.7342648Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f9402963ce0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f93978ea160>}
2025-05-07T20:31:59.7343963Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:31:59.7344973Z context = <triton._C.libtriton.ir.context object at 0x7f93977b56b0>
2025-05-07T20:31:59.7345266Z 
2025-05-07T20:31:59.7345431Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:31:59.7346024Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:31:59.7346499Z                            module_map=module_map)
2025-05-07T20:31:59.7346866Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:31:59.7347230Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:31:59.7347497Z E       ^
2025-05-07T20:31:59.7347958Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:31:59.7348399Z 
2025-05-07T20:31:59.7348818Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:31:59.7349321Z 
2025-05-07T20:31:59.7349426Z Trying example: test_silu_mul_quant(
2025-05-07T20:31:59.7349839Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:31:59.7350242Z     T=4096,
2025-05-07T20:31:59.7350428Z     D=7168,
2025-05-07T20:31:59.7350633Z     scale_ub=1200.0,
2025-05-07T20:31:59.7350863Z     contiguous=False,
2025-05-07T20:31:59.7351084Z     compiled=True,
2025-05-07T20:31:59.8950993Z )
2025-05-07T20:31:59.8951527Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:31:59.8952120Z T = 4096, D = 7168, scale_ub = 1200.0, contiguous = False, compiled = True
2025-05-07T20:31:59.8952398Z 
2025-05-07T20:31:59.8952486Z     @given(
2025-05-07T20:31:59.8952717Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:31:59.8953034Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:31:59.8953345Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:31:59.8953669Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:31:59.8954001Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:31:59.8954290Z     )
2025-05-07T20:31:59.8954632Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:31:59.8955075Z     def test_silu_mul_quant(
2025-05-07T20:31:59.8955330Z         self,
2025-05-07T20:31:59.8955522Z         T: int,
2025-05-07T20:31:59.8955725Z         D: int,
2025-05-07T20:31:59.8955946Z         scale_ub: Optional[float],
2025-05-07T20:31:59.8956216Z         contiguous: bool,
2025-05-07T20:31:59.8956468Z         compiled: bool,
2025-05-07T20:31:59.8956699Z     ) -> None:
2025-05-07T20:31:59.8956920Z         torch.manual_seed(2025)
2025-05-07T20:31:59.8957157Z     
2025-05-07T20:31:59.8957433Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:31:59.8957778Z     
2025-05-07T20:31:59.8957968Z         x_sign = torch.sign(x)
2025-05-07T20:31:59.8958262Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:31:59.8958576Z         x = x_sign * x_clamp
2025-05-07T20:31:59.8958810Z         x0 = x[:, :D]
2025-05-07T20:31:59.8959034Z         x1 = x[:, D:]
2025-05-07T20:31:59.8959252Z     
2025-05-07T20:31:59.8959435Z         if contiguous:
2025-05-07T20:31:59.8959673Z             x0 = x0.contiguous()
2025-05-07T20:31:59.8959943Z             x1 = x1.contiguous()
2025-05-07T20:31:59.8960178Z     
2025-05-07T20:31:59.8960380Z         if scale_ub is not None:
2025-05-07T20:31:59.8960656Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:31:59.8961338Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:31:59.8961653Z             )
2025-05-07T20:31:59.8961853Z         else:
2025-05-07T20:31:59.8962071Z             scale_ub_tensor = None
2025-05-07T20:31:59.8962321Z     
2025-05-07T20:31:59.8962558Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:31:59.8962878Z             op = silu_mul_quant
2025-05-07T20:31:59.8963122Z             if compiled:
2025-05-07T20:31:59.8963375Z                 op = torch.compile(op)
2025-05-07T20:31:59.8963670Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:31:59.8963935Z     
2025-05-07T20:31:59.8964129Z >       y_fp8, y_scale = fn()
2025-05-07T20:31:59.8964324Z 
2025-05-07T20:31:59.8964561Z moe/activation_test.py:117: 
2025-05-07T20:31:59.8964862Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:31:59.8965189Z moe/activation_test.py:115: in fn
2025-05-07T20:31:59.8965467Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:31:59.8966026Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/torch/_dynamo/eval_frame.py:678: in _fn
2025-05-07T20:31:59.8966578Z     return fn(*args, **kwargs)
2025-05-07T20:31:59.8967218Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:31:59.8967892Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:31:59.8968427Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:31:59.8969090Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:31:59.8969747Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:31:59.8970270Z     kernel = self.compile(
2025-05-07T20:31:59.8970801Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:31:59.8971451Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:31:59.8971846Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:31:59.8972067Z 
2025-05-07T20:31:59.8972276Z self = <triton.compiler.compiler.ASTSource object at 0x7f9397f0bed0>
2025-05-07T20:31:59.8973332Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:31:59.8974804Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f9402963ce0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f93978eb420>}
2025-05-07T20:31:59.8976120Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:31:59.8977125Z context = <triton._C.libtriton.ir.context object at 0x7f93977c5830>
2025-05-07T20:31:59.8977402Z 
2025-05-07T20:31:59.8977574Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:31:59.8978074Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:31:59.8978533Z                            module_map=module_map)
2025-05-07T20:31:59.8978893Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:31:59.8979240Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:31:59.8979492Z E       ^
2025-05-07T20:31:59.8979948Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:31:59.8980386Z 
2025-05-07T20:31:59.8980803Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:31:59.8981394Z 
2025-05-07T20:31:59.8981504Z Trying example: test_silu_mul_quant(
2025-05-07T20:31:59.8981904Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:31:59.8982302Z     T=128,
2025-05-07T20:31:59.8982493Z     D=7168,
2025-05-07T20:31:59.8982684Z     scale_ub=1200.0,
2025-05-07T20:31:59.8982910Z     contiguous=False,
2025-05-07T20:31:59.8983137Z     compiled=True,
2025-05-07T20:31:59.8983337Z )
2025-05-07T20:31:59.8983657Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:31:59.8984143Z T = 128, D = 7168, scale_ub = 1200.0, contiguous = False, compiled = True
2025-05-07T20:31:59.8984409Z 
2025-05-07T20:31:59.8984487Z     @given(
2025-05-07T20:31:59.8984804Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:31:59.8985118Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:31:59.8985423Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:31:59.8985743Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:31:59.8986077Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:31:59.8986362Z     )
2025-05-07T20:31:59.8986703Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:31:59.8987138Z     def test_silu_mul_quant(
2025-05-07T20:31:59.8987375Z         self,
2025-05-07T20:31:59.8987566Z         T: int,
2025-05-07T20:31:59.8987764Z         D: int,
2025-05-07T20:31:59.8987979Z         scale_ub: Optional[float],
2025-05-07T20:31:59.8988240Z         contiguous: bool,
2025-05-07T20:31:59.8988475Z         compiled: bool,
2025-05-07T20:31:59.8988698Z     ) -> None:
2025-05-07T20:31:59.8988904Z         torch.manual_seed(2025)
2025-05-07T20:31:59.8989145Z     
2025-05-07T20:31:59.8989416Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:31:59.8989747Z     
2025-05-07T20:31:59.8989941Z         x_sign = torch.sign(x)
2025-05-07T20:31:59.8990230Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:31:59.8990542Z         x = x_sign * x_clamp
2025-05-07T20:31:59.8990776Z         x0 = x[:, :D]
2025-05-07T20:31:59.8990986Z         x1 = x[:, D:]
2025-05-07T20:31:59.8991198Z     
2025-05-07T20:31:59.8991376Z         if contiguous:
2025-05-07T20:31:59.8991608Z             x0 = x0.contiguous()
2025-05-07T20:31:59.8991863Z             x1 = x1.contiguous()
2025-05-07T20:31:59.8992098Z     
2025-05-07T20:31:59.8992292Z         if scale_ub is not None:
2025-05-07T20:31:59.8992561Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:31:59.8992888Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:31:59.8993195Z             )
2025-05-07T20:31:59.8993383Z         else:
2025-05-07T20:31:59.8993587Z             scale_ub_tensor = None
2025-05-07T20:31:59.8993841Z     
2025-05-07T20:31:59.8994071Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:31:59.8994378Z             op = silu_mul_quant
2025-05-07T20:31:59.8994625Z             if compiled:
2025-05-07T20:31:59.8994876Z                 op = torch.compile(op)
2025-05-07T20:31:59.8995172Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:31:59.8995460Z     
2025-05-07T20:31:59.8995645Z >       y_fp8, y_scale = fn()
2025-05-07T20:31:59.8995814Z 
2025-05-07T20:31:59.8995910Z moe/activation_test.py:117: 
2025-05-07T20:31:59.8996200Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:31:59.8996532Z moe/activation_test.py:115: in fn
2025-05-07T20:31:59.8996811Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:31:59.8997358Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/torch/_dynamo/eval_frame.py:678: in _fn
2025-05-07T20:31:59.8997910Z     return fn(*args, **kwargs)
2025-05-07T20:31:59.9012285Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:31:59.9013134Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:31:59.9013758Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:31:59.9014642Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:31:59.9015306Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:31:59.9015834Z     kernel = self.compile(
2025-05-07T20:31:59.9016378Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:31:59.9017032Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:31:59.9017427Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:31:59.9017835Z 
2025-05-07T20:31:59.9018046Z self = <triton.compiler.compiler.ASTSource object at 0x7f93a43af150>
2025-05-07T20:31:59.9019107Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:31:59.9020463Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f9402963ce0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f93975cc720>}
2025-05-07T20:31:59.9021796Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:31:59.9022800Z context = <triton._C.libtriton.ir.context object at 0x7f93975b83f0>
2025-05-07T20:31:59.9023087Z 
2025-05-07T20:31:59.9023258Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:31:59.9023776Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:31:59.9024244Z                            module_map=module_map)
2025-05-07T20:31:59.9024616Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:31:59.9024974Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:31:59.9025239Z E       ^
2025-05-07T20:31:59.9025695Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:31:59.9026146Z 
2025-05-07T20:31:59.9026553Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:31:59.9027069Z 
2025-05-07T20:31:59.9027172Z Trying example: test_silu_mul_quant(
2025-05-07T20:31:59.9027586Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:31:59.9027979Z     T=2048,
2025-05-07T20:31:59.9028178Z     D=7168,
2025-05-07T20:31:59.9028378Z     scale_ub=None,
2025-05-07T20:31:59.9028589Z     contiguous=True,
2025-05-07T20:31:59.9028821Z     compiled=True,
2025-05-07T20:32:00.0258261Z )
2025-05-07T20:32:00.0258740Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:00.0259356Z T = 2048, D = 7168, scale_ub = None, contiguous = True, compiled = True
2025-05-07T20:32:00.0259631Z 
2025-05-07T20:32:00.0259730Z     @given(
2025-05-07T20:32:00.0259966Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:00.0260284Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:00.0260593Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:00.0260927Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:00.0261251Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:00.0261549Z     )
2025-05-07T20:32:00.0261900Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:00.0262349Z     def test_silu_mul_quant(
2025-05-07T20:32:00.0262593Z         self,
2025-05-07T20:32:00.0262796Z         T: int,
2025-05-07T20:32:00.0262993Z         D: int,
2025-05-07T20:32:00.0263215Z         scale_ub: Optional[float],
2025-05-07T20:32:00.0263827Z         contiguous: bool,
2025-05-07T20:32:00.0264065Z         compiled: bool,
2025-05-07T20:32:00.0264296Z     ) -> None:
2025-05-07T20:32:00.0264514Z         torch.manual_seed(2025)
2025-05-07T20:32:00.0264753Z     
2025-05-07T20:32:00.0265031Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:00.0265382Z     
2025-05-07T20:32:00.0265583Z         x_sign = torch.sign(x)
2025-05-07T20:32:00.0265870Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:00.0266183Z         x = x_sign * x_clamp
2025-05-07T20:32:00.0266429Z         x0 = x[:, :D]
2025-05-07T20:32:00.0266644Z         x1 = x[:, D:]
2025-05-07T20:32:00.0266863Z     
2025-05-07T20:32:00.0267063Z         if contiguous:
2025-05-07T20:32:00.0267446Z             x0 = x0.contiguous()
2025-05-07T20:32:00.0267713Z             x1 = x1.contiguous()
2025-05-07T20:32:00.0267958Z     
2025-05-07T20:32:00.0268150Z         if scale_ub is not None:
2025-05-07T20:32:00.0268436Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:00.0268782Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:00.0269091Z             )
2025-05-07T20:32:00.0269291Z         else:
2025-05-07T20:32:00.0269511Z             scale_ub_tensor = None
2025-05-07T20:32:00.0269760Z     
2025-05-07T20:32:00.0269993Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:00.0270316Z             op = silu_mul_quant
2025-05-07T20:32:00.0270569Z             if compiled:
2025-05-07T20:32:00.0270812Z                 op = torch.compile(op)
2025-05-07T20:32:00.0271111Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:00.0271391Z     
2025-05-07T20:32:00.0271583Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:00.0271756Z 
2025-05-07T20:32:00.0271857Z moe/activation_test.py:117: 
2025-05-07T20:32:00.0272158Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:00.0272488Z moe/activation_test.py:115: in fn
2025-05-07T20:32:00.0272777Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:00.0273340Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/torch/_dynamo/eval_frame.py:678: in _fn
2025-05-07T20:32:00.0273900Z     return fn(*args, **kwargs)
2025-05-07T20:32:00.0274550Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:00.0275236Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:00.0275775Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:00.0276444Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:00.0277115Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:00.0277649Z     kernel = self.compile(
2025-05-07T20:32:00.0278190Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:00.0278844Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:00.0279244Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:00.0279474Z 
2025-05-07T20:32:00.0279686Z self = <triton.compiler.compiler.ASTSource object at 0x7f93972a1750>
2025-05-07T20:32:00.0280755Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:00.0282120Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f9402963ce0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f93975cd440>}
2025-05-07T20:32:00.0283447Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:00.0284543Z context = <triton._C.libtriton.ir.context object at 0x7f939755eff0>
2025-05-07T20:32:00.0284825Z 
2025-05-07T20:32:00.0284999Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:00.0285513Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:00.0285985Z                            module_map=module_map)
2025-05-07T20:32:00.0286356Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:00.0286714Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:00.0286973Z E       ^
2025-05-07T20:32:00.0287514Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:00.0287961Z 
2025-05-07T20:32:00.0288378Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:00.0288891Z 
2025-05-07T20:32:00.0288996Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:00.0289410Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:00.0289816Z     T=16384,
2025-05-07T20:32:00.0290017Z     D=5120,
2025-05-07T20:32:00.0290211Z     scale_ub=None,
2025-05-07T20:32:00.0290434Z     contiguous=False,
2025-05-07T20:32:00.0290673Z     compiled=False,
2025-05-07T20:32:00.0290882Z )
2025-05-07T20:32:00.0291203Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:00.0291704Z T = 16384, D = 5120, scale_ub = None, contiguous = False, compiled = False
2025-05-07T20:32:00.0291979Z 
2025-05-07T20:32:00.0292068Z     @given(
2025-05-07T20:32:00.0292304Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:00.0292620Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:00.0292924Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:00.0293263Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:00.0293592Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:00.0294000Z     )
2025-05-07T20:32:00.0294345Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:00.0294788Z     def test_silu_mul_quant(
2025-05-07T20:32:00.0295034Z         self,
2025-05-07T20:32:00.0295228Z         T: int,
2025-05-07T20:32:00.0295430Z         D: int,
2025-05-07T20:32:00.0295652Z         scale_ub: Optional[float],
2025-05-07T20:32:00.0295921Z         contiguous: bool,
2025-05-07T20:32:00.0296166Z         compiled: bool,
2025-05-07T20:32:00.0296395Z     ) -> None:
2025-05-07T20:32:00.0296607Z         torch.manual_seed(2025)
2025-05-07T20:32:00.0296859Z     
2025-05-07T20:32:00.0297134Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:00.0297473Z     
2025-05-07T20:32:00.0297672Z         x_sign = torch.sign(x)
2025-05-07T20:32:00.0297969Z >       x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:00.0300193Z E       torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 320.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 144.44 MiB is free. Including non-PyTorch memory, this process has 21.92 GiB memory in use. Of the allocated memory 21.60 GiB is allocated by PyTorch, and 40.52 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
2025-05-07T20:32:00.0302025Z 
2025-05-07T20:32:00.0302142Z moe/activation_test.py:95: OutOfMemoryError
2025-05-07T20:32:00.0302357Z 
2025-05-07T20:32:00.0302466Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:00.0302864Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:00.0303267Z     T=4096,
2025-05-07T20:32:00.0303593Z     D=7168,
2025-05-07T20:32:00.0303782Z     scale_ub=1200.0,
2025-05-07T20:32:00.0304007Z     contiguous=True,
2025-05-07T20:32:00.0304226Z     compiled=True,
2025-05-07T20:32:00.0304424Z )
2025-05-07T20:32:00.0304743Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:00.0305233Z T = 4096, D = 7168, scale_ub = 1200.0, contiguous = True, compiled = True
2025-05-07T20:32:00.0305500Z 
2025-05-07T20:32:00.0305588Z     @given(
2025-05-07T20:32:00.0305814Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:00.0306125Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:00.0306431Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:00.0306900Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:00.0307232Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:00.0307523Z     )
2025-05-07T20:32:00.0307862Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:00.0308311Z     def test_silu_mul_quant(
2025-05-07T20:32:00.0308549Z         self,
2025-05-07T20:32:00.0308735Z         T: int,
2025-05-07T20:32:00.0308934Z         D: int,
2025-05-07T20:32:00.0309154Z         scale_ub: Optional[float],
2025-05-07T20:32:00.0309421Z         contiguous: bool,
2025-05-07T20:32:00.0309655Z         compiled: bool,
2025-05-07T20:32:00.0309878Z     ) -> None:
2025-05-07T20:32:00.0310092Z         torch.manual_seed(2025)
2025-05-07T20:32:00.0310322Z     
2025-05-07T20:32:00.0310589Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:00.0310923Z     
2025-05-07T20:32:00.0311111Z         x_sign = torch.sign(x)
2025-05-07T20:32:00.0311404Z >       x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:00.0313355Z E       torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 112.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 32.44 MiB is free. Including non-PyTorch memory, this process has 22.03 GiB memory in use. Of the allocated memory 21.61 GiB is allocated by PyTorch, and 136.52 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
2025-05-07T20:32:00.0315170Z 
2025-05-07T20:32:00.0315293Z moe/activation_test.py:95: OutOfMemoryError
2025-05-07T20:32:00.0315499Z 
2025-05-07T20:32:00.0315608Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:00.0316020Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:00.0316416Z     T=16384,
2025-05-07T20:32:00.0316606Z     D=7168,
2025-05-07T20:32:00.0316799Z     scale_ub=None,
2025-05-07T20:32:00.0317013Z     contiguous=False,
2025-05-07T20:32:00.0317237Z     compiled=False,
2025-05-07T20:32:00.0317431Z )
2025-05-07T20:32:00.0317745Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:00.0318236Z T = 16384, D = 7168, scale_ub = None, contiguous = False, compiled = False
2025-05-07T20:32:00.0318505Z 
2025-05-07T20:32:00.0318590Z     @given(
2025-05-07T20:32:00.0318812Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:00.0319122Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:00.0319427Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:00.0319743Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:00.0320068Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:00.0320352Z     )
2025-05-07T20:32:00.0320689Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:00.0321130Z     def test_silu_mul_quant(
2025-05-07T20:32:00.0321372Z         self,
2025-05-07T20:32:00.0321567Z         T: int,
2025-05-07T20:32:00.0321768Z         D: int,
2025-05-07T20:32:00.0321985Z         scale_ub: Optional[float],
2025-05-07T20:32:00.0322341Z         contiguous: bool,
2025-05-07T20:32:00.0322580Z         compiled: bool,
2025-05-07T20:32:00.0322804Z     ) -> None:
2025-05-07T20:32:00.0323020Z         torch.manual_seed(2025)
2025-05-07T20:32:00.0323254Z     
2025-05-07T20:32:00.0323520Z >       x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:00.0325591Z E       torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 448.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 144.44 MiB is free. Including non-PyTorch memory, this process has 21.92 GiB memory in use. Of the allocated memory 21.50 GiB is allocated by PyTorch, and 136.52 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
2025-05-07T20:32:00.0327405Z 
2025-05-07T20:32:00.0327528Z moe/activation_test.py:92: OutOfMemoryError
2025-05-07T20:32:00.1563453Z 
2025-05-07T20:32:00.1564121Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:00.1564755Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:00.1565769Z     T=2048,
2025-05-07T20:32:00.1566233Z     D=7168,
2025-05-07T20:32:00.1566615Z     scale_ub=1200.0,
2025-05-07T20:32:00.1567059Z     contiguous=True,
2025-05-07T20:32:00.1567481Z     compiled=True,
2025-05-07T20:32:00.1567893Z )
2025-05-07T20:32:00.1568533Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:00.1569517Z T = 2048, D = 7168, scale_ub = 1200.0, contiguous = True, compiled = True
2025-05-07T20:32:00.1570066Z 
2025-05-07T20:32:00.1570227Z     @given(
2025-05-07T20:32:00.1570725Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:00.1571345Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:00.1571939Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:00.1572604Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:00.1573254Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:00.1574005Z     )
2025-05-07T20:32:00.1574692Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:00.1575214Z     def test_silu_mul_quant(
2025-05-07T20:32:00.1575446Z         self,
2025-05-07T20:32:00.1575648Z         T: int,
2025-05-07T20:32:00.1575851Z         D: int,
2025-05-07T20:32:00.1576078Z         scale_ub: Optional[float],
2025-05-07T20:32:00.1576353Z         contiguous: bool,
2025-05-07T20:32:00.1576592Z         compiled: bool,
2025-05-07T20:32:00.1576814Z     ) -> None:
2025-05-07T20:32:00.1577033Z         torch.manual_seed(2025)
2025-05-07T20:32:00.1577275Z     
2025-05-07T20:32:00.1577548Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:00.1577887Z     
2025-05-07T20:32:00.1578085Z         x_sign = torch.sign(x)
2025-05-07T20:32:00.1578374Z >       x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:00.1580346Z E       torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 56.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 32.44 MiB is free. Including non-PyTorch memory, this process has 22.03 GiB memory in use. Of the allocated memory 21.67 GiB is allocated by PyTorch, and 80.52 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
2025-05-07T20:32:00.1582188Z 
2025-05-07T20:32:00.1582315Z moe/activation_test.py:95: OutOfMemoryError
2025-05-07T20:32:00.1582524Z 
2025-05-07T20:32:00.1582629Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:00.1583035Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:00.1583464Z     T=2048,
2025-05-07T20:32:00.1583655Z     D=7168,
2025-05-07T20:32:00.1584227Z     scale_ub=None,
2025-05-07T20:32:00.1584439Z     contiguous=True,
2025-05-07T20:32:00.1584659Z     compiled=False,
2025-05-07T20:32:00.1584862Z )
2025-05-07T20:32:00.1585174Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:00.1585658Z T = 2048, D = 7168, scale_ub = None, contiguous = True, compiled = False
2025-05-07T20:32:00.1585923Z 
2025-05-07T20:32:00.1586007Z     @given(
2025-05-07T20:32:00.1586234Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:00.1586548Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:00.1586850Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:00.1587178Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:00.1587654Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:00.1587944Z     )
2025-05-07T20:32:00.1588292Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:00.1588735Z     def test_silu_mul_quant(
2025-05-07T20:32:00.1588981Z         self,
2025-05-07T20:32:00.1589181Z         T: int,
2025-05-07T20:32:00.1589373Z         D: int,
2025-05-07T20:32:00.1589591Z         scale_ub: Optional[float],
2025-05-07T20:32:00.1589864Z         contiguous: bool,
2025-05-07T20:32:00.1590101Z         compiled: bool,
2025-05-07T20:32:00.1590324Z     ) -> None:
2025-05-07T20:32:00.1590543Z         torch.manual_seed(2025)
2025-05-07T20:32:00.1590780Z     
2025-05-07T20:32:00.1591047Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:00.1591388Z     
2025-05-07T20:32:00.1591585Z >       x_sign = torch.sign(x)
2025-05-07T20:32:00.1593479Z E       torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 56.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 32.44 MiB is free. Including non-PyTorch memory, this process has 22.03 GiB memory in use. Of the allocated memory 21.67 GiB is allocated by PyTorch, and 80.52 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
2025-05-07T20:32:00.1595288Z 
2025-05-07T20:32:00.1595403Z moe/activation_test.py:94: OutOfMemoryError
2025-05-07T20:32:00.1595620Z 
2025-05-07T20:32:00.1595721Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:00.1596128Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:00.1596520Z     T=1,
2025-05-07T20:32:00.1596706Z     D=7168,
2025-05-07T20:32:00.1596900Z     scale_ub=1200.0,
2025-05-07T20:32:00.1597121Z     contiguous=True,
2025-05-07T20:32:00.1597343Z     compiled=False,
2025-05-07T20:32:00.1597550Z )
2025-05-07T20:32:00.1597863Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:00.1598676Z T = 1, D = 7168, scale_ub = 1200.0, contiguous = True, compiled = False
2025-05-07T20:32:00.1598948Z 
2025-05-07T20:32:00.1599028Z     @given(
2025-05-07T20:32:00.1599257Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:00.1599565Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:00.1599868Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:00.1600193Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:00.1600511Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:00.1600797Z     )
2025-05-07T20:32:00.1601141Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:00.1601585Z     def test_silu_mul_quant(
2025-05-07T20:32:00.1601821Z         self,
2025-05-07T20:32:00.1602016Z         T: int,
2025-05-07T20:32:00.1602215Z         D: int,
2025-05-07T20:32:00.1602426Z         scale_ub: Optional[float],
2025-05-07T20:32:00.1602691Z         contiguous: bool,
2025-05-07T20:32:00.1602928Z         compiled: bool,
2025-05-07T20:32:00.1603144Z     ) -> None:
2025-05-07T20:32:00.1603500Z         torch.manual_seed(2025)
2025-05-07T20:32:00.1603740Z     
2025-05-07T20:32:00.1604001Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:00.1604344Z     
2025-05-07T20:32:00.1604536Z         x_sign = torch.sign(x)
2025-05-07T20:32:00.1604819Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:00.1605132Z         x = x_sign * x_clamp
2025-05-07T20:32:00.1605373Z         x0 = x[:, :D]
2025-05-07T20:32:00.1605585Z         x1 = x[:, D:]
2025-05-07T20:32:00.1605797Z     
2025-05-07T20:32:00.1605986Z         if contiguous:
2025-05-07T20:32:00.1606213Z             x0 = x0.contiguous()
2025-05-07T20:32:00.1606471Z             x1 = x1.contiguous()
2025-05-07T20:32:00.1606711Z     
2025-05-07T20:32:00.1607023Z         if scale_ub is not None:
2025-05-07T20:32:00.1607294Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:00.1607625Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:00.1607938Z             )
2025-05-07T20:32:00.1608131Z         else:
2025-05-07T20:32:00.1608345Z             scale_ub_tensor = None
2025-05-07T20:32:00.1608598Z     
2025-05-07T20:32:00.1608826Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:00.1609142Z             op = silu_mul_quant
2025-05-07T20:32:00.1609391Z             if compiled:
2025-05-07T20:32:00.1609636Z                 op = torch.compile(op)
2025-05-07T20:32:00.1609936Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:00.1610210Z     
2025-05-07T20:32:00.1610396Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:00.1610564Z 
2025-05-07T20:32:00.1610664Z moe/activation_test.py:117: 
2025-05-07T20:32:00.1610964Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:00.1611292Z moe/activation_test.py:115: in fn
2025-05-07T20:32:00.1611566Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:00.1612253Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:00.1612940Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:00.1613464Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:00.1614263Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:00.1614918Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:00.1615448Z     kernel = self.compile(
2025-05-07T20:32:00.1615977Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:00.1616630Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:00.1617028Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:00.1617253Z 
2025-05-07T20:32:00.1617462Z self = <triton.compiler.compiler.ASTSource object at 0x7f939778e450>
2025-05-07T20:32:00.1618522Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:00.1619866Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f9402963ce0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f93973f8400>}
2025-05-07T20:32:00.1621179Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:00.1622191Z context = <triton._C.libtriton.ir.context object at 0x7f93973a53b0>
2025-05-07T20:32:00.1622473Z 
2025-05-07T20:32:00.1622639Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:00.1623264Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:00.1623724Z                            module_map=module_map)
2025-05-07T20:32:00.1624086Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:00.1624428Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:00.1624688Z E       ^
2025-05-07T20:32:00.1625145Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:00.1625584Z 
2025-05-07T20:32:00.1625988Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:00.1626501Z 
2025-05-07T20:32:00.1626683Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:00.1627104Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:00.1627503Z     T=128,
2025-05-07T20:32:00.1627696Z     D=5120,
2025-05-07T20:32:00.1627894Z     scale_ub=None,
2025-05-07T20:32:00.1628101Z     contiguous=True,
2025-05-07T20:32:00.1628326Z     compiled=False,
2025-05-07T20:32:00.1628530Z )
2025-05-07T20:32:00.1628843Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:00.1629323Z T = 128, D = 5120, scale_ub = None, contiguous = True, compiled = False
2025-05-07T20:32:00.1629586Z 
2025-05-07T20:32:00.1629677Z     @given(
2025-05-07T20:32:00.1629906Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:00.1639799Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:00.1640131Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:00.1640457Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:00.1640799Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:00.1641090Z     )
2025-05-07T20:32:00.1641438Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:00.1641887Z     def test_silu_mul_quant(
2025-05-07T20:32:00.1642140Z         self,
2025-05-07T20:32:00.1642333Z         T: int,
2025-05-07T20:32:00.1642534Z         D: int,
2025-05-07T20:32:00.1642758Z         scale_ub: Optional[float],
2025-05-07T20:32:00.1643027Z         contiguous: bool,
2025-05-07T20:32:00.1643270Z         compiled: bool,
2025-05-07T20:32:00.1643498Z     ) -> None:
2025-05-07T20:32:00.1643710Z         torch.manual_seed(2025)
2025-05-07T20:32:00.1643954Z     
2025-05-07T20:32:00.1644233Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:00.1644579Z     
2025-05-07T20:32:00.1644768Z         x_sign = torch.sign(x)
2025-05-07T20:32:00.1645060Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:00.1645379Z         x = x_sign * x_clamp
2025-05-07T20:32:00.1645617Z         x0 = x[:, :D]
2025-05-07T20:32:00.1645838Z         x1 = x[:, D:]
2025-05-07T20:32:00.1646050Z     
2025-05-07T20:32:00.1646229Z         if contiguous:
2025-05-07T20:32:00.1646459Z             x0 = x0.contiguous()
2025-05-07T20:32:00.1646719Z             x1 = x1.contiguous()
2025-05-07T20:32:00.1646950Z     
2025-05-07T20:32:00.1647140Z         if scale_ub is not None:
2025-05-07T20:32:00.1647402Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:00.1647723Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:00.1648031Z             )
2025-05-07T20:32:00.1648225Z         else:
2025-05-07T20:32:00.1648428Z             scale_ub_tensor = None
2025-05-07T20:32:00.1648681Z     
2025-05-07T20:32:00.1648915Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:00.1649218Z             op = silu_mul_quant
2025-05-07T20:32:00.1649466Z             if compiled:
2025-05-07T20:32:00.1649709Z                 op = torch.compile(op)
2025-05-07T20:32:00.1650005Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:00.1650270Z     
2025-05-07T20:32:00.1650454Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:00.1650612Z 
2025-05-07T20:32:00.1650709Z moe/activation_test.py:117: 
2025-05-07T20:32:00.1651119Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:00.1651449Z moe/activation_test.py:115: in fn
2025-05-07T20:32:00.1651721Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:00.1652399Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:00.1653075Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:00.1653596Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:00.1654348Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:00.1655083Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:00.1655605Z     kernel = self.compile(
2025-05-07T20:32:00.1656130Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:00.1656788Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:00.1657184Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:00.1657408Z 
2025-05-07T20:32:00.1657621Z self = <triton.compiler.compiler.ASTSource object at 0x7f93a43ae850>
2025-05-07T20:32:00.1658677Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:00.1660026Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f9402963ce0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f93973f9300>}
2025-05-07T20:32:00.1661344Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:00.1662349Z context = <triton._C.libtriton.ir.context object at 0x7f939730e2f0>
2025-05-07T20:32:00.1662626Z 
2025-05-07T20:32:00.1662789Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:00.1663297Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:00.1663752Z                            module_map=module_map)
2025-05-07T20:32:00.1664110Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:00.1664449Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:00.1664703Z E       ^
2025-05-07T20:32:00.1665159Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:00.1665593Z 
2025-05-07T20:32:00.1666000Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:00.2785477Z 
2025-05-07T20:32:00.2786224Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:00.2786880Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:00.2787417Z     T=128,
2025-05-07T20:32:00.2787605Z     D=7168,
2025-05-07T20:32:00.2787800Z     scale_ub=None,
2025-05-07T20:32:00.2788014Z     contiguous=True,
2025-05-07T20:32:00.2788238Z     compiled=False,
2025-05-07T20:32:00.2788438Z )
2025-05-07T20:32:00.2788756Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:00.2789243Z T = 128, D = 7168, scale_ub = None, contiguous = True, compiled = False
2025-05-07T20:32:00.2789505Z 
2025-05-07T20:32:00.2789583Z     @given(
2025-05-07T20:32:00.2789831Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:00.2790143Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:00.2790440Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:00.2791153Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:00.2791478Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:00.2791754Z     )
2025-05-07T20:32:00.2792099Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:00.2792540Z     def test_silu_mul_quant(
2025-05-07T20:32:00.2792786Z         self,
2025-05-07T20:32:00.2792980Z         T: int,
2025-05-07T20:32:00.2793181Z         D: int,
2025-05-07T20:32:00.2793401Z         scale_ub: Optional[float],
2025-05-07T20:32:00.2793667Z         contiguous: bool,
2025-05-07T20:32:00.2793905Z         compiled: bool,
2025-05-07T20:32:00.2794136Z     ) -> None:
2025-05-07T20:32:00.2794345Z         torch.manual_seed(2025)
2025-05-07T20:32:00.2794586Z     
2025-05-07T20:32:00.2795016Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:00.2795355Z     
2025-05-07T20:32:00.2795549Z         x_sign = torch.sign(x)
2025-05-07T20:32:00.2795839Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:00.2796149Z         x = x_sign * x_clamp
2025-05-07T20:32:00.2796390Z         x0 = x[:, :D]
2025-05-07T20:32:00.2796613Z         x1 = x[:, D:]
2025-05-07T20:32:00.2796814Z     
2025-05-07T20:32:00.2796999Z         if contiguous:
2025-05-07T20:32:00.2797224Z             x0 = x0.contiguous()
2025-05-07T20:32:00.2797479Z             x1 = x1.contiguous()
2025-05-07T20:32:00.2797716Z     
2025-05-07T20:32:00.2797908Z         if scale_ub is not None:
2025-05-07T20:32:00.2798467Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:00.2798796Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:00.2799104Z             )
2025-05-07T20:32:00.2799295Z         else:
2025-05-07T20:32:00.2799502Z             scale_ub_tensor = None
2025-05-07T20:32:00.2799755Z     
2025-05-07T20:32:00.2799982Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:00.2800290Z             op = silu_mul_quant
2025-05-07T20:32:00.2800545Z             if compiled:
2025-05-07T20:32:00.2800796Z                 op = torch.compile(op)
2025-05-07T20:32:00.2801089Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:00.2801367Z     
2025-05-07T20:32:00.2801562Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:00.2801723Z 
2025-05-07T20:32:00.2801826Z moe/activation_test.py:117: 
2025-05-07T20:32:00.2802116Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:00.2802447Z moe/activation_test.py:115: in fn
2025-05-07T20:32:00.2802726Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:00.2803401Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:00.2804085Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:00.2804619Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:00.2805345Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:00.2805997Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:00.2806524Z     kernel = self.compile(
2025-05-07T20:32:00.2807060Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:00.2807705Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:00.2808099Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:00.2808330Z 
2025-05-07T20:32:00.2808536Z self = <triton.compiler.compiler.ASTSource object at 0x7f939778ee50>
2025-05-07T20:32:00.2809608Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:00.2811216Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f9402963ce0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f93973fa0c0>}
2025-05-07T20:32:00.2812529Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:00.2813538Z context = <triton._C.libtriton.ir.context object at 0x7f9397490cb0>
2025-05-07T20:32:00.2813956Z 
2025-05-07T20:32:00.2814120Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:00.2814756Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:00.2815214Z                            module_map=module_map)
2025-05-07T20:32:00.2815579Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:00.2815930Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:00.2816197Z E       ^
2025-05-07T20:32:00.2816649Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:00.2817095Z 
2025-05-07T20:32:00.2817501Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:00.2818004Z 
2025-05-07T20:32:00.2818115Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:00.2818516Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:00.2818912Z     T=2048,
2025-05-07T20:32:00.2819100Z     D=7168,
2025-05-07T20:32:00.2819297Z     scale_ub=1200.0,
2025-05-07T20:32:00.2819511Z     contiguous=True,
2025-05-07T20:32:00.2819739Z     compiled=False,
2025-05-07T20:32:00.2819949Z )
2025-05-07T20:32:00.2820263Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:00.2820756Z T = 2048, D = 7168, scale_ub = 1200.0, contiguous = True, compiled = False
2025-05-07T20:32:00.2821030Z 
2025-05-07T20:32:00.2821114Z     @given(
2025-05-07T20:32:00.2821341Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:00.2821651Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:00.2821957Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:00.2822278Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:00.2822601Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:00.2822887Z     )
2025-05-07T20:32:00.2823232Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:00.2823664Z     def test_silu_mul_quant(
2025-05-07T20:32:00.2823904Z         self,
2025-05-07T20:32:00.2824098Z         T: int,
2025-05-07T20:32:00.2824290Z         D: int,
2025-05-07T20:32:00.2824505Z         scale_ub: Optional[float],
2025-05-07T20:32:00.2824774Z         contiguous: bool,
2025-05-07T20:32:00.2825004Z         compiled: bool,
2025-05-07T20:32:00.2825227Z     ) -> None:
2025-05-07T20:32:00.2825440Z         torch.manual_seed(2025)
2025-05-07T20:32:00.2825673Z     
2025-05-07T20:32:00.2825940Z >       x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:00.2827959Z E       torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 56.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 30.44 MiB is free. Including non-PyTorch memory, this process has 22.03 GiB memory in use. Of the allocated memory 21.70 GiB is allocated by PyTorch, and 53.93 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
2025-05-07T20:32:00.2829770Z 
2025-05-07T20:32:00.2829887Z moe/activation_test.py:92: OutOfMemoryError
2025-05-07T20:32:00.2830102Z 
2025-05-07T20:32:00.2830205Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:00.2830705Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:00.2831100Z     T=1,
2025-05-07T20:32:00.2831289Z     D=5120,
2025-05-07T20:32:00.2831486Z     scale_ub=1200.0,
2025-05-07T20:32:00.2831700Z     contiguous=True,
2025-05-07T20:32:00.2831920Z     compiled=False,
2025-05-07T20:32:00.2832127Z )
2025-05-07T20:32:00.2832438Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:00.2832920Z T = 1, D = 5120, scale_ub = 1200.0, contiguous = True, compiled = False
2025-05-07T20:32:00.2833180Z 
2025-05-07T20:32:00.2833264Z     @given(
2025-05-07T20:32:00.2833494Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:00.2833886Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:00.2834191Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:00.2834520Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:00.2834842Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:00.2835133Z     )
2025-05-07T20:32:00.2835477Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:00.2835912Z     def test_silu_mul_quant(
2025-05-07T20:32:00.2836154Z         self,
2025-05-07T20:32:00.2836348Z         T: int,
2025-05-07T20:32:00.2836537Z         D: int,
2025-05-07T20:32:00.2836753Z         scale_ub: Optional[float],
2025-05-07T20:32:00.2837020Z         contiguous: bool,
2025-05-07T20:32:00.2837249Z         compiled: bool,
2025-05-07T20:32:00.2837470Z     ) -> None:
2025-05-07T20:32:00.2837682Z         torch.manual_seed(2025)
2025-05-07T20:32:00.2837916Z     
2025-05-07T20:32:00.2838184Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:00.2838527Z     
2025-05-07T20:32:00.2838716Z         x_sign = torch.sign(x)
2025-05-07T20:32:00.2839005Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:00.2839314Z         x = x_sign * x_clamp
2025-05-07T20:32:00.2839560Z         x0 = x[:, :D]
2025-05-07T20:32:00.2839768Z         x1 = x[:, D:]
2025-05-07T20:32:00.2839977Z     
2025-05-07T20:32:00.2840164Z         if contiguous:
2025-05-07T20:32:00.2840386Z             x0 = x0.contiguous()
2025-05-07T20:32:00.2840642Z             x1 = x1.contiguous()
2025-05-07T20:32:00.2840882Z     
2025-05-07T20:32:00.2841068Z         if scale_ub is not None:
2025-05-07T20:32:00.2841338Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:00.2841669Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:00.2841980Z             )
2025-05-07T20:32:00.2842180Z         else:
2025-05-07T20:32:00.2842385Z             scale_ub_tensor = None
2025-05-07T20:32:00.2842636Z     
2025-05-07T20:32:00.2842872Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:00.2843183Z             op = silu_mul_quant
2025-05-07T20:32:00.2843435Z             if compiled:
2025-05-07T20:32:00.2843687Z                 op = torch.compile(op)
2025-05-07T20:32:00.2843979Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:00.2844252Z     
2025-05-07T20:32:00.2844444Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:00.2844604Z 
2025-05-07T20:32:00.2844701Z moe/activation_test.py:117: 
2025-05-07T20:32:00.2844997Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:00.2845326Z moe/activation_test.py:115: in fn
2025-05-07T20:32:00.2845598Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:00.2846280Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:00.2846964Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:00.2847502Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:00.2848168Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:00.2848821Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:00.2849468Z     kernel = self.compile(
2025-05-07T20:32:00.2850003Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:00.2850647Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:00.2851039Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:00.2851262Z 
2025-05-07T20:32:00.2851471Z self = <triton.compiler.compiler.ASTSource object at 0x7f93978bc1d0>
2025-05-07T20:32:00.2852614Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:00.2854055Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f9402963ce0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f93973fb6a0>}
2025-05-07T20:32:00.2855375Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:00.2856379Z context = <triton._C.libtriton.ir.context object at 0x7f93974615b0>
2025-05-07T20:32:00.2856659Z 
2025-05-07T20:32:00.2856829Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:00.2857333Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:00.2857794Z                            module_map=module_map)
2025-05-07T20:32:00.2858165Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:00.2858517Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:00.2858772Z E       ^
2025-05-07T20:32:00.2859230Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:00.2859676Z 
2025-05-07T20:32:00.2860091Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:00.3687469Z 
2025-05-07T20:32:00.3687952Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:00.3688883Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:00.3689678Z     T=2048,
2025-05-07T20:32:00.3690055Z     D=5120,
2025-05-07T20:32:00.3690433Z     scale_ub=None,
2025-05-07T20:32:00.3690855Z     contiguous=True,
2025-05-07T20:32:00.3691289Z     compiled=False,
2025-05-07T20:32:00.3691689Z )
2025-05-07T20:32:00.3692338Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:00.3693306Z T = 2048, D = 5120, scale_ub = None, contiguous = True, compiled = False
2025-05-07T20:32:00.3694010Z 
2025-05-07T20:32:00.3694168Z     @given(
2025-05-07T20:32:00.3694677Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:00.3695113Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:00.3695427Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:00.3695757Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:00.3696076Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:00.3696369Z     )
2025-05-07T20:32:00.3696717Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:00.3697161Z     def test_silu_mul_quant(
2025-05-07T20:32:00.3697398Z         self,
2025-05-07T20:32:00.3697605Z         T: int,
2025-05-07T20:32:00.3697807Z         D: int,
2025-05-07T20:32:00.3698021Z         scale_ub: Optional[float],
2025-05-07T20:32:00.3698646Z         contiguous: bool,
2025-05-07T20:32:00.3698895Z         compiled: bool,
2025-05-07T20:32:00.3699113Z     ) -> None:
2025-05-07T20:32:00.3699329Z         torch.manual_seed(2025)
2025-05-07T20:32:00.3699570Z     
2025-05-07T20:32:00.3700132Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:00.3700472Z     
2025-05-07T20:32:00.3700663Z >       x_sign = torch.sign(x)
2025-05-07T20:32:00.3702570Z E       torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 40.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 30.44 MiB is free. Including non-PyTorch memory, this process has 22.03 GiB memory in use. Of the allocated memory 21.73 GiB is allocated by PyTorch, and 13.87 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
2025-05-07T20:32:00.3704521Z 
2025-05-07T20:32:00.3704648Z moe/activation_test.py:94: OutOfMemoryError
2025-05-07T20:32:00.3704856Z 
2025-05-07T20:32:00.3704958Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:00.3705363Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:00.3705766Z     T=16384,
2025-05-07T20:32:00.3705954Z     D=5120,
2025-05-07T20:32:00.3706146Z     scale_ub=None,
2025-05-07T20:32:00.3706358Z     contiguous=True,
2025-05-07T20:32:00.3706572Z     compiled=False,
2025-05-07T20:32:00.3706777Z )
2025-05-07T20:32:00.3707093Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:00.3707573Z T = 16384, D = 5120, scale_ub = None, contiguous = True, compiled = False
2025-05-07T20:32:00.3707848Z 
2025-05-07T20:32:00.3707926Z     @given(
2025-05-07T20:32:00.3708158Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:00.3708471Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:00.3708778Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:00.3709104Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:00.3709428Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:00.3709717Z     )
2025-05-07T20:32:00.3710064Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:00.3710506Z     def test_silu_mul_quant(
2025-05-07T20:32:00.3710737Z         self,
2025-05-07T20:32:00.3710931Z         T: int,
2025-05-07T20:32:00.3711131Z         D: int,
2025-05-07T20:32:00.3711346Z         scale_ub: Optional[float],
2025-05-07T20:32:00.3711617Z         contiguous: bool,
2025-05-07T20:32:00.3711860Z         compiled: bool,
2025-05-07T20:32:00.3712080Z     ) -> None:
2025-05-07T20:32:00.3712298Z         torch.manual_seed(2025)
2025-05-07T20:32:00.3712541Z     
2025-05-07T20:32:00.3712812Z >       x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:00.3714802Z E       torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 320.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 30.44 MiB is free. Including non-PyTorch memory, this process has 22.03 GiB memory in use. Of the allocated memory 21.73 GiB is allocated by PyTorch, and 13.87 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
2025-05-07T20:32:00.3716663Z 
2025-05-07T20:32:00.3716778Z moe/activation_test.py:92: OutOfMemoryError
2025-05-07T20:32:00.3716993Z 
2025-05-07T20:32:00.3717095Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:00.3717498Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:00.3717895Z     T=4096,
2025-05-07T20:32:00.3718078Z     D=5120,
2025-05-07T20:32:00.3718270Z     scale_ub=None,
2025-05-07T20:32:00.3718489Z     contiguous=True,
2025-05-07T20:32:00.3718706Z     compiled=False,
2025-05-07T20:32:00.3718908Z )
2025-05-07T20:32:00.3719223Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:00.3719702Z T = 4096, D = 5120, scale_ub = None, contiguous = True, compiled = False
2025-05-07T20:32:00.3720063Z 
2025-05-07T20:32:00.3720140Z     @given(
2025-05-07T20:32:00.3720368Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:00.3720670Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:00.3720973Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:00.3721296Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:00.3721618Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:00.3721897Z     )
2025-05-07T20:32:00.3722239Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:00.3722677Z     def test_silu_mul_quant(
2025-05-07T20:32:00.3722906Z         self,
2025-05-07T20:32:00.3723179Z         T: int,
2025-05-07T20:32:00.3723374Z         D: int,
2025-05-07T20:32:00.3723582Z         scale_ub: Optional[float],
2025-05-07T20:32:00.3723854Z         contiguous: bool,
2025-05-07T20:32:00.3724089Z         compiled: bool,
2025-05-07T20:32:00.3724309Z     ) -> None:
2025-05-07T20:32:00.3724521Z         torch.manual_seed(2025)
2025-05-07T20:32:00.3724762Z     
2025-05-07T20:32:00.3725039Z >       x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:00.3727053Z E       torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 80.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 30.44 MiB is free. Including non-PyTorch memory, this process has 22.03 GiB memory in use. Of the allocated memory 21.73 GiB is allocated by PyTorch, and 13.87 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
2025-05-07T20:32:00.3728861Z 
2025-05-07T20:32:00.3728977Z moe/activation_test.py:92: OutOfMemoryError
2025-05-07T20:32:00.3729192Z 
2025-05-07T20:32:00.3729293Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:00.3729703Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:00.3730093Z     T=2048,
2025-05-07T20:32:00.3730280Z     D=5120,
2025-05-07T20:32:00.3730468Z     scale_ub=None,
2025-05-07T20:32:00.3730675Z     contiguous=False,
2025-05-07T20:32:00.3730902Z     compiled=False,
2025-05-07T20:32:00.3731108Z )
2025-05-07T20:32:00.3731417Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:00.3731925Z T = 2048, D = 5120, scale_ub = None, contiguous = False, compiled = False
2025-05-07T20:32:00.3732193Z 
2025-05-07T20:32:00.3732273Z     @given(
2025-05-07T20:32:00.3732503Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:00.3732819Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:00.3733117Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:00.3733443Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:00.3733865Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:00.3734149Z     )
2025-05-07T20:32:00.3734485Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:00.3734924Z     def test_silu_mul_quant(
2025-05-07T20:32:00.3745002Z         self,
2025-05-07T20:32:00.3745244Z         T: int,
2025-05-07T20:32:00.3745452Z         D: int,
2025-05-07T20:32:00.3745669Z         scale_ub: Optional[float],
2025-05-07T20:32:00.3745947Z         contiguous: bool,
2025-05-07T20:32:00.3746192Z         compiled: bool,
2025-05-07T20:32:00.3746415Z     ) -> None:
2025-05-07T20:32:00.3746642Z         torch.manual_seed(2025)
2025-05-07T20:32:00.3746889Z     
2025-05-07T20:32:00.3747169Z >       x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:00.3749185Z E       torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 40.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 30.44 MiB is free. Including non-PyTorch memory, this process has 22.03 GiB memory in use. Of the allocated memory 21.73 GiB is allocated by PyTorch, and 13.87 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
2025-05-07T20:32:00.3751124Z 
2025-05-07T20:32:00.3751245Z moe/activation_test.py:92: OutOfMemoryError
2025-05-07T20:32:00.3751457Z 
2025-05-07T20:32:00.3751570Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:00.3751982Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:00.3752377Z     T=4096,
2025-05-07T20:32:00.3752575Z     D=7168,
2025-05-07T20:32:00.3752859Z     scale_ub=None,
2025-05-07T20:32:00.3753074Z     contiguous=True,
2025-05-07T20:32:00.3753306Z     compiled=True,
2025-05-07T20:32:00.3753524Z )
2025-05-07T20:32:00.3753840Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:00.3754346Z T = 4096, D = 7168, scale_ub = None, contiguous = True, compiled = True
2025-05-07T20:32:00.3754611Z 
2025-05-07T20:32:00.3754704Z     @given(
2025-05-07T20:32:00.3754933Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:00.3755251Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:00.3755558Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:00.3755894Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:00.3756216Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:00.3756511Z     )
2025-05-07T20:32:00.3756867Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:00.3757310Z     def test_silu_mul_quant(
2025-05-07T20:32:00.3757557Z         self,
2025-05-07T20:32:00.3757756Z         T: int,
2025-05-07T20:32:00.3757951Z         D: int,
2025-05-07T20:32:00.3758176Z         scale_ub: Optional[float],
2025-05-07T20:32:00.3758458Z         contiguous: bool,
2025-05-07T20:32:00.3758694Z         compiled: bool,
2025-05-07T20:32:00.3758919Z     ) -> None:
2025-05-07T20:32:00.3759138Z         torch.manual_seed(2025)
2025-05-07T20:32:00.3759374Z     
2025-05-07T20:32:00.3759647Z >       x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:00.3761648Z E       torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 112.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 30.44 MiB is free. Including non-PyTorch memory, this process has 22.03 GiB memory in use. Of the allocated memory 21.73 GiB is allocated by PyTorch, and 13.87 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
2025-05-07T20:32:00.3763460Z 
2025-05-07T20:32:00.3763577Z moe/activation_test.py:92: OutOfMemoryError
2025-05-07T20:32:00.3763792Z 
2025-05-07T20:32:00.3763900Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:00.3764298Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:00.3764701Z     T=2048,
2025-05-07T20:32:00.3764885Z     D=5120,
2025-05-07T20:32:00.3765104Z     scale_ub=1200.0,
2025-05-07T20:32:00.3765351Z     contiguous=False,
2025-05-07T20:32:00.3765571Z     compiled=False,
2025-05-07T20:32:00.4310569Z )
2025-05-07T20:32:00.4311071Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:00.4311761Z T = 2048, D = 5120, scale_ub = 1200.0, contiguous = False, compiled = False
2025-05-07T20:32:00.4312120Z 
2025-05-07T20:32:00.4312208Z     @given(
2025-05-07T20:32:00.4312453Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:00.4312773Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:00.4313090Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:00.4313687Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:00.4314052Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:00.4314444Z     )
2025-05-07T20:32:00.4314837Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:00.4315278Z     def test_silu_mul_quant(
2025-05-07T20:32:00.4315529Z         self,
2025-05-07T20:32:00.4315734Z         T: int,
2025-05-07T20:32:00.4315931Z         D: int,
2025-05-07T20:32:00.4316156Z         scale_ub: Optional[float],
2025-05-07T20:32:00.4316435Z         contiguous: bool,
2025-05-07T20:32:00.4316676Z         compiled: bool,
2025-05-07T20:32:00.4316918Z     ) -> None:
2025-05-07T20:32:00.4317142Z         torch.manual_seed(2025)
2025-05-07T20:32:00.4317386Z     
2025-05-07T20:32:00.4317809Z >       x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:00.4319828Z E       torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 40.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 30.44 MiB is free. Including non-PyTorch memory, this process has 22.03 GiB memory in use. Of the allocated memory 21.73 GiB is allocated by PyTorch, and 13.87 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
2025-05-07T20:32:00.4321654Z 
2025-05-07T20:32:00.4321777Z moe/activation_test.py:92: OutOfMemoryError
2025-05-07T20:32:00.4321989Z 
2025-05-07T20:32:00.4322104Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:00.4322514Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:00.4322923Z     T=4096,
2025-05-07T20:32:00.4323120Z     D=7168,
2025-05-07T20:32:00.4323322Z     scale_ub=1200.0,
2025-05-07T20:32:00.4323552Z     contiguous=True,
2025-05-07T20:32:00.4323783Z     compiled=False,
2025-05-07T20:32:00.4323996Z )
2025-05-07T20:32:00.4324320Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:00.4324814Z T = 4096, D = 7168, scale_ub = 1200.0, contiguous = True, compiled = False
2025-05-07T20:32:00.4325090Z 
2025-05-07T20:32:00.4325180Z     @given(
2025-05-07T20:32:00.4325409Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:00.4325727Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:00.4326034Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:00.4326361Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:00.4326693Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:00.4326984Z     )
2025-05-07T20:32:00.4327332Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:00.4327778Z     def test_silu_mul_quant(
2025-05-07T20:32:00.4328023Z         self,
2025-05-07T20:32:00.4328224Z         T: int,
2025-05-07T20:32:00.4328422Z         D: int,
2025-05-07T20:32:00.4328653Z         scale_ub: Optional[float],
2025-05-07T20:32:00.4328929Z         contiguous: bool,
2025-05-07T20:32:00.4329166Z         compiled: bool,
2025-05-07T20:32:00.4329396Z     ) -> None:
2025-05-07T20:32:00.4329617Z         torch.manual_seed(2025)
2025-05-07T20:32:00.4329858Z     
2025-05-07T20:32:00.4330130Z >       x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:00.4332135Z E       torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 112.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 30.44 MiB is free. Including non-PyTorch memory, this process has 22.03 GiB memory in use. Of the allocated memory 21.73 GiB is allocated by PyTorch, and 13.87 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
2025-05-07T20:32:00.4334610Z 
2025-05-07T20:32:00.4334737Z moe/activation_test.py:92: OutOfMemoryError
2025-05-07T20:32:00.4334948Z 
2025-05-07T20:32:00.4335059Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:00.4335463Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:00.4335865Z     T=16384,
2025-05-07T20:32:00.4336067Z     D=7168,
2025-05-07T20:32:00.4336264Z     scale_ub=None,
2025-05-07T20:32:00.4336487Z     contiguous=False,
2025-05-07T20:32:00.4336717Z     compiled=True,
2025-05-07T20:32:00.4336920Z )
2025-05-07T20:32:00.4337242Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:00.4337737Z T = 16384, D = 7168, scale_ub = None, contiguous = False, compiled = True
2025-05-07T20:32:00.4338093Z 
2025-05-07T20:32:00.4338175Z     @given(
2025-05-07T20:32:00.4338413Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:00.4338733Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:00.4339051Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:00.4339378Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:00.4339711Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:00.4340004Z     )
2025-05-07T20:32:00.4340350Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:00.4340798Z     def test_silu_mul_quant(
2025-05-07T20:32:00.4341045Z         self,
2025-05-07T20:32:00.4341242Z         T: int,
2025-05-07T20:32:00.4341445Z         D: int,
2025-05-07T20:32:00.4341675Z         scale_ub: Optional[float],
2025-05-07T20:32:00.4341951Z         contiguous: bool,
2025-05-07T20:32:00.4342197Z         compiled: bool,
2025-05-07T20:32:00.4342427Z     ) -> None:
2025-05-07T20:32:00.4342647Z         torch.manual_seed(2025)
2025-05-07T20:32:00.4342891Z     
2025-05-07T20:32:00.4343164Z >       x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:00.4345181Z E       torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 448.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 30.44 MiB is free. Including non-PyTorch memory, this process has 22.03 GiB memory in use. Of the allocated memory 21.73 GiB is allocated by PyTorch, and 13.87 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
2025-05-07T20:32:00.4346997Z 
2025-05-07T20:32:00.4347118Z moe/activation_test.py:92: OutOfMemoryError
2025-05-07T20:32:00.4347328Z 
2025-05-07T20:32:00.4347436Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:00.4347852Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:00.4348256Z     T=4096,
2025-05-07T20:32:00.4348448Z     D=7168,
2025-05-07T20:32:00.4348641Z     scale_ub=None,
2025-05-07T20:32:00.4348860Z     contiguous=True,
2025-05-07T20:32:00.4349091Z     compiled=False,
2025-05-07T20:32:00.4349293Z )
2025-05-07T20:32:00.4349613Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:00.4350107Z T = 4096, D = 7168, scale_ub = None, contiguous = True, compiled = False
2025-05-07T20:32:00.4350378Z 
2025-05-07T20:32:00.4350467Z     @given(
2025-05-07T20:32:00.4350699Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:00.4351016Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:00.4351329Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:00.4351657Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:00.4351987Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:00.4352282Z     )
2025-05-07T20:32:00.4352629Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:00.4353076Z     def test_silu_mul_quant(
2025-05-07T20:32:00.4353324Z         self,
2025-05-07T20:32:00.4353620Z         T: int,
2025-05-07T20:32:00.4353817Z         D: int,
2025-05-07T20:32:00.4354039Z         scale_ub: Optional[float],
2025-05-07T20:32:00.4354316Z         contiguous: bool,
2025-05-07T20:32:00.4354553Z         compiled: bool,
2025-05-07T20:32:00.4354783Z     ) -> None:
2025-05-07T20:32:00.4355005Z         torch.manual_seed(2025)
2025-05-07T20:32:00.4355245Z     
2025-05-07T20:32:00.4355520Z >       x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:00.4357621Z E       torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 112.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 30.44 MiB is free. Including non-PyTorch memory, this process has 22.03 GiB memory in use. Of the allocated memory 21.73 GiB is allocated by PyTorch, and 13.87 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
2025-05-07T20:32:00.4359436Z 
2025-05-07T20:32:00.4359565Z moe/activation_test.py:92: OutOfMemoryError
2025-05-07T20:32:00.4359780Z 
2025-05-07T20:32:00.4359891Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:00.4360299Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:00.4360706Z     T=16384,
2025-05-07T20:32:00.4360906Z     D=7168,
2025-05-07T20:32:00.4361100Z     scale_ub=None,
2025-05-07T20:32:00.4361323Z     contiguous=True,
2025-05-07T20:32:00.4361556Z     compiled=False,
2025-05-07T20:32:00.4361760Z )
2025-05-07T20:32:00.4362079Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:00.4362577Z T = 16384, D = 7168, scale_ub = None, contiguous = True, compiled = False
2025-05-07T20:32:00.4362852Z 
2025-05-07T20:32:00.4362932Z     @given(
2025-05-07T20:32:00.4363168Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:00.4363491Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:00.4363803Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:00.4364130Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:00.4364464Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:00.4364755Z     )
2025-05-07T20:32:00.4365098Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:00.4365541Z     def test_silu_mul_quant(
2025-05-07T20:32:00.4365785Z         self,
2025-05-07T20:32:00.4365981Z         T: int,
2025-05-07T20:32:00.4366185Z         D: int,
2025-05-07T20:32:00.4366407Z         scale_ub: Optional[float],
2025-05-07T20:32:00.4366676Z         contiguous: bool,
2025-05-07T20:32:00.4366925Z         compiled: bool,
2025-05-07T20:32:00.4367151Z     ) -> None:
2025-05-07T20:32:00.4367363Z         torch.manual_seed(2025)
2025-05-07T20:32:00.4367607Z     
2025-05-07T20:32:00.4367882Z >       x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:00.4369873Z E       torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 448.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 30.44 MiB is free. Including non-PyTorch memory, this process has 22.03 GiB memory in use. Of the allocated memory 21.73 GiB is allocated by PyTorch, and 13.87 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
2025-05-07T20:32:00.4371683Z 
2025-05-07T20:32:00.4371807Z moe/activation_test.py:92: OutOfMemoryError
2025-05-07T20:32:00.4372016Z 
2025-05-07T20:32:00.4372124Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:00.4372535Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:00.4372935Z     T=16384,
2025-05-07T20:32:00.4373127Z     D=7168,
2025-05-07T20:32:00.4373418Z     scale_ub=1200.0,
2025-05-07T20:32:00.4373717Z     contiguous=True,
2025-05-07T20:32:00.4373935Z     compiled=False,
2025-05-07T20:32:00.4374143Z )
2025-05-07T20:32:00.4374456Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:00.4374947Z T = 16384, D = 7168, scale_ub = 1200.0, contiguous = True, compiled = False
2025-05-07T20:32:00.4375223Z 
2025-05-07T20:32:00.4375304Z     @given(
2025-05-07T20:32:00.4375534Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:00.4375848Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:00.4376147Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:00.4376477Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:00.4376885Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:00.4377169Z     )
2025-05-07T20:32:00.4377516Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:00.4377952Z     def test_silu_mul_quant(
2025-05-07T20:32:00.4378202Z         self,
2025-05-07T20:32:00.4378395Z         T: int,
2025-05-07T20:32:00.4378594Z         D: int,
2025-05-07T20:32:00.4378817Z         scale_ub: Optional[float],
2025-05-07T20:32:00.4379084Z         contiguous: bool,
2025-05-07T20:32:00.4379323Z         compiled: bool,
2025-05-07T20:32:00.4379542Z     ) -> None:
2025-05-07T20:32:00.4379749Z         torch.manual_seed(2025)
2025-05-07T20:32:00.4379989Z     
2025-05-07T20:32:00.4380263Z >       x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:00.4382245Z E       torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 448.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 30.44 MiB is free. Including non-PyTorch memory, this process has 22.03 GiB memory in use. Of the allocated memory 21.73 GiB is allocated by PyTorch, and 13.87 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
2025-05-07T20:32:00.4384063Z 
2025-05-07T20:32:00.4384179Z moe/activation_test.py:92: OutOfMemoryError
2025-05-07T20:32:00.6187457Z 
2025-05-07T20:32:00.6188161Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:00.6188805Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:00.6189342Z     T=128,
2025-05-07T20:32:00.6189591Z     D=5120,
2025-05-07T20:32:00.6189843Z     scale_ub=1200.0,
2025-05-07T20:32:00.6190132Z     contiguous=False,
2025-05-07T20:32:00.6190359Z     compiled=False,
2025-05-07T20:32:00.6190571Z )
2025-05-07T20:32:00.6190909Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:00.6191408Z T = 128, D = 5120, scale_ub = 1200.0, contiguous = False, compiled = False
2025-05-07T20:32:00.6191683Z 
2025-05-07T20:32:00.6191762Z     @given(
2025-05-07T20:32:00.6192005Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:00.6192311Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:00.6192621Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:00.6192950Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:00.6193268Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:00.6193554Z     )
2025-05-07T20:32:00.6193904Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:00.6194337Z     def test_silu_mul_quant(
2025-05-07T20:32:00.6194577Z         self,
2025-05-07T20:32:00.6194770Z         T: int,
2025-05-07T20:32:00.6194970Z         D: int,
2025-05-07T20:32:00.6195181Z         scale_ub: Optional[float],
2025-05-07T20:32:00.6195462Z         contiguous: bool,
2025-05-07T20:32:00.6195702Z         compiled: bool,
2025-05-07T20:32:00.6195925Z     ) -> None:
2025-05-07T20:32:00.6196142Z         torch.manual_seed(2025)
2025-05-07T20:32:00.6196385Z     
2025-05-07T20:32:00.6197048Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:00.6197389Z     
2025-05-07T20:32:00.6197589Z         x_sign = torch.sign(x)
2025-05-07T20:32:00.6197870Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:00.6198436Z         x = x_sign * x_clamp
2025-05-07T20:32:00.6198709Z         x0 = x[:, :D]
2025-05-07T20:32:00.6198924Z         x1 = x[:, D:]
2025-05-07T20:32:00.6199127Z     
2025-05-07T20:32:00.6199305Z         if contiguous:
2025-05-07T20:32:00.6199538Z             x0 = x0.contiguous()
2025-05-07T20:32:00.6199796Z             x1 = x1.contiguous()
2025-05-07T20:32:00.6200029Z     
2025-05-07T20:32:00.6200221Z         if scale_ub is not None:
2025-05-07T20:32:00.6200655Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:00.6200985Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:00.6201293Z             )
2025-05-07T20:32:00.6201487Z         else:
2025-05-07T20:32:00.6201705Z             scale_ub_tensor = None
2025-05-07T20:32:00.6201955Z     
2025-05-07T20:32:00.6202185Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:00.6202494Z             op = silu_mul_quant
2025-05-07T20:32:00.6202742Z             if compiled:
2025-05-07T20:32:00.6202984Z                 op = torch.compile(op)
2025-05-07T20:32:00.6203277Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:00.6203544Z     
2025-05-07T20:32:00.6203736Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:00.6203900Z 
2025-05-07T20:32:00.6204002Z moe/activation_test.py:117: 
2025-05-07T20:32:00.6204286Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:00.6204619Z moe/activation_test.py:115: in fn
2025-05-07T20:32:00.6204910Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:00.6205585Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:00.6206275Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:00.6206808Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:00.6207477Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:00.6208125Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:00.6208652Z     kernel = self.compile(
2025-05-07T20:32:00.6209189Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:00.6209836Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:00.6210229Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:00.6210458Z 
2025-05-07T20:32:00.6210661Z self = <triton.compiler.compiler.ASTSource object at 0x7f9397347450>
2025-05-07T20:32:00.6211719Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:00.6213080Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f9402963ce0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f9397275bc0>}
2025-05-07T20:32:00.6214574Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:00.6215581Z context = <triton._C.libtriton.ir.context object at 0x7f93970c3d30>
2025-05-07T20:32:00.6215874Z 
2025-05-07T20:32:00.6216037Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:00.6216551Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:00.6217146Z                            module_map=module_map)
2025-05-07T20:32:00.6217508Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:00.6217856Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:00.6218110Z E       ^
2025-05-07T20:32:00.6218565Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:00.6219007Z 
2025-05-07T20:32:00.6219412Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:00.6219913Z 
2025-05-07T20:32:00.6220021Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:00.6220501Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:00.6220902Z     T=2048,
2025-05-07T20:32:00.6221092Z     D=7168,
2025-05-07T20:32:00.6221278Z     scale_ub=None,
2025-05-07T20:32:00.6221491Z     contiguous=False,
2025-05-07T20:32:00.6221718Z     compiled=False,
2025-05-07T20:32:00.6221922Z )
2025-05-07T20:32:00.6222243Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:00.6222730Z T = 2048, D = 7168, scale_ub = None, contiguous = False, compiled = False
2025-05-07T20:32:00.6222996Z 
2025-05-07T20:32:00.6223079Z     @given(
2025-05-07T20:32:00.6223300Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:00.6223610Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:00.6223916Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:00.6224237Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:00.6224560Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:00.6224839Z     )
2025-05-07T20:32:00.6225183Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:00.6225619Z     def test_silu_mul_quant(
2025-05-07T20:32:00.6225855Z         self,
2025-05-07T20:32:00.6226046Z         T: int,
2025-05-07T20:32:00.6226241Z         D: int,
2025-05-07T20:32:00.6226455Z         scale_ub: Optional[float],
2025-05-07T20:32:00.6226723Z         contiguous: bool,
2025-05-07T20:32:00.6226959Z         compiled: bool,
2025-05-07T20:32:00.6227180Z     ) -> None:
2025-05-07T20:32:00.6227392Z         torch.manual_seed(2025)
2025-05-07T20:32:00.6227624Z     
2025-05-07T20:32:00.6227888Z >       x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:00.6229895Z E       torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 56.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 30.44 MiB is free. Including non-PyTorch memory, this process has 22.03 GiB memory in use. Of the allocated memory 21.74 GiB is allocated by PyTorch, and 5.24 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
2025-05-07T20:32:00.6231704Z 
2025-05-07T20:32:00.6231826Z moe/activation_test.py:92: OutOfMemoryError
2025-05-07T20:32:00.6232034Z 
2025-05-07T20:32:00.6232138Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:00.6232534Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:00.6232932Z     T=128,
2025-05-07T20:32:00.6233120Z     D=7168,
2025-05-07T20:32:00.6233309Z     scale_ub=1200.0,
2025-05-07T20:32:00.6233529Z     contiguous=True,
2025-05-07T20:32:00.6233746Z     compiled=True,
2025-05-07T20:32:00.6233944Z )
2025-05-07T20:32:00.6234259Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:00.6234758Z T = 128, D = 7168, scale_ub = 1200.0, contiguous = True, compiled = True
2025-05-07T20:32:00.6235023Z 
2025-05-07T20:32:00.6235101Z     @given(
2025-05-07T20:32:00.6235330Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:00.6235631Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:00.6236052Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:00.6236385Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:00.6236703Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:00.6236993Z     )
2025-05-07T20:32:00.6237337Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:00.6237774Z     def test_silu_mul_quant(
2025-05-07T20:32:00.6238005Z         self,
2025-05-07T20:32:00.6238205Z         T: int,
2025-05-07T20:32:00.6238407Z         D: int,
2025-05-07T20:32:00.6248758Z         scale_ub: Optional[float],
2025-05-07T20:32:00.6249169Z         contiguous: bool,
2025-05-07T20:32:00.6249419Z         compiled: bool,
2025-05-07T20:32:00.6249769Z     ) -> None:
2025-05-07T20:32:00.6249988Z         torch.manual_seed(2025)
2025-05-07T20:32:00.6250228Z     
2025-05-07T20:32:00.6250493Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:00.6250844Z     
2025-05-07T20:32:00.6251039Z         x_sign = torch.sign(x)
2025-05-07T20:32:00.6251322Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:00.6251627Z         x = x_sign * x_clamp
2025-05-07T20:32:00.6251864Z         x0 = x[:, :D]
2025-05-07T20:32:00.6252067Z         x1 = x[:, D:]
2025-05-07T20:32:00.6252280Z     
2025-05-07T20:32:00.6252461Z         if contiguous:
2025-05-07T20:32:00.6252690Z             x0 = x0.contiguous()
2025-05-07T20:32:00.6252935Z             x1 = x1.contiguous()
2025-05-07T20:32:00.6253174Z     
2025-05-07T20:32:00.6253371Z         if scale_ub is not None:
2025-05-07T20:32:00.6253701Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:00.6254051Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:00.6254367Z             )
2025-05-07T20:32:00.6254557Z         else:
2025-05-07T20:32:00.6254769Z             scale_ub_tensor = None
2025-05-07T20:32:00.6255023Z     
2025-05-07T20:32:00.6255243Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:00.6255573Z             op = silu_mul_quant
2025-05-07T20:32:00.6255822Z             if compiled:
2025-05-07T20:32:00.6256064Z                 op = torch.compile(op)
2025-05-07T20:32:00.6256359Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:00.6256636Z     
2025-05-07T20:32:00.6256820Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:00.6256988Z 
2025-05-07T20:32:00.6257084Z moe/activation_test.py:117: 
2025-05-07T20:32:00.6257376Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:00.6257711Z moe/activation_test.py:115: in fn
2025-05-07T20:32:00.6257987Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:00.6258550Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/torch/_dynamo/eval_frame.py:678: in _fn
2025-05-07T20:32:00.6259107Z     return fn(*args, **kwargs)
2025-05-07T20:32:00.6259750Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:00.6260428Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:00.6260957Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:00.6261625Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:00.6262270Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:00.6262795Z     kernel = self.compile(
2025-05-07T20:32:00.6263326Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:00.6263969Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:00.6264358Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:00.6264581Z 
2025-05-07T20:32:00.6264781Z self = <triton.compiler.compiler.ASTSource object at 0x7f9397983150>
2025-05-07T20:32:00.6265931Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:00.6267271Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f9402963ce0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f93971742c0>}
2025-05-07T20:32:00.6268587Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:00.6269648Z context = <triton._C.libtriton.ir.context object at 0x7f93971aecb0>
2025-05-07T20:32:00.6269923Z 
2025-05-07T20:32:00.6270085Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:00.6270591Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:00.6271049Z                            module_map=module_map)
2025-05-07T20:32:00.6271409Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:00.6271746Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:00.6272004Z E       ^
2025-05-07T20:32:00.6272455Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:00.6272889Z 
2025-05-07T20:32:00.6273295Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:01.2247238Z 
2025-05-07T20:32:01.2247694Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:01.2248584Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:01.2249448Z     T=128,
2025-05-07T20:32:01.2249839Z     D=7168,
2025-05-07T20:32:01.2250236Z     scale_ub=1200.0,
2025-05-07T20:32:01.2250697Z     contiguous=True,
2025-05-07T20:32:01.2251145Z     compiled=False,
2025-05-07T20:32:01.2251571Z )
2025-05-07T20:32:01.2252209Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:01.2253195Z T = 128, D = 7168, scale_ub = 1200.0, contiguous = True, compiled = False
2025-05-07T20:32:01.2253906Z 
2025-05-07T20:32:01.2254082Z     @given(
2025-05-07T20:32:01.2254549Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:01.2255163Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:01.2255514Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:01.2255862Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:01.2256187Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:01.2256486Z     )
2025-05-07T20:32:01.2256840Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:01.2257281Z     def test_silu_mul_quant(
2025-05-07T20:32:01.2257540Z         self,
2025-05-07T20:32:01.2257743Z         T: int,
2025-05-07T20:32:01.2257941Z         D: int,
2025-05-07T20:32:01.2258166Z         scale_ub: Optional[float],
2025-05-07T20:32:01.2258442Z         contiguous: bool,
2025-05-07T20:32:01.2258682Z         compiled: bool,
2025-05-07T20:32:01.2258917Z     ) -> None:
2025-05-07T20:32:01.2259142Z         torch.manual_seed(2025)
2025-05-07T20:32:01.2259382Z     
2025-05-07T20:32:01.2259663Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:01.2260015Z     
2025-05-07T20:32:01.2260211Z         x_sign = torch.sign(x)
2025-05-07T20:32:01.2260506Z >       x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:01.2262482Z E       torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 20.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 8.44 MiB is free. Including non-PyTorch memory, this process has 22.05 GiB memory in use. Of the allocated memory 21.77 GiB is allocated by PyTorch, and 4.62 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
2025-05-07T20:32:01.2264510Z 
2025-05-07T20:32:01.2264633Z moe/activation_test.py:95: OutOfMemoryError
2025-05-07T20:32:01.2264850Z 
2025-05-07T20:32:01.2264958Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:01.2265365Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:01.2265761Z     T=128,
2025-05-07T20:32:01.2265955Z     D=5120,
2025-05-07T20:32:01.2266156Z     scale_ub=1200.0,
2025-05-07T20:32:01.2266380Z     contiguous=True,
2025-05-07T20:32:01.2266739Z     compiled=True,
2025-05-07T20:32:01.2266954Z )
2025-05-07T20:32:01.2267269Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:01.2267749Z T = 128, D = 5120, scale_ub = 1200.0, contiguous = True, compiled = True
2025-05-07T20:32:01.2268023Z 
2025-05-07T20:32:01.2268103Z     @given(
2025-05-07T20:32:01.2268337Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:01.2268646Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:01.2268954Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:01.2269285Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:01.2269604Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:01.2269890Z     )
2025-05-07T20:32:01.2270237Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:01.2270673Z     def test_silu_mul_quant(
2025-05-07T20:32:01.2270917Z         self,
2025-05-07T20:32:01.2271113Z         T: int,
2025-05-07T20:32:01.2271313Z         D: int,
2025-05-07T20:32:01.2271534Z         scale_ub: Optional[float],
2025-05-07T20:32:01.2271805Z         contiguous: bool,
2025-05-07T20:32:01.2272050Z         compiled: bool,
2025-05-07T20:32:01.2272276Z     ) -> None:
2025-05-07T20:32:01.2272499Z         torch.manual_seed(2025)
2025-05-07T20:32:01.2272741Z     
2025-05-07T20:32:01.2273008Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:01.2273345Z     
2025-05-07T20:32:01.2273543Z >       x_sign = torch.sign(x)
2025-05-07T20:32:01.2275431Z E       torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 20.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 8.44 MiB is free. Including non-PyTorch memory, this process has 22.05 GiB memory in use. Of the allocated memory 21.77 GiB is allocated by PyTorch, and 2.12 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
2025-05-07T20:32:01.2277243Z 
2025-05-07T20:32:01.2277364Z moe/activation_test.py:94: OutOfMemoryError
2025-05-07T20:32:01.2277587Z 
2025-05-07T20:32:01.2277689Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:01.2278100Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:01.2278501Z     T=128,
2025-05-07T20:32:01.2278688Z     D=7168,
2025-05-07T20:32:01.2278881Z     scale_ub=None,
2025-05-07T20:32:01.2279098Z     contiguous=True,
2025-05-07T20:32:01.2279319Z     compiled=True,
2025-05-07T20:32:01.2279521Z )
2025-05-07T20:32:01.2279837Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:01.2280312Z T = 128, D = 7168, scale_ub = None, contiguous = True, compiled = True
2025-05-07T20:32:01.2280582Z 
2025-05-07T20:32:01.2280662Z     @given(
2025-05-07T20:32:01.2280901Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:01.2281213Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:01.2281522Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:01.2281848Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:01.2282263Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:01.2282539Z     )
2025-05-07T20:32:01.2282882Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:01.2283319Z     def test_silu_mul_quant(
2025-05-07T20:32:01.2283552Z         self,
2025-05-07T20:32:01.2283749Z         T: int,
2025-05-07T20:32:01.2283947Z         D: int,
2025-05-07T20:32:01.2284165Z         scale_ub: Optional[float],
2025-05-07T20:32:01.2284436Z         contiguous: bool,
2025-05-07T20:32:01.2284678Z         compiled: bool,
2025-05-07T20:32:01.2284894Z     ) -> None:
2025-05-07T20:32:01.2285110Z         torch.manual_seed(2025)
2025-05-07T20:32:01.2285355Z     
2025-05-07T20:32:01.2285699Z >       x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:01.2287688Z E       torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 20.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 8.44 MiB is free. Including non-PyTorch memory, this process has 22.05 GiB memory in use. Of the allocated memory 21.77 GiB is allocated by PyTorch, and 2.12 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
2025-05-07T20:32:01.2289498Z 
2025-05-07T20:32:01.2289617Z moe/activation_test.py:92: OutOfMemoryError
2025-05-07T20:32:01.2289830Z 
2025-05-07T20:32:01.2337692Z FAILED
2025-05-07T20:32:01.2338040Z 
2025-05-07T20:32:01.2338505Z =================================== FAILURES ===================================
2025-05-07T20:32:01.2339118Z _____________________ ActivationTests.test_silu_mul_quant ______________________
2025-05-07T20:32:01.2339734Z   + Exception Group Traceback (most recent call last):
2025-05-07T20:32:01.2340574Z   |   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/unittest/case.py", line 58, in testPartExecutor
2025-05-07T20:32:01.2341321Z   |     yield
2025-05-07T20:32:01.2341903Z   |   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/unittest/case.py", line 651, in run
2025-05-07T20:32:01.2342615Z   |     self._callTestMethod(testMethod)
2025-05-07T20:32:01.2343012Z   |     ~~~~~~~~~~~~~~~~~~~~^^^^^^^^^^^^
2025-05-07T20:32:01.2343748Z   |   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/unittest/case.py", line 606, in _callTestMethod
2025-05-07T20:32:01.2344487Z   |     if method() is not None:
2025-05-07T20:32:01.2344824Z   |        ~~~~~~^^
2025-05-07T20:32:01.2345695Z   |   File "/home/ec2-user/actions-runner/_work/FBGEMM/FBGEMM/fbgemm_gpu/experimental/gen_ai/test/moe/activation_test.py", line 75, in test_silu_mul_quant
2025-05-07T20:32:01.2346716Z   |     T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:01.2347114Z   |            ^^^^^^^
2025-05-07T20:32:01.2347886Z   |   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/hypothesis/core.py", line 1850, in wrapped_test
2025-05-07T20:32:01.2348756Z   |     raise the_error_hypothesis_found
2025-05-07T20:32:01.2349320Z   | ExceptionGroup: Hypothesis found 4 distinct failures. (4 sub-exceptions)
2025-05-07T20:32:01.2349894Z   +-+---------------- 1 ----------------
2025-05-07T20:32:01.2350301Z     | Traceback (most recent call last):
2025-05-07T20:32:01.2351281Z     |   File "/home/ec2-user/actions-runner/_work/FBGEMM/FBGEMM/fbgemm_gpu/experimental/gen_ai/test/moe/activation_test.py", line 92, in test_silu_mul_quant
2025-05-07T20:32:01.2352337Z     |     x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:01.2355182Z     | torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 20.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 8.44 MiB is free. Including non-PyTorch memory, this process has 22.05 GiB memory in use. Of the allocated memory 21.77 GiB is allocated by PyTorch, and 2.12 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
2025-05-07T20:32:01.2357876Z     | Falsifying example: test_silu_mul_quant(
2025-05-07T20:32:01.2358307Z     |     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:01.2358709Z     |     T=128,
2025-05-07T20:32:01.2358901Z     |     D=7168,
2025-05-07T20:32:01.2359110Z     |     scale_ub=1200.0,
2025-05-07T20:32:01.2359668Z     |     contiguous=True,
2025-05-07T20:32:01.2359903Z     |     compiled=False,
2025-05-07T20:32:01.2360132Z     | )
2025-05-07T20:32:01.2360464Z     | 
2025-05-07T20:32:01.2360987Z     | You can reproduce this example by temporarily adding @reproduce_failure('6.131.14', b'AEEBQQFBAUEAQQE=') as a decorator on your test case
2025-05-07T20:32:01.2361581Z     +---------------- 2 ----------------
2025-05-07T20:32:01.2361874Z     | Traceback (most recent call last):
2025-05-07T20:32:01.2362564Z     |   File "/home/ec2-user/actions-runner/_work/FBGEMM/FBGEMM/fbgemm_gpu/experimental/gen_ai/test/moe/activation_test.py", line 92, in test_silu_mul_quant
2025-05-07T20:32:01.2363322Z     |     x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:01.2365312Z     | torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 20.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 8.44 MiB is free. Including non-PyTorch memory, this process has 22.05 GiB memory in use. Of the allocated memory 21.77 GiB is allocated by PyTorch, and 2.12 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
2025-05-07T20:32:01.2367214Z     | Falsifying example: test_silu_mul_quant(
2025-05-07T20:32:01.2367645Z     |     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:01.2368046Z     |     T=128,
2025-05-07T20:32:01.2368237Z     |     D=7168,
2025-05-07T20:32:01.2368445Z     |     scale_ub=None,
2025-05-07T20:32:01.2368682Z     |     contiguous=True,
2025-05-07T20:32:01.2368912Z     |     compiled=True,
2025-05-07T20:32:01.2369131Z     | )
2025-05-07T20:32:01.2369309Z     | 
2025-05-07T20:32:01.2369812Z     | You can reproduce this example by temporarily adding @reproduce_failure('6.131.14', b'AEEBQQFBAEEAQQA=') as a decorator on your test case
2025-05-07T20:32:01.2370401Z     +---------------- 3 ----------------
2025-05-07T20:32:01.2370694Z     | Traceback (most recent call last):
2025-05-07T20:32:01.2371388Z     |   File "/home/ec2-user/actions-runner/_work/FBGEMM/FBGEMM/fbgemm_gpu/experimental/gen_ai/test/moe/activation_test.py", line 92, in test_silu_mul_quant
2025-05-07T20:32:01.2372135Z     |     x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:01.2374282Z     | torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 20.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 8.44 MiB is free. Including non-PyTorch memory, this process has 22.05 GiB memory in use. Of the allocated memory 21.77 GiB is allocated by PyTorch, and 2.12 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
2025-05-07T20:32:01.2376540Z     | Falsifying example: test_silu_mul_quant(
2025-05-07T20:32:01.2377142Z     |     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:01.2377685Z     |     T=128,
2025-05-07T20:32:01.2377946Z     |     D=5120,
2025-05-07T20:32:01.2378230Z     |     scale_ub=1200.0,
2025-05-07T20:32:01.2378675Z     |     contiguous=True,
2025-05-07T20:32:01.2378990Z     |     compiled=True,
2025-05-07T20:32:01.2379300Z     | )
2025-05-07T20:32:01.2379540Z     | 
2025-05-07T20:32:01.2380227Z     | You can reproduce this example by temporarily adding @reproduce_failure('6.131.14', b'AEEBQQBBAUEAQQA=') as a decorator on your test case
2025-05-07T20:32:01.2381044Z     +---------------- 4 ----------------
2025-05-07T20:32:01.2381438Z     | Traceback (most recent call last):
2025-05-07T20:32:01.2382402Z     |   File "/home/ec2-user/actions-runner/_work/FBGEMM/FBGEMM/fbgemm_gpu/experimental/gen_ai/test/moe/activation_test.py", line 126, in test_silu_mul_quant
2025-05-07T20:32:01.2383371Z     |     y_fp8_ref, y_scale_ref = ref_fn()
2025-05-07T20:32:01.2383852Z     |                              ~~~~~~^^
2025-05-07T20:32:01.2384723Z     |   File "/home/ec2-user/actions-runner/_work/FBGEMM/FBGEMM/fbgemm_gpu/experimental/gen_ai/test/moe/activation_test.py", line 124, in ref_fn
2025-05-07T20:32:01.2385652Z     |     return triton_quantize_fp8_row(y, scale_ub_tensor)
2025-05-07T20:32:01.2386772Z     |   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gemm/triton_gemm/fp8_gemm.py", line 2370, in triton_quantize_fp8_row
2025-05-07T20:32:01.2387850Z     |     _kernel_quantize_fp8_row[grid](
2025-05-07T20:32:01.2388228Z     |     ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~^
2025-05-07T20:32:01.2388568Z     |         a,
2025-05-07T20:32:01.2388835Z     |         ^^
2025-05-07T20:32:01.2389116Z     |     ...<23 lines>...
2025-05-07T20:32:01.2389436Z     |         USE_INT64=use_int64,
2025-05-07T20:32:01.2389791Z     |         ^^^^^^^^^^^^^^^^^^^^
2025-05-07T20:32:01.2390118Z     |     )
2025-05-07T20:32:01.2390356Z     |     ^
2025-05-07T20:32:01.2391054Z     |   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py", line 330, in <lambda>
2025-05-07T20:32:01.2392061Z     |     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:01.2392672Z     |                                    ~~~~~~~~^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
2025-05-07T20:32:01.2393531Z     |   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/autotuner.py", line 186, in run
2025-05-07T20:32:01.2394573Z     |     timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs}
2025-05-07T20:32:01.2395211Z     |                        ~~~~~~~~~~~^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
2025-05-07T20:32:01.2396070Z     |   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/autotuner.py", line 166, in _bench
2025-05-07T20:32:01.2396966Z     |     return self.do_bench(kernel_call, quantiles=(0.5, 0.2, 0.8))
2025-05-07T20:32:01.2397352Z     |            ~~~~~~~~~~~~~^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
2025-05-07T20:32:01.2397955Z     |   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/testing.py", line 117, in do_bench
2025-05-07T20:32:01.2399104Z     |     fn()
2025-05-07T20:32:01.2399302Z     |     ~~^^
2025-05-07T20:32:01.2399863Z     |   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/autotuner.py", line 152, in kernel_call
2025-05-07T20:32:01.2400490Z     |     self.fn.run(
2025-05-07T20:32:01.2400738Z     |     ~~~~~~~~~~~^
2025-05-07T20:32:01.2401028Z     |         *args,
2025-05-07T20:32:01.2401243Z     |         ^^^^^^
2025-05-07T20:32:01.2401455Z     |         **current,
2025-05-07T20:32:01.2401681Z     |         ^^^^^^^^^^
2025-05-07T20:32:01.2401903Z     |     )
2025-05-07T20:32:01.2402086Z     |     ^
2025-05-07T20:32:01.2402590Z     |   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py", line 623, in run
2025-05-07T20:32:01.2403165Z     |     kernel = self.compile(
2025-05-07T20:32:01.2403420Z     |         src,
2025-05-07T20:32:01.2403630Z     |         target=target,
2025-05-07T20:32:01.2404090Z     |         options=options.__dict__,
2025-05-07T20:32:01.2404361Z     |     )
2025-05-07T20:32:01.2404896Z     |   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py", line 273, in compile
2025-05-07T20:32:01.2405647Z     |     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:01.2406347Z     |   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py", line 100, in make_ir
2025-05-07T20:32:01.2407120Z     |     return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:01.2407587Z     |                        module_map=module_map)
2025-05-07T20:32:01.2408074Z     | triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:01.2408431Z     | def _kernel_quantize_fp8_row(
2025-05-07T20:32:01.2408693Z     | ^
2025-05-07T20:32:01.2409149Z     | ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:01.2409716Z     | Falsifying example: test_silu_mul_quant(
2025-05-07T20:32:01.2410108Z     |     # The test always failed when commented parts were varied together.
2025-05-07T20:32:01.2410624Z     |     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:01.2411064Z     |     T=1,  # or any other generated value
2025-05-07T20:32:01.2411377Z     |     D=5120,  # or any other generated value
2025-05-07T20:32:01.2411747Z     |     scale_ub=None,  # or any other generated value
2025-05-07T20:32:01.2412235Z     |     contiguous=True,  # or any other generated value
2025-05-07T20:32:01.2412726Z     |     compiled=True,  # or any other generated value
2025-05-07T20:32:01.2413129Z     | )
2025-05-07T20:32:01.2413373Z     | 
2025-05-07T20:32:01.2414196Z     | You can reproduce this example by temporarily adding @reproduce_failure('6.131.14', b'AEEAQQBBAEEAQQA=') as a decorator on your test case
2025-05-07T20:32:01.2415016Z     +------------------------------------
2025-05-07T20:32:01.2415508Z ---------------------------------- Hypothesis ----------------------------------
2025-05-07T20:32:01.2416019Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:01.2416599Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:01.2417149Z     T=1,
2025-05-07T20:32:01.2417399Z     D=5120,
2025-05-07T20:32:01.2417670Z     scale_ub=None,
2025-05-07T20:32:01.2417973Z     contiguous=True,
2025-05-07T20:32:01.2418277Z     compiled=True,
2025-05-07T20:32:01.2418562Z )
2025-05-07T20:32:01.2419000Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:01.2419665Z T = 1, D = 5120, scale_ub = None, contiguous = True, compiled = True
2025-05-07T20:32:01.2420020Z 
2025-05-07T20:32:01.2420130Z     @given(
2025-05-07T20:32:01.2420450Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:01.2420881Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:01.2444639Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:01.2445142Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:01.2445612Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:01.2446017Z     )
2025-05-07T20:32:01.2446490Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:01.2447100Z     def test_silu_mul_quant(
2025-05-07T20:32:01.2447435Z         self,
2025-05-07T20:32:01.2447705Z         T: int,
2025-05-07T20:32:01.2447972Z         D: int,
2025-05-07T20:32:01.2448276Z         scale_ub: Optional[float],
2025-05-07T20:32:01.2448654Z         contiguous: bool,
2025-05-07T20:32:01.2448983Z         compiled: bool,
2025-05-07T20:32:01.2449312Z     ) -> None:
2025-05-07T20:32:01.2449608Z         torch.manual_seed(2025)
2025-05-07T20:32:01.2449919Z     
2025-05-07T20:32:01.2450288Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:01.2450781Z     
2025-05-07T20:32:01.2451285Z         x_sign = torch.sign(x)
2025-05-07T20:32:01.2451697Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:01.2452129Z         x = x_sign * x_clamp
2025-05-07T20:32:01.2452464Z         x0 = x[:, :D]
2025-05-07T20:32:01.2452760Z         x1 = x[:, D:]
2025-05-07T20:32:01.2453052Z     
2025-05-07T20:32:01.2453310Z         if contiguous:
2025-05-07T20:32:01.2453799Z             x0 = x0.contiguous()
2025-05-07T20:32:01.2454168Z             x1 = x1.contiguous()
2025-05-07T20:32:01.2454496Z     
2025-05-07T20:32:01.2454763Z         if scale_ub is not None:
2025-05-07T20:32:01.2455142Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:01.2455608Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:01.2456148Z             )
2025-05-07T20:32:01.2456427Z         else:
2025-05-07T20:32:01.2456726Z             scale_ub_tensor = None
2025-05-07T20:32:01.2457069Z     
2025-05-07T20:32:01.2457398Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:01.2457849Z             op = silu_mul_quant
2025-05-07T20:32:01.2458187Z             if compiled:
2025-05-07T20:32:01.2458528Z                 op = torch.compile(op)
2025-05-07T20:32:01.2458916Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:01.2459276Z     
2025-05-07T20:32:01.2459539Z         y_fp8, y_scale = fn()
2025-05-07T20:32:01.2459907Z         y = y_fp8.to(torch.float32) * y_scale[:, None]
2025-05-07T20:32:01.2460277Z     
2025-05-07T20:32:01.2460585Z         def ref_fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:01.2461020Z             x0_fp32 = x0.to(torch.float32)
2025-05-07T20:32:01.2461409Z             x1_fp32 = x1.to(torch.float32)
2025-05-07T20:32:01.2461835Z             y = x0_fp32 * torch.sigmoid(x0_fp32) * x1_fp32
2025-05-07T20:32:01.2462299Z             return triton_quantize_fp8_row(y, scale_ub_tensor)
2025-05-07T20:32:01.2462708Z     
2025-05-07T20:32:01.2462983Z >       y_fp8_ref, y_scale_ref = ref_fn()
2025-05-07T20:32:01.2463264Z 
2025-05-07T20:32:01.2463414Z moe/activation_test.py:126: 
2025-05-07T20:32:01.2463822Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:01.2464291Z moe/activation_test.py:124: in ref_fn
2025-05-07T20:32:01.2464710Z     return triton_quantize_fp8_row(y, scale_ub_tensor)
2025-05-07T20:32:01.2465825Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gemm/triton_gemm/fp8_gemm.py:2370: in triton_quantize_fp8_row
2025-05-07T20:32:01.2466834Z     _kernel_quantize_fp8_row[grid](
2025-05-07T20:32:01.2467527Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:01.2468432Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:01.2469306Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/autotuner.py:186: in run
2025-05-07T20:32:01.2470201Z     timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs}
2025-05-07T20:32:01.2471113Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/autotuner.py:166: in _bench
2025-05-07T20:32:01.2471916Z     return self.do_bench(kernel_call, quantiles=(0.5, 0.2, 0.8))
2025-05-07T20:32:01.2472668Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/testing.py:117: in do_bench
2025-05-07T20:32:01.2473344Z     fn()
2025-05-07T20:32:01.2473997Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/autotuner.py:152: in kernel_call
2025-05-07T20:32:01.2474771Z     self.fn.run(
2025-05-07T20:32:01.2475389Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:01.2476079Z     kernel = self.compile(
2025-05-07T20:32:01.2476769Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:01.2477723Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:01.2478237Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:01.2478534Z 
2025-05-07T20:32:01.2478803Z self = <triton.compiler.compiler.ASTSource object at 0x7f93cf631ad0>
2025-05-07T20:32:01.2480202Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=2, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:01.2482089Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f9402963ce0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f93bd68d3a0>}
2025-05-07T20:32:01.2483823Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:01.2485138Z context = <triton._C.libtriton.ir.context object at 0x7f93ceadff70>
2025-05-07T20:32:01.2485499Z 
2025-05-07T20:32:01.2485716Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:01.2486381Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:01.2486995Z                            module_map=module_map)
2025-05-07T20:32:01.2487465Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:01.2487924Z E       def _kernel_quantize_fp8_row(
2025-05-07T20:32:01.2488282Z E       ^
2025-05-07T20:32:01.2488889Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:01.2489464Z 
2025-05-07T20:32:01.2489980Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:01.2490604Z 
2025-05-07T20:32:01.2490735Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:01.2491231Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:01.2491716Z     T=2048,
2025-05-07T20:32:01.2491935Z     D=5120,
2025-05-07T20:32:01.2492167Z     scale_ub=1200.0,
2025-05-07T20:32:01.2492439Z     contiguous=True,
2025-05-07T20:32:01.2492696Z     compiled=False,
2025-05-07T20:32:01.2492943Z )
2025-05-07T20:32:01.2493323Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:01.2494063Z T = 2048, D = 5120, scale_ub = 1200.0, contiguous = True, compiled = False
2025-05-07T20:32:01.2494388Z 
2025-05-07T20:32:01.2494480Z     @given(
2025-05-07T20:32:01.2494756Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:01.2495168Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:01.2495536Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:01.2496172Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:01.2496607Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:01.2496973Z     )
2025-05-07T20:32:01.2497410Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:01.2497947Z     def test_silu_mul_quant(
2025-05-07T20:32:01.2498513Z         self,
2025-05-07T20:32:01.2498771Z         T: int,
2025-05-07T20:32:01.2499005Z         D: int,
2025-05-07T20:32:01.2499260Z         scale_ub: Optional[float],
2025-05-07T20:32:01.2499577Z         contiguous: bool,
2025-05-07T20:32:01.2499857Z         compiled: bool,
2025-05-07T20:32:01.2500121Z     ) -> None:
2025-05-07T20:32:01.2500366Z         torch.manual_seed(2025)
2025-05-07T20:32:01.2500653Z     
2025-05-07T20:32:01.2500977Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:01.2501396Z     
2025-05-07T20:32:01.2501635Z         x_sign = torch.sign(x)
2025-05-07T20:32:01.2501977Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:01.2502595Z         x = x_sign * x_clamp
2025-05-07T20:32:01.2502886Z         x0 = x[:, :D]
2025-05-07T20:32:01.2503147Z         x1 = x[:, D:]
2025-05-07T20:32:01.2503393Z     
2025-05-07T20:32:01.2503634Z         if contiguous:
2025-05-07T20:32:01.2503933Z             x0 = x0.contiguous()
2025-05-07T20:32:01.2504232Z             x1 = x1.contiguous()
2025-05-07T20:32:01.2504515Z     
2025-05-07T20:32:01.2504758Z         if scale_ub is not None:
2025-05-07T20:32:01.2505117Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:01.2505551Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:01.2505931Z             )
2025-05-07T20:32:01.2506165Z         else:
2025-05-07T20:32:01.2506410Z             scale_ub_tensor = None
2025-05-07T20:32:01.2506869Z     
2025-05-07T20:32:01.2507153Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:01.2507525Z             op = silu_mul_quant
2025-05-07T20:32:01.2507843Z             if compiled:
2025-05-07T20:32:01.2508156Z                 op = torch.compile(op)
2025-05-07T20:32:01.2508508Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:01.2508845Z     
2025-05-07T20:32:01.2509074Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:01.2509275Z 
2025-05-07T20:32:01.2509391Z moe/activation_test.py:117: 
2025-05-07T20:32:01.2509751Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:01.2510159Z moe/activation_test.py:115: in fn
2025-05-07T20:32:01.2510498Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:01.2511326Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:01.2512175Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:01.2512862Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:01.2513701Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:01.2514547Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:01.2515238Z     kernel = self.compile(
2025-05-07T20:32:01.2515892Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:01.2516674Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:01.2517150Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:01.2517439Z 
2025-05-07T20:32:01.2517694Z self = <triton.compiler.compiler.ASTSource object at 0x7f93ce5431d0>
2025-05-07T20:32:01.2519075Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:01.2520830Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f9402963ce0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f93ce57e200>}
2025-05-07T20:32:01.2522550Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:01.2523862Z context = <triton._C.libtriton.ir.context object at 0x7f93ce3e5730>
2025-05-07T20:32:01.2524228Z 
2025-05-07T20:32:01.2524446Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:01.2525115Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:01.2525739Z                            module_map=module_map)
2025-05-07T20:32:01.2526229Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:01.2526686Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:01.2527039Z E       ^
2025-05-07T20:32:01.2527749Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:01.2528351Z 
2025-05-07T20:32:01.2528942Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:01.2529631Z 
2025-05-07T20:32:01.2529780Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:01.2530324Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:01.2530868Z     T=2048,
2025-05-07T20:32:01.2531121Z     D=5120,
2025-05-07T20:32:01.2531372Z     scale_ub=1200.0,
2025-05-07T20:32:01.2531672Z     contiguous=True,
2025-05-07T20:32:01.2531970Z     compiled=True,
2025-05-07T20:32:01.2532236Z )
2025-05-07T20:32:01.2532794Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:01.2533455Z T = 2048, D = 5120, scale_ub = 1200.0, contiguous = True, compiled = True
2025-05-07T20:32:01.2533962Z 
2025-05-07T20:32:01.2534073Z     @given(
2025-05-07T20:32:01.2534376Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:01.2534799Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:01.2535214Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:01.2535694Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:01.2536141Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:01.2536522Z     )
2025-05-07T20:32:01.2536990Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:01.2537571Z     def test_silu_mul_quant(
2025-05-07T20:32:01.2537884Z         self,
2025-05-07T20:32:01.2538129Z         T: int,
2025-05-07T20:32:01.2538385Z         D: int,
2025-05-07T20:32:01.2538679Z         scale_ub: Optional[float],
2025-05-07T20:32:01.2539037Z         contiguous: bool,
2025-05-07T20:32:01.2539365Z         compiled: bool,
2025-05-07T20:32:01.2539666Z     ) -> None:
2025-05-07T20:32:01.2539945Z         torch.manual_seed(2025)
2025-05-07T20:32:01.2540290Z     
2025-05-07T20:32:01.2540654Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:01.2541116Z     
2025-05-07T20:32:01.2541374Z         x_sign = torch.sign(x)
2025-05-07T20:32:01.2541767Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:01.2542184Z         x = x_sign * x_clamp
2025-05-07T20:32:01.2542496Z         x0 = x[:, :D]
2025-05-07T20:32:01.2542793Z         x1 = x[:, D:]
2025-05-07T20:32:01.2543070Z     
2025-05-07T20:32:01.2543310Z         if contiguous:
2025-05-07T20:32:01.2543616Z             x0 = x0.contiguous()
2025-05-07T20:32:01.2543962Z             x1 = x1.contiguous()
2025-05-07T20:32:01.2544279Z     
2025-05-07T20:32:01.2544544Z         if scale_ub is not None:
2025-05-07T20:32:01.2544914Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:01.2545360Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:01.2545771Z             )
2025-05-07T20:32:01.2546036Z         else:
2025-05-07T20:32:01.2546318Z             scale_ub_tensor = None
2025-05-07T20:32:01.2546659Z     
2025-05-07T20:32:01.2546970Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:01.2547394Z             op = silu_mul_quant
2025-05-07T20:32:01.2547722Z             if compiled:
2025-05-07T20:32:01.2548054Z                 op = torch.compile(op)
2025-05-07T20:32:01.2548450Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:01.2548820Z     
2025-05-07T20:32:01.2549079Z         y_fp8, y_scale = fn()
2025-05-07T20:32:01.2549455Z         y = y_fp8.to(torch.float32) * y_scale[:, None]
2025-05-07T20:32:01.2549837Z     
2025-05-07T20:32:01.2550154Z         def ref_fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:01.2550607Z             x0_fp32 = x0.to(torch.float32)
2025-05-07T20:32:01.2550993Z             x1_fp32 = x1.to(torch.float32)
2025-05-07T20:32:01.2551403Z             y = x0_fp32 * torch.sigmoid(x0_fp32) * x1_fp32
2025-05-07T20:32:01.2551978Z             return triton_quantize_fp8_row(y, scale_ub_tensor)
2025-05-07T20:32:01.2552378Z     
2025-05-07T20:32:01.2552638Z >       y_fp8_ref, y_scale_ref = ref_fn()
2025-05-07T20:32:01.2552894Z 
2025-05-07T20:32:01.2553023Z moe/activation_test.py:126: 
2025-05-07T20:32:01.2553408Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:01.2553854Z moe/activation_test.py:124: in ref_fn
2025-05-07T20:32:01.2554306Z     return triton_quantize_fp8_row(y, scale_ub_tensor)
2025-05-07T20:32:01.2555386Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gemm/triton_gemm/fp8_gemm.py:2370: in triton_quantize_fp8_row
2025-05-07T20:32:01.2556398Z     _kernel_quantize_fp8_row[grid](
2025-05-07T20:32:01.2557232Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:01.2558145Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:01.2559070Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/autotuner.py:186: in run
2025-05-07T20:32:01.2560021Z     timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs}
2025-05-07T20:32:01.2560974Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/autotuner.py:166: in _bench
2025-05-07T20:32:01.2561838Z     return self.do_bench(kernel_call, quantiles=(0.5, 0.2, 0.8))
2025-05-07T20:32:01.2562637Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/testing.py:117: in do_bench
2025-05-07T20:32:01.2563310Z     fn()
2025-05-07T20:32:01.2564010Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/autotuner.py:152: in kernel_call
2025-05-07T20:32:01.2564796Z     self.fn.run(
2025-05-07T20:32:01.2565447Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:01.2566204Z     kernel = self.compile(
2025-05-07T20:32:01.2566919Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:01.2567780Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:01.2568309Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:01.2568633Z 
2025-05-07T20:32:01.2568910Z self = <triton.compiler.compiler.ASTSource object at 0x7f93ce36e5d0>
2025-05-07T20:32:01.2570351Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=2, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:01.2572265Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f9402963ce0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f93ce4bede0>}
2025-05-07T20:32:01.2574308Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:01.2575677Z context = <triton._C.libtriton.ir.context object at 0x7f93a7292fb0>
2025-05-07T20:32:01.2576062Z 
2025-05-07T20:32:01.2576280Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:01.2576958Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:01.2577559Z                            module_map=module_map)
2025-05-07T20:32:01.2578027Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:01.2578492Z E       def _kernel_quantize_fp8_row(
2025-05-07T20:32:01.2578847Z E       ^
2025-05-07T20:32:01.2579450Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:01.2580132Z 
2025-05-07T20:32:01.2580630Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:01.2581251Z 
2025-05-07T20:32:01.2581379Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:01.2581871Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:01.2582346Z     T=16384,
2025-05-07T20:32:01.2582591Z     D=7168,
2025-05-07T20:32:01.2582842Z     scale_ub=1200.0,
2025-05-07T20:32:01.2583104Z     contiguous=False,
2025-05-07T20:32:01.2583388Z     compiled=False,
2025-05-07T20:32:01.2583663Z )
2025-05-07T20:32:01.2584090Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:01.2584861Z T = 16384, D = 7168, scale_ub = 1200.0, contiguous = False, compiled = False
2025-05-07T20:32:01.2585237Z 
2025-05-07T20:32:01.2585348Z     @given(
2025-05-07T20:32:01.2585654Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:01.2586081Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:01.2586490Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:01.2586930Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:01.2587361Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:01.2587746Z     )
2025-05-07T20:32:01.2588222Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:01.2588816Z     def test_silu_mul_quant(
2025-05-07T20:32:01.2589133Z         self,
2025-05-07T20:32:01.2589407Z         T: int,
2025-05-07T20:32:01.2589649Z         D: int,
2025-05-07T20:32:01.2589913Z         scale_ub: Optional[float],
2025-05-07T20:32:01.2590240Z         contiguous: bool,
2025-05-07T20:32:01.2590531Z         compiled: bool,
2025-05-07T20:32:01.2590797Z     ) -> None:
2025-05-07T20:32:01.2591058Z         torch.manual_seed(2025)
2025-05-07T20:32:01.2591339Z     
2025-05-07T20:32:01.2591702Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:01.2592172Z     
2025-05-07T20:32:01.2592430Z         x_sign = torch.sign(x)
2025-05-07T20:32:01.2592830Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:01.2593253Z         x = x_sign * x_clamp
2025-05-07T20:32:01.2593581Z         x0 = x[:, :D]
2025-05-07T20:32:01.2593869Z         x1 = x[:, D:]
2025-05-07T20:32:01.2594150Z     
2025-05-07T20:32:01.2594400Z         if contiguous:
2025-05-07T20:32:01.2594700Z             x0 = x0.contiguous()
2025-05-07T20:32:01.2595054Z             x1 = x1.contiguous()
2025-05-07T20:32:01.2595412Z     
2025-05-07T20:32:01.2595693Z         if scale_ub is not None:
2025-05-07T20:32:01.2596042Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:01.2596388Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:01.2596689Z             )
2025-05-07T20:32:01.2596882Z         else:
2025-05-07T20:32:01.2597088Z             scale_ub_tensor = None
2025-05-07T20:32:01.2597330Z     
2025-05-07T20:32:01.2597574Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:01.2597883Z             op = silu_mul_quant
2025-05-07T20:32:01.2598120Z             if compiled:
2025-05-07T20:32:01.2598694Z                 op = torch.compile(op)
2025-05-07T20:32:01.2598990Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:01.2599263Z     
2025-05-07T20:32:01.2599445Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:01.2599609Z 
2025-05-07T20:32:01.2599703Z moe/activation_test.py:117: 
2025-05-07T20:32:01.2599991Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:01.2600306Z moe/activation_test.py:115: in fn
2025-05-07T20:32:01.2600583Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:01.2601268Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:01.2601939Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:01.2602469Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:01.2603397Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:01.2604050Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:01.2604569Z     kernel = self.compile(
2025-05-07T20:32:01.2605103Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:01.2605748Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:01.2606141Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:01.2606364Z 
2025-05-07T20:32:01.2606691Z self = <triton.compiler.compiler.ASTSource object at 0x7f93ce38b8d0>
2025-05-07T20:32:01.2607757Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:01.2609107Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f9402963ce0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f93a73cd440>}
2025-05-07T20:32:01.2610418Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:01.2611411Z context = <triton._C.libtriton.ir.context object at 0x7f93a69058f0>
2025-05-07T20:32:01.2611697Z 
2025-05-07T20:32:01.2611868Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:01.2612384Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:01.2612844Z                            module_map=module_map)
2025-05-07T20:32:01.2613200Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:01.2613550Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:01.2613932Z E       ^
2025-05-07T20:32:01.2614379Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:01.2614819Z 
2025-05-07T20:32:01.2615223Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:01.2615732Z 
2025-05-07T20:32:01.2615833Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:01.2616236Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:01.2616626Z     T=1,
2025-05-07T20:32:01.2616815Z     D=7168,
2025-05-07T20:32:01.2617005Z     scale_ub=None,
2025-05-07T20:32:01.2617206Z     contiguous=True,
2025-05-07T20:32:01.2617423Z     compiled=True,
2025-05-07T20:32:01.2617620Z )
2025-05-07T20:32:01.2617927Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:01.2618405Z T = 1, D = 7168, scale_ub = None, contiguous = True, compiled = True
2025-05-07T20:32:01.2618663Z 
2025-05-07T20:32:01.2618740Z     @given(
2025-05-07T20:32:01.2618968Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:01.2619270Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:01.2619571Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:01.2619892Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:01.2620206Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:01.2620491Z     )
2025-05-07T20:32:01.2620831Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:01.2621265Z     def test_silu_mul_quant(
2025-05-07T20:32:01.2621504Z         self,
2025-05-07T20:32:01.2621693Z         T: int,
2025-05-07T20:32:01.2621885Z         D: int,
2025-05-07T20:32:01.2622093Z         scale_ub: Optional[float],
2025-05-07T20:32:01.2622477Z         contiguous: bool,
2025-05-07T20:32:01.2622709Z         compiled: bool,
2025-05-07T20:32:01.2622918Z     ) -> None:
2025-05-07T20:32:01.2623128Z         torch.manual_seed(2025)
2025-05-07T20:32:01.2623365Z     
2025-05-07T20:32:01.2623623Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:01.2623957Z     
2025-05-07T20:32:01.2624152Z         x_sign = torch.sign(x)
2025-05-07T20:32:01.2624429Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:01.2624734Z         x = x_sign * x_clamp
2025-05-07T20:32:01.2624973Z         x0 = x[:, :D]
2025-05-07T20:32:01.2625178Z         x1 = x[:, D:]
2025-05-07T20:32:01.2625383Z     
2025-05-07T20:32:01.2643794Z         if contiguous:
2025-05-07T20:32:01.2644253Z             x0 = x0.contiguous()
2025-05-07T20:32:01.2644530Z             x1 = x1.contiguous()
2025-05-07T20:32:01.2644772Z     
2025-05-07T20:32:01.2644976Z         if scale_ub is not None:
2025-05-07T20:32:01.2645260Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:01.2645605Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:01.2645922Z             )
2025-05-07T20:32:01.2646124Z         else:
2025-05-07T20:32:01.2646336Z             scale_ub_tensor = None
2025-05-07T20:32:01.2646596Z     
2025-05-07T20:32:01.2646843Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:01.2647162Z             op = silu_mul_quant
2025-05-07T20:32:01.2647419Z             if compiled:
2025-05-07T20:32:01.2647680Z                 op = torch.compile(op)
2025-05-07T20:32:01.2647976Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:01.2648261Z     
2025-05-07T20:32:01.2648463Z         y_fp8, y_scale = fn()
2025-05-07T20:32:01.2648763Z         y = y_fp8.to(torch.float32) * y_scale[:, None]
2025-05-07T20:32:01.2649054Z     
2025-05-07T20:32:01.2649302Z         def ref_fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:01.2649638Z             x0_fp32 = x0.to(torch.float32)
2025-05-07T20:32:01.2649938Z             x1_fp32 = x1.to(torch.float32)
2025-05-07T20:32:01.2650252Z             y = x0_fp32 * torch.sigmoid(x0_fp32) * x1_fp32
2025-05-07T20:32:01.2650615Z             return triton_quantize_fp8_row(y, scale_ub_tensor)
2025-05-07T20:32:01.2650922Z     
2025-05-07T20:32:01.2651130Z >       y_fp8_ref, y_scale_ref = ref_fn()
2025-05-07T20:32:01.2651325Z 
2025-05-07T20:32:01.2651435Z moe/activation_test.py:126: 
2025-05-07T20:32:01.2651739Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:01.2652080Z moe/activation_test.py:124: in ref_fn
2025-05-07T20:32:01.2652414Z     return triton_quantize_fp8_row(y, scale_ub_tensor)
2025-05-07T20:32:01.2653211Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gemm/triton_gemm/fp8_gemm.py:2370: in triton_quantize_fp8_row
2025-05-07T20:32:01.2654104Z     _kernel_quantize_fp8_row[grid](
2025-05-07T20:32:01.2654655Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:01.2655346Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:01.2656035Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/autotuner.py:186: in run
2025-05-07T20:32:01.2656747Z     timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs}
2025-05-07T20:32:01.2657474Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/autotuner.py:166: in _bench
2025-05-07T20:32:01.2658111Z     return self.do_bench(kernel_call, quantiles=(0.5, 0.2, 0.8))
2025-05-07T20:32:01.2658718Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/testing.py:117: in do_bench
2025-05-07T20:32:01.2659235Z     fn()
2025-05-07T20:32:01.2659746Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/autotuner.py:152: in kernel_call
2025-05-07T20:32:01.2660456Z     self.fn.run(
2025-05-07T20:32:01.2660916Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:01.2661442Z     kernel = self.compile(
2025-05-07T20:32:01.2661985Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:01.2662630Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:01.2663029Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:01.2663257Z 
2025-05-07T20:32:01.2663463Z self = <triton.compiler.compiler.ASTSource object at 0x7f93ce29c250>
2025-05-07T20:32:01.2664604Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=2, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:01.2665976Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f9402963ce0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f93a73cf920>}
2025-05-07T20:32:01.2667299Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:01.2668307Z context = <triton._C.libtriton.ir.context object at 0x7f93a6ac6830>
2025-05-07T20:32:01.2668586Z 
2025-05-07T20:32:01.2668752Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:01.2669266Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:01.2669736Z                            module_map=module_map)
2025-05-07T20:32:01.2670105Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:01.2670455Z E       def _kernel_quantize_fp8_row(
2025-05-07T20:32:01.2670727Z E       ^
2025-05-07T20:32:01.2671185Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:01.2671627Z 
2025-05-07T20:32:01.2672031Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:01.2672547Z 
2025-05-07T20:32:01.2672651Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:01.2673064Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:01.2673461Z     T=4096,
2025-05-07T20:32:01.2673647Z     D=5120,
2025-05-07T20:32:01.2673841Z     scale_ub=None,
2025-05-07T20:32:01.2674056Z     contiguous=False,
2025-05-07T20:32:01.2674275Z     compiled=False,
2025-05-07T20:32:01.2674484Z )
2025-05-07T20:32:01.2674801Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:01.2675282Z T = 4096, D = 5120, scale_ub = None, contiguous = False, compiled = False
2025-05-07T20:32:01.2675567Z 
2025-05-07T20:32:01.2675648Z     @given(
2025-05-07T20:32:01.2675878Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:01.2676185Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:01.2676498Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:01.2676831Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:01.2677162Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:01.2677445Z     )
2025-05-07T20:32:01.2677790Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:01.2678231Z     def test_silu_mul_quant(
2025-05-07T20:32:01.2678466Z         self,
2025-05-07T20:32:01.2678661Z         T: int,
2025-05-07T20:32:01.2678861Z         D: int,
2025-05-07T20:32:01.2679071Z         scale_ub: Optional[float],
2025-05-07T20:32:01.2679344Z         contiguous: bool,
2025-05-07T20:32:01.2679430Z         compiled: bool,
2025-05-07T20:32:01.2679515Z     ) -> None:
2025-05-07T20:32:01.2679697Z         torch.manual_seed(2025)
2025-05-07T20:32:01.2679770Z     
2025-05-07T20:32:01.2679945Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:01.2680018Z     
2025-05-07T20:32:01.2680116Z         x_sign = torch.sign(x)
2025-05-07T20:32:01.2680241Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:01.2680329Z         x = x_sign * x_clamp
2025-05-07T20:32:01.2680415Z         x0 = x[:, :D]
2025-05-07T20:32:01.2680495Z         x1 = x[:, D:]
2025-05-07T20:32:01.2680567Z     
2025-05-07T20:32:01.2680658Z         if contiguous:
2025-05-07T20:32:01.2680752Z             x0 = x0.contiguous()
2025-05-07T20:32:01.2680842Z             x1 = x1.contiguous()
2025-05-07T20:32:01.2680920Z     
2025-05-07T20:32:01.2681086Z         if scale_ub is not None:
2025-05-07T20:32:01.2681192Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:01.2681333Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:01.2681420Z             )
2025-05-07T20:32:01.2681506Z         else:
2025-05-07T20:32:01.2681601Z             scale_ub_tensor = None
2025-05-07T20:32:01.2681673Z     
2025-05-07T20:32:01.2681810Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:01.2681902Z             op = silu_mul_quant
2025-05-07T20:32:01.2681987Z             if compiled:
2025-05-07T20:32:01.2682094Z                 op = torch.compile(op)
2025-05-07T20:32:01.2682200Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:01.2682272Z     
2025-05-07T20:32:01.2682369Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:01.2682374Z 
2025-05-07T20:32:01.2682471Z moe/activation_test.py:117: 
2025-05-07T20:32:01.2682605Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:01.2682712Z moe/activation_test.py:115: in fn
2025-05-07T20:32:01.2682812Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:01.2683309Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:01.2683411Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:01.2683767Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:01.2683995Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:01.2684329Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:01.2684427Z     kernel = self.compile(
2025-05-07T20:32:01.2684808Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:01.2684987Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:01.2685123Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:01.2685129Z 
2025-05-07T20:32:01.2685334Z self = <triton.compiler.compiler.ASTSource object at 0x7f93ceb351d0>
2025-05-07T20:32:01.2686109Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:01.2686606Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f9402963ce0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f93a7554cc0>}
2025-05-07T20:32:01.2687339Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:01.2687540Z context = <triton._C.libtriton.ir.context object at 0x7f93a6af30b0>
2025-05-07T20:32:01.2687545Z 
2025-05-07T20:32:01.2687709Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:01.2687975Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:01.2688164Z                            module_map=module_map)
2025-05-07T20:32:01.2688323Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:01.2688427Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:01.2688503Z E       ^
2025-05-07T20:32:01.2688850Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:01.2688862Z 
2025-05-07T20:32:01.2689264Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:01.2689269Z 
2025-05-07T20:32:01.2689442Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:01.2689669Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:01.2689747Z     T=4096,
2025-05-07T20:32:01.2689823Z     D=7168,
2025-05-07T20:32:01.2689913Z     scale_ub=None,
2025-05-07T20:32:01.2690005Z     contiguous=False,
2025-05-07T20:32:01.2690088Z     compiled=False,
2025-05-07T20:32:01.2690167Z )
2025-05-07T20:32:01.2690382Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:01.2690559Z T = 4096, D = 7168, scale_ub = None, contiguous = False, compiled = False
2025-05-07T20:32:01.2690563Z 
2025-05-07T20:32:01.2690639Z     @given(
2025-05-07T20:32:01.2690756Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:01.2690860Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:01.2690973Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:01.2691089Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:01.2691212Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:01.2691286Z     )
2025-05-07T20:32:01.2691527Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:01.2691627Z     def test_silu_mul_quant(
2025-05-07T20:32:01.2691708Z         self,
2025-05-07T20:32:01.2691790Z         T: int,
2025-05-07T20:32:01.2691866Z         D: int,
2025-05-07T20:32:01.2691963Z         scale_ub: Optional[float],
2025-05-07T20:32:01.2692058Z         contiguous: bool,
2025-05-07T20:32:01.2692144Z         compiled: bool,
2025-05-07T20:32:01.2692224Z     ) -> None:
2025-05-07T20:32:01.2692322Z         torch.manual_seed(2025)
2025-05-07T20:32:01.2692393Z     
2025-05-07T20:32:01.2692559Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:01.2692639Z     
2025-05-07T20:32:01.2692731Z         x_sign = torch.sign(x)
2025-05-07T20:32:01.2692855Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:01.2692948Z         x = x_sign * x_clamp
2025-05-07T20:32:01.2693031Z         x0 = x[:, :D]
2025-05-07T20:32:01.2693118Z         x1 = x[:, D:]
2025-05-07T20:32:01.2693191Z     
2025-05-07T20:32:01.2693275Z         if contiguous:
2025-05-07T20:32:01.2693371Z             x0 = x0.contiguous()
2025-05-07T20:32:01.2693464Z             x1 = x1.contiguous()
2025-05-07T20:32:01.2693537Z     
2025-05-07T20:32:01.2693734Z         if scale_ub is not None:
2025-05-07T20:32:01.2693836Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:01.2693967Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:01.2694048Z             )
2025-05-07T20:32:01.2694125Z         else:
2025-05-07T20:32:01.2694217Z             scale_ub_tensor = None
2025-05-07T20:32:01.2694293Z     
2025-05-07T20:32:01.2694421Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:01.2694507Z             op = silu_mul_quant
2025-05-07T20:32:01.2694592Z             if compiled:
2025-05-07T20:32:01.2694688Z                 op = torch.compile(op)
2025-05-07T20:32:01.2694801Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:01.2694872Z     
2025-05-07T20:32:01.2694961Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:01.2694966Z 
2025-05-07T20:32:01.2695064Z moe/activation_test.py:117: 
2025-05-07T20:32:01.2695303Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:01.2695421Z moe/activation_test.py:115: in fn
2025-05-07T20:32:01.2695527Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:01.2696016Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:01.2696118Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:01.2696468Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:01.2696689Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:01.2697126Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:01.2697217Z     kernel = self.compile(
2025-05-07T20:32:01.2697592Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:01.2697773Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:01.2697897Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:01.2697902Z 
2025-05-07T20:32:01.2698109Z self = <triton.compiler.compiler.ASTSource object at 0x7f93a7451d50>
2025-05-07T20:32:01.2699294Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:01.2699800Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f9402963ce0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f93a75568e0>}
2025-05-07T20:32:01.2700535Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:01.2700726Z context = <triton._C.libtriton.ir.context object at 0x7f93a61d0bb0>
2025-05-07T20:32:01.2700731Z 
2025-05-07T20:32:01.2700903Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:01.2701157Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:01.2701269Z                            module_map=module_map)
2025-05-07T20:32:01.2701427Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:01.2701524Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:01.2701606Z E       ^
2025-05-07T20:32:01.2701956Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:01.2701961Z 
2025-05-07T20:32:01.2702364Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:01.2702372Z 
2025-05-07T20:32:01.2702478Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:01.2702695Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:01.2702774Z     T=128,
2025-05-07T20:32:01.2702856Z     D=7168,
2025-05-07T20:32:01.2702934Z     scale_ub=None,
2025-05-07T20:32:01.2703023Z     contiguous=False,
2025-05-07T20:32:01.2703104Z     compiled=True,
2025-05-07T20:32:01.2703172Z )
2025-05-07T20:32:01.2703389Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:01.2703551Z T = 128, D = 7168, scale_ub = None, contiguous = False, compiled = True
2025-05-07T20:32:01.2703556Z 
2025-05-07T20:32:01.2703630Z     @given(
2025-05-07T20:32:01.2703756Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:01.2703850Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:01.2703970Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:01.2704302Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:01.2704412Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:01.2704487Z     )
2025-05-07T20:32:01.2704727Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:01.2704816Z     def test_silu_mul_quant(
2025-05-07T20:32:01.2704896Z         self,
2025-05-07T20:32:01.2704970Z         T: int,
2025-05-07T20:32:01.2705041Z         D: int,
2025-05-07T20:32:01.2705143Z         scale_ub: Optional[float],
2025-05-07T20:32:01.2705229Z         contiguous: bool,
2025-05-07T20:32:01.2705311Z         compiled: bool,
2025-05-07T20:32:01.2705394Z     ) -> None:
2025-05-07T20:32:01.2705485Z         torch.manual_seed(2025)
2025-05-07T20:32:01.2705558Z     
2025-05-07T20:32:01.2705842Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:01.2705918Z     
2025-05-07T20:32:01.2706013Z         x_sign = torch.sign(x)
2025-05-07T20:32:01.2706134Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:01.2706224Z         x = x_sign * x_clamp
2025-05-07T20:32:01.2706316Z         x0 = x[:, :D]
2025-05-07T20:32:01.2706394Z         x1 = x[:, D:]
2025-05-07T20:32:01.2706461Z     
2025-05-07T20:32:01.2706550Z         if contiguous:
2025-05-07T20:32:01.2706643Z             x0 = x0.contiguous()
2025-05-07T20:32:01.2706727Z             x1 = x1.contiguous()
2025-05-07T20:32:01.2706802Z     
2025-05-07T20:32:01.2706890Z         if scale_ub is not None:
2025-05-07T20:32:01.2706999Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:01.2707132Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:01.2707204Z             )
2025-05-07T20:32:01.2707286Z         else:
2025-05-07T20:32:01.2707384Z             scale_ub_tensor = None
2025-05-07T20:32:01.2707453Z     
2025-05-07T20:32:01.2707584Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:01.2707672Z             op = silu_mul_quant
2025-05-07T20:32:01.2707758Z             if compiled:
2025-05-07T20:32:01.2707862Z                 op = torch.compile(op)
2025-05-07T20:32:01.2707964Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:01.2708036Z     
2025-05-07T20:32:01.2708129Z         y_fp8, y_scale = fn()
2025-05-07T20:32:01.2708247Z         y = y_fp8.to(torch.float32) * y_scale[:, None]
2025-05-07T20:32:01.2708320Z     
2025-05-07T20:32:01.2708452Z         def ref_fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:01.2708554Z             x0_fp32 = x0.to(torch.float32)
2025-05-07T20:32:01.2708654Z             x1_fp32 = x1.to(torch.float32)
2025-05-07T20:32:01.2708772Z             y = x0_fp32 * torch.sigmoid(x0_fp32) * x1_fp32
2025-05-07T20:32:01.2708916Z             return triton_quantize_fp8_row(y, scale_ub_tensor)
2025-05-07T20:32:01.2708993Z     
2025-05-07T20:32:01.2709091Z >       y_fp8_ref, y_scale_ref = ref_fn()
2025-05-07T20:32:01.2709095Z 
2025-05-07T20:32:01.2709195Z moe/activation_test.py:126: 
2025-05-07T20:32:01.2709330Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:01.2709436Z moe/activation_test.py:124: in ref_fn
2025-05-07T20:32:01.2709572Z     return triton_quantize_fp8_row(y, scale_ub_tensor)
2025-05-07T20:32:01.2710117Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gemm/triton_gemm/fp8_gemm.py:2370: in triton_quantize_fp8_row
2025-05-07T20:32:01.2710214Z     _kernel_quantize_fp8_row[grid](
2025-05-07T20:32:01.2710581Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:01.2710805Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:01.2711167Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/autotuner.py:186: in run
2025-05-07T20:32:01.2711416Z     timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs}
2025-05-07T20:32:01.2711875Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/autotuner.py:166: in _bench
2025-05-07T20:32:01.2712039Z     return self.do_bench(kernel_call, quantiles=(0.5, 0.2, 0.8))
2025-05-07T20:32:01.2712383Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/testing.py:117: in do_bench
2025-05-07T20:32:01.2712455Z     fn()
2025-05-07T20:32:01.2712848Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/autotuner.py:152: in kernel_call
2025-05-07T20:32:01.2712934Z     self.fn.run(
2025-05-07T20:32:01.2713264Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:01.2713354Z     kernel = self.compile(
2025-05-07T20:32:01.2713809Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:01.2713983Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:01.2714120Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:01.2714125Z 
2025-05-07T20:32:01.2714328Z self = <triton.compiler.compiler.ASTSource object at 0x7f93ce29ca50>
2025-05-07T20:32:01.2715088Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=2, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:01.2715587Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f9402963ce0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f93a75558a0>}
2025-05-07T20:32:01.2716318Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:01.2716510Z context = <triton._C.libtriton.ir.context object at 0x7f93a5e37f70>
2025-05-07T20:32:01.2716520Z 
2025-05-07T20:32:01.2716681Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:01.2716943Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:01.2717050Z                            module_map=module_map)
2025-05-07T20:32:01.2717206Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:01.2717310Z E       def _kernel_quantize_fp8_row(
2025-05-07T20:32:01.2717385Z E       ^
2025-05-07T20:32:01.2717732Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:01.2717736Z 
2025-05-07T20:32:01.2718249Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:01.2718253Z 
2025-05-07T20:32:01.2718385Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:01.2718822Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:01.2718980Z     T=128,
2025-05-07T20:32:01.2719082Z     D=7168,
2025-05-07T20:32:01.2719189Z     scale_ub=None,
2025-05-07T20:32:01.2719337Z     contiguous=False,
2025-05-07T20:32:01.2719455Z     compiled=False,
2025-05-07T20:32:01.2719610Z )
2025-05-07T20:32:01.2719901Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:01.2720095Z T = 128, D = 7168, scale_ub = None, contiguous = False, compiled = False
2025-05-07T20:32:01.2720100Z 
2025-05-07T20:32:01.2720235Z     @given(
2025-05-07T20:32:01.2720400Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:01.2720510Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:01.2720754Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:01.2720897Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:01.2721036Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:01.2721273Z     )
2025-05-07T20:32:01.2721544Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:01.2721757Z     def test_silu_mul_quant(
2025-05-07T20:32:01.2721878Z         self,
2025-05-07T20:32:01.2721981Z         T: int,
2025-05-07T20:32:01.2722134Z         D: int,
2025-05-07T20:32:01.2722260Z         scale_ub: Optional[float],
2025-05-07T20:32:01.2722374Z         contiguous: bool,
2025-05-07T20:32:01.2722552Z         compiled: bool,
2025-05-07T20:32:01.2722672Z     ) -> None:
2025-05-07T20:32:01.2722794Z         torch.manual_seed(2025)
2025-05-07T20:32:01.2722951Z     
2025-05-07T20:32:01.2723153Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:01.2723256Z     
2025-05-07T20:32:01.2723553Z         x_sign = torch.sign(x)
2025-05-07T20:32:01.2723727Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:01.2723905Z         x = x_sign * x_clamp
2025-05-07T20:32:01.2724014Z         x0 = x[:, :D]
2025-05-07T20:32:01.2724131Z         x1 = x[:, D:]
2025-05-07T20:32:01.2724252Z     
2025-05-07T20:32:01.2724419Z         if contiguous:
2025-05-07T20:32:01.2724559Z             x0 = x0.contiguous()
2025-05-07T20:32:01.2724740Z             x1 = x1.contiguous()
2025-05-07T20:32:01.2724844Z     
2025-05-07T20:32:01.2724967Z         if scale_ub is not None:
2025-05-07T20:32:01.2725125Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:01.2725366Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:01.2725573Z             )
2025-05-07T20:32:01.2725681Z         else:
2025-05-07T20:32:01.2725807Z             scale_ub_tensor = None
2025-05-07T20:32:01.2725950Z     
2025-05-07T20:32:01.2726095Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:01.2726271Z             op = silu_mul_quant
2025-05-07T20:32:01.2726460Z             if compiled:
2025-05-07T20:32:01.2726595Z                 op = torch.compile(op)
2025-05-07T20:32:01.2726732Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:01.2726884Z     
2025-05-07T20:32:01.2726989Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:01.2726994Z 
2025-05-07T20:32:01.2727246Z moe/activation_test.py:117: 
2025-05-07T20:32:01.2727407Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:01.2727537Z moe/activation_test.py:115: in fn
2025-05-07T20:32:01.2727701Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:01.2728222Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:01.2728336Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:01.2728857Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:01.2729105Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:01.2729504Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:01.2729635Z     kernel = self.compile(
2025-05-07T20:32:01.2730044Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:01.2730353Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:01.2730527Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:01.2730532Z 
2025-05-07T20:32:01.2730801Z self = <triton.compiler.compiler.ASTSource object at 0x7f93ce39c0d0>
2025-05-07T20:32:01.2731598Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:01.2732125Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f9402963ce0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f93a659e700>}
2025-05-07T20:32:01.2733016Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:01.2733290Z context = <triton._C.libtriton.ir.context object at 0x7f93a5e5b5f0>
2025-05-07T20:32:01.2733295Z 
2025-05-07T20:32:01.2733536Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:01.2733935Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:01.2734072Z                            module_map=module_map)
2025-05-07T20:32:01.2734394Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:01.2734507Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:01.2734721Z E       ^
2025-05-07T20:32:01.2735098Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:01.2735109Z 
2025-05-07T20:32:01.2735592Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:01.2735596Z 
2025-05-07T20:32:01.2735786Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:01.2736033Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:01.2736213Z     T=4096,
2025-05-07T20:32:01.2736332Z     D=5120,
2025-05-07T20:32:01.2736445Z     scale_ub=1200.0,
2025-05-07T20:32:01.2736617Z     contiguous=True,
2025-05-07T20:32:01.2736729Z     compiled=False,
2025-05-07T20:32:01.2736831Z )
2025-05-07T20:32:01.2737153Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:01.2737373Z T = 4096, D = 5120, scale_ub = 1200.0, contiguous = True, compiled = False
2025-05-07T20:32:01.2737378Z 
2025-05-07T20:32:01.2737577Z     @given(
2025-05-07T20:32:01.2737726Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:01.2737857Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:01.2738017Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:01.2738233Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:01.2738388Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:01.2738544Z     )
2025-05-07T20:32:01.2738815Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:01.2738936Z     def test_silu_mul_quant(
2025-05-07T20:32:01.2739059Z         self,
2025-05-07T20:32:01.2739216Z         T: int,
2025-05-07T20:32:01.2739395Z         D: int,
2025-05-07T20:32:01.2739521Z         scale_ub: Optional[float],
2025-05-07T20:32:01.2739639Z         contiguous: bool,
2025-05-07T20:32:01.2739789Z         compiled: bool,
2025-05-07T20:32:01.2739883Z     ) -> None:
2025-05-07T20:32:01.2740052Z         torch.manual_seed(2025)
2025-05-07T20:32:01.2740223Z     
2025-05-07T20:32:01.2740418Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:01.2740524Z     
2025-05-07T20:32:01.2740680Z         x_sign = torch.sign(x)
2025-05-07T20:32:01.2740817Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:01.2741050Z         x = x_sign * x_clamp
2025-05-07T20:32:01.2741159Z         x0 = x[:, :D]
2025-05-07T20:32:01.2741269Z         x1 = x[:, D:]
2025-05-07T20:32:01.2741404Z     
2025-05-07T20:32:01.2741518Z         if contiguous:
2025-05-07T20:32:01.2741624Z             x0 = x0.contiguous()
2025-05-07T20:32:01.2741859Z             x1 = x1.contiguous()
2025-05-07T20:32:01.2741961Z     
2025-05-07T20:32:01.2742079Z         if scale_ub is not None:
2025-05-07T20:32:01.2742246Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:01.2742412Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:01.2742502Z             )
2025-05-07T20:32:01.2742725Z         else:
2025-05-07T20:32:01.2742845Z             scale_ub_tensor = None
2025-05-07T20:32:01.2743067Z     
2025-05-07T20:32:01.2743224Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:01.2743362Z             op = silu_mul_quant
2025-05-07T20:32:01.2743543Z             if compiled:
2025-05-07T20:32:01.2743686Z                 op = torch.compile(op)
2025-05-07T20:32:01.2743820Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:01.2743953Z     
2025-05-07T20:32:01.2744072Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:01.2744076Z 
2025-05-07T20:32:01.2744222Z moe/activation_test.py:117: 
2025-05-07T20:32:01.2744446Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:01.2744591Z moe/activation_test.py:115: in fn
2025-05-07T20:32:01.2744751Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:01.2745395Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:01.2745546Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:01.2745955Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:01.2746254Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:01.2746663Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:01.2746785Z     kernel = self.compile(
2025-05-07T20:32:01.2747214Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:01.2747448Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:01.2747596Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:01.2747601Z 
2025-05-07T20:32:01.2747931Z self = <triton.compiler.compiler.ASTSource object at 0x7f93a64db150>
2025-05-07T20:32:01.2748720Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:01.2749269Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f9402963ce0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f93a659c360>}
2025-05-07T20:32:01.2750063Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:01.2750280Z context = <triton._C.libtriton.ir.context object at 0x7f93a5e9b4f0>
2025-05-07T20:32:01.2750286Z 
2025-05-07T20:32:01.2750562Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:01.2750864Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:01.2750999Z                            module_map=module_map)
2025-05-07T20:32:01.2751248Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:01.2751373Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:01.2751512Z E       ^
2025-05-07T20:32:01.2751944Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:01.2751949Z 
2025-05-07T20:32:01.2752397Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:01.2752403Z 
2025-05-07T20:32:01.2752587Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:01.2752835Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:01.2752980Z     T=1,
2025-05-07T20:32:01.2753071Z     D=5120,
2025-05-07T20:32:01.2753234Z     scale_ub=None,
2025-05-07T20:32:01.2753418Z     contiguous=True,
2025-05-07T20:32:01.2753529Z     compiled=True,
2025-05-07T20:32:01.2753630Z )
2025-05-07T20:32:01.2753990Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:01.2754161Z T = 1, D = 5120, scale_ub = None, contiguous = True, compiled = True
2025-05-07T20:32:01.2754166Z 
2025-05-07T20:32:01.2754325Z     @given(
2025-05-07T20:32:01.2754543Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:01.2754669Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:01.2754843Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:01.2754990Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:01.2755118Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:01.2755351Z     )
2025-05-07T20:32:01.2755697Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:01.2755821Z     def test_silu_mul_quant(
2025-05-07T20:32:01.2755965Z         self,
2025-05-07T20:32:01.2756098Z         T: int,
2025-05-07T20:32:01.2756300Z         D: int,
2025-05-07T20:32:01.2756453Z         scale_ub: Optional[float],
2025-05-07T20:32:01.2756572Z         contiguous: bool,
2025-05-07T20:32:01.2756722Z         compiled: bool,
2025-05-07T20:32:01.2756830Z     ) -> None:
2025-05-07T20:32:01.2756975Z         torch.manual_seed(2025)
2025-05-07T20:32:01.2757150Z     
2025-05-07T20:32:01.2757363Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:01.2757468Z     
2025-05-07T20:32:01.2757623Z         x_sign = torch.sign(x)
2025-05-07T20:32:01.2757798Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:01.2757913Z         x = x_sign * x_clamp
2025-05-07T20:32:01.2758094Z         x0 = x[:, :D]
2025-05-07T20:32:01.2758252Z         x1 = x[:, D:]
2025-05-07T20:32:01.2758390Z     
2025-05-07T20:32:01.2758512Z         if contiguous:
2025-05-07T20:32:01.2758656Z             x0 = x0.contiguous()
2025-05-07T20:32:01.2758795Z             x1 = x1.contiguous()
2025-05-07T20:32:01.2758945Z     
2025-05-07T20:32:01.2759083Z         if scale_ub is not None:
2025-05-07T20:32:01.2759262Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:01.2759449Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:01.2759555Z             )
2025-05-07T20:32:01.2759682Z         else:
2025-05-07T20:32:01.2759855Z             scale_ub_tensor = None
2025-05-07T20:32:01.2760011Z     
2025-05-07T20:32:01.2760169Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:01.2760313Z             op = silu_mul_quant
2025-05-07T20:32:01.2760460Z             if compiled:
2025-05-07T20:32:01.2760575Z                 op = torch.compile(op)
2025-05-07T20:32:01.2760756Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:01.2760915Z     
2025-05-07T20:32:01.2761058Z         y_fp8, y_scale = fn()
2025-05-07T20:32:01.2761213Z         y = y_fp8.to(torch.float32) * y_scale[:, None]
2025-05-07T20:32:01.2761348Z     
2025-05-07T20:32:01.2761498Z         def ref_fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:01.2761758Z             x0_fp32 = x0.to(torch.float32)
2025-05-07T20:32:01.2761891Z             x1_fp32 = x1.to(torch.float32)
2025-05-07T20:32:01.2762065Z             y = x0_fp32 * torch.sigmoid(x0_fp32) * x1_fp32
2025-05-07T20:32:01.2762271Z             return triton_quantize_fp8_row(y, scale_ub_tensor)
2025-05-07T20:32:01.2762374Z     
2025-05-07T20:32:01.2762493Z >       y_fp8_ref, y_scale_ref = ref_fn()
2025-05-07T20:32:01.2762498Z 
2025-05-07T20:32:01.2762722Z moe/activation_test.py:126: 
2025-05-07T20:32:01.2762905Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:01.2763037Z moe/activation_test.py:124: in ref_fn
2025-05-07T20:32:01.2763235Z     return triton_quantize_fp8_row(y, scale_ub_tensor)
2025-05-07T20:32:01.2763820Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gemm/triton_gemm/fp8_gemm.py:2370: in triton_quantize_fp8_row
2025-05-07T20:32:01.2764018Z     _kernel_quantize_fp8_row[grid](
2025-05-07T20:32:01.2764528Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:01.2764781Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:01.2765207Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/autotuner.py:186: in run
2025-05-07T20:32:01.2765488Z     timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs}
2025-05-07T20:32:01.2765908Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/autotuner.py:166: in _bench
2025-05-07T20:32:01.2766154Z     return self.do_bench(kernel_call, quantiles=(0.5, 0.2, 0.8))
2025-05-07T20:32:01.2766639Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/testing.py:117: in do_bench
2025-05-07T20:32:01.2766783Z     fn()
2025-05-07T20:32:01.2767211Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/autotuner.py:152: in kernel_call
2025-05-07T20:32:01.2767393Z     self.fn.run(
2025-05-07T20:32:01.2767763Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:01.2767942Z     kernel = self.compile(
2025-05-07T20:32:01.2768403Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:01.2768604Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:01.2768763Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:01.2768806Z 
2025-05-07T20:32:01.2769083Z self = <triton.compiler.compiler.ASTSource object at 0x7f93a6cebb50>
2025-05-07T20:32:01.2769870Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=2, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:01.2770506Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f9402963ce0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f93a659ef20>}
2025-05-07T20:32:01.2771265Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:01.2771515Z context = <triton._C.libtriton.ir.context object at 0x7f93a57aecb0>
2025-05-07T20:32:01.2771521Z 
2025-05-07T20:32:01.2771714Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:01.2772027Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:01.2772240Z                            module_map=module_map)
2025-05-07T20:32:01.2772447Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:01.2772611Z E       def _kernel_quantize_fp8_row(
2025-05-07T20:32:01.2772724Z E       ^
2025-05-07T20:32:01.2773126Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:01.2773131Z 
2025-05-07T20:32:01.2773587Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:01.2773592Z 
2025-05-07T20:32:01.2773864Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:01.2774166Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:01.2774272Z     T=2048,
2025-05-07T20:32:01.2774377Z     D=5120,
2025-05-07T20:32:01.2774577Z     scale_ub=None,
2025-05-07T20:32:01.2774676Z     contiguous=True,
2025-05-07T20:32:01.2774848Z     compiled=True,
2025-05-07T20:32:01.2775001Z )
2025-05-07T20:32:01.2775250Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:01.2775469Z T = 2048, D = 5120, scale_ub = None, contiguous = True, compiled = True
2025-05-07T20:32:01.2775561Z 
2025-05-07T20:32:01.2775704Z     @given(
2025-05-07T20:32:01.2775838Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:01.2776082Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:01.2776225Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:01.2776395Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:01.2776577Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:01.2776682Z     )
2025-05-07T20:32:01.2776940Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:01.2777190Z     def test_silu_mul_quant(
2025-05-07T20:32:01.2777321Z         self,
2025-05-07T20:32:01.2777546Z         T: int,
2025-05-07T20:32:01.2777661Z         D: int,
2025-05-07T20:32:01.2777789Z         scale_ub: Optional[float],
2025-05-07T20:32:01.2777979Z         contiguous: bool,
2025-05-07T20:32:01.2778136Z         compiled: bool,
2025-05-07T20:32:01.2778254Z     ) -> None:
2025-05-07T20:32:01.2778439Z         torch.manual_seed(2025)
2025-05-07T20:32:01.2778542Z     
2025-05-07T20:32:01.2778737Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:01.2778912Z     
2025-05-07T20:32:01.2779075Z         x_sign = torch.sign(x)
2025-05-07T20:32:01.2779228Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:01.2779386Z         x = x_sign * x_clamp
2025-05-07T20:32:01.2779497Z         x0 = x[:, :D]
2025-05-07T20:32:01.2779630Z         x1 = x[:, D:]
2025-05-07T20:32:01.2779810Z     
2025-05-07T20:32:01.2779941Z         if contiguous:
2025-05-07T20:32:01.2780093Z             x0 = x0.contiguous()
2025-05-07T20:32:01.2780213Z             x1 = x1.contiguous()
2025-05-07T20:32:01.2780325Z     
2025-05-07T20:32:01.2780467Z         if scale_ub is not None:
2025-05-07T20:32:01.2780669Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:01.2780846Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:01.2780989Z             )
2025-05-07T20:32:01.2781094Z         else:
2025-05-07T20:32:01.2781252Z             scale_ub_tensor = None
2025-05-07T20:32:01.2781362Z     
2025-05-07T20:32:01.2781569Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:01.2781763Z             op = silu_mul_quant
2025-05-07T20:32:01.2781880Z             if compiled:
2025-05-07T20:32:01.2782012Z                 op = torch.compile(op)
2025-05-07T20:32:01.2782206Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:01.2782295Z     
2025-05-07T20:32:01.2782467Z         y_fp8, y_scale = fn()
2025-05-07T20:32:01.2782665Z         y = y_fp8.to(torch.float32) * y_scale[:, None]
2025-05-07T20:32:01.2782768Z     
2025-05-07T20:32:01.2782936Z         def ref_fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:01.2783122Z             x0_fp32 = x0.to(torch.float32)
2025-05-07T20:32:01.2783237Z             x1_fp32 = x1.to(torch.float32)
2025-05-07T20:32:01.2783493Z             y = x0_fp32 * torch.sigmoid(x0_fp32) * x1_fp32
2025-05-07T20:32:01.2783663Z             return triton_quantize_fp8_row(y, scale_ub_tensor)
2025-05-07T20:32:01.2783767Z     
2025-05-07T20:32:01.2783952Z >       y_fp8_ref, y_scale_ref = ref_fn()
2025-05-07T20:32:01.2783957Z 
2025-05-07T20:32:01.2784083Z moe/activation_test.py:126: 
2025-05-07T20:32:01.2784325Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:01.2784478Z moe/activation_test.py:124: in ref_fn
2025-05-07T20:32:01.2784640Z     return triton_quantize_fp8_row(y, scale_ub_tensor)
2025-05-07T20:32:01.2785254Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gemm/triton_gemm/fp8_gemm.py:2370: in triton_quantize_fp8_row
2025-05-07T20:32:01.2785412Z     _kernel_quantize_fp8_row[grid](
2025-05-07T20:32:01.2785803Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:01.2786134Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:01.2786656Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/autotuner.py:186: in run
2025-05-07T20:32:01.2787003Z     timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs}
2025-05-07T20:32:01.2787404Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/autotuner.py:166: in _bench
2025-05-07T20:32:01.2787599Z     return self.do_bench(kernel_call, quantiles=(0.5, 0.2, 0.8))
2025-05-07T20:32:01.2787986Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/testing.py:117: in do_bench
2025-05-07T20:32:01.2788145Z     fn()
2025-05-07T20:32:01.2788725Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/autotuner.py:152: in kernel_call
2025-05-07T20:32:01.2788837Z     self.fn.run(
2025-05-07T20:32:01.2789198Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:01.2789361Z     kernel = self.compile(
2025-05-07T20:32:01.2789753Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:01.2789994Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:01.2790225Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:01.2790230Z 
2025-05-07T20:32:01.2790465Z self = <triton.compiler.compiler.ASTSource object at 0x7f93cf5e3050>
2025-05-07T20:32:01.2791298Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=2, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:01.2791825Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f9402963ce0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f93a651ab60>}
2025-05-07T20:32:01.2792689Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:01.2800231Z context = <triton._C.libtriton.ir.context object at 0x7f93a5a6e970>
2025-05-07T20:32:01.2800244Z 
2025-05-07T20:32:01.2800444Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:01.2800707Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:01.2800826Z                            module_map=module_map)
2025-05-07T20:32:01.2801001Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:01.2801106Z E       def _kernel_quantize_fp8_row(
2025-05-07T20:32:01.2801192Z E       ^
2025-05-07T20:32:01.2801544Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:01.2801553Z 
2025-05-07T20:32:01.2801975Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:01.2801980Z 
2025-05-07T20:32:01.2802084Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:01.2802307Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:01.2802395Z     T=128,
2025-05-07T20:32:01.2802473Z     D=5120,
2025-05-07T20:32:01.2802556Z     scale_ub=None,
2025-05-07T20:32:01.2802651Z     contiguous=True,
2025-05-07T20:32:01.2802734Z     compiled=True,
2025-05-07T20:32:01.2802809Z )
2025-05-07T20:32:01.2803037Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:01.2803205Z T = 128, D = 5120, scale_ub = None, contiguous = True, compiled = True
2025-05-07T20:32:01.2803209Z 
2025-05-07T20:32:01.2803298Z     @given(
2025-05-07T20:32:01.2803418Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:01.2803793Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:01.2803916Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:01.2804033Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:01.2804148Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:01.2804231Z     )
2025-05-07T20:32:01.2804474Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:01.2804569Z     def test_silu_mul_quant(
2025-05-07T20:32:01.2804654Z         self,
2025-05-07T20:32:01.2804732Z         T: int,
2025-05-07T20:32:01.2804819Z         D: int,
2025-05-07T20:32:01.2804919Z         scale_ub: Optional[float],
2025-05-07T20:32:01.2805130Z         contiguous: bool,
2025-05-07T20:32:01.2805225Z         compiled: bool,
2025-05-07T20:32:01.2805306Z     ) -> None:
2025-05-07T20:32:01.2805402Z         torch.manual_seed(2025)
2025-05-07T20:32:01.2805487Z     
2025-05-07T20:32:01.2805656Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:01.2805740Z     
2025-05-07T20:32:01.2805844Z         x_sign = torch.sign(x)
2025-05-07T20:32:01.2805968Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:01.2806057Z         x = x_sign * x_clamp
2025-05-07T20:32:01.2806149Z         x0 = x[:, :D]
2025-05-07T20:32:01.2806233Z         x1 = x[:, D:]
2025-05-07T20:32:01.2806307Z     
2025-05-07T20:32:01.2806399Z         if contiguous:
2025-05-07T20:32:01.2806492Z             x0 = x0.contiguous()
2025-05-07T20:32:01.2806592Z             x1 = x1.contiguous()
2025-05-07T20:32:01.2806666Z     
2025-05-07T20:32:01.2806757Z         if scale_ub is not None:
2025-05-07T20:32:01.2806869Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:01.2807008Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:01.2807085Z             )
2025-05-07T20:32:01.2807169Z         else:
2025-05-07T20:32:01.2807262Z             scale_ub_tensor = None
2025-05-07T20:32:01.2807342Z     
2025-05-07T20:32:01.2807478Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:01.2807568Z             op = silu_mul_quant
2025-05-07T20:32:01.2807653Z             if compiled:
2025-05-07T20:32:01.2807763Z                 op = torch.compile(op)
2025-05-07T20:32:01.2807868Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:01.2807948Z     
2025-05-07T20:32:01.2808038Z         y_fp8, y_scale = fn()
2025-05-07T20:32:01.2808158Z         y = y_fp8.to(torch.float32) * y_scale[:, None]
2025-05-07T20:32:01.2808239Z     
2025-05-07T20:32:01.2808373Z         def ref_fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:01.2808473Z             x0_fp32 = x0.to(torch.float32)
2025-05-07T20:32:01.2808585Z             x1_fp32 = x1.to(torch.float32)
2025-05-07T20:32:01.2808708Z             y = x0_fp32 * torch.sigmoid(x0_fp32) * x1_fp32
2025-05-07T20:32:01.2808846Z             return triton_quantize_fp8_row(y, scale_ub_tensor)
2025-05-07T20:32:01.2808930Z     
2025-05-07T20:32:01.2809030Z >       y_fp8_ref, y_scale_ref = ref_fn()
2025-05-07T20:32:01.2809035Z 
2025-05-07T20:32:01.2809138Z moe/activation_test.py:126: 
2025-05-07T20:32:01.2809268Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:01.2809372Z moe/activation_test.py:124: in ref_fn
2025-05-07T20:32:01.2809513Z     return triton_quantize_fp8_row(y, scale_ub_tensor)
2025-05-07T20:32:01.2810062Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gemm/triton_gemm/fp8_gemm.py:2370: in triton_quantize_fp8_row
2025-05-07T20:32:01.2810163Z     _kernel_quantize_fp8_row[grid](
2025-05-07T20:32:01.2810534Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:01.2810755Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:01.2811124Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/autotuner.py:186: in run
2025-05-07T20:32:01.2811465Z     timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs}
2025-05-07T20:32:01.2811838Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/autotuner.py:166: in _bench
2025-05-07T20:32:01.2812012Z     return self.do_bench(kernel_call, quantiles=(0.5, 0.2, 0.8))
2025-05-07T20:32:01.2812351Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/testing.py:117: in do_bench
2025-05-07T20:32:01.2812435Z     fn()
2025-05-07T20:32:01.2812829Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/autotuner.py:152: in kernel_call
2025-05-07T20:32:01.2812908Z     self.fn.run(
2025-05-07T20:32:01.2813314Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:01.2813420Z     kernel = self.compile(
2025-05-07T20:32:01.2813895Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:01.2814077Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:01.2814213Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:01.2814218Z 
2025-05-07T20:32:01.2814420Z self = <triton.compiler.compiler.ASTSource object at 0x7f93a653c150>
2025-05-07T20:32:01.2815196Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=2, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:01.2815699Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f9402963ce0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f93a6c42700>}
2025-05-07T20:32:01.2816432Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:01.2816635Z context = <triton._C.libtriton.ir.context object at 0x7f93a54dca30>
2025-05-07T20:32:01.2816640Z 
2025-05-07T20:32:01.2816803Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:01.2817070Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:01.2817176Z                            module_map=module_map)
2025-05-07T20:32:01.2817337Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:01.2817446Z E       def _kernel_quantize_fp8_row(
2025-05-07T20:32:01.2817523Z E       ^
2025-05-07T20:32:01.2817875Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:01.2817888Z 
2025-05-07T20:32:01.2818298Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:01.2818306Z 
2025-05-07T20:32:01.2818407Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:01.2818633Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:01.2818710Z     T=4096,
2025-05-07T20:32:01.2818787Z     D=5120,
2025-05-07T20:32:01.2818877Z     scale_ub=None,
2025-05-07T20:32:01.2818962Z     contiguous=True,
2025-05-07T20:32:01.2819045Z     compiled=True,
2025-05-07T20:32:01.2819124Z )
2025-05-07T20:32:01.2819344Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:01.2819517Z T = 4096, D = 5120, scale_ub = None, contiguous = True, compiled = True
2025-05-07T20:32:01.2819521Z 
2025-05-07T20:32:01.2819602Z     @given(
2025-05-07T20:32:01.2819721Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:01.2819826Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:01.2819941Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:01.2820141Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:01.2820259Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:01.2820333Z     )
2025-05-07T20:32:01.2820577Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:01.2820677Z     def test_silu_mul_quant(
2025-05-07T20:32:01.2820753Z         self,
2025-05-07T20:32:01.2820836Z         T: int,
2025-05-07T20:32:01.2820911Z         D: int,
2025-05-07T20:32:01.2821008Z         scale_ub: Optional[float],
2025-05-07T20:32:01.2821106Z         contiguous: bool,
2025-05-07T20:32:01.2821190Z         compiled: bool,
2025-05-07T20:32:01.2821269Z     ) -> None:
2025-05-07T20:32:01.2821448Z         torch.manual_seed(2025)
2025-05-07T20:32:01.2821522Z     
2025-05-07T20:32:01.2821690Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:01.2821768Z     
2025-05-07T20:32:01.2821859Z         x_sign = torch.sign(x)
2025-05-07T20:32:01.2821989Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:01.2822083Z         x = x_sign * x_clamp
2025-05-07T20:32:01.2822164Z         x0 = x[:, :D]
2025-05-07T20:32:01.2822250Z         x1 = x[:, D:]
2025-05-07T20:32:01.2822323Z     
2025-05-07T20:32:01.2822405Z         if contiguous:
2025-05-07T20:32:01.2822503Z             x0 = x0.contiguous()
2025-05-07T20:32:01.2822593Z             x1 = x1.contiguous()
2025-05-07T20:32:01.2822665Z     
2025-05-07T20:32:01.2822761Z         if scale_ub is not None:
2025-05-07T20:32:01.2822867Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:01.2823004Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:01.2823086Z             )
2025-05-07T20:32:01.2823170Z         else:
2025-05-07T20:32:01.2823264Z             scale_ub_tensor = None
2025-05-07T20:32:01.2823342Z     
2025-05-07T20:32:01.2823474Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:01.2823564Z             op = silu_mul_quant
2025-05-07T20:32:01.2823658Z             if compiled:
2025-05-07T20:32:01.2823756Z                 op = torch.compile(op)
2025-05-07T20:32:01.2823867Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:01.2823941Z     
2025-05-07T20:32:01.2824033Z         y_fp8, y_scale = fn()
2025-05-07T20:32:01.2824159Z         y = y_fp8.to(torch.float32) * y_scale[:, None]
2025-05-07T20:32:01.2824231Z     
2025-05-07T20:32:01.2824365Z         def ref_fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:01.2824469Z             x0_fp32 = x0.to(torch.float32)
2025-05-07T20:32:01.2824566Z             x1_fp32 = x1.to(torch.float32)
2025-05-07T20:32:01.2824687Z             y = x0_fp32 * torch.sigmoid(x0_fp32) * x1_fp32
2025-05-07T20:32:01.2824837Z             return triton_quantize_fp8_row(y, scale_ub_tensor)
2025-05-07T20:32:01.2824916Z     
2025-05-07T20:32:01.2825023Z >       y_fp8_ref, y_scale_ref = ref_fn()
2025-05-07T20:32:01.2825027Z 
2025-05-07T20:32:01.2825126Z moe/activation_test.py:126: 
2025-05-07T20:32:01.2825258Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:01.2825392Z moe/activation_test.py:124: in ref_fn
2025-05-07T20:32:01.2825540Z     return triton_quantize_fp8_row(y, scale_ub_tensor)
2025-05-07T20:32:01.2826098Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gemm/triton_gemm/fp8_gemm.py:2370: in triton_quantize_fp8_row
2025-05-07T20:32:01.2826203Z     _kernel_quantize_fp8_row[grid](
2025-05-07T20:32:01.2826560Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:01.2826788Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:01.2827156Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/autotuner.py:186: in run
2025-05-07T20:32:01.2827410Z     timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs}
2025-05-07T20:32:01.2827933Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/autotuner.py:166: in _bench
2025-05-07T20:32:01.2828098Z     return self.do_bench(kernel_call, quantiles=(0.5, 0.2, 0.8))
2025-05-07T20:32:01.2828442Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/testing.py:117: in do_bench
2025-05-07T20:32:01.2828521Z     fn()
2025-05-07T20:32:01.2828918Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/autotuner.py:152: in kernel_call
2025-05-07T20:32:01.2829007Z     self.fn.run(
2025-05-07T20:32:01.2829343Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:01.2829535Z     kernel = self.compile(
2025-05-07T20:32:01.2829924Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:01.2830098Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:01.2830243Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:01.2830248Z 
2025-05-07T20:32:01.2830453Z self = <triton.compiler.compiler.ASTSource object at 0x7f93bd6fbad0>
2025-05-07T20:32:01.2831217Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=2, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:01.2831720Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f9402963ce0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f93a55a3600>}
2025-05-07T20:32:01.2832460Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:01.2832662Z context = <triton._C.libtriton.ir.context object at 0x7f93a5c65630>
2025-05-07T20:32:01.2832667Z 
2025-05-07T20:32:01.2832830Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:01.2833088Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:01.2833203Z                            module_map=module_map)
2025-05-07T20:32:01.2833364Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:01.2833473Z E       def _kernel_quantize_fp8_row(
2025-05-07T20:32:01.2833549Z E       ^
2025-05-07T20:32:01.2833901Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:01.2833906Z 
2025-05-07T20:32:01.2834327Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:01.2834332Z 
2025-05-07T20:32:01.2834435Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:01.2834666Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:01.2834745Z     T=16384,
2025-05-07T20:32:01.2834822Z     D=5120,
2025-05-07T20:32:01.2834912Z     scale_ub=None,
2025-05-07T20:32:01.2834998Z     contiguous=True,
2025-05-07T20:32:01.2835080Z     compiled=True,
2025-05-07T20:32:01.2835161Z )
2025-05-07T20:32:01.2835376Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:01.2835547Z T = 16384, D = 5120, scale_ub = None, contiguous = True, compiled = True
2025-05-07T20:32:01.2835552Z 
2025-05-07T20:32:01.2835635Z     @given(
2025-05-07T20:32:01.2835755Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:01.2835858Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:01.2835979Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:01.2836095Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:01.2836214Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:01.2836374Z     )
2025-05-07T20:32:01.2836617Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:01.2836716Z     def test_silu_mul_quant(
2025-05-07T20:32:01.2836793Z         self,
2025-05-07T20:32:01.2836871Z         T: int,
2025-05-07T20:32:01.2836953Z         D: int,
2025-05-07T20:32:01.2837052Z         scale_ub: Optional[float],
2025-05-07T20:32:01.2837141Z         contiguous: bool,
2025-05-07T20:32:01.2837232Z         compiled: bool,
2025-05-07T20:32:01.2837310Z     ) -> None:
2025-05-07T20:32:01.2837411Z         torch.manual_seed(2025)
2025-05-07T20:32:01.2837483Z     
2025-05-07T20:32:01.2837648Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:01.2837805Z     
2025-05-07T20:32:01.2837897Z         x_sign = torch.sign(x)
2025-05-07T20:32:01.2838020Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:01.2838116Z         x = x_sign * x_clamp
2025-05-07T20:32:01.2838203Z         x0 = x[:, :D]
2025-05-07T20:32:01.2838283Z         x1 = x[:, D:]
2025-05-07T20:32:01.2838360Z     
2025-05-07T20:32:01.2838443Z         if contiguous:
2025-05-07T20:32:01.2838537Z             x0 = x0.contiguous()
2025-05-07T20:32:01.2838632Z             x1 = x1.contiguous()
2025-05-07T20:32:01.2838704Z     
2025-05-07T20:32:01.2838795Z         if scale_ub is not None:
2025-05-07T20:32:01.2838911Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:01.2839046Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:01.2839129Z             )
2025-05-07T20:32:01.2839205Z         else:
2025-05-07T20:32:01.2839299Z             scale_ub_tensor = None
2025-05-07T20:32:01.2839377Z     
2025-05-07T20:32:01.2839510Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:01.2839599Z             op = silu_mul_quant
2025-05-07T20:32:01.2839690Z             if compiled:
2025-05-07T20:32:01.2839789Z                 op = torch.compile(op)
2025-05-07T20:32:01.2839899Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:01.2839979Z     
2025-05-07T20:32:01.2840069Z         y_fp8, y_scale = fn()
2025-05-07T20:32:01.2840187Z         y = y_fp8.to(torch.float32) * y_scale[:, None]
2025-05-07T20:32:01.2840265Z     
2025-05-07T20:32:01.2840402Z         def ref_fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:01.2840508Z             x0_fp32 = x0.to(torch.float32)
2025-05-07T20:32:01.2840606Z             x1_fp32 = x1.to(torch.float32)
2025-05-07T20:32:01.2840728Z             y = x0_fp32 * torch.sigmoid(x0_fp32) * x1_fp32
2025-05-07T20:32:01.2840872Z             return triton_quantize_fp8_row(y, scale_ub_tensor)
2025-05-07T20:32:01.2840944Z     
2025-05-07T20:32:01.2841049Z >       y_fp8_ref, y_scale_ref = ref_fn()
2025-05-07T20:32:01.2841054Z 
2025-05-07T20:32:01.2841160Z moe/activation_test.py:126: 
2025-05-07T20:32:01.2841287Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:01.2841397Z moe/activation_test.py:124: in ref_fn
2025-05-07T20:32:01.2841534Z     return triton_quantize_fp8_row(y, scale_ub_tensor)
2025-05-07T20:32:01.2842084Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gemm/triton_gemm/fp8_gemm.py:2370: in triton_quantize_fp8_row
2025-05-07T20:32:01.2842190Z     _kernel_quantize_fp8_row[grid](
2025-05-07T20:32:01.2842549Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:01.2842770Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:01.2843140Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/autotuner.py:186: in run
2025-05-07T20:32:01.2843397Z     timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs}
2025-05-07T20:32:01.2843777Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/autotuner.py:166: in _bench
2025-05-07T20:32:01.2844027Z     return self.do_bench(kernel_call, quantiles=(0.5, 0.2, 0.8))
2025-05-07T20:32:01.2844368Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/testing.py:117: in do_bench
2025-05-07T20:32:01.2844452Z     fn()
2025-05-07T20:32:01.2844850Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/autotuner.py:152: in kernel_call
2025-05-07T20:32:01.2844934Z     self.fn.run(
2025-05-07T20:32:01.2845277Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:01.2845369Z     kernel = self.compile(
2025-05-07T20:32:01.2845828Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:01.2846002Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:01.2846131Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:01.2846142Z 
2025-05-07T20:32:01.2846352Z self = <triton.compiler.compiler.ASTSource object at 0x7f93bd6f9ad0>
2025-05-07T20:32:01.2847114Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=2, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:01.2847618Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f9402963ce0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f93a55eaac0>}
2025-05-07T20:32:01.2848359Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:01.2848549Z context = <triton._C.libtriton.ir.context object at 0x7f93a49eb6b0>
2025-05-07T20:32:01.2848560Z 
2025-05-07T20:32:01.2848727Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:01.2848993Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:01.2849107Z                            module_map=module_map)
2025-05-07T20:32:01.2849269Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:01.2849373Z E       def _kernel_quantize_fp8_row(
2025-05-07T20:32:01.2849459Z E       ^
2025-05-07T20:32:01.2849810Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:01.2849815Z 
2025-05-07T20:32:01.2850228Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:01.2850236Z 
2025-05-07T20:32:01.2850340Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:01.2850563Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:01.2850648Z     T=1,
2025-05-07T20:32:01.2850730Z     D=5120,
2025-05-07T20:32:01.2850817Z     scale_ub=1200.0,
2025-05-07T20:32:01.2850910Z     contiguous=True,
2025-05-07T20:32:01.2850995Z     compiled=True,
2025-05-07T20:32:01.2851071Z )
2025-05-07T20:32:01.2851292Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:01.2851455Z T = 1, D = 5120, scale_ub = 1200.0, contiguous = True, compiled = True
2025-05-07T20:32:01.2851459Z 
2025-05-07T20:32:01.2851541Z     @given(
2025-05-07T20:32:01.2851661Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:01.2851759Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:01.2851883Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:01.2852005Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:01.2852120Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:01.2852202Z     )
2025-05-07T20:32:01.2852446Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:01.2852630Z     def test_silu_mul_quant(
2025-05-07T20:32:01.2852708Z         self,
2025-05-07T20:32:01.2852787Z         T: int,
2025-05-07T20:32:01.2852872Z         D: int,
2025-05-07T20:32:01.2852970Z         scale_ub: Optional[float],
2025-05-07T20:32:01.2853061Z         contiguous: bool,
2025-05-07T20:32:01.2853155Z         compiled: bool,
2025-05-07T20:32:01.2853235Z     ) -> None:
2025-05-07T20:32:01.2853330Z         torch.manual_seed(2025)
2025-05-07T20:32:01.2853412Z     
2025-05-07T20:32:01.2853581Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:01.2853726Z     
2025-05-07T20:32:01.2853826Z         x_sign = torch.sign(x)
2025-05-07T20:32:01.2854028Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:01.2854119Z         x = x_sign * x_clamp
2025-05-07T20:32:01.2854206Z         x0 = x[:, :D]
2025-05-07T20:32:01.2854286Z         x1 = x[:, D:]
2025-05-07T20:32:01.2854364Z     
2025-05-07T20:32:01.2854448Z         if contiguous:
2025-05-07T20:32:01.2854545Z             x0 = x0.contiguous()
2025-05-07T20:32:01.2854639Z             x1 = x1.contiguous()
2025-05-07T20:32:01.2854713Z     
2025-05-07T20:32:01.2854801Z         if scale_ub is not None:
2025-05-07T20:32:01.2854912Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:01.2855045Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:01.2855120Z             )
2025-05-07T20:32:01.2855201Z         else:
2025-05-07T20:32:01.2855296Z             scale_ub_tensor = None
2025-05-07T20:32:01.2855368Z     
2025-05-07T20:32:01.2855501Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:01.2855591Z             op = silu_mul_quant
2025-05-07T20:32:01.2855680Z             if compiled:
2025-05-07T20:32:01.2855785Z                 op = torch.compile(op)
2025-05-07T20:32:01.2855891Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:01.2855971Z     
2025-05-07T20:32:01.2856060Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:01.2856071Z 
2025-05-07T20:32:01.2856167Z moe/activation_test.py:117: 
2025-05-07T20:32:01.2856303Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:01.2856403Z moe/activation_test.py:115: in fn
2025-05-07T20:32:01.2856503Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:01.2856871Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/torch/_dynamo/eval_frame.py:678: in _fn
2025-05-07T20:32:01.2856963Z     return fn(*args, **kwargs)
2025-05-07T20:32:01.2857457Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:01.2857555Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:01.2857915Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:01.2858145Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:01.2858484Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:01.2858585Z     kernel = self.compile(
2025-05-07T20:32:01.2858963Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:01.2859134Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:01.2859269Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:01.2859273Z 
2025-05-07T20:32:01.2859477Z self = <triton.compiler.compiler.ASTSource object at 0x7f93cd90fdd0>
2025-05-07T20:32:01.2860245Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:01.2860744Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f9402963ce0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f93a4a4dbc0>}
2025-05-07T20:32:01.2861581Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:01.2861770Z context = <triton._C.libtriton.ir.context object at 0x7f93a4754370>
2025-05-07T20:32:01.2861774Z 
2025-05-07T20:32:01.2861946Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:01.2862203Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:01.2862416Z                            module_map=module_map)
2025-05-07T20:32:01.2862578Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:01.2862675Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:01.2862759Z E       ^
2025-05-07T20:32:01.2863108Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:01.2863118Z 
2025-05-07T20:32:01.2863527Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:01.2863532Z 
2025-05-07T20:32:01.2863642Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:01.2863862Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:01.2863946Z     T=1,
2025-05-07T20:32:01.2864024Z     D=5120,
2025-05-07T20:32:01.2864108Z     scale_ub=None,
2025-05-07T20:32:01.2864201Z     contiguous=False,
2025-05-07T20:32:01.2864286Z     compiled=True,
2025-05-07T20:32:01.2864359Z )
2025-05-07T20:32:01.2864585Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:01.2864747Z T = 1, D = 5120, scale_ub = None, contiguous = False, compiled = True
2025-05-07T20:32:01.2864752Z 
2025-05-07T20:32:01.2864828Z     @given(
2025-05-07T20:32:01.2864962Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:01.2865062Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:01.2865183Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:01.2865300Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:01.2865415Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:01.2865496Z     )
2025-05-07T20:32:01.2865739Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:01.2865833Z     def test_silu_mul_quant(
2025-05-07T20:32:01.2865919Z         self,
2025-05-07T20:32:01.2865996Z         T: int,
2025-05-07T20:32:01.2866074Z         D: int,
2025-05-07T20:32:01.2866184Z         scale_ub: Optional[float],
2025-05-07T20:32:01.2866274Z         contiguous: bool,
2025-05-07T20:32:01.2866359Z         compiled: bool,
2025-05-07T20:32:01.2866445Z     ) -> None:
2025-05-07T20:32:01.2866538Z         torch.manual_seed(2025)
2025-05-07T20:32:01.2866622Z     
2025-05-07T20:32:01.2866787Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:01.2866859Z     
2025-05-07T20:32:01.2866955Z         x_sign = torch.sign(x)
2025-05-07T20:32:01.2867079Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:01.2867167Z         x = x_sign * x_clamp
2025-05-07T20:32:01.2867253Z         x0 = x[:, :D]
2025-05-07T20:32:01.2867332Z         x1 = x[:, D:]
2025-05-07T20:32:01.2867405Z     
2025-05-07T20:32:01.2867493Z         if contiguous:
2025-05-07T20:32:01.2867584Z             x0 = x0.contiguous()
2025-05-07T20:32:01.2867672Z             x1 = x1.contiguous()
2025-05-07T20:32:01.2867753Z     
2025-05-07T20:32:01.2867842Z         if scale_ub is not None:
2025-05-07T20:32:01.2867958Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:01.2868092Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:01.2868167Z             )
2025-05-07T20:32:01.2868250Z         else:
2025-05-07T20:32:01.2868432Z             scale_ub_tensor = None
2025-05-07T20:32:01.2868504Z     
2025-05-07T20:32:01.2868638Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:01.2868728Z             op = silu_mul_quant
2025-05-07T20:32:01.2868812Z             if compiled:
2025-05-07T20:32:01.2868920Z                 op = torch.compile(op)
2025-05-07T20:32:01.2869026Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:01.2869098Z     
2025-05-07T20:32:01.2869196Z         y_fp8, y_scale = fn()
2025-05-07T20:32:01.2869316Z         y = y_fp8.to(torch.float32) * y_scale[:, None]
2025-05-07T20:32:01.2869396Z     
2025-05-07T20:32:01.2869531Z         def ref_fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:01.2869709Z             x0_fp32 = x0.to(torch.float32)
2025-05-07T20:32:01.2869815Z             x1_fp32 = x1.to(torch.float32)
2025-05-07T20:32:01.2869936Z             y = x0_fp32 * torch.sigmoid(x0_fp32) * x1_fp32
2025-05-07T20:32:01.2870074Z             return triton_quantize_fp8_row(y, scale_ub_tensor)
2025-05-07T20:32:01.2870159Z     
2025-05-07T20:32:01.2870260Z >       y_fp8_ref, y_scale_ref = ref_fn()
2025-05-07T20:32:01.2870264Z 
2025-05-07T20:32:01.2870361Z moe/activation_test.py:126: 
2025-05-07T20:32:01.2870494Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:01.2870598Z moe/activation_test.py:124: in ref_fn
2025-05-07T20:32:01.2870738Z     return triton_quantize_fp8_row(y, scale_ub_tensor)
2025-05-07T20:32:01.2871286Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gemm/triton_gemm/fp8_gemm.py:2370: in triton_quantize_fp8_row
2025-05-07T20:32:01.2871387Z     _kernel_quantize_fp8_row[grid](
2025-05-07T20:32:01.2871753Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:01.2871977Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:01.2872348Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/autotuner.py:186: in run
2025-05-07T20:32:01.2872607Z     timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs}
2025-05-07T20:32:01.2872978Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/autotuner.py:166: in _bench
2025-05-07T20:32:01.2873152Z     return self.do_bench(kernel_call, quantiles=(0.5, 0.2, 0.8))
2025-05-07T20:32:01.2873489Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/testing.py:117: in do_bench
2025-05-07T20:32:01.2873567Z     fn()
2025-05-07T20:32:01.2873979Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/autotuner.py:152: in kernel_call
2025-05-07T20:32:01.2874061Z     self.fn.run(
2025-05-07T20:32:01.2874404Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:01.2874496Z     kernel = self.compile(
2025-05-07T20:32:01.2874878Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:01.2875058Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:01.2875189Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:01.2875193Z 
2025-05-07T20:32:01.2875396Z self = <triton.compiler.compiler.ASTSource object at 0x7f93cd95d9d0>
2025-05-07T20:32:01.2876168Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=2, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:01.2876670Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f9402963ce0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f93a5c30ae0>}
2025-05-07T20:32:01.2877412Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:01.2877695Z context = <triton._C.libtriton.ir.context object at 0x7f93a4798930>
2025-05-07T20:32:01.2877701Z 
2025-05-07T20:32:01.2877928Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:01.2878292Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:01.2878407Z                            module_map=module_map)
2025-05-07T20:32:01.2878575Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:01.2878676Z E       def _kernel_quantize_fp8_row(
2025-05-07T20:32:01.2878843Z E       ^
2025-05-07T20:32:01.2879199Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:01.2879204Z 
2025-05-07T20:32:01.2879617Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:01.2879621Z 
2025-05-07T20:32:01.2879729Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:01.2879948Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:01.2880023Z     T=1,
2025-05-07T20:32:01.2880107Z     D=5120,
2025-05-07T20:32:01.2880189Z     scale_ub=None,
2025-05-07T20:32:01.2880274Z     contiguous=True,
2025-05-07T20:32:01.2880365Z     compiled=False,
2025-05-07T20:32:01.2880444Z )
2025-05-07T20:32:01.2880665Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:01.2880832Z T = 1, D = 5120, scale_ub = None, contiguous = True, compiled = False
2025-05-07T20:32:01.2880837Z 
2025-05-07T20:32:01.2880917Z     @given(
2025-05-07T20:32:01.2881042Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:01.2881141Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:01.2881260Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:01.2881384Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:01.2881496Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:01.2881570Z     )
2025-05-07T20:32:01.2881816Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:01.2881908Z     def test_silu_mul_quant(
2025-05-07T20:32:01.2881991Z         self,
2025-05-07T20:32:01.2882072Z         T: int,
2025-05-07T20:32:01.2882147Z         D: int,
2025-05-07T20:32:01.2882249Z         scale_ub: Optional[float],
2025-05-07T20:32:01.2882337Z         contiguous: bool,
2025-05-07T20:32:01.2882422Z         compiled: bool,
2025-05-07T20:32:01.2882510Z     ) -> None:
2025-05-07T20:32:01.2882605Z         torch.manual_seed(2025)
2025-05-07T20:32:01.2882677Z     
2025-05-07T20:32:01.2882849Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:01.2882926Z     
2025-05-07T20:32:01.2883017Z         x_sign = torch.sign(x)
2025-05-07T20:32:01.2883145Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:01.2883233Z         x = x_sign * x_clamp
2025-05-07T20:32:01.2883318Z         x0 = x[:, :D]
2025-05-07T20:32:01.2883398Z         x1 = x[:, D:]
2025-05-07T20:32:01.2883471Z     
2025-05-07T20:32:01.2883562Z         if contiguous:
2025-05-07T20:32:01.2883652Z             x0 = x0.contiguous()
2025-05-07T20:32:01.2883740Z             x1 = x1.contiguous()
2025-05-07T20:32:01.2883817Z     
2025-05-07T20:32:01.2883908Z         if scale_ub is not None:
2025-05-07T20:32:01.2884011Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:01.2884151Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:01.2884234Z             )
2025-05-07T20:32:01.2884313Z         else:
2025-05-07T20:32:01.2884413Z             scale_ub_tensor = None
2025-05-07T20:32:01.2884485Z     
2025-05-07T20:32:01.2884613Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:01.2884796Z             op = silu_mul_quant
2025-05-07T20:32:01.2884880Z             if compiled:
2025-05-07T20:32:01.2884986Z                 op = torch.compile(op)
2025-05-07T20:32:01.2885091Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:01.2885168Z     
2025-05-07T20:32:01.2885263Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:01.2885267Z 
2025-05-07T20:32:01.2885365Z moe/activation_test.py:117: 
2025-05-07T20:32:01.2885494Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:01.2885601Z moe/activation_test.py:115: in fn
2025-05-07T20:32:01.2885701Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:01.2886272Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:01.2886369Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:01.2886723Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:01.2886952Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:01.2887289Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:01.2887380Z     kernel = self.compile(
2025-05-07T20:32:01.2887762Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:01.2887934Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:01.2888066Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:01.2888071Z 
2025-05-07T20:32:01.2888277Z self = <triton.compiler.compiler.ASTSource object at 0x7f93ce36d250>
2025-05-07T20:32:01.2889040Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:01.2889546Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f9402963ce0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f93a6f956c0>}
2025-05-07T20:32:01.2890276Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:01.2890469Z context = <triton._C.libtriton.ir.context object at 0x7f9397c47e30>
2025-05-07T20:32:01.2890474Z 
2025-05-07T20:32:01.2890636Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:01.2890901Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:01.2891009Z                            module_map=module_map)
2025-05-07T20:32:01.2891169Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:01.2891276Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:01.2891352Z E       ^
2025-05-07T20:32:01.2891700Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:01.2891704Z 
2025-05-07T20:32:01.2892118Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:01.2892123Z 
2025-05-07T20:32:01.2892224Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:01.2892449Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:01.2892526Z     T=128,
2025-05-07T20:32:01.2892602Z     D=5120,
2025-05-07T20:32:01.2892697Z     scale_ub=None,
2025-05-07T20:32:01.2892786Z     contiguous=False,
2025-05-07T20:32:01.2892871Z     compiled=True,
2025-05-07T20:32:01.2892953Z )
2025-05-07T20:32:01.2893166Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:01.2893444Z T = 128, D = 5120, scale_ub = None, contiguous = False, compiled = True
2025-05-07T20:32:01.2893448Z 
2025-05-07T20:32:01.2893536Z     @given(
2025-05-07T20:32:01.2893772Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:01.2893922Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:01.2894072Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:01.2894189Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:01.2894307Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:01.2894382Z     )
2025-05-07T20:32:01.2894622Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:01.2894720Z     def test_silu_mul_quant(
2025-05-07T20:32:01.2894880Z         self,
2025-05-07T20:32:01.2894962Z         T: int,
2025-05-07T20:32:01.2895046Z         D: int,
2025-05-07T20:32:01.2895142Z         scale_ub: Optional[float],
2025-05-07T20:32:01.2895237Z         contiguous: bool,
2025-05-07T20:32:01.2895333Z         compiled: bool,
2025-05-07T20:32:01.2895408Z     ) -> None:
2025-05-07T20:32:01.2895508Z         torch.manual_seed(2025)
2025-05-07T20:32:01.2895580Z     
2025-05-07T20:32:01.2895746Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:01.2895824Z     
2025-05-07T20:32:01.2895916Z         x_sign = torch.sign(x)
2025-05-07T20:32:01.2896038Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:01.2896128Z         x = x_sign * x_clamp
2025-05-07T20:32:01.2896206Z         x0 = x[:, :D]
2025-05-07T20:32:01.2896284Z         x1 = x[:, D:]
2025-05-07T20:32:01.2896360Z     
2025-05-07T20:32:01.2896441Z         if contiguous:
2025-05-07T20:32:01.2896535Z             x0 = x0.contiguous()
2025-05-07T20:32:01.2896626Z             x1 = x1.contiguous()
2025-05-07T20:32:01.2896701Z     
2025-05-07T20:32:01.2896797Z         if scale_ub is not None:
2025-05-07T20:32:01.2896898Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:01.2897034Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:01.2897113Z             )
2025-05-07T20:32:01.2897187Z         else:
2025-05-07T20:32:01.2897278Z             scale_ub_tensor = None
2025-05-07T20:32:01.2897356Z     
2025-05-07T20:32:01.2897488Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:01.2897575Z             op = silu_mul_quant
2025-05-07T20:32:01.2897661Z             if compiled:
2025-05-07T20:32:01.2897756Z                 op = torch.compile(op)
2025-05-07T20:32:01.2897858Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:01.2897936Z     
2025-05-07T20:32:01.2898023Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:01.2898028Z 
2025-05-07T20:32:01.2898135Z moe/activation_test.py:117: 
2025-05-07T20:32:01.2898545Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:01.2898672Z moe/activation_test.py:115: in fn
2025-05-07T20:32:01.2898775Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:01.2899142Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/torch/_dynamo/eval_frame.py:678: in _fn
2025-05-07T20:32:01.2899232Z     return fn(*args, **kwargs)
2025-05-07T20:32:01.2899722Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:01.2899816Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:01.2900171Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:01.2900386Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:01.2900720Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:01.2900816Z     kernel = self.compile(
2025-05-07T20:32:01.2901192Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:01.2901576Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:01.2901706Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:01.2901712Z 
2025-05-07T20:32:01.2901911Z self = <triton.compiler.compiler.ASTSource object at 0x7f93a487f8d0>
2025-05-07T20:32:01.2902673Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:01.2903281Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f9402963ce0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f93a5c32340>}
2025-05-07T20:32:01.2904014Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:01.2904204Z context = <triton._C.libtriton.ir.context object at 0x7f9397ed1df0>
2025-05-07T20:32:01.2904209Z 
2025-05-07T20:32:01.2904370Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:01.2904633Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:01.2904739Z                            module_map=module_map)
2025-05-07T20:32:01.2904903Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:01.2905001Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:01.2905078Z E       ^
2025-05-07T20:32:01.2905439Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:01.2905443Z 
2025-05-07T20:32:01.2905845Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:01.2905854Z 
2025-05-07T20:32:01.2905955Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:01.2906177Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:01.2906254Z     T=128,
2025-05-07T20:32:01.2906333Z     D=7168,
2025-05-07T20:32:01.2906415Z     scale_ub=1200.0,
2025-05-07T20:32:01.2906501Z     contiguous=False,
2025-05-07T20:32:01.2906595Z     compiled=False,
2025-05-07T20:32:01.2906668Z )
2025-05-07T20:32:01.2906880Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:01.2907052Z T = 128, D = 7168, scale_ub = 1200.0, contiguous = False, compiled = False
2025-05-07T20:32:01.2907057Z 
2025-05-07T20:32:01.2907133Z     @given(
2025-05-07T20:32:01.2907257Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:01.2907361Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:01.2907474Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:01.2907598Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:01.2907714Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:01.2907785Z     )
2025-05-07T20:32:01.2908029Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:01.2908123Z     def test_silu_mul_quant(
2025-05-07T20:32:01.2908196Z         self,
2025-05-07T20:32:01.2908277Z         T: int,
2025-05-07T20:32:01.2908352Z         D: int,
2025-05-07T20:32:01.2908447Z         scale_ub: Optional[float],
2025-05-07T20:32:01.2908543Z         contiguous: bool,
2025-05-07T20:32:01.2908626Z         compiled: bool,
2025-05-07T20:32:01.2908700Z     ) -> None:
2025-05-07T20:32:01.2908797Z         torch.manual_seed(2025)
2025-05-07T20:32:01.2908871Z     
2025-05-07T20:32:01.2909046Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:01.2909117Z     
2025-05-07T20:32:01.2909205Z         x_sign = torch.sign(x)
2025-05-07T20:32:01.2909335Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:01.2909507Z         x = x_sign * x_clamp
2025-05-07T20:32:01.2909585Z         x0 = x[:, :D]
2025-05-07T20:32:01.2909669Z         x1 = x[:, D:]
2025-05-07T20:32:01.2909740Z     
2025-05-07T20:32:01.2909823Z         if contiguous:
2025-05-07T20:32:01.2909918Z             x0 = x0.contiguous()
2025-05-07T20:32:01.2910003Z             x1 = x1.contiguous()
2025-05-07T20:32:01.2910073Z     
2025-05-07T20:32:01.2910169Z         if scale_ub is not None:
2025-05-07T20:32:01.2910272Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:01.2910408Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:01.2910481Z             )
2025-05-07T20:32:01.2910557Z         else:
2025-05-07T20:32:01.2910730Z             scale_ub_tensor = None
2025-05-07T20:32:01.2910801Z     
2025-05-07T20:32:01.2910926Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:01.2911017Z             op = silu_mul_quant
2025-05-07T20:32:01.2911098Z             if compiled:
2025-05-07T20:32:01.2911199Z                 op = torch.compile(op)
2025-05-07T20:32:01.2911307Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:01.2911378Z     
2025-05-07T20:32:01.2911465Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:01.2911476Z 
2025-05-07T20:32:01.2911569Z moe/activation_test.py:117: 
2025-05-07T20:32:01.2911694Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:01.2911794Z moe/activation_test.py:115: in fn
2025-05-07T20:32:01.2911889Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:01.2912375Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:01.2912481Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:01.2912832Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:01.2913049Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:01.2913392Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:01.2913479Z     kernel = self.compile(
2025-05-07T20:32:01.2913859Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:01.2914025Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:01.2914148Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:01.2914152Z 
2025-05-07T20:32:01.2914357Z self = <triton.compiler.compiler.ASTSource object at 0x7f93cea74f50>
2025-05-07T20:32:01.2915117Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:01.2915614Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f9402963ce0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f93a5c30220>}
2025-05-07T20:32:01.2916346Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:01.2916537Z context = <triton._C.libtriton.ir.context object at 0x7f93a464acb0>
2025-05-07T20:32:01.2916541Z 
2025-05-07T20:32:01.2916702Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:01.2916956Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:01.2917073Z                            module_map=module_map)
2025-05-07T20:32:01.2917231Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:01.2917328Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:01.2917491Z E       ^
2025-05-07T20:32:01.2917834Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:01.2917838Z 
2025-05-07T20:32:01.2918245Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:01.2918249Z 
2025-05-07T20:32:01.2918349Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:01.2918565Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:01.2918643Z     T=128,
2025-05-07T20:32:01.2918717Z     D=5120,
2025-05-07T20:32:01.2918796Z     scale_ub=None,
2025-05-07T20:32:01.2918886Z     contiguous=False,
2025-05-07T20:32:01.2918968Z     compiled=False,
2025-05-07T20:32:01.2919116Z )
2025-05-07T20:32:01.2919329Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:01.2919495Z T = 128, D = 5120, scale_ub = None, contiguous = False, compiled = False
2025-05-07T20:32:01.2919505Z 
2025-05-07T20:32:01.2919587Z     @given(
2025-05-07T20:32:01.2919707Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:01.2919801Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:01.2919919Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:01.2920032Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:01.2920141Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:01.2920220Z     )
2025-05-07T20:32:01.2920459Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:01.2920555Z     def test_silu_mul_quant(
2025-05-07T20:32:01.2920633Z         self,
2025-05-07T20:32:01.2920705Z         T: int,
2025-05-07T20:32:01.2920791Z         D: int,
2025-05-07T20:32:01.2920885Z         scale_ub: Optional[float],
2025-05-07T20:32:01.2920980Z         contiguous: bool,
2025-05-07T20:32:01.2925857Z         compiled: bool,
2025-05-07T20:32:01.2925955Z     ) -> None:
2025-05-07T20:32:01.2926064Z         torch.manual_seed(2025)
2025-05-07T20:32:01.2926147Z     
2025-05-07T20:32:01.2926319Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:01.2926397Z     
2025-05-07T20:32:01.2926499Z         x_sign = torch.sign(x)
2025-05-07T20:32:01.2926624Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:01.2926715Z         x = x_sign * x_clamp
2025-05-07T20:32:01.2926805Z         x0 = x[:, :D]
2025-05-07T20:32:01.2926888Z         x1 = x[:, D:]
2025-05-07T20:32:01.2926968Z     
2025-05-07T20:32:01.2927055Z         if contiguous:
2025-05-07T20:32:01.2927148Z             x0 = x0.contiguous()
2025-05-07T20:32:01.2927244Z             x1 = x1.contiguous()
2025-05-07T20:32:01.2927322Z     
2025-05-07T20:32:01.2927419Z         if scale_ub is not None:
2025-05-07T20:32:01.2927533Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:01.2927669Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:01.2927753Z             )
2025-05-07T20:32:01.2927841Z         else:
2025-05-07T20:32:01.2927942Z             scale_ub_tensor = None
2025-05-07T20:32:01.2928016Z     
2025-05-07T20:32:01.2928158Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:01.2928253Z             op = silu_mul_quant
2025-05-07T20:32:01.2928347Z             if compiled:
2025-05-07T20:32:01.2928449Z                 op = torch.compile(op)
2025-05-07T20:32:01.2928557Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:01.2928637Z     
2025-05-07T20:32:01.2928730Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:01.2928734Z 
2025-05-07T20:32:01.2928833Z moe/activation_test.py:117: 
2025-05-07T20:32:01.2928975Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:01.2929081Z moe/activation_test.py:115: in fn
2025-05-07T20:32:01.2929184Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:01.2929691Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:01.2929935Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:01.2930297Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:01.2930520Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:01.2930857Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:01.2930964Z     kernel = self.compile(
2025-05-07T20:32:01.2931343Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:01.2931595Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:01.2931733Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:01.2931737Z 
2025-05-07T20:32:01.2931944Z self = <triton.compiler.compiler.ASTSource object at 0x7f93d47964d0>
2025-05-07T20:32:01.2932727Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:01.2933227Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f9402963ce0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f93a4a4c900>}
2025-05-07T20:32:01.2934091Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:01.2934287Z context = <triton._C.libtriton.ir.context object at 0x7f93a4184270>
2025-05-07T20:32:01.2934291Z 
2025-05-07T20:32:01.2934455Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:01.2934721Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:01.2934837Z                            module_map=module_map)
2025-05-07T20:32:01.2935009Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:01.2935110Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:01.2935188Z E       ^
2025-05-07T20:32:01.2935547Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:01.2935551Z 
2025-05-07T20:32:01.2935960Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:01.2935964Z 
2025-05-07T20:32:01.2936076Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:01.2936301Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:01.2936380Z     T=128,
2025-05-07T20:32:01.2936467Z     D=5120,
2025-05-07T20:32:01.2936553Z     scale_ub=1200.0,
2025-05-07T20:32:01.2936643Z     contiguous=True,
2025-05-07T20:32:01.2936736Z     compiled=False,
2025-05-07T20:32:01.2936811Z )
2025-05-07T20:32:01.2937029Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:01.2937205Z T = 128, D = 5120, scale_ub = 1200.0, contiguous = True, compiled = False
2025-05-07T20:32:01.2937209Z 
2025-05-07T20:32:01.2937288Z     @given(
2025-05-07T20:32:01.2937409Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:01.2937521Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:01.2937637Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:01.2937764Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:01.2937884Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:01.2937960Z     )
2025-05-07T20:32:01.2938213Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:01.2938309Z     def test_silu_mul_quant(
2025-05-07T20:32:01.2938475Z         self,
2025-05-07T20:32:01.2938563Z         T: int,
2025-05-07T20:32:01.2938641Z         D: int,
2025-05-07T20:32:01.2938741Z         scale_ub: Optional[float],
2025-05-07T20:32:01.2938837Z         contiguous: bool,
2025-05-07T20:32:01.2938924Z         compiled: bool,
2025-05-07T20:32:01.2939010Z     ) -> None:
2025-05-07T20:32:01.2939107Z         torch.manual_seed(2025)
2025-05-07T20:32:01.2939182Z     
2025-05-07T20:32:01.2939355Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:01.2939435Z     
2025-05-07T20:32:01.2939527Z         x_sign = torch.sign(x)
2025-05-07T20:32:01.2939651Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:01.2939746Z         x = x_sign * x_clamp
2025-05-07T20:32:01.2939904Z         x0 = x[:, :D]
2025-05-07T20:32:01.2939988Z         x1 = x[:, D:]
2025-05-07T20:32:01.2940068Z     
2025-05-07T20:32:01.2940153Z         if contiguous:
2025-05-07T20:32:01.2940246Z             x0 = x0.contiguous()
2025-05-07T20:32:01.2940348Z             x1 = x1.contiguous()
2025-05-07T20:32:01.2940422Z     
2025-05-07T20:32:01.2940513Z         if scale_ub is not None:
2025-05-07T20:32:01.2940630Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:01.2940766Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:01.2940843Z             )
2025-05-07T20:32:01.2940927Z         else:
2025-05-07T20:32:01.2941022Z             scale_ub_tensor = None
2025-05-07T20:32:01.2941095Z     
2025-05-07T20:32:01.2941232Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:01.2941323Z             op = silu_mul_quant
2025-05-07T20:32:01.2941416Z             if compiled:
2025-05-07T20:32:01.2941517Z                 op = torch.compile(op)
2025-05-07T20:32:01.2941631Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:01.2941712Z     
2025-05-07T20:32:01.2941805Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:01.2941809Z 
2025-05-07T20:32:01.2941911Z moe/activation_test.py:117: 
2025-05-07T20:32:01.2942052Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:01.2942154Z moe/activation_test.py:115: in fn
2025-05-07T20:32:01.2942253Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:01.2942749Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:01.2942848Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:01.2943208Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:01.2943428Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:01.2943768Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:01.2943870Z     kernel = self.compile(
2025-05-07T20:32:01.2944251Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:01.2944434Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:01.2944562Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:01.2944566Z 
2025-05-07T20:32:01.2944767Z self = <triton.compiler.compiler.ASTSource object at 0x7f93a6f54350>
2025-05-07T20:32:01.2945538Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:01.2946039Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f9402963ce0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f93a4c39ee0>}
2025-05-07T20:32:01.2946778Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:01.2947053Z context = <triton._C.libtriton.ir.context object at 0x7f9397f079f0>
2025-05-07T20:32:01.2947057Z 
2025-05-07T20:32:01.2947222Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:01.2947486Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:01.2947596Z                            module_map=module_map)
2025-05-07T20:32:01.2947763Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:01.2947864Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:01.2947946Z E       ^
2025-05-07T20:32:01.2948376Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:01.2948381Z 
2025-05-07T20:32:01.2948789Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:01.2948799Z 
2025-05-07T20:32:01.2948909Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:01.2949130Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:01.2949209Z     T=1,
2025-05-07T20:32:01.2949293Z     D=7168,
2025-05-07T20:32:01.2949379Z     scale_ub=1200.0,
2025-05-07T20:32:01.2949469Z     contiguous=True,
2025-05-07T20:32:01.2949560Z     compiled=True,
2025-05-07T20:32:01.2949636Z )
2025-05-07T20:32:01.2949854Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:01.2950025Z T = 1, D = 7168, scale_ub = 1200.0, contiguous = True, compiled = True
2025-05-07T20:32:01.2950029Z 
2025-05-07T20:32:01.2950108Z     @given(
2025-05-07T20:32:01.2950245Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:01.2950346Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:01.2950463Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:01.2950589Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:01.2950709Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:01.2950786Z     )
2025-05-07T20:32:01.2951034Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:01.2951128Z     def test_silu_mul_quant(
2025-05-07T20:32:01.2951203Z         self,
2025-05-07T20:32:01.2951287Z         T: int,
2025-05-07T20:32:01.2951364Z         D: int,
2025-05-07T20:32:01.2951463Z         scale_ub: Optional[float],
2025-05-07T20:32:01.2951561Z         contiguous: bool,
2025-05-07T20:32:01.2951648Z         compiled: bool,
2025-05-07T20:32:01.2951737Z     ) -> None:
2025-05-07T20:32:01.2951833Z         torch.manual_seed(2025)
2025-05-07T20:32:01.2951906Z     
2025-05-07T20:32:01.2952082Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:01.2952156Z     
2025-05-07T20:32:01.2952249Z         x_sign = torch.sign(x)
2025-05-07T20:32:01.2952379Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:01.2952472Z         x = x_sign * x_clamp
2025-05-07T20:32:01.2952554Z         x0 = x[:, :D]
2025-05-07T20:32:01.2952643Z         x1 = x[:, D:]
2025-05-07T20:32:01.2952717Z     
2025-05-07T20:32:01.2952802Z         if contiguous:
2025-05-07T20:32:01.2952900Z             x0 = x0.contiguous()
2025-05-07T20:32:01.2952991Z             x1 = x1.contiguous()
2025-05-07T20:32:01.2953070Z     
2025-05-07T20:32:01.2953162Z         if scale_ub is not None:
2025-05-07T20:32:01.2953268Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:01.2953409Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:01.2953486Z             )
2025-05-07T20:32:01.2953563Z         else:
2025-05-07T20:32:01.2953669Z             scale_ub_tensor = None
2025-05-07T20:32:01.2953746Z     
2025-05-07T20:32:01.2953876Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:01.2953975Z             op = silu_mul_quant
2025-05-07T20:32:01.2954061Z             if compiled:
2025-05-07T20:32:01.2954247Z                 op = torch.compile(op)
2025-05-07T20:32:01.2954364Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:01.2954439Z     
2025-05-07T20:32:01.2954541Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:01.2954545Z 
2025-05-07T20:32:01.2954644Z moe/activation_test.py:117: 
2025-05-07T20:32:01.2954773Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:01.2954879Z moe/activation_test.py:115: in fn
2025-05-07T20:32:01.2954980Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:01.2955344Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/torch/_dynamo/eval_frame.py:678: in _fn
2025-05-07T20:32:01.2955449Z     return fn(*args, **kwargs)
2025-05-07T20:32:01.2956038Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:01.2956145Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:01.2956508Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:01.2956728Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:01.2957069Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:01.2957163Z     kernel = self.compile(
2025-05-07T20:32:01.2957543Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:01.2957719Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:01.2957847Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:01.2957857Z 
2025-05-07T20:32:01.2958065Z self = <triton.compiler.compiler.ASTSource object at 0x7f93a5ac1050>
2025-05-07T20:32:01.2958831Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:01.2959332Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f9402963ce0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f93a4c3a660>}
2025-05-07T20:32:01.2960073Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:01.2960264Z context = <triton._C.libtriton.ir.context object at 0x7f9397f7b0f0>
2025-05-07T20:32:01.2960269Z 
2025-05-07T20:32:01.2960450Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:01.2960710Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:01.2960825Z                            module_map=module_map)
2025-05-07T20:32:01.2960992Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:01.2961093Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:01.2961179Z E       ^
2025-05-07T20:32:01.2961527Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:01.2961531Z 
2025-05-07T20:32:01.2961936Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:01.2961940Z 
2025-05-07T20:32:01.2962050Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:01.2962271Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:01.2962354Z     T=1,
2025-05-07T20:32:01.2962438Z     D=7168,
2025-05-07T20:32:01.2962524Z     scale_ub=1200.0,
2025-05-07T20:32:01.2962620Z     contiguous=False,
2025-05-07T20:32:01.2962705Z     compiled=True,
2025-05-07T20:32:01.2962781Z )
2025-05-07T20:32:01.2963160Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:01.2963324Z T = 1, D = 7168, scale_ub = 1200.0, contiguous = False, compiled = True
2025-05-07T20:32:01.2963329Z 
2025-05-07T20:32:01.2963408Z     @given(
2025-05-07T20:32:01.2963533Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:01.2963634Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:01.2963755Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:01.2963873Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:01.2963989Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:01.2964071Z     )
2025-05-07T20:32:01.2964388Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:01.2964484Z     def test_silu_mul_quant(
2025-05-07T20:32:01.2964568Z         self,
2025-05-07T20:32:01.2964648Z         T: int,
2025-05-07T20:32:01.2964727Z         D: int,
2025-05-07T20:32:01.2964839Z         scale_ub: Optional[float],
2025-05-07T20:32:01.2964932Z         contiguous: bool,
2025-05-07T20:32:01.2965019Z         compiled: bool,
2025-05-07T20:32:01.2965109Z     ) -> None:
2025-05-07T20:32:01.2965208Z         torch.manual_seed(2025)
2025-05-07T20:32:01.2965289Z     
2025-05-07T20:32:01.2965457Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:01.2965532Z     
2025-05-07T20:32:01.2965630Z         x_sign = torch.sign(x)
2025-05-07T20:32:01.2965755Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:01.2965847Z         x = x_sign * x_clamp
2025-05-07T20:32:01.2965936Z         x0 = x[:, :D]
2025-05-07T20:32:01.2966017Z         x1 = x[:, D:]
2025-05-07T20:32:01.2966090Z     
2025-05-07T20:32:01.2966186Z         if contiguous:
2025-05-07T20:32:01.2966277Z             x0 = x0.contiguous()
2025-05-07T20:32:01.2966367Z             x1 = x1.contiguous()
2025-05-07T20:32:01.2966448Z     
2025-05-07T20:32:01.2966540Z         if scale_ub is not None:
2025-05-07T20:32:01.2966651Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:01.2966796Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:01.2966871Z             )
2025-05-07T20:32:01.2966954Z         else:
2025-05-07T20:32:01.2967048Z             scale_ub_tensor = None
2025-05-07T20:32:01.2967122Z     
2025-05-07T20:32:01.2967258Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:01.2967348Z             op = silu_mul_quant
2025-05-07T20:32:01.2967432Z             if compiled:
2025-05-07T20:32:01.2967538Z                 op = torch.compile(op)
2025-05-07T20:32:01.2967644Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:01.2967717Z     
2025-05-07T20:32:01.2967814Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:01.2967823Z 
2025-05-07T20:32:01.2967922Z moe/activation_test.py:117: 
2025-05-07T20:32:01.2968058Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:01.2968158Z moe/activation_test.py:115: in fn
2025-05-07T20:32:01.2968262Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:01.2968632Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/torch/_dynamo/eval_frame.py:678: in _fn
2025-05-07T20:32:01.2968726Z     return fn(*args, **kwargs)
2025-05-07T20:32:01.2969214Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:01.2969317Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:01.2969672Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:01.2969901Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:01.2970242Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:01.2970336Z     kernel = self.compile(
2025-05-07T20:32:01.2970721Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:01.2970992Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:01.2971121Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:01.2971132Z 
2025-05-07T20:32:01.2971334Z self = <triton.compiler.compiler.ASTSource object at 0x7f93a555c7d0>
2025-05-07T20:32:01.2972103Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:01.2972685Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f9402963ce0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f93a4c393a0>}
2025-05-07T20:32:01.2973421Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:01.2973716Z context = <triton._C.libtriton.ir.context object at 0x7f9397f6a3b0>
2025-05-07T20:32:01.2973725Z 
2025-05-07T20:32:01.2973890Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:01.2974148Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:01.2974265Z                            module_map=module_map)
2025-05-07T20:32:01.2974429Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:01.2974533Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:01.2974612Z E       ^
2025-05-07T20:32:01.2974967Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:01.2974971Z 
2025-05-07T20:32:01.2975384Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:01.2975394Z 
2025-05-07T20:32:01.2975499Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:01.2975720Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:01.2975807Z     T=1,
2025-05-07T20:32:01.2975885Z     D=7168,
2025-05-07T20:32:01.2975976Z     scale_ub=None,
2025-05-07T20:32:01.2976064Z     contiguous=False,
2025-05-07T20:32:01.2976149Z     compiled=True,
2025-05-07T20:32:01.2976228Z )
2025-05-07T20:32:01.2976445Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:01.2976605Z T = 1, D = 7168, scale_ub = None, contiguous = False, compiled = True
2025-05-07T20:32:01.2976609Z 
2025-05-07T20:32:01.2976697Z     @given(
2025-05-07T20:32:01.2976817Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:01.2976918Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:01.2977039Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:01.2977162Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:01.2977282Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:01.2977359Z     )
2025-05-07T20:32:01.2977602Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:01.2977701Z     def test_silu_mul_quant(
2025-05-07T20:32:01.2977777Z         self,
2025-05-07T20:32:01.2977855Z         T: int,
2025-05-07T20:32:01.2977939Z         D: int,
2025-05-07T20:32:01.2978037Z         scale_ub: Optional[float],
2025-05-07T20:32:01.2978126Z         contiguous: bool,
2025-05-07T20:32:01.2978218Z         compiled: bool,
2025-05-07T20:32:01.2978297Z     ) -> None:
2025-05-07T20:32:01.2978392Z         torch.manual_seed(2025)
2025-05-07T20:32:01.2978482Z     
2025-05-07T20:32:01.2978650Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:01.2978730Z     
2025-05-07T20:32:01.2978823Z         x_sign = torch.sign(x)
2025-05-07T20:32:01.2979089Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:01.2979182Z         x = x_sign * x_clamp
2025-05-07T20:32:01.2979264Z         x0 = x[:, :D]
2025-05-07T20:32:01.2979346Z         x1 = x[:, D:]
2025-05-07T20:32:01.2979427Z     
2025-05-07T20:32:01.2979513Z         if contiguous:
2025-05-07T20:32:01.2979604Z             x0 = x0.contiguous()
2025-05-07T20:32:01.2979703Z             x1 = x1.contiguous()
2025-05-07T20:32:01.2979776Z     
2025-05-07T20:32:01.2979866Z         if scale_ub is not None:
2025-05-07T20:32:01.2979978Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:01.2980118Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:01.2980200Z             )
2025-05-07T20:32:01.2980355Z         else:
2025-05-07T20:32:01.2980452Z             scale_ub_tensor = None
2025-05-07T20:32:01.2980530Z     
2025-05-07T20:32:01.2980658Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:01.2980749Z             op = silu_mul_quant
2025-05-07T20:32:01.2980846Z             if compiled:
2025-05-07T20:32:01.2980945Z                 op = torch.compile(op)
2025-05-07T20:32:01.2981051Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:01.2981130Z     
2025-05-07T20:32:01.2981221Z         y_fp8, y_scale = fn()
2025-05-07T20:32:01.2981344Z         y = y_fp8.to(torch.float32) * y_scale[:, None]
2025-05-07T20:32:01.2981425Z     
2025-05-07T20:32:01.2981565Z         def ref_fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:01.2981674Z             x0_fp32 = x0.to(torch.float32)
2025-05-07T20:32:01.2981778Z             x1_fp32 = x1.to(torch.float32)
2025-05-07T20:32:01.2981901Z             y = x0_fp32 * torch.sigmoid(x0_fp32) * x1_fp32
2025-05-07T20:32:01.2982054Z             return triton_quantize_fp8_row(y, scale_ub_tensor)
2025-05-07T20:32:01.2982129Z     
2025-05-07T20:32:01.2982229Z >       y_fp8_ref, y_scale_ref = ref_fn()
2025-05-07T20:32:01.2982233Z 
2025-05-07T20:32:01.2982336Z moe/activation_test.py:126: 
2025-05-07T20:32:01.2982473Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:01.2982580Z moe/activation_test.py:124: in ref_fn
2025-05-07T20:32:01.2982719Z     return triton_quantize_fp8_row(y, scale_ub_tensor)
2025-05-07T20:32:01.2983266Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gemm/triton_gemm/fp8_gemm.py:2370: in triton_quantize_fp8_row
2025-05-07T20:32:01.2983375Z     _kernel_quantize_fp8_row[grid](
2025-05-07T20:32:01.2983735Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:01.2983959Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:01.2984333Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/autotuner.py:186: in run
2025-05-07T20:32:01.2984594Z     timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs}
2025-05-07T20:32:01.2984969Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/autotuner.py:166: in _bench
2025-05-07T20:32:01.2985134Z     return self.do_bench(kernel_call, quantiles=(0.5, 0.2, 0.8))
2025-05-07T20:32:01.2985478Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/testing.py:117: in do_bench
2025-05-07T20:32:01.2985555Z     fn()
2025-05-07T20:32:01.2985957Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/autotuner.py:152: in kernel_call
2025-05-07T20:32:01.2986042Z     self.fn.run(
2025-05-07T20:32:01.2986378Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:01.2986482Z     kernel = self.compile(
2025-05-07T20:32:01.2986863Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:01.2987035Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:01.2987256Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:01.2987261Z 
2025-05-07T20:32:01.2987465Z self = <triton.compiler.compiler.ASTSource object at 0x7f93ce29f450>
2025-05-07T20:32:01.2988232Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=2, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:01.2988728Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f9402963ce0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f93a4c3bce0>}
2025-05-07T20:32:01.2989569Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:01.2989766Z context = <triton._C.libtriton.ir.context object at 0x7f9397cf95f0>
2025-05-07T20:32:01.2989771Z 
2025-05-07T20:32:01.2989937Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:01.2990199Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:01.2990308Z                            module_map=module_map)
2025-05-07T20:32:01.2990474Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:01.2990577Z E       def _kernel_quantize_fp8_row(
2025-05-07T20:32:01.2990654Z E       ^
2025-05-07T20:32:01.2991011Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:01.2991015Z 
2025-05-07T20:32:01.2991429Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:01.2991433Z 
2025-05-07T20:32:01.2991546Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:01.2991774Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:01.2991854Z     T=1,
2025-05-07T20:32:01.2991941Z     D=5120,
2025-05-07T20:32:01.2992026Z     scale_ub=1200.0,
2025-05-07T20:32:01.2992113Z     contiguous=False,
2025-05-07T20:32:01.2992204Z     compiled=True,
2025-05-07T20:32:01.2992278Z )
2025-05-07T20:32:01.2992492Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:01.2992662Z T = 1, D = 5120, scale_ub = 1200.0, contiguous = False, compiled = True
2025-05-07T20:32:01.2992666Z 
2025-05-07T20:32:01.2992745Z     @given(
2025-05-07T20:32:01.2992865Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:01.2992976Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:01.2993092Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:01.2993214Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:01.2993328Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:01.2993414Z     )
2025-05-07T20:32:01.2993662Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:01.2993758Z     def test_silu_mul_quant(
2025-05-07T20:32:01.2993836Z         self,
2025-05-07T20:32:01.2993921Z         T: int,
2025-05-07T20:32:01.2993999Z         D: int,
2025-05-07T20:32:01.2994099Z         scale_ub: Optional[float],
2025-05-07T20:32:01.2994195Z         contiguous: bool,
2025-05-07T20:32:01.2994282Z         compiled: bool,
2025-05-07T20:32:01.2994370Z     ) -> None:
2025-05-07T20:32:01.2994466Z         torch.manual_seed(2025)
2025-05-07T20:32:01.2994542Z     
2025-05-07T20:32:01.2994714Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:01.2994793Z     
2025-05-07T20:32:01.2994888Z         x_sign = torch.sign(x)
2025-05-07T20:32:01.2995018Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:01.2995106Z         x = x_sign * x_clamp
2025-05-07T20:32:01.2995271Z         x0 = x[:, :D]
2025-05-07T20:32:01.2995357Z         x1 = x[:, D:]
2025-05-07T20:32:01.2995432Z     
2025-05-07T20:32:01.2995516Z         if contiguous:
2025-05-07T20:32:01.2995616Z             x0 = x0.contiguous()
2025-05-07T20:32:01.2995709Z             x1 = x1.contiguous()
2025-05-07T20:32:01.2995781Z     
2025-05-07T20:32:01.2995879Z         if scale_ub is not None:
2025-05-07T20:32:01.2995986Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:01.2996125Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:01.2996202Z             )
2025-05-07T20:32:01.2996280Z         else:
2025-05-07T20:32:01.2996379Z             scale_ub_tensor = None
2025-05-07T20:32:01.2996451Z     
2025-05-07T20:32:01.2996665Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:01.2996761Z             op = silu_mul_quant
2025-05-07T20:32:01.2996845Z             if compiled:
2025-05-07T20:32:01.2996946Z                 op = torch.compile(op)
2025-05-07T20:32:01.2997063Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:01.2997135Z     
2025-05-07T20:32:01.2997226Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:01.2997235Z 
2025-05-07T20:32:01.2997330Z moe/activation_test.py:117: 
2025-05-07T20:32:01.2997459Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:01.2997565Z moe/activation_test.py:115: in fn
2025-05-07T20:32:01.2997663Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:01.2998024Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/torch/_dynamo/eval_frame.py:678: in _fn
2025-05-07T20:32:01.2998124Z     return fn(*args, **kwargs)
2025-05-07T20:32:01.2998953Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:01.2999058Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:01.2999421Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:01.2999646Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:01.2999990Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:01.3000083Z     kernel = self.compile(
2025-05-07T20:32:01.3000462Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:01.3000640Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:01.3000770Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:01.3000775Z 
2025-05-07T20:32:01.3000983Z self = <triton.compiler.compiler.ASTSource object at 0x7f9397f09b50>
2025-05-07T20:32:01.3001744Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:01.3002243Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f9402963ce0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f93a4867920>}
2025-05-07T20:32:01.3002979Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:01.3003168Z context = <triton._C.libtriton.ir.context object at 0x7f9397ca8bb0>
2025-05-07T20:32:01.3003173Z 
2025-05-07T20:32:01.3003342Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:01.3003603Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:01.3003712Z                            module_map=module_map)
2025-05-07T20:32:01.3003881Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:01.3004139Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:01.3004220Z E       ^
2025-05-07T20:32:01.3004567Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:01.3004571Z 
2025-05-07T20:32:01.3004977Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:01.3004982Z 
2025-05-07T20:32:01.3005091Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:01.3005333Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:01.3005418Z     T=1,
2025-05-07T20:32:01.3005510Z     D=5120,
2025-05-07T20:32:01.3005597Z     scale_ub=1200.0,
2025-05-07T20:32:01.3005795Z     contiguous=False,
2025-05-07T20:32:01.3005881Z     compiled=False,
2025-05-07T20:32:01.3005953Z )
2025-05-07T20:32:01.3006173Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:01.3006342Z T = 1, D = 5120, scale_ub = 1200.0, contiguous = False, compiled = False
2025-05-07T20:32:01.3006347Z 
2025-05-07T20:32:01.3006422Z     @given(
2025-05-07T20:32:01.3006549Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:01.3006645Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:01.3006756Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:01.3006876Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:01.3006988Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:01.3007070Z     )
2025-05-07T20:32:01.3007310Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:01.3007403Z     def test_silu_mul_quant(
2025-05-07T20:32:01.3007486Z         self,
2025-05-07T20:32:01.3007563Z         T: int,
2025-05-07T20:32:01.3007639Z         D: int,
2025-05-07T20:32:01.3007741Z         scale_ub: Optional[float],
2025-05-07T20:32:01.3007830Z         contiguous: bool,
2025-05-07T20:32:01.3007918Z         compiled: bool,
2025-05-07T20:32:01.3007999Z     ) -> None:
2025-05-07T20:32:01.3008093Z         torch.manual_seed(2025)
2025-05-07T20:32:01.3008162Z     
2025-05-07T20:32:01.3008334Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:01.3008406Z     
2025-05-07T20:32:01.3008502Z         x_sign = torch.sign(x)
2025-05-07T20:32:01.3008625Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:01.3008713Z         x = x_sign * x_clamp
2025-05-07T20:32:01.3008798Z         x0 = x[:, :D]
2025-05-07T20:32:01.3008876Z         x1 = x[:, D:]
2025-05-07T20:32:01.3008949Z     
2025-05-07T20:32:01.3009037Z         if contiguous:
2025-05-07T20:32:01.3009127Z             x0 = x0.contiguous()
2025-05-07T20:32:01.3009221Z             x1 = x1.contiguous()
2025-05-07T20:32:01.3009298Z     
2025-05-07T20:32:01.3009387Z         if scale_ub is not None:
2025-05-07T20:32:01.3009493Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:01.3009635Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:01.3009711Z             )
2025-05-07T20:32:01.3009791Z         else:
2025-05-07T20:32:01.3009886Z             scale_ub_tensor = None
2025-05-07T20:32:01.3009958Z     
2025-05-07T20:32:01.3010090Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:01.3010179Z             op = silu_mul_quant
2025-05-07T20:32:01.3010261Z             if compiled:
2025-05-07T20:32:01.3010364Z                 op = torch.compile(op)
2025-05-07T20:32:01.3010471Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:01.3010543Z     
2025-05-07T20:32:01.3010640Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:01.3010644Z 
2025-05-07T20:32:01.3010739Z moe/activation_test.py:117: 
2025-05-07T20:32:01.3010875Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:01.3010973Z moe/activation_test.py:115: in fn
2025-05-07T20:32:01.3011070Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:01.3011652Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:01.3011748Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:01.3012101Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:01.3012324Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:01.3012658Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:01.3012752Z     kernel = self.compile(
2025-05-07T20:32:01.3013204Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:01.3013378Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:01.3013508Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:01.3013518Z 
2025-05-07T20:32:01.3013788Z self = <triton.compiler.compiler.ASTSource object at 0x7f93ce36ded0>
2025-05-07T20:32:01.3014552Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:01.3015048Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f9402963ce0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f93a4d72b60>}
2025-05-07T20:32:01.3015832Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:01.3016024Z context = <triton._C.libtriton.ir.context object at 0x7f9397d77db0>
2025-05-07T20:32:01.3016029Z 
2025-05-07T20:32:01.3016193Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:01.3016461Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:01.3016569Z                            module_map=module_map)
2025-05-07T20:32:01.3016731Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:01.3016832Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:01.3016906Z E       ^
2025-05-07T20:32:01.3017254Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:01.3017263Z 
2025-05-07T20:32:01.3017674Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:01.3017679Z 
2025-05-07T20:32:01.3017781Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:01.3018005Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:01.3018089Z     T=16384,
2025-05-07T20:32:01.3018167Z     D=5120,
2025-05-07T20:32:01.3018254Z     scale_ub=1200.0,
2025-05-07T20:32:01.3018339Z     contiguous=False,
2025-05-07T20:32:01.3018420Z     compiled=True,
2025-05-07T20:32:01.3018498Z )
2025-05-07T20:32:01.3018714Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:01.3018890Z T = 16384, D = 5120, scale_ub = 1200.0, contiguous = False, compiled = True
2025-05-07T20:32:01.3018895Z 
2025-05-07T20:32:01.3018972Z     @given(
2025-05-07T20:32:01.3019091Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:01.3019193Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:01.3019305Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:01.3019426Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:01.3019540Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:01.3019614Z     )
2025-05-07T20:32:01.3019855Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:01.3020067Z     def test_silu_mul_quant(
2025-05-07T20:32:01.3020143Z         self,
2025-05-07T20:32:01.3020224Z         T: int,
2025-05-07T20:32:01.3020302Z         D: int,
2025-05-07T20:32:01.3020400Z         scale_ub: Optional[float],
2025-05-07T20:32:01.3020491Z         contiguous: bool,
2025-05-07T20:32:01.3020576Z         compiled: bool,
2025-05-07T20:32:01.3020654Z     ) -> None:
2025-05-07T20:32:01.3020752Z         torch.manual_seed(2025)
2025-05-07T20:32:01.3020824Z     
2025-05-07T20:32:01.3020989Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:01.3021065Z     
2025-05-07T20:32:01.3021155Z         x_sign = torch.sign(x)
2025-05-07T20:32:01.3021353Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:01.3021447Z         x = x_sign * x_clamp
2025-05-07T20:32:01.3021525Z         x0 = x[:, :D]
2025-05-07T20:32:01.3021605Z         x1 = x[:, D:]
2025-05-07T20:32:01.3021678Z     
2025-05-07T20:32:01.3021766Z         if contiguous:
2025-05-07T20:32:01.3021863Z             x0 = x0.contiguous()
2025-05-07T20:32:01.3021952Z             x1 = x1.contiguous()
2025-05-07T20:32:01.3022024Z     
2025-05-07T20:32:01.3022121Z         if scale_ub is not None:
2025-05-07T20:32:01.3022225Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:01.3022358Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:01.3022438Z             )
2025-05-07T20:32:01.3022514Z         else:
2025-05-07T20:32:01.3022606Z             scale_ub_tensor = None
2025-05-07T20:32:01.3022685Z     
2025-05-07T20:32:01.3022812Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:01.3022905Z             op = silu_mul_quant
2025-05-07T20:32:01.3022996Z             if compiled:
2025-05-07T20:32:01.3023094Z                 op = torch.compile(op)
2025-05-07T20:32:01.3023204Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:01.3023276Z     
2025-05-07T20:32:01.3023366Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:01.3023376Z 
2025-05-07T20:32:01.3023473Z moe/activation_test.py:117: 
2025-05-07T20:32:01.3023601Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:01.3023698Z moe/activation_test.py:115: in fn
2025-05-07T20:32:01.3023802Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:01.3024162Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/torch/_dynamo/eval_frame.py:678: in _fn
2025-05-07T20:32:01.3024258Z     return fn(*args, **kwargs)
2025-05-07T20:32:01.3024747Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:01.3024842Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:01.3025209Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:01.3025427Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:01.3025766Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:01.3025862Z     kernel = self.compile(
2025-05-07T20:32:01.3026241Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:01.3026416Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:01.3026544Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:01.3026549Z 
2025-05-07T20:32:01.3026750Z self = <triton.compiler.compiler.ASTSource object at 0x7f93a55de3d0>
2025-05-07T20:32:01.3027521Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:01.3028017Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f9402963ce0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f93a4d72d40>}
2025-05-07T20:32:01.3028833Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:01.3029019Z context = <triton._C.libtriton.ir.context object at 0x7f9397cdb2f0>
2025-05-07T20:32:01.3029023Z 
2025-05-07T20:32:01.3029188Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:01.3029448Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:01.3029627Z                            module_map=module_map)
2025-05-07T20:32:01.3029793Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:01.3029893Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:01.3029969Z E       ^
2025-05-07T20:32:01.3030326Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:01.3030330Z 
2025-05-07T20:32:01.3030736Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:01.3030740Z 
2025-05-07T20:32:01.3030845Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:01.3031063Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:01.3031140Z     T=2048,
2025-05-07T20:32:01.3031222Z     D=7168,
2025-05-07T20:32:01.3031306Z     scale_ub=1200.0,
2025-05-07T20:32:01.3031392Z     contiguous=False,
2025-05-07T20:32:01.3031479Z     compiled=True,
2025-05-07T20:32:01.3031554Z )
2025-05-07T20:32:01.3031773Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:01.3031949Z T = 2048, D = 7168, scale_ub = 1200.0, contiguous = False, compiled = True
2025-05-07T20:32:01.3031957Z 
2025-05-07T20:32:01.3032034Z     @given(
2025-05-07T20:32:01.3032158Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:01.3032258Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:01.3032375Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:01.3032496Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:01.3032609Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:01.3032686Z     )
2025-05-07T20:32:01.3032933Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:01.3033026Z     def test_silu_mul_quant(
2025-05-07T20:32:01.3033103Z         self,
2025-05-07T20:32:01.3033185Z         T: int,
2025-05-07T20:32:01.3033264Z         D: int,
2025-05-07T20:32:01.3033371Z         scale_ub: Optional[float],
2025-05-07T20:32:01.3033461Z         contiguous: bool,
2025-05-07T20:32:01.3033546Z         compiled: bool,
2025-05-07T20:32:01.3033627Z     ) -> None:
2025-05-07T20:32:01.3033720Z         torch.manual_seed(2025)
2025-05-07T20:32:01.3033797Z     
2025-05-07T20:32:01.3033970Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:01.3034043Z     
2025-05-07T20:32:01.3034133Z         x_sign = torch.sign(x)
2025-05-07T20:32:01.3034260Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:01.3034347Z         x = x_sign * x_clamp
2025-05-07T20:32:01.3034427Z         x0 = x[:, :D]
2025-05-07T20:32:01.3034510Z         x1 = x[:, D:]
2025-05-07T20:32:01.3034580Z     
2025-05-07T20:32:01.3034666Z         if contiguous:
2025-05-07T20:32:01.3034755Z             x0 = x0.contiguous()
2025-05-07T20:32:01.3034842Z             x1 = x1.contiguous()
2025-05-07T20:32:01.3034919Z     
2025-05-07T20:32:01.3035014Z         if scale_ub is not None:
2025-05-07T20:32:01.3035118Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:01.3035258Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:01.3035332Z             )
2025-05-07T20:32:01.3035494Z         else:
2025-05-07T20:32:01.3035592Z             scale_ub_tensor = None
2025-05-07T20:32:01.3035664Z     
2025-05-07T20:32:01.3035792Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:01.3035881Z             op = silu_mul_quant
2025-05-07T20:32:01.3035965Z             if compiled:
2025-05-07T20:32:01.3036063Z                 op = torch.compile(op)
2025-05-07T20:32:01.3036170Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:01.3036240Z     
2025-05-07T20:32:01.3036335Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:01.3036339Z 
2025-05-07T20:32:01.3036435Z moe/activation_test.py:117: 
2025-05-07T20:32:01.3036565Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:01.3036749Z moe/activation_test.py:115: in fn
2025-05-07T20:32:01.3036849Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:01.3037212Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/torch/_dynamo/eval_frame.py:678: in _fn
2025-05-07T20:32:01.3037315Z     return fn(*args, **kwargs)
2025-05-07T20:32:01.3037801Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:01.3037901Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:01.3038255Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:01.3038474Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:01.3038814Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:01.3038905Z     kernel = self.compile(
2025-05-07T20:32:01.3039286Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:01.3039462Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:01.3039595Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:01.3039599Z 
2025-05-07T20:32:01.3039804Z self = <triton.compiler.compiler.ASTSource object at 0x7f93a5ac0f50>
2025-05-07T20:32:01.3040562Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:01.3041058Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f9402963ce0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f93cd947e20>}
2025-05-07T20:32:01.3041794Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:01.3041990Z context = <triton._C.libtriton.ir.context object at 0x7f93a46e39f0>
2025-05-07T20:32:01.3041998Z 
2025-05-07T20:32:01.3042168Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:01.3042426Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:01.3042537Z                            module_map=module_map)
2025-05-07T20:32:01.3042697Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:01.3042795Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:01.3042878Z E       ^
2025-05-07T20:32:01.3043222Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:01.3043227Z 
2025-05-07T20:32:01.3043635Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:01.3043645Z 
2025-05-07T20:32:01.3043747Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:01.3043967Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:01.3044131Z     T=1,
2025-05-07T20:32:01.3044207Z     D=5120,
2025-05-07T20:32:01.3044290Z     scale_ub=None,
2025-05-07T20:32:01.3044377Z     contiguous=False,
2025-05-07T20:32:01.3044462Z     compiled=False,
2025-05-07T20:32:01.3044538Z )
2025-05-07T20:32:01.3044757Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:01.3044920Z T = 1, D = 5120, scale_ub = None, contiguous = False, compiled = False
2025-05-07T20:32:01.3044924Z 
2025-05-07T20:32:01.3045004Z     @given(
2025-05-07T20:32:01.3045122Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:01.3045221Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:01.3045436Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:01.3045573Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:01.3045690Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:01.3045767Z     )
2025-05-07T20:32:01.3046013Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:01.3046103Z     def test_silu_mul_quant(
2025-05-07T20:32:01.3046182Z         self,
2025-05-07T20:32:01.3046258Z         T: int,
2025-05-07T20:32:01.3050345Z         D: int,
2025-05-07T20:32:01.3050448Z         scale_ub: Optional[float],
2025-05-07T20:32:01.3050546Z         contiguous: bool,
2025-05-07T20:32:01.3050631Z         compiled: bool,
2025-05-07T20:32:01.3050711Z     ) -> None:
2025-05-07T20:32:01.3050810Z         torch.manual_seed(2025)
2025-05-07T20:32:01.3050881Z     
2025-05-07T20:32:01.3051050Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:01.3051124Z     
2025-05-07T20:32:01.3051221Z         x_sign = torch.sign(x)
2025-05-07T20:32:01.3051344Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:01.3051435Z         x = x_sign * x_clamp
2025-05-07T20:32:01.3051514Z         x0 = x[:, :D]
2025-05-07T20:32:01.3051594Z         x1 = x[:, D:]
2025-05-07T20:32:01.3051677Z     
2025-05-07T20:32:01.3051760Z         if contiguous:
2025-05-07T20:32:01.3051855Z             x0 = x0.contiguous()
2025-05-07T20:32:01.3051942Z             x1 = x1.contiguous()
2025-05-07T20:32:01.3052013Z     
2025-05-07T20:32:01.3052109Z         if scale_ub is not None:
2025-05-07T20:32:01.3052215Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:01.3052347Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:01.3052425Z             )
2025-05-07T20:32:01.3052504Z         else:
2025-05-07T20:32:01.3052597Z             scale_ub_tensor = None
2025-05-07T20:32:01.3052671Z     
2025-05-07T20:32:01.3052800Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:01.3052897Z             op = silu_mul_quant
2025-05-07T20:32:01.3052983Z             if compiled:
2025-05-07T20:32:01.3053083Z                 op = torch.compile(op)
2025-05-07T20:32:01.3053189Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:01.3053265Z     
2025-05-07T20:32:01.3053353Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:01.3053358Z 
2025-05-07T20:32:01.3053457Z moe/activation_test.py:117: 
2025-05-07T20:32:01.3053586Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:01.3053751Z moe/activation_test.py:115: in fn
2025-05-07T20:32:01.3053857Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:01.3054352Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:01.3054447Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:01.3054804Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:01.3055027Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:01.3055368Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:01.3055573Z     kernel = self.compile(
2025-05-07T20:32:01.3055951Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:01.3056126Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:01.3056253Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:01.3056258Z 
2025-05-07T20:32:01.3056463Z self = <triton.compiler.compiler.ASTSource object at 0x7f93a555c750>
2025-05-07T20:32:01.3057300Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:01.3057797Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f9402963ce0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f93a55eb100>}
2025-05-07T20:32:01.3058539Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:01.3058725Z context = <triton._C.libtriton.ir.context object at 0x7f93979b2bf0>
2025-05-07T20:32:01.3058730Z 
2025-05-07T20:32:01.3058898Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:01.3059156Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:01.3059265Z                            module_map=module_map)
2025-05-07T20:32:01.3059429Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:01.3059532Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:01.3059613Z E       ^
2025-05-07T20:32:01.3059963Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:01.3059973Z 
2025-05-07T20:32:01.3060379Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:01.3060384Z 
2025-05-07T20:32:01.3060490Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:01.3060708Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:01.3060793Z     T=4096,
2025-05-07T20:32:01.3060870Z     D=7168,
2025-05-07T20:32:01.3060954Z     scale_ub=1200.0,
2025-05-07T20:32:01.3061046Z     contiguous=False,
2025-05-07T20:32:01.3061129Z     compiled=False,
2025-05-07T20:32:01.3061203Z )
2025-05-07T20:32:01.3061425Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:01.3061603Z T = 4096, D = 7168, scale_ub = 1200.0, contiguous = False, compiled = False
2025-05-07T20:32:01.3061607Z 
2025-05-07T20:32:01.3061683Z     @given(
2025-05-07T20:32:01.3061804Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:01.3061909Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:01.3062023Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:01.3062144Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:01.3062257Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:01.3062337Z     )
2025-05-07T20:32:01.3062581Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:01.3062676Z     def test_silu_mul_quant(
2025-05-07T20:32:01.3062759Z         self,
2025-05-07T20:32:01.3062835Z         T: int,
2025-05-07T20:32:01.3062913Z         D: int,
2025-05-07T20:32:01.3063020Z         scale_ub: Optional[float],
2025-05-07T20:32:01.3063109Z         contiguous: bool,
2025-05-07T20:32:01.3063202Z         compiled: bool,
2025-05-07T20:32:01.3063284Z     ) -> None:
2025-05-07T20:32:01.3063383Z         torch.manual_seed(2025)
2025-05-07T20:32:01.3063457Z     
2025-05-07T20:32:01.3063624Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:01.3063787Z     
2025-05-07T20:32:01.3063879Z         x_sign = torch.sign(x)
2025-05-07T20:32:01.3064000Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:01.3064092Z         x = x_sign * x_clamp
2025-05-07T20:32:01.3064173Z         x0 = x[:, :D]
2025-05-07T20:32:01.3064252Z         x1 = x[:, D:]
2025-05-07T20:32:01.3064326Z     
2025-05-07T20:32:01.3064408Z         if contiguous:
2025-05-07T20:32:01.3064506Z             x0 = x0.contiguous()
2025-05-07T20:32:01.3064594Z             x1 = x1.contiguous()
2025-05-07T20:32:01.3064668Z     
2025-05-07T20:32:01.3064758Z         if scale_ub is not None:
2025-05-07T20:32:01.3064862Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:01.3065070Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:01.3065152Z             )
2025-05-07T20:32:01.3065231Z         else:
2025-05-07T20:32:01.3065326Z             scale_ub_tensor = None
2025-05-07T20:32:01.3065401Z     
2025-05-07T20:32:01.3065534Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:01.3065621Z             op = silu_mul_quant
2025-05-07T20:32:01.3065709Z             if compiled:
2025-05-07T20:32:01.3065806Z                 op = torch.compile(op)
2025-05-07T20:32:01.3065913Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:01.3065983Z     
2025-05-07T20:32:01.3066073Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:01.3066077Z 
2025-05-07T20:32:01.3066178Z moe/activation_test.py:117: 
2025-05-07T20:32:01.3066305Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:01.3066403Z moe/activation_test.py:115: in fn
2025-05-07T20:32:01.3066503Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:01.3066995Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:01.3067094Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:01.3067457Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:01.3067675Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:01.3068012Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:01.3068105Z     kernel = self.compile(
2025-05-07T20:32:01.3068481Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:01.3068656Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:01.3068779Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:01.3068791Z 
2025-05-07T20:32:01.3068995Z self = <triton.compiler.compiler.ASTSource object at 0x7f93a4c41cd0>
2025-05-07T20:32:01.3069755Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:01.3070254Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f9402963ce0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f9397cdc2c0>}
2025-05-07T20:32:01.3070991Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:01.3071177Z context = <triton._C.libtriton.ir.context object at 0x7f939792f5f0>
2025-05-07T20:32:01.3071182Z 
2025-05-07T20:32:01.3071355Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:01.3071610Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:01.3071716Z                            module_map=module_map)
2025-05-07T20:32:01.3071963Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:01.3072063Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:01.3072144Z E       ^
2025-05-07T20:32:01.3072492Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:01.3072497Z 
2025-05-07T20:32:01.3072902Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:01.3072906Z 
2025-05-07T20:32:01.3073012Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:01.3073231Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:01.3073311Z     T=16384,
2025-05-07T20:32:01.3073465Z     D=7168,
2025-05-07T20:32:01.3073551Z     scale_ub=None,
2025-05-07T20:32:01.3073639Z     contiguous=True,
2025-05-07T20:32:01.3073723Z     compiled=True,
2025-05-07T20:32:01.3073797Z )
2025-05-07T20:32:01.3074021Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:01.3074192Z T = 16384, D = 7168, scale_ub = None, contiguous = True, compiled = True
2025-05-07T20:32:01.3074196Z 
2025-05-07T20:32:01.3074272Z     @given(
2025-05-07T20:32:01.3074393Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:01.3074493Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:01.3074611Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:01.3074727Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:01.3074841Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:01.3074919Z     )
2025-05-07T20:32:01.3075164Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:01.3075258Z     def test_silu_mul_quant(
2025-05-07T20:32:01.3075338Z         self,
2025-05-07T20:32:01.3075417Z         T: int,
2025-05-07T20:32:01.3075495Z         D: int,
2025-05-07T20:32:01.3075595Z         scale_ub: Optional[float],
2025-05-07T20:32:01.3075690Z         contiguous: bool,
2025-05-07T20:32:01.3075774Z         compiled: bool,
2025-05-07T20:32:01.3075855Z     ) -> None:
2025-05-07T20:32:01.3075949Z         torch.manual_seed(2025)
2025-05-07T20:32:01.3076024Z     
2025-05-07T20:32:01.3076188Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:01.3076262Z     
2025-05-07T20:32:01.3076356Z         x_sign = torch.sign(x)
2025-05-07T20:32:01.3076483Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:01.3076571Z         x = x_sign * x_clamp
2025-05-07T20:32:01.3076658Z         x0 = x[:, :D]
2025-05-07T20:32:01.3076739Z         x1 = x[:, D:]
2025-05-07T20:32:01.3076811Z     
2025-05-07T20:32:01.3076903Z         if contiguous:
2025-05-07T20:32:01.3076994Z             x0 = x0.contiguous()
2025-05-07T20:32:01.3077084Z             x1 = x1.contiguous()
2025-05-07T20:32:01.3077160Z     
2025-05-07T20:32:01.3077251Z         if scale_ub is not None:
2025-05-07T20:32:01.3077360Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:01.3077497Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:01.3077572Z             )
2025-05-07T20:32:01.3077651Z         else:
2025-05-07T20:32:01.3077743Z             scale_ub_tensor = None
2025-05-07T20:32:01.3077814Z     
2025-05-07T20:32:01.3077946Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:01.3078035Z             op = silu_mul_quant
2025-05-07T20:32:01.3078119Z             if compiled:
2025-05-07T20:32:01.3078220Z                 op = torch.compile(op)
2025-05-07T20:32:01.3078323Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:01.3078394Z     
2025-05-07T20:32:01.3078486Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:01.3078495Z 
2025-05-07T20:32:01.3078590Z moe/activation_test.py:117: 
2025-05-07T20:32:01.3078720Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:01.3078819Z moe/activation_test.py:115: in fn
2025-05-07T20:32:01.3079029Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:01.3079397Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/torch/_dynamo/eval_frame.py:678: in _fn
2025-05-07T20:32:01.3079488Z     return fn(*args, **kwargs)
2025-05-07T20:32:01.3079970Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:01.3080069Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:01.3080423Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:01.3080644Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:01.3081051Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:01.3081145Z     kernel = self.compile(
2025-05-07T20:32:01.3081526Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:01.3081703Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:01.3081828Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:01.3081836Z 
2025-05-07T20:32:01.3082037Z self = <triton.compiler.compiler.ASTSource object at 0x7f93a52637d0>
2025-05-07T20:32:01.3082798Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:01.3083301Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f9402963ce0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f9397cddc60>}
2025-05-07T20:32:01.3084033Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:01.3084228Z context = <triton._C.libtriton.ir.context object at 0x7f939798ee30>
2025-05-07T20:32:01.3084232Z 
2025-05-07T20:32:01.3084393Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:01.3084647Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:01.3084755Z                            module_map=module_map)
2025-05-07T20:32:01.3084914Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:01.3085012Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:01.3085095Z E       ^
2025-05-07T20:32:01.3085445Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:01.3085450Z 
2025-05-07T20:32:01.3085855Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:01.3085863Z 
2025-05-07T20:32:01.3085964Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:01.3086182Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:01.3086262Z     T=4096,
2025-05-07T20:32:01.3086337Z     D=5120,
2025-05-07T20:32:01.3086421Z     scale_ub=None,
2025-05-07T20:32:01.3086507Z     contiguous=False,
2025-05-07T20:32:01.3086589Z     compiled=True,
2025-05-07T20:32:01.3086662Z )
2025-05-07T20:32:01.3086875Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:01.3087041Z T = 4096, D = 5120, scale_ub = None, contiguous = False, compiled = True
2025-05-07T20:32:01.3087045Z 
2025-05-07T20:32:01.3087131Z     @given(
2025-05-07T20:32:01.3087248Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:01.3087345Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:01.3087461Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:01.3087746Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:01.3087865Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:01.3087939Z     )
2025-05-07T20:32:01.3088180Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:01.3088277Z     def test_silu_mul_quant(
2025-05-07T20:32:01.3088354Z         self,
2025-05-07T20:32:01.3088429Z         T: int,
2025-05-07T20:32:01.3088510Z         D: int,
2025-05-07T20:32:01.3088605Z         scale_ub: Optional[float],
2025-05-07T20:32:01.3088693Z         contiguous: bool,
2025-05-07T20:32:01.3088780Z         compiled: bool,
2025-05-07T20:32:01.3088857Z     ) -> None:
2025-05-07T20:32:01.3088949Z         torch.manual_seed(2025)
2025-05-07T20:32:01.3089099Z     
2025-05-07T20:32:01.3089265Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:01.3089340Z     
2025-05-07T20:32:01.3089432Z         x_sign = torch.sign(x)
2025-05-07T20:32:01.3089560Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:01.3089650Z         x = x_sign * x_clamp
2025-05-07T20:32:01.3089729Z         x0 = x[:, :D]
2025-05-07T20:32:01.3089808Z         x1 = x[:, D:]
2025-05-07T20:32:01.3089883Z     
2025-05-07T20:32:01.3089965Z         if contiguous:
2025-05-07T20:32:01.3090054Z             x0 = x0.contiguous()
2025-05-07T20:32:01.3090145Z             x1 = x1.contiguous()
2025-05-07T20:32:01.3090216Z     
2025-05-07T20:32:01.3090305Z         if scale_ub is not None:
2025-05-07T20:32:01.3090412Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:01.3090543Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:01.3090617Z             )
2025-05-07T20:32:01.3090695Z         else:
2025-05-07T20:32:01.3090792Z             scale_ub_tensor = None
2025-05-07T20:32:01.3090868Z     
2025-05-07T20:32:01.3090994Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:01.3091081Z             op = silu_mul_quant
2025-05-07T20:32:01.3091173Z             if compiled:
2025-05-07T20:32:01.3091273Z                 op = torch.compile(op)
2025-05-07T20:32:01.3091378Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:01.3091452Z     
2025-05-07T20:32:01.3091540Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:01.3091545Z 
2025-05-07T20:32:01.3091641Z moe/activation_test.py:117: 
2025-05-07T20:32:01.3091771Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:01.3091870Z moe/activation_test.py:115: in fn
2025-05-07T20:32:01.3091976Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:01.3092335Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/torch/_dynamo/eval_frame.py:678: in _fn
2025-05-07T20:32:01.3092432Z     return fn(*args, **kwargs)
2025-05-07T20:32:01.3092919Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:01.3093018Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:01.3093371Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:01.3093592Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:01.3093985Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:01.3094080Z     kernel = self.compile(
2025-05-07T20:32:01.3094457Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:01.3094629Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:01.3094763Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:01.3094768Z 
2025-05-07T20:32:01.3094968Z self = <triton.compiler.compiler.ASTSource object at 0x7f93a4c40fd0>
2025-05-07T20:32:01.3095730Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:01.3096309Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f9402963ce0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f9397cde980>}
2025-05-07T20:32:01.3097039Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:01.3097228Z context = <triton._C.libtriton.ir.context object at 0x7f93979a1b30>
2025-05-07T20:32:01.3097305Z 
2025-05-07T20:32:01.3097469Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:01.3097727Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:01.3097838Z                            module_map=module_map)
2025-05-07T20:32:01.3097996Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:01.3098096Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:01.3098388Z E       ^
2025-05-07T20:32:01.3098828Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:01.3098834Z 
2025-05-07T20:32:01.3099238Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:01.3099242Z 
2025-05-07T20:32:01.3099344Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:01.3099569Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:01.3099643Z     T=4096,
2025-05-07T20:32:01.3099716Z     D=5120,
2025-05-07T20:32:01.3099800Z     scale_ub=1200.0,
2025-05-07T20:32:01.3099884Z     contiguous=False,
2025-05-07T20:32:01.3099966Z     compiled=False,
2025-05-07T20:32:01.3100043Z )
2025-05-07T20:32:01.3100253Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:01.3100427Z T = 4096, D = 5120, scale_ub = 1200.0, contiguous = False, compiled = False
2025-05-07T20:32:01.3100431Z 
2025-05-07T20:32:01.3100504Z     @given(
2025-05-07T20:32:01.3100620Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:01.3100719Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:01.3100829Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:01.3100942Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:01.3101055Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:01.3101126Z     )
2025-05-07T20:32:01.3101372Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:01.3101463Z     def test_silu_mul_quant(
2025-05-07T20:32:01.3101537Z         self,
2025-05-07T20:32:01.3101615Z         T: int,
2025-05-07T20:32:01.3101693Z         D: int,
2025-05-07T20:32:01.3101787Z         scale_ub: Optional[float],
2025-05-07T20:32:01.3101878Z         contiguous: bool,
2025-05-07T20:32:01.3101960Z         compiled: bool,
2025-05-07T20:32:01.3102036Z     ) -> None:
2025-05-07T20:32:01.3102133Z         torch.manual_seed(2025)
2025-05-07T20:32:01.3102202Z     
2025-05-07T20:32:01.3102363Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:01.3102440Z     
2025-05-07T20:32:01.3102528Z         x_sign = torch.sign(x)
2025-05-07T20:32:01.3102654Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:01.3102739Z         x = x_sign * x_clamp
2025-05-07T20:32:01.3102817Z         x0 = x[:, :D]
2025-05-07T20:32:01.3102896Z         x1 = x[:, D:]
2025-05-07T20:32:01.3102971Z     
2025-05-07T20:32:01.3103052Z         if contiguous:
2025-05-07T20:32:01.3103145Z             x0 = x0.contiguous()
2025-05-07T20:32:01.3103231Z             x1 = x1.contiguous()
2025-05-07T20:32:01.3103449Z     
2025-05-07T20:32:01.3103540Z         if scale_ub is not None:
2025-05-07T20:32:01.3103642Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:01.3103774Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:01.3103851Z             )
2025-05-07T20:32:01.3103925Z         else:
2025-05-07T20:32:01.3104016Z             scale_ub_tensor = None
2025-05-07T20:32:01.3104092Z     
2025-05-07T20:32:01.3104215Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:01.3104307Z             op = silu_mul_quant
2025-05-07T20:32:01.3104389Z             if compiled:
2025-05-07T20:32:01.3104485Z                 op = torch.compile(op)
2025-05-07T20:32:01.3104589Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:01.3104768Z     
2025-05-07T20:32:01.3104858Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:01.3104862Z 
2025-05-07T20:32:01.3104959Z moe/activation_test.py:117: 
2025-05-07T20:32:01.3105087Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:01.3105191Z moe/activation_test.py:115: in fn
2025-05-07T20:32:01.3105292Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:01.3105832Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:01.3105934Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:01.3106288Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:01.3106509Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:01.3106848Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:01.3106941Z     kernel = self.compile(
2025-05-07T20:32:01.3107322Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:01.3107490Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:01.3107624Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:01.3107629Z 
2025-05-07T20:32:01.3107827Z self = <triton.compiler.compiler.ASTSource object at 0x7f93a5923bd0>
2025-05-07T20:32:01.3108589Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:01.3109078Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f9402963ce0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f9397cdfba0>}
2025-05-07T20:32:01.3109817Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:01.3110006Z context = <triton._C.libtriton.ir.context object at 0x7f9397ddbb30>
2025-05-07T20:32:01.3110010Z 
2025-05-07T20:32:01.3110174Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:01.3110435Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:01.3110540Z                            module_map=module_map)
2025-05-07T20:32:01.3110696Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:01.3110797Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:01.3110872Z E       ^
2025-05-07T20:32:01.3111223Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:01.3111232Z 
2025-05-07T20:32:01.3111636Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:01.3111640Z 
2025-05-07T20:32:01.3111740Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:01.3112075Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:01.3112151Z     T=4096,
2025-05-07T20:32:01.3112235Z     D=5120,
2025-05-07T20:32:01.3112317Z     scale_ub=1200.0,
2025-05-07T20:32:01.3112403Z     contiguous=False,
2025-05-07T20:32:01.3112490Z     compiled=True,
2025-05-07T20:32:01.3112562Z )
2025-05-07T20:32:01.3112773Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:01.3112948Z T = 4096, D = 5120, scale_ub = 1200.0, contiguous = False, compiled = True
2025-05-07T20:32:01.3112952Z 
2025-05-07T20:32:01.3113026Z     @given(
2025-05-07T20:32:01.3113141Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:01.3113315Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:01.3113428Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:01.3113544Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:01.3113654Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:01.3113731Z     )
2025-05-07T20:32:01.3113973Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:01.3114063Z     def test_silu_mul_quant(
2025-05-07T20:32:01.3114138Z         self,
2025-05-07T20:32:01.3114217Z         T: int,
2025-05-07T20:32:01.3114291Z         D: int,
2025-05-07T20:32:01.3114386Z         scale_ub: Optional[float],
2025-05-07T20:32:01.3114477Z         contiguous: bool,
2025-05-07T20:32:01.3114560Z         compiled: bool,
2025-05-07T20:32:01.3114636Z     ) -> None:
2025-05-07T20:32:01.3114730Z         torch.manual_seed(2025)
2025-05-07T20:32:01.3114800Z     
2025-05-07T20:32:01.3114972Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:01.3115043Z     
2025-05-07T20:32:01.3115134Z         x_sign = torch.sign(x)
2025-05-07T20:32:01.3115257Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:01.3115343Z         x = x_sign * x_clamp
2025-05-07T20:32:01.3115425Z         x0 = x[:, :D]
2025-05-07T20:32:01.3115508Z         x1 = x[:, D:]
2025-05-07T20:32:01.3115576Z     
2025-05-07T20:32:01.3115657Z         if contiguous:
2025-05-07T20:32:01.3115751Z             x0 = x0.contiguous()
2025-05-07T20:32:01.3115836Z             x1 = x1.contiguous()
2025-05-07T20:32:01.3115906Z     
2025-05-07T20:32:01.3115997Z         if scale_ub is not None:
2025-05-07T20:32:01.3116099Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:01.3116229Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:01.3116307Z             )
2025-05-07T20:32:01.3116380Z         else:
2025-05-07T20:32:01.3116476Z             scale_ub_tensor = None
2025-05-07T20:32:01.3116547Z     
2025-05-07T20:32:01.3116678Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:01.3116771Z             op = silu_mul_quant
2025-05-07T20:32:01.3116852Z             if compiled:
2025-05-07T20:32:01.3116948Z                 op = torch.compile(op)
2025-05-07T20:32:01.3117056Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:01.3117129Z     
2025-05-07T20:32:01.3117217Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:01.3117221Z 
2025-05-07T20:32:01.3117317Z moe/activation_test.py:117: 
2025-05-07T20:32:01.3117443Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:01.3117543Z moe/activation_test.py:115: in fn
2025-05-07T20:32:01.3117639Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:01.3117998Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/torch/_dynamo/eval_frame.py:678: in _fn
2025-05-07T20:32:01.3118093Z     return fn(*args, **kwargs)
2025-05-07T20:32:01.3118580Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:01.3118673Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:01.3119026Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:01.3119329Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:01.3119665Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:01.3119755Z     kernel = self.compile(
2025-05-07T20:32:01.3120132Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:01.3120305Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:01.3120428Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:01.3120433Z 
2025-05-07T20:32:01.3120706Z self = <triton.compiler.compiler.ASTSource object at 0x7f9397981050>
2025-05-07T20:32:01.3121465Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:01.3121967Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f9402963ce0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f9397d94ea0>}
2025-05-07T20:32:01.3122698Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:01.3122881Z context = <triton._C.libtriton.ir.context object at 0x7f9397d05530>
2025-05-07T20:32:01.3122885Z 
2025-05-07T20:32:01.3123048Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:01.3123307Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:01.3123411Z                            module_map=module_map)
2025-05-07T20:32:01.3123573Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:01.3123674Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:01.3123748Z E       ^
2025-05-07T20:32:01.3124095Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:01.3124100Z 
2025-05-07T20:32:01.3124506Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:01.3124510Z 
2025-05-07T20:32:01.3124617Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:01.3124835Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:01.3124907Z     T=2048,
2025-05-07T20:32:01.3124984Z     D=7168,
2025-05-07T20:32:01.3125068Z     scale_ub=1200.0,
2025-05-07T20:32:01.3125154Z     contiguous=False,
2025-05-07T20:32:01.3125236Z     compiled=False,
2025-05-07T20:32:01.3125305Z )
2025-05-07T20:32:01.3125555Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:01.3125744Z T = 2048, D = 7168, scale_ub = 1200.0, contiguous = False, compiled = False
2025-05-07T20:32:01.3125748Z 
2025-05-07T20:32:01.3125824Z     @given(
2025-05-07T20:32:01.3125942Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:01.3126039Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:01.3126147Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:01.3126263Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:01.3126374Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:01.3126449Z     )
2025-05-07T20:32:01.3126689Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:01.3126783Z     def test_silu_mul_quant(
2025-05-07T20:32:01.3126859Z         self,
2025-05-07T20:32:01.3126932Z         T: int,
2025-05-07T20:32:01.3127005Z         D: int,
2025-05-07T20:32:01.3127101Z         scale_ub: Optional[float],
2025-05-07T20:32:01.3127275Z         contiguous: bool,
2025-05-07T20:32:01.3127359Z         compiled: bool,
2025-05-07T20:32:01.3127438Z     ) -> None:
2025-05-07T20:32:01.3127529Z         torch.manual_seed(2025)
2025-05-07T20:32:01.3127598Z     
2025-05-07T20:32:01.3127763Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:01.3127832Z     
2025-05-07T20:32:01.3127923Z         x_sign = torch.sign(x)
2025-05-07T20:32:01.3128043Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:01.3128129Z         x = x_sign * x_clamp
2025-05-07T20:32:01.3128209Z         x0 = x[:, :D]
2025-05-07T20:32:01.3128285Z         x1 = x[:, D:]
2025-05-07T20:32:01.3128354Z     
2025-05-07T20:32:01.3128438Z         if contiguous:
2025-05-07T20:32:01.3128602Z             x0 = x0.contiguous()
2025-05-07T20:32:01.3128690Z             x1 = x1.contiguous()
2025-05-07T20:32:01.3128761Z     
2025-05-07T20:32:01.3128849Z         if scale_ub is not None:
2025-05-07T20:32:01.3128949Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:01.3129088Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:01.3129160Z             )
2025-05-07T20:32:01.3129233Z         else:
2025-05-07T20:32:01.3129329Z             scale_ub_tensor = None
2025-05-07T20:32:01.3129399Z     
2025-05-07T20:32:01.3129525Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:01.3129612Z             op = silu_mul_quant
2025-05-07T20:32:01.3129693Z             if compiled:
2025-05-07T20:32:01.3129793Z                 op = torch.compile(op)
2025-05-07T20:32:01.3129894Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:01.3129964Z     
2025-05-07T20:32:01.3130055Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:01.3130060Z 
2025-05-07T20:32:01.3130157Z moe/activation_test.py:117: 
2025-05-07T20:32:01.3130283Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:01.3130384Z moe/activation_test.py:115: in fn
2025-05-07T20:32:01.3130478Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:01.3130974Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:01.3131068Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:01.3131420Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:01.3131643Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:01.3131975Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:01.3132064Z     kernel = self.compile(
2025-05-07T20:32:01.3132445Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:01.3132613Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:01.3132739Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:01.3132747Z 
2025-05-07T20:32:01.3132943Z self = <triton.compiler.compiler.ASTSource object at 0x7f93a5ac35d0>
2025-05-07T20:32:01.3133776Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:01.3134270Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f9402963ce0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f9397d95940>}
2025-05-07T20:32:01.3135002Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:01.3135188Z context = <triton._C.libtriton.ir.context object at 0x7f9397a182b0>
2025-05-07T20:32:01.3135192Z 
2025-05-07T20:32:01.3135439Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:01.3135697Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:01.3135803Z                            module_map=module_map)
2025-05-07T20:32:01.3135958Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:01.3136057Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:01.3136132Z E       ^
2025-05-07T20:32:01.3136476Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:01.3136480Z 
2025-05-07T20:32:01.3136983Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:01.3136988Z 
2025-05-07T20:32:01.3137090Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:01.3137310Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:01.3137390Z     T=1,
2025-05-07T20:32:01.3137464Z     D=7168,
2025-05-07T20:32:01.3137548Z     scale_ub=None,
2025-05-07T20:32:01.3137630Z     contiguous=True,
2025-05-07T20:32:01.3137709Z     compiled=False,
2025-05-07T20:32:01.3137781Z )
2025-05-07T20:32:01.3137994Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:01.3138150Z T = 1, D = 7168, scale_ub = None, contiguous = True, compiled = False
2025-05-07T20:32:01.3138159Z 
2025-05-07T20:32:01.3138232Z     @given(
2025-05-07T20:32:01.3138349Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:01.3138451Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:01.3138562Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:01.3138681Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:01.3138793Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:01.3138865Z     )
2025-05-07T20:32:01.3139104Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:01.3139201Z     def test_silu_mul_quant(
2025-05-07T20:32:01.3139274Z         self,
2025-05-07T20:32:01.3139350Z         T: int,
2025-05-07T20:32:01.3139428Z         D: int,
2025-05-07T20:32:01.3139524Z         scale_ub: Optional[float],
2025-05-07T20:32:01.3139612Z         contiguous: bool,
2025-05-07T20:32:01.3139693Z         compiled: bool,
2025-05-07T20:32:01.3139767Z     ) -> None:
2025-05-07T20:32:01.3139860Z         torch.manual_seed(2025)
2025-05-07T20:32:01.3139931Z     
2025-05-07T20:32:01.3140093Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:01.3140169Z     
2025-05-07T20:32:01.3140258Z         x_sign = torch.sign(x)
2025-05-07T20:32:01.3140385Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:01.3140473Z         x = x_sign * x_clamp
2025-05-07T20:32:01.3140551Z         x0 = x[:, :D]
2025-05-07T20:32:01.3140627Z         x1 = x[:, D:]
2025-05-07T20:32:01.3140706Z     
2025-05-07T20:32:01.3140785Z         if contiguous:
2025-05-07T20:32:01.3140878Z             x0 = x0.contiguous()
2025-05-07T20:32:01.3140963Z             x1 = x1.contiguous()
2025-05-07T20:32:01.3141032Z     
2025-05-07T20:32:01.3141122Z         if scale_ub is not None:
2025-05-07T20:32:01.3141224Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:01.3141354Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:01.3141429Z             )
2025-05-07T20:32:01.3141502Z         else:
2025-05-07T20:32:01.3141593Z             scale_ub_tensor = None
2025-05-07T20:32:01.3141667Z     
2025-05-07T20:32:01.3141792Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:01.3141878Z             op = silu_mul_quant
2025-05-07T20:32:01.3141965Z             if compiled:
2025-05-07T20:32:01.3142061Z                 op = torch.compile(op)
2025-05-07T20:32:01.3142165Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:01.3142236Z     
2025-05-07T20:32:01.3142411Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:01.3142415Z 
2025-05-07T20:32:01.3142512Z moe/activation_test.py:117: 
2025-05-07T20:32:01.3142637Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:01.3142735Z moe/activation_test.py:115: in fn
2025-05-07T20:32:01.3142832Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:01.3143319Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:01.3143412Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:01.3143769Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:01.3144061Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:01.3144399Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:01.3144494Z     kernel = self.compile(
2025-05-07T20:32:01.3144869Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:01.3145039Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:01.3145164Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:01.3145168Z 
2025-05-07T20:32:01.3145369Z self = <triton.compiler.compiler.ASTSource object at 0x7f93a5f59bd0>
2025-05-07T20:32:01.3146131Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:01.3146628Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f9402963ce0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f9397d96ca0>}
2025-05-07T20:32:01.3147358Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:01.3147548Z context = <triton._C.libtriton.ir.context object at 0x7f9397a35d70>
2025-05-07T20:32:01.3147552Z 
2025-05-07T20:32:01.3147716Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:01.3149266Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:01.3149371Z                            module_map=module_map)
2025-05-07T20:32:01.3149533Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:01.3149633Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:01.3149712Z E       ^
2025-05-07T20:32:01.3150054Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:01.3150059Z 
2025-05-07T20:32:01.3150465Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:01.3150470Z 
2025-05-07T20:32:01.3150573Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:01.3150788Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:01.3150866Z     T=16384,
2025-05-07T20:32:01.3150941Z     D=7168,
2025-05-07T20:32:01.3151022Z     scale_ub=1200.0,
2025-05-07T20:32:01.3151111Z     contiguous=False,
2025-05-07T20:32:01.3151193Z     compiled=True,
2025-05-07T20:32:01.3151264Z )
2025-05-07T20:32:01.3151480Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:01.3151656Z T = 16384, D = 7168, scale_ub = 1200.0, contiguous = False, compiled = True
2025-05-07T20:32:01.3151661Z 
2025-05-07T20:32:01.3151735Z     @given(
2025-05-07T20:32:01.3151857Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:01.3151953Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:01.3152153Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:01.3152267Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:01.3152375Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:01.3152449Z     )
2025-05-07T20:32:01.3152689Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:01.3152777Z     def test_silu_mul_quant(
2025-05-07T20:32:01.3152853Z         self,
2025-05-07T20:32:01.3152928Z         T: int,
2025-05-07T20:32:01.3153001Z         D: int,
2025-05-07T20:32:01.3153099Z         scale_ub: Optional[float],
2025-05-07T20:32:01.3153184Z         contiguous: bool,
2025-05-07T20:32:01.3153267Z         compiled: bool,
2025-05-07T20:32:01.3153420Z     ) -> None:
2025-05-07T20:32:01.3153512Z         torch.manual_seed(2025)
2025-05-07T20:32:01.3153582Z     
2025-05-07T20:32:01.3153749Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:01.3153828Z     
2025-05-07T20:32:01.3153921Z         x_sign = torch.sign(x)
2025-05-07T20:32:01.3154041Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:01.3154125Z         x = x_sign * x_clamp
2025-05-07T20:32:01.3154206Z         x0 = x[:, :D]
2025-05-07T20:32:01.3154283Z         x1 = x[:, D:]
2025-05-07T20:32:01.3154354Z     
2025-05-07T20:32:01.3154435Z         if contiguous:
2025-05-07T20:32:01.3154521Z             x0 = x0.contiguous()
2025-05-07T20:32:01.3154606Z             x1 = x1.contiguous()
2025-05-07T20:32:01.3154678Z     
2025-05-07T20:32:01.3154765Z         if scale_ub is not None:
2025-05-07T20:32:01.3154865Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:01.3155005Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:01.3155078Z             )
2025-05-07T20:32:01.3155156Z         else:
2025-05-07T20:32:01.3155249Z             scale_ub_tensor = None
2025-05-07T20:32:01.3155319Z     
2025-05-07T20:32:01.3155465Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:01.3155565Z             op = silu_mul_quant
2025-05-07T20:32:01.3155665Z             if compiled:
2025-05-07T20:32:01.3155770Z                 op = torch.compile(op)
2025-05-07T20:32:01.3155872Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:01.3155942Z     
2025-05-07T20:32:01.3156035Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:01.3156039Z 
2025-05-07T20:32:01.3156133Z moe/activation_test.py:117: 
2025-05-07T20:32:01.3156262Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:01.3156359Z moe/activation_test.py:115: in fn
2025-05-07T20:32:01.3156454Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:01.3156820Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/torch/_dynamo/eval_frame.py:678: in _fn
2025-05-07T20:32:01.3156909Z     return fn(*args, **kwargs)
2025-05-07T20:32:01.3157391Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:01.3157492Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:01.3157842Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:01.3158060Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:01.3158391Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:01.3158484Z     kernel = self.compile(
2025-05-07T20:32:01.3158862Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:01.3159035Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:01.3159159Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:01.3159167Z 
2025-05-07T20:32:01.3159364Z self = <triton.compiler.compiler.ASTSource object at 0x7f93a5921350>
2025-05-07T20:32:01.3160211Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:01.3160705Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f9402963ce0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f9397d97f60>}
2025-05-07T20:32:01.3161432Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:01.3161692Z context = <triton._C.libtriton.ir.context object at 0x7f93a426b330>
2025-05-07T20:32:01.3161697Z 
2025-05-07T20:32:01.3161861Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:01.3162114Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:01.3162226Z                            module_map=module_map)
2025-05-07T20:32:01.3162382Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:01.3162477Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:01.3162555Z E       ^
2025-05-07T20:32:01.3162900Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:01.3162905Z 
2025-05-07T20:32:01.3163308Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:01.3163312Z 
2025-05-07T20:32:01.3163411Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:01.3163630Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:01.3163709Z     T=1,
2025-05-07T20:32:01.3163782Z     D=7168,
2025-05-07T20:32:01.3163867Z     scale_ub=None,
2025-05-07T20:32:01.3163957Z     contiguous=False,
2025-05-07T20:32:01.3164038Z     compiled=False,
2025-05-07T20:32:01.3164110Z )
2025-05-07T20:32:01.3164321Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:01.3164481Z T = 1, D = 7168, scale_ub = None, contiguous = False, compiled = False
2025-05-07T20:32:01.3164485Z 
2025-05-07T20:32:01.3164560Z     @given(
2025-05-07T20:32:01.3164678Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:01.3164773Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:01.3164887Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:01.3165000Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:01.3165117Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:01.3165188Z     )
2025-05-07T20:32:01.3165450Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:01.3165558Z     def test_silu_mul_quant(
2025-05-07T20:32:01.3165645Z         self,
2025-05-07T20:32:01.3165719Z         T: int,
2025-05-07T20:32:01.3165795Z         D: int,
2025-05-07T20:32:01.3165890Z         scale_ub: Optional[float],
2025-05-07T20:32:01.3165977Z         contiguous: bool,
2025-05-07T20:32:01.3166065Z         compiled: bool,
2025-05-07T20:32:01.3166140Z     ) -> None:
2025-05-07T20:32:01.3166232Z         torch.manual_seed(2025)
2025-05-07T20:32:01.3166306Z     
2025-05-07T20:32:01.3166468Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:01.3166539Z     
2025-05-07T20:32:01.3166628Z         x_sign = torch.sign(x)
2025-05-07T20:32:01.3166748Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:01.3166840Z         x = x_sign * x_clamp
2025-05-07T20:32:01.3170270Z         x0 = x[:, :D]
2025-05-07T20:32:01.3170363Z         x1 = x[:, D:]
2025-05-07T20:32:01.3170436Z     
2025-05-07T20:32:01.3170518Z         if contiguous:
2025-05-07T20:32:01.3170607Z             x0 = x0.contiguous()
2025-05-07T20:32:01.3170804Z             x1 = x1.contiguous()
2025-05-07T20:32:01.3170876Z     
2025-05-07T20:32:01.3170965Z         if scale_ub is not None:
2025-05-07T20:32:01.3171071Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:01.3171204Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:01.3171278Z             )
2025-05-07T20:32:01.3171356Z         else:
2025-05-07T20:32:01.3171447Z             scale_ub_tensor = None
2025-05-07T20:32:01.3171517Z     
2025-05-07T20:32:01.3171648Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:01.3171736Z             op = silu_mul_quant
2025-05-07T20:32:01.3171821Z             if compiled:
2025-05-07T20:32:01.3171918Z                 op = torch.compile(op)
2025-05-07T20:32:01.3172119Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:01.3172196Z     
2025-05-07T20:32:01.3172285Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:01.3172290Z 
2025-05-07T20:32:01.3172383Z moe/activation_test.py:117: 
2025-05-07T20:32:01.3172519Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:01.3172616Z moe/activation_test.py:115: in fn
2025-05-07T20:32:01.3172713Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:01.3173209Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:01.3173303Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:01.3173775Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:01.3173994Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:01.3174332Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:01.3174424Z     kernel = self.compile(
2025-05-07T20:32:01.3174803Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:01.3174980Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:01.3175104Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:01.3175108Z 
2025-05-07T20:32:01.3175307Z self = <triton.compiler.compiler.ASTSource object at 0x7f93a5f591d0>
2025-05-07T20:32:01.3176070Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:01.3176567Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f9402963ce0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f9397a149a0>}
2025-05-07T20:32:01.3177301Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:01.3177493Z context = <triton._C.libtriton.ir.context object at 0x7f93a4209af0>
2025-05-07T20:32:01.3177498Z 
2025-05-07T20:32:01.3177658Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:01.3177917Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:01.3178022Z                            module_map=module_map)
2025-05-07T20:32:01.3178183Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:01.3178279Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:01.3178355Z E       ^
2025-05-07T20:32:01.3178709Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:01.3178714Z 
2025-05-07T20:32:01.3179115Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:01.3179207Z 
2025-05-07T20:32:01.3179311Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:01.3179527Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:01.3179603Z     T=2048,
2025-05-07T20:32:01.3179681Z     D=7168,
2025-05-07T20:32:01.3179763Z     scale_ub=None,
2025-05-07T20:32:01.3179849Z     contiguous=False,
2025-05-07T20:32:01.3179933Z     compiled=True,
2025-05-07T20:32:01.3180005Z )
2025-05-07T20:32:01.3180217Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:01.3180387Z T = 2048, D = 7168, scale_ub = None, contiguous = False, compiled = True
2025-05-07T20:32:01.3180391Z 
2025-05-07T20:32:01.3180466Z     @given(
2025-05-07T20:32:01.3180661Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:01.3180765Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:01.3180877Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:01.3180995Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:01.3181112Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:01.3181183Z     )
2025-05-07T20:32:01.3181423Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:01.3181513Z     def test_silu_mul_quant(
2025-05-07T20:32:01.3181587Z         self,
2025-05-07T20:32:01.3181667Z         T: int,
2025-05-07T20:32:01.3181741Z         D: int,
2025-05-07T20:32:01.3181839Z         scale_ub: Optional[float],
2025-05-07T20:32:01.3181929Z         contiguous: bool,
2025-05-07T20:32:01.3182013Z         compiled: bool,
2025-05-07T20:32:01.3182095Z     ) -> None:
2025-05-07T20:32:01.3182186Z         torch.manual_seed(2025)
2025-05-07T20:32:01.3182255Z     
2025-05-07T20:32:01.3182424Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:01.3182494Z     
2025-05-07T20:32:01.3182581Z         x_sign = torch.sign(x)
2025-05-07T20:32:01.3182708Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:01.3182801Z         x = x_sign * x_clamp
2025-05-07T20:32:01.3182878Z         x0 = x[:, :D]
2025-05-07T20:32:01.3182958Z         x1 = x[:, D:]
2025-05-07T20:32:01.3183033Z     
2025-05-07T20:32:01.3183113Z         if contiguous:
2025-05-07T20:32:01.3183205Z             x0 = x0.contiguous()
2025-05-07T20:32:01.3183293Z             x1 = x1.contiguous()
2025-05-07T20:32:01.3183369Z     
2025-05-07T20:32:01.3183455Z         if scale_ub is not None:
2025-05-07T20:32:01.3183560Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:01.3183693Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:01.3183767Z             )
2025-05-07T20:32:01.3183847Z         else:
2025-05-07T20:32:01.3183941Z             scale_ub_tensor = None
2025-05-07T20:32:01.3184015Z     
2025-05-07T20:32:01.3184142Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:01.3184229Z             op = silu_mul_quant
2025-05-07T20:32:01.3184314Z             if compiled:
2025-05-07T20:32:01.3184415Z                 op = torch.compile(op)
2025-05-07T20:32:01.3184516Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:01.3184588Z     
2025-05-07T20:32:01.3184676Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:01.3184680Z 
2025-05-07T20:32:01.3184774Z moe/activation_test.py:117: 
2025-05-07T20:32:01.3184904Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:01.3185002Z moe/activation_test.py:115: in fn
2025-05-07T20:32:01.3185102Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:01.3185463Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/torch/_dynamo/eval_frame.py:678: in _fn
2025-05-07T20:32:01.3185552Z     return fn(*args, **kwargs)
2025-05-07T20:32:01.3186045Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:01.3186139Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:01.3186576Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:01.3186795Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:01.3187127Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:01.3187222Z     kernel = self.compile(
2025-05-07T20:32:01.3187596Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:01.3187765Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:01.3187891Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:01.3187972Z 
2025-05-07T20:32:01.3188173Z self = <triton.compiler.compiler.ASTSource object at 0x7f93a43af450>
2025-05-07T20:32:01.3188938Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:01.3189436Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f9402963ce0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f9397a16160>}
2025-05-07T20:32:01.3190163Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:01.3190352Z context = <triton._C.libtriton.ir.context object at 0x7f93a42d6430>
2025-05-07T20:32:01.3190356Z 
2025-05-07T20:32:01.3190521Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:01.3190776Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:01.3190882Z                            module_map=module_map)
2025-05-07T20:32:01.3191046Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:01.3191144Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:01.3191220Z E       ^
2025-05-07T20:32:01.3191564Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:01.3191571Z 
2025-05-07T20:32:01.3191975Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:01.3191979Z 
2025-05-07T20:32:01.3192080Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:01.3192298Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:01.3192379Z     T=4096,
2025-05-07T20:32:01.3192456Z     D=7168,
2025-05-07T20:32:01.3192543Z     scale_ub=None,
2025-05-07T20:32:01.3192627Z     contiguous=False,
2025-05-07T20:32:01.3192708Z     compiled=True,
2025-05-07T20:32:01.3192782Z )
2025-05-07T20:32:01.3192999Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:01.3193168Z T = 4096, D = 7168, scale_ub = None, contiguous = False, compiled = True
2025-05-07T20:32:01.3193172Z 
2025-05-07T20:32:01.3193247Z     @given(
2025-05-07T20:32:01.3193364Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:01.3193464Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:01.3193576Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:01.3193690Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:01.3193805Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:01.3193879Z     )
2025-05-07T20:32:01.3194124Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:01.3194217Z     def test_silu_mul_quant(
2025-05-07T20:32:01.3194294Z         self,
2025-05-07T20:32:01.3194371Z         T: int,
2025-05-07T20:32:01.3194447Z         D: int,
2025-05-07T20:32:01.3194627Z         scale_ub: Optional[float],
2025-05-07T20:32:01.3194719Z         contiguous: bool,
2025-05-07T20:32:01.3194803Z         compiled: bool,
2025-05-07T20:32:01.3194881Z     ) -> None:
2025-05-07T20:32:01.3194978Z         torch.manual_seed(2025)
2025-05-07T20:32:01.3195047Z     
2025-05-07T20:32:01.3195209Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:01.3195281Z     
2025-05-07T20:32:01.3195370Z         x_sign = torch.sign(x)
2025-05-07T20:32:01.3195491Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:01.3195579Z         x = x_sign * x_clamp
2025-05-07T20:32:01.3195655Z         x0 = x[:, :D]
2025-05-07T20:32:01.3195735Z         x1 = x[:, D:]
2025-05-07T20:32:01.3195806Z     
2025-05-07T20:32:01.3195965Z         if contiguous:
2025-05-07T20:32:01.3196057Z             x0 = x0.contiguous()
2025-05-07T20:32:01.3196142Z             x1 = x1.contiguous()
2025-05-07T20:32:01.3196210Z     
2025-05-07T20:32:01.3196300Z         if scale_ub is not None:
2025-05-07T20:32:01.3196409Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:01.3196540Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:01.3196614Z             )
2025-05-07T20:32:01.3196688Z         else:
2025-05-07T20:32:01.3196780Z             scale_ub_tensor = None
2025-05-07T20:32:01.3196853Z     
2025-05-07T20:32:01.3196978Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:01.3197069Z             op = silu_mul_quant
2025-05-07T20:32:01.3197151Z             if compiled:
2025-05-07T20:32:01.3197247Z                 op = torch.compile(op)
2025-05-07T20:32:01.3197351Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:01.3197420Z     
2025-05-07T20:32:01.3197513Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:01.3197517Z 
2025-05-07T20:32:01.3197613Z moe/activation_test.py:117: 
2025-05-07T20:32:01.3197739Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:01.3197845Z moe/activation_test.py:115: in fn
2025-05-07T20:32:01.3197942Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:01.3198518Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/torch/_dynamo/eval_frame.py:678: in _fn
2025-05-07T20:32:01.3198654Z     return fn(*args, **kwargs)
2025-05-07T20:32:01.3199156Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:01.3199254Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:01.3199614Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:01.3199841Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:01.3200176Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:01.3200275Z     kernel = self.compile(
2025-05-07T20:32:01.3200653Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:01.3200834Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:01.3200963Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:01.3200967Z 
2025-05-07T20:32:01.3201170Z self = <triton.compiler.compiler.ASTSource object at 0x7f9397f087d0>
2025-05-07T20:32:01.3201935Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:01.3202436Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f9402963ce0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f9397a16e80>}
2025-05-07T20:32:01.3203179Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:01.3203556Z context = <triton._C.libtriton.ir.context object at 0x7f93a4316770>
2025-05-07T20:32:01.3203561Z 
2025-05-07T20:32:01.3203727Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:01.3203985Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:01.3204093Z                            module_map=module_map)
2025-05-07T20:32:01.3204258Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:01.3204357Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:01.3204437Z E       ^
2025-05-07T20:32:01.3204898Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:01.3204903Z 
2025-05-07T20:32:01.3205313Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:01.3205324Z 
2025-05-07T20:32:01.3205432Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:01.3205651Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:01.3205728Z     T=16384,
2025-05-07T20:32:01.3205809Z     D=5120,
2025-05-07T20:32:01.3205892Z     scale_ub=1200.0,
2025-05-07T20:32:01.3205977Z     contiguous=False,
2025-05-07T20:32:01.3206064Z     compiled=False,
2025-05-07T20:32:01.3206134Z )
2025-05-07T20:32:01.3206346Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:01.3206522Z T = 16384, D = 5120, scale_ub = 1200.0, contiguous = False, compiled = False
2025-05-07T20:32:01.3206526Z 
2025-05-07T20:32:01.3206605Z     @given(
2025-05-07T20:32:01.3206728Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:01.3206826Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:01.3206936Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:01.3207059Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:01.3207169Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:01.3207241Z     )
2025-05-07T20:32:01.3207485Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:01.3207575Z     def test_silu_mul_quant(
2025-05-07T20:32:01.3207652Z         self,
2025-05-07T20:32:01.3207728Z         T: int,
2025-05-07T20:32:01.3207802Z         D: int,
2025-05-07T20:32:01.3207901Z         scale_ub: Optional[float],
2025-05-07T20:32:01.3207987Z         contiguous: bool,
2025-05-07T20:32:01.3208070Z         compiled: bool,
2025-05-07T20:32:01.3208153Z     ) -> None:
2025-05-07T20:32:01.3208249Z         torch.manual_seed(2025)
2025-05-07T20:32:01.3208320Z     
2025-05-07T20:32:01.3208485Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:01.3208556Z     
2025-05-07T20:32:01.3208646Z         x_sign = torch.sign(x)
2025-05-07T20:32:01.3208784Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:01.3208870Z         x = x_sign * x_clamp
2025-05-07T20:32:01.3208948Z         x0 = x[:, :D]
2025-05-07T20:32:01.3209032Z         x1 = x[:, D:]
2025-05-07T20:32:01.3209102Z     
2025-05-07T20:32:01.3209188Z         if contiguous:
2025-05-07T20:32:01.3209276Z             x0 = x0.contiguous()
2025-05-07T20:32:01.3209364Z             x1 = x1.contiguous()
2025-05-07T20:32:01.3209437Z     
2025-05-07T20:32:01.3209524Z         if scale_ub is not None:
2025-05-07T20:32:01.3209626Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:01.3209759Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:01.3209832Z             )
2025-05-07T20:32:01.3209914Z         else:
2025-05-07T20:32:01.3210009Z             scale_ub_tensor = None
2025-05-07T20:32:01.3210078Z     
2025-05-07T20:32:01.3210201Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:01.3210380Z             op = silu_mul_quant
2025-05-07T20:32:01.3210462Z             if compiled:
2025-05-07T20:32:01.3210565Z                 op = torch.compile(op)
2025-05-07T20:32:01.3210666Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:01.3210735Z     
2025-05-07T20:32:01.3210825Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:01.3210829Z 
2025-05-07T20:32:01.3210923Z moe/activation_test.py:117: 
2025-05-07T20:32:01.3211049Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:01.3211149Z moe/activation_test.py:115: in fn
2025-05-07T20:32:01.3211245Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:01.3211808Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:01.3211909Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:01.3212260Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:01.3212485Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:01.3212818Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:01.3212908Z     kernel = self.compile(
2025-05-07T20:32:01.3213288Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:01.3213456Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:01.3213585Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:01.3213590Z 
2025-05-07T20:32:01.3213876Z self = <triton.compiler.compiler.ASTSource object at 0x7f93a487e2d0>
2025-05-07T20:32:01.3214636Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:01.3215140Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f9402963ce0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f93a43cc220>}
2025-05-07T20:32:01.3215868Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:01.3216064Z context = <triton._C.libtriton.ir.context object at 0x7f939762ddb0>
2025-05-07T20:32:01.3216069Z 
2025-05-07T20:32:01.3216230Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:01.3216487Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:01.3216596Z                            module_map=module_map)
2025-05-07T20:32:01.3216754Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:01.3216858Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:01.3216931Z E       ^
2025-05-07T20:32:01.3217276Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:01.3217280Z 
2025-05-07T20:32:01.3217688Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:01.3217692Z 
2025-05-07T20:32:01.3217794Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:01.3218011Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:01.3218085Z     T=16384,
2025-05-07T20:32:01.3218160Z     D=5120,
2025-05-07T20:32:01.3218243Z     scale_ub=1200.0,
2025-05-07T20:32:01.3218333Z     contiguous=True,
2025-05-07T20:32:01.3218412Z     compiled=True,
2025-05-07T20:32:01.3218486Z )
2025-05-07T20:32:01.3218699Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:01.3219048Z T = 16384, D = 5120, scale_ub = 1200.0, contiguous = True, compiled = True
2025-05-07T20:32:01.3219052Z 
2025-05-07T20:32:01.3219129Z     @given(
2025-05-07T20:32:01.3219243Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:01.3219343Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:01.3219456Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:01.3219569Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:01.3219683Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:01.3219754Z     )
2025-05-07T20:32:01.3219995Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:01.3220088Z     def test_silu_mul_quant(
2025-05-07T20:32:01.3220160Z         self,
2025-05-07T20:32:01.3220880Z         T: int,
2025-05-07T20:32:01.3220962Z         D: int,
2025-05-07T20:32:01.3221057Z         scale_ub: Optional[float],
2025-05-07T20:32:01.3221144Z         contiguous: bool,
2025-05-07T20:32:01.3221236Z         compiled: bool,
2025-05-07T20:32:01.3221311Z     ) -> None:
2025-05-07T20:32:01.3221405Z         torch.manual_seed(2025)
2025-05-07T20:32:01.3221475Z     
2025-05-07T20:32:01.3221639Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:01.3221710Z     
2025-05-07T20:32:01.3221798Z         x_sign = torch.sign(x)
2025-05-07T20:32:01.3221921Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:01.3222010Z         x = x_sign * x_clamp
2025-05-07T20:32:01.3222087Z         x0 = x[:, :D]
2025-05-07T20:32:01.3222163Z         x1 = x[:, D:]
2025-05-07T20:32:01.3222235Z     
2025-05-07T20:32:01.3222316Z         if contiguous:
2025-05-07T20:32:01.3222404Z             x0 = x0.contiguous()
2025-05-07T20:32:01.3222501Z             x1 = x1.contiguous()
2025-05-07T20:32:01.3222570Z     
2025-05-07T20:32:01.3222657Z         if scale_ub is not None:
2025-05-07T20:32:01.3222765Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:01.3222895Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:01.3222980Z             )
2025-05-07T20:32:01.3223053Z         else:
2025-05-07T20:32:01.3223143Z             scale_ub_tensor = None
2025-05-07T20:32:01.3223218Z     
2025-05-07T20:32:01.3223342Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:01.3223431Z             op = silu_mul_quant
2025-05-07T20:32:01.3223515Z             if compiled:
2025-05-07T20:32:01.3223613Z                 op = torch.compile(op)
2025-05-07T20:32:01.3223714Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:01.3223789Z     
2025-05-07T20:32:01.3223877Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:01.3223881Z 
2025-05-07T20:32:01.3223980Z moe/activation_test.py:117: 
2025-05-07T20:32:01.3224110Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:01.3224206Z moe/activation_test.py:115: in fn
2025-05-07T20:32:01.3224305Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:01.3224670Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/torch/_dynamo/eval_frame.py:678: in _fn
2025-05-07T20:32:01.3224761Z     return fn(*args, **kwargs)
2025-05-07T20:32:01.3225248Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:01.3225341Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:01.3225694Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:01.3225912Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:01.3226250Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:01.3226346Z     kernel = self.compile(
2025-05-07T20:32:01.3226723Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:01.3226977Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:01.3227104Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:01.3227109Z 
2025-05-07T20:32:01.3227310Z self = <triton.compiler.compiler.ASTSource object at 0x7f93a426e750>
2025-05-07T20:32:01.3228073Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:01.3228639Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f9402963ce0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f93a43cd4e0>}
2025-05-07T20:32:01.3229374Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:01.3229565Z context = <triton._C.libtriton.ir.context object at 0x7f93a438c930>
2025-05-07T20:32:01.3229570Z 
2025-05-07T20:32:01.3229732Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:01.3229991Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:01.3230096Z                            module_map=module_map)
2025-05-07T20:32:01.3230253Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:01.3230353Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:01.3230428Z E       ^
2025-05-07T20:32:01.3230778Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:01.3230783Z 
2025-05-07T20:32:01.3231188Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:01.3231193Z 
2025-05-07T20:32:01.3231298Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:01.3231513Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:01.3231592Z     T=16384,
2025-05-07T20:32:01.3231666Z     D=5120,
2025-05-07T20:32:01.3231744Z     scale_ub=None,
2025-05-07T20:32:01.3231829Z     contiguous=False,
2025-05-07T20:32:01.3231909Z     compiled=True,
2025-05-07T20:32:01.3231983Z )
2025-05-07T20:32:01.3232196Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:01.3232365Z T = 16384, D = 5120, scale_ub = None, contiguous = False, compiled = True
2025-05-07T20:32:01.3232369Z 
2025-05-07T20:32:01.3232445Z     @given(
2025-05-07T20:32:01.3232562Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:01.3232662Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:01.3232779Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:01.3232892Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:01.3233006Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:01.3233079Z     )
2025-05-07T20:32:01.3233318Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:01.3233413Z     def test_silu_mul_quant(
2025-05-07T20:32:01.3233486Z         self,
2025-05-07T20:32:01.3233559Z         T: int,
2025-05-07T20:32:01.3233638Z         D: int,
2025-05-07T20:32:01.3233733Z         scale_ub: Optional[float],
2025-05-07T20:32:01.3233818Z         contiguous: bool,
2025-05-07T20:32:01.3233906Z         compiled: bool,
2025-05-07T20:32:01.3233981Z     ) -> None:
2025-05-07T20:32:01.3234072Z         torch.manual_seed(2025)
2025-05-07T20:32:01.3234145Z     
2025-05-07T20:32:01.3234311Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:01.3234382Z     
2025-05-07T20:32:01.3234473Z         x_sign = torch.sign(x)
2025-05-07T20:32:01.3234593Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:01.3234678Z         x = x_sign * x_clamp
2025-05-07T20:32:01.3234875Z         x0 = x[:, :D]
2025-05-07T20:32:01.3234953Z         x1 = x[:, D:]
2025-05-07T20:32:01.3235030Z     
2025-05-07T20:32:01.3235109Z         if contiguous:
2025-05-07T20:32:01.3235197Z             x0 = x0.contiguous()
2025-05-07T20:32:01.3235286Z             x1 = x1.contiguous()
2025-05-07T20:32:01.3235356Z     
2025-05-07T20:32:01.3235443Z         if scale_ub is not None:
2025-05-07T20:32:01.3235547Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:01.3235678Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:01.3235751Z             )
2025-05-07T20:32:01.3235829Z         else:
2025-05-07T20:32:01.3235921Z             scale_ub_tensor = None
2025-05-07T20:32:01.3235990Z     
2025-05-07T20:32:01.3236193Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:01.3236282Z             op = silu_mul_quant
2025-05-07T20:32:01.3236368Z             if compiled:
2025-05-07T20:32:01.3236465Z                 op = torch.compile(op)
2025-05-07T20:32:01.3236572Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:01.3236646Z     
2025-05-07T20:32:01.3236733Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:01.3236737Z 
2025-05-07T20:32:01.3236829Z moe/activation_test.py:117: 
2025-05-07T20:32:01.3236959Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:01.3237056Z moe/activation_test.py:115: in fn
2025-05-07T20:32:01.3237151Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:01.3237515Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/torch/_dynamo/eval_frame.py:678: in _fn
2025-05-07T20:32:01.3237606Z     return fn(*args, **kwargs)
2025-05-07T20:32:01.3238100Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:01.3238195Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:01.3238546Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:01.3238773Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:01.3239105Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:01.3239197Z     kernel = self.compile(
2025-05-07T20:32:01.3239570Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:01.3239739Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:01.3239864Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:01.3239869Z 
2025-05-07T20:32:01.3240072Z self = <triton.compiler.compiler.ASTSource object at 0x7f93a44a4ed0>
2025-05-07T20:32:01.3240828Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:01.3241329Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f9402963ce0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f93a43ce2a0>}
2025-05-07T20:32:01.3242063Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:01.3242251Z context = <triton._C.libtriton.ir.context object at 0x7f9397b08970>
2025-05-07T20:32:01.3242255Z 
2025-05-07T20:32:01.3242417Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:01.3242678Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:01.3242782Z                            module_map=module_map)
2025-05-07T20:32:01.3243027Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:01.3243125Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:01.3243198Z E       ^
2025-05-07T20:32:01.3243543Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:01.3243548Z 
2025-05-07T20:32:01.3243953Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:01.3243958Z 
2025-05-07T20:32:01.3244057Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:01.3244275Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:01.3244349Z     T=2048,
2025-05-07T20:32:01.3244421Z     D=5120,
2025-05-07T20:32:01.3244577Z     scale_ub=None,
2025-05-07T20:32:01.3244662Z     contiguous=False,
2025-05-07T20:32:01.3244742Z     compiled=True,
2025-05-07T20:32:01.3244817Z )
2025-05-07T20:32:01.3245028Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:01.3245202Z T = 2048, D = 5120, scale_ub = None, contiguous = False, compiled = True
2025-05-07T20:32:01.3245209Z 
2025-05-07T20:32:01.3245283Z     @given(
2025-05-07T20:32:01.3245398Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:01.3245496Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:01.3245607Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:01.3245720Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:01.3245835Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:01.3245906Z     )
2025-05-07T20:32:01.3246144Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:01.3246242Z     def test_silu_mul_quant(
2025-05-07T20:32:01.3246317Z         self,
2025-05-07T20:32:01.3246394Z         T: int,
2025-05-07T20:32:01.3246468Z         D: int,
2025-05-07T20:32:01.3246563Z         scale_ub: Optional[float],
2025-05-07T20:32:01.3246657Z         contiguous: bool,
2025-05-07T20:32:01.3246740Z         compiled: bool,
2025-05-07T20:32:01.3246818Z     ) -> None:
2025-05-07T20:32:01.3246912Z         torch.manual_seed(2025)
2025-05-07T20:32:01.3246981Z     
2025-05-07T20:32:01.3247142Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:01.3247216Z     
2025-05-07T20:32:01.3247305Z         x_sign = torch.sign(x)
2025-05-07T20:32:01.3247426Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:01.3247519Z         x = x_sign * x_clamp
2025-05-07T20:32:01.3247594Z         x0 = x[:, :D]
2025-05-07T20:32:01.3247671Z         x1 = x[:, D:]
2025-05-07T20:32:01.3247742Z     
2025-05-07T20:32:01.3247822Z         if contiguous:
2025-05-07T20:32:01.3247917Z             x0 = x0.contiguous()
2025-05-07T20:32:01.3248004Z             x1 = x1.contiguous()
2025-05-07T20:32:01.3248073Z     
2025-05-07T20:32:01.3248161Z         if scale_ub is not None:
2025-05-07T20:32:01.3248263Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:01.3248400Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:01.3248475Z             )
2025-05-07T20:32:01.3248549Z         else:
2025-05-07T20:32:01.3248640Z             scale_ub_tensor = None
2025-05-07T20:32:01.3248712Z     
2025-05-07T20:32:01.3248836Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:01.3248923Z             op = silu_mul_quant
2025-05-07T20:32:01.3249007Z             if compiled:
2025-05-07T20:32:01.3249102Z                 op = torch.compile(op)
2025-05-07T20:32:01.3249207Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:01.3249276Z     
2025-05-07T20:32:01.3249365Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:01.3249369Z 
2025-05-07T20:32:01.3249470Z moe/activation_test.py:117: 
2025-05-07T20:32:01.3249596Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:01.3249692Z moe/activation_test.py:115: in fn
2025-05-07T20:32:01.3249879Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:01.3250238Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/torch/_dynamo/eval_frame.py:678: in _fn
2025-05-07T20:32:01.3250327Z     return fn(*args, **kwargs)
2025-05-07T20:32:01.3250814Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:01.3250905Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:01.3251258Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:01.3251473Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:01.3251878Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:01.3251971Z     kernel = self.compile(
2025-05-07T20:32:01.3252344Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:01.3252522Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:01.3252646Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:01.3252650Z 
2025-05-07T20:32:01.3252849Z self = <triton.compiler.compiler.ASTSource object at 0x7f93a4c41c50>
2025-05-07T20:32:01.3253719Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:01.3254222Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f9402963ce0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f93a43cf560>}
2025-05-07T20:32:01.3254954Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:01.3255144Z context = <triton._C.libtriton.ir.context object at 0x7f9397bee170>
2025-05-07T20:32:01.3255149Z 
2025-05-07T20:32:01.3255321Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:01.3255626Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:01.3255732Z                            module_map=module_map)
2025-05-07T20:32:01.3255894Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:01.3255989Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:01.3256063Z E       ^
2025-05-07T20:32:01.3256413Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:01.3256418Z 
2025-05-07T20:32:01.3256821Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:01.3256830Z 
2025-05-07T20:32:01.3256932Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:01.3257148Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:01.3257221Z     T=2048,
2025-05-07T20:32:01.3257296Z     D=5120,
2025-05-07T20:32:01.3257376Z     scale_ub=1200.0,
2025-05-07T20:32:01.3257458Z     contiguous=False,
2025-05-07T20:32:01.3257541Z     compiled=True,
2025-05-07T20:32:01.3257612Z )
2025-05-07T20:32:01.3257823Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:01.3257994Z T = 2048, D = 5120, scale_ub = 1200.0, contiguous = False, compiled = True
2025-05-07T20:32:01.3257998Z 
2025-05-07T20:32:01.3258072Z     @given(
2025-05-07T20:32:01.3258193Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:01.3258289Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:01.3258400Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:01.3258607Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:01.3258719Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:01.3258791Z     )
2025-05-07T20:32:01.3259036Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:01.3259126Z     def test_silu_mul_quant(
2025-05-07T20:32:01.3259199Z         self,
2025-05-07T20:32:01.3259275Z         T: int,
2025-05-07T20:32:01.3259348Z         D: int,
2025-05-07T20:32:01.3259446Z         scale_ub: Optional[float],
2025-05-07T20:32:01.3259531Z         contiguous: bool,
2025-05-07T20:32:01.3259614Z         compiled: bool,
2025-05-07T20:32:01.3259694Z     ) -> None:
2025-05-07T20:32:01.3259784Z         torch.manual_seed(2025)
2025-05-07T20:32:01.3259857Z     
2025-05-07T20:32:01.3260119Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:01.3260191Z     
2025-05-07T20:32:01.3260279Z         x_sign = torch.sign(x)
2025-05-07T20:32:01.3260402Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:01.3260492Z         x = x_sign * x_clamp
2025-05-07T20:32:01.3260573Z         x0 = x[:, :D]
2025-05-07T20:32:01.3260655Z         x1 = x[:, D:]
2025-05-07T20:32:01.3260724Z     
2025-05-07T20:32:01.3260803Z         if contiguous:
2025-05-07T20:32:01.3260894Z             x0 = x0.contiguous()
2025-05-07T20:32:01.3260980Z             x1 = x1.contiguous()
2025-05-07T20:32:01.3261053Z     
2025-05-07T20:32:01.3261141Z         if scale_ub is not None:
2025-05-07T20:32:01.3261243Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:01.3261378Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:01.3261451Z             )
2025-05-07T20:32:01.3261525Z         else:
2025-05-07T20:32:01.3261626Z             scale_ub_tensor = None
2025-05-07T20:32:01.3261696Z     
2025-05-07T20:32:01.3261819Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:01.3261910Z             op = silu_mul_quant
2025-05-07T20:32:01.3261996Z             if compiled:
2025-05-07T20:32:01.3262094Z                 op = torch.compile(op)
2025-05-07T20:32:01.3262198Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:01.3262267Z     
2025-05-07T20:32:01.3262357Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:01.3262362Z 
2025-05-07T20:32:01.3262460Z moe/activation_test.py:117: 
2025-05-07T20:32:01.3262585Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:01.3262685Z moe/activation_test.py:115: in fn
2025-05-07T20:32:01.3262781Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:01.3263141Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/torch/_dynamo/eval_frame.py:678: in _fn
2025-05-07T20:32:01.3263242Z     return fn(*args, **kwargs)
2025-05-07T20:32:01.3263726Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:01.3263824Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:01.3264179Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:01.3264395Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:01.3264732Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:01.3264822Z     kernel = self.compile(
2025-05-07T20:32:01.3265197Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:01.3265369Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:01.3265497Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:01.3265501Z 
2025-05-07T20:32:01.3265702Z self = <triton.compiler.compiler.ASTSource object at 0x7f93a43af550>
2025-05-07T20:32:01.3266461Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:01.3267038Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f9402963ce0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f9397618c20>}
2025-05-07T20:32:01.3267772Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:01.3267957Z context = <triton._C.libtriton.ir.context object at 0x7f93976a8e30>
2025-05-07T20:32:01.3267961Z 
2025-05-07T20:32:01.3268200Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:01.3268455Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:01.3268563Z                            module_map=module_map)
2025-05-07T20:32:01.3268726Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:01.3268820Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:01.3268900Z E       ^
2025-05-07T20:32:01.3269243Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:01.3269247Z 
2025-05-07T20:32:01.3269648Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:01.3269653Z 
2025-05-07T20:32:01.3269754Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:01.3269969Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:01.3270049Z     T=4096,
2025-05-07T20:32:01.3270123Z     D=5120,
2025-05-07T20:32:01.3270203Z     scale_ub=1200.0,
2025-05-07T20:32:01.3270286Z     contiguous=True,
2025-05-07T20:32:01.3270367Z     compiled=True,
2025-05-07T20:32:01.3270444Z )
2025-05-07T20:32:01.3270660Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:01.3270827Z T = 4096, D = 5120, scale_ub = 1200.0, contiguous = True, compiled = True
2025-05-07T20:32:01.3270831Z 
2025-05-07T20:32:01.3270903Z     @given(
2025-05-07T20:32:01.3271021Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:01.3271116Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:01.3271230Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:01.3271343Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:01.3271452Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:01.3271528Z     )
2025-05-07T20:32:01.3271772Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:01.3271861Z     def test_silu_mul_quant(
2025-05-07T20:32:01.3271939Z         self,
2025-05-07T20:32:01.3272013Z         T: int,
2025-05-07T20:32:01.3272086Z         D: int,
2025-05-07T20:32:01.3272191Z         scale_ub: Optional[float],
2025-05-07T20:32:01.3272275Z         contiguous: bool,
2025-05-07T20:32:01.3272358Z         compiled: bool,
2025-05-07T20:32:01.3272438Z     ) -> None:
2025-05-07T20:32:01.3272527Z         torch.manual_seed(2025)
2025-05-07T20:32:01.3272600Z     
2025-05-07T20:32:01.3272762Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:01.3272832Z     
2025-05-07T20:32:01.3272921Z         x_sign = torch.sign(x)
2025-05-07T20:32:01.3273041Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:01.3273126Z         x = x_sign * x_clamp
2025-05-07T20:32:01.3273211Z         x0 = x[:, :D]
2025-05-07T20:32:01.3273287Z         x1 = x[:, D:]
2025-05-07T20:32:01.3273357Z     
2025-05-07T20:32:01.3273446Z         if contiguous:
2025-05-07T20:32:01.3273534Z             x0 = x0.contiguous()
2025-05-07T20:32:01.3273621Z             x1 = x1.contiguous()
2025-05-07T20:32:01.3273694Z     
2025-05-07T20:32:01.3273780Z         if scale_ub is not None:
2025-05-07T20:32:01.3273979Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:01.3274110Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:01.3274184Z             )
2025-05-07T20:32:01.3274261Z         else:
2025-05-07T20:32:01.3274353Z             scale_ub_tensor = None
2025-05-07T20:32:01.3274424Z     
2025-05-07T20:32:01.3274551Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:01.3274637Z             op = silu_mul_quant
2025-05-07T20:32:01.3274718Z             if compiled:
2025-05-07T20:32:01.3274818Z                 op = torch.compile(op)
2025-05-07T20:32:01.3274919Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:01.3274988Z     
2025-05-07T20:32:01.3275204Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:01.3275209Z 
2025-05-07T20:32:01.3275304Z moe/activation_test.py:117: 
2025-05-07T20:32:01.3275434Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:01.3275538Z moe/activation_test.py:115: in fn
2025-05-07T20:32:01.3275635Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:01.3275999Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/torch/_dynamo/eval_frame.py:678: in _fn
2025-05-07T20:32:01.3276090Z     return fn(*args, **kwargs)
2025-05-07T20:32:01.3276573Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:01.3276669Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:01.3277018Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:01.3277243Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:01.3277576Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:01.3277665Z     kernel = self.compile(
2025-05-07T20:32:01.3278051Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:01.3278219Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:01.3278344Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:01.3278351Z 
2025-05-07T20:32:01.3278549Z self = <triton.compiler.compiler.ASTSource object at 0x7f93a44a4550>
2025-05-07T20:32:01.3279307Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:01.3279804Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f9402963ce0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f9397619a80>}
2025-05-07T20:32:01.3280531Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:01.3280722Z context = <triton._C.libtriton.ir.context object at 0x7f9397648b30>
2025-05-07T20:32:01.3280727Z 
2025-05-07T20:32:01.3280887Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:01.3281140Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:01.3281251Z                            module_map=module_map)
2025-05-07T20:32:01.3281410Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:01.3281506Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:01.3281581Z E       ^
2025-05-07T20:32:01.3281928Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:01.3281933Z 
2025-05-07T20:32:01.3282340Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:01.3282426Z 
2025-05-07T20:32:01.3282527Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:01.3282743Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:01.3282825Z     T=128,
2025-05-07T20:32:01.3282901Z     D=5120,
2025-05-07T20:32:01.3282985Z     scale_ub=1200.0,
2025-05-07T20:32:01.3283068Z     contiguous=False,
2025-05-07T20:32:01.3283149Z     compiled=True,
2025-05-07T20:32:01.3283221Z )
2025-05-07T20:32:01.3283434Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:01.3283601Z T = 128, D = 5120, scale_ub = 1200.0, contiguous = False, compiled = True
2025-05-07T20:32:01.3283680Z 
2025-05-07T20:32:01.3283757Z     @given(
2025-05-07T20:32:01.3283873Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:01.3283970Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:01.3284086Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:01.3284205Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:01.3284319Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:01.3284392Z     )
2025-05-07T20:32:01.3284629Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:01.3284721Z     def test_silu_mul_quant(
2025-05-07T20:32:01.3284795Z         self,
2025-05-07T20:32:01.3284869Z         T: int,
2025-05-07T20:32:01.3284946Z         D: int,
2025-05-07T20:32:01.3285041Z         scale_ub: Optional[float],
2025-05-07T20:32:01.3285130Z         contiguous: bool,
2025-05-07T20:32:01.3285216Z         compiled: bool,
2025-05-07T20:32:01.3285309Z     ) -> None:
2025-05-07T20:32:01.3285412Z         torch.manual_seed(2025)
2025-05-07T20:32:01.3285504Z     
2025-05-07T20:32:01.3285670Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:01.3285745Z     
2025-05-07T20:32:01.3285833Z         x_sign = torch.sign(x)
2025-05-07T20:32:01.3285958Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:01.3286045Z         x = x_sign * x_clamp
2025-05-07T20:32:01.3286123Z         x0 = x[:, :D]
2025-05-07T20:32:01.3286200Z         x1 = x[:, D:]
2025-05-07T20:32:01.3286273Z     
2025-05-07T20:32:01.3286352Z         if contiguous:
2025-05-07T20:32:01.3286440Z             x0 = x0.contiguous()
2025-05-07T20:32:01.3286531Z             x1 = x1.contiguous()
2025-05-07T20:32:01.3286601Z     
2025-05-07T20:32:01.3286688Z         if scale_ub is not None:
2025-05-07T20:32:01.3286792Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:01.3286922Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:01.3287004Z             )
2025-05-07T20:32:01.3290307Z         else:
2025-05-07T20:32:01.3290418Z             scale_ub_tensor = None
2025-05-07T20:32:01.3290494Z     
2025-05-07T20:32:01.3290626Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:01.3290722Z             op = silu_mul_quant
2025-05-07T20:32:01.3290808Z             if compiled:
2025-05-07T20:32:01.3290908Z                 op = torch.compile(op)
2025-05-07T20:32:01.3291014Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:01.3291087Z     
2025-05-07T20:32:01.3291178Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:01.3291183Z 
2025-05-07T20:32:01.3291279Z moe/activation_test.py:117: 
2025-05-07T20:32:01.3291414Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:01.3291513Z moe/activation_test.py:115: in fn
2025-05-07T20:32:01.3291614Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:01.3291980Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/torch/_dynamo/eval_frame.py:678: in _fn
2025-05-07T20:32:01.3292070Z     return fn(*args, **kwargs)
2025-05-07T20:32:01.3292556Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:01.3292760Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:01.3293109Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:01.3293331Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:01.3293758Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:01.3293852Z     kernel = self.compile(
2025-05-07T20:32:01.3294227Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:01.3294395Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:01.3294625Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:01.3294630Z 
2025-05-07T20:32:01.3294831Z self = <triton.compiler.compiler.ASTSource object at 0x7f939778fad0>
2025-05-07T20:32:01.3295603Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:01.3296094Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f9402963ce0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f939761aca0>}
2025-05-07T20:32:01.3296827Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:01.3297021Z context = <triton._C.libtriton.ir.context object at 0x7f93977094f0>
2025-05-07T20:32:01.3297026Z 
2025-05-07T20:32:01.3297185Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:01.3297443Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:01.3297552Z                            module_map=module_map)
2025-05-07T20:32:01.3297710Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:01.3297814Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:01.3297887Z E       ^
2025-05-07T20:32:01.3298425Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:01.3298438Z 
2025-05-07T20:32:01.3298914Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:01.3298919Z 
2025-05-07T20:32:01.3299020Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:01.3299248Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:01.3299325Z     T=16384,
2025-05-07T20:32:01.3299405Z     D=7168,
2025-05-07T20:32:01.3299488Z     scale_ub=1200.0,
2025-05-07T20:32:01.3299571Z     contiguous=True,
2025-05-07T20:32:01.3299657Z     compiled=True,
2025-05-07T20:32:01.3299731Z )
2025-05-07T20:32:01.3299943Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:01.3300114Z T = 16384, D = 7168, scale_ub = 1200.0, contiguous = True, compiled = True
2025-05-07T20:32:01.3300118Z 
2025-05-07T20:32:01.3300191Z     @given(
2025-05-07T20:32:01.3300308Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:01.3300404Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:01.3300517Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:01.3300630Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:01.3300742Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:01.3300815Z     )
2025-05-07T20:32:01.3301062Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:01.3301153Z     def test_silu_mul_quant(
2025-05-07T20:32:01.3301226Z         self,
2025-05-07T20:32:01.3301459Z         T: int,
2025-05-07T20:32:01.3301535Z         D: int,
2025-05-07T20:32:01.3301631Z         scale_ub: Optional[float],
2025-05-07T20:32:01.3301721Z         contiguous: bool,
2025-05-07T20:32:01.3301805Z         compiled: bool,
2025-05-07T20:32:01.3301881Z     ) -> None:
2025-05-07T20:32:01.3301976Z         torch.manual_seed(2025)
2025-05-07T20:32:01.3302045Z     
2025-05-07T20:32:01.3302209Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:01.3302284Z     
2025-05-07T20:32:01.3302373Z         x_sign = torch.sign(x)
2025-05-07T20:32:01.3302494Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:01.3302582Z         x = x_sign * x_clamp
2025-05-07T20:32:01.3302660Z         x0 = x[:, :D]
2025-05-07T20:32:01.3302859Z         x1 = x[:, D:]
2025-05-07T20:32:01.3302933Z     
2025-05-07T20:32:01.3303013Z         if contiguous:
2025-05-07T20:32:01.3303103Z             x0 = x0.contiguous()
2025-05-07T20:32:01.3303190Z             x1 = x1.contiguous()
2025-05-07T20:32:01.3303266Z     
2025-05-07T20:32:01.3303358Z         if scale_ub is not None:
2025-05-07T20:32:01.3303462Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:01.3303594Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:01.3303670Z             )
2025-05-07T20:32:01.3303746Z         else:
2025-05-07T20:32:01.3303836Z             scale_ub_tensor = None
2025-05-07T20:32:01.3303910Z     
2025-05-07T20:32:01.3304037Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:01.3304133Z             op = silu_mul_quant
2025-05-07T20:32:01.3304212Z             if compiled:
2025-05-07T20:32:01.3304310Z                 op = torch.compile(op)
2025-05-07T20:32:01.3304420Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:01.3304491Z     
2025-05-07T20:32:01.3304584Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:01.3304589Z 
2025-05-07T20:32:01.3304681Z moe/activation_test.py:117: 
2025-05-07T20:32:01.3304810Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:01.3304914Z moe/activation_test.py:115: in fn
2025-05-07T20:32:01.3305009Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:01.3305371Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/torch/_dynamo/eval_frame.py:678: in _fn
2025-05-07T20:32:01.3305460Z     return fn(*args, **kwargs)
2025-05-07T20:32:01.3305940Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:01.3306036Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:01.3306387Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:01.3306614Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:01.3306945Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:01.3307041Z     kernel = self.compile(
2025-05-07T20:32:01.3307419Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:01.3307589Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:01.3307717Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:01.3307722Z 
2025-05-07T20:32:01.3307920Z self = <triton.compiler.compiler.ASTSource object at 0x7f93a43ad250>
2025-05-07T20:32:01.3308686Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:01.3309184Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f9402963ce0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f93978e8400>}
2025-05-07T20:32:01.3310000Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:01.3310186Z context = <triton._C.libtriton.ir.context object at 0x7f939786d2b0>
2025-05-07T20:32:01.3310190Z 
2025-05-07T20:32:01.3310352Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:01.3310606Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:01.3310713Z                            module_map=module_map)
2025-05-07T20:32:01.3310871Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:01.3311042Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:01.3311119Z E       ^
2025-05-07T20:32:01.3311463Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:01.3311474Z 
2025-05-07T20:32:01.3311879Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:01.3311884Z 
2025-05-07T20:32:01.3311987Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:01.3312206Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:01.3312281Z     T=16384,
2025-05-07T20:32:01.3312354Z     D=5120,
2025-05-07T20:32:01.3312438Z     scale_ub=1200.0,
2025-05-07T20:32:01.3312517Z     contiguous=True,
2025-05-07T20:32:01.3312597Z     compiled=False,
2025-05-07T20:32:01.3312672Z )
2025-05-07T20:32:01.3312885Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:01.3313062Z T = 16384, D = 5120, scale_ub = 1200.0, contiguous = True, compiled = False
2025-05-07T20:32:01.3313066Z 
2025-05-07T20:32:01.3313142Z     @given(
2025-05-07T20:32:01.3313256Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:01.3313354Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:01.3313471Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:01.3313583Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:01.3313696Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:01.3313767Z     )
2025-05-07T20:32:01.3314007Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:01.3314098Z     def test_silu_mul_quant(
2025-05-07T20:32:01.3314171Z         self,
2025-05-07T20:32:01.3314247Z         T: int,
2025-05-07T20:32:01.3314323Z         D: int,
2025-05-07T20:32:01.3314417Z         scale_ub: Optional[float],
2025-05-07T20:32:01.3314504Z         contiguous: bool,
2025-05-07T20:32:01.3314590Z         compiled: bool,
2025-05-07T20:32:01.3314672Z     ) -> None:
2025-05-07T20:32:01.3314764Z         torch.manual_seed(2025)
2025-05-07T20:32:01.3314838Z     
2025-05-07T20:32:01.3315001Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:01.3315078Z     
2025-05-07T20:32:01.3315166Z         x_sign = torch.sign(x)
2025-05-07T20:32:01.3315285Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:01.3315374Z         x = x_sign * x_clamp
2025-05-07T20:32:01.3315452Z         x0 = x[:, :D]
2025-05-07T20:32:01.3315528Z         x1 = x[:, D:]
2025-05-07T20:32:01.3315602Z     
2025-05-07T20:32:01.3315682Z         if contiguous:
2025-05-07T20:32:01.3315770Z             x0 = x0.contiguous()
2025-05-07T20:32:01.3315858Z             x1 = x1.contiguous()
2025-05-07T20:32:01.3315929Z     
2025-05-07T20:32:01.3316018Z         if scale_ub is not None:
2025-05-07T20:32:01.3316121Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:01.3316256Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:01.3316331Z             )
2025-05-07T20:32:01.3316405Z         else:
2025-05-07T20:32:01.3316497Z             scale_ub_tensor = None
2025-05-07T20:32:01.3316569Z     
2025-05-07T20:32:01.3316781Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:01.3316867Z             op = silu_mul_quant
2025-05-07T20:32:01.3316952Z             if compiled:
2025-05-07T20:32:01.3317049Z                 op = torch.compile(op)
2025-05-07T20:32:01.3317150Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:01.3317222Z     
2025-05-07T20:32:01.3317311Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:01.3317315Z 
2025-05-07T20:32:01.3317412Z moe/activation_test.py:117: 
2025-05-07T20:32:01.3317540Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:01.3317637Z moe/activation_test.py:115: in fn
2025-05-07T20:32:01.3317738Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:01.3318298Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:01.3318393Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:01.3318748Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:01.3318971Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:01.3319307Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:01.3319396Z     kernel = self.compile(
2025-05-07T20:32:01.3319771Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:01.3319944Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:01.3320067Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:01.3320072Z 
2025-05-07T20:32:01.3320272Z self = <triton.compiler.compiler.ASTSource object at 0x7f939778e650>
2025-05-07T20:32:01.3321032Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:01.3321529Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f9402963ce0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f93978e8e00>}
2025-05-07T20:32:01.3322264Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:01.3322449Z context = <triton._C.libtriton.ir.context object at 0x7f93978ec3f0>
2025-05-07T20:32:01.3322454Z 
2025-05-07T20:32:01.3322622Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:01.3322876Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:01.3322981Z                            module_map=module_map)
2025-05-07T20:32:01.3323148Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:01.3323243Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:01.3323317Z E       ^
2025-05-07T20:32:01.3323663Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:01.3323667Z 
2025-05-07T20:32:01.3324070Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:01.3324074Z 
2025-05-07T20:32:01.3324177Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:01.3324395Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:01.3324468Z     T=1,
2025-05-07T20:32:01.3324551Z     D=7168,
2025-05-07T20:32:01.3324632Z     scale_ub=1200.0,
2025-05-07T20:32:01.3324714Z     contiguous=False,
2025-05-07T20:32:01.3324798Z     compiled=False,
2025-05-07T20:32:01.3324867Z )
2025-05-07T20:32:01.3325084Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:01.3325355Z T = 1, D = 7168, scale_ub = 1200.0, contiguous = False, compiled = False
2025-05-07T20:32:01.3325359Z 
2025-05-07T20:32:01.3325433Z     @given(
2025-05-07T20:32:01.3325554Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:01.3325649Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:01.3325760Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:01.3325878Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:01.3325988Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:01.3326060Z     )
2025-05-07T20:32:01.3326304Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:01.3326467Z     def test_silu_mul_quant(
2025-05-07T20:32:01.3326544Z         self,
2025-05-07T20:32:01.3326617Z         T: int,
2025-05-07T20:32:01.3326690Z         D: int,
2025-05-07T20:32:01.3326796Z         scale_ub: Optional[float],
2025-05-07T20:32:01.3326889Z         contiguous: bool,
2025-05-07T20:32:01.3326970Z         compiled: bool,
2025-05-07T20:32:01.3327049Z     ) -> None:
2025-05-07T20:32:01.3327141Z         torch.manual_seed(2025)
2025-05-07T20:32:01.3327211Z     
2025-05-07T20:32:01.3327377Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:01.3327447Z     
2025-05-07T20:32:01.3327536Z         x_sign = torch.sign(x)
2025-05-07T20:32:01.3327660Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:01.3327746Z         x = x_sign * x_clamp
2025-05-07T20:32:01.3327825Z         x0 = x[:, :D]
2025-05-07T20:32:01.3327902Z         x1 = x[:, D:]
2025-05-07T20:32:01.3327971Z     
2025-05-07T20:32:01.3328059Z         if contiguous:
2025-05-07T20:32:01.3328151Z             x0 = x0.contiguous()
2025-05-07T20:32:01.3328238Z             x1 = x1.contiguous()
2025-05-07T20:32:01.3328309Z     
2025-05-07T20:32:01.3328399Z         if scale_ub is not None:
2025-05-07T20:32:01.3328505Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:01.3328644Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:01.3328717Z             )
2025-05-07T20:32:01.3328791Z         else:
2025-05-07T20:32:01.3328887Z             scale_ub_tensor = None
2025-05-07T20:32:01.3328957Z     
2025-05-07T20:32:01.3329082Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:01.3329172Z             op = silu_mul_quant
2025-05-07T20:32:01.3329254Z             if compiled:
2025-05-07T20:32:01.3329352Z                 op = torch.compile(op)
2025-05-07T20:32:01.3329453Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:01.3329522Z     
2025-05-07T20:32:01.3329612Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:01.3329616Z 
2025-05-07T20:32:01.3329714Z moe/activation_test.py:117: 
2025-05-07T20:32:01.3329840Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:01.3329939Z moe/activation_test.py:115: in fn
2025-05-07T20:32:01.3330039Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:01.3330530Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:01.3330622Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:01.3330976Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:01.3331196Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:01.3331529Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:01.3331618Z     kernel = self.compile(
2025-05-07T20:32:01.3332002Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:01.3332172Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:01.3332302Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:01.3332388Z 
2025-05-07T20:32:01.3332587Z self = <triton.compiler.compiler.ASTSource object at 0x7f93a4c432d0>
2025-05-07T20:32:01.3333345Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:01.3333940Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f9402963ce0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f93978ea160>}
2025-05-07T20:32:01.3334744Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:01.3334933Z context = <triton._C.libtriton.ir.context object at 0x7f93977b56b0>
2025-05-07T20:32:01.3334943Z 
2025-05-07T20:32:01.3335105Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:01.3335361Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:01.3335466Z                            module_map=module_map)
2025-05-07T20:32:01.3335623Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:01.3335725Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:01.3335797Z E       ^
2025-05-07T20:32:01.3336140Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:01.3336144Z 
2025-05-07T20:32:01.3336553Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:01.3336557Z 
2025-05-07T20:32:01.3336657Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:01.3336877Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:01.3336956Z     T=4096,
2025-05-07T20:32:01.3337030Z     D=7168,
2025-05-07T20:32:01.3337113Z     scale_ub=1200.0,
2025-05-07T20:32:01.3337196Z     contiguous=False,
2025-05-07T20:32:01.3337275Z     compiled=True,
2025-05-07T20:32:01.3337349Z )
2025-05-07T20:32:01.3337559Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:01.3337728Z T = 4096, D = 7168, scale_ub = 1200.0, contiguous = False, compiled = True
2025-05-07T20:32:01.3337733Z 
2025-05-07T20:32:01.3337811Z     @given(
2025-05-07T20:32:01.3337930Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:01.3338028Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:01.3338143Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:01.3338257Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:01.3338368Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:01.3338441Z     )
2025-05-07T20:32:01.3338685Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:01.3338777Z     def test_silu_mul_quant(
2025-05-07T20:32:01.3338850Z         self,
2025-05-07T20:32:01.3338924Z         T: int,
2025-05-07T20:32:01.3339001Z         D: int,
2025-05-07T20:32:01.3339095Z         scale_ub: Optional[float],
2025-05-07T20:32:01.3339184Z         contiguous: bool,
2025-05-07T20:32:01.3339267Z         compiled: bool,
2025-05-07T20:32:01.3339344Z     ) -> None:
2025-05-07T20:32:01.3339440Z         torch.manual_seed(2025)
2025-05-07T20:32:01.3339511Z     
2025-05-07T20:32:01.3339673Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:01.3339745Z     
2025-05-07T20:32:01.3339838Z         x_sign = torch.sign(x)
2025-05-07T20:32:01.3339958Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:01.3340046Z         x = x_sign * x_clamp
2025-05-07T20:32:01.3340123Z         x0 = x[:, :D]
2025-05-07T20:32:01.3340199Z         x1 = x[:, D:]
2025-05-07T20:32:01.3340357Z     
2025-05-07T20:32:01.3340437Z         if contiguous:
2025-05-07T20:32:01.3340526Z             x0 = x0.contiguous()
2025-05-07T20:32:01.3340614Z             x1 = x1.contiguous()
2025-05-07T20:32:01.3340683Z     
2025-05-07T20:32:01.3340773Z         if scale_ub is not None:
2025-05-07T20:32:01.3340874Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:01.3341004Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:01.3341078Z             )
2025-05-07T20:32:01.3341150Z         else:
2025-05-07T20:32:01.3341240Z             scale_ub_tensor = None
2025-05-07T20:32:01.3341311Z     
2025-05-07T20:32:01.3341436Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:01.3341684Z             op = silu_mul_quant
2025-05-07T20:32:01.3341772Z             if compiled:
2025-05-07T20:32:01.3341868Z                 op = torch.compile(op)
2025-05-07T20:32:01.3341969Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:01.3342049Z     
2025-05-07T20:32:01.3342136Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:01.3342141Z 
2025-05-07T20:32:01.3342241Z moe/activation_test.py:117: 
2025-05-07T20:32:01.3342365Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:01.3342461Z moe/activation_test.py:115: in fn
2025-05-07T20:32:01.3342559Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:01.3342919Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/torch/_dynamo/eval_frame.py:678: in _fn
2025-05-07T20:32:01.3343009Z     return fn(*args, **kwargs)
2025-05-07T20:32:01.3343495Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:01.3343593Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:01.3343946Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:01.3344163Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:01.3344499Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:01.3344592Z     kernel = self.compile(
2025-05-07T20:32:01.3344966Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:01.3345136Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:01.3345264Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:01.3345268Z 
2025-05-07T20:32:01.3345465Z self = <triton.compiler.compiler.ASTSource object at 0x7f9397f0bed0>
2025-05-07T20:32:01.3346234Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:01.3346729Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f9402963ce0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f93978eb420>}
2025-05-07T20:32:01.3347460Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:01.3347644Z context = <triton._C.libtriton.ir.context object at 0x7f93977c5830>
2025-05-07T20:32:01.3347649Z 
2025-05-07T20:32:01.3347808Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:01.3348070Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:01.3348176Z                            module_map=module_map)
2025-05-07T20:32:01.3348337Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:01.3348430Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:01.3348590Z E       ^
2025-05-07T20:32:01.3348935Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:01.3348940Z 
2025-05-07T20:32:01.3349341Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:01.3349345Z 
2025-05-07T20:32:01.3349444Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:01.3349663Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:01.3349737Z     T=128,
2025-05-07T20:32:01.3349817Z     D=7168,
2025-05-07T20:32:01.3349897Z     scale_ub=1200.0,
2025-05-07T20:32:01.3349979Z     contiguous=False,
2025-05-07T20:32:01.3350140Z     compiled=True,
2025-05-07T20:32:01.3350213Z )
2025-05-07T20:32:01.3350424Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:01.3350592Z T = 128, D = 7168, scale_ub = 1200.0, contiguous = False, compiled = True
2025-05-07T20:32:01.3350601Z 
2025-05-07T20:32:01.3350680Z     @given(
2025-05-07T20:32:01.3350793Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:01.3350887Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:01.3351003Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:01.3351115Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:01.3351224Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:01.3351297Z     )
2025-05-07T20:32:01.3351535Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:01.3351629Z     def test_silu_mul_quant(
2025-05-07T20:32:01.3351701Z         self,
2025-05-07T20:32:01.3351775Z         T: int,
2025-05-07T20:32:01.3351857Z         D: int,
2025-05-07T20:32:01.3351950Z         scale_ub: Optional[float],
2025-05-07T20:32:01.3352035Z         contiguous: bool,
2025-05-07T20:32:01.3352123Z         compiled: bool,
2025-05-07T20:32:01.3352203Z     ) -> None:
2025-05-07T20:32:01.3352294Z         torch.manual_seed(2025)
2025-05-07T20:32:01.3352367Z     
2025-05-07T20:32:01.3352528Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:01.3352597Z     
2025-05-07T20:32:01.3352688Z         x_sign = torch.sign(x)
2025-05-07T20:32:01.3352808Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:01.3352896Z         x = x_sign * x_clamp
2025-05-07T20:32:01.3352973Z         x0 = x[:, :D]
2025-05-07T20:32:01.3353049Z         x1 = x[:, D:]
2025-05-07T20:32:01.3353126Z     
2025-05-07T20:32:01.3353206Z         if contiguous:
2025-05-07T20:32:01.3353293Z             x0 = x0.contiguous()
2025-05-07T20:32:01.3353381Z             x1 = x1.contiguous()
2025-05-07T20:32:01.3353454Z     
2025-05-07T20:32:01.3353542Z         if scale_ub is not None:
2025-05-07T20:32:01.3353645Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:01.3353777Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:01.3353854Z             )
2025-05-07T20:32:01.3353931Z         else:
2025-05-07T20:32:01.3354025Z             scale_ub_tensor = None
2025-05-07T20:32:01.3354100Z     
2025-05-07T20:32:01.3354224Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:01.3354310Z             op = silu_mul_quant
2025-05-07T20:32:01.3354393Z             if compiled:
2025-05-07T20:32:01.3354488Z                 op = torch.compile(op)
2025-05-07T20:32:01.3354590Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:01.3354661Z     
2025-05-07T20:32:01.3354749Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:01.3354754Z 
2025-05-07T20:32:01.3354847Z moe/activation_test.py:117: 
2025-05-07T20:32:01.3354980Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:01.3355078Z moe/activation_test.py:115: in fn
2025-05-07T20:32:01.3355173Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:01.3355536Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/torch/_dynamo/eval_frame.py:678: in _fn
2025-05-07T20:32:01.3355736Z     return fn(*args, **kwargs)
2025-05-07T20:32:01.3356225Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:01.3356318Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:01.3356668Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:01.3356891Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:01.3357224Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:01.3357394Z     kernel = self.compile(
2025-05-07T20:32:01.3357770Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:01.3357939Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:01.3358073Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:01.3358078Z 
2025-05-07T20:32:01.3358278Z self = <triton.compiler.compiler.ASTSource object at 0x7f93a43af150>
2025-05-07T20:32:01.3359040Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:01.3359532Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f9402963ce0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f93975cc720>}
2025-05-07T20:32:01.3360265Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:01.3360459Z context = <triton._C.libtriton.ir.context object at 0x7f93975b83f0>
2025-05-07T20:32:01.3360463Z 
2025-05-07T20:32:01.3360624Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:01.3360880Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:01.3360984Z                            module_map=module_map)
2025-05-07T20:32:01.3361142Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:01.3361239Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:01.3361312Z E       ^
2025-05-07T20:32:01.3361655Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:01.3361662Z 
2025-05-07T20:32:01.3362071Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:01.3362076Z 
2025-05-07T20:32:01.3362175Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:01.3362400Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:01.3362474Z     T=2048,
2025-05-07T20:32:01.3362547Z     D=7168,
2025-05-07T20:32:01.3362628Z     scale_ub=None,
2025-05-07T20:32:01.3362709Z     contiguous=True,
2025-05-07T20:32:01.3362789Z     compiled=True,
2025-05-07T20:32:01.3362861Z )
2025-05-07T20:32:01.3363073Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:01.3363242Z T = 2048, D = 7168, scale_ub = None, contiguous = True, compiled = True
2025-05-07T20:32:01.3363246Z 
2025-05-07T20:32:01.3363320Z     @given(
2025-05-07T20:32:01.3363434Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:01.3363537Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:01.3363649Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:01.3363761Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:01.3363874Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:01.3364030Z     )
2025-05-07T20:32:01.3364270Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:01.3364363Z     def test_silu_mul_quant(
2025-05-07T20:32:01.3364435Z         self,
2025-05-07T20:32:01.3364513Z         T: int,
2025-05-07T20:32:01.3364587Z         D: int,
2025-05-07T20:32:01.3364681Z         scale_ub: Optional[float],
2025-05-07T20:32:01.3364768Z         contiguous: bool,
2025-05-07T20:32:01.3364850Z         compiled: bool,
2025-05-07T20:32:01.3364926Z     ) -> None:
2025-05-07T20:32:01.3365023Z         torch.manual_seed(2025)
2025-05-07T20:32:01.3365093Z     
2025-05-07T20:32:01.3365254Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:01.3365420Z     
2025-05-07T20:32:01.3365518Z         x_sign = torch.sign(x)
2025-05-07T20:32:01.3365663Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:01.3365750Z         x = x_sign * x_clamp
2025-05-07T20:32:01.3365833Z         x0 = x[:, :D]
2025-05-07T20:32:01.3365913Z         x1 = x[:, D:]
2025-05-07T20:32:01.3365982Z     
2025-05-07T20:32:01.3366065Z         if contiguous:
2025-05-07T20:32:01.3366155Z             x0 = x0.contiguous()
2025-05-07T20:32:01.3366242Z             x1 = x1.contiguous()
2025-05-07T20:32:01.3366310Z     
2025-05-07T20:32:01.3366399Z         if scale_ub is not None:
2025-05-07T20:32:01.3366500Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:01.3366635Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:01.3366711Z             )
2025-05-07T20:32:01.3366784Z         else:
2025-05-07T20:32:01.3366874Z             scale_ub_tensor = None
2025-05-07T20:32:01.3366945Z     
2025-05-07T20:32:01.3367074Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:01.3367163Z             op = silu_mul_quant
2025-05-07T20:32:01.3367244Z             if compiled:
2025-05-07T20:32:01.3367339Z                 op = torch.compile(op)
2025-05-07T20:32:01.3367448Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:01.3367518Z     
2025-05-07T20:32:01.3367605Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:01.3367609Z 
2025-05-07T20:32:01.3367705Z moe/activation_test.py:117: 
2025-05-07T20:32:01.3367830Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:01.3367927Z moe/activation_test.py:115: in fn
2025-05-07T20:32:01.3368025Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:01.3368383Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/torch/_dynamo/eval_frame.py:678: in _fn
2025-05-07T20:32:01.3368475Z     return fn(*args, **kwargs)
2025-05-07T20:32:01.3368963Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:01.3369055Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:01.3369408Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:01.3369628Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:01.3369962Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:01.3370054Z     kernel = self.compile(
2025-05-07T20:32:01.3370428Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:01.3370600Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:01.3370724Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:01.3370729Z 
2025-05-07T20:32:01.3370930Z self = <triton.compiler.compiler.ASTSource object at 0x7f93972a1750>
2025-05-07T20:32:01.3371691Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:01.3372268Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f9402963ce0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f93975cd440>}
2025-05-07T20:32:01.3372999Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:01.3373183Z context = <triton._C.libtriton.ir.context object at 0x7f939755eff0>
2025-05-07T20:32:01.3373188Z 
2025-05-07T20:32:01.3373350Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:01.3373790Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:01.3373896Z                            module_map=module_map)
2025-05-07T20:32:01.3374057Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:01.3374157Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:01.3374231Z E       ^
2025-05-07T20:32:01.3374577Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:01.3374581Z 
2025-05-07T20:32:01.3374985Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:01.3374990Z 
2025-05-07T20:32:01.3375093Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:01.3375307Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:01.3375381Z     T=16384,
2025-05-07T20:32:01.3375458Z     D=5120,
2025-05-07T20:32:01.3375536Z     scale_ub=None,
2025-05-07T20:32:01.3375625Z     contiguous=False,
2025-05-07T20:32:01.3375710Z     compiled=False,
2025-05-07T20:32:01.3375781Z )
2025-05-07T20:32:01.3375993Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:01.3376170Z T = 16384, D = 5120, scale_ub = None, contiguous = False, compiled = False
2025-05-07T20:32:01.3376175Z 
2025-05-07T20:32:01.3376248Z     @given(
2025-05-07T20:32:01.3376366Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:01.3376461Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:01.3376571Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:01.3376686Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:01.3376794Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:01.3376865Z     )
2025-05-07T20:32:01.3377106Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:01.3377195Z     def test_silu_mul_quant(
2025-05-07T20:32:01.3377272Z         self,
2025-05-07T20:32:01.3377350Z         T: int,
2025-05-07T20:32:01.3377424Z         D: int,
2025-05-07T20:32:01.3377521Z         scale_ub: Optional[float],
2025-05-07T20:32:01.3377606Z         contiguous: bool,
2025-05-07T20:32:01.3377692Z         compiled: bool,
2025-05-07T20:32:01.3377770Z     ) -> None:
2025-05-07T20:32:01.3377862Z         torch.manual_seed(2025)
2025-05-07T20:32:01.3377932Z     
2025-05-07T20:32:01.3378096Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:01.3378166Z     
2025-05-07T20:32:01.3378253Z         x_sign = torch.sign(x)
2025-05-07T20:32:01.3378376Z >       x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:01.3380134Z E       torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 320.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 144.44 MiB is free. Including non-PyTorch memory, this process has 21.92 GiB memory in use. Of the allocated memory 21.60 GiB is allocated by PyTorch, and 40.52 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
2025-05-07T20:32:01.3380222Z 
2025-05-07T20:32:01.3380340Z moe/activation_test.py:95: OutOfMemoryError
2025-05-07T20:32:01.3380344Z 
2025-05-07T20:32:01.3380443Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:01.3380665Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:01.3380740Z     T=4096,
2025-05-07T20:32:01.3380812Z     D=7168,
2025-05-07T20:32:01.3380897Z     scale_ub=1200.0,
2025-05-07T20:32:01.3380978Z     contiguous=True,
2025-05-07T20:32:01.3381057Z     compiled=True,
2025-05-07T20:32:01.3381129Z )
2025-05-07T20:32:01.3381340Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:01.3381613Z T = 4096, D = 7168, scale_ub = 1200.0, contiguous = True, compiled = True
2025-05-07T20:32:01.3381618Z 
2025-05-07T20:32:01.3381697Z     @given(
2025-05-07T20:32:01.3381814Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:01.3381912Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:01.3382028Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:01.3382139Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:01.3382251Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:01.3382323Z     )
2025-05-07T20:32:01.3382561Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:01.3382656Z     def test_silu_mul_quant(
2025-05-07T20:32:01.3382729Z         self,
2025-05-07T20:32:01.3382802Z         T: int,
2025-05-07T20:32:01.3382878Z         D: int,
2025-05-07T20:32:01.3382972Z         scale_ub: Optional[float],
2025-05-07T20:32:01.3383057Z         contiguous: bool,
2025-05-07T20:32:01.3383143Z         compiled: bool,
2025-05-07T20:32:01.3383224Z     ) -> None:
2025-05-07T20:32:01.3383321Z         torch.manual_seed(2025)
2025-05-07T20:32:01.3383390Z     
2025-05-07T20:32:01.3383551Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:01.3383632Z     
2025-05-07T20:32:01.3383719Z         x_sign = torch.sign(x)
2025-05-07T20:32:01.3383839Z >       x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:01.3385624Z E       torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 112.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 32.44 MiB is free. Including non-PyTorch memory, this process has 22.03 GiB memory in use. Of the allocated memory 21.61 GiB is allocated by PyTorch, and 136.52 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
2025-05-07T20:32:01.3385634Z 
2025-05-07T20:32:01.3385747Z moe/activation_test.py:95: OutOfMemoryError
2025-05-07T20:32:01.3385752Z 
2025-05-07T20:32:01.3385856Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:01.3386074Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:01.3386154Z     T=16384,
2025-05-07T20:32:01.3386230Z     D=7168,
2025-05-07T20:32:01.3386308Z     scale_ub=None,
2025-05-07T20:32:01.3386391Z     contiguous=False,
2025-05-07T20:32:01.3386471Z     compiled=False,
2025-05-07T20:32:01.3386541Z )
2025-05-07T20:32:01.3386753Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:01.3386923Z T = 16384, D = 7168, scale_ub = None, contiguous = False, compiled = False
2025-05-07T20:32:01.3386927Z 
2025-05-07T20:32:01.3386999Z     @given(
2025-05-07T20:32:01.3387116Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:01.3387210Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:01.3387324Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:01.3387439Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:01.3387549Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:01.3387709Z     )
2025-05-07T20:32:01.3387948Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:01.3388037Z     def test_silu_mul_quant(
2025-05-07T20:32:01.3388114Z         self,
2025-05-07T20:32:01.3388188Z         T: int,
2025-05-07T20:32:01.3388261Z         D: int,
2025-05-07T20:32:01.3388362Z         scale_ub: Optional[float],
2025-05-07T20:32:01.3388447Z         contiguous: bool,
2025-05-07T20:32:01.3388531Z         compiled: bool,
2025-05-07T20:32:01.3388608Z     ) -> None:
2025-05-07T20:32:01.3388698Z         torch.manual_seed(2025)
2025-05-07T20:32:01.3388768Z     
2025-05-07T20:32:01.3388931Z >       x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:01.3390739Z E       torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 448.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 144.44 MiB is free. Including non-PyTorch memory, this process has 21.92 GiB memory in use. Of the allocated memory 21.50 GiB is allocated by PyTorch, and 136.52 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
2025-05-07T20:32:01.3390754Z 
2025-05-07T20:32:01.3390869Z moe/activation_test.py:92: OutOfMemoryError
2025-05-07T20:32:01.3390873Z 
2025-05-07T20:32:01.3390972Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:01.3391191Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:01.3391264Z     T=2048,
2025-05-07T20:32:01.3391336Z     D=7168,
2025-05-07T20:32:01.3391417Z     scale_ub=1200.0,
2025-05-07T20:32:01.3391501Z     contiguous=True,
2025-05-07T20:32:01.3391580Z     compiled=True,
2025-05-07T20:32:01.3391652Z )
2025-05-07T20:32:01.3391861Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:01.3392023Z T = 2048, D = 7168, scale_ub = 1200.0, contiguous = True, compiled = True
2025-05-07T20:32:01.3392037Z 
2025-05-07T20:32:01.3392110Z     @given(
2025-05-07T20:32:01.3392223Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:01.3392321Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:01.3392431Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:01.3392542Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:01.3392654Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:01.3392725Z     )
2025-05-07T20:32:01.3392962Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:01.3393056Z     def test_silu_mul_quant(
2025-05-07T20:32:01.3393130Z         self,
2025-05-07T20:32:01.3393209Z         T: int,
2025-05-07T20:32:01.3393285Z         D: int,
2025-05-07T20:32:01.3393378Z         scale_ub: Optional[float],
2025-05-07T20:32:01.3393466Z         contiguous: bool,
2025-05-07T20:32:01.3393548Z         compiled: bool,
2025-05-07T20:32:01.3393626Z     ) -> None:
2025-05-07T20:32:01.3393719Z         torch.manual_seed(2025)
2025-05-07T20:32:01.3393790Z     
2025-05-07T20:32:01.3393949Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:01.3394023Z     
2025-05-07T20:32:01.3394111Z         x_sign = torch.sign(x)
2025-05-07T20:32:01.3394230Z >       x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:01.3395955Z E       torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 56.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 32.44 MiB is free. Including non-PyTorch memory, this process has 22.03 GiB memory in use. Of the allocated memory 21.67 GiB is allocated by PyTorch, and 80.52 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
2025-05-07T20:32:01.3396044Z 
2025-05-07T20:32:01.3396158Z moe/activation_test.py:95: OutOfMemoryError
2025-05-07T20:32:01.3396162Z 
2025-05-07T20:32:01.3396263Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:01.3396481Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:01.3396558Z     T=2048,
2025-05-07T20:32:01.3396632Z     D=7168,
2025-05-07T20:32:01.3396710Z     scale_ub=None,
2025-05-07T20:32:01.3396795Z     contiguous=True,
2025-05-07T20:32:01.3396875Z     compiled=False,
2025-05-07T20:32:01.3396944Z )
2025-05-07T20:32:01.3397156Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:01.3397397Z T = 2048, D = 7168, scale_ub = None, contiguous = True, compiled = False
2025-05-07T20:32:01.3397402Z 
2025-05-07T20:32:01.3397475Z     @given(
2025-05-07T20:32:01.3397592Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:01.3397687Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:01.3397805Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:01.3397917Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:01.3398026Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:01.3398101Z     )
2025-05-07T20:32:01.3398777Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:01.3398895Z     def test_silu_mul_quant(
2025-05-07T20:32:01.3398973Z         self,
2025-05-07T20:32:01.3399047Z         T: int,
2025-05-07T20:32:01.3399120Z         D: int,
2025-05-07T20:32:01.3399219Z         scale_ub: Optional[float],
2025-05-07T20:32:01.3399303Z         contiguous: bool,
2025-05-07T20:32:01.3399385Z         compiled: bool,
2025-05-07T20:32:01.3399464Z     ) -> None:
2025-05-07T20:32:01.3399563Z         torch.manual_seed(2025)
2025-05-07T20:32:01.3399637Z     
2025-05-07T20:32:01.3399798Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:01.3399867Z     
2025-05-07T20:32:01.3399963Z >       x_sign = torch.sign(x)
2025-05-07T20:32:01.3401678Z E       torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 56.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 32.44 MiB is free. Including non-PyTorch memory, this process has 22.03 GiB memory in use. Of the allocated memory 21.67 GiB is allocated by PyTorch, and 80.52 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
2025-05-07T20:32:01.3401683Z 
2025-05-07T20:32:01.3401797Z moe/activation_test.py:94: OutOfMemoryError
2025-05-07T20:32:01.3401806Z 
2025-05-07T20:32:01.3401904Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:01.3402118Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:01.3402197Z     T=1,
2025-05-07T20:32:01.3402273Z     D=7168,
2025-05-07T20:32:01.3402355Z     scale_ub=1200.0,
2025-05-07T20:32:01.3402439Z     contiguous=True,
2025-05-07T20:32:01.3402519Z     compiled=False,
2025-05-07T20:32:01.3402594Z )
2025-05-07T20:32:01.3402802Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:01.3402961Z T = 1, D = 7168, scale_ub = 1200.0, contiguous = True, compiled = False
2025-05-07T20:32:01.3402965Z 
2025-05-07T20:32:01.3403042Z     @given(
2025-05-07T20:32:01.3403156Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:01.3403251Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:01.3403365Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:01.3403482Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:01.3403590Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:01.3403665Z     )
2025-05-07T20:32:01.3403908Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:01.3404152Z     def test_silu_mul_quant(
2025-05-07T20:32:01.3404225Z         self,
2025-05-07T20:32:01.3404299Z         T: int,
2025-05-07T20:32:01.3404377Z         D: int,
2025-05-07T20:32:01.3404471Z         scale_ub: Optional[float],
2025-05-07T20:32:01.3404557Z         contiguous: bool,
2025-05-07T20:32:01.3404644Z         compiled: bool,
2025-05-07T20:32:01.3404720Z     ) -> None:
2025-05-07T20:32:01.3404810Z         torch.manual_seed(2025)
2025-05-07T20:32:01.3404883Z     
2025-05-07T20:32:01.3405044Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:01.3405114Z     
2025-05-07T20:32:01.3405206Z         x_sign = torch.sign(x)
2025-05-07T20:32:01.3405327Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:01.3405529Z         x = x_sign * x_clamp
2025-05-07T20:32:01.3405609Z         x0 = x[:, :D]
2025-05-07T20:32:01.3405687Z         x1 = x[:, D:]
2025-05-07T20:32:01.3405760Z     
2025-05-07T20:32:01.3405839Z         if contiguous:
2025-05-07T20:32:01.3405934Z             x0 = x0.contiguous()
2025-05-07T20:32:01.3406025Z             x1 = x1.contiguous()
2025-05-07T20:32:01.3406093Z     
2025-05-07T20:32:01.3406180Z         if scale_ub is not None:
2025-05-07T20:32:01.3406284Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:01.3406417Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:01.3406489Z             )
2025-05-07T20:32:01.3406566Z         else:
2025-05-07T20:32:01.3406656Z             scale_ub_tensor = None
2025-05-07T20:32:01.3406725Z     
2025-05-07T20:32:01.3406853Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:01.3406941Z             op = silu_mul_quant
2025-05-07T20:32:01.3407025Z             if compiled:
2025-05-07T20:32:01.3407128Z                 op = torch.compile(op)
2025-05-07T20:32:01.3407232Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:01.3407311Z     
2025-05-07T20:32:01.3410557Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:01.3410573Z 
2025-05-07T20:32:01.3410684Z moe/activation_test.py:117: 
2025-05-07T20:32:01.3410822Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:01.3410925Z moe/activation_test.py:115: in fn
2025-05-07T20:32:01.3411029Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:01.3411531Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:01.3411630Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:01.3411991Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:01.3412215Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:01.3412551Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:01.3412648Z     kernel = self.compile(
2025-05-07T20:32:01.3413033Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:01.3413204Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:01.3413331Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:01.3413335Z 
2025-05-07T20:32:01.3413535Z self = <triton.compiler.compiler.ASTSource object at 0x7f939778e450>
2025-05-07T20:32:01.3414405Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:01.3414906Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f9402963ce0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f93973f8400>}
2025-05-07T20:32:01.3415697Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:01.3415992Z context = <triton._C.libtriton.ir.context object at 0x7f93973a53b0>
2025-05-07T20:32:01.3415997Z 
2025-05-07T20:32:01.3416161Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:01.3416422Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:01.3416530Z                            module_map=module_map)
2025-05-07T20:32:01.3416693Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:01.3416791Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:01.3416868Z E       ^
2025-05-07T20:32:01.3417318Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:01.3417323Z 
2025-05-07T20:32:01.3417728Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:01.3417739Z 
2025-05-07T20:32:01.3417841Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:01.3418060Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:01.3418138Z     T=128,
2025-05-07T20:32:01.3418223Z     D=5120,
2025-05-07T20:32:01.3418304Z     scale_ub=None,
2025-05-07T20:32:01.3418389Z     contiguous=True,
2025-05-07T20:32:01.3418477Z     compiled=False,
2025-05-07T20:32:01.3418550Z )
2025-05-07T20:32:01.3418763Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:01.3418929Z T = 128, D = 5120, scale_ub = None, contiguous = True, compiled = False
2025-05-07T20:32:01.3418933Z 
2025-05-07T20:32:01.3419015Z     @given(
2025-05-07T20:32:01.3419133Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:01.3419236Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:01.3419351Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:01.3419474Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:01.3419585Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:01.3419658Z     )
2025-05-07T20:32:01.3419903Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:01.3419994Z     def test_silu_mul_quant(
2025-05-07T20:32:01.3420068Z         self,
2025-05-07T20:32:01.3420146Z         T: int,
2025-05-07T20:32:01.3420221Z         D: int,
2025-05-07T20:32:01.3420318Z         scale_ub: Optional[float],
2025-05-07T20:32:01.3420408Z         contiguous: bool,
2025-05-07T20:32:01.3420493Z         compiled: bool,
2025-05-07T20:32:01.3420571Z     ) -> None:
2025-05-07T20:32:01.3420671Z         torch.manual_seed(2025)
2025-05-07T20:32:01.3420743Z     
2025-05-07T20:32:01.3420913Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:01.3420984Z     
2025-05-07T20:32:01.3421074Z         x_sign = torch.sign(x)
2025-05-07T20:32:01.3421204Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:01.3421294Z         x = x_sign * x_clamp
2025-05-07T20:32:01.3421374Z         x0 = x[:, :D]
2025-05-07T20:32:01.3421457Z         x1 = x[:, D:]
2025-05-07T20:32:01.3421531Z     
2025-05-07T20:32:01.3421613Z         if contiguous:
2025-05-07T20:32:01.3421705Z             x0 = x0.contiguous()
2025-05-07T20:32:01.3421794Z             x1 = x1.contiguous()
2025-05-07T20:32:01.3421865Z     
2025-05-07T20:32:01.3421960Z         if scale_ub is not None:
2025-05-07T20:32:01.3422063Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:01.3422204Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:01.3422278Z             )
2025-05-07T20:32:01.3422358Z         else:
2025-05-07T20:32:01.3422455Z             scale_ub_tensor = None
2025-05-07T20:32:01.3422526Z     
2025-05-07T20:32:01.3422653Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:01.3422828Z             op = silu_mul_quant
2025-05-07T20:32:01.3422910Z             if compiled:
2025-05-07T20:32:01.3423010Z                 op = torch.compile(op)
2025-05-07T20:32:01.3423117Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:01.3423189Z     
2025-05-07T20:32:01.3423278Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:01.3423285Z 
2025-05-07T20:32:01.3423379Z moe/activation_test.py:117: 
2025-05-07T20:32:01.3423507Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:01.3423608Z moe/activation_test.py:115: in fn
2025-05-07T20:32:01.3423705Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:01.3424270Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:01.3424374Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:01.3424726Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:01.3424952Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:01.3425286Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:01.3425391Z     kernel = self.compile(
2025-05-07T20:32:01.3425810Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:01.3425986Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:01.3426115Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:01.3426119Z 
2025-05-07T20:32:01.3426324Z self = <triton.compiler.compiler.ASTSource object at 0x7f93a43ae850>
2025-05-07T20:32:01.3427093Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:01.3427589Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f9402963ce0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f93973f9300>}
2025-05-07T20:32:01.3428322Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:01.3428509Z context = <triton._C.libtriton.ir.context object at 0x7f939730e2f0>
2025-05-07T20:32:01.3428514Z 
2025-05-07T20:32:01.3428675Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:01.3428940Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:01.3429047Z                            module_map=module_map)
2025-05-07T20:32:01.3429216Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:01.3429318Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:01.3429392Z E       ^
2025-05-07T20:32:01.3429743Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:01.3429748Z 
2025-05-07T20:32:01.3430154Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:01.3430159Z 
2025-05-07T20:32:01.3430268Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:01.3430485Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:01.3430562Z     T=128,
2025-05-07T20:32:01.3430641Z     D=7168,
2025-05-07T20:32:01.3430723Z     scale_ub=None,
2025-05-07T20:32:01.3430814Z     contiguous=True,
2025-05-07T20:32:01.3430901Z     compiled=False,
2025-05-07T20:32:01.3430974Z )
2025-05-07T20:32:01.3431189Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:01.3431443Z T = 128, D = 7168, scale_ub = None, contiguous = True, compiled = False
2025-05-07T20:32:01.3431447Z 
2025-05-07T20:32:01.3431524Z     @given(
2025-05-07T20:32:01.3431647Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:01.3431746Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:01.3431859Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:01.3431977Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:01.3432087Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:01.3432160Z     )
2025-05-07T20:32:01.3432404Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:01.3432495Z     def test_silu_mul_quant(
2025-05-07T20:32:01.3432569Z         self,
2025-05-07T20:32:01.3432723Z         T: int,
2025-05-07T20:32:01.3432800Z         D: int,
2025-05-07T20:32:01.3432897Z         scale_ub: Optional[float],
2025-05-07T20:32:01.3432987Z         contiguous: bool,
2025-05-07T20:32:01.3433077Z         compiled: bool,
2025-05-07T20:32:01.3433162Z     ) -> None:
2025-05-07T20:32:01.3433256Z         torch.manual_seed(2025)
2025-05-07T20:32:01.3433326Z     
2025-05-07T20:32:01.3433494Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:01.3433565Z     
2025-05-07T20:32:01.3433654Z         x_sign = torch.sign(x)
2025-05-07T20:32:01.3433779Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:01.3433865Z         x = x_sign * x_clamp
2025-05-07T20:32:01.3433945Z         x0 = x[:, :D]
2025-05-07T20:32:01.3434027Z         x1 = x[:, D:]
2025-05-07T20:32:01.3434098Z     
2025-05-07T20:32:01.3434180Z         if contiguous:
2025-05-07T20:32:01.3434273Z             x0 = x0.contiguous()
2025-05-07T20:32:01.3434365Z             x1 = x1.contiguous()
2025-05-07T20:32:01.3434436Z     
2025-05-07T20:32:01.3434528Z         if scale_ub is not None:
2025-05-07T20:32:01.3434633Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:01.3434769Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:01.3434849Z             )
2025-05-07T20:32:01.3434923Z         else:
2025-05-07T20:32:01.3435019Z             scale_ub_tensor = None
2025-05-07T20:32:01.3435091Z     
2025-05-07T20:32:01.3435218Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:01.3435307Z             op = silu_mul_quant
2025-05-07T20:32:01.3435391Z             if compiled:
2025-05-07T20:32:01.3435488Z                 op = torch.compile(op)
2025-05-07T20:32:01.3435593Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:01.3435665Z     
2025-05-07T20:32:01.3435753Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:01.3435760Z 
2025-05-07T20:32:01.3435855Z moe/activation_test.py:117: 
2025-05-07T20:32:01.3435985Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:01.3436086Z moe/activation_test.py:115: in fn
2025-05-07T20:32:01.3436184Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:01.3436680Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:01.3436777Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:01.3437130Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:01.3437349Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:01.3437684Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:01.3437776Z     kernel = self.compile(
2025-05-07T20:32:01.3438160Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:01.3438331Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:01.3438457Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:01.3438544Z 
2025-05-07T20:32:01.3438750Z self = <triton.compiler.compiler.ASTSource object at 0x7f939778ee50>
2025-05-07T20:32:01.3439514Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:01.3440010Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f9402963ce0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f93973fa0c0>}
2025-05-07T20:32:01.3440814Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:01.3441006Z context = <triton._C.libtriton.ir.context object at 0x7f9397490cb0>
2025-05-07T20:32:01.3441010Z 
2025-05-07T20:32:01.3441171Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:01.3441437Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:01.3441546Z                            module_map=module_map)
2025-05-07T20:32:01.3441705Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:01.3441802Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:01.3441881Z E       ^
2025-05-07T20:32:01.3442228Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:01.3442232Z 
2025-05-07T20:32:01.3442638Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:01.3442646Z 
2025-05-07T20:32:01.3442749Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:01.3442968Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:01.3443046Z     T=2048,
2025-05-07T20:32:01.3443126Z     D=7168,
2025-05-07T20:32:01.3443210Z     scale_ub=1200.0,
2025-05-07T20:32:01.3443296Z     contiguous=True,
2025-05-07T20:32:01.3443377Z     compiled=False,
2025-05-07T20:32:01.3443451Z )
2025-05-07T20:32:01.3443667Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:01.3443840Z T = 2048, D = 7168, scale_ub = 1200.0, contiguous = True, compiled = False
2025-05-07T20:32:01.3443844Z 
2025-05-07T20:32:01.3443923Z     @given(
2025-05-07T20:32:01.3444039Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:01.3444137Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:01.3444255Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:01.3444374Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:01.3444489Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:01.3444561Z     )
2025-05-07T20:32:01.3444801Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:01.3444903Z     def test_silu_mul_quant(
2025-05-07T20:32:01.3444978Z         self,
2025-05-07T20:32:01.3445052Z         T: int,
2025-05-07T20:32:01.3445136Z         D: int,
2025-05-07T20:32:01.3445234Z         scale_ub: Optional[float],
2025-05-07T20:32:01.3445323Z         contiguous: bool,
2025-05-07T20:32:01.3445410Z         compiled: bool,
2025-05-07T20:32:01.3445486Z     ) -> None:
2025-05-07T20:32:01.3445580Z         torch.manual_seed(2025)
2025-05-07T20:32:01.3445656Z     
2025-05-07T20:32:01.3445822Z >       x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:01.3447571Z E       torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 56.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 30.44 MiB is free. Including non-PyTorch memory, this process has 22.03 GiB memory in use. Of the allocated memory 21.70 GiB is allocated by PyTorch, and 53.93 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
2025-05-07T20:32:01.3447684Z 
2025-05-07T20:32:01.3447802Z moe/activation_test.py:92: OutOfMemoryError
2025-05-07T20:32:01.3447806Z 
2025-05-07T20:32:01.3447908Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:01.3448124Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:01.3448198Z     T=1,
2025-05-07T20:32:01.3448278Z     D=5120,
2025-05-07T20:32:01.3448361Z     scale_ub=1200.0,
2025-05-07T20:32:01.3448443Z     contiguous=True,
2025-05-07T20:32:01.3448527Z     compiled=False,
2025-05-07T20:32:01.3448599Z )
2025-05-07T20:32:01.3448888Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:01.3449056Z T = 1, D = 5120, scale_ub = 1200.0, contiguous = True, compiled = False
2025-05-07T20:32:01.3449060Z 
2025-05-07T20:32:01.3449142Z     @given(
2025-05-07T20:32:01.3449262Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:01.3449358Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:01.3449470Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:01.3449587Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:01.3449698Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:01.3449769Z     )
2025-05-07T20:32:01.3450015Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:01.3450108Z     def test_silu_mul_quant(
2025-05-07T20:32:01.3450181Z         self,
2025-05-07T20:32:01.3450258Z         T: int,
2025-05-07T20:32:01.3450332Z         D: int,
2025-05-07T20:32:01.3450435Z         scale_ub: Optional[float],
2025-05-07T20:32:01.3450527Z         contiguous: bool,
2025-05-07T20:32:01.3450616Z         compiled: bool,
2025-05-07T20:32:01.3450696Z     ) -> None:
2025-05-07T20:32:01.3450790Z         torch.manual_seed(2025)
2025-05-07T20:32:01.3450866Z     
2025-05-07T20:32:01.3451034Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:01.3451106Z     
2025-05-07T20:32:01.3451195Z         x_sign = torch.sign(x)
2025-05-07T20:32:01.3451320Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:01.3451409Z         x = x_sign * x_clamp
2025-05-07T20:32:01.3451486Z         x0 = x[:, :D]
2025-05-07T20:32:01.3451567Z         x1 = x[:, D:]
2025-05-07T20:32:01.3451637Z     
2025-05-07T20:32:01.3451718Z         if contiguous:
2025-05-07T20:32:01.3451812Z             x0 = x0.contiguous()
2025-05-07T20:32:01.3451900Z             x1 = x1.contiguous()
2025-05-07T20:32:01.3451972Z     
2025-05-07T20:32:01.3452061Z         if scale_ub is not None:
2025-05-07T20:32:01.3452170Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:01.3452306Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:01.3452381Z             )
2025-05-07T20:32:01.3452456Z         else:
2025-05-07T20:32:01.3452554Z             scale_ub_tensor = None
2025-05-07T20:32:01.3452625Z     
2025-05-07T20:32:01.3452754Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:01.3452847Z             op = silu_mul_quant
2025-05-07T20:32:01.3452929Z             if compiled:
2025-05-07T20:32:01.3453026Z                 op = torch.compile(op)
2025-05-07T20:32:01.3453132Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:01.3453202Z     
2025-05-07T20:32:01.3453294Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:01.3453298Z 
2025-05-07T20:32:01.3453393Z moe/activation_test.py:117: 
2025-05-07T20:32:01.3453519Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:01.3453704Z moe/activation_test.py:115: in fn
2025-05-07T20:32:01.3453806Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:01.3454296Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:01.3454483Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:01.3454839Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:01.3455064Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:01.3455400Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:01.3455489Z     kernel = self.compile(
2025-05-07T20:32:01.3455872Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:01.3456042Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:01.3456237Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:01.3456246Z 
2025-05-07T20:32:01.3456445Z self = <triton.compiler.compiler.ASTSource object at 0x7f93978bc1d0>
2025-05-07T20:32:01.3457214Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:01.3457716Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f9402963ce0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f93973fb6a0>}
2025-05-07T20:32:01.3458451Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:01.3458645Z context = <triton._C.libtriton.ir.context object at 0x7f93974615b0>
2025-05-07T20:32:01.3458650Z 
2025-05-07T20:32:01.3458810Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:01.3459066Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:01.3459180Z                            module_map=module_map)
2025-05-07T20:32:01.3459336Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:01.3459430Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:01.3459507Z E       ^
2025-05-07T20:32:01.3459853Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:01.3459857Z 
2025-05-07T20:32:01.3460262Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:01.3460266Z 
2025-05-07T20:32:01.3460366Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:01.3460585Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:01.3460661Z     T=2048,
2025-05-07T20:32:01.3460735Z     D=5120,
2025-05-07T20:32:01.3460816Z     scale_ub=None,
2025-05-07T20:32:01.3460897Z     contiguous=True,
2025-05-07T20:32:01.3460985Z     compiled=False,
2025-05-07T20:32:01.3461059Z )
2025-05-07T20:32:01.3461270Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:01.3461435Z T = 2048, D = 5120, scale_ub = None, contiguous = True, compiled = False
2025-05-07T20:32:01.3461440Z 
2025-05-07T20:32:01.3461517Z     @given(
2025-05-07T20:32:01.3461632Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:01.3461727Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:01.3461840Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:01.3461951Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:01.3462063Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:01.3462134Z     )
2025-05-07T20:32:01.3462377Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:01.3462469Z     def test_silu_mul_quant(
2025-05-07T20:32:01.3462542Z         self,
2025-05-07T20:32:01.3462699Z         T: int,
2025-05-07T20:32:01.3462774Z         D: int,
2025-05-07T20:32:01.3462869Z         scale_ub: Optional[float],
2025-05-07T20:32:01.3462954Z         contiguous: bool,
2025-05-07T20:32:01.3463039Z         compiled: bool,
2025-05-07T20:32:01.3463113Z     ) -> None:
2025-05-07T20:32:01.3463204Z         torch.manual_seed(2025)
2025-05-07T20:32:01.3463276Z     
2025-05-07T20:32:01.3463437Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:01.3463509Z     
2025-05-07T20:32:01.3463597Z >       x_sign = torch.sign(x)
2025-05-07T20:32:01.3465409Z E       torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 40.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 30.44 MiB is free. Including non-PyTorch memory, this process has 22.03 GiB memory in use. Of the allocated memory 21.73 GiB is allocated by PyTorch, and 13.87 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
2025-05-07T20:32:01.3465425Z 
2025-05-07T20:32:01.3465539Z moe/activation_test.py:94: OutOfMemoryError
2025-05-07T20:32:01.3465543Z 
2025-05-07T20:32:01.3465642Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:01.3465861Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:01.3465937Z     T=16384,
2025-05-07T20:32:01.3466012Z     D=5120,
2025-05-07T20:32:01.3466093Z     scale_ub=None,
2025-05-07T20:32:01.3466175Z     contiguous=True,
2025-05-07T20:32:01.3466260Z     compiled=False,
2025-05-07T20:32:01.3466333Z )
2025-05-07T20:32:01.3466549Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:01.3466722Z T = 16384, D = 5120, scale_ub = None, contiguous = True, compiled = False
2025-05-07T20:32:01.3466726Z 
2025-05-07T20:32:01.3466798Z     @given(
2025-05-07T20:32:01.3466919Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:01.3467017Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:01.3467128Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:01.3467241Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:01.3467353Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:01.3467425Z     )
2025-05-07T20:32:01.3467662Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:01.3467755Z     def test_silu_mul_quant(
2025-05-07T20:32:01.3467830Z         self,
2025-05-07T20:32:01.3467907Z         T: int,
2025-05-07T20:32:01.3467980Z         D: int,
2025-05-07T20:32:01.3468075Z         scale_ub: Optional[float],
2025-05-07T20:32:01.3468169Z         contiguous: bool,
2025-05-07T20:32:01.3468252Z         compiled: bool,
2025-05-07T20:32:01.3468329Z     ) -> None:
2025-05-07T20:32:01.3468422Z         torch.manual_seed(2025)
2025-05-07T20:32:01.3468491Z     
2025-05-07T20:32:01.3468655Z >       x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:01.3470382Z E       torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 320.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 30.44 MiB is free. Including non-PyTorch memory, this process has 22.03 GiB memory in use. Of the allocated memory 21.73 GiB is allocated by PyTorch, and 13.87 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
2025-05-07T20:32:01.3470388Z 
2025-05-07T20:32:01.3470507Z moe/activation_test.py:92: OutOfMemoryError
2025-05-07T20:32:01.3470512Z 
2025-05-07T20:32:01.3470613Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:01.3470828Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:01.3470986Z     T=4096,
2025-05-07T20:32:01.3471061Z     D=5120,
2025-05-07T20:32:01.3471139Z     scale_ub=None,
2025-05-07T20:32:01.3471221Z     contiguous=True,
2025-05-07T20:32:01.3471304Z     compiled=False,
2025-05-07T20:32:01.3471374Z )
2025-05-07T20:32:01.3471585Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:01.3471753Z T = 4096, D = 5120, scale_ub = None, contiguous = True, compiled = False
2025-05-07T20:32:01.3471757Z 
2025-05-07T20:32:01.3471830Z     @given(
2025-05-07T20:32:01.3471942Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:01.3472039Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:01.3472150Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:01.3472408Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:01.3472522Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:01.3472592Z     )
2025-05-07T20:32:01.3472833Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:01.3472930Z     def test_silu_mul_quant(
2025-05-07T20:32:01.3473002Z         self,
2025-05-07T20:32:01.3473079Z         T: int,
2025-05-07T20:32:01.3473151Z         D: int,
2025-05-07T20:32:01.3473245Z         scale_ub: Optional[float],
2025-05-07T20:32:01.3473334Z         contiguous: bool,
2025-05-07T20:32:01.3473417Z         compiled: bool,
2025-05-07T20:32:01.3473492Z     ) -> None:
2025-05-07T20:32:01.3473586Z         torch.manual_seed(2025)
2025-05-07T20:32:01.3473656Z     
2025-05-07T20:32:01.3473817Z >       x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:01.3475571Z E       torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 80.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 30.44 MiB is free. Including non-PyTorch memory, this process has 22.03 GiB memory in use. Of the allocated memory 21.73 GiB is allocated by PyTorch, and 13.87 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
2025-05-07T20:32:01.3475581Z 
2025-05-07T20:32:01.3475710Z moe/activation_test.py:92: OutOfMemoryError
2025-05-07T20:32:01.3475715Z 
2025-05-07T20:32:01.3475822Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:01.3476036Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:01.3476111Z     T=2048,
2025-05-07T20:32:01.3476185Z     D=5120,
2025-05-07T20:32:01.3476263Z     scale_ub=None,
2025-05-07T20:32:01.3476346Z     contiguous=False,
2025-05-07T20:32:01.3476426Z     compiled=False,
2025-05-07T20:32:01.3476499Z )
2025-05-07T20:32:01.3476712Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:01.3476876Z T = 2048, D = 5120, scale_ub = None, contiguous = False, compiled = False
2025-05-07T20:32:01.3476885Z 
2025-05-07T20:32:01.3476959Z     @given(
2025-05-07T20:32:01.3477077Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:01.3477173Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:01.3477287Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:01.3477399Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:01.3477508Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:01.3477582Z     )
2025-05-07T20:32:01.3477819Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:01.3477908Z     def test_silu_mul_quant(
2025-05-07T20:32:01.3477985Z         self,
2025-05-07T20:32:01.3478058Z         T: int,
2025-05-07T20:32:01.3478131Z         D: int,
2025-05-07T20:32:01.3478234Z         scale_ub: Optional[float],
2025-05-07T20:32:01.3478319Z         contiguous: bool,
2025-05-07T20:32:01.3478402Z         compiled: bool,
2025-05-07T20:32:01.3478481Z     ) -> None:
2025-05-07T20:32:01.3478679Z         torch.manual_seed(2025)
2025-05-07T20:32:01.3478751Z     
2025-05-07T20:32:01.3478911Z >       x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:01.3480634Z E       torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 40.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 30.44 MiB is free. Including non-PyTorch memory, this process has 22.03 GiB memory in use. Of the allocated memory 21.73 GiB is allocated by PyTorch, and 13.87 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
2025-05-07T20:32:01.3480714Z 
2025-05-07T20:32:01.3480828Z moe/activation_test.py:92: OutOfMemoryError
2025-05-07T20:32:01.3480832Z 
2025-05-07T20:32:01.3480933Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:01.3481150Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:01.3481229Z     T=4096,
2025-05-07T20:32:01.3481301Z     D=7168,
2025-05-07T20:32:01.3481383Z     scale_ub=None,
2025-05-07T20:32:01.3481464Z     contiguous=True,
2025-05-07T20:32:01.3481543Z     compiled=True,
2025-05-07T20:32:01.3481616Z )
2025-05-07T20:32:01.3481826Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:01.3481994Z T = 4096, D = 7168, scale_ub = None, contiguous = True, compiled = True
2025-05-07T20:32:01.3481998Z 
2025-05-07T20:32:01.3482071Z     @given(
2025-05-07T20:32:01.3482187Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:01.3482284Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:01.3482398Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:01.3482510Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:01.3482623Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:01.3482701Z     )
2025-05-07T20:32:01.3482942Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:01.3483031Z     def test_silu_mul_quant(
2025-05-07T20:32:01.3483105Z         self,
2025-05-07T20:32:01.3483183Z         T: int,
2025-05-07T20:32:01.3483256Z         D: int,
2025-05-07T20:32:01.3483350Z         scale_ub: Optional[float],
2025-05-07T20:32:01.3483438Z         contiguous: bool,
2025-05-07T20:32:01.3483520Z         compiled: bool,
2025-05-07T20:32:01.3483594Z     ) -> None:
2025-05-07T20:32:01.3483687Z         torch.manual_seed(2025)
2025-05-07T20:32:01.3483756Z     
2025-05-07T20:32:01.3483917Z >       x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:01.3485647Z E       torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 112.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 30.44 MiB is free. Including non-PyTorch memory, this process has 22.03 GiB memory in use. Of the allocated memory 21.73 GiB is allocated by PyTorch, and 13.87 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
2025-05-07T20:32:01.3485657Z 
2025-05-07T20:32:01.3485772Z moe/activation_test.py:92: OutOfMemoryError
2025-05-07T20:32:01.3485776Z 
2025-05-07T20:32:01.3485878Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:01.3486093Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:01.3486169Z     T=2048,
2025-05-07T20:32:01.3486246Z     D=5120,
2025-05-07T20:32:01.3486326Z     scale_ub=1200.0,
2025-05-07T20:32:01.3486412Z     contiguous=False,
2025-05-07T20:32:01.3486494Z     compiled=False,
2025-05-07T20:32:01.3486564Z )
2025-05-07T20:32:01.3486777Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:01.3486946Z T = 2048, D = 5120, scale_ub = 1200.0, contiguous = False, compiled = False
2025-05-07T20:32:01.3487035Z 
2025-05-07T20:32:01.3487109Z     @given(
2025-05-07T20:32:01.3487226Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:01.3487322Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:01.3487434Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:01.3487548Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:01.3487656Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:01.3487729Z     )
2025-05-07T20:32:01.3487968Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:01.3488058Z     def test_silu_mul_quant(
2025-05-07T20:32:01.3488135Z         self,
2025-05-07T20:32:01.3488281Z         T: int,
2025-05-07T20:32:01.3488356Z         D: int,
2025-05-07T20:32:01.3488451Z         scale_ub: Optional[float],
2025-05-07T20:32:01.3488537Z         contiguous: bool,
2025-05-07T20:32:01.3488620Z         compiled: bool,
2025-05-07T20:32:01.3488707Z     ) -> None:
2025-05-07T20:32:01.3488795Z         torch.manual_seed(2025)
2025-05-07T20:32:01.3488867Z     
2025-05-07T20:32:01.3489027Z >       x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:01.3490746Z E       torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 40.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 30.44 MiB is free. Including non-PyTorch memory, this process has 22.03 GiB memory in use. Of the allocated memory 21.73 GiB is allocated by PyTorch, and 13.87 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
2025-05-07T20:32:01.3490755Z 
2025-05-07T20:32:01.3490868Z moe/activation_test.py:92: OutOfMemoryError
2025-05-07T20:32:01.3490872Z 
2025-05-07T20:32:01.3490975Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:01.3491194Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:01.3491267Z     T=4096,
2025-05-07T20:32:01.3491340Z     D=7168,
2025-05-07T20:32:01.3491422Z     scale_ub=1200.0,
2025-05-07T20:32:01.3491502Z     contiguous=True,
2025-05-07T20:32:01.3491581Z     compiled=False,
2025-05-07T20:32:01.3491655Z )
2025-05-07T20:32:01.3491864Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:01.3492032Z T = 4096, D = 7168, scale_ub = 1200.0, contiguous = True, compiled = False
2025-05-07T20:32:01.3492036Z 
2025-05-07T20:32:01.3492109Z     @given(
2025-05-07T20:32:01.3492223Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:01.3492325Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:01.3492434Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:01.3492544Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:01.3492660Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:01.3492731Z     )
2025-05-07T20:32:01.3492970Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:01.3493060Z     def test_silu_mul_quant(
2025-05-07T20:32:01.3493133Z         self,
2025-05-07T20:32:01.3493208Z         T: int,
2025-05-07T20:32:01.3493281Z         D: int,
2025-05-07T20:32:01.3493375Z         scale_ub: Optional[float],
2025-05-07T20:32:01.3493462Z         contiguous: bool,
2025-05-07T20:32:01.3493542Z         compiled: bool,
2025-05-07T20:32:01.3493692Z     ) -> None:
2025-05-07T20:32:01.3493787Z         torch.manual_seed(2025)
2025-05-07T20:32:01.3493856Z     
2025-05-07T20:32:01.3494024Z >       x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:01.3495746Z E       torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 112.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 30.44 MiB is free. Including non-PyTorch memory, this process has 22.03 GiB memory in use. Of the allocated memory 21.73 GiB is allocated by PyTorch, and 13.87 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
2025-05-07T20:32:01.3495841Z 
2025-05-07T20:32:01.3495953Z moe/activation_test.py:92: OutOfMemoryError
2025-05-07T20:32:01.3495958Z 
2025-05-07T20:32:01.3496058Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:01.3496272Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:01.3496348Z     T=16384,
2025-05-07T20:32:01.3496421Z     D=7168,
2025-05-07T20:32:01.3496571Z     scale_ub=None,
2025-05-07T20:32:01.3496659Z     contiguous=False,
2025-05-07T20:32:01.3496738Z     compiled=True,
2025-05-07T20:32:01.3496808Z )
2025-05-07T20:32:01.3497025Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:01.3497199Z T = 16384, D = 7168, scale_ub = None, contiguous = False, compiled = True
2025-05-07T20:32:01.3497203Z 
2025-05-07T20:32:01.3497275Z     @given(
2025-05-07T20:32:01.3497394Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:01.3497489Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:01.3497604Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:01.3497718Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:01.3497827Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:01.3497900Z     )
2025-05-07T20:32:01.3498136Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:01.3498546Z     def test_silu_mul_quant(
2025-05-07T20:32:01.3498661Z         self,
2025-05-07T20:32:01.3498741Z         T: int,
2025-05-07T20:32:01.3498814Z         D: int,
2025-05-07T20:32:01.3498911Z         scale_ub: Optional[float],
2025-05-07T20:32:01.3499002Z         contiguous: bool,
2025-05-07T20:32:01.3499082Z         compiled: bool,
2025-05-07T20:32:01.3499161Z     ) -> None:
2025-05-07T20:32:01.3499252Z         torch.manual_seed(2025)
2025-05-07T20:32:01.3499325Z     
2025-05-07T20:32:01.3499486Z >       x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:01.3501212Z E       torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 448.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 30.44 MiB is free. Including non-PyTorch memory, this process has 22.03 GiB memory in use. Of the allocated memory 21.73 GiB is allocated by PyTorch, and 13.87 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
2025-05-07T20:32:01.3501222Z 
2025-05-07T20:32:01.3501333Z moe/activation_test.py:92: OutOfMemoryError
2025-05-07T20:32:01.3501341Z 
2025-05-07T20:32:01.3501439Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:01.3501657Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:01.3501730Z     T=4096,
2025-05-07T20:32:01.3501805Z     D=7168,
2025-05-07T20:32:01.3501887Z     scale_ub=None,
2025-05-07T20:32:01.3501970Z     contiguous=True,
2025-05-07T20:32:01.3502049Z     compiled=False,
2025-05-07T20:32:01.3502122Z )
2025-05-07T20:32:01.3502331Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:01.3502497Z T = 4096, D = 7168, scale_ub = None, contiguous = True, compiled = False
2025-05-07T20:32:01.3502501Z 
2025-05-07T20:32:01.3502574Z     @given(
2025-05-07T20:32:01.3502692Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:01.3502792Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:01.3502903Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:01.3503185Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:01.3503298Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:01.3503368Z     )
2025-05-07T20:32:01.3503614Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:01.3503704Z     def test_silu_mul_quant(
2025-05-07T20:32:01.3503778Z         self,
2025-05-07T20:32:01.3503853Z         T: int,
2025-05-07T20:32:01.3503928Z         D: int,
2025-05-07T20:32:01.3504021Z         scale_ub: Optional[float],
2025-05-07T20:32:01.3504113Z         contiguous: bool,
2025-05-07T20:32:01.3504194Z         compiled: bool,
2025-05-07T20:32:01.3504271Z     ) -> None:
2025-05-07T20:32:01.3504365Z         torch.manual_seed(2025)
2025-05-07T20:32:01.3504435Z     
2025-05-07T20:32:01.3504741Z >       x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:01.3506519Z E       torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 112.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 30.44 MiB is free. Including non-PyTorch memory, this process has 22.03 GiB memory in use. Of the allocated memory 21.73 GiB is allocated by PyTorch, and 13.87 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
2025-05-07T20:32:01.3506531Z 
2025-05-07T20:32:01.3506646Z moe/activation_test.py:92: OutOfMemoryError
2025-05-07T20:32:01.3506650Z 
2025-05-07T20:32:01.3506752Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:01.3506968Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:01.3507049Z     T=16384,
2025-05-07T20:32:01.3507124Z     D=7168,
2025-05-07T20:32:01.3507202Z     scale_ub=None,
2025-05-07T20:32:01.3507288Z     contiguous=True,
2025-05-07T20:32:01.3507368Z     compiled=False,
2025-05-07T20:32:01.3507443Z )
2025-05-07T20:32:01.3507655Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:01.3507821Z T = 16384, D = 7168, scale_ub = None, contiguous = True, compiled = False
2025-05-07T20:32:01.3507826Z 
2025-05-07T20:32:01.3507899Z     @given(
2025-05-07T20:32:01.3508017Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:01.3508112Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:01.3508224Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:01.3508336Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:01.3508446Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:01.3508522Z     )
2025-05-07T20:32:01.3508765Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:01.3508857Z     def test_silu_mul_quant(
2025-05-07T20:32:01.3508933Z         self,
2025-05-07T20:32:01.3509006Z         T: int,
2025-05-07T20:32:01.3509080Z         D: int,
2025-05-07T20:32:01.3509184Z         scale_ub: Optional[float],
2025-05-07T20:32:01.3509270Z         contiguous: bool,
2025-05-07T20:32:01.3509353Z         compiled: bool,
2025-05-07T20:32:01.3509432Z     ) -> None:
2025-05-07T20:32:01.3509523Z         torch.manual_seed(2025)
2025-05-07T20:32:01.3509596Z     
2025-05-07T20:32:01.3509755Z >       x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:01.3511477Z E       torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 448.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 30.44 MiB is free. Including non-PyTorch memory, this process has 22.03 GiB memory in use. Of the allocated memory 21.73 GiB is allocated by PyTorch, and 13.87 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
2025-05-07T20:32:01.3511572Z 
2025-05-07T20:32:01.3511686Z moe/activation_test.py:92: OutOfMemoryError
2025-05-07T20:32:01.3511690Z 
2025-05-07T20:32:01.3511790Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:01.3512007Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:01.3512081Z     T=16384,
2025-05-07T20:32:01.3512155Z     D=7168,
2025-05-07T20:32:01.3512238Z     scale_ub=1200.0,
2025-05-07T20:32:01.3512320Z     contiguous=True,
2025-05-07T20:32:01.3512402Z     compiled=False,
2025-05-07T20:32:01.3512478Z )
2025-05-07T20:32:01.3512687Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:01.3512859Z T = 16384, D = 7168, scale_ub = 1200.0, contiguous = True, compiled = False
2025-05-07T20:32:01.3512942Z 
2025-05-07T20:32:01.3513018Z     @given(
2025-05-07T20:32:01.3513131Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:01.3513229Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:01.3513345Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:01.3513459Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:01.3513574Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:01.3513645Z     )
2025-05-07T20:32:01.3513889Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:01.3513978Z     def test_silu_mul_quant(
2025-05-07T20:32:01.3514052Z         self,
2025-05-07T20:32:01.3514128Z         T: int,
2025-05-07T20:32:01.3514201Z         D: int,
2025-05-07T20:32:01.3514295Z         scale_ub: Optional[float],
2025-05-07T20:32:01.3514383Z         contiguous: bool,
2025-05-07T20:32:01.3514467Z         compiled: bool,
2025-05-07T20:32:01.3514542Z     ) -> None:
2025-05-07T20:32:01.3514641Z         torch.manual_seed(2025)
2025-05-07T20:32:01.3514713Z     
2025-05-07T20:32:01.3514875Z >       x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:01.3516596Z E       torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 448.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 30.44 MiB is free. Including non-PyTorch memory, this process has 22.03 GiB memory in use. Of the allocated memory 21.73 GiB is allocated by PyTorch, and 13.87 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
2025-05-07T20:32:01.3516607Z 
2025-05-07T20:32:01.3516721Z moe/activation_test.py:92: OutOfMemoryError
2025-05-07T20:32:01.3516725Z 
2025-05-07T20:32:01.3516827Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:01.3517045Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:01.3517122Z     T=128,
2025-05-07T20:32:01.3517195Z     D=5120,
2025-05-07T20:32:01.3517275Z     scale_ub=1200.0,
2025-05-07T20:32:01.3517360Z     contiguous=False,
2025-05-07T20:32:01.3517446Z     compiled=False,
2025-05-07T20:32:01.3517516Z )
2025-05-07T20:32:01.3517728Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:01.3517892Z T = 128, D = 5120, scale_ub = 1200.0, contiguous = False, compiled = False
2025-05-07T20:32:01.3517897Z 
2025-05-07T20:32:01.3517970Z     @given(
2025-05-07T20:32:01.3518088Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:01.3518185Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:01.3518298Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:01.3518415Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:01.3518522Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:01.3518601Z     )
2025-05-07T20:32:01.3518839Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:01.3518927Z     def test_silu_mul_quant(
2025-05-07T20:32:01.3519004Z         self,
2025-05-07T20:32:01.3519163Z         T: int,
2025-05-07T20:32:01.3519237Z         D: int,
2025-05-07T20:32:01.3519335Z         scale_ub: Optional[float],
2025-05-07T20:32:01.3519421Z         contiguous: bool,
2025-05-07T20:32:01.3519503Z         compiled: bool,
2025-05-07T20:32:01.3519580Z     ) -> None:
2025-05-07T20:32:01.3519670Z         torch.manual_seed(2025)
2025-05-07T20:32:01.3519746Z     
2025-05-07T20:32:01.3519906Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:01.3519976Z     
2025-05-07T20:32:01.3520067Z         x_sign = torch.sign(x)
2025-05-07T20:32:01.3520188Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:01.3520275Z         x = x_sign * x_clamp
2025-05-07T20:32:01.3520356Z         x0 = x[:, :D]
2025-05-07T20:32:01.3520510Z         x1 = x[:, D:]
2025-05-07T20:32:01.3520581Z     
2025-05-07T20:32:01.3520668Z         if contiguous:
2025-05-07T20:32:01.3520756Z             x0 = x0.contiguous()
2025-05-07T20:32:01.3520842Z             x1 = x1.contiguous()
2025-05-07T20:32:01.3520921Z     
2025-05-07T20:32:01.3521009Z         if scale_ub is not None:
2025-05-07T20:32:01.3521114Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:01.3521246Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:01.3521319Z             )
2025-05-07T20:32:01.3521395Z         else:
2025-05-07T20:32:01.3521487Z             scale_ub_tensor = None
2025-05-07T20:32:01.3521556Z     
2025-05-07T20:32:01.3521684Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:01.3521772Z             op = silu_mul_quant
2025-05-07T20:32:01.3521853Z             if compiled:
2025-05-07T20:32:01.3521951Z                 op = torch.compile(op)
2025-05-07T20:32:01.3522058Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:01.3522127Z     
2025-05-07T20:32:01.3522220Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:01.3522224Z 
2025-05-07T20:32:01.3522317Z moe/activation_test.py:117: 
2025-05-07T20:32:01.3522454Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:01.3522551Z moe/activation_test.py:115: in fn
2025-05-07T20:32:01.3522648Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:01.3523143Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:01.3523236Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:01.3523595Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:01.3523817Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:01.3524159Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:01.3524255Z     kernel = self.compile(
2025-05-07T20:32:01.3524635Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:01.3524811Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:01.3524942Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:01.3524947Z 
2025-05-07T20:32:01.3525148Z self = <triton.compiler.compiler.ASTSource object at 0x7f9397347450>
2025-05-07T20:32:01.3525920Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:01.3526418Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f9402963ce0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f9397275bc0>}
2025-05-07T20:32:01.3527150Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:01.3527425Z context = <triton._C.libtriton.ir.context object at 0x7f93970c3d30>
2025-05-07T20:32:01.3527430Z 
2025-05-07T20:32:01.3527595Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:01.3527856Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:01.3527964Z                            module_map=module_map)
2025-05-07T20:32:01.3528126Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:01.3528228Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:01.3528304Z E       ^
2025-05-07T20:32:01.3528731Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:01.3528736Z 
2025-05-07T20:32:01.3529146Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:01.3529156Z 
2025-05-07T20:32:01.3529260Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:01.3529484Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:01.3529557Z     T=2048,
2025-05-07T20:32:01.3529631Z     D=7168,
2025-05-07T20:32:01.3529713Z     scale_ub=None,
2025-05-07T20:32:01.3529795Z     contiguous=False,
2025-05-07T20:32:01.3529879Z     compiled=False,
2025-05-07T20:32:01.3529948Z )
2025-05-07T20:32:01.3530160Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:01.3530331Z T = 2048, D = 7168, scale_ub = None, contiguous = False, compiled = False
2025-05-07T20:32:01.3530335Z 
2025-05-07T20:32:01.3530408Z     @given(
2025-05-07T20:32:01.3530528Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:01.3530627Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:01.3530739Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:01.3530852Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:01.3530970Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:01.3531040Z     )
2025-05-07T20:32:01.3531282Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:01.3531372Z     def test_silu_mul_quant(
2025-05-07T20:32:01.3531445Z         self,
2025-05-07T20:32:01.3531521Z         T: int,
2025-05-07T20:32:01.3531595Z         D: int,
2025-05-07T20:32:01.3531690Z         scale_ub: Optional[float],
2025-05-07T20:32:01.3535065Z         contiguous: bool,
2025-05-07T20:32:01.3535167Z         compiled: bool,
2025-05-07T20:32:01.3535250Z     ) -> None:
2025-05-07T20:32:01.3535348Z         torch.manual_seed(2025)
2025-05-07T20:32:01.3535420Z     
2025-05-07T20:32:01.3535593Z >       x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:01.3537334Z E       torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 56.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 30.44 MiB is free. Including non-PyTorch memory, this process has 22.03 GiB memory in use. Of the allocated memory 21.74 GiB is allocated by PyTorch, and 5.24 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
2025-05-07T20:32:01.3537345Z 
2025-05-07T20:32:01.3537462Z moe/activation_test.py:92: OutOfMemoryError
2025-05-07T20:32:01.3537466Z 
2025-05-07T20:32:01.3537567Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:01.3537787Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:01.3537866Z     T=128,
2025-05-07T20:32:01.3537945Z     D=7168,
2025-05-07T20:32:01.3538025Z     scale_ub=1200.0,
2025-05-07T20:32:01.3538109Z     contiguous=True,
2025-05-07T20:32:01.3538188Z     compiled=True,
2025-05-07T20:32:01.3538257Z )
2025-05-07T20:32:01.3538583Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:01.3538743Z T = 128, D = 7168, scale_ub = 1200.0, contiguous = True, compiled = True
2025-05-07T20:32:01.3538748Z 
2025-05-07T20:32:01.3538824Z     @given(
2025-05-07T20:32:01.3538945Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:01.3539040Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:01.3539154Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:01.3539265Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:01.3539372Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:01.3539447Z     )
2025-05-07T20:32:01.3539686Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:01.3539877Z     def test_silu_mul_quant(
2025-05-07T20:32:01.3539956Z         self,
2025-05-07T20:32:01.3540031Z         T: int,
2025-05-07T20:32:01.3540104Z         D: int,
2025-05-07T20:32:01.3540202Z         scale_ub: Optional[float],
2025-05-07T20:32:01.3540295Z         contiguous: bool,
2025-05-07T20:32:01.3540377Z         compiled: bool,
2025-05-07T20:32:01.3540455Z     ) -> None:
2025-05-07T20:32:01.3540548Z         torch.manual_seed(2025)
2025-05-07T20:32:01.3540620Z     
2025-05-07T20:32:01.3540783Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:01.3540852Z     
2025-05-07T20:32:01.3540943Z         x_sign = torch.sign(x)
2025-05-07T20:32:01.3541064Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:01.3541151Z         x = x_sign * x_clamp
2025-05-07T20:32:01.3541234Z         x0 = x[:, :D]
2025-05-07T20:32:01.3541311Z         x1 = x[:, D:]
2025-05-07T20:32:01.3541380Z     
2025-05-07T20:32:01.3541468Z         if contiguous:
2025-05-07T20:32:01.3541557Z             x0 = x0.contiguous()
2025-05-07T20:32:01.3541643Z             x1 = x1.contiguous()
2025-05-07T20:32:01.3541715Z     
2025-05-07T20:32:01.3541801Z         if scale_ub is not None:
2025-05-07T20:32:01.3541914Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:01.3542044Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:01.3542117Z             )
2025-05-07T20:32:01.3542194Z         else:
2025-05-07T20:32:01.3542284Z             scale_ub_tensor = None
2025-05-07T20:32:01.3542353Z     
2025-05-07T20:32:01.3542481Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:01.3542569Z             op = silu_mul_quant
2025-05-07T20:32:01.3542651Z             if compiled:
2025-05-07T20:32:01.3542750Z                 op = torch.compile(op)
2025-05-07T20:32:01.3542850Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:01.3542923Z     
2025-05-07T20:32:01.3543014Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:01.3543024Z 
2025-05-07T20:32:01.3543118Z moe/activation_test.py:117: 
2025-05-07T20:32:01.3543249Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:01.3543346Z moe/activation_test.py:115: in fn
2025-05-07T20:32:01.3543446Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:01.3543815Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/torch/_dynamo/eval_frame.py:678: in _fn
2025-05-07T20:32:01.3543908Z     return fn(*args, **kwargs)
2025-05-07T20:32:01.3544392Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:01.3544494Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:01.3544845Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:01.3545064Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:01.3545401Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:01.3545494Z     kernel = self.compile(
2025-05-07T20:32:01.3545872Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:01.3546127Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:01.3546251Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:01.3546258Z 
2025-05-07T20:32:01.3546456Z self = <triton.compiler.compiler.ASTSource object at 0x7f9397983150>
2025-05-07T20:32:01.3547224Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:01.3547792Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f9402963ce0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f93971742c0>}
2025-05-07T20:32:01.3548524Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:01.3548716Z context = <triton._C.libtriton.ir.context object at 0x7f93971aecb0>
2025-05-07T20:32:01.3548720Z 
2025-05-07T20:32:01.3548881Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:01.3549135Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:01.3549247Z                            module_map=module_map)
2025-05-07T20:32:01.3549405Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:01.3549505Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:01.3549579Z E       ^
2025-05-07T20:32:01.3549931Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:01.3549935Z 
2025-05-07T20:32:01.3550340Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:01.3550350Z 
2025-05-07T20:32:01.3550454Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:01.3550671Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:01.3550745Z     T=128,
2025-05-07T20:32:01.3550823Z     D=7168,
2025-05-07T20:32:01.3550904Z     scale_ub=1200.0,
2025-05-07T20:32:01.3550985Z     contiguous=True,
2025-05-07T20:32:01.3551070Z     compiled=False,
2025-05-07T20:32:01.3551140Z )
2025-05-07T20:32:01.3551351Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:01.3551515Z T = 128, D = 7168, scale_ub = 1200.0, contiguous = True, compiled = False
2025-05-07T20:32:01.3551519Z 
2025-05-07T20:32:01.3551597Z     @given(
2025-05-07T20:32:01.3551718Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:01.3551815Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:01.3551931Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:01.3552051Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:01.3552161Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:01.3552232Z     )
2025-05-07T20:32:01.3552472Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:01.3552563Z     def test_silu_mul_quant(
2025-05-07T20:32:01.3552640Z         self,
2025-05-07T20:32:01.3552715Z         T: int,
2025-05-07T20:32:01.3552789Z         D: int,
2025-05-07T20:32:01.3552888Z         scale_ub: Optional[float],
2025-05-07T20:32:01.3552973Z         contiguous: bool,
2025-05-07T20:32:01.3553055Z         compiled: bool,
2025-05-07T20:32:01.3553134Z     ) -> None:
2025-05-07T20:32:01.3553225Z         torch.manual_seed(2025)
2025-05-07T20:32:01.3553300Z     
2025-05-07T20:32:01.3553467Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:01.3553538Z     
2025-05-07T20:32:01.3553626Z         x_sign = torch.sign(x)
2025-05-07T20:32:01.3553834Z >       x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:01.3555557Z E       torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 20.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 8.44 MiB is free. Including non-PyTorch memory, this process has 22.05 GiB memory in use. Of the allocated memory 21.77 GiB is allocated by PyTorch, and 4.62 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
2025-05-07T20:32:01.3555563Z 
2025-05-07T20:32:01.3555679Z moe/activation_test.py:95: OutOfMemoryError
2025-05-07T20:32:01.3555758Z 
2025-05-07T20:32:01.3555859Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:01.3556078Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:01.3556154Z     T=128,
2025-05-07T20:32:01.3556234Z     D=5120,
2025-05-07T20:32:01.3556315Z     scale_ub=1200.0,
2025-05-07T20:32:01.3556398Z     contiguous=True,
2025-05-07T20:32:01.3556478Z     compiled=True,
2025-05-07T20:32:01.3556550Z )
2025-05-07T20:32:01.3556761Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:01.3556921Z T = 128, D = 5120, scale_ub = 1200.0, contiguous = True, compiled = True
2025-05-07T20:32:01.3556926Z 
2025-05-07T20:32:01.3557002Z     @given(
2025-05-07T20:32:01.3557117Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:01.3557215Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:01.3557324Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:01.3557440Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:01.3557551Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:01.3557623Z     )
2025-05-07T20:32:01.3557862Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:01.3557960Z     def test_silu_mul_quant(
2025-05-07T20:32:01.3558034Z         self,
2025-05-07T20:32:01.3558107Z         T: int,
2025-05-07T20:32:01.3558185Z         D: int,
2025-05-07T20:32:01.3558279Z         scale_ub: Optional[float],
2025-05-07T20:32:01.3558363Z         contiguous: bool,
2025-05-07T20:32:01.3558454Z         compiled: bool,
2025-05-07T20:32:01.3558529Z     ) -> None:
2025-05-07T20:32:01.3558622Z         torch.manual_seed(2025)
2025-05-07T20:32:01.3558691Z     
2025-05-07T20:32:01.3558850Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:01.3558925Z     
2025-05-07T20:32:01.3559013Z >       x_sign = torch.sign(x)
2025-05-07T20:32:01.3560730Z E       torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 20.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 8.44 MiB is free. Including non-PyTorch memory, this process has 22.05 GiB memory in use. Of the allocated memory 21.77 GiB is allocated by PyTorch, and 2.12 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
2025-05-07T20:32:01.3560743Z 
2025-05-07T20:32:01.3560856Z moe/activation_test.py:94: OutOfMemoryError
2025-05-07T20:32:01.3560861Z 
2025-05-07T20:32:01.3560957Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:01.3561175Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:01.3561250Z     T=128,
2025-05-07T20:32:01.3561323Z     D=7168,
2025-05-07T20:32:01.3561403Z     scale_ub=None,
2025-05-07T20:32:01.3561484Z     contiguous=True,
2025-05-07T20:32:01.3561571Z     compiled=True,
2025-05-07T20:32:01.3561642Z )
2025-05-07T20:32:01.3561853Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:01.3562014Z T = 128, D = 7168, scale_ub = None, contiguous = True, compiled = True
2025-05-07T20:32:01.3562102Z 
2025-05-07T20:32:01.3562175Z     @given(
2025-05-07T20:32:01.3562288Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:01.3562387Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:01.3562498Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:01.3562610Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:01.3562721Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:01.3562792Z     )
2025-05-07T20:32:01.3563034Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:01.3563125Z     def test_silu_mul_quant(
2025-05-07T20:32:01.3563198Z         self,
2025-05-07T20:32:01.3563349Z         T: int,
2025-05-07T20:32:01.3563425Z         D: int,
2025-05-07T20:32:01.3563519Z         scale_ub: Optional[float],
2025-05-07T20:32:01.3563607Z         contiguous: bool,
2025-05-07T20:32:01.3563690Z         compiled: bool,
2025-05-07T20:32:01.3563771Z     ) -> None:
2025-05-07T20:32:01.3563864Z         torch.manual_seed(2025)
2025-05-07T20:32:01.3563934Z     
2025-05-07T20:32:01.3564094Z >       x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:01.3565814Z E       torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 20.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 8.44 MiB is free. Including non-PyTorch memory, this process has 22.05 GiB memory in use. Of the allocated memory 21.77 GiB is allocated by PyTorch, and 2.12 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
2025-05-07T20:32:01.3565820Z 
2025-05-07T20:32:01.3565931Z moe/activation_test.py:92: OutOfMemoryError
2025-05-07T20:32:01.3566064Z =============================== warnings summary ===============================
2025-05-07T20:32:01.3566369Z ../../../../../../../../miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/autotuner.py:108
2025-05-07T20:32:01.3566667Z ../../../../../../../../miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/autotuner.py:108
2025-05-07T20:32:01.3566959Z ../../../../../../../../miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/autotuner.py:108
2025-05-07T20:32:01.3567813Z   /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/autotuner.py:108: DeprecationWarning: warmup, rep, and use_cuda_graph parameters are deprecated. See https://github.com/triton-lang/triton/pull/4496 for details.
2025-05-07T20:32:01.3568048Z     warnings.warn(("warmup, rep, and use_cuda_graph parameters are deprecated. See "
2025-05-07T20:32:01.3568052Z 
2025-05-07T20:32:01.3568224Z experimental/gen_ai/test/moe/activation_test.py: 10 warnings
2025-05-07T20:32:01.3569460Z   /home/ec2-user/actions-runner/_work/FBGEMM/FBGEMM/fbgemm_gpu/experimental/gen_ai/test/moe/activation_test.py:72: FutureWarning: `torch.testing.assert_allclose()` is deprecated since 1.12 and will be removed in a future release. Please use `torch.testing.assert_close()` instead. You can find detailed upgrade instructions in https://github.com/pytorch/pytorch/issues/61844.
2025-05-07T20:32:01.3569646Z     torch.testing.assert_allclose(y, y_ref, rtol=1.6e-2, atol=1e-3)
2025-05-07T20:32:01.3569651Z 
2025-05-07T20:32:01.3569857Z -- Docs: https://docs.pytest.org/en/stable/how-to/capture-warnings.html
2025-05-07T20:32:01.3570014Z ================== 1 failed, 1 passed, 13 warnings in 18.91s ===================
2025-05-07T20:32:03.1473117Z ERROR conda.cli.main_run:execute(125): `conda run python -m pytest -v -rsx -s -W ignore::pytest.PytestCollectionWarning --cache-clear ./moe/activation_test.py` failed. (See above for error)
2025-05-07T20:32:03.2098401Z 
2025-05-07T20:32:03.2098828Z [TEST] Some tests FAILED.  Re-attempting only FAILED tests: ./moe/activation_test.py
2025-05-07T20:32:03.2099753Z 
2025-05-07T20:32:03.2099757Z 
2025-05-07T20:32:03.2120424Z [EXEC] [ATTEMPT 0/2]    + conda run --no-capture-output -n build_binary python -m pytest -v -rsx -s -W ignore::pytest.PytestCollectionWarning --lf --last-failed-no-failures none ./moe/activation_test.py
2025-05-07T20:32:05.3948036Z ============================= test session starts ==============================
2025-05-07T20:32:05.3948850Z platform linux -- Python 3.13.0, pytest-8.3.5, pluggy-1.5.0 -- /home/ec2-user/miniconda/envs/build_binary/bin/python
2025-05-07T20:32:05.3949376Z cachedir: .pytest_cache
2025-05-07T20:32:05.3949941Z hypothesis profile 'ci' -> database=None, deadline=None, print_blob=True, derandomize=True, suppress_health_check=(HealthCheck.too_slow,)
2025-05-07T20:32:05.3950980Z rootdir: /home/ec2-user/actions-runner/_work/FBGEMM/FBGEMM/fbgemm_gpu
2025-05-07T20:32:05.3951383Z plugins: hypothesis-6.131.14
2025-05-07T20:32:06.9480601Z TMA benchmarks will be running with experimental grid constant TMA descriptor.
2025-05-07T20:32:07.0450324Z collecting ... collected 2 items / 1 deselected / 1 selected
2025-05-07T20:32:07.0450868Z run-last-failure: rerun previous 1 failure
2025-05-07T20:32:07.0451164Z 
2025-05-07T20:32:08.9136754Z W0507 20:32:08.911000 276022 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/0] Encountered an exception in identify_mutated_tensors, assuming every input is mutated
2025-05-07T20:32:08.9137828Z W0507 20:32:08.911000 276022 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/0] Traceback (most recent call last):
2025-05-07T20:32:08.9139179Z W0507 20:32:08.911000 276022 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/0]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/torch/_higher_order_ops/triton_kernel_wrap.py", line 731, in identify_mutated_tensors
2025-05-07T20:32:08.9140590Z W0507 20:32:08.911000 276022 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/0]     ttir_module, ordered_tensor_names = generate_ttir(kernel, kwargs)
2025-05-07T20:32:08.9141568Z W0507 20:32:08.911000 276022 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/0]                                         ~~~~~~~~~~~~~^^^^^^^^^^^^^^^^
2025-05-07T20:32:08.9142846Z W0507 20:32:08.911000 276022 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/0]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/torch/_higher_order_ops/triton_kernel_wrap.py", line 356, in generate_ttir
2025-05-07T20:32:08.9144202Z W0507 20:32:08.911000 276022 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/0]     ttir_module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:08.9145484Z W0507 20:32:08.911000 276022 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/0]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py", line 100, in make_ir
2025-05-07T20:32:08.9146831Z W0507 20:32:08.911000 276022 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/0]     return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:08.9147864Z W0507 20:32:08.911000 276022 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/0]                        module_map=module_map)
2025-05-07T20:32:08.9149104Z W0507 20:32:08.911000 276022 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/0]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/code_generator.py", line 1298, in ast_to_ttir
2025-05-07T20:32:08.9150335Z W0507 20:32:08.911000 276022 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/0]     generator.visit(fn.parse())
2025-05-07T20:32:08.9151170Z W0507 20:32:08.911000 276022 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/0]     ~~~~~~~~~~~~~~~^^^^^^^^^^^^
2025-05-07T20:32:08.9152352Z W0507 20:32:08.911000 276022 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/0]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/code_generator.py", line 1201, in visit
2025-05-07T20:32:08.9153896Z W0507 20:32:08.911000 276022 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/0]     ret = super().visit(node)
2025-05-07T20:32:08.9154921Z W0507 20:32:08.911000 276022 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/0]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/ast.py", line 428, in visit
2025-05-07T20:32:08.9155925Z W0507 20:32:08.911000 276022 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/0]     return visitor(node)
2025-05-07T20:32:08.9157270Z W0507 20:32:08.911000 276022 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/0]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/code_generator.py", line 352, in visit_Module
2025-05-07T20:32:08.9158529Z W0507 20:32:08.911000 276022 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/0]     ast.NodeVisitor.generic_visit(self, node)
2025-05-07T20:32:08.9159421Z W0507 20:32:08.911000 276022 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/0]     ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~^^^^^^^^^^^^
2025-05-07T20:32:08.9160492Z W0507 20:32:08.911000 276022 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/0]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/ast.py", line 436, in generic_visit
2025-05-07T20:32:08.9161518Z W0507 20:32:08.911000 276022 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/0]     self.visit(item)
2025-05-07T20:32:08.9162282Z W0507 20:32:08.911000 276022 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/0]     ~~~~~~~~~~^^^^^^
2025-05-07T20:32:08.9163438Z W0507 20:32:08.911000 276022 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/0]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/code_generator.py", line 1207, in visit
2025-05-07T20:32:08.9164775Z W0507 20:32:08.911000 276022 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/0]     raise CompilationError(self.jit_fn.src, self.cur_node, repr(e)) from None
2025-05-07T20:32:08.9165825Z W0507 20:32:08.911000 276022 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/0] triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:08.9166733Z W0507 20:32:08.911000 276022 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/0] def _fbgemm_silu_mul_quant(
2025-05-07T20:32:08.9167467Z W0507 20:32:08.911000 276022 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/0] ^
2025-05-07T20:32:08.9168472Z W0507 20:32:08.911000 276022 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/0] ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:08.9308021Z W0507 20:32:08.929000 276022 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/0] Encountered an exception in identify_mutated_tensors, assuming every input is mutated
2025-05-07T20:32:08.9309058Z W0507 20:32:08.929000 276022 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/0] Traceback (most recent call last):
2025-05-07T20:32:08.9310369Z W0507 20:32:08.929000 276022 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/0]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/torch/_higher_order_ops/triton_kernel_wrap.py", line 731, in identify_mutated_tensors
2025-05-07T20:32:08.9311779Z W0507 20:32:08.929000 276022 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/0]     ttir_module, ordered_tensor_names = generate_ttir(kernel, kwargs)
2025-05-07T20:32:08.9312741Z W0507 20:32:08.929000 276022 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/0]                                         ~~~~~~~~~~~~~^^^^^^^^^^^^^^^^
2025-05-07T20:32:08.9314269Z W0507 20:32:08.929000 276022 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/0]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/torch/_higher_order_ops/triton_kernel_wrap.py", line 356, in generate_ttir
2025-05-07T20:32:08.9315626Z W0507 20:32:08.929000 276022 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/0]     ttir_module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:08.9317032Z W0507 20:32:08.929000 276022 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/0]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py", line 100, in make_ir
2025-05-07T20:32:08.9318378Z W0507 20:32:08.929000 276022 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/0]     return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:08.9319405Z W0507 20:32:08.929000 276022 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/0]                        module_map=module_map)
2025-05-07T20:32:08.9320642Z W0507 20:32:08.929000 276022 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/0]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/code_generator.py", line 1298, in ast_to_ttir
2025-05-07T20:32:08.9321864Z W0507 20:32:08.929000 276022 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/0]     generator.visit(fn.parse())
2025-05-07T20:32:08.9322698Z W0507 20:32:08.929000 276022 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/0]     ~~~~~~~~~~~~~~~^^^^^^^^^^^^
2025-05-07T20:32:08.9323887Z W0507 20:32:08.929000 276022 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/0]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/code_generator.py", line 1201, in visit
2025-05-07T20:32:08.9325098Z W0507 20:32:08.929000 276022 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/0]     ret = super().visit(node)
2025-05-07T20:32:08.9326107Z W0507 20:32:08.929000 276022 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/0]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/ast.py", line 428, in visit
2025-05-07T20:32:08.9327108Z W0507 20:32:08.929000 276022 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/0]     return visitor(node)
2025-05-07T20:32:08.9328303Z W0507 20:32:08.929000 276022 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/0]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/code_generator.py", line 352, in visit_Module
2025-05-07T20:32:08.9329562Z W0507 20:32:08.929000 276022 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/0]     ast.NodeVisitor.generic_visit(self, node)
2025-05-07T20:32:08.9330460Z W0507 20:32:08.929000 276022 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/0]     ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~^^^^^^^^^^^^
2025-05-07T20:32:08.9331521Z W0507 20:32:08.929000 276022 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/0]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/ast.py", line 436, in generic_visit
2025-05-07T20:32:08.9332545Z W0507 20:32:08.929000 276022 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/0]     self.visit(item)
2025-05-07T20:32:08.9333305Z W0507 20:32:08.929000 276022 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/0]     ~~~~~~~~~~^^^^^^
2025-05-07T20:32:08.9334559Z W0507 20:32:08.929000 276022 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/0]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/code_generator.py", line 1207, in visit
2025-05-07T20:32:08.9335882Z W0507 20:32:08.929000 276022 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/0]     raise CompilationError(self.jit_fn.src, self.cur_node, repr(e)) from None
2025-05-07T20:32:08.9337021Z W0507 20:32:08.929000 276022 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/0] triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:08.9337918Z W0507 20:32:08.929000 276022 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/0] def _fbgemm_silu_mul_quant(
2025-05-07T20:32:08.9338653Z W0507 20:32:08.929000 276022 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/0] ^
2025-05-07T20:32:08.9339735Z W0507 20:32:08.929000 276022 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/0] ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:09.3313459Z moe/activation_test.py::ActivationTests::test_silu_mul_quant Trying example: test_silu_mul_quant(
2025-05-07T20:32:09.3314381Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:09.3314865Z     T=1,
2025-05-07T20:32:09.3315063Z     D=5120,
2025-05-07T20:32:09.3315267Z     scale_ub=None,
2025-05-07T20:32:09.3315481Z     contiguous=True,
2025-05-07T20:32:09.3315711Z     compiled=True,
2025-05-07T20:32:09.3315930Z )
2025-05-07T20:32:09.3316251Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:09.3316742Z T = 1, D = 5120, scale_ub = None, contiguous = True, compiled = True
2025-05-07T20:32:09.3317001Z 
2025-05-07T20:32:09.3317096Z     @given(
2025-05-07T20:32:09.3317329Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:09.3317674Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:09.3318023Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:09.3318359Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:09.3318683Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:09.3318973Z     )
2025-05-07T20:32:09.3319331Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:09.3319771Z     def test_silu_mul_quant(
2025-05-07T20:32:09.3320018Z         self,
2025-05-07T20:32:09.3320218Z         T: int,
2025-05-07T20:32:09.3320413Z         D: int,
2025-05-07T20:32:09.3320639Z         scale_ub: Optional[float],
2025-05-07T20:32:09.3320914Z         contiguous: bool,
2025-05-07T20:32:09.3321151Z         compiled: bool,
2025-05-07T20:32:09.3321387Z     ) -> None:
2025-05-07T20:32:09.3321607Z         torch.manual_seed(2025)
2025-05-07T20:32:09.3321844Z     
2025-05-07T20:32:09.3322123Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:09.3322474Z     
2025-05-07T20:32:09.3322677Z         x_sign = torch.sign(x)
2025-05-07T20:32:09.3322966Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:09.3323280Z         x = x_sign * x_clamp
2025-05-07T20:32:09.3323526Z         x0 = x[:, :D]
2025-05-07T20:32:09.3323748Z         x1 = x[:, D:]
2025-05-07T20:32:09.3323957Z     
2025-05-07T20:32:09.3324148Z         if contiguous:
2025-05-07T20:32:09.3324381Z             x0 = x0.contiguous()
2025-05-07T20:32:09.3324644Z             x1 = x1.contiguous()
2025-05-07T20:32:09.3324886Z     
2025-05-07T20:32:09.3325077Z         if scale_ub is not None:
2025-05-07T20:32:09.3325352Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:09.3325687Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:09.3325989Z             )
2025-05-07T20:32:09.3326185Z         else:
2025-05-07T20:32:09.3326398Z             scale_ub_tensor = None
2025-05-07T20:32:09.3326650Z     
2025-05-07T20:32:09.3326884Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:09.3327209Z             op = silu_mul_quant
2025-05-07T20:32:09.3327467Z             if compiled:
2025-05-07T20:32:09.3327711Z                 op = torch.compile(op)
2025-05-07T20:32:09.3328009Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:09.3328660Z     
2025-05-07T20:32:09.3328850Z         y_fp8, y_scale = fn()
2025-05-07T20:32:09.3329136Z         y = y_fp8.to(torch.float32) * y_scale[:, None]
2025-05-07T20:32:09.3329434Z     
2025-05-07T20:32:09.3329666Z         def ref_fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:09.3330002Z             x0_fp32 = x0.to(torch.float32)
2025-05-07T20:32:09.3330303Z             x1_fp32 = x1.to(torch.float32)
2025-05-07T20:32:09.3330611Z             y = x0_fp32 * torch.sigmoid(x0_fp32) * x1_fp32
2025-05-07T20:32:09.3330973Z             return triton_quantize_fp8_row(y, scale_ub_tensor)
2025-05-07T20:32:09.3331288Z     
2025-05-07T20:32:09.3331486Z >       y_fp8_ref, y_scale_ref = ref_fn()
2025-05-07T20:32:09.3331688Z 
2025-05-07T20:32:09.3331995Z moe/activation_test.py:126: 
2025-05-07T20:32:09.3332298Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:09.3332635Z moe/activation_test.py:124: in ref_fn
2025-05-07T20:32:09.3332963Z     return triton_quantize_fp8_row(y, scale_ub_tensor)
2025-05-07T20:32:09.3333897Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gemm/triton_gemm/fp8_gemm.py:2370: in triton_quantize_fp8_row
2025-05-07T20:32:09.3334642Z     _kernel_quantize_fp8_row[grid](
2025-05-07T20:32:09.3335186Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:09.3335860Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:09.3336547Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/autotuner.py:186: in run
2025-05-07T20:32:09.3337271Z     timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs}
2025-05-07T20:32:09.3337982Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/autotuner.py:166: in _bench
2025-05-07T20:32:09.3338616Z     return self.do_bench(kernel_call, quantiles=(0.5, 0.2, 0.8))
2025-05-07T20:32:09.3339228Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/testing.py:117: in do_bench
2025-05-07T20:32:09.3339742Z     fn()
2025-05-07T20:32:09.3340242Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/autotuner.py:152: in kernel_call
2025-05-07T20:32:09.3340825Z     self.fn.run(
2025-05-07T20:32:09.3341293Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:09.3341811Z     kernel = self.compile(
2025-05-07T20:32:09.3342351Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:09.3343009Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:09.3343407Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:09.3343631Z 
2025-05-07T20:32:09.3343837Z self = <triton.compiler.compiler.ASTSource object at 0x7fce60f4f9d0>
2025-05-07T20:32:09.3344914Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=2, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:09.3346292Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7fcea1473ce0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7fce612f36a0>}
2025-05-07T20:32:09.3347624Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:09.3348645Z context = <triton._C.libtriton.ir.context object at 0x7fcea14784b0>
2025-05-07T20:32:09.3348930Z 
2025-05-07T20:32:09.3349099Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:09.3349709Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:09.3350177Z                            module_map=module_map)
2025-05-07T20:32:09.3350542Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:09.3350902Z E       def _kernel_quantize_fp8_row(
2025-05-07T20:32:09.3351175Z E       ^
2025-05-07T20:32:09.3351640Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:09.3352079Z 
2025-05-07T20:32:09.3352490Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:09.3353002Z 
2025-05-07T20:32:09.3353195Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:09.3353613Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:09.3354021Z     T=2048,
2025-05-07T20:32:09.3354212Z     D=5120,
2025-05-07T20:32:09.3354420Z     scale_ub=1200.0,
2025-05-07T20:32:09.3354646Z     contiguous=True,
2025-05-07T20:32:09.3354867Z     compiled=False,
2025-05-07T20:32:09.3355079Z )
2025-05-07T20:32:09.3355401Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:09.3355885Z T = 2048, D = 5120, scale_ub = 1200.0, contiguous = True, compiled = False
2025-05-07T20:32:09.3356163Z 
2025-05-07T20:32:09.3356245Z     @given(
2025-05-07T20:32:09.3356476Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:09.3356787Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:09.3357098Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:09.3357431Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:09.3357808Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:09.3358096Z     )
2025-05-07T20:32:09.3358448Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:09.3358889Z     def test_silu_mul_quant(
2025-05-07T20:32:09.3359131Z         self,
2025-05-07T20:32:09.3359330Z         T: int,
2025-05-07T20:32:09.3359527Z         D: int,
2025-05-07T20:32:09.3359743Z         scale_ub: Optional[float],
2025-05-07T20:32:09.3360018Z         contiguous: bool,
2025-05-07T20:32:09.3360261Z         compiled: bool,
2025-05-07T20:32:09.3360479Z     ) -> None:
2025-05-07T20:32:09.3360697Z         torch.manual_seed(2025)
2025-05-07T20:32:09.3360940Z     
2025-05-07T20:32:09.3361206Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:09.3361551Z     
2025-05-07T20:32:09.3361752Z         x_sign = torch.sign(x)
2025-05-07T20:32:09.3362037Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:09.3362359Z         x = x_sign * x_clamp
2025-05-07T20:32:09.3362601Z         x0 = x[:, :D]
2025-05-07T20:32:09.3362815Z         x1 = x[:, D:]
2025-05-07T20:32:09.3363023Z     
2025-05-07T20:32:09.3363211Z         if contiguous:
2025-05-07T20:32:09.3363445Z             x0 = x0.contiguous()
2025-05-07T20:32:09.3363703Z             x1 = x1.contiguous()
2025-05-07T20:32:09.3363946Z     
2025-05-07T20:32:09.3364140Z         if scale_ub is not None:
2025-05-07T20:32:09.3364408Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:09.3364743Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:09.3365054Z             )
2025-05-07T20:32:09.3365246Z         else:
2025-05-07T20:32:09.3365466Z             scale_ub_tensor = None
2025-05-07T20:32:09.3365725Z     
2025-05-07T20:32:09.3365952Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:09.3366272Z             op = silu_mul_quant
2025-05-07T20:32:09.3366526Z             if compiled:
2025-05-07T20:32:09.3366775Z                 op = torch.compile(op)
2025-05-07T20:32:09.3367079Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:09.3367358Z     
2025-05-07T20:32:09.3367550Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:09.3367722Z 
2025-05-07T20:32:09.3367914Z moe/activation_test.py:117: 
2025-05-07T20:32:09.3368210Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:09.3368544Z moe/activation_test.py:115: in fn
2025-05-07T20:32:09.3368822Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:09.3369507Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:09.3370193Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:09.3370722Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:09.3371399Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:09.3372175Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:09.3372707Z     kernel = self.compile(
2025-05-07T20:32:09.3373238Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:09.3374025Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:09.3374426Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:09.3374651Z 
2025-05-07T20:32:09.3374867Z self = <triton.compiler.compiler.ASTSource object at 0x7fce612dda70>
2025-05-07T20:32:09.3375922Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:09.3377277Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7fcea1473ce0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7fce60f61f80>}
2025-05-07T20:32:09.3378658Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:09.3386152Z context = <triton._C.libtriton.ir.context object at 0x7fce60ac25b0>
2025-05-07T20:32:09.3386451Z 
2025-05-07T20:32:09.3386629Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:09.3387149Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:09.3387625Z                            module_map=module_map)
2025-05-07T20:32:09.3388002Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:09.3388355Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:09.3388627Z E       ^
2025-05-07T20:32:09.3389108Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:09.3389555Z 
2025-05-07T20:32:09.3389983Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:09.7259612Z W0507 20:32:09.721000 276022 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/1] Encountered an exception in identify_mutated_tensors, assuming every input is mutated
2025-05-07T20:32:09.7260813Z W0507 20:32:09.721000 276022 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/1] Traceback (most recent call last):
2025-05-07T20:32:09.7262141Z W0507 20:32:09.721000 276022 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/1]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/torch/_higher_order_ops/triton_kernel_wrap.py", line 731, in identify_mutated_tensors
2025-05-07T20:32:09.7263603Z W0507 20:32:09.721000 276022 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/1]     ttir_module, ordered_tensor_names = generate_ttir(kernel, kwargs)
2025-05-07T20:32:09.7264574Z W0507 20:32:09.721000 276022 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/1]                                         ~~~~~~~~~~~~~^^^^^^^^^^^^^^^^
2025-05-07T20:32:09.7266228Z W0507 20:32:09.721000 276022 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/1]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/torch/_higher_order_ops/triton_kernel_wrap.py", line 356, in generate_ttir
2025-05-07T20:32:09.7267585Z W0507 20:32:09.721000 276022 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/1]     ttir_module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:09.7269066Z W0507 20:32:09.721000 276022 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/1]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py", line 100, in make_ir
2025-05-07T20:32:09.7270426Z W0507 20:32:09.721000 276022 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/1]     return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:09.7271468Z W0507 20:32:09.721000 276022 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/1]                        module_map=module_map)
2025-05-07T20:32:09.7272708Z W0507 20:32:09.721000 276022 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/1]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/code_generator.py", line 1298, in ast_to_ttir
2025-05-07T20:32:09.7273932Z W0507 20:32:09.721000 276022 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/1]     generator.visit(fn.parse())
2025-05-07T20:32:09.7274767Z W0507 20:32:09.721000 276022 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/1]     ~~~~~~~~~~~~~~~^^^^^^^^^^^^
2025-05-07T20:32:09.7275952Z W0507 20:32:09.721000 276022 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/1]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/code_generator.py", line 1201, in visit
2025-05-07T20:32:09.7277148Z W0507 20:32:09.721000 276022 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/1]     ret = super().visit(node)
2025-05-07T20:32:09.7278170Z W0507 20:32:09.721000 276022 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/1]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/ast.py", line 428, in visit
2025-05-07T20:32:09.7279170Z W0507 20:32:09.721000 276022 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/1]     return visitor(node)
2025-05-07T20:32:09.7280370Z W0507 20:32:09.721000 276022 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/1]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/code_generator.py", line 352, in visit_Module
2025-05-07T20:32:09.7281638Z W0507 20:32:09.721000 276022 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/1]     ast.NodeVisitor.generic_visit(self, node)
2025-05-07T20:32:09.7282538Z W0507 20:32:09.721000 276022 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/1]     ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~^^^^^^^^^^^^
2025-05-07T20:32:09.7283612Z W0507 20:32:09.721000 276022 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/1]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/ast.py", line 436, in generic_visit
2025-05-07T20:32:09.7284633Z W0507 20:32:09.721000 276022 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/1]     self.visit(item)
2025-05-07T20:32:09.7285398Z W0507 20:32:09.721000 276022 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/1]     ~~~~~~~~~~^^^^^^
2025-05-07T20:32:09.7286556Z W0507 20:32:09.721000 276022 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/1]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/code_generator.py", line 1207, in visit
2025-05-07T20:32:09.7287917Z W0507 20:32:09.721000 276022 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/1]     raise CompilationError(self.jit_fn.src, self.cur_node, repr(e)) from None
2025-05-07T20:32:09.7289074Z W0507 20:32:09.721000 276022 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/1] triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:09.7289970Z W0507 20:32:09.721000 276022 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/1] def _fbgemm_silu_mul_quant(
2025-05-07T20:32:09.7290714Z W0507 20:32:09.721000 276022 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/1] ^
2025-05-07T20:32:09.7291802Z W0507 20:32:09.721000 276022 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/1] ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:09.8043405Z W0507 20:32:09.800000 276022 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/1] Encountered an exception in identify_mutated_tensors, assuming every input is mutated
2025-05-07T20:32:09.8044613Z W0507 20:32:09.800000 276022 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/1] Traceback (most recent call last):
2025-05-07T20:32:09.8045934Z W0507 20:32:09.800000 276022 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/1]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/torch/_higher_order_ops/triton_kernel_wrap.py", line 731, in identify_mutated_tensors
2025-05-07T20:32:09.8047342Z W0507 20:32:09.800000 276022 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/1]     ttir_module, ordered_tensor_names = generate_ttir(kernel, kwargs)
2025-05-07T20:32:09.8048332Z W0507 20:32:09.800000 276022 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/1]                                         ~~~~~~~~~~~~~^^^^^^^^^^^^^^^^
2025-05-07T20:32:09.8049624Z W0507 20:32:09.800000 276022 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/1]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/torch/_higher_order_ops/triton_kernel_wrap.py", line 356, in generate_ttir
2025-05-07T20:32:09.8050993Z W0507 20:32:09.800000 276022 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/1]     ttir_module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:09.8052280Z W0507 20:32:09.800000 276022 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/1]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py", line 100, in make_ir
2025-05-07T20:32:09.8053764Z W0507 20:32:09.800000 276022 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/1]     return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:09.8054808Z W0507 20:32:09.800000 276022 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/1]                        module_map=module_map)
2025-05-07T20:32:09.8056063Z W0507 20:32:09.800000 276022 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/1]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/code_generator.py", line 1298, in ast_to_ttir
2025-05-07T20:32:09.8057297Z W0507 20:32:09.800000 276022 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/1]     generator.visit(fn.parse())
2025-05-07T20:32:09.8058194Z W0507 20:32:09.800000 276022 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/1]     ~~~~~~~~~~~~~~~^^^^^^^^^^^^
2025-05-07T20:32:09.8059390Z W0507 20:32:09.800000 276022 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/1]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/code_generator.py", line 1201, in visit
2025-05-07T20:32:09.8060593Z W0507 20:32:09.800000 276022 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/1]     ret = super().visit(node)
2025-05-07T20:32:09.8061968Z W0507 20:32:09.800000 276022 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/1]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/ast.py", line 428, in visit
2025-05-07T20:32:09.8062975Z W0507 20:32:09.800000 276022 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/1]     return visitor(node)
2025-05-07T20:32:09.8064183Z W0507 20:32:09.800000 276022 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/1]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/code_generator.py", line 352, in visit_Module
2025-05-07T20:32:09.8065579Z W0507 20:32:09.800000 276022 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/1]     ast.NodeVisitor.generic_visit(self, node)
2025-05-07T20:32:09.8066483Z W0507 20:32:09.800000 276022 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/1]     ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~^^^^^^^^^^^^
2025-05-07T20:32:09.8067570Z W0507 20:32:09.800000 276022 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/1]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/ast.py", line 436, in generic_visit
2025-05-07T20:32:09.8068602Z W0507 20:32:09.800000 276022 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/1]     self.visit(item)
2025-05-07T20:32:09.8069363Z W0507 20:32:09.800000 276022 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/1]     ~~~~~~~~~~^^^^^^
2025-05-07T20:32:09.8070517Z W0507 20:32:09.800000 276022 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/1]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/code_generator.py", line 1207, in visit
2025-05-07T20:32:09.8071860Z W0507 20:32:09.800000 276022 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/1]     raise CompilationError(self.jit_fn.src, self.cur_node, repr(e)) from None
2025-05-07T20:32:09.8072918Z W0507 20:32:09.800000 276022 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/1] triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:09.8073832Z W0507 20:32:09.800000 276022 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/1] def _fbgemm_silu_mul_quant(
2025-05-07T20:32:09.8074567Z W0507 20:32:09.800000 276022 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/1] ^
2025-05-07T20:32:09.8075580Z W0507 20:32:09.800000 276022 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/1] ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:10.3822258Z 
2025-05-07T20:32:10.3822735Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:10.3823445Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:10.3824004Z     T=2048,
2025-05-07T20:32:10.3824299Z     D=5120,
2025-05-07T20:32:10.3824503Z     scale_ub=1200.0,
2025-05-07T20:32:10.3824736Z     contiguous=True,
2025-05-07T20:32:10.3824957Z     compiled=True,
2025-05-07T20:32:10.3825166Z )
2025-05-07T20:32:10.3825486Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:10.3825981Z T = 2048, D = 5120, scale_ub = 1200.0, contiguous = True, compiled = True
2025-05-07T20:32:10.3826260Z 
2025-05-07T20:32:10.3826340Z     @given(
2025-05-07T20:32:10.3826574Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:10.3826880Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:10.3827190Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:10.3827521Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:10.3827843Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:10.3828140Z     )
2025-05-07T20:32:10.3828489Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:10.3828930Z     def test_silu_mul_quant(
2025-05-07T20:32:10.3829180Z         self,
2025-05-07T20:32:10.3829777Z         T: int,
2025-05-07T20:32:10.3829969Z         D: int,
2025-05-07T20:32:10.3830195Z         scale_ub: Optional[float],
2025-05-07T20:32:10.3830469Z         contiguous: bool,
2025-05-07T20:32:10.3830712Z         compiled: bool,
2025-05-07T20:32:10.3830938Z     ) -> None:
2025-05-07T20:32:10.3831161Z         torch.manual_seed(2025)
2025-05-07T20:32:10.3831406Z     
2025-05-07T20:32:10.3831677Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:10.3832020Z     
2025-05-07T20:32:10.3832214Z         x_sign = torch.sign(x)
2025-05-07T20:32:10.3832496Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:10.3832805Z         x = x_sign * x_clamp
2025-05-07T20:32:10.3833189Z         x0 = x[:, :D]
2025-05-07T20:32:10.3833400Z         x1 = x[:, D:]
2025-05-07T20:32:10.3833609Z     
2025-05-07T20:32:10.3833794Z         if contiguous:
2025-05-07T20:32:10.3834018Z             x0 = x0.contiguous()
2025-05-07T20:32:10.3834281Z             x1 = x1.contiguous()
2025-05-07T20:32:10.3834520Z     
2025-05-07T20:32:10.3834703Z         if scale_ub is not None:
2025-05-07T20:32:10.3834976Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:10.3835310Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:10.3835617Z             )
2025-05-07T20:32:10.3835805Z         else:
2025-05-07T20:32:10.3836012Z             scale_ub_tensor = None
2025-05-07T20:32:10.3836260Z     
2025-05-07T20:32:10.3836482Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:10.3836794Z             op = silu_mul_quant
2025-05-07T20:32:10.3837042Z             if compiled:
2025-05-07T20:32:10.3837286Z                 op = torch.compile(op)
2025-05-07T20:32:10.3837590Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:10.3837873Z     
2025-05-07T20:32:10.3838083Z         y_fp8, y_scale = fn()
2025-05-07T20:32:10.3838394Z         y = y_fp8.to(torch.float32) * y_scale[:, None]
2025-05-07T20:32:10.3838692Z     
2025-05-07T20:32:10.3838922Z         def ref_fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:10.3839257Z             x0_fp32 = x0.to(torch.float32)
2025-05-07T20:32:10.3839547Z             x1_fp32 = x1.to(torch.float32)
2025-05-07T20:32:10.3839851Z             y = x0_fp32 * torch.sigmoid(x0_fp32) * x1_fp32
2025-05-07T20:32:10.3840211Z             return triton_quantize_fp8_row(y, scale_ub_tensor)
2025-05-07T20:32:10.3840525Z     
2025-05-07T20:32:10.3840729Z >       y_fp8_ref, y_scale_ref = ref_fn()
2025-05-07T20:32:10.3840920Z 
2025-05-07T20:32:10.3841020Z moe/activation_test.py:126: 
2025-05-07T20:32:10.3841315Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:10.3841654Z moe/activation_test.py:124: in ref_fn
2025-05-07T20:32:10.3841975Z     return triton_quantize_fp8_row(y, scale_ub_tensor)
2025-05-07T20:32:10.3842758Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gemm/triton_gemm/fp8_gemm.py:2370: in triton_quantize_fp8_row
2025-05-07T20:32:10.3843505Z     _kernel_quantize_fp8_row[grid](
2025-05-07T20:32:10.3844053Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:10.3844722Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:10.3845398Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/autotuner.py:186: in run
2025-05-07T20:32:10.3846112Z     timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs}
2025-05-07T20:32:10.3846823Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/autotuner.py:166: in _bench
2025-05-07T20:32:10.3847450Z     return self.do_bench(kernel_call, quantiles=(0.5, 0.2, 0.8))
2025-05-07T20:32:10.3848046Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/testing.py:117: in do_bench
2025-05-07T20:32:10.3848646Z     fn()
2025-05-07T20:32:10.3849138Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/autotuner.py:152: in kernel_call
2025-05-07T20:32:10.3849716Z     self.fn.run(
2025-05-07T20:32:10.3850178Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:10.3850699Z     kernel = self.compile(
2025-05-07T20:32:10.3851225Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:10.3851867Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:10.3852265Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:10.3852495Z 
2025-05-07T20:32:10.3852776Z self = <triton.compiler.compiler.ASTSource object at 0x7fce5b7c6e00>
2025-05-07T20:32:10.3854002Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=2, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:10.3855380Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7fcea1473ce0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7fce60bf07c0>}
2025-05-07T20:32:10.3856700Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:10.3857709Z context = <triton._C.libtriton.ir.context object at 0x7fce5b2d2530>
2025-05-07T20:32:10.3857992Z 
2025-05-07T20:32:10.3858165Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:10.3858677Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:10.3859140Z                            module_map=module_map)
2025-05-07T20:32:10.3859507Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:10.3859860Z E       def _kernel_quantize_fp8_row(
2025-05-07T20:32:10.3860128Z E       ^
2025-05-07T20:32:10.3860584Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:10.3861021Z 
2025-05-07T20:32:10.3861427Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:10.3861936Z 
2025-05-07T20:32:10.3862042Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:10.3862452Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:10.3862849Z     T=16384,
2025-05-07T20:32:10.3863043Z     D=7168,
2025-05-07T20:32:10.3863236Z     scale_ub=1200.0,
2025-05-07T20:32:10.3863458Z     contiguous=False,
2025-05-07T20:32:10.3863681Z     compiled=False,
2025-05-07T20:32:10.3863887Z )
2025-05-07T20:32:10.3864203Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:10.3864693Z T = 16384, D = 7168, scale_ub = 1200.0, contiguous = False, compiled = False
2025-05-07T20:32:10.3864969Z 
2025-05-07T20:32:10.3865065Z     @given(
2025-05-07T20:32:10.3865294Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:10.3865609Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:10.3865913Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:10.3866236Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:10.3866562Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:10.3866846Z     )
2025-05-07T20:32:10.3867194Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:10.3867634Z     def test_silu_mul_quant(
2025-05-07T20:32:10.3867877Z         self,
2025-05-07T20:32:10.3868099Z         T: int,
2025-05-07T20:32:10.3868320Z         D: int,
2025-05-07T20:32:10.3868540Z         scale_ub: Optional[float],
2025-05-07T20:32:10.3869093Z         contiguous: bool,
2025-05-07T20:32:10.3869323Z         compiled: bool,
2025-05-07T20:32:10.3869547Z     ) -> None:
2025-05-07T20:32:10.3869761Z         torch.manual_seed(2025)
2025-05-07T20:32:10.3869998Z     
2025-05-07T20:32:10.3870268Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:10.3870605Z     
2025-05-07T20:32:10.3870792Z         x_sign = torch.sign(x)
2025-05-07T20:32:10.3871078Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:10.3871383Z         x = x_sign * x_clamp
2025-05-07T20:32:10.3871613Z         x0 = x[:, :D]
2025-05-07T20:32:10.3871829Z         x1 = x[:, D:]
2025-05-07T20:32:10.3872035Z     
2025-05-07T20:32:10.3872212Z         if contiguous:
2025-05-07T20:32:10.3872523Z             x0 = x0.contiguous()
2025-05-07T20:32:10.3872778Z             x1 = x1.contiguous()
2025-05-07T20:32:10.3873014Z     
2025-05-07T20:32:10.3873199Z         if scale_ub is not None:
2025-05-07T20:32:10.3873478Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:10.3873814Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:10.3874115Z             )
2025-05-07T20:32:10.3874307Z         else:
2025-05-07T20:32:10.3874516Z             scale_ub_tensor = None
2025-05-07T20:32:10.3874757Z     
2025-05-07T20:32:10.3874984Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:10.3875298Z             op = silu_mul_quant
2025-05-07T20:32:10.3875539Z             if compiled:
2025-05-07T20:32:10.3875790Z                 op = torch.compile(op)
2025-05-07T20:32:10.3876087Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:10.3876355Z     
2025-05-07T20:32:10.3876549Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:10.3876716Z 
2025-05-07T20:32:10.3876821Z moe/activation_test.py:117: 
2025-05-07T20:32:10.3877114Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:10.3877439Z moe/activation_test.py:115: in fn
2025-05-07T20:32:10.3877725Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:10.3878404Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:10.3879079Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:10.3879611Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:10.3880286Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:10.3880942Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:10.3881466Z     kernel = self.compile(
2025-05-07T20:32:10.3882009Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:10.3882660Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:10.3883052Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:10.3883286Z 
2025-05-07T20:32:10.3883492Z self = <triton.compiler.compiler.ASTSource object at 0x7fce5b7c6030>
2025-05-07T20:32:10.3884555Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:10.3885902Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7fcea1473ce0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7fce60bd5440>}
2025-05-07T20:32:10.3887229Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:10.3888231Z context = <triton._C.libtriton.ir.context object at 0x7fce5b2e6ff0>
2025-05-07T20:32:10.3888599Z 
2025-05-07T20:32:10.3888763Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:10.3889275Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:10.3889741Z                            module_map=module_map)
2025-05-07T20:32:10.3890097Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:10.3890448Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:10.3890707Z E       ^
2025-05-07T20:32:10.3891151Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:10.3891596Z 
2025-05-07T20:32:10.3892099Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:10.6106404Z W0507 20:32:10.606000 276022 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/2] Encountered an exception in identify_mutated_tensors, assuming every input is mutated
2025-05-07T20:32:10.6107627Z W0507 20:32:10.606000 276022 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/2] Traceback (most recent call last):
2025-05-07T20:32:10.6108998Z W0507 20:32:10.606000 276022 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/2]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/torch/_higher_order_ops/triton_kernel_wrap.py", line 731, in identify_mutated_tensors
2025-05-07T20:32:10.6110403Z W0507 20:32:10.606000 276022 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/2]     ttir_module, ordered_tensor_names = generate_ttir(kernel, kwargs)
2025-05-07T20:32:10.6111382Z W0507 20:32:10.606000 276022 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/2]                                         ~~~~~~~~~~~~~^^^^^^^^^^^^^^^^
2025-05-07T20:32:10.6112663Z W0507 20:32:10.606000 276022 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/2]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/torch/_higher_order_ops/triton_kernel_wrap.py", line 356, in generate_ttir
2025-05-07T20:32:10.6114020Z W0507 20:32:10.606000 276022 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/2]     ttir_module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:10.6115302Z W0507 20:32:10.606000 276022 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/2]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py", line 100, in make_ir
2025-05-07T20:32:10.6116646Z W0507 20:32:10.606000 276022 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/2]     return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:10.6117680Z W0507 20:32:10.606000 276022 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/2]                        module_map=module_map)
2025-05-07T20:32:10.6118923Z W0507 20:32:10.606000 276022 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/2]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/code_generator.py", line 1298, in ast_to_ttir
2025-05-07T20:32:10.6120143Z W0507 20:32:10.606000 276022 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/2]     generator.visit(fn.parse())
2025-05-07T20:32:10.6120977Z W0507 20:32:10.606000 276022 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/2]     ~~~~~~~~~~~~~~~^^^^^^^^^^^^
2025-05-07T20:32:10.6122160Z W0507 20:32:10.606000 276022 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/2]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/code_generator.py", line 1201, in visit
2025-05-07T20:32:10.6123352Z W0507 20:32:10.606000 276022 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/2]     ret = super().visit(node)
2025-05-07T20:32:10.6124741Z W0507 20:32:10.606000 276022 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/2]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/ast.py", line 428, in visit
2025-05-07T20:32:10.6125741Z W0507 20:32:10.606000 276022 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/2]     return visitor(node)
2025-05-07T20:32:10.6126938Z W0507 20:32:10.606000 276022 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/2]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/code_generator.py", line 352, in visit_Module
2025-05-07T20:32:10.6128329Z W0507 20:32:10.606000 276022 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/2]     ast.NodeVisitor.generic_visit(self, node)
2025-05-07T20:32:10.6129226Z W0507 20:32:10.606000 276022 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/2]     ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~^^^^^^^^^^^^
2025-05-07T20:32:10.6130298Z W0507 20:32:10.606000 276022 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/2]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/ast.py", line 436, in generic_visit
2025-05-07T20:32:10.6131325Z W0507 20:32:10.606000 276022 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/2]     self.visit(item)
2025-05-07T20:32:10.6132077Z W0507 20:32:10.606000 276022 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/2]     ~~~~~~~~~~^^^^^^
2025-05-07T20:32:10.6133227Z W0507 20:32:10.606000 276022 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/2]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/code_generator.py", line 1207, in visit
2025-05-07T20:32:10.6134715Z W0507 20:32:10.606000 276022 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/2]     raise CompilationError(self.jit_fn.src, self.cur_node, repr(e)) from None
2025-05-07T20:32:10.6135764Z W0507 20:32:10.606000 276022 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/2] triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:10.6136672Z W0507 20:32:10.606000 276022 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/2] def _fbgemm_silu_mul_quant(
2025-05-07T20:32:10.6137402Z W0507 20:32:10.606000 276022 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/2] ^
2025-05-07T20:32:10.6138415Z W0507 20:32:10.606000 276022 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/2] ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:10.6649301Z W0507 20:32:10.661000 276022 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/2] Encountered an exception in identify_mutated_tensors, assuming every input is mutated
2025-05-07T20:32:10.6650585Z W0507 20:32:10.661000 276022 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/2] Traceback (most recent call last):
2025-05-07T20:32:10.6651913Z W0507 20:32:10.661000 276022 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/2]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/torch/_higher_order_ops/triton_kernel_wrap.py", line 731, in identify_mutated_tensors
2025-05-07T20:32:10.6653327Z W0507 20:32:10.661000 276022 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/2]     ttir_module, ordered_tensor_names = generate_ttir(kernel, kwargs)
2025-05-07T20:32:10.6654370Z W0507 20:32:10.661000 276022 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/2]                                         ~~~~~~~~~~~~~^^^^^^^^^^^^^^^^
2025-05-07T20:32:10.6655666Z W0507 20:32:10.661000 276022 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/2]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/torch/_higher_order_ops/triton_kernel_wrap.py", line 356, in generate_ttir
2025-05-07T20:32:10.6657375Z W0507 20:32:10.661000 276022 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/2]     ttir_module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:10.6658665Z W0507 20:32:10.661000 276022 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/2]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py", line 100, in make_ir
2025-05-07T20:32:10.6660022Z W0507 20:32:10.661000 276022 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/2]     return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:10.6661182Z W0507 20:32:10.661000 276022 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/2]                        module_map=module_map)
2025-05-07T20:32:10.6662431Z W0507 20:32:10.661000 276022 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/2]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/code_generator.py", line 1298, in ast_to_ttir
2025-05-07T20:32:10.6663675Z W0507 20:32:10.661000 276022 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/2]     generator.visit(fn.parse())
2025-05-07T20:32:10.6664520Z W0507 20:32:10.661000 276022 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/2]     ~~~~~~~~~~~~~~~^^^^^^^^^^^^
2025-05-07T20:32:10.6665715Z W0507 20:32:10.661000 276022 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/2]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/code_generator.py", line 1201, in visit
2025-05-07T20:32:10.6666915Z W0507 20:32:10.661000 276022 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/2]     ret = super().visit(node)
2025-05-07T20:32:10.6667945Z W0507 20:32:10.661000 276022 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/2]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/ast.py", line 428, in visit
2025-05-07T20:32:10.6668963Z W0507 20:32:10.661000 276022 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/2]     return visitor(node)
2025-05-07T20:32:10.6670170Z W0507 20:32:10.661000 276022 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/2]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/code_generator.py", line 352, in visit_Module
2025-05-07T20:32:10.6671428Z W0507 20:32:10.661000 276022 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/2]     ast.NodeVisitor.generic_visit(self, node)
2025-05-07T20:32:10.6672336Z W0507 20:32:10.661000 276022 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/2]     ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~^^^^^^^^^^^^
2025-05-07T20:32:10.6673416Z W0507 20:32:10.661000 276022 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/2]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/ast.py", line 436, in generic_visit
2025-05-07T20:32:10.6674451Z W0507 20:32:10.661000 276022 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/2]     self.visit(item)
2025-05-07T20:32:10.6675224Z W0507 20:32:10.661000 276022 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/2]     ~~~~~~~~~~^^^^^^
2025-05-07T20:32:10.6676372Z W0507 20:32:10.661000 276022 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/2]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/code_generator.py", line 1207, in visit
2025-05-07T20:32:10.6677717Z W0507 20:32:10.661000 276022 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/2]     raise CompilationError(self.jit_fn.src, self.cur_node, repr(e)) from None
2025-05-07T20:32:10.6678830Z W0507 20:32:10.661000 276022 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/2] triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:10.6679739Z W0507 20:32:10.661000 276022 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/2] def _fbgemm_silu_mul_quant(
2025-05-07T20:32:10.6680555Z W0507 20:32:10.661000 276022 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/2] ^
2025-05-07T20:32:10.6681571Z W0507 20:32:10.661000 276022 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/2] ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:11.1147372Z 
2025-05-07T20:32:11.1147774Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:11.1148306Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:11.1148716Z     T=1,
2025-05-07T20:32:11.1148939Z     D=7168,
2025-05-07T20:32:11.1149509Z     scale_ub=None,
2025-05-07T20:32:11.1149730Z     contiguous=True,
2025-05-07T20:32:11.1149961Z     compiled=True,
2025-05-07T20:32:11.1150171Z )
2025-05-07T20:32:11.1150493Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:11.1150989Z T = 1, D = 7168, scale_ub = None, contiguous = True, compiled = True
2025-05-07T20:32:11.1151245Z 
2025-05-07T20:32:11.1151332Z     @given(
2025-05-07T20:32:11.1151564Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:11.1151884Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:11.1152196Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:11.1152527Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:11.1152852Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:11.1153144Z     )
2025-05-07T20:32:11.1153494Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:11.1153932Z     def test_silu_mul_quant(
2025-05-07T20:32:11.1154187Z         self,
2025-05-07T20:32:11.1154387Z         T: int,
2025-05-07T20:32:11.1154583Z         D: int,
2025-05-07T20:32:11.1154802Z         scale_ub: Optional[float],
2025-05-07T20:32:11.1155076Z         contiguous: bool,
2025-05-07T20:32:11.1155318Z         compiled: bool,
2025-05-07T20:32:11.1155548Z     ) -> None:
2025-05-07T20:32:11.1155770Z         torch.manual_seed(2025)
2025-05-07T20:32:11.1156007Z     
2025-05-07T20:32:11.1156283Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:11.1156626Z     
2025-05-07T20:32:11.1156822Z         x_sign = torch.sign(x)
2025-05-07T20:32:11.1157108Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:11.1157420Z         x = x_sign * x_clamp
2025-05-07T20:32:11.1157666Z         x0 = x[:, :D]
2025-05-07T20:32:11.1157880Z         x1 = x[:, D:]
2025-05-07T20:32:11.1158107Z     
2025-05-07T20:32:11.1158330Z         if contiguous:
2025-05-07T20:32:11.1158565Z             x0 = x0.contiguous()
2025-05-07T20:32:11.1158828Z             x1 = x1.contiguous()
2025-05-07T20:32:11.1159070Z     
2025-05-07T20:32:11.1159257Z         if scale_ub is not None:
2025-05-07T20:32:11.1159533Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:11.1159874Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:11.1160182Z             )
2025-05-07T20:32:11.1160378Z         else:
2025-05-07T20:32:11.1160591Z             scale_ub_tensor = None
2025-05-07T20:32:11.1160839Z     
2025-05-07T20:32:11.1161076Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:11.1167467Z             op = silu_mul_quant
2025-05-07T20:32:11.1167764Z             if compiled:
2025-05-07T20:32:11.1168020Z                 op = torch.compile(op)
2025-05-07T20:32:11.1168314Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:11.1168593Z     
2025-05-07T20:32:11.1168788Z         y_fp8, y_scale = fn()
2025-05-07T20:32:11.1169075Z         y = y_fp8.to(torch.float32) * y_scale[:, None]
2025-05-07T20:32:11.1169369Z     
2025-05-07T20:32:11.1169611Z         def ref_fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:11.1169939Z             x0_fp32 = x0.to(torch.float32)
2025-05-07T20:32:11.1170231Z             x1_fp32 = x1.to(torch.float32)
2025-05-07T20:32:11.1170767Z             y = x0_fp32 * torch.sigmoid(x0_fp32) * x1_fp32
2025-05-07T20:32:11.1171121Z             return triton_quantize_fp8_row(y, scale_ub_tensor)
2025-05-07T20:32:11.1171433Z     
2025-05-07T20:32:11.1171635Z >       y_fp8_ref, y_scale_ref = ref_fn()
2025-05-07T20:32:11.1171826Z 
2025-05-07T20:32:11.1171932Z moe/activation_test.py:126: 
2025-05-07T20:32:11.1172222Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:11.1172556Z moe/activation_test.py:124: in ref_fn
2025-05-07T20:32:11.1172886Z     return triton_quantize_fp8_row(y, scale_ub_tensor)
2025-05-07T20:32:11.1173910Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gemm/triton_gemm/fp8_gemm.py:2370: in triton_quantize_fp8_row
2025-05-07T20:32:11.1174659Z     _kernel_quantize_fp8_row[grid](
2025-05-07T20:32:11.1175204Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:11.1175881Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:11.1176550Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/autotuner.py:186: in run
2025-05-07T20:32:11.1177265Z     timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs}
2025-05-07T20:32:11.1177981Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/autotuner.py:166: in _bench
2025-05-07T20:32:11.1178658Z     return self.do_bench(kernel_call, quantiles=(0.5, 0.2, 0.8))
2025-05-07T20:32:11.1179251Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/testing.py:117: in do_bench
2025-05-07T20:32:11.1179769Z     fn()
2025-05-07T20:32:11.1180272Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/autotuner.py:152: in kernel_call
2025-05-07T20:32:11.1180839Z     self.fn.run(
2025-05-07T20:32:11.1181310Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:11.1181831Z     kernel = self.compile(
2025-05-07T20:32:11.1182370Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:11.1183006Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:11.1183407Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:11.1183634Z 
2025-05-07T20:32:11.1183848Z self = <triton.compiler.compiler.ASTSource object at 0x7fce5b6e3c50>
2025-05-07T20:32:11.1184920Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=2, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:11.1186280Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7fcea1473ce0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7fce5b40b7e0>}
2025-05-07T20:32:11.1187610Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:11.1188612Z context = <triton._C.libtriton.ir.context object at 0x7fce5aecacf0>
2025-05-07T20:32:11.1188895Z 
2025-05-07T20:32:11.1189065Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:11.1189572Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:11.1190039Z                            module_map=module_map)
2025-05-07T20:32:11.1190394Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:11.1190750Z E       def _kernel_quantize_fp8_row(
2025-05-07T20:32:11.1191016Z E       ^
2025-05-07T20:32:11.1191555Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:11.1192000Z 
2025-05-07T20:32:11.1192406Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:11.1192916Z 
2025-05-07T20:32:11.1193017Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:11.1193424Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:11.1193814Z     T=4096,
2025-05-07T20:32:11.1194002Z     D=5120,
2025-05-07T20:32:11.1194203Z     scale_ub=None,
2025-05-07T20:32:11.1194419Z     contiguous=False,
2025-05-07T20:32:11.1194646Z     compiled=False,
2025-05-07T20:32:11.1194854Z )
2025-05-07T20:32:11.1195245Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:11.1195736Z T = 4096, D = 5120, scale_ub = None, contiguous = False, compiled = False
2025-05-07T20:32:11.1196016Z 
2025-05-07T20:32:11.1196099Z     @given(
2025-05-07T20:32:11.1196330Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:11.1196633Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:11.1196939Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:11.1197261Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:11.1197584Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:11.1197872Z     )
2025-05-07T20:32:11.1198475Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:11.1198912Z     def test_silu_mul_quant(
2025-05-07T20:32:11.1199149Z         self,
2025-05-07T20:32:11.1199346Z         T: int,
2025-05-07T20:32:11.1199533Z         D: int,
2025-05-07T20:32:11.1199752Z         scale_ub: Optional[float],
2025-05-07T20:32:11.1200021Z         contiguous: bool,
2025-05-07T20:32:11.1200257Z         compiled: bool,
2025-05-07T20:32:11.1200470Z     ) -> None:
2025-05-07T20:32:11.1200682Z         torch.manual_seed(2025)
2025-05-07T20:32:11.1200926Z     
2025-05-07T20:32:11.1201186Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:11.1201525Z     
2025-05-07T20:32:11.1201713Z         x_sign = torch.sign(x)
2025-05-07T20:32:11.1201994Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:11.1202298Z         x = x_sign * x_clamp
2025-05-07T20:32:11.1202537Z         x0 = x[:, :D]
2025-05-07T20:32:11.1202744Z         x1 = x[:, D:]
2025-05-07T20:32:11.1202951Z     
2025-05-07T20:32:11.1203135Z         if contiguous:
2025-05-07T20:32:11.1203358Z             x0 = x0.contiguous()
2025-05-07T20:32:11.1203611Z             x1 = x1.contiguous()
2025-05-07T20:32:11.1203848Z     
2025-05-07T20:32:11.1204036Z         if scale_ub is not None:
2025-05-07T20:32:11.1204313Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:11.1204647Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:11.1204954Z             )
2025-05-07T20:32:11.1205144Z         else:
2025-05-07T20:32:11.1205357Z             scale_ub_tensor = None
2025-05-07T20:32:11.1205608Z     
2025-05-07T20:32:11.1205835Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:11.1206149Z             op = silu_mul_quant
2025-05-07T20:32:11.1206405Z             if compiled:
2025-05-07T20:32:11.1206645Z                 op = torch.compile(op)
2025-05-07T20:32:11.1206940Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:11.1207222Z     
2025-05-07T20:32:11.1207405Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:11.1207576Z 
2025-05-07T20:32:11.1207671Z moe/activation_test.py:117: 
2025-05-07T20:32:11.1207966Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:11.1208289Z moe/activation_test.py:115: in fn
2025-05-07T20:32:11.1208573Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:11.1209252Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:11.1210075Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:11.1210598Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:11.1211265Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:11.1211919Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:11.1212443Z     kernel = self.compile(
2025-05-07T20:32:11.1212968Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:11.1213613Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:11.1214211Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:11.1214436Z 
2025-05-07T20:32:11.1214639Z self = <triton.compiler.compiler.ASTSource object at 0x7fce60e864e0>
2025-05-07T20:32:11.1215707Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:11.1217052Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7fcea1473ce0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7fce5ba78400>}
2025-05-07T20:32:11.1218368Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:11.1219375Z context = <triton._C.libtriton.ir.context object at 0x7fce5af036b0>
2025-05-07T20:32:11.1219655Z 
2025-05-07T20:32:11.1219818Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:11.1220328Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:11.1220792Z                            module_map=module_map)
2025-05-07T20:32:11.1221144Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:11.1221492Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:11.1221746Z E       ^
2025-05-07T20:32:11.1222199Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:11.1222632Z 
2025-05-07T20:32:11.1223037Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:11.3998778Z W0507 20:32:11.396000 276022 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/3] Encountered an exception in identify_mutated_tensors, assuming every input is mutated
2025-05-07T20:32:11.3999897Z W0507 20:32:11.396000 276022 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/3] Traceback (most recent call last):
2025-05-07T20:32:11.4001233Z W0507 20:32:11.396000 276022 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/3]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/torch/_higher_order_ops/triton_kernel_wrap.py", line 731, in identify_mutated_tensors
2025-05-07T20:32:11.4002650Z W0507 20:32:11.396000 276022 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/3]     ttir_module, ordered_tensor_names = generate_ttir(kernel, kwargs)
2025-05-07T20:32:11.4003629Z W0507 20:32:11.396000 276022 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/3]                                         ~~~~~~~~~~~~~^^^^^^^^^^^^^^^^
2025-05-07T20:32:11.4004914Z W0507 20:32:11.396000 276022 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/3]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/torch/_higher_order_ops/triton_kernel_wrap.py", line 356, in generate_ttir
2025-05-07T20:32:11.4006280Z W0507 20:32:11.396000 276022 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/3]     ttir_module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:11.4007925Z W0507 20:32:11.396000 276022 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/3]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py", line 100, in make_ir
2025-05-07T20:32:11.4009288Z W0507 20:32:11.396000 276022 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/3]     return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:11.4010462Z W0507 20:32:11.396000 276022 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/3]                        module_map=module_map)
2025-05-07T20:32:11.4011708Z W0507 20:32:11.396000 276022 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/3]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/code_generator.py", line 1298, in ast_to_ttir
2025-05-07T20:32:11.4012953Z W0507 20:32:11.396000 276022 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/3]     generator.visit(fn.parse())
2025-05-07T20:32:11.4013920Z W0507 20:32:11.396000 276022 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/3]     ~~~~~~~~~~~~~~~^^^^^^^^^^^^
2025-05-07T20:32:11.4015113Z W0507 20:32:11.396000 276022 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/3]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/code_generator.py", line 1201, in visit
2025-05-07T20:32:11.4016314Z W0507 20:32:11.396000 276022 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/3]     ret = super().visit(node)
2025-05-07T20:32:11.4017336Z W0507 20:32:11.396000 276022 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/3]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/ast.py", line 428, in visit
2025-05-07T20:32:11.4018359Z W0507 20:32:11.396000 276022 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/3]     return visitor(node)
2025-05-07T20:32:11.4019564Z W0507 20:32:11.396000 276022 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/3]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/code_generator.py", line 352, in visit_Module
2025-05-07T20:32:11.4020837Z W0507 20:32:11.396000 276022 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/3]     ast.NodeVisitor.generic_visit(self, node)
2025-05-07T20:32:11.4021736Z W0507 20:32:11.396000 276022 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/3]     ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~^^^^^^^^^^^^
2025-05-07T20:32:11.4022818Z W0507 20:32:11.396000 276022 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/3]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/ast.py", line 436, in generic_visit
2025-05-07T20:32:11.4023858Z W0507 20:32:11.396000 276022 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/3]     self.visit(item)
2025-05-07T20:32:11.4024630Z W0507 20:32:11.396000 276022 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/3]     ~~~~~~~~~~^^^^^^
2025-05-07T20:32:11.4025785Z W0507 20:32:11.396000 276022 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/3]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/code_generator.py", line 1207, in visit
2025-05-07T20:32:11.4027123Z W0507 20:32:11.396000 276022 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/3]     raise CompilationError(self.jit_fn.src, self.cur_node, repr(e)) from None
2025-05-07T20:32:11.4028186Z W0507 20:32:11.396000 276022 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/3] triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:11.4029096Z W0507 20:32:11.396000 276022 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/3] def _fbgemm_silu_mul_quant(
2025-05-07T20:32:11.4029956Z W0507 20:32:11.396000 276022 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/3] ^
2025-05-07T20:32:11.4030962Z W0507 20:32:11.396000 276022 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/3] ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:11.5815793Z W0507 20:32:11.577000 276022 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/3] Encountered an exception in identify_mutated_tensors, assuming every input is mutated
2025-05-07T20:32:11.5817621Z W0507 20:32:11.577000 276022 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/3] Traceback (most recent call last):
2025-05-07T20:32:11.5818959Z W0507 20:32:11.577000 276022 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/3]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/torch/_higher_order_ops/triton_kernel_wrap.py", line 731, in identify_mutated_tensors
2025-05-07T20:32:11.5820391Z W0507 20:32:11.577000 276022 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/3]     ttir_module, ordered_tensor_names = generate_ttir(kernel, kwargs)
2025-05-07T20:32:11.5821366Z W0507 20:32:11.577000 276022 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/3]                                         ~~~~~~~~~~~~~^^^^^^^^^^^^^^^^
2025-05-07T20:32:11.5822658Z W0507 20:32:11.577000 276022 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/3]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/torch/_higher_order_ops/triton_kernel_wrap.py", line 356, in generate_ttir
2025-05-07T20:32:11.5824028Z W0507 20:32:11.577000 276022 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/3]     ttir_module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:11.5825310Z W0507 20:32:11.577000 276022 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/3]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py", line 100, in make_ir
2025-05-07T20:32:11.5826675Z W0507 20:32:11.577000 276022 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/3]     return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:11.5827717Z W0507 20:32:11.577000 276022 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/3]                        module_map=module_map)
2025-05-07T20:32:11.5828977Z W0507 20:32:11.577000 276022 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/3]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/code_generator.py", line 1298, in ast_to_ttir
2025-05-07T20:32:11.5830215Z W0507 20:32:11.577000 276022 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/3]     generator.visit(fn.parse())
2025-05-07T20:32:11.5831058Z W0507 20:32:11.577000 276022 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/3]     ~~~~~~~~~~~~~~~^^^^^^^^^^^^
2025-05-07T20:32:11.5832255Z W0507 20:32:11.577000 276022 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/3]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/code_generator.py", line 1201, in visit
2025-05-07T20:32:11.5833453Z W0507 20:32:11.577000 276022 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/3]     ret = super().visit(node)
2025-05-07T20:32:11.5834485Z W0507 20:32:11.577000 276022 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/3]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/ast.py", line 428, in visit
2025-05-07T20:32:11.5835494Z W0507 20:32:11.577000 276022 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/3]     return visitor(node)
2025-05-07T20:32:11.5836852Z W0507 20:32:11.577000 276022 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/3]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/code_generator.py", line 352, in visit_Module
2025-05-07T20:32:11.5838126Z W0507 20:32:11.577000 276022 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/3]     ast.NodeVisitor.generic_visit(self, node)
2025-05-07T20:32:11.5839084Z W0507 20:32:11.577000 276022 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/3]     ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~^^^^^^^^^^^^
2025-05-07T20:32:11.5840237Z W0507 20:32:11.577000 276022 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/3]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/ast.py", line 436, in generic_visit
2025-05-07T20:32:11.5841293Z W0507 20:32:11.577000 276022 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/3]     self.visit(item)
2025-05-07T20:32:11.5842069Z W0507 20:32:11.577000 276022 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/3]     ~~~~~~~~~~^^^^^^
2025-05-07T20:32:11.5843240Z W0507 20:32:11.577000 276022 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/3]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/code_generator.py", line 1207, in visit
2025-05-07T20:32:11.5844578Z W0507 20:32:11.577000 276022 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/3]     raise CompilationError(self.jit_fn.src, self.cur_node, repr(e)) from None
2025-05-07T20:32:11.5845642Z W0507 20:32:11.577000 276022 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/3] triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:11.5846566Z W0507 20:32:11.577000 276022 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/3] def _fbgemm_silu_mul_quant(
2025-05-07T20:32:11.5847314Z W0507 20:32:11.577000 276022 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/3] ^
2025-05-07T20:32:11.5848327Z W0507 20:32:11.577000 276022 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/3] ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:12.1093321Z 
2025-05-07T20:32:12.1094086Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:12.1094737Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:12.1095226Z     T=4096,
2025-05-07T20:32:12.1095422Z     D=7168,
2025-05-07T20:32:12.1095615Z     scale_ub=None,
2025-05-07T20:32:12.1095827Z     contiguous=False,
2025-05-07T20:32:12.1096052Z     compiled=False,
2025-05-07T20:32:12.1096266Z )
2025-05-07T20:32:12.1096610Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:12.1097104Z T = 4096, D = 7168, scale_ub = None, contiguous = False, compiled = False
2025-05-07T20:32:12.1097372Z 
2025-05-07T20:32:12.1097458Z     @given(
2025-05-07T20:32:12.1097694Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:12.1098010Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:12.1098574Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:12.1098897Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:12.1099228Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:12.1099518Z     )
2025-05-07T20:32:12.1099866Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:12.1100305Z     def test_silu_mul_quant(
2025-05-07T20:32:12.1100550Z         self,
2025-05-07T20:32:12.1100751Z         T: int,
2025-05-07T20:32:12.1100943Z         D: int,
2025-05-07T20:32:12.1101163Z         scale_ub: Optional[float],
2025-05-07T20:32:12.1101440Z         contiguous: bool,
2025-05-07T20:32:12.1101679Z         compiled: bool,
2025-05-07T20:32:12.1101913Z     ) -> None:
2025-05-07T20:32:12.1102131Z         torch.manual_seed(2025)
2025-05-07T20:32:12.1102368Z     
2025-05-07T20:32:12.1103036Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:12.1103387Z     
2025-05-07T20:32:12.1103578Z         x_sign = torch.sign(x)
2025-05-07T20:32:12.1103869Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:12.1104181Z         x = x_sign * x_clamp
2025-05-07T20:32:12.1104422Z         x0 = x[:, :D]
2025-05-07T20:32:12.1104636Z         x1 = x[:, D:]
2025-05-07T20:32:12.1104851Z     
2025-05-07T20:32:12.1105041Z         if contiguous:
2025-05-07T20:32:12.1105268Z             x0 = x0.contiguous()
2025-05-07T20:32:12.1105528Z             x1 = x1.contiguous()
2025-05-07T20:32:12.1105770Z     
2025-05-07T20:32:12.1105960Z         if scale_ub is not None:
2025-05-07T20:32:12.1106391Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:12.1106731Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:12.1107035Z             )
2025-05-07T20:32:12.1107234Z         else:
2025-05-07T20:32:12.1107450Z             scale_ub_tensor = None
2025-05-07T20:32:12.1107694Z     
2025-05-07T20:32:12.1107926Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:12.1108259Z             op = silu_mul_quant
2025-05-07T20:32:12.1108533Z             if compiled:
2025-05-07T20:32:12.1108780Z                 op = torch.compile(op)
2025-05-07T20:32:12.1109074Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:12.1109345Z     
2025-05-07T20:32:12.1109546Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:12.1109714Z 
2025-05-07T20:32:12.1109812Z moe/activation_test.py:117: 
2025-05-07T20:32:12.1110109Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:12.1110432Z moe/activation_test.py:115: in fn
2025-05-07T20:32:12.1110720Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:12.1111407Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:12.1112091Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:12.1112625Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:12.1113302Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:12.1113959Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:12.1114484Z     kernel = self.compile(
2025-05-07T20:32:12.1115020Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:12.1115669Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:12.1116071Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:12.1116295Z 
2025-05-07T20:32:12.1116500Z self = <triton.compiler.compiler.ASTSource object at 0x7fce5b2f51d0>
2025-05-07T20:32:12.1117565Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:12.1118938Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7fcea1473ce0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7fce5ba7b600>}
2025-05-07T20:32:12.1120259Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:12.1121256Z context = <triton._C.libtriton.ir.context object at 0x7fce5af35a70>
2025-05-07T20:32:12.1121551Z 
2025-05-07T20:32:12.1121714Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:12.1122227Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:12.1122795Z                            module_map=module_map)
2025-05-07T20:32:12.1123151Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:12.1123505Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:12.1123766Z E       ^
2025-05-07T20:32:12.1124216Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:12.1124661Z 
2025-05-07T20:32:12.1125071Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:12.1125581Z 
2025-05-07T20:32:12.1125682Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:12.1126180Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:12.1126574Z     T=128,
2025-05-07T20:32:12.1126760Z     D=7168,
2025-05-07T20:32:12.1126958Z     scale_ub=None,
2025-05-07T20:32:12.1127178Z     contiguous=False,
2025-05-07T20:32:12.1127401Z     compiled=True,
2025-05-07T20:32:12.1127616Z )
2025-05-07T20:32:12.1127928Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:12.1128418Z T = 128, D = 7168, scale_ub = None, contiguous = False, compiled = True
2025-05-07T20:32:12.1128690Z 
2025-05-07T20:32:12.1128766Z     @given(
2025-05-07T20:32:12.1128997Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:12.1129304Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:12.1129607Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:12.1129936Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:12.1130252Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:12.1130539Z     )
2025-05-07T20:32:12.1130889Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:12.1131319Z     def test_silu_mul_quant(
2025-05-07T20:32:12.1131558Z         self,
2025-05-07T20:32:12.1131752Z         T: int,
2025-05-07T20:32:12.1131945Z         D: int,
2025-05-07T20:32:12.1132163Z         scale_ub: Optional[float],
2025-05-07T20:32:12.1132431Z         contiguous: bool,
2025-05-07T20:32:12.1132671Z         compiled: bool,
2025-05-07T20:32:12.1132889Z     ) -> None:
2025-05-07T20:32:12.1133102Z         torch.manual_seed(2025)
2025-05-07T20:32:12.1133345Z     
2025-05-07T20:32:12.1133609Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:12.1134063Z     
2025-05-07T20:32:12.1134259Z         x_sign = torch.sign(x)
2025-05-07T20:32:12.1134539Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:12.1134847Z         x = x_sign * x_clamp
2025-05-07T20:32:12.1135086Z         x0 = x[:, :D]
2025-05-07T20:32:12.1135294Z         x1 = x[:, D:]
2025-05-07T20:32:12.1135508Z     
2025-05-07T20:32:12.1135696Z         if contiguous:
2025-05-07T20:32:12.1135917Z             x0 = x0.contiguous()
2025-05-07T20:32:12.1136172Z             x1 = x1.contiguous()
2025-05-07T20:32:12.1136412Z     
2025-05-07T20:32:12.1136601Z         if scale_ub is not None:
2025-05-07T20:32:12.1136875Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:12.1137209Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:12.1137517Z             )
2025-05-07T20:32:12.1137710Z         else:
2025-05-07T20:32:12.1137920Z             scale_ub_tensor = None
2025-05-07T20:32:12.1138176Z     
2025-05-07T20:32:12.1138402Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:12.1138715Z             op = silu_mul_quant
2025-05-07T20:32:12.1138964Z             if compiled:
2025-05-07T20:32:12.1139204Z                 op = torch.compile(op)
2025-05-07T20:32:12.1139500Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:12.1139773Z     
2025-05-07T20:32:12.1139966Z         y_fp8, y_scale = fn()
2025-05-07T20:32:12.1140251Z         y = y_fp8.to(torch.float32) * y_scale[:, None]
2025-05-07T20:32:12.1140547Z     
2025-05-07T20:32:12.1140779Z         def ref_fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:12.1141244Z             x0_fp32 = x0.to(torch.float32)
2025-05-07T20:32:12.1141541Z             x1_fp32 = x1.to(torch.float32)
2025-05-07T20:32:12.1141845Z             y = x0_fp32 * torch.sigmoid(x0_fp32) * x1_fp32
2025-05-07T20:32:12.1142202Z             return triton_quantize_fp8_row(y, scale_ub_tensor)
2025-05-07T20:32:12.1142509Z     
2025-05-07T20:32:12.1142710Z >       y_fp8_ref, y_scale_ref = ref_fn()
2025-05-07T20:32:12.1142900Z 
2025-05-07T20:32:12.1142998Z moe/activation_test.py:126: 
2025-05-07T20:32:12.1143291Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:12.1143619Z moe/activation_test.py:124: in ref_fn
2025-05-07T20:32:12.1144018Z     return triton_quantize_fp8_row(y, scale_ub_tensor)
2025-05-07T20:32:12.1144791Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gemm/triton_gemm/fp8_gemm.py:2370: in triton_quantize_fp8_row
2025-05-07T20:32:12.1145534Z     _kernel_quantize_fp8_row[grid](
2025-05-07T20:32:12.1146075Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:12.1146735Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:12.1147414Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/autotuner.py:186: in run
2025-05-07T20:32:12.1148126Z     timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs}
2025-05-07T20:32:12.1148890Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/autotuner.py:166: in _bench
2025-05-07T20:32:12.1149515Z     return self.do_bench(kernel_call, quantiles=(0.5, 0.2, 0.8))
2025-05-07T20:32:12.1150110Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/testing.py:117: in do_bench
2025-05-07T20:32:12.1150619Z     fn()
2025-05-07T20:32:12.1151112Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/autotuner.py:152: in kernel_call
2025-05-07T20:32:12.1151690Z     self.fn.run(
2025-05-07T20:32:12.1152150Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:12.1152670Z     kernel = self.compile(
2025-05-07T20:32:12.1153195Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:12.1153837Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:12.1154231Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:12.1154454Z 
2025-05-07T20:32:12.1154662Z self = <triton.compiler.compiler.ASTSource object at 0x7fce5a81c7d0>
2025-05-07T20:32:12.1155722Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=2, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:12.1157077Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7fcea1473ce0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7fce5ba7a020>}
2025-05-07T20:32:12.1158402Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:12.1159405Z context = <triton._C.libtriton.ir.context object at 0x7fce5ab35770>
2025-05-07T20:32:12.1159692Z 
2025-05-07T20:32:12.1159853Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:12.1160374Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:12.1160837Z                            module_map=module_map)
2025-05-07T20:32:12.1161202Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:12.1161642Z E       def _kernel_quantize_fp8_row(
2025-05-07T20:32:12.1161910Z E       ^
2025-05-07T20:32:12.1162380Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:12.1169215Z 
2025-05-07T20:32:12.1169662Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:12.3544616Z 
2025-05-07T20:32:12.3545289Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:12.3545975Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:12.3546536Z     T=128,
2025-05-07T20:32:12.3546741Z     D=7168,
2025-05-07T20:32:12.3546947Z     scale_ub=None,
2025-05-07T20:32:12.3547557Z     contiguous=False,
2025-05-07T20:32:12.3547795Z     compiled=False,
2025-05-07T20:32:12.3548019Z )
2025-05-07T20:32:12.3548375Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:12.3548905Z T = 128, D = 7168, scale_ub = None, contiguous = False, compiled = False
2025-05-07T20:32:12.3549182Z 
2025-05-07T20:32:12.3549265Z     @given(
2025-05-07T20:32:12.3549509Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:12.3549836Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:12.3550145Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:12.3550484Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:12.3550824Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:12.3551115Z     )
2025-05-07T20:32:12.3551473Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:12.3551924Z     def test_silu_mul_quant(
2025-05-07T20:32:12.3552174Z         self,
2025-05-07T20:32:12.3552380Z         T: int,
2025-05-07T20:32:12.3552593Z         D: int,
2025-05-07T20:32:12.3552814Z         scale_ub: Optional[float],
2025-05-07T20:32:12.3553099Z         contiguous: bool,
2025-05-07T20:32:12.3553353Z         compiled: bool,
2025-05-07T20:32:12.3553585Z     ) -> None:
2025-05-07T20:32:12.3553814Z         torch.manual_seed(2025)
2025-05-07T20:32:12.3554068Z     
2025-05-07T20:32:12.3554352Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:12.3554695Z     
2025-05-07T20:32:12.3554900Z         x_sign = torch.sign(x)
2025-05-07T20:32:12.3555200Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:12.3555510Z         x = x_sign * x_clamp
2025-05-07T20:32:12.3555761Z         x0 = x[:, :D]
2025-05-07T20:32:12.3555985Z         x1 = x[:, D:]
2025-05-07T20:32:12.3556195Z     
2025-05-07T20:32:12.3556392Z         if contiguous:
2025-05-07T20:32:12.3556632Z             x0 = x0.contiguous()
2025-05-07T20:32:12.3556897Z             x1 = x1.contiguous()
2025-05-07T20:32:12.3557147Z     
2025-05-07T20:32:12.3557344Z         if scale_ub is not None:
2025-05-07T20:32:12.3557632Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:12.3557977Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:12.3558285Z             )
2025-05-07T20:32:12.3558519Z         else:
2025-05-07T20:32:12.3558758Z             scale_ub_tensor = None
2025-05-07T20:32:12.3559011Z     
2025-05-07T20:32:12.3559253Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:12.3559577Z             op = silu_mul_quant
2025-05-07T20:32:12.3559828Z             if compiled:
2025-05-07T20:32:12.3560087Z                 op = torch.compile(op)
2025-05-07T20:32:12.3560393Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:12.3560672Z     
2025-05-07T20:32:12.3560874Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:12.3561047Z 
2025-05-07T20:32:12.3561154Z moe/activation_test.py:117: 
2025-05-07T20:32:12.3561463Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:12.3561797Z moe/activation_test.py:115: in fn
2025-05-07T20:32:12.3562085Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:12.3562943Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:12.3563623Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:12.3564162Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:12.3564840Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:12.3565502Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:12.3566034Z     kernel = self.compile(
2025-05-07T20:32:12.3566661Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:12.3567320Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:12.3567716Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:12.3567957Z 
2025-05-07T20:32:12.3568164Z self = <triton.compiler.compiler.ASTSource object at 0x7fce5b6e6c90>
2025-05-07T20:32:12.3569281Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:12.3570649Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7fcea1473ce0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7fce5b278680>}
2025-05-07T20:32:12.3571988Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:12.3572993Z context = <triton._C.libtriton.ir.context object at 0x7fce5ab5de70>
2025-05-07T20:32:12.3573286Z 
2025-05-07T20:32:12.3573450Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:12.3574083Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:12.3574553Z                            module_map=module_map)
2025-05-07T20:32:12.3574914Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:12.3575269Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:12.3575538Z E       ^
2025-05-07T20:32:12.3575994Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:12.3576442Z 
2025-05-07T20:32:12.3576854Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:12.3577373Z 
2025-05-07T20:32:12.3577479Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:12.3577893Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:12.3578298Z     T=4096,
2025-05-07T20:32:12.3578493Z     D=5120,
2025-05-07T20:32:12.3578691Z     scale_ub=1200.0,
2025-05-07T20:32:12.3578915Z     contiguous=True,
2025-05-07T20:32:12.3579143Z     compiled=False,
2025-05-07T20:32:12.3579356Z )
2025-05-07T20:32:12.3579673Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:12.3580170Z T = 4096, D = 5120, scale_ub = 1200.0, contiguous = True, compiled = False
2025-05-07T20:32:12.3580451Z 
2025-05-07T20:32:12.3580534Z     @given(
2025-05-07T20:32:12.3580768Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:12.3581080Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:12.3581393Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:12.3581734Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:12.3582062Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:12.3582352Z     )
2025-05-07T20:32:12.3582707Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:12.3583240Z     def test_silu_mul_quant(
2025-05-07T20:32:12.3583485Z         self,
2025-05-07T20:32:12.3583688Z         T: int,
2025-05-07T20:32:12.3583886Z         D: int,
2025-05-07T20:32:12.3584111Z         scale_ub: Optional[float],
2025-05-07T20:32:12.3584389Z         contiguous: bool,
2025-05-07T20:32:12.3584626Z         compiled: bool,
2025-05-07T20:32:12.3584853Z     ) -> None:
2025-05-07T20:32:12.3585069Z         torch.manual_seed(2025)
2025-05-07T20:32:12.3585313Z     
2025-05-07T20:32:12.3585584Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:12.3585930Z     
2025-05-07T20:32:12.3586133Z         x_sign = torch.sign(x)
2025-05-07T20:32:12.3586569Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:12.3586883Z         x = x_sign * x_clamp
2025-05-07T20:32:12.3587126Z         x0 = x[:, :D]
2025-05-07T20:32:12.3587339Z         x1 = x[:, D:]
2025-05-07T20:32:12.3587553Z     
2025-05-07T20:32:12.3587750Z         if contiguous:
2025-05-07T20:32:12.3587978Z             x0 = x0.contiguous()
2025-05-07T20:32:12.3588242Z             x1 = x1.contiguous()
2025-05-07T20:32:12.3588492Z     
2025-05-07T20:32:12.3588683Z         if scale_ub is not None:
2025-05-07T20:32:12.3588961Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:12.3589295Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:12.3589605Z             )
2025-05-07T20:32:12.3589803Z         else:
2025-05-07T20:32:12.3590016Z             scale_ub_tensor = None
2025-05-07T20:32:12.3590272Z     
2025-05-07T20:32:12.3590501Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:12.3590826Z             op = silu_mul_quant
2025-05-07T20:32:12.3591085Z             if compiled:
2025-05-07T20:32:12.3591330Z                 op = torch.compile(op)
2025-05-07T20:32:12.3591633Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:12.3591913Z     
2025-05-07T20:32:12.3592105Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:12.3592285Z 
2025-05-07T20:32:12.3592385Z moe/activation_test.py:117: 
2025-05-07T20:32:12.3592686Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:12.3593018Z moe/activation_test.py:115: in fn
2025-05-07T20:32:12.3593309Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:12.3594002Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:12.3594695Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:12.3595230Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:12.3595922Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:12.3596588Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:12.3597119Z     kernel = self.compile(
2025-05-07T20:32:12.3597667Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:12.3598622Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:12.3599030Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:12.3599258Z 
2025-05-07T20:32:12.3599466Z self = <triton.compiler.compiler.ASTSource object at 0x7fce5b4110d0>
2025-05-07T20:32:12.3600537Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:12.3601907Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7fcea1473ce0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7fce5b278f40>}
2025-05-07T20:32:12.3603243Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:12.3604387Z context = <triton._C.libtriton.ir.context object at 0x7fce5ab72230>
2025-05-07T20:32:12.3604670Z 
2025-05-07T20:32:12.3604834Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:12.3605353Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:12.3605820Z                            module_map=module_map)
2025-05-07T20:32:12.3606184Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:12.3606541Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:12.3606932Z E       ^
2025-05-07T20:32:12.3607395Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:12.3607837Z 
2025-05-07T20:32:12.3608251Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:12.3608821Z 
2025-05-07T20:32:12.3608925Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:12.3609343Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:12.3609747Z     T=1,
2025-05-07T20:32:12.3609931Z     D=5120,
2025-05-07T20:32:12.3610132Z     scale_ub=None,
2025-05-07T20:32:12.3610351Z     contiguous=True,
2025-05-07T20:32:12.3610572Z     compiled=True,
2025-05-07T20:32:12.3610783Z )
2025-05-07T20:32:12.3611110Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:12.3611587Z T = 1, D = 5120, scale_ub = None, contiguous = True, compiled = True
2025-05-07T20:32:12.3611855Z 
2025-05-07T20:32:12.3611936Z     @given(
2025-05-07T20:32:12.3612174Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:12.3612485Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:12.3612800Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:12.3613134Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:12.3613462Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:12.3613854Z     )
2025-05-07T20:32:12.3614210Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:12.3614658Z     def test_silu_mul_quant(
2025-05-07T20:32:12.3614900Z         self,
2025-05-07T20:32:12.3615104Z         T: int,
2025-05-07T20:32:12.3615307Z         D: int,
2025-05-07T20:32:12.3615524Z         scale_ub: Optional[float],
2025-05-07T20:32:12.3615800Z         contiguous: bool,
2025-05-07T20:32:12.3616055Z         compiled: bool,
2025-05-07T20:32:12.3616283Z     ) -> None:
2025-05-07T20:32:12.3616508Z         torch.manual_seed(2025)
2025-05-07T20:32:12.3616748Z     
2025-05-07T20:32:12.3617030Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:12.3617383Z     
2025-05-07T20:32:12.3617581Z         x_sign = torch.sign(x)
2025-05-07T20:32:12.3617877Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:12.3618191Z         x = x_sign * x_clamp
2025-05-07T20:32:12.3618464Z         x0 = x[:, :D]
2025-05-07T20:32:12.3618705Z         x1 = x[:, D:]
2025-05-07T20:32:12.3618921Z     
2025-05-07T20:32:12.3619104Z         if contiguous:
2025-05-07T20:32:12.3619343Z             x0 = x0.contiguous()
2025-05-07T20:32:12.3619605Z             x1 = x1.contiguous()
2025-05-07T20:32:12.3619844Z     
2025-05-07T20:32:12.3620047Z         if scale_ub is not None:
2025-05-07T20:32:12.3620325Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:12.3620659Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:12.3620974Z             )
2025-05-07T20:32:12.3621174Z         else:
2025-05-07T20:32:12.3621390Z             scale_ub_tensor = None
2025-05-07T20:32:12.3621639Z     
2025-05-07T20:32:12.3621877Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:12.3622300Z             op = silu_mul_quant
2025-05-07T20:32:12.3622548Z             if compiled:
2025-05-07T20:32:12.3622804Z                 op = torch.compile(op)
2025-05-07T20:32:12.3623106Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:12.3623381Z     
2025-05-07T20:32:12.3623586Z         y_fp8, y_scale = fn()
2025-05-07T20:32:12.3623873Z         y = y_fp8.to(torch.float32) * y_scale[:, None]
2025-05-07T20:32:12.3624163Z     
2025-05-07T20:32:12.3624407Z         def ref_fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:12.3624749Z             x0_fp32 = x0.to(torch.float32)
2025-05-07T20:32:12.3625041Z             x1_fp32 = x1.to(torch.float32)
2025-05-07T20:32:12.3625357Z             y = x0_fp32 * torch.sigmoid(x0_fp32) * x1_fp32
2025-05-07T20:32:12.3625799Z             return triton_quantize_fp8_row(y, scale_ub_tensor)
2025-05-07T20:32:12.3626119Z     
2025-05-07T20:32:12.3626320Z >       y_fp8_ref, y_scale_ref = ref_fn()
2025-05-07T20:32:12.3626522Z 
2025-05-07T20:32:12.3626630Z moe/activation_test.py:126: 
2025-05-07T20:32:12.3626931Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:12.3627266Z moe/activation_test.py:124: in ref_fn
2025-05-07T20:32:12.3627599Z     return triton_quantize_fp8_row(y, scale_ub_tensor)
2025-05-07T20:32:12.3628383Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gemm/triton_gemm/fp8_gemm.py:2370: in triton_quantize_fp8_row
2025-05-07T20:32:12.3629132Z     _kernel_quantize_fp8_row[grid](
2025-05-07T20:32:12.3629674Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:12.3630360Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:12.3631052Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/autotuner.py:186: in run
2025-05-07T20:32:12.3631768Z     timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs}
2025-05-07T20:32:12.3632497Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/autotuner.py:166: in _bench
2025-05-07T20:32:12.3633137Z     return self.do_bench(kernel_call, quantiles=(0.5, 0.2, 0.8))
2025-05-07T20:32:12.3633744Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/testing.py:117: in do_bench
2025-05-07T20:32:12.3634258Z     fn()
2025-05-07T20:32:12.3634766Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/autotuner.py:152: in kernel_call
2025-05-07T20:32:12.3635345Z     self.fn.run(
2025-05-07T20:32:12.3635812Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:12.3636340Z     kernel = self.compile(
2025-05-07T20:32:12.3636882Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:12.3637534Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:12.3637932Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:12.3638170Z 
2025-05-07T20:32:12.3638383Z self = <triton.compiler.compiler.ASTSource object at 0x7fce5b411910>
2025-05-07T20:32:12.3639502Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=2, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:12.3640864Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7fcea1473ce0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7fce60117f60>}
2025-05-07T20:32:12.3642195Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:12.3643303Z context = <triton._C.libtriton.ir.context object at 0x7fcda5f192b0>
2025-05-07T20:32:12.3643593Z 
2025-05-07T20:32:12.3643760Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:12.3644279Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:12.3644743Z                            module_map=module_map)
2025-05-07T20:32:12.3645110Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:12.3645472Z E       def _kernel_quantize_fp8_row(
2025-05-07T20:32:12.3645740Z E       ^
2025-05-07T20:32:12.3646278Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:12.3646725Z 
2025-05-07T20:32:12.3647136Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:12.5812496Z W0507 20:32:12.577000 276022 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/4] Encountered an exception in identify_mutated_tensors, assuming every input is mutated
2025-05-07T20:32:12.5813806Z W0507 20:32:12.577000 276022 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/4] Traceback (most recent call last):
2025-05-07T20:32:12.5815131Z W0507 20:32:12.577000 276022 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/4]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/torch/_higher_order_ops/triton_kernel_wrap.py", line 731, in identify_mutated_tensors
2025-05-07T20:32:12.5816932Z W0507 20:32:12.577000 276022 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/4]     ttir_module, ordered_tensor_names = generate_ttir(kernel, kwargs)
2025-05-07T20:32:12.5818140Z W0507 20:32:12.577000 276022 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/4]                                         ~~~~~~~~~~~~~^^^^^^^^^^^^^^^^
2025-05-07T20:32:12.5819763Z W0507 20:32:12.577000 276022 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/4]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/torch/_higher_order_ops/triton_kernel_wrap.py", line 356, in generate_ttir
2025-05-07T20:32:12.5821472Z W0507 20:32:12.577000 276022 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/4]     ttir_module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:12.5822954Z W0507 20:32:12.577000 276022 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/4]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py", line 100, in make_ir
2025-05-07T20:32:12.5824314Z W0507 20:32:12.577000 276022 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/4]     return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:12.5825350Z W0507 20:32:12.577000 276022 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/4]                        module_map=module_map)
2025-05-07T20:32:12.5826606Z W0507 20:32:12.577000 276022 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/4]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/code_generator.py", line 1298, in ast_to_ttir
2025-05-07T20:32:12.5827840Z W0507 20:32:12.577000 276022 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/4]     generator.visit(fn.parse())
2025-05-07T20:32:12.5828679Z W0507 20:32:12.577000 276022 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/4]     ~~~~~~~~~~~~~~~^^^^^^^^^^^^
2025-05-07T20:32:12.5829879Z W0507 20:32:12.577000 276022 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/4]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/code_generator.py", line 1201, in visit
2025-05-07T20:32:12.5831074Z W0507 20:32:12.577000 276022 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/4]     ret = super().visit(node)
2025-05-07T20:32:12.5832503Z W0507 20:32:12.577000 276022 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/4]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/ast.py", line 428, in visit
2025-05-07T20:32:12.5833510Z W0507 20:32:12.577000 276022 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/4]     return visitor(node)
2025-05-07T20:32:12.5834716Z W0507 20:32:12.577000 276022 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/4]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/code_generator.py", line 352, in visit_Module
2025-05-07T20:32:12.5836179Z W0507 20:32:12.577000 276022 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/4]     ast.NodeVisitor.generic_visit(self, node)
2025-05-07T20:32:12.5837084Z W0507 20:32:12.577000 276022 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/4]     ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~^^^^^^^^^^^^
2025-05-07T20:32:12.5838167Z W0507 20:32:12.577000 276022 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/4]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/ast.py", line 436, in generic_visit
2025-05-07T20:32:12.5839192Z W0507 20:32:12.577000 276022 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/4]     self.visit(item)
2025-05-07T20:32:12.5839970Z W0507 20:32:12.577000 276022 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/4]     ~~~~~~~~~~^^^^^^
2025-05-07T20:32:12.5841136Z W0507 20:32:12.577000 276022 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/4]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/code_generator.py", line 1207, in visit
2025-05-07T20:32:12.5842476Z W0507 20:32:12.577000 276022 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/4]     raise CompilationError(self.jit_fn.src, self.cur_node, repr(e)) from None
2025-05-07T20:32:12.5843535Z W0507 20:32:12.577000 276022 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/4] triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:12.5844440Z W0507 20:32:12.577000 276022 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/4] def _fbgemm_silu_mul_quant(
2025-05-07T20:32:12.5845185Z W0507 20:32:12.577000 276022 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/4] ^
2025-05-07T20:32:12.5846201Z W0507 20:32:12.577000 276022 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/4] ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:12.6437461Z W0507 20:32:12.640000 276022 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/4] Encountered an exception in identify_mutated_tensors, assuming every input is mutated
2025-05-07T20:32:12.6438925Z W0507 20:32:12.640000 276022 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/4] Traceback (most recent call last):
2025-05-07T20:32:12.6440477Z W0507 20:32:12.640000 276022 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/4]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/torch/_higher_order_ops/triton_kernel_wrap.py", line 731, in identify_mutated_tensors
2025-05-07T20:32:12.6441895Z W0507 20:32:12.640000 276022 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/4]     ttir_module, ordered_tensor_names = generate_ttir(kernel, kwargs)
2025-05-07T20:32:12.6442871Z W0507 20:32:12.640000 276022 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/4]                                         ~~~~~~~~~~~~~^^^^^^^^^^^^^^^^
2025-05-07T20:32:12.6444160Z W0507 20:32:12.640000 276022 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/4]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/torch/_higher_order_ops/triton_kernel_wrap.py", line 356, in generate_ttir
2025-05-07T20:32:12.6445889Z W0507 20:32:12.640000 276022 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/4]     ttir_module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:12.6447168Z W0507 20:32:12.640000 276022 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/4]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py", line 100, in make_ir
2025-05-07T20:32:12.6448525Z W0507 20:32:12.640000 276022 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/4]     return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:12.6449702Z W0507 20:32:12.640000 276022 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/4]                        module_map=module_map)
2025-05-07T20:32:12.6450948Z W0507 20:32:12.640000 276022 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/4]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/code_generator.py", line 1298, in ast_to_ttir
2025-05-07T20:32:12.6452186Z W0507 20:32:12.640000 276022 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/4]     generator.visit(fn.parse())
2025-05-07T20:32:12.6453022Z W0507 20:32:12.640000 276022 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/4]     ~~~~~~~~~~~~~~~^^^^^^^^^^^^
2025-05-07T20:32:12.6454343Z W0507 20:32:12.640000 276022 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/4]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/code_generator.py", line 1201, in visit
2025-05-07T20:32:12.6455544Z W0507 20:32:12.640000 276022 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/4]     ret = super().visit(node)
2025-05-07T20:32:12.6456569Z W0507 20:32:12.640000 276022 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/4]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/ast.py", line 428, in visit
2025-05-07T20:32:12.6457585Z W0507 20:32:12.640000 276022 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/4]     return visitor(node)
2025-05-07T20:32:12.6458781Z W0507 20:32:12.640000 276022 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/4]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/code_generator.py", line 352, in visit_Module
2025-05-07T20:32:12.6460045Z W0507 20:32:12.640000 276022 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/4]     ast.NodeVisitor.generic_visit(self, node)
2025-05-07T20:32:12.6460950Z W0507 20:32:12.640000 276022 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/4]     ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~^^^^^^^^^^^^
2025-05-07T20:32:12.6462028Z W0507 20:32:12.640000 276022 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/4]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/ast.py", line 436, in generic_visit
2025-05-07T20:32:12.6463065Z W0507 20:32:12.640000 276022 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/4]     self.visit(item)
2025-05-07T20:32:12.6463823Z W0507 20:32:12.640000 276022 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/4]     ~~~~~~~~~~^^^^^^
2025-05-07T20:32:12.6464979Z W0507 20:32:12.640000 276022 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/4]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/code_generator.py", line 1207, in visit
2025-05-07T20:32:12.6466325Z W0507 20:32:12.640000 276022 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/4]     raise CompilationError(self.jit_fn.src, self.cur_node, repr(e)) from None
2025-05-07T20:32:12.6467379Z W0507 20:32:12.640000 276022 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/4] triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:12.6468377Z W0507 20:32:12.640000 276022 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/4] def _fbgemm_silu_mul_quant(
2025-05-07T20:32:12.6469121Z W0507 20:32:12.640000 276022 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/4] ^
2025-05-07T20:32:12.6470134Z W0507 20:32:12.640000 276022 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/4] ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:13.1288281Z W0507 20:32:13.125000 276022 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/5] Encountered an exception in identify_mutated_tensors, assuming every input is mutated
2025-05-07T20:32:13.1290350Z W0507 20:32:13.125000 276022 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/5] Traceback (most recent call last):
2025-05-07T20:32:13.1292963Z W0507 20:32:13.125000 276022 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/5]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/torch/_higher_order_ops/triton_kernel_wrap.py", line 731, in identify_mutated_tensors
2025-05-07T20:32:13.1295939Z W0507 20:32:13.125000 276022 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/5]     ttir_module, ordered_tensor_names = generate_ttir(kernel, kwargs)
2025-05-07T20:32:13.1297858Z W0507 20:32:13.125000 276022 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/5]                                         ~~~~~~~~~~~~~^^^^^^^^^^^^^^^^
2025-05-07T20:32:13.1299658Z W0507 20:32:13.125000 276022 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/5]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/torch/_higher_order_ops/triton_kernel_wrap.py", line 356, in generate_ttir
2025-05-07T20:32:13.1301016Z W0507 20:32:13.125000 276022 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/5]     ttir_module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:13.1302304Z W0507 20:32:13.125000 276022 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/5]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py", line 100, in make_ir
2025-05-07T20:32:13.1303652Z W0507 20:32:13.125000 276022 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/5]     return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:13.1304686Z W0507 20:32:13.125000 276022 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/5]                        module_map=module_map)
2025-05-07T20:32:13.1305926Z W0507 20:32:13.125000 276022 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/5]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/code_generator.py", line 1298, in ast_to_ttir
2025-05-07T20:32:13.1307161Z W0507 20:32:13.125000 276022 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/5]     generator.visit(fn.parse())
2025-05-07T20:32:13.1307997Z W0507 20:32:13.125000 276022 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/5]     ~~~~~~~~~~~~~~~^^^^^^^^^^^^
2025-05-07T20:32:13.1309189Z W0507 20:32:13.125000 276022 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/5]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/code_generator.py", line 1201, in visit
2025-05-07T20:32:13.1310380Z W0507 20:32:13.125000 276022 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/5]     ret = super().visit(node)
2025-05-07T20:32:13.1311403Z W0507 20:32:13.125000 276022 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/5]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/ast.py", line 428, in visit
2025-05-07T20:32:13.1312409Z W0507 20:32:13.125000 276022 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/5]     return visitor(node)
2025-05-07T20:32:13.1313788Z W0507 20:32:13.125000 276022 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/5]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/code_generator.py", line 352, in visit_Module
2025-05-07T20:32:13.1315047Z W0507 20:32:13.125000 276022 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/5]     ast.NodeVisitor.generic_visit(self, node)
2025-05-07T20:32:13.1315940Z W0507 20:32:13.125000 276022 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/5]     ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~^^^^^^^^^^^^
2025-05-07T20:32:13.1317113Z W0507 20:32:13.125000 276022 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/5]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/ast.py", line 436, in generic_visit
2025-05-07T20:32:13.1318141Z W0507 20:32:13.125000 276022 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/5]     self.visit(item)
2025-05-07T20:32:13.1318907Z W0507 20:32:13.125000 276022 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/5]     ~~~~~~~~~~^^^^^^
2025-05-07T20:32:13.1320060Z W0507 20:32:13.125000 276022 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/5]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/code_generator.py", line 1207, in visit
2025-05-07T20:32:13.1321390Z W0507 20:32:13.125000 276022 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/5]     raise CompilationError(self.jit_fn.src, self.cur_node, repr(e)) from None
2025-05-07T20:32:13.1322436Z W0507 20:32:13.125000 276022 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/5] triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:13.1323342Z W0507 20:32:13.125000 276022 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/5] def _fbgemm_silu_mul_quant(
2025-05-07T20:32:13.1324081Z W0507 20:32:13.125000 276022 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/5] ^
2025-05-07T20:32:13.1325083Z W0507 20:32:13.125000 276022 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/5] ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:13.1906662Z W0507 20:32:13.186000 276022 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/5] Encountered an exception in identify_mutated_tensors, assuming every input is mutated
2025-05-07T20:32:13.1907704Z W0507 20:32:13.186000 276022 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/5] Traceback (most recent call last):
2025-05-07T20:32:13.1909093Z W0507 20:32:13.186000 276022 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/5]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/torch/_higher_order_ops/triton_kernel_wrap.py", line 731, in identify_mutated_tensors
2025-05-07T20:32:13.1910503Z W0507 20:32:13.186000 276022 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/5]     ttir_module, ordered_tensor_names = generate_ttir(kernel, kwargs)
2025-05-07T20:32:13.1911469Z W0507 20:32:13.186000 276022 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/5]                                         ~~~~~~~~~~~~~^^^^^^^^^^^^^^^^
2025-05-07T20:32:13.1912750Z W0507 20:32:13.186000 276022 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/5]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/torch/_higher_order_ops/triton_kernel_wrap.py", line 356, in generate_ttir
2025-05-07T20:32:13.1914103Z W0507 20:32:13.186000 276022 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/5]     ttir_module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:13.1915376Z W0507 20:32:13.186000 276022 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/5]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py", line 100, in make_ir
2025-05-07T20:32:13.1917070Z W0507 20:32:13.186000 276022 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/5]     return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:13.1924099Z W0507 20:32:13.186000 276022 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/5]                        module_map=module_map)
2025-05-07T20:32:13.1925574Z W0507 20:32:13.186000 276022 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/5]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/code_generator.py", line 1298, in ast_to_ttir
2025-05-07T20:32:13.1926816Z W0507 20:32:13.186000 276022 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/5]     generator.visit(fn.parse())
2025-05-07T20:32:13.1927662Z W0507 20:32:13.186000 276022 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/5]     ~~~~~~~~~~~~~~~^^^^^^^^^^^^
2025-05-07T20:32:13.1928855Z W0507 20:32:13.186000 276022 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/5]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/code_generator.py", line 1201, in visit
2025-05-07T20:32:13.1930048Z W0507 20:32:13.186000 276022 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/5]     ret = super().visit(node)
2025-05-07T20:32:13.1931069Z W0507 20:32:13.186000 276022 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/5]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/ast.py", line 428, in visit
2025-05-07T20:32:13.1932235Z W0507 20:32:13.186000 276022 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/5]     return visitor(node)
2025-05-07T20:32:13.1933442Z W0507 20:32:13.186000 276022 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/5]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/code_generator.py", line 352, in visit_Module
2025-05-07T20:32:13.1934824Z W0507 20:32:13.186000 276022 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/5]     ast.NodeVisitor.generic_visit(self, node)
2025-05-07T20:32:13.1935718Z W0507 20:32:13.186000 276022 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/5]     ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~^^^^^^^^^^^^
2025-05-07T20:32:13.1936800Z W0507 20:32:13.186000 276022 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/5]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/ast.py", line 436, in generic_visit
2025-05-07T20:32:13.1937832Z W0507 20:32:13.186000 276022 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/5]     self.visit(item)
2025-05-07T20:32:13.1938600Z W0507 20:32:13.186000 276022 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/5]     ~~~~~~~~~~^^^^^^
2025-05-07T20:32:13.1939760Z W0507 20:32:13.186000 276022 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/5]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/code_generator.py", line 1207, in visit
2025-05-07T20:32:13.1941088Z W0507 20:32:13.186000 276022 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/5]     raise CompilationError(self.jit_fn.src, self.cur_node, repr(e)) from None
2025-05-07T20:32:13.1942146Z W0507 20:32:13.186000 276022 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/5] triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:13.1943058Z W0507 20:32:13.186000 276022 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/5] def _fbgemm_silu_mul_quant(
2025-05-07T20:32:13.1943802Z W0507 20:32:13.186000 276022 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/5] ^
2025-05-07T20:32:13.1944817Z W0507 20:32:13.186000 276022 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/5] ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:13.4617966Z 
2025-05-07T20:32:13.4618199Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:13.4618683Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:13.4619275Z     T=2048,
2025-05-07T20:32:13.4619485Z     D=5120,
2025-05-07T20:32:13.4619690Z     scale_ub=None,
2025-05-07T20:32:13.4619905Z     contiguous=True,
2025-05-07T20:32:13.4620137Z     compiled=True,
2025-05-07T20:32:13.4620358Z )
2025-05-07T20:32:13.4620681Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:13.4621500Z T = 2048, D = 5120, scale_ub = None, contiguous = True, compiled = True
2025-05-07T20:32:13.4621777Z 
2025-05-07T20:32:13.4621866Z     @given(
2025-05-07T20:32:13.4622108Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:13.4622431Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:13.4622744Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:13.4623079Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:13.4623406Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:13.4623701Z     )
2025-05-07T20:32:13.4624054Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:13.4624497Z     def test_silu_mul_quant(
2025-05-07T20:32:13.4624754Z         self,
2025-05-07T20:32:13.4624956Z         T: int,
2025-05-07T20:32:13.4625153Z         D: int,
2025-05-07T20:32:13.4625377Z         scale_ub: Optional[float],
2025-05-07T20:32:13.4625652Z         contiguous: bool,
2025-05-07T20:32:13.4625896Z         compiled: bool,
2025-05-07T20:32:13.4626130Z     ) -> None:
2025-05-07T20:32:13.4626352Z         torch.manual_seed(2025)
2025-05-07T20:32:13.4626590Z     
2025-05-07T20:32:13.4626863Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:13.4627215Z     
2025-05-07T20:32:13.4627410Z         x_sign = torch.sign(x)
2025-05-07T20:32:13.4627705Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:13.4628018Z         x = x_sign * x_clamp
2025-05-07T20:32:13.4628264Z         x0 = x[:, :D]
2025-05-07T20:32:13.4628480Z         x1 = x[:, D:]
2025-05-07T20:32:13.4628695Z     
2025-05-07T20:32:13.4628912Z         if contiguous:
2025-05-07T20:32:13.4629165Z             x0 = x0.contiguous()
2025-05-07T20:32:13.4629426Z             x1 = x1.contiguous()
2025-05-07T20:32:13.4629671Z     
2025-05-07T20:32:13.4629859Z         if scale_ub is not None:
2025-05-07T20:32:13.4630134Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:13.4630477Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:13.4630779Z             )
2025-05-07T20:32:13.4630979Z         else:
2025-05-07T20:32:13.4631195Z             scale_ub_tensor = None
2025-05-07T20:32:13.4631441Z     
2025-05-07T20:32:13.4631679Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:13.4631999Z             op = silu_mul_quant
2025-05-07T20:32:13.4632246Z             if compiled:
2025-05-07T20:32:13.4632496Z                 op = torch.compile(op)
2025-05-07T20:32:13.4632795Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:13.4633074Z     
2025-05-07T20:32:13.4633261Z         y_fp8, y_scale = fn()
2025-05-07T20:32:13.4633548Z         y = y_fp8.to(torch.float32) * y_scale[:, None]
2025-05-07T20:32:13.4633844Z     
2025-05-07T20:32:13.4634077Z         def ref_fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:13.4634415Z             x0_fp32 = x0.to(torch.float32)
2025-05-07T20:32:13.4634711Z             x1_fp32 = x1.to(torch.float32)
2025-05-07T20:32:13.4635026Z             y = x0_fp32 * torch.sigmoid(x0_fp32) * x1_fp32
2025-05-07T20:32:13.4635398Z             return triton_quantize_fp8_row(y, scale_ub_tensor)
2025-05-07T20:32:13.4635710Z     
2025-05-07T20:32:13.4635911Z >       y_fp8_ref, y_scale_ref = ref_fn()
2025-05-07T20:32:13.4636292Z 
2025-05-07T20:32:13.4636393Z moe/activation_test.py:126: 
2025-05-07T20:32:13.4636694Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:13.4637040Z moe/activation_test.py:124: in ref_fn
2025-05-07T20:32:13.4637362Z     return triton_quantize_fp8_row(y, scale_ub_tensor)
2025-05-07T20:32:13.4638151Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gemm/triton_gemm/fp8_gemm.py:2370: in triton_quantize_fp8_row
2025-05-07T20:32:13.4638903Z     _kernel_quantize_fp8_row[grid](
2025-05-07T20:32:13.4639442Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:13.4640204Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:13.4642245Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/autotuner.py:186: in run
2025-05-07T20:32:13.4642975Z     timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs}
2025-05-07T20:32:13.4643686Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/autotuner.py:166: in _bench
2025-05-07T20:32:13.4644324Z     return self.do_bench(kernel_call, quantiles=(0.5, 0.2, 0.8))
2025-05-07T20:32:13.4644921Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/testing.py:117: in do_bench
2025-05-07T20:32:13.4645441Z     fn()
2025-05-07T20:32:13.4645937Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/autotuner.py:152: in kernel_call
2025-05-07T20:32:13.4646517Z     self.fn.run(
2025-05-07T20:32:13.4646991Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:13.4647512Z     kernel = self.compile(
2025-05-07T20:32:13.4648052Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:13.4648714Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:13.4649116Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:13.4649343Z 
2025-05-07T20:32:13.4649551Z self = <triton.compiler.compiler.ASTSource object at 0x7fce5b40e840>
2025-05-07T20:32:13.4650630Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=2, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:13.4652001Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7fcea1473ce0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7fce5ae28cc0>}
2025-05-07T20:32:13.4653329Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:13.4654497Z context = <triton._C.libtriton.ir.context object at 0x7fce5a30e1b0>
2025-05-07T20:32:13.4654787Z 
2025-05-07T20:32:13.4654952Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:13.4655469Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:13.4655933Z                            module_map=module_map)
2025-05-07T20:32:13.4656297Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:13.4656654Z E       def _kernel_quantize_fp8_row(
2025-05-07T20:32:13.4656926Z E       ^
2025-05-07T20:32:13.4657387Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:13.4657833Z 
2025-05-07T20:32:13.4658244Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:13.4658855Z 
2025-05-07T20:32:13.4658958Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:13.4659368Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:13.4659761Z     T=128,
2025-05-07T20:32:13.4659951Z     D=5120,
2025-05-07T20:32:13.4660143Z     scale_ub=None,
2025-05-07T20:32:13.4660357Z     contiguous=True,
2025-05-07T20:32:13.4660584Z     compiled=True,
2025-05-07T20:32:13.4660789Z )
2025-05-07T20:32:13.4661100Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:13.4661588Z T = 128, D = 5120, scale_ub = None, contiguous = True, compiled = True
2025-05-07T20:32:13.4661856Z 
2025-05-07T20:32:13.4661939Z     @given(
2025-05-07T20:32:13.4662253Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:13.4662563Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:13.4662871Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:13.4663210Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:13.4663533Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:13.4663821Z     )
2025-05-07T20:32:13.4664171Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:13.4664617Z     def test_silu_mul_quant(
2025-05-07T20:32:13.4664861Z         self,
2025-05-07T20:32:13.4665060Z         T: int,
2025-05-07T20:32:13.4665263Z         D: int,
2025-05-07T20:32:13.4665478Z         scale_ub: Optional[float],
2025-05-07T20:32:13.4665754Z         contiguous: bool,
2025-05-07T20:32:13.4665995Z         compiled: bool,
2025-05-07T20:32:13.4666216Z     ) -> None:
2025-05-07T20:32:13.4666439Z         torch.manual_seed(2025)
2025-05-07T20:32:13.4666714Z     
2025-05-07T20:32:13.4666997Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:13.4667336Z     
2025-05-07T20:32:13.4667534Z         x_sign = torch.sign(x)
2025-05-07T20:32:13.4667826Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:13.4668136Z         x = x_sign * x_clamp
2025-05-07T20:32:13.4668373Z         x0 = x[:, :D]
2025-05-07T20:32:13.4668593Z         x1 = x[:, D:]
2025-05-07T20:32:13.4668806Z     
2025-05-07T20:32:13.4668994Z         if contiguous:
2025-05-07T20:32:13.4669225Z             x0 = x0.contiguous()
2025-05-07T20:32:13.4669480Z             x1 = x1.contiguous()
2025-05-07T20:32:13.4669722Z     
2025-05-07T20:32:13.4669919Z         if scale_ub is not None:
2025-05-07T20:32:13.4670186Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:13.4670519Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:13.4670835Z             )
2025-05-07T20:32:13.4671031Z         else:
2025-05-07T20:32:13.4671243Z             scale_ub_tensor = None
2025-05-07T20:32:13.4671499Z     
2025-05-07T20:32:13.4671732Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:13.4672041Z             op = silu_mul_quant
2025-05-07T20:32:13.4672300Z             if compiled:
2025-05-07T20:32:13.4672547Z                 op = torch.compile(op)
2025-05-07T20:32:13.4672839Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:13.4673117Z     
2025-05-07T20:32:13.4673311Z         y_fp8, y_scale = fn()
2025-05-07T20:32:13.4673589Z         y = y_fp8.to(torch.float32) * y_scale[:, None]
2025-05-07T20:32:13.4673883Z     
2025-05-07T20:32:13.4674125Z         def ref_fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:13.4674454Z             x0_fp32 = x0.to(torch.float32)
2025-05-07T20:32:13.4674743Z             x1_fp32 = x1.to(torch.float32)
2025-05-07T20:32:13.4675058Z             y = x0_fp32 * torch.sigmoid(x0_fp32) * x1_fp32
2025-05-07T20:32:13.4675424Z             return triton_quantize_fp8_row(y, scale_ub_tensor)
2025-05-07T20:32:13.4675726Z     
2025-05-07T20:32:13.4675932Z >       y_fp8_ref, y_scale_ref = ref_fn()
2025-05-07T20:32:13.4676122Z 
2025-05-07T20:32:13.4676230Z moe/activation_test.py:126: 
2025-05-07T20:32:13.4676521Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:13.4677010Z moe/activation_test.py:124: in ref_fn
2025-05-07T20:32:13.4677331Z     return triton_quantize_fp8_row(y, scale_ub_tensor)
2025-05-07T20:32:13.4678109Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gemm/triton_gemm/fp8_gemm.py:2370: in triton_quantize_fp8_row
2025-05-07T20:32:13.4678853Z     _kernel_quantize_fp8_row[grid](
2025-05-07T20:32:13.4679390Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:13.4680067Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:13.4680831Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/autotuner.py:186: in run
2025-05-07T20:32:13.4681552Z     timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs}
2025-05-07T20:32:13.4682267Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/autotuner.py:166: in _bench
2025-05-07T20:32:13.4682901Z     return self.do_bench(kernel_call, quantiles=(0.5, 0.2, 0.8))
2025-05-07T20:32:13.4683503Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/testing.py:117: in do_bench
2025-05-07T20:32:13.4684022Z     fn()
2025-05-07T20:32:13.4684520Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/autotuner.py:152: in kernel_call
2025-05-07T20:32:13.4685096Z     self.fn.run(
2025-05-07T20:32:13.4685561Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:13.4686079Z     kernel = self.compile(
2025-05-07T20:32:13.4686625Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:13.4687274Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:13.4687675Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:13.4687899Z 
2025-05-07T20:32:13.4688104Z self = <triton.compiler.compiler.ASTSource object at 0x7fce5b0d1f50>
2025-05-07T20:32:13.4689219Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=2, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:13.4690570Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7fcea1473ce0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7fce5ae48f40>}
2025-05-07T20:32:13.4691900Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:13.4692915Z context = <triton._C.libtriton.ir.context object at 0x7fce5a518bf0>
2025-05-07T20:32:13.4693203Z 
2025-05-07T20:32:13.4693367Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:13.4693970Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:13.4694435Z                            module_map=module_map)
2025-05-07T20:32:13.4694792Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:13.4695149Z E       def _kernel_quantize_fp8_row(
2025-05-07T20:32:13.4695419Z E       ^
2025-05-07T20:32:13.4695879Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:13.4696321Z 
2025-05-07T20:32:13.4696736Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:13.6940964Z W0507 20:32:13.690000 276022 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/6] Encountered an exception in identify_mutated_tensors, assuming every input is mutated
2025-05-07T20:32:13.6942436Z W0507 20:32:13.690000 276022 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/6] Traceback (most recent call last):
2025-05-07T20:32:13.6943750Z W0507 20:32:13.690000 276022 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/6]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/torch/_higher_order_ops/triton_kernel_wrap.py", line 731, in identify_mutated_tensors
2025-05-07T20:32:13.6945155Z W0507 20:32:13.690000 276022 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/6]     ttir_module, ordered_tensor_names = generate_ttir(kernel, kwargs)
2025-05-07T20:32:13.6946260Z W0507 20:32:13.690000 276022 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/6]                                         ~~~~~~~~~~~~~^^^^^^^^^^^^^^^^
2025-05-07T20:32:13.6947542Z W0507 20:32:13.690000 276022 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/6]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/torch/_higher_order_ops/triton_kernel_wrap.py", line 356, in generate_ttir
2025-05-07T20:32:13.6948902Z W0507 20:32:13.690000 276022 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/6]     ttir_module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:13.6950180Z W0507 20:32:13.690000 276022 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/6]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py", line 100, in make_ir
2025-05-07T20:32:13.6951533Z W0507 20:32:13.690000 276022 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/6]     return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:13.6952558Z W0507 20:32:13.690000 276022 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/6]                        module_map=module_map)
2025-05-07T20:32:13.6953808Z W0507 20:32:13.690000 276022 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/6]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/code_generator.py", line 1298, in ast_to_ttir
2025-05-07T20:32:13.6955034Z W0507 20:32:13.690000 276022 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/6]     generator.visit(fn.parse())
2025-05-07T20:32:13.6955868Z W0507 20:32:13.690000 276022 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/6]     ~~~~~~~~~~~~~~~^^^^^^^^^^^^
2025-05-07T20:32:13.6957054Z W0507 20:32:13.690000 276022 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/6]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/code_generator.py", line 1201, in visit
2025-05-07T20:32:13.6958238Z W0507 20:32:13.690000 276022 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/6]     ret = super().visit(node)
2025-05-07T20:32:13.6959260Z W0507 20:32:13.690000 276022 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/6]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/ast.py", line 428, in visit
2025-05-07T20:32:13.6960266Z W0507 20:32:13.690000 276022 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/6]     return visitor(node)
2025-05-07T20:32:13.6961464Z W0507 20:32:13.690000 276022 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/6]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/code_generator.py", line 352, in visit_Module
2025-05-07T20:32:13.6962732Z W0507 20:32:13.690000 276022 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/6]     ast.NodeVisitor.generic_visit(self, node)
2025-05-07T20:32:13.6963618Z W0507 20:32:13.690000 276022 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/6]     ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~^^^^^^^^^^^^
2025-05-07T20:32:13.6964782Z W0507 20:32:13.690000 276022 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/6]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/ast.py", line 436, in generic_visit
2025-05-07T20:32:13.6965805Z W0507 20:32:13.690000 276022 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/6]     self.visit(item)
2025-05-07T20:32:13.6966565Z W0507 20:32:13.690000 276022 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/6]     ~~~~~~~~~~^^^^^^
2025-05-07T20:32:13.6967780Z W0507 20:32:13.690000 276022 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/6]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/code_generator.py", line 1207, in visit
2025-05-07T20:32:13.6969162Z W0507 20:32:13.690000 276022 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/6]     raise CompilationError(self.jit_fn.src, self.cur_node, repr(e)) from None
2025-05-07T20:32:13.6970213Z W0507 20:32:13.690000 276022 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/6] triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:13.6971115Z W0507 20:32:13.690000 276022 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/6] def _fbgemm_silu_mul_quant(
2025-05-07T20:32:13.6971852Z W0507 20:32:13.690000 276022 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/6] ^
2025-05-07T20:32:13.6972849Z W0507 20:32:13.690000 276022 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/6] ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:13.7567623Z W0507 20:32:13.753000 276022 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/6] Encountered an exception in identify_mutated_tensors, assuming every input is mutated
2025-05-07T20:32:13.7569479Z W0507 20:32:13.753000 276022 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/6] Traceback (most recent call last):
2025-05-07T20:32:13.7570826Z W0507 20:32:13.753000 276022 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/6]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/torch/_higher_order_ops/triton_kernel_wrap.py", line 731, in identify_mutated_tensors
2025-05-07T20:32:13.7572261Z W0507 20:32:13.753000 276022 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/6]     ttir_module, ordered_tensor_names = generate_ttir(kernel, kwargs)
2025-05-07T20:32:13.7573247Z W0507 20:32:13.753000 276022 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/6]                                         ~~~~~~~~~~~~~^^^^^^^^^^^^^^^^
2025-05-07T20:32:13.7574691Z W0507 20:32:13.753000 276022 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/6]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/torch/_higher_order_ops/triton_kernel_wrap.py", line 356, in generate_ttir
2025-05-07T20:32:13.7576066Z W0507 20:32:13.753000 276022 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/6]     ttir_module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:13.7577356Z W0507 20:32:13.753000 276022 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/6]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py", line 100, in make_ir
2025-05-07T20:32:13.7578714Z W0507 20:32:13.753000 276022 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/6]     return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:13.7579753Z W0507 20:32:13.753000 276022 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/6]                        module_map=module_map)
2025-05-07T20:32:13.7580992Z W0507 20:32:13.753000 276022 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/6]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/code_generator.py", line 1298, in ast_to_ttir
2025-05-07T20:32:13.7582554Z W0507 20:32:13.753000 276022 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/6]     generator.visit(fn.parse())
2025-05-07T20:32:13.7583398Z W0507 20:32:13.753000 276022 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/6]     ~~~~~~~~~~~~~~~^^^^^^^^^^^^
2025-05-07T20:32:13.7584587Z W0507 20:32:13.753000 276022 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/6]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/code_generator.py", line 1201, in visit
2025-05-07T20:32:13.7585950Z W0507 20:32:13.753000 276022 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/6]     ret = super().visit(node)
2025-05-07T20:32:13.7586971Z W0507 20:32:13.753000 276022 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/6]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/ast.py", line 428, in visit
2025-05-07T20:32:13.7587989Z W0507 20:32:13.753000 276022 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/6]     return visitor(node)
2025-05-07T20:32:13.7589198Z W0507 20:32:13.753000 276022 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/6]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/code_generator.py", line 352, in visit_Module
2025-05-07T20:32:13.7590464Z W0507 20:32:13.753000 276022 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/6]     ast.NodeVisitor.generic_visit(self, node)
2025-05-07T20:32:13.7591365Z W0507 20:32:13.753000 276022 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/6]     ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~^^^^^^^^^^^^
2025-05-07T20:32:13.7592441Z W0507 20:32:13.753000 276022 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/6]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/ast.py", line 436, in generic_visit
2025-05-07T20:32:13.7593478Z W0507 20:32:13.753000 276022 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/6]     self.visit(item)
2025-05-07T20:32:13.7594245Z W0507 20:32:13.753000 276022 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/6]     ~~~~~~~~~~^^^^^^
2025-05-07T20:32:13.7595400Z W0507 20:32:13.753000 276022 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/6]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/code_generator.py", line 1207, in visit
2025-05-07T20:32:13.7596738Z W0507 20:32:13.753000 276022 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/6]     raise CompilationError(self.jit_fn.src, self.cur_node, repr(e)) from None
2025-05-07T20:32:13.7597791Z W0507 20:32:13.753000 276022 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/6] triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:13.7599113Z W0507 20:32:13.753000 276022 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/6] def _fbgemm_silu_mul_quant(
2025-05-07T20:32:13.7599856Z W0507 20:32:13.753000 276022 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/6] ^
2025-05-07T20:32:13.7600853Z W0507 20:32:13.753000 276022 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/6] ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:14.2972082Z W0507 20:32:14.293000 276022 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/7] Encountered an exception in identify_mutated_tensors, assuming every input is mutated
2025-05-07T20:32:14.2973263Z W0507 20:32:14.293000 276022 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/7] Traceback (most recent call last):
2025-05-07T20:32:14.2974703Z W0507 20:32:14.293000 276022 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/7]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/torch/_higher_order_ops/triton_kernel_wrap.py", line 731, in identify_mutated_tensors
2025-05-07T20:32:14.2976591Z W0507 20:32:14.293000 276022 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/7]     ttir_module, ordered_tensor_names = generate_ttir(kernel, kwargs)
2025-05-07T20:32:14.2977556Z W0507 20:32:14.293000 276022 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/7]                                         ~~~~~~~~~~~~~^^^^^^^^^^^^^^^^
2025-05-07T20:32:14.2979007Z W0507 20:32:14.293000 276022 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/7]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/torch/_higher_order_ops/triton_kernel_wrap.py", line 356, in generate_ttir
2025-05-07T20:32:14.2980378Z W0507 20:32:14.293000 276022 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/7]     ttir_module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:14.2981674Z W0507 20:32:14.293000 276022 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/7]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py", line 100, in make_ir
2025-05-07T20:32:14.2983031Z W0507 20:32:14.293000 276022 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/7]     return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:14.2984058Z W0507 20:32:14.293000 276022 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/7]                        module_map=module_map)
2025-05-07T20:32:14.2985308Z W0507 20:32:14.293000 276022 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/7]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/code_generator.py", line 1298, in ast_to_ttir
2025-05-07T20:32:14.2986540Z W0507 20:32:14.293000 276022 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/7]     generator.visit(fn.parse())
2025-05-07T20:32:14.2987382Z W0507 20:32:14.293000 276022 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/7]     ~~~~~~~~~~~~~~~^^^^^^^^^^^^
2025-05-07T20:32:14.2988569Z W0507 20:32:14.293000 276022 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/7]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/code_generator.py", line 1201, in visit
2025-05-07T20:32:14.2989754Z W0507 20:32:14.293000 276022 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/7]     ret = super().visit(node)
2025-05-07T20:32:14.2990785Z W0507 20:32:14.293000 276022 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/7]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/ast.py", line 428, in visit
2025-05-07T20:32:14.2991796Z W0507 20:32:14.293000 276022 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/7]     return visitor(node)
2025-05-07T20:32:14.2993011Z W0507 20:32:14.293000 276022 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/7]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/code_generator.py", line 352, in visit_Module
2025-05-07T20:32:14.2994275Z W0507 20:32:14.293000 276022 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/7]     ast.NodeVisitor.generic_visit(self, node)
2025-05-07T20:32:14.2995164Z W0507 20:32:14.293000 276022 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/7]     ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~^^^^^^^^^^^^
2025-05-07T20:32:14.2996246Z W0507 20:32:14.293000 276022 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/7]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/ast.py", line 436, in generic_visit
2025-05-07T20:32:14.2997279Z W0507 20:32:14.293000 276022 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/7]     self.visit(item)
2025-05-07T20:32:14.2998123Z W0507 20:32:14.293000 276022 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/7]     ~~~~~~~~~~^^^^^^
2025-05-07T20:32:14.2999667Z W0507 20:32:14.293000 276022 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/7]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/code_generator.py", line 1207, in visit
2025-05-07T20:32:14.3001008Z W0507 20:32:14.293000 276022 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/7]     raise CompilationError(self.jit_fn.src, self.cur_node, repr(e)) from None
2025-05-07T20:32:14.3002058Z W0507 20:32:14.293000 276022 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/7] triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:14.3003083Z W0507 20:32:14.293000 276022 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/7] def _fbgemm_silu_mul_quant(
2025-05-07T20:32:14.3003826Z W0507 20:32:14.293000 276022 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/7] ^
2025-05-07T20:32:14.3004838Z W0507 20:32:14.293000 276022 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/7] ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:14.3596960Z W0507 20:32:14.356000 276022 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/7] Encountered an exception in identify_mutated_tensors, assuming every input is mutated
2025-05-07T20:32:14.3598464Z W0507 20:32:14.356000 276022 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/7] Traceback (most recent call last):
2025-05-07T20:32:14.3599867Z W0507 20:32:14.356000 276022 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/7]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/torch/_higher_order_ops/triton_kernel_wrap.py", line 731, in identify_mutated_tensors
2025-05-07T20:32:14.3601288Z W0507 20:32:14.356000 276022 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/7]     ttir_module, ordered_tensor_names = generate_ttir(kernel, kwargs)
2025-05-07T20:32:14.3602271Z W0507 20:32:14.356000 276022 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/7]                                         ~~~~~~~~~~~~~^^^^^^^^^^^^^^^^
2025-05-07T20:32:14.3603552Z W0507 20:32:14.356000 276022 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/7]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/torch/_higher_order_ops/triton_kernel_wrap.py", line 356, in generate_ttir
2025-05-07T20:32:14.3604920Z W0507 20:32:14.356000 276022 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/7]     ttir_module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:14.3606210Z W0507 20:32:14.356000 276022 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/7]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py", line 100, in make_ir
2025-05-07T20:32:14.3607578Z W0507 20:32:14.356000 276022 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/7]     return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:14.3608613Z W0507 20:32:14.356000 276022 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/7]                        module_map=module_map)
2025-05-07T20:32:14.3609850Z W0507 20:32:14.356000 276022 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/7]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/code_generator.py", line 1298, in ast_to_ttir
2025-05-07T20:32:14.3611094Z W0507 20:32:14.356000 276022 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/7]     generator.visit(fn.parse())
2025-05-07T20:32:14.3611932Z W0507 20:32:14.356000 276022 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/7]     ~~~~~~~~~~~~~~~^^^^^^^^^^^^
2025-05-07T20:32:14.3613442Z W0507 20:32:14.356000 276022 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/7]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/code_generator.py", line 1201, in visit
2025-05-07T20:32:14.3614755Z W0507 20:32:14.356000 276022 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/7]     ret = super().visit(node)
2025-05-07T20:32:14.3615772Z W0507 20:32:14.356000 276022 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/7]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/ast.py", line 428, in visit
2025-05-07T20:32:14.3616935Z W0507 20:32:14.356000 276022 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/7]     return visitor(node)
2025-05-07T20:32:14.3618137Z W0507 20:32:14.356000 276022 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/7]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/code_generator.py", line 352, in visit_Module
2025-05-07T20:32:14.3619404Z W0507 20:32:14.356000 276022 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/7]     ast.NodeVisitor.generic_visit(self, node)
2025-05-07T20:32:14.3620293Z W0507 20:32:14.356000 276022 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/7]     ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~^^^^^^^^^^^^
2025-05-07T20:32:14.3621363Z W0507 20:32:14.356000 276022 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/7]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/ast.py", line 436, in generic_visit
2025-05-07T20:32:14.3622398Z W0507 20:32:14.356000 276022 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/7]     self.visit(item)
2025-05-07T20:32:14.3623160Z W0507 20:32:14.356000 276022 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/7]     ~~~~~~~~~~^^^^^^
2025-05-07T20:32:14.3624316Z W0507 20:32:14.356000 276022 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/7]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/code_generator.py", line 1207, in visit
2025-05-07T20:32:14.3625658Z W0507 20:32:14.356000 276022 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/7]     raise CompilationError(self.jit_fn.src, self.cur_node, repr(e)) from None
2025-05-07T20:32:14.3626710Z W0507 20:32:14.356000 276022 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/7] triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:14.3627613Z W0507 20:32:14.356000 276022 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/7] def _fbgemm_silu_mul_quant(
2025-05-07T20:32:14.3628350Z W0507 20:32:14.356000 276022 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/7] ^
2025-05-07T20:32:14.3629363Z W0507 20:32:14.356000 276022 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/7] ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:14.6673670Z 
2025-05-07T20:32:14.6674344Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:14.6675025Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:14.6675566Z     T=4096,
2025-05-07T20:32:14.6675825Z     D=5120,
2025-05-07T20:32:14.6676087Z     scale_ub=None,
2025-05-07T20:32:14.6676378Z     contiguous=True,
2025-05-07T20:32:14.6676605Z     compiled=True,
2025-05-07T20:32:14.6676814Z )
2025-05-07T20:32:14.6677132Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:14.6677622Z T = 4096, D = 5120, scale_ub = None, contiguous = True, compiled = True
2025-05-07T20:32:14.6677912Z 
2025-05-07T20:32:14.6678003Z     @given(
2025-05-07T20:32:14.6678229Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:14.6678545Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:14.6678853Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:14.6679559Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:14.6679878Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:14.6680165Z     )
2025-05-07T20:32:14.6680510Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:14.6680946Z     def test_silu_mul_quant(
2025-05-07T20:32:14.6681190Z         self,
2025-05-07T20:32:14.6687692Z         T: int,
2025-05-07T20:32:14.6687959Z         D: int,
2025-05-07T20:32:14.6688180Z         scale_ub: Optional[float],
2025-05-07T20:32:14.6688456Z         contiguous: bool,
2025-05-07T20:32:14.6688687Z         compiled: bool,
2025-05-07T20:32:14.6688920Z     ) -> None:
2025-05-07T20:32:14.6689345Z         torch.manual_seed(2025)
2025-05-07T20:32:14.6689595Z     
2025-05-07T20:32:14.6689868Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:14.6690217Z     
2025-05-07T20:32:14.6690402Z         x_sign = torch.sign(x)
2025-05-07T20:32:14.6690700Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:14.6691022Z         x = x_sign * x_clamp
2025-05-07T20:32:14.6691254Z         x0 = x[:, :D]
2025-05-07T20:32:14.6691471Z         x1 = x[:, D:]
2025-05-07T20:32:14.6691679Z     
2025-05-07T20:32:14.6691862Z         if contiguous:
2025-05-07T20:32:14.6692084Z             x0 = x0.contiguous()
2025-05-07T20:32:14.6692341Z             x1 = x1.contiguous()
2025-05-07T20:32:14.6692583Z     
2025-05-07T20:32:14.6692770Z         if scale_ub is not None:
2025-05-07T20:32:14.6693042Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:14.6693381Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:14.6693835Z             )
2025-05-07T20:32:14.6694042Z         else:
2025-05-07T20:32:14.6694269Z             scale_ub_tensor = None
2025-05-07T20:32:14.6694526Z     
2025-05-07T20:32:14.6694770Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:14.6695095Z             op = silu_mul_quant
2025-05-07T20:32:14.6695336Z             if compiled:
2025-05-07T20:32:14.6695591Z                 op = torch.compile(op)
2025-05-07T20:32:14.6695886Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:14.6696155Z     
2025-05-07T20:32:14.6696352Z         y_fp8, y_scale = fn()
2025-05-07T20:32:14.6696635Z         y = y_fp8.to(torch.float32) * y_scale[:, None]
2025-05-07T20:32:14.6696929Z     
2025-05-07T20:32:14.6697158Z         def ref_fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:14.6697491Z             x0_fp32 = x0.to(torch.float32)
2025-05-07T20:32:14.6697784Z             x1_fp32 = x1.to(torch.float32)
2025-05-07T20:32:14.6698088Z             y = x0_fp32 * torch.sigmoid(x0_fp32) * x1_fp32
2025-05-07T20:32:14.6698807Z             return triton_quantize_fp8_row(y, scale_ub_tensor)
2025-05-07T20:32:14.6699118Z     
2025-05-07T20:32:14.6699311Z >       y_fp8_ref, y_scale_ref = ref_fn()
2025-05-07T20:32:14.6699511Z 
2025-05-07T20:32:14.6699612Z moe/activation_test.py:126: 
2025-05-07T20:32:14.6699908Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:14.6700236Z moe/activation_test.py:124: in ref_fn
2025-05-07T20:32:14.6700558Z     return triton_quantize_fp8_row(y, scale_ub_tensor)
2025-05-07T20:32:14.6702834Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gemm/triton_gemm/fp8_gemm.py:2370: in triton_quantize_fp8_row
2025-05-07T20:32:14.6703573Z     _kernel_quantize_fp8_row[grid](
2025-05-07T20:32:14.6704110Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:14.6704789Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:14.6705462Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/autotuner.py:186: in run
2025-05-07T20:32:14.6706174Z     timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs}
2025-05-07T20:32:14.6707075Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/autotuner.py:166: in _bench
2025-05-07T20:32:14.6707701Z     return self.do_bench(kernel_call, quantiles=(0.5, 0.2, 0.8))
2025-05-07T20:32:14.6708287Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/testing.py:117: in do_bench
2025-05-07T20:32:14.6708801Z     fn()
2025-05-07T20:32:14.6709299Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/autotuner.py:152: in kernel_call
2025-05-07T20:32:14.6709866Z     self.fn.run(
2025-05-07T20:32:14.6710447Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:14.6710969Z     kernel = self.compile(
2025-05-07T20:32:14.6711500Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:14.6712133Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:14.6712532Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:14.6712754Z 
2025-05-07T20:32:14.6712964Z self = <triton.compiler.compiler.ASTSource object at 0x7fce5b869ad0>
2025-05-07T20:32:14.6714027Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=2, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:14.6715383Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7fcea1473ce0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7fce5ae4b1a0>}
2025-05-07T20:32:14.6716708Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:14.6717716Z context = <triton._C.libtriton.ir.context object at 0x7fce5aabebf0>
2025-05-07T20:32:14.6717996Z 
2025-05-07T20:32:14.6718163Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:14.6718668Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:14.6719132Z                            module_map=module_map)
2025-05-07T20:32:14.6719491Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:14.6719840Z E       def _kernel_quantize_fp8_row(
2025-05-07T20:32:14.6720100Z E       ^
2025-05-07T20:32:14.6720556Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:14.6721000Z 
2025-05-07T20:32:14.6721413Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:14.6721912Z 
2025-05-07T20:32:14.6722014Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:14.6722421Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:14.6722816Z     T=16384,
2025-05-07T20:32:14.6723009Z     D=5120,
2025-05-07T20:32:14.6723196Z     scale_ub=None,
2025-05-07T20:32:14.6723409Z     contiguous=True,
2025-05-07T20:32:14.6723630Z     compiled=True,
2025-05-07T20:32:14.6723828Z )
2025-05-07T20:32:14.6724142Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:14.6724627Z T = 16384, D = 5120, scale_ub = None, contiguous = True, compiled = True
2025-05-07T20:32:14.6724895Z 
2025-05-07T20:32:14.6724972Z     @given(
2025-05-07T20:32:14.6725202Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:14.6725514Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:14.6725813Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:14.6726139Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:14.6726555Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:14.6726840Z     )
2025-05-07T20:32:14.6727179Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:14.6727618Z     def test_silu_mul_quant(
2025-05-07T20:32:14.6727858Z         self,
2025-05-07T20:32:14.6728046Z         T: int,
2025-05-07T20:32:14.6728244Z         D: int,
2025-05-07T20:32:14.6728460Z         scale_ub: Optional[float],
2025-05-07T20:32:14.6728724Z         contiguous: bool,
2025-05-07T20:32:14.6728962Z         compiled: bool,
2025-05-07T20:32:14.6729209Z     ) -> None:
2025-05-07T20:32:14.6729442Z         torch.manual_seed(2025)
2025-05-07T20:32:14.6729688Z     
2025-05-07T20:32:14.6730044Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:14.6730382Z     
2025-05-07T20:32:14.6730578Z         x_sign = torch.sign(x)
2025-05-07T20:32:14.6730861Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:14.6731161Z         x = x_sign * x_clamp
2025-05-07T20:32:14.6731395Z         x0 = x[:, :D]
2025-05-07T20:32:14.6731606Z         x1 = x[:, D:]
2025-05-07T20:32:14.6731807Z     
2025-05-07T20:32:14.6731979Z         if contiguous:
2025-05-07T20:32:14.6732206Z             x0 = x0.contiguous()
2025-05-07T20:32:14.6732455Z             x1 = x1.contiguous()
2025-05-07T20:32:14.6732682Z     
2025-05-07T20:32:14.6732869Z         if scale_ub is not None:
2025-05-07T20:32:14.6733139Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:14.6733459Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:14.6733882Z             )
2025-05-07T20:32:14.6734070Z         else:
2025-05-07T20:32:14.6734268Z             scale_ub_tensor = None
2025-05-07T20:32:14.6734518Z     
2025-05-07T20:32:14.6734749Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:14.6735051Z             op = silu_mul_quant
2025-05-07T20:32:14.6735297Z             if compiled:
2025-05-07T20:32:14.6735537Z                 op = torch.compile(op)
2025-05-07T20:32:14.6735827Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:14.6736097Z     
2025-05-07T20:32:14.6736284Z         y_fp8, y_scale = fn()
2025-05-07T20:32:14.6736560Z         y = y_fp8.to(torch.float32) * y_scale[:, None]
2025-05-07T20:32:14.6736837Z     
2025-05-07T20:32:14.6737093Z         def ref_fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:14.6737418Z             x0_fp32 = x0.to(torch.float32)
2025-05-07T20:32:14.6737698Z             x1_fp32 = x1.to(torch.float32)
2025-05-07T20:32:14.6738003Z             y = x0_fp32 * torch.sigmoid(x0_fp32) * x1_fp32
2025-05-07T20:32:14.6738353Z             return triton_quantize_fp8_row(y, scale_ub_tensor)
2025-05-07T20:32:14.6738660Z     
2025-05-07T20:32:14.6738856Z >       y_fp8_ref, y_scale_ref = ref_fn()
2025-05-07T20:32:14.6739052Z 
2025-05-07T20:32:14.6739148Z moe/activation_test.py:126: 
2025-05-07T20:32:14.6739440Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:14.6739768Z moe/activation_test.py:124: in ref_fn
2025-05-07T20:32:14.6740090Z     return triton_quantize_fp8_row(y, scale_ub_tensor)
2025-05-07T20:32:14.6740862Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gemm/triton_gemm/fp8_gemm.py:2370: in triton_quantize_fp8_row
2025-05-07T20:32:14.6741594Z     _kernel_quantize_fp8_row[grid](
2025-05-07T20:32:14.6742127Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:14.6742794Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:14.6743478Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/autotuner.py:186: in run
2025-05-07T20:32:14.6744181Z     timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs}
2025-05-07T20:32:14.6744902Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/autotuner.py:166: in _bench
2025-05-07T20:32:14.6745623Z     return self.do_bench(kernel_call, quantiles=(0.5, 0.2, 0.8))
2025-05-07T20:32:14.6746214Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/testing.py:117: in do_bench
2025-05-07T20:32:14.6746720Z     fn()
2025-05-07T20:32:14.6747222Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/autotuner.py:152: in kernel_call
2025-05-07T20:32:14.6747796Z     self.fn.run(
2025-05-07T20:32:14.6748254Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:14.6748766Z     kernel = self.compile(
2025-05-07T20:32:14.6749405Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:14.6750048Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:14.6750433Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:14.6750669Z 
2025-05-07T20:32:14.6750872Z self = <triton.compiler.compiler.ASTSource object at 0x7fce5b0d19d0>
2025-05-07T20:32:14.6751940Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=2, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:14.6753287Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7fcea1473ce0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7fcda5bce8e0>}
2025-05-07T20:32:14.6754611Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:14.6755756Z context = <triton._C.libtriton.ir.context object at 0x7fcda52b05f0>
2025-05-07T20:32:14.6756045Z 
2025-05-07T20:32:14.6756214Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:14.6756726Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:14.6757187Z                            module_map=module_map)
2025-05-07T20:32:14.6757542Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:14.6757895Z E       def _kernel_quantize_fp8_row(
2025-05-07T20:32:14.6758157Z E       ^
2025-05-07T20:32:14.6758604Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:14.6759048Z 
2025-05-07T20:32:14.6759458Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:14.6959153Z W0507 20:32:14.694000 276022 site-packages/torch/_dynamo/convert_frame.py:987] [0/8] torch._dynamo hit config.recompile_limit (8)
2025-05-07T20:32:14.6960440Z W0507 20:32:14.694000 276022 site-packages/torch/_dynamo/convert_frame.py:987] [0/8]    function: 'silu_mul_quant' (/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:55)
2025-05-07T20:32:14.6961766Z W0507 20:32:14.694000 276022 site-packages/torch/_dynamo/convert_frame.py:987] [0/8]    last reason: 0/7: tensor 'x0' stride mismatch at index 0. expected 5120, actual 10240
2025-05-07T20:32:14.6962732Z W0507 20:32:14.694000 276022 site-packages/torch/_dynamo/convert_frame.py:987] [0/8] To log all recompilation reasons, use TORCH_LOGS="recompiles".
2025-05-07T20:32:14.6963826Z W0507 20:32:14.694000 276022 site-packages/torch/_dynamo/convert_frame.py:987] [0/8] To diagnose recompilation issues, see https://pytorch.org/docs/main/torch.compiler_troubleshooting.html.
2025-05-07T20:32:15.1199085Z 
2025-05-07T20:32:15.1199703Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:15.1200319Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:15.1201294Z     T=1,
2025-05-07T20:32:15.1201493Z     D=5120,
2025-05-07T20:32:15.1201687Z     scale_ub=1200.0,
2025-05-07T20:32:15.1201914Z     contiguous=True,
2025-05-07T20:32:15.1202136Z     compiled=True,
2025-05-07T20:32:15.1202341Z )
2025-05-07T20:32:15.1202664Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:15.1203155Z T = 1, D = 5120, scale_ub = 1200.0, contiguous = True, compiled = True
2025-05-07T20:32:15.1203410Z 
2025-05-07T20:32:15.1203488Z     @given(
2025-05-07T20:32:15.1203720Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:15.1204036Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:15.1204336Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:15.1204829Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:15.1205162Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:15.1205452Z     )
2025-05-07T20:32:15.1205792Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:15.1206238Z     def test_silu_mul_quant(
2025-05-07T20:32:15.1206485Z         self,
2025-05-07T20:32:15.1206677Z         T: int,
2025-05-07T20:32:15.1206873Z         D: int,
2025-05-07T20:32:15.1207088Z         scale_ub: Optional[float],
2025-05-07T20:32:15.1207352Z         contiguous: bool,
2025-05-07T20:32:15.1207591Z         compiled: bool,
2025-05-07T20:32:15.1207818Z     ) -> None:
2025-05-07T20:32:15.1208026Z         torch.manual_seed(2025)
2025-05-07T20:32:15.1208265Z     
2025-05-07T20:32:15.1208538Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:15.1208870Z     
2025-05-07T20:32:15.1209065Z         x_sign = torch.sign(x)
2025-05-07T20:32:15.1209402Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:15.1209720Z         x = x_sign * x_clamp
2025-05-07T20:32:15.1209956Z         x0 = x[:, :D]
2025-05-07T20:32:15.1210204Z         x1 = x[:, D:]
2025-05-07T20:32:15.1210416Z     
2025-05-07T20:32:15.1210613Z         if contiguous:
2025-05-07T20:32:15.1210843Z             x0 = x0.contiguous()
2025-05-07T20:32:15.1211105Z             x1 = x1.contiguous()
2025-05-07T20:32:15.1211347Z     
2025-05-07T20:32:15.1211540Z         if scale_ub is not None:
2025-05-07T20:32:15.1211815Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:15.1212147Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:15.1212461Z             )
2025-05-07T20:32:15.1212653Z         else:
2025-05-07T20:32:15.1212867Z             scale_ub_tensor = None
2025-05-07T20:32:15.1213122Z     
2025-05-07T20:32:15.1213349Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:15.1213820Z             op = silu_mul_quant
2025-05-07T20:32:15.1214077Z             if compiled:
2025-05-07T20:32:15.1214325Z                 op = torch.compile(op)
2025-05-07T20:32:15.1214623Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:15.1214898Z     
2025-05-07T20:32:15.1215094Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:15.1215265Z 
2025-05-07T20:32:15.1215366Z moe/activation_test.py:117: 
2025-05-07T20:32:15.1215661Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:15.1215994Z moe/activation_test.py:115: in fn
2025-05-07T20:32:15.1216269Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:15.1216827Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/torch/_dynamo/eval_frame.py:678: in _fn
2025-05-07T20:32:15.1217394Z     return fn(*args, **kwargs)
2025-05-07T20:32:15.1218037Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:15.1218722Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:15.1219255Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:15.1219929Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:15.1220668Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:15.1221201Z     kernel = self.compile(
2025-05-07T20:32:15.1221739Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:15.1222381Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:15.1222777Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:15.1223011Z 
2025-05-07T20:32:15.1223216Z self = <triton.compiler.compiler.ASTSource object at 0x7fcda5d834d0>
2025-05-07T20:32:15.1224351Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:15.1225715Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7fcea1473ce0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7fce5a9f4400>}
2025-05-07T20:32:15.1227038Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:15.1228046Z context = <triton._C.libtriton.ir.context object at 0x7fcda4b936b0>
2025-05-07T20:32:15.1228330Z 
2025-05-07T20:32:15.1228499Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:15.1229012Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:15.1229475Z                            module_map=module_map)
2025-05-07T20:32:15.1229839Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:15.1230192Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:15.1230445Z E       ^
2025-05-07T20:32:15.1230907Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:15.1231347Z 
2025-05-07T20:32:15.1231759Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:15.1232261Z 
2025-05-07T20:32:15.1232370Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:15.1232771Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:15.1233169Z     T=1,
2025-05-07T20:32:15.1233351Z     D=5120,
2025-05-07T20:32:15.1233540Z     scale_ub=None,
2025-05-07T20:32:15.1233758Z     contiguous=False,
2025-05-07T20:32:15.1233983Z     compiled=True,
2025-05-07T20:32:15.1234182Z )
2025-05-07T20:32:15.1234502Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:15.1234979Z T = 1, D = 5120, scale_ub = None, contiguous = False, compiled = True
2025-05-07T20:32:15.1235232Z 
2025-05-07T20:32:15.1235319Z     @given(
2025-05-07T20:32:15.1235544Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:15.1235864Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:15.1236168Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:15.1236491Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:15.1236819Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:15.1237107Z     )
2025-05-07T20:32:15.1237449Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:15.1237889Z     def test_silu_mul_quant(
2025-05-07T20:32:15.1238132Z         self,
2025-05-07T20:32:15.1238323Z         T: int,
2025-05-07T20:32:15.1238524Z         D: int,
2025-05-07T20:32:15.1238750Z         scale_ub: Optional[float],
2025-05-07T20:32:15.1239023Z         contiguous: bool,
2025-05-07T20:32:15.1239259Z         compiled: bool,
2025-05-07T20:32:15.1239483Z     ) -> None:
2025-05-07T20:32:15.1239701Z         torch.manual_seed(2025)
2025-05-07T20:32:15.1240034Z     
2025-05-07T20:32:15.1240306Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:15.1240649Z     
2025-05-07T20:32:15.1240838Z         x_sign = torch.sign(x)
2025-05-07T20:32:15.1241127Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:15.1241438Z         x = x_sign * x_clamp
2025-05-07T20:32:15.1241676Z         x0 = x[:, :D]
2025-05-07T20:32:15.1241892Z         x1 = x[:, D:]
2025-05-07T20:32:15.1242101Z     
2025-05-07T20:32:15.1242282Z         if contiguous:
2025-05-07T20:32:15.1242517Z             x0 = x0.contiguous()
2025-05-07T20:32:15.1242773Z             x1 = x1.contiguous()
2025-05-07T20:32:15.1243006Z     
2025-05-07T20:32:15.1243280Z         if scale_ub is not None:
2025-05-07T20:32:15.1243560Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:15.1243887Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:15.1244196Z             )
2025-05-07T20:32:15.1244394Z         else:
2025-05-07T20:32:15.1244609Z             scale_ub_tensor = None
2025-05-07T20:32:15.1244857Z     
2025-05-07T20:32:15.1245088Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:15.1245402Z             op = silu_mul_quant
2025-05-07T20:32:15.1245645Z             if compiled:
2025-05-07T20:32:15.1245890Z                 op = torch.compile(op)
2025-05-07T20:32:15.1246185Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:15.1246456Z     
2025-05-07T20:32:15.1246651Z         y_fp8, y_scale = fn()
2025-05-07T20:32:15.1246933Z         y = y_fp8.to(torch.float32) * y_scale[:, None]
2025-05-07T20:32:15.1247221Z     
2025-05-07T20:32:15.1247459Z         def ref_fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:15.1247803Z             x0_fp32 = x0.to(torch.float32)
2025-05-07T20:32:15.1248091Z             x1_fp32 = x1.to(torch.float32)
2025-05-07T20:32:15.1248402Z             y = x0_fp32 * torch.sigmoid(x0_fp32) * x1_fp32
2025-05-07T20:32:15.1248765Z             return triton_quantize_fp8_row(y, scale_ub_tensor)
2025-05-07T20:32:15.1249078Z     
2025-05-07T20:32:15.1249277Z >       y_fp8_ref, y_scale_ref = ref_fn()
2025-05-07T20:32:15.1249476Z 
2025-05-07T20:32:15.1249575Z moe/activation_test.py:126: 
2025-05-07T20:32:15.1249871Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:15.1250201Z moe/activation_test.py:124: in ref_fn
2025-05-07T20:32:15.1250527Z     return triton_quantize_fp8_row(y, scale_ub_tensor)
2025-05-07T20:32:15.1251301Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gemm/triton_gemm/fp8_gemm.py:2370: in triton_quantize_fp8_row
2025-05-07T20:32:15.1252044Z     _kernel_quantize_fp8_row[grid](
2025-05-07T20:32:15.1252585Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:15.1253257Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:15.1254039Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/autotuner.py:186: in run
2025-05-07T20:32:15.1254747Z     timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs}
2025-05-07T20:32:15.1255465Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/autotuner.py:166: in _bench
2025-05-07T20:32:15.1256097Z     return self.do_bench(kernel_call, quantiles=(0.5, 0.2, 0.8))
2025-05-07T20:32:15.1256697Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/testing.py:117: in do_bench
2025-05-07T20:32:15.1257209Z     fn()
2025-05-07T20:32:15.1257720Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/autotuner.py:152: in kernel_call
2025-05-07T20:32:15.1258294Z     self.fn.run(
2025-05-07T20:32:15.1258755Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:15.1259402Z     kernel = self.compile(
2025-05-07T20:32:15.1259942Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:15.1260587Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:15.1260976Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:15.1261208Z 
2025-05-07T20:32:15.1261414Z self = <triton.compiler.compiler.ASTSource object at 0x7fce5ae32cd0>
2025-05-07T20:32:15.1262550Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=2, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:15.1263899Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7fcea1473ce0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7fce5a9ee020>}
2025-05-07T20:32:15.1265226Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:15.1266227Z context = <triton._C.libtriton.ir.context object at 0x7fcda5680ff0>
2025-05-07T20:32:15.1266521Z 
2025-05-07T20:32:15.1266684Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:15.1267198Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:15.1267654Z                            module_map=module_map)
2025-05-07T20:32:15.1268016Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:15.1268373Z E       def _kernel_quantize_fp8_row(
2025-05-07T20:32:15.1268638Z E       ^
2025-05-07T20:32:15.1269090Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:15.1269591Z 
2025-05-07T20:32:15.1269998Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:15.2704010Z 
2025-05-07T20:32:15.2704425Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:15.2705059Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:15.2705587Z     T=1,
2025-05-07T20:32:15.2705783Z     D=5120,
2025-05-07T20:32:15.2705988Z     scale_ub=None,
2025-05-07T20:32:15.2706208Z     contiguous=True,
2025-05-07T20:32:15.2706435Z     compiled=False,
2025-05-07T20:32:15.2706656Z )
2025-05-07T20:32:15.2706984Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:15.2707699Z T = 1, D = 5120, scale_ub = None, contiguous = True, compiled = False
2025-05-07T20:32:15.2714728Z 
2025-05-07T20:32:15.2714826Z     @given(
2025-05-07T20:32:15.2715080Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:15.2715403Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:15.2715720Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:15.2716052Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:15.2716382Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:15.2716664Z     )
2025-05-07T20:32:15.2717016Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:15.2717463Z     def test_silu_mul_quant(
2025-05-07T20:32:15.2717701Z         self,
2025-05-07T20:32:15.2717901Z         T: int,
2025-05-07T20:32:15.2718098Z         D: int,
2025-05-07T20:32:15.2718309Z         scale_ub: Optional[float],
2025-05-07T20:32:15.2718579Z         contiguous: bool,
2025-05-07T20:32:15.2718818Z         compiled: bool,
2025-05-07T20:32:15.2719084Z     ) -> None:
2025-05-07T20:32:15.2719308Z         torch.manual_seed(2025)
2025-05-07T20:32:15.2719581Z     
2025-05-07T20:32:15.2719858Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:15.2720570Z     
2025-05-07T20:32:15.2720755Z         x_sign = torch.sign(x)
2025-05-07T20:32:15.2721042Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:15.2721353Z         x = x_sign * x_clamp
2025-05-07T20:32:15.2721589Z         x0 = x[:, :D]
2025-05-07T20:32:15.2721796Z         x1 = x[:, D:]
2025-05-07T20:32:15.2722006Z     
2025-05-07T20:32:15.2722186Z         if contiguous:
2025-05-07T20:32:15.2722407Z             x0 = x0.contiguous()
2025-05-07T20:32:15.2722662Z             x1 = x1.contiguous()
2025-05-07T20:32:15.2722902Z     
2025-05-07T20:32:15.2723090Z         if scale_ub is not None:
2025-05-07T20:32:15.2723362Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:15.2723845Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:15.2724154Z             )
2025-05-07T20:32:15.2724350Z         else:
2025-05-07T20:32:15.2724561Z             scale_ub_tensor = None
2025-05-07T20:32:15.2724802Z     
2025-05-07T20:32:15.2725031Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:15.2725353Z             op = silu_mul_quant
2025-05-07T20:32:15.2725594Z             if compiled:
2025-05-07T20:32:15.2725840Z                 op = torch.compile(op)
2025-05-07T20:32:15.2726131Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:15.2726402Z     
2025-05-07T20:32:15.2726582Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:15.2726747Z 
2025-05-07T20:32:15.2726839Z moe/activation_test.py:117: 
2025-05-07T20:32:15.2727134Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:15.2727461Z moe/activation_test.py:115: in fn
2025-05-07T20:32:15.2727740Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:15.2728423Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:15.2729103Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:15.2729634Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:15.2730309Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:15.2730962Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:15.2731489Z     kernel = self.compile(
2025-05-07T20:32:15.2732023Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:15.2732665Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:15.2733061Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:15.2733287Z 
2025-05-07T20:32:15.2733503Z self = <triton.compiler.compiler.ASTSource object at 0x7fcda58cb150>
2025-05-07T20:32:15.2734723Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:15.2736076Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7fcea1473ce0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7fce60114400>}
2025-05-07T20:32:15.2737393Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:15.2738398Z context = <triton._C.libtriton.ir.context object at 0x7fcda4a11ff0>
2025-05-07T20:32:15.2738676Z 
2025-05-07T20:32:15.2738851Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:15.2739352Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:15.2739806Z                            module_map=module_map)
2025-05-07T20:32:15.2740290Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:15.2740638Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:15.2740889Z E       ^
2025-05-07T20:32:15.2741344Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:15.2741777Z 
2025-05-07T20:32:15.2742188Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:15.2742689Z 
2025-05-07T20:32:15.2742795Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:15.2743194Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:15.2743589Z     T=128,
2025-05-07T20:32:15.2743776Z     D=5120,
2025-05-07T20:32:15.2744036Z     scale_ub=None,
2025-05-07T20:32:15.2744254Z     contiguous=False,
2025-05-07T20:32:15.2744476Z     compiled=True,
2025-05-07T20:32:15.2744671Z )
2025-05-07T20:32:15.2744982Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:15.2745472Z T = 128, D = 5120, scale_ub = None, contiguous = False, compiled = True
2025-05-07T20:32:15.2745733Z 
2025-05-07T20:32:15.2745805Z     @given(
2025-05-07T20:32:15.2746029Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:15.2746338Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:15.2746639Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:15.2746958Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:15.2747281Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:15.2747566Z     )
2025-05-07T20:32:15.2747903Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:15.2748345Z     def test_silu_mul_quant(
2025-05-07T20:32:15.2748583Z         self,
2025-05-07T20:32:15.2748767Z         T: int,
2025-05-07T20:32:15.2748962Z         D: int,
2025-05-07T20:32:15.2749175Z         scale_ub: Optional[float],
2025-05-07T20:32:15.2749443Z         contiguous: bool,
2025-05-07T20:32:15.2749678Z         compiled: bool,
2025-05-07T20:32:15.2749903Z     ) -> None:
2025-05-07T20:32:15.2750107Z         torch.manual_seed(2025)
2025-05-07T20:32:15.2750343Z     
2025-05-07T20:32:15.2750612Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:15.2750941Z     
2025-05-07T20:32:15.2751130Z         x_sign = torch.sign(x)
2025-05-07T20:32:15.2751417Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:15.2751726Z         x = x_sign * x_clamp
2025-05-07T20:32:15.2751956Z         x0 = x[:, :D]
2025-05-07T20:32:15.2752172Z         x1 = x[:, D:]
2025-05-07T20:32:15.2752376Z     
2025-05-07T20:32:15.2752550Z         if contiguous:
2025-05-07T20:32:15.2752783Z             x0 = x0.contiguous()
2025-05-07T20:32:15.2753037Z             x1 = x1.contiguous()
2025-05-07T20:32:15.2753263Z     
2025-05-07T20:32:15.2753450Z         if scale_ub is not None:
2025-05-07T20:32:15.2753717Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:15.2754040Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:15.2754342Z             )
2025-05-07T20:32:15.2754534Z         else:
2025-05-07T20:32:15.2754730Z             scale_ub_tensor = None
2025-05-07T20:32:15.2754977Z     
2025-05-07T20:32:15.2755205Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:15.2755506Z             op = silu_mul_quant
2025-05-07T20:32:15.2755750Z             if compiled:
2025-05-07T20:32:15.2755989Z                 op = torch.compile(op)
2025-05-07T20:32:15.2756275Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:15.2756540Z     
2025-05-07T20:32:15.2756730Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:15.2756888Z 
2025-05-07T20:32:15.2756998Z moe/activation_test.py:117: 
2025-05-07T20:32:15.2757282Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:15.2757606Z moe/activation_test.py:115: in fn
2025-05-07T20:32:15.2757975Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:15.2758516Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/torch/_dynamo/eval_frame.py:678: in _fn
2025-05-07T20:32:15.2759063Z     return fn(*args, **kwargs)
2025-05-07T20:32:15.2759710Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:15.2760381Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:15.2760901Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:15.2761571Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:15.2762868Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:15.2763391Z     kernel = self.compile(
2025-05-07T20:32:15.2763922Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:15.2764576Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:15.2764967Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:15.2765189Z 
2025-05-07T20:32:15.2765392Z self = <triton.compiler.compiler.ASTSource object at 0x7fce5a599250>
2025-05-07T20:32:15.2766451Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:15.2767803Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7fcea1473ce0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7fcda5d91ee0>}
2025-05-07T20:32:15.2769126Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:15.2770131Z context = <triton._C.libtriton.ir.context object at 0x7fcda5d72db0>
2025-05-07T20:32:15.2770412Z 
2025-05-07T20:32:15.2770575Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:15.2771086Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:15.2771546Z                            module_map=module_map)
2025-05-07T20:32:15.2771896Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:15.2772245Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:15.2772504Z E       ^
2025-05-07T20:32:15.2772971Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:15.2773409Z 
2025-05-07T20:32:15.2773912Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:15.2774426Z 
2025-05-07T20:32:15.2774528Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:15.2774933Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:15.2775326Z     T=128,
2025-05-07T20:32:15.2775504Z     D=7168,
2025-05-07T20:32:15.2775690Z     scale_ub=1200.0,
2025-05-07T20:32:15.2775904Z     contiguous=False,
2025-05-07T20:32:15.2776119Z     compiled=False,
2025-05-07T20:32:15.4345536Z )
2025-05-07T20:32:15.4346038Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:15.4346809Z T = 128, D = 7168, scale_ub = 1200.0, contiguous = False, compiled = False
2025-05-07T20:32:15.4347189Z 
2025-05-07T20:32:15.4347302Z     @given(
2025-05-07T20:32:15.4347624Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:15.4347950Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:15.4348272Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:15.4348908Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:15.4349239Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:15.4349538Z     )
2025-05-07T20:32:15.4349902Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:15.4350349Z     def test_silu_mul_quant(
2025-05-07T20:32:15.4350601Z         self,
2025-05-07T20:32:15.4350808Z         T: int,
2025-05-07T20:32:15.4351011Z         D: int,
2025-05-07T20:32:15.4351240Z         scale_ub: Optional[float],
2025-05-07T20:32:15.4351521Z         contiguous: bool,
2025-05-07T20:32:15.4351774Z         compiled: bool,
2025-05-07T20:32:15.4352005Z     ) -> None:
2025-05-07T20:32:15.4352244Z         torch.manual_seed(2025)
2025-05-07T20:32:15.4352498Z     
2025-05-07T20:32:15.4352935Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:15.4353287Z     
2025-05-07T20:32:15.4353496Z         x_sign = torch.sign(x)
2025-05-07T20:32:15.4353801Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:15.4354122Z         x = x_sign * x_clamp
2025-05-07T20:32:15.4354374Z         x0 = x[:, :D]
2025-05-07T20:32:15.4354604Z         x1 = x[:, D:]
2025-05-07T20:32:15.4354815Z     
2025-05-07T20:32:15.4355014Z         if contiguous:
2025-05-07T20:32:15.4355258Z             x0 = x0.contiguous()
2025-05-07T20:32:15.4355527Z             x1 = x1.contiguous()
2025-05-07T20:32:15.4355771Z     
2025-05-07T20:32:15.4355976Z         if scale_ub is not None:
2025-05-07T20:32:15.4356260Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:15.4356601Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:15.4356925Z             )
2025-05-07T20:32:15.4357127Z         else:
2025-05-07T20:32:15.4357346Z             scale_ub_tensor = None
2025-05-07T20:32:15.4357607Z     
2025-05-07T20:32:15.4357851Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:15.4358169Z             op = silu_mul_quant
2025-05-07T20:32:15.4358469Z             if compiled:
2025-05-07T20:32:15.4358720Z                 op = torch.compile(op)
2025-05-07T20:32:15.4359028Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:15.4359311Z     
2025-05-07T20:32:15.4359509Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:15.4359681Z 
2025-05-07T20:32:15.4359784Z moe/activation_test.py:117: 
2025-05-07T20:32:15.4360092Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:15.4360425Z moe/activation_test.py:115: in fn
2025-05-07T20:32:15.4360716Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:15.4361417Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:15.4362115Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:15.4362653Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:15.4363339Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:15.4364014Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:15.4364556Z     kernel = self.compile(
2025-05-07T20:32:15.4365101Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:15.4365766Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:15.4366175Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:15.4366406Z 
2025-05-07T20:32:15.4366617Z self = <triton.compiler.compiler.ASTSource object at 0x7fcda58caad0>
2025-05-07T20:32:15.4367702Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:15.4369165Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7fcea1473ce0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7fce5b767560>}
2025-05-07T20:32:15.4370508Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:15.4371528Z context = <triton._C.libtriton.ir.context object at 0x7fcda4a2bd30>
2025-05-07T20:32:15.4371821Z 
2025-05-07T20:32:15.4371991Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:15.4372597Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:15.4373075Z                            module_map=module_map)
2025-05-07T20:32:15.4373453Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:15.4373947Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:15.4374222Z E       ^
2025-05-07T20:32:15.4374684Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:15.4375126Z 
2025-05-07T20:32:15.4375539Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:15.4376052Z 
2025-05-07T20:32:15.4376161Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:15.4376577Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:15.4376980Z     T=128,
2025-05-07T20:32:15.4377172Z     D=5120,
2025-05-07T20:32:15.4377373Z     scale_ub=None,
2025-05-07T20:32:15.4377595Z     contiguous=False,
2025-05-07T20:32:15.4377827Z     compiled=False,
2025-05-07T20:32:15.4378038Z )
2025-05-07T20:32:15.4378360Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:15.4378849Z T = 128, D = 5120, scale_ub = None, contiguous = False, compiled = False
2025-05-07T20:32:15.4379129Z 
2025-05-07T20:32:15.4379210Z     @given(
2025-05-07T20:32:15.4379445Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:15.4379760Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:15.4380069Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:15.4380401Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:15.4380735Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:15.4381019Z     )
2025-05-07T20:32:15.4381370Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:15.4381817Z     def test_silu_mul_quant(
2025-05-07T20:32:15.4382057Z         self,
2025-05-07T20:32:15.4382257Z         T: int,
2025-05-07T20:32:15.4382469Z         D: int,
2025-05-07T20:32:15.4382687Z         scale_ub: Optional[float],
2025-05-07T20:32:15.4382966Z         contiguous: bool,
2025-05-07T20:32:15.4383213Z         compiled: bool,
2025-05-07T20:32:15.4383441Z     ) -> None:
2025-05-07T20:32:15.4383662Z         torch.manual_seed(2025)
2025-05-07T20:32:15.4383910Z     
2025-05-07T20:32:15.4384180Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:15.4384531Z     
2025-05-07T20:32:15.4384732Z         x_sign = torch.sign(x)
2025-05-07T20:32:15.4385018Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:15.4385331Z         x = x_sign * x_clamp
2025-05-07T20:32:15.4385574Z         x0 = x[:, :D]
2025-05-07T20:32:15.4385797Z         x1 = x[:, D:]
2025-05-07T20:32:15.4386004Z     
2025-05-07T20:32:15.4386198Z         if contiguous:
2025-05-07T20:32:15.4386435Z             x0 = x0.contiguous()
2025-05-07T20:32:15.4386695Z             x1 = x1.contiguous()
2025-05-07T20:32:15.4386945Z     
2025-05-07T20:32:15.4387151Z         if scale_ub is not None:
2025-05-07T20:32:15.4387423Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:15.4387766Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:15.4388174Z             )
2025-05-07T20:32:15.4388369Z         else:
2025-05-07T20:32:15.4388586Z             scale_ub_tensor = None
2025-05-07T20:32:15.4388847Z     
2025-05-07T20:32:15.4389079Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:15.4389401Z             op = silu_mul_quant
2025-05-07T20:32:15.4389659Z             if compiled:
2025-05-07T20:32:15.4389907Z                 op = torch.compile(op)
2025-05-07T20:32:15.4390217Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:15.4390502Z     
2025-05-07T20:32:15.4390705Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:15.4390869Z 
2025-05-07T20:32:15.4390969Z moe/activation_test.py:117: 
2025-05-07T20:32:15.4391353Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:15.4391697Z moe/activation_test.py:115: in fn
2025-05-07T20:32:15.4391978Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:15.4392669Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:15.4393359Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:15.4393897Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:15.4394571Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:15.4395236Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:15.4395772Z     kernel = self.compile(
2025-05-07T20:32:15.4396309Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:15.4396972Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:15.4397380Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:15.4397608Z 
2025-05-07T20:32:15.4397821Z self = <triton.compiler.compiler.ASTSource object at 0x7fcda54d7f50>
2025-05-07T20:32:15.4399179Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:15.4400538Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7fcea1473ce0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7fce5a892a20>}
2025-05-07T20:32:15.4401871Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:15.4402895Z context = <triton._C.libtriton.ir.context object at 0x7fcda4a632b0>
2025-05-07T20:32:15.4403184Z 
2025-05-07T20:32:15.4403359Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:15.4403882Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:15.4404352Z                            module_map=module_map)
2025-05-07T20:32:15.4404725Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:15.4405077Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:15.4405349Z E       ^
2025-05-07T20:32:15.4405819Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:15.4406262Z 
2025-05-07T20:32:15.4406681Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:15.4407191Z 
2025-05-07T20:32:15.4407301Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:15.4407718Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:15.4408126Z     T=128,
2025-05-07T20:32:15.4408314Z     D=5120,
2025-05-07T20:32:15.4408669Z     scale_ub=1200.0,
2025-05-07T20:32:15.4408900Z     contiguous=True,
2025-05-07T20:32:15.4409121Z     compiled=False,
2025-05-07T20:32:15.4409331Z )
2025-05-07T20:32:15.4409659Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:15.4410150Z T = 128, D = 5120, scale_ub = 1200.0, contiguous = True, compiled = False
2025-05-07T20:32:15.4410429Z 
2025-05-07T20:32:15.4410509Z     @given(
2025-05-07T20:32:15.4410748Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:15.4411068Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:15.4411375Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:15.4411713Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:15.4412177Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:15.4412459Z     )
2025-05-07T20:32:15.4412803Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:15.4413245Z     def test_silu_mul_quant(
2025-05-07T20:32:15.4413488Z         self,
2025-05-07T20:32:15.4413761Z         T: int,
2025-05-07T20:32:15.4413958Z         D: int,
2025-05-07T20:32:15.4414178Z         scale_ub: Optional[float],
2025-05-07T20:32:15.4414450Z         contiguous: bool,
2025-05-07T20:32:15.4414689Z         compiled: bool,
2025-05-07T20:32:15.4414913Z     ) -> None:
2025-05-07T20:32:15.4415120Z         torch.manual_seed(2025)
2025-05-07T20:32:15.4415360Z     
2025-05-07T20:32:15.4415634Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:15.4415971Z     
2025-05-07T20:32:15.4416169Z         x_sign = torch.sign(x)
2025-05-07T20:32:15.4416460Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:15.4416768Z         x = x_sign * x_clamp
2025-05-07T20:32:15.4417008Z         x0 = x[:, :D]
2025-05-07T20:32:15.4417223Z         x1 = x[:, D:]
2025-05-07T20:32:15.4417428Z     
2025-05-07T20:32:15.4417613Z         if contiguous:
2025-05-07T20:32:15.4417845Z             x0 = x0.contiguous()
2025-05-07T20:32:15.4418101Z             x1 = x1.contiguous()
2025-05-07T20:32:15.4418340Z     
2025-05-07T20:32:15.4418536Z         if scale_ub is not None:
2025-05-07T20:32:15.4418805Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:15.4419141Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:15.4419451Z             )
2025-05-07T20:32:15.4419648Z         else:
2025-05-07T20:32:15.4419853Z             scale_ub_tensor = None
2025-05-07T20:32:15.4420104Z     
2025-05-07T20:32:15.4420331Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:15.4420639Z             op = silu_mul_quant
2025-05-07T20:32:15.4420891Z             if compiled:
2025-05-07T20:32:15.4421145Z                 op = torch.compile(op)
2025-05-07T20:32:15.4421434Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:15.4421710Z     
2025-05-07T20:32:15.4421904Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:15.4422066Z 
2025-05-07T20:32:15.4422164Z moe/activation_test.py:117: 
2025-05-07T20:32:15.4429173Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:15.4429570Z moe/activation_test.py:115: in fn
2025-05-07T20:32:15.4429861Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:15.4430554Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
﻿2025-05-07T20:32:15.4435938Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:15.4436476Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:15.4437165Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:15.4437833Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:15.4438370Z     kernel = self.compile(
2025-05-07T20:32:15.4438908Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:15.4439637Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:15.4440040Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:15.4440269Z 
2025-05-07T20:32:15.4440483Z self = <triton.compiler.compiler.ASTSource object at 0x7fcda59f3450>
2025-05-07T20:32:15.4441543Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:15.4443009Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7fcea1473ce0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7fce5a892c00>}
2025-05-07T20:32:15.4444334Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:15.4445349Z context = <triton._C.libtriton.ir.context object at 0x7fcda4c73cf0>
2025-05-07T20:32:15.4445644Z 
2025-05-07T20:32:15.4445811Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:15.4446327Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:15.4446790Z                            module_map=module_map)
2025-05-07T20:32:15.4447159Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:15.4447517Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:15.4447778Z E       ^
2025-05-07T20:32:15.4448247Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:15.4448697Z 
2025-05-07T20:32:15.4449111Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:15.5993577Z 
2025-05-07T20:32:15.5994027Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:15.5994636Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:15.5995135Z     T=1,
2025-05-07T20:32:15.5995329Z     D=7168,
2025-05-07T20:32:15.5995534Z     scale_ub=1200.0,
2025-05-07T20:32:15.5995762Z     contiguous=True,
2025-05-07T20:32:15.5995981Z     compiled=True,
2025-05-07T20:32:15.5996221Z )
2025-05-07T20:32:15.5996554Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:15.5997038Z T = 1, D = 7168, scale_ub = 1200.0, contiguous = True, compiled = True
2025-05-07T20:32:15.5997311Z 
2025-05-07T20:32:15.5997392Z     @given(
2025-05-07T20:32:15.5997639Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:15.5997945Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:15.5998581Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:15.5998987Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:15.5999307Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:15.5999590Z     )
2025-05-07T20:32:15.5999938Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:15.6000372Z     def test_silu_mul_quant(
2025-05-07T20:32:15.6000612Z         self,
2025-05-07T20:32:15.6000809Z         T: int,
2025-05-07T20:32:15.6001000Z         D: int,
2025-05-07T20:32:15.6001483Z         scale_ub: Optional[float],
2025-05-07T20:32:15.6001755Z         contiguous: bool,
2025-05-07T20:32:15.6001984Z         compiled: bool,
2025-05-07T20:32:15.6002211Z     ) -> None:
2025-05-07T20:32:15.6002421Z         torch.manual_seed(2025)
2025-05-07T20:32:15.6002657Z     
2025-05-07T20:32:15.6002925Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:15.6003265Z     
2025-05-07T20:32:15.6003449Z         x_sign = torch.sign(x)
2025-05-07T20:32:15.6003736Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:15.6004155Z         x = x_sign * x_clamp
2025-05-07T20:32:15.6004392Z         x0 = x[:, :D]
2025-05-07T20:32:15.6004599Z         x1 = x[:, D:]
2025-05-07T20:32:15.6004801Z     
2025-05-07T20:32:15.6004984Z         if contiguous:
2025-05-07T20:32:15.6005204Z             x0 = x0.contiguous()
2025-05-07T20:32:15.6005456Z             x1 = x1.contiguous()
2025-05-07T20:32:15.6005691Z     
2025-05-07T20:32:15.6005877Z         if scale_ub is not None:
2025-05-07T20:32:15.6006145Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:15.6006474Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:15.6006770Z             )
2025-05-07T20:32:15.6006964Z         else:
2025-05-07T20:32:15.6007347Z             scale_ub_tensor = None
2025-05-07T20:32:15.6007595Z     
2025-05-07T20:32:15.6007823Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:15.6008135Z             op = silu_mul_quant
2025-05-07T20:32:15.6008377Z             if compiled:
2025-05-07T20:32:15.6008622Z                 op = torch.compile(op)
2025-05-07T20:32:15.6008914Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:15.6009194Z     
2025-05-07T20:32:15.6009398Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:15.6009591Z 
2025-05-07T20:32:15.6009693Z moe/activation_test.py:117: 
2025-05-07T20:32:15.6009989Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:15.6010318Z moe/activation_test.py:115: in fn
2025-05-07T20:32:15.6010595Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:15.6011152Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/torch/_dynamo/eval_frame.py:678: in _fn
2025-05-07T20:32:15.6011708Z     return fn(*args, **kwargs)
2025-05-07T20:32:15.6012362Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:15.6013042Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:15.6013577Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:15.6014375Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:15.6015036Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:15.6015564Z     kernel = self.compile(
2025-05-07T20:32:15.6016106Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:15.6016748Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:15.6017147Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:15.6017372Z 
2025-05-07T20:32:15.6017586Z self = <triton.compiler.compiler.ASTSource object at 0x7fcda4f390d0>
2025-05-07T20:32:15.6018641Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:15.6020130Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7fcea1473ce0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7fce5a215080>}
2025-05-07T20:32:15.6021454Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:15.6022538Z context = <triton._C.libtriton.ir.context object at 0x7fcda4c3cfb0>
2025-05-07T20:32:15.6022828Z 
2025-05-07T20:32:15.6023001Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:15.6023516Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:15.6024026Z                            module_map=module_map)
2025-05-07T20:32:15.6024381Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:15.6024728Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:15.6024987Z E       ^
2025-05-07T20:32:15.6025435Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:15.6025881Z 
2025-05-07T20:32:15.6026285Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:15.6026799Z 
2025-05-07T20:32:15.6026902Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:15.6027310Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:15.6027784Z     T=1,
2025-05-07T20:32:15.6027967Z     D=7168,
2025-05-07T20:32:15.6028160Z     scale_ub=1200.0,
2025-05-07T20:32:15.6028376Z     contiguous=False,
2025-05-07T20:32:15.6028603Z     compiled=True,
2025-05-07T20:32:15.6028810Z )
2025-05-07T20:32:15.6029118Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:15.6029598Z T = 1, D = 7168, scale_ub = 1200.0, contiguous = False, compiled = True
2025-05-07T20:32:15.6029858Z 
2025-05-07T20:32:15.6029943Z     @given(
2025-05-07T20:32:15.6030172Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:15.6030475Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:15.6030785Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:15.6031112Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:15.6031426Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:15.6031712Z     )
2025-05-07T20:32:15.6032061Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:15.6032492Z     def test_silu_mul_quant(
2025-05-07T20:32:15.6032731Z         self,
2025-05-07T20:32:15.6032926Z         T: int,
2025-05-07T20:32:15.6033114Z         D: int,
2025-05-07T20:32:15.6033339Z         scale_ub: Optional[float],
2025-05-07T20:32:15.6033612Z         contiguous: bool,
2025-05-07T20:32:15.6033842Z         compiled: bool,
2025-05-07T20:32:15.6034066Z     ) -> None:
2025-05-07T20:32:15.6034282Z         torch.manual_seed(2025)
2025-05-07T20:32:15.6034524Z     
2025-05-07T20:32:15.6034787Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:15.6035129Z     
2025-05-07T20:32:15.6035332Z         x_sign = torch.sign(x)
2025-05-07T20:32:15.6035618Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:15.6035926Z         x = x_sign * x_clamp
2025-05-07T20:32:15.6036166Z         x0 = x[:, :D]
2025-05-07T20:32:15.6036373Z         x1 = x[:, D:]
2025-05-07T20:32:15.6036581Z     
2025-05-07T20:32:15.6036773Z         if contiguous:
2025-05-07T20:32:15.6036998Z             x0 = x0.contiguous()
2025-05-07T20:32:15.6037254Z             x1 = x1.contiguous()
2025-05-07T20:32:15.6037489Z     
2025-05-07T20:32:15.6037677Z         if scale_ub is not None:
2025-05-07T20:32:15.6037947Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:15.6038275Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:15.6038573Z             )
2025-05-07T20:32:15.6038763Z         else:
2025-05-07T20:32:15.6038969Z             scale_ub_tensor = None
2025-05-07T20:32:15.6039209Z     
2025-05-07T20:32:15.6039466Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:15.6039858Z             op = silu_mul_quant
2025-05-07T20:32:15.6040103Z             if compiled:
2025-05-07T20:32:15.6040342Z                 op = torch.compile(op)
2025-05-07T20:32:15.6040637Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:15.6040912Z     
2025-05-07T20:32:15.6041101Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:15.6041267Z 
2025-05-07T20:32:15.6041363Z moe/activation_test.py:117: 
2025-05-07T20:32:15.6041655Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:15.6042028Z moe/activation_test.py:115: in fn
2025-05-07T20:32:15.6042307Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:15.6042856Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/torch/_dynamo/eval_frame.py:678: in _fn
2025-05-07T20:32:15.6043402Z     return fn(*args, **kwargs)
2025-05-07T20:32:15.6044044Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:15.6044720Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:15.6045249Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:15.6045988Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:15.6046649Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:15.6047172Z     kernel = self.compile(
2025-05-07T20:32:15.6047712Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:15.6048353Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:15.6048752Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:15.6048974Z 
2025-05-07T20:32:15.6049186Z self = <triton.compiler.compiler.ASTSource object at 0x7fcda59f2f50>
2025-05-07T20:32:15.6050244Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:15.6051589Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7fcea1473ce0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7fce5a217380>}
2025-05-07T20:32:15.6052913Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:15.6054052Z context = <triton._C.libtriton.ir.context object at 0x7fcda4c2ae70>
2025-05-07T20:32:15.6054333Z 
2025-05-07T20:32:15.6054502Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:15.6055020Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:15.6055482Z                            module_map=module_map)
2025-05-07T20:32:15.6055838Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:15.6056186Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:15.6056444Z E       ^
2025-05-07T20:32:15.6056903Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:15.6057343Z 
2025-05-07T20:32:15.6057748Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:15.8124546Z 
2025-05-07T20:32:15.8125071Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:15.8126284Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:15.8127400Z     T=1,
2025-05-07T20:32:15.8127881Z     D=7168,
2025-05-07T20:32:15.8128381Z     scale_ub=None,
2025-05-07T20:32:15.8128926Z     contiguous=False,
2025-05-07T20:32:15.8129571Z     compiled=True,
2025-05-07T20:32:15.8129800Z )
2025-05-07T20:32:15.8130119Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:15.8130593Z T = 1, D = 7168, scale_ub = None, contiguous = False, compiled = True
2025-05-07T20:32:15.8130863Z 
2025-05-07T20:32:15.8130941Z     @given(
2025-05-07T20:32:15.8131172Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:15.8131478Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:15.8131780Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:15.8132202Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:15.8132520Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:15.8132802Z     )
2025-05-07T20:32:15.8133153Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:15.8133593Z     def test_silu_mul_quant(
2025-05-07T20:32:15.8133935Z         self,
2025-05-07T20:32:15.8134135Z         T: int,
2025-05-07T20:32:15.8134334Z         D: int,
2025-05-07T20:32:15.8134542Z         scale_ub: Optional[float],
2025-05-07T20:32:15.8134812Z         contiguous: bool,
2025-05-07T20:32:15.8135049Z         compiled: bool,
2025-05-07T20:32:15.8135268Z     ) -> None:
2025-05-07T20:32:15.8135671Z         torch.manual_seed(2025)
2025-05-07T20:32:15.8135915Z     
2025-05-07T20:32:15.8136177Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:15.8136519Z     
2025-05-07T20:32:15.8136712Z         x_sign = torch.sign(x)
2025-05-07T20:32:15.8136996Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:15.8137307Z         x = x_sign * x_clamp
2025-05-07T20:32:15.8137566Z         x0 = x[:, :D]
2025-05-07T20:32:15.8137779Z         x1 = x[:, D:]
2025-05-07T20:32:15.8137985Z     
2025-05-07T20:32:15.8138162Z         if contiguous:
2025-05-07T20:32:15.8138389Z             x0 = x0.contiguous()
2025-05-07T20:32:15.8138642Z             x1 = x1.contiguous()
2025-05-07T20:32:15.8138878Z     
2025-05-07T20:32:15.8139069Z         if scale_ub is not None:
2025-05-07T20:32:15.8139355Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:15.8139712Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:15.8140017Z             )
2025-05-07T20:32:15.8140217Z         else:
2025-05-07T20:32:15.8140426Z             scale_ub_tensor = None
2025-05-07T20:32:15.8140667Z     
2025-05-07T20:32:15.8140896Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:15.8141208Z             op = silu_mul_quant
2025-05-07T20:32:15.8141448Z             if compiled:
2025-05-07T20:32:15.8141694Z                 op = torch.compile(op)
2025-05-07T20:32:15.8141987Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:15.8142254Z     
2025-05-07T20:32:15.8142443Z         y_fp8, y_scale = fn()
2025-05-07T20:32:15.8142724Z         y = y_fp8.to(torch.float32) * y_scale[:, None]
2025-05-07T20:32:15.8143028Z     
2025-05-07T20:32:15.8143374Z         def ref_fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:15.8143826Z             x0_fp32 = x0.to(torch.float32)
2025-05-07T20:32:15.8144213Z             x1_fp32 = x1.to(torch.float32)
2025-05-07T20:32:15.8144556Z             y = x0_fp32 * torch.sigmoid(x0_fp32) * x1_fp32
2025-05-07T20:32:15.8144922Z             return triton_quantize_fp8_row(y, scale_ub_tensor)
2025-05-07T20:32:15.8145292Z     
2025-05-07T20:32:15.8145555Z >       y_fp8_ref, y_scale_ref = ref_fn()
2025-05-07T20:32:15.8145821Z 
2025-05-07T20:32:15.8145957Z moe/activation_test.py:126: 
2025-05-07T20:32:15.8146315Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:15.8146643Z moe/activation_test.py:124: in ref_fn
2025-05-07T20:32:15.8146971Z     return triton_quantize_fp8_row(y, scale_ub_tensor)
2025-05-07T20:32:15.8147752Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gemm/triton_gemm/fp8_gemm.py:2370: in triton_quantize_fp8_row
2025-05-07T20:32:15.8148574Z     _kernel_quantize_fp8_row[grid](
2025-05-07T20:32:15.8149109Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:15.8149798Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:15.8150479Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/autotuner.py:186: in run
2025-05-07T20:32:15.8151186Z     timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs}
2025-05-07T20:32:15.8151949Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/autotuner.py:166: in _bench
2025-05-07T20:32:15.8152575Z     return self.do_bench(kernel_call, quantiles=(0.5, 0.2, 0.8))
2025-05-07T20:32:15.8153169Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/testing.py:117: in do_bench
2025-05-07T20:32:15.8153672Z     fn()
2025-05-07T20:32:15.8154174Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/autotuner.py:152: in kernel_call
2025-05-07T20:32:15.8154750Z     self.fn.run(
2025-05-07T20:32:15.8155213Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:15.8155853Z     kernel = self.compile(
2025-05-07T20:32:15.8156386Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:15.8157026Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:15.8157418Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:15.8157647Z 
2025-05-07T20:32:15.8157850Z self = <triton.compiler.compiler.ASTSource object at 0x7fcda5deaf50>
2025-05-07T20:32:15.8158916Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=2, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:15.8160282Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7fcea1473ce0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7fce5b1cc220>}
2025-05-07T20:32:15.8161608Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:15.8162608Z context = <triton._C.libtriton.ir.context object at 0x7fcda53d47f0>
2025-05-07T20:32:15.8162896Z 
2025-05-07T20:32:15.8163086Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:15.8163800Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:15.8164388Z                            module_map=module_map)
2025-05-07T20:32:15.8164796Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:15.8165275Z E       def _kernel_quantize_fp8_row(
2025-05-07T20:32:15.8165633Z E       ^
2025-05-07T20:32:15.8166175Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:15.8166788Z 
2025-05-07T20:32:15.8167243Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:15.8167749Z 
2025-05-07T20:32:15.8167856Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:15.8168270Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:15.8168785Z     T=1,
2025-05-07T20:32:15.8169035Z     D=5120,
2025-05-07T20:32:15.8169299Z     scale_ub=1200.0,
2025-05-07T20:32:15.8169558Z     contiguous=False,
2025-05-07T20:32:15.8169784Z     compiled=True,
2025-05-07T20:32:15.8169990Z )
2025-05-07T20:32:15.8170302Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:15.8170863Z T = 1, D = 5120, scale_ub = 1200.0, contiguous = False, compiled = True
2025-05-07T20:32:15.8171129Z 
2025-05-07T20:32:15.8171240Z     @given(
2025-05-07T20:32:15.8171543Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:15.8171974Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:15.8172352Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:15.8172682Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:15.8173002Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:15.8173343Z     )
2025-05-07T20:32:15.8173818Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:15.8174253Z     def test_silu_mul_quant(
2025-05-07T20:32:15.8174492Z         self,
2025-05-07T20:32:15.8174689Z         T: int,
2025-05-07T20:32:15.8174882Z         D: int,
2025-05-07T20:32:15.8175103Z         scale_ub: Optional[float],
2025-05-07T20:32:15.8175377Z         contiguous: bool,
2025-05-07T20:32:15.8175611Z         compiled: bool,
2025-05-07T20:32:15.8175836Z     ) -> None:
2025-05-07T20:32:15.8176058Z         torch.manual_seed(2025)
2025-05-07T20:32:15.8176290Z     
2025-05-07T20:32:15.8176638Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:15.8176979Z     
2025-05-07T20:32:15.8177165Z         x_sign = torch.sign(x)
2025-05-07T20:32:15.8177452Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:15.8177757Z         x = x_sign * x_clamp
2025-05-07T20:32:15.8177998Z         x0 = x[:, :D]
2025-05-07T20:32:15.8178207Z         x1 = x[:, D:]
2025-05-07T20:32:15.8178413Z     
2025-05-07T20:32:15.8178598Z         if contiguous:
2025-05-07T20:32:15.8178820Z             x0 = x0.contiguous()
2025-05-07T20:32:15.8179073Z             x1 = x1.contiguous()
2025-05-07T20:32:15.8179310Z     
2025-05-07T20:32:15.8179492Z         if scale_ub is not None:
2025-05-07T20:32:15.8179763Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:15.8180097Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:15.8180400Z             )
2025-05-07T20:32:15.8180594Z         else:
2025-05-07T20:32:15.8180803Z             scale_ub_tensor = None
2025-05-07T20:32:15.8181044Z     
2025-05-07T20:32:15.8181278Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:15.8181588Z             op = silu_mul_quant
2025-05-07T20:32:15.8181826Z             if compiled:
2025-05-07T20:32:15.8182074Z                 op = torch.compile(op)
2025-05-07T20:32:15.8182370Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:15.8182640Z     
2025-05-07T20:32:15.8182826Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:15.8182994Z 
2025-05-07T20:32:15.8183089Z moe/activation_test.py:117: 
2025-05-07T20:32:15.8183383Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:15.8183786Z moe/activation_test.py:115: in fn
2025-05-07T20:32:15.8184188Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:15.8195898Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/torch/_dynamo/eval_frame.py:678: in _fn
2025-05-07T20:32:15.8196769Z     return fn(*args, **kwargs)
2025-05-07T20:32:15.8197737Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:15.8199052Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:15.8199851Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:15.8200891Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:15.8201911Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:15.8202674Z     kernel = self.compile(
2025-05-07T20:32:15.8203482Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:15.8204694Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:15.8205311Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:15.8205654Z 
2025-05-07T20:32:15.8205963Z self = <triton.compiler.compiler.ASTSource object at 0x7fcda58cadd0>
2025-05-07T20:32:15.8207615Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:15.8209859Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7fcea1473ce0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7fce5a9efb00>}
2025-05-07T20:32:15.8211687Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:15.8212708Z context = <triton._C.libtriton.ir.context object at 0x7fcda53a50b0>
2025-05-07T20:32:15.8212996Z 
2025-05-07T20:32:15.8213166Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:15.8213967Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:15.8214446Z                            module_map=module_map)
2025-05-07T20:32:15.8214811Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:15.8215172Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:15.8215441Z E       ^
2025-05-07T20:32:15.8215900Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:15.8216350Z 
2025-05-07T20:32:15.8216763Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:15.9609922Z 
2025-05-07T20:32:15.9610355Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:15.9610998Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:15.9611449Z     T=1,
2025-05-07T20:32:15.9611657Z     D=5120,
2025-05-07T20:32:15.9611887Z     scale_ub=1200.0,
2025-05-07T20:32:15.9612138Z     contiguous=False,
2025-05-07T20:32:15.9612374Z     compiled=False,
2025-05-07T20:32:15.9612588Z )
2025-05-07T20:32:15.9612918Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:15.9613430Z T = 1, D = 5120, scale_ub = 1200.0, contiguous = False, compiled = False
2025-05-07T20:32:15.9613826Z 
2025-05-07T20:32:15.9613919Z     @given(
2025-05-07T20:32:15.9614158Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:15.9614478Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:15.9614792Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:15.9615122Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:15.9615461Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:15.9615750Z     )
2025-05-07T20:32:15.9616099Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:15.9616554Z     def test_silu_mul_quant(
2025-05-07T20:32:15.9616802Z         self,
2025-05-07T20:32:15.9617002Z         T: int,
2025-05-07T20:32:15.9617199Z         D: int,
2025-05-07T20:32:15.9617419Z         scale_ub: Optional[float],
2025-05-07T20:32:15.9617693Z         contiguous: bool,
2025-05-07T20:32:15.9617934Z         compiled: bool,
2025-05-07T20:32:15.9618170Z     ) -> None:
2025-05-07T20:32:15.9618394Z         torch.manual_seed(2025)
2025-05-07T20:32:15.9618634Z     
2025-05-07T20:32:15.9618909Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:15.9619255Z     
2025-05-07T20:32:15.9619450Z         x_sign = torch.sign(x)
2025-05-07T20:32:15.9619745Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:15.9620828Z         x = x_sign * x_clamp
2025-05-07T20:32:15.9621066Z         x0 = x[:, :D]
2025-05-07T20:32:15.9621286Z         x1 = x[:, D:]
2025-05-07T20:32:15.9621499Z     
2025-05-07T20:32:15.9621684Z         if contiguous:
2025-05-07T20:32:15.9621923Z             x0 = x0.contiguous()
2025-05-07T20:32:15.9622186Z             x1 = x1.contiguous()
2025-05-07T20:32:15.9622425Z     
2025-05-07T20:32:15.9622621Z         if scale_ub is not None:
2025-05-07T20:32:15.9622899Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:15.9623344Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:15.9623655Z             )
2025-05-07T20:32:15.9623855Z         else:
2025-05-07T20:32:15.9624072Z             scale_ub_tensor = None
2025-05-07T20:32:15.9624324Z     
2025-05-07T20:32:15.9624560Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:15.9624882Z             op = silu_mul_quant
2025-05-07T20:32:15.9625131Z             if compiled:
2025-05-07T20:32:15.9625418Z                 op = torch.compile(op)
2025-05-07T20:32:15.9625719Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:15.9626001Z     
2025-05-07T20:32:15.9626205Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:15.9626370Z 
2025-05-07T20:32:15.9626643Z moe/activation_test.py:117: 
2025-05-07T20:32:15.9626950Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:15.9627291Z moe/activation_test.py:115: in fn
2025-05-07T20:32:15.9627573Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:15.9628275Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:15.9628973Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:15.9629521Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:15.9630195Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:15.9630864Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:15.9631401Z     kernel = self.compile(
2025-05-07T20:32:15.9631954Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:15.9632616Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:15.9633023Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:15.9633258Z 
2025-05-07T20:32:15.9633473Z self = <triton.compiler.compiler.ASTSource object at 0x7fce5a9c27d0>
2025-05-07T20:32:15.9634544Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:15.9635921Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7fcea1473ce0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7fce5a9f56c0>}
2025-05-07T20:32:15.9637269Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:15.9638285Z context = <triton._C.libtriton.ir.context object at 0x7fcda4636f30>
2025-05-07T20:32:15.9638572Z 
2025-05-07T20:32:15.9638751Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:15.9639264Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:15.9639784Z                            module_map=module_map)
2025-05-07T20:32:15.9640150Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:15.9640496Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:15.9640818Z E       ^
2025-05-07T20:32:15.9641279Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:15.9641722Z 
2025-05-07T20:32:15.9642145Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:15.9642654Z 
2025-05-07T20:32:15.9642759Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:15.9643169Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:15.9643618Z     T=16384,
2025-05-07T20:32:15.9643808Z     D=5120,
2025-05-07T20:32:15.9644004Z     scale_ub=1200.0,
2025-05-07T20:32:15.9644230Z     contiguous=False,
2025-05-07T20:32:15.9644450Z     compiled=True,
2025-05-07T20:32:15.9644659Z )
2025-05-07T20:32:15.9644976Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:15.9645471Z T = 16384, D = 5120, scale_ub = 1200.0, contiguous = False, compiled = True
2025-05-07T20:32:15.9645749Z 
2025-05-07T20:32:15.9645828Z     @given(
2025-05-07T20:32:15.9646062Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:15.9646378Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:15.9646681Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:15.9647097Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:15.9647426Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:15.9647707Z     )
2025-05-07T20:32:15.9648056Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:15.9648509Z     def test_silu_mul_quant(
2025-05-07T20:32:15.9648754Z         self,
2025-05-07T20:32:15.9648946Z         T: int,
2025-05-07T20:32:15.9649146Z         D: int,
2025-05-07T20:32:15.9649372Z         scale_ub: Optional[float],
2025-05-07T20:32:15.9649637Z         contiguous: bool,
2025-05-07T20:32:15.9650087Z         compiled: bool,
2025-05-07T20:32:15.9650314Z     ) -> None:
2025-05-07T20:32:15.9650532Z         torch.manual_seed(2025)
2025-05-07T20:32:15.9650782Z     
2025-05-07T20:32:15.9651058Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:15.9651536Z     
2025-05-07T20:32:15.9651735Z         x_sign = torch.sign(x)
2025-05-07T20:32:15.9652038Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:15.9652344Z         x = x_sign * x_clamp
2025-05-07T20:32:15.9652588Z         x0 = x[:, :D]
2025-05-07T20:32:15.9652811Z         x1 = x[:, D:]
2025-05-07T20:32:15.9653018Z     
2025-05-07T20:32:15.9653207Z         if contiguous:
2025-05-07T20:32:15.9653437Z             x0 = x0.contiguous()
2025-05-07T20:32:15.9653732Z             x1 = x1.contiguous()
2025-05-07T20:32:15.9653978Z     
2025-05-07T20:32:15.9654172Z         if scale_ub is not None:
2025-05-07T20:32:15.9654449Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:15.9654776Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:15.9655089Z             )
2025-05-07T20:32:15.9655288Z         else:
2025-05-07T20:32:15.9655494Z             scale_ub_tensor = None
2025-05-07T20:32:15.9655749Z     
2025-05-07T20:32:15.9655982Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:15.9656293Z             op = silu_mul_quant
2025-05-07T20:32:15.9656551Z             if compiled:
2025-05-07T20:32:15.9656803Z                 op = torch.compile(op)
2025-05-07T20:32:15.9657095Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:15.9657374Z     
2025-05-07T20:32:15.9657571Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:15.9657733Z 
2025-05-07T20:32:15.9657832Z moe/activation_test.py:117: 
2025-05-07T20:32:15.9658132Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:15.9658465Z moe/activation_test.py:115: in fn
2025-05-07T20:32:15.9658748Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:15.9659305Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/torch/_dynamo/eval_frame.py:678: in _fn
2025-05-07T20:32:15.9659934Z     return fn(*args, **kwargs)
2025-05-07T20:32:15.9660592Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:15.9661278Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:15.9661817Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:15.9662557Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:15.9663361Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:15.9663888Z     kernel = self.compile(
2025-05-07T20:32:15.9664433Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:15.9665089Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:15.9665502Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:15.9665726Z 
2025-05-07T20:32:15.9665939Z self = <triton.compiler.compiler.ASTSource object at 0x7fce60a36d50>
2025-05-07T20:32:15.9667143Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:15.9668504Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7fcea1473ce0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7fce5a9f6fc0>}
2025-05-07T20:32:15.9669897Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:15.9670913Z context = <triton._C.libtriton.ir.context object at 0x7fcda56bd0b0>
2025-05-07T20:32:15.9671206Z 
2025-05-07T20:32:15.9671379Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:15.9671896Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:15.9672375Z                            module_map=module_map)
2025-05-07T20:32:15.9672746Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:15.9673104Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:15.9673365Z E       ^
2025-05-07T20:32:15.9673829Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:15.9674273Z 
2025-05-07T20:32:15.9674698Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:15.9675210Z 
2025-05-07T20:32:15.9675326Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:15.9675732Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:15.9676134Z     T=2048,
2025-05-07T20:32:15.9676326Z     D=7168,
2025-05-07T20:32:15.9676516Z     scale_ub=1200.0,
2025-05-07T20:32:15.9676744Z     contiguous=False,
2025-05-07T20:32:15.9676978Z     compiled=True,
2025-05-07T20:32:16.1553043Z )
2025-05-07T20:32:16.1554106Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:16.1555492Z T = 2048, D = 7168, scale_ub = 1200.0, contiguous = False, compiled = True
2025-05-07T20:32:16.1556060Z 
2025-05-07T20:32:16.1556229Z     @given(
2025-05-07T20:32:16.1556680Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:16.1557295Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:16.1557899Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:16.1558539Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:16.1559166Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:16.1559796Z     )
2025-05-07T20:32:16.1560143Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:16.1560579Z     def test_silu_mul_quant(
2025-05-07T20:32:16.1560821Z         self,
2025-05-07T20:32:16.1561014Z         T: int,
2025-05-07T20:32:16.1561211Z         D: int,
2025-05-07T20:32:16.1561428Z         scale_ub: Optional[float],
2025-05-07T20:32:16.1561696Z         contiguous: bool,
2025-05-07T20:32:16.1561928Z         compiled: bool,
2025-05-07T20:32:16.1562154Z     ) -> None:
2025-05-07T20:32:16.1562462Z         torch.manual_seed(2025)
2025-05-07T20:32:16.1562699Z     
2025-05-07T20:32:16.1562971Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:16.1563312Z     
2025-05-07T20:32:16.1563508Z         x_sign = torch.sign(x)
2025-05-07T20:32:16.1563789Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:16.1564099Z         x = x_sign * x_clamp
2025-05-07T20:32:16.1564344Z         x0 = x[:, :D]
2025-05-07T20:32:16.1564560Z         x1 = x[:, D:]
2025-05-07T20:32:16.1564768Z     
2025-05-07T20:32:16.1564956Z         if contiguous:
2025-05-07T20:32:16.1565180Z             x0 = x0.contiguous()
2025-05-07T20:32:16.1565437Z             x1 = x1.contiguous()
2025-05-07T20:32:16.1565674Z     
2025-05-07T20:32:16.1566017Z         if scale_ub is not None:
2025-05-07T20:32:16.1566292Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:16.1566625Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:16.1566930Z             )
2025-05-07T20:32:16.1567125Z         else:
2025-05-07T20:32:16.1567338Z             scale_ub_tensor = None
2025-05-07T20:32:16.1567583Z     
2025-05-07T20:32:16.1567809Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:16.1568119Z             op = silu_mul_quant
2025-05-07T20:32:16.1568368Z             if compiled:
2025-05-07T20:32:16.1568606Z                 op = torch.compile(op)
2025-05-07T20:32:16.1568897Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:16.1569179Z     
2025-05-07T20:32:16.1569366Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:16.1569534Z 
2025-05-07T20:32:16.1569630Z moe/activation_test.py:117: 
2025-05-07T20:32:16.1569929Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:16.1570251Z moe/activation_test.py:115: in fn
2025-05-07T20:32:16.1570535Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:16.1571088Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/torch/_dynamo/eval_frame.py:678: in _fn
2025-05-07T20:32:16.1571644Z     return fn(*args, **kwargs)
2025-05-07T20:32:16.1572291Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:16.1572968Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:16.1573500Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:16.1574317Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:16.1574970Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:16.1575499Z     kernel = self.compile(
2025-05-07T20:32:16.1576040Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:16.1576686Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:16.1577082Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:16.1577309Z 
2025-05-07T20:32:16.1577520Z self = <triton.compiler.compiler.ASTSource object at 0x7fce5b4035d0>
2025-05-07T20:32:16.1578582Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:16.1580008Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7fcea1473ce0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7fcda5de58a0>}
2025-05-07T20:32:16.1581349Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:16.1582360Z context = <triton._C.libtriton.ir.context object at 0x7fcda5668770>
2025-05-07T20:32:16.1582690Z 
2025-05-07T20:32:16.1582859Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:16.1583365Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:16.1583827Z                            module_map=module_map)
2025-05-07T20:32:16.1584189Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:16.1584543Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:16.1584797Z E       ^
2025-05-07T20:32:16.1585254Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:16.1585695Z 
2025-05-07T20:32:16.1586192Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:16.1586702Z 
2025-05-07T20:32:16.1586804Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:16.1587220Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:16.1587618Z     T=1,
2025-05-07T20:32:16.1587801Z     D=5120,
2025-05-07T20:32:16.1587993Z     scale_ub=None,
2025-05-07T20:32:16.1588208Z     contiguous=False,
2025-05-07T20:32:16.1588468Z     compiled=False,
2025-05-07T20:32:16.1588674Z )
2025-05-07T20:32:16.1588985Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:16.1589472Z T = 1, D = 5120, scale_ub = None, contiguous = False, compiled = False
2025-05-07T20:32:16.1589737Z 
2025-05-07T20:32:16.1589818Z     @given(
2025-05-07T20:32:16.1590052Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:16.1590361Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:16.1590671Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:16.1591001Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:16.1591318Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:16.1591605Z     )
2025-05-07T20:32:16.1591952Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:16.1592389Z     def test_silu_mul_quant(
2025-05-07T20:32:16.1592627Z         self,
2025-05-07T20:32:16.1592821Z         T: int,
2025-05-07T20:32:16.1593014Z         D: int,
2025-05-07T20:32:16.1593225Z         scale_ub: Optional[float],
2025-05-07T20:32:16.1593493Z         contiguous: bool,
2025-05-07T20:32:16.1593736Z         compiled: bool,
2025-05-07T20:32:16.1593949Z     ) -> None:
2025-05-07T20:32:16.1594166Z         torch.manual_seed(2025)
2025-05-07T20:32:16.1594406Z     
2025-05-07T20:32:16.1594669Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:16.1595014Z     
2025-05-07T20:32:16.1595219Z         x_sign = torch.sign(x)
2025-05-07T20:32:16.1595504Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:16.1595811Z         x = x_sign * x_clamp
2025-05-07T20:32:16.1596051Z         x0 = x[:, :D]
2025-05-07T20:32:16.1596262Z         x1 = x[:, D:]
2025-05-07T20:32:16.1596468Z     
2025-05-07T20:32:16.1596653Z         if contiguous:
2025-05-07T20:32:16.1596875Z             x0 = x0.contiguous()
2025-05-07T20:32:16.1597134Z             x1 = x1.contiguous()
2025-05-07T20:32:16.1597378Z     
2025-05-07T20:32:16.1597567Z         if scale_ub is not None:
2025-05-07T20:32:16.1597838Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:16.1598418Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:16.1598880Z             )
2025-05-07T20:32:16.1599071Z         else:
2025-05-07T20:32:16.1599285Z             scale_ub_tensor = None
2025-05-07T20:32:16.1599572Z     
2025-05-07T20:32:16.1599822Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:16.1600143Z             op = silu_mul_quant
2025-05-07T20:32:16.1600394Z             if compiled:
2025-05-07T20:32:16.1600640Z                 op = torch.compile(op)
2025-05-07T20:32:16.1600941Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:16.1601298Z     
2025-05-07T20:32:16.1601485Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:16.1601655Z 
2025-05-07T20:32:16.1601752Z moe/activation_test.py:117: 
2025-05-07T20:32:16.1602051Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:16.1602382Z moe/activation_test.py:115: in fn
2025-05-07T20:32:16.1602663Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:16.1603346Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:16.1604027Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:16.1604676Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:16.1605356Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:16.1606018Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:16.1606552Z     kernel = self.compile(
2025-05-07T20:32:16.1607083Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:16.1607734Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:16.1608128Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:16.1608354Z 
2025-05-07T20:32:16.1608559Z self = <triton.compiler.compiler.ASTSource object at 0x7fcda5deae50>
2025-05-07T20:32:16.1609675Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:16.1611035Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7fcea1473ce0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7fcda5de53a0>}
2025-05-07T20:32:16.1612366Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:16.1613379Z context = <triton._C.libtriton.ir.context object at 0x7fcda4663c70>
2025-05-07T20:32:16.1613726Z 
2025-05-07T20:32:16.1613890Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:16.1614409Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:16.1614874Z                            module_map=module_map)
2025-05-07T20:32:16.1615244Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:16.1615594Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:16.1615855Z E       ^
2025-05-07T20:32:16.1616314Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:16.1616757Z 
2025-05-07T20:32:16.1617166Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:16.1617680Z 
2025-05-07T20:32:16.1617781Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:16.1618193Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:16.1618593Z     T=4096,
2025-05-07T20:32:16.1618837Z     D=7168,
2025-05-07T20:32:16.1619034Z     scale_ub=1200.0,
2025-05-07T20:32:16.1619256Z     contiguous=False,
2025-05-07T20:32:16.1619472Z     compiled=False,
2025-05-07T20:32:16.1619679Z )
2025-05-07T20:32:16.1619996Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:16.1620485Z T = 4096, D = 7168, scale_ub = 1200.0, contiguous = False, compiled = False
2025-05-07T20:32:16.1620762Z 
2025-05-07T20:32:16.1620838Z     @given(
2025-05-07T20:32:16.1630273Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:16.1630702Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:16.1631013Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:16.1631347Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:16.1631678Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:16.1631957Z     )
2025-05-07T20:32:16.1632309Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:16.1632754Z     def test_silu_mul_quant(
2025-05-07T20:32:16.1632986Z         self,
2025-05-07T20:32:16.1633178Z         T: int,
2025-05-07T20:32:16.1633374Z         D: int,
2025-05-07T20:32:16.1633582Z         scale_ub: Optional[float],
2025-05-07T20:32:16.1633857Z         contiguous: bool,
2025-05-07T20:32:16.1634185Z         compiled: bool,
2025-05-07T20:32:16.1634400Z     ) -> None:
2025-05-07T20:32:16.1634600Z         torch.manual_seed(2025)
2025-05-07T20:32:16.1634841Z     
2025-05-07T20:32:16.1635104Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:16.1635455Z     
2025-05-07T20:32:16.1635648Z         x_sign = torch.sign(x)
2025-05-07T20:32:16.1635933Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:16.1636243Z         x = x_sign * x_clamp
2025-05-07T20:32:16.1636481Z         x0 = x[:, :D]
2025-05-07T20:32:16.1636696Z         x1 = x[:, D:]
2025-05-07T20:32:16.1636895Z     
2025-05-07T20:32:16.1637084Z         if contiguous:
2025-05-07T20:32:16.1637309Z             x0 = x0.contiguous()
2025-05-07T20:32:16.1637553Z             x1 = x1.contiguous()
2025-05-07T20:32:16.1637785Z     
2025-05-07T20:32:16.1637977Z         if scale_ub is not None:
2025-05-07T20:32:16.1638239Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:16.1638575Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:16.1638881Z             )
2025-05-07T20:32:16.1639066Z         else:
2025-05-07T20:32:16.1639272Z             scale_ub_tensor = None
2025-05-07T20:32:16.1639529Z     
2025-05-07T20:32:16.1639755Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:16.1640067Z             op = silu_mul_quant
2025-05-07T20:32:16.1640307Z             if compiled:
2025-05-07T20:32:16.1640543Z                 op = torch.compile(op)
2025-05-07T20:32:16.1640826Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:16.1641100Z     
2025-05-07T20:32:16.1641293Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:16.1641455Z 
2025-05-07T20:32:16.1641553Z moe/activation_test.py:117: 
2025-05-07T20:32:16.1641846Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:16.1642171Z moe/activation_test.py:115: in fn
2025-05-07T20:32:16.1642456Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:16.1643150Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:16.1643829Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:16.1644362Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:16.1645039Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:16.1645686Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:16.1646214Z     kernel = self.compile(
2025-05-07T20:32:16.1646813Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:16.1647471Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:16.1647868Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:16.1648098Z 
2025-05-07T20:32:16.1648307Z self = <triton.compiler.compiler.ASTSource object at 0x7fce5a598150>
2025-05-07T20:32:16.1649374Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:16.1650774Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7fcea1473ce0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7fcda5de63e0>}
2025-05-07T20:32:16.1652095Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:16.1653118Z context = <triton._C.libtriton.ir.context object at 0x7fcda579cb30>
2025-05-07T20:32:16.1653410Z 
2025-05-07T20:32:16.1653719Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:16.1654237Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:16.1654689Z                            module_map=module_map)
2025-05-07T20:32:16.1655057Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:16.1655408Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:16.1655665Z E       ^
2025-05-07T20:32:16.1656114Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:16.1656561Z 
2025-05-07T20:32:16.1656970Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:16.3206629Z 
2025-05-07T20:32:16.3207036Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:16.3207615Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:16.3208052Z     T=16384,
2025-05-07T20:32:16.3208289Z     D=7168,
2025-05-07T20:32:16.3208480Z     scale_ub=None,
2025-05-07T20:32:16.3208705Z     contiguous=True,
2025-05-07T20:32:16.3208931Z     compiled=True,
2025-05-07T20:32:16.3209143Z )
2025-05-07T20:32:16.3209468Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:16.3210012Z T = 16384, D = 7168, scale_ub = None, contiguous = True, compiled = True
2025-05-07T20:32:16.3210279Z 
2025-05-07T20:32:16.3210358Z     @given(
2025-05-07T20:32:16.3210594Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:16.3210908Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:16.3211209Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:16.3211548Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:16.3211879Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:16.3212168Z     )
2025-05-07T20:32:16.3212514Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:16.3212954Z     def test_silu_mul_quant(
2025-05-07T20:32:16.3213199Z         self,
2025-05-07T20:32:16.3213391Z         T: int,
2025-05-07T20:32:16.3213594Z         D: int,
2025-05-07T20:32:16.3214021Z         scale_ub: Optional[float],
2025-05-07T20:32:16.3214291Z         contiguous: bool,
2025-05-07T20:32:16.3214533Z         compiled: bool,
2025-05-07T20:32:16.3214763Z     ) -> None:
2025-05-07T20:32:16.3214975Z         torch.manual_seed(2025)
2025-05-07T20:32:16.3215216Z     
2025-05-07T20:32:16.3215488Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:16.3215824Z     
2025-05-07T20:32:16.3216030Z         x_sign = torch.sign(x)
2025-05-07T20:32:16.3216622Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:16.3216928Z         x = x_sign * x_clamp
2025-05-07T20:32:16.3217173Z         x0 = x[:, :D]
2025-05-07T20:32:16.3217392Z         x1 = x[:, D:]
2025-05-07T20:32:16.3217593Z     
2025-05-07T20:32:16.3217795Z         if contiguous:
2025-05-07T20:32:16.3218031Z             x0 = x0.contiguous()
2025-05-07T20:32:16.3218286Z             x1 = x1.contiguous()
2025-05-07T20:32:16.3218523Z     
2025-05-07T20:32:16.3218715Z         if scale_ub is not None:
2025-05-07T20:32:16.3219089Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:16.3219415Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:16.3219730Z             )
2025-05-07T20:32:16.3219925Z         else:
2025-05-07T20:32:16.3220134Z             scale_ub_tensor = None
2025-05-07T20:32:16.3220385Z     
2025-05-07T20:32:16.3220619Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:16.3220931Z             op = silu_mul_quant
2025-05-07T20:32:16.3221183Z             if compiled:
2025-05-07T20:32:16.3221435Z                 op = torch.compile(op)
2025-05-07T20:32:16.3221727Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:16.3222002Z     
2025-05-07T20:32:16.3222357Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:16.3222525Z 
2025-05-07T20:32:16.3222630Z moe/activation_test.py:117: 
2025-05-07T20:32:16.3222918Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:16.3223250Z moe/activation_test.py:115: in fn
2025-05-07T20:32:16.3223532Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:16.3224088Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/torch/_dynamo/eval_frame.py:678: in _fn
2025-05-07T20:32:16.3224648Z     return fn(*args, **kwargs)
2025-05-07T20:32:16.3225304Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:16.3225986Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:16.3226514Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:16.3227195Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:16.3227860Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:16.3228385Z     kernel = self.compile(
2025-05-07T20:32:16.3228926Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:16.3229603Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:16.3230030Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:16.3230256Z 
2025-05-07T20:32:16.3230464Z self = <triton.compiler.compiler.ASTSource object at 0x7fcda5e41e50>
2025-05-07T20:32:16.3231537Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:16.3232913Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7fcea1473ce0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7fcda4ef2a20>}
2025-05-07T20:32:16.3234246Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:16.3235255Z context = <triton._C.libtriton.ir.context object at 0x7fcda57dfbb0>
2025-05-07T20:32:16.3235548Z 
2025-05-07T20:32:16.3235712Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:16.3236232Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:16.3236779Z                            module_map=module_map)
2025-05-07T20:32:16.3237138Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:16.3237490Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:16.3237756Z E       ^
2025-05-07T20:32:16.3238215Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:16.3238664Z 
2025-05-07T20:32:16.3239075Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:16.3239675Z 
2025-05-07T20:32:16.3239776Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:16.3240186Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:16.3240576Z     T=4096,
2025-05-07T20:32:16.3240765Z     D=5120,
2025-05-07T20:32:16.3240957Z     scale_ub=None,
2025-05-07T20:32:16.3241168Z     contiguous=False,
2025-05-07T20:32:16.3241395Z     compiled=True,
2025-05-07T20:32:16.3241596Z )
2025-05-07T20:32:16.3241903Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:16.3242388Z T = 4096, D = 5120, scale_ub = None, contiguous = False, compiled = True
2025-05-07T20:32:16.3242660Z 
2025-05-07T20:32:16.3242820Z     @given(
2025-05-07T20:32:16.3243051Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:16.3243354Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:16.3243658Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:16.3243987Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:16.3244305Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:16.3244585Z     )
2025-05-07T20:32:16.3244930Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:16.3245365Z     def test_silu_mul_quant(
2025-05-07T20:32:16.3245599Z         self,
2025-05-07T20:32:16.3245797Z         T: int,
2025-05-07T20:32:16.3245990Z         D: int,
2025-05-07T20:32:16.3246201Z         scale_ub: Optional[float],
2025-05-07T20:32:16.3246473Z         contiguous: bool,
2025-05-07T20:32:16.3246710Z         compiled: bool,
2025-05-07T20:32:16.3246925Z     ) -> None:
2025-05-07T20:32:16.3247146Z         torch.manual_seed(2025)
2025-05-07T20:32:16.3247382Z     
2025-05-07T20:32:16.3247645Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:16.3247989Z     
2025-05-07T20:32:16.3248181Z         x_sign = torch.sign(x)
2025-05-07T20:32:16.3248467Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:16.3248779Z         x = x_sign * x_clamp
2025-05-07T20:32:16.3249019Z         x0 = x[:, :D]
2025-05-07T20:32:16.3249229Z         x1 = x[:, D:]
2025-05-07T20:32:16.3249455Z     
2025-05-07T20:32:16.3249653Z         if contiguous:
2025-05-07T20:32:16.3249917Z             x0 = x0.contiguous()
2025-05-07T20:32:16.3250175Z             x1 = x1.contiguous()
2025-05-07T20:32:16.3250412Z     
2025-05-07T20:32:16.3250609Z         if scale_ub is not None:
2025-05-07T20:32:16.3250884Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:16.3251209Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:16.3251520Z             )
2025-05-07T20:32:16.3251720Z         else:
2025-05-07T20:32:16.3251923Z             scale_ub_tensor = None
2025-05-07T20:32:16.3252179Z     
2025-05-07T20:32:16.3252411Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:16.3252725Z             op = silu_mul_quant
2025-05-07T20:32:16.3252968Z             if compiled:
2025-05-07T20:32:16.3253216Z                 op = torch.compile(op)
2025-05-07T20:32:16.3253513Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:16.3253858Z     
2025-05-07T20:32:16.3254051Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:16.3254214Z 
2025-05-07T20:32:16.3254317Z moe/activation_test.py:117: 
2025-05-07T20:32:16.3254604Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:16.3254992Z moe/activation_test.py:115: in fn
2025-05-07T20:32:16.3255273Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:16.3255823Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/torch/_dynamo/eval_frame.py:678: in _fn
2025-05-07T20:32:16.3256380Z     return fn(*args, **kwargs)
2025-05-07T20:32:16.3257032Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:16.3257759Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:16.3258289Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:16.3258962Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:16.3259622Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:16.3260155Z     kernel = self.compile(
2025-05-07T20:32:16.3260686Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:16.3261335Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:16.3261813Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:16.3262039Z 
2025-05-07T20:32:16.3262246Z self = <triton.compiler.compiler.ASTSource object at 0x7fce5a956250>
2025-05-07T20:32:16.3263314Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:16.3264674Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7fcea1473ce0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7fcda4ef3c40>}
2025-05-07T20:32:16.3266010Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:16.3267026Z context = <triton._C.libtriton.ir.context object at 0x7fcda4728970>
2025-05-07T20:32:16.3267311Z 
2025-05-07T20:32:16.3267475Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:16.3267987Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:16.3268453Z                            module_map=module_map)
2025-05-07T20:32:16.3268818Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:16.3269166Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:16.3269448Z E       ^
2025-05-07T20:32:16.3269940Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:16.3270386Z 
2025-05-07T20:32:16.3270795Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:16.4653301Z 
2025-05-07T20:32:16.4653522Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:16.4654070Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:16.4654622Z     T=4096,
2025-05-07T20:32:16.4654881Z     D=5120,
2025-05-07T20:32:16.4655084Z     scale_ub=1200.0,
2025-05-07T20:32:16.4655308Z     contiguous=False,
2025-05-07T20:32:16.4655544Z     compiled=False,
2025-05-07T20:32:16.4655760Z )
2025-05-07T20:32:16.4656074Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:16.4656573Z T = 4096, D = 5120, scale_ub = 1200.0, contiguous = False, compiled = False
2025-05-07T20:32:16.4656852Z 
2025-05-07T20:32:16.4656930Z     @given(
2025-05-07T20:32:16.4657162Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:16.4657473Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:16.4657996Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:16.4658324Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:16.4658646Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:16.4658930Z     )
2025-05-07T20:32:16.4659286Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:16.4659773Z     def test_silu_mul_quant(
2025-05-07T20:32:16.4660015Z         self,
2025-05-07T20:32:16.4660312Z         T: int,
2025-05-07T20:32:16.4660506Z         D: int,
2025-05-07T20:32:16.4660729Z         scale_ub: Optional[float],
2025-05-07T20:32:16.4661000Z         contiguous: bool,
2025-05-07T20:32:16.4661245Z         compiled: bool,
2025-05-07T20:32:16.4661470Z     ) -> None:
2025-05-07T20:32:16.4661688Z         torch.manual_seed(2025)
2025-05-07T20:32:16.4661932Z     
2025-05-07T20:32:16.4662202Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:16.4662546Z     
2025-05-07T20:32:16.4662740Z         x_sign = torch.sign(x)
2025-05-07T20:32:16.4663032Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:16.4663342Z         x = x_sign * x_clamp
2025-05-07T20:32:16.4663589Z         x0 = x[:, :D]
2025-05-07T20:32:16.4663952Z         x1 = x[:, D:]
2025-05-07T20:32:16.4664164Z     
2025-05-07T20:32:16.4664353Z         if contiguous:
2025-05-07T20:32:16.4664581Z             x0 = x0.contiguous()
2025-05-07T20:32:16.4664847Z             x1 = x1.contiguous()
2025-05-07T20:32:16.4665090Z     
2025-05-07T20:32:16.4665278Z         if scale_ub is not None:
2025-05-07T20:32:16.4665554Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:16.4665892Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:16.4666200Z             )
2025-05-07T20:32:16.4666399Z         else:
2025-05-07T20:32:16.4666613Z             scale_ub_tensor = None
2025-05-07T20:32:16.4666870Z     
2025-05-07T20:32:16.4667103Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:16.4667420Z             op = silu_mul_quant
2025-05-07T20:32:16.4667672Z             if compiled:
2025-05-07T20:32:16.4667917Z                 op = torch.compile(op)
2025-05-07T20:32:16.4668214Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:16.4668502Z     
2025-05-07T20:32:16.4668692Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:16.4668861Z 
2025-05-07T20:32:16.4668961Z moe/activation_test.py:117: 
2025-05-07T20:32:16.4669261Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:16.4669593Z moe/activation_test.py:115: in fn
2025-05-07T20:32:16.4669920Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:16.4670618Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:16.4671306Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:16.4671834Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:16.4672515Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:16.4673177Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:16.4673711Z     kernel = self.compile(
2025-05-07T20:32:16.4674246Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:16.4674906Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:16.4675307Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:16.4675537Z 
2025-05-07T20:32:16.4675744Z self = <triton.compiler.compiler.ASTSource object at 0x7fce5a59bc50>
2025-05-07T20:32:16.4676813Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:16.4678239Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7fcea1473ce0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7fcda52582c0>}
2025-05-07T20:32:16.4679564Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:16.4680620Z context = <triton._C.libtriton.ir.context object at 0x7fcda47f3d30>
2025-05-07T20:32:16.4680902Z 
2025-05-07T20:32:16.4681068Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:16.4681585Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:16.4682052Z                            module_map=module_map)
2025-05-07T20:32:16.4682417Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:16.4682769Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:16.4683034Z E       ^
2025-05-07T20:32:16.4683574Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:16.4684018Z 
2025-05-07T20:32:16.4684430Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:16.4684942Z 
2025-05-07T20:32:16.4685050Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:16.4685465Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:16.4685866Z     T=4096,
2025-05-07T20:32:16.4686053Z     D=5120,
2025-05-07T20:32:16.4686254Z     scale_ub=1200.0,
2025-05-07T20:32:16.4686482Z     contiguous=False,
2025-05-07T20:32:16.4686704Z     compiled=True,
2025-05-07T20:32:16.4686908Z )
2025-05-07T20:32:16.4687232Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:16.4687719Z T = 4096, D = 5120, scale_ub = 1200.0, contiguous = False, compiled = True
2025-05-07T20:32:16.4687999Z 
2025-05-07T20:32:16.4688077Z     @given(
2025-05-07T20:32:16.4688309Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:16.4688624Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:16.4688936Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:16.4689266Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:16.4689597Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:16.4689932Z     )
2025-05-07T20:32:16.4690283Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:16.4690728Z     def test_silu_mul_quant(
2025-05-07T20:32:16.4690973Z         self,
2025-05-07T20:32:16.4691167Z         T: int,
2025-05-07T20:32:16.4691367Z         D: int,
2025-05-07T20:32:16.4691581Z         scale_ub: Optional[float],
2025-05-07T20:32:16.4691857Z         contiguous: bool,
2025-05-07T20:32:16.4692102Z         compiled: bool,
2025-05-07T20:32:16.4692322Z     ) -> None:
2025-05-07T20:32:16.4692546Z         torch.manual_seed(2025)
2025-05-07T20:32:16.4692789Z     
2025-05-07T20:32:16.4693061Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:16.4693405Z     
2025-05-07T20:32:16.4693600Z         x_sign = torch.sign(x)
2025-05-07T20:32:16.4693991Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:16.4694308Z         x = x_sign * x_clamp
2025-05-07T20:32:16.4694548Z         x0 = x[:, :D]
2025-05-07T20:32:16.4694766Z         x1 = x[:, D:]
2025-05-07T20:32:16.4694971Z     
2025-05-07T20:32:16.4695156Z         if contiguous:
2025-05-07T20:32:16.4695386Z             x0 = x0.contiguous()
2025-05-07T20:32:16.4695637Z             x1 = x1.contiguous()
2025-05-07T20:32:16.4695877Z     
2025-05-07T20:32:16.4696072Z         if scale_ub is not None:
2025-05-07T20:32:16.4696393Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:16.4696729Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:16.4697037Z             )
2025-05-07T20:32:16.4697227Z         else:
2025-05-07T20:32:16.4697436Z             scale_ub_tensor = None
2025-05-07T20:32:16.4697692Z     
2025-05-07T20:32:16.4697937Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:16.4706204Z             op = silu_mul_quant
2025-05-07T20:32:16.4706471Z             if compiled:
2025-05-07T20:32:16.4706714Z                 op = torch.compile(op)
2025-05-07T20:32:16.4707142Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:16.4707420Z     
2025-05-07T20:32:16.4707605Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:16.4707776Z 
2025-05-07T20:32:16.4707871Z moe/activation_test.py:117: 
2025-05-07T20:32:16.4708167Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:16.4708487Z moe/activation_test.py:115: in fn
2025-05-07T20:32:16.4708772Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:16.4709331Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/torch/_dynamo/eval_frame.py:678: in _fn
2025-05-07T20:32:16.4709889Z     return fn(*args, **kwargs)
2025-05-07T20:32:16.4710680Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:16.4711365Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:16.4711897Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:16.4712567Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:16.4713228Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:16.4713755Z     kernel = self.compile(
2025-05-07T20:32:16.4714295Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:16.4714939Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:16.4715339Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:16.4715577Z 
2025-05-07T20:32:16.4715788Z self = <triton.compiler.compiler.ASTSource object at 0x7fcda4f3aa50>
2025-05-07T20:32:16.4716856Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:16.4718208Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7fcea1473ce0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7fcda5259b20>}
2025-05-07T20:32:16.4719538Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:16.4720551Z context = <triton._C.libtriton.ir.context object at 0x7fcda468eeb0>
2025-05-07T20:32:16.4720835Z 
2025-05-07T20:32:16.4721012Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:16.4721525Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:16.4721980Z                            module_map=module_map)
2025-05-07T20:32:16.4722345Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:16.4722697Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:16.4722947Z E       ^
2025-05-07T20:32:16.4723399Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:16.4723839Z 
2025-05-07T20:32:16.4724254Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:16.4724835Z 
2025-05-07T20:32:16.4724946Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:16.4725343Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:16.4725741Z     T=2048,
2025-05-07T20:32:16.4725931Z     D=7168,
2025-05-07T20:32:16.4726121Z     scale_ub=1200.0,
2025-05-07T20:32:16.4726345Z     contiguous=False,
2025-05-07T20:32:16.4726569Z     compiled=False,
2025-05-07T20:32:16.6687422Z )
2025-05-07T20:32:16.6688012Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:16.6688971Z T = 2048, D = 7168, scale_ub = 1200.0, contiguous = False, compiled = False
2025-05-07T20:32:16.6689259Z 
2025-05-07T20:32:16.6689342Z     @given(
2025-05-07T20:32:16.6689598Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:16.6689949Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:16.6690262Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:16.6690606Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:16.6690935Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:16.6691229Z     )
2025-05-07T20:32:16.6691585Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:16.6692201Z     def test_silu_mul_quant(
2025-05-07T20:32:16.6692446Z         self,
2025-05-07T20:32:16.6692648Z         T: int,
2025-05-07T20:32:16.6692849Z         D: int,
2025-05-07T20:32:16.6693065Z         scale_ub: Optional[float],
2025-05-07T20:32:16.6693347Z         contiguous: bool,
2025-05-07T20:32:16.6693591Z         compiled: bool,
2025-05-07T20:32:16.6693956Z     ) -> None:
2025-05-07T20:32:16.6694177Z         torch.manual_seed(2025)
2025-05-07T20:32:16.6694422Z     
2025-05-07T20:32:16.6694695Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:16.6695046Z     
2025-05-07T20:32:16.6695246Z         x_sign = torch.sign(x)
2025-05-07T20:32:16.6695536Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:16.6695849Z         x = x_sign * x_clamp
2025-05-07T20:32:16.6696095Z         x0 = x[:, :D]
2025-05-07T20:32:16.6696313Z         x1 = x[:, D:]
2025-05-07T20:32:16.6696527Z     
2025-05-07T20:32:16.6696716Z         if contiguous:
2025-05-07T20:32:16.6696955Z             x0 = x0.contiguous()
2025-05-07T20:32:16.6697216Z             x1 = x1.contiguous()
2025-05-07T20:32:16.6697458Z     
2025-05-07T20:32:16.6697656Z         if scale_ub is not None:
2025-05-07T20:32:16.6697934Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:16.6698539Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:16.6698853Z             )
2025-05-07T20:32:16.6699051Z         else:
2025-05-07T20:32:16.6699270Z             scale_ub_tensor = None
2025-05-07T20:32:16.6699532Z     
2025-05-07T20:32:16.6699760Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:16.6700073Z             op = silu_mul_quant
2025-05-07T20:32:16.6700324Z             if compiled:
2025-05-07T20:32:16.6700568Z                 op = torch.compile(op)
2025-05-07T20:32:16.6700864Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:16.6701143Z     
2025-05-07T20:32:16.6701327Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:16.6701495Z 
2025-05-07T20:32:16.6701599Z moe/activation_test.py:117: 
2025-05-07T20:32:16.6701895Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:16.6702226Z moe/activation_test.py:115: in fn
2025-05-07T20:32:16.6702504Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:16.6703187Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:16.6703867Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:16.6704392Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:16.6705066Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:16.6705817Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:16.6706347Z     kernel = self.compile(
2025-05-07T20:32:16.6706883Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:16.6707531Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:16.6707928Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:16.6708216Z 
2025-05-07T20:32:16.6708429Z self = <triton.compiler.compiler.ASTSource object at 0x7fce5a59a050>
2025-05-07T20:32:16.6709491Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:16.6710862Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7fcea1473ce0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7fcda525a700>}
2025-05-07T20:32:16.6712299Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:16.6713305Z context = <triton._C.libtriton.ir.context object at 0x7fcda464a370>
2025-05-07T20:32:16.6713593Z 
2025-05-07T20:32:16.6713756Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:16.6714269Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:16.6714729Z                            module_map=module_map)
2025-05-07T20:32:16.6715093Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:16.6715438Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:16.6715701Z E       ^
2025-05-07T20:32:16.6716158Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:16.6716595Z 
2025-05-07T20:32:16.6717010Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:16.6717523Z 
2025-05-07T20:32:16.6717624Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:16.6718030Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:16.6718431Z     T=1,
2025-05-07T20:32:16.6718610Z     D=7168,
2025-05-07T20:32:16.6718804Z     scale_ub=None,
2025-05-07T20:32:16.6719019Z     contiguous=True,
2025-05-07T20:32:16.6719239Z     compiled=False,
2025-05-07T20:32:16.6719441Z )
2025-05-07T20:32:16.6719757Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:16.6720232Z T = 1, D = 7168, scale_ub = None, contiguous = True, compiled = False
2025-05-07T20:32:16.6720498Z 
2025-05-07T20:32:16.6720576Z     @given(
2025-05-07T20:32:16.6720810Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:16.6721128Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:16.6721433Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:16.6721760Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:16.6722086Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:16.6722364Z     )
2025-05-07T20:32:16.6722717Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:16.6723164Z     def test_silu_mul_quant(
2025-05-07T20:32:16.6723401Z         self,
2025-05-07T20:32:16.6723600Z         T: int,
2025-05-07T20:32:16.6723797Z         D: int,
2025-05-07T20:32:16.6724008Z         scale_ub: Optional[float],
2025-05-07T20:32:16.6724281Z         contiguous: bool,
2025-05-07T20:32:16.6724521Z         compiled: bool,
2025-05-07T20:32:16.6724793Z     ) -> None:
2025-05-07T20:32:16.6725010Z         torch.manual_seed(2025)
2025-05-07T20:32:16.6725248Z     
2025-05-07T20:32:16.6725520Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:16.6725859Z     
2025-05-07T20:32:16.6726055Z         x_sign = torch.sign(x)
2025-05-07T20:32:16.6726351Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:16.6726657Z         x = x_sign * x_clamp
2025-05-07T20:32:16.6726895Z         x0 = x[:, :D]
2025-05-07T20:32:16.6727111Z         x1 = x[:, D:]
2025-05-07T20:32:16.6727361Z     
2025-05-07T20:32:16.6727544Z         if contiguous:
2025-05-07T20:32:16.6727774Z             x0 = x0.contiguous()
2025-05-07T20:32:16.6728021Z             x1 = x1.contiguous()
2025-05-07T20:32:16.6728264Z     
2025-05-07T20:32:16.6728454Z         if scale_ub is not None:
2025-05-07T20:32:16.6728718Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:16.6729050Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:16.6729360Z             )
2025-05-07T20:32:16.6729548Z         else:
2025-05-07T20:32:16.6729756Z             scale_ub_tensor = None
2025-05-07T20:32:16.6730005Z     
2025-05-07T20:32:16.6730228Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:16.6730619Z             op = silu_mul_quant
2025-05-07T20:32:16.6730865Z             if compiled:
2025-05-07T20:32:16.6731114Z                 op = torch.compile(op)
2025-05-07T20:32:16.6731403Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:16.6731681Z     
2025-05-07T20:32:16.6731867Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:16.6732031Z 
2025-05-07T20:32:16.6732129Z moe/activation_test.py:117: 
2025-05-07T20:32:16.6732421Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:16.6732748Z moe/activation_test.py:115: in fn
2025-05-07T20:32:16.6733024Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:16.6733861Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:16.6734546Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:16.6735078Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:16.6735747Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:16.6736405Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:16.6736936Z     kernel = self.compile(
2025-05-07T20:32:16.6737466Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:16.6738117Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:16.6738515Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:16.6738739Z 
2025-05-07T20:32:16.6738944Z self = <triton.compiler.compiler.ASTSource object at 0x7fcda5deba50>
2025-05-07T20:32:16.6740028Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:16.6741376Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7fcea1473ce0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7fcda525ba60>}
2025-05-07T20:32:16.6742705Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:16.6743704Z context = <triton._C.libtriton.ir.context object at 0x7fcda46fedb0>
2025-05-07T20:32:16.6743992Z 
2025-05-07T20:32:16.6744156Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:16.6744722Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:16.6745185Z                            module_map=module_map)
2025-05-07T20:32:16.6745542Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:16.6745897Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:16.6746157Z E       ^
2025-05-07T20:32:16.6746610Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:16.6747101Z 
2025-05-07T20:32:16.6747508Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:16.6748015Z 
2025-05-07T20:32:16.6748118Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:16.6748529Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:16.6748921Z     T=16384,
2025-05-07T20:32:16.6749114Z     D=7168,
2025-05-07T20:32:16.6749311Z     scale_ub=1200.0,
2025-05-07T20:32:16.6749530Z     contiguous=False,
2025-05-07T20:32:16.6749752Z     compiled=True,
2025-05-07T20:32:16.6749951Z )
2025-05-07T20:32:16.6750258Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:16.6750857Z T = 16384, D = 7168, scale_ub = 1200.0, contiguous = False, compiled = True
2025-05-07T20:32:16.6751135Z 
2025-05-07T20:32:16.6751212Z     @given(
2025-05-07T20:32:16.6751440Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:16.6751749Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:16.6752053Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:16.6752378Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:16.6752697Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:16.6752981Z     )
2025-05-07T20:32:16.6753328Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:16.6753760Z     def test_silu_mul_quant(
2025-05-07T20:32:16.6754006Z         self,
2025-05-07T20:32:16.6754198Z         T: int,
2025-05-07T20:32:16.6754389Z         D: int,
2025-05-07T20:32:16.6754610Z         scale_ub: Optional[float],
2025-05-07T20:32:16.6754877Z         contiguous: bool,
2025-05-07T20:32:16.6755122Z         compiled: bool,
2025-05-07T20:32:16.6755357Z     ) -> None:
2025-05-07T20:32:16.6755572Z         torch.manual_seed(2025)
2025-05-07T20:32:16.6755813Z     
2025-05-07T20:32:16.6756073Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:16.6756418Z     
2025-05-07T20:32:16.6756609Z         x_sign = torch.sign(x)
2025-05-07T20:32:16.6756891Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:16.6757198Z         x = x_sign * x_clamp
2025-05-07T20:32:16.6757437Z         x0 = x[:, :D]
2025-05-07T20:32:16.6757658Z         x1 = x[:, D:]
2025-05-07T20:32:16.6757857Z     
2025-05-07T20:32:16.6758042Z         if contiguous:
2025-05-07T20:32:16.6758272Z             x0 = x0.contiguous()
2025-05-07T20:32:16.6758525Z             x1 = x1.contiguous()
2025-05-07T20:32:16.6758766Z     
2025-05-07T20:32:16.6758958Z         if scale_ub is not None:
2025-05-07T20:32:16.6759228Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:16.6759566Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:16.6759869Z             )
2025-05-07T20:32:16.6760055Z         else:
2025-05-07T20:32:16.6760268Z             scale_ub_tensor = None
2025-05-07T20:32:16.6760520Z     
2025-05-07T20:32:16.6760745Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:16.6761056Z             op = silu_mul_quant
2025-05-07T20:32:16.6761304Z             if compiled:
2025-05-07T20:32:16.6761543Z                 op = torch.compile(op)
2025-05-07T20:32:16.6761839Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:16.6762111Z     
2025-05-07T20:32:16.6762296Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:16.6762463Z 
2025-05-07T20:32:16.6762614Z moe/activation_test.py:117: 
2025-05-07T20:32:16.6762907Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:16.6763231Z moe/activation_test.py:115: in fn
2025-05-07T20:32:16.6763506Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:16.6764064Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/torch/_dynamo/eval_frame.py:678: in _fn
2025-05-07T20:32:16.6764615Z     return fn(*args, **kwargs)
2025-05-07T20:32:16.6765258Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:16.6765974Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:16.6766503Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:16.6767173Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:16.6767822Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:16.6768352Z     kernel = self.compile(
2025-05-07T20:32:16.6768888Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:16.6769611Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:16.6770002Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:16.6770233Z 
2025-05-07T20:32:16.6770438Z self = <triton.compiler.compiler.ASTSource object at 0x7fce5a9c17d0>
2025-05-07T20:32:16.6771501Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:16.6772845Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7fcea1473ce0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7fcda48e4d60>}
2025-05-07T20:32:16.6774225Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:16.6775238Z context = <triton._C.libtriton.ir.context object at 0x7fcda48074b0>
2025-05-07T20:32:16.6775531Z 
2025-05-07T20:32:16.6775694Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:16.6776206Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:16.6776658Z                            module_map=module_map)
2025-05-07T20:32:16.6777017Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:16.6777368Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:16.6777619Z E       ^
2025-05-07T20:32:16.6778074Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:16.6778519Z 
2025-05-07T20:32:16.6778927Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:16.8091776Z 
2025-05-07T20:32:16.8092570Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:16.8093211Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:16.8093840Z     T=1,
2025-05-07T20:32:16.8094095Z     D=7168,
2025-05-07T20:32:16.8094366Z     scale_ub=None,
2025-05-07T20:32:16.8094640Z     contiguous=False,
2025-05-07T20:32:16.8094931Z     compiled=False,
2025-05-07T20:32:16.8095194Z )
2025-05-07T20:32:16.8095639Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:16.8096182Z T = 1, D = 7168, scale_ub = None, contiguous = False, compiled = False
2025-05-07T20:32:16.8096440Z 
2025-05-07T20:32:16.8096520Z     @given(
2025-05-07T20:32:16.8096753Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:16.8097373Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:16.8097671Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:16.8098000Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:16.8098671Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:16.8098963Z     )
2025-05-07T20:32:16.8099303Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:16.8099742Z     def test_silu_mul_quant(
2025-05-07T20:32:16.8100094Z         self,
2025-05-07T20:32:16.8100282Z         T: int,
2025-05-07T20:32:16.8100478Z         D: int,
2025-05-07T20:32:16.8100693Z         scale_ub: Optional[float],
2025-05-07T20:32:16.8100959Z         contiguous: bool,
2025-05-07T20:32:16.8101197Z         compiled: bool,
2025-05-07T20:32:16.8101426Z     ) -> None:
2025-05-07T20:32:16.8101635Z         torch.manual_seed(2025)
2025-05-07T20:32:16.8101874Z     
2025-05-07T20:32:16.8102149Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:16.8102486Z     
2025-05-07T20:32:16.8102681Z         x_sign = torch.sign(x)
2025-05-07T20:32:16.8102971Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:16.8103453Z         x = x_sign * x_clamp
2025-05-07T20:32:16.8103701Z         x0 = x[:, :D]
2025-05-07T20:32:16.8103917Z         x1 = x[:, D:]
2025-05-07T20:32:16.8104121Z     
2025-05-07T20:32:16.8104306Z         if contiguous:
2025-05-07T20:32:16.8104536Z             x0 = x0.contiguous()
2025-05-07T20:32:16.8104800Z             x1 = x1.contiguous()
2025-05-07T20:32:16.8105031Z     
2025-05-07T20:32:16.8105226Z         if scale_ub is not None:
2025-05-07T20:32:16.8105498Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:16.8105828Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:16.8106141Z             )
2025-05-07T20:32:16.8106332Z         else:
2025-05-07T20:32:16.8106539Z             scale_ub_tensor = None
2025-05-07T20:32:16.8106794Z     
2025-05-07T20:32:16.8107027Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:16.8107333Z             op = silu_mul_quant
2025-05-07T20:32:16.8107582Z             if compiled:
2025-05-07T20:32:16.8107836Z                 op = torch.compile(op)
2025-05-07T20:32:16.8108129Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:16.8108408Z     
2025-05-07T20:32:16.8108602Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:16.8108763Z 
2025-05-07T20:32:16.8108868Z moe/activation_test.py:117: 
2025-05-07T20:32:16.8109161Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:16.8109491Z moe/activation_test.py:115: in fn
2025-05-07T20:32:16.8109807Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:16.8110502Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:16.8111184Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:16.8111728Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:16.8112403Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:16.8113063Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:16.8113596Z     kernel = self.compile(
2025-05-07T20:32:16.8114135Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:16.8114783Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:16.8115180Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:16.8115410Z 
2025-05-07T20:32:16.8115616Z self = <triton.compiler.compiler.ASTSource object at 0x7fcda5deaed0>
2025-05-07T20:32:16.8116675Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:16.8118627Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7fcea1473ce0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7fcda48e5760>}
2025-05-07T20:32:16.8120000Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:16.8121055Z context = <triton._C.libtriton.ir.context object at 0x7fcda486faf0>
2025-05-07T20:32:16.8121340Z 
2025-05-07T20:32:16.8121509Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:16.8122020Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:16.8122485Z                            module_map=module_map)
2025-05-07T20:32:16.8122848Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:16.8123199Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:16.8123453Z E       ^
2025-05-07T20:32:16.8123996Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:16.8124437Z 
2025-05-07T20:32:16.8124853Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:16.8125361Z 
2025-05-07T20:32:16.8125474Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:16.8125875Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:16.8126274Z     T=2048,
2025-05-07T20:32:16.8126463Z     D=7168,
2025-05-07T20:32:16.8126652Z     scale_ub=None,
2025-05-07T20:32:16.8126872Z     contiguous=False,
2025-05-07T20:32:16.8127099Z     compiled=True,
2025-05-07T20:32:16.8127302Z )
2025-05-07T20:32:16.8127617Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:16.8128106Z T = 2048, D = 7168, scale_ub = None, contiguous = False, compiled = True
2025-05-07T20:32:16.8128371Z 
2025-05-07T20:32:16.8128458Z     @given(
2025-05-07T20:32:16.8128686Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:16.8129016Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:16.8129433Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:16.8129895Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:16.8130369Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:16.8130722Z     )
2025-05-07T20:32:16.8139440Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:16.8139940Z     def test_silu_mul_quant(
2025-05-07T20:32:16.8140171Z         self,
2025-05-07T20:32:16.8140367Z         T: int,
2025-05-07T20:32:16.8140566Z         D: int,
2025-05-07T20:32:16.8140776Z         scale_ub: Optional[float],
2025-05-07T20:32:16.8141042Z         contiguous: bool,
2025-05-07T20:32:16.8141275Z         compiled: bool,
2025-05-07T20:32:16.8141490Z     ) -> None:
2025-05-07T20:32:16.8141706Z         torch.manual_seed(2025)
2025-05-07T20:32:16.8141953Z     
2025-05-07T20:32:16.8142216Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:16.8142560Z     
2025-05-07T20:32:16.8142753Z         x_sign = torch.sign(x)
2025-05-07T20:32:16.8143037Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:16.8143347Z         x = x_sign * x_clamp
2025-05-07T20:32:16.8143585Z         x0 = x[:, :D]
2025-05-07T20:32:16.8143799Z         x1 = x[:, D:]
2025-05-07T20:32:16.8143999Z     
2025-05-07T20:32:16.8144182Z         if contiguous:
2025-05-07T20:32:16.8144409Z             x0 = x0.contiguous()
2025-05-07T20:32:16.8144654Z             x1 = x1.contiguous()
2025-05-07T20:32:16.8144893Z     
2025-05-07T20:32:16.8145162Z         if scale_ub is not None:
2025-05-07T20:32:16.8145428Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:16.8145761Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:16.8146067Z             )
2025-05-07T20:32:16.8146252Z         else:
2025-05-07T20:32:16.8146469Z             scale_ub_tensor = None
2025-05-07T20:32:16.8146720Z     
2025-05-07T20:32:16.8146942Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:16.8147256Z             op = silu_mul_quant
2025-05-07T20:32:16.8147555Z             if compiled:
2025-05-07T20:32:16.8147793Z                 op = torch.compile(op)
2025-05-07T20:32:16.8148088Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:16.8148363Z     
2025-05-07T20:32:16.8148555Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:16.8148714Z 
2025-05-07T20:32:16.8148810Z moe/activation_test.py:117: 
2025-05-07T20:32:16.8149098Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:16.8149421Z moe/activation_test.py:115: in fn
2025-05-07T20:32:16.8149690Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:16.8150236Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/torch/_dynamo/eval_frame.py:678: in _fn
2025-05-07T20:32:16.8150890Z     return fn(*args, **kwargs)
2025-05-07T20:32:16.8151538Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:16.8152215Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:16.8152753Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:16.8153416Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:16.8154070Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:16.8154594Z     kernel = self.compile(
2025-05-07T20:32:16.8155125Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:16.8155767Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:16.8156165Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:16.8156391Z 
2025-05-07T20:32:16.8156600Z self = <triton.compiler.compiler.ASTSource object at 0x7fce5a954050>
2025-05-07T20:32:16.8157654Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:16.8159003Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7fcea1473ce0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7fcda48e6f20>}
2025-05-07T20:32:16.8160322Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:16.8161325Z context = <triton._C.libtriton.ir.context object at 0x7fcda4da2bf0>
2025-05-07T20:32:16.8161604Z 
2025-05-07T20:32:16.8161780Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:16.8162283Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:16.8162743Z                            module_map=module_map)
2025-05-07T20:32:16.8163101Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:16.8163443Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:16.8163703Z E       ^
2025-05-07T20:32:16.8164154Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:16.8164591Z 
2025-05-07T20:32:16.8165003Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:16.8165556Z 
2025-05-07T20:32:16.8165657Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:16.8166059Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:16.8166461Z     T=4096,
2025-05-07T20:32:16.8166644Z     D=7168,
2025-05-07T20:32:16.8166841Z     scale_ub=None,
2025-05-07T20:32:16.8167060Z     contiguous=False,
2025-05-07T20:32:16.8167282Z     compiled=True,
2025-05-07T20:32:17.0414745Z )
2025-05-07T20:32:17.0415275Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:17.0416023Z T = 4096, D = 7168, scale_ub = None, contiguous = False, compiled = True
2025-05-07T20:32:17.0416398Z 
2025-05-07T20:32:17.0416507Z     @given(
2025-05-07T20:32:17.0416760Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:17.0417072Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:17.0417405Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:17.0417743Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:17.0418080Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:17.0418360Z     )
2025-05-07T20:32:17.0419083Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:17.0419531Z     def test_silu_mul_quant(
2025-05-07T20:32:17.0419768Z         self,
2025-05-07T20:32:17.0419966Z         T: int,
2025-05-07T20:32:17.0420165Z         D: int,
2025-05-07T20:32:17.0420385Z         scale_ub: Optional[float],
2025-05-07T20:32:17.0420658Z         contiguous: bool,
2025-05-07T20:32:17.0420896Z         compiled: bool,
2025-05-07T20:32:17.0421119Z     ) -> None:
2025-05-07T20:32:17.0421340Z         torch.manual_seed(2025)
2025-05-07T20:32:17.0421586Z     
2025-05-07T20:32:17.0421851Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:17.0422191Z     
2025-05-07T20:32:17.0422390Z         x_sign = torch.sign(x)
2025-05-07T20:32:17.0422675Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:17.0422989Z         x = x_sign * x_clamp
2025-05-07T20:32:17.0423233Z         x0 = x[:, :D]
2025-05-07T20:32:17.0423453Z         x1 = x[:, D:]
2025-05-07T20:32:17.0423661Z     
2025-05-07T20:32:17.0423849Z         if contiguous:
2025-05-07T20:32:17.0424083Z             x0 = x0.contiguous()
2025-05-07T20:32:17.0424340Z             x1 = x1.contiguous()
2025-05-07T20:32:17.0424586Z     
2025-05-07T20:32:17.0424782Z         if scale_ub is not None:
2025-05-07T20:32:17.0425050Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:17.0425385Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:17.0425703Z             )
2025-05-07T20:32:17.0425894Z         else:
2025-05-07T20:32:17.0426110Z             scale_ub_tensor = None
2025-05-07T20:32:17.0426362Z     
2025-05-07T20:32:17.0426590Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:17.0426908Z             op = silu_mul_quant
2025-05-07T20:32:17.0427158Z             if compiled:
2025-05-07T20:32:17.0427401Z                 op = torch.compile(op)
2025-05-07T20:32:17.0427700Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:17.0427980Z     
2025-05-07T20:32:17.0428173Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:17.0428341Z 
2025-05-07T20:32:17.0428440Z moe/activation_test.py:117: 
2025-05-07T20:32:17.0428738Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:17.0429071Z moe/activation_test.py:115: in fn
2025-05-07T20:32:17.0429351Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:17.0429951Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/torch/_dynamo/eval_frame.py:678: in _fn
2025-05-07T20:32:17.0430527Z     return fn(*args, **kwargs)
2025-05-07T20:32:17.0431176Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:17.0431954Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:17.0432488Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:17.0433168Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:17.0433827Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:17.0434356Z     kernel = self.compile(
2025-05-07T20:32:17.0434988Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:17.0435642Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:17.0436034Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:17.0436267Z 
2025-05-07T20:32:17.0436475Z self = <triton.compiler.compiler.ASTSource object at 0x7fce5a21e2d0>
2025-05-07T20:32:17.0437545Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:17.0439001Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7fcea1473ce0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7fcda4df00e0>}
2025-05-07T20:32:17.0440377Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:17.0441391Z context = <triton._C.libtriton.ir.context object at 0x7fcda4dcc470>
2025-05-07T20:32:17.0441682Z 
2025-05-07T20:32:17.0441849Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:17.0442366Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:17.0442826Z                            module_map=module_map)
2025-05-07T20:32:17.0443196Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:17.0443549Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:17.0443810Z E       ^
2025-05-07T20:32:17.0444276Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:17.0444722Z 
2025-05-07T20:32:17.0445132Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:17.0445641Z 
2025-05-07T20:32:17.0445753Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:17.0446159Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:17.0446558Z     T=16384,
2025-05-07T20:32:17.0446755Z     D=5120,
2025-05-07T20:32:17.0446950Z     scale_ub=1200.0,
2025-05-07T20:32:17.0447180Z     contiguous=False,
2025-05-07T20:32:17.0447411Z     compiled=False,
2025-05-07T20:32:17.0447625Z )
2025-05-07T20:32:17.0447938Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:17.0448435Z T = 16384, D = 5120, scale_ub = 1200.0, contiguous = False, compiled = False
2025-05-07T20:32:17.0448718Z 
2025-05-07T20:32:17.0448802Z     @given(
2025-05-07T20:32:17.0449030Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:17.0449344Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:17.0449655Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:17.0449977Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:17.0450306Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:17.0450591Z     )
2025-05-07T20:32:17.0450930Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:17.0451372Z     def test_silu_mul_quant(
2025-05-07T20:32:17.0451616Z         self,
2025-05-07T20:32:17.0451881Z         T: int,
2025-05-07T20:32:17.0452081Z         D: int,
2025-05-07T20:32:17.0452297Z         scale_ub: Optional[float],
2025-05-07T20:32:17.0452569Z         contiguous: bool,
2025-05-07T20:32:17.0452811Z         compiled: bool,
2025-05-07T20:32:17.0453031Z     ) -> None:
2025-05-07T20:32:17.0453259Z         torch.manual_seed(2025)
2025-05-07T20:32:17.0453509Z     
2025-05-07T20:32:17.0453908Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:17.0454253Z     
2025-05-07T20:32:17.0454502Z         x_sign = torch.sign(x)
2025-05-07T20:32:17.0454789Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:17.0455102Z         x = x_sign * x_clamp
2025-05-07T20:32:17.0455344Z         x0 = x[:, :D]
2025-05-07T20:32:17.0455558Z         x1 = x[:, D:]
2025-05-07T20:32:17.0455773Z     
2025-05-07T20:32:17.0455961Z         if contiguous:
2025-05-07T20:32:17.0456188Z             x0 = x0.contiguous()
2025-05-07T20:32:17.0456453Z             x1 = x1.contiguous()
2025-05-07T20:32:17.0456697Z     
2025-05-07T20:32:17.0456879Z         if scale_ub is not None:
2025-05-07T20:32:17.0457153Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:17.0457486Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:17.0457881Z             )
2025-05-07T20:32:17.0458075Z         else:
2025-05-07T20:32:17.0458286Z             scale_ub_tensor = None
2025-05-07T20:32:17.0458539Z     
2025-05-07T20:32:17.0458764Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:17.0459083Z             op = silu_mul_quant
2025-05-07T20:32:17.0459332Z             if compiled:
2025-05-07T20:32:17.0459589Z                 op = torch.compile(op)
2025-05-07T20:32:17.0459925Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:17.0460201Z     
2025-05-07T20:32:17.0460392Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:17.0460562Z 
2025-05-07T20:32:17.0460664Z moe/activation_test.py:117: 
2025-05-07T20:32:17.0460961Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:17.0461291Z moe/activation_test.py:115: in fn
2025-05-07T20:32:17.0461571Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:17.0462256Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:17.0462939Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:17.0463467Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:17.0464145Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:17.0464806Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:17.0465346Z     kernel = self.compile(
2025-05-07T20:32:17.0465880Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:17.0466536Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:17.0466934Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:17.0467161Z 
2025-05-07T20:32:17.0467373Z self = <triton.compiler.compiler.ASTSource object at 0x7fcda4f39950>
2025-05-07T20:32:17.0468437Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:17.0469814Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7fcea1473ce0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7fcda4df0b80>}
2025-05-07T20:32:17.0471178Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:17.0472257Z context = <triton._C.libtriton.ir.context object at 0x7fcda4319130>
2025-05-07T20:32:17.0472542Z 
2025-05-07T20:32:17.0472707Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:17.0473233Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:17.0473694Z                            module_map=module_map)
2025-05-07T20:32:17.0474062Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:17.0474452Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:17.0474711Z E       ^
2025-05-07T20:32:17.0475170Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:17.0475610Z 
2025-05-07T20:32:17.0476019Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:17.0476532Z 
2025-05-07T20:32:17.0476636Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:17.0477052Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:17.0477454Z     T=16384,
2025-05-07T20:32:17.0477642Z     D=5120,
2025-05-07T20:32:17.0477838Z     scale_ub=1200.0,
2025-05-07T20:32:17.0478174Z     contiguous=True,
2025-05-07T20:32:17.0478391Z     compiled=True,
2025-05-07T20:32:17.0478595Z )
2025-05-07T20:32:17.0478915Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:17.0479405Z T = 16384, D = 5120, scale_ub = 1200.0, contiguous = True, compiled = True
2025-05-07T20:32:17.0479678Z 
2025-05-07T20:32:17.0479754Z     @given(
2025-05-07T20:32:17.0479987Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:17.0480291Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:17.0480597Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:17.0480926Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:17.0481255Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:17.0481534Z     )
2025-05-07T20:32:17.0481880Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:17.0482321Z     def test_silu_mul_quant(
2025-05-07T20:32:17.0482563Z         self,
2025-05-07T20:32:17.0482759Z         T: int,
2025-05-07T20:32:17.0482957Z         D: int,
2025-05-07T20:32:17.0483169Z         scale_ub: Optional[float],
2025-05-07T20:32:17.0483441Z         contiguous: bool,
2025-05-07T20:32:17.0483682Z         compiled: bool,
2025-05-07T20:32:17.0483897Z     ) -> None:
2025-05-07T20:32:17.0484113Z         torch.manual_seed(2025)
2025-05-07T20:32:17.0484353Z     
2025-05-07T20:32:17.0484617Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:17.0484960Z     
2025-05-07T20:32:17.0485153Z         x_sign = torch.sign(x)
2025-05-07T20:32:17.0485445Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:17.0485749Z         x = x_sign * x_clamp
2025-05-07T20:32:17.0485988Z         x0 = x[:, :D]
2025-05-07T20:32:17.0486206Z         x1 = x[:, D:]
2025-05-07T20:32:17.0486412Z     
2025-05-07T20:32:17.0486601Z         if contiguous:
2025-05-07T20:32:17.0486837Z             x0 = x0.contiguous()
2025-05-07T20:32:17.0487096Z             x1 = x1.contiguous()
2025-05-07T20:32:17.0487343Z     
2025-05-07T20:32:17.0487539Z         if scale_ub is not None:
2025-05-07T20:32:17.0487807Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:17.0488147Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:17.0488460Z             )
2025-05-07T20:32:17.0488647Z         else:
2025-05-07T20:32:17.0488858Z             scale_ub_tensor = None
2025-05-07T20:32:17.0489114Z     
2025-05-07T20:32:17.0489342Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:17.0489661Z             op = silu_mul_quant
2025-05-07T20:32:17.0489917Z             if compiled:
2025-05-07T20:32:17.0490221Z                 op = torch.compile(op)
2025-05-07T20:32:17.0490515Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:17.0490789Z     
2025-05-07T20:32:17.0490994Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:17.0491159Z 
2025-05-07T20:32:17.0491255Z moe/activation_test.py:117: 
2025-05-07T20:32:17.0491554Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:17.0491882Z moe/activation_test.py:115: in fn
2025-05-07T20:32:17.0492156Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:17.0492753Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/torch/_dynamo/eval_frame.py:678: in _fn
2025-05-07T20:32:17.0493305Z     return fn(*args, **kwargs)
2025-05-07T20:32:17.0494060Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:17.0494734Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:17.0495272Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:17.0495944Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:17.0496676Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:17.0497209Z     kernel = self.compile(
2025-05-07T20:32:17.0497749Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:17.0498706Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:17.0499098Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:17.0499336Z 
2025-05-07T20:32:17.0499542Z self = <triton.compiler.compiler.ASTSource object at 0x7fcda5184650>
2025-05-07T20:32:17.0500611Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:17.0501970Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7fcea1473ce0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7fcda4df22a0>}
2025-05-07T20:32:17.0504474Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:17.0505485Z context = <triton._C.libtriton.ir.context object at 0x7fcda45655b0>
2025-05-07T20:32:17.0505774Z 
2025-05-07T20:32:17.0505941Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:17.0506458Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:17.0506914Z                            module_map=module_map)
2025-05-07T20:32:17.0507283Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:17.0507635Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:17.0507902Z E       ^
2025-05-07T20:32:17.0508362Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:17.0508809Z 
2025-05-07T20:32:17.0509217Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:17.2062218Z 
2025-05-07T20:32:17.2062823Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:17.2064025Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:17.2065116Z     T=16384,
2025-05-07T20:32:17.2065584Z     D=5120,
2025-05-07T20:32:17.2065974Z     scale_ub=None,
2025-05-07T20:32:17.2066447Z     contiguous=False,
2025-05-07T20:32:17.2066889Z     compiled=True,
2025-05-07T20:32:17.2067301Z )
2025-05-07T20:32:17.2068300Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:17.2069290Z T = 16384, D = 5120, scale_ub = None, contiguous = False, compiled = True
2025-05-07T20:32:17.2069719Z 
2025-05-07T20:32:17.2069798Z     @given(
2025-05-07T20:32:17.2070046Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:17.2070361Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:17.2070663Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:17.2070999Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:17.2071422Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:17.2071714Z     )
2025-05-07T20:32:17.2072064Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:17.2072507Z     def test_silu_mul_quant(
2025-05-07T20:32:17.2072746Z         self,
2025-05-07T20:32:17.2072939Z         T: int,
2025-05-07T20:32:17.2073138Z         D: int,
2025-05-07T20:32:17.2073357Z         scale_ub: Optional[float],
2025-05-07T20:32:17.2073629Z         contiguous: bool,
2025-05-07T20:32:17.2073869Z         compiled: bool,
2025-05-07T20:32:17.2074100Z     ) -> None:
2025-05-07T20:32:17.2074313Z         torch.manual_seed(2025)
2025-05-07T20:32:17.2074557Z     
2025-05-07T20:32:17.2074971Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:17.2075312Z     
2025-05-07T20:32:17.2075514Z         x_sign = torch.sign(x)
2025-05-07T20:32:17.2075806Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:17.2076117Z         x = x_sign * x_clamp
2025-05-07T20:32:17.2076362Z         x0 = x[:, :D]
2025-05-07T20:32:17.2076583Z         x1 = x[:, D:]
2025-05-07T20:32:17.2076789Z     
2025-05-07T20:32:17.2076978Z         if contiguous:
2025-05-07T20:32:17.2077221Z             x0 = x0.contiguous()
2025-05-07T20:32:17.2077477Z             x1 = x1.contiguous()
2025-05-07T20:32:17.2077720Z     
2025-05-07T20:32:17.2077916Z         if scale_ub is not None:
2025-05-07T20:32:17.2078199Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:17.2078528Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:17.2078841Z             )
2025-05-07T20:32:17.2079037Z         else:
2025-05-07T20:32:17.2079246Z             scale_ub_tensor = None
2025-05-07T20:32:17.2079526Z     
2025-05-07T20:32:17.2079766Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:17.2080126Z             op = silu_mul_quant
2025-05-07T20:32:17.2080390Z             if compiled:
2025-05-07T20:32:17.2080773Z                 op = torch.compile(op)
2025-05-07T20:32:17.2081163Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:17.2081745Z     
2025-05-07T20:32:17.2090284Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:17.2090488Z 
2025-05-07T20:32:17.2090599Z moe/activation_test.py:117: 
2025-05-07T20:32:17.2090910Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:17.2091254Z moe/activation_test.py:115: in fn
2025-05-07T20:32:17.2091546Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:17.2092115Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/torch/_dynamo/eval_frame.py:678: in _fn
2025-05-07T20:32:17.2092682Z     return fn(*args, **kwargs)
2025-05-07T20:32:17.2093343Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:17.2094731Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:17.2095277Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:17.2095965Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:17.2096627Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:17.2097171Z     kernel = self.compile(
2025-05-07T20:32:17.2097723Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:17.2098755Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:17.2099156Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:17.2099402Z 
2025-05-07T20:32:17.2099613Z self = <triton.compiler.compiler.ASTSource object at 0x7fcda5d83ed0>
2025-05-07T20:32:17.2100686Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:17.2102156Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7fcea1473ce0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7fcda4df3060>}
2025-05-07T20:32:17.2103485Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:17.2104509Z context = <triton._C.libtriton.ir.context object at 0x7fcda4578d70>
2025-05-07T20:32:17.2104804Z 
2025-05-07T20:32:17.2105125Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:17.2105652Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:17.2106112Z                            module_map=module_map)
2025-05-07T20:32:17.2106487Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:17.2106853Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:17.2107116Z E       ^
2025-05-07T20:32:17.2107586Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:17.2108038Z 
2025-05-07T20:32:17.2108452Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:17.2108965Z 
2025-05-07T20:32:17.2109080Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:17.2109492Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:17.2109911Z     T=2048,
2025-05-07T20:32:17.2110116Z     D=5120,
2025-05-07T20:32:17.2110326Z     scale_ub=None,
2025-05-07T20:32:17.2110546Z     contiguous=False,
2025-05-07T20:32:17.2110775Z     compiled=True,
2025-05-07T20:32:17.2110982Z )
2025-05-07T20:32:17.2111308Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:17.2111805Z T = 2048, D = 5120, scale_ub = None, contiguous = False, compiled = True
2025-05-07T20:32:17.2112075Z 
2025-05-07T20:32:17.2112154Z     @given(
2025-05-07T20:32:17.2112394Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:17.2112717Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:17.2113021Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:17.2113360Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:17.2113693Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:17.2113980Z     )
2025-05-07T20:32:17.2114334Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:17.2114786Z     def test_silu_mul_quant(
2025-05-07T20:32:17.2115034Z         self,
2025-05-07T20:32:17.2115227Z         T: int,
2025-05-07T20:32:17.2115427Z         D: int,
2025-05-07T20:32:17.2115649Z         scale_ub: Optional[float],
2025-05-07T20:32:17.2115926Z         contiguous: bool,
2025-05-07T20:32:17.2116169Z         compiled: bool,
2025-05-07T20:32:17.2116396Z     ) -> None:
2025-05-07T20:32:17.2116609Z         torch.manual_seed(2025)
2025-05-07T20:32:17.2116857Z     
2025-05-07T20:32:17.2117135Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:17.2117474Z     
2025-05-07T20:32:17.2117671Z         x_sign = torch.sign(x)
2025-05-07T20:32:17.2118043Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:17.2118353Z         x = x_sign * x_clamp
2025-05-07T20:32:17.2118599Z         x0 = x[:, :D]
2025-05-07T20:32:17.2118819Z         x1 = x[:, D:]
2025-05-07T20:32:17.2119025Z     
2025-05-07T20:32:17.2119215Z         if contiguous:
2025-05-07T20:32:17.2119460Z             x0 = x0.contiguous()
2025-05-07T20:32:17.2119727Z             x1 = x1.contiguous()
2025-05-07T20:32:17.2119965Z     
2025-05-07T20:32:17.2120164Z         if scale_ub is not None:
2025-05-07T20:32:17.2120490Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:17.2120815Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:17.2121124Z             )
2025-05-07T20:32:17.2121325Z         else:
2025-05-07T20:32:17.2121535Z             scale_ub_tensor = None
2025-05-07T20:32:17.2121792Z     
2025-05-07T20:32:17.2122029Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:17.2122341Z             op = silu_mul_quant
2025-05-07T20:32:17.2122604Z             if compiled:
2025-05-07T20:32:17.2122859Z                 op = torch.compile(op)
2025-05-07T20:32:17.2123154Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:17.2123438Z     
2025-05-07T20:32:17.2123635Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:17.2123891Z 
2025-05-07T20:32:17.2123994Z moe/activation_test.py:117: 
2025-05-07T20:32:17.2124285Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:17.2124621Z moe/activation_test.py:115: in fn
2025-05-07T20:32:17.2124911Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:17.2125462Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/torch/_dynamo/eval_frame.py:678: in _fn
2025-05-07T20:32:17.2126026Z     return fn(*args, **kwargs)
2025-05-07T20:32:17.2126689Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:17.2127370Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:17.2127911Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:17.2128598Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:17.2129266Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:17.2129797Z     kernel = self.compile(
2025-05-07T20:32:17.2130350Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:17.2131010Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:17.2131420Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:17.2131649Z 
2025-05-07T20:32:17.2131856Z self = <triton.compiler.compiler.ASTSource object at 0x7fce5a266850>
2025-05-07T20:32:17.2132927Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:17.2134421Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7fcea1473ce0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7fcda51507c0>}
2025-05-07T20:32:17.2135755Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:17.2136764Z context = <triton._C.libtriton.ir.context object at 0x7fcda515feb0>
2025-05-07T20:32:17.2137058Z 
2025-05-07T20:32:17.2137225Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:17.2137749Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:17.2138272Z                            module_map=module_map)
2025-05-07T20:32:17.2138633Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:17.2138991Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:17.2139258Z E       ^
2025-05-07T20:32:17.2139744Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:17.2140221Z 
2025-05-07T20:32:17.2140630Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:17.5790322Z 
2025-05-07T20:32:17.5790796Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:17.5791454Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:17.5792002Z     T=2048,
2025-05-07T20:32:17.5792310Z     D=5120,
2025-05-07T20:32:17.5792533Z     scale_ub=1200.0,
2025-05-07T20:32:17.5792757Z     contiguous=False,
2025-05-07T20:32:17.5792985Z     compiled=True,
2025-05-07T20:32:17.5793218Z )
2025-05-07T20:32:17.5793535Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:17.5794036Z T = 2048, D = 5120, scale_ub = 1200.0, contiguous = False, compiled = True
2025-05-07T20:32:17.5794313Z 
2025-05-07T20:32:17.5794754Z     @given(
2025-05-07T20:32:17.5794995Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:17.5795307Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:17.5795615Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:17.5795953Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:17.5796273Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:17.5796563Z     )
2025-05-07T20:32:17.5796913Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:17.5797348Z     def test_silu_mul_quant(
2025-05-07T20:32:17.5797592Z         self,
2025-05-07T20:32:17.5797788Z         T: int,
2025-05-07T20:32:17.5797982Z         D: int,
2025-05-07T20:32:17.5798510Z         scale_ub: Optional[float],
2025-05-07T20:32:17.5798788Z         contiguous: bool,
2025-05-07T20:32:17.5799031Z         compiled: bool,
2025-05-07T20:32:17.5799256Z     ) -> None:
2025-05-07T20:32:17.5799473Z         torch.manual_seed(2025)
2025-05-07T20:32:17.5799730Z     
2025-05-07T20:32:17.5799998Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:17.5800345Z     
2025-05-07T20:32:17.5800543Z         x_sign = torch.sign(x)
2025-05-07T20:32:17.5800832Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:17.5801144Z         x = x_sign * x_clamp
2025-05-07T20:32:17.5801387Z         x0 = x[:, :D]
2025-05-07T20:32:17.5801597Z         x1 = x[:, D:]
2025-05-07T20:32:17.5801808Z     
2025-05-07T20:32:17.5801997Z         if contiguous:
2025-05-07T20:32:17.5802228Z             x0 = x0.contiguous()
2025-05-07T20:32:17.5802490Z             x1 = x1.contiguous()
2025-05-07T20:32:17.5802732Z     
2025-05-07T20:32:17.5802925Z         if scale_ub is not None:
2025-05-07T20:32:17.5803201Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:17.5803541Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:17.5803851Z             )
2025-05-07T20:32:17.5804046Z         else:
2025-05-07T20:32:17.5804265Z             scale_ub_tensor = None
2025-05-07T20:32:17.5804522Z     
2025-05-07T20:32:17.5804749Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:17.5805073Z             op = silu_mul_quant
2025-05-07T20:32:17.5805331Z             if compiled:
2025-05-07T20:32:17.5805576Z                 op = torch.compile(op)
2025-05-07T20:32:17.5805877Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:17.5806158Z     
2025-05-07T20:32:17.5806348Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:17.5806520Z 
2025-05-07T20:32:17.5806618Z moe/activation_test.py:117: 
2025-05-07T20:32:17.5806929Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:17.5807365Z moe/activation_test.py:115: in fn
2025-05-07T20:32:17.5807642Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:17.5808203Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/torch/_dynamo/eval_frame.py:678: in _fn
2025-05-07T20:32:17.5808765Z     return fn(*args, **kwargs)
2025-05-07T20:32:17.5809416Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:17.5810234Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:17.5810780Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:17.5811462Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:17.5812118Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:17.5812656Z     kernel = self.compile(
2025-05-07T20:32:17.5813206Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:17.5813955Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:17.5814496Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:17.5814737Z 
2025-05-07T20:32:17.5814944Z self = <triton.compiler.compiler.ASTSource object at 0x7fcda4d592d0>
2025-05-07T20:32:17.5816013Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:17.5817391Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7fcea1473ce0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7fcda5151580>}
2025-05-07T20:32:17.5818711Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:17.5819726Z context = <triton._C.libtriton.ir.context object at 0x7fcda51ef9f0>
2025-05-07T20:32:17.5820022Z 
2025-05-07T20:32:17.5820187Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:17.5820706Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:17.5821167Z                            module_map=module_map)
2025-05-07T20:32:17.5821533Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:17.5821888Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:17.5822144Z E       ^
2025-05-07T20:32:17.5822609Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:17.5823052Z 
2025-05-07T20:32:17.5823465Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:17.5823969Z 
2025-05-07T20:32:17.5824080Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:17.5824491Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:17.5824896Z     T=4096,
2025-05-07T20:32:17.5825090Z     D=5120,
2025-05-07T20:32:17.5825282Z     scale_ub=1200.0,
2025-05-07T20:32:17.5825513Z     contiguous=True,
2025-05-07T20:32:17.5825736Z     compiled=True,
2025-05-07T20:32:17.5825938Z )
2025-05-07T20:32:17.5826260Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:17.5826753Z T = 4096, D = 5120, scale_ub = 1200.0, contiguous = True, compiled = True
2025-05-07T20:32:17.5827025Z 
2025-05-07T20:32:17.5827108Z     @given(
2025-05-07T20:32:17.5827334Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:17.5827648Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:17.5828017Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:17.5828344Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:17.5828676Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:17.5828968Z     )
2025-05-07T20:32:17.5829317Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:17.5829759Z     def test_silu_mul_quant(
2025-05-07T20:32:17.5830002Z         self,
2025-05-07T20:32:17.5830200Z         T: int,
2025-05-07T20:32:17.5830448Z         D: int,
2025-05-07T20:32:17.5830671Z         scale_ub: Optional[float],
2025-05-07T20:32:17.5830945Z         contiguous: bool,
2025-05-07T20:32:17.5831181Z         compiled: bool,
2025-05-07T20:32:17.5831407Z     ) -> None:
2025-05-07T20:32:17.5831624Z         torch.manual_seed(2025)
2025-05-07T20:32:17.5831860Z     
2025-05-07T20:32:17.5832135Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:17.5832478Z     
2025-05-07T20:32:17.5832671Z         x_sign = torch.sign(x)
2025-05-07T20:32:17.5832964Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:17.5833274Z         x = x_sign * x_clamp
2025-05-07T20:32:17.5833510Z         x0 = x[:, :D]
2025-05-07T20:32:17.5833732Z         x1 = x[:, D:]
2025-05-07T20:32:17.5834060Z     
2025-05-07T20:32:17.5834244Z         if contiguous:
2025-05-07T20:32:17.5834482Z             x0 = x0.contiguous()
2025-05-07T20:32:17.5834743Z             x1 = x1.contiguous()
2025-05-07T20:32:17.5834976Z     
2025-05-07T20:32:17.5835177Z         if scale_ub is not None:
2025-05-07T20:32:17.5835450Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:17.5835785Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:17.5836090Z             )
2025-05-07T20:32:17.5836285Z         else:
2025-05-07T20:32:17.5836496Z             scale_ub_tensor = None
2025-05-07T20:32:17.5836743Z     
2025-05-07T20:32:17.5836981Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:17.5837300Z             op = silu_mul_quant
2025-05-07T20:32:17.5837545Z             if compiled:
2025-05-07T20:32:17.5837794Z                 op = torch.compile(op)
2025-05-07T20:32:17.5838096Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:17.5838373Z     
2025-05-07T20:32:17.5838574Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:17.5838738Z 
2025-05-07T20:32:17.5838842Z moe/activation_test.py:117: 
2025-05-07T20:32:17.5839138Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:17.5839474Z moe/activation_test.py:115: in fn
2025-05-07T20:32:17.5839759Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:17.5840360Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/torch/_dynamo/eval_frame.py:678: in _fn
2025-05-07T20:32:17.5840906Z     return fn(*args, **kwargs)
2025-05-07T20:32:17.5841556Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:17.5842240Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:17.5842771Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:17.5843453Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:17.5844122Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:17.5844657Z     kernel = self.compile(
2025-05-07T20:32:17.5845196Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:17.5845853Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:17.5846255Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:17.5846479Z 
2025-05-07T20:32:17.5846697Z self = <triton.compiler.compiler.ASTSource object at 0x7fce5a21f7d0>
2025-05-07T20:32:17.5847812Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:17.5849171Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7fcea1473ce0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7fcda5152840>}
2025-05-07T20:32:17.5850502Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:17.5851561Z context = <triton._C.libtriton.ir.context object at 0x7fcda42fc830>
2025-05-07T20:32:17.5851849Z 
2025-05-07T20:32:17.5852015Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:17.5852542Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:17.5853017Z                            module_map=module_map)
2025-05-07T20:32:17.5853387Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:17.5853848Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:17.5854202Z E       ^
2025-05-07T20:32:17.5854671Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:17.5855114Z 
2025-05-07T20:32:17.5855532Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:17.7555019Z 
2025-05-07T20:32:17.7555438Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:17.7556108Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:17.7556663Z     T=128,
2025-05-07T20:32:17.7556920Z     D=5120,
2025-05-07T20:32:17.7557168Z     scale_ub=1200.0,
2025-05-07T20:32:17.7557419Z     contiguous=False,
2025-05-07T20:32:17.7557644Z     compiled=True,
2025-05-07T20:32:17.7557848Z )
2025-05-07T20:32:17.7558168Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:17.7558669Z T = 128, D = 5120, scale_ub = 1200.0, contiguous = False, compiled = True
2025-05-07T20:32:17.7558942Z 
2025-05-07T20:32:17.7559026Z     @given(
2025-05-07T20:32:17.7559253Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:17.7559571Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:17.7559897Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:17.7560256Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:17.7560582Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:17.7560869Z     )
2025-05-07T20:32:17.7561212Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:17.7561656Z     def test_silu_mul_quant(
2025-05-07T20:32:17.7561904Z         self,
2025-05-07T20:32:17.7562102Z         T: int,
2025-05-07T20:32:17.7562295Z         D: int,
2025-05-07T20:32:17.7562515Z         scale_ub: Optional[float],
2025-05-07T20:32:17.7562787Z         contiguous: bool,
2025-05-07T20:32:17.7563020Z         compiled: bool,
2025-05-07T20:32:17.7563247Z     ) -> None:
2025-05-07T20:32:17.7563474Z         torch.manual_seed(2025)
2025-05-07T20:32:17.7563708Z     
2025-05-07T20:32:17.7563984Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:17.7564336Z     
2025-05-07T20:32:17.7564529Z         x_sign = torch.sign(x)
2025-05-07T20:32:17.7564822Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:17.7565138Z         x = x_sign * x_clamp
2025-05-07T20:32:17.7565374Z         x0 = x[:, :D]
2025-05-07T20:32:17.7565593Z         x1 = x[:, D:]
2025-05-07T20:32:17.7565805Z     
2025-05-07T20:32:17.7565986Z         if contiguous:
2025-05-07T20:32:17.7566220Z             x0 = x0.contiguous()
2025-05-07T20:32:17.7566758Z             x1 = x1.contiguous()
2025-05-07T20:32:17.7566989Z     
2025-05-07T20:32:17.7567187Z         if scale_ub is not None:
2025-05-07T20:32:17.7567463Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:17.7567800Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:17.7568110Z             )
2025-05-07T20:32:17.7568307Z         else:
2025-05-07T20:32:17.7568518Z             scale_ub_tensor = None
2025-05-07T20:32:17.7568765Z     
2025-05-07T20:32:17.7568996Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:17.7569414Z             op = silu_mul_quant
2025-05-07T20:32:17.7569658Z             if compiled:
2025-05-07T20:32:17.7569905Z                 op = torch.compile(op)
2025-05-07T20:32:17.7570204Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:17.7570472Z     
2025-05-07T20:32:17.7570667Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:17.7570832Z 
2025-05-07T20:32:17.7570936Z moe/activation_test.py:117: 
2025-05-07T20:32:17.7571231Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:17.7571561Z moe/activation_test.py:115: in fn
2025-05-07T20:32:17.7571842Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:17.7572545Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/torch/_dynamo/eval_frame.py:678: in _fn
2025-05-07T20:32:17.7573098Z     return fn(*args, **kwargs)
2025-05-07T20:32:17.7573922Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:17.7574605Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:17.7575133Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:17.7575807Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:17.7576467Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:17.7577001Z     kernel = self.compile(
2025-05-07T20:32:17.7577532Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:17.7578193Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:17.7578589Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:17.7578818Z 
2025-05-07T20:32:17.7579031Z self = <triton.compiler.compiler.ASTSource object at 0x7fcda54d6850>
2025-05-07T20:32:17.7580138Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:17.7581511Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7fcea1473ce0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7fcda51534c0>}
2025-05-07T20:32:17.7582839Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:17.7583857Z context = <triton._C.libtriton.ir.context object at 0x7fcda4399af0>
2025-05-07T20:32:17.7584141Z 
2025-05-07T20:32:17.7584312Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:17.7584827Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:17.7585293Z                            module_map=module_map)
2025-05-07T20:32:17.7585654Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:17.7586018Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:17.7586287Z E       ^
2025-05-07T20:32:17.7586744Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:17.7587242Z 
2025-05-07T20:32:17.7587660Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:17.7588166Z 
2025-05-07T20:32:17.7588269Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:17.7588686Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:17.7589092Z     T=16384,
2025-05-07T20:32:17.7589280Z     D=7168,
2025-05-07T20:32:17.7589652Z     scale_ub=1200.0,
2025-05-07T20:32:17.7590200Z     contiguous=True,
2025-05-07T20:32:17.7598809Z     compiled=True,
2025-05-07T20:32:17.7599029Z )
2025-05-07T20:32:17.7599362Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:17.7599856Z T = 16384, D = 7168, scale_ub = 1200.0, contiguous = True, compiled = True
2025-05-07T20:32:17.7600138Z 
2025-05-07T20:32:17.7600219Z     @given(
2025-05-07T20:32:17.7600459Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:17.7600787Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:17.7601102Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:17.7601441Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:17.7601769Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:17.7602249Z     )
2025-05-07T20:32:17.7602612Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:17.7603069Z     def test_silu_mul_quant(
2025-05-07T20:32:17.7603310Z         self,
2025-05-07T20:32:17.7603518Z         T: int,
2025-05-07T20:32:17.7603726Z         D: int,
2025-05-07T20:32:17.7603944Z         scale_ub: Optional[float],
2025-05-07T20:32:17.7604224Z         contiguous: bool,
2025-05-07T20:32:17.7604469Z         compiled: bool,
2025-05-07T20:32:17.7604694Z     ) -> None:
2025-05-07T20:32:17.7604916Z         torch.manual_seed(2025)
2025-05-07T20:32:17.7605164Z     
2025-05-07T20:32:17.7605456Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:17.7605810Z     
2025-05-07T20:32:17.7606012Z         x_sign = torch.sign(x)
2025-05-07T20:32:17.7606309Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:17.7606618Z         x = x_sign * x_clamp
2025-05-07T20:32:17.7606869Z         x0 = x[:, :D]
2025-05-07T20:32:17.7607091Z         x1 = x[:, D:]
2025-05-07T20:32:17.7607297Z     
2025-05-07T20:32:17.7607488Z         if contiguous:
2025-05-07T20:32:17.7607726Z             x0 = x0.contiguous()
2025-05-07T20:32:17.7607986Z             x1 = x1.contiguous()
2025-05-07T20:32:17.7608236Z     
2025-05-07T20:32:17.7608436Z         if scale_ub is not None:
2025-05-07T20:32:17.7608710Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:17.7609047Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:17.7609360Z             )
2025-05-07T20:32:17.7609551Z         else:
2025-05-07T20:32:17.7609769Z             scale_ub_tensor = None
2025-05-07T20:32:17.7610057Z     
2025-05-07T20:32:17.7610305Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:17.7610628Z             op = silu_mul_quant
2025-05-07T20:32:17.7610882Z             if compiled:
2025-05-07T20:32:17.7611133Z                 op = torch.compile(op)
2025-05-07T20:32:17.7611435Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:17.7611716Z     
2025-05-07T20:32:17.7611913Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:17.7612077Z 
2025-05-07T20:32:17.7612176Z moe/activation_test.py:117: 
2025-05-07T20:32:17.7612477Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:17.7612815Z moe/activation_test.py:115: in fn
2025-05-07T20:32:17.7613097Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:17.7613746Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/torch/_dynamo/eval_frame.py:678: in _fn
2025-05-07T20:32:17.7614306Z     return fn(*args, **kwargs)
2025-05-07T20:32:17.7614965Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:17.7615717Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:17.7616257Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:17.7616931Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:17.7617583Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:17.7618187Z     kernel = self.compile(
2025-05-07T20:32:17.7618730Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:17.7619382Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:17.7619779Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:17.7620014Z 
2025-05-07T20:32:17.7620226Z self = <triton.compiler.compiler.ASTSource object at 0x7fcda5186b50>
2025-05-07T20:32:17.7621377Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:17.7622742Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7fcea1473ce0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7fcda4378c20>}
2025-05-07T20:32:17.7624076Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:17.7625081Z context = <triton._C.libtriton.ir.context object at 0x7fcda4335cb0>
2025-05-07T20:32:17.7625373Z 
2025-05-07T20:32:17.7625540Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:17.7626061Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:17.7626516Z                            module_map=module_map)
2025-05-07T20:32:17.7626879Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:17.7627235Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:17.7627494Z E       ^
2025-05-07T20:32:17.7627943Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:17.7628391Z 
2025-05-07T20:32:17.7628801Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:17.8781504Z 
2025-05-07T20:32:17.8781891Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:17.8782513Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:17.8783108Z     T=16384,
2025-05-07T20:32:17.8783387Z     D=5120,
2025-05-07T20:32:17.8783675Z     scale_ub=1200.0,
2025-05-07T20:32:17.8783970Z     contiguous=True,
2025-05-07T20:32:17.8784270Z     compiled=False,
2025-05-07T20:32:17.8784533Z )
2025-05-07T20:32:17.8784943Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:17.8785580Z T = 16384, D = 5120, scale_ub = 1200.0, contiguous = True, compiled = False
2025-05-07T20:32:17.8785856Z 
2025-05-07T20:32:17.8785937Z     @given(
2025-05-07T20:32:17.8786174Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:17.8786497Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:17.8786800Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:17.8787136Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:17.8787467Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:17.8787759Z     )
2025-05-07T20:32:17.8788105Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:17.8788818Z     def test_silu_mul_quant(
2025-05-07T20:32:17.8789064Z         self,
2025-05-07T20:32:17.8789259Z         T: int,
2025-05-07T20:32:17.8789461Z         D: int,
2025-05-07T20:32:17.8789682Z         scale_ub: Optional[float],
2025-05-07T20:32:17.8789949Z         contiguous: bool,
2025-05-07T20:32:17.8790198Z         compiled: bool,
2025-05-07T20:32:17.8790425Z     ) -> None:
2025-05-07T20:32:17.8790637Z         torch.manual_seed(2025)
2025-05-07T20:32:17.8790883Z     
2025-05-07T20:32:17.8791159Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:17.8791590Z     
2025-05-07T20:32:17.8791787Z         x_sign = torch.sign(x)
2025-05-07T20:32:17.8792086Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:17.8792403Z         x = x_sign * x_clamp
2025-05-07T20:32:17.8792639Z         x0 = x[:, :D]
2025-05-07T20:32:17.8792861Z         x1 = x[:, D:]
2025-05-07T20:32:17.8793075Z     
2025-05-07T20:32:17.8793259Z         if contiguous:
2025-05-07T20:32:17.8793499Z             x0 = x0.contiguous()
2025-05-07T20:32:17.8793763Z             x1 = x1.contiguous()
2025-05-07T20:32:17.8794001Z     
2025-05-07T20:32:17.8794195Z         if scale_ub is not None:
2025-05-07T20:32:17.8794471Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:17.8794988Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:17.8795305Z             )
2025-05-07T20:32:17.8795504Z         else:
2025-05-07T20:32:17.8795713Z             scale_ub_tensor = None
2025-05-07T20:32:17.8795974Z     
2025-05-07T20:32:17.8796215Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:17.8796529Z             op = silu_mul_quant
2025-05-07T20:32:17.8796783Z             if compiled:
2025-05-07T20:32:17.8797034Z                 op = torch.compile(op)
2025-05-07T20:32:17.8797330Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:17.8797610Z     
2025-05-07T20:32:17.8797810Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:17.8797977Z 
2025-05-07T20:32:17.8798084Z moe/activation_test.py:117: 
2025-05-07T20:32:17.8798728Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:17.8799070Z moe/activation_test.py:115: in fn
2025-05-07T20:32:17.8799357Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:17.8800049Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:17.8800737Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:17.8801278Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:17.8801955Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:17.8802610Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:17.8803145Z     kernel = self.compile(
2025-05-07T20:32:17.8803691Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:17.8804346Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:17.8804749Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:17.8804990Z 
2025-05-07T20:32:17.8805198Z self = <triton.compiler.compiler.ASTSource object at 0x7fcda4d5add0>
2025-05-07T20:32:17.8806266Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:17.8807635Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7fcea1473ce0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7fcda4379580>}
2025-05-07T20:32:17.8808960Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:17.8810089Z context = <triton._C.libtriton.ir.context object at 0x7fcda413bdf0>
2025-05-07T20:32:17.8810400Z 
2025-05-07T20:32:17.8810582Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:17.8811100Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:17.8811562Z                            module_map=module_map)
2025-05-07T20:32:17.8812002Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:17.8812360Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:17.8812621Z E       ^
2025-05-07T20:32:17.8813087Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:17.8813529Z 
2025-05-07T20:32:17.8814080Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:17.8814592Z 
2025-05-07T20:32:17.8814709Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:17.8815118Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:17.8815520Z     T=1,
2025-05-07T20:32:17.8815836Z     D=7168,
2025-05-07T20:32:17.8816033Z     scale_ub=1200.0,
2025-05-07T20:32:17.8816262Z     contiguous=False,
2025-05-07T20:32:17.8816498Z     compiled=False,
2025-05-07T20:32:17.8816704Z )
2025-05-07T20:32:17.8817028Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:17.8817516Z T = 1, D = 7168, scale_ub = 1200.0, contiguous = False, compiled = False
2025-05-07T20:32:17.8817779Z 
2025-05-07T20:32:17.8817865Z     @given(
2025-05-07T20:32:17.8818096Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:17.8818413Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:17.8818728Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:17.8819057Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:17.8819388Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:17.8819679Z     )
2025-05-07T20:32:17.8820028Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:17.8820473Z     def test_silu_mul_quant(
2025-05-07T20:32:17.8820719Z         self,
2025-05-07T20:32:17.8820913Z         T: int,
2025-05-07T20:32:17.8821119Z         D: int,
2025-05-07T20:32:17.8821352Z         scale_ub: Optional[float],
2025-05-07T20:32:17.8821629Z         contiguous: bool,
2025-05-07T20:32:17.8821874Z         compiled: bool,
2025-05-07T20:32:17.8822097Z     ) -> None:
2025-05-07T20:32:17.8822321Z         torch.manual_seed(2025)
2025-05-07T20:32:17.8822567Z     
2025-05-07T20:32:17.8822840Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:17.8823190Z     
2025-05-07T20:32:17.8823388Z         x_sign = torch.sign(x)
2025-05-07T20:32:17.8823685Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:17.8823993Z         x = x_sign * x_clamp
2025-05-07T20:32:17.8824237Z         x0 = x[:, :D]
2025-05-07T20:32:17.8824458Z         x1 = x[:, D:]
2025-05-07T20:32:17.8824665Z     
2025-05-07T20:32:17.8824863Z         if contiguous:
2025-05-07T20:32:17.8825100Z             x0 = x0.contiguous()
2025-05-07T20:32:17.8825356Z             x1 = x1.contiguous()
2025-05-07T20:32:17.8825599Z     
2025-05-07T20:32:17.8825797Z         if scale_ub is not None:
2025-05-07T20:32:17.8826071Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:17.8826409Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:17.8826719Z             )
2025-05-07T20:32:17.8826909Z         else:
2025-05-07T20:32:17.8827124Z             scale_ub_tensor = None
2025-05-07T20:32:17.8827380Z     
2025-05-07T20:32:17.8827608Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:17.8827924Z             op = silu_mul_quant
2025-05-07T20:32:17.8828232Z             if compiled:
2025-05-07T20:32:17.8828482Z                 op = torch.compile(op)
2025-05-07T20:32:17.8828776Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:17.8829056Z     
2025-05-07T20:32:17.8829253Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:17.8829422Z 
2025-05-07T20:32:17.8829524Z moe/activation_test.py:117: 
2025-05-07T20:32:17.8829823Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:17.8830158Z moe/activation_test.py:115: in fn
2025-05-07T20:32:17.8830483Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:17.8831166Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:17.8831852Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:17.8832391Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:17.8833067Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:17.8833734Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:17.8834271Z     kernel = self.compile(
2025-05-07T20:32:17.8834885Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:17.8835549Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:17.8835953Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:17.8836183Z 
2025-05-07T20:32:17.8836396Z self = <triton.compiler.compiler.ASTSource object at 0x7fcda4f3a8d0>
2025-05-07T20:32:17.8837457Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:17.8838814Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7fcea1473ce0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7fcda437a8e0>}
2025-05-07T20:32:17.8840153Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:17.8841169Z context = <triton._C.libtriton.ir.context object at 0x7fcda421e670>
2025-05-07T20:32:17.8841456Z 
2025-05-07T20:32:17.8841630Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:17.8842144Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:17.8842614Z                            module_map=module_map)
2025-05-07T20:32:17.8842980Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:17.8843331Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:17.8843598Z E       ^
2025-05-07T20:32:17.8844062Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:17.8844506Z 
2025-05-07T20:32:17.8844923Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:17.8845429Z 
2025-05-07T20:32:17.8845534Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:17.8845948Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:17.8846352Z     T=4096,
2025-05-07T20:32:17.8846538Z     D=7168,
2025-05-07T20:32:17.8846735Z     scale_ub=1200.0,
2025-05-07T20:32:17.8846963Z     contiguous=False,
2025-05-07T20:32:17.8847185Z     compiled=True,
2025-05-07T20:32:18.0470790Z )
2025-05-07T20:32:18.0471811Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:18.0473163Z T = 4096, D = 7168, scale_ub = 1200.0, contiguous = False, compiled = True
2025-05-07T20:32:18.0474186Z 
2025-05-07T20:32:18.0474345Z     @given(
2025-05-07T20:32:18.0474805Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:18.0475415Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:18.0476032Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:18.0476682Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:18.0477318Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:18.0478040Z     )
2025-05-07T20:32:18.0478723Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:18.0479590Z     def test_silu_mul_quant(
2025-05-07T20:32:18.0480038Z         self,
2025-05-07T20:32:18.0480256Z         T: int,
2025-05-07T20:32:18.0480472Z         D: int,
2025-05-07T20:32:18.0480690Z         scale_ub: Optional[float],
2025-05-07T20:32:18.0480959Z         contiguous: bool,
2025-05-07T20:32:18.0481205Z         compiled: bool,
2025-05-07T20:32:18.0481435Z     ) -> None:
2025-05-07T20:32:18.0481656Z         torch.manual_seed(2025)
2025-05-07T20:32:18.0481899Z     
2025-05-07T20:32:18.0482166Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:18.0482511Z     
2025-05-07T20:32:18.0482865Z         x_sign = torch.sign(x)
2025-05-07T20:32:18.0483153Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:18.0483462Z         x = x_sign * x_clamp
2025-05-07T20:32:18.0483700Z         x0 = x[:, :D]
2025-05-07T20:32:18.0483912Z         x1 = x[:, D:]
2025-05-07T20:32:18.0484121Z     
2025-05-07T20:32:18.0484308Z         if contiguous:
2025-05-07T20:32:18.0484533Z             x0 = x0.contiguous()
2025-05-07T20:32:18.0484790Z             x1 = x1.contiguous()
2025-05-07T20:32:18.0485032Z     
2025-05-07T20:32:18.0485221Z         if scale_ub is not None:
2025-05-07T20:32:18.0485497Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:18.0485832Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:18.0486147Z             )
2025-05-07T20:32:18.0486341Z         else:
2025-05-07T20:32:18.0486557Z             scale_ub_tensor = None
2025-05-07T20:32:18.0486815Z     
2025-05-07T20:32:18.0487044Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:18.0487366Z             op = silu_mul_quant
2025-05-07T20:32:18.0487620Z             if compiled:
2025-05-07T20:32:18.0487865Z                 op = torch.compile(op)
2025-05-07T20:32:18.0488165Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:18.0488448Z     
2025-05-07T20:32:18.0488636Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:18.0488806Z 
2025-05-07T20:32:18.0488903Z moe/activation_test.py:117: 
2025-05-07T20:32:18.0489201Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:18.0489533Z moe/activation_test.py:115: in fn
2025-05-07T20:32:18.0489826Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:18.0490390Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/torch/_dynamo/eval_frame.py:678: in _fn
2025-05-07T20:32:18.0490952Z     return fn(*args, **kwargs)
2025-05-07T20:32:18.0491603Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:18.0492287Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:18.0492822Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:18.0493498Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:18.0494315Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:18.0494850Z     kernel = self.compile(
2025-05-07T20:32:18.0495395Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:18.0496129Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:18.0496527Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:18.0496763Z 
2025-05-07T20:32:18.0496972Z self = <triton.compiler.compiler.ASTSource object at 0x7fce5b4026d0>
2025-05-07T20:32:18.0498065Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:18.0499836Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7fcea1473ce0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7fcda437ba60>}
2025-05-07T20:32:18.0501164Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:18.0502179Z context = <triton._C.libtriton.ir.context object at 0x7fcda449c8f0>
2025-05-07T20:32:18.0502461Z 
2025-05-07T20:32:18.0502631Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:18.0503292Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:18.0503755Z                            module_map=module_map)
2025-05-07T20:32:18.0504121Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:18.0504481Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:18.0504737Z E       ^
2025-05-07T20:32:18.0505195Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:18.0505639Z 
2025-05-07T20:32:18.0506055Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:18.0506561Z 
2025-05-07T20:32:18.0506676Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:18.0507080Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:18.0507476Z     T=128,
2025-05-07T20:32:18.0507665Z     D=7168,
2025-05-07T20:32:18.0507883Z     scale_ub=1200.0,
2025-05-07T20:32:18.0508116Z     contiguous=False,
2025-05-07T20:32:18.0508343Z     compiled=True,
2025-05-07T20:32:18.0508545Z )
2025-05-07T20:32:18.0508857Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:18.0509352Z T = 128, D = 7168, scale_ub = 1200.0, contiguous = False, compiled = True
2025-05-07T20:32:18.0509623Z 
2025-05-07T20:32:18.0509709Z     @given(
2025-05-07T20:32:18.0509964Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:18.0510307Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:18.0510616Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:18.0510947Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:18.0511275Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:18.0511565Z     )
2025-05-07T20:32:18.0511912Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:18.0512347Z     def test_silu_mul_quant(
2025-05-07T20:32:18.0512592Z         self,
2025-05-07T20:32:18.0512795Z         T: int,
2025-05-07T20:32:18.0512988Z         D: int,
2025-05-07T20:32:18.0513211Z         scale_ub: Optional[float],
2025-05-07T20:32:18.0513486Z         contiguous: bool,
2025-05-07T20:32:18.0513719Z         compiled: bool,
2025-05-07T20:32:18.0513946Z     ) -> None:
2025-05-07T20:32:18.0514167Z         torch.manual_seed(2025)
2025-05-07T20:32:18.0514402Z     
2025-05-07T20:32:18.0514672Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:18.0515014Z     
2025-05-07T20:32:18.0515209Z         x_sign = torch.sign(x)
2025-05-07T20:32:18.0515491Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:18.0515801Z         x = x_sign * x_clamp
2025-05-07T20:32:18.0516119Z         x0 = x[:, :D]
2025-05-07T20:32:18.0516334Z         x1 = x[:, D:]
2025-05-07T20:32:18.0516545Z     
2025-05-07T20:32:18.0516746Z         if contiguous:
2025-05-07T20:32:18.0516983Z             x0 = x0.contiguous()
2025-05-07T20:32:18.0517247Z             x1 = x1.contiguous()
2025-05-07T20:32:18.0517484Z     
2025-05-07T20:32:18.0517680Z         if scale_ub is not None:
2025-05-07T20:32:18.0517957Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:18.0518292Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:18.0518666Z             )
2025-05-07T20:32:18.0526695Z         else:
2025-05-07T20:32:18.0526947Z             scale_ub_tensor = None
2025-05-07T20:32:18.0527215Z     
2025-05-07T20:32:18.0527456Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:18.0527787Z             op = silu_mul_quant
2025-05-07T20:32:18.0528052Z             if compiled:
2025-05-07T20:32:18.0528299Z                 op = torch.compile(op)
2025-05-07T20:32:18.0528614Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:18.0528896Z     
2025-05-07T20:32:18.0529093Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:18.0529269Z 
2025-05-07T20:32:18.0529371Z moe/activation_test.py:117: 
2025-05-07T20:32:18.0529794Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:18.0530137Z moe/activation_test.py:115: in fn
2025-05-07T20:32:18.0530423Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:18.0530990Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/torch/_dynamo/eval_frame.py:678: in _fn
2025-05-07T20:32:18.0531557Z     return fn(*args, **kwargs)
2025-05-07T20:32:18.0532210Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:18.0532899Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:18.0533441Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:18.0534215Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:18.0534874Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:18.0535416Z     kernel = self.compile(
2025-05-07T20:32:18.0535964Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:18.0536617Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:18.0537022Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:18.0537258Z 
2025-05-07T20:32:18.0537467Z self = <triton.compiler.compiler.ASTSource object at 0x7fce5a2666d0>
2025-05-07T20:32:18.0538541Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:18.0539904Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7fcea1473ce0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7fcda445cea0>}
2025-05-07T20:32:18.0541235Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:18.0542254Z context = <triton._C.libtriton.ir.context object at 0x7fcda4403ef0>
2025-05-07T20:32:18.0542552Z 
2025-05-07T20:32:18.0542722Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:18.0543246Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:18.0543709Z                            module_map=module_map)
2025-05-07T20:32:18.0544083Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:18.0544496Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:18.0544758Z E       ^
2025-05-07T20:32:18.0545217Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:18.0545673Z 
2025-05-07T20:32:18.0546084Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:18.0546587Z 
2025-05-07T20:32:18.0546692Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:18.0547154Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:18.0547557Z     T=2048,
2025-05-07T20:32:18.0547748Z     D=7168,
2025-05-07T20:32:18.0547947Z     scale_ub=None,
2025-05-07T20:32:18.0548172Z     contiguous=True,
2025-05-07T20:32:18.0548393Z     compiled=True,
2025-05-07T20:32:18.1753513Z )
2025-05-07T20:32:18.1754093Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:18.1754789Z T = 2048, D = 7168, scale_ub = None, contiguous = True, compiled = True
2025-05-07T20:32:18.1755108Z 
2025-05-07T20:32:18.1755197Z     @given(
2025-05-07T20:32:18.1755428Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:18.1756121Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:18.1756438Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:18.1756764Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:18.1757101Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:18.1757398Z     )
2025-05-07T20:32:18.1757744Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:18.1758184Z     def test_silu_mul_quant(
2025-05-07T20:32:18.1758433Z         self,
2025-05-07T20:32:18.1758631Z         T: int,
2025-05-07T20:32:18.1758827Z         D: int,
2025-05-07T20:32:18.1759047Z         scale_ub: Optional[float],
2025-05-07T20:32:18.1759327Z         contiguous: bool,
2025-05-07T20:32:18.1759566Z         compiled: bool,
2025-05-07T20:32:18.1759797Z     ) -> None:
2025-05-07T20:32:18.1760017Z         torch.manual_seed(2025)
2025-05-07T20:32:18.1760256Z     
2025-05-07T20:32:18.1760535Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:18.1760883Z     
2025-05-07T20:32:18.1761074Z         x_sign = torch.sign(x)
2025-05-07T20:32:18.1761369Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:18.1761681Z         x = x_sign * x_clamp
2025-05-07T20:32:18.1761924Z         x0 = x[:, :D]
2025-05-07T20:32:18.1762144Z         x1 = x[:, D:]
2025-05-07T20:32:18.1762354Z     
2025-05-07T20:32:18.1762545Z         if contiguous:
2025-05-07T20:32:18.1762777Z             x0 = x0.contiguous()
2025-05-07T20:32:18.1763039Z             x1 = x1.contiguous()
2025-05-07T20:32:18.1763284Z     
2025-05-07T20:32:18.1763474Z         if scale_ub is not None:
2025-05-07T20:32:18.1763752Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:18.1764097Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:18.1764402Z             )
2025-05-07T20:32:18.1764603Z         else:
2025-05-07T20:32:18.1764821Z             scale_ub_tensor = None
2025-05-07T20:32:18.1765075Z     
2025-05-07T20:32:18.1765319Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:18.1765642Z             op = silu_mul_quant
2025-05-07T20:32:18.1765891Z             if compiled:
2025-05-07T20:32:18.1766147Z                 op = torch.compile(op)
2025-05-07T20:32:18.1766455Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:18.1766731Z     
2025-05-07T20:32:18.1766927Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:18.1767098Z 
2025-05-07T20:32:18.1767201Z moe/activation_test.py:117: 
2025-05-07T20:32:18.1767504Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:18.1767835Z moe/activation_test.py:115: in fn
2025-05-07T20:32:18.1768122Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:18.1768776Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/torch/_dynamo/eval_frame.py:678: in _fn
2025-05-07T20:32:18.1769335Z     return fn(*args, **kwargs)
2025-05-07T20:32:18.1769995Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:18.1770679Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:18.1771217Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:18.1771970Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:18.1772628Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:18.1773166Z     kernel = self.compile(
2025-05-07T20:32:18.1773803Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:18.1774465Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:18.1774865Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:18.1775092Z 
2025-05-07T20:32:18.1775394Z self = <triton.compiler.compiler.ASTSource object at 0x7fcda5186ed0>
2025-05-07T20:32:18.1776458Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:18.1777835Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7fcea1473ce0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7fcda445dc60>}
2025-05-07T20:32:18.1779165Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:18.1780181Z context = <triton._C.libtriton.ir.context object at 0x7fcda44b38f0>
2025-05-07T20:32:18.1780464Z 
2025-05-07T20:32:18.1780637Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:18.1781153Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:18.1781620Z                            module_map=module_map)
2025-05-07T20:32:18.1781989Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:18.1782345Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:18.1782608Z E       ^
2025-05-07T20:32:18.1783065Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:18.1783509Z 
2025-05-07T20:32:18.1783925Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:18.1784432Z 
2025-05-07T20:32:18.1784539Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:18.1784951Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:18.1785352Z     T=16384,
2025-05-07T20:32:18.1785542Z     D=5120,
2025-05-07T20:32:18.1785747Z     scale_ub=None,
2025-05-07T20:32:18.1785970Z     contiguous=False,
2025-05-07T20:32:18.1786196Z     compiled=False,
2025-05-07T20:32:18.1786405Z )
2025-05-07T20:32:18.1786725Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:18.1787230Z T = 16384, D = 5120, scale_ub = None, contiguous = False, compiled = False
2025-05-07T20:32:18.1787505Z 
2025-05-07T20:32:18.1787585Z     @given(
2025-05-07T20:32:18.1787823Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:18.1788143Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:18.1788448Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:18.1788784Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:18.1789174Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:18.1789457Z     )
2025-05-07T20:32:18.1789808Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:18.1790258Z     def test_silu_mul_quant(
2025-05-07T20:32:18.1790506Z         self,
2025-05-07T20:32:18.1790701Z         T: int,
2025-05-07T20:32:18.1790908Z         D: int,
2025-05-07T20:32:18.1791132Z         scale_ub: Optional[float],
2025-05-07T20:32:18.1791447Z         contiguous: bool,
2025-05-07T20:32:18.1791695Z         compiled: bool,
2025-05-07T20:32:18.1791924Z     ) -> None:
2025-05-07T20:32:18.1792142Z         torch.manual_seed(2025)
2025-05-07T20:32:18.1792386Z     
2025-05-07T20:32:18.1792658Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:18.1792994Z     
2025-05-07T20:32:18.1793193Z         x_sign = torch.sign(x)
2025-05-07T20:32:18.1793485Z >       x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:18.1795553Z E       torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 320.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 140.44 MiB is free. Including non-PyTorch memory, this process has 21.92 GiB memory in use. Of the allocated memory 21.60 GiB is allocated by PyTorch, and 45.02 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
2025-05-07T20:32:18.1797384Z 
2025-05-07T20:32:18.1797512Z moe/activation_test.py:95: OutOfMemoryError
2025-05-07T20:32:18.1797722Z 
2025-05-07T20:32:18.1797833Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:18.1798500Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:18.1798908Z     T=4096,
2025-05-07T20:32:18.1799101Z     D=7168,
2025-05-07T20:32:18.1799293Z     scale_ub=1200.0,
2025-05-07T20:32:18.1799518Z     contiguous=True,
2025-05-07T20:32:18.1799743Z     compiled=True,
2025-05-07T20:32:18.1799943Z )
2025-05-07T20:32:18.1800264Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:18.1800760Z T = 4096, D = 7168, scale_ub = 1200.0, contiguous = True, compiled = True
2025-05-07T20:32:18.1801027Z 
2025-05-07T20:32:18.1801107Z     @given(
2025-05-07T20:32:18.1801340Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:18.1801659Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:18.1801966Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:18.1802293Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:18.1802622Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:18.1802913Z     )
2025-05-07T20:32:18.1803260Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:18.1803711Z     def test_silu_mul_quant(
2025-05-07T20:32:18.1803954Z         self,
2025-05-07T20:32:18.1804148Z         T: int,
2025-05-07T20:32:18.1804348Z         D: int,
2025-05-07T20:32:18.1804574Z         scale_ub: Optional[float],
2025-05-07T20:32:18.1804843Z         contiguous: bool,
2025-05-07T20:32:18.1805096Z         compiled: bool,
2025-05-07T20:32:18.1805325Z     ) -> None:
2025-05-07T20:32:18.1805556Z         torch.manual_seed(2025)
2025-05-07T20:32:18.1805799Z     
2025-05-07T20:32:18.1806065Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:18.1806415Z     
2025-05-07T20:32:18.1806617Z         x_sign = torch.sign(x)
2025-05-07T20:32:18.1806902Z >       x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:18.1808867Z E       torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 112.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 28.44 MiB is free. Including non-PyTorch memory, this process has 22.03 GiB memory in use. Of the allocated memory 21.61 GiB is allocated by PyTorch, and 141.02 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
2025-05-07T20:32:18.1810800Z 
2025-05-07T20:32:18.1810920Z moe/activation_test.py:95: OutOfMemoryError
2025-05-07T20:32:18.1811135Z 
2025-05-07T20:32:18.1811238Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:18.1811711Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:18.1812107Z     T=16384,
2025-05-07T20:32:18.1812306Z     D=7168,
2025-05-07T20:32:18.1812505Z     scale_ub=None,
2025-05-07T20:32:18.1812717Z     contiguous=False,
2025-05-07T20:32:18.1812952Z     compiled=False,
2025-05-07T20:32:18.1813159Z )
2025-05-07T20:32:18.1813473Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:18.1814042Z T = 16384, D = 7168, scale_ub = None, contiguous = False, compiled = False
2025-05-07T20:32:18.1814322Z 
2025-05-07T20:32:18.1814399Z     @given(
2025-05-07T20:32:18.1814642Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:18.1815069Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:18.1815377Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:18.1815708Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:18.1816031Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:18.1816321Z     )
2025-05-07T20:32:18.1816668Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:18.1817109Z     def test_silu_mul_quant(
2025-05-07T20:32:18.1817346Z         self,
2025-05-07T20:32:18.1817545Z         T: int,
2025-05-07T20:32:18.1817744Z         D: int,
2025-05-07T20:32:18.1817958Z         scale_ub: Optional[float],
2025-05-07T20:32:18.1818232Z         contiguous: bool,
2025-05-07T20:32:18.1818480Z         compiled: bool,
2025-05-07T20:32:18.1818698Z     ) -> None:
2025-05-07T20:32:18.1818921Z         torch.manual_seed(2025)
2025-05-07T20:32:18.1819170Z     
2025-05-07T20:32:18.1819439Z >       x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:18.1821450Z E       torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 448.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 140.44 MiB is free. Including non-PyTorch memory, this process has 21.92 GiB memory in use. Of the allocated memory 21.50 GiB is allocated by PyTorch, and 141.02 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
2025-05-07T20:32:18.1823274Z 
2025-05-07T20:32:18.1823391Z moe/activation_test.py:92: OutOfMemoryError
2025-05-07T20:32:18.3068444Z 
2025-05-07T20:32:18.3069057Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:18.3069696Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:18.3070317Z     T=2048,
2025-05-07T20:32:18.3070582Z     D=7168,
2025-05-07T20:32:18.3070821Z     scale_ub=1200.0,
2025-05-07T20:32:18.3071080Z     contiguous=True,
2025-05-07T20:32:18.3071316Z     compiled=True,
2025-05-07T20:32:18.3071532Z )
2025-05-07T20:32:18.3071861Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:18.3072369Z T = 2048, D = 7168, scale_ub = 1200.0, contiguous = True, compiled = True
2025-05-07T20:32:18.3072637Z 
2025-05-07T20:32:18.3072720Z     @given(
2025-05-07T20:32:18.3072963Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:18.3073283Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:18.3073596Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:18.3073924Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:18.3074545Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:18.3074840Z     )
2025-05-07T20:32:18.3075188Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:18.3075636Z     def test_silu_mul_quant(
2025-05-07T20:32:18.3075892Z         self,
2025-05-07T20:32:18.3076089Z         T: int,
2025-05-07T20:32:18.3076295Z         D: int,
2025-05-07T20:32:18.3076519Z         scale_ub: Optional[float],
2025-05-07T20:32:18.3076789Z         contiguous: bool,
2025-05-07T20:32:18.3077133Z         compiled: bool,
2025-05-07T20:32:18.3077367Z     ) -> None:
2025-05-07T20:32:18.3077582Z         torch.manual_seed(2025)
2025-05-07T20:32:18.3077829Z     
2025-05-07T20:32:18.3078102Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:18.3078447Z     
2025-05-07T20:32:18.3078638Z         x_sign = torch.sign(x)
2025-05-07T20:32:18.3078933Z >       x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:18.3081049Z E       torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 56.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 28.44 MiB is free. Including non-PyTorch memory, this process has 22.03 GiB memory in use. Of the allocated memory 21.67 GiB is allocated by PyTorch, and 85.02 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
2025-05-07T20:32:18.3082874Z 
2025-05-07T20:32:18.3082998Z moe/activation_test.py:95: OutOfMemoryError
2025-05-07T20:32:18.3083209Z 
2025-05-07T20:32:18.3083313Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:18.3083724Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:18.3084127Z     T=2048,
2025-05-07T20:32:18.3084325Z     D=7168,
2025-05-07T20:32:18.3084518Z     scale_ub=None,
2025-05-07T20:32:18.3084736Z     contiguous=True,
2025-05-07T20:32:18.3084966Z     compiled=False,
2025-05-07T20:32:18.3085169Z )
2025-05-07T20:32:18.3085489Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:18.3085981Z T = 2048, D = 7168, scale_ub = None, contiguous = True, compiled = False
2025-05-07T20:32:18.3086251Z 
2025-05-07T20:32:18.3086330Z     @given(
2025-05-07T20:32:18.3086562Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:18.3086878Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:18.3087182Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:18.3087512Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:18.3087843Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:18.3088132Z     )
2025-05-07T20:32:18.3088476Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:18.3088920Z     def test_silu_mul_quant(
2025-05-07T20:32:18.3089169Z         self,
2025-05-07T20:32:18.3089364Z         T: int,
2025-05-07T20:32:18.3089572Z         D: int,
2025-05-07T20:32:18.3089797Z         scale_ub: Optional[float],
2025-05-07T20:32:18.3090069Z         contiguous: bool,
2025-05-07T20:32:18.3090314Z         compiled: bool,
2025-05-07T20:32:18.3090547Z     ) -> None:
2025-05-07T20:32:18.3090764Z         torch.manual_seed(2025)
2025-05-07T20:32:18.3091011Z     
2025-05-07T20:32:18.3091287Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:18.3091626Z     
2025-05-07T20:32:18.3091828Z >       x_sign = torch.sign(x)
2025-05-07T20:32:18.3093838Z E       torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 56.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 28.44 MiB is free. Including non-PyTorch memory, this process has 22.03 GiB memory in use. Of the allocated memory 21.67 GiB is allocated by PyTorch, and 85.02 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
2025-05-07T20:32:18.3095702Z 
2025-05-07T20:32:18.3095826Z moe/activation_test.py:94: OutOfMemoryError
2025-05-07T20:32:18.3096038Z 
2025-05-07T20:32:18.3096148Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:18.3096553Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:18.3097000Z     T=1,
2025-05-07T20:32:18.3097189Z     D=7168,
2025-05-07T20:32:18.3097385Z     scale_ub=1200.0,
2025-05-07T20:32:18.3097613Z     contiguous=True,
2025-05-07T20:32:18.3097837Z     compiled=False,
2025-05-07T20:32:18.3098038Z )
2025-05-07T20:32:18.3098639Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:18.3099124Z T = 1, D = 7168, scale_ub = 1200.0, contiguous = True, compiled = False
2025-05-07T20:32:18.3099389Z 
2025-05-07T20:32:18.3099474Z     @given(
2025-05-07T20:32:18.3099706Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:18.3100021Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:18.3100327Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:18.3100781Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:18.3101115Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:18.3101405Z     )
2025-05-07T20:32:18.3101751Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:18.3102198Z     def test_silu_mul_quant(
2025-05-07T20:32:18.3102442Z         self,
2025-05-07T20:32:18.3102635Z         T: int,
2025-05-07T20:32:18.3102838Z         D: int,
2025-05-07T20:32:18.3103060Z         scale_ub: Optional[float],
2025-05-07T20:32:18.3103333Z         contiguous: bool,
2025-05-07T20:32:18.3103570Z         compiled: bool,
2025-05-07T20:32:18.3103798Z     ) -> None:
2025-05-07T20:32:18.3104016Z         torch.manual_seed(2025)
2025-05-07T20:32:18.3104259Z     
2025-05-07T20:32:18.3104532Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:18.3104880Z     
2025-05-07T20:32:18.3105072Z         x_sign = torch.sign(x)
2025-05-07T20:32:18.3105371Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:18.3105682Z         x = x_sign * x_clamp
2025-05-07T20:32:18.3105924Z         x0 = x[:, :D]
2025-05-07T20:32:18.3106148Z         x1 = x[:, D:]
2025-05-07T20:32:18.3106366Z     
2025-05-07T20:32:18.3106554Z         if contiguous:
2025-05-07T20:32:18.3106792Z             x0 = x0.contiguous()
2025-05-07T20:32:18.3107060Z             x1 = x1.contiguous()
2025-05-07T20:32:18.3107299Z     
2025-05-07T20:32:18.3107496Z         if scale_ub is not None:
2025-05-07T20:32:18.3107776Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:18.3108110Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:18.3108424Z             )
2025-05-07T20:32:18.3108628Z         else:
2025-05-07T20:32:18.3108845Z             scale_ub_tensor = None
2025-05-07T20:32:18.3109095Z     
2025-05-07T20:32:18.3109333Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:18.3109657Z             op = silu_mul_quant
2025-05-07T20:32:18.3109915Z             if compiled:
2025-05-07T20:32:18.3110191Z                 op = torch.compile(op)
2025-05-07T20:32:18.3110518Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:18.3110793Z     
2025-05-07T20:32:18.3110991Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:18.3111156Z 
2025-05-07T20:32:18.3111261Z moe/activation_test.py:117: 
2025-05-07T20:32:18.3111555Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:18.3111888Z moe/activation_test.py:115: in fn
2025-05-07T20:32:18.3112171Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:18.3112859Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:18.3113618Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:18.3114157Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:18.3114841Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:18.3115497Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:18.3116034Z     kernel = self.compile(
2025-05-07T20:32:18.3116640Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:18.3117297Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:18.3117694Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:18.3117928Z 
2025-05-07T20:32:18.3118138Z self = <triton.compiler.compiler.ASTSource object at 0x7fcda4d587d0>
2025-05-07T20:32:18.3119216Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:18.3120680Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7fcea1473ce0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7fcda49e4b80>}
2025-05-07T20:32:18.3122004Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:18.3123022Z context = <triton._C.libtriton.ir.context object at 0x7fcda492e8b0>
2025-05-07T20:32:18.3123318Z 
2025-05-07T20:32:18.3123484Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:18.3124002Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:18.3124466Z                            module_map=module_map)
2025-05-07T20:32:18.3124831Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:18.3125187Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:18.3125456Z E       ^
2025-05-07T20:32:18.3125913Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:18.3126358Z 
2025-05-07T20:32:18.3126769Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:18.3127283Z 
2025-05-07T20:32:18.3127411Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:18.3127830Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:18.3128228Z     T=128,
2025-05-07T20:32:18.3128426Z     D=5120,
2025-05-07T20:32:18.3128630Z     scale_ub=None,
2025-05-07T20:32:18.3128848Z     contiguous=True,
2025-05-07T20:32:18.3129079Z     compiled=False,
2025-05-07T20:32:18.3138029Z )
2025-05-07T20:32:18.3138389Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:18.3138882Z T = 128, D = 5120, scale_ub = None, contiguous = True, compiled = False
2025-05-07T20:32:18.3139155Z 
2025-05-07T20:32:18.3139243Z     @given(
2025-05-07T20:32:18.3139473Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:18.3139789Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:18.3140103Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:18.3140478Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:18.3140806Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:18.3141093Z     )
2025-05-07T20:32:18.3141442Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:18.3141876Z     def test_silu_mul_quant(
2025-05-07T20:32:18.3142118Z         self,
2025-05-07T20:32:18.3142409Z         T: int,
2025-05-07T20:32:18.3142605Z         D: int,
2025-05-07T20:32:18.3142828Z         scale_ub: Optional[float],
2025-05-07T20:32:18.3143106Z         contiguous: bool,
2025-05-07T20:32:18.3143339Z         compiled: bool,
2025-05-07T20:32:18.3143565Z     ) -> None:
2025-05-07T20:32:18.3143788Z         torch.manual_seed(2025)
2025-05-07T20:32:18.3144023Z     
2025-05-07T20:32:18.3144298Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:18.3144641Z     
2025-05-07T20:32:18.3144893Z         x_sign = torch.sign(x)
2025-05-07T20:32:18.3145187Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:18.3145502Z         x = x_sign * x_clamp
2025-05-07T20:32:18.3145745Z         x0 = x[:, :D]
2025-05-07T20:32:18.3145953Z         x1 = x[:, D:]
2025-05-07T20:32:18.3146163Z     
2025-05-07T20:32:18.3146350Z         if contiguous:
2025-05-07T20:32:18.3146577Z             x0 = x0.contiguous()
2025-05-07T20:32:18.3146839Z             x1 = x1.contiguous()
2025-05-07T20:32:18.3147087Z     
2025-05-07T20:32:18.3147277Z         if scale_ub is not None:
2025-05-07T20:32:18.3147555Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:18.3147882Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:18.3148280Z             )
2025-05-07T20:32:18.3148478Z         else:
2025-05-07T20:32:18.3148683Z             scale_ub_tensor = None
2025-05-07T20:32:18.3148937Z     
2025-05-07T20:32:18.3149168Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:18.3149485Z             op = silu_mul_quant
2025-05-07T20:32:18.3149731Z             if compiled:
2025-05-07T20:32:18.3149970Z                 op = torch.compile(op)
2025-05-07T20:32:18.3150254Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:18.3150526Z     
2025-05-07T20:32:18.3150720Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:18.3150884Z 
2025-05-07T20:32:18.3150989Z moe/activation_test.py:117: 
2025-05-07T20:32:18.3151280Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:18.3151611Z moe/activation_test.py:115: in fn
2025-05-07T20:32:18.3151889Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:18.3152571Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:18.3153252Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:18.3153786Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:18.3154458Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:18.3155108Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:18.3155637Z     kernel = self.compile(
2025-05-07T20:32:18.3156173Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:18.3156820Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:18.3157215Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:18.3157445Z 
2025-05-07T20:32:18.3157655Z self = <triton.compiler.compiler.ASTSource object at 0x7fce5b7860d0>
2025-05-07T20:32:18.3158724Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:18.3160095Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7fcea1473ce0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7fcda49e5a80>}
2025-05-07T20:32:18.3161445Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:18.3162512Z context = <triton._C.libtriton.ir.context object at 0x7fc96fe0c070>
2025-05-07T20:32:18.3162794Z 
2025-05-07T20:32:18.3162964Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:18.3163480Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:18.3163937Z                            module_map=module_map)
2025-05-07T20:32:18.3164303Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:18.3164697Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:18.3164949Z E       ^
2025-05-07T20:32:18.3165407Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:18.3165844Z 
2025-05-07T20:32:18.3166258Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:18.4287923Z 
2025-05-07T20:32:18.4289150Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:18.4290056Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:18.4290472Z     T=128,
2025-05-07T20:32:18.4290674Z     D=7168,
2025-05-07T20:32:18.4290870Z     scale_ub=None,
2025-05-07T20:32:18.4291441Z     contiguous=True,
2025-05-07T20:32:18.4291672Z     compiled=False,
2025-05-07T20:32:18.4291886Z )
2025-05-07T20:32:18.4292204Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:18.4292708Z T = 128, D = 7168, scale_ub = None, contiguous = True, compiled = False
2025-05-07T20:32:18.4292982Z 
2025-05-07T20:32:18.4293060Z     @given(
2025-05-07T20:32:18.4293297Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:18.4293608Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:18.4294044Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:18.4294376Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:18.4294706Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:18.4294993Z     )
2025-05-07T20:32:18.4295342Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:18.4295780Z     def test_silu_mul_quant(
2025-05-07T20:32:18.4296033Z         self,
2025-05-07T20:32:18.4296233Z         T: int,
2025-05-07T20:32:18.4296426Z         D: int,
2025-05-07T20:32:18.4296645Z         scale_ub: Optional[float],
2025-05-07T20:32:18.4296919Z         contiguous: bool,
2025-05-07T20:32:18.4297161Z         compiled: bool,
2025-05-07T20:32:18.4297382Z     ) -> None:
2025-05-07T20:32:18.4297598Z         torch.manual_seed(2025)
2025-05-07T20:32:18.4297836Z     
2025-05-07T20:32:18.4298105Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:18.4298612Z     
2025-05-07T20:32:18.4298806Z         x_sign = torch.sign(x)
2025-05-07T20:32:18.4299091Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:18.4299405Z         x = x_sign * x_clamp
2025-05-07T20:32:18.4299646Z         x0 = x[:, :D]
2025-05-07T20:32:18.4299855Z         x1 = x[:, D:]
2025-05-07T20:32:18.4300065Z     
2025-05-07T20:32:18.4300250Z         if contiguous:
2025-05-07T20:32:18.4300476Z             x0 = x0.contiguous()
2025-05-07T20:32:18.4300738Z             x1 = x1.contiguous()
2025-05-07T20:32:18.4300983Z     
2025-05-07T20:32:18.4301169Z         if scale_ub is not None:
2025-05-07T20:32:18.4301445Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:18.4301776Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:18.4302086Z             )
2025-05-07T20:32:18.4302274Z         else:
2025-05-07T20:32:18.4302487Z             scale_ub_tensor = None
2025-05-07T20:32:18.4302739Z     
2025-05-07T20:32:18.4302965Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:18.4303283Z             op = silu_mul_quant
2025-05-07T20:32:18.4303534Z             if compiled:
2025-05-07T20:32:18.4303875Z                 op = torch.compile(op)
2025-05-07T20:32:18.4304173Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:18.4304482Z     
2025-05-07T20:32:18.4304676Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:18.4304839Z 
2025-05-07T20:32:18.4304944Z moe/activation_test.py:117: 
2025-05-07T20:32:18.4305242Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:18.4305583Z moe/activation_test.py:115: in fn
2025-05-07T20:32:18.4305868Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:18.4306645Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:18.4307330Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:18.4307864Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:18.4308543Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:18.4309197Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:18.4309730Z     kernel = self.compile(
2025-05-07T20:32:18.4310937Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:18.4311599Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:18.4311987Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:18.4312223Z 
2025-05-07T20:32:18.4312431Z self = <triton.compiler.compiler.ASTSource object at 0x7fce5b403950>
2025-05-07T20:32:18.4313500Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:18.4314869Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7fcea1473ce0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7fcda49e6980>}
2025-05-07T20:32:18.4316195Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:18.4317201Z context = <triton._C.libtriton.ir.context object at 0x7fcda40744b0>
2025-05-07T20:32:18.4317489Z 
2025-05-07T20:32:18.4317655Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:18.4318166Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:18.4318621Z                            module_map=module_map)
2025-05-07T20:32:18.4318987Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:18.4319339Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:18.4319593Z E       ^
2025-05-07T20:32:18.4320052Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:18.4320496Z 
2025-05-07T20:32:18.4320908Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:18.4321412Z 
2025-05-07T20:32:18.4321519Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:18.4321925Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:18.4322328Z     T=2048,
2025-05-07T20:32:18.4322518Z     D=7168,
2025-05-07T20:32:18.4322708Z     scale_ub=1200.0,
2025-05-07T20:32:18.4322930Z     contiguous=True,
2025-05-07T20:32:18.4323156Z     compiled=False,
2025-05-07T20:32:18.4323359Z )
2025-05-07T20:32:18.4323678Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:18.4324165Z T = 2048, D = 7168, scale_ub = 1200.0, contiguous = True, compiled = False
2025-05-07T20:32:18.4324483Z 
2025-05-07T20:32:18.4324567Z     @given(
2025-05-07T20:32:18.4324791Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:18.4325102Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:18.4325408Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:18.4325738Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:18.4326063Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:18.4326350Z     )
2025-05-07T20:32:18.4326691Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:18.4327178Z     def test_silu_mul_quant(
2025-05-07T20:32:18.4327422Z         self,
2025-05-07T20:32:18.4327618Z         T: int,
2025-05-07T20:32:18.4327811Z         D: int,
2025-05-07T20:32:18.4328033Z         scale_ub: Optional[float],
2025-05-07T20:32:18.4328301Z         contiguous: bool,
2025-05-07T20:32:18.4328555Z         compiled: bool,
2025-05-07T20:32:18.4328774Z     ) -> None:
2025-05-07T20:32:18.4328998Z         torch.manual_seed(2025)
2025-05-07T20:32:18.4329239Z     
2025-05-07T20:32:18.4329514Z >       x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:18.4331661Z E       torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 56.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 26.44 MiB is free. Including non-PyTorch memory, this process has 22.04 GiB memory in use. Of the allocated memory 21.69 GiB is allocated by PyTorch, and 59.18 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
2025-05-07T20:32:18.4333483Z 
2025-05-07T20:32:18.4333600Z moe/activation_test.py:92: OutOfMemoryError
2025-05-07T20:32:18.4333879Z 
2025-05-07T20:32:18.4333985Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:18.4334393Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:18.4334786Z     T=1,
2025-05-07T20:32:18.4334974Z     D=5120,
2025-05-07T20:32:18.4335170Z     scale_ub=1200.0,
2025-05-07T20:32:18.4335396Z     contiguous=True,
2025-05-07T20:32:18.4335613Z     compiled=False,
2025-05-07T20:32:18.4335821Z )
2025-05-07T20:32:18.4336148Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:18.4336621Z T = 1, D = 5120, scale_ub = 1200.0, contiguous = True, compiled = False
2025-05-07T20:32:18.4336892Z 
2025-05-07T20:32:18.4336968Z     @given(
2025-05-07T20:32:18.4337197Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:18.4337501Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:18.4337803Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:18.4338129Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:18.4338447Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:18.4338734Z     )
2025-05-07T20:32:18.4339081Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:18.4339523Z     def test_silu_mul_quant(
2025-05-07T20:32:18.4339756Z         self,
2025-05-07T20:32:18.4339951Z         T: int,
2025-05-07T20:32:18.4340145Z         D: int,
2025-05-07T20:32:18.4340360Z         scale_ub: Optional[float],
2025-05-07T20:32:18.4340654Z         contiguous: bool,
2025-05-07T20:32:18.4340920Z         compiled: bool,
2025-05-07T20:32:18.4341138Z     ) -> None:
2025-05-07T20:32:18.4341355Z         torch.manual_seed(2025)
2025-05-07T20:32:18.4341598Z     
2025-05-07T20:32:18.4341862Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:18.4342202Z     
2025-05-07T20:32:18.4342399Z         x_sign = torch.sign(x)
2025-05-07T20:32:18.4342681Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:18.4342992Z         x = x_sign * x_clamp
2025-05-07T20:32:18.4343237Z         x0 = x[:, :D]
2025-05-07T20:32:18.4343500Z         x1 = x[:, D:]
2025-05-07T20:32:18.4343715Z     
2025-05-07T20:32:18.4343900Z         if contiguous:
2025-05-07T20:32:18.4344124Z             x0 = x0.contiguous()
2025-05-07T20:32:18.4344380Z             x1 = x1.contiguous()
2025-05-07T20:32:18.4344625Z     
2025-05-07T20:32:18.4344823Z         if scale_ub is not None:
2025-05-07T20:32:18.4345093Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:18.4345425Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:18.4345736Z             )
2025-05-07T20:32:18.4345968Z         else:
2025-05-07T20:32:18.4346178Z             scale_ub_tensor = None
2025-05-07T20:32:18.4346427Z     
2025-05-07T20:32:18.4346651Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:18.4346962Z             op = silu_mul_quant
2025-05-07T20:32:18.4347214Z             if compiled:
2025-05-07T20:32:18.4347457Z                 op = torch.compile(op)
2025-05-07T20:32:18.4347751Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:18.4348030Z     
2025-05-07T20:32:18.4348217Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:18.4348388Z 
2025-05-07T20:32:18.4348485Z moe/activation_test.py:117: 
2025-05-07T20:32:18.4348778Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:18.4349217Z moe/activation_test.py:115: in fn
2025-05-07T20:32:18.4349496Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:18.4350176Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:18.4350897Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:18.4351438Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:18.4352110Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:18.4352762Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:18.4353294Z     kernel = self.compile(
2025-05-07T20:32:18.4353824Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:18.4354477Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:18.4354875Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:18.4355101Z 
2025-05-07T20:32:18.4355311Z self = <triton.compiler.compiler.ASTSource object at 0x7fcda455e650>
2025-05-07T20:32:18.4356376Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:18.4357720Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7fcea1473ce0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7fcda49e7e20>}
2025-05-07T20:32:18.4359045Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:18.4360055Z context = <triton._C.libtriton.ir.context object at 0x7fc96fee5030>
2025-05-07T20:32:18.4360337Z 
2025-05-07T20:32:18.4360501Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:18.4361021Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:18.4361484Z                            module_map=module_map)
2025-05-07T20:32:18.4361846Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:18.4362192Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:18.4362456Z E       ^
2025-05-07T20:32:18.4362911Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:18.4363399Z 
2025-05-07T20:32:18.4363808Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:18.5202034Z 
2025-05-07T20:32:18.5202207Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:18.5202661Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:18.5203087Z     T=2048,
2025-05-07T20:32:18.5203376Z     D=5120,
2025-05-07T20:32:18.5203589Z     scale_ub=None,
2025-05-07T20:32:18.5203928Z     contiguous=True,
2025-05-07T20:32:18.5204165Z     compiled=False,
2025-05-07T20:32:18.5204379Z )
2025-05-07T20:32:18.5204704Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:18.5205208Z T = 2048, D = 5120, scale_ub = None, contiguous = True, compiled = False
2025-05-07T20:32:18.5205488Z 
2025-05-07T20:32:18.5205572Z     @given(
2025-05-07T20:32:18.5205819Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:18.5206142Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:18.5206459Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:18.5206802Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:18.5207133Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:18.5207552Z     )
2025-05-07T20:32:18.5207920Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:18.5208364Z     def test_silu_mul_quant(
2025-05-07T20:32:18.5208619Z         self,
2025-05-07T20:32:18.5208830Z         T: int,
2025-05-07T20:32:18.5209037Z         D: int,
2025-05-07T20:32:18.5209260Z         scale_ub: Optional[float],
2025-05-07T20:32:18.5209542Z         contiguous: bool,
2025-05-07T20:32:18.5209791Z         compiled: bool,
2025-05-07T20:32:18.5210025Z     ) -> None:
2025-05-07T20:32:18.5210253Z         torch.manual_seed(2025)
2025-05-07T20:32:18.5210507Z     
2025-05-07T20:32:18.5210785Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:18.5211139Z     
2025-05-07T20:32:18.5211346Z >       x_sign = torch.sign(x)
2025-05-07T20:32:18.5213261Z E       torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 40.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 26.44 MiB is free. Including non-PyTorch memory, this process has 22.04 GiB memory in use. Of the allocated memory 21.73 GiB is allocated by PyTorch, and 19.12 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
2025-05-07T20:32:18.5215214Z 
2025-05-07T20:32:18.5215338Z moe/activation_test.py:94: OutOfMemoryError
2025-05-07T20:32:18.5215562Z 
2025-05-07T20:32:18.5215668Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:18.5216088Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:18.5216504Z     T=16384,
2025-05-07T20:32:18.5216703Z     D=5120,
2025-05-07T20:32:18.5216907Z     scale_ub=None,
2025-05-07T20:32:18.5217133Z     contiguous=True,
2025-05-07T20:32:18.5217360Z     compiled=False,
2025-05-07T20:32:18.5217578Z )
2025-05-07T20:32:18.5217916Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:18.5218409Z T = 16384, D = 5120, scale_ub = None, contiguous = True, compiled = False
2025-05-07T20:32:18.5218693Z 
2025-05-07T20:32:18.5218775Z     @given(
2025-05-07T20:32:18.5219025Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:18.5219341Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:18.5219664Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:18.5220008Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:18.5220349Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:18.5220638Z     )
2025-05-07T20:32:18.5220996Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:18.5221517Z     def test_silu_mul_quant(
2025-05-07T20:32:18.5221762Z         self,
2025-05-07T20:32:18.5221964Z         T: int,
2025-05-07T20:32:18.5222169Z         D: int,
2025-05-07T20:32:18.5222392Z         scale_ub: Optional[float],
2025-05-07T20:32:18.5222676Z         contiguous: bool,
2025-05-07T20:32:18.5222924Z         compiled: bool,
2025-05-07T20:32:18.5223150Z     ) -> None:
2025-05-07T20:32:18.5223377Z         torch.manual_seed(2025)
2025-05-07T20:32:18.5223674Z     
2025-05-07T20:32:18.5223946Z >       x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:18.5225954Z E       torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 320.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 26.44 MiB is free. Including non-PyTorch memory, this process has 22.04 GiB memory in use. Of the allocated memory 21.73 GiB is allocated by PyTorch, and 19.12 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
2025-05-07T20:32:18.5227774Z 
2025-05-07T20:32:18.5227969Z moe/activation_test.py:92: OutOfMemoryError
2025-05-07T20:32:18.5228192Z 
2025-05-07T20:32:18.5228298Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:18.5228717Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:18.5229121Z     T=4096,
2025-05-07T20:32:18.5229317Z     D=5120,
2025-05-07T20:32:18.5229518Z     scale_ub=None,
2025-05-07T20:32:18.5229733Z     contiguous=True,
2025-05-07T20:32:18.5229964Z     compiled=False,
2025-05-07T20:32:18.5230177Z )
2025-05-07T20:32:18.5230498Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:18.5230997Z T = 4096, D = 5120, scale_ub = None, contiguous = True, compiled = False
2025-05-07T20:32:18.5231280Z 
2025-05-07T20:32:18.5231363Z     @given(
2025-05-07T20:32:18.5231604Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:18.5231922Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:18.5232242Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:18.5232580Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:18.5232911Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:18.5233206Z     )
2025-05-07T20:32:18.5233565Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:18.5234020Z     def test_silu_mul_quant(
2025-05-07T20:32:18.5234266Z         self,
2025-05-07T20:32:18.5234473Z         T: int,
2025-05-07T20:32:18.5234681Z         D: int,
2025-05-07T20:32:18.5234903Z         scale_ub: Optional[float],
2025-05-07T20:32:18.5235184Z         contiguous: bool,
2025-05-07T20:32:18.5235438Z         compiled: bool,
2025-05-07T20:32:18.5235664Z     ) -> None:
2025-05-07T20:32:18.5235892Z         torch.manual_seed(2025)
2025-05-07T20:32:18.5236146Z     
2025-05-07T20:32:18.5236419Z >       x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:18.5238422Z E       torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 80.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 26.44 MiB is free. Including non-PyTorch memory, this process has 22.04 GiB memory in use. Of the allocated memory 21.73 GiB is allocated by PyTorch, and 19.12 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
2025-05-07T20:32:18.5240240Z 
2025-05-07T20:32:18.5240361Z moe/activation_test.py:92: OutOfMemoryError
2025-05-07T20:32:18.5240581Z 
2025-05-07T20:32:18.5240688Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:18.5241203Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:18.5241606Z     T=2048,
2025-05-07T20:32:18.5250232Z     D=5120,
2025-05-07T20:32:18.5250470Z     scale_ub=None,
2025-05-07T20:32:18.5250720Z     contiguous=False,
2025-05-07T20:32:18.5250991Z     compiled=False,
2025-05-07T20:32:18.5251204Z )
2025-05-07T20:32:18.5251526Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:18.5252030Z T = 2048, D = 5120, scale_ub = None, contiguous = False, compiled = False
2025-05-07T20:32:18.5252383Z 
2025-05-07T20:32:18.5252475Z     @given(
2025-05-07T20:32:18.5252711Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:18.5253042Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:18.5253356Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:18.5253801Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:18.5254143Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:18.5254440Z     )
2025-05-07T20:32:18.5254795Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:18.5255238Z     def test_silu_mul_quant(
2025-05-07T20:32:18.5255486Z         self,
2025-05-07T20:32:18.5255688Z         T: int,
2025-05-07T20:32:18.5255969Z         D: int,
2025-05-07T20:32:18.5256196Z         scale_ub: Optional[float],
2025-05-07T20:32:18.5256476Z         contiguous: bool,
2025-05-07T20:32:18.5256715Z         compiled: bool,
2025-05-07T20:32:18.5256950Z     ) -> None:
2025-05-07T20:32:18.5257179Z         torch.manual_seed(2025)
2025-05-07T20:32:18.5257418Z     
2025-05-07T20:32:18.5257698Z >       x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:18.5259719Z E       torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 40.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 26.44 MiB is free. Including non-PyTorch memory, this process has 22.04 GiB memory in use. Of the allocated memory 21.73 GiB is allocated by PyTorch, and 19.12 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
2025-05-07T20:32:18.5261589Z 
2025-05-07T20:32:18.5261712Z moe/activation_test.py:92: OutOfMemoryError
2025-05-07T20:32:18.5261925Z 
2025-05-07T20:32:18.5262038Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:18.5263859Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:18.5264270Z     T=4096,
2025-05-07T20:32:18.5264472Z     D=7168,
2025-05-07T20:32:18.5264677Z     scale_ub=None,
2025-05-07T20:32:18.5264890Z     contiguous=True,
2025-05-07T20:32:18.5265123Z     compiled=True,
2025-05-07T20:32:18.5265336Z )
2025-05-07T20:32:18.5265650Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:18.5266139Z T = 4096, D = 7168, scale_ub = None, contiguous = True, compiled = True
2025-05-07T20:32:18.5266410Z 
2025-05-07T20:32:18.5266490Z     @given(
2025-05-07T20:32:18.5266728Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:18.5267040Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:18.5267356Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:18.5267688Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:18.5268022Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:18.5268309Z     )
2025-05-07T20:32:18.5268661Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:18.5269108Z     def test_silu_mul_quant(
2025-05-07T20:32:18.5269346Z         self,
2025-05-07T20:32:18.5269549Z         T: int,
2025-05-07T20:32:18.5269753Z         D: int,
2025-05-07T20:32:18.5269974Z         scale_ub: Optional[float],
2025-05-07T20:32:18.5270248Z         contiguous: bool,
2025-05-07T20:32:18.5270545Z         compiled: bool,
2025-05-07T20:32:18.5270790Z     ) -> None:
2025-05-07T20:32:18.5271032Z         torch.manual_seed(2025)
2025-05-07T20:32:18.5271275Z     
2025-05-07T20:32:18.5271541Z >       x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:18.5273538Z E       torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 112.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 26.44 MiB is free. Including non-PyTorch memory, this process has 22.04 GiB memory in use. Of the allocated memory 21.73 GiB is allocated by PyTorch, and 19.12 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
2025-05-07T20:32:18.5275404Z 
2025-05-07T20:32:18.5275523Z moe/activation_test.py:92: OutOfMemoryError
2025-05-07T20:32:18.5275742Z 
2025-05-07T20:32:18.5275845Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:18.5276258Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:18.5276652Z     T=2048,
2025-05-07T20:32:18.5276842Z     D=5120,
2025-05-07T20:32:18.5277040Z     scale_ub=1200.0,
2025-05-07T20:32:18.5277337Z     contiguous=False,
2025-05-07T20:32:18.5277566Z     compiled=False,
2025-05-07T20:32:18.5822909Z )
2025-05-07T20:32:18.5823780Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:18.5824544Z T = 2048, D = 5120, scale_ub = 1200.0, contiguous = False, compiled = False
2025-05-07T20:32:18.5824918Z 
2025-05-07T20:32:18.5825025Z     @given(
2025-05-07T20:32:18.5825339Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:18.5825750Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:18.5826071Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:18.5826403Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:18.5826758Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:18.5827052Z     )
2025-05-07T20:32:18.5827400Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:18.5827847Z     def test_silu_mul_quant(
2025-05-07T20:32:18.5828107Z         self,
2025-05-07T20:32:18.5828297Z         T: int,
2025-05-07T20:32:18.5828505Z         D: int,
2025-05-07T20:32:18.5828728Z         scale_ub: Optional[float],
2025-05-07T20:32:18.5829000Z         contiguous: bool,
2025-05-07T20:32:18.5829251Z         compiled: bool,
2025-05-07T20:32:18.5829489Z     ) -> None:
2025-05-07T20:32:18.5829702Z         torch.manual_seed(2025)
2025-05-07T20:32:18.5829949Z     
2025-05-07T20:32:18.5830227Z >       x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:18.5832252Z E       torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 40.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 26.44 MiB is free. Including non-PyTorch memory, this process has 22.04 GiB memory in use. Of the allocated memory 21.73 GiB is allocated by PyTorch, and 19.12 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
2025-05-07T20:32:18.5834083Z 
2025-05-07T20:32:18.5834209Z moe/activation_test.py:92: OutOfMemoryError
2025-05-07T20:32:18.5834420Z 
2025-05-07T20:32:18.5834526Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:18.5834941Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:18.5835343Z     T=4096,
2025-05-07T20:32:18.5835528Z     D=7168,
2025-05-07T20:32:18.5835728Z     scale_ub=1200.0,
2025-05-07T20:32:18.5835957Z     contiguous=True,
2025-05-07T20:32:18.5836176Z     compiled=False,
2025-05-07T20:32:18.5836392Z )
2025-05-07T20:32:18.5836712Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:18.5837513Z T = 4096, D = 7168, scale_ub = 1200.0, contiguous = True, compiled = False
2025-05-07T20:32:18.5837784Z 
2025-05-07T20:32:18.5837865Z     @given(
2025-05-07T20:32:18.5838107Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:18.5838427Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:18.5838726Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:18.5839059Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:18.5839487Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:18.5839761Z     )
2025-05-07T20:32:18.5840110Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:18.5840549Z     def test_silu_mul_quant(
2025-05-07T20:32:18.5840791Z         self,
2025-05-07T20:32:18.5840980Z         T: int,
2025-05-07T20:32:18.5841180Z         D: int,
2025-05-07T20:32:18.5841400Z         scale_ub: Optional[float],
2025-05-07T20:32:18.5841671Z         contiguous: bool,
2025-05-07T20:32:18.5841913Z         compiled: bool,
2025-05-07T20:32:18.5842141Z     ) -> None:
2025-05-07T20:32:18.5842350Z         torch.manual_seed(2025)
2025-05-07T20:32:18.5842594Z     
2025-05-07T20:32:18.5843019Z >       x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:18.5845015Z E       torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 112.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 26.44 MiB is free. Including non-PyTorch memory, this process has 22.04 GiB memory in use. Of the allocated memory 21.73 GiB is allocated by PyTorch, and 19.12 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
2025-05-07T20:32:18.5846837Z 
2025-05-07T20:32:18.5846959Z moe/activation_test.py:92: OutOfMemoryError
2025-05-07T20:32:18.5847177Z 
2025-05-07T20:32:18.5847279Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:18.5847694Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:18.5848096Z     T=16384,
2025-05-07T20:32:18.5848291Z     D=7168,
2025-05-07T20:32:18.5848526Z     scale_ub=None,
2025-05-07T20:32:18.5848740Z     contiguous=False,
2025-05-07T20:32:18.5848969Z     compiled=True,
2025-05-07T20:32:18.5849176Z )
2025-05-07T20:32:18.5849488Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:18.5849985Z T = 16384, D = 7168, scale_ub = None, contiguous = False, compiled = True
2025-05-07T20:32:18.5850273Z 
2025-05-07T20:32:18.5850370Z     @given(
2025-05-07T20:32:18.5850624Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:18.5850941Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:18.5851251Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:18.5851578Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:18.5851910Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:18.5852201Z     )
2025-05-07T20:32:18.5852551Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:18.5852992Z     def test_silu_mul_quant(
2025-05-07T20:32:18.5853240Z         self,
2025-05-07T20:32:18.5853441Z         T: int,
2025-05-07T20:32:18.5853633Z         D: int,
2025-05-07T20:32:18.5853969Z         scale_ub: Optional[float],
2025-05-07T20:32:18.5854248Z         contiguous: bool,
2025-05-07T20:32:18.5854480Z         compiled: bool,
2025-05-07T20:32:18.5854704Z     ) -> None:
2025-05-07T20:32:18.5854917Z         torch.manual_seed(2025)
2025-05-07T20:32:18.5855154Z     
2025-05-07T20:32:18.5855426Z >       x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:18.5857423Z E       torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 448.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 26.44 MiB is free. Including non-PyTorch memory, this process has 22.04 GiB memory in use. Of the allocated memory 21.73 GiB is allocated by PyTorch, and 19.12 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
2025-05-07T20:32:18.5859286Z 
2025-05-07T20:32:18.5859443Z moe/activation_test.py:92: OutOfMemoryError
2025-05-07T20:32:18.5859655Z 
2025-05-07T20:32:18.5859765Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:18.5860178Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:18.5860617Z     T=4096,
2025-05-07T20:32:18.5860809Z     D=7168,
2025-05-07T20:32:18.5860998Z     scale_ub=None,
2025-05-07T20:32:18.5861220Z     contiguous=True,
2025-05-07T20:32:18.5861446Z     compiled=False,
2025-05-07T20:32:18.5861651Z )
2025-05-07T20:32:18.5861976Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:18.5862466Z T = 4096, D = 7168, scale_ub = None, contiguous = True, compiled = False
2025-05-07T20:32:18.5862732Z 
2025-05-07T20:32:18.5862898Z     @given(
2025-05-07T20:32:18.5863129Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:18.5863451Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:18.5863761Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:18.5864090Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:18.5864426Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:18.5864730Z     )
2025-05-07T20:32:18.5865075Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:18.5865526Z     def test_silu_mul_quant(
2025-05-07T20:32:18.5865774Z         self,
2025-05-07T20:32:18.5865965Z         T: int,
2025-05-07T20:32:18.5866173Z         D: int,
2025-05-07T20:32:18.5866402Z         scale_ub: Optional[float],
2025-05-07T20:32:18.5866673Z         contiguous: bool,
2025-05-07T20:32:18.5866925Z         compiled: bool,
2025-05-07T20:32:18.5867156Z     ) -> None:
2025-05-07T20:32:18.5867388Z         torch.manual_seed(2025)
2025-05-07T20:32:18.5867649Z     
2025-05-07T20:32:18.5867919Z >       x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:18.5869917Z E       torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 112.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 26.44 MiB is free. Including non-PyTorch memory, this process has 22.04 GiB memory in use. Of the allocated memory 21.73 GiB is allocated by PyTorch, and 19.12 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
2025-05-07T20:32:18.5871790Z 
2025-05-07T20:32:18.5871910Z moe/activation_test.py:92: OutOfMemoryError
2025-05-07T20:32:18.5872131Z 
2025-05-07T20:32:18.5872238Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:18.5872663Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:18.5873057Z     T=16384,
2025-05-07T20:32:18.5873258Z     D=7168,
2025-05-07T20:32:18.5873454Z     scale_ub=None,
2025-05-07T20:32:18.5873673Z     contiguous=True,
2025-05-07T20:32:18.5873905Z     compiled=False,
2025-05-07T20:32:18.5874117Z )
2025-05-07T20:32:18.5874431Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:18.5874932Z T = 16384, D = 7168, scale_ub = None, contiguous = True, compiled = False
2025-05-07T20:32:18.5875214Z 
2025-05-07T20:32:18.5875293Z     @given(
2025-05-07T20:32:18.5875531Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:18.5875841Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:18.5876243Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:18.5876578Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:18.5876903Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:18.5877198Z     )
2025-05-07T20:32:18.5877573Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:18.5878189Z     def test_silu_mul_quant(
2025-05-07T20:32:18.5878494Z         self,
2025-05-07T20:32:18.5878698Z         T: int,
2025-05-07T20:32:18.5878987Z         D: int,
2025-05-07T20:32:18.5879207Z         scale_ub: Optional[float],
2025-05-07T20:32:18.5879491Z         contiguous: bool,
2025-05-07T20:32:18.5879735Z         compiled: bool,
2025-05-07T20:32:18.5879962Z     ) -> None:
2025-05-07T20:32:18.5880183Z         torch.manual_seed(2025)
2025-05-07T20:32:18.5880427Z     
2025-05-07T20:32:18.5880695Z >       x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:18.5882807Z E       torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 448.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 26.44 MiB is free. Including non-PyTorch memory, this process has 22.04 GiB memory in use. Of the allocated memory 21.73 GiB is allocated by PyTorch, and 19.12 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
2025-05-07T20:32:18.5884656Z 
2025-05-07T20:32:18.5884773Z moe/activation_test.py:92: OutOfMemoryError
2025-05-07T20:32:18.5884983Z 
2025-05-07T20:32:18.5885095Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:18.5885508Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:18.5885907Z     T=16384,
2025-05-07T20:32:18.5886105Z     D=7168,
2025-05-07T20:32:18.5886303Z     scale_ub=1200.0,
2025-05-07T20:32:18.5886524Z     contiguous=True,
2025-05-07T20:32:18.5886753Z     compiled=False,
2025-05-07T20:32:18.5886966Z )
2025-05-07T20:32:18.5887283Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:18.5887787Z T = 16384, D = 7168, scale_ub = 1200.0, contiguous = True, compiled = False
2025-05-07T20:32:18.5888063Z 
2025-05-07T20:32:18.5888150Z     @given(
2025-05-07T20:32:18.5888382Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:18.5888704Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:18.5889130Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:18.5889573Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:18.5889902Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:18.5890218Z     )
2025-05-07T20:32:18.5890599Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:18.5891037Z     def test_silu_mul_quant(
2025-05-07T20:32:18.5891293Z         self,
2025-05-07T20:32:18.5891493Z         T: int,
2025-05-07T20:32:18.5891689Z         D: int,
2025-05-07T20:32:18.5891916Z         scale_ub: Optional[float],
2025-05-07T20:32:18.5892194Z         contiguous: bool,
2025-05-07T20:32:18.5892431Z         compiled: bool,
2025-05-07T20:32:18.5892664Z     ) -> None:
2025-05-07T20:32:18.5892887Z         torch.manual_seed(2025)
2025-05-07T20:32:18.5893133Z     
2025-05-07T20:32:18.5893409Z >       x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:18.5895558Z E       torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 448.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 26.44 MiB is free. Including non-PyTorch memory, this process has 22.04 GiB memory in use. Of the allocated memory 21.73 GiB is allocated by PyTorch, and 19.12 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
2025-05-07T20:32:18.5897463Z 
2025-05-07T20:32:18.5897582Z moe/activation_test.py:92: OutOfMemoryError
2025-05-07T20:32:18.7706888Z 
2025-05-07T20:32:18.7707631Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:18.7708316Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:18.7708849Z     T=128,
2025-05-07T20:32:18.7709042Z     D=5120,
2025-05-07T20:32:18.7709250Z     scale_ub=1200.0,
2025-05-07T20:32:18.7709781Z     contiguous=False,
2025-05-07T20:32:18.7710014Z     compiled=False,
2025-05-07T20:32:18.7710225Z )
2025-05-07T20:32:18.7710546Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:18.7711044Z T = 128, D = 5120, scale_ub = 1200.0, contiguous = False, compiled = False
2025-05-07T20:32:18.7711316Z 
2025-05-07T20:32:18.7711400Z     @given(
2025-05-07T20:32:18.7711640Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:18.7711967Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:18.7712271Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:18.7712605Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:18.7713093Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:18.7713380Z     )
2025-05-07T20:32:18.7713733Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:18.7714188Z     def test_silu_mul_quant(
2025-05-07T20:32:18.7714450Z         self,
2025-05-07T20:32:18.7714648Z         T: int,
2025-05-07T20:32:18.7714858Z         D: int,
2025-05-07T20:32:18.7715083Z         scale_ub: Optional[float],
2025-05-07T20:32:18.7715356Z         contiguous: bool,
2025-05-07T20:32:18.7715602Z         compiled: bool,
2025-05-07T20:32:18.7715845Z     ) -> None:
2025-05-07T20:32:18.7716063Z         torch.manual_seed(2025)
2025-05-07T20:32:18.7716319Z     
2025-05-07T20:32:18.7716602Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:18.7716943Z     
2025-05-07T20:32:18.7717146Z         x_sign = torch.sign(x)
2025-05-07T20:32:18.7717454Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:18.7717765Z         x = x_sign * x_clamp
2025-05-07T20:32:18.7718025Z         x0 = x[:, :D]
2025-05-07T20:32:18.7718261Z         x1 = x[:, D:]
2025-05-07T20:32:18.7718472Z     
2025-05-07T20:32:18.7718669Z         if contiguous:
2025-05-07T20:32:18.7718907Z             x0 = x0.contiguous()
2025-05-07T20:32:18.7719178Z             x1 = x1.contiguous()
2025-05-07T20:32:18.7719418Z     
2025-05-07T20:32:18.7719620Z         if scale_ub is not None:
2025-05-07T20:32:18.7719902Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:18.7720251Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:18.7720603Z             )
2025-05-07T20:32:18.7720806Z         else:
2025-05-07T20:32:18.7721019Z             scale_ub_tensor = None
2025-05-07T20:32:18.7721282Z     
2025-05-07T20:32:18.7721519Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:18.7721831Z             op = silu_mul_quant
2025-05-07T20:32:18.7722085Z             if compiled:
2025-05-07T20:32:18.7722341Z                 op = torch.compile(op)
2025-05-07T20:32:18.7722641Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:18.7722926Z     
2025-05-07T20:32:18.7723119Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:18.7723282Z 
2025-05-07T20:32:18.7723388Z moe/activation_test.py:117: 
2025-05-07T20:32:18.7723684Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:18.7724022Z moe/activation_test.py:115: in fn
2025-05-07T20:32:18.7724311Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:18.7724997Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:18.7725688Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:18.7726329Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:18.7727011Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:18.7727672Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:18.7728208Z     kernel = self.compile(
2025-05-07T20:32:18.7728750Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:18.7729444Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:18.7729848Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:18.7730086Z 
2025-05-07T20:32:18.7730296Z self = <triton.compiler.compiler.ASTSource object at 0x7fcda49f0950>
2025-05-07T20:32:18.7731373Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:18.7732826Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7fcea1473ce0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7fc96fe8cae0>}
2025-05-07T20:32:18.7734296Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:18.7735317Z context = <triton._C.libtriton.ir.context object at 0x7fc96fd94b30>
2025-05-07T20:32:18.7735601Z 
2025-05-07T20:32:18.7735775Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:18.7736301Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:18.7736763Z                            module_map=module_map)
2025-05-07T20:32:18.7737141Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:18.7737504Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:18.7737766Z E       ^
2025-05-07T20:32:18.7738236Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:18.7738681Z 
2025-05-07T20:32:18.7739101Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:18.7739612Z 
2025-05-07T20:32:18.7739725Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:18.7740147Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:18.7740593Z     T=2048,
2025-05-07T20:32:18.7740795Z     D=7168,
2025-05-07T20:32:18.7740989Z     scale_ub=None,
2025-05-07T20:32:18.7741212Z     contiguous=False,
2025-05-07T20:32:18.7741445Z     compiled=False,
2025-05-07T20:32:18.7741650Z )
2025-05-07T20:32:18.7741974Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:18.7742469Z T = 2048, D = 7168, scale_ub = None, contiguous = False, compiled = False
2025-05-07T20:32:18.7742740Z 
2025-05-07T20:32:18.7742827Z     @given(
2025-05-07T20:32:18.7743064Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:18.7743379Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:18.7743687Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:18.7744013Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:18.7744348Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:18.7744636Z     )
2025-05-07T20:32:18.7744978Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:18.7745420Z     def test_silu_mul_quant(
2025-05-07T20:32:18.7745666Z         self,
2025-05-07T20:32:18.7745860Z         T: int,
2025-05-07T20:32:18.7746064Z         D: int,
2025-05-07T20:32:18.7746338Z         scale_ub: Optional[float],
2025-05-07T20:32:18.7746608Z         contiguous: bool,
2025-05-07T20:32:18.7746853Z         compiled: bool,
2025-05-07T20:32:18.7747080Z     ) -> None:
2025-05-07T20:32:18.7747296Z         torch.manual_seed(2025)
2025-05-07T20:32:18.7747545Z     
2025-05-07T20:32:18.7747827Z >       x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:18.7749844Z E       torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 56.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 26.44 MiB is free. Including non-PyTorch memory, this process has 22.04 GiB memory in use. Of the allocated memory 21.74 GiB is allocated by PyTorch, and 10.99 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
2025-05-07T20:32:18.7751751Z 
2025-05-07T20:32:18.7751878Z moe/activation_test.py:92: OutOfMemoryError
2025-05-07T20:32:18.7752090Z 
2025-05-07T20:32:18.7752195Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:18.7752609Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:18.7753094Z     T=128,
2025-05-07T20:32:18.7753279Z     D=7168,
2025-05-07T20:32:18.7753482Z     scale_ub=1200.0,
2025-05-07T20:32:18.7753709Z     contiguous=True,
2025-05-07T20:32:18.7753936Z     compiled=True,
2025-05-07T20:32:18.7754141Z )
2025-05-07T20:32:18.7754480Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:18.7754970Z T = 128, D = 7168, scale_ub = 1200.0, contiguous = True, compiled = True
2025-05-07T20:32:18.7755236Z 
2025-05-07T20:32:18.7755316Z     @given(
2025-05-07T20:32:18.7755555Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:18.7755879Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:18.7756179Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:18.7756515Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:18.7756847Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:18.7757141Z     )
2025-05-07T20:32:18.7766114Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:18.7766606Z     def test_silu_mul_quant(
2025-05-07T20:32:18.7766861Z         self,
2025-05-07T20:32:18.7767062Z         T: int,
2025-05-07T20:32:18.7767273Z         D: int,
2025-05-07T20:32:18.7767508Z         scale_ub: Optional[float],
2025-05-07T20:32:18.7767785Z         contiguous: bool,
2025-05-07T20:32:18.7768040Z         compiled: bool,
2025-05-07T20:32:18.7768279Z     ) -> None:
2025-05-07T20:32:18.7768500Z         torch.manual_seed(2025)
2025-05-07T20:32:18.7768758Z     
2025-05-07T20:32:18.7769046Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:18.7769394Z     
2025-05-07T20:32:18.7769603Z         x_sign = torch.sign(x)
2025-05-07T20:32:18.7769913Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:18.7770266Z         x = x_sign * x_clamp
2025-05-07T20:32:18.7770528Z         x0 = x[:, :D]
2025-05-07T20:32:18.7770762Z         x1 = x[:, D:]
2025-05-07T20:32:18.7770981Z     
2025-05-07T20:32:18.7771177Z         if contiguous:
2025-05-07T20:32:18.7771422Z             x0 = x0.contiguous()
2025-05-07T20:32:18.7771693Z             x1 = x1.contiguous()
2025-05-07T20:32:18.7771935Z     
2025-05-07T20:32:18.7772143Z         if scale_ub is not None:
2025-05-07T20:32:18.7772434Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:18.7772773Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:18.7773097Z             )
2025-05-07T20:32:18.7773309Z         else:
2025-05-07T20:32:18.7773525Z             scale_ub_tensor = None
2025-05-07T20:32:18.7773891Z     
2025-05-07T20:32:18.7774134Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:18.7774530Z             op = silu_mul_quant
2025-05-07T20:32:18.7774785Z             if compiled:
2025-05-07T20:32:18.7775032Z                 op = torch.compile(op)
2025-05-07T20:32:18.7775329Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:18.7775615Z     
2025-05-07T20:32:18.7775820Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:18.7775985Z 
2025-05-07T20:32:18.7776087Z moe/activation_test.py:117: 
2025-05-07T20:32:18.7776388Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:18.7776776Z moe/activation_test.py:115: in fn
2025-05-07T20:32:18.7777065Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:18.7777621Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/torch/_dynamo/eval_frame.py:678: in _fn
2025-05-07T20:32:18.7778188Z     return fn(*args, **kwargs)
2025-05-07T20:32:18.7778850Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:18.7779530Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:18.7780072Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:18.7780861Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:18.7781531Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:18.7782057Z     kernel = self.compile(
2025-05-07T20:32:18.7782604Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:18.7783262Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:18.7783657Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:18.7783894Z 
2025-05-07T20:32:18.7784102Z self = <triton.compiler.compiler.ASTSource object at 0x7fc96fde44d0>
2025-05-07T20:32:18.7785175Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:18.7786537Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7fcea1473ce0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7fc96fd10040>}
2025-05-07T20:32:18.7787873Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:18.7788875Z context = <triton._C.libtriton.ir.context object at 0x7fc96ff2ab30>
2025-05-07T20:32:18.7789165Z 
2025-05-07T20:32:18.7789331Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:18.7789849Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:18.7790323Z                            module_map=module_map)
2025-05-07T20:32:18.7790686Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:18.7791044Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:18.7791308Z E       ^
2025-05-07T20:32:18.7791771Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:18.7792220Z 
2025-05-07T20:32:18.7792632Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:19.0865111Z 
2025-05-07T20:32:19.0865491Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:19.0866126Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:19.0866681Z     T=128,
2025-05-07T20:32:19.0866941Z     D=7168,
2025-05-07T20:32:19.0867202Z     scale_ub=1200.0,
2025-05-07T20:32:19.0867496Z     contiguous=True,
2025-05-07T20:32:19.0868105Z     compiled=False,
2025-05-07T20:32:19.0868370Z )
2025-05-07T20:32:19.0868759Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:19.0869256Z T = 128, D = 7168, scale_ub = 1200.0, contiguous = True, compiled = False
2025-05-07T20:32:19.0869526Z 
2025-05-07T20:32:19.0869623Z     @given(
2025-05-07T20:32:19.0869865Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:19.0870182Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:19.0870489Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:19.0870990Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:19.0871317Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:19.0871606Z     )
2025-05-07T20:32:19.0871949Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:19.0872397Z     def test_silu_mul_quant(
2025-05-07T20:32:19.0872642Z         self,
2025-05-07T20:32:19.0872841Z         T: int,
2025-05-07T20:32:19.0873043Z         D: int,
2025-05-07T20:32:19.0873271Z         scale_ub: Optional[float],
2025-05-07T20:32:19.0873543Z         contiguous: bool,
2025-05-07T20:32:19.0873786Z         compiled: bool,
2025-05-07T20:32:19.0874018Z     ) -> None:
2025-05-07T20:32:19.0874394Z         torch.manual_seed(2025)
2025-05-07T20:32:19.0874641Z     
2025-05-07T20:32:19.0874918Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:19.0875265Z     
2025-05-07T20:32:19.0875460Z         x_sign = torch.sign(x)
2025-05-07T20:32:19.0875755Z >       x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:19.0877727Z E       torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 20.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 4.44 MiB is free. Including non-PyTorch memory, this process has 22.06 GiB memory in use. Of the allocated memory 21.77 GiB is allocated by PyTorch, and 6.37 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
2025-05-07T20:32:19.0879554Z 
2025-05-07T20:32:19.0879684Z moe/activation_test.py:95: OutOfMemoryError
2025-05-07T20:32:19.0879898Z 
2025-05-07T20:32:19.0880002Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:19.0880434Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:19.0880872Z     T=128,
2025-05-07T20:32:19.0881065Z     D=5120,
2025-05-07T20:32:19.0881260Z     scale_ub=1200.0,
2025-05-07T20:32:19.0881487Z     contiguous=True,
2025-05-07T20:32:19.0881713Z     compiled=True,
2025-05-07T20:32:19.0881913Z )
2025-05-07T20:32:19.0882247Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:19.0882731Z T = 128, D = 5120, scale_ub = 1200.0, contiguous = True, compiled = True
2025-05-07T20:32:19.0882996Z 
2025-05-07T20:32:19.0883084Z     @given(
2025-05-07T20:32:19.0883321Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:19.0883639Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:19.0883947Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:19.0884281Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:19.0884612Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:19.0884901Z     )
2025-05-07T20:32:19.0885245Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:19.0885692Z     def test_silu_mul_quant(
2025-05-07T20:32:19.0885939Z         self,
2025-05-07T20:32:19.0886137Z         T: int,
2025-05-07T20:32:19.0886340Z         D: int,
2025-05-07T20:32:19.0886565Z         scale_ub: Optional[float],
2025-05-07T20:32:19.0886842Z         contiguous: bool,
2025-05-07T20:32:19.0887084Z         compiled: bool,
2025-05-07T20:32:19.0887313Z     ) -> None:
2025-05-07T20:32:19.0887585Z         torch.manual_seed(2025)
2025-05-07T20:32:19.0887822Z     
2025-05-07T20:32:19.0888094Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:19.0888437Z     
2025-05-07T20:32:19.0888628Z         x_sign = torch.sign(x)
2025-05-07T20:32:19.0888923Z >       x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:19.0890865Z E       torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 20.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 4.44 MiB is free. Including non-PyTorch memory, this process has 22.06 GiB memory in use. Of the allocated memory 21.77 GiB is allocated by PyTorch, and 3.87 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
2025-05-07T20:32:19.0892706Z 
2025-05-07T20:32:19.0892832Z moe/activation_test.py:95: OutOfMemoryError
2025-05-07T20:32:19.0893044Z 
2025-05-07T20:32:19.0893151Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:19.0893551Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:19.0894110Z     T=128,
2025-05-07T20:32:19.0894379Z     D=7168,
2025-05-07T20:32:19.0894584Z     scale_ub=None,
2025-05-07T20:32:19.0894802Z     contiguous=True,
2025-05-07T20:32:19.0895029Z     compiled=True,
2025-05-07T20:32:19.0895226Z )
2025-05-07T20:32:19.0895547Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:19.0896040Z T = 128, D = 7168, scale_ub = None, contiguous = True, compiled = True
2025-05-07T20:32:19.0896303Z 
2025-05-07T20:32:19.0896384Z     @given(
2025-05-07T20:32:19.0896617Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:19.0896933Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:19.0897239Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:19.0897574Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:19.0897906Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:19.0898537Z     )
2025-05-07T20:32:19.0898917Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:19.0899369Z     def test_silu_mul_quant(
2025-05-07T20:32:19.0899612Z         self,
2025-05-07T20:32:19.0899805Z         T: int,
2025-05-07T20:32:19.0900008Z         D: int,
2025-05-07T20:32:19.0900232Z         scale_ub: Optional[float],
2025-05-07T20:32:19.0900529Z         contiguous: bool,
2025-05-07T20:32:19.0900798Z         compiled: bool,
2025-05-07T20:32:19.0901027Z     ) -> None:
2025-05-07T20:32:19.0901241Z         torch.manual_seed(2025)
2025-05-07T20:32:19.0901484Z     
2025-05-07T20:32:19.0901763Z >       x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:19.0903760Z E       torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 20.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 4.44 MiB is free. Including non-PyTorch memory, this process has 22.06 GiB memory in use. Of the allocated memory 21.77 GiB is allocated by PyTorch, and 3.87 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
2025-05-07T20:32:19.0905557Z 
2025-05-07T20:32:19.0905682Z moe/activation_test.py:92: OutOfMemoryError
2025-05-07T20:32:19.0905894Z 
2025-05-07T20:32:19.0912945Z FAILED
2025-05-07T20:32:19.0913105Z 
2025-05-07T20:32:19.0913310Z =================================== FAILURES ===================================
2025-05-07T20:32:19.0913917Z _____________________ ActivationTests.test_silu_mul_quant ______________________
2025-05-07T20:32:19.0914546Z   + Exception Group Traceback (most recent call last):
2025-05-07T20:32:19.0915438Z   |   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/unittest/case.py", line 58, in testPartExecutor
2025-05-07T20:32:19.0916924Z   |     yield
2025-05-07T20:32:19.0917508Z   |   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/unittest/case.py", line 651, in run
2025-05-07T20:32:19.0918205Z   |     self._callTestMethod(testMethod)
2025-05-07T20:32:19.0918596Z   |     ~~~~~~~~~~~~~~~~~~~~^^^^^^^^^^^^
2025-05-07T20:32:19.0919340Z   |   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/unittest/case.py", line 606, in _callTestMethod
2025-05-07T20:32:19.0920193Z   |     if method() is not None:
2025-05-07T20:32:19.0920588Z   |        ~~~~~~^^
2025-05-07T20:32:19.0921449Z   |   File "/home/ec2-user/actions-runner/_work/FBGEMM/FBGEMM/fbgemm_gpu/experimental/gen_ai/test/moe/activation_test.py", line 75, in test_silu_mul_quant
2025-05-07T20:32:19.0922427Z   |     T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:19.0922838Z   |            ^^^^^^^
2025-05-07T20:32:19.0923599Z   |   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/hypothesis/core.py", line 1850, in wrapped_test
2025-05-07T20:32:19.0924456Z   |     raise the_error_hypothesis_found
2025-05-07T20:32:19.0925026Z   | ExceptionGroup: Hypothesis found 4 distinct failures. (4 sub-exceptions)
2025-05-07T20:32:19.0925736Z   +-+---------------- 1 ----------------
2025-05-07T20:32:19.0926139Z     | Traceback (most recent call last):
2025-05-07T20:32:19.0927098Z     |   File "/home/ec2-user/actions-runner/_work/FBGEMM/FBGEMM/fbgemm_gpu/experimental/gen_ai/test/moe/activation_test.py", line 92, in test_silu_mul_quant
2025-05-07T20:32:19.0928164Z     |     x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:19.0931019Z     | torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 40.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 4.44 MiB is free. Including non-PyTorch memory, this process has 22.06 GiB memory in use. Of the allocated memory 21.77 GiB is allocated by PyTorch, and 3.87 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
2025-05-07T20:32:19.0933825Z     | Falsifying example: test_silu_mul_quant(
2025-05-07T20:32:19.0934422Z     |     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:19.0934975Z     |     T=2048,
2025-05-07T20:32:19.0935295Z     |     D=5120,  # or any other generated value
2025-05-07T20:32:19.0935746Z     |     scale_ub=None,  # or any other generated value
2025-05-07T20:32:19.0936237Z     |     contiguous=True,  # or any other generated value
2025-05-07T20:32:19.0936729Z     |     compiled=False,  # or any other generated value
2025-05-07T20:32:19.0937143Z     | )
2025-05-07T20:32:19.0937390Z     | 
2025-05-07T20:32:19.0938108Z     | You can reproduce this example by temporarily adding @reproduce_failure('6.131.14', b'AEECQQBBAEEAQQE=') as a decorator on your test case
2025-05-07T20:32:19.0938936Z     +---------------- 2 ----------------
2025-05-07T20:32:19.0939325Z     | Traceback (most recent call last):
2025-05-07T20:32:19.0940323Z     |   File "/home/ec2-user/actions-runner/_work/FBGEMM/FBGEMM/fbgemm_gpu/experimental/gen_ai/test/moe/activation_test.py", line 92, in test_silu_mul_quant
2025-05-07T20:32:19.0941470Z     |     x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:19.0944275Z     | torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 20.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 4.44 MiB is free. Including non-PyTorch memory, this process has 22.06 GiB memory in use. Of the allocated memory 21.77 GiB is allocated by PyTorch, and 3.87 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
2025-05-07T20:32:19.0947033Z     | Falsifying example: test_silu_mul_quant(
2025-05-07T20:32:19.0947583Z     |     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:19.0947999Z     |     T=128,
2025-05-07T20:32:19.0948209Z     |     D=7168,
2025-05-07T20:32:19.0948418Z     |     scale_ub=None,
2025-05-07T20:32:19.0948666Z     |     contiguous=True,
2025-05-07T20:32:19.0948974Z     |     compiled=True,
2025-05-07T20:32:19.0949196Z     | )
2025-05-07T20:32:19.0949382Z     | 
2025-05-07T20:32:19.0949907Z     | You can reproduce this example by temporarily adding @reproduce_failure('6.131.14', b'AEEBQQFBAEEAQQA=') as a decorator on your test case
2025-05-07T20:32:19.0950511Z     +---------------- 3 ----------------
2025-05-07T20:32:19.0950803Z     | Traceback (most recent call last):
2025-05-07T20:32:19.0951510Z     |   File "/home/ec2-user/actions-runner/_work/FBGEMM/FBGEMM/fbgemm_gpu/experimental/gen_ai/test/moe/activation_test.py", line 92, in test_silu_mul_quant
2025-05-07T20:32:19.0952358Z     |     x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:19.0954469Z     | torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 20.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 4.44 MiB is free. Including non-PyTorch memory, this process has 22.06 GiB memory in use. Of the allocated memory 21.77 GiB is allocated by PyTorch, and 3.87 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
2025-05-07T20:32:19.0956393Z     | Falsifying example: test_silu_mul_quant(
2025-05-07T20:32:19.0956826Z     |     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:19.0957239Z     |     T=128,
2025-05-07T20:32:19.0957444Z     |     D=5120,
2025-05-07T20:32:19.0957655Z     |     scale_ub=1200.0,
2025-05-07T20:32:19.0957905Z     |     contiguous=True,
2025-05-07T20:32:19.0958151Z     |     compiled=True,
2025-05-07T20:32:19.0958375Z     | )
2025-05-07T20:32:19.0958567Z     | 
2025-05-07T20:32:19.0959088Z     | You can reproduce this example by temporarily adding @reproduce_failure('6.131.14', b'AEEBQQBBAUEAQQA=') as a decorator on your test case
2025-05-07T20:32:19.0959687Z     +---------------- 4 ----------------
2025-05-07T20:32:19.0959986Z     | Traceback (most recent call last):
2025-05-07T20:32:19.0960691Z     |   File "/home/ec2-user/actions-runner/_work/FBGEMM/FBGEMM/fbgemm_gpu/experimental/gen_ai/test/moe/activation_test.py", line 126, in test_silu_mul_quant
2025-05-07T20:32:19.0961405Z     |     y_fp8_ref, y_scale_ref = ref_fn()
2025-05-07T20:32:19.0961694Z     |                              ~~~~~~^^
2025-05-07T20:32:19.0962334Z     |   File "/home/ec2-user/actions-runner/_work/FBGEMM/FBGEMM/fbgemm_gpu/experimental/gen_ai/test/moe/activation_test.py", line 124, in ref_fn
2025-05-07T20:32:19.0963031Z     |     return triton_quantize_fp8_row(y, scale_ub_tensor)
2025-05-07T20:32:19.0963862Z     |   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gemm/triton_gemm/fp8_gemm.py", line 2370, in triton_quantize_fp8_row
2025-05-07T20:32:19.0964644Z     |     _kernel_quantize_fp8_row[grid](
2025-05-07T20:32:19.0964942Z     |     ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~^
2025-05-07T20:32:19.0965215Z     |         a,
2025-05-07T20:32:19.0965415Z     |         ^^
2025-05-07T20:32:19.0965631Z     |     ...<23 lines>...
2025-05-07T20:32:19.0965879Z     |         USE_INT64=use_int64,
2025-05-07T20:32:19.0966144Z     |         ^^^^^^^^^^^^^^^^^^^^
2025-05-07T20:32:19.0966393Z     |     )
2025-05-07T20:32:19.0966585Z     |     ^
2025-05-07T20:32:19.0967102Z     |   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py", line 330, in <lambda>
2025-05-07T20:32:19.0967892Z     |     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:19.0968348Z     |                                    ~~~~~~~~^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
2025-05-07T20:32:19.0969002Z     |   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/autotuner.py", line 186, in run
2025-05-07T20:32:19.0969770Z     |     timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs}
2025-05-07T20:32:19.0970307Z     |                        ~~~~~~~~~~~^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
2025-05-07T20:32:19.0970984Z     |   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/autotuner.py", line 166, in _bench
2025-05-07T20:32:19.0971678Z     |     return self.do_bench(kernel_call, quantiles=(0.5, 0.2, 0.8))
2025-05-07T20:32:19.0972058Z     |            ~~~~~~~~~~~~~^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
2025-05-07T20:32:19.0972667Z     |   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/testing.py", line 117, in do_bench
2025-05-07T20:32:19.0973232Z     |     fn()
2025-05-07T20:32:19.0973433Z     |     ~~^^
2025-05-07T20:32:19.0974241Z     |   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/autotuner.py", line 152, in kernel_call
2025-05-07T20:32:19.0974875Z     |     self.fn.run(
2025-05-07T20:32:19.0975102Z     |     ~~~~~~~~~~~^
2025-05-07T20:32:19.0975322Z     |         *args,
2025-05-07T20:32:19.0975544Z     |         ^^^^^^
2025-05-07T20:32:19.0975765Z     |         **current,
2025-05-07T20:32:19.0975991Z     |         ^^^^^^^^^^
2025-05-07T20:32:19.0976220Z     |     )
2025-05-07T20:32:19.0976419Z     |     ^
2025-05-07T20:32:19.0976906Z     |   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py", line 623, in run
2025-05-07T20:32:19.0977493Z     |     kernel = self.compile(
2025-05-07T20:32:19.0977756Z     |         src,
2025-05-07T20:32:19.0977975Z     |         target=target,
2025-05-07T20:32:19.0978240Z     |         options=options.__dict__,
2025-05-07T20:32:19.0978516Z     |     )
2025-05-07T20:32:19.0979063Z     |   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py", line 273, in compile
2025-05-07T20:32:19.0979759Z     |     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:19.0980465Z     |   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py", line 100, in make_ir
2025-05-07T20:32:19.0981245Z     |     return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:19.0981711Z     |                        module_map=module_map)
2025-05-07T20:32:19.0982082Z     | triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:19.0982439Z     | def _kernel_quantize_fp8_row(
2025-05-07T20:32:19.0982712Z     | ^
2025-05-07T20:32:19.0983166Z     | ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:19.0983727Z     | Falsifying example: test_silu_mul_quant(
2025-05-07T20:32:19.0984137Z     |     # The test always failed when commented parts were varied together.
2025-05-07T20:32:19.0984647Z     |     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:19.0985087Z     |     T=1,  # or any other generated value
2025-05-07T20:32:19.0985409Z     |     D=5120,  # or any other generated value
2025-05-07T20:32:19.0985754Z     |     scale_ub=None,  # or any other generated value
2025-05-07T20:32:19.0986119Z     |     contiguous=True,  # or any other generated value
2025-05-07T20:32:19.0986488Z     |     compiled=True,  # or any other generated value
2025-05-07T20:32:19.0986795Z     | )
2025-05-07T20:32:19.0986976Z     | 
2025-05-07T20:32:19.0987568Z     | You can reproduce this example by temporarily adding @reproduce_failure('6.131.14', b'AEEAQQBBAEEAQQA=') as a decorator on your test case
2025-05-07T20:32:19.0988174Z     +------------------------------------
2025-05-07T20:32:19.0988537Z ---------------------------------- Hypothesis ----------------------------------
2025-05-07T20:32:19.0988917Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:19.0989332Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:19.0989732Z     T=1,
2025-05-07T20:32:19.0989964Z     D=5120,
2025-05-07T20:32:19.0990162Z     scale_ub=None,
2025-05-07T20:32:19.0990428Z     contiguous=True,
2025-05-07T20:32:19.0990673Z     compiled=True,
2025-05-07T20:32:19.0990888Z )
2025-05-07T20:32:19.0991214Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:19.0991697Z T = 1, D = 5120, scale_ub = None, contiguous = True, compiled = True
2025-05-07T20:32:19.0991963Z 
2025-05-07T20:32:19.0992049Z     @given(
2025-05-07T20:32:19.0992292Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:19.0992608Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:19.0992920Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:19.0993346Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:19.0993780Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:19.0994145Z     )
2025-05-07T20:32:19.1015966Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:19.1016673Z     def test_silu_mul_quant(
2025-05-07T20:32:19.1017017Z         self,
2025-05-07T20:32:19.1017289Z         T: int,
2025-05-07T20:32:19.1017559Z         D: int,
2025-05-07T20:32:19.1017854Z         scale_ub: Optional[float],
2025-05-07T20:32:19.1018242Z         contiguous: bool,
2025-05-07T20:32:19.1018565Z         compiled: bool,
2025-05-07T20:32:19.1018869Z     ) -> None:
2025-05-07T20:32:19.1019155Z         torch.manual_seed(2025)
2025-05-07T20:32:19.1019478Z     
2025-05-07T20:32:19.1019848Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:19.1020326Z     
2025-05-07T20:32:19.1020596Z         x_sign = torch.sign(x)
2025-05-07T20:32:19.1020994Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:19.1021423Z         x = x_sign * x_clamp
2025-05-07T20:32:19.1021757Z         x0 = x[:, :D]
2025-05-07T20:32:19.1022077Z         x1 = x[:, D:]
2025-05-07T20:32:19.1022377Z     
2025-05-07T20:32:19.1022631Z         if contiguous:
2025-05-07T20:32:19.1022961Z             x0 = x0.contiguous()
2025-05-07T20:32:19.1023320Z             x1 = x1.contiguous()
2025-05-07T20:32:19.1023658Z     
2025-05-07T20:32:19.1023918Z         if scale_ub is not None:
2025-05-07T20:32:19.1024295Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:19.1024757Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:19.1025174Z             )
2025-05-07T20:32:19.1025446Z         else:
2025-05-07T20:32:19.1025745Z             scale_ub_tensor = None
2025-05-07T20:32:19.1026085Z     
2025-05-07T20:32:19.1026406Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:19.1026844Z             op = silu_mul_quant
2025-05-07T20:32:19.1027186Z             if compiled:
2025-05-07T20:32:19.1027541Z                 op = torch.compile(op)
2025-05-07T20:32:19.1027954Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:19.1028328Z     
2025-05-07T20:32:19.1028596Z         y_fp8, y_scale = fn()
2025-05-07T20:32:19.1028991Z         y = y_fp8.to(torch.float32) * y_scale[:, None]
2025-05-07T20:32:19.1029388Z     
2025-05-07T20:32:19.1029717Z         def ref_fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:19.1030168Z             x0_fp32 = x0.to(torch.float32)
2025-05-07T20:32:19.1030548Z             x1_fp32 = x1.to(torch.float32)
2025-05-07T20:32:19.1030950Z             y = x0_fp32 * torch.sigmoid(x0_fp32) * x1_fp32
2025-05-07T20:32:19.1031410Z             return triton_quantize_fp8_row(y, scale_ub_tensor)
2025-05-07T20:32:19.1032086Z     
2025-05-07T20:32:19.1032347Z >       y_fp8_ref, y_scale_ref = ref_fn()
2025-05-07T20:32:19.1032595Z 
2025-05-07T20:32:19.1032730Z moe/activation_test.py:126: 
2025-05-07T20:32:19.1033112Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:19.1033551Z moe/activation_test.py:124: in ref_fn
2025-05-07T20:32:19.1033996Z     return triton_quantize_fp8_row(y, scale_ub_tensor)
2025-05-07T20:32:19.1035096Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gemm/triton_gemm/fp8_gemm.py:2370: in triton_quantize_fp8_row
2025-05-07T20:32:19.1036254Z     _kernel_quantize_fp8_row[grid](
2025-05-07T20:32:19.1037011Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:19.1037935Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:19.1038874Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/autotuner.py:186: in run
2025-05-07T20:32:19.1039858Z     timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs}
2025-05-07T20:32:19.1041019Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/autotuner.py:166: in _bench
2025-05-07T20:32:19.1041893Z     return self.do_bench(kernel_call, quantiles=(0.5, 0.2, 0.8))
2025-05-07T20:32:19.1042697Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/testing.py:117: in do_bench
2025-05-07T20:32:19.1043406Z     fn()
2025-05-07T20:32:19.1044098Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/autotuner.py:152: in kernel_call
2025-05-07T20:32:19.1044872Z     self.fn.run(
2025-05-07T20:32:19.1045502Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:19.1046231Z     kernel = self.compile(
2025-05-07T20:32:19.1046969Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:19.1047834Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:19.1048395Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:19.1048721Z 
2025-05-07T20:32:19.1049002Z self = <triton.compiler.compiler.ASTSource object at 0x7fce60f4f9d0>
2025-05-07T20:32:19.1050464Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=2, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:19.1052339Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7fcea1473ce0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7fce612f36a0>}
2025-05-07T20:32:19.1054244Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:19.1055580Z context = <triton._C.libtriton.ir.context object at 0x7fcea14784b0>
2025-05-07T20:32:19.1055962Z 
2025-05-07T20:32:19.1056186Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:19.1056853Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:19.1057472Z                            module_map=module_map)
2025-05-07T20:32:19.1057950Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:19.1058402Z E       def _kernel_quantize_fp8_row(
2025-05-07T20:32:19.1058747Z E       ^
2025-05-07T20:32:19.1059359Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:19.1059953Z 
2025-05-07T20:32:19.1060647Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:19.1061340Z 
2025-05-07T20:32:19.1061483Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:19.1062052Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:19.1062593Z     T=2048,
2025-05-07T20:32:19.1062855Z     D=5120,
2025-05-07T20:32:19.1063114Z     scale_ub=1200.0,
2025-05-07T20:32:19.1063427Z     contiguous=True,
2025-05-07T20:32:19.1063802Z     compiled=False,
2025-05-07T20:32:19.1064085Z )
2025-05-07T20:32:19.1064528Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:19.1065214Z T = 2048, D = 5120, scale_ub = 1200.0, contiguous = True, compiled = False
2025-05-07T20:32:19.1065593Z 
2025-05-07T20:32:19.1065701Z     @given(
2025-05-07T20:32:19.1066020Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:19.1066451Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:19.1066873Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:19.1067331Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:19.1067782Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:19.1068168Z     )
2025-05-07T20:32:19.1068720Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:19.1069306Z     def test_silu_mul_quant(
2025-05-07T20:32:19.1069617Z         self,
2025-05-07T20:32:19.1069866Z         T: int,
2025-05-07T20:32:19.1070142Z         D: int,
2025-05-07T20:32:19.1070435Z         scale_ub: Optional[float],
2025-05-07T20:32:19.1070786Z         contiguous: bool,
2025-05-07T20:32:19.1071110Z         compiled: bool,
2025-05-07T20:32:19.1071419Z     ) -> None:
2025-05-07T20:32:19.1071707Z         torch.manual_seed(2025)
2025-05-07T20:32:19.1072037Z     
2025-05-07T20:32:19.1072401Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:19.1072866Z     
2025-05-07T20:32:19.1073133Z         x_sign = torch.sign(x)
2025-05-07T20:32:19.1073527Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:19.1073939Z         x = x_sign * x_clamp
2025-05-07T20:32:19.1074264Z         x0 = x[:, :D]
2025-05-07T20:32:19.1074566Z         x1 = x[:, D:]
2025-05-07T20:32:19.1074850Z     
2025-05-07T20:32:19.1075095Z         if contiguous:
2025-05-07T20:32:19.1075389Z             x0 = x0.contiguous()
2025-05-07T20:32:19.1075713Z             x1 = x1.contiguous()
2025-05-07T20:32:19.1076007Z     
2025-05-07T20:32:19.1076248Z         if scale_ub is not None:
2025-05-07T20:32:19.1076593Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:19.1077009Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:19.1077405Z             )
2025-05-07T20:32:19.1077653Z         else:
2025-05-07T20:32:19.1077926Z             scale_ub_tensor = None
2025-05-07T20:32:19.1078269Z     
2025-05-07T20:32:19.1078579Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:19.1079010Z             op = silu_mul_quant
2025-05-07T20:32:19.1079359Z             if compiled:
2025-05-07T20:32:19.1079698Z                 op = torch.compile(op)
2025-05-07T20:32:19.1080100Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:19.1080491Z     
2025-05-07T20:32:19.1080760Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:19.1080983Z 
2025-05-07T20:32:19.1081124Z moe/activation_test.py:117: 
2025-05-07T20:32:19.1081522Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:19.1081982Z moe/activation_test.py:115: in fn
2025-05-07T20:32:19.1082367Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:19.1083254Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:19.1084131Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:19.1084816Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:19.1085732Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:19.1086563Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:19.1087251Z     kernel = self.compile(
2025-05-07T20:32:19.1087932Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:19.1088753Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:19.1089341Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:19.1089638Z 
2025-05-07T20:32:19.1089891Z self = <triton.compiler.compiler.ASTSource object at 0x7fce612dda70>
2025-05-07T20:32:19.1091250Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:19.1093002Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7fcea1473ce0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7fce60f61f80>}
2025-05-07T20:32:19.1094925Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:19.1096253Z context = <triton._C.libtriton.ir.context object at 0x7fce60ac25b0>
2025-05-07T20:32:19.1096620Z 
2025-05-07T20:32:19.1096838Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:19.1097508Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:19.1098105Z                            module_map=module_map)
2025-05-07T20:32:19.1098865Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:19.1099326Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:19.1099663Z E       ^
2025-05-07T20:32:19.1100280Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:19.1100940Z 
2025-05-07T20:32:19.1101499Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:19.1102178Z 
2025-05-07T20:32:19.1102320Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:19.1102852Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:19.1103381Z     T=2048,
2025-05-07T20:32:19.1103626Z     D=5120,
2025-05-07T20:32:19.1103873Z     scale_ub=1200.0,
2025-05-07T20:32:19.1104186Z     contiguous=True,
2025-05-07T20:32:19.1104492Z     compiled=True,
2025-05-07T20:32:19.1104758Z )
2025-05-07T20:32:19.1105179Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:19.1105838Z T = 2048, D = 5120, scale_ub = 1200.0, contiguous = True, compiled = True
2025-05-07T20:32:19.1106180Z 
2025-05-07T20:32:19.1106301Z     @given(
2025-05-07T20:32:19.1106583Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:19.1106986Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:19.1107371Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:19.1107774Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:19.1108189Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:19.1108555Z     )
2025-05-07T20:32:19.1108986Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:19.1109576Z     def test_silu_mul_quant(
2025-05-07T20:32:19.1109907Z         self,
2025-05-07T20:32:19.1110165Z         T: int,
2025-05-07T20:32:19.1110431Z         D: int,
2025-05-07T20:32:19.1110729Z         scale_ub: Optional[float],
2025-05-07T20:32:19.1111233Z         contiguous: bool,
2025-05-07T20:32:19.1111566Z         compiled: bool,
2025-05-07T20:32:19.1111878Z     ) -> None:
2025-05-07T20:32:19.1112177Z         torch.manual_seed(2025)
2025-05-07T20:32:19.1112501Z     
2025-05-07T20:32:19.1112874Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:19.1113345Z     
2025-05-07T20:32:19.1113594Z         x_sign = torch.sign(x)
2025-05-07T20:32:19.1113979Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:19.1114406Z         x = x_sign * x_clamp
2025-05-07T20:32:19.1114872Z         x0 = x[:, :D]
2025-05-07T20:32:19.1115167Z         x1 = x[:, D:]
2025-05-07T20:32:19.1115446Z     
2025-05-07T20:32:19.1115709Z         if contiguous:
2025-05-07T20:32:19.1116026Z             x0 = x0.contiguous()
2025-05-07T20:32:19.1116371Z             x1 = x1.contiguous()
2025-05-07T20:32:19.1116708Z     
2025-05-07T20:32:19.1116976Z         if scale_ub is not None:
2025-05-07T20:32:19.1117344Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:19.1117816Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:19.1118231Z             )
2025-05-07T20:32:19.1118487Z         else:
2025-05-07T20:32:19.1120091Z             scale_ub_tensor = None
2025-05-07T20:32:19.1120439Z     
2025-05-07T20:32:19.1120903Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:19.1121340Z             op = silu_mul_quant
2025-05-07T20:32:19.1121684Z             if compiled:
2025-05-07T20:32:19.1122024Z                 op = torch.compile(op)
2025-05-07T20:32:19.1122394Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:19.1122770Z     
2025-05-07T20:32:19.1123038Z         y_fp8, y_scale = fn()
2025-05-07T20:32:19.1123401Z         y = y_fp8.to(torch.float32) * y_scale[:, None]
2025-05-07T20:32:19.1123793Z     
2025-05-07T20:32:19.1124115Z         def ref_fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:19.1124560Z             x0_fp32 = x0.to(torch.float32)
2025-05-07T20:32:19.1124973Z             x1_fp32 = x1.to(torch.float32)
2025-05-07T20:32:19.1125413Z             y = x0_fp32 * torch.sigmoid(x0_fp32) * x1_fp32
2025-05-07T20:32:19.1125903Z             return triton_quantize_fp8_row(y, scale_ub_tensor)
2025-05-07T20:32:19.1126332Z     
2025-05-07T20:32:19.1126613Z >       y_fp8_ref, y_scale_ref = ref_fn()
2025-05-07T20:32:19.1126876Z 
2025-05-07T20:32:19.1127019Z moe/activation_test.py:126: 
2025-05-07T20:32:19.1127409Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:19.1127859Z moe/activation_test.py:124: in ref_fn
2025-05-07T20:32:19.1128297Z     return triton_quantize_fp8_row(y, scale_ub_tensor)
2025-05-07T20:32:19.1129352Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gemm/triton_gemm/fp8_gemm.py:2370: in triton_quantize_fp8_row
2025-05-07T20:32:19.1130371Z     _kernel_quantize_fp8_row[grid](
2025-05-07T20:32:19.1131098Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:19.1132013Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:19.1132924Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/autotuner.py:186: in run
2025-05-07T20:32:19.1134010Z     timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs}
2025-05-07T20:32:19.1135005Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/autotuner.py:166: in _bench
2025-05-07T20:32:19.1135876Z     return self.do_bench(kernel_call, quantiles=(0.5, 0.2, 0.8))
2025-05-07T20:32:19.1136681Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/testing.py:117: in do_bench
2025-05-07T20:32:19.1137271Z     fn()
2025-05-07T20:32:19.1137771Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/autotuner.py:152: in kernel_call
2025-05-07T20:32:19.1138413Z     self.fn.run(
2025-05-07T20:32:19.1138875Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:19.1139395Z     kernel = self.compile(
2025-05-07T20:32:19.1139928Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:19.1140569Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:19.1140963Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:19.1141235Z 
2025-05-07T20:32:19.1141445Z self = <triton.compiler.compiler.ASTSource object at 0x7fce5b7c6e00>
2025-05-07T20:32:19.1142496Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=2, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:19.1143849Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7fcea1473ce0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7fce60bf07c0>}
2025-05-07T20:32:19.1145249Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:19.1146254Z context = <triton._C.libtriton.ir.context object at 0x7fce5b2d2530>
2025-05-07T20:32:19.1146535Z 
2025-05-07T20:32:19.1146705Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:19.1147207Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:19.1147664Z                            module_map=module_map)
2025-05-07T20:32:19.1148022Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:19.1148366Z E       def _kernel_quantize_fp8_row(
2025-05-07T20:32:19.1148626Z E       ^
2025-05-07T20:32:19.1149079Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:19.1149515Z 
2025-05-07T20:32:19.1149929Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:19.1150429Z 
2025-05-07T20:32:19.1150530Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:19.1150933Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:19.1151329Z     T=16384,
2025-05-07T20:32:19.1151513Z     D=7168,
2025-05-07T20:32:19.1151703Z     scale_ub=1200.0,
2025-05-07T20:32:19.1151923Z     contiguous=False,
2025-05-07T20:32:19.1152140Z     compiled=False,
2025-05-07T20:32:19.1152337Z )
2025-05-07T20:32:19.1152651Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:19.1153144Z T = 16384, D = 7168, scale_ub = 1200.0, contiguous = False, compiled = False
2025-05-07T20:32:19.1153417Z 
2025-05-07T20:32:19.1153490Z     @given(
2025-05-07T20:32:19.1153714Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:19.1154021Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:19.1154312Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:19.1154636Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:19.1154957Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:19.1155237Z     )
2025-05-07T20:32:19.1155576Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:19.1156011Z     def test_silu_mul_quant(
2025-05-07T20:32:19.1156251Z         self,
2025-05-07T20:32:19.1156441Z         T: int,
2025-05-07T20:32:19.1156635Z         D: int,
2025-05-07T20:32:19.1156845Z         scale_ub: Optional[float],
2025-05-07T20:32:19.1157102Z         contiguous: bool,
2025-05-07T20:32:19.1157337Z         compiled: bool,
2025-05-07T20:32:19.1157555Z     ) -> None:
2025-05-07T20:32:19.1157810Z         torch.manual_seed(2025)
2025-05-07T20:32:19.1158048Z     
2025-05-07T20:32:19.1158316Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:19.1158646Z     
2025-05-07T20:32:19.1158836Z         x_sign = torch.sign(x)
2025-05-07T20:32:19.1159127Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:19.1159426Z         x = x_sign * x_clamp
2025-05-07T20:32:19.1159663Z         x0 = x[:, :D]
2025-05-07T20:32:19.1159877Z         x1 = x[:, D:]
2025-05-07T20:32:19.1160076Z     
2025-05-07T20:32:19.1160309Z         if contiguous:
2025-05-07T20:32:19.1160561Z             x0 = x0.contiguous()
2025-05-07T20:32:19.1160839Z             x1 = x1.contiguous()
2025-05-07T20:32:19.1161073Z     
2025-05-07T20:32:19.1161268Z         if scale_ub is not None:
2025-05-07T20:32:19.1161536Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:19.1161856Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:19.1162159Z             )
2025-05-07T20:32:19.1162353Z         else:
2025-05-07T20:32:19.1162552Z             scale_ub_tensor = None
2025-05-07T20:32:19.1162797Z     
2025-05-07T20:32:19.1163022Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:19.1163323Z             op = silu_mul_quant
2025-05-07T20:32:19.1163647Z             if compiled:
2025-05-07T20:32:19.1163889Z                 op = torch.compile(op)
2025-05-07T20:32:19.1164176Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:19.1164447Z     
2025-05-07T20:32:19.1164638Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:19.1164799Z 
2025-05-07T20:32:19.1164904Z moe/activation_test.py:117: 
2025-05-07T20:32:19.1165185Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:19.1165515Z moe/activation_test.py:115: in fn
2025-05-07T20:32:19.1165795Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:19.1166464Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:19.1167139Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:19.1167665Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:19.1168335Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:19.1168977Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:19.1169497Z     kernel = self.compile(
2025-05-07T20:32:19.1170035Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:19.1170718Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:19.1171108Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:19.1171339Z 
2025-05-07T20:32:19.1171543Z self = <triton.compiler.compiler.ASTSource object at 0x7fce5b7c6030>
2025-05-07T20:32:19.1172596Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:19.1174052Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7fcea1473ce0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7fce60bd5440>}
2025-05-07T20:32:19.1175372Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:19.1176375Z context = <triton._C.libtriton.ir.context object at 0x7fce5b2e6ff0>
2025-05-07T20:32:19.1176654Z 
2025-05-07T20:32:19.1176822Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:19.1177334Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:19.1177867Z                            module_map=module_map)
2025-05-07T20:32:19.1178227Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:19.1178576Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:19.1178832Z E       ^
2025-05-07T20:32:19.1179287Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:19.1179724Z 
2025-05-07T20:32:19.1180178Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:19.1180678Z 
2025-05-07T20:32:19.1180785Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:19.1181181Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:19.1181575Z     T=1,
2025-05-07T20:32:19.1181758Z     D=7168,
2025-05-07T20:32:19.1181942Z     scale_ub=None,
2025-05-07T20:32:19.1182158Z     contiguous=True,
2025-05-07T20:32:19.1182379Z     compiled=True,
2025-05-07T20:32:19.1182571Z )
2025-05-07T20:32:19.1182882Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:19.1183436Z T = 1, D = 7168, scale_ub = None, contiguous = True, compiled = True
2025-05-07T20:32:19.1183691Z 
2025-05-07T20:32:19.1183764Z     @given(
2025-05-07T20:32:19.1183990Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:19.1184301Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:19.1184602Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:19.1184918Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:19.1185239Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:19.1185515Z     )
2025-05-07T20:32:19.1185851Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:19.1186282Z     def test_silu_mul_quant(
2025-05-07T20:32:19.1186520Z         self,
2025-05-07T20:32:19.1186703Z         T: int,
2025-05-07T20:32:19.1186893Z         D: int,
2025-05-07T20:32:19.1187109Z         scale_ub: Optional[float],
2025-05-07T20:32:19.1187368Z         contiguous: bool,
2025-05-07T20:32:19.1187603Z         compiled: bool,
2025-05-07T20:32:19.1187820Z     ) -> None:
2025-05-07T20:32:19.1188029Z         torch.manual_seed(2025)
2025-05-07T20:32:19.1188266Z     
2025-05-07T20:32:19.1188532Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:19.1188874Z     
2025-05-07T20:32:19.1189061Z         x_sign = torch.sign(x)
2025-05-07T20:32:19.1189344Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:19.1189645Z         x = x_sign * x_clamp
2025-05-07T20:32:19.1189873Z         x0 = x[:, :D]
2025-05-07T20:32:19.1190098Z         x1 = x[:, D:]
2025-05-07T20:32:19.1199099Z     
2025-05-07T20:32:19.1199299Z         if contiguous:
2025-05-07T20:32:19.1199548Z             x0 = x0.contiguous()
2025-05-07T20:32:19.1199817Z             x1 = x1.contiguous()
2025-05-07T20:32:19.1200054Z     
2025-05-07T20:32:19.1200257Z         if scale_ub is not None:
2025-05-07T20:32:19.1200534Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:19.1200867Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:19.1201183Z             )
2025-05-07T20:32:19.1201384Z         else:
2025-05-07T20:32:19.1201480Z             scale_ub_tensor = None
2025-05-07T20:32:19.1201553Z     
2025-05-07T20:32:19.1201692Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:19.1201786Z             op = silu_mul_quant
2025-05-07T20:32:19.1201872Z             if compiled:
2025-05-07T20:32:19.1201983Z                 op = torch.compile(op)
2025-05-07T20:32:19.1202089Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:19.1202172Z     
2025-05-07T20:32:19.1202262Z         y_fp8, y_scale = fn()
2025-05-07T20:32:19.1202383Z         y = y_fp8.to(torch.float32) * y_scale[:, None]
2025-05-07T20:32:19.1202631Z     
2025-05-07T20:32:19.1202768Z         def ref_fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:19.1202870Z             x0_fp32 = x0.to(torch.float32)
2025-05-07T20:32:19.1202979Z             x1_fp32 = x1.to(torch.float32)
2025-05-07T20:32:19.1203110Z             y = x0_fp32 * torch.sigmoid(x0_fp32) * x1_fp32
2025-05-07T20:32:19.1203248Z             return triton_quantize_fp8_row(y, scale_ub_tensor)
2025-05-07T20:32:19.1203330Z     
2025-05-07T20:32:19.1203430Z >       y_fp8_ref, y_scale_ref = ref_fn()
2025-05-07T20:32:19.1203506Z 
2025-05-07T20:32:19.1203615Z moe/activation_test.py:126: 
2025-05-07T20:32:19.1203748Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:19.1203855Z moe/activation_test.py:124: in ref_fn
2025-05-07T20:32:19.1203997Z     return triton_quantize_fp8_row(y, scale_ub_tensor)
2025-05-07T20:32:19.1204552Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gemm/triton_gemm/fp8_gemm.py:2370: in triton_quantize_fp8_row
2025-05-07T20:32:19.1204657Z     _kernel_quantize_fp8_row[grid](
2025-05-07T20:32:19.1205019Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:19.1205362Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:19.1205740Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/autotuner.py:186: in run
2025-05-07T20:32:19.1205991Z     timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs}
2025-05-07T20:32:19.1206364Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/autotuner.py:166: in _bench
2025-05-07T20:32:19.1206534Z     return self.do_bench(kernel_call, quantiles=(0.5, 0.2, 0.8))
2025-05-07T20:32:19.1206870Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/testing.py:117: in do_bench
2025-05-07T20:32:19.1206958Z     fn()
2025-05-07T20:32:19.1207354Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/autotuner.py:152: in kernel_call
2025-05-07T20:32:19.1207442Z     self.fn.run(
2025-05-07T20:32:19.1207781Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:19.1207875Z     kernel = self.compile(
2025-05-07T20:32:19.1208253Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:19.1208433Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:19.1208561Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:19.1208566Z 
2025-05-07T20:32:19.1208779Z self = <triton.compiler.compiler.ASTSource object at 0x7fce5b6e3c50>
2025-05-07T20:32:19.1209545Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=2, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:19.1210055Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7fcea1473ce0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7fce5b40b7e0>}
2025-05-07T20:32:19.1210793Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:19.1210986Z context = <triton._C.libtriton.ir.context object at 0x7fce5aecacf0>
2025-05-07T20:32:19.1210991Z 
2025-05-07T20:32:19.1211161Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:19.1211422Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:19.1211533Z                            module_map=module_map)
2025-05-07T20:32:19.1211746Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:19.1211850Z E       def _kernel_quantize_fp8_row(
2025-05-07T20:32:19.1211938Z E       ^
2025-05-07T20:32:19.1212294Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:19.1212298Z 
2025-05-07T20:32:19.1212703Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:19.1212708Z 
2025-05-07T20:32:19.1212860Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:19.1213082Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:19.1213170Z     T=4096,
2025-05-07T20:32:19.1213248Z     D=5120,
2025-05-07T20:32:19.1213332Z     scale_ub=None,
2025-05-07T20:32:19.1213431Z     contiguous=False,
2025-05-07T20:32:19.1213515Z     compiled=False,
2025-05-07T20:32:19.1213589Z )
2025-05-07T20:32:19.1213896Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:19.1214076Z T = 4096, D = 5120, scale_ub = None, contiguous = False, compiled = False
2025-05-07T20:32:19.1214081Z 
2025-05-07T20:32:19.1214166Z     @given(
2025-05-07T20:32:19.1214364Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:19.1214467Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:19.1214591Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:19.1214709Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:19.1214827Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:19.1214912Z     )
2025-05-07T20:32:19.1215157Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:19.1215251Z     def test_silu_mul_quant(
2025-05-07T20:32:19.1215336Z         self,
2025-05-07T20:32:19.1215413Z         T: int,
2025-05-07T20:32:19.1215490Z         D: int,
2025-05-07T20:32:19.1215597Z         scale_ub: Optional[float],
2025-05-07T20:32:19.1215690Z         contiguous: bool,
2025-05-07T20:32:19.1215781Z         compiled: bool,
2025-05-07T20:32:19.1215861Z     ) -> None:
2025-05-07T20:32:19.1215957Z         torch.manual_seed(2025)
2025-05-07T20:32:19.1216037Z     
2025-05-07T20:32:19.1216214Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:19.1216288Z     
2025-05-07T20:32:19.1216388Z         x_sign = torch.sign(x)
2025-05-07T20:32:19.1216510Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:19.1216603Z         x = x_sign * x_clamp
2025-05-07T20:32:19.1216692Z         x0 = x[:, :D]
2025-05-07T20:32:19.1216772Z         x1 = x[:, D:]
2025-05-07T20:32:19.1216846Z     
2025-05-07T20:32:19.1216939Z         if contiguous:
2025-05-07T20:32:19.1217034Z             x0 = x0.contiguous()
2025-05-07T20:32:19.1217132Z             x1 = x1.contiguous()
2025-05-07T20:32:19.1217205Z     
2025-05-07T20:32:19.1217294Z         if scale_ub is not None:
2025-05-07T20:32:19.1217412Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:19.1217543Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:19.1217620Z             )
2025-05-07T20:32:19.1217701Z         else:
2025-05-07T20:32:19.1217794Z             scale_ub_tensor = None
2025-05-07T20:32:19.1217872Z     
2025-05-07T20:32:19.1218006Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:19.1218096Z             op = silu_mul_quant
2025-05-07T20:32:19.1218180Z             if compiled:
2025-05-07T20:32:19.1218288Z                 op = torch.compile(op)
2025-05-07T20:32:19.1218394Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:19.1218472Z     
2025-05-07T20:32:19.1218562Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:19.1218567Z 
2025-05-07T20:32:19.1218663Z moe/activation_test.py:117: 
2025-05-07T20:32:19.1218798Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:19.1218898Z moe/activation_test.py:115: in fn
2025-05-07T20:32:19.1219045Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:19.1219541Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:19.1219638Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:19.1220004Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:19.1220243Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:19.1220652Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:19.1220754Z     kernel = self.compile(
2025-05-07T20:32:19.1221133Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:19.1221307Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:19.1221445Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:19.1221450Z 
2025-05-07T20:32:19.1221652Z self = <triton.compiler.compiler.ASTSource object at 0x7fce60e864e0>
2025-05-07T20:32:19.1222518Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:19.1223018Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7fcea1473ce0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7fce5ba78400>}
2025-05-07T20:32:19.1223760Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:19.1223951Z context = <triton._C.libtriton.ir.context object at 0x7fce5af036b0>
2025-05-07T20:32:19.1223959Z 
2025-05-07T20:32:19.1224122Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:19.1224388Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:19.1224502Z                            module_map=module_map)
2025-05-07T20:32:19.1224663Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:19.1224768Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:19.1224845Z E       ^
2025-05-07T20:32:19.1225205Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:19.1225210Z 
2025-05-07T20:32:19.1225616Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:19.1225620Z 
2025-05-07T20:32:19.1225724Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:19.1225951Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:19.1226031Z     T=4096,
2025-05-07T20:32:19.1226114Z     D=7168,
2025-05-07T20:32:19.1226197Z     scale_ub=None,
2025-05-07T20:32:19.1226284Z     contiguous=False,
2025-05-07T20:32:19.1226375Z     compiled=False,
2025-05-07T20:32:19.1226448Z )
2025-05-07T20:32:19.1226668Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:19.1226848Z T = 4096, D = 7168, scale_ub = None, contiguous = False, compiled = False
2025-05-07T20:32:19.1226852Z 
2025-05-07T20:32:19.1226932Z     @given(
2025-05-07T20:32:19.1227051Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:19.1227156Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:19.1227271Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:19.1227397Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:19.1227511Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:19.1227585Z     )
2025-05-07T20:32:19.1227885Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:19.1227980Z     def test_silu_mul_quant(
2025-05-07T20:32:19.1228058Z         self,
2025-05-07T20:32:19.1228141Z         T: int,
2025-05-07T20:32:19.1228220Z         D: int,
2025-05-07T20:32:19.1228322Z         scale_ub: Optional[float],
2025-05-07T20:32:19.1228419Z         contiguous: bool,
2025-05-07T20:32:19.1228505Z         compiled: bool,
2025-05-07T20:32:19.1228586Z     ) -> None:
2025-05-07T20:32:19.1228689Z         torch.manual_seed(2025)
2025-05-07T20:32:19.1228805Z     
2025-05-07T20:32:19.1228979Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:19.1229053Z     
2025-05-07T20:32:19.1229144Z         x_sign = torch.sign(x)
2025-05-07T20:32:19.1229275Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:19.1229363Z         x = x_sign * x_clamp
2025-05-07T20:32:19.1229445Z         x0 = x[:, :D]
2025-05-07T20:32:19.1229537Z         x1 = x[:, D:]
2025-05-07T20:32:19.1229609Z     
2025-05-07T20:32:19.1229694Z         if contiguous:
2025-05-07T20:32:19.1229790Z             x0 = x0.contiguous()
2025-05-07T20:32:19.1229879Z             x1 = x1.contiguous()
2025-05-07T20:32:19.1229952Z     
2025-05-07T20:32:19.1230126Z         if scale_ub is not None:
2025-05-07T20:32:19.1230233Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:19.1230366Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:19.1230453Z             )
2025-05-07T20:32:19.1230536Z         else:
2025-05-07T20:32:19.1230638Z             scale_ub_tensor = None
2025-05-07T20:32:19.1230712Z     
2025-05-07T20:32:19.1230841Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:19.1230939Z             op = silu_mul_quant
2025-05-07T20:32:19.1231023Z             if compiled:
2025-05-07T20:32:19.1231123Z                 op = torch.compile(op)
2025-05-07T20:32:19.1231235Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:19.1231312Z     
2025-05-07T20:32:19.1231404Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:19.1231408Z 
2025-05-07T20:32:19.1231512Z moe/activation_test.py:117: 
2025-05-07T20:32:19.1231642Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:19.1231754Z moe/activation_test.py:115: in fn
2025-05-07T20:32:19.1231853Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:19.1232343Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:19.1232450Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:19.1232805Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:19.1233024Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:19.1233366Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:19.1233462Z     kernel = self.compile(
2025-05-07T20:32:19.1233846Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:19.1234017Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:19.1234150Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:19.1234154Z 
2025-05-07T20:32:19.1234364Z self = <triton.compiler.compiler.ASTSource object at 0x7fce5b2f51d0>
2025-05-07T20:32:19.1235128Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:19.1235635Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7fcea1473ce0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7fce5ba7b600>}
2025-05-07T20:32:19.1236409Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:19.1236603Z context = <triton._C.libtriton.ir.context object at 0x7fce5af35a70>
2025-05-07T20:32:19.1236615Z 
2025-05-07T20:32:19.1236776Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:19.1237033Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:19.1237187Z                            module_map=module_map)
2025-05-07T20:32:19.1237349Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:19.1237448Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:19.1237532Z E       ^
2025-05-07T20:32:19.1237880Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:19.1237888Z 
2025-05-07T20:32:19.1238298Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:19.1238303Z 
2025-05-07T20:32:19.1238406Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:19.1238699Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:19.1238787Z     T=128,
2025-05-07T20:32:19.1238865Z     D=7168,
2025-05-07T20:32:19.1238948Z     scale_ub=None,
2025-05-07T20:32:19.1239043Z     contiguous=False,
2025-05-07T20:32:19.1239131Z     compiled=True,
2025-05-07T20:32:19.1239207Z )
2025-05-07T20:32:19.1239434Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:19.1239601Z T = 128, D = 7168, scale_ub = None, contiguous = False, compiled = True
2025-05-07T20:32:19.1239605Z 
2025-05-07T20:32:19.1239687Z     @given(
2025-05-07T20:32:19.1239806Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:19.1239910Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:19.1240030Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:19.1240156Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:19.1240279Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:19.1240383Z     )
2025-05-07T20:32:19.1240627Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:19.1240725Z     def test_silu_mul_quant(
2025-05-07T20:32:19.1240804Z         self,
2025-05-07T20:32:19.1240883Z         T: int,
2025-05-07T20:32:19.1240966Z         D: int,
2025-05-07T20:32:19.1241060Z         scale_ub: Optional[float],
2025-05-07T20:32:19.1241147Z         contiguous: bool,
2025-05-07T20:32:19.1241237Z         compiled: bool,
2025-05-07T20:32:19.1241314Z     ) -> None:
2025-05-07T20:32:19.1241406Z         torch.manual_seed(2025)
2025-05-07T20:32:19.1241480Z     
2025-05-07T20:32:19.1241644Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:19.1241722Z     
2025-05-07T20:32:19.1241816Z         x_sign = torch.sign(x)
2025-05-07T20:32:19.1241938Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:19.1242022Z         x = x_sign * x_clamp
2025-05-07T20:32:19.1242107Z         x0 = x[:, :D]
2025-05-07T20:32:19.1242191Z         x1 = x[:, D:]
2025-05-07T20:32:19.1242272Z     
2025-05-07T20:32:19.1242352Z         if contiguous:
2025-05-07T20:32:19.1242441Z             x0 = x0.contiguous()
2025-05-07T20:32:19.1242536Z             x1 = x1.contiguous()
2025-05-07T20:32:19.1242608Z     
2025-05-07T20:32:19.1242698Z         if scale_ub is not None:
2025-05-07T20:32:19.1242808Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:19.1242939Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:19.1243013Z             )
2025-05-07T20:32:19.1243095Z         else:
2025-05-07T20:32:19.1243191Z             scale_ub_tensor = None
2025-05-07T20:32:19.1243263Z     
2025-05-07T20:32:19.1243456Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:19.1243550Z             op = silu_mul_quant
2025-05-07T20:32:19.1243644Z             if compiled:
2025-05-07T20:32:19.1243749Z                 op = torch.compile(op)
2025-05-07T20:32:19.1243867Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:19.1243951Z     
2025-05-07T20:32:19.1244046Z         y_fp8, y_scale = fn()
2025-05-07T20:32:19.1244174Z         y = y_fp8.to(torch.float32) * y_scale[:, None]
2025-05-07T20:32:19.1244255Z     
2025-05-07T20:32:19.1244434Z         def ref_fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:19.1244537Z             x0_fp32 = x0.to(torch.float32)
2025-05-07T20:32:19.1244644Z             x1_fp32 = x1.to(torch.float32)
2025-05-07T20:32:19.1244765Z             y = x0_fp32 * torch.sigmoid(x0_fp32) * x1_fp32
2025-05-07T20:32:19.1244901Z             return triton_quantize_fp8_row(y, scale_ub_tensor)
2025-05-07T20:32:19.1244981Z     
2025-05-07T20:32:19.1245084Z >       y_fp8_ref, y_scale_ref = ref_fn()
2025-05-07T20:32:19.1245088Z 
2025-05-07T20:32:19.1245192Z moe/activation_test.py:126: 
2025-05-07T20:32:19.1245317Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:19.1245420Z moe/activation_test.py:124: in ref_fn
2025-05-07T20:32:19.1245633Z     return triton_quantize_fp8_row(y, scale_ub_tensor)
2025-05-07T20:32:19.1246178Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gemm/triton_gemm/fp8_gemm.py:2370: in triton_quantize_fp8_row
2025-05-07T20:32:19.1246281Z     _kernel_quantize_fp8_row[grid](
2025-05-07T20:32:19.1246641Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:19.1246857Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:19.1247222Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/autotuner.py:186: in run
2025-05-07T20:32:19.1247474Z     timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs}
2025-05-07T20:32:19.1247840Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/autotuner.py:166: in _bench
2025-05-07T20:32:19.1248013Z     return self.do_bench(kernel_call, quantiles=(0.5, 0.2, 0.8))
2025-05-07T20:32:19.1248348Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/testing.py:117: in do_bench
2025-05-07T20:32:19.1248430Z     fn()
2025-05-07T20:32:19.1248824Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/autotuner.py:152: in kernel_call
2025-05-07T20:32:19.1248902Z     self.fn.run(
2025-05-07T20:32:19.1249239Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:19.1249329Z     kernel = self.compile(
2025-05-07T20:32:19.1249704Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:19.1249888Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:19.1250012Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:19.1250016Z 
2025-05-07T20:32:19.1250230Z self = <triton.compiler.compiler.ASTSource object at 0x7fce5a81c7d0>
2025-05-07T20:32:19.1251040Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=2, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:19.1251539Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7fcea1473ce0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7fce5ba7a020>}
2025-05-07T20:32:19.1252271Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:19.1252502Z context = <triton._C.libtriton.ir.context object at 0x7fce5ab35770>
2025-05-07T20:32:19.1252507Z 
2025-05-07T20:32:19.1252678Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:19.1252932Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:19.1253037Z                            module_map=module_map)
2025-05-07T20:32:19.1253242Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:19.1253343Z E       def _kernel_quantize_fp8_row(
2025-05-07T20:32:19.1253426Z E       ^
2025-05-07T20:32:19.1253855Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:19.1253860Z 
2025-05-07T20:32:19.1254259Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:19.1254266Z 
2025-05-07T20:32:19.1254377Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:19.1254595Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:19.1254681Z     T=128,
2025-05-07T20:32:19.1254757Z     D=7168,
2025-05-07T20:32:19.1254938Z     scale_ub=None,
2025-05-07T20:32:19.1255028Z     contiguous=False,
2025-05-07T20:32:19.1255110Z     compiled=False,
2025-05-07T20:32:19.1255179Z )
2025-05-07T20:32:19.1255399Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:19.1255568Z T = 128, D = 7168, scale_ub = None, contiguous = False, compiled = False
2025-05-07T20:32:19.1255572Z 
2025-05-07T20:32:19.1255647Z     @given(
2025-05-07T20:32:19.1255769Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:19.1255866Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:19.1255984Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:19.1256101Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:19.1256212Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:19.1256293Z     )
2025-05-07T20:32:19.1256533Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:19.1256631Z     def test_silu_mul_quant(
2025-05-07T20:32:19.1256711Z         self,
2025-05-07T20:32:19.1256788Z         T: int,
2025-05-07T20:32:19.1256864Z         D: int,
2025-05-07T20:32:19.1256966Z         scale_ub: Optional[float],
2025-05-07T20:32:19.1257053Z         contiguous: bool,
2025-05-07T20:32:19.1257135Z         compiled: bool,
2025-05-07T20:32:19.1257216Z     ) -> None:
2025-05-07T20:32:19.1257308Z         torch.manual_seed(2025)
2025-05-07T20:32:19.1257386Z     
2025-05-07T20:32:19.1257551Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:19.1257625Z     
2025-05-07T20:32:19.1257722Z         x_sign = torch.sign(x)
2025-05-07T20:32:19.1257844Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:19.1257935Z         x = x_sign * x_clamp
2025-05-07T20:32:19.1258020Z         x0 = x[:, :D]
2025-05-07T20:32:19.1258098Z         x1 = x[:, D:]
2025-05-07T20:32:19.1258186Z     
2025-05-07T20:32:19.1258265Z         if contiguous:
2025-05-07T20:32:19.1258365Z             x0 = x0.contiguous()
2025-05-07T20:32:19.1258451Z             x1 = x1.contiguous()
2025-05-07T20:32:19.1258524Z     
2025-05-07T20:32:19.1258619Z         if scale_ub is not None:
2025-05-07T20:32:19.1258719Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:19.1258855Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:19.1258940Z             )
2025-05-07T20:32:19.1259016Z         else:
2025-05-07T20:32:19.1259107Z             scale_ub_tensor = None
2025-05-07T20:32:19.1259189Z     
2025-05-07T20:32:19.1259317Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:19.1259410Z             op = silu_mul_quant
2025-05-07T20:32:19.1259545Z             if compiled:
2025-05-07T20:32:19.1259642Z                 op = torch.compile(op)
2025-05-07T20:32:19.1259755Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:19.1259826Z     
2025-05-07T20:32:19.1259915Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:19.1259920Z 
2025-05-07T20:32:19.1260027Z moe/activation_test.py:117: 
2025-05-07T20:32:19.1260153Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:19.1260252Z moe/activation_test.py:115: in fn
2025-05-07T20:32:19.1260399Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:19.1260884Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:19.1260983Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:19.1261335Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:19.1261551Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:19.1261894Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:19.1261985Z     kernel = self.compile(
2025-05-07T20:32:19.1262462Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:19.1262636Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:19.1262762Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:19.1262770Z 
2025-05-07T20:32:19.1262978Z self = <triton.compiler.compiler.ASTSource object at 0x7fce5b6e6c90>
2025-05-07T20:32:19.1263734Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:19.1264241Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7fcea1473ce0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7fce5b278680>}
2025-05-07T20:32:19.1264974Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:19.1265161Z context = <triton._C.libtriton.ir.context object at 0x7fce5ab5de70>
2025-05-07T20:32:19.1265169Z 
2025-05-07T20:32:19.1265334Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:19.1265589Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:19.1265697Z                            module_map=module_map)
2025-05-07T20:32:19.1265852Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:19.1265945Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:19.1266025Z E       ^
2025-05-07T20:32:19.1266369Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:19.1266374Z 
2025-05-07T20:32:19.1266783Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:19.1266794Z 
2025-05-07T20:32:19.1266894Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:19.1267112Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:19.1267195Z     T=4096,
2025-05-07T20:32:19.1267267Z     D=5120,
2025-05-07T20:32:19.1267346Z     scale_ub=1200.0,
2025-05-07T20:32:19.1267434Z     contiguous=True,
2025-05-07T20:32:19.1267516Z     compiled=False,
2025-05-07T20:32:19.1267585Z )
2025-05-07T20:32:19.1267803Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:19.1267972Z T = 4096, D = 5120, scale_ub = 1200.0, contiguous = True, compiled = False
2025-05-07T20:32:19.1268021Z 
2025-05-07T20:32:19.1268103Z     @given(
2025-05-07T20:32:19.1268218Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:19.1268316Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:19.1268438Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:19.1268553Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:19.1268663Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:19.1268742Z     )
2025-05-07T20:32:19.1269024Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:19.1269116Z     def test_silu_mul_quant(
2025-05-07T20:32:19.1269197Z         self,
2025-05-07T20:32:19.1269274Z         T: int,
2025-05-07T20:32:19.1269349Z         D: int,
2025-05-07T20:32:19.1269452Z         scale_ub: Optional[float],
2025-05-07T20:32:19.1269540Z         contiguous: bool,
2025-05-07T20:32:19.1269630Z         compiled: bool,
2025-05-07T20:32:19.1269709Z     ) -> None:
2025-05-07T20:32:19.1269800Z         torch.manual_seed(2025)
2025-05-07T20:32:19.1269875Z     
2025-05-07T20:32:19.1270036Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:19.1270105Z     
2025-05-07T20:32:19.1270199Z         x_sign = torch.sign(x)
2025-05-07T20:32:19.1270400Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:19.1270486Z         x = x_sign * x_clamp
2025-05-07T20:32:19.1270573Z         x0 = x[:, :D]
2025-05-07T20:32:19.1270651Z         x1 = x[:, D:]
2025-05-07T20:32:19.1270727Z     
2025-05-07T20:32:19.1270813Z         if contiguous:
2025-05-07T20:32:19.1270904Z             x0 = x0.contiguous()
2025-05-07T20:32:19.1270995Z             x1 = x1.contiguous()
2025-05-07T20:32:19.1271067Z     
2025-05-07T20:32:19.1271159Z         if scale_ub is not None:
2025-05-07T20:32:19.1271269Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:19.1271400Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:19.1271478Z             )
2025-05-07T20:32:19.1271559Z         else:
2025-05-07T20:32:19.1271652Z             scale_ub_tensor = None
2025-05-07T20:32:19.1271722Z     
2025-05-07T20:32:19.1271855Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:19.1271948Z             op = silu_mul_quant
2025-05-07T20:32:19.1272031Z             if compiled:
2025-05-07T20:32:19.1272135Z                 op = torch.compile(op)
2025-05-07T20:32:19.1272239Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:19.1272315Z     
2025-05-07T20:32:19.1272405Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:19.1272410Z 
2025-05-07T20:32:19.1272505Z moe/activation_test.py:117: 
2025-05-07T20:32:19.1272638Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:19.1272734Z moe/activation_test.py:115: in fn
2025-05-07T20:32:19.1272833Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:19.1273325Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:19.1273424Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:19.1273782Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:19.1274003Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:19.1274336Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:19.1274438Z     kernel = self.compile(
2025-05-07T20:32:19.1274815Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:19.1274985Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:19.1275116Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:19.1275121Z 
2025-05-07T20:32:19.1275325Z self = <triton.compiler.compiler.ASTSource object at 0x7fce5b4110d0>
2025-05-07T20:32:19.1276342Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:19.1276963Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7fcea1473ce0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7fce5b278f40>}
2025-05-07T20:32:19.1277927Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:19.1278143Z context = <triton._C.libtriton.ir.context object at 0x7fce5ab72230>
2025-05-07T20:32:19.1278148Z 
2025-05-07T20:32:19.1278328Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:19.1278643Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:19.1278758Z                            module_map=module_map)
2025-05-07T20:32:19.1278937Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:19.1279117Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:19.1279199Z E       ^
2025-05-07T20:32:19.1279629Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:19.1279637Z 
2025-05-07T20:32:19.1280132Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:19.1280137Z 
2025-05-07T20:32:19.1280245Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:19.1280549Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:19.1280634Z     T=1,
2025-05-07T20:32:19.1280722Z     D=5120,
2025-05-07T20:32:19.1280811Z     scale_ub=None,
2025-05-07T20:32:19.1280902Z     contiguous=True,
2025-05-07T20:32:19.1280994Z     compiled=True,
2025-05-07T20:32:19.1281067Z )
2025-05-07T20:32:19.1281316Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:19.1281501Z T = 1, D = 5120, scale_ub = None, contiguous = True, compiled = True
2025-05-07T20:32:19.1281505Z 
2025-05-07T20:32:19.1281583Z     @given(
2025-05-07T20:32:19.1281709Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:19.1281821Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:19.1281946Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:19.1282080Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:19.1282202Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:19.1282276Z     )
2025-05-07T20:32:19.1282568Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:19.1282666Z     def test_silu_mul_quant(
2025-05-07T20:32:19.1282746Z         self,
2025-05-07T20:32:19.1282832Z         T: int,
2025-05-07T20:32:19.1282908Z         D: int,
2025-05-07T20:32:19.1283009Z         scale_ub: Optional[float],
2025-05-07T20:32:19.1283108Z         contiguous: bool,
2025-05-07T20:32:19.1283199Z         compiled: bool,
2025-05-07T20:32:19.1283282Z     ) -> None:
2025-05-07T20:32:19.1283390Z         torch.manual_seed(2025)
2025-05-07T20:32:19.1283464Z     
2025-05-07T20:32:19.1283648Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:19.1283730Z     
2025-05-07T20:32:19.1283825Z         x_sign = torch.sign(x)
2025-05-07T20:32:19.1283964Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:19.1284055Z         x = x_sign * x_clamp
2025-05-07T20:32:19.1284136Z         x0 = x[:, :D]
2025-05-07T20:32:19.1284225Z         x1 = x[:, D:]
2025-05-07T20:32:19.1284299Z     
2025-05-07T20:32:19.1284384Z         if contiguous:
2025-05-07T20:32:19.1284489Z             x0 = x0.contiguous()
2025-05-07T20:32:19.1284650Z             x1 = x1.contiguous()
2025-05-07T20:32:19.1284721Z     
2025-05-07T20:32:19.1284814Z         if scale_ub is not None:
2025-05-07T20:32:19.1284919Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:19.1285052Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:19.1285131Z             )
2025-05-07T20:32:19.1285202Z         else:
2025-05-07T20:32:19.1285297Z             scale_ub_tensor = None
2025-05-07T20:32:19.1285369Z     
2025-05-07T20:32:19.1285559Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:19.1285652Z             op = silu_mul_quant
2025-05-07T20:32:19.1285733Z             if compiled:
2025-05-07T20:32:19.1285828Z                 op = torch.compile(op)
2025-05-07T20:32:19.1285936Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:19.1286006Z     
2025-05-07T20:32:19.1286093Z         y_fp8, y_scale = fn()
2025-05-07T20:32:19.1286218Z         y = y_fp8.to(torch.float32) * y_scale[:, None]
2025-05-07T20:32:19.1286293Z     
2025-05-07T20:32:19.1286426Z         def ref_fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:19.1286532Z             x0_fp32 = x0.to(torch.float32)
2025-05-07T20:32:19.1286628Z             x1_fp32 = x1.to(torch.float32)
2025-05-07T20:32:19.1286828Z             y = x0_fp32 * torch.sigmoid(x0_fp32) * x1_fp32
2025-05-07T20:32:19.1286966Z             return triton_quantize_fp8_row(y, scale_ub_tensor)
2025-05-07T20:32:19.1287036Z     
2025-05-07T20:32:19.1287140Z >       y_fp8_ref, y_scale_ref = ref_fn()
2025-05-07T20:32:19.1287148Z 
2025-05-07T20:32:19.1287243Z moe/activation_test.py:126: 
2025-05-07T20:32:19.1287369Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:19.1287479Z moe/activation_test.py:124: in ref_fn
2025-05-07T20:32:19.1287612Z     return triton_quantize_fp8_row(y, scale_ub_tensor)
2025-05-07T20:32:19.1288161Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gemm/triton_gemm/fp8_gemm.py:2370: in triton_quantize_fp8_row
2025-05-07T20:32:19.1288261Z     _kernel_quantize_fp8_row[grid](
2025-05-07T20:32:19.1288616Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:19.1288847Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:19.1289205Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/autotuner.py:186: in run
2025-05-07T20:32:19.1289456Z     timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs}
2025-05-07T20:32:19.1289829Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/autotuner.py:166: in _bench
2025-05-07T20:32:19.1289991Z     return self.do_bench(kernel_call, quantiles=(0.5, 0.2, 0.8))
2025-05-07T20:32:19.1290331Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/testing.py:117: in do_bench
2025-05-07T20:32:19.1290412Z     fn()
2025-05-07T20:32:19.1290803Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/autotuner.py:152: in kernel_call
2025-05-07T20:32:19.1290892Z     self.fn.run(
2025-05-07T20:32:19.1291227Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:19.1291328Z     kernel = self.compile(
2025-05-07T20:32:19.1291702Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:19.1291876Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:19.1292007Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:19.1292012Z 
2025-05-07T20:32:19.1292217Z self = <triton.compiler.compiler.ASTSource object at 0x7fce5b411910>
2025-05-07T20:32:19.1292979Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=2, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:19.1293746Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7fcea1473ce0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7fce60117f60>}
2025-05-07T20:32:19.1294669Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:19.1294904Z context = <triton._C.libtriton.ir.context object at 0x7fcda5f192b0>
2025-05-07T20:32:19.1294908Z 
2025-05-07T20:32:19.1295069Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:19.1295335Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:19.1295443Z                            module_map=module_map)
2025-05-07T20:32:19.1295604Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:19.1295714Z E       def _kernel_quantize_fp8_row(
2025-05-07T20:32:19.1295788Z E       ^
2025-05-07T20:32:19.1296213Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:19.1296218Z 
2025-05-07T20:32:19.1296632Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:19.1296639Z 
2025-05-07T20:32:19.1296740Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:19.1296963Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:19.1297040Z     T=2048,
2025-05-07T20:32:19.1297112Z     D=5120,
2025-05-07T20:32:19.1297196Z     scale_ub=None,
2025-05-07T20:32:19.1297277Z     contiguous=True,
2025-05-07T20:32:19.1297358Z     compiled=True,
2025-05-07T20:32:19.1297436Z )
2025-05-07T20:32:19.1297651Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:19.1297816Z T = 2048, D = 5120, scale_ub = None, contiguous = True, compiled = True
2025-05-07T20:32:19.1297825Z 
2025-05-07T20:32:19.1297901Z     @given(
2025-05-07T20:32:19.1298025Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:19.1298128Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:19.1298516Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:19.1298682Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:19.1298809Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:19.1298882Z     )
2025-05-07T20:32:19.1299120Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:19.1299218Z     def test_silu_mul_quant(
2025-05-07T20:32:19.1299297Z         self,
2025-05-07T20:32:19.1299379Z         T: int,
2025-05-07T20:32:19.1299454Z         D: int,
2025-05-07T20:32:19.1299551Z         scale_ub: Optional[float],
2025-05-07T20:32:19.1299643Z         contiguous: bool,
2025-05-07T20:32:19.1299727Z         compiled: bool,
2025-05-07T20:32:19.1299804Z     ) -> None:
2025-05-07T20:32:19.1299900Z         torch.manual_seed(2025)
2025-05-07T20:32:19.1299968Z     
2025-05-07T20:32:19.1300138Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:19.1300216Z     
2025-05-07T20:32:19.1300304Z         x_sign = torch.sign(x)
2025-05-07T20:32:19.1300425Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:19.1300521Z         x = x_sign * x_clamp
2025-05-07T20:32:19.1300599Z         x0 = x[:, :D]
2025-05-07T20:32:19.1300678Z         x1 = x[:, D:]
2025-05-07T20:32:19.1300754Z     
2025-05-07T20:32:19.1300834Z         if contiguous:
2025-05-07T20:32:19.1300925Z             x0 = x0.contiguous()
2025-05-07T20:32:19.1301010Z             x1 = x1.contiguous()
2025-05-07T20:32:19.1301082Z     
2025-05-07T20:32:19.1301176Z         if scale_ub is not None:
2025-05-07T20:32:19.1301470Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:19.1301603Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:19.1301683Z             )
2025-05-07T20:32:19.1301760Z         else:
2025-05-07T20:32:19.1301859Z             scale_ub_tensor = None
2025-05-07T20:32:19.1301937Z     
2025-05-07T20:32:19.1302062Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:19.1302151Z             op = silu_mul_quant
2025-05-07T20:32:19.1302239Z             if compiled:
2025-05-07T20:32:19.1302410Z                 op = torch.compile(op)
2025-05-07T20:32:19.1302520Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:19.1302592Z     
2025-05-07T20:32:19.1302679Z         y_fp8, y_scale = fn()
2025-05-07T20:32:19.1302806Z         y = y_fp8.to(torch.float32) * y_scale[:, None]
2025-05-07T20:32:19.1302875Z     
2025-05-07T20:32:19.1303008Z         def ref_fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:19.1303119Z             x0_fp32 = x0.to(torch.float32)
2025-05-07T20:32:19.1303217Z             x1_fp32 = x1.to(torch.float32)
2025-05-07T20:32:19.1303338Z             y = x0_fp32 * torch.sigmoid(x0_fp32) * x1_fp32
2025-05-07T20:32:19.1303480Z             return triton_quantize_fp8_row(y, scale_ub_tensor)
2025-05-07T20:32:19.1303671Z     
2025-05-07T20:32:19.1303770Z >       y_fp8_ref, y_scale_ref = ref_fn()
2025-05-07T20:32:19.1303780Z 
2025-05-07T20:32:19.1303876Z moe/activation_test.py:126: 
2025-05-07T20:32:19.1304002Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:19.1304114Z moe/activation_test.py:124: in ref_fn
2025-05-07T20:32:19.1304244Z     return triton_quantize_fp8_row(y, scale_ub_tensor)
2025-05-07T20:32:19.1304791Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gemm/triton_gemm/fp8_gemm.py:2370: in triton_quantize_fp8_row
2025-05-07T20:32:19.1304896Z     _kernel_quantize_fp8_row[grid](
2025-05-07T20:32:19.1305250Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:19.1305473Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:19.1305833Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/autotuner.py:186: in run
2025-05-07T20:32:19.1306082Z     timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs}
2025-05-07T20:32:19.1306457Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/autotuner.py:166: in _bench
2025-05-07T20:32:19.1306625Z     return self.do_bench(kernel_call, quantiles=(0.5, 0.2, 0.8))
2025-05-07T20:32:19.1306958Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/testing.py:117: in do_bench
2025-05-07T20:32:19.1307044Z     fn()
2025-05-07T20:32:19.1307435Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/autotuner.py:152: in kernel_call
2025-05-07T20:32:19.1307524Z     self.fn.run(
2025-05-07T20:32:19.1307855Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:19.1307946Z     kernel = self.compile(
2025-05-07T20:32:19.1308330Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:19.1308500Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:19.1308628Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:19.1308640Z 
2025-05-07T20:32:19.1308842Z self = <triton.compiler.compiler.ASTSource object at 0x7fce5b40e840>
2025-05-07T20:32:19.1309598Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=2, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:19.1310148Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7fcea1473ce0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7fce5ae28cc0>}
2025-05-07T20:32:19.1310882Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:19.1311076Z context = <triton._C.libtriton.ir.context object at 0x7fce5a30e1b0>
2025-05-07T20:32:19.1311120Z 
2025-05-07T20:32:19.1311284Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:19.1311537Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:19.1311649Z                            module_map=module_map)
2025-05-07T20:32:19.1311810Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:19.1311917Z E       def _kernel_quantize_fp8_row(
2025-05-07T20:32:19.1311995Z E       ^
2025-05-07T20:32:19.1312347Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:19.1312352Z 
2025-05-07T20:32:19.1312836Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:19.1312842Z 
2025-05-07T20:32:19.1312941Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:19.1313160Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:19.1313246Z     T=128,
2025-05-07T20:32:19.1313321Z     D=5120,
2025-05-07T20:32:19.1313407Z     scale_ub=None,
2025-05-07T20:32:19.1313491Z     contiguous=True,
2025-05-07T20:32:19.1313571Z     compiled=True,
2025-05-07T20:32:19.1313645Z )
2025-05-07T20:32:19.1313858Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:19.1314021Z T = 128, D = 5120, scale_ub = None, contiguous = True, compiled = True
2025-05-07T20:32:19.1314029Z 
2025-05-07T20:32:19.1314106Z     @given(
2025-05-07T20:32:19.1314223Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:19.1314318Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:19.1314446Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:19.1314560Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:19.1314677Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:19.1314748Z     )
2025-05-07T20:32:19.1314992Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:19.1315091Z     def test_silu_mul_quant(
2025-05-07T20:32:19.1315166Z         self,
2025-05-07T20:32:19.1315239Z         T: int,
2025-05-07T20:32:19.1315324Z         D: int,
2025-05-07T20:32:19.1315417Z         scale_ub: Optional[float],
2025-05-07T20:32:19.1315503Z         contiguous: bool,
2025-05-07T20:32:19.1315591Z         compiled: bool,
2025-05-07T20:32:19.1315669Z     ) -> None:
2025-05-07T20:32:19.1315759Z         torch.manual_seed(2025)
2025-05-07T20:32:19.1315835Z     
2025-05-07T20:32:19.1316000Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:19.1316076Z     
2025-05-07T20:32:19.1316170Z         x_sign = torch.sign(x)
2025-05-07T20:32:19.1316291Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:19.1316383Z         x = x_sign * x_clamp
2025-05-07T20:32:19.1316461Z         x0 = x[:, :D]
2025-05-07T20:32:19.1316541Z         x1 = x[:, D:]
2025-05-07T20:32:19.1316619Z     
2025-05-07T20:32:19.1316699Z         if contiguous:
2025-05-07T20:32:19.1316787Z             x0 = x0.contiguous()
2025-05-07T20:32:19.1316880Z             x1 = x1.contiguous()
2025-05-07T20:32:19.1316952Z     
2025-05-07T20:32:19.1317039Z         if scale_ub is not None:
2025-05-07T20:32:19.1317145Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:19.1317274Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:19.1317430Z             )
2025-05-07T20:32:19.1317506Z         else:
2025-05-07T20:32:19.1317599Z             scale_ub_tensor = None
2025-05-07T20:32:19.1317675Z     
2025-05-07T20:32:19.1317801Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:19.1317893Z             op = silu_mul_quant
2025-05-07T20:32:19.1317984Z             if compiled:
2025-05-07T20:32:19.1318080Z                 op = torch.compile(op)
2025-05-07T20:32:19.1318181Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:19.1318303Z     
2025-05-07T20:32:19.1318392Z         y_fp8, y_scale = fn()
2025-05-07T20:32:19.1318515Z         y = y_fp8.to(torch.float32) * y_scale[:, None]
2025-05-07T20:32:19.1318595Z     
2025-05-07T20:32:19.1318727Z         def ref_fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:19.1318833Z             x0_fp32 = x0.to(torch.float32)
2025-05-07T20:32:19.1318930Z             x1_fp32 = x1.to(torch.float32)
2025-05-07T20:32:19.1319048Z             y = x0_fp32 * torch.sigmoid(x0_fp32) * x1_fp32
2025-05-07T20:32:19.1319194Z             return triton_quantize_fp8_row(y, scale_ub_tensor)
2025-05-07T20:32:19.1319266Z     
2025-05-07T20:32:19.1325212Z >       y_fp8_ref, y_scale_ref = ref_fn()
2025-05-07T20:32:19.1325222Z 
2025-05-07T20:32:19.1325458Z moe/activation_test.py:126: 
2025-05-07T20:32:19.1325598Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:19.1325719Z moe/activation_test.py:124: in ref_fn
2025-05-07T20:32:19.1325859Z     return triton_quantize_fp8_row(y, scale_ub_tensor)
2025-05-07T20:32:19.1326432Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gemm/triton_gemm/fp8_gemm.py:2370: in triton_quantize_fp8_row
2025-05-07T20:32:19.1326536Z     _kernel_quantize_fp8_row[grid](
2025-05-07T20:32:19.1326896Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:19.1327130Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:19.1327496Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/autotuner.py:186: in run
2025-05-07T20:32:19.1327762Z     timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs}
2025-05-07T20:32:19.1328131Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/autotuner.py:166: in _bench
2025-05-07T20:32:19.1328300Z     return self.do_bench(kernel_call, quantiles=(0.5, 0.2, 0.8))
2025-05-07T20:32:19.1328654Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/testing.py:117: in do_bench
2025-05-07T20:32:19.1328733Z     fn()
2025-05-07T20:32:19.1329132Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/autotuner.py:152: in kernel_call
2025-05-07T20:32:19.1329227Z     self.fn.run(
2025-05-07T20:32:19.1329562Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:19.1329669Z     kernel = self.compile(
2025-05-07T20:32:19.1330050Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:19.1330232Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:19.1330381Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:19.1330387Z 
2025-05-07T20:32:19.1330631Z self = <triton.compiler.compiler.ASTSource object at 0x7fce5b0d1f50>
2025-05-07T20:32:19.1331420Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=2, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:19.1331924Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7fcea1473ce0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7fce5ae48f40>}
2025-05-07T20:32:19.1332707Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:19.1332912Z context = <triton._C.libtriton.ir.context object at 0x7fce5a518bf0>
2025-05-07T20:32:19.1332916Z 
2025-05-07T20:32:19.1333083Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:19.1333390Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:19.1333506Z                            module_map=module_map)
2025-05-07T20:32:19.1333780Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:19.1333898Z E       def _kernel_quantize_fp8_row(
2025-05-07T20:32:19.1333978Z E       ^
2025-05-07T20:32:19.1334329Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:19.1334346Z 
2025-05-07T20:32:19.1334757Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:19.1334762Z 
2025-05-07T20:32:19.1334947Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:19.1335177Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:19.1335258Z     T=4096,
2025-05-07T20:32:19.1335337Z     D=5120,
2025-05-07T20:32:19.1335434Z     scale_ub=None,
2025-05-07T20:32:19.1335521Z     contiguous=True,
2025-05-07T20:32:19.1335605Z     compiled=True,
2025-05-07T20:32:19.1335688Z )
2025-05-07T20:32:19.1335908Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:19.1336085Z T = 4096, D = 5120, scale_ub = None, contiguous = True, compiled = True
2025-05-07T20:32:19.1336090Z 
2025-05-07T20:32:19.1336168Z     @given(
2025-05-07T20:32:19.1336291Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:19.1336404Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:19.1336522Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:19.1336640Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:19.1336768Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:19.1336847Z     )
2025-05-07T20:32:19.1337093Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:19.1337196Z     def test_silu_mul_quant(
2025-05-07T20:32:19.1337278Z         self,
2025-05-07T20:32:19.1337366Z         T: int,
2025-05-07T20:32:19.1337445Z         D: int,
2025-05-07T20:32:19.1337548Z         scale_ub: Optional[float],
2025-05-07T20:32:19.1337646Z         contiguous: bool,
2025-05-07T20:32:19.1337735Z         compiled: bool,
2025-05-07T20:32:19.1337817Z     ) -> None:
2025-05-07T20:32:19.1337927Z         torch.manual_seed(2025)
2025-05-07T20:32:19.1338003Z     
2025-05-07T20:32:19.1338177Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:19.1338262Z     
2025-05-07T20:32:19.1338358Z         x_sign = torch.sign(x)
2025-05-07T20:32:19.1338490Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:19.1338593Z         x = x_sign * x_clamp
2025-05-07T20:32:19.1338683Z         x0 = x[:, :D]
2025-05-07T20:32:19.1338776Z         x1 = x[:, D:]
2025-05-07T20:32:19.1338855Z     
2025-05-07T20:32:19.1338941Z         if contiguous:
2025-05-07T20:32:19.1339044Z             x0 = x0.contiguous()
2025-05-07T20:32:19.1339139Z             x1 = x1.contiguous()
2025-05-07T20:32:19.1339215Z     
2025-05-07T20:32:19.1339317Z         if scale_ub is not None:
2025-05-07T20:32:19.1339425Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:19.1339561Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:19.1339650Z             )
2025-05-07T20:32:19.1339728Z         else:
2025-05-07T20:32:19.1339824Z             scale_ub_tensor = None
2025-05-07T20:32:19.1339951Z     
2025-05-07T20:32:19.1340081Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:19.1340182Z             op = silu_mul_quant
2025-05-07T20:32:19.1340269Z             if compiled:
2025-05-07T20:32:19.1340371Z                 op = torch.compile(op)
2025-05-07T20:32:19.1340495Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:19.1340570Z     
2025-05-07T20:32:19.1340663Z         y_fp8, y_scale = fn()
2025-05-07T20:32:19.1340792Z         y = y_fp8.to(torch.float32) * y_scale[:, None]
2025-05-07T20:32:19.1340909Z     
2025-05-07T20:32:19.1341051Z         def ref_fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:19.1341155Z             x0_fp32 = x0.to(torch.float32)
2025-05-07T20:32:19.1341258Z             x1_fp32 = x1.to(torch.float32)
2025-05-07T20:32:19.1341387Z             y = x0_fp32 * torch.sigmoid(x0_fp32) * x1_fp32
2025-05-07T20:32:19.1341526Z             return triton_quantize_fp8_row(y, scale_ub_tensor)
2025-05-07T20:32:19.1341607Z     
2025-05-07T20:32:19.1341715Z >       y_fp8_ref, y_scale_ref = ref_fn()
2025-05-07T20:32:19.1341719Z 
2025-05-07T20:32:19.1341819Z moe/activation_test.py:126: 
2025-05-07T20:32:19.1341952Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:19.1342143Z moe/activation_test.py:124: in ref_fn
2025-05-07T20:32:19.1342278Z     return triton_quantize_fp8_row(y, scale_ub_tensor)
2025-05-07T20:32:19.1342836Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gemm/triton_gemm/fp8_gemm.py:2370: in triton_quantize_fp8_row
2025-05-07T20:32:19.1342942Z     _kernel_quantize_fp8_row[grid](
2025-05-07T20:32:19.1343299Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:19.1343528Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:19.1343892Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/autotuner.py:186: in run
2025-05-07T20:32:19.1344157Z     timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs}
2025-05-07T20:32:19.1344531Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/autotuner.py:166: in _bench
2025-05-07T20:32:19.1344697Z     return self.do_bench(kernel_call, quantiles=(0.5, 0.2, 0.8))
2025-05-07T20:32:19.1345043Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/testing.py:117: in do_bench
2025-05-07T20:32:19.1345128Z     fn()
2025-05-07T20:32:19.1345527Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/autotuner.py:152: in kernel_call
2025-05-07T20:32:19.1345616Z     self.fn.run(
2025-05-07T20:32:19.1345950Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:19.1346050Z     kernel = self.compile(
2025-05-07T20:32:19.1346427Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:19.1346603Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:19.1346737Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:19.1346746Z 
2025-05-07T20:32:19.1346953Z self = <triton.compiler.compiler.ASTSource object at 0x7fce5b869ad0>
2025-05-07T20:32:19.1347724Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=2, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:19.1348228Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7fcea1473ce0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7fce5ae4b1a0>}
2025-05-07T20:32:19.1348961Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:19.1349204Z context = <triton._C.libtriton.ir.context object at 0x7fce5aabebf0>
2025-05-07T20:32:19.1349209Z 
2025-05-07T20:32:19.1349378Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:19.1349645Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:19.1349755Z                            module_map=module_map)
2025-05-07T20:32:19.1349957Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:19.1350066Z E       def _kernel_quantize_fp8_row(
2025-05-07T20:32:19.1350145Z E       ^
2025-05-07T20:32:19.1350495Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:19.1350507Z 
2025-05-07T20:32:19.1350913Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:19.1350920Z 
2025-05-07T20:32:19.1351026Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:19.1351258Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:19.1351437Z     T=16384,
2025-05-07T20:32:19.1351518Z     D=5120,
2025-05-07T20:32:19.1351609Z     scale_ub=None,
2025-05-07T20:32:19.1351695Z     contiguous=True,
2025-05-07T20:32:19.1351778Z     compiled=True,
2025-05-07T20:32:19.1351860Z )
2025-05-07T20:32:19.1352083Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:19.1352260Z T = 16384, D = 5120, scale_ub = None, contiguous = True, compiled = True
2025-05-07T20:32:19.1352265Z 
2025-05-07T20:32:19.1352343Z     @given(
2025-05-07T20:32:19.1352463Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:19.1352570Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:19.1352688Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:19.1352809Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:19.1352930Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:19.1353007Z     )
2025-05-07T20:32:19.1353256Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:19.1353357Z     def test_silu_mul_quant(
2025-05-07T20:32:19.1353435Z         self,
2025-05-07T20:32:19.1353518Z         T: int,
2025-05-07T20:32:19.1353597Z         D: int,
2025-05-07T20:32:19.1353698Z         scale_ub: Optional[float],
2025-05-07T20:32:19.1353794Z         contiguous: bool,
2025-05-07T20:32:19.1353881Z         compiled: bool,
2025-05-07T20:32:19.1353962Z     ) -> None:
2025-05-07T20:32:19.1354064Z         torch.manual_seed(2025)
2025-05-07T20:32:19.1354140Z     
2025-05-07T20:32:19.1354307Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:19.1354389Z     
2025-05-07T20:32:19.1354483Z         x_sign = torch.sign(x)
2025-05-07T20:32:19.1354616Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:19.1354712Z         x = x_sign * x_clamp
2025-05-07T20:32:19.1354796Z         x0 = x[:, :D]
2025-05-07T20:32:19.1354886Z         x1 = x[:, D:]
2025-05-07T20:32:19.1354960Z     
2025-05-07T20:32:19.1355049Z         if contiguous:
2025-05-07T20:32:19.1355148Z             x0 = x0.contiguous()
2025-05-07T20:32:19.1355240Z             x1 = x1.contiguous()
2025-05-07T20:32:19.1355316Z     
2025-05-07T20:32:19.1355417Z         if scale_ub is not None:
2025-05-07T20:32:19.1355527Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:19.1355662Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:19.1355746Z             )
2025-05-07T20:32:19.1355824Z         else:
2025-05-07T20:32:19.1355921Z             scale_ub_tensor = None
2025-05-07T20:32:19.1356006Z     
2025-05-07T20:32:19.1356136Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:19.1356282Z             op = silu_mul_quant
2025-05-07T20:32:19.1356369Z             if compiled:
2025-05-07T20:32:19.1356470Z                 op = torch.compile(op)
2025-05-07T20:32:19.1356589Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:19.1356663Z     
2025-05-07T20:32:19.1356761Z         y_fp8, y_scale = fn()
2025-05-07T20:32:19.1356891Z         y = y_fp8.to(torch.float32) * y_scale[:, None]
2025-05-07T20:32:19.1356965Z     
2025-05-07T20:32:19.1357101Z         def ref_fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:19.1357256Z             x0_fp32 = x0.to(torch.float32)
2025-05-07T20:32:19.1357358Z             x1_fp32 = x1.to(torch.float32)
2025-05-07T20:32:19.1357481Z             y = x0_fp32 * torch.sigmoid(x0_fp32) * x1_fp32
2025-05-07T20:32:19.1357632Z             return triton_quantize_fp8_row(y, scale_ub_tensor)
2025-05-07T20:32:19.1357707Z     
2025-05-07T20:32:19.1357814Z >       y_fp8_ref, y_scale_ref = ref_fn()
2025-05-07T20:32:19.1357819Z 
2025-05-07T20:32:19.1357923Z moe/activation_test.py:126: 
2025-05-07T20:32:19.1358055Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:19.1358169Z moe/activation_test.py:124: in ref_fn
2025-05-07T20:32:19.1358305Z     return triton_quantize_fp8_row(y, scale_ub_tensor)
2025-05-07T20:32:19.1358923Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gemm/triton_gemm/fp8_gemm.py:2370: in triton_quantize_fp8_row
2025-05-07T20:32:19.1359037Z     _kernel_quantize_fp8_row[grid](
2025-05-07T20:32:19.1359396Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:19.1359626Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:19.1359992Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/autotuner.py:186: in run
2025-05-07T20:32:19.1360248Z     timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs}
2025-05-07T20:32:19.1360628Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/autotuner.py:166: in _bench
2025-05-07T20:32:19.1360795Z     return self.do_bench(kernel_call, quantiles=(0.5, 0.2, 0.8))
2025-05-07T20:32:19.1361142Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/testing.py:117: in do_bench
2025-05-07T20:32:19.1361222Z     fn()
2025-05-07T20:32:19.1361618Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/autotuner.py:152: in kernel_call
2025-05-07T20:32:19.1361713Z     self.fn.run(
2025-05-07T20:32:19.1362052Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:19.1362145Z     kernel = self.compile(
2025-05-07T20:32:19.1362528Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:19.1362702Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:19.1362840Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:19.1362845Z 
2025-05-07T20:32:19.1363051Z self = <triton.compiler.compiler.ASTSource object at 0x7fce5b0d19d0>
2025-05-07T20:32:19.1363821Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=2, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:19.1364331Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7fcea1473ce0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7fcda5bce8e0>}
2025-05-07T20:32:19.1365062Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:19.1365330Z context = <triton._C.libtriton.ir.context object at 0x7fcda52b05f0>
2025-05-07T20:32:19.1365334Z 
2025-05-07T20:32:19.1365517Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:19.1365826Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:19.1365946Z                            module_map=module_map)
2025-05-07T20:32:19.1366126Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:19.1366239Z E       def _kernel_quantize_fp8_row(
2025-05-07T20:32:19.1366357Z E       ^
2025-05-07T20:32:19.1366708Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:19.1366712Z 
2025-05-07T20:32:19.1367126Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:19.1367130Z 
2025-05-07T20:32:19.1367233Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:19.1367463Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:19.1367542Z     T=1,
2025-05-07T20:32:19.1367622Z     D=5120,
2025-05-07T20:32:19.1367714Z     scale_ub=1200.0,
2025-05-07T20:32:19.1367800Z     contiguous=True,
2025-05-07T20:32:19.1367959Z     compiled=True,
2025-05-07T20:32:19.1368041Z )
2025-05-07T20:32:19.1368259Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:19.1368424Z T = 1, D = 5120, scale_ub = 1200.0, contiguous = True, compiled = True
2025-05-07T20:32:19.1368432Z 
2025-05-07T20:32:19.1368517Z     @given(
2025-05-07T20:32:19.1368637Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:19.1368743Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:19.1368859Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:19.1368978Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:19.1369098Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:19.1369182Z     )
2025-05-07T20:32:19.1369427Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:19.1369528Z     def test_silu_mul_quant(
2025-05-07T20:32:19.1369607Z         self,
2025-05-07T20:32:19.1369693Z         T: int,
2025-05-07T20:32:19.1369778Z         D: int,
2025-05-07T20:32:19.1369876Z         scale_ub: Optional[float],
2025-05-07T20:32:19.1369967Z         contiguous: bool,
2025-05-07T20:32:19.1370064Z         compiled: bool,
2025-05-07T20:32:19.1370148Z     ) -> None:
2025-05-07T20:32:19.1370253Z         torch.manual_seed(2025)
2025-05-07T20:32:19.1370327Z     
2025-05-07T20:32:19.1370497Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:19.1370579Z     
2025-05-07T20:32:19.1370672Z         x_sign = torch.sign(x)
2025-05-07T20:32:19.1370798Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:19.1370897Z         x = x_sign * x_clamp
2025-05-07T20:32:19.1370981Z         x0 = x[:, :D]
2025-05-07T20:32:19.1371063Z         x1 = x[:, D:]
2025-05-07T20:32:19.1371143Z     
2025-05-07T20:32:19.1371229Z         if contiguous:
2025-05-07T20:32:19.1371322Z             x0 = x0.contiguous()
2025-05-07T20:32:19.1371423Z             x1 = x1.contiguous()
2025-05-07T20:32:19.1371502Z     
2025-05-07T20:32:19.1371595Z         if scale_ub is not None:
2025-05-07T20:32:19.1371710Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:19.1371846Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:19.1371934Z             )
2025-05-07T20:32:19.1372011Z         else:
2025-05-07T20:32:19.1372107Z             scale_ub_tensor = None
2025-05-07T20:32:19.1372188Z     
2025-05-07T20:32:19.1372319Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:19.1372411Z             op = silu_mul_quant
2025-05-07T20:32:19.1372505Z             if compiled:
2025-05-07T20:32:19.1372607Z                 op = torch.compile(op)
2025-05-07T20:32:19.1372764Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:19.1372846Z     
2025-05-07T20:32:19.1372941Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:19.1372946Z 
2025-05-07T20:32:19.1373051Z moe/activation_test.py:117: 
2025-05-07T20:32:19.1373186Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:19.1373289Z moe/activation_test.py:115: in fn
2025-05-07T20:32:19.1373396Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:19.1373856Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/torch/_dynamo/eval_frame.py:678: in _fn
2025-05-07T20:32:19.1373997Z     return fn(*args, **kwargs)
2025-05-07T20:32:19.1374493Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:19.1374592Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:19.1374955Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:19.1375182Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:19.1375522Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:19.1375699Z     kernel = self.compile(
2025-05-07T20:32:19.1376081Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:19.1376256Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:19.1376397Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:19.1376401Z 
2025-05-07T20:32:19.1376610Z self = <triton.compiler.compiler.ASTSource object at 0x7fcda5d834d0>
2025-05-07T20:32:19.1377382Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:19.1377884Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7fcea1473ce0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7fce5a9f4400>}
2025-05-07T20:32:19.1378626Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:19.1378821Z context = <triton._C.libtriton.ir.context object at 0x7fcda4b936b0>
2025-05-07T20:32:19.1378825Z 
2025-05-07T20:32:19.1378993Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:19.1379258Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:19.1379367Z                            module_map=module_map)
2025-05-07T20:32:19.1379530Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:19.1379640Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:19.1379720Z E       ^
2025-05-07T20:32:19.1380075Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:19.1380079Z 
2025-05-07T20:32:19.1380491Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:19.1380495Z 
2025-05-07T20:32:19.1380600Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:19.1380828Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:19.1380906Z     T=1,
2025-05-07T20:32:19.1380989Z     D=5120,
2025-05-07T20:32:19.1381073Z     scale_ub=None,
2025-05-07T20:32:19.1381161Z     contiguous=False,
2025-05-07T20:32:19.1381256Z     compiled=True,
2025-05-07T20:32:19.1381330Z )
2025-05-07T20:32:19.1381547Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:19.1381762Z T = 1, D = 5120, scale_ub = None, contiguous = False, compiled = True
2025-05-07T20:32:19.1381767Z 
2025-05-07T20:32:19.1381844Z     @given(
2025-05-07T20:32:19.1381965Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:19.1382081Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:19.1382197Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:19.1382321Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:19.1382438Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:19.1382564Z     )
2025-05-07T20:32:19.1382817Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:19.1382913Z     def test_silu_mul_quant(
2025-05-07T20:32:19.1382990Z         self,
2025-05-07T20:32:19.1383078Z         T: int,
2025-05-07T20:32:19.1383155Z         D: int,
2025-05-07T20:32:19.1383254Z         scale_ub: Optional[float],
2025-05-07T20:32:19.1383354Z         contiguous: bool,
2025-05-07T20:32:19.1383444Z         compiled: bool,
2025-05-07T20:32:19.1383524Z     ) -> None:
2025-05-07T20:32:19.1383626Z         torch.manual_seed(2025)
2025-05-07T20:32:19.1383700Z     
2025-05-07T20:32:19.1383873Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:19.1384050Z     
2025-05-07T20:32:19.1384144Z         x_sign = torch.sign(x)
2025-05-07T20:32:19.1384277Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:19.1384369Z         x = x_sign * x_clamp
2025-05-07T20:32:19.1384455Z         x0 = x[:, :D]
2025-05-07T20:32:19.1384542Z         x1 = x[:, D:]
2025-05-07T20:32:19.1384618Z     
2025-05-07T20:32:19.1384704Z         if contiguous:
2025-05-07T20:32:19.1384803Z             x0 = x0.contiguous()
2025-05-07T20:32:19.1384896Z             x1 = x1.contiguous()
2025-05-07T20:32:19.1384971Z     
2025-05-07T20:32:19.1385071Z         if scale_ub is not None:
2025-05-07T20:32:19.1385178Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:19.1385318Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:19.1385401Z             )
2025-05-07T20:32:19.1385478Z         else:
2025-05-07T20:32:19.1385580Z             scale_ub_tensor = None
2025-05-07T20:32:19.1385654Z     
2025-05-07T20:32:19.1385788Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:19.1385890Z             op = silu_mul_quant
2025-05-07T20:32:19.1385977Z             if compiled:
2025-05-07T20:32:19.1386081Z                 op = torch.compile(op)
2025-05-07T20:32:19.1386198Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:19.1386272Z     
2025-05-07T20:32:19.1386373Z         y_fp8, y_scale = fn()
2025-05-07T20:32:19.1386495Z         y = y_fp8.to(torch.float32) * y_scale[:, None]
2025-05-07T20:32:19.1386575Z     
2025-05-07T20:32:19.1386711Z         def ref_fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:19.1386815Z             x0_fp32 = x0.to(torch.float32)
2025-05-07T20:32:19.1386924Z             x1_fp32 = x1.to(torch.float32)
2025-05-07T20:32:19.1387049Z             y = x0_fp32 * torch.sigmoid(x0_fp32) * x1_fp32
2025-05-07T20:32:19.1387195Z             return triton_quantize_fp8_row(y, scale_ub_tensor)
2025-05-07T20:32:19.1387273Z     
2025-05-07T20:32:19.1387375Z >       y_fp8_ref, y_scale_ref = ref_fn()
2025-05-07T20:32:19.1387385Z 
2025-05-07T20:32:19.1387492Z moe/activation_test.py:126: 
2025-05-07T20:32:19.1387622Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:19.1387730Z moe/activation_test.py:124: in ref_fn
2025-05-07T20:32:19.1387874Z     return triton_quantize_fp8_row(y, scale_ub_tensor)
2025-05-07T20:32:19.1388423Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gemm/triton_gemm/fp8_gemm.py:2370: in triton_quantize_fp8_row
2025-05-07T20:32:19.1388533Z     _kernel_quantize_fp8_row[grid](
2025-05-07T20:32:19.1388890Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:19.1389162Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:19.1389529Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/autotuner.py:186: in run
2025-05-07T20:32:19.1389790Z     timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs}
2025-05-07T20:32:19.1390161Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/autotuner.py:166: in _bench
2025-05-07T20:32:19.1390382Z     return self.do_bench(kernel_call, quantiles=(0.5, 0.2, 0.8))
2025-05-07T20:32:19.1390764Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/testing.py:117: in do_bench
2025-05-07T20:32:19.1390853Z     fn()
2025-05-07T20:32:19.1391249Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/autotuner.py:152: in kernel_call
2025-05-07T20:32:19.1391333Z     self.fn.run(
2025-05-07T20:32:19.1391680Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:19.1391774Z     kernel = self.compile(
2025-05-07T20:32:19.1392227Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:19.1392412Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:19.1392541Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:19.1392551Z 
2025-05-07T20:32:19.1392764Z self = <triton.compiler.compiler.ASTSource object at 0x7fce5ae32cd0>
2025-05-07T20:32:19.1393531Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=2, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:19.1394039Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7fcea1473ce0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7fce5a9ee020>}
2025-05-07T20:32:19.1394780Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:19.1394974Z context = <triton._C.libtriton.ir.context object at 0x7fcda5680ff0>
2025-05-07T20:32:19.1394979Z 
2025-05-07T20:32:19.1395151Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:19.1395412Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:19.1395527Z                            module_map=module_map)
2025-05-07T20:32:19.1395688Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:19.1395791Z E       def _kernel_quantize_fp8_row(
2025-05-07T20:32:19.1395874Z E       ^
2025-05-07T20:32:19.1396226Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:19.1396230Z 
2025-05-07T20:32:19.1396633Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:19.1396651Z 
2025-05-07T20:32:19.1396756Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:19.1396977Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:19.1397060Z     T=1,
2025-05-07T20:32:19.1397140Z     D=5120,
2025-05-07T20:32:19.1397227Z     scale_ub=None,
2025-05-07T20:32:19.1397318Z     contiguous=True,
2025-05-07T20:32:19.1397403Z     compiled=False,
2025-05-07T20:32:19.1397478Z )
2025-05-07T20:32:19.1397701Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:19.1397863Z T = 1, D = 5120, scale_ub = None, contiguous = True, compiled = False
2025-05-07T20:32:19.1397868Z 
2025-05-07T20:32:19.1397991Z     @given(
2025-05-07T20:32:19.1398118Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:19.1398500Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:19.1398671Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:19.1398841Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:19.1398957Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:19.1399041Z     )
2025-05-07T20:32:19.1399284Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:19.1399528Z     def test_silu_mul_quant(
2025-05-07T20:32:19.1399613Z         self,
2025-05-07T20:32:19.1399692Z         T: int,
2025-05-07T20:32:19.1399770Z         D: int,
2025-05-07T20:32:19.1399875Z         scale_ub: Optional[float],
2025-05-07T20:32:19.1399964Z         contiguous: bool,
2025-05-07T20:32:19.1400056Z         compiled: bool,
2025-05-07T20:32:19.1400138Z     ) -> None:
2025-05-07T20:32:19.1400232Z         torch.manual_seed(2025)
2025-05-07T20:32:19.1400315Z     
2025-05-07T20:32:19.1400483Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:19.1400556Z     
2025-05-07T20:32:19.1400655Z         x_sign = torch.sign(x)
2025-05-07T20:32:19.1400781Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:19.1401006Z         x = x_sign * x_clamp
2025-05-07T20:32:19.1401098Z         x0 = x[:, :D]
2025-05-07T20:32:19.1401181Z         x1 = x[:, D:]
2025-05-07T20:32:19.1401254Z     
2025-05-07T20:32:19.1401347Z         if contiguous:
2025-05-07T20:32:19.1401443Z             x0 = x0.contiguous()
2025-05-07T20:32:19.1401534Z             x1 = x1.contiguous()
2025-05-07T20:32:19.1401619Z     
2025-05-07T20:32:19.1401712Z         if scale_ub is not None:
2025-05-07T20:32:19.1401851Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:19.1402027Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:19.1402137Z             )
2025-05-07T20:32:19.1402232Z         else:
2025-05-07T20:32:19.1402329Z             scale_ub_tensor = None
2025-05-07T20:32:19.1402404Z     
2025-05-07T20:32:19.1402540Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:19.1402630Z             op = silu_mul_quant
2025-05-07T20:32:19.1402715Z             if compiled:
2025-05-07T20:32:19.1402831Z                 op = torch.compile(op)
2025-05-07T20:32:19.1402937Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:19.1403010Z     
2025-05-07T20:32:19.1403106Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:19.1403114Z 
2025-05-07T20:32:19.1403211Z moe/activation_test.py:117: 
2025-05-07T20:32:19.1403349Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:19.1403450Z moe/activation_test.py:115: in fn
2025-05-07T20:32:19.1403551Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:19.1404051Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:19.1404153Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:19.1404508Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:19.1404740Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:19.1405076Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:19.1405177Z     kernel = self.compile(
2025-05-07T20:32:19.1405559Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:19.1405731Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:19.1405869Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:19.1405873Z 
2025-05-07T20:32:19.1406079Z self = <triton.compiler.compiler.ASTSource object at 0x7fcda58cb150>
2025-05-07T20:32:19.1406928Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:19.1407433Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7fcea1473ce0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7fce60114400>}
2025-05-07T20:32:19.1408164Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:19.1408403Z context = <triton._C.libtriton.ir.context object at 0x7fcda4a11ff0>
2025-05-07T20:32:19.1408408Z 
2025-05-07T20:32:19.1408571Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:19.1408835Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:19.1408947Z                            module_map=module_map)
2025-05-07T20:32:19.1409110Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:19.1409216Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:19.1409294Z E       ^
2025-05-07T20:32:19.1409732Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:19.1409737Z 
2025-05-07T20:32:19.1410143Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:19.1410150Z 
2025-05-07T20:32:19.1410255Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:19.1410481Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:19.1410561Z     T=128,
2025-05-07T20:32:19.1410639Z     D=5120,
2025-05-07T20:32:19.1410730Z     scale_ub=None,
2025-05-07T20:32:19.1410818Z     contiguous=False,
2025-05-07T20:32:19.1410912Z     compiled=True,
2025-05-07T20:32:19.1410987Z )
2025-05-07T20:32:19.1411203Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:19.1411376Z T = 128, D = 5120, scale_ub = None, contiguous = False, compiled = True
2025-05-07T20:32:19.1411380Z 
2025-05-07T20:32:19.1411464Z     @given(
2025-05-07T20:32:19.1411584Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:19.1411693Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:19.1411809Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:19.1411930Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:19.1412050Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:19.1412125Z     )
2025-05-07T20:32:19.1412372Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:19.1412468Z     def test_silu_mul_quant(
2025-05-07T20:32:19.1412544Z         self,
2025-05-07T20:32:19.1412630Z         T: int,
2025-05-07T20:32:19.1412707Z         D: int,
2025-05-07T20:32:19.1412804Z         scale_ub: Optional[float],
2025-05-07T20:32:19.1412900Z         contiguous: bool,
2025-05-07T20:32:19.1412987Z         compiled: bool,
2025-05-07T20:32:19.1413067Z     ) -> None:
2025-05-07T20:32:19.1413178Z         torch.manual_seed(2025)
2025-05-07T20:32:19.1413251Z     
2025-05-07T20:32:19.1413418Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:19.1413497Z     
2025-05-07T20:32:19.1413589Z         x_sign = torch.sign(x)
2025-05-07T20:32:19.1413811Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:19.1413902Z         x = x_sign * x_clamp
2025-05-07T20:32:19.1413984Z         x0 = x[:, :D]
2025-05-07T20:32:19.1414072Z         x1 = x[:, D:]
2025-05-07T20:32:19.1414145Z     
2025-05-07T20:32:19.1414229Z         if contiguous:
2025-05-07T20:32:19.1414326Z             x0 = x0.contiguous()
2025-05-07T20:32:19.1414416Z             x1 = x1.contiguous()
2025-05-07T20:32:19.1414539Z     
2025-05-07T20:32:19.1414633Z         if scale_ub is not None:
2025-05-07T20:32:19.1414738Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:19.1414872Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:19.1414953Z             )
2025-05-07T20:32:19.1415034Z         else:
2025-05-07T20:32:19.1415128Z             scale_ub_tensor = None
2025-05-07T20:32:19.1415206Z     
2025-05-07T20:32:19.1415335Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:19.1415496Z             op = silu_mul_quant
2025-05-07T20:32:19.1415581Z             if compiled:
2025-05-07T20:32:19.1415680Z                 op = torch.compile(op)
2025-05-07T20:32:19.1415790Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:19.1415863Z     
2025-05-07T20:32:19.1415953Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:19.1415958Z 
2025-05-07T20:32:19.1416062Z moe/activation_test.py:117: 
2025-05-07T20:32:19.1416189Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:19.1416294Z moe/activation_test.py:115: in fn
2025-05-07T20:32:19.1416399Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:19.1416835Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/torch/_dynamo/eval_frame.py:678: in _fn
2025-05-07T20:32:19.1416936Z     return fn(*args, **kwargs)
2025-05-07T20:32:19.1417423Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:19.1417524Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:19.1417885Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:19.1418104Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:19.1418447Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:19.1418543Z     kernel = self.compile(
2025-05-07T20:32:19.1418920Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:19.1419097Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:19.1419231Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:19.1419236Z 
2025-05-07T20:32:19.1419440Z self = <triton.compiler.compiler.ASTSource object at 0x7fce5a599250>
2025-05-07T20:32:19.1420213Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:19.1420713Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7fcea1473ce0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7fcda5d91ee0>}
2025-05-07T20:32:19.1421453Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:19.1421647Z context = <triton._C.libtriton.ir.context object at 0x7fcda5d72db0>
2025-05-07T20:32:19.1421652Z 
2025-05-07T20:32:19.1421821Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:19.1422079Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:19.1422190Z                            module_map=module_map)
2025-05-07T20:32:19.1422357Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:19.1422457Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:19.1422537Z E       ^
2025-05-07T20:32:19.1422893Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:19.1422975Z 
2025-05-07T20:32:19.1423378Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:19.1423383Z 
2025-05-07T20:32:19.1423493Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:19.1423719Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:19.1423798Z     T=128,
2025-05-07T20:32:19.1423884Z     D=7168,
2025-05-07T20:32:19.1423967Z     scale_ub=1200.0,
2025-05-07T20:32:19.1424055Z     contiguous=False,
2025-05-07T20:32:19.1424188Z     compiled=False,
2025-05-07T20:32:19.1424262Z )
2025-05-07T20:32:19.1424491Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:19.1424662Z T = 128, D = 7168, scale_ub = 1200.0, contiguous = False, compiled = False
2025-05-07T20:32:19.1424667Z 
2025-05-07T20:32:19.1424743Z     @given(
2025-05-07T20:32:19.1424867Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:19.1424971Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:19.1425088Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:19.1425212Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:19.1425324Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:19.1425400Z     )
2025-05-07T20:32:19.1425725Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:19.1425820Z     def test_silu_mul_quant(
2025-05-07T20:32:19.1425903Z         self,
2025-05-07T20:32:19.1425990Z         T: int,
2025-05-07T20:32:19.1426068Z         D: int,
2025-05-07T20:32:19.1426176Z         scale_ub: Optional[float],
2025-05-07T20:32:19.1426266Z         contiguous: bool,
2025-05-07T20:32:19.1426352Z         compiled: bool,
2025-05-07T20:32:19.1426437Z     ) -> None:
2025-05-07T20:32:19.1426532Z         torch.manual_seed(2025)
2025-05-07T20:32:19.1426605Z     
2025-05-07T20:32:19.1426775Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:19.1426853Z     
2025-05-07T20:32:19.1426949Z         x_sign = torch.sign(x)
2025-05-07T20:32:19.1427078Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:19.1427168Z         x = x_sign * x_clamp
2025-05-07T20:32:19.1427255Z         x0 = x[:, :D]
2025-05-07T20:32:19.1427340Z         x1 = x[:, D:]
2025-05-07T20:32:19.1427414Z     
2025-05-07T20:32:19.1427503Z         if contiguous:
2025-05-07T20:32:19.1427593Z             x0 = x0.contiguous()
2025-05-07T20:32:19.1427684Z             x1 = x1.contiguous()
2025-05-07T20:32:19.1427765Z     
2025-05-07T20:32:19.1427855Z         if scale_ub is not None:
2025-05-07T20:32:19.1427963Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:19.1428102Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:19.1428178Z             )
2025-05-07T20:32:19.1428255Z         else:
2025-05-07T20:32:19.1428353Z             scale_ub_tensor = None
2025-05-07T20:32:19.1428426Z     
2025-05-07T20:32:19.1428559Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:19.1428656Z             op = silu_mul_quant
2025-05-07T20:32:19.1428742Z             if compiled:
2025-05-07T20:32:19.1428848Z                 op = torch.compile(op)
2025-05-07T20:32:19.1428956Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:19.1429034Z     
2025-05-07T20:32:19.1429131Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:19.1429136Z 
2025-05-07T20:32:19.1429233Z moe/activation_test.py:117: 
2025-05-07T20:32:19.1429361Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:19.1429471Z moe/activation_test.py:115: in fn
2025-05-07T20:32:19.1429569Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:19.1430067Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:19.1430164Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:19.1430519Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:19.1430794Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:19.1431134Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:19.1431228Z     kernel = self.compile(
2025-05-07T20:32:19.1431614Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:19.1431831Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:19.1431963Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:19.1431967Z 
2025-05-07T20:32:19.1432170Z self = <triton.compiler.compiler.ASTSource object at 0x7fcda58caad0>
2025-05-07T20:32:19.1432933Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:19.1433560Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7fcea1473ce0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7fce5b767560>}
2025-05-07T20:32:19.1434293Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:19.1434491Z context = <triton._C.libtriton.ir.context object at 0x7fcda4a2bd30>
2025-05-07T20:32:19.1434496Z 
2025-05-07T20:32:19.1434660Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:19.1434920Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:19.1435037Z                            module_map=module_map)
2025-05-07T20:32:19.1435199Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:19.1435306Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:19.1435388Z E       ^
2025-05-07T20:32:19.1435741Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:19.1435746Z 
2025-05-07T20:32:19.1436165Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:19.1436169Z 
2025-05-07T20:32:19.1436277Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:19.1436506Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:19.1436582Z     T=128,
2025-05-07T20:32:19.1436660Z     D=5120,
2025-05-07T20:32:19.1436751Z     scale_ub=None,
2025-05-07T20:32:19.1436839Z     contiguous=False,
2025-05-07T20:32:19.1436926Z     compiled=False,
2025-05-07T20:32:19.1437008Z )
2025-05-07T20:32:19.1437228Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:19.1437400Z T = 128, D = 5120, scale_ub = None, contiguous = False, compiled = False
2025-05-07T20:32:19.1437405Z 
2025-05-07T20:32:19.1437493Z     @given(
2025-05-07T20:32:19.1437612Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:19.1437726Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:19.1437842Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:19.1437959Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:19.1438082Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:19.1438159Z     )
2025-05-07T20:32:19.1438400Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:19.1438502Z     def test_silu_mul_quant(
2025-05-07T20:32:19.1438580Z         self,
2025-05-07T20:32:19.1438658Z         T: int,
2025-05-07T20:32:19.1438739Z         D: int,
2025-05-07T20:32:19.1438838Z         scale_ub: Optional[float],
2025-05-07T20:32:19.1438978Z         contiguous: bool,
2025-05-07T20:32:19.1439071Z         compiled: bool,
2025-05-07T20:32:19.1439151Z     ) -> None:
2025-05-07T20:32:19.1439251Z         torch.manual_seed(2025)
2025-05-07T20:32:19.1439324Z     
2025-05-07T20:32:19.1439496Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:19.1439576Z     
2025-05-07T20:32:19.1439668Z         x_sign = torch.sign(x)
2025-05-07T20:32:19.1439793Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:19.1439933Z         x = x_sign * x_clamp
2025-05-07T20:32:19.1440014Z         x0 = x[:, :D]
2025-05-07T20:32:19.1440094Z         x1 = x[:, D:]
2025-05-07T20:32:19.1440173Z     
2025-05-07T20:32:19.1440256Z         if contiguous:
2025-05-07T20:32:19.1440346Z             x0 = x0.contiguous()
2025-05-07T20:32:19.1440443Z             x1 = x1.contiguous()
2025-05-07T20:32:19.1440515Z     
2025-05-07T20:32:19.1440613Z         if scale_ub is not None:
2025-05-07T20:32:19.1440721Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:19.1440853Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:19.1440935Z             )
2025-05-07T20:32:19.1441012Z         else:
2025-05-07T20:32:19.1441106Z             scale_ub_tensor = None
2025-05-07T20:32:19.1441189Z     
2025-05-07T20:32:19.1441392Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:19.1441484Z             op = silu_mul_quant
2025-05-07T20:32:19.1441580Z             if compiled:
2025-05-07T20:32:19.1441681Z                 op = torch.compile(op)
2025-05-07T20:32:19.1441789Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:19.1441868Z     
2025-05-07T20:32:19.1441960Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:19.1441965Z 
2025-05-07T20:32:19.1442068Z moe/activation_test.py:117: 
2025-05-07T20:32:19.1442196Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:19.1442296Z moe/activation_test.py:115: in fn
2025-05-07T20:32:19.1442408Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:19.1442898Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:19.1442996Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:19.1443363Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:19.1443578Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:19.1443924Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:19.1444017Z     kernel = self.compile(
2025-05-07T20:32:19.1444397Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:19.1444573Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:19.1444699Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:19.1444705Z 
2025-05-07T20:32:19.1444913Z self = <triton.compiler.compiler.ASTSource object at 0x7fcda54d7f50>
2025-05-07T20:32:19.1445679Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:19.1446177Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7fcea1473ce0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7fce5a892a20>}
2025-05-07T20:32:19.1446918Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:19.1447109Z context = <triton._C.libtriton.ir.context object at 0x7fcda4a632b0>
2025-05-07T20:32:19.1447184Z 
2025-05-07T20:32:19.1447357Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:19.1447616Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:19.1447729Z                            module_map=module_map)
2025-05-07T20:32:19.1447897Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:19.1447996Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:19.1448074Z E       ^
2025-05-07T20:32:19.1448471Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:19.1448476Z 
2025-05-07T20:32:19.1448881Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:19.1448885Z 
2025-05-07T20:32:19.1448997Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:19.1449216Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:19.1449297Z     T=128,
2025-05-07T20:32:19.1449380Z     D=5120,
2025-05-07T20:32:19.1449465Z     scale_ub=1200.0,
2025-05-07T20:32:19.1449557Z     contiguous=True,
2025-05-07T20:32:19.1449655Z     compiled=False,
2025-05-07T20:32:19.1449730Z )
2025-05-07T20:32:19.1450038Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:19.1450209Z T = 128, D = 5120, scale_ub = 1200.0, contiguous = True, compiled = False
2025-05-07T20:32:19.1450214Z 
2025-05-07T20:32:19.1450294Z     @given(
2025-05-07T20:32:19.1450424Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:19.1450527Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:19.1450640Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:19.1450761Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:19.1456942Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:19.1457049Z     )
2025-05-07T20:32:19.1457310Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:19.1457417Z     def test_silu_mul_quant(
2025-05-07T20:32:19.1457497Z         self,
2025-05-07T20:32:19.1457575Z         T: int,
2025-05-07T20:32:19.1457662Z         D: int,
2025-05-07T20:32:19.1457769Z         scale_ub: Optional[float],
2025-05-07T20:32:19.1457864Z         contiguous: bool,
2025-05-07T20:32:19.1457959Z         compiled: bool,
2025-05-07T20:32:19.1458040Z     ) -> None:
2025-05-07T20:32:19.1458138Z         torch.manual_seed(2025)
2025-05-07T20:32:19.1458226Z     
2025-05-07T20:32:19.1458398Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:19.1458474Z     
2025-05-07T20:32:19.1458587Z         x_sign = torch.sign(x)
2025-05-07T20:32:19.1458713Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:19.1458812Z         x = x_sign * x_clamp
2025-05-07T20:32:19.1458898Z         x0 = x[:, :D]
2025-05-07T20:32:19.1458983Z         x1 = x[:, D:]
2025-05-07T20:32:19.1459068Z     
2025-05-07T20:32:19.1459154Z         if contiguous:
2025-05-07T20:32:19.1459250Z             x0 = x0.contiguous()
2025-05-07T20:32:19.1459351Z             x1 = x1.contiguous()
2025-05-07T20:32:19.1459427Z     
2025-05-07T20:32:19.1459521Z         if scale_ub is not None:
2025-05-07T20:32:19.1459645Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:19.1459784Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:19.1459863Z             )
2025-05-07T20:32:19.1459951Z         else:
2025-05-07T20:32:19.1460050Z             scale_ub_tensor = None
2025-05-07T20:32:19.1460126Z     
2025-05-07T20:32:19.1460269Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:19.1460362Z             op = silu_mul_quant
2025-05-07T20:32:19.1460457Z             if compiled:
2025-05-07T20:32:19.1460559Z                 op = torch.compile(op)
2025-05-07T20:32:19.1460667Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:19.1460824Z     
2025-05-07T20:32:19.1460919Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:19.1460924Z 
2025-05-07T20:32:19.1461025Z moe/activation_test.py:117: 
2025-05-07T20:32:19.1461167Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:19.1461275Z moe/activation_test.py:115: in fn
2025-05-07T20:32:19.1461378Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:19.1461883Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:19.1462033Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:19.1462398Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:19.1462622Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:19.1462962Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:19.1463069Z     kernel = self.compile(
2025-05-07T20:32:19.1463451Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:19.1463633Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:19.1463845Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:19.1463850Z 
2025-05-07T20:32:19.1464056Z self = <triton.compiler.compiler.ASTSource object at 0x7fcda59f3450>
2025-05-07T20:32:19.1464838Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:19.1465338Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7fcea1473ce0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7fce5a892c00>}
2025-05-07T20:32:19.1466088Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:19.1466288Z context = <triton._C.libtriton.ir.context object at 0x7fcda4c73cf0>
2025-05-07T20:32:19.1466292Z 
2025-05-07T20:32:19.1466458Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:19.1466726Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:19.1466841Z                            module_map=module_map)
2025-05-07T20:32:19.1467013Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:19.1467116Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:19.1467195Z E       ^
2025-05-07T20:32:19.1467557Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:19.1467564Z 
2025-05-07T20:32:19.1467975Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:19.1467980Z 
2025-05-07T20:32:19.1468097Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:19.1468324Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:19.1468404Z     T=1,
2025-05-07T20:32:19.1468495Z     D=7168,
2025-05-07T20:32:19.1468582Z     scale_ub=1200.0,
2025-05-07T20:32:19.1468672Z     contiguous=True,
2025-05-07T20:32:19.1468769Z     compiled=True,
2025-05-07T20:32:19.1468848Z )
2025-05-07T20:32:19.1469069Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:19.1469247Z T = 1, D = 7168, scale_ub = 1200.0, contiguous = True, compiled = True
2025-05-07T20:32:19.1469252Z 
2025-05-07T20:32:19.1469332Z     @given(
2025-05-07T20:32:19.1469464Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:19.1469617Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:19.1469737Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:19.1469861Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:19.1469973Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:19.1470060Z     )
2025-05-07T20:32:19.1470311Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:19.1470407Z     def test_silu_mul_quant(
2025-05-07T20:32:19.1470490Z         self,
2025-05-07T20:32:19.1470613Z         T: int,
2025-05-07T20:32:19.1470695Z         D: int,
2025-05-07T20:32:19.1470802Z         scale_ub: Optional[float],
2025-05-07T20:32:19.1470893Z         contiguous: bool,
2025-05-07T20:32:19.1470981Z         compiled: bool,
2025-05-07T20:32:19.1471069Z     ) -> None:
2025-05-07T20:32:19.1471165Z         torch.manual_seed(2025)
2025-05-07T20:32:19.1471243Z     
2025-05-07T20:32:19.1471419Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:19.1471500Z     
2025-05-07T20:32:19.1471596Z         x_sign = torch.sign(x)
2025-05-07T20:32:19.1471732Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:19.1471822Z         x = x_sign * x_clamp
2025-05-07T20:32:19.1471910Z         x0 = x[:, :D]
2025-05-07T20:32:19.1472070Z         x1 = x[:, D:]
2025-05-07T20:32:19.1472146Z     
2025-05-07T20:32:19.1472238Z         if contiguous:
2025-05-07T20:32:19.1472332Z             x0 = x0.contiguous()
2025-05-07T20:32:19.1472424Z             x1 = x1.contiguous()
2025-05-07T20:32:19.1472510Z     
2025-05-07T20:32:19.1472601Z         if scale_ub is not None:
2025-05-07T20:32:19.1472710Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:19.1472855Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:19.1472933Z             )
2025-05-07T20:32:19.1473010Z         else:
2025-05-07T20:32:19.1473113Z             scale_ub_tensor = None
2025-05-07T20:32:19.1473187Z     
2025-05-07T20:32:19.1473326Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:19.1473419Z             op = silu_mul_quant
2025-05-07T20:32:19.1473505Z             if compiled:
2025-05-07T20:32:19.1473610Z                 op = torch.compile(op)
2025-05-07T20:32:19.1473720Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:19.1473794Z     
2025-05-07T20:32:19.1473891Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:19.1473896Z 
2025-05-07T20:32:19.1473996Z moe/activation_test.py:117: 
2025-05-07T20:32:19.1474125Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:19.1474235Z moe/activation_test.py:115: in fn
2025-05-07T20:32:19.1474337Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:19.1474707Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/torch/_dynamo/eval_frame.py:678: in _fn
2025-05-07T20:32:19.1474801Z     return fn(*args, **kwargs)
2025-05-07T20:32:19.1475291Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:19.1475398Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:19.1475753Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:19.1475982Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:19.1476325Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:19.1476424Z     kernel = self.compile(
2025-05-07T20:32:19.1476810Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:19.1476984Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:19.1477113Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:19.1477118Z 
2025-05-07T20:32:19.1477329Z self = <triton.compiler.compiler.ASTSource object at 0x7fcda4f390d0>
2025-05-07T20:32:19.1478147Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:19.1478655Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7fcea1473ce0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7fce5a215080>}
2025-05-07T20:32:19.1479431Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:19.1479623Z context = <triton._C.libtriton.ir.context object at 0x7fcda4c3cfb0>
2025-05-07T20:32:19.1479634Z 
2025-05-07T20:32:19.1479800Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:19.1480062Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:19.1480178Z                            module_map=module_map)
2025-05-07T20:32:19.1480343Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:19.1480541Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:19.1480627Z E       ^
2025-05-07T20:32:19.1480979Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:19.1480990Z 
2025-05-07T20:32:19.1481403Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:19.1481408Z 
2025-05-07T20:32:19.1481512Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:19.1481734Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:19.1481822Z     T=1,
2025-05-07T20:32:19.1481899Z     D=7168,
2025-05-07T20:32:19.1481988Z     scale_ub=1200.0,
2025-05-07T20:32:19.1482081Z     contiguous=False,
2025-05-07T20:32:19.1482165Z     compiled=True,
2025-05-07T20:32:19.1482242Z )
2025-05-07T20:32:19.1482464Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:19.1482633Z T = 1, D = 7168, scale_ub = 1200.0, contiguous = False, compiled = True
2025-05-07T20:32:19.1482638Z 
2025-05-07T20:32:19.1482723Z     @given(
2025-05-07T20:32:19.1482846Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:19.1482948Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:19.1483074Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:19.1483192Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:19.1483307Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:19.1483390Z     )
2025-05-07T20:32:19.1483634Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:19.1483736Z     def test_silu_mul_quant(
2025-05-07T20:32:19.1483817Z         self,
2025-05-07T20:32:19.1483895Z         T: int,
2025-05-07T20:32:19.1483977Z         D: int,
2025-05-07T20:32:19.1484077Z         scale_ub: Optional[float],
2025-05-07T20:32:19.1484168Z         contiguous: bool,
2025-05-07T20:32:19.1484262Z         compiled: bool,
2025-05-07T20:32:19.1484346Z     ) -> None:
2025-05-07T20:32:19.1484442Z         torch.manual_seed(2025)
2025-05-07T20:32:19.1484524Z     
2025-05-07T20:32:19.1484692Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:19.1484771Z     
2025-05-07T20:32:19.1484872Z         x_sign = torch.sign(x)
2025-05-07T20:32:19.1485001Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:19.1485092Z         x = x_sign * x_clamp
2025-05-07T20:32:19.1485182Z         x0 = x[:, :D]
2025-05-07T20:32:19.1485264Z         x1 = x[:, D:]
2025-05-07T20:32:19.1485345Z     
2025-05-07T20:32:19.1485431Z         if contiguous:
2025-05-07T20:32:19.1485523Z             x0 = x0.contiguous()
2025-05-07T20:32:19.1485668Z             x1 = x1.contiguous()
2025-05-07T20:32:19.1485743Z     
2025-05-07T20:32:19.1485835Z         if scale_ub is not None:
2025-05-07T20:32:19.1485949Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:19.1486089Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:19.1486167Z             )
2025-05-07T20:32:19.1486251Z         else:
2025-05-07T20:32:19.1486347Z             scale_ub_tensor = None
2025-05-07T20:32:19.1486421Z     
2025-05-07T20:32:19.1486603Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:19.1486695Z             op = silu_mul_quant
2025-05-07T20:32:19.1486787Z             if compiled:
2025-05-07T20:32:19.1486888Z                 op = torch.compile(op)
2025-05-07T20:32:19.1486995Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:19.1487076Z     
2025-05-07T20:32:19.1487170Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:19.1487175Z 
2025-05-07T20:32:19.1487271Z moe/activation_test.py:117: 
2025-05-07T20:32:19.1487414Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:19.1487515Z moe/activation_test.py:115: in fn
2025-05-07T20:32:19.1487616Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:19.1488067Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/torch/_dynamo/eval_frame.py:678: in _fn
2025-05-07T20:32:19.1488166Z     return fn(*args, **kwargs)
2025-05-07T20:32:19.1488668Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:19.1488773Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:19.1489131Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:19.1489362Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:19.1489699Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:19.1489796Z     kernel = self.compile(
2025-05-07T20:32:19.1490182Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:19.1490358Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:19.1490494Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:19.1490498Z 
2025-05-07T20:32:19.1490702Z self = <triton.compiler.compiler.ASTSource object at 0x7fcda59f2f50>
2025-05-07T20:32:19.1491465Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:19.1491975Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7fcea1473ce0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7fce5a217380>}
2025-05-07T20:32:19.1494212Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:19.1494419Z context = <triton._C.libtriton.ir.context object at 0x7fcda4c2ae70>
2025-05-07T20:32:19.1494424Z 
2025-05-07T20:32:19.1494591Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:19.1494858Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:19.1494968Z                            module_map=module_map)
2025-05-07T20:32:19.1495132Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:19.1495239Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:19.1495318Z E       ^
2025-05-07T20:32:19.1495671Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:19.1495724Z 
2025-05-07T20:32:19.1496139Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:19.1496143Z 
2025-05-07T20:32:19.1496253Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:19.1496481Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:19.1496562Z     T=1,
2025-05-07T20:32:19.1496641Z     D=7168,
2025-05-07T20:32:19.1496733Z     scale_ub=None,
2025-05-07T20:32:19.1496863Z     contiguous=False,
2025-05-07T20:32:19.1496949Z     compiled=True,
2025-05-07T20:32:19.1497030Z )
2025-05-07T20:32:19.1497249Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:19.1497414Z T = 1, D = 7168, scale_ub = None, contiguous = False, compiled = True
2025-05-07T20:32:19.1497426Z 
2025-05-07T20:32:19.1497504Z     @given(
2025-05-07T20:32:19.1497625Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:19.1497737Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:19.1497854Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:19.1497973Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:19.1498172Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:19.1498552Z     )
2025-05-07T20:32:19.1498861Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:19.1498966Z     def test_silu_mul_quant(
2025-05-07T20:32:19.1499050Z         self,
2025-05-07T20:32:19.1499137Z         T: int,
2025-05-07T20:32:19.1499216Z         D: int,
2025-05-07T20:32:19.1499316Z         scale_ub: Optional[float],
2025-05-07T20:32:19.1499415Z         contiguous: bool,
2025-05-07T20:32:19.1499505Z         compiled: bool,
2025-05-07T20:32:19.1499585Z     ) -> None:
2025-05-07T20:32:19.1499688Z         torch.manual_seed(2025)
2025-05-07T20:32:19.1499763Z     
2025-05-07T20:32:19.1499935Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:19.1500016Z     
2025-05-07T20:32:19.1500109Z         x_sign = torch.sign(x)
2025-05-07T20:32:19.1500236Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:19.1500331Z         x = x_sign * x_clamp
2025-05-07T20:32:19.1500421Z         x0 = x[:, :D]
2025-05-07T20:32:19.1500503Z         x1 = x[:, D:]
2025-05-07T20:32:19.1500582Z     
2025-05-07T20:32:19.1500669Z         if contiguous:
2025-05-07T20:32:19.1500768Z             x0 = x0.contiguous()
2025-05-07T20:32:19.1500861Z             x1 = x1.contiguous()
2025-05-07T20:32:19.1500935Z     
2025-05-07T20:32:19.1501038Z         if scale_ub is not None:
2025-05-07T20:32:19.1501146Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:19.1501281Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:19.1501365Z             )
2025-05-07T20:32:19.1501443Z         else:
2025-05-07T20:32:19.1501541Z             scale_ub_tensor = None
2025-05-07T20:32:19.1501631Z     
2025-05-07T20:32:19.1501765Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:19.1501856Z             op = silu_mul_quant
2025-05-07T20:32:19.1501949Z             if compiled:
2025-05-07T20:32:19.1502051Z                 op = torch.compile(op)
2025-05-07T20:32:19.1502171Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:19.1502246Z     
2025-05-07T20:32:19.1502341Z         y_fp8, y_scale = fn()
2025-05-07T20:32:19.1502470Z         y = y_fp8.to(torch.float32) * y_scale[:, None]
2025-05-07T20:32:19.1502547Z     
2025-05-07T20:32:19.1502684Z         def ref_fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:19.1502797Z             x0_fp32 = x0.to(torch.float32)
2025-05-07T20:32:19.1502899Z             x1_fp32 = x1.to(torch.float32)
2025-05-07T20:32:19.1503022Z             y = x0_fp32 * torch.sigmoid(x0_fp32) * x1_fp32
2025-05-07T20:32:19.1503170Z             return triton_quantize_fp8_row(y, scale_ub_tensor)
2025-05-07T20:32:19.1503400Z     
2025-05-07T20:32:19.1503502Z >       y_fp8_ref, y_scale_ref = ref_fn()
2025-05-07T20:32:19.1503514Z 
2025-05-07T20:32:19.1503613Z moe/activation_test.py:126: 
2025-05-07T20:32:19.1503746Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:19.1503871Z moe/activation_test.py:124: in ref_fn
2025-05-07T20:32:19.1504006Z     return triton_quantize_fp8_row(y, scale_ub_tensor)
2025-05-07T20:32:19.1504556Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gemm/triton_gemm/fp8_gemm.py:2370: in triton_quantize_fp8_row
2025-05-07T20:32:19.1504778Z     _kernel_quantize_fp8_row[grid](
2025-05-07T20:32:19.1505136Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:19.1505366Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:19.1505729Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/autotuner.py:186: in run
2025-05-07T20:32:19.1505988Z     timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs}
2025-05-07T20:32:19.1506482Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/autotuner.py:166: in _bench
2025-05-07T20:32:19.1506653Z     return self.do_bench(kernel_call, quantiles=(0.5, 0.2, 0.8))
2025-05-07T20:32:19.1506990Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/testing.py:117: in do_bench
2025-05-07T20:32:19.1507081Z     fn()
2025-05-07T20:32:19.1507480Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/autotuner.py:152: in kernel_call
2025-05-07T20:32:19.1507569Z     self.fn.run(
2025-05-07T20:32:19.1507905Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:19.1507999Z     kernel = self.compile(
2025-05-07T20:32:19.1508382Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:19.1508560Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:19.1508692Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:19.1508708Z 
2025-05-07T20:32:19.1508913Z self = <triton.compiler.compiler.ASTSource object at 0x7fcda5deaf50>
2025-05-07T20:32:19.1509679Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=2, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:19.1510188Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7fcea1473ce0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7fce5b1cc220>}
2025-05-07T20:32:19.1510919Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:19.1511120Z context = <triton._C.libtriton.ir.context object at 0x7fcda53d47f0>
2025-05-07T20:32:19.1511124Z 
2025-05-07T20:32:19.1511294Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:19.1511554Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:19.1511671Z                            module_map=module_map)
2025-05-07T20:32:19.1511837Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:19.1511945Z E       def _kernel_quantize_fp8_row(
2025-05-07T20:32:19.1512025Z E       ^
2025-05-07T20:32:19.1512376Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:19.1512381Z 
2025-05-07T20:32:19.1512794Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:19.1512845Z 
2025-05-07T20:32:19.1512950Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:19.1513173Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:19.1513259Z     T=1,
2025-05-07T20:32:19.1513343Z     D=5120,
2025-05-07T20:32:19.1513439Z     scale_ub=1200.0,
2025-05-07T20:32:19.1513528Z     contiguous=False,
2025-05-07T20:32:19.1513613Z     compiled=True,
2025-05-07T20:32:19.1513693Z )
2025-05-07T20:32:19.1513955Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:19.1514119Z T = 1, D = 5120, scale_ub = 1200.0, contiguous = False, compiled = True
2025-05-07T20:32:19.1514124Z 
2025-05-07T20:32:19.1514209Z     @given(
2025-05-07T20:32:19.1514332Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:19.1514432Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:19.1514555Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:19.1514679Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:19.1514799Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:19.1514877Z     )
2025-05-07T20:32:19.1515222Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:19.1515327Z     def test_silu_mul_quant(
2025-05-07T20:32:19.1515406Z         self,
2025-05-07T20:32:19.1515487Z         T: int,
2025-05-07T20:32:19.1515572Z         D: int,
2025-05-07T20:32:19.1515675Z         scale_ub: Optional[float],
2025-05-07T20:32:19.1515767Z         contiguous: bool,
2025-05-07T20:32:19.1515860Z         compiled: bool,
2025-05-07T20:32:19.1515940Z     ) -> None:
2025-05-07T20:32:19.1516037Z         torch.manual_seed(2025)
2025-05-07T20:32:19.1516119Z     
2025-05-07T20:32:19.1516287Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:19.1516368Z     
2025-05-07T20:32:19.1516462Z         x_sign = torch.sign(x)
2025-05-07T20:32:19.1516596Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:19.1516696Z         x = x_sign * x_clamp
2025-05-07T20:32:19.1516780Z         x0 = x[:, :D]
2025-05-07T20:32:19.1516862Z         x1 = x[:, D:]
2025-05-07T20:32:19.1516944Z     
2025-05-07T20:32:19.1517034Z         if contiguous:
2025-05-07T20:32:19.1517127Z             x0 = x0.contiguous()
2025-05-07T20:32:19.1517229Z             x1 = x1.contiguous()
2025-05-07T20:32:19.1517311Z     
2025-05-07T20:32:19.1517404Z         if scale_ub is not None:
2025-05-07T20:32:19.1517513Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:19.1517658Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:19.1517737Z             )
2025-05-07T20:32:19.1517814Z         else:
2025-05-07T20:32:19.1517918Z             scale_ub_tensor = None
2025-05-07T20:32:19.1517995Z     
2025-05-07T20:32:19.1518129Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:19.1518227Z             op = silu_mul_quant
2025-05-07T20:32:19.1518317Z             if compiled:
2025-05-07T20:32:19.1518426Z                 op = torch.compile(op)
2025-05-07T20:32:19.1518533Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:19.1518607Z     
2025-05-07T20:32:19.1518707Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:19.1518715Z 
2025-05-07T20:32:19.1518814Z moe/activation_test.py:117: 
2025-05-07T20:32:19.1518945Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:19.1519057Z moe/activation_test.py:115: in fn
2025-05-07T20:32:19.1519162Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:19.1519530Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/torch/_dynamo/eval_frame.py:678: in _fn
2025-05-07T20:32:19.1519626Z     return fn(*args, **kwargs)
2025-05-07T20:32:19.1520115Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:19.1520270Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:19.1520625Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:19.1520845Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:19.1521194Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:19.1521290Z     kernel = self.compile(
2025-05-07T20:32:19.1521675Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:19.1521889Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:19.1522018Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:19.1522023Z 
2025-05-07T20:32:19.1522234Z self = <triton.compiler.compiler.ASTSource object at 0x7fcda58cadd0>
2025-05-07T20:32:19.1523000Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:19.1523586Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7fcea1473ce0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7fce5a9efb00>}
2025-05-07T20:32:19.1524320Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:19.1524516Z context = <triton._C.libtriton.ir.context object at 0x7fcda53a50b0>
2025-05-07T20:32:19.1524528Z 
2025-05-07T20:32:19.1524693Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:19.1524952Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:19.1525069Z                            module_map=module_map)
2025-05-07T20:32:19.1525230Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:19.1525330Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:19.1525414Z E       ^
2025-05-07T20:32:19.1525767Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:19.1525772Z 
2025-05-07T20:32:19.1526185Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:19.1526192Z 
2025-05-07T20:32:19.1526296Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:19.1526517Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:19.1526601Z     T=1,
2025-05-07T20:32:19.1526677Z     D=5120,
2025-05-07T20:32:19.1526762Z     scale_ub=1200.0,
2025-05-07T20:32:19.1526855Z     contiguous=False,
2025-05-07T20:32:19.1526943Z     compiled=False,
2025-05-07T20:32:19.1527017Z )
2025-05-07T20:32:19.1527243Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:19.1527408Z T = 1, D = 5120, scale_ub = 1200.0, contiguous = False, compiled = False
2025-05-07T20:32:19.1527413Z 
2025-05-07T20:32:19.1527501Z     @given(
2025-05-07T20:32:19.1527621Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:19.1527721Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:19.1527844Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:19.1527966Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:19.1528081Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:19.1528167Z     )
2025-05-07T20:32:19.1528409Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:19.1528504Z     def test_silu_mul_quant(
2025-05-07T20:32:19.1528589Z         self,
2025-05-07T20:32:19.1528667Z         T: int,
2025-05-07T20:32:19.1528800Z         D: int,
2025-05-07T20:32:19.1528901Z         scale_ub: Optional[float],
2025-05-07T20:32:19.1528991Z         contiguous: bool,
2025-05-07T20:32:19.1529085Z         compiled: bool,
2025-05-07T20:32:19.1529167Z     ) -> None:
2025-05-07T20:32:19.1529268Z         torch.manual_seed(2025)
2025-05-07T20:32:19.1529350Z     
2025-05-07T20:32:19.1529518Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:19.1529592Z     
2025-05-07T20:32:19.1529691Z         x_sign = torch.sign(x)
2025-05-07T20:32:19.1529859Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:19.1529949Z         x = x_sign * x_clamp
2025-05-07T20:32:19.1530039Z         x0 = x[:, :D]
2025-05-07T20:32:19.1530121Z         x1 = x[:, D:]
2025-05-07T20:32:19.1530202Z     
2025-05-07T20:32:19.1530287Z         if contiguous:
2025-05-07T20:32:19.1530379Z             x0 = x0.contiguous()
2025-05-07T20:32:19.1530476Z             x1 = x1.contiguous()
2025-05-07T20:32:19.1530554Z     
2025-05-07T20:32:19.1530647Z         if scale_ub is not None:
2025-05-07T20:32:19.1530759Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:19.1530893Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:19.1530971Z             )
2025-05-07T20:32:19.1531127Z         else:
2025-05-07T20:32:19.1531225Z             scale_ub_tensor = None
2025-05-07T20:32:19.1531300Z     
2025-05-07T20:32:19.1531436Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:19.1531527Z             op = silu_mul_quant
2025-05-07T20:32:19.1531617Z             if compiled:
2025-05-07T20:32:19.1531724Z                 op = torch.compile(op)
2025-05-07T20:32:19.1531833Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:19.1531920Z     
2025-05-07T20:32:19.1532013Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:19.1532017Z 
2025-05-07T20:32:19.1532115Z moe/activation_test.py:117: 
2025-05-07T20:32:19.1532252Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:19.1532357Z moe/activation_test.py:115: in fn
2025-05-07T20:32:19.1532458Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:19.1532961Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:19.1533061Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:19.1533426Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:19.1533742Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:19.1534081Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:19.1534186Z     kernel = self.compile(
2025-05-07T20:32:19.1534567Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:19.1534743Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:19.1534878Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:19.1534882Z 
2025-05-07T20:32:19.1535087Z self = <triton.compiler.compiler.ASTSource object at 0x7fce5a9c27d0>
2025-05-07T20:32:19.1535860Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:19.1536365Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7fcea1473ce0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7fce5a9f56c0>}
2025-05-07T20:32:19.1537106Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:19.1537348Z context = <triton._C.libtriton.ir.context object at 0x7fcda4636f30>
2025-05-07T20:32:19.1537353Z 
2025-05-07T20:32:19.1537519Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:19.1537791Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:19.1537900Z                            module_map=module_map)
2025-05-07T20:32:19.1538071Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:19.1538210Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:19.1538288Z E       ^
2025-05-07T20:32:19.1538642Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:19.1538647Z 
2025-05-07T20:32:19.1539052Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:19.1539057Z 
2025-05-07T20:32:19.1539160Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:19.1539394Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:19.1539473Z     T=16384,
2025-05-07T20:32:19.1539555Z     D=5120,
2025-05-07T20:32:19.1539641Z     scale_ub=1200.0,
2025-05-07T20:32:19.1539804Z     contiguous=False,
2025-05-07T20:32:19.1539896Z     compiled=True,
2025-05-07T20:32:19.1539972Z )
2025-05-07T20:32:19.1540189Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:19.1540393Z T = 16384, D = 5120, scale_ub = 1200.0, contiguous = False, compiled = True
2025-05-07T20:32:19.1540402Z 
2025-05-07T20:32:19.1540483Z     @given(
2025-05-07T20:32:19.1540627Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:19.1540734Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:19.1540850Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:19.1540973Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:19.1541090Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:19.1541165Z     )
2025-05-07T20:32:19.1541413Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:19.1541507Z     def test_silu_mul_quant(
2025-05-07T20:32:19.1541586Z         self,
2025-05-07T20:32:19.1541674Z         T: int,
2025-05-07T20:32:19.1541754Z         D: int,
2025-05-07T20:32:19.1541853Z         scale_ub: Optional[float],
2025-05-07T20:32:19.1541951Z         contiguous: bool,
2025-05-07T20:32:19.1542037Z         compiled: bool,
2025-05-07T20:32:19.1542119Z     ) -> None:
2025-05-07T20:32:19.1542222Z         torch.manual_seed(2025)
2025-05-07T20:32:19.1542296Z     
2025-05-07T20:32:19.1542473Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:19.1542548Z     
2025-05-07T20:32:19.1542641Z         x_sign = torch.sign(x)
2025-05-07T20:32:19.1542772Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:19.1542862Z         x = x_sign * x_clamp
2025-05-07T20:32:19.1542948Z         x0 = x[:, :D]
2025-05-07T20:32:19.1543037Z         x1 = x[:, D:]
2025-05-07T20:32:19.1543111Z     
2025-05-07T20:32:19.1543197Z         if contiguous:
2025-05-07T20:32:19.1543296Z             x0 = x0.contiguous()
2025-05-07T20:32:19.1543392Z             x1 = x1.contiguous()
2025-05-07T20:32:19.1543468Z     
2025-05-07T20:32:19.1543565Z         if scale_ub is not None:
2025-05-07T20:32:19.1543670Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:19.1543813Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:19.1543891Z             )
2025-05-07T20:32:19.1543968Z         else:
2025-05-07T20:32:19.1544070Z             scale_ub_tensor = None
2025-05-07T20:32:19.1544143Z     
2025-05-07T20:32:19.1544273Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:19.1544369Z             op = silu_mul_quant
2025-05-07T20:32:19.1544453Z             if compiled:
2025-05-07T20:32:19.1544556Z                 op = torch.compile(op)
2025-05-07T20:32:19.1544718Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:19.1544793Z     
2025-05-07T20:32:19.1544886Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:19.1544890Z 
2025-05-07T20:32:19.1545000Z moe/activation_test.py:117: 
2025-05-07T20:32:19.1545137Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:19.1545244Z moe/activation_test.py:115: in fn
2025-05-07T20:32:19.1545347Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:19.1545708Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/torch/_dynamo/eval_frame.py:678: in _fn
2025-05-07T20:32:19.1545875Z     return fn(*args, **kwargs)
2025-05-07T20:32:19.1546362Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:19.1546462Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:19.1546825Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:19.1547051Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:19.1547397Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:19.1547568Z     kernel = self.compile(
2025-05-07T20:32:19.1547951Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:19.1548134Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:19.1548267Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:19.1548271Z 
2025-05-07T20:32:19.1548484Z self = <triton.compiler.compiler.ASTSource object at 0x7fce60a36d50>
2025-05-07T20:32:19.1549248Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:19.1549752Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7fcea1473ce0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7fce5a9f6fc0>}
2025-05-07T20:32:19.1550499Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:19.1550693Z context = <triton._C.libtriton.ir.context object at 0x7fcda56bd0b0>
2025-05-07T20:32:19.1550698Z 
2025-05-07T20:32:19.1550868Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:19.1551128Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:19.1551237Z                            module_map=module_map)
2025-05-07T20:32:19.1551409Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:19.1551512Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:19.1551598Z E       ^
2025-05-07T20:32:19.1551948Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:19.1551956Z 
2025-05-07T20:32:19.1552360Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:19.1552365Z 
2025-05-07T20:32:19.1552474Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:19.1552700Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:19.1552779Z     T=2048,
2025-05-07T20:32:19.1552866Z     D=7168,
2025-05-07T20:32:19.1552950Z     scale_ub=1200.0,
2025-05-07T20:32:19.1553045Z     contiguous=False,
2025-05-07T20:32:19.1553131Z     compiled=True,
2025-05-07T20:32:19.1553205Z )
2025-05-07T20:32:19.1553430Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:19.1553653Z T = 2048, D = 7168, scale_ub = 1200.0, contiguous = False, compiled = True
2025-05-07T20:32:19.1553658Z 
2025-05-07T20:32:19.1553735Z     @given(
2025-05-07T20:32:19.1553864Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:19.1553971Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:19.1554087Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:19.1554212Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:19.1554369Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:19.1554450Z     )
2025-05-07T20:32:19.1554693Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:19.1554787Z     def test_silu_mul_quant(
2025-05-07T20:32:19.1554872Z         self,
2025-05-07T20:32:19.1554950Z         T: int,
2025-05-07T20:32:19.1555027Z         D: int,
2025-05-07T20:32:19.1555133Z         scale_ub: Optional[float],
2025-05-07T20:32:19.1555229Z         contiguous: bool,
2025-05-07T20:32:19.1555315Z         compiled: bool,
2025-05-07T20:32:19.1555400Z     ) -> None:
2025-05-07T20:32:19.1555497Z         torch.manual_seed(2025)
2025-05-07T20:32:19.1555571Z     
2025-05-07T20:32:19.1555817Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:19.1555894Z     
2025-05-07T20:32:19.1555992Z         x_sign = torch.sign(x)
2025-05-07T20:32:19.1556116Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:19.1556209Z         x = x_sign * x_clamp
2025-05-07T20:32:19.1556299Z         x0 = x[:, :D]
2025-05-07T20:32:19.1556380Z         x1 = x[:, D:]
2025-05-07T20:32:19.1556454Z     
2025-05-07T20:32:19.1556543Z         if contiguous:
2025-05-07T20:32:19.1556635Z             x0 = x0.contiguous()
2025-05-07T20:32:19.1556726Z             x1 = x1.contiguous()
2025-05-07T20:32:19.1556807Z     
2025-05-07T20:32:19.1556900Z         if scale_ub is not None:
2025-05-07T20:32:19.1557006Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:19.1557149Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:19.1557226Z             )
2025-05-07T20:32:19.1557313Z         else:
2025-05-07T20:32:19.1557409Z             scale_ub_tensor = None
2025-05-07T20:32:19.1557482Z     
2025-05-07T20:32:19.1557623Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:19.1557716Z             op = silu_mul_quant
2025-05-07T20:32:19.1557805Z             if compiled:
2025-05-07T20:32:19.1557914Z                 op = torch.compile(op)
2025-05-07T20:32:19.1558022Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:19.1558096Z     
2025-05-07T20:32:19.1558194Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:19.1558199Z 
2025-05-07T20:32:19.1558296Z moe/activation_test.py:117: 
2025-05-07T20:32:19.1558425Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:19.1558538Z moe/activation_test.py:115: in fn
2025-05-07T20:32:19.1558641Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:19.1559007Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/torch/_dynamo/eval_frame.py:678: in _fn
2025-05-07T20:32:19.1559102Z     return fn(*args, **kwargs)
2025-05-07T20:32:19.1559594Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:19.1559702Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:19.1560058Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:19.1560285Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:19.1560621Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:19.1560716Z     kernel = self.compile(
2025-05-07T20:32:19.1561101Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:19.1561369Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:19.1561499Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:19.1561503Z 
2025-05-07T20:32:19.1561722Z self = <triton.compiler.compiler.ASTSource object at 0x7fce5b4035d0>
2025-05-07T20:32:19.1562487Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:19.1563036Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7fcea1473ce0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7fcda5de58a0>}
2025-05-07T20:32:19.1563768Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:19.1563972Z context = <triton._C.libtriton.ir.context object at 0x7fcda5668770>
2025-05-07T20:32:19.1563977Z 
2025-05-07T20:32:19.1564141Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:19.1564475Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:19.1564594Z                            module_map=module_map)
2025-05-07T20:32:19.1564758Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:19.1564862Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:19.1564948Z E       ^
2025-05-07T20:32:19.1565302Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:19.1565306Z 
2025-05-07T20:32:19.1565724Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:19.1565731Z 
2025-05-07T20:32:19.1565835Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:19.1566055Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:19.1566141Z     T=1,
2025-05-07T20:32:19.1566222Z     D=5120,
2025-05-07T20:32:19.1566312Z     scale_ub=None,
2025-05-07T20:32:19.1566407Z     contiguous=False,
2025-05-07T20:32:19.1566492Z     compiled=False,
2025-05-07T20:32:19.1566573Z )
2025-05-07T20:32:19.1566794Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:19.1566962Z T = 1, D = 5120, scale_ub = None, contiguous = False, compiled = False
2025-05-07T20:32:19.1566967Z 
2025-05-07T20:32:19.1567050Z     @given(
2025-05-07T20:32:19.1567171Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:19.1567272Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:19.1567394Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:19.1567513Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:19.1567629Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:19.1567713Z     )
2025-05-07T20:32:19.1567958Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:19.1568064Z     def test_silu_mul_quant(
2025-05-07T20:32:19.1568141Z         self,
2025-05-07T20:32:19.1568221Z         T: int,
2025-05-07T20:32:19.1568305Z         D: int,
2025-05-07T20:32:19.1568405Z         scale_ub: Optional[float],
2025-05-07T20:32:19.1568500Z         contiguous: bool,
2025-05-07T20:32:19.1568594Z         compiled: bool,
2025-05-07T20:32:19.1568674Z     ) -> None:
2025-05-07T20:32:19.1568772Z         torch.manual_seed(2025)
2025-05-07T20:32:19.1568856Z     
2025-05-07T20:32:19.1569023Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:19.1569098Z     
2025-05-07T20:32:19.1569196Z         x_sign = torch.sign(x)
2025-05-07T20:32:19.1569321Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:19.1569471Z         x = x_sign * x_clamp
2025-05-07T20:32:19.1569553Z         x0 = x[:, :D]
2025-05-07T20:32:19.1569634Z         x1 = x[:, D:]
2025-05-07T20:32:19.1569719Z     
2025-05-07T20:32:19.1569803Z         if contiguous:
2025-05-07T20:32:19.1569900Z             x0 = x0.contiguous()
2025-05-07T20:32:19.1569996Z             x1 = x1.contiguous()
2025-05-07T20:32:19.1570071Z     
2025-05-07T20:32:19.1570163Z         if scale_ub is not None:
2025-05-07T20:32:19.1570280Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:19.1570485Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:19.1570570Z             )
2025-05-07T20:32:19.1570670Z         else:
2025-05-07T20:32:19.1570773Z             scale_ub_tensor = None
2025-05-07T20:32:19.1570854Z     
2025-05-07T20:32:19.1570983Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:19.1571075Z             op = silu_mul_quant
2025-05-07T20:32:19.1571167Z             if compiled:
2025-05-07T20:32:19.1571271Z                 op = torch.compile(op)
2025-05-07T20:32:19.1571380Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:19.1571461Z     
2025-05-07T20:32:19.1571553Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:19.1571558Z 
2025-05-07T20:32:19.1571816Z moe/activation_test.py:117: 
2025-05-07T20:32:19.1571957Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:19.1572059Z moe/activation_test.py:115: in fn
2025-05-07T20:32:19.1572170Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:19.1572665Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:19.1572764Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:19.1573126Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:19.1573344Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:19.1573776Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:19.1573880Z     kernel = self.compile(
2025-05-07T20:32:19.1574265Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:19.1574445Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:19.1574573Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:19.1574583Z 
2025-05-07T20:32:19.1574788Z self = <triton.compiler.compiler.ASTSource object at 0x7fcda5deae50>
2025-05-07T20:32:19.1575556Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:19.1576060Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7fcea1473ce0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7fcda5de53a0>}
2025-05-07T20:32:19.1576801Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:19.1576993Z context = <triton._C.libtriton.ir.context object at 0x7fcda4663c70>
2025-05-07T20:32:19.1576998Z 
2025-05-07T20:32:19.1577163Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:19.1577428Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:19.1577536Z                            module_map=module_map)
2025-05-07T20:32:19.1577704Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:19.1577803Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:19.1577954Z E       ^
2025-05-07T20:32:19.1578308Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:19.1578313Z 
2025-05-07T20:32:19.1578726Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:19.1578730Z 
2025-05-07T20:32:19.1578839Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:19.1579060Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:19.1579179Z     T=4096,
2025-05-07T20:32:19.1579263Z     D=7168,
2025-05-07T20:32:19.1579358Z     scale_ub=1200.0,
2025-05-07T20:32:19.1579449Z     contiguous=False,
2025-05-07T20:32:19.1579543Z     compiled=False,
2025-05-07T20:32:19.1579617Z )
2025-05-07T20:32:19.1579835Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:19.1580022Z T = 4096, D = 7168, scale_ub = 1200.0, contiguous = False, compiled = False
2025-05-07T20:32:19.1580029Z 
2025-05-07T20:32:19.1580108Z     @given(
2025-05-07T20:32:19.1580235Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:19.1580335Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:19.1586326Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:19.1586593Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:19.1586715Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:19.1586803Z     )
2025-05-07T20:32:19.1587050Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:19.1587152Z     def test_silu_mul_quant(
2025-05-07T20:32:19.1587240Z         self,
2025-05-07T20:32:19.1587319Z         T: int,
2025-05-07T20:32:19.1587398Z         D: int,
2025-05-07T20:32:19.1587506Z         scale_ub: Optional[float],
2025-05-07T20:32:19.1587599Z         contiguous: bool,
2025-05-07T20:32:19.1587696Z         compiled: bool,
2025-05-07T20:32:19.1587777Z     ) -> None:
2025-05-07T20:32:19.1587878Z         torch.manual_seed(2025)
2025-05-07T20:32:19.1587961Z     
2025-05-07T20:32:19.1588134Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:19.1588209Z     
2025-05-07T20:32:19.1588310Z         x_sign = torch.sign(x)
2025-05-07T20:32:19.1588441Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:19.1588533Z         x = x_sign * x_clamp
2025-05-07T20:32:19.1588624Z         x0 = x[:, :D]
2025-05-07T20:32:19.1588707Z         x1 = x[:, D:]
2025-05-07T20:32:19.1588784Z     
2025-05-07T20:32:19.1588880Z         if contiguous:
2025-05-07T20:32:19.1588977Z             x0 = x0.contiguous()
2025-05-07T20:32:19.1589069Z             x1 = x1.contiguous()
2025-05-07T20:32:19.1589154Z     
2025-05-07T20:32:19.1589248Z         if scale_ub is not None:
2025-05-07T20:32:19.1589365Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:19.1589502Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:19.1589583Z             )
2025-05-07T20:32:19.1589671Z         else:
2025-05-07T20:32:19.1589769Z             scale_ub_tensor = None
2025-05-07T20:32:19.1589844Z     
2025-05-07T20:32:19.1589989Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:19.1590085Z             op = silu_mul_quant
2025-05-07T20:32:19.1590177Z             if compiled:
2025-05-07T20:32:19.1590290Z                 op = torch.compile(op)
2025-05-07T20:32:19.1590400Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:19.1590475Z     
2025-05-07T20:32:19.1590578Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:19.1590583Z 
2025-05-07T20:32:19.1590681Z moe/activation_test.py:117: 
2025-05-07T20:32:19.1590826Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:19.1590931Z moe/activation_test.py:115: in fn
2025-05-07T20:32:19.1591034Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:19.1591541Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:19.1591699Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:19.1592058Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:19.1592294Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:19.1592632Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:19.1592781Z     kernel = self.compile(
2025-05-07T20:32:19.1593165Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:19.1593341Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:19.1593478Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:19.1593482Z 
2025-05-07T20:32:19.1593689Z self = <triton.compiler.compiler.ASTSource object at 0x7fce5a598150>
2025-05-07T20:32:19.1594549Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:19.1595052Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7fcea1473ce0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7fcda5de63e0>}
2025-05-07T20:32:19.1595793Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:19.1595997Z context = <triton._C.libtriton.ir.context object at 0x7fcda579cb30>
2025-05-07T20:32:19.1596001Z 
2025-05-07T20:32:19.1596169Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:19.1596445Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:19.1596558Z                            module_map=module_map)
2025-05-07T20:32:19.1596721Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:19.1596840Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:19.1596920Z E       ^
2025-05-07T20:32:19.1597282Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:19.1597289Z 
2025-05-07T20:32:19.1597695Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:19.1597700Z 
2025-05-07T20:32:19.1597807Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:19.1598040Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:19.1598121Z     T=16384,
2025-05-07T20:32:19.1598484Z     D=7168,
2025-05-07T20:32:19.1598620Z     scale_ub=None,
2025-05-07T20:32:19.1598745Z     contiguous=True,
2025-05-07T20:32:19.1598866Z     compiled=True,
2025-05-07T20:32:19.1598946Z )
2025-05-07T20:32:19.1599166Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:19.1599354Z T = 16384, D = 7168, scale_ub = None, contiguous = True, compiled = True
2025-05-07T20:32:19.1599358Z 
2025-05-07T20:32:19.1599442Z     @given(
2025-05-07T20:32:19.1599563Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:19.1599671Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:19.1599787Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:19.1599915Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:19.1600028Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:19.1600105Z     )
2025-05-07T20:32:19.1600357Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:19.1600454Z     def test_silu_mul_quant(
2025-05-07T20:32:19.1600701Z         self,
2025-05-07T20:32:19.1600789Z         T: int,
2025-05-07T20:32:19.1600867Z         D: int,
2025-05-07T20:32:19.1600966Z         scale_ub: Optional[float],
2025-05-07T20:32:19.1601064Z         contiguous: bool,
2025-05-07T20:32:19.1601150Z         compiled: bool,
2025-05-07T20:32:19.1601245Z     ) -> None:
2025-05-07T20:32:19.1601339Z         torch.manual_seed(2025)
2025-05-07T20:32:19.1601412Z     
2025-05-07T20:32:19.1601586Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:19.1601738Z     
2025-05-07T20:32:19.1601830Z         x_sign = torch.sign(x)
2025-05-07T20:32:19.1601964Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:19.1602056Z         x = x_sign * x_clamp
2025-05-07T20:32:19.1602138Z         x0 = x[:, :D]
2025-05-07T20:32:19.1602226Z         x1 = x[:, D:]
2025-05-07T20:32:19.1602300Z     
2025-05-07T20:32:19.1602385Z         if contiguous:
2025-05-07T20:32:19.1602485Z             x0 = x0.contiguous()
2025-05-07T20:32:19.1602581Z             x1 = x1.contiguous()
2025-05-07T20:32:19.1602663Z     
2025-05-07T20:32:19.1602756Z         if scale_ub is not None:
2025-05-07T20:32:19.1602863Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:19.1603133Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:19.1603212Z             )
2025-05-07T20:32:19.1603291Z         else:
2025-05-07T20:32:19.1603392Z             scale_ub_tensor = None
2025-05-07T20:32:19.1603465Z     
2025-05-07T20:32:19.1603592Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:19.1603693Z             op = silu_mul_quant
2025-05-07T20:32:19.1603777Z             if compiled:
2025-05-07T20:32:19.1603876Z                 op = torch.compile(op)
2025-05-07T20:32:19.1603986Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:19.1604059Z     
2025-05-07T20:32:19.1604155Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:19.1604159Z 
2025-05-07T20:32:19.1604257Z moe/activation_test.py:117: 
2025-05-07T20:32:19.1604389Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:19.1604498Z moe/activation_test.py:115: in fn
2025-05-07T20:32:19.1604598Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:19.1604967Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/torch/_dynamo/eval_frame.py:678: in _fn
2025-05-07T20:32:19.1605070Z     return fn(*args, **kwargs)
2025-05-07T20:32:19.1605557Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:19.1605663Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:19.1606016Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:19.1606235Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:19.1606577Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:19.1606674Z     kernel = self.compile(
2025-05-07T20:32:19.1607053Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:19.1607239Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:19.1607367Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:19.1607371Z 
2025-05-07T20:32:19.1607582Z self = <triton.compiler.compiler.ASTSource object at 0x7fcda5e41e50>
2025-05-07T20:32:19.1608346Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:19.1608847Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7fcea1473ce0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7fcda4ef2a20>}
2025-05-07T20:32:19.1609635Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:19.1609832Z context = <triton._C.libtriton.ir.context object at 0x7fcda57dfbb0>
2025-05-07T20:32:19.1609836Z 
2025-05-07T20:32:19.1610009Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:19.1610271Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:19.1610433Z                            module_map=module_map)
2025-05-07T20:32:19.1610594Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:19.1610693Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:19.1610776Z E       ^
2025-05-07T20:32:19.1611126Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:19.1611134Z 
2025-05-07T20:32:19.1611541Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:19.1611545Z 
2025-05-07T20:32:19.1611763Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:19.1611984Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:19.1612067Z     T=4096,
2025-05-07T20:32:19.1612143Z     D=5120,
2025-05-07T20:32:19.1612225Z     scale_ub=None,
2025-05-07T20:32:19.1612320Z     contiguous=False,
2025-05-07T20:32:19.1612403Z     compiled=True,
2025-05-07T20:32:19.1612477Z )
2025-05-07T20:32:19.1612701Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:19.1612869Z T = 4096, D = 5120, scale_ub = None, contiguous = False, compiled = True
2025-05-07T20:32:19.1612874Z 
2025-05-07T20:32:19.1612950Z     @given(
2025-05-07T20:32:19.1613074Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:19.1613177Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:19.1613301Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:19.1613418Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:19.1613535Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:19.1613618Z     )
2025-05-07T20:32:19.1613970Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:19.1614065Z     def test_silu_mul_quant(
2025-05-07T20:32:19.1614152Z         self,
2025-05-07T20:32:19.1614230Z         T: int,
2025-05-07T20:32:19.1614312Z         D: int,
2025-05-07T20:32:19.1614416Z         scale_ub: Optional[float],
2025-05-07T20:32:19.1614506Z         contiguous: bool,
2025-05-07T20:32:19.1614591Z         compiled: bool,
2025-05-07T20:32:19.1614678Z     ) -> None:
2025-05-07T20:32:19.1614773Z         torch.manual_seed(2025)
2025-05-07T20:32:19.1614852Z     
2025-05-07T20:32:19.1615019Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:19.1615094Z     
2025-05-07T20:32:19.1615193Z         x_sign = torch.sign(x)
2025-05-07T20:32:19.1615315Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:19.1615404Z         x = x_sign * x_clamp
2025-05-07T20:32:19.1615499Z         x0 = x[:, :D]
2025-05-07T20:32:19.1615583Z         x1 = x[:, D:]
2025-05-07T20:32:19.1615656Z     
2025-05-07T20:32:19.1615747Z         if contiguous:
2025-05-07T20:32:19.1615838Z             x0 = x0.contiguous()
2025-05-07T20:32:19.1615928Z             x1 = x1.contiguous()
2025-05-07T20:32:19.1616007Z     
2025-05-07T20:32:19.1616099Z         if scale_ub is not None:
2025-05-07T20:32:19.1616210Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:19.1616344Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:19.1616420Z             )
2025-05-07T20:32:19.1616506Z         else:
2025-05-07T20:32:19.1616601Z             scale_ub_tensor = None
2025-05-07T20:32:19.1616733Z     
2025-05-07T20:32:19.1616867Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:19.1616957Z             op = silu_mul_quant
2025-05-07T20:32:19.1617042Z             if compiled:
2025-05-07T20:32:19.1617149Z                 op = torch.compile(op)
2025-05-07T20:32:19.1617262Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:19.1617337Z     
2025-05-07T20:32:19.1617438Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:19.1617442Z 
2025-05-07T20:32:19.1617542Z moe/activation_test.py:117: 
2025-05-07T20:32:19.1617723Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:19.1617823Z moe/activation_test.py:115: in fn
2025-05-07T20:32:19.1617922Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:19.1618291Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/torch/_dynamo/eval_frame.py:678: in _fn
2025-05-07T20:32:19.1618384Z     return fn(*args, **kwargs)
2025-05-07T20:32:19.1618873Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:19.1618977Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:19.1619408Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:19.1619635Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:19.1619971Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:19.1620067Z     kernel = self.compile(
2025-05-07T20:32:19.1620452Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:19.1620626Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:19.1620753Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:19.1620767Z 
2025-05-07T20:32:19.1620971Z self = <triton.compiler.compiler.ASTSource object at 0x7fce5a956250>
2025-05-07T20:32:19.1621736Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:19.1622245Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7fcea1473ce0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7fcda4ef3c40>}
2025-05-07T20:32:19.1622982Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:19.1623178Z context = <triton._C.libtriton.ir.context object at 0x7fcda4728970>
2025-05-07T20:32:19.1623183Z 
2025-05-07T20:32:19.1623348Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:19.1623608Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:19.1623723Z                            module_map=module_map)
2025-05-07T20:32:19.1623889Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:19.1624000Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:19.1624078Z E       ^
2025-05-07T20:32:19.1624425Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:19.1624432Z 
2025-05-07T20:32:19.1624844Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:19.1624849Z 
2025-05-07T20:32:19.1624951Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:19.1625171Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:19.1625256Z     T=4096,
2025-05-07T20:32:19.1625380Z     D=5120,
2025-05-07T20:32:19.1625471Z     scale_ub=1200.0,
2025-05-07T20:32:19.1625559Z     contiguous=False,
2025-05-07T20:32:19.1625643Z     compiled=False,
2025-05-07T20:32:19.1625722Z )
2025-05-07T20:32:19.1625937Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:19.1626117Z T = 4096, D = 5120, scale_ub = 1200.0, contiguous = False, compiled = False
2025-05-07T20:32:19.1626121Z 
2025-05-07T20:32:19.1626203Z     @given(
2025-05-07T20:32:19.1626321Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:19.1626460Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:19.1626583Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:19.1626699Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:19.1626818Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:19.1626892Z     )
2025-05-07T20:32:19.1627134Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:19.1627237Z     def test_silu_mul_quant(
2025-05-07T20:32:19.1627318Z         self,
2025-05-07T20:32:19.1627395Z         T: int,
2025-05-07T20:32:19.1627479Z         D: int,
2025-05-07T20:32:19.1627577Z         scale_ub: Optional[float],
2025-05-07T20:32:19.1627748Z         contiguous: bool,
2025-05-07T20:32:19.1627843Z         compiled: bool,
2025-05-07T20:32:19.1627922Z     ) -> None:
2025-05-07T20:32:19.1628017Z         torch.manual_seed(2025)
2025-05-07T20:32:19.1628096Z     
2025-05-07T20:32:19.1628264Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:19.1628347Z     
2025-05-07T20:32:19.1628440Z         x_sign = torch.sign(x)
2025-05-07T20:32:19.1628563Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:19.1628657Z         x = x_sign * x_clamp
2025-05-07T20:32:19.1628738Z         x0 = x[:, :D]
2025-05-07T20:32:19.1628819Z         x1 = x[:, D:]
2025-05-07T20:32:19.1628899Z     
2025-05-07T20:32:19.1628983Z         if contiguous:
2025-05-07T20:32:19.1629078Z             x0 = x0.contiguous()
2025-05-07T20:32:19.1629174Z             x1 = x1.contiguous()
2025-05-07T20:32:19.1629248Z     
2025-05-07T20:32:19.1629340Z         if scale_ub is not None:
2025-05-07T20:32:19.1629457Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:19.1629597Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:19.1629683Z             )
2025-05-07T20:32:19.1629760Z         else:
2025-05-07T20:32:19.1629855Z             scale_ub_tensor = None
2025-05-07T20:32:19.1629938Z     
2025-05-07T20:32:19.1630067Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:19.1630161Z             op = silu_mul_quant
2025-05-07T20:32:19.1630253Z             if compiled:
2025-05-07T20:32:19.1630352Z                 op = torch.compile(op)
2025-05-07T20:32:19.1630460Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:19.1630542Z     
2025-05-07T20:32:19.1630633Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:19.1630639Z 
2025-05-07T20:32:19.1630737Z moe/activation_test.py:117: 
2025-05-07T20:32:19.1630874Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:19.1630975Z moe/activation_test.py:115: in fn
2025-05-07T20:32:19.1631088Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:19.1631576Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:19.1631676Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:19.1632043Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:19.1632262Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:19.1632607Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:19.1632702Z     kernel = self.compile(
2025-05-07T20:32:19.1633135Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:19.1633315Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:19.1633447Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:19.1633451Z 
2025-05-07T20:32:19.1633655Z self = <triton.compiler.compiler.ASTSource object at 0x7fce5a59bc50>
2025-05-07T20:32:19.1634423Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:19.1634964Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7fcea1473ce0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7fcda52582c0>}
2025-05-07T20:32:19.1635704Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:19.1635897Z context = <triton._C.libtriton.ir.context object at 0x7fcda47f3d30>
2025-05-07T20:32:19.1635902Z 
2025-05-07T20:32:19.1636150Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:19.1636410Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:19.1636518Z                            module_map=module_map)
2025-05-07T20:32:19.1636687Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:19.1636785Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:19.1636865Z E       ^
2025-05-07T20:32:19.1637218Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:19.1637223Z 
2025-05-07T20:32:19.1637630Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:19.1637638Z 
2025-05-07T20:32:19.1637746Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:19.1637966Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:19.1638048Z     T=4096,
2025-05-07T20:32:19.1638131Z     D=5120,
2025-05-07T20:32:19.1638215Z     scale_ub=1200.0,
2025-05-07T20:32:19.1638302Z     contiguous=False,
2025-05-07T20:32:19.1638392Z     compiled=True,
2025-05-07T20:32:19.1638467Z )
2025-05-07T20:32:19.1638690Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:19.1638861Z T = 4096, D = 5120, scale_ub = 1200.0, contiguous = False, compiled = True
2025-05-07T20:32:19.1638865Z 
2025-05-07T20:32:19.1638942Z     @given(
2025-05-07T20:32:19.1639065Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:19.1639164Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:19.1639280Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:19.1639405Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:19.1639517Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:19.1639593Z     )
2025-05-07T20:32:19.1639849Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:19.1639943Z     def test_silu_mul_quant(
2025-05-07T20:32:19.1640027Z         self,
2025-05-07T20:32:19.1640105Z         T: int,
2025-05-07T20:32:19.1640181Z         D: int,
2025-05-07T20:32:19.1640288Z         scale_ub: Optional[float],
2025-05-07T20:32:19.1640379Z         contiguous: bool,
2025-05-07T20:32:19.1640465Z         compiled: bool,
2025-05-07T20:32:19.1640549Z     ) -> None:
2025-05-07T20:32:19.1640645Z         torch.manual_seed(2025)
2025-05-07T20:32:19.1640719Z     
2025-05-07T20:32:19.1640893Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:19.1640971Z     
2025-05-07T20:32:19.1641114Z         x_sign = torch.sign(x)
2025-05-07T20:32:19.1641247Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:19.1641337Z         x = x_sign * x_clamp
2025-05-07T20:32:19.1641424Z         x0 = x[:, :D]
2025-05-07T20:32:19.1641508Z         x1 = x[:, D:]
2025-05-07T20:32:19.1641581Z     
2025-05-07T20:32:19.1641677Z         if contiguous:
2025-05-07T20:32:19.1641768Z             x0 = x0.contiguous()
2025-05-07T20:32:19.1641858Z             x1 = x1.contiguous()
2025-05-07T20:32:19.1641936Z     
2025-05-07T20:32:19.1642027Z         if scale_ub is not None:
2025-05-07T20:32:19.1642178Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:19.1642317Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:19.1642392Z             )
2025-05-07T20:32:19.1642469Z         else:
2025-05-07T20:32:19.1642571Z             scale_ub_tensor = None
2025-05-07T20:32:19.1642648Z     
2025-05-07T20:32:19.1642777Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:19.1642878Z             op = silu_mul_quant
2025-05-07T20:32:19.1642964Z             if compiled:
2025-05-07T20:32:19.1643072Z                 op = torch.compile(op)
2025-05-07T20:32:19.1643177Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:19.1643249Z     
2025-05-07T20:32:19.1643469Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:19.1643474Z 
2025-05-07T20:32:19.1643580Z moe/activation_test.py:117: 
2025-05-07T20:32:19.1643710Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:19.1643815Z moe/activation_test.py:115: in fn
2025-05-07T20:32:19.1643921Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:19.1644281Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/torch/_dynamo/eval_frame.py:678: in _fn
2025-05-07T20:32:19.1644378Z     return fn(*args, **kwargs)
2025-05-07T20:32:19.1644873Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:19.1644977Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:19.1645339Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:19.1645562Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:19.1645899Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:19.1646004Z     kernel = self.compile(
2025-05-07T20:32:19.1646386Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:19.1646559Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:19.1646696Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:19.1646701Z 
2025-05-07T20:32:19.1646905Z self = <triton.compiler.compiler.ASTSource object at 0x7fcda4f3aa50>
2025-05-07T20:32:19.1647674Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:19.1648177Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7fcea1473ce0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7fcda5259b20>}
2025-05-07T20:32:19.1648914Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:19.1649106Z context = <triton._C.libtriton.ir.context object at 0x7fcda468eeb0>
2025-05-07T20:32:19.1649111Z 
2025-05-07T20:32:19.1649275Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:19.1649542Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:19.1649699Z                            module_map=module_map)
2025-05-07T20:32:19.1649868Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:19.1649970Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:19.1650048Z E       ^
2025-05-07T20:32:19.1650410Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:19.1650414Z 
2025-05-07T20:32:19.1650820Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:19.1650865Z 
2025-05-07T20:32:19.1650970Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:19.1651197Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:19.1651276Z     T=2048,
2025-05-07T20:32:19.1651359Z     D=7168,
2025-05-07T20:32:19.1651442Z     scale_ub=1200.0,
2025-05-07T20:32:19.1651529Z     contiguous=False,
2025-05-07T20:32:19.1651622Z     compiled=False,
2025-05-07T20:32:19.1651696Z )
2025-05-07T20:32:19.1651913Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:19.1652093Z T = 2048, D = 7168, scale_ub = 1200.0, contiguous = False, compiled = False
2025-05-07T20:32:19.1652180Z 
2025-05-07T20:32:19.1652259Z     @given(
2025-05-07T20:32:19.1652376Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:19.1652482Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:19.1652599Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:19.1652722Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:19.1652836Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:19.1652910Z     )
2025-05-07T20:32:19.1653157Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:19.1653251Z     def test_silu_mul_quant(
2025-05-07T20:32:19.1653327Z         self,
2025-05-07T20:32:19.1653413Z         T: int,
2025-05-07T20:32:19.1653491Z         D: int,
2025-05-07T20:32:19.1653588Z         scale_ub: Optional[float],
2025-05-07T20:32:19.1653821Z         contiguous: bool,
2025-05-07T20:32:19.1653911Z         compiled: bool,
2025-05-07T20:32:19.1653990Z     ) -> None:
2025-05-07T20:32:19.1654101Z         torch.manual_seed(2025)
2025-05-07T20:32:19.1654176Z     
2025-05-07T20:32:19.1654369Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:19.1654443Z     
2025-05-07T20:32:19.1654543Z         x_sign = torch.sign(x)
2025-05-07T20:32:19.1654685Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:19.1654776Z         x = x_sign * x_clamp
2025-05-07T20:32:19.1654860Z         x0 = x[:, :D]
2025-05-07T20:32:19.1654954Z         x1 = x[:, D:]
2025-05-07T20:32:19.1655027Z     
2025-05-07T20:32:19.1655113Z         if contiguous:
2025-05-07T20:32:19.1655214Z             x0 = x0.contiguous()
2025-05-07T20:32:19.1655307Z             x1 = x1.contiguous()
2025-05-07T20:32:19.1655383Z     
2025-05-07T20:32:19.1655487Z         if scale_ub is not None:
2025-05-07T20:32:19.1655597Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:19.1655746Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:19.1655824Z             )
2025-05-07T20:32:19.1655907Z         else:
2025-05-07T20:32:19.1656011Z             scale_ub_tensor = None
2025-05-07T20:32:19.1656086Z     
2025-05-07T20:32:19.1656222Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:19.1656323Z             op = silu_mul_quant
2025-05-07T20:32:19.1656410Z             if compiled:
2025-05-07T20:32:19.1656514Z                 op = torch.compile(op)
2025-05-07T20:32:19.1656632Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:19.1656707Z     
2025-05-07T20:32:19.1656801Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:19.1656814Z 
2025-05-07T20:32:19.1656917Z moe/activation_test.py:117: 
2025-05-07T20:32:19.1657058Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:19.1657224Z moe/activation_test.py:115: in fn
2025-05-07T20:32:19.1657324Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:19.1657818Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:19.1657922Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:19.1658278Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:19.1658539Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:19.1658883Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:19.1658977Z     kernel = self.compile(
2025-05-07T20:32:19.1659362Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:19.1659537Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:19.1659665Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:19.1659669Z 
2025-05-07T20:32:19.1659955Z self = <triton.compiler.compiler.ASTSource object at 0x7fce5a59a050>
2025-05-07T20:32:19.1660723Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:19.1661235Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7fcea1473ce0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7fcda525a700>}
2025-05-07T20:32:19.1661968Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:19.1662170Z context = <triton._C.libtriton.ir.context object at 0x7fcda464a370>
2025-05-07T20:32:19.1662174Z 
2025-05-07T20:32:19.1662342Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:19.1662604Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:19.1662720Z                            module_map=module_map)
2025-05-07T20:32:19.1662887Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:19.1662990Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:19.1663075Z E       ^
2025-05-07T20:32:19.1663424Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:19.1663428Z 
2025-05-07T20:32:19.1663840Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:19.1663847Z 
2025-05-07T20:32:19.1663948Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:19.1664168Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:19.1664252Z     T=1,
2025-05-07T20:32:19.1664328Z     D=7168,
2025-05-07T20:32:19.1664412Z     scale_ub=None,
2025-05-07T20:32:19.1664510Z     contiguous=True,
2025-05-07T20:32:19.1664595Z     compiled=False,
2025-05-07T20:32:19.1664672Z )
2025-05-07T20:32:19.1664889Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:19.1665052Z T = 1, D = 7168, scale_ub = None, contiguous = True, compiled = False
2025-05-07T20:32:19.1665057Z 
2025-05-07T20:32:19.1665141Z     @given(
2025-05-07T20:32:19.1665259Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:19.1665357Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:19.1665479Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:19.1665596Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:19.1665754Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:19.1665838Z     )
2025-05-07T20:32:19.1666078Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:19.1666178Z     def test_silu_mul_quant(
2025-05-07T20:32:19.1666261Z         self,
2025-05-07T20:32:19.1666339Z         T: int,
2025-05-07T20:32:19.1666424Z         D: int,
2025-05-07T20:32:19.1666522Z         scale_ub: Optional[float],
2025-05-07T20:32:19.1666611Z         contiguous: bool,
2025-05-07T20:32:19.1666749Z         compiled: bool,
2025-05-07T20:32:19.1666828Z     ) -> None:
2025-05-07T20:32:19.1666921Z         torch.manual_seed(2025)
2025-05-07T20:32:19.1667002Z     
2025-05-07T20:32:19.1667169Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:19.1667242Z     
2025-05-07T20:32:19.1667340Z         x_sign = torch.sign(x)
2025-05-07T20:32:19.1667462Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:19.1667560Z         x = x_sign * x_clamp
2025-05-07T20:32:19.1667643Z         x0 = x[:, :D]
2025-05-07T20:32:19.1667724Z         x1 = x[:, D:]
2025-05-07T20:32:19.1667802Z     
2025-05-07T20:32:19.1667885Z         if contiguous:
2025-05-07T20:32:19.1667977Z             x0 = x0.contiguous()
2025-05-07T20:32:19.1668153Z             x1 = x1.contiguous()
2025-05-07T20:32:19.1668228Z     
2025-05-07T20:32:19.1668319Z         if scale_ub is not None:
2025-05-07T20:32:19.1668429Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:19.1668561Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:19.1668640Z             )
2025-05-07T20:32:19.1668723Z         else:
2025-05-07T20:32:19.1668818Z             scale_ub_tensor = None
2025-05-07T20:32:19.1668891Z     
2025-05-07T20:32:19.1669028Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:19.1669121Z             op = silu_mul_quant
2025-05-07T20:32:19.1669214Z             if compiled:
2025-05-07T20:32:19.1669315Z                 op = torch.compile(op)
2025-05-07T20:32:19.1669421Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:19.1669501Z     
2025-05-07T20:32:19.1669591Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:19.1669596Z 
2025-05-07T20:32:19.1669693Z moe/activation_test.py:117: 
2025-05-07T20:32:19.1669832Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:19.1669932Z moe/activation_test.py:115: in fn
2025-05-07T20:32:19.1670031Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:19.1670530Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:19.1670626Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:19.1670987Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:19.1671205Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:19.1671544Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:19.1671647Z     kernel = self.compile(
2025-05-07T20:32:19.1672030Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:19.1672209Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:19.1672337Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:19.1672344Z 
2025-05-07T20:32:19.1672546Z self = <triton.compiler.compiler.ASTSource object at 0x7fcda5deba50>
2025-05-07T20:32:19.1673315Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:19.1673812Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7fcea1473ce0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7fcda525ba60>}
2025-05-07T20:32:19.1674610Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:19.1674799Z context = <triton._C.libtriton.ir.context object at 0x7fcda46fedb0>
2025-05-07T20:32:19.1674804Z 
2025-05-07T20:32:19.1674967Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:19.1675305Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:19.1675414Z                            module_map=module_map)
2025-05-07T20:32:19.1675580Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:19.1675679Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:19.1675759Z E       ^
2025-05-07T20:32:19.1676117Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:19.1676121Z 
2025-05-07T20:32:19.1676528Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:19.1676607Z 
2025-05-07T20:32:19.1676723Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:19.1676942Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:19.1677021Z     T=16384,
2025-05-07T20:32:19.1677109Z     D=7168,
2025-05-07T20:32:19.1677194Z     scale_ub=1200.0,
2025-05-07T20:32:19.1677282Z     contiguous=False,
2025-05-07T20:32:19.1677373Z     compiled=True,
2025-05-07T20:32:19.1677447Z )
2025-05-07T20:32:19.1677662Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:19.1677846Z T = 16384, D = 7168, scale_ub = 1200.0, contiguous = False, compiled = True
2025-05-07T20:32:19.1677851Z 
2025-05-07T20:32:19.1677931Z     @given(
2025-05-07T20:32:19.1678058Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:19.1678158Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:19.1678273Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:19.1678403Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:19.1678515Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:19.1678590Z     )
2025-05-07T20:32:19.1678838Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:19.1678936Z     def test_silu_mul_quant(
2025-05-07T20:32:19.1679013Z         self,
2025-05-07T20:32:19.1679096Z         T: int,
2025-05-07T20:32:19.1679173Z         D: int,
2025-05-07T20:32:19.1679271Z         scale_ub: Optional[float],
2025-05-07T20:32:19.1679367Z         contiguous: bool,
2025-05-07T20:32:19.1679453Z         compiled: bool,
2025-05-07T20:32:19.1679538Z     ) -> None:
2025-05-07T20:32:19.1679634Z         torch.manual_seed(2025)
2025-05-07T20:32:19.1679710Z     
2025-05-07T20:32:19.1679882Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:19.1679955Z     
2025-05-07T20:32:19.1680047Z         x_sign = torch.sign(x)
2025-05-07T20:32:19.1680183Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:19.1680273Z         x = x_sign * x_clamp
2025-05-07T20:32:19.1680352Z         x0 = x[:, :D]
2025-05-07T20:32:19.1680442Z         x1 = x[:, D:]
2025-05-07T20:32:19.1680514Z     
2025-05-07T20:32:19.1680602Z         if contiguous:
2025-05-07T20:32:19.1680698Z             x0 = x0.contiguous()
2025-05-07T20:32:19.1680788Z             x1 = x1.contiguous()
2025-05-07T20:32:19.1680868Z     
2025-05-07T20:32:19.1680959Z         if scale_ub is not None:
2025-05-07T20:32:19.1681067Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:19.1681206Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:19.1681285Z             )
2025-05-07T20:32:19.1681415Z         else:
2025-05-07T20:32:19.1681518Z             scale_ub_tensor = None
2025-05-07T20:32:19.1681590Z     
2025-05-07T20:32:19.1681719Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:19.1681815Z             op = silu_mul_quant
2025-05-07T20:32:19.1681904Z             if compiled:
2025-05-07T20:32:19.1682005Z                 op = torch.compile(op)
2025-05-07T20:32:19.1682119Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:19.1682193Z     
2025-05-07T20:32:19.1682290Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:19.1682339Z 
2025-05-07T20:32:19.1682437Z moe/activation_test.py:117: 
2025-05-07T20:32:19.1682566Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:19.1682673Z moe/activation_test.py:115: in fn
2025-05-07T20:32:19.1682775Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:19.1683137Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/torch/_dynamo/eval_frame.py:678: in _fn
2025-05-07T20:32:19.1683242Z     return fn(*args, **kwargs)
2025-05-07T20:32:19.1683730Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:19.1683834Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:19.1684274Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:19.1684494Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:19.1684842Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:19.1684936Z     kernel = self.compile(
2025-05-07T20:32:19.1685315Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:19.1685496Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:19.1685621Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:19.1685628Z 
2025-05-07T20:32:19.1685840Z self = <triton.compiler.compiler.ASTSource object at 0x7fce5a9c17d0>
2025-05-07T20:32:19.1686605Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:19.1687104Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7fcea1473ce0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7fcda48e4d60>}
2025-05-07T20:32:19.1687846Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:19.1688037Z context = <triton._C.libtriton.ir.context object at 0x7fcda48074b0>
2025-05-07T20:32:19.1688046Z 
2025-05-07T20:32:19.1688216Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:19.1688476Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:19.1688594Z                            module_map=module_map)
2025-05-07T20:32:19.1688755Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:19.1688856Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:19.1688940Z E       ^
2025-05-07T20:32:19.1689290Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:19.1689294Z 
2025-05-07T20:32:19.1689703Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:19.1689707Z 
2025-05-07T20:32:19.1689819Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:19.1690039Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:19.1690175Z     T=1,
2025-05-07T20:32:19.1690253Z     D=7168,
2025-05-07T20:32:19.1690335Z     scale_ub=None,
2025-05-07T20:32:19.1690432Z     contiguous=False,
2025-05-07T20:32:19.1690516Z     compiled=False,
2025-05-07T20:32:19.1690588Z )
2025-05-07T20:32:19.1690818Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:19.1690983Z T = 1, D = 7168, scale_ub = None, contiguous = False, compiled = False
2025-05-07T20:32:19.1690987Z 
2025-05-07T20:32:19.1691109Z     @given(
2025-05-07T20:32:19.1691240Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:19.1691338Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:19.1691459Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:19.1691581Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:19.1691694Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:19.1691775Z     )
2025-05-07T20:32:19.1692021Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:19.1692115Z     def test_silu_mul_quant(
2025-05-07T20:32:19.1692203Z         self,
2025-05-07T20:32:19.1692282Z         T: int,
2025-05-07T20:32:19.1692358Z         D: int,
2025-05-07T20:32:19.1692544Z         scale_ub: Optional[float],
2025-05-07T20:32:19.1692639Z         contiguous: bool,
2025-05-07T20:32:19.1692723Z         compiled: bool,
2025-05-07T20:32:19.1692810Z     ) -> None:
2025-05-07T20:32:19.1692906Z         torch.manual_seed(2025)
2025-05-07T20:32:19.1692989Z     
2025-05-07T20:32:19.1693157Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:19.1693231Z     
2025-05-07T20:32:19.1693331Z         x_sign = torch.sign(x)
2025-05-07T20:32:19.1693456Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:19.1693545Z         x = x_sign * x_clamp
2025-05-07T20:32:19.1693633Z         x0 = x[:, :D]
2025-05-07T20:32:19.1693857Z         x1 = x[:, D:]
2025-05-07T20:32:19.1693934Z     
2025-05-07T20:32:19.1694024Z         if contiguous:
2025-05-07T20:32:19.1694119Z             x0 = x0.contiguous()
2025-05-07T20:32:19.1694208Z             x1 = x1.contiguous()
2025-05-07T20:32:19.1694289Z     
2025-05-07T20:32:19.1694381Z         if scale_ub is not None:
2025-05-07T20:32:19.1694496Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:19.1694630Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:19.1694708Z             )
2025-05-07T20:32:19.1694794Z         else:
2025-05-07T20:32:19.1694887Z             scale_ub_tensor = None
2025-05-07T20:32:19.1694960Z     
2025-05-07T20:32:19.1695096Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:19.1695185Z             op = silu_mul_quant
2025-05-07T20:32:19.1695271Z             if compiled:
2025-05-07T20:32:19.1695376Z                 op = torch.compile(op)
2025-05-07T20:32:19.1695481Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:19.1695557Z     
2025-05-07T20:32:19.1695655Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:19.1695660Z 
2025-05-07T20:32:19.1695756Z moe/activation_test.py:117: 
2025-05-07T20:32:19.1695889Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:19.1695994Z moe/activation_test.py:115: in fn
2025-05-07T20:32:19.1696093Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:19.1696588Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:19.1696689Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:19.1697043Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:19.1697267Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:19.1697602Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:19.1697756Z     kernel = self.compile(
2025-05-07T20:32:19.1698134Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:19.1699396Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:19.1699745Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:19.1699752Z 
2025-05-07T20:32:19.1699984Z self = <triton.compiler.compiler.ASTSource object at 0x7fcda5deaed0>
2025-05-07T20:32:19.1701932Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:19.1702937Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7fcea1473ce0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7fcda48e5760>}
2025-05-07T20:32:19.1704430Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:19.1705072Z context = <triton._C.libtriton.ir.context object at 0x7fcda486faf0>
2025-05-07T20:32:19.1705083Z 
2025-05-07T20:32:19.1705413Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:19.1705938Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:19.1706161Z                            module_map=module_map)
2025-05-07T20:32:19.1706485Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:19.1706690Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:19.1706845Z E       ^
2025-05-07T20:32:19.1707550Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:19.1707575Z 
2025-05-07T20:32:19.1708391Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:19.1708401Z 
2025-05-07T20:32:19.1708607Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:19.1709061Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:19.1709218Z     T=2048,
2025-05-07T20:32:19.1709370Z     D=7168,
2025-05-07T20:32:19.1709543Z     scale_ub=None,
2025-05-07T20:32:19.1709717Z     contiguous=False,
2025-05-07T20:32:19.1709890Z     compiled=True,
2025-05-07T20:32:19.1710048Z )
2025-05-07T20:32:19.1710384Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:19.1710588Z T = 2048, D = 7168, scale_ub = None, contiguous = False, compiled = True
2025-05-07T20:32:19.1710592Z 
2025-05-07T20:32:19.1710688Z     @given(
2025-05-07T20:32:19.1710815Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:19.1710927Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:19.1711044Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:19.1711161Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:19.1711280Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:19.1711363Z     )
2025-05-07T20:32:19.1711613Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:19.1711708Z     def test_silu_mul_quant(
2025-05-07T20:32:19.1711789Z         self,
2025-05-07T20:32:19.1711878Z         T: int,
2025-05-07T20:32:19.1711972Z         D: int,
2025-05-07T20:32:19.1712071Z         scale_ub: Optional[float],
2025-05-07T20:32:19.1712168Z         contiguous: bool,
2025-05-07T20:32:19.1712254Z         compiled: bool,
2025-05-07T20:32:19.1712336Z     ) -> None:
2025-05-07T20:32:19.1712439Z         torch.manual_seed(2025)
2025-05-07T20:32:19.1712512Z     
2025-05-07T20:32:19.1712683Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:19.1712850Z     
2025-05-07T20:32:19.1719113Z         x_sign = torch.sign(x)
2025-05-07T20:32:19.1719274Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:19.1719370Z         x = x_sign * x_clamp
2025-05-07T20:32:19.1719463Z         x0 = x[:, :D]
2025-05-07T20:32:19.1719553Z         x1 = x[:, D:]
2025-05-07T20:32:19.1719631Z     
2025-05-07T20:32:19.1719727Z         if contiguous:
2025-05-07T20:32:19.1719820Z             x0 = x0.contiguous()
2025-05-07T20:32:19.1719910Z             x1 = x1.contiguous()
2025-05-07T20:32:19.1720064Z     
2025-05-07T20:32:19.1720156Z         if scale_ub is not None:
2025-05-07T20:32:19.1720266Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:19.1720415Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:19.1720494Z             )
2025-05-07T20:32:19.1720572Z         else:
2025-05-07T20:32:19.1720677Z             scale_ub_tensor = None
2025-05-07T20:32:19.1720754Z     
2025-05-07T20:32:19.1720895Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:19.1720994Z             op = silu_mul_quant
2025-05-07T20:32:19.1721080Z             if compiled:
2025-05-07T20:32:19.1721188Z                 op = torch.compile(op)
2025-05-07T20:32:19.1721377Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:19.1721453Z     
2025-05-07T20:32:19.1721554Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:19.1721559Z 
2025-05-07T20:32:19.1721658Z moe/activation_test.py:117: 
2025-05-07T20:32:19.1721792Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:19.1721905Z moe/activation_test.py:115: in fn
2025-05-07T20:32:19.1722009Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:19.1722378Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/torch/_dynamo/eval_frame.py:678: in _fn
2025-05-07T20:32:19.1722481Z     return fn(*args, **kwargs)
2025-05-07T20:32:19.1722972Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:19.1724537Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:19.1724894Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:19.1725119Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:19.1725462Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:19.1725561Z     kernel = self.compile(
2025-05-07T20:32:19.1725949Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:19.1726124Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:19.1726255Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:19.1726260Z 
2025-05-07T20:32:19.1726471Z self = <triton.compiler.compiler.ASTSource object at 0x7fce5a954050>
2025-05-07T20:32:19.1727248Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:19.1727757Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7fcea1473ce0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7fcda48e6f20>}
2025-05-07T20:32:19.1728497Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:19.1728692Z context = <triton._C.libtriton.ir.context object at 0x7fcda4da2bf0>
2025-05-07T20:32:19.1728697Z 
2025-05-07T20:32:19.1728868Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:19.1729176Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:19.1729294Z                            module_map=module_map)
2025-05-07T20:32:19.1729457Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:19.1729562Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:19.1729646Z E       ^
2025-05-07T20:32:19.1729996Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:19.1730041Z 
2025-05-07T20:32:19.1730455Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:19.1730460Z 
2025-05-07T20:32:19.1730564Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:19.1730784Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:19.1730871Z     T=4096,
2025-05-07T20:32:19.1730949Z     D=7168,
2025-05-07T20:32:19.1731035Z     scale_ub=None,
2025-05-07T20:32:19.1731131Z     contiguous=False,
2025-05-07T20:32:19.1731215Z     compiled=True,
2025-05-07T20:32:19.1731292Z )
2025-05-07T20:32:19.1731515Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:19.1731768Z T = 4096, D = 7168, scale_ub = None, contiguous = False, compiled = True
2025-05-07T20:32:19.1731773Z 
2025-05-07T20:32:19.1731860Z     @given(
2025-05-07T20:32:19.1731980Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:19.1732085Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:19.1732210Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:19.1732332Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:19.1732446Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:19.1732532Z     )
2025-05-07T20:32:19.1732777Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:19.1732873Z     def test_silu_mul_quant(
2025-05-07T20:32:19.1732966Z         self,
2025-05-07T20:32:19.1733045Z         T: int,
2025-05-07T20:32:19.1733133Z         D: int,
2025-05-07T20:32:19.1733235Z         scale_ub: Optional[float],
2025-05-07T20:32:19.1733325Z         contiguous: bool,
2025-05-07T20:32:19.1733421Z         compiled: bool,
2025-05-07T20:32:19.1733506Z     ) -> None:
2025-05-07T20:32:19.1733603Z         torch.manual_seed(2025)
2025-05-07T20:32:19.1733805Z     
2025-05-07T20:32:19.1733974Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:19.1734064Z     
2025-05-07T20:32:19.1734158Z         x_sign = torch.sign(x)
2025-05-07T20:32:19.1734283Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:19.1734381Z         x = x_sign * x_clamp
2025-05-07T20:32:19.1734461Z         x0 = x[:, :D]
2025-05-07T20:32:19.1734542Z         x1 = x[:, D:]
2025-05-07T20:32:19.1734622Z     
2025-05-07T20:32:19.1734708Z         if contiguous:
2025-05-07T20:32:19.1734800Z             x0 = x0.contiguous()
2025-05-07T20:32:19.1734899Z             x1 = x1.contiguous()
2025-05-07T20:32:19.1734972Z     
2025-05-07T20:32:19.1735064Z         if scale_ub is not None:
2025-05-07T20:32:19.1735177Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:19.1735315Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:19.1735394Z             )
2025-05-07T20:32:19.1735480Z         else:
2025-05-07T20:32:19.1735575Z             scale_ub_tensor = None
2025-05-07T20:32:19.1735656Z     
2025-05-07T20:32:19.1735789Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:19.1735881Z             op = silu_mul_quant
2025-05-07T20:32:19.1735972Z             if compiled:
2025-05-07T20:32:19.1736072Z                 op = torch.compile(op)
2025-05-07T20:32:19.1736179Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:19.1736262Z     
2025-05-07T20:32:19.1736353Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:19.1736358Z 
2025-05-07T20:32:19.1736510Z moe/activation_test.py:117: 
2025-05-07T20:32:19.1736649Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:19.1736749Z moe/activation_test.py:115: in fn
2025-05-07T20:32:19.1736856Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:19.1737224Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/torch/_dynamo/eval_frame.py:678: in _fn
2025-05-07T20:32:19.1737318Z     return fn(*args, **kwargs)
2025-05-07T20:32:19.1737815Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:19.1737959Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:19.1738312Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:19.1738538Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:19.1738875Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:19.1738978Z     kernel = self.compile(
2025-05-07T20:32:19.1739358Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:19.1739637Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:19.1739772Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:19.1739777Z 
2025-05-07T20:32:19.1739982Z self = <triton.compiler.compiler.ASTSource object at 0x7fce5a21e2d0>
2025-05-07T20:32:19.1740809Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:19.1741306Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7fcea1473ce0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7fcda4df00e0>}
2025-05-07T20:32:19.1742043Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:19.1742245Z context = <triton._C.libtriton.ir.context object at 0x7fcda4dcc470>
2025-05-07T20:32:19.1742250Z 
2025-05-07T20:32:19.1742412Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:19.1742679Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:19.1742791Z                            module_map=module_map)
2025-05-07T20:32:19.1742954Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:19.1743060Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:19.1743138Z E       ^
2025-05-07T20:32:19.1743495Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:19.1743502Z 
2025-05-07T20:32:19.1743907Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:19.1743912Z 
2025-05-07T20:32:19.1744021Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:19.1744248Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:19.1744328Z     T=16384,
2025-05-07T20:32:19.1744407Z     D=5120,
2025-05-07T20:32:19.1744503Z     scale_ub=1200.0,
2025-05-07T20:32:19.1744593Z     contiguous=False,
2025-05-07T20:32:19.1744686Z     compiled=False,
2025-05-07T20:32:19.1744766Z )
2025-05-07T20:32:19.1744983Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:19.1745169Z T = 16384, D = 5120, scale_ub = 1200.0, contiguous = False, compiled = False
2025-05-07T20:32:19.1745173Z 
2025-05-07T20:32:19.1745253Z     @given(
2025-05-07T20:32:19.1745377Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:19.1745531Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:19.1745649Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:19.1745769Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:19.1745893Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:19.1745969Z     )
2025-05-07T20:32:19.1746220Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:19.1746313Z     def test_silu_mul_quant(
2025-05-07T20:32:19.1746431Z         self,
2025-05-07T20:32:19.1746516Z         T: int,
2025-05-07T20:32:19.1746593Z         D: int,
2025-05-07T20:32:19.1746692Z         scale_ub: Optional[float],
2025-05-07T20:32:19.1746794Z         contiguous: bool,
2025-05-07T20:32:19.1746882Z         compiled: bool,
2025-05-07T20:32:19.1746962Z     ) -> None:
2025-05-07T20:32:19.1747064Z         torch.manual_seed(2025)
2025-05-07T20:32:19.1747137Z     
2025-05-07T20:32:19.1747311Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:19.1747393Z     
2025-05-07T20:32:19.1747486Z         x_sign = torch.sign(x)
2025-05-07T20:32:19.1747619Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:19.1747785Z         x = x_sign * x_clamp
2025-05-07T20:32:19.1747867Z         x0 = x[:, :D]
2025-05-07T20:32:19.1747961Z         x1 = x[:, D:]
2025-05-07T20:32:19.1748035Z     
2025-05-07T20:32:19.1748120Z         if contiguous:
2025-05-07T20:32:19.1748219Z             x0 = x0.contiguous()
2025-05-07T20:32:19.1748312Z             x1 = x1.contiguous()
2025-05-07T20:32:19.1748386Z     
2025-05-07T20:32:19.1748485Z         if scale_ub is not None:
2025-05-07T20:32:19.1748593Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:19.1748727Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:19.1748811Z             )
2025-05-07T20:32:19.1748888Z         else:
2025-05-07T20:32:19.1748983Z             scale_ub_tensor = None
2025-05-07T20:32:19.1749068Z     
2025-05-07T20:32:19.1749196Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:19.1749295Z             op = silu_mul_quant
2025-05-07T20:32:19.1749380Z             if compiled:
2025-05-07T20:32:19.1749486Z                 op = torch.compile(op)
2025-05-07T20:32:19.1749601Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:19.1749674Z     
2025-05-07T20:32:19.1749765Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:19.1749770Z 
2025-05-07T20:32:19.1749877Z moe/activation_test.py:117: 
2025-05-07T20:32:19.1750008Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:19.1750108Z moe/activation_test.py:115: in fn
2025-05-07T20:32:19.1750215Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:19.1750704Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:19.1750813Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:19.1751171Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:19.1751393Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:19.1751744Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:19.1751838Z     kernel = self.compile(
2025-05-07T20:32:19.1752225Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:19.1752403Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:19.1752532Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:19.1752536Z 
2025-05-07T20:32:19.1752751Z self = <triton.compiler.compiler.ASTSource object at 0x7fcda4f39950>
2025-05-07T20:32:19.1753512Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:19.1754071Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7fcea1473ce0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7fcda4df0b80>}
2025-05-07T20:32:19.1754806Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:19.1755038Z context = <triton._C.libtriton.ir.context object at 0x7fcda4319130>
2025-05-07T20:32:19.1755043Z 
2025-05-07T20:32:19.1755215Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:19.1755471Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:19.1755591Z                            module_map=module_map)
2025-05-07T20:32:19.1755753Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:19.1755851Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:19.1755937Z E       ^
2025-05-07T20:32:19.1756358Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:19.1756363Z 
2025-05-07T20:32:19.1756769Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:19.1756783Z 
2025-05-07T20:32:19.1756888Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:19.1757109Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:19.1757194Z     T=16384,
2025-05-07T20:32:19.1757272Z     D=5120,
2025-05-07T20:32:19.1757358Z     scale_ub=1200.0,
2025-05-07T20:32:19.1757450Z     contiguous=True,
2025-05-07T20:32:19.1757535Z     compiled=True,
2025-05-07T20:32:19.1757612Z )
2025-05-07T20:32:19.1757835Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:19.1758005Z T = 16384, D = 5120, scale_ub = 1200.0, contiguous = True, compiled = True
2025-05-07T20:32:19.1758009Z 
2025-05-07T20:32:19.1758091Z     @given(
2025-05-07T20:32:19.1758216Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:19.1758315Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:19.1758438Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:19.1758558Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:19.1758674Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:19.1758756Z     )
2025-05-07T20:32:19.1758998Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:19.1759092Z     def test_silu_mul_quant(
2025-05-07T20:32:19.1759174Z         self,
2025-05-07T20:32:19.1759252Z         T: int,
2025-05-07T20:32:19.1759334Z         D: int,
2025-05-07T20:32:19.1759438Z         scale_ub: Optional[float],
2025-05-07T20:32:19.1759529Z         contiguous: bool,
2025-05-07T20:32:19.1759623Z         compiled: bool,
2025-05-07T20:32:19.1759702Z     ) -> None:
2025-05-07T20:32:19.1759797Z         torch.manual_seed(2025)
2025-05-07T20:32:19.1759881Z     
2025-05-07T20:32:19.1760050Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:19.1760125Z     
2025-05-07T20:32:19.1760223Z         x_sign = torch.sign(x)
2025-05-07T20:32:19.1760353Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:19.1760462Z         x = x_sign * x_clamp
2025-05-07T20:32:19.1760562Z         x0 = x[:, :D]
2025-05-07T20:32:19.1760661Z         x1 = x[:, D:]
2025-05-07T20:32:19.1760737Z     
2025-05-07T20:32:19.1760827Z         if contiguous:
2025-05-07T20:32:19.1760919Z             x0 = x0.contiguous()
2025-05-07T20:32:19.1761015Z             x1 = x1.contiguous()
2025-05-07T20:32:19.1761087Z     
2025-05-07T20:32:19.1761228Z         if scale_ub is not None:
2025-05-07T20:32:19.1761339Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:19.1761473Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:19.1761550Z             )
2025-05-07T20:32:19.1761636Z         else:
2025-05-07T20:32:19.1761733Z             scale_ub_tensor = None
2025-05-07T20:32:19.1761807Z     
2025-05-07T20:32:19.1761944Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:19.1762034Z             op = silu_mul_quant
2025-05-07T20:32:19.1762164Z             if compiled:
2025-05-07T20:32:19.1762274Z                 op = torch.compile(op)
2025-05-07T20:32:19.1762380Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:19.1762461Z     
2025-05-07T20:32:19.1762553Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:19.1762557Z 
2025-05-07T20:32:19.1762654Z moe/activation_test.py:117: 
2025-05-07T20:32:19.1762790Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:19.1762894Z moe/activation_test.py:115: in fn
2025-05-07T20:32:19.1762997Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:19.1763366Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/torch/_dynamo/eval_frame.py:678: in _fn
2025-05-07T20:32:19.1763535Z     return fn(*args, **kwargs)
2025-05-07T20:32:19.1764030Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:19.1764128Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:19.1764488Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:19.1764718Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:19.1765057Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:19.1765153Z     kernel = self.compile(
2025-05-07T20:32:19.1765544Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:19.1765717Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:19.1765857Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:19.1765862Z 
2025-05-07T20:32:19.1766066Z self = <triton.compiler.compiler.ASTSource object at 0x7fcda5184650>
2025-05-07T20:32:19.1766828Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:19.1767341Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7fcea1473ce0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7fcda4df22a0>}
2025-05-07T20:32:19.1768074Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:19.1768278Z context = <triton._C.libtriton.ir.context object at 0x7fcda45655b0>
2025-05-07T20:32:19.1768282Z 
2025-05-07T20:32:19.1768455Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:19.1768714Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:19.1768833Z                            module_map=module_map)
2025-05-07T20:32:19.1768998Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:19.1769104Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:19.1769186Z E       ^
2025-05-07T20:32:19.1769537Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:19.1769542Z 
2025-05-07T20:32:19.1769955Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:19.1770004Z 
2025-05-07T20:32:19.1770111Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:19.1770346Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:19.1770427Z     T=16384,
2025-05-07T20:32:19.1770528Z     D=5120,
2025-05-07T20:32:19.1770626Z     scale_ub=None,
2025-05-07T20:32:19.1770735Z     contiguous=False,
2025-05-07T20:32:19.1770821Z     compiled=True,
2025-05-07T20:32:19.1770942Z )
2025-05-07T20:32:19.1771159Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:19.1771331Z T = 16384, D = 5120, scale_ub = None, contiguous = False, compiled = True
2025-05-07T20:32:19.1771336Z 
2025-05-07T20:32:19.1771421Z     @given(
2025-05-07T20:32:19.1771540Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:19.1771648Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:19.1771768Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:19.1771887Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:19.1772009Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:19.1772086Z     )
2025-05-07T20:32:19.1772428Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:19.1772531Z     def test_silu_mul_quant(
2025-05-07T20:32:19.1772611Z         self,
2025-05-07T20:32:19.1772689Z         T: int,
2025-05-07T20:32:19.1772777Z         D: int,
2025-05-07T20:32:19.1772875Z         scale_ub: Optional[float],
2025-05-07T20:32:19.1772965Z         contiguous: bool,
2025-05-07T20:32:19.1773058Z         compiled: bool,
2025-05-07T20:32:19.1773136Z     ) -> None:
2025-05-07T20:32:19.1773241Z         torch.manual_seed(2025)
2025-05-07T20:32:19.1773313Z     
2025-05-07T20:32:19.1773481Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:19.1773562Z     
2025-05-07T20:32:19.1773726Z         x_sign = torch.sign(x)
2025-05-07T20:32:19.1773852Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:19.1773947Z         x = x_sign * x_clamp
2025-05-07T20:32:19.1774028Z         x0 = x[:, :D]
2025-05-07T20:32:19.1774105Z         x1 = x[:, D:]
2025-05-07T20:32:19.1774191Z     
2025-05-07T20:32:19.1774272Z         if contiguous:
2025-05-07T20:32:19.1774361Z             x0 = x0.contiguous()
2025-05-07T20:32:19.1774455Z             x1 = x1.contiguous()
2025-05-07T20:32:19.1774526Z     
2025-05-07T20:32:19.1774626Z         if scale_ub is not None:
2025-05-07T20:32:19.1774727Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:19.1774859Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:19.1774936Z             )
2025-05-07T20:32:19.1775009Z         else:
2025-05-07T20:32:19.1775102Z             scale_ub_tensor = None
2025-05-07T20:32:19.1775181Z     
2025-05-07T20:32:19.1775311Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:19.1775402Z             op = silu_mul_quant
2025-05-07T20:32:19.1775489Z             if compiled:
2025-05-07T20:32:19.1775589Z                 op = torch.compile(op)
2025-05-07T20:32:19.1775694Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:19.1775776Z     
2025-05-07T20:32:19.1775870Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:19.1775874Z 
2025-05-07T20:32:19.1775975Z moe/activation_test.py:117: 
2025-05-07T20:32:19.1776101Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:19.1776201Z moe/activation_test.py:115: in fn
2025-05-07T20:32:19.1776304Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:19.1776660Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/torch/_dynamo/eval_frame.py:678: in _fn
2025-05-07T20:32:19.1776752Z     return fn(*args, **kwargs)
2025-05-07T20:32:19.1777243Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:19.1777389Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:19.1777749Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:19.1777970Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:19.1778303Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:19.1778402Z     kernel = self.compile(
2025-05-07T20:32:19.1778822Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:19.1778994Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:19.1779127Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:19.1779131Z 
2025-05-07T20:32:19.1779338Z self = <triton.compiler.compiler.ASTSource object at 0x7fcda5d83ed0>
2025-05-07T20:32:19.1780101Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:19.1780721Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7fcea1473ce0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7fcda4df3060>}
2025-05-07T20:32:19.1781458Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:19.1781658Z context = <triton._C.libtriton.ir.context object at 0x7fcda4578d70>
2025-05-07T20:32:19.1781662Z 
2025-05-07T20:32:19.1781824Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:19.1782077Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:19.1782194Z                            module_map=module_map)
2025-05-07T20:32:19.1782353Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:19.1782450Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:19.1782537Z E       ^
2025-05-07T20:32:19.1782884Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:19.1782889Z 
2025-05-07T20:32:19.1783299Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:19.1783306Z 
2025-05-07T20:32:19.1783408Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:19.1783623Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:19.1783709Z     T=2048,
2025-05-07T20:32:19.1783781Z     D=5120,
2025-05-07T20:32:19.1783860Z     scale_ub=None,
2025-05-07T20:32:19.1783953Z     contiguous=False,
2025-05-07T20:32:19.1784035Z     compiled=True,
2025-05-07T20:32:19.1784113Z )
2025-05-07T20:32:19.1784327Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:19.1784496Z T = 2048, D = 5120, scale_ub = None, contiguous = False, compiled = True
2025-05-07T20:32:19.1784506Z 
2025-05-07T20:32:19.1784587Z     @given(
2025-05-07T20:32:19.1784706Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:19.1784803Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:19.1784925Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:19.1785041Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:19.1785151Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:19.1785235Z     )
2025-05-07T20:32:19.1785475Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:19.1785572Z     def test_silu_mul_quant(
2025-05-07T20:32:19.1785647Z         self,
2025-05-07T20:32:19.1785769Z         T: int,
2025-05-07T20:32:19.1785848Z         D: int,
2025-05-07T20:32:19.1785946Z         scale_ub: Optional[float],
2025-05-07T20:32:19.1786032Z         contiguous: bool,
2025-05-07T20:32:19.1786122Z         compiled: bool,
2025-05-07T20:32:19.1786197Z     ) -> None:
2025-05-07T20:32:19.1786292Z         torch.manual_seed(2025)
2025-05-07T20:32:19.1786369Z     
2025-05-07T20:32:19.1786532Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:19.1786602Z     
2025-05-07T20:32:19.1786736Z         x_sign = torch.sign(x)
2025-05-07T20:32:19.1786856Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:19.1786948Z         x = x_sign * x_clamp
2025-05-07T20:32:19.1787027Z         x0 = x[:, :D]
2025-05-07T20:32:19.1787103Z         x1 = x[:, D:]
2025-05-07T20:32:19.1787182Z     
2025-05-07T20:32:19.1787262Z         if contiguous:
2025-05-07T20:32:19.1787350Z             x0 = x0.contiguous()
2025-05-07T20:32:19.1787442Z             x1 = x1.contiguous()
2025-05-07T20:32:19.1787515Z     
2025-05-07T20:32:19.1787602Z         if scale_ub is not None:
2025-05-07T20:32:19.1787712Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:19.1787842Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:19.1787989Z             )
2025-05-07T20:32:19.1788074Z         else:
2025-05-07T20:32:19.1788166Z             scale_ub_tensor = None
2025-05-07T20:32:19.1788238Z     
2025-05-07T20:32:19.1788373Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:19.1788465Z             op = silu_mul_quant
2025-05-07T20:32:19.1788553Z             if compiled:
2025-05-07T20:32:19.1788650Z                 op = torch.compile(op)
2025-05-07T20:32:19.1788754Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:19.1788833Z     
2025-05-07T20:32:19.1788922Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:19.1788927Z 
2025-05-07T20:32:19.1789020Z moe/activation_test.py:117: 
2025-05-07T20:32:19.1789154Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:19.1789250Z moe/activation_test.py:115: in fn
2025-05-07T20:32:19.1789344Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:19.1789715Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/torch/_dynamo/eval_frame.py:678: in _fn
2025-05-07T20:32:19.1789807Z     return fn(*args, **kwargs)
2025-05-07T20:32:19.1790297Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:19.1790395Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:19.1790795Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:19.1791017Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:19.1791350Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:19.1791450Z     kernel = self.compile(
2025-05-07T20:32:19.1791826Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:19.1792000Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:19.1792138Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:19.1792142Z 
2025-05-07T20:32:19.1792344Z self = <triton.compiler.compiler.ASTSource object at 0x7fce5a266850>
2025-05-07T20:32:19.1793112Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:19.1793607Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7fcea1473ce0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7fcda51507c0>}
2025-05-07T20:32:19.1794380Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:19.1794582Z context = <triton._C.libtriton.ir.context object at 0x7fcda515feb0>
2025-05-07T20:32:19.1794587Z 
2025-05-07T20:32:19.1794750Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:19.1795009Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:19.1795155Z                            module_map=module_map)
2025-05-07T20:32:19.1795312Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:19.1795415Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:19.1795493Z E       ^
2025-05-07T20:32:19.1795838Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:19.1795852Z 
2025-05-07T20:32:19.1796256Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:19.1796261Z 
2025-05-07T20:32:19.1796361Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:19.1796655Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:19.1796735Z     T=2048,
2025-05-07T20:32:19.1796811Z     D=5120,
2025-05-07T20:32:19.1796902Z     scale_ub=1200.0,
2025-05-07T20:32:19.1796990Z     contiguous=False,
2025-05-07T20:32:19.1797073Z     compiled=True,
2025-05-07T20:32:19.1797154Z )
2025-05-07T20:32:19.1797368Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:19.1797544Z T = 2048, D = 5120, scale_ub = 1200.0, contiguous = False, compiled = True
2025-05-07T20:32:19.1797548Z 
2025-05-07T20:32:19.1797622Z     @given(
2025-05-07T20:32:19.1797743Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:19.1797849Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:19.1797961Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:19.1798075Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:19.1798423Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:19.1798541Z     )
2025-05-07T20:32:19.1798792Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:19.1798892Z     def test_silu_mul_quant(
2025-05-07T20:32:19.1798968Z         self,
2025-05-07T20:32:19.1799050Z         T: int,
2025-05-07T20:32:19.1799123Z         D: int,
2025-05-07T20:32:19.1799218Z         scale_ub: Optional[float],
2025-05-07T20:32:19.1799314Z         contiguous: bool,
2025-05-07T20:32:19.1799395Z         compiled: bool,
2025-05-07T20:32:19.1799656Z     ) -> None:
2025-05-07T20:32:19.1799756Z         torch.manual_seed(2025)
2025-05-07T20:32:19.1799827Z     
2025-05-07T20:32:19.1799992Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:19.1800072Z     
2025-05-07T20:32:19.1800160Z         x_sign = torch.sign(x)
2025-05-07T20:32:19.1800283Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:19.1800375Z         x = x_sign * x_clamp
2025-05-07T20:32:19.1800451Z         x0 = x[:, :D]
2025-05-07T20:32:19.1800537Z         x1 = x[:, D:]
2025-05-07T20:32:19.1800609Z     
2025-05-07T20:32:19.1800690Z         if contiguous:
2025-05-07T20:32:19.1800783Z             x0 = x0.contiguous()
2025-05-07T20:32:19.1800870Z             x1 = x1.contiguous()
2025-05-07T20:32:19.1800940Z     
2025-05-07T20:32:19.1801033Z         if scale_ub is not None:
2025-05-07T20:32:19.1801137Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:19.1801265Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:19.1801345Z             )
2025-05-07T20:32:19.1801417Z         else:
2025-05-07T20:32:19.1801507Z             scale_ub_tensor = None
2025-05-07T20:32:19.1801581Z     
2025-05-07T20:32:19.1801805Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:19.1801890Z             op = silu_mul_quant
2025-05-07T20:32:19.1801977Z             if compiled:
2025-05-07T20:32:19.1802073Z                 op = torch.compile(op)
2025-05-07T20:32:19.1802188Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:19.1802256Z     
2025-05-07T20:32:19.1802345Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:19.1802350Z 
2025-05-07T20:32:19.1802448Z moe/activation_test.py:117: 
2025-05-07T20:32:19.1802665Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:19.1802760Z moe/activation_test.py:115: in fn
2025-05-07T20:32:19.1802864Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:19.1803223Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/torch/_dynamo/eval_frame.py:678: in _fn
2025-05-07T20:32:19.1803319Z     return fn(*args, **kwargs)
2025-05-07T20:32:19.1803802Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:19.1803901Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:19.1804256Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:19.1804586Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:19.1804921Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:19.1805020Z     kernel = self.compile(
2025-05-07T20:32:19.1805394Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:19.1805570Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:19.1805695Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:19.1805699Z 
2025-05-07T20:32:19.1805902Z self = <triton.compiler.compiler.ASTSource object at 0x7fcda4d592d0>
2025-05-07T20:32:19.1806673Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:19.1807167Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7fcea1473ce0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7fcda5151580>}
2025-05-07T20:32:19.1807904Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:19.1808093Z context = <triton._C.libtriton.ir.context object at 0x7fcda51ef9f0>
2025-05-07T20:32:19.1808098Z 
2025-05-07T20:32:19.1808267Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:19.1808522Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:19.1808629Z                            module_map=module_map)
2025-05-07T20:32:19.1808792Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:19.1808892Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:19.1808970Z E       ^
2025-05-07T20:32:19.1809321Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:19.1809329Z 
2025-05-07T20:32:19.1809733Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:19.1809738Z 
2025-05-07T20:32:19.1809843Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:19.1810060Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:19.1810133Z     T=4096,
2025-05-07T20:32:19.1810212Z     D=5120,
2025-05-07T20:32:19.1810341Z     scale_ub=1200.0,
2025-05-07T20:32:19.1810424Z     contiguous=True,
2025-05-07T20:32:19.1810523Z     compiled=True,
2025-05-07T20:32:19.1810607Z )
2025-05-07T20:32:19.1810848Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:19.1811024Z T = 4096, D = 5120, scale_ub = 1200.0, contiguous = True, compiled = True
2025-05-07T20:32:19.1811029Z 
2025-05-07T20:32:19.1811103Z     @given(
2025-05-07T20:32:19.1811226Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:19.1811364Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:19.1811479Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:19.1811600Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:19.1811709Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:19.1811780Z     )
2025-05-07T20:32:19.1812026Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:19.1812121Z     def test_silu_mul_quant(
2025-05-07T20:32:19.1812197Z         self,
2025-05-07T20:32:19.1812279Z         T: int,
2025-05-07T20:32:19.1812354Z         D: int,
2025-05-07T20:32:19.1812455Z         scale_ub: Optional[float],
2025-05-07T20:32:19.1812541Z         contiguous: bool,
2025-05-07T20:32:19.1812704Z         compiled: bool,
2025-05-07T20:32:19.1812785Z     ) -> None:
2025-05-07T20:32:19.1812875Z         torch.manual_seed(2025)
2025-05-07T20:32:19.1812943Z     
2025-05-07T20:32:19.1813113Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:19.1813188Z     
2025-05-07T20:32:19.1813277Z         x_sign = torch.sign(x)
2025-05-07T20:32:19.1813406Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:19.1813492Z         x = x_sign * x_clamp
2025-05-07T20:32:19.1813569Z         x0 = x[:, :D]
2025-05-07T20:32:19.1813735Z         x1 = x[:, D:]
2025-05-07T20:32:19.1813807Z     
2025-05-07T20:32:19.1813895Z         if contiguous:
2025-05-07T20:32:19.1813989Z             x0 = x0.contiguous()
2025-05-07T20:32:19.1814074Z             x1 = x1.contiguous()
2025-05-07T20:32:19.1814156Z     
2025-05-07T20:32:19.1814244Z         if scale_ub is not None:
2025-05-07T20:32:19.1814347Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:19.1814490Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:19.1814562Z             )
2025-05-07T20:32:19.1814637Z         else:
2025-05-07T20:32:19.1814733Z             scale_ub_tensor = None
2025-05-07T20:32:19.1814803Z     
2025-05-07T20:32:19.1814932Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:19.1815022Z             op = silu_mul_quant
2025-05-07T20:32:19.1815104Z             if compiled:
2025-05-07T20:32:19.1815201Z                 op = torch.compile(op)
2025-05-07T20:32:19.1815309Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:19.1815380Z     
2025-05-07T20:32:19.1815474Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:19.1815479Z 
2025-05-07T20:32:19.1815579Z moe/activation_test.py:117: 
2025-05-07T20:32:19.1815704Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:19.1815807Z moe/activation_test.py:115: in fn
2025-05-07T20:32:19.1815903Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:19.1816267Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/torch/_dynamo/eval_frame.py:678: in _fn
2025-05-07T20:32:19.1816364Z     return fn(*args, **kwargs)
2025-05-07T20:32:19.1816847Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:19.1816948Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:19.1817300Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:19.1817518Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:19.1817854Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:19.1817999Z     kernel = self.compile(
2025-05-07T20:32:19.1818376Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:19.1818556Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:19.1818680Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:19.1818685Z 
2025-05-07T20:32:19.1818937Z self = <triton.compiler.compiler.ASTSource object at 0x7fce5a21f7d0>
2025-05-07T20:32:19.1819696Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:19.1820195Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7fcea1473ce0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7fcda5152840>}
2025-05-07T20:32:19.1821063Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:19.1821255Z context = <triton._C.libtriton.ir.context object at 0x7fcda42fc830>
2025-05-07T20:32:19.1821260Z 
2025-05-07T20:32:19.1821432Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:19.1821692Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:19.1821802Z                            module_map=module_map)
2025-05-07T20:32:19.1821963Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:19.1822061Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:19.1822145Z E       ^
2025-05-07T20:32:19.1822489Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:19.1822498Z 
2025-05-07T20:32:19.1822901Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:19.1822912Z 
2025-05-07T20:32:19.1823016Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:19.1823234Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:19.1823317Z     T=128,
2025-05-07T20:32:19.1823393Z     D=5120,
2025-05-07T20:32:19.1823478Z     scale_ub=1200.0,
2025-05-07T20:32:19.1823567Z     contiguous=False,
2025-05-07T20:32:19.1823651Z     compiled=True,
2025-05-07T20:32:19.1823727Z )
2025-05-07T20:32:19.1823948Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:19.1824114Z T = 128, D = 5120, scale_ub = 1200.0, contiguous = False, compiled = True
2025-05-07T20:32:19.1824118Z 
2025-05-07T20:32:19.1824199Z     @given(
2025-05-07T20:32:19.1824319Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:19.1824416Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:19.1824537Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:19.1824653Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:19.1824766Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:19.1824845Z     )
2025-05-07T20:32:19.1825085Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:19.1825174Z     def test_silu_mul_quant(
2025-05-07T20:32:19.1825258Z         self,
2025-05-07T20:32:19.1825332Z         T: int,
2025-05-07T20:32:19.1825405Z         D: int,
2025-05-07T20:32:19.1825512Z         scale_ub: Optional[float],
2025-05-07T20:32:19.1825599Z         contiguous: bool,
2025-05-07T20:32:19.1825688Z         compiled: bool,
2025-05-07T20:32:19.1825764Z     ) -> None:
2025-05-07T20:32:19.1825856Z         torch.manual_seed(2025)
2025-05-07T20:32:19.1825936Z     
2025-05-07T20:32:19.1826151Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:19.1826225Z     
2025-05-07T20:32:19.1826322Z         x_sign = torch.sign(x)
2025-05-07T20:32:19.1826444Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:19.1826533Z         x = x_sign * x_clamp
2025-05-07T20:32:19.1826619Z         x0 = x[:, :D]
2025-05-07T20:32:19.1826695Z         x1 = x[:, D:]
2025-05-07T20:32:19.1826767Z     
2025-05-07T20:32:19.1826855Z         if contiguous:
2025-05-07T20:32:19.1826984Z             x0 = x0.contiguous()
2025-05-07T20:32:19.1827073Z             x1 = x1.contiguous()
2025-05-07T20:32:19.1827157Z     
2025-05-07T20:32:19.1827247Z         if scale_ub is not None:
2025-05-07T20:32:19.1827355Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:19.1827485Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:19.1827561Z             )
2025-05-07T20:32:19.1827642Z         else:
2025-05-07T20:32:19.1827737Z             scale_ub_tensor = None
2025-05-07T20:32:19.1827806Z     
2025-05-07T20:32:19.1827939Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:19.1828028Z             op = silu_mul_quant
2025-05-07T20:32:19.1828112Z             if compiled:
2025-05-07T20:32:19.1828296Z                 op = torch.compile(op)
2025-05-07T20:32:19.1828401Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:19.1828473Z     
2025-05-07T20:32:19.1828568Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:19.1828573Z 
2025-05-07T20:32:19.1828673Z moe/activation_test.py:117: 
2025-05-07T20:32:19.1828806Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:19.1828904Z moe/activation_test.py:115: in fn
2025-05-07T20:32:19.1829003Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:19.1829367Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/torch/_dynamo/eval_frame.py:678: in _fn
2025-05-07T20:32:19.1829457Z     return fn(*args, **kwargs)
2025-05-07T20:32:19.1829942Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:19.1830043Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:19.1830399Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:19.1830623Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:19.1830955Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:19.1831047Z     kernel = self.compile(
2025-05-07T20:32:19.1831429Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:19.1831599Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:19.1831729Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:19.1831736Z 
2025-05-07T20:32:19.1831939Z self = <triton.compiler.compiler.ASTSource object at 0x7fcda54d6850>
2025-05-07T20:32:19.1832705Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:19.1833206Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7fcea1473ce0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7fcda51534c0>}
2025-05-07T20:32:19.1833936Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:19.1834133Z context = <triton._C.libtriton.ir.context object at 0x7fcda4399af0>
2025-05-07T20:32:19.1834137Z 
2025-05-07T20:32:19.1834371Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:19.1834626Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:19.1834736Z                            module_map=module_map)
2025-05-07T20:32:19.1834898Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:19.1834998Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:19.1835071Z E       ^
2025-05-07T20:32:19.1835419Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:19.1835464Z 
2025-05-07T20:32:19.1835876Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:19.1835881Z 
2025-05-07T20:32:19.1835984Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:19.1836208Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:19.1836289Z     T=16384,
2025-05-07T20:32:19.1836365Z     D=7168,
2025-05-07T20:32:19.1836452Z     scale_ub=1200.0,
2025-05-07T20:32:19.1836535Z     contiguous=True,
2025-05-07T20:32:19.1836617Z     compiled=True,
2025-05-07T20:32:19.1836697Z )
2025-05-07T20:32:19.1836983Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:19.1837155Z T = 16384, D = 7168, scale_ub = 1200.0, contiguous = True, compiled = True
2025-05-07T20:32:19.1837160Z 
2025-05-07T20:32:19.1837237Z     @given(
2025-05-07T20:32:19.1837356Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:19.1837458Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:19.1837569Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:19.1837681Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:19.1837794Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:19.1837866Z     )
2025-05-07T20:32:19.1838105Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:19.1838202Z     def test_silu_mul_quant(
2025-05-07T20:32:19.1838275Z         self,
2025-05-07T20:32:19.1838349Z         T: int,
2025-05-07T20:32:19.1838425Z         D: int,
2025-05-07T20:32:19.1838520Z         scale_ub: Optional[float],
2025-05-07T20:32:19.1838609Z         contiguous: bool,
2025-05-07T20:32:19.1838696Z         compiled: bool,
2025-05-07T20:32:19.1838770Z     ) -> None:
2025-05-07T20:32:19.1838866Z         torch.manual_seed(2025)
2025-05-07T20:32:19.1838935Z     
2025-05-07T20:32:19.1839101Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:19.1839177Z     
2025-05-07T20:32:19.1839269Z         x_sign = torch.sign(x)
2025-05-07T20:32:19.1839389Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:19.1839479Z         x = x_sign * x_clamp
2025-05-07T20:32:19.1839557Z         x0 = x[:, :D]
2025-05-07T20:32:19.1839634Z         x1 = x[:, D:]
2025-05-07T20:32:19.1839713Z     
2025-05-07T20:32:19.1839796Z         if contiguous:
2025-05-07T20:32:19.1839884Z             x0 = x0.contiguous()
2025-05-07T20:32:19.1839972Z             x1 = x1.contiguous()
2025-05-07T20:32:19.1844441Z     
2025-05-07T20:32:19.1844559Z         if scale_ub is not None:
2025-05-07T20:32:19.1844677Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:19.1844813Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:19.1844893Z             )
2025-05-07T20:32:19.1844969Z         else:
2025-05-07T20:32:19.1845067Z             scale_ub_tensor = None
2025-05-07T20:32:19.1845141Z     
2025-05-07T20:32:19.1845272Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:19.1845366Z             op = silu_mul_quant
2025-05-07T20:32:19.1845451Z             if compiled:
2025-05-07T20:32:19.1845550Z                 op = torch.compile(op)
2025-05-07T20:32:19.1845658Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:19.1845730Z     
2025-05-07T20:32:19.1845890Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:19.1845894Z 
2025-05-07T20:32:19.1845996Z moe/activation_test.py:117: 
2025-05-07T20:32:19.1846125Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:19.1846229Z moe/activation_test.py:115: in fn
2025-05-07T20:32:19.1846333Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:19.1846703Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/torch/_dynamo/eval_frame.py:678: in _fn
2025-05-07T20:32:19.1846800Z     return fn(*args, **kwargs)
2025-05-07T20:32:19.1847327Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:19.1847422Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:19.1847778Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:19.1847994Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:19.1848339Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:19.1848429Z     kernel = self.compile(
2025-05-07T20:32:19.1848881Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:19.1849059Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:19.1849185Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:19.1849193Z 
2025-05-07T20:32:19.1849400Z self = <triton.compiler.compiler.ASTSource object at 0x7fcda5186b50>
2025-05-07T20:32:19.1850162Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:19.1850658Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7fcea1473ce0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7fcda4378c20>}
2025-05-07T20:32:19.1851399Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:19.1851584Z context = <triton._C.libtriton.ir.context object at 0x7fcda4335cb0>
2025-05-07T20:32:19.1851589Z 
2025-05-07T20:32:19.1851753Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:19.1852008Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:19.1852112Z                            module_map=module_map)
2025-05-07T20:32:19.1852274Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:19.1852366Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:19.1852438Z E       ^
2025-05-07T20:32:19.1852792Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:19.1852797Z 
2025-05-07T20:32:19.1853208Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:19.1853212Z 
2025-05-07T20:32:19.1853320Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:19.1853536Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:19.1853611Z     T=16384,
2025-05-07T20:32:19.1853765Z     D=5120,
2025-05-07T20:32:19.1853846Z     scale_ub=1200.0,
2025-05-07T20:32:19.1853934Z     contiguous=True,
2025-05-07T20:32:19.1854015Z     compiled=False,
2025-05-07T20:32:19.1854086Z )
2025-05-07T20:32:19.1854307Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:19.1854480Z T = 16384, D = 5120, scale_ub = 1200.0, contiguous = True, compiled = False
2025-05-07T20:32:19.1854531Z 
2025-05-07T20:32:19.1854607Z     @given(
2025-05-07T20:32:19.1854730Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:19.1854825Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:19.1854935Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:19.1855059Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:19.1855169Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:19.1855242Z     )
2025-05-07T20:32:19.1855481Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:19.1855613Z     def test_silu_mul_quant(
2025-05-07T20:32:19.1855699Z         self,
2025-05-07T20:32:19.1855772Z         T: int,
2025-05-07T20:32:19.1855847Z         D: int,
2025-05-07T20:32:19.1855949Z         scale_ub: Optional[float],
2025-05-07T20:32:19.1856033Z         contiguous: bool,
2025-05-07T20:32:19.1856113Z         compiled: bool,
2025-05-07T20:32:19.1856200Z     ) -> None:
2025-05-07T20:32:19.1856295Z         torch.manual_seed(2025)
2025-05-07T20:32:19.1856364Z     
2025-05-07T20:32:19.1856535Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:19.1856604Z     
2025-05-07T20:32:19.1856699Z         x_sign = torch.sign(x)
2025-05-07T20:32:19.1856901Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:19.1856989Z         x = x_sign * x_clamp
2025-05-07T20:32:19.1857074Z         x0 = x[:, :D]
2025-05-07T20:32:19.1857150Z         x1 = x[:, D:]
2025-05-07T20:32:19.1857220Z     
2025-05-07T20:32:19.1857307Z         if contiguous:
2025-05-07T20:32:19.1857393Z             x0 = x0.contiguous()
2025-05-07T20:32:19.1857489Z             x1 = x1.contiguous()
2025-05-07T20:32:19.1857561Z     
2025-05-07T20:32:19.1857647Z         if scale_ub is not None:
2025-05-07T20:32:19.1857752Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:19.1857884Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:19.1857954Z             )
2025-05-07T20:32:19.1858035Z         else:
2025-05-07T20:32:19.1858121Z             scale_ub_tensor = None
2025-05-07T20:32:19.1858190Z     
2025-05-07T20:32:19.1858318Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:19.1858402Z             op = silu_mul_quant
2025-05-07T20:32:19.1858486Z             if compiled:
2025-05-07T20:32:19.1858589Z                 op = torch.compile(op)
2025-05-07T20:32:19.1858691Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:19.1858763Z     
2025-05-07T20:32:19.1858852Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:19.1858857Z 
2025-05-07T20:32:19.1858948Z moe/activation_test.py:117: 
2025-05-07T20:32:19.1859080Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:19.1859178Z moe/activation_test.py:115: in fn
2025-05-07T20:32:19.1859272Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:19.1859761Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:19.1859854Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:19.1860203Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:19.1860431Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:19.1860762Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:19.1860856Z     kernel = self.compile(
2025-05-07T20:32:19.1861233Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:19.1861400Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:19.1861529Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:19.1861534Z 
2025-05-07T20:32:19.1861732Z self = <triton.compiler.compiler.ASTSource object at 0x7fcda4d5add0>
2025-05-07T20:32:19.1862543Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:19.1863040Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7fcea1473ce0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7fcda4379580>}
2025-05-07T20:32:19.1863770Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:19.1864000Z context = <triton._C.libtriton.ir.context object at 0x7fcda413bdf0>
2025-05-07T20:32:19.1864005Z 
2025-05-07T20:32:19.1864161Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:19.1864419Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:19.1864525Z                            module_map=module_map)
2025-05-07T20:32:19.1864680Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:19.1864782Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:19.1864952Z E       ^
2025-05-07T20:32:19.1865304Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:19.1865308Z 
2025-05-07T20:32:19.1865710Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:19.1865715Z 
2025-05-07T20:32:19.1865810Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:19.1866030Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:19.1866102Z     T=1,
2025-05-07T20:32:19.1866174Z     D=7168,
2025-05-07T20:32:19.1866252Z     scale_ub=1200.0,
2025-05-07T20:32:19.1866335Z     contiguous=False,
2025-05-07T20:32:19.1866415Z     compiled=False,
2025-05-07T20:32:19.1866482Z )
2025-05-07T20:32:19.1866695Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:19.1866866Z T = 1, D = 7168, scale_ub = 1200.0, contiguous = False, compiled = False
2025-05-07T20:32:19.1866871Z 
2025-05-07T20:32:19.1869619Z     @given(
2025-05-07T20:32:19.1869759Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:19.1869864Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:19.1869984Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:19.1870099Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:19.1870217Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:19.1870291Z     )
2025-05-07T20:32:19.1870533Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:19.1870630Z     def test_silu_mul_quant(
2025-05-07T20:32:19.1870708Z         self,
2025-05-07T20:32:19.1870785Z         T: int,
2025-05-07T20:32:19.1870865Z         D: int,
2025-05-07T20:32:19.1870964Z         scale_ub: Optional[float],
2025-05-07T20:32:19.1871056Z         contiguous: bool,
2025-05-07T20:32:19.1871137Z         compiled: bool,
2025-05-07T20:32:19.1871217Z     ) -> None:
2025-05-07T20:32:19.1871312Z         torch.manual_seed(2025)
2025-05-07T20:32:19.1871407Z     
2025-05-07T20:32:19.1871575Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:19.1871651Z     
2025-05-07T20:32:19.1871743Z         x_sign = torch.sign(x)
2025-05-07T20:32:19.1871868Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:19.1871958Z         x = x_sign * x_clamp
2025-05-07T20:32:19.1872038Z         x0 = x[:, :D]
2025-05-07T20:32:19.1872118Z         x1 = x[:, D:]
2025-05-07T20:32:19.1872194Z     
2025-05-07T20:32:19.1872277Z         if contiguous:
2025-05-07T20:32:19.1872367Z             x0 = x0.contiguous()
2025-05-07T20:32:19.1872518Z             x1 = x1.contiguous()
2025-05-07T20:32:19.1872588Z     
2025-05-07T20:32:19.1872678Z         if scale_ub is not None:
2025-05-07T20:32:19.1872789Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:19.1872920Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:19.1873005Z             )
2025-05-07T20:32:19.1873080Z         else:
2025-05-07T20:32:19.1873173Z             scale_ub_tensor = None
2025-05-07T20:32:19.1873248Z     
2025-05-07T20:32:19.1873376Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:19.1873507Z             op = silu_mul_quant
2025-05-07T20:32:19.1873595Z             if compiled:
2025-05-07T20:32:19.1873692Z                 op = torch.compile(op)
2025-05-07T20:32:19.1873797Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:19.1873873Z     
2025-05-07T20:32:19.1873960Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:19.1873965Z 
2025-05-07T20:32:19.1874060Z moe/activation_test.py:117: 
2025-05-07T20:32:19.1874196Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:19.1874295Z moe/activation_test.py:115: in fn
2025-05-07T20:32:19.1874398Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:19.1874933Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:19.1875031Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:19.1875390Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:19.1875612Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:19.1875945Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:19.1876043Z     kernel = self.compile(
2025-05-07T20:32:19.1876423Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:19.1876600Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:19.1876725Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:19.1876729Z 
2025-05-07T20:32:19.1876933Z self = <triton.compiler.compiler.ASTSource object at 0x7fcda4f3a8d0>
2025-05-07T20:32:19.1877785Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:19.1878287Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7fcea1473ce0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7fcda437a8e0>}
2025-05-07T20:32:19.1879021Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:19.1879213Z context = <triton._C.libtriton.ir.context object at 0x7fcda421e670>
2025-05-07T20:32:19.1879217Z 
2025-05-07T20:32:19.1879385Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:19.1879645Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:19.1879753Z                            module_map=module_map)
2025-05-07T20:32:19.1879919Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:19.1880018Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:19.1880093Z E       ^
2025-05-07T20:32:19.1880446Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:19.1880451Z 
2025-05-07T20:32:19.1880853Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:19.1880901Z 
2025-05-07T20:32:19.1881008Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:19.1881226Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:19.1881300Z     T=4096,
2025-05-07T20:32:19.1881380Z     D=7168,
2025-05-07T20:32:19.1881464Z     scale_ub=1200.0,
2025-05-07T20:32:19.1881546Z     contiguous=False,
2025-05-07T20:32:19.1881634Z     compiled=True,
2025-05-07T20:32:19.1881704Z )
2025-05-07T20:32:19.1881917Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:19.1882134Z T = 4096, D = 7168, scale_ub = 1200.0, contiguous = False, compiled = True
2025-05-07T20:32:19.1882139Z 
2025-05-07T20:32:19.1882216Z     @given(
2025-05-07T20:32:19.1882337Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:19.1882435Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:19.1882551Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:19.1882669Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:19.1882783Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:19.1882857Z     )
2025-05-07T20:32:19.1883099Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:19.1883231Z     def test_silu_mul_quant(
2025-05-07T20:32:19.1883312Z         self,
2025-05-07T20:32:19.1883389Z         T: int,
2025-05-07T20:32:19.1883465Z         D: int,
2025-05-07T20:32:19.1883567Z         scale_ub: Optional[float],
2025-05-07T20:32:19.1883660Z         contiguous: bool,
2025-05-07T20:32:19.1883747Z         compiled: bool,
2025-05-07T20:32:19.1883827Z     ) -> None:
2025-05-07T20:32:19.1883919Z         torch.manual_seed(2025)
2025-05-07T20:32:19.1883991Z     
2025-05-07T20:32:19.1884163Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:19.1884235Z     
2025-05-07T20:32:19.1884326Z         x_sign = torch.sign(x)
2025-05-07T20:32:19.1884454Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:19.1884545Z         x = x_sign * x_clamp
2025-05-07T20:32:19.1884623Z         x0 = x[:, :D]
2025-05-07T20:32:19.1884706Z         x1 = x[:, D:]
2025-05-07T20:32:19.1884777Z     
2025-05-07T20:32:19.1884863Z         if contiguous:
2025-05-07T20:32:19.1884954Z             x0 = x0.contiguous()
2025-05-07T20:32:19.1885041Z             x1 = x1.contiguous()
2025-05-07T20:32:19.1885173Z     
2025-05-07T20:32:19.1885263Z         if scale_ub is not None:
2025-05-07T20:32:19.1885366Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:19.1885503Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:19.1885578Z             )
2025-05-07T20:32:19.1885651Z         else:
2025-05-07T20:32:19.1885745Z             scale_ub_tensor = None
2025-05-07T20:32:19.1885816Z     
2025-05-07T20:32:19.1885941Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:19.1886034Z             op = silu_mul_quant
2025-05-07T20:32:19.1886118Z             if compiled:
2025-05-07T20:32:19.1886227Z                 op = torch.compile(op)
2025-05-07T20:32:19.1886329Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:19.1886404Z     
2025-05-07T20:32:19.1886496Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:19.1886500Z 
2025-05-07T20:32:19.1886596Z moe/activation_test.py:117: 
2025-05-07T20:32:19.1886723Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:19.1886826Z moe/activation_test.py:115: in fn
2025-05-07T20:32:19.1886923Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:19.1887284Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/torch/_dynamo/eval_frame.py:678: in _fn
2025-05-07T20:32:19.1887380Z     return fn(*args, **kwargs)
2025-05-07T20:32:19.1887863Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:19.1887964Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:19.1888361Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:19.1888579Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:19.1888923Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:19.1889018Z     kernel = self.compile(
2025-05-07T20:32:19.1889399Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:19.1889609Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:19.1889733Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:19.1889737Z 
2025-05-07T20:32:19.1889946Z self = <triton.compiler.compiler.ASTSource object at 0x7fce5b4026d0>
2025-05-07T20:32:19.1890709Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:19.1891252Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7fcea1473ce0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7fcda437ba60>}
2025-05-07T20:32:19.1891986Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:19.1892179Z context = <triton._C.libtriton.ir.context object at 0x7fcda449c8f0>
2025-05-07T20:32:19.1892183Z 
2025-05-07T20:32:19.1892349Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:19.1892605Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:19.1892718Z                            module_map=module_map)
2025-05-07T20:32:19.1892880Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:19.1892977Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:19.1893057Z E       ^
2025-05-07T20:32:19.1893404Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:19.1893409Z 
2025-05-07T20:32:19.1893938Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:19.1893951Z 
2025-05-07T20:32:19.1894053Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:19.1894270Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:19.1894348Z     T=128,
2025-05-07T20:32:19.1894423Z     D=7168,
2025-05-07T20:32:19.1894505Z     scale_ub=1200.0,
2025-05-07T20:32:19.1894594Z     contiguous=False,
2025-05-07T20:32:19.1894677Z     compiled=True,
2025-05-07T20:32:19.1894747Z )
2025-05-07T20:32:19.1894966Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:19.1895132Z T = 128, D = 7168, scale_ub = 1200.0, contiguous = False, compiled = True
2025-05-07T20:32:19.1895136Z 
2025-05-07T20:32:19.1895217Z     @given(
2025-05-07T20:32:19.1895338Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:19.1895437Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:19.1895557Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:19.1895672Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:19.1895784Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:19.1895862Z     )
2025-05-07T20:32:19.1896100Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:19.1896192Z     def test_silu_mul_quant(
2025-05-07T20:32:19.1896271Z         self,
2025-05-07T20:32:19.1896348Z         T: int,
2025-05-07T20:32:19.1896424Z         D: int,
2025-05-07T20:32:19.1896569Z         scale_ub: Optional[float],
2025-05-07T20:32:19.1896658Z         contiguous: bool,
2025-05-07T20:32:19.1896744Z         compiled: bool,
2025-05-07T20:32:19.1896820Z     ) -> None:
2025-05-07T20:32:19.1896912Z         torch.manual_seed(2025)
2025-05-07T20:32:19.1896987Z     
2025-05-07T20:32:19.1897154Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:19.1897228Z     
2025-05-07T20:32:19.1897323Z         x_sign = torch.sign(x)
2025-05-07T20:32:19.1897445Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:19.1897599Z         x = x_sign * x_clamp
2025-05-07T20:32:19.1897683Z         x0 = x[:, :D]
2025-05-07T20:32:19.1897761Z         x1 = x[:, D:]
2025-05-07T20:32:19.1897832Z     
2025-05-07T20:32:19.1897917Z         if contiguous:
2025-05-07T20:32:19.1898007Z             x0 = x0.contiguous()
2025-05-07T20:32:19.1898098Z             x1 = x1.contiguous()
2025-05-07T20:32:19.1898811Z     
2025-05-07T20:32:19.1898956Z         if scale_ub is not None:
2025-05-07T20:32:19.1899110Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:19.1899248Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:19.1899322Z             )
2025-05-07T20:32:19.1899401Z         else:
2025-05-07T20:32:19.1899592Z             scale_ub_tensor = None
2025-05-07T20:32:19.1899666Z     
2025-05-07T20:32:19.1899800Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:19.1899890Z             op = silu_mul_quant
2025-05-07T20:32:19.1899972Z             if compiled:
2025-05-07T20:32:19.1900078Z                 op = torch.compile(op)
2025-05-07T20:32:19.1900181Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:19.1900259Z     
2025-05-07T20:32:19.1900348Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:19.1900353Z 
2025-05-07T20:32:19.1900448Z moe/activation_test.py:117: 
2025-05-07T20:32:19.1900580Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:19.1900682Z moe/activation_test.py:115: in fn
2025-05-07T20:32:19.1900780Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:19.1901149Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/torch/_dynamo/eval_frame.py:678: in _fn
2025-05-07T20:32:19.1901241Z     return fn(*args, **kwargs)
2025-05-07T20:32:19.1901802Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:19.1901900Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:19.1902254Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:19.1902472Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:19.1902804Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:19.1902895Z     kernel = self.compile(
2025-05-07T20:32:19.1903276Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:19.1903448Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:19.1903578Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:19.1903585Z 
2025-05-07T20:32:19.1903793Z self = <triton.compiler.compiler.ASTSource object at 0x7fce5a2666d0>
2025-05-07T20:32:19.1904554Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:19.1905061Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7fcea1473ce0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7fcda445cea0>}
2025-05-07T20:32:19.1905790Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:19.1906049Z context = <triton._C.libtriton.ir.context object at 0x7fcda4403ef0>
2025-05-07T20:32:19.1906054Z 
2025-05-07T20:32:19.1906218Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:19.1906478Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:19.1906587Z                            module_map=module_map)
2025-05-07T20:32:19.1906816Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:19.1906913Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:19.1906991Z E       ^
2025-05-07T20:32:19.1907341Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:19.1907345Z 
2025-05-07T20:32:19.1907749Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:19.1907757Z 
2025-05-07T20:32:19.1907863Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:19.1908078Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:19.1908155Z     T=2048,
2025-05-07T20:32:19.1908278Z     D=7168,
2025-05-07T20:32:19.1908364Z     scale_ub=None,
2025-05-07T20:32:19.1908454Z     contiguous=True,
2025-05-07T20:32:19.1908538Z     compiled=True,
2025-05-07T20:32:19.1908612Z )
2025-05-07T20:32:19.1908837Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:19.1909001Z T = 2048, D = 7168, scale_ub = None, contiguous = True, compiled = True
2025-05-07T20:32:19.1909006Z 
2025-05-07T20:32:19.1909083Z     @given(
2025-05-07T20:32:19.1909207Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:19.1909305Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:19.1909419Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:19.1909541Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:19.1909653Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:19.1909730Z     )
2025-05-07T20:32:19.1909973Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:19.1910066Z     def test_silu_mul_quant(
2025-05-07T20:32:19.1910190Z         self,
2025-05-07T20:32:19.1910269Z         T: int,
2025-05-07T20:32:19.1910348Z         D: int,
2025-05-07T20:32:19.1910452Z         scale_ub: Optional[float],
2025-05-07T20:32:19.1910542Z         contiguous: bool,
2025-05-07T20:32:19.1910627Z         compiled: bool,
2025-05-07T20:32:19.1910706Z     ) -> None:
2025-05-07T20:32:19.1910798Z         torch.manual_seed(2025)
2025-05-07T20:32:19.1910868Z     
2025-05-07T20:32:19.1911038Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:19.1911109Z     
2025-05-07T20:32:19.1911211Z         x_sign = torch.sign(x)
2025-05-07T20:32:19.1911336Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:19.1911423Z         x = x_sign * x_clamp
2025-05-07T20:32:19.1911507Z         x0 = x[:, :D]
2025-05-07T20:32:19.1911584Z         x1 = x[:, D:]
2025-05-07T20:32:19.1911655Z     
2025-05-07T20:32:19.1911745Z         if contiguous:
2025-05-07T20:32:19.1911833Z             x0 = x0.contiguous()
2025-05-07T20:32:19.1911923Z             x1 = x1.contiguous()
2025-05-07T20:32:19.1912001Z     
2025-05-07T20:32:19.1912088Z         if scale_ub is not None:
2025-05-07T20:32:19.1912196Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:19.1912332Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:19.1912408Z             )
2025-05-07T20:32:19.1912485Z         else:
2025-05-07T20:32:19.1912582Z             scale_ub_tensor = None
2025-05-07T20:32:19.1912654Z     
2025-05-07T20:32:19.1912786Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:19.1912874Z             op = silu_mul_quant
2025-05-07T20:32:19.1913005Z             if compiled:
2025-05-07T20:32:19.1913108Z                 op = torch.compile(op)
2025-05-07T20:32:19.1913212Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:19.1913282Z     
2025-05-07T20:32:19.1913377Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:19.1913383Z 
2025-05-07T20:32:19.1913480Z moe/activation_test.py:117: 
2025-05-07T20:32:19.1913605Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:19.1913708Z moe/activation_test.py:115: in fn
2025-05-07T20:32:19.1913850Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:19.1914216Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/torch/_dynamo/eval_frame.py:678: in _fn
2025-05-07T20:32:19.1914307Z     return fn(*args, **kwargs)
2025-05-07T20:32:19.1914792Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:19.1914894Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:19.1915244Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:19.1915462Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:19.1915840Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:19.1915933Z     kernel = self.compile(
2025-05-07T20:32:19.1916313Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:19.1916488Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:19.1916614Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:19.1916619Z 
2025-05-07T20:32:19.1916825Z self = <triton.compiler.compiler.ASTSource object at 0x7fcda5186ed0>
2025-05-07T20:32:19.1917586Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:19.1918144Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7fcea1473ce0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7fcda445dc60>}
2025-05-07T20:32:19.1918873Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:19.1919072Z context = <triton._C.libtriton.ir.context object at 0x7fcda44b38f0>
2025-05-07T20:32:19.1919077Z 
2025-05-07T20:32:19.1919237Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:19.1919492Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:19.1919607Z                            module_map=module_map)
2025-05-07T20:32:19.1919769Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:19.1919864Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:19.1919946Z E       ^
2025-05-07T20:32:19.1920297Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:19.1920301Z 
2025-05-07T20:32:19.1920711Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:19.1920718Z 
2025-05-07T20:32:19.1920820Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:19.1921036Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:19.1921117Z     T=16384,
2025-05-07T20:32:19.1921194Z     D=5120,
2025-05-07T20:32:19.1921276Z     scale_ub=None,
2025-05-07T20:32:19.1921369Z     contiguous=False,
2025-05-07T20:32:19.1921499Z     compiled=False,
2025-05-07T20:32:19.1921577Z )
2025-05-07T20:32:19.1921790Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:19.1921960Z T = 16384, D = 5120, scale_ub = None, contiguous = False, compiled = False
2025-05-07T20:32:19.1921965Z 
2025-05-07T20:32:19.1922049Z     @given(
2025-05-07T20:32:19.1922173Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:19.1922271Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:19.1922389Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:19.1922546Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:19.1922657Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:19.1922737Z     )
2025-05-07T20:32:19.1922977Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:19.1923070Z     def test_silu_mul_quant(
2025-05-07T20:32:19.1923146Z         self,
2025-05-07T20:32:19.1923220Z         T: int,
2025-05-07T20:32:19.1923304Z         D: int,
2025-05-07T20:32:19.1923400Z         scale_ub: Optional[float],
2025-05-07T20:32:19.1923487Z         contiguous: bool,
2025-05-07T20:32:19.1923572Z         compiled: bool,
2025-05-07T20:32:19.1923647Z     ) -> None:
2025-05-07T20:32:19.1923780Z         torch.manual_seed(2025)
2025-05-07T20:32:19.1923859Z     
2025-05-07T20:32:19.1924026Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:19.1924100Z     
2025-05-07T20:32:19.1924195Z         x_sign = torch.sign(x)
2025-05-07T20:32:19.1924322Z >       x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:19.1926079Z E       torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 320.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 140.44 MiB is free. Including non-PyTorch memory, this process has 21.92 GiB memory in use. Of the allocated memory 21.60 GiB is allocated by PyTorch, and 45.02 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
2025-05-07T20:32:19.1926088Z 
2025-05-07T20:32:19.1926206Z moe/activation_test.py:95: OutOfMemoryError
2025-05-07T20:32:19.1926210Z 
2025-05-07T20:32:19.1926358Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:19.1926575Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:19.1926653Z     T=4096,
2025-05-07T20:32:19.1926729Z     D=7168,
2025-05-07T20:32:19.1926811Z     scale_ub=1200.0,
2025-05-07T20:32:19.1926893Z     contiguous=True,
2025-05-07T20:32:19.1926978Z     compiled=True,
2025-05-07T20:32:19.1927048Z )
2025-05-07T20:32:19.1927258Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:19.1927428Z T = 4096, D = 7168, scale_ub = 1200.0, contiguous = True, compiled = True
2025-05-07T20:32:19.1927435Z 
2025-05-07T20:32:19.1927510Z     @given(
2025-05-07T20:32:19.1927629Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:19.1927727Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:19.1927836Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:19.1927961Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:19.1928073Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:19.1928146Z     )
2025-05-07T20:32:19.1928388Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:19.1928483Z     def test_silu_mul_quant(
2025-05-07T20:32:19.1928558Z         self,
2025-05-07T20:32:19.1928635Z         T: int,
2025-05-07T20:32:19.1928712Z         D: int,
2025-05-07T20:32:19.1928812Z         scale_ub: Optional[float],
2025-05-07T20:32:19.1928898Z         contiguous: bool,
2025-05-07T20:32:19.1928981Z         compiled: bool,
2025-05-07T20:32:19.1929059Z     ) -> None:
2025-05-07T20:32:19.1929223Z         torch.manual_seed(2025)
2025-05-07T20:32:19.1929293Z     
2025-05-07T20:32:19.1929460Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:19.1929532Z     
2025-05-07T20:32:19.1929624Z         x_sign = torch.sign(x)
2025-05-07T20:32:19.1929753Z >       x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:19.1931489Z E       torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 112.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 28.44 MiB is free. Including non-PyTorch memory, this process has 22.03 GiB memory in use. Of the allocated memory 21.61 GiB is allocated by PyTorch, and 141.02 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
2025-05-07T20:32:19.1931540Z 
2025-05-07T20:32:19.1931665Z moe/activation_test.py:95: OutOfMemoryError
2025-05-07T20:32:19.1931672Z 
2025-05-07T20:32:19.1931769Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:19.1931987Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:19.1932062Z     T=16384,
2025-05-07T20:32:19.1932139Z     D=7168,
2025-05-07T20:32:19.1932264Z     scale_ub=None,
2025-05-07T20:32:19.1932351Z     contiguous=False,
2025-05-07T20:32:19.1932432Z     compiled=False,
2025-05-07T20:32:19.1932506Z )
2025-05-07T20:32:19.1932716Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:19.1932888Z T = 16384, D = 7168, scale_ub = None, contiguous = False, compiled = False
2025-05-07T20:32:19.1932892Z 
2025-05-07T20:32:19.1932971Z     @given(
2025-05-07T20:32:19.1933086Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:19.1933184Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:19.1933298Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:19.1933412Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:19.1933528Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:19.1933601Z     )
2025-05-07T20:32:19.1933938Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:19.1934035Z     def test_silu_mul_quant(
2025-05-07T20:32:19.1934111Z         self,
2025-05-07T20:32:19.1934235Z         T: int,
2025-05-07T20:32:19.1934315Z         D: int,
2025-05-07T20:32:19.1934408Z         scale_ub: Optional[float],
2025-05-07T20:32:19.1934497Z         contiguous: bool,
2025-05-07T20:32:19.1934582Z         compiled: bool,
2025-05-07T20:32:19.1934659Z     ) -> None:
2025-05-07T20:32:19.1934757Z         torch.manual_seed(2025)
2025-05-07T20:32:19.1934829Z     
2025-05-07T20:32:19.1934993Z >       x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:19.1936729Z E       torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 448.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 140.44 MiB is free. Including non-PyTorch memory, this process has 21.92 GiB memory in use. Of the allocated memory 21.50 GiB is allocated by PyTorch, and 141.02 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
2025-05-07T20:32:19.1936737Z 
2025-05-07T20:32:19.1936849Z moe/activation_test.py:92: OutOfMemoryError
2025-05-07T20:32:19.1936855Z 
2025-05-07T20:32:19.1936956Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:19.1937171Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:19.1937244Z     T=2048,
2025-05-07T20:32:19.1937322Z     D=7168,
2025-05-07T20:32:19.1937402Z     scale_ub=1200.0,
2025-05-07T20:32:19.1937483Z     contiguous=True,
2025-05-07T20:32:19.1937568Z     compiled=True,
2025-05-07T20:32:19.1937690Z )
2025-05-07T20:32:19.1937906Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:19.1938070Z T = 2048, D = 7168, scale_ub = 1200.0, contiguous = True, compiled = True
2025-05-07T20:32:19.1938075Z 
2025-05-07T20:32:19.1938148Z     @given(
2025-05-07T20:32:19.1938268Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:19.1938369Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:19.1938482Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:19.1938640Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:19.1938751Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:19.1938822Z     )
2025-05-07T20:32:19.1939065Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:19.1939158Z     def test_silu_mul_quant(
2025-05-07T20:32:19.1939238Z         self,
2025-05-07T20:32:19.1939313Z         T: int,
2025-05-07T20:32:19.1939391Z         D: int,
2025-05-07T20:32:19.1939489Z         scale_ub: Optional[float],
2025-05-07T20:32:19.1939577Z         contiguous: bool,
2025-05-07T20:32:19.1939660Z         compiled: bool,
2025-05-07T20:32:19.1939739Z     ) -> None:
2025-05-07T20:32:19.1939831Z         torch.manual_seed(2025)
2025-05-07T20:32:19.1939942Z     
2025-05-07T20:32:19.1940114Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:19.1940183Z     
2025-05-07T20:32:19.1940273Z         x_sign = torch.sign(x)
2025-05-07T20:32:19.1940403Z >       x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:19.1942117Z E       torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 56.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 28.44 MiB is free. Including non-PyTorch memory, this process has 22.03 GiB memory in use. Of the allocated memory 21.67 GiB is allocated by PyTorch, and 85.02 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
2025-05-07T20:32:19.1942124Z 
2025-05-07T20:32:19.1942242Z moe/activation_test.py:95: OutOfMemoryError
2025-05-07T20:32:19.1942249Z 
2025-05-07T20:32:19.1942345Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:19.1942605Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:19.1942681Z     T=2048,
2025-05-07T20:32:19.1942758Z     D=7168,
2025-05-07T20:32:19.1942842Z     scale_ub=None,
2025-05-07T20:32:19.1942924Z     contiguous=True,
2025-05-07T20:32:19.1943004Z     compiled=False,
2025-05-07T20:32:19.1943078Z )
2025-05-07T20:32:19.1943289Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:19.1943452Z T = 2048, D = 7168, scale_ub = None, contiguous = True, compiled = False
2025-05-07T20:32:19.1943460Z 
2025-05-07T20:32:19.1943538Z     @given(
2025-05-07T20:32:19.1943653Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:19.1943753Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:19.1943865Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:19.1943982Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:19.1944098Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:19.1944170Z     )
2025-05-07T20:32:19.1944409Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:19.1944510Z     def test_silu_mul_quant(
2025-05-07T20:32:19.1944584Z         self,
2025-05-07T20:32:19.1944658Z         T: int,
2025-05-07T20:32:19.1944736Z         D: int,
2025-05-07T20:32:19.1944831Z         scale_ub: Optional[float],
2025-05-07T20:32:19.1944921Z         contiguous: bool,
2025-05-07T20:32:19.1945005Z         compiled: bool,
2025-05-07T20:32:19.1945079Z     ) -> None:
2025-05-07T20:32:19.1945177Z         torch.manual_seed(2025)
2025-05-07T20:32:19.1945292Z     
2025-05-07T20:32:19.1945454Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:19.1945527Z     
2025-05-07T20:32:19.1945615Z >       x_sign = torch.sign(x)
2025-05-07T20:32:19.1947341Z E       torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 56.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 28.44 MiB is free. Including non-PyTorch memory, this process has 22.03 GiB memory in use. Of the allocated memory 21.67 GiB is allocated by PyTorch, and 85.02 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
2025-05-07T20:32:19.1947385Z 
2025-05-07T20:32:19.1947501Z moe/activation_test.py:94: OutOfMemoryError
2025-05-07T20:32:19.1947506Z 
2025-05-07T20:32:19.1947606Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:19.1947830Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:19.1947905Z     T=1,
2025-05-07T20:32:19.1947982Z     D=7168,
2025-05-07T20:32:19.1948062Z     scale_ub=1200.0,
2025-05-07T20:32:19.1948144Z     contiguous=True,
2025-05-07T20:32:19.1948266Z     compiled=False,
2025-05-07T20:32:19.1948337Z )
2025-05-07T20:32:19.1948550Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:19.1948715Z T = 1, D = 7168, scale_ub = 1200.0, contiguous = True, compiled = False
2025-05-07T20:32:19.1948722Z 
2025-05-07T20:32:19.1948797Z     @given(
2025-05-07T20:32:19.1948912Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:19.1949014Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:19.1949124Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:19.1949240Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:19.1949352Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:19.1949428Z     )
2025-05-07T20:32:19.1949669Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:19.1949761Z     def test_silu_mul_quant(
2025-05-07T20:32:19.1949836Z         self,
2025-05-07T20:32:19.1949912Z         T: int,
2025-05-07T20:32:19.1949988Z         D: int,
2025-05-07T20:32:19.1950128Z         scale_ub: Optional[float],
2025-05-07T20:32:19.1950221Z         contiguous: bool,
2025-05-07T20:32:19.1950304Z         compiled: bool,
2025-05-07T20:32:19.1950381Z     ) -> None:
2025-05-07T20:32:19.1950476Z         torch.manual_seed(2025)
2025-05-07T20:32:19.1950546Z     
2025-05-07T20:32:19.1950715Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:19.1950787Z     
2025-05-07T20:32:19.1950876Z         x_sign = torch.sign(x)
2025-05-07T20:32:19.1951002Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:19.1951087Z         x = x_sign * x_clamp
2025-05-07T20:32:19.1951176Z         x0 = x[:, :D]
2025-05-07T20:32:19.1951257Z         x1 = x[:, D:]
2025-05-07T20:32:19.1951328Z     
2025-05-07T20:32:19.1951410Z         if contiguous:
2025-05-07T20:32:19.1951502Z             x0 = x0.contiguous()
2025-05-07T20:32:19.1951589Z             x1 = x1.contiguous()
2025-05-07T20:32:19.1951662Z     
2025-05-07T20:32:19.1951755Z         if scale_ub is not None:
2025-05-07T20:32:19.1951861Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:19.1951992Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:19.1952070Z             )
2025-05-07T20:32:19.1952146Z         else:
2025-05-07T20:32:19.1952242Z             scale_ub_tensor = None
2025-05-07T20:32:19.1952311Z     
2025-05-07T20:32:19.1952438Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:19.1952531Z             op = silu_mul_quant
2025-05-07T20:32:19.1952614Z             if compiled:
2025-05-07T20:32:19.1952711Z                 op = torch.compile(op)
2025-05-07T20:32:19.1952861Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:19.1952933Z     
2025-05-07T20:32:19.1953020Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:19.1953024Z 
2025-05-07T20:32:19.1953120Z moe/activation_test.py:117: 
2025-05-07T20:32:19.1953248Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:19.1953354Z moe/activation_test.py:115: in fn
2025-05-07T20:32:19.1953452Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:19.1953945Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:19.1954085Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:19.1954440Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:19.1954657Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:19.1954995Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:19.1955088Z     kernel = self.compile(
2025-05-07T20:32:19.1955469Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:19.1955677Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:19.1955804Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:19.1955809Z 
2025-05-07T20:32:19.1956016Z self = <triton.compiler.compiler.ASTSource object at 0x7fcda4d587d0>
2025-05-07T20:32:19.1956779Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:19.1957281Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7fcea1473ce0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7fcda49e4b80>}
2025-05-07T20:32:19.1958017Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:19.1958272Z context = <triton._C.libtriton.ir.context object at 0x7fcda492e8b0>
2025-05-07T20:32:19.1958282Z 
2025-05-07T20:32:19.1958442Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:19.1958703Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:19.1958814Z                            module_map=module_map)
2025-05-07T20:32:19.1958972Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:19.1959067Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:19.1959145Z E       ^
2025-05-07T20:32:19.1959490Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:19.1959497Z 
2025-05-07T20:32:19.1959908Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:19.1959912Z 
2025-05-07T20:32:19.1960018Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:19.1960237Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:19.1960318Z     T=128,
2025-05-07T20:32:19.1960408Z     D=5120,
2025-05-07T20:32:19.1960500Z     scale_ub=None,
2025-05-07T20:32:19.1960605Z     contiguous=True,
2025-05-07T20:32:19.1960693Z     compiled=False,
2025-05-07T20:32:19.1960762Z )
2025-05-07T20:32:19.1960981Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:19.1961144Z T = 128, D = 5120, scale_ub = None, contiguous = True, compiled = False
2025-05-07T20:32:19.1961148Z 
2025-05-07T20:32:19.1961226Z     @given(
2025-05-07T20:32:19.1961386Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:19.1961484Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:19.1961599Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:19.1961713Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:19.1961830Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:19.1961906Z     )
2025-05-07T20:32:19.1962146Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:19.1962239Z     def test_silu_mul_quant(
2025-05-07T20:32:19.1962357Z         self,
2025-05-07T20:32:19.1962431Z         T: int,
2025-05-07T20:32:19.1962508Z         D: int,
2025-05-07T20:32:19.1962603Z         scale_ub: Optional[float],
2025-05-07T20:32:19.1962689Z         contiguous: bool,
2025-05-07T20:32:19.1962775Z         compiled: bool,
2025-05-07T20:32:19.1962853Z     ) -> None:
2025-05-07T20:32:19.1962945Z         torch.manual_seed(2025)
2025-05-07T20:32:19.1963018Z     
2025-05-07T20:32:19.1963183Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:19.1963254Z     
2025-05-07T20:32:19.1963348Z         x_sign = torch.sign(x)
2025-05-07T20:32:19.1963470Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:19.1963593Z         x = x_sign * x_clamp
2025-05-07T20:32:19.1963677Z         x0 = x[:, :D]
2025-05-07T20:32:19.1963757Z         x1 = x[:, D:]
2025-05-07T20:32:19.1963831Z     
2025-05-07T20:32:19.1963912Z         if contiguous:
2025-05-07T20:32:19.1964000Z             x0 = x0.contiguous()
2025-05-07T20:32:19.1964095Z             x1 = x1.contiguous()
2025-05-07T20:32:19.1964169Z     
2025-05-07T20:32:19.1964256Z         if scale_ub is not None:
2025-05-07T20:32:19.1964361Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:19.1964491Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:19.1964564Z             )
2025-05-07T20:32:19.1964642Z         else:
2025-05-07T20:32:19.1964734Z             scale_ub_tensor = None
2025-05-07T20:32:19.1964807Z     
2025-05-07T20:32:19.1964936Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:19.1969179Z             op = silu_mul_quant
2025-05-07T20:32:19.1969287Z             if compiled:
2025-05-07T20:32:19.1969395Z                 op = torch.compile(op)
2025-05-07T20:32:19.1969577Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:19.1969649Z     
2025-05-07T20:32:19.1969738Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:19.1969743Z 
2025-05-07T20:32:19.1969843Z moe/activation_test.py:117: 
2025-05-07T20:32:19.1969972Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:19.1970070Z moe/activation_test.py:115: in fn
2025-05-07T20:32:19.1970171Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:19.1970668Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:19.1970767Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:19.1971122Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:19.1971339Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:19.1971680Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:19.1971775Z     kernel = self.compile(
2025-05-07T20:32:19.1972150Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:19.1972327Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:19.1972453Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:19.1972457Z 
2025-05-07T20:32:19.1972669Z self = <triton.compiler.compiler.ASTSource object at 0x7fce5b7860d0>
2025-05-07T20:32:19.1973433Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:19.1974091Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7fcea1473ce0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7fcda49e5a80>}
2025-05-07T20:32:19.1974827Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:19.1975060Z context = <triton._C.libtriton.ir.context object at 0x7fc96fe0c070>
2025-05-07T20:32:19.1975064Z 
2025-05-07T20:32:19.1975229Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:19.1975486Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:19.1975599Z                            module_map=module_map)
2025-05-07T20:32:19.1975760Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:19.1975856Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:19.1975934Z E       ^
2025-05-07T20:32:19.1976326Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:19.1976332Z 
2025-05-07T20:32:19.1976736Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:19.1976743Z 
2025-05-07T20:32:19.1976846Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:19.1977063Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:19.1977140Z     T=128,
2025-05-07T20:32:19.1977216Z     D=7168,
2025-05-07T20:32:19.1977295Z     scale_ub=None,
2025-05-07T20:32:19.1977379Z     contiguous=True,
2025-05-07T20:32:19.1977460Z     compiled=False,
2025-05-07T20:32:19.1977533Z )
2025-05-07T20:32:19.1977749Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:19.1977912Z T = 128, D = 7168, scale_ub = None, contiguous = True, compiled = False
2025-05-07T20:32:19.1977917Z 
2025-05-07T20:32:19.1977991Z     @given(
2025-05-07T20:32:19.1978113Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:19.1978255Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:19.1978371Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:19.1978493Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:19.1978604Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:19.1978681Z     )
2025-05-07T20:32:19.1978920Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:19.1979016Z     def test_silu_mul_quant(
2025-05-07T20:32:19.1979094Z         self,
2025-05-07T20:32:19.1979170Z         T: int,
2025-05-07T20:32:19.1979251Z         D: int,
2025-05-07T20:32:19.1979350Z         scale_ub: Optional[float],
2025-05-07T20:32:19.1979439Z         contiguous: bool,
2025-05-07T20:32:19.1979523Z         compiled: bool,
2025-05-07T20:32:19.1979605Z     ) -> None:
2025-05-07T20:32:19.1979697Z         torch.manual_seed(2025)
2025-05-07T20:32:19.1979770Z     
2025-05-07T20:32:19.1979944Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:19.1980016Z     
2025-05-07T20:32:19.1980114Z         x_sign = torch.sign(x)
2025-05-07T20:32:19.1980242Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:19.1980328Z         x = x_sign * x_clamp
2025-05-07T20:32:19.1980409Z         x0 = x[:, :D]
2025-05-07T20:32:19.1980487Z         x1 = x[:, D:]
2025-05-07T20:32:19.1980557Z     
2025-05-07T20:32:19.1980643Z         if contiguous:
2025-05-07T20:32:19.1980733Z             x0 = x0.contiguous()
2025-05-07T20:32:19.1980819Z             x1 = x1.contiguous()
2025-05-07T20:32:19.1980891Z     
2025-05-07T20:32:19.1981028Z         if scale_ub is not None:
2025-05-07T20:32:19.1981134Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:19.1981271Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:19.1981344Z             )
2025-05-07T20:32:19.1981424Z         else:
2025-05-07T20:32:19.1981520Z             scale_ub_tensor = None
2025-05-07T20:32:19.1981591Z     
2025-05-07T20:32:19.1981720Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:19.1981812Z             op = silu_mul_quant
2025-05-07T20:32:19.1981941Z             if compiled:
2025-05-07T20:32:19.1982039Z                 op = torch.compile(op)
2025-05-07T20:32:19.1982141Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:19.1982215Z     
2025-05-07T20:32:19.1982303Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:19.1982307Z 
2025-05-07T20:32:19.1982405Z moe/activation_test.py:117: 
2025-05-07T20:32:19.1982530Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:19.1982630Z moe/activation_test.py:115: in fn
2025-05-07T20:32:19.1982729Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:19.1983215Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:19.1983351Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:19.1983713Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:19.1983929Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:19.1984270Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:19.1984360Z     kernel = self.compile(
2025-05-07T20:32:19.1984739Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:19.1984914Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:19.1985042Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:19.1985046Z 
2025-05-07T20:32:19.1985251Z self = <triton.compiler.compiler.ASTSource object at 0x7fce5b403950>
2025-05-07T20:32:19.1986058Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:19.1986558Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7fcea1473ce0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7fcda49e6980>}
2025-05-07T20:32:19.1987289Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:19.1987483Z context = <triton._C.libtriton.ir.context object at 0x7fcda40744b0>
2025-05-07T20:32:19.1987488Z 
2025-05-07T20:32:19.1987651Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:19.1987909Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:19.1988017Z                            module_map=module_map)
2025-05-07T20:32:19.1988179Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:19.1988276Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:19.1988353Z E       ^
2025-05-07T20:32:19.1988704Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:19.1988709Z 
2025-05-07T20:32:19.1989110Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:19.1989115Z 
2025-05-07T20:32:19.1989215Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:19.1989472Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:19.1989547Z     T=2048,
2025-05-07T20:32:19.1989623Z     D=7168,
2025-05-07T20:32:19.1989706Z     scale_ub=1200.0,
2025-05-07T20:32:19.1989792Z     contiguous=True,
2025-05-07T20:32:19.1989875Z     compiled=False,
2025-05-07T20:32:19.1989945Z )
2025-05-07T20:32:19.1990165Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:19.1990332Z T = 2048, D = 7168, scale_ub = 1200.0, contiguous = True, compiled = False
2025-05-07T20:32:19.1990377Z 
2025-05-07T20:32:19.1990451Z     @given(
2025-05-07T20:32:19.1990573Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:19.1990671Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:19.1990782Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:19.1990899Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:19.1991009Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:19.1991087Z     )
2025-05-07T20:32:19.1991326Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:19.1991418Z     def test_silu_mul_quant(
2025-05-07T20:32:19.1991496Z         self,
2025-05-07T20:32:19.1991639Z         T: int,
2025-05-07T20:32:19.1991714Z         D: int,
2025-05-07T20:32:19.1991815Z         scale_ub: Optional[float],
2025-05-07T20:32:19.1991902Z         contiguous: bool,
2025-05-07T20:32:19.1991986Z         compiled: bool,
2025-05-07T20:32:19.1992067Z     ) -> None:
2025-05-07T20:32:19.1992160Z         torch.manual_seed(2025)
2025-05-07T20:32:19.1992232Z     
2025-05-07T20:32:19.1992399Z >       x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:19.1994134Z E       torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 56.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 26.44 MiB is free. Including non-PyTorch memory, this process has 22.04 GiB memory in use. Of the allocated memory 21.69 GiB is allocated by PyTorch, and 59.18 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
2025-05-07T20:32:19.1994142Z 
2025-05-07T20:32:19.1994302Z moe/activation_test.py:92: OutOfMemoryError
2025-05-07T20:32:19.1994307Z 
2025-05-07T20:32:19.1994406Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:19.1994628Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:19.1994702Z     T=1,
2025-05-07T20:32:19.1994775Z     D=5120,
2025-05-07T20:32:19.1994857Z     scale_ub=1200.0,
2025-05-07T20:32:19.1994937Z     contiguous=True,
2025-05-07T20:32:19.1995018Z     compiled=False,
2025-05-07T20:32:19.1995092Z )
2025-05-07T20:32:19.1995302Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:19.1995464Z T = 1, D = 5120, scale_ub = 1200.0, contiguous = True, compiled = False
2025-05-07T20:32:19.1995472Z 
2025-05-07T20:32:19.1995547Z     @given(
2025-05-07T20:32:19.1995662Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:19.1995767Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:19.1995880Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:19.1995993Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:19.1996106Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:19.1996182Z     )
2025-05-07T20:32:19.1996423Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:19.1996516Z     def test_silu_mul_quant(
2025-05-07T20:32:19.1996589Z         self,
2025-05-07T20:32:19.1996662Z         T: int,
2025-05-07T20:32:19.1996741Z         D: int,
2025-05-07T20:32:19.1996836Z         scale_ub: Optional[float],
2025-05-07T20:32:19.1996927Z         contiguous: bool,
2025-05-07T20:32:19.1997056Z         compiled: bool,
2025-05-07T20:32:19.1997132Z     ) -> None:
2025-05-07T20:32:19.1997226Z         torch.manual_seed(2025)
2025-05-07T20:32:19.1997296Z     
2025-05-07T20:32:19.1997457Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:19.1997534Z     
2025-05-07T20:32:19.1997624Z         x_sign = torch.sign(x)
2025-05-07T20:32:19.1997747Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:19.1997837Z         x = x_sign * x_clamp
2025-05-07T20:32:19.1997956Z         x0 = x[:, :D]
2025-05-07T20:32:19.1998034Z         x1 = x[:, D:]
2025-05-07T20:32:19.1998110Z     
2025-05-07T20:32:19.1998548Z         if contiguous:
2025-05-07T20:32:19.1998653Z             x0 = x0.contiguous()
2025-05-07T20:32:19.1998745Z             x1 = x1.contiguous()
2025-05-07T20:32:19.1998816Z     
2025-05-07T20:32:19.1998909Z         if scale_ub is not None:
2025-05-07T20:32:19.1999012Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:19.1999148Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:19.1999225Z             )
2025-05-07T20:32:19.1999299Z         else:
2025-05-07T20:32:19.1999391Z             scale_ub_tensor = None
2025-05-07T20:32:19.1999466Z     
2025-05-07T20:32:19.1999683Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:19.1999777Z             op = silu_mul_quant
2025-05-07T20:32:19.1999867Z             if compiled:
2025-05-07T20:32:19.1999964Z                 op = torch.compile(op)
2025-05-07T20:32:19.2000070Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:19.2000141Z     
2025-05-07T20:32:19.2000230Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:19.2000234Z 
2025-05-07T20:32:19.2000331Z moe/activation_test.py:117: 
2025-05-07T20:32:19.2000457Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:19.2000556Z moe/activation_test.py:115: in fn
2025-05-07T20:32:19.2000659Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:19.2001150Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:19.2001245Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:19.2001603Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:19.2001890Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:19.2002229Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:19.2002323Z     kernel = self.compile(
2025-05-07T20:32:19.2002705Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:19.2002880Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:19.2003003Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:19.2003010Z 
2025-05-07T20:32:19.2003213Z self = <triton.compiler.compiler.ASTSource object at 0x7fcda455e650>
2025-05-07T20:32:19.2003979Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:19.2004477Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7fcea1473ce0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7fcda49e7e20>}
2025-05-07T20:32:19.2005217Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:19.2005405Z context = <triton._C.libtriton.ir.context object at 0x7fc96fee5030>
2025-05-07T20:32:19.2005410Z 
2025-05-07T20:32:19.2005636Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:19.2005893Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:19.2005998Z                            module_map=module_map)
2025-05-07T20:32:19.2006165Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:19.2006264Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:19.2006344Z E       ^
2025-05-07T20:32:19.2006692Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:19.2006756Z 
2025-05-07T20:32:19.2007162Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:19.2007166Z 
2025-05-07T20:32:19.2007270Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:19.2007486Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:19.2007566Z     T=2048,
2025-05-07T20:32:19.2007646Z     D=5120,
2025-05-07T20:32:19.2007730Z     scale_ub=None,
2025-05-07T20:32:19.2007816Z     contiguous=True,
2025-05-07T20:32:19.2007897Z     compiled=False,
2025-05-07T20:32:19.2007970Z )
2025-05-07T20:32:19.2008231Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:19.2008404Z T = 2048, D = 5120, scale_ub = None, contiguous = True, compiled = False
2025-05-07T20:32:19.2008409Z 
2025-05-07T20:32:19.2008487Z     @given(
2025-05-07T20:32:19.2008613Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:19.2008712Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:19.2008825Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:19.2008943Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:19.2009053Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:19.2009129Z     )
2025-05-07T20:32:19.2009369Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:19.2009462Z     def test_silu_mul_quant(
2025-05-07T20:32:19.2009539Z         self,
2025-05-07T20:32:19.2009614Z         T: int,
2025-05-07T20:32:19.2009687Z         D: int,
2025-05-07T20:32:19.2009786Z         scale_ub: Optional[float],
2025-05-07T20:32:19.2009874Z         contiguous: bool,
2025-05-07T20:32:19.2010012Z         compiled: bool,
2025-05-07T20:32:19.2010092Z     ) -> None:
2025-05-07T20:32:19.2010184Z         torch.manual_seed(2025)
2025-05-07T20:32:19.2010253Z     
2025-05-07T20:32:19.2010426Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:19.2010497Z     
2025-05-07T20:32:19.2010589Z >       x_sign = torch.sign(x)
2025-05-07T20:32:19.2012320Z E       torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 40.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 26.44 MiB is free. Including non-PyTorch memory, this process has 22.04 GiB memory in use. Of the allocated memory 21.73 GiB is allocated by PyTorch, and 19.12 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
2025-05-07T20:32:19.2012328Z 
2025-05-07T20:32:19.2012449Z moe/activation_test.py:94: OutOfMemoryError
2025-05-07T20:32:19.2012454Z 
2025-05-07T20:32:19.2012553Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:19.2012772Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:19.2012849Z     T=16384,
2025-05-07T20:32:19.2012922Z     D=5120,
2025-05-07T20:32:19.2013000Z     scale_ub=None,
2025-05-07T20:32:19.2013084Z     contiguous=True,
2025-05-07T20:32:19.2013165Z     compiled=False,
2025-05-07T20:32:19.2013235Z )
2025-05-07T20:32:19.2013448Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:19.2013719Z T = 16384, D = 5120, scale_ub = None, contiguous = True, compiled = False
2025-05-07T20:32:19.2013724Z 
2025-05-07T20:32:19.2013802Z     @given(
2025-05-07T20:32:19.2013918Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:19.2014014Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:19.2014129Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:19.2014246Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:19.2014355Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:19.2014473Z     )
2025-05-07T20:32:19.2014714Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:19.2014806Z     def test_silu_mul_quant(
2025-05-07T20:32:19.2014883Z         self,
2025-05-07T20:32:19.2014956Z         T: int,
2025-05-07T20:32:19.2015032Z         D: int,
2025-05-07T20:32:19.2015127Z         scale_ub: Optional[float],
2025-05-07T20:32:19.2015213Z         contiguous: bool,
2025-05-07T20:32:19.2015306Z         compiled: bool,
2025-05-07T20:32:19.2015382Z     ) -> None:
2025-05-07T20:32:19.2015474Z         torch.manual_seed(2025)
2025-05-07T20:32:19.2015547Z     
2025-05-07T20:32:19.2015707Z >       x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:19.2017475Z E       torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 320.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 26.44 MiB is free. Including non-PyTorch memory, this process has 22.04 GiB memory in use. Of the allocated memory 21.73 GiB is allocated by PyTorch, and 19.12 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
2025-05-07T20:32:19.2017484Z 
2025-05-07T20:32:19.2017600Z moe/activation_test.py:92: OutOfMemoryError
2025-05-07T20:32:19.2017604Z 
2025-05-07T20:32:19.2017708Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:19.2017925Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:19.2017999Z     T=4096,
2025-05-07T20:32:19.2018076Z     D=5120,
2025-05-07T20:32:19.2018156Z     scale_ub=None,
2025-05-07T20:32:19.2018244Z     contiguous=True,
2025-05-07T20:32:19.2018326Z     compiled=False,
2025-05-07T20:32:19.2018434Z )
2025-05-07T20:32:19.2018646Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:19.2018816Z T = 4096, D = 5120, scale_ub = None, contiguous = True, compiled = False
2025-05-07T20:32:19.2018820Z 
2025-05-07T20:32:19.2018893Z     @given(
2025-05-07T20:32:19.2019008Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:19.2019111Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:19.2019223Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:19.2019338Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:19.2019451Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:19.2019521Z     )
2025-05-07T20:32:19.2019762Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:19.2019854Z     def test_silu_mul_quant(
2025-05-07T20:32:19.2019929Z         self,
2025-05-07T20:32:19.2020010Z         T: int,
2025-05-07T20:32:19.2020086Z         D: int,
2025-05-07T20:32:19.2020180Z         scale_ub: Optional[float],
2025-05-07T20:32:19.2020267Z         contiguous: bool,
2025-05-07T20:32:19.2020353Z         compiled: bool,
2025-05-07T20:32:19.2020427Z     ) -> None:
2025-05-07T20:32:19.2020524Z         torch.manual_seed(2025)
2025-05-07T20:32:19.2020593Z     
2025-05-07T20:32:19.2020758Z >       x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:19.2022478Z E       torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 80.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 26.44 MiB is free. Including non-PyTorch memory, this process has 22.04 GiB memory in use. Of the allocated memory 21.73 GiB is allocated by PyTorch, and 19.12 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
2025-05-07T20:32:19.2022526Z 
2025-05-07T20:32:19.2022644Z moe/activation_test.py:92: OutOfMemoryError
2025-05-07T20:32:19.2022648Z 
2025-05-07T20:32:19.2022809Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:19.2023024Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:19.2023102Z     T=2048,
2025-05-07T20:32:19.2023175Z     D=5120,
2025-05-07T20:32:19.2023255Z     scale_ub=None,
2025-05-07T20:32:19.2023342Z     contiguous=False,
2025-05-07T20:32:19.2023421Z     compiled=False,
2025-05-07T20:32:19.2023490Z )
2025-05-07T20:32:19.2023703Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:19.2023875Z T = 2048, D = 5120, scale_ub = None, contiguous = False, compiled = False
2025-05-07T20:32:19.2023880Z 
2025-05-07T20:32:19.2023958Z     @given(
2025-05-07T20:32:19.2024074Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:19.2024211Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:19.2024329Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:19.2024441Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:19.2024555Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:19.2024630Z     )
2025-05-07T20:32:19.2024872Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:19.2024964Z     def test_silu_mul_quant(
2025-05-07T20:32:19.2025040Z         self,
2025-05-07T20:32:19.2025114Z         T: int,
2025-05-07T20:32:19.2025193Z         D: int,
2025-05-07T20:32:19.2025288Z         scale_ub: Optional[float],
2025-05-07T20:32:19.2025378Z         contiguous: bool,
2025-05-07T20:32:19.2025463Z         compiled: bool,
2025-05-07T20:32:19.2025538Z     ) -> None:
2025-05-07T20:32:19.2025628Z         torch.manual_seed(2025)
2025-05-07T20:32:19.2025702Z     
2025-05-07T20:32:19.2025867Z >       x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:19.2027623Z E       torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 40.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 26.44 MiB is free. Including non-PyTorch memory, this process has 22.04 GiB memory in use. Of the allocated memory 21.73 GiB is allocated by PyTorch, and 19.12 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
2025-05-07T20:32:19.2027632Z 
2025-05-07T20:32:19.2027745Z moe/activation_test.py:92: OutOfMemoryError
2025-05-07T20:32:19.2027751Z 
2025-05-07T20:32:19.2027848Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:19.2028063Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:19.2028136Z     T=4096,
2025-05-07T20:32:19.2028213Z     D=7168,
2025-05-07T20:32:19.2028296Z     scale_ub=None,
2025-05-07T20:32:19.2028380Z     contiguous=True,
2025-05-07T20:32:19.2028465Z     compiled=True,
2025-05-07T20:32:19.2028535Z )
2025-05-07T20:32:19.2028746Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:19.2028913Z T = 4096, D = 7168, scale_ub = None, contiguous = True, compiled = True
2025-05-07T20:32:19.2028918Z 
2025-05-07T20:32:19.2028994Z     @given(
2025-05-07T20:32:19.2029110Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:19.2029208Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:19.2029325Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:19.2029479Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:19.2029590Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:19.2029665Z     )
2025-05-07T20:32:19.2029904Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:19.2030003Z     def test_silu_mul_quant(
2025-05-07T20:32:19.2030077Z         self,
2025-05-07T20:32:19.2030168Z         T: int,
2025-05-07T20:32:19.2030253Z         D: int,
2025-05-07T20:32:19.2030364Z         scale_ub: Optional[float],
2025-05-07T20:32:19.2030497Z         contiguous: bool,
2025-05-07T20:32:19.2030585Z         compiled: bool,
2025-05-07T20:32:19.2030659Z     ) -> None:
2025-05-07T20:32:19.2030752Z         torch.manual_seed(2025)
2025-05-07T20:32:19.2030824Z     
2025-05-07T20:32:19.2030987Z >       x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:19.2032749Z E       torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 112.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 26.44 MiB is free. Including non-PyTorch memory, this process has 22.04 GiB memory in use. Of the allocated memory 21.73 GiB is allocated by PyTorch, and 19.12 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
2025-05-07T20:32:19.2032758Z 
2025-05-07T20:32:19.2032872Z moe/activation_test.py:92: OutOfMemoryError
2025-05-07T20:32:19.2032879Z 
2025-05-07T20:32:19.2032980Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:19.2033195Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:19.2033270Z     T=2048,
2025-05-07T20:32:19.2033346Z     D=5120,
2025-05-07T20:32:19.2033427Z     scale_ub=1200.0,
2025-05-07T20:32:19.2033510Z     contiguous=False,
2025-05-07T20:32:19.2033592Z     compiled=False,
2025-05-07T20:32:19.2033670Z )
2025-05-07T20:32:19.2033882Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:19.2034057Z T = 2048, D = 5120, scale_ub = 1200.0, contiguous = False, compiled = False
2025-05-07T20:32:19.2034061Z 
2025-05-07T20:32:19.2034134Z     @given(
2025-05-07T20:32:19.2034251Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:19.2034392Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:19.2034505Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:19.2034624Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:19.2034733Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:19.2034805Z     )
2025-05-07T20:32:19.2035045Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:19.2035137Z     def test_silu_mul_quant(
2025-05-07T20:32:19.2035211Z         self,
2025-05-07T20:32:19.2035289Z         T: int,
2025-05-07T20:32:19.2035362Z         D: int,
2025-05-07T20:32:19.2035457Z         scale_ub: Optional[float],
2025-05-07T20:32:19.2035547Z         contiguous: bool,
2025-05-07T20:32:19.2035629Z         compiled: bool,
2025-05-07T20:32:19.2035708Z     ) -> None:
2025-05-07T20:32:19.2035800Z         torch.manual_seed(2025)
2025-05-07T20:32:19.2035872Z     
2025-05-07T20:32:19.2036043Z >       x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:19.2037755Z E       torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 40.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 26.44 MiB is free. Including non-PyTorch memory, this process has 22.04 GiB memory in use. Of the allocated memory 21.73 GiB is allocated by PyTorch, and 19.12 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
2025-05-07T20:32:19.2037806Z 
2025-05-07T20:32:19.2037921Z moe/activation_test.py:92: OutOfMemoryError
2025-05-07T20:32:19.2037926Z 
2025-05-07T20:32:19.2038023Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:19.2038238Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:19.2038319Z     T=4096,
2025-05-07T20:32:19.2038393Z     D=7168,
2025-05-07T20:32:19.2038478Z     scale_ub=1200.0,
2025-05-07T20:32:19.2038561Z     contiguous=True,
2025-05-07T20:32:19.2038642Z     compiled=False,
2025-05-07T20:32:19.2038752Z )
2025-05-07T20:32:19.2038969Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:19.2039135Z T = 4096, D = 7168, scale_ub = 1200.0, contiguous = True, compiled = False
2025-05-07T20:32:19.2039139Z 
2025-05-07T20:32:19.2039215Z     @given(
2025-05-07T20:32:19.2039329Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:19.2039427Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:19.2039553Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:19.2039666Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:19.2039776Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:19.2039849Z     )
2025-05-07T20:32:19.2040128Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:19.2040226Z     def test_silu_mul_quant(
2025-05-07T20:32:19.2040301Z         self,
2025-05-07T20:32:19.2040375Z         T: int,
2025-05-07T20:32:19.2040451Z         D: int,
2025-05-07T20:32:19.2040547Z         scale_ub: Optional[float],
2025-05-07T20:32:19.2040633Z         contiguous: bool,
2025-05-07T20:32:19.2040722Z         compiled: bool,
2025-05-07T20:32:19.2040798Z     ) -> None:
2025-05-07T20:32:19.2040890Z         torch.manual_seed(2025)
2025-05-07T20:32:19.2040962Z     
2025-05-07T20:32:19.2041122Z >       x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:19.2042884Z E       torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 112.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 26.44 MiB is free. Including non-PyTorch memory, this process has 22.04 GiB memory in use. Of the allocated memory 21.73 GiB is allocated by PyTorch, and 19.12 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
2025-05-07T20:32:19.2042895Z 
2025-05-07T20:32:19.2043009Z moe/activation_test.py:92: OutOfMemoryError
2025-05-07T20:32:19.2043013Z 
2025-05-07T20:32:19.2043114Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:19.2043328Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:19.2043402Z     T=16384,
2025-05-07T20:32:19.2043479Z     D=7168,
2025-05-07T20:32:19.2043557Z     scale_ub=None,
2025-05-07T20:32:19.2043641Z     contiguous=False,
2025-05-07T20:32:19.2043724Z     compiled=True,
2025-05-07T20:32:19.2043795Z )
2025-05-07T20:32:19.2044004Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:19.2044175Z T = 16384, D = 7168, scale_ub = None, contiguous = False, compiled = True
2025-05-07T20:32:19.2044183Z 
2025-05-07T20:32:19.2044257Z     @given(
2025-05-07T20:32:19.2044373Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:19.2044473Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:19.2044585Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:19.2044704Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:19.2044814Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:19.2044886Z     )
2025-05-07T20:32:19.2045126Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:19.2045218Z     def test_silu_mul_quant(
2025-05-07T20:32:19.2045292Z         self,
2025-05-07T20:32:19.2045413Z         T: int,
2025-05-07T20:32:19.2045490Z         D: int,
2025-05-07T20:32:19.2045583Z         scale_ub: Optional[float],
2025-05-07T20:32:19.2045673Z         contiguous: bool,
2025-05-07T20:32:19.2045755Z         compiled: bool,
2025-05-07T20:32:19.2045832Z     ) -> None:
2025-05-07T20:32:19.2045926Z         torch.manual_seed(2025)
2025-05-07T20:32:19.2045999Z     
2025-05-07T20:32:19.2046165Z >       x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:19.2047880Z E       torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 448.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 26.44 MiB is free. Including non-PyTorch memory, this process has 22.04 GiB memory in use. Of the allocated memory 21.73 GiB is allocated by PyTorch, and 19.12 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
2025-05-07T20:32:19.2047930Z 
2025-05-07T20:32:19.2048046Z moe/activation_test.py:92: OutOfMemoryError
2025-05-07T20:32:19.2048050Z 
2025-05-07T20:32:19.2048148Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:19.2048398Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:19.2048479Z     T=4096,
2025-05-07T20:32:19.2048553Z     D=7168,
2025-05-07T20:32:19.2048631Z     scale_ub=None,
2025-05-07T20:32:19.2048720Z     contiguous=True,
2025-05-07T20:32:19.2048800Z     compiled=False,
2025-05-07T20:32:19.2048869Z )
2025-05-07T20:32:19.2049083Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:19.2049246Z T = 4096, D = 7168, scale_ub = None, contiguous = True, compiled = False
2025-05-07T20:32:19.2049250Z 
2025-05-07T20:32:19.2049326Z     @given(
2025-05-07T20:32:19.2049440Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:19.2049540Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:19.2049654Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:19.2049767Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:19.2049876Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:19.2049953Z     )
2025-05-07T20:32:19.2050229Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:19.2050323Z     def test_silu_mul_quant(
2025-05-07T20:32:19.2050397Z         self,
2025-05-07T20:32:19.2050473Z         T: int,
2025-05-07T20:32:19.2050548Z         D: int,
2025-05-07T20:32:19.2050642Z         scale_ub: Optional[float],
2025-05-07T20:32:19.2050728Z         contiguous: bool,
2025-05-07T20:32:19.2050812Z         compiled: bool,
2025-05-07T20:32:19.2050887Z     ) -> None:
2025-05-07T20:32:19.2050978Z         torch.manual_seed(2025)
2025-05-07T20:32:19.2051050Z     
2025-05-07T20:32:19.2051211Z >       x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:19.2052933Z E       torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 112.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 26.44 MiB is free. Including non-PyTorch memory, this process has 22.04 GiB memory in use. Of the allocated memory 21.73 GiB is allocated by PyTorch, and 19.12 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
2025-05-07T20:32:19.2052941Z 
2025-05-07T20:32:19.2053053Z moe/activation_test.py:92: OutOfMemoryError
2025-05-07T20:32:19.2053058Z 
2025-05-07T20:32:19.2053159Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:19.2053374Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:19.2053448Z     T=16384,
2025-05-07T20:32:19.2053523Z     D=7168,
2025-05-07T20:32:19.2053717Z     scale_ub=None,
2025-05-07T20:32:19.2053799Z     contiguous=True,
2025-05-07T20:32:19.2053881Z     compiled=False,
2025-05-07T20:32:19.2053951Z )
2025-05-07T20:32:19.2054160Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:19.2054334Z T = 16384, D = 7168, scale_ub = None, contiguous = True, compiled = False
2025-05-07T20:32:19.2054341Z 
2025-05-07T20:32:19.2054414Z     @given(
2025-05-07T20:32:19.2054527Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:19.2054685Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:19.2054797Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:19.2054912Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:19.2055022Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:19.2055094Z     )
2025-05-07T20:32:19.2055338Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:19.2055435Z     def test_silu_mul_quant(
2025-05-07T20:32:19.2055509Z         self,
2025-05-07T20:32:19.2055585Z         T: int,
2025-05-07T20:32:19.2055658Z         D: int,
2025-05-07T20:32:19.2055751Z         scale_ub: Optional[float],
2025-05-07T20:32:19.2055839Z         contiguous: bool,
2025-05-07T20:32:19.2055963Z         compiled: bool,
2025-05-07T20:32:19.2056042Z     ) -> None:
2025-05-07T20:32:19.2056136Z         torch.manual_seed(2025)
2025-05-07T20:32:19.2056206Z     
2025-05-07T20:32:19.2056367Z >       x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:19.2058086Z E       torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 448.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 26.44 MiB is free. Including non-PyTorch memory, this process has 22.04 GiB memory in use. Of the allocated memory 21.73 GiB is allocated by PyTorch, and 19.12 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
2025-05-07T20:32:19.2058095Z 
2025-05-07T20:32:19.2058211Z moe/activation_test.py:92: OutOfMemoryError
2025-05-07T20:32:19.2058215Z 
2025-05-07T20:32:19.2058314Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:19.2058570Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:19.2058651Z     T=16384,
2025-05-07T20:32:19.2058724Z     D=7168,
2025-05-07T20:32:19.2058807Z     scale_ub=1200.0,
2025-05-07T20:32:19.2058890Z     contiguous=True,
2025-05-07T20:32:19.2058971Z     compiled=False,
2025-05-07T20:32:19.2059041Z )
2025-05-07T20:32:19.2059254Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:19.2059425Z T = 16384, D = 7168, scale_ub = 1200.0, contiguous = True, compiled = False
2025-05-07T20:32:19.2059429Z 
2025-05-07T20:32:19.2059506Z     @given(
2025-05-07T20:32:19.2059622Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:19.2059718Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:19.2059832Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:19.2059943Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:19.2060056Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:19.2060131Z     )
2025-05-07T20:32:19.2060377Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:19.2060490Z     def test_silu_mul_quant(
2025-05-07T20:32:19.2060574Z         self,
2025-05-07T20:32:19.2060663Z         T: int,
2025-05-07T20:32:19.2060739Z         D: int,
2025-05-07T20:32:19.2060832Z         scale_ub: Optional[float],
2025-05-07T20:32:19.2060917Z         contiguous: bool,
2025-05-07T20:32:19.2061002Z         compiled: bool,
2025-05-07T20:32:19.2061078Z     ) -> None:
2025-05-07T20:32:19.2061169Z         torch.manual_seed(2025)
2025-05-07T20:32:19.2061290Z     
2025-05-07T20:32:19.2061450Z >       x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:19.2063173Z E       torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 448.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 26.44 MiB is free. Including non-PyTorch memory, this process has 22.04 GiB memory in use. Of the allocated memory 21.73 GiB is allocated by PyTorch, and 19.12 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
2025-05-07T20:32:19.2063222Z 
2025-05-07T20:32:19.2063335Z moe/activation_test.py:92: OutOfMemoryError
2025-05-07T20:32:19.2063340Z 
2025-05-07T20:32:19.2063440Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:19.2063656Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:19.2063733Z     T=128,
2025-05-07T20:32:19.2063811Z     D=5120,
2025-05-07T20:32:19.2063891Z     scale_ub=1200.0,
2025-05-07T20:32:19.2063974Z     contiguous=False,
2025-05-07T20:32:19.2064057Z     compiled=False,
2025-05-07T20:32:19.2064129Z )
2025-05-07T20:32:19.2064379Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:19.2064554Z T = 128, D = 5120, scale_ub = 1200.0, contiguous = False, compiled = False
2025-05-07T20:32:19.2064559Z 
2025-05-07T20:32:19.2064633Z     @given(
2025-05-07T20:32:19.2064753Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:19.2064849Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:19.2064961Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:19.2065076Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:19.2065186Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:19.2065257Z     )
2025-05-07T20:32:19.2065496Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:19.2065589Z     def test_silu_mul_quant(
2025-05-07T20:32:19.2065662Z         self,
2025-05-07T20:32:19.2065740Z         T: int,
2025-05-07T20:32:19.2065814Z         D: int,
2025-05-07T20:32:19.2065910Z         scale_ub: Optional[float],
2025-05-07T20:32:19.2066000Z         contiguous: bool,
2025-05-07T20:32:19.2066124Z         compiled: bool,
2025-05-07T20:32:19.2066203Z     ) -> None:
2025-05-07T20:32:19.2066295Z         torch.manual_seed(2025)
2025-05-07T20:32:19.2066368Z     
2025-05-07T20:32:19.2066533Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:19.2066604Z     
2025-05-07T20:32:19.2066691Z         x_sign = torch.sign(x)
2025-05-07T20:32:19.2066818Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:19.2066902Z         x = x_sign * x_clamp
2025-05-07T20:32:19.2066982Z         x0 = x[:, :D]
2025-05-07T20:32:19.2067063Z         x1 = x[:, D:]
2025-05-07T20:32:19.2067136Z     
2025-05-07T20:32:19.2067216Z         if contiguous:
2025-05-07T20:32:19.2067307Z             x0 = x0.contiguous()
2025-05-07T20:32:19.2067393Z             x1 = x1.contiguous()
2025-05-07T20:32:19.2067463Z     
2025-05-07T20:32:19.2067555Z         if scale_ub is not None:
2025-05-07T20:32:19.2067661Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:19.2067800Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:19.2067872Z             )
2025-05-07T20:32:19.2067946Z         else:
2025-05-07T20:32:19.2068041Z             scale_ub_tensor = None
2025-05-07T20:32:19.2068111Z     
2025-05-07T20:32:19.2068236Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:19.2068325Z             op = silu_mul_quant
2025-05-07T20:32:19.2068408Z             if compiled:
2025-05-07T20:32:19.2068505Z                 op = torch.compile(op)
2025-05-07T20:32:19.2068611Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:19.2068681Z     
2025-05-07T20:32:19.2068813Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:19.2068821Z 
2025-05-07T20:32:19.2068913Z moe/activation_test.py:117: 
2025-05-07T20:32:19.2069041Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:19.2069140Z moe/activation_test.py:115: in fn
2025-05-07T20:32:19.2069238Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:19.2069729Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:19.2069869Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:19.2070225Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:19.2070445Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:19.2070781Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:19.2070875Z     kernel = self.compile(
2025-05-07T20:32:19.2071255Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:19.2071428Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:19.2071590Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:19.2071596Z 
2025-05-07T20:32:19.2071801Z self = <triton.compiler.compiler.ASTSource object at 0x7fcda49f0950>
2025-05-07T20:32:19.2072563Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:19.2073067Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7fcea1473ce0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7fc96fe8cae0>}
2025-05-07T20:32:19.2073799Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:19.2073990Z context = <triton._C.libtriton.ir.context object at 0x7fc96fd94b30>
2025-05-07T20:32:19.2073997Z 
2025-05-07T20:32:19.2074197Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:19.2074457Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:19.2074571Z                            module_map=module_map)
2025-05-07T20:32:19.2074729Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:19.2074826Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:19.2074905Z E       ^
2025-05-07T20:32:19.2075253Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:19.2075260Z 
2025-05-07T20:32:19.2075666Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:19.2075671Z 
2025-05-07T20:32:19.2075772Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:19.2075992Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:19.2076073Z     T=2048,
2025-05-07T20:32:19.2076150Z     D=7168,
2025-05-07T20:32:19.2076232Z     scale_ub=None,
2025-05-07T20:32:19.2076321Z     contiguous=False,
2025-05-07T20:32:19.2076406Z     compiled=False,
2025-05-07T20:32:19.2076481Z )
2025-05-07T20:32:19.2076692Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:19.2076861Z T = 2048, D = 7168, scale_ub = None, contiguous = False, compiled = False
2025-05-07T20:32:19.2076865Z 
2025-05-07T20:32:19.2076941Z     @given(
2025-05-07T20:32:19.2077056Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:19.2077151Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:19.2077312Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:19.2077430Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:19.2077539Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:19.2077614Z     )
2025-05-07T20:32:19.2077859Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:19.2077953Z     def test_silu_mul_quant(
2025-05-07T20:32:19.2078027Z         self,
2025-05-07T20:32:19.2078101Z         T: int,
2025-05-07T20:32:19.2078221Z         D: int,
2025-05-07T20:32:19.2078315Z         scale_ub: Optional[float],
2025-05-07T20:32:19.2078401Z         contiguous: bool,
2025-05-07T20:32:19.2078487Z         compiled: bool,
2025-05-07T20:32:19.2078563Z     ) -> None:
2025-05-07T20:32:19.2078655Z         torch.manual_seed(2025)
2025-05-07T20:32:19.2078729Z     
2025-05-07T20:32:19.2078892Z >       x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:19.2080663Z E       torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 56.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 26.44 MiB is free. Including non-PyTorch memory, this process has 22.04 GiB memory in use. Of the allocated memory 21.74 GiB is allocated by PyTorch, and 10.99 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
2025-05-07T20:32:19.2080672Z 
2025-05-07T20:32:19.2080788Z moe/activation_test.py:92: OutOfMemoryError
2025-05-07T20:32:19.2080792Z 
2025-05-07T20:32:19.2080894Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:19.2081111Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:19.2081184Z     T=128,
2025-05-07T20:32:19.2081260Z     D=7168,
2025-05-07T20:32:19.2081339Z     scale_ub=1200.0,
2025-05-07T20:32:19.2081423Z     contiguous=True,
2025-05-07T20:32:19.2081505Z     compiled=True,
2025-05-07T20:32:19.2081574Z )
2025-05-07T20:32:19.2081789Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:19.2081958Z T = 128, D = 7168, scale_ub = 1200.0, contiguous = True, compiled = True
2025-05-07T20:32:19.2081962Z 
2025-05-07T20:32:19.2082102Z     @given(
2025-05-07T20:32:19.2082224Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:19.2082321Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:19.2082436Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:19.2082552Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:19.2082662Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:19.2082734Z     )
2025-05-07T20:32:19.2082977Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:19.2083067Z     def test_silu_mul_quant(
2025-05-07T20:32:19.2083146Z         self,
2025-05-07T20:32:19.2083223Z         T: int,
2025-05-07T20:32:19.2083296Z         D: int,
2025-05-07T20:32:19.2083391Z         scale_ub: Optional[float],
2025-05-07T20:32:19.2083481Z         contiguous: bool,
2025-05-07T20:32:19.2083564Z         compiled: bool,
2025-05-07T20:32:19.2083644Z     ) -> None:
2025-05-07T20:32:19.2083737Z         torch.manual_seed(2025)
2025-05-07T20:32:19.2083809Z     
2025-05-07T20:32:19.2083974Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:19.2084049Z     
2025-05-07T20:32:19.2084138Z         x_sign = torch.sign(x)
2025-05-07T20:32:19.2084264Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:19.2084350Z         x = x_sign * x_clamp
2025-05-07T20:32:19.2084428Z         x0 = x[:, :D]
2025-05-07T20:32:19.2084509Z         x1 = x[:, D:]
2025-05-07T20:32:19.2084578Z     
2025-05-07T20:32:19.2084658Z         if contiguous:
2025-05-07T20:32:19.2084750Z             x0 = x0.contiguous()
2025-05-07T20:32:19.2084883Z             x1 = x1.contiguous()
2025-05-07T20:32:19.2084955Z     
2025-05-07T20:32:19.2085042Z         if scale_ub is not None:
2025-05-07T20:32:19.2085145Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:19.2085280Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:19.2085353Z             )
2025-05-07T20:32:19.2085431Z         else:
2025-05-07T20:32:19.2085524Z             scale_ub_tensor = None
2025-05-07T20:32:19.2085595Z     
2025-05-07T20:32:19.2085721Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:19.2085853Z             op = silu_mul_quant
2025-05-07T20:32:19.2085935Z             if compiled:
2025-05-07T20:32:19.2086032Z                 op = torch.compile(op)
2025-05-07T20:32:19.2086136Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:19.2086205Z     
2025-05-07T20:32:19.2086295Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:19.2086299Z 
2025-05-07T20:32:19.2086392Z moe/activation_test.py:117: 
2025-05-07T20:32:19.2086521Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:19.2086622Z moe/activation_test.py:115: in fn
2025-05-07T20:32:19.2086720Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:19.2087123Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/torch/_dynamo/eval_frame.py:678: in _fn
2025-05-07T20:32:19.2087219Z     return fn(*args, **kwargs)
2025-05-07T20:32:19.2087704Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:19.2087807Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:19.2088158Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:19.2088376Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:19.2088710Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:19.2088802Z     kernel = self.compile(
2025-05-07T20:32:19.2089178Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:19.2093348Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:19.2093558Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:19.2093564Z 
2025-05-07T20:32:19.2093853Z self = <triton.compiler.compiler.ASTSource object at 0x7fc96fde44d0>
2025-05-07T20:32:19.2094624Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:19.2095125Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7fcea1473ce0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7fc96fd10040>}
2025-05-07T20:32:19.2095865Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:19.2096055Z context = <triton._C.libtriton.ir.context object at 0x7fc96ff2ab30>
2025-05-07T20:32:19.2096063Z 
2025-05-07T20:32:19.2096226Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:19.2096482Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:19.2096593Z                            module_map=module_map)
2025-05-07T20:32:19.2096752Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:19.2096847Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:19.2096925Z E       ^
2025-05-07T20:32:19.2097271Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:19.2097322Z 
2025-05-07T20:32:19.2097726Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:19.2097737Z 
2025-05-07T20:32:19.2097837Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:19.2098057Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:19.2098136Z     T=128,
2025-05-07T20:32:19.2098509Z     D=7168,
2025-05-07T20:32:19.2098613Z     scale_ub=1200.0,
2025-05-07T20:32:19.2098789Z     contiguous=True,
2025-05-07T20:32:19.2098867Z     compiled=False,
2025-05-07T20:32:19.2098936Z )
2025-05-07T20:32:19.2099152Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:19.2099315Z T = 128, D = 7168, scale_ub = 1200.0, contiguous = True, compiled = False
2025-05-07T20:32:19.2099319Z 
2025-05-07T20:32:19.2099392Z     @given(
2025-05-07T20:32:19.2099511Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:19.2099609Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:19.2099720Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:19.2099832Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:19.2100003Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:19.2100080Z     )
2025-05-07T20:32:19.2100324Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:19.2100411Z     def test_silu_mul_quant(
2025-05-07T20:32:19.2100489Z         self,
2025-05-07T20:32:19.2100560Z         T: int,
2025-05-07T20:32:19.2100629Z         D: int,
2025-05-07T20:32:19.2100726Z         scale_ub: Optional[float],
2025-05-07T20:32:19.2100811Z         contiguous: bool,
2025-05-07T20:32:19.2100893Z         compiled: bool,
2025-05-07T20:32:19.2100965Z     ) -> None:
2025-05-07T20:32:19.2101054Z         torch.manual_seed(2025)
2025-05-07T20:32:19.2101125Z     
2025-05-07T20:32:19.2101286Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:19.2101359Z     
2025-05-07T20:32:19.2101447Z         x_sign = torch.sign(x)
2025-05-07T20:32:19.2101567Z >       x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:19.2103357Z E       torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 20.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 4.44 MiB is free. Including non-PyTorch memory, this process has 22.06 GiB memory in use. Of the allocated memory 21.77 GiB is allocated by PyTorch, and 6.37 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
2025-05-07T20:32:19.2103370Z 
2025-05-07T20:32:19.2103482Z moe/activation_test.py:95: OutOfMemoryError
2025-05-07T20:32:19.2103487Z 
2025-05-07T20:32:19.2103582Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:19.2103802Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:19.2103874Z     T=128,
2025-05-07T20:32:19.2103950Z     D=5120,
2025-05-07T20:32:19.2104028Z     scale_ub=1200.0,
2025-05-07T20:32:19.2104106Z     contiguous=True,
2025-05-07T20:32:19.2104188Z     compiled=True,
2025-05-07T20:32:19.2104258Z )
2025-05-07T20:32:19.2104467Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:19.2104632Z T = 128, D = 5120, scale_ub = 1200.0, contiguous = True, compiled = True
2025-05-07T20:32:19.2104639Z 
2025-05-07T20:32:19.2104709Z     @given(
2025-05-07T20:32:19.2104823Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:19.2104921Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:19.2105029Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:19.2105146Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:19.2105254Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:19.2105384Z     )
2025-05-07T20:32:19.2105622Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:19.2105711Z     def test_silu_mul_quant(
2025-05-07T20:32:19.2105780Z         self,
2025-05-07T20:32:19.2105859Z         T: int,
2025-05-07T20:32:19.2105929Z         D: int,
2025-05-07T20:32:19.2106023Z         scale_ub: Optional[float],
2025-05-07T20:32:19.2106108Z         contiguous: bool,
2025-05-07T20:32:19.2106190Z         compiled: bool,
2025-05-07T20:32:19.2106303Z     ) -> None:
2025-05-07T20:32:19.2106400Z         torch.manual_seed(2025)
2025-05-07T20:32:19.2106470Z     
2025-05-07T20:32:19.2106632Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:19.2106706Z     
2025-05-07T20:32:19.2106795Z         x_sign = torch.sign(x)
2025-05-07T20:32:19.2106916Z >       x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:19.2108672Z E       torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 20.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 4.44 MiB is free. Including non-PyTorch memory, this process has 22.06 GiB memory in use. Of the allocated memory 21.77 GiB is allocated by PyTorch, and 3.87 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
2025-05-07T20:32:19.2108689Z 
2025-05-07T20:32:19.2108803Z moe/activation_test.py:95: OutOfMemoryError
2025-05-07T20:32:19.2108808Z 
2025-05-07T20:32:19.2108903Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:19.2109121Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:19.2109193Z     T=128,
2025-05-07T20:32:19.2109261Z     D=7168,
2025-05-07T20:32:19.2109343Z     scale_ub=None,
2025-05-07T20:32:19.2109420Z     contiguous=True,
2025-05-07T20:32:19.2109498Z     compiled=True,
2025-05-07T20:32:19.2109567Z )
2025-05-07T20:32:19.2109774Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:19.2109934Z T = 128, D = 7168, scale_ub = None, contiguous = True, compiled = True
2025-05-07T20:32:19.2109942Z 
2025-05-07T20:32:19.2110014Z     @given(
2025-05-07T20:32:19.2110169Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:19.2110267Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:19.2110379Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:19.2110491Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:19.2110600Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:19.2110672Z     )
2025-05-07T20:32:19.2110912Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:19.2110999Z     def test_silu_mul_quant(
2025-05-07T20:32:19.2111068Z         self,
2025-05-07T20:32:19.2111145Z         T: int,
2025-05-07T20:32:19.2111218Z         D: int,
2025-05-07T20:32:19.2111310Z         scale_ub: Optional[float],
2025-05-07T20:32:19.2111396Z         contiguous: bool,
2025-05-07T20:32:19.2111476Z         compiled: bool,
2025-05-07T20:32:19.2111550Z     ) -> None:
2025-05-07T20:32:19.2111645Z         torch.manual_seed(2025)
2025-05-07T20:32:19.2111713Z     
2025-05-07T20:32:19.2111878Z >       x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:19.2113584Z E       torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 20.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 4.44 MiB is free. Including non-PyTorch memory, this process has 22.06 GiB memory in use. Of the allocated memory 21.77 GiB is allocated by PyTorch, and 3.87 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
2025-05-07T20:32:19.2113646Z 
2025-05-07T20:32:19.2113770Z moe/activation_test.py:92: OutOfMemoryError
2025-05-07T20:32:19.2113949Z =============================== warnings summary ===============================
2025-05-07T20:32:19.2114368Z ../../../../../../../../miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/autotuner.py:108
2025-05-07T20:32:19.2114699Z ../../../../../../../../miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/autotuner.py:108
2025-05-07T20:32:19.2115046Z ../../../../../../../../miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/autotuner.py:108
2025-05-07T20:32:19.2115904Z   /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/autotuner.py:108: DeprecationWarning: warmup, rep, and use_cuda_graph parameters are deprecated. See https://github.com/triton-lang/triton/pull/4496 for details.
2025-05-07T20:32:19.2116130Z     warnings.warn(("warmup, rep, and use_cuda_graph parameters are deprecated. See "
2025-05-07T20:32:19.2116138Z 
2025-05-07T20:32:19.2116343Z -- Docs: https://docs.pytest.org/en/stable/how-to/capture-warnings.html
2025-05-07T20:32:19.2116507Z ================= 1 failed, 1 deselected, 3 warnings in 13.83s =================
2025-05-07T20:32:20.8856322Z ERROR conda.cli.main_run:execute(125): `conda run python -m pytest -v -rsx -s -W ignore::pytest.PytestCollectionWarning --lf --last-failed-no-failures none ./moe/activation_test.py` failed. (See above for error)
2025-05-07T20:32:20.9487952Z [EXEC] [ATTEMPT 0/2] Command attempt failed.
2025-05-07T20:32:20.9488184Z 
2025-05-07T20:32:22.9508957Z [EXEC] [ATTEMPT 1/2]    + conda run --no-capture-output -n build_binary python -m pytest -v -rsx -s -W ignore::pytest.PytestCollectionWarning --lf --last-failed-no-failures none ./moe/activation_test.py
2025-05-07T20:32:25.1300737Z ============================= test session starts ==============================
2025-05-07T20:32:25.1301961Z platform linux -- Python 3.13.0, pytest-8.3.5, pluggy-1.5.0 -- /home/ec2-user/miniconda/envs/build_binary/bin/python
2025-05-07T20:32:25.1302791Z cachedir: .pytest_cache
2025-05-07T20:32:25.1303378Z hypothesis profile 'ci' -> database=None, deadline=None, print_blob=True, derandomize=True, suppress_health_check=(HealthCheck.too_slow,)
2025-05-07T20:32:25.1304358Z rootdir: /home/ec2-user/actions-runner/_work/FBGEMM/FBGEMM/fbgemm_gpu
2025-05-07T20:32:25.1304760Z plugins: hypothesis-6.131.14
2025-05-07T20:32:26.6740301Z TMA benchmarks will be running with experimental grid constant TMA descriptor.
2025-05-07T20:32:26.7699404Z collecting ... collected 2 items / 1 deselected / 1 selected
2025-05-07T20:32:26.7699812Z run-last-failure: rerun previous 1 failure
2025-05-07T20:32:26.7700038Z 
2025-05-07T20:32:28.8743779Z moe/activation_test.py::ActivationTests::test_silu_mul_quant Trying example: test_silu_mul_quant(
2025-05-07T20:32:28.8744481Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:28.8744914Z     T=1,
2025-05-07T20:32:28.8745113Z     D=5120,
2025-05-07T20:32:28.8745305Z     scale_ub=None,
2025-05-07T20:32:28.8745526Z     contiguous=True,
2025-05-07T20:32:28.8745755Z     compiled=True,
2025-05-07T20:32:28.8745958Z )
2025-05-07T20:32:28.8746286Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:28.8746784Z T = 1, D = 5120, scale_ub = None, contiguous = True, compiled = True
2025-05-07T20:32:28.8747037Z 
2025-05-07T20:32:28.8747127Z     @given(
2025-05-07T20:32:28.8747360Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:28.8747677Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:28.8747981Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:28.8748303Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:28.8748629Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:28.8748920Z     )
2025-05-07T20:32:28.8749261Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:28.8750008Z     def test_silu_mul_quant(
2025-05-07T20:32:28.8750254Z         self,
2025-05-07T20:32:28.8750453Z         T: int,
2025-05-07T20:32:28.8750648Z         D: int,
2025-05-07T20:32:28.8750868Z         scale_ub: Optional[float],
2025-05-07T20:32:28.8751143Z         contiguous: bool,
2025-05-07T20:32:28.8751382Z         compiled: bool,
2025-05-07T20:32:28.8751613Z     ) -> None:
2025-05-07T20:32:28.8751835Z         torch.manual_seed(2025)
2025-05-07T20:32:28.8752183Z     
2025-05-07T20:32:28.8752459Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:28.8752799Z     
2025-05-07T20:32:28.8752987Z         x_sign = torch.sign(x)
2025-05-07T20:32:28.8753329Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:28.8753642Z         x = x_sign * x_clamp
2025-05-07T20:32:28.8753880Z         x0 = x[:, :D]
2025-05-07T20:32:28.8754102Z         x1 = x[:, D:]
2025-05-07T20:32:28.8754322Z     
2025-05-07T20:32:28.8754508Z         if contiguous:
2025-05-07T20:32:28.8754741Z             x0 = x0.contiguous()
2025-05-07T20:32:28.8755000Z             x1 = x1.contiguous()
2025-05-07T20:32:28.8755235Z     
2025-05-07T20:32:28.8755433Z         if scale_ub is not None:
2025-05-07T20:32:28.8755797Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:28.8756138Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:28.8756445Z             )
2025-05-07T20:32:28.8756646Z         else:
2025-05-07T20:32:28.8756862Z             scale_ub_tensor = None
2025-05-07T20:32:28.8757115Z     
2025-05-07T20:32:28.8757347Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:28.8757663Z             op = silu_mul_quant
2025-05-07T20:32:28.8757908Z             if compiled:
2025-05-07T20:32:28.8758159Z                 op = torch.compile(op)
2025-05-07T20:32:28.8758459Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:28.8758732Z     
2025-05-07T20:32:28.8758930Z         y_fp8, y_scale = fn()
2025-05-07T20:32:28.8759215Z         y = y_fp8.to(torch.float32) * y_scale[:, None]
2025-05-07T20:32:28.8759499Z     
2025-05-07T20:32:28.8759737Z         def ref_fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:28.8760079Z             x0_fp32 = x0.to(torch.float32)
2025-05-07T20:32:28.8760383Z             x1_fp32 = x1.to(torch.float32)
2025-05-07T20:32:28.8760787Z             y = x0_fp32 * torch.sigmoid(x0_fp32) * x1_fp32
2025-05-07T20:32:28.8761148Z             return triton_quantize_fp8_row(y, scale_ub_tensor)
2025-05-07T20:32:28.8761454Z     
2025-05-07T20:32:28.8761663Z >       y_fp8_ref, y_scale_ref = ref_fn()
2025-05-07T20:32:28.8761854Z 
2025-05-07T20:32:28.8761965Z moe/activation_test.py:126: 
2025-05-07T20:32:28.8762257Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:28.8762596Z moe/activation_test.py:124: in ref_fn
2025-05-07T20:32:28.8762922Z     return triton_quantize_fp8_row(y, scale_ub_tensor)
2025-05-07T20:32:28.8763710Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gemm/triton_gemm/fp8_gemm.py:2370: in triton_quantize_fp8_row
2025-05-07T20:32:28.8764446Z     _kernel_quantize_fp8_row[grid](
2025-05-07T20:32:28.8764997Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:28.8765675Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:28.8766352Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/autotuner.py:186: in run
2025-05-07T20:32:28.8767067Z     timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs}
2025-05-07T20:32:28.8767787Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/autotuner.py:166: in _bench
2025-05-07T20:32:28.8768417Z     return self.do_bench(kernel_call, quantiles=(0.5, 0.2, 0.8))
2025-05-07T20:32:28.8769057Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/testing.py:117: in do_bench
2025-05-07T20:32:28.8769569Z     fn()
2025-05-07T20:32:28.8770070Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/autotuner.py:152: in kernel_call
2025-05-07T20:32:28.8770645Z     self.fn.run(
2025-05-07T20:32:28.8771103Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:28.8771624Z     kernel = self.compile(
2025-05-07T20:32:28.8772204Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:28.8772846Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:28.8773289Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:28.8773522Z 
2025-05-07T20:32:28.8773918Z self = <triton.compiler.compiler.ASTSource object at 0x7f78c7b26270>
2025-05-07T20:32:28.8774991Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=2, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:28.8776396Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f7905667ce0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f78c52aa700>}
2025-05-07T20:32:28.8777723Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:28.8778731Z context = <triton._C.libtriton.ir.context object at 0x7f78c5687970>
2025-05-07T20:32:28.8779013Z 
2025-05-07T20:32:28.8779182Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:28.8779698Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:28.8780158Z                            module_map=module_map)
2025-05-07T20:32:28.8780523Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:28.8780877Z E       def _kernel_quantize_fp8_row(
2025-05-07T20:32:28.8781142Z E       ^
2025-05-07T20:32:28.8781648Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:28.8782087Z 
2025-05-07T20:32:28.8782504Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:28.8783013Z 
2025-05-07T20:32:28.8783124Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:28.8783526Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:28.8783927Z     T=2048,
2025-05-07T20:32:28.8784121Z     D=5120,
2025-05-07T20:32:28.8784309Z     scale_ub=1200.0,
2025-05-07T20:32:28.8784537Z     contiguous=True,
2025-05-07T20:32:28.8784758Z     compiled=False,
2025-05-07T20:32:28.8784956Z )
2025-05-07T20:32:28.8785277Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:28.8785772Z T = 2048, D = 5120, scale_ub = 1200.0, contiguous = True, compiled = False
2025-05-07T20:32:28.8786043Z 
2025-05-07T20:32:28.8786127Z     @given(
2025-05-07T20:32:28.8786361Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:28.8786674Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:28.8786980Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:28.8787301Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:28.8787630Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:28.8787915Z     )
2025-05-07T20:32:28.8788257Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:28.8788698Z     def test_silu_mul_quant(
2025-05-07T20:32:28.8788943Z         self,
2025-05-07T20:32:28.8789184Z         T: int,
2025-05-07T20:32:28.8789383Z         D: int,
2025-05-07T20:32:28.8789601Z         scale_ub: Optional[float],
2025-05-07T20:32:28.8789864Z         contiguous: bool,
2025-05-07T20:32:28.8790105Z         compiled: bool,
2025-05-07T20:32:28.8790324Z     ) -> None:
2025-05-07T20:32:28.8790532Z         torch.manual_seed(2025)
2025-05-07T20:32:28.8790776Z     
2025-05-07T20:32:28.8791047Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:28.8791388Z     
2025-05-07T20:32:28.8791627Z         x_sign = torch.sign(x)
2025-05-07T20:32:28.8791913Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:28.8792219Z         x = x_sign * x_clamp
2025-05-07T20:32:28.8792455Z         x0 = x[:, :D]
2025-05-07T20:32:28.8792672Z         x1 = x[:, D:]
2025-05-07T20:32:28.8792880Z     
2025-05-07T20:32:28.8793061Z         if contiguous:
2025-05-07T20:32:28.8793293Z             x0 = x0.contiguous()
2025-05-07T20:32:28.8793551Z             x1 = x1.contiguous()
2025-05-07T20:32:28.8793787Z     
2025-05-07T20:32:28.8793979Z         if scale_ub is not None:
2025-05-07T20:32:28.8794255Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:28.8794585Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:28.8794946Z             )
2025-05-07T20:32:28.8795141Z         else:
2025-05-07T20:32:28.8795349Z             scale_ub_tensor = None
2025-05-07T20:32:28.8795605Z     
2025-05-07T20:32:28.8795837Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:28.8796157Z             op = silu_mul_quant
2025-05-07T20:32:28.8796400Z             if compiled:
2025-05-07T20:32:28.8796645Z                 op = torch.compile(op)
2025-05-07T20:32:28.8796942Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:28.8797212Z     
2025-05-07T20:32:28.8797407Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:28.8797569Z 
2025-05-07T20:32:28.8797674Z moe/activation_test.py:117: 
2025-05-07T20:32:28.8797963Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:28.8798666Z moe/activation_test.py:115: in fn
2025-05-07T20:32:28.8798951Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:28.8799633Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:28.8800443Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:28.8800978Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:28.8801658Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:28.8802309Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:28.8802840Z     kernel = self.compile(
2025-05-07T20:32:28.8803431Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:28.8804079Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:28.8804471Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:28.8804706Z 
2025-05-07T20:32:28.8804917Z self = <triton.compiler.compiler.ASTSource object at 0x7f78c523d090>
2025-05-07T20:32:28.8805987Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:28.8807342Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f7905667ce0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f78c5162020>}
2025-05-07T20:32:28.8808659Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:28.8809741Z context = <triton._C.libtriton.ir.context object at 0x7f78c57194b0>
2025-05-07T20:32:28.8810028Z 
2025-05-07T20:32:28.8810192Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:28.8810714Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:28.8811171Z                            module_map=module_map)
2025-05-07T20:32:28.8811533Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:28.8811957Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:28.8812212Z E       ^
2025-05-07T20:32:28.8812666Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:28.8813108Z 
2025-05-07T20:32:28.8813514Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:29.5335137Z 
2025-05-07T20:32:29.5335506Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:29.5336069Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:29.5336527Z     T=2048,
2025-05-07T20:32:29.5336732Z     D=5120,
2025-05-07T20:32:29.5337330Z     scale_ub=1200.0,
2025-05-07T20:32:29.5337560Z     contiguous=True,
2025-05-07T20:32:29.5337795Z     compiled=True,
2025-05-07T20:32:29.5338008Z )
2025-05-07T20:32:29.5338327Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:29.5338873Z T = 2048, D = 5120, scale_ub = 1200.0, contiguous = True, compiled = True
2025-05-07T20:32:29.5339145Z 
2025-05-07T20:32:29.5339227Z     @given(
2025-05-07T20:32:29.5339464Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:29.5339779Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:29.5340086Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:29.5340413Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:29.5340747Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:29.5341034Z     )
2025-05-07T20:32:29.5341379Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:29.5341821Z     def test_silu_mul_quant(
2025-05-07T20:32:29.5342073Z         self,
2025-05-07T20:32:29.5342375Z         T: int,
2025-05-07T20:32:29.5342578Z         D: int,
2025-05-07T20:32:29.5342802Z         scale_ub: Optional[float],
2025-05-07T20:32:29.5343069Z         contiguous: bool,
2025-05-07T20:32:29.5343311Z         compiled: bool,
2025-05-07T20:32:29.5343538Z     ) -> None:
2025-05-07T20:32:29.5343746Z         torch.manual_seed(2025)
2025-05-07T20:32:29.5343988Z     
2025-05-07T20:32:29.5344262Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:29.5344601Z     
2025-05-07T20:32:29.5344793Z         x_sign = torch.sign(x)
2025-05-07T20:32:29.5345087Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:29.5345404Z         x = x_sign * x_clamp
2025-05-07T20:32:29.5345637Z         x0 = x[:, :D]
2025-05-07T20:32:29.5345852Z         x1 = x[:, D:]
2025-05-07T20:32:29.5346064Z     
2025-05-07T20:32:29.5346242Z         if contiguous:
2025-05-07T20:32:29.5346476Z             x0 = x0.contiguous()
2025-05-07T20:32:29.5346736Z             x1 = x1.contiguous()
2025-05-07T20:32:29.5346969Z     
2025-05-07T20:32:29.5347162Z         if scale_ub is not None:
2025-05-07T20:32:29.5347435Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:29.5347768Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:29.5348078Z             )
2025-05-07T20:32:29.5348274Z         else:
2025-05-07T20:32:29.5348481Z             scale_ub_tensor = None
2025-05-07T20:32:29.5348734Z     
2025-05-07T20:32:29.5348971Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:29.5349280Z             op = silu_mul_quant
2025-05-07T20:32:29.5349530Z             if compiled:
2025-05-07T20:32:29.5349879Z                 op = torch.compile(op)
2025-05-07T20:32:29.5350178Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:29.5350450Z     
2025-05-07T20:32:29.5350649Z         y_fp8, y_scale = fn()
2025-05-07T20:32:29.5350937Z         y = y_fp8.to(torch.float32) * y_scale[:, None]
2025-05-07T20:32:29.5351226Z     
2025-05-07T20:32:29.5351466Z         def ref_fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:29.5351803Z             x0_fp32 = x0.to(torch.float32)
2025-05-07T20:32:29.5352091Z             x1_fp32 = x1.to(torch.float32)
2025-05-07T20:32:29.5352504Z             y = x0_fp32 * torch.sigmoid(x0_fp32) * x1_fp32
2025-05-07T20:32:29.5352869Z             return triton_quantize_fp8_row(y, scale_ub_tensor)
2025-05-07T20:32:29.5353171Z     
2025-05-07T20:32:29.5353374Z >       y_fp8_ref, y_scale_ref = ref_fn()
2025-05-07T20:32:29.5353588Z 
2025-05-07T20:32:29.5353688Z moe/activation_test.py:126: 
2025-05-07T20:32:29.5353987Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:29.5354322Z moe/activation_test.py:124: in ref_fn
2025-05-07T20:32:29.5354652Z     return triton_quantize_fp8_row(y, scale_ub_tensor)
2025-05-07T20:32:29.5355483Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gemm/triton_gemm/fp8_gemm.py:2370: in triton_quantize_fp8_row
2025-05-07T20:32:29.5364627Z     _kernel_quantize_fp8_row[grid](
2025-05-07T20:32:29.5365245Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:29.5365930Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:29.5366610Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/autotuner.py:186: in run
2025-05-07T20:32:29.5367315Z     timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs}
2025-05-07T20:32:29.5368027Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/autotuner.py:166: in _bench
2025-05-07T20:32:29.5368658Z     return self.do_bench(kernel_call, quantiles=(0.5, 0.2, 0.8))
2025-05-07T20:32:29.5369256Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/testing.py:117: in do_bench
2025-05-07T20:32:29.5369761Z     fn()
2025-05-07T20:32:29.5370342Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/autotuner.py:152: in kernel_call
2025-05-07T20:32:29.5370921Z     self.fn.run(
2025-05-07T20:32:29.5371379Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:29.5371906Z     kernel = self.compile(
2025-05-07T20:32:29.5372442Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:29.5373093Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:29.5373487Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:29.5373854Z 
2025-05-07T20:32:29.5374060Z self = <triton.compiler.compiler.ASTSource object at 0x7f78c523e0d0>
2025-05-07T20:32:29.5375142Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=2, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:29.5376504Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f7905667ce0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f78c402ede0>}
2025-05-07T20:32:29.5377820Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:29.5378830Z context = <triton._C.libtriton.ir.context object at 0x7f78bfe09a70>
2025-05-07T20:32:29.5379177Z 
2025-05-07T20:32:29.5379340Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:29.5379855Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:29.5380317Z                            module_map=module_map)
2025-05-07T20:32:29.5380683Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:29.5381036Z E       def _kernel_quantize_fp8_row(
2025-05-07T20:32:29.5381301Z E       ^
2025-05-07T20:32:29.5381802Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:29.5382246Z 
2025-05-07T20:32:29.5382654Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:29.5383158Z 
2025-05-07T20:32:29.5383266Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:29.5383667Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:29.5384070Z     T=16384,
2025-05-07T20:32:29.5384265Z     D=7168,
2025-05-07T20:32:29.5384461Z     scale_ub=1200.0,
2025-05-07T20:32:29.5384679Z     contiguous=False,
2025-05-07T20:32:29.5384906Z     compiled=False,
2025-05-07T20:32:29.5385111Z )
2025-05-07T20:32:29.5385468Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:29.5385954Z T = 16384, D = 7168, scale_ub = 1200.0, contiguous = False, compiled = False
2025-05-07T20:32:29.5386233Z 
2025-05-07T20:32:29.5386309Z     @given(
2025-05-07T20:32:29.5386539Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:29.5386841Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:29.5387144Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:29.5387466Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:29.5387782Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:29.5388071Z     )
2025-05-07T20:32:29.5388420Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:29.5388858Z     def test_silu_mul_quant(
2025-05-07T20:32:29.5389087Z         self,
2025-05-07T20:32:29.5389282Z         T: int,
2025-05-07T20:32:29.5389475Z         D: int,
2025-05-07T20:32:29.5389690Z         scale_ub: Optional[float],
2025-05-07T20:32:29.5390010Z         contiguous: bool,
2025-05-07T20:32:29.5390249Z         compiled: bool,
2025-05-07T20:32:29.5390461Z     ) -> None:
2025-05-07T20:32:29.5390672Z         torch.manual_seed(2025)
2025-05-07T20:32:29.5390918Z     
2025-05-07T20:32:29.5391181Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:29.5391524Z     
2025-05-07T20:32:29.5391713Z         x_sign = torch.sign(x)
2025-05-07T20:32:29.5391994Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:29.5392301Z         x = x_sign * x_clamp
2025-05-07T20:32:29.5392533Z         x0 = x[:, :D]
2025-05-07T20:32:29.5392744Z         x1 = x[:, D:]
2025-05-07T20:32:29.5392950Z     
2025-05-07T20:32:29.5393135Z         if contiguous:
2025-05-07T20:32:29.5393356Z             x0 = x0.contiguous()
2025-05-07T20:32:29.5393612Z             x1 = x1.contiguous()
2025-05-07T20:32:29.5393845Z     
2025-05-07T20:32:29.5394034Z         if scale_ub is not None:
2025-05-07T20:32:29.5394304Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:29.5394632Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:29.5394937Z             )
2025-05-07T20:32:29.5395117Z         else:
2025-05-07T20:32:29.5395311Z             scale_ub_tensor = None
2025-05-07T20:32:29.5395560Z     
2025-05-07T20:32:29.5395779Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:29.5396090Z             op = silu_mul_quant
2025-05-07T20:32:29.5396335Z             if compiled:
2025-05-07T20:32:29.5396571Z                 op = torch.compile(op)
2025-05-07T20:32:29.5396864Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:29.5397185Z     
2025-05-07T20:32:29.5397367Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:29.5397536Z 
2025-05-07T20:32:29.5397630Z moe/activation_test.py:117: 
2025-05-07T20:32:29.5397925Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:29.5398564Z moe/activation_test.py:115: in fn
2025-05-07T20:32:29.5398833Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:29.5399505Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:29.5400273Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:29.5400797Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:29.5401468Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:29.5402128Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:29.5402659Z     kernel = self.compile(
2025-05-07T20:32:29.5403185Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:29.5403941Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:29.5404337Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:29.5404560Z 
2025-05-07T20:32:29.5404769Z self = <triton.compiler.compiler.ASTSource object at 0x7f78bffd9220>
2025-05-07T20:32:29.5405826Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:29.5407176Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f7905667ce0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f78c42e8ae0>}
2025-05-07T20:32:29.5408495Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:29.5409498Z context = <triton._C.libtriton.ir.context object at 0x7f78bfe36eb0>
2025-05-07T20:32:29.5409783Z 
2025-05-07T20:32:29.5410006Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:29.5410524Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:29.5410985Z                            module_map=module_map)
2025-05-07T20:32:29.5411342Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:29.5411682Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:29.5411939Z E       ^
2025-05-07T20:32:29.5412391Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:29.5412831Z 
2025-05-07T20:32:29.5413238Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:30.2369010Z 
2025-05-07T20:32:30.2369533Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:30.2370177Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:30.2370721Z     T=1,
2025-05-07T20:32:30.2370965Z     D=7168,
2025-05-07T20:32:30.2371213Z     scale_ub=None,
2025-05-07T20:32:30.2371483Z     contiguous=True,
2025-05-07T20:32:30.2371753Z     compiled=True,
2025-05-07T20:32:30.2371969Z )
2025-05-07T20:32:30.2372288Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:30.2372778Z T = 1, D = 7168, scale_ub = None, contiguous = True, compiled = True
2025-05-07T20:32:30.2373042Z 
2025-05-07T20:32:30.2373124Z     @given(
2025-05-07T20:32:30.2373364Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:30.2374072Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:30.2374385Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:30.2374717Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:30.2375042Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:30.2375340Z     )
2025-05-07T20:32:30.2375694Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:30.2376139Z     def test_silu_mul_quant(
2025-05-07T20:32:30.2376378Z         self,
2025-05-07T20:32:30.2376684Z         T: int,
2025-05-07T20:32:30.2376881Z         D: int,
2025-05-07T20:32:30.2377099Z         scale_ub: Optional[float],
2025-05-07T20:32:30.2377371Z         contiguous: bool,
2025-05-07T20:32:30.2377611Z         compiled: bool,
2025-05-07T20:32:30.2377835Z     ) -> None:
2025-05-07T20:32:30.2378054Z         torch.manual_seed(2025)
2025-05-07T20:32:30.2378301Z     
2025-05-07T20:32:30.2378571Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:30.2378925Z     
2025-05-07T20:32:30.2379125Z         x_sign = torch.sign(x)
2025-05-07T20:32:30.2379411Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:30.2379727Z         x = x_sign * x_clamp
2025-05-07T20:32:30.2380002Z         x0 = x[:, :D]
2025-05-07T20:32:30.2380315Z         x1 = x[:, D:]
2025-05-07T20:32:30.2380533Z     
2025-05-07T20:32:30.2380720Z         if contiguous:
2025-05-07T20:32:30.2380955Z             x0 = x0.contiguous()
2025-05-07T20:32:30.2381218Z             x1 = x1.contiguous()
2025-05-07T20:32:30.2381460Z     
2025-05-07T20:32:30.2381656Z         if scale_ub is not None:
2025-05-07T20:32:30.2381932Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:30.2382262Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:30.2382570Z             )
2025-05-07T20:32:30.2382767Z         else:
2025-05-07T20:32:30.2382975Z             scale_ub_tensor = None
2025-05-07T20:32:30.2383231Z     
2025-05-07T20:32:30.2383491Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:30.2383832Z             op = silu_mul_quant
2025-05-07T20:32:30.2384078Z             if compiled:
2025-05-07T20:32:30.2384335Z                 op = torch.compile(op)
2025-05-07T20:32:30.2384639Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:30.2384915Z     
2025-05-07T20:32:30.2385198Z         y_fp8, y_scale = fn()
2025-05-07T20:32:30.2385487Z         y = y_fp8.to(torch.float32) * y_scale[:, None]
2025-05-07T20:32:30.2385779Z     
2025-05-07T20:32:30.2386022Z         def ref_fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:30.2386361Z             x0_fp32 = x0.to(torch.float32)
2025-05-07T20:32:30.2386654Z             x1_fp32 = x1.to(torch.float32)
2025-05-07T20:32:30.2386968Z             y = x0_fp32 * torch.sigmoid(x0_fp32) * x1_fp32
2025-05-07T20:32:30.2387329Z             return triton_quantize_fp8_row(y, scale_ub_tensor)
2025-05-07T20:32:30.2387641Z     
2025-05-07T20:32:30.2387845Z >       y_fp8_ref, y_scale_ref = ref_fn()
2025-05-07T20:32:30.2388046Z 
2025-05-07T20:32:30.2388151Z moe/activation_test.py:126: 
2025-05-07T20:32:30.2388452Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:30.2388796Z moe/activation_test.py:124: in ref_fn
2025-05-07T20:32:30.2389130Z     return triton_quantize_fp8_row(y, scale_ub_tensor)
2025-05-07T20:32:30.2389922Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gemm/triton_gemm/fp8_gemm.py:2370: in triton_quantize_fp8_row
2025-05-07T20:32:30.2390671Z     _kernel_quantize_fp8_row[grid](
2025-05-07T20:32:30.2391212Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:30.2391889Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:30.2392572Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/autotuner.py:186: in run
2025-05-07T20:32:30.2393339Z     timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs}
2025-05-07T20:32:30.2394051Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/autotuner.py:166: in _bench
2025-05-07T20:32:30.2394690Z     return self.do_bench(kernel_call, quantiles=(0.5, 0.2, 0.8))
2025-05-07T20:32:30.2395290Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/testing.py:117: in do_bench
2025-05-07T20:32:30.2395804Z     fn()
2025-05-07T20:32:30.2396355Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/autotuner.py:152: in kernel_call
2025-05-07T20:32:30.2396934Z     self.fn.run(
2025-05-07T20:32:30.2397402Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:30.2397924Z     kernel = self.compile(
2025-05-07T20:32:30.2398754Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:30.2399411Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:30.2399805Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:30.2400037Z 
2025-05-07T20:32:30.2400321Z self = <triton.compiler.compiler.ASTSource object at 0x7f78bffdb950>
2025-05-07T20:32:30.2401397Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=2, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:30.2402760Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f7905667ce0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f78c411a340>}
2025-05-07T20:32:30.2404080Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:30.2405084Z context = <triton._C.libtriton.ir.context object at 0x7f78bfaba3f0>
2025-05-07T20:32:30.2405371Z 
2025-05-07T20:32:30.2405536Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:30.2406120Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:30.2406588Z                            module_map=module_map)
2025-05-07T20:32:30.2406951Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:30.2407309Z E       def _kernel_quantize_fp8_row(
2025-05-07T20:32:30.2407581Z E       ^
2025-05-07T20:32:30.2408035Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:30.2408478Z 
2025-05-07T20:32:30.2408886Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:30.2409400Z 
2025-05-07T20:32:30.2409503Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:30.2409917Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:30.2410316Z     T=4096,
2025-05-07T20:32:30.2410503Z     D=5120,
2025-05-07T20:32:30.2410703Z     scale_ub=None,
2025-05-07T20:32:30.2410915Z     contiguous=False,
2025-05-07T20:32:30.2411146Z     compiled=False,
2025-05-07T20:32:30.2411360Z )
2025-05-07T20:32:30.2411671Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:30.2412166Z T = 4096, D = 5120, scale_ub = None, contiguous = False, compiled = False
2025-05-07T20:32:30.2412442Z 
2025-05-07T20:32:30.2412522Z     @given(
2025-05-07T20:32:30.2412755Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:30.2413064Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:30.2413376Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:30.2413914Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:30.2414238Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:30.2414529Z     )
2025-05-07T20:32:30.2414878Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:30.2415313Z     def test_silu_mul_quant(
2025-05-07T20:32:30.2415560Z         self,
2025-05-07T20:32:30.2415762Z         T: int,
2025-05-07T20:32:30.2415954Z         D: int,
2025-05-07T20:32:30.2416172Z         scale_ub: Optional[float],
2025-05-07T20:32:30.2416515Z         contiguous: bool,
2025-05-07T20:32:30.2416753Z         compiled: bool,
2025-05-07T20:32:30.2416973Z     ) -> None:
2025-05-07T20:32:30.2417185Z         torch.manual_seed(2025)
2025-05-07T20:32:30.2417428Z     
2025-05-07T20:32:30.2417694Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:30.2418043Z     
2025-05-07T20:32:30.2418239Z         x_sign = torch.sign(x)
2025-05-07T20:32:30.2418525Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:30.2418842Z         x = x_sign * x_clamp
2025-05-07T20:32:30.2419081Z         x0 = x[:, :D]
2025-05-07T20:32:30.2419291Z         x1 = x[:, D:]
2025-05-07T20:32:30.2419500Z     
2025-05-07T20:32:30.2419688Z         if contiguous:
2025-05-07T20:32:30.2419965Z             x0 = x0.contiguous()
2025-05-07T20:32:30.2420228Z             x1 = x1.contiguous()
2025-05-07T20:32:30.2420468Z     
2025-05-07T20:32:30.2420656Z         if scale_ub is not None:
2025-05-07T20:32:30.2420929Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:30.2421263Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:30.2421566Z             )
2025-05-07T20:32:30.2421761Z         else:
2025-05-07T20:32:30.2421972Z             scale_ub_tensor = None
2025-05-07T20:32:30.2422223Z     
2025-05-07T20:32:30.2422447Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:30.2422762Z             op = silu_mul_quant
2025-05-07T20:32:30.2423016Z             if compiled:
2025-05-07T20:32:30.2423261Z                 op = torch.compile(op)
2025-05-07T20:32:30.2423564Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:30.2423841Z     
2025-05-07T20:32:30.2424028Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:30.2424193Z 
2025-05-07T20:32:30.2424294Z moe/activation_test.py:117: 
2025-05-07T20:32:30.2424640Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:30.2424966Z moe/activation_test.py:115: in fn
2025-05-07T20:32:30.2425250Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:30.2425932Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:30.2426609Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:30.2427136Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:30.2427808Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:30.2428471Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:30.2428991Z     kernel = self.compile(
2025-05-07T20:32:30.2429534Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:30.2430186Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:30.2430588Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:30.2430815Z 
2025-05-07T20:32:30.2431024Z self = <triton.compiler.compiler.ASTSource object at 0x7f78bf7f4b90>
2025-05-07T20:32:30.2432090Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:30.2433490Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f7905667ce0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f78c411b100>}
2025-05-07T20:32:30.2434817Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:30.2435826Z context = <triton._C.libtriton.ir.context object at 0x7f78bfadb4f0>
2025-05-07T20:32:30.2436185Z 
2025-05-07T20:32:30.2436349Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:30.2436870Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:30.2437335Z                            module_map=module_map)
2025-05-07T20:32:30.2437693Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:30.2438050Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:30.2438318Z E       ^
2025-05-07T20:32:30.2438782Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:30.2439224Z 
2025-05-07T20:32:30.2439673Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:30.9493847Z 
2025-05-07T20:32:30.9494255Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:30.9494866Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:30.9495285Z     T=4096,
2025-05-07T20:32:30.9495488Z     D=7168,
2025-05-07T20:32:30.9495689Z     scale_ub=None,
2025-05-07T20:32:30.9495908Z     contiguous=False,
2025-05-07T20:32:30.9496146Z     compiled=False,
2025-05-07T20:32:30.9496366Z )
2025-05-07T20:32:30.9496688Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:30.9497191Z T = 4096, D = 7168, scale_ub = None, contiguous = False, compiled = False
2025-05-07T20:32:30.9497467Z 
2025-05-07T20:32:30.9497556Z     @given(
2025-05-07T20:32:30.9497790Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:30.9498114Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:30.9498675Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:30.9499278Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:30.9499609Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:30.9499900Z     )
2025-05-07T20:32:30.9500251Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:30.9500687Z     def test_silu_mul_quant(
2025-05-07T20:32:30.9500934Z         self,
2025-05-07T20:32:30.9501137Z         T: int,
2025-05-07T20:32:30.9501333Z         D: int,
2025-05-07T20:32:30.9501551Z         scale_ub: Optional[float],
2025-05-07T20:32:30.9501827Z         contiguous: bool,
2025-05-07T20:32:30.9502060Z         compiled: bool,
2025-05-07T20:32:30.9502295Z     ) -> None:
2025-05-07T20:32:30.9502517Z         torch.manual_seed(2025)
2025-05-07T20:32:30.9502755Z     
2025-05-07T20:32:30.9503030Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:30.9503378Z     
2025-05-07T20:32:30.9503579Z         x_sign = torch.sign(x)
2025-05-07T20:32:30.9503917Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:30.9504234Z         x = x_sign * x_clamp
2025-05-07T20:32:30.9504478Z         x0 = x[:, :D]
2025-05-07T20:32:30.9504693Z         x1 = x[:, D:]
2025-05-07T20:32:30.9504906Z     
2025-05-07T20:32:30.9505094Z         if contiguous:
2025-05-07T20:32:30.9505322Z             x0 = x0.contiguous()
2025-05-07T20:32:30.9505580Z             x1 = x1.contiguous()
2025-05-07T20:32:30.9505825Z     
2025-05-07T20:32:30.9506016Z         if scale_ub is not None:
2025-05-07T20:32:30.9506294Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:30.9506631Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:30.9507023Z             )
2025-05-07T20:32:30.9507222Z         else:
2025-05-07T20:32:30.9507440Z             scale_ub_tensor = None
2025-05-07T20:32:30.9507689Z     
2025-05-07T20:32:30.9507929Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:30.9508252Z             op = silu_mul_quant
2025-05-07T20:32:30.9508499Z             if compiled:
2025-05-07T20:32:30.9508754Z                 op = torch.compile(op)
2025-05-07T20:32:30.9509058Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:30.9509433Z     
2025-05-07T20:32:30.9509625Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:30.9509792Z 
2025-05-07T20:32:30.9509894Z moe/activation_test.py:117: 
2025-05-07T20:32:30.9510192Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:30.9510522Z moe/activation_test.py:115: in fn
2025-05-07T20:32:30.9510810Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:30.9511499Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:30.9512179Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:30.9512716Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:30.9513487Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:30.9514151Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:30.9514681Z     kernel = self.compile(
2025-05-07T20:32:30.9515221Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:30.9515875Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:30.9516276Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:30.9516503Z 
2025-05-07T20:32:30.9516712Z self = <triton.compiler.compiler.ASTSource object at 0x7f78c404f130>
2025-05-07T20:32:30.9517782Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:30.9519191Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f7905667ce0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f78c411a840>}
2025-05-07T20:32:30.9520531Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:30.9521542Z context = <triton._C.libtriton.ir.context object at 0x7f78bf08cff0>
2025-05-07T20:32:30.9521827Z 
2025-05-07T20:32:30.9521993Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:30.9522513Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:30.9522983Z                            module_map=module_map)
2025-05-07T20:32:30.9523346Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:30.9531283Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:30.9531545Z E       ^
2025-05-07T20:32:30.9532023Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:30.9532463Z 
2025-05-07T20:32:30.9532871Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:30.9533376Z 
2025-05-07T20:32:30.9533480Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:30.9533984Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:30.9534382Z     T=128,
2025-05-07T20:32:30.9534562Z     D=7168,
2025-05-07T20:32:30.9534841Z     scale_ub=None,
2025-05-07T20:32:30.9535066Z     contiguous=False,
2025-05-07T20:32:30.9535285Z     compiled=True,
2025-05-07T20:32:30.9535491Z )
2025-05-07T20:32:30.9535810Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:30.9536291Z T = 128, D = 7168, scale_ub = None, contiguous = False, compiled = True
2025-05-07T20:32:30.9536560Z 
2025-05-07T20:32:30.9536642Z     @given(
2025-05-07T20:32:30.9536877Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:30.9537192Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:30.9537549Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:30.9537879Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:30.9538209Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:30.9538490Z     )
2025-05-07T20:32:30.9538842Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:30.9539284Z     def test_silu_mul_quant(
2025-05-07T20:32:30.9539527Z         self,
2025-05-07T20:32:30.9539726Z         T: int,
2025-05-07T20:32:30.9539920Z         D: int,
2025-05-07T20:32:30.9540132Z         scale_ub: Optional[float],
2025-05-07T20:32:30.9540405Z         contiguous: bool,
2025-05-07T20:32:30.9540646Z         compiled: bool,
2025-05-07T20:32:30.9540913Z     ) -> None:
2025-05-07T20:32:30.9541132Z         torch.manual_seed(2025)
2025-05-07T20:32:30.9541379Z     
2025-05-07T20:32:30.9541643Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:30.9541991Z     
2025-05-07T20:32:30.9542187Z         x_sign = torch.sign(x)
2025-05-07T20:32:30.9542474Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:30.9542773Z         x = x_sign * x_clamp
2025-05-07T20:32:30.9543011Z         x0 = x[:, :D]
2025-05-07T20:32:30.9543227Z         x1 = x[:, D:]
2025-05-07T20:32:30.9543426Z     
2025-05-07T20:32:30.9543612Z         if contiguous:
2025-05-07T20:32:30.9543843Z             x0 = x0.contiguous()
2025-05-07T20:32:30.9544092Z             x1 = x1.contiguous()
2025-05-07T20:32:30.9544330Z     
2025-05-07T20:32:30.9544519Z         if scale_ub is not None:
2025-05-07T20:32:30.9544787Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:30.9545120Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:30.9545414Z             )
2025-05-07T20:32:30.9545655Z         else:
2025-05-07T20:32:30.9545866Z             scale_ub_tensor = None
2025-05-07T20:32:30.9546112Z     
2025-05-07T20:32:30.9546346Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:30.9546658Z             op = silu_mul_quant
2025-05-07T20:32:30.9546901Z             if compiled:
2025-05-07T20:32:30.9547146Z                 op = torch.compile(op)
2025-05-07T20:32:30.9547439Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:30.9547707Z     
2025-05-07T20:32:30.9547897Z         y_fp8, y_scale = fn()
2025-05-07T20:32:30.9548179Z         y = y_fp8.to(torch.float32) * y_scale[:, None]
2025-05-07T20:32:30.9548465Z     
2025-05-07T20:32:30.9548699Z         def ref_fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:30.9549029Z             x0_fp32 = x0.to(torch.float32)
2025-05-07T20:32:30.9549316Z             x1_fp32 = x1.to(torch.float32)
2025-05-07T20:32:30.9549622Z             y = x0_fp32 * torch.sigmoid(x0_fp32) * x1_fp32
2025-05-07T20:32:30.9549977Z             return triton_quantize_fp8_row(y, scale_ub_tensor)
2025-05-07T20:32:30.9550282Z     
2025-05-07T20:32:30.9550473Z >       y_fp8_ref, y_scale_ref = ref_fn()
2025-05-07T20:32:30.9550672Z 
2025-05-07T20:32:30.9550769Z moe/activation_test.py:126: 
2025-05-07T20:32:30.9551059Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:30.9551384Z moe/activation_test.py:124: in ref_fn
2025-05-07T20:32:30.9551700Z     return triton_quantize_fp8_row(y, scale_ub_tensor)
2025-05-07T20:32:30.9552470Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gemm/triton_gemm/fp8_gemm.py:2370: in triton_quantize_fp8_row
2025-05-07T20:32:30.9553262Z     _kernel_quantize_fp8_row[grid](
2025-05-07T20:32:30.9553845Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:30.9554518Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:30.9555189Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/autotuner.py:186: in run
2025-05-07T20:32:30.9555943Z     timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs}
2025-05-07T20:32:30.9556648Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/autotuner.py:166: in _bench
2025-05-07T20:32:30.9557274Z     return self.do_bench(kernel_call, quantiles=(0.5, 0.2, 0.8))
2025-05-07T20:32:30.9557866Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/testing.py:117: in do_bench
2025-05-07T20:32:30.9558370Z     fn()
2025-05-07T20:32:30.9558867Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/autotuner.py:152: in kernel_call
2025-05-07T20:32:30.9559436Z     self.fn.run(
2025-05-07T20:32:30.9559938Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:30.9560452Z     kernel = self.compile(
2025-05-07T20:32:30.9560985Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:30.9561625Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:30.9562012Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:30.9562243Z 
2025-05-07T20:32:30.9562448Z self = <triton.compiler.compiler.ASTSource object at 0x7f78bf7999d0>
2025-05-07T20:32:30.9563512Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=2, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:30.9564911Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f7905667ce0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f78bf7bf060>}
2025-05-07T20:32:30.9566282Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:30.9567282Z context = <triton._C.libtriton.ir.context object at 0x7f78bf3f5270>
2025-05-07T20:32:30.9567571Z 
2025-05-07T20:32:30.9567735Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:30.9568253Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:30.9568715Z                            module_map=module_map)
2025-05-07T20:32:30.9569074Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:30.9569429Z E       def _kernel_quantize_fp8_row(
2025-05-07T20:32:30.9569693Z E       ^
2025-05-07T20:32:30.9570140Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:30.9570588Z 
2025-05-07T20:32:30.9570997Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:31.1958701Z 
2025-05-07T20:32:31.1958944Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:31.1959389Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:31.1959818Z     T=128,
2025-05-07T20:32:31.1960010Z     D=7168,
2025-05-07T20:32:31.1960244Z     scale_ub=None,
2025-05-07T20:32:31.1960458Z     contiguous=False,
2025-05-07T20:32:31.1960688Z     compiled=False,
2025-05-07T20:32:31.1961161Z )
2025-05-07T20:32:31.1961490Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:31.1961983Z T = 128, D = 7168, scale_ub = None, contiguous = False, compiled = False
2025-05-07T20:32:31.1962246Z 
2025-05-07T20:32:31.1962326Z     @given(
2025-05-07T20:32:31.1962559Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:31.1962877Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:31.1963172Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:31.1963498Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:31.1963942Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:31.1964239Z     )
2025-05-07T20:32:31.1964586Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:31.1965024Z     def test_silu_mul_quant(
2025-05-07T20:32:31.1965265Z         self,
2025-05-07T20:32:31.1965450Z         T: int,
2025-05-07T20:32:31.1965640Z         D: int,
2025-05-07T20:32:31.1965855Z         scale_ub: Optional[float],
2025-05-07T20:32:31.1966115Z         contiguous: bool,
2025-05-07T20:32:31.1966350Z         compiled: bool,
2025-05-07T20:32:31.1966570Z     ) -> None:
2025-05-07T20:32:31.1966776Z         torch.manual_seed(2025)
2025-05-07T20:32:31.1967055Z     
2025-05-07T20:32:31.1967405Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:31.1967748Z     
2025-05-07T20:32:31.1967939Z         x_sign = torch.sign(x)
2025-05-07T20:32:31.1968225Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:31.1968534Z         x = x_sign * x_clamp
2025-05-07T20:32:31.1968771Z         x0 = x[:, :D]
2025-05-07T20:32:31.1968978Z         x1 = x[:, D:]
2025-05-07T20:32:31.1969186Z     
2025-05-07T20:32:31.1969374Z         if contiguous:
2025-05-07T20:32:31.1969598Z             x0 = x0.contiguous()
2025-05-07T20:32:31.1969854Z             x1 = x1.contiguous()
2025-05-07T20:32:31.1970096Z     
2025-05-07T20:32:31.1970280Z         if scale_ub is not None:
2025-05-07T20:32:31.1970555Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:31.1970886Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:31.1971189Z             )
2025-05-07T20:32:31.1971382Z         else:
2025-05-07T20:32:31.1971594Z             scale_ub_tensor = None
2025-05-07T20:32:31.1971838Z     
2025-05-07T20:32:31.1972150Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:31.1972464Z             op = silu_mul_quant
2025-05-07T20:32:31.1972715Z             if compiled:
2025-05-07T20:32:31.1972957Z                 op = torch.compile(op)
2025-05-07T20:32:31.1973249Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:31.1973519Z     
2025-05-07T20:32:31.1973790Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:31.1973957Z 
2025-05-07T20:32:31.1974054Z moe/activation_test.py:117: 
2025-05-07T20:32:31.1974350Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:31.1974676Z moe/activation_test.py:115: in fn
2025-05-07T20:32:31.1974954Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:31.1975636Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:31.1976322Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:31.1976851Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:31.1977525Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:31.1978181Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:31.1978701Z     kernel = self.compile(
2025-05-07T20:32:31.1979240Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:31.1979887Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:31.1980332Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:31.1980558Z 
2025-05-07T20:32:31.1980761Z self = <triton.compiler.compiler.ASTSource object at 0x7f78bfeb8e50>
2025-05-07T20:32:31.1981832Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:31.1983222Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f7905667ce0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f78bf909e40>}
2025-05-07T20:32:31.1984543Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:31.1985544Z context = <triton._C.libtriton.ir.context object at 0x7f78bf821a70>
2025-05-07T20:32:31.1985826Z 
2025-05-07T20:32:31.1985989Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:31.1986507Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:31.1987010Z                            module_map=module_map)
2025-05-07T20:32:31.1987373Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:31.1987722Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:31.1987983Z E       ^
2025-05-07T20:32:31.1988437Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:31.1988874Z 
2025-05-07T20:32:31.1989279Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:31.1989788Z 
2025-05-07T20:32:31.1989892Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:31.1990303Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:31.1990702Z     T=4096,
2025-05-07T20:32:31.1990897Z     D=5120,
2025-05-07T20:32:31.1991084Z     scale_ub=1200.0,
2025-05-07T20:32:31.1991305Z     contiguous=True,
2025-05-07T20:32:31.1991527Z     compiled=False,
2025-05-07T20:32:31.1991731Z )
2025-05-07T20:32:31.1992090Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:31.1992580Z T = 4096, D = 5120, scale_ub = 1200.0, contiguous = True, compiled = False
2025-05-07T20:32:31.1992853Z 
2025-05-07T20:32:31.1992938Z     @given(
2025-05-07T20:32:31.1993161Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:31.1993472Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:31.1993812Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:31.1994146Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:31.1994471Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:31.1994759Z     )
2025-05-07T20:32:31.1995098Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:31.1995536Z     def test_silu_mul_quant(
2025-05-07T20:32:31.1995778Z         self,
2025-05-07T20:32:31.1995963Z         T: int,
2025-05-07T20:32:31.1996160Z         D: int,
2025-05-07T20:32:31.1996378Z         scale_ub: Optional[float],
2025-05-07T20:32:31.1996641Z         contiguous: bool,
2025-05-07T20:32:31.1996875Z         compiled: bool,
2025-05-07T20:32:31.1997097Z     ) -> None:
2025-05-07T20:32:31.1997307Z         torch.manual_seed(2025)
2025-05-07T20:32:31.1997537Z     
2025-05-07T20:32:31.1997801Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:31.1998137Z     
2025-05-07T20:32:31.1998488Z         x_sign = torch.sign(x)
2025-05-07T20:32:31.1998782Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:31.1999093Z         x = x_sign * x_clamp
2025-05-07T20:32:31.1999325Z         x0 = x[:, :D]
2025-05-07T20:32:31.1999615Z         x1 = x[:, D:]
2025-05-07T20:32:31.1999825Z     
2025-05-07T20:32:31.2000007Z         if contiguous:
2025-05-07T20:32:31.2000239Z             x0 = x0.contiguous()
2025-05-07T20:32:31.2000497Z             x1 = x1.contiguous()
2025-05-07T20:32:31.2000730Z     
2025-05-07T20:32:31.2000926Z         if scale_ub is not None:
2025-05-07T20:32:31.2001202Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:31.2001529Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:31.2001904Z             )
2025-05-07T20:32:31.2002102Z         else:
2025-05-07T20:32:31.2002313Z             scale_ub_tensor = None
2025-05-07T20:32:31.2002561Z     
2025-05-07T20:32:31.2002795Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:31.2003110Z             op = silu_mul_quant
2025-05-07T20:32:31.2003356Z             if compiled:
2025-05-07T20:32:31.2003609Z                 op = torch.compile(op)
2025-05-07T20:32:31.2003954Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:31.2004233Z     
2025-05-07T20:32:31.2004426Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:31.2004589Z 
2025-05-07T20:32:31.2004696Z moe/activation_test.py:117: 
2025-05-07T20:32:31.2005051Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:31.2005384Z moe/activation_test.py:115: in fn
2025-05-07T20:32:31.2005666Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:31.2006350Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:31.2007033Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:31.2007566Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:31.2008242Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:31.2008893Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:31.2009427Z     kernel = self.compile(
2025-05-07T20:32:31.2009967Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:31.2010619Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:31.2011072Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:31.2011309Z 
2025-05-07T20:32:31.2011514Z self = <triton.compiler.compiler.ASTSource object at 0x7f78bfebb850>
2025-05-07T20:32:31.2012588Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:31.2014001Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f7905667ce0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f78bf90a5c0>}
2025-05-07T20:32:31.2015324Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:31.2016335Z context = <triton._C.libtriton.ir.context object at 0x7f78bf8b0630>
2025-05-07T20:32:31.2016624Z 
2025-05-07T20:32:31.2016790Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:31.2017306Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:31.2017763Z                            module_map=module_map)
2025-05-07T20:32:31.2018126Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:31.2018479Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:31.2018740Z E       ^
2025-05-07T20:32:31.2019191Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:31.2019686Z 
2025-05-07T20:32:31.2020095Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:31.2020599Z 
2025-05-07T20:32:31.2020711Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:31.2021124Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:31.2021521Z     T=1,
2025-05-07T20:32:31.2021706Z     D=5120,
2025-05-07T20:32:31.2021900Z     scale_ub=None,
2025-05-07T20:32:31.2022153Z     contiguous=True,
2025-05-07T20:32:31.2022377Z     compiled=True,
2025-05-07T20:32:31.2022578Z )
2025-05-07T20:32:31.2022889Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:31.2023364Z T = 1, D = 5120, scale_ub = None, contiguous = True, compiled = True
2025-05-07T20:32:31.2023619Z 
2025-05-07T20:32:31.2023703Z     @given(
2025-05-07T20:32:31.2023929Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:31.2024245Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:31.2024556Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:31.2024887Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:31.2025284Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:31.2025570Z     )
2025-05-07T20:32:31.2025922Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:31.2026357Z     def test_silu_mul_quant(
2025-05-07T20:32:31.2026605Z         self,
2025-05-07T20:32:31.2026804Z         T: int,
2025-05-07T20:32:31.2026995Z         D: int,
2025-05-07T20:32:31.2027214Z         scale_ub: Optional[float],
2025-05-07T20:32:31.2027487Z         contiguous: bool,
2025-05-07T20:32:31.2027722Z         compiled: bool,
2025-05-07T20:32:31.2027950Z     ) -> None:
2025-05-07T20:32:31.2028170Z         torch.manual_seed(2025)
2025-05-07T20:32:31.2028404Z     
2025-05-07T20:32:31.2028675Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:31.2029019Z     
2025-05-07T20:32:31.2029210Z         x_sign = torch.sign(x)
2025-05-07T20:32:31.2029504Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:31.2029816Z         x = x_sign * x_clamp
2025-05-07T20:32:31.2030062Z         x0 = x[:, :D]
2025-05-07T20:32:31.2030322Z         x1 = x[:, D:]
2025-05-07T20:32:31.2030532Z     
2025-05-07T20:32:31.2030719Z         if contiguous:
2025-05-07T20:32:31.2030945Z             x0 = x0.contiguous()
2025-05-07T20:32:31.2031207Z             x1 = x1.contiguous()
2025-05-07T20:32:31.2031448Z     
2025-05-07T20:32:31.2031638Z         if scale_ub is not None:
2025-05-07T20:32:31.2031910Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:31.2032246Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:31.2032547Z             )
2025-05-07T20:32:31.2032745Z         else:
2025-05-07T20:32:31.2032957Z             scale_ub_tensor = None
2025-05-07T20:32:31.2033203Z     
2025-05-07T20:32:31.2033435Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:31.2033760Z             op = silu_mul_quant
2025-05-07T20:32:31.2034048Z             if compiled:
2025-05-07T20:32:31.2034294Z                 op = torch.compile(op)
2025-05-07T20:32:31.2034592Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:31.2034873Z     
2025-05-07T20:32:31.2035061Z         y_fp8, y_scale = fn()
2025-05-07T20:32:31.2035345Z         y = y_fp8.to(torch.float32) * y_scale[:, None]
2025-05-07T20:32:31.2035639Z     
2025-05-07T20:32:31.2035873Z         def ref_fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:31.2036214Z             x0_fp32 = x0.to(torch.float32)
2025-05-07T20:32:31.2036504Z             x1_fp32 = x1.to(torch.float32)
2025-05-07T20:32:31.2036814Z             y = x0_fp32 * torch.sigmoid(x0_fp32) * x1_fp32
2025-05-07T20:32:31.2037176Z             return triton_quantize_fp8_row(y, scale_ub_tensor)
2025-05-07T20:32:31.2037541Z     
2025-05-07T20:32:31.2037741Z >       y_fp8_ref, y_scale_ref = ref_fn()
2025-05-07T20:32:31.2037939Z 
2025-05-07T20:32:31.2038040Z moe/activation_test.py:126: 
2025-05-07T20:32:31.2038336Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:31.2038673Z moe/activation_test.py:124: in ref_fn
2025-05-07T20:32:31.2038999Z     return triton_quantize_fp8_row(y, scale_ub_tensor)
2025-05-07T20:32:31.2039778Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gemm/triton_gemm/fp8_gemm.py:2370: in triton_quantize_fp8_row
2025-05-07T20:32:31.2040568Z     _kernel_quantize_fp8_row[grid](
2025-05-07T20:32:31.2041107Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:31.2041782Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:31.2042465Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/autotuner.py:186: in run
2025-05-07T20:32:31.2043182Z     timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs}
2025-05-07T20:32:31.2043986Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/autotuner.py:166: in _bench
2025-05-07T20:32:31.2044620Z     return self.do_bench(kernel_call, quantiles=(0.5, 0.2, 0.8))
2025-05-07T20:32:31.2045221Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/testing.py:117: in do_bench
2025-05-07T20:32:31.2045740Z     fn()
2025-05-07T20:32:31.2046238Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/autotuner.py:152: in kernel_call
2025-05-07T20:32:31.2046818Z     self.fn.run(
2025-05-07T20:32:31.2047288Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:31.2047808Z     kernel = self.compile(
2025-05-07T20:32:31.2048346Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:31.2048993Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:31.2049393Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:31.2049623Z 
2025-05-07T20:32:31.2049890Z self = <triton.compiler.compiler.ASTSource object at 0x7f78befc6a80>
2025-05-07T20:32:31.2050957Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=2, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:31.2052308Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f7905667ce0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f78bf90b240>}
2025-05-07T20:32:31.2053634Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:31.2054726Z context = <triton._C.libtriton.ir.context object at 0x7f78bf868930>
2025-05-07T20:32:31.2055017Z 
2025-05-07T20:32:31.2055187Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:31.2055706Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:31.2056171Z                            module_map=module_map)
2025-05-07T20:32:31.2056531Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:31.2056889Z E       def _kernel_quantize_fp8_row(
2025-05-07T20:32:31.2057161Z E       ^
2025-05-07T20:32:31.2057610Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:31.2058056Z 
2025-05-07T20:32:31.2058464Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:31.9062967Z 
2025-05-07T20:32:31.9063637Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:31.9064338Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:31.9064898Z     T=2048,
2025-05-07T20:32:31.9065174Z     D=5120,
2025-05-07T20:32:31.9073215Z     scale_ub=None,
2025-05-07T20:32:31.9073571Z     contiguous=True,
2025-05-07T20:32:31.9073872Z     compiled=True,
2025-05-07T20:32:31.9074123Z )
2025-05-07T20:32:31.9074589Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:31.9075084Z T = 2048, D = 5120, scale_ub = None, contiguous = True, compiled = True
2025-05-07T20:32:31.9075362Z 
2025-05-07T20:32:31.9075443Z     @given(
2025-05-07T20:32:31.9075682Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:31.9075991Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:31.9076287Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:31.9076620Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:31.9076946Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:31.9077222Z     )
2025-05-07T20:32:31.9077644Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:31.9078087Z     def test_silu_mul_quant(
2025-05-07T20:32:31.9078330Z         self,
2025-05-07T20:32:31.9078525Z         T: int,
2025-05-07T20:32:31.9078721Z         D: int,
2025-05-07T20:32:31.9078929Z         scale_ub: Optional[float],
2025-05-07T20:32:31.9079202Z         contiguous: bool,
2025-05-07T20:32:31.9079441Z         compiled: bool,
2025-05-07T20:32:31.9079672Z     ) -> None:
2025-05-07T20:32:31.9079880Z         torch.manual_seed(2025)
2025-05-07T20:32:31.9080121Z     
2025-05-07T20:32:31.9080396Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:31.9080730Z     
2025-05-07T20:32:31.9080923Z         x_sign = torch.sign(x)
2025-05-07T20:32:31.9081213Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:31.9081514Z         x = x_sign * x_clamp
2025-05-07T20:32:31.9081754Z         x0 = x[:, :D]
2025-05-07T20:32:31.9081971Z         x1 = x[:, D:]
2025-05-07T20:32:31.9082173Z     
2025-05-07T20:32:31.9082363Z         if contiguous:
2025-05-07T20:32:31.9082594Z             x0 = x0.contiguous()
2025-05-07T20:32:31.9082920Z             x1 = x1.contiguous()
2025-05-07T20:32:31.9083165Z     
2025-05-07T20:32:31.9083356Z         if scale_ub is not None:
2025-05-07T20:32:31.9083622Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:31.9083956Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:31.9084268Z             )
2025-05-07T20:32:31.9084461Z         else:
2025-05-07T20:32:31.9084659Z             scale_ub_tensor = None
2025-05-07T20:32:31.9084908Z     
2025-05-07T20:32:31.9085138Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:31.9085444Z             op = silu_mul_quant
2025-05-07T20:32:31.9085685Z             if compiled:
2025-05-07T20:32:31.9085933Z                 op = torch.compile(op)
2025-05-07T20:32:31.9086231Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:31.9086495Z     
2025-05-07T20:32:31.9086684Z         y_fp8, y_scale = fn()
2025-05-07T20:32:31.9086970Z         y = y_fp8.to(torch.float32) * y_scale[:, None]
2025-05-07T20:32:31.9087261Z     
2025-05-07T20:32:31.9087491Z         def ref_fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:31.9087825Z             x0_fp32 = x0.to(torch.float32)
2025-05-07T20:32:31.9088117Z             x1_fp32 = x1.to(torch.float32)
2025-05-07T20:32:31.9088420Z             y = x0_fp32 * torch.sigmoid(x0_fp32) * x1_fp32
2025-05-07T20:32:31.9088777Z             return triton_quantize_fp8_row(y, scale_ub_tensor)
2025-05-07T20:32:31.9089087Z     
2025-05-07T20:32:31.9089279Z >       y_fp8_ref, y_scale_ref = ref_fn()
2025-05-07T20:32:31.9089502Z 
2025-05-07T20:32:31.9089609Z moe/activation_test.py:126: 
2025-05-07T20:32:31.9089971Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:31.9090306Z moe/activation_test.py:124: in ref_fn
2025-05-07T20:32:31.9090630Z     return triton_quantize_fp8_row(y, scale_ub_tensor)
2025-05-07T20:32:31.9091415Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gemm/triton_gemm/fp8_gemm.py:2370: in triton_quantize_fp8_row
2025-05-07T20:32:31.9092154Z     _kernel_quantize_fp8_row[grid](
2025-05-07T20:32:31.9092691Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:31.9093415Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:31.9094240Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/autotuner.py:186: in run
2025-05-07T20:32:31.9094964Z     timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs}
2025-05-07T20:32:31.9095679Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/autotuner.py:166: in _bench
2025-05-07T20:32:31.9096303Z     return self.do_bench(kernel_call, quantiles=(0.5, 0.2, 0.8))
2025-05-07T20:32:31.9096934Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/testing.py:117: in do_bench
2025-05-07T20:32:31.9097452Z     fn()
2025-05-07T20:32:31.9097953Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/autotuner.py:152: in kernel_call
2025-05-07T20:32:31.9098703Z     self.fn.run(
2025-05-07T20:32:31.9099158Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:31.9099675Z     kernel = self.compile(
2025-05-07T20:32:31.9100205Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:31.9100838Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:31.9101235Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:31.9101458Z 
2025-05-07T20:32:31.9101669Z self = <triton.compiler.compiler.ASTSource object at 0x7f78befc6b70>
2025-05-07T20:32:31.9102848Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=2, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:31.9104251Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f7905667ce0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f78bf9f2d40>}
2025-05-07T20:32:31.9105568Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:31.9106572Z context = <triton._C.libtriton.ir.context object at 0x7f78bec61530>
2025-05-07T20:32:31.9106852Z 
2025-05-07T20:32:31.9107025Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:31.9107535Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:31.9107997Z                            module_map=module_map)
2025-05-07T20:32:31.9108358Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:31.9108710Z E       def _kernel_quantize_fp8_row(
2025-05-07T20:32:31.9108976Z E       ^
2025-05-07T20:32:31.9109428Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:31.9109866Z 
2025-05-07T20:32:31.9110279Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:31.9110780Z 
2025-05-07T20:32:31.9110886Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:31.9111352Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:31.9111749Z     T=128,
2025-05-07T20:32:31.9111937Z     D=5120,
2025-05-07T20:32:31.9112126Z     scale_ub=None,
2025-05-07T20:32:31.9112337Z     contiguous=True,
2025-05-07T20:32:31.9112559Z     compiled=True,
2025-05-07T20:32:31.9112755Z )
2025-05-07T20:32:31.9113073Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:31.9113558Z T = 128, D = 5120, scale_ub = None, contiguous = True, compiled = True
2025-05-07T20:32:31.9113884Z 
2025-05-07T20:32:31.9113959Z     @given(
2025-05-07T20:32:31.9114191Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:31.9114495Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:31.9114800Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:31.9115121Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:31.9115445Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:31.9115734Z     )
2025-05-07T20:32:31.9116071Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:31.9116509Z     def test_silu_mul_quant(
2025-05-07T20:32:31.9116749Z         self,
2025-05-07T20:32:31.9116935Z         T: int,
2025-05-07T20:32:31.9117196Z         D: int,
2025-05-07T20:32:31.9117413Z         scale_ub: Optional[float],
2025-05-07T20:32:31.9117671Z         contiguous: bool,
2025-05-07T20:32:31.9117907Z         compiled: bool,
2025-05-07T20:32:31.9118125Z     ) -> None:
2025-05-07T20:32:31.9118327Z         torch.manual_seed(2025)
2025-05-07T20:32:31.9118564Z     
2025-05-07T20:32:31.9118827Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:31.9119166Z     
2025-05-07T20:32:31.9119354Z         x_sign = torch.sign(x)
2025-05-07T20:32:31.9119637Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:31.9119941Z         x = x_sign * x_clamp
2025-05-07T20:32:31.9120173Z         x0 = x[:, :D]
2025-05-07T20:32:31.9120538Z         x1 = x[:, D:]
2025-05-07T20:32:31.9120741Z     
2025-05-07T20:32:31.9120917Z         if contiguous:
2025-05-07T20:32:31.9121145Z             x0 = x0.contiguous()
2025-05-07T20:32:31.9121400Z             x1 = x1.contiguous()
2025-05-07T20:32:31.9121630Z     
2025-05-07T20:32:31.9121815Z         if scale_ub is not None:
2025-05-07T20:32:31.9122132Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:31.9122454Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:31.9122758Z             )
2025-05-07T20:32:31.9122944Z         else:
2025-05-07T20:32:31.9123145Z             scale_ub_tensor = None
2025-05-07T20:32:31.9123391Z     
2025-05-07T20:32:31.9123619Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:31.9123922Z             op = silu_mul_quant
2025-05-07T20:32:31.9124206Z             if compiled:
2025-05-07T20:32:31.9124454Z                 op = torch.compile(op)
2025-05-07T20:32:31.9124748Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:31.9125014Z     
2025-05-07T20:32:31.9125200Z         y_fp8, y_scale = fn()
2025-05-07T20:32:31.9125480Z         y = y_fp8.to(torch.float32) * y_scale[:, None]
2025-05-07T20:32:31.9125761Z     
2025-05-07T20:32:31.9125992Z         def ref_fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:31.9126320Z             x0_fp32 = x0.to(torch.float32)
2025-05-07T20:32:31.9126595Z             x1_fp32 = x1.to(torch.float32)
2025-05-07T20:32:31.9126901Z             y = x0_fp32 * torch.sigmoid(x0_fp32) * x1_fp32
2025-05-07T20:32:31.9127252Z             return triton_quantize_fp8_row(y, scale_ub_tensor)
2025-05-07T20:32:31.9127549Z     
2025-05-07T20:32:31.9127746Z >       y_fp8_ref, y_scale_ref = ref_fn()
2025-05-07T20:32:31.9127939Z 
2025-05-07T20:32:31.9128032Z moe/activation_test.py:126: 
2025-05-07T20:32:31.9128317Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:31.9128641Z moe/activation_test.py:124: in ref_fn
2025-05-07T20:32:31.9129010Z     return triton_quantize_fp8_row(y, scale_ub_tensor)
2025-05-07T20:32:31.9129777Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gemm/triton_gemm/fp8_gemm.py:2370: in triton_quantize_fp8_row
2025-05-07T20:32:31.9130504Z     _kernel_quantize_fp8_row[grid](
2025-05-07T20:32:31.9131036Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:31.9131705Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:31.9132423Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/autotuner.py:186: in run
2025-05-07T20:32:31.9133125Z     timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs}
2025-05-07T20:32:31.9133899Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/autotuner.py:166: in _bench
2025-05-07T20:32:31.9134576Z     return self.do_bench(kernel_call, quantiles=(0.5, 0.2, 0.8))
2025-05-07T20:32:31.9135168Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/testing.py:117: in do_bench
2025-05-07T20:32:31.9135671Z     fn()
2025-05-07T20:32:31.9136212Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/autotuner.py:152: in kernel_call
2025-05-07T20:32:31.9136777Z     self.fn.run(
2025-05-07T20:32:31.9137227Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:31.9137747Z     kernel = self.compile(
2025-05-07T20:32:31.9138275Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:31.9138910Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:31.9139294Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:31.9139525Z 
2025-05-07T20:32:31.9139729Z self = <triton.compiler.compiler.ASTSource object at 0x7f78bf625ef0>
2025-05-07T20:32:31.9140786Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=2, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:31.9142176Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f7905667ce0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f78bf9f0680>}
2025-05-07T20:32:31.9143485Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:31.9144482Z context = <triton._C.libtriton.ir.context object at 0x7f78be860030>
2025-05-07T20:32:31.9144764Z 
2025-05-07T20:32:31.9144927Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:31.9145438Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:31.9145890Z                            module_map=module_map)
2025-05-07T20:32:31.9146254Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:31.9146608Z E       def _kernel_quantize_fp8_row(
2025-05-07T20:32:31.9146865Z E       ^
2025-05-07T20:32:31.9147325Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:31.9147763Z 
2025-05-07T20:32:31.9148170Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:32.6993199Z 
2025-05-07T20:32:32.6993427Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:32.6994054Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:32.6994670Z     T=4096,
2025-05-07T20:32:32.6995088Z     D=5120,
2025-05-07T20:32:32.6995355Z     scale_ub=None,
2025-05-07T20:32:32.6995645Z     contiguous=True,
2025-05-07T20:32:32.6995879Z     compiled=True,
2025-05-07T20:32:32.6996095Z )
2025-05-07T20:32:32.6996418Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:32.6996930Z T = 4096, D = 5120, scale_ub = None, contiguous = True, compiled = True
2025-05-07T20:32:32.6997196Z 
2025-05-07T20:32:32.6997285Z     @given(
2025-05-07T20:32:32.6997517Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:32.6997924Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:32.6998469Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:32.6998808Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:32.6999141Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:32.6999438Z     )
2025-05-07T20:32:32.6999799Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:32.7000246Z     def test_silu_mul_quant(
2025-05-07T20:32:32.7000496Z         self,
2025-05-07T20:32:32.7000699Z         T: int,
2025-05-07T20:32:32.7000900Z         D: int,
2025-05-07T20:32:32.7001126Z         scale_ub: Optional[float],
2025-05-07T20:32:32.7001406Z         contiguous: bool,
2025-05-07T20:32:32.7001724Z         compiled: bool,
2025-05-07T20:32:32.7001963Z     ) -> None:
2025-05-07T20:32:32.7002196Z         torch.manual_seed(2025)
2025-05-07T20:32:32.7002510Z     
2025-05-07T20:32:32.7002827Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:32.7003177Z     
2025-05-07T20:32:32.7003371Z         x_sign = torch.sign(x)
2025-05-07T20:32:32.7003665Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:32.7003980Z         x = x_sign * x_clamp
2025-05-07T20:32:32.7004220Z         x0 = x[:, :D]
2025-05-07T20:32:32.7004442Z         x1 = x[:, D:]
2025-05-07T20:32:32.7004657Z     
2025-05-07T20:32:32.7004854Z         if contiguous:
2025-05-07T20:32:32.7005085Z             x0 = x0.contiguous()
2025-05-07T20:32:32.7005345Z             x1 = x1.contiguous()
2025-05-07T20:32:32.7005587Z     
2025-05-07T20:32:32.7005779Z         if scale_ub is not None:
2025-05-07T20:32:32.7006052Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:32.7006390Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:32.7006776Z             )
2025-05-07T20:32:32.7006978Z         else:
2025-05-07T20:32:32.7007190Z             scale_ub_tensor = None
2025-05-07T20:32:32.7007438Z     
2025-05-07T20:32:32.7007672Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:32.7007986Z             op = silu_mul_quant
2025-05-07T20:32:32.7008229Z             if compiled:
2025-05-07T20:32:32.7008476Z                 op = torch.compile(op)
2025-05-07T20:32:32.7008769Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:32.7009036Z     
2025-05-07T20:32:32.7009225Z         y_fp8, y_scale = fn()
2025-05-07T20:32:32.7009508Z         y = y_fp8.to(torch.float32) * y_scale[:, None]
2025-05-07T20:32:32.7009794Z     
2025-05-07T20:32:32.7010021Z         def ref_fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:32.7010349Z             x0_fp32 = x0.to(torch.float32)
2025-05-07T20:32:32.7010639Z             x1_fp32 = x1.to(torch.float32)
2025-05-07T20:32:32.7010945Z             y = x0_fp32 * torch.sigmoid(x0_fp32) * x1_fp32
2025-05-07T20:32:32.7011299Z             return triton_quantize_fp8_row(y, scale_ub_tensor)
2025-05-07T20:32:32.7011612Z     
2025-05-07T20:32:32.7011810Z >       y_fp8_ref, y_scale_ref = ref_fn()
2025-05-07T20:32:32.7012007Z 
2025-05-07T20:32:32.7012105Z moe/activation_test.py:126: 
2025-05-07T20:32:32.7012398Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:32.7012728Z moe/activation_test.py:124: in ref_fn
2025-05-07T20:32:32.7013046Z     return triton_quantize_fp8_row(y, scale_ub_tensor)
2025-05-07T20:32:32.7013915Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gemm/triton_gemm/fp8_gemm.py:2370: in triton_quantize_fp8_row
2025-05-07T20:32:32.7014729Z     _kernel_quantize_fp8_row[grid](
2025-05-07T20:32:32.7015265Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:32.7015936Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:32.7016616Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/autotuner.py:186: in run
2025-05-07T20:32:32.7017390Z     timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs}
2025-05-07T20:32:32.7018096Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/autotuner.py:166: in _bench
2025-05-07T20:32:32.7018721Z     return self.do_bench(kernel_call, quantiles=(0.5, 0.2, 0.8))
2025-05-07T20:32:32.7019311Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/testing.py:117: in do_bench
2025-05-07T20:32:32.7019826Z     fn()
2025-05-07T20:32:32.7020319Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/autotuner.py:152: in kernel_call
2025-05-07T20:32:32.7020887Z     self.fn.run(
2025-05-07T20:32:32.7021418Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:32.7021939Z     kernel = self.compile(
2025-05-07T20:32:32.7022475Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:32.7023118Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:32.7023516Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:32.7023741Z 
2025-05-07T20:32:32.7023948Z self = <triton.compiler.compiler.ASTSource object at 0x7f78be88eb30>
2025-05-07T20:32:32.7025010Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=2, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:32.7026408Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f7905667ce0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f78becbe520>}
2025-05-07T20:32:32.7027728Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:32.7028729Z context = <triton._C.libtriton.ir.context object at 0x7f78be745ff0>
2025-05-07T20:32:32.7029019Z 
2025-05-07T20:32:32.7029184Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:32.7029700Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:32.7030162Z                            module_map=module_map)
2025-05-07T20:32:32.7030519Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:32.7030877Z E       def _kernel_quantize_fp8_row(
2025-05-07T20:32:32.7031144Z E       ^
2025-05-07T20:32:32.7031598Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:32.7032041Z 
2025-05-07T20:32:32.7032447Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:32.7032957Z 
2025-05-07T20:32:32.7033060Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:32.7033469Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:32.7033864Z     T=16384,
2025-05-07T20:32:32.7034061Z     D=5120,
2025-05-07T20:32:32.7034257Z     scale_ub=None,
2025-05-07T20:32:32.7034467Z     contiguous=True,
2025-05-07T20:32:32.7034739Z     compiled=True,
2025-05-07T20:32:32.7034941Z )
2025-05-07T20:32:32.7035254Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:32.7035747Z T = 16384, D = 5120, scale_ub = None, contiguous = True, compiled = True
2025-05-07T20:32:32.7036020Z 
2025-05-07T20:32:32.7036099Z     @given(
2025-05-07T20:32:32.7036335Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:32.7036641Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:32.7036947Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:32.7037321Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:32.7037641Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:32.7037930Z     )
2025-05-07T20:32:32.7038276Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:32.7038709Z     def test_silu_mul_quant(
2025-05-07T20:32:32.7038950Z         self,
2025-05-07T20:32:32.7039144Z         T: int,
2025-05-07T20:32:32.7039337Z         D: int,
2025-05-07T20:32:32.7039555Z         scale_ub: Optional[float],
2025-05-07T20:32:32.7039825Z         contiguous: bool,
2025-05-07T20:32:32.7040064Z         compiled: bool,
2025-05-07T20:32:32.7040279Z     ) -> None:
2025-05-07T20:32:32.7040495Z         torch.manual_seed(2025)
2025-05-07T20:32:32.7040782Z     
2025-05-07T20:32:32.7041053Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:32.7041391Z     
2025-05-07T20:32:32.7041583Z         x_sign = torch.sign(x)
2025-05-07T20:32:32.7041872Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:32.7042177Z         x = x_sign * x_clamp
2025-05-07T20:32:32.7042411Z         x0 = x[:, :D]
2025-05-07T20:32:32.7042617Z         x1 = x[:, D:]
2025-05-07T20:32:32.7042822Z     
2025-05-07T20:32:32.7043008Z         if contiguous:
2025-05-07T20:32:32.7043232Z             x0 = x0.contiguous()
2025-05-07T20:32:32.7043489Z             x1 = x1.contiguous()
2025-05-07T20:32:32.7043728Z     
2025-05-07T20:32:32.7043917Z         if scale_ub is not None:
2025-05-07T20:32:32.7044188Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:32.7044521Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:32.7044825Z             )
2025-05-07T20:32:32.7045014Z         else:
2025-05-07T20:32:32.7045226Z             scale_ub_tensor = None
2025-05-07T20:32:32.7045519Z     
2025-05-07T20:32:32.7045745Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:32.7046052Z             op = silu_mul_quant
2025-05-07T20:32:32.7046305Z             if compiled:
2025-05-07T20:32:32.7046540Z                 op = torch.compile(op)
2025-05-07T20:32:32.7046833Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:32.7047106Z     
2025-05-07T20:32:32.7047289Z         y_fp8, y_scale = fn()
2025-05-07T20:32:32.7047567Z         y = y_fp8.to(torch.float32) * y_scale[:, None]
2025-05-07T20:32:32.7047852Z     
2025-05-07T20:32:32.7048082Z         def ref_fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:32.7048413Z             x0_fp32 = x0.to(torch.float32)
2025-05-07T20:32:32.7048701Z             x1_fp32 = x1.to(torch.float32)
2025-05-07T20:32:32.7049014Z             y = x0_fp32 * torch.sigmoid(x0_fp32) * x1_fp32
2025-05-07T20:32:32.7049367Z             return triton_quantize_fp8_row(y, scale_ub_tensor)
2025-05-07T20:32:32.7049674Z     
2025-05-07T20:32:32.7049877Z >       y_fp8_ref, y_scale_ref = ref_fn()
2025-05-07T20:32:32.7050068Z 
2025-05-07T20:32:32.7050164Z moe/activation_test.py:126: 
2025-05-07T20:32:32.7050458Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:32.7050789Z moe/activation_test.py:124: in ref_fn
2025-05-07T20:32:32.7051107Z     return triton_quantize_fp8_row(y, scale_ub_tensor)
2025-05-07T20:32:32.7051880Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gemm/triton_gemm/fp8_gemm.py:2370: in triton_quantize_fp8_row
2025-05-07T20:32:32.7052665Z     _kernel_quantize_fp8_row[grid](
2025-05-07T20:32:32.7053201Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:32.7053944Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:32.7054680Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/autotuner.py:186: in run
2025-05-07T20:32:32.7055393Z     timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs}
2025-05-07T20:32:32.7056153Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/autotuner.py:166: in _bench
2025-05-07T20:32:32.7056775Z     return self.do_bench(kernel_call, quantiles=(0.5, 0.2, 0.8))
2025-05-07T20:32:32.7057371Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/testing.py:117: in do_bench
2025-05-07T20:32:32.7057882Z     fn()
2025-05-07T20:32:32.7058382Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/autotuner.py:152: in kernel_call
2025-05-07T20:32:32.7058962Z     self.fn.run(
2025-05-07T20:32:32.7059426Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:32.7059992Z     kernel = self.compile(
2025-05-07T20:32:32.7060526Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:32.7061170Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:32.7061576Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:32.7067074Z 
2025-05-07T20:32:32.7067302Z self = <triton.compiler.compiler.ASTSource object at 0x7f78be871a50>
2025-05-07T20:32:32.7068378Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=2, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:32.7069738Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f7905667ce0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f78bee302c0>}
2025-05-07T20:32:32.7071137Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:32.7072147Z context = <triton._C.libtriton.ir.context object at 0x7f78be093630>
2025-05-07T20:32:32.7072433Z 
2025-05-07T20:32:32.7072599Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:32.7073112Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:32.7073586Z                            module_map=module_map)
2025-05-07T20:32:32.7073954Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:32.7074307Z E       def _kernel_quantize_fp8_row(
2025-05-07T20:32:32.7074570Z E       ^
2025-05-07T20:32:32.7075025Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:32.7075463Z 
2025-05-07T20:32:32.7075877Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:32.7275691Z W0507 20:32:32.726000 276483 site-packages/torch/_dynamo/convert_frame.py:987] [0/8] torch._dynamo hit config.recompile_limit (8)
2025-05-07T20:32:32.7276916Z W0507 20:32:32.726000 276483 site-packages/torch/_dynamo/convert_frame.py:987] [0/8]    function: 'silu_mul_quant' (/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:55)
2025-05-07T20:32:32.7278223Z W0507 20:32:32.726000 276483 site-packages/torch/_dynamo/convert_frame.py:987] [0/8]    last reason: 0/7: tensor 'x0' stride mismatch at index 0. expected 5120, actual 10240
2025-05-07T20:32:32.7279312Z W0507 20:32:32.726000 276483 site-packages/torch/_dynamo/convert_frame.py:987] [0/8] To log all recompilation reasons, use TORCH_LOGS="recompiles".
2025-05-07T20:32:32.7280406Z W0507 20:32:32.726000 276483 site-packages/torch/_dynamo/convert_frame.py:987] [0/8] To diagnose recompilation issues, see https://pytorch.org/docs/main/torch.compiler_troubleshooting.html.
2025-05-07T20:32:33.1306690Z 
2025-05-07T20:32:33.1306873Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:33.1307543Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:33.1308011Z     T=1,
2025-05-07T20:32:33.1308195Z     D=5120,
2025-05-07T20:32:33.1308394Z     scale_ub=1200.0,
2025-05-07T20:32:33.1308618Z     contiguous=True,
2025-05-07T20:32:33.1308833Z     compiled=True,
2025-05-07T20:32:33.1309039Z )
2025-05-07T20:32:33.1309359Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:33.1309854Z T = 1, D = 5120, scale_ub = 1200.0, contiguous = True, compiled = True
2025-05-07T20:32:33.1310109Z 
2025-05-07T20:32:33.1310191Z     @given(
2025-05-07T20:32:33.1310425Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:33.1310815Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:33.1311119Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:33.1311451Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:33.1311784Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:33.1312069Z     )
2025-05-07T20:32:33.1312425Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:33.1312865Z     def test_silu_mul_quant(
2025-05-07T20:32:33.1313104Z         self,
2025-05-07T20:32:33.1313299Z         T: int,
2025-05-07T20:32:33.1313494Z         D: int,
2025-05-07T20:32:33.1313706Z         scale_ub: Optional[float],
2025-05-07T20:32:33.1313977Z         contiguous: bool,
2025-05-07T20:32:33.1314240Z         compiled: bool,
2025-05-07T20:32:33.1314487Z     ) -> None:
2025-05-07T20:32:33.1314700Z         torch.manual_seed(2025)
2025-05-07T20:32:33.1314942Z     
2025-05-07T20:32:33.1315206Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:33.1315547Z     
2025-05-07T20:32:33.1315736Z         x_sign = torch.sign(x)
2025-05-07T20:32:33.1316090Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:33.1316395Z         x = x_sign * x_clamp
2025-05-07T20:32:33.1316632Z         x0 = x[:, :D]
2025-05-07T20:32:33.1316847Z         x1 = x[:, D:]
2025-05-07T20:32:33.1317052Z     
2025-05-07T20:32:33.1317237Z         if contiguous:
2025-05-07T20:32:33.1317468Z             x0 = x0.contiguous()
2025-05-07T20:32:33.1317720Z             x1 = x1.contiguous()
2025-05-07T20:32:33.1317958Z     
2025-05-07T20:32:33.1318148Z         if scale_ub is not None:
2025-05-07T20:32:33.1318416Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:33.1318746Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:33.1319051Z             )
2025-05-07T20:32:33.1319240Z         else:
2025-05-07T20:32:33.1319449Z             scale_ub_tensor = None
2025-05-07T20:32:33.1319699Z     
2025-05-07T20:32:33.1319930Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:33.1320247Z             op = silu_mul_quant
2025-05-07T20:32:33.1320495Z             if compiled:
2025-05-07T20:32:33.1320739Z                 op = torch.compile(op)
2025-05-07T20:32:33.1321030Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:33.1321298Z     
2025-05-07T20:32:33.1321489Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:33.1321650Z 
2025-05-07T20:32:33.1321748Z moe/activation_test.py:117: 
2025-05-07T20:32:33.1322038Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:33.1322367Z moe/activation_test.py:115: in fn
2025-05-07T20:32:33.1322638Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:33.1323256Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/torch/_dynamo/eval_frame.py:678: in _fn
2025-05-07T20:32:33.1323809Z     return fn(*args, **kwargs)
2025-05-07T20:32:33.1324464Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:33.1325139Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:33.1325666Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:33.1326405Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:33.1327053Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:33.1327586Z     kernel = self.compile(
2025-05-07T20:32:33.1328133Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:33.1328785Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:33.1329171Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:33.1329403Z 
2025-05-07T20:32:33.1329706Z self = <triton.compiler.compiler.ASTSource object at 0x7f78be7bc410>
2025-05-07T20:32:33.1330831Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:33.1332178Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f7905667ce0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f779dd18680>}
2025-05-07T20:32:33.1333488Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:33.1334686Z context = <triton._C.libtriton.ir.context object at 0x7f78be0609b0>
2025-05-07T20:32:33.1334976Z 
2025-05-07T20:32:33.1335140Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:33.1335703Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:33.1336159Z                            module_map=module_map)
2025-05-07T20:32:33.1336523Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:33.1336874Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:33.1337135Z E       ^
2025-05-07T20:32:33.1337586Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:33.1338030Z 
2025-05-07T20:32:33.1338439Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:33.1338942Z 
2025-05-07T20:32:33.1339050Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:33.1339455Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:33.1339847Z     T=1,
2025-05-07T20:32:33.1340031Z     D=5120,
2025-05-07T20:32:33.1340221Z     scale_ub=None,
2025-05-07T20:32:33.1340433Z     contiguous=False,
2025-05-07T20:32:33.1340658Z     compiled=True,
2025-05-07T20:32:33.1340859Z )
2025-05-07T20:32:33.1341173Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:33.1341650Z T = 1, D = 5120, scale_ub = None, contiguous = False, compiled = True
2025-05-07T20:32:33.1341904Z 
2025-05-07T20:32:33.1341987Z     @given(
2025-05-07T20:32:33.1342209Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:33.1342522Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:33.1342827Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:33.1343152Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:33.1343528Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:33.1343805Z     )
2025-05-07T20:32:33.1344150Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:33.1344583Z     def test_silu_mul_quant(
2025-05-07T20:32:33.1344826Z         self,
2025-05-07T20:32:33.1345017Z         T: int,
2025-05-07T20:32:33.1345212Z         D: int,
2025-05-07T20:32:33.1345432Z         scale_ub: Optional[float],
2025-05-07T20:32:33.1345701Z         contiguous: bool,
2025-05-07T20:32:33.1346008Z         compiled: bool,
2025-05-07T20:32:33.1346225Z     ) -> None:
2025-05-07T20:32:33.1346439Z         torch.manual_seed(2025)
2025-05-07T20:32:33.1346672Z     
2025-05-07T20:32:33.1346937Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:33.1347272Z     
2025-05-07T20:32:33.1347465Z         x_sign = torch.sign(x)
2025-05-07T20:32:33.1347748Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:33.1348053Z         x = x_sign * x_clamp
2025-05-07T20:32:33.1348289Z         x0 = x[:, :D]
2025-05-07T20:32:33.1348500Z         x1 = x[:, D:]
2025-05-07T20:32:33.1348707Z     
2025-05-07T20:32:33.1348893Z         if contiguous:
2025-05-07T20:32:33.1349122Z             x0 = x0.contiguous()
2025-05-07T20:32:33.1349420Z             x1 = x1.contiguous()
2025-05-07T20:32:33.1349660Z     
2025-05-07T20:32:33.1349853Z         if scale_ub is not None:
2025-05-07T20:32:33.1350122Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:33.1350457Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:33.1350765Z             )
2025-05-07T20:32:33.1350952Z         else:
2025-05-07T20:32:33.1351163Z             scale_ub_tensor = None
2025-05-07T20:32:33.1351410Z     
2025-05-07T20:32:33.1351635Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:33.1351946Z             op = silu_mul_quant
2025-05-07T20:32:33.1352188Z             if compiled:
2025-05-07T20:32:33.1352429Z                 op = torch.compile(op)
2025-05-07T20:32:33.1352724Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:33.1352997Z     
2025-05-07T20:32:33.1353183Z         y_fp8, y_scale = fn()
2025-05-07T20:32:33.1353465Z         y = y_fp8.to(torch.float32) * y_scale[:, None]
2025-05-07T20:32:33.1353760Z     
2025-05-07T20:32:33.1354051Z         def ref_fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:33.1354383Z             x0_fp32 = x0.to(torch.float32)
2025-05-07T20:32:33.1354669Z             x1_fp32 = x1.to(torch.float32)
2025-05-07T20:32:33.1354980Z             y = x0_fp32 * torch.sigmoid(x0_fp32) * x1_fp32
2025-05-07T20:32:33.1355329Z             return triton_quantize_fp8_row(y, scale_ub_tensor)
2025-05-07T20:32:33.1355636Z     
2025-05-07T20:32:33.1355838Z >       y_fp8_ref, y_scale_ref = ref_fn()
2025-05-07T20:32:33.1356029Z 
2025-05-07T20:32:33.1356124Z moe/activation_test.py:126: 
2025-05-07T20:32:33.1356411Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:33.1356743Z moe/activation_test.py:124: in ref_fn
2025-05-07T20:32:33.1357062Z     return triton_quantize_fp8_row(y, scale_ub_tensor)
2025-05-07T20:32:33.1357829Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gemm/triton_gemm/fp8_gemm.py:2370: in triton_quantize_fp8_row
2025-05-07T20:32:33.1358565Z     _kernel_quantize_fp8_row[grid](
2025-05-07T20:32:33.1359102Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:33.1359767Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:33.1360443Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/autotuner.py:186: in run
2025-05-07T20:32:33.1361154Z     timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs}
2025-05-07T20:32:33.1361868Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/autotuner.py:166: in _bench
2025-05-07T20:32:33.1362537Z     return self.do_bench(kernel_call, quantiles=(0.5, 0.2, 0.8))
2025-05-07T20:32:33.1363129Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/testing.py:117: in do_bench
2025-05-07T20:32:33.1363639Z     fn()
2025-05-07T20:32:33.1364141Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/autotuner.py:152: in kernel_call
2025-05-07T20:32:33.1364759Z     self.fn.run(
2025-05-07T20:32:33.1365277Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:33.1365791Z     kernel = self.compile(
2025-05-07T20:32:33.1366324Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:33.1366964Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:33.1367353Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:33.1367586Z 
2025-05-07T20:32:33.1367791Z self = <triton.compiler.compiler.ASTSource object at 0x7f78be7beb10>
2025-05-07T20:32:33.1368899Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=2, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:33.1370244Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f7905667ce0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f779dd02de0>}
2025-05-07T20:32:33.1371570Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:33.1372572Z context = <triton._C.libtriton.ir.context object at 0x7f779db3a470>
2025-05-07T20:32:33.1372860Z 
2025-05-07T20:32:33.1373023Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:33.1373534Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:33.1374099Z                            module_map=module_map)
2025-05-07T20:32:33.1374539Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:33.1374907Z E       def _kernel_quantize_fp8_row(
2025-05-07T20:32:33.1375170Z E       ^
2025-05-07T20:32:33.1375617Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:33.1376062Z 
2025-05-07T20:32:33.1376466Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:33.2772799Z 
2025-05-07T20:32:33.2773146Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:33.2773580Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:33.2774131Z     T=1,
2025-05-07T20:32:33.2774370Z     D=5120,
2025-05-07T20:32:33.2774665Z     scale_ub=None,
2025-05-07T20:32:33.2774960Z     contiguous=True,
2025-05-07T20:32:33.2775225Z     compiled=False,
2025-05-07T20:32:33.2775428Z )
2025-05-07T20:32:33.2775755Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:33.2776243Z T = 1, D = 5120, scale_ub = None, contiguous = True, compiled = False
2025-05-07T20:32:33.2776500Z 
2025-05-07T20:32:33.2776584Z     @given(
2025-05-07T20:32:33.2776815Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:33.2777129Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:33.2777436Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:33.2777759Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:33.2778088Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:33.2778376Z     )
2025-05-07T20:32:33.2778717Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:33.2779275Z     def test_silu_mul_quant(
2025-05-07T20:32:33.2779511Z         self,
2025-05-07T20:32:33.2779711Z         T: int,
2025-05-07T20:32:33.2779907Z         D: int,
2025-05-07T20:32:33.2780125Z         scale_ub: Optional[float],
2025-05-07T20:32:33.2780394Z         contiguous: bool,
2025-05-07T20:32:33.2780630Z         compiled: bool,
2025-05-07T20:32:33.2780853Z     ) -> None:
2025-05-07T20:32:33.2781068Z         torch.manual_seed(2025)
2025-05-07T20:32:33.2781374Z     
2025-05-07T20:32:33.2781645Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:33.2781984Z     
2025-05-07T20:32:33.2782176Z         x_sign = torch.sign(x)
2025-05-07T20:32:33.2782474Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:33.2782782Z         x = x_sign * x_clamp
2025-05-07T20:32:33.2783011Z         x0 = x[:, :D]
2025-05-07T20:32:33.2783222Z         x1 = x[:, D:]
2025-05-07T20:32:33.2783435Z     
2025-05-07T20:32:33.2783613Z         if contiguous:
2025-05-07T20:32:33.2783844Z             x0 = x0.contiguous()
2025-05-07T20:32:33.2784101Z             x1 = x1.contiguous()
2025-05-07T20:32:33.2784336Z     
2025-05-07T20:32:33.2784526Z         if scale_ub is not None:
2025-05-07T20:32:33.2784861Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:33.2785197Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:33.2785495Z             )
2025-05-07T20:32:33.2785694Z         else:
2025-05-07T20:32:33.2785913Z             scale_ub_tensor = None
2025-05-07T20:32:33.2786156Z     
2025-05-07T20:32:33.2786383Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:33.2786694Z             op = silu_mul_quant
2025-05-07T20:32:33.2786939Z             if compiled:
2025-05-07T20:32:33.2787183Z                 op = torch.compile(op)
2025-05-07T20:32:33.2787481Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:33.2787746Z     
2025-05-07T20:32:33.2787942Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:33.2788104Z 
2025-05-07T20:32:33.2788205Z moe/activation_test.py:117: 
2025-05-07T20:32:33.2788490Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:33.2788821Z moe/activation_test.py:115: in fn
2025-05-07T20:32:33.2789102Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:33.2789865Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:33.2790548Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:33.2791091Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:33.2791763Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:33.2792415Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:33.2792942Z     kernel = self.compile(
2025-05-07T20:32:33.2793479Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:33.2794121Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:33.2794541Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:33.2794791Z 
2025-05-07T20:32:33.2794996Z self = <triton.compiler.compiler.ASTSource object at 0x7f78be3bee60>
2025-05-07T20:32:33.2796053Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:33.2797398Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f7905667ce0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f78be546a20>}
2025-05-07T20:32:33.2799257Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:33.2800287Z context = <triton._C.libtriton.ir.context object at 0x7f779dcd0d30>
2025-05-07T20:32:33.2800572Z 
2025-05-07T20:32:33.2800744Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:33.2801422Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:33.2802033Z                            module_map=module_map)
2025-05-07T20:32:33.2802386Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:33.2802733Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:33.2802989Z E       ^
2025-05-07T20:32:33.2803433Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:33.2803881Z 
2025-05-07T20:32:33.2804286Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:33.2804842Z 
2025-05-07T20:32:33.2804942Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:33.2805414Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:33.2805801Z     T=128,
2025-05-07T20:32:33.2805989Z     D=5120,
2025-05-07T20:32:33.2806181Z     scale_ub=None,
2025-05-07T20:32:33.2806391Z     contiguous=False,
2025-05-07T20:32:33.2806616Z     compiled=True,
2025-05-07T20:32:33.2806820Z )
2025-05-07T20:32:33.2807126Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:33.2807608Z T = 128, D = 5120, scale_ub = None, contiguous = False, compiled = True
2025-05-07T20:32:33.2807872Z 
2025-05-07T20:32:33.2807952Z     @given(
2025-05-07T20:32:33.2808173Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:33.2808478Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:33.2808778Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:33.2809098Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:33.2809417Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:33.2809695Z     )
2025-05-07T20:32:33.2810103Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:33.2810534Z     def test_silu_mul_quant(
2025-05-07T20:32:33.2810775Z         self,
2025-05-07T20:32:33.2810963Z         T: int,
2025-05-07T20:32:33.2811152Z         D: int,
2025-05-07T20:32:33.2811362Z         scale_ub: Optional[float],
2025-05-07T20:32:33.2811627Z         contiguous: bool,
2025-05-07T20:32:33.2811859Z         compiled: bool,
2025-05-07T20:32:33.2812081Z     ) -> None:
2025-05-07T20:32:33.2812290Z         torch.manual_seed(2025)
2025-05-07T20:32:33.2812523Z     
2025-05-07T20:32:33.2812792Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:33.2813129Z     
2025-05-07T20:32:33.2813310Z         x_sign = torch.sign(x)
2025-05-07T20:32:33.2813591Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:33.2813997Z         x = x_sign * x_clamp
2025-05-07T20:32:33.2814232Z         x0 = x[:, :D]
2025-05-07T20:32:33.2814438Z         x1 = x[:, D:]
2025-05-07T20:32:33.2814641Z     
2025-05-07T20:32:33.2814821Z         if contiguous:
2025-05-07T20:32:33.2815040Z             x0 = x0.contiguous()
2025-05-07T20:32:33.2815294Z             x1 = x1.contiguous()
2025-05-07T20:32:33.2815525Z     
2025-05-07T20:32:33.2815708Z         if scale_ub is not None:
2025-05-07T20:32:33.2815973Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:33.2816302Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:33.2816598Z             )
2025-05-07T20:32:33.2816787Z         else:
2025-05-07T20:32:33.2816996Z             scale_ub_tensor = None
2025-05-07T20:32:33.2817236Z     
2025-05-07T20:32:33.2817460Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:33.2817843Z             op = silu_mul_quant
2025-05-07T20:32:33.2818079Z             if compiled:
2025-05-07T20:32:33.2818316Z                 op = torch.compile(op)
2025-05-07T20:32:33.2818605Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:33.2818878Z     
2025-05-07T20:32:33.2819058Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:33.2819221Z 
2025-05-07T20:32:33.2819317Z moe/activation_test.py:117: 
2025-05-07T20:32:33.2819603Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:33.2819973Z moe/activation_test.py:115: in fn
2025-05-07T20:32:33.2820250Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:33.2820797Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/torch/_dynamo/eval_frame.py:678: in _fn
2025-05-07T20:32:33.2821339Z     return fn(*args, **kwargs)
2025-05-07T20:32:33.2821984Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:33.2822655Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:33.2823177Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:33.2823894Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:33.2824554Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:33.2825082Z     kernel = self.compile(
2025-05-07T20:32:33.2825610Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:33.2826246Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:33.2826636Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:33.2826858Z 
2025-05-07T20:32:33.2827066Z self = <triton.compiler.compiler.ASTSource object at 0x7f779dd14e10>
2025-05-07T20:32:33.2828121Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:33.2829497Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f7905667ce0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f779dd03ce0>}
2025-05-07T20:32:33.2830812Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:33.2831808Z context = <triton._C.libtriton.ir.context object at 0x7f779db94bf0>
2025-05-07T20:32:33.2832084Z 
2025-05-07T20:32:33.2832252Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:33.2832750Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:33.2833212Z                            module_map=module_map)
2025-05-07T20:32:33.2833567Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:33.2833910Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:33.2834159Z E       ^
2025-05-07T20:32:33.2834609Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:33.2835045Z 
2025-05-07T20:32:33.2835459Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:33.2835959Z 
2025-05-07T20:32:33.2836065Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:33.2836462Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:33.2836849Z     T=128,
2025-05-07T20:32:33.2837033Z     D=7168,
2025-05-07T20:32:33.2837216Z     scale_ub=1200.0,
2025-05-07T20:32:33.2837506Z     contiguous=False,
2025-05-07T20:32:33.2837728Z     compiled=False,
2025-05-07T20:32:33.4394678Z )
2025-05-07T20:32:33.4405706Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:33.4406483Z T = 128, D = 7168, scale_ub = 1200.0, contiguous = False, compiled = False
2025-05-07T20:32:33.4406887Z 
2025-05-07T20:32:33.4407014Z     @given(
2025-05-07T20:32:33.4407352Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:33.4407789Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:33.4408446Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:33.4408793Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:33.4409137Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:33.4409431Z     )
2025-05-07T20:32:33.4409793Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:33.4410251Z     def test_silu_mul_quant(
2025-05-07T20:32:33.4410505Z         self,
2025-05-07T20:32:33.4410723Z         T: int,
2025-05-07T20:32:33.4410940Z         D: int,
2025-05-07T20:32:33.4411173Z         scale_ub: Optional[float],
2025-05-07T20:32:33.4411451Z         contiguous: bool,
2025-05-07T20:32:33.4411753Z         compiled: bool,
2025-05-07T20:32:33.4412114Z     ) -> None:
2025-05-07T20:32:33.4412356Z         torch.manual_seed(2025)
2025-05-07T20:32:33.4412601Z     
2025-05-07T20:32:33.4412886Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:33.4413253Z     
2025-05-07T20:32:33.4413459Z         x_sign = torch.sign(x)
2025-05-07T20:32:33.4413887Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:33.4414217Z         x = x_sign * x_clamp
2025-05-07T20:32:33.4414471Z         x0 = x[:, :D]
2025-05-07T20:32:33.4414690Z         x1 = x[:, D:]
2025-05-07T20:32:33.4414911Z     
2025-05-07T20:32:33.4415112Z         if contiguous:
2025-05-07T20:32:33.4415347Z             x0 = x0.contiguous()
2025-05-07T20:32:33.4415623Z             x1 = x1.contiguous()
2025-05-07T20:32:33.4415876Z     
2025-05-07T20:32:33.4416074Z         if scale_ub is not None:
2025-05-07T20:32:33.4416360Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:33.4416712Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:33.4417025Z             )
2025-05-07T20:32:33.4417369Z         else:
2025-05-07T20:32:33.4417590Z             scale_ub_tensor = None
2025-05-07T20:32:33.4417847Z     
2025-05-07T20:32:33.4418085Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:33.4418413Z             op = silu_mul_quant
2025-05-07T20:32:33.4418677Z             if compiled:
2025-05-07T20:32:33.4418929Z                 op = torch.compile(op)
2025-05-07T20:32:33.4419236Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:33.4419522Z     
2025-05-07T20:32:33.4419723Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:33.4419900Z 
2025-05-07T20:32:33.4420001Z moe/activation_test.py:117: 
2025-05-07T20:32:33.4420309Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:33.4420654Z moe/activation_test.py:115: in fn
2025-05-07T20:32:33.4420942Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:33.4421648Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:33.4422343Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:33.4422878Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:33.4423568Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:33.4424237Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:33.4424775Z     kernel = self.compile(
2025-05-07T20:32:33.4425310Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:33.4426109Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:33.4426517Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:33.4426748Z 
2025-05-07T20:32:33.4426966Z self = <triton.compiler.compiler.ASTSource object at 0x7f78becbe5d0>
2025-05-07T20:32:33.4428037Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:33.4429534Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f7905667ce0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f78bf7bc180>}
2025-05-07T20:32:33.4430862Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:33.4431882Z context = <triton._C.libtriton.ir.context object at 0x7f779da2a530>
2025-05-07T20:32:33.4432170Z 
2025-05-07T20:32:33.4432345Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:33.4432907Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:33.4433384Z                            module_map=module_map)
2025-05-07T20:32:33.4433760Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:33.4434123Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:33.4434389Z E       ^
2025-05-07T20:32:33.4434857Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:33.4435303Z 
2025-05-07T20:32:33.4435726Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:33.4436240Z 
2025-05-07T20:32:33.4436347Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:33.4436768Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:33.4437176Z     T=128,
2025-05-07T20:32:33.4437378Z     D=5120,
2025-05-07T20:32:33.4437582Z     scale_ub=None,
2025-05-07T20:32:33.4437859Z     contiguous=False,
2025-05-07T20:32:33.4438099Z     compiled=False,
2025-05-07T20:32:33.4438315Z )
2025-05-07T20:32:33.4438639Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:33.4439143Z T = 128, D = 5120, scale_ub = None, contiguous = False, compiled = False
2025-05-07T20:32:33.4439414Z 
2025-05-07T20:32:33.4439495Z     @given(
2025-05-07T20:32:33.4439737Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:33.4440062Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:33.4440373Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:33.4440716Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:33.4441056Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:33.4441352Z     )
2025-05-07T20:32:33.4441701Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:33.4442153Z     def test_silu_mul_quant(
2025-05-07T20:32:33.4442403Z         self,
2025-05-07T20:32:33.4442604Z         T: int,
2025-05-07T20:32:33.4442811Z         D: int,
2025-05-07T20:32:33.4443039Z         scale_ub: Optional[float],
2025-05-07T20:32:33.4443316Z         contiguous: bool,
2025-05-07T20:32:33.4443563Z         compiled: bool,
2025-05-07T20:32:33.4443795Z     ) -> None:
2025-05-07T20:32:33.4444009Z         torch.manual_seed(2025)
2025-05-07T20:32:33.4444257Z     
2025-05-07T20:32:33.4444534Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:33.4444877Z     
2025-05-07T20:32:33.4445079Z         x_sign = torch.sign(x)
2025-05-07T20:32:33.4445373Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:33.4445740Z         x = x_sign * x_clamp
2025-05-07T20:32:33.4445980Z         x0 = x[:, :D]
2025-05-07T20:32:33.4446204Z         x1 = x[:, D:]
2025-05-07T20:32:33.4446420Z     
2025-05-07T20:32:33.4446606Z         if contiguous:
2025-05-07T20:32:33.4446848Z             x0 = x0.contiguous()
2025-05-07T20:32:33.4447116Z             x1 = x1.contiguous()
2025-05-07T20:32:33.4447356Z     
2025-05-07T20:32:33.4447556Z         if scale_ub is not None:
2025-05-07T20:32:33.4447839Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:33.4448277Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:33.4448596Z             )
2025-05-07T20:32:33.4448803Z         else:
2025-05-07T20:32:33.4449016Z             scale_ub_tensor = None
2025-05-07T20:32:33.4449278Z     
2025-05-07T20:32:33.4449515Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:33.4449832Z             op = silu_mul_quant
2025-05-07T20:32:33.4450094Z             if compiled:
2025-05-07T20:32:33.4450348Z                 op = torch.compile(op)
2025-05-07T20:32:33.4450642Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:33.4450926Z     
2025-05-07T20:32:33.4451131Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:33.4451297Z 
2025-05-07T20:32:33.4451453Z moe/activation_test.py:117: 
2025-05-07T20:32:33.4451756Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:33.4452097Z moe/activation_test.py:115: in fn
2025-05-07T20:32:33.4452395Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:33.4453079Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:33.4453858Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:33.4454404Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:33.4455125Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:33.4455807Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:33.4456350Z     kernel = self.compile(
2025-05-07T20:32:33.4456898Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:33.4457600Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:33.4458013Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:33.4458255Z 
2025-05-07T20:32:33.4458462Z self = <triton.compiler.compiler.ASTSource object at 0x7f779dd18730>
2025-05-07T20:32:33.4459532Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:33.4460893Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f7905667ce0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f779dd18180>}
2025-05-07T20:32:33.4462221Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:33.4463236Z context = <triton._C.libtriton.ir.context object at 0x7f779dad9570>
2025-05-07T20:32:33.4463524Z 
2025-05-07T20:32:33.4463702Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:33.4464233Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:33.4464696Z                            module_map=module_map)
2025-05-07T20:32:33.4465068Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:33.4465430Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:33.4465737Z E       ^
2025-05-07T20:32:33.4466199Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:33.4466651Z 
2025-05-07T20:32:33.4467064Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:33.4467572Z 
2025-05-07T20:32:33.4467689Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:33.4468100Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:33.4468550Z     T=128,
2025-05-07T20:32:33.4468748Z     D=5120,
2025-05-07T20:32:33.4468945Z     scale_ub=1200.0,
2025-05-07T20:32:33.4469178Z     contiguous=True,
2025-05-07T20:32:33.4469406Z     compiled=False,
2025-05-07T20:32:33.4469614Z )
2025-05-07T20:32:33.4469942Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:33.4470440Z T = 128, D = 5120, scale_ub = 1200.0, contiguous = True, compiled = False
2025-05-07T20:32:33.4470712Z 
2025-05-07T20:32:33.4470801Z     @given(
2025-05-07T20:32:33.4471035Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:33.4471359Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:33.4471723Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:33.4472056Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:33.4472391Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:33.4472688Z     )
2025-05-07T20:32:33.4473036Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:33.4473489Z     def test_silu_mul_quant(
2025-05-07T20:32:33.4473739Z         self,
2025-05-07T20:32:33.4473938Z         T: int,
2025-05-07T20:32:33.4474146Z         D: int,
2025-05-07T20:32:33.4474376Z         scale_ub: Optional[float],
2025-05-07T20:32:33.4474657Z         contiguous: bool,
2025-05-07T20:32:33.4474899Z         compiled: bool,
2025-05-07T20:32:33.4475132Z     ) -> None:
2025-05-07T20:32:33.4475362Z         torch.manual_seed(2025)
2025-05-07T20:32:33.4475605Z     
2025-05-07T20:32:33.4475887Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:33.4476246Z     
2025-05-07T20:32:33.4476452Z         x_sign = torch.sign(x)
2025-05-07T20:32:33.4476752Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:33.4477102Z         x = x_sign * x_clamp
2025-05-07T20:32:33.4477351Z         x0 = x[:, :D]
2025-05-07T20:32:33.4477577Z         x1 = x[:, D:]
2025-05-07T20:32:33.4477786Z     
2025-05-07T20:32:33.4477977Z         if contiguous:
2025-05-07T20:32:33.4478210Z             x0 = x0.contiguous()
2025-05-07T20:32:33.4478461Z             x1 = x1.contiguous()
2025-05-07T20:32:33.4478701Z     
2025-05-07T20:32:33.4478901Z         if scale_ub is not None:
2025-05-07T20:32:33.4479169Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:33.4479507Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:33.4479823Z             )
2025-05-07T20:32:33.4480015Z         else:
2025-05-07T20:32:33.4480229Z             scale_ub_tensor = None
2025-05-07T20:32:33.4480482Z     
2025-05-07T20:32:33.4480705Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:33.4481024Z             op = silu_mul_quant
2025-05-07T20:32:33.4481279Z             if compiled:
2025-05-07T20:32:33.4481540Z                 op = torch.compile(op)
2025-05-07T20:32:33.4481836Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:33.4482109Z     
2025-05-07T20:32:33.4482308Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:33.4482473Z 
2025-05-07T20:32:33.4482572Z moe/activation_test.py:117: 
2025-05-07T20:32:33.4482868Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:33.4483204Z moe/activation_test.py:115: in fn
2025-05-07T20:32:33.4483481Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:33.4484162Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:33.4484918Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:33.4485477Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:33.4486150Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:33.4486812Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:33.4487386Z     kernel = self.compile(
2025-05-07T20:32:33.4487916Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:33.4488564Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:33.4488963Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:33.4489190Z 
2025-05-07T20:32:33.4489401Z self = <triton.compiler.compiler.ASTSource object at 0x7f78c42f1520>
2025-05-07T20:32:33.4490500Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:33.4491853Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f7905667ce0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f779dae8c20>}
2025-05-07T20:32:33.4493177Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:33.4494262Z context = <triton._C.libtriton.ir.context object at 0x7f779da7b430>
2025-05-07T20:32:33.4494544Z 
2025-05-07T20:32:33.4494720Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:33.4495232Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:33.4495696Z                            module_map=module_map)
2025-05-07T20:32:33.4496063Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:33.4496417Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:33.4496685Z E       ^
2025-05-07T20:32:33.4497215Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:33.4497658Z 
2025-05-07T20:32:33.4498076Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:33.6037473Z 
2025-05-07T20:32:33.6038709Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:33.6039997Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:33.6041063Z     T=1,
2025-05-07T20:32:33.6041417Z     D=7168,
2025-05-07T20:32:33.6041839Z     scale_ub=1200.0,
2025-05-07T20:32:33.6042283Z     contiguous=True,
2025-05-07T20:32:33.6042718Z     compiled=True,
2025-05-07T20:32:33.6043117Z )
2025-05-07T20:32:33.6043751Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:33.6044513Z T = 1, D = 7168, scale_ub = 1200.0, contiguous = True, compiled = True
2025-05-07T20:32:33.6044796Z 
2025-05-07T20:32:33.6044881Z     @given(
2025-05-07T20:32:33.6045125Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:33.6045446Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:33.6045754Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:33.6046088Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:33.6046421Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:33.6046704Z     )
2025-05-07T20:32:33.6047061Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:33.6047507Z     def test_silu_mul_quant(
2025-05-07T20:32:33.6048083Z         self,
2025-05-07T20:32:33.6048274Z         T: int,
2025-05-07T20:32:33.6048478Z         D: int,
2025-05-07T20:32:33.6048700Z         scale_ub: Optional[float],
2025-05-07T20:32:33.6048969Z         contiguous: bool,
2025-05-07T20:32:33.6049212Z         compiled: bool,
2025-05-07T20:32:33.6049448Z     ) -> None:
2025-05-07T20:32:33.6049666Z         torch.manual_seed(2025)
2025-05-07T20:32:33.6049913Z     
2025-05-07T20:32:33.6050187Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:33.6050632Z     
2025-05-07T20:32:33.6050850Z         x_sign = torch.sign(x)
2025-05-07T20:32:33.6051145Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:33.6051450Z         x = x_sign * x_clamp
2025-05-07T20:32:33.6051700Z         x0 = x[:, :D]
2025-05-07T20:32:33.6051922Z         x1 = x[:, D:]
2025-05-07T20:32:33.6052130Z     
2025-05-07T20:32:33.6052326Z         if contiguous:
2025-05-07T20:32:33.6052563Z             x0 = x0.contiguous()
2025-05-07T20:32:33.6052835Z             x1 = x1.contiguous()
2025-05-07T20:32:33.6053074Z     
2025-05-07T20:32:33.6053277Z         if scale_ub is not None:
2025-05-07T20:32:33.6053556Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:33.6054165Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:33.6054485Z             )
2025-05-07T20:32:33.6054688Z         else:
2025-05-07T20:32:33.6054900Z             scale_ub_tensor = None
2025-05-07T20:32:33.6055161Z     
2025-05-07T20:32:33.6055402Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:33.6055731Z             op = silu_mul_quant
2025-05-07T20:32:33.6055993Z             if compiled:
2025-05-07T20:32:33.6056249Z                 op = torch.compile(op)
2025-05-07T20:32:33.6058040Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:33.6058323Z     
2025-05-07T20:32:33.6058526Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:33.6058691Z 
2025-05-07T20:32:33.6058791Z moe/activation_test.py:117: 
2025-05-07T20:32:33.6059095Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:33.6059433Z moe/activation_test.py:115: in fn
2025-05-07T20:32:33.6059722Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:33.6060275Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/torch/_dynamo/eval_frame.py:678: in _fn
2025-05-07T20:32:33.6060928Z     return fn(*args, **kwargs)
2025-05-07T20:32:33.6061583Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:33.6062259Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:33.6062797Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:33.6063470Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:33.6064131Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:33.6064655Z     kernel = self.compile(
2025-05-07T20:32:33.6065199Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:33.6065853Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:33.6066257Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:33.6066485Z 
2025-05-07T20:32:33.6066691Z self = <triton.compiler.compiler.ASTSource object at 0x7f779daaf950>
2025-05-07T20:32:33.6067762Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:33.6069131Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f7905667ce0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f779dae9ee0>}
2025-05-07T20:32:33.6070503Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:33.6071511Z context = <triton._C.libtriton.ir.context object at 0x7f78be45afb0>
2025-05-07T20:32:33.6071805Z 
2025-05-07T20:32:33.6071971Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:33.6072492Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:33.6072996Z                            module_map=module_map)
2025-05-07T20:32:33.6073360Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:33.6073719Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:33.6073984Z E       ^
2025-05-07T20:32:33.6074440Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:33.6074897Z 
2025-05-07T20:32:33.6075349Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:33.6075863Z 
2025-05-07T20:32:33.6076010Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:33.6076432Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:33.6076830Z     T=1,
2025-05-07T20:32:33.6077023Z     D=7168,
2025-05-07T20:32:33.6077223Z     scale_ub=1200.0,
2025-05-07T20:32:33.6077448Z     contiguous=False,
2025-05-07T20:32:33.6077676Z     compiled=True,
2025-05-07T20:32:33.6077886Z )
2025-05-07T20:32:33.6078200Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:33.6078687Z T = 1, D = 7168, scale_ub = 1200.0, contiguous = False, compiled = True
2025-05-07T20:32:33.6078957Z 
2025-05-07T20:32:33.6079035Z     @given(
2025-05-07T20:32:33.6079274Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:33.6079587Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:33.6079898Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:33.6080236Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:33.6080566Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:33.6080859Z     )
2025-05-07T20:32:33.6081261Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:33.6081706Z     def test_silu_mul_quant(
2025-05-07T20:32:33.6081959Z         self,
2025-05-07T20:32:33.6082161Z         T: int,
2025-05-07T20:32:33.6082364Z         D: int,
2025-05-07T20:32:33.6082578Z         scale_ub: Optional[float],
2025-05-07T20:32:33.6082853Z         contiguous: bool,
2025-05-07T20:32:33.6083099Z         compiled: bool,
2025-05-07T20:32:33.6083322Z     ) -> None:
2025-05-07T20:32:33.6083541Z         torch.manual_seed(2025)
2025-05-07T20:32:33.6083791Z     
2025-05-07T20:32:33.6084063Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:33.6084414Z     
2025-05-07T20:32:33.6084612Z         x_sign = torch.sign(x)
2025-05-07T20:32:33.6084909Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:33.6085267Z         x = x_sign * x_clamp
2025-05-07T20:32:33.6085512Z         x0 = x[:, :D]
2025-05-07T20:32:33.6085726Z         x1 = x[:, D:]
2025-05-07T20:32:33.6085940Z     
2025-05-07T20:32:33.6086128Z         if contiguous:
2025-05-07T20:32:33.6086357Z             x0 = x0.contiguous()
2025-05-07T20:32:33.6086622Z             x1 = x1.contiguous()
2025-05-07T20:32:33.6086867Z     
2025-05-07T20:32:33.6087058Z         if scale_ub is not None:
2025-05-07T20:32:33.6087335Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:33.6087672Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:33.6087999Z             )
2025-05-07T20:32:33.6088202Z         else:
2025-05-07T20:32:33.6088407Z             scale_ub_tensor = None
2025-05-07T20:32:33.6088713Z     
2025-05-07T20:32:33.6088948Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:33.6097264Z             op = silu_mul_quant
2025-05-07T20:32:33.6097562Z             if compiled:
2025-05-07T20:32:33.6097826Z                 op = torch.compile(op)
2025-05-07T20:32:33.6098128Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:33.6098699Z     
2025-05-07T20:32:33.6098901Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:33.6099067Z 
2025-05-07T20:32:33.6099180Z moe/activation_test.py:117: 
2025-05-07T20:32:33.6099587Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:33.6099927Z moe/activation_test.py:115: in fn
2025-05-07T20:32:33.6100215Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:33.6100765Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/torch/_dynamo/eval_frame.py:678: in _fn
2025-05-07T20:32:33.6101334Z     return fn(*args, **kwargs)
2025-05-07T20:32:33.6101995Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:33.6102682Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:33.6103279Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:33.6103958Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:33.6104620Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:33.6105151Z     kernel = self.compile(
2025-05-07T20:32:33.6105694Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:33.6106353Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:33.6106758Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:33.6106988Z 
2025-05-07T20:32:33.6107195Z self = <triton.compiler.compiler.ASTSource object at 0x7f78be2eee50>
2025-05-07T20:32:33.6108272Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:33.6109693Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f7905667ce0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f779daeac00>}
2025-05-07T20:32:33.6111026Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:33.6112033Z context = <triton._C.libtriton.ir.context object at 0x7f78be491df0>
2025-05-07T20:32:33.6112314Z 
2025-05-07T20:32:33.6112478Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:33.6112998Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:33.6113469Z                            module_map=module_map)
2025-05-07T20:32:33.6113833Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:33.6114190Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:33.6114450Z E       ^
2025-05-07T20:32:33.6114906Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:33.6115355Z 
2025-05-07T20:32:33.6115765Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:33.8152437Z 
2025-05-07T20:32:33.8153222Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:33.8153860Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:33.8154412Z     T=1,
2025-05-07T20:32:33.8155010Z     D=7168,
2025-05-07T20:32:33.8155249Z     scale_ub=None,
2025-05-07T20:32:33.8155526Z     contiguous=False,
2025-05-07T20:32:33.8155814Z     compiled=True,
2025-05-07T20:32:33.8156069Z )
2025-05-07T20:32:33.8156392Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:33.8156885Z T = 1, D = 7168, scale_ub = None, contiguous = False, compiled = True
2025-05-07T20:32:33.8157144Z 
2025-05-07T20:32:33.8157223Z     @given(
2025-05-07T20:32:33.8157458Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:33.8157881Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:33.8158181Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:33.8158511Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:33.8158840Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:33.8159120Z     )
2025-05-07T20:32:33.8159462Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:33.8159903Z     def test_silu_mul_quant(
2025-05-07T20:32:33.8160143Z         self,
2025-05-07T20:32:33.8160328Z         T: int,
2025-05-07T20:32:33.8160526Z         D: int,
2025-05-07T20:32:33.8160743Z         scale_ub: Optional[float],
2025-05-07T20:32:33.8161010Z         contiguous: bool,
2025-05-07T20:32:33.8161336Z         compiled: bool,
2025-05-07T20:32:33.8161568Z     ) -> None:
2025-05-07T20:32:33.8161775Z         torch.manual_seed(2025)
2025-05-07T20:32:33.8162014Z     
2025-05-07T20:32:33.8162283Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:33.8162614Z     
2025-05-07T20:32:33.8162803Z         x_sign = torch.sign(x)
2025-05-07T20:32:33.8163094Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:33.8163395Z         x = x_sign * x_clamp
2025-05-07T20:32:33.8163634Z         x0 = x[:, :D]
2025-05-07T20:32:33.8163848Z         x1 = x[:, D:]
2025-05-07T20:32:33.8164085Z     
2025-05-07T20:32:33.8164273Z         if contiguous:
2025-05-07T20:32:33.8164499Z             x0 = x0.contiguous()
2025-05-07T20:32:33.8164757Z             x1 = x1.contiguous()
2025-05-07T20:32:33.8164997Z     
2025-05-07T20:32:33.8165184Z         if scale_ub is not None:
2025-05-07T20:32:33.8165456Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:33.8165796Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:33.8166186Z             )
2025-05-07T20:32:33.8166379Z         else:
2025-05-07T20:32:33.8166591Z             scale_ub_tensor = None
2025-05-07T20:32:33.8166845Z     
2025-05-07T20:32:33.8167068Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:33.8167383Z             op = silu_mul_quant
2025-05-07T20:32:33.8167631Z             if compiled:
2025-05-07T20:32:33.8167872Z                 op = torch.compile(op)
2025-05-07T20:32:33.8168168Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:33.8168444Z     
2025-05-07T20:32:33.8168635Z         y_fp8, y_scale = fn()
2025-05-07T20:32:33.8168923Z         y = y_fp8.to(torch.float32) * y_scale[:, None]
2025-05-07T20:32:33.8169217Z     
2025-05-07T20:32:33.8169448Z         def ref_fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:33.8169782Z             x0_fp32 = x0.to(torch.float32)
2025-05-07T20:32:33.8170073Z             x1_fp32 = x1.to(torch.float32)
2025-05-07T20:32:33.8170389Z             y = x0_fp32 * torch.sigmoid(x0_fp32) * x1_fp32
2025-05-07T20:32:33.8170736Z             return triton_quantize_fp8_row(y, scale_ub_tensor)
2025-05-07T20:32:33.8171045Z     
2025-05-07T20:32:33.8171248Z >       y_fp8_ref, y_scale_ref = ref_fn()
2025-05-07T20:32:33.8171439Z 
2025-05-07T20:32:33.8171541Z moe/activation_test.py:126: 
2025-05-07T20:32:33.8171835Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:33.8172171Z moe/activation_test.py:124: in ref_fn
2025-05-07T20:32:33.8172488Z     return triton_quantize_fp8_row(y, scale_ub_tensor)
2025-05-07T20:32:33.8173264Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gemm/triton_gemm/fp8_gemm.py:2370: in triton_quantize_fp8_row
2025-05-07T20:32:33.8174223Z     _kernel_quantize_fp8_row[grid](
2025-05-07T20:32:33.8174769Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:33.8175436Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:33.8176115Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/autotuner.py:186: in run
2025-05-07T20:32:33.8176875Z     timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs}
2025-05-07T20:32:33.8177590Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/autotuner.py:166: in _bench
2025-05-07T20:32:33.8178210Z     return self.do_bench(kernel_call, quantiles=(0.5, 0.2, 0.8))
2025-05-07T20:32:33.8178804Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/testing.py:117: in do_bench
2025-05-07T20:32:33.8179319Z     fn()
2025-05-07T20:32:33.8179816Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/autotuner.py:152: in kernel_call
2025-05-07T20:32:33.8180390Z     self.fn.run(
2025-05-07T20:32:33.8180899Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:33.8181427Z     kernel = self.compile(
2025-05-07T20:32:33.8181956Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:33.8182603Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:33.8182996Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:33.8183220Z 
2025-05-07T20:32:33.8183431Z self = <triton.compiler.compiler.ASTSource object at 0x7f78be5d2950>
2025-05-07T20:32:33.8184492Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=2, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:33.8185953Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f7905667ce0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f779dc28180>}
2025-05-07T20:32:33.8188992Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:33.8189998Z context = <triton._C.libtriton.ir.context object at 0x7f779dcf95b0>
2025-05-07T20:32:33.8190277Z 
2025-05-07T20:32:33.8190441Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:33.8190950Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:33.8191412Z                            module_map=module_map)
2025-05-07T20:32:33.8191773Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:33.8192118Z E       def _kernel_quantize_fp8_row(
2025-05-07T20:32:33.8192380Z E       ^
2025-05-07T20:32:33.8192838Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:33.8193274Z 
2025-05-07T20:32:33.8193680Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:33.8194192Z 
2025-05-07T20:32:33.8194295Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:33.8194703Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:33.8195098Z     T=1,
2025-05-07T20:32:33.8195275Z     D=5120,
2025-05-07T20:32:33.8195468Z     scale_ub=1200.0,
2025-05-07T20:32:33.8195690Z     contiguous=False,
2025-05-07T20:32:33.8195950Z     compiled=True,
2025-05-07T20:32:33.8196156Z )
2025-05-07T20:32:33.8196470Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:33.8196946Z T = 1, D = 5120, scale_ub = 1200.0, contiguous = False, compiled = True
2025-05-07T20:32:33.8197213Z 
2025-05-07T20:32:33.8197295Z     @given(
2025-05-07T20:32:33.8197526Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:33.8197839Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:33.8198141Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:33.8198801Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:33.8199125Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:33.8199405Z     )
2025-05-07T20:32:33.8199755Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:33.8200189Z     def test_silu_mul_quant(
2025-05-07T20:32:33.8200422Z         self,
2025-05-07T20:32:33.8200616Z         T: int,
2025-05-07T20:32:33.8200814Z         D: int,
2025-05-07T20:32:33.8201024Z         scale_ub: Optional[float],
2025-05-07T20:32:33.8201290Z         contiguous: bool,
2025-05-07T20:32:33.8201526Z         compiled: bool,
2025-05-07T20:32:33.8201740Z     ) -> None:
2025-05-07T20:32:33.8201951Z         torch.manual_seed(2025)
2025-05-07T20:32:33.8202265Z     
2025-05-07T20:32:33.8202530Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:33.8202870Z     
2025-05-07T20:32:33.8203061Z         x_sign = torch.sign(x)
2025-05-07T20:32:33.8203353Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:33.8203656Z         x = x_sign * x_clamp
2025-05-07T20:32:33.8203891Z         x0 = x[:, :D]
2025-05-07T20:32:33.8204104Z         x1 = x[:, D:]
2025-05-07T20:32:33.8204305Z     
2025-05-07T20:32:33.8204489Z         if contiguous:
2025-05-07T20:32:33.8204718Z             x0 = x0.contiguous()
2025-05-07T20:32:33.8204969Z             x1 = x1.contiguous()
2025-05-07T20:32:33.8205210Z     
2025-05-07T20:32:33.8205403Z         if scale_ub is not None:
2025-05-07T20:32:33.8205665Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:33.8205999Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:33.8206305Z             )
2025-05-07T20:32:33.8206492Z         else:
2025-05-07T20:32:33.8206711Z             scale_ub_tensor = None
2025-05-07T20:32:33.8207026Z     
2025-05-07T20:32:33.8207252Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:33.8207566Z             op = silu_mul_quant
2025-05-07T20:32:33.8207835Z             if compiled:
2025-05-07T20:32:33.8208082Z                 op = torch.compile(op)
2025-05-07T20:32:33.8208377Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:33.8208644Z     
2025-05-07T20:32:33.8208838Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:33.8209010Z 
2025-05-07T20:32:33.8209109Z moe/activation_test.py:117: 
2025-05-07T20:32:33.8209401Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:33.8209725Z moe/activation_test.py:115: in fn
2025-05-07T20:32:33.8210009Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:33.8210560Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/torch/_dynamo/eval_frame.py:678: in _fn
2025-05-07T20:32:33.8211116Z     return fn(*args, **kwargs)
2025-05-07T20:32:33.8211769Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:33.8212452Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:33.8212985Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:33.8213646Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:33.8214420Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:33.8214946Z     kernel = self.compile(
2025-05-07T20:32:33.8215550Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:33.8216197Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:33.8216595Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:33.8216817Z 
2025-05-07T20:32:33.8217036Z self = <triton.compiler.compiler.ASTSource object at 0x7f78be87ccd0>
2025-05-07T20:32:33.8218085Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:33.8220215Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f7905667ce0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f779dc29300>}
2025-05-07T20:32:33.8221536Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:33.8222538Z context = <triton._C.libtriton.ir.context object at 0x7f779dc2e130>
2025-05-07T20:32:33.8222896Z 
2025-05-07T20:32:33.8223069Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:33.8223574Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:33.8224041Z                            module_map=module_map)
2025-05-07T20:32:33.8224400Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:33.8224741Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:33.8225006Z E       ^
2025-05-07T20:32:33.8225458Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:33.8225894Z 
2025-05-07T20:32:33.8226307Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:33.9634765Z 
2025-05-07T20:32:33.9635389Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:33.9636600Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:33.9637683Z     T=1,
2025-05-07T20:32:33.9638562Z     D=5120,
2025-05-07T20:32:33.9638987Z     scale_ub=1200.0,
2025-05-07T20:32:33.9639441Z     contiguous=False,
2025-05-07T20:32:33.9639901Z     compiled=False,
2025-05-07T20:32:33.9640331Z )
2025-05-07T20:32:33.9640961Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:33.9641942Z T = 1, D = 5120, scale_ub = 1200.0, contiguous = False, compiled = False
2025-05-07T20:32:33.9642473Z 
2025-05-07T20:32:33.9642645Z     @given(
2025-05-07T20:32:33.9643112Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:33.9643747Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:33.9644369Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:33.9644794Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:33.9645148Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:33.9645441Z     )
2025-05-07T20:32:33.9645800Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:33.9646248Z     def test_silu_mul_quant(
2025-05-07T20:32:33.9646503Z         self,
2025-05-07T20:32:33.9646705Z         T: int,
2025-05-07T20:32:33.9646905Z         D: int,
2025-05-07T20:32:33.9647127Z         scale_ub: Optional[float],
2025-05-07T20:32:33.9647402Z         contiguous: bool,
2025-05-07T20:32:33.9647644Z         compiled: bool,
2025-05-07T20:32:33.9647874Z     ) -> None:
2025-05-07T20:32:33.9648095Z         torch.manual_seed(2025)
2025-05-07T20:32:33.9648339Z     
2025-05-07T20:32:33.9648619Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:33.9649055Z     
2025-05-07T20:32:33.9649252Z         x_sign = torch.sign(x)
2025-05-07T20:32:33.9649549Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:33.9649865Z         x = x_sign * x_clamp
2025-05-07T20:32:33.9650119Z         x0 = x[:, :D]
2025-05-07T20:32:33.9650336Z         x1 = x[:, D:]
2025-05-07T20:32:33.9650554Z     
2025-05-07T20:32:33.9650747Z         if contiguous:
2025-05-07T20:32:33.9650980Z             x0 = x0.contiguous()
2025-05-07T20:32:33.9651253Z             x1 = x1.contiguous()
2025-05-07T20:32:33.9651496Z     
2025-05-07T20:32:33.9651775Z         if scale_ub is not None:
2025-05-07T20:32:33.9652056Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:33.9652396Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:33.9652707Z             )
2025-05-07T20:32:33.9652908Z         else:
2025-05-07T20:32:33.9653126Z             scale_ub_tensor = None
2025-05-07T20:32:33.9653378Z     
2025-05-07T20:32:33.9653617Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:33.9654096Z             op = silu_mul_quant
2025-05-07T20:32:33.9654344Z             if compiled:
2025-05-07T20:32:33.9654599Z                 op = torch.compile(op)
2025-05-07T20:32:33.9654901Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:33.9655181Z     
2025-05-07T20:32:33.9655460Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:33.9655637Z 
2025-05-07T20:32:33.9655740Z moe/activation_test.py:117: 
2025-05-07T20:32:33.9656044Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:33.9656381Z moe/activation_test.py:115: in fn
2025-05-07T20:32:33.9656671Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:33.9657371Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:33.9658053Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:33.9658594Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:33.9659277Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:33.9659940Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:33.9660478Z     kernel = self.compile(
2025-05-07T20:32:33.9661067Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:33.9661728Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:33.9662140Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:33.9662368Z 
2025-05-07T20:32:33.9662576Z self = <triton.compiler.compiler.ASTSource object at 0x7f779daaf150>
2025-05-07T20:32:33.9663640Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:33.9665002Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f7905667ce0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f779dc2a020>}
2025-05-07T20:32:33.9666335Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:33.9667345Z context = <triton._C.libtriton.ir.context object at 0x7f779dcf30f0>
2025-05-07T20:32:33.9667635Z 
2025-05-07T20:32:33.9667803Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:33.9668326Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:33.9668795Z                            module_map=module_map)
2025-05-07T20:32:33.9669158Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:33.9669564Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:33.9669830Z E       ^
2025-05-07T20:32:33.9670288Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:33.9670740Z 
2025-05-07T20:32:33.9671155Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:33.9671669Z 
2025-05-07T20:32:33.9671774Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:33.9672234Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:33.9672636Z     T=16384,
2025-05-07T20:32:33.9672839Z     D=5120,
2025-05-07T20:32:33.9673043Z     scale_ub=1200.0,
2025-05-07T20:32:33.9673269Z     contiguous=False,
2025-05-07T20:32:33.9673500Z     compiled=True,
2025-05-07T20:32:33.9673711Z )
2025-05-07T20:32:33.9674030Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:33.9674535Z T = 16384, D = 5120, scale_ub = 1200.0, contiguous = False, compiled = True
2025-05-07T20:32:33.9674816Z 
2025-05-07T20:32:33.9674900Z     @given(
2025-05-07T20:32:33.9675186Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:33.9675543Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:33.9675861Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:33.9676200Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:33.9676527Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:33.9676821Z     )
2025-05-07T20:32:33.9677174Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:33.9677616Z     def test_silu_mul_quant(
2025-05-07T20:32:33.9677866Z         self,
2025-05-07T20:32:33.9678072Z         T: int,
2025-05-07T20:32:33.9678279Z         D: int,
2025-05-07T20:32:33.9678500Z         scale_ub: Optional[float],
2025-05-07T20:32:33.9678784Z         contiguous: bool,
2025-05-07T20:32:33.9679032Z         compiled: bool,
2025-05-07T20:32:33.9679257Z     ) -> None:
2025-05-07T20:32:33.9679479Z         torch.manual_seed(2025)
2025-05-07T20:32:33.9679727Z     
2025-05-07T20:32:33.9679999Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:33.9680350Z     
2025-05-07T20:32:33.9680601Z         x_sign = torch.sign(x)
2025-05-07T20:32:33.9680898Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:33.9681215Z         x = x_sign * x_clamp
2025-05-07T20:32:33.9681465Z         x0 = x[:, :D]
2025-05-07T20:32:33.9681684Z         x1 = x[:, D:]
2025-05-07T20:32:33.9681901Z     
2025-05-07T20:32:33.9682096Z         if contiguous:
2025-05-07T20:32:33.9682330Z             x0 = x0.contiguous()
2025-05-07T20:32:33.9682593Z             x1 = x1.contiguous()
2025-05-07T20:32:33.9682840Z     
2025-05-07T20:32:33.9683034Z         if scale_ub is not None:
2025-05-07T20:32:33.9683315Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:33.9683656Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:33.9683971Z             )
2025-05-07T20:32:33.9684165Z         else:
2025-05-07T20:32:33.9684381Z             scale_ub_tensor = None
2025-05-07T20:32:33.9684637Z     
2025-05-07T20:32:33.9684869Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:33.9685193Z             op = silu_mul_quant
2025-05-07T20:32:33.9685453Z             if compiled:
2025-05-07T20:32:33.9685703Z                 op = torch.compile(op)
2025-05-07T20:32:33.9686008Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:33.9686291Z     
2025-05-07T20:32:33.9686487Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:33.9686658Z 
2025-05-07T20:32:33.9686759Z moe/activation_test.py:117: 
2025-05-07T20:32:33.9687059Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:33.9687398Z moe/activation_test.py:115: in fn
2025-05-07T20:32:33.9687680Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:33.9688293Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/torch/_dynamo/eval_frame.py:678: in _fn
2025-05-07T20:32:33.9688856Z     return fn(*args, **kwargs)
2025-05-07T20:32:33.9689511Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:33.9690195Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:33.9690734Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:33.9691459Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:33.9692135Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:33.9692672Z     kernel = self.compile(
2025-05-07T20:32:33.9693218Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:33.9693963Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:33.9694369Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:33.9694596Z 
2025-05-07T20:32:33.9694853Z self = <triton.compiler.compiler.ASTSource object at 0x7f78be914ad0>
2025-05-07T20:32:33.9695931Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:33.9705265Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f7905667ce0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f779dc2b600>}
2025-05-07T20:32:33.9708133Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:33.9709145Z context = <triton._C.libtriton.ir.context object at 0x7f779d6e6a70>
2025-05-07T20:32:33.9709441Z 
2025-05-07T20:32:33.9709610Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:33.9710252Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:33.9710725Z                            module_map=module_map)
2025-05-07T20:32:33.9711088Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:33.9711450Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:33.9711717Z E       ^
2025-05-07T20:32:33.9712172Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:33.9712623Z 
2025-05-07T20:32:33.9713041Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:33.9713556Z 
2025-05-07T20:32:33.9713661Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:33.9714077Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:33.9714474Z     T=2048,
2025-05-07T20:32:33.9714675Z     D=7168,
2025-05-07T20:32:33.9714883Z     scale_ub=1200.0,
2025-05-07T20:32:33.9715112Z     contiguous=False,
2025-05-07T20:32:33.9715349Z     compiled=True,
2025-05-07T20:32:34.1572001Z )
2025-05-07T20:32:34.1572592Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:34.1573295Z T = 2048, D = 7168, scale_ub = 1200.0, contiguous = False, compiled = True
2025-05-07T20:32:34.1573764Z 
2025-05-07T20:32:34.1573858Z     @given(
2025-05-07T20:32:34.1574094Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:34.1574413Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:34.1574726Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:34.1575054Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:34.1575677Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:34.1575969Z     )
2025-05-07T20:32:34.1576321Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:34.1576777Z     def test_silu_mul_quant(
2025-05-07T20:32:34.1577027Z         self,
2025-05-07T20:32:34.1577233Z         T: int,
2025-05-07T20:32:34.1577440Z         D: int,
2025-05-07T20:32:34.1577671Z         scale_ub: Optional[float],
2025-05-07T20:32:34.1577957Z         contiguous: bool,
2025-05-07T20:32:34.1578304Z         compiled: bool,
2025-05-07T20:32:34.1578544Z     ) -> None:
2025-05-07T20:32:34.1578770Z         torch.manual_seed(2025)
2025-05-07T20:32:34.1579020Z     
2025-05-07T20:32:34.1579292Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:34.1579642Z     
2025-05-07T20:32:34.1579846Z         x_sign = torch.sign(x)
2025-05-07T20:32:34.1580138Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:34.1580461Z         x = x_sign * x_clamp
2025-05-07T20:32:34.1580708Z         x0 = x[:, :D]
2025-05-07T20:32:34.1580931Z         x1 = x[:, D:]
2025-05-07T20:32:34.1581142Z     
2025-05-07T20:32:34.1581336Z         if contiguous:
2025-05-07T20:32:34.1581669Z             x0 = x0.contiguous()
2025-05-07T20:32:34.1581933Z             x1 = x1.contiguous()
2025-05-07T20:32:34.1582184Z     
2025-05-07T20:32:34.1582387Z         if scale_ub is not None:
2025-05-07T20:32:34.1582660Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:34.1583005Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:34.1583324Z             )
2025-05-07T20:32:34.1583517Z         else:
2025-05-07T20:32:34.1583738Z             scale_ub_tensor = None
2025-05-07T20:32:34.1583995Z     
2025-05-07T20:32:34.1584227Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:34.1584554Z             op = silu_mul_quant
2025-05-07T20:32:34.1584808Z             if compiled:
2025-05-07T20:32:34.1585056Z                 op = torch.compile(op)
2025-05-07T20:32:34.1585356Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:34.1585635Z     
2025-05-07T20:32:34.1585833Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:34.1585996Z 
2025-05-07T20:32:34.1586101Z moe/activation_test.py:117: 
2025-05-07T20:32:34.1586490Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:34.1586835Z moe/activation_test.py:115: in fn
2025-05-07T20:32:34.1587118Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:34.1587684Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/torch/_dynamo/eval_frame.py:678: in _fn
2025-05-07T20:32:34.1588244Z     return fn(*args, **kwargs)
2025-05-07T20:32:34.1588902Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:34.1589585Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:34.1590124Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:34.1590803Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:34.1591462Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:34.1591999Z     kernel = self.compile(
2025-05-07T20:32:34.1592544Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:34.1593203Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:34.1593598Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:34.1593841Z 
2025-05-07T20:32:34.1594046Z self = <triton.compiler.compiler.ASTSource object at 0x7f779daae950>
2025-05-07T20:32:34.1595114Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:34.1596536Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f7905667ce0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f779d608720>}
2025-05-07T20:32:34.1597858Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:34.1599205Z context = <triton._C.libtriton.ir.context object at 0x7f779d64e4b0>
2025-05-07T20:32:34.1599495Z 
2025-05-07T20:32:34.1599665Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:34.1600192Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:34.1600654Z                            module_map=module_map)
2025-05-07T20:32:34.1601032Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:34.1601391Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:34.1601651Z E       ^
2025-05-07T20:32:34.1602186Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:34.1602736Z 
2025-05-07T20:32:34.1603148Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:34.1603658Z 
2025-05-07T20:32:34.1603769Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:34.1604181Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:34.1604585Z     T=1,
2025-05-07T20:32:34.1604777Z     D=5120,
2025-05-07T20:32:34.1604972Z     scale_ub=None,
2025-05-07T20:32:34.1605196Z     contiguous=False,
2025-05-07T20:32:34.1605430Z     compiled=False,
2025-05-07T20:32:34.1605642Z )
2025-05-07T20:32:34.1605957Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:34.1606442Z T = 1, D = 5120, scale_ub = None, contiguous = False, compiled = False
2025-05-07T20:32:34.1606700Z 
2025-05-07T20:32:34.1606785Z     @given(
2025-05-07T20:32:34.1607020Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:34.1607422Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:34.1607737Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:34.1608063Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:34.1608397Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:34.1608685Z     )
2025-05-07T20:32:34.1609039Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:34.1609475Z     def test_silu_mul_quant(
2025-05-07T20:32:34.1609719Z         self,
2025-05-07T20:32:34.1609917Z         T: int,
2025-05-07T20:32:34.1610112Z         D: int,
2025-05-07T20:32:34.1610339Z         scale_ub: Optional[float],
2025-05-07T20:32:34.1610614Z         contiguous: bool,
2025-05-07T20:32:34.1610851Z         compiled: bool,
2025-05-07T20:32:34.1611077Z     ) -> None:
2025-05-07T20:32:34.1611296Z         torch.manual_seed(2025)
2025-05-07T20:32:34.1611533Z     
2025-05-07T20:32:34.1611811Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:34.1612160Z     
2025-05-07T20:32:34.1612353Z         x_sign = torch.sign(x)
2025-05-07T20:32:34.1612650Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:34.1612964Z         x = x_sign * x_clamp
2025-05-07T20:32:34.1613200Z         x0 = x[:, :D]
2025-05-07T20:32:34.1613420Z         x1 = x[:, D:]
2025-05-07T20:32:34.1613629Z     
2025-05-07T20:32:34.1613939Z         if contiguous:
2025-05-07T20:32:34.1614177Z             x0 = x0.contiguous()
2025-05-07T20:32:34.1614440Z             x1 = x1.contiguous()
2025-05-07T20:32:34.1614706Z     
2025-05-07T20:32:34.1614921Z         if scale_ub is not None:
2025-05-07T20:32:34.1615272Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:34.1615615Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:34.1615924Z             )
2025-05-07T20:32:34.1616125Z         else:
2025-05-07T20:32:34.1616342Z             scale_ub_tensor = None
2025-05-07T20:32:34.1616598Z     
2025-05-07T20:32:34.1616837Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:34.1617157Z             op = silu_mul_quant
2025-05-07T20:32:34.1617406Z             if compiled:
2025-05-07T20:32:34.1617724Z                 op = torch.compile(op)
2025-05-07T20:32:34.1618023Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:34.1618297Z     
2025-05-07T20:32:34.1618495Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:34.1618658Z 
2025-05-07T20:32:34.1618767Z moe/activation_test.py:117: 
2025-05-07T20:32:34.1619063Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:34.1619386Z moe/activation_test.py:115: in fn
2025-05-07T20:32:34.1619671Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:34.1620355Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:34.1621074Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:34.1621614Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:34.1622291Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:34.1622963Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:34.1623490Z     kernel = self.compile(
2025-05-07T20:32:34.1624028Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:34.1624715Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:34.1625130Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:34.1625368Z 
2025-05-07T20:32:34.1625574Z self = <triton.compiler.compiler.ASTSource object at 0x7f78be2efdd0>
2025-05-07T20:32:34.1626683Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:34.1628034Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f7905667ce0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f779d609120>}
2025-05-07T20:32:34.1629358Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:34.1630362Z context = <triton._C.libtriton.ir.context object at 0x7f78bedcedf0>
2025-05-07T20:32:34.1630657Z 
2025-05-07T20:32:34.1630824Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:34.1631341Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:34.1631809Z                            module_map=module_map)
2025-05-07T20:32:34.1632174Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:34.1632530Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:34.1632802Z E       ^
2025-05-07T20:32:34.1633259Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:34.1633712Z 
2025-05-07T20:32:34.1634121Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:34.1634636Z 
2025-05-07T20:32:34.1634740Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:34.1635151Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:34.1635593Z     T=4096,
2025-05-07T20:32:34.1635786Z     D=7168,
2025-05-07T20:32:34.1635991Z     scale_ub=1200.0,
2025-05-07T20:32:34.1636215Z     contiguous=False,
2025-05-07T20:32:34.1636449Z     compiled=False,
2025-05-07T20:32:34.1636673Z )
2025-05-07T20:32:34.1637026Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:34.1637524Z T = 4096, D = 7168, scale_ub = 1200.0, contiguous = False, compiled = False
2025-05-07T20:32:34.1637870Z 
2025-05-07T20:32:34.1637956Z     @given(
2025-05-07T20:32:34.1638186Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:34.1638504Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:34.1638814Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:34.1639142Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:34.1639473Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:34.1639766Z     )
2025-05-07T20:32:34.1640120Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:34.1640558Z     def test_silu_mul_quant(
2025-05-07T20:32:34.1640804Z         self,
2025-05-07T20:32:34.1641004Z         T: int,
2025-05-07T20:32:34.1641243Z         D: int,
2025-05-07T20:32:34.1641467Z         scale_ub: Optional[float],
2025-05-07T20:32:34.1641745Z         contiguous: bool,
2025-05-07T20:32:34.1641982Z         compiled: bool,
2025-05-07T20:32:34.1642210Z     ) -> None:
2025-05-07T20:32:34.1642433Z         torch.manual_seed(2025)
2025-05-07T20:32:34.1642673Z     
2025-05-07T20:32:34.1642944Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:34.1643286Z     
2025-05-07T20:32:34.1643476Z         x_sign = torch.sign(x)
2025-05-07T20:32:34.1643770Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:34.1644081Z         x = x_sign * x_clamp
2025-05-07T20:32:34.1644314Z         x0 = x[:, :D]
2025-05-07T20:32:34.1644539Z         x1 = x[:, D:]
2025-05-07T20:32:34.1644749Z     
2025-05-07T20:32:34.1644931Z         if contiguous:
2025-05-07T20:32:34.1645165Z             x0 = x0.contiguous()
2025-05-07T20:32:34.1645427Z             x1 = x1.contiguous()
2025-05-07T20:32:34.1645671Z     
2025-05-07T20:32:34.1645864Z         if scale_ub is not None:
2025-05-07T20:32:34.1646192Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:34.1646531Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:34.1646834Z             )
2025-05-07T20:32:34.1647036Z         else:
2025-05-07T20:32:34.1647249Z             scale_ub_tensor = None
2025-05-07T20:32:34.1647495Z     
2025-05-07T20:32:34.1647727Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:34.1648043Z             op = silu_mul_quant
2025-05-07T20:32:34.1648290Z             if compiled:
2025-05-07T20:32:34.1648545Z                 op = torch.compile(op)
2025-05-07T20:32:34.1648852Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:34.1649129Z     
2025-05-07T20:32:34.1649332Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:34.1649503Z 
2025-05-07T20:32:34.1649607Z moe/activation_test.py:117: 
2025-05-07T20:32:34.1649905Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:34.1650239Z moe/activation_test.py:115: in fn
2025-05-07T20:32:34.1650529Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:34.1651210Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:34.1651891Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:34.1652430Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:34.1653107Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:34.1653882Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:34.1654463Z     kernel = self.compile(
2025-05-07T20:32:34.1655007Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:34.1655716Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:34.1656116Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:34.1656350Z 
2025-05-07T20:32:34.1656556Z self = <triton.compiler.compiler.ASTSource object at 0x7f78bea86ed0>
2025-05-07T20:32:34.1657665Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:34.1659010Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f7905667ce0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f779d60a480>}
2025-05-07T20:32:34.1660333Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:34.1661376Z context = <triton._C.libtriton.ir.context object at 0x7f78bf18dab0>
2025-05-07T20:32:34.1661671Z 
2025-05-07T20:32:34.1661838Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:34.1662359Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:34.1662826Z                            module_map=module_map)
2025-05-07T20:32:34.1663186Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:34.1663540Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:34.1663802Z E       ^
2025-05-07T20:32:34.1664254Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:34.1664700Z 
2025-05-07T20:32:34.1665138Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:34.3215252Z 
2025-05-07T20:32:34.3215747Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:34.3216656Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:34.3217185Z     T=16384,
2025-05-07T20:32:34.3217423Z     D=7168,
2025-05-07T20:32:34.3217615Z     scale_ub=None,
2025-05-07T20:32:34.3217840Z     contiguous=True,
2025-05-07T20:32:34.3218067Z     compiled=True,
2025-05-07T20:32:34.3218275Z )
2025-05-07T20:32:34.3218597Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:34.3219089Z T = 16384, D = 7168, scale_ub = None, contiguous = True, compiled = True
2025-05-07T20:32:34.3219358Z 
2025-05-07T20:32:34.3219438Z     @given(
2025-05-07T20:32:34.3219675Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:34.3219991Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:34.3220299Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:34.3220623Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:34.3220954Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:34.3221242Z     )
2025-05-07T20:32:34.3221588Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:34.3222035Z     def test_silu_mul_quant(
2025-05-07T20:32:34.3222282Z         self,
2025-05-07T20:32:34.3222484Z         T: int,
2025-05-07T20:32:34.3222686Z         D: int,
2025-05-07T20:32:34.3222909Z         scale_ub: Optional[float],
2025-05-07T20:32:34.3223177Z         contiguous: bool,
2025-05-07T20:32:34.3223422Z         compiled: bool,
2025-05-07T20:32:34.3223652Z     ) -> None:
2025-05-07T20:32:34.3223865Z         torch.manual_seed(2025)
2025-05-07T20:32:34.3224111Z     
2025-05-07T20:32:34.3224388Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:34.3224818Z     
2025-05-07T20:32:34.3225022Z         x_sign = torch.sign(x)
2025-05-07T20:32:34.3225317Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:34.3225634Z         x = x_sign * x_clamp
2025-05-07T20:32:34.3225877Z         x0 = x[:, :D]
2025-05-07T20:32:34.3226098Z         x1 = x[:, D:]
2025-05-07T20:32:34.3226316Z     
2025-05-07T20:32:34.3226502Z         if contiguous:
2025-05-07T20:32:34.3226736Z             x0 = x0.contiguous()
2025-05-07T20:32:34.3227136Z             x1 = x1.contiguous()
2025-05-07T20:32:34.3227374Z     
2025-05-07T20:32:34.3227571Z         if scale_ub is not None:
2025-05-07T20:32:34.3227845Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:34.3228176Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:34.3228487Z             )
2025-05-07T20:32:34.3228684Z         else:
2025-05-07T20:32:34.3228891Z             scale_ub_tensor = None
2025-05-07T20:32:34.3229174Z     
2025-05-07T20:32:34.3229425Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:34.3229779Z             op = silu_mul_quant
2025-05-07T20:32:34.3230045Z             if compiled:
2025-05-07T20:32:34.3230315Z                 op = torch.compile(op)
2025-05-07T20:32:34.3230728Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:34.3231033Z     
2025-05-07T20:32:34.3231238Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:34.3231425Z 
2025-05-07T20:32:34.3231534Z moe/activation_test.py:117: 
2025-05-07T20:32:34.3231860Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:34.3232231Z moe/activation_test.py:115: in fn
2025-05-07T20:32:34.3232540Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:34.3233188Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/torch/_dynamo/eval_frame.py:678: in _fn
2025-05-07T20:32:34.3233849Z     return fn(*args, **kwargs)
2025-05-07T20:32:34.3234632Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:34.3235457Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:34.3236085Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:34.3236942Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:34.3237607Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:34.3238144Z     kernel = self.compile(
2025-05-07T20:32:34.3238684Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:34.3239339Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:34.3239743Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:34.3239976Z 
2025-05-07T20:32:34.3240183Z self = <triton.compiler.compiler.ASTSource object at 0x7f78be87ea50>
2025-05-07T20:32:34.3241255Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:34.3242622Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f7905667ce0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f779d60b740>}
2025-05-07T20:32:34.3243950Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:34.3244961Z context = <triton._C.libtriton.ir.context object at 0x7f78bedb18b0>
2025-05-07T20:32:34.3245252Z 
2025-05-07T20:32:34.3245419Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:34.3245989Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:34.3246460Z                            module_map=module_map)
2025-05-07T20:32:34.3246838Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:34.3247189Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:34.3247462Z E       ^
2025-05-07T20:32:34.3247926Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:34.3248411Z 
2025-05-07T20:32:34.3248820Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:34.3249335Z 
2025-05-07T20:32:34.3249442Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:34.3249858Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:34.3250266Z     T=4096,
2025-05-07T20:32:34.3250456Z     D=5120,
2025-05-07T20:32:34.3250657Z     scale_ub=None,
2025-05-07T20:32:34.3250881Z     contiguous=False,
2025-05-07T20:32:34.3251102Z     compiled=True,
2025-05-07T20:32:34.3251307Z )
2025-05-07T20:32:34.3251625Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:34.3252157Z T = 4096, D = 5120, scale_ub = None, contiguous = False, compiled = True
2025-05-07T20:32:34.3252433Z 
2025-05-07T20:32:34.3252513Z     @given(
2025-05-07T20:32:34.3252748Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:34.3253065Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:34.3253375Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:34.3253877Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:34.3254210Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:34.3254494Z     )
2025-05-07T20:32:34.3254842Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:34.3255290Z     def test_silu_mul_quant(
2025-05-07T20:32:34.3255531Z         self,
2025-05-07T20:32:34.3255732Z         T: int,
2025-05-07T20:32:34.3255932Z         D: int,
2025-05-07T20:32:34.3256146Z         scale_ub: Optional[float],
2025-05-07T20:32:34.3256423Z         contiguous: bool,
2025-05-07T20:32:34.3256675Z         compiled: bool,
2025-05-07T20:32:34.3256941Z     ) -> None:
2025-05-07T20:32:34.3257164Z         torch.manual_seed(2025)
2025-05-07T20:32:34.3257413Z     
2025-05-07T20:32:34.3257679Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:34.3258027Z     
2025-05-07T20:32:34.3258224Z         x_sign = torch.sign(x)
2025-05-07T20:32:34.3258514Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:34.3258841Z         x = x_sign * x_clamp
2025-05-07T20:32:34.3259086Z         x0 = x[:, :D]
2025-05-07T20:32:34.3259308Z         x1 = x[:, D:]
2025-05-07T20:32:34.3259517Z     
2025-05-07T20:32:34.3259708Z         if contiguous:
2025-05-07T20:32:34.3259947Z             x0 = x0.contiguous()
2025-05-07T20:32:34.3268396Z             x1 = x1.contiguous()
2025-05-07T20:32:34.3268664Z     
2025-05-07T20:32:34.3268865Z         if scale_ub is not None:
2025-05-07T20:32:34.3269137Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:34.3269481Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:34.3269796Z             )
2025-05-07T20:32:34.3269986Z         else:
2025-05-07T20:32:34.3270198Z             scale_ub_tensor = None
2025-05-07T20:32:34.3270456Z     
2025-05-07T20:32:34.3270687Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:34.3271005Z             op = silu_mul_quant
2025-05-07T20:32:34.3271261Z             if compiled:
2025-05-07T20:32:34.3271503Z                 op = torch.compile(op)
2025-05-07T20:32:34.3271800Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:34.3272077Z     
2025-05-07T20:32:34.3272264Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:34.3272560Z 
2025-05-07T20:32:34.3272658Z moe/activation_test.py:117: 
2025-05-07T20:32:34.3272954Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:34.3273286Z moe/activation_test.py:115: in fn
2025-05-07T20:32:34.3273564Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:34.3274130Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/torch/_dynamo/eval_frame.py:678: in _fn
2025-05-07T20:32:34.3274696Z     return fn(*args, **kwargs)
2025-05-07T20:32:34.3275347Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:34.3276107Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:34.3276644Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:34.3277319Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:34.3277975Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:34.3278510Z     kernel = self.compile(
2025-05-07T20:32:34.3279051Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:34.3279742Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:34.3280146Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:34.3280382Z 
2025-05-07T20:32:34.3280592Z self = <triton.compiler.compiler.ASTSource object at 0x7f78be917ad0>
2025-05-07T20:32:34.3281660Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:34.3283008Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f7905667ce0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f78bf148c20>}
2025-05-07T20:32:34.3284328Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:34.3285379Z context = <triton._C.libtriton.ir.context object at 0x7f78bf171370>
2025-05-07T20:32:34.3285666Z 
2025-05-07T20:32:34.3285832Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:34.3286346Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:34.3286811Z                            module_map=module_map)
2025-05-07T20:32:34.3287168Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:34.3287516Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:34.3287776Z E       ^
2025-05-07T20:32:34.3288224Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:34.3288671Z 
2025-05-07T20:32:34.3289077Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:34.4657841Z 
2025-05-07T20:32:34.4658250Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:34.4658913Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:34.4659477Z     T=4096,
2025-05-07T20:32:34.4659744Z     D=5120,
2025-05-07T20:32:34.4659947Z     scale_ub=1200.0,
2025-05-07T20:32:34.4660177Z     contiguous=False,
2025-05-07T20:32:34.4660418Z     compiled=False,
2025-05-07T20:32:34.4660636Z )
2025-05-07T20:32:34.4660955Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:34.4661464Z T = 4096, D = 5120, scale_ub = 1200.0, contiguous = False, compiled = False
2025-05-07T20:32:34.4661743Z 
2025-05-07T20:32:34.4661822Z     @given(
2025-05-07T20:32:34.4662237Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:34.4662550Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:34.4662896Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:34.4663232Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:34.4663561Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:34.4663855Z     )
2025-05-07T20:32:34.4664208Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:34.4664789Z     def test_silu_mul_quant(
2025-05-07T20:32:34.4665030Z         self,
2025-05-07T20:32:34.4665232Z         T: int,
2025-05-07T20:32:34.4665435Z         D: int,
2025-05-07T20:32:34.4665663Z         scale_ub: Optional[float],
2025-05-07T20:32:34.4665963Z         contiguous: bool,
2025-05-07T20:32:34.4666225Z         compiled: bool,
2025-05-07T20:32:34.4666471Z     ) -> None:
2025-05-07T20:32:34.4666706Z         torch.manual_seed(2025)
2025-05-07T20:32:34.4666976Z     
2025-05-07T20:32:34.4667272Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:34.4667662Z     
2025-05-07T20:32:34.4667868Z         x_sign = torch.sign(x)
2025-05-07T20:32:34.4668186Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:34.4668619Z         x = x_sign * x_clamp
2025-05-07T20:32:34.4668873Z         x0 = x[:, :D]
2025-05-07T20:32:34.4669093Z         x1 = x[:, D:]
2025-05-07T20:32:34.4669308Z     
2025-05-07T20:32:34.4669501Z         if contiguous:
2025-05-07T20:32:34.4669737Z             x0 = x0.contiguous()
2025-05-07T20:32:34.4670003Z             x1 = x1.contiguous()
2025-05-07T20:32:34.4670248Z     
2025-05-07T20:32:34.4670439Z         if scale_ub is not None:
2025-05-07T20:32:34.4670720Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:34.4671062Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:34.4671375Z             )
2025-05-07T20:32:34.4671572Z         else:
2025-05-07T20:32:34.4671795Z             scale_ub_tensor = None
2025-05-07T20:32:34.4672055Z     
2025-05-07T20:32:34.4672286Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:34.4672608Z             op = silu_mul_quant
2025-05-07T20:32:34.4672862Z             if compiled:
2025-05-07T20:32:34.4673111Z                 op = torch.compile(op)
2025-05-07T20:32:34.4673506Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:34.4673785Z     
2025-05-07T20:32:34.4673978Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:34.4674151Z 
2025-05-07T20:32:34.4674253Z moe/activation_test.py:117: 
2025-05-07T20:32:34.4674552Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:34.4674888Z moe/activation_test.py:115: in fn
2025-05-07T20:32:34.4675170Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:34.4675865Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:34.4676563Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:34.4677099Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:34.4677789Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:34.4678457Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:34.4678997Z     kernel = self.compile(
2025-05-07T20:32:34.4679535Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:34.4680199Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:34.4680607Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:34.4680836Z 
2025-05-07T20:32:34.4681043Z self = <triton.compiler.compiler.ASTSource object at 0x7f779d9a36d0>
2025-05-07T20:32:34.4682169Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:34.4683554Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f7905667ce0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f78bf1496c0>}
2025-05-07T20:32:34.4684885Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:34.4685942Z context = <triton._C.libtriton.ir.context object at 0x7f78bf137130>
2025-05-07T20:32:34.4686227Z 
2025-05-07T20:32:34.4686396Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:34.4686922Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:34.4687395Z                            module_map=module_map)
2025-05-07T20:32:34.4687768Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:34.4688124Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:34.4688395Z E       ^
2025-05-07T20:32:34.4688906Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:34.4689352Z 
2025-05-07T20:32:34.4689763Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:34.4690284Z 
2025-05-07T20:32:34.4690392Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:34.4690808Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:34.4691219Z     T=4096,
2025-05-07T20:32:34.4691412Z     D=5120,
2025-05-07T20:32:34.4691617Z     scale_ub=1200.0,
2025-05-07T20:32:34.4691853Z     contiguous=False,
2025-05-07T20:32:34.4692088Z     compiled=True,
2025-05-07T20:32:34.4692301Z )
2025-05-07T20:32:34.4692628Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:34.4693117Z T = 4096, D = 5120, scale_ub = 1200.0, contiguous = False, compiled = True
2025-05-07T20:32:34.4693399Z 
2025-05-07T20:32:34.4693484Z     @given(
2025-05-07T20:32:34.4693921Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:34.4694239Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:34.4694544Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:34.4694879Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:34.4695205Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:34.4695487Z     )
2025-05-07T20:32:34.4695833Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:34.4696270Z     def test_silu_mul_quant(
2025-05-07T20:32:34.4696507Z         self,
2025-05-07T20:32:34.4696706Z         T: int,
2025-05-07T20:32:34.4696904Z         D: int,
2025-05-07T20:32:34.4697117Z         scale_ub: Optional[float],
2025-05-07T20:32:34.4697387Z         contiguous: bool,
2025-05-07T20:32:34.4697624Z         compiled: bool,
2025-05-07T20:32:34.4697845Z     ) -> None:
2025-05-07T20:32:34.4698066Z         torch.manual_seed(2025)
2025-05-07T20:32:34.4698594Z     
2025-05-07T20:32:34.4698864Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:34.4699210Z     
2025-05-07T20:32:34.4699405Z         x_sign = torch.sign(x)
2025-05-07T20:32:34.4699697Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:34.4700001Z         x = x_sign * x_clamp
2025-05-07T20:32:34.4700242Z         x0 = x[:, :D]
2025-05-07T20:32:34.4700460Z         x1 = x[:, D:]
2025-05-07T20:32:34.4700664Z     
2025-05-07T20:32:34.4700852Z         if contiguous:
2025-05-07T20:32:34.4701083Z             x0 = x0.contiguous()
2025-05-07T20:32:34.4701338Z             x1 = x1.contiguous()
2025-05-07T20:32:34.4701657Z     
2025-05-07T20:32:34.4701851Z         if scale_ub is not None:
2025-05-07T20:32:34.4702123Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:34.4702456Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:34.4702786Z             )
2025-05-07T20:32:34.4702982Z         else:
2025-05-07T20:32:34.4703201Z             scale_ub_tensor = None
2025-05-07T20:32:34.4703456Z     
2025-05-07T20:32:34.4703685Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:34.4704074Z             op = silu_mul_quant
2025-05-07T20:32:34.4704326Z             if compiled:
2025-05-07T20:32:34.4704576Z                 op = torch.compile(op)
2025-05-07T20:32:34.4704873Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:34.4705148Z     
2025-05-07T20:32:34.4705346Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:34.4705509Z 
2025-05-07T20:32:34.4705608Z moe/activation_test.py:117: 
2025-05-07T20:32:34.4705901Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:34.4706236Z moe/activation_test.py:115: in fn
2025-05-07T20:32:34.4706513Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:34.4707128Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/torch/_dynamo/eval_frame.py:678: in _fn
2025-05-07T20:32:34.4707690Z     return fn(*args, **kwargs)
2025-05-07T20:32:34.4708351Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:34.4709027Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:34.4709565Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:34.4710237Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:34.4710888Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:34.4711424Z     kernel = self.compile(
2025-05-07T20:32:34.4711961Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:34.4712610Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:34.4713004Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:34.4713327Z 
2025-05-07T20:32:34.4713535Z self = <triton.compiler.compiler.ASTSource object at 0x7f779d52d950>
2025-05-07T20:32:34.4714595Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:34.4715940Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f7905667ce0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f78bf14afc0>}
2025-05-07T20:32:34.4717253Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:34.4718265Z context = <triton._C.libtriton.ir.context object at 0x7f779d5c9cb0>
2025-05-07T20:32:34.4718553Z 
2025-05-07T20:32:34.4718719Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:34.4719235Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:34.4719696Z                            module_map=module_map)
2025-05-07T20:32:34.4720066Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:34.4720420Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:34.4720681Z E       ^
2025-05-07T20:32:34.4721130Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:34.4721625Z 
2025-05-07T20:32:34.4722031Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:34.4722536Z 
2025-05-07T20:32:34.4722648Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:34.4723055Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:34.4723459Z     T=2048,
2025-05-07T20:32:34.4723653Z     D=7168,
2025-05-07T20:32:34.4723848Z     scale_ub=1200.0,
2025-05-07T20:32:34.4724071Z     contiguous=False,
2025-05-07T20:32:34.4724346Z     compiled=False,
2025-05-07T20:32:34.6675698Z )
2025-05-07T20:32:34.6676244Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:34.6676973Z T = 2048, D = 7168, scale_ub = 1200.0, contiguous = False, compiled = False
2025-05-07T20:32:34.6677355Z 
2025-05-07T20:32:34.6677464Z     @given(
2025-05-07T20:32:34.6677786Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:34.6678128Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:34.6678464Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:34.6678801Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:34.6679132Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:34.6679428Z     )
2025-05-07T20:32:34.6680074Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:34.6680521Z     def test_silu_mul_quant(
2025-05-07T20:32:34.6680769Z         self,
2025-05-07T20:32:34.6680976Z         T: int,
2025-05-07T20:32:34.6681179Z         D: int,
2025-05-07T20:32:34.6681396Z         scale_ub: Optional[float],
2025-05-07T20:32:34.6681672Z         contiguous: bool,
2025-05-07T20:32:34.6681915Z         compiled: bool,
2025-05-07T20:32:34.6682141Z     ) -> None:
2025-05-07T20:32:34.6682361Z         torch.manual_seed(2025)
2025-05-07T20:32:34.6682606Z     
2025-05-07T20:32:34.6682876Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:34.6683224Z     
2025-05-07T20:32:34.6683421Z         x_sign = torch.sign(x)
2025-05-07T20:32:34.6683709Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:34.6684022Z         x = x_sign * x_clamp
2025-05-07T20:32:34.6684275Z         x0 = x[:, :D]
2025-05-07T20:32:34.6684494Z         x1 = x[:, D:]
2025-05-07T20:32:34.6684730Z     
2025-05-07T20:32:34.6685035Z         if contiguous:
2025-05-07T20:32:34.6685268Z             x0 = x0.contiguous()
2025-05-07T20:32:34.6685534Z             x1 = x1.contiguous()
2025-05-07T20:32:34.6685783Z     
2025-05-07T20:32:34.6685970Z         if scale_ub is not None:
2025-05-07T20:32:34.6686286Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:34.6686611Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:34.6686916Z             )
2025-05-07T20:32:34.6687111Z         else:
2025-05-07T20:32:34.6687315Z             scale_ub_tensor = None
2025-05-07T20:32:34.6687566Z     
2025-05-07T20:32:34.6687799Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:34.6688108Z             op = silu_mul_quant
2025-05-07T20:32:34.6688362Z             if compiled:
2025-05-07T20:32:34.6688611Z                 op = torch.compile(op)
2025-05-07T20:32:34.6688911Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:34.6689186Z     
2025-05-07T20:32:34.6689387Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:34.6689553Z 
2025-05-07T20:32:34.6689658Z moe/activation_test.py:117: 
2025-05-07T20:32:34.6689950Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:34.6690288Z moe/activation_test.py:115: in fn
2025-05-07T20:32:34.6690573Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:34.6691250Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:34.6691937Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:34.6692475Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:34.6693239Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:34.6694064Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:34.6694596Z     kernel = self.compile(
2025-05-07T20:32:34.6695140Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:34.6695878Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:34.6696272Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:34.6696504Z 
2025-05-07T20:32:34.6696709Z self = <triton.compiler.compiler.ASTSource object at 0x7f78bea87e50>
2025-05-07T20:32:34.6697766Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:34.6699485Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f7905667ce0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f78bf14bec0>}
2025-05-07T20:32:34.6700810Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:34.6701823Z context = <triton._C.libtriton.ir.context object at 0x7f779d8752b0>
2025-05-07T20:32:34.6702112Z 
2025-05-07T20:32:34.6702277Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:34.6702792Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:34.6703248Z                            module_map=module_map)
2025-05-07T20:32:34.6703619Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:34.6703972Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:34.6704225Z E       ^
2025-05-07T20:32:34.6704684Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:34.6705129Z 
2025-05-07T20:32:34.6705598Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:34.6706104Z 
2025-05-07T20:32:34.6706218Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:34.6706621Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:34.6707020Z     T=1,
2025-05-07T20:32:34.6707207Z     D=7168,
2025-05-07T20:32:34.6707399Z     scale_ub=None,
2025-05-07T20:32:34.6707616Z     contiguous=True,
2025-05-07T20:32:34.6707842Z     compiled=False,
2025-05-07T20:32:34.6708039Z )
2025-05-07T20:32:34.6708355Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:34.6708840Z T = 1, D = 7168, scale_ub = None, contiguous = True, compiled = False
2025-05-07T20:32:34.6709096Z 
2025-05-07T20:32:34.6709181Z     @given(
2025-05-07T20:32:34.6709406Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:34.6709724Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:34.6710034Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:34.6710358Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:34.6710693Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:34.6710975Z     )
2025-05-07T20:32:34.6711318Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:34.6711760Z     def test_silu_mul_quant(
2025-05-07T20:32:34.6712004Z         self,
2025-05-07T20:32:34.6712206Z         T: int,
2025-05-07T20:32:34.6712396Z         D: int,
2025-05-07T20:32:34.6712618Z         scale_ub: Optional[float],
2025-05-07T20:32:34.6712970Z         contiguous: bool,
2025-05-07T20:32:34.6713208Z         compiled: bool,
2025-05-07T20:32:34.6713434Z     ) -> None:
2025-05-07T20:32:34.6713653Z         torch.manual_seed(2025)
2025-05-07T20:32:34.6713893Z     
2025-05-07T20:32:34.6714170Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:34.6714516Z     
2025-05-07T20:32:34.6714716Z         x_sign = torch.sign(x)
2025-05-07T20:32:34.6715054Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:34.6715367Z         x = x_sign * x_clamp
2025-05-07T20:32:34.6715675Z         x0 = x[:, :D]
2025-05-07T20:32:34.6715895Z         x1 = x[:, D:]
2025-05-07T20:32:34.6716109Z     
2025-05-07T20:32:34.6716300Z         if contiguous:
2025-05-07T20:32:34.6716533Z             x0 = x0.contiguous()
2025-05-07T20:32:34.6716796Z             x1 = x1.contiguous()
2025-05-07T20:32:34.6717031Z     
2025-05-07T20:32:34.6717228Z         if scale_ub is not None:
2025-05-07T20:32:34.6717501Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:34.6717840Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:34.6718140Z             )
2025-05-07T20:32:34.6718335Z         else:
2025-05-07T20:32:34.6718545Z             scale_ub_tensor = None
2025-05-07T20:32:34.6718790Z     
2025-05-07T20:32:34.6719067Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:34.6719388Z             op = silu_mul_quant
2025-05-07T20:32:34.6719632Z             if compiled:
2025-05-07T20:32:34.6719886Z                 op = torch.compile(op)
2025-05-07T20:32:34.6720188Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:34.6720459Z     
2025-05-07T20:32:34.6720661Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:34.6720823Z 
2025-05-07T20:32:34.6720929Z moe/activation_test.py:117: 
2025-05-07T20:32:34.6721226Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:34.6721552Z moe/activation_test.py:115: in fn
2025-05-07T20:32:34.6721837Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:34.6722518Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:34.6723192Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:34.6723773Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:34.6724447Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:34.6725113Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:34.6725636Z     kernel = self.compile(
2025-05-07T20:32:34.6726174Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:34.6726821Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:34.6727211Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:34.6727445Z 
2025-05-07T20:32:34.6727663Z self = <triton.compiler.compiler.ASTSource object at 0x7f78be00f1d0>
2025-05-07T20:32:34.6728735Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:34.6730081Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f7905667ce0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f779d5bccc0>}
2025-05-07T20:32:34.6731402Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:34.6732401Z context = <triton._C.libtriton.ir.context object at 0x7f779d93cc30>
2025-05-07T20:32:34.6732762Z 
2025-05-07T20:32:34.6732927Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:34.6733444Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:34.6733999Z                            module_map=module_map)
2025-05-07T20:32:34.6734361Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:34.6743462Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:34.6743745Z E       ^
2025-05-07T20:32:34.6744219Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:34.6744759Z 
2025-05-07T20:32:34.6745178Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:34.6745696Z 
2025-05-07T20:32:34.6745801Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:34.6746210Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:34.6746617Z     T=16384,
2025-05-07T20:32:34.6746811Z     D=7168,
2025-05-07T20:32:34.6747005Z     scale_ub=1200.0,
2025-05-07T20:32:34.6747235Z     contiguous=False,
2025-05-07T20:32:34.6747453Z     compiled=True,
2025-05-07T20:32:34.6747660Z )
2025-05-07T20:32:34.6748033Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:34.6748530Z T = 16384, D = 7168, scale_ub = 1200.0, contiguous = False, compiled = True
2025-05-07T20:32:34.6748813Z 
2025-05-07T20:32:34.6748895Z     @given(
2025-05-07T20:32:34.6749137Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:34.6749448Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:34.6749756Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:34.6750089Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:34.6750419Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:34.6750702Z     )
2025-05-07T20:32:34.6751058Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:34.6751506Z     def test_silu_mul_quant(
2025-05-07T20:32:34.6751744Z         self,
2025-05-07T20:32:34.6751942Z         T: int,
2025-05-07T20:32:34.6752145Z         D: int,
2025-05-07T20:32:34.6752363Z         scale_ub: Optional[float],
2025-05-07T20:32:34.6752641Z         contiguous: bool,
2025-05-07T20:32:34.6752934Z         compiled: bool,
2025-05-07T20:32:34.6753157Z     ) -> None:
2025-05-07T20:32:34.6753376Z         torch.manual_seed(2025)
2025-05-07T20:32:34.6753629Z     
2025-05-07T20:32:34.6753896Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:34.6754242Z     
2025-05-07T20:32:34.6754441Z         x_sign = torch.sign(x)
2025-05-07T20:32:34.6754728Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:34.6755043Z         x = x_sign * x_clamp
2025-05-07T20:32:34.6755279Z         x0 = x[:, :D]
2025-05-07T20:32:34.6755499Z         x1 = x[:, D:]
2025-05-07T20:32:34.6755703Z     
2025-05-07T20:32:34.6755885Z         if contiguous:
2025-05-07T20:32:34.6756117Z             x0 = x0.contiguous()
2025-05-07T20:32:34.6756373Z             x1 = x1.contiguous()
2025-05-07T20:32:34.6756607Z     
2025-05-07T20:32:34.6756801Z         if scale_ub is not None:
2025-05-07T20:32:34.6757067Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:34.6757396Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:34.6757706Z             )
2025-05-07T20:32:34.6757900Z         else:
2025-05-07T20:32:34.6758106Z             scale_ub_tensor = None
2025-05-07T20:32:34.6758357Z     
2025-05-07T20:32:34.6758590Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:34.6758899Z             op = silu_mul_quant
2025-05-07T20:32:34.6759149Z             if compiled:
2025-05-07T20:32:34.6759397Z                 op = torch.compile(op)
2025-05-07T20:32:34.6759686Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:34.6760019Z     
2025-05-07T20:32:34.6760212Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:34.6760376Z 
2025-05-07T20:32:34.6760479Z moe/activation_test.py:117: 
2025-05-07T20:32:34.6760767Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:34.6761101Z moe/activation_test.py:115: in fn
2025-05-07T20:32:34.6761386Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:34.6761936Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/torch/_dynamo/eval_frame.py:678: in _fn
2025-05-07T20:32:34.6762542Z     return fn(*args, **kwargs)
2025-05-07T20:32:34.6763193Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:34.6763872Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:34.6764400Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:34.6765073Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:34.6765734Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:34.6766261Z     kernel = self.compile(
2025-05-07T20:32:34.6766838Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:34.6767491Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:34.6767889Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:34.6768116Z 
2025-05-07T20:32:34.6768325Z self = <triton.compiler.compiler.ASTSource object at 0x7f78be1e8950>
2025-05-07T20:32:34.6769389Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:34.6770743Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f7905667ce0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f779d5be0c0>}
2025-05-07T20:32:34.6772110Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:34.6773113Z context = <triton._C.libtriton.ir.context object at 0x7f779d9f17b0>
2025-05-07T20:32:34.6773410Z 
2025-05-07T20:32:34.6773574Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:34.6774195Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:34.6774658Z                            module_map=module_map)
2025-05-07T20:32:34.6775020Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:34.6775373Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:34.6775635Z E       ^
2025-05-07T20:32:34.6776088Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:34.6776536Z 
2025-05-07T20:32:34.6776948Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:34.8097211Z 
2025-05-07T20:32:34.8097686Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:34.8098367Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:34.8098846Z     T=1,
2025-05-07T20:32:34.8099043Z     D=7168,
2025-05-07T20:32:34.8099247Z     scale_ub=None,
2025-05-07T20:32:34.8099471Z     contiguous=False,
2025-05-07T20:32:34.8099703Z     compiled=False,
2025-05-07T20:32:34.8099913Z )
2025-05-07T20:32:34.8100233Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:34.8100731Z T = 1, D = 7168, scale_ub = None, contiguous = False, compiled = False
2025-05-07T20:32:34.8101199Z 
2025-05-07T20:32:34.8101286Z     @given(
2025-05-07T20:32:34.8101525Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:34.8101845Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:34.8102150Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:34.8102499Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:34.8102831Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:34.8103112Z     )
2025-05-07T20:32:34.8103460Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:34.8103987Z     def test_silu_mul_quant(
2025-05-07T20:32:34.8104233Z         self,
2025-05-07T20:32:34.8104424Z         T: int,
2025-05-07T20:32:34.8104623Z         D: int,
2025-05-07T20:32:34.8104844Z         scale_ub: Optional[float],
2025-05-07T20:32:34.8105111Z         contiguous: bool,
2025-05-07T20:32:34.8105352Z         compiled: bool,
2025-05-07T20:32:34.8105584Z     ) -> None:
2025-05-07T20:32:34.8105798Z         torch.manual_seed(2025)
2025-05-07T20:32:34.8106044Z     
2025-05-07T20:32:34.8106320Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:34.8106658Z     
2025-05-07T20:32:34.8106857Z         x_sign = torch.sign(x)
2025-05-07T20:32:34.8107224Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:34.8107535Z         x = x_sign * x_clamp
2025-05-07T20:32:34.8107780Z         x0 = x[:, :D]
2025-05-07T20:32:34.8108001Z         x1 = x[:, D:]
2025-05-07T20:32:34.8108209Z     
2025-05-07T20:32:34.8108400Z         if contiguous:
2025-05-07T20:32:34.8108634Z             x0 = x0.contiguous()
2025-05-07T20:32:34.8108896Z             x1 = x1.contiguous()
2025-05-07T20:32:34.8109133Z     
2025-05-07T20:32:34.8109328Z         if scale_ub is not None:
2025-05-07T20:32:34.8109607Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:34.8109944Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:34.8110263Z             )
2025-05-07T20:32:34.8110467Z         else:
2025-05-07T20:32:34.8110674Z             scale_ub_tensor = None
2025-05-07T20:32:34.8110935Z     
2025-05-07T20:32:34.8111177Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:34.8111493Z             op = silu_mul_quant
2025-05-07T20:32:34.8111750Z             if compiled:
2025-05-07T20:32:34.8112084Z                 op = torch.compile(op)
2025-05-07T20:32:34.8112378Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:34.8112664Z     
2025-05-07T20:32:34.8112871Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:34.8113035Z 
2025-05-07T20:32:34.8113141Z moe/activation_test.py:117: 
2025-05-07T20:32:34.8113432Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:34.8113766Z moe/activation_test.py:115: in fn
2025-05-07T20:32:34.8114049Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:34.8114731Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:34.8115425Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:34.8115963Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:34.8116633Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:34.8117291Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:34.8117822Z     kernel = self.compile(
2025-05-07T20:32:34.8118364Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:34.8119009Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:34.8119406Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:34.8119638Z 
2025-05-07T20:32:34.8119845Z self = <triton.compiler.compiler.ASTSource object at 0x7f78be00fcd0>
2025-05-07T20:32:34.8120957Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:34.8122312Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f7905667ce0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f779d5bec00>}
2025-05-07T20:32:34.8123676Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:34.8124682Z context = <triton._C.libtriton.ir.context object at 0x7f779d9ebeb0>
2025-05-07T20:32:34.8124962Z 
2025-05-07T20:32:34.8125134Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:34.8125647Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:34.8126108Z                            module_map=module_map)
2025-05-07T20:32:34.8126475Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:34.8126875Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:34.8127135Z E       ^
2025-05-07T20:32:34.8127599Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:34.8128040Z 
2025-05-07T20:32:34.8128456Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:34.8128959Z 
2025-05-07T20:32:34.8129069Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:34.8129473Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:34.8129875Z     T=2048,
2025-05-07T20:32:34.8130069Z     D=7168,
2025-05-07T20:32:34.8130265Z     scale_ub=None,
2025-05-07T20:32:34.8130486Z     contiguous=False,
2025-05-07T20:32:34.8130713Z     compiled=True,
2025-05-07T20:32:34.8130912Z )
2025-05-07T20:32:34.8131228Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:34.8131723Z T = 2048, D = 7168, scale_ub = None, contiguous = False, compiled = True
2025-05-07T20:32:34.8131990Z 
2025-05-07T20:32:34.8132130Z     @given(
2025-05-07T20:32:34.8132366Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:34.8132690Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:34.8133001Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:34.8133332Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:34.8133737Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:34.8134021Z     )
2025-05-07T20:32:34.8134369Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:34.8134812Z     def test_silu_mul_quant(
2025-05-07T20:32:34.8135101Z         self,
2025-05-07T20:32:34.8135308Z         T: int,
2025-05-07T20:32:34.8135505Z         D: int,
2025-05-07T20:32:34.8135725Z         scale_ub: Optional[float],
2025-05-07T20:32:34.8135998Z         contiguous: bool,
2025-05-07T20:32:34.8136246Z         compiled: bool,
2025-05-07T20:32:34.8136470Z     ) -> None:
2025-05-07T20:32:34.8136692Z         torch.manual_seed(2025)
2025-05-07T20:32:34.8136938Z     
2025-05-07T20:32:34.8137202Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:34.8137552Z     
2025-05-07T20:32:34.8137748Z         x_sign = torch.sign(x)
2025-05-07T20:32:34.8138029Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:34.8138338Z         x = x_sign * x_clamp
2025-05-07T20:32:34.8138577Z         x0 = x[:, :D]
2025-05-07T20:32:34.8138798Z         x1 = x[:, D:]
2025-05-07T20:32:34.8139002Z     
2025-05-07T20:32:34.8139192Z         if contiguous:
2025-05-07T20:32:34.8139423Z             x0 = x0.contiguous()
2025-05-07T20:32:34.8139729Z             x1 = x1.contiguous()
2025-05-07T20:32:34.8139971Z     
2025-05-07T20:32:34.8140165Z         if scale_ub is not None:
2025-05-07T20:32:34.8140434Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:34.8140769Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:34.8141080Z             )
2025-05-07T20:32:34.8141273Z         else:
2025-05-07T20:32:34.8141487Z             scale_ub_tensor = None
2025-05-07T20:32:34.8141743Z     
2025-05-07T20:32:34.8141970Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:34.8142335Z             op = silu_mul_quant
2025-05-07T20:32:34.8142584Z             if compiled:
2025-05-07T20:32:34.8142823Z                 op = torch.compile(op)
2025-05-07T20:32:34.8143118Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:34.8143393Z     
2025-05-07T20:32:34.8143582Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:34.8143755Z 
2025-05-07T20:32:34.8143853Z moe/activation_test.py:117: 
2025-05-07T20:32:34.8144151Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:34.8144480Z moe/activation_test.py:115: in fn
2025-05-07T20:32:34.8144782Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:34.8145402Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/torch/_dynamo/eval_frame.py:678: in _fn
2025-05-07T20:32:34.8145959Z     return fn(*args, **kwargs)
2025-05-07T20:32:34.8146605Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:34.8147293Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:34.8147834Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:34.8148509Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:34.8149156Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:34.8149695Z     kernel = self.compile(
2025-05-07T20:32:34.8150228Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:34.8150882Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:34.8151323Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:34.8151558Z 
2025-05-07T20:32:34.8151764Z self = <triton.compiler.compiler.ASTSource object at 0x7f779d52fbd0>
2025-05-07T20:32:34.8152831Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:34.8154174Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f7905667ce0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f78be12c2c0>}
2025-05-07T20:32:34.8155489Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:34.8156502Z context = <triton._C.libtriton.ir.context object at 0x7f78be1a9db0>
2025-05-07T20:32:34.8156792Z 
2025-05-07T20:32:34.8156958Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:34.8157480Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:34.8157939Z                            module_map=module_map)
2025-05-07T20:32:34.8158303Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:34.8158657Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:34.8158914Z E       ^
2025-05-07T20:32:34.8159370Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:34.8159863Z 
2025-05-07T20:32:34.8160272Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:34.8160775Z 
2025-05-07T20:32:34.8160888Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:34.8161292Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:34.8161689Z     T=4096,
2025-05-07T20:32:34.8161876Z     D=7168,
2025-05-07T20:32:34.8162065Z     scale_ub=None,
2025-05-07T20:32:34.8162325Z     contiguous=False,
2025-05-07T20:32:34.8162548Z     compiled=True,
2025-05-07T20:32:35.2282507Z )
2025-05-07T20:32:35.2282863Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:35.2283370Z T = 4096, D = 7168, scale_ub = None, contiguous = False, compiled = True
2025-05-07T20:32:35.2283675Z 
2025-05-07T20:32:35.2283771Z     @given(
2025-05-07T20:32:35.2284007Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:35.2284327Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:35.2284637Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:35.2284987Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:35.2285457Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:35.2285750Z     )
2025-05-07T20:32:35.2286095Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:35.2286528Z     def test_silu_mul_quant(
2025-05-07T20:32:35.2286773Z         self,
2025-05-07T20:32:35.2286966Z         T: int,
2025-05-07T20:32:35.2287322Z         D: int,
2025-05-07T20:32:35.2287544Z         scale_ub: Optional[float],
2025-05-07T20:32:35.2287817Z         contiguous: bool,
2025-05-07T20:32:35.2288048Z         compiled: bool,
2025-05-07T20:32:35.2288271Z     ) -> None:
2025-05-07T20:32:35.2288483Z         torch.manual_seed(2025)
2025-05-07T20:32:35.2288720Z     
2025-05-07T20:32:35.2288992Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:35.2289325Z     
2025-05-07T20:32:35.2289519Z         x_sign = torch.sign(x)
2025-05-07T20:32:35.2289800Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:35.2290105Z         x = x_sign * x_clamp
2025-05-07T20:32:35.2290344Z         x0 = x[:, :D]
2025-05-07T20:32:35.2290650Z         x1 = x[:, D:]
2025-05-07T20:32:35.2290858Z     
2025-05-07T20:32:35.2291041Z         if contiguous:
2025-05-07T20:32:35.2291263Z             x0 = x0.contiguous()
2025-05-07T20:32:35.2291520Z             x1 = x1.contiguous()
2025-05-07T20:32:35.2291759Z     
2025-05-07T20:32:35.2291943Z         if scale_ub is not None:
2025-05-07T20:32:35.2292216Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:35.2292549Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:35.2292856Z             )
2025-05-07T20:32:35.2293050Z         else:
2025-05-07T20:32:35.2293263Z             scale_ub_tensor = None
2025-05-07T20:32:35.2293509Z     
2025-05-07T20:32:35.2293818Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:35.2294131Z             op = silu_mul_quant
2025-05-07T20:32:35.2294377Z             if compiled:
2025-05-07T20:32:35.2294616Z                 op = torch.compile(op)
2025-05-07T20:32:35.2294914Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:35.2295194Z     
2025-05-07T20:32:35.2295380Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:35.2295545Z 
2025-05-07T20:32:35.2295644Z moe/activation_test.py:117: 
2025-05-07T20:32:35.2295938Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:35.2296263Z moe/activation_test.py:115: in fn
2025-05-07T20:32:35.2296551Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:35.2297108Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/torch/_dynamo/eval_frame.py:678: in _fn
2025-05-07T20:32:35.2297670Z     return fn(*args, **kwargs)
2025-05-07T20:32:35.2298672Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:35.2299359Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:35.2299896Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:35.2300563Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:35.2301221Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:35.2301834Z     kernel = self.compile(
2025-05-07T20:32:35.2302373Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:35.2303013Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:35.2303410Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:35.2303637Z 
2025-05-07T20:32:35.2303849Z self = <triton.compiler.compiler.ASTSource object at 0x7f779daad950>
2025-05-07T20:32:35.2304973Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:35.2306313Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f7905667ce0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f78be12cd60>}
2025-05-07T20:32:35.2307632Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:35.2308705Z context = <triton._C.libtriton.ir.context object at 0x7f78be189330>
2025-05-07T20:32:35.2309024Z 
2025-05-07T20:32:35.2309198Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:35.2309706Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:35.2310166Z                            module_map=module_map)
2025-05-07T20:32:35.2310529Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:35.2310950Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:35.2311204Z E       ^
2025-05-07T20:32:35.2311658Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:35.2312099Z 
2025-05-07T20:32:35.2312512Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:35.2313012Z 
2025-05-07T20:32:35.2313117Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:35.2313517Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:35.2313913Z     T=16384,
2025-05-07T20:32:35.2314108Z     D=5120,
2025-05-07T20:32:35.2314336Z     scale_ub=1200.0,
2025-05-07T20:32:35.2314550Z     contiguous=False,
2025-05-07T20:32:35.2314774Z     compiled=False,
2025-05-07T20:32:35.2314978Z )
2025-05-07T20:32:35.2315295Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:35.2315784Z T = 16384, D = 5120, scale_ub = 1200.0, contiguous = False, compiled = False
2025-05-07T20:32:35.2316064Z 
2025-05-07T20:32:35.2316141Z     @given(
2025-05-07T20:32:35.2316373Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:35.2316680Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:35.2316982Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:35.2317309Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:35.2317625Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:35.2317907Z     )
2025-05-07T20:32:35.2318258Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:35.2318763Z     def test_silu_mul_quant(
2025-05-07T20:32:35.2318992Z         self,
2025-05-07T20:32:35.2319184Z         T: int,
2025-05-07T20:32:35.2319379Z         D: int,
2025-05-07T20:32:35.2319589Z         scale_ub: Optional[float],
2025-05-07T20:32:35.2319865Z         contiguous: bool,
2025-05-07T20:32:35.2320106Z         compiled: bool,
2025-05-07T20:32:35.2320323Z     ) -> None:
2025-05-07T20:32:35.2320537Z         torch.manual_seed(2025)
2025-05-07T20:32:35.2320781Z     
2025-05-07T20:32:35.2321090Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:35.2321430Z     
2025-05-07T20:32:35.2321622Z         x_sign = torch.sign(x)
2025-05-07T20:32:35.2329399Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:35.2329754Z         x = x_sign * x_clamp
2025-05-07T20:32:35.2329999Z         x0 = x[:, :D]
2025-05-07T20:32:35.2330226Z         x1 = x[:, D:]
2025-05-07T20:32:35.2330429Z     
2025-05-07T20:32:35.2330623Z         if contiguous:
2025-05-07T20:32:35.2330857Z             x0 = x0.contiguous()
2025-05-07T20:32:35.2331111Z             x1 = x1.contiguous()
2025-05-07T20:32:35.2331353Z     
2025-05-07T20:32:35.2331551Z         if scale_ub is not None:
2025-05-07T20:32:35.2331820Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:35.2332235Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:35.2332548Z             )
2025-05-07T20:32:35.2332744Z         else:
2025-05-07T20:32:35.2332961Z             scale_ub_tensor = None
2025-05-07T20:32:35.2333218Z     
2025-05-07T20:32:35.2333458Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:35.2333845Z             op = silu_mul_quant
2025-05-07T20:32:35.2334100Z             if compiled:
2025-05-07T20:32:35.2334353Z                 op = torch.compile(op)
2025-05-07T20:32:35.2334645Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:35.2334943Z     
2025-05-07T20:32:35.2335174Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:35.2335345Z 
2025-05-07T20:32:35.2335445Z moe/activation_test.py:117: 
2025-05-07T20:32:35.2335746Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:35.2336086Z moe/activation_test.py:115: in fn
2025-05-07T20:32:35.2336365Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:35.2337108Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:35.2337801Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:35.2338336Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:35.2339003Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:35.2339664Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:35.2340205Z     kernel = self.compile(
2025-05-07T20:32:35.2340749Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:35.2341394Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:35.2341799Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:35.2342025Z 
2025-05-07T20:32:35.2342243Z self = <triton.compiler.compiler.ASTSource object at 0x7f78bf1ae750>
2025-05-07T20:32:35.2343305Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:35.2344661Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f7905667ce0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f78be12dc60>}
2025-05-07T20:32:35.2345987Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:35.2347044Z context = <triton._C.libtriton.ir.context object at 0x7f779d81e770>
2025-05-07T20:32:35.2347322Z 
2025-05-07T20:32:35.2347490Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:35.2348010Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:35.2348521Z                            module_map=module_map)
2025-05-07T20:32:35.2348887Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:35.2349237Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:35.2349502Z E       ^
2025-05-07T20:32:35.2349961Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:35.2350401Z 
2025-05-07T20:32:35.2350815Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:35.2351323Z 
2025-05-07T20:32:35.2351425Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:35.2351833Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:35.2352276Z     T=16384,
2025-05-07T20:32:35.2352468Z     D=5120,
2025-05-07T20:32:35.2352671Z     scale_ub=1200.0,
2025-05-07T20:32:35.2352896Z     contiguous=True,
2025-05-07T20:32:35.2353113Z     compiled=True,
2025-05-07T20:32:35.2353321Z )
2025-05-07T20:32:35.2353642Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:35.2354130Z T = 16384, D = 5120, scale_ub = 1200.0, contiguous = True, compiled = True
2025-05-07T20:32:35.2354411Z 
2025-05-07T20:32:35.2354488Z     @given(
2025-05-07T20:32:35.2354722Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:35.2355037Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:35.2355342Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:35.2355670Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:35.2355997Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:35.2356276Z     )
2025-05-07T20:32:35.2356635Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:35.2357124Z     def test_silu_mul_quant(
2025-05-07T20:32:35.2357361Z         self,
2025-05-07T20:32:35.2357560Z         T: int,
2025-05-07T20:32:35.2357761Z         D: int,
2025-05-07T20:32:35.2357977Z         scale_ub: Optional[float],
2025-05-07T20:32:35.2358249Z         contiguous: bool,
2025-05-07T20:32:35.2358491Z         compiled: bool,
2025-05-07T20:32:35.2358716Z     ) -> None:
2025-05-07T20:32:35.2358929Z         torch.manual_seed(2025)
2025-05-07T20:32:35.2359170Z     
2025-05-07T20:32:35.2359441Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:35.2359777Z     
2025-05-07T20:32:35.2359978Z         x_sign = torch.sign(x)
2025-05-07T20:32:35.2360271Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:35.2360573Z         x = x_sign * x_clamp
2025-05-07T20:32:35.2360813Z         x0 = x[:, :D]
2025-05-07T20:32:35.2361033Z         x1 = x[:, D:]
2025-05-07T20:32:35.2361240Z     
2025-05-07T20:32:35.2361432Z         if contiguous:
2025-05-07T20:32:35.2361670Z             x0 = x0.contiguous()
2025-05-07T20:32:35.2361924Z             x1 = x1.contiguous()
2025-05-07T20:32:35.2362169Z     
2025-05-07T20:32:35.2362367Z         if scale_ub is not None:
2025-05-07T20:32:35.2362638Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:35.2362975Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:35.2363284Z             )
2025-05-07T20:32:35.2363473Z         else:
2025-05-07T20:32:35.2363685Z             scale_ub_tensor = None
2025-05-07T20:32:35.2363939Z     
2025-05-07T20:32:35.2364165Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:35.2364522Z             op = silu_mul_quant
2025-05-07T20:32:35.2364769Z             if compiled:
2025-05-07T20:32:35.2365022Z                 op = torch.compile(op)
2025-05-07T20:32:35.2365310Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:35.2365590Z     
2025-05-07T20:32:35.2365787Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:35.2365949Z 
2025-05-07T20:32:35.2366051Z moe/activation_test.py:117: 
2025-05-07T20:32:35.2366344Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:35.2366722Z moe/activation_test.py:115: in fn
2025-05-07T20:32:35.2366995Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:35.2367547Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/torch/_dynamo/eval_frame.py:678: in _fn
2025-05-07T20:32:35.2368098Z     return fn(*args, **kwargs)
2025-05-07T20:32:35.2368751Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:35.2369425Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:35.2369956Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:35.2370701Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:35.2371361Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:35.2371882Z     kernel = self.compile(
2025-05-07T20:32:35.2372422Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:35.2373069Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:35.2373459Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:35.2373767Z 
2025-05-07T20:32:35.2373973Z self = <triton.compiler.compiler.ASTSource object at 0x7f779d52fb50>
2025-05-07T20:32:35.2375086Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:35.2376477Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f7905667ce0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f78be12f380>}
2025-05-07T20:32:35.2377794Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:35.2378796Z context = <triton._C.libtriton.ir.context object at 0x7f779d84ceb0>
2025-05-07T20:32:35.2379085Z 
2025-05-07T20:32:35.2379248Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:35.2379759Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:35.2380221Z                            module_map=module_map)
2025-05-07T20:32:35.2380577Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:35.2380929Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:35.2381192Z E       ^
2025-05-07T20:32:35.2381641Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:35.2382086Z 
2025-05-07T20:32:35.2382493Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:35.3925474Z 
2025-05-07T20:32:35.3925637Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:35.3926237Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:35.3926690Z     T=16384,
2025-05-07T20:32:35.3926900Z     D=5120,
2025-05-07T20:32:35.3927096Z     scale_ub=None,
2025-05-07T20:32:35.3927319Z     contiguous=False,
2025-05-07T20:32:35.3927681Z     compiled=True,
2025-05-07T20:32:35.3927885Z )
2025-05-07T20:32:35.3928213Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:35.3928710Z T = 16384, D = 5120, scale_ub = None, contiguous = False, compiled = True
2025-05-07T20:32:35.3928985Z 
2025-05-07T20:32:35.3929067Z     @given(
2025-05-07T20:32:35.3929308Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:35.3929626Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:35.3930002Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:35.3930325Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:35.3930656Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:35.3930951Z     )
2025-05-07T20:32:35.3931293Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:35.3931735Z     def test_silu_mul_quant(
2025-05-07T20:32:35.3931981Z         self,
2025-05-07T20:32:35.3932180Z         T: int,
2025-05-07T20:32:35.3932384Z         D: int,
2025-05-07T20:32:35.3932604Z         scale_ub: Optional[float],
2025-05-07T20:32:35.3932874Z         contiguous: bool,
2025-05-07T20:32:35.3933124Z         compiled: bool,
2025-05-07T20:32:35.3933355Z     ) -> None:
2025-05-07T20:32:35.3933637Z         torch.manual_seed(2025)
2025-05-07T20:32:35.3933960Z     
2025-05-07T20:32:35.3934236Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:35.3934584Z     
2025-05-07T20:32:35.3934779Z         x_sign = torch.sign(x)
2025-05-07T20:32:35.3935102Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:35.3935441Z         x = x_sign * x_clamp
2025-05-07T20:32:35.3935678Z         x0 = x[:, :D]
2025-05-07T20:32:35.3935901Z         x1 = x[:, D:]
2025-05-07T20:32:35.3936112Z     
2025-05-07T20:32:35.3936296Z         if contiguous:
2025-05-07T20:32:35.3936532Z             x0 = x0.contiguous()
2025-05-07T20:32:35.3936794Z             x1 = x1.contiguous()
2025-05-07T20:32:35.3937033Z     
2025-05-07T20:32:35.3937230Z         if scale_ub is not None:
2025-05-07T20:32:35.3937506Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:35.3937836Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:35.3938154Z             )
2025-05-07T20:32:35.3938352Z         else:
2025-05-07T20:32:35.3938633Z             scale_ub_tensor = None
2025-05-07T20:32:35.3938892Z     
2025-05-07T20:32:35.3939127Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:35.3939445Z             op = silu_mul_quant
2025-05-07T20:32:35.3939714Z             if compiled:
2025-05-07T20:32:35.3939961Z                 op = torch.compile(op)
2025-05-07T20:32:35.3940261Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:35.3940542Z     
2025-05-07T20:32:35.3940742Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:35.3940912Z 
2025-05-07T20:32:35.3941013Z moe/activation_test.py:117: 
2025-05-07T20:32:35.3941315Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:35.3941654Z moe/activation_test.py:115: in fn
2025-05-07T20:32:35.3941932Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:35.3942491Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/torch/_dynamo/eval_frame.py:678: in _fn
2025-05-07T20:32:35.3943055Z     return fn(*args, **kwargs)
2025-05-07T20:32:35.3943712Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:35.3944396Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:35.3944933Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:35.3945605Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:35.3946260Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:35.3946845Z     kernel = self.compile(
2025-05-07T20:32:35.3947381Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:35.3948036Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:35.3948436Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:35.3948668Z 
2025-05-07T20:32:35.3948875Z self = <triton.compiler.compiler.ASTSource object at 0x7f779d2015d0>
2025-05-07T20:32:35.3949982Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:35.3951333Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f7905667ce0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f779d8a85e0>}
2025-05-07T20:32:35.3952653Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:35.3953705Z context = <triton._C.libtriton.ir.context object at 0x7f779d832c30>
2025-05-07T20:32:35.3953996Z 
2025-05-07T20:32:35.3954164Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:35.3954684Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:35.3955151Z                            module_map=module_map)
2025-05-07T20:32:35.3955517Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:35.3955873Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:35.3956132Z E       ^
2025-05-07T20:32:35.3956591Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:35.3957040Z 
2025-05-07T20:32:35.3957451Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:35.3957954Z 
2025-05-07T20:32:35.3958063Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:35.3958474Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:35.3958917Z     T=2048,
2025-05-07T20:32:35.3959113Z     D=5120,
2025-05-07T20:32:35.3959305Z     scale_ub=None,
2025-05-07T20:32:35.3959521Z     contiguous=False,
2025-05-07T20:32:35.3959750Z     compiled=True,
2025-05-07T20:32:35.3959949Z )
2025-05-07T20:32:35.3960271Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:35.3960763Z T = 2048, D = 5120, scale_ub = None, contiguous = False, compiled = True
2025-05-07T20:32:35.3961033Z 
2025-05-07T20:32:35.3961116Z     @given(
2025-05-07T20:32:35.3961346Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:35.3961666Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:35.3961973Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:35.3962298Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:35.3962627Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:35.3962919Z     )
2025-05-07T20:32:35.3963263Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:35.3963708Z     def test_silu_mul_quant(
2025-05-07T20:32:35.3963950Z         self,
2025-05-07T20:32:35.3964148Z         T: int,
2025-05-07T20:32:35.3964341Z         D: int,
2025-05-07T20:32:35.3964561Z         scale_ub: Optional[float],
2025-05-07T20:32:35.3964837Z         contiguous: bool,
2025-05-07T20:32:35.3965102Z         compiled: bool,
2025-05-07T20:32:35.3965350Z     ) -> None:
2025-05-07T20:32:35.3965567Z         torch.manual_seed(2025)
2025-05-07T20:32:35.3965805Z     
2025-05-07T20:32:35.3966072Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:35.3966461Z     
2025-05-07T20:32:35.3966653Z         x_sign = torch.sign(x)
2025-05-07T20:32:35.3966943Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:35.3967252Z         x = x_sign * x_clamp
2025-05-07T20:32:35.3967484Z         x0 = x[:, :D]
2025-05-07T20:32:35.3967702Z         x1 = x[:, D:]
2025-05-07T20:32:35.3967908Z     
2025-05-07T20:32:35.3968094Z         if contiguous:
2025-05-07T20:32:35.3968327Z             x0 = x0.contiguous()
2025-05-07T20:32:35.3968583Z             x1 = x1.contiguous()
2025-05-07T20:32:35.3968869Z     
2025-05-07T20:32:35.3969062Z         if scale_ub is not None:
2025-05-07T20:32:35.3969336Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:35.3969670Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:35.3969974Z             )
2025-05-07T20:32:35.3970166Z         else:
2025-05-07T20:32:35.3970373Z             scale_ub_tensor = None
2025-05-07T20:32:35.3970621Z     
2025-05-07T20:32:35.3970855Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:35.3971174Z             op = silu_mul_quant
2025-05-07T20:32:35.3971419Z             if compiled:
2025-05-07T20:32:35.3971665Z                 op = torch.compile(op)
2025-05-07T20:32:35.3972006Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:35.3972280Z     
2025-05-07T20:32:35.3972476Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:35.3972638Z 
2025-05-07T20:32:35.3972739Z moe/activation_test.py:117: 
2025-05-07T20:32:35.3973029Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:35.3973414Z moe/activation_test.py:115: in fn
2025-05-07T20:32:35.3973867Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:35.3974430Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/torch/_dynamo/eval_frame.py:678: in _fn
2025-05-07T20:32:35.3974978Z     return fn(*args, **kwargs)
2025-05-07T20:32:35.3975621Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:35.3976304Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:35.3976830Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:35.3977566Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:35.3978224Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:35.3978753Z     kernel = self.compile(
2025-05-07T20:32:35.3979275Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:35.3979923Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:35.3980318Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:35.3980539Z 
2025-05-07T20:32:35.3980748Z self = <triton.compiler.compiler.ASTSource object at 0x7f779dd3e5d0>
2025-05-07T20:32:35.3981801Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:35.3983149Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f7905667ce0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f779d8a9440>}
2025-05-07T20:32:35.3984460Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:35.3985457Z context = <triton._C.libtriton.ir.context object at 0x7f779d3224f0>
2025-05-07T20:32:35.3985739Z 
2025-05-07T20:32:35.3985909Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:35.3986462Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:35.3986918Z                            module_map=module_map)
2025-05-07T20:32:35.3987279Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:35.3987622Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:35.3987876Z E       ^
2025-05-07T20:32:35.3988329Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:35.3988838Z 
2025-05-07T20:32:35.3989249Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:35.5584829Z 
2025-05-07T20:32:35.5585248Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:35.5585672Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:35.5586126Z     T=2048,
2025-05-07T20:32:35.5586318Z     D=5120,
2025-05-07T20:32:35.5586523Z     scale_ub=1200.0,
2025-05-07T20:32:35.5586748Z     contiguous=False,
2025-05-07T20:32:35.5586975Z     compiled=True,
2025-05-07T20:32:35.5587182Z )
2025-05-07T20:32:35.5587501Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:35.5588110Z T = 2048, D = 5120, scale_ub = 1200.0, contiguous = False, compiled = True
2025-05-07T20:32:35.5588390Z 
2025-05-07T20:32:35.5588471Z     @given(
2025-05-07T20:32:35.5588702Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:35.5589015Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:35.5589322Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:35.5589653Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:35.5589975Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:35.5590264Z     )
2025-05-07T20:32:35.5590611Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:35.5591051Z     def test_silu_mul_quant(
2025-05-07T20:32:35.5591296Z         self,
2025-05-07T20:32:35.5591492Z         T: int,
2025-05-07T20:32:35.5591685Z         D: int,
2025-05-07T20:32:35.5591901Z         scale_ub: Optional[float],
2025-05-07T20:32:35.5592170Z         contiguous: bool,
2025-05-07T20:32:35.5592414Z         compiled: bool,
2025-05-07T20:32:35.5592634Z     ) -> None:
2025-05-07T20:32:35.5592924Z         torch.manual_seed(2025)
2025-05-07T20:32:35.5593165Z     
2025-05-07T20:32:35.5593431Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:35.5593773Z     
2025-05-07T20:32:35.5593966Z         x_sign = torch.sign(x)
2025-05-07T20:32:35.5594254Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:35.5594569Z         x = x_sign * x_clamp
2025-05-07T20:32:35.5594809Z         x0 = x[:, :D]
2025-05-07T20:32:35.5595020Z         x1 = x[:, D:]
2025-05-07T20:32:35.5595229Z     
2025-05-07T20:32:35.5595417Z         if contiguous:
2025-05-07T20:32:35.5595643Z             x0 = x0.contiguous()
2025-05-07T20:32:35.5595906Z             x1 = x1.contiguous()
2025-05-07T20:32:35.5596148Z     
2025-05-07T20:32:35.5596337Z         if scale_ub is not None:
2025-05-07T20:32:35.5596609Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:35.5596945Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:35.5597258Z             )
2025-05-07T20:32:35.5597451Z         else:
2025-05-07T20:32:35.5597666Z             scale_ub_tensor = None
2025-05-07T20:32:35.5597920Z     
2025-05-07T20:32:35.5598147Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:35.5598621Z             op = silu_mul_quant
2025-05-07T20:32:35.5598874Z             if compiled:
2025-05-07T20:32:35.5599115Z                 op = torch.compile(op)
2025-05-07T20:32:35.5599411Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:35.5599682Z     
2025-05-07T20:32:35.5599871Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:35.5600040Z 
2025-05-07T20:32:35.5600213Z moe/activation_test.py:117: 
2025-05-07T20:32:35.5600508Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:35.5600832Z moe/activation_test.py:115: in fn
2025-05-07T20:32:35.5601111Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:35.5601674Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/torch/_dynamo/eval_frame.py:678: in _fn
2025-05-07T20:32:35.5602226Z     return fn(*args, **kwargs)
2025-05-07T20:32:35.5602871Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:35.5603619Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:35.5604151Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:35.5604814Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:35.5605470Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:35.5605996Z     kernel = self.compile(
2025-05-07T20:32:35.5606522Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:35.5607237Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:35.5613386Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:35.5613643Z 
2025-05-07T20:32:35.5613924Z self = <triton.compiler.compiler.ASTSource object at 0x7f78bf1ac450>
2025-05-07T20:32:35.5614993Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:35.5616341Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f7905667ce0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f779d8aa660>}
2025-05-07T20:32:35.5617668Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:35.5618764Z context = <triton._C.libtriton.ir.context object at 0x7f779d331870>
2025-05-07T20:32:35.5619059Z 
2025-05-07T20:32:35.5619226Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:35.5619743Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:35.5620211Z                            module_map=module_map)
2025-05-07T20:32:35.5620568Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:35.5620922Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:35.5621180Z E       ^
2025-05-07T20:32:35.5621630Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:35.5622084Z 
2025-05-07T20:32:35.5622491Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:35.5623002Z 
2025-05-07T20:32:35.5623110Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:35.5623514Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:35.5623905Z     T=4096,
2025-05-07T20:32:35.5624093Z     D=5120,
2025-05-07T20:32:35.5624284Z     scale_ub=1200.0,
2025-05-07T20:32:35.5624500Z     contiguous=True,
2025-05-07T20:32:35.5624718Z     compiled=True,
2025-05-07T20:32:35.5624920Z )
2025-05-07T20:32:35.5625226Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:35.5625712Z T = 4096, D = 5120, scale_ub = 1200.0, contiguous = True, compiled = True
2025-05-07T20:32:35.5625985Z 
2025-05-07T20:32:35.5626059Z     @given(
2025-05-07T20:32:35.5626285Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:35.5626635Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:35.5626933Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:35.5627259Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:35.5627576Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:35.5627858Z     )
2025-05-07T20:32:35.5628201Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:35.5628627Z     def test_silu_mul_quant(
2025-05-07T20:32:35.5628906Z         self,
2025-05-07T20:32:35.5629095Z         T: int,
2025-05-07T20:32:35.5629278Z         D: int,
2025-05-07T20:32:35.5629488Z         scale_ub: Optional[float],
2025-05-07T20:32:35.5629753Z         contiguous: bool,
2025-05-07T20:32:35.5629984Z         compiled: bool,
2025-05-07T20:32:35.5630195Z     ) -> None:
2025-05-07T20:32:35.5630399Z         torch.manual_seed(2025)
2025-05-07T20:32:35.5630630Z     
2025-05-07T20:32:35.5630894Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:35.5631222Z     
2025-05-07T20:32:35.5631404Z         x_sign = torch.sign(x)
2025-05-07T20:32:35.5631689Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:35.5632031Z         x = x_sign * x_clamp
2025-05-07T20:32:35.5632266Z         x0 = x[:, :D]
2025-05-07T20:32:35.5632476Z         x1 = x[:, D:]
2025-05-07T20:32:35.5632675Z     
2025-05-07T20:32:35.5632852Z         if contiguous:
2025-05-07T20:32:35.5633075Z             x0 = x0.contiguous()
2025-05-07T20:32:35.5633320Z             x1 = x1.contiguous()
2025-05-07T20:32:35.5633552Z     
2025-05-07T20:32:35.5633730Z         if scale_ub is not None:
2025-05-07T20:32:35.5633988Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:35.5634309Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:35.5634607Z             )
2025-05-07T20:32:35.5634790Z         else:
2025-05-07T20:32:35.5634987Z             scale_ub_tensor = None
2025-05-07T20:32:35.5635235Z     
2025-05-07T20:32:35.5635453Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:35.5635754Z             op = silu_mul_quant
2025-05-07T20:32:35.5635992Z             if compiled:
2025-05-07T20:32:35.5636230Z                 op = torch.compile(op)
2025-05-07T20:32:35.5636564Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:35.5636832Z     
2025-05-07T20:32:35.5637017Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:35.5637175Z 
2025-05-07T20:32:35.5637271Z moe/activation_test.py:117: 
2025-05-07T20:32:35.5637556Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:35.5637875Z moe/activation_test.py:115: in fn
2025-05-07T20:32:35.5638145Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:35.5638682Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/torch/_dynamo/eval_frame.py:678: in _fn
2025-05-07T20:32:35.5639227Z     return fn(*args, **kwargs)
2025-05-07T20:32:35.5639869Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:35.5640531Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:35.5641055Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:35.5641720Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:35.5642364Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:35.5642880Z     kernel = self.compile(
2025-05-07T20:32:35.5643406Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:35.5644045Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:35.5644430Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:35.5644705Z 
2025-05-07T20:32:35.5644905Z self = <triton.compiler.compiler.ASTSource object at 0x7f779d2029d0>
2025-05-07T20:32:35.5646015Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:35.5647349Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f7905667ce0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f779d8ab9c0>}
2025-05-07T20:32:35.5648702Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:35.5649688Z context = <triton._C.libtriton.ir.context object at 0x7f779d1b8fb0>
2025-05-07T20:32:35.5649970Z 
2025-05-07T20:32:35.5650130Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:35.5650637Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:35.5651091Z                            module_map=module_map)
2025-05-07T20:32:35.5651482Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:35.5651829Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:35.5652080Z E       ^
2025-05-07T20:32:35.5652523Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:35.5652964Z 
2025-05-07T20:32:35.5653367Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:35.7332140Z 
2025-05-07T20:32:35.7332925Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:35.7333568Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:35.7334284Z     T=128,
2025-05-07T20:32:35.7334505Z     D=5120,
2025-05-07T20:32:35.7334716Z     scale_ub=1200.0,
2025-05-07T20:32:35.7334939Z     contiguous=False,
2025-05-07T20:32:35.7335203Z     compiled=True,
2025-05-07T20:32:35.7335439Z )
2025-05-07T20:32:35.7335765Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:35.7336487Z T = 128, D = 5120, scale_ub = 1200.0, contiguous = False, compiled = True
2025-05-07T20:32:35.7336777Z 
2025-05-07T20:32:35.7336862Z     @given(
2025-05-07T20:32:35.7337100Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:35.7337416Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:35.7337726Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:35.7338055Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:35.7338376Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:35.7338671Z     )
2025-05-07T20:32:35.7339023Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:35.7339463Z     def test_silu_mul_quant(
2025-05-07T20:32:35.7339717Z         self,
2025-05-07T20:32:35.7339923Z         T: int,
2025-05-07T20:32:35.7340117Z         D: int,
2025-05-07T20:32:35.7340340Z         scale_ub: Optional[float],
2025-05-07T20:32:35.7340618Z         contiguous: bool,
2025-05-07T20:32:35.7340867Z         compiled: bool,
2025-05-07T20:32:35.7341095Z     ) -> None:
2025-05-07T20:32:35.7341323Z         torch.manual_seed(2025)
2025-05-07T20:32:35.7341573Z     
2025-05-07T20:32:35.7341843Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:35.7342194Z     
2025-05-07T20:32:35.7342397Z         x_sign = torch.sign(x)
2025-05-07T20:32:35.7342681Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:35.7342998Z         x = x_sign * x_clamp
2025-05-07T20:32:35.7343244Z         x0 = x[:, :D]
2025-05-07T20:32:35.7343462Z         x1 = x[:, D:]
2025-05-07T20:32:35.7343685Z     
2025-05-07T20:32:35.7343982Z         if contiguous:
2025-05-07T20:32:35.7344224Z             x0 = x0.contiguous()
2025-05-07T20:32:35.7344495Z             x1 = x1.contiguous()
2025-05-07T20:32:35.7344746Z     
2025-05-07T20:32:35.7344941Z         if scale_ub is not None:
2025-05-07T20:32:35.7345227Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:35.7345571Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:35.7345887Z             )
2025-05-07T20:32:35.7346085Z         else:
2025-05-07T20:32:35.7346308Z             scale_ub_tensor = None
2025-05-07T20:32:35.7346695Z     
2025-05-07T20:32:35.7346924Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:35.7347251Z             op = silu_mul_quant
2025-05-07T20:32:35.7347506Z             if compiled:
2025-05-07T20:32:35.7347791Z                 op = torch.compile(op)
2025-05-07T20:32:35.7348088Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:35.7348370Z     
2025-05-07T20:32:35.7348569Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:35.7348733Z 
2025-05-07T20:32:35.7348832Z moe/activation_test.py:117: 
2025-05-07T20:32:35.7349133Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:35.7349468Z moe/activation_test.py:115: in fn
2025-05-07T20:32:35.7349833Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:35.7350405Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/torch/_dynamo/eval_frame.py:678: in _fn
2025-05-07T20:32:35.7350966Z     return fn(*args, **kwargs)
2025-05-07T20:32:35.7351626Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:35.7352303Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:35.7352841Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:35.7353515Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:35.7354181Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:35.7354708Z     kernel = self.compile(
2025-05-07T20:32:35.7355297Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:35.7356004Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:35.7356401Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:35.7356639Z 
2025-05-07T20:32:35.7356846Z self = <triton.compiler.compiler.ASTSource object at 0x7f779d9a1dd0>
2025-05-07T20:32:35.7357906Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:35.7359278Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f7905667ce0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f779d108fe0>}
2025-05-07T20:32:35.7360608Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:35.7361610Z context = <triton._C.libtriton.ir.context object at 0x7f779d16bdf0>
2025-05-07T20:32:35.7361907Z 
2025-05-07T20:32:35.7362073Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:35.7362595Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:35.7363067Z                            module_map=module_map)
2025-05-07T20:32:35.7363429Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:35.7363785Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:35.7364053Z E       ^
2025-05-07T20:32:35.7364555Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:35.7365005Z 
2025-05-07T20:32:35.7365414Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:35.7365930Z 
2025-05-07T20:32:35.7366036Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:35.7366454Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:35.7366869Z     T=16384,
2025-05-07T20:32:35.7367111Z     D=7168,
2025-05-07T20:32:35.7367302Z     scale_ub=1200.0,
2025-05-07T20:32:35.7367531Z     contiguous=True,
2025-05-07T20:32:35.7367770Z     compiled=True,
2025-05-07T20:32:35.7367971Z )
2025-05-07T20:32:35.7368290Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:35.7368785Z T = 16384, D = 7168, scale_ub = 1200.0, contiguous = True, compiled = True
2025-05-07T20:32:35.7369057Z 
2025-05-07T20:32:35.7369137Z     @given(
2025-05-07T20:32:35.7369373Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:35.7369688Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:35.7369989Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:35.7370365Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:35.7370700Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:35.7370994Z     )
2025-05-07T20:32:35.7371334Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:35.7371786Z     def test_silu_mul_quant(
2025-05-07T20:32:35.7372029Z         self,
2025-05-07T20:32:35.7372219Z         T: int,
2025-05-07T20:32:35.7372417Z         D: int,
2025-05-07T20:32:35.7372636Z         scale_ub: Optional[float],
2025-05-07T20:32:35.7372901Z         contiguous: bool,
2025-05-07T20:32:35.7373140Z         compiled: bool,
2025-05-07T20:32:35.7373366Z     ) -> None:
2025-05-07T20:32:35.7373576Z         torch.manual_seed(2025)
2025-05-07T20:32:35.7373922Z     
2025-05-07T20:32:35.7374195Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:35.7374531Z     
2025-05-07T20:32:35.7374728Z         x_sign = torch.sign(x)
2025-05-07T20:32:35.7375026Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:35.7375332Z         x = x_sign * x_clamp
2025-05-07T20:32:35.7375622Z         x0 = x[:, :D]
2025-05-07T20:32:35.7375871Z         x1 = x[:, D:]
2025-05-07T20:32:35.7376106Z     
2025-05-07T20:32:35.7376288Z         if contiguous:
2025-05-07T20:32:35.7376525Z             x0 = x0.contiguous()
2025-05-07T20:32:35.7376791Z             x1 = x1.contiguous()
2025-05-07T20:32:35.7377029Z     
2025-05-07T20:32:35.7377226Z         if scale_ub is not None:
2025-05-07T20:32:35.7377506Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:35.7377835Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:35.7378150Z             )
2025-05-07T20:32:35.7378353Z         else:
2025-05-07T20:32:35.7378562Z             scale_ub_tensor = None
2025-05-07T20:32:35.7378819Z     
2025-05-07T20:32:35.7379057Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:35.7379367Z             op = silu_mul_quant
2025-05-07T20:32:35.7379629Z             if compiled:
2025-05-07T20:32:35.7379884Z                 op = torch.compile(op)
2025-05-07T20:32:35.7380179Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:35.7380458Z     
2025-05-07T20:32:35.7380655Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:35.7380823Z 
2025-05-07T20:32:35.7380930Z moe/activation_test.py:117: 
2025-05-07T20:32:35.7381221Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:35.7381552Z moe/activation_test.py:115: in fn
2025-05-07T20:32:35.7381838Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:35.7382388Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/torch/_dynamo/eval_frame.py:678: in _fn
2025-05-07T20:32:35.7382996Z     return fn(*args, **kwargs)
2025-05-07T20:32:35.7383649Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:35.7384328Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:35.7384860Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:35.7385538Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:35.7386248Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:35.7386771Z     kernel = self.compile(
2025-05-07T20:32:35.7387315Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:35.7387969Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:35.7388371Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:35.7388595Z 
2025-05-07T20:32:35.7388799Z self = <triton.compiler.compiler.ASTSource object at 0x7f78bf1af2d0>
2025-05-07T20:32:35.7389911Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:35.7391258Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f7905667ce0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f779d109e40>}
2025-05-07T20:32:35.7392579Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:35.7393576Z context = <triton._C.libtriton.ir.context object at 0x7f779d107bf0>
2025-05-07T20:32:35.7393869Z 
2025-05-07T20:32:35.7394034Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:35.7394548Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:35.7395016Z                            module_map=module_map)
2025-05-07T20:32:35.7395495Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:35.7395850Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:35.7396113Z E       ^
2025-05-07T20:32:35.7396570Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:35.7397019Z 
2025-05-07T20:32:35.7397428Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:35.8555231Z 
2025-05-07T20:32:35.8555646Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:35.8556733Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:35.8557863Z     T=16384,
2025-05-07T20:32:35.8558345Z     D=5120,
2025-05-07T20:32:35.8558811Z     scale_ub=1200.0,
2025-05-07T20:32:35.8559252Z     contiguous=True,
2025-05-07T20:32:35.8559673Z     compiled=False,
2025-05-07T20:32:35.8560076Z )
2025-05-07T20:32:35.8560717Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:35.8561691Z T = 16384, D = 5120, scale_ub = 1200.0, contiguous = True, compiled = False
2025-05-07T20:32:35.8562242Z 
2025-05-07T20:32:35.8562402Z     @given(
2025-05-07T20:32:35.8562858Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:35.8563465Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:35.8564073Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:35.8564720Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:35.8565365Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:35.8565691Z     )
2025-05-07T20:32:35.8566337Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:35.8566785Z     def test_silu_mul_quant(
2025-05-07T20:32:35.8567067Z         self,
2025-05-07T20:32:35.8567272Z         T: int,
2025-05-07T20:32:35.8567468Z         D: int,
2025-05-07T20:32:35.8567700Z         scale_ub: Optional[float],
2025-05-07T20:32:35.8567979Z         contiguous: bool,
2025-05-07T20:32:35.8568215Z         compiled: bool,
2025-05-07T20:32:35.8568449Z     ) -> None:
2025-05-07T20:32:35.8568669Z         torch.manual_seed(2025)
2025-05-07T20:32:35.8569000Z     
2025-05-07T20:32:35.8569274Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:35.8569619Z     
2025-05-07T20:32:35.8569811Z         x_sign = torch.sign(x)
2025-05-07T20:32:35.8570104Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:35.8570418Z         x = x_sign * x_clamp
2025-05-07T20:32:35.8570651Z         x0 = x[:, :D]
2025-05-07T20:32:35.8570883Z         x1 = x[:, D:]
2025-05-07T20:32:35.8571107Z     
2025-05-07T20:32:35.8571299Z         if contiguous:
2025-05-07T20:32:35.8571545Z             x0 = x0.contiguous()
2025-05-07T20:32:35.8571808Z             x1 = x1.contiguous()
2025-05-07T20:32:35.8572054Z     
2025-05-07T20:32:35.8572326Z         if scale_ub is not None:
2025-05-07T20:32:35.8572616Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:35.8572952Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:35.8573257Z             )
2025-05-07T20:32:35.8573461Z         else:
2025-05-07T20:32:35.8573821Z             scale_ub_tensor = None
2025-05-07T20:32:35.8574074Z     
2025-05-07T20:32:35.8574327Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:35.8574655Z             op = silu_mul_quant
2025-05-07T20:32:35.8574904Z             if compiled:
2025-05-07T20:32:35.8575163Z                 op = torch.compile(op)
2025-05-07T20:32:35.8575468Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:35.8575747Z     
2025-05-07T20:32:35.8575953Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:35.8576120Z 
2025-05-07T20:32:35.8576233Z moe/activation_test.py:117: 
2025-05-07T20:32:35.8576539Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:35.8576877Z moe/activation_test.py:115: in fn
2025-05-07T20:32:35.8577258Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:35.8577952Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:35.8578638Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:35.8579181Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:35.8579866Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:35.8580531Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:35.8581063Z     kernel = self.compile(
2025-05-07T20:32:35.8581606Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:35.8582262Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:35.8582659Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:35.8582895Z 
2025-05-07T20:32:35.8583102Z self = <triton.compiler.compiler.ASTSource object at 0x7f78be1e9750>
2025-05-07T20:32:35.8584178Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:35.8585596Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f7905667ce0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f779d10aca0>}
2025-05-07T20:32:35.8586999Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:35.8588001Z context = <triton._C.libtriton.ir.context object at 0x7f779d2ac230>
2025-05-07T20:32:35.8588299Z 
2025-05-07T20:32:35.8588469Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:35.8588993Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:35.8589504Z                            module_map=module_map)
2025-05-07T20:32:35.8589866Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:35.8590227Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:35.8590494Z E       ^
2025-05-07T20:32:35.8590948Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:35.8591400Z 
2025-05-07T20:32:35.8591807Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:35.8592321Z 
2025-05-07T20:32:35.8592426Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:35.8592879Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:35.8593280Z     T=1,
2025-05-07T20:32:35.8593469Z     D=7168,
2025-05-07T20:32:35.8593667Z     scale_ub=1200.0,
2025-05-07T20:32:35.8593895Z     contiguous=False,
2025-05-07T20:32:35.8594127Z     compiled=False,
2025-05-07T20:32:35.8594338Z )
2025-05-07T20:32:35.8594659Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:35.8595147Z T = 1, D = 7168, scale_ub = 1200.0, contiguous = False, compiled = False
2025-05-07T20:32:35.8595407Z 
2025-05-07T20:32:35.8595494Z     @given(
2025-05-07T20:32:35.8595723Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:35.8596044Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:35.8596356Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:35.8596692Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:35.8597021Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:35.8597316Z     )
2025-05-07T20:32:35.8597713Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:35.8598444Z     def test_silu_mul_quant(
2025-05-07T20:32:35.8598695Z         self,
2025-05-07T20:32:35.8598901Z         T: int,
2025-05-07T20:32:35.8599097Z         D: int,
2025-05-07T20:32:35.8599320Z         scale_ub: Optional[float],
2025-05-07T20:32:35.8599595Z         contiguous: bool,
2025-05-07T20:32:35.8599832Z         compiled: bool,
2025-05-07T20:32:35.8600061Z     ) -> None:
2025-05-07T20:32:35.8600283Z         torch.manual_seed(2025)
2025-05-07T20:32:35.8600536Z     
2025-05-07T20:32:35.8600815Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:35.8601167Z     
2025-05-07T20:32:35.8601370Z         x_sign = torch.sign(x)
2025-05-07T20:32:35.8601656Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:35.8601973Z         x = x_sign * x_clamp
2025-05-07T20:32:35.8610514Z         x0 = x[:, :D]
2025-05-07T20:32:35.8610778Z         x1 = x[:, D:]
2025-05-07T20:32:35.8611002Z     
2025-05-07T20:32:35.8611204Z         if contiguous:
2025-05-07T20:32:35.8611439Z             x0 = x0.contiguous()
2025-05-07T20:32:35.8611706Z             x1 = x1.contiguous()
2025-05-07T20:32:35.8611955Z     
2025-05-07T20:32:35.8612145Z         if scale_ub is not None:
2025-05-07T20:32:35.8612427Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:35.8612768Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:35.8613083Z             )
2025-05-07T20:32:35.8613273Z         else:
2025-05-07T20:32:35.8613488Z             scale_ub_tensor = None
2025-05-07T20:32:35.8613885Z     
2025-05-07T20:32:35.8614243Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:35.8614565Z             op = silu_mul_quant
2025-05-07T20:32:35.8614819Z             if compiled:
2025-05-07T20:32:35.8615065Z                 op = torch.compile(op)
2025-05-07T20:32:35.8615366Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:35.8615646Z     
2025-05-07T20:32:35.8615840Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:35.8616014Z 
2025-05-07T20:32:35.8616113Z moe/activation_test.py:117: 
2025-05-07T20:32:35.8616493Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:35.8616820Z moe/activation_test.py:115: in fn
2025-05-07T20:32:35.8617102Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:35.8617792Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:35.8618476Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:35.8619007Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:35.8619684Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:35.8620410Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:35.8620947Z     kernel = self.compile(
2025-05-07T20:32:35.8621479Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:35.8622133Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:35.8622536Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:35.8622762Z 
2025-05-07T20:32:35.8622966Z self = <triton.compiler.compiler.ASTSource object at 0x7f779cf5f1d0>
2025-05-07T20:32:35.8624031Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:35.8625381Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f7905667ce0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f779d20c0e0>}
2025-05-07T20:32:35.8626766Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:35.8627775Z context = <triton._C.libtriton.ir.context object at 0x7f779d27bb70>
2025-05-07T20:32:35.8628060Z 
2025-05-07T20:32:35.8628223Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:35.8628737Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:35.8629201Z                            module_map=module_map)
2025-05-07T20:32:35.8629558Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:35.8629912Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:35.8630174Z E       ^
2025-05-07T20:32:35.8630633Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:35.8631068Z 
2025-05-07T20:32:35.8631479Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:35.8631993Z 
2025-05-07T20:32:35.8632097Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:35.8632506Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:35.8632903Z     T=4096,
2025-05-07T20:32:35.8633085Z     D=7168,
2025-05-07T20:32:35.8633282Z     scale_ub=1200.0,
2025-05-07T20:32:35.8633507Z     contiguous=False,
2025-05-07T20:32:35.8633719Z     compiled=True,
2025-05-07T20:32:36.0233104Z )
2025-05-07T20:32:36.0233997Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:36.0234513Z T = 4096, D = 7168, scale_ub = 1200.0, contiguous = False, compiled = True
2025-05-07T20:32:36.0234800Z 
2025-05-07T20:32:36.0234883Z     @given(
2025-05-07T20:32:36.0235133Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:36.0235480Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:36.0235824Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:36.0236163Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:36.0236611Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:36.0236904Z     )
2025-05-07T20:32:36.0237262Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:36.0237711Z     def test_silu_mul_quant(
2025-05-07T20:32:36.0237953Z         self,
2025-05-07T20:32:36.0238161Z         T: int,
2025-05-07T20:32:36.0238372Z         D: int,
2025-05-07T20:32:36.0238597Z         scale_ub: Optional[float],
2025-05-07T20:32:36.0238881Z         contiguous: bool,
2025-05-07T20:32:36.0239133Z         compiled: bool,
2025-05-07T20:32:36.0239363Z     ) -> None:
2025-05-07T20:32:36.0239588Z         torch.manual_seed(2025)
2025-05-07T20:32:36.0239839Z     
2025-05-07T20:32:36.0240210Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:36.0240565Z     
2025-05-07T20:32:36.0240771Z         x_sign = torch.sign(x)
2025-05-07T20:32:36.0241063Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:36.0241384Z         x = x_sign * x_clamp
2025-05-07T20:32:36.0241634Z         x0 = x[:, :D]
2025-05-07T20:32:36.0241871Z         x1 = x[:, D:]
2025-05-07T20:32:36.0242082Z     
2025-05-07T20:32:36.0242286Z         if contiguous:
2025-05-07T20:32:36.0242525Z             x0 = x0.contiguous()
2025-05-07T20:32:36.0242778Z             x1 = x1.contiguous()
2025-05-07T20:32:36.0243024Z     
2025-05-07T20:32:36.0243217Z         if scale_ub is not None:
2025-05-07T20:32:36.0243487Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:36.0243820Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:36.0244133Z             )
2025-05-07T20:32:36.0244322Z         else:
2025-05-07T20:32:36.0244539Z             scale_ub_tensor = None
2025-05-07T20:32:36.0244799Z     
2025-05-07T20:32:36.0245111Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:36.0245435Z             op = silu_mul_quant
2025-05-07T20:32:36.0245686Z             if compiled:
2025-05-07T20:32:36.0245931Z                 op = torch.compile(op)
2025-05-07T20:32:36.0246224Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:36.0246504Z     
2025-05-07T20:32:36.0246693Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:36.0246862Z 
2025-05-07T20:32:36.0246961Z moe/activation_test.py:117: 
2025-05-07T20:32:36.0247257Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:36.0247588Z moe/activation_test.py:115: in fn
2025-05-07T20:32:36.0247871Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:36.0248429Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/torch/_dynamo/eval_frame.py:678: in _fn
2025-05-07T20:32:36.0248986Z     return fn(*args, **kwargs)
2025-05-07T20:32:36.0249639Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:36.0250319Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:36.0250856Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:36.0251527Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:36.0252188Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:36.0252720Z     kernel = self.compile(
2025-05-07T20:32:36.0253260Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:36.0254062Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:36.0254460Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:36.0254696Z 
2025-05-07T20:32:36.0254909Z self = <triton.compiler.compiler.ASTSource object at 0x7f779d9a0650>
2025-05-07T20:32:36.0256022Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:36.0257439Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f7905667ce0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f779d20d300>}
2025-05-07T20:32:36.0258757Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:36.0259770Z context = <triton._C.libtriton.ir.context object at 0x7f779d4d63b0>
2025-05-07T20:32:36.0260059Z 
2025-05-07T20:32:36.0260268Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:36.0260791Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:36.0261252Z                            module_map=module_map)
2025-05-07T20:32:36.0261626Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:36.0261981Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:36.0262242Z E       ^
2025-05-07T20:32:36.0262706Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:36.0263154Z 
2025-05-07T20:32:36.0263564Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:36.0264070Z 
2025-05-07T20:32:36.0264183Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:36.0264588Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:36.0264996Z     T=128,
2025-05-07T20:32:36.0265192Z     D=7168,
2025-05-07T20:32:36.0265393Z     scale_ub=1200.0,
2025-05-07T20:32:36.0265686Z     contiguous=False,
2025-05-07T20:32:36.0265915Z     compiled=True,
2025-05-07T20:32:36.0266115Z )
2025-05-07T20:32:36.0266436Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:36.0266923Z T = 128, D = 7168, scale_ub = 1200.0, contiguous = False, compiled = True
2025-05-07T20:32:36.0267190Z 
2025-05-07T20:32:36.0267273Z     @given(
2025-05-07T20:32:36.0267501Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:36.0267816Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:36.0268122Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:36.0268450Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:36.0268779Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:36.0269068Z     )
2025-05-07T20:32:36.0269411Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:36.0269854Z     def test_silu_mul_quant(
2025-05-07T20:32:36.0270100Z         self,
2025-05-07T20:32:36.0270297Z         T: int,
2025-05-07T20:32:36.0270490Z         D: int,
2025-05-07T20:32:36.0270712Z         scale_ub: Optional[float],
2025-05-07T20:32:36.0270989Z         contiguous: bool,
2025-05-07T20:32:36.0271221Z         compiled: bool,
2025-05-07T20:32:36.0271443Z     ) -> None:
2025-05-07T20:32:36.0271657Z         torch.manual_seed(2025)
2025-05-07T20:32:36.0271895Z     
2025-05-07T20:32:36.0272167Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:36.0272512Z     
2025-05-07T20:32:36.0272703Z         x_sign = torch.sign(x)
2025-05-07T20:32:36.0273042Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:36.0273350Z         x = x_sign * x_clamp
2025-05-07T20:32:36.0273585Z         x0 = x[:, :D]
2025-05-07T20:32:36.0273804Z         x1 = x[:, D:]
2025-05-07T20:32:36.0274020Z     
2025-05-07T20:32:36.0274206Z         if contiguous:
2025-05-07T20:32:36.0274443Z             x0 = x0.contiguous()
2025-05-07T20:32:36.0274721Z             x1 = x1.contiguous()
2025-05-07T20:32:36.0274965Z     
2025-05-07T20:32:36.0275153Z         if scale_ub is not None:
2025-05-07T20:32:36.0275475Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:36.0275814Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:36.0276118Z             )
2025-05-07T20:32:36.0276316Z         else:
2025-05-07T20:32:36.0276529Z             scale_ub_tensor = None
2025-05-07T20:32:36.0276779Z     
2025-05-07T20:32:36.0277011Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:36.0278657Z             op = silu_mul_quant
2025-05-07T20:32:36.0278903Z             if compiled:
2025-05-07T20:32:36.0279155Z                 op = torch.compile(op)
2025-05-07T20:32:36.0279454Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:36.0279724Z     
2025-05-07T20:32:36.0279925Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:36.0280140Z 
2025-05-07T20:32:36.0280239Z moe/activation_test.py:117: 
2025-05-07T20:32:36.0280540Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:36.0280868Z moe/activation_test.py:115: in fn
2025-05-07T20:32:36.0281151Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:36.0281703Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/torch/_dynamo/eval_frame.py:678: in _fn
2025-05-07T20:32:36.0282246Z     return fn(*args, **kwargs)
2025-05-07T20:32:36.0282897Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:36.0283576Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:36.0284108Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:36.0284773Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:36.0285478Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:36.0286011Z     kernel = self.compile(
2025-05-07T20:32:36.0286540Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:36.0287196Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:36.0287599Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:36.0287826Z 
2025-05-07T20:32:36.0288036Z self = <triton.compiler.compiler.ASTSource object at 0x7f779cffd8d0>
2025-05-07T20:32:36.0289089Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:36.0290443Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f7905667ce0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f779d20e160>}
2025-05-07T20:32:36.0291764Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:36.0292766Z context = <triton._C.libtriton.ir.context object at 0x7f779d4b5930>
2025-05-07T20:32:36.0293050Z 
2025-05-07T20:32:36.0293220Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:36.0293837Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:36.0294349Z                            module_map=module_map)
2025-05-07T20:32:36.0294712Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:36.0295059Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:36.0295321Z E       ^
2025-05-07T20:32:36.0295787Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:36.0296222Z 
2025-05-07T20:32:36.0296634Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:36.0297176Z 
2025-05-07T20:32:36.0297280Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:36.0297690Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:36.0298087Z     T=2048,
2025-05-07T20:32:36.0298643Z     D=7168,
2025-05-07T20:32:36.0298845Z     scale_ub=None,
2025-05-07T20:32:36.0299063Z     contiguous=True,
2025-05-07T20:32:36.0299280Z     compiled=True,
2025-05-07T20:32:36.1556419Z )
2025-05-07T20:32:36.1556891Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:36.1557583Z T = 2048, D = 7168, scale_ub = None, contiguous = True, compiled = True
2025-05-07T20:32:36.1557944Z 
2025-05-07T20:32:36.1558358Z     @given(
2025-05-07T20:32:36.1558608Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:36.1558926Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:36.1559230Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:36.1559567Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:36.1559895Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:36.1560191Z     )
2025-05-07T20:32:36.1560538Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:36.1560983Z     def test_silu_mul_quant(
2025-05-07T20:32:36.1561225Z         self,
2025-05-07T20:32:36.1561421Z         T: int,
2025-05-07T20:32:36.1561629Z         D: int,
2025-05-07T20:32:36.1561847Z         scale_ub: Optional[float],
2025-05-07T20:32:36.1562114Z         contiguous: bool,
2025-05-07T20:32:36.1562356Z         compiled: bool,
2025-05-07T20:32:36.1562584Z     ) -> None:
2025-05-07T20:32:36.1562795Z         torch.manual_seed(2025)
2025-05-07T20:32:36.1563044Z     
2025-05-07T20:32:36.1563403Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:36.1563742Z     
2025-05-07T20:32:36.1563940Z         x_sign = torch.sign(x)
2025-05-07T20:32:36.1564234Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:36.1564551Z         x = x_sign * x_clamp
2025-05-07T20:32:36.1564788Z         x0 = x[:, :D]
2025-05-07T20:32:36.1565009Z         x1 = x[:, D:]
2025-05-07T20:32:36.1565233Z     
2025-05-07T20:32:36.1565426Z         if contiguous:
2025-05-07T20:32:36.1565694Z             x0 = x0.contiguous()
2025-05-07T20:32:36.1565953Z             x1 = x1.contiguous()
2025-05-07T20:32:36.1566197Z     
2025-05-07T20:32:36.1566388Z         if scale_ub is not None:
2025-05-07T20:32:36.1566664Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:36.1567000Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:36.1567311Z             )
2025-05-07T20:32:36.1567504Z         else:
2025-05-07T20:32:36.1567721Z             scale_ub_tensor = None
2025-05-07T20:32:36.1567978Z     
2025-05-07T20:32:36.1568206Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:36.1568523Z             op = silu_mul_quant
2025-05-07T20:32:36.1568782Z             if compiled:
2025-05-07T20:32:36.1569027Z                 op = torch.compile(op)
2025-05-07T20:32:36.1569330Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:36.1569608Z     
2025-05-07T20:32:36.1569800Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:36.1569969Z 
2025-05-07T20:32:36.1570074Z moe/activation_test.py:117: 
2025-05-07T20:32:36.1570374Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:36.1570802Z moe/activation_test.py:115: in fn
2025-05-07T20:32:36.1571083Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:36.1571639Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/torch/_dynamo/eval_frame.py:678: in _fn
2025-05-07T20:32:36.1572202Z     return fn(*args, **kwargs)
2025-05-07T20:32:36.1572856Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:36.1573612Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:36.1574299Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:36.1574972Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:36.1575625Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:36.1576161Z     kernel = self.compile(
2025-05-07T20:32:36.1576704Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:36.1577350Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:36.1577799Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:36.1578036Z 
2025-05-07T20:32:36.1578247Z self = <triton.compiler.compiler.ASTSource object at 0x7f78be1e8d50>
2025-05-07T20:32:36.1579313Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:36.1580677Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f7905667ce0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f779d20f420>}
2025-05-07T20:32:36.1582004Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:36.1583016Z context = <triton._C.libtriton.ir.context object at 0x7f779d7711f0>
2025-05-07T20:32:36.1583304Z 
2025-05-07T20:32:36.1583519Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:36.1584047Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:36.1584514Z                            module_map=module_map)
2025-05-07T20:32:36.1584880Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:36.1585239Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:36.1585498Z E       ^
2025-05-07T20:32:36.1585958Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:36.1586400Z 
2025-05-07T20:32:36.1586818Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:36.1587326Z 
2025-05-07T20:32:36.1587441Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:36.1587851Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:36.1588256Z     T=16384,
2025-05-07T20:32:36.1588456Z     D=5120,
2025-05-07T20:32:36.1588648Z     scale_ub=None,
2025-05-07T20:32:36.1588866Z     contiguous=False,
2025-05-07T20:32:36.1589098Z     compiled=False,
2025-05-07T20:32:36.1589307Z )
2025-05-07T20:32:36.1589628Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:36.1590127Z T = 16384, D = 5120, scale_ub = None, contiguous = False, compiled = False
2025-05-07T20:32:36.1590401Z 
2025-05-07T20:32:36.1590491Z     @given(
2025-05-07T20:32:36.1590719Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:36.1591035Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:36.1591391Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:36.1591719Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:36.1592051Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:36.1592342Z     )
2025-05-07T20:32:36.1592689Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:36.1593139Z     def test_silu_mul_quant(
2025-05-07T20:32:36.1593384Z         self,
2025-05-07T20:32:36.1593578Z         T: int,
2025-05-07T20:32:36.1593828Z         D: int,
2025-05-07T20:32:36.1594054Z         scale_ub: Optional[float],
2025-05-07T20:32:36.1594328Z         contiguous: bool,
2025-05-07T20:32:36.1594567Z         compiled: bool,
2025-05-07T20:32:36.1594794Z     ) -> None:
2025-05-07T20:32:36.1595012Z         torch.manual_seed(2025)
2025-05-07T20:32:36.1595278Z     
2025-05-07T20:32:36.1595574Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:36.1595914Z     
2025-05-07T20:32:36.1596103Z         x_sign = torch.sign(x)
2025-05-07T20:32:36.1596393Z >       x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:36.1598902Z E       torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 320.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 140.44 MiB is free. Including non-PyTorch memory, this process has 21.92 GiB memory in use. Of the allocated memory 21.60 GiB is allocated by PyTorch, and 45.02 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
2025-05-07T20:32:36.1600737Z 
2025-05-07T20:32:36.1600866Z moe/activation_test.py:95: OutOfMemoryError
2025-05-07T20:32:36.1601078Z 
2025-05-07T20:32:36.1601188Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:36.1601593Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:36.1602000Z     T=4096,
2025-05-07T20:32:36.1602191Z     D=7168,
2025-05-07T20:32:36.1602383Z     scale_ub=1200.0,
2025-05-07T20:32:36.1602606Z     contiguous=True,
2025-05-07T20:32:36.1602831Z     compiled=True,
2025-05-07T20:32:36.1603030Z )
2025-05-07T20:32:36.1603415Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:36.1603909Z T = 4096, D = 7168, scale_ub = 1200.0, contiguous = True, compiled = True
2025-05-07T20:32:36.1604176Z 
2025-05-07T20:32:36.1604256Z     @given(
2025-05-07T20:32:36.1604491Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:36.1604805Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:36.1605124Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:36.1605488Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:36.1605817Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:36.1606106Z     )
2025-05-07T20:32:36.1606455Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:36.1606898Z     def test_silu_mul_quant(
2025-05-07T20:32:36.1607144Z         self,
2025-05-07T20:32:36.1607337Z         T: int,
2025-05-07T20:32:36.1607541Z         D: int,
2025-05-07T20:32:36.1607767Z         scale_ub: Optional[float],
2025-05-07T20:32:36.1608039Z         contiguous: bool,
2025-05-07T20:32:36.1608286Z         compiled: bool,
2025-05-07T20:32:36.1608515Z     ) -> None:
2025-05-07T20:32:36.1608734Z         torch.manual_seed(2025)
2025-05-07T20:32:36.1608991Z     
2025-05-07T20:32:36.1609270Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:36.1609608Z     
2025-05-07T20:32:36.1609814Z         x_sign = torch.sign(x)
2025-05-07T20:32:36.1610111Z >       x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:36.1612080Z E       torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 112.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 28.44 MiB is free. Including non-PyTorch memory, this process has 22.03 GiB memory in use. Of the allocated memory 21.61 GiB is allocated by PyTorch, and 141.02 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
2025-05-07T20:32:36.1614052Z 
2025-05-07T20:32:36.1614184Z moe/activation_test.py:95: OutOfMemoryError
2025-05-07T20:32:36.1614459Z 
2025-05-07T20:32:36.1614568Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:36.1614986Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:36.1623459Z     T=16384,
2025-05-07T20:32:36.1623679Z     D=7168,
2025-05-07T20:32:36.1623881Z     scale_ub=None,
2025-05-07T20:32:36.1624098Z     contiguous=False,
2025-05-07T20:32:36.1624334Z     compiled=False,
2025-05-07T20:32:36.1624548Z )
2025-05-07T20:32:36.1624863Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:36.1625363Z T = 16384, D = 7168, scale_ub = None, contiguous = False, compiled = False
2025-05-07T20:32:36.1625639Z 
2025-05-07T20:32:36.1625724Z     @given(
2025-05-07T20:32:36.1626050Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:36.1626374Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:36.1626686Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:36.1627022Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:36.1627355Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:36.1627647Z     )
2025-05-07T20:32:36.1628002Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:36.1628445Z     def test_silu_mul_quant(
2025-05-07T20:32:36.1628698Z         self,
2025-05-07T20:32:36.1628906Z         T: int,
2025-05-07T20:32:36.1629104Z         D: int,
2025-05-07T20:32:36.1629333Z         scale_ub: Optional[float],
2025-05-07T20:32:36.1629614Z         contiguous: bool,
2025-05-07T20:32:36.1629863Z         compiled: bool,
2025-05-07T20:32:36.1630096Z     ) -> None:
2025-05-07T20:32:36.1630320Z         torch.manual_seed(2025)
2025-05-07T20:32:36.1630563Z     
2025-05-07T20:32:36.1630894Z >       x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:36.1632924Z E       torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 448.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 140.44 MiB is free. Including non-PyTorch memory, this process has 21.92 GiB memory in use. Of the allocated memory 21.50 GiB is allocated by PyTorch, and 141.02 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
2025-05-07T20:32:36.1634765Z 
2025-05-07T20:32:36.1634886Z moe/activation_test.py:92: OutOfMemoryError
2025-05-07T20:32:36.2857183Z 
2025-05-07T20:32:36.2857767Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:36.2858390Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:36.2858975Z     T=2048,
2025-05-07T20:32:36.2859295Z     D=7168,
2025-05-07T20:32:36.2859562Z     scale_ub=1200.0,
2025-05-07T20:32:36.2859799Z     contiguous=True,
2025-05-07T20:32:36.2860032Z     compiled=True,
2025-05-07T20:32:36.2860254Z )
2025-05-07T20:32:36.2860569Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:36.2861068Z T = 2048, D = 7168, scale_ub = 1200.0, contiguous = True, compiled = True
2025-05-07T20:32:36.2861343Z 
2025-05-07T20:32:36.2861429Z     @given(
2025-05-07T20:32:36.2861668Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:36.2861983Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:36.2862566Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:36.2862902Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:36.2863232Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:36.2863529Z     )
2025-05-07T20:32:36.2863885Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:36.2864326Z     def test_silu_mul_quant(
2025-05-07T20:32:36.2864577Z         self,
2025-05-07T20:32:36.2864781Z         T: int,
2025-05-07T20:32:36.2864987Z         D: int,
2025-05-07T20:32:36.2865300Z         scale_ub: Optional[float],
2025-05-07T20:32:36.2865579Z         contiguous: bool,
2025-05-07T20:32:36.2865823Z         compiled: bool,
2025-05-07T20:32:36.2866047Z     ) -> None:
2025-05-07T20:32:36.2866267Z         torch.manual_seed(2025)
2025-05-07T20:32:36.2866513Z     
2025-05-07T20:32:36.2866783Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:36.2867126Z     
2025-05-07T20:32:36.2867332Z         x_sign = torch.sign(x)
2025-05-07T20:32:36.2867622Z >       x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:36.2869672Z E       torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 56.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 28.44 MiB is free. Including non-PyTorch memory, this process has 22.03 GiB memory in use. Of the allocated memory 21.67 GiB is allocated by PyTorch, and 85.02 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
2025-05-07T20:32:36.2871502Z 
2025-05-07T20:32:36.2871624Z moe/activation_test.py:95: OutOfMemoryError
2025-05-07T20:32:36.2871841Z 
2025-05-07T20:32:36.2871948Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:36.2872369Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:36.2872776Z     T=2048,
2025-05-07T20:32:36.2872974Z     D=7168,
2025-05-07T20:32:36.2873178Z     scale_ub=None,
2025-05-07T20:32:36.2873392Z     contiguous=True,
2025-05-07T20:32:36.2873620Z     compiled=False,
2025-05-07T20:32:36.2873829Z )
2025-05-07T20:32:36.2874146Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:36.2874714Z T = 2048, D = 7168, scale_ub = None, contiguous = True, compiled = False
2025-05-07T20:32:36.2874985Z 
2025-05-07T20:32:36.2875074Z     @given(
2025-05-07T20:32:36.2875312Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:36.2875623Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:36.2875933Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:36.2876270Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:36.2876601Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:36.2876896Z     )
2025-05-07T20:32:36.2877249Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:36.2877694Z     def test_silu_mul_quant(
2025-05-07T20:32:36.2877943Z         self,
2025-05-07T20:32:36.2878145Z         T: int,
2025-05-07T20:32:36.2878346Z         D: int,
2025-05-07T20:32:36.2878572Z         scale_ub: Optional[float],
2025-05-07T20:32:36.2878854Z         contiguous: bool,
2025-05-07T20:32:36.2879096Z         compiled: bool,
2025-05-07T20:32:36.2879330Z     ) -> None:
2025-05-07T20:32:36.2879554Z         torch.manual_seed(2025)
2025-05-07T20:32:36.2879798Z     
2025-05-07T20:32:36.2880075Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:36.2880422Z     
2025-05-07T20:32:36.2880624Z >       x_sign = torch.sign(x)
2025-05-07T20:32:36.2882510Z E       torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 56.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 28.44 MiB is free. Including non-PyTorch memory, this process has 22.03 GiB memory in use. Of the allocated memory 21.67 GiB is allocated by PyTorch, and 85.02 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
2025-05-07T20:32:36.2884365Z 
2025-05-07T20:32:36.2884488Z moe/activation_test.py:94: OutOfMemoryError
2025-05-07T20:32:36.2884707Z 
2025-05-07T20:32:36.2884813Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:36.2885264Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:36.2885669Z     T=1,
2025-05-07T20:32:36.2885852Z     D=7168,
2025-05-07T20:32:36.2886051Z     scale_ub=1200.0,
2025-05-07T20:32:36.2886278Z     contiguous=True,
2025-05-07T20:32:36.2886499Z     compiled=False,
2025-05-07T20:32:36.2886710Z )
2025-05-07T20:32:36.2887029Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:36.2887509Z T = 1, D = 7168, scale_ub = 1200.0, contiguous = True, compiled = False
2025-05-07T20:32:36.2887779Z 
2025-05-07T20:32:36.2887860Z     @given(
2025-05-07T20:32:36.2888097Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:36.2888409Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:36.2888764Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:36.2889102Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:36.2889428Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:36.2889727Z     )
2025-05-07T20:32:36.2890078Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:36.2890523Z     def test_silu_mul_quant(
2025-05-07T20:32:36.2890767Z         self,
2025-05-07T20:32:36.2890968Z         T: int,
2025-05-07T20:32:36.2891172Z         D: int,
2025-05-07T20:32:36.2891392Z         scale_ub: Optional[float],
2025-05-07T20:32:36.2891667Z         contiguous: bool,
2025-05-07T20:32:36.2891916Z         compiled: bool,
2025-05-07T20:32:36.2892138Z     ) -> None:
2025-05-07T20:32:36.2892359Z         torch.manual_seed(2025)
2025-05-07T20:32:36.2892610Z     
2025-05-07T20:32:36.2892880Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:36.2893230Z     
2025-05-07T20:32:36.2893436Z         x_sign = torch.sign(x)
2025-05-07T20:32:36.2893917Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:36.2894237Z         x = x_sign * x_clamp
2025-05-07T20:32:36.2894486Z         x0 = x[:, :D]
2025-05-07T20:32:36.2894705Z         x1 = x[:, D:]
2025-05-07T20:32:36.2894922Z     
2025-05-07T20:32:36.2895120Z         if contiguous:
2025-05-07T20:32:36.2895360Z             x0 = x0.contiguous()
2025-05-07T20:32:36.2895621Z             x1 = x1.contiguous()
2025-05-07T20:32:36.2895867Z     
2025-05-07T20:32:36.2896069Z         if scale_ub is not None:
2025-05-07T20:32:36.2896344Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:36.2896687Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:36.2897004Z             )
2025-05-07T20:32:36.2897204Z         else:
2025-05-07T20:32:36.2897427Z             scale_ub_tensor = None
2025-05-07T20:32:36.2897685Z     
2025-05-07T20:32:36.2897923Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:36.2898519Z             op = silu_mul_quant
2025-05-07T20:32:36.2898779Z             if compiled:
2025-05-07T20:32:36.2899026Z                 op = torch.compile(op)
2025-05-07T20:32:36.2899328Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:36.2899612Z     
2025-05-07T20:32:36.2899805Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:36.2899978Z 
2025-05-07T20:32:36.2900083Z moe/activation_test.py:117: 
2025-05-07T20:32:36.2900383Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:36.2900719Z moe/activation_test.py:115: in fn
2025-05-07T20:32:36.2901000Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:36.2901761Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:36.2902449Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:36.2902986Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:36.2903669Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:36.2904333Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:36.2904958Z     kernel = self.compile(
2025-05-07T20:32:36.2905498Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:36.2906156Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:36.2906559Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:36.2906790Z 
2025-05-07T20:32:36.2907002Z self = <triton.compiler.compiler.ASTSource object at 0x7f779dd3e350>
2025-05-07T20:32:36.2908129Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:36.2909483Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f7905667ce0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f779d7b22a0>}
2025-05-07T20:32:36.2910815Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:36.2911822Z context = <triton._C.libtriton.ir.context object at 0x7f779ce78270>
2025-05-07T20:32:36.2912106Z 
2025-05-07T20:32:36.2912271Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:36.2912794Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:36.2913263Z                            module_map=module_map)
2025-05-07T20:32:36.2913639Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:36.2914056Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:36.2914322Z E       ^
2025-05-07T20:32:36.2914784Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:36.2915227Z 
2025-05-07T20:32:36.2915659Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:36.2916205Z 
2025-05-07T20:32:36.2916310Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:36.2916723Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:36.2917143Z     T=128,
2025-05-07T20:32:36.2917338Z     D=5120,
2025-05-07T20:32:36.2917537Z     scale_ub=None,
2025-05-07T20:32:36.2917752Z     contiguous=True,
2025-05-07T20:32:36.2917981Z     compiled=False,
2025-05-07T20:32:36.2918191Z )
2025-05-07T20:32:36.2918513Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:36.2919007Z T = 128, D = 5120, scale_ub = None, contiguous = True, compiled = False
2025-05-07T20:32:36.2919276Z 
2025-05-07T20:32:36.2919362Z     @given(
2025-05-07T20:32:36.2919592Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:36.2919917Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:36.2920231Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:36.2920567Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:36.2920896Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:36.2921188Z     )
2025-05-07T20:32:36.2921542Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:36.2922029Z     def test_silu_mul_quant(
2025-05-07T20:32:36.2922277Z         self,
2025-05-07T20:32:36.2922478Z         T: int,
2025-05-07T20:32:36.2922672Z         D: int,
2025-05-07T20:32:36.2922890Z         scale_ub: Optional[float],
2025-05-07T20:32:36.2923167Z         contiguous: bool,
2025-05-07T20:32:36.2923404Z         compiled: bool,
2025-05-07T20:32:36.2923632Z     ) -> None:
2025-05-07T20:32:36.2923850Z         torch.manual_seed(2025)
2025-05-07T20:32:36.2924090Z     
2025-05-07T20:32:36.2924408Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:36.2924754Z     
2025-05-07T20:32:36.2924945Z         x_sign = torch.sign(x)
2025-05-07T20:32:36.2925236Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:36.2925551Z         x = x_sign * x_clamp
2025-05-07T20:32:36.2925797Z         x0 = x[:, :D]
2025-05-07T20:32:36.2926012Z         x1 = x[:, D:]
2025-05-07T20:32:36.2926226Z     
2025-05-07T20:32:36.2926421Z         if contiguous:
2025-05-07T20:32:36.2926649Z             x0 = x0.contiguous()
2025-05-07T20:32:36.2926914Z             x1 = x1.contiguous()
2025-05-07T20:32:36.2927156Z     
2025-05-07T20:32:36.2927348Z         if scale_ub is not None:
2025-05-07T20:32:36.2927626Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:36.2928018Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:36.2928328Z             )
2025-05-07T20:32:36.2928527Z         else:
2025-05-07T20:32:36.2928746Z             scale_ub_tensor = None
2025-05-07T20:32:36.2928999Z     
2025-05-07T20:32:36.2929235Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:36.2929553Z             op = silu_mul_quant
2025-05-07T20:32:36.2929807Z             if compiled:
2025-05-07T20:32:36.2930058Z                 op = torch.compile(op)
2025-05-07T20:32:36.2930361Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:36.2930640Z     
2025-05-07T20:32:36.2930833Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:36.2931011Z 
2025-05-07T20:32:36.2931113Z moe/activation_test.py:117: 
2025-05-07T20:32:36.2931425Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:36.2931759Z moe/activation_test.py:115: in fn
2025-05-07T20:32:36.2932053Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:36.2932785Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:36.2933464Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:36.2934076Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:36.2934758Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:36.2935421Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:36.2935953Z     kernel = self.compile(
2025-05-07T20:32:36.2936501Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:36.2937152Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:36.2937562Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:36.2937790Z 
2025-05-07T20:32:36.2938000Z self = <triton.compiler.compiler.ASTSource object at 0x7f779cf5c450>
2025-05-07T20:32:36.2939067Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:36.2940420Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f7905667ce0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f779d7b31a0>}
2025-05-07T20:32:36.2941742Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:36.2942792Z context = <triton._C.libtriton.ir.context object at 0x7f779cff3df0>
2025-05-07T20:32:36.2943081Z 
2025-05-07T20:32:36.2943251Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:36.2943769Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:36.2944277Z                            module_map=module_map)
2025-05-07T20:32:36.2944638Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:36.2945000Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:36.2945288Z E       ^
2025-05-07T20:32:36.2945767Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:36.2946213Z 
2025-05-07T20:32:36.2946623Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:36.4076086Z 
2025-05-07T20:32:36.4076291Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:36.4076885Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:36.4077636Z     T=128,
2025-05-07T20:32:36.4077902Z     D=7168,
2025-05-07T20:32:36.4078170Z     scale_ub=None,
2025-05-07T20:32:36.4078466Z     contiguous=True,
2025-05-07T20:32:36.4078702Z     compiled=False,
2025-05-07T20:32:36.4078911Z )
2025-05-07T20:32:36.4079229Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:36.4079715Z T = 128, D = 7168, scale_ub = None, contiguous = True, compiled = False
2025-05-07T20:32:36.4079978Z 
2025-05-07T20:32:36.4080064Z     @given(
2025-05-07T20:32:36.4080290Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:36.4080611Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:36.4080925Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:36.4081253Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:36.4081587Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:36.4081881Z     )
2025-05-07T20:32:36.4082270Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:36.4082809Z     def test_silu_mul_quant(
2025-05-07T20:32:36.4083053Z         self,
2025-05-07T20:32:36.4083254Z         T: int,
2025-05-07T20:32:36.4083456Z         D: int,
2025-05-07T20:32:36.4083674Z         scale_ub: Optional[float],
2025-05-07T20:32:36.4083952Z         contiguous: bool,
2025-05-07T20:32:36.4084194Z         compiled: bool,
2025-05-07T20:32:36.4084419Z     ) -> None:
2025-05-07T20:32:36.4084639Z         torch.manual_seed(2025)
2025-05-07T20:32:36.4084886Z     
2025-05-07T20:32:36.4085153Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:36.4085498Z     
2025-05-07T20:32:36.4085701Z         x_sign = torch.sign(x)
2025-05-07T20:32:36.4085992Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:36.4086296Z         x = x_sign * x_clamp
2025-05-07T20:32:36.4086536Z         x0 = x[:, :D]
2025-05-07T20:32:36.4086758Z         x1 = x[:, D:]
2025-05-07T20:32:36.4086963Z     
2025-05-07T20:32:36.4087158Z         if contiguous:
2025-05-07T20:32:36.4087397Z             x0 = x0.contiguous()
2025-05-07T20:32:36.4087653Z             x1 = x1.contiguous()
2025-05-07T20:32:36.4087897Z     
2025-05-07T20:32:36.4088096Z         if scale_ub is not None:
2025-05-07T20:32:36.4088366Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:36.4088701Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:36.4089015Z             )
2025-05-07T20:32:36.4089209Z         else:
2025-05-07T20:32:36.4089426Z             scale_ub_tensor = None
2025-05-07T20:32:36.4089686Z     
2025-05-07T20:32:36.4089919Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:36.4090329Z             op = silu_mul_quant
2025-05-07T20:32:36.4090586Z             if compiled:
2025-05-07T20:32:36.4090836Z                 op = torch.compile(op)
2025-05-07T20:32:36.4091130Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:36.4091409Z     
2025-05-07T20:32:36.4091612Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:36.4091775Z 
2025-05-07T20:32:36.4091877Z moe/activation_test.py:117: 
2025-05-07T20:32:36.4092176Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:36.4092598Z moe/activation_test.py:115: in fn
2025-05-07T20:32:36.4092879Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:36.4093566Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:36.4094388Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:36.4094929Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:36.4095603Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:36.4096269Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:36.4096899Z     kernel = self.compile(
2025-05-07T20:32:36.4097440Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:36.4098100Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:36.4098782Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:36.4099012Z 
2025-05-07T20:32:36.4099223Z self = <triton.compiler.compiler.ASTSource object at 0x7f779d202c50>
2025-05-07T20:32:36.4100290Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:36.4101650Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f7905667ce0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f779ced4040>}
2025-05-07T20:32:36.4103098Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:36.4104111Z context = <triton._C.libtriton.ir.context object at 0x7f779cf30370>
2025-05-07T20:32:36.4104397Z 
2025-05-07T20:32:36.4104574Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:36.4105089Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:36.4105610Z                            module_map=module_map)
2025-05-07T20:32:36.4105980Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:36.4106333Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:36.4106596Z E       ^
2025-05-07T20:32:36.4107058Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:36.4107501Z 
2025-05-07T20:32:36.4107928Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:36.4108436Z 
2025-05-07T20:32:36.4108539Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:36.4108957Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:36.4109359Z     T=2048,
2025-05-07T20:32:36.4109548Z     D=7168,
2025-05-07T20:32:36.4109748Z     scale_ub=1200.0,
2025-05-07T20:32:36.4109977Z     contiguous=True,
2025-05-07T20:32:36.4110198Z     compiled=False,
2025-05-07T20:32:36.4110408Z )
2025-05-07T20:32:36.4110730Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:36.4111325Z T = 2048, D = 7168, scale_ub = 1200.0, contiguous = True, compiled = False
2025-05-07T20:32:36.4111597Z 
2025-05-07T20:32:36.4111679Z     @given(
2025-05-07T20:32:36.4111913Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:36.4112231Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:36.4112539Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:36.4112870Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:36.4113200Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:36.4113548Z     )
2025-05-07T20:32:36.4113895Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:36.4114336Z     def test_silu_mul_quant(
2025-05-07T20:32:36.4114581Z         self,
2025-05-07T20:32:36.4114775Z         T: int,
2025-05-07T20:32:36.4114980Z         D: int,
2025-05-07T20:32:36.4115201Z         scale_ub: Optional[float],
2025-05-07T20:32:36.4115467Z         contiguous: bool,
2025-05-07T20:32:36.4115743Z         compiled: bool,
2025-05-07T20:32:36.4115992Z     ) -> None:
2025-05-07T20:32:36.4116205Z         torch.manual_seed(2025)
2025-05-07T20:32:36.4116464Z     
2025-05-07T20:32:36.4116743Z >       x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:36.4118820Z E       torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 56.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 26.44 MiB is free. Including non-PyTorch memory, this process has 22.04 GiB memory in use. Of the allocated memory 21.69 GiB is allocated by PyTorch, and 59.18 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
2025-05-07T20:32:36.4120637Z 
2025-05-07T20:32:36.4120757Z moe/activation_test.py:92: OutOfMemoryError
2025-05-07T20:32:36.4120977Z 
2025-05-07T20:32:36.4121081Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:36.4121492Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:36.4121899Z     T=1,
2025-05-07T20:32:36.4130649Z     D=5120,
2025-05-07T20:32:36.4130887Z     scale_ub=1200.0,
2025-05-07T20:32:36.4131125Z     contiguous=True,
2025-05-07T20:32:36.4131443Z     compiled=False,
2025-05-07T20:32:36.4131661Z )
2025-05-07T20:32:36.4131980Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:36.4132479Z T = 1, D = 5120, scale_ub = 1200.0, contiguous = True, compiled = False
2025-05-07T20:32:36.4132750Z 
2025-05-07T20:32:36.4132832Z     @given(
2025-05-07T20:32:36.4133075Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:36.4133389Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:36.4133789Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:36.4134124Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:36.4134450Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:36.4134743Z     )
2025-05-07T20:32:36.4135097Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:36.4135537Z     def test_silu_mul_quant(
2025-05-07T20:32:36.4135785Z         self,
2025-05-07T20:32:36.4135983Z         T: int,
2025-05-07T20:32:36.4136180Z         D: int,
2025-05-07T20:32:36.4136403Z         scale_ub: Optional[float],
2025-05-07T20:32:36.4136678Z         contiguous: bool,
2025-05-07T20:32:36.4136923Z         compiled: bool,
2025-05-07T20:32:36.4137144Z     ) -> None:
2025-05-07T20:32:36.4137366Z         torch.manual_seed(2025)
2025-05-07T20:32:36.4137617Z     
2025-05-07T20:32:36.4137888Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:36.4138241Z     
2025-05-07T20:32:36.4138444Z         x_sign = torch.sign(x)
2025-05-07T20:32:36.4138740Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:36.4139114Z         x = x_sign * x_clamp
2025-05-07T20:32:36.4139357Z         x0 = x[:, :D]
2025-05-07T20:32:36.4139574Z         x1 = x[:, D:]
2025-05-07T20:32:36.4139793Z     
2025-05-07T20:32:36.4139987Z         if contiguous:
2025-05-07T20:32:36.4140214Z             x0 = x0.contiguous()
2025-05-07T20:32:36.4140482Z             x1 = x1.contiguous()
2025-05-07T20:32:36.4140723Z     
2025-05-07T20:32:36.4140923Z         if scale_ub is not None:
2025-05-07T20:32:36.4141200Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:36.4141588Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:36.4141900Z             )
2025-05-07T20:32:36.4142099Z         else:
2025-05-07T20:32:36.4142315Z             scale_ub_tensor = None
2025-05-07T20:32:36.4142566Z     
2025-05-07T20:32:36.4142802Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:36.4143124Z             op = silu_mul_quant
2025-05-07T20:32:36.4143370Z             if compiled:
2025-05-07T20:32:36.4143629Z                 op = torch.compile(op)
2025-05-07T20:32:36.4143929Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:36.4144201Z     
2025-05-07T20:32:36.4144403Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:36.4144565Z 
2025-05-07T20:32:36.4144674Z moe/activation_test.py:117: 
2025-05-07T20:32:36.4145011Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:36.4145345Z moe/activation_test.py:115: in fn
2025-05-07T20:32:36.4145630Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:36.4146329Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:36.4147008Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:36.4147545Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:36.4148221Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:36.4148879Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:36.4149414Z     kernel = self.compile(
2025-05-07T20:32:36.4149958Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:36.4150657Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:36.4151057Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:36.4151298Z 
2025-05-07T20:32:36.4151504Z self = <triton.compiler.compiler.ASTSource object at 0x7f779d9a3250>
2025-05-07T20:32:36.4152575Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:36.4153929Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f7905667ce0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f779ced5580>}
2025-05-07T20:32:36.4155259Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:36.4156270Z context = <triton._C.libtriton.ir.context object at 0x7f779cffa6f0>
2025-05-07T20:32:36.4156563Z 
2025-05-07T20:32:36.4156731Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:36.4157247Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:36.4157708Z                            module_map=module_map)
2025-05-07T20:32:36.4158073Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:36.4158428Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:36.4158693Z E       ^
2025-05-07T20:32:36.4159191Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:36.4159638Z 
2025-05-07T20:32:36.4160047Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:36.4976696Z 
2025-05-07T20:32:36.4976966Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:36.4977587Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:36.4978385Z     T=2048,
2025-05-07T20:32:36.4978649Z     D=5120,
2025-05-07T20:32:36.4978913Z     scale_ub=None,
2025-05-07T20:32:36.4979210Z     contiguous=True,
2025-05-07T20:32:36.4979528Z     compiled=False,
2025-05-07T20:32:36.4979744Z )
2025-05-07T20:32:36.4980075Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:36.4980579Z T = 2048, D = 5120, scale_ub = None, contiguous = True, compiled = False
2025-05-07T20:32:36.4980852Z 
2025-05-07T20:32:36.4980940Z     @given(
2025-05-07T20:32:36.4981186Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:36.4981515Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:36.4981826Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:36.4982259Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:36.4982603Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:36.4982896Z     )
2025-05-07T20:32:36.4983281Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:36.4983724Z     def test_silu_mul_quant(
2025-05-07T20:32:36.4983975Z         self,
2025-05-07T20:32:36.4984180Z         T: int,
2025-05-07T20:32:36.4984388Z         D: int,
2025-05-07T20:32:36.4984607Z         scale_ub: Optional[float],
2025-05-07T20:32:36.4984887Z         contiguous: bool,
2025-05-07T20:32:36.4985137Z         compiled: bool,
2025-05-07T20:32:36.4985365Z     ) -> None:
2025-05-07T20:32:36.4985589Z         torch.manual_seed(2025)
2025-05-07T20:32:36.4985840Z     
2025-05-07T20:32:36.4986113Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:36.4986466Z     
2025-05-07T20:32:36.4986667Z >       x_sign = torch.sign(x)
2025-05-07T20:32:36.4988662Z E       torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 40.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 26.44 MiB is free. Including non-PyTorch memory, this process has 22.04 GiB memory in use. Of the allocated memory 21.73 GiB is allocated by PyTorch, and 19.12 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
2025-05-07T20:32:36.4990491Z 
2025-05-07T20:32:36.4990612Z moe/activation_test.py:94: OutOfMemoryError
2025-05-07T20:32:36.4990837Z 
2025-05-07T20:32:36.4990948Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:36.4991366Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:36.4991774Z     T=16384,
2025-05-07T20:32:36.4991968Z     D=5120,
2025-05-07T20:32:36.4992168Z     scale_ub=None,
2025-05-07T20:32:36.4992394Z     contiguous=True,
2025-05-07T20:32:36.4992618Z     compiled=False,
2025-05-07T20:32:36.4992837Z )
2025-05-07T20:32:36.4993159Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:36.4993648Z T = 16384, D = 5120, scale_ub = None, contiguous = True, compiled = False
2025-05-07T20:32:36.4993930Z 
2025-05-07T20:32:36.4994012Z     @given(
2025-05-07T20:32:36.4994248Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:36.4994565Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:36.4994870Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:36.4995218Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:36.4995674Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:36.4995961Z     )
2025-05-07T20:32:36.4996314Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:36.4996759Z     def test_silu_mul_quant(
2025-05-07T20:32:36.4996999Z         self,
2025-05-07T20:32:36.4997205Z         T: int,
2025-05-07T20:32:36.4997410Z         D: int,
2025-05-07T20:32:36.4997632Z         scale_ub: Optional[float],
2025-05-07T20:32:36.4997912Z         contiguous: bool,
2025-05-07T20:32:36.4998443Z         compiled: bool,
2025-05-07T20:32:36.4998748Z     ) -> None:
2025-05-07T20:32:36.4998973Z         torch.manual_seed(2025)
2025-05-07T20:32:36.4999225Z     
2025-05-07T20:32:36.4999495Z >       x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:36.5001554Z E       torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 320.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 26.44 MiB is free. Including non-PyTorch memory, this process has 22.04 GiB memory in use. Of the allocated memory 21.73 GiB is allocated by PyTorch, and 19.12 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
2025-05-07T20:32:36.5003379Z 
2025-05-07T20:32:36.5003499Z moe/activation_test.py:92: OutOfMemoryError
2025-05-07T20:32:36.5003717Z 
2025-05-07T20:32:36.5003821Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:36.5004244Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:36.5004642Z     T=4096,
2025-05-07T20:32:36.5004837Z     D=5120,
2025-05-07T20:32:36.5005037Z     scale_ub=None,
2025-05-07T20:32:36.5005254Z     contiguous=True,
2025-05-07T20:32:36.5005485Z     compiled=False,
2025-05-07T20:32:36.5005699Z )
2025-05-07T20:32:36.5006019Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:36.5006506Z T = 4096, D = 5120, scale_ub = None, contiguous = True, compiled = False
2025-05-07T20:32:36.5006781Z 
2025-05-07T20:32:36.5006861Z     @given(
2025-05-07T20:32:36.5007097Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:36.5007415Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:36.5007815Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:36.5008152Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:36.5008482Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:36.5008769Z     )
2025-05-07T20:32:36.5009121Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:36.5009569Z     def test_silu_mul_quant(
2025-05-07T20:32:36.5009808Z         self,
2025-05-07T20:32:36.5010007Z         T: int,
2025-05-07T20:32:36.5010213Z         D: int,
2025-05-07T20:32:36.5010431Z         scale_ub: Optional[float],
2025-05-07T20:32:36.5010711Z         contiguous: bool,
2025-05-07T20:32:36.5010962Z         compiled: bool,
2025-05-07T20:32:36.5011191Z     ) -> None:
2025-05-07T20:32:36.5011411Z         torch.manual_seed(2025)
2025-05-07T20:32:36.5011657Z     
2025-05-07T20:32:36.5011930Z >       x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:36.5014034Z E       torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 80.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 26.44 MiB is free. Including non-PyTorch memory, this process has 22.04 GiB memory in use. Of the allocated memory 21.73 GiB is allocated by PyTorch, and 19.12 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
2025-05-07T20:32:36.5015846Z 
2025-05-07T20:32:36.5015965Z moe/activation_test.py:92: OutOfMemoryError
2025-05-07T20:32:36.5016244Z 
2025-05-07T20:32:36.5016357Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:36.5016763Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:36.5017169Z     T=2048,
2025-05-07T20:32:36.5017365Z     D=5120,
2025-05-07T20:32:36.5017571Z     scale_ub=None,
2025-05-07T20:32:36.5017787Z     contiguous=False,
2025-05-07T20:32:36.5018020Z     compiled=False,
2025-05-07T20:32:36.5018235Z )
2025-05-07T20:32:36.5018555Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:36.5019093Z T = 2048, D = 5120, scale_ub = None, contiguous = False, compiled = False
2025-05-07T20:32:36.5019363Z 
2025-05-07T20:32:36.5019452Z     @given(
2025-05-07T20:32:36.5019680Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:36.5019994Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:36.5020301Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:36.5020630Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:36.5020963Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:36.5021252Z     )
2025-05-07T20:32:36.5021601Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:36.5022083Z     def test_silu_mul_quant(
2025-05-07T20:32:36.5022328Z         self,
2025-05-07T20:32:36.5022531Z         T: int,
2025-05-07T20:32:36.5022727Z         D: int,
2025-05-07T20:32:36.5022950Z         scale_ub: Optional[float],
2025-05-07T20:32:36.5023233Z         contiguous: bool,
2025-05-07T20:32:36.5023472Z         compiled: bool,
2025-05-07T20:32:36.5023702Z     ) -> None:
2025-05-07T20:32:36.5023920Z         torch.manual_seed(2025)
2025-05-07T20:32:36.5024160Z     
2025-05-07T20:32:36.5024437Z >       x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:36.5026472Z E       torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 40.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 26.44 MiB is free. Including non-PyTorch memory, this process has 22.04 GiB memory in use. Of the allocated memory 21.73 GiB is allocated by PyTorch, and 19.12 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
2025-05-07T20:32:36.5028283Z 
2025-05-07T20:32:36.5028402Z moe/activation_test.py:92: OutOfMemoryError
2025-05-07T20:32:36.5028619Z 
2025-05-07T20:32:36.5028730Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:36.5029133Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:36.5029536Z     T=4096,
2025-05-07T20:32:36.5029729Z     D=7168,
2025-05-07T20:32:36.5029919Z     scale_ub=None,
2025-05-07T20:32:36.5030140Z     contiguous=True,
2025-05-07T20:32:36.5030367Z     compiled=True,
2025-05-07T20:32:36.5030572Z )
2025-05-07T20:32:36.5030890Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:36.5031381Z T = 4096, D = 7168, scale_ub = None, contiguous = True, compiled = True
2025-05-07T20:32:36.5031652Z 
2025-05-07T20:32:36.5031741Z     @given(
2025-05-07T20:32:36.5031975Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:36.5032301Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:36.5032613Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:36.5032941Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:36.5033276Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:36.5033569Z     )
2025-05-07T20:32:36.5033913Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:36.5034357Z     def test_silu_mul_quant(
2025-05-07T20:32:36.5034603Z         self,
2025-05-07T20:32:36.5034795Z         T: int,
2025-05-07T20:32:36.5034996Z         D: int,
2025-05-07T20:32:36.5035272Z         scale_ub: Optional[float],
2025-05-07T20:32:36.5035544Z         contiguous: bool,
2025-05-07T20:32:36.5035786Z         compiled: bool,
2025-05-07T20:32:36.5036014Z     ) -> None:
2025-05-07T20:32:36.5036233Z         torch.manual_seed(2025)
2025-05-07T20:32:36.5036476Z     
2025-05-07T20:32:36.5036756Z >       x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:36.5038752Z E       torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 112.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 26.44 MiB is free. Including non-PyTorch memory, this process has 22.04 GiB memory in use. Of the allocated memory 21.73 GiB is allocated by PyTorch, and 19.12 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
2025-05-07T20:32:36.5040601Z 
2025-05-07T20:32:36.5040726Z moe/activation_test.py:92: OutOfMemoryError
2025-05-07T20:32:36.5040936Z 
2025-05-07T20:32:36.5041041Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:36.5041451Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:36.5041892Z     T=2048,
2025-05-07T20:32:36.5042089Z     D=5120,
2025-05-07T20:32:36.5042290Z     scale_ub=1200.0,
2025-05-07T20:32:36.5042518Z     contiguous=False,
2025-05-07T20:32:36.5042746Z     compiled=False,
2025-05-07T20:32:36.5583284Z )
2025-05-07T20:32:36.5583803Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:36.5584490Z T = 2048, D = 5120, scale_ub = 1200.0, contiguous = False, compiled = False
2025-05-07T20:32:36.5584857Z 
2025-05-07T20:32:36.5584977Z     @given(
2025-05-07T20:32:36.5585286Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:36.5585629Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:36.5585961Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:36.5586287Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:36.5586616Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:36.5586903Z     )
2025-05-07T20:32:36.5587253Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:36.5587872Z     def test_silu_mul_quant(
2025-05-07T20:32:36.5588120Z         self,
2025-05-07T20:32:36.5588314Z         T: int,
2025-05-07T20:32:36.5588516Z         D: int,
2025-05-07T20:32:36.5588742Z         scale_ub: Optional[float],
2025-05-07T20:32:36.5589011Z         contiguous: bool,
2025-05-07T20:32:36.5589258Z         compiled: bool,
2025-05-07T20:32:36.5589491Z     ) -> None:
2025-05-07T20:32:36.5589713Z         torch.manual_seed(2025)
2025-05-07T20:32:36.5589955Z     
2025-05-07T20:32:36.5590230Z >       x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:36.5592231Z E       torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 40.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 26.44 MiB is free. Including non-PyTorch memory, this process has 22.04 GiB memory in use. Of the allocated memory 21.73 GiB is allocated by PyTorch, and 19.12 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
2025-05-07T20:32:36.5594044Z 
2025-05-07T20:32:36.5594170Z moe/activation_test.py:92: OutOfMemoryError
2025-05-07T20:32:36.5594381Z 
2025-05-07T20:32:36.5594485Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:36.5594895Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:36.5595315Z     T=4096,
2025-05-07T20:32:36.5595542Z     D=7168,
2025-05-07T20:32:36.5595744Z     scale_ub=1200.0,
2025-05-07T20:32:36.5595973Z     contiguous=True,
2025-05-07T20:32:36.5596306Z     compiled=False,
2025-05-07T20:32:36.5596517Z )
2025-05-07T20:32:36.5596832Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:36.5597323Z T = 4096, D = 7168, scale_ub = 1200.0, contiguous = True, compiled = False
2025-05-07T20:32:36.5597610Z 
2025-05-07T20:32:36.5597689Z     @given(
2025-05-07T20:32:36.5597928Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:36.5598538Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:36.5598926Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:36.5599258Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:36.5599582Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:36.5599872Z     )
2025-05-07T20:32:36.5600221Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:36.5600658Z     def test_silu_mul_quant(
2025-05-07T20:32:36.5600903Z         self,
2025-05-07T20:32:36.5601110Z         T: int,
2025-05-07T20:32:36.5601306Z         D: int,
2025-05-07T20:32:36.5601528Z         scale_ub: Optional[float],
2025-05-07T20:32:36.5601803Z         contiguous: bool,
2025-05-07T20:32:36.5602046Z         compiled: bool,
2025-05-07T20:32:36.5602269Z     ) -> None:
2025-05-07T20:32:36.5602564Z         torch.manual_seed(2025)
2025-05-07T20:32:36.5602812Z     
2025-05-07T20:32:36.5603080Z >       x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:36.5605075Z E       torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 112.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 26.44 MiB is free. Including non-PyTorch memory, this process has 22.04 GiB memory in use. Of the allocated memory 21.73 GiB is allocated by PyTorch, and 19.12 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
2025-05-07T20:32:36.5606888Z 
2025-05-07T20:32:36.5607005Z moe/activation_test.py:92: OutOfMemoryError
2025-05-07T20:32:36.5607218Z 
2025-05-07T20:32:36.5607327Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:36.5607741Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:36.5608197Z     T=16384,
2025-05-07T20:32:36.5608398Z     D=7168,
2025-05-07T20:32:36.5608597Z     scale_ub=None,
2025-05-07T20:32:36.5608813Z     contiguous=False,
2025-05-07T20:32:36.5609041Z     compiled=True,
2025-05-07T20:32:36.5609247Z )
2025-05-07T20:32:36.5609559Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:36.5610054Z T = 16384, D = 7168, scale_ub = None, contiguous = False, compiled = True
2025-05-07T20:32:36.5610328Z 
2025-05-07T20:32:36.5610417Z     @given(
2025-05-07T20:32:36.5610645Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:36.5610965Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:36.5611278Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:36.5611616Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:36.5611944Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:36.5612240Z     )
2025-05-07T20:32:36.5612596Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:36.5613035Z     def test_silu_mul_quant(
2025-05-07T20:32:36.5613284Z         self,
2025-05-07T20:32:36.5613488Z         T: int,
2025-05-07T20:32:36.5613766Z         D: int,
2025-05-07T20:32:36.5613993Z         scale_ub: Optional[float],
2025-05-07T20:32:36.5614270Z         contiguous: bool,
2025-05-07T20:32:36.5614509Z         compiled: bool,
2025-05-07T20:32:36.5614736Z     ) -> None:
2025-05-07T20:32:36.5614955Z         torch.manual_seed(2025)
2025-05-07T20:32:36.5615197Z     
2025-05-07T20:32:36.5615473Z >       x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:36.5617545Z E       torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 448.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 26.44 MiB is free. Including non-PyTorch memory, this process has 22.04 GiB memory in use. Of the allocated memory 21.73 GiB is allocated by PyTorch, and 19.12 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
2025-05-07T20:32:36.5619391Z 
2025-05-07T20:32:36.5619512Z moe/activation_test.py:92: OutOfMemoryError
2025-05-07T20:32:36.5619726Z 
2025-05-07T20:32:36.5619843Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:36.5620249Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:36.5620654Z     T=4096,
2025-05-07T20:32:36.5620848Z     D=7168,
2025-05-07T20:32:36.5621042Z     scale_ub=None,
2025-05-07T20:32:36.5621260Z     contiguous=True,
2025-05-07T20:32:36.5621489Z     compiled=False,
2025-05-07T20:32:36.5621693Z )
2025-05-07T20:32:36.5622019Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:36.5622580Z T = 4096, D = 7168, scale_ub = None, contiguous = True, compiled = False
2025-05-07T20:32:36.5622854Z 
2025-05-07T20:32:36.5622942Z     @given(
2025-05-07T20:32:36.5623170Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:36.5623495Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:36.5623808Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:36.5624137Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:36.5624471Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:36.5624764Z     )
2025-05-07T20:32:36.5625112Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:36.5625563Z     def test_silu_mul_quant(
2025-05-07T20:32:36.5625815Z         self,
2025-05-07T20:32:36.5626013Z         T: int,
2025-05-07T20:32:36.5626220Z         D: int,
2025-05-07T20:32:36.5626446Z         scale_ub: Optional[float],
2025-05-07T20:32:36.5626725Z         contiguous: bool,
2025-05-07T20:32:36.5626971Z         compiled: bool,
2025-05-07T20:32:36.5627219Z     ) -> None:
2025-05-07T20:32:36.5627490Z         torch.manual_seed(2025)
2025-05-07T20:32:36.5627736Z     
2025-05-07T20:32:36.5628018Z >       x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:36.5630016Z E       torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 112.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 26.44 MiB is free. Including non-PyTorch memory, this process has 22.04 GiB memory in use. Of the allocated memory 21.73 GiB is allocated by PyTorch, and 19.12 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
2025-05-07T20:32:36.5631828Z 
2025-05-07T20:32:36.5631957Z moe/activation_test.py:92: OutOfMemoryError
2025-05-07T20:32:36.5632169Z 
2025-05-07T20:32:36.5632285Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:36.5632696Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:36.5633103Z     T=16384,
2025-05-07T20:32:36.5633303Z     D=7168,
2025-05-07T20:32:36.5633498Z     scale_ub=None,
2025-05-07T20:32:36.5641906Z     contiguous=True,
2025-05-07T20:32:36.5642153Z     compiled=False,
2025-05-07T20:32:36.5642370Z )
2025-05-07T20:32:36.5642698Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:36.5643194Z T = 16384, D = 7168, scale_ub = None, contiguous = True, compiled = False
2025-05-07T20:32:36.5643481Z 
2025-05-07T20:32:36.5643562Z     @given(
2025-05-07T20:32:36.5643885Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:36.5644204Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:36.5644510Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:36.5644844Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:36.5645182Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:36.5645470Z     )
2025-05-07T20:32:36.5645825Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:36.5646270Z     def test_silu_mul_quant(
2025-05-07T20:32:36.5646561Z         self,
2025-05-07T20:32:36.5646766Z         T: int,
2025-05-07T20:32:36.5646969Z         D: int,
2025-05-07T20:32:36.5647186Z         scale_ub: Optional[float],
2025-05-07T20:32:36.5647471Z         contiguous: bool,
2025-05-07T20:32:36.5647721Z         compiled: bool,
2025-05-07T20:32:36.5647952Z     ) -> None:
2025-05-07T20:32:36.5648166Z         torch.manual_seed(2025)
2025-05-07T20:32:36.5648410Z     
2025-05-07T20:32:36.5648688Z >       x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:36.5650743Z E       torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 448.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 26.44 MiB is free. Including non-PyTorch memory, this process has 22.04 GiB memory in use. Of the allocated memory 21.73 GiB is allocated by PyTorch, and 19.12 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
2025-05-07T20:32:36.5652570Z 
2025-05-07T20:32:36.5652690Z moe/activation_test.py:92: OutOfMemoryError
2025-05-07T20:32:36.5652908Z 
2025-05-07T20:32:36.5653014Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:36.5653426Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:36.5653963Z     T=16384,
2025-05-07T20:32:36.5654157Z     D=7168,
2025-05-07T20:32:36.5654356Z     scale_ub=1200.0,
2025-05-07T20:32:36.5654588Z     contiguous=True,
2025-05-07T20:32:36.5654809Z     compiled=False,
2025-05-07T20:32:36.5655020Z )
2025-05-07T20:32:36.5655344Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:36.5655879Z T = 16384, D = 7168, scale_ub = 1200.0, contiguous = True, compiled = False
2025-05-07T20:32:36.5656153Z 
2025-05-07T20:32:36.5656234Z     @given(
2025-05-07T20:32:36.5656474Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:36.5656788Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:36.5657095Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:36.5657421Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:36.5657755Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:36.5658043Z     )
2025-05-07T20:32:36.5658389Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:36.5658836Z     def test_silu_mul_quant(
2025-05-07T20:32:36.5659083Z         self,
2025-05-07T20:32:36.5659279Z         T: int,
2025-05-07T20:32:36.5659479Z         D: int,
2025-05-07T20:32:36.5659700Z         scale_ub: Optional[float],
2025-05-07T20:32:36.5659974Z         contiguous: bool,
2025-05-07T20:32:36.5660221Z         compiled: bool,
2025-05-07T20:32:36.5660447Z     ) -> None:
2025-05-07T20:32:36.5660662Z         torch.manual_seed(2025)
2025-05-07T20:32:36.5660905Z     
2025-05-07T20:32:36.5661181Z >       x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:36.5663174Z E       torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 448.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 26.44 MiB is free. Including non-PyTorch memory, this process has 22.04 GiB memory in use. Of the allocated memory 21.73 GiB is allocated by PyTorch, and 19.12 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
2025-05-07T20:32:36.5665027Z 
2025-05-07T20:32:36.5665155Z moe/activation_test.py:92: OutOfMemoryError
2025-05-07T20:32:36.7464000Z 
2025-05-07T20:32:36.7464449Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:36.7464899Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:36.7465904Z     T=128,
2025-05-07T20:32:36.7466279Z     D=5120,
2025-05-07T20:32:36.7466663Z     scale_ub=1200.0,
2025-05-07T20:32:36.7467094Z     contiguous=False,
2025-05-07T20:32:36.7467536Z     compiled=False,
2025-05-07T20:32:36.7467940Z )
2025-05-07T20:32:36.7468555Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:36.7469531Z T = 128, D = 5120, scale_ub = 1200.0, contiguous = False, compiled = False
2025-05-07T20:32:36.7470089Z 
2025-05-07T20:32:36.7470247Z     @given(
2025-05-07T20:32:36.7470698Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:36.7471300Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:36.7471899Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:36.7472676Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:36.7473312Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:36.7473873Z     )
2025-05-07T20:32:36.7474560Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:36.7475279Z     def test_silu_mul_quant(
2025-05-07T20:32:36.7475559Z         self,
2025-05-07T20:32:36.7475759Z         T: int,
2025-05-07T20:32:36.7475956Z         D: int,
2025-05-07T20:32:36.7476173Z         scale_ub: Optional[float],
2025-05-07T20:32:36.7476447Z         contiguous: bool,
2025-05-07T20:32:36.7476686Z         compiled: bool,
2025-05-07T20:32:36.7476907Z     ) -> None:
2025-05-07T20:32:36.7477132Z         torch.manual_seed(2025)
2025-05-07T20:32:36.7477376Z     
2025-05-07T20:32:36.7477641Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:36.7477983Z     
2025-05-07T20:32:36.7478185Z         x_sign = torch.sign(x)
2025-05-07T20:32:36.7478478Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:36.7478865Z         x = x_sign * x_clamp
2025-05-07T20:32:36.7479110Z         x0 = x[:, :D]
2025-05-07T20:32:36.7479318Z         x1 = x[:, D:]
2025-05-07T20:32:36.7479530Z     
2025-05-07T20:32:36.7479718Z         if contiguous:
2025-05-07T20:32:36.7479943Z             x0 = x0.contiguous()
2025-05-07T20:32:36.7480200Z             x1 = x1.contiguous()
2025-05-07T20:32:36.7480440Z     
2025-05-07T20:32:36.7480626Z         if scale_ub is not None:
2025-05-07T20:32:36.7480899Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:36.7481231Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:36.7481538Z             )
2025-05-07T20:32:36.7481732Z         else:
2025-05-07T20:32:36.7481945Z             scale_ub_tensor = None
2025-05-07T20:32:36.7482193Z     
2025-05-07T20:32:36.7482417Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:36.7482731Z             op = silu_mul_quant
2025-05-07T20:32:36.7482982Z             if compiled:
2025-05-07T20:32:36.7483229Z                 op = torch.compile(op)
2025-05-07T20:32:36.7483526Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:36.7483831Z     
2025-05-07T20:32:36.7484028Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:36.7484189Z 
2025-05-07T20:32:36.7484297Z moe/activation_test.py:117: 
2025-05-07T20:32:36.7484584Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:36.7484919Z moe/activation_test.py:115: in fn
2025-05-07T20:32:36.7485202Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:36.7485886Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:36.7486642Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:36.7487175Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:36.7487850Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:36.7488506Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:36.7489038Z     kernel = self.compile(
2025-05-07T20:32:36.7489631Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:36.7490280Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:36.7490667Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:36.7490898Z 
2025-05-07T20:32:36.7491102Z self = <triton.compiler.compiler.ASTSource object at 0x7f779d200750>
2025-05-07T20:32:36.7492168Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:36.7493561Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f7905667ce0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f779cfc11c0>}
2025-05-07T20:32:36.7494966Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:36.7495976Z context = <triton._C.libtriton.ir.context object at 0x7f779d0765f0>
2025-05-07T20:32:36.7496264Z 
2025-05-07T20:32:36.7496427Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:36.7496939Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:36.7497402Z                            module_map=module_map)
2025-05-07T20:32:36.7497772Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:36.7498124Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:36.7498543Z E       ^
2025-05-07T20:32:36.7499079Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:36.7499525Z 
2025-05-07T20:32:36.7499939Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:36.7500442Z 
2025-05-07T20:32:36.7500549Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:36.7500954Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:36.7501353Z     T=2048,
2025-05-07T20:32:36.7501541Z     D=7168,
2025-05-07T20:32:36.7501725Z     scale_ub=None,
2025-05-07T20:32:36.7501944Z     contiguous=False,
2025-05-07T20:32:36.7502167Z     compiled=False,
2025-05-07T20:32:36.7502362Z )
2025-05-07T20:32:36.7502673Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:36.7503161Z T = 2048, D = 7168, scale_ub = None, contiguous = False, compiled = False
2025-05-07T20:32:36.7503428Z 
2025-05-07T20:32:36.7503517Z     @given(
2025-05-07T20:32:36.7503737Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:36.7504046Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:36.7504352Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:36.7504672Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:36.7504999Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:36.7505291Z     )
2025-05-07T20:32:36.7505629Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:36.7506069Z     def test_silu_mul_quant(
2025-05-07T20:32:36.7506379Z         self,
2025-05-07T20:32:36.7506574Z         T: int,
2025-05-07T20:32:36.7506765Z         D: int,
2025-05-07T20:32:36.7506983Z         scale_ub: Optional[float],
2025-05-07T20:32:36.7507252Z         contiguous: bool,
2025-05-07T20:32:36.7507480Z         compiled: bool,
2025-05-07T20:32:36.7507702Z     ) -> None:
2025-05-07T20:32:36.7507914Z         torch.manual_seed(2025)
2025-05-07T20:32:36.7508153Z     
2025-05-07T20:32:36.7508421Z >       x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:36.7510520Z E       torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 56.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 26.44 MiB is free. Including non-PyTorch memory, this process has 22.04 GiB memory in use. Of the allocated memory 21.74 GiB is allocated by PyTorch, and 10.99 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
2025-05-07T20:32:36.7512327Z 
2025-05-07T20:32:36.7512449Z moe/activation_test.py:92: OutOfMemoryError
2025-05-07T20:32:36.7512657Z 
2025-05-07T20:32:36.7512765Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:36.7513221Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:36.7513615Z     T=128,
2025-05-07T20:32:36.7513803Z     D=7168,
2025-05-07T20:32:36.7513991Z     scale_ub=1200.0,
2025-05-07T20:32:36.7514215Z     contiguous=True,
2025-05-07T20:32:36.7514433Z     compiled=True,
2025-05-07T20:32:36.7514630Z )
2025-05-07T20:32:36.7514952Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:36.7515432Z T = 128, D = 7168, scale_ub = 1200.0, contiguous = True, compiled = True
2025-05-07T20:32:36.7515701Z 
2025-05-07T20:32:36.7515774Z     @given(
2025-05-07T20:32:36.7516004Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:36.7516313Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:36.7516615Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:36.7516945Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:36.7517266Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:36.7517553Z     )
2025-05-07T20:32:36.7517939Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:36.7518380Z     def test_silu_mul_quant(
2025-05-07T20:32:36.7518619Z         self,
2025-05-07T20:32:36.7518816Z         T: int,
2025-05-07T20:32:36.7519013Z         D: int,
2025-05-07T20:32:36.7519225Z         scale_ub: Optional[float],
2025-05-07T20:32:36.7519495Z         contiguous: bool,
2025-05-07T20:32:36.7519739Z         compiled: bool,
2025-05-07T20:32:36.7519954Z     ) -> None:
2025-05-07T20:32:36.7520166Z         torch.manual_seed(2025)
2025-05-07T20:32:36.7520405Z     
2025-05-07T20:32:36.7520660Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:36.7521000Z     
2025-05-07T20:32:36.7521190Z         x_sign = torch.sign(x)
2025-05-07T20:32:36.7521470Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:36.7521777Z         x = x_sign * x_clamp
2025-05-07T20:32:36.7522015Z         x0 = x[:, :D]
2025-05-07T20:32:36.7522224Z         x1 = x[:, D:]
2025-05-07T20:32:36.7522435Z     
2025-05-07T20:32:36.7522618Z         if contiguous:
2025-05-07T20:32:36.7522844Z             x0 = x0.contiguous()
2025-05-07T20:32:36.7523099Z             x1 = x1.contiguous()
2025-05-07T20:32:36.7523339Z     
2025-05-07T20:32:36.7523528Z         if scale_ub is not None:
2025-05-07T20:32:36.7523794Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:36.7524124Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:36.7524432Z             )
2025-05-07T20:32:36.7524622Z         else:
2025-05-07T20:32:36.7524828Z             scale_ub_tensor = None
2025-05-07T20:32:36.7525126Z     
2025-05-07T20:32:36.7525375Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:36.7525712Z             op = silu_mul_quant
2025-05-07T20:32:36.7525958Z             if compiled:
2025-05-07T20:32:36.7526198Z                 op = torch.compile(op)
2025-05-07T20:32:36.7526498Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:36.7526773Z     
2025-05-07T20:32:36.7526961Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:36.7527128Z 
2025-05-07T20:32:36.7527225Z moe/activation_test.py:117: 
2025-05-07T20:32:36.7527561Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:36.7527890Z moe/activation_test.py:115: in fn
2025-05-07T20:32:36.7528165Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:36.7528717Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/torch/_dynamo/eval_frame.py:678: in _fn
2025-05-07T20:32:36.7529274Z     return fn(*args, **kwargs)
2025-05-07T20:32:36.7529917Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:36.7530595Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:36.7531169Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:36.7531842Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:36.7532496Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:36.7533024Z     kernel = self.compile(
2025-05-07T20:32:36.7533556Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:36.7534283Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:36.7534670Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:36.7534902Z 
2025-05-07T20:32:36.7535106Z self = <triton.compiler.compiler.ASTSource object at 0x7f779d44aa50>
2025-05-07T20:32:36.7536219Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:36.7537604Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f7905667ce0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f779d0abb00>}
2025-05-07T20:32:36.7538916Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:36.7539921Z context = <triton._C.libtriton.ir.context object at 0x7f779cc299f0>
2025-05-07T20:32:36.7540206Z 
2025-05-07T20:32:36.7540367Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:36.7540881Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:36.7541336Z                            module_map=module_map)
2025-05-07T20:32:36.7541699Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:36.7542044Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:36.7542299Z E       ^
2025-05-07T20:32:36.7542750Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:36.7543192Z 
2025-05-07T20:32:36.7543599Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:37.0313514Z 
2025-05-07T20:32:37.0313803Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:37.0314251Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:37.0314725Z     T=128,
2025-05-07T20:32:37.0315053Z     D=7168,
2025-05-07T20:32:37.0315254Z     scale_ub=1200.0,
2025-05-07T20:32:37.0315498Z     contiguous=True,
2025-05-07T20:32:37.0315735Z     compiled=False,
2025-05-07T20:32:37.0315947Z )
2025-05-07T20:32:37.0316280Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:37.0316793Z T = 128, D = 7168, scale_ub = 1200.0, contiguous = True, compiled = False
2025-05-07T20:32:37.0317065Z 
2025-05-07T20:32:37.0317158Z     @given(
2025-05-07T20:32:37.0317395Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:37.0317788Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:37.0318107Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:37.0318443Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:37.0318782Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:37.0319085Z     )
2025-05-07T20:32:37.0319436Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:37.0319886Z     def test_silu_mul_quant(
2025-05-07T20:32:37.0320159Z         self,
2025-05-07T20:32:37.0320367Z         T: int,
2025-05-07T20:32:37.0320570Z         D: int,
2025-05-07T20:32:37.0320795Z         scale_ub: Optional[float],
2025-05-07T20:32:37.0321074Z         contiguous: bool,
2025-05-07T20:32:37.0321437Z         compiled: bool,
2025-05-07T20:32:37.0321678Z     ) -> None:
2025-05-07T20:32:37.0321905Z         torch.manual_seed(2025)
2025-05-07T20:32:37.0322150Z     
2025-05-07T20:32:37.0322430Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:37.0322787Z     
2025-05-07T20:32:37.0322992Z         x_sign = torch.sign(x)
2025-05-07T20:32:37.0323286Z >       x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:37.0325268Z E       torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 20.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 4.44 MiB is free. Including non-PyTorch memory, this process has 22.06 GiB memory in use. Of the allocated memory 21.77 GiB is allocated by PyTorch, and 6.37 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
2025-05-07T20:32:37.0327091Z 
2025-05-07T20:32:37.0327276Z moe/activation_test.py:95: OutOfMemoryError
2025-05-07T20:32:37.0327492Z 
2025-05-07T20:32:37.0327604Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:37.0328018Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:37.0328427Z     T=128,
2025-05-07T20:32:37.0328623Z     D=5120,
2025-05-07T20:32:37.0328826Z     scale_ub=1200.0,
2025-05-07T20:32:37.0329052Z     contiguous=True,
2025-05-07T20:32:37.0329288Z     compiled=True,
2025-05-07T20:32:37.0329500Z )
2025-05-07T20:32:37.0329820Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:37.0330321Z T = 128, D = 5120, scale_ub = 1200.0, contiguous = True, compiled = True
2025-05-07T20:32:37.0330590Z 
2025-05-07T20:32:37.0330678Z     @given(
2025-05-07T20:32:37.0330914Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:37.0331241Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:37.0331559Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:37.0331892Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:37.0332228Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:37.0332525Z     )
2025-05-07T20:32:37.0332881Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:37.0333323Z     def test_silu_mul_quant(
2025-05-07T20:32:37.0333576Z         self,
2025-05-07T20:32:37.0333860Z         T: int,
2025-05-07T20:32:37.0334059Z         D: int,
2025-05-07T20:32:37.0334283Z         scale_ub: Optional[float],
2025-05-07T20:32:37.0334565Z         contiguous: bool,
2025-05-07T20:32:37.0334855Z         compiled: bool,
2025-05-07T20:32:37.0335086Z     ) -> None:
2025-05-07T20:32:37.0335312Z         torch.manual_seed(2025)
2025-05-07T20:32:37.0335557Z     
2025-05-07T20:32:37.0335835Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:37.0336183Z     
2025-05-07T20:32:37.0336382Z         x_sign = torch.sign(x)
2025-05-07T20:32:37.0336680Z >       x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:37.0338630Z E       torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 20.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 4.44 MiB is free. Including non-PyTorch memory, this process has 22.06 GiB memory in use. Of the allocated memory 21.77 GiB is allocated by PyTorch, and 3.87 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
2025-05-07T20:32:37.0340646Z 
2025-05-07T20:32:37.0340778Z moe/activation_test.py:95: OutOfMemoryError
2025-05-07T20:32:37.0340994Z 
2025-05-07T20:32:37.0341110Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:37.0341576Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:37.0341996Z     T=128,
2025-05-07T20:32:37.0342198Z     D=7168,
2025-05-07T20:32:37.0342396Z     scale_ub=None,
2025-05-07T20:32:37.0342615Z     contiguous=True,
2025-05-07T20:32:37.0342852Z     compiled=True,
2025-05-07T20:32:37.0343061Z )
2025-05-07T20:32:37.0343493Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:37.0344016Z T = 128, D = 7168, scale_ub = None, contiguous = True, compiled = True
2025-05-07T20:32:37.0344401Z 
2025-05-07T20:32:37.0344547Z     @given(
2025-05-07T20:32:37.0344866Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:37.0345306Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:37.0345741Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:37.0346189Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:37.0346644Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:37.0347038Z     )
2025-05-07T20:32:37.0347582Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:37.0348217Z     def test_silu_mul_quant(
2025-05-07T20:32:37.0348559Z         self,
2025-05-07T20:32:37.0348835Z         T: int,
2025-05-07T20:32:37.0349115Z         D: int,
2025-05-07T20:32:37.0349418Z         scale_ub: Optional[float],
2025-05-07T20:32:37.0349819Z         contiguous: bool,
2025-05-07T20:32:37.0350141Z         compiled: bool,
2025-05-07T20:32:37.0350462Z     ) -> None:
2025-05-07T20:32:37.0350758Z         torch.manual_seed(2025)
2025-05-07T20:32:37.0351087Z     
2025-05-07T20:32:37.0351461Z >       x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:37.0366960Z E       torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 20.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 4.44 MiB is free. Including non-PyTorch memory, this process has 22.06 GiB memory in use. Of the allocated memory 21.77 GiB is allocated by PyTorch, and 3.87 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
2025-05-07T20:32:37.0369558Z 
2025-05-07T20:32:37.0369730Z moe/activation_test.py:92: OutOfMemoryError
2025-05-07T20:32:37.0370026Z 
2025-05-07T20:32:37.0370544Z FAILED
2025-05-07T20:32:37.0370693Z 
2025-05-07T20:32:37.0370882Z =================================== FAILURES ===================================
2025-05-07T20:32:37.0371467Z _____________________ ActivationTests.test_silu_mul_quant ______________________
2025-05-07T20:32:37.0372191Z   + Exception Group Traceback (most recent call last):
2025-05-07T20:32:37.0373026Z   |   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/unittest/case.py", line 58, in testPartExecutor
2025-05-07T20:32:37.0373884Z   |     yield
2025-05-07T20:32:37.0374483Z   |   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/unittest/case.py", line 651, in run
2025-05-07T20:32:37.0375200Z   |     self._callTestMethod(testMethod)
2025-05-07T20:32:37.0375648Z   |     ~~~~~~~~~~~~~~~~~~~~^^^^^^^^^^^^
2025-05-07T20:32:37.0376433Z   |   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/unittest/case.py", line 606, in _callTestMethod
2025-05-07T20:32:37.0377194Z   |     if method() is not None:
2025-05-07T20:32:37.0377541Z   |        ~~~~~~^^
2025-05-07T20:32:37.0378393Z   |   File "/home/ec2-user/actions-runner/_work/FBGEMM/FBGEMM/fbgemm_gpu/experimental/gen_ai/test/moe/activation_test.py", line 75, in test_silu_mul_quant
2025-05-07T20:32:37.0379376Z   |     T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:37.0379788Z   |            ^^^^^^^
2025-05-07T20:32:37.0380536Z   |   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/hypothesis/core.py", line 1850, in wrapped_test
2025-05-07T20:32:37.0381428Z   |     raise the_error_hypothesis_found
2025-05-07T20:32:37.0382053Z   | ExceptionGroup: Hypothesis found 4 distinct failures. (4 sub-exceptions)
2025-05-07T20:32:37.0382622Z   +-+---------------- 1 ----------------
2025-05-07T20:32:37.0383013Z     | Traceback (most recent call last):
2025-05-07T20:32:37.0383975Z     |   File "/home/ec2-user/actions-runner/_work/FBGEMM/FBGEMM/fbgemm_gpu/experimental/gen_ai/test/moe/activation_test.py", line 92, in test_silu_mul_quant
2025-05-07T20:32:37.0385049Z     |     x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:37.0389271Z     | torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 40.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 4.44 MiB is free. Including non-PyTorch memory, this process has 22.06 GiB memory in use. Of the allocated memory 21.77 GiB is allocated by PyTorch, and 3.87 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
2025-05-07T20:32:37.0391307Z     | Falsifying example: test_silu_mul_quant(
2025-05-07T20:32:37.0391840Z     |     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:37.0392270Z     |     T=2048,
2025-05-07T20:32:37.0392506Z     |     D=5120,  # or any other generated value
2025-05-07T20:32:37.0392845Z     |     scale_ub=None,  # or any other generated value
2025-05-07T20:32:37.0393204Z     |     contiguous=True,  # or any other generated value
2025-05-07T20:32:37.0393571Z     |     compiled=False,  # or any other generated value
2025-05-07T20:32:37.0393894Z     | )
2025-05-07T20:32:37.0394067Z     | 
2025-05-07T20:32:37.0394679Z     | You can reproduce this example by temporarily adding @reproduce_failure('6.131.14', b'AEECQQBBAEEAQQE=') as a decorator on your test case
2025-05-07T20:32:37.0395290Z     +---------------- 2 ----------------
2025-05-07T20:32:37.0395584Z     | Traceback (most recent call last):
2025-05-07T20:32:37.0396274Z     |   File "/home/ec2-user/actions-runner/_work/FBGEMM/FBGEMM/fbgemm_gpu/experimental/gen_ai/test/moe/activation_test.py", line 92, in test_silu_mul_quant
2025-05-07T20:32:37.0397038Z     |     x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:37.0399221Z     | torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 20.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 4.44 MiB is free. Including non-PyTorch memory, this process has 22.06 GiB memory in use. Of the allocated memory 21.77 GiB is allocated by PyTorch, and 3.87 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
2025-05-07T20:32:37.0401241Z     | Falsifying example: test_silu_mul_quant(
2025-05-07T20:32:37.0401674Z     |     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:37.0402072Z     |     T=128,
2025-05-07T20:32:37.0402275Z     |     D=7168,
2025-05-07T20:32:37.0402548Z     |     scale_ub=None,
2025-05-07T20:32:37.0402779Z     |     contiguous=True,
2025-05-07T20:32:37.0403020Z     |     compiled=True,
2025-05-07T20:32:37.0403245Z     | )
2025-05-07T20:32:37.0403421Z     | 
2025-05-07T20:32:37.0403933Z     | You can reproduce this example by temporarily adding @reproduce_failure('6.131.14', b'AEEBQQFBAEEAQQA=') as a decorator on your test case
2025-05-07T20:32:37.0404521Z     +---------------- 3 ----------------
2025-05-07T20:32:37.0404806Z     | Traceback (most recent call last):
2025-05-07T20:32:37.0405550Z     |   File "/home/ec2-user/actions-runner/_work/FBGEMM/FBGEMM/fbgemm_gpu/experimental/gen_ai/test/moe/activation_test.py", line 92, in test_silu_mul_quant
2025-05-07T20:32:37.0406380Z     |     x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:37.0408358Z     | torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 20.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 4.44 MiB is free. Including non-PyTorch memory, this process has 22.06 GiB memory in use. Of the allocated memory 21.77 GiB is allocated by PyTorch, and 3.87 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
2025-05-07T20:32:37.0410264Z     | Falsifying example: test_silu_mul_quant(
2025-05-07T20:32:37.0410697Z     |     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:37.0411189Z     |     T=128,
2025-05-07T20:32:37.0411463Z     |     D=5120,
2025-05-07T20:32:37.0411741Z     |     scale_ub=1200.0,
2025-05-07T20:32:37.0412076Z     |     contiguous=True,
2025-05-07T20:32:37.0412407Z     |     compiled=True,
2025-05-07T20:32:37.0412784Z     | )
2025-05-07T20:32:37.0413038Z     | 
2025-05-07T20:32:37.0413859Z     | You can reproduce this example by temporarily adding @reproduce_failure('6.131.14', b'AEEBQQBBAUEAQQA=') as a decorator on your test case
2025-05-07T20:32:37.0414686Z     +---------------- 4 ----------------
2025-05-07T20:32:37.0415080Z     | Traceback (most recent call last):
2025-05-07T20:32:37.0416042Z     |   File "/home/ec2-user/actions-runner/_work/FBGEMM/FBGEMM/fbgemm_gpu/experimental/gen_ai/test/moe/activation_test.py", line 126, in test_silu_mul_quant
2025-05-07T20:32:37.0417000Z     |     y_fp8_ref, y_scale_ref = ref_fn()
2025-05-07T20:32:37.0417397Z     |                              ~~~~~~^^
2025-05-07T20:32:37.0418268Z     |   File "/home/ec2-user/actions-runner/_work/FBGEMM/FBGEMM/fbgemm_gpu/experimental/gen_ai/test/moe/activation_test.py", line 124, in ref_fn
2025-05-07T20:32:37.0419215Z     |     return triton_quantize_fp8_row(y, scale_ub_tensor)
2025-05-07T20:32:37.0420347Z     |   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gemm/triton_gemm/fp8_gemm.py", line 2370, in triton_quantize_fp8_row
2025-05-07T20:32:37.0421424Z     |     _kernel_quantize_fp8_row[grid](
2025-05-07T20:32:37.0421818Z     |     ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~^
2025-05-07T20:32:37.0422168Z     |         a,
2025-05-07T20:32:37.0422443Z     |         ^^
2025-05-07T20:32:37.0422726Z     |     ...<23 lines>...
2025-05-07T20:32:37.0423058Z     |         USE_INT64=use_int64,
2025-05-07T20:32:37.0423415Z     |         ^^^^^^^^^^^^^^^^^^^^
2025-05-07T20:32:37.0423738Z     |     )
2025-05-07T20:32:37.0424078Z     |     ^
2025-05-07T20:32:37.0424783Z     |   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py", line 330, in <lambda>
2025-05-07T20:32:37.0425773Z     |     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:37.0426397Z     |                                    ~~~~~~~~^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
2025-05-07T20:32:37.0427273Z     |   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/autotuner.py", line 186, in run
2025-05-07T20:32:37.0428381Z     |     timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs}
2025-05-07T20:32:37.0429014Z     |                        ~~~~~~~~~~~^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
2025-05-07T20:32:37.0429891Z     |   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/autotuner.py", line 166, in _bench
2025-05-07T20:32:37.0430835Z     |     return self.do_bench(kernel_call, quantiles=(0.5, 0.2, 0.8))
2025-05-07T20:32:37.0431354Z     |            ~~~~~~~~~~~~~^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
2025-05-07T20:32:37.0432173Z     |   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/testing.py", line 117, in do_bench
2025-05-07T20:32:37.0432999Z     |     fn()
2025-05-07T20:32:37.0433280Z     |     ~~^^
2025-05-07T20:32:37.0434046Z     |   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/autotuner.py", line 152, in kernel_call
2025-05-07T20:32:37.0434923Z     |     self.fn.run(
2025-05-07T20:32:37.0435231Z     |     ~~~~~~~~~~~^
2025-05-07T20:32:37.0435529Z     |         *args,
2025-05-07T20:32:37.0435828Z     |         ^^^^^^
2025-05-07T20:32:37.0436126Z     |         **current,
2025-05-07T20:32:37.0436435Z     |         ^^^^^^^^^^
2025-05-07T20:32:37.0436738Z     |     )
2025-05-07T20:32:37.0436999Z     |     ^
2025-05-07T20:32:37.0437669Z     |   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py", line 623, in run
2025-05-07T20:32:37.0438469Z     |     kernel = self.compile(
2025-05-07T20:32:37.0438821Z     |         src,
2025-05-07T20:32:37.0439120Z     |         target=target,
2025-05-07T20:32:37.0439474Z     |         options=options.__dict__,
2025-05-07T20:32:37.0439856Z     |     )
2025-05-07T20:32:37.0440652Z     |   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py", line 273, in compile
2025-05-07T20:32:37.0441616Z     |     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:37.0442577Z     |   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py", line 100, in make_ir
2025-05-07T20:32:37.0443663Z     |     return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:37.0444303Z     |                        module_map=module_map)
2025-05-07T20:32:37.0444800Z     | triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:37.0445287Z     | def _kernel_quantize_fp8_row(
2025-05-07T20:32:37.0445637Z     | ^
2025-05-07T20:32:37.0446264Z     | ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:37.0447039Z     | Falsifying example: test_silu_mul_quant(
2025-05-07T20:32:37.0447592Z     |     # The test always failed when commented parts were varied together.
2025-05-07T20:32:37.0448308Z     |     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:37.0448906Z     |     T=1,  # or any other generated value
2025-05-07T20:32:37.0449337Z     |     D=5120,  # or any other generated value
2025-05-07T20:32:37.0449794Z     |     scale_ub=None,  # or any other generated value
2025-05-07T20:32:37.0450274Z     |     contiguous=True,  # or any other generated value
2025-05-07T20:32:37.0450780Z     |     compiled=True,  # or any other generated value
2025-05-07T20:32:37.0451259Z     | )
2025-05-07T20:32:37.0451506Z     | 
2025-05-07T20:32:37.0452225Z     | You can reproduce this example by temporarily adding @reproduce_failure('6.131.14', b'AEEAQQBBAEEAQQA=') as a decorator on your test case
2025-05-07T20:32:37.0453065Z     +------------------------------------
2025-05-07T20:32:37.0453565Z ---------------------------------- Hypothesis ----------------------------------
2025-05-07T20:32:37.0454188Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:37.0454755Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:37.0455360Z     T=1,
2025-05-07T20:32:37.0455620Z     D=5120,
2025-05-07T20:32:37.0455893Z     scale_ub=None,
2025-05-07T20:32:37.0456172Z     contiguous=True,
2025-05-07T20:32:37.0456463Z     compiled=True,
2025-05-07T20:32:37.0456730Z )
2025-05-07T20:32:37.0457140Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:37.0457768Z T = 1, D = 5120, scale_ub = None, contiguous = True, compiled = True
2025-05-07T20:32:37.0458103Z 
2025-05-07T20:32:37.0458217Z     @given(
2025-05-07T20:32:37.0458515Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:37.0458941Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:37.0459409Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:37.0459855Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:37.0460290Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:37.0460690Z     )
2025-05-07T20:32:37.0461175Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:37.0461751Z     def test_silu_mul_quant(
2025-05-07T20:32:37.0462071Z         self,
2025-05-07T20:32:37.0462342Z         T: int,
2025-05-07T20:32:37.0462607Z         D: int,
2025-05-07T20:32:37.0462908Z         scale_ub: Optional[float],
2025-05-07T20:32:37.0463290Z         contiguous: bool,
2025-05-07T20:32:37.0463616Z         compiled: bool,
2025-05-07T20:32:37.0463914Z     ) -> None:
2025-05-07T20:32:37.0464211Z         torch.manual_seed(2025)
2025-05-07T20:32:37.0464529Z     
2025-05-07T20:32:37.0464908Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:37.0465380Z     
2025-05-07T20:32:37.0465650Z         x_sign = torch.sign(x)
2025-05-07T20:32:37.0466141Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:37.0466572Z         x = x_sign * x_clamp
2025-05-07T20:32:37.0466896Z         x0 = x[:, :D]
2025-05-07T20:32:37.0467183Z         x1 = x[:, D:]
2025-05-07T20:32:37.0467472Z     
2025-05-07T20:32:37.0467733Z         if contiguous:
2025-05-07T20:32:37.0468047Z             x0 = x0.contiguous()
2025-05-07T20:32:37.0468409Z             x1 = x1.contiguous()
2025-05-07T20:32:37.0468724Z     
2025-05-07T20:32:37.0468972Z         if scale_ub is not None:
2025-05-07T20:32:37.0469338Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:37.0469776Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:37.0470184Z             )
2025-05-07T20:32:37.0470447Z         else:
2025-05-07T20:32:37.0470741Z             scale_ub_tensor = None
2025-05-07T20:32:37.0471079Z     
2025-05-07T20:32:37.0471402Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:37.0471844Z             op = silu_mul_quant
2025-05-07T20:32:37.0472189Z             if compiled:
2025-05-07T20:32:37.0472541Z                 op = torch.compile(op)
2025-05-07T20:32:37.0472952Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:37.0473337Z     
2025-05-07T20:32:37.0473599Z         y_fp8, y_scale = fn()
2025-05-07T20:32:37.0473994Z         y = y_fp8.to(torch.float32) * y_scale[:, None]
2025-05-07T20:32:37.0474412Z     
2025-05-07T20:32:37.0474732Z         def ref_fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:37.0475197Z             x0_fp32 = x0.to(torch.float32)
2025-05-07T20:32:37.0475604Z             x1_fp32 = x1.to(torch.float32)
2025-05-07T20:32:37.0476076Z             y = x0_fp32 * torch.sigmoid(x0_fp32) * x1_fp32
2025-05-07T20:32:37.0476558Z             return triton_quantize_fp8_row(y, scale_ub_tensor)
2025-05-07T20:32:37.0476967Z     
2025-05-07T20:32:37.0477226Z >       y_fp8_ref, y_scale_ref = ref_fn()
2025-05-07T20:32:37.0477495Z 
2025-05-07T20:32:37.0477632Z moe/activation_test.py:126: 
2025-05-07T20:32:37.0478040Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:37.0478488Z moe/activation_test.py:124: in ref_fn
2025-05-07T20:32:37.0478971Z     return triton_quantize_fp8_row(y, scale_ub_tensor)
2025-05-07T20:32:37.0480041Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gemm/triton_gemm/fp8_gemm.py:2370: in triton_quantize_fp8_row
2025-05-07T20:32:37.0481081Z     _kernel_quantize_fp8_row[grid](
2025-05-07T20:32:37.0481826Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:37.0482781Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:37.0483706Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/autotuner.py:186: in run
2025-05-07T20:32:37.0484772Z     timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs}
2025-05-07T20:32:37.0485737Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/autotuner.py:166: in _bench
2025-05-07T20:32:37.0486578Z     return self.do_bench(kernel_call, quantiles=(0.5, 0.2, 0.8))
2025-05-07T20:32:37.0487376Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/testing.py:117: in do_bench
2025-05-07T20:32:37.0488055Z     fn()
2025-05-07T20:32:37.0488722Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/autotuner.py:152: in kernel_call
2025-05-07T20:32:37.0489488Z     self.fn.run(
2025-05-07T20:32:37.0490111Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:37.0490808Z     kernel = self.compile(
2025-05-07T20:32:37.0491513Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:37.0492384Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:37.0492976Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:37.0493287Z 
2025-05-07T20:32:37.0493555Z self = <triton.compiler.compiler.ASTSource object at 0x7f78c7b26270>
2025-05-07T20:32:37.0495124Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=2, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:37.0497006Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f7905667ce0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f78c52aa700>}
2025-05-07T20:32:37.0499057Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:37.0500433Z context = <triton._C.libtriton.ir.context object at 0x7f78c5687970>
2025-05-07T20:32:37.0500826Z 
2025-05-07T20:32:37.0501041Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:37.0501770Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:37.0502396Z                            module_map=module_map)
2025-05-07T20:32:37.0502889Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:37.0503376Z E       def _kernel_quantize_fp8_row(
2025-05-07T20:32:37.0503753Z E       ^
2025-05-07T20:32:37.0504373Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:37.0505133Z 
2025-05-07T20:32:37.0505748Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:37.0506450Z 
2025-05-07T20:32:37.0506598Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:37.0507160Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:37.0507694Z     T=2048,
2025-05-07T20:32:37.0507955Z     D=5120,
2025-05-07T20:32:37.0508311Z     scale_ub=1200.0,
2025-05-07T20:32:37.0508603Z     contiguous=True,
2025-05-07T20:32:37.0508916Z     compiled=False,
2025-05-07T20:32:37.0509202Z )
2025-05-07T20:32:37.0509623Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:37.0510287Z T = 2048, D = 5120, scale_ub = 1200.0, contiguous = True, compiled = False
2025-05-07T20:32:37.0510660Z 
2025-05-07T20:32:37.0510766Z     @given(
2025-05-07T20:32:37.0511086Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:37.0511498Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:37.0511919Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:37.0512375Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:37.0512902Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:37.0513299Z     )
2025-05-07T20:32:37.0513768Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:37.0514363Z     def test_silu_mul_quant(
2025-05-07T20:32:37.0514684Z         self,
2025-05-07T20:32:37.0514950Z         T: int,
2025-05-07T20:32:37.0515214Z         D: int,
2025-05-07T20:32:37.0515492Z         scale_ub: Optional[float],
2025-05-07T20:32:37.0515854Z         contiguous: bool,
2025-05-07T20:32:37.0516176Z         compiled: bool,
2025-05-07T20:32:37.0516467Z     ) -> None:
2025-05-07T20:32:37.0516748Z         torch.manual_seed(2025)
2025-05-07T20:32:37.0517077Z     
2025-05-07T20:32:37.0517430Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:37.0517901Z     
2025-05-07T20:32:37.0518172Z         x_sign = torch.sign(x)
2025-05-07T20:32:37.0518562Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:37.0518996Z         x = x_sign * x_clamp
2025-05-07T20:32:37.0519327Z         x0 = x[:, :D]
2025-05-07T20:32:37.0519707Z         x1 = x[:, D:]
2025-05-07T20:32:37.0520002Z     
2025-05-07T20:32:37.0520264Z         if contiguous:
2025-05-07T20:32:37.0520582Z             x0 = x0.contiguous()
2025-05-07T20:32:37.0520940Z             x1 = x1.contiguous()
2025-05-07T20:32:37.0521272Z     
2025-05-07T20:32:37.0521535Z         if scale_ub is not None:
2025-05-07T20:32:37.0521921Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:37.0522384Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:37.0522813Z             )
2025-05-07T20:32:37.0523074Z         else:
2025-05-07T20:32:37.0523363Z             scale_ub_tensor = None
2025-05-07T20:32:37.0523698Z     
2025-05-07T20:32:37.0524006Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:37.0524421Z             op = silu_mul_quant
2025-05-07T20:32:37.0524769Z             if compiled:
2025-05-07T20:32:37.0525111Z                 op = torch.compile(op)
2025-05-07T20:32:37.0525554Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:37.0525943Z     
2025-05-07T20:32:37.0526209Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:37.0526430Z 
2025-05-07T20:32:37.0526565Z moe/activation_test.py:117: 
2025-05-07T20:32:37.0526972Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:37.0527431Z moe/activation_test.py:115: in fn
2025-05-07T20:32:37.0527816Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:37.0528757Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:37.0529750Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:37.0530455Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:37.0531363Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:37.0532283Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:37.0533027Z     kernel = self.compile(
2025-05-07T20:32:37.0533915Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:37.0534950Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:37.0535524Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:37.0535847Z 
2025-05-07T20:32:37.0536132Z self = <triton.compiler.compiler.ASTSource object at 0x7f78c523d090>
2025-05-07T20:32:37.0537607Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:37.0539578Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f7905667ce0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f78c5162020>}
2025-05-07T20:32:37.0541334Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:37.0542689Z context = <triton._C.libtriton.ir.context object at 0x7f78c57194b0>
2025-05-07T20:32:37.0543067Z 
2025-05-07T20:32:37.0543275Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:37.0543935Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:37.0544574Z                            module_map=module_map)
2025-05-07T20:32:37.0545060Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:37.0545565Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:37.0545915Z E       ^
2025-05-07T20:32:37.0546588Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:37.0547178Z 
2025-05-07T20:32:37.0547722Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:37.0548411Z 
2025-05-07T20:32:37.0548553Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:37.0549102Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:37.0549633Z     T=2048,
2025-05-07T20:32:37.0549880Z     D=5120,
2025-05-07T20:32:37.0550151Z     scale_ub=1200.0,
2025-05-07T20:32:37.0550450Z     contiguous=True,
2025-05-07T20:32:37.0550742Z     compiled=True,
2025-05-07T20:32:37.0551019Z )
2025-05-07T20:32:37.0551461Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:37.0552113Z T = 2048, D = 5120, scale_ub = 1200.0, contiguous = True, compiled = True
2025-05-07T20:32:37.0552504Z 
2025-05-07T20:32:37.0552619Z     @given(
2025-05-07T20:32:37.0552941Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:37.0553377Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:37.0553797Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:37.0554263Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:37.0554684Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:37.0554967Z     )
2025-05-07T20:32:37.0555319Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:37.0555763Z     def test_silu_mul_quant(
2025-05-07T20:32:37.0556001Z         self,
2025-05-07T20:32:37.0556203Z         T: int,
2025-05-07T20:32:37.0556558Z         D: int,
2025-05-07T20:32:37.0556858Z         scale_ub: Optional[float],
2025-05-07T20:32:37.0557220Z         contiguous: bool,
2025-05-07T20:32:37.0557511Z         compiled: bool,
2025-05-07T20:32:37.0558060Z     ) -> None:
2025-05-07T20:32:37.0558362Z         torch.manual_seed(2025)
2025-05-07T20:32:37.0558687Z     
2025-05-07T20:32:37.0574839Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:37.0575183Z     
2025-05-07T20:32:37.0575371Z         x_sign = torch.sign(x)
2025-05-07T20:32:37.0575753Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:37.0576056Z         x = x_sign * x_clamp
2025-05-07T20:32:37.0576286Z         x0 = x[:, :D]
2025-05-07T20:32:37.0576489Z         x1 = x[:, D:]
2025-05-07T20:32:37.0576686Z     
2025-05-07T20:32:37.0576861Z         if contiguous:
2025-05-07T20:32:37.0577078Z             x0 = x0.contiguous()
2025-05-07T20:32:37.0577323Z             x1 = x1.contiguous()
2025-05-07T20:32:37.0577555Z     
2025-05-07T20:32:37.0577733Z         if scale_ub is not None:
2025-05-07T20:32:37.0577994Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:37.0578319Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:37.0578617Z             )
2025-05-07T20:32:37.0578854Z         else:
2025-05-07T20:32:37.0579057Z             scale_ub_tensor = None
2025-05-07T20:32:37.0579302Z     
2025-05-07T20:32:37.0579523Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:37.0579828Z             op = silu_mul_quant
2025-05-07T20:32:37.0580073Z             if compiled:
2025-05-07T20:32:37.0580309Z                 op = torch.compile(op)
2025-05-07T20:32:37.0580603Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:37.0580879Z     
2025-05-07T20:32:37.0581064Z         y_fp8, y_scale = fn()
2025-05-07T20:32:37.0581346Z         y = y_fp8.to(torch.float32) * y_scale[:, None]
2025-05-07T20:32:37.0581626Z     
2025-05-07T20:32:37.0581851Z         def ref_fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:37.0582192Z             x0_fp32 = x0.to(torch.float32)
2025-05-07T20:32:37.0582479Z             x1_fp32 = x1.to(torch.float32)
2025-05-07T20:32:37.0582788Z             y = x0_fp32 * torch.sigmoid(x0_fp32) * x1_fp32
2025-05-07T20:32:37.0583137Z             return triton_quantize_fp8_row(y, scale_ub_tensor)
2025-05-07T20:32:37.0583447Z     
2025-05-07T20:32:37.0583694Z >       y_fp8_ref, y_scale_ref = ref_fn()
2025-05-07T20:32:37.0583887Z 
2025-05-07T20:32:37.0583987Z moe/activation_test.py:126: 
2025-05-07T20:32:37.0584280Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:37.0584621Z moe/activation_test.py:124: in ref_fn
2025-05-07T20:32:37.0584936Z     return triton_quantize_fp8_row(y, scale_ub_tensor)
2025-05-07T20:32:37.0585711Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gemm/triton_gemm/fp8_gemm.py:2370: in triton_quantize_fp8_row
2025-05-07T20:32:37.0586451Z     _kernel_quantize_fp8_row[grid](
2025-05-07T20:32:37.0586990Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:37.0587667Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:37.0588350Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/autotuner.py:186: in run
2025-05-07T20:32:37.0589059Z     timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs}
2025-05-07T20:32:37.0589777Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/autotuner.py:166: in _bench
2025-05-07T20:32:37.0590399Z     return self.do_bench(kernel_call, quantiles=(0.5, 0.2, 0.8))
2025-05-07T20:32:37.0590986Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/testing.py:117: in do_bench
2025-05-07T20:32:37.0591501Z     fn()
2025-05-07T20:32:37.0592000Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/autotuner.py:152: in kernel_call
2025-05-07T20:32:37.0592619Z     self.fn.run(
2025-05-07T20:32:37.0593077Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:37.0593596Z     kernel = self.compile(
2025-05-07T20:32:37.0594130Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:37.0594769Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:37.0595207Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:37.0595431Z 
2025-05-07T20:32:37.0595641Z self = <triton.compiler.compiler.ASTSource object at 0x7f78c523e0d0>
2025-05-07T20:32:37.0596694Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=2, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:37.0598046Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f7905667ce0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f78c402ede0>}
2025-05-07T20:32:37.0599899Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:37.0600912Z context = <triton._C.libtriton.ir.context object at 0x7f78bfe09a70>
2025-05-07T20:32:37.0601193Z 
2025-05-07T20:32:37.0601360Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:37.0601865Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:37.0602323Z                            module_map=module_map)
2025-05-07T20:32:37.0602685Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:37.0603034Z E       def _kernel_quantize_fp8_row(
2025-05-07T20:32:37.0603293Z E       ^
2025-05-07T20:32:37.0603743Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:37.0604179Z 
2025-05-07T20:32:37.0604666Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:37.0605172Z 
2025-05-07T20:32:37.0605277Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:37.0605727Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:37.0606116Z     T=16384,
2025-05-07T20:32:37.0606302Z     D=7168,
2025-05-07T20:32:37.0606489Z     scale_ub=1200.0,
2025-05-07T20:32:37.0606706Z     contiguous=False,
2025-05-07T20:32:37.0606923Z     compiled=False,
2025-05-07T20:32:37.0607117Z )
2025-05-07T20:32:37.0607429Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:37.0607925Z T = 16384, D = 7168, scale_ub = 1200.0, contiguous = False, compiled = False
2025-05-07T20:32:37.0608199Z 
2025-05-07T20:32:37.0608272Z     @given(
2025-05-07T20:32:37.0608504Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:37.0608815Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:37.0609115Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:37.0609439Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:37.0609763Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:37.0610041Z     )
2025-05-07T20:32:37.0610378Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:37.0610817Z     def test_silu_mul_quant(
2025-05-07T20:32:37.0611052Z         self,
2025-05-07T20:32:37.0611241Z         T: int,
2025-05-07T20:32:37.0611436Z         D: int,
2025-05-07T20:32:37.0611652Z         scale_ub: Optional[float],
2025-05-07T20:32:37.0611915Z         contiguous: bool,
2025-05-07T20:32:37.0612228Z         compiled: bool,
2025-05-07T20:32:37.0612450Z     ) -> None:
2025-05-07T20:32:37.0612659Z         torch.manual_seed(2025)
2025-05-07T20:32:37.0612903Z     
2025-05-07T20:32:37.0613174Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:37.0613509Z     
2025-05-07T20:32:37.0613791Z         x_sign = torch.sign(x)
2025-05-07T20:32:37.0614080Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:37.0614376Z         x = x_sign * x_clamp
2025-05-07T20:32:37.0614684Z         x0 = x[:, :D]
2025-05-07T20:32:37.0614894Z         x1 = x[:, D:]
2025-05-07T20:32:37.0615102Z     
2025-05-07T20:32:37.0615276Z         if contiguous:
2025-05-07T20:32:37.0615503Z             x0 = x0.contiguous()
2025-05-07T20:32:37.0615751Z             x1 = x1.contiguous()
2025-05-07T20:32:37.0615979Z     
2025-05-07T20:32:37.0616167Z         if scale_ub is not None:
2025-05-07T20:32:37.0616431Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:37.0616754Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:37.0617053Z             )
2025-05-07T20:32:37.0617239Z         else:
2025-05-07T20:32:37.0617440Z             scale_ub_tensor = None
2025-05-07T20:32:37.0617685Z     
2025-05-07T20:32:37.0617953Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:37.0618260Z             op = silu_mul_quant
2025-05-07T20:32:37.0618503Z             if compiled:
2025-05-07T20:32:37.0618744Z                 op = torch.compile(op)
2025-05-07T20:32:37.0619030Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:37.0619300Z     
2025-05-07T20:32:37.0619492Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:37.0619651Z 
2025-05-07T20:32:37.0619751Z moe/activation_test.py:117: 
2025-05-07T20:32:37.0620031Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:37.0620360Z moe/activation_test.py:115: in fn
2025-05-07T20:32:37.0620634Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:37.0621304Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:37.0621977Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:37.0622510Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:37.0623225Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:37.0623874Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:37.0624405Z     kernel = self.compile(
2025-05-07T20:32:37.0624934Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:37.0625608Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:37.0626011Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:37.0626247Z 
2025-05-07T20:32:37.0626449Z self = <triton.compiler.compiler.ASTSource object at 0x7f78bffd9220>
2025-05-07T20:32:37.0627513Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:37.0628843Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f7905667ce0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f78c42e8ae0>}
2025-05-07T20:32:37.0630160Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:37.0631160Z context = <triton._C.libtriton.ir.context object at 0x7f78bfe36eb0>
2025-05-07T20:32:37.0631440Z 
2025-05-07T20:32:37.0631656Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:37.0632169Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:37.0632625Z                            module_map=module_map)
2025-05-07T20:32:37.0632993Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:37.0633343Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:37.0633594Z E       ^
2025-05-07T20:32:37.0634048Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:37.0634560Z 
2025-05-07T20:32:37.0634973Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:37.0635501Z 
2025-05-07T20:32:37.0635620Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:37.0636035Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:37.0636433Z     T=1,
2025-05-07T20:32:37.0636615Z     D=7168,
2025-05-07T20:32:37.0636799Z     scale_ub=None,
2025-05-07T20:32:37.0637009Z     contiguous=True,
2025-05-07T20:32:37.0637225Z     compiled=True,
2025-05-07T20:32:37.0637415Z )
2025-05-07T20:32:37.0637771Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:37.0638246Z T = 1, D = 7168, scale_ub = None, contiguous = True, compiled = True
2025-05-07T20:32:37.0638501Z 
2025-05-07T20:32:37.0638582Z     @given(
2025-05-07T20:32:37.0638802Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:37.0639109Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:37.0639408Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:37.0639725Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:37.0640048Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:37.0640329Z     )
2025-05-07T20:32:37.0640667Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:37.0641101Z     def test_silu_mul_quant(
2025-05-07T20:32:37.0641337Z         self,
2025-05-07T20:32:37.0641523Z         T: int,
2025-05-07T20:32:37.0641715Z         D: int,
2025-05-07T20:32:37.0641928Z         scale_ub: Optional[float],
2025-05-07T20:32:37.0642187Z         contiguous: bool,
2025-05-07T20:32:37.0642424Z         compiled: bool,
2025-05-07T20:32:37.0642684Z     ) -> None:
2025-05-07T20:32:37.0642894Z         torch.manual_seed(2025)
2025-05-07T20:32:37.0643125Z     
2025-05-07T20:32:37.0643395Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:37.0643732Z     
2025-05-07T20:32:37.0643915Z         x_sign = torch.sign(x)
2025-05-07T20:32:37.0644203Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:37.0644511Z         x = x_sign * x_clamp
2025-05-07T20:32:37.0644742Z         x0 = x[:, :D]
2025-05-07T20:32:37.0644955Z         x1 = x[:, D:]
2025-05-07T20:32:37.0645160Z     
2025-05-07T20:32:37.0645341Z         if contiguous:
2025-05-07T20:32:37.0645594Z             x0 = x0.contiguous()
2025-05-07T20:32:37.0645877Z             x1 = x1.contiguous()
2025-05-07T20:32:37.0646115Z     
2025-05-07T20:32:37.0646302Z         if scale_ub is not None:
2025-05-07T20:32:37.0646567Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:37.0646900Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:37.0647205Z             )
2025-05-07T20:32:37.0647391Z         else:
2025-05-07T20:32:37.0647601Z             scale_ub_tensor = None
2025-05-07T20:32:37.0647845Z     
2025-05-07T20:32:37.0648066Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:37.0648376Z             op = silu_mul_quant
2025-05-07T20:32:37.0648623Z             if compiled:
2025-05-07T20:32:37.0648861Z                 op = torch.compile(op)
2025-05-07T20:32:37.0649156Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:37.0649423Z     
2025-05-07T20:32:37.0649614Z         y_fp8, y_scale = fn()
2025-05-07T20:32:37.0649938Z         y = y_fp8.to(torch.float32) * y_scale[:, None]
2025-05-07T20:32:37.0650223Z     
2025-05-07T20:32:37.0650457Z         def ref_fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:37.0650781Z             x0_fp32 = x0.to(torch.float32)
2025-05-07T20:32:37.0651070Z             x1_fp32 = x1.to(torch.float32)
2025-05-07T20:32:37.0651383Z             y = x0_fp32 * torch.sigmoid(x0_fp32) * x1_fp32
2025-05-07T20:32:37.0651729Z             return triton_quantize_fp8_row(y, scale_ub_tensor)
2025-05-07T20:32:37.0652075Z     
2025-05-07T20:32:37.0652271Z >       y_fp8_ref, y_scale_ref = ref_fn()
2025-05-07T20:32:37.0652460Z 
2025-05-07T20:32:37.0652554Z moe/activation_test.py:126: 
2025-05-07T20:32:37.0652841Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:37.0653167Z moe/activation_test.py:124: in ref_fn
2025-05-07T20:32:37.0653485Z     return triton_quantize_fp8_row(y, scale_ub_tensor)
2025-05-07T20:32:37.0654344Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gemm/triton_gemm/fp8_gemm.py:2370: in triton_quantize_fp8_row
2025-05-07T20:32:37.0655076Z     _kernel_quantize_fp8_row[grid](
2025-05-07T20:32:37.0655657Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:37.0656332Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:37.0657003Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/autotuner.py:186: in run
2025-05-07T20:32:37.0657716Z     timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs}
2025-05-07T20:32:37.0658427Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/autotuner.py:166: in _bench
2025-05-07T20:32:37.0659044Z     return self.do_bench(kernel_call, quantiles=(0.5, 0.2, 0.8))
2025-05-07T20:32:37.0659637Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/testing.py:117: in do_bench
2025-05-07T20:32:37.0660142Z     fn()
2025-05-07T20:32:37.0660639Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/autotuner.py:152: in kernel_call
2025-05-07T20:32:37.0661203Z     self.fn.run(
2025-05-07T20:32:37.0661707Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:37.0662229Z     kernel = self.compile(
2025-05-07T20:32:37.0662755Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:37.0663396Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:37.0663792Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:37.0664014Z 
2025-05-07T20:32:37.0664223Z self = <triton.compiler.compiler.ASTSource object at 0x7f78bffdb950>
2025-05-07T20:32:37.0665279Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=2, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:37.0666625Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f7905667ce0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f78c411a340>}
2025-05-07T20:32:37.0667943Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:37.0668947Z context = <triton._C.libtriton.ir.context object at 0x7f78bfaba3f0>
2025-05-07T20:32:37.0669228Z 
2025-05-07T20:32:37.0669400Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:37.0669907Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:37.0670417Z                            module_map=module_map)
2025-05-07T20:32:37.0670780Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:37.0671127Z E       def _kernel_quantize_fp8_row(
2025-05-07T20:32:37.0671406Z E       ^
2025-05-07T20:32:37.0671865Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:37.0672300Z 
2025-05-07T20:32:37.0672711Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:37.0673258Z 
2025-05-07T20:32:37.0673361Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:37.0673764Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:37.0674166Z     T=4096,
2025-05-07T20:32:37.0674348Z     D=5120,
2025-05-07T20:32:37.0674537Z     scale_ub=None,
2025-05-07T20:32:37.0674748Z     contiguous=False,
2025-05-07T20:32:37.0674970Z     compiled=False,
2025-05-07T20:32:37.0675172Z )
2025-05-07T20:32:37.0675508Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:37.0676013Z T = 4096, D = 5120, scale_ub = None, contiguous = False, compiled = False
2025-05-07T20:32:37.0676327Z 
2025-05-07T20:32:37.0676406Z     @given(
2025-05-07T20:32:37.0676633Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:37.0676943Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:37.0677239Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:37.0677568Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:37.0677893Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:37.0678165Z     )
2025-05-07T20:32:37.0678513Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:37.0678947Z     def test_silu_mul_quant(
2025-05-07T20:32:37.0679188Z         self,
2025-05-07T20:32:37.0679377Z         T: int,
2025-05-07T20:32:37.0679576Z         D: int,
2025-05-07T20:32:37.0679791Z         scale_ub: Optional[float],
2025-05-07T20:32:37.0679880Z         contiguous: bool,
2025-05-07T20:32:37.0679965Z         compiled: bool,
2025-05-07T20:32:37.0680048Z     ) -> None:
2025-05-07T20:32:37.0680148Z         torch.manual_seed(2025)
2025-05-07T20:32:37.0680222Z     
2025-05-07T20:32:37.0680438Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:37.0680511Z     
2025-05-07T20:32:37.0680601Z         x_sign = torch.sign(x)
2025-05-07T20:32:37.0680732Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:37.0680819Z         x = x_sign * x_clamp
2025-05-07T20:32:37.0680896Z         x0 = x[:, :D]
2025-05-07T20:32:37.0680978Z         x1 = x[:, D:]
2025-05-07T20:32:37.0681050Z     
2025-05-07T20:32:37.0681138Z         if contiguous:
2025-05-07T20:32:37.0681227Z             x0 = x0.contiguous()
2025-05-07T20:32:37.0681313Z             x1 = x1.contiguous()
2025-05-07T20:32:37.0681391Z     
2025-05-07T20:32:37.0681478Z         if scale_ub is not None:
2025-05-07T20:32:37.0681582Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:37.0681721Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:37.0681796Z             )
2025-05-07T20:32:37.0681874Z         else:
2025-05-07T20:32:37.0681976Z             scale_ub_tensor = None
2025-05-07T20:32:37.0682047Z     
2025-05-07T20:32:37.0682174Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:37.0682269Z             op = silu_mul_quant
2025-05-07T20:32:37.0682354Z             if compiled:
2025-05-07T20:32:37.0682453Z                 op = torch.compile(op)
2025-05-07T20:32:37.0682561Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:37.0682634Z     
2025-05-07T20:32:37.0682732Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:37.0682737Z 
2025-05-07T20:32:37.0682830Z moe/activation_test.py:117: 
2025-05-07T20:32:37.0682957Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:37.0683129Z moe/activation_test.py:115: in fn
2025-05-07T20:32:37.0683229Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:37.0683719Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:37.0683825Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:37.0684177Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:37.0684447Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:37.0684778Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:37.0684867Z     kernel = self.compile(
2025-05-07T20:32:37.0685248Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:37.0685422Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:37.0685553Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:37.0685558Z 
2025-05-07T20:32:37.0685797Z self = <triton.compiler.compiler.ASTSource object at 0x7f78bf7f4b90>
2025-05-07T20:32:37.0686562Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:37.0687066Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f7905667ce0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f78c411b100>}
2025-05-07T20:32:37.0687798Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:37.0687995Z context = <triton._C.libtriton.ir.context object at 0x7f78bfadb4f0>
2025-05-07T20:32:37.0687999Z 
2025-05-07T20:32:37.0688161Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:37.0688419Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:37.0688571Z                            module_map=module_map)
2025-05-07T20:32:37.0688733Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:37.0688838Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:37.0688912Z E       ^
2025-05-07T20:32:37.0689257Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:37.0689263Z 
2025-05-07T20:32:37.0689674Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:37.0689678Z 
2025-05-07T20:32:37.0689780Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:37.0690004Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:37.0690079Z     T=4096,
2025-05-07T20:32:37.0690157Z     D=7168,
2025-05-07T20:32:37.0690241Z     scale_ub=None,
2025-05-07T20:32:37.0690327Z     contiguous=False,
2025-05-07T20:32:37.0690409Z     compiled=False,
2025-05-07T20:32:37.0690489Z )
2025-05-07T20:32:37.0690702Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:37.0690877Z T = 4096, D = 7168, scale_ub = None, contiguous = False, compiled = False
2025-05-07T20:32:37.0690882Z 
2025-05-07T20:32:37.0690962Z     @given(
2025-05-07T20:32:37.0691078Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:37.0691173Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:37.0691292Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:37.0691405Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:37.0691592Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:37.0691664Z     )
2025-05-07T20:32:37.0691902Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:37.0691999Z     def test_silu_mul_quant(
2025-05-07T20:32:37.0692076Z         self,
2025-05-07T20:32:37.0692151Z         T: int,
2025-05-07T20:32:37.0692233Z         D: int,
2025-05-07T20:32:37.0692328Z         scale_ub: Optional[float],
2025-05-07T20:32:37.0692415Z         contiguous: bool,
2025-05-07T20:32:37.0692543Z         compiled: bool,
2025-05-07T20:32:37.0692621Z     ) -> None:
2025-05-07T20:32:37.0692714Z         torch.manual_seed(2025)
2025-05-07T20:32:37.0692788Z     
2025-05-07T20:32:37.0692952Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:37.0693025Z     
2025-05-07T20:32:37.0693113Z         x_sign = torch.sign(x)
2025-05-07T20:32:37.0693235Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:37.0693330Z         x = x_sign * x_clamp
2025-05-07T20:32:37.0693405Z         x0 = x[:, :D]
2025-05-07T20:32:37.0693482Z         x1 = x[:, D:]
2025-05-07T20:32:37.0693559Z     
2025-05-07T20:32:37.0693640Z         if contiguous:
2025-05-07T20:32:37.0693812Z             x0 = x0.contiguous()
2025-05-07T20:32:37.0693948Z             x1 = x1.contiguous()
2025-05-07T20:32:37.0694019Z     
2025-05-07T20:32:37.0694111Z         if scale_ub is not None:
2025-05-07T20:32:37.0694219Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:37.0694353Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:37.0694433Z             )
2025-05-07T20:32:37.0694508Z         else:
2025-05-07T20:32:37.0694598Z             scale_ub_tensor = None
2025-05-07T20:32:37.0694676Z     
2025-05-07T20:32:37.0694801Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:37.0694890Z             op = silu_mul_quant
2025-05-07T20:32:37.0694978Z             if compiled:
2025-05-07T20:32:37.0695078Z                 op = torch.compile(op)
2025-05-07T20:32:37.0695180Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:37.0695254Z     
2025-05-07T20:32:37.0695342Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:37.0695346Z 
2025-05-07T20:32:37.0695440Z moe/activation_test.py:117: 
2025-05-07T20:32:37.0695626Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:37.0695723Z moe/activation_test.py:115: in fn
2025-05-07T20:32:37.0695825Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:37.0696316Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:37.0696409Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:37.0696768Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:37.0696987Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:37.0697333Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:37.0697426Z     kernel = self.compile(
2025-05-07T20:32:37.0697803Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:37.0697980Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:37.0698106Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:37.0698113Z 
2025-05-07T20:32:37.0698576Z self = <triton.compiler.compiler.ASTSource object at 0x7f78c404f130>
2025-05-07T20:32:37.0699354Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:37.0699848Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f7905667ce0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f78c411a840>}
2025-05-07T20:32:37.0700688Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:37.0700880Z context = <triton._C.libtriton.ir.context object at 0x7f78bf08cff0>
2025-05-07T20:32:37.0700885Z 
2025-05-07T20:32:37.0701053Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:37.0701373Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:37.0701480Z                            module_map=module_map)
2025-05-07T20:32:37.0701645Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:37.0701741Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:37.0701815Z E       ^
2025-05-07T20:32:37.0702168Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:37.0702172Z 
2025-05-07T20:32:37.0702575Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:37.0702641Z 
2025-05-07T20:32:37.0702753Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:37.0702969Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:37.0703045Z     T=128,
2025-05-07T20:32:37.0703130Z     D=7168,
2025-05-07T20:32:37.0703213Z     scale_ub=None,
2025-05-07T20:32:37.0703297Z     contiguous=False,
2025-05-07T20:32:37.0703384Z     compiled=True,
2025-05-07T20:32:37.0703456Z )
2025-05-07T20:32:37.0703677Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:37.0703841Z T = 128, D = 7168, scale_ub = None, contiguous = False, compiled = True
2025-05-07T20:32:37.0703846Z 
2025-05-07T20:32:37.0703925Z     @given(
2025-05-07T20:32:37.0704047Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:37.0704146Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:37.0704259Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:37.0704383Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:37.0704566Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:37.0704641Z     )
2025-05-07T20:32:37.0704888Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:37.0704982Z     def test_silu_mul_quant(
2025-05-07T20:32:37.0705067Z         self,
2025-05-07T20:32:37.0705141Z         T: int,
2025-05-07T20:32:37.0705214Z         D: int,
2025-05-07T20:32:37.0705315Z         scale_ub: Optional[float],
2025-05-07T20:32:37.0705401Z         contiguous: bool,
2025-05-07T20:32:37.0705484Z         compiled: bool,
2025-05-07T20:32:37.0705563Z     ) -> None:
2025-05-07T20:32:37.0705654Z         torch.manual_seed(2025)
2025-05-07T20:32:37.0705729Z     
2025-05-07T20:32:37.0705898Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:37.0705968Z     
2025-05-07T20:32:37.0706057Z         x_sign = torch.sign(x)
2025-05-07T20:32:37.0706186Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:37.0706273Z         x = x_sign * x_clamp
2025-05-07T20:32:37.0706360Z         x0 = x[:, :D]
2025-05-07T20:32:37.0706437Z         x1 = x[:, D:]
2025-05-07T20:32:37.0706507Z     
2025-05-07T20:32:37.0706597Z         if contiguous:
2025-05-07T20:32:37.0706687Z             x0 = x0.contiguous()
2025-05-07T20:32:37.0706774Z             x1 = x1.contiguous()
2025-05-07T20:32:37.0706852Z     
2025-05-07T20:32:37.0706939Z         if scale_ub is not None:
2025-05-07T20:32:37.0707041Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:37.0707180Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:37.0707253Z             )
2025-05-07T20:32:37.0707372Z         else:
2025-05-07T20:32:37.0707470Z             scale_ub_tensor = None
2025-05-07T20:32:37.0707541Z     
2025-05-07T20:32:37.0707667Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:37.0707761Z             op = silu_mul_quant
2025-05-07T20:32:37.0707842Z             if compiled:
2025-05-07T20:32:37.0707953Z                 op = torch.compile(op)
2025-05-07T20:32:37.0708058Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:37.0708130Z     
2025-05-07T20:32:37.0708228Z         y_fp8, y_scale = fn()
2025-05-07T20:32:37.0708389Z         y = y_fp8.to(torch.float32) * y_scale[:, None]
2025-05-07T20:32:37.0708461Z     
2025-05-07T20:32:37.0708598Z         def ref_fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:37.0708697Z             x0_fp32 = x0.to(torch.float32)
2025-05-07T20:32:37.0708793Z             x1_fp32 = x1.to(torch.float32)
2025-05-07T20:32:37.0708918Z             y = x0_fp32 * torch.sigmoid(x0_fp32) * x1_fp32
2025-05-07T20:32:37.0709053Z             return triton_quantize_fp8_row(y, scale_ub_tensor)
2025-05-07T20:32:37.0709130Z     
2025-05-07T20:32:37.0709227Z >       y_fp8_ref, y_scale_ref = ref_fn()
2025-05-07T20:32:37.0709232Z 
2025-05-07T20:32:37.0709326Z moe/activation_test.py:126: 
2025-05-07T20:32:37.0709501Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:37.0709606Z moe/activation_test.py:124: in ref_fn
2025-05-07T20:32:37.0709736Z     return triton_quantize_fp8_row(y, scale_ub_tensor)
2025-05-07T20:32:37.0710285Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gemm/triton_gemm/fp8_gemm.py:2370: in triton_quantize_fp8_row
2025-05-07T20:32:37.0710389Z     _kernel_quantize_fp8_row[grid](
2025-05-07T20:32:37.0710750Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:37.0710965Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:37.0711326Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/autotuner.py:186: in run
2025-05-07T20:32:37.0711586Z     timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs}
2025-05-07T20:32:37.0711993Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/autotuner.py:166: in _bench
2025-05-07T20:32:37.0712156Z     return self.do_bench(kernel_call, quantiles=(0.5, 0.2, 0.8))
2025-05-07T20:32:37.0712496Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/testing.py:117: in do_bench
2025-05-07T20:32:37.0712575Z     fn()
2025-05-07T20:32:37.0720701Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/autotuner.py:152: in kernel_call
2025-05-07T20:32:37.0720810Z     self.fn.run(
2025-05-07T20:32:37.0721162Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:37.0721269Z     kernel = self.compile(
2025-05-07T20:32:37.0721652Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:37.0721827Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:37.0721964Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:37.0721972Z 
2025-05-07T20:32:37.0722179Z self = <triton.compiler.compiler.ASTSource object at 0x7f78bf7999d0>
2025-05-07T20:32:37.0722951Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=2, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:37.0723453Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f7905667ce0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f78bf7bf060>}
2025-05-07T20:32:37.0724277Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:37.0724465Z context = <triton._C.libtriton.ir.context object at 0x7f78bf3f5270>
2025-05-07T20:32:37.0724473Z 
2025-05-07T20:32:37.0724639Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:37.0724906Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:37.0725060Z                            module_map=module_map)
2025-05-07T20:32:37.0725232Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:37.0725337Z E       def _kernel_quantize_fp8_row(
2025-05-07T20:32:37.0725415Z E       ^
2025-05-07T20:32:37.0725773Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:37.0725781Z 
2025-05-07T20:32:37.0726185Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:37.0726190Z 
2025-05-07T20:32:37.0726293Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:37.0726561Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:37.0726642Z     T=128,
2025-05-07T20:32:37.0726732Z     D=7168,
2025-05-07T20:32:37.0726816Z     scale_ub=None,
2025-05-07T20:32:37.0726904Z     contiguous=False,
2025-05-07T20:32:37.0727000Z     compiled=False,
2025-05-07T20:32:37.0727075Z )
2025-05-07T20:32:37.0727294Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:37.0727470Z T = 128, D = 7168, scale_ub = None, contiguous = False, compiled = False
2025-05-07T20:32:37.0727475Z 
2025-05-07T20:32:37.0727553Z     @given(
2025-05-07T20:32:37.0727675Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:37.0727786Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:37.0727908Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:37.0728034Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:37.0728148Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:37.0728224Z     )
2025-05-07T20:32:37.0728550Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:37.0728647Z     def test_silu_mul_quant(
2025-05-07T20:32:37.0728727Z         self,
2025-05-07T20:32:37.0728812Z         T: int,
2025-05-07T20:32:37.0728891Z         D: int,
2025-05-07T20:32:37.0728989Z         scale_ub: Optional[float],
2025-05-07T20:32:37.0729084Z         contiguous: bool,
2025-05-07T20:32:37.0729172Z         compiled: bool,
2025-05-07T20:32:37.0729251Z     ) -> None:
2025-05-07T20:32:37.0729352Z         torch.manual_seed(2025)
2025-05-07T20:32:37.0729428Z     
2025-05-07T20:32:37.0729604Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:37.0729682Z     
2025-05-07T20:32:37.0729774Z         x_sign = torch.sign(x)
2025-05-07T20:32:37.0729907Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:37.0729999Z         x = x_sign * x_clamp
2025-05-07T20:32:37.0730081Z         x0 = x[:, :D]
2025-05-07T20:32:37.0730169Z         x1 = x[:, D:]
2025-05-07T20:32:37.0730244Z     
2025-05-07T20:32:37.0730331Z         if contiguous:
2025-05-07T20:32:37.0730431Z             x0 = x0.contiguous()
2025-05-07T20:32:37.0730523Z             x1 = x1.contiguous()
2025-05-07T20:32:37.0730601Z     
2025-05-07T20:32:37.0730700Z         if scale_ub is not None:
2025-05-07T20:32:37.0730806Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:37.0730944Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:37.0731021Z             )
2025-05-07T20:32:37.0731097Z         else:
2025-05-07T20:32:37.0731197Z             scale_ub_tensor = None
2025-05-07T20:32:37.0731271Z     
2025-05-07T20:32:37.0731398Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:37.0731543Z             op = silu_mul_quant
2025-05-07T20:32:37.0731628Z             if compiled:
2025-05-07T20:32:37.0731729Z                 op = torch.compile(op)
2025-05-07T20:32:37.0731843Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:37.0731920Z     
2025-05-07T20:32:37.0732013Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:37.0732021Z 
2025-05-07T20:32:37.0732126Z moe/activation_test.py:117: 
2025-05-07T20:32:37.0732254Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:37.0732402Z moe/activation_test.py:115: in fn
2025-05-07T20:32:37.0732504Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:37.0732997Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:37.0733103Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:37.0733458Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:37.0733784Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:37.0734173Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:37.0734268Z     kernel = self.compile(
2025-05-07T20:32:37.0734658Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:37.0734835Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:37.0734966Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:37.0734971Z 
2025-05-07T20:32:37.0735175Z self = <triton.compiler.compiler.ASTSource object at 0x7f78bfeb8e50>
2025-05-07T20:32:37.0735951Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:37.0736456Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f7905667ce0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f78bf909e40>}
2025-05-07T20:32:37.0737225Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:37.0737426Z context = <triton._C.libtriton.ir.context object at 0x7f78bf821a70>
2025-05-07T20:32:37.0737431Z 
2025-05-07T20:32:37.0737593Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:37.0737857Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:37.0737966Z                            module_map=module_map)
2025-05-07T20:32:37.0738131Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:37.0738237Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:37.0738314Z E       ^
2025-05-07T20:32:37.0738666Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:37.0738674Z 
2025-05-07T20:32:37.0739082Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:37.0739086Z 
2025-05-07T20:32:37.0739191Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:37.0739415Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:37.0739491Z     T=4096,
2025-05-07T20:32:37.0739568Z     D=5120,
2025-05-07T20:32:37.0739657Z     scale_ub=1200.0,
2025-05-07T20:32:37.0739741Z     contiguous=True,
2025-05-07T20:32:37.0739831Z     compiled=False,
2025-05-07T20:32:37.0739906Z )
2025-05-07T20:32:37.0740121Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:37.0740347Z T = 4096, D = 5120, scale_ub = 1200.0, contiguous = True, compiled = False
2025-05-07T20:32:37.0740352Z 
2025-05-07T20:32:37.0740427Z     @given(
2025-05-07T20:32:37.0740546Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:37.0740653Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:37.0740774Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:37.0740890Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:37.0741050Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:37.0741127Z     )
2025-05-07T20:32:37.0741374Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:37.0741467Z     def test_silu_mul_quant(
2025-05-07T20:32:37.0741542Z         self,
2025-05-07T20:32:37.0741623Z         T: int,
2025-05-07T20:32:37.0741698Z         D: int,
2025-05-07T20:32:37.0741796Z         scale_ub: Optional[float],
2025-05-07T20:32:37.0741890Z         contiguous: bool,
2025-05-07T20:32:37.0741977Z         compiled: bool,
2025-05-07T20:32:37.0742055Z     ) -> None:
2025-05-07T20:32:37.0742152Z         torch.manual_seed(2025)
2025-05-07T20:32:37.0742224Z     
2025-05-07T20:32:37.0742431Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:37.0742511Z     
2025-05-07T20:32:37.0742603Z         x_sign = torch.sign(x)
2025-05-07T20:32:37.0742732Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:37.0742820Z         x = x_sign * x_clamp
2025-05-07T20:32:37.0742903Z         x0 = x[:, :D]
2025-05-07T20:32:37.0742988Z         x1 = x[:, D:]
2025-05-07T20:32:37.0743061Z     
2025-05-07T20:32:37.0743144Z         if contiguous:
2025-05-07T20:32:37.0743240Z             x0 = x0.contiguous()
2025-05-07T20:32:37.0743329Z             x1 = x1.contiguous()
2025-05-07T20:32:37.0743401Z     
2025-05-07T20:32:37.0743497Z         if scale_ub is not None:
2025-05-07T20:32:37.0743601Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:37.0743737Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:37.0743816Z             )
2025-05-07T20:32:37.0743892Z         else:
2025-05-07T20:32:37.0743988Z             scale_ub_tensor = None
2025-05-07T20:32:37.0744069Z     
2025-05-07T20:32:37.0744199Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:37.0744339Z             op = silu_mul_quant
2025-05-07T20:32:37.0744424Z             if compiled:
2025-05-07T20:32:37.0744521Z                 op = torch.compile(op)
2025-05-07T20:32:37.0744634Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:37.0744706Z     
2025-05-07T20:32:37.0744796Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:37.0744800Z 
2025-05-07T20:32:37.0744904Z moe/activation_test.py:117: 
2025-05-07T20:32:37.0745032Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:37.0745130Z moe/activation_test.py:115: in fn
2025-05-07T20:32:37.0745235Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:37.0745725Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:37.0745828Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:37.0746188Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:37.0746409Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:37.0746753Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:37.0746846Z     kernel = self.compile(
2025-05-07T20:32:37.0747229Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:37.0747402Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:37.0747532Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:37.0747578Z 
2025-05-07T20:32:37.0747787Z self = <triton.compiler.compiler.ASTSource object at 0x7f78bfebb850>
2025-05-07T20:32:37.0748553Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:37.0749055Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f7905667ce0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f78bf90a5c0>}
2025-05-07T20:32:37.0749825Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:37.0750013Z context = <triton._C.libtriton.ir.context object at 0x7f78bf8b0630>
2025-05-07T20:32:37.0750020Z 
2025-05-07T20:32:37.0750187Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:37.0750443Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:37.0750595Z                            module_map=module_map)
2025-05-07T20:32:37.0750761Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:37.0750860Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:37.0750943Z E       ^
2025-05-07T20:32:37.0751291Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:37.0751299Z 
2025-05-07T20:32:37.0751705Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:37.0751716Z 
2025-05-07T20:32:37.0751819Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:37.0752039Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:37.0752126Z     T=1,
2025-05-07T20:32:37.0752202Z     D=5120,
2025-05-07T20:32:37.0752287Z     scale_ub=None,
2025-05-07T20:32:37.0752382Z     contiguous=True,
2025-05-07T20:32:37.0752465Z     compiled=True,
2025-05-07T20:32:37.0752540Z )
2025-05-07T20:32:37.0752770Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:37.0752971Z T = 1, D = 5120, scale_ub = None, contiguous = True, compiled = True
2025-05-07T20:32:37.0752976Z 
2025-05-07T20:32:37.0753060Z     @given(
2025-05-07T20:32:37.0753184Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:37.0753284Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:37.0753403Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:37.0753520Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:37.0753634Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:37.0753714Z     )
2025-05-07T20:32:37.0753958Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:37.0754055Z     def test_silu_mul_quant(
2025-05-07T20:32:37.0754138Z         self,
2025-05-07T20:32:37.0754218Z         T: int,
2025-05-07T20:32:37.0754296Z         D: int,
2025-05-07T20:32:37.0754404Z         scale_ub: Optional[float],
2025-05-07T20:32:37.0754494Z         contiguous: bool,
2025-05-07T20:32:37.0754590Z         compiled: bool,
2025-05-07T20:32:37.0754668Z     ) -> None:
2025-05-07T20:32:37.0754762Z         torch.manual_seed(2025)
2025-05-07T20:32:37.0754844Z     
2025-05-07T20:32:37.0755009Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:37.0755084Z     
2025-05-07T20:32:37.0755185Z         x_sign = torch.sign(x)
2025-05-07T20:32:37.0755308Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:37.0755410Z         x = x_sign * x_clamp
2025-05-07T20:32:37.0755511Z         x0 = x[:, :D]
2025-05-07T20:32:37.0755604Z         x1 = x[:, D:]
2025-05-07T20:32:37.0755733Z     
2025-05-07T20:32:37.0755823Z         if contiguous:
2025-05-07T20:32:37.0755917Z             x0 = x0.contiguous()
2025-05-07T20:32:37.0756013Z             x1 = x1.contiguous()
2025-05-07T20:32:37.0756087Z     
2025-05-07T20:32:37.0756179Z         if scale_ub is not None:
2025-05-07T20:32:37.0756293Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:37.0756430Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:37.0756507Z             )
2025-05-07T20:32:37.0756590Z         else:
2025-05-07T20:32:37.0756727Z             scale_ub_tensor = None
2025-05-07T20:32:37.0756804Z     
2025-05-07T20:32:37.0756938Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:37.0757031Z             op = silu_mul_quant
2025-05-07T20:32:37.0757116Z             if compiled:
2025-05-07T20:32:37.0757221Z                 op = torch.compile(op)
2025-05-07T20:32:37.0757326Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:37.0757405Z     
2025-05-07T20:32:37.0757499Z         y_fp8, y_scale = fn()
2025-05-07T20:32:37.0757619Z         y = y_fp8.to(torch.float32) * y_scale[:, None]
2025-05-07T20:32:37.0757699Z     
2025-05-07T20:32:37.0757833Z         def ref_fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:37.0758047Z             x0_fp32 = x0.to(torch.float32)
2025-05-07T20:32:37.0758156Z             x1_fp32 = x1.to(torch.float32)
2025-05-07T20:32:37.0758279Z             y = x0_fp32 * torch.sigmoid(x0_fp32) * x1_fp32
2025-05-07T20:32:37.0758417Z             return triton_quantize_fp8_row(y, scale_ub_tensor)
2025-05-07T20:32:37.0758499Z     
2025-05-07T20:32:37.0758599Z >       y_fp8_ref, y_scale_ref = ref_fn()
2025-05-07T20:32:37.0758604Z 
2025-05-07T20:32:37.0758706Z moe/activation_test.py:126: 
2025-05-07T20:32:37.0758834Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:37.0758938Z moe/activation_test.py:124: in ref_fn
2025-05-07T20:32:37.0759076Z     return triton_quantize_fp8_row(y, scale_ub_tensor)
2025-05-07T20:32:37.0759625Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gemm/triton_gemm/fp8_gemm.py:2370: in triton_quantize_fp8_row
2025-05-07T20:32:37.0759725Z     _kernel_quantize_fp8_row[grid](
2025-05-07T20:32:37.0760130Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:37.0760351Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:37.0760722Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/autotuner.py:186: in run
2025-05-07T20:32:37.0760972Z     timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs}
2025-05-07T20:32:37.0761343Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/autotuner.py:166: in _bench
2025-05-07T20:32:37.0761513Z     return self.do_bench(kernel_call, quantiles=(0.5, 0.2, 0.8))
2025-05-07T20:32:37.0761850Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/testing.py:117: in do_bench
2025-05-07T20:32:37.0761926Z     fn()
2025-05-07T20:32:37.0762332Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/autotuner.py:152: in kernel_call
2025-05-07T20:32:37.0762415Z     self.fn.run(
2025-05-07T20:32:37.0762756Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:37.0762848Z     kernel = self.compile(
2025-05-07T20:32:37.0763226Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:37.0763402Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:37.0763529Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:37.0763533Z 
2025-05-07T20:32:37.0763743Z self = <triton.compiler.compiler.ASTSource object at 0x7f78befc6a80>
2025-05-07T20:32:37.0764553Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=2, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:37.0765055Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f7905667ce0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f78bf90b240>}
2025-05-07T20:32:37.0765844Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:37.0766071Z context = <triton._C.libtriton.ir.context object at 0x7f78bf868930>
2025-05-07T20:32:37.0766076Z 
2025-05-07T20:32:37.0766242Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:37.0766498Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:37.0766607Z                            module_map=module_map)
2025-05-07T20:32:37.0766773Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:37.0766876Z E       def _kernel_quantize_fp8_row(
2025-05-07T20:32:37.0766997Z E       ^
2025-05-07T20:32:37.0767350Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:37.0767355Z 
2025-05-07T20:32:37.0767760Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:37.0767765Z 
2025-05-07T20:32:37.0767873Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:37.0768094Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:37.0768169Z     T=2048,
2025-05-07T20:32:37.0768251Z     D=5120,
2025-05-07T20:32:37.0768333Z     scale_ub=None,
2025-05-07T20:32:37.0768427Z     contiguous=True,
2025-05-07T20:32:37.0768509Z     compiled=True,
2025-05-07T20:32:37.0768582Z )
2025-05-07T20:32:37.0768802Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:37.0768972Z T = 2048, D = 5120, scale_ub = None, contiguous = True, compiled = True
2025-05-07T20:32:37.0768977Z 
2025-05-07T20:32:37.0769055Z     @given(
2025-05-07T20:32:37.0769224Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:37.0769324Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:37.0769441Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:37.0769562Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:37.0769674Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:37.0769752Z     )
2025-05-07T20:32:37.0769993Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:37.0770087Z     def test_silu_mul_quant(
2025-05-07T20:32:37.0770170Z         self,
2025-05-07T20:32:37.0770246Z         T: int,
2025-05-07T20:32:37.0770322Z         D: int,
2025-05-07T20:32:37.0770426Z         scale_ub: Optional[float],
2025-05-07T20:32:37.0770518Z         contiguous: bool,
2025-05-07T20:32:37.0770601Z         compiled: bool,
2025-05-07T20:32:37.0770686Z     ) -> None:
2025-05-07T20:32:37.0770779Z         torch.manual_seed(2025)
2025-05-07T20:32:37.0770856Z     
2025-05-07T20:32:37.0771027Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:37.0771100Z     
2025-05-07T20:32:37.0771203Z         x_sign = torch.sign(x)
2025-05-07T20:32:37.0771324Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:37.0771412Z         x = x_sign * x_clamp
2025-05-07T20:32:37.0771497Z         x0 = x[:, :D]
2025-05-07T20:32:37.0771574Z         x1 = x[:, D:]
2025-05-07T20:32:37.0771646Z     
2025-05-07T20:32:37.0771733Z         if contiguous:
2025-05-07T20:32:37.0771823Z             x0 = x0.contiguous()
2025-05-07T20:32:37.0771958Z             x1 = x1.contiguous()
2025-05-07T20:32:37.0772034Z     
2025-05-07T20:32:37.0772123Z         if scale_ub is not None:
2025-05-07T20:32:37.0772226Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:37.0772363Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:37.0772440Z             )
2025-05-07T20:32:37.0772522Z         else:
2025-05-07T20:32:37.0772616Z             scale_ub_tensor = None
2025-05-07T20:32:37.0772688Z     
2025-05-07T20:32:37.0772819Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:37.0772951Z             op = silu_mul_quant
2025-05-07T20:32:37.0773034Z             if compiled:
2025-05-07T20:32:37.0773140Z                 op = torch.compile(op)
2025-05-07T20:32:37.0773244Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:37.0773316Z     
2025-05-07T20:32:37.0773412Z         y_fp8, y_scale = fn()
2025-05-07T20:32:37.0773530Z         y = y_fp8.to(torch.float32) * y_scale[:, None]
2025-05-07T20:32:37.0773606Z     
2025-05-07T20:32:37.0773820Z         def ref_fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:37.0773919Z             x0_fp32 = x0.to(torch.float32)
2025-05-07T20:32:37.0774022Z             x1_fp32 = x1.to(torch.float32)
2025-05-07T20:32:37.0774187Z             y = x0_fp32 * torch.sigmoid(x0_fp32) * x1_fp32
2025-05-07T20:32:37.0774327Z             return triton_quantize_fp8_row(y, scale_ub_tensor)
2025-05-07T20:32:37.0774408Z     
2025-05-07T20:32:37.0774507Z >       y_fp8_ref, y_scale_ref = ref_fn()
2025-05-07T20:32:37.0774515Z 
2025-05-07T20:32:37.0774611Z moe/activation_test.py:126: 
2025-05-07T20:32:37.0774743Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:37.0774847Z moe/activation_test.py:124: in ref_fn
2025-05-07T20:32:37.0774977Z     return triton_quantize_fp8_row(y, scale_ub_tensor)
2025-05-07T20:32:37.0775527Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gemm/triton_gemm/fp8_gemm.py:2370: in triton_quantize_fp8_row
2025-05-07T20:32:37.0775630Z     _kernel_quantize_fp8_row[grid](
2025-05-07T20:32:37.0775994Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:37.0776220Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:37.0776621Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/autotuner.py:186: in run
2025-05-07T20:32:37.0776877Z     timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs}
2025-05-07T20:32:37.0777247Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/autotuner.py:166: in _bench
2025-05-07T20:32:37.0777416Z     return self.do_bench(kernel_call, quantiles=(0.5, 0.2, 0.8))
2025-05-07T20:32:37.0777752Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/testing.py:117: in do_bench
2025-05-07T20:32:37.0777833Z     fn()
2025-05-07T20:32:37.0778235Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/autotuner.py:152: in kernel_call
2025-05-07T20:32:37.0778318Z     self.fn.run(
2025-05-07T20:32:37.0778652Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:37.0778753Z     kernel = self.compile(
2025-05-07T20:32:37.0779129Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:37.0779306Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:37.0779433Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:37.0779438Z 
2025-05-07T20:32:37.0779639Z self = <triton.compiler.compiler.ASTSource object at 0x7f78befc6b70>
2025-05-07T20:32:37.0780406Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=2, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:37.0780946Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f7905667ce0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f78bf9f2d40>}
2025-05-07T20:32:37.0781685Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:37.0781918Z context = <triton._C.libtriton.ir.context object at 0x7f78bec61530>
2025-05-07T20:32:37.0781923Z 
2025-05-07T20:32:37.0782092Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:37.0782352Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:37.0782464Z                            module_map=module_map)
2025-05-07T20:32:37.0782633Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:37.0782743Z E       def _kernel_quantize_fp8_row(
2025-05-07T20:32:37.0782821Z E       ^
2025-05-07T20:32:37.0783210Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:37.0783215Z 
2025-05-07T20:32:37.0783633Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:37.0783641Z 
2025-05-07T20:32:37.0783745Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:37.0783970Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:37.0784047Z     T=128,
2025-05-07T20:32:37.0784121Z     D=5120,
2025-05-07T20:32:37.0784208Z     scale_ub=None,
2025-05-07T20:32:37.0784293Z     contiguous=True,
2025-05-07T20:32:37.0784376Z     compiled=True,
2025-05-07T20:32:37.0784452Z )
2025-05-07T20:32:37.0784666Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:37.0784832Z T = 128, D = 5120, scale_ub = None, contiguous = True, compiled = True
2025-05-07T20:32:37.0784837Z 
2025-05-07T20:32:37.0784918Z     @given(
2025-05-07T20:32:37.0785039Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:37.0785144Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:37.0785299Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:37.0785416Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:37.0785534Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:37.0785608Z     )
2025-05-07T20:32:37.0785850Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:37.0785948Z     def test_silu_mul_quant(
2025-05-07T20:32:37.0786023Z         self,
2025-05-07T20:32:37.0786098Z         T: int,
2025-05-07T20:32:37.0786178Z         D: int,
2025-05-07T20:32:37.0786276Z         scale_ub: Optional[float],
2025-05-07T20:32:37.0786369Z         contiguous: bool,
2025-05-07T20:32:37.0786458Z         compiled: bool,
2025-05-07T20:32:37.0786534Z     ) -> None:
2025-05-07T20:32:37.0786631Z         torch.manual_seed(2025)
2025-05-07T20:32:37.0786703Z     
2025-05-07T20:32:37.0786871Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:37.0786948Z     
2025-05-07T20:32:37.0787041Z         x_sign = torch.sign(x)
2025-05-07T20:32:37.0787163Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:37.0787257Z         x = x_sign * x_clamp
2025-05-07T20:32:37.0787337Z         x0 = x[:, :D]
2025-05-07T20:32:37.0787416Z         x1 = x[:, D:]
2025-05-07T20:32:37.0787493Z     
2025-05-07T20:32:37.0787575Z         if contiguous:
2025-05-07T20:32:37.0787666Z             x0 = x0.contiguous()
2025-05-07T20:32:37.0787759Z             x1 = x1.contiguous()
2025-05-07T20:32:37.0787829Z     
2025-05-07T20:32:37.0787916Z         if scale_ub is not None:
2025-05-07T20:32:37.0788076Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:37.0788208Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:37.0788287Z             )
2025-05-07T20:32:37.0788362Z         else:
2025-05-07T20:32:37.0788455Z             scale_ub_tensor = None
2025-05-07T20:32:37.0788534Z     
2025-05-07T20:32:37.0788668Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:37.0788757Z             op = silu_mul_quant
2025-05-07T20:32:37.0788847Z             if compiled:
2025-05-07T20:32:37.0789010Z                 op = torch.compile(op)
2025-05-07T20:32:37.0789115Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:37.0789194Z     
2025-05-07T20:32:37.0789283Z         y_fp8, y_scale = fn()
2025-05-07T20:32:37.0789403Z         y = y_fp8.to(torch.float32) * y_scale[:, None]
2025-05-07T20:32:37.0789480Z     
2025-05-07T20:32:37.0789613Z         def ref_fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:37.0789721Z             x0_fp32 = x0.to(torch.float32)
2025-05-07T20:32:37.0789821Z             x1_fp32 = x1.to(torch.float32)
2025-05-07T20:32:37.0789942Z             y = x0_fp32 * torch.sigmoid(x0_fp32) * x1_fp32
2025-05-07T20:32:37.0790084Z             return triton_quantize_fp8_row(y, scale_ub_tensor)
2025-05-07T20:32:37.0790155Z     
2025-05-07T20:32:37.0790292Z >       y_fp8_ref, y_scale_ref = ref_fn()
2025-05-07T20:32:37.0790299Z 
2025-05-07T20:32:37.0790400Z moe/activation_test.py:126: 
2025-05-07T20:32:37.0790528Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:37.0790640Z moe/activation_test.py:124: in ref_fn
2025-05-07T20:32:37.0790770Z     return triton_quantize_fp8_row(y, scale_ub_tensor)
2025-05-07T20:32:37.0791315Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gemm/triton_gemm/fp8_gemm.py:2370: in triton_quantize_fp8_row
2025-05-07T20:32:37.0791418Z     _kernel_quantize_fp8_row[grid](
2025-05-07T20:32:37.0791772Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:37.0791994Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:37.0792357Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/autotuner.py:186: in run
2025-05-07T20:32:37.0792648Z     timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs}
2025-05-07T20:32:37.0793022Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/autotuner.py:166: in _bench
2025-05-07T20:32:37.0793188Z     return self.do_bench(kernel_call, quantiles=(0.5, 0.2, 0.8))
2025-05-07T20:32:37.0793524Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/testing.py:117: in do_bench
2025-05-07T20:32:37.0793605Z     fn()
2025-05-07T20:32:37.0794000Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/autotuner.py:152: in kernel_call
2025-05-07T20:32:37.0794083Z     self.fn.run(
2025-05-07T20:32:37.0794420Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:37.0794509Z     kernel = self.compile(
2025-05-07T20:32:37.0794891Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:37.0795066Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:37.0795191Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:37.0795198Z 
2025-05-07T20:32:37.0795406Z self = <triton.compiler.compiler.ASTSource object at 0x7f78bf625ef0>
2025-05-07T20:32:37.0796169Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=2, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:37.0796711Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f7905667ce0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f78bf9f0680>}
2025-05-07T20:32:37.0797448Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:37.0797639Z context = <triton._C.libtriton.ir.context object at 0x7f78be860030>
2025-05-07T20:32:37.0797684Z 
2025-05-07T20:32:37.0797847Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:37.0798105Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:37.0798435Z                            module_map=module_map)
2025-05-07T20:32:37.0798657Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:37.0798779Z E       def _kernel_quantize_fp8_row(
2025-05-07T20:32:37.0798869Z E       ^
2025-05-07T20:32:37.0799218Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:37.0799223Z 
2025-05-07T20:32:37.0799731Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:37.0799737Z 
2025-05-07T20:32:37.0799847Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:37.0800067Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:37.0800151Z     T=4096,
2025-05-07T20:32:37.0800231Z     D=5120,
2025-05-07T20:32:37.0800314Z     scale_ub=None,
2025-05-07T20:32:37.0800403Z     contiguous=True,
2025-05-07T20:32:37.0800485Z     compiled=True,
2025-05-07T20:32:37.0800560Z )
2025-05-07T20:32:37.0800782Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:37.0800949Z T = 4096, D = 5120, scale_ub = None, contiguous = True, compiled = True
2025-05-07T20:32:37.0800956Z 
2025-05-07T20:32:37.0801040Z     @given(
2025-05-07T20:32:37.0801158Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:37.0801258Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:37.0801379Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:37.0801498Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:37.0801673Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:37.0801757Z     )
2025-05-07T20:32:37.0801999Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:37.0802099Z     def test_silu_mul_quant(
2025-05-07T20:32:37.0802173Z         self,
2025-05-07T20:32:37.0802249Z         T: int,
2025-05-07T20:32:37.0802331Z         D: int,
2025-05-07T20:32:37.0802428Z         scale_ub: Optional[float],
2025-05-07T20:32:37.0802517Z         contiguous: bool,
2025-05-07T20:32:37.0802607Z         compiled: bool,
2025-05-07T20:32:37.0802685Z     ) -> None:
2025-05-07T20:32:37.0802781Z         torch.manual_seed(2025)
2025-05-07T20:32:37.0802859Z     
2025-05-07T20:32:37.0803024Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:37.0803096Z     
2025-05-07T20:32:37.0803192Z         x_sign = torch.sign(x)
2025-05-07T20:32:37.0803317Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:37.0803410Z         x = x_sign * x_clamp
2025-05-07T20:32:37.0803490Z         x0 = x[:, :D]
2025-05-07T20:32:37.0803569Z         x1 = x[:, D:]
2025-05-07T20:32:37.0803648Z     
2025-05-07T20:32:37.0803733Z         if contiguous:
2025-05-07T20:32:37.0803823Z             x0 = x0.contiguous()
2025-05-07T20:32:37.0803914Z             x1 = x1.contiguous()
2025-05-07T20:32:37.0803987Z     
2025-05-07T20:32:37.0804077Z         if scale_ub is not None:
2025-05-07T20:32:37.0804184Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:37.0804317Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:37.0804460Z             )
2025-05-07T20:32:37.0804540Z         else:
2025-05-07T20:32:37.0804634Z             scale_ub_tensor = None
2025-05-07T20:32:37.0804706Z     
2025-05-07T20:32:37.0804838Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:37.0804927Z             op = silu_mul_quant
2025-05-07T20:32:37.0805018Z             if compiled:
2025-05-07T20:32:37.0805121Z                 op = torch.compile(op)
2025-05-07T20:32:37.0805227Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:37.0805305Z     
2025-05-07T20:32:37.0805461Z         y_fp8, y_scale = fn()
2025-05-07T20:32:37.0805582Z         y = y_fp8.to(torch.float32) * y_scale[:, None]
2025-05-07T20:32:37.0805658Z     
2025-05-07T20:32:37.0805792Z         def ref_fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:37.0805893Z             x0_fp32 = x0.to(torch.float32)
2025-05-07T20:32:37.0805997Z             x1_fp32 = x1.to(torch.float32)
2025-05-07T20:32:37.0806119Z             y = x0_fp32 * torch.sigmoid(x0_fp32) * x1_fp32
2025-05-07T20:32:37.0806265Z             return triton_quantize_fp8_row(y, scale_ub_tensor)
2025-05-07T20:32:37.0806338Z     
2025-05-07T20:32:37.0806440Z >       y_fp8_ref, y_scale_ref = ref_fn()
2025-05-07T20:32:37.0806444Z 
2025-05-07T20:32:37.0806546Z moe/activation_test.py:126: 
2025-05-07T20:32:37.0806717Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:37.0806824Z moe/activation_test.py:124: in ref_fn
2025-05-07T20:32:37.0806964Z     return triton_quantize_fp8_row(y, scale_ub_tensor)
2025-05-07T20:32:37.0807512Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gemm/triton_gemm/fp8_gemm.py:2370: in triton_quantize_fp8_row
2025-05-07T20:32:37.0807617Z     _kernel_quantize_fp8_row[grid](
2025-05-07T20:32:37.0807972Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:37.0808190Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:37.0808560Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/autotuner.py:186: in run
2025-05-07T20:32:37.0808812Z     timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs}
2025-05-07T20:32:37.0809226Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/autotuner.py:166: in _bench
2025-05-07T20:32:37.0809397Z     return self.do_bench(kernel_call, quantiles=(0.5, 0.2, 0.8))
2025-05-07T20:32:37.0809736Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/testing.py:117: in do_bench
2025-05-07T20:32:37.0809818Z     fn()
2025-05-07T20:32:37.0810212Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/autotuner.py:152: in kernel_call
2025-05-07T20:32:37.0810293Z     self.fn.run(
2025-05-07T20:32:37.0810632Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:37.0810727Z     kernel = self.compile(
2025-05-07T20:32:37.0811102Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:37.0811283Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:37.0811414Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:37.0811418Z 
2025-05-07T20:32:37.0811626Z self = <triton.compiler.compiler.ASTSource object at 0x7f78be88eb30>
2025-05-07T20:32:37.0812392Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=2, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:37.0812893Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f7905667ce0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f78becbe520>}
2025-05-07T20:32:37.0813739Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:37.0813935Z context = <triton._C.libtriton.ir.context object at 0x7f78be745ff0>
2025-05-07T20:32:37.0813943Z 
2025-05-07T20:32:37.0814113Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:37.0814372Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:37.0814530Z                            module_map=module_map)
2025-05-07T20:32:37.0814692Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:37.0814795Z E       def _kernel_quantize_fp8_row(
2025-05-07T20:32:37.0814878Z E       ^
2025-05-07T20:32:37.0815228Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:37.0815235Z 
2025-05-07T20:32:37.0815639Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:37.0815648Z 
2025-05-07T20:32:37.0815751Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:37.0816013Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:37.0816097Z     T=16384,
2025-05-07T20:32:37.0816171Z     D=5120,
2025-05-07T20:32:37.0816253Z     scale_ub=None,
2025-05-07T20:32:37.0816347Z     contiguous=True,
2025-05-07T20:32:37.0816429Z     compiled=True,
2025-05-07T20:32:37.0816501Z )
2025-05-07T20:32:37.0816721Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:37.0816893Z T = 16384, D = 5120, scale_ub = None, contiguous = True, compiled = True
2025-05-07T20:32:37.0816897Z 
2025-05-07T20:32:37.0816973Z     @given(
2025-05-07T20:32:37.0817099Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:37.0817203Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:37.0817322Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:37.0817437Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:37.0817551Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:37.0817635Z     )
2025-05-07T20:32:37.0817922Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:37.0818017Z     def test_silu_mul_quant(
2025-05-07T20:32:37.0818100Z         self,
2025-05-07T20:32:37.0818176Z         T: int,
2025-05-07T20:32:37.0818253Z         D: int,
2025-05-07T20:32:37.0818356Z         scale_ub: Optional[float],
2025-05-07T20:32:37.0818449Z         contiguous: bool,
2025-05-07T20:32:37.0818539Z         compiled: bool,
2025-05-07T20:32:37.0818617Z     ) -> None:
2025-05-07T20:32:37.0818711Z         torch.manual_seed(2025)
2025-05-07T20:32:37.0818788Z     
2025-05-07T20:32:37.0818953Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:37.0819028Z     
2025-05-07T20:32:37.0819124Z         x_sign = torch.sign(x)
2025-05-07T20:32:37.0819248Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:37.0819336Z         x = x_sign * x_clamp
2025-05-07T20:32:37.0819424Z         x0 = x[:, :D]
2025-05-07T20:32:37.0819503Z         x1 = x[:, D:]
2025-05-07T20:32:37.0819576Z     
2025-05-07T20:32:37.0819664Z         if contiguous:
2025-05-07T20:32:37.0819755Z             x0 = x0.contiguous()
2025-05-07T20:32:37.0819846Z             x1 = x1.contiguous()
2025-05-07T20:32:37.0819922Z     
2025-05-07T20:32:37.0820011Z         if scale_ub is not None:
2025-05-07T20:32:37.0820122Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:37.0820254Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:37.0820329Z             )
2025-05-07T20:32:37.0820409Z         else:
2025-05-07T20:32:37.0820501Z             scale_ub_tensor = None
2025-05-07T20:32:37.0820644Z     
2025-05-07T20:32:37.0820778Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:37.0820866Z             op = silu_mul_quant
2025-05-07T20:32:37.0820952Z             if compiled:
2025-05-07T20:32:37.0821056Z                 op = torch.compile(op)
2025-05-07T20:32:37.0821162Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:37.0821235Z     
2025-05-07T20:32:37.0821332Z         y_fp8, y_scale = fn()
2025-05-07T20:32:37.0821456Z         y = y_fp8.to(torch.float32) * y_scale[:, None]
2025-05-07T20:32:37.0821573Z     
2025-05-07T20:32:37.0821710Z         def ref_fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:37.0821810Z             x0_fp32 = x0.to(torch.float32)
2025-05-07T20:32:37.0821914Z             x1_fp32 = x1.to(torch.float32)
2025-05-07T20:32:37.0822034Z             y = x0_fp32 * torch.sigmoid(x0_fp32) * x1_fp32
2025-05-07T20:32:37.0822171Z             return triton_quantize_fp8_row(y, scale_ub_tensor)
2025-05-07T20:32:37.0822249Z     
2025-05-07T20:32:37.0822351Z >       y_fp8_ref, y_scale_ref = ref_fn()
2025-05-07T20:32:37.0822355Z 
2025-05-07T20:32:37.0822451Z moe/activation_test.py:126: 
2025-05-07T20:32:37.0822584Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:37.0822728Z moe/activation_test.py:124: in ref_fn
2025-05-07T20:32:37.0822872Z     return triton_quantize_fp8_row(y, scale_ub_tensor)
2025-05-07T20:32:37.0823418Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gemm/triton_gemm/fp8_gemm.py:2370: in triton_quantize_fp8_row
2025-05-07T20:32:37.0823520Z     _kernel_quantize_fp8_row[grid](
2025-05-07T20:32:37.0823882Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:37.0824101Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:37.0824467Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/autotuner.py:186: in run
2025-05-07T20:32:37.0824721Z     timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs}
2025-05-07T20:32:37.0825088Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/autotuner.py:166: in _bench
2025-05-07T20:32:37.0825267Z     return self.do_bench(kernel_call, quantiles=(0.5, 0.2, 0.8))
2025-05-07T20:32:37.0825685Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/testing.py:117: in do_bench
2025-05-07T20:32:37.0825767Z     fn()
2025-05-07T20:32:37.0826168Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/autotuner.py:152: in kernel_call
2025-05-07T20:32:37.0826248Z     self.fn.run(
2025-05-07T20:32:37.0826585Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:37.0826677Z     kernel = self.compile(
2025-05-07T20:32:37.0827053Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:37.0827232Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:37.0827360Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:37.0827364Z 
2025-05-07T20:32:37.0827575Z self = <triton.compiler.compiler.ASTSource object at 0x7f78be871a50>
2025-05-07T20:32:37.0828343Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=2, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:37.0828844Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f7905667ce0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f78bee302c0>}
2025-05-07T20:32:37.0829580Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:37.0829811Z context = <triton._C.libtriton.ir.context object at 0x7f78be093630>
2025-05-07T20:32:37.0829816Z 
2025-05-07T20:32:37.0829987Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:37.0830249Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:37.0830358Z                            module_map=module_map)
2025-05-07T20:32:37.0830569Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:37.0830674Z E       def _kernel_quantize_fp8_row(
2025-05-07T20:32:37.0830752Z E       ^
2025-05-07T20:32:37.0831107Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:37.0831111Z 
2025-05-07T20:32:37.0831516Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:37.0831523Z 
2025-05-07T20:32:37.0831630Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:37.0831850Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:37.0831925Z     T=1,
2025-05-07T20:32:37.0832043Z     D=5120,
2025-05-07T20:32:37.0832128Z     scale_ub=1200.0,
2025-05-07T20:32:37.0832217Z     contiguous=True,
2025-05-07T20:32:37.0832306Z     compiled=True,
2025-05-07T20:32:37.0832378Z )
2025-05-07T20:32:37.0832603Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:37.0832769Z T = 1, D = 5120, scale_ub = 1200.0, contiguous = True, compiled = True
2025-05-07T20:32:37.0832773Z 
2025-05-07T20:32:37.0832849Z     @given(
2025-05-07T20:32:37.0832975Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:37.0833072Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:37.0833185Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:37.0833311Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:37.0833425Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:37.0833503Z     )
2025-05-07T20:32:37.0833745Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:37.0833843Z     def test_silu_mul_quant(
2025-05-07T20:32:37.0833970Z         self,
2025-05-07T20:32:37.0834050Z         T: int,
2025-05-07T20:32:37.0834129Z         D: int,
2025-05-07T20:32:37.0834230Z         scale_ub: Optional[float],
2025-05-07T20:32:37.0834322Z         contiguous: bool,
2025-05-07T20:32:37.0834411Z         compiled: bool,
2025-05-07T20:32:37.0834494Z     ) -> None:
2025-05-07T20:32:37.0834588Z         torch.manual_seed(2025)
2025-05-07T20:32:37.0834661Z     
2025-05-07T20:32:37.0834831Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:37.0834905Z     
2025-05-07T20:32:37.0834998Z         x_sign = torch.sign(x)
2025-05-07T20:32:37.0835129Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:37.0835218Z         x = x_sign * x_clamp
2025-05-07T20:32:37.0835304Z         x0 = x[:, :D]
2025-05-07T20:32:37.0835382Z         x1 = x[:, D:]
2025-05-07T20:32:37.0835453Z     
2025-05-07T20:32:37.0835543Z         if contiguous:
2025-05-07T20:32:37.0835640Z             x0 = x0.contiguous()
2025-05-07T20:32:37.0835732Z             x1 = x1.contiguous()
2025-05-07T20:32:37.0835808Z     
2025-05-07T20:32:37.0835897Z         if scale_ub is not None:
2025-05-07T20:32:37.0836002Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:37.0836141Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:37.0836216Z             )
2025-05-07T20:32:37.0836292Z         else:
2025-05-07T20:32:37.0836388Z             scale_ub_tensor = None
2025-05-07T20:32:37.0836460Z     
2025-05-07T20:32:37.0836594Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:37.0836683Z             op = silu_mul_quant
2025-05-07T20:32:37.0836816Z             if compiled:
2025-05-07T20:32:37.0836920Z                 op = torch.compile(op)
2025-05-07T20:32:37.0837024Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:37.0837095Z     
2025-05-07T20:32:37.0837189Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:37.0837194Z 
2025-05-07T20:32:37.0837290Z moe/activation_test.py:117: 
2025-05-07T20:32:37.0837422Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:37.0837523Z moe/activation_test.py:115: in fn
2025-05-07T20:32:37.0837666Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:37.0838032Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/torch/_dynamo/eval_frame.py:678: in _fn
2025-05-07T20:32:37.0838123Z     return fn(*args, **kwargs)
2025-05-07T20:32:37.0838609Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:37.0838709Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:37.0839064Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:37.0839282Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:37.0839665Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:37.0839760Z     kernel = self.compile(
2025-05-07T20:32:37.0840144Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:37.0840319Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:37.0840446Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:37.0840450Z 
2025-05-07T20:32:37.0840657Z self = <triton.compiler.compiler.ASTSource object at 0x7f78be7bc410>
2025-05-07T20:32:37.0841419Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:37.0841960Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f7905667ce0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f779dd18680>}
2025-05-07T20:32:37.0842694Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:37.0842886Z context = <triton._C.libtriton.ir.context object at 0x7f78be0609b0>
2025-05-07T20:32:37.0842895Z 
2025-05-07T20:32:37.0843056Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:37.0843313Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:37.0843426Z                            module_map=module_map)
2025-05-07T20:32:37.0843587Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:37.0843693Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:37.0843777Z E       ^
2025-05-07T20:32:37.0855524Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:37.0855542Z 
2025-05-07T20:32:37.0855998Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:37.0856006Z 
2025-05-07T20:32:37.0856115Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:37.0856345Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:37.0856427Z     T=1,
2025-05-07T20:32:37.0856505Z     D=5120,
2025-05-07T20:32:37.0856597Z     scale_ub=None,
2025-05-07T20:32:37.0856685Z     contiguous=False,
2025-05-07T20:32:37.0856769Z     compiled=True,
2025-05-07T20:32:37.0856930Z )
2025-05-07T20:32:37.0857148Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:37.0857318Z T = 1, D = 5120, scale_ub = None, contiguous = False, compiled = True
2025-05-07T20:32:37.0857323Z 
2025-05-07T20:32:37.0857399Z     @given(
2025-05-07T20:32:37.0857523Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:37.0857632Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:37.0857752Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:37.0857914Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:37.0858033Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:37.0858109Z     )
2025-05-07T20:32:37.0858355Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:37.0858450Z     def test_silu_mul_quant(
2025-05-07T20:32:37.0858528Z         self,
2025-05-07T20:32:37.0858610Z         T: int,
2025-05-07T20:32:37.0858686Z         D: int,
2025-05-07T20:32:37.0858789Z         scale_ub: Optional[float],
2025-05-07T20:32:37.0858884Z         contiguous: bool,
2025-05-07T20:32:37.0858969Z         compiled: bool,
2025-05-07T20:32:37.0859049Z     ) -> None:
2025-05-07T20:32:37.0859150Z         torch.manual_seed(2025)
2025-05-07T20:32:37.0859265Z     
2025-05-07T20:32:37.0859438Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:37.0859518Z     
2025-05-07T20:32:37.0859610Z         x_sign = torch.sign(x)
2025-05-07T20:32:37.0859736Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:37.0859833Z         x = x_sign * x_clamp
2025-05-07T20:32:37.0859915Z         x0 = x[:, :D]
2025-05-07T20:32:37.0859996Z         x1 = x[:, D:]
2025-05-07T20:32:37.0860070Z     
2025-05-07T20:32:37.0860154Z         if contiguous:
2025-05-07T20:32:37.0860251Z             x0 = x0.contiguous()
2025-05-07T20:32:37.0860340Z             x1 = x1.contiguous()
2025-05-07T20:32:37.0860411Z     
2025-05-07T20:32:37.0860509Z         if scale_ub is not None:
2025-05-07T20:32:37.0860615Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:37.0860749Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:37.0860833Z             )
2025-05-07T20:32:37.0860910Z         else:
2025-05-07T20:32:37.0861005Z             scale_ub_tensor = None
2025-05-07T20:32:37.0861082Z     
2025-05-07T20:32:37.0861282Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:37.0861374Z             op = silu_mul_quant
2025-05-07T20:32:37.0861468Z             if compiled:
2025-05-07T20:32:37.0861570Z                 op = torch.compile(op)
2025-05-07T20:32:37.0861681Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:37.0861753Z     
2025-05-07T20:32:37.0861843Z         y_fp8, y_scale = fn()
2025-05-07T20:32:37.0861967Z         y = y_fp8.to(torch.float32) * y_scale[:, None]
2025-05-07T20:32:37.0862040Z     
2025-05-07T20:32:37.0862176Z         def ref_fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:37.0862287Z             x0_fp32 = x0.to(torch.float32)
2025-05-07T20:32:37.0862388Z             x1_fp32 = x1.to(torch.float32)
2025-05-07T20:32:37.0862510Z             y = x0_fp32 * torch.sigmoid(x0_fp32) * x1_fp32
2025-05-07T20:32:37.0862659Z             return triton_quantize_fp8_row(y, scale_ub_tensor)
2025-05-07T20:32:37.0862731Z     
2025-05-07T20:32:37.0862838Z >       y_fp8_ref, y_scale_ref = ref_fn()
2025-05-07T20:32:37.0862843Z 
2025-05-07T20:32:37.0862942Z moe/activation_test.py:126: 
2025-05-07T20:32:37.0863074Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:37.0863190Z moe/activation_test.py:124: in ref_fn
2025-05-07T20:32:37.0863327Z     return triton_quantize_fp8_row(y, scale_ub_tensor)
2025-05-07T20:32:37.0863876Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gemm/triton_gemm/fp8_gemm.py:2370: in triton_quantize_fp8_row
2025-05-07T20:32:37.0863983Z     _kernel_quantize_fp8_row[grid](
2025-05-07T20:32:37.0864387Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:37.0864621Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:37.0864991Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/autotuner.py:186: in run
2025-05-07T20:32:37.0865248Z     timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs}
2025-05-07T20:32:37.0865662Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/autotuner.py:166: in _bench
2025-05-07T20:32:37.0865834Z     return self.do_bench(kernel_call, quantiles=(0.5, 0.2, 0.8))
2025-05-07T20:32:37.0866181Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/testing.py:117: in do_bench
2025-05-07T20:32:37.0866260Z     fn()
2025-05-07T20:32:37.0866653Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/autotuner.py:152: in kernel_call
2025-05-07T20:32:37.0866744Z     self.fn.run(
2025-05-07T20:32:37.0867080Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:37.0867173Z     kernel = self.compile(
2025-05-07T20:32:37.0867598Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:37.0867774Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:37.0867917Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:37.0867921Z 
2025-05-07T20:32:37.0868128Z self = <triton.compiler.compiler.ASTSource object at 0x7f78be7beb10>
2025-05-07T20:32:37.0868897Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=2, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:37.0869404Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f7905667ce0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f779dd02de0>}
2025-05-07T20:32:37.0870178Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:37.0870376Z context = <triton._C.libtriton.ir.context object at 0x7f779db3a470>
2025-05-07T20:32:37.0870382Z 
2025-05-07T20:32:37.0870546Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:37.0870809Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:37.0870919Z                            module_map=module_map)
2025-05-07T20:32:37.0871081Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:37.0871194Z E       def _kernel_quantize_fp8_row(
2025-05-07T20:32:37.0871270Z E       ^
2025-05-07T20:32:37.0871625Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:37.0871630Z 
2025-05-07T20:32:37.0872039Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:37.0872044Z 
2025-05-07T20:32:37.0872146Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:37.0872374Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:37.0872451Z     T=1,
2025-05-07T20:32:37.0872529Z     D=5120,
2025-05-07T20:32:37.0872618Z     scale_ub=None,
2025-05-07T20:32:37.0872702Z     contiguous=True,
2025-05-07T20:32:37.0872790Z     compiled=False,
2025-05-07T20:32:37.0872864Z )
2025-05-07T20:32:37.0873080Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:37.0873293Z T = 1, D = 5120, scale_ub = None, contiguous = True, compiled = False
2025-05-07T20:32:37.0873297Z 
2025-05-07T20:32:37.0873373Z     @given(
2025-05-07T20:32:37.0873493Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:37.0873596Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:37.0873714Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:37.0873833Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:37.0873949Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:37.0874068Z     )
2025-05-07T20:32:37.0874316Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:37.0874411Z     def test_silu_mul_quant(
2025-05-07T20:32:37.0874486Z         self,
2025-05-07T20:32:37.0874566Z         T: int,
2025-05-07T20:32:37.0874642Z         D: int,
2025-05-07T20:32:37.0874742Z         scale_ub: Optional[float],
2025-05-07T20:32:37.0874834Z         contiguous: bool,
2025-05-07T20:32:37.0874919Z         compiled: bool,
2025-05-07T20:32:37.0875000Z     ) -> None:
2025-05-07T20:32:37.0875098Z         torch.manual_seed(2025)
2025-05-07T20:32:37.0875169Z     
2025-05-07T20:32:37.0875334Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:37.0875409Z     
2025-05-07T20:32:37.0875546Z         x_sign = torch.sign(x)
2025-05-07T20:32:37.0875678Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:37.0875766Z         x = x_sign * x_clamp
2025-05-07T20:32:37.0875846Z         x0 = x[:, :D]
2025-05-07T20:32:37.0875940Z         x1 = x[:, D:]
2025-05-07T20:32:37.0876011Z     
2025-05-07T20:32:37.0876094Z         if contiguous:
2025-05-07T20:32:37.0876190Z             x0 = x0.contiguous()
2025-05-07T20:32:37.0876278Z             x1 = x1.contiguous()
2025-05-07T20:32:37.0876350Z     
2025-05-07T20:32:37.0876446Z         if scale_ub is not None:
2025-05-07T20:32:37.0876552Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:37.0876686Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:37.0876770Z             )
2025-05-07T20:32:37.0876847Z         else:
2025-05-07T20:32:37.0876945Z             scale_ub_tensor = None
2025-05-07T20:32:37.0877018Z     
2025-05-07T20:32:37.0877145Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:37.0877243Z             op = silu_mul_quant
2025-05-07T20:32:37.0877374Z             if compiled:
2025-05-07T20:32:37.0877476Z                 op = torch.compile(op)
2025-05-07T20:32:37.0877587Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:37.0877663Z     
2025-05-07T20:32:37.0877753Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:37.0877758Z 
2025-05-07T20:32:37.0877859Z moe/activation_test.py:117: 
2025-05-07T20:32:37.0877988Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:37.0878087Z moe/activation_test.py:115: in fn
2025-05-07T20:32:37.0878192Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:37.0878685Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:37.0878787Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:37.0879145Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:37.0879366Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:37.0879708Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:37.0879804Z     kernel = self.compile(
2025-05-07T20:32:37.0880186Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:37.0880360Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:37.0880487Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:37.0880534Z 
2025-05-07T20:32:37.0880742Z self = <triton.compiler.compiler.ASTSource object at 0x7f78be3bee60>
2025-05-07T20:32:37.0881508Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:37.0882010Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f7905667ce0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f78be546a20>}
2025-05-07T20:32:37.0882784Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:37.0882974Z context = <triton._C.libtriton.ir.context object at 0x7f779dcd0d30>
2025-05-07T20:32:37.0882979Z 
2025-05-07T20:32:37.0883145Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:37.0883403Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:37.0883517Z                            module_map=module_map)
2025-05-07T20:32:37.0883717Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:37.0883817Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:37.0883904Z E       ^
2025-05-07T20:32:37.0884252Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:37.0884259Z 
2025-05-07T20:32:37.0884667Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:37.0884672Z 
2025-05-07T20:32:37.0884774Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:37.0884994Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:37.0885077Z     T=128,
2025-05-07T20:32:37.0885156Z     D=5120,
2025-05-07T20:32:37.0885238Z     scale_ub=None,
2025-05-07T20:32:37.0885330Z     contiguous=False,
2025-05-07T20:32:37.0885413Z     compiled=True,
2025-05-07T20:32:37.0885487Z )
2025-05-07T20:32:37.0885709Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:37.0885920Z T = 128, D = 5120, scale_ub = None, contiguous = False, compiled = True
2025-05-07T20:32:37.0885925Z 
2025-05-07T20:32:37.0886007Z     @given(
2025-05-07T20:32:37.0886128Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:37.0886229Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:37.0886348Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:37.0886466Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:37.0886581Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:37.0886659Z     )
2025-05-07T20:32:37.0886901Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:37.0886997Z     def test_silu_mul_quant(
2025-05-07T20:32:37.0887077Z         self,
2025-05-07T20:32:37.0887154Z         T: int,
2025-05-07T20:32:37.0887235Z         D: int,
2025-05-07T20:32:37.0887334Z         scale_ub: Optional[float],
2025-05-07T20:32:37.0887424Z         contiguous: bool,
2025-05-07T20:32:37.0887514Z         compiled: bool,
2025-05-07T20:32:37.0887596Z     ) -> None:
2025-05-07T20:32:37.0887690Z         torch.manual_seed(2025)
2025-05-07T20:32:37.0887768Z     
2025-05-07T20:32:37.0887934Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:37.0888013Z     
2025-05-07T20:32:37.0888110Z         x_sign = torch.sign(x)
2025-05-07T20:32:37.0888235Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:37.0888324Z         x = x_sign * x_clamp
2025-05-07T20:32:37.0888409Z         x0 = x[:, :D]
2025-05-07T20:32:37.0888489Z         x1 = x[:, D:]
2025-05-07T20:32:37.0888561Z     
2025-05-07T20:32:37.0888649Z         if contiguous:
2025-05-07T20:32:37.0888786Z             x0 = x0.contiguous()
2025-05-07T20:32:37.0888880Z             x1 = x1.contiguous()
2025-05-07T20:32:37.0888953Z     
2025-05-07T20:32:37.0889044Z         if scale_ub is not None:
2025-05-07T20:32:37.0889151Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:37.0889287Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:37.0889366Z             )
2025-05-07T20:32:37.0889450Z         else:
2025-05-07T20:32:37.0889543Z             scale_ub_tensor = None
2025-05-07T20:32:37.0889660Z     
2025-05-07T20:32:37.0889793Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:37.0889883Z             op = silu_mul_quant
2025-05-07T20:32:37.0889968Z             if compiled:
2025-05-07T20:32:37.0890073Z                 op = torch.compile(op)
2025-05-07T20:32:37.0890178Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:37.0890255Z     
2025-05-07T20:32:37.0890345Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:37.0890352Z 
2025-05-07T20:32:37.0890449Z moe/activation_test.py:117: 
2025-05-07T20:32:37.0890581Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:37.0890681Z moe/activation_test.py:115: in fn
2025-05-07T20:32:37.0890780Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:37.0891213Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/torch/_dynamo/eval_frame.py:678: in _fn
2025-05-07T20:32:37.0891308Z     return fn(*args, **kwargs)
2025-05-07T20:32:37.0891799Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:37.0891902Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:37.0892258Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:37.0892481Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:37.0892820Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:37.0892914Z     kernel = self.compile(
2025-05-07T20:32:37.0893298Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:37.0893508Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:37.0893642Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:37.0893649Z 
2025-05-07T20:32:37.0893949Z self = <triton.compiler.compiler.ASTSource object at 0x7f779dd14e10>
2025-05-07T20:32:37.0894711Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:37.0895209Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f7905667ce0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f779dd03ce0>}
2025-05-07T20:32:37.0895997Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:37.0896193Z context = <triton._C.libtriton.ir.context object at 0x7f779db94bf0>
2025-05-07T20:32:37.0896197Z 
2025-05-07T20:32:37.0896359Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:37.0896621Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:37.0896727Z                            module_map=module_map)
2025-05-07T20:32:37.0896887Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:37.0896987Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:37.0897063Z E       ^
2025-05-07T20:32:37.0897412Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:37.0897462Z 
2025-05-07T20:32:37.0897872Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:37.0897876Z 
2025-05-07T20:32:37.0897980Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:37.0898427Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:37.0898542Z     T=128,
2025-05-07T20:32:37.0898767Z     D=7168,
2025-05-07T20:32:37.0898857Z     scale_ub=1200.0,
2025-05-07T20:32:37.0898943Z     contiguous=False,
2025-05-07T20:32:37.0899024Z     compiled=False,
2025-05-07T20:32:37.0899103Z )
2025-05-07T20:32:37.0899318Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:37.0899488Z T = 128, D = 7168, scale_ub = 1200.0, contiguous = False, compiled = False
2025-05-07T20:32:37.0899493Z 
2025-05-07T20:32:37.0899573Z     @given(
2025-05-07T20:32:37.0899697Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:37.0899802Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:37.0899914Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:37.0900099Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:37.0900221Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:37.0900294Z     )
2025-05-07T20:32:37.0900537Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:37.0900637Z     def test_silu_mul_quant(
2025-05-07T20:32:37.0900711Z         self,
2025-05-07T20:32:37.0900788Z         T: int,
2025-05-07T20:32:37.0900873Z         D: int,
2025-05-07T20:32:37.0900972Z         scale_ub: Optional[float],
2025-05-07T20:32:37.0901067Z         contiguous: bool,
2025-05-07T20:32:37.0901154Z         compiled: bool,
2025-05-07T20:32:37.0901233Z     ) -> None:
2025-05-07T20:32:37.0901337Z         torch.manual_seed(2025)
2025-05-07T20:32:37.0901413Z     
2025-05-07T20:32:37.0901578Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:37.0901657Z     
2025-05-07T20:32:37.0901750Z         x_sign = torch.sign(x)
2025-05-07T20:32:37.0901875Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:37.0901971Z         x = x_sign * x_clamp
2025-05-07T20:32:37.0902114Z         x0 = x[:, :D]
2025-05-07T20:32:37.0902195Z         x1 = x[:, D:]
2025-05-07T20:32:37.0902273Z     
2025-05-07T20:32:37.0902354Z         if contiguous:
2025-05-07T20:32:37.0902448Z             x0 = x0.contiguous()
2025-05-07T20:32:37.0902543Z             x1 = x1.contiguous()
2025-05-07T20:32:37.0902616Z     
2025-05-07T20:32:37.0902712Z         if scale_ub is not None:
2025-05-07T20:32:37.0902816Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:37.0902951Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:37.0903029Z             )
2025-05-07T20:32:37.0903105Z         else:
2025-05-07T20:32:37.0903201Z             scale_ub_tensor = None
2025-05-07T20:32:37.0903278Z     
2025-05-07T20:32:37.0903407Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:37.0903496Z             op = silu_mul_quant
2025-05-07T20:32:37.0903582Z             if compiled:
2025-05-07T20:32:37.0903683Z                 op = torch.compile(op)
2025-05-07T20:32:37.0903791Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:37.0903869Z     
2025-05-07T20:32:37.0903961Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:37.0903968Z 
2025-05-07T20:32:37.0904069Z moe/activation_test.py:117: 
2025-05-07T20:32:37.0904198Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:37.0904297Z moe/activation_test.py:115: in fn
2025-05-07T20:32:37.0904397Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:37.0904887Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:37.0905049Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:37.0905433Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:37.0905679Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:37.0906023Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:37.0906116Z     kernel = self.compile(
2025-05-07T20:32:37.0906493Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:37.0906713Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:37.0906841Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:37.0906846Z 
2025-05-07T20:32:37.0907052Z self = <triton.compiler.compiler.ASTSource object at 0x7f78becbe5d0>
2025-05-07T20:32:37.0907821Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:37.0908357Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f7905667ce0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f78bf7bc180>}
2025-05-07T20:32:37.0909099Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:37.0909291Z context = <triton._C.libtriton.ir.context object at 0x7f779da2a530>
2025-05-07T20:32:37.0909296Z 
2025-05-07T20:32:37.0909464Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:37.0909720Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:37.0909829Z                            module_map=module_map)
2025-05-07T20:32:37.0909995Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:37.0910094Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:37.0910179Z E       ^
2025-05-07T20:32:37.0910571Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:37.0910577Z 
2025-05-07T20:32:37.0910979Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:37.0910986Z 
2025-05-07T20:32:37.0911091Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:37.0911309Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:37.0911386Z     T=128,
2025-05-07T20:32:37.0911466Z     D=5120,
2025-05-07T20:32:37.0911547Z     scale_ub=None,
2025-05-07T20:32:37.0911637Z     contiguous=False,
2025-05-07T20:32:37.0911723Z     compiled=False,
2025-05-07T20:32:37.0911796Z )
2025-05-07T20:32:37.0912016Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:37.0912185Z T = 128, D = 5120, scale_ub = None, contiguous = False, compiled = False
2025-05-07T20:32:37.0912190Z 
2025-05-07T20:32:37.0912270Z     @given(
2025-05-07T20:32:37.0912395Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:37.0912493Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:37.0912608Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:37.0912733Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:37.0912846Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:37.0912925Z     )
2025-05-07T20:32:37.0913166Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:37.0913258Z     def test_silu_mul_quant(
2025-05-07T20:32:37.0913340Z         self,
2025-05-07T20:32:37.0913467Z         T: int,
2025-05-07T20:32:37.0913542Z         D: int,
2025-05-07T20:32:37.0913644Z         scale_ub: Optional[float],
2025-05-07T20:32:37.0913734Z         contiguous: bool,
2025-05-07T20:32:37.0913819Z         compiled: bool,
2025-05-07T20:32:37.0913902Z     ) -> None:
2025-05-07T20:32:37.0913997Z         torch.manual_seed(2025)
2025-05-07T20:32:37.0914070Z     
2025-05-07T20:32:37.0914247Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:37.0914319Z     
2025-05-07T20:32:37.0914413Z         x_sign = torch.sign(x)
2025-05-07T20:32:37.0914586Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:37.0914680Z         x = x_sign * x_clamp
2025-05-07T20:32:37.0914761Z         x0 = x[:, :D]
2025-05-07T20:32:37.0914840Z         x1 = x[:, D:]
2025-05-07T20:32:37.0914919Z     
2025-05-07T20:32:37.0915002Z         if contiguous:
2025-05-07T20:32:37.0915100Z             x0 = x0.contiguous()
2025-05-07T20:32:37.0915190Z             x1 = x1.contiguous()
2025-05-07T20:32:37.0915265Z     
2025-05-07T20:32:37.0915361Z         if scale_ub is not None:
2025-05-07T20:32:37.0915466Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:37.0915601Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:37.0915682Z             )
2025-05-07T20:32:37.0915799Z         else:
2025-05-07T20:32:37.0915899Z             scale_ub_tensor = None
2025-05-07T20:32:37.0915975Z     
2025-05-07T20:32:37.0916103Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:37.0916194Z             op = silu_mul_quant
2025-05-07T20:32:37.0916283Z             if compiled:
2025-05-07T20:32:37.0916382Z                 op = torch.compile(op)
2025-05-07T20:32:37.0916493Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:37.0916565Z     
2025-05-07T20:32:37.0916655Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:37.0916659Z 
2025-05-07T20:32:37.0916759Z moe/activation_test.py:117: 
2025-05-07T20:32:37.0916886Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:37.0916989Z moe/activation_test.py:115: in fn
2025-05-07T20:32:37.0917092Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:37.0917584Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:37.0917724Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:37.0918080Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:37.0918301Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:37.0918644Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:37.0918738Z     kernel = self.compile(
2025-05-07T20:32:37.0919120Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:37.0919303Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:37.0919430Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:37.0919434Z 
2025-05-07T20:32:37.0919641Z self = <triton.compiler.compiler.ASTSource object at 0x7f779dd18730>
2025-05-07T20:32:37.0920407Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:37.0920904Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f7905667ce0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f779dd18180>}
2025-05-07T20:32:37.0921640Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:37.0921870Z context = <triton._C.libtriton.ir.context object at 0x7f779dad9570>
2025-05-07T20:32:37.0921874Z 
2025-05-07T20:32:37.0922042Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:37.0922302Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:37.0922412Z                            module_map=module_map)
2025-05-07T20:32:37.0922575Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:37.0922739Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:37.0922819Z E       ^
2025-05-07T20:32:37.0923167Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:37.0923172Z 
2025-05-07T20:32:37.0923576Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:37.0923580Z 
2025-05-07T20:32:37.0923688Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:37.0923907Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:37.0923986Z     T=128,
2025-05-07T20:32:37.0924063Z     D=5120,
2025-05-07T20:32:37.0924146Z     scale_ub=1200.0,
2025-05-07T20:32:37.0924272Z     contiguous=True,
2025-05-07T20:32:37.0924357Z     compiled=False,
2025-05-07T20:32:37.0924432Z )
2025-05-07T20:32:37.0924651Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:37.0924824Z T = 128, D = 5120, scale_ub = 1200.0, contiguous = True, compiled = False
2025-05-07T20:32:37.0924829Z 
2025-05-07T20:32:37.0924905Z     @given(
2025-05-07T20:32:37.0925028Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:37.0925127Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:37.0925249Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:37.0925367Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:37.0925501Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:37.0925586Z     )
2025-05-07T20:32:37.0925850Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:37.0925943Z     def test_silu_mul_quant(
2025-05-07T20:32:37.0926024Z         self,
2025-05-07T20:32:37.0926102Z         T: int,
2025-05-07T20:32:37.0926220Z         D: int,
2025-05-07T20:32:37.0926326Z         scale_ub: Optional[float],
2025-05-07T20:32:37.0926420Z         contiguous: bool,
2025-05-07T20:32:37.0926506Z         compiled: bool,
2025-05-07T20:32:37.0926589Z     ) -> None:
2025-05-07T20:32:37.0926683Z         torch.manual_seed(2025)
2025-05-07T20:32:37.0926758Z     
2025-05-07T20:32:37.0926922Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:37.0926995Z     
2025-05-07T20:32:37.0927088Z         x_sign = torch.sign(x)
2025-05-07T20:32:37.0927213Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:37.0927304Z         x = x_sign * x_clamp
2025-05-07T20:32:37.0927388Z         x0 = x[:, :D]
2025-05-07T20:32:37.0927466Z         x1 = x[:, D:]
2025-05-07T20:32:37.0927537Z     
2025-05-07T20:32:37.0927626Z         if contiguous:
2025-05-07T20:32:37.0927715Z             x0 = x0.contiguous()
2025-05-07T20:32:37.0927807Z             x1 = x1.contiguous()
2025-05-07T20:32:37.0927884Z     
2025-05-07T20:32:37.0927977Z         if scale_ub is not None:
2025-05-07T20:32:37.0928084Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:37.0928222Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:37.0928301Z             )
2025-05-07T20:32:37.0928383Z         else:
2025-05-07T20:32:37.0928475Z             scale_ub_tensor = None
2025-05-07T20:32:37.0928547Z     
2025-05-07T20:32:37.0928680Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:37.0928770Z             op = silu_mul_quant
2025-05-07T20:32:37.0928854Z             if compiled:
2025-05-07T20:32:37.0928958Z                 op = torch.compile(op)
2025-05-07T20:32:37.0929108Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:37.0929181Z     
2025-05-07T20:32:37.0929276Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:37.0929281Z 
2025-05-07T20:32:37.0929377Z moe/activation_test.py:117: 
2025-05-07T20:32:37.0929512Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:37.0929611Z moe/activation_test.py:115: in fn
2025-05-07T20:32:37.0929709Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:37.0930250Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:37.0930348Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:37.0930704Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:37.0930926Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:37.0931262Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:37.0931356Z     kernel = self.compile(
2025-05-07T20:32:37.0931775Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:37.0931952Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:37.0932083Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:37.0932090Z 
2025-05-07T20:32:37.0932292Z self = <triton.compiler.compiler.ASTSource object at 0x7f78c42f1520>
2025-05-07T20:32:37.0933060Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:37.0933556Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f7905667ce0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f779dae8c20>}
2025-05-07T20:32:37.0934410Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:37.0934644Z context = <triton._C.libtriton.ir.context object at 0x7f779da7b430>
2025-05-07T20:32:37.0934649Z 
2025-05-07T20:32:37.0934814Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:37.0935079Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:37.0935190Z                            module_map=module_map)
2025-05-07T20:32:37.0935352Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:37.0935476Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:37.0935557Z E       ^
2025-05-07T20:32:37.0935924Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:37.0935936Z 
2025-05-07T20:32:37.0936340Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:37.0936347Z 
2025-05-07T20:32:37.0936448Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:37.0936674Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:37.0936748Z     T=1,
2025-05-07T20:32:37.0936825Z     D=7168,
2025-05-07T20:32:37.0936914Z     scale_ub=1200.0,
2025-05-07T20:32:37.0936996Z     contiguous=True,
2025-05-07T20:32:37.0937076Z     compiled=True,
2025-05-07T20:32:37.0937151Z )
2025-05-07T20:32:37.0937364Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:37.0937527Z T = 1, D = 7168, scale_ub = 1200.0, contiguous = True, compiled = True
2025-05-07T20:32:37.0937532Z 
2025-05-07T20:32:37.0937650Z     @given(
2025-05-07T20:32:37.0937769Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:37.0937870Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:37.0937985Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:37.0938104Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:37.0938226Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:37.0938299Z     )
2025-05-07T20:32:37.0938542Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:37.0938677Z     def test_silu_mul_quant(
2025-05-07T20:32:37.0938752Z         self,
2025-05-07T20:32:37.0938835Z         T: int,
2025-05-07T20:32:37.0938912Z         D: int,
2025-05-07T20:32:37.0939010Z         scale_ub: Optional[float],
2025-05-07T20:32:37.0939102Z         contiguous: bool,
2025-05-07T20:32:37.0939187Z         compiled: bool,
2025-05-07T20:32:37.0939266Z     ) -> None:
2025-05-07T20:32:37.0939365Z         torch.manual_seed(2025)
2025-05-07T20:32:37.0939440Z     
2025-05-07T20:32:37.0939605Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:37.0939683Z     
2025-05-07T20:32:37.0939775Z         x_sign = torch.sign(x)
2025-05-07T20:32:37.0939902Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:37.0940053Z         x = x_sign * x_clamp
2025-05-07T20:32:37.0940136Z         x0 = x[:, :D]
2025-05-07T20:32:37.0940220Z         x1 = x[:, D:]
2025-05-07T20:32:37.0940290Z     
2025-05-07T20:32:37.0940373Z         if contiguous:
2025-05-07T20:32:37.0940470Z             x0 = x0.contiguous()
2025-05-07T20:32:37.0940557Z             x1 = x1.contiguous()
2025-05-07T20:32:37.0940628Z     
2025-05-07T20:32:37.0940720Z         if scale_ub is not None:
2025-05-07T20:32:37.0940823Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:37.0940956Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:37.0941036Z             )
2025-05-07T20:32:37.0941111Z         else:
2025-05-07T20:32:37.0941208Z             scale_ub_tensor = None
2025-05-07T20:32:37.0941284Z     
2025-05-07T20:32:37.0941411Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:37.0941505Z             op = silu_mul_quant
2025-05-07T20:32:37.0941591Z             if compiled:
2025-05-07T20:32:37.0941692Z                 op = torch.compile(op)
2025-05-07T20:32:37.0941844Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:37.0941918Z     
2025-05-07T20:32:37.0942008Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:37.0942015Z 
2025-05-07T20:32:37.0942120Z moe/activation_test.py:117: 
2025-05-07T20:32:37.0942247Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:37.0942344Z moe/activation_test.py:115: in fn
2025-05-07T20:32:37.0942443Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:37.0942805Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/torch/_dynamo/eval_frame.py:678: in _fn
2025-05-07T20:32:37.0942904Z     return fn(*args, **kwargs)
2025-05-07T20:32:37.0943390Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:37.0943485Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:37.0943850Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:37.0944068Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:37.0944406Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:37.0944505Z     kernel = self.compile(
2025-05-07T20:32:37.0944882Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:37.0945055Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:37.0945181Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:37.0945229Z 
2025-05-07T20:32:37.0945433Z self = <triton.compiler.compiler.ASTSource object at 0x7f779daaf950>
2025-05-07T20:32:37.0946207Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:37.0946700Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f7905667ce0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f779dae9ee0>}
2025-05-07T20:32:37.0947474Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:37.0947663Z context = <triton._C.libtriton.ir.context object at 0x7f78be45afb0>
2025-05-07T20:32:37.0947670Z 
2025-05-07T20:32:37.0947840Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:37.0948100Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:37.0948207Z                            module_map=module_map)
2025-05-07T20:32:37.0948411Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:37.0948510Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:37.0948587Z E       ^
2025-05-07T20:32:37.0948940Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:37.0948948Z 
2025-05-07T20:32:37.0949349Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:37.0949354Z 
2025-05-07T20:32:37.0949461Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:37.0949681Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:37.0949760Z     T=1,
2025-05-07T20:32:37.0949840Z     D=7168,
2025-05-07T20:32:37.0949922Z     scale_ub=1200.0,
2025-05-07T20:32:37.0950007Z     contiguous=False,
2025-05-07T20:32:37.0950093Z     compiled=True,
2025-05-07T20:32:37.0950165Z )
2025-05-07T20:32:37.0950383Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:37.0950589Z T = 1, D = 7168, scale_ub = 1200.0, contiguous = False, compiled = True
2025-05-07T20:32:37.0950594Z 
2025-05-07T20:32:37.0950670Z     @given(
2025-05-07T20:32:37.0950795Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:37.0950893Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:37.0951009Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:37.0951130Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:37.0951243Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:37.0951316Z     )
2025-05-07T20:32:37.0951560Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:37.0951653Z     def test_silu_mul_quant(
2025-05-07T20:32:37.0951738Z         self,
2025-05-07T20:32:37.0951814Z         T: int,
2025-05-07T20:32:37.0951890Z         D: int,
2025-05-07T20:32:37.0951993Z         scale_ub: Optional[float],
2025-05-07T20:32:37.0952083Z         contiguous: bool,
2025-05-07T20:32:37.0952170Z         compiled: bool,
2025-05-07T20:32:37.0952252Z     ) -> None:
2025-05-07T20:32:37.0952345Z         torch.manual_seed(2025)
2025-05-07T20:32:37.0952419Z     
2025-05-07T20:32:37.0952589Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:37.0952661Z     
2025-05-07T20:32:37.0952750Z         x_sign = torch.sign(x)
2025-05-07T20:32:37.0952882Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:37.0952967Z         x = x_sign * x_clamp
2025-05-07T20:32:37.0953047Z         x0 = x[:, :D]
2025-05-07T20:32:37.0953130Z         x1 = x[:, D:]
2025-05-07T20:32:37.0953271Z     
2025-05-07T20:32:37.0953358Z         if contiguous:
2025-05-07T20:32:37.0953447Z             x0 = x0.contiguous()
2025-05-07T20:32:37.0953535Z             x1 = x1.contiguous()
2025-05-07T20:32:37.0953612Z     
2025-05-07T20:32:37.0953703Z         if scale_ub is not None:
2025-05-07T20:32:37.0953810Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:37.0953958Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:37.0954034Z             )
2025-05-07T20:32:37.0954110Z         else:
2025-05-07T20:32:37.0954251Z             scale_ub_tensor = None
2025-05-07T20:32:37.0954321Z     
2025-05-07T20:32:37.0954449Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:37.0954543Z             op = silu_mul_quant
2025-05-07T20:32:37.0954625Z             if compiled:
2025-05-07T20:32:37.0954728Z                 op = torch.compile(op)
2025-05-07T20:32:37.0954832Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:37.0954904Z     
2025-05-07T20:32:37.0955001Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:37.0955005Z 
2025-05-07T20:32:37.0955103Z moe/activation_test.py:117: 
2025-05-07T20:32:37.0955233Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:37.0955352Z moe/activation_test.py:115: in fn
2025-05-07T20:32:37.0955510Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:37.0955891Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/torch/_dynamo/eval_frame.py:678: in _fn
2025-05-07T20:32:37.0955990Z     return fn(*args, **kwargs)
2025-05-07T20:32:37.0956474Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:37.0956575Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:37.0956929Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:37.0957148Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:37.0957489Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:37.0957581Z     kernel = self.compile(
2025-05-07T20:32:37.0957965Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:37.0958180Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:37.0958309Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:37.0958315Z 
2025-05-07T20:32:37.0958523Z self = <triton.compiler.compiler.ASTSource object at 0x7f78be2eee50>
2025-05-07T20:32:37.0959285Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:37.0959790Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f7905667ce0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f779daeac00>}
2025-05-07T20:32:37.0960525Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:37.0961116Z context = <triton._C.libtriton.ir.context object at 0x7f78be491df0>
2025-05-07T20:32:37.0961128Z 
2025-05-07T20:32:37.0961291Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:37.0961546Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:37.0961658Z                            module_map=module_map)
2025-05-07T20:32:37.0961818Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:37.0961916Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:37.0962043Z E       ^
2025-05-07T20:32:37.0962391Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:37.0962395Z 
2025-05-07T20:32:37.0962811Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:37.0962816Z 
2025-05-07T20:32:37.0962921Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:37.0963139Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:37.0963262Z     T=1,
2025-05-07T20:32:37.0963337Z     D=7168,
2025-05-07T20:32:37.0963418Z     scale_ub=None,
2025-05-07T20:32:37.0963507Z     contiguous=False,
2025-05-07T20:32:37.0963590Z     compiled=True,
2025-05-07T20:32:37.0963663Z )
2025-05-07T20:32:37.0963879Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:37.0964040Z T = 1, D = 7168, scale_ub = None, contiguous = False, compiled = True
2025-05-07T20:32:37.0964047Z 
2025-05-07T20:32:37.0964128Z     @given(
2025-05-07T20:32:37.0964246Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:37.0964345Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:37.0964461Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:37.0964620Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:37.0964738Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:37.0964816Z     )
2025-05-07T20:32:37.0965056Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:37.0965151Z     def test_silu_mul_quant(
2025-05-07T20:32:37.0965236Z         self,
2025-05-07T20:32:37.0965331Z         T: int,
2025-05-07T20:32:37.0965416Z         D: int,
2025-05-07T20:32:37.0965536Z         scale_ub: Optional[float],
2025-05-07T20:32:37.0965623Z         contiguous: bool,
2025-05-07T20:32:37.0965715Z         compiled: bool,
2025-05-07T20:32:37.0965793Z     ) -> None:
2025-05-07T20:32:37.0965890Z         torch.manual_seed(2025)
2025-05-07T20:32:37.0965965Z     
2025-05-07T20:32:37.0966131Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:37.0966205Z     
2025-05-07T20:32:37.0966298Z         x_sign = torch.sign(x)
2025-05-07T20:32:37.0966423Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:37.0966555Z         x = x_sign * x_clamp
2025-05-07T20:32:37.0966641Z         x0 = x[:, :D]
2025-05-07T20:32:37.0966720Z         x1 = x[:, D:]
2025-05-07T20:32:37.0966797Z     
2025-05-07T20:32:37.0966879Z         if contiguous:
2025-05-07T20:32:37.0966970Z             x0 = x0.contiguous()
2025-05-07T20:32:37.0967064Z             x1 = x1.contiguous()
2025-05-07T20:32:37.0967136Z     
2025-05-07T20:32:37.0967226Z         if scale_ub is not None:
2025-05-07T20:32:37.0967333Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:37.0967466Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:37.0967545Z             )
2025-05-07T20:32:37.0967622Z         else:
2025-05-07T20:32:37.0967716Z             scale_ub_tensor = None
2025-05-07T20:32:37.0967789Z     
2025-05-07T20:32:37.0967921Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:37.0968009Z             op = silu_mul_quant
2025-05-07T20:32:37.0968096Z             if compiled:
2025-05-07T20:32:37.0968202Z                 op = torch.compile(op)
2025-05-07T20:32:37.0968306Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:37.0968381Z     
2025-05-07T20:32:37.0968473Z         y_fp8, y_scale = fn()
2025-05-07T20:32:37.0968593Z         y = y_fp8.to(torch.float32) * y_scale[:, None]
2025-05-07T20:32:37.0968667Z     
2025-05-07T20:32:37.0968801Z         def ref_fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:37.0968902Z             x0_fp32 = x0.to(torch.float32)
2025-05-07T20:32:37.0969003Z             x1_fp32 = x1.to(torch.float32)
2025-05-07T20:32:37.0969124Z             y = x0_fp32 * torch.sigmoid(x0_fp32) * x1_fp32
2025-05-07T20:32:37.0969309Z             return triton_quantize_fp8_row(y, scale_ub_tensor)
2025-05-07T20:32:37.0969385Z     
2025-05-07T20:32:37.0969485Z >       y_fp8_ref, y_scale_ref = ref_fn()
2025-05-07T20:32:37.0969490Z 
2025-05-07T20:32:37.0969588Z moe/activation_test.py:126: 
2025-05-07T20:32:37.0969720Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:37.0969825Z moe/activation_test.py:124: in ref_fn
2025-05-07T20:32:37.0969960Z     return triton_quantize_fp8_row(y, scale_ub_tensor)
2025-05-07T20:32:37.0970546Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gemm/triton_gemm/fp8_gemm.py:2370: in triton_quantize_fp8_row
2025-05-07T20:32:37.0970646Z     _kernel_quantize_fp8_row[grid](
2025-05-07T20:32:37.0971006Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:37.0971225Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:37.0971590Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/autotuner.py:186: in run
2025-05-07T20:32:37.0971842Z     timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs}
2025-05-07T20:32:37.0972255Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/autotuner.py:166: in _bench
2025-05-07T20:32:37.0976988Z     return self.do_bench(kernel_call, quantiles=(0.5, 0.2, 0.8))
2025-05-07T20:32:37.0977362Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/testing.py:117: in do_bench
2025-05-07T20:32:37.0977446Z     fn()
2025-05-07T20:32:37.0977850Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/autotuner.py:152: in kernel_call
2025-05-07T20:32:37.0977934Z     self.fn.run(
2025-05-07T20:32:37.0978279Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:37.0978376Z     kernel = self.compile(
2025-05-07T20:32:37.0978759Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:37.0978936Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:37.0979134Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:37.0979140Z 
2025-05-07T20:32:37.0979350Z self = <triton.compiler.compiler.ASTSource object at 0x7f78be5d2950>
2025-05-07T20:32:37.0980117Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=2, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:37.0980617Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f7905667ce0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f779dc28180>}
2025-05-07T20:32:37.0981351Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:37.0981542Z context = <triton._C.libtriton.ir.context object at 0x7f779dcf95b0>
2025-05-07T20:32:37.0981551Z 
2025-05-07T20:32:37.0981723Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:37.0981981Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:37.0982095Z                            module_map=module_map)
2025-05-07T20:32:37.0982260Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:37.0982365Z E       def _kernel_quantize_fp8_row(
2025-05-07T20:32:37.0982445Z E       ^
2025-05-07T20:32:37.0982794Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:37.0982842Z 
2025-05-07T20:32:37.0983256Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:37.0983260Z 
2025-05-07T20:32:37.0983363Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:37.0983589Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:37.0983673Z     T=1,
2025-05-07T20:32:37.0983750Z     D=5120,
2025-05-07T20:32:37.0983835Z     scale_ub=1200.0,
2025-05-07T20:32:37.0983968Z     contiguous=False,
2025-05-07T20:32:37.0984052Z     compiled=True,
2025-05-07T20:32:37.0984126Z )
2025-05-07T20:32:37.0984346Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:37.0984509Z T = 1, D = 5120, scale_ub = 1200.0, contiguous = False, compiled = True
2025-05-07T20:32:37.0984514Z 
2025-05-07T20:32:37.0984597Z     @given(
2025-05-07T20:32:37.0984714Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:37.0984817Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:37.0984934Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:37.0985050Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:37.0985163Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:37.0985286Z     )
2025-05-07T20:32:37.0985533Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:37.0985629Z     def test_silu_mul_quant(
2025-05-07T20:32:37.0985703Z         self,
2025-05-07T20:32:37.0985783Z         T: int,
2025-05-07T20:32:37.0985863Z         D: int,
2025-05-07T20:32:37.0985962Z         scale_ub: Optional[float],
2025-05-07T20:32:37.0986049Z         contiguous: bool,
2025-05-07T20:32:37.0986134Z         compiled: bool,
2025-05-07T20:32:37.0986212Z     ) -> None:
2025-05-07T20:32:37.0986305Z         torch.manual_seed(2025)
2025-05-07T20:32:37.0986382Z     
2025-05-07T20:32:37.0986549Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:37.0986625Z     
2025-05-07T20:32:37.0986719Z         x_sign = torch.sign(x)
2025-05-07T20:32:37.0986844Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:37.0986935Z         x = x_sign * x_clamp
2025-05-07T20:32:37.0987017Z         x0 = x[:, :D]
2025-05-07T20:32:37.0987098Z         x1 = x[:, D:]
2025-05-07T20:32:37.0987243Z     
2025-05-07T20:32:37.0987328Z         if contiguous:
2025-05-07T20:32:37.0987417Z             x0 = x0.contiguous()
2025-05-07T20:32:37.0987515Z             x1 = x1.contiguous()
2025-05-07T20:32:37.0987595Z     
2025-05-07T20:32:37.0987684Z         if scale_ub is not None:
2025-05-07T20:32:37.0987793Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:37.0987927Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:37.0988002Z             )
2025-05-07T20:32:37.0988086Z         else:
2025-05-07T20:32:37.0988179Z             scale_ub_tensor = None
2025-05-07T20:32:37.0988255Z     
2025-05-07T20:32:37.0988392Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:37.0988481Z             op = silu_mul_quant
2025-05-07T20:32:37.0988571Z             if compiled:
2025-05-07T20:32:37.0988670Z                 op = torch.compile(op)
2025-05-07T20:32:37.0988779Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:37.0988855Z     
2025-05-07T20:32:37.0988948Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:37.0988953Z 
2025-05-07T20:32:37.0989049Z moe/activation_test.py:117: 
2025-05-07T20:32:37.0989187Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:37.0989286Z moe/activation_test.py:115: in fn
2025-05-07T20:32:37.0989384Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:37.0989750Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/torch/_dynamo/eval_frame.py:678: in _fn
2025-05-07T20:32:37.0989844Z     return fn(*args, **kwargs)
2025-05-07T20:32:37.0990335Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:37.0990477Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:37.0990836Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:37.0991064Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:37.0991400Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:37.0991538Z     kernel = self.compile(
2025-05-07T20:32:37.0991917Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:37.0992095Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:37.0992225Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:37.0992229Z 
2025-05-07T20:32:37.0992435Z self = <triton.compiler.compiler.ASTSource object at 0x7f78be87ccd0>
2025-05-07T20:32:37.0993244Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:37.0993744Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f7905667ce0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f779dc29300>}
2025-05-07T20:32:37.0994481Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:37.0994667Z context = <triton._C.libtriton.ir.context object at 0x7f779dc2e130>
2025-05-07T20:32:37.0994672Z 
2025-05-07T20:32:37.0994835Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:37.0995099Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:37.0995204Z                            module_map=module_map)
2025-05-07T20:32:37.0995394Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:37.0995507Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:37.0995643Z E       ^
2025-05-07T20:32:37.0996001Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:37.0996009Z 
2025-05-07T20:32:37.0996413Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:37.0996418Z 
2025-05-07T20:32:37.0996523Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:37.0996744Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:37.0996820Z     T=1,
2025-05-07T20:32:37.0996901Z     D=5120,
2025-05-07T20:32:37.0996988Z     scale_ub=1200.0,
2025-05-07T20:32:37.0997074Z     contiguous=False,
2025-05-07T20:32:37.0997161Z     compiled=False,
2025-05-07T20:32:37.0997234Z )
2025-05-07T20:32:37.0997449Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:37.0997619Z T = 1, D = 5120, scale_ub = 1200.0, contiguous = False, compiled = False
2025-05-07T20:32:37.0997630Z 
2025-05-07T20:32:37.0997707Z     @given(
2025-05-07T20:32:37.0997830Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:37.0997934Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:37.0998048Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:37.0998363Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:37.0998530Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:37.0998608Z     )
2025-05-07T20:32:37.0998853Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:37.0999038Z     def test_silu_mul_quant(
2025-05-07T20:32:37.0999116Z         self,
2025-05-07T20:32:37.0999199Z         T: int,
2025-05-07T20:32:37.0999274Z         D: int,
2025-05-07T20:32:37.0999372Z         scale_ub: Optional[float],
2025-05-07T20:32:37.0999462Z         contiguous: bool,
2025-05-07T20:32:37.0999551Z         compiled: bool,
2025-05-07T20:32:37.0999633Z     ) -> None:
2025-05-07T20:32:37.0999728Z         torch.manual_seed(2025)
2025-05-07T20:32:37.0999800Z     
2025-05-07T20:32:37.0999971Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:37.1000109Z     
2025-05-07T20:32:37.1000200Z         x_sign = torch.sign(x)
2025-05-07T20:32:37.1000326Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:37.1000412Z         x = x_sign * x_clamp
2025-05-07T20:32:37.1000492Z         x0 = x[:, :D]
2025-05-07T20:32:37.1000575Z         x1 = x[:, D:]
2025-05-07T20:32:37.1000647Z     
2025-05-07T20:32:37.1000729Z         if contiguous:
2025-05-07T20:32:37.1000826Z             x0 = x0.contiguous()
2025-05-07T20:32:37.1000915Z             x1 = x1.contiguous()
2025-05-07T20:32:37.1000987Z     
2025-05-07T20:32:37.1001081Z         if scale_ub is not None:
2025-05-07T20:32:37.1001185Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:37.1001388Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:37.1001468Z             )
2025-05-07T20:32:37.1001544Z         else:
2025-05-07T20:32:37.1001642Z             scale_ub_tensor = None
2025-05-07T20:32:37.1001712Z     
2025-05-07T20:32:37.1001842Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:37.1001937Z             op = silu_mul_quant
2025-05-07T20:32:37.1002024Z             if compiled:
2025-05-07T20:32:37.1002123Z                 op = torch.compile(op)
2025-05-07T20:32:37.1002230Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:37.1002300Z     
2025-05-07T20:32:37.1002390Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:37.1002398Z 
2025-05-07T20:32:37.1002497Z moe/activation_test.py:117: 
2025-05-07T20:32:37.1002628Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:37.1002728Z moe/activation_test.py:115: in fn
2025-05-07T20:32:37.1002826Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:37.1003379Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:37.1003482Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:37.1003840Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:37.1004062Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:37.1004398Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:37.1004490Z     kernel = self.compile(
2025-05-07T20:32:37.1004870Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:37.1005047Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:37.1005176Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:37.1005184Z 
2025-05-07T20:32:37.1005408Z self = <triton.compiler.compiler.ASTSource object at 0x7f779daaf150>
2025-05-07T20:32:37.1006198Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:37.1006702Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f7905667ce0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f779dc2a020>}
2025-05-07T20:32:37.1007434Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:37.1007668Z context = <triton._C.libtriton.ir.context object at 0x7f779dcf30f0>
2025-05-07T20:32:37.1007673Z 
2025-05-07T20:32:37.1007837Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:37.1008097Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:37.1008212Z                            module_map=module_map)
2025-05-07T20:32:37.1008412Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:37.1008509Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:37.1008590Z E       ^
2025-05-07T20:32:37.1008936Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:37.1008941Z 
2025-05-07T20:32:37.1009351Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:37.1009358Z 
2025-05-07T20:32:37.1009458Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:37.1009677Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:37.1009759Z     T=16384,
2025-05-07T20:32:37.1009873Z     D=5120,
2025-05-07T20:32:37.1009957Z     scale_ub=1200.0,
2025-05-07T20:32:37.1010052Z     contiguous=False,
2025-05-07T20:32:37.1010136Z     compiled=True,
2025-05-07T20:32:37.1010214Z )
2025-05-07T20:32:37.1010433Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:37.1010609Z T = 16384, D = 5120, scale_ub = 1200.0, contiguous = False, compiled = True
2025-05-07T20:32:37.1010613Z 
2025-05-07T20:32:37.1010692Z     @given(
2025-05-07T20:32:37.1010812Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:37.1010911Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:37.1011028Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:37.1011146Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:37.1011264Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:37.1011338Z     )
2025-05-07T20:32:37.1011581Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:37.1011676Z     def test_silu_mul_quant(
2025-05-07T20:32:37.1011794Z         self,
2025-05-07T20:32:37.1011871Z         T: int,
2025-05-07T20:32:37.1011951Z         D: int,
2025-05-07T20:32:37.1012048Z         scale_ub: Optional[float],
2025-05-07T20:32:37.1012142Z         contiguous: bool,
2025-05-07T20:32:37.1012232Z         compiled: bool,
2025-05-07T20:32:37.1012309Z     ) -> None:
2025-05-07T20:32:37.1012402Z         torch.manual_seed(2025)
2025-05-07T20:32:37.1012476Z     
2025-05-07T20:32:37.1012641Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:37.1012715Z     
2025-05-07T20:32:37.1012814Z         x_sign = torch.sign(x)
2025-05-07T20:32:37.1012942Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:37.1013034Z         x = x_sign * x_clamp
2025-05-07T20:32:37.1013115Z         x0 = x[:, :D]
2025-05-07T20:32:37.1013195Z         x1 = x[:, D:]
2025-05-07T20:32:37.1013267Z     
2025-05-07T20:32:37.1013354Z         if contiguous:
2025-05-07T20:32:37.1013447Z             x0 = x0.contiguous()
2025-05-07T20:32:37.1013544Z             x1 = x1.contiguous()
2025-05-07T20:32:37.1013615Z     
2025-05-07T20:32:37.1013810Z         if scale_ub is not None:
2025-05-07T20:32:37.1013926Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:37.1014060Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:37.1014134Z             )
2025-05-07T20:32:37.1014213Z         else:
2025-05-07T20:32:37.1014305Z             scale_ub_tensor = None
2025-05-07T20:32:37.1014382Z     
2025-05-07T20:32:37.1014510Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:37.1014598Z             op = silu_mul_quant
2025-05-07T20:32:37.1014734Z             if compiled:
2025-05-07T20:32:37.1014833Z                 op = torch.compile(op)
2025-05-07T20:32:37.1014940Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:37.1015017Z     
2025-05-07T20:32:37.1015110Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:37.1015116Z 
2025-05-07T20:32:37.1015212Z moe/activation_test.py:117: 
2025-05-07T20:32:37.1015346Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:37.1015449Z moe/activation_test.py:115: in fn
2025-05-07T20:32:37.1015611Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:37.1015972Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/torch/_dynamo/eval_frame.py:678: in _fn
2025-05-07T20:32:37.1016065Z     return fn(*args, **kwargs)
2025-05-07T20:32:37.1016560Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:37.1016657Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:37.1017013Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:37.1017234Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:37.1017634Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:37.1017731Z     kernel = self.compile(
2025-05-07T20:32:37.1018108Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:37.1018284Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:37.1018416Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:37.1018421Z 
2025-05-07T20:32:37.1018622Z self = <triton.compiler.compiler.ASTSource object at 0x7f78be914ad0>
2025-05-07T20:32:37.1019387Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:37.1019925Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f7905667ce0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f779dc2b600>}
2025-05-07T20:32:37.1020657Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:37.1020855Z context = <triton._C.libtriton.ir.context object at 0x7f779d6e6a70>
2025-05-07T20:32:37.1020859Z 
2025-05-07T20:32:37.1021020Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:37.1021281Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:37.1021390Z                            module_map=module_map)
2025-05-07T20:32:37.1021549Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:37.1021648Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:37.1021724Z E       ^
2025-05-07T20:32:37.1022077Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:37.1022090Z 
2025-05-07T20:32:37.1022498Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:37.1022505Z 
2025-05-07T20:32:37.1022606Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:37.1022827Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:37.1022903Z     T=2048,
2025-05-07T20:32:37.1022978Z     D=7168,
2025-05-07T20:32:37.1023066Z     scale_ub=1200.0,
2025-05-07T20:32:37.1023152Z     contiguous=False,
2025-05-07T20:32:37.1023276Z     compiled=True,
2025-05-07T20:32:37.1023350Z )
2025-05-07T20:32:37.1023566Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:37.1023740Z T = 2048, D = 7168, scale_ub = 1200.0, contiguous = False, compiled = True
2025-05-07T20:32:37.1023745Z 
2025-05-07T20:32:37.1023823Z     @given(
2025-05-07T20:32:37.1023945Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:37.1024045Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:37.1024157Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:37.1024314Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:37.1024435Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:37.1024510Z     )
2025-05-07T20:32:37.1024756Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:37.1024847Z     def test_silu_mul_quant(
2025-05-07T20:32:37.1024925Z         self,
2025-05-07T20:32:37.1025004Z         T: int,
2025-05-07T20:32:37.1025082Z         D: int,
2025-05-07T20:32:37.1025178Z         scale_ub: Optional[float],
2025-05-07T20:32:37.1025271Z         contiguous: bool,
2025-05-07T20:32:37.1025356Z         compiled: bool,
2025-05-07T20:32:37.1025433Z     ) -> None:
2025-05-07T20:32:37.1025530Z         torch.manual_seed(2025)
2025-05-07T20:32:37.1025642Z     
2025-05-07T20:32:37.1025810Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:37.1025891Z     
2025-05-07T20:32:37.1025980Z         x_sign = torch.sign(x)
2025-05-07T20:32:37.1026105Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:37.1026196Z         x = x_sign * x_clamp
2025-05-07T20:32:37.1026276Z         x0 = x[:, :D]
2025-05-07T20:32:37.1026359Z         x1 = x[:, D:]
2025-05-07T20:32:37.1026429Z     
2025-05-07T20:32:37.1026510Z         if contiguous:
2025-05-07T20:32:37.1026602Z             x0 = x0.contiguous()
2025-05-07T20:32:37.1026692Z             x1 = x1.contiguous()
2025-05-07T20:32:37.1026766Z     
2025-05-07T20:32:37.1026860Z         if scale_ub is not None:
2025-05-07T20:32:37.1026965Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:37.1027099Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:37.1027175Z             )
2025-05-07T20:32:37.1027253Z         else:
2025-05-07T20:32:37.1027345Z             scale_ub_tensor = None
2025-05-07T20:32:37.1027461Z     
2025-05-07T20:32:37.1027591Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:37.1027685Z             op = silu_mul_quant
2025-05-07T20:32:37.1027770Z             if compiled:
2025-05-07T20:32:37.1027869Z                 op = torch.compile(op)
2025-05-07T20:32:37.1027975Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:37.1028045Z     
2025-05-07T20:32:37.1028134Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:37.1028138Z 
2025-05-07T20:32:37.1028237Z moe/activation_test.py:117: 
2025-05-07T20:32:37.1028364Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:37.1028466Z moe/activation_test.py:115: in fn
2025-05-07T20:32:37.1028568Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:37.1028927Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/torch/_dynamo/eval_frame.py:678: in _fn
2025-05-07T20:32:37.1029023Z     return fn(*args, **kwargs)
2025-05-07T20:32:37.1029512Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:37.1029611Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:37.1029971Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:37.1030188Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:37.1030522Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:37.1030620Z     kernel = self.compile(
2025-05-07T20:32:37.1031084Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:37.1031261Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:37.1031389Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:37.1031395Z 
2025-05-07T20:32:37.1031596Z self = <triton.compiler.compiler.ASTSource object at 0x7f779daae950>
2025-05-07T20:32:37.1032365Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:37.1032898Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f7905667ce0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f779d608720>}
2025-05-07T20:32:37.1033637Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:37.1033826Z context = <triton._C.libtriton.ir.context object at 0x7f779d64e4b0>
2025-05-07T20:32:37.1033869Z 
2025-05-07T20:32:37.1034037Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:37.1034295Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:37.1034405Z                            module_map=module_map)
2025-05-07T20:32:37.1034567Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:37.1034663Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:37.1034738Z E       ^
2025-05-07T20:32:37.1035089Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:37.1035093Z 
2025-05-07T20:32:37.1035524Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:37.1035528Z 
2025-05-07T20:32:37.1035650Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:37.1035879Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:37.1035954Z     T=1,
2025-05-07T20:32:37.1036072Z     D=5120,
2025-05-07T20:32:37.1036154Z     scale_ub=None,
2025-05-07T20:32:37.1036241Z     contiguous=False,
2025-05-07T20:32:37.1036329Z     compiled=False,
2025-05-07T20:32:37.1036402Z )
2025-05-07T20:32:37.1036617Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:37.1036784Z T = 1, D = 5120, scale_ub = None, contiguous = False, compiled = False
2025-05-07T20:32:37.1036789Z 
2025-05-07T20:32:37.1036863Z     @given(
2025-05-07T20:32:37.1036985Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:37.1037085Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:37.1037200Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:37.1037320Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:37.1037434Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:37.1037508Z     )
2025-05-07T20:32:37.1037756Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:37.1037849Z     def test_silu_mul_quant(
2025-05-07T20:32:37.1037928Z         self,
2025-05-07T20:32:37.1038007Z         T: int,
2025-05-07T20:32:37.1038085Z         D: int,
2025-05-07T20:32:37.1038183Z         scale_ub: Optional[float],
2025-05-07T20:32:37.1038274Z         contiguous: bool,
2025-05-07T20:32:37.1038361Z         compiled: bool,
2025-05-07T20:32:37.1038438Z     ) -> None:
2025-05-07T20:32:37.1038536Z         torch.manual_seed(2025)
2025-05-07T20:32:37.1038609Z     
2025-05-07T20:32:37.1038774Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:37.1038893Z     
2025-05-07T20:32:37.1038983Z         x_sign = torch.sign(x)
2025-05-07T20:32:37.1039105Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:37.1039196Z         x = x_sign * x_clamp
2025-05-07T20:32:37.1039275Z         x0 = x[:, :D]
2025-05-07T20:32:37.1039356Z         x1 = x[:, D:]
2025-05-07T20:32:37.1039433Z     
2025-05-07T20:32:37.1039515Z         if contiguous:
2025-05-07T20:32:37.1039605Z             x0 = x0.contiguous()
2025-05-07T20:32:37.1039697Z             x1 = x1.contiguous()
2025-05-07T20:32:37.1039767Z     
2025-05-07T20:32:37.1039901Z         if scale_ub is not None:
2025-05-07T20:32:37.1040009Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:37.1040142Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:37.1040221Z             )
2025-05-07T20:32:37.1040295Z         else:
2025-05-07T20:32:37.1040387Z             scale_ub_tensor = None
2025-05-07T20:32:37.1040461Z     
2025-05-07T20:32:37.1040587Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:37.1040679Z             op = silu_mul_quant
2025-05-07T20:32:37.1040766Z             if compiled:
2025-05-07T20:32:37.1040863Z                 op = torch.compile(op)
2025-05-07T20:32:37.1040968Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:37.1041041Z     
2025-05-07T20:32:37.1041173Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:37.1041180Z 
2025-05-07T20:32:37.1041281Z moe/activation_test.py:117: 
2025-05-07T20:32:37.1041410Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:37.1041510Z moe/activation_test.py:115: in fn
2025-05-07T20:32:37.1041613Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:37.1042102Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:37.1042199Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:37.1042559Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:37.1042780Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:37.1043118Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:37.1043213Z     kernel = self.compile(
2025-05-07T20:32:37.1043633Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:37.1043808Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:37.1043937Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:37.1043941Z 
2025-05-07T20:32:37.1044143Z self = <triton.compiler.compiler.ASTSource object at 0x7f78be2efdd0>
2025-05-07T20:32:37.1044907Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:37.1045404Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f7905667ce0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f779d609120>}
2025-05-07T20:32:37.1046192Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:37.1046381Z context = <triton._C.libtriton.ir.context object at 0x7f78bedcedf0>
2025-05-07T20:32:37.1046385Z 
2025-05-07T20:32:37.1046550Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:37.1046806Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:37.1046912Z                            module_map=module_map)
2025-05-07T20:32:37.1047076Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:37.1047213Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:37.1047288Z E       ^
2025-05-07T20:32:37.1047638Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:37.1047645Z 
2025-05-07T20:32:37.1048054Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:37.1048058Z 
2025-05-07T20:32:37.1048165Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:37.1048449Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:37.1048526Z     T=4096,
2025-05-07T20:32:37.1048607Z     D=7168,
2025-05-07T20:32:37.1048690Z     scale_ub=1200.0,
2025-05-07T20:32:37.1048779Z     contiguous=False,
2025-05-07T20:32:37.1048867Z     compiled=False,
2025-05-07T20:32:37.1048939Z )
2025-05-07T20:32:37.1049156Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:37.1049331Z T = 4096, D = 7168, scale_ub = 1200.0, contiguous = False, compiled = False
2025-05-07T20:32:37.1049335Z 
2025-05-07T20:32:37.1049413Z     @given(
2025-05-07T20:32:37.1049536Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:37.1049675Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:37.1049792Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:37.1049912Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:37.1050025Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:37.1050105Z     )
2025-05-07T20:32:37.1050346Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:37.1050438Z     def test_silu_mul_quant(
2025-05-07T20:32:37.1050518Z         self,
2025-05-07T20:32:37.1050592Z         T: int,
2025-05-07T20:32:37.1050669Z         D: int,
2025-05-07T20:32:37.1050767Z         scale_ub: Optional[float],
2025-05-07T20:32:37.1050859Z         contiguous: bool,
2025-05-07T20:32:37.1050944Z         compiled: bool,
2025-05-07T20:32:37.1051027Z     ) -> None:
2025-05-07T20:32:37.1051119Z         torch.manual_seed(2025)
2025-05-07T20:32:37.1051190Z     
2025-05-07T20:32:37.1051364Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:37.1051435Z     
2025-05-07T20:32:37.1051568Z         x_sign = torch.sign(x)
2025-05-07T20:32:37.1051695Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:37.1051781Z         x = x_sign * x_clamp
2025-05-07T20:32:37.1051868Z         x0 = x[:, :D]
2025-05-07T20:32:37.1051946Z         x1 = x[:, D:]
2025-05-07T20:32:37.1052017Z     
2025-05-07T20:32:37.1052103Z         if contiguous:
2025-05-07T20:32:37.1052192Z             x0 = x0.contiguous()
2025-05-07T20:32:37.1052281Z             x1 = x1.contiguous()
2025-05-07T20:32:37.1052355Z     
2025-05-07T20:32:37.1052445Z         if scale_ub is not None:
2025-05-07T20:32:37.1052550Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:37.1052690Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:37.1052764Z             )
2025-05-07T20:32:37.1052838Z         else:
2025-05-07T20:32:37.1052934Z             scale_ub_tensor = None
2025-05-07T20:32:37.1053005Z     
2025-05-07T20:32:37.1053138Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:37.1053229Z             op = silu_mul_quant
2025-05-07T20:32:37.1053311Z             if compiled:
2025-05-07T20:32:37.1053414Z                 op = torch.compile(op)
2025-05-07T20:32:37.1053522Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:37.1053593Z     
2025-05-07T20:32:37.1053779Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:37.1053784Z 
2025-05-07T20:32:37.1053878Z moe/activation_test.py:117: 
2025-05-07T20:32:37.1054004Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:37.1054105Z moe/activation_test.py:115: in fn
2025-05-07T20:32:37.1054204Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:37.1054746Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:37.1054841Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:37.1055199Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:37.1055421Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:37.1055753Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:37.1055884Z     kernel = self.compile(
2025-05-07T20:32:37.1056264Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:37.1056438Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:37.1056567Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:37.1056574Z 
2025-05-07T20:32:37.1056775Z self = <triton.compiler.compiler.ASTSource object at 0x7f78bea86ed0>
2025-05-07T20:32:37.1057576Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:37.1058073Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f7905667ce0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f779d60a480>}
2025-05-07T20:32:37.1058805Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:37.1058995Z context = <triton._C.libtriton.ir.context object at 0x7f78bf18dab0>
2025-05-07T20:32:37.1059002Z 
2025-05-07T20:32:37.1059162Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:37.1059422Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:37.1059527Z                            module_map=module_map)
2025-05-07T20:32:37.1059687Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:37.1059829Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:37.1059904Z E       ^
2025-05-07T20:32:37.1060251Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:37.1060257Z 
2025-05-07T20:32:37.1060663Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:37.1060667Z 
2025-05-07T20:32:37.1060768Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:37.1060987Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:37.1061065Z     T=16384,
2025-05-07T20:32:37.1061140Z     D=7168,
2025-05-07T20:32:37.1061222Z     scale_ub=None,
2025-05-07T20:32:37.1061306Z     contiguous=True,
2025-05-07T20:32:37.1061388Z     compiled=True,
2025-05-07T20:32:37.1061462Z )
2025-05-07T20:32:37.1061680Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:37.1061851Z T = 16384, D = 7168, scale_ub = None, contiguous = True, compiled = True
2025-05-07T20:32:37.1061859Z 
2025-05-07T20:32:37.1061933Z     @given(
2025-05-07T20:32:37.1062056Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:37.1062156Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:37.1062270Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:37.1062384Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:37.1062498Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:37.1062572Z     )
2025-05-07T20:32:37.1062813Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:37.1062951Z     def test_silu_mul_quant(
2025-05-07T20:32:37.1063025Z         self,
2025-05-07T20:32:37.1063101Z         T: int,
2025-05-07T20:32:37.1063179Z         D: int,
2025-05-07T20:32:37.1063277Z         scale_ub: Optional[float],
2025-05-07T20:32:37.1063371Z         contiguous: bool,
2025-05-07T20:32:37.1063457Z         compiled: bool,
2025-05-07T20:32:37.1063534Z     ) -> None:
2025-05-07T20:32:37.1063628Z         torch.manual_seed(2025)
2025-05-07T20:32:37.1063739Z     
2025-05-07T20:32:37.1063902Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:37.1063976Z     
2025-05-07T20:32:37.1064065Z         x_sign = torch.sign(x)
2025-05-07T20:32:37.1064186Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:37.1064276Z         x = x_sign * x_clamp
2025-05-07T20:32:37.1064354Z         x0 = x[:, :D]
2025-05-07T20:32:37.1064432Z         x1 = x[:, D:]
2025-05-07T20:32:37.1064510Z     
2025-05-07T20:32:37.1064590Z         if contiguous:
2025-05-07T20:32:37.1064679Z             x0 = x0.contiguous()
2025-05-07T20:32:37.1064771Z             x1 = x1.contiguous()
2025-05-07T20:32:37.1064841Z     
2025-05-07T20:32:37.1064932Z         if scale_ub is not None:
2025-05-07T20:32:37.1065078Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:37.1065213Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:37.1065290Z             )
2025-05-07T20:32:37.1065365Z         else:
2025-05-07T20:32:37.1065461Z             scale_ub_tensor = None
2025-05-07T20:32:37.1065536Z     
2025-05-07T20:32:37.1065662Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:37.1065750Z             op = silu_mul_quant
2025-05-07T20:32:37.1065835Z             if compiled:
2025-05-07T20:32:37.1065932Z                 op = torch.compile(op)
2025-05-07T20:32:37.1066039Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:37.1066113Z     
2025-05-07T20:32:37.1066205Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:37.1066210Z 
2025-05-07T20:32:37.1066308Z moe/activation_test.py:117: 
2025-05-07T20:32:37.1066437Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:37.1066534Z moe/activation_test.py:115: in fn
2025-05-07T20:32:37.1066637Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:37.1067041Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/torch/_dynamo/eval_frame.py:678: in _fn
2025-05-07T20:32:37.1067135Z     return fn(*args, **kwargs)
2025-05-07T20:32:37.1067626Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:37.1067723Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:37.1068080Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:37.1068299Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:37.1068635Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:37.1068728Z     kernel = self.compile(
2025-05-07T20:32:37.1069107Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:37.1069285Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:37.1069411Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:37.1069421Z 
2025-05-07T20:32:37.1069622Z self = <triton.compiler.compiler.ASTSource object at 0x7f78be87ea50>
2025-05-07T20:32:37.1070383Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:37.1070918Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f7905667ce0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f779d60b740>}
2025-05-07T20:32:37.1071657Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:37.1071844Z context = <triton._C.libtriton.ir.context object at 0x7f78bedb18b0>
2025-05-07T20:32:37.1071848Z 
2025-05-07T20:32:37.1072050Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:37.1072308Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:37.1072415Z                            module_map=module_map)
2025-05-07T20:32:37.1072577Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:37.1072672Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:37.1072750Z E       ^
2025-05-07T20:32:37.1073098Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:37.1073102Z 
2025-05-07T20:32:37.1073545Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:37.1073550Z 
2025-05-07T20:32:37.1073657Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:37.1073874Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:37.1073952Z     T=4096,
2025-05-07T20:32:37.1074030Z     D=5120,
2025-05-07T20:32:37.1074110Z     scale_ub=None,
2025-05-07T20:32:37.1074197Z     contiguous=False,
2025-05-07T20:32:37.1074281Z     compiled=True,
2025-05-07T20:32:37.1074352Z )
2025-05-07T20:32:37.1074567Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:37.1074738Z T = 4096, D = 5120, scale_ub = None, contiguous = False, compiled = True
2025-05-07T20:32:37.1074745Z 
2025-05-07T20:32:37.1074819Z     @given(
2025-05-07T20:32:37.1074940Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:37.1075039Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:37.1075151Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:37.1075272Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:37.1075424Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:37.1075507Z     )
2025-05-07T20:32:37.1075790Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:37.1075885Z     def test_silu_mul_quant(
2025-05-07T20:32:37.1075961Z         self,
2025-05-07T20:32:37.1076041Z         T: int,
2025-05-07T20:32:37.1076115Z         D: int,
2025-05-07T20:32:37.1076211Z         scale_ub: Optional[float],
2025-05-07T20:32:37.1076303Z         contiguous: bool,
2025-05-07T20:32:37.1076386Z         compiled: bool,
2025-05-07T20:32:37.1076469Z     ) -> None:
2025-05-07T20:32:37.1076566Z         torch.manual_seed(2025)
2025-05-07T20:32:37.1076638Z     
2025-05-07T20:32:37.1076806Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:37.1076876Z     
2025-05-07T20:32:37.1076967Z         x_sign = torch.sign(x)
2025-05-07T20:32:37.1077094Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:37.1077183Z         x = x_sign * x_clamp
2025-05-07T20:32:37.1077261Z         x0 = x[:, :D]
2025-05-07T20:32:37.1077348Z         x1 = x[:, D:]
2025-05-07T20:32:37.1077423Z     
2025-05-07T20:32:37.1077504Z         if contiguous:
2025-05-07T20:32:37.1077598Z             x0 = x0.contiguous()
2025-05-07T20:32:37.1077685Z             x1 = x1.contiguous()
2025-05-07T20:32:37.1077755Z     
2025-05-07T20:32:37.1077846Z         if scale_ub is not None:
2025-05-07T20:32:37.1077950Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:37.1078084Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:37.1078231Z             )
2025-05-07T20:32:37.1078307Z         else:
2025-05-07T20:32:37.1078403Z             scale_ub_tensor = None
2025-05-07T20:32:37.1078473Z     
2025-05-07T20:32:37.1078602Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:37.1078692Z             op = silu_mul_quant
2025-05-07T20:32:37.1078778Z             if compiled:
2025-05-07T20:32:37.1078881Z                 op = torch.compile(op)
2025-05-07T20:32:37.1078988Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:37.1079059Z     
2025-05-07T20:32:37.1079193Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:37.1079202Z 
2025-05-07T20:32:37.1079296Z moe/activation_test.py:117: 
2025-05-07T20:32:37.1079424Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:37.1079524Z moe/activation_test.py:115: in fn
2025-05-07T20:32:37.1079623Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:37.1079983Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/torch/_dynamo/eval_frame.py:678: in _fn
2025-05-07T20:32:37.1080082Z     return fn(*args, **kwargs)
2025-05-07T20:32:37.1080568Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:37.1080705Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:37.1081067Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:37.1081286Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:37.1081628Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:37.1081720Z     kernel = self.compile(
2025-05-07T20:32:37.1082096Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:37.1082272Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:37.1082399Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:37.1082404Z 
2025-05-07T20:32:37.1082607Z self = <triton.compiler.compiler.ASTSource object at 0x7f78be917ad0>
2025-05-07T20:32:37.1083409Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:37.1083905Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f7905667ce0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f78bf148c20>}
2025-05-07T20:32:37.1084643Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:37.1084831Z context = <triton._C.libtriton.ir.context object at 0x7f78bf171370>
2025-05-07T20:32:37.1084839Z 
2025-05-07T20:32:37.1085004Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:37.1085259Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:37.1085369Z                            module_map=module_map)
2025-05-07T20:32:37.1085560Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:37.1085681Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:37.1085760Z E       ^
2025-05-07T20:32:37.1086108Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:37.1086113Z 
2025-05-07T20:32:37.1086517Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:37.1086521Z 
2025-05-07T20:32:37.1086626Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:37.1086843Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:37.1086961Z     T=4096,
2025-05-07T20:32:37.1087037Z     D=5120,
2025-05-07T20:32:37.1087120Z     scale_ub=1200.0,
2025-05-07T20:32:37.1087207Z     contiguous=False,
2025-05-07T20:32:37.1087288Z     compiled=False,
2025-05-07T20:32:37.1087362Z )
2025-05-07T20:32:37.1087582Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:37.1087754Z T = 4096, D = 5120, scale_ub = 1200.0, contiguous = False, compiled = False
2025-05-07T20:32:37.1087799Z 
2025-05-07T20:32:37.1087875Z     @given(
2025-05-07T20:32:37.1087995Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:37.1088094Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:37.1088211Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:37.1088326Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:37.1088438Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:37.1088514Z     )
2025-05-07T20:32:37.1088755Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:37.1088846Z     def test_silu_mul_quant(
2025-05-07T20:32:37.1088924Z         self,
2025-05-07T20:32:37.1089000Z         T: int,
2025-05-07T20:32:37.1089117Z         D: int,
2025-05-07T20:32:37.1089224Z         scale_ub: Optional[float],
2025-05-07T20:32:37.1089315Z         contiguous: bool,
2025-05-07T20:32:37.1089399Z         compiled: bool,
2025-05-07T20:32:37.1089480Z     ) -> None:
2025-05-07T20:32:37.1089576Z         torch.manual_seed(2025)
2025-05-07T20:32:37.1089652Z     
2025-05-07T20:32:37.1089817Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:37.1089887Z     
2025-05-07T20:32:37.1089982Z         x_sign = torch.sign(x)
2025-05-07T20:32:37.1090105Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:37.1090192Z         x = x_sign * x_clamp
2025-05-07T20:32:37.1090272Z         x0 = x[:, :D]
2025-05-07T20:32:37.1090353Z         x1 = x[:, D:]
2025-05-07T20:32:37.1090425Z     
2025-05-07T20:32:37.1090510Z         if contiguous:
2025-05-07T20:32:37.1090601Z             x0 = x0.contiguous()
2025-05-07T20:32:37.1090688Z             x1 = x1.contiguous()
2025-05-07T20:32:37.1090761Z     
2025-05-07T20:32:37.1090853Z         if scale_ub is not None:
2025-05-07T20:32:37.1090999Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:37.1091137Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:37.1091216Z             )
2025-05-07T20:32:37.1091296Z         else:
2025-05-07T20:32:37.1091388Z             scale_ub_tensor = None
2025-05-07T20:32:37.1091457Z     
2025-05-07T20:32:37.1091587Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:37.1091675Z             op = silu_mul_quant
2025-05-07T20:32:37.1091758Z             if compiled:
2025-05-07T20:32:37.1091861Z                 op = torch.compile(op)
2025-05-07T20:32:37.1091964Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:37.1092039Z     
2025-05-07T20:32:37.1092130Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:37.1092135Z 
2025-05-07T20:32:37.1092228Z moe/activation_test.py:117: 
2025-05-07T20:32:37.1092362Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:37.1092464Z moe/activation_test.py:115: in fn
2025-05-07T20:32:37.1092564Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:37.1093057Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:37.1093154Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:37.1093509Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:37.1093784Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:37.1094121Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:37.1094784Z     kernel = self.compile(
2025-05-07T20:32:37.1095165Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:37.1095340Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:37.1095474Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:37.1095478Z 
2025-05-07T20:32:37.1095681Z self = <triton.compiler.compiler.ASTSource object at 0x7f779d9a36d0>
2025-05-07T20:32:37.1096491Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:37.1096987Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f7905667ce0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f78bf1496c0>}
2025-05-07T20:32:37.1097720Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:37.1097955Z context = <triton._C.libtriton.ir.context object at 0x7f78bf137130>
2025-05-07T20:32:37.1097963Z 
2025-05-07T20:32:37.1101453Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:37.1101744Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:37.1101862Z                            module_map=module_map)
2025-05-07T20:32:37.1102030Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:37.1102131Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:37.1102209Z E       ^
2025-05-07T20:32:37.1102563Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:37.1102572Z 
2025-05-07T20:32:37.1102982Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:37.1102987Z 
2025-05-07T20:32:37.1103091Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:37.1103421Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:37.1103503Z     T=4096,
2025-05-07T20:32:37.1103584Z     D=5120,
2025-05-07T20:32:37.1103666Z     scale_ub=1200.0,
2025-05-07T20:32:37.1103754Z     contiguous=False,
2025-05-07T20:32:37.1103839Z     compiled=True,
2025-05-07T20:32:37.1103910Z )
2025-05-07T20:32:37.1104126Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:37.1104299Z T = 4096, D = 5120, scale_ub = 1200.0, contiguous = False, compiled = True
2025-05-07T20:32:37.1104304Z 
2025-05-07T20:32:37.1104380Z     @given(
2025-05-07T20:32:37.1104502Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:37.1104604Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:37.1104718Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:37.1104837Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:37.1104953Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:37.1105026Z     )
2025-05-07T20:32:37.1105274Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:37.1105364Z     def test_silu_mul_quant(
2025-05-07T20:32:37.1105441Z         self,
2025-05-07T20:32:37.1105521Z         T: int,
2025-05-07T20:32:37.1105597Z         D: int,
2025-05-07T20:32:37.1105692Z         scale_ub: Optional[float],
2025-05-07T20:32:37.1105782Z         contiguous: bool,
2025-05-07T20:32:37.1105867Z         compiled: bool,
2025-05-07T20:32:37.1105948Z     ) -> None:
2025-05-07T20:32:37.1106041Z         torch.manual_seed(2025)
2025-05-07T20:32:37.1106112Z     
2025-05-07T20:32:37.1106279Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:37.1106417Z     
2025-05-07T20:32:37.1106507Z         x_sign = torch.sign(x)
2025-05-07T20:32:37.1106634Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:37.1106722Z         x = x_sign * x_clamp
2025-05-07T20:32:37.1106804Z         x0 = x[:, :D]
2025-05-07T20:32:37.1106892Z         x1 = x[:, D:]
2025-05-07T20:32:37.1106963Z     
2025-05-07T20:32:37.1107047Z         if contiguous:
2025-05-07T20:32:37.1107142Z             x0 = x0.contiguous()
2025-05-07T20:32:37.1107295Z             x1 = x1.contiguous()
2025-05-07T20:32:37.1107368Z     
2025-05-07T20:32:37.1107457Z         if scale_ub is not None:
2025-05-07T20:32:37.1107560Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:37.1107696Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:37.1107773Z             )
2025-05-07T20:32:37.1107847Z         else:
2025-05-07T20:32:37.1107943Z             scale_ub_tensor = None
2025-05-07T20:32:37.1108016Z     
2025-05-07T20:32:37.1108146Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:37.1108237Z             op = silu_mul_quant
2025-05-07T20:32:37.1108320Z             if compiled:
2025-05-07T20:32:37.1108418Z                 op = torch.compile(op)
2025-05-07T20:32:37.1108593Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:37.1108667Z     
2025-05-07T20:32:37.1108757Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:37.1108765Z 
2025-05-07T20:32:37.1108864Z moe/activation_test.py:117: 
2025-05-07T20:32:37.1108994Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:37.1109096Z moe/activation_test.py:115: in fn
2025-05-07T20:32:37.1109195Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:37.1109556Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/torch/_dynamo/eval_frame.py:678: in _fn
2025-05-07T20:32:37.1109652Z     return fn(*args, **kwargs)
2025-05-07T20:32:37.1110136Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:37.1110238Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:37.1110593Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:37.1110881Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:37.1111225Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:37.1111319Z     kernel = self.compile(
2025-05-07T20:32:37.1111695Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:37.1111867Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:37.1111993Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:37.1112003Z 
2025-05-07T20:32:37.1112206Z self = <triton.compiler.compiler.ASTSource object at 0x7f779d52d950>
2025-05-07T20:32:37.1112978Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:37.1113482Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f7905667ce0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f78bf14afc0>}
2025-05-07T20:32:37.1114217Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:37.1114407Z context = <triton._C.libtriton.ir.context object at 0x7f779d5c9cb0>
2025-05-07T20:32:37.1114412Z 
2025-05-07T20:32:37.1114579Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:37.1114880Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:37.1114986Z                            module_map=module_map)
2025-05-07T20:32:37.1115159Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:37.1115257Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:37.1115338Z E       ^
2025-05-07T20:32:37.1115685Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:37.1115728Z 
2025-05-07T20:32:37.1116137Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:37.1116141Z 
2025-05-07T20:32:37.1116244Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:37.1116462Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:37.1116540Z     T=2048,
2025-05-07T20:32:37.1116619Z     D=7168,
2025-05-07T20:32:37.1116703Z     scale_ub=1200.0,
2025-05-07T20:32:37.1116792Z     contiguous=False,
2025-05-07T20:32:37.1116874Z     compiled=False,
2025-05-07T20:32:37.1116947Z )
2025-05-07T20:32:37.1117162Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:37.1117374Z T = 2048, D = 7168, scale_ub = 1200.0, contiguous = False, compiled = False
2025-05-07T20:32:37.1117378Z 
2025-05-07T20:32:37.1117459Z     @given(
2025-05-07T20:32:37.1117577Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:37.1117678Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:37.1117796Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:37.1117910Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:37.1118025Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:37.1118098Z     )
2025-05-07T20:32:37.1118339Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:37.1118439Z     def test_silu_mul_quant(
2025-05-07T20:32:37.1118514Z         self,
2025-05-07T20:32:37.1118590Z         T: int,
2025-05-07T20:32:37.1118667Z         D: int,
2025-05-07T20:32:37.1118765Z         scale_ub: Optional[float],
2025-05-07T20:32:37.1118852Z         contiguous: bool,
2025-05-07T20:32:37.1118941Z         compiled: bool,
2025-05-07T20:32:37.1119061Z     ) -> None:
2025-05-07T20:32:37.1119155Z         torch.manual_seed(2025)
2025-05-07T20:32:37.1119231Z     
2025-05-07T20:32:37.1119396Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:37.1119468Z     
2025-05-07T20:32:37.1119561Z         x_sign = torch.sign(x)
2025-05-07T20:32:37.1119684Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:37.1119773Z         x = x_sign * x_clamp
2025-05-07T20:32:37.1119851Z         x0 = x[:, :D]
2025-05-07T20:32:37.1119931Z         x1 = x[:, D:]
2025-05-07T20:32:37.1120003Z     
2025-05-07T20:32:37.1120084Z         if contiguous:
2025-05-07T20:32:37.1120178Z             x0 = x0.contiguous()
2025-05-07T20:32:37.1120267Z             x1 = x1.contiguous()
2025-05-07T20:32:37.1120338Z     
2025-05-07T20:32:37.1120427Z         if scale_ub is not None:
2025-05-07T20:32:37.1120531Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:37.1120666Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:37.1120743Z             )
2025-05-07T20:32:37.1120821Z         else:
2025-05-07T20:32:37.1120913Z             scale_ub_tensor = None
2025-05-07T20:32:37.1120990Z     
2025-05-07T20:32:37.1121116Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:37.1121204Z             op = silu_mul_quant
2025-05-07T20:32:37.1121289Z             if compiled:
2025-05-07T20:32:37.1121388Z                 op = torch.compile(op)
2025-05-07T20:32:37.1121493Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:37.1121568Z     
2025-05-07T20:32:37.1121657Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:37.1121706Z 
2025-05-07T20:32:37.1121803Z moe/activation_test.py:117: 
2025-05-07T20:32:37.1121935Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:37.1122034Z moe/activation_test.py:115: in fn
2025-05-07T20:32:37.1122138Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:37.1122629Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:37.1122726Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:37.1123121Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:37.1123340Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:37.1123673Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:37.1123770Z     kernel = self.compile(
2025-05-07T20:32:37.1124150Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:37.1124322Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:37.1124487Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:37.1124492Z 
2025-05-07T20:32:37.1124694Z self = <triton.compiler.compiler.ASTSource object at 0x7f78bea87e50>
2025-05-07T20:32:37.1125457Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:37.1125955Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f7905667ce0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f78bf14bec0>}
2025-05-07T20:32:37.1126688Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:37.1126879Z context = <triton._C.libtriton.ir.context object at 0x7f779d8752b0>
2025-05-07T20:32:37.1126883Z 
2025-05-07T20:32:37.1127049Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:37.1127343Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:37.1127451Z                            module_map=module_map)
2025-05-07T20:32:37.1127616Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:37.1127711Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:37.1127791Z E       ^
2025-05-07T20:32:37.1128142Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:37.1128146Z 
2025-05-07T20:32:37.1128547Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:37.1128555Z 
2025-05-07T20:32:37.1128661Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:37.1128880Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:37.1128962Z     T=1,
2025-05-07T20:32:37.1129043Z     D=7168,
2025-05-07T20:32:37.1129127Z     scale_ub=None,
2025-05-07T20:32:37.1129213Z     contiguous=True,
2025-05-07T20:32:37.1129302Z     compiled=False,
2025-05-07T20:32:37.1129379Z )
2025-05-07T20:32:37.1129592Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:37.1129753Z T = 1, D = 7168, scale_ub = None, contiguous = True, compiled = False
2025-05-07T20:32:37.1129757Z 
2025-05-07T20:32:37.1129831Z     @given(
2025-05-07T20:32:37.1129952Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:37.1130048Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:37.1130159Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:37.1130325Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:37.1130436Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:37.1130508Z     )
2025-05-07T20:32:37.1130757Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:37.1130849Z     def test_silu_mul_quant(
2025-05-07T20:32:37.1130924Z         self,
2025-05-07T20:32:37.1131004Z         T: int,
2025-05-07T20:32:37.1131080Z         D: int,
2025-05-07T20:32:37.1131222Z         scale_ub: Optional[float],
2025-05-07T20:32:37.1131311Z         contiguous: bool,
2025-05-07T20:32:37.1131395Z         compiled: bool,
2025-05-07T20:32:37.1131476Z     ) -> None:
2025-05-07T20:32:37.1131568Z         torch.manual_seed(2025)
2025-05-07T20:32:37.1131639Z     
2025-05-07T20:32:37.1131805Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:37.1131876Z     
2025-05-07T20:32:37.1131968Z         x_sign = torch.sign(x)
2025-05-07T20:32:37.1132099Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:37.1132184Z         x = x_sign * x_clamp
2025-05-07T20:32:37.1132264Z         x0 = x[:, :D]
2025-05-07T20:32:37.1132344Z         x1 = x[:, D:]
2025-05-07T20:32:37.1132414Z     
2025-05-07T20:32:37.1132536Z         if contiguous:
2025-05-07T20:32:37.1132632Z             x0 = x0.contiguous()
2025-05-07T20:32:37.1132721Z             x1 = x1.contiguous()
2025-05-07T20:32:37.1132798Z     
2025-05-07T20:32:37.1132886Z         if scale_ub is not None:
2025-05-07T20:32:37.1132993Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:37.1133127Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:37.1133202Z             )
2025-05-07T20:32:37.1133277Z         else:
2025-05-07T20:32:37.1133373Z             scale_ub_tensor = None
2025-05-07T20:32:37.1133445Z     
2025-05-07T20:32:37.1133572Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:37.1133752Z             op = silu_mul_quant
2025-05-07T20:32:37.1133836Z             if compiled:
2025-05-07T20:32:37.1133932Z                 op = torch.compile(op)
2025-05-07T20:32:37.1134040Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:37.1134113Z     
2025-05-07T20:32:37.1134211Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:37.1134216Z 
2025-05-07T20:32:37.1134355Z moe/activation_test.py:117: 
2025-05-07T20:32:37.1134485Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:37.1134590Z moe/activation_test.py:115: in fn
2025-05-07T20:32:37.1134688Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:37.1135177Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:37.1135278Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:37.1135653Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:37.1135905Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:37.1136240Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:37.1136330Z     kernel = self.compile(
2025-05-07T20:32:37.1136716Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:37.1136887Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:37.1137018Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:37.1137028Z 
2025-05-07T20:32:37.1137230Z self = <triton.compiler.compiler.ASTSource object at 0x7f78be00f1d0>
2025-05-07T20:32:37.1137989Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:37.1138527Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f7905667ce0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f779d5bccc0>}
2025-05-07T20:32:37.1139262Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:37.1139456Z context = <triton._C.libtriton.ir.context object at 0x7f779d93cc30>
2025-05-07T20:32:37.1139499Z 
2025-05-07T20:32:37.1139664Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:37.1139921Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:37.1140035Z                            module_map=module_map)
2025-05-07T20:32:37.1140194Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:37.1140298Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:37.1140375Z E       ^
2025-05-07T20:32:37.1140720Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:37.1140724Z 
2025-05-07T20:32:37.1141198Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:37.1141203Z 
2025-05-07T20:32:37.1141305Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:37.1141526Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:37.1141606Z     T=16384,
2025-05-07T20:32:37.1141682Z     D=7168,
2025-05-07T20:32:37.1141772Z     scale_ub=1200.0,
2025-05-07T20:32:37.1141857Z     contiguous=False,
2025-05-07T20:32:37.1141942Z     compiled=True,
2025-05-07T20:32:37.1142018Z )
2025-05-07T20:32:37.1142233Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:37.1142408Z T = 16384, D = 7168, scale_ub = 1200.0, contiguous = False, compiled = True
2025-05-07T20:32:37.1142413Z 
2025-05-07T20:32:37.1142490Z     @given(
2025-05-07T20:32:37.1142609Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:37.1142709Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:37.1142827Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:37.1142985Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:37.1143102Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:37.1143178Z     )
2025-05-07T20:32:37.1143418Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:37.1143512Z     def test_silu_mul_quant(
2025-05-07T20:32:37.1143586Z         self,
2025-05-07T20:32:37.1143661Z         T: int,
2025-05-07T20:32:37.1143739Z         D: int,
2025-05-07T20:32:37.1143837Z         scale_ub: Optional[float],
2025-05-07T20:32:37.1143924Z         contiguous: bool,
2025-05-07T20:32:37.1144014Z         compiled: bool,
2025-05-07T20:32:37.1144090Z     ) -> None:
2025-05-07T20:32:37.1144182Z         torch.manual_seed(2025)
2025-05-07T20:32:37.1144260Z     
2025-05-07T20:32:37.1144424Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:37.1144498Z     
2025-05-07T20:32:37.1144590Z         x_sign = torch.sign(x)
2025-05-07T20:32:37.1144717Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:37.1144810Z         x = x_sign * x_clamp
2025-05-07T20:32:37.1144889Z         x0 = x[:, :D]
2025-05-07T20:32:37.1144968Z         x1 = x[:, D:]
2025-05-07T20:32:37.1145040Z     
2025-05-07T20:32:37.1145121Z         if contiguous:
2025-05-07T20:32:37.1145210Z             x0 = x0.contiguous()
2025-05-07T20:32:37.1145302Z             x1 = x1.contiguous()
2025-05-07T20:32:37.1145372Z     
2025-05-07T20:32:37.1145478Z         if scale_ub is not None:
2025-05-07T20:32:37.1145594Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:37.1145793Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:37.1145870Z             )
2025-05-07T20:32:37.1145945Z         else:
2025-05-07T20:32:37.1146037Z             scale_ub_tensor = None
2025-05-07T20:32:37.1146111Z     
2025-05-07T20:32:37.1146240Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:37.1146327Z             op = silu_mul_quant
2025-05-07T20:32:37.1146415Z             if compiled:
2025-05-07T20:32:37.1146513Z                 op = torch.compile(op)
2025-05-07T20:32:37.1146617Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:37.1146735Z     
2025-05-07T20:32:37.1146823Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:37.1146828Z 
2025-05-07T20:32:37.1146921Z moe/activation_test.py:117: 
2025-05-07T20:32:37.1147053Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:37.1147152Z moe/activation_test.py:115: in fn
2025-05-07T20:32:37.1147253Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:37.1147614Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/torch/_dynamo/eval_frame.py:678: in _fn
2025-05-07T20:32:37.1147706Z     return fn(*args, **kwargs)
2025-05-07T20:32:37.1148234Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:37.1148334Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:37.1148685Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:37.1148911Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:37.1149243Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:37.1149339Z     kernel = self.compile(
2025-05-07T20:32:37.1149715Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:37.1149887Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:37.1150015Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:37.1150019Z 
2025-05-07T20:32:37.1150222Z self = <triton.compiler.compiler.ASTSource object at 0x7f78be1e8950>
2025-05-07T20:32:37.1151028Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:37.1151525Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f7905667ce0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f779d5be0c0>}
2025-05-07T20:32:37.1152256Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:37.1152447Z context = <triton._C.libtriton.ir.context object at 0x7f779d9f17b0>
2025-05-07T20:32:37.1152452Z 
2025-05-07T20:32:37.1152613Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:37.1152873Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:37.1152983Z                            module_map=module_map)
2025-05-07T20:32:37.1153143Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:37.1153244Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:37.1153320Z E       ^
2025-05-07T20:32:37.1153668Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:37.1153672Z 
2025-05-07T20:32:37.1154077Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:37.1154082Z 
2025-05-07T20:32:37.1154223Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:37.1154445Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:37.1154522Z     T=1,
2025-05-07T20:32:37.1154597Z     D=7168,
2025-05-07T20:32:37.1154680Z     scale_ub=None,
2025-05-07T20:32:37.1154769Z     contiguous=False,
2025-05-07T20:32:37.1154854Z     compiled=False,
2025-05-07T20:32:37.1154931Z )
2025-05-07T20:32:37.1155144Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:37.1155348Z T = 1, D = 7168, scale_ub = None, contiguous = False, compiled = False
2025-05-07T20:32:37.1155353Z 
2025-05-07T20:32:37.1155427Z     @given(
2025-05-07T20:32:37.1155545Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:37.1155648Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:37.1155760Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:37.1155874Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:37.1155993Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:37.1156066Z     )
2025-05-07T20:32:37.1156309Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:37.1156401Z     def test_silu_mul_quant(
2025-05-07T20:32:37.1156475Z         self,
2025-05-07T20:32:37.1156592Z         T: int,
2025-05-07T20:32:37.1156671Z         D: int,
2025-05-07T20:32:37.1156767Z         scale_ub: Optional[float],
2025-05-07T20:32:37.1156857Z         contiguous: bool,
2025-05-07T20:32:37.1156946Z         compiled: bool,
2025-05-07T20:32:37.1157023Z     ) -> None:
2025-05-07T20:32:37.1157118Z         torch.manual_seed(2025)
2025-05-07T20:32:37.1157188Z     
2025-05-07T20:32:37.1157353Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:37.1157428Z     
2025-05-07T20:32:37.1157517Z         x_sign = torch.sign(x)
2025-05-07T20:32:37.1157647Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:37.1157735Z         x = x_sign * x_clamp
2025-05-07T20:32:37.1157814Z         x0 = x[:, :D]
2025-05-07T20:32:37.1157896Z         x1 = x[:, D:]
2025-05-07T20:32:37.1157966Z     
2025-05-07T20:32:37.1158047Z         if contiguous:
2025-05-07T20:32:37.1158140Z             x0 = x0.contiguous()
2025-05-07T20:32:37.1158229Z             x1 = x1.contiguous()
2025-05-07T20:32:37.1158299Z     
2025-05-07T20:32:37.1158506Z         if scale_ub is not None:
2025-05-07T20:32:37.1158612Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:37.1158744Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:37.1158820Z             )
2025-05-07T20:32:37.1158898Z         else:
2025-05-07T20:32:37.1158990Z             scale_ub_tensor = None
2025-05-07T20:32:37.1159060Z     
2025-05-07T20:32:37.1159188Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:37.1159276Z             op = silu_mul_quant
2025-05-07T20:32:37.1159359Z             if compiled:
2025-05-07T20:32:37.1159460Z                 op = torch.compile(op)
2025-05-07T20:32:37.1159565Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:37.1159639Z     
2025-05-07T20:32:37.1159727Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:37.1159732Z 
2025-05-07T20:32:37.1159826Z moe/activation_test.py:117: 
2025-05-07T20:32:37.1159964Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:37.1160064Z moe/activation_test.py:115: in fn
2025-05-07T20:32:37.1160161Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:37.1160655Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:37.1160749Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:37.1161108Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:37.1161326Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:37.1161704Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:37.1161799Z     kernel = self.compile(
2025-05-07T20:32:37.1162176Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:37.1162350Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:37.1162479Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:37.1162523Z 
2025-05-07T20:32:37.1162724Z self = <triton.compiler.compiler.ASTSource object at 0x7f78be00fcd0>
2025-05-07T20:32:37.1163484Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:37.1163976Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f7905667ce0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f779d5bec00>}
2025-05-07T20:32:37.1164749Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:37.1164941Z context = <triton._C.libtriton.ir.context object at 0x7f779d9ebeb0>
2025-05-07T20:32:37.1164945Z 
2025-05-07T20:32:37.1165107Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:37.1165392Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:37.1165512Z                            module_map=module_map)
2025-05-07T20:32:37.1165686Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:37.1165784Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:37.1165858Z E       ^
2025-05-07T20:32:37.1166207Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:37.1166214Z 
2025-05-07T20:32:37.1166618Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:37.1166625Z 
2025-05-07T20:32:37.1166727Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:37.1166991Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:37.1167067Z     T=2048,
2025-05-07T20:32:37.1167149Z     D=7168,
2025-05-07T20:32:37.1167229Z     scale_ub=None,
2025-05-07T20:32:37.1167314Z     contiguous=False,
2025-05-07T20:32:37.1167398Z     compiled=True,
2025-05-07T20:32:37.1167469Z )
2025-05-07T20:32:37.1167683Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:37.1167854Z T = 2048, D = 7168, scale_ub = None, contiguous = False, compiled = True
2025-05-07T20:32:37.1167858Z 
2025-05-07T20:32:37.1167936Z     @given(
2025-05-07T20:32:37.1168052Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:37.1168154Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:37.1168265Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:37.1168386Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:37.1168502Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:37.1168574Z     )
2025-05-07T20:32:37.1168817Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:37.1168911Z     def test_silu_mul_quant(
2025-05-07T20:32:37.1168986Z         self,
2025-05-07T20:32:37.1169065Z         T: int,
2025-05-07T20:32:37.1169140Z         D: int,
2025-05-07T20:32:37.1169239Z         scale_ub: Optional[float],
2025-05-07T20:32:37.1169330Z         contiguous: bool,
2025-05-07T20:32:37.1169413Z         compiled: bool,
2025-05-07T20:32:37.1169488Z     ) -> None:
2025-05-07T20:32:37.1169584Z         torch.manual_seed(2025)
2025-05-07T20:32:37.1169703Z     
2025-05-07T20:32:37.1169872Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:37.1169943Z     
2025-05-07T20:32:37.1170032Z         x_sign = torch.sign(x)
2025-05-07T20:32:37.1170157Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:37.1170247Z         x = x_sign * x_clamp
2025-05-07T20:32:37.1170330Z         x0 = x[:, :D]
2025-05-07T20:32:37.1170411Z         x1 = x[:, D:]
2025-05-07T20:32:37.1170482Z     
2025-05-07T20:32:37.1170563Z         if contiguous:
2025-05-07T20:32:37.1170730Z             x0 = x0.contiguous()
2025-05-07T20:32:37.1170817Z             x1 = x1.contiguous()
2025-05-07T20:32:37.1170887Z     
2025-05-07T20:32:37.1170978Z         if scale_ub is not None:
2025-05-07T20:32:37.1171082Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:37.1171218Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:37.1171292Z             )
2025-05-07T20:32:37.1171367Z         else:
2025-05-07T20:32:37.1171466Z             scale_ub_tensor = None
2025-05-07T20:32:37.1171536Z     
2025-05-07T20:32:37.1171663Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:37.1171753Z             op = silu_mul_quant
2025-05-07T20:32:37.1171835Z             if compiled:
2025-05-07T20:32:37.1171974Z                 op = torch.compile(op)
2025-05-07T20:32:37.1172085Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:37.1172156Z     
2025-05-07T20:32:37.1172245Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:37.1172253Z 
2025-05-07T20:32:37.1172351Z moe/activation_test.py:117: 
2025-05-07T20:32:37.1172477Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:37.1172576Z moe/activation_test.py:115: in fn
2025-05-07T20:32:37.1172674Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:37.1173034Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/torch/_dynamo/eval_frame.py:678: in _fn
2025-05-07T20:32:37.1173134Z     return fn(*args, **kwargs)
2025-05-07T20:32:37.1173618Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:37.1173807Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:37.1174209Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:37.1174428Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:37.1174766Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:37.1174856Z     kernel = self.compile(
2025-05-07T20:32:37.1175232Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:37.1175418Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:37.1175561Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:37.1175571Z 
2025-05-07T20:32:37.1175795Z self = <triton.compiler.compiler.ASTSource object at 0x7f779d52fbd0>
2025-05-07T20:32:37.1176557Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:37.1177051Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f7905667ce0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f78be12c2c0>}
2025-05-07T20:32:37.1177787Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:37.1177975Z context = <triton._C.libtriton.ir.context object at 0x7f78be1a9db0>
2025-05-07T20:32:37.1178021Z 
2025-05-07T20:32:37.1178186Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:37.1178443Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:37.1178550Z                            module_map=module_map)
2025-05-07T20:32:37.1178713Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:37.1178809Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:37.1178886Z E       ^
2025-05-07T20:32:37.1179232Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:37.1179279Z 
2025-05-07T20:32:37.1179683Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:37.1179687Z 
2025-05-07T20:32:37.1179794Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:37.1180013Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:37.1180097Z     T=4096,
2025-05-07T20:32:37.1180174Z     D=7168,
2025-05-07T20:32:37.1180255Z     scale_ub=None,
2025-05-07T20:32:37.1180344Z     contiguous=False,
2025-05-07T20:32:37.1180425Z     compiled=True,
2025-05-07T20:32:37.1180495Z )
2025-05-07T20:32:37.1180750Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:37.1180922Z T = 4096, D = 7168, scale_ub = None, contiguous = False, compiled = True
2025-05-07T20:32:37.1180926Z 
2025-05-07T20:32:37.1181000Z     @given(
2025-05-07T20:32:37.1181124Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:37.1181220Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:37.1181331Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:37.1181449Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:37.1181560Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:37.1181635Z     )
2025-05-07T20:32:37.1181873Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:37.1181966Z     def test_silu_mul_quant(
2025-05-07T20:32:37.1182042Z         self,
2025-05-07T20:32:37.1182117Z         T: int,
2025-05-07T20:32:37.1182192Z         D: int,
2025-05-07T20:32:37.1182296Z         scale_ub: Optional[float],
2025-05-07T20:32:37.1182383Z         contiguous: bool,
2025-05-07T20:32:37.1182508Z         compiled: bool,
2025-05-07T20:32:37.1182589Z     ) -> None:
2025-05-07T20:32:37.1182682Z         torch.manual_seed(2025)
2025-05-07T20:32:37.1182755Z     
2025-05-07T20:32:37.1182924Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:37.1182994Z     
2025-05-07T20:32:37.1183089Z         x_sign = torch.sign(x)
2025-05-07T20:32:37.1183211Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:37.1183301Z         x = x_sign * x_clamp
2025-05-07T20:32:37.1183383Z         x0 = x[:, :D]
2025-05-07T20:32:37.1183460Z         x1 = x[:, D:]
2025-05-07T20:32:37.1183536Z     
2025-05-07T20:32:37.1183619Z         if contiguous:
2025-05-07T20:32:37.1183709Z             x0 = x0.contiguous()
2025-05-07T20:32:37.1183797Z             x1 = x1.contiguous()
2025-05-07T20:32:37.1183870Z     
2025-05-07T20:32:37.1183958Z         if scale_ub is not None:
2025-05-07T20:32:37.1184065Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:37.1184202Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:37.1184275Z             )
2025-05-07T20:32:37.1184353Z         else:
2025-05-07T20:32:37.1184446Z             scale_ub_tensor = None
2025-05-07T20:32:37.1184516Z     
2025-05-07T20:32:37.1184645Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:37.1184732Z             op = silu_mul_quant
2025-05-07T20:32:37.1184813Z             if compiled:
2025-05-07T20:32:37.1184914Z                 op = torch.compile(op)
2025-05-07T20:32:37.1185018Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:37.1185133Z     
2025-05-07T20:32:37.1185226Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:37.1185230Z 
2025-05-07T20:32:37.1185324Z moe/activation_test.py:117: 
2025-05-07T20:32:37.1185451Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:37.1185555Z moe/activation_test.py:115: in fn
2025-05-07T20:32:37.1185651Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:37.1186018Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/torch/_dynamo/eval_frame.py:678: in _fn
2025-05-07T20:32:37.1186153Z     return fn(*args, **kwargs)
2025-05-07T20:32:37.1186637Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:37.1186736Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:37.1187088Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:37.1187308Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:37.1187644Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:37.1187735Z     kernel = self.compile(
2025-05-07T20:32:37.1188153Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:37.1188536Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:37.1188666Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:37.1188674Z 
2025-05-07T20:32:37.1188877Z self = <triton.compiler.compiler.ASTSource object at 0x7f779daad950>
2025-05-07T20:32:37.1189638Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:37.1190134Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f7905667ce0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f78be12cd60>}
2025-05-07T20:32:37.1190913Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:37.1191106Z context = <triton._C.libtriton.ir.context object at 0x7f78be189330>
2025-05-07T20:32:37.1191113Z 
2025-05-07T20:32:37.1191274Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:37.1191530Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:37.1191640Z                            module_map=module_map)
2025-05-07T20:32:37.1191801Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:37.1191897Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:37.1191977Z E       ^
2025-05-07T20:32:37.1192321Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:37.1192326Z 
2025-05-07T20:32:37.1192734Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:37.1192739Z 
2025-05-07T20:32:37.1192843Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:37.1193062Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:37.1193147Z     T=16384,
2025-05-07T20:32:37.1193222Z     D=5120,
2025-05-07T20:32:37.1193304Z     scale_ub=1200.0,
2025-05-07T20:32:37.1193392Z     contiguous=False,
2025-05-07T20:32:37.1193474Z     compiled=False,
2025-05-07T20:32:37.1193548Z )
2025-05-07T20:32:37.1193764Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:37.1193941Z T = 16384, D = 5120, scale_ub = 1200.0, contiguous = False, compiled = False
2025-05-07T20:32:37.1193990Z 
2025-05-07T20:32:37.1194070Z     @given(
2025-05-07T20:32:37.1194187Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:37.1194284Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:37.1194402Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:37.1194519Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:37.1194631Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:37.1194706Z     )
2025-05-07T20:32:37.1194946Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:37.1195085Z     def test_silu_mul_quant(
2025-05-07T20:32:37.1195159Z         self,
2025-05-07T20:32:37.1195236Z         T: int,
2025-05-07T20:32:37.1195313Z         D: int,
2025-05-07T20:32:37.1195409Z         scale_ub: Optional[float],
2025-05-07T20:32:37.1195498Z         contiguous: bool,
2025-05-07T20:32:37.1195584Z         compiled: bool,
2025-05-07T20:32:37.1195668Z     ) -> None:
2025-05-07T20:32:37.1195765Z         torch.manual_seed(2025)
2025-05-07T20:32:37.1195842Z     
2025-05-07T20:32:37.1196005Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:37.1196077Z     
2025-05-07T20:32:37.1196170Z         x_sign = torch.sign(x)
2025-05-07T20:32:37.1196333Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:37.1196427Z         x = x_sign * x_clamp
2025-05-07T20:32:37.1196506Z         x0 = x[:, :D]
2025-05-07T20:32:37.1196584Z         x1 = x[:, D:]
2025-05-07T20:32:37.1196663Z     
2025-05-07T20:32:37.1196745Z         if contiguous:
2025-05-07T20:32:37.1196833Z             x0 = x0.contiguous()
2025-05-07T20:32:37.1196923Z             x1 = x1.contiguous()
2025-05-07T20:32:37.1196993Z     
2025-05-07T20:32:37.1197083Z         if scale_ub is not None:
2025-05-07T20:32:37.1197192Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:37.1197325Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:37.1197402Z             )
2025-05-07T20:32:37.1197479Z         else:
2025-05-07T20:32:37.1197571Z             scale_ub_tensor = None
2025-05-07T20:32:37.1197645Z     
2025-05-07T20:32:37.1197770Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:37.1197860Z             op = silu_mul_quant
2025-05-07T20:32:37.1197949Z             if compiled:
2025-05-07T20:32:37.1198090Z                 op = torch.compile(op)
2025-05-07T20:32:37.1198431Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:37.1198512Z     
2025-05-07T20:32:37.1198601Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:37.1198605Z 
2025-05-07T20:32:37.1198699Z moe/activation_test.py:117: 
2025-05-07T20:32:37.1198829Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:37.1198926Z moe/activation_test.py:115: in fn
2025-05-07T20:32:37.1199026Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:37.1199515Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:37.1199614Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:37.1199974Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:37.1200199Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:37.1200537Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:37.1200634Z     kernel = self.compile(
2025-05-07T20:32:37.1201010Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:37.1201185Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:37.1201310Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:37.1201314Z 
2025-05-07T20:32:37.1201515Z self = <triton.compiler.compiler.ASTSource object at 0x7f78bf1ae750>
2025-05-07T20:32:37.1202380Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:37.1202876Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f7905667ce0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f78be12dc60>}
2025-05-07T20:32:37.1203668Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:37.1203854Z context = <triton._C.libtriton.ir.context object at 0x7f779d81e770>
2025-05-07T20:32:37.1203858Z 
2025-05-07T20:32:37.1204021Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:37.1204285Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:37.1204392Z                            module_map=module_map)
2025-05-07T20:32:37.1204553Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:37.1204707Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:37.1204783Z E       ^
2025-05-07T20:32:37.1205137Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:37.1205144Z 
2025-05-07T20:32:37.1205547Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:37.1205552Z 
2025-05-07T20:32:37.1205654Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:37.1205873Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:37.1205949Z     T=16384,
2025-05-07T20:32:37.1206026Z     D=5120,
2025-05-07T20:32:37.1206112Z     scale_ub=1200.0,
2025-05-07T20:32:37.1206194Z     contiguous=True,
2025-05-07T20:32:37.1206278Z     compiled=True,
2025-05-07T20:32:37.1206352Z )
2025-05-07T20:32:37.1206567Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:37.1206745Z T = 16384, D = 5120, scale_ub = 1200.0, contiguous = True, compiled = True
2025-05-07T20:32:37.1206749Z 
2025-05-07T20:32:37.1206879Z     @given(
2025-05-07T20:32:37.1207000Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:37.1207099Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:37.1207212Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:37.1207329Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:37.1207440Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:37.1207512Z     )
2025-05-07T20:32:37.1207756Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:37.1207847Z     def test_silu_mul_quant(
2025-05-07T20:32:37.1207924Z         self,
2025-05-07T20:32:37.1208003Z         T: int,
2025-05-07T20:32:37.1208079Z         D: int,
2025-05-07T20:32:37.1208177Z         scale_ub: Optional[float],
2025-05-07T20:32:37.1208264Z         contiguous: bool,
2025-05-07T20:32:37.1208349Z         compiled: bool,
2025-05-07T20:32:37.1208433Z     ) -> None:
2025-05-07T20:32:37.1208530Z         torch.manual_seed(2025)
2025-05-07T20:32:37.1208602Z     
2025-05-07T20:32:37.1208769Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:37.1208843Z     
2025-05-07T20:32:37.1208934Z         x_sign = torch.sign(x)
2025-05-07T20:32:37.1209060Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:37.1209147Z         x = x_sign * x_clamp
2025-05-07T20:32:37.1209224Z         x0 = x[:, :D]
2025-05-07T20:32:37.1209305Z         x1 = x[:, D:]
2025-05-07T20:32:37.1209375Z     
2025-05-07T20:32:37.1209455Z         if contiguous:
2025-05-07T20:32:37.1209549Z             x0 = x0.contiguous()
2025-05-07T20:32:37.1209685Z             x1 = x1.contiguous()
2025-05-07T20:32:37.1209761Z     
2025-05-07T20:32:37.1209850Z         if scale_ub is not None:
2025-05-07T20:32:37.1209956Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:37.1210093Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:37.1210171Z             )
2025-05-07T20:32:37.1210249Z         else:
2025-05-07T20:32:37.1210345Z             scale_ub_tensor = None
2025-05-07T20:32:37.1210416Z     
2025-05-07T20:32:37.1210587Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:37.1210679Z             op = silu_mul_quant
2025-05-07T20:32:37.1210762Z             if compiled:
2025-05-07T20:32:37.1210859Z                 op = torch.compile(op)
2025-05-07T20:32:37.1210966Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:37.1211037Z     
2025-05-07T20:32:37.1211131Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:37.1211135Z 
2025-05-07T20:32:37.1211234Z moe/activation_test.py:117: 
2025-05-07T20:32:37.1211365Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:37.1211464Z moe/activation_test.py:115: in fn
2025-05-07T20:32:37.1211561Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:37.1211964Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/torch/_dynamo/eval_frame.py:678: in _fn
2025-05-07T20:32:37.1212060Z     return fn(*args, **kwargs)
2025-05-07T20:32:37.1212543Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:37.1212646Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:37.1212999Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:37.1213217Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:37.1213553Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:37.1213647Z     kernel = self.compile(
2025-05-07T20:32:37.1214118Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:37.1214296Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:37.1214464Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:37.1214469Z 
2025-05-07T20:32:37.1214673Z self = <triton.compiler.compiler.ASTSource object at 0x7f779d52fb50>
2025-05-07T20:32:37.1215448Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:37.1215982Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f7905667ce0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f78be12f380>}
2025-05-07T20:32:37.1216714Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:37.1216906Z context = <triton._C.libtriton.ir.context object at 0x7f779d84ceb0>
2025-05-07T20:32:37.1216910Z 
2025-05-07T20:32:37.1217078Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:37.1217334Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:37.1217445Z                            module_map=module_map)
2025-05-07T20:32:37.1217604Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:37.1217702Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:37.1217782Z E       ^
2025-05-07T20:32:37.1218127Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:37.1218171Z 
2025-05-07T20:32:37.1218575Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:37.1218583Z 
2025-05-07T20:32:37.1218688Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:37.1218909Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:37.1218989Z     T=16384,
2025-05-07T20:32:37.1219065Z     D=5120,
2025-05-07T20:32:37.1219186Z     scale_ub=None,
2025-05-07T20:32:37.1219277Z     contiguous=False,
2025-05-07T20:32:37.1219363Z     compiled=True,
2025-05-07T20:32:37.1222341Z )
2025-05-07T20:32:37.1222579Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:37.1222756Z T = 16384, D = 5120, scale_ub = None, contiguous = False, compiled = True
2025-05-07T20:32:37.1222761Z 
2025-05-07T20:32:37.1222844Z     @given(
2025-05-07T20:32:37.1222965Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:37.1223070Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:37.1223190Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:37.1223305Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:37.1223487Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:37.1223562Z     )
2025-05-07T20:32:37.1223809Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:37.1223905Z     def test_silu_mul_quant(
2025-05-07T20:32:37.1223985Z         self,
2025-05-07T20:32:37.1224060Z         T: int,
2025-05-07T20:32:37.1224137Z         D: int,
2025-05-07T20:32:37.1224233Z         scale_ub: Optional[float],
2025-05-07T20:32:37.1224321Z         contiguous: bool,
2025-05-07T20:32:37.1224408Z         compiled: bool,
2025-05-07T20:32:37.1224488Z     ) -> None:
2025-05-07T20:32:37.1224581Z         torch.manual_seed(2025)
2025-05-07T20:32:37.1224654Z     
2025-05-07T20:32:37.1224823Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:37.1224894Z     
2025-05-07T20:32:37.1224989Z         x_sign = torch.sign(x)
2025-05-07T20:32:37.1225111Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:37.1225205Z         x = x_sign * x_clamp
2025-05-07T20:32:37.1225286Z         x0 = x[:, :D]
2025-05-07T20:32:37.1225409Z         x1 = x[:, D:]
2025-05-07T20:32:37.1225485Z     
2025-05-07T20:32:37.1225582Z         if contiguous:
2025-05-07T20:32:37.1225683Z             x0 = x0.contiguous()
2025-05-07T20:32:37.1225794Z             x1 = x1.contiguous()
2025-05-07T20:32:37.1225870Z     
2025-05-07T20:32:37.1225959Z         if scale_ub is not None:
2025-05-07T20:32:37.1226067Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:37.1226200Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:37.1226273Z             )
2025-05-07T20:32:37.1226350Z         else:
2025-05-07T20:32:37.1226442Z             scale_ub_tensor = None
2025-05-07T20:32:37.1226520Z     
2025-05-07T20:32:37.1226648Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:37.1226735Z             op = silu_mul_quant
2025-05-07T20:32:37.1226821Z             if compiled:
2025-05-07T20:32:37.1226919Z                 op = torch.compile(op)
2025-05-07T20:32:37.1227030Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:37.1227107Z     
2025-05-07T20:32:37.1227195Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:37.1227200Z 
2025-05-07T20:32:37.1227293Z moe/activation_test.py:117: 
2025-05-07T20:32:37.1227428Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:37.1227526Z moe/activation_test.py:115: in fn
2025-05-07T20:32:37.1227628Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:37.1227992Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/torch/_dynamo/eval_frame.py:678: in _fn
2025-05-07T20:32:37.1228083Z     return fn(*args, **kwargs)
2025-05-07T20:32:37.1228653Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:37.1228748Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:37.1229102Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:37.1229324Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:37.1229657Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:37.1229793Z     kernel = self.compile(
2025-05-07T20:32:37.1230173Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:37.1230342Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:37.1230472Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:37.1230479Z 
2025-05-07T20:32:37.1230682Z self = <triton.compiler.compiler.ASTSource object at 0x7f779d2015d0>
2025-05-07T20:32:37.1231486Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:37.1231982Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f7905667ce0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f779d8a85e0>}
2025-05-07T20:32:37.1232718Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:37.1232912Z context = <triton._C.libtriton.ir.context object at 0x7f779d832c30>
2025-05-07T20:32:37.1232916Z 
2025-05-07T20:32:37.1233077Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:37.1233341Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:37.1233448Z                            module_map=module_map)
2025-05-07T20:32:37.1233609Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:37.1233775Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:37.1233853Z E       ^
2025-05-07T20:32:37.1234201Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:37.1234211Z 
2025-05-07T20:32:37.1234614Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:37.1234619Z 
2025-05-07T20:32:37.1234719Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:37.1234942Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:37.1235017Z     T=2048,
2025-05-07T20:32:37.1235096Z     D=5120,
2025-05-07T20:32:37.1235180Z     scale_ub=None,
2025-05-07T20:32:37.1235264Z     contiguous=False,
2025-05-07T20:32:37.1235347Z     compiled=True,
2025-05-07T20:32:37.1235424Z )
2025-05-07T20:32:37.1235640Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:37.1235817Z T = 2048, D = 5120, scale_ub = None, contiguous = False, compiled = True
2025-05-07T20:32:37.1235822Z 
2025-05-07T20:32:37.1235897Z     @given(
2025-05-07T20:32:37.1236014Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:37.1236117Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:37.1236229Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:37.1236349Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:37.1236463Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:37.1236535Z     )
2025-05-07T20:32:37.1236775Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:37.1236912Z     def test_silu_mul_quant(
2025-05-07T20:32:37.1236986Z         self,
2025-05-07T20:32:37.1237069Z         T: int,
2025-05-07T20:32:37.1237143Z         D: int,
2025-05-07T20:32:37.1237242Z         scale_ub: Optional[float],
2025-05-07T20:32:37.1237333Z         contiguous: bool,
2025-05-07T20:32:37.1237417Z         compiled: bool,
2025-05-07T20:32:37.1237500Z     ) -> None:
2025-05-07T20:32:37.1237593Z         torch.manual_seed(2025)
2025-05-07T20:32:37.1237665Z     
2025-05-07T20:32:37.1237878Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:37.1237949Z     
2025-05-07T20:32:37.1238040Z         x_sign = torch.sign(x)
2025-05-07T20:32:37.1238166Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:37.1238253Z         x = x_sign * x_clamp
2025-05-07T20:32:37.1238334Z         x0 = x[:, :D]
2025-05-07T20:32:37.1238415Z         x1 = x[:, D:]
2025-05-07T20:32:37.1238485Z     
2025-05-07T20:32:37.1238571Z         if contiguous:
2025-05-07T20:32:37.1238660Z             x0 = x0.contiguous()
2025-05-07T20:32:37.1238746Z             x1 = x1.contiguous()
2025-05-07T20:32:37.1238818Z     
2025-05-07T20:32:37.1238908Z         if scale_ub is not None:
2025-05-07T20:32:37.1239056Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:37.1239196Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:37.1239271Z             )
2025-05-07T20:32:37.1239345Z         else:
2025-05-07T20:32:37.1239439Z             scale_ub_tensor = None
2025-05-07T20:32:37.1239513Z     
2025-05-07T20:32:37.1239640Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:37.1239730Z             op = silu_mul_quant
2025-05-07T20:32:37.1239813Z             if compiled:
2025-05-07T20:32:37.1239911Z                 op = torch.compile(op)
2025-05-07T20:32:37.1240017Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:37.1240088Z     
2025-05-07T20:32:37.1240179Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:37.1240186Z 
2025-05-07T20:32:37.1240281Z moe/activation_test.py:117: 
2025-05-07T20:32:37.1240410Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:37.1240511Z moe/activation_test.py:115: in fn
2025-05-07T20:32:37.1240611Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:37.1241016Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/torch/_dynamo/eval_frame.py:678: in _fn
2025-05-07T20:32:37.1241114Z     return fn(*args, **kwargs)
2025-05-07T20:32:37.1241600Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:37.1241698Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:37.1242051Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:37.1242269Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:37.1242608Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:37.1242700Z     kernel = self.compile(
2025-05-07T20:32:37.1243079Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:37.1243256Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:37.1243381Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:37.1243388Z 
2025-05-07T20:32:37.1243592Z self = <triton.compiler.compiler.ASTSource object at 0x7f779dd3e5d0>
2025-05-07T20:32:37.1244353Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:37.1244849Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f7905667ce0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f779d8a9440>}
2025-05-07T20:32:37.1245682Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:37.1245871Z context = <triton._C.libtriton.ir.context object at 0x7f779d3224f0>
2025-05-07T20:32:37.1245875Z 
2025-05-07T20:32:37.1246040Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:37.1246333Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:37.1246443Z                            module_map=module_map)
2025-05-07T20:32:37.1246603Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:37.1246701Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:37.1246778Z E       ^
2025-05-07T20:32:37.1247125Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:37.1247130Z 
2025-05-07T20:32:37.1247533Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:37.1247582Z 
2025-05-07T20:32:37.1247686Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:37.1247905Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:37.1247985Z     T=2048,
2025-05-07T20:32:37.1248061Z     D=5120,
2025-05-07T20:32:37.1248143Z     scale_ub=1200.0,
2025-05-07T20:32:37.1248230Z     contiguous=False,
2025-05-07T20:32:37.1248310Z     compiled=True,
2025-05-07T20:32:37.1248382Z )
2025-05-07T20:32:37.1248599Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:37.1248769Z T = 2048, D = 5120, scale_ub = 1200.0, contiguous = False, compiled = True
2025-05-07T20:32:37.1248773Z 
2025-05-07T20:32:37.1248852Z     @given(
2025-05-07T20:32:37.1248968Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:37.1249065Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:37.1249179Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:37.1249297Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:37.1249447Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:37.1249526Z     )
2025-05-07T20:32:37.1249767Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:37.1249862Z     def test_silu_mul_quant(
2025-05-07T20:32:37.1249940Z         self,
2025-05-07T20:32:37.1250017Z         T: int,
2025-05-07T20:32:37.1250092Z         D: int,
2025-05-07T20:32:37.1250190Z         scale_ub: Optional[float],
2025-05-07T20:32:37.1250277Z         contiguous: bool,
2025-05-07T20:32:37.1250366Z         compiled: bool,
2025-05-07T20:32:37.1250443Z     ) -> None:
2025-05-07T20:32:37.1250536Z         torch.manual_seed(2025)
2025-05-07T20:32:37.1250613Z     
2025-05-07T20:32:37.1250777Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:37.1250848Z     
2025-05-07T20:32:37.1250938Z         x_sign = torch.sign(x)
2025-05-07T20:32:37.1251062Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:37.1251149Z         x = x_sign * x_clamp
2025-05-07T20:32:37.1251235Z         x0 = x[:, :D]
2025-05-07T20:32:37.1251317Z         x1 = x[:, D:]
2025-05-07T20:32:37.1251387Z     
2025-05-07T20:32:37.1251475Z         if contiguous:
2025-05-07T20:32:37.1251564Z             x0 = x0.contiguous()
2025-05-07T20:32:37.1251650Z             x1 = x1.contiguous()
2025-05-07T20:32:37.1251725Z     
2025-05-07T20:32:37.1251813Z         if scale_ub is not None:
2025-05-07T20:32:37.1251918Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:37.1252049Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:37.1252123Z             )
2025-05-07T20:32:37.1252247Z         else:
2025-05-07T20:32:37.1252338Z             scale_ub_tensor = None
2025-05-07T20:32:37.1252408Z     
2025-05-07T20:32:37.1252537Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:37.1252625Z             op = silu_mul_quant
2025-05-07T20:32:37.1252710Z             if compiled:
2025-05-07T20:32:37.1252812Z                 op = torch.compile(op)
2025-05-07T20:32:37.1252922Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:37.1252992Z     
2025-05-07T20:32:37.1253085Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:37.1253130Z 
2025-05-07T20:32:37.1253226Z moe/activation_test.py:117: 
2025-05-07T20:32:37.1253355Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:37.1253454Z moe/activation_test.py:115: in fn
2025-05-07T20:32:37.1253552Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:37.1254009Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/torch/_dynamo/eval_frame.py:678: in _fn
2025-05-07T20:32:37.1254106Z     return fn(*args, **kwargs)
2025-05-07T20:32:37.1254590Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:37.1254688Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:37.1255086Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:37.1255308Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:37.1255669Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:37.1255771Z     kernel = self.compile(
2025-05-07T20:32:37.1256170Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:37.1256339Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:37.1256468Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:37.1256475Z 
2025-05-07T20:32:37.1256677Z self = <triton.compiler.compiler.ASTSource object at 0x7f78bf1ac450>
2025-05-07T20:32:37.1257475Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:37.1257974Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f7905667ce0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f779d8aa660>}
2025-05-07T20:32:37.1258707Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:37.1258896Z context = <triton._C.libtriton.ir.context object at 0x7f779d331870>
2025-05-07T20:32:37.1258903Z 
2025-05-07T20:32:37.1259065Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:37.1259318Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:37.1259429Z                            module_map=module_map)
2025-05-07T20:32:37.1259593Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:37.1259693Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:37.1259769Z E       ^
2025-05-07T20:32:37.1260117Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:37.1260122Z 
2025-05-07T20:32:37.1260529Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:37.1260534Z 
2025-05-07T20:32:37.1260635Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:37.1260859Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:37.1260979Z     T=4096,
2025-05-07T20:32:37.1261056Z     D=5120,
2025-05-07T20:32:37.1261140Z     scale_ub=1200.0,
2025-05-07T20:32:37.1261224Z     contiguous=True,
2025-05-07T20:32:37.1261304Z     compiled=True,
2025-05-07T20:32:37.1261379Z )
2025-05-07T20:32:37.1261599Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:37.1261765Z T = 4096, D = 5120, scale_ub = 1200.0, contiguous = True, compiled = True
2025-05-07T20:32:37.1261770Z 
2025-05-07T20:32:37.1261888Z     @given(
2025-05-07T20:32:37.1262005Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:37.1262104Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:37.1262216Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:37.1262331Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:37.1262446Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:37.1262518Z     )
2025-05-07T20:32:37.1262762Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:37.1262855Z     def test_silu_mul_quant(
2025-05-07T20:32:37.1262933Z         self,
2025-05-07T20:32:37.1263008Z         T: int,
2025-05-07T20:32:37.1263086Z         D: int,
2025-05-07T20:32:37.1263248Z         scale_ub: Optional[float],
2025-05-07T20:32:37.1263338Z         contiguous: bool,
2025-05-07T20:32:37.1263424Z         compiled: bool,
2025-05-07T20:32:37.1263501Z     ) -> None:
2025-05-07T20:32:37.1263598Z         torch.manual_seed(2025)
2025-05-07T20:32:37.1263672Z     
2025-05-07T20:32:37.1263837Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:37.1263910Z     
2025-05-07T20:32:37.1263999Z         x_sign = torch.sign(x)
2025-05-07T20:32:37.1264121Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:37.1264211Z         x = x_sign * x_clamp
2025-05-07T20:32:37.1264290Z         x0 = x[:, :D]
2025-05-07T20:32:37.1264368Z         x1 = x[:, D:]
2025-05-07T20:32:37.1264444Z     
2025-05-07T20:32:37.1264526Z         if contiguous:
2025-05-07T20:32:37.1264617Z             x0 = x0.contiguous()
2025-05-07T20:32:37.1264707Z             x1 = x1.contiguous()
2025-05-07T20:32:37.1264777Z     
2025-05-07T20:32:37.1264865Z         if scale_ub is not None:
2025-05-07T20:32:37.1264974Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:37.1265149Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:37.1265230Z             )
2025-05-07T20:32:37.1265310Z         else:
2025-05-07T20:32:37.1265401Z             scale_ub_tensor = None
2025-05-07T20:32:37.1265474Z     
2025-05-07T20:32:37.1265600Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:37.1265688Z             op = silu_mul_quant
2025-05-07T20:32:37.1265772Z             if compiled:
2025-05-07T20:32:37.1265870Z                 op = torch.compile(op)
2025-05-07T20:32:37.1265975Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:37.1266052Z     
2025-05-07T20:32:37.1266140Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:37.1266144Z 
2025-05-07T20:32:37.1266242Z moe/activation_test.py:117: 
2025-05-07T20:32:37.1266371Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:37.1266471Z moe/activation_test.py:115: in fn
2025-05-07T20:32:37.1266575Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:37.1266933Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/torch/_dynamo/eval_frame.py:678: in _fn
2025-05-07T20:32:37.1267026Z     return fn(*args, **kwargs)
2025-05-07T20:32:37.1267512Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:37.1267607Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:37.1267962Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:37.1268179Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:37.1268558Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:37.1268653Z     kernel = self.compile(
2025-05-07T20:32:37.1269033Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:37.1269209Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:37.1269339Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:37.1269384Z 
2025-05-07T20:32:37.1269585Z self = <triton.compiler.compiler.ASTSource object at 0x7f779d2029d0>
2025-05-07T20:32:37.1270347Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:37.1270841Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f7905667ce0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f779d8ab9c0>}
2025-05-07T20:32:37.1271611Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:37.1271799Z context = <triton._C.libtriton.ir.context object at 0x7f779d1b8fb0>
2025-05-07T20:32:37.1271807Z 
2025-05-07T20:32:37.1271968Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:37.1272225Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:37.1272333Z                            module_map=module_map)
2025-05-07T20:32:37.1272492Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:37.1272592Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:37.1272671Z E       ^
2025-05-07T20:32:37.1273019Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:37.1273024Z 
2025-05-07T20:32:37.1273432Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:37.1273475Z 
2025-05-07T20:32:37.1273577Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:37.1273796Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:37.1273875Z     T=128,
2025-05-07T20:32:37.1273954Z     D=5120,
2025-05-07T20:32:37.1274037Z     scale_ub=1200.0,
2025-05-07T20:32:37.1274123Z     contiguous=False,
2025-05-07T20:32:37.1274208Z     compiled=True,
2025-05-07T20:32:37.1274280Z )
2025-05-07T20:32:37.1274496Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:37.1274666Z T = 128, D = 5120, scale_ub = 1200.0, contiguous = False, compiled = True
2025-05-07T20:32:37.1274673Z 
2025-05-07T20:32:37.1274750Z     @given(
2025-05-07T20:32:37.1274866Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:37.1274966Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:37.1275083Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:37.1275205Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:37.1275319Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:37.1275394Z     )
2025-05-07T20:32:37.1275642Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:37.1275734Z     def test_silu_mul_quant(
2025-05-07T20:32:37.1275810Z         self,
2025-05-07T20:32:37.1275890Z         T: int,
2025-05-07T20:32:37.1275966Z         D: int,
2025-05-07T20:32:37.1276064Z         scale_ub: Optional[float],
2025-05-07T20:32:37.1276157Z         contiguous: bool,
2025-05-07T20:32:37.1276243Z         compiled: bool,
2025-05-07T20:32:37.1276367Z     ) -> None:
2025-05-07T20:32:37.1276465Z         torch.manual_seed(2025)
2025-05-07T20:32:37.1276538Z     
2025-05-07T20:32:37.1276707Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:37.1276780Z     
2025-05-07T20:32:37.1276870Z         x_sign = torch.sign(x)
2025-05-07T20:32:37.1277000Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:37.1277088Z         x = x_sign * x_clamp
2025-05-07T20:32:37.1277167Z         x0 = x[:, :D]
2025-05-07T20:32:37.1277248Z         x1 = x[:, D:]
2025-05-07T20:32:37.1277361Z     
2025-05-07T20:32:37.1277444Z         if contiguous:
2025-05-07T20:32:37.1277536Z             x0 = x0.contiguous()
2025-05-07T20:32:37.1277622Z             x1 = x1.contiguous()
2025-05-07T20:32:37.1277694Z     
2025-05-07T20:32:37.1277785Z         if scale_ub is not None:
2025-05-07T20:32:37.1277888Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:37.1278021Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:37.1278100Z             )
2025-05-07T20:32:37.1278175Z         else:
2025-05-07T20:32:37.1278274Z             scale_ub_tensor = None
2025-05-07T20:32:37.1278344Z     
2025-05-07T20:32:37.1278471Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:37.1278605Z             op = silu_mul_quant
2025-05-07T20:32:37.1278690Z             if compiled:
2025-05-07T20:32:37.1278789Z                 op = torch.compile(op)
2025-05-07T20:32:37.1278897Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:37.1278973Z     
2025-05-07T20:32:37.1279062Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:37.1279066Z 
2025-05-07T20:32:37.1279161Z moe/activation_test.py:117: 
2025-05-07T20:32:37.1279292Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:37.1279390Z moe/activation_test.py:115: in fn
2025-05-07T20:32:37.1279487Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:37.1279851Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/torch/_dynamo/eval_frame.py:678: in _fn
2025-05-07T20:32:37.1279946Z     return fn(*args, **kwargs)
2025-05-07T20:32:37.1280434Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:37.1280532Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:37.1280925Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:37.1281147Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:37.1281481Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:37.1281577Z     kernel = self.compile(
2025-05-07T20:32:37.1281954Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:37.1282123Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:37.1282257Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:37.1282262Z 
2025-05-07T20:32:37.1282461Z self = <triton.compiler.compiler.ASTSource object at 0x7f779d9a1dd0>
2025-05-07T20:32:37.1283227Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:37.1283725Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f7905667ce0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f779d108fe0>}
2025-05-07T20:32:37.1284455Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:37.1284688Z context = <triton._C.libtriton.ir.context object at 0x7f779d16bdf0>
2025-05-07T20:32:37.1284693Z 
2025-05-07T20:32:37.1284854Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:37.1285114Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:37.1285223Z                            module_map=module_map)
2025-05-07T20:32:37.1285391Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:37.1285507Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:37.1285648Z E       ^
2025-05-07T20:32:37.1285996Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:37.1286001Z 
2025-05-07T20:32:37.1286407Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:37.1286411Z 
2025-05-07T20:32:37.1286515Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:37.1286738Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:37.1286816Z     T=16384,
2025-05-07T20:32:37.1286893Z     D=7168,
2025-05-07T20:32:37.1286981Z     scale_ub=1200.0,
2025-05-07T20:32:37.1287069Z     contiguous=True,
2025-05-07T20:32:37.1287255Z     compiled=True,
2025-05-07T20:32:37.1287334Z )
2025-05-07T20:32:37.1287550Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:37.1287724Z T = 16384, D = 7168, scale_ub = 1200.0, contiguous = True, compiled = True
2025-05-07T20:32:37.1287731Z 
2025-05-07T20:32:37.1287807Z     @given(
2025-05-07T20:32:37.1287923Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:37.1288024Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:37.1288137Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:37.1288253Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:37.1288370Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:37.1288446Z     )
2025-05-07T20:32:37.1288686Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:37.1288781Z     def test_silu_mul_quant(
2025-05-07T20:32:37.1288859Z         self,
2025-05-07T20:32:37.1288938Z         T: int,
2025-05-07T20:32:37.1289019Z         D: int,
2025-05-07T20:32:37.1289162Z         scale_ub: Optional[float],
2025-05-07T20:32:37.1289254Z         contiguous: bool,
2025-05-07T20:32:37.1289339Z         compiled: bool,
2025-05-07T20:32:37.1289420Z     ) -> None:
2025-05-07T20:32:37.1289517Z         torch.manual_seed(2025)
2025-05-07T20:32:37.1289591Z     
2025-05-07T20:32:37.1289756Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:37.1289832Z     
2025-05-07T20:32:37.1289923Z         x_sign = torch.sign(x)
2025-05-07T20:32:37.1290047Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:37.1290139Z         x = x_sign * x_clamp
2025-05-07T20:32:37.1290219Z         x0 = x[:, :D]
2025-05-07T20:32:37.1290297Z         x1 = x[:, D:]
2025-05-07T20:32:37.1290371Z     
2025-05-07T20:32:37.1290452Z         if contiguous:
2025-05-07T20:32:37.1290545Z             x0 = x0.contiguous()
2025-05-07T20:32:37.1290633Z             x1 = x1.contiguous()
2025-05-07T20:32:37.1290705Z     
2025-05-07T20:32:37.1290795Z         if scale_ub is not None:
2025-05-07T20:32:37.1290902Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:37.1291035Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:37.1291115Z             )
2025-05-07T20:32:37.1291189Z         else:
2025-05-07T20:32:37.1291281Z             scale_ub_tensor = None
2025-05-07T20:32:37.1291353Z     
2025-05-07T20:32:37.1291479Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:37.1291567Z             op = silu_mul_quant
2025-05-07T20:32:37.1291651Z             if compiled:
2025-05-07T20:32:37.1291750Z                 op = torch.compile(op)
2025-05-07T20:32:37.1291906Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:37.1291976Z     
2025-05-07T20:32:37.1292065Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:37.1292070Z 
2025-05-07T20:32:37.1292168Z moe/activation_test.py:117: 
2025-05-07T20:32:37.1292297Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:37.1292398Z moe/activation_test.py:115: in fn
2025-05-07T20:32:37.1292497Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:37.1292857Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/torch/_dynamo/eval_frame.py:678: in _fn
2025-05-07T20:32:37.1293014Z     return fn(*args, **kwargs)
2025-05-07T20:32:37.1293504Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:37.1293599Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:37.1294017Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:37.1294242Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:37.1294575Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:37.1294708Z     kernel = self.compile(
2025-05-07T20:32:37.1295087Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:37.1295267Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:37.1295397Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:37.1295402Z 
2025-05-07T20:32:37.1295636Z self = <triton.compiler.compiler.ASTSource object at 0x7f78bf1af2d0>
2025-05-07T20:32:37.1296420Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:37.1296914Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f7905667ce0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f779d109e40>}
2025-05-07T20:32:37.1297687Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:37.1297877Z context = <triton._C.libtriton.ir.context object at 0x7f779d107bf0>
2025-05-07T20:32:37.1297882Z 
2025-05-07T20:32:37.1298043Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:37.1298531Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:37.1298640Z                            module_map=module_map)
2025-05-07T20:32:37.1298804Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:37.1298904Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:37.1298978Z E       ^
2025-05-07T20:32:37.1299327Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:37.1299331Z 
2025-05-07T20:32:37.1299741Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:37.1299745Z 
2025-05-07T20:32:37.1299855Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:37.1300077Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:37.1300157Z     T=16384,
2025-05-07T20:32:37.1300235Z     D=5120,
2025-05-07T20:32:37.1300318Z     scale_ub=1200.0,
2025-05-07T20:32:37.1300402Z     contiguous=True,
2025-05-07T20:32:37.1300487Z     compiled=False,
2025-05-07T20:32:37.1300560Z )
2025-05-07T20:32:37.1300774Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:37.1301029Z T = 16384, D = 5120, scale_ub = 1200.0, contiguous = True, compiled = False
2025-05-07T20:32:37.1301034Z 
2025-05-07T20:32:37.1301109Z     @given(
2025-05-07T20:32:37.1301230Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:37.1301331Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:37.1301446Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:37.1301562Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:37.1301673Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:37.1301805Z     )
2025-05-07T20:32:37.1302049Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:37.1302139Z     def test_silu_mul_quant(
2025-05-07T20:32:37.1302213Z         self,
2025-05-07T20:32:37.1302293Z         T: int,
2025-05-07T20:32:37.1302367Z         D: int,
2025-05-07T20:32:37.1302468Z         scale_ub: Optional[float],
2025-05-07T20:32:37.1302554Z         contiguous: bool,
2025-05-07T20:32:37.1302641Z         compiled: bool,
2025-05-07T20:32:37.1302720Z     ) -> None:
2025-05-07T20:32:37.1302812Z         torch.manual_seed(2025)
2025-05-07T20:32:37.1302884Z     
2025-05-07T20:32:37.1303050Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:37.1303181Z     
2025-05-07T20:32:37.1303274Z         x_sign = torch.sign(x)
2025-05-07T20:32:37.1303401Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:37.1303487Z         x = x_sign * x_clamp
2025-05-07T20:32:37.1303569Z         x0 = x[:, :D]
2025-05-07T20:32:37.1303650Z         x1 = x[:, D:]
2025-05-07T20:32:37.1303721Z     
2025-05-07T20:32:37.1303804Z         if contiguous:
2025-05-07T20:32:37.1303899Z             x0 = x0.contiguous()
2025-05-07T20:32:37.1303986Z             x1 = x1.contiguous()
2025-05-07T20:32:37.1304059Z     
2025-05-07T20:32:37.1304149Z         if scale_ub is not None:
2025-05-07T20:32:37.1304254Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:37.1304392Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:37.1304465Z             )
2025-05-07T20:32:37.1304541Z         else:
2025-05-07T20:32:37.1304636Z             scale_ub_tensor = None
2025-05-07T20:32:37.1304708Z     
2025-05-07T20:32:37.1304837Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:37.1304995Z             op = silu_mul_quant
2025-05-07T20:32:37.1305079Z             if compiled:
2025-05-07T20:32:37.1305177Z                 op = torch.compile(op)
2025-05-07T20:32:37.1305285Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:37.1305355Z     
2025-05-07T20:32:37.1305449Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:37.1305454Z 
2025-05-07T20:32:37.1305548Z moe/activation_test.py:117: 
2025-05-07T20:32:37.1305674Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:37.1305774Z moe/activation_test.py:115: in fn
2025-05-07T20:32:37.1305871Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:37.1306362Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:37.1306461Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:37.1306818Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:37.1307041Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:37.1307376Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:37.1307469Z     kernel = self.compile(
2025-05-07T20:32:37.1307848Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:37.1308018Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:37.1308142Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:37.1308194Z 
2025-05-07T20:32:37.1308396Z self = <triton.compiler.compiler.ASTSource object at 0x7f78be1e9750>
2025-05-07T20:32:37.1309163Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:37.1309659Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f7905667ce0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f779d10aca0>}
2025-05-07T20:32:37.1310429Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:37.1310617Z context = <triton._C.libtriton.ir.context object at 0x7f779d2ac230>
2025-05-07T20:32:37.1310622Z 
2025-05-07T20:32:37.1310784Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:37.1311040Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:37.1311148Z                            module_map=module_map)
2025-05-07T20:32:37.1311343Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:37.1311447Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:37.1311523Z E       ^
2025-05-07T20:32:37.1311868Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:37.1311875Z 
2025-05-07T20:32:37.1312283Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:37.1312287Z 
2025-05-07T20:32:37.1312388Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:37.1312604Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:37.1312684Z     T=1,
2025-05-07T20:32:37.1312758Z     D=7168,
2025-05-07T20:32:37.1312843Z     scale_ub=1200.0,
2025-05-07T20:32:37.1312927Z     contiguous=False,
2025-05-07T20:32:37.1313011Z     compiled=False,
2025-05-07T20:32:37.1313085Z )
2025-05-07T20:32:37.1313301Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:37.1313505Z T = 1, D = 7168, scale_ub = 1200.0, contiguous = False, compiled = False
2025-05-07T20:32:37.1313510Z 
2025-05-07T20:32:37.1313587Z     @given(
2025-05-07T20:32:37.1313705Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:37.1313806Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:37.1313927Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:37.1314041Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:37.1314155Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:37.1314227Z     )
2025-05-07T20:32:37.1314466Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:37.1314564Z     def test_silu_mul_quant(
2025-05-07T20:32:37.1314640Z         self,
2025-05-07T20:32:37.1314714Z         T: int,
2025-05-07T20:32:37.1314792Z         D: int,
2025-05-07T20:32:37.1314888Z         scale_ub: Optional[float],
2025-05-07T20:32:37.1314978Z         contiguous: bool,
2025-05-07T20:32:37.1315067Z         compiled: bool,
2025-05-07T20:32:37.1315144Z     ) -> None:
2025-05-07T20:32:37.1315237Z         torch.manual_seed(2025)
2025-05-07T20:32:37.1315310Z     
2025-05-07T20:32:37.1315476Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:37.1315549Z     
2025-05-07T20:32:37.1315638Z         x_sign = torch.sign(x)
2025-05-07T20:32:37.1315762Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:37.1315851Z         x = x_sign * x_clamp
2025-05-07T20:32:37.1315929Z         x0 = x[:, :D]
2025-05-07T20:32:37.1316006Z         x1 = x[:, D:]
2025-05-07T20:32:37.1316079Z     
2025-05-07T20:32:37.1316207Z         if contiguous:
2025-05-07T20:32:37.1316297Z             x0 = x0.contiguous()
2025-05-07T20:32:37.1316388Z             x1 = x1.contiguous()
2025-05-07T20:32:37.1316458Z     
2025-05-07T20:32:37.1316547Z         if scale_ub is not None:
2025-05-07T20:32:37.1316656Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:37.1316792Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:37.1316868Z             )
2025-05-07T20:32:37.1316942Z         else:
2025-05-07T20:32:37.1317034Z             scale_ub_tensor = None
2025-05-07T20:32:37.1317149Z     
2025-05-07T20:32:37.1317277Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:37.1317365Z             op = silu_mul_quant
2025-05-07T20:32:37.1317453Z             if compiled:
2025-05-07T20:32:37.1317550Z                 op = torch.compile(op)
2025-05-07T20:32:37.1317654Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:37.1317729Z     
2025-05-07T20:32:37.1317820Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:37.1317827Z 
2025-05-07T20:32:37.1317921Z moe/activation_test.py:117: 
2025-05-07T20:32:37.1318051Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:37.1318149Z moe/activation_test.py:115: in fn
2025-05-07T20:32:37.1318290Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:37.1318780Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:37.1318878Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:37.1319237Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:37.1319457Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:37.1319790Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:37.1319884Z     kernel = self.compile(
2025-05-07T20:32:37.1320263Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:37.1320435Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:37.1320563Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:37.1320568Z 
2025-05-07T20:32:37.1320808Z self = <triton.compiler.compiler.ASTSource object at 0x7f779cf5f1d0>
2025-05-07T20:32:37.1321570Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:37.1322066Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f7905667ce0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f779d20c0e0>}
2025-05-07T20:32:37.1322802Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:37.1322994Z context = <triton._C.libtriton.ir.context object at 0x7f779d27bb70>
2025-05-07T20:32:37.1323001Z 
2025-05-07T20:32:37.1323170Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:37.1323423Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:37.1323531Z                            module_map=module_map)
2025-05-07T20:32:37.1323693Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:37.1323790Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:37.1323864Z E       ^
2025-05-07T20:32:37.1324213Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:37.1324217Z 
2025-05-07T20:32:37.1324688Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:37.1324692Z 
2025-05-07T20:32:37.1324798Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:37.1325017Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:37.1325094Z     T=4096,
2025-05-07T20:32:37.1325176Z     D=7168,
2025-05-07T20:32:37.1325259Z     scale_ub=1200.0,
2025-05-07T20:32:37.1325356Z     contiguous=False,
2025-05-07T20:32:37.1325456Z     compiled=True,
2025-05-07T20:32:37.1325592Z )
2025-05-07T20:32:37.1325812Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:37.1325987Z T = 4096, D = 7168, scale_ub = 1200.0, contiguous = False, compiled = True
2025-05-07T20:32:37.1325991Z 
2025-05-07T20:32:37.1326067Z     @given(
2025-05-07T20:32:37.1326187Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:37.1326286Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:37.1326405Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:37.1326523Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:37.1326636Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:37.1326710Z     )
2025-05-07T20:32:37.1326992Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:37.1327089Z     def test_silu_mul_quant(
2025-05-07T20:32:37.1327169Z         self,
2025-05-07T20:32:37.1327246Z         T: int,
2025-05-07T20:32:37.1327326Z         D: int,
2025-05-07T20:32:37.1327425Z         scale_ub: Optional[float],
2025-05-07T20:32:37.1327514Z         contiguous: bool,
2025-05-07T20:32:37.1327598Z         compiled: bool,
2025-05-07T20:32:37.1327678Z     ) -> None:
2025-05-07T20:32:37.1327771Z         torch.manual_seed(2025)
2025-05-07T20:32:37.1327844Z     
2025-05-07T20:32:37.1328012Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:37.1328084Z     
2025-05-07T20:32:37.1328176Z         x_sign = torch.sign(x)
2025-05-07T20:32:37.1328302Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:37.1328389Z         x = x_sign * x_clamp
2025-05-07T20:32:37.1328468Z         x0 = x[:, :D]
2025-05-07T20:32:37.1328547Z         x1 = x[:, D:]
2025-05-07T20:32:37.1328621Z     
2025-05-07T20:32:37.1328708Z         if contiguous:
2025-05-07T20:32:37.1328841Z             x0 = x0.contiguous()
2025-05-07T20:32:37.1328930Z             x1 = x1.contiguous()
2025-05-07T20:32:37.1329003Z     
2025-05-07T20:32:37.1329096Z         if scale_ub is not None:
2025-05-07T20:32:37.1329200Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:37.1329335Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:37.1329409Z             )
2025-05-07T20:32:37.1329482Z         else:
2025-05-07T20:32:37.1329576Z             scale_ub_tensor = None
2025-05-07T20:32:37.1329646Z     
2025-05-07T20:32:37.1329772Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:37.1329866Z             op = silu_mul_quant
2025-05-07T20:32:37.1329947Z             if compiled:
2025-05-07T20:32:37.1330050Z                 op = torch.compile(op)
2025-05-07T20:32:37.1330154Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:37.1330225Z     
2025-05-07T20:32:37.1330320Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:37.1330327Z 
2025-05-07T20:32:37.1330421Z moe/activation_test.py:117: 
2025-05-07T20:32:37.1330549Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:37.1330653Z moe/activation_test.py:115: in fn
2025-05-07T20:32:37.1330750Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:37.1331108Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/torch/_dynamo/eval_frame.py:678: in _fn
2025-05-07T20:32:37.1331203Z     return fn(*args, **kwargs)
2025-05-07T20:32:37.1331687Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:37.1331829Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:37.1332181Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:37.1332402Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:37.1332744Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:37.1332837Z     kernel = self.compile(
2025-05-07T20:32:37.1333258Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:37.1333429Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:37.1333553Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:37.1333558Z 
2025-05-07T20:32:37.1333841Z self = <triton.compiler.compiler.ASTSource object at 0x7f779d9a0650>
2025-05-07T20:32:37.1334604Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:37.1335144Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f7905667ce0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f779d20d300>}
2025-05-07T20:32:37.1335929Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:37.1336117Z context = <triton._C.libtriton.ir.context object at 0x7f779d4d63b0>
2025-05-07T20:32:37.1336122Z 
2025-05-07T20:32:37.1336285Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:37.1336539Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:37.1336652Z                            module_map=module_map)
2025-05-07T20:32:37.1336810Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:37.1336907Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:37.1336987Z E       ^
2025-05-07T20:32:37.1337372Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:37.1337377Z 
2025-05-07T20:32:37.1337780Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:37.1337790Z 
2025-05-07T20:32:37.1337891Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:37.1338108Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:37.1338185Z     T=128,
2025-05-07T20:32:37.1338261Z     D=7168,
2025-05-07T20:32:37.1338343Z     scale_ub=1200.0,
2025-05-07T20:32:37.1338432Z     contiguous=False,
2025-05-07T20:32:37.1338513Z     compiled=True,
2025-05-07T20:32:37.1338585Z )
2025-05-07T20:32:37.1338801Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:37.1338971Z T = 128, D = 7168, scale_ub = 1200.0, contiguous = False, compiled = True
2025-05-07T20:32:37.1338975Z 
2025-05-07T20:32:37.1339051Z     @given(
2025-05-07T20:32:37.1339170Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:37.1339267Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:37.1339385Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:37.1339500Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:37.1339611Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:37.1339688Z     )
2025-05-07T20:32:37.1339931Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:37.1342856Z     def test_silu_mul_quant(
2025-05-07T20:32:37.1342944Z         self,
2025-05-07T20:32:37.1343097Z         T: int,
2025-05-07T20:32:37.1343178Z         D: int,
2025-05-07T20:32:37.1343278Z         scale_ub: Optional[float],
2025-05-07T20:32:37.1343367Z         contiguous: bool,
2025-05-07T20:32:37.1343458Z         compiled: bool,
2025-05-07T20:32:37.1343538Z     ) -> None:
2025-05-07T20:32:37.1343635Z         torch.manual_seed(2025)
2025-05-07T20:32:37.1343713Z     
2025-05-07T20:32:37.1343884Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:37.1343958Z     
2025-05-07T20:32:37.1344092Z         x_sign = torch.sign(x)
2025-05-07T20:32:37.1344215Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:37.1344304Z         x = x_sign * x_clamp
2025-05-07T20:32:37.1344382Z         x0 = x[:, :D]
2025-05-07T20:32:37.1344460Z         x1 = x[:, D:]
2025-05-07T20:32:37.1344535Z     
2025-05-07T20:32:37.1344619Z         if contiguous:
2025-05-07T20:32:37.1344708Z             x0 = x0.contiguous()
2025-05-07T20:32:37.1344802Z             x1 = x1.contiguous()
2025-05-07T20:32:37.1344874Z     
2025-05-07T20:32:37.1344962Z         if scale_ub is not None:
2025-05-07T20:32:37.1345071Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:37.1345203Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:37.1345323Z             )
2025-05-07T20:32:37.1345399Z         else:
2025-05-07T20:32:37.1345493Z             scale_ub_tensor = None
2025-05-07T20:32:37.1345568Z     
2025-05-07T20:32:37.1345695Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:37.1345787Z             op = silu_mul_quant
2025-05-07T20:32:37.1345873Z             if compiled:
2025-05-07T20:32:37.1345971Z                 op = torch.compile(op)
2025-05-07T20:32:37.1346074Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:37.1346149Z     
2025-05-07T20:32:37.1346237Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:37.1346242Z 
2025-05-07T20:32:37.1346338Z moe/activation_test.py:117: 
2025-05-07T20:32:37.1346474Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:37.1346573Z moe/activation_test.py:115: in fn
2025-05-07T20:32:37.1346675Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:37.1347045Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/torch/_dynamo/eval_frame.py:678: in _fn
2025-05-07T20:32:37.1347177Z     return fn(*args, **kwargs)
2025-05-07T20:32:37.1347667Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:37.1347764Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:37.1348118Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:37.1348340Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:37.1348674Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:37.1348770Z     kernel = self.compile(
2025-05-07T20:32:37.1349149Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:37.1349323Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:37.1349455Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:37.1349459Z 
2025-05-07T20:32:37.1349661Z self = <triton.compiler.compiler.ASTSource object at 0x7f779cffd8d0>
2025-05-07T20:32:37.1350428Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:37.1350922Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f7905667ce0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f779d20e160>}
2025-05-07T20:32:37.1351698Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:37.1351889Z context = <triton._C.libtriton.ir.context object at 0x7f779d4b5930>
2025-05-07T20:32:37.1351893Z 
2025-05-07T20:32:37.1352057Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:37.1352318Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:37.1352466Z                            module_map=module_map)
2025-05-07T20:32:37.1352627Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:37.1352725Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:37.1352801Z E       ^
2025-05-07T20:32:37.1353155Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:37.1353162Z 
2025-05-07T20:32:37.1353566Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:37.1353570Z 
2025-05-07T20:32:37.1353671Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:37.1353930Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:37.1354009Z     T=2048,
2025-05-07T20:32:37.1354084Z     D=7168,
2025-05-07T20:32:37.1354170Z     scale_ub=None,
2025-05-07T20:32:37.1354256Z     contiguous=True,
2025-05-07T20:32:37.1354341Z     compiled=True,
2025-05-07T20:32:37.1354414Z )
2025-05-07T20:32:37.1354629Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:37.1354797Z T = 2048, D = 7168, scale_ub = None, contiguous = True, compiled = True
2025-05-07T20:32:37.1354801Z 
2025-05-07T20:32:37.1354874Z     @given(
2025-05-07T20:32:37.1354991Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:37.1355093Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:37.1355205Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:37.1355320Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:37.1355434Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:37.1355508Z     )
2025-05-07T20:32:37.1355819Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:37.1355911Z     def test_silu_mul_quant(
2025-05-07T20:32:37.1355987Z         self,
2025-05-07T20:32:37.1356072Z         T: int,
2025-05-07T20:32:37.1356149Z         D: int,
2025-05-07T20:32:37.1356246Z         scale_ub: Optional[float],
2025-05-07T20:32:37.1356338Z         contiguous: bool,
2025-05-07T20:32:37.1356421Z         compiled: bool,
2025-05-07T20:32:37.1356502Z     ) -> None:
2025-05-07T20:32:37.1356597Z         torch.manual_seed(2025)
2025-05-07T20:32:37.1356667Z     
2025-05-07T20:32:37.1356834Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:37.1356910Z     
2025-05-07T20:32:37.1357000Z         x_sign = torch.sign(x)
2025-05-07T20:32:37.1357126Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:37.1357212Z         x = x_sign * x_clamp
2025-05-07T20:32:37.1357292Z         x0 = x[:, :D]
2025-05-07T20:32:37.1357376Z         x1 = x[:, D:]
2025-05-07T20:32:37.1357448Z     
2025-05-07T20:32:37.1357535Z         if contiguous:
2025-05-07T20:32:37.1357624Z             x0 = x0.contiguous()
2025-05-07T20:32:37.1357712Z             x1 = x1.contiguous()
2025-05-07T20:32:37.1357793Z     
2025-05-07T20:32:37.1357881Z         if scale_ub is not None:
2025-05-07T20:32:37.1357986Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:37.1358121Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:37.1358195Z             )
2025-05-07T20:32:37.1358271Z         else:
2025-05-07T20:32:37.1358367Z             scale_ub_tensor = None
2025-05-07T20:32:37.1358438Z     
2025-05-07T20:32:37.1358612Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:37.1358703Z             op = silu_mul_quant
2025-05-07T20:32:37.1358787Z             if compiled:
2025-05-07T20:32:37.1358886Z                 op = torch.compile(op)
2025-05-07T20:32:37.1358991Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:37.1359062Z     
2025-05-07T20:32:37.1359158Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:37.1359162Z 
2025-05-07T20:32:37.1359258Z moe/activation_test.py:117: 
2025-05-07T20:32:37.1359430Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:37.1359531Z moe/activation_test.py:115: in fn
2025-05-07T20:32:37.1359628Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:37.1359989Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/torch/_dynamo/eval_frame.py:678: in _fn
2025-05-07T20:32:37.1360083Z     return fn(*args, **kwargs)
2025-05-07T20:32:37.1360569Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:37.1360669Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:37.1361022Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:37.1361282Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:37.1361621Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:37.1361716Z     kernel = self.compile(
2025-05-07T20:32:37.1362094Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:37.1362264Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:37.1362391Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:37.1362396Z 
2025-05-07T20:32:37.1362603Z self = <triton.compiler.compiler.ASTSource object at 0x7f78be1e8d50>
2025-05-07T20:32:37.1363366Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:37.1363903Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f7905667ce0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f779d20f420>}
2025-05-07T20:32:37.1364635Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:37.1364822Z context = <triton._C.libtriton.ir.context object at 0x7f779d7711f0>
2025-05-07T20:32:37.1364827Z 
2025-05-07T20:32:37.1364991Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:37.1365248Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:37.1365358Z                            module_map=module_map)
2025-05-07T20:32:37.1365516Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:37.1365616Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:37.1365697Z E       ^
2025-05-07T20:32:37.1366042Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:37.1366049Z 
2025-05-07T20:32:37.1366455Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:37.1366459Z 
2025-05-07T20:32:37.1366560Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:37.1366778Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:37.1366859Z     T=16384,
2025-05-07T20:32:37.1366933Z     D=5120,
2025-05-07T20:32:37.1367054Z     scale_ub=None,
2025-05-07T20:32:37.1367141Z     contiguous=False,
2025-05-07T20:32:37.1367225Z     compiled=False,
2025-05-07T20:32:37.1367297Z )
2025-05-07T20:32:37.1367513Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:37.1367688Z T = 16384, D = 5120, scale_ub = None, contiguous = False, compiled = False
2025-05-07T20:32:37.1367694Z 
2025-05-07T20:32:37.1367772Z     @given(
2025-05-07T20:32:37.1367888Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:37.1368024Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:37.1368142Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:37.1368256Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:37.1368367Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:37.1368441Z     )
2025-05-07T20:32:37.1368680Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:37.1368775Z     def test_silu_mul_quant(
2025-05-07T20:32:37.1368853Z         self,
2025-05-07T20:32:37.1368927Z         T: int,
2025-05-07T20:32:37.1369004Z         D: int,
2025-05-07T20:32:37.1369100Z         scale_ub: Optional[float],
2025-05-07T20:32:37.1369187Z         contiguous: bool,
2025-05-07T20:32:37.1369318Z         compiled: bool,
2025-05-07T20:32:37.1369396Z     ) -> None:
2025-05-07T20:32:37.1369490Z         torch.manual_seed(2025)
2025-05-07T20:32:37.1369565Z     
2025-05-07T20:32:37.1369728Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:37.1369802Z     
2025-05-07T20:32:37.1369896Z         x_sign = torch.sign(x)
2025-05-07T20:32:37.1370019Z >       x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:37.1371784Z E       torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 320.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 140.44 MiB is free. Including non-PyTorch memory, this process has 21.92 GiB memory in use. Of the allocated memory 21.60 GiB is allocated by PyTorch, and 45.02 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
2025-05-07T20:32:37.1371793Z 
2025-05-07T20:32:37.1371948Z moe/activation_test.py:95: OutOfMemoryError
2025-05-07T20:32:37.1371953Z 
2025-05-07T20:32:37.1372057Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:37.1372279Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:37.1372353Z     T=4096,
2025-05-07T20:32:37.1372431Z     D=7168,
2025-05-07T20:32:37.1372512Z     scale_ub=1200.0,
2025-05-07T20:32:37.1372594Z     contiguous=True,
2025-05-07T20:32:37.1372677Z     compiled=True,
2025-05-07T20:32:37.1372748Z )
2025-05-07T20:32:37.1372960Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:37.1373132Z T = 4096, D = 7168, scale_ub = 1200.0, contiguous = True, compiled = True
2025-05-07T20:32:37.1373136Z 
2025-05-07T20:32:37.1373212Z     @given(
2025-05-07T20:32:37.1373329Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:37.1373432Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:37.1373546Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:37.1373748Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:37.1373860Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:37.1373937Z     )
2025-05-07T20:32:37.1374180Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:37.1374271Z     def test_silu_mul_quant(
2025-05-07T20:32:37.1374346Z         self,
2025-05-07T20:32:37.1374424Z         T: int,
2025-05-07T20:32:37.1374500Z         D: int,
2025-05-07T20:32:37.1374596Z         scale_ub: Optional[float],
2025-05-07T20:32:37.1374686Z         contiguous: bool,
2025-05-07T20:32:37.1374819Z         compiled: bool,
2025-05-07T20:32:37.1374899Z     ) -> None:
2025-05-07T20:32:37.1374992Z         torch.manual_seed(2025)
2025-05-07T20:32:37.1375062Z     
2025-05-07T20:32:37.1375230Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:37.1375305Z     
2025-05-07T20:32:37.1375399Z         x_sign = torch.sign(x)
2025-05-07T20:32:37.1375538Z >       x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:37.1377307Z E       torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 112.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 28.44 MiB is free. Including non-PyTorch memory, this process has 22.03 GiB memory in use. Of the allocated memory 21.61 GiB is allocated by PyTorch, and 141.02 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
2025-05-07T20:32:37.1377355Z 
2025-05-07T20:32:37.1377474Z moe/activation_test.py:95: OutOfMemoryError
2025-05-07T20:32:37.1377478Z 
2025-05-07T20:32:37.1377578Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:37.1377833Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:37.1377919Z     T=16384,
2025-05-07T20:32:37.1377994Z     D=7168,
2025-05-07T20:32:37.1378076Z     scale_ub=None,
2025-05-07T20:32:37.1378160Z     contiguous=False,
2025-05-07T20:32:37.1378244Z     compiled=False,
2025-05-07T20:32:37.1378320Z )
2025-05-07T20:32:37.1378534Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:37.1378704Z T = 16384, D = 7168, scale_ub = None, contiguous = False, compiled = False
2025-05-07T20:32:37.1378709Z 
2025-05-07T20:32:37.1378786Z     @given(
2025-05-07T20:32:37.1378901Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:37.1378998Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:37.1379117Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:37.1379230Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:37.1379344Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:37.1379416Z     )
2025-05-07T20:32:37.1379700Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:37.1379797Z     def test_silu_mul_quant(
2025-05-07T20:32:37.1379872Z         self,
2025-05-07T20:32:37.1379947Z         T: int,
2025-05-07T20:32:37.1380030Z         D: int,
2025-05-07T20:32:37.1380128Z         scale_ub: Optional[float],
2025-05-07T20:32:37.1380214Z         contiguous: bool,
2025-05-07T20:32:37.1380301Z         compiled: bool,
2025-05-07T20:32:37.1380376Z     ) -> None:
2025-05-07T20:32:37.1380468Z         torch.manual_seed(2025)
2025-05-07T20:32:37.1380543Z     
2025-05-07T20:32:37.1380707Z >       x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:37.1382452Z E       torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 448.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 140.44 MiB is free. Including non-PyTorch memory, this process has 21.92 GiB memory in use. Of the allocated memory 21.50 GiB is allocated by PyTorch, and 141.02 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
2025-05-07T20:32:37.1382460Z 
2025-05-07T20:32:37.1382575Z moe/activation_test.py:92: OutOfMemoryError
2025-05-07T20:32:37.1382579Z 
2025-05-07T20:32:37.1382681Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:37.1382899Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:37.1382974Z     T=2048,
2025-05-07T20:32:37.1383054Z     D=7168,
2025-05-07T20:32:37.1383178Z     scale_ub=1200.0,
2025-05-07T20:32:37.1383260Z     contiguous=True,
2025-05-07T20:32:37.1383342Z     compiled=True,
2025-05-07T20:32:37.1383413Z )
2025-05-07T20:32:37.1383624Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:37.1383795Z T = 2048, D = 7168, scale_ub = 1200.0, contiguous = True, compiled = True
2025-05-07T20:32:37.1383799Z 
2025-05-07T20:32:37.1383874Z     @given(
2025-05-07T20:32:37.1383992Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:37.1384087Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:37.1384240Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:37.1384356Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:37.1384466Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:37.1384539Z     )
2025-05-07T20:32:37.1384780Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:37.1384871Z     def test_silu_mul_quant(
2025-05-07T20:32:37.1384947Z         self,
2025-05-07T20:32:37.1385026Z         T: int,
2025-05-07T20:32:37.1385101Z         D: int,
2025-05-07T20:32:37.1385198Z         scale_ub: Optional[float],
2025-05-07T20:32:37.1385284Z         contiguous: bool,
2025-05-07T20:32:37.1385371Z         compiled: bool,
2025-05-07T20:32:37.1385534Z     ) -> None:
2025-05-07T20:32:37.1385651Z         torch.manual_seed(2025)
2025-05-07T20:32:37.1385725Z     
2025-05-07T20:32:37.1385890Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:37.1385964Z     
2025-05-07T20:32:37.1386054Z         x_sign = torch.sign(x)
2025-05-07T20:32:37.1386179Z >       x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:37.1387896Z E       torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 56.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 28.44 MiB is free. Including non-PyTorch memory, this process has 22.03 GiB memory in use. Of the allocated memory 21.67 GiB is allocated by PyTorch, and 85.02 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
2025-05-07T20:32:37.1388759Z 
2025-05-07T20:32:37.1388916Z moe/activation_test.py:95: OutOfMemoryError
2025-05-07T20:32:37.1388921Z 
2025-05-07T20:32:37.1389021Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:37.1389239Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:37.1389321Z     T=2048,
2025-05-07T20:32:37.1389398Z     D=7168,
2025-05-07T20:32:37.1389480Z     scale_ub=None,
2025-05-07T20:32:37.1389563Z     contiguous=True,
2025-05-07T20:32:37.1389644Z     compiled=False,
2025-05-07T20:32:37.1389717Z )
2025-05-07T20:32:37.1389927Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:37.1390091Z T = 2048, D = 7168, scale_ub = None, contiguous = True, compiled = False
2025-05-07T20:32:37.1390098Z 
2025-05-07T20:32:37.1390178Z     @given(
2025-05-07T20:32:37.1390292Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:37.1390390Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:37.1390508Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:37.1390625Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:37.1390739Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:37.1390813Z     )
2025-05-07T20:32:37.1391053Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:37.1391146Z     def test_silu_mul_quant(
2025-05-07T20:32:37.1391220Z         self,
2025-05-07T20:32:37.1391295Z         T: int,
2025-05-07T20:32:37.1391373Z         D: int,
2025-05-07T20:32:37.1391469Z         scale_ub: Optional[float],
2025-05-07T20:32:37.1391555Z         contiguous: bool,
2025-05-07T20:32:37.1391640Z         compiled: bool,
2025-05-07T20:32:37.1391763Z     ) -> None:
2025-05-07T20:32:37.1391855Z         torch.manual_seed(2025)
2025-05-07T20:32:37.1391930Z     
2025-05-07T20:32:37.1392092Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:37.1392167Z     
2025-05-07T20:32:37.1392259Z >       x_sign = torch.sign(x)
2025-05-07T20:32:37.1393975Z E       torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 56.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 28.44 MiB is free. Including non-PyTorch memory, this process has 22.03 GiB memory in use. Of the allocated memory 21.67 GiB is allocated by PyTorch, and 85.02 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
2025-05-07T20:32:37.1394025Z 
2025-05-07T20:32:37.1394139Z moe/activation_test.py:94: OutOfMemoryError
2025-05-07T20:32:37.1394146Z 
2025-05-07T20:32:37.1394245Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:37.1394464Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:37.1394539Z     T=1,
2025-05-07T20:32:37.1394613Z     D=7168,
2025-05-07T20:32:37.1394735Z     scale_ub=1200.0,
2025-05-07T20:32:37.1394821Z     contiguous=True,
2025-05-07T20:32:37.1394902Z     compiled=False,
2025-05-07T20:32:37.1394975Z )
2025-05-07T20:32:37.1395187Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:37.1395353Z T = 1, D = 7168, scale_ub = 1200.0, contiguous = True, compiled = False
2025-05-07T20:32:37.1395357Z 
2025-05-07T20:32:37.1395447Z     @given(
2025-05-07T20:32:37.1395575Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:37.1395694Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:37.1395805Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:37.1395921Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:37.1396035Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:37.1396106Z     )
2025-05-07T20:32:37.1396350Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:37.1396444Z     def test_silu_mul_quant(
2025-05-07T20:32:37.1396520Z         self,
2025-05-07T20:32:37.1396639Z         T: int,
2025-05-07T20:32:37.1396716Z         D: int,
2025-05-07T20:32:37.1396814Z         scale_ub: Optional[float],
2025-05-07T20:32:37.1396905Z         contiguous: bool,
2025-05-07T20:32:37.1396988Z         compiled: bool,
2025-05-07T20:32:37.1397066Z     ) -> None:
2025-05-07T20:32:37.1397161Z         torch.manual_seed(2025)
2025-05-07T20:32:37.1397232Z     
2025-05-07T20:32:37.1397398Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:37.1397472Z     
2025-05-07T20:32:37.1397561Z         x_sign = torch.sign(x)
2025-05-07T20:32:37.1397683Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:37.1397776Z         x = x_sign * x_clamp
2025-05-07T20:32:37.1397856Z         x0 = x[:, :D]
2025-05-07T20:32:37.1397938Z         x1 = x[:, D:]
2025-05-07T20:32:37.1398009Z     
2025-05-07T20:32:37.1398092Z         if contiguous:
2025-05-07T20:32:37.1398346Z             x0 = x0.contiguous()
2025-05-07T20:32:37.1398440Z             x1 = x1.contiguous()
2025-05-07T20:32:37.1398511Z     
2025-05-07T20:32:37.1398602Z         if scale_ub is not None:
2025-05-07T20:32:37.1398706Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:37.1398841Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:37.1398917Z             )
2025-05-07T20:32:37.1398992Z         else:
2025-05-07T20:32:37.1399084Z             scale_ub_tensor = None
2025-05-07T20:32:37.1399158Z     
2025-05-07T20:32:37.1399284Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:37.1399375Z             op = silu_mul_quant
2025-05-07T20:32:37.1399538Z             if compiled:
2025-05-07T20:32:37.1399636Z                 op = torch.compile(op)
2025-05-07T20:32:37.1399741Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:37.1399813Z     
2025-05-07T20:32:37.1399903Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:37.1399907Z 
2025-05-07T20:32:37.1400009Z moe/activation_test.py:117: 
2025-05-07T20:32:37.1400138Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:37.1400237Z moe/activation_test.py:115: in fn
2025-05-07T20:32:37.1400400Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:37.1400892Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:37.1400989Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:37.1401347Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:37.1401568Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:37.1401905Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:37.1402002Z     kernel = self.compile(
2025-05-07T20:32:37.1402438Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:37.1402611Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:37.1402740Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:37.1402747Z 
2025-05-07T20:32:37.1402946Z self = <triton.compiler.compiler.ASTSource object at 0x7f779dd3e350>
2025-05-07T20:32:37.1403709Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:37.1404204Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f7905667ce0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f779d7b22a0>}
2025-05-07T20:32:37.1404995Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:37.1405185Z context = <triton._C.libtriton.ir.context object at 0x7f779ce78270>
2025-05-07T20:32:37.1405192Z 
2025-05-07T20:32:37.1405367Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:37.1405663Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:37.1405773Z                            module_map=module_map)
2025-05-07T20:32:37.1405933Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:37.1406032Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:37.1406110Z E       ^
2025-05-07T20:32:37.1406460Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:37.1406464Z 
2025-05-07T20:32:37.1406872Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:37.1406879Z 
2025-05-07T20:32:37.1406979Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:37.1407202Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:37.1407279Z     T=128,
2025-05-07T20:32:37.1407356Z     D=5120,
2025-05-07T20:32:37.1407437Z     scale_ub=None,
2025-05-07T20:32:37.1407520Z     contiguous=True,
2025-05-07T20:32:37.1407603Z     compiled=False,
2025-05-07T20:32:37.1407674Z )
2025-05-07T20:32:37.1407886Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:37.1408053Z T = 128, D = 5120, scale_ub = None, contiguous = True, compiled = False
2025-05-07T20:32:37.1408102Z 
2025-05-07T20:32:37.1408176Z     @given(
2025-05-07T20:32:37.1408294Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:37.1408394Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:37.1408509Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:37.1408627Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:37.1408743Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:37.1408814Z     )
2025-05-07T20:32:37.1409058Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:37.1409192Z     def test_silu_mul_quant(
2025-05-07T20:32:37.1409266Z         self,
2025-05-07T20:32:37.1409343Z         T: int,
2025-05-07T20:32:37.1409417Z         D: int,
2025-05-07T20:32:37.1409512Z         scale_ub: Optional[float],
2025-05-07T20:32:37.1409603Z         contiguous: bool,
2025-05-07T20:32:37.1409687Z         compiled: bool,
2025-05-07T20:32:37.1409766Z     ) -> None:
2025-05-07T20:32:37.1409863Z         torch.manual_seed(2025)
2025-05-07T20:32:37.1409934Z     
2025-05-07T20:32:37.1410096Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:37.1410169Z     
2025-05-07T20:32:37.1410260Z         x_sign = torch.sign(x)
2025-05-07T20:32:37.1410426Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:37.1410516Z         x = x_sign * x_clamp
2025-05-07T20:32:37.1410595Z         x0 = x[:, :D]
2025-05-07T20:32:37.1410675Z         x1 = x[:, D:]
2025-05-07T20:32:37.1410748Z     
2025-05-07T20:32:37.1410830Z         if contiguous:
2025-05-07T20:32:37.1410923Z             x0 = x0.contiguous()
2025-05-07T20:32:37.1411012Z             x1 = x1.contiguous()
2025-05-07T20:32:37.1411083Z     
2025-05-07T20:32:37.1411175Z         if scale_ub is not None:
2025-05-07T20:32:37.1411279Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:37.1411413Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:37.1411494Z             )
2025-05-07T20:32:37.1411568Z         else:
2025-05-07T20:32:37.1411662Z             scale_ub_tensor = None
2025-05-07T20:32:37.1411733Z     
2025-05-07T20:32:37.1411859Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:37.1411953Z             op = silu_mul_quant
2025-05-07T20:32:37.1412036Z             if compiled:
2025-05-07T20:32:37.1412175Z                 op = torch.compile(op)
2025-05-07T20:32:37.1412281Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:37.1412355Z     
2025-05-07T20:32:37.1412444Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:37.1412450Z 
2025-05-07T20:32:37.1412547Z moe/activation_test.py:117: 
2025-05-07T20:32:37.1412673Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:37.1412773Z moe/activation_test.py:115: in fn
2025-05-07T20:32:37.1412870Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:37.1413357Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:37.1413457Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:37.1413897Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:37.1414122Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:37.1414460Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:37.1414553Z     kernel = self.compile(
2025-05-07T20:32:37.1414933Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:37.1415104Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:37.1415230Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:37.1415235Z 
2025-05-07T20:32:37.1415437Z self = <triton.compiler.compiler.ASTSource object at 0x7f779cf5c450>
2025-05-07T20:32:37.1416243Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:37.1416742Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f7905667ce0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f779d7b31a0>}
2025-05-07T20:32:37.1417598Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:37.1417785Z context = <triton._C.libtriton.ir.context object at 0x7f779cff3df0>
2025-05-07T20:32:37.1417793Z 
2025-05-07T20:32:37.1417955Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:37.1418212Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:37.1418320Z                            module_map=module_map)
2025-05-07T20:32:37.1418478Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:37.1418613Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:37.1418694Z E       ^
2025-05-07T20:32:37.1419042Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:37.1419050Z 
2025-05-07T20:32:37.1419456Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:37.1419460Z 
2025-05-07T20:32:37.1419563Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:37.1419780Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:37.1419857Z     T=128,
2025-05-07T20:32:37.1419931Z     D=7168,
2025-05-07T20:32:37.1420013Z     scale_ub=None,
2025-05-07T20:32:37.1420098Z     contiguous=True,
2025-05-07T20:32:37.1420182Z     compiled=False,
2025-05-07T20:32:37.1420254Z )
2025-05-07T20:32:37.1420472Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:37.1420638Z T = 128, D = 7168, scale_ub = None, contiguous = True, compiled = False
2025-05-07T20:32:37.1420642Z 
2025-05-07T20:32:37.1420759Z     @given(
2025-05-07T20:32:37.1420876Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:37.1420973Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:37.1421092Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:37.1421206Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:37.1421318Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:37.1421393Z     )
2025-05-07T20:32:37.1421632Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:37.1421724Z     def test_silu_mul_quant(
2025-05-07T20:32:37.1421803Z         self,
2025-05-07T20:32:37.1421878Z         T: int,
2025-05-07T20:32:37.1421955Z         D: int,
2025-05-07T20:32:37.1422051Z         scale_ub: Optional[float],
2025-05-07T20:32:37.1422137Z         contiguous: bool,
2025-05-07T20:32:37.1422224Z         compiled: bool,
2025-05-07T20:32:37.1422302Z     ) -> None:
2025-05-07T20:32:37.1422398Z         torch.manual_seed(2025)
2025-05-07T20:32:37.1422470Z     
2025-05-07T20:32:37.1422634Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:37.1422708Z     
2025-05-07T20:32:37.1422800Z         x_sign = torch.sign(x)
2025-05-07T20:32:37.1422921Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:37.1423008Z         x = x_sign * x_clamp
2025-05-07T20:32:37.1423090Z         x0 = x[:, :D]
2025-05-07T20:32:37.1423168Z         x1 = x[:, D:]
2025-05-07T20:32:37.1423238Z     
2025-05-07T20:32:37.1423322Z         if contiguous:
2025-05-07T20:32:37.1423411Z             x0 = x0.contiguous()
2025-05-07T20:32:37.1423547Z             x1 = x1.contiguous()
2025-05-07T20:32:37.1423617Z     
2025-05-07T20:32:37.1423706Z         if scale_ub is not None:
2025-05-07T20:32:37.1423811Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:37.1423946Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:37.1424021Z             )
2025-05-07T20:32:37.1424101Z         else:
2025-05-07T20:32:37.1424192Z             scale_ub_tensor = None
2025-05-07T20:32:37.1424262Z     
2025-05-07T20:32:37.1424433Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:37.1424520Z             op = silu_mul_quant
2025-05-07T20:32:37.1424603Z             if compiled:
2025-05-07T20:32:37.1424704Z                 op = torch.compile(op)
2025-05-07T20:32:37.1424807Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:37.1424881Z     
2025-05-07T20:32:37.1424970Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:37.1424975Z 
2025-05-07T20:32:37.1425071Z moe/activation_test.py:117: 
2025-05-07T20:32:37.1425203Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:37.1425303Z moe/activation_test.py:115: in fn
2025-05-07T20:32:37.1425401Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:37.1425935Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:37.1426032Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:37.1426389Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:37.1426609Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:37.1426942Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:37.1427035Z     kernel = self.compile(
2025-05-07T20:32:37.1427412Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:37.1427585Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:37.1427714Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:37.1427718Z 
2025-05-07T20:32:37.1427959Z self = <triton.compiler.compiler.ASTSource object at 0x7f779d202c50>
2025-05-07T20:32:37.1428721Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:37.1429217Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f7905667ce0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f779ced4040>}
2025-05-07T20:32:37.1429949Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:37.1430141Z context = <triton._C.libtriton.ir.context object at 0x7f779cf30370>
2025-05-07T20:32:37.1430145Z 
2025-05-07T20:32:37.1430311Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:37.1430572Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:37.1430678Z                            module_map=module_map)
2025-05-07T20:32:37.1430843Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:37.1430940Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:37.1431014Z E       ^
2025-05-07T20:32:37.1431362Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:37.1431367Z 
2025-05-07T20:32:37.1431770Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:37.1431815Z 
2025-05-07T20:32:37.1431916Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:37.1432138Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:37.1432213Z     T=2048,
2025-05-07T20:32:37.1432293Z     D=7168,
2025-05-07T20:32:37.1432375Z     scale_ub=1200.0,
2025-05-07T20:32:37.1432459Z     contiguous=True,
2025-05-07T20:32:37.1432543Z     compiled=False,
2025-05-07T20:32:37.1432614Z )
2025-05-07T20:32:37.1432826Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:37.1433038Z T = 2048, D = 7168, scale_ub = 1200.0, contiguous = True, compiled = False
2025-05-07T20:32:37.1433043Z 
2025-05-07T20:32:37.1433117Z     @given(
2025-05-07T20:32:37.1433233Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:37.1433333Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:37.1433446Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:37.1433565Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:37.1433677Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:37.1433749Z     )
2025-05-07T20:32:37.1433993Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:37.1434124Z     def test_silu_mul_quant(
2025-05-07T20:32:37.1434206Z         self,
2025-05-07T20:32:37.1434285Z         T: int,
2025-05-07T20:32:37.1434360Z         D: int,
2025-05-07T20:32:37.1434455Z         scale_ub: Optional[float],
2025-05-07T20:32:37.1434548Z         contiguous: bool,
2025-05-07T20:32:37.1434633Z         compiled: bool,
2025-05-07T20:32:37.1434709Z     ) -> None:
2025-05-07T20:32:37.1434806Z         torch.manual_seed(2025)
2025-05-07T20:32:37.1434876Z     
2025-05-07T20:32:37.1435044Z >       x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:37.1436871Z E       torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 56.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 26.44 MiB is free. Including non-PyTorch memory, this process has 22.04 GiB memory in use. Of the allocated memory 21.69 GiB is allocated by PyTorch, and 59.18 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
2025-05-07T20:32:37.1436880Z 
2025-05-07T20:32:37.1437000Z moe/activation_test.py:92: OutOfMemoryError
2025-05-07T20:32:37.1437007Z 
2025-05-07T20:32:37.1437106Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:37.1437325Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:37.1437402Z     T=1,
2025-05-07T20:32:37.1437478Z     D=5120,
2025-05-07T20:32:37.1437558Z     scale_ub=1200.0,
2025-05-07T20:32:37.1437642Z     contiguous=True,
2025-05-07T20:32:37.1437723Z     compiled=False,
2025-05-07T20:32:37.1437796Z )
2025-05-07T20:32:37.1438011Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:37.1438170Z T = 1, D = 5120, scale_ub = 1200.0, contiguous = True, compiled = False
2025-05-07T20:32:37.1438174Z 
2025-05-07T20:32:37.1438250Z     @given(
2025-05-07T20:32:37.1438371Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:37.1438469Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:37.1438585Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:37.1438701Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:37.1438811Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:37.1438886Z     )
2025-05-07T20:32:37.1439124Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:37.1439218Z     def test_silu_mul_quant(
2025-05-07T20:32:37.1439292Z         self,
2025-05-07T20:32:37.1439368Z         T: int,
2025-05-07T20:32:37.1439490Z         D: int,
2025-05-07T20:32:37.1439587Z         scale_ub: Optional[float],
2025-05-07T20:32:37.1439673Z         contiguous: bool,
2025-05-07T20:32:37.1439760Z         compiled: bool,
2025-05-07T20:32:37.1439835Z     ) -> None:
2025-05-07T20:32:37.1439927Z         torch.manual_seed(2025)
2025-05-07T20:32:37.1440003Z     
2025-05-07T20:32:37.1440169Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:37.1440240Z     
2025-05-07T20:32:37.1440333Z         x_sign = torch.sign(x)
2025-05-07T20:32:37.1440455Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:37.1440584Z         x = x_sign * x_clamp
2025-05-07T20:32:37.1440669Z         x0 = x[:, :D]
2025-05-07T20:32:37.1440748Z         x1 = x[:, D:]
2025-05-07T20:32:37.1440822Z     
2025-05-07T20:32:37.1440904Z         if contiguous:
2025-05-07T20:32:37.1440993Z             x0 = x0.contiguous()
2025-05-07T20:32:37.1441086Z             x1 = x1.contiguous()
2025-05-07T20:32:37.1441155Z     
2025-05-07T20:32:37.1441247Z         if scale_ub is not None:
2025-05-07T20:32:37.1441352Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:37.1441487Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:37.1441561Z             )
2025-05-07T20:32:37.1441639Z         else:
2025-05-07T20:32:37.1441772Z             scale_ub_tensor = None
2025-05-07T20:32:37.1441843Z     
2025-05-07T20:32:37.1441976Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:37.1442064Z             op = silu_mul_quant
2025-05-07T20:32:37.1442151Z             if compiled:
2025-05-07T20:32:37.1442249Z                 op = torch.compile(op)
2025-05-07T20:32:37.1442352Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:37.1442426Z     
2025-05-07T20:32:37.1442515Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:37.1442519Z 
2025-05-07T20:32:37.1442613Z moe/activation_test.py:117: 
2025-05-07T20:32:37.1442744Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:37.1442844Z moe/activation_test.py:115: in fn
2025-05-07T20:32:37.1442941Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:37.1443432Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:37.1443529Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:37.1443932Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:37.1444152Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:37.1444488Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:37.1444582Z     kernel = self.compile(
2025-05-07T20:32:37.1444959Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:37.1445131Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:37.1445263Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:37.1445267Z 
2025-05-07T20:32:37.1445492Z self = <triton.compiler.compiler.ASTSource object at 0x7f779d9a3250>
2025-05-07T20:32:37.1446283Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:37.1446780Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f7905667ce0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f779ced5580>}
2025-05-07T20:32:37.1447515Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:37.1447769Z context = <triton._C.libtriton.ir.context object at 0x7f779cffa6f0>
2025-05-07T20:32:37.1447774Z 
2025-05-07T20:32:37.1447936Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:37.1448197Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:37.1448305Z                            module_map=module_map)
2025-05-07T20:32:37.1448467Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:37.1448562Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:37.1448679Z E       ^
2025-05-07T20:32:37.1449025Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:37.1449030Z 
2025-05-07T20:32:37.1449434Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:37.1449438Z 
2025-05-07T20:32:37.1449545Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:37.1449768Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:37.1449843Z     T=2048,
2025-05-07T20:32:37.1449919Z     D=5120,
2025-05-07T20:32:37.1449999Z     scale_ub=None,
2025-05-07T20:32:37.1450082Z     contiguous=True,
2025-05-07T20:32:37.1450205Z     compiled=False,
2025-05-07T20:32:37.1450277Z )
2025-05-07T20:32:37.1450493Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:37.1450665Z T = 2048, D = 5120, scale_ub = None, contiguous = True, compiled = False
2025-05-07T20:32:37.1450672Z 
2025-05-07T20:32:37.1450747Z     @given(
2025-05-07T20:32:37.1450863Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:37.1450963Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:37.1451075Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:37.1451197Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:37.1451307Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:37.1451381Z     )
2025-05-07T20:32:37.1451623Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:37.1451715Z     def test_silu_mul_quant(
2025-05-07T20:32:37.1451792Z         self,
2025-05-07T20:32:37.1451871Z         T: int,
2025-05-07T20:32:37.1451948Z         D: int,
2025-05-07T20:32:37.1452086Z         scale_ub: Optional[float],
2025-05-07T20:32:37.1452176Z         contiguous: bool,
2025-05-07T20:32:37.1452259Z         compiled: bool,
2025-05-07T20:32:37.1452338Z     ) -> None:
2025-05-07T20:32:37.1452433Z         torch.manual_seed(2025)
2025-05-07T20:32:37.1452503Z     
2025-05-07T20:32:37.1452669Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:37.1452740Z     
2025-05-07T20:32:37.1452829Z >       x_sign = torch.sign(x)
2025-05-07T20:32:37.1454667Z E       torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 40.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 26.44 MiB is free. Including non-PyTorch memory, this process has 22.04 GiB memory in use. Of the allocated memory 21.73 GiB is allocated by PyTorch, and 19.12 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
2025-05-07T20:32:37.1454677Z 
2025-05-07T20:32:37.1454792Z moe/activation_test.py:94: OutOfMemoryError
2025-05-07T20:32:37.1454799Z 
2025-05-07T20:32:37.1454903Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:37.1455121Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:37.1455198Z     T=16384,
2025-05-07T20:32:37.1455277Z     D=5120,
2025-05-07T20:32:37.1455357Z     scale_ub=None,
2025-05-07T20:32:37.1455442Z     contiguous=True,
2025-05-07T20:32:37.1455531Z     compiled=False,
2025-05-07T20:32:37.1455621Z )
2025-05-07T20:32:37.1455906Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:37.1456076Z T = 16384, D = 5120, scale_ub = None, contiguous = True, compiled = False
2025-05-07T20:32:37.1456081Z 
2025-05-07T20:32:37.1456155Z     @given(
2025-05-07T20:32:37.1456277Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:37.1456376Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:37.1456489Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:37.1456607Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:37.1456760Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:37.1456832Z     )
2025-05-07T20:32:37.1457081Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:37.1457171Z     def test_silu_mul_quant(
2025-05-07T20:32:37.1457250Z         self,
2025-05-07T20:32:37.1457325Z         T: int,
2025-05-07T20:32:37.1457400Z         D: int,
2025-05-07T20:32:37.1457503Z         scale_ub: Optional[float],
2025-05-07T20:32:37.1457590Z         contiguous: bool,
2025-05-07T20:32:37.1457673Z         compiled: bool,
2025-05-07T20:32:37.1457752Z     ) -> None:
2025-05-07T20:32:37.1457843Z         torch.manual_seed(2025)
2025-05-07T20:32:37.1457918Z     
2025-05-07T20:32:37.1458124Z >       x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:37.1459853Z E       torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 320.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 26.44 MiB is free. Including non-PyTorch memory, this process has 22.04 GiB memory in use. Of the allocated memory 21.73 GiB is allocated by PyTorch, and 19.12 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
2025-05-07T20:32:37.1459863Z 
2025-05-07T20:32:37.1459979Z moe/activation_test.py:92: OutOfMemoryError
2025-05-07T20:32:37.1459984Z 
2025-05-07T20:32:37.1460083Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:37.1460302Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:37.1460378Z     T=4096,
2025-05-07T20:32:37.1460455Z     D=5120,
2025-05-07T20:32:37.1460577Z     scale_ub=None,
2025-05-07T20:32:37.1460661Z     contiguous=True,
2025-05-07T20:32:37.1460742Z     compiled=False,
2025-05-07T20:32:37.1460817Z )
2025-05-07T20:32:37.1461031Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:37.1461195Z T = 4096, D = 5120, scale_ub = None, contiguous = True, compiled = False
2025-05-07T20:32:37.1461200Z 
2025-05-07T20:32:37.1461276Z     @given(
2025-05-07T20:32:37.1461392Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:37.1461492Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:37.1461602Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:37.1461719Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:37.1461832Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:37.1461904Z     )
2025-05-07T20:32:37.1462150Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:37.1465089Z     def test_silu_mul_quant(
2025-05-07T20:32:37.1465181Z         self,
2025-05-07T20:32:37.1465262Z         T: int,
2025-05-07T20:32:37.1465339Z         D: int,
2025-05-07T20:32:37.1465442Z         scale_ub: Optional[float],
2025-05-07T20:32:37.1465535Z         contiguous: bool,
2025-05-07T20:32:37.1465619Z         compiled: bool,
2025-05-07T20:32:37.1465698Z     ) -> None:
2025-05-07T20:32:37.1465794Z         torch.manual_seed(2025)
2025-05-07T20:32:37.1465866Z     
2025-05-07T20:32:37.1466036Z >       x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:37.1467776Z E       torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 80.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 26.44 MiB is free. Including non-PyTorch memory, this process has 22.04 GiB memory in use. Of the allocated memory 21.73 GiB is allocated by PyTorch, and 19.12 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
2025-05-07T20:32:37.1467888Z 
2025-05-07T20:32:37.1468008Z moe/activation_test.py:92: OutOfMemoryError
2025-05-07T20:32:37.1468013Z 
2025-05-07T20:32:37.1468113Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:37.1468331Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:37.1468409Z     T=2048,
2025-05-07T20:32:37.1468483Z     D=5120,
2025-05-07T20:32:37.1468562Z     scale_ub=None,
2025-05-07T20:32:37.1468653Z     contiguous=False,
2025-05-07T20:32:37.1468738Z     compiled=False,
2025-05-07T20:32:37.1468808Z )
2025-05-07T20:32:37.1469025Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:37.1469195Z T = 2048, D = 5120, scale_ub = None, contiguous = False, compiled = False
2025-05-07T20:32:37.1469199Z 
2025-05-07T20:32:37.1469329Z     @given(
2025-05-07T20:32:37.1469450Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:37.1469548Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:37.1469663Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:37.1469780Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:37.1469891Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:37.1469966Z     )
2025-05-07T20:32:37.1470207Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:37.1470299Z     def test_silu_mul_quant(
2025-05-07T20:32:37.1470376Z         self,
2025-05-07T20:32:37.1470451Z         T: int,
2025-05-07T20:32:37.1470528Z         D: int,
2025-05-07T20:32:37.1470628Z         scale_ub: Optional[float],
2025-05-07T20:32:37.1470714Z         contiguous: bool,
2025-05-07T20:32:37.1470799Z         compiled: bool,
2025-05-07T20:32:37.1470876Z     ) -> None:
2025-05-07T20:32:37.1470973Z         torch.manual_seed(2025)
2025-05-07T20:32:37.1471047Z     
2025-05-07T20:32:37.1471253Z >       x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:37.1472977Z E       torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 40.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 26.44 MiB is free. Including non-PyTorch memory, this process has 22.04 GiB memory in use. Of the allocated memory 21.73 GiB is allocated by PyTorch, and 19.12 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
2025-05-07T20:32:37.1472991Z 
2025-05-07T20:32:37.1473106Z moe/activation_test.py:92: OutOfMemoryError
2025-05-07T20:32:37.1473111Z 
2025-05-07T20:32:37.1473211Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:37.1473434Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:37.1473508Z     T=4096,
2025-05-07T20:32:37.1473584Z     D=7168,
2025-05-07T20:32:37.1473667Z     scale_ub=None,
2025-05-07T20:32:37.1473750Z     contiguous=True,
2025-05-07T20:32:37.1473837Z     compiled=True,
2025-05-07T20:32:37.1473908Z )
2025-05-07T20:32:37.1474120Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:37.1474288Z T = 4096, D = 7168, scale_ub = None, contiguous = True, compiled = True
2025-05-07T20:32:37.1474292Z 
2025-05-07T20:32:37.1474366Z     @given(
2025-05-07T20:32:37.1474482Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:37.1474582Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:37.1474737Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:37.1474853Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:37.1474969Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:37.1475042Z     )
2025-05-07T20:32:37.1475289Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:37.1475381Z     def test_silu_mul_quant(
2025-05-07T20:32:37.1475455Z         self,
2025-05-07T20:32:37.1475534Z         T: int,
2025-05-07T20:32:37.1475650Z         D: int,
2025-05-07T20:32:37.1475749Z         scale_ub: Optional[float],
2025-05-07T20:32:37.1475840Z         contiguous: bool,
2025-05-07T20:32:37.1475925Z         compiled: bool,
2025-05-07T20:32:37.1476002Z     ) -> None:
2025-05-07T20:32:37.1476098Z         torch.manual_seed(2025)
2025-05-07T20:32:37.1476170Z     
2025-05-07T20:32:37.1476333Z >       x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:37.1478101Z E       torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 112.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 26.44 MiB is free. Including non-PyTorch memory, this process has 22.04 GiB memory in use. Of the allocated memory 21.73 GiB is allocated by PyTorch, and 19.12 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
2025-05-07T20:32:37.1478110Z 
2025-05-07T20:32:37.1478225Z moe/activation_test.py:92: OutOfMemoryError
2025-05-07T20:32:37.1478229Z 
2025-05-07T20:32:37.1478328Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:37.1478543Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:37.1478623Z     T=2048,
2025-05-07T20:32:37.1478698Z     D=5120,
2025-05-07T20:32:37.1478778Z     scale_ub=1200.0,
2025-05-07T20:32:37.1478867Z     contiguous=False,
2025-05-07T20:32:37.1478950Z     compiled=False,
2025-05-07T20:32:37.1479021Z )
2025-05-07T20:32:37.1479236Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:37.1479410Z T = 2048, D = 5120, scale_ub = 1200.0, contiguous = False, compiled = False
2025-05-07T20:32:37.1479415Z 
2025-05-07T20:32:37.1479562Z     @given(
2025-05-07T20:32:37.1479681Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:37.1479778Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:37.1479897Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:37.1480011Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:37.1480129Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:37.1480202Z     )
2025-05-07T20:32:37.1480440Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:37.1480535Z     def test_silu_mul_quant(
2025-05-07T20:32:37.1480611Z         self,
2025-05-07T20:32:37.1480687Z         T: int,
2025-05-07T20:32:37.1480764Z         D: int,
2025-05-07T20:32:37.1480860Z         scale_ub: Optional[float],
2025-05-07T20:32:37.1480948Z         contiguous: bool,
2025-05-07T20:32:37.1481033Z         compiled: bool,
2025-05-07T20:32:37.1481112Z     ) -> None:
2025-05-07T20:32:37.1481212Z         torch.manual_seed(2025)
2025-05-07T20:32:37.1481289Z     
2025-05-07T20:32:37.1481451Z >       x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:37.1483170Z E       torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 40.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 26.44 MiB is free. Including non-PyTorch memory, this process has 22.04 GiB memory in use. Of the allocated memory 21.73 GiB is allocated by PyTorch, and 19.12 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
2025-05-07T20:32:37.1483221Z 
2025-05-07T20:32:37.1483334Z moe/activation_test.py:92: OutOfMemoryError
2025-05-07T20:32:37.1483339Z 
2025-05-07T20:32:37.1483444Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:37.1483664Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:37.1483739Z     T=4096,
2025-05-07T20:32:37.1483816Z     D=7168,
2025-05-07T20:32:37.1483895Z     scale_ub=1200.0,
2025-05-07T20:32:37.1484018Z     contiguous=True,
2025-05-07T20:32:37.1484102Z     compiled=False,
2025-05-07T20:32:37.1484173Z )
2025-05-07T20:32:37.1484383Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:37.1484557Z T = 4096, D = 7168, scale_ub = 1200.0, contiguous = True, compiled = False
2025-05-07T20:32:37.1484561Z 
2025-05-07T20:32:37.1484636Z     @given(
2025-05-07T20:32:37.1484758Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:37.1484856Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:37.1484967Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:37.1485083Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:37.1485238Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:37.1485311Z     )
2025-05-07T20:32:37.1485588Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:37.1485700Z     def test_silu_mul_quant(
2025-05-07T20:32:37.1485780Z         self,
2025-05-07T20:32:37.1485856Z         T: int,
2025-05-07T20:32:37.1485931Z         D: int,
2025-05-07T20:32:37.1486029Z         scale_ub: Optional[float],
2025-05-07T20:32:37.1486116Z         contiguous: bool,
2025-05-07T20:32:37.1486198Z         compiled: bool,
2025-05-07T20:32:37.1486277Z     ) -> None:
2025-05-07T20:32:37.1486369Z         torch.manual_seed(2025)
2025-05-07T20:32:37.1486440Z     
2025-05-07T20:32:37.1486609Z >       x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:37.1488373Z E       torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 112.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 26.44 MiB is free. Including non-PyTorch memory, this process has 22.04 GiB memory in use. Of the allocated memory 21.73 GiB is allocated by PyTorch, and 19.12 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
2025-05-07T20:32:37.1488381Z 
2025-05-07T20:32:37.1488497Z moe/activation_test.py:92: OutOfMemoryError
2025-05-07T20:32:37.1488502Z 
2025-05-07T20:32:37.1488601Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:37.1488818Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:37.1488895Z     T=16384,
2025-05-07T20:32:37.1488971Z     D=7168,
2025-05-07T20:32:37.1489054Z     scale_ub=None,
2025-05-07T20:32:37.1489138Z     contiguous=False,
2025-05-07T20:32:37.1489218Z     compiled=True,
2025-05-07T20:32:37.1489294Z )
2025-05-07T20:32:37.1489508Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:37.1489680Z T = 16384, D = 7168, scale_ub = None, contiguous = False, compiled = True
2025-05-07T20:32:37.1489685Z 
2025-05-07T20:32:37.1489764Z     @given(
2025-05-07T20:32:37.1489879Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:37.1489982Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:37.1490093Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:37.1490206Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:37.1490322Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:37.1490394Z     )
2025-05-07T20:32:37.1490630Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:37.1490768Z     def test_silu_mul_quant(
2025-05-07T20:32:37.1490842Z         self,
2025-05-07T20:32:37.1490916Z         T: int,
2025-05-07T20:32:37.1490994Z         D: int,
2025-05-07T20:32:37.1491090Z         scale_ub: Optional[float],
2025-05-07T20:32:37.1491179Z         contiguous: bool,
2025-05-07T20:32:37.1491264Z         compiled: bool,
2025-05-07T20:32:37.1491342Z     ) -> None:
2025-05-07T20:32:37.1491438Z         torch.manual_seed(2025)
2025-05-07T20:32:37.1491509Z     
2025-05-07T20:32:37.1491713Z >       x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:37.1493431Z E       torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 448.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 26.44 MiB is free. Including non-PyTorch memory, this process has 22.04 GiB memory in use. Of the allocated memory 21.73 GiB is allocated by PyTorch, and 19.12 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
2025-05-07T20:32:37.1493441Z 
2025-05-07T20:32:37.1493554Z moe/activation_test.py:92: OutOfMemoryError
2025-05-07T20:32:37.1493597Z 
2025-05-07T20:32:37.1493803Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:37.1494021Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:37.1494096Z     T=4096,
2025-05-07T20:32:37.1494175Z     D=7168,
2025-05-07T20:32:37.1494255Z     scale_ub=None,
2025-05-07T20:32:37.1494338Z     contiguous=True,
2025-05-07T20:32:37.1494423Z     compiled=False,
2025-05-07T20:32:37.1494493Z )
2025-05-07T20:32:37.1494709Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:37.1494873Z T = 4096, D = 7168, scale_ub = None, contiguous = True, compiled = False
2025-05-07T20:32:37.1494878Z 
2025-05-07T20:32:37.1494954Z     @given(
2025-05-07T20:32:37.1495071Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:37.1495169Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:37.1495280Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:37.1495411Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:37.1495589Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:37.1495676Z     )
2025-05-07T20:32:37.1495917Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:37.1496010Z     def test_silu_mul_quant(
2025-05-07T20:32:37.1496086Z         self,
2025-05-07T20:32:37.1496161Z         T: int,
2025-05-07T20:32:37.1496234Z         D: int,
2025-05-07T20:32:37.1496336Z         scale_ub: Optional[float],
2025-05-07T20:32:37.1496424Z         contiguous: bool,
2025-05-07T20:32:37.1496507Z         compiled: bool,
2025-05-07T20:32:37.1496586Z     ) -> None:
2025-05-07T20:32:37.1496677Z         torch.manual_seed(2025)
2025-05-07T20:32:37.1496751Z     
2025-05-07T20:32:37.1496917Z >       x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:37.1498911Z E       torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 112.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 26.44 MiB is free. Including non-PyTorch memory, this process has 22.04 GiB memory in use. Of the allocated memory 21.73 GiB is allocated by PyTorch, and 19.12 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
2025-05-07T20:32:37.1498921Z 
2025-05-07T20:32:37.1499039Z moe/activation_test.py:92: OutOfMemoryError
2025-05-07T20:32:37.1499045Z 
2025-05-07T20:32:37.1499144Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:37.1499366Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:37.1499520Z     T=16384,
2025-05-07T20:32:37.1499594Z     D=7168,
2025-05-07T20:32:37.1499677Z     scale_ub=None,
2025-05-07T20:32:37.1499759Z     contiguous=True,
2025-05-07T20:32:37.1499841Z     compiled=False,
2025-05-07T20:32:37.1499914Z )
2025-05-07T20:32:37.1500131Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:37.1500301Z T = 16384, D = 7168, scale_ub = None, contiguous = True, compiled = False
2025-05-07T20:32:37.1500305Z 
2025-05-07T20:32:37.1500447Z     @given(
2025-05-07T20:32:37.1500563Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:37.1500662Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:37.1500774Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:37.1500889Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:37.1501004Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:37.1501077Z     )
2025-05-07T20:32:37.1501319Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:37.1501416Z     def test_silu_mul_quant(
2025-05-07T20:32:37.1501491Z         self,
2025-05-07T20:32:37.1501565Z         T: int,
2025-05-07T20:32:37.1501643Z         D: int,
2025-05-07T20:32:37.1501807Z         scale_ub: Optional[float],
2025-05-07T20:32:37.1501899Z         contiguous: bool,
2025-05-07T20:32:37.1501985Z         compiled: bool,
2025-05-07T20:32:37.1502061Z     ) -> None:
2025-05-07T20:32:37.1502157Z         torch.manual_seed(2025)
2025-05-07T20:32:37.1502233Z     
2025-05-07T20:32:37.1502396Z >       x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:37.1504120Z E       torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 448.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 26.44 MiB is free. Including non-PyTorch memory, this process has 22.04 GiB memory in use. Of the allocated memory 21.73 GiB is allocated by PyTorch, and 19.12 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
2025-05-07T20:32:37.1504129Z 
2025-05-07T20:32:37.1504245Z moe/activation_test.py:92: OutOfMemoryError
2025-05-07T20:32:37.1504308Z 
2025-05-07T20:32:37.1504414Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:37.1504630Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:37.1504710Z     T=16384,
2025-05-07T20:32:37.1504786Z     D=7168,
2025-05-07T20:32:37.1504867Z     scale_ub=1200.0,
2025-05-07T20:32:37.1504951Z     contiguous=True,
2025-05-07T20:32:37.1505035Z     compiled=False,
2025-05-07T20:32:37.1505107Z )
2025-05-07T20:32:37.1505319Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:37.1505490Z T = 16384, D = 7168, scale_ub = 1200.0, contiguous = True, compiled = False
2025-05-07T20:32:37.1505497Z 
2025-05-07T20:32:37.1505572Z     @given(
2025-05-07T20:32:37.1505696Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:37.1505793Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:37.1505908Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:37.1506034Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:37.1506144Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:37.1506217Z     )
2025-05-07T20:32:37.1506461Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:37.1506552Z     def test_silu_mul_quant(
2025-05-07T20:32:37.1506630Z         self,
2025-05-07T20:32:37.1506705Z         T: int,
2025-05-07T20:32:37.1506778Z         D: int,
2025-05-07T20:32:37.1506878Z         scale_ub: Optional[float],
2025-05-07T20:32:37.1506965Z         contiguous: bool,
2025-05-07T20:32:37.1507050Z         compiled: bool,
2025-05-07T20:32:37.1507175Z     ) -> None:
2025-05-07T20:32:37.1507266Z         torch.manual_seed(2025)
2025-05-07T20:32:37.1507337Z     
2025-05-07T20:32:37.1507503Z >       x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:37.1509227Z E       torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 448.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 26.44 MiB is free. Including non-PyTorch memory, this process has 22.04 GiB memory in use. Of the allocated memory 21.73 GiB is allocated by PyTorch, and 19.12 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
2025-05-07T20:32:37.1509274Z 
2025-05-07T20:32:37.1509393Z moe/activation_test.py:92: OutOfMemoryError
2025-05-07T20:32:37.1509397Z 
2025-05-07T20:32:37.1509497Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:37.1509716Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:37.1509791Z     T=128,
2025-05-07T20:32:37.1509865Z     D=5120,
2025-05-07T20:32:37.1509950Z     scale_ub=1200.0,
2025-05-07T20:32:37.1510034Z     contiguous=False,
2025-05-07T20:32:37.1510180Z     compiled=False,
2025-05-07T20:32:37.1510255Z )
2025-05-07T20:32:37.1510468Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:37.1510635Z T = 128, D = 5120, scale_ub = 1200.0, contiguous = False, compiled = False
2025-05-07T20:32:37.1510642Z 
2025-05-07T20:32:37.1510721Z     @given(
2025-05-07T20:32:37.1510835Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:37.1510934Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:37.1511047Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:37.1511161Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:37.1511276Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:37.1511352Z     )
2025-05-07T20:32:37.1511589Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:37.1511683Z     def test_silu_mul_quant(
2025-05-07T20:32:37.1511757Z         self,
2025-05-07T20:32:37.1511836Z         T: int,
2025-05-07T20:32:37.1511915Z         D: int,
2025-05-07T20:32:37.1512051Z         scale_ub: Optional[float],
2025-05-07T20:32:37.1512139Z         contiguous: bool,
2025-05-07T20:32:37.1512226Z         compiled: bool,
2025-05-07T20:32:37.1512303Z     ) -> None:
2025-05-07T20:32:37.1512398Z         torch.manual_seed(2025)
2025-05-07T20:32:37.1512469Z     
2025-05-07T20:32:37.1512634Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:37.1512709Z     
2025-05-07T20:32:37.1512798Z         x_sign = torch.sign(x)
2025-05-07T20:32:37.1512920Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:37.1513010Z         x = x_sign * x_clamp
2025-05-07T20:32:37.1513091Z         x0 = x[:, :D]
2025-05-07T20:32:37.1513169Z         x1 = x[:, D:]
2025-05-07T20:32:37.1513244Z     
2025-05-07T20:32:37.1513325Z         if contiguous:
2025-05-07T20:32:37.1513415Z             x0 = x0.contiguous()
2025-05-07T20:32:37.1513507Z             x1 = x1.contiguous()
2025-05-07T20:32:37.1513580Z     
2025-05-07T20:32:37.1513671Z         if scale_ub is not None:
2025-05-07T20:32:37.1513777Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:37.1513909Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:37.1513990Z             )
2025-05-07T20:32:37.1514066Z         else:
2025-05-07T20:32:37.1514158Z             scale_ub_tensor = None
2025-05-07T20:32:37.1514231Z     
2025-05-07T20:32:37.1514359Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:37.1514447Z             op = silu_mul_quant
2025-05-07T20:32:37.1514533Z             if compiled:
2025-05-07T20:32:37.1514631Z                 op = torch.compile(op)
2025-05-07T20:32:37.1514779Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:37.1514853Z     
2025-05-07T20:32:37.1514941Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:37.1514945Z 
2025-05-07T20:32:37.1515043Z moe/activation_test.py:117: 
2025-05-07T20:32:37.1515172Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:37.1515273Z moe/activation_test.py:115: in fn
2025-05-07T20:32:37.1515373Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:37.1515867Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:37.1516005Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:37.1516365Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:37.1516583Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:37.1516925Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:37.1517017Z     kernel = self.compile(
2025-05-07T20:32:37.1517395Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:37.1517614Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:37.1517741Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:37.1517745Z 
2025-05-07T20:32:37.1517949Z self = <triton.compiler.compiler.ASTSource object at 0x7f779d200750>
2025-05-07T20:32:37.1518714Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:37.1519207Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f7905667ce0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f779cfc11c0>}
2025-05-07T20:32:37.1519951Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:37.1520190Z context = <triton._C.libtriton.ir.context object at 0x7f779d0765f0>
2025-05-07T20:32:37.1520195Z 
2025-05-07T20:32:37.1520362Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:37.1520620Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:37.1520726Z                            module_map=module_map)
2025-05-07T20:32:37.1520891Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:37.1520987Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:37.1521063Z E       ^
2025-05-07T20:32:37.1521415Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:37.1521422Z 
2025-05-07T20:32:37.1521828Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:37.1521832Z 
2025-05-07T20:32:37.1521939Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:37.1522161Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:37.1522236Z     T=2048,
2025-05-07T20:32:37.1522314Z     D=7168,
2025-05-07T20:32:37.1522396Z     scale_ub=None,
2025-05-07T20:32:37.1522482Z     contiguous=False,
2025-05-07T20:32:37.1522569Z     compiled=False,
2025-05-07T20:32:37.1522639Z )
2025-05-07T20:32:37.1522857Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:37.1523026Z T = 2048, D = 7168, scale_ub = None, contiguous = False, compiled = False
2025-05-07T20:32:37.1523030Z 
2025-05-07T20:32:37.1523103Z     @given(
2025-05-07T20:32:37.1523268Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:37.1523365Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:37.1523476Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:37.1523594Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:37.1523709Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:37.1523783Z     )
2025-05-07T20:32:37.1524025Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:37.1524158Z     def test_silu_mul_quant(
2025-05-07T20:32:37.1524236Z         self,
2025-05-07T20:32:37.1524311Z         T: int,
2025-05-07T20:32:37.1524386Z         D: int,
2025-05-07T20:32:37.1524486Z         scale_ub: Optional[float],
2025-05-07T20:32:37.1524572Z         contiguous: bool,
2025-05-07T20:32:37.1524656Z         compiled: bool,
2025-05-07T20:32:37.1524736Z     ) -> None:
2025-05-07T20:32:37.1524829Z         torch.manual_seed(2025)
2025-05-07T20:32:37.1524902Z     
2025-05-07T20:32:37.1525069Z >       x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:37.1526839Z E       torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 56.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 26.44 MiB is free. Including non-PyTorch memory, this process has 22.04 GiB memory in use. Of the allocated memory 21.74 GiB is allocated by PyTorch, and 10.99 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
2025-05-07T20:32:37.1526848Z 
2025-05-07T20:32:37.1526969Z moe/activation_test.py:92: OutOfMemoryError
2025-05-07T20:32:37.1526974Z 
2025-05-07T20:32:37.1527076Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:37.1527293Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:37.1527371Z     T=128,
2025-05-07T20:32:37.1527451Z     D=7168,
2025-05-07T20:32:37.1527530Z     scale_ub=1200.0,
2025-05-07T20:32:37.1527614Z     contiguous=True,
2025-05-07T20:32:37.1527695Z     compiled=True,
2025-05-07T20:32:37.1527767Z )
2025-05-07T20:32:37.1527985Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:37.1528188Z T = 128, D = 7168, scale_ub = 1200.0, contiguous = True, compiled = True
2025-05-07T20:32:37.1528193Z 
2025-05-07T20:32:37.1528268Z     @given(
2025-05-07T20:32:37.1528389Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:37.1528486Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:37.1528597Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:37.1528714Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:37.1528824Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:37.1528900Z     )
2025-05-07T20:32:37.1529140Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:37.1529233Z     def test_silu_mul_quant(
2025-05-07T20:32:37.1529311Z         self,
2025-05-07T20:32:37.1529386Z         T: int,
2025-05-07T20:32:37.1529461Z         D: int,
2025-05-07T20:32:37.1529559Z         scale_ub: Optional[float],
2025-05-07T20:32:37.1529648Z         contiguous: bool,
2025-05-07T20:32:37.1529736Z         compiled: bool,
2025-05-07T20:32:37.1529814Z     ) -> None:
2025-05-07T20:32:37.1529906Z         torch.manual_seed(2025)
2025-05-07T20:32:37.1529978Z     
2025-05-07T20:32:37.1530142Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:37.1530213Z     
2025-05-07T20:32:37.1530307Z         x_sign = torch.sign(x)
2025-05-07T20:32:37.1530431Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:37.1530518Z         x = x_sign * x_clamp
2025-05-07T20:32:37.1530600Z         x0 = x[:, :D]
2025-05-07T20:32:37.1530678Z         x1 = x[:, D:]
2025-05-07T20:32:37.1530795Z     
2025-05-07T20:32:37.1530879Z         if contiguous:
2025-05-07T20:32:37.1530969Z             x0 = x0.contiguous()
2025-05-07T20:32:37.1531057Z             x1 = x1.contiguous()
2025-05-07T20:32:37.1531129Z     
2025-05-07T20:32:37.1531217Z         if scale_ub is not None:
2025-05-07T20:32:37.1531321Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:37.1531460Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:37.1531535Z             )
2025-05-07T20:32:37.1531610Z         else:
2025-05-07T20:32:37.1531748Z             scale_ub_tensor = None
2025-05-07T20:32:37.1531819Z     
2025-05-07T20:32:37.1531949Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:37.1532037Z             op = silu_mul_quant
2025-05-07T20:32:37.1532119Z             if compiled:
2025-05-07T20:32:37.1532221Z                 op = torch.compile(op)
2025-05-07T20:32:37.1532324Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:37.1532396Z     
2025-05-07T20:32:37.1532490Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:37.1532494Z 
2025-05-07T20:32:37.1532589Z moe/activation_test.py:117: 
2025-05-07T20:32:37.1532717Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:37.1532818Z moe/activation_test.py:115: in fn
2025-05-07T20:32:37.1532957Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:37.1533328Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/torch/_dynamo/eval_frame.py:678: in _fn
2025-05-07T20:32:37.1533420Z     return fn(*args, **kwargs)
2025-05-07T20:32:37.1533977Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:37.1534078Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:37.1534431Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:37.1534651Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:37.1534989Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:37.1535081Z     kernel = self.compile(
2025-05-07T20:32:37.1535489Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:37.1535728Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:37.1535856Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:37.1535866Z 
2025-05-07T20:32:37.1536070Z self = <triton.compiler.compiler.ASTSource object at 0x7f779d44aa50>
2025-05-07T20:32:37.1536830Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:37.1537330Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f7905667ce0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f779d0abb00>}
2025-05-07T20:32:37.1538072Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:37.1538262Z context = <triton._C.libtriton.ir.context object at 0x7f779cc299f0>
2025-05-07T20:32:37.1538266Z 
2025-05-07T20:32:37.1538431Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:37.1538685Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:37.1538793Z                            module_map=module_map)
2025-05-07T20:32:37.1538952Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:37.1539047Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:37.1539169Z E       ^
2025-05-07T20:32:37.1539515Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:37.1539520Z 
2025-05-07T20:32:37.1539931Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:37.1539935Z 
2025-05-07T20:32:37.1540038Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:37.1540257Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:37.1540404Z     T=128,
2025-05-07T20:32:37.1540478Z     D=7168,
2025-05-07T20:32:37.1540561Z     scale_ub=1200.0,
2025-05-07T20:32:37.1540646Z     contiguous=True,
2025-05-07T20:32:37.1540727Z     compiled=False,
2025-05-07T20:32:37.1540802Z )
2025-05-07T20:32:37.1541015Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:37.1541181Z T = 128, D = 7168, scale_ub = 1200.0, contiguous = True, compiled = False
2025-05-07T20:32:37.1541187Z 
2025-05-07T20:32:37.1541264Z     @given(
2025-05-07T20:32:37.1541381Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:37.1541477Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:37.1541592Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:37.1541748Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:37.1541864Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:37.1541941Z     )
2025-05-07T20:32:37.1542182Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:37.1542282Z     def test_silu_mul_quant(
2025-05-07T20:32:37.1542356Z         self,
2025-05-07T20:32:37.1542430Z         T: int,
2025-05-07T20:32:37.1542507Z         D: int,
2025-05-07T20:32:37.1542603Z         scale_ub: Optional[float],
2025-05-07T20:32:37.1542690Z         contiguous: bool,
2025-05-07T20:32:37.1542778Z         compiled: bool,
2025-05-07T20:32:37.1542857Z     ) -> None:
2025-05-07T20:32:37.1542952Z         torch.manual_seed(2025)
2025-05-07T20:32:37.1543025Z     
2025-05-07T20:32:37.1543187Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:37.1543258Z     
2025-05-07T20:32:37.1543354Z         x_sign = torch.sign(x)
2025-05-07T20:32:37.1543481Z >       x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:37.1545257Z E       torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 20.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 4.44 MiB is free. Including non-PyTorch memory, this process has 22.06 GiB memory in use. Of the allocated memory 21.77 GiB is allocated by PyTorch, and 6.37 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
2025-05-07T20:32:37.1545266Z 
2025-05-07T20:32:37.1545403Z moe/activation_test.py:95: OutOfMemoryError
2025-05-07T20:32:37.1545408Z 
2025-05-07T20:32:37.1545519Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:37.1545757Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:37.1545832Z     T=128,
2025-05-07T20:32:37.1545911Z     D=5120,
2025-05-07T20:32:37.1545992Z     scale_ub=1200.0,
2025-05-07T20:32:37.1546078Z     contiguous=True,
2025-05-07T20:32:37.1546161Z     compiled=True,
2025-05-07T20:32:37.1546233Z )
2025-05-07T20:32:37.1546446Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:37.1546611Z T = 128, D = 5120, scale_ub = 1200.0, contiguous = True, compiled = True
2025-05-07T20:32:37.1546616Z 
2025-05-07T20:32:37.1546689Z     @given(
2025-05-07T20:32:37.1546807Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:37.1546903Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:37.1547015Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:37.1547236Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:37.1547346Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:37.1547419Z     )
2025-05-07T20:32:37.1547661Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:37.1547758Z     def test_silu_mul_quant(
2025-05-07T20:32:37.1547834Z         self,
2025-05-07T20:32:37.1547913Z         T: int,
2025-05-07T20:32:37.1547988Z         D: int,
2025-05-07T20:32:37.1548086Z         scale_ub: Optional[float],
2025-05-07T20:32:37.1548213Z         contiguous: bool,
2025-05-07T20:32:37.1548296Z         compiled: bool,
2025-05-07T20:32:37.1548374Z     ) -> None:
2025-05-07T20:32:37.1548467Z         torch.manual_seed(2025)
2025-05-07T20:32:37.1548538Z     
2025-05-07T20:32:37.1548702Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:37.1548773Z     
2025-05-07T20:32:37.1548862Z         x_sign = torch.sign(x)
2025-05-07T20:32:37.1548991Z >       x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:37.1550751Z E       torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 20.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 4.44 MiB is free. Including non-PyTorch memory, this process has 22.06 GiB memory in use. Of the allocated memory 21.77 GiB is allocated by PyTorch, and 3.87 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
2025-05-07T20:32:37.1550762Z 
2025-05-07T20:32:37.1550881Z moe/activation_test.py:95: OutOfMemoryError
2025-05-07T20:32:37.1550885Z 
2025-05-07T20:32:37.1550984Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:37.1551204Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:37.1551280Z     T=128,
2025-05-07T20:32:37.1551360Z     D=7168,
2025-05-07T20:32:37.1551442Z     scale_ub=None,
2025-05-07T20:32:37.1551525Z     contiguous=True,
2025-05-07T20:32:37.1551607Z     compiled=True,
2025-05-07T20:32:37.1551681Z )
2025-05-07T20:32:37.1551894Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:37.1552098Z T = 128, D = 7168, scale_ub = None, contiguous = True, compiled = True
2025-05-07T20:32:37.1552103Z 
2025-05-07T20:32:37.1552180Z     @given(
2025-05-07T20:32:37.1552295Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:37.1552396Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:37.1552507Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:37.1552620Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:37.1552734Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:37.1552805Z     )
2025-05-07T20:32:37.1553043Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:37.1553140Z     def test_silu_mul_quant(
2025-05-07T20:32:37.1553213Z         self,
2025-05-07T20:32:37.1553287Z         T: int,
2025-05-07T20:32:37.1553368Z         D: int,
2025-05-07T20:32:37.1553462Z         scale_ub: Optional[float],
2025-05-07T20:32:37.1553549Z         contiguous: bool,
2025-05-07T20:32:37.1553638Z         compiled: bool,
2025-05-07T20:32:37.1553717Z     ) -> None:
2025-05-07T20:32:37.1553813Z         torch.manual_seed(2025)
2025-05-07T20:32:37.1553884Z     
2025-05-07T20:32:37.1554045Z >       x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:37.1555766Z E       torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 20.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 4.44 MiB is free. Including non-PyTorch memory, this process has 22.06 GiB memory in use. Of the allocated memory 21.77 GiB is allocated by PyTorch, and 3.87 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
2025-05-07T20:32:37.1555815Z 
2025-05-07T20:32:37.1555931Z moe/activation_test.py:92: OutOfMemoryError
2025-05-07T20:32:37.1556073Z =============================== warnings summary ===============================
2025-05-07T20:32:37.1556374Z ../../../../../../../../miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/autotuner.py:108
2025-05-07T20:32:37.1556713Z ../../../../../../../../miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/autotuner.py:108
2025-05-07T20:32:37.1557009Z ../../../../../../../../miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/autotuner.py:108
2025-05-07T20:32:37.1557863Z   /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/autotuner.py:108: DeprecationWarning: warmup, rep, and use_cuda_graph parameters are deprecated. See https://github.com/triton-lang/triton/pull/4496 for details.
2025-05-07T20:32:37.1558095Z     warnings.warn(("warmup, rep, and use_cuda_graph parameters are deprecated. See "
2025-05-07T20:32:37.1558099Z 
2025-05-07T20:32:37.1558343Z -- Docs: https://docs.pytest.org/en/stable/how-to/capture-warnings.html
2025-05-07T20:32:37.1558509Z ================= 1 failed, 1 deselected, 3 warnings in 12.04s =================
2025-05-07T20:32:38.7671213Z ERROR conda.cli.main_run:execute(125): `conda run python -m pytest -v -rsx -s -W ignore::pytest.PytestCollectionWarning --lf --last-failed-no-failures none ./moe/activation_test.py` failed. (See above for error)
2025-05-07T20:32:38.8302555Z [EXEC] [ATTEMPT 1/2] Command attempt failed.
2025-05-07T20:32:38.8302806Z 
2025-05-07T20:32:40.8320823Z [EXEC] [ATTEMPT 2/2]    + conda run --no-capture-output -n build_binary python -m pytest -v -rsx -s -W ignore::pytest.PytestCollectionWarning --lf --last-failed-no-failures none ./moe/activation_test.py
2025-05-07T20:32:43.0063291Z ============================= test session starts ==============================
2025-05-07T20:32:43.0064275Z platform linux -- Python 3.13.0, pytest-8.3.5, pluggy-1.5.0 -- /home/ec2-user/miniconda/envs/build_binary/bin/python
2025-05-07T20:32:43.0065168Z cachedir: .pytest_cache
2025-05-07T20:32:43.0066449Z hypothesis profile 'ci' -> database=None, deadline=None, print_blob=True, derandomize=True, suppress_health_check=(HealthCheck.too_slow,)
2025-05-07T20:32:43.0067664Z rootdir: /home/ec2-user/actions-runner/_work/FBGEMM/FBGEMM/fbgemm_gpu
2025-05-07T20:32:43.0068362Z plugins: hypothesis-6.131.14
2025-05-07T20:32:44.5497504Z TMA benchmarks will be running with experimental grid constant TMA descriptor.
2025-05-07T20:32:44.6454454Z collecting ... collected 2 items / 1 deselected / 1 selected
2025-05-07T20:32:44.6455253Z run-last-failure: rerun previous 1 failure
2025-05-07T20:32:44.6455684Z 
2025-05-07T20:32:46.7481747Z moe/activation_test.py::ActivationTests::test_silu_mul_quant Trying example: test_silu_mul_quant(
2025-05-07T20:32:46.7482715Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:46.7483225Z     T=1,
2025-05-07T20:32:46.7483419Z     D=5120,
2025-05-07T20:32:46.7483607Z     scale_ub=None,
2025-05-07T20:32:46.7483819Z     contiguous=True,
2025-05-07T20:32:46.7484051Z     compiled=True,
2025-05-07T20:32:46.7484255Z )
2025-05-07T20:32:46.7484579Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:46.7485061Z T = 1, D = 5120, scale_ub = None, contiguous = True, compiled = True
2025-05-07T20:32:46.7485322Z 
2025-05-07T20:32:46.7485400Z     @given(
2025-05-07T20:32:46.7485638Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:46.7485952Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:46.7486265Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:46.7486585Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:46.7486913Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:46.7487498Z     )
2025-05-07T20:32:46.7487840Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:46.7488328Z     def test_silu_mul_quant(
2025-05-07T20:32:46.7488575Z         self,
2025-05-07T20:32:46.7488768Z         T: int,
2025-05-07T20:32:46.7488969Z         D: int,
2025-05-07T20:32:46.7489190Z         scale_ub: Optional[float],
2025-05-07T20:32:46.7489456Z         contiguous: bool,
2025-05-07T20:32:46.7489701Z         compiled: bool,
2025-05-07T20:32:46.7490034Z     ) -> None:
2025-05-07T20:32:46.7490243Z         torch.manual_seed(2025)
2025-05-07T20:32:46.7490489Z     
2025-05-07T20:32:46.7490764Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:46.7491095Z     
2025-05-07T20:32:46.7491292Z         x_sign = torch.sign(x)
2025-05-07T20:32:46.7491582Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:46.7491900Z         x = x_sign * x_clamp
2025-05-07T20:32:46.7492139Z         x0 = x[:, :D]
2025-05-07T20:32:46.7492356Z         x1 = x[:, D:]
2025-05-07T20:32:46.7492568Z     
2025-05-07T20:32:46.7492752Z         if contiguous:
2025-05-07T20:32:46.7492986Z             x0 = x0.contiguous()
2025-05-07T20:32:46.7493249Z             x1 = x1.contiguous()
2025-05-07T20:32:46.7493568Z     
2025-05-07T20:32:46.7493894Z         if scale_ub is not None:
2025-05-07T20:32:46.7494168Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:46.7494495Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:46.7494811Z             )
2025-05-07T20:32:46.7495008Z         else:
2025-05-07T20:32:46.7495214Z             scale_ub_tensor = None
2025-05-07T20:32:46.7495478Z     
2025-05-07T20:32:46.7495733Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:46.7502315Z             op = silu_mul_quant
2025-05-07T20:32:46.7502592Z             if compiled:
2025-05-07T20:32:46.7502852Z                 op = torch.compile(op)
2025-05-07T20:32:46.7503159Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:46.7503441Z     
2025-05-07T20:32:46.7503642Z         y_fp8, y_scale = fn()
2025-05-07T20:32:46.7503936Z         y = y_fp8.to(torch.float32) * y_scale[:, None]
2025-05-07T20:32:46.7504230Z     
2025-05-07T20:32:46.7504482Z         def ref_fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:46.7504957Z             x0_fp32 = x0.to(torch.float32)
2025-05-07T20:32:46.7505249Z             x1_fp32 = x1.to(torch.float32)
2025-05-07T20:32:46.7505563Z             y = x0_fp32 * torch.sigmoid(x0_fp32) * x1_fp32
2025-05-07T20:32:46.7505932Z             return triton_quantize_fp8_row(y, scale_ub_tensor)
2025-05-07T20:32:46.7506238Z     
2025-05-07T20:32:46.7506442Z >       y_fp8_ref, y_scale_ref = ref_fn()
2025-05-07T20:32:46.7506635Z 
2025-05-07T20:32:46.7506743Z moe/activation_test.py:126: 
2025-05-07T20:32:46.7507036Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:46.7507378Z moe/activation_test.py:124: in ref_fn
2025-05-07T20:32:46.7507712Z     return triton_quantize_fp8_row(y, scale_ub_tensor)
2025-05-07T20:32:46.7508498Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gemm/triton_gemm/fp8_gemm.py:2370: in triton_quantize_fp8_row
2025-05-07T20:32:46.7509241Z     _kernel_quantize_fp8_row[grid](
2025-05-07T20:32:46.7509788Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:46.7510466Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:46.7511148Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/autotuner.py:186: in run
2025-05-07T20:32:46.7511861Z     timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs}
2025-05-07T20:32:46.7512583Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/autotuner.py:166: in _bench
2025-05-07T20:32:46.7513292Z     return self.do_bench(kernel_call, quantiles=(0.5, 0.2, 0.8))
2025-05-07T20:32:46.7513887Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/testing.py:117: in do_bench
2025-05-07T20:32:46.7514411Z     fn()
2025-05-07T20:32:46.7514926Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/autotuner.py:152: in kernel_call
2025-05-07T20:32:46.7515509Z     self.fn.run(
2025-05-07T20:32:46.7515969Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:46.7516575Z     kernel = self.compile(
2025-05-07T20:32:46.7517121Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:46.7517765Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:46.7518171Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:46.7518414Z 
2025-05-07T20:32:46.7518622Z self = <triton.compiler.compiler.ASTSource object at 0x7fe9c8a26270>
2025-05-07T20:32:46.7519760Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=2, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:46.7521129Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7fea06577ce0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7fe9c614a700>}
2025-05-07T20:32:46.7522458Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:46.7523470Z context = <triton._C.libtriton.ir.context object at 0x7fe9c657a3f0>
2025-05-07T20:32:46.7523763Z 
2025-05-07T20:32:46.7523931Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:46.7524459Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:46.7524921Z                            module_map=module_map)
2025-05-07T20:32:46.7525290Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:46.7525698Z E       def _kernel_quantize_fp8_row(
2025-05-07T20:32:46.7525972Z E       ^
2025-05-07T20:32:46.7526426Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:46.7526876Z 
2025-05-07T20:32:46.7527285Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:46.7527800Z 
2025-05-07T20:32:46.7527907Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:46.7528316Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:46.7528710Z     T=2048,
2025-05-07T20:32:46.7528903Z     D=5120,
2025-05-07T20:32:46.7529105Z     scale_ub=1200.0,
2025-05-07T20:32:46.7529327Z     contiguous=True,
2025-05-07T20:32:46.7529555Z     compiled=False,
2025-05-07T20:32:46.7529767Z )
2025-05-07T20:32:46.7530084Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:46.7530581Z T = 2048, D = 5120, scale_ub = 1200.0, contiguous = True, compiled = False
2025-05-07T20:32:46.7530851Z 
2025-05-07T20:32:46.7530936Z     @given(
2025-05-07T20:32:46.7531168Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:46.7531476Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:46.7531779Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:46.7532109Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:46.7532430Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:46.7532718Z     )
2025-05-07T20:32:46.7533071Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:46.7533561Z     def test_silu_mul_quant(
2025-05-07T20:32:46.7533910Z         self,
2025-05-07T20:32:46.7534111Z         T: int,
2025-05-07T20:32:46.7534305Z         D: int,
2025-05-07T20:32:46.7534524Z         scale_ub: Optional[float],
2025-05-07T20:32:46.7534797Z         contiguous: bool,
2025-05-07T20:32:46.7535036Z         compiled: bool,
2025-05-07T20:32:46.7535262Z     ) -> None:
2025-05-07T20:32:46.7535477Z         torch.manual_seed(2025)
2025-05-07T20:32:46.7535716Z     
2025-05-07T20:32:46.7536039Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:46.7536378Z     
2025-05-07T20:32:46.7536572Z         x_sign = torch.sign(x)
2025-05-07T20:32:46.7536857Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:46.7537169Z         x = x_sign * x_clamp
2025-05-07T20:32:46.7537413Z         x0 = x[:, :D]
2025-05-07T20:32:46.7537629Z         x1 = x[:, D:]
2025-05-07T20:32:46.7537837Z     
2025-05-07T20:32:46.7538030Z         if contiguous:
2025-05-07T20:32:46.7538263Z             x0 = x0.contiguous()
2025-05-07T20:32:46.7538523Z             x1 = x1.contiguous()
2025-05-07T20:32:46.7538765Z     
2025-05-07T20:32:46.7538954Z         if scale_ub is not None:
2025-05-07T20:32:46.7539278Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:46.7539617Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:46.7539920Z             )
2025-05-07T20:32:46.7540115Z         else:
2025-05-07T20:32:46.7540328Z             scale_ub_tensor = None
2025-05-07T20:32:46.7540578Z     
2025-05-07T20:32:46.7540811Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:46.7541126Z             op = silu_mul_quant
2025-05-07T20:32:46.7541384Z             if compiled:
2025-05-07T20:32:46.7541627Z                 op = torch.compile(op)
2025-05-07T20:32:46.7541926Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:46.7542203Z     
2025-05-07T20:32:46.7542393Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:46.7542565Z 
2025-05-07T20:32:46.7542663Z moe/activation_test.py:117: 
2025-05-07T20:32:46.7542960Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:46.7543284Z moe/activation_test.py:115: in fn
2025-05-07T20:32:46.7543571Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:46.7544307Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:46.7544994Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:46.7545525Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:46.7546197Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:46.7546858Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:46.7547383Z     kernel = self.compile(
2025-05-07T20:32:46.7547924Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:46.7548573Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:46.7548975Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:46.7549203Z 
2025-05-07T20:32:46.7549410Z self = <triton.compiler.compiler.ASTSource object at 0x7fe9c60d9090>
2025-05-07T20:32:46.7550473Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:46.7551828Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7fea06577ce0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7fe9c5ffa020>}
2025-05-07T20:32:46.7553151Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:46.7554232Z context = <triton._C.libtriton.ir.context object at 0x7fe9c6419770>
2025-05-07T20:32:46.7554531Z 
2025-05-07T20:32:46.7554701Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:46.7555221Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:46.7555728Z                            module_map=module_map)
2025-05-07T20:32:46.7556088Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:46.7556440Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:46.7556698Z E       ^
2025-05-07T20:32:46.7557153Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:46.7557591Z 
2025-05-07T20:32:46.7557997Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:47.4002478Z 
2025-05-07T20:32:47.4003156Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:47.4003789Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:47.4004673Z     T=2048,
2025-05-07T20:32:47.4004930Z     D=5120,
2025-05-07T20:32:47.4005133Z     scale_ub=1200.0,
2025-05-07T20:32:47.4005353Z     contiguous=True,
2025-05-07T20:32:47.4005578Z     compiled=True,
2025-05-07T20:32:47.4005790Z )
2025-05-07T20:32:47.4006113Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:47.4006610Z T = 2048, D = 5120, scale_ub = 1200.0, contiguous = True, compiled = True
2025-05-07T20:32:47.4006886Z 
2025-05-07T20:32:47.4006966Z     @given(
2025-05-07T20:32:47.4007199Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:47.4007506Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:47.4007821Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:47.4008151Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:47.4008469Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:47.4008754Z     )
2025-05-07T20:32:47.4009104Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:47.4009635Z     def test_silu_mul_quant(
2025-05-07T20:32:47.4009884Z         self,
2025-05-07T20:32:47.4010080Z         T: int,
2025-05-07T20:32:47.4010274Z         D: int,
2025-05-07T20:32:47.4010491Z         scale_ub: Optional[float],
2025-05-07T20:32:47.4010760Z         contiguous: bool,
2025-05-07T20:32:47.4011003Z         compiled: bool,
2025-05-07T20:32:47.4011228Z     ) -> None:
2025-05-07T20:32:47.4011447Z         torch.manual_seed(2025)
2025-05-07T20:32:47.4011690Z     
2025-05-07T20:32:47.4011956Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:47.4012298Z     
2025-05-07T20:32:47.4012492Z         x_sign = torch.sign(x)
2025-05-07T20:32:47.4012773Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:47.4013082Z         x = x_sign * x_clamp
2025-05-07T20:32:47.4013319Z         x0 = x[:, :D]
2025-05-07T20:32:47.4013524Z         x1 = x[:, D:]
2025-05-07T20:32:47.4013877Z     
2025-05-07T20:32:47.4014061Z         if contiguous:
2025-05-07T20:32:47.4014284Z             x0 = x0.contiguous()
2025-05-07T20:32:47.4014543Z             x1 = x1.contiguous()
2025-05-07T20:32:47.4014782Z     
2025-05-07T20:32:47.4014965Z         if scale_ub is not None:
2025-05-07T20:32:47.4015234Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:47.4015564Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:47.4015870Z             )
2025-05-07T20:32:47.4016056Z         else:
2025-05-07T20:32:47.4016268Z             scale_ub_tensor = None
2025-05-07T20:32:47.4016519Z     
2025-05-07T20:32:47.4016743Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:47.4017154Z             op = silu_mul_quant
2025-05-07T20:32:47.4017404Z             if compiled:
2025-05-07T20:32:47.4017681Z                 op = torch.compile(op)
2025-05-07T20:32:47.4017976Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:47.4018254Z     
2025-05-07T20:32:47.4018451Z         y_fp8, y_scale = fn()
2025-05-07T20:32:47.4018737Z         y = y_fp8.to(torch.float32) * y_scale[:, None]
2025-05-07T20:32:47.4019022Z     
2025-05-07T20:32:47.4019256Z         def ref_fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:47.4019681Z             x0_fp32 = x0.to(torch.float32)
2025-05-07T20:32:47.4019966Z             x1_fp32 = x1.to(torch.float32)
2025-05-07T20:32:47.4020279Z             y = x0_fp32 * torch.sigmoid(x0_fp32) * x1_fp32
2025-05-07T20:32:47.4020633Z             return triton_quantize_fp8_row(y, scale_ub_tensor)
2025-05-07T20:32:47.4020953Z     
2025-05-07T20:32:47.4021154Z >       y_fp8_ref, y_scale_ref = ref_fn()
2025-05-07T20:32:47.4021356Z 
2025-05-07T20:32:47.4021460Z moe/activation_test.py:126: 
2025-05-07T20:32:47.4021754Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:47.4022082Z moe/activation_test.py:124: in ref_fn
2025-05-07T20:32:47.4022451Z     return triton_quantize_fp8_row(y, scale_ub_tensor)
2025-05-07T20:32:47.4023239Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gemm/triton_gemm/fp8_gemm.py:2370: in triton_quantize_fp8_row
2025-05-07T20:32:47.4023978Z     _kernel_quantize_fp8_row[grid](
2025-05-07T20:32:47.4024518Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:47.4025198Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:47.4025874Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/autotuner.py:186: in run
2025-05-07T20:32:47.4026578Z     timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs}
2025-05-07T20:32:47.4027297Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/autotuner.py:166: in _bench
2025-05-07T20:32:47.4027922Z     return self.do_bench(kernel_call, quantiles=(0.5, 0.2, 0.8))
2025-05-07T20:32:47.4028571Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/testing.py:117: in do_bench
2025-05-07T20:32:47.4029079Z     fn()
2025-05-07T20:32:47.4029580Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/autotuner.py:152: in kernel_call
2025-05-07T20:32:47.4030157Z     self.fn.run(
2025-05-07T20:32:47.4030612Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:47.4031135Z     kernel = self.compile(
2025-05-07T20:32:47.4031671Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:47.4032316Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:47.4032705Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:47.4032937Z 
2025-05-07T20:32:47.4033145Z self = <triton.compiler.compiler.ASTSource object at 0x7fe9c60da0d0>
2025-05-07T20:32:47.4034217Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=2, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:47.4035581Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7fea06577ce0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7fe9c4eeaac0>}
2025-05-07T20:32:47.4036899Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:47.4037952Z context = <triton._C.libtriton.ir.context object at 0x7fe9c50bcd70>
2025-05-07T20:32:47.4038241Z 
2025-05-07T20:32:47.4038406Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:47.4038969Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:47.4039429Z                            module_map=module_map)
2025-05-07T20:32:47.4039795Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:47.4040200Z E       def _kernel_quantize_fp8_row(
2025-05-07T20:32:47.4040469Z E       ^
2025-05-07T20:32:47.4040918Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:47.4041366Z 
2025-05-07T20:32:47.4041774Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:47.4042275Z 
2025-05-07T20:32:47.4042392Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:47.4042802Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:47.4043193Z     T=16384,
2025-05-07T20:32:47.4043392Z     D=7168,
2025-05-07T20:32:47.4043587Z     scale_ub=1200.0,
2025-05-07T20:32:47.4043853Z     contiguous=False,
2025-05-07T20:32:47.4044080Z     compiled=False,
2025-05-07T20:32:47.4044291Z )
2025-05-07T20:32:47.4044598Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:47.4045093Z T = 16384, D = 7168, scale_ub = 1200.0, contiguous = False, compiled = False
2025-05-07T20:32:47.4045363Z 
2025-05-07T20:32:47.4045445Z     @given(
2025-05-07T20:32:47.4045668Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:47.4045984Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:47.4046285Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:47.4046608Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:47.4046931Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:47.4047218Z     )
2025-05-07T20:32:47.4047561Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:47.4047990Z     def test_silu_mul_quant(
2025-05-07T20:32:47.4048239Z         self,
2025-05-07T20:32:47.4048435Z         T: int,
2025-05-07T20:32:47.4048677Z         D: int,
2025-05-07T20:32:47.4048902Z         scale_ub: Optional[float],
2025-05-07T20:32:47.4049169Z         contiguous: bool,
2025-05-07T20:32:47.4049402Z         compiled: bool,
2025-05-07T20:32:47.4049627Z     ) -> None:
2025-05-07T20:32:47.4049841Z         torch.manual_seed(2025)
2025-05-07T20:32:47.4050074Z     
2025-05-07T20:32:47.4050342Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:47.4050676Z     
2025-05-07T20:32:47.4050868Z         x_sign = torch.sign(x)
2025-05-07T20:32:47.4051152Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:47.4051461Z         x = x_sign * x_clamp
2025-05-07T20:32:47.4051701Z         x0 = x[:, :D]
2025-05-07T20:32:47.4051913Z         x1 = x[:, D:]
2025-05-07T20:32:47.4052120Z     
2025-05-07T20:32:47.4052308Z         if contiguous:
2025-05-07T20:32:47.4052531Z             x0 = x0.contiguous()
2025-05-07T20:32:47.4052790Z             x1 = x1.contiguous()
2025-05-07T20:32:47.4053026Z     
2025-05-07T20:32:47.4053214Z         if scale_ub is not None:
2025-05-07T20:32:47.4053481Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:47.4053929Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:47.4054268Z             )
2025-05-07T20:32:47.4054471Z         else:
2025-05-07T20:32:47.4054692Z             scale_ub_tensor = None
2025-05-07T20:32:47.4054959Z     
2025-05-07T20:32:47.4055207Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:47.4055555Z             op = silu_mul_quant
2025-05-07T20:32:47.4055823Z             if compiled:
2025-05-07T20:32:47.4056126Z                 op = torch.compile(op)
2025-05-07T20:32:47.4056420Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:47.4056696Z     
2025-05-07T20:32:47.4056882Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:47.4057048Z 
2025-05-07T20:32:47.4057142Z moe/activation_test.py:117: 
2025-05-07T20:32:47.4057439Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:47.4057766Z moe/activation_test.py:115: in fn
2025-05-07T20:32:47.4058045Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:47.4058765Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:47.4059439Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:47.4059971Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:47.4060643Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:47.4061302Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:47.4061822Z     kernel = self.compile(
2025-05-07T20:32:47.4062401Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:47.4063055Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:47.4063450Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:47.4063679Z 
2025-05-07T20:32:47.4063885Z self = <triton.compiler.compiler.ASTSource object at 0x7fe9c4e65220>
2025-05-07T20:32:47.4064941Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:47.4066291Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7fea06577ce0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7fe9c5ffa980>}
2025-05-07T20:32:47.4067613Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:47.4068705Z context = <triton._C.libtriton.ir.context object at 0x7fe9c4ceabf0>
2025-05-07T20:32:47.4068992Z 
2025-05-07T20:32:47.4069156Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:47.4069669Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:47.4070126Z                            module_map=module_map)
2025-05-07T20:32:47.4070482Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:47.4070831Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:47.4071091Z E       ^
2025-05-07T20:32:47.4071542Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:47.4071986Z 
2025-05-07T20:32:47.4072393Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:48.0958909Z 
2025-05-07T20:32:48.0959895Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:48.0960447Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:48.0960855Z     T=1,
2025-05-07T20:32:48.0961054Z     D=7168,
2025-05-07T20:32:48.0961244Z     scale_ub=None,
2025-05-07T20:32:48.0961456Z     contiguous=True,
2025-05-07T20:32:48.0961681Z     compiled=True,
2025-05-07T20:32:48.0961884Z )
2025-05-07T20:32:48.0962205Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:48.0962693Z T = 1, D = 7168, scale_ub = None, contiguous = True, compiled = True
2025-05-07T20:32:48.0962954Z 
2025-05-07T20:32:48.0963338Z     @given(
2025-05-07T20:32:48.0963574Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:48.0963893Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:48.0964194Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:48.0964531Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:48.0964865Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:48.0965154Z     )
2025-05-07T20:32:48.0965496Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:48.0966039Z     def test_silu_mul_quant(
2025-05-07T20:32:48.0966278Z         self,
2025-05-07T20:32:48.0966467Z         T: int,
2025-05-07T20:32:48.0966666Z         D: int,
2025-05-07T20:32:48.0966885Z         scale_ub: Optional[float],
2025-05-07T20:32:48.0967148Z         contiguous: bool,
2025-05-07T20:32:48.0967387Z         compiled: bool,
2025-05-07T20:32:48.0967613Z     ) -> None:
2025-05-07T20:32:48.0967829Z         torch.manual_seed(2025)
2025-05-07T20:32:48.0968075Z     
2025-05-07T20:32:48.0968352Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:48.0968692Z     
2025-05-07T20:32:48.0968886Z         x_sign = torch.sign(x)
2025-05-07T20:32:48.0969296Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:48.0969600Z         x = x_sign * x_clamp
2025-05-07T20:32:48.0969848Z         x0 = x[:, :D]
2025-05-07T20:32:48.0970068Z         x1 = x[:, D:]
2025-05-07T20:32:48.0970270Z     
2025-05-07T20:32:48.0970459Z         if contiguous:
2025-05-07T20:32:48.0970698Z             x0 = x0.contiguous()
2025-05-07T20:32:48.0970956Z             x1 = x1.contiguous()
2025-05-07T20:32:48.0971190Z     
2025-05-07T20:32:48.0971384Z         if scale_ub is not None:
2025-05-07T20:32:48.0971662Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:48.0971997Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:48.0972311Z             )
2025-05-07T20:32:48.0972508Z         else:
2025-05-07T20:32:48.0972711Z             scale_ub_tensor = None
2025-05-07T20:32:48.0972961Z     
2025-05-07T20:32:48.0973191Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:48.0973501Z             op = silu_mul_quant
2025-05-07T20:32:48.0973864Z             if compiled:
2025-05-07T20:32:48.0974117Z                 op = torch.compile(op)
2025-05-07T20:32:48.0974495Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:48.0974770Z     
2025-05-07T20:32:48.0974962Z         y_fp8, y_scale = fn()
2025-05-07T20:32:48.0975244Z         y = y_fp8.to(torch.float32) * y_scale[:, None]
2025-05-07T20:32:48.0975533Z     
2025-05-07T20:32:48.0975770Z         def ref_fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:48.0976121Z             x0_fp32 = x0.to(torch.float32)
2025-05-07T20:32:48.0982355Z             x1_fp32 = x1.to(torch.float32)
2025-05-07T20:32:48.0982679Z             y = x0_fp32 * torch.sigmoid(x0_fp32) * x1_fp32
2025-05-07T20:32:48.0983048Z             return triton_quantize_fp8_row(y, scale_ub_tensor)
2025-05-07T20:32:48.0983364Z     
2025-05-07T20:32:48.0983571Z >       y_fp8_ref, y_scale_ref = ref_fn()
2025-05-07T20:32:48.0983768Z 
2025-05-07T20:32:48.0983869Z moe/activation_test.py:126: 
2025-05-07T20:32:48.0984174Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:48.0984514Z moe/activation_test.py:124: in ref_fn
2025-05-07T20:32:48.0984832Z     return triton_quantize_fp8_row(y, scale_ub_tensor)
2025-05-07T20:32:48.0985615Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gemm/triton_gemm/fp8_gemm.py:2370: in triton_quantize_fp8_row
2025-05-07T20:32:48.0986363Z     _kernel_quantize_fp8_row[grid](
2025-05-07T20:32:48.0986908Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:48.0987574Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:48.0988325Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/autotuner.py:186: in run
2025-05-07T20:32:48.0989036Z     timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs}
2025-05-07T20:32:48.0989755Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/autotuner.py:166: in _bench
2025-05-07T20:32:48.0990374Z     return self.do_bench(kernel_call, quantiles=(0.5, 0.2, 0.8))
2025-05-07T20:32:48.0990968Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/testing.py:117: in do_bench
2025-05-07T20:32:48.0991525Z     fn()
2025-05-07T20:32:48.0992018Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/autotuner.py:152: in kernel_call
2025-05-07T20:32:48.0992591Z     self.fn.run(
2025-05-07T20:32:48.0993054Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:48.0993581Z     kernel = self.compile(
2025-05-07T20:32:48.0994112Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:48.0994768Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:48.0995206Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:48.0995435Z 
2025-05-07T20:32:48.0995639Z self = <triton.compiler.compiler.ASTSource object at 0x7fe9c4e67950>
2025-05-07T20:32:48.0996708Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=2, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:48.1001471Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7fea06577ce0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7fe9c519e5c0>}
2025-05-07T20:32:48.1002805Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:48.1003813Z context = <triton._C.libtriton.ir.context object at 0x7fe9c497d9b0>
2025-05-07T20:32:48.1004105Z 
2025-05-07T20:32:48.1004367Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:48.1004877Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:48.1005342Z                            module_map=module_map)
2025-05-07T20:32:48.1005708Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:48.1006055Z E       def _kernel_quantize_fp8_row(
2025-05-07T20:32:48.1006322Z E       ^
2025-05-07T20:32:48.1006785Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:48.1007228Z 
2025-05-07T20:32:48.1007647Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:48.1008153Z 
2025-05-07T20:32:48.1008255Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:48.1008714Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:48.1009116Z     T=4096,
2025-05-07T20:32:48.1009301Z     D=5120,
2025-05-07T20:32:48.1009500Z     scale_ub=None,
2025-05-07T20:32:48.1009714Z     contiguous=False,
2025-05-07T20:32:48.1009942Z     compiled=False,
2025-05-07T20:32:48.1010140Z )
2025-05-07T20:32:48.1010453Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:48.1010939Z T = 4096, D = 5120, scale_ub = None, contiguous = False, compiled = False
2025-05-07T20:32:48.1011203Z 
2025-05-07T20:32:48.1011284Z     @given(
2025-05-07T20:32:48.1011504Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:48.1011813Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:48.1012191Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:48.1012507Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:48.1012833Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:48.1013116Z     )
2025-05-07T20:32:48.1013457Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:48.1013974Z     def test_silu_mul_quant(
2025-05-07T20:32:48.1014211Z         self,
2025-05-07T20:32:48.1014395Z         T: int,
2025-05-07T20:32:48.1014655Z         D: int,
2025-05-07T20:32:48.1014873Z         scale_ub: Optional[float],
2025-05-07T20:32:48.1015136Z         contiguous: bool,
2025-05-07T20:32:48.1015369Z         compiled: bool,
2025-05-07T20:32:48.1015589Z     ) -> None:
2025-05-07T20:32:48.1015804Z         torch.manual_seed(2025)
2025-05-07T20:32:48.1016031Z     
2025-05-07T20:32:48.1016299Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:48.1016637Z     
2025-05-07T20:32:48.1016820Z         x_sign = torch.sign(x)
2025-05-07T20:32:48.1017104Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:48.1017414Z         x = x_sign * x_clamp
2025-05-07T20:32:48.1017646Z         x0 = x[:, :D]
2025-05-07T20:32:48.1017926Z         x1 = x[:, D:]
2025-05-07T20:32:48.1018132Z     
2025-05-07T20:32:48.1018309Z         if contiguous:
2025-05-07T20:32:48.1018541Z             x0 = x0.contiguous()
2025-05-07T20:32:48.1018794Z             x1 = x1.contiguous()
2025-05-07T20:32:48.1019024Z     
2025-05-07T20:32:48.1019212Z         if scale_ub is not None:
2025-05-07T20:32:48.1019477Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:48.1019798Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:48.1020102Z             )
2025-05-07T20:32:48.1020294Z         else:
2025-05-07T20:32:48.1020495Z             scale_ub_tensor = None
2025-05-07T20:32:48.1020744Z     
2025-05-07T20:32:48.1020975Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:48.1021287Z             op = silu_mul_quant
2025-05-07T20:32:48.1021522Z             if compiled:
2025-05-07T20:32:48.1021766Z                 op = torch.compile(op)
2025-05-07T20:32:48.1022060Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:48.1022328Z     
2025-05-07T20:32:48.1022520Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:48.1022726Z 
2025-05-07T20:32:48.1022831Z moe/activation_test.py:117: 
2025-05-07T20:32:48.1023114Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:48.1023443Z moe/activation_test.py:115: in fn
2025-05-07T20:32:48.1023722Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:48.1024405Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:48.1025081Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:48.1025613Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:48.1026287Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:48.1026931Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:48.1027455Z     kernel = self.compile(
2025-05-07T20:32:48.1027990Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:48.1028634Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:48.1029037Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:48.1029268Z 
2025-05-07T20:32:48.1029470Z self = <triton.compiler.compiler.ASTSource object at 0x7fe9c46accb0>
2025-05-07T20:32:48.1030528Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:48.1031925Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7fea06577ce0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7fe9c51be3e0>}
2025-05-07T20:32:48.1033245Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:48.1034279Z context = <triton._C.libtriton.ir.context object at 0x7fe9c47d18b0>
2025-05-07T20:32:48.1034566Z 
2025-05-07T20:32:48.1034728Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:48.1035235Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:48.1035696Z                            module_map=module_map)
2025-05-07T20:32:48.1036054Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:48.1036397Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:48.1036653Z E       ^
2025-05-07T20:32:48.1037099Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:48.1037586Z 
2025-05-07T20:32:48.1037993Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:48.7935789Z 
2025-05-07T20:32:48.7936145Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:48.7936585Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:48.7937030Z     T=4096,
2025-05-07T20:32:48.7937226Z     D=7168,
2025-05-07T20:32:48.7937433Z     scale_ub=None,
2025-05-07T20:32:48.7937659Z     contiguous=False,
2025-05-07T20:32:48.7937895Z     compiled=False,
2025-05-07T20:32:48.7938104Z )
2025-05-07T20:32:48.7938425Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:48.7938927Z T = 4096, D = 7168, scale_ub = None, contiguous = False, compiled = False
2025-05-07T20:32:48.7939198Z 
2025-05-07T20:32:48.7939278Z     @given(
2025-05-07T20:32:48.7939513Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:48.7939831Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:48.7940251Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:48.7940587Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:48.7940921Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:48.7941204Z     )
2025-05-07T20:32:48.7941551Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:48.7941996Z     def test_silu_mul_quant(
2025-05-07T20:32:48.7942239Z         self,
2025-05-07T20:32:48.7942431Z         T: int,
2025-05-07T20:32:48.7942632Z         D: int,
2025-05-07T20:32:48.7942855Z         scale_ub: Optional[float],
2025-05-07T20:32:48.7943128Z         contiguous: bool,
2025-05-07T20:32:48.7943370Z         compiled: bool,
2025-05-07T20:32:48.7943601Z     ) -> None:
2025-05-07T20:32:48.7943815Z         torch.manual_seed(2025)
2025-05-07T20:32:48.7944060Z     
2025-05-07T20:32:48.7944336Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:48.7944677Z     
2025-05-07T20:32:48.7944878Z         x_sign = torch.sign(x)
2025-05-07T20:32:48.7945171Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:48.7945476Z         x = x_sign * x_clamp
2025-05-07T20:32:48.7945721Z         x0 = x[:, :D]
2025-05-07T20:32:48.7945945Z         x1 = x[:, D:]
2025-05-07T20:32:48.7946151Z     
2025-05-07T20:32:48.7946343Z         if contiguous:
2025-05-07T20:32:48.7946580Z             x0 = x0.contiguous()
2025-05-07T20:32:48.7946842Z             x1 = x1.contiguous()
2025-05-07T20:32:48.7947079Z     
2025-05-07T20:32:48.7947278Z         if scale_ub is not None:
2025-05-07T20:32:48.7947558Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:48.7947997Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:48.7948312Z             )
2025-05-07T20:32:48.7948533Z         else:
2025-05-07T20:32:48.7948782Z             scale_ub_tensor = None
2025-05-07T20:32:48.7949044Z     
2025-05-07T20:32:48.7949286Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:48.7949599Z             op = silu_mul_quant
2025-05-07T20:32:48.7949858Z             if compiled:
2025-05-07T20:32:48.7950115Z                 op = torch.compile(op)
2025-05-07T20:32:48.7950477Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:48.7950758Z     
2025-05-07T20:32:48.7950960Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:48.7951124Z 
2025-05-07T20:32:48.7951225Z moe/activation_test.py:117: 
2025-05-07T20:32:48.7951526Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:48.7951862Z moe/activation_test.py:115: in fn
2025-05-07T20:32:48.7952151Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:48.7952832Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:48.7953516Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:48.7954117Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:48.7954788Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:48.7955459Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:48.7955997Z     kernel = self.compile(
2025-05-07T20:32:48.7956539Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:48.7957194Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:48.7957588Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:48.7957824Z 
2025-05-07T20:32:48.7958032Z self = <triton.compiler.compiler.ASTSource object at 0x7fe9c4ea6be0>
2025-05-07T20:32:48.7959206Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:48.7960568Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7fea06577ce0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7fe9c51bede0>}
2025-05-07T20:32:48.7961895Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:48.7962897Z context = <triton._C.libtriton.ir.context object at 0x7fe92be58cb0>
2025-05-07T20:32:48.7963191Z 
2025-05-07T20:32:48.7963355Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:48.7963871Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:48.7964330Z                            module_map=module_map)
2025-05-07T20:32:48.7964706Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:48.7965062Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:48.7965324Z E       ^
2025-05-07T20:32:48.7965780Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:48.7966231Z 
2025-05-07T20:32:48.7966639Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:48.7967143Z 
2025-05-07T20:32:48.7967255Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:48.7967666Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:48.7968106Z     T=128,
2025-05-07T20:32:48.7968291Z     D=7168,
2025-05-07T20:32:48.7968490Z     scale_ub=None,
2025-05-07T20:32:48.7968701Z     contiguous=False,
2025-05-07T20:32:48.7968923Z     compiled=True,
2025-05-07T20:32:48.7969126Z )
2025-05-07T20:32:48.7969440Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:48.7969930Z T = 128, D = 7168, scale_ub = None, contiguous = False, compiled = True
2025-05-07T20:32:48.7970194Z 
2025-05-07T20:32:48.7970274Z     @given(
2025-05-07T20:32:48.7970552Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:48.7970868Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:48.7971176Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:48.7971509Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:48.7971832Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:48.7972118Z     )
2025-05-07T20:32:48.7972467Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:48.7972902Z     def test_silu_mul_quant(
2025-05-07T20:32:48.7973142Z         self,
2025-05-07T20:32:48.7973365Z         T: int,
2025-05-07T20:32:48.7973565Z         D: int,
2025-05-07T20:32:48.7973925Z         scale_ub: Optional[float],
2025-05-07T20:32:48.7974200Z         contiguous: bool,
2025-05-07T20:32:48.7974441Z         compiled: bool,
2025-05-07T20:32:48.7974660Z     ) -> None:
2025-05-07T20:32:48.7974873Z         torch.manual_seed(2025)
2025-05-07T20:32:48.7975115Z     
2025-05-07T20:32:48.7975380Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:48.7975724Z     
2025-05-07T20:32:48.7975918Z         x_sign = torch.sign(x)
2025-05-07T20:32:48.7976205Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:48.7976514Z         x = x_sign * x_clamp
2025-05-07T20:32:48.7976753Z         x0 = x[:, :D]
2025-05-07T20:32:48.7976973Z         x1 = x[:, D:]
2025-05-07T20:32:48.7977182Z     
2025-05-07T20:32:48.7977370Z         if contiguous:
2025-05-07T20:32:48.7977602Z             x0 = x0.contiguous()
2025-05-07T20:32:48.7977853Z             x1 = x1.contiguous()
2025-05-07T20:32:48.7978094Z     
2025-05-07T20:32:48.7978288Z         if scale_ub is not None:
2025-05-07T20:32:48.7978558Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:48.7978988Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:48.7979307Z             )
2025-05-07T20:32:48.7979500Z         else:
2025-05-07T20:32:48.7979718Z             scale_ub_tensor = None
2025-05-07T20:32:48.7979970Z     
2025-05-07T20:32:48.7980195Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:48.7980513Z             op = silu_mul_quant
2025-05-07T20:32:48.7980764Z             if compiled:
2025-05-07T20:32:48.7981008Z                 op = torch.compile(op)
2025-05-07T20:32:48.7981304Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:48.7981580Z     
2025-05-07T20:32:48.7981772Z         y_fp8, y_scale = fn()
2025-05-07T20:32:48.7982050Z         y = y_fp8.to(torch.float32) * y_scale[:, None]
2025-05-07T20:32:48.7982344Z     
2025-05-07T20:32:48.7982577Z         def ref_fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:48.7982909Z             x0_fp32 = x0.to(torch.float32)
2025-05-07T20:32:48.7983207Z             x1_fp32 = x1.to(torch.float32)
2025-05-07T20:32:48.7983518Z             y = x0_fp32 * torch.sigmoid(x0_fp32) * x1_fp32
2025-05-07T20:32:48.7983869Z             return triton_quantize_fp8_row(y, scale_ub_tensor)
2025-05-07T20:32:48.7984180Z     
2025-05-07T20:32:48.7984383Z >       y_fp8_ref, y_scale_ref = ref_fn()
2025-05-07T20:32:48.7984574Z 
2025-05-07T20:32:48.7984673Z moe/activation_test.py:126: 
2025-05-07T20:32:48.7984970Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:48.7985309Z moe/activation_test.py:124: in ref_fn
2025-05-07T20:32:48.7985635Z     return triton_quantize_fp8_row(y, scale_ub_tensor)
2025-05-07T20:32:48.7986454Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gemm/triton_gemm/fp8_gemm.py:2370: in triton_quantize_fp8_row
2025-05-07T20:32:48.7987194Z     _kernel_quantize_fp8_row[grid](
2025-05-07T20:32:48.7987743Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:48.7988411Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:48.7989130Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/autotuner.py:186: in run
2025-05-07T20:32:48.7989846Z     timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs}
2025-05-07T20:32:48.7990561Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/autotuner.py:166: in _bench
2025-05-07T20:32:48.7991193Z     return self.do_bench(kernel_call, quantiles=(0.5, 0.2, 0.8))
2025-05-07T20:32:48.7991791Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/testing.py:117: in do_bench
2025-05-07T20:32:48.7992311Z     fn()
2025-05-07T20:32:48.7992862Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/autotuner.py:152: in kernel_call
2025-05-07T20:32:48.7993435Z     self.fn.run(
2025-05-07T20:32:48.7993902Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:48.7994427Z     kernel = self.compile(
2025-05-07T20:32:48.7994959Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:48.7995605Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:48.7996002Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:48.7996229Z 
2025-05-07T20:32:48.7996441Z self = <triton.compiler.compiler.ASTSource object at 0x7fe9c46759d0>
2025-05-07T20:32:48.7997506Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=2, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:48.7999122Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7fea06577ce0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7fe9c4683a60>}
2025-05-07T20:32:48.8000447Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:48.8001460Z context = <triton._C.libtriton.ir.context object at 0x7fe9c42b3b70>
2025-05-07T20:32:48.8001745Z 
2025-05-07T20:32:48.8001917Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:48.8002424Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:48.8002889Z                            module_map=module_map)
2025-05-07T20:32:48.8003254Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:48.8003603Z E       def _kernel_quantize_fp8_row(
2025-05-07T20:32:48.8003878Z E       ^
2025-05-07T20:32:48.8004340Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:48.8004778Z 
2025-05-07T20:32:48.8005195Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:49.0371011Z 
2025-05-07T20:32:49.0371298Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:49.0371734Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:49.0372163Z     T=128,
2025-05-07T20:32:49.0372355Z     D=7168,
2025-05-07T20:32:49.0372560Z     scale_ub=None,
2025-05-07T20:32:49.0372929Z     contiguous=False,
2025-05-07T20:32:49.0373163Z     compiled=False,
2025-05-07T20:32:49.0373373Z )
2025-05-07T20:32:49.0373780Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:49.0374267Z T = 128, D = 7168, scale_ub = None, contiguous = False, compiled = False
2025-05-07T20:32:49.0374540Z 
2025-05-07T20:32:49.0374624Z     @given(
2025-05-07T20:32:49.0374862Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:49.0375175Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:49.0375551Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:49.0375884Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:49.0376212Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:49.0376494Z     )
2025-05-07T20:32:49.0376841Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:49.0377284Z     def test_silu_mul_quant(
2025-05-07T20:32:49.0377524Z         self,
2025-05-07T20:32:49.0377721Z         T: int,
2025-05-07T20:32:49.0377926Z         D: int,
2025-05-07T20:32:49.0378140Z         scale_ub: Optional[float],
2025-05-07T20:32:49.0378415Z         contiguous: bool,
2025-05-07T20:32:49.0378654Z         compiled: bool,
2025-05-07T20:32:49.0378941Z     ) -> None:
2025-05-07T20:32:49.0379162Z         torch.manual_seed(2025)
2025-05-07T20:32:49.0379405Z     
2025-05-07T20:32:49.0379677Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:49.0380013Z     
2025-05-07T20:32:49.0380206Z         x_sign = torch.sign(x)
2025-05-07T20:32:49.0380501Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:49.0380803Z         x = x_sign * x_clamp
2025-05-07T20:32:49.0381043Z         x0 = x[:, :D]
2025-05-07T20:32:49.0381256Z         x1 = x[:, D:]
2025-05-07T20:32:49.0381462Z     
2025-05-07T20:32:49.0381651Z         if contiguous:
2025-05-07T20:32:49.0381884Z             x0 = x0.contiguous()
2025-05-07T20:32:49.0382141Z             x1 = x1.contiguous()
2025-05-07T20:32:49.0382382Z     
2025-05-07T20:32:49.0382578Z         if scale_ub is not None:
2025-05-07T20:32:49.0382846Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:49.0383186Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:49.0383501Z             )
2025-05-07T20:32:49.0383688Z         else:
2025-05-07T20:32:49.0384043Z             scale_ub_tensor = None
2025-05-07T20:32:49.0384295Z     
2025-05-07T20:32:49.0384527Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:49.0384847Z             op = silu_mul_quant
2025-05-07T20:32:49.0385094Z             if compiled:
2025-05-07T20:32:49.0385344Z                 op = torch.compile(op)
2025-05-07T20:32:49.0385641Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:49.0385914Z     
2025-05-07T20:32:49.0386107Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:49.0386273Z 
2025-05-07T20:32:49.0386372Z moe/activation_test.py:117: 
2025-05-07T20:32:49.0386672Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:49.0386998Z moe/activation_test.py:115: in fn
2025-05-07T20:32:49.0387286Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:49.0387982Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:49.0388660Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:49.0389204Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:49.0389882Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:49.0390542Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:49.0392415Z     kernel = self.compile(
2025-05-07T20:32:49.0392955Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:49.0394127Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:49.0394518Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:49.0394755Z 
2025-05-07T20:32:49.0400933Z self = <triton.compiler.compiler.ASTSource object at 0x7fe9c4d88e50>
2025-05-07T20:32:49.0402026Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:49.0403504Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7fea06577ce0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7fe9c4105e40>}
2025-05-07T20:32:49.0404833Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:49.0405853Z context = <triton._C.libtriton.ir.context object at 0x7fe9c46ec730>
2025-05-07T20:32:49.0406149Z 
2025-05-07T20:32:49.0406317Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:49.0406907Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:49.0407371Z                            module_map=module_map)
2025-05-07T20:32:49.0407745Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:49.0408112Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:49.0408380Z E       ^
2025-05-07T20:32:49.0408845Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:49.0409297Z 
2025-05-07T20:32:49.0409709Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:49.0410220Z 
2025-05-07T20:32:49.0410332Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:49.0410738Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:49.0411143Z     T=4096,
2025-05-07T20:32:49.0411339Z     D=5120,
2025-05-07T20:32:49.0411540Z     scale_ub=1200.0,
2025-05-07T20:32:49.0411764Z     contiguous=True,
2025-05-07T20:32:49.0412051Z     compiled=False,
2025-05-07T20:32:49.0412264Z )
2025-05-07T20:32:49.0412584Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:49.0413084Z T = 4096, D = 5120, scale_ub = 1200.0, contiguous = True, compiled = False
2025-05-07T20:32:49.0413357Z 
2025-05-07T20:32:49.0413447Z     @given(
2025-05-07T20:32:49.0413747Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:49.0414100Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:49.0414448Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:49.0414816Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:49.0415151Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:49.0415447Z     )
2025-05-07T20:32:49.0415802Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:49.0416245Z     def test_silu_mul_quant(
2025-05-07T20:32:49.0416493Z         self,
2025-05-07T20:32:49.0416696Z         T: int,
2025-05-07T20:32:49.0416892Z         D: int,
2025-05-07T20:32:49.0417115Z         scale_ub: Optional[float],
2025-05-07T20:32:49.0417396Z         contiguous: bool,
2025-05-07T20:32:49.0417634Z         compiled: bool,
2025-05-07T20:32:49.0417866Z     ) -> None:
2025-05-07T20:32:49.0418088Z         torch.manual_seed(2025)
2025-05-07T20:32:49.0418327Z     
2025-05-07T20:32:49.0418602Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:49.0418955Z     
2025-05-07T20:32:49.0419150Z         x_sign = torch.sign(x)
2025-05-07T20:32:49.0419449Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:49.0419838Z         x = x_sign * x_clamp
2025-05-07T20:32:49.0420079Z         x0 = x[:, :D]
2025-05-07T20:32:49.0420297Z         x1 = x[:, D:]
2025-05-07T20:32:49.0420507Z     
2025-05-07T20:32:49.0420696Z         if contiguous:
2025-05-07T20:32:49.0420925Z             x0 = x0.contiguous()
2025-05-07T20:32:49.0421189Z             x1 = x1.contiguous()
2025-05-07T20:32:49.0421438Z     
2025-05-07T20:32:49.0421632Z         if scale_ub is not None:
2025-05-07T20:32:49.0421911Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:49.0422309Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:49.0422619Z             )
2025-05-07T20:32:49.0422811Z         else:
2025-05-07T20:32:49.0423023Z             scale_ub_tensor = None
2025-05-07T20:32:49.0423279Z     
2025-05-07T20:32:49.0423504Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:49.0423820Z             op = silu_mul_quant
2025-05-07T20:32:49.0424072Z             if compiled:
2025-05-07T20:32:49.0424322Z                 op = torch.compile(op)
2025-05-07T20:32:49.0424621Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:49.0424898Z     
2025-05-07T20:32:49.0425086Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:49.0425252Z 
2025-05-07T20:32:49.0425401Z moe/activation_test.py:117: 
2025-05-07T20:32:49.0425703Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:49.0426036Z moe/activation_test.py:115: in fn
2025-05-07T20:32:49.0426316Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:49.0427007Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:49.0427698Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:49.0428232Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:49.0428909Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:49.0429575Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:49.0430108Z     kernel = self.compile(
2025-05-07T20:32:49.0430651Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:49.0431347Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:49.0431748Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:49.0431975Z 
2025-05-07T20:32:49.0432183Z self = <triton.compiler.compiler.ASTSource object at 0x7fe9c4d8b850>
2025-05-07T20:32:49.0433255Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:49.0434617Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7fea06577ce0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7fe9c41068e0>}
2025-05-07T20:32:49.0435948Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:49.0436961Z context = <triton._C.libtriton.ir.context object at 0x7fe9c474c5f0>
2025-05-07T20:32:49.0437247Z 
2025-05-07T20:32:49.0437415Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:49.0437934Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:49.0438400Z                            module_map=module_map)
2025-05-07T20:32:49.0438770Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:49.0439124Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:49.0439463Z E       ^
2025-05-07T20:32:49.0439947Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:49.0440393Z 
2025-05-07T20:32:49.0440805Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:49.0441319Z 
2025-05-07T20:32:49.0441432Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:49.0441845Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:49.0442291Z     T=1,
2025-05-07T20:32:49.0442472Z     D=5120,
2025-05-07T20:32:49.0442669Z     scale_ub=None,
2025-05-07T20:32:49.0442887Z     contiguous=True,
2025-05-07T20:32:49.0443107Z     compiled=True,
2025-05-07T20:32:49.0443311Z )
2025-05-07T20:32:49.0443633Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:49.0444105Z T = 1, D = 5120, scale_ub = None, contiguous = True, compiled = True
2025-05-07T20:32:49.0444367Z 
2025-05-07T20:32:49.0444445Z     @given(
2025-05-07T20:32:49.0444672Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:49.0444980Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:49.0445284Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:49.0445653Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:49.0445983Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:49.0446264Z     )
2025-05-07T20:32:49.0446610Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:49.0447052Z     def test_silu_mul_quant(
2025-05-07T20:32:49.0447287Z         self,
2025-05-07T20:32:49.0447479Z         T: int,
2025-05-07T20:32:49.0447674Z         D: int,
2025-05-07T20:32:49.0447884Z         scale_ub: Optional[float],
2025-05-07T20:32:49.0448152Z         contiguous: bool,
2025-05-07T20:32:49.0448391Z         compiled: bool,
2025-05-07T20:32:49.0448605Z     ) -> None:
2025-05-07T20:32:49.0448823Z         torch.manual_seed(2025)
2025-05-07T20:32:49.0449060Z     
2025-05-07T20:32:49.0449345Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:49.0449706Z     
2025-05-07T20:32:49.0449899Z         x_sign = torch.sign(x)
2025-05-07T20:32:49.0450192Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:49.0450541Z         x = x_sign * x_clamp
2025-05-07T20:32:49.0450782Z         x0 = x[:, :D]
2025-05-07T20:32:49.0451001Z         x1 = x[:, D:]
2025-05-07T20:32:49.0451199Z     
2025-05-07T20:32:49.0451381Z         if contiguous:
2025-05-07T20:32:49.0451608Z             x0 = x0.contiguous()
2025-05-07T20:32:49.0451856Z             x1 = x1.contiguous()
2025-05-07T20:32:49.0452098Z     
2025-05-07T20:32:49.0452289Z         if scale_ub is not None:
2025-05-07T20:32:49.0452556Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:49.0452883Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:49.0453187Z             )
2025-05-07T20:32:49.0453371Z         else:
2025-05-07T20:32:49.0453577Z             scale_ub_tensor = None
2025-05-07T20:32:49.0453896Z     
2025-05-07T20:32:49.0454117Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:49.0454429Z             op = silu_mul_quant
2025-05-07T20:32:49.0454680Z             if compiled:
2025-05-07T20:32:49.0454924Z                 op = torch.compile(op)
2025-05-07T20:32:49.0455214Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:49.0455486Z     
2025-05-07T20:32:49.0455677Z         y_fp8, y_scale = fn()
2025-05-07T20:32:49.0455950Z         y = y_fp8.to(torch.float32) * y_scale[:, None]
2025-05-07T20:32:49.0456239Z     
2025-05-07T20:32:49.0456471Z         def ref_fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:49.0456798Z             x0_fp32 = x0.to(torch.float32)
2025-05-07T20:32:49.0457084Z             x1_fp32 = x1.to(torch.float32)
2025-05-07T20:32:49.0457394Z             y = x0_fp32 * torch.sigmoid(x0_fp32) * x1_fp32
2025-05-07T20:32:49.0457791Z             return triton_quantize_fp8_row(y, scale_ub_tensor)
2025-05-07T20:32:49.0458095Z     
2025-05-07T20:32:49.0458295Z >       y_fp8_ref, y_scale_ref = ref_fn()
2025-05-07T20:32:49.0458484Z 
2025-05-07T20:32:49.0458586Z moe/activation_test.py:126: 
2025-05-07T20:32:49.0458875Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:49.0459204Z moe/activation_test.py:124: in ref_fn
2025-05-07T20:32:49.0459524Z     return triton_quantize_fp8_row(y, scale_ub_tensor)
2025-05-07T20:32:49.0460335Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gemm/triton_gemm/fp8_gemm.py:2370: in triton_quantize_fp8_row
2025-05-07T20:32:49.0461072Z     _kernel_quantize_fp8_row[grid](
2025-05-07T20:32:49.0461610Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:49.0462278Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:49.0462953Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/autotuner.py:186: in run
2025-05-07T20:32:49.0463663Z     timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs}
2025-05-07T20:32:49.0464444Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/autotuner.py:166: in _bench
2025-05-07T20:32:49.0465067Z     return self.do_bench(kernel_call, quantiles=(0.5, 0.2, 0.8))
2025-05-07T20:32:49.0465664Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/testing.py:117: in do_bench
2025-05-07T20:32:49.0466178Z     fn()
2025-05-07T20:32:49.0466675Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/autotuner.py:152: in kernel_call
2025-05-07T20:32:49.0467241Z     self.fn.run(
2025-05-07T20:32:49.0467701Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:49.0468223Z     kernel = self.compile(
2025-05-07T20:32:49.0468749Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:49.0469399Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:49.0469882Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:49.0470108Z 
2025-05-07T20:32:49.0470319Z self = <triton.compiler.compiler.ASTSource object at 0x7fe9c488aa80>
2025-05-07T20:32:49.0471385Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=2, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:49.0472730Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7fea06577ce0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7fe9c4107560>}
2025-05-07T20:32:49.0474057Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:49.0475064Z context = <triton._C.libtriton.ir.context object at 0x7fe9c478ad30>
2025-05-07T20:32:49.0475345Z 
2025-05-07T20:32:49.0475515Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:49.0476020Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:49.0476483Z                            module_map=module_map)
2025-05-07T20:32:49.0476847Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:49.0477193Z E       def _kernel_quantize_fp8_row(
2025-05-07T20:32:49.0477458Z E       ^
2025-05-07T20:32:49.0477912Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:49.0478402Z 
2025-05-07T20:32:49.0478807Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:49.7311858Z 
2025-05-07T20:32:49.7312033Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:49.7312476Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:49.7312955Z     T=2048,
2025-05-07T20:32:49.7313148Z     D=5120,
2025-05-07T20:32:49.7313384Z     scale_ub=None,
2025-05-07T20:32:49.7313608Z     contiguous=True,
2025-05-07T20:32:49.7313941Z     compiled=True,
2025-05-07T20:32:49.7314147Z )
2025-05-07T20:32:49.7314466Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:49.7314949Z T = 2048, D = 5120, scale_ub = None, contiguous = True, compiled = True
2025-05-07T20:32:49.7315222Z 
2025-05-07T20:32:49.7315305Z     @given(
2025-05-07T20:32:49.7315535Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:49.7315851Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:49.7316150Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:49.7316475Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:49.7316799Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:49.7317145Z     )
2025-05-07T20:32:49.7317494Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:49.7317930Z     def test_silu_mul_quant(
2025-05-07T20:32:49.7318167Z         self,
2025-05-07T20:32:49.7318359Z         T: int,
2025-05-07T20:32:49.7318561Z         D: int,
2025-05-07T20:32:49.7318771Z         scale_ub: Optional[float],
2025-05-07T20:32:49.7319039Z         contiguous: bool,
2025-05-07T20:32:49.7319272Z         compiled: bool,
2025-05-07T20:32:49.7319490Z     ) -> None:
2025-05-07T20:32:49.7319706Z         torch.manual_seed(2025)
2025-05-07T20:32:49.7319944Z     
2025-05-07T20:32:49.7320214Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:49.7320545Z     
2025-05-07T20:32:49.7320732Z         x_sign = torch.sign(x)
2025-05-07T20:32:49.7321020Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:49.7321319Z         x = x_sign * x_clamp
2025-05-07T20:32:49.7321562Z         x0 = x[:, :D]
2025-05-07T20:32:49.7321783Z         x1 = x[:, D:]
2025-05-07T20:32:49.7321980Z     
2025-05-07T20:32:49.7322232Z         if contiguous:
2025-05-07T20:32:49.7322465Z             x0 = x0.contiguous()
2025-05-07T20:32:49.7322716Z             x1 = x1.contiguous()
2025-05-07T20:32:49.7322960Z     
2025-05-07T20:32:49.7323151Z         if scale_ub is not None:
2025-05-07T20:32:49.7323416Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:49.7323745Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:49.7324046Z             )
2025-05-07T20:32:49.7324234Z         else:
2025-05-07T20:32:49.7324442Z             scale_ub_tensor = None
2025-05-07T20:32:49.7324688Z     
2025-05-07T20:32:49.7324914Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:49.7325229Z             op = silu_mul_quant
2025-05-07T20:32:49.7325473Z             if compiled:
2025-05-07T20:32:49.7325717Z                 op = torch.compile(op)
2025-05-07T20:32:49.7326009Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:49.7326280Z     
2025-05-07T20:32:49.7326474Z         y_fp8, y_scale = fn()
2025-05-07T20:32:49.7326751Z         y = y_fp8.to(torch.float32) * y_scale[:, None]
2025-05-07T20:32:49.7327040Z     
2025-05-07T20:32:49.7327273Z         def ref_fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:49.7327595Z             x0_fp32 = x0.to(torch.float32)
2025-05-07T20:32:49.7327878Z             x1_fp32 = x1.to(torch.float32)
2025-05-07T20:32:49.7328185Z             y = x0_fp32 * torch.sigmoid(x0_fp32) * x1_fp32
2025-05-07T20:32:49.7328531Z             return triton_quantize_fp8_row(y, scale_ub_tensor)
2025-05-07T20:32:49.7328842Z     
2025-05-07T20:32:49.7329113Z >       y_fp8_ref, y_scale_ref = ref_fn()
2025-05-07T20:32:49.7329306Z 
2025-05-07T20:32:49.7329409Z moe/activation_test.py:126: 
2025-05-07T20:32:49.7329698Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:49.7330024Z moe/activation_test.py:124: in ref_fn
2025-05-07T20:32:49.7330347Z     return triton_quantize_fp8_row(y, scale_ub_tensor)
2025-05-07T20:32:49.7331122Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gemm/triton_gemm/fp8_gemm.py:2370: in triton_quantize_fp8_row
2025-05-07T20:32:49.7331904Z     _kernel_quantize_fp8_row[grid](
2025-05-07T20:32:49.7332442Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:49.7333112Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:49.7333871Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/autotuner.py:186: in run
2025-05-07T20:32:49.7334587Z     timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs}
2025-05-07T20:32:49.7335297Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/autotuner.py:166: in _bench
2025-05-07T20:32:49.7335973Z     return self.do_bench(kernel_call, quantiles=(0.5, 0.2, 0.8))
2025-05-07T20:32:49.7336567Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/testing.py:117: in do_bench
2025-05-07T20:32:49.7337104Z     fn()
2025-05-07T20:32:49.7337604Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/autotuner.py:152: in kernel_call
2025-05-07T20:32:49.7338174Z     self.fn.run(
2025-05-07T20:32:49.7338629Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:49.7339198Z     kernel = self.compile(
2025-05-07T20:32:49.7339733Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:49.7340381Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:49.7340774Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:49.7341002Z 
2025-05-07T20:32:49.7341207Z self = <triton.compiler.compiler.ASTSource object at 0x7fe9c488ab70>
2025-05-07T20:32:49.7342312Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=2, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:49.7343665Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7fea06577ce0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7fe92b944cc0>}
2025-05-07T20:32:49.7344975Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:49.7345979Z context = <triton._C.libtriton.ir.context object at 0x7fe92b84f770>
2025-05-07T20:32:49.7346264Z 
2025-05-07T20:32:49.7346430Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:49.7346942Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:49.7347397Z                            module_map=module_map)
2025-05-07T20:32:49.7347761Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:49.7348117Z E       def _kernel_quantize_fp8_row(
2025-05-07T20:32:49.7348381Z E       ^
2025-05-07T20:32:49.7348835Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:49.7349281Z 
2025-05-07T20:32:49.7349689Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:49.7350237Z 
2025-05-07T20:32:49.7350348Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:49.7350751Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:49.7351147Z     T=128,
2025-05-07T20:32:49.7351336Z     D=5120,
2025-05-07T20:32:49.7351523Z     scale_ub=None,
2025-05-07T20:32:49.7351737Z     contiguous=True,
2025-05-07T20:32:49.7351958Z     compiled=True,
2025-05-07T20:32:49.7352165Z )
2025-05-07T20:32:49.7352474Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:49.7352997Z T = 128, D = 5120, scale_ub = None, contiguous = True, compiled = True
2025-05-07T20:32:49.7353254Z 
2025-05-07T20:32:49.7353334Z     @given(
2025-05-07T20:32:49.7353562Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:49.7353875Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:49.7354178Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:49.7354510Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:49.7354835Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:49.7355128Z     )
2025-05-07T20:32:49.7355473Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:49.7355957Z     def test_silu_mul_quant(
2025-05-07T20:32:49.7356203Z         self,
2025-05-07T20:32:49.7356406Z         T: int,
2025-05-07T20:32:49.7356603Z         D: int,
2025-05-07T20:32:49.7356820Z         scale_ub: Optional[float],
2025-05-07T20:32:49.7357097Z         contiguous: bool,
2025-05-07T20:32:49.7357332Z         compiled: bool,
2025-05-07T20:32:49.7357554Z     ) -> None:
2025-05-07T20:32:49.7357775Z         torch.manual_seed(2025)
2025-05-07T20:32:49.7358010Z     
2025-05-07T20:32:49.7358281Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:49.7358619Z     
2025-05-07T20:32:49.7358811Z         x_sign = torch.sign(x)
2025-05-07T20:32:49.7359097Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:49.7359409Z         x = x_sign * x_clamp
2025-05-07T20:32:49.7359643Z         x0 = x[:, :D]
2025-05-07T20:32:49.7359856Z         x1 = x[:, D:]
2025-05-07T20:32:49.7360066Z     
2025-05-07T20:32:49.7360248Z         if contiguous:
2025-05-07T20:32:49.7360482Z             x0 = x0.contiguous()
2025-05-07T20:32:49.7360786Z             x1 = x1.contiguous()
2025-05-07T20:32:49.7361023Z     
2025-05-07T20:32:49.7361212Z         if scale_ub is not None:
2025-05-07T20:32:49.7361483Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:49.7361814Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:49.7362117Z             )
2025-05-07T20:32:49.7362311Z         else:
2025-05-07T20:32:49.7362523Z             scale_ub_tensor = None
2025-05-07T20:32:49.7362767Z     
2025-05-07T20:32:49.7362997Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:49.7363306Z             op = silu_mul_quant
2025-05-07T20:32:49.7363551Z             if compiled:
2025-05-07T20:32:49.7363799Z                 op = torch.compile(op)
2025-05-07T20:32:49.7364091Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:49.7364358Z     
2025-05-07T20:32:49.7364553Z         y_fp8, y_scale = fn()
2025-05-07T20:32:49.7364833Z         y = y_fp8.to(torch.float32) * y_scale[:, None]
2025-05-07T20:32:49.7365115Z     
2025-05-07T20:32:49.7365353Z         def ref_fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:49.7365689Z             x0_fp32 = x0.to(torch.float32)
2025-05-07T20:32:49.7365981Z             x1_fp32 = x1.to(torch.float32)
2025-05-07T20:32:49.7366288Z             y = x0_fp32 * torch.sigmoid(x0_fp32) * x1_fp32
2025-05-07T20:32:49.7366640Z             return triton_quantize_fp8_row(y, scale_ub_tensor)
2025-05-07T20:32:49.7366945Z     
2025-05-07T20:32:49.7367144Z >       y_fp8_ref, y_scale_ref = ref_fn()
2025-05-07T20:32:49.7367338Z 
2025-05-07T20:32:49.7367439Z moe/activation_test.py:126: 
2025-05-07T20:32:49.7367783Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:49.7368109Z moe/activation_test.py:124: in ref_fn
2025-05-07T20:32:49.7368433Z     return triton_quantize_fp8_row(y, scale_ub_tensor)
2025-05-07T20:32:49.7369209Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gemm/triton_gemm/fp8_gemm.py:2370: in triton_quantize_fp8_row
2025-05-07T20:32:49.7369943Z     _kernel_quantize_fp8_row[grid](
2025-05-07T20:32:49.7370476Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:49.7371215Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:49.7371894Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/autotuner.py:186: in run
2025-05-07T20:32:49.7372600Z     timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs}
2025-05-07T20:32:49.7373310Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/autotuner.py:166: in _bench
2025-05-07T20:32:49.7374026Z     return self.do_bench(kernel_call, quantiles=(0.5, 0.2, 0.8))
2025-05-07T20:32:49.7374673Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/testing.py:117: in do_bench
2025-05-07T20:32:49.7375186Z     fn()
2025-05-07T20:32:49.7375693Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/autotuner.py:152: in kernel_call
2025-05-07T20:32:49.7376272Z     self.fn.run(
2025-05-07T20:32:49.7376738Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:49.7377260Z     kernel = self.compile(
2025-05-07T20:32:49.7377808Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:49.7383333Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:49.7383742Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:49.7383978Z 
2025-05-07T20:32:49.7384185Z self = <triton.compiler.compiler.ASTSource object at 0x7fe9c45a6cf0>
2025-05-07T20:32:49.7385326Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=2, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:49.7386683Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7fea06577ce0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7fe92ba11d00>}
2025-05-07T20:32:49.7388012Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:49.7389050Z context = <triton._C.libtriton.ir.context object at 0x7fe92bb6e930>
2025-05-07T20:32:49.7389354Z 
2025-05-07T20:32:49.7389520Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:49.7390033Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:49.7390496Z                            module_map=module_map)
2025-05-07T20:32:49.7390857Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:49.7391216Z E       def _kernel_quantize_fp8_row(
2025-05-07T20:32:49.7391483Z E       ^
2025-05-07T20:32:49.7391934Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:49.7392377Z 
2025-05-07T20:32:49.7392788Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:50.5067178Z 
2025-05-07T20:32:50.5067466Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:50.5067924Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:50.5068475Z     T=4096,
2025-05-07T20:32:50.5068672Z     D=5120,
2025-05-07T20:32:50.5068940Z     scale_ub=None,
2025-05-07T20:32:50.5069397Z     contiguous=True,
2025-05-07T20:32:50.5069854Z     compiled=True,
2025-05-07T20:32:50.5070271Z )
2025-05-07T20:32:50.5070912Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:50.5071888Z T = 4096, D = 5120, scale_ub = None, contiguous = True, compiled = True
2025-05-07T20:32:50.5072546Z 
2025-05-07T20:32:50.5072711Z     @given(
2025-05-07T20:32:50.5073174Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:50.5073799Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:50.5074416Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:50.5075072Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:50.5075716Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:50.5076295Z     )
2025-05-07T20:32:50.5076999Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:50.5077880Z     def test_silu_mul_quant(
2025-05-07T20:32:50.5078365Z         self,
2025-05-07T20:32:50.5078762Z         T: int,
2025-05-07T20:32:50.5079107Z         D: int,
2025-05-07T20:32:50.5079426Z         scale_ub: Optional[float],
2025-05-07T20:32:50.5079754Z         contiguous: bool,
2025-05-07T20:32:50.5079993Z         compiled: bool,
2025-05-07T20:32:50.5080224Z     ) -> None:
2025-05-07T20:32:50.5080445Z         torch.manual_seed(2025)
2025-05-07T20:32:50.5080684Z     
2025-05-07T20:32:50.5080956Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:50.5081306Z     
2025-05-07T20:32:50.5081503Z         x_sign = torch.sign(x)
2025-05-07T20:32:50.5081796Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:50.5082109Z         x = x_sign * x_clamp
2025-05-07T20:32:50.5082349Z         x0 = x[:, :D]
2025-05-07T20:32:50.5082573Z         x1 = x[:, D:]
2025-05-07T20:32:50.5082782Z     
2025-05-07T20:32:50.5082972Z         if contiguous:
2025-05-07T20:32:50.5083208Z             x0 = x0.contiguous()
2025-05-07T20:32:50.5083469Z             x1 = x1.contiguous()
2025-05-07T20:32:50.5083705Z     
2025-05-07T20:32:50.5083903Z         if scale_ub is not None:
2025-05-07T20:32:50.5084245Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:50.5084588Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:50.5084899Z             )
2025-05-07T20:32:50.5085098Z         else:
2025-05-07T20:32:50.5085314Z             scale_ub_tensor = None
2025-05-07T20:32:50.5085566Z     
2025-05-07T20:32:50.5085806Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:50.5086125Z             op = silu_mul_quant
2025-05-07T20:32:50.5086374Z             if compiled:
2025-05-07T20:32:50.5086625Z                 op = torch.compile(op)
2025-05-07T20:32:50.5086926Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:50.5087202Z     
2025-05-07T20:32:50.5087403Z         y_fp8, y_scale = fn()
2025-05-07T20:32:50.5087689Z         y = y_fp8.to(torch.float32) * y_scale[:, None]
2025-05-07T20:32:50.5087980Z     
2025-05-07T20:32:50.5088220Z         def ref_fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:50.5088558Z             x0_fp32 = x0.to(torch.float32)
2025-05-07T20:32:50.5088851Z             x1_fp32 = x1.to(torch.float32)
2025-05-07T20:32:50.5089160Z             y = x0_fp32 * torch.sigmoid(x0_fp32) * x1_fp32
2025-05-07T20:32:50.5089519Z             return triton_quantize_fp8_row(y, scale_ub_tensor)
2025-05-07T20:32:50.5089832Z     
2025-05-07T20:32:50.5090029Z >       y_fp8_ref, y_scale_ref = ref_fn()
2025-05-07T20:32:50.5090228Z 
2025-05-07T20:32:50.5090330Z moe/activation_test.py:126: 
2025-05-07T20:32:50.5090630Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:50.5090960Z moe/activation_test.py:124: in ref_fn
2025-05-07T20:32:50.5091339Z     return triton_quantize_fp8_row(y, scale_ub_tensor)
2025-05-07T20:32:50.5092126Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gemm/triton_gemm/fp8_gemm.py:2370: in triton_quantize_fp8_row
2025-05-07T20:32:50.5092869Z     _kernel_quantize_fp8_row[grid](
2025-05-07T20:32:50.5093416Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:50.5094216Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:50.5094948Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/autotuner.py:186: in run
2025-05-07T20:32:50.5095667Z     timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs}
2025-05-07T20:32:50.5096383Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/autotuner.py:166: in _bench
2025-05-07T20:32:50.5097019Z     return self.do_bench(kernel_call, quantiles=(0.5, 0.2, 0.8))
2025-05-07T20:32:50.5097621Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/testing.py:117: in do_bench
2025-05-07T20:32:50.5098134Z     fn()
2025-05-07T20:32:50.5098878Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/autotuner.py:152: in kernel_call
2025-05-07T20:32:50.5099467Z     self.fn.run(
2025-05-07T20:32:50.5099936Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:50.5100462Z     kernel = self.compile(
2025-05-07T20:32:50.5100999Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:50.5101650Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:50.5102045Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:50.5102278Z 
2025-05-07T20:32:50.5102489Z self = <triton.compiler.compiler.ASTSource object at 0x7fe92bb72dd0>
2025-05-07T20:32:50.5103561Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=2, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:50.5104984Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7fea06577ce0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7fe92b80f2e0>}
2025-05-07T20:32:50.5106320Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:50.5107324Z context = <triton._C.libtriton.ir.context object at 0x7fe92af680f0>
2025-05-07T20:32:50.5107612Z 
2025-05-07T20:32:50.5107777Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:50.5108302Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:50.5108773Z                            module_map=module_map)
2025-05-07T20:32:50.5109163Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:50.5109551Z E       def _kernel_quantize_fp8_row(
2025-05-07T20:32:50.5109822Z E       ^
2025-05-07T20:32:50.5110280Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:50.5110731Z 
2025-05-07T20:32:50.5111142Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:50.5111655Z 
2025-05-07T20:32:50.5111760Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:50.5112171Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:50.5112568Z     T=16384,
2025-05-07T20:32:50.5112772Z     D=5120,
2025-05-07T20:32:50.5113037Z     scale_ub=None,
2025-05-07T20:32:50.5113246Z     contiguous=True,
2025-05-07T20:32:50.5113473Z     compiled=True,
2025-05-07T20:32:50.5113678Z )
2025-05-07T20:32:50.5113991Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:50.5114484Z T = 16384, D = 5120, scale_ub = None, contiguous = True, compiled = True
2025-05-07T20:32:50.5114758Z 
2025-05-07T20:32:50.5114841Z     @given(
2025-05-07T20:32:50.5115067Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:50.5115445Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:50.5115754Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:50.5116087Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:50.5116414Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:50.5116702Z     )
2025-05-07T20:32:50.5117049Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:50.5117491Z     def test_silu_mul_quant(
2025-05-07T20:32:50.5117734Z         self,
2025-05-07T20:32:50.5117934Z         T: int,
2025-05-07T20:32:50.5118133Z         D: int,
2025-05-07T20:32:50.5118352Z         scale_ub: Optional[float],
2025-05-07T20:32:50.5118625Z         contiguous: bool,
2025-05-07T20:32:50.5118911Z         compiled: bool,
2025-05-07T20:32:50.5119137Z     ) -> None:
2025-05-07T20:32:50.5119361Z         torch.manual_seed(2025)
2025-05-07T20:32:50.5119598Z     
2025-05-07T20:32:50.5119875Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:50.5120224Z     
2025-05-07T20:32:50.5120421Z         x_sign = torch.sign(x)
2025-05-07T20:32:50.5120708Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:50.5121021Z         x = x_sign * x_clamp
2025-05-07T20:32:50.5121263Z         x0 = x[:, :D]
2025-05-07T20:32:50.5121476Z         x1 = x[:, D:]
2025-05-07T20:32:50.5121684Z     
2025-05-07T20:32:50.5121871Z         if contiguous:
2025-05-07T20:32:50.5122101Z             x0 = x0.contiguous()
2025-05-07T20:32:50.5122359Z             x1 = x1.contiguous()
2025-05-07T20:32:50.5122600Z     
2025-05-07T20:32:50.5122790Z         if scale_ub is not None:
2025-05-07T20:32:50.5123066Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:50.5123403Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:50.5123757Z             )
2025-05-07T20:32:50.5123954Z         else:
2025-05-07T20:32:50.5124168Z             scale_ub_tensor = None
2025-05-07T20:32:50.5124414Z     
2025-05-07T20:32:50.5124650Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:50.5124964Z             op = silu_mul_quant
2025-05-07T20:32:50.5125219Z             if compiled:
2025-05-07T20:32:50.5125464Z                 op = torch.compile(op)
2025-05-07T20:32:50.5125760Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:50.5126042Z     
2025-05-07T20:32:50.5126232Z         y_fp8, y_scale = fn()
2025-05-07T20:32:50.5126515Z         y = y_fp8.to(torch.float32) * y_scale[:, None]
2025-05-07T20:32:50.5126812Z     
2025-05-07T20:32:50.5127044Z         def ref_fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:50.5127380Z             x0_fp32 = x0.to(torch.float32)
2025-05-07T20:32:50.5127679Z             x1_fp32 = x1.to(torch.float32)
2025-05-07T20:32:50.5127992Z             y = x0_fp32 * torch.sigmoid(x0_fp32) * x1_fp32
2025-05-07T20:32:50.5128355Z             return triton_quantize_fp8_row(y, scale_ub_tensor)
2025-05-07T20:32:50.5128667Z     
2025-05-07T20:32:50.5128866Z >       y_fp8_ref, y_scale_ref = ref_fn()
2025-05-07T20:32:50.5129069Z 
2025-05-07T20:32:50.5129169Z moe/activation_test.py:126: 
2025-05-07T20:32:50.5129466Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:50.5129802Z moe/activation_test.py:124: in ref_fn
2025-05-07T20:32:50.5130125Z     return triton_quantize_fp8_row(y, scale_ub_tensor)
2025-05-07T20:32:50.5130901Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gemm/triton_gemm/fp8_gemm.py:2370: in triton_quantize_fp8_row
2025-05-07T20:32:50.5131721Z     _kernel_quantize_fp8_row[grid](
2025-05-07T20:32:50.5132267Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:50.5132949Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:50.5133630Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/autotuner.py:186: in run
2025-05-07T20:32:50.5134496Z     timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs}
2025-05-07T20:32:50.5135220Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/autotuner.py:166: in _bench
2025-05-07T20:32:50.5135853Z     return self.do_bench(kernel_call, quantiles=(0.5, 0.2, 0.8))
2025-05-07T20:32:50.5136448Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/testing.py:117: in do_bench
2025-05-07T20:32:50.5136968Z     fn()
2025-05-07T20:32:50.5137475Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/autotuner.py:152: in kernel_call
2025-05-07T20:32:50.5138059Z     self.fn.run(
2025-05-07T20:32:50.5138569Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:50.5139107Z     kernel = self.compile(
2025-05-07T20:32:50.5139648Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:50.5140296Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:50.5140700Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:50.5140937Z 
2025-05-07T20:32:50.5141144Z self = <triton.compiler.compiler.ASTSource object at 0x7fe92bb8d640>
2025-05-07T20:32:50.5142211Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=2, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:50.5143607Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7fea06577ce0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7fe92b52ce00>}
2025-05-07T20:32:50.5144934Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:50.5145944Z context = <triton._C.libtriton.ir.context object at 0x7fe92b6f2030>
2025-05-07T20:32:50.5146229Z 
2025-05-07T20:32:50.5146401Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:50.5146913Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:50.5147379Z                            module_map=module_map)
2025-05-07T20:32:50.5147747Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:50.5148102Z E       def _kernel_quantize_fp8_row(
2025-05-07T20:32:50.5148365Z E       ^
2025-05-07T20:32:50.5148824Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:50.5149270Z 
2025-05-07T20:32:50.5149687Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:50.5343478Z W0507 20:32:50.533000 276945 site-packages/torch/_dynamo/convert_frame.py:987] [0/8] torch._dynamo hit config.recompile_limit (8)
2025-05-07T20:32:50.5344697Z W0507 20:32:50.533000 276945 site-packages/torch/_dynamo/convert_frame.py:987] [0/8]    function: 'silu_mul_quant' (/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:55)
2025-05-07T20:32:50.5346105Z W0507 20:32:50.533000 276945 site-packages/torch/_dynamo/convert_frame.py:987] [0/8]    last reason: 0/7: tensor 'x0' stride mismatch at index 0. expected 5120, actual 10240
2025-05-07T20:32:50.5347089Z W0507 20:32:50.533000 276945 site-packages/torch/_dynamo/convert_frame.py:987] [0/8] To log all recompilation reasons, use TORCH_LOGS="recompiles".
2025-05-07T20:32:50.5348177Z W0507 20:32:50.533000 276945 site-packages/torch/_dynamo/convert_frame.py:987] [0/8] To diagnose recompilation issues, see https://pytorch.org/docs/main/torch.compiler_troubleshooting.html.
2025-05-07T20:32:50.9356129Z 
2025-05-07T20:32:50.9356374Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:50.9356794Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:50.9357233Z     T=1,
2025-05-07T20:32:50.9357437Z     D=5120,
2025-05-07T20:32:50.9357642Z     scale_ub=1200.0,
2025-05-07T20:32:50.9357868Z     contiguous=True,
2025-05-07T20:32:50.9358091Z     compiled=True,
2025-05-07T20:32:50.9358305Z )
2025-05-07T20:32:50.9358628Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:50.9359105Z T = 1, D = 5120, scale_ub = 1200.0, contiguous = True, compiled = True
2025-05-07T20:32:50.9359392Z 
2025-05-07T20:32:50.9359476Z     @given(
2025-05-07T20:32:50.9359910Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:50.9360226Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:50.9360535Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:50.9360877Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:50.9361199Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:50.9361492Z     )
2025-05-07T20:32:50.9361843Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:50.9362282Z     def test_silu_mul_quant(
2025-05-07T20:32:50.9362522Z         self,
2025-05-07T20:32:50.9362717Z         T: int,
2025-05-07T20:32:50.9362916Z         D: int,
2025-05-07T20:32:50.9363137Z         scale_ub: Optional[float],
2025-05-07T20:32:50.9363407Z         contiguous: bool,
2025-05-07T20:32:50.9363645Z         compiled: bool,
2025-05-07T20:32:50.9363863Z     ) -> None:
2025-05-07T20:32:50.9364079Z         torch.manual_seed(2025)
2025-05-07T20:32:50.9364323Z     
2025-05-07T20:32:50.9364676Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:50.9365020Z     
2025-05-07T20:32:50.9365215Z         x_sign = torch.sign(x)
2025-05-07T20:32:50.9365498Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:50.9365805Z         x = x_sign * x_clamp
2025-05-07T20:32:50.9366043Z         x0 = x[:, :D]
2025-05-07T20:32:50.9366255Z         x1 = x[:, D:]
2025-05-07T20:32:50.9366462Z     
2025-05-07T20:32:50.9366646Z         if contiguous:
2025-05-07T20:32:50.9366877Z             x0 = x0.contiguous()
2025-05-07T20:32:50.9367134Z             x1 = x1.contiguous()
2025-05-07T20:32:50.9367374Z     
2025-05-07T20:32:50.9367571Z         if scale_ub is not None:
2025-05-07T20:32:50.9367841Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:50.9368184Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:50.9368493Z             )
2025-05-07T20:32:50.9368685Z         else:
2025-05-07T20:32:50.9368903Z             scale_ub_tensor = None
2025-05-07T20:32:50.9369161Z     
2025-05-07T20:32:50.9369392Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:50.9369706Z             op = silu_mul_quant
2025-05-07T20:32:50.9369958Z             if compiled:
2025-05-07T20:32:50.9370202Z                 op = torch.compile(op)
2025-05-07T20:32:50.9370498Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:50.9370776Z     
2025-05-07T20:32:50.9370965Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:50.9371131Z 
2025-05-07T20:32:50.9371230Z moe/activation_test.py:117: 
2025-05-07T20:32:50.9371524Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:50.9371925Z moe/activation_test.py:115: in fn
2025-05-07T20:32:50.9372206Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:50.9372762Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/torch/_dynamo/eval_frame.py:678: in _fn
2025-05-07T20:32:50.9373322Z     return fn(*args, **kwargs)
2025-05-07T20:32:50.9374083Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:50.9374770Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:50.9375371Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:50.9376043Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:50.9376695Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:50.9377227Z     kernel = self.compile(
2025-05-07T20:32:50.9377770Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:50.9378418Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:50.9378859Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:50.9379096Z 
2025-05-07T20:32:50.9379305Z self = <triton.compiler.compiler.ASTSource object at 0x7fe92b560a10>
2025-05-07T20:32:50.9380367Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:50.9381723Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7fea06577ce0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7fe92b52cfe0>}
2025-05-07T20:32:50.9383046Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:50.9384055Z context = <triton._C.libtriton.ir.context object at 0x7fe92b6179f0>
2025-05-07T20:32:50.9384344Z 
2025-05-07T20:32:50.9384512Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:50.9385065Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:50.9385524Z                            module_map=module_map)
2025-05-07T20:32:50.9385887Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:50.9386240Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:50.9386496Z E       ^
2025-05-07T20:32:50.9386952Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:50.9387398Z 
2025-05-07T20:32:50.9387834Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:50.9388347Z 
2025-05-07T20:32:50.9388452Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:50.9388861Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:50.9389263Z     T=1,
2025-05-07T20:32:50.9389451Z     D=5120,
2025-05-07T20:32:50.9389652Z     scale_ub=None,
2025-05-07T20:32:50.9389865Z     contiguous=False,
2025-05-07T20:32:50.9390092Z     compiled=True,
2025-05-07T20:32:50.9390303Z )
2025-05-07T20:32:50.9390620Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:50.9391102Z T = 1, D = 5120, scale_ub = None, contiguous = False, compiled = True
2025-05-07T20:32:50.9391362Z 
2025-05-07T20:32:50.9391443Z     @given(
2025-05-07T20:32:50.9391676Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:50.9391992Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:50.9392297Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:50.9392674Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:50.9392995Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:50.9393285Z     )
2025-05-07T20:32:50.9393634Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:50.9394073Z     def test_silu_mul_quant(
2025-05-07T20:32:50.9394313Z         self,
2025-05-07T20:32:50.9394512Z         T: int,
2025-05-07T20:32:50.9394704Z         D: int,
2025-05-07T20:32:50.9394966Z         scale_ub: Optional[float],
2025-05-07T20:32:50.9395237Z         contiguous: bool,
2025-05-07T20:32:50.9395476Z         compiled: bool,
2025-05-07T20:32:50.9395697Z     ) -> None:
2025-05-07T20:32:50.9395914Z         torch.manual_seed(2025)
2025-05-07T20:32:50.9396157Z     
2025-05-07T20:32:50.9396421Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:50.9396762Z     
2025-05-07T20:32:50.9396966Z         x_sign = torch.sign(x)
2025-05-07T20:32:50.9397253Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:50.9397561Z         x = x_sign * x_clamp
2025-05-07T20:32:50.9397801Z         x0 = x[:, :D]
2025-05-07T20:32:50.9398014Z         x1 = x[:, D:]
2025-05-07T20:32:50.9398401Z     
2025-05-07T20:32:50.9398663Z         if contiguous:
2025-05-07T20:32:50.9398896Z             x0 = x0.contiguous()
2025-05-07T20:32:50.9399154Z             x1 = x1.contiguous()
2025-05-07T20:32:50.9399396Z     
2025-05-07T20:32:50.9399584Z         if scale_ub is not None:
2025-05-07T20:32:50.9399865Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:50.9400201Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:50.9400507Z             )
2025-05-07T20:32:50.9400696Z         else:
2025-05-07T20:32:50.9400907Z             scale_ub_tensor = None
2025-05-07T20:32:50.9401160Z     
2025-05-07T20:32:50.9401391Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:50.9401709Z             op = silu_mul_quant
2025-05-07T20:32:50.9401959Z             if compiled:
2025-05-07T20:32:50.9402202Z                 op = torch.compile(op)
2025-05-07T20:32:50.9402500Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:50.9402777Z     
2025-05-07T20:32:50.9402970Z         y_fp8, y_scale = fn()
2025-05-07T20:32:50.9403345Z         y = y_fp8.to(torch.float32) * y_scale[:, None]
2025-05-07T20:32:50.9403640Z     
2025-05-07T20:32:50.9403874Z         def ref_fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:50.9404212Z             x0_fp32 = x0.to(torch.float32)
2025-05-07T20:32:50.9404509Z             x1_fp32 = x1.to(torch.float32)
2025-05-07T20:32:50.9404818Z             y = x0_fp32 * torch.sigmoid(x0_fp32) * x1_fp32
2025-05-07T20:32:50.9405177Z             return triton_quantize_fp8_row(y, scale_ub_tensor)
2025-05-07T20:32:50.9405488Z     
2025-05-07T20:32:50.9405690Z >       y_fp8_ref, y_scale_ref = ref_fn()
2025-05-07T20:32:50.9405882Z 
2025-05-07T20:32:50.9405986Z moe/activation_test.py:126: 
2025-05-07T20:32:50.9406287Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:50.9406623Z moe/activation_test.py:124: in ref_fn
2025-05-07T20:32:50.9406945Z     return triton_quantize_fp8_row(y, scale_ub_tensor)
2025-05-07T20:32:50.9407730Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gemm/triton_gemm/fp8_gemm.py:2370: in triton_quantize_fp8_row
2025-05-07T20:32:50.9408473Z     _kernel_quantize_fp8_row[grid](
2025-05-07T20:32:50.9409020Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:50.9409692Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:50.9410384Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/autotuner.py:186: in run
2025-05-07T20:32:50.9416622Z     timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs}
2025-05-07T20:32:50.9417473Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/autotuner.py:166: in _bench
2025-05-07T20:32:50.9418107Z     return self.do_bench(kernel_call, quantiles=(0.5, 0.2, 0.8))
2025-05-07T20:32:50.9418715Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/testing.py:117: in do_bench
2025-05-07T20:32:50.9419232Z     fn()
2025-05-07T20:32:50.9419739Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/autotuner.py:152: in kernel_call
2025-05-07T20:32:50.9420377Z     self.fn.run(
2025-05-07T20:32:50.9420843Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:50.9421369Z     kernel = self.compile(
2025-05-07T20:32:50.9421903Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:50.9422550Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:50.9422948Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:50.9423176Z 
2025-05-07T20:32:50.9423389Z self = <triton.compiler.compiler.ASTSource object at 0x7fe92b562ed0>
2025-05-07T20:32:50.9424497Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=2, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:50.9425854Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7fea06577ce0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7fe92ab27ec0>}
2025-05-07T20:32:50.9427177Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:50.9428187Z context = <triton._C.libtriton.ir.context object at 0x7fe92a891670>
2025-05-07T20:32:50.9428468Z 
2025-05-07T20:32:50.9428641Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:50.9429157Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:50.9429661Z                            module_map=module_map)
2025-05-07T20:32:50.9430034Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:50.9430387Z E       def _kernel_quantize_fp8_row(
2025-05-07T20:32:50.9430661Z E       ^
2025-05-07T20:32:50.9431114Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:50.9431555Z 
2025-05-07T20:32:50.9431971Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:51.0839550Z 
2025-05-07T20:32:51.0840019Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:51.0840951Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:51.0841757Z     T=1,
2025-05-07T20:32:51.0842142Z     D=5120,
2025-05-07T20:32:51.0842529Z     scale_ub=None,
2025-05-07T20:32:51.0842966Z     contiguous=True,
2025-05-07T20:32:51.0843431Z     compiled=False,
2025-05-07T20:32:51.0843849Z )
2025-05-07T20:32:51.0844499Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:51.0845454Z T = 1, D = 5120, scale_ub = None, contiguous = True, compiled = False
2025-05-07T20:32:51.0845984Z 
2025-05-07T20:32:51.0846145Z     @given(
2025-05-07T20:32:51.0846613Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:51.0847240Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:51.0847845Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:51.0848504Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:51.0849165Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:51.0849642Z     )
2025-05-07T20:32:51.0849997Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:51.0850442Z     def test_silu_mul_quant(
2025-05-07T20:32:51.0850683Z         self,
2025-05-07T20:32:51.0850887Z         T: int,
2025-05-07T20:32:51.0851086Z         D: int,
2025-05-07T20:32:51.0851308Z         scale_ub: Optional[float],
2025-05-07T20:32:51.0851609Z         contiguous: bool,
2025-05-07T20:32:51.0851868Z         compiled: bool,
2025-05-07T20:32:51.0852172Z     ) -> None:
2025-05-07T20:32:51.0852389Z         torch.manual_seed(2025)
2025-05-07T20:32:51.0852635Z     
2025-05-07T20:32:51.0852912Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:51.0853253Z     
2025-05-07T20:32:51.0853450Z         x_sign = torch.sign(x)
2025-05-07T20:32:51.0853848Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:51.0854159Z         x = x_sign * x_clamp
2025-05-07T20:32:51.0854407Z         x0 = x[:, :D]
2025-05-07T20:32:51.0854635Z         x1 = x[:, D:]
2025-05-07T20:32:51.0854841Z     
2025-05-07T20:32:51.0855032Z         if contiguous:
2025-05-07T20:32:51.0855267Z             x0 = x0.contiguous()
2025-05-07T20:32:51.0855594Z             x1 = x1.contiguous()
2025-05-07T20:32:51.0855843Z     
2025-05-07T20:32:51.0856043Z         if scale_ub is not None:
2025-05-07T20:32:51.0856317Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:51.0856652Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:51.0856967Z             )
2025-05-07T20:32:51.0857161Z         else:
2025-05-07T20:32:51.0857376Z             scale_ub_tensor = None
2025-05-07T20:32:51.0857633Z     
2025-05-07T20:32:51.0857870Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:51.0858185Z             op = silu_mul_quant
2025-05-07T20:32:51.0858439Z             if compiled:
2025-05-07T20:32:51.0858696Z                 op = torch.compile(op)
2025-05-07T20:32:51.0858997Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:51.0859276Z     
2025-05-07T20:32:51.0859473Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:51.0859639Z 
2025-05-07T20:32:51.0859739Z moe/activation_test.py:117: 
2025-05-07T20:32:51.0860040Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:51.0860441Z moe/activation_test.py:115: in fn
2025-05-07T20:32:51.0860726Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:51.0861418Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:51.0862106Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:51.0862648Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:51.0863322Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:51.0863984Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:51.0864519Z     kernel = self.compile(
2025-05-07T20:32:51.0865060Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:51.0865715Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:51.0866120Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:51.0866347Z 
2025-05-07T20:32:51.0866562Z self = <triton.compiler.compiler.ASTSource object at 0x7fe92ba16db0>
2025-05-07T20:32:51.0867626Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:51.0868983Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7fea06577ce0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7fe92ab2c7c0>}
2025-05-07T20:32:51.0870362Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:51.0871370Z context = <triton._C.libtriton.ir.context object at 0x7fe92a8bcaf0>
2025-05-07T20:32:51.0871656Z 
2025-05-07T20:32:51.0871826Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:51.0872378Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:51.0872844Z                            module_map=module_map)
2025-05-07T20:32:51.0873208Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:51.0873562Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:51.0873828Z E       ^
2025-05-07T20:32:51.0874292Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:51.0874735Z 
2025-05-07T20:32:51.0875147Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:51.0875655Z 
2025-05-07T20:32:51.0875802Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:51.0876224Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:51.0876625Z     T=128,
2025-05-07T20:32:51.0876817Z     D=5120,
2025-05-07T20:32:51.0877010Z     scale_ub=None,
2025-05-07T20:32:51.0877227Z     contiguous=False,
2025-05-07T20:32:51.0877459Z     compiled=True,
2025-05-07T20:32:51.0877660Z )
2025-05-07T20:32:51.0877980Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:51.0878467Z T = 128, D = 5120, scale_ub = None, contiguous = False, compiled = True
2025-05-07T20:32:51.0878731Z 
2025-05-07T20:32:51.0878810Z     @given(
2025-05-07T20:32:51.0879044Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:51.0879357Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:51.0879664Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:51.0879995Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:51.0880327Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:51.0880663Z     )
2025-05-07T20:32:51.0881007Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:51.0881454Z     def test_silu_mul_quant(
2025-05-07T20:32:51.0881694Z         self,
2025-05-07T20:32:51.0881886Z         T: int,
2025-05-07T20:32:51.0882083Z         D: int,
2025-05-07T20:32:51.0882300Z         scale_ub: Optional[float],
2025-05-07T20:32:51.0882568Z         contiguous: bool,
2025-05-07T20:32:51.0882809Z         compiled: bool,
2025-05-07T20:32:51.0883030Z     ) -> None:
2025-05-07T20:32:51.0883243Z         torch.manual_seed(2025)
2025-05-07T20:32:51.0883489Z     
2025-05-07T20:32:51.0883761Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:51.0884101Z     
2025-05-07T20:32:51.0884297Z         x_sign = torch.sign(x)
2025-05-07T20:32:51.0884589Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:51.0884900Z         x = x_sign * x_clamp
2025-05-07T20:32:51.0885146Z         x0 = x[:, :D]
2025-05-07T20:32:51.0885364Z         x1 = x[:, D:]
2025-05-07T20:32:51.0885570Z     
2025-05-07T20:32:51.0885758Z         if contiguous:
2025-05-07T20:32:51.0885990Z             x0 = x0.contiguous()
2025-05-07T20:32:51.0886251Z             x1 = x1.contiguous()
2025-05-07T20:32:51.0886489Z     
2025-05-07T20:32:51.0886681Z         if scale_ub is not None:
2025-05-07T20:32:51.0886952Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:51.0887281Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:51.0887589Z             )
2025-05-07T20:32:51.0887783Z         else:
2025-05-07T20:32:51.0888037Z             scale_ub_tensor = None
2025-05-07T20:32:51.0888287Z     
2025-05-07T20:32:51.0888519Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:51.0888831Z             op = silu_mul_quant
2025-05-07T20:32:51.0889080Z             if compiled:
2025-05-07T20:32:51.0889357Z                 op = torch.compile(op)
2025-05-07T20:32:51.0889676Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:51.0889958Z     
2025-05-07T20:32:51.0890154Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:51.0890316Z 
2025-05-07T20:32:51.0890467Z moe/activation_test.py:117: 
2025-05-07T20:32:51.0890763Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:51.0891095Z moe/activation_test.py:115: in fn
2025-05-07T20:32:51.0891374Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:51.0891928Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/torch/_dynamo/eval_frame.py:678: in _fn
2025-05-07T20:32:51.0892486Z     return fn(*args, **kwargs)
2025-05-07T20:32:51.0893147Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:51.0893957Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:51.0894565Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:51.0895246Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:51.0895908Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:51.0896438Z     kernel = self.compile(
2025-05-07T20:32:51.0896977Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:51.0897631Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:51.0898030Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:51.0898508Z 
2025-05-07T20:32:51.0898784Z self = <triton.compiler.compiler.ASTSource object at 0x7fe92ab1cd60>
2025-05-07T20:32:51.0900239Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:51.0901595Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7fea06577ce0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7fe92ba12d40>}
2025-05-07T20:32:51.0902927Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:51.0903932Z context = <triton._C.libtriton.ir.context object at 0x7fe92a87d070>
2025-05-07T20:32:51.0904217Z 
2025-05-07T20:32:51.0904391Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:51.0904906Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:51.0905363Z                            module_map=module_map)
2025-05-07T20:32:51.0905727Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:51.0906088Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:51.0906343Z E       ^
2025-05-07T20:32:51.0906800Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:51.0907245Z 
2025-05-07T20:32:51.0907659Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:51.0908168Z 
2025-05-07T20:32:51.0908276Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:51.0908680Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:51.0909149Z     T=128,
2025-05-07T20:32:51.0909358Z     D=7168,
2025-05-07T20:32:51.0909579Z     scale_ub=1200.0,
2025-05-07T20:32:51.0909805Z     contiguous=False,
2025-05-07T20:32:51.0910034Z     compiled=False,
2025-05-07T20:32:51.2464513Z )
2025-05-07T20:32:51.2465204Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:51.2466264Z T = 128, D = 7168, scale_ub = 1200.0, contiguous = False, compiled = False
2025-05-07T20:32:51.2466820Z 
2025-05-07T20:32:51.2466977Z     @given(
2025-05-07T20:32:51.2467635Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:51.2468258Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:51.2468857Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:51.2469433Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:51.2469811Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:51.2470099Z     )
2025-05-07T20:32:51.2470446Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:51.2470886Z     def test_silu_mul_quant(
2025-05-07T20:32:51.2471123Z         self,
2025-05-07T20:32:51.2471321Z         T: int,
2025-05-07T20:32:51.2471518Z         D: int,
2025-05-07T20:32:51.2471731Z         scale_ub: Optional[float],
2025-05-07T20:32:51.2472067Z         contiguous: bool,
2025-05-07T20:32:51.2472315Z         compiled: bool,
2025-05-07T20:32:51.2472539Z     ) -> None:
2025-05-07T20:32:51.2472760Z         torch.manual_seed(2025)
2025-05-07T20:32:51.2473006Z     
2025-05-07T20:32:51.2473272Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:51.2473617Z     
2025-05-07T20:32:51.2473813Z         x_sign = torch.sign(x)
2025-05-07T20:32:51.2474100Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:51.2474405Z         x = x_sign * x_clamp
2025-05-07T20:32:51.2474645Z         x0 = x[:, :D]
2025-05-07T20:32:51.2474866Z         x1 = x[:, D:]
2025-05-07T20:32:51.2475075Z     
2025-05-07T20:32:51.2475265Z         if contiguous:
2025-05-07T20:32:51.2475500Z             x0 = x0.contiguous()
2025-05-07T20:32:51.2475756Z             x1 = x1.contiguous()
2025-05-07T20:32:51.2475993Z     
2025-05-07T20:32:51.2476189Z         if scale_ub is not None:
2025-05-07T20:32:51.2476462Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:51.2476858Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:51.2477171Z             )
2025-05-07T20:32:51.2477359Z         else:
2025-05-07T20:32:51.2477578Z             scale_ub_tensor = None
2025-05-07T20:32:51.2477833Z     
2025-05-07T20:32:51.2478059Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:51.2478373Z             op = silu_mul_quant
2025-05-07T20:32:51.2478619Z             if compiled:
2025-05-07T20:32:51.2478868Z                 op = torch.compile(op)
2025-05-07T20:32:51.2479162Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:51.2479436Z     
2025-05-07T20:32:51.2479628Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:51.2479792Z 
2025-05-07T20:32:51.2479890Z moe/activation_test.py:117: 
2025-05-07T20:32:51.2480182Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:51.2480510Z moe/activation_test.py:115: in fn
2025-05-07T20:32:51.2480789Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:51.2481470Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:51.2482151Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:51.2482684Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:51.2483351Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:51.2484011Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:51.2484610Z     kernel = self.compile(
2025-05-07T20:32:51.2485144Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:51.2485799Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:51.2486201Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:51.2486430Z 
2025-05-07T20:32:51.2486640Z self = <triton.compiler.compiler.ASTSource object at 0x7fe9c51905f0>
2025-05-07T20:32:51.2487697Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:51.2489098Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7fea06577ce0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7fe92b34b880>}
2025-05-07T20:32:51.2490425Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:51.2491434Z context = <triton._C.libtriton.ir.context object at 0x7fe92bc54a70>
2025-05-07T20:32:51.2491758Z 
2025-05-07T20:32:51.2491927Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:51.2492437Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:51.2492897Z                            module_map=module_map)
2025-05-07T20:32:51.2493262Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:51.2493613Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:51.2493961Z E       ^
2025-05-07T20:32:51.2494415Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:51.2494857Z 
2025-05-07T20:32:51.2495276Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:51.2495783Z 
2025-05-07T20:32:51.2495890Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:51.2496303Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:51.2496698Z     T=128,
2025-05-07T20:32:51.2496930Z     D=5120,
2025-05-07T20:32:51.2497131Z     scale_ub=None,
2025-05-07T20:32:51.2497351Z     contiguous=False,
2025-05-07T20:32:51.2497576Z     compiled=False,
2025-05-07T20:32:51.2497780Z )
2025-05-07T20:32:51.2498095Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:51.2498756Z T = 128, D = 5120, scale_ub = None, contiguous = False, compiled = False
2025-05-07T20:32:51.2499023Z 
2025-05-07T20:32:51.2499102Z     @given(
2025-05-07T20:32:51.2499333Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:51.2499648Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:51.2499952Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:51.2500280Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:51.2500606Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:51.2500886Z     )
2025-05-07T20:32:51.2501243Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:51.2501691Z     def test_silu_mul_quant(
2025-05-07T20:32:51.2501932Z         self,
2025-05-07T20:32:51.2502119Z         T: int,
2025-05-07T20:32:51.2502319Z         D: int,
2025-05-07T20:32:51.2502538Z         scale_ub: Optional[float],
2025-05-07T20:32:51.2502808Z         contiguous: bool,
2025-05-07T20:32:51.2503045Z         compiled: bool,
2025-05-07T20:32:51.2503266Z     ) -> None:
2025-05-07T20:32:51.2503477Z         torch.manual_seed(2025)
2025-05-07T20:32:51.2503719Z     
2025-05-07T20:32:51.2503989Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:51.2504400Z     
2025-05-07T20:32:51.2504594Z         x_sign = torch.sign(x)
2025-05-07T20:32:51.2504883Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:51.2505188Z         x = x_sign * x_clamp
2025-05-07T20:32:51.2505430Z         x0 = x[:, :D]
2025-05-07T20:32:51.2505647Z         x1 = x[:, D:]
2025-05-07T20:32:51.2505851Z     
2025-05-07T20:32:51.2506042Z         if contiguous:
2025-05-07T20:32:51.2506274Z             x0 = x0.contiguous()
2025-05-07T20:32:51.2506528Z             x1 = x1.contiguous()
2025-05-07T20:32:51.2506833Z     
2025-05-07T20:32:51.2507024Z         if scale_ub is not None:
2025-05-07T20:32:51.2507297Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:51.2507624Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:51.2507931Z             )
2025-05-07T20:32:51.2508127Z         else:
2025-05-07T20:32:51.2508332Z             scale_ub_tensor = None
2025-05-07T20:32:51.2508584Z     
2025-05-07T20:32:51.2508813Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:51.2509130Z             op = silu_mul_quant
2025-05-07T20:32:51.2509379Z             if compiled:
2025-05-07T20:32:51.2509628Z                 op = torch.compile(op)
2025-05-07T20:32:51.2509922Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:51.2510263Z     
2025-05-07T20:32:51.2510463Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:51.2510627Z 
2025-05-07T20:32:51.2510725Z moe/activation_test.py:117: 
2025-05-07T20:32:51.2511019Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:51.2511357Z moe/activation_test.py:115: in fn
2025-05-07T20:32:51.2511640Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:51.2512315Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:51.2512992Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:51.2513528Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:51.2514199Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:51.2514857Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:51.2515392Z     kernel = self.compile(
2025-05-07T20:32:51.2515998Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:51.2516647Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:51.2517046Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:51.2517272Z 
2025-05-07T20:32:51.2517482Z self = <triton.compiler.compiler.ASTSource object at 0x7fe92ab2c730>
2025-05-07T20:32:51.2518543Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:51.2519940Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7fea06577ce0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7fe92ab2c040>}
2025-05-07T20:32:51.2521268Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:51.2522282Z context = <triton._C.libtriton.ir.context object at 0x7fe92a7b30f0>
2025-05-07T20:32:51.2522565Z 
2025-05-07T20:32:51.2522735Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:51.2523244Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:51.2523707Z                            module_map=module_map)
2025-05-07T20:32:51.2524072Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:51.2524472Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:51.2524730Z E       ^
2025-05-07T20:32:51.2525191Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:51.2525635Z 
2025-05-07T20:32:51.2526051Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:51.2526557Z 
2025-05-07T20:32:51.2526667Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:51.2527140Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:51.2527541Z     T=128,
2025-05-07T20:32:51.2527732Z     D=5120,
2025-05-07T20:32:51.2527922Z     scale_ub=1200.0,
2025-05-07T20:32:51.2528146Z     contiguous=True,
2025-05-07T20:32:51.2528371Z     compiled=False,
2025-05-07T20:32:51.2528572Z )
2025-05-07T20:32:51.2528888Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:51.2529381Z T = 128, D = 5120, scale_ub = 1200.0, contiguous = True, compiled = False
2025-05-07T20:32:51.2529646Z 
2025-05-07T20:32:51.2529724Z     @given(
2025-05-07T20:32:51.2529956Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:51.2530315Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:51.2530624Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:51.2530951Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:51.2531282Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:51.2531574Z     )
2025-05-07T20:32:51.2537878Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:51.2538326Z     def test_silu_mul_quant(
2025-05-07T20:32:51.2538590Z         self,
2025-05-07T20:32:51.2538785Z         T: int,
2025-05-07T20:32:51.2538980Z         D: int,
2025-05-07T20:32:51.2539194Z         scale_ub: Optional[float],
2025-05-07T20:32:51.2539498Z         contiguous: bool,
2025-05-07T20:32:51.2539760Z         compiled: bool,
2025-05-07T20:32:51.2539978Z     ) -> None:
2025-05-07T20:32:51.2540195Z         torch.manual_seed(2025)
2025-05-07T20:32:51.2540432Z     
2025-05-07T20:32:51.2540702Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:51.2541044Z     
2025-05-07T20:32:51.2541311Z         x_sign = torch.sign(x)
2025-05-07T20:32:51.2541597Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:51.2541908Z         x = x_sign * x_clamp
2025-05-07T20:32:51.2542149Z         x0 = x[:, :D]
2025-05-07T20:32:51.2542359Z         x1 = x[:, D:]
2025-05-07T20:32:51.2542565Z     
2025-05-07T20:32:51.2542755Z         if contiguous:
2025-05-07T20:32:51.2542980Z             x0 = x0.contiguous()
2025-05-07T20:32:51.2543236Z             x1 = x1.contiguous()
2025-05-07T20:32:51.2543475Z     
2025-05-07T20:32:51.2543663Z         if scale_ub is not None:
2025-05-07T20:32:51.2543932Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:51.2544267Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:51.2544569Z             )
2025-05-07T20:32:51.2544756Z         else:
2025-05-07T20:32:51.2544962Z             scale_ub_tensor = None
2025-05-07T20:32:51.2545216Z     
2025-05-07T20:32:51.2545448Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:51.2545769Z             op = silu_mul_quant
2025-05-07T20:32:51.2546016Z             if compiled:
2025-05-07T20:32:51.2546258Z                 op = torch.compile(op)
2025-05-07T20:32:51.2546562Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:51.2546837Z     
2025-05-07T20:32:51.2547024Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:51.2547197Z 
2025-05-07T20:32:51.2547295Z moe/activation_test.py:117: 
2025-05-07T20:32:51.2547590Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:51.2547926Z moe/activation_test.py:115: in fn
2025-05-07T20:32:51.2548253Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:51.2548939Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:51.2549669Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:51.2550209Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:51.2550886Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:51.2551589Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:51.2552116Z     kernel = self.compile(
2025-05-07T20:32:51.2552651Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:51.2553295Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:51.2553689Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:51.2553919Z 
2025-05-07T20:32:51.2554126Z self = <triton.compiler.compiler.ASTSource object at 0x7fe9c5199520>
2025-05-07T20:32:51.2555234Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:51.2556581Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7fea06577ce0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7fe92a7d4c20>}
2025-05-07T20:32:51.2557903Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:51.2558908Z context = <triton._C.libtriton.ir.context object at 0x7fe92a992ab0>
2025-05-07T20:32:51.2559193Z 
2025-05-07T20:32:51.2559361Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:51.2559865Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:51.2560324Z                            module_map=module_map)
2025-05-07T20:32:51.2560687Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:51.2561084Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:51.2561338Z E       ^
2025-05-07T20:32:51.2561788Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:51.2562231Z 
2025-05-07T20:32:51.2562644Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:51.4092888Z 
2025-05-07T20:32:51.4093224Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:51.4094980Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:51.4096207Z     T=1,
2025-05-07T20:32:51.4096714Z     D=7168,
2025-05-07T20:32:51.4097109Z     scale_ub=1200.0,
2025-05-07T20:32:51.4097565Z     contiguous=True,
2025-05-07T20:32:51.4098009Z     compiled=True,
2025-05-07T20:32:51.4098878Z )
2025-05-07T20:32:51.4099366Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:51.4099873Z T = 1, D = 7168, scale_ub = 1200.0, contiguous = True, compiled = True
2025-05-07T20:32:51.4100137Z 
2025-05-07T20:32:51.4100233Z     @given(
2025-05-07T20:32:51.4100476Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:51.4100802Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:51.4101121Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:51.4101449Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:51.4101787Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:51.4102085Z     )
2025-05-07T20:32:51.4102436Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:51.4103183Z     def test_silu_mul_quant(
2025-05-07T20:32:51.4103434Z         self,
2025-05-07T20:32:51.4103643Z         T: int,
2025-05-07T20:32:51.4103845Z         D: int,
2025-05-07T20:32:51.4104074Z         scale_ub: Optional[float],
2025-05-07T20:32:51.4104356Z         contiguous: bool,
2025-05-07T20:32:51.4104598Z         compiled: bool,
2025-05-07T20:32:51.4104838Z     ) -> None:
2025-05-07T20:32:51.4105064Z         torch.manual_seed(2025)
2025-05-07T20:32:51.4105401Z     
2025-05-07T20:32:51.4105686Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:51.4106039Z     
2025-05-07T20:32:51.4106236Z         x_sign = torch.sign(x)
2025-05-07T20:32:51.4106539Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:51.4106858Z         x = x_sign * x_clamp
2025-05-07T20:32:51.4107103Z         x0 = x[:, :D]
2025-05-07T20:32:51.4107331Z         x1 = x[:, D:]
2025-05-07T20:32:51.4107554Z     
2025-05-07T20:32:51.4107740Z         if contiguous:
2025-05-07T20:32:51.4107985Z             x0 = x0.contiguous()
2025-05-07T20:32:51.4108248Z             x1 = x1.contiguous()
2025-05-07T20:32:51.4108488Z     
2025-05-07T20:32:51.4108691Z         if scale_ub is not None:
2025-05-07T20:32:51.4109067Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:51.4109419Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:51.4109726Z             )
2025-05-07T20:32:51.4109928Z         else:
2025-05-07T20:32:51.4110152Z             scale_ub_tensor = None
2025-05-07T20:32:51.4110406Z     
2025-05-07T20:32:51.4110647Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:51.4110967Z             op = silu_mul_quant
2025-05-07T20:32:51.4111223Z             if compiled:
2025-05-07T20:32:51.4111482Z                 op = torch.compile(op)
2025-05-07T20:32:51.4111787Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:51.4112061Z     
2025-05-07T20:32:51.4112273Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:51.4112437Z 
2025-05-07T20:32:51.4112549Z moe/activation_test.py:117: 
2025-05-07T20:32:51.4112855Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:51.4113188Z moe/activation_test.py:115: in fn
2025-05-07T20:32:51.4113484Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:51.4114158Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/torch/_dynamo/eval_frame.py:678: in _fn
2025-05-07T20:32:51.4114721Z     return fn(*args, **kwargs)
2025-05-07T20:32:51.4115385Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:51.4116071Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:51.4116604Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:51.4117282Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:51.4117946Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:51.4118486Z     kernel = self.compile(
2025-05-07T20:32:51.4119027Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:51.4119736Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:51.4120140Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:51.4120371Z 
2025-05-07T20:32:51.4120591Z self = <triton.compiler.compiler.ASTSource object at 0x7fe92b56da50>
2025-05-07T20:32:51.4121657Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:51.4123076Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7fea06577ce0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7fe92a7d5ee0>}
2025-05-07T20:32:51.4124412Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:51.4125426Z context = <triton._C.libtriton.ir.context object at 0x7fe92a9f1670>
2025-05-07T20:32:51.4125711Z 
2025-05-07T20:32:51.4125934Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:51.4126448Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:51.4126922Z                            module_map=module_map)
2025-05-07T20:32:51.4127297Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:51.4127647Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:51.4127920Z E       ^
2025-05-07T20:32:51.4128389Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:51.4128831Z 
2025-05-07T20:32:51.4129294Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:51.4129797Z 
2025-05-07T20:32:51.4129908Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:51.4130327Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:51.4130741Z     T=1,
2025-05-07T20:32:51.4130929Z     D=7168,
2025-05-07T20:32:51.4131133Z     scale_ub=1200.0,
2025-05-07T20:32:51.4131367Z     contiguous=False,
2025-05-07T20:32:51.4131592Z     compiled=True,
2025-05-07T20:32:51.4131807Z )
2025-05-07T20:32:51.4132132Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:51.4132611Z T = 1, D = 7168, scale_ub = 1200.0, contiguous = False, compiled = True
2025-05-07T20:32:51.4132884Z 
2025-05-07T20:32:51.4132964Z     @given(
2025-05-07T20:32:51.4133203Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:51.4133523Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:51.4133914Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:51.4134252Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:51.4134640Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:51.4134928Z     )
2025-05-07T20:32:51.4135282Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:51.4135732Z     def test_silu_mul_quant(
2025-05-07T20:32:51.4135973Z         self,
2025-05-07T20:32:51.4136177Z         T: int,
2025-05-07T20:32:51.4136388Z         D: int,
2025-05-07T20:32:51.4136605Z         scale_ub: Optional[float],
2025-05-07T20:32:51.4136886Z         contiguous: bool,
2025-05-07T20:32:51.4137136Z         compiled: bool,
2025-05-07T20:32:51.4137365Z     ) -> None:
2025-05-07T20:32:51.4137583Z         torch.manual_seed(2025)
2025-05-07T20:32:51.4137830Z     
2025-05-07T20:32:51.4138109Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:51.4138448Z     
2025-05-07T20:32:51.4138652Z         x_sign = torch.sign(x)
2025-05-07T20:32:51.4138954Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:51.4139270Z         x = x_sign * x_clamp
2025-05-07T20:32:51.4139522Z         x0 = x[:, :D]
2025-05-07T20:32:51.4139746Z         x1 = x[:, D:]
2025-05-07T20:32:51.4139961Z     
2025-05-07T20:32:51.4140161Z         if contiguous:
2025-05-07T20:32:51.4140403Z             x0 = x0.contiguous()
2025-05-07T20:32:51.4140661Z             x1 = x1.contiguous()
2025-05-07T20:32:51.4140912Z     
2025-05-07T20:32:51.4141116Z         if scale_ub is not None:
2025-05-07T20:32:51.4141410Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:51.4141751Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:51.4142111Z             )
2025-05-07T20:32:51.4142314Z         else:
2025-05-07T20:32:51.4142535Z             scale_ub_tensor = None
2025-05-07T20:32:51.4142786Z     
2025-05-07T20:32:51.4143033Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:51.4143355Z             op = silu_mul_quant
2025-05-07T20:32:51.4143602Z             if compiled:
2025-05-07T20:32:51.4143859Z                 op = torch.compile(op)
2025-05-07T20:32:51.4144162Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:51.4144444Z     
2025-05-07T20:32:51.4144722Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:51.4144893Z 
2025-05-07T20:32:51.4144993Z moe/activation_test.py:117: 
2025-05-07T20:32:51.4145297Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:51.4145626Z moe/activation_test.py:115: in fn
2025-05-07T20:32:51.4145916Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:51.4146480Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/torch/_dynamo/eval_frame.py:678: in _fn
2025-05-07T20:32:51.4147033Z     return fn(*args, **kwargs)
2025-05-07T20:32:51.4147697Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:51.4148431Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:51.4148981Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:51.4149679Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:51.4150365Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:51.4150899Z     kernel = self.compile(
2025-05-07T20:32:51.4151440Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:51.4152087Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:51.4152493Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:51.4152720Z 
2025-05-07T20:32:51.4152936Z self = <triton.compiler.compiler.ASTSource object at 0x7fe92ba8e250>
2025-05-07T20:32:51.4154048Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:51.4155395Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7fea06577ce0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7fe92a7d6c00>}
2025-05-07T20:32:51.4156726Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:51.4157739Z context = <triton._C.libtriton.ir.context object at 0x7fe92a9252b0>
2025-05-07T20:32:51.4158027Z 
2025-05-07T20:32:51.4158203Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:51.4158716Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:51.4159187Z                            module_map=module_map)
2025-05-07T20:32:51.4159590Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:51.4159968Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:51.4160232Z E       ^
2025-05-07T20:32:51.4160703Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:51.4161147Z 
2025-05-07T20:32:51.4161565Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:51.6204077Z 
2025-05-07T20:32:51.6204997Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:51.6205875Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:51.6207708Z     T=1,
2025-05-07T20:32:51.6208073Z     D=7168,
2025-05-07T20:32:51.6208449Z     scale_ub=None,
2025-05-07T20:32:51.6208864Z     contiguous=False,
2025-05-07T20:32:51.6209300Z     compiled=True,
2025-05-07T20:32:51.6209600Z )
2025-05-07T20:32:51.6209971Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:51.6210456Z T = 1, D = 7168, scale_ub = None, contiguous = False, compiled = True
2025-05-07T20:32:51.6210811Z 
2025-05-07T20:32:51.6210888Z     @given(
2025-05-07T20:32:51.6211114Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:51.6211420Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:51.6211720Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:51.6212041Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:51.6212360Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:51.6212641Z     )
2025-05-07T20:32:51.6212988Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:51.6213419Z     def test_silu_mul_quant(
2025-05-07T20:32:51.6213653Z         self,
2025-05-07T20:32:51.6213957Z         T: int,
2025-05-07T20:32:51.6214148Z         D: int,
2025-05-07T20:32:51.6214436Z         scale_ub: Optional[float],
2025-05-07T20:32:51.6214706Z         contiguous: bool,
2025-05-07T20:32:51.6214939Z         compiled: bool,
2025-05-07T20:32:51.6215154Z     ) -> None:
2025-05-07T20:32:51.6215367Z         torch.manual_seed(2025)
2025-05-07T20:32:51.6215604Z     
2025-05-07T20:32:51.6215866Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:51.6216204Z     
2025-05-07T20:32:51.6216391Z         x_sign = torch.sign(x)
2025-05-07T20:32:51.6216669Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:51.6216971Z         x = x_sign * x_clamp
2025-05-07T20:32:51.6217204Z         x0 = x[:, :D]
2025-05-07T20:32:51.6217417Z         x1 = x[:, D:]
2025-05-07T20:32:51.6217626Z     
2025-05-07T20:32:51.6217806Z         if contiguous:
2025-05-07T20:32:51.6218028Z             x0 = x0.contiguous()
2025-05-07T20:32:51.6218278Z             x1 = x1.contiguous()
2025-05-07T20:32:51.6218513Z     
2025-05-07T20:32:51.6218700Z         if scale_ub is not None:
2025-05-07T20:32:51.6219043Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:51.6219375Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:51.6219727Z             )
2025-05-07T20:32:51.6219915Z         else:
2025-05-07T20:32:51.6220123Z             scale_ub_tensor = None
2025-05-07T20:32:51.6220365Z     
2025-05-07T20:32:51.6220622Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:51.6220924Z             op = silu_mul_quant
2025-05-07T20:32:51.6221166Z             if compiled:
2025-05-07T20:32:51.6221407Z                 op = torch.compile(op)
2025-05-07T20:32:51.6221698Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:51.6221970Z     
2025-05-07T20:32:51.6222158Z         y_fp8, y_scale = fn()
2025-05-07T20:32:51.6222434Z         y = y_fp8.to(torch.float32) * y_scale[:, None]
2025-05-07T20:32:51.6222716Z     
2025-05-07T20:32:51.6222947Z         def ref_fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:51.6223275Z             x0_fp32 = x0.to(torch.float32)
2025-05-07T20:32:51.6223559Z             x1_fp32 = x1.to(torch.float32)
2025-05-07T20:32:51.6223861Z             y = x0_fp32 * torch.sigmoid(x0_fp32) * x1_fp32
2025-05-07T20:32:51.6224215Z             return triton_quantize_fp8_row(y, scale_ub_tensor)
2025-05-07T20:32:51.6224517Z     
2025-05-07T20:32:51.6224712Z >       y_fp8_ref, y_scale_ref = ref_fn()
2025-05-07T20:32:51.6224903Z 
2025-05-07T20:32:51.6225003Z moe/activation_test.py:126: 
2025-05-07T20:32:51.6225303Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:51.6225635Z moe/activation_test.py:124: in ref_fn
2025-05-07T20:32:51.6226003Z     return triton_quantize_fp8_row(y, scale_ub_tensor)
2025-05-07T20:32:51.6226780Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gemm/triton_gemm/fp8_gemm.py:2370: in triton_quantize_fp8_row
2025-05-07T20:32:51.6227514Z     _kernel_quantize_fp8_row[grid](
2025-05-07T20:32:51.6228050Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:51.6228718Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:51.6229444Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/autotuner.py:186: in run
2025-05-07T20:32:51.6230149Z     timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs}
2025-05-07T20:32:51.6230854Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/autotuner.py:166: in _bench
2025-05-07T20:32:51.6232639Z     return self.do_bench(kernel_call, quantiles=(0.5, 0.2, 0.8))
2025-05-07T20:32:51.6233228Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/testing.py:117: in do_bench
2025-05-07T20:32:51.6233735Z     fn()
2025-05-07T20:32:51.6234271Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/autotuner.py:152: in kernel_call
2025-05-07T20:32:51.6234841Z     self.fn.run(
2025-05-07T20:32:51.6235292Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:51.6235810Z     kernel = self.compile(
2025-05-07T20:32:51.6236341Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:51.6236979Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:51.6237362Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:51.6237588Z 
2025-05-07T20:32:51.6237795Z self = <triton.compiler.compiler.ASTSource object at 0x7fe92b0642d0>
2025-05-07T20:32:51.6238858Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=2, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:51.6240356Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7fea06577ce0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7fe92bc00180>}
2025-05-07T20:32:51.6241677Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:51.6242670Z context = <triton._C.libtriton.ir.context object at 0x7fe92bc75970>
2025-05-07T20:32:51.6242954Z 
2025-05-07T20:32:51.6243116Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:51.6243621Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:51.6244074Z                            module_map=module_map)
2025-05-07T20:32:51.6244425Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:51.6244778Z E       def _kernel_quantize_fp8_row(
2025-05-07T20:32:51.6245040Z E       ^
2025-05-07T20:32:51.6245481Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:51.6245922Z 
2025-05-07T20:32:51.6246332Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:51.6246835Z 
2025-05-07T20:32:51.6246934Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:51.6247330Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:51.6247713Z     T=1,
2025-05-07T20:32:51.6247889Z     D=5120,
2025-05-07T20:32:51.6248129Z     scale_ub=1200.0,
2025-05-07T20:32:51.6248339Z     contiguous=False,
2025-05-07T20:32:51.6248553Z     compiled=True,
2025-05-07T20:32:51.6248745Z )
2025-05-07T20:32:51.6249049Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:51.6249524Z T = 1, D = 5120, scale_ub = 1200.0, contiguous = False, compiled = True
2025-05-07T20:32:51.6249790Z 
2025-05-07T20:32:51.6249880Z     @given(
2025-05-07T20:32:51.6250134Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:51.6250481Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:51.6250777Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:51.6251093Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:51.6251408Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:51.6251686Z     )
2025-05-07T20:32:51.6252021Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:51.6252445Z     def test_silu_mul_quant(
2025-05-07T20:32:51.6252681Z         self,
2025-05-07T20:32:51.6252866Z         T: int,
2025-05-07T20:32:51.6253049Z         D: int,
2025-05-07T20:32:51.6253257Z         scale_ub: Optional[float],
2025-05-07T20:32:51.6253516Z         contiguous: bool,
2025-05-07T20:32:51.6253854Z         compiled: bool,
2025-05-07T20:32:51.6254064Z     ) -> None:
2025-05-07T20:32:51.6254271Z         torch.manual_seed(2025)
2025-05-07T20:32:51.6254502Z     
2025-05-07T20:32:51.6254760Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:51.6255094Z     
2025-05-07T20:32:51.6255277Z         x_sign = torch.sign(x)
2025-05-07T20:32:51.6255552Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:51.6255847Z         x = x_sign * x_clamp
2025-05-07T20:32:51.6256078Z         x0 = x[:, :D]
2025-05-07T20:32:51.6256281Z         x1 = x[:, D:]
2025-05-07T20:32:51.6256477Z     
2025-05-07T20:32:51.6256652Z         if contiguous:
2025-05-07T20:32:51.6256874Z             x0 = x0.contiguous()
2025-05-07T20:32:51.6268056Z             x1 = x1.contiguous()
2025-05-07T20:32:51.6268452Z     
2025-05-07T20:32:51.6268645Z         if scale_ub is not None:
2025-05-07T20:32:51.6268942Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:51.6269285Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:51.6269594Z             )
2025-05-07T20:32:51.6269858Z         else:
2025-05-07T20:32:51.6270064Z             scale_ub_tensor = None
2025-05-07T20:32:51.6270310Z     
2025-05-07T20:32:51.6270542Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:51.6270855Z             op = silu_mul_quant
2025-05-07T20:32:51.6271105Z             if compiled:
2025-05-07T20:32:51.6271355Z                 op = torch.compile(op)
2025-05-07T20:32:51.6271653Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:51.6271926Z     
2025-05-07T20:32:51.6272108Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:51.6272270Z 
2025-05-07T20:32:51.6272375Z moe/activation_test.py:117: 
2025-05-07T20:32:51.6272670Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:51.6273023Z moe/activation_test.py:115: in fn
2025-05-07T20:32:51.6273323Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:51.6273883Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/torch/_dynamo/eval_frame.py:678: in _fn
2025-05-07T20:32:51.6274434Z     return fn(*args, **kwargs)
2025-05-07T20:32:51.6275079Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:51.6275758Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:51.6276283Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:51.6276945Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:51.6277590Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:51.6278167Z     kernel = self.compile(
2025-05-07T20:32:51.6278698Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:51.6279346Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:51.6279795Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:51.6280027Z 
2025-05-07T20:32:51.6280234Z self = <triton.compiler.compiler.ASTSource object at 0x7fe92b597750>
2025-05-07T20:32:51.6281347Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:51.6282691Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7fea06577ce0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7fe92bc01300>}
2025-05-07T20:32:51.6284183Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:51.6289043Z context = <triton._C.libtriton.ir.context object at 0x7fe92a34e3f0>
2025-05-07T20:32:51.6289450Z 
2025-05-07T20:32:51.6289699Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:51.6290498Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:51.6291204Z                            module_map=module_map)
2025-05-07T20:32:51.6291741Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:51.6292254Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:51.6292627Z E       ^
2025-05-07T20:32:51.6293328Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:51.6294197Z 
2025-05-07T20:32:51.6294844Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:51.7677264Z 
2025-05-07T20:32:51.7677425Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:51.7678032Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:51.7678454Z     T=1,
2025-05-07T20:32:51.7678643Z     D=5120,
2025-05-07T20:32:51.7678843Z     scale_ub=1200.0,
2025-05-07T20:32:51.7679068Z     contiguous=False,
2025-05-07T20:32:51.7679304Z     compiled=False,
2025-05-07T20:32:51.7679517Z )
2025-05-07T20:32:51.7679831Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:51.7680313Z T = 1, D = 5120, scale_ub = 1200.0, contiguous = False, compiled = False
2025-05-07T20:32:51.7680582Z 
2025-05-07T20:32:51.7680660Z     @given(
2025-05-07T20:32:51.7680893Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:51.7681203Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:51.7681546Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:51.7681878Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:51.7682208Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:51.7682493Z     )
2025-05-07T20:32:51.7682841Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:51.7683280Z     def test_silu_mul_quant(
2025-05-07T20:32:51.7683526Z         self,
2025-05-07T20:32:51.7683724Z         T: int,
2025-05-07T20:32:51.7683921Z         D: int,
2025-05-07T20:32:51.7684136Z         scale_ub: Optional[float],
2025-05-07T20:32:51.7684400Z         contiguous: bool,
2025-05-07T20:32:51.7684639Z         compiled: bool,
2025-05-07T20:32:51.7684870Z     ) -> None:
2025-05-07T20:32:51.7685090Z         torch.manual_seed(2025)
2025-05-07T20:32:51.7685339Z     
2025-05-07T20:32:51.7685691Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:51.7686031Z     
2025-05-07T20:32:51.7686237Z         x_sign = torch.sign(x)
2025-05-07T20:32:51.7686533Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:51.7686848Z         x = x_sign * x_clamp
2025-05-07T20:32:51.7687093Z         x0 = x[:, :D]
2025-05-07T20:32:51.7687319Z         x1 = x[:, D:]
2025-05-07T20:32:51.7687533Z     
2025-05-07T20:32:51.7687720Z         if contiguous:
2025-05-07T20:32:51.7687955Z             x0 = x0.contiguous()
2025-05-07T20:32:51.7688295Z             x1 = x1.contiguous()
2025-05-07T20:32:51.7688534Z     
2025-05-07T20:32:51.7688734Z         if scale_ub is not None:
2025-05-07T20:32:51.7689010Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:51.7689341Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:51.7689653Z             )
2025-05-07T20:32:51.7689852Z         else:
2025-05-07T20:32:51.7690062Z             scale_ub_tensor = None
2025-05-07T20:32:51.7690322Z     
2025-05-07T20:32:51.7690558Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:51.7690883Z             op = silu_mul_quant
2025-05-07T20:32:51.7691133Z             if compiled:
2025-05-07T20:32:51.7691462Z                 op = torch.compile(op)
2025-05-07T20:32:51.7691766Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:51.7692043Z     
2025-05-07T20:32:51.7692242Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:51.7692405Z 
2025-05-07T20:32:51.7692522Z moe/activation_test.py:117: 
2025-05-07T20:32:51.7692817Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:51.7693157Z moe/activation_test.py:115: in fn
2025-05-07T20:32:51.7693442Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:51.7694198Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:51.7694885Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:51.7695423Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:51.7696100Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:51.7696804Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:51.7697341Z     kernel = self.compile(
2025-05-07T20:32:51.7697881Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:51.7698687Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:51.7699080Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:51.7699312Z 
2025-05-07T20:32:51.7699522Z self = <triton.compiler.compiler.ASTSource object at 0x7fe92ae806d0>
2025-05-07T20:32:51.7700643Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:51.7702009Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7fea06577ce0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7fe92bc02020>}
2025-05-07T20:32:51.7703328Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:51.7704344Z context = <triton._C.libtriton.ir.context object at 0x7fe92a4a48b0>
2025-05-07T20:32:51.7704640Z 
2025-05-07T20:32:51.7704804Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:51.7705364Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:51.7705896Z                            module_map=module_map)
2025-05-07T20:32:51.7706266Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:51.7706619Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:51.7706875Z E       ^
2025-05-07T20:32:51.7707342Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:51.7707794Z 
2025-05-07T20:32:51.7708206Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:51.7708811Z 
2025-05-07T20:32:51.7708923Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:51.7709327Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:51.7709727Z     T=16384,
2025-05-07T20:32:51.7709925Z     D=5120,
2025-05-07T20:32:51.7710112Z     scale_ub=1200.0,
2025-05-07T20:32:51.7710337Z     contiguous=False,
2025-05-07T20:32:51.7710565Z     compiled=True,
2025-05-07T20:32:51.7710771Z )
2025-05-07T20:32:51.7711077Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:51.7711567Z T = 16384, D = 5120, scale_ub = 1200.0, contiguous = False, compiled = True
2025-05-07T20:32:51.7711841Z 
2025-05-07T20:32:51.7711921Z     @given(
2025-05-07T20:32:51.7712214Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:51.7712532Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:51.7712836Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:51.7713158Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:51.7713483Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:51.7713765Z     )
2025-05-07T20:32:51.7714105Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:51.7714548Z     def test_silu_mul_quant(
2025-05-07T20:32:51.7714780Z         self,
2025-05-07T20:32:51.7714974Z         T: int,
2025-05-07T20:32:51.7715163Z         D: int,
2025-05-07T20:32:51.7715378Z         scale_ub: Optional[float],
2025-05-07T20:32:51.7715647Z         contiguous: bool,
2025-05-07T20:32:51.7715876Z         compiled: bool,
2025-05-07T20:32:51.7716095Z     ) -> None:
2025-05-07T20:32:51.7716308Z         torch.manual_seed(2025)
2025-05-07T20:32:51.7716541Z     
2025-05-07T20:32:51.7716872Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:51.7717213Z     
2025-05-07T20:32:51.7717397Z         x_sign = torch.sign(x)
2025-05-07T20:32:51.7717692Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:51.7718000Z         x = x_sign * x_clamp
2025-05-07T20:32:51.7718233Z         x0 = x[:, :D]
2025-05-07T20:32:51.7718450Z         x1 = x[:, D:]
2025-05-07T20:32:51.7718655Z     
2025-05-07T20:32:51.7718833Z         if contiguous:
2025-05-07T20:32:51.7719063Z             x0 = x0.contiguous()
2025-05-07T20:32:51.7719317Z             x1 = x1.contiguous()
2025-05-07T20:32:51.7719563Z     
2025-05-07T20:32:51.7719784Z         if scale_ub is not None:
2025-05-07T20:32:51.7720062Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:51.7720393Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:51.7720691Z             )
2025-05-07T20:32:51.7720885Z         else:
2025-05-07T20:32:51.7721101Z             scale_ub_tensor = None
2025-05-07T20:32:51.7721353Z     
2025-05-07T20:32:51.7721586Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:51.7721901Z             op = silu_mul_quant
2025-05-07T20:32:51.7722146Z             if compiled:
2025-05-07T20:32:51.7722394Z                 op = torch.compile(op)
2025-05-07T20:32:51.7722690Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:51.7722960Z     
2025-05-07T20:32:51.7723151Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:51.7723312Z 
2025-05-07T20:32:51.7723415Z moe/activation_test.py:117: 
2025-05-07T20:32:51.7723709Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:51.7724085Z moe/activation_test.py:115: in fn
2025-05-07T20:32:51.7724365Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:51.7724921Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/torch/_dynamo/eval_frame.py:678: in _fn
2025-05-07T20:32:51.7725470Z     return fn(*args, **kwargs)
2025-05-07T20:32:51.7726120Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:51.7726841Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:51.7727374Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:51.7728034Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:51.7728688Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:51.7729213Z     kernel = self.compile(
2025-05-07T20:32:51.7729789Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:51.7730436Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:51.7730869Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:51.7731091Z 
2025-05-07T20:32:51.7731303Z self = <triton.compiler.compiler.ASTSource object at 0x7fe92b067b50>
2025-05-07T20:32:51.7732351Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:51.7733764Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7fea06577ce0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7fe92bc03600>}
2025-05-07T20:32:51.7735084Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:51.7736086Z context = <triton._C.libtriton.ir.context object at 0x7fe92a42ed70>
2025-05-07T20:32:51.7736366Z 
2025-05-07T20:32:51.7736583Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:51.7737087Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:51.7737547Z                            module_map=module_map)
2025-05-07T20:32:51.7737908Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:51.7738252Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:51.7738510Z E       ^
2025-05-07T20:32:51.7738964Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:51.7739402Z 
2025-05-07T20:32:51.7739816Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:51.7740318Z 
2025-05-07T20:32:51.7740422Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:51.7740829Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:51.7741230Z     T=2048,
2025-05-07T20:32:51.7741414Z     D=7168,
2025-05-07T20:32:51.7741606Z     scale_ub=1200.0,
2025-05-07T20:32:51.7741829Z     contiguous=False,
2025-05-07T20:32:51.7742044Z     compiled=True,
2025-05-07T20:32:51.9606197Z )
2025-05-07T20:32:51.9606547Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:51.9607047Z T = 2048, D = 7168, scale_ub = 1200.0, contiguous = False, compiled = True
2025-05-07T20:32:51.9607346Z 
2025-05-07T20:32:51.9607433Z     @given(
2025-05-07T20:32:51.9607665Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:51.9607988Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:51.9608412Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:51.9608741Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:51.9609085Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:51.9609379Z     )
2025-05-07T20:32:51.9609723Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:51.9610169Z     def test_silu_mul_quant(
2025-05-07T20:32:51.9610412Z         self,
2025-05-07T20:32:51.9610609Z         T: int,
2025-05-07T20:32:51.9610872Z         D: int,
2025-05-07T20:32:51.9611092Z         scale_ub: Optional[float],
2025-05-07T20:32:51.9611368Z         contiguous: bool,
2025-05-07T20:32:51.9611602Z         compiled: bool,
2025-05-07T20:32:51.9611830Z     ) -> None:
2025-05-07T20:32:51.9612048Z         torch.manual_seed(2025)
2025-05-07T20:32:51.9612286Z     
2025-05-07T20:32:51.9612563Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:51.9612909Z     
2025-05-07T20:32:51.9613103Z         x_sign = torch.sign(x)
2025-05-07T20:32:51.9613392Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:51.9613784Z         x = x_sign * x_clamp
2025-05-07T20:32:51.9614012Z         x0 = x[:, :D]
2025-05-07T20:32:51.9614221Z         x1 = x[:, D:]
2025-05-07T20:32:51.9614499Z     
2025-05-07T20:32:51.9614678Z         if contiguous:
2025-05-07T20:32:51.9614904Z             x0 = x0.contiguous()
2025-05-07T20:32:51.9615156Z             x1 = x1.contiguous()
2025-05-07T20:32:51.9615392Z     
2025-05-07T20:32:51.9615577Z         if scale_ub is not None:
2025-05-07T20:32:51.9615848Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:51.9616180Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:51.9616482Z             )
2025-05-07T20:32:51.9616678Z         else:
2025-05-07T20:32:51.9616885Z             scale_ub_tensor = None
2025-05-07T20:32:51.9617129Z     
2025-05-07T20:32:51.9617358Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:51.9617669Z             op = silu_mul_quant
2025-05-07T20:32:51.9617909Z             if compiled:
2025-05-07T20:32:51.9618155Z                 op = torch.compile(op)
2025-05-07T20:32:51.9618448Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:51.9618717Z     
2025-05-07T20:32:51.9618913Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:51.9619141Z 
2025-05-07T20:32:51.9619244Z moe/activation_test.py:117: 
2025-05-07T20:32:51.9619537Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:51.9619862Z moe/activation_test.py:115: in fn
2025-05-07T20:32:51.9620141Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:51.9620696Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/torch/_dynamo/eval_frame.py:678: in _fn
2025-05-07T20:32:51.9621241Z     return fn(*args, **kwargs)
2025-05-07T20:32:51.9621888Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:51.9622565Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:51.9623095Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:51.9623763Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:51.9624421Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:51.9624951Z     kernel = self.compile(
2025-05-07T20:32:51.9625479Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:51.9626127Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:51.9626520Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:51.9626744Z 
2025-05-07T20:32:51.9626952Z self = <triton.compiler.compiler.ASTSource object at 0x7fe92ae82c50>
2025-05-07T20:32:51.9628049Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:51.9629401Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7fea06577ce0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7fe92a424720>}
2025-05-07T20:32:51.9630800Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:51.9631853Z context = <triton._C.libtriton.ir.context object at 0x7fe92a4f2770>
2025-05-07T20:32:51.9632132Z 
2025-05-07T20:32:51.9632300Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:51.9632816Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:51.9633285Z                            module_map=module_map)
2025-05-07T20:32:51.9633653Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:51.9634004Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:51.9634304Z E       ^
2025-05-07T20:32:51.9634775Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:51.9635215Z 
2025-05-07T20:32:51.9635658Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:51.9636179Z 
2025-05-07T20:32:51.9636283Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:51.9636691Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:51.9637086Z     T=1,
2025-05-07T20:32:51.9637262Z     D=5120,
2025-05-07T20:32:51.9637453Z     scale_ub=None,
2025-05-07T20:32:51.9637701Z     contiguous=False,
2025-05-07T20:32:51.9637931Z     compiled=False,
2025-05-07T20:32:51.9638133Z )
2025-05-07T20:32:51.9638437Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:51.9638917Z T = 1, D = 5120, scale_ub = None, contiguous = False, compiled = False
2025-05-07T20:32:51.9639180Z 
2025-05-07T20:32:51.9639260Z     @given(
2025-05-07T20:32:51.9639540Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:51.9639844Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:51.9640145Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:51.9640470Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:51.9640784Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:51.9641064Z     )
2025-05-07T20:32:51.9641409Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:51.9641839Z     def test_silu_mul_quant(
2025-05-07T20:32:51.9642080Z         self,
2025-05-07T20:32:51.9642272Z         T: int,
2025-05-07T20:32:51.9642458Z         D: int,
2025-05-07T20:32:51.9642674Z         scale_ub: Optional[float],
2025-05-07T20:32:51.9642943Z         contiguous: bool,
2025-05-07T20:32:51.9643176Z         compiled: bool,
2025-05-07T20:32:51.9643397Z     ) -> None:
2025-05-07T20:32:51.9643612Z         torch.manual_seed(2025)
2025-05-07T20:32:51.9643847Z     
2025-05-07T20:32:51.9644115Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:51.9644454Z     
2025-05-07T20:32:51.9644647Z         x_sign = torch.sign(x)
2025-05-07T20:32:51.9644928Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:51.9645236Z         x = x_sign * x_clamp
2025-05-07T20:32:51.9645469Z         x0 = x[:, :D]
2025-05-07T20:32:51.9645680Z         x1 = x[:, D:]
2025-05-07T20:32:51.9645884Z     
2025-05-07T20:32:51.9646066Z         if contiguous:
2025-05-07T20:32:51.9646289Z             x0 = x0.contiguous()
2025-05-07T20:32:51.9646628Z             x1 = x1.contiguous()
2025-05-07T20:32:51.9646860Z     
2025-05-07T20:32:51.9647044Z         if scale_ub is not None:
2025-05-07T20:32:51.9647315Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:51.9647642Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:51.9647942Z             )
2025-05-07T20:32:51.9648136Z         else:
2025-05-07T20:32:51.9648345Z             scale_ub_tensor = None
2025-05-07T20:32:51.9648585Z     
2025-05-07T20:32:51.9648813Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:51.9649179Z             op = silu_mul_quant
2025-05-07T20:32:51.9649426Z             if compiled:
2025-05-07T20:32:51.9649664Z                 op = torch.compile(op)
2025-05-07T20:32:51.9649956Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:51.9650228Z     
2025-05-07T20:32:51.9650409Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:51.9650576Z 
2025-05-07T20:32:51.9650674Z moe/activation_test.py:117: 
2025-05-07T20:32:51.9650970Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:51.9651298Z moe/activation_test.py:115: in fn
2025-05-07T20:32:51.9651576Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:51.9652292Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:51.9652971Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:51.9653492Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:51.9654241Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:51.9654891Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:51.9655409Z     kernel = self.compile(
2025-05-07T20:32:51.9655941Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:51.9656592Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:51.9656985Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:51.9657207Z 
2025-05-07T20:32:51.9657414Z self = <triton.compiler.compiler.ASTSource object at 0x7fe92a8e3f50>
2025-05-07T20:32:51.9658514Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:51.9659916Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7fea06577ce0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7fe92a425120>}
2025-05-07T20:32:51.9661231Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:51.9662232Z context = <triton._C.libtriton.ir.context object at 0x7fe92aaa4530>
2025-05-07T20:32:51.9668976Z 
2025-05-07T20:32:51.9669238Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:51.9669877Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:51.9670346Z                            module_map=module_map)
2025-05-07T20:32:51.9670705Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:51.9671058Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:51.9671323Z E       ^
2025-05-07T20:32:51.9671776Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:51.9672222Z 
2025-05-07T20:32:51.9672641Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:51.9673248Z 
2025-05-07T20:32:51.9673352Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:51.9673761Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:51.9674160Z     T=4096,
2025-05-07T20:32:51.9674352Z     D=7168,
2025-05-07T20:32:51.9674549Z     scale_ub=1200.0,
2025-05-07T20:32:51.9674767Z     contiguous=False,
2025-05-07T20:32:51.9674997Z     compiled=False,
2025-05-07T20:32:51.9675205Z )
2025-05-07T20:32:51.9675516Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:51.9676052Z T = 4096, D = 7168, scale_ub = 1200.0, contiguous = False, compiled = False
2025-05-07T20:32:51.9676325Z 
2025-05-07T20:32:51.9676409Z     @given(
2025-05-07T20:32:51.9676638Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:51.9676946Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:51.9677246Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:51.9677575Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:51.9677901Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:51.9678185Z     )
2025-05-07T20:32:51.9678528Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:51.9678959Z     def test_silu_mul_quant(
2025-05-07T20:32:51.9679265Z         self,
2025-05-07T20:32:51.9679468Z         T: int,
2025-05-07T20:32:51.9679672Z         D: int,
2025-05-07T20:32:51.9679903Z         scale_ub: Optional[float],
2025-05-07T20:32:51.9680199Z         contiguous: bool,
2025-05-07T20:32:51.9680452Z         compiled: bool,
2025-05-07T20:32:51.9680688Z     ) -> None:
2025-05-07T20:32:51.9680910Z         torch.manual_seed(2025)
2025-05-07T20:32:51.9681172Z     
2025-05-07T20:32:51.9681462Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:51.9681844Z     
2025-05-07T20:32:51.9682045Z         x_sign = torch.sign(x)
2025-05-07T20:32:51.9682360Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:51.9682704Z         x = x_sign * x_clamp
2025-05-07T20:32:51.9682965Z         x0 = x[:, :D]
2025-05-07T20:32:51.9683187Z         x1 = x[:, D:]
2025-05-07T20:32:51.9683410Z     
2025-05-07T20:32:51.9683599Z         if contiguous:
2025-05-07T20:32:51.9683851Z             x0 = x0.contiguous()
2025-05-07T20:32:51.9684134Z             x1 = x1.contiguous()
2025-05-07T20:32:51.9684441Z     
2025-05-07T20:32:51.9684644Z         if scale_ub is not None:
2025-05-07T20:32:51.9684938Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:51.9685310Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:51.9685649Z             )
2025-05-07T20:32:51.9685846Z         else:
2025-05-07T20:32:51.9686065Z             scale_ub_tensor = None
2025-05-07T20:32:51.9686336Z     
2025-05-07T20:32:51.9686575Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:51.9686919Z             op = silu_mul_quant
2025-05-07T20:32:51.9687186Z             if compiled:
2025-05-07T20:32:51.9687450Z                 op = torch.compile(op)
2025-05-07T20:32:51.9687772Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:51.9688078Z     
2025-05-07T20:32:51.9688273Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:51.9688453Z 
2025-05-07T20:32:51.9688557Z moe/activation_test.py:117: 
2025-05-07T20:32:51.9688886Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:51.9689255Z moe/activation_test.py:115: in fn
2025-05-07T20:32:51.9689561Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:51.9690375Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:51.9691199Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:51.9691822Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:51.9692623Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:51.9693459Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:51.9694101Z     kernel = self.compile(
2025-05-07T20:32:51.9694635Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:51.9695288Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:51.9695684Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:51.9695955Z 
2025-05-07T20:32:51.9696165Z self = <triton.compiler.compiler.ASTSource object at 0x7fe92b56d8d0>
2025-05-07T20:32:51.9697222Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:51.9698836Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7fea06577ce0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7fe92a426480>}
2025-05-07T20:32:51.9700298Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:51.9701302Z context = <triton._C.libtriton.ir.context object at 0x7fe92aae3930>
2025-05-07T20:32:51.9701582Z 
2025-05-07T20:32:51.9701752Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:51.9702258Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:51.9702717Z                            module_map=module_map)
2025-05-07T20:32:51.9703078Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:51.9703425Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:51.9703681Z E       ^
2025-05-07T20:32:51.9704140Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:51.9704577Z 
2025-05-07T20:32:51.9704994Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:52.1249347Z 
2025-05-07T20:32:52.1249939Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:52.1250853Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:52.1251660Z     T=16384,
2025-05-07T20:32:52.1252051Z     D=7168,
2025-05-07T20:32:52.1252432Z     scale_ub=None,
2025-05-07T20:32:52.1252858Z     contiguous=True,
2025-05-07T20:32:52.1253303Z     compiled=True,
2025-05-07T20:32:52.1253823Z )
2025-05-07T20:32:52.1254449Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:52.1255415Z T = 16384, D = 7168, scale_ub = None, contiguous = True, compiled = True
2025-05-07T20:32:52.1255959Z 
2025-05-07T20:32:52.1256123Z     @given(
2025-05-07T20:32:52.1256573Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:52.1257195Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:52.1257796Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:52.1258449Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:52.1259104Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:52.1259616Z     )
2025-05-07T20:32:52.1260007Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:52.1260455Z     def test_silu_mul_quant(
2025-05-07T20:32:52.1260699Z         self,
2025-05-07T20:32:52.1260897Z         T: int,
2025-05-07T20:32:52.1261093Z         D: int,
2025-05-07T20:32:52.1261314Z         scale_ub: Optional[float],
2025-05-07T20:32:52.1261586Z         contiguous: bool,
2025-05-07T20:32:52.1261825Z         compiled: bool,
2025-05-07T20:32:52.1262052Z     ) -> None:
2025-05-07T20:32:52.1262343Z         torch.manual_seed(2025)
2025-05-07T20:32:52.1262583Z     
2025-05-07T20:32:52.1262857Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:52.1263201Z     
2025-05-07T20:32:52.1263393Z         x_sign = torch.sign(x)
2025-05-07T20:32:52.1263686Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:52.1264001Z         x = x_sign * x_clamp
2025-05-07T20:32:52.1264238Z         x0 = x[:, :D]
2025-05-07T20:32:52.1264457Z         x1 = x[:, D:]
2025-05-07T20:32:52.1264672Z     
2025-05-07T20:32:52.1264928Z         if contiguous:
2025-05-07T20:32:52.1265161Z             x0 = x0.contiguous()
2025-05-07T20:32:52.1265417Z             x1 = x1.contiguous()
2025-05-07T20:32:52.1265657Z     
2025-05-07T20:32:52.1265845Z         if scale_ub is not None:
2025-05-07T20:32:52.1266121Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:52.1266455Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:52.1266757Z             )
2025-05-07T20:32:52.1266958Z         else:
2025-05-07T20:32:52.1267173Z             scale_ub_tensor = None
2025-05-07T20:32:52.1267423Z     
2025-05-07T20:32:52.1267651Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:52.1267966Z             op = silu_mul_quant
2025-05-07T20:32:52.1268304Z             if compiled:
2025-05-07T20:32:52.1268560Z                 op = torch.compile(op)
2025-05-07T20:32:52.1268857Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:52.1269128Z     
2025-05-07T20:32:52.1269329Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:52.1269494Z 
2025-05-07T20:32:52.1269603Z moe/activation_test.py:117: 
2025-05-07T20:32:52.1269925Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:52.1270275Z moe/activation_test.py:115: in fn
2025-05-07T20:32:52.1270559Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:52.1271116Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/torch/_dynamo/eval_frame.py:678: in _fn
2025-05-07T20:32:52.1271670Z     return fn(*args, **kwargs)
2025-05-07T20:32:52.1272320Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:52.1273005Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:52.1273588Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:52.1274255Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:52.1274912Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:52.1275441Z     kernel = self.compile(
2025-05-07T20:32:52.1275971Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:52.1276621Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:52.1277018Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:52.1277244Z 
2025-05-07T20:32:52.1277452Z self = <triton.compiler.compiler.ASTSource object at 0x7fe92b609950>
2025-05-07T20:32:52.1278514Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:52.1279887Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7fea06577ce0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7fe92a427740>}
2025-05-07T20:32:52.1281233Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:52.1282234Z context = <triton._C.libtriton.ir.context object at 0x7fe92ae29870>
2025-05-07T20:32:52.1282563Z 
2025-05-07T20:32:52.1282732Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:52.1283241Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:52.1283708Z                            module_map=module_map)
2025-05-07T20:32:52.1284074Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:52.1284423Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:52.1284687Z E       ^
2025-05-07T20:32:52.1285190Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:52.1285628Z 
2025-05-07T20:32:52.1286040Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:52.1286545Z 
2025-05-07T20:32:52.1286650Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:52.1287057Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:52.1287458Z     T=4096,
2025-05-07T20:32:52.1287645Z     D=5120,
2025-05-07T20:32:52.1287845Z     scale_ub=None,
2025-05-07T20:32:52.1288064Z     contiguous=False,
2025-05-07T20:32:52.1288289Z     compiled=True,
2025-05-07T20:32:52.1288536Z )
2025-05-07T20:32:52.1288856Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:52.1289343Z T = 4096, D = 5120, scale_ub = None, contiguous = False, compiled = True
2025-05-07T20:32:52.1289618Z 
2025-05-07T20:32:52.1289736Z     @given(
2025-05-07T20:32:52.1290006Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:52.1290328Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:52.1290634Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:52.1290956Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:52.1291282Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:52.1291581Z     )
2025-05-07T20:32:52.1291927Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:52.1292366Z     def test_silu_mul_quant(
2025-05-07T20:32:52.1292601Z         self,
2025-05-07T20:32:52.1292794Z         T: int,
2025-05-07T20:32:52.1292991Z         D: int,
2025-05-07T20:32:52.1293207Z         scale_ub: Optional[float],
2025-05-07T20:32:52.1293522Z         contiguous: bool,
2025-05-07T20:32:52.1293843Z         compiled: bool,
2025-05-07T20:32:52.1294061Z     ) -> None:
2025-05-07T20:32:52.1294280Z         torch.manual_seed(2025)
2025-05-07T20:32:52.1294518Z     
2025-05-07T20:32:52.1294782Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:52.1295127Z     
2025-05-07T20:32:52.1295320Z         x_sign = torch.sign(x)
2025-05-07T20:32:52.1295602Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:52.1295910Z         x = x_sign * x_clamp
2025-05-07T20:32:52.1296148Z         x0 = x[:, :D]
2025-05-07T20:32:52.1296366Z         x1 = x[:, D:]
2025-05-07T20:32:52.1296566Z     
2025-05-07T20:32:52.1296753Z         if contiguous:
2025-05-07T20:32:52.1296981Z             x0 = x0.contiguous()
2025-05-07T20:32:52.1297232Z             x1 = x1.contiguous()
2025-05-07T20:32:52.1297470Z     
2025-05-07T20:32:52.1297667Z         if scale_ub is not None:
2025-05-07T20:32:52.1297935Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:52.1298429Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:52.1298738Z             )
2025-05-07T20:32:52.1298929Z         else:
2025-05-07T20:32:52.1299138Z             scale_ub_tensor = None
2025-05-07T20:32:52.1299390Z     
2025-05-07T20:32:52.1299616Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:52.1299936Z             op = silu_mul_quant
2025-05-07T20:32:52.1300182Z             if compiled:
2025-05-07T20:32:52.1300421Z                 op = torch.compile(op)
2025-05-07T20:32:52.1300719Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:52.1301072Z     
2025-05-07T20:32:52.1301264Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:52.1301425Z 
2025-05-07T20:32:52.1301522Z moe/activation_test.py:117: 
2025-05-07T20:32:52.1301812Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:52.1302184Z moe/activation_test.py:115: in fn
2025-05-07T20:32:52.1302460Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:52.1303013Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/torch/_dynamo/eval_frame.py:678: in _fn
2025-05-07T20:32:52.1303632Z     return fn(*args, **kwargs)
2025-05-07T20:32:52.1304276Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:52.1304951Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:52.1305480Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:52.1306158Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:52.1306813Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:52.1307345Z     kernel = self.compile(
2025-05-07T20:32:52.1307945Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:52.1308595Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:52.1308989Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:52.1309219Z 
2025-05-07T20:32:52.1309422Z self = <triton.compiler.compiler.ASTSource object at 0x7fe92bb81f50>
2025-05-07T20:32:52.1310485Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:52.1311836Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7fea06577ce0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7fe92aeacc20>}
2025-05-07T20:32:52.1313215Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:52.1314224Z context = <triton._C.libtriton.ir.context object at 0x7fe92ae3f7b0>
2025-05-07T20:32:52.1314513Z 
2025-05-07T20:32:52.1314678Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:52.1315192Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:52.1315649Z                            module_map=module_map)
2025-05-07T20:32:52.1316017Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:52.1316380Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:52.1316642Z E       ^
2025-05-07T20:32:52.1317100Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:52.1317545Z 
2025-05-07T20:32:52.1317963Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:52.2683697Z 
2025-05-07T20:32:52.2683955Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:52.2684967Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:52.2685716Z     T=4096,
2025-05-07T20:32:52.2686115Z     D=5120,
2025-05-07T20:32:52.2686470Z     scale_ub=1200.0,
2025-05-07T20:32:52.2686961Z     contiguous=False,
2025-05-07T20:32:52.2687548Z     compiled=False,
2025-05-07T20:32:52.2687962Z )
2025-05-07T20:32:52.2688534Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:52.2689419Z T = 4096, D = 5120, scale_ub = 1200.0, contiguous = False, compiled = False
2025-05-07T20:32:52.2690134Z 
2025-05-07T20:32:52.2690275Z     @given(
2025-05-07T20:32:52.2690565Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:52.2690873Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:52.2691183Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:52.2691509Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:52.2691831Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:52.2692108Z     )
2025-05-07T20:32:52.2692521Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:52.2692952Z     def test_silu_mul_quant(
2025-05-07T20:32:52.2693186Z         self,
2025-05-07T20:32:52.2693376Z         T: int,
2025-05-07T20:32:52.2693571Z         D: int,
2025-05-07T20:32:52.2693897Z         scale_ub: Optional[float],
2025-05-07T20:32:52.2694164Z         contiguous: bool,
2025-05-07T20:32:52.2694402Z         compiled: bool,
2025-05-07T20:32:52.2694623Z     ) -> None:
2025-05-07T20:32:52.2694833Z         torch.manual_seed(2025)
2025-05-07T20:32:52.2695069Z     
2025-05-07T20:32:52.2695334Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:52.2695669Z     
2025-05-07T20:32:52.2695929Z         x_sign = torch.sign(x)
2025-05-07T20:32:52.2696217Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:52.2696521Z         x = x_sign * x_clamp
2025-05-07T20:32:52.2696765Z         x0 = x[:, :D]
2025-05-07T20:32:52.2696981Z         x1 = x[:, D:]
2025-05-07T20:32:52.2697182Z     
2025-05-07T20:32:52.2697369Z         if contiguous:
2025-05-07T20:32:52.2697598Z             x0 = x0.contiguous()
2025-05-07T20:32:52.2697850Z             x1 = x1.contiguous()
2025-05-07T20:32:52.2698091Z     
2025-05-07T20:32:52.2698456Z         if scale_ub is not None:
2025-05-07T20:32:52.2698722Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:52.2699053Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:52.2699365Z             )
2025-05-07T20:32:52.2699562Z         else:
2025-05-07T20:32:52.2699806Z             scale_ub_tensor = None
2025-05-07T20:32:52.2700065Z     
2025-05-07T20:32:52.2700289Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:52.2700605Z             op = silu_mul_quant
2025-05-07T20:32:52.2700923Z             if compiled:
2025-05-07T20:32:52.2701168Z                 op = torch.compile(op)
2025-05-07T20:32:52.2701462Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:52.2701736Z     
2025-05-07T20:32:52.2701931Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:52.2702092Z 
2025-05-07T20:32:52.2702189Z moe/activation_test.py:117: 
2025-05-07T20:32:52.2702480Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:52.2702814Z moe/activation_test.py:115: in fn
2025-05-07T20:32:52.2703086Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:52.2703768Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:52.2704447Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:52.2704977Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:52.2705653Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:52.2706307Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:52.2706835Z     kernel = self.compile(
2025-05-07T20:32:52.2707363Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:52.2708011Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:52.2708406Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:52.2708629Z 
2025-05-07T20:32:52.2708909Z self = <triton.compiler.compiler.ASTSource object at 0x7fe92b56f050>
2025-05-07T20:32:52.2709968Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:52.2711318Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7fea06577ce0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7fe92aead6c0>}
2025-05-07T20:32:52.2712728Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:52.2713729Z context = <triton._C.libtriton.ir.context object at 0x7fe92a503430>
2025-05-07T20:32:52.2714007Z 
2025-05-07T20:32:52.2714178Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:52.2714690Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:52.2715150Z                            module_map=module_map)
2025-05-07T20:32:52.2715514Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:52.2715918Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:52.2716179Z E       ^
2025-05-07T20:32:52.2716634Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:52.2717073Z 
2025-05-07T20:32:52.2717483Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:52.2718010Z 
2025-05-07T20:32:52.2718114Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:52.2718516Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:52.2718911Z     T=4096,
2025-05-07T20:32:52.2719100Z     D=5120,
2025-05-07T20:32:52.2719289Z     scale_ub=1200.0,
2025-05-07T20:32:52.2719522Z     contiguous=False,
2025-05-07T20:32:52.2719786Z     compiled=True,
2025-05-07T20:32:52.2719989Z )
2025-05-07T20:32:52.2720311Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:52.2720867Z T = 4096, D = 5120, scale_ub = 1200.0, contiguous = False, compiled = True
2025-05-07T20:32:52.2721148Z 
2025-05-07T20:32:52.2721237Z     @given(
2025-05-07T20:32:52.2721471Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:52.2721800Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:52.2722109Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:52.2722434Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:52.2722763Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:52.2723051Z     )
2025-05-07T20:32:52.2723395Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:52.2723843Z     def test_silu_mul_quant(
2025-05-07T20:32:52.2724085Z         self,
2025-05-07T20:32:52.2724277Z         T: int,
2025-05-07T20:32:52.2724478Z         D: int,
2025-05-07T20:32:52.2724701Z         scale_ub: Optional[float],
2025-05-07T20:32:52.2724969Z         contiguous: bool,
2025-05-07T20:32:52.2725214Z         compiled: bool,
2025-05-07T20:32:52.2725441Z     ) -> None:
2025-05-07T20:32:52.2725659Z         torch.manual_seed(2025)
2025-05-07T20:32:52.2725896Z     
2025-05-07T20:32:52.2726169Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:52.2726514Z     
2025-05-07T20:32:52.2726705Z         x_sign = torch.sign(x)
2025-05-07T20:32:52.2726999Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:52.2727309Z         x = x_sign * x_clamp
2025-05-07T20:32:52.2727545Z         x0 = x[:, :D]
2025-05-07T20:32:52.2727767Z         x1 = x[:, D:]
2025-05-07T20:32:52.2727980Z     
2025-05-07T20:32:52.2728163Z         if contiguous:
2025-05-07T20:32:52.2728446Z             x0 = x0.contiguous()
2025-05-07T20:32:52.2728716Z             x1 = x1.contiguous()
2025-05-07T20:32:52.2728951Z     
2025-05-07T20:32:52.2729150Z         if scale_ub is not None:
2025-05-07T20:32:52.2734892Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:52.2735331Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:52.2735646Z             )
2025-05-07T20:32:52.2735842Z         else:
2025-05-07T20:32:52.2736045Z             scale_ub_tensor = None
2025-05-07T20:32:52.2736384Z     
2025-05-07T20:32:52.2736618Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:52.2736930Z             op = silu_mul_quant
2025-05-07T20:32:52.2737181Z             if compiled:
2025-05-07T20:32:52.2737426Z                 op = torch.compile(op)
2025-05-07T20:32:52.2737722Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:52.2737994Z     
2025-05-07T20:32:52.2738183Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:52.2738349Z 
2025-05-07T20:32:52.2738450Z moe/activation_test.py:117: 
2025-05-07T20:32:52.2738745Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:52.2739073Z moe/activation_test.py:115: in fn
2025-05-07T20:32:52.2739354Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:52.2740012Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/torch/_dynamo/eval_frame.py:678: in _fn
2025-05-07T20:32:52.2740567Z     return fn(*args, **kwargs)
2025-05-07T20:32:52.2741213Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:52.2741893Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:52.2742416Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:52.2743083Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:52.2743738Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:52.2744261Z     kernel = self.compile(
2025-05-07T20:32:52.2744790Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:52.2745486Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:52.2745890Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:52.2746115Z 
2025-05-07T20:32:52.2746323Z self = <triton.compiler.compiler.ASTSource object at 0x7fe92a8e16d0>
2025-05-07T20:32:52.2747387Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:52.2748733Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7fea06577ce0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7fe92aeaefc0>}
2025-05-07T20:32:52.2750067Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:52.2751075Z context = <triton._C.libtriton.ir.context object at 0x7fe92b2bb2b0>
2025-05-07T20:32:52.2751358Z 
2025-05-07T20:32:52.2751522Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:52.2752037Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:52.2752496Z                            module_map=module_map)
2025-05-07T20:32:52.2752860Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:52.2753204Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:52.2753464Z E       ^
2025-05-07T20:32:52.2753918Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:52.2754402Z 
2025-05-07T20:32:52.2754812Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:52.2755319Z 
2025-05-07T20:32:52.2755430Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:52.2755842Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:52.2756242Z     T=2048,
2025-05-07T20:32:52.2756477Z     D=7168,
2025-05-07T20:32:52.2756668Z     scale_ub=1200.0,
2025-05-07T20:32:52.2756892Z     contiguous=False,
2025-05-07T20:32:52.2757114Z     compiled=False,
2025-05-07T20:32:52.4685575Z )
2025-05-07T20:32:52.4686312Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:52.4687469Z T = 2048, D = 7168, scale_ub = 1200.0, contiguous = False, compiled = False
2025-05-07T20:32:52.4688248Z 
2025-05-07T20:32:52.4688439Z     @given(
2025-05-07T20:32:52.4689088Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:52.4689756Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:52.4690193Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:52.4690530Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:52.4690970Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:52.4691271Z     )
2025-05-07T20:32:52.4691627Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:52.4692071Z     def test_silu_mul_quant(
2025-05-07T20:32:52.4692315Z         self,
2025-05-07T20:32:52.4692516Z         T: int,
2025-05-07T20:32:52.4692709Z         D: int,
2025-05-07T20:32:52.4692934Z         scale_ub: Optional[float],
2025-05-07T20:32:52.4693213Z         contiguous: bool,
2025-05-07T20:32:52.4693451Z         compiled: bool,
2025-05-07T20:32:52.4693758Z     ) -> None:
2025-05-07T20:32:52.4693981Z         torch.manual_seed(2025)
2025-05-07T20:32:52.4694231Z     
2025-05-07T20:32:52.4694504Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:52.4694852Z     
2025-05-07T20:32:52.4695050Z         x_sign = torch.sign(x)
2025-05-07T20:32:52.4695345Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:52.4695671Z         x = x_sign * x_clamp
2025-05-07T20:32:52.4695997Z         x0 = x[:, :D]
2025-05-07T20:32:52.4696221Z         x1 = x[:, D:]
2025-05-07T20:32:52.4696434Z     
2025-05-07T20:32:52.4696627Z         if contiguous:
2025-05-07T20:32:52.4696862Z             x0 = x0.contiguous()
2025-05-07T20:32:52.4697123Z             x1 = x1.contiguous()
2025-05-07T20:32:52.4697368Z     
2025-05-07T20:32:52.4697565Z         if scale_ub is not None:
2025-05-07T20:32:52.4697841Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:52.4698360Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:52.4698671Z             )
2025-05-07T20:32:52.4698872Z         else:
2025-05-07T20:32:52.4699091Z             scale_ub_tensor = None
2025-05-07T20:32:52.4699345Z     
2025-05-07T20:32:52.4699576Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:52.4699897Z             op = silu_mul_quant
2025-05-07T20:32:52.4700152Z             if compiled:
2025-05-07T20:32:52.4700405Z                 op = torch.compile(op)
2025-05-07T20:32:52.4700713Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:52.4700992Z     
2025-05-07T20:32:52.4701187Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:52.4701390Z 
2025-05-07T20:32:52.4701496Z moe/activation_test.py:117: 
2025-05-07T20:32:52.4701796Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:52.4702132Z moe/activation_test.py:115: in fn
2025-05-07T20:32:52.4702417Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:52.4703104Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:52.4703863Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:52.4704407Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:52.4705090Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:52.4705755Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:52.4706291Z     kernel = self.compile(
2025-05-07T20:32:52.4706841Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:52.4707567Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:52.4707976Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:52.4708207Z 
2025-05-07T20:32:52.4708415Z self = <triton.compiler.compiler.ASTSource object at 0x7fe92ae80250>
2025-05-07T20:32:52.4709486Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:52.4710954Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7fea06577ce0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7fe92aeafec0>}
2025-05-07T20:32:52.4712286Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:52.4713300Z context = <triton._C.libtriton.ir.context object at 0x7fe92a6190f0>
2025-05-07T20:32:52.4713587Z 
2025-05-07T20:32:52.4713760Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:52.4714280Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:52.4714749Z                            module_map=module_map)
2025-05-07T20:32:52.4715124Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:52.4715483Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:52.4715742Z E       ^
2025-05-07T20:32:52.4716265Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:52.4716716Z 
2025-05-07T20:32:52.4717130Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:52.4717640Z 
2025-05-07T20:32:52.4717750Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:52.4718167Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:52.4718570Z     T=1,
2025-05-07T20:32:52.4718761Z     D=7168,
2025-05-07T20:32:52.4718954Z     scale_ub=None,
2025-05-07T20:32:52.4719174Z     contiguous=True,
2025-05-07T20:32:52.4719403Z     compiled=False,
2025-05-07T20:32:52.4719608Z )
2025-05-07T20:32:52.4719972Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:52.4720464Z T = 1, D = 7168, scale_ub = None, contiguous = True, compiled = False
2025-05-07T20:32:52.4720727Z 
2025-05-07T20:32:52.4720819Z     @given(
2025-05-07T20:32:52.4721052Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:52.4721369Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:52.4721680Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:52.4722010Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:52.4722342Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:52.4722630Z     )
2025-05-07T20:32:52.4722978Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:52.4723421Z     def test_silu_mul_quant(
2025-05-07T20:32:52.4723662Z         self,
2025-05-07T20:32:52.4723855Z         T: int,
2025-05-07T20:32:52.4724135Z         D: int,
2025-05-07T20:32:52.4724356Z         scale_ub: Optional[float],
2025-05-07T20:32:52.4724626Z         contiguous: bool,
2025-05-07T20:32:52.4724863Z         compiled: bool,
2025-05-07T20:32:52.4725089Z     ) -> None:
2025-05-07T20:32:52.4725309Z         torch.manual_seed(2025)
2025-05-07T20:32:52.4725548Z     
2025-05-07T20:32:52.4725828Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:52.4726173Z     
2025-05-07T20:32:52.4726368Z         x_sign = torch.sign(x)
2025-05-07T20:32:52.4726708Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:52.4727018Z         x = x_sign * x_clamp
2025-05-07T20:32:52.4727260Z         x0 = x[:, :D]
2025-05-07T20:32:52.4727484Z         x1 = x[:, D:]
2025-05-07T20:32:52.4727693Z     
2025-05-07T20:32:52.4727879Z         if contiguous:
2025-05-07T20:32:52.4728116Z             x0 = x0.contiguous()
2025-05-07T20:32:52.4728378Z             x1 = x1.contiguous()
2025-05-07T20:32:52.4728621Z     
2025-05-07T20:32:52.4728817Z         if scale_ub is not None:
2025-05-07T20:32:52.4729095Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:52.4729428Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:52.4729742Z             )
2025-05-07T20:32:52.4729981Z         else:
2025-05-07T20:32:52.4730238Z             scale_ub_tensor = None
2025-05-07T20:32:52.4730507Z     
2025-05-07T20:32:52.4730742Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:52.4731062Z             op = silu_mul_quant
2025-05-07T20:32:52.4731314Z             if compiled:
2025-05-07T20:32:52.4731562Z                 op = torch.compile(op)
2025-05-07T20:32:52.4731870Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:52.4732142Z     
2025-05-07T20:32:52.4732338Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:52.4732503Z 
2025-05-07T20:32:52.4732605Z moe/activation_test.py:117: 
2025-05-07T20:32:52.4732901Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:52.4733238Z moe/activation_test.py:115: in fn
2025-05-07T20:32:52.4733523Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:52.4734298Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:52.4735024Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:52.4735562Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:52.4736242Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:52.4736905Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:52.4737436Z     kernel = self.compile(
2025-05-07T20:32:52.4737980Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:52.4738638Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:52.4739035Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:52.4739270Z 
2025-05-07T20:32:52.4739482Z self = <triton.compiler.compiler.ASTSource object at 0x7fe92b75cbd0>
2025-05-07T20:32:52.4740607Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:52.4741964Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7fea06577ce0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7fe92b26ccc0>}
2025-05-07T20:32:52.4743290Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:52.4744349Z context = <triton._C.libtriton.ir.context object at 0x7fe92bf191f0>
2025-05-07T20:32:52.4744639Z 
2025-05-07T20:32:52.4744806Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:52.4745326Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:52.4745798Z                            module_map=module_map)
2025-05-07T20:32:52.4746161Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:52.4746559Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:52.4746825Z E       ^
2025-05-07T20:32:52.4747283Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:52.4747732Z 
2025-05-07T20:32:52.4748144Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:52.4748657Z 
2025-05-07T20:32:52.4748767Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:52.4749180Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:52.4749583Z     T=16384,
2025-05-07T20:32:52.4749778Z     D=7168,
2025-05-07T20:32:52.4749973Z     scale_ub=1200.0,
2025-05-07T20:32:52.4750243Z     contiguous=False,
2025-05-07T20:32:52.4750473Z     compiled=True,
2025-05-07T20:32:52.4750680Z )
2025-05-07T20:32:52.4750997Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:52.4751493Z T = 16384, D = 7168, scale_ub = 1200.0, contiguous = False, compiled = True
2025-05-07T20:32:52.4751771Z 
2025-05-07T20:32:52.4751851Z     @given(
2025-05-07T20:32:52.4752078Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:52.4752389Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:52.4752691Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:52.4753018Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:52.4753343Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:52.4753629Z     )
2025-05-07T20:32:52.4753975Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:52.4754414Z     def test_silu_mul_quant(
2025-05-07T20:32:52.4754657Z         self,
2025-05-07T20:32:52.4754857Z         T: int,
2025-05-07T20:32:52.4755051Z         D: int,
2025-05-07T20:32:52.4755316Z         scale_ub: Optional[float],
2025-05-07T20:32:52.4755586Z         contiguous: bool,
2025-05-07T20:32:52.4755817Z         compiled: bool,
2025-05-07T20:32:52.4756043Z     ) -> None:
2025-05-07T20:32:52.4756260Z         torch.manual_seed(2025)
2025-05-07T20:32:52.4756495Z     
2025-05-07T20:32:52.4756769Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:52.4757109Z     
2025-05-07T20:32:52.4757303Z         x_sign = torch.sign(x)
2025-05-07T20:32:52.4757586Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:52.4757892Z         x = x_sign * x_clamp
2025-05-07T20:32:52.4758137Z         x0 = x[:, :D]
2025-05-07T20:32:52.4758349Z         x1 = x[:, D:]
2025-05-07T20:32:52.4758555Z     
2025-05-07T20:32:52.4758741Z         if contiguous:
2025-05-07T20:32:52.4758964Z             x0 = x0.contiguous()
2025-05-07T20:32:52.4759225Z             x1 = x1.contiguous()
2025-05-07T20:32:52.4759469Z     
2025-05-07T20:32:52.4759660Z         if scale_ub is not None:
2025-05-07T20:32:52.4759937Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:52.4760268Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:52.4760581Z             )
2025-05-07T20:32:52.4760780Z         else:
2025-05-07T20:32:52.4760989Z             scale_ub_tensor = None
2025-05-07T20:32:52.4761234Z     
2025-05-07T20:32:52.4761466Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:52.4761778Z             op = silu_mul_quant
2025-05-07T20:32:52.4762027Z             if compiled:
2025-05-07T20:32:52.4762268Z                 op = torch.compile(op)
2025-05-07T20:32:52.4762621Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:52.4762897Z     
2025-05-07T20:32:52.4763086Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:52.4763249Z 
2025-05-07T20:32:52.4763348Z moe/activation_test.py:117: 
2025-05-07T20:32:52.4763644Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:52.4763973Z moe/activation_test.py:115: in fn
2025-05-07T20:32:52.4764247Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:52.4764803Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/torch/_dynamo/eval_frame.py:678: in _fn
2025-05-07T20:32:52.4765396Z     return fn(*args, **kwargs)
2025-05-07T20:32:52.4766049Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:52.4766723Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:52.4767256Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:52.4767930Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:52.4768627Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:52.4769155Z     kernel = self.compile(
2025-05-07T20:32:52.4769696Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:52.4770394Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:52.4770791Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:52.4771016Z 
2025-05-07T20:32:52.4771225Z self = <triton.compiler.compiler.ASTSource object at 0x7fe92bcf6d50>
2025-05-07T20:32:52.4772277Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:52.4773628Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7fea06577ce0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7fe92b26e0c0>}
2025-05-07T20:32:52.4775094Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:52.4776103Z context = <triton._C.libtriton.ir.context object at 0x7fe92bff6af0>
2025-05-07T20:32:52.4776385Z 
2025-05-07T20:32:52.4776556Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:52.4777067Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:52.4777530Z                            module_map=module_map)
2025-05-07T20:32:52.4777898Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:52.4778251Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:52.4778508Z E       ^
2025-05-07T20:32:52.4778962Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:52.4779408Z 
2025-05-07T20:32:52.4779824Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:52.6097861Z 
2025-05-07T20:32:52.6098762Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:52.6099587Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:52.6100029Z     T=1,
2025-05-07T20:32:52.6100210Z     D=7168,
2025-05-07T20:32:52.6100399Z     scale_ub=None,
2025-05-07T20:32:52.6100606Z     contiguous=False,
2025-05-07T20:32:52.6100828Z     compiled=False,
2025-05-07T20:32:52.6101027Z )
2025-05-07T20:32:52.6101332Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:52.6101974Z T = 1, D = 7168, scale_ub = None, contiguous = False, compiled = False
2025-05-07T20:32:52.6102234Z 
2025-05-07T20:32:52.6102312Z     @given(
2025-05-07T20:32:52.6102538Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:52.6102849Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:52.6103152Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:52.6103473Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:52.6103788Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:52.6104142Z     )
2025-05-07T20:32:52.6104480Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:52.6104904Z     def test_silu_mul_quant(
2025-05-07T20:32:52.6105138Z         self,
2025-05-07T20:32:52.6105327Z         T: int,
2025-05-07T20:32:52.6105514Z         D: int,
2025-05-07T20:32:52.6105724Z         scale_ub: Optional[float],
2025-05-07T20:32:52.6105992Z         contiguous: bool,
2025-05-07T20:32:52.6106225Z         compiled: bool,
2025-05-07T20:32:52.6106438Z     ) -> None:
2025-05-07T20:32:52.6106646Z         torch.manual_seed(2025)
2025-05-07T20:32:52.6106883Z     
2025-05-07T20:32:52.6107213Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:52.6107553Z     
2025-05-07T20:32:52.6107744Z         x_sign = torch.sign(x)
2025-05-07T20:32:52.6108023Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:52.6108324Z         x = x_sign * x_clamp
2025-05-07T20:32:52.6108565Z         x0 = x[:, :D]
2025-05-07T20:32:52.6108770Z         x1 = x[:, D:]
2025-05-07T20:32:52.6108970Z     
2025-05-07T20:32:52.6109148Z         if contiguous:
2025-05-07T20:32:52.6109365Z             x0 = x0.contiguous()
2025-05-07T20:32:52.6109615Z             x1 = x1.contiguous()
2025-05-07T20:32:52.6109847Z     
2025-05-07T20:32:52.6110028Z         if scale_ub is not None:
2025-05-07T20:32:52.6110296Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:52.6110624Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:52.6110925Z             )
2025-05-07T20:32:52.6111110Z         else:
2025-05-07T20:32:52.6111313Z             scale_ub_tensor = None
2025-05-07T20:32:52.6111562Z     
2025-05-07T20:32:52.6111784Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:52.6112159Z             op = silu_mul_quant
2025-05-07T20:32:52.6112406Z             if compiled:
2025-05-07T20:32:52.6112649Z                 op = torch.compile(op)
2025-05-07T20:32:52.6112950Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:52.6113224Z     
2025-05-07T20:32:52.6113412Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:52.6113578Z 
2025-05-07T20:32:52.6113677Z moe/activation_test.py:117: 
2025-05-07T20:32:52.6113965Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:52.6114318Z moe/activation_test.py:115: in fn
2025-05-07T20:32:52.6114598Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:52.6115281Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:52.6115961Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:52.6116496Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:52.6117161Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:52.6117813Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:52.6118333Z     kernel = self.compile(
2025-05-07T20:32:52.6118863Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:52.6119501Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:52.6119890Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:52.6120163Z 
2025-05-07T20:32:52.6120371Z self = <triton.compiler.compiler.ASTSource object at 0x7fe92b75e050>
2025-05-07T20:32:52.6121435Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:52.6122788Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7fea06577ce0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7fe92b26ec00>}
2025-05-07T20:32:52.6124148Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:52.6125146Z context = <triton._C.libtriton.ir.context object at 0x7fe92bfe3bf0>
2025-05-07T20:32:52.6125428Z 
2025-05-07T20:32:52.6125596Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:52.6126099Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:52.6126559Z                            module_map=module_map)
2025-05-07T20:32:52.6126969Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:52.6127332Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:52.6127590Z E       ^
2025-05-07T20:32:52.6134045Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:52.6134538Z 
2025-05-07T20:32:52.6134965Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:52.6135473Z 
2025-05-07T20:32:52.6135576Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:52.6135986Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:52.6136386Z     T=2048,
2025-05-07T20:32:52.6136576Z     D=7168,
2025-05-07T20:32:52.6136759Z     scale_ub=None,
2025-05-07T20:32:52.6136971Z     contiguous=False,
2025-05-07T20:32:52.6137192Z     compiled=True,
2025-05-07T20:32:52.6137386Z )
2025-05-07T20:32:52.6137701Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:52.6138265Z T = 2048, D = 7168, scale_ub = None, contiguous = False, compiled = True
2025-05-07T20:32:52.6138532Z 
2025-05-07T20:32:52.6138608Z     @given(
2025-05-07T20:32:52.6138844Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:52.6139154Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:52.6139448Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:52.6139772Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:52.6140093Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:52.6140375Z     )
2025-05-07T20:32:52.6140714Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:52.6141158Z     def test_silu_mul_quant(
2025-05-07T20:32:52.6141397Z         self,
2025-05-07T20:32:52.6141585Z         T: int,
2025-05-07T20:32:52.6141779Z         D: int,
2025-05-07T20:32:52.6142001Z         scale_ub: Optional[float],
2025-05-07T20:32:52.6142266Z         contiguous: bool,
2025-05-07T20:32:52.6142503Z         compiled: bool,
2025-05-07T20:32:52.6142723Z     ) -> None:
2025-05-07T20:32:52.6142932Z         torch.manual_seed(2025)
2025-05-07T20:32:52.6143175Z     
2025-05-07T20:32:52.6143446Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:52.6143773Z     
2025-05-07T20:32:52.6143961Z         x_sign = torch.sign(x)
2025-05-07T20:32:52.6144249Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:52.6144551Z         x = x_sign * x_clamp
2025-05-07T20:32:52.6144788Z         x0 = x[:, :D]
2025-05-07T20:32:52.6145002Z         x1 = x[:, D:]
2025-05-07T20:32:52.6145264Z     
2025-05-07T20:32:52.6145443Z         if contiguous:
2025-05-07T20:32:52.6145674Z             x0 = x0.contiguous()
2025-05-07T20:32:52.6145926Z             x1 = x1.contiguous()
2025-05-07T20:32:52.6146160Z     
2025-05-07T20:32:52.6146349Z         if scale_ub is not None:
2025-05-07T20:32:52.6146619Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:52.6146947Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:52.6147252Z             )
2025-05-07T20:32:52.6147435Z         else:
2025-05-07T20:32:52.6147696Z             scale_ub_tensor = None
2025-05-07T20:32:52.6147944Z     
2025-05-07T20:32:52.6148166Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:52.6148476Z             op = silu_mul_quant
2025-05-07T20:32:52.6148722Z             if compiled:
2025-05-07T20:32:52.6148967Z                 op = torch.compile(op)
2025-05-07T20:32:52.6149254Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:52.6149526Z     
2025-05-07T20:32:52.6149716Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:52.6149877Z 
2025-05-07T20:32:52.6149974Z moe/activation_test.py:117: 
2025-05-07T20:32:52.6150269Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:52.6150600Z moe/activation_test.py:115: in fn
2025-05-07T20:32:52.6150926Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:52.6151489Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/torch/_dynamo/eval_frame.py:678: in _fn
2025-05-07T20:32:52.6152046Z     return fn(*args, **kwargs)
2025-05-07T20:32:52.6152694Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:52.6153368Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:52.6153892Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:52.6154559Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:52.6155209Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:52.6155733Z     kernel = self.compile(
2025-05-07T20:32:52.6156268Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:52.6156960Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:52.6157355Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:52.6157585Z 
2025-05-07T20:32:52.6157787Z self = <triton.compiler.compiler.ASTSource object at 0x7fe92a3a2650>
2025-05-07T20:32:52.6158847Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:52.6160200Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7fea06577ce0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7fe92a6002c0>}
2025-05-07T20:32:52.6161524Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:52.6162523Z context = <triton._C.libtriton.ir.context object at 0x7fe92a6f8db0>
2025-05-07T20:32:52.6162809Z 
2025-05-07T20:32:52.6162973Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:52.6163483Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:52.6163936Z                            module_map=module_map)
2025-05-07T20:32:52.6164298Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:52.6164647Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:52.6164951Z E       ^
2025-05-07T20:32:52.6165401Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:52.6165845Z 
2025-05-07T20:32:52.6166257Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:52.6166762Z 
2025-05-07T20:32:52.6166870Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:52.6167274Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:52.6167740Z     T=4096,
2025-05-07T20:32:52.6167926Z     D=7168,
2025-05-07T20:32:52.6168115Z     scale_ub=None,
2025-05-07T20:32:52.6168325Z     contiguous=False,
2025-05-07T20:32:52.6168549Z     compiled=True,
2025-05-07T20:32:53.0265289Z )
2025-05-07T20:32:53.0265626Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:53.0266156Z T = 4096, D = 7168, scale_ub = None, contiguous = False, compiled = True
2025-05-07T20:32:53.0266451Z 
2025-05-07T20:32:53.0266546Z     @given(
2025-05-07T20:32:53.0266797Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:53.0267112Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:53.0267421Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:53.0267866Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:53.0268199Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:53.0268491Z     )
2025-05-07T20:32:53.0268841Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:53.0269293Z     def test_silu_mul_quant(
2025-05-07T20:32:53.0269534Z         self,
2025-05-07T20:32:53.0269733Z         T: int,
2025-05-07T20:32:53.0269963Z         D: int,
2025-05-07T20:32:53.0270204Z         scale_ub: Optional[float],
2025-05-07T20:32:53.0270473Z         contiguous: bool,
2025-05-07T20:32:53.0270712Z         compiled: bool,
2025-05-07T20:32:53.0270936Z     ) -> None:
2025-05-07T20:32:53.0271157Z         torch.manual_seed(2025)
2025-05-07T20:32:53.0271400Z     
2025-05-07T20:32:53.0271671Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:53.0272017Z     
2025-05-07T20:32:53.0272214Z         x_sign = torch.sign(x)
2025-05-07T20:32:53.0272505Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:53.0272887Z         x = x_sign * x_clamp
2025-05-07T20:32:53.0273132Z         x0 = x[:, :D]
2025-05-07T20:32:53.0273347Z         x1 = x[:, D:]
2025-05-07T20:32:53.0273557Z     
2025-05-07T20:32:53.0273743Z         if contiguous:
2025-05-07T20:32:53.0273968Z             x0 = x0.contiguous()
2025-05-07T20:32:53.0274229Z             x1 = x1.contiguous()
2025-05-07T20:32:53.0274473Z     
2025-05-07T20:32:53.0274664Z         if scale_ub is not None:
2025-05-07T20:32:53.0274981Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:53.0275312Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:53.0275625Z             )
2025-05-07T20:32:53.0275821Z         else:
2025-05-07T20:32:53.0276027Z             scale_ub_tensor = None
2025-05-07T20:32:53.0276280Z     
2025-05-07T20:32:53.0276514Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:53.0276830Z             op = silu_mul_quant
2025-05-07T20:32:53.0277078Z             if compiled:
2025-05-07T20:32:53.0277336Z                 op = torch.compile(op)
2025-05-07T20:32:53.0277634Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:53.0277908Z     
2025-05-07T20:32:53.0278105Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:53.0278269Z 
2025-05-07T20:32:53.0278373Z moe/activation_test.py:117: 
2025-05-07T20:32:53.0278665Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:53.0278996Z moe/activation_test.py:115: in fn
2025-05-07T20:32:53.0279280Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:53.0279834Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/torch/_dynamo/eval_frame.py:678: in _fn
2025-05-07T20:32:53.0280476Z     return fn(*args, **kwargs)
2025-05-07T20:32:53.0281127Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:53.0281810Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:53.0282345Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:53.0283021Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:53.0283748Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:53.0284284Z     kernel = self.compile(
2025-05-07T20:32:53.0284821Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:53.0285474Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:53.0285878Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:53.0286106Z 
2025-05-07T20:32:53.0286317Z self = <triton.compiler.compiler.ASTSource object at 0x7fe92b5975d0>
2025-05-07T20:32:53.0287429Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:53.0288787Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7fea06577ce0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7fe92a600d60>}
2025-05-07T20:32:53.0290121Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:53.0291134Z context = <triton._C.libtriton.ir.context object at 0x7fe92a6095b0>
2025-05-07T20:32:53.0291425Z 
2025-05-07T20:32:53.0291591Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:53.0292108Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:53.0292580Z                            module_map=module_map)
2025-05-07T20:32:53.0292992Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:53.0293344Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:53.0293610Z E       ^
2025-05-07T20:32:53.0294171Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:53.0294616Z 
2025-05-07T20:32:53.0295026Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:53.0295538Z 
2025-05-07T20:32:53.0295643Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:53.0296061Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:53.0296463Z     T=16384,
2025-05-07T20:32:53.0296659Z     D=5120,
2025-05-07T20:32:53.0296857Z     scale_ub=1200.0,
2025-05-07T20:32:53.0297091Z     contiguous=False,
2025-05-07T20:32:53.0297322Z     compiled=False,
2025-05-07T20:32:53.0297527Z )
2025-05-07T20:32:53.0297848Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:53.0298512Z T = 16384, D = 5120, scale_ub = 1200.0, contiguous = False, compiled = False
2025-05-07T20:32:53.0298790Z 
2025-05-07T20:32:53.0298869Z     @given(
2025-05-07T20:32:53.0299101Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:53.0299413Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:53.0299718Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:53.0300047Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:53.0300377Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:53.0300739Z     )
2025-05-07T20:32:53.0301083Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:53.0301527Z     def test_silu_mul_quant(
2025-05-07T20:32:53.0301779Z         self,
2025-05-07T20:32:53.0301974Z         T: int,
2025-05-07T20:32:53.0302181Z         D: int,
2025-05-07T20:32:53.0302399Z         scale_ub: Optional[float],
2025-05-07T20:32:53.0302670Z         contiguous: bool,
2025-05-07T20:32:53.0302909Z         compiled: bool,
2025-05-07T20:32:53.0303132Z     ) -> None:
2025-05-07T20:32:53.0303413Z         torch.manual_seed(2025)
2025-05-07T20:32:53.0303652Z     
2025-05-07T20:32:53.0303925Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:53.0304263Z     
2025-05-07T20:32:53.0304458Z         x_sign = torch.sign(x)
2025-05-07T20:32:53.0304748Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:53.0305057Z         x = x_sign * x_clamp
2025-05-07T20:32:53.0305293Z         x0 = x[:, :D]
2025-05-07T20:32:53.0305513Z         x1 = x[:, D:]
2025-05-07T20:32:53.0305720Z     
2025-05-07T20:32:53.0305903Z         if contiguous:
2025-05-07T20:32:53.0306136Z             x0 = x0.contiguous()
2025-05-07T20:32:53.0306391Z             x1 = x1.contiguous()
2025-05-07T20:32:53.0306627Z     
2025-05-07T20:32:53.0306888Z         if scale_ub is not None:
2025-05-07T20:32:53.0307172Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:53.0307502Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:53.0307818Z             )
2025-05-07T20:32:53.0308009Z         else:
2025-05-07T20:32:53.0308215Z             scale_ub_tensor = None
2025-05-07T20:32:53.0308467Z     
2025-05-07T20:32:53.0308695Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:53.0309007Z             op = silu_mul_quant
2025-05-07T20:32:53.0309256Z             if compiled:
2025-05-07T20:32:53.0309507Z                 op = torch.compile(op)
2025-05-07T20:32:53.0309806Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:53.0310081Z     
2025-05-07T20:32:53.0310278Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:53.0310439Z 
2025-05-07T20:32:53.0310546Z moe/activation_test.py:117: 
2025-05-07T20:32:53.0310843Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:53.0311172Z moe/activation_test.py:115: in fn
2025-05-07T20:32:53.0311522Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:53.0312199Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:53.0312885Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:53.0313421Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:53.0314092Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:53.0314749Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:53.0315286Z     kernel = self.compile(
2025-05-07T20:32:53.0315822Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:53.0316474Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:53.0316872Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:53.0317103Z 
2025-05-07T20:32:53.0317309Z self = <triton.compiler.compiler.ASTSource object at 0x7fe92a8e3a50>
2025-05-07T20:32:53.0318379Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:53.0319730Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7fea06577ce0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7fe92a601c60>}
2025-05-07T20:32:53.0321150Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:53.0322160Z context = <triton._C.libtriton.ir.context object at 0x7fe92a21df30>
2025-05-07T20:32:53.0322448Z 
2025-05-07T20:32:53.0322614Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:53.0323125Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:53.0323625Z                            module_map=module_map)
2025-05-07T20:32:53.0323987Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:53.0324339Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:53.0324598Z E       ^
2025-05-07T20:32:53.0325056Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:53.0325503Z 
2025-05-07T20:32:53.0325912Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:53.0326419Z 
2025-05-07T20:32:53.0326575Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:53.0326986Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:53.0327385Z     T=16384,
2025-05-07T20:32:53.0327579Z     D=5120,
2025-05-07T20:32:53.0327774Z     scale_ub=1200.0,
2025-05-07T20:32:53.0328002Z     contiguous=True,
2025-05-07T20:32:53.0328228Z     compiled=True,
2025-05-07T20:32:53.0328425Z )
2025-05-07T20:32:53.0328739Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:53.0329231Z T = 16384, D = 5120, scale_ub = 1200.0, contiguous = True, compiled = True
2025-05-07T20:32:53.0329503Z 
2025-05-07T20:32:53.0329583Z     @given(
2025-05-07T20:32:53.0329811Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:53.0330176Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:53.0330482Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:53.0330809Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:53.0331138Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:53.0331427Z     )
2025-05-07T20:32:53.0331818Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:53.0332256Z     def test_silu_mul_quant(
2025-05-07T20:32:53.0332496Z         self,
2025-05-07T20:32:53.0332687Z         T: int,
2025-05-07T20:32:53.0332888Z         D: int,
2025-05-07T20:32:53.0333108Z         scale_ub: Optional[float],
2025-05-07T20:32:53.0333377Z         contiguous: bool,
2025-05-07T20:32:53.0333611Z         compiled: bool,
2025-05-07T20:32:53.0333918Z     ) -> None:
2025-05-07T20:32:53.0334137Z         torch.manual_seed(2025)
2025-05-07T20:32:53.0334372Z     
2025-05-07T20:32:53.0334647Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:53.0334993Z     
2025-05-07T20:32:53.0335187Z         x_sign = torch.sign(x)
2025-05-07T20:32:53.0335478Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:53.0335794Z         x = x_sign * x_clamp
2025-05-07T20:32:53.0336036Z         x0 = x[:, :D]
2025-05-07T20:32:53.0336262Z         x1 = x[:, D:]
2025-05-07T20:32:53.0336469Z     
2025-05-07T20:32:53.0336654Z         if contiguous:
2025-05-07T20:32:53.0336887Z             x0 = x0.contiguous()
2025-05-07T20:32:53.0337157Z             x1 = x1.contiguous()
2025-05-07T20:32:53.0337400Z     
2025-05-07T20:32:53.0337595Z         if scale_ub is not None:
2025-05-07T20:32:53.0337867Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:53.0338203Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:53.0338510Z             )
2025-05-07T20:32:53.0338713Z         else:
2025-05-07T20:32:53.0338926Z             scale_ub_tensor = None
2025-05-07T20:32:53.0339254Z     
2025-05-07T20:32:53.0339486Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:53.0339806Z             op = silu_mul_quant
2025-05-07T20:32:53.0340051Z             if compiled:
2025-05-07T20:32:53.0340299Z                 op = torch.compile(op)
2025-05-07T20:32:53.0340597Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:53.0340870Z     
2025-05-07T20:32:53.0341067Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:53.0341230Z 
2025-05-07T20:32:53.0341334Z moe/activation_test.py:117: 
2025-05-07T20:32:53.0341670Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:53.0341999Z moe/activation_test.py:115: in fn
2025-05-07T20:32:53.0342281Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:53.0342837Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/torch/_dynamo/eval_frame.py:678: in _fn
2025-05-07T20:32:53.0343385Z     return fn(*args, **kwargs)
2025-05-07T20:32:53.0344037Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:53.0344716Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:53.0345290Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:53.0345967Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:53.0346628Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:53.0347160Z     kernel = self.compile(
2025-05-07T20:32:53.0347690Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:53.0348340Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:53.0348739Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:53.0348969Z 
2025-05-07T20:32:53.0349180Z self = <triton.compiler.compiler.ASTSource object at 0x7fe92a3a30d0>
2025-05-07T20:32:53.0350239Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:53.0351631Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7fea06577ce0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7fe92a603380>}
2025-05-07T20:32:53.0352964Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:53.0353972Z context = <triton._C.libtriton.ir.context object at 0x7fe92a34b7b0>
2025-05-07T20:32:53.0354253Z 
2025-05-07T20:32:53.0354418Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:53.0354935Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:53.0355398Z                            module_map=module_map)
2025-05-07T20:32:53.0355767Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:53.0356118Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:53.0356380Z E       ^
2025-05-07T20:32:53.0356835Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:53.0357279Z 
2025-05-07T20:32:53.0357689Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:53.1898020Z 
2025-05-07T20:32:53.1898337Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:53.1898759Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:53.1899190Z     T=16384,
2025-05-07T20:32:53.1899487Z     D=5120,
2025-05-07T20:32:53.1899692Z     scale_ub=None,
2025-05-07T20:32:53.1899912Z     contiguous=False,
2025-05-07T20:32:53.1900140Z     compiled=True,
2025-05-07T20:32:53.1900340Z )
2025-05-07T20:32:53.1900661Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:53.1901159Z T = 16384, D = 5120, scale_ub = None, contiguous = False, compiled = True
2025-05-07T20:32:53.1901430Z 
2025-05-07T20:32:53.1901510Z     @given(
2025-05-07T20:32:53.1901743Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:53.1902131Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:53.1902437Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:53.1902764Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:53.1903094Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:53.1903381Z     )
2025-05-07T20:32:53.1903725Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:53.1904170Z     def test_silu_mul_quant(
2025-05-07T20:32:53.1904410Z         self,
2025-05-07T20:32:53.1904604Z         T: int,
2025-05-07T20:32:53.1904802Z         D: int,
2025-05-07T20:32:53.1905019Z         scale_ub: Optional[float],
2025-05-07T20:32:53.1905353Z         contiguous: bool,
2025-05-07T20:32:53.1905598Z         compiled: bool,
2025-05-07T20:32:53.1905827Z     ) -> None:
2025-05-07T20:32:53.1906037Z         torch.manual_seed(2025)
2025-05-07T20:32:53.1906281Z     
2025-05-07T20:32:53.1906552Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:53.1906899Z     
2025-05-07T20:32:53.1907095Z         x_sign = torch.sign(x)
2025-05-07T20:32:53.1907386Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:53.1907696Z         x = x_sign * x_clamp
2025-05-07T20:32:53.1907934Z         x0 = x[:, :D]
2025-05-07T20:32:53.1908163Z         x1 = x[:, D:]
2025-05-07T20:32:53.1908379Z     
2025-05-07T20:32:53.1908566Z         if contiguous:
2025-05-07T20:32:53.1914568Z             x0 = x0.contiguous()
2025-05-07T20:32:53.1914843Z             x1 = x1.contiguous()
2025-05-07T20:32:53.1915080Z     
2025-05-07T20:32:53.1915273Z         if scale_ub is not None:
2025-05-07T20:32:53.1915548Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:53.1915986Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:53.1916297Z             )
2025-05-07T20:32:53.1916490Z         else:
2025-05-07T20:32:53.1916699Z             scale_ub_tensor = None
2025-05-07T20:32:53.1916950Z     
2025-05-07T20:32:53.1917186Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:53.1917500Z             op = silu_mul_quant
2025-05-07T20:32:53.1917747Z             if compiled:
2025-05-07T20:32:53.1917998Z                 op = torch.compile(op)
2025-05-07T20:32:53.1918290Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:53.1918557Z     
2025-05-07T20:32:53.1918749Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:53.1918913Z 
2025-05-07T20:32:53.1919017Z moe/activation_test.py:117: 
2025-05-07T20:32:53.1919313Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:53.1919638Z moe/activation_test.py:115: in fn
2025-05-07T20:32:53.1919917Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:53.1920479Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/torch/_dynamo/eval_frame.py:678: in _fn
2025-05-07T20:32:53.1921030Z     return fn(*args, **kwargs)
2025-05-07T20:32:53.1921686Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:53.1922364Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:53.1922895Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:53.1923566Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:53.1924277Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:53.1924808Z     kernel = self.compile(
2025-05-07T20:32:53.1925345Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:53.1925995Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:53.1926391Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:53.1926663Z 
2025-05-07T20:32:53.1926877Z self = <triton.compiler.compiler.ASTSource object at 0x7fe92ab7e0d0>
2025-05-07T20:32:53.1927935Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:53.1929288Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7fea06577ce0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7fe92a3a85e0>}
2025-05-07T20:32:53.1930708Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:53.1931720Z context = <triton._C.libtriton.ir.context object at 0x7fe92a29af70>
2025-05-07T20:32:53.1932003Z 
2025-05-07T20:32:53.1932167Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:53.1932679Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:53.1933142Z                            module_map=module_map)
2025-05-07T20:32:53.1933502Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:53.1933923Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:53.1934177Z E       ^
2025-05-07T20:32:53.1934629Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:53.1935076Z 
2025-05-07T20:32:53.1935490Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:53.1936002Z 
2025-05-07T20:32:53.1936105Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:53.1936556Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:53.1936954Z     T=2048,
2025-05-07T20:32:53.1937144Z     D=5120,
2025-05-07T20:32:53.1937331Z     scale_ub=None,
2025-05-07T20:32:53.1937544Z     contiguous=False,
2025-05-07T20:32:53.1937766Z     compiled=True,
2025-05-07T20:32:53.1937963Z )
2025-05-07T20:32:53.1938281Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:53.1938764Z T = 2048, D = 5120, scale_ub = None, contiguous = False, compiled = True
2025-05-07T20:32:53.1939031Z 
2025-05-07T20:32:53.1939112Z     @given(
2025-05-07T20:32:53.1939346Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:53.1939659Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:53.1939954Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:53.1940315Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:53.1940659Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:53.1940941Z     )
2025-05-07T20:32:53.1941282Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:53.1941725Z     def test_silu_mul_quant(
2025-05-07T20:32:53.1941974Z         self,
2025-05-07T20:32:53.1942164Z         T: int,
2025-05-07T20:32:53.1942362Z         D: int,
2025-05-07T20:32:53.1942577Z         scale_ub: Optional[float],
2025-05-07T20:32:53.1942840Z         contiguous: bool,
2025-05-07T20:32:53.1943075Z         compiled: bool,
2025-05-07T20:32:53.1943296Z     ) -> None:
2025-05-07T20:32:53.1943503Z         torch.manual_seed(2025)
2025-05-07T20:32:53.1943796Z     
2025-05-07T20:32:53.1944062Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:53.1944390Z     
2025-05-07T20:32:53.1944580Z         x_sign = torch.sign(x)
2025-05-07T20:32:53.1944869Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:53.1945171Z         x = x_sign * x_clamp
2025-05-07T20:32:53.1945417Z         x0 = x[:, :D]
2025-05-07T20:32:53.1945633Z         x1 = x[:, D:]
2025-05-07T20:32:53.1945836Z     
2025-05-07T20:32:53.1946014Z         if contiguous:
2025-05-07T20:32:53.1946291Z             x0 = x0.contiguous()
2025-05-07T20:32:53.1946546Z             x1 = x1.contiguous()
2025-05-07T20:32:53.1946779Z     
2025-05-07T20:32:53.1946973Z         if scale_ub is not None:
2025-05-07T20:32:53.1947242Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:53.1947566Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:53.1947872Z             )
2025-05-07T20:32:53.1948063Z         else:
2025-05-07T20:32:53.1948267Z             scale_ub_tensor = None
2025-05-07T20:32:53.1948514Z     
2025-05-07T20:32:53.1948741Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:53.1949047Z             op = silu_mul_quant
2025-05-07T20:32:53.1949290Z             if compiled:
2025-05-07T20:32:53.1949578Z                 op = torch.compile(op)
2025-05-07T20:32:53.1949869Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:53.1950140Z     
2025-05-07T20:32:53.1950332Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:53.1950497Z 
2025-05-07T20:32:53.1950605Z moe/activation_test.py:117: 
2025-05-07T20:32:53.1950892Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:53.1951224Z moe/activation_test.py:115: in fn
2025-05-07T20:32:53.1951509Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:53.1952059Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/torch/_dynamo/eval_frame.py:678: in _fn
2025-05-07T20:32:53.1952620Z     return fn(*args, **kwargs)
2025-05-07T20:32:53.1953276Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:53.1953950Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:53.1954549Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:53.1955219Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:53.1955876Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:53.1956396Z     kernel = self.compile(
2025-05-07T20:32:53.1956929Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:53.1957578Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:53.1957971Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:53.1958199Z 
2025-05-07T20:32:53.1958401Z self = <triton.compiler.compiler.ASTSource object at 0x7fe92a37d2d0>
2025-05-07T20:32:53.1959470Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:53.1960869Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7fea06577ce0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7fe92a3a9440>}
2025-05-07T20:32:53.1962192Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:53.1963194Z context = <triton._C.libtriton.ir.context object at 0x7fe92a2ea930>
2025-05-07T20:32:53.1963526Z 
2025-05-07T20:32:53.1963690Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:53.1964196Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:53.1964657Z                            module_map=module_map)
2025-05-07T20:32:53.1965012Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:53.1965358Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:53.1965613Z E       ^
2025-05-07T20:32:53.1966059Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:53.1966546Z 
2025-05-07T20:32:53.1966952Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:53.3554494Z 
2025-05-07T20:32:53.3554669Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:53.3556467Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:53.3557402Z     T=2048,
2025-05-07T20:32:53.3557799Z     D=5120,
2025-05-07T20:32:53.3558177Z     scale_ub=1200.0,
2025-05-07T20:32:53.3558632Z     contiguous=False,
2025-05-07T20:32:53.3559084Z     compiled=True,
2025-05-07T20:32:53.3559485Z )
2025-05-07T20:32:53.3560267Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:53.3560785Z T = 2048, D = 5120, scale_ub = 1200.0, contiguous = False, compiled = True
2025-05-07T20:32:53.3561060Z 
2025-05-07T20:32:53.3561150Z     @given(
2025-05-07T20:32:53.3561392Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:53.3561712Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:53.3562018Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:53.3562352Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:53.3562686Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:53.3562981Z     )
2025-05-07T20:32:53.3563331Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:53.3563777Z     def test_silu_mul_quant(
2025-05-07T20:32:53.3564027Z         self,
2025-05-07T20:32:53.3564221Z         T: int,
2025-05-07T20:32:53.3564427Z         D: int,
2025-05-07T20:32:53.3564661Z         scale_ub: Optional[float],
2025-05-07T20:32:53.3564931Z         contiguous: bool,
2025-05-07T20:32:53.3565270Z         compiled: bool,
2025-05-07T20:32:53.3565515Z     ) -> None:
2025-05-07T20:32:53.3565729Z         torch.manual_seed(2025)
2025-05-07T20:32:53.3565981Z     
2025-05-07T20:32:53.3566264Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:53.3566604Z     
2025-05-07T20:32:53.3566806Z         x_sign = torch.sign(x)
2025-05-07T20:32:53.3567103Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:53.3567413Z         x = x_sign * x_clamp
2025-05-07T20:32:53.3567663Z         x0 = x[:, :D]
2025-05-07T20:32:53.3567890Z         x1 = x[:, D:]
2025-05-07T20:32:53.3568111Z     
2025-05-07T20:32:53.3568302Z         if contiguous:
2025-05-07T20:32:53.3568544Z             x0 = x0.contiguous()
2025-05-07T20:32:53.3568811Z             x1 = x1.contiguous()
2025-05-07T20:32:53.3569053Z     
2025-05-07T20:32:53.3569254Z         if scale_ub is not None:
2025-05-07T20:32:53.3569537Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:53.3569875Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:53.3570195Z             )
2025-05-07T20:32:53.3570396Z         else:
2025-05-07T20:32:53.3570614Z             scale_ub_tensor = None
2025-05-07T20:32:53.3570872Z     
2025-05-07T20:32:53.3571109Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:53.3571424Z             op = silu_mul_quant
2025-05-07T20:32:53.3571685Z             if compiled:
2025-05-07T20:32:53.3571944Z                 op = torch.compile(op)
2025-05-07T20:32:53.3572240Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:53.3572646Z     
2025-05-07T20:32:53.3572843Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:53.3573015Z 
2025-05-07T20:32:53.3573118Z moe/activation_test.py:117: 
2025-05-07T20:32:53.3573423Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:53.3573934Z moe/activation_test.py:115: in fn
2025-05-07T20:32:53.3574234Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:53.3574805Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/torch/_dynamo/eval_frame.py:678: in _fn
2025-05-07T20:32:53.3575474Z     return fn(*args, **kwargs)
2025-05-07T20:32:53.3576122Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:53.3576807Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:53.3577347Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:53.3578014Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:53.3578678Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:53.3579209Z     kernel = self.compile(
2025-05-07T20:32:53.3579802Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:53.3580488Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:53.3580883Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:53.3581118Z 
2025-05-07T20:32:53.3581326Z self = <triton.compiler.compiler.ASTSource object at 0x7fe92a8e11d0>
2025-05-07T20:32:53.3582394Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:53.3583851Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7fea06577ce0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7fe92a3aa660>}
2025-05-07T20:32:53.3585231Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:53.3586234Z context = <triton._C.libtriton.ir.context object at 0x7fe92a239f30>
2025-05-07T20:32:53.3586526Z 
2025-05-07T20:32:53.3586691Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:53.3587210Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:53.3587681Z                            module_map=module_map)
2025-05-07T20:32:53.3588041Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:53.3588402Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:53.3588672Z E       ^
2025-05-07T20:32:53.3589129Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:53.3589576Z 
2025-05-07T20:32:53.3589993Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:53.3590510Z 
2025-05-07T20:32:53.3590620Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:53.3591040Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:53.3591441Z     T=4096,
2025-05-07T20:32:53.3591646Z     D=5120,
2025-05-07T20:32:53.3591851Z     scale_ub=1200.0,
2025-05-07T20:32:53.3592075Z     contiguous=True,
2025-05-07T20:32:53.3592308Z     compiled=True,
2025-05-07T20:32:53.3592523Z )
2025-05-07T20:32:53.3592852Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:53.3593338Z T = 4096, D = 5120, scale_ub = 1200.0, contiguous = True, compiled = True
2025-05-07T20:32:53.3593671Z 
2025-05-07T20:32:53.3593754Z     @given(
2025-05-07T20:32:53.3593996Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:53.3594312Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:53.3594633Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:53.3594977Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:53.3595306Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:53.3595598Z     )
2025-05-07T20:32:53.3595956Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:53.3596452Z     def test_silu_mul_quant(
2025-05-07T20:32:53.3596693Z         self,
2025-05-07T20:32:53.3596901Z         T: int,
2025-05-07T20:32:53.3597108Z         D: int,
2025-05-07T20:32:53.3597325Z         scale_ub: Optional[float],
2025-05-07T20:32:53.3597604Z         contiguous: bool,
2025-05-07T20:32:53.3597852Z         compiled: bool,
2025-05-07T20:32:53.3598076Z     ) -> None:
2025-05-07T20:32:53.3598675Z         torch.manual_seed(2025)
2025-05-07T20:32:53.3598928Z     
2025-05-07T20:32:53.3599198Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:53.3599547Z     
2025-05-07T20:32:53.3599754Z         x_sign = torch.sign(x)
2025-05-07T20:32:53.3600132Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:53.3600458Z         x = x_sign * x_clamp
2025-05-07T20:32:53.3600713Z         x0 = x[:, :D]
2025-05-07T20:32:53.3600933Z         x1 = x[:, D:]
2025-05-07T20:32:53.3601158Z     
2025-05-07T20:32:53.3601351Z         if contiguous:
2025-05-07T20:32:53.3601582Z             x0 = x0.contiguous()
2025-05-07T20:32:53.3601847Z             x1 = x1.contiguous()
2025-05-07T20:32:53.3602094Z     
2025-05-07T20:32:53.3602285Z         if scale_ub is not None:
2025-05-07T20:32:53.3602564Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:53.3602905Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:53.3603224Z             )
2025-05-07T20:32:53.3603421Z         else:
2025-05-07T20:32:53.3603639Z             scale_ub_tensor = None
2025-05-07T20:32:53.3603900Z     
2025-05-07T20:32:53.3604130Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:53.3604455Z             op = silu_mul_quant
2025-05-07T20:32:53.3604713Z             if compiled:
2025-05-07T20:32:53.3605029Z                 op = torch.compile(op)
2025-05-07T20:32:53.3605335Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:53.3605625Z     
2025-05-07T20:32:53.3605815Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:53.3605989Z 
2025-05-07T20:32:53.3606091Z moe/activation_test.py:117: 
2025-05-07T20:32:53.3606390Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:53.3606729Z moe/activation_test.py:115: in fn
2025-05-07T20:32:53.3607012Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:53.3607570Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/torch/_dynamo/eval_frame.py:678: in _fn
2025-05-07T20:32:53.3608133Z     return fn(*args, **kwargs)
2025-05-07T20:32:53.3608782Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:53.3609470Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:53.3610017Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:53.3610694Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:53.3611358Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:53.3611896Z     kernel = self.compile(
2025-05-07T20:32:53.3612439Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:53.3613089Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:53.3613565Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:53.3613939Z 
2025-05-07T20:32:53.3614146Z self = <triton.compiler.compiler.ASTSource object at 0x7fe92bcf7cd0>
2025-05-07T20:32:53.3615220Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:53.3616645Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7fea06577ce0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7fe92a3ab9c0>}
2025-05-07T20:32:53.3617968Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:53.3618981Z context = <triton._C.libtriton.ir.context object at 0x7fe92a5fd930>
2025-05-07T20:32:53.3619271Z 
2025-05-07T20:32:53.3619446Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:53.3619967Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:53.3620470Z                            module_map=module_map)
2025-05-07T20:32:53.3620844Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:53.3621204Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:53.3621468Z E       ^
2025-05-07T20:32:53.3621934Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:53.3622374Z 
2025-05-07T20:32:53.3622792Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:53.5300206Z 
2025-05-07T20:32:53.5300754Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:53.5301358Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:53.5301777Z     T=128,
2025-05-07T20:32:53.5301983Z     D=5120,
2025-05-07T20:32:53.5302179Z     scale_ub=1200.0,
2025-05-07T20:32:53.5302412Z     contiguous=False,
2025-05-07T20:32:53.5302651Z     compiled=True,
2025-05-07T20:32:53.5302867Z )
2025-05-07T20:32:53.5303498Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:53.5303996Z T = 128, D = 5120, scale_ub = 1200.0, contiguous = False, compiled = True
2025-05-07T20:32:53.5304273Z 
2025-05-07T20:32:53.5304361Z     @given(
2025-05-07T20:32:53.5304592Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:53.5304916Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:53.5305229Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:53.5305555Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:53.5305884Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:53.5306182Z     )
2025-05-07T20:32:53.5306559Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:53.5307005Z     def test_silu_mul_quant(
2025-05-07T20:32:53.5307247Z         self,
2025-05-07T20:32:53.5307451Z         T: int,
2025-05-07T20:32:53.5307662Z         D: int,
2025-05-07T20:32:53.5307880Z         scale_ub: Optional[float],
2025-05-07T20:32:53.5308161Z         contiguous: bool,
2025-05-07T20:32:53.5308410Z         compiled: bool,
2025-05-07T20:32:53.5308647Z     ) -> None:
2025-05-07T20:32:53.5308864Z         torch.manual_seed(2025)
2025-05-07T20:32:53.5309112Z     
2025-05-07T20:32:53.5309394Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:53.5309739Z     
2025-05-07T20:32:53.5309941Z         x_sign = torch.sign(x)
2025-05-07T20:32:53.5310236Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:53.5310551Z         x = x_sign * x_clamp
2025-05-07T20:32:53.5310888Z         x0 = x[:, :D]
2025-05-07T20:32:53.5311119Z         x1 = x[:, D:]
2025-05-07T20:32:53.5311327Z     
2025-05-07T20:32:53.5311530Z         if contiguous:
2025-05-07T20:32:53.5311778Z             x0 = x0.contiguous()
2025-05-07T20:32:53.5312038Z             x1 = x1.contiguous()
2025-05-07T20:32:53.5312291Z     
2025-05-07T20:32:53.5312498Z         if scale_ub is not None:
2025-05-07T20:32:53.5312775Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:53.5313118Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:53.5313560Z             )
2025-05-07T20:32:53.5313760Z         else:
2025-05-07T20:32:53.5313981Z             scale_ub_tensor = None
2025-05-07T20:32:53.5314240Z     
2025-05-07T20:32:53.5314478Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:53.5314794Z             op = silu_mul_quant
2025-05-07T20:32:53.5315049Z             if compiled:
2025-05-07T20:32:53.5315304Z                 op = torch.compile(op)
2025-05-07T20:32:53.5315603Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:53.5315891Z     
2025-05-07T20:32:53.5316092Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:53.5316255Z 
2025-05-07T20:32:53.5316356Z moe/activation_test.py:117: 
2025-05-07T20:32:53.5316742Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:53.5317088Z moe/activation_test.py:115: in fn
2025-05-07T20:32:53.5317370Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:53.5317932Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/torch/_dynamo/eval_frame.py:678: in _fn
2025-05-07T20:32:53.5318502Z     return fn(*args, **kwargs)
2025-05-07T20:32:53.5319158Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:53.5319835Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:53.5320374Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:53.5321054Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:53.5321733Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:53.5322271Z     kernel = self.compile(
2025-05-07T20:32:53.5322849Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:53.5323508Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:53.5323919Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:53.5324144Z 
2025-05-07T20:32:53.5324359Z self = <triton.compiler.compiler.ASTSource object at 0x7fe5b9fa9dd0>
2025-05-07T20:32:53.5332664Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:53.5334211Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7fea06577ce0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7fe92a548fe0>}
2025-05-07T20:32:53.5335570Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:53.5336597Z context = <triton._C.libtriton.ir.context object at 0x7fe92a5938b0>
2025-05-07T20:32:53.5336891Z 
2025-05-07T20:32:53.5337072Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:53.5337593Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:53.5338074Z                            module_map=module_map)
2025-05-07T20:32:53.5338450Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:53.5338899Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:53.5339166Z E       ^
2025-05-07T20:32:53.5339644Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:53.5340087Z 
2025-05-07T20:32:53.5340522Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:53.5341036Z 
2025-05-07T20:32:53.5341153Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:53.5341617Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:53.5342033Z     T=16384,
2025-05-07T20:32:53.5342242Z     D=7168,
2025-05-07T20:32:53.5342444Z     scale_ub=1200.0,
2025-05-07T20:32:53.5342681Z     contiguous=True,
2025-05-07T20:32:53.5342920Z     compiled=True,
2025-05-07T20:32:53.5343132Z )
2025-05-07T20:32:53.5343464Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:53.5343968Z T = 16384, D = 7168, scale_ub = 1200.0, contiguous = True, compiled = True
2025-05-07T20:32:53.5344245Z 
2025-05-07T20:32:53.5344328Z     @given(
2025-05-07T20:32:53.5344574Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:53.5344900Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:53.5345267Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:53.5345603Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:53.5345941Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:53.5346241Z     )
2025-05-07T20:32:53.5346585Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:53.5347036Z     def test_silu_mul_quant(
2025-05-07T20:32:53.5347286Z         self,
2025-05-07T20:32:53.5347483Z         T: int,
2025-05-07T20:32:53.5347691Z         D: int,
2025-05-07T20:32:53.5347920Z         scale_ub: Optional[float],
2025-05-07T20:32:53.5348192Z         contiguous: bool,
2025-05-07T20:32:53.5348442Z         compiled: bool,
2025-05-07T20:32:53.5348676Z     ) -> None:
2025-05-07T20:32:53.5348892Z         torch.manual_seed(2025)
2025-05-07T20:32:53.5349141Z     
2025-05-07T20:32:53.5349422Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:53.5349763Z     
2025-05-07T20:32:53.5349971Z         x_sign = torch.sign(x)
2025-05-07T20:32:53.5350314Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:53.5350625Z         x = x_sign * x_clamp
2025-05-07T20:32:53.5350871Z         x0 = x[:, :D]
2025-05-07T20:32:53.5351096Z         x1 = x[:, D:]
2025-05-07T20:32:53.5351312Z     
2025-05-07T20:32:53.5351497Z         if contiguous:
2025-05-07T20:32:53.5351734Z             x0 = x0.contiguous()
2025-05-07T20:32:53.5351999Z             x1 = x1.contiguous()
2025-05-07T20:32:53.5352238Z     
2025-05-07T20:32:53.5352437Z         if scale_ub is not None:
2025-05-07T20:32:53.5352717Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:53.5353055Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:53.5353368Z             )
2025-05-07T20:32:53.5353571Z         else:
2025-05-07T20:32:53.5353781Z             scale_ub_tensor = None
2025-05-07T20:32:53.5354035Z     
2025-05-07T20:32:53.5354275Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:53.5354586Z             op = silu_mul_quant
2025-05-07T20:32:53.5354844Z             if compiled:
2025-05-07T20:32:53.5355098Z                 op = torch.compile(op)
2025-05-07T20:32:53.5355390Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:53.5355676Z     
2025-05-07T20:32:53.5355874Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:53.5356038Z 
2025-05-07T20:32:53.5356144Z moe/activation_test.py:117: 
2025-05-07T20:32:53.5356436Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:53.5356772Z moe/activation_test.py:115: in fn
2025-05-07T20:32:53.5357059Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:53.5357776Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/torch/_dynamo/eval_frame.py:678: in _fn
2025-05-07T20:32:53.5358391Z     return fn(*args, **kwargs)
2025-05-07T20:32:53.5359056Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:53.5359752Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:53.5360285Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:53.5361048Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:53.5361714Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:53.5363628Z     kernel = self.compile(
2025-05-07T20:32:53.5364177Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:53.5364839Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:53.5365246Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:53.5365475Z 
2025-05-07T20:32:53.5365727Z self = <triton.compiler.compiler.ASTSource object at 0x7fe92b595450>
2025-05-07T20:32:53.5366806Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:53.5368165Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7fea06577ce0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7fe92a549e40>}
2025-05-07T20:32:53.5369500Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:53.5370504Z context = <triton._C.libtriton.ir.context object at 0x7fe92a0ac470>
2025-05-07T20:32:53.5370801Z 
2025-05-07T20:32:53.5370968Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:53.5371492Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:53.5372007Z                            module_map=module_map)
2025-05-07T20:32:53.5372373Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:53.5372738Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:53.5373006Z E       ^
2025-05-07T20:32:53.5373463Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:53.5373995Z 
2025-05-07T20:32:53.5374407Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:53.6520727Z 
2025-05-07T20:32:53.6521098Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:53.6521572Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:53.6522120Z     T=16384,
2025-05-07T20:32:53.6522326Z     D=5120,
2025-05-07T20:32:53.6522533Z     scale_ub=1200.0,
2025-05-07T20:32:53.6522770Z     contiguous=True,
2025-05-07T20:32:53.6522995Z     compiled=False,
2025-05-07T20:32:53.6523213Z )
2025-05-07T20:32:53.6523536Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:53.6524034Z T = 16384, D = 5120, scale_ub = 1200.0, contiguous = True, compiled = False
2025-05-07T20:32:53.6524315Z 
2025-05-07T20:32:53.6524395Z     @given(
2025-05-07T20:32:53.6524630Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:53.6524956Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:53.6525266Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:53.6525597Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:53.6526169Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:53.6526456Z     )
2025-05-07T20:32:53.6527023Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:53.6527474Z     def test_silu_mul_quant(
2025-05-07T20:32:53.6527723Z         self,
2025-05-07T20:32:53.6527930Z         T: int,
2025-05-07T20:32:53.6528140Z         D: int,
2025-05-07T20:32:53.6528358Z         scale_ub: Optional[float],
2025-05-07T20:32:53.6528637Z         contiguous: bool,
2025-05-07T20:32:53.6528966Z         compiled: bool,
2025-05-07T20:32:53.6529193Z     ) -> None:
2025-05-07T20:32:53.6529416Z         torch.manual_seed(2025)
2025-05-07T20:32:53.6529665Z     
2025-05-07T20:32:53.6529943Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:53.6530295Z     
2025-05-07T20:32:53.6530496Z         x_sign = torch.sign(x)
2025-05-07T20:32:53.6530792Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:53.6531106Z         x = x_sign * x_clamp
2025-05-07T20:32:53.6531363Z         x0 = x[:, :D]
2025-05-07T20:32:53.6531588Z         x1 = x[:, D:]
2025-05-07T20:32:53.6531805Z     
2025-05-07T20:32:53.6531997Z         if contiguous:
2025-05-07T20:32:53.6532230Z             x0 = x0.contiguous()
2025-05-07T20:32:53.6532568Z             x1 = x1.contiguous()
2025-05-07T20:32:53.6532817Z     
2025-05-07T20:32:53.6533017Z         if scale_ub is not None:
2025-05-07T20:32:53.6533287Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:53.6533623Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:53.6534116Z             )
2025-05-07T20:32:53.6534314Z         else:
2025-05-07T20:32:53.6534537Z             scale_ub_tensor = None
2025-05-07T20:32:53.6534791Z     
2025-05-07T20:32:53.6535018Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:53.6535341Z             op = silu_mul_quant
2025-05-07T20:32:53.6535587Z             if compiled:
2025-05-07T20:32:53.6535862Z                 op = torch.compile(op)
2025-05-07T20:32:53.6536164Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:53.6536436Z     
2025-05-07T20:32:53.6536632Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:53.6536793Z 
2025-05-07T20:32:53.6536900Z moe/activation_test.py:117: 
2025-05-07T20:32:53.6537459Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:53.6537796Z moe/activation_test.py:115: in fn
2025-05-07T20:32:53.6538077Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:53.6538763Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:53.6539450Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:53.6539988Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:53.6540665Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:53.6541315Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:53.6541846Z     kernel = self.compile(
2025-05-07T20:32:53.6542390Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:53.6543036Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:53.6543434Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:53.6543667Z 
2025-05-07T20:32:53.6543872Z self = <triton.compiler.compiler.ASTSource object at 0x7fe92a8e1e50>
2025-05-07T20:32:53.6544932Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:53.6546293Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7fea06577ce0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7fe92a54aca0>}
2025-05-07T20:32:53.6547656Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:53.6548837Z context = <triton._C.libtriton.ir.context object at 0x7fe5b9e225f0>
2025-05-07T20:32:53.6549127Z 
2025-05-07T20:32:53.6549290Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:53.6549847Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:53.6550301Z                            module_map=module_map)
2025-05-07T20:32:53.6550667Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:53.6551019Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:53.6551273Z E       ^
2025-05-07T20:32:53.6551734Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:53.6552181Z 
2025-05-07T20:32:53.6552587Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:53.6553642Z 
2025-05-07T20:32:53.6553760Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:53.6554161Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:53.6554558Z     T=1,
2025-05-07T20:32:53.6554747Z     D=7168,
2025-05-07T20:32:53.6554939Z     scale_ub=1200.0,
2025-05-07T20:32:53.6555165Z     contiguous=False,
2025-05-07T20:32:53.6555386Z     compiled=False,
2025-05-07T20:32:53.6555605Z )
2025-05-07T20:32:53.6555924Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:53.6556400Z T = 1, D = 7168, scale_ub = 1200.0, contiguous = False, compiled = False
2025-05-07T20:32:53.6556667Z 
2025-05-07T20:32:53.6556748Z     @given(
2025-05-07T20:32:53.6556982Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:53.6557298Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:53.6557596Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:53.6557931Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:53.6558306Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:53.6558588Z     )
2025-05-07T20:32:53.6559131Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:53.6559576Z     def test_silu_mul_quant(
2025-05-07T20:32:53.6559812Z         self,
2025-05-07T20:32:53.6560016Z         T: int,
2025-05-07T20:32:53.6560215Z         D: int,
2025-05-07T20:32:53.6560426Z         scale_ub: Optional[float],
2025-05-07T20:32:53.6560694Z         contiguous: bool,
2025-05-07T20:32:53.6560932Z         compiled: bool,
2025-05-07T20:32:53.6561146Z     ) -> None:
2025-05-07T20:32:53.6561361Z         torch.manual_seed(2025)
2025-05-07T20:32:53.6561602Z     
2025-05-07T20:32:53.6561862Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:53.6562199Z     
2025-05-07T20:32:53.6562392Z         x_sign = torch.sign(x)
2025-05-07T20:32:53.6562685Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:53.6562987Z         x = x_sign * x_clamp
2025-05-07T20:32:53.6563230Z         x0 = x[:, :D]
2025-05-07T20:32:53.6563447Z         x1 = x[:, D:]
2025-05-07T20:32:53.6563652Z     
2025-05-07T20:32:53.6563840Z         if contiguous:
2025-05-07T20:32:53.6564071Z             x0 = x0.contiguous()
2025-05-07T20:32:53.6564321Z             x1 = x1.contiguous()
2025-05-07T20:32:53.6564559Z     
2025-05-07T20:32:53.6564754Z         if scale_ub is not None:
2025-05-07T20:32:53.6565018Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:53.6565348Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:53.6565654Z             )
2025-05-07T20:32:53.6565894Z         else:
2025-05-07T20:32:53.6566105Z             scale_ub_tensor = None
2025-05-07T20:32:53.6566357Z     
2025-05-07T20:32:53.6566580Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:53.6566900Z             op = silu_mul_quant
2025-05-07T20:32:53.6567150Z             if compiled:
2025-05-07T20:32:53.6567397Z                 op = torch.compile(op)
2025-05-07T20:32:53.6567690Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:53.6567965Z     
2025-05-07T20:32:53.6568156Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:53.6568364Z 
2025-05-07T20:32:53.6568461Z moe/activation_test.py:117: 
2025-05-07T20:32:53.6568757Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:53.6569087Z moe/activation_test.py:115: in fn
2025-05-07T20:32:53.6569361Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:53.6570039Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:53.6570724Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:53.6571255Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:53.6571967Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:53.6572626Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:53.6573155Z     kernel = self.compile(
2025-05-07T20:32:53.6573770Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:53.6574420Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:53.6574815Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:53.6575038Z 
2025-05-07T20:32:53.6575251Z self = <triton.compiler.compiler.ASTSource object at 0x7fe92ba8ff50>
2025-05-07T20:32:53.6576313Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:53.6577705Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7fea06577ce0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7fe92a04c0e0>}
2025-05-07T20:32:53.6579023Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:53.6580025Z context = <triton._C.libtriton.ir.context object at 0x7fe92a06a0b0>
2025-05-07T20:32:53.6580306Z 
2025-05-07T20:32:53.6580477Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:53.6580985Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:53.6581447Z                            module_map=module_map)
2025-05-07T20:32:53.6581806Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:53.6582150Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:53.6582410Z E       ^
2025-05-07T20:32:53.6582865Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:53.6583304Z 
2025-05-07T20:32:53.6583718Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:53.6584224Z 
2025-05-07T20:32:53.6584326Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:53.6584736Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:53.6585134Z     T=4096,
2025-05-07T20:32:53.6585314Z     D=7168,
2025-05-07T20:32:53.6585508Z     scale_ub=1200.0,
2025-05-07T20:32:53.6585807Z     contiguous=False,
2025-05-07T20:32:53.6586025Z     compiled=True,
2025-05-07T20:32:53.8197956Z )
2025-05-07T20:32:53.8198667Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:53.8199318Z T = 4096, D = 7168, scale_ub = 1200.0, contiguous = False, compiled = True
2025-05-07T20:32:53.8199628Z 
2025-05-07T20:32:53.8199709Z     @given(
2025-05-07T20:32:53.8199951Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:53.8200270Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:53.8200895Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:53.8201223Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:53.8201552Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:53.8201834Z     )
2025-05-07T20:32:53.8202186Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:53.8202629Z     def test_silu_mul_quant(
2025-05-07T20:32:53.8202871Z         self,
2025-05-07T20:32:53.8203078Z         T: int,
2025-05-07T20:32:53.8203279Z         D: int,
2025-05-07T20:32:53.8203500Z         scale_ub: Optional[float],
2025-05-07T20:32:53.8203781Z         contiguous: bool,
2025-05-07T20:32:53.8204024Z         compiled: bool,
2025-05-07T20:32:53.8204256Z     ) -> None:
2025-05-07T20:32:53.8204573Z         torch.manual_seed(2025)
2025-05-07T20:32:53.8204828Z     
2025-05-07T20:32:53.8205101Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:53.8205449Z     
2025-05-07T20:32:53.8205650Z         x_sign = torch.sign(x)
2025-05-07T20:32:53.8205944Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:53.8206255Z         x = x_sign * x_clamp
2025-05-07T20:32:53.8206501Z         x0 = x[:, :D]
2025-05-07T20:32:53.8206724Z         x1 = x[:, D:]
2025-05-07T20:32:53.8206926Z     
2025-05-07T20:32:53.8207118Z         if contiguous:
2025-05-07T20:32:53.8207350Z             x0 = x0.contiguous()
2025-05-07T20:32:53.8207603Z             x1 = x1.contiguous()
2025-05-07T20:32:53.8207845Z     
2025-05-07T20:32:53.8208043Z         if scale_ub is not None:
2025-05-07T20:32:53.8208314Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:53.8208650Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:53.8208965Z             )
2025-05-07T20:32:53.8209156Z         else:
2025-05-07T20:32:53.8209496Z             scale_ub_tensor = None
2025-05-07T20:32:53.8209746Z     
2025-05-07T20:32:53.8209983Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:53.8210308Z             op = silu_mul_quant
2025-05-07T20:32:53.8210551Z             if compiled:
2025-05-07T20:32:53.8210800Z                 op = torch.compile(op)
2025-05-07T20:32:53.8211094Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:53.8211365Z     
2025-05-07T20:32:53.8211559Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:53.8211721Z 
2025-05-07T20:32:53.8211826Z moe/activation_test.py:117: 
2025-05-07T20:32:53.8212126Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:53.8212455Z moe/activation_test.py:115: in fn
2025-05-07T20:32:53.8212737Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:53.8213301Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/torch/_dynamo/eval_frame.py:678: in _fn
2025-05-07T20:32:53.8213941Z     return fn(*args, **kwargs)
2025-05-07T20:32:53.8214595Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:53.8215275Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:53.8215808Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:53.8216478Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:53.8217135Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:53.8217791Z     kernel = self.compile(
2025-05-07T20:32:53.8218324Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:53.8218973Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:53.8219383Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:53.8219610Z 
2025-05-07T20:32:53.8219825Z self = <triton.compiler.compiler.ASTSource object at 0x7fe92a5252d0>
2025-05-07T20:32:53.8220961Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:53.8222337Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7fea06577ce0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7fe92a04d300>}
2025-05-07T20:32:53.8223674Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:53.8224723Z context = <triton._C.libtriton.ir.context object at 0x7fe92a083ef0>
2025-05-07T20:32:53.8225008Z 
2025-05-07T20:32:53.8225182Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:53.8225690Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:53.8226161Z                            module_map=module_map)
2025-05-07T20:32:53.8226531Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:53.8226877Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:53.8227137Z E       ^
2025-05-07T20:32:53.8227600Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:53.8228043Z 
2025-05-07T20:32:53.8228460Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:53.8228964Z 
2025-05-07T20:32:53.8229069Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:53.8229483Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:53.8229925Z     T=128,
2025-05-07T20:32:53.8230112Z     D=7168,
2025-05-07T20:32:53.8230309Z     scale_ub=1200.0,
2025-05-07T20:32:53.8230535Z     contiguous=False,
2025-05-07T20:32:53.8230755Z     compiled=True,
2025-05-07T20:32:53.8230963Z )
2025-05-07T20:32:53.8231280Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:53.8231764Z T = 128, D = 7168, scale_ub = 1200.0, contiguous = False, compiled = True
2025-05-07T20:32:53.8232033Z 
2025-05-07T20:32:53.8232111Z     @given(
2025-05-07T20:32:53.8232343Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:53.8232660Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:53.8232959Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:53.8233289Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:53.8233617Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:53.8233901Z     )
2025-05-07T20:32:53.8234250Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:53.8234689Z     def test_silu_mul_quant(
2025-05-07T20:32:53.8234930Z         self,
2025-05-07T20:32:53.8235124Z         T: int,
2025-05-07T20:32:53.8235319Z         D: int,
2025-05-07T20:32:53.8235539Z         scale_ub: Optional[float],
2025-05-07T20:32:53.8235805Z         contiguous: bool,
2025-05-07T20:32:53.8236042Z         compiled: bool,
2025-05-07T20:32:53.8236262Z     ) -> None:
2025-05-07T20:32:53.8236473Z         torch.manual_seed(2025)
2025-05-07T20:32:53.8236714Z     
2025-05-07T20:32:53.8236988Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:53.8237381Z     
2025-05-07T20:32:53.8237576Z         x_sign = torch.sign(x)
2025-05-07T20:32:53.8237885Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:53.8238189Z         x = x_sign * x_clamp
2025-05-07T20:32:53.8238434Z         x0 = x[:, :D]
2025-05-07T20:32:53.8238656Z         x1 = x[:, D:]
2025-05-07T20:32:53.8238861Z     
2025-05-07T20:32:53.8239053Z         if contiguous:
2025-05-07T20:32:53.8239286Z             x0 = x0.contiguous()
2025-05-07T20:32:53.8247272Z             x1 = x1.contiguous()
2025-05-07T20:32:53.8247618Z     
2025-05-07T20:32:53.8247815Z         if scale_ub is not None:
2025-05-07T20:32:53.8248094Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:53.8248437Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:53.8248743Z             )
2025-05-07T20:32:53.8248940Z         else:
2025-05-07T20:32:53.8249160Z             scale_ub_tensor = None
2025-05-07T20:32:53.8249411Z     
2025-05-07T20:32:53.8249652Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:53.8249974Z             op = silu_mul_quant
2025-05-07T20:32:53.8250231Z             if compiled:
2025-05-07T20:32:53.8250476Z                 op = torch.compile(op)
2025-05-07T20:32:53.8250832Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:53.8251113Z     
2025-05-07T20:32:53.8251307Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:53.8251486Z 
2025-05-07T20:32:53.8251589Z moe/activation_test.py:117: 
2025-05-07T20:32:53.8251890Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:53.8252220Z moe/activation_test.py:115: in fn
2025-05-07T20:32:53.8252502Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:53.8253063Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/torch/_dynamo/eval_frame.py:678: in _fn
2025-05-07T20:32:53.8253627Z     return fn(*args, **kwargs)
2025-05-07T20:32:53.8254368Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:53.8255053Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:53.8255588Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:53.8256306Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:53.8256970Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:53.8257510Z     kernel = self.compile(
2025-05-07T20:32:53.8258053Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:53.8258698Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:53.8259101Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:53.8259328Z 
2025-05-07T20:32:53.8259544Z self = <triton.compiler.compiler.ASTSource object at 0x7fe92ba8eed0>
2025-05-07T20:32:53.8260619Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:53.8261974Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7fea06577ce0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7fe92a04e160>}
2025-05-07T20:32:53.8263301Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:53.8264310Z context = <triton._C.libtriton.ir.context object at 0x7fe92a1beef0>
2025-05-07T20:32:53.8264597Z 
2025-05-07T20:32:53.8264760Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:53.8265324Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:53.8265786Z                            module_map=module_map)
2025-05-07T20:32:53.8266143Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:53.8266499Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:53.8266759Z E       ^
2025-05-07T20:32:53.8267218Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:53.8267701Z 
2025-05-07T20:32:53.8268110Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:53.8268623Z 
2025-05-07T20:32:53.8268725Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:53.8269135Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:53.8269527Z     T=2048,
2025-05-07T20:32:53.8269720Z     D=7168,
2025-05-07T20:32:53.8269934Z     scale_ub=None,
2025-05-07T20:32:53.8270183Z     contiguous=True,
2025-05-07T20:32:53.8270400Z     compiled=True,
2025-05-07T20:32:53.9559504Z )
2025-05-07T20:32:53.9560326Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:53.9561304Z T = 2048, D = 7168, scale_ub = None, contiguous = True, compiled = True
2025-05-07T20:32:53.9561583Z 
2025-05-07T20:32:53.9561674Z     @given(
2025-05-07T20:32:53.9561906Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:53.9562232Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:53.9562541Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:53.9562869Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:53.9563202Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:53.9563493Z     )
2025-05-07T20:32:53.9563844Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:53.9564284Z     def test_silu_mul_quant(
2025-05-07T20:32:53.9564540Z         self,
2025-05-07T20:32:53.9564742Z         T: int,
2025-05-07T20:32:53.9564940Z         D: int,
2025-05-07T20:32:53.9565163Z         scale_ub: Optional[float],
2025-05-07T20:32:53.9565438Z         contiguous: bool,
2025-05-07T20:32:53.9565681Z         compiled: bool,
2025-05-07T20:32:53.9565915Z     ) -> None:
2025-05-07T20:32:53.9566226Z         torch.manual_seed(2025)
2025-05-07T20:32:53.9566469Z     
2025-05-07T20:32:53.9566750Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:53.9567098Z     
2025-05-07T20:32:53.9567294Z         x_sign = torch.sign(x)
2025-05-07T20:32:53.9567594Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:53.9567908Z         x = x_sign * x_clamp
2025-05-07T20:32:53.9568151Z         x0 = x[:, :D]
2025-05-07T20:32:53.9568373Z         x1 = x[:, D:]
2025-05-07T20:32:53.9568585Z     
2025-05-07T20:32:53.9568780Z         if contiguous:
2025-05-07T20:32:53.9569011Z             x0 = x0.contiguous()
2025-05-07T20:32:53.9569279Z             x1 = x1.contiguous()
2025-05-07T20:32:53.9569529Z     
2025-05-07T20:32:53.9569724Z         if scale_ub is not None:
2025-05-07T20:32:53.9570005Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:53.9570348Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:53.9570658Z             )
2025-05-07T20:32:53.9570861Z         else:
2025-05-07T20:32:53.9571084Z             scale_ub_tensor = None
2025-05-07T20:32:53.9571336Z     
2025-05-07T20:32:53.9571581Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:53.9571905Z             op = silu_mul_quant
2025-05-07T20:32:53.9572155Z             if compiled:
2025-05-07T20:32:53.9572408Z                 op = torch.compile(op)
2025-05-07T20:32:53.9572711Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:53.9572991Z     
2025-05-07T20:32:53.9573194Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:53.9573367Z 
2025-05-07T20:32:53.9573555Z moe/activation_test.py:117: 
2025-05-07T20:32:53.9573976Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:53.9574307Z moe/activation_test.py:115: in fn
2025-05-07T20:32:53.9574592Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:53.9575154Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/torch/_dynamo/eval_frame.py:678: in _fn
2025-05-07T20:32:53.9575711Z     return fn(*args, **kwargs)
2025-05-07T20:32:53.9576371Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:53.9577144Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:53.9577681Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:53.9578351Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:53.9579015Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:53.9579555Z     kernel = self.compile(
2025-05-07T20:32:53.9580092Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:53.9580795Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:53.9581202Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:53.9581430Z 
2025-05-07T20:32:53.9581645Z self = <triton.compiler.compiler.ASTSource object at 0x7fe5b9fa9550>
2025-05-07T20:32:53.9582710Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:53.9584087Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7fea06577ce0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7fe92a04f420>}
2025-05-07T20:32:53.9585423Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:53.9586479Z context = <triton._C.libtriton.ir.context object at 0x7fe5b9f71b70>
2025-05-07T20:32:53.9586766Z 
2025-05-07T20:32:53.9586938Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:53.9587456Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:53.9587931Z                            module_map=module_map)
2025-05-07T20:32:53.9588299Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:53.9588650Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:53.9588918Z E       ^
2025-05-07T20:32:53.9589385Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:53.9589829Z 
2025-05-07T20:32:53.9590246Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:53.9590753Z 
2025-05-07T20:32:53.9590863Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:53.9591279Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:53.9591687Z     T=16384,
2025-05-07T20:32:53.9591881Z     D=5120,
2025-05-07T20:32:53.9592087Z     scale_ub=None,
2025-05-07T20:32:53.9592309Z     contiguous=False,
2025-05-07T20:32:53.9592535Z     compiled=False,
2025-05-07T20:32:53.9592747Z )
2025-05-07T20:32:53.9593068Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:53.9593568Z T = 16384, D = 5120, scale_ub = None, contiguous = False, compiled = False
2025-05-07T20:32:53.9593843Z 
2025-05-07T20:32:53.9593924Z     @given(
2025-05-07T20:32:53.9594242Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:53.9594560Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:53.9594868Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:53.9595207Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:53.9595546Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:53.9595830Z     )
2025-05-07T20:32:53.9596188Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:53.9596638Z     def test_silu_mul_quant(
2025-05-07T20:32:53.9596930Z         self,
2025-05-07T20:32:53.9597124Z         T: int,
2025-05-07T20:32:53.9597322Z         D: int,
2025-05-07T20:32:53.9597544Z         scale_ub: Optional[float],
2025-05-07T20:32:53.9597813Z         contiguous: bool,
2025-05-07T20:32:53.9598059Z         compiled: bool,
2025-05-07T20:32:53.9598643Z     ) -> None:
2025-05-07T20:32:53.9598856Z         torch.manual_seed(2025)
2025-05-07T20:32:53.9599096Z     
2025-05-07T20:32:53.9599369Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:53.9599703Z     
2025-05-07T20:32:53.9599901Z         x_sign = torch.sign(x)
2025-05-07T20:32:53.9600191Z >       x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:53.9602322Z E       torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 320.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 140.44 MiB is free. Including non-PyTorch memory, this process has 21.92 GiB memory in use. Of the allocated memory 21.60 GiB is allocated by PyTorch, and 45.02 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
2025-05-07T20:32:53.9604146Z 
2025-05-07T20:32:53.9604271Z moe/activation_test.py:95: OutOfMemoryError
2025-05-07T20:32:53.9604482Z 
2025-05-07T20:32:53.9604586Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:53.9604991Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:53.9605385Z     T=4096,
2025-05-07T20:32:53.9605567Z     D=7168,
2025-05-07T20:32:53.9605763Z     scale_ub=1200.0,
2025-05-07T20:32:53.9605989Z     contiguous=True,
2025-05-07T20:32:53.9606206Z     compiled=True,
2025-05-07T20:32:53.9606474Z )
2025-05-07T20:32:53.9606796Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:53.9607278Z T = 4096, D = 7168, scale_ub = 1200.0, contiguous = True, compiled = True
2025-05-07T20:32:53.9607552Z 
2025-05-07T20:32:53.9607631Z     @given(
2025-05-07T20:32:53.9607861Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:53.9608172Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:53.9608471Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:53.9608799Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:53.9609129Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:53.9609407Z     )
2025-05-07T20:32:53.9609774Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:53.9610208Z     def test_silu_mul_quant(
2025-05-07T20:32:53.9610457Z         self,
2025-05-07T20:32:53.9610660Z         T: int,
2025-05-07T20:32:53.9610853Z         D: int,
2025-05-07T20:32:53.9611075Z         scale_ub: Optional[float],
2025-05-07T20:32:53.9611345Z         contiguous: bool,
2025-05-07T20:32:53.9611582Z         compiled: bool,
2025-05-07T20:32:53.9611808Z     ) -> None:
2025-05-07T20:32:53.9612027Z         torch.manual_seed(2025)
2025-05-07T20:32:53.9612259Z     
2025-05-07T20:32:53.9612530Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:53.9612872Z     
2025-05-07T20:32:53.9613064Z         x_sign = torch.sign(x)
2025-05-07T20:32:53.9613354Z >       x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:53.9615531Z E       torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 112.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 28.44 MiB is free. Including non-PyTorch memory, this process has 22.03 GiB memory in use. Of the allocated memory 21.61 GiB is allocated by PyTorch, and 141.02 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
2025-05-07T20:32:53.9617442Z 
2025-05-07T20:32:53.9617561Z moe/activation_test.py:95: OutOfMemoryError
2025-05-07T20:32:53.9617773Z 
2025-05-07T20:32:53.9617891Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:53.9618290Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:53.9618688Z     T=16384,
2025-05-07T20:32:53.9618881Z     D=7168,
2025-05-07T20:32:53.9619069Z     scale_ub=None,
2025-05-07T20:32:53.9619283Z     contiguous=False,
2025-05-07T20:32:53.9619503Z     compiled=False,
2025-05-07T20:32:53.9619702Z )
2025-05-07T20:32:53.9620017Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:53.9620564Z T = 16384, D = 7168, scale_ub = None, contiguous = False, compiled = False
2025-05-07T20:32:53.9620837Z 
2025-05-07T20:32:53.9620923Z     @given(
2025-05-07T20:32:53.9621149Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:53.9621463Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:53.9621771Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:53.9622091Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:53.9622422Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:53.9622812Z     )
2025-05-07T20:32:53.9623228Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:53.9623676Z     def test_silu_mul_quant(
2025-05-07T20:32:53.9623949Z         self,
2025-05-07T20:32:53.9624189Z         T: int,
2025-05-07T20:32:53.9624383Z         D: int,
2025-05-07T20:32:53.9624613Z         scale_ub: Optional[float],
2025-05-07T20:32:53.9624885Z         contiguous: bool,
2025-05-07T20:32:53.9625121Z         compiled: bool,
2025-05-07T20:32:53.9625353Z     ) -> None:
2025-05-07T20:32:53.9625570Z         torch.manual_seed(2025)
2025-05-07T20:32:53.9625872Z     
2025-05-07T20:32:53.9626143Z >       x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:53.9628153Z E       torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 448.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 140.44 MiB is free. Including non-PyTorch memory, this process has 21.92 GiB memory in use. Of the allocated memory 21.50 GiB is allocated by PyTorch, and 141.02 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
﻿2025-05-07T20:32:53.9632971Z 
2025-05-07T20:32:53.9633108Z moe/activation_test.py:92: OutOfMemoryError
2025-05-07T20:32:54.0890809Z 
2025-05-07T20:32:54.0891149Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:54.0891615Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:54.0892024Z     T=2048,
2025-05-07T20:32:54.0892220Z     D=7168,
2025-05-07T20:32:54.0892413Z     scale_ub=1200.0,
2025-05-07T20:32:54.0892648Z     contiguous=True,
2025-05-07T20:32:54.0892868Z     compiled=True,
2025-05-07T20:32:54.0893072Z )
2025-05-07T20:32:54.0893395Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:54.0893971Z T = 2048, D = 7168, scale_ub = 1200.0, contiguous = True, compiled = True
2025-05-07T20:32:54.0894241Z 
2025-05-07T20:32:54.0894320Z     @given(
2025-05-07T20:32:54.0894555Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:54.0894868Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:54.0895175Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:54.0895526Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:54.0895860Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:54.0896136Z     )
2025-05-07T20:32:54.0896489Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:54.0896932Z     def test_silu_mul_quant(
2025-05-07T20:32:54.0897362Z         self,
2025-05-07T20:32:54.0897554Z         T: int,
2025-05-07T20:32:54.0897783Z         D: int,
2025-05-07T20:32:54.0898003Z         scale_ub: Optional[float],
2025-05-07T20:32:54.0898525Z         contiguous: bool,
2025-05-07T20:32:54.0898767Z         compiled: bool,
2025-05-07T20:32:54.0898992Z     ) -> None:
2025-05-07T20:32:54.0899208Z         torch.manual_seed(2025)
2025-05-07T20:32:54.0899448Z     
2025-05-07T20:32:54.0899712Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:54.0900049Z     
2025-05-07T20:32:54.0900244Z         x_sign = torch.sign(x)
2025-05-07T20:32:54.0900531Z >       x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:54.0902599Z E       torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 56.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 28.44 MiB is free. Including non-PyTorch memory, this process has 22.03 GiB memory in use. Of the allocated memory 21.67 GiB is allocated by PyTorch, and 85.02 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
2025-05-07T20:32:54.0904434Z 
2025-05-07T20:32:54.0904553Z moe/activation_test.py:95: OutOfMemoryError
2025-05-07T20:32:54.0904762Z 
2025-05-07T20:32:54.0904870Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:54.0905280Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:54.0905677Z     T=2048,
2025-05-07T20:32:54.0905867Z     D=7168,
2025-05-07T20:32:54.0906064Z     scale_ub=None,
2025-05-07T20:32:54.0906271Z     contiguous=True,
2025-05-07T20:32:54.0906500Z     compiled=False,
2025-05-07T20:32:54.0906708Z )
2025-05-07T20:32:54.0907106Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:54.0907598Z T = 2048, D = 7168, scale_ub = None, contiguous = True, compiled = False
2025-05-07T20:32:54.0907868Z 
2025-05-07T20:32:54.0907954Z     @given(
2025-05-07T20:32:54.0908177Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:54.0908489Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:54.0908791Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:54.0909115Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:54.0909434Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:54.0909717Z     )
2025-05-07T20:32:54.0910060Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:54.0910625Z     def test_silu_mul_quant(
2025-05-07T20:32:54.0910865Z         self,
2025-05-07T20:32:54.0911060Z         T: int,
2025-05-07T20:32:54.0911253Z         D: int,
2025-05-07T20:32:54.0911474Z         scale_ub: Optional[float],
2025-05-07T20:32:54.0911742Z         contiguous: bool,
2025-05-07T20:32:54.0911973Z         compiled: bool,
2025-05-07T20:32:54.0912200Z     ) -> None:
2025-05-07T20:32:54.0912420Z         torch.manual_seed(2025)
2025-05-07T20:32:54.0912651Z     
2025-05-07T20:32:54.0912919Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:54.0913266Z     
2025-05-07T20:32:54.0913454Z >       x_sign = torch.sign(x)
2025-05-07T20:32:54.0915354Z E       torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 56.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 28.44 MiB is free. Including non-PyTorch memory, this process has 22.03 GiB memory in use. Of the allocated memory 21.67 GiB is allocated by PyTorch, and 85.02 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
2025-05-07T20:32:54.0917164Z 
2025-05-07T20:32:54.0917280Z moe/activation_test.py:94: OutOfMemoryError
2025-05-07T20:32:54.0917566Z 
2025-05-07T20:32:54.0917666Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:54.0918074Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:54.0918465Z     T=1,
2025-05-07T20:32:54.0918654Z     D=7168,
2025-05-07T20:32:54.0918849Z     scale_ub=1200.0,
2025-05-07T20:32:54.0919061Z     contiguous=True,
2025-05-07T20:32:54.0919281Z     compiled=False,
2025-05-07T20:32:54.0919484Z )
2025-05-07T20:32:54.0919789Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:54.0920269Z T = 1, D = 7168, scale_ub = 1200.0, contiguous = True, compiled = False
2025-05-07T20:32:54.0920537Z 
2025-05-07T20:32:54.0920615Z     @given(
2025-05-07T20:32:54.0920891Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:54.0921198Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:54.0921503Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:54.0921829Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:54.0922153Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:54.0922437Z     )
2025-05-07T20:32:54.0922780Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:54.0923217Z     def test_silu_mul_quant(
2025-05-07T20:32:54.0923453Z         self,
2025-05-07T20:32:54.0923647Z         T: int,
2025-05-07T20:32:54.0923844Z         D: int,
2025-05-07T20:32:54.0924053Z         scale_ub: Optional[float],
2025-05-07T20:32:54.0924319Z         contiguous: bool,
2025-05-07T20:32:54.0924558Z         compiled: bool,
2025-05-07T20:32:54.0924772Z     ) -> None:
2025-05-07T20:32:54.0924982Z         torch.manual_seed(2025)
2025-05-07T20:32:54.0925225Z     
2025-05-07T20:32:54.0925534Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:54.0925879Z     
2025-05-07T20:32:54.0926070Z         x_sign = torch.sign(x)
2025-05-07T20:32:54.0926352Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:54.0926663Z         x = x_sign * x_clamp
2025-05-07T20:32:54.0926905Z         x0 = x[:, :D]
2025-05-07T20:32:54.0927117Z         x1 = x[:, D:]
2025-05-07T20:32:54.0927325Z     
2025-05-07T20:32:54.0927515Z         if contiguous:
2025-05-07T20:32:54.0927740Z             x0 = x0.contiguous()
2025-05-07T20:32:54.0928005Z             x1 = x1.contiguous()
2025-05-07T20:32:54.0928244Z     
2025-05-07T20:32:54.0928430Z         if scale_ub is not None:
2025-05-07T20:32:54.0928702Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:54.0929034Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:54.0929427Z             )
2025-05-07T20:32:54.0929619Z         else:
2025-05-07T20:32:54.0929833Z             scale_ub_tensor = None
2025-05-07T20:32:54.0930113Z     
2025-05-07T20:32:54.0930371Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:54.0930683Z             op = silu_mul_quant
2025-05-07T20:32:54.0930934Z             if compiled:
2025-05-07T20:32:54.0931176Z                 op = torch.compile(op)
2025-05-07T20:32:54.0931472Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:54.0940111Z     
2025-05-07T20:32:54.0940338Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:54.0940520Z 
2025-05-07T20:32:54.0940625Z moe/activation_test.py:117: 
2025-05-07T20:32:54.0940932Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:54.0941278Z moe/activation_test.py:115: in fn
2025-05-07T20:32:54.0941563Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:54.0942263Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:54.0942977Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:54.0943515Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:54.0944199Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:54.0944947Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:54.0945487Z     kernel = self.compile(
2025-05-07T20:32:54.0946028Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:54.0946687Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:54.0947094Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:54.0947329Z 
2025-05-07T20:32:54.0947545Z self = <triton.compiler.compiler.ASTSource object at 0x7fe92a3a06d0>
2025-05-07T20:32:54.0948658Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:54.0950028Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7fea06577ce0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7fe5b9ff62a0>}
2025-05-07T20:32:54.0951359Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:54.0952372Z context = <triton._C.libtriton.ir.context object at 0x7fe5b9bbcbf0>
2025-05-07T20:32:54.0952657Z 
2025-05-07T20:32:54.0952823Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:54.0953355Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:54.0953826Z                            module_map=module_map)
2025-05-07T20:32:54.0954242Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:54.0954597Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:54.0954866Z E       ^
2025-05-07T20:32:54.0955333Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:54.0955776Z 
2025-05-07T20:32:54.0956185Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:54.0956703Z 
2025-05-07T20:32:54.0956809Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:54.0957223Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:54.0957625Z     T=128,
2025-05-07T20:32:54.0957896Z     D=5120,
2025-05-07T20:32:54.0958091Z     scale_ub=None,
2025-05-07T20:32:54.0958312Z     contiguous=True,
2025-05-07T20:32:54.0958543Z     compiled=False,
2025-05-07T20:32:54.0958748Z )
2025-05-07T20:32:54.0959071Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:54.0959560Z T = 128, D = 5120, scale_ub = None, contiguous = True, compiled = False
2025-05-07T20:32:54.0959827Z 
2025-05-07T20:32:54.0959916Z     @given(
2025-05-07T20:32:54.0960171Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:54.0960512Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:54.0960822Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:54.0961149Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:54.0961481Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:54.0961771Z     )
2025-05-07T20:32:54.0962116Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:54.0962567Z     def test_silu_mul_quant(
2025-05-07T20:32:54.0962812Z         self,
2025-05-07T20:32:54.0963017Z         T: int,
2025-05-07T20:32:54.0963220Z         D: int,
2025-05-07T20:32:54.0963443Z         scale_ub: Optional[float],
2025-05-07T20:32:54.0963716Z         contiguous: bool,
2025-05-07T20:32:54.0963952Z         compiled: bool,
2025-05-07T20:32:54.0964178Z     ) -> None:
2025-05-07T20:32:54.0964457Z         torch.manual_seed(2025)
2025-05-07T20:32:54.0964695Z     
2025-05-07T20:32:54.0964975Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:54.0965320Z     
2025-05-07T20:32:54.0965510Z         x_sign = torch.sign(x)
2025-05-07T20:32:54.0965803Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:54.0966115Z         x = x_sign * x_clamp
2025-05-07T20:32:54.0966354Z         x0 = x[:, :D]
2025-05-07T20:32:54.0966579Z         x1 = x[:, D:]
2025-05-07T20:32:54.0966789Z     
2025-05-07T20:32:54.0966973Z         if contiguous:
2025-05-07T20:32:54.0967213Z             x0 = x0.contiguous()
2025-05-07T20:32:54.0967475Z             x1 = x1.contiguous()
2025-05-07T20:32:54.0967709Z     
2025-05-07T20:32:54.0967959Z         if scale_ub is not None:
2025-05-07T20:32:54.0968247Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:54.0968584Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:54.0968893Z             )
2025-05-07T20:32:54.0969090Z         else:
2025-05-07T20:32:54.0969305Z             scale_ub_tensor = None
2025-05-07T20:32:54.0969553Z     
2025-05-07T20:32:54.0969787Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:54.0970107Z             op = silu_mul_quant
2025-05-07T20:32:54.0970351Z             if compiled:
2025-05-07T20:32:54.0970602Z                 op = torch.compile(op)
2025-05-07T20:32:54.0970900Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:54.0971172Z     
2025-05-07T20:32:54.0971372Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:54.0971539Z 
2025-05-07T20:32:54.0971644Z moe/activation_test.py:117: 
2025-05-07T20:32:54.0971944Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:54.0972279Z moe/activation_test.py:115: in fn
2025-05-07T20:32:54.0972611Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:54.0973300Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:54.0974076Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:54.0974617Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:54.0975301Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:54.0975953Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:54.0976492Z     kernel = self.compile(
2025-05-07T20:32:54.0977035Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:54.0977749Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:54.0978147Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:54.0978383Z 
2025-05-07T20:32:54.0978589Z self = <triton.compiler.compiler.ASTSource object at 0x7fe92ba8eb50>
2025-05-07T20:32:54.0979664Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:54.0981017Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7fea06577ce0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7fe5b9ff71a0>}
2025-05-07T20:32:54.0982347Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:54.0983357Z context = <triton._C.libtriton.ir.context object at 0x7fe5b9b813b0>
2025-05-07T20:32:54.0983649Z 
2025-05-07T20:32:54.0983816Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:54.0984335Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:54.0984843Z                            module_map=module_map)
2025-05-07T20:32:54.0985212Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:54.0985567Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:54.0985832Z E       ^
2025-05-07T20:32:54.0986286Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:54.0986735Z 
2025-05-07T20:32:54.0987144Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:54.2113867Z 
2025-05-07T20:32:54.2114527Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:54.2114993Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:54.2115405Z     T=128,
2025-05-07T20:32:54.2115607Z     D=7168,
2025-05-07T20:32:54.2115810Z     scale_ub=None,
2025-05-07T20:32:54.2116033Z     contiguous=True,
2025-05-07T20:32:54.2116267Z     compiled=False,
2025-05-07T20:32:54.2116482Z )
2025-05-07T20:32:54.2116808Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:54.2117304Z T = 128, D = 7168, scale_ub = None, contiguous = True, compiled = False
2025-05-07T20:32:54.2117575Z 
2025-05-07T20:32:54.2117658Z     @given(
2025-05-07T20:32:54.2117900Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:54.2118213Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:54.2118525Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:54.2118871Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:54.2119205Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:54.2119499Z     )
2025-05-07T20:32:54.2119949Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:54.2120394Z     def test_silu_mul_quant(
2025-05-07T20:32:54.2120649Z         self,
2025-05-07T20:32:54.2120858Z         T: int,
2025-05-07T20:32:54.2121058Z         D: int,
2025-05-07T20:32:54.2121285Z         scale_ub: Optional[float],
2025-05-07T20:32:54.2121564Z         contiguous: bool,
2025-05-07T20:32:54.2121811Z         compiled: bool,
2025-05-07T20:32:54.2122043Z     ) -> None:
2025-05-07T20:32:54.2122273Z         torch.manual_seed(2025)
2025-05-07T20:32:54.2122522Z     
2025-05-07T20:32:54.2122831Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:54.2123174Z     
2025-05-07T20:32:54.2123380Z         x_sign = torch.sign(x)
2025-05-07T20:32:54.2123774Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:54.2124092Z         x = x_sign * x_clamp
2025-05-07T20:32:54.2124337Z         x0 = x[:, :D]
2025-05-07T20:32:54.2124564Z         x1 = x[:, D:]
2025-05-07T20:32:54.2124777Z     
2025-05-07T20:32:54.2124964Z         if contiguous:
2025-05-07T20:32:54.2125200Z             x0 = x0.contiguous()
2025-05-07T20:32:54.2125469Z             x1 = x1.contiguous()
2025-05-07T20:32:54.2125708Z     
2025-05-07T20:32:54.2125906Z         if scale_ub is not None:
2025-05-07T20:32:54.2126185Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:54.2126518Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:54.2126837Z             )
2025-05-07T20:32:54.2127037Z         else:
2025-05-07T20:32:54.2127251Z             scale_ub_tensor = None
2025-05-07T20:32:54.2127511Z     
2025-05-07T20:32:54.2127753Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:54.2128073Z             op = silu_mul_quant
2025-05-07T20:32:54.2128327Z             if compiled:
2025-05-07T20:32:54.2128585Z                 op = torch.compile(op)
2025-05-07T20:32:54.2128892Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:54.2129171Z     
2025-05-07T20:32:54.2129369Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:54.2129533Z 
2025-05-07T20:32:54.2129644Z moe/activation_test.py:117: 
2025-05-07T20:32:54.2130032Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:54.2130421Z moe/activation_test.py:115: in fn
2025-05-07T20:32:54.2130714Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:54.2131400Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:54.2132092Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:54.2132633Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:54.2133316Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:54.2134118Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:54.2134662Z     kernel = self.compile(
2025-05-07T20:32:54.2135211Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:54.2135889Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:54.2136297Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:54.2136527Z 
2025-05-07T20:32:54.2136739Z self = <triton.compiler.compiler.ASTSource object at 0x7fe92a5269d0>
2025-05-07T20:32:54.2137806Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:54.2139193Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7fea06577ce0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7fe5b9b60040>}
2025-05-07T20:32:54.2140571Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:54.2141593Z context = <triton._C.libtriton.ir.context object at 0x7fe5b9ea8e70>
2025-05-07T20:32:54.2141882Z 
2025-05-07T20:32:54.2142048Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:54.2142572Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:54.2143042Z                            module_map=module_map)
2025-05-07T20:32:54.2143414Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:54.2143815Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:54.2144086Z E       ^
2025-05-07T20:32:54.2144553Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:54.2144999Z 
2025-05-07T20:32:54.2145411Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:54.2145930Z 
2025-05-07T20:32:54.2146036Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:54.2146451Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:54.2146855Z     T=2048,
2025-05-07T20:32:54.2147044Z     D=7168,
2025-05-07T20:32:54.2147242Z     scale_ub=1200.0,
2025-05-07T20:32:54.2147470Z     contiguous=True,
2025-05-07T20:32:54.2147691Z     compiled=False,
2025-05-07T20:32:54.2147899Z )
2025-05-07T20:32:54.2148219Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:54.2148708Z T = 2048, D = 7168, scale_ub = 1200.0, contiguous = True, compiled = False
2025-05-07T20:32:54.2148990Z 
2025-05-07T20:32:54.2149068Z     @given(
2025-05-07T20:32:54.2149309Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:54.2149622Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:54.2149931Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:54.2150320Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:54.2150651Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:54.2150935Z     )
2025-05-07T20:32:54.2151286Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:54.2151733Z     def test_silu_mul_quant(
2025-05-07T20:32:54.2151971Z         self,
2025-05-07T20:32:54.2152173Z         T: int,
2025-05-07T20:32:54.2152376Z         D: int,
2025-05-07T20:32:54.2152591Z         scale_ub: Optional[float],
2025-05-07T20:32:54.2152869Z         contiguous: bool,
2025-05-07T20:32:54.2153117Z         compiled: bool,
2025-05-07T20:32:54.2153338Z     ) -> None:
2025-05-07T20:32:54.2153562Z         torch.manual_seed(2025)
2025-05-07T20:32:54.2153810Z     
2025-05-07T20:32:54.2154129Z >       x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:54.2156149Z E       torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 56.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 26.44 MiB is free. Including non-PyTorch memory, this process has 22.04 GiB memory in use. Of the allocated memory 21.69 GiB is allocated by PyTorch, and 59.18 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
2025-05-07T20:32:54.2157967Z 
2025-05-07T20:32:54.2158090Z moe/activation_test.py:92: OutOfMemoryError
2025-05-07T20:32:54.2158306Z 
2025-05-07T20:32:54.2158414Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:54.2158827Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:54.2159226Z     T=1,
2025-05-07T20:32:54.2159414Z     D=5120,
2025-05-07T20:32:54.2159688Z     scale_ub=1200.0,
2025-05-07T20:32:54.2159915Z     contiguous=True,
2025-05-07T20:32:54.2160147Z     compiled=False,
2025-05-07T20:32:54.2160362Z )
2025-05-07T20:32:54.2160676Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:54.2161158Z T = 1, D = 5120, scale_ub = 1200.0, contiguous = True, compiled = False
2025-05-07T20:32:54.2161426Z 
2025-05-07T20:32:54.2161506Z     @given(
2025-05-07T20:32:54.2161738Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:54.2162047Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:54.2162355Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:54.2162687Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:54.2163064Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:54.2163353Z     )
2025-05-07T20:32:54.2163707Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:54.2164155Z     def test_silu_mul_quant(
2025-05-07T20:32:54.2164395Z         self,
2025-05-07T20:32:54.2164595Z         T: int,
2025-05-07T20:32:54.2164793Z         D: int,
2025-05-07T20:32:54.2165010Z         scale_ub: Optional[float],
2025-05-07T20:32:54.2165283Z         contiguous: bool,
2025-05-07T20:32:54.2165524Z         compiled: bool,
2025-05-07T20:32:54.2165742Z     ) -> None:
2025-05-07T20:32:54.2165959Z         torch.manual_seed(2025)
2025-05-07T20:32:54.2166204Z     
2025-05-07T20:32:54.2166470Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:54.2166814Z     
2025-05-07T20:32:54.2167012Z         x_sign = torch.sign(x)
2025-05-07T20:32:54.2167299Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:54.2167614Z         x = x_sign * x_clamp
2025-05-07T20:32:54.2167871Z         x0 = x[:, :D]
2025-05-07T20:32:54.2168086Z         x1 = x[:, D:]
2025-05-07T20:32:54.2168296Z     
2025-05-07T20:32:54.2168488Z         if contiguous:
2025-05-07T20:32:54.2168718Z             x0 = x0.contiguous()
2025-05-07T20:32:54.2168980Z             x1 = x1.contiguous()
2025-05-07T20:32:54.2169220Z     
2025-05-07T20:32:54.2169408Z         if scale_ub is not None:
2025-05-07T20:32:54.2169738Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:54.2170075Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:54.2170434Z             )
2025-05-07T20:32:54.2170625Z         else:
2025-05-07T20:32:54.2170839Z             scale_ub_tensor = None
2025-05-07T20:32:54.2171093Z     
2025-05-07T20:32:54.2171321Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:54.2171639Z             op = silu_mul_quant
2025-05-07T20:32:54.2171892Z             if compiled:
2025-05-07T20:32:54.2172139Z                 op = torch.compile(op)
2025-05-07T20:32:54.2172446Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:54.2172725Z     
2025-05-07T20:32:54.2172964Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:54.2173138Z 
2025-05-07T20:32:54.2173240Z moe/activation_test.py:117: 
2025-05-07T20:32:54.2173541Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:54.2173970Z moe/activation_test.py:115: in fn
2025-05-07T20:32:54.2174252Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:54.2174937Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:54.2175627Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:54.2176161Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:54.2176840Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:54.2177503Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:54.2178049Z     kernel = self.compile(
2025-05-07T20:32:54.2178636Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:54.2179294Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:54.2179704Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:54.2179932Z 
2025-05-07T20:32:54.2180138Z self = <triton.compiler.compiler.ASTSource object at 0x7fe92ab7c3d0>
2025-05-07T20:32:54.2181207Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:54.2182557Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7fea06577ce0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7fe5b9b61580>}
2025-05-07T20:32:54.2183941Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:54.2184950Z context = <triton._C.libtriton.ir.context object at 0x7fe5b9e87030>
2025-05-07T20:32:54.2185235Z 
2025-05-07T20:32:54.2185401Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:54.2185921Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:54.2186388Z                            module_map=module_map)
2025-05-07T20:32:54.2186758Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:54.2187106Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:54.2187368Z E       ^
2025-05-07T20:32:54.2187829Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:54.2188276Z 
2025-05-07T20:32:54.2188692Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:54.3012338Z 
2025-05-07T20:32:54.3012705Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:54.3013460Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:54.3013932Z     T=2048,
2025-05-07T20:32:54.3014131Z     D=5120,
2025-05-07T20:32:54.3014336Z     scale_ub=None,
2025-05-07T20:32:54.3014551Z     contiguous=True,
2025-05-07T20:32:54.3014783Z     compiled=False,
2025-05-07T20:32:54.3015002Z )
2025-05-07T20:32:54.3015319Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:54.3015816Z T = 2048, D = 5120, scale_ub = None, contiguous = True, compiled = False
2025-05-07T20:32:54.3016084Z 
2025-05-07T20:32:54.3016184Z     @given(
2025-05-07T20:32:54.3016418Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:54.3016744Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:54.3017144Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:54.3017488Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:54.3017820Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:54.3018120Z     )
2025-05-07T20:32:54.3018472Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:54.3018915Z     def test_silu_mul_quant(
2025-05-07T20:32:54.3019165Z         self,
2025-05-07T20:32:54.3019371Z         T: int,
2025-05-07T20:32:54.3019575Z         D: int,
2025-05-07T20:32:54.3019805Z         scale_ub: Optional[float],
2025-05-07T20:32:54.3020082Z         contiguous: bool,
2025-05-07T20:32:54.3020328Z         compiled: bool,
2025-05-07T20:32:54.3020566Z     ) -> None:
2025-05-07T20:32:54.3020792Z         torch.manual_seed(2025)
2025-05-07T20:32:54.3021038Z     
2025-05-07T20:32:54.3021318Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:54.3021671Z     
2025-05-07T20:32:54.3021883Z >       x_sign = torch.sign(x)
2025-05-07T20:32:54.3023873Z E       torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 40.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 26.44 MiB is free. Including non-PyTorch memory, this process has 22.04 GiB memory in use. Of the allocated memory 21.73 GiB is allocated by PyTorch, and 19.12 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
2025-05-07T20:32:54.3025704Z 
2025-05-07T20:32:54.3025825Z moe/activation_test.py:94: OutOfMemoryError
2025-05-07T20:32:54.3026044Z 
2025-05-07T20:32:54.3026150Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:54.3026647Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:54.3027046Z     T=16384,
2025-05-07T20:32:54.3027252Z     D=5120,
2025-05-07T20:32:54.3027452Z     scale_ub=None,
2025-05-07T20:32:54.3027670Z     contiguous=True,
2025-05-07T20:32:54.3027901Z     compiled=False,
2025-05-07T20:32:54.3028111Z )
2025-05-07T20:32:54.3028434Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:54.3028924Z T = 16384, D = 5120, scale_ub = None, contiguous = True, compiled = False
2025-05-07T20:32:54.3029203Z 
2025-05-07T20:32:54.3029284Z     @given(
2025-05-07T20:32:54.3029519Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:54.3029829Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:54.3030141Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:54.3030479Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:54.3030805Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:54.3031102Z     )
2025-05-07T20:32:54.3031459Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:54.3031906Z     def test_silu_mul_quant(
2025-05-07T20:32:54.3032150Z         self,
2025-05-07T20:32:54.3032353Z         T: int,
2025-05-07T20:32:54.3032559Z         D: int,
2025-05-07T20:32:54.3032778Z         scale_ub: Optional[float],
2025-05-07T20:32:54.3033105Z         contiguous: bool,
2025-05-07T20:32:54.3033351Z         compiled: bool,
2025-05-07T20:32:54.3033575Z     ) -> None:
2025-05-07T20:32:54.3033796Z         torch.manual_seed(2025)
2025-05-07T20:32:54.3034043Z     
2025-05-07T20:32:54.3034331Z >       x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:54.3036384Z E       torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 320.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 26.44 MiB is free. Including non-PyTorch memory, this process has 22.04 GiB memory in use. Of the allocated memory 21.73 GiB is allocated by PyTorch, and 19.12 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
2025-05-07T20:32:54.3038210Z 
2025-05-07T20:32:54.3038338Z moe/activation_test.py:92: OutOfMemoryError
2025-05-07T20:32:54.3038560Z 
2025-05-07T20:32:54.3038667Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:54.3039082Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:54.3039482Z     T=4096,
2025-05-07T20:32:54.3039679Z     D=5120,
2025-05-07T20:32:54.3039879Z     scale_ub=None,
2025-05-07T20:32:54.3040096Z     contiguous=True,
2025-05-07T20:32:54.3040364Z     compiled=False,
2025-05-07T20:32:54.3040588Z )
2025-05-07T20:32:54.3040907Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:54.3041404Z T = 4096, D = 5120, scale_ub = None, contiguous = True, compiled = False
2025-05-07T20:32:54.3041673Z 
2025-05-07T20:32:54.3041765Z     @given(
2025-05-07T20:32:54.3050399Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:54.3050728Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:54.3051047Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:54.3051388Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:54.3051714Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:54.3052005Z     )
2025-05-07T20:32:54.3052362Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:54.3052805Z     def test_silu_mul_quant(
2025-05-07T20:32:54.3053058Z         self,
2025-05-07T20:32:54.3053262Z         T: int,
2025-05-07T20:32:54.3053459Z         D: int,
2025-05-07T20:32:54.3053801Z         scale_ub: Optional[float],
2025-05-07T20:32:54.3054083Z         contiguous: bool,
2025-05-07T20:32:54.3054384Z         compiled: bool,
2025-05-07T20:32:54.3054621Z     ) -> None:
2025-05-07T20:32:54.3054847Z         torch.manual_seed(2025)
2025-05-07T20:32:54.3055089Z     
2025-05-07T20:32:54.3055376Z >       x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:54.3057390Z E       torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 80.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 26.44 MiB is free. Including non-PyTorch memory, this process has 22.04 GiB memory in use. Of the allocated memory 21.73 GiB is allocated by PyTorch, and 19.12 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
2025-05-07T20:32:54.3059207Z 
2025-05-07T20:32:54.3059329Z moe/activation_test.py:92: OutOfMemoryError
2025-05-07T20:32:54.3059548Z 
2025-05-07T20:32:54.3059664Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:54.3060081Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:54.3060489Z     T=2048,
2025-05-07T20:32:54.3060689Z     D=5120,
2025-05-07T20:32:54.3060884Z     scale_ub=None,
2025-05-07T20:32:54.3061107Z     contiguous=False,
2025-05-07T20:32:54.3061394Z     compiled=False,
2025-05-07T20:32:54.3061606Z )
2025-05-07T20:32:54.3061925Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:54.3062415Z T = 2048, D = 5120, scale_ub = None, contiguous = False, compiled = False
2025-05-07T20:32:54.3062696Z 
2025-05-07T20:32:54.3062777Z     @given(
2025-05-07T20:32:54.3063016Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:54.3063336Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:54.3063643Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:54.3063977Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:54.3064315Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:54.3064599Z     )
2025-05-07T20:32:54.3065029Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:54.3065477Z     def test_silu_mul_quant(
2025-05-07T20:32:54.3065716Z         self,
2025-05-07T20:32:54.3065917Z         T: int,
2025-05-07T20:32:54.3066125Z         D: int,
2025-05-07T20:32:54.3066346Z         scale_ub: Optional[float],
2025-05-07T20:32:54.3066620Z         contiguous: bool,
2025-05-07T20:32:54.3066866Z         compiled: bool,
2025-05-07T20:32:54.3067087Z     ) -> None:
2025-05-07T20:32:54.3067307Z         torch.manual_seed(2025)
2025-05-07T20:32:54.3067554Z     
2025-05-07T20:32:54.3067827Z >       x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:54.3070498Z E       torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 40.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 26.44 MiB is free. Including non-PyTorch memory, this process has 22.04 GiB memory in use. Of the allocated memory 21.73 GiB is allocated by PyTorch, and 19.12 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
2025-05-07T20:32:54.3072318Z 
2025-05-07T20:32:54.3072439Z moe/activation_test.py:92: OutOfMemoryError
2025-05-07T20:32:54.3072657Z 
2025-05-07T20:32:54.3072761Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:54.3073174Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:54.3073570Z     T=4096,
2025-05-07T20:32:54.3073762Z     D=7168,
2025-05-07T20:32:54.3073957Z     scale_ub=None,
2025-05-07T20:32:54.3074195Z     contiguous=True,
2025-05-07T20:32:54.3074414Z     compiled=True,
2025-05-07T20:32:54.3074620Z )
2025-05-07T20:32:54.3074992Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:54.3075478Z T = 4096, D = 7168, scale_ub = None, contiguous = True, compiled = True
2025-05-07T20:32:54.3075752Z 
2025-05-07T20:32:54.3075836Z     @given(
2025-05-07T20:32:54.3076073Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:54.3076390Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:54.3076696Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:54.3077030Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:54.3077362Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:54.3077646Z     )
2025-05-07T20:32:54.3077998Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:54.3078442Z     def test_silu_mul_quant(
2025-05-07T20:32:54.3078685Z         self,
2025-05-07T20:32:54.3078885Z         T: int,
2025-05-07T20:32:54.3079086Z         D: int,
2025-05-07T20:32:54.3079301Z         scale_ub: Optional[float],
2025-05-07T20:32:54.3079583Z         contiguous: bool,
2025-05-07T20:32:54.3079828Z         compiled: bool,
2025-05-07T20:32:54.3080061Z     ) -> None:
2025-05-07T20:32:54.3080323Z         torch.manual_seed(2025)
2025-05-07T20:32:54.3080576Z     
2025-05-07T20:32:54.3080851Z >       x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:54.3082893Z E       torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 112.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 26.44 MiB is free. Including non-PyTorch memory, this process has 22.04 GiB memory in use. Of the allocated memory 21.73 GiB is allocated by PyTorch, and 19.12 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
2025-05-07T20:32:54.3084708Z 
2025-05-07T20:32:54.3084830Z moe/activation_test.py:92: OutOfMemoryError
2025-05-07T20:32:54.3085047Z 
2025-05-07T20:32:54.3085153Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:54.3085611Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:54.3086010Z     T=2048,
2025-05-07T20:32:54.3086202Z     D=5120,
2025-05-07T20:32:54.3086402Z     scale_ub=1200.0,
2025-05-07T20:32:54.3086636Z     contiguous=False,
2025-05-07T20:32:54.3086860Z     compiled=False,
2025-05-07T20:32:54.3624888Z )
2025-05-07T20:32:54.3625252Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:54.3625774Z T = 2048, D = 5120, scale_ub = 1200.0, contiguous = False, compiled = False
2025-05-07T20:32:54.3626053Z 
2025-05-07T20:32:54.3626134Z     @given(
2025-05-07T20:32:54.3626376Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:54.3626699Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:54.3627006Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:54.3627354Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:54.3627693Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:54.3627982Z     )
2025-05-07T20:32:54.3628537Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:54.3628985Z     def test_silu_mul_quant(
2025-05-07T20:32:54.3629239Z         self,
2025-05-07T20:32:54.3629437Z         T: int,
2025-05-07T20:32:54.3629646Z         D: int,
2025-05-07T20:32:54.3629879Z         scale_ub: Optional[float],
2025-05-07T20:32:54.3630156Z         contiguous: bool,
2025-05-07T20:32:54.3630406Z         compiled: bool,
2025-05-07T20:32:54.3630640Z     ) -> None:
2025-05-07T20:32:54.3630858Z         torch.manual_seed(2025)
2025-05-07T20:32:54.3631104Z     
2025-05-07T20:32:54.3631380Z >       x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:54.3633383Z E       torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 40.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 26.44 MiB is free. Including non-PyTorch memory, this process has 22.04 GiB memory in use. Of the allocated memory 21.73 GiB is allocated by PyTorch, and 19.12 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
2025-05-07T20:32:54.3635286Z 
2025-05-07T20:32:54.3635406Z moe/activation_test.py:92: OutOfMemoryError
2025-05-07T20:32:54.3635626Z 
2025-05-07T20:32:54.3635732Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:54.3636147Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:54.3636555Z     T=4096,
2025-05-07T20:32:54.3636742Z     D=7168,
2025-05-07T20:32:54.3636942Z     scale_ub=1200.0,
2025-05-07T20:32:54.3637169Z     contiguous=True,
2025-05-07T20:32:54.3637393Z     compiled=False,
2025-05-07T20:32:54.3637606Z )
2025-05-07T20:32:54.3637929Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:54.3638431Z T = 4096, D = 7168, scale_ub = 1200.0, contiguous = True, compiled = False
2025-05-07T20:32:54.3638713Z 
2025-05-07T20:32:54.3638794Z     @given(
2025-05-07T20:32:54.3639033Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:54.3639424Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:54.3639738Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:54.3640089Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:54.3640474Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:54.3640761Z     )
2025-05-07T20:32:54.3641117Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:54.3641566Z     def test_silu_mul_quant(
2025-05-07T20:32:54.3641809Z         self,
2025-05-07T20:32:54.3642050Z         T: int,
2025-05-07T20:32:54.3642258Z         D: int,
2025-05-07T20:32:54.3642485Z         scale_ub: Optional[float],
2025-05-07T20:32:54.3642766Z         contiguous: bool,
2025-05-07T20:32:54.3643103Z         compiled: bool,
2025-05-07T20:32:54.3643340Z     ) -> None:
2025-05-07T20:32:54.3643567Z         torch.manual_seed(2025)
2025-05-07T20:32:54.3643809Z     
2025-05-07T20:32:54.3644095Z >       x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:54.3646103Z E       torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 112.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 26.44 MiB is free. Including non-PyTorch memory, this process has 22.04 GiB memory in use. Of the allocated memory 21.73 GiB is allocated by PyTorch, and 19.12 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
2025-05-07T20:32:54.3647913Z 
2025-05-07T20:32:54.3648043Z moe/activation_test.py:92: OutOfMemoryError
2025-05-07T20:32:54.3648257Z 
2025-05-07T20:32:54.3648369Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:54.3648821Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:54.3649229Z     T=16384,
2025-05-07T20:32:54.3649430Z     D=7168,
2025-05-07T20:32:54.3649622Z     scale_ub=None,
2025-05-07T20:32:54.3649845Z     contiguous=False,
2025-05-07T20:32:54.3650081Z     compiled=True,
2025-05-07T20:32:54.3650283Z )
2025-05-07T20:32:54.3650610Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:54.3651111Z T = 16384, D = 7168, scale_ub = None, contiguous = False, compiled = True
2025-05-07T20:32:54.3651388Z 
2025-05-07T20:32:54.3651469Z     @given(
2025-05-07T20:32:54.3651706Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:54.3652025Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:54.3652392Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:54.3652728Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:54.3653068Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:54.3653362Z     )
2025-05-07T20:32:54.3653846Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:54.3654301Z     def test_silu_mul_quant(
2025-05-07T20:32:54.3654550Z         self,
2025-05-07T20:32:54.3654749Z         T: int,
2025-05-07T20:32:54.3654957Z         D: int,
2025-05-07T20:32:54.3655182Z         scale_ub: Optional[float],
2025-05-07T20:32:54.3655456Z         contiguous: bool,
2025-05-07T20:32:54.3655708Z         compiled: bool,
2025-05-07T20:32:54.3655948Z     ) -> None:
2025-05-07T20:32:54.3656172Z         torch.manual_seed(2025)
2025-05-07T20:32:54.3656422Z     
2025-05-07T20:32:54.3656693Z >       x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:54.3658701Z E       torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 448.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 26.44 MiB is free. Including non-PyTorch memory, this process has 22.04 GiB memory in use. Of the allocated memory 21.73 GiB is allocated by PyTorch, and 19.12 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
2025-05-07T20:32:54.3660567Z 
2025-05-07T20:32:54.3660687Z moe/activation_test.py:92: OutOfMemoryError
2025-05-07T20:32:54.3660910Z 
2025-05-07T20:32:54.3661015Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:54.3661432Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:54.3661830Z     T=4096,
2025-05-07T20:32:54.3662029Z     D=7168,
2025-05-07T20:32:54.3662230Z     scale_ub=None,
2025-05-07T20:32:54.3662447Z     contiguous=True,
2025-05-07T20:32:54.3662682Z     compiled=False,
2025-05-07T20:32:54.3662897Z )
2025-05-07T20:32:54.3663261Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:54.3663761Z T = 4096, D = 7168, scale_ub = None, contiguous = True, compiled = False
2025-05-07T20:32:54.3664035Z 
2025-05-07T20:32:54.3664119Z     @given(
2025-05-07T20:32:54.3664359Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:54.3664672Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:54.3664988Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:54.3665324Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:54.3665652Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:54.3665948Z     )
2025-05-07T20:32:54.3666303Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:54.3666739Z     def test_silu_mul_quant(
2025-05-07T20:32:54.3666990Z         self,
2025-05-07T20:32:54.3667194Z         T: int,
2025-05-07T20:32:54.3667389Z         D: int,
2025-05-07T20:32:54.3667617Z         scale_ub: Optional[float],
2025-05-07T20:32:54.3667892Z         contiguous: bool,
2025-05-07T20:32:54.3668179Z         compiled: bool,
2025-05-07T20:32:54.3668411Z     ) -> None:
2025-05-07T20:32:54.3668633Z         torch.manual_seed(2025)
2025-05-07T20:32:54.3668884Z     
2025-05-07T20:32:54.3669154Z >       x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:54.3671154Z E       torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 112.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 26.44 MiB is free. Including non-PyTorch memory, this process has 22.04 GiB memory in use. Of the allocated memory 21.73 GiB is allocated by PyTorch, and 19.12 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
2025-05-07T20:32:54.3674023Z 
2025-05-07T20:32:54.3674148Z moe/activation_test.py:92: OutOfMemoryError
2025-05-07T20:32:54.3674363Z 
2025-05-07T20:32:54.3674475Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:54.3674884Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:54.3675294Z     T=16384,
2025-05-07T20:32:54.3675496Z     D=7168,
2025-05-07T20:32:54.3675697Z     scale_ub=None,
2025-05-07T20:32:54.3675915Z     contiguous=True,
2025-05-07T20:32:54.3676145Z     compiled=False,
2025-05-07T20:32:54.3676356Z )
2025-05-07T20:32:54.3676673Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:54.3677170Z T = 16384, D = 7168, scale_ub = None, contiguous = True, compiled = False
2025-05-07T20:32:54.3677445Z 
2025-05-07T20:32:54.3677533Z     @given(
2025-05-07T20:32:54.3677764Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:54.3678085Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:54.3678400Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:54.3678734Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:54.3679072Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:54.3679364Z     )
2025-05-07T20:32:54.3679798Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:54.3680240Z     def test_silu_mul_quant(
2025-05-07T20:32:54.3680486Z         self,
2025-05-07T20:32:54.3680689Z         T: int,
2025-05-07T20:32:54.3680887Z         D: int,
2025-05-07T20:32:54.3681110Z         scale_ub: Optional[float],
2025-05-07T20:32:54.3681391Z         contiguous: bool,
2025-05-07T20:32:54.3681633Z         compiled: bool,
2025-05-07T20:32:54.3681864Z     ) -> None:
2025-05-07T20:32:54.3682085Z         torch.manual_seed(2025)
2025-05-07T20:32:54.3682324Z     
2025-05-07T20:32:54.3682599Z >       x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:54.3684642Z E       torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 448.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 26.44 MiB is free. Including non-PyTorch memory, this process has 22.04 GiB memory in use. Of the allocated memory 21.73 GiB is allocated by PyTorch, and 19.12 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
2025-05-07T20:32:54.3686458Z 
2025-05-07T20:32:54.3686578Z moe/activation_test.py:92: OutOfMemoryError
2025-05-07T20:32:54.3686790Z 
2025-05-07T20:32:54.3686898Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:54.3687306Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:54.3687712Z     T=16384,
2025-05-07T20:32:54.3687913Z     D=7168,
2025-05-07T20:32:54.3688107Z     scale_ub=1200.0,
2025-05-07T20:32:54.3688334Z     contiguous=True,
2025-05-07T20:32:54.3688563Z     compiled=False,
2025-05-07T20:32:54.3688768Z )
2025-05-07T20:32:54.3689133Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:54.3689628Z T = 16384, D = 7168, scale_ub = 1200.0, contiguous = True, compiled = False
2025-05-07T20:32:54.3689905Z 
2025-05-07T20:32:54.3689991Z     @given(
2025-05-07T20:32:54.3690222Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:54.3690540Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:54.3690852Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:54.3691180Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:54.3691513Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:54.3691805Z     )
2025-05-07T20:32:54.3692151Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:54.3692642Z     def test_silu_mul_quant(
2025-05-07T20:32:54.3692889Z         self,
2025-05-07T20:32:54.3693086Z         T: int,
2025-05-07T20:32:54.3693293Z         D: int,
2025-05-07T20:32:54.3693519Z         scale_ub: Optional[float],
2025-05-07T20:32:54.3693875Z         contiguous: bool,
2025-05-07T20:32:54.3694119Z         compiled: bool,
2025-05-07T20:32:54.3694351Z     ) -> None:
2025-05-07T20:32:54.3694572Z         torch.manual_seed(2025)
2025-05-07T20:32:54.3694812Z     
2025-05-07T20:32:54.3695089Z >       x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:54.3697090Z E       torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 448.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 26.44 MiB is free. Including non-PyTorch memory, this process has 22.04 GiB memory in use. Of the allocated memory 21.73 GiB is allocated by PyTorch, and 19.12 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
2025-05-07T20:32:54.3699150Z 
2025-05-07T20:32:54.3699284Z moe/activation_test.py:92: OutOfMemoryError
2025-05-07T20:32:54.5509223Z 
2025-05-07T20:32:54.5509558Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:54.5510383Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:54.5510806Z     T=128,
2025-05-07T20:32:54.5511008Z     D=5120,
2025-05-07T20:32:54.5511205Z     scale_ub=1200.0,
2025-05-07T20:32:54.5511440Z     contiguous=False,
2025-05-07T20:32:54.5511674Z     compiled=False,
2025-05-07T20:32:54.5511885Z )
2025-05-07T20:32:54.5512215Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:54.5512720Z T = 128, D = 5120, scale_ub = 1200.0, contiguous = False, compiled = False
2025-05-07T20:32:54.5512998Z 
2025-05-07T20:32:54.5513088Z     @given(
2025-05-07T20:32:54.5513330Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:54.5513748Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:54.5514063Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:54.5514409Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:54.5514764Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:54.5515072Z     )
2025-05-07T20:32:54.5515461Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:54.5515900Z     def test_silu_mul_quant(
2025-05-07T20:32:54.5516147Z         self,
2025-05-07T20:32:54.5516347Z         T: int,
2025-05-07T20:32:54.5516544Z         D: int,
2025-05-07T20:32:54.5516770Z         scale_ub: Optional[float],
2025-05-07T20:32:54.5517047Z         contiguous: bool,
2025-05-07T20:32:54.5517291Z         compiled: bool,
2025-05-07T20:32:54.5517523Z     ) -> None:
2025-05-07T20:32:54.5517747Z         torch.manual_seed(2025)
2025-05-07T20:32:54.5518001Z     
2025-05-07T20:32:54.5518275Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:54.5518630Z     
2025-05-07T20:32:54.5518831Z         x_sign = torch.sign(x)
2025-05-07T20:32:54.5519216Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:54.5519540Z         x = x_sign * x_clamp
2025-05-07T20:32:54.5519790Z         x0 = x[:, :D]
2025-05-07T20:32:54.5520009Z         x1 = x[:, D:]
2025-05-07T20:32:54.5520230Z     
2025-05-07T20:32:54.5520426Z         if contiguous:
2025-05-07T20:32:54.5520658Z             x0 = x0.contiguous()
2025-05-07T20:32:54.5520926Z             x1 = x1.contiguous()
2025-05-07T20:32:54.5521172Z     
2025-05-07T20:32:54.5521365Z         if scale_ub is not None:
2025-05-07T20:32:54.5521648Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:54.5521993Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:54.5522311Z             )
2025-05-07T20:32:54.5522610Z         else:
2025-05-07T20:32:54.5522830Z             scale_ub_tensor = None
2025-05-07T20:32:54.5523092Z     
2025-05-07T20:32:54.5523328Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:54.5523654Z             op = silu_mul_quant
2025-05-07T20:32:54.5523910Z             if compiled:
2025-05-07T20:32:54.5524159Z                 op = torch.compile(op)
2025-05-07T20:32:54.5524466Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:54.5524746Z     
2025-05-07T20:32:54.5524939Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:54.5525115Z 
2025-05-07T20:32:54.5525215Z moe/activation_test.py:117: 
2025-05-07T20:32:54.5525527Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:54.5525859Z moe/activation_test.py:115: in fn
2025-05-07T20:32:54.5526146Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:54.5526843Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:54.5527538Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:54.5528075Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:54.5528763Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:54.5529429Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:54.5530014Z     kernel = self.compile(
2025-05-07T20:32:54.5530556Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:54.5531214Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:54.5531617Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:54.5531846Z 
2025-05-07T20:32:54.5532054Z self = <triton.compiler.compiler.ASTSource object at 0x7fe5b9ec2a50>
2025-05-07T20:32:54.5533168Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:54.5534700Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7fea06577ce0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7fe5b9e951c0>}
2025-05-07T20:32:54.5536038Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:54.5537052Z context = <triton._C.libtriton.ir.context object at 0x7fe5b9d8e870>
2025-05-07T20:32:54.5537341Z 
2025-05-07T20:32:54.5537508Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:54.5538031Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:54.5538503Z                            module_map=module_map)
2025-05-07T20:32:54.5538876Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:54.5539274Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:54.5539559Z E       ^
2025-05-07T20:32:54.5540026Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:54.5540472Z 
2025-05-07T20:32:54.5540888Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:54.5541403Z 
2025-05-07T20:32:54.5541507Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:54.5541926Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:54.5542336Z     T=2048,
2025-05-07T20:32:54.5542526Z     D=7168,
2025-05-07T20:32:54.5542729Z     scale_ub=None,
2025-05-07T20:32:54.5542958Z     contiguous=False,
2025-05-07T20:32:54.5543263Z     compiled=False,
2025-05-07T20:32:54.5543479Z )
2025-05-07T20:32:54.5543812Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:54.5544304Z T = 2048, D = 7168, scale_ub = None, contiguous = False, compiled = False
2025-05-07T20:32:54.5544587Z 
2025-05-07T20:32:54.5544667Z     @given(
2025-05-07T20:32:54.5544908Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:54.5545222Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:54.5545540Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:54.5554004Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:54.5554353Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:54.5554641Z     )
2025-05-07T20:32:54.5554996Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:54.5555444Z     def test_silu_mul_quant(
2025-05-07T20:32:54.5555687Z         self,
2025-05-07T20:32:54.5555896Z         T: int,
2025-05-07T20:32:54.5556102Z         D: int,
2025-05-07T20:32:54.5556322Z         scale_ub: Optional[float],
2025-05-07T20:32:54.5556604Z         contiguous: bool,
2025-05-07T20:32:54.5556856Z         compiled: bool,
2025-05-07T20:32:54.5557082Z     ) -> None:
2025-05-07T20:32:54.5557308Z         torch.manual_seed(2025)
2025-05-07T20:32:54.5557559Z     
2025-05-07T20:32:54.5557913Z >       x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:54.5559936Z E       torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 56.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 26.44 MiB is free. Including non-PyTorch memory, this process has 22.04 GiB memory in use. Of the allocated memory 21.74 GiB is allocated by PyTorch, and 10.99 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
2025-05-07T20:32:54.5561757Z 
2025-05-07T20:32:54.5561921Z moe/activation_test.py:92: OutOfMemoryError
2025-05-07T20:32:54.5562148Z 
2025-05-07T20:32:54.5562255Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:54.5562671Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:54.5563072Z     T=128,
2025-05-07T20:32:54.5563265Z     D=7168,
2025-05-07T20:32:54.5563471Z     scale_ub=1200.0,
2025-05-07T20:32:54.5563695Z     contiguous=True,
2025-05-07T20:32:54.5563920Z     compiled=True,
2025-05-07T20:32:54.5564137Z )
2025-05-07T20:32:54.5564455Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:54.5564945Z T = 128, D = 7168, scale_ub = 1200.0, contiguous = True, compiled = True
2025-05-07T20:32:54.5565220Z 
2025-05-07T20:32:54.5565302Z     @given(
2025-05-07T20:32:54.5565532Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:54.5565847Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:54.5566153Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:54.5566480Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:54.5566854Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:54.5567144Z     )
2025-05-07T20:32:54.5567484Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:54.5567927Z     def test_silu_mul_quant(
2025-05-07T20:32:54.5568169Z         self,
2025-05-07T20:32:54.5568366Z         T: int,
2025-05-07T20:32:54.5568562Z         D: int,
2025-05-07T20:32:54.5568783Z         scale_ub: Optional[float],
2025-05-07T20:32:54.5569057Z         contiguous: bool,
2025-05-07T20:32:54.5569290Z         compiled: bool,
2025-05-07T20:32:54.5569515Z     ) -> None:
2025-05-07T20:32:54.5569732Z         torch.manual_seed(2025)
2025-05-07T20:32:54.5569968Z     
2025-05-07T20:32:54.5570245Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:54.5570669Z     
2025-05-07T20:32:54.5570861Z         x_sign = torch.sign(x)
2025-05-07T20:32:54.5571158Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:54.5571473Z         x = x_sign * x_clamp
2025-05-07T20:32:54.5571707Z         x0 = x[:, :D]
2025-05-07T20:32:54.5571926Z         x1 = x[:, D:]
2025-05-07T20:32:54.5572136Z     
2025-05-07T20:32:54.5572319Z         if contiguous:
2025-05-07T20:32:54.5572553Z             x0 = x0.contiguous()
2025-05-07T20:32:54.5572813Z             x1 = x1.contiguous()
2025-05-07T20:32:54.5573046Z     
2025-05-07T20:32:54.5573242Z         if scale_ub is not None:
2025-05-07T20:32:54.5573516Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:54.5573955Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:54.5574260Z             )
2025-05-07T20:32:54.5574453Z         else:
2025-05-07T20:32:54.5574665Z             scale_ub_tensor = None
2025-05-07T20:32:54.5574914Z     
2025-05-07T20:32:54.5575150Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:54.5575467Z             op = silu_mul_quant
2025-05-07T20:32:54.5575716Z             if compiled:
2025-05-07T20:32:54.5575969Z                 op = torch.compile(op)
2025-05-07T20:32:54.5576268Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:54.5576539Z     
2025-05-07T20:32:54.5576737Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:54.5576958Z 
2025-05-07T20:32:54.5577064Z moe/activation_test.py:117: 
2025-05-07T20:32:54.5577362Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:54.5577688Z moe/activation_test.py:115: in fn
2025-05-07T20:32:54.5577970Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:54.5578535Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/torch/_dynamo/eval_frame.py:678: in _fn
2025-05-07T20:32:54.5579089Z     return fn(*args, **kwargs)
2025-05-07T20:32:54.5579752Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:54.5580488Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:54.5581027Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:54.5581697Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:54.5582365Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:54.5582897Z     kernel = self.compile(
2025-05-07T20:32:54.5583432Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:54.5584086Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:54.5584486Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:54.5584712Z 
2025-05-07T20:32:54.5584927Z self = <triton.compiler.compiler.ASTSource object at 0x7fe5b9fa81d0>
2025-05-07T20:32:54.5586033Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:54.5587387Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7fea06577ce0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7fe5b9de7b00>}
2025-05-07T20:32:54.5588716Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:54.5589727Z context = <triton._C.libtriton.ir.context object at 0x7fe5b9ab6c70>
2025-05-07T20:32:54.5590010Z 
2025-05-07T20:32:54.5590182Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:54.5590746Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:54.5591219Z                            module_map=module_map)
2025-05-07T20:32:54.5591589Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:54.5591941Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:54.5592206Z E       ^
2025-05-07T20:32:54.5592670Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:54.5593114Z 
2025-05-07T20:32:54.5593532Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:54.8328937Z 
2025-05-07T20:32:54.8329492Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:54.8330325Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:54.8330754Z     T=128,
2025-05-07T20:32:54.8330951Z     D=7168,
2025-05-07T20:32:54.8331173Z     scale_ub=1200.0,
2025-05-07T20:32:54.8331403Z     contiguous=True,
2025-05-07T20:32:54.8331630Z     compiled=False,
2025-05-07T20:32:54.8331845Z )
2025-05-07T20:32:54.8332174Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:54.8332675Z T = 128, D = 7168, scale_ub = 1200.0, contiguous = True, compiled = False
2025-05-07T20:32:54.8333226Z 
2025-05-07T20:32:54.8333313Z     @given(
2025-05-07T20:32:54.8333542Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:54.8333973Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:54.8334283Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:54.8334608Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:54.8334937Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:54.8335228Z     )
2025-05-07T20:32:54.8335573Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:54.8336018Z     def test_silu_mul_quant(
2025-05-07T20:32:54.8336264Z         self,
2025-05-07T20:32:54.8336462Z         T: int,
2025-05-07T20:32:54.8336657Z         D: int,
2025-05-07T20:32:54.8336977Z         scale_ub: Optional[float],
2025-05-07T20:32:54.8337257Z         contiguous: bool,
2025-05-07T20:32:54.8337496Z         compiled: bool,
2025-05-07T20:32:54.8337730Z     ) -> None:
2025-05-07T20:32:54.8337958Z         torch.manual_seed(2025)
2025-05-07T20:32:54.8338199Z     
2025-05-07T20:32:54.8338477Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:54.8338826Z     
2025-05-07T20:32:54.8339018Z         x_sign = torch.sign(x)
2025-05-07T20:32:54.8339316Z >       x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:54.8341368Z E       torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 20.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 4.44 MiB is free. Including non-PyTorch memory, this process has 22.06 GiB memory in use. Of the allocated memory 21.77 GiB is allocated by PyTorch, and 6.37 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
2025-05-07T20:32:54.8343204Z 
2025-05-07T20:32:54.8343326Z moe/activation_test.py:95: OutOfMemoryError
2025-05-07T20:32:54.8343541Z 
2025-05-07T20:32:54.8343653Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:54.8344062Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:54.8344469Z     T=128,
2025-05-07T20:32:54.8344663Z     D=5120,
2025-05-07T20:32:54.8344859Z     scale_ub=1200.0,
2025-05-07T20:32:54.8345087Z     contiguous=True,
2025-05-07T20:32:54.8345315Z     compiled=True,
2025-05-07T20:32:54.8345519Z )
2025-05-07T20:32:54.8345837Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:54.8346398Z T = 128, D = 5120, scale_ub = 1200.0, contiguous = True, compiled = True
2025-05-07T20:32:54.8346660Z 
2025-05-07T20:32:54.8346746Z     @given(
2025-05-07T20:32:54.8346978Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:54.8347296Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:54.8347604Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:54.8347931Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:54.8348263Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:54.8348552Z     )
2025-05-07T20:32:54.8348895Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:54.8349338Z     def test_silu_mul_quant(
2025-05-07T20:32:54.8349582Z         self,
2025-05-07T20:32:54.8349776Z         T: int,
2025-05-07T20:32:54.8349977Z         D: int,
2025-05-07T20:32:54.8350201Z         scale_ub: Optional[float],
2025-05-07T20:32:54.8350472Z         contiguous: bool,
2025-05-07T20:32:54.8350718Z         compiled: bool,
2025-05-07T20:32:54.8350945Z     ) -> None:
2025-05-07T20:32:54.8351165Z         torch.manual_seed(2025)
2025-05-07T20:32:54.8351407Z     
2025-05-07T20:32:54.8351683Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:54.8352029Z     
2025-05-07T20:32:54.8352223Z         x_sign = torch.sign(x)
2025-05-07T20:32:54.8352518Z >       x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:54.8354517Z E       torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 20.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 4.44 MiB is free. Including non-PyTorch memory, this process has 22.06 GiB memory in use. Of the allocated memory 21.77 GiB is allocated by PyTorch, and 3.87 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
2025-05-07T20:32:54.8356324Z 
2025-05-07T20:32:54.8356452Z moe/activation_test.py:95: OutOfMemoryError
2025-05-07T20:32:54.8356664Z 
2025-05-07T20:32:54.8356810Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:54.8357241Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:54.8357640Z     T=128,
2025-05-07T20:32:54.8357839Z     D=7168,
2025-05-07T20:32:54.8358038Z     scale_ub=None,
2025-05-07T20:32:54.8358252Z     contiguous=True,
2025-05-07T20:32:54.8358480Z     compiled=True,
2025-05-07T20:32:54.8358688Z )
2025-05-07T20:32:54.8359012Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:54.8359492Z T = 128, D = 7168, scale_ub = None, contiguous = True, compiled = True
2025-05-07T20:32:54.8359852Z 
2025-05-07T20:32:54.8359961Z     @given(
2025-05-07T20:32:54.8360253Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:54.8360598Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:54.8360943Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:54.8361328Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:54.8361711Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:54.8362101Z     )
2025-05-07T20:32:54.8362495Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:54.8363013Z     def test_silu_mul_quant(
2025-05-07T20:32:54.8363256Z         self,
2025-05-07T20:32:54.8363498Z         T: int,
2025-05-07T20:32:54.8363700Z         D: int,
2025-05-07T20:32:54.8363959Z         scale_ub: Optional[float],
2025-05-07T20:32:54.8364247Z         contiguous: bool,
2025-05-07T20:32:54.8364531Z         compiled: bool,
2025-05-07T20:32:54.8364791Z     ) -> None:
2025-05-07T20:32:54.8365016Z         torch.manual_seed(2025)
2025-05-07T20:32:54.8365274Z     
2025-05-07T20:32:54.8365569Z >       x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:54.8367892Z E       torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 20.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 4.44 MiB is free. Including non-PyTorch memory, this process has 22.06 GiB memory in use. Of the allocated memory 21.77 GiB is allocated by PyTorch, and 3.87 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
2025-05-07T20:32:54.8370098Z 
2025-05-07T20:32:54.8370225Z moe/activation_test.py:92: OutOfMemoryError
2025-05-07T20:32:54.8370489Z 
2025-05-07T20:32:54.8370910Z FAILED
2025-05-07T20:32:54.8371046Z 
2025-05-07T20:32:54.8371182Z =================================== FAILURES ===================================
2025-05-07T20:32:54.8371649Z _____________________ ActivationTests.test_silu_mul_quant ______________________
2025-05-07T20:32:54.8372143Z   + Exception Group Traceback (most recent call last):
2025-05-07T20:32:54.8372821Z   |   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/unittest/case.py", line 58, in testPartExecutor
2025-05-07T20:32:54.8373421Z   |     yield
2025-05-07T20:32:54.8374034Z   |   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/unittest/case.py", line 651, in run
2025-05-07T20:32:54.8374686Z   |     self._callTestMethod(testMethod)
2025-05-07T20:32:54.8375029Z   |     ~~~~~~~~~~~~~~~~~~~~^^^^^^^^^^^^
2025-05-07T20:32:54.8375607Z   |   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/unittest/case.py", line 606, in _callTestMethod
2025-05-07T20:32:54.8376205Z   |     if method() is not None:
2025-05-07T20:32:54.8376507Z   |        ~~~~~~^^
2025-05-07T20:32:54.8377178Z   |   File "/home/ec2-user/actions-runner/_work/FBGEMM/FBGEMM/fbgemm_gpu/experimental/gen_ai/test/moe/activation_test.py", line 75, in test_silu_mul_quant
2025-05-07T20:32:54.8378002Z   |     T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:54.8378351Z   |            ^^^^^^^
2025-05-07T20:32:54.8379018Z   |   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/hypothesis/core.py", line 1850, in wrapped_test
2025-05-07T20:32:54.8379710Z   |     raise the_error_hypothesis_found
2025-05-07T20:32:54.8380196Z   | ExceptionGroup: Hypothesis found 4 distinct failures. (4 sub-exceptions)
2025-05-07T20:32:54.8380713Z   +-+---------------- 1 ----------------
2025-05-07T20:32:54.8381010Z     | Traceback (most recent call last):
2025-05-07T20:32:54.8381872Z     |   File "/home/ec2-user/actions-runner/_work/FBGEMM/FBGEMM/fbgemm_gpu/experimental/gen_ai/test/moe/activation_test.py", line 92, in test_silu_mul_quant
2025-05-07T20:32:54.8383105Z     |     x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:54.8386813Z     | torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 40.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 4.44 MiB is free. Including non-PyTorch memory, this process has 22.06 GiB memory in use. Of the allocated memory 21.77 GiB is allocated by PyTorch, and 3.87 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
2025-05-07T20:32:54.8390190Z     | Falsifying example: test_silu_mul_quant(
2025-05-07T20:32:54.8390869Z     |     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:54.8391525Z     |     T=2048,
2025-05-07T20:32:54.8391919Z     |     D=5120,  # or any other generated value
2025-05-07T20:32:54.8392425Z     |     scale_ub=None,  # or any other generated value
2025-05-07T20:32:54.8392973Z     |     contiguous=True,  # or any other generated value
2025-05-07T20:32:54.8393525Z     |     compiled=False,  # or any other generated value
2025-05-07T20:32:54.8394070Z     | )
2025-05-07T20:32:54.8394319Z     | 
2025-05-07T20:32:54.8395114Z     | You can reproduce this example by temporarily adding @reproduce_failure('6.131.14', b'AEECQQBBAEEAQQE=') as a decorator on your test case
2025-05-07T20:32:54.8395931Z     +---------------- 2 ----------------
2025-05-07T20:32:54.8396217Z     | Traceback (most recent call last):
2025-05-07T20:32:54.8396923Z     |   File "/home/ec2-user/actions-runner/_work/FBGEMM/FBGEMM/fbgemm_gpu/experimental/gen_ai/test/moe/activation_test.py", line 92, in test_silu_mul_quant
2025-05-07T20:32:54.8397696Z     |     x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:54.8400003Z     | torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 20.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 4.44 MiB is free. Including non-PyTorch memory, this process has 22.06 GiB memory in use. Of the allocated memory 21.77 GiB is allocated by PyTorch, and 3.87 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
2025-05-07T20:32:54.8401959Z     | Falsifying example: test_silu_mul_quant(
2025-05-07T20:32:54.8402390Z     |     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:54.8402925Z     |     T=128,
2025-05-07T20:32:54.8403133Z     |     D=7168,
2025-05-07T20:32:54.8403350Z     |     scale_ub=None,
2025-05-07T20:32:54.8403579Z     |     contiguous=True,
2025-05-07T20:32:54.8403820Z     |     compiled=True,
2025-05-07T20:32:54.8404041Z     | )
2025-05-07T20:32:54.8404212Z     | 
2025-05-07T20:32:54.8404732Z     | You can reproduce this example by temporarily adding @reproduce_failure('6.131.14', b'AEEBQQFBAEEAQQA=') as a decorator on your test case
2025-05-07T20:32:54.8405335Z     +---------------- 3 ----------------
2025-05-07T20:32:54.8405619Z     | Traceback (most recent call last):
2025-05-07T20:32:54.8406530Z     |   File "/home/ec2-user/actions-runner/_work/FBGEMM/FBGEMM/fbgemm_gpu/experimental/gen_ai/test/moe/activation_test.py", line 92, in test_silu_mul_quant
2025-05-07T20:32:54.8407319Z     |     x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:54.8409320Z     | torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 20.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 4.44 MiB is free. Including non-PyTorch memory, this process has 22.06 GiB memory in use. Of the allocated memory 21.77 GiB is allocated by PyTorch, and 3.87 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
2025-05-07T20:32:54.8411300Z     | Falsifying example: test_silu_mul_quant(
2025-05-07T20:32:54.8411732Z     |     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:54.8412145Z     |     T=128,
2025-05-07T20:32:54.8412349Z     |     D=5120,
2025-05-07T20:32:54.8412552Z     |     scale_ub=1200.0,
2025-05-07T20:32:54.8412873Z     |     contiguous=True,
2025-05-07T20:32:54.8413118Z     |     compiled=True,
2025-05-07T20:32:54.8413336Z     | )
2025-05-07T20:32:54.8413521Z     | 
2025-05-07T20:32:54.8414168Z     | You can reproduce this example by temporarily adding @reproduce_failure('6.131.14', b'AEEBQQBBAUEAQQA=') as a decorator on your test case
2025-05-07T20:32:54.8414764Z     +---------------- 4 ----------------
2025-05-07T20:32:54.8415058Z     | Traceback (most recent call last):
2025-05-07T20:32:54.8415761Z     |   File "/home/ec2-user/actions-runner/_work/FBGEMM/FBGEMM/fbgemm_gpu/experimental/gen_ai/test/moe/activation_test.py", line 126, in test_silu_mul_quant
2025-05-07T20:32:54.8416467Z     |     y_fp8_ref, y_scale_ref = ref_fn()
2025-05-07T20:32:54.8416825Z     |                              ~~~~~~^^
2025-05-07T20:32:54.8417466Z     |   File "/home/ec2-user/actions-runner/_work/FBGEMM/FBGEMM/fbgemm_gpu/experimental/gen_ai/test/moe/activation_test.py", line 124, in ref_fn
2025-05-07T20:32:54.8418167Z     |     return triton_quantize_fp8_row(y, scale_ub_tensor)
2025-05-07T20:32:54.8418992Z     |   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gemm/triton_gemm/fp8_gemm.py", line 2370, in triton_quantize_fp8_row
2025-05-07T20:32:54.8419774Z     |     _kernel_quantize_fp8_row[grid](
2025-05-07T20:32:54.8420060Z     |     ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~^
2025-05-07T20:32:54.8420324Z     |         a,
2025-05-07T20:32:54.8420521Z     |         ^^
2025-05-07T20:32:54.8420723Z     |     ...<23 lines>...
2025-05-07T20:32:54.8420963Z     |         USE_INT64=use_int64,
2025-05-07T20:32:54.8421215Z     |         ^^^^^^^^^^^^^^^^^^^^
2025-05-07T20:32:54.8421459Z     |     )
2025-05-07T20:32:54.8421644Z     |     ^
2025-05-07T20:32:54.8422161Z     |   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py", line 330, in <lambda>
2025-05-07T20:32:54.8422883Z     |     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:54.8423326Z     |                                    ~~~~~~~~^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
2025-05-07T20:32:54.8424024Z     |   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/autotuner.py", line 186, in run
2025-05-07T20:32:54.8424785Z     |     timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs}
2025-05-07T20:32:54.8425250Z     |                        ~~~~~~~~~~~^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
2025-05-07T20:32:54.8425883Z     |   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/autotuner.py", line 166, in _bench
2025-05-07T20:32:54.8426570Z     |     return self.do_bench(kernel_call, quantiles=(0.5, 0.2, 0.8))
2025-05-07T20:32:54.8426972Z     |            ~~~~~~~~~~~~~^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
2025-05-07T20:32:54.8427887Z     |   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/testing.py", line 117, in do_bench
2025-05-07T20:32:54.8428687Z     |     fn()
2025-05-07T20:32:54.8428963Z     |     ~~^^
2025-05-07T20:32:54.8429746Z     |   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/autotuner.py", line 152, in kernel_call
2025-05-07T20:32:54.8430663Z     |     self.fn.run(
2025-05-07T20:32:54.8430970Z     |     ~~~~~~~~~~~^
2025-05-07T20:32:54.8431265Z     |         *args,
2025-05-07T20:32:54.8431563Z     |         ^^^^^^
2025-05-07T20:32:54.8431860Z     |         **current,
2025-05-07T20:32:54.8432170Z     |         ^^^^^^^^^^
2025-05-07T20:32:54.8432480Z     |     )
2025-05-07T20:32:54.8432741Z     |     ^
2025-05-07T20:32:54.8433417Z     |   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py", line 623, in run
2025-05-07T20:32:54.8434224Z     |     kernel = self.compile(
2025-05-07T20:32:54.8434591Z     |         src,
2025-05-07T20:32:54.8434905Z     |         target=target,
2025-05-07T20:32:54.8435350Z     |         options=options.__dict__,
2025-05-07T20:32:54.8435736Z     |     )
2025-05-07T20:32:54.8436486Z     |   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py", line 273, in compile
2025-05-07T20:32:54.8437469Z     |     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:54.8438479Z     |   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py", line 100, in make_ir
2025-05-07T20:32:54.8439566Z     |     return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:54.8440249Z     |                        module_map=module_map)
2025-05-07T20:32:54.8440856Z     | triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:54.8441343Z     | def _kernel_quantize_fp8_row(
2025-05-07T20:32:54.8441714Z     | ^
2025-05-07T20:32:54.8442355Z     | ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:54.8443150Z     | Falsifying example: test_silu_mul_quant(
2025-05-07T20:32:54.8443735Z     |     # The test always failed when commented parts were varied together.
2025-05-07T20:32:54.8444435Z     |     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:54.8445041Z     |     T=1,  # or any other generated value
2025-05-07T20:32:54.8467552Z     |     D=5120,  # or any other generated value
2025-05-07T20:32:54.8468104Z     |     scale_ub=None,  # or any other generated value
2025-05-07T20:32:54.8468595Z     |     contiguous=True,  # or any other generated value
2025-05-07T20:32:54.8469095Z     |     compiled=True,  # or any other generated value
2025-05-07T20:32:54.8469516Z     | )
2025-05-07T20:32:54.8469761Z     | 
2025-05-07T20:32:54.8470566Z     | You can reproduce this example by temporarily adding @reproduce_failure('6.131.14', b'AEEAQQBBAEEAQQA=') as a decorator on your test case
2025-05-07T20:32:54.8471431Z     +------------------------------------
2025-05-07T20:32:54.8471917Z ---------------------------------- Hypothesis ----------------------------------
2025-05-07T20:32:54.8472596Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:54.8473147Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:54.8473682Z     T=1,
2025-05-07T20:32:54.8473921Z     D=5120,
2025-05-07T20:32:54.8474177Z     scale_ub=None,
2025-05-07T20:32:54.8474464Z     contiguous=True,
2025-05-07T20:32:54.8474754Z     compiled=True,
2025-05-07T20:32:54.8475036Z )
2025-05-07T20:32:54.8475475Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:54.8476091Z T = 1, D = 5120, scale_ub = None, contiguous = True, compiled = True
2025-05-07T20:32:54.8476449Z 
2025-05-07T20:32:54.8476557Z     @given(
2025-05-07T20:32:54.8476951Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:54.8477377Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:54.8477805Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:54.8478265Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:54.8478723Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:54.8479114Z     )
2025-05-07T20:32:54.8479595Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:54.8480201Z     def test_silu_mul_quant(
2025-05-07T20:32:54.8480529Z         self,
2025-05-07T20:32:54.8480811Z         T: int,
2025-05-07T20:32:54.8481091Z         D: int,
2025-05-07T20:32:54.8481389Z         scale_ub: Optional[float],
2025-05-07T20:32:54.8481773Z         contiguous: bool,
2025-05-07T20:32:54.8482101Z         compiled: bool,
2025-05-07T20:32:54.8482391Z     ) -> None:
2025-05-07T20:32:54.8482680Z         torch.manual_seed(2025)
2025-05-07T20:32:54.8483010Z     
2025-05-07T20:32:54.8483372Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:54.8483919Z     
2025-05-07T20:32:54.8484175Z         x_sign = torch.sign(x)
2025-05-07T20:32:54.8484567Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:54.8484999Z         x = x_sign * x_clamp
2025-05-07T20:32:54.8485326Z         x0 = x[:, :D]
2025-05-07T20:32:54.8485630Z         x1 = x[:, D:]
2025-05-07T20:32:54.8485913Z     
2025-05-07T20:32:54.8486170Z         if contiguous:
2025-05-07T20:32:54.8486494Z             x0 = x0.contiguous()
2025-05-07T20:32:54.8486847Z             x1 = x1.contiguous()
2025-05-07T20:32:54.8487172Z     
2025-05-07T20:32:54.8487441Z         if scale_ub is not None:
2025-05-07T20:32:54.8487818Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:54.8488267Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:54.8488751Z             )
2025-05-07T20:32:54.8489018Z         else:
2025-05-07T20:32:54.8489301Z             scale_ub_tensor = None
2025-05-07T20:32:54.8489646Z     
2025-05-07T20:32:54.8489964Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:54.8490387Z             op = silu_mul_quant
2025-05-07T20:32:54.8490730Z             if compiled:
2025-05-07T20:32:54.8491070Z                 op = torch.compile(op)
2025-05-07T20:32:54.8491467Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:54.8491844Z     
2025-05-07T20:32:54.8492109Z         y_fp8, y_scale = fn()
2025-05-07T20:32:54.8492490Z         y = y_fp8.to(torch.float32) * y_scale[:, None]
2025-05-07T20:32:54.8492884Z     
2025-05-07T20:32:54.8493206Z         def ref_fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:54.8493787Z             x0_fp32 = x0.to(torch.float32)
2025-05-07T20:32:54.8494183Z             x1_fp32 = x1.to(torch.float32)
2025-05-07T20:32:54.8494612Z             y = x0_fp32 * torch.sigmoid(x0_fp32) * x1_fp32
2025-05-07T20:32:54.8495106Z             return triton_quantize_fp8_row(y, scale_ub_tensor)
2025-05-07T20:32:54.8495522Z     
2025-05-07T20:32:54.8495804Z >       y_fp8_ref, y_scale_ref = ref_fn()
2025-05-07T20:32:54.8496078Z 
2025-05-07T20:32:54.8496223Z moe/activation_test.py:126: 
2025-05-07T20:32:54.8496629Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:54.8497141Z moe/activation_test.py:124: in ref_fn
2025-05-07T20:32:54.8497598Z     return triton_quantize_fp8_row(y, scale_ub_tensor)
2025-05-07T20:32:54.8499216Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gemm/triton_gemm/fp8_gemm.py:2370: in triton_quantize_fp8_row
2025-05-07T20:32:54.8500255Z     _kernel_quantize_fp8_row[grid](
2025-05-07T20:32:54.8501059Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:54.8502012Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:54.8503207Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/autotuner.py:186: in run
2025-05-07T20:32:54.8504192Z     timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs}
2025-05-07T20:32:54.8505173Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/autotuner.py:166: in _bench
2025-05-07T20:32:54.8506044Z     return self.do_bench(kernel_call, quantiles=(0.5, 0.2, 0.8))
2025-05-07T20:32:54.8506854Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/testing.py:117: in do_bench
2025-05-07T20:32:54.8507558Z     fn()
2025-05-07T20:32:54.8508246Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/autotuner.py:152: in kernel_call
2025-05-07T20:32:54.8509039Z     self.fn.run(
2025-05-07T20:32:54.8509674Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:54.8510398Z     kernel = self.compile(
2025-05-07T20:32:54.8511227Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:54.8512112Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:54.8512651Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:54.8512980Z 
2025-05-07T20:32:54.8513268Z self = <triton.compiler.compiler.ASTSource object at 0x7fe9c8a26270>
2025-05-07T20:32:54.8514801Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=2, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:54.8516693Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7fea06577ce0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7fe9c614a700>}
2025-05-07T20:32:54.8518613Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:54.8519993Z context = <triton._C.libtriton.ir.context object at 0x7fe9c657a3f0>
2025-05-07T20:32:54.8520384Z 
2025-05-07T20:32:54.8520619Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:54.8521349Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:54.8521983Z                            module_map=module_map)
2025-05-07T20:32:54.8522475Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:54.8522960Z E       def _kernel_quantize_fp8_row(
2025-05-07T20:32:54.8523327Z E       ^
2025-05-07T20:32:54.8523956Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:54.8524573Z 
2025-05-07T20:32:54.8525159Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:54.8525883Z 
2025-05-07T20:32:54.8526034Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:54.8526609Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:54.8527223Z     T=2048,
2025-05-07T20:32:54.8527468Z     D=5120,
2025-05-07T20:32:54.8527713Z     scale_ub=1200.0,
2025-05-07T20:32:54.8528005Z     contiguous=True,
2025-05-07T20:32:54.8528304Z     compiled=False,
2025-05-07T20:32:54.8528576Z )
2025-05-07T20:32:54.8528981Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:54.8529634Z T = 2048, D = 5120, scale_ub = 1200.0, contiguous = True, compiled = False
2025-05-07T20:32:54.8529986Z 
2025-05-07T20:32:54.8530097Z     @given(
2025-05-07T20:32:54.8530386Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:54.8530793Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:54.8531240Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:54.8531656Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:54.8531810Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:54.8531904Z     )
2025-05-07T20:32:54.8532218Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:54.8532345Z     def test_silu_mul_quant(
2025-05-07T20:32:54.8532446Z         self,
2025-05-07T20:32:54.8532550Z         T: int,
2025-05-07T20:32:54.8532656Z         D: int,
2025-05-07T20:32:54.8532780Z         scale_ub: Optional[float],
2025-05-07T20:32:54.8532905Z         contiguous: bool,
2025-05-07T20:32:54.8533017Z         compiled: bool,
2025-05-07T20:32:54.8533124Z     ) -> None:
2025-05-07T20:32:54.8533256Z         torch.manual_seed(2025)
2025-05-07T20:32:54.8533350Z     
2025-05-07T20:32:54.8533570Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:54.8533797Z     
2025-05-07T20:32:54.8533926Z         x_sign = torch.sign(x)
2025-05-07T20:32:54.8534144Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:54.8534270Z         x = x_sign * x_clamp
2025-05-07T20:32:54.8534384Z         x0 = x[:, :D]
2025-05-07T20:32:54.8534489Z         x1 = x[:, D:]
2025-05-07T20:32:54.8534597Z     
2025-05-07T20:32:54.8534705Z         if contiguous:
2025-05-07T20:32:54.8534839Z             x0 = x0.contiguous()
2025-05-07T20:32:54.8534955Z             x1 = x1.contiguous()
2025-05-07T20:32:54.8535048Z     
2025-05-07T20:32:54.8535173Z         if scale_ub is not None:
2025-05-07T20:32:54.8535316Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:54.8535499Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:54.8535612Z             )
2025-05-07T20:32:54.8535723Z         else:
2025-05-07T20:32:54.8535853Z             scale_ub_tensor = None
2025-05-07T20:32:54.8536012Z     
2025-05-07T20:32:54.8536188Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:54.8536315Z             op = silu_mul_quant
2025-05-07T20:32:54.8536440Z             if compiled:
2025-05-07T20:32:54.8536581Z                 op = torch.compile(op)
2025-05-07T20:32:54.8536734Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:54.8536840Z     
2025-05-07T20:32:54.8536967Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:54.8536973Z 
2025-05-07T20:32:54.8537115Z moe/activation_test.py:117: 
2025-05-07T20:32:54.8537296Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:54.8537436Z moe/activation_test.py:115: in fn
2025-05-07T20:32:54.8537583Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:54.8538267Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:54.8538411Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:54.8538922Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:54.8539234Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:54.8539707Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:54.8539898Z     kernel = self.compile(
2025-05-07T20:32:54.8540478Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:54.8540726Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:54.8540905Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:54.8540912Z 
2025-05-07T20:32:54.8541207Z self = <triton.compiler.compiler.ASTSource object at 0x7fe9c60d9090>
2025-05-07T20:32:54.8542327Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:54.8543029Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7fea06577ce0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7fe9c5ffa020>}
2025-05-07T20:32:54.8544049Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:54.8544309Z context = <triton._C.libtriton.ir.context object at 0x7fe9c6419770>
2025-05-07T20:32:54.8544316Z 
2025-05-07T20:32:54.8544548Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:54.8544904Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:54.8545056Z                            module_map=module_map)
2025-05-07T20:32:54.8545280Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:54.8545420Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:54.8545534Z E       ^
2025-05-07T20:32:54.8546071Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:54.8546081Z 
2025-05-07T20:32:54.8546638Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:54.8546644Z 
2025-05-07T20:32:54.8546793Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:54.8547070Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:54.8547179Z     T=2048,
2025-05-07T20:32:54.8547283Z     D=5120,
2025-05-07T20:32:54.8547395Z     scale_ub=1200.0,
2025-05-07T20:32:54.8547512Z     contiguous=True,
2025-05-07T20:32:54.8547619Z     compiled=True,
2025-05-07T20:32:54.8547764Z )
2025-05-07T20:32:54.8548062Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:54.8548298Z T = 2048, D = 5120, scale_ub = 1200.0, contiguous = True, compiled = True
2025-05-07T20:32:54.8548308Z 
2025-05-07T20:32:54.8548414Z     @given(
2025-05-07T20:32:54.8548584Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:54.8548722Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:54.8548877Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:54.8549044Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:54.8549199Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:54.8549311Z     )
2025-05-07T20:32:54.8549645Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:54.8549778Z     def test_silu_mul_quant(
2025-05-07T20:32:54.8549892Z         self,
2025-05-07T20:32:54.8550001Z         T: int,
2025-05-07T20:32:54.8550106Z         D: int,
2025-05-07T20:32:54.8550256Z         scale_ub: Optional[float],
2025-05-07T20:32:54.8550385Z         contiguous: bool,
2025-05-07T20:32:54.8550507Z         compiled: bool,
2025-05-07T20:32:54.8550628Z     ) -> None:
2025-05-07T20:32:54.8550769Z         torch.manual_seed(2025)
2025-05-07T20:32:54.8550870Z     
2025-05-07T20:32:54.8551108Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:54.8551268Z     
2025-05-07T20:32:54.8551401Z         x_sign = torch.sign(x)
2025-05-07T20:32:54.8551572Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:54.8551696Z         x = x_sign * x_clamp
2025-05-07T20:32:54.8551819Z         x0 = x[:, :D]
2025-05-07T20:32:54.8551932Z         x1 = x[:, D:]
2025-05-07T20:32:54.8552033Z     
2025-05-07T20:32:54.8552161Z         if contiguous:
2025-05-07T20:32:54.8552285Z             x0 = x0.contiguous()
2025-05-07T20:32:54.8552408Z             x1 = x1.contiguous()
2025-05-07T20:32:54.8552524Z     
2025-05-07T20:32:54.8552658Z         if scale_ub is not None:
2025-05-07T20:32:54.8552803Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:54.8553045Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:54.8553156Z             )
2025-05-07T20:32:54.8553270Z         else:
2025-05-07T20:32:54.8553403Z             scale_ub_tensor = None
2025-05-07T20:32:54.8553502Z     
2025-05-07T20:32:54.8553691Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:54.8553819Z             op = silu_mul_quant
2025-05-07T20:32:54.8553936Z             if compiled:
2025-05-07T20:32:54.8554082Z                 op = torch.compile(op)
2025-05-07T20:32:54.8554229Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:54.8554329Z     
2025-05-07T20:32:54.8554470Z         y_fp8, y_scale = fn()
2025-05-07T20:32:54.8554642Z         y = y_fp8.to(torch.float32) * y_scale[:, None]
2025-05-07T20:32:54.8554745Z     
2025-05-07T20:32:54.8554944Z         def ref_fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:54.8555091Z             x0_fp32 = x0.to(torch.float32)
2025-05-07T20:32:54.8555237Z             x1_fp32 = x1.to(torch.float32)
2025-05-07T20:32:54.8555414Z             y = x0_fp32 * torch.sigmoid(x0_fp32) * x1_fp32
2025-05-07T20:32:54.8555664Z             return triton_quantize_fp8_row(y, scale_ub_tensor)
2025-05-07T20:32:54.8555789Z     
2025-05-07T20:32:54.8555929Z >       y_fp8_ref, y_scale_ref = ref_fn()
2025-05-07T20:32:54.8555939Z 
2025-05-07T20:32:54.8556077Z moe/activation_test.py:126: 
2025-05-07T20:32:54.8556263Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:54.8556408Z moe/activation_test.py:124: in ref_fn
2025-05-07T20:32:54.8556590Z     return triton_quantize_fp8_row(y, scale_ub_tensor)
2025-05-07T20:32:54.8557358Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gemm/triton_gemm/fp8_gemm.py:2370: in triton_quantize_fp8_row
2025-05-07T20:32:54.8557498Z     _kernel_quantize_fp8_row[grid](
2025-05-07T20:32:54.8558072Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:54.8558400Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:54.8558904Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/autotuner.py:186: in run
2025-05-07T20:32:54.8559257Z     timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs}
2025-05-07T20:32:54.8559779Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/autotuner.py:166: in _bench
2025-05-07T20:32:54.8560005Z     return self.do_bench(kernel_call, quantiles=(0.5, 0.2, 0.8))
2025-05-07T20:32:54.8560475Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/testing.py:117: in do_bench
2025-05-07T20:32:54.8560583Z     fn()
2025-05-07T20:32:54.8561112Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/autotuner.py:152: in kernel_call
2025-05-07T20:32:54.8561236Z     self.fn.run(
2025-05-07T20:32:54.8561703Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:54.8561832Z     kernel = self.compile(
2025-05-07T20:32:54.8562367Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:54.8562686Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:54.8562869Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:54.8562876Z 
2025-05-07T20:32:54.8563151Z self = <triton.compiler.compiler.ASTSource object at 0x7fe9c60da0d0>
2025-05-07T20:32:54.8564189Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=2, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:54.8564939Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7fea06577ce0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7fe9c4eeaac0>}
2025-05-07T20:32:54.8565913Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:54.8566179Z context = <triton._C.libtriton.ir.context object at 0x7fe9c50bcd70>
2025-05-07T20:32:54.8566185Z 
2025-05-07T20:32:54.8566400Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:54.8566743Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:54.8566903Z                            module_map=module_map)
2025-05-07T20:32:54.8567115Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:54.8567251Z E       def _kernel_quantize_fp8_row(
2025-05-07T20:32:54.8567343Z E       ^
2025-05-07T20:32:54.8567823Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:54.8567950Z 
2025-05-07T20:32:54.8568539Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:54.8568549Z 
2025-05-07T20:32:54.8568690Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:54.8569000Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:54.8569106Z     T=16384,
2025-05-07T20:32:54.8569210Z     D=7168,
2025-05-07T20:32:54.8569334Z     scale_ub=1200.0,
2025-05-07T20:32:54.8569454Z     contiguous=False,
2025-05-07T20:32:54.8569570Z     compiled=False,
2025-05-07T20:32:54.8569680Z )
2025-05-07T20:32:54.8569974Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:54.8570296Z T = 16384, D = 7168, scale_ub = 1200.0, contiguous = False, compiled = False
2025-05-07T20:32:54.8570305Z 
2025-05-07T20:32:54.8570426Z     @given(
2025-05-07T20:32:54.8570572Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:54.8570682Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:54.8570795Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:54.8570912Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:54.8571028Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:54.8571100Z     )
2025-05-07T20:32:54.8571343Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:54.8571441Z     def test_silu_mul_quant(
2025-05-07T20:32:54.8571516Z         self,
2025-05-07T20:32:54.8571589Z         T: int,
2025-05-07T20:32:54.8571668Z         D: int,
2025-05-07T20:32:54.8571764Z         scale_ub: Optional[float],
2025-05-07T20:32:54.8571850Z         contiguous: bool,
2025-05-07T20:32:54.8571943Z         compiled: bool,
2025-05-07T20:32:54.8572021Z     ) -> None:
2025-05-07T20:32:54.8572121Z         torch.manual_seed(2025)
2025-05-07T20:32:54.8572195Z     
2025-05-07T20:32:54.8572361Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:54.8572442Z     
2025-05-07T20:32:54.8572531Z         x_sign = torch.sign(x)
2025-05-07T20:32:54.8572653Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:54.8572808Z         x = x_sign * x_clamp
2025-05-07T20:32:54.8572886Z         x0 = x[:, :D]
2025-05-07T20:32:54.8572962Z         x1 = x[:, D:]
2025-05-07T20:32:54.8573040Z     
2025-05-07T20:32:54.8573121Z         if contiguous:
2025-05-07T20:32:54.8573210Z             x0 = x0.contiguous()
2025-05-07T20:32:54.8573305Z             x1 = x1.contiguous()
2025-05-07T20:32:54.8573375Z     
2025-05-07T20:32:54.8573463Z         if scale_ub is not None:
2025-05-07T20:32:54.8573578Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:54.8573910Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:54.8573992Z             )
2025-05-07T20:32:54.8574068Z         else:
2025-05-07T20:32:54.8574207Z             scale_ub_tensor = None
2025-05-07T20:32:54.8574290Z     
2025-05-07T20:32:54.8574422Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:54.8574508Z             op = silu_mul_quant
2025-05-07T20:32:54.8574598Z             if compiled:
2025-05-07T20:32:54.8574698Z                 op = torch.compile(op)
2025-05-07T20:32:54.8574801Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:54.8574880Z     
2025-05-07T20:32:54.8574967Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:54.8574972Z 
2025-05-07T20:32:54.8575075Z moe/activation_test.py:117: 
2025-05-07T20:32:54.8575201Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:54.8575302Z moe/activation_test.py:115: in fn
2025-05-07T20:32:54.8575408Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:54.8575905Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:54.8576009Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:54.8576418Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:54.8576639Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:54.8576988Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:54.8577080Z     kernel = self.compile(
2025-05-07T20:32:54.8577465Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:54.8577649Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:54.8577772Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:54.8577822Z 
2025-05-07T20:32:54.8578027Z self = <triton.compiler.compiler.ASTSource object at 0x7fe9c4e65220>
2025-05-07T20:32:54.8578816Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:54.8579320Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7fea06577ce0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7fe9c5ffa980>}
2025-05-07T20:32:54.8580064Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:54.8580253Z context = <triton._C.libtriton.ir.context object at 0x7fe9c4ceabf0>
2025-05-07T20:32:54.8580257Z 
2025-05-07T20:32:54.8580427Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:54.8580691Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:54.8580799Z                            module_map=module_map)
2025-05-07T20:32:54.8580969Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:54.8581066Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:54.8581181Z E       ^
2025-05-07T20:32:54.8581540Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:54.8581545Z 
2025-05-07T20:32:54.8581951Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:54.8581955Z 
2025-05-07T20:32:54.8582065Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:54.8582283Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:54.8582360Z     T=1,
2025-05-07T20:32:54.8582448Z     D=7168,
2025-05-07T20:32:54.8582528Z     scale_ub=None,
2025-05-07T20:32:54.8582611Z     contiguous=True,
2025-05-07T20:32:54.8582736Z     compiled=True,
2025-05-07T20:32:54.8582809Z )
2025-05-07T20:32:54.8583033Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:54.8583191Z T = 1, D = 7168, scale_ub = None, contiguous = True, compiled = True
2025-05-07T20:32:54.8583198Z 
2025-05-07T20:32:54.8583272Z     @given(
2025-05-07T20:32:54.8583402Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:54.8583500Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:54.8583611Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:54.8583732Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:54.8583842Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:54.8583912Z     )
2025-05-07T20:32:54.8584161Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:54.8584256Z     def test_silu_mul_quant(
2025-05-07T20:32:54.8584333Z         self,
2025-05-07T20:32:54.8584409Z         T: int,
2025-05-07T20:32:54.8584483Z         D: int,
2025-05-07T20:32:54.8584629Z         scale_ub: Optional[float],
2025-05-07T20:32:54.8584719Z         contiguous: bool,
2025-05-07T20:32:54.8584801Z         compiled: bool,
2025-05-07T20:32:54.8584882Z     ) -> None:
2025-05-07T20:32:54.8584978Z         torch.manual_seed(2025)
2025-05-07T20:32:54.8585046Z     
2025-05-07T20:32:54.8585218Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:54.8585288Z     
2025-05-07T20:32:54.8585376Z         x_sign = torch.sign(x)
2025-05-07T20:32:54.8585504Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:54.8585593Z         x = x_sign * x_clamp
2025-05-07T20:32:54.8585674Z         x0 = x[:, :D]
2025-05-07T20:32:54.8585751Z         x1 = x[:, D:]
2025-05-07T20:32:54.8585821Z     
2025-05-07T20:32:54.8585907Z         if contiguous:
2025-05-07T20:32:54.8586038Z             x0 = x0.contiguous()
2025-05-07T20:32:54.8586124Z             x1 = x1.contiguous()
2025-05-07T20:32:54.8586201Z     
2025-05-07T20:32:54.8586292Z         if scale_ub is not None:
2025-05-07T20:32:54.8586397Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:54.8586536Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:54.8586612Z             )
2025-05-07T20:32:54.8586684Z         else:
2025-05-07T20:32:54.8586780Z             scale_ub_tensor = None
2025-05-07T20:32:54.8586853Z     
2025-05-07T20:32:54.8586990Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:54.8587077Z             op = silu_mul_quant
2025-05-07T20:32:54.8587159Z             if compiled:
2025-05-07T20:32:54.8587260Z                 op = torch.compile(op)
2025-05-07T20:32:54.8587363Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:54.8587435Z     
2025-05-07T20:32:54.8587531Z         y_fp8, y_scale = fn()
2025-05-07T20:32:54.8587652Z         y = y_fp8.to(torch.float32) * y_scale[:, None]
2025-05-07T20:32:54.8587720Z     
2025-05-07T20:32:54.8587863Z         def ref_fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:54.8587965Z             x0_fp32 = x0.to(torch.float32)
2025-05-07T20:32:54.8588065Z             x1_fp32 = x1.to(torch.float32)
2025-05-07T20:32:54.8588189Z             y = x0_fp32 * torch.sigmoid(x0_fp32) * x1_fp32
2025-05-07T20:32:54.8588378Z             return triton_quantize_fp8_row(y, scale_ub_tensor)
2025-05-07T20:32:54.8588452Z     
2025-05-07T20:32:54.8588549Z >       y_fp8_ref, y_scale_ref = ref_fn()
2025-05-07T20:32:54.8588553Z 
2025-05-07T20:32:54.8588649Z moe/activation_test.py:126: 
2025-05-07T20:32:54.8588782Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:54.8588884Z moe/activation_test.py:124: in ref_fn
2025-05-07T20:32:54.8589014Z     return triton_quantize_fp8_row(y, scale_ub_tensor)
2025-05-07T20:32:54.8589571Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gemm/triton_gemm/fp8_gemm.py:2370: in triton_quantize_fp8_row
2025-05-07T20:32:54.8589715Z     _kernel_quantize_fp8_row[grid](
2025-05-07T20:32:54.8590080Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:54.8590301Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:54.8590667Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/autotuner.py:186: in run
2025-05-07T20:32:54.8590924Z     timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs}
2025-05-07T20:32:54.8591292Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/autotuner.py:166: in _bench
2025-05-07T20:32:54.8591468Z     return self.do_bench(kernel_call, quantiles=(0.5, 0.2, 0.8))
2025-05-07T20:32:54.8591805Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/testing.py:117: in do_bench
2025-05-07T20:32:54.8591885Z     fn()
2025-05-07T20:32:54.8592326Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/autotuner.py:152: in kernel_call
2025-05-07T20:32:54.8592413Z     self.fn.run(
2025-05-07T20:32:54.8592748Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:54.8592855Z     kernel = self.compile(
2025-05-07T20:32:54.8593233Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:54.8593411Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:54.8593542Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:54.8593547Z 
2025-05-07T20:32:54.8593749Z self = <triton.compiler.compiler.ASTSource object at 0x7fe9c4e67950>
2025-05-07T20:32:54.8594524Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=2, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:54.8595116Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7fea06577ce0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7fe9c519e5c0>}
2025-05-07T20:32:54.8595875Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:54.8596065Z context = <triton._C.libtriton.ir.context object at 0x7fe9c497d9b0>
2025-05-07T20:32:54.8596069Z 
2025-05-07T20:32:54.8596233Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:54.8596504Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:54.8596614Z                            module_map=module_map)
2025-05-07T20:32:54.8596788Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:54.8596894Z E       def _kernel_quantize_fp8_row(
2025-05-07T20:32:54.8596975Z E       ^
2025-05-07T20:32:54.8597331Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:54.8597378Z 
2025-05-07T20:32:54.8597782Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:54.8597787Z 
2025-05-07T20:32:54.8597896Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:54.8598116Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:54.8598501Z     T=4096,
2025-05-07T20:32:54.8598629Z     D=5120,
2025-05-07T20:32:54.8598715Z     scale_ub=None,
2025-05-07T20:32:54.8598802Z     contiguous=False,
2025-05-07T20:32:54.8598899Z     compiled=False,
2025-05-07T20:32:54.8598972Z )
2025-05-07T20:32:54.8599191Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:54.8599523Z T = 4096, D = 5120, scale_ub = None, contiguous = False, compiled = False
2025-05-07T20:32:54.8599530Z 
2025-05-07T20:32:54.8599606Z     @given(
2025-05-07T20:32:54.8599730Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:54.8599831Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:54.8599944Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:54.8600063Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:54.8600174Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:54.8600248Z     )
2025-05-07T20:32:54.8600502Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:54.8600618Z     def test_silu_mul_quant(
2025-05-07T20:32:54.8600696Z         self,
2025-05-07T20:32:54.8600797Z         T: int,
2025-05-07T20:32:54.8600877Z         D: int,
2025-05-07T20:32:54.8600976Z         scale_ub: Optional[float],
2025-05-07T20:32:54.8601068Z         contiguous: bool,
2025-05-07T20:32:54.8601156Z         compiled: bool,
2025-05-07T20:32:54.8601311Z     ) -> None:
2025-05-07T20:32:54.8601405Z         torch.manual_seed(2025)
2025-05-07T20:32:54.8601473Z     
2025-05-07T20:32:54.8601644Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:54.8601719Z     
2025-05-07T20:32:54.8601809Z         x_sign = torch.sign(x)
2025-05-07T20:32:54.8601937Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:54.8602023Z         x = x_sign * x_clamp
2025-05-07T20:32:54.8602102Z         x0 = x[:, :D]
2025-05-07T20:32:54.8602186Z         x1 = x[:, D:]
2025-05-07T20:32:54.8602256Z     
2025-05-07T20:32:54.8602336Z         if contiguous:
2025-05-07T20:32:54.8602429Z             x0 = x0.contiguous()
2025-05-07T20:32:54.8602514Z             x1 = x1.contiguous()
2025-05-07T20:32:54.8602660Z     
2025-05-07T20:32:54.8602749Z         if scale_ub is not None:
2025-05-07T20:32:54.8602851Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:54.8602993Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:54.8603071Z             )
2025-05-07T20:32:54.8603146Z         else:
2025-05-07T20:32:54.8603243Z             scale_ub_tensor = None
2025-05-07T20:32:54.8603314Z     
2025-05-07T20:32:54.8603441Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:54.8603535Z             op = silu_mul_quant
2025-05-07T20:32:54.8603618Z             if compiled:
2025-05-07T20:32:54.8603715Z                 op = torch.compile(op)
2025-05-07T20:32:54.8603832Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:54.8603905Z     
2025-05-07T20:32:54.8603999Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:54.8604003Z 
2025-05-07T20:32:54.8604096Z moe/activation_test.py:117: 
2025-05-07T20:32:54.8604223Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:54.8604329Z moe/activation_test.py:115: in fn
2025-05-07T20:32:54.8604427Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:54.8604922Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:54.8605030Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:54.8605460Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:54.8605683Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:54.8606019Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:54.8606114Z     kernel = self.compile(
2025-05-07T20:32:54.8606500Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:54.8606672Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:54.8606841Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:54.8606852Z 
2025-05-07T20:32:54.8607057Z self = <triton.compiler.compiler.ASTSource object at 0x7fe9c46accb0>
2025-05-07T20:32:54.8607824Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:54.8608330Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7fea06577ce0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7fe9c51be3e0>}
2025-05-07T20:32:54.8609066Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:54.8609262Z context = <triton._C.libtriton.ir.context object at 0x7fe9c47d18b0>
2025-05-07T20:32:54.8609267Z 
2025-05-07T20:32:54.8609430Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:54.8609730Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:54.8609845Z                            module_map=module_map)
2025-05-07T20:32:54.8610006Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:54.8610102Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:54.8610187Z E       ^
2025-05-07T20:32:54.8610583Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:54.8610589Z 
2025-05-07T20:32:54.8611004Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:54.8611009Z 
2025-05-07T20:32:54.8611109Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:54.8611371Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:54.8611458Z     T=4096,
2025-05-07T20:32:54.8611533Z     D=7168,
2025-05-07T20:32:54.8611627Z     scale_ub=None,
2025-05-07T20:32:54.8611711Z     contiguous=False,
2025-05-07T20:32:54.8611790Z     compiled=False,
2025-05-07T20:32:54.8611870Z )
2025-05-07T20:32:54.8612086Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:54.8612254Z T = 4096, D = 7168, scale_ub = None, contiguous = False, compiled = False
2025-05-07T20:32:54.8612258Z 
2025-05-07T20:32:54.8612340Z     @given(
2025-05-07T20:32:54.8612458Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:54.8612555Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:54.8612674Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:54.8612792Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:54.8612920Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:54.8612994Z     )
2025-05-07T20:32:54.8613240Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:54.8613341Z     def test_silu_mul_quant(
2025-05-07T20:32:54.8613420Z         self,
2025-05-07T20:32:54.8613494Z         T: int,
2025-05-07T20:32:54.8613575Z         D: int,
2025-05-07T20:32:54.8613896Z         scale_ub: Optional[float],
2025-05-07T20:32:54.8613984Z         contiguous: bool,
2025-05-07T20:32:54.8614078Z         compiled: bool,
2025-05-07T20:32:54.8614155Z     ) -> None:
2025-05-07T20:32:54.8614249Z         torch.manual_seed(2025)
2025-05-07T20:32:54.8614330Z     
2025-05-07T20:32:54.8614496Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:54.8614573Z     
2025-05-07T20:32:54.8614662Z         x_sign = torch.sign(x)
2025-05-07T20:32:54.8614784Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:54.8614876Z         x = x_sign * x_clamp
2025-05-07T20:32:54.8614956Z         x0 = x[:, :D]
2025-05-07T20:32:54.8615033Z         x1 = x[:, D:]
2025-05-07T20:32:54.8615109Z     
2025-05-07T20:32:54.8615239Z         if contiguous:
2025-05-07T20:32:54.8615327Z             x0 = x0.contiguous()
2025-05-07T20:32:54.8615420Z             x1 = x1.contiguous()
2025-05-07T20:32:54.8615491Z     
2025-05-07T20:32:54.8615581Z         if scale_ub is not None:
2025-05-07T20:32:54.8615697Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:54.8615828Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:54.8615904Z             )
2025-05-07T20:32:54.8615984Z         else:
2025-05-07T20:32:54.8616077Z             scale_ub_tensor = None
2025-05-07T20:32:54.8616154Z     
2025-05-07T20:32:54.8616280Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:54.8616367Z             op = silu_mul_quant
2025-05-07T20:32:54.8616455Z             if compiled:
2025-05-07T20:32:54.8616553Z                 op = torch.compile(op)
2025-05-07T20:32:54.8616659Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:54.8616737Z     
2025-05-07T20:32:54.8616826Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:54.8616831Z 
2025-05-07T20:32:54.8616970Z moe/activation_test.py:117: 
2025-05-07T20:32:54.8617105Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:54.8617205Z moe/activation_test.py:115: in fn
2025-05-07T20:32:54.8617309Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:54.8617796Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:54.8617891Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:54.8618250Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:54.8618469Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:54.8618841Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:54.8618938Z     kernel = self.compile(
2025-05-07T20:32:54.8619322Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:54.8619497Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:54.8619622Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:54.8619627Z 
2025-05-07T20:32:54.8619828Z self = <triton.compiler.compiler.ASTSource object at 0x7fe9c4ea6be0>
2025-05-07T20:32:54.8620595Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:54.8621088Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7fea06577ce0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7fe9c51bede0>}
2025-05-07T20:32:54.8621832Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:54.8622020Z context = <triton._C.libtriton.ir.context object at 0x7fe92be58cb0>
2025-05-07T20:32:54.8622065Z 
2025-05-07T20:32:54.8622234Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:54.8622492Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:54.8622602Z                            module_map=module_map)
2025-05-07T20:32:54.8622768Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:54.8622867Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:54.8622945Z E       ^
2025-05-07T20:32:54.8623298Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:54.8623305Z 
2025-05-07T20:32:54.8623749Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:54.8623754Z 
2025-05-07T20:32:54.8623868Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:54.8624088Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:54.8624164Z     T=128,
2025-05-07T20:32:54.8624247Z     D=7168,
2025-05-07T20:32:54.8624331Z     scale_ub=None,
2025-05-07T20:32:54.8624415Z     contiguous=False,
2025-05-07T20:32:54.8624508Z     compiled=True,
2025-05-07T20:32:54.8624597Z )
2025-05-07T20:32:54.8632976Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:54.8633186Z T = 128, D = 7168, scale_ub = None, contiguous = False, compiled = True
2025-05-07T20:32:54.8633192Z 
2025-05-07T20:32:54.8633275Z     @given(
2025-05-07T20:32:54.8633411Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:54.8633513Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:54.8633635Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:54.8633862Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:54.8633982Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:54.8634062Z     )
2025-05-07T20:32:54.8634318Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:54.8634418Z     def test_silu_mul_quant(
2025-05-07T20:32:54.8634497Z         self,
2025-05-07T20:32:54.8634582Z         T: int,
2025-05-07T20:32:54.8634658Z         D: int,
2025-05-07T20:32:54.8634766Z         scale_ub: Optional[float],
2025-05-07T20:32:54.8634855Z         contiguous: bool,
2025-05-07T20:32:54.8634940Z         compiled: bool,
2025-05-07T20:32:54.8635025Z     ) -> None:
2025-05-07T20:32:54.8635120Z         torch.manual_seed(2025)
2025-05-07T20:32:54.8635239Z     
2025-05-07T20:32:54.8635415Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:54.8635488Z     
2025-05-07T20:32:54.8635584Z         x_sign = torch.sign(x)
2025-05-07T20:32:54.8635720Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:54.8635809Z         x = x_sign * x_clamp
2025-05-07T20:32:54.8635890Z         x0 = x[:, :D]
2025-05-07T20:32:54.8635979Z         x1 = x[:, D:]
2025-05-07T20:32:54.8636054Z     
2025-05-07T20:32:54.8636139Z         if contiguous:
2025-05-07T20:32:54.8636239Z             x0 = x0.contiguous()
2025-05-07T20:32:54.8636328Z             x1 = x1.contiguous()
2025-05-07T20:32:54.8636409Z     
2025-05-07T20:32:54.8636502Z         if scale_ub is not None:
2025-05-07T20:32:54.8636610Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:54.8636756Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:54.8636833Z             )
2025-05-07T20:32:54.8636911Z         else:
2025-05-07T20:32:54.8637021Z             scale_ub_tensor = None
2025-05-07T20:32:54.8637097Z     
2025-05-07T20:32:54.8637231Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:54.8637333Z             op = silu_mul_quant
2025-05-07T20:32:54.8637422Z             if compiled:
2025-05-07T20:32:54.8637524Z                 op = torch.compile(op)
2025-05-07T20:32:54.8637641Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:54.8637762Z     
2025-05-07T20:32:54.8637863Z         y_fp8, y_scale = fn()
2025-05-07T20:32:54.8637986Z         y = y_fp8.to(torch.float32) * y_scale[:, None]
2025-05-07T20:32:54.8638062Z     
2025-05-07T20:32:54.8638206Z         def ref_fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:54.8638312Z             x0_fp32 = x0.to(torch.float32)
2025-05-07T20:32:54.8638412Z             x1_fp32 = x1.to(torch.float32)
2025-05-07T20:32:54.8638543Z             y = x0_fp32 * torch.sigmoid(x0_fp32) * x1_fp32
2025-05-07T20:32:54.8638683Z             return triton_quantize_fp8_row(y, scale_ub_tensor)
2025-05-07T20:32:54.8638761Z     
2025-05-07T20:32:54.8638869Z >       y_fp8_ref, y_scale_ref = ref_fn()
2025-05-07T20:32:54.8638873Z 
2025-05-07T20:32:54.8639012Z moe/activation_test.py:126: 
2025-05-07T20:32:54.8639157Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:54.8639264Z moe/activation_test.py:124: in ref_fn
2025-05-07T20:32:54.8639401Z     return triton_quantize_fp8_row(y, scale_ub_tensor)
2025-05-07T20:32:54.8639963Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gemm/triton_gemm/fp8_gemm.py:2370: in triton_quantize_fp8_row
2025-05-07T20:32:54.8640065Z     _kernel_quantize_fp8_row[grid](
2025-05-07T20:32:54.8640424Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:54.8640655Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:54.8641025Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/autotuner.py:186: in run
2025-05-07T20:32:54.8641293Z     timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs}
2025-05-07T20:32:54.8641702Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/autotuner.py:166: in _bench
2025-05-07T20:32:54.8641873Z     return self.do_bench(kernel_call, quantiles=(0.5, 0.2, 0.8))
2025-05-07T20:32:54.8642219Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/testing.py:117: in do_bench
2025-05-07T20:32:54.8642300Z     fn()
2025-05-07T20:32:54.8642703Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/autotuner.py:152: in kernel_call
2025-05-07T20:32:54.8642792Z     self.fn.run(
2025-05-07T20:32:54.8643130Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:54.8643235Z     kernel = self.compile(
2025-05-07T20:32:54.8643651Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:54.8643831Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:54.8643971Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:54.8643976Z 
2025-05-07T20:32:54.8644182Z self = <triton.compiler.compiler.ASTSource object at 0x7fe9c46759d0>
2025-05-07T20:32:54.8644959Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=2, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:54.8645461Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7fea06577ce0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7fe9c4683a60>}
2025-05-07T20:32:54.8646209Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:54.8646405Z context = <triton._C.libtriton.ir.context object at 0x7fe9c42b3b70>
2025-05-07T20:32:54.8646410Z 
2025-05-07T20:32:54.8646574Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:54.8646883Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:54.8646993Z                            module_map=module_map)
2025-05-07T20:32:54.8647155Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:54.8647265Z E       def _kernel_quantize_fp8_row(
2025-05-07T20:32:54.8647345Z E       ^
2025-05-07T20:32:54.8647696Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:54.8647701Z 
2025-05-07T20:32:54.8648108Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:54.8648113Z 
2025-05-07T20:32:54.8648263Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:54.8648487Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:54.8648574Z     T=128,
2025-05-07T20:32:54.8648654Z     D=7168,
2025-05-07T20:32:54.8648742Z     scale_ub=None,
2025-05-07T20:32:54.8648836Z     contiguous=False,
2025-05-07T20:32:54.8648921Z     compiled=False,
2025-05-07T20:32:54.8648999Z )
2025-05-07T20:32:54.8649219Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:54.8649391Z T = 128, D = 7168, scale_ub = None, contiguous = False, compiled = False
2025-05-07T20:32:54.8649395Z 
2025-05-07T20:32:54.8649474Z     @given(
2025-05-07T20:32:54.8649601Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:54.8649702Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:54.8649822Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:54.8649945Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:54.8650061Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:54.8650187Z     )
2025-05-07T20:32:54.8650432Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:54.8650529Z     def test_silu_mul_quant(
2025-05-07T20:32:54.8650614Z         self,
2025-05-07T20:32:54.8650693Z         T: int,
2025-05-07T20:32:54.8650770Z         D: int,
2025-05-07T20:32:54.8650873Z         scale_ub: Optional[float],
2025-05-07T20:32:54.8650962Z         contiguous: bool,
2025-05-07T20:32:54.8651047Z         compiled: bool,
2025-05-07T20:32:54.8651134Z     ) -> None:
2025-05-07T20:32:54.8651230Z         torch.manual_seed(2025)
2025-05-07T20:32:54.8651303Z     
2025-05-07T20:32:54.8651477Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:54.8651551Z     
2025-05-07T20:32:54.8651693Z         x_sign = torch.sign(x)
2025-05-07T20:32:54.8651817Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:54.8651909Z         x = x_sign * x_clamp
2025-05-07T20:32:54.8651997Z         x0 = x[:, :D]
2025-05-07T20:32:54.8652077Z         x1 = x[:, D:]
2025-05-07T20:32:54.8652150Z     
2025-05-07T20:32:54.8652238Z         if contiguous:
2025-05-07T20:32:54.8652330Z             x0 = x0.contiguous()
2025-05-07T20:32:54.8652418Z             x1 = x1.contiguous()
2025-05-07T20:32:54.8652498Z     
2025-05-07T20:32:54.8652590Z         if scale_ub is not None:
2025-05-07T20:32:54.8652693Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:54.8652832Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:54.8652908Z             )
2025-05-07T20:32:54.8652990Z         else:
2025-05-07T20:32:54.8653084Z             scale_ub_tensor = None
2025-05-07T20:32:54.8653156Z     
2025-05-07T20:32:54.8653288Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:54.8653381Z             op = silu_mul_quant
2025-05-07T20:32:54.8653466Z             if compiled:
2025-05-07T20:32:54.8653572Z                 op = torch.compile(op)
2025-05-07T20:32:54.8653804Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:54.8653883Z     
2025-05-07T20:32:54.8653976Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:54.8653980Z 
2025-05-07T20:32:54.8654127Z moe/activation_test.py:117: 
2025-05-07T20:32:54.8654253Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:54.8654357Z moe/activation_test.py:115: in fn
2025-05-07T20:32:54.8654454Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:54.8654946Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:54.8655040Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:54.8655394Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:54.8655619Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:54.8655998Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:54.8656097Z     kernel = self.compile(
2025-05-07T20:32:54.8656474Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:54.8656647Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:54.8656779Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:54.8656783Z 
2025-05-07T20:32:54.8656982Z self = <triton.compiler.compiler.ASTSource object at 0x7fe9c4d88e50>
2025-05-07T20:32:54.8657748Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:54.8658286Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7fea06577ce0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7fe9c4105e40>}
2025-05-07T20:32:54.8659014Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:54.8659209Z context = <triton._C.libtriton.ir.context object at 0x7fe9c46ec730>
2025-05-07T20:32:54.8659214Z 
2025-05-07T20:32:54.8659373Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:54.8659631Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:54.8659740Z                            module_map=module_map)
2025-05-07T20:32:54.8659899Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:54.8660045Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:54.8660124Z E       ^
2025-05-07T20:32:54.8660522Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:54.8660532Z 
2025-05-07T20:32:54.8660935Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:54.8660941Z 
2025-05-07T20:32:54.8661044Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:54.8661264Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:54.8661341Z     T=4096,
2025-05-07T20:32:54.8661417Z     D=5120,
2025-05-07T20:32:54.8661508Z     scale_ub=1200.0,
2025-05-07T20:32:54.8661589Z     contiguous=True,
2025-05-07T20:32:54.8661671Z     compiled=False,
2025-05-07T20:32:54.8661747Z )
2025-05-07T20:32:54.8661962Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:54.8662139Z T = 4096, D = 5120, scale_ub = 1200.0, contiguous = True, compiled = False
2025-05-07T20:32:54.8662144Z 
2025-05-07T20:32:54.8662222Z     @given(
2025-05-07T20:32:54.8662341Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:54.8662448Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:54.8662562Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:54.8662721Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:54.8662839Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:54.8662914Z     )
2025-05-07T20:32:54.8663152Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:54.8663250Z     def test_silu_mul_quant(
2025-05-07T20:32:54.8663325Z         self,
2025-05-07T20:32:54.8663406Z         T: int,
2025-05-07T20:32:54.8663481Z         D: int,
2025-05-07T20:32:54.8663576Z         scale_ub: Optional[float],
2025-05-07T20:32:54.8663667Z         contiguous: bool,
2025-05-07T20:32:54.8663754Z         compiled: bool,
2025-05-07T20:32:54.8663829Z     ) -> None:
2025-05-07T20:32:54.8663993Z         torch.manual_seed(2025)
2025-05-07T20:32:54.8664065Z     
2025-05-07T20:32:54.8664232Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:54.8664309Z     
2025-05-07T20:32:54.8664399Z         x_sign = torch.sign(x)
2025-05-07T20:32:54.8664524Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:54.8664615Z         x = x_sign * x_clamp
2025-05-07T20:32:54.8664692Z         x0 = x[:, :D]
2025-05-07T20:32:54.8664775Z         x1 = x[:, D:]
2025-05-07T20:32:54.8664847Z     
2025-05-07T20:32:54.8664929Z         if contiguous:
2025-05-07T20:32:54.8665022Z             x0 = x0.contiguous()
2025-05-07T20:32:54.8665109Z             x1 = x1.contiguous()
2025-05-07T20:32:54.8665180Z     
2025-05-07T20:32:54.8665276Z         if scale_ub is not None:
2025-05-07T20:32:54.8665376Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:54.8665509Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:54.8665590Z             )
2025-05-07T20:32:54.8665665Z         else:
2025-05-07T20:32:54.8665758Z             scale_ub_tensor = None
2025-05-07T20:32:54.8665880Z     
2025-05-07T20:32:54.8666007Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:54.8666095Z             op = silu_mul_quant
2025-05-07T20:32:54.8666187Z             if compiled:
2025-05-07T20:32:54.8666282Z                 op = torch.compile(op)
2025-05-07T20:32:54.8666389Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:54.8666458Z     
2025-05-07T20:32:54.8666547Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:54.8666552Z 
2025-05-07T20:32:54.8666653Z moe/activation_test.py:117: 
2025-05-07T20:32:54.8666777Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:54.8666873Z moe/activation_test.py:115: in fn
2025-05-07T20:32:54.8666974Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:54.8667507Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:54.8667611Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:54.8667963Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:54.8668181Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:54.8668519Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:54.8668612Z     kernel = self.compile(
2025-05-07T20:32:54.8668990Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:54.8669166Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:54.8669288Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:54.8669295Z 
2025-05-07T20:32:54.8669500Z self = <triton.compiler.compiler.ASTSource object at 0x7fe9c4d8b850>
2025-05-07T20:32:54.8670264Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:54.8670804Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7fea06577ce0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7fe9c41068e0>}
2025-05-07T20:32:54.8671537Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:54.8671724Z context = <triton._C.libtriton.ir.context object at 0x7fe9c474c5f0>
2025-05-07T20:32:54.8671729Z 
2025-05-07T20:32:54.8671899Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:54.8672193Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:54.8672310Z                            module_map=module_map)
2025-05-07T20:32:54.8672469Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:54.8672569Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:54.8672650Z E       ^
2025-05-07T20:32:54.8672994Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:54.8672999Z 
2025-05-07T20:32:54.8673401Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:54.8673406Z 
2025-05-07T20:32:54.8673511Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:54.8673729Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:54.8673813Z     T=1,
2025-05-07T20:32:54.8673887Z     D=5120,
2025-05-07T20:32:54.8673964Z     scale_ub=None,
2025-05-07T20:32:54.8674054Z     contiguous=True,
2025-05-07T20:32:54.8674135Z     compiled=True,
2025-05-07T20:32:54.8674247Z )
2025-05-07T20:32:54.8674469Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:54.8674625Z T = 1, D = 5120, scale_ub = None, contiguous = True, compiled = True
2025-05-07T20:32:54.8674632Z 
2025-05-07T20:32:54.8674707Z     @given(
2025-05-07T20:32:54.8674830Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:54.8674928Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:54.8675046Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:54.8675163Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:54.8675273Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:54.8675353Z     )
2025-05-07T20:32:54.8675594Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:54.8675728Z     def test_silu_mul_quant(
2025-05-07T20:32:54.8675810Z         self,
2025-05-07T20:32:54.8675885Z         T: int,
2025-05-07T20:32:54.8675958Z         D: int,
2025-05-07T20:32:54.8676067Z         scale_ub: Optional[float],
2025-05-07T20:32:54.8676152Z         contiguous: bool,
2025-05-07T20:32:54.8676235Z         compiled: bool,
2025-05-07T20:32:54.8676320Z     ) -> None:
2025-05-07T20:32:54.8676414Z         torch.manual_seed(2025)
2025-05-07T20:32:54.8676487Z     
2025-05-07T20:32:54.8676658Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:54.8676730Z     
2025-05-07T20:32:54.8676827Z         x_sign = torch.sign(x)
2025-05-07T20:32:54.8676953Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:54.8677044Z         x = x_sign * x_clamp
2025-05-07T20:32:54.8677131Z         x0 = x[:, :D]
2025-05-07T20:32:54.8677210Z         x1 = x[:, D:]
2025-05-07T20:32:54.8677279Z     
2025-05-07T20:32:54.8677370Z         if contiguous:
2025-05-07T20:32:54.8677463Z             x0 = x0.contiguous()
2025-05-07T20:32:54.8677550Z             x1 = x1.contiguous()
2025-05-07T20:32:54.8677628Z     
2025-05-07T20:32:54.8677721Z         if scale_ub is not None:
2025-05-07T20:32:54.8677830Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:54.8677961Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:54.8678082Z             )
2025-05-07T20:32:54.8678164Z         else:
2025-05-07T20:32:54.8678255Z             scale_ub_tensor = None
2025-05-07T20:32:54.8678332Z     
2025-05-07T20:32:54.8678464Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:54.8678552Z             op = silu_mul_quant
2025-05-07T20:32:54.8678634Z             if compiled:
2025-05-07T20:32:54.8678739Z                 op = torch.compile(op)
2025-05-07T20:32:54.8678842Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:54.8678918Z     
2025-05-07T20:32:54.8679014Z         y_fp8, y_scale = fn()
2025-05-07T20:32:54.8679134Z         y = y_fp8.to(torch.float32) * y_scale[:, None]
2025-05-07T20:32:54.8679209Z     
2025-05-07T20:32:54.8679383Z         def ref_fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:54.8679486Z             x0_fp32 = x0.to(torch.float32)
2025-05-07T20:32:54.8679588Z             x1_fp32 = x1.to(torch.float32)
2025-05-07T20:32:54.8679712Z             y = x0_fp32 * torch.sigmoid(x0_fp32) * x1_fp32
2025-05-07T20:32:54.8679849Z             return triton_quantize_fp8_row(y, scale_ub_tensor)
2025-05-07T20:32:54.8679927Z     
2025-05-07T20:32:54.8680023Z >       y_fp8_ref, y_scale_ref = ref_fn()
2025-05-07T20:32:54.8680028Z 
2025-05-07T20:32:54.8680121Z moe/activation_test.py:126: 
2025-05-07T20:32:54.8680251Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:54.8680354Z moe/activation_test.py:124: in ref_fn
2025-05-07T20:32:54.8680490Z     return triton_quantize_fp8_row(y, scale_ub_tensor)
2025-05-07T20:32:54.8681036Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gemm/triton_gemm/fp8_gemm.py:2370: in triton_quantize_fp8_row
2025-05-07T20:32:54.8681137Z     _kernel_quantize_fp8_row[grid](
2025-05-07T20:32:54.8681539Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:54.8681759Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:54.8682119Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/autotuner.py:186: in run
2025-05-07T20:32:54.8682376Z     timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs}
2025-05-07T20:32:54.8682742Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/autotuner.py:166: in _bench
2025-05-07T20:32:54.8682910Z     return self.do_bench(kernel_call, quantiles=(0.5, 0.2, 0.8))
2025-05-07T20:32:54.8683287Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/testing.py:117: in do_bench
2025-05-07T20:32:54.8683363Z     fn()
2025-05-07T20:32:54.8683767Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/autotuner.py:152: in kernel_call
2025-05-07T20:32:54.8683848Z     self.fn.run(
2025-05-07T20:32:54.8684186Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:54.8684280Z     kernel = self.compile(
2025-05-07T20:32:54.8684653Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:54.8684829Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:54.8684954Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:54.8684958Z 
2025-05-07T20:32:54.8685162Z self = <triton.compiler.compiler.ASTSource object at 0x7fe9c488aa80>
2025-05-07T20:32:54.8685934Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=2, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:54.8686425Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7fea06577ce0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7fe9c4107560>}
2025-05-07T20:32:54.8687209Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:54.8687397Z context = <triton._C.libtriton.ir.context object at 0x7fe9c478ad30>
2025-05-07T20:32:54.8687402Z 
2025-05-07T20:32:54.8687569Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:54.8687825Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:54.8687933Z                            module_map=module_map)
2025-05-07T20:32:54.8688140Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:54.8688245Z E       def _kernel_quantize_fp8_row(
2025-05-07T20:32:54.8688321Z E       ^
2025-05-07T20:32:54.8688673Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:54.8688681Z 
2025-05-07T20:32:54.8689081Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:54.8689086Z 
2025-05-07T20:32:54.8689193Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:54.8689412Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:54.8689487Z     T=2048,
2025-05-07T20:32:54.8689567Z     D=5120,
2025-05-07T20:32:54.8689648Z     scale_ub=None,
2025-05-07T20:32:54.8689732Z     contiguous=True,
2025-05-07T20:32:54.8689824Z     compiled=True,
2025-05-07T20:32:54.8689897Z )
2025-05-07T20:32:54.8690119Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:54.8690342Z T = 2048, D = 5120, scale_ub = None, contiguous = True, compiled = True
2025-05-07T20:32:54.8690348Z 
2025-05-07T20:32:54.8690432Z     @given(
2025-05-07T20:32:54.8690576Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:54.8690678Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:54.8690789Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:54.8690908Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:54.8691018Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:54.8691089Z     )
2025-05-07T20:32:54.8691337Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:54.8691429Z     def test_silu_mul_quant(
2025-05-07T20:32:54.8691514Z         self,
2025-05-07T20:32:54.8691632Z         T: int,
2025-05-07T20:32:54.8691705Z         D: int,
2025-05-07T20:32:54.8691809Z         scale_ub: Optional[float],
2025-05-07T20:32:54.8691900Z         contiguous: bool,
2025-05-07T20:32:54.8691986Z         compiled: bool,
2025-05-07T20:32:54.8692072Z     ) -> None:
2025-05-07T20:32:54.8692171Z         torch.manual_seed(2025)
2025-05-07T20:32:54.8692242Z     
2025-05-07T20:32:54.8692413Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:54.8692491Z     
2025-05-07T20:32:54.8692583Z         x_sign = torch.sign(x)
2025-05-07T20:32:54.8692706Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:54.8692799Z         x = x_sign * x_clamp
2025-05-07T20:32:54.8692880Z         x0 = x[:, :D]
2025-05-07T20:32:54.8692959Z         x1 = x[:, D:]
2025-05-07T20:32:54.8693039Z     
2025-05-07T20:32:54.8693122Z         if contiguous:
2025-05-07T20:32:54.8693211Z             x0 = x0.contiguous()
2025-05-07T20:32:54.8693302Z             x1 = x1.contiguous()
2025-05-07T20:32:54.8693375Z     
2025-05-07T20:32:54.8693474Z         if scale_ub is not None:
2025-05-07T20:32:54.8693580Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:54.8693840Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:54.8693923Z             )
2025-05-07T20:32:54.8693998Z         else:
2025-05-07T20:32:54.8694092Z             scale_ub_tensor = None
2025-05-07T20:32:54.8694246Z     
2025-05-07T20:32:54.8694374Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:54.8694464Z             op = silu_mul_quant
2025-05-07T20:32:54.8694552Z             if compiled:
2025-05-07T20:32:54.8694650Z                 op = torch.compile(op)
2025-05-07T20:32:54.8694755Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:54.8694831Z     
2025-05-07T20:32:54.8694920Z         y_fp8, y_scale = fn()
2025-05-07T20:32:54.8695047Z         y = y_fp8.to(torch.float32) * y_scale[:, None]
2025-05-07T20:32:54.8695118Z     
2025-05-07T20:32:54.8695255Z         def ref_fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:54.8695365Z             x0_fp32 = x0.to(torch.float32)
2025-05-07T20:32:54.8695509Z             x1_fp32 = x1.to(torch.float32)
2025-05-07T20:32:54.8695632Z             y = x0_fp32 * torch.sigmoid(x0_fp32) * x1_fp32
2025-05-07T20:32:54.8695778Z             return triton_quantize_fp8_row(y, scale_ub_tensor)
2025-05-07T20:32:54.8695851Z     
2025-05-07T20:32:54.8695948Z >       y_fp8_ref, y_scale_ref = ref_fn()
2025-05-07T20:32:54.8695952Z 
2025-05-07T20:32:54.8696059Z moe/activation_test.py:126: 
2025-05-07T20:32:54.8696186Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:54.8696296Z moe/activation_test.py:124: in ref_fn
2025-05-07T20:32:54.8696426Z     return triton_quantize_fp8_row(y, scale_ub_tensor)
2025-05-07T20:32:54.8696973Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gemm/triton_gemm/fp8_gemm.py:2370: in triton_quantize_fp8_row
2025-05-07T20:32:54.8697077Z     _kernel_quantize_fp8_row[grid](
2025-05-07T20:32:54.8697442Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:54.8697700Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:54.8698071Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/autotuner.py:186: in run
2025-05-07T20:32:54.8698659Z     timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs}
2025-05-07T20:32:54.8699044Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/autotuner.py:166: in _bench
2025-05-07T20:32:54.8699205Z     return self.do_bench(kernel_call, quantiles=(0.5, 0.2, 0.8))
2025-05-07T20:32:54.8699540Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/testing.py:117: in do_bench
2025-05-07T20:32:54.8699625Z     fn()
2025-05-07T20:32:54.8700159Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/autotuner.py:152: in kernel_call
2025-05-07T20:32:54.8700254Z     self.fn.run(
2025-05-07T20:32:54.8700587Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:54.8700676Z     kernel = self.compile(
2025-05-07T20:32:54.8701056Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:54.8701226Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:54.8701353Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:54.8701358Z 
2025-05-07T20:32:54.8701566Z self = <triton.compiler.compiler.ASTSource object at 0x7fe9c488ab70>
2025-05-07T20:32:54.8702329Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=2, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:54.8702837Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7fea06577ce0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7fe92b944cc0>}
2025-05-07T20:32:54.8703566Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:54.8703837Z context = <triton._C.libtriton.ir.context object at 0x7fe92b84f770>
2025-05-07T20:32:54.8703841Z 
2025-05-07T20:32:54.8704002Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:54.8704258Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:54.8704371Z                            module_map=module_map)
2025-05-07T20:32:54.8704531Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:54.8704630Z E       def _kernel_quantize_fp8_row(
2025-05-07T20:32:54.8704714Z E       ^
2025-05-07T20:32:54.8705137Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:54.8705142Z 
2025-05-07T20:32:54.8705552Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:54.8705560Z 
2025-05-07T20:32:54.8705659Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:54.8705879Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:54.8705960Z     T=128,
2025-05-07T20:32:54.8706037Z     D=5120,
2025-05-07T20:32:54.8706115Z     scale_ub=None,
2025-05-07T20:32:54.8706203Z     contiguous=True,
2025-05-07T20:32:54.8706284Z     compiled=True,
2025-05-07T20:32:54.8706359Z )
2025-05-07T20:32:54.8706573Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:54.8706739Z T = 128, D = 5120, scale_ub = None, contiguous = True, compiled = True
2025-05-07T20:32:54.8706744Z 
2025-05-07T20:32:54.8706828Z     @given(
2025-05-07T20:32:54.8707012Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:54.8707112Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:54.8707231Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:54.8707347Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:54.8707465Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:54.8707534Z     )
2025-05-07T20:32:54.8707774Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:54.8707873Z     def test_silu_mul_quant(
2025-05-07T20:32:54.8707950Z         self,
2025-05-07T20:32:54.8708026Z         T: int,
2025-05-07T20:32:54.8708109Z         D: int,
2025-05-07T20:32:54.8708204Z         scale_ub: Optional[float],
2025-05-07T20:32:54.8708338Z         contiguous: bool,
2025-05-07T20:32:54.8708431Z         compiled: bool,
2025-05-07T20:32:54.8708509Z     ) -> None:
2025-05-07T20:32:54.8708608Z         torch.manual_seed(2025)
2025-05-07T20:32:54.8708689Z     
2025-05-07T20:32:54.8708856Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:54.8708930Z     
2025-05-07T20:32:54.8709026Z         x_sign = torch.sign(x)
2025-05-07T20:32:54.8709152Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:54.8709245Z         x = x_sign * x_clamp
2025-05-07T20:32:54.8709324Z         x0 = x[:, :D]
2025-05-07T20:32:54.8709403Z         x1 = x[:, D:]
2025-05-07T20:32:54.8709482Z     
2025-05-07T20:32:54.8709562Z         if contiguous:
2025-05-07T20:32:54.8709649Z             x0 = x0.contiguous()
2025-05-07T20:32:54.8709742Z             x1 = x1.contiguous()
2025-05-07T20:32:54.8709812Z     
2025-05-07T20:32:54.8709900Z         if scale_ub is not None:
2025-05-07T20:32:54.8710010Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:54.8710150Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:54.8710227Z             )
2025-05-07T20:32:54.8710312Z         else:
2025-05-07T20:32:54.8710406Z             scale_ub_tensor = None
2025-05-07T20:32:54.8710481Z     
2025-05-07T20:32:54.8710608Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:54.8710744Z             op = silu_mul_quant
2025-05-07T20:32:54.8710832Z             if compiled:
2025-05-07T20:32:54.8710929Z                 op = torch.compile(op)
2025-05-07T20:32:54.8711033Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:54.8711115Z     
2025-05-07T20:32:54.8711204Z         y_fp8, y_scale = fn()
2025-05-07T20:32:54.8711322Z         y = y_fp8.to(torch.float32) * y_scale[:, None]
2025-05-07T20:32:54.8711401Z     
2025-05-07T20:32:54.8711537Z         def ref_fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:54.8711637Z             x0_fp32 = x0.to(torch.float32)
2025-05-07T20:32:54.8711744Z             x1_fp32 = x1.to(torch.float32)
2025-05-07T20:32:54.8711865Z             y = x0_fp32 * torch.sigmoid(x0_fp32) * x1_fp32
2025-05-07T20:32:54.8712052Z             return triton_quantize_fp8_row(y, scale_ub_tensor)
2025-05-07T20:32:54.8712130Z     
2025-05-07T20:32:54.8712228Z >       y_fp8_ref, y_scale_ref = ref_fn()
2025-05-07T20:32:54.8712234Z 
2025-05-07T20:32:54.8712338Z moe/activation_test.py:126: 
2025-05-07T20:32:54.8712464Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:54.8712571Z moe/activation_test.py:124: in ref_fn
2025-05-07T20:32:54.8712708Z     return triton_quantize_fp8_row(y, scale_ub_tensor)
2025-05-07T20:32:54.8713250Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gemm/triton_gemm/fp8_gemm.py:2370: in triton_quantize_fp8_row
2025-05-07T20:32:54.8713353Z     _kernel_quantize_fp8_row[grid](
2025-05-07T20:32:54.8713707Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:54.8713931Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:54.8714341Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/autotuner.py:186: in run
2025-05-07T20:32:54.8714592Z     timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs}
2025-05-07T20:32:54.8714967Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/autotuner.py:166: in _bench
2025-05-07T20:32:54.8715129Z     return self.do_bench(kernel_call, quantiles=(0.5, 0.2, 0.8))
2025-05-07T20:32:54.8715463Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/testing.py:117: in do_bench
2025-05-07T20:32:54.8715543Z     fn()
2025-05-07T20:32:54.8715937Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/autotuner.py:152: in kernel_call
2025-05-07T20:32:54.8716058Z     self.fn.run(
2025-05-07T20:32:54.8716399Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:54.8716496Z     kernel = self.compile(
2025-05-07T20:32:54.8716878Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:54.8717052Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:54.8717180Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:54.8717184Z 
2025-05-07T20:32:54.8717396Z self = <triton.compiler.compiler.ASTSource object at 0x7fe9c45a6cf0>
2025-05-07T20:32:54.8718157Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=2, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:54.8718659Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7fea06577ce0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7fe92ba11d00>}
2025-05-07T20:32:54.8719399Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:54.8719632Z context = <triton._C.libtriton.ir.context object at 0x7fe92bb6e930>
2025-05-07T20:32:54.8719637Z 
2025-05-07T20:32:54.8719802Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:54.8720056Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:54.8720169Z                            module_map=module_map)
2025-05-07T20:32:54.8720329Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:54.8720430Z E       def _kernel_quantize_fp8_row(
2025-05-07T20:32:54.8720508Z E       ^
2025-05-07T20:32:54.8720856Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:54.8720900Z 
2025-05-07T20:32:54.8721310Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:54.8721321Z 
2025-05-07T20:32:54.8721421Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:54.8721641Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:54.8721723Z     T=4096,
2025-05-07T20:32:54.8721800Z     D=5120,
2025-05-07T20:32:54.8721883Z     scale_ub=None,
2025-05-07T20:32:54.8721972Z     contiguous=True,
2025-05-07T20:32:54.8722051Z     compiled=True,
2025-05-07T20:32:54.8722122Z )
2025-05-07T20:32:54.8722344Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:54.8722511Z T = 4096, D = 5120, scale_ub = None, contiguous = True, compiled = True
2025-05-07T20:32:54.8722518Z 
2025-05-07T20:32:54.8722597Z     @given(
2025-05-07T20:32:54.8722714Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:54.8722813Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:54.8722973Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:54.8723090Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:54.8723203Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:54.8723285Z     )
2025-05-07T20:32:54.8723532Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:54.8723625Z     def test_silu_mul_quant(
2025-05-07T20:32:54.8723704Z         self,
2025-05-07T20:32:54.8723780Z         T: int,
2025-05-07T20:32:54.8723855Z         D: int,
2025-05-07T20:32:54.8723960Z         scale_ub: Optional[float],
2025-05-07T20:32:54.8724047Z         contiguous: bool,
2025-05-07T20:32:54.8724138Z         compiled: bool,
2025-05-07T20:32:54.8724217Z     ) -> None:
2025-05-07T20:32:54.8724376Z         torch.manual_seed(2025)
2025-05-07T20:32:54.8724453Z     
2025-05-07T20:32:54.8724621Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:54.8724693Z     
2025-05-07T20:32:54.8724793Z         x_sign = torch.sign(x)
2025-05-07T20:32:54.8724918Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:54.8725004Z         x = x_sign * x_clamp
2025-05-07T20:32:54.8725092Z         x0 = x[:, :D]
2025-05-07T20:32:54.8725170Z         x1 = x[:, D:]
2025-05-07T20:32:54.8725241Z     
2025-05-07T20:32:54.8725330Z         if contiguous:
2025-05-07T20:32:54.8725417Z             x0 = x0.contiguous()
2025-05-07T20:32:54.8725512Z             x1 = x1.contiguous()
2025-05-07T20:32:54.8725581Z     
2025-05-07T20:32:54.8725670Z         if scale_ub is not None:
2025-05-07T20:32:54.8725781Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:54.8725914Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:54.8725985Z             )
2025-05-07T20:32:54.8726071Z         else:
2025-05-07T20:32:54.8726166Z             scale_ub_tensor = None
2025-05-07T20:32:54.8726238Z     
2025-05-07T20:32:54.8726374Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:54.8726464Z             op = silu_mul_quant
2025-05-07T20:32:54.8726544Z             if compiled:
2025-05-07T20:32:54.8726647Z                 op = torch.compile(op)
2025-05-07T20:32:54.8726800Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:54.8726877Z     
2025-05-07T20:32:54.8726966Z         y_fp8, y_scale = fn()
2025-05-07T20:32:54.8727085Z         y = y_fp8.to(torch.float32) * y_scale[:, None]
2025-05-07T20:32:54.8727165Z     
2025-05-07T20:32:54.8727297Z         def ref_fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:54.8727393Z             x0_fp32 = x0.to(torch.float32)
2025-05-07T20:32:54.8727497Z             x1_fp32 = x1.to(torch.float32)
2025-05-07T20:32:54.8727615Z             y = x0_fp32 * torch.sigmoid(x0_fp32) * x1_fp32
2025-05-07T20:32:54.8727754Z             return triton_quantize_fp8_row(y, scale_ub_tensor)
2025-05-07T20:32:54.8727832Z     
2025-05-07T20:32:54.8727971Z >       y_fp8_ref, y_scale_ref = ref_fn()
2025-05-07T20:32:54.8727976Z 
2025-05-07T20:32:54.8728079Z moe/activation_test.py:126: 
2025-05-07T20:32:54.8728205Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:54.8728310Z moe/activation_test.py:124: in ref_fn
2025-05-07T20:32:54.8728446Z     return triton_quantize_fp8_row(y, scale_ub_tensor)
2025-05-07T20:32:54.8728989Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gemm/triton_gemm/fp8_gemm.py:2370: in triton_quantize_fp8_row
2025-05-07T20:32:54.8729086Z     _kernel_quantize_fp8_row[grid](
2025-05-07T20:32:54.8729447Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:54.8729663Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:54.8730032Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/autotuner.py:186: in run
2025-05-07T20:32:54.8730323Z     timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs}
2025-05-07T20:32:54.8730692Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/autotuner.py:166: in _bench
2025-05-07T20:32:54.8730863Z     return self.do_bench(kernel_call, quantiles=(0.5, 0.2, 0.8))
2025-05-07T20:32:54.8731197Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/testing.py:117: in do_bench
2025-05-07T20:32:54.8731269Z     fn()
2025-05-07T20:32:54.8731667Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/autotuner.py:152: in kernel_call
2025-05-07T20:32:54.8731747Z     self.fn.run(
2025-05-07T20:32:54.8732083Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:54.8732232Z     kernel = self.compile(
2025-05-07T20:32:54.8732608Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:54.8732787Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:54.8732912Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:54.8732919Z 
2025-05-07T20:32:54.8733129Z self = <triton.compiler.compiler.ASTSource object at 0x7fe92bb72dd0>
2025-05-07T20:32:54.8734046Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=2, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:54.8734543Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7fea06577ce0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7fe92b80f2e0>}
2025-05-07T20:32:54.8735286Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:54.8735475Z context = <triton._C.libtriton.ir.context object at 0x7fe92af680f0>
2025-05-07T20:32:54.8735480Z 
2025-05-07T20:32:54.8735694Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:54.8735948Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:54.8736052Z                            module_map=module_map)
2025-05-07T20:32:54.8736217Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:54.8736316Z E       def _kernel_quantize_fp8_row(
2025-05-07T20:32:54.8736401Z E       ^
2025-05-07T20:32:54.8736748Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:54.8736755Z 
2025-05-07T20:32:54.8737199Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:54.8737204Z 
2025-05-07T20:32:54.8737314Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:54.8737534Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:54.8737611Z     T=16384,
2025-05-07T20:32:54.8737694Z     D=5120,
2025-05-07T20:32:54.8737775Z     scale_ub=None,
2025-05-07T20:32:54.8737863Z     contiguous=True,
2025-05-07T20:32:54.8737942Z     compiled=True,
2025-05-07T20:32:54.8738015Z )
2025-05-07T20:32:54.8738234Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:54.8738400Z T = 16384, D = 5120, scale_ub = None, contiguous = True, compiled = True
2025-05-07T20:32:54.8738405Z 
2025-05-07T20:32:54.8738480Z     @given(
2025-05-07T20:32:54.8738603Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:54.8738701Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:54.8738813Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:54.8738936Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:54.8739092Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:54.8739172Z     )
2025-05-07T20:32:54.8739413Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:54.8739507Z     def test_silu_mul_quant(
2025-05-07T20:32:54.8739590Z         self,
2025-05-07T20:32:54.8739665Z         T: int,
2025-05-07T20:32:54.8739744Z         D: int,
2025-05-07T20:32:54.8739845Z         scale_ub: Optional[float],
2025-05-07T20:32:54.8739932Z         contiguous: bool,
2025-05-07T20:32:54.8740013Z         compiled: bool,
2025-05-07T20:32:54.8740095Z     ) -> None:
2025-05-07T20:32:54.8740188Z         torch.manual_seed(2025)
2025-05-07T20:32:54.8740257Z     
2025-05-07T20:32:54.8740431Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:54.8740546Z     
2025-05-07T20:32:54.8740643Z         x_sign = torch.sign(x)
2025-05-07T20:32:54.8740768Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:54.8740858Z         x = x_sign * x_clamp
2025-05-07T20:32:54.8740943Z         x0 = x[:, :D]
2025-05-07T20:32:54.8741021Z         x1 = x[:, D:]
2025-05-07T20:32:54.8741090Z     
2025-05-07T20:32:54.8741181Z         if contiguous:
2025-05-07T20:32:54.8741267Z             x0 = x0.contiguous()
2025-05-07T20:32:54.8741354Z             x1 = x1.contiguous()
2025-05-07T20:32:54.8741433Z     
2025-05-07T20:32:54.8741523Z         if scale_ub is not None:
2025-05-07T20:32:54.8741625Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:54.8741764Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:54.8741841Z             )
2025-05-07T20:32:54.8741922Z         else:
2025-05-07T20:32:54.8742013Z             scale_ub_tensor = None
2025-05-07T20:32:54.8742085Z     
2025-05-07T20:32:54.8742219Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:54.8742306Z             op = silu_mul_quant
2025-05-07T20:32:54.8742392Z             if compiled:
2025-05-07T20:32:54.8742497Z                 op = torch.compile(op)
2025-05-07T20:32:54.8742601Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:54.8742672Z     
2025-05-07T20:32:54.8742764Z         y_fp8, y_scale = fn()
2025-05-07T20:32:54.8742935Z         y = y_fp8.to(torch.float32) * y_scale[:, None]
2025-05-07T20:32:54.8743007Z     
2025-05-07T20:32:54.8743147Z         def ref_fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:54.8743247Z             x0_fp32 = x0.to(torch.float32)
2025-05-07T20:32:54.8743350Z             x1_fp32 = x1.to(torch.float32)
2025-05-07T20:32:54.8743472Z             y = x0_fp32 * torch.sigmoid(x0_fp32) * x1_fp32
2025-05-07T20:32:54.8743609Z             return triton_quantize_fp8_row(y, scale_ub_tensor)
2025-05-07T20:32:54.8743684Z     
2025-05-07T20:32:54.8743781Z >       y_fp8_ref, y_scale_ref = ref_fn()
2025-05-07T20:32:54.8743788Z 
2025-05-07T20:32:54.8743881Z moe/activation_test.py:126: 
2025-05-07T20:32:54.8744055Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:54.8744162Z moe/activation_test.py:124: in ref_fn
2025-05-07T20:32:54.8744291Z     return triton_quantize_fp8_row(y, scale_ub_tensor)
2025-05-07T20:32:54.8744843Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gemm/triton_gemm/fp8_gemm.py:2370: in triton_quantize_fp8_row
2025-05-07T20:32:54.8744941Z     _kernel_quantize_fp8_row[grid](
2025-05-07T20:32:54.8745302Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:54.8745521Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:54.8745879Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/autotuner.py:186: in run
2025-05-07T20:32:54.8746139Z     timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs}
2025-05-07T20:32:54.8746546Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/autotuner.py:166: in _bench
2025-05-07T20:32:54.8746716Z     return self.do_bench(kernel_call, quantiles=(0.5, 0.2, 0.8))
2025-05-07T20:32:54.8747051Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/testing.py:117: in do_bench
2025-05-07T20:32:54.8747127Z     fn()
2025-05-07T20:32:54.8747527Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/autotuner.py:152: in kernel_call
2025-05-07T20:32:54.8747610Z     self.fn.run(
2025-05-07T20:32:54.8747940Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:54.8748037Z     kernel = self.compile(
2025-05-07T20:32:54.8748413Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:54.8748632Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:54.8748761Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:54.8748766Z 
2025-05-07T20:32:54.8748967Z self = <triton.compiler.compiler.ASTSource object at 0x7fe92bb8d640>
2025-05-07T20:32:54.8749738Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=2, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:54.8750231Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7fea06577ce0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7fe92b52ce00>}
2025-05-07T20:32:54.8750968Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:54.8751161Z context = <triton._C.libtriton.ir.context object at 0x7fe92b6f2030>
2025-05-07T20:32:54.8751165Z 
2025-05-07T20:32:54.8751333Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:54.8751588Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:54.8751757Z                            module_map=module_map)
2025-05-07T20:32:54.8751922Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:54.8752022Z E       def _kernel_quantize_fp8_row(
2025-05-07T20:32:54.8752095Z E       ^
2025-05-07T20:32:54.8752453Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:54.8752458Z 
2025-05-07T20:32:54.8752858Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:54.8752865Z 
2025-05-07T20:32:54.8752970Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:54.8753231Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:54.8753307Z     T=1,
2025-05-07T20:32:54.8753385Z     D=5120,
2025-05-07T20:32:54.8753464Z     scale_ub=1200.0,
2025-05-07T20:32:54.8753546Z     contiguous=True,
2025-05-07T20:32:54.8753638Z     compiled=True,
2025-05-07T20:32:54.8753707Z )
2025-05-07T20:32:54.8753921Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:54.8754095Z T = 1, D = 5120, scale_ub = 1200.0, contiguous = True, compiled = True
2025-05-07T20:32:54.8754100Z 
2025-05-07T20:32:54.8754174Z     @given(
2025-05-07T20:32:54.8766365Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:54.8766487Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:54.8766609Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:54.8766734Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:54.8766845Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:54.8766926Z     )
2025-05-07T20:32:54.8767297Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:54.8767403Z     def test_silu_mul_quant(
2025-05-07T20:32:54.8767483Z         self,
2025-05-07T20:32:54.8767561Z         T: int,
2025-05-07T20:32:54.8767643Z         D: int,
2025-05-07T20:32:54.8767741Z         scale_ub: Optional[float],
2025-05-07T20:32:54.8767827Z         contiguous: bool,
2025-05-07T20:32:54.8767921Z         compiled: bool,
2025-05-07T20:32:54.8768014Z     ) -> None:
2025-05-07T20:32:54.8768149Z         torch.manual_seed(2025)
2025-05-07T20:32:54.8768264Z     
2025-05-07T20:32:54.8768460Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:54.8768530Z     
2025-05-07T20:32:54.8768631Z         x_sign = torch.sign(x)
2025-05-07T20:32:54.8768755Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:54.8768901Z         x = x_sign * x_clamp
2025-05-07T20:32:54.8769006Z         x0 = x[:, :D]
2025-05-07T20:32:54.8769120Z         x1 = x[:, D:]
2025-05-07T20:32:54.8769229Z     
2025-05-07T20:32:54.8769317Z         if contiguous:
2025-05-07T20:32:54.8769406Z             x0 = x0.contiguous()
2025-05-07T20:32:54.8769502Z             x1 = x1.contiguous()
2025-05-07T20:32:54.8769574Z     
2025-05-07T20:32:54.8769664Z         if scale_ub is not None:
2025-05-07T20:32:54.8769779Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:54.8769949Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:54.8770058Z             )
2025-05-07T20:32:54.8770142Z         else:
2025-05-07T20:32:54.8770236Z             scale_ub_tensor = None
2025-05-07T20:32:54.8770305Z     
2025-05-07T20:32:54.8770441Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:54.8770529Z             op = silu_mul_quant
2025-05-07T20:32:54.8770622Z             if compiled:
2025-05-07T20:32:54.8770741Z                 op = torch.compile(op)
2025-05-07T20:32:54.8770887Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:54.8770981Z     
2025-05-07T20:32:54.8771074Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:54.8771079Z 
2025-05-07T20:32:54.8771177Z moe/activation_test.py:117: 
2025-05-07T20:32:54.8771315Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:54.8771489Z moe/activation_test.py:115: in fn
2025-05-07T20:32:54.8771610Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:54.8772051Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/torch/_dynamo/eval_frame.py:678: in _fn
2025-05-07T20:32:54.8772146Z     return fn(*args, **kwargs)
2025-05-07T20:32:54.8772728Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:54.8772845Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:54.8773207Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:54.8773536Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:54.8774066Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:54.8774164Z     kernel = self.compile(
2025-05-07T20:32:54.8774555Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:54.8774730Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:54.8774867Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:54.8774872Z 
2025-05-07T20:32:54.8775078Z self = <triton.compiler.compiler.ASTSource object at 0x7fe92b560a10>
2025-05-07T20:32:54.8775842Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:54.8776403Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7fea06577ce0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7fe92b52cfe0>}
2025-05-07T20:32:54.8777139Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:54.8777334Z context = <triton._C.libtriton.ir.context object at 0x7fe92b6179f0>
2025-05-07T20:32:54.8777339Z 
2025-05-07T20:32:54.8777504Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:54.8777767Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:54.8777872Z                            module_map=module_map)
2025-05-07T20:32:54.8778073Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:54.8778185Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:54.8778264Z E       ^
2025-05-07T20:32:54.8778615Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:54.8778622Z 
2025-05-07T20:32:54.8779036Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:54.8779041Z 
2025-05-07T20:32:54.8779141Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:54.8779370Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:54.8779445Z     T=1,
2025-05-07T20:32:54.8779522Z     D=5120,
2025-05-07T20:32:54.8779611Z     scale_ub=None,
2025-05-07T20:32:54.8779695Z     contiguous=False,
2025-05-07T20:32:54.8779777Z     compiled=True,
2025-05-07T20:32:54.8779855Z )
2025-05-07T20:32:54.8780071Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:54.8780232Z T = 1, D = 5120, scale_ub = None, contiguous = False, compiled = True
2025-05-07T20:32:54.8780245Z 
2025-05-07T20:32:54.8780322Z     @given(
2025-05-07T20:32:54.8780441Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:54.8780547Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:54.8780705Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:54.8780821Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:54.8780943Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:54.8781017Z     )
2025-05-07T20:32:54.8781262Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:54.8781353Z     def test_silu_mul_quant(
2025-05-07T20:32:54.8781431Z         self,
2025-05-07T20:32:54.8781513Z         T: int,
2025-05-07T20:32:54.8781586Z         D: int,
2025-05-07T20:32:54.8781684Z         scale_ub: Optional[float],
2025-05-07T20:32:54.8781780Z         contiguous: bool,
2025-05-07T20:32:54.8781863Z         compiled: bool,
2025-05-07T20:32:54.8781992Z     ) -> None:
2025-05-07T20:32:54.8782086Z         torch.manual_seed(2025)
2025-05-07T20:32:54.8784003Z     
2025-05-07T20:32:54.8784177Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:54.8784253Z     
2025-05-07T20:32:54.8784343Z         x_sign = torch.sign(x)
2025-05-07T20:32:54.8784473Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:54.8784561Z         x = x_sign * x_clamp
2025-05-07T20:32:54.8784637Z         x0 = x[:, :D]
2025-05-07T20:32:54.8784721Z         x1 = x[:, D:]
2025-05-07T20:32:54.8784793Z     
2025-05-07T20:32:54.8784872Z         if contiguous:
2025-05-07T20:32:54.8784965Z             x0 = x0.contiguous()
2025-05-07T20:32:54.8785052Z             x1 = x1.contiguous()
2025-05-07T20:32:54.8785125Z     
2025-05-07T20:32:54.8785220Z         if scale_ub is not None:
2025-05-07T20:32:54.8785327Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:54.8785469Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:54.8785544Z             )
2025-05-07T20:32:54.8785663Z         else:
2025-05-07T20:32:54.8785765Z             scale_ub_tensor = None
2025-05-07T20:32:54.8785837Z     
2025-05-07T20:32:54.8785964Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:54.8786059Z             op = silu_mul_quant
2025-05-07T20:32:54.8786140Z             if compiled:
2025-05-07T20:32:54.8786238Z                 op = torch.compile(op)
2025-05-07T20:32:54.8786348Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:54.8786418Z     
2025-05-07T20:32:54.8786505Z         y_fp8, y_scale = fn()
2025-05-07T20:32:54.8786628Z         y = y_fp8.to(torch.float32) * y_scale[:, None]
2025-05-07T20:32:54.8786699Z     
2025-05-07T20:32:54.8786836Z         def ref_fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:54.8786980Z             x0_fp32 = x0.to(torch.float32)
2025-05-07T20:32:54.8787077Z             x1_fp32 = x1.to(torch.float32)
2025-05-07T20:32:54.8787205Z             y = x0_fp32 * torch.sigmoid(x0_fp32) * x1_fp32
2025-05-07T20:32:54.8787344Z             return triton_quantize_fp8_row(y, scale_ub_tensor)
2025-05-07T20:32:54.8787415Z     
2025-05-07T20:32:54.8787520Z >       y_fp8_ref, y_scale_ref = ref_fn()
2025-05-07T20:32:54.8787526Z 
2025-05-07T20:32:54.8787619Z moe/activation_test.py:126: 
2025-05-07T20:32:54.8787745Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:54.8787852Z moe/activation_test.py:124: in ref_fn
2025-05-07T20:32:54.8787982Z     return triton_quantize_fp8_row(y, scale_ub_tensor)
2025-05-07T20:32:54.8788535Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gemm/triton_gemm/fp8_gemm.py:2370: in triton_quantize_fp8_row
2025-05-07T20:32:54.8788632Z     _kernel_quantize_fp8_row[grid](
2025-05-07T20:32:54.8788989Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:54.8789214Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:54.8789577Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/autotuner.py:186: in run
2025-05-07T20:32:54.8789879Z     timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs}
2025-05-07T20:32:54.8790247Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/autotuner.py:166: in _bench
2025-05-07T20:32:54.8790409Z     return self.do_bench(kernel_call, quantiles=(0.5, 0.2, 0.8))
2025-05-07T20:32:54.8790749Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/testing.py:117: in do_bench
2025-05-07T20:32:54.8790822Z     fn()
2025-05-07T20:32:54.8791215Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/autotuner.py:152: in kernel_call
2025-05-07T20:32:54.8791304Z     self.fn.run(
2025-05-07T20:32:54.8791677Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:54.8791775Z     kernel = self.compile(
2025-05-07T20:32:54.8792149Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:54.8792322Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:54.8792452Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:54.8792457Z 
2025-05-07T20:32:54.8792661Z self = <triton.compiler.compiler.ASTSource object at 0x7fe92b562ed0>
2025-05-07T20:32:54.8793428Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=2, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:54.8793928Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7fea06577ce0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7fe92ab27ec0>}
2025-05-07T20:32:54.8794695Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:54.8794890Z context = <triton._C.libtriton.ir.context object at 0x7fe92a891670>
2025-05-07T20:32:54.8794895Z 
2025-05-07T20:32:54.8795053Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:54.8795311Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:54.8795416Z                            module_map=module_map)
2025-05-07T20:32:54.8795573Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:54.8795679Z E       def _kernel_quantize_fp8_row(
2025-05-07T20:32:54.8795793Z E       ^
2025-05-07T20:32:54.8796143Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:54.8796154Z 
2025-05-07T20:32:54.8796556Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:54.8796563Z 
2025-05-07T20:32:54.8796663Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:54.8796887Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:54.8796962Z     T=1,
2025-05-07T20:32:54.8797037Z     D=5120,
2025-05-07T20:32:54.8797126Z     scale_ub=None,
2025-05-07T20:32:54.8797210Z     contiguous=True,
2025-05-07T20:32:54.8797292Z     compiled=False,
2025-05-07T20:32:54.8797369Z )
2025-05-07T20:32:54.8797582Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:54.8797745Z T = 1, D = 5120, scale_ub = None, contiguous = True, compiled = False
2025-05-07T20:32:54.8797752Z 
2025-05-07T20:32:54.8797828Z     @given(
2025-05-07T20:32:54.8797948Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:54.8798054Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:54.8798407Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:54.8798585Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:54.8798881Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:54.8798954Z     )
2025-05-07T20:32:54.8799201Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:54.8799293Z     def test_silu_mul_quant(
2025-05-07T20:32:54.8799369Z         self,
2025-05-07T20:32:54.8799453Z         T: int,
2025-05-07T20:32:54.8799526Z         D: int,
2025-05-07T20:32:54.8799623Z         scale_ub: Optional[float],
2025-05-07T20:32:54.8799716Z         contiguous: bool,
2025-05-07T20:32:54.8799799Z         compiled: bool,
2025-05-07T20:32:54.8799879Z     ) -> None:
2025-05-07T20:32:54.8799982Z         torch.manual_seed(2025)
2025-05-07T20:32:54.8800057Z     
2025-05-07T20:32:54.8800317Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:54.8800401Z     
2025-05-07T20:32:54.8800490Z         x_sign = torch.sign(x)
2025-05-07T20:32:54.8800611Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:54.8800705Z         x = x_sign * x_clamp
2025-05-07T20:32:54.8800781Z         x0 = x[:, :D]
2025-05-07T20:32:54.8800865Z         x1 = x[:, D:]
2025-05-07T20:32:54.8800934Z     
2025-05-07T20:32:54.8801012Z         if contiguous:
2025-05-07T20:32:54.8801111Z             x0 = x0.contiguous()
2025-05-07T20:32:54.8801196Z             x1 = x1.contiguous()
2025-05-07T20:32:54.8801267Z     
2025-05-07T20:32:54.8801361Z         if scale_ub is not None:
2025-05-07T20:32:54.8801464Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:54.8801598Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:54.8801684Z             )
2025-05-07T20:32:54.8801757Z         else:
2025-05-07T20:32:54.8801847Z             scale_ub_tensor = None
2025-05-07T20:32:54.8801928Z     
2025-05-07T20:32:54.8802123Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:54.8802218Z             op = silu_mul_quant
2025-05-07T20:32:54.8802300Z             if compiled:
2025-05-07T20:32:54.8802400Z                 op = torch.compile(op)
2025-05-07T20:32:54.8802509Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:54.8802578Z     
2025-05-07T20:32:54.8802664Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:54.8802669Z 
2025-05-07T20:32:54.8802770Z moe/activation_test.py:117: 
2025-05-07T20:32:54.8802895Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:54.8802991Z moe/activation_test.py:115: in fn
2025-05-07T20:32:54.8803091Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:54.8803580Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:54.8803746Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:54.8804101Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:54.8804317Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:54.8804659Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:54.8804748Z     kernel = self.compile(
2025-05-07T20:32:54.8805124Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:54.8805299Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:54.8805423Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:54.8805427Z 
2025-05-07T20:32:54.8805634Z self = <triton.compiler.compiler.ASTSource object at 0x7fe92ba16db0>
2025-05-07T20:32:54.8806395Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:54.8806896Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7fea06577ce0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7fe92ab2c7c0>}
2025-05-07T20:32:54.8807667Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:54.8807855Z context = <triton._C.libtriton.ir.context object at 0x7fe92a8bcaf0>
2025-05-07T20:32:54.8807859Z 
2025-05-07T20:32:54.8808028Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:54.8808284Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:54.8808436Z                            module_map=module_map)
2025-05-07T20:32:54.8808598Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:54.8808692Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:54.8808769Z E       ^
2025-05-07T20:32:54.8809119Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:54.8809124Z 
2025-05-07T20:32:54.8809525Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:54.8809536Z 
2025-05-07T20:32:54.8809634Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:54.8809851Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:54.8809932Z     T=128,
2025-05-07T20:32:54.8810005Z     D=5120,
2025-05-07T20:32:54.8810090Z     scale_ub=None,
2025-05-07T20:32:54.8810181Z     contiguous=False,
2025-05-07T20:32:54.8810261Z     compiled=True,
2025-05-07T20:32:54.8810335Z )
2025-05-07T20:32:54.8810594Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:54.8810759Z T = 128, D = 5120, scale_ub = None, contiguous = False, compiled = True
2025-05-07T20:32:54.8810766Z 
2025-05-07T20:32:54.8810846Z     @given(
2025-05-07T20:32:54.8810964Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:54.8811062Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:54.8811179Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:54.8811291Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:54.8811400Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:54.8811478Z     )
2025-05-07T20:32:54.8811718Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:54.8811852Z     def test_silu_mul_quant(
2025-05-07T20:32:54.8811936Z         self,
2025-05-07T20:32:54.8812011Z         T: int,
2025-05-07T20:32:54.8812085Z         D: int,
2025-05-07T20:32:54.8812189Z         scale_ub: Optional[float],
2025-05-07T20:32:54.8812277Z         contiguous: bool,
2025-05-07T20:32:54.8812366Z         compiled: bool,
2025-05-07T20:32:54.8812443Z     ) -> None:
2025-05-07T20:32:54.8812540Z         torch.manual_seed(2025)
2025-05-07T20:32:54.8812619Z     
2025-05-07T20:32:54.8812785Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:54.8812857Z     
2025-05-07T20:32:54.8812952Z         x_sign = torch.sign(x)
2025-05-07T20:32:54.8813073Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:54.8813158Z         x = x_sign * x_clamp
2025-05-07T20:32:54.8813244Z         x0 = x[:, :D]
2025-05-07T20:32:54.8813320Z         x1 = x[:, D:]
2025-05-07T20:32:54.8813393Z     
2025-05-07T20:32:54.8813480Z         if contiguous:
2025-05-07T20:32:54.8813573Z             x0 = x0.contiguous()
2025-05-07T20:32:54.8813789Z             x1 = x1.contiguous()
2025-05-07T20:32:54.8813871Z     
2025-05-07T20:32:54.8813963Z         if scale_ub is not None:
2025-05-07T20:32:54.8814078Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:54.8814211Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:54.8814286Z             )
2025-05-07T20:32:54.8814416Z         else:
2025-05-07T20:32:54.8814508Z             scale_ub_tensor = None
2025-05-07T20:32:54.8814580Z     
2025-05-07T20:32:54.8814714Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:54.8814802Z             op = silu_mul_quant
2025-05-07T20:32:54.8814885Z             if compiled:
2025-05-07T20:32:54.8814988Z                 op = torch.compile(op)
2025-05-07T20:32:54.8815091Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:54.8815163Z     
2025-05-07T20:32:54.8815259Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:54.8815263Z 
2025-05-07T20:32:54.8815359Z moe/activation_test.py:117: 
2025-05-07T20:32:54.8815491Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:54.8815631Z moe/activation_test.py:115: in fn
2025-05-07T20:32:54.8815731Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:54.8816096Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/torch/_dynamo/eval_frame.py:678: in _fn
2025-05-07T20:32:54.8816190Z     return fn(*args, **kwargs)
2025-05-07T20:32:54.8816673Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:54.8816779Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:54.8817130Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:54.8817351Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:54.8817682Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:54.8817776Z     kernel = self.compile(
2025-05-07T20:32:54.8818205Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:54.8818375Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:54.8818510Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:54.8818514Z 
2025-05-07T20:32:54.8818712Z self = <triton.compiler.compiler.ASTSource object at 0x7fe92ab1cd60>
2025-05-07T20:32:54.8819471Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:54.8819975Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7fea06577ce0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7fe92ba12d40>}
2025-05-07T20:32:54.8820749Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:54.8820942Z context = <triton._C.libtriton.ir.context object at 0x7fe92a87d070>
2025-05-07T20:32:54.8820949Z 
2025-05-07T20:32:54.8821108Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:54.8821362Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:54.8821472Z                            module_map=module_map)
2025-05-07T20:32:54.8821629Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:54.8821733Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:54.8821808Z E       ^
2025-05-07T20:32:54.8822155Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:54.8822162Z 
2025-05-07T20:32:54.8822574Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:54.8822578Z 
2025-05-07T20:32:54.8822681Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:54.8822903Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:54.8823022Z     T=128,
2025-05-07T20:32:54.8823095Z     D=7168,
2025-05-07T20:32:54.8823182Z     scale_ub=1200.0,
2025-05-07T20:32:54.8823264Z     contiguous=False,
2025-05-07T20:32:54.8823346Z     compiled=False,
2025-05-07T20:32:54.8823424Z )
2025-05-07T20:32:54.8823637Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:54.8823802Z T = 128, D = 7168, scale_ub = 1200.0, contiguous = False, compiled = False
2025-05-07T20:32:54.8823806Z 
2025-05-07T20:32:54.8823885Z     @given(
2025-05-07T20:32:54.8824010Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:54.8824113Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:54.8824266Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:54.8824382Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:54.8824497Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:54.8824571Z     )
2025-05-07T20:32:54.8824814Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:54.8824913Z     def test_silu_mul_quant(
2025-05-07T20:32:54.8824988Z         self,
2025-05-07T20:32:54.8825063Z         T: int,
2025-05-07T20:32:54.8825143Z         D: int,
2025-05-07T20:32:54.8825239Z         scale_ub: Optional[float],
2025-05-07T20:32:54.8825326Z         contiguous: bool,
2025-05-07T20:32:54.8825415Z         compiled: bool,
2025-05-07T20:32:54.8825489Z     ) -> None:
2025-05-07T20:32:54.8825589Z         torch.manual_seed(2025)
2025-05-07T20:32:54.8825659Z     
2025-05-07T20:32:54.8825826Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:54.8825905Z     
2025-05-07T20:32:54.8825995Z         x_sign = torch.sign(x)
2025-05-07T20:32:54.8826163Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:54.8826258Z         x = x_sign * x_clamp
2025-05-07T20:32:54.8826337Z         x0 = x[:, :D]
2025-05-07T20:32:54.8826415Z         x1 = x[:, D:]
2025-05-07T20:32:54.8826494Z     
2025-05-07T20:32:54.8826574Z         if contiguous:
2025-05-07T20:32:54.8826661Z             x0 = x0.contiguous()
2025-05-07T20:32:54.8826760Z             x1 = x1.contiguous()
2025-05-07T20:32:54.8826831Z     
2025-05-07T20:32:54.8826921Z         if scale_ub is not None:
2025-05-07T20:32:54.8827028Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:54.8827185Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:54.8827261Z             )
2025-05-07T20:32:54.8827336Z         else:
2025-05-07T20:32:54.8827477Z             scale_ub_tensor = None
2025-05-07T20:32:54.8827549Z     
2025-05-07T20:32:54.8827682Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:54.8827771Z             op = silu_mul_quant
2025-05-07T20:32:54.8827855Z             if compiled:
2025-05-07T20:32:54.8827960Z                 op = torch.compile(op)
2025-05-07T20:32:54.8828063Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:54.8828137Z     
2025-05-07T20:32:54.8828235Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:54.8828239Z 
2025-05-07T20:32:54.8828332Z moe/activation_test.py:117: 
2025-05-07T20:32:54.8828458Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:54.8828562Z moe/activation_test.py:115: in fn
2025-05-07T20:32:54.8828658Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:54.8829152Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:54.8829250Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:54.8829605Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:54.8829833Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:54.8830166Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:54.8830324Z     kernel = self.compile(
2025-05-07T20:32:54.8830709Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:54.8830878Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:54.8831011Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:54.8831016Z 
2025-05-07T20:32:54.8831216Z self = <triton.compiler.compiler.ASTSource object at 0x7fe9c51905f0>
2025-05-07T20:32:54.8832015Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:54.8832523Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7fea06577ce0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7fe92b34b880>}
2025-05-07T20:32:54.8833252Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:54.8833446Z context = <triton._C.libtriton.ir.context object at 0x7fe92bc54a70>
2025-05-07T20:32:54.8833451Z 
2025-05-07T20:32:54.8833611Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:54.8833877Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:54.8833985Z                            module_map=module_map)
2025-05-07T20:32:54.8834143Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:54.8834248Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:54.8834367Z E       ^
2025-05-07T20:32:54.8834713Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:54.8834719Z 
2025-05-07T20:32:54.8835128Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:54.8835132Z 
2025-05-07T20:32:54.8835231Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:54.8835451Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:54.8835525Z     T=128,
2025-05-07T20:32:54.8835600Z     D=5120,
2025-05-07T20:32:54.8835689Z     scale_ub=None,
2025-05-07T20:32:54.8835773Z     contiguous=False,
2025-05-07T20:32:54.8835852Z     compiled=False,
2025-05-07T20:32:54.8835970Z )
2025-05-07T20:32:54.8836182Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:54.8836349Z T = 128, D = 5120, scale_ub = None, contiguous = False, compiled = False
2025-05-07T20:32:54.8836360Z 
2025-05-07T20:32:54.8836435Z     @given(
2025-05-07T20:32:54.8836551Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:54.8836656Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:54.8836768Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:54.8836882Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:54.8836999Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:54.8837071Z     )
2025-05-07T20:32:54.8837309Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:54.8837404Z     def test_silu_mul_quant(
2025-05-07T20:32:54.8837480Z         self,
2025-05-07T20:32:54.8837552Z         T: int,
2025-05-07T20:32:54.8837636Z         D: int,
2025-05-07T20:32:54.8837731Z         scale_ub: Optional[float],
2025-05-07T20:32:54.8837829Z         contiguous: bool,
2025-05-07T20:32:54.8837916Z         compiled: bool,
2025-05-07T20:32:54.8837992Z     ) -> None:
2025-05-07T20:32:54.8838090Z         torch.manual_seed(2025)
2025-05-07T20:32:54.8838161Z     
2025-05-07T20:32:54.8838328Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:54.8838450Z     
2025-05-07T20:32:54.8838541Z         x_sign = torch.sign(x)
2025-05-07T20:32:54.8838664Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:54.8838753Z         x = x_sign * x_clamp
2025-05-07T20:32:54.8838829Z         x0 = x[:, :D]
2025-05-07T20:32:54.8838908Z         x1 = x[:, D:]
2025-05-07T20:32:54.8838985Z     
2025-05-07T20:32:54.8839065Z         if contiguous:
2025-05-07T20:32:54.8839153Z             x0 = x0.contiguous()
2025-05-07T20:32:54.8839244Z             x1 = x1.contiguous()
2025-05-07T20:32:54.8839318Z     
2025-05-07T20:32:54.8839411Z         if scale_ub is not None:
2025-05-07T20:32:54.8839513Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:54.8839686Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:54.8839769Z             )
2025-05-07T20:32:54.8839846Z         else:
2025-05-07T20:32:54.8839938Z             scale_ub_tensor = None
2025-05-07T20:32:54.8840020Z     
2025-05-07T20:32:54.8840149Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:54.8840240Z             op = silu_mul_quant
2025-05-07T20:32:54.8840329Z             if compiled:
2025-05-07T20:32:54.8840427Z                 op = torch.compile(op)
2025-05-07T20:32:54.8840530Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:54.8840610Z     
2025-05-07T20:32:54.8840698Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:54.8840703Z 
2025-05-07T20:32:54.8840802Z moe/activation_test.py:117: 
2025-05-07T20:32:54.8840929Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:54.8841028Z moe/activation_test.py:115: in fn
2025-05-07T20:32:54.8841134Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:54.8841666Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:54.8841760Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:54.8842128Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:54.8842348Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:54.8842696Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:54.8842793Z     kernel = self.compile(
2025-05-07T20:32:54.8843168Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:54.8843348Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:54.8843518Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:54.8843525Z 
2025-05-07T20:32:54.8843735Z self = <triton.compiler.compiler.ASTSource object at 0x7fe92ab2c730>
2025-05-07T20:32:54.8844489Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:54.8844985Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7fea06577ce0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7fe92ab2c040>}
2025-05-07T20:32:54.8845719Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:54.8845907Z context = <triton._C.libtriton.ir.context object at 0x7fe92a7b30f0>
2025-05-07T20:32:54.8845911Z 
2025-05-07T20:32:54.8846080Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:54.8846336Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:54.8846441Z                            module_map=module_map)
2025-05-07T20:32:54.8846652Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:54.8846749Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:54.8846829Z E       ^
2025-05-07T20:32:54.8847175Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:54.8847179Z 
2025-05-07T20:32:54.8847578Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:54.8847582Z 
2025-05-07T20:32:54.8847694Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:54.8847914Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:54.8847990Z     T=128,
2025-05-07T20:32:54.8848113Z     D=5120,
2025-05-07T20:32:54.8848198Z     scale_ub=1200.0,
2025-05-07T20:32:54.8848287Z     contiguous=True,
2025-05-07T20:32:54.8848369Z     compiled=False,
2025-05-07T20:32:54.8848442Z )
2025-05-07T20:32:54.8848661Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:54.8848827Z T = 128, D = 5120, scale_ub = 1200.0, contiguous = True, compiled = False
2025-05-07T20:32:54.8848831Z 
2025-05-07T20:32:54.8848905Z     @given(
2025-05-07T20:32:54.8849028Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:54.8849126Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:54.8849236Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:54.8849357Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:54.8849465Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:54.8849547Z     )
2025-05-07T20:32:54.8849789Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:54.8849879Z     def test_silu_mul_quant(
2025-05-07T20:32:54.8850003Z         self,
2025-05-07T20:32:54.8850081Z         T: int,
2025-05-07T20:32:54.8850154Z         D: int,
2025-05-07T20:32:54.8850270Z         scale_ub: Optional[float],
2025-05-07T20:32:54.8850373Z         contiguous: bool,
2025-05-07T20:32:54.8850465Z         compiled: bool,
2025-05-07T20:32:54.8850561Z     ) -> None:
2025-05-07T20:32:54.8850653Z         torch.manual_seed(2025)
2025-05-07T20:32:54.8850726Z     
2025-05-07T20:32:54.8850895Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:54.8850968Z     
2025-05-07T20:32:54.8851062Z         x_sign = torch.sign(x)
2025-05-07T20:32:54.8851181Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:54.8851267Z         x = x_sign * x_clamp
2025-05-07T20:32:54.8851394Z         x0 = x[:, :D]
2025-05-07T20:32:54.8851472Z         x1 = x[:, D:]
2025-05-07T20:32:54.8851543Z     
2025-05-07T20:32:54.8851635Z         if contiguous:
2025-05-07T20:32:54.8851727Z             x0 = x0.contiguous()
2025-05-07T20:32:54.8851816Z             x1 = x1.contiguous()
2025-05-07T20:32:54.8851893Z     
2025-05-07T20:32:54.8851981Z         if scale_ub is not None:
2025-05-07T20:32:54.8852083Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:54.8852221Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:54.8852295Z             )
2025-05-07T20:32:54.8852378Z         else:
2025-05-07T20:32:54.8852468Z             scale_ub_tensor = None
2025-05-07T20:32:54.8852540Z     
2025-05-07T20:32:54.8852673Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:54.8852758Z             op = silu_mul_quant
2025-05-07T20:32:54.8852839Z             if compiled:
2025-05-07T20:32:54.8852940Z                 op = torch.compile(op)
2025-05-07T20:32:54.8853048Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:54.8853118Z     
2025-05-07T20:32:54.8853212Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:54.8853216Z 
2025-05-07T20:32:54.8853314Z moe/activation_test.py:117: 
2025-05-07T20:32:54.8853440Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:54.8853541Z moe/activation_test.py:115: in fn
2025-05-07T20:32:54.8853819Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:54.8854313Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:54.8854408Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:54.8854764Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:54.8854988Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:54.8855318Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:54.8855417Z     kernel = self.compile(
2025-05-07T20:32:54.8855840Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:54.8856012Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:54.8856143Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:54.8856147Z 
2025-05-07T20:32:54.8856346Z self = <triton.compiler.compiler.ASTSource object at 0x7fe9c5199520>
2025-05-07T20:32:54.8857106Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:54.8857604Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7fea06577ce0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7fe92a7d4c20>}
2025-05-07T20:32:54.8858375Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:54.8858570Z context = <triton._C.libtriton.ir.context object at 0x7fe92a992ab0>
2025-05-07T20:32:54.8858577Z 
2025-05-07T20:32:54.8858739Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:54.8858999Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:54.8859104Z                            module_map=module_map)
2025-05-07T20:32:54.8859262Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:54.8859371Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:54.8859448Z E       ^
2025-05-07T20:32:54.8859794Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:54.8859864Z 
2025-05-07T20:32:54.8860281Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:54.8860286Z 
2025-05-07T20:32:54.8860388Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:54.8860614Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:54.8860695Z     T=1,
2025-05-07T20:32:54.8860769Z     D=7168,
2025-05-07T20:32:54.8860858Z     scale_ub=1200.0,
2025-05-07T20:32:54.8860942Z     contiguous=True,
2025-05-07T20:32:54.8861022Z     compiled=True,
2025-05-07T20:32:54.8861102Z )
2025-05-07T20:32:54.8861315Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:54.8861480Z T = 1, D = 7168, scale_ub = 1200.0, contiguous = True, compiled = True
2025-05-07T20:32:54.8861485Z 
2025-05-07T20:32:54.8861562Z     @given(
2025-05-07T20:32:54.8861683Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:54.8861788Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:54.8861907Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:54.8862024Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:54.8862142Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:54.8862263Z     )
2025-05-07T20:32:54.8862503Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:54.8862601Z     def test_silu_mul_quant(
2025-05-07T20:32:54.8862678Z         self,
2025-05-07T20:32:54.8862758Z         T: int,
2025-05-07T20:32:54.8862834Z         D: int,
2025-05-07T20:32:54.8862932Z         scale_ub: Optional[float],
2025-05-07T20:32:54.8863029Z         contiguous: bool,
2025-05-07T20:32:54.8863114Z         compiled: bool,
2025-05-07T20:32:54.8863193Z     ) -> None:
2025-05-07T20:32:54.8863291Z         torch.manual_seed(2025)
2025-05-07T20:32:54.8863368Z     
2025-05-07T20:32:54.8863532Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:54.8863612Z     
2025-05-07T20:32:54.8863746Z         x_sign = torch.sign(x)
2025-05-07T20:32:54.8863874Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:54.8863968Z         x = x_sign * x_clamp
2025-05-07T20:32:54.8864049Z         x0 = x[:, :D]
2025-05-07T20:32:54.8864135Z         x1 = x[:, D:]
2025-05-07T20:32:54.8864214Z     
2025-05-07T20:32:54.8864294Z         if contiguous:
2025-05-07T20:32:54.8864389Z             x0 = x0.contiguous()
2025-05-07T20:32:54.8864476Z             x1 = x1.contiguous()
2025-05-07T20:32:54.8864546Z     
2025-05-07T20:32:54.8864640Z         if scale_ub is not None:
2025-05-07T20:32:54.8864747Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:54.8864876Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:54.8864956Z             )
2025-05-07T20:32:54.8865031Z         else:
2025-05-07T20:32:54.8865124Z             scale_ub_tensor = None
2025-05-07T20:32:54.8865200Z     
2025-05-07T20:32:54.8865325Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:54.8865412Z             op = silu_mul_quant
2025-05-07T20:32:54.8865546Z             if compiled:
2025-05-07T20:32:54.8865644Z                 op = torch.compile(op)
2025-05-07T20:32:54.8865752Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:54.8865824Z     
2025-05-07T20:32:54.8865913Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:54.8865917Z 
2025-05-07T20:32:54.8866018Z moe/activation_test.py:117: 
2025-05-07T20:32:54.8866142Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:54.8866241Z moe/activation_test.py:115: in fn
2025-05-07T20:32:54.8866344Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:54.8866702Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/torch/_dynamo/eval_frame.py:678: in _fn
2025-05-07T20:32:54.8866791Z     return fn(*args, **kwargs)
2025-05-07T20:32:54.8867326Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:54.8867421Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:54.8867777Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:54.8867998Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:54.8868328Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:54.8868424Z     kernel = self.compile(
2025-05-07T20:32:54.8868797Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:54.8868970Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:54.8869092Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:54.8869100Z 
2025-05-07T20:32:54.8869300Z self = <triton.compiler.compiler.ASTSource object at 0x7fe92b56da50>
2025-05-07T20:32:54.8870067Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:54.8870601Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7fea06577ce0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7fe92a7d5ee0>}
2025-05-07T20:32:54.8871337Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:54.8871523Z context = <triton._C.libtriton.ir.context object at 0x7fe92a9f1670>
2025-05-07T20:32:54.8871527Z 
2025-05-07T20:32:54.8871691Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:54.8871991Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:54.8872099Z                            module_map=module_map)
2025-05-07T20:32:54.8872264Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:54.8872363Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:54.8872439Z E       ^
2025-05-07T20:32:54.8872869Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:54.8872877Z 
2025-05-07T20:32:54.8873334Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:54.8873339Z 
2025-05-07T20:32:54.8873446Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:54.8873682Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:54.8873793Z     T=1,
2025-05-07T20:32:54.8873901Z     D=7168,
2025-05-07T20:32:54.8873986Z     scale_ub=1200.0,
2025-05-07T20:32:54.8874072Z     contiguous=False,
2025-05-07T20:32:54.8874160Z     compiled=True,
2025-05-07T20:32:54.8874294Z )
2025-05-07T20:32:54.8874523Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:54.8874755Z T = 1, D = 7168, scale_ub = 1200.0, contiguous = False, compiled = True
2025-05-07T20:32:54.8874764Z 
2025-05-07T20:32:54.8874842Z     @given(
2025-05-07T20:32:54.8874964Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:54.8875060Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:54.8875173Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:54.8875294Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:54.8875424Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:54.8875528Z     )
2025-05-07T20:32:54.8875844Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:54.8876119Z     def test_silu_mul_quant(
2025-05-07T20:32:54.8876196Z         self,
2025-05-07T20:32:54.8876310Z         T: int,
2025-05-07T20:32:54.8876421Z         D: int,
2025-05-07T20:32:54.8876542Z         scale_ub: Optional[float],
2025-05-07T20:32:54.8876631Z         contiguous: bool,
2025-05-07T20:32:54.8876712Z         compiled: bool,
2025-05-07T20:32:54.8876799Z     ) -> None:
2025-05-07T20:32:54.8876891Z         torch.manual_seed(2025)
2025-05-07T20:32:54.8876960Z     
2025-05-07T20:32:54.8877131Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:54.8877202Z     
2025-05-07T20:32:54.8877290Z         x_sign = torch.sign(x)
2025-05-07T20:32:54.8877417Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:54.8877502Z         x = x_sign * x_clamp
2025-05-07T20:32:54.8877579Z         x0 = x[:, :D]
2025-05-07T20:32:54.8877660Z         x1 = x[:, D:]
2025-05-07T20:32:54.8877732Z     
2025-05-07T20:32:54.8877815Z         if contiguous:
2025-05-07T20:32:54.8877910Z             x0 = x0.contiguous()
2025-05-07T20:32:54.8877996Z             x1 = x1.contiguous()
2025-05-07T20:32:54.8878072Z     
2025-05-07T20:32:54.8878161Z         if scale_ub is not None:
2025-05-07T20:32:54.8878264Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:54.8878399Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:54.8878529Z             )
2025-05-07T20:32:54.8878603Z         else:
2025-05-07T20:32:54.8878702Z             scale_ub_tensor = None
2025-05-07T20:32:54.8878772Z     
2025-05-07T20:32:54.8878896Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:54.8878988Z             op = silu_mul_quant
2025-05-07T20:32:54.8879068Z             if compiled:
2025-05-07T20:32:54.8879165Z                 op = torch.compile(op)
2025-05-07T20:32:54.8879272Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:54.8879343Z     
2025-05-07T20:32:54.8879439Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:54.8879444Z 
2025-05-07T20:32:54.8879537Z moe/activation_test.py:117: 
2025-05-07T20:32:54.8879703Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:54.8879811Z moe/activation_test.py:115: in fn
2025-05-07T20:32:54.8879907Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:54.8880273Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/torch/_dynamo/eval_frame.py:678: in _fn
2025-05-07T20:32:54.8880374Z     return fn(*args, **kwargs)
2025-05-07T20:32:54.8880855Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:54.8880956Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:54.8881305Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:54.8881522Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:54.8881863Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:54.8881955Z     kernel = self.compile(
2025-05-07T20:32:54.8882373Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:54.8882550Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:54.8882677Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:54.8882682Z 
2025-05-07T20:32:54.8882886Z self = <triton.compiler.compiler.ASTSource object at 0x7fe92ba8e250>
2025-05-07T20:32:54.8883647Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:54.8884144Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7fea06577ce0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7fe92a7d6c00>}
2025-05-07T20:32:54.8884920Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:54.8885109Z context = <triton._C.libtriton.ir.context object at 0x7fe92a9252b0>
2025-05-07T20:32:54.8885114Z 
2025-05-07T20:32:54.8885281Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:54.8885534Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:54.8885644Z                            module_map=module_map)
2025-05-07T20:32:54.8885801Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:54.8885897Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:54.8885981Z E       ^
2025-05-07T20:32:54.8886329Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:54.8886336Z 
2025-05-07T20:32:54.8886740Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:54.8886752Z 
2025-05-07T20:32:54.8886853Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:54.8887111Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:54.8887191Z     T=1,
2025-05-07T20:32:54.8887264Z     D=7168,
2025-05-07T20:32:54.8887342Z     scale_ub=None,
2025-05-07T20:32:54.8887432Z     contiguous=False,
2025-05-07T20:32:54.8887514Z     compiled=True,
2025-05-07T20:32:54.8887583Z )
2025-05-07T20:32:54.8887800Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:54.8887957Z T = 1, D = 7168, scale_ub = None, contiguous = False, compiled = True
2025-05-07T20:32:54.8887966Z 
2025-05-07T20:32:54.8888054Z     @given(
2025-05-07T20:32:54.8892591Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:54.8892781Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:54.8892912Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:54.8893031Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:54.8893154Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:54.8893235Z     )
2025-05-07T20:32:54.8893480Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:54.8893586Z     def test_silu_mul_quant(
2025-05-07T20:32:54.8893767Z         self,
2025-05-07T20:32:54.8893849Z         T: int,
2025-05-07T20:32:54.8893937Z         D: int,
2025-05-07T20:32:54.8894035Z         scale_ub: Optional[float],
2025-05-07T20:32:54.8894128Z         contiguous: bool,
2025-05-07T20:32:54.8894222Z         compiled: bool,
2025-05-07T20:32:54.8894303Z     ) -> None:
2025-05-07T20:32:54.8894401Z         torch.manual_seed(2025)
2025-05-07T20:32:54.8894483Z     
2025-05-07T20:32:54.8894651Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:54.8894727Z     
2025-05-07T20:32:54.8894905Z         x_sign = torch.sign(x)
2025-05-07T20:32:54.8895036Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:54.8895134Z         x = x_sign * x_clamp
2025-05-07T20:32:54.8895221Z         x0 = x[:, :D]
2025-05-07T20:32:54.8895303Z         x1 = x[:, D:]
2025-05-07T20:32:54.8895385Z     
2025-05-07T20:32:54.8895473Z         if contiguous:
2025-05-07T20:32:54.8895568Z             x0 = x0.contiguous()
2025-05-07T20:32:54.8895668Z             x1 = x1.contiguous()
2025-05-07T20:32:54.8895744Z     
2025-05-07T20:32:54.8895835Z         if scale_ub is not None:
2025-05-07T20:32:54.8895951Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:54.8896088Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:54.8896166Z             )
2025-05-07T20:32:54.8896299Z         else:
2025-05-07T20:32:54.8896397Z             scale_ub_tensor = None
2025-05-07T20:32:54.8896480Z     
2025-05-07T20:32:54.8896617Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:54.8896713Z             op = silu_mul_quant
2025-05-07T20:32:54.8896807Z             if compiled:
2025-05-07T20:32:54.8896910Z                 op = torch.compile(op)
2025-05-07T20:32:54.8897023Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:54.8897108Z     
2025-05-07T20:32:54.8897200Z         y_fp8, y_scale = fn()
2025-05-07T20:32:54.8897323Z         y = y_fp8.to(torch.float32) * y_scale[:, None]
2025-05-07T20:32:54.8897406Z     
2025-05-07T20:32:54.8897551Z         def ref_fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:54.8897654Z             x0_fp32 = x0.to(torch.float32)
2025-05-07T20:32:54.8897762Z             x1_fp32 = x1.to(torch.float32)
2025-05-07T20:32:54.8897889Z             y = x0_fp32 * torch.sigmoid(x0_fp32) * x1_fp32
2025-05-07T20:32:54.8898043Z             return triton_quantize_fp8_row(y, scale_ub_tensor)
2025-05-07T20:32:54.8898120Z     
2025-05-07T20:32:54.8898499Z >       y_fp8_ref, y_scale_ref = ref_fn()
2025-05-07T20:32:54.8898508Z 
2025-05-07T20:32:54.8898627Z moe/activation_test.py:126: 
2025-05-07T20:32:54.8898759Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:54.8899030Z moe/activation_test.py:124: in ref_fn
2025-05-07T20:32:54.8899173Z     return triton_quantize_fp8_row(y, scale_ub_tensor)
2025-05-07T20:32:54.8899721Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gemm/triton_gemm/fp8_gemm.py:2370: in triton_quantize_fp8_row
2025-05-07T20:32:54.8899830Z     _kernel_quantize_fp8_row[grid](
2025-05-07T20:32:54.8900185Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:54.8900407Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:54.8900780Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/autotuner.py:186: in run
2025-05-07T20:32:54.8901100Z     timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs}
2025-05-07T20:32:54.8901476Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/autotuner.py:166: in _bench
2025-05-07T20:32:54.8901643Z     return self.do_bench(kernel_call, quantiles=(0.5, 0.2, 0.8))
2025-05-07T20:32:54.8901975Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/testing.py:117: in do_bench
2025-05-07T20:32:54.8902056Z     fn()
2025-05-07T20:32:54.8902451Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/autotuner.py:152: in kernel_call
2025-05-07T20:32:54.8902532Z     self.fn.run(
2025-05-07T20:32:54.8902874Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:54.8902970Z     kernel = self.compile(
2025-05-07T20:32:54.8903359Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:54.8903599Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:54.8903729Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:54.8903738Z 
2025-05-07T20:32:54.8903953Z self = <triton.compiler.compiler.ASTSource object at 0x7fe92b0642d0>
2025-05-07T20:32:54.8904718Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=2, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:54.8905224Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7fea06577ce0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7fe92bc00180>}
2025-05-07T20:32:54.8906021Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:54.8906212Z context = <triton._C.libtriton.ir.context object at 0x7fe92bc75970>
2025-05-07T20:32:54.8906217Z 
2025-05-07T20:32:54.8906387Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:54.8906647Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:54.8906764Z                            module_map=module_map)
2025-05-07T20:32:54.8906924Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:54.8907026Z E       def _kernel_quantize_fp8_row(
2025-05-07T20:32:54.8907107Z E       ^
2025-05-07T20:32:54.8907456Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:54.8907464Z 
2025-05-07T20:32:54.8907879Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:54.8907886Z 
2025-05-07T20:32:54.8907991Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:54.8908209Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:54.8908335Z     T=1,
2025-05-07T20:32:54.8908408Z     D=5120,
2025-05-07T20:32:54.8908498Z     scale_ub=1200.0,
2025-05-07T20:32:54.8908582Z     contiguous=False,
2025-05-07T20:32:54.8908663Z     compiled=True,
2025-05-07T20:32:54.8908742Z )
2025-05-07T20:32:54.8908955Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:54.8909115Z T = 1, D = 5120, scale_ub = 1200.0, contiguous = False, compiled = True
2025-05-07T20:32:54.8909120Z 
2025-05-07T20:32:54.8909202Z     @given(
2025-05-07T20:32:54.8909323Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:54.8909428Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:54.8909541Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:54.8909696Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:54.8909818Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:54.8909890Z     )
2025-05-07T20:32:54.8910132Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:54.8910233Z     def test_silu_mul_quant(
2025-05-07T20:32:54.8910309Z         self,
2025-05-07T20:32:54.8910384Z         T: int,
2025-05-07T20:32:54.8910467Z         D: int,
2025-05-07T20:32:54.8910563Z         scale_ub: Optional[float],
2025-05-07T20:32:54.8910650Z         contiguous: bool,
2025-05-07T20:32:54.8910741Z         compiled: bool,
2025-05-07T20:32:54.8910817Z     ) -> None:
2025-05-07T20:32:54.8910915Z         torch.manual_seed(2025)
2025-05-07T20:32:54.8910985Z     
2025-05-07T20:32:54.8911149Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:54.8911231Z     
2025-05-07T20:32:54.8911321Z         x_sign = torch.sign(x)
2025-05-07T20:32:54.8911444Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:54.8911539Z         x = x_sign * x_clamp
2025-05-07T20:32:54.8911658Z         x0 = x[:, :D]
2025-05-07T20:32:54.8911738Z         x1 = x[:, D:]
2025-05-07T20:32:54.8911815Z     
2025-05-07T20:32:54.8911898Z         if contiguous:
2025-05-07T20:32:54.8911986Z             x0 = x0.contiguous()
2025-05-07T20:32:54.8912080Z             x1 = x1.contiguous()
2025-05-07T20:32:54.8912151Z     
2025-05-07T20:32:54.8912236Z         if scale_ub is not None:
2025-05-07T20:32:54.8912343Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:54.8912474Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:54.8912552Z             )
2025-05-07T20:32:54.8912626Z         else:
2025-05-07T20:32:54.8912718Z             scale_ub_tensor = None
2025-05-07T20:32:54.8912795Z     
2025-05-07T20:32:54.8912964Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:54.8913051Z             op = silu_mul_quant
2025-05-07T20:32:54.8913140Z             if compiled:
2025-05-07T20:32:54.8913238Z                 op = torch.compile(op)
2025-05-07T20:32:54.8913343Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:54.8913424Z     
2025-05-07T20:32:54.8913509Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:54.8913516Z 
2025-05-07T20:32:54.8913615Z moe/activation_test.py:117: 
2025-05-07T20:32:54.8913742Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:54.8913840Z moe/activation_test.py:115: in fn
2025-05-07T20:32:54.8913942Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:54.8914301Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/torch/_dynamo/eval_frame.py:678: in _fn
2025-05-07T20:32:54.8914390Z     return fn(*args, **kwargs)
2025-05-07T20:32:54.8914881Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:54.8914979Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:54.8915339Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:54.8915557Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:54.8915933Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:54.8916030Z     kernel = self.compile(
2025-05-07T20:32:54.8916404Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:54.8916573Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:54.8916704Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:54.8916709Z 
2025-05-07T20:32:54.8916908Z self = <triton.compiler.compiler.ASTSource object at 0x7fe92b597750>
2025-05-07T20:32:54.8917720Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:54.8918220Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7fea06577ce0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7fe92bc01300>}
2025-05-07T20:32:54.8918955Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:54.8919144Z context = <triton._C.libtriton.ir.context object at 0x7fe92a34e3f0>
2025-05-07T20:32:54.8919148Z 
2025-05-07T20:32:54.8919307Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:54.8919569Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:54.8919676Z                            module_map=module_map)
2025-05-07T20:32:54.8919874Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:54.8919979Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:54.8920055Z E       ^
2025-05-07T20:32:54.8920411Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:54.8920415Z 
2025-05-07T20:32:54.8920818Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:54.8920822Z 
2025-05-07T20:32:54.8920921Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:54.8921143Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:54.8921219Z     T=1,
2025-05-07T20:32:54.8921301Z     D=5120,
2025-05-07T20:32:54.8921424Z     scale_ub=1200.0,
2025-05-07T20:32:54.8921509Z     contiguous=False,
2025-05-07T20:32:54.8921595Z     compiled=False,
2025-05-07T20:32:54.8921668Z )
2025-05-07T20:32:54.8921886Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:54.8922054Z T = 1, D = 5120, scale_ub = 1200.0, contiguous = False, compiled = False
2025-05-07T20:32:54.8922059Z 
2025-05-07T20:32:54.8922137Z     @given(
2025-05-07T20:32:54.8922255Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:54.8922356Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:54.8922466Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:54.8922591Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:54.8922702Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:54.8922777Z     )
2025-05-07T20:32:54.8923026Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:54.8923120Z     def test_silu_mul_quant(
2025-05-07T20:32:54.8923197Z         self,
2025-05-07T20:32:54.8923279Z         T: int,
2025-05-07T20:32:54.8923354Z         D: int,
2025-05-07T20:32:54.8923452Z         scale_ub: Optional[float],
2025-05-07T20:32:54.8923550Z         contiguous: bool,
2025-05-07T20:32:54.8923633Z         compiled: bool,
2025-05-07T20:32:54.8923709Z     ) -> None:
2025-05-07T20:32:54.8923811Z         torch.manual_seed(2025)
2025-05-07T20:32:54.8923926Z     
2025-05-07T20:32:54.8924099Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:54.8924171Z     
2025-05-07T20:32:54.8924261Z         x_sign = torch.sign(x)
2025-05-07T20:32:54.8924390Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:54.8924479Z         x = x_sign * x_clamp
2025-05-07T20:32:54.8924555Z         x0 = x[:, :D]
2025-05-07T20:32:54.8924641Z         x1 = x[:, D:]
2025-05-07T20:32:54.8924714Z     
2025-05-07T20:32:54.8924796Z         if contiguous:
2025-05-07T20:32:54.8924893Z             x0 = x0.contiguous()
2025-05-07T20:32:54.8924984Z             x1 = x1.contiguous()
2025-05-07T20:32:54.8925054Z     
2025-05-07T20:32:54.8925219Z         if scale_ub is not None:
2025-05-07T20:32:54.8925324Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:54.8925456Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:54.8925536Z             )
2025-05-07T20:32:54.8925618Z         else:
2025-05-07T20:32:54.8925715Z             scale_ub_tensor = None
2025-05-07T20:32:54.8925787Z     
2025-05-07T20:32:54.8925913Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:54.8926007Z             op = silu_mul_quant
2025-05-07T20:32:54.8926088Z             if compiled:
2025-05-07T20:32:54.8926184Z                 op = torch.compile(op)
2025-05-07T20:32:54.8926292Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:54.8926360Z     
2025-05-07T20:32:54.8926451Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:54.8926455Z 
2025-05-07T20:32:54.8926554Z moe/activation_test.py:117: 
2025-05-07T20:32:54.8926687Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:54.8926790Z moe/activation_test.py:115: in fn
2025-05-07T20:32:54.8926931Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:54.8927420Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:54.8927521Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:54.8927874Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:54.8928093Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:54.8928432Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:54.8928522Z     kernel = self.compile(
2025-05-07T20:32:54.8928904Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:54.8929115Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:54.8929247Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:54.8929251Z 
2025-05-07T20:32:54.8929457Z self = <triton.compiler.compiler.ASTSource object at 0x7fe92ae806d0>
2025-05-07T20:32:54.8930218Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:54.8930722Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7fea06577ce0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7fe92bc02020>}
2025-05-07T20:32:54.8931453Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:54.8931643Z context = <triton._C.libtriton.ir.context object at 0x7fe92a4a48b0>
2025-05-07T20:32:54.8931654Z 
2025-05-07T20:32:54.8931818Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:54.8932072Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:54.8932228Z                            module_map=module_map)
2025-05-07T20:32:54.8932388Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:54.8932484Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:54.8932567Z E       ^
2025-05-07T20:32:54.8932912Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:54.8932916Z 
2025-05-07T20:32:54.8933323Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:54.8933331Z 
2025-05-07T20:32:54.8933428Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:54.8933789Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:54.8933878Z     T=16384,
2025-05-07T20:32:54.8933954Z     D=5120,
2025-05-07T20:32:54.8934035Z     scale_ub=1200.0,
2025-05-07T20:32:54.8934127Z     contiguous=False,
2025-05-07T20:32:54.8934210Z     compiled=True,
2025-05-07T20:32:54.8934281Z )
2025-05-07T20:32:54.8934506Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:54.8934683Z T = 16384, D = 5120, scale_ub = 1200.0, contiguous = False, compiled = True
2025-05-07T20:32:54.8934687Z 
2025-05-07T20:32:54.8934770Z     @given(
2025-05-07T20:32:54.8934886Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:54.8934981Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:54.8935099Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:54.8935216Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:54.8935328Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:54.8935410Z     )
2025-05-07T20:32:54.8935690Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:54.8935787Z     def test_silu_mul_quant(
2025-05-07T20:32:54.8935862Z         self,
2025-05-07T20:32:54.8935939Z         T: int,
2025-05-07T20:32:54.8936017Z         D: int,
2025-05-07T20:32:54.8936112Z         scale_ub: Optional[float],
2025-05-07T20:32:54.8936198Z         contiguous: bool,
2025-05-07T20:32:54.8936286Z         compiled: bool,
2025-05-07T20:32:54.8936366Z     ) -> None:
2025-05-07T20:32:54.8936459Z         torch.manual_seed(2025)
2025-05-07T20:32:54.8936539Z     
2025-05-07T20:32:54.8936702Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:54.8936773Z     
2025-05-07T20:32:54.8936876Z         x_sign = torch.sign(x)
2025-05-07T20:32:54.8937040Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:54.8937127Z         x = x_sign * x_clamp
2025-05-07T20:32:54.8937214Z         x0 = x[:, :D]
2025-05-07T20:32:54.8937292Z         x1 = x[:, D:]
2025-05-07T20:32:54.8937375Z     
2025-05-07T20:32:54.8937458Z         if contiguous:
2025-05-07T20:32:54.8937546Z             x0 = x0.contiguous()
2025-05-07T20:32:54.8937641Z             x1 = x1.contiguous()
2025-05-07T20:32:54.8937712Z     
2025-05-07T20:32:54.8937800Z         if scale_ub is not None:
2025-05-07T20:32:54.8937910Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:54.8938040Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:54.8938115Z             )
2025-05-07T20:32:54.8938199Z         else:
2025-05-07T20:32:54.8938291Z             scale_ub_tensor = None
2025-05-07T20:32:54.8938363Z     
2025-05-07T20:32:54.8938499Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:54.8938585Z             op = silu_mul_quant
2025-05-07T20:32:54.8938676Z             if compiled:
2025-05-07T20:32:54.8938773Z                 op = torch.compile(op)
2025-05-07T20:32:54.8938877Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:54.8938954Z     
2025-05-07T20:32:54.8939046Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:54.8939050Z 
2025-05-07T20:32:54.8939144Z moe/activation_test.py:117: 
2025-05-07T20:32:54.8939275Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:54.8939425Z moe/activation_test.py:115: in fn
2025-05-07T20:32:54.8939522Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:54.8939890Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/torch/_dynamo/eval_frame.py:678: in _fn
2025-05-07T20:32:54.8939980Z     return fn(*args, **kwargs)
2025-05-07T20:32:54.8940469Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:54.8940562Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:54.8940917Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:54.8941180Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:54.8941512Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:54.8941607Z     kernel = self.compile(
2025-05-07T20:32:54.8941990Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:54.8942160Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:54.8942290Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:54.8942295Z 
2025-05-07T20:32:54.8942495Z self = <triton.compiler.compiler.ASTSource object at 0x7fe92b067b50>
2025-05-07T20:32:54.8943256Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:54.8943800Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7fea06577ce0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7fe92bc03600>}
2025-05-07T20:32:54.8944534Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:54.8944727Z context = <triton._C.libtriton.ir.context object at 0x7fe92a42ed70>
2025-05-07T20:32:54.8944732Z 
2025-05-07T20:32:54.8944893Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:54.8945153Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:54.8945299Z                            module_map=module_map)
2025-05-07T20:32:54.8945457Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:54.8945562Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:54.8945639Z E       ^
2025-05-07T20:32:54.8945984Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:54.8945992Z 
2025-05-07T20:32:54.8946402Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:54.8946406Z 
2025-05-07T20:32:54.8946507Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:54.8946728Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:54.8946804Z     T=2048,
2025-05-07T20:32:54.8946883Z     D=7168,
2025-05-07T20:32:54.8946972Z     scale_ub=1200.0,
2025-05-07T20:32:54.8947055Z     contiguous=False,
2025-05-07T20:32:54.8947135Z     compiled=True,
2025-05-07T20:32:54.8947218Z )
2025-05-07T20:32:54.8947434Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:54.8947605Z T = 2048, D = 7168, scale_ub = 1200.0, contiguous = False, compiled = True
2025-05-07T20:32:54.8947622Z 
2025-05-07T20:32:54.8947697Z     @given(
2025-05-07T20:32:54.8947813Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:54.8947967Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:54.8948077Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:54.8948193Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:54.8948311Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:54.8948385Z     )
2025-05-07T20:32:54.8948625Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:54.8948723Z     def test_silu_mul_quant(
2025-05-07T20:32:54.8948800Z         self,
2025-05-07T20:32:54.8948877Z         T: int,
2025-05-07T20:32:54.8948963Z         D: int,
2025-05-07T20:32:54.8949060Z         scale_ub: Optional[float],
2025-05-07T20:32:54.8949153Z         contiguous: bool,
2025-05-07T20:32:54.8949276Z         compiled: bool,
2025-05-07T20:32:54.8949354Z     ) -> None:
2025-05-07T20:32:54.8949457Z         torch.manual_seed(2025)
2025-05-07T20:32:54.8949528Z     
2025-05-07T20:32:54.8949690Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:54.8949774Z     
2025-05-07T20:32:54.8949862Z         x_sign = torch.sign(x)
2025-05-07T20:32:54.8949983Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:54.8950082Z         x = x_sign * x_clamp
2025-05-07T20:32:54.8950160Z         x0 = x[:, :D]
2025-05-07T20:32:54.8950238Z         x1 = x[:, D:]
2025-05-07T20:32:54.8950316Z     
2025-05-07T20:32:54.8950396Z         if contiguous:
2025-05-07T20:32:54.8950489Z             x0 = x0.contiguous()
2025-05-07T20:32:54.8950576Z             x1 = x1.contiguous()
2025-05-07T20:32:54.8950645Z     
2025-05-07T20:32:54.8950742Z         if scale_ub is not None:
2025-05-07T20:32:54.8950841Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:54.8950997Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:54.8951110Z             )
2025-05-07T20:32:54.8951191Z         else:
2025-05-07T20:32:54.8951283Z             scale_ub_tensor = None
2025-05-07T20:32:54.8951356Z     
2025-05-07T20:32:54.8951493Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:54.8951581Z             op = silu_mul_quant
2025-05-07T20:32:54.8951661Z             if compiled:
2025-05-07T20:32:54.8951769Z                 op = torch.compile(op)
2025-05-07T20:32:54.8951872Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:54.8951942Z     
2025-05-07T20:32:54.8952039Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:54.8952044Z 
2025-05-07T20:32:54.8952142Z moe/activation_test.py:117: 
2025-05-07T20:32:54.8952277Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:54.8952414Z moe/activation_test.py:115: in fn
2025-05-07T20:32:54.8952511Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:54.8952883Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/torch/_dynamo/eval_frame.py:678: in _fn
2025-05-07T20:32:54.8952974Z     return fn(*args, **kwargs)
2025-05-07T20:32:54.8953458Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:54.8953561Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:54.8953911Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:54.8954136Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:54.8954471Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:54.8954561Z     kernel = self.compile(
2025-05-07T20:32:54.8954946Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:54.8955119Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:54.8955247Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:54.8955261Z 
2025-05-07T20:32:54.8955526Z self = <triton.compiler.compiler.ASTSource object at 0x7fe92ae82c50>
2025-05-07T20:32:54.8956284Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:54.8956783Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7fea06577ce0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7fe92a424720>}
2025-05-07T20:32:54.8957546Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:54.8957747Z context = <triton._C.libtriton.ir.context object at 0x7fe92a4f2770>
2025-05-07T20:32:54.8957752Z 
2025-05-07T20:32:54.8957912Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:54.8958168Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:54.8958280Z                            module_map=module_map)
2025-05-07T20:32:54.8958438Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:54.8958534Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:54.8958620Z E       ^
2025-05-07T20:32:54.8958964Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:54.8958970Z 
2025-05-07T20:32:54.8959376Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:54.8959383Z 
2025-05-07T20:32:54.8959485Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:54.8959737Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:54.8959823Z     T=1,
2025-05-07T20:32:54.8959897Z     D=5120,
2025-05-07T20:32:54.8959987Z     scale_ub=None,
2025-05-07T20:32:54.8960072Z     contiguous=False,
2025-05-07T20:32:54.8960153Z     compiled=False,
2025-05-07T20:32:54.8960230Z )
2025-05-07T20:32:54.8960447Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:54.8960606Z T = 1, D = 5120, scale_ub = None, contiguous = False, compiled = False
2025-05-07T20:32:54.8960611Z 
2025-05-07T20:32:54.8960690Z     @given(
2025-05-07T20:32:54.8960806Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:54.8960902Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:54.8961060Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:54.8961174Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:54.8961292Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:54.8961367Z     )
2025-05-07T20:32:54.8961608Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:54.8961703Z     def test_silu_mul_quant(
2025-05-07T20:32:54.8961779Z         self,
2025-05-07T20:32:54.8961853Z         T: int,
2025-05-07T20:32:54.8961934Z         D: int,
2025-05-07T20:32:54.8962030Z         scale_ub: Optional[float],
2025-05-07T20:32:54.8962117Z         contiguous: bool,
2025-05-07T20:32:54.8962205Z         compiled: bool,
2025-05-07T20:32:54.8962282Z     ) -> None:
2025-05-07T20:32:54.8962376Z         torch.manual_seed(2025)
2025-05-07T20:32:54.8962452Z     
2025-05-07T20:32:54.8962616Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:54.8962685Z     
2025-05-07T20:32:54.8962781Z         x_sign = torch.sign(x)
2025-05-07T20:32:54.8962904Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:54.8962998Z         x = x_sign * x_clamp
2025-05-07T20:32:54.8963076Z         x0 = x[:, :D]
2025-05-07T20:32:54.8963157Z         x1 = x[:, D:]
2025-05-07T20:32:54.8963236Z     
2025-05-07T20:32:54.8963319Z         if contiguous:
2025-05-07T20:32:54.8963408Z             x0 = x0.contiguous()
2025-05-07T20:32:54.8963548Z             x1 = x1.contiguous()
2025-05-07T20:32:54.8963620Z     
2025-05-07T20:32:54.8963707Z         if scale_ub is not None:
2025-05-07T20:32:54.8963818Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:54.8963949Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:54.8964020Z             )
2025-05-07T20:32:54.8964103Z         else:
2025-05-07T20:32:54.8964193Z             scale_ub_tensor = None
2025-05-07T20:32:54.8964271Z     
2025-05-07T20:32:54.8964398Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:54.8964488Z             op = silu_mul_quant
2025-05-07T20:32:54.8964576Z             if compiled:
2025-05-07T20:32:54.8965120Z                 op = torch.compile(op)
2025-05-07T20:32:54.8965226Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:54.8965306Z     
2025-05-07T20:32:54.8965395Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:54.8965400Z 
2025-05-07T20:32:54.8965499Z moe/activation_test.py:117: 
2025-05-07T20:32:54.8965631Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:54.8965729Z moe/activation_test.py:115: in fn
2025-05-07T20:32:54.8965833Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:54.8966320Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:54.8966417Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:54.8966776Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:54.8966998Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:54.8967379Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:54.8967470Z     kernel = self.compile(
2025-05-07T20:32:54.8967845Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:54.8968029Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:54.8968153Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:54.8968157Z 
2025-05-07T20:32:54.8968358Z self = <triton.compiler.compiler.ASTSource object at 0x7fe92a8e3f50>
2025-05-07T20:32:54.8969124Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:54.8969660Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7fea06577ce0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7fe92a425120>}
2025-05-07T20:32:54.8970391Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:54.8970577Z context = <triton._C.libtriton.ir.context object at 0x7fe92aaa4530>
2025-05-07T20:32:54.8970582Z 
2025-05-07T20:32:54.8970752Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:54.8971006Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:54.8971111Z                            module_map=module_map)
2025-05-07T20:32:54.8971275Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:54.8971372Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:54.8971450Z E       ^
2025-05-07T20:32:54.8971808Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:54.8971812Z 
2025-05-07T20:32:54.8972217Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:54.8972261Z 
2025-05-07T20:32:54.8972369Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:54.8972588Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:54.8972662Z     T=4096,
2025-05-07T20:32:54.8972742Z     D=7168,
2025-05-07T20:32:54.8972824Z     scale_ub=1200.0,
2025-05-07T20:32:54.8972906Z     contiguous=False,
2025-05-07T20:32:54.8972993Z     compiled=False,
2025-05-07T20:32:54.8973062Z )
2025-05-07T20:32:54.8973275Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:54.8973454Z T = 4096, D = 7168, scale_ub = 1200.0, contiguous = False, compiled = False
2025-05-07T20:32:54.8973458Z 
2025-05-07T20:32:54.8973570Z     @given(
2025-05-07T20:32:54.8973781Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:54.8973877Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:54.8973989Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:54.8974113Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:54.8974221Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:54.8974291Z     )
2025-05-07T20:32:54.8974535Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:54.8974624Z     def test_silu_mul_quant(
2025-05-07T20:32:54.8974700Z         self,
2025-05-07T20:32:54.8974775Z         T: int,
2025-05-07T20:32:54.8974849Z         D: int,
2025-05-07T20:32:54.8974950Z         scale_ub: Optional[float],
2025-05-07T20:32:54.8975034Z         contiguous: bool,
2025-05-07T20:32:54.8975118Z         compiled: bool,
2025-05-07T20:32:54.8975198Z     ) -> None:
2025-05-07T20:32:54.8975289Z         torch.manual_seed(2025)
2025-05-07T20:32:54.8975360Z     
2025-05-07T20:32:54.8975575Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:54.8975646Z     
2025-05-07T20:32:54.8975735Z         x_sign = torch.sign(x)
2025-05-07T20:32:54.8975863Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:54.8975948Z         x = x_sign * x_clamp
2025-05-07T20:32:54.8976030Z         x0 = x[:, :D]
2025-05-07T20:32:54.8976105Z         x1 = x[:, D:]
2025-05-07T20:32:54.8976174Z     
2025-05-07T20:32:54.8976260Z         if contiguous:
2025-05-07T20:32:54.8976348Z             x0 = x0.contiguous()
2025-05-07T20:32:54.8976435Z             x1 = x1.contiguous()
2025-05-07T20:32:54.8976510Z     
2025-05-07T20:32:54.8976599Z         if scale_ub is not None:
2025-05-07T20:32:54.8976700Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:54.8976879Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:54.8976951Z             )
2025-05-07T20:32:54.8977022Z         else:
2025-05-07T20:32:54.8977121Z             scale_ub_tensor = None
2025-05-07T20:32:54.8977195Z     
2025-05-07T20:32:54.8977320Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:54.8977411Z             op = silu_mul_quant
2025-05-07T20:32:54.8977495Z             if compiled:
2025-05-07T20:32:54.8977594Z                 op = torch.compile(op)
2025-05-07T20:32:54.8977697Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:54.8977768Z     
2025-05-07T20:32:54.8977861Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:54.8977865Z 
2025-05-07T20:32:54.8977959Z moe/activation_test.py:117: 
2025-05-07T20:32:54.8978086Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:54.8978189Z moe/activation_test.py:115: in fn
2025-05-07T20:32:54.8978287Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:54.8978777Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:54.8978874Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:54.8979231Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:54.8979497Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:54.8979830Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:54.8979920Z     kernel = self.compile(
2025-05-07T20:32:54.8980307Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:54.8980478Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:54.8980608Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:54.8980616Z 
2025-05-07T20:32:54.8980818Z self = <triton.compiler.compiler.ASTSource object at 0x7fe92b56d8d0>
2025-05-07T20:32:54.8981617Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:54.8982123Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7fea06577ce0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7fe92a426480>}
2025-05-07T20:32:54.8982851Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:54.8983044Z context = <triton._C.libtriton.ir.context object at 0x7fe92aae3930>
2025-05-07T20:32:54.8983049Z 
2025-05-07T20:32:54.8983211Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:54.8983468Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:54.8983616Z                            module_map=module_map)
2025-05-07T20:32:54.8983774Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:54.8983879Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:54.8983953Z E       ^
2025-05-07T20:32:54.8984299Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:54.8984304Z 
2025-05-07T20:32:54.8984719Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:54.8984723Z 
2025-05-07T20:32:54.8984825Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:54.8985048Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:54.8985183Z     T=16384,
2025-05-07T20:32:54.8985255Z     D=7168,
2025-05-07T20:32:54.8985342Z     scale_ub=None,
2025-05-07T20:32:54.8985426Z     contiguous=True,
2025-05-07T20:32:54.8985512Z     compiled=True,
2025-05-07T20:32:54.8985589Z )
2025-05-07T20:32:54.8985805Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:54.8985973Z T = 16384, D = 7168, scale_ub = None, contiguous = True, compiled = True
2025-05-07T20:32:54.8985980Z 
2025-05-07T20:32:54.8986061Z     @given(
2025-05-07T20:32:54.8986180Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:54.8986286Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:54.8986401Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:54.8986516Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:54.8986632Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:54.8986705Z     )
2025-05-07T20:32:54.8986945Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:54.8987044Z     def test_silu_mul_quant(
2025-05-07T20:32:54.8987116Z         self,
2025-05-07T20:32:54.8987190Z         T: int,
2025-05-07T20:32:54.8987268Z         D: int,
2025-05-07T20:32:54.8987364Z         scale_ub: Optional[float],
2025-05-07T20:32:54.8987450Z         contiguous: bool,
2025-05-07T20:32:54.8987540Z         compiled: bool,
2025-05-07T20:32:54.8987662Z     ) -> None:
2025-05-07T20:32:54.8987760Z         torch.manual_seed(2025)
2025-05-07T20:32:54.8987831Z     
2025-05-07T20:32:54.8987993Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:54.8988071Z     
2025-05-07T20:32:54.8988159Z         x_sign = torch.sign(x)
2025-05-07T20:32:54.8988280Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:54.8988369Z         x = x_sign * x_clamp
2025-05-07T20:32:54.8988446Z         x0 = x[:, :D]
2025-05-07T20:32:54.8988525Z         x1 = x[:, D:]
2025-05-07T20:32:54.8988601Z     
2025-05-07T20:32:54.8988683Z         if contiguous:
2025-05-07T20:32:54.8988771Z             x0 = x0.contiguous()
2025-05-07T20:32:54.8988864Z             x1 = x1.contiguous()
2025-05-07T20:32:54.8988974Z     
2025-05-07T20:32:54.8989067Z         if scale_ub is not None:
2025-05-07T20:32:54.8989176Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:54.8989309Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:54.8989391Z             )
2025-05-07T20:32:54.8989466Z         else:
2025-05-07T20:32:54.8989555Z             scale_ub_tensor = None
2025-05-07T20:32:54.8989631Z     
2025-05-07T20:32:54.8989755Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:54.8989840Z             op = silu_mul_quant
2025-05-07T20:32:54.8989928Z             if compiled:
2025-05-07T20:32:54.8990024Z                 op = torch.compile(op)
2025-05-07T20:32:54.8990126Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:54.8990201Z     
2025-05-07T20:32:54.8990287Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:54.8990293Z 
2025-05-07T20:32:54.8990392Z moe/activation_test.py:117: 
2025-05-07T20:32:54.8990520Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:54.8990658Z moe/activation_test.py:115: in fn
2025-05-07T20:32:54.8990762Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:54.8991118Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/torch/_dynamo/eval_frame.py:678: in _fn
2025-05-07T20:32:54.8991211Z     return fn(*args, **kwargs)
2025-05-07T20:32:54.8991702Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:54.8991798Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:54.8992154Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:54.8992369Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:54.8992741Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:54.8992841Z     kernel = self.compile(
2025-05-07T20:32:54.8993219Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:54.8993390Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:54.8993522Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:54.8993527Z 
2025-05-07T20:32:54.8993726Z self = <triton.compiler.compiler.ASTSource object at 0x7fe92b609950>
2025-05-07T20:32:54.8994490Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:54.8994981Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7fea06577ce0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7fe92a427740>}
2025-05-07T20:32:54.8995722Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:54.8995951Z context = <triton._C.libtriton.ir.context object at 0x7fe92ae29870>
2025-05-07T20:32:54.8995955Z 
2025-05-07T20:32:54.8996115Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:54.8996379Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:54.8996483Z                            module_map=module_map)
2025-05-07T20:32:54.8996649Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:54.8996745Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:54.8996821Z E       ^
2025-05-07T20:32:54.8997174Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:54.8997178Z 
2025-05-07T20:32:54.8997623Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:54.8997628Z 
2025-05-07T20:32:54.8997729Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:54.8997957Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:54.8998035Z     T=4096,
2025-05-07T20:32:54.8998114Z     D=5120,
2025-05-07T20:32:54.8998464Z     scale_ub=None,
2025-05-07T20:32:54.8998594Z     contiguous=False,
2025-05-07T20:32:54.8998714Z     compiled=True,
2025-05-07T20:32:54.8998786Z )
2025-05-07T20:32:54.9000201Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:54.9000505Z T = 4096, D = 5120, scale_ub = None, contiguous = False, compiled = True
2025-05-07T20:32:54.9000511Z 
2025-05-07T20:32:54.9000635Z     @given(
2025-05-07T20:32:54.9000770Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:54.9000897Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:54.9001342Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:54.9001468Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:54.9001581Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:54.9001664Z     )
2025-05-07T20:32:54.9001920Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:54.9002015Z     def test_silu_mul_quant(
2025-05-07T20:32:54.9002095Z         self,
2025-05-07T20:32:54.9002182Z         T: int,
2025-05-07T20:32:54.9002257Z         D: int,
2025-05-07T20:32:54.9002353Z         scale_ub: Optional[float],
2025-05-07T20:32:54.9002446Z         contiguous: bool,
2025-05-07T20:32:54.9002529Z         compiled: bool,
2025-05-07T20:32:54.9002612Z     ) -> None:
2025-05-07T20:32:54.9002712Z         torch.manual_seed(2025)
2025-05-07T20:32:54.9002869Z     
2025-05-07T20:32:54.9003047Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:54.9003122Z     
2025-05-07T20:32:54.9003214Z         x_sign = torch.sign(x)
2025-05-07T20:32:54.9003343Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:54.9003429Z         x = x_sign * x_clamp
2025-05-07T20:32:54.9003506Z         x0 = x[:, :D]
2025-05-07T20:32:54.9003591Z         x1 = x[:, D:]
2025-05-07T20:32:54.9003660Z     
2025-05-07T20:32:54.9003740Z         if contiguous:
2025-05-07T20:32:54.9003834Z             x0 = x0.contiguous()
2025-05-07T20:32:54.9003919Z             x1 = x1.contiguous()
2025-05-07T20:32:54.9003985Z     
2025-05-07T20:32:54.9004075Z         if scale_ub is not None:
2025-05-07T20:32:54.9004178Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:54.9004317Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:54.9004392Z             )
2025-05-07T20:32:54.9004470Z         else:
2025-05-07T20:32:54.9004567Z             scale_ub_tensor = None
2025-05-07T20:32:54.9004638Z     
2025-05-07T20:32:54.9004771Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:54.9004866Z             op = silu_mul_quant
2025-05-07T20:32:54.9004948Z             if compiled:
2025-05-07T20:32:54.9005045Z                 op = torch.compile(op)
2025-05-07T20:32:54.9005256Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:54.9005326Z     
2025-05-07T20:32:54.9005413Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:54.9005418Z 
2025-05-07T20:32:54.9005520Z moe/activation_test.py:117: 
2025-05-07T20:32:54.9005651Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:54.9005753Z moe/activation_test.py:115: in fn
2025-05-07T20:32:54.9005850Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:54.9006222Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/torch/_dynamo/eval_frame.py:678: in _fn
2025-05-07T20:32:54.9006325Z     return fn(*args, **kwargs)
2025-05-07T20:32:54.9006904Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:54.9007006Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:54.9007366Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:54.9007592Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:54.9007929Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:54.9008020Z     kernel = self.compile(
2025-05-07T20:32:54.9008397Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:54.9008576Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:54.9008707Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:54.9008712Z 
2025-05-07T20:32:54.9008925Z self = <triton.compiler.compiler.ASTSource object at 0x7fe92bb81f50>
2025-05-07T20:32:54.9009729Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:54.9010226Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7fea06577ce0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7fe92aeacc20>}
2025-05-07T20:32:54.9011016Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:54.9011205Z context = <triton._C.libtriton.ir.context object at 0x7fe92ae3f7b0>
2025-05-07T20:32:54.9011268Z 
2025-05-07T20:32:54.9011437Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:54.9011695Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:54.9011804Z                            module_map=module_map)
2025-05-07T20:32:54.9011970Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:54.9012067Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:54.9012149Z E       ^
2025-05-07T20:32:54.9012495Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:54.9012500Z 
2025-05-07T20:32:54.9012900Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:54.9012905Z 
2025-05-07T20:32:54.9013011Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:54.9013226Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:54.9013318Z     T=4096,
2025-05-07T20:32:54.9018243Z     D=5120,
2025-05-07T20:32:54.9018348Z     scale_ub=1200.0,
2025-05-07T20:32:54.9018451Z     contiguous=False,
2025-05-07T20:32:54.9018549Z     compiled=False,
2025-05-07T20:32:54.9018632Z )
2025-05-07T20:32:54.9018870Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:54.9019126Z T = 4096, D = 5120, scale_ub = 1200.0, contiguous = False, compiled = False
2025-05-07T20:32:54.9019131Z 
2025-05-07T20:32:54.9019224Z     @given(
2025-05-07T20:32:54.9019351Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:54.9019456Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:54.9019585Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:54.9019708Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:54.9019827Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:54.9019916Z     )
2025-05-07T20:32:54.9020170Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:54.9020269Z     def test_silu_mul_quant(
2025-05-07T20:32:54.9020404Z         self,
2025-05-07T20:32:54.9020491Z         T: int,
2025-05-07T20:32:54.9020580Z         D: int,
2025-05-07T20:32:54.9020685Z         scale_ub: Optional[float],
2025-05-07T20:32:54.9020779Z         contiguous: bool,
2025-05-07T20:32:54.9020880Z         compiled: bool,
2025-05-07T20:32:54.9020966Z     ) -> None:
2025-05-07T20:32:54.9021066Z         torch.manual_seed(2025)
2025-05-07T20:32:54.9021150Z     
2025-05-07T20:32:54.9021326Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:54.9021407Z     
2025-05-07T20:32:54.9021515Z         x_sign = torch.sign(x)
2025-05-07T20:32:54.9021645Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:54.9021742Z         x = x_sign * x_clamp
2025-05-07T20:32:54.9021836Z         x0 = x[:, :D]
2025-05-07T20:32:54.9021922Z         x1 = x[:, D:]
2025-05-07T20:32:54.9022004Z     
2025-05-07T20:32:54.9022092Z         if contiguous:
2025-05-07T20:32:54.9022190Z             x0 = x0.contiguous()
2025-05-07T20:32:54.9022293Z             x1 = x1.contiguous()
2025-05-07T20:32:54.9022413Z     
2025-05-07T20:32:54.9022511Z         if scale_ub is not None:
2025-05-07T20:32:54.9022632Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:54.9022774Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:54.9022854Z             )
2025-05-07T20:32:54.9022946Z         else:
2025-05-07T20:32:54.9023048Z             scale_ub_tensor = None
2025-05-07T20:32:54.9023126Z     
2025-05-07T20:32:54.9023269Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:54.9023365Z             op = silu_mul_quant
2025-05-07T20:32:54.9023454Z             if compiled:
2025-05-07T20:32:54.9023572Z                 op = torch.compile(op)
2025-05-07T20:32:54.9023682Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:54.9023815Z     
2025-05-07T20:32:54.9023914Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:54.9023919Z 
2025-05-07T20:32:54.9024025Z moe/activation_test.py:117: 
2025-05-07T20:32:54.9024173Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:54.9024280Z moe/activation_test.py:115: in fn
2025-05-07T20:32:54.9024385Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:54.9024898Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:54.9025002Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:54.9025376Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:54.9025607Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:54.9025951Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:54.9026061Z     kernel = self.compile(
2025-05-07T20:32:54.9026452Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:54.9026634Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:54.9026778Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:54.9026828Z 
2025-05-07T20:32:54.9027039Z self = <triton.compiler.compiler.ASTSource object at 0x7fe92b56f050>
2025-05-07T20:32:54.9027824Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:54.9028331Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7fea06577ce0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7fe92aead6c0>}
2025-05-07T20:32:54.9029127Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:54.9029323Z context = <triton._C.libtriton.ir.context object at 0x7fe92a503430>
2025-05-07T20:32:54.9029328Z 
2025-05-07T20:32:54.9029502Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:54.9029774Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:54.9029886Z                            module_map=module_map)
2025-05-07T20:32:54.9030060Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:54.9030166Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:54.9030249Z E       ^
2025-05-07T20:32:54.9030615Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:54.9030623Z 
2025-05-07T20:32:54.9031035Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:54.9031040Z 
2025-05-07T20:32:54.9031190Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:54.9031425Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:54.9031509Z     T=4096,
2025-05-07T20:32:54.9031603Z     D=5120,
2025-05-07T20:32:54.9031693Z     scale_ub=1200.0,
2025-05-07T20:32:54.9031785Z     contiguous=False,
2025-05-07T20:32:54.9031882Z     compiled=True,
2025-05-07T20:32:54.9031961Z )
2025-05-07T20:32:54.9032184Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:54.9032376Z T = 4096, D = 5120, scale_ub = 1200.0, contiguous = False, compiled = True
2025-05-07T20:32:54.9032380Z 
2025-05-07T20:32:54.9032463Z     @given(
2025-05-07T20:32:54.9032590Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:54.9032740Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:54.9032859Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:54.9032991Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:54.9033111Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:54.9033198Z     )
2025-05-07T20:32:54.9033449Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:54.9033551Z     def test_silu_mul_quant(
2025-05-07T20:32:54.9033640Z         self,
2025-05-07T20:32:54.9033721Z         T: int,
2025-05-07T20:32:54.9033804Z         D: int,
2025-05-07T20:32:54.9033919Z         scale_ub: Optional[float],
2025-05-07T20:32:54.9034012Z         contiguous: bool,
2025-05-07T20:32:54.9034100Z         compiled: bool,
2025-05-07T20:32:54.9034189Z     ) -> None:
2025-05-07T20:32:54.9034288Z         torch.manual_seed(2025)
2025-05-07T20:32:54.9034365Z     
2025-05-07T20:32:54.9034543Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:54.9034623Z     
2025-05-07T20:32:54.9034726Z         x_sign = torch.sign(x)
2025-05-07T20:32:54.9034856Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:54.9034951Z         x = x_sign * x_clamp
2025-05-07T20:32:54.9035043Z         x0 = x[:, :D]
2025-05-07T20:32:54.9035126Z         x1 = x[:, D:]
2025-05-07T20:32:54.9035202Z     
2025-05-07T20:32:54.9035369Z         if contiguous:
2025-05-07T20:32:54.9035466Z             x0 = x0.contiguous()
2025-05-07T20:32:54.9035560Z             x1 = x1.contiguous()
2025-05-07T20:32:54.9035644Z     
2025-05-07T20:32:54.9035742Z         if scale_ub is not None:
2025-05-07T20:32:54.9035852Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:54.9035999Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:54.9036078Z             )
2025-05-07T20:32:54.9036157Z         else:
2025-05-07T20:32:54.9036262Z             scale_ub_tensor = None
2025-05-07T20:32:54.9036343Z     
2025-05-07T20:32:54.9036485Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:54.9036578Z             op = silu_mul_quant
2025-05-07T20:32:54.9036707Z             if compiled:
2025-05-07T20:32:54.9036819Z                 op = torch.compile(op)
2025-05-07T20:32:54.9036929Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:54.9037004Z     
2025-05-07T20:32:54.9037108Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:54.9037113Z 
2025-05-07T20:32:54.9037211Z moe/activation_test.py:117: 
2025-05-07T20:32:54.9037343Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:54.9037453Z moe/activation_test.py:115: in fn
2025-05-07T20:32:54.9037557Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:54.9037931Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/torch/_dynamo/eval_frame.py:678: in _fn
2025-05-07T20:32:54.9038027Z     return fn(*args, **kwargs)
2025-05-07T20:32:54.9038518Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:54.9038633Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:54.9039036Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:54.9039261Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:54.9039611Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:54.9039707Z     kernel = self.compile(
2025-05-07T20:32:54.9040097Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:54.9040274Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:54.9040408Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:54.9040412Z 
2025-05-07T20:32:54.9040667Z self = <triton.compiler.compiler.ASTSource object at 0x7fe92a8e16d0>
2025-05-07T20:32:54.9041443Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:54.9041951Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7fea06577ce0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7fe92aeaefc0>}
2025-05-07T20:32:54.9042695Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:54.9042899Z context = <triton._C.libtriton.ir.context object at 0x7fe92b2bb2b0>
2025-05-07T20:32:54.9042903Z 
2025-05-07T20:32:54.9043072Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:54.9043337Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:54.9043457Z                            module_map=module_map)
2025-05-07T20:32:54.9043624Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:54.9043725Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:54.9043815Z E       ^
2025-05-07T20:32:54.9044206Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:54.9044211Z 
2025-05-07T20:32:54.9044626Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:54.9044631Z 
2025-05-07T20:32:54.9044736Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:54.9044959Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:54.9045046Z     T=2048,
2025-05-07T20:32:54.9045125Z     D=7168,
2025-05-07T20:32:54.9045214Z     scale_ub=1200.0,
2025-05-07T20:32:54.9045310Z     contiguous=False,
2025-05-07T20:32:54.9045397Z     compiled=False,
2025-05-07T20:32:54.9045519Z )
2025-05-07T20:32:54.9045742Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:54.9045920Z T = 2048, D = 7168, scale_ub = 1200.0, contiguous = False, compiled = False
2025-05-07T20:32:54.9045928Z 
2025-05-07T20:32:54.9046013Z     @given(
2025-05-07T20:32:54.9046138Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:54.9046244Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:54.9046369Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:54.9046490Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:54.9046606Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:54.9046688Z     )
2025-05-07T20:32:54.9046934Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:54.9047039Z     def test_silu_mul_quant(
2025-05-07T20:32:54.9047119Z         self,
2025-05-07T20:32:54.9047199Z         T: int,
2025-05-07T20:32:54.9047290Z         D: int,
2025-05-07T20:32:54.9047393Z         scale_ub: Optional[float],
2025-05-07T20:32:54.9047529Z         contiguous: bool,
2025-05-07T20:32:54.9047626Z         compiled: bool,
2025-05-07T20:32:54.9047707Z     ) -> None:
2025-05-07T20:32:54.9047807Z         torch.manual_seed(2025)
2025-05-07T20:32:54.9047892Z     
2025-05-07T20:32:54.9048063Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:54.9048141Z     
2025-05-07T20:32:54.9048244Z         x_sign = torch.sign(x)
2025-05-07T20:32:54.9048372Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:54.9048471Z         x = x_sign * x_clamp
2025-05-07T20:32:54.9048556Z         x0 = x[:, :D]
2025-05-07T20:32:54.9048639Z         x1 = x[:, D:]
2025-05-07T20:32:54.9048721Z     
2025-05-07T20:32:54.9048808Z         if contiguous:
2025-05-07T20:32:54.9048945Z             x0 = x0.contiguous()
2025-05-07T20:32:54.9049044Z             x1 = x1.contiguous()
2025-05-07T20:32:54.9049119Z     
2025-05-07T20:32:54.9049217Z         if scale_ub is not None:
2025-05-07T20:32:54.9049337Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:54.9049476Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:54.9049556Z             )
2025-05-07T20:32:54.9049645Z         else:
2025-05-07T20:32:54.9049743Z             scale_ub_tensor = None
2025-05-07T20:32:54.9049819Z     
2025-05-07T20:32:54.9049962Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:54.9050056Z             op = silu_mul_quant
2025-05-07T20:32:54.9050152Z             if compiled:
2025-05-07T20:32:54.9050257Z                 op = torch.compile(op)
2025-05-07T20:32:54.9050367Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:54.9050448Z     
2025-05-07T20:32:54.9050549Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:54.9050554Z 
2025-05-07T20:32:54.9050671Z moe/activation_test.py:117: 
2025-05-07T20:32:54.9050836Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:54.9050940Z moe/activation_test.py:115: in fn
2025-05-07T20:32:54.9051045Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:54.9051544Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:54.9051688Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:54.9052053Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:54.9052278Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:54.9052617Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:54.9052725Z     kernel = self.compile(
2025-05-07T20:32:54.9053110Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:54.9053337Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:54.9053473Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:54.9053478Z 
2025-05-07T20:32:54.9053777Z self = <triton.compiler.compiler.ASTSource object at 0x7fe92ae80250>
2025-05-07T20:32:54.9054554Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:54.9055056Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7fea06577ce0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7fe92aeafec0>}
2025-05-07T20:32:54.9055798Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:54.9055999Z context = <triton._C.libtriton.ir.context object at 0x7fe92a6190f0>
2025-05-07T20:32:54.9056047Z 
2025-05-07T20:32:54.9056216Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:54.9056486Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:54.9056599Z                            module_map=module_map)
2025-05-07T20:32:54.9056772Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:54.9056873Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:54.9056953Z E       ^
2025-05-07T20:32:54.9057308Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:54.9057313Z 
2025-05-07T20:32:54.9057720Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:54.9057764Z 
2025-05-07T20:32:54.9057877Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:54.9058107Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:54.9058186Z     T=1,
2025-05-07T20:32:54.9058270Z     D=7168,
2025-05-07T20:32:54.9058355Z     scale_ub=None,
2025-05-07T20:32:54.9058445Z     contiguous=True,
2025-05-07T20:32:54.9058537Z     compiled=False,
2025-05-07T20:32:54.9058614Z )
2025-05-07T20:32:54.9058833Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:54.9059008Z T = 1, D = 7168, scale_ub = None, contiguous = True, compiled = False
2025-05-07T20:32:54.9059013Z 
2025-05-07T20:32:54.9059091Z     @given(
2025-05-07T20:32:54.9059224Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:54.9059325Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:54.9059442Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:54.9059571Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:54.9059687Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:54.9059769Z     )
2025-05-07T20:32:54.9060022Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:54.9060119Z     def test_silu_mul_quant(
2025-05-07T20:32:54.9060244Z         self,
2025-05-07T20:32:54.9060330Z         T: int,
2025-05-07T20:32:54.9060415Z         D: int,
2025-05-07T20:32:54.9060535Z         scale_ub: Optional[float],
2025-05-07T20:32:54.9060651Z         contiguous: bool,
2025-05-07T20:32:54.9060750Z         compiled: bool,
2025-05-07T20:32:54.9060838Z     ) -> None:
2025-05-07T20:32:54.9060936Z         torch.manual_seed(2025)
2025-05-07T20:32:54.9061014Z     
2025-05-07T20:32:54.9061189Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:54.9061265Z     
2025-05-07T20:32:54.9061361Z         x_sign = torch.sign(x)
2025-05-07T20:32:54.9061498Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:54.9061592Z         x = x_sign * x_clamp
2025-05-07T20:32:54.9061715Z         x0 = x[:, :D]
2025-05-07T20:32:54.9061807Z         x1 = x[:, D:]
2025-05-07T20:32:54.9061884Z     
2025-05-07T20:32:54.9061971Z         if contiguous:
2025-05-07T20:32:54.9062077Z             x0 = x0.contiguous()
2025-05-07T20:32:54.9062174Z             x1 = x1.contiguous()
2025-05-07T20:32:54.9062256Z     
2025-05-07T20:32:54.9062350Z         if scale_ub is not None:
2025-05-07T20:32:54.9062461Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:54.9062604Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:54.9062682Z             )
2025-05-07T20:32:54.9062761Z         else:
2025-05-07T20:32:54.9062865Z             scale_ub_tensor = None
2025-05-07T20:32:54.9062941Z     
2025-05-07T20:32:54.9063074Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:54.9063178Z             op = silu_mul_quant
2025-05-07T20:32:54.9063269Z             if compiled:
2025-05-07T20:32:54.9063372Z                 op = torch.compile(op)
2025-05-07T20:32:54.9063491Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:54.9063567Z     
2025-05-07T20:32:54.9063712Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:54.9063717Z 
2025-05-07T20:32:54.9063818Z moe/activation_test.py:117: 
2025-05-07T20:32:54.9063953Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:54.9064061Z moe/activation_test.py:115: in fn
2025-05-07T20:32:54.9064165Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:54.9064653Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:54.9064759Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:54.9065117Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:54.9065409Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:54.9065751Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:54.9065852Z     kernel = self.compile(
2025-05-07T20:32:54.9066240Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:54.9066419Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:54.9066549Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:54.9066560Z 
2025-05-07T20:32:54.9066767Z self = <triton.compiler.compiler.ASTSource object at 0x7fe92b75cbd0>
2025-05-07T20:32:54.9067527Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:54.9068037Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7fea06577ce0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7fe92b26ccc0>}
2025-05-07T20:32:54.9068774Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:54.9069019Z context = <triton._C.libtriton.ir.context object at 0x7fe92bf191f0>
2025-05-07T20:32:54.9069024Z 
2025-05-07T20:32:54.9069189Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:54.9069448Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:54.9069564Z                            module_map=module_map)
2025-05-07T20:32:54.9069726Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:54.9069826Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:54.9069916Z E       ^
2025-05-07T20:32:54.9070326Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:54.9070335Z 
2025-05-07T20:32:54.9070749Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:54.9070757Z 
2025-05-07T20:32:54.9070862Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:54.9071084Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:54.9071172Z     T=16384,
2025-05-07T20:32:54.9071252Z     D=7168,
2025-05-07T20:32:54.9071345Z     scale_ub=1200.0,
2025-05-07T20:32:54.9071434Z     contiguous=False,
2025-05-07T20:32:54.9071519Z     compiled=True,
2025-05-07T20:32:54.9071603Z )
2025-05-07T20:32:54.9071820Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:54.9071996Z T = 16384, D = 7168, scale_ub = 1200.0, contiguous = False, compiled = True
2025-05-07T20:32:54.9072004Z 
2025-05-07T20:32:54.9072089Z     @given(
2025-05-07T20:32:54.9072212Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:54.9072354Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:54.9072481Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:54.9072599Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:54.9072723Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:54.9072799Z     )
2025-05-07T20:32:54.9073045Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:54.9073146Z     def test_silu_mul_quant(
2025-05-07T20:32:54.9073223Z         self,
2025-05-07T20:32:54.9073303Z         T: int,
2025-05-07T20:32:54.9073387Z         D: int,
2025-05-07T20:32:54.9073486Z         scale_ub: Optional[float],
2025-05-07T20:32:54.9073577Z         contiguous: bool,
2025-05-07T20:32:54.9073670Z         compiled: bool,
2025-05-07T20:32:54.9073791Z     ) -> None:
2025-05-07T20:32:54.9073886Z         torch.manual_seed(2025)
2025-05-07T20:32:54.9073966Z     
2025-05-07T20:32:54.9074138Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:54.9074222Z     
2025-05-07T20:32:54.9074315Z         x_sign = torch.sign(x)
2025-05-07T20:32:54.9074439Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:54.9074538Z         x = x_sign * x_clamp
2025-05-07T20:32:54.9074619Z         x0 = x[:, :D]
2025-05-07T20:32:54.9074700Z         x1 = x[:, D:]
2025-05-07T20:32:54.9074779Z     
2025-05-07T20:32:54.9074863Z         if contiguous:
2025-05-07T20:32:54.9074955Z             x0 = x0.contiguous()
2025-05-07T20:32:54.9075053Z             x1 = x1.contiguous()
2025-05-07T20:32:54.9075126Z     
2025-05-07T20:32:54.9075218Z         if scale_ub is not None:
2025-05-07T20:32:54.9075332Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:54.9075466Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:54.9075547Z             )
2025-05-07T20:32:54.9075632Z         else:
2025-05-07T20:32:54.9075730Z             scale_ub_tensor = None
2025-05-07T20:32:54.9075810Z     
2025-05-07T20:32:54.9075944Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:54.9076034Z             op = silu_mul_quant
2025-05-07T20:32:54.9076127Z             if compiled:
2025-05-07T20:32:54.9076273Z                 op = torch.compile(op)
2025-05-07T20:32:54.9076381Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:54.9076459Z     
2025-05-07T20:32:54.9076563Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:54.9076567Z 
2025-05-07T20:32:54.9076665Z moe/activation_test.py:117: 
2025-05-07T20:32:54.9076794Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:54.9076903Z moe/activation_test.py:115: in fn
2025-05-07T20:32:54.9077004Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:54.9077370Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/torch/_dynamo/eval_frame.py:678: in _fn
2025-05-07T20:32:54.9077475Z     return fn(*args, **kwargs)
2025-05-07T20:32:54.9078007Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:54.9078114Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:54.9078475Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:54.9078699Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:54.9079046Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:54.9079142Z     kernel = self.compile(
2025-05-07T20:32:54.9079524Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:54.9079708Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:54.9079840Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:54.9079846Z 
2025-05-07T20:32:54.9080101Z self = <triton.compiler.compiler.ASTSource object at 0x7fe92bcf6d50>
2025-05-07T20:32:54.9080912Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:54.9081421Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7fea06577ce0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7fe92b26e0c0>}
2025-05-07T20:32:54.9082157Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:54.9082387Z context = <triton._C.libtriton.ir.context object at 0x7fe92bff6af0>
2025-05-07T20:32:54.9082391Z 
2025-05-07T20:32:54.9082566Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:54.9082827Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:54.9082941Z                            module_map=module_map)
2025-05-07T20:32:54.9083104Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:54.9083207Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:54.9083295Z E       ^
2025-05-07T20:32:54.9083646Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:54.9083650Z 
2025-05-07T20:32:54.9084054Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:54.9084071Z 
2025-05-07T20:32:54.9084178Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:54.9084402Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:54.9084491Z     T=1,
2025-05-07T20:32:54.9084569Z     D=7168,
2025-05-07T20:32:54.9084656Z     scale_ub=None,
2025-05-07T20:32:54.9084755Z     contiguous=False,
2025-05-07T20:32:54.9084840Z     compiled=False,
2025-05-07T20:32:54.9084917Z )
2025-05-07T20:32:54.9085184Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:54.9085348Z T = 1, D = 7168, scale_ub = None, contiguous = False, compiled = False
2025-05-07T20:32:54.9085353Z 
2025-05-07T20:32:54.9085435Z     @given(
2025-05-07T20:32:54.9085556Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:54.9085656Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:54.9085778Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:54.9085895Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:54.9086009Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:54.9086094Z     )
2025-05-07T20:32:54.9086377Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:54.9086471Z     def test_silu_mul_quant(
2025-05-07T20:32:54.9086559Z         self,
2025-05-07T20:32:54.9086636Z         T: int,
2025-05-07T20:32:54.9086716Z         D: int,
2025-05-07T20:32:54.9086824Z         scale_ub: Optional[float],
2025-05-07T20:32:54.9086916Z         contiguous: bool,
2025-05-07T20:32:54.9087008Z         compiled: bool,
2025-05-07T20:32:54.9087088Z     ) -> None:
2025-05-07T20:32:54.9087182Z         torch.manual_seed(2025)
2025-05-07T20:32:54.9087262Z     
2025-05-07T20:32:54.9087430Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:54.9087505Z     
2025-05-07T20:32:54.9087603Z         x_sign = torch.sign(x)
2025-05-07T20:32:54.9087729Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:54.9087819Z         x = x_sign * x_clamp
2025-05-07T20:32:54.9087910Z         x0 = x[:, :D]
2025-05-07T20:32:54.9087993Z         x1 = x[:, D:]
2025-05-07T20:32:54.9088066Z     
2025-05-07T20:32:54.9088160Z         if contiguous:
2025-05-07T20:32:54.9088298Z             x0 = x0.contiguous()
2025-05-07T20:32:54.9088396Z             x1 = x1.contiguous()
2025-05-07T20:32:54.9088469Z     
2025-05-07T20:32:54.9088563Z         if scale_ub is not None:
2025-05-07T20:32:54.9088676Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:54.9088811Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:54.9088893Z             )
2025-05-07T20:32:54.9088975Z         else:
2025-05-07T20:32:54.9089069Z             scale_ub_tensor = None
2025-05-07T20:32:54.9089143Z     
2025-05-07T20:32:54.9089284Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:54.9089374Z             op = silu_mul_quant
2025-05-07T20:32:54.9089461Z             if compiled:
2025-05-07T20:32:54.9089569Z                 op = torch.compile(op)
2025-05-07T20:32:54.9089718Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:54.9089793Z     
2025-05-07T20:32:54.9089896Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:54.9089900Z 
2025-05-07T20:32:54.9090000Z moe/activation_test.py:117: 
2025-05-07T20:32:54.9090138Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:54.9090240Z moe/activation_test.py:115: in fn
2025-05-07T20:32:54.9090345Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:54.9090842Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:54.9090940Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:54.9091296Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:54.9091525Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:54.9091861Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:54.9091965Z     kernel = self.compile(
2025-05-07T20:32:54.9092346Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:54.9092519Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:54.9092694Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:54.9092698Z 
2025-05-07T20:32:54.9092905Z self = <triton.compiler.compiler.ASTSource object at 0x7fe92b75e050>
2025-05-07T20:32:54.9093751Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:54.9094248Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7fea06577ce0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7fe92b26ec00>}
2025-05-07T20:32:54.9095034Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:54.9095227Z context = <triton._C.libtriton.ir.context object at 0x7fe92bfe3bf0>
2025-05-07T20:32:54.9095235Z 
2025-05-07T20:32:54.9095401Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:54.9095665Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:54.9095775Z                            module_map=module_map)
2025-05-07T20:32:54.9095937Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:54.9096042Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:54.9096119Z E       ^
2025-05-07T20:32:54.9096475Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:54.9096482Z 
2025-05-07T20:32:54.9096951Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:54.9096956Z 
2025-05-07T20:32:54.9097060Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:54.9097288Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:54.9097368Z     T=2048,
2025-05-07T20:32:54.9097452Z     D=7168,
2025-05-07T20:32:54.9097537Z     scale_ub=None,
2025-05-07T20:32:54.9097623Z     contiguous=False,
2025-05-07T20:32:54.9097717Z     compiled=True,
2025-05-07T20:32:54.9097791Z )
2025-05-07T20:32:54.9098007Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:54.9098787Z T = 2048, D = 7168, scale_ub = None, contiguous = False, compiled = True
2025-05-07T20:32:54.9098808Z 
2025-05-07T20:32:54.9098916Z     @given(
2025-05-07T20:32:54.9099198Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:54.9099307Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:54.9099428Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:54.9099549Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:54.9099671Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:54.9099751Z     )
2025-05-07T20:32:54.9100005Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:54.9100102Z     def test_silu_mul_quant(
2025-05-07T20:32:54.9100179Z         self,
2025-05-07T20:32:54.9100265Z         T: int,
2025-05-07T20:32:54.9100358Z         D: int,
2025-05-07T20:32:54.9100466Z         scale_ub: Optional[float],
2025-05-07T20:32:54.9100583Z         contiguous: bool,
2025-05-07T20:32:54.9100673Z         compiled: bool,
2025-05-07T20:32:54.9100755Z     ) -> None:
2025-05-07T20:32:54.9100857Z         torch.manual_seed(2025)
2025-05-07T20:32:54.9100936Z     
2025-05-07T20:32:54.9101111Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:54.9101190Z     
2025-05-07T20:32:54.9101286Z         x_sign = torch.sign(x)
2025-05-07T20:32:54.9101419Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:54.9101509Z         x = x_sign * x_clamp
2025-05-07T20:32:54.9101591Z         x0 = x[:, :D]
2025-05-07T20:32:54.9101753Z         x1 = x[:, D:]
2025-05-07T20:32:54.9101828Z     
2025-05-07T20:32:54.9101914Z         if contiguous:
2025-05-07T20:32:54.9102012Z             x0 = x0.contiguous()
2025-05-07T20:32:54.9102102Z             x1 = x1.contiguous()
2025-05-07T20:32:54.9102175Z     
2025-05-07T20:32:54.9102273Z         if scale_ub is not None:
2025-05-07T20:32:54.9102384Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:54.9102520Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:54.9102603Z             )
2025-05-07T20:32:54.9102680Z         else:
2025-05-07T20:32:54.9102784Z             scale_ub_tensor = None
2025-05-07T20:32:54.9102859Z     
2025-05-07T20:32:54.9102989Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:54.9103158Z             op = silu_mul_quant
2025-05-07T20:32:54.9103247Z             if compiled:
2025-05-07T20:32:54.9103348Z                 op = torch.compile(op)
2025-05-07T20:32:54.9103459Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:54.9103536Z     
2025-05-07T20:32:54.9103628Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:54.9103633Z 
2025-05-07T20:32:54.9103739Z moe/activation_test.py:117: 
2025-05-07T20:32:54.9103873Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:54.9103979Z moe/activation_test.py:115: in fn
2025-05-07T20:32:54.9104079Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:54.9104450Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/torch/_dynamo/eval_frame.py:678: in _fn
2025-05-07T20:32:54.9104550Z     return fn(*args, **kwargs)
2025-05-07T20:32:54.9105046Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:54.9105210Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:54.9105573Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:54.9105796Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:54.9106139Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:54.9106235Z     kernel = self.compile(
2025-05-07T20:32:54.9106617Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:54.9106798Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:54.9106927Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:54.9106974Z 
2025-05-07T20:32:54.9107180Z self = <triton.compiler.compiler.ASTSource object at 0x7fe92a3a2650>
2025-05-07T20:32:54.9107958Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:54.9108459Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7fea06577ce0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7fe92a6002c0>}
2025-05-07T20:32:54.9109201Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:54.9109392Z context = <triton._C.libtriton.ir.context object at 0x7fe92a6f8db0>
2025-05-07T20:32:54.9109396Z 
2025-05-07T20:32:54.9109567Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:54.9109829Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:54.9109939Z                            module_map=module_map)
2025-05-07T20:32:54.9110108Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:54.9110249Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:54.9110327Z E       ^
2025-05-07T20:32:54.9110686Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:54.9110690Z 
2025-05-07T20:32:54.9111101Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:54.9111105Z 
2025-05-07T20:32:54.9111219Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:54.9111437Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:54.9111517Z     T=4096,
2025-05-07T20:32:54.9111606Z     D=7168,
2025-05-07T20:32:54.9111690Z     scale_ub=None,
2025-05-07T20:32:54.9111817Z     contiguous=False,
2025-05-07T20:32:54.9111917Z     compiled=True,
2025-05-07T20:32:54.9111995Z )
2025-05-07T20:32:54.9112221Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:54.9112395Z T = 4096, D = 7168, scale_ub = None, contiguous = False, compiled = True
2025-05-07T20:32:54.9112402Z 
2025-05-07T20:32:54.9112479Z     @given(
2025-05-07T20:32:54.9112608Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:54.9112710Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:54.9112824Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:54.9112950Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:54.9113065Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:54.9113147Z     )
2025-05-07T20:32:54.9113393Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:54.9113491Z     def test_silu_mul_quant(
2025-05-07T20:32:54.9113579Z         self,
2025-05-07T20:32:54.9113661Z         T: int,
2025-05-07T20:32:54.9113739Z         D: int,
2025-05-07T20:32:54.9113888Z         scale_ub: Optional[float],
2025-05-07T20:32:54.9113979Z         contiguous: bool,
2025-05-07T20:32:54.9114067Z         compiled: bool,
2025-05-07T20:32:54.9114157Z     ) -> None:
2025-05-07T20:32:54.9114253Z         torch.manual_seed(2025)
2025-05-07T20:32:54.9114329Z     
2025-05-07T20:32:54.9114503Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:54.9114578Z     
2025-05-07T20:32:54.9114672Z         x_sign = torch.sign(x)
2025-05-07T20:32:54.9114803Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:54.9114893Z         x = x_sign * x_clamp
2025-05-07T20:32:54.9114978Z         x0 = x[:, :D]
2025-05-07T20:32:54.9115060Z         x1 = x[:, D:]
2025-05-07T20:32:54.9115177Z     
2025-05-07T20:32:54.9115268Z         if contiguous:
2025-05-07T20:32:54.9115364Z             x0 = x0.contiguous()
2025-05-07T20:32:54.9115457Z             x1 = x1.contiguous()
2025-05-07T20:32:54.9115541Z     
2025-05-07T20:32:54.9115638Z         if scale_ub is not None:
2025-05-07T20:32:54.9115745Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:54.9115886Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:54.9115966Z             )
2025-05-07T20:32:54.9116045Z         else:
2025-05-07T20:32:54.9116148Z             scale_ub_tensor = None
2025-05-07T20:32:54.9116221Z     
2025-05-07T20:32:54.9116358Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:54.9116448Z             op = silu_mul_quant
2025-05-07T20:32:54.9116535Z             if compiled:
2025-05-07T20:32:54.9116642Z                 op = torch.compile(op)
2025-05-07T20:32:54.9116747Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:54.9116821Z     
2025-05-07T20:32:54.9116927Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:54.9116932Z 
2025-05-07T20:32:54.9117030Z moe/activation_test.py:117: 
2025-05-07T20:32:54.9117162Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:54.9117275Z moe/activation_test.py:115: in fn
2025-05-07T20:32:54.9117376Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:54.9117750Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/torch/_dynamo/eval_frame.py:678: in _fn
2025-05-07T20:32:54.9117892Z     return fn(*args, **kwargs)
2025-05-07T20:32:54.9118380Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:54.9118488Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:54.9118843Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:54.9119065Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:54.9119416Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:54.9119551Z     kernel = self.compile(
2025-05-07T20:32:54.9119940Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:54.9120116Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:54.9120251Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:54.9120255Z 
2025-05-07T20:32:54.9120483Z self = <triton.compiler.compiler.ASTSource object at 0x7fe92b5975d0>
2025-05-07T20:32:54.9121275Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:54.9121779Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7fea06577ce0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7fe92a600d60>}
2025-05-07T20:32:54.9122572Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:54.9122768Z context = <triton._C.libtriton.ir.context object at 0x7fe92a6095b0>
2025-05-07T20:32:54.9122779Z 
2025-05-07T20:32:54.9122945Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:54.9123204Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:54.9123319Z                            module_map=module_map)
2025-05-07T20:32:54.9123481Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:54.9123580Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:54.9123670Z E       ^
2025-05-07T20:32:54.9124060Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:54.9124067Z 
2025-05-07T20:32:54.9124481Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:54.9124485Z 
2025-05-07T20:32:54.9124589Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:54.9124813Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:54.9124902Z     T=16384,
2025-05-07T20:32:54.9124980Z     D=5120,
2025-05-07T20:32:54.9125064Z     scale_ub=1200.0,
2025-05-07T20:32:54.9125158Z     contiguous=False,
2025-05-07T20:32:54.9125246Z     compiled=False,
2025-05-07T20:32:54.9125324Z )
2025-05-07T20:32:54.9125550Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:54.9125730Z T = 16384, D = 5120, scale_ub = 1200.0, contiguous = False, compiled = False
2025-05-07T20:32:54.9125737Z 
2025-05-07T20:32:54.9125823Z     @given(
2025-05-07T20:32:54.9125945Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:54.9126046Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:54.9126172Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:54.9126292Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:54.9126449Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:54.9126533Z     )
2025-05-07T20:32:54.9126777Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:54.9126880Z     def test_silu_mul_quant(
2025-05-07T20:32:54.9126959Z         self,
2025-05-07T20:32:54.9127042Z         T: int,
2025-05-07T20:32:54.9127131Z         D: int,
2025-05-07T20:32:54.9127232Z         scale_ub: Optional[float],
2025-05-07T20:32:54.9127320Z         contiguous: bool,
2025-05-07T20:32:54.9127410Z         compiled: bool,
2025-05-07T20:32:54.9127489Z     ) -> None:
2025-05-07T20:32:54.9127588Z         torch.manual_seed(2025)
2025-05-07T20:32:54.9127667Z     
2025-05-07T20:32:54.9127902Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:54.9127976Z     
2025-05-07T20:32:54.9128078Z         x_sign = torch.sign(x)
2025-05-07T20:32:54.9128201Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:54.9128289Z         x = x_sign * x_clamp
2025-05-07T20:32:54.9128376Z         x0 = x[:, :D]
2025-05-07T20:32:54.9128455Z         x1 = x[:, D:]
2025-05-07T20:32:54.9128532Z     
2025-05-07T20:32:54.9128616Z         if contiguous:
2025-05-07T20:32:54.9128708Z             x0 = x0.contiguous()
2025-05-07T20:32:54.9128802Z             x1 = x1.contiguous()
2025-05-07T20:32:54.9128874Z     
2025-05-07T20:32:54.9128965Z         if scale_ub is not None:
2025-05-07T20:32:54.9129077Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:54.9129211Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:54.9129292Z             )
2025-05-07T20:32:54.9129372Z         else:
2025-05-07T20:32:54.9129466Z             scale_ub_tensor = None
2025-05-07T20:32:54.9129538Z     
2025-05-07T20:32:54.9129673Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:54.9129805Z             op = silu_mul_quant
2025-05-07T20:32:54.9129896Z             if compiled:
2025-05-07T20:32:54.9129995Z                 op = torch.compile(op)
2025-05-07T20:32:54.9130105Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:54.9130184Z     
2025-05-07T20:32:54.9130274Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:54.9130278Z 
2025-05-07T20:32:54.9130375Z moe/activation_test.py:117: 
2025-05-07T20:32:54.9130508Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:54.9130608Z moe/activation_test.py:115: in fn
2025-05-07T20:32:54.9130708Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:54.9131204Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:54.9131341Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:54.9131708Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:54.9131928Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:54.9132265Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:54.9132366Z     kernel = self.compile(
2025-05-07T20:32:54.9132745Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:54.9132923Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:54.9133050Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:54.9133054Z 
2025-05-07T20:32:54.9133255Z self = <triton.compiler.compiler.ASTSource object at 0x7fe92a8e3a50>
2025-05-07T20:32:54.9134130Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:54.9134625Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7fea06577ce0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7fe92a601c60>}
2025-05-07T20:32:54.9135405Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:54.9135595Z context = <triton._C.libtriton.ir.context object at 0x7fe92a21df30>
2025-05-07T20:32:54.9135599Z 
2025-05-07T20:32:54.9135763Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:54.9136023Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:54.9136134Z                            module_map=module_map)
2025-05-07T20:32:54.9136338Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:54.9136437Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:54.9136513Z E       ^
2025-05-07T20:32:54.9136867Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:54.9136874Z 
2025-05-07T20:32:54.9137278Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:54.9137282Z 
2025-05-07T20:32:54.9137389Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:54.9137609Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:54.9137686Z     T=16384,
2025-05-07T20:32:54.9137767Z     D=5120,
2025-05-07T20:32:54.9137853Z     scale_ub=1200.0,
2025-05-07T20:32:54.9137941Z     contiguous=True,
2025-05-07T20:32:54.9138028Z     compiled=True,
2025-05-07T20:32:54.9138101Z )
2025-05-07T20:32:54.9138318Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:54.9138535Z T = 16384, D = 5120, scale_ub = 1200.0, contiguous = True, compiled = True
2025-05-07T20:32:54.9138540Z 
2025-05-07T20:32:54.9138616Z     @given(
2025-05-07T20:32:54.9138737Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:54.9138839Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:54.9138954Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:54.9139075Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:54.9139187Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:54.9139261Z     )
2025-05-07T20:32:54.9139511Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:54.9139605Z     def test_silu_mul_quant(
2025-05-07T20:32:54.9139728Z         self,
2025-05-07T20:32:54.9143231Z         T: int,
2025-05-07T20:32:54.9143317Z         D: int,
2025-05-07T20:32:54.9143428Z         scale_ub: Optional[float],
2025-05-07T20:32:54.9143521Z         contiguous: bool,
2025-05-07T20:32:54.9143609Z         compiled: bool,
2025-05-07T20:32:54.9143694Z     ) -> None:
2025-05-07T20:32:54.9143790Z         torch.manual_seed(2025)
2025-05-07T20:32:54.9143866Z     
2025-05-07T20:32:54.9144043Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:54.9144117Z     
2025-05-07T20:32:54.9144213Z         x_sign = torch.sign(x)
2025-05-07T20:32:54.9144336Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:54.9144424Z         x = x_sign * x_clamp
2025-05-07T20:32:54.9144509Z         x0 = x[:, :D]
2025-05-07T20:32:54.9144590Z         x1 = x[:, D:]
2025-05-07T20:32:54.9144663Z     
2025-05-07T20:32:54.9144755Z         if contiguous:
2025-05-07T20:32:54.9144846Z             x0 = x0.contiguous()
2025-05-07T20:32:54.9144940Z             x1 = x1.contiguous()
2025-05-07T20:32:54.9145017Z     
2025-05-07T20:32:54.9145107Z         if scale_ub is not None:
2025-05-07T20:32:54.9145216Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:54.9145358Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:54.9145434Z             )
2025-05-07T20:32:54.9145510Z         else:
2025-05-07T20:32:54.9145674Z             scale_ub_tensor = None
2025-05-07T20:32:54.9145744Z     
2025-05-07T20:32:54.9145880Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:54.9145969Z             op = silu_mul_quant
2025-05-07T20:32:54.9146053Z             if compiled:
2025-05-07T20:32:54.9146156Z                 op = torch.compile(op)
2025-05-07T20:32:54.9146263Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:54.9146334Z     
2025-05-07T20:32:54.9146429Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:54.9146434Z 
2025-05-07T20:32:54.9146530Z moe/activation_test.py:117: 
2025-05-07T20:32:54.9146665Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:54.9146814Z moe/activation_test.py:115: in fn
2025-05-07T20:32:54.9146919Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:54.9147292Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/torch/_dynamo/eval_frame.py:678: in _fn
2025-05-07T20:32:54.9147389Z     return fn(*args, **kwargs)
2025-05-07T20:32:54.9147875Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:54.9147975Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:54.9148328Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:54.9148549Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:54.9148890Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:54.9148986Z     kernel = self.compile(
2025-05-07T20:32:54.9149370Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:54.9149581Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:54.9149711Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:54.9149718Z 
2025-05-07T20:32:54.9149929Z self = <triton.compiler.compiler.ASTSource object at 0x7fe92a3a30d0>
2025-05-07T20:32:54.9150691Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:54.9151196Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7fea06577ce0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7fe92a603380>}
2025-05-07T20:32:54.9151992Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:54.9152186Z context = <triton._C.libtriton.ir.context object at 0x7fe92a34b7b0>
2025-05-07T20:32:54.9152193Z 
2025-05-07T20:32:54.9152355Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:54.9152612Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:54.9152724Z                            module_map=module_map)
2025-05-07T20:32:54.9152887Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:54.9152986Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:54.9153072Z E       ^
2025-05-07T20:32:54.9153423Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:54.9153431Z 
2025-05-07T20:32:54.9153844Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:54.9153849Z 
2025-05-07T20:32:54.9153952Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:54.9154171Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:54.9154296Z     T=16384,
2025-05-07T20:32:54.9154372Z     D=5120,
2025-05-07T20:32:54.9154457Z     scale_ub=None,
2025-05-07T20:32:54.9154551Z     contiguous=False,
2025-05-07T20:32:54.9154635Z     compiled=True,
2025-05-07T20:32:54.9154721Z )
2025-05-07T20:32:54.9154936Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:54.9155111Z T = 16384, D = 5120, scale_ub = None, contiguous = False, compiled = True
2025-05-07T20:32:54.9155116Z 
2025-05-07T20:32:54.9155201Z     @given(
2025-05-07T20:32:54.9155319Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:54.9155423Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:54.9155544Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:54.9155704Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:54.9155820Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:54.9155901Z     )
2025-05-07T20:32:54.9156143Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:54.9156244Z     def test_silu_mul_quant(
2025-05-07T20:32:54.9156323Z         self,
2025-05-07T20:32:54.9156401Z         T: int,
2025-05-07T20:32:54.9156485Z         D: int,
2025-05-07T20:32:54.9156587Z         scale_ub: Optional[float],
2025-05-07T20:32:54.9156679Z         contiguous: bool,
2025-05-07T20:32:54.9156771Z         compiled: bool,
2025-05-07T20:32:54.9156849Z     ) -> None:
2025-05-07T20:32:54.9156948Z         torch.manual_seed(2025)
2025-05-07T20:32:54.9157020Z     
2025-05-07T20:32:54.9157187Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:54.9157269Z     
2025-05-07T20:32:54.9157360Z         x_sign = torch.sign(x)
2025-05-07T20:32:54.9157486Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:54.9157620Z         x = x_sign * x_clamp
2025-05-07T20:32:54.9157703Z         x0 = x[:, :D]
2025-05-07T20:32:54.9157782Z         x1 = x[:, D:]
2025-05-07T20:32:54.9157862Z     
2025-05-07T20:32:54.9157948Z         if contiguous:
2025-05-07T20:32:54.9158042Z             x0 = x0.contiguous()
2025-05-07T20:32:54.9158132Z             x1 = x1.contiguous()
2025-05-07T20:32:54.9158205Z     
2025-05-07T20:32:54.9158302Z         if scale_ub is not None:
2025-05-07T20:32:54.9158409Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:54.9158544Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:54.9158629Z             )
2025-05-07T20:32:54.9158706Z         else:
2025-05-07T20:32:54.9158799Z             scale_ub_tensor = None
2025-05-07T20:32:54.9158918Z     
2025-05-07T20:32:54.9159047Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:54.9159136Z             op = silu_mul_quant
2025-05-07T20:32:54.9159229Z             if compiled:
2025-05-07T20:32:54.9159331Z                 op = torch.compile(op)
2025-05-07T20:32:54.9159438Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:54.9159511Z     
2025-05-07T20:32:54.9159604Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:54.9159609Z 
2025-05-07T20:32:54.9159710Z moe/activation_test.py:117: 
2025-05-07T20:32:54.9159839Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:54.9159937Z moe/activation_test.py:115: in fn
2025-05-07T20:32:54.9160042Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:54.9160444Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/torch/_dynamo/eval_frame.py:678: in _fn
2025-05-07T20:32:54.9160551Z     return fn(*args, **kwargs)
2025-05-07T20:32:54.9161039Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:54.9161141Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:54.9161501Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:54.9161720Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:54.9162125Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:54.9162223Z     kernel = self.compile(
2025-05-07T20:32:54.9162597Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:54.9162773Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:54.9162900Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:54.9162911Z 
2025-05-07T20:32:54.9163111Z self = <triton.compiler.compiler.ASTSource object at 0x7fe92ab7e0d0>
2025-05-07T20:32:54.9163916Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:54.9164419Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7fea06577ce0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7fe92a3a85e0>}
2025-05-07T20:32:54.9165158Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:54.9165347Z context = <triton._C.libtriton.ir.context object at 0x7fe92a29af70>
2025-05-07T20:32:54.9165352Z 
2025-05-07T20:32:54.9165515Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:54.9165778Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:54.9165888Z                            module_map=module_map)
2025-05-07T20:32:54.9166095Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:54.9166194Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:54.9166273Z E       ^
2025-05-07T20:32:54.9166626Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:54.9166631Z 
2025-05-07T20:32:54.9167034Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:54.9167039Z 
2025-05-07T20:32:54.9167146Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:54.9167365Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:54.9167441Z     T=2048,
2025-05-07T20:32:54.9167563Z     D=5120,
2025-05-07T20:32:54.9167646Z     scale_ub=None,
2025-05-07T20:32:54.9167732Z     contiguous=False,
2025-05-07T20:32:54.9167818Z     compiled=True,
2025-05-07T20:32:54.9167892Z )
2025-05-07T20:32:54.9168110Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:54.9168285Z T = 2048, D = 5120, scale_ub = None, contiguous = False, compiled = True
2025-05-07T20:32:54.9168292Z 
2025-05-07T20:32:54.9168367Z     @given(
2025-05-07T20:32:54.9168491Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:54.9168586Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:54.9168698Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:54.9168816Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:54.9168929Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:54.9169002Z     )
2025-05-07T20:32:54.9169247Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:54.9169342Z     def test_silu_mul_quant(
2025-05-07T20:32:54.9169417Z         self,
2025-05-07T20:32:54.9169500Z         T: int,
2025-05-07T20:32:54.9169578Z         D: int,
2025-05-07T20:32:54.9169677Z         scale_ub: Optional[float],
2025-05-07T20:32:54.9169772Z         contiguous: bool,
2025-05-07T20:32:54.9169857Z         compiled: bool,
2025-05-07T20:32:54.9169941Z     ) -> None:
2025-05-07T20:32:54.9170076Z         torch.manual_seed(2025)
2025-05-07T20:32:54.9170148Z     
2025-05-07T20:32:54.9170324Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:54.9170410Z     
2025-05-07T20:32:54.9170512Z         x_sign = torch.sign(x)
2025-05-07T20:32:54.9170663Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:54.9170752Z         x = x_sign * x_clamp
2025-05-07T20:32:54.9170832Z         x0 = x[:, :D]
2025-05-07T20:32:54.9170914Z         x1 = x[:, D:]
2025-05-07T20:32:54.9170986Z     
2025-05-07T20:32:54.9171069Z         if contiguous:
2025-05-07T20:32:54.9171169Z             x0 = x0.contiguous()
2025-05-07T20:32:54.9171258Z             x1 = x1.contiguous()
2025-05-07T20:32:54.9171334Z     
2025-05-07T20:32:54.9171466Z         if scale_ub is not None:
2025-05-07T20:32:54.9171573Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:54.9171712Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:54.9171794Z             )
2025-05-07T20:32:54.9171872Z         else:
2025-05-07T20:32:54.9171967Z             scale_ub_tensor = None
2025-05-07T20:32:54.9172041Z     
2025-05-07T20:32:54.9172170Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:54.9172263Z             op = silu_mul_quant
2025-05-07T20:32:54.9172348Z             if compiled:
2025-05-07T20:32:54.9172447Z                 op = torch.compile(op)
2025-05-07T20:32:54.9172555Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:54.9172628Z     
2025-05-07T20:32:54.9172721Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:54.9172728Z 
2025-05-07T20:32:54.9172824Z moe/activation_test.py:117: 
2025-05-07T20:32:54.9172954Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:54.9173056Z moe/activation_test.py:115: in fn
2025-05-07T20:32:54.9173197Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:54.9173562Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/torch/_dynamo/eval_frame.py:678: in _fn
2025-05-07T20:32:54.9173769Z     return fn(*args, **kwargs)
2025-05-07T20:32:54.9174257Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:54.9174358Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:54.9174710Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:54.9174929Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:54.9175312Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:54.9175405Z     kernel = self.compile(
2025-05-07T20:32:54.9175790Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:54.9175969Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:54.9176097Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:54.9176101Z 
2025-05-07T20:32:54.9176310Z self = <triton.compiler.compiler.ASTSource object at 0x7fe92a37d2d0>
2025-05-07T20:32:54.9177073Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:54.9177569Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7fea06577ce0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7fe92a3a9440>}
2025-05-07T20:32:54.9178311Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:54.9178502Z context = <triton._C.libtriton.ir.context object at 0x7fe92a2ea930>
2025-05-07T20:32:54.9178547Z 
2025-05-07T20:32:54.9178717Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:54.9178972Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:54.9179082Z                            module_map=module_map)
2025-05-07T20:32:54.9179244Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:54.9179343Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:54.9179425Z E       ^
2025-05-07T20:32:54.9179773Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:54.9179780Z 
2025-05-07T20:32:54.9180230Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:54.9180234Z 
2025-05-07T20:32:54.9180342Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:54.9180604Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:54.9180693Z     T=2048,
2025-05-07T20:32:54.9180768Z     D=5120,
2025-05-07T20:32:54.9180852Z     scale_ub=1200.0,
2025-05-07T20:32:54.9180942Z     contiguous=False,
2025-05-07T20:32:54.9181025Z     compiled=True,
2025-05-07T20:32:54.9181098Z )
2025-05-07T20:32:54.9181318Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:54.9181488Z T = 2048, D = 5120, scale_ub = 1200.0, contiguous = False, compiled = True
2025-05-07T20:32:54.9181493Z 
2025-05-07T20:32:54.9181569Z     @given(
2025-05-07T20:32:54.9181694Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:54.9181791Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:54.9181911Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:54.9182068Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:54.9182183Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:54.9182265Z     )
2025-05-07T20:32:54.9182512Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:54.9182606Z     def test_silu_mul_quant(
2025-05-07T20:32:54.9182685Z         self,
2025-05-07T20:32:54.9182762Z         T: int,
2025-05-07T20:32:54.9182837Z         D: int,
2025-05-07T20:32:54.9182939Z         scale_ub: Optional[float],
2025-05-07T20:32:54.9183028Z         contiguous: bool,
2025-05-07T20:32:54.9183113Z         compiled: bool,
2025-05-07T20:32:54.9183194Z     ) -> None:
2025-05-07T20:32:54.9183288Z         torch.manual_seed(2025)
2025-05-07T20:32:54.9183404Z     
2025-05-07T20:32:54.9183572Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:54.9183642Z     
2025-05-07T20:32:54.9183738Z         x_sign = torch.sign(x)
2025-05-07T20:32:54.9183863Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:54.9183951Z         x = x_sign * x_clamp
2025-05-07T20:32:54.9184034Z         x0 = x[:, :D]
2025-05-07T20:32:54.9184120Z         x1 = x[:, D:]
2025-05-07T20:32:54.9184193Z     
2025-05-07T20:32:54.9184280Z         if contiguous:
2025-05-07T20:32:54.9184370Z             x0 = x0.contiguous()
2025-05-07T20:32:54.9184457Z             x1 = x1.contiguous()
2025-05-07T20:32:54.9184534Z     
2025-05-07T20:32:54.9184624Z         if scale_ub is not None:
2025-05-07T20:32:54.9184728Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:54.9184861Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:54.9184935Z             )
2025-05-07T20:32:54.9185013Z         else:
2025-05-07T20:32:54.9185108Z             scale_ub_tensor = None
2025-05-07T20:32:54.9185181Z     
2025-05-07T20:32:54.9185315Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:54.9185405Z             op = silu_mul_quant
2025-05-07T20:32:54.9185492Z             if compiled:
2025-05-07T20:32:54.9185594Z                 op = torch.compile(op)
2025-05-07T20:32:54.9185698Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:54.9185815Z     
2025-05-07T20:32:54.9185908Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:54.9185913Z 
2025-05-07T20:32:54.9186007Z moe/activation_test.py:117: 
2025-05-07T20:32:54.9186138Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:54.9186239Z moe/activation_test.py:115: in fn
2025-05-07T20:32:54.9186335Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:54.9186700Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/torch/_dynamo/eval_frame.py:678: in _fn
2025-05-07T20:32:54.9186796Z     return fn(*args, **kwargs)
2025-05-07T20:32:54.9187320Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:54.9187423Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:54.9187775Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:54.9188002Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:54.9188336Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:54.9188429Z     kernel = self.compile(
2025-05-07T20:32:54.9188815Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:54.9188986Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:54.9189113Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:54.9189124Z 
2025-05-07T20:32:54.9189331Z self = <triton.compiler.compiler.ASTSource object at 0x7fe92a8e11d0>
2025-05-07T20:32:54.9190132Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:54.9190633Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7fea06577ce0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7fe92a3aa660>}
2025-05-07T20:32:54.9191363Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:54.9191556Z context = <triton._C.libtriton.ir.context object at 0x7fe92a239f30>
2025-05-07T20:32:54.9191621Z 
2025-05-07T20:32:54.9191784Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:54.9192046Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:54.9192158Z                            module_map=module_map)
2025-05-07T20:32:54.9192316Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:54.9192421Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:54.9192497Z E       ^
2025-05-07T20:32:54.9192845Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:54.9192850Z 
2025-05-07T20:32:54.9193259Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:54.9193263Z 
2025-05-07T20:32:54.9193365Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:54.9193583Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:54.9193664Z     T=4096,
2025-05-07T20:32:54.9193739Z     D=5120,
2025-05-07T20:32:54.9193825Z     scale_ub=1200.0,
2025-05-07T20:32:54.9193910Z     contiguous=True,
2025-05-07T20:32:54.9193992Z     compiled=True,
2025-05-07T20:32:54.9194072Z )
2025-05-07T20:32:54.9194289Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:54.9194502Z T = 4096, D = 5120, scale_ub = 1200.0, contiguous = True, compiled = True
2025-05-07T20:32:54.9194507Z 
2025-05-07T20:32:54.9194586Z     @given(
2025-05-07T20:32:54.9194703Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:54.9194801Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:54.9194917Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:54.9195033Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:54.9195148Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:54.9195221Z     )
2025-05-07T20:32:54.9195463Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:54.9195563Z     def test_silu_mul_quant(
2025-05-07T20:32:54.9195640Z         self,
2025-05-07T20:32:54.9195758Z         T: int,
2025-05-07T20:32:54.9195844Z         D: int,
2025-05-07T20:32:54.9195942Z         scale_ub: Optional[float],
2025-05-07T20:32:54.9196031Z         contiguous: bool,
2025-05-07T20:32:54.9196125Z         compiled: bool,
2025-05-07T20:32:54.9196203Z     ) -> None:
2025-05-07T20:32:54.9196297Z         torch.manual_seed(2025)
2025-05-07T20:32:54.9196372Z     
2025-05-07T20:32:54.9196538Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:54.9196613Z     
2025-05-07T20:32:54.9196705Z         x_sign = torch.sign(x)
2025-05-07T20:32:54.9196827Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:54.9196918Z         x = x_sign * x_clamp
2025-05-07T20:32:54.9196998Z         x0 = x[:, :D]
2025-05-07T20:32:54.9197077Z         x1 = x[:, D:]
2025-05-07T20:32:54.9197153Z     
2025-05-07T20:32:54.9197236Z         if contiguous:
2025-05-07T20:32:54.9197325Z             x0 = x0.contiguous()
2025-05-07T20:32:54.9197418Z             x1 = x1.contiguous()
2025-05-07T20:32:54.9197490Z     
2025-05-07T20:32:54.9197620Z         if scale_ub is not None:
2025-05-07T20:32:54.9197730Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:54.9197864Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:54.9197947Z             )
2025-05-07T20:32:54.9198022Z         else:
2025-05-07T20:32:54.9198119Z             scale_ub_tensor = None
2025-05-07T20:32:54.9198507Z     
2025-05-07T20:32:54.9198691Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:54.9198785Z             op = silu_mul_quant
2025-05-07T20:32:54.9198873Z             if compiled:
2025-05-07T20:32:54.9198972Z                 op = torch.compile(op)
2025-05-07T20:32:54.9199076Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:54.9199153Z     
2025-05-07T20:32:54.9199343Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:54.9199348Z 
2025-05-07T20:32:54.9199442Z moe/activation_test.py:117: 
2025-05-07T20:32:54.9199584Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:54.9199684Z moe/activation_test.py:115: in fn
2025-05-07T20:32:54.9199787Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:54.9200154Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/torch/_dynamo/eval_frame.py:678: in _fn
2025-05-07T20:32:54.9200246Z     return fn(*args, **kwargs)
2025-05-07T20:32:54.9200736Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:54.9200833Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:54.9201187Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:54.9201406Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:54.9201755Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:54.9201847Z     kernel = self.compile(
2025-05-07T20:32:54.9202232Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:54.9202476Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:54.9202603Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:54.9202607Z 
2025-05-07T20:32:54.9202813Z self = <triton.compiler.compiler.ASTSource object at 0x7fe92bcf7cd0>
2025-05-07T20:32:54.9203575Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:54.9204136Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7fea06577ce0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7fe92a3ab9c0>}
2025-05-07T20:32:54.9204872Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:54.9205065Z context = <triton._C.libtriton.ir.context object at 0x7fe92a5fd930>
2025-05-07T20:32:54.9205069Z 
2025-05-07T20:32:54.9205239Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:54.9205498Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:54.9205613Z                            module_map=module_map)
2025-05-07T20:32:54.9205770Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:54.9205867Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:54.9205950Z E       ^
2025-05-07T20:32:54.9206298Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:54.9206302Z 
2025-05-07T20:32:54.9206770Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:54.9206774Z 
2025-05-07T20:32:54.9206881Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:54.9207102Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:54.9207182Z     T=128,
2025-05-07T20:32:54.9207258Z     D=5120,
2025-05-07T20:32:54.9207340Z     scale_ub=1200.0,
2025-05-07T20:32:54.9207427Z     contiguous=False,
2025-05-07T20:32:54.9207511Z     compiled=True,
2025-05-07T20:32:54.9207586Z )
2025-05-07T20:32:54.9207804Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:54.9207970Z T = 128, D = 5120, scale_ub = 1200.0, contiguous = False, compiled = True
2025-05-07T20:32:54.9208016Z 
2025-05-07T20:32:54.9208095Z     @given(
2025-05-07T20:32:54.9208212Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:54.9208311Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:54.9208432Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:54.9208547Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:54.9208661Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:54.9208739Z     )
2025-05-07T20:32:54.9208981Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:54.9209074Z     def test_silu_mul_quant(
2025-05-07T20:32:54.9209154Z         self,
2025-05-07T20:32:54.9209230Z         T: int,
2025-05-07T20:32:54.9209309Z         D: int,
2025-05-07T20:32:54.9209406Z         scale_ub: Optional[float],
2025-05-07T20:32:54.9209494Z         contiguous: bool,
2025-05-07T20:32:54.9209582Z         compiled: bool,
2025-05-07T20:32:54.9209660Z     ) -> None:
2025-05-07T20:32:54.9209757Z         torch.manual_seed(2025)
2025-05-07T20:32:54.9209834Z     
2025-05-07T20:32:54.9210000Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:54.9210072Z     
2025-05-07T20:32:54.9210172Z         x_sign = torch.sign(x)
2025-05-07T20:32:54.9210294Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:54.9210429Z         x = x_sign * x_clamp
2025-05-07T20:32:54.9210530Z         x0 = x[:, :D]
2025-05-07T20:32:54.9210619Z         x1 = x[:, D:]
2025-05-07T20:32:54.9210703Z     
2025-05-07T20:32:54.9210801Z         if contiguous:
2025-05-07T20:32:54.9210892Z             x0 = x0.contiguous()
2025-05-07T20:32:54.9210986Z             x1 = x1.contiguous()
2025-05-07T20:32:54.9211056Z     
2025-05-07T20:32:54.9211146Z         if scale_ub is not None:
2025-05-07T20:32:54.9211253Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:54.9211387Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:54.9211465Z             )
2025-05-07T20:32:54.9211544Z         else:
2025-05-07T20:32:54.9211638Z             scale_ub_tensor = None
2025-05-07T20:32:54.9211709Z     
2025-05-07T20:32:54.9211883Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:54.9211977Z             op = silu_mul_quant
2025-05-07T20:32:54.9212059Z             if compiled:
2025-05-07T20:32:54.9212162Z                 op = torch.compile(op)
2025-05-07T20:32:54.9212270Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:54.9212345Z     
2025-05-07T20:32:54.9212438Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:54.9212443Z 
2025-05-07T20:32:54.9212539Z moe/activation_test.py:117: 
2025-05-07T20:32:54.9212672Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:54.9212770Z moe/activation_test.py:115: in fn
2025-05-07T20:32:54.9212870Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:54.9213234Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/torch/_dynamo/eval_frame.py:678: in _fn
2025-05-07T20:32:54.9213327Z     return fn(*args, **kwargs)
2025-05-07T20:32:54.9213933Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:54.9214031Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:54.9214386Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:54.9214616Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:54.9214953Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:54.9215044Z     kernel = self.compile(
2025-05-07T20:32:54.9215424Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:54.9215596Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:54.9215765Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:54.9215770Z 
2025-05-07T20:32:54.9215976Z self = <triton.compiler.compiler.ASTSource object at 0x7fe5b9fa9dd0>
2025-05-07T20:32:54.9216743Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:54.9217244Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7fea06577ce0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7fe92a548fe0>}
2025-05-07T20:32:54.9217975Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:54.9218168Z context = <triton._C.libtriton.ir.context object at 0x7fe92a5938b0>
2025-05-07T20:32:54.9218176Z 
2025-05-07T20:32:54.9218336Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:54.9218602Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:54.9218710Z                            module_map=module_map)
2025-05-07T20:32:54.9218912Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:54.9219014Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:54.9219093Z E       ^
2025-05-07T20:32:54.9219442Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:54.9219447Z 
2025-05-07T20:32:54.9219858Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:54.9219863Z 
2025-05-07T20:32:54.9219965Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:54.9220189Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:54.9220265Z     T=16384,
2025-05-07T20:32:54.9220342Z     D=7168,
2025-05-07T20:32:54.9220466Z     scale_ub=1200.0,
2025-05-07T20:32:54.9220553Z     contiguous=True,
2025-05-07T20:32:54.9220636Z     compiled=True,
2025-05-07T20:32:54.9220712Z )
2025-05-07T20:32:54.9220927Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:54.9221100Z T = 16384, D = 7168, scale_ub = 1200.0, contiguous = True, compiled = True
2025-05-07T20:32:54.9221108Z 
2025-05-07T20:32:54.9221184Z     @given(
2025-05-07T20:32:54.9221301Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:54.9221404Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:54.9221520Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:54.9221635Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:54.9221753Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:54.9221828Z     )
2025-05-07T20:32:54.9222069Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:54.9222167Z     def test_silu_mul_quant(
2025-05-07T20:32:54.9222241Z         self,
2025-05-07T20:32:54.9222384Z         T: int,
2025-05-07T20:32:54.9222465Z         D: int,
2025-05-07T20:32:54.9222562Z         scale_ub: Optional[float],
2025-05-07T20:32:54.9222656Z         contiguous: bool,
2025-05-07T20:32:54.9222741Z         compiled: bool,
2025-05-07T20:32:54.9222817Z     ) -> None:
2025-05-07T20:32:54.9222913Z         torch.manual_seed(2025)
2025-05-07T20:32:54.9222984Z     
2025-05-07T20:32:54.9223148Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:54.9223224Z     
2025-05-07T20:32:54.9223312Z         x_sign = torch.sign(x)
2025-05-07T20:32:54.9223434Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:54.9223522Z         x = x_sign * x_clamp
2025-05-07T20:32:54.9223602Z         x0 = x[:, :D]
2025-05-07T20:32:54.9223723Z         x1 = x[:, D:]
2025-05-07T20:32:54.9223798Z     
2025-05-07T20:32:54.9223879Z         if contiguous:
2025-05-07T20:32:54.9223973Z             x0 = x0.contiguous()
2025-05-07T20:32:54.9224066Z             x1 = x1.contiguous()
2025-05-07T20:32:54.9224137Z     
2025-05-07T20:32:54.9224230Z         if scale_ub is not None:
2025-05-07T20:32:54.9224334Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:54.9224469Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:54.9224547Z             )
2025-05-07T20:32:54.9224621Z         else:
2025-05-07T20:32:54.9224714Z             scale_ub_tensor = None
2025-05-07T20:32:54.9224790Z     
2025-05-07T20:32:54.9224919Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:54.9225005Z             op = silu_mul_quant
2025-05-07T20:32:54.9225094Z             if compiled:
2025-05-07T20:32:54.9225192Z                 op = torch.compile(op)
2025-05-07T20:32:54.9225297Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:54.9225376Z     
2025-05-07T20:32:54.9225468Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:54.9225472Z 
2025-05-07T20:32:54.9225576Z moe/activation_test.py:117: 
2025-05-07T20:32:54.9225708Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:54.9225807Z moe/activation_test.py:115: in fn
2025-05-07T20:32:54.9225958Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:54.9226321Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/torch/_dynamo/eval_frame.py:678: in _fn
2025-05-07T20:32:54.9226414Z     return fn(*args, **kwargs)
2025-05-07T20:32:54.9226906Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:54.9227002Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:54.9227358Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:54.9227580Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:54.9227956Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:54.9228053Z     kernel = self.compile(
2025-05-07T20:32:54.9228430Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:54.9228603Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:54.9228737Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:54.9228741Z 
2025-05-07T20:32:54.9228940Z self = <triton.compiler.compiler.ASTSource object at 0x7fe92b595450>
2025-05-07T20:32:54.9229701Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:54.9230240Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7fea06577ce0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7fe92a549e40>}
2025-05-07T20:32:54.9230977Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:54.9231166Z context = <triton._C.libtriton.ir.context object at 0x7fe92a0ac470>
2025-05-07T20:32:54.9231171Z 
2025-05-07T20:32:54.9231333Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:54.9231594Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:54.9231702Z                            module_map=module_map)
2025-05-07T20:32:54.9231864Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:54.9232000Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:54.9232077Z E       ^
2025-05-07T20:32:54.9232432Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:54.9232439Z 
2025-05-07T20:32:54.9232843Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:54.9232850Z 
2025-05-07T20:32:54.9232957Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:54.9233173Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:54.9233249Z     T=16384,
2025-05-07T20:32:54.9233326Z     D=5120,
2025-05-07T20:32:54.9233407Z     scale_ub=1200.0,
2025-05-07T20:32:54.9233490Z     contiguous=True,
2025-05-07T20:32:54.9233576Z     compiled=False,
2025-05-07T20:32:54.9233648Z )
2025-05-07T20:32:54.9233863Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:54.9234047Z T = 16384, D = 5120, scale_ub = 1200.0, contiguous = True, compiled = False
2025-05-07T20:32:54.9234052Z 
2025-05-07T20:32:54.9234126Z     @given(
2025-05-07T20:32:54.9234245Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:54.9234349Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:54.9234463Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:54.9234622Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:54.9234736Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:54.9234807Z     )
2025-05-07T20:32:54.9235055Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:54.9235147Z     def test_silu_mul_quant(
2025-05-07T20:32:54.9235221Z         self,
2025-05-07T20:32:54.9235299Z         T: int,
2025-05-07T20:32:54.9235374Z         D: int,
2025-05-07T20:32:54.9235472Z         scale_ub: Optional[float],
2025-05-07T20:32:54.9235563Z         contiguous: bool,
2025-05-07T20:32:54.9235651Z         compiled: bool,
2025-05-07T20:32:54.9235732Z     ) -> None:
2025-05-07T20:32:54.9235825Z         torch.manual_seed(2025)
2025-05-07T20:32:54.9235897Z     
2025-05-07T20:32:54.9236110Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:54.9236183Z     
2025-05-07T20:32:54.9236274Z         x_sign = torch.sign(x)
2025-05-07T20:32:54.9236401Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:54.9236491Z         x = x_sign * x_clamp
2025-05-07T20:32:54.9236569Z         x0 = x[:, :D]
2025-05-07T20:32:54.9236654Z         x1 = x[:, D:]
2025-05-07T20:32:54.9236725Z     
2025-05-07T20:32:54.9236808Z         if contiguous:
2025-05-07T20:32:54.9236901Z             x0 = x0.contiguous()
2025-05-07T20:32:54.9236988Z             x1 = x1.contiguous()
2025-05-07T20:32:54.9237059Z     
2025-05-07T20:32:54.9237150Z         if scale_ub is not None:
2025-05-07T20:32:54.9237254Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:54.9237391Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:54.9237469Z             )
2025-05-07T20:32:54.9237544Z         else:
2025-05-07T20:32:54.9237643Z             scale_ub_tensor = None
2025-05-07T20:32:54.9237713Z     
2025-05-07T20:32:54.9237883Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:54.9237980Z             op = silu_mul_quant
2025-05-07T20:32:54.9238066Z             if compiled:
2025-05-07T20:32:54.9238164Z                 op = torch.compile(op)
2025-05-07T20:32:54.9238273Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:54.9238348Z     
2025-05-07T20:32:54.9238439Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:54.9238446Z 
2025-05-07T20:32:54.9238543Z moe/activation_test.py:117: 
2025-05-07T20:32:54.9238671Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:54.9238774Z moe/activation_test.py:115: in fn
2025-05-07T20:32:54.9238873Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:54.9239402Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:54.9239506Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:54.9239865Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:54.9240089Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:54.9240428Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:54.9240519Z     kernel = self.compile(
2025-05-07T20:32:54.9240899Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:54.9241070Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:54.9241195Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:54.9241202Z 
2025-05-07T20:32:54.9241407Z self = <triton.compiler.compiler.ASTSource object at 0x7fe92a8e1e50>
2025-05-07T20:32:54.9242172Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:54.9242714Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7fea06577ce0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7fe92a54aca0>}
2025-05-07T20:32:54.9243447Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:54.9243640Z context = <triton._C.libtriton.ir.context object at 0x7fe5b9e225f0>
2025-05-07T20:32:54.9243645Z 
2025-05-07T20:32:54.9243805Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:54.9244105Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:54.9244218Z                            module_map=module_map)
2025-05-07T20:32:54.9244378Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:54.9244475Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:54.9244558Z E       ^
2025-05-07T20:32:54.9244904Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:54.9244908Z 
2025-05-07T20:32:54.9245314Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:54.9245319Z 
2025-05-07T20:32:54.9245422Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:54.9245641Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:54.9245719Z     T=1,
2025-05-07T20:32:54.9245799Z     D=7168,
2025-05-07T20:32:54.9245880Z     scale_ub=1200.0,
2025-05-07T20:32:54.9245968Z     contiguous=False,
2025-05-07T20:32:54.9246054Z     compiled=False,
2025-05-07T20:32:54.9246130Z )
2025-05-07T20:32:54.9246383Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:54.9246548Z T = 1, D = 7168, scale_ub = 1200.0, contiguous = False, compiled = False
2025-05-07T20:32:54.9246555Z 
2025-05-07T20:32:54.9246633Z     @given(
2025-05-07T20:32:54.9246750Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:54.9246848Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:54.9246963Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:54.9247078Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:54.9247189Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:54.9247265Z     )
2025-05-07T20:32:54.9247504Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:54.9247640Z     def test_silu_mul_quant(
2025-05-07T20:32:54.9247717Z         self,
2025-05-07T20:32:54.9247794Z         T: int,
2025-05-07T20:32:54.9247875Z         D: int,
2025-05-07T20:32:54.9247975Z         scale_ub: Optional[float],
2025-05-07T20:32:54.9248064Z         contiguous: bool,
2025-05-07T20:32:54.9248152Z         compiled: bool,
2025-05-07T20:32:54.9248232Z     ) -> None:
2025-05-07T20:32:54.9248328Z         torch.manual_seed(2025)
2025-05-07T20:32:54.9248405Z     
2025-05-07T20:32:54.9248569Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:54.9248640Z     
2025-05-07T20:32:54.9248736Z         x_sign = torch.sign(x)
2025-05-07T20:32:54.9248857Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:54.9248948Z         x = x_sign * x_clamp
2025-05-07T20:32:54.9249028Z         x0 = x[:, :D]
2025-05-07T20:32:54.9249107Z         x1 = x[:, D:]
2025-05-07T20:32:54.9249181Z     
2025-05-07T20:32:54.9249262Z         if contiguous:
2025-05-07T20:32:54.9249355Z             x0 = x0.contiguous()
2025-05-07T20:32:54.9249447Z             x1 = x1.contiguous()
2025-05-07T20:32:54.9249522Z     
2025-05-07T20:32:54.9249611Z         if scale_ub is not None:
2025-05-07T20:32:54.9249722Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:54.9249854Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:54.9249974Z             )
2025-05-07T20:32:54.9250051Z         else:
2025-05-07T20:32:54.9250142Z             scale_ub_tensor = None
2025-05-07T20:32:54.9250215Z     
2025-05-07T20:32:54.9250347Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:54.9250436Z             op = silu_mul_quant
2025-05-07T20:32:54.9250522Z             if compiled:
2025-05-07T20:32:54.9250619Z                 op = torch.compile(op)
2025-05-07T20:32:54.9250722Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:54.9250798Z     
2025-05-07T20:32:54.9250887Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:54.9250894Z 
2025-05-07T20:32:54.9250989Z moe/activation_test.py:117: 
2025-05-07T20:32:54.9251182Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:54.9251285Z moe/activation_test.py:115: in fn
2025-05-07T20:32:54.9251383Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:54.9251876Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:54.9251977Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:54.9252337Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:54.9252553Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:54.9252891Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:54.9252986Z     kernel = self.compile(
2025-05-07T20:32:54.9253366Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:54.9253545Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:54.9253800Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:54.9253805Z 
2025-05-07T20:32:54.9254007Z self = <triton.compiler.compiler.ASTSource object at 0x7fe92ba8ff50>
2025-05-07T20:32:54.9254776Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:54.9255270Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7fea06577ce0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7fe92a04c0e0>}
2025-05-07T20:32:54.9256007Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:54.9256242Z context = <triton._C.libtriton.ir.context object at 0x7fe92a06a0b0>
2025-05-07T20:32:54.9256246Z 
2025-05-07T20:32:54.9256408Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:54.9256672Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:54.9256778Z                            module_map=module_map)
2025-05-07T20:32:54.9256942Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:54.9257038Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:54.9257113Z E       ^
2025-05-07T20:32:54.9257462Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:54.9257466Z 
2025-05-07T20:32:54.9257871Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:54.9257878Z 
2025-05-07T20:32:54.9257984Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:54.9258206Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:54.9258283Z     T=4096,
2025-05-07T20:32:54.9258363Z     D=7168,
2025-05-07T20:32:54.9258486Z     scale_ub=1200.0,
2025-05-07T20:32:54.9258571Z     contiguous=False,
2025-05-07T20:32:54.9258657Z     compiled=True,
2025-05-07T20:32:54.9258729Z )
2025-05-07T20:32:54.9258942Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:54.9259117Z T = 4096, D = 7168, scale_ub = 1200.0, contiguous = False, compiled = True
2025-05-07T20:32:54.9259121Z 
2025-05-07T20:32:54.9259196Z     @given(
2025-05-07T20:32:54.9259315Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:54.9259412Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:54.9259530Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:54.9259646Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:54.9259798Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:54.9259875Z     )
2025-05-07T20:32:54.9260122Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:54.9260215Z     def test_silu_mul_quant(
2025-05-07T20:32:54.9260294Z         self,
2025-05-07T20:32:54.9260374Z         T: int,
2025-05-07T20:32:54.9260450Z         D: int,
2025-05-07T20:32:54.9260550Z         scale_ub: Optional[float],
2025-05-07T20:32:54.9260637Z         contiguous: bool,
2025-05-07T20:32:54.9260722Z         compiled: bool,
2025-05-07T20:32:54.9260801Z     ) -> None:
2025-05-07T20:32:54.9260893Z         torch.manual_seed(2025)
2025-05-07T20:32:54.9260962Z     
2025-05-07T20:32:54.9261130Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:54.9261203Z     
2025-05-07T20:32:54.9261301Z         x_sign = torch.sign(x)
2025-05-07T20:32:54.9264374Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:54.9264485Z         x = x_sign * x_clamp
2025-05-07T20:32:54.9264571Z         x0 = x[:, :D]
2025-05-07T20:32:54.9264714Z         x1 = x[:, D:]
2025-05-07T20:32:54.9264788Z     
2025-05-07T20:32:54.9264876Z         if contiguous:
2025-05-07T20:32:54.9264966Z             x0 = x0.contiguous()
2025-05-07T20:32:54.9265056Z             x1 = x1.contiguous()
2025-05-07T20:32:54.9265129Z     
2025-05-07T20:32:54.9265218Z         if scale_ub is not None:
2025-05-07T20:32:54.9265325Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:54.9265464Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:54.9265538Z             )
2025-05-07T20:32:54.9265615Z         else:
2025-05-07T20:32:54.9265707Z             scale_ub_tensor = None
2025-05-07T20:32:54.9265778Z     
2025-05-07T20:32:54.9265909Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:54.9266042Z             op = silu_mul_quant
2025-05-07T20:32:54.9266128Z             if compiled:
2025-05-07T20:32:54.9266233Z                 op = torch.compile(op)
2025-05-07T20:32:54.9266339Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:54.9266412Z     
2025-05-07T20:32:54.9266506Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:54.9266511Z 
2025-05-07T20:32:54.9266610Z moe/activation_test.py:117: 
2025-05-07T20:32:54.9266743Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:54.9266845Z moe/activation_test.py:115: in fn
2025-05-07T20:32:54.9266945Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:54.9267315Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/torch/_dynamo/eval_frame.py:678: in _fn
2025-05-07T20:32:54.9267406Z     return fn(*args, **kwargs)
2025-05-07T20:32:54.9267890Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:54.9267993Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:54.9268347Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:54.9268572Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:54.9268908Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:54.9269045Z     kernel = self.compile(
2025-05-07T20:32:54.9269426Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:54.9269596Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:54.9269722Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:54.9269730Z 
2025-05-07T20:32:54.9269933Z self = <triton.compiler.compiler.ASTSource object at 0x7fe92a5252d0>
2025-05-07T20:32:54.9270741Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:54.9271244Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7fea06577ce0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7fe92a04d300>}
2025-05-07T20:32:54.9271980Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:54.9272169Z context = <triton._C.libtriton.ir.context object at 0x7fe92a083ef0>
2025-05-07T20:32:54.9272173Z 
2025-05-07T20:32:54.9272335Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:54.9272592Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:54.9272702Z                            module_map=module_map)
2025-05-07T20:32:54.9272863Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:54.9272999Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:54.9273080Z E       ^
2025-05-07T20:32:54.9273427Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:54.9273435Z 
2025-05-07T20:32:54.9273844Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:54.9273849Z 
2025-05-07T20:32:54.9273950Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:54.9274168Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:54.9274248Z     T=128,
2025-05-07T20:32:54.9274324Z     D=7168,
2025-05-07T20:32:54.9274409Z     scale_ub=1200.0,
2025-05-07T20:32:54.9274496Z     contiguous=False,
2025-05-07T20:32:54.9274620Z     compiled=True,
2025-05-07T20:32:54.9274698Z )
2025-05-07T20:32:54.9274914Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:54.9275083Z T = 128, D = 7168, scale_ub = 1200.0, contiguous = False, compiled = True
2025-05-07T20:32:54.9275088Z 
2025-05-07T20:32:54.9275168Z     @given(
2025-05-07T20:32:54.9275285Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:54.9275383Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:54.9275501Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:54.9275616Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:54.9275732Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:54.9275804Z     )
2025-05-07T20:32:54.9276047Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:54.9276142Z     def test_silu_mul_quant(
2025-05-07T20:32:54.9276217Z         self,
2025-05-07T20:32:54.9276297Z         T: int,
2025-05-07T20:32:54.9276375Z         D: int,
2025-05-07T20:32:54.9276471Z         scale_ub: Optional[float],
2025-05-07T20:32:54.9276560Z         contiguous: bool,
2025-05-07T20:32:54.9276646Z         compiled: bool,
2025-05-07T20:32:54.9276730Z     ) -> None:
2025-05-07T20:32:54.9276823Z         torch.manual_seed(2025)
2025-05-07T20:32:54.9276900Z     
2025-05-07T20:32:54.9277107Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:54.9277183Z     
2025-05-07T20:32:54.9277273Z         x_sign = torch.sign(x)
2025-05-07T20:32:54.9277395Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:54.9277489Z         x = x_sign * x_clamp
2025-05-07T20:32:54.9277570Z         x0 = x[:, :D]
2025-05-07T20:32:54.9277650Z         x1 = x[:, D:]
2025-05-07T20:32:54.9277726Z     
2025-05-07T20:32:54.9277808Z         if contiguous:
2025-05-07T20:32:54.9277902Z             x0 = x0.contiguous()
2025-05-07T20:32:54.9277989Z             x1 = x1.contiguous()
2025-05-07T20:32:54.9278065Z     
2025-05-07T20:32:54.9278157Z         if scale_ub is not None:
2025-05-07T20:32:54.9278302Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:54.9278441Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:54.9278518Z             )
2025-05-07T20:32:54.9278594Z         else:
2025-05-07T20:32:54.9278686Z             scale_ub_tensor = None
2025-05-07T20:32:54.9278762Z     
2025-05-07T20:32:54.9278889Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:54.9278980Z             op = silu_mul_quant
2025-05-07T20:32:54.9279064Z             if compiled:
2025-05-07T20:32:54.9279161Z                 op = torch.compile(op)
2025-05-07T20:32:54.9279272Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:54.9279342Z     
2025-05-07T20:32:54.9279430Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:54.9279434Z 
2025-05-07T20:32:54.9279532Z moe/activation_test.py:117: 
2025-05-07T20:32:54.9279660Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:54.9279762Z moe/activation_test.py:115: in fn
2025-05-07T20:32:54.9279866Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:54.9280277Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/torch/_dynamo/eval_frame.py:678: in _fn
2025-05-07T20:32:54.9280388Z     return fn(*args, **kwargs)
2025-05-07T20:32:54.9280901Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:54.9280998Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:54.9281353Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:54.9281570Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:54.9281905Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:54.9282051Z     kernel = self.compile(
2025-05-07T20:32:54.9282435Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:54.9282612Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:54.9282737Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:54.9282744Z 
2025-05-07T20:32:54.9282943Z self = <triton.compiler.compiler.ASTSource object at 0x7fe92ba8eed0>
2025-05-07T20:32:54.9283706Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:54.9284200Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7fea06577ce0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7fe92a04e160>}
2025-05-07T20:32:54.9284943Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:54.9285131Z context = <triton._C.libtriton.ir.context object at 0x7fe92a1beef0>
2025-05-07T20:32:54.9285136Z 
2025-05-07T20:32:54.9285299Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:54.9285621Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:54.9285727Z                            module_map=module_map)
2025-05-07T20:32:54.9285889Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:54.9285986Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:54.9286060Z E       ^
2025-05-07T20:32:54.9286411Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:54.9286418Z 
2025-05-07T20:32:54.9286825Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:54.9286869Z 
2025-05-07T20:32:54.9286978Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:54.9287195Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:54.9287270Z     T=2048,
2025-05-07T20:32:54.9287350Z     D=7168,
2025-05-07T20:32:54.9287431Z     scale_ub=None,
2025-05-07T20:32:54.9287513Z     contiguous=True,
2025-05-07T20:32:54.9287598Z     compiled=True,
2025-05-07T20:32:54.9287669Z )
2025-05-07T20:32:54.9287884Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:54.9288056Z T = 2048, D = 7168, scale_ub = None, contiguous = True, compiled = True
2025-05-07T20:32:54.9288061Z 
2025-05-07T20:32:54.9288135Z     @given(
2025-05-07T20:32:54.9288259Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:54.9288356Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:54.9288471Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:54.9288597Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:54.9288748Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:54.9288822Z     )
2025-05-07T20:32:54.9289065Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:54.9289158Z     def test_silu_mul_quant(
2025-05-07T20:32:54.9289234Z         self,
2025-05-07T20:32:54.9289313Z         T: int,
2025-05-07T20:32:54.9289388Z         D: int,
2025-05-07T20:32:54.9289488Z         scale_ub: Optional[float],
2025-05-07T20:32:54.9289576Z         contiguous: bool,
2025-05-07T20:32:54.9289659Z         compiled: bool,
2025-05-07T20:32:54.9289739Z     ) -> None:
2025-05-07T20:32:54.9289831Z         torch.manual_seed(2025)
2025-05-07T20:32:54.9289901Z     
2025-05-07T20:32:54.9290068Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:54.9290182Z     
2025-05-07T20:32:54.9290271Z         x_sign = torch.sign(x)
2025-05-07T20:32:54.9290407Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:54.9290494Z         x = x_sign * x_clamp
2025-05-07T20:32:54.9290575Z         x0 = x[:, :D]
2025-05-07T20:32:54.9290656Z         x1 = x[:, D:]
2025-05-07T20:32:54.9290726Z     
2025-05-07T20:32:54.9290814Z         if contiguous:
2025-05-07T20:32:54.9290903Z             x0 = x0.contiguous()
2025-05-07T20:32:54.9290989Z             x1 = x1.contiguous()
2025-05-07T20:32:54.9291063Z     
2025-05-07T20:32:54.9291151Z         if scale_ub is not None:
2025-05-07T20:32:54.9291253Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:54.9291388Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:54.9291464Z             )
2025-05-07T20:32:54.9291539Z         else:
2025-05-07T20:32:54.9291634Z             scale_ub_tensor = None
2025-05-07T20:32:54.9291704Z     
2025-05-07T20:32:54.9291836Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:54.9291928Z             op = silu_mul_quant
2025-05-07T20:32:54.9292014Z             if compiled:
2025-05-07T20:32:54.9292111Z                 op = torch.compile(op)
2025-05-07T20:32:54.9292221Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:54.9292291Z     
2025-05-07T20:32:54.9292383Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:54.9292431Z 
2025-05-07T20:32:54.9292526Z moe/activation_test.py:117: 
2025-05-07T20:32:54.9292653Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:54.9292753Z moe/activation_test.py:115: in fn
2025-05-07T20:32:54.9292851Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:54.9293214Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/torch/_dynamo/eval_frame.py:678: in _fn
2025-05-07T20:32:54.9293308Z     return fn(*args, **kwargs)
2025-05-07T20:32:54.9293917Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:54.9294022Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:54.9294421Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:54.9294642Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:54.9294985Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:54.9295077Z     kernel = self.compile(
2025-05-07T20:32:54.9295454Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:54.9295626Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:54.9295753Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:54.9295758Z 
2025-05-07T20:32:54.9295961Z self = <triton.compiler.compiler.ASTSource object at 0x7fe5b9fa9550>
2025-05-07T20:32:54.9296769Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:54.9297270Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7fea06577ce0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7fe92a04f420>}
2025-05-07T20:32:54.9298003Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:54.9298568Z context = <triton._C.libtriton.ir.context object at 0x7fe5b9f71b70>
2025-05-07T20:32:54.9298576Z 
2025-05-07T20:32:54.9298754Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:54.9299100Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:54.9299216Z                            module_map=module_map)
2025-05-07T20:32:54.9299376Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:54.9299471Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:54.9299551Z E       ^
2025-05-07T20:32:54.9299900Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:54.9299905Z 
2025-05-07T20:32:54.9300308Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:54.9300317Z 
2025-05-07T20:32:54.9300416Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:54.9300635Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:54.9300713Z     T=16384,
2025-05-07T20:32:54.9300786Z     D=5120,
2025-05-07T20:32:54.9300867Z     scale_ub=None,
2025-05-07T20:32:54.9300953Z     contiguous=False,
2025-05-07T20:32:54.9301038Z     compiled=False,
2025-05-07T20:32:54.9301110Z )
2025-05-07T20:32:54.9301330Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:54.9301500Z T = 16384, D = 5120, scale_ub = None, contiguous = False, compiled = False
2025-05-07T20:32:54.9301504Z 
2025-05-07T20:32:54.9301672Z     @given(
2025-05-07T20:32:54.9301788Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:54.9301884Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:54.9302000Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:54.9302114Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:54.9302223Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:54.9302298Z     )
2025-05-07T20:32:54.9302537Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:54.9302627Z     def test_silu_mul_quant(
2025-05-07T20:32:54.9302706Z         self,
2025-05-07T20:32:54.9302783Z         T: int,
2025-05-07T20:32:54.9302860Z         D: int,
2025-05-07T20:32:54.9303027Z         scale_ub: Optional[float],
2025-05-07T20:32:54.9303117Z         contiguous: bool,
2025-05-07T20:32:54.9303204Z         compiled: bool,
2025-05-07T20:32:54.9303283Z     ) -> None:
2025-05-07T20:32:54.9303381Z         torch.manual_seed(2025)
2025-05-07T20:32:54.9303460Z     
2025-05-07T20:32:54.9303643Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:54.9303716Z     
2025-05-07T20:32:54.9303810Z         x_sign = torch.sign(x)
2025-05-07T20:32:54.9303941Z >       x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:54.9306290Z E       torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 320.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 140.44 MiB is free. Including non-PyTorch memory, this process has 21.92 GiB memory in use. Of the allocated memory 21.60 GiB is allocated by PyTorch, and 45.02 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
2025-05-07T20:32:54.9306300Z 
2025-05-07T20:32:54.9306415Z moe/activation_test.py:95: OutOfMemoryError
2025-05-07T20:32:54.9306422Z 
2025-05-07T20:32:54.9306523Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:54.9306741Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:54.9306814Z     T=4096,
2025-05-07T20:32:54.9306892Z     D=7168,
2025-05-07T20:32:54.9306971Z     scale_ub=1200.0,
2025-05-07T20:32:54.9307053Z     contiguous=True,
2025-05-07T20:32:54.9307136Z     compiled=True,
2025-05-07T20:32:54.9307205Z )
2025-05-07T20:32:54.9307416Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:54.9307584Z T = 4096, D = 7168, scale_ub = 1200.0, contiguous = True, compiled = True
2025-05-07T20:32:54.9307632Z 
2025-05-07T20:32:54.9307706Z     @given(
2025-05-07T20:32:54.9307823Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:54.9307925Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:54.9308035Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:54.9308154Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:54.9308265Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:54.9308336Z     )
2025-05-07T20:32:54.9308577Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:54.9308666Z     def test_silu_mul_quant(
2025-05-07T20:32:54.9308739Z         self,
2025-05-07T20:32:54.9308815Z         T: int,
2025-05-07T20:32:54.9308889Z         D: int,
2025-05-07T20:32:54.9308985Z         scale_ub: Optional[float],
2025-05-07T20:32:54.9309072Z         contiguous: bool,
2025-05-07T20:32:54.9309154Z         compiled: bool,
2025-05-07T20:32:54.9309235Z     ) -> None:
2025-05-07T20:32:54.9309330Z         torch.manual_seed(2025)
2025-05-07T20:32:54.9309401Z     
2025-05-07T20:32:54.9309571Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:54.9309641Z     
2025-05-07T20:32:54.9309729Z         x_sign = torch.sign(x)
2025-05-07T20:32:54.9309850Z >       x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:54.9311686Z E       torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 112.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 28.44 MiB is free. Including non-PyTorch memory, this process has 22.03 GiB memory in use. Of the allocated memory 21.61 GiB is allocated by PyTorch, and 141.02 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
2025-05-07T20:32:54.9311694Z 
2025-05-07T20:32:54.9311812Z moe/activation_test.py:95: OutOfMemoryError
2025-05-07T20:32:54.9311816Z 
2025-05-07T20:32:54.9311954Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:54.9312174Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:54.9312251Z     T=16384,
2025-05-07T20:32:54.9312323Z     D=7168,
2025-05-07T20:32:54.9312411Z     scale_ub=None,
2025-05-07T20:32:54.9312500Z     contiguous=False,
2025-05-07T20:32:54.9312582Z     compiled=False,
2025-05-07T20:32:54.9312655Z )
2025-05-07T20:32:54.9312864Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:54.9313034Z T = 16384, D = 7168, scale_ub = None, contiguous = False, compiled = False
2025-05-07T20:32:54.9313038Z 
2025-05-07T20:32:54.9313114Z     @given(
2025-05-07T20:32:54.9313227Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:54.9313322Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:54.9313439Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:54.9313550Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:54.9313663Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:54.9313777Z     )
2025-05-07T20:32:54.9314017Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:54.9314113Z     def test_silu_mul_quant(
2025-05-07T20:32:54.9314186Z         self,
2025-05-07T20:32:54.9314260Z         T: int,
2025-05-07T20:32:54.9314338Z         D: int,
2025-05-07T20:32:54.9314433Z         scale_ub: Optional[float],
2025-05-07T20:32:54.9314519Z         contiguous: bool,
2025-05-07T20:32:54.9314605Z         compiled: bool,
2025-05-07T20:32:54.9314680Z     ) -> None:
2025-05-07T20:32:54.9314770Z         torch.manual_seed(2025)
2025-05-07T20:32:54.9314843Z     
2025-05-07T20:32:54.9315004Z >       x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:54.9316820Z E       torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 448.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 140.44 MiB is free. Including non-PyTorch memory, this process has 21.92 GiB memory in use. Of the allocated memory 21.50 GiB is allocated by PyTorch, and 141.02 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
2025-05-07T20:32:54.9316828Z 
2025-05-07T20:32:54.9316939Z moe/activation_test.py:92: OutOfMemoryError
2025-05-07T20:32:54.9316943Z 
2025-05-07T20:32:54.9317046Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:54.9317260Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:54.9317332Z     T=2048,
2025-05-07T20:32:54.9317412Z     D=7168,
2025-05-07T20:32:54.9317490Z     scale_ub=1200.0,
2025-05-07T20:32:54.9317574Z     contiguous=True,
2025-05-07T20:32:54.9317657Z     compiled=True,
2025-05-07T20:32:54.9317728Z )
2025-05-07T20:32:54.9317938Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:54.9318107Z T = 2048, D = 7168, scale_ub = 1200.0, contiguous = True, compiled = True
2025-05-07T20:32:54.9318112Z 
2025-05-07T20:32:54.9318184Z     @given(
2025-05-07T20:32:54.9318344Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:54.9318439Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:54.9318549Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:54.9318665Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:54.9318774Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:54.9318844Z     )
2025-05-07T20:32:54.9319085Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:54.9319174Z     def test_silu_mul_quant(
2025-05-07T20:32:54.9319251Z         self,
2025-05-07T20:32:54.9319329Z         T: int,
2025-05-07T20:32:54.9319402Z         D: int,
2025-05-07T20:32:54.9319537Z         scale_ub: Optional[float],
2025-05-07T20:32:54.9319632Z         contiguous: bool,
2025-05-07T20:32:54.9319716Z         compiled: bool,
2025-05-07T20:32:54.9319794Z     ) -> None:
2025-05-07T20:32:54.9319886Z         torch.manual_seed(2025)
2025-05-07T20:32:54.9319958Z     
2025-05-07T20:32:54.9320123Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:54.9320193Z     
2025-05-07T20:32:54.9320281Z         x_sign = torch.sign(x)
2025-05-07T20:32:54.9320403Z >       x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:54.9322165Z E       torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 56.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 28.44 MiB is free. Including non-PyTorch memory, this process has 22.03 GiB memory in use. Of the allocated memory 21.67 GiB is allocated by PyTorch, and 85.02 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
2025-05-07T20:32:54.9322174Z 
2025-05-07T20:32:54.9322289Z moe/activation_test.py:95: OutOfMemoryError
2025-05-07T20:32:54.9322296Z 
2025-05-07T20:32:54.9322395Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:54.9322611Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:54.9322689Z     T=2048,
2025-05-07T20:32:54.9322763Z     D=7168,
2025-05-07T20:32:54.9322844Z     scale_ub=None,
2025-05-07T20:32:54.9322925Z     contiguous=True,
2025-05-07T20:32:54.9323004Z     compiled=False,
2025-05-07T20:32:54.9323079Z )
2025-05-07T20:32:54.9323288Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:54.9323450Z T = 2048, D = 7168, scale_ub = None, contiguous = True, compiled = False
2025-05-07T20:32:54.9323498Z 
2025-05-07T20:32:54.9323573Z     @given(
2025-05-07T20:32:54.9323691Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:54.9323788Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:54.9323904Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:54.9324018Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:54.9324131Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:54.9324202Z     )
2025-05-07T20:32:54.9324441Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:54.9324533Z     def test_silu_mul_quant(
2025-05-07T20:32:54.9324606Z         self,
2025-05-07T20:32:54.9324680Z         T: int,
2025-05-07T20:32:54.9324757Z         D: int,
2025-05-07T20:32:54.9324851Z         scale_ub: Optional[float],
2025-05-07T20:32:54.9324941Z         contiguous: bool,
2025-05-07T20:32:54.9325023Z         compiled: bool,
2025-05-07T20:32:54.9325103Z     ) -> None:
2025-05-07T20:32:54.9325194Z         torch.manual_seed(2025)
2025-05-07T20:32:54.9325265Z     
2025-05-07T20:32:54.9325432Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:54.9325505Z     
2025-05-07T20:32:54.9325593Z >       x_sign = torch.sign(x)
2025-05-07T20:32:54.9327307Z E       torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 56.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 28.44 MiB is free. Including non-PyTorch memory, this process has 22.03 GiB memory in use. Of the allocated memory 21.67 GiB is allocated by PyTorch, and 85.02 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
2025-05-07T20:32:54.9327357Z 
2025-05-07T20:32:54.9327470Z moe/activation_test.py:94: OutOfMemoryError
2025-05-07T20:32:54.9327478Z 
2025-05-07T20:32:54.9327578Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:54.9327831Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:54.9327909Z     T=1,
2025-05-07T20:32:54.9327983Z     D=7168,
2025-05-07T20:32:54.9328063Z     scale_ub=1200.0,
2025-05-07T20:32:54.9328148Z     contiguous=True,
2025-05-07T20:32:54.9328231Z     compiled=False,
2025-05-07T20:32:54.9328300Z )
2025-05-07T20:32:54.9328514Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:54.9328671Z T = 1, D = 7168, scale_ub = 1200.0, contiguous = True, compiled = False
2025-05-07T20:32:54.9328675Z 
2025-05-07T20:32:54.9328750Z     @given(
2025-05-07T20:32:54.9328866Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:54.9328960Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:54.9329073Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:54.9329184Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:54.9329296Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:54.9329370Z     )
2025-05-07T20:32:54.9329651Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:54.9329742Z     def test_silu_mul_quant(
2025-05-07T20:32:54.9329819Z         self,
2025-05-07T20:32:54.9329892Z         T: int,
2025-05-07T20:32:54.9329967Z         D: int,
2025-05-07T20:32:54.9330065Z         scale_ub: Optional[float],
2025-05-07T20:32:54.9330150Z         contiguous: bool,
2025-05-07T20:32:54.9330231Z         compiled: bool,
2025-05-07T20:32:54.9330311Z     ) -> None:
2025-05-07T20:32:54.9330401Z         torch.manual_seed(2025)
2025-05-07T20:32:54.9330476Z     
2025-05-07T20:32:54.9330637Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:54.9330707Z     
2025-05-07T20:32:54.9330798Z         x_sign = torch.sign(x)
2025-05-07T20:32:54.9330917Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:54.9331045Z         x = x_sign * x_clamp
2025-05-07T20:32:54.9331125Z         x0 = x[:, :D]
2025-05-07T20:32:54.9331206Z         x1 = x[:, D:]
2025-05-07T20:32:54.9331276Z     
2025-05-07T20:32:54.9331364Z         if contiguous:
2025-05-07T20:32:54.9331451Z             x0 = x0.contiguous()
2025-05-07T20:32:54.9331537Z             x1 = x1.contiguous()
2025-05-07T20:32:54.9331612Z     
2025-05-07T20:32:54.9331698Z         if scale_ub is not None:
2025-05-07T20:32:54.9331805Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:54.9331935Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:54.9332010Z             )
2025-05-07T20:32:54.9332085Z         else:
2025-05-07T20:32:54.9332175Z             scale_ub_tensor = None
2025-05-07T20:32:54.9332244Z     
2025-05-07T20:32:54.9332372Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:54.9332459Z             op = silu_mul_quant
2025-05-07T20:32:54.9332540Z             if compiled:
2025-05-07T20:32:54.9332643Z                 op = torch.compile(op)
2025-05-07T20:32:54.9332746Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:54.9332817Z     
2025-05-07T20:32:54.9332908Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:54.9332915Z 
2025-05-07T20:32:54.9333010Z moe/activation_test.py:117: 
2025-05-07T20:32:54.9333139Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:54.9333282Z moe/activation_test.py:115: in fn
2025-05-07T20:32:54.9333378Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:54.9333955Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:54.9334051Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:54.9334405Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:54.9334627Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:54.9335011Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:54.9335107Z     kernel = self.compile(
2025-05-07T20:32:54.9335486Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:54.9335663Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:54.9335791Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:54.9335795Z 
2025-05-07T20:32:54.9335995Z self = <triton.compiler.compiler.ASTSource object at 0x7fe92a3a06d0>
2025-05-07T20:32:54.9336762Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:54.9337261Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7fea06577ce0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7fe5b9ff62a0>}
2025-05-07T20:32:54.9338033Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:54.9338230Z context = <triton._C.libtriton.ir.context object at 0x7fe5b9bbcbf0>
2025-05-07T20:32:54.9338235Z 
2025-05-07T20:32:54.9338397Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:54.9338658Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:54.9338762Z                            module_map=module_map)
2025-05-07T20:32:54.9338919Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:54.9339017Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:54.9339132Z E       ^
2025-05-07T20:32:54.9339478Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:54.9339489Z 
2025-05-07T20:32:54.9339896Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:54.9339901Z 
2025-05-07T20:32:54.9340005Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:54.9340226Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:54.9340299Z     T=128,
2025-05-07T20:32:54.9340372Z     D=5120,
2025-05-07T20:32:54.9340453Z     scale_ub=None,
2025-05-07T20:32:54.9340534Z     contiguous=True,
2025-05-07T20:32:54.9340615Z     compiled=False,
2025-05-07T20:32:54.9340687Z )
2025-05-07T20:32:54.9340900Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:54.9341067Z T = 128, D = 5120, scale_ub = None, contiguous = True, compiled = False
2025-05-07T20:32:54.9341075Z 
2025-05-07T20:32:54.9341148Z     @given(
2025-05-07T20:32:54.9341264Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:54.9341366Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:54.9341480Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:54.9341593Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:54.9341749Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:54.9341820Z     )
2025-05-07T20:32:54.9342062Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:54.9342155Z     def test_silu_mul_quant(
2025-05-07T20:32:54.9342229Z         self,
2025-05-07T20:32:54.9342305Z         T: int,
2025-05-07T20:32:54.9342377Z         D: int,
2025-05-07T20:32:54.9342472Z         scale_ub: Optional[float],
2025-05-07T20:32:54.9342563Z         contiguous: bool,
2025-05-07T20:32:54.9342646Z         compiled: bool,
2025-05-07T20:32:54.9342721Z     ) -> None:
2025-05-07T20:32:54.9342818Z         torch.manual_seed(2025)
2025-05-07T20:32:54.9342887Z     
2025-05-07T20:32:54.9343089Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:54.9343167Z     
2025-05-07T20:32:54.9343258Z         x_sign = torch.sign(x)
2025-05-07T20:32:54.9343378Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:54.9343466Z         x = x_sign * x_clamp
2025-05-07T20:32:54.9343546Z         x0 = x[:, :D]
2025-05-07T20:32:54.9343626Z         x1 = x[:, D:]
2025-05-07T20:32:54.9343697Z     
2025-05-07T20:32:54.9343776Z         if contiguous:
2025-05-07T20:32:54.9343866Z             x0 = x0.contiguous()
2025-05-07T20:32:54.9343951Z             x1 = x1.contiguous()
2025-05-07T20:32:54.9344019Z     
2025-05-07T20:32:54.9344107Z         if scale_ub is not None:
2025-05-07T20:32:54.9344209Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:54.9344339Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:54.9344417Z             )
2025-05-07T20:32:54.9344490Z         else:
2025-05-07T20:32:54.9344579Z             scale_ub_tensor = None
2025-05-07T20:32:54.9344650Z     
2025-05-07T20:32:54.9344778Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:54.9344933Z             op = silu_mul_quant
2025-05-07T20:32:54.9345016Z             if compiled:
2025-05-07T20:32:54.9345112Z                 op = torch.compile(op)
2025-05-07T20:32:54.9345222Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:54.9345290Z     
2025-05-07T20:32:54.9345379Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:54.9345384Z 
2025-05-07T20:32:54.9345481Z moe/activation_test.py:117: 
2025-05-07T20:32:54.9345606Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:54.9345702Z moe/activation_test.py:115: in fn
2025-05-07T20:32:54.9345802Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:54.9346290Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:54.9346426Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:54.9346782Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:54.9347000Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:54.9347338Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:54.9347428Z     kernel = self.compile(
2025-05-07T20:32:54.9347804Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:54.9347977Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:54.9348101Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:54.9348105Z 
2025-05-07T20:32:54.9348311Z self = <triton.compiler.compiler.ASTSource object at 0x7fe92ba8eb50>
2025-05-07T20:32:54.9349077Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:54.9349577Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7fea06577ce0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7fe5b9ff71a0>}
2025-05-07T20:32:54.9350365Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:54.9350582Z context = <triton._C.libtriton.ir.context object at 0x7fe5b9b813b0>
2025-05-07T20:32:54.9350588Z 
2025-05-07T20:32:54.9350756Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:54.9351010Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:54.9351119Z                            module_map=module_map)
2025-05-07T20:32:54.9351316Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:54.9351412Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:54.9351487Z E       ^
2025-05-07T20:32:54.9351832Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:54.9351839Z 
2025-05-07T20:32:54.9352239Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:54.9352247Z 
2025-05-07T20:32:54.9352345Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:54.9352561Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:54.9352639Z     T=128,
2025-05-07T20:32:54.9352712Z     D=7168,
2025-05-07T20:32:54.9352790Z     scale_ub=None,
2025-05-07T20:32:54.9352879Z     contiguous=True,
2025-05-07T20:32:54.9352959Z     compiled=False,
2025-05-07T20:32:54.9353029Z )
2025-05-07T20:32:54.9353248Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:54.9353447Z T = 128, D = 7168, scale_ub = None, contiguous = True, compiled = False
2025-05-07T20:32:54.9353452Z 
2025-05-07T20:32:54.9353529Z     @given(
2025-05-07T20:32:54.9353647Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:54.9353744Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:54.9353858Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:54.9353970Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:54.9354079Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:54.9354150Z     )
2025-05-07T20:32:54.9354387Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:54.9354477Z     def test_silu_mul_quant(
2025-05-07T20:32:54.9354593Z         self,
2025-05-07T20:32:54.9354667Z         T: int,
2025-05-07T20:32:54.9354741Z         D: int,
2025-05-07T20:32:54.9354841Z         scale_ub: Optional[float],
2025-05-07T20:32:54.9354927Z         contiguous: bool,
2025-05-07T20:32:54.9355015Z         compiled: bool,
2025-05-07T20:32:54.9355089Z     ) -> None:
2025-05-07T20:32:54.9355180Z         torch.manual_seed(2025)
2025-05-07T20:32:54.9355254Z     
2025-05-07T20:32:54.9355416Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:54.9355486Z     
2025-05-07T20:32:54.9355579Z         x_sign = torch.sign(x)
2025-05-07T20:32:54.9355699Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:54.9355783Z         x = x_sign * x_clamp
2025-05-07T20:32:54.9355863Z         x0 = x[:, :D]
2025-05-07T20:32:54.9355939Z         x1 = x[:, D:]
2025-05-07T20:32:54.9356009Z     
2025-05-07T20:32:54.9356093Z         if contiguous:
2025-05-07T20:32:54.9356183Z             x0 = x0.contiguous()
2025-05-07T20:32:54.9356272Z             x1 = x1.contiguous()
2025-05-07T20:32:54.9356346Z     
2025-05-07T20:32:54.9356432Z         if scale_ub is not None:
2025-05-07T20:32:54.9356537Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:54.9356669Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:54.9356742Z             )
2025-05-07T20:32:54.9356817Z         else:
2025-05-07T20:32:54.9356954Z             scale_ub_tensor = None
2025-05-07T20:32:54.9357023Z     
2025-05-07T20:32:54.9357150Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:54.9357236Z             op = silu_mul_quant
2025-05-07T20:32:54.9357317Z             if compiled:
2025-05-07T20:32:54.9357415Z                 op = torch.compile(op)
2025-05-07T20:32:54.9357516Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:54.9357584Z     
2025-05-07T20:32:54.9357675Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:54.9357680Z 
2025-05-07T20:32:54.9357774Z moe/activation_test.py:117: 
2025-05-07T20:32:54.9357906Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:54.9358005Z moe/activation_test.py:115: in fn
2025-05-07T20:32:54.9358140Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:54.9358630Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:54.9358726Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:54.9359077Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:54.9359300Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:54.9359633Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:54.9359727Z     kernel = self.compile(
2025-05-07T20:32:54.9360103Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:54.9360276Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:54.9360404Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:54.9360446Z 
2025-05-07T20:32:54.9360647Z self = <triton.compiler.compiler.ASTSource object at 0x7fe92a5269d0>
2025-05-07T20:32:54.9361410Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:54.9361904Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7fea06577ce0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7fe5b9b60040>}
2025-05-07T20:32:54.9362633Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:54.9362867Z context = <triton._C.libtriton.ir.context object at 0x7fe5b9ea8e70>
2025-05-07T20:32:54.9362872Z 
2025-05-07T20:32:54.9363034Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:54.9363289Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:54.9363398Z                            module_map=module_map)
2025-05-07T20:32:54.9363557Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:54.9363656Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:54.9363730Z E       ^
2025-05-07T20:32:54.9364077Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:54.9364081Z 
2025-05-07T20:32:54.9364485Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:54.9364492Z 
2025-05-07T20:32:54.9364591Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:54.9364811Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:54.9364887Z     T=2048,
2025-05-07T20:32:54.9364962Z     D=7168,
2025-05-07T20:32:54.9365047Z     scale_ub=1200.0,
2025-05-07T20:32:54.9365127Z     contiguous=True,
2025-05-07T20:32:54.9365256Z     compiled=False,
2025-05-07T20:32:54.9365326Z )
2025-05-07T20:32:54.9365535Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:54.9365704Z T = 2048, D = 7168, scale_ub = 1200.0, contiguous = True, compiled = False
2025-05-07T20:32:54.9365709Z 
2025-05-07T20:32:54.9365782Z     @given(
2025-05-07T20:32:54.9365895Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:54.9365993Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:54.9366105Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:54.9366218Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:54.9366333Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:54.9366406Z     )
2025-05-07T20:32:54.9366687Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:54.9366778Z     def test_silu_mul_quant(
2025-05-07T20:32:54.9366852Z         self,
2025-05-07T20:32:54.9366933Z         T: int,
2025-05-07T20:32:54.9367006Z         D: int,
2025-05-07T20:32:54.9367100Z         scale_ub: Optional[float],
2025-05-07T20:32:54.9367189Z         contiguous: bool,
2025-05-07T20:32:54.9367272Z         compiled: bool,
2025-05-07T20:32:54.9367348Z     ) -> None:
2025-05-07T20:32:54.9367442Z         torch.manual_seed(2025)
2025-05-07T20:32:54.9367511Z     
2025-05-07T20:32:54.9367673Z >       x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:54.9369449Z E       torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 56.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 26.44 MiB is free. Including non-PyTorch memory, this process has 22.04 GiB memory in use. Of the allocated memory 21.69 GiB is allocated by PyTorch, and 59.18 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
2025-05-07T20:32:54.9369461Z 
2025-05-07T20:32:54.9369577Z moe/activation_test.py:92: OutOfMemoryError
2025-05-07T20:32:54.9369581Z 
2025-05-07T20:32:54.9369680Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:54.9369895Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:54.9369977Z     T=1,
2025-05-07T20:32:54.9370050Z     D=5120,
2025-05-07T20:32:54.9370129Z     scale_ub=1200.0,
2025-05-07T20:32:54.9370214Z     contiguous=True,
2025-05-07T20:32:54.9370294Z     compiled=False,
2025-05-07T20:32:54.9370412Z )
2025-05-07T20:32:54.9370664Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:54.9370830Z T = 1, D = 5120, scale_ub = 1200.0, contiguous = True, compiled = False
2025-05-07T20:32:54.9370835Z 
2025-05-07T20:32:54.9370914Z     @given(
2025-05-07T20:32:54.9371028Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:54.9371125Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:54.9371240Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:54.9371351Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:54.9371459Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:54.9371534Z     )
2025-05-07T20:32:54.9371770Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:54.9371859Z     def test_silu_mul_quant(
2025-05-07T20:32:54.9371935Z         self,
2025-05-07T20:32:54.9372009Z         T: int,
2025-05-07T20:32:54.9372082Z         D: int,
2025-05-07T20:32:54.9372182Z         scale_ub: Optional[float],
2025-05-07T20:32:54.9372270Z         contiguous: bool,
2025-05-07T20:32:54.9372358Z         compiled: bool,
2025-05-07T20:32:54.9372435Z     ) -> None:
2025-05-07T20:32:54.9372529Z         torch.manual_seed(2025)
2025-05-07T20:32:54.9372601Z     
2025-05-07T20:32:54.9372762Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:54.9372877Z     
2025-05-07T20:32:54.9372970Z         x_sign = torch.sign(x)
2025-05-07T20:32:54.9373091Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:54.9373177Z         x = x_sign * x_clamp
2025-05-07T20:32:54.9373257Z         x0 = x[:, :D]
2025-05-07T20:32:54.9373334Z         x1 = x[:, D:]
2025-05-07T20:32:54.9373403Z     
2025-05-07T20:32:54.9373485Z         if contiguous:
2025-05-07T20:32:54.9373572Z             x0 = x0.contiguous()
2025-05-07T20:32:54.9373722Z             x1 = x1.contiguous()
2025-05-07T20:32:54.9373793Z     
2025-05-07T20:32:54.9373880Z         if scale_ub is not None:
2025-05-07T20:32:54.9373988Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:54.9374191Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:54.9374265Z             )
2025-05-07T20:32:54.9374347Z         else:
2025-05-07T20:32:54.9374437Z             scale_ub_tensor = None
2025-05-07T20:32:54.9374507Z     
2025-05-07T20:32:54.9374637Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:54.9374728Z             op = silu_mul_quant
2025-05-07T20:32:54.9374809Z             if compiled:
2025-05-07T20:32:54.9374910Z                 op = torch.compile(op)
2025-05-07T20:32:54.9375012Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:54.9375086Z     
2025-05-07T20:32:54.9375174Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:54.9375178Z 
2025-05-07T20:32:54.9375271Z moe/activation_test.py:117: 
2025-05-07T20:32:54.9375400Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:54.9375501Z moe/activation_test.py:115: in fn
2025-05-07T20:32:54.9375598Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:54.9376133Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:54.9376228Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:54.9376583Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:54.9376805Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:54.9377138Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:54.9377231Z     kernel = self.compile(
2025-05-07T20:32:54.9377607Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:54.9377779Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:54.9377950Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:54.9377954Z 
2025-05-07T20:32:54.9378155Z self = <triton.compiler.compiler.ASTSource object at 0x7fe92ab7c3d0>
2025-05-07T20:32:54.9378920Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:54.9379417Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7fea06577ce0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7fe5b9b61580>}
2025-05-07T20:32:54.9380150Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:54.9380339Z context = <triton._C.libtriton.ir.context object at 0x7fe5b9e87030>
2025-05-07T20:32:54.9380346Z 
2025-05-07T20:32:54.9380511Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:54.9380770Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:54.9380873Z                            module_map=module_map)
2025-05-07T20:32:54.9381073Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:54.9381170Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:54.9381243Z E       ^
2025-05-07T20:32:54.9381591Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:54.9381596Z 
2025-05-07T20:32:54.9381999Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:54.9382003Z 
2025-05-07T20:32:54.9382102Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:54.9382324Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:54.9382399Z     T=2048,
2025-05-07T20:32:54.9382475Z     D=5120,
2025-05-07T20:32:54.9382592Z     scale_ub=None,
2025-05-07T20:32:54.9382678Z     contiguous=True,
2025-05-07T20:32:54.9382763Z     compiled=False,
2025-05-07T20:32:54.9382833Z )
2025-05-07T20:32:54.9383044Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:54.9383215Z T = 2048, D = 5120, scale_ub = None, contiguous = True, compiled = False
2025-05-07T20:32:54.9383219Z 
2025-05-07T20:32:54.9383292Z     @given(
2025-05-07T20:32:54.9383407Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:54.9386402Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:54.9386534Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:54.9386656Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:54.9386765Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:54.9386844Z     )
2025-05-07T20:32:54.9387089Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:54.9387187Z     def test_silu_mul_quant(
2025-05-07T20:32:54.9387266Z         self,
2025-05-07T20:32:54.9387405Z         T: int,
2025-05-07T20:32:54.9387481Z         D: int,
2025-05-07T20:32:54.9387581Z         scale_ub: Optional[float],
2025-05-07T20:32:54.9387674Z         contiguous: bool,
2025-05-07T20:32:54.9387756Z         compiled: bool,
2025-05-07T20:32:54.9387834Z     ) -> None:
2025-05-07T20:32:54.9387926Z         torch.manual_seed(2025)
2025-05-07T20:32:54.9387996Z     
2025-05-07T20:32:54.9388162Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:54.9388233Z     
2025-05-07T20:32:54.9388321Z >       x_sign = torch.sign(x)
2025-05-07T20:32:54.9390076Z E       torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 40.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 26.44 MiB is free. Including non-PyTorch memory, this process has 22.04 GiB memory in use. Of the allocated memory 21.73 GiB is allocated by PyTorch, and 19.12 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
2025-05-07T20:32:54.9390127Z 
2025-05-07T20:32:54.9390243Z moe/activation_test.py:94: OutOfMemoryError
2025-05-07T20:32:54.9390251Z 
2025-05-07T20:32:54.9390349Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:54.9390564Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:54.9390650Z     T=16384,
2025-05-07T20:32:54.9390723Z     D=5120,
2025-05-07T20:32:54.9390801Z     scale_ub=None,
2025-05-07T20:32:54.9390885Z     contiguous=True,
2025-05-07T20:32:54.9390965Z     compiled=False,
2025-05-07T20:32:54.9391036Z )
2025-05-07T20:32:54.9391248Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:54.9391424Z T = 16384, D = 5120, scale_ub = None, contiguous = True, compiled = False
2025-05-07T20:32:54.9391429Z 
2025-05-07T20:32:54.9391502Z     @given(
2025-05-07T20:32:54.9391621Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:54.9391718Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:54.9391876Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:54.9391989Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:54.9392097Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:54.9392171Z     )
2025-05-07T20:32:54.9392409Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:54.9392500Z     def test_silu_mul_quant(
2025-05-07T20:32:54.9392578Z         self,
2025-05-07T20:32:54.9392652Z         T: int,
2025-05-07T20:32:54.9392726Z         D: int,
2025-05-07T20:32:54.9392822Z         scale_ub: Optional[float],
2025-05-07T20:32:54.9392912Z         contiguous: bool,
2025-05-07T20:32:54.9392996Z         compiled: bool,
2025-05-07T20:32:54.9393071Z     ) -> None:
2025-05-07T20:32:54.9393202Z         torch.manual_seed(2025)
2025-05-07T20:32:54.9393279Z     
2025-05-07T20:32:54.9393438Z >       x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:54.9395164Z E       torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 320.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 26.44 MiB is free. Including non-PyTorch memory, this process has 22.04 GiB memory in use. Of the allocated memory 21.73 GiB is allocated by PyTorch, and 19.12 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
2025-05-07T20:32:54.9395177Z 
2025-05-07T20:32:54.9395295Z moe/activation_test.py:92: OutOfMemoryError
2025-05-07T20:32:54.9395299Z 
2025-05-07T20:32:54.9395396Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:54.9395656Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:54.9395731Z     T=4096,
2025-05-07T20:32:54.9395804Z     D=5120,
2025-05-07T20:32:54.9395891Z     scale_ub=None,
2025-05-07T20:32:54.9395975Z     contiguous=True,
2025-05-07T20:32:54.9396055Z     compiled=False,
2025-05-07T20:32:54.9396131Z )
2025-05-07T20:32:54.9396341Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:54.9396511Z T = 4096, D = 5120, scale_ub = None, contiguous = True, compiled = False
2025-05-07T20:32:54.9396515Z 
2025-05-07T20:32:54.9396590Z     @given(
2025-05-07T20:32:54.9396704Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:54.9396803Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:54.9396914Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:54.9397072Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:54.9397187Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:54.9397261Z     )
2025-05-07T20:32:54.9397507Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:54.9397597Z     def test_silu_mul_quant(
2025-05-07T20:32:54.9397671Z         self,
2025-05-07T20:32:54.9397754Z         T: int,
2025-05-07T20:32:54.9397828Z         D: int,
2025-05-07T20:32:54.9397922Z         scale_ub: Optional[float],
2025-05-07T20:32:54.9398010Z         contiguous: bool,
2025-05-07T20:32:54.9398090Z         compiled: bool,
2025-05-07T20:32:54.9398572Z     ) -> None:
2025-05-07T20:32:54.9398682Z         torch.manual_seed(2025)
2025-05-07T20:32:54.9398753Z     
2025-05-07T20:32:54.9398935Z >       x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:54.9401201Z E       torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 80.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 26.44 MiB is free. Including non-PyTorch memory, this process has 22.04 GiB memory in use. Of the allocated memory 21.73 GiB is allocated by PyTorch, and 19.12 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
2025-05-07T20:32:54.9401305Z 
2025-05-07T20:32:54.9401419Z moe/activation_test.py:92: OutOfMemoryError
2025-05-07T20:32:54.9401428Z 
2025-05-07T20:32:54.9401529Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:54.9401743Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:54.9401824Z     T=2048,
2025-05-07T20:32:54.9401899Z     D=5120,
2025-05-07T20:32:54.9401978Z     scale_ub=None,
2025-05-07T20:32:54.9402064Z     contiguous=False,
2025-05-07T20:32:54.9402145Z     compiled=False,
2025-05-07T20:32:54.9402223Z )
2025-05-07T20:32:54.9402433Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:54.9402659Z T = 2048, D = 5120, scale_ub = None, contiguous = False, compiled = False
2025-05-07T20:32:54.9402667Z 
2025-05-07T20:32:54.9402744Z     @given(
2025-05-07T20:32:54.9402859Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:54.9402957Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:54.9403071Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:54.9403182Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:54.9403293Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:54.9403364Z     )
2025-05-07T20:32:54.9403600Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:54.9403693Z     def test_silu_mul_quant(
2025-05-07T20:32:54.9403766Z         self,
2025-05-07T20:32:54.9403846Z         T: int,
2025-05-07T20:32:54.9403924Z         D: int,
2025-05-07T20:32:54.9404019Z         scale_ub: Optional[float],
2025-05-07T20:32:54.9404103Z         contiguous: bool,
2025-05-07T20:32:54.9404190Z         compiled: bool,
2025-05-07T20:32:54.9404267Z     ) -> None:
2025-05-07T20:32:54.9404422Z         torch.manual_seed(2025)
2025-05-07T20:32:54.9404496Z     
2025-05-07T20:32:54.9404656Z >       x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:54.9406377Z E       torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 40.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 26.44 MiB is free. Including non-PyTorch memory, this process has 22.04 GiB memory in use. Of the allocated memory 21.73 GiB is allocated by PyTorch, and 19.12 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
2025-05-07T20:32:54.9406438Z 
2025-05-07T20:32:54.9406551Z moe/activation_test.py:92: OutOfMemoryError
2025-05-07T20:32:54.9406555Z 
2025-05-07T20:32:54.9406658Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:54.9406875Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:54.9406949Z     T=4096,
2025-05-07T20:32:54.9407026Z     D=7168,
2025-05-07T20:32:54.9407107Z     scale_ub=None,
2025-05-07T20:32:54.9407189Z     contiguous=True,
2025-05-07T20:32:54.9407271Z     compiled=True,
2025-05-07T20:32:54.9407340Z )
2025-05-07T20:32:54.9407549Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:54.9407713Z T = 4096, D = 7168, scale_ub = None, contiguous = True, compiled = True
2025-05-07T20:32:54.9407717Z 
2025-05-07T20:32:54.9407790Z     @given(
2025-05-07T20:32:54.9407907Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:54.9408000Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:54.9408115Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:54.9408229Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:54.9408340Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:54.9408415Z     )
2025-05-07T20:32:54.9408654Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:54.9408814Z     def test_silu_mul_quant(
2025-05-07T20:32:54.9408889Z         self,
2025-05-07T20:32:54.9408967Z         T: int,
2025-05-07T20:32:54.9409040Z         D: int,
2025-05-07T20:32:54.9409136Z         scale_ub: Optional[float],
2025-05-07T20:32:54.9409222Z         contiguous: bool,
2025-05-07T20:32:54.9409304Z         compiled: bool,
2025-05-07T20:32:54.9409381Z     ) -> None:
2025-05-07T20:32:54.9409474Z         torch.manual_seed(2025)
2025-05-07T20:32:54.9409543Z     
2025-05-07T20:32:54.9409710Z >       x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:54.9411472Z E       torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 112.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 26.44 MiB is free. Including non-PyTorch memory, this process has 22.04 GiB memory in use. Of the allocated memory 21.73 GiB is allocated by PyTorch, and 19.12 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
2025-05-07T20:32:54.9411485Z 
2025-05-07T20:32:54.9411601Z moe/activation_test.py:92: OutOfMemoryError
2025-05-07T20:32:54.9411605Z 
2025-05-07T20:32:54.9411705Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:54.9411920Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:54.9411998Z     T=2048,
2025-05-07T20:32:54.9412071Z     D=5120,
2025-05-07T20:32:54.9412154Z     scale_ub=1200.0,
2025-05-07T20:32:54.9412241Z     contiguous=False,
2025-05-07T20:32:54.9412322Z     compiled=False,
2025-05-07T20:32:54.9412395Z )
2025-05-07T20:32:54.9412608Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:54.9412814Z T = 2048, D = 5120, scale_ub = 1200.0, contiguous = False, compiled = False
2025-05-07T20:32:54.9412818Z 
2025-05-07T20:32:54.9412894Z     @given(
2025-05-07T20:32:54.9413010Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:54.9413104Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:54.9413216Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:54.9413328Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:54.9413439Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:54.9413509Z     )
2025-05-07T20:32:54.9413852Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:54.9413945Z     def test_silu_mul_quant(
2025-05-07T20:32:54.9414085Z         self,
2025-05-07T20:32:54.9414160Z         T: int,
2025-05-07T20:32:54.9414236Z         D: int,
2025-05-07T20:32:54.9414333Z         scale_ub: Optional[float],
2025-05-07T20:32:54.9414418Z         contiguous: bool,
2025-05-07T20:32:54.9414504Z         compiled: bool,
2025-05-07T20:32:54.9414580Z     ) -> None:
2025-05-07T20:32:54.9414671Z         torch.manual_seed(2025)
2025-05-07T20:32:54.9414745Z     
2025-05-07T20:32:54.9414907Z >       x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:54.9416627Z E       torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 40.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 26.44 MiB is free. Including non-PyTorch memory, this process has 22.04 GiB memory in use. Of the allocated memory 21.73 GiB is allocated by PyTorch, and 19.12 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
2025-05-07T20:32:54.9416635Z 
2025-05-07T20:32:54.9416753Z moe/activation_test.py:92: OutOfMemoryError
2025-05-07T20:32:54.9416757Z 
2025-05-07T20:32:54.9416861Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:54.9417076Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:54.9417193Z     T=4096,
2025-05-07T20:32:54.9417268Z     D=7168,
2025-05-07T20:32:54.9417347Z     scale_ub=1200.0,
2025-05-07T20:32:54.9417428Z     contiguous=True,
2025-05-07T20:32:54.9417510Z     compiled=False,
2025-05-07T20:32:54.9417579Z )
2025-05-07T20:32:54.9417789Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:54.9417956Z T = 4096, D = 7168, scale_ub = 1200.0, contiguous = True, compiled = False
2025-05-07T20:32:54.9417960Z 
2025-05-07T20:32:54.9418032Z     @given(
2025-05-07T20:32:54.9418147Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:54.9418245Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:54.9418355Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:54.9418514Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:54.9418627Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:54.9418698Z     )
2025-05-07T20:32:54.9418941Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:54.9419034Z     def test_silu_mul_quant(
2025-05-07T20:32:54.9419107Z         self,
2025-05-07T20:32:54.9419185Z         T: int,
2025-05-07T20:32:54.9419259Z         D: int,
2025-05-07T20:32:54.9419357Z         scale_ub: Optional[float],
2025-05-07T20:32:54.9419442Z         contiguous: bool,
2025-05-07T20:32:54.9419524Z         compiled: bool,
2025-05-07T20:32:54.9419602Z     ) -> None:
2025-05-07T20:32:54.9419691Z         torch.manual_seed(2025)
2025-05-07T20:32:54.9419764Z     
2025-05-07T20:32:54.9419925Z >       x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:54.9421690Z E       torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 112.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 26.44 MiB is free. Including non-PyTorch memory, this process has 22.04 GiB memory in use. Of the allocated memory 21.73 GiB is allocated by PyTorch, and 19.12 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
2025-05-07T20:32:54.9421698Z 
2025-05-07T20:32:54.9421815Z moe/activation_test.py:92: OutOfMemoryError
2025-05-07T20:32:54.9421820Z 
2025-05-07T20:32:54.9421918Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:54.9422132Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:54.9422210Z     T=16384,
2025-05-07T20:32:54.9422323Z     D=7168,
2025-05-07T20:32:54.9422405Z     scale_ub=None,
2025-05-07T20:32:54.9422487Z     contiguous=False,
2025-05-07T20:32:54.9422568Z     compiled=True,
2025-05-07T20:32:54.9422646Z )
2025-05-07T20:32:54.9422858Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:54.9423026Z T = 16384, D = 7168, scale_ub = None, contiguous = False, compiled = True
2025-05-07T20:32:54.9423033Z 
2025-05-07T20:32:54.9423108Z     @given(
2025-05-07T20:32:54.9423220Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:54.9423314Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:54.9423426Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:54.9423538Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:54.9423654Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:54.9423725Z     )
2025-05-07T20:32:54.9423960Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:54.9424055Z     def test_silu_mul_quant(
2025-05-07T20:32:54.9424128Z         self,
2025-05-07T20:32:54.9424203Z         T: int,
2025-05-07T20:32:54.9424282Z         D: int,
2025-05-07T20:32:54.9424379Z         scale_ub: Optional[float],
2025-05-07T20:32:54.9424464Z         contiguous: bool,
2025-05-07T20:32:54.9424550Z         compiled: bool,
2025-05-07T20:32:54.9424625Z     ) -> None:
2025-05-07T20:32:54.9424760Z         torch.manual_seed(2025)
2025-05-07T20:32:54.9424832Z     
2025-05-07T20:32:54.9424991Z >       x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:54.9426755Z E       torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 448.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 26.44 MiB is free. Including non-PyTorch memory, this process has 22.04 GiB memory in use. Of the allocated memory 21.73 GiB is allocated by PyTorch, and 19.12 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
2025-05-07T20:32:54.9426763Z 
2025-05-07T20:32:54.9426879Z moe/activation_test.py:92: OutOfMemoryError
2025-05-07T20:32:54.9426883Z 
2025-05-07T20:32:54.9426984Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:54.9427200Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:54.9427274Z     T=4096,
2025-05-07T20:32:54.9427353Z     D=7168,
2025-05-07T20:32:54.9427433Z     scale_ub=None,
2025-05-07T20:32:54.9427514Z     contiguous=True,
2025-05-07T20:32:54.9427602Z     compiled=False,
2025-05-07T20:32:54.9427671Z )
2025-05-07T20:32:54.9427879Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:54.9428047Z T = 4096, D = 7168, scale_ub = None, contiguous = True, compiled = False
2025-05-07T20:32:54.9428052Z 
2025-05-07T20:32:54.9428124Z     @given(
2025-05-07T20:32:54.9428245Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:54.9428339Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:54.9428451Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:54.9428606Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:54.9428716Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:54.9428789Z     )
2025-05-07T20:32:54.9429029Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:54.9429118Z     def test_silu_mul_quant(
2025-05-07T20:32:54.9429191Z         self,
2025-05-07T20:32:54.9429266Z         T: int,
2025-05-07T20:32:54.9429339Z         D: int,
2025-05-07T20:32:54.9429436Z         scale_ub: Optional[float],
2025-05-07T20:32:54.9429525Z         contiguous: bool,
2025-05-07T20:32:54.9429606Z         compiled: bool,
2025-05-07T20:32:54.9429682Z     ) -> None:
2025-05-07T20:32:54.9429771Z         torch.manual_seed(2025)
2025-05-07T20:32:54.9429884Z     
2025-05-07T20:32:54.9430046Z >       x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:54.9431765Z E       torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 112.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 26.44 MiB is free. Including non-PyTorch memory, this process has 22.04 GiB memory in use. Of the allocated memory 21.73 GiB is allocated by PyTorch, and 19.12 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
2025-05-07T20:32:54.9431773Z 
2025-05-07T20:32:54.9431887Z moe/activation_test.py:92: OutOfMemoryError
2025-05-07T20:32:54.9431891Z 
2025-05-07T20:32:54.9431989Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:54.9432202Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:54.9432281Z     T=16384,
2025-05-07T20:32:54.9432355Z     D=7168,
2025-05-07T20:32:54.9432435Z     scale_ub=None,
2025-05-07T20:32:54.9432517Z     contiguous=True,
2025-05-07T20:32:54.9432598Z     compiled=False,
2025-05-07T20:32:54.9432674Z )
2025-05-07T20:32:54.9432886Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:54.9433095Z T = 16384, D = 7168, scale_ub = None, contiguous = True, compiled = False
2025-05-07T20:32:54.9433099Z 
2025-05-07T20:32:54.9433174Z     @given(
2025-05-07T20:32:54.9433286Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:54.9433380Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:54.9433497Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:54.9433608Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:54.9433718Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:54.9433787Z     )
2025-05-07T20:32:54.9434022Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:54.9434118Z     def test_silu_mul_quant(
2025-05-07T20:32:54.9434231Z         self,
2025-05-07T20:32:54.9434306Z         T: int,
2025-05-07T20:32:54.9434386Z         D: int,
2025-05-07T20:32:54.9434480Z         scale_ub: Optional[float],
2025-05-07T20:32:54.9434566Z         contiguous: bool,
2025-05-07T20:32:54.9434652Z         compiled: bool,
2025-05-07T20:32:54.9434727Z     ) -> None:
2025-05-07T20:32:54.9434817Z         torch.manual_seed(2025)
2025-05-07T20:32:54.9434890Z     
2025-05-07T20:32:54.9435048Z >       x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:54.9436810Z E       torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 448.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 26.44 MiB is free. Including non-PyTorch memory, this process has 22.04 GiB memory in use. Of the allocated memory 21.73 GiB is allocated by PyTorch, and 19.12 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
2025-05-07T20:32:54.9436819Z 
2025-05-07T20:32:54.9436933Z moe/activation_test.py:92: OutOfMemoryError
2025-05-07T20:32:54.9436937Z 
2025-05-07T20:32:54.9437040Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:54.9437255Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:54.9437328Z     T=16384,
2025-05-07T20:32:54.9437402Z     D=7168,
2025-05-07T20:32:54.9437480Z     scale_ub=1200.0,
2025-05-07T20:32:54.9437563Z     contiguous=True,
2025-05-07T20:32:54.9437647Z     compiled=False,
2025-05-07T20:32:54.9437716Z )
2025-05-07T20:32:54.9437927Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:54.9438097Z T = 16384, D = 7168, scale_ub = 1200.0, contiguous = True, compiled = False
2025-05-07T20:32:54.9438165Z 
2025-05-07T20:32:54.9438238Z     @given(
2025-05-07T20:32:54.9438357Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:54.9438452Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:54.9438564Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:54.9438679Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:54.9438790Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:54.9438860Z     )
2025-05-07T20:32:54.9439097Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:54.9439185Z     def test_silu_mul_quant(
2025-05-07T20:32:54.9439258Z         self,
2025-05-07T20:32:54.9439335Z         T: int,
2025-05-07T20:32:54.9439408Z         D: int,
2025-05-07T20:32:54.9439505Z         scale_ub: Optional[float],
2025-05-07T20:32:54.9439590Z         contiguous: bool,
2025-05-07T20:32:54.9439671Z         compiled: bool,
2025-05-07T20:32:54.9439750Z     ) -> None:
2025-05-07T20:32:54.9439840Z         torch.manual_seed(2025)
2025-05-07T20:32:54.9439909Z     
2025-05-07T20:32:54.9440076Z >       x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:54.9441798Z E       torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 448.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 26.44 MiB is free. Including non-PyTorch memory, this process has 22.04 GiB memory in use. Of the allocated memory 21.73 GiB is allocated by PyTorch, and 19.12 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
2025-05-07T20:32:54.9441849Z 
2025-05-07T20:32:54.9441966Z moe/activation_test.py:92: OutOfMemoryError
2025-05-07T20:32:54.9441970Z 
2025-05-07T20:32:54.9442069Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:54.9442286Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:54.9442361Z     T=128,
2025-05-07T20:32:54.9442473Z     D=5120,
2025-05-07T20:32:54.9442555Z     scale_ub=1200.0,
2025-05-07T20:32:54.9442640Z     contiguous=False,
2025-05-07T20:32:54.9442722Z     compiled=False,
2025-05-07T20:32:54.9442794Z )
2025-05-07T20:32:54.9443003Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:54.9443170Z T = 128, D = 5120, scale_ub = 1200.0, contiguous = False, compiled = False
2025-05-07T20:32:54.9443174Z 
2025-05-07T20:32:54.9443249Z     @given(
2025-05-07T20:32:54.9443361Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:54.9443456Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:54.9443571Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:54.9443683Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:54.9443793Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:54.9443867Z     )
2025-05-07T20:32:54.9444101Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:54.9444196Z     def test_silu_mul_quant(
2025-05-07T20:32:54.9444311Z         self,
2025-05-07T20:32:54.9444386Z         T: int,
2025-05-07T20:32:54.9444462Z         D: int,
2025-05-07T20:32:54.9444557Z         scale_ub: Optional[float],
2025-05-07T20:32:54.9444644Z         contiguous: bool,
2025-05-07T20:32:54.9444729Z         compiled: bool,
2025-05-07T20:32:54.9444805Z     ) -> None:
2025-05-07T20:32:54.9444895Z         torch.manual_seed(2025)
2025-05-07T20:32:54.9444967Z     
2025-05-07T20:32:54.9445126Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:54.9445199Z     
2025-05-07T20:32:54.9445289Z         x_sign = torch.sign(x)
2025-05-07T20:32:54.9445409Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:54.9445497Z         x = x_sign * x_clamp
2025-05-07T20:32:54.9445616Z         x0 = x[:, :D]
2025-05-07T20:32:54.9445694Z         x1 = x[:, D:]
2025-05-07T20:32:54.9445768Z     
2025-05-07T20:32:54.9445848Z         if contiguous:
2025-05-07T20:32:54.9445938Z             x0 = x0.contiguous()
2025-05-07T20:32:54.9446031Z             x1 = x1.contiguous()
2025-05-07T20:32:54.9446100Z     
2025-05-07T20:32:54.9446186Z         if scale_ub is not None:
2025-05-07T20:32:54.9446299Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:54.9446432Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:54.9446504Z             )
2025-05-07T20:32:54.9446576Z         else:
2025-05-07T20:32:54.9446668Z             scale_ub_tensor = None
2025-05-07T20:32:54.9446737Z     
2025-05-07T20:32:54.9446860Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:54.9446948Z             op = silu_mul_quant
2025-05-07T20:32:54.9447028Z             if compiled:
2025-05-07T20:32:54.9447127Z                 op = torch.compile(op)
2025-05-07T20:32:54.9447231Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:54.9447300Z     
2025-05-07T20:32:54.9447389Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:54.9447396Z 
2025-05-07T20:32:54.9447489Z moe/activation_test.py:117: 
2025-05-07T20:32:54.9447620Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:54.9447720Z moe/activation_test.py:115: in fn
2025-05-07T20:32:54.9447861Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:54.9448354Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:54.9448451Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:54.9448804Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:54.9449024Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:54.9449358Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:54.9449451Z     kernel = self.compile(
2025-05-07T20:32:54.9449873Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:54.9450043Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:54.9450178Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:54.9450183Z 
2025-05-07T20:32:54.9450380Z self = <triton.compiler.compiler.ASTSource object at 0x7fe5b9ec2a50>
2025-05-07T20:32:54.9451142Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:54.9451638Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7fea06577ce0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7fe5b9e951c0>}
2025-05-07T20:32:54.9452410Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:54.9452600Z context = <triton._C.libtriton.ir.context object at 0x7fe5b9d8e870>
2025-05-07T20:32:54.9452607Z 
2025-05-07T20:32:54.9452770Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:54.9453024Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:54.9453132Z                            module_map=module_map)
2025-05-07T20:32:54.9453289Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:54.9453387Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:54.9453462Z E       ^
2025-05-07T20:32:54.9453900Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:54.9453946Z 
2025-05-07T20:32:54.9454356Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:54.9454364Z 
2025-05-07T20:32:54.9454463Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:54.9454682Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:54.9454758Z     T=2048,
2025-05-07T20:32:54.9454830Z     D=7168,
2025-05-07T20:32:54.9454912Z     scale_ub=None,
2025-05-07T20:32:54.9454995Z     contiguous=False,
2025-05-07T20:32:54.9455076Z     compiled=False,
2025-05-07T20:32:54.9455149Z )
2025-05-07T20:32:54.9455360Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:54.9455526Z T = 2048, D = 7168, scale_ub = None, contiguous = False, compiled = False
2025-05-07T20:32:54.9455531Z 
2025-05-07T20:32:54.9455607Z     @given(
2025-05-07T20:32:54.9455725Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:54.9455826Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:54.9455945Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:54.9456061Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:54.9456174Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:54.9456245Z     )
2025-05-07T20:32:54.9456526Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:54.9456619Z     def test_silu_mul_quant(
2025-05-07T20:32:54.9456692Z         self,
2025-05-07T20:32:54.9456767Z         T: int,
2025-05-07T20:32:54.9456846Z         D: int,
2025-05-07T20:32:54.9456941Z         scale_ub: Optional[float],
2025-05-07T20:32:54.9457029Z         contiguous: bool,
2025-05-07T20:32:54.9457114Z         compiled: bool,
2025-05-07T20:32:54.9457189Z     ) -> None:
2025-05-07T20:32:54.9457282Z         torch.manual_seed(2025)
2025-05-07T20:32:54.9457352Z     
2025-05-07T20:32:54.9457517Z >       x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:54.9459293Z E       torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 56.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 26.44 MiB is free. Including non-PyTorch memory, this process has 22.04 GiB memory in use. Of the allocated memory 21.74 GiB is allocated by PyTorch, and 10.99 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
2025-05-07T20:32:54.9459302Z 
2025-05-07T20:32:54.9459418Z moe/activation_test.py:92: OutOfMemoryError
2025-05-07T20:32:54.9459422Z 
2025-05-07T20:32:54.9459524Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:54.9459740Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:54.9459817Z     T=128,
2025-05-07T20:32:54.9459893Z     D=7168,
2025-05-07T20:32:54.9459972Z     scale_ub=1200.0,
2025-05-07T20:32:54.9460054Z     contiguous=True,
2025-05-07T20:32:54.9460137Z     compiled=True,
2025-05-07T20:32:54.9460206Z )
2025-05-07T20:32:54.9460458Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:54.9460617Z T = 128, D = 7168, scale_ub = 1200.0, contiguous = True, compiled = True
2025-05-07T20:32:54.9460625Z 
2025-05-07T20:32:54.9460698Z     @given(
2025-05-07T20:32:54.9460817Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:54.9460911Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:54.9461021Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:54.9461135Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:54.9461244Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:54.9461314Z     )
2025-05-07T20:32:54.9461555Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:54.9461685Z     def test_silu_mul_quant(
2025-05-07T20:32:54.9461762Z         self,
2025-05-07T20:32:54.9461840Z         T: int,
2025-05-07T20:32:54.9461913Z         D: int,
2025-05-07T20:32:54.9462014Z         scale_ub: Optional[float],
2025-05-07T20:32:54.9462099Z         contiguous: bool,
2025-05-07T20:32:54.9462181Z         compiled: bool,
2025-05-07T20:32:54.9462261Z     ) -> None:
2025-05-07T20:32:54.9462350Z         torch.manual_seed(2025)
2025-05-07T20:32:54.9462419Z     
2025-05-07T20:32:54.9462584Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:54.9462654Z     
2025-05-07T20:32:54.9462743Z         x_sign = torch.sign(x)
2025-05-07T20:32:54.9462868Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:54.9462953Z         x = x_sign * x_clamp
2025-05-07T20:32:54.9463034Z         x0 = x[:, :D]
2025-05-07T20:32:54.9463112Z         x1 = x[:, D:]
2025-05-07T20:32:54.9463180Z     
2025-05-07T20:32:54.9463266Z         if contiguous:
2025-05-07T20:32:54.9463353Z             x0 = x0.contiguous()
2025-05-07T20:32:54.9463441Z             x1 = x1.contiguous()
2025-05-07T20:32:54.9463516Z     
2025-05-07T20:32:54.9463606Z         if scale_ub is not None:
2025-05-07T20:32:54.9463707Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:54.9463841Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:54.9463958Z             )
2025-05-07T20:32:54.9464034Z         else:
2025-05-07T20:32:54.9464128Z             scale_ub_tensor = None
2025-05-07T20:32:54.9464197Z     
2025-05-07T20:32:54.9464320Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:54.9464410Z             op = silu_mul_quant
2025-05-07T20:32:54.9464492Z             if compiled:
2025-05-07T20:32:54.9464590Z                 op = torch.compile(op)
2025-05-07T20:32:54.9464690Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:54.9464759Z     
2025-05-07T20:32:54.9464847Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:54.9464855Z 
2025-05-07T20:32:54.9464948Z moe/activation_test.py:117: 
2025-05-07T20:32:54.9465114Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:54.9465219Z moe/activation_test.py:115: in fn
2025-05-07T20:32:54.9465314Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:54.9465675Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/torch/_dynamo/eval_frame.py:678: in _fn
2025-05-07T20:32:54.9465769Z     return fn(*args, **kwargs)
2025-05-07T20:32:54.9466252Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:54.9466349Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:54.9466701Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:54.9466915Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:54.9467254Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:54.9467346Z     kernel = self.compile(
2025-05-07T20:32:54.9467791Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:54.9467964Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:54.9468090Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:54.9468094Z 
2025-05-07T20:32:54.9468296Z self = <triton.compiler.compiler.ASTSource object at 0x7fe5b9fa81d0>
2025-05-07T20:32:54.9469057Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:54.9469554Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7fea06577ce0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7fe5b9de7b00>}
2025-05-07T20:32:54.9470333Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:54.9470523Z context = <triton._C.libtriton.ir.context object at 0x7fe5b9ab6c70>
2025-05-07T20:32:54.9470532Z 
2025-05-07T20:32:54.9470692Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:54.9470945Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:54.9471052Z                            module_map=module_map)
2025-05-07T20:32:54.9471210Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:54.9471304Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:54.9471381Z E       ^
2025-05-07T20:32:54.9471729Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:54.9471734Z 
2025-05-07T20:32:54.9472143Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:54.9472147Z 
2025-05-07T20:32:54.9472247Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:54.9472504Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:54.9472582Z     T=128,
2025-05-07T20:32:54.9472656Z     D=7168,
2025-05-07T20:32:54.9472736Z     scale_ub=1200.0,
2025-05-07T20:32:54.9472820Z     contiguous=True,
2025-05-07T20:32:54.9472899Z     compiled=False,
2025-05-07T20:32:54.9472970Z )
2025-05-07T20:32:54.9473186Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:54.9473346Z T = 128, D = 7168, scale_ub = 1200.0, contiguous = True, compiled = False
2025-05-07T20:32:54.9473351Z 
2025-05-07T20:32:54.9473429Z     @given(
2025-05-07T20:32:54.9473543Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:54.9473677Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:54.9473794Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:54.9473906Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:54.9474014Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:54.9474090Z     )
2025-05-07T20:32:54.9474327Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:54.9474416Z     def test_silu_mul_quant(
2025-05-07T20:32:54.9474492Z         self,
2025-05-07T20:32:54.9474566Z         T: int,
2025-05-07T20:32:54.9474641Z         D: int,
2025-05-07T20:32:54.9474735Z         scale_ub: Optional[float],
2025-05-07T20:32:54.9474821Z         contiguous: bool,
2025-05-07T20:32:54.9474905Z         compiled: bool,
2025-05-07T20:32:54.9474980Z     ) -> None:
2025-05-07T20:32:54.9475071Z         torch.manual_seed(2025)
2025-05-07T20:32:54.9475147Z     
2025-05-07T20:32:54.9475305Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:54.9475377Z     
2025-05-07T20:32:54.9475510Z         x_sign = torch.sign(x)
2025-05-07T20:32:54.9475632Z >       x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:54.9477358Z E       torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 20.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 4.44 MiB is free. Including non-PyTorch memory, this process has 22.06 GiB memory in use. Of the allocated memory 21.77 GiB is allocated by PyTorch, and 6.37 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
2025-05-07T20:32:54.9477367Z 
2025-05-07T20:32:54.9477481Z moe/activation_test.py:95: OutOfMemoryError
2025-05-07T20:32:54.9477523Z 
2025-05-07T20:32:54.9477624Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:54.9477841Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:54.9477917Z     T=128,
2025-05-07T20:32:54.9477995Z     D=5120,
2025-05-07T20:32:54.9478075Z     scale_ub=1200.0,
2025-05-07T20:32:54.9478155Z     contiguous=True,
2025-05-07T20:32:54.9478241Z     compiled=True,
2025-05-07T20:32:54.9478311Z )
2025-05-07T20:32:54.9478520Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:54.9478681Z T = 128, D = 5120, scale_ub = 1200.0, contiguous = True, compiled = True
2025-05-07T20:32:54.9478685Z 
2025-05-07T20:32:54.9478757Z     @given(
2025-05-07T20:32:54.9478875Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:54.9478970Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:54.9479083Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:54.9479201Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:54.9479309Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:54.9479380Z     )
2025-05-07T20:32:54.9479622Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:54.9479712Z     def test_silu_mul_quant(
2025-05-07T20:32:54.9479786Z         self,
2025-05-07T20:32:54.9479910Z         T: int,
2025-05-07T20:32:54.9479984Z         D: int,
2025-05-07T20:32:54.9480080Z         scale_ub: Optional[float],
2025-05-07T20:32:54.9480167Z         contiguous: bool,
2025-05-07T20:32:54.9480249Z         compiled: bool,
2025-05-07T20:32:54.9480326Z     ) -> None:
2025-05-07T20:32:54.9480418Z         torch.manual_seed(2025)
2025-05-07T20:32:54.9480488Z     
2025-05-07T20:32:54.9480650Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:54.9480719Z     
2025-05-07T20:32:54.9480806Z         x_sign = torch.sign(x)
2025-05-07T20:32:54.9480929Z >       x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:54.9482687Z E       torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 20.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 4.44 MiB is free. Including non-PyTorch memory, this process has 22.06 GiB memory in use. Of the allocated memory 21.77 GiB is allocated by PyTorch, and 3.87 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
2025-05-07T20:32:54.9482697Z 
2025-05-07T20:32:54.9482813Z moe/activation_test.py:95: OutOfMemoryError
2025-05-07T20:32:54.9482817Z 
2025-05-07T20:32:54.9482914Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:54.9483128Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:54.9483204Z     T=128,
2025-05-07T20:32:54.9483277Z     D=7168,
2025-05-07T20:32:54.9483362Z     scale_ub=None,
2025-05-07T20:32:54.9483443Z     contiguous=True,
2025-05-07T20:32:54.9483522Z     compiled=True,
2025-05-07T20:32:54.9483596Z )
2025-05-07T20:32:54.9483842Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:54.9484002Z T = 128, D = 7168, scale_ub = None, contiguous = True, compiled = True
2025-05-07T20:32:54.9484009Z 
2025-05-07T20:32:54.9484087Z     @given(
2025-05-07T20:32:54.9484200Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:54.9484294Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:54.9484407Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:54.9484519Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:54.9484630Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:54.9484700Z     )
2025-05-07T20:32:54.9484937Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:54.9485071Z     def test_silu_mul_quant(
2025-05-07T20:32:54.9485145Z         self,
2025-05-07T20:32:54.9485218Z         T: int,
2025-05-07T20:32:54.9485294Z         D: int,
2025-05-07T20:32:54.9485392Z         scale_ub: Optional[float],
2025-05-07T20:32:54.9485479Z         contiguous: bool,
2025-05-07T20:32:54.9485565Z         compiled: bool,
2025-05-07T20:32:54.9485639Z     ) -> None:
2025-05-07T20:32:54.9485734Z         torch.manual_seed(2025)
2025-05-07T20:32:54.9485806Z     
2025-05-07T20:32:54.9485965Z >       x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:54.9487684Z E       torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 20.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 4.44 MiB is free. Including non-PyTorch memory, this process has 22.06 GiB memory in use. Of the allocated memory 21.77 GiB is allocated by PyTorch, and 3.87 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
2025-05-07T20:32:54.9487695Z 
2025-05-07T20:32:54.9487810Z moe/activation_test.py:92: OutOfMemoryError
2025-05-07T20:32:54.9487944Z =============================== warnings summary ===============================
2025-05-07T20:32:54.9488290Z ../../../../../../../../miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/autotuner.py:108
2025-05-07T20:32:54.9488585Z ../../../../../../../../miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/autotuner.py:108
2025-05-07T20:32:54.9488882Z ../../../../../../../../miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/autotuner.py:108
2025-05-07T20:32:54.9489738Z   /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/autotuner.py:108: DeprecationWarning: warmup, rep, and use_cuda_graph parameters are deprecated. See https://github.com/triton-lang/triton/pull/4496 for details.
2025-05-07T20:32:54.9489967Z     warnings.warn(("warmup, rep, and use_cuda_graph parameters are deprecated. See "
2025-05-07T20:32:54.9490011Z 
2025-05-07T20:32:54.9490217Z -- Docs: https://docs.pytest.org/en/stable/how-to/capture-warnings.html
2025-05-07T20:32:54.9490379Z ================= 1 failed, 1 deselected, 3 warnings in 11.97s =================
2025-05-07T20:32:56.5825320Z ERROR conda.cli.main_run:execute(125): `conda run python -m pytest -v -rsx -s -W ignore::pytest.PytestCollectionWarning --lf --last-failed-no-failures none ./moe/activation_test.py` failed. (See above for error)
2025-05-07T20:32:56.6460795Z [EXEC] [ATTEMPT 2/2] Command attempt failed.
2025-05-07T20:32:56.6461036Z 
2025-05-07T20:32:56.6461215Z [EXEC] The command has failed after 2 + 1 attempts; aborting.
2025-05-07T20:32:56.6461779Z [TEST] Python test suite FAILED for some or all tests despite multiple retries: ./moe/activation_test.py
2025-05-07T20:32:56.6462168Z 
2025-05-07T20:32:56.6462196Z 
2025-05-07T20:32:56.6462200Z 
2025-05-07T20:32:56.6478554Z ##[error]Process completed with exit code 1.
2025-05-07T20:32:56.6564551Z Post job cleanup.
2025-05-07T20:32:56.7552465Z [command]/usr/bin/git version
2025-05-07T20:32:56.7595562Z git version 2.47.1
2025-05-07T20:32:56.7635261Z Copying '/home/ec2-user/.gitconfig' to '/home/ec2-user/actions-runner/_work/_temp/e14a5568-deff-401d-b484-86b49e6546de/.gitconfig'
2025-05-07T20:32:56.7646407Z Temporarily overriding HOME='/home/ec2-user/actions-runner/_work/_temp/e14a5568-deff-401d-b484-86b49e6546de' before making global git config changes
2025-05-07T20:32:56.7647247Z Adding repository directory to the temporary git global config as a safe directory
2025-05-07T20:32:56.7660952Z [command]/usr/bin/git config --global --add safe.directory /home/ec2-user/actions-runner/_work/FBGEMM/FBGEMM
2025-05-07T20:32:56.7705556Z [command]/usr/bin/git config --local --name-only --get-regexp core\.sshCommand
2025-05-07T20:32:56.7741060Z [command]/usr/bin/git submodule foreach --recursive sh -c "git config --local --name-only --get-regexp 'core\.sshCommand' && git config --local --unset-all 'core.sshCommand' || :"
2025-05-07T20:32:56.8082602Z Entering 'external/asmjit'
2025-05-07T20:32:56.8149927Z Entering 'external/composable_kernel'
2025-05-07T20:32:56.8223351Z Entering 'external/cpuinfo'
2025-05-07T20:32:56.8292555Z Entering 'external/cutlass'
2025-05-07T20:32:56.8368889Z Entering 'external/googletest'
2025-05-07T20:32:56.8436192Z Entering 'external/hipify_torch'
2025-05-07T20:32:56.8503537Z Entering 'external/json'
2025-05-07T20:32:56.8594277Z [command]/usr/bin/git config --local --name-only --get-regexp http\.https\:\/\/github\.com\/\.extraheader
2025-05-07T20:32:56.8620639Z http.https://github.com/.extraheader
2025-05-07T20:32:56.8632328Z [command]/usr/bin/git config --local --unset-all http.https://github.com/.extraheader
2025-05-07T20:32:56.8665772Z [command]/usr/bin/git submodule foreach --recursive sh -c "git config --local --name-only --get-regexp 'http\.https\:\/\/github\.com\/\.extraheader' && git config --local --unset-all 'http.https://github.com/.extraheader' || :"
2025-05-07T20:32:56.8995337Z Entering 'external/asmjit'
2025-05-07T20:32:56.9041570Z http.https://github.com/.extraheader
2025-05-07T20:32:56.9084733Z Entering 'external/composable_kernel'
2025-05-07T20:32:56.9131158Z http.https://github.com/.extraheader
2025-05-07T20:32:56.9180691Z Entering 'external/cpuinfo'
2025-05-07T20:32:56.9223543Z http.https://github.com/.extraheader
2025-05-07T20:32:56.9268039Z Entering 'external/cutlass'
2025-05-07T20:32:56.9313840Z http.https://github.com/.extraheader
2025-05-07T20:32:56.9364906Z Entering 'external/googletest'
2025-05-07T20:32:56.9407150Z http.https://github.com/.extraheader
2025-05-07T20:32:56.9449748Z Entering 'external/hipify_torch'
2025-05-07T20:32:56.9493071Z http.https://github.com/.extraheader
2025-05-07T20:32:56.9536136Z Entering 'external/json'
2025-05-07T20:32:56.9579294Z http.https://github.com/.extraheader
2025-05-07T20:32:56.9729851Z A job completed hook has been configured by the self-hosted runner administrator
2025-05-07T20:32:56.9757674Z ##[group]Run '/home/ec2-user/runner-scripts/after_job.sh'
2025-05-07T20:32:56.9768342Z shell: /usr/bin/bash --noprofile --norc -e -o pipefail {0}
2025-05-07T20:32:56.9768700Z ##[endgroup]
2025-05-07T20:32:56.9871866Z [!ALERT!] Swap in detected! [!ALERT!]
2025-05-07T20:33:07.7867180Z [!ALERT!] Swap out detected [!ALERT!]
2025-05-07T20:33:24.3653060Z Cleaning up orphan processes